Incident Response in the Cloud

Many organizations have begun moving a majority of their services towards the cloud in recent years. As a result, attackers have shifted their focus towards the cloud. This has resulted in new techniques and methods specifically designed to compromise cloud infrastructure, like the recent SALTSTACK vulnerability that was widely exploited [1]. Therefore it is critical for these organizations to have an incident response team that understands the new risks attached to cloud and how cloud can make incident response easier or harder. 


In this blog post we will walk you through each phase you may encounter in traditional incident response and highlight the differences when adopting cloud computing. It is aimed towards both those who are new to incident response and cloud computing. We’ve included insights that will benefit organizations taking that step towards cloud who want to ensure they are prepared to respond efficiently to cloud incidents. 


Traditional Incident Response

Before jumping into the cloud, let’s look at the phases of traditional incident response:


  • Preparation: This phase is about making sure you have an ongoing incident response team who is fully prepared to respond to incidents, regardless of whether you’ve had one yet. Response plans need to be created, your team trained, logs collected, detections/alerts created, and technical controls implemented and tested. 

  • Identification: This is the initial trigger that alerts the incident response team to a potential incident. In traditional incident response, alerts usually trigger from detections from a security information and event management (SIEM) system [2] or from an internal or external user reporting suspicious activity. It’s critical for your organization to have both methods, detection and reporting, for identifying incidents, and that everyone in your organization is trained regularly on how to report incidents.

  • Triage: After receiving the initial alert, an analyst should review it, and assess if this is an incident or not. The analyst should also review the severity of the incident and decide the appropriate level of response required; this includes who to loop into the incident depending on the needs of your organization

  • Investigation: Once the response has been initiated, the incident response team begins investigating the incident. This can involve collecting forensic data, reviewing logs, or reaching out to relevant stakeholders to scope the impact. At this point the incident response team will need access to impacted assets. In a traditional environment, members of the response team are usually granted access via a systems administrator (Sysadmin), or may be able to remotely collect data themselves using an EDR solution [3].

  • Containment: Once the investigation has concluded and the impact has been scoped, the team will move on to containment. Technical containment in traditional incident response is about removing compromised assets from the network, updating anti-malware signatures, and implementing policies on the network perimeter to remove an attacker's access. Non-technical containment can involve a number of things, but often involves containing the media story, ensuring employees don’t disclose information, or physically securing an attacker on the premises. In traditional incident response, technical containment actions are likely to be centralized, with a few teams working in close proximity. 

  • Recovery: When the incident has been contained, the next phase focuses on recovery. Similar to containment, in traditional incident response technical recovery actions are usually centralized to a few teams. Technical recovery actions usually involve reimaging compromised hosts/servers, fixing vulnerabilities that were exploited, and removing backdoors/malicious payloads. Non-technical actions can include legal actions, communications around incident closure, and follow up actions your organization will take to enhance their security posture.

  • Lessons Learned: The final phase, which we don’t discuss in depth in this post, is the lessons learned phase. A postmortem of the incident is performed and actions identified to improve your response and prevent similar incidents from occurring again.


It’s important to note that during an incident, different parts of your response can be in different phases. For example, the technical team may have contained the incident and be remediating any compromised instances, while the communications team could be trying to contain a media storm.


Incident Response in the Cloud

Now let’s dive into how the cloud has affected these phases. 


1. Prepare

To prepare for cloud incidents, it’s important to understand that cloud infrastructure can be created by anyone, at any moment, fully automated, within and outside your organization. This infrastructure can be set up within a realm of your control but more likely not. If it is created outside of your realm of control, it's less likely you have the necessary security controls in place, such as monitoring. There are a few ways to combat this:

  • Internal policies that state cloud infrastructure must be created within approved virtual private networks

  • Training and education to all technical staff on how to securely set up cloud infrastructure and the risks involved when not

  • Using services from major cloud providers that identify and group an organization's cloud infrastructure under a single umbrella


Technical controls are extremely important for mitigation of automated and low-level compromises. To mitigate high-level attacks, you’ll rely on promoting a secure culture within your organization through training and education.


2. Identification

The identification phase is substantially easier for cloud infrastructure within your realm of control. There are a few aspects that make it easier, but most important is the fact that parts of your infrastructure are continuously monitored. Most cloud providers will monitor for CPU usage, beaconing activity and malicious network traffic, and notify you if they detect something abnormal or malicious. This process acts as an additional layer of security that you would normally have to set up manually, and maintain within on-premise infrastructure. 


A difficult aspect of identification comes when individuals within your organization set up cloud infrastructure outside your realm of control. These individuals will be notified of incidents by the cloud provider, so it falls on them to understand how to respond appropriately or notify their own internal incident response team. A pro tip for identifying shadow IT cloud infrastructure, is to set up mail rules in your SIEM (if you are able to) that alert on the sending address from the cloud providers. Most cloud providers will provide a service that identifies potential infrastructure belonging to your organization and the ability to add it to your realm of control.


Cloud providers have ready to go logging capabilities that can be enabled on demand. There is a cost associated with the storage and processing of these logs, so you’ll want to tune the logging to keep only those which are required for alerting. These logs can be pushed to an external SIEM or to the cloud provider’s security center, which come with a number of pre-built alerts. This is a great option for organizations that don’t have the resources to commit to building out a full detection platform. Alerts from these logs are going to be one of your teams primary detection mechanisms for incidents within the cloud. A majority of the providers will have threat intelligence feeds you can incorporate to provide additional detection opportunities for a variety of Indicators of Compromise (IOCs) [4]. Logs are also useful for the triaging of incidents, as they can provide a wealth of additional information, information that may not have been useful to create a detection from, but will assist the responder with creating a timeline of the malicious activity that occurred before and after the detection.


Everything that we have discussed can aid in identifying security incidents but must be implemented before they can be used.


3. Triage

Triaging incidents in the cloud to determine the severity and impact can be challenging for multiple reasons. Let’s look closer at two of them. First, it can be challenging to understand what the platform is used for and who uses it. The names of the cloud instances do not necessarily identify their purpose, so you need to track down an owner. It, therefore, is good practice to regularly “take stock” of the projects in your cloud environment and the owners and (if possible) their purpose. This will shorten the time it takes to contain the compromise. 


A second challenge is getting access to the project. Since cloud environments have tight access controls, it is often difficult to get access without explicitly getting the owner to add you. This is vastly different to the traditional on-premise environment, where an overarching system administrator can add and remove access quickly. There are a few options to combat this, but the most efficient way is to ensure your incident response team has the ability to request emergency access to a majority of projects in your environment. 


Upon reaching the owner, you can judge the severity of the problem more easily by asking the following questions:

  • What is the purpose of this project?

  • Is it a test project or in production?

  • Does it have access to other projects? So you can determine if an attacker could use this to move lateral? [6

  • Does it have access back to the corporate environment?

  • What kind of data is in use? 

  • Where is the project geographically located?


The last question is important with respect to data breach regulations. In cloud computing, resources for redundancy purposes can be split across multiple regions. This can have grave implications if there is a data breach, so it’s critical to train your teams on how to securely and properly set up environments for the type of data they are hosting. We strongly recommend you have clear guidelines from or have consulted with your legal counsel on the topic of where data can or can’t be hosted and that it’s made clear the applicable countries data legislation that you’ll have to abide by.


In a cloud environment, your incident response team can investigate and process data without the data ever having to leave its region. This is particularly useful when you have a geographically dispersed incident response team that is investigating the incident, but you have data locality regulations that you need to abide by.


4. Investigation

Investigating cloud incidents involving virtual machines is simpler in the cloud. It no longer requires PCs to be shipped for imaging or EDR images pulled over the network. You can quickly “snapshot” a running virtual machine (VM) to create an identical image of the compromised VM. As best practice dictates, this snapshot can be exported to a security owned project, where you can forensically investigate while the main VM remains running. The added benefit is that you aren’t tipping off the attackers that you are aware of their compromise or exposing system administrator level credentials for live on-system response. If the snapshot is required as evidence, then you’ll need to follow chain of custody principles including hashing the snapshot to show later it hasn’t been tampered with during the investigation.


This, coupled with the abundance of logs available to you, makes it’s easier to quietly review compromised assets. The actual analysis of compromised VMs doesn’t differentiate much from traditional forensics. 


There are a number of open source tools available that will allow you to analyze assets like VMs  live. One such example is GRR [7],an open source tool that will provide a variety of information from running VMs across your cloud so you can quickly identify other compromised assets.


Note that in this section we’ve only scratched the surface of the technologies that cloud uses. There are a number of others, such as containers, Kubernetes, micro-services, that come with their own set of challenges to investigate. For the sake of brevity, we will save those technologies for a future blog post.


5. Containment

Similar to investigating cloud incidents, containment actions are more straightforward in the cloud. As mentioned previously, the difficult part is getting access. However, by the point of containment, you should already have access or be in contact with the owner who can contain the asset.


Containment of compromised assets can occur in multiple ways. All of which can be completed from the cloud provider’s portal. These include:

  • Pausing the compromised VM, therefore disabling access to it but maintaining valuable forensic artifacts such as active network connections and contents of memory.

  • Shutting/powering down the VM. This option loses valuable forensic artifacts, such as active network connections and contents of memory.

  • Network isolation of the compromised asset. This action is taken by removing the firewall rules associated with the project or limiting them to only your internal network.

  • Revoking compromised credentials such as SSH keys 

  • Or the drastic option of completely deleting the compromised asset, which is an action to be taken only once you’ve got all the necessary forensic copies you need and are sure the original asset is no longer needed.


Containment is typically difficult to do remotely, but as we’ve shown, it’s much easier when everything you need to do is centered in one console. During containment you’ll want to ensure you know the full extent of the compromise before attempting containment otherwise you’ll likely tip off the attacker that you are aware of the compromise.


6. Recovery

Within the cloud there are common root causes to most incidents; the good news is these are easy to recover from and contain. Some of the most common causes are discussed below.


6.1 Leaked Keys/Credentials: Often credentials and keys are uploaded to source code repositories; these are then scraped by bots who access your infrastructure and drop coin miners or other malware. Leaked keys are easy to revoke and switch, and restoring a VM to a previously (uncompromised) state, or even redeploying the VM in a fresh state is simple to do through cloud consoles.


6.2 Insecure Firewall Rules: To make things easy for themselves, developers will often set up cloud infrastructure with broadly permissive remote connection rules so that they can access the infrastructure from anywhere (often “any any” rules). Attackers are constantly scanning to identify infrastructure that is accessible by them, which they then attempt to brute force to gain access. They typically succeed since often the username password combination is easy for them to guess. Firewall rules are easy and quick to delete, and more secure rules implemented. If you have the logs available, setting up alerts for the enabling of “risky” firewall rules will save your incident response team a lot of time and effort.


6.3 Vulnerable Software: A significant number of cloud incidents are due to the deployment of software which is not updated, and therefore becomes vulnerable. The easiest way to detect these is to utilize cloud provided vulnerability scanners. To recover from these sorts of incidents, cloud providers let you quickly deploy fresh assets and apply updates to the vulnerable software.


Similarly to the other phases, with recovery we’ve only scratched the surface. there are many more potential recovery activities that your team may need to understand and be able to perform.


Conclusion

Cloud computing presents an additional level of challenges for incident response teams. We’ve highlighted the need for more technical training to ensure your employees are aware of the risks of deploying within the cloud. This is a great opportunity to increase your security posture and revamp your incident response plans, including how  incidents are identified and reported. Other challenges revolve around how cloud instances can be created by anyone, anywhere, and depending on how they are implemented can impact your organization both financially and reputationally if they aren’t implemented securely. 


Despite the challenges we’ve highlighted in the article, we’ve also shown the impact cloud computing can have on the way you respond to incidents. With cloud, you can contain and recover instances from a single portal, create forensically sound snapshots, and investigate instances without alerting attackers that you are onto them. But the most important benefit of working in the cloud is the alert from cloud providers on malicious activity. This often provides early warning of compromises that may otherwise go undetected. 


In conclusion, incident response in the cloud can have its challenges, but the benefits far outweigh these. If your team is prepared and understands how to evade the pitfalls, it’ll be easy for them to adapt to working in the cloud and respond to incidents quickly and easily.


Popular posts from this blog

Forensic Disk Copies in GCP & AWS

Introducing Libcloudforensics