Chapter 14 - Monitoring and Incident Response.pdf

Chapter 14 Monitoring and Incident Response THE COMPTIA SECURITY+ EXAM OBJECTIVES COVERED IN THIS CHAPTER INCLUDE: Domain 2.0: Threats, Vulnerabilities, and Mitigations 2.4. Given a scenario, analyze indicators of malicious activity. Indicators (Account lockout, Concurrent session usage, Blocked content, Impossible travel, Resource consumption, Resource inaccessibility, Out- of-cycle logging, Published/documented, Missing logs) 2.5. Explain the purpose of mitigation techniques used to secure the enterprise. Application allow list Isolation Monitoring Domain 4.0: Security Operations 4.4. Explain security alerting and monitoring concepts and tools. Monitoring computing resources (Systems, Applications, Infrastructure) Activities (Log aggregation, Alerting, Scanning, Reporting, Archiving, Alert response and remediation/validation (Quarantine, Alert tuning)) Tools (Benchmarks, Agents/agentless, Security information and event management (SIEM), NetFlow) 4.8. Explain appropriate incident response activities. Process (Preparation, Detection, Analysis, Containment, Eradication, Recovery, Lessons learned) Training Testing (Tabletop exercise, Simulation) Root cause analysis Threat hunting 4.9. Given a scenario, use data sources to support an investigation. Log data (Firewall logs, Application logs, Endpoint logs, OS-specific security logs, IPS/IDS logs, Network logs, Metadata) Data sources (Vulnerability scans, Automated reports, Dashboards, Packet captures) When things go wrong, organizations need a way to respond to incidents to ensure that their impact is limited and that normal operations can resume as quickly as possible. That means you need to know how to detect and analyze an incident given a series of events or data points, how to contain the incident, and then what to do about it. In this chapter you'll learn about the components of a typical incident response process, including the incident response cycle. Incident response isn't just about how to stop an attacker or remove their tools. It includes preparation and learning processes to ensure that the organizations continuously learn and improve based on the incidents they have resolved. You'll also learn about incident response teams, the types of exercises you can conduct to get ready for incidents, and the incident response plans you may want to have in place. With the basics of incident response under your belt, your next step will be to examine threat detection and incident response data and tools and techniques. You'll explore indicators and data sources like logs that are commonly used to identify whether something has happened and what may have occurred. Along the way, you'll learn about common capabilities and uses for SIEM tools and NetFlow data, as well as the logs and data that SIEM systems ingest and analyze to help incident responders. Incident Response No matter how strong an organization's security protections are, eventually something will go wrong. Whether that involves a direct attack by a malicious actor, malicious software, an insider threat, or even just a simple mistake, a security incident is an eventuality for all organizations. Organizations therefore need an incident response (IR) plan, process, and team, as well as the technology, skills, and training to respond appropriately. A strong incident response process is not just a one-time action or something that is applied only in emergencies. Instead, IR is an ongoing process that improves organizational security using information that is learned from each security incident. Although individual organizations may define them differently, in general an incident is a violation of the organization's policies and procedures or security practices. Events, on the other hand, are an observable occurrence, which means that there are many events, few of which are likely to be incidents. These definitions can become confusing, because IT service management standards define incidents differently, which means that some organizations specify security incidents to keep things straight. The Incident Response Process The first step toward a mature incident response capability for most organizations is to understand the incident response process and what happens at each stage. Although organizations may use slightly different labels or steps and the number of steps may vary, the basic concepts remain the same. Organizations must prepare for incidents, identify incidents when they occur, and then contain and remove the artifacts of the incidents. Once the incident has been contained, the organization can work to recover and return to normal, and then make sure that the lessons learned from the incident are baked into the preparation for the next time something occurs. Figure 14.1 shows the six steps that the Security+ exam outline describes for the incident response process. FIGURE 14.1 The incident response cycle The six steps you will need to know for the Security+ exam are as follows: 1. Preparation. In this phase, you build the tools, processes, and procedures to respond to an incident. That includes building and training an incident response team, conducting exercises, documenting what you will do and how you will respond, and acquiring, configuring, and operating security tools and incident response capabilities. 2. Detection. This phase involves reviewing events to identify incidents. You must pay attention to indicators of compromise, use log analysis and security monitoring capabilities, and have a comprehensive awareness and reporting program for your staff. 3. Analysis. Once an event has been identified as potentially being part of an incident, it needs to be analyzed. That includes identifying other related events and what their target or impact is or was. 4. Containment. Once an incident has been identified, the incident response team needs to contain it to prevent further issues or damage. Containment can be challenging and may not be complete if elements of the incident are not identified in the initial identification efforts. This can involve quarantine, which places a system or device in an isolated network zone or removes it from a network to ensure that it cannot impact other devices. 5. Eradication. The eradication stage involves removing the artifacts associated with the incident. In many cases, that will involve rebuilding or restoring systems and applications from backups rather than simply removing tools from a system since proving that a system has been fully cleaned can be very difficult. Complete eradication and verification is crucial to ensuring that an incident is over. 6. Recovery. Restoration to normal is the heart of the recovery phase. That may mean bringing systems or services back online or other actions that are part of a return to operations. Recovery requires eradication to be successful, but it also involves implementing fixes to ensure that whatever security weakness, flaw, or action that allowed the incident to occur has been remediated to prevent the event from immediately reoccurring. In addition to these six steps, organizations typically conduct a lessons learned session. These sessions are important to ensure that organizations improve and do not make the same mistakes again. They may be as simple as patching systems or as complex as needing to redesign permission structures and operational procedures. Lessons learned are then used to inform the preparation process, and the cycle continues. Although this list may make it appear as if incidents always proceed in a linear fashion from item to item, many incidents will move back and forth between stages as additional discoveries are made or as additional actions are taken by malicious actors. So, you need to remain nimble and understand that you may not be in the phase you think you are, or that you need to operate in multiple phases at once as you deal with components of an incident—or multiple incidents at once! Preparing for Incident Response The next step after understanding and defining an organization's IR process is to determine who will be on the organization's IR team, who will be in charge of the IR process, and who will lead the IR team. Next, plans are built, and then the plans are tested via exercises. Incident Response Team Building an IR team involves finding the right members for the team. Typical teams often include the following: A member of management or organizational leadership. This individual will be responsible for making decisions for the team and will act as a primary conduit to senior management for the organization. Ideally, teams should have a leader with enough seniority to make decisions for the organization in an emergency. Information security staff members are likely to make up the core of the team and will bring the specialized IR and analysis skills needed for the process. Since containment often requires immediate action using security tools like firewalls, intrusion prevention systems, and other security tools, the information security team can also help speed up the IR process. The team will need technical experts such as systems administrators, developers, or others from disciplines throughout the organization. The composition of the IR team may vary depending on the nature of the incident, and not all technical experts may be pulled in for every incident. Knowing the systems, software, and architecture can make a huge difference in the IR process, and familiarity can also help responders find unexpected artifacts that might be missed by someone who does not work with a specific system every day. Communications and public relations staff are important to help make sure that internal and external communications are handled well. Poor communications—or worse, no communications—can make incidents worse or severely damage an organization's reputation. Legal and human resources (HR) staff may be involved in some, but not all, incidents. Legal counsel can advise on legal issues, contracts, and similar matters. HR may be needed if staff were involved, particularly if the incident involves an insider or is an HR-related investigation. Law enforcement is sometimes added to a team, but in most cases only when specific issues or attacks require their involvement. Regardless of the specific composition of your organization's team, you will also need to ensure that team members have proper training. That may mean IR training for security professionals and technical staff, or it could include exercises and practice for the entire team as a group to ensure that they are ready to work together. Exercises There are two major types of exercises that incident response teams use to prepare included in the Security+ exam outline: Tabletop exercises are used to talk through processes. Team members are given a scenario and are asked questions about how they would respond, what issues might arise, and what they would need to do to accomplish the tasks they are assigned in the IR plan. Tabletop exercises can resemble a brainstorming session as team members think through a scenario and document improvements in their responses and the overall IR plan. Simulations can include a variety of types of events. Exercises may simulate individual functions or elements of the plan, or only target specific parts of an organization. They can also be done at full scale, involving the entire organization in the exercise. It is important to plan and execute simulations in a way that ensures that all participants know that they are engaged in an exercise so that no actions are taken outside of the exercise environment. When you conduct an exercise, start every call, text, or email with “This is an exercise” or a similar cue to let the person who is responding know that they should not take actual action. Of course, doing so can lead to biases and incorrect estimates on what effort or time would be required to perform an action or response. In most cases, keeping exercises properly under control is more important than detailed testing. In those cases where specific performance is needed, you may want to ensure that the person has a script or can perform a task that is scoped and limited to the needs of the simulation without causing problems or issues with normal operations. Building Incident Response Plans Incident response plans can include several subplans to handle various stages of the response process. Your organization may choose to combine them all into a single larger document or may break them out to allow the response team to select the components that they need. Individual plans may also be managed or run by different teams. Regardless of the structure of your response plans, they need to be regularly reviewed and tested. A plan that is out of date, or that the team is not familiar with, can be just as much of a problem as not having a plan at all. Several subplans include: Communication plans are critical to incident response processes. A lack of communication, incorrect communication, or just poor communication can cause significant issues for an organization and its ability to conduct business. At the same time, problematic communications can also make incidents worse, as individuals may not know what is going on or may take undesired actions, thinking they are doing the right thing due to a lack of information or with bad or partial information available to them. Because of the importance of getting communication right, communication plans may also need to list roles, such as who should communicate with the press or media, who will handle specific stakeholders, and who makes the final call on the tone or content of the communications. Stakeholder management plans are related to communication plans and focus on groups and individuals who have an interest or role in the systems, organizations, or services that are impacted by an incident. Stakeholders can be internal or external to an organization and may have different roles and expectations that need to be called out and addressed in the stakeholder management plan. Many stakeholder management plans will help with prioritization of which stakeholders will receive communications, what support they may need, and how they will be provided with options to offer input or otherwise interact with the IR process, communications and support staff, or others involved in the response process. Business continuity (BC) plans focus on keeping an organization functional when misfortune or incidents occur. In the context of IR processes, BC plans may be used to ensure that systems or services that are impacted by an incident can continue to function despite any changes required by the IR process. That might involve ways to restore or offload the services or use of alternate systems. BC plans have a significant role to play for larger incidents, whereas smaller incidents may not impact an organization's ability to conduct business in a significant way. Disaster recovery (DR) plans define the processes and procedures that an organization will take when a disaster occurs. Unlike a BC plan, a DR plan focuses on natural and human-made disasters that may destroy facilities or infrastructure, or otherwise prevent an organization from functioning normally. A DR plan focuses on restoration or continuation of services despite a disaster. Policies Organizations define policies as formal statements about organizational intent. In short, they explain why an organization wishes to operate in a certain way, and they define things like the purpose or objective of an activity or program. Incident response policies are commonly defined as part of building an IR capability. Well-written incident response policies will include important components of the IR process. They will identify the team and the authority that the team operates under. They will also require the creation and maintenance of incident handling and response procedures and practices, and they may define the overall IR process used by the organization. In some cases, they may also have specific communication or compliance requirements that are included in the overall policy based on organizational needs. It helps to bear in mind that a policy is a high-level statement of management intent that is used to convey the organization's expectations and direction for a topic. Standards will then point to a policy for their authority, while providing specific guidance about what should be done. Procedures are then used to implement standards or to guide how a task is done. Policies tend to be slow to change, whereas standards change more frequently, and procedures and guidelines may be updated frequently to handle organizational needs or technology change, or for other business-related reasons. An IR policy isn't the only policy that your organization may rely on to have a complete incident response capability. In fact, organizations often have many IT policies that can impact response. Training Appropriate and regular training is required for incident responders to be ready to handle incidents of all types. Organizations often invest in training for their staff including incident response certifications. In addition, organizations like the Cybersecurity & Infrastructure Security Agency (CISA) offer training for incident response covering a broad range of topics like preventing attacks, understanding indicators of compromise, and managing logs. You can read more about CISA's free training at www.cisa.gov/resources-tools/programs/Incident-Response-Training. Threat Hunting Threat hunting helps organizations achieve the detection and analysis phases of the incident response process. Threat hunters look for indicators of compromise (IoCs) that are commonly associated with malicious actors and incidents. The Security+ exam outline specifically notes a handful of these indicators: Account lockout, which is often due to brute-force login attempts or incorrect passwords used by attackers. Concurrent session usage when users aren't likely to use concurrent sessions. If a user is connected from more than one system or device, particularly when the second device is in an unexpected or uncommon location or the application is one that isn't typically used on multiple devices at once, this can be a strong indicator that something is not right. Blocked content is content that the organization has blocked, often via a DNS filter or other tool that prohibits domains, IP addresses, or types of content from being viewed or accessed. If this occurs, it may be because a malicious actor or malware is attempting to access the resource. Impossible travel, which involves a user connecting from two locations that are far enough apart that the time between the connections makes the travel impossible to have occurred, typically indicates that someone else has access to the user's credentials or devices. Resource consumption like filling up a disk or using more bandwidth than usual for uploads or downloads, can be an indicator of compromise. Unlike some of the other IoCs here, this one often requires other actions to become concerning unless it is much higher than usual. Resource inaccessibility can indicate that something unexpected is happening. If a resource like a system, file, or service isn't available identifying the underlying cause and ensuring that the cause isn't malicious, can be important. Out-of-cycle logging occurs when an event that happens at the same time or on a set cycle occurs at an unusual time. This might be a worker logging in at 2 a.m. who normally works 9–5, or a cleanup process that gets activated when it normally runs once a week. Missing logs may indicate that an attacker has wiped the logs to attempt to hide their actions. This is one reason that many organizations centralize their log collection so that a protected system will retain logs even if they are wiped on a server or workstation. There are many other types of indicators—in fact, any behavior that an attacker may perform that can be observed could be an indicator. Indicators are often analyzed together as part of the detection and analysis phases of the incident response process. Exam Note The Security+ exam outline includes one other indicator type: published/documented. That type describes indicators that have been discovered and published or documented. Descriptions of IoCs are commonly distributed via threat feeds as well as through information sharing organizations. As you prepare for the exam, you should also make sure you know the incident response process as well as the list of common indicators. Be ready to analyze an indicator through a log entry or a scenario to determine what may have happened and if an incident should be declared. Understanding Attacks and Incidents Incident responders frequently need ways to describe attacks and incidents using common language and terminology. Attack frameworks are used to understand adversaries, document techniques, and categorize tactics. As you review the ATT&CK framework, consider how you would apply it as part of an incident response process. For example, if you find an attack tool as part of an incident response effort, what would considering that tool via the ATT&CK framework do? What information might you seek next, and why? MITRE ATT&CK MITRE provides the ATT&CK, or Adversarial Tactics, Techniques, and Common Knowledge knowledgebase of adversary tactics and techniques. The ATT&CK matrices includes detailed descriptions, definitions, and examples for the complete threat life cycle from reconnaissance through execution, persistence, privilege escalation, and impact. At each level, it lists techniques and components, allowing threat assessment modeling to leverage common descriptions and knowledge. ATT&CK matrices include pre-attack, enterprise matrices focusing on Windows, macOS, Linux, and cloud computing, as well as iOS and Android mobile platforms. It also includes details of data sources, threat actor groups, software, and a host of other useful details. All of this adds up to make ATT&CK the most comprehensive freely available database of adversary techniques, tactics, and related information that the authors of this book are aware of. Figure 14.2 shows an example of an ATT&CK technique definition for attacks against cloud instances via their metadata APIs. It provides an ID number, as well as classification details like the tactic, platforms it applies to, what user permissions are required, the data sources it applies to, who contributed it, and the revision level of the specific technique. FIGURE 14.2 MITRE's ATT&CK framework example of attacks against cloud instances The ATT&CK framework is the most popular of the three models discussed here and has broad support in a variety of security tools, which means that analysts are most likely to find ATT&CK-related concepts, labels, and tools in their organizations. You can find the full ATT&CK website at http://attack.mitre.org. In addition to the ATT&CK framework, the Diamond Model and Lockheed Martin's Cyber Kill Chain are sometimes used by organizations. You find details of them at https://apps.dtic.mil/sti/pdfs/ADA586960.pdf and www.lockheedmartin.com/content/dam/lockheed- martin/rms/documents/cyber/Gaining_the_Advantage_Cyber_Kill_Chain.pdf Incident Response Data and Tools Incident responders rely on a wide range of data for their efforts. As a security professional, you need to be aware of the types of data you may need to conduct an investigation and to determine both what occurred and how to prevent it from happening again. Monitoring Computing Resources The Security+ exam outline specifically notes three types of monitoring that test takers should be familiar with: systems, applications, and infrastructure. System monitoring is typically done via system logs as well as through central management tools, including those found in cloud services. System health and performance information may be aggregated and analyzed through those management tools in addition to being gathered at central logging servers or services. Application monitoring may involve application logs, application management interfaces, and performance monitoring tools. This can vary significantly based on what the application provides, meaning that each application and application environment will need to be analyzed and designed to support monitoring. Infrastructure devices can also generate logs. SNMP and syslog are both commonly used for infrastructure devices. In addition, hardware vendors often sell management tools and systems that are used to monitor and control infrastructure systems and devices. This complex set of devices that each generate their own logs and have different log levels and events that may be important drives the importance of devices like security information and event management (SIEM) devices, which have profiles for each type of device or service and which can correlate and alert on activity based on rules and heuristic analysis. Security Information and Event Management Systems In many organizations, the central security monitoring tool is a security information and event management (SIEM) tool. SIEM devices and software have broad security capabilities, which are typically based on the ability to collect and aggregate log data from a variety of sources and then to perform correlation and analysis activities with that data. This means that organizations will send data inputs—including logs and other useful information from systems, network security devices, network infrastructure, and many other sources—to a SIEM for it to ingest, compare to the other data it has, and then to apply rules, analytical techniques, and machine learning or artificial intelligence to the data. SIEM systems may include the ability to review and alert on user behavior or to perform sentiment analysis, a process by which they look at text using natural language processing and other text analysis tools to determine emotions from textual data. Another data input for SIEM devices is packet capture. The ability to capture and analyze raw packet data from network traffic, or to receive packet captures from other data sources, can be useful for incident analysis, particularly when specific information is needed about a network event. Correlating raw packet data with IDS or IPS events, firewall and WAF logs, and other security events provides a powerful tool for security practitioners. You may also encounter terms like SIM (security information management) or SEM (security event management). As the market has matured and converged, SIEM has become the most common term, but some tools may still be described as SIM or SEM due to a narrower focus or specialized capabilities. SIEM devices also provide alerting, reporting, and response capabilities, allowing organizations to see when an issue needs to be addressed and to track the response to that issue through its life cycle. This may include forensic capabilities, or it may be more focused on a ticketing and workflow process to handle issues and events. SIEM Dashboards The first part of a SIEM that many security practitioners see is a dashboard like the AlienVault SIEM dashboard shown in Figure 14.3. Dashboards can be configured to show the information considered most useful and critical to an organization or to the individual analyst, and multiple dashboards can be configured to show specific views and information. The key to dashboards is understanding that they provide a high-level, visual representation of the information they contain. That helps security analysts quickly identify likely problems, abnormal patterns, and new trends that may be of interest or concern. SIEM dashboards have a number of important components that provide elements of their display. These include sensors that gather and send information to the SIEM, trending and alerting capabilities, correlation engines and rules, and methods to set sensitivity and levels. Sensors Although devices can send data directly to a SIEM, sensors are often deployed to gather additional data. Sensors are typically software agents, although they can be a virtual machine or even a dedicated device. Sensors are often placed in environments like a cloud infrastructure, a remote datacenter, or other locations where volumes of unique data are being generated, or where a specialized device is needed because data acquisition needs are not being met by existing capabilities. Sensors gather useful data for the SIEM and may either forward it in its original form or do some preprocessing to optimize the data before the SIEM ingests it. Choosing where to deploy sensors is part of network and security architecture and design efforts, and sensors must be secured and protected from attack and compromise just like other network security components. FIGURE 14.3 The AlienVault SIEM default dashboard Sensitivity and Thresholds Organizations can create a massive amount of data, and security data is no exception to that rule. Analysts need to understand how to control and limit the alerts that a SIEM can generate. To do that, they set thresholds, filter rules, and use other methods of managing the sensitivity of the SIEM. Alerts may be set to activate only when an event has happened a certain number of times, or when it impacts specific high-value systems. Or, an alert may be set to activate once instead of hundreds or thousands of times. Regardless of how your SIEM handles sensitivity and thresholds, configuring and managing them so that alerts are sent only on items that need to be alerted on helps avoid alert fatigue and false positives. Trends The ability to view trend information is a valuable part of a SIEM platform's capabilities. A trend can point to a new problem that is starting to crop up, an exploit that is occurring and taking over, or simply which malware is most prevalent in your organization. In Figure 14.4, you can see an example of categorizing malware activity, identifying which signatures have been detected most frequently, which malware family is most prevalent, and where it sends traffic to. This can help organizations identify new threats as they rise to the top. FIGURE 14.4 Trend analysis via a SIEM dashboard Alerts and Alarms Alerts and alarms are an important part of SIEM systems. Figure 14.5 shows an example from AlienVault's demonstration system. Note that the alarms are categorized by their time and severity, and then provide detailed information that can be drilled down into. Events like malware beaconing and infection are automatically categorized, prioritized, marked by source and destination, and matched to an investigation by an analyst as appropriate. They also show things like which sensor is reporting the issue. An important activity for security professionals is alert tuning, the process of modifying alerts to only alarm on important events. Alerts that are not properly tuned can cause additional work, often alerting responders over and over until they begin to be ignored. Properly tuning alerts is a key part of using alerts and alarms so that responders know that the events they will be notified about are worth responding to. Alert tuning often involves setting thresholds, removing noise by identifying false alarms and normal behaviors, and ensuring that tuning is not overly broad so that it ignores actual issues and malicious activity. One of the biggest threats to SIEM deployments is alert fatigue. Alert fatigue occurs when alerts are sent so often, for so many events, that analysts stop responding to them. In most cases, these alerts aren't critical, high urgency, or high impact and are in essence just creating noise. Or, there may be a very high proportion of false positives, causing the analyst to spend hours chasing ghosts. In either case, alert fatigue means that when an actual event occurs it may be missed or simply disregarded, resulting in a much worse security incident than if analysts had been ready and willing to handle it sooner. That's why alert tuning is so important. FIGURE 14.5 Alerts and alarms in the AlienVault SIEM Log Aggregation, Correlation, and Analysis Individual data points can be useful when investigating an incident, but matching data points to other data points is a key part of most investigations. Correlation requires having data such as the time that an event occurred, what system or systems it occurred on, what user accounts were involved, and other details that can help with the analysis process. A SIEM can allow you to search and filter data based on multiple data points like these to narrow down the information related to an incident. Automated correlation and analysis is designed to match known events and indicators of compromise to build a complete dataset for an incident or event that can then be reviewed and analyzed. As you can see in Figure 14.5 from the AlienVault SIEM, you can add tags and investigations to data. Although each SIEM tool may refer to these by slightly different terms, the basic concepts and capabilities remain the same. Log aggregation isn't only done with SIEM devices or services, however. Centralized logging tools like syslog-ng, rsyslog, and similar tools provide the ability to centralize logs and perform analysis of the log data. As with many security tools, many solutions exist to gather and analyze logs, with the lines blurring between various solutions like SIEM, SOAR (Security Orchestration, Automation, and Response systems), and other security response and analysis tools. Rules The heart of alarms, alerts, and correlation engines for a SIEM is the set of rules that drive those components. Figure 14.6 shows an example of how an alarm rule can be built using information the SIEM gathers. Rule conditions can use logic to determine if and when a rule will be activated, and then actions can trigger based on the rule. Results may be as simple as an alert or as complex as a programmatic action that changes infrastructure, enables or disables firewall rules, or triggers other defenses. Rules are important but can also cause issues. Poorly constructed rule logic may miss events or cause false positives or overly broad detections. If the rule has an active response component, a mis-triggered rule can cause an outage or other infrastructure issue. Thus, rules need to be carefully built, tested, and reviewed on a regular basis. Although SIEM vendors often provide default rules and detection capabilities, the custom-built rules that organizations design for their environments and systems are key to a successful SIEM deployment. Finally, SIEM devices also follow the entire life cycle for data. That means most have the ability to set retention and data lifespan for each type of data and have support for compliance requirements. In fact, most SIEM devices have prebuilt rules or modules designed to meet specific compliance requirements based on the standards they require. SIEM devices typically have built-in integrations for cloud services like Google, ServiceNow, Office 365, Okta, Sophos, and others. That means you can import data directly from those services to get a better view of your security environment. Log Files Log files provide incident responders with information about what has occurred. Of course, that makes log files a target for attackers as well, so incident responders need to make sure that the logs they are using have not been tampered with and that they have time stamp and other data that is correct. Once you're sure the data you are working with is good, logs can provide a treasure trove of incident-related information. Figure 14.7 shows the Windows Event Viewer, one of the most common ways to view logs for a single Windows system. In many enterprise environments, specific logs or critical log entries will be sent to a secure logging infrastructure to ensure a trustworthy replica of the logs collected at endpoint systems exists. Security practitioners will still review local logs, particularly because the volume of log data at endpoints throughout an organization means that complete copies of all logs for every system are not typically maintained. Common logs used by incident responders that are covered in the Security+ exam outline include the following: Firewall logs, which can provide information about blocked and allowed traffic, and with more advanced firewalls like NGFW or UTM, devices can also provide application-layer details or IDS/IPS functionality along with other security service– related log information. FIGURE 14.6 Rule configuration in AlienVault Application logs for Windows include information like installer information for applications, errors generated by applications, license checks, and any other logs that applications generate and send to the application log. Web servers and other devices also generate logs like those from Apache and Internet Information Services (IIS), which track requests to the web server and related events. These logs can help track what was accessed, when it was accessed, and what IP address sent the request. Since requests are logged, these logs can also help identify attacks, including SQL injection (SQLi) and other web server and web application–specific attacks. Endpoint logs such as application installation logs, system and service logs, and any other logs available from endpoint systems and devices. OS-specific security logs for Windows systems store information about failed and successful logins, as well as other authentication log information. Authentication and security logs for Linux systems are stored in /var/log/auth.log and /var/log/secure. FIGURE 14.7 The Windows Event Viewer showing a security log with an audit event Remember that Windows logs can be viewed and exported using the Event Viewer and that Linux log locations can vary based on the distribution that you're using. For Linux systems, /var/log/ is usually a good starting place, but Debian and Ubuntu store lots of useful syslog messages in /var/log/syslog, whereas Red Hat will put the same messages into /var/log/messages. IDS/IPS logs provide insight into attack traffic that was detected or, in the case of IPS, blocked. Network logs can include logs for routers and switches with configuration changes, traffic information, network flows, and data captured by packet analyzers like Wireshark. Security practitioners will use SIEM tools as well as manual search tools like grep and tail to review logs for specific log entries that may be relevant to an event or incident. Lists of important Windows event IDs are commonly available, and many Linux log entries can be easily identified by the text they contain. Going With the Flow Tracking your bandwidth utilization using a bandwidth monitor can provide trend information that can help spot both current problems and new behaviors. Network flows, either using Cisco's proprietary NetFlow protocol, which is a software-driven capability, or sFlow, which is broadly implemented on devices from many vendors, are an important tool in an incident responder's toolkit. In addition to NetFlow and sFlow, you may encounter IPFIX, an open standard based on NetFlow v9 that many vendors support. The hardware deployed in your environment is likely to drive the decision about which to use, with each option having advantages and disadvantages. Network flows are incredibly helpful when you are attempting to determine what traffic was sent on your network, where it went, or where it came from. Flows contain information such as the source and destination of traffic, how much traffic was sent, and when the traffic occurred. You can think of flow information like phone records—you know what number was called and how long the conversation took, but not what was said. Thus, although flows like those shown in the following graphic are useful hints, they may not contain all the information about an event. Flows may not show all the traffic for another reason, too: keeping track of high- volume traffic flows can consume a large amount of network device processing power and storage, and thus many flows are sampled at rates like 10:1 or even 1000:1. That means flows may not capture all traffic, and you may lose some resolution and detail in your flow analysis. Even though flows may only show part of the picture, they are a very useful diagnostic and incident response tool. If you're tasked with providing network security for an organization, you may want to consider setting up flows as part of your instrumentation efforts. FIGURE 14.8 The Windows Event Viewer showing an application log event Logging Protocols and Tools In addition to knowing how to find and search through logs, you need to know how logs are sent to remote systems, what tools are used to collect and manage logs, and how they are acquired. Traditional Linux logs are sent via syslog, with clients sending messages to servers that collect and store the logs. Over time, other syslog replacements have been created to improve upon the basic functionality and capabilities of syslog. When speed is necessary, the rocket-fast system for log processing, or rsyslog, is an option. It supports extremely high message rates, secure logging via TLS, and TCP-based messages as well as multiple backend database options. Another alternative is syslog-ng, which provides enhanced filtering, direct logging to databases, and support for sending logs via TCP protected by TLS. The enhanced features of syslog replacements like rsyslog and syslog- ng mean that many organizations replace their syslog infrastructure with one of these options. A final option for log collection is NXLog, an open source and commercially supported syslog centralization and aggregation tool that can parse and generate log files in many common formats while also sending logs to analysis tools and SIEM solutions. Digging Into systemd's Journal in Linux Most Linux distributions rely on systemd to manage services and processes and, in general, manage the system itself. Accessing the systemd journal that records what systemd is doing using the journald daemon can be accomplished using journalctl. This tool allows you to review kernel, services, and initrd messages as well as many others that systemd generates. Simply issuing the journalctl command will display all the journal entries, but additional modes can be useful. If you need to see what happened since the last boot, the -b flag will show only those entries. Filtering by time can be accomplished with the --since flag and a time/date entry in the format “year-month-day hour:minute:seconds”. Regardless of the logging system you use, you will have to make decisions about retention on both local systems and central logging and monitoring infrastructure. Take into account operational needs; likely scenarios where you may need the logs you collect; and legal, compliance, or other requirements that you need to meet. In many cases organizations choose to keep logs for 30, 45, 90, or 180 days depending on their needs, but some cases may even result in some logs being kept for a year or more. Retention comes with both hardware costs and potential legal challenges if you retain logs that you may not wish to disclose in court. Exam Note The Security+ exam outline includes a large number of types of logging systems, logs, analysis tools, and other data sources. You should focus on thinking about why you might need each of them. Although you don't have to master each of these log types, if one is completely unfamiliar to you, you may want to learn more about it so that you can read it and understand it if it shows up on the exam. Going Beyond Logs: Using Metadata Log entries aren't the only useful data that systems contain. Metadata generated as a normal part of system operations, communications, and other activities can also be used for incident response. Metadata is data about other data—in the case of systems and services, metadata is created as part of files, embedded in documents, used to define structured data, and included in transactions and network communications, among many other places you can find it. While the Security+ exam outline simply mentions metadata as a broad category, it helps to think of metadata related to various data types. Four common examples of metadata are: Email metadata includes headers and other information found in an email. Email headers provide details about the sender, the recipient, the date and time the message was sent, whether the email had an attachment, which systems the email traveled through, and other header markup that systems may have added, including antispam and other information. Mobile metadata is collected by phones and other mobile devices as they are used. It can include call logs, SMS and other message data, data usage, GPS location tracking, cellular tower information, and other details found in call data records. Mobile metadata is incredibly powerful because of the amount of geospatial information that is recorded about where the phone is at any point during each day. Web metadata is embedded into websites as part of the code of the website but is often invisible to everyday users. It can include metatags, headers, cookies, and other information that help with search engine optimization, website functionality, advertising, and tracking, or that may support specific functionality. File metadata can be a powerful tool when reviewing when a file was created, how it was created, if and when it was modified, who modified it, the GPS location of the device that created it, and many other details. The following code shows selected metadata recovered from a single photo using ExifTool (http://exiftool.org). The output shows that the photo was taken with a digital camera, which inserted metadata such as the date the photo was taken, the specific camera that took it, the camera's settings, and even the firmware version. Mobile devices may also include the GPS location of the photo if they are not set to remove that information from photos, resulting in even more information leakage. File Size : 2.0 MB File Modification Date/Time : 2009:11:28 14:36:02-05:00 Make : Canon Camera Model Name : Canon PowerShot A610 Orientation : Horizontal (normal) X Resolution : 180 Y Resolution : 180 Resolution Unit : inches Modify Date : 2009:08:22 14:52:16 Exposure Time : 1/400 F Number : 4.0 Date/Time Original : 2009:08:22 14:52:16 Create Date : 2009:08:22 14:52:16 Flash : Off, Did not fire Canon Firmware Version : Firmware Version 1.00 Metadata is commonly used for forensic and other investigations, and most forensic tools have built-in metadata-viewing capabilities. Other Data Sources In addition to system and service logs, other data sources can also be used, either through SIEM and other log and event management systems or manually. They can be acquired using agents, special-purpose software deployed to systems and devices that send the logs to a log aggregator or management system, or they can be agentless and simply send the logs via standardized log interfaces like syslog. The Security+ exam outline specifically points to vulnerability scans that provide information about scanning activities and packet captures, which can be used to review network traffic as part of incident response or troubleshooting activities. Other useful data can be found in automated reports from various systems and services and dashboards that are available via management tools and administrative control panels. Benchmarks and Logging A key tool included in the Security+ exam outline as part of alerting and monitoring is the use of benchmarks. You're already familiar with the concept of using benchmarks to configure systems to a known standard security configuration, but benchmarks also often include log settings. That means that a well-constructed benchmark might require central logging, configuring log and alerting levels, and that endpoints or servers log critical and important events. As you consider how an organization manages systems, services, and devices at scale, benchmarks are a useful means of ensuring that each of them is configured to log important information in useful ways to support security operations. Reporting and Archiving Once you've gathered logs, two key actions remain: reporting and archiving. Reporting on log information is part of the overall log management process, including identifying trends and providing visibility into changes in the logs that may indicate issues or require management oversight. Finally, it is important that organizations consider the full lifespan of their log data. That includes setting data retention life cycles and archiving logs when they must be retained but are not in active use. This helps to make sure space is available in SIEM and other devices and also keeps the logs to a manageable size for analysis. Organizations often pick a time frame like 30, 60, 90, or 180 days for log retention before archiving or deletion. Exam Note As you consider monitoring, SIEM, log aggregation, and log and event analysis, make sure you understand why log aggregation is important, and how SIEM and NetFlow play into understanding what an organization is doing. Be prepared to explain alerts, reporting, archiving, and how logs and analysis are used throughout the entire incident response process. Mitigation and Recovery An active incident can cause disruptions throughout an organization. The organization must act to mitigate the incident and then work to recover from it without creating new risks or vulnerabilities. At the same time, the organization may want to preserve incident data and artifacts to allow forensic analysis by internal responders or law enforcement. Exam Note The Security+ exam focuses on mitigation efforts and does not delve into recovery. As you read this section of the chapter, remember the incident response flow from the beginning of the chapter and think about how you would support recovery and incident response goals as you mitigate the incident. But remember that the focus of the exam will be on how to stop the incident and secure systems, not on how to bring them back to normal. Security Orchestration, Automation, and Response (SOAR) Managing multiple security technologies can be challenging, and using the information from those platforms and systems to determine your organization's security posture and status requires integrating different data sources. At the same time, managing security operations and remediating issues you identify is also an important part of security work. SOAR platforms seek to address these needs. As a mitigation and recovery tool, SOAR platforms allow you to quickly assess the attack surface of an organization, the state of systems, and where issues may exist. They also allow automation of remediation and restoration workflows. Containment, Mitigation, and Recovery Techniques In many cases, one of the first mitigation techniques will be to quickly block the cause of the incident on the impacted systems or devices. That means you may need to reconfigure endpoint security solutions: Application allow lists (sometimes referred to as whitelisting) list the applications and files that are allowed to be on a system and prevent anything that is not on the list from being installed or run. Application deny lists or block lists (sometimes referred to as blacklists) list applications or files that are not allowed on a system and will prevent them from being installed or copied to the system. Isolation or quarantine solutions can place files in a specific safe zone. Antimalware and antivirus often provide an option to quarantine suspect or infected files rather than deleting them, which can help with investigations. Monitoring is a key part of containment and mitigation efforts because security professionals and system administrators need to validate their efforts. Monitoring a system, service, or device can provide information about whether there are still issues or the device remains compromised. Monitoring can also show other actions taken by attackers after remediation is completed, helping responders identify the rest of the attacker's compromised resources. Exam Note As you prepare for the exam, you should pay particular attention to the concepts of segmentation, access control through ACLs and permissions, application allow lists, and isolation. Quarantine or Delete? One of the authors of this book dealt with a major issue caused by an antivirus update that incorrectly identified all Microsoft Office files as malware. That change resulted in thousands of machines taking their default action on those files. Fortunately, most of the organization used a quarantine, and then deleted settings for the antivirus product. One division, however, had set their systems to delete as the primary action. Every Office file on those systems was deleted within minutes of the update being deployed to them, causing chaos as staff tried to access their files. Although most of the files were eventually restored, some were lost as systems overwrote the deleted files with other information. This isn't a typical scenario, but understanding the settings you are using and the situations where they may apply is critical. Quarantine can be a great way to ensure that you still have access to the files, but it does run the danger of allowing the malicious files to still be on the system, even if they should be in a safe location. Configuration changes are also a common remediation and containment technique. They may be required to address a security vulnerability that allowed the incident to occur, or they may be needed to isolate a system or network. In fact, configuration changes are one of the most frequently used tools in containment and remediation efforts. They need to be carefully tracked and recorded, since responders can still make mistakes, and changes may have to be rolled back after the incident response process to allow a return to normal function. Common examples of remediation actions include: Firewall rule changes, either to add new firewall rules, modify existing firewall rules, or in some cases, to remove firewall rules. Mobile device management (MDM) changes, including applying new policies or changing policies; responding by remotely wiping devices; locating devices; or using other MDM capabilities to assist in the IR process. Data loss prevention (DLP) tool changes, which may focus on preventing data from leaving the organization or detecting new types or classifications of data from being sent or shared. DLP changes are likely to be reactive in most IR processes, but DLP can be used to help ensure that an ongoing incident has a lower chance of creating more data exposure. Content filter and URL filtering capabilities, which can be used to ensure that specific sites are not able to be browsed or accessed. Content filter and URL filtering can help prevent malware from phoning home or connecting to C2 sites, and it can also prevent users from responding to phishing attacks and similar threats. Updating or revoking certificates, which may be required if the certificates were compromised, particularly if attackers had access to the private keys for the certificates. At the same time, removing certificates from trust lists can also be a useful tool, particularly if an upstream service provider is not responding promptly and there are security concerns with their services or systems. Of course, there are many other configuration changes that you may need to make. When you're faced with an incident response scenario, you should consider what was targeted; how it was targeted; what the impact was; and what controls, configuration changes, and tools you can apply to first contain and then remediate the issue. It is important to bear in mind the operational impact and additional risks that the changes you are considering may result in, and to ensure that stakeholders are made aware of the changes or are involved in the decision, depending on the urgency of the situation. At times, broader action may also be necessary. Removing systems, devices, or even entire network segments or zones may be required to stop further spread of an incident or when the source of the incident cannot be quickly identified. The following techniques support this type of activity: Isolation moves a system into a protected space or network where it can be kept away from other systems. Isolation can be as simple as removing a system from the network or as technically complex as moving it to an isolation VLAN, or in the case of virtual machines or cloud infrastructure, it may require moving the system to an environment with security rules that will keep it isolated while allowing inspection and investigation. Containment leaves the system in place but works to prevent further malicious actions or attacks. Network-level containment is frequently accomplished using firewall rules or similar capabilities to limit the traffic that the system can send or receive. System and application-level containment can be more difficult without shutting down the system or interfering with the functionality and state of the system, which can have an impact on forensic data. Therefore, the decisions you make about containment actions can have an impact on your future investigative work. Incident responders may have different goals than forensic analysts, and organizations may have to make quick choices about whether rapid response or forensic data is more important in some situations. Segmentation is often employed before an incident occurs to place systems with different functions or data security levels in different zones or segments of a network. Segmentation can also be done in virtual and cloud environments. In essence, segmentation is the process of using security, network, or physical machine boundaries to build separation between environments, systems, networks, or other components. Incident responders may choose to use segmentation techniques as part of a response process to move groups of systems or services so that they can focus on other areas. You might choose to segment infected systems away from the rest of your network or to move crucial systems to a more protected segment to help protect them during an active incident. Root Cause Analysis Once you've mitigated issues and are on the path to recovery, organizations typically perform a root cause analysis (RCA). This process focuses on identifying the underlying cause for an issue or compromise, identifying how to fix the problems that allowed the event or incident to occur, and ensuring that any systemic issues that led to the problem are also addressed. Common techniques used in RCA efforts include: Five whys, which asks why multiple times to get to the underlying reason for an event or issue. Event analysis, which examines each event and determines if it's the root cause or occurred because of the root cause. Diagramming cause and effect is used to help determine whether each event was a cause or an effect. Fishbone diagrams are often used for this purpose. Regardless of the process chosen, root cause analysis is an important step in the incident response process and feeds the preparation phase of the cycle to avoid future issues of the same type. Exam Note Mitigation and recovery processes require an understanding of allow and deny lists, isolation, quarantine, and, of course, ongoing monitoring to ensure that the remediation efforts were successful. Finally, be ready to explain what root cause analysis is and why it's important as part of the recovery and preparation process. Summary Every organization will eventually experience a security incident, and having a solid incident response plan in place with a team who knows what they need to do is critical to appropriately handling incidents. Incident response typically follows a response cycle with preparation, detection, analysis, containment, eradication, recovery, and lessons learned phases. Although incident response may involve all of these phases, they are not always conducted as distinct elements, and organizations and IR teams may be in multiple phases at the same time, depending on the status of the incident in question. Preparation for incident response includes building a team, putting in place policies and procedures, conducting exercises, and building the technical and information-gathering infrastructure that will support incident response needs. Incident response plans don't exist in a vacuum. Instead, they are accompanied by communications and stakeholder management plans, business continuity and disaster recovery plans, and other detailed response processes unique to each organization. Threat hunting is used to help identify information that may indicate an issue or compromise has occurred. Looking for key events like account lockouts, concurrent session usage, attempts to access blocked content, unusual resource consumption, resource inaccessibility, missing logs, and out-of-cycle logging are all examples of data that can help indicate compromise. Once they've identified an issue, responders also

Chapter 14 - Monitoring and Incident Response.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue