Chapter 14 - Using Statistics and Sensors to Ensure Network Availability.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
Chapter 14 Using Statistics and Sensors to Ensure Network Availability THE FOLLOWING COMPTIA NETWORK+ EXAM OBJECTIVES ARE COVERED IN THIS CHAPTER: Domain 3.0 Network Operations 3.2 Given a scenario, use network monitoring technologies. Methods SNMP Traps Management in...
Chapter 14 Using Statistics and Sensors to Ensure Network Availability THE FOLLOWING COMPTIA NETWORK+ EXAM OBJECTIVES ARE COVERED IN THIS CHAPTER: Domain 3.0 Network Operations 3.2 Given a scenario, use network monitoring technologies. Methods SNMP Traps Management information base (MIB) Versions: v2c, v3 Community strings Authentication Flow data Packet capture Baseline metrics Anomaly alerting/notification Log aggregation Syslog collector Security information and event management (SIEM) Application programming interface (API) integration Port mirroring Solutions Network discovery Ad hoc Scheduled Traffic analysis Performance monitoring Availability monitoring Configuration monitoring All organizations detest downtime. It costs money, and it damages their brand and reputation. So, they spend millions trying to solve the issue. One of the keys to stopping downtime is to listen to what the devices may be telling you about their current state of health. Doing so forms a sort of early warning system that lets you know before a system goes down so there is time to avoid it. In this chapter, you'll learn what sort of data you should be monitoring and some of the ways to do so. To find Todd Lammle CompTIA videos and practice questions, please see www.lammle.com. Performance Monitoring/Metrics/Sensors Let's imagine you were just brought from the 1800s to the present by a time machine and in your first trip in a car you examine the dashboard. Speed, temperature, tire inflation, tachometer…what does all that stuff mean? It would be meaningless to you and useless for monitoring the state of the car's health. Likewise, you cannot monitor the health of a device or a network unless you understand the metrics. In these opening sections, you will learn what these are and how to use them. Device/Chassis There are certain basic items to monitor about physical computing devices, regardless of whether it's a computer, router, or switch. While not the only thing to monitor, these items would be on the dashboard if they had dashboards. Temperature Heat and computers do not mix well. Many computer systems require both temperature and humidity control for reliable service. The larger servers, communications equipment, and drive arrays generate considerable amounts of heat; this is especially true of mainframe and older minicomputers. An environmental system for this type of equipment is a significant expense beyond the actual computer system costs. Fortunately, newer systems operate in a wider temperature range. Most new systems are designed to operate in an office environment. Overheating is also a big cause of reboots. When CPUs get overheated, a cycle of reboots can ensue. Make sure the fan is working on the heat sink and the system fan is also working. If required, vacuum the dust from around the vents. Under normal conditions, the PC cools itself by pulling in air. That air is used to dissipate the heat created by the processor (and absorbed by the heat sink). When airflow is restricted by clogged ports, a bad fan, and so forth, heat can build up inside the unit and cause problems. Chip creep—the unseating of components—is one of the more common byproducts of a cycle of overheating and cooling of the inside of the system. Since the air is being pulled into the machine, excessive heat can originate from outside the PC as well because of a hot working environment. The heat can be pulled in and cause the same problems. Take care to keep the ambient air within normal ranges (approximately 60 to 90 degrees Fahrenheit) and at a constant temperature. Replacing slot covers is vital. Computers are designed to circulate air with slot covers in place or cards plugged into the ports. Leaving slots on the back of the computer open alters the air circulation and causes more dust to be pulled into the system. Finally, note whether the fan is working; if it stops, that is a major cause of overheating. Central Processing Unit Usage When monitoring the central processing unit (CPU), the specific counters you use depend on the server role. Consult the vendor's documentation for information about those counters and what they mean to the performance of the service or application. The following counters are commonly monitored: Processor\% Processor Time: The percentage of time the CPU spends executing a non-idle thread. This should not be more than 85% on a sustained basis. Processor\% User Time: Represents the percentage of time the CPU spends in user mode, which means it is doing work for an application. If this value is higher than the baseline you captured during normal operation, the service or application is dominating the CPU. Processor\% Interrupt Time: The percentage of time the CPU receives and services hardware interrupts during specific sample intervals. If this is more than 15%, there could be a hardware issue. System\Processor Queue Length: The number of threads (which are smaller pieces of an overall operation) in the processor queue. If this value is more than two times the number of CPUs, the server is not keeping up with the workload. Memory Different system roles place different demands on the memory, so there may be specific counters of interest you can learn by consulting the documentation provided by the vendor of the specific service. Some of the most common counters monitored by server administrators are listed here: Memory\% Committed Bytes in Use: The amount of virtual memory in use. If this is more than 80%, you need more memory. Memory\Available Mbytes: The amount of physical memory, in megabytes, currently available. If this is less than 5%, you need more memory. Memory\Free System Page Table Entries: Number of entries in the page table not currently in use by the system. If the number is less than 5,000, there may well be a memory leak. Memory\Pool Non-Paged Bytes: The size, in bytes, of the non-paged pool, which contains objects that cannot be paged to the disk. If the value is greater than 175 MB, you may have a memory leak (an application is not releasing its allocated memory when it is done). Memory\Pool Paged Bytes: The size, in bytes, of the paged pool, which contains objects that can be paged to disk. (If this value is greater than 250 MB, there may be a memory leak.) Memory\Pages per Second: The rate at which pages are written to and read from the disk during paging. If the value is greater than 1,000, as a result of excessive paging, there may be a memory leak. Network Metrics The health of the operation of a network can also be monitored and should be to maintain its performance at peak efficiency. Just as you can avoid a problem issue with a workstation or server, you can react to network conditions before they cause an issue by monitoring these items. Bandwidth In a perfect world, there would be unlimited bandwidth, but in reality, you're more likely to find Bigfoot. So, it's helpful to have some great strategies up your sleeve. If you look at what computers are used for today, there's a huge difference between the files we transfer now versus those transferred even three to five years ago. Now we do things like watch movies online without them stalling, and we can send huge email attachments. Video teleconferencing is almost more common than Starbucks locations. The point is that the files we transfer today are really large compared to what we sent back and forth just a few years ago. And although bandwidth has increased to allow us to do what we do, there are still limitations that cause network performance to suffer miserably. Bandwidth can be measured in two different ways; the first is available bandwidth, and the second is bandwidth utilization or throughput. Simply put, the bandwidth is the connections capacity, and throughput is the utilized bandwidth. When we talk about bandwidth utilization or throughput, we simply refer to it as utilization. The bandwidth of a connection is often referred to as a connection's speed, and it is measured in bits per second (bps). The throughput can never exceed the bandwidth for a given connection. Throughput measurement is often performed on a connection such as the Internet. We will collect the metric by measuring the path to a given host on the Internet. It's not a perfect measurement because a host can sometimes be down or overwhelmed with other traffic. However, most of the time the destination host has a much bigger connection than the Internet connection you are connected with. The following are metrics to follow for bandwidth on a system: Network Interface\Bytes Total/Sec: The percentage of bandwidth the NIC is capable of that is currently being used. If this value is more than 70% of the bandwidth of the interface, the interface is saturated or not keeping up. Network Interface\Output Queue Length: The number of packets in the output queue. If this value is over 2, the NIC is not keeping up with the workload. Latency Latency is the delay typically incurred in the processing of network data. A low-latency network connection is one that generally experiences short delay times, while a high- latency connection generally suffers from long delays. Many security solutions may negatively affect latency. For example, routers take a certain amount of time to process and forward any communication. Configuring additional rules on a router generally increases latency, thereby resulting in longer delays. An organization may decide not to deploy certain security solutions because of the negative effects they will have on network latency. Auditing is a great example of a security solution that affects latency and performance. When auditing is configured, it records certain actions as they occur. The recording of these actions may affect the latency and performance. Measuring latency is typically done using a metric called round-trip time (RTT). It is calculated using Ping, a command-line tool that bounces a user request off a server and calculates how long it takes to return to the user device. Jitter Jitter occurs when the data flow in a connection is not consistent; that is, it increases and decreases in no discernable pattern. Jitter results from network congestion, timing drift, and route changes. Jitter is especially problematic in real-time communications like IP telephony and videoconferencing. Loss Loss occurs when packets are dropped or otherwise lost! Loss should never occur on an organization's internal network. However, it is somewhat common on public networks like the Internet. Network problems can occur outside of the organization's control and cause packet loss from end to end. The packet loss then requires retransmission of data and slow downs for applications. Just a fraction of a percent can affect real-time application performance, such as VoIP, where data lost is never retransmitted. Additional Monitoring Solutions While monitoring for acceptable performance is certainly important, there are other monitoring techniques that provide benefits and can help to keep the network and its systems operating correctly. In this section, we'll look at some of these. Network Discovery Even if you think you know the topology of your network, you may be surprised to find that it has changed over time. Network discovery is a process that scans the network and identifies all devices and systems and their network relationship to one another. There are a number of tools designed to do just this, including Discovery Profile in OpManager. There are two different approaches to conducting network discovery, covered in the next sections. Ad Hoc An ad hoc scan is simply one that is performed on demand as a one-time event. The issue with ad hoc scans is that someone must initiate the process each time and the network can change significantly without notice between these manual scans. Scheduled Scheduled scans that are automated are possible with network discovery tools and provide the benefit of reminding the humans that this needs to be done. Alerts trigger when unexpected devices are detected on networks, signaling rogue devices or misconfigurations. Baseline Metrics When collecting performance information, there must be a standard against which the data can be measured. This is called a baseline. Baseline metrics should represent the normal and expected functioning of the network or of the device being measured. Therefore, when gathering the data that will represent this baseline, it should be done during normal operation and not during periods of unusual stress to the system or network. Anomaly Alerting/Notification Once a baseline has been established, it becomes possible to identify when performance falls below the baseline. Anomaly detection and alerting monitor the incoming performance data and create a notification via email or text message to the team to alert them of the issue. This frees the team from constantly pouring over the log files for performance issues. Traffic Analysis The most common use of network traffic analysis is to identify root causes of performance issues. For example, it can help determine that there is a bottleneck on one router or switch. It can also identify systems and users who are creating the most traffic. Armed with this information, technicians can take steps to alter the flow of traffic to eliminate bottlenecks or, in the case of a user who is creating excessive traffic, block their access to that gaming site they use all day! While traffic analysis can focus on network performance, it also entails determining what type of traffic there is on the network. Why is that important? Some types of traffic may indicate that malware is present on the network. Other types of traffic may indicate the use of services that are against policy. It also may show that encryption is not in use when policy requires it for certain operations. Performance Monitoring Performance problems are the toughest and often take a great deal of time to resolve. When you experience a performance problem, the first question to come to mind is “How was it performing before?” This is where a performance baseline comes in handy. The performance baseline should be captured over a period of time that involves normal activity. An example of normal activity is Monday through Sunday during hours of business and normal times of inactivity. The best scenario is to constantly monitor and record the performance of selected network services. There are several tools you can use to monitor performance and compile baselines of the performance of the network. Microsoft includes a tool called Performance Monitor that is built into the operating system. Other performance monitoring tools are Multi Router Traffic Grapher (MRTG) and Paessler AG's PRTG, which use the Simple Network Management Protocol, which you will learn about later in this chapter. There are also many different applications that you can purchase that can monitor performance—too many to mention. Availability Monitoring Some organizations have systems that must be available 24/7. Consider an e-commerce server that processes an average of $5,000 of transactions per hour. If the team is not made aware that the server is down, the loss is $5,000 an hour, not to mention the damage done to the confidence of users concerning the security and functionality of your site. A function of a network management system/station is to calculate uptime and downtime for the network; uptime/downtime is also called the availability. The network as a whole is made up of many different services on the various servers. The uptime calculated by the network management system is the total time all network services are up and responding to requests. The downtime is how long the network services were down and not responding to requests. The network management system will check each service and keep a running tally of down services, as shown in Figure 14.1. It will then calculate the outages and display the total uptime of the network. FIGURE 14.1 Network management system/station availability Configuration Monitoring When multiple technicians are manually implementing system configurations and making changes to them, it becomes easy for systems to fall out of compliance with the intended configuration. While some incorrect settings will become immediately noticeable (incorrect IP address, etc.), others can take some time to show their effects. Configuration monitoring is the process of either manually checking settings on a schedule or using a monitoring tool to check configurations. Altera is just one example of a network monitoring tool that can verify that a system conforms to the appropriate configuration. EXERCISE 14.1 Working with Performance Monitor In this exercise, you will use a built-in tool in the Windows operating system to view performance. The Performance Monitor exists in every version of Windows and can be used to establish performance metrics for monitoring and establishing baselines. 1. Select Start, type perfmon, and then select Performance Monitor from the search results. 2. Choose the Performance Monitor section under Monitoring Tools. 3. Click the plus sign [+] or right-click in the graphical display area and select Add Counters. 4. Expand the Processor section, and then select the %Processor Time object. 5. Click Add >> and then OK. 6. Open Windows File Explorer, click the C: drive, type * into the search box, and then press Enter. 7. Quickly change to Performance Monitor and watch the impact of this search on the processor. This action is time-consuming and therefore will help you notice the changes that take place in Performance Monitor. 8. Run the same operation again. This time, however, change your view within Performance Monitor to a Histogram bar by clicking the button directly to the left of the plus sign [+]. 9. Run the same operation again, changing your view within Performance Monitor to Report. 10. Exit Performance Monitor. SNMP The Simple Network Management Protocol (SNMP) allows the collection of metrics, also known as counters. SNMP can also be used for reporting events from a device back to a centralized network management station (NMS). Although the expansion of the acronym SNMP includes the word simple, it is not a simple protocol because there are many components, as shown in Figure 14.2. But don't worry; I will discuss each component in detail in the following sections. As always, you might feel like you need to know everything about this subject before you can understand a specific topic about this subject. For that reason, I recommend reading the following sections twice. Agent The SNMP agent is a small piece of software that resides on the device or operating system to be monitored. The agent is responsible for answering requests from a network management station, or the agent will send messages to the NMS. The agent is configured with a specific set of counters (metrics) called object identifiers (OIDs) for which it is responsible. It will be responsible for collecting the values for these counters and presenting them upon request. The agent can also be set up to transmit to an NMS if a counter crosses a threshold value. FIGURE 14.2 SNMP components NMS OpenNMS and PRTG Network Monitor are two examples of network management systems. An NMS is responsible for collecting statistics such as bandwidth, memory, and CPU from devices and operating systems. An NMS can also be configured to store these counters in a database, so you have the ability to review past performance. When a service goes down or stops responding to NMS queries, an NMS can send an alert or notification for network administrator intervention. We can also set thresholds for specific counters, and when a counter crosses the threshold, alerts and notifications are sent to the administrator from the NMS. For instance, if we are collecting the metric of temperature and the value crosses a threshold we set in the NMS, an email can be sent to the administrator with a warning. Network management systems are generally used for the ongoing collection of statistics from network devices and operating systems. This constant recording of statistics creates baselines for comparison over time. It also helps us identify trends and problematic periods of time. As shown in Figure 14.3, at around 17:00 hours bandwidth spiked up to 9.2 Mbps. Looking at the weekly graph, these spikes seem normal for brief periods of time. Commands Network management stations can operate with two basic command methods: the SNMP get command and the SNMP trap command, as shown in Figure 14.4. An SNMP get command is a solicited request to the OS or network device for an object ID (OID) value; SNMP get commands are considered polling requests since they happen at a set interval. An SNMP trap command is unsolicited information from the OS or network device. An SNMP trap command is sent when the threshold on the device has been exceeded, such as a bandwidth setting or disk space, or in the event an interface goes down. These SNMP trap commands can be configured on the SNMP monitor to create alerts and notifications for the administrator. FIGURE 14.3 SNMP monitor graph FIGURE 14.4 SNMP get and trap methods There is a third command method called the SNMP set command, but it is not normally used by the NMS. It allows a variable to be set on the device or operating system. It functions similarly to an SNMP get command, with the exception you are setting a value. The SNMP set command is normally initiated by a technician or script when setting a value, such as a password on a group of network devices. SNMP operates on two different ports, 161 and 162, depending on if you are sending an SNMP get command to retrieve a value for a counter or a device is reporting an event. All SNMP get and set commands (polling type commands) are sent via UDP port 161. SNMP trap commands are sent from the agent to the NMS on UDP port 162. By default, SNMP uses UDP since the messages sent are simple commands that require simple responses. However, TCP can be configured for moving data in an environment where data delivery is not always assured, such as across the Internet. Community Name The community name is a shared passphrase authentication for SNMP versions 1 and 2c. SNMPv3 uses community names, but it is not the main authentication method, as you will learn later. The SNMP community name allows an NMS or technician to send SNMP commands to an SNMP instance running on a device or operating system. The default community name for read-only (get commands) is public. The default community name for read-write (set commands) is private. So, we often refer to the community name as public and private regardless of the actual community name. Obviously if someone obtains the community name, they can read sensitive information such as configuration information and possibly write new configurations if they obtain a read-write community name. Versions SNMP version 1, known as SNMPv1, is obviously the first version of SNMP, and it is the oldest. SNMPv1 was defined in RFC 1155 and 1157 back in 1990. It is old and should no longer be used; it is covered for historical purposes only. SNMPv2 expanded on SNMPv1 by adding support for 64-bit counters for handling large counter numbers. SNMPv2c added support for proxy agents. Both version 1 and version 2c lack any kind of encryption or authentication outside of the community name string. They should both be avoided when setting up SNMP, but many vendors still promote setting up SNMPv2c. SNMPv3 was released in 2002 and added the much-needed encryption and authentication that prior versions lacked. User accounts can be set up along with a type of access control called an SNMP view. The SNMP view is a way of scoping access down to a specific OID or group of OIDs. SNMPv3 is a lot more difficult to set up, but it is a lot more secure than prior versions. My personal recommendation is to use it over SNMPv1 and v2c, but every situation has its considerations. SNMPv3 is defined in RFCs 3413 to 3415. Here's a summary of the three versions of SNMP: SNMPv1: Supports plaintext authentication with community strings and uses only UDP. SNMPv2c: Supports plaintext authentication with MD5 or SHA with no encryption but provides GET BULK, which is a way to gather many types of information at once and minimize the number of GET requests. It offers a more detailed error message reporting method, but it's not more secure than v1. It uses UDP even though it can be configured to use TCP. SNMPv3: Supports strong authentication with MD5 or SHA, providing confidentiality (encryption) and data integrity of messages via DES or DES-256 encryption between agents and managers. GET BULK is a supported feature of SNMPv3, and this version also uses TCP. (Note: MD5 and DES are no longer considered secure.) OIDs and the MIB Object identifiers (OIDs) are uniquely managed objects on a device or operating system that can be queried or configured. The OIDs are organized into a hierarchal tree and are noted in dotted decimal notation, such as.1.3.6.1.2.1.2.2. Each number represents a portion of the hierarchy, from least significant on the left to most significant on the right. For example, the OID.1.3.6.1.2.1.2.2 is broken down to iso(1) identified- organization(3) dod(6) internet(1) mgmt(2) mib-2(1) interface(2) ifTable(2). So, if this OID is queried, a value will be returned for each interface on the system. From there, you can choose which interface you want to query and the specific attribute. To query interface 3, the OID would look like.1.3.6.1.2.1.2.2.3, adding a 3 on the rightmost OID string. The attributes would then follow, such as.1.3.6.1.2.1.2.2.3.5 for ifSpeed. This might look like dark magic, but it is all very well documented in the management information base (MIB). The guide of attributes is always contained in the MIB, which is a database of OIDs published by the vendor of the OS or network device. The MIB defines the OID counters along with the type of data the OID offers for collection. Otherwise, the value you get back is just an arbitrary value. The MIB gives definition to the value, such as an interface error rate, bandwidth, or many other attributes of an interface. The NMS will require a specific MIB for the device or OS in order to collect statistics for the counter. Without a proper MIB installed, the SNMP process on the NMS cannot be configured to retrieve the values. Authentication Imagine the following scenario: One day, you receive an alarming email from your security team, notifying you of a potential breach in your network. Panic sets in as you realize that confidential information could be compromised, jeopardizing your business and its reputation. This is where SNMP authentication comes into play. By implementing authentication measures, you can ensure that only authorized devices and users can access your network. SNMP authentication is crucial for securing your network and protecting sensitive data from unauthorized access. There are various authentication methods for SNMP, including community strings and additional authentication methods that were added in SNMPv3. Implementing SNMP authentication involves configuring SNMP agents and managers to ensure secure network communication. Securing SNMP in a network environment is important to protect against vulnerabilities and ensure reliable network management. Authentication is essential for verifying the identity and integrity of SNMP entities, preventing unauthorized access, and ensuring secure communication. SNMPv1 and SNMPv2c provide limited security features, relying on community strings for access control. SNMPv3 introduced enhanced authentication and security features to address the vulnerabilities of previous versions including authentication and encryption. The SNMPv3 version introduced the user security model (USM) as a replacement for the community string method. The user-based security model supports no-authentication, authentication, and privacy security levels. It supports authentication protocols such as MD5. It ensures that only authorized users can access and manage SNMP agents. SNMPv3 can also employ encryption protocols like DES and AES to ensure confidential transmission of SNMP messages. The use of SNMPv3 adds an extra layer of authenticity and confidentiality to SNMP communications. SNMP authentication methods should be used in conjunction with other security measures, such as access control lists (ACLs) and firewall rules, to strengthen network security. As mentioned, SNMP is not a simple protocol as its name states, mainly because it has many different components. For the exam, you should know the various components and not their intricacies, as an entire book could be devoted to those topics. That being said, if you would like to learn more about the MIB file referenced in this section, visit www.net-snmp.org/docs/mibs/IF-MIB.txt and browse the file. Net-SNMP is the de facto standard used by many devices and operating systems. The website at www.net-snmp.org is a great source for documentation on the subject. Application Programming Interface Integration An alternative to using SNMP to extract data from systems is to use application programming interfaces (APIs), also called webhooks. An API is a way for two or more computer programs to communicate with each other. Utilizing APIs makes automating network tasks much more accessible than using SNMP. By utilizing an API, network engineers can create repeatable code blocks that can be interconnected and operate automatically based on certain conditions. Protocol Analyzer/Packet Capture Packet sniffers are software-based tools for capturing network traffic, also known as packet capturing. Packet sniffers can be used with wireless and wired network connections to capture packets. An example of a packet sniffer is the open-source Wireshark packet sniffer and analyzer. The packet sniffer's ability to capture network traffic is also directly dependent on the NIC. The NIC must support promiscuous mode, which allows the capture of frames with any destination MAC address. Although newer NICs will allow promiscuous mode, the feature should be checked for the model of NIC with the vendor's specifications. Protocol analyzers decipher frames of data that have been captured with a packet sniffer. Protocol analyzers such as Wireshark and Microsoft Message Analyzer also provide packet-sniffing capabilities. The protocol analyzer allows the network administrator to see the details of the data being transmitted or received. This allows the administrator to confirm that the data is being transmitted or received correctly or enables the administrator to focus on the problem area. The protocol analyzer comes preloaded with parsers that help decipher the captured data. The parser is nothing more than a list of protocol numbers that define what protocols are being used from layer 2 through layer 7. Once the protocol is known, the data contained in the rest of the layer can be deciphered and displayed in a readable format. For example, if a frame is captured with a Type field of 0x0806, then the frame is an ARP request, and the rest of the data contained within is parsed according to an ARP frame layout. If the Type field is 0x0800, then the frame is an IPv4 frame, and the data is parsed according to an IPv4 header layout. The data can be parsed further for the Transport and Application layers. In effect, the protocol analyzer de-encapsulates the data captured so it is readable, as shown in Figure 14.5. Port Mirroring When you need to use a sniffer or protocol analyzer to capture packets traversing the network, you face a challenge when connecting to one of your switchports. Switches create a separate collision domain for each switchport, which is great for performance but not so good when you are capturing traffic. A consequence of creating a separate collision domain for each port is that only traffic destined for the port to which you have the sniffer connected will be captured. To capture all traffic, you must send a copy of all traffic from the other ports to the port to which you have connected the sniffer. This is called port mirroring. Cisco calls this Switched Port Analyzer (SPAN). You can even send packets to the sniffer port from other switches, a process Cisco called Remote SPAN (RSPAN) when the switches are in the same subnet and Encapsulated Remote Switched Port Analyzer (ERSPAN) when the switches are not in the same subnet. FIGURE 14.5 Protocol analyzer of a TCP packet Flow Data While SNMP is used to extract data from systems, tools that create flow data, such as NetFlow, capture conversations between systems. A flow is a unidirectional sequence of packets that all share seven values that define a unique key for the flow: Ingress interface Source IP address Destination IP address IP protocol number Source port for UDP or TCP, 0 for other protocols Destination port for UDP or TCP, type and code for ICMP, or 0 for other protocols IP type of service Routers and switches that support NetFlow can collect flows on all interfaces where NetFlow is enabled and later export those flows toward at least one NetFlow collector— typically a server that does the actual traffic analysis. The NetFlow protocol is a Cisco proprietary protocol used to collect IP traffic information and the monitoring of network flows of data. Log Aggregation All the systems in the network have log files that record all system events. Monitoring these logs individually wherever they exist is a monumental task. One of the techniques that have been developed is to aggregate those logs in one place to make these tasks easier. The most common tool used for this is Syslog, which is a system to which all log files are directed. You will learn more about Syslog in this section. Network Device Logs While SMTP should be in your toolbox when monitoring the network, there is also a wealth of information to be found in the logs on the network devices. You will now learn about the main log types and methods to manage the volume of data that exists in these logs. In networking, a baseline can refer to the standard level of performance of a certain device or to the normal operating capacity for your whole network. For instance, a specific server's baseline describes norms for factors such as how busy its processors are, how much of the memory it uses, and how much data usually goes through the NIC at a given time. A network baseline delimits the amount of bandwidth available and when. For networks and networked devices, baselines include information about these four key components: Processor Memory Hard-disk (or other storage) subsystem Wired/wireless utilization After everything is up and running, it's a good idea to establish performance baselines on all vital devices and your network in general. To do this, measure things such as network usage at three different strategic times to get an accurate assessment. For instance, peak usage usually happens around 8 a.m. Monday through Friday, or whenever most people log in to the network in the morning. After hours or on weekends is often when usage is the lowest. Knowing these values can help you troubleshoot bottlenecks or determine why certain system resources are more limited than they should be. Knowing what your baseline is can even tell you if someone's complaints about the network running like a slug are really valid—nice! It's good to know that you can use network-monitoring software to establish baselines. Even some server operating systems come with software to help with network monitoring, which can help find baselines, perform log management, and even do network graphing as well so you can compare the logs and graphs at a later period of time on your network. In my experience, it's wise to re-baseline network performance at least once a year. And always pinpoint new performance baselines after any major upgrade to your network's infrastructure. Log Reviews High-quality documentation should include a baseline for network performance because you and your client need to know what “normal” looks like to detect problems before they develop into disasters. Don't forget to verify that the network conforms to all internal and external regulations and that you've developed and itemized solid management procedures and security policies for future network administrators to refer to and follow. Traffic Logs Some of your infrastructure devices will have logs that record the network traffic that has traversed the device. Examples include firewalls and intrusion detection and prevention devices. Many organizations choose to direct the traffic logs from these devices to a Syslog server or to security information and event management (SIEM) systems (both covered later in this section). Audit Logs Audit logs are those that record the activities of the users. Windows Server 2022 (and most other Windows operating systems) comes with a tool called Event Viewer that provides you with several logs containing vital information about events happening on your computer. Other server operating systems have similar logs, and many connectivity devices like routers and switches also have graphical logs that gather statistics on what's happening to them. These logs can go by various names, like history logs, general logs, or server logs. Figure 14.6 shows an Event Viewer Security log display from a Windows Server 2022 machine. On Windows servers and client systems, a minimum of three separate logs hold different types of information: Application Log: Contains events triggered by applications or programs determined by their programmers. Example applications include LiveUpdate, the Microsoft Office suite, and SQL and Exchange servers. FIGURE 14.6 Event Viewer Security log Security Log: Contains security events like valid or invalid logon attempts and potential security problems. System Log: Contains events generated by Windows system components, including drivers and services that started or failed to start. The basic “big three” can give us lots of juicy information about who's logging on, who's accessing the computer, and which services are running properly (or not). If you want to find out whether your Dynamic Host Configuration Protocol (DHCP) server started up its DHCP service properly, just check out its System log. Syslog Syslog is a client-server protocol that allows just about any device on the network to send logs as they happen. The protocol operates on UDP port 514, and it's considered a “fire-and-forget” type protocol. This means that the device sending the message never receives an acknowledgment that the message was received. So, you really need to make sure that the logs are being collected at the Syslog server, also known as the Syslog collector. It is also important to note that by default network devices and Linux/UNIX operating systems will write a file called Syslog because it contains local events for the device. This comes in handy when troubleshooting, but it also causes challenges if the device completely fails. Therefore, it's always best to ship the logs off the network device or operating system with the Syslog protocol pointed to a Syslog collector. Syslog Collector The Syslog server is also called the Syslog collector. This is the server to which all log files are sent from the various servers, router, switches, and other devices that send the logs. Figure 14.7 shows the data flow. FIGURE 14.7 Syslog collector Network devices can be configured to generate a Syslog message and forward it to various destinations. These four examples are popular ways to gather messages from Cisco devices: Logging buffer (on by default) Console line (on by default) Terminal lines (using the terminal monitor command) Syslog server Syslog Messages The Syslog message format is standardized, as shown in Figure 14.8. The message will start with a timestamp so you know when it was created. On some network devices, sequence numbers can be used in lieu of timestamps. Sequence numbers are useful because some events can happen simultaneously and the sequence number helps sort out which happened first. The next field, called the Device-id, is optional. By default, for most network devices, the Device-id is not sent. However, it's useful to send it if you are sending these messages to a centralized syslog server. The Device-id can be a hostname, an IP address, or any string that identifies the device. The next field actually comprises three different parts: the facility, severity, and mnemonic. The facility is the internal system inside the device that has generated the log message. The severity is standardized based upon a 0–7 severity level that we will cover in the next section. The mnemonic is nothing more than the action the facility took, and the value is a simple string. The last section of a Syslog message is the message text itself. This is what exactly happened to generate the syslog message. FIGURE 14.8 Anatomy of a Syslog message Logging Levels/Severity Levels Most services like DNS and DHCP have some sort of debug feature to help you diagnose problems. The debug feature will produce some form of logs, either on the screen or to a file. Keep in mind that when logs are produced, you will end up with a certain level of noise from normal events. So, some services allow you to specify a logging level in an effort to reduce the noise or dive deeper into the problem. This of course all depends on what you specify, your mode of diagnostic, and your tolerance for the noise in the log file. The Syslog protocol/service is used solely for logging events on network devices and operating systems. Therefore, it has a built-in logging level called a severity level. These severity levels range from the most critical level of 0 (emergency) to the least critical of 7 (debug). This gives you a total of eight levels to choose from, as shown in Table 14.1. TABLE 14.1 Syslog severity levels LevelSeverity Description 0 Emergency System is unusable. Action must be taken 1 Alert immediately. 2 Critical Critical conditions. 3 Error Error conditions. 4 Warning Warning conditions. Normal, but significant 5 Notice conditions. 6 InformationalInformational messages. 7 Debug Debug-level messages. A level can be throttled back and forth depending on what you are trying to capture. The severity level is also inclusive of lower levels. This means that if you choose a level of 3, it will include the logging that would be produced at level 2 and level 1. For example, if you configure the severity level to the lowest value of 0, you will receive only the emergency messages. However, if you configure the severity level to 4, you will receive all of the warning (4), error (3), critical (2), alert (1), and emergency (0) messages. SIEM Security information and event management (SIEM) is a term for software products and services combining security information management (SIM) and security event management (SEM). SIEM technology provides real-time analysis of security alerts generated by network hardware and applications. You can get this as a software solution or a hardware appliance, and some businesses sell managed services using SIEM. Any one of these solutions provides log security data and can generate reports for compliance purposes. The acronyms SEM, SIM, and SIEM are used interchangeably; however, SEM is typically used to describe the management that deals with real-time monitoring and correlation of events, notifications, and console views. The term SIM is used to describe long-term storage, analysis, and reporting of log data. Recently, the term voice security information and event management (vSIEM) was introduced to provide voice data visibility. SIEM can collect useful data about the following items: Data aggregation Correlation Alerting Dashboards Compliance Retention Forensic analysis Notifications SIEM systems not only assess the aggregated logs in real time but also generate alerts or notifications when an issue is discovered. This allows for continuous monitoring of the environment in a way not possible with other log centralization approaches such as Syslog. Summary In this chapter, you learned that one of the keys to stopping downtime is to listen to what the devices may be telling you about their current state of health. You learned how to use performance metrics to monitor the health of a device's CPU, memory, and NIC. You learned about metrics that are used to monitor network interface performance and about settings that may impact that performance. Finally, you were introduced to the use of SNMP and NetFlow to monitor both device health and network traffic from a central location, and you learned how to send log files either to a Syslog server or to a SIEM system.