Chapter 16 - High Availability and Disaster Recovery.pdf
Document Details
Tags
Full Transcript
Chapter 16 High Availability and Disaster Recovery THE FOLLOWING COMPTIA NETWORK+ EXAM OBJECTIVES ARE COVERED IN THIS CHAPTER: Domain 2.0 Network Implementations 2.4 Explain important factors of physical installations. Power Uninterruptible power supply(UPS) Power distribution unit (PD...
Chapter 16 High Availability and Disaster Recovery THE FOLLOWING COMPTIA NETWORK+ EXAM OBJECTIVES ARE COVERED IN THIS CHAPTER: Domain 2.0 Network Implementations 2.4 Explain important factors of physical installations. Power Uninterruptible power supply(UPS) Power distribution unit (PDU) Power load Voltage Environmental factors Humidity Fire suppression Temperature Domain 3.0 Network Operations 3.3 Explain disaster recovery (DR) concepts. DR metrics Recovery point objective (RPO) Recovery time objective (RTO) Mean time to repair (MTTR) Mean time between failures (MTBF) DR sites Cold site Warm site Hot site High-availability approaches Active-active Active-passive Testing Tabletop exercises Validation tests High availability is a system-design protocol that guarantees a certain amount of operational uptime during a given period. The design attempts to minimize unplanned downtime—the time users are unable to access resources. In almost all cases, high availability is provided through the implementation of duplicate equipment (multiple servers, multiple NICs, etc.). Organizations that serve critical functions obviously need this; after all, you really don't want to blaze your way to a hospital ER only to find that they can't treat you because their network is down! Fault tolerance means that even if one component fails, you won't lose access to the resource it provides. To implement fault tolerance, you need to employ multiple devices or connections that all provide a way to access the same resource(s). A familiar form of fault tolerance is configuring an additional hard drive to be a mirror image of the original so that if either fails, there's still a copy of the data available. In networking, fault tolerance means that you have multiple paths from one point to another. What's cool is that fault-tolerant connections can be configured to be available either on a standby basis only or all the time if you intend to use them as part of a load- balancing system. In this chapter, you will learn about redundancy concepts, fault tolerance, and the disaster recovery process. To find Todd Lammle CompTIA videos and practice questions, please see www.lammle.com/network+. Load Balancing Load balancing is a technique to spread work out to multiple computers, network links, or other devices. Using load balancing, you can provide an active/passive server cluster in which only one server is active and handling requests. For example, your favorite Internet site might consist of 20 servers that all appear to be the same site because its owner wants to ensure that its users always experience quick access. You can accomplish this on a network by installing multiple redundant links to ensure network traffic is spread across several paths and maximize the bandwidth on each link. Think of this as similar to having two or more different freeways that will both get you to your destination equally well—if one is really busy, take the other one. Multipathing Multipathing is the process of configuring multiple network connections between a system and its storage device. The idea behind multipathing is to provide a backup path in case the preferred connection goes down. For example, a SCSI hard disk drive may connect to two SCSI controllers on the same computer, or a disk may connect to two Fibre Channel ports. The ease with which multipathing can be set up in a virtual environment is one if the advantages a virtual environment provides. Figure 16.1 shows a multipath configuration. FIGURE 16.1 Multipathing Both Host A and Host B have multiple host bus adapters (NICs) and multiple connections through multiple switches and are mapped to multiple storage processors as well. This is a highly fault-tolerant arrangement that can survive an HBA failure, a path failure, a switch failure, and a storage processor failure. Network Interface Card (NIC) Teaming NIC teaming allows multiple network interfaces to be placed into a team for the purposes of bandwidth aggregation and/or traffic failover to prevent connectivity loss in the event of a network component failure. The cards can be set to active/active state, where both cards are load balancing, or active/passive, where one card is on standby in case the primary card fails. Most of the time, the NIC team will use a multicast address to send and receive data, but it can also use a broadcast address so all cards receive the data at the same time. It can be done with a single switch or multiple switches. Figure 16.2 shows what is called static teaming, where a single switch is in use. This would provide failover only for the connection and would not protect against a switch failure. FIGURE 16.2 Static teaming Figure 16.3 shows a more redundant arrangement, where a switch-independent setup is in use. This provides fault tolerance for both switches and connections. FIGURE 16.3 Switch-independent setup Redundant Hardware/Clusters By now it must be clear that redundancy is a good thing. While this concept can be applied to network connections, it can also be applied to hardware components and even complete servers. In the following sections, you'll learn how this concept is applied to servers and infrastructure devices. Switches As you saw in the previous section, multiple switches can be deployed to provide for failover if a switch fails. When this is done, it sometimes creates what is called a switching loop. Luckily, as you learned in Chapter 11, “Switching and Virtual LANs,” Spanning Tree Protocol (STP) can prevent these loops from forming. There are two forms of switch redundancy: switch stacking and switch clusters. Switch Stacking Switch stacking is the process of connecting multiple switches (usually in a stack) and managing them as a single switch. Figure 16.4 shows a typical configuration. FIGURE 16.4 Switch stacking The stack members work together as a unified system. Layer 2 and layer 3 protocols present the entire switch stack as a single entity to the network. A switch stack always has one active switch and one standby switch. If the active switch becomes unavailable, the standby switch assumes the role of the active switch and continues to keep the stack operational. The active switch controls the operation of the switch stack and is the single point of stack-wide management. A typical access closet contains one or more access switches placed next to each other in the same rack and uses high-speed redundant links with copper, or more typically fiber, to the distribution layer switches. Here are three big drawbacks to a typical switch topology: There is management overhead. STP will block half of the uplinks. There is no direct communication between switches. Cisco StackWise technology connects switches that are mounted in the same rack together so they basically become one larger switch. By doing this, you clearly get more access ports for each closet while avoiding the cost of upgrading to a bigger switch. So you're adding ports as you grow your company instead of front loading the investment into a pricier, larger switch all at once. And since these stacks are managed as a single unit, it reduces the management in your network. All switches in a stack share configuration and routing information so you can easily add or remove switches at any time without disrupting your network or affecting its performance. Figure 16.4 shows a typical switch stack. To create a StackWise unit, you combine switches into a single, logical unit using special stack interconnect cables, as shown in Figure 16.4. This creates a bidirectional closed- loop path in the stack. Here are some other features of StackWise: Any changes to the network topology or routing information are updated continuously through the stack interconnect. A master switch manages the stack as a single unit. The master switch is elected from one of the stack member switches. You can join up to nine separate switches in a stack. Each stack of switches has only a single IP address, and the stack is managed as a single object. You'll use this single IP address for all the management of the stack, including fault detection, VLAN database updates, security, and QoS controls. Each stack has only one configuration file, which is distributed to each switch in StackWise. Using Cisco StackWise will produce some management overhead, but at the same time, multiple switches in a stack can create an EtherChannel connection, eliminating the need for STP. Here's a list of the benefits to using StackWise technology: StackWise provides a method to join multiple physical switches into a single logical switching unit. Switches are united by special interconnect cables. The master switch is elected. The stack is managed as a single object and has a single management IP address. It reduces management overhead. STP is no longer needed if you use EtherChannel. Up to nine switches can be in a StackWise unit. One more very cool thing: When you add a new switch to the stack, the master switch automatically configures the unit with the currently running IOS image as well as the configuration of the stack. So you don't have to do anything to bring up the switch before it's ready to operate. Nice! Switch Clustering A switch cluster is another option. This is a set of connected and cluster-capable switches that are managed as a single entity without interconnecting stack cables. This is possible by using Cluster Management Protocol (CMP). The switches in the cluster use the switch clustering technology so that you can configure and troubleshoot a group of different switch platforms through a single IP address. In those switches, one switch plays the role of cluster command switch, and the other switches are cluster member switches that are managed by the command switch. Figure 16.5 shows a switch cluster. FIGURE 16.5 Switch cluster Notice that the cluster is managed by using the CMP address of the cluster commander. Routers Routers can also be set up in a redundant fashion. When we provide router redundancy, we call it providing first-hop redundancy since the router will be the first hop from any system to get to a destination. Accomplishing first-hop redundancy requires an FHRP protocol. First-hop redundancy protocols (FHRPs) work by giving you a way to configure more than one physical router to appear as if they were only a single logical one. This makes client configuration and communication easier because you can configure a single default gateway, and the host machine can use its standard protocols to communicate. First hop is a reference to the default router being the first router, or first router hop, through which a packet must pass. So, how does a redundancy protocol accomplish this? The protocols I'm going to describe to you do this basically by presenting a virtual router to all of the clients. The virtual router has its own IP and MAC addresses. The virtual IP address is the address that's configured on each of the host machines as the default gateway. The virtual MAC address is the address that will be returned when an ARP request is sent by a host. The hosts don't know or care which physical router is actually forwarding the traffic, as you can see in Figure 16.6. FIGURE 16.6 FHRPs use a virtual router with a virtual IP address and virtual MAC address. It's the responsibility of the redundancy protocol to decide which physical router will actively forward traffic and which one will be placed in standby in case the active router fails. Even if the active router fails, the transition to the standby router will be transparent to the hosts because the virtual router, identified by the virtual IP and MAC addresses, is now used by the standby router. The hosts never change default gateway information, so traffic keeps flowing. Fault-tolerant solutions provide continued operation in the event of a device failure, and load-balancing solutions distribute the workload over multiple devices. Later in this chapter you will learn about the two most common FHRPs. Firewalls Firewalls can also be clustered, and some can also use FHRPs. A firewall cluster is a group of firewall nodes that work as a single logical entity to share the load of traffic processing and provide redundancy. Clustering guarantees the availability of network services to the users. Cisco Adaptive Security Appliance (ASA) and Cisco Firepower next-generation firewall (NGFW) clustering allow you to group multiple ASA nodes as a single logical device to provide high availability and scalability. The two main clustering options discussed in this chapter are active/standby and active/active. In both cases, the firewall cluster looks like a single logical device (a single MAC/IP address) to the network. Later in this chapter, you will learn more about active/active and active/standby operations. Servers Fault tolerance is the ability of a system to remain running after a component failure. Redundancy is the key to fault tolerance. When systems are built with redundancy, a component can suffer a failure and an identical component will resume its functionality. Systems should be designed with fault tolerance from the ground up. Power Supply Redundancy If a power supply in a piece of network equipment malfunctions, the equipment is dead. With the best support contracts, you could wait up to 4 hours before a new power supply arrives and you are back up and running again. Therefore, dual-power supplies are a requirement if high availability is desired. Fortunately, most networking equipment can be purchased with an optional second power supply. Dual-power supplies operate in a few different ways: Active/passive dual-power supplies allow only one power supply to supply power at a time. When a power fault occurs, the entire load of the device is shifted to the passive power supply, and then it becomes the active power supply. One problem with active-passive dual-power supplies is that only one power supply operates at a time. If the passive power supply is worn with age and the load is transferred, it has a higher chance of not functioning properly. Load balancing dual-power supplies allow both power supplies to operate in an active-active configuration. Both power supplies will supply a portion of the power to balance out the load. Load balancing dual-power supplies have a similar problem as active-passive dual-power supplies, because one will eventually have to carry the entire load. Load-shifting dual-power supplies are found in servers and data center equipment. As power is supplied by one power supply, the load, or a portion of the load, is slowly transferred to the other power supply and then back again. This method allows for testing of both power supplies, so problems are identified before an actual power outage. Storage Redundancy When installing the operating system on a hard drive or Secure Digital (SD) card, you should mirror the operating system onto an identical device, as shown in Figure 16.7. Redundant Array of Independent Disks (RAID-1) is also called mirroring, which supports the fault tolerance of the operating system in the event of a drive or card failure. FIGURE 16.7 RAID-1 (mirroring) The data drives should be placed on a RAID level as well, but mirroring is too expensive since it requires each drive to be mirrored to an equal size drive. Striping with parity, also called RAID-5, is often used for data drives. RAID-5 requires three or more drives and operates by slicing the data being written into blocks, as shown in Figure 16.8. The first two drives receive the first two sequential blocks of data, but the third is a parity calculation of the first two blocks of data. The parity information and data blocks will alternate on the drives so that each drive has an equal amount of parity blocks. In the event of failure, a parity block and data block can create the missing block of data. Read performance is enhanced because several blocks (drives) are read at once. However, write performance is decreased because the parity information must be calculated. The calculated overhead of RAID-5 is 1/N: If three drives are used, one-third of the capacity is used for parity; if four drives are used, one-fourth of the capacity is used for parity; and so on. FIGURE 16.8 RAID-5 (striping with parity) RAID-5 has its disadvantages. Because of the larger data sets, when a drive fails, the other drives must work longer to rebuild the missing drive. This puts the other drives under a severe stress level. If another drive fails during this process, you are at risk of losing your data completely. Luckily, RAID-6 helps ease the burden of large data sets. As shown in Figure 16.9, RAID-6 achieves this by striping two blocks of parity information with two independent parity schemes, but this requires at least four drives. RAID-6 allows you to lose a maximum of two drives and not suffer a total loss of data. The first parity block and another block can rebuild the missing block of data. If under a severe load of rebuilding a drive fails, a separate copy of parity has been calculated and can achieve the same goal of rebuilding. The overhead of RAID-6 is 2/N: If four drives are used, two-fourths, or one-half, the capacity is for parity; if five drives are used, two-fifths the capacity is used for parity; and so on. Server hardware can manage faults related to a CPU and the motherboard and will switch processing to the second CPU, in the event of failure. Memory faults can also be predicted and managed so that information is not lost in the event of memory failure. All of these redundant systems can operate, and the only noticeable event will be an amber warning light on the server to notify the administrator the system requires attention. FIGURE 16.9 RAID-6 (striping with two parity schemes) Clusters Today's servers can be purchased with full redundancy so they maintain functionality in the event of any component failure. However, component redundancy does not address periods in which you need to take the system down for maintenance, nor does it address the load of processes that require several servers processing together. Clusters are redundant groupings of servers that can balance the load of processes and allow maintenance or complete failure on a node without consequence to overall operation. The Microsoft Server 2019 product contains a feature called Failover Clustering that allows applications to run in a high availability mode. If one server fails or is put into maintenance mode, then the application will fail to another server. The application must be written for failover clustering, and although it was popular 5 to 10 years ago, today it has been upstaged by virtualization clusters. Mean Time to Repair One of the metrics that's used in planning both SLAs and IT operations in general is mean time to repair (MTTR). This value describes the average length of time it takes a vendor to repair a device or component. By building these into SLAs, IT can assure that the time taken to repair a component or device will not be a factor that causes them to violate the SLAs' requirements. Sometimes MTTR is considered to be from the point at which the failure is first discovered until the point at which the equipment returns to operation. In other cases it is a measure of the elapsed time between the point where repairs actually begin until the point at which the equipment returns to operation. It is important that there is a clear understanding by all parties with regard to when the clock starts and ends when calculating MTTR. Mean Time Between Failure Another valuable metric typically provided is the mean time between failures (MTBF), which describes the amount of time that elapses between one failure and the next. Mathematically, this is the sum of mean time to failure (MTTF) and MTTR, which is the total time required to get the device fixed and back online. Planned Downtime There's a difference between planned downtime and unplanned downtime. Planned downtime is good—it's occasionally scheduled for system maintenance and routine upgrades. Unplanned downtime is bad—it's a lack of access due to system failure, which is exactly the issue high availability resolves. Facilities and Infrastructure Support When infrastructure equipment is purchased and deployed, the ultimate success of the deployment can depend on selecting the proper equipment, determining its proper location in the facility, and installing it correctly. Let's look at some common data center and server room equipment and a few best practices for managing these facilities. Uninterruptible Power Supply One risk that all organizations should prepare for is the loss of power. All infrastructure systems should be connected to uninterruptible power supplies (UPSs). These devices can immediately supply power from a battery backup when a loss of power is detected. You should keep in mind, however, that these devices are not designed as a long-term solution. They are designed to provide power long enough for you to either shut the system down gracefully or turn on a power generator. In scenarios where long-term backup power is called for, a gas-powered generator should be installed. There are several types of UPS systems that you may encounter. The main types are as follows: A standby UPS is the most common UPS, the kind you find under a desk protecting a personal computer. It operates by transferring the load from the AC line to the battery-supplied inverter, and capacitors in the unit help to keep the power sag to a minimum. These units work well, but they are not generally found in server rooms. A line interactive UPS is commonly used for small server rooms and racks of networking equipment. It operates by supplying power from the AC line to the inverter. When a power failure occurs, the line signals the inverter to draw power from the batteries. This might seem similar to a standby UPS, but the difference is that the load is not shifted. In a standby UPS, the load must shift from AC to a completely different circuit (the inverter), whereas on a line interactive UPS, the inverter is always wired to the load, but only during the power outage is the inverter running on batteries. This shift in power allows for a much smoother transition of power. An online UPS is the standard for data centers. It operates by supplying AC power to a rectifier/charging circuit that maintains a charge for the batteries. The batteries then supply the inverter with a constant DC power source. The inverter converts the DC power source back into an AC power circuit that supplies the load. The benefit of an online UPS is that the power is constantly supplied from the batteries. When there is a power loss, the unit maintains a constant supply of power to the load. The other benefit is that the online UPS always supplies a perfect AC signal. Power Distribution Units Power distribution units (PDUs) simply provide a means of distributing power from the input to a plurality of outlets. Intelligent PDUs normally have an intelligence module that allows for remote management of power metering information, power outlet on/off control, and/or alarms. Some advanced PDUs allow users to manage external sensors such as temperature, humidity, and airflow. While these can be as simple as a power strip, larger PDUs are needed in data centers to power multiple server cabinets. Each server cabinet or row of cabinets may require multiple high-current circuits, possibly from different phases of incoming power or different UPSs. Stand-alone cabinet PDUs are self-contained units that include main circuit breakers, individual circuit breakers, and power monitoring panels. Figure 16.10 shows a standard rack mount PDU. Generator Power generators supply power during a power outage. They consist of three major components: fuel, an engine, and a generator. The engine burns the fuel to turn the generator and create power. The three common sources of fuel are natural gas, gasoline, and diesel. Diesel-fueled generators are the most common type of generator supplying data centers around the world. FIGURE 16.10 Rack-mounted PDU As mentioned earlier, generators require a startup period before they can supply a constant source of electricity. In addition to the startup period, there is also a switchover lag. When a power outage occurs, the transfer switch moves the load from the street power to the generator circuit. UPSs help bridge both the lag and sag in electricity supply during the switchover and startup periods. HVAC Like any device with a CPU, infrastructure devices such as routers, switches, and specialty appliances must have a cool area to operate. When temperatures rise, servers start rebooting, and appliance CPUs start overworking as well. The room(s) where these devices are located should be provided with heavy-duty heating, ventilation, and air conditioning (HVAC) systems and ample ventilation. It is advisable to dedicate a suite for this purpose and put the entire system on a UPS with a backup generator in the case of a loss of power. The heating and air-conditioning systems must support the massive amounts of computing equipment most enterprises deploy. Computing equipment and infrastructure devices like routers and switches do not like the following conditions: Heat: Excessive heat causes reboots and crashes. High humidity: It causes corrosion problems with connections. Low humidity: Dry conditions encourage static electricity, which can damage equipment. The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) publishes standards for indoor air quality and humidity. Their latest recommendations are as follows: A class A1 data center Can range in temperature from 59°F to 89.6°F Can range in relative humidity from 20 percent to 80 percent. Also keep in mind: At 175 degrees, damage starts occurring to computers and peripherals. At 350 degrees, damage starts occurring to paper products. Fire Suppression While fire extinguishers are important and should be placed throughout a facility, when large numbers of computing devices are present, it is worth the money to protect them with a fire-suppression system. There are five basic types of fire suppression you may find in a facility: Wet Pipe System This is the most common fire suppression system found in facilities such as office complexes and even residential buildings. The wet pipe system is constantly charged with water from a holding tank or the city water supply. The sprinkler head contains a small glass capsule that holds a glycerin-based liquid that keeps the valve shut. When the glass capsule is heated between 135°F to 165°F, the liquid expands, breaking the glass and opening the value. Gallons of water will dump in that area until either the fire is extinguished or another head opens from excessive heat. Dry Pipe System Although the name is deceiving, a dry pipe system uses water, similar to a wet pipe system. The difference is that a dry pipe system does not initially contain water. The pipes in a dry pipe system are charged with air or nitrogen. When a pressure drop occurs because a sprinkler head is heated between 135°F to 165°F, the air escapes out of the sprinkler head. The water is then released behind the initial air charge and the system will operate similarly to a wet pipe system. Preaction Systems The preaction system is identical to the dry pipe system in operations. The preaction system employs an additional mechanism of an independent thermal link that pre-charges the system with water. The system will not dump water unless the sprinkler head is heated between 135°F to 165°F and the thermal link is tripped by smoke or fire. This is an additional factor of safety for the equipment, so a sprinkler head is not tripped by an accident such as a ladder banging into it. Deluge Systems The deluge systems are some of the simplest systems, and they are often used in factory settings. They do not contain a valve in the sprinkler head, just a deflector for the water. When a fire breaks out, the entire system dumps water from all of the sprinkler heads. Clean Agent There are many different clean agents available on the market today. These systems are deployed in data centers worldwide because they do not damage equipment in the event of a fire. The principle of operation is simple: The system displaces oxygen in the air below 15% to contain the fire. The clean agent is always a gas, and these systems are often mislabeled as halon systems. At one time, fire suppression systems used halon gas, which works well by suppressing combustion through a chemical reaction. However, the US Environmental Protection Agency (EPA) banned halon manufacturing in 1994 as it has been found to damage the ozone layer. The EPA has approved the following replacements for halon: Water Argon NAF-S-III FM-200 Or mixture of gases EXERCISE 16.1 Designing Facilities and Infrastructure In this exercise, you will design a mock facility with all of the items you learned about in this section and explain their various functions and why you chose the items. You need to support a dedicated data center of server. There are three racks of server in the data center. In the event of a power failure, the data center needs to remain in operation. And in the event there is a fire, the equipment must remain unharmed. After you have made your list of items that you would recommend building the data center with, go back through this section and check if each one was appropriate. Redundancy and High Availability Concepts All organizations should identify and analyze the risks they face. This is called risk management. In the following sections, you'll find a survey of topics that all relate in some way to addressing risks that can be mitigated with redundancy and high availability techniques. Disaster Recovery Sites Although a secondary site that is identical in every way to the main site with data kept synchronized up to the minute would be ideal, the cost cannot be justified for most organizations. Cost-benefit analysis must be applied to every business issue, even disaster recovery. Thankfully, not all secondary sites are created equally. They can vary in functionality and cost. We're going to explore four types of sites: cold sites, warm sites, hot sites, and cloud sites. Cold Site A cold site is a leased facility that contains only electrical and communications wiring, air conditioning, plumbing, and raised flooring. No communications equipment, networking hardware, or computers are installed at a cold site until it is necessary to bring the site to full operation. For this reason, a cold site takes much longer to restore than a hot or warm site. A cold site provides the slowest recovery, but it is the least expensive to maintain. It is also the most difficult to test. Warm Site The restoration time and cost of a warm site is somewhere between that of a hot site and a cold site. It is the most widely implemented alternate leased location. Although it is easier to test a warm site than a cold site, a warm site requires much more effort for testing than a hot site. A warm site is a leased facility that contains electrical and communications wiring, full utilities, and networking equipment. In most cases, the only thing that needs to be restored is the software and the data. A warm site takes longer to restore than a hot site but less than a cold site. Hot Site A hot site is a leased facility that contains all the resources needed for full operation. This environment includes computers, raised flooring, full utilities, electrical and communications wiring, networking equipment, and uninterruptible power supplies. The only resource that must be restored at a hot site is the organization's data, usually only partially. It should take only a few minutes to bring a hot site to full operation. Although a hot site provides the quickest recovery, it is the most expensive to maintain. In addition, it can be administratively hard to manage if the organization requires proprietary hardware or software. A hot site requires the same security controls as the primary facility and full redundancy, including hardware, software, and communication wiring. Cloud Site A cloud recovery site is an extension of the cloud backup services that have developed over the years. These are sites that, while mimicking your on-premises network, are totally virtual, as shown in Figure 16.11. FIGURE 16.11 Cloud recovery site Organizations that lack the expertise to develop even a cold site may benefit from engaging with a cloud vendor of these services. Active/Active vs. Active/Passive When systems are arranged for fault tolerance or high availability, they can be set up in either an active/active arrangement or an active/passive configuration. Earlier in this chapter you learned that when set to active/active state, both or all devices (servers, routers, switches, etc.) are performing work, and when set to active/passive, at least one device is on standby in case a working device fails. Active/active increases availability by providing more systems for work, while active/passive provides fault tolerance by holding at least one system in reserve in case of a system failure. Multiple Internet Service Providers/Diverse Paths Redundancy may also be beneficial when it comes to your Internet connection. There are two types of redundancy that can be implemented. Path redundancy is accomplished by configuring paths to the Internet service provider (ISP), as shown in Figure 16.12. There is a single ISP with two paths extending to the ISP from two different routers. FIGURE 16.12 Path redundancy That's great, but what if the ISP suffers a failure (it does happen)? To protect against that you could engage two different ISPs with a path to each from a single router, as shown in Figure 16.13. FIGURE 16.13 ISP redundancy For complete protection you could combine the two by using a separate router connection to each ISP, thus protecting against an issue with a single router or path in your network, as shown in Figure 16.14. FIGURE 16.14 Path and ISP redundancy First-Hop Redundancy Protocol Earlier in this chapter I mentioned First-Hop Redundancy Protocol (FHRP) and said we would come back to it. Now is the time. There are two first-hop redundancy protocols: Hot Standby Router Protocol (HSRP) and Virtual Router Redundancy Protocol (VRRP). HSRP is a Cisco proprietary protocol, while VRRP is a standards-based protocol. Let's look at them. Hot Standby Router Protocol Hot Standby Router Protocol is a Cisco proprietary protocol that can be run on most, but not all, of Cisco's router and multilayer switch models. It defines a standby group, and each standby group that you define includes the following routers: Active router Standby router Virtual router Any other routers that may be attached to the subnet The problem with HSRP is that, with it, only one router is active, and two or more routers just sit there in standby mode and won't be used unless a failure occurs—not very cost effective or efficient! Figure 16.15 shows how only one router is used at a time in an HSRP group. FIGURE 16.15 HSRP active and standby routers The standby group will always have at least two routers participating in it. The primary players in the group are the one active router and one standby router that communicate to each other using multicast Hello messages. The Hello messages provide all of the required communication for the routers. The Hellos contain the information required to accomplish the election that determines the active and standby router positions. They also hold the key to the failover process. If the standby router stops receiving Hello packets from the active router, it then takes over the active router role, as shown in Figure 16.16. FIGURE 16.16 HSRP active and standby routers As soon as the active router stops responding to Hellos, the standby router automatically becomes the active router and starts responding to host requests. VIRTUAL MAC ADDRESS A virtual router in an HSRP group has a virtual IP address and a virtual MAC address. So where does that virtual MAC address come from? The virtual IP address isn't that hard to figure out; it just has to be a unique IP address on the same subnet as the hosts defined in the configuration. But MAC addresses are a little different, right? Or are they? The answer is yes—sort of. With HSRP, you create a totally new, made-up MAC address in addition to the IP address. The HSRP MAC address has only one variable piece in it. The first 24 bits still identify the vendor that manufactured the device (the organizationally unique identifier, or OUI). The next 16 bits in the address tells us that the MAC address is a well-known HSRP MAC address. Finally, the last 8 bits of the address are the hexadecimal representation of the HSRP group number. Let me clarify all this with an example of what an HSRP MAC address would look like: 0000.0c07.ac0a Here's how it breaks down: The first 24 bits (0000.0c) are the vendor ID of the address; in the case of HSRP being a Cisco protocol, the ID is assigned to Cisco. The next 16 bits (07.ac) are the well-known HSRP ID. This part of the address was assigned by Cisco in the protocol, so it's always easy to recognize that this address is for use with HSRP. The last 8 bits (0a) are the only variable bits and represent the HSRP group number that you assign. In this case, the group number is 10 and converted to hexadecimal when placed in the MAC address, where it becomes the 0a that you see. You can see this MAC address added to the ARP cache of every router in the HSRP group. There will be the translation from the IP address to the MAC address as well as the interface on which it's located. HSRP TIMERS Before we get deeper into the roles that each of the routers can have in an HSRP group, I want to define the HSRP timers. The timers are very important to HSRP function because they ensure communication between the routers, and if something goes wrong, they allow the standby router to take over. The HSRP timers include hello, hold, active, and standby. Hello timer: The hello timer is the defined interval during which each of the routers sends out Hello messages. Their default interval is 3 seconds, and they identify the state that each router is in. This is important because the particular state determines the specific role of each router and, as a result, the actions each will take within the group. Figure 16.17 shows the Hello messages being sent, and the router uses the hello timer to keep network traffic flowing in case of a failure. This timer can be changed, and people used to avoid doing so because it was thought that lowering the hello value would place an unnecessary load on the routers. That isn't true with most of the routers today; in fact, you can configure the timers in milliseconds, meaning the failover time can be in milliseconds! Still, keep in mind that increasing the value will cause the standby router to wait longer before taking over for the active router when it fails or can't communicate. FIGURE 16.17 HSRP active and standby routers Hold timer: The hold timer specifies the interval the standby router uses to determine whether the active router is offline or out of communication. By default, the hold timer is 10 seconds, roughly three times the default for the hello timer. If one timer is changed for some reason, I recommend using this multiplier to adjust the other timers too. By setting the hold timer at three times the hello timer, you ensure that the standby router doesn't take over the active role every time there's a short break in communication. Active timer: The active timer monitors the state of the active router. The timer resets each time a router in the standby group receives a Hello packet from the active router. This timer expires based on the hold time value that's set in the corresponding field of the HSRP Hello message. Standby timer: The standby timer is used to monitor the state of the standby router. The timer resets anytime a router in the standby group receives a Hello packet from the standby router and expires based on the hold time value that's set in the respective Hello packet. Real World Scenario Large Enterprise Network Outages with FHRPs Years ago when HSRP was all the rage, and before VRRP, enterprises used hundreds of HSRP groups. With the hello timer set to 3 seconds and a hold time of 10 seconds, these timers worked just fine, and we had great redundancy with our core routers. However, as we've seen in the last few years, and will certainly see in the future, 10 seconds is now a lifetime! Some of my customers have been complaining about the failover time and loss of connectivity to their virtual server farms. So lately I've been changing the timers to well below the defaults. Cisco had changed the timers so you could use subsecond times for failover. Because these are multicast packets, the overhead that is seen on a current high-speed network is almost nothing. The hello timer is typically set to 200 milliseconds (msec), and the hold time is 700 msec. The command is as follows: (config-if)#Standby 1 timers msec 200 msec 700 This almost ensures that not even a single packet is lost when there is an outage. Virtual Router Redundancy Protocol Like HSRP, Virtual Router Redundancy Protocol allows a group of routers to form a single virtual router. In an HSRP or VRRP group, one router is elected to handle all requests sent to the virtual IP address. With HSRP, this is the active router. An HSRP group has only one active router, at least one standby router, and many listening routers. A VRRP group has one master router and one or more backup routers and is the open standard implementation of HSRP. COMPARING VRRP AND HSRP The LAN workstations are configured with the address of the virtual router as their default gateway, just as they are with HSRP, but VRRP differs from HSRP in these important ways: VRRP is an IEEE standard (RFC 2338) for router redundancy; HSRP is a Cisco proprietary protocol. The virtual router that represents a group of routers is known as a VRRP group. The active router is referred to as the master virtual router. The master virtual router may have the same IP address as the virtual router group. Multiple routers can function as backup routers. VRRP is supported on Ethernet, Fast Ethernet, and Gigabit Ethernet interfaces as well as on Multiprotocol Label Switching (MPLS), virtual private networks (VPNs), and VLANs. VRRP REDUNDANCY CHARACTERISTICS VRRP has some unique features: VRRP provides redundancy for the real IP address of a router or for a virtual IP address shared among the VRRP group members. If a real IP address is used, the router with that address becomes the master. If a virtual IP address is used, the master is the router with the highest priority. A VRRP group has one master router and one or more backup routers. The master router uses VRRP messages to inform group members. Backups Backups are not just there for disasters. For example, you may need a backup simply because of mistakes on the part of a user deleting files. However, backups are typically used for larger problems such as malicious data loss or failures of disk subsystems. Administrators will adopt a rotation schedule for long-term archiving of data. The most popular backup rotation is grandfather, father, son (GFS). The GFS rotation specifies that the daily backup will be rotated on a first-in, first-out (FIFO) basis. One of the daily backups will become the weekly backup. And last, one of the weekly backups will become the month-end backup. Policies should be created such as retaining 6 daily backups, retaining 4 weekly backups, and retaining 12 monthly backups. As you progress further away from the first six days, the RPO jumps to a weekly basis and then to a monthly basis. However, the benefit is that you can retain data over a longer period of time with the same number of tapes. Three types of media are commonly used for backups: Disk-to-tape backups have evolved quite a bit throughout the years. Today, Linear Tape-Open (LTO) technology has become the successor for backups. LTO can provide 6 TB of raw capacity per tape, with plans for 48 TB per tape in the near future. Tapes are portable enough to rotate off-site for safekeeping. However, time is required to record the data, resulting in lengthy backup windows. Restores also require time to tension the tape, locate the data, and restore the data, making the RTO a lengthy process. Disk-to-disk backups have become standard in data centers as well because of the short RTO. They can record the data quicker, thus shortening backup windows. They also do not require tensioning and do not require seeking for the data as tape media requires. However, the capacity of a disk is much smaller than tapes because the drives remain in the backup unit. Data deduplication can provide a nominal 10:1 compression ratio, depending on the data. This means that 10 TB of data can be compressed on 1 TB of disk storage. So a 10 TB storage unit can potentially back up 100 TB of data; this depends on the types of files you are backing up. Disk-to-cloud is another popular and emerging backup technology. It is often used with disk-to-disk backups to provide an off-site storage location for end-of-week backups or monthly backups. The two disadvantages of disk-to-cloud is the ongoing cost and the lengthy RTO. The advantage is that expensive backup equipment does not need to be purchased along with the ongoing purchase of tapes. Network Device Backup/Restore Files are not the only thing that should be backed up on the network. Network devices should be backed up as well, since their configuration is usually completely unique. Configurations such as the various port configurations on a network switch can be a nightmare to reconfigure. Configurations can be lost because they were erased by accident or overwritten or due to just plain failure of the equipment. There are automated appliances and software that can automatically back up configuration of switches on a daily basis. Many vendors also have mechanisms so that the equipment can back itself up to a TFTP, FTP, SFTP server, or even a flash card. In the case of a cluster host or virtualization host, configuration is not the only thing you will need to back up in the event of failure. The overall state of the device should be saved as well, in the event the device needs to be completely replaced. The software installed on the device expects MAC addresses and disk configuration to be the same when it is moved to new hardware. Otherwise, the software could need to be completely reinstalled. Thankfully, many vendors allow for state to be saved. This allows a complete forklift of the operating system and data without reinstalling. Recovery The concept of the recovery point objective (RPO) defines the point in time that you can restore to in the event of a disaster. The RPO is often the night before, since backup windows are often scheduled at night. The concept of recovery time objective (RTO) defines how fast you can restore the data. In this section I discuss backup methods, some of which can speed up the process. However, the disadvantage is that these methods will increase the recovery time, as I will explain. Recovery Point Objective An RPO is a measurement of time from the failure, disaster, or comparable loss-causing event. RPOs measure back in time to when your data was preserved in a usable format, usually to the most recent backup. Recovery Time Objective This is the shortest time period after a disaster or disruptive event within which a resource or function must be restored in order to avoid unacceptable consequences. RTO assumes that an acceptable period of downtime exists. Testing When discussing testing, you must understand that your network plan needs testing. To do that, the industry uses what is called tabletop exercises. Tabletop exercises are cost-effective to test and validate your plan, procedure, network, and security policies. This can also be called walk-throughs or table-top exercises (TTXs). This exercise is a discussion-based event where staff or personnel with roles and responsibilities in a particular IT plan meet in a classroom setting or breakout groups to discuss their roles during an emergency and their responses to a particular emergency. These are meant to be a very informal environment, with an open discussion, guided by an administrator, through a discussion designed to meet predefined objectives. These can be a cost-effective tool to validate the plans of IT, such as backups, contingency plans, and incident response plans. This ensures the plan content is viable and implementable in an emergency. Tabletop Exercises TTXs exercises, also called play, are a great way to test processes and plans. By communicating with the group members, you can stress test all the procedures and other processes. You can use a TTX play to perform a tabletop exercise, verify the existing integration plane, and identify areas that would break if you were to implement the plans. Tabletop exercises can do the following: Help assess plans, policies, and procedures. Identify gaps and challenges. Clarify roles and responsibilities. Identify additional mitigation and preparedness needs. Provide hands-on training. Highlight flaws in incident response planning. Validation Tests Once your plan is in place, next comes the validation. Validation of the plan comes from listening or reading all group and group participants' feedback. These sessions allow for the plan, procedures, and policies to be revised. Before you design and execute any exercise, you must clearly understand what you want to achieve and plan your testing methods. Consider these questions: What are the specific goals and outcomes? What are the potential threats and risks that could disrupt your operations? How will you measure and evaluate your performance and improvement? You can focus your exercise on the most relevant and essential aspects by identifying your objectives first. Here is a list to use well-validating and tabletop tests: Set Up the Initial Meeting Gather all cross-functional leads and domain experts and lay out the scenario. Provide all the facts regarding the deal for evaluation and input for the next meeting. Listen to Feedback In this second meeting, listen to the cross-functional teams. They will be able to provide good insights into how the integration will affect their respective function. If it's too complicated, a third meeting might be required. Conclusion Come up with conclusions on the things that are possible. Everyone should be clear on what can be safely integrated, the risks that come with it, and mitigation plans. Document the Process A new practice, process, or function should be generated at the end of these exercises. Document them for future reference. Summary In this chapter, you learned the importance of providing both fault tolerance and high availability. You also learned about disaster recovery concepts. We discussed ensuring continued access to resources with load balancing, multipathing, and NIC teaming. Expanding on that concept, we looked at setting up clusters of routers, switches, and firewalls. Finally, we took up facilities redundancy with techniques such as UPS systems, PDUs, and generators and environmental issues such as HVAC systems and fire suppression systems. In disaster recovery you learned about hot, cold, warm, and cloud sites and how they fit into a disaster recovery plan. You also learned terms critical to planning for disaster recovery, such as MTTR, MTBF, RTO, and RPO. Finally, we covered backup operations for both configurations and system state and planning and validating a tabletop test. Exam Essentials Understand the importance of fault tolerance and high availability techniques. These include load balancing, multipathing, NIC teaming, and router, switch, and firewall clusters. Describe facilities and infrastructure redundancy techniques. Among these are uninterruptible power supplies (UPSs), power distribution units (PDUs), generators, HVAC systems, fire suppression, and multiple Internet service providers (ISPs)/diverse paths. Utilize disaster recovery techniques. These include physical cold sites, warm sites, hot sites, and cloud sites. It also requires an understanding of RPO, MTTR, MTBF, and RTO. Identify applications of active/active and active/passive configurations. These include switch clusters, VRRP and HSRP, and firewall clusters. Written Lab