Chapter 18 - Network Troubleshooting Methodology.pdf

Chapter 18 Network Troubleshooting Methodology THE FOLLOWING COMPTIA NETWORK+ EXAM OBJECTIVES ARE COVERED IN THIS CHAPTER: Domain 5.0 Network Troubleshooting 5.1 Explain the troubleshooting methodology. Identify the problem Gather information Question users Identify symptoms Determine if anything has changed Duplicate the problem, if possible Approach multiple problems individually Establish a theory of probable cause Question the obvious Consider multiple approaches Top-to-bottom/bottom-to-top OSI model Divide and conquer Test the theory to determine the cause If the theory is confirmed, determine the next steps to resolve the problem If the theory is not confirmed, reestablish a new theory or escalate Establish a plan of action to resolve the problem and identify potential effects Implement the solution or escalate as necessary Verify full system functionality and, implement preventive measures if applicable Document findings, actions, and outcomes and lessons learned throughout the process 5.2 Given a scenario, troubleshoot common cabling and physical interface issues. Cable issues Incorrect cable Single mode vs. multimode Shielded twisted pair (STP) vs. unshielded twisted pair (UTP) Signal degradation Crosstalk Interference Attenuation Improper termination Transmitter (TX)/Receiver (RX) transposed Interface issues Increasing interface counters Cyclic redundancy check (CRC) Runts Giants Drops Port status Error disabled Administratively down Suspended Hardware issues Power over Ethernet (PoE) Power budget exceeded Incorrect standard Transceivers Mismatch Signal strength 5.3 Given a scenario, troubleshoot common issues with network services. Switching issues Incorrect VLAN assignment ACLs Route selection Routing table Incorrect default gateway Incorrect IP address Duplicate IP address Incorrect subnet mask 5.4 Given a scenario, troubleshoot common performance issues. Congestion/contention Bottlenecking Bandwidth Throughput capacity Latency Packet loss Jitter Wireless Interference Channel overlap Signal degradation or loss Insufficient wireless coverage Client disassociation issues Roaming misconfiguration There is no way around it. Troubleshooting computers and networks is a combination of art and science, and the only way to get really good at it is by doing it—a lot! So it's practice, practice, and practice with the basic yet vitally important skills you'll attain in this chapter. Of course, I'm going to cover all the troubleshooting topics you'll need to sail through the Network+ exam, but I'm also going to add some juicy bits of knowledge that will really help you to tackle the task of troubleshooting successfully in the real world. First, you'll learn to check quickly for problems in the “super simple stuff” category, and then we'll move into a hearty discussion about a common troubleshooting model that you can use like a checklist to go through and solve a surprising number of network problems. We'll finish the chapter with a good briefing about some common troubleshooting resources, tools, tips, and tricks to keep up your sleeve and equip you even further. I won't be covering any new networking information in this chapter because you've gotten all the foundational background material you need for troubleshooting in the previous chapters. But no worries. I'll go through each of the issues described in this chapter's objectives, one at a time, in detail, so that even if you've still got a bit of that previous material to nail down yet, you'll be good to get going and fix some networks anyway. To find Todd Lammle CompTIA videos and practice questions, please see www.lammle.com. Narrowing Down the Problem When initially faced with a network problem in its entirety, it's easy to get totally overwhelmed. That's why it's a great strategy to start by narrowing things down to the source of the problem. To help you achieve that goal, it's always wise to ask the right questions. You can begin doing just that with this list of questions to ask yourself: Did you check the super simple stuff (SSS)? Is hardware or software causing the problem? Is it a workstation or server problem? Which segments of the network are affected? Are there any cabling issues? Did You Check the Super Simple Stuff? Yes—it sounds like a snake's hiss (appropriate for a problem, right?), but exactly what's on the SSS list that you should be checking first, and why? Well, as the saying goes, “All things being equal, the simplest explanation is probably the correct one,” so you probably won't be stunned and amazed when I tell you that I've had people call me in and act like the sky is falling when all they needed to do was check to make sure their workstation was plugged in or powered on. (I didn't say “super simple stuff” for nothing!) Your SSS list really does include things that are this obvious—sometimes so obvious no one thinks to check for them. Even though anyone experienced in networking has their own favorite “DUH” events to tell about, almost everyone can agree on a few things that should definitely be on the SSS list: Check to verify login procedures and rights. Look for link lights and collision lights. Check all power switches, cords, and adapters. Look for user errors. The Correct Login Procedure and Rights You know by now that if you've set up everything correctly, your network's users absolutely have to follow the proper login procedure to the letter (or number, or symbol) to successfully gain access to the network resources they're after. If they don't do that, they will be denied access, and considering that there are truly tons of opportunities to blow it, it's a miracle, or at least very special, that anyone manages to log into the network correctly at all. Think about it. First, a user must enter their username and password flawlessly. Sounds easy, but as they say, “in a perfect world….” In this one, people mess up, don't realize it, and freak out at you about the “broken network” or the imaginary IT demon that changed their password on them while they went to lunch and now they can't log in. (The latter could be true—you may have done exactly that. If you did, just gently remind them about that memo you sent about the upcoming password-change date and time that they must have spaced about due to the tremendous demands on them.) Anyway, it's true. By far the most common problem is bad typing—people accidentally enter the wrong username or password, and they do that a lot. With some operating systems, a slight brush of the Caps Lock key is all it takes: The user's username and password are case sensitive, and suddenly, they're trying to log in with what's now all in uppercase instead—oops. Plus, if you happen to be running one of the shiny new operating systems around today, you can also restrict the times and conditions under which users can log in, right? So, if your user spent an unusual amount of time in the bathroom upon returning from lunch or if they got distracted and tried to log in from their BFF's workstation instead of their own, the network's operating system would've rejected their login request even though they still can type impressively well after two martinis. Remember, you can also restrict how many times a user can log in to the network simultaneously. If you've set that up and your user tries to establish more connections than you've allowed, access will again be denied. Just know that most of the time, if a user is denied access to the network and/or its resources, they're probably going to interpret that as a network problem even though the network operating system is doing what it should. Real World Scenario Can the Problem Be Reproduced? The first question to ask anyone who reports a network or computer problem is, “Can you show me what ‘not working’ looks like?” This is because if you can reproduce the problem, you can identify when it happens, which may give you all the information you need to determine the source of the problem and maybe even solve it in a snap. The hardest problems to solve are those of the random variety that occur intermittently and can't be easily reproduced. Let's pause for a minute to outline the steps to take during any user-oriented network problem-solving process: 1. Make sure the username and password are being entered correctly. 2. Check that Caps Lock key. 3. Try to log in yourself from another workstation, assuming that doing this doesn't violate the security policy. If it works, go back to the user-oriented login problems, and go through them again. 4. If none of this solves the problem, check the network documentation to find out whether any of the aforementioned kinds of restrictions are in place; if so, find out whether the user has violated any of them. Remember, if intruder detection is enabled on your network, a user will get locked out of their account after a specific number of unsuccessful login attempts. If this happens, either they'll have to wait until a predetermined time period has elapsed before their account will unlock and give them another chance or you'll have to go in and manually unlock it for them. Network Connection LED Status Indicators The link light is that little light-emitting diode (LED) found on both the network interface card (NIC) and the switch. It's typically green and labeled Link or some abbreviation of that. If you're running 100/1000BaseT, a link light indicates that the NIC and switch are making a logical (Data Link layer) connection. If the link lights are lit up on both the workstation's NIC and the switch port to which the workstation is connected, it's usually safe to assume that the workstation and switch are communicating just fine. The link lights on some NICs don't activate until the driver is loaded. So, if the link light isn't on when the system is first turned on, you'll have to wait until the operating system loads the NIC driver. But don't wait forever! The collision light is also a small LED, but it's typically amber in color, and it can usually be found on both Ethernet NICs and hubs. When lit, it indicates that an Ethernet collision has occurred. If you've got a busy Ethernet network on which collisions are somewhat common, understand that this light is likely to blink occasionally; if it stays on continuously, though, it could mean that there are way too many collisions happening for legitimate network traffic to get through. Don't assume this is really what's happening without first checking that the NIC or other network device is working properly because one or both could simply be malfunctioning. Don't confuse the collision light with the network-activity or network-traffic light (which is usually green) because the latter indicates that a device is transmitting. This particular light should be blinking on and off continually as the device transmits and receives data on the network. The Power Switch Clearly, to function properly, all computer and network components must be turned on and powered up first. Obvious, yes, but if I had a buck for each time I've heard, “My computer is on, but my monitor is all dark,” I'd be rolling in money by now. When this kind of thing happens, just keep your cool and politely ask, “Is the monitor turned on?” After a little pause, the person calling for help will usually say, “Ohhh … ummmm … thanks,” and then hang up ASAP. The reason I said to be nice is that, embarrassing as it is, this, or something like it, will probably happen to you, too, eventually. Most systems include a power indicator (a Power or PWR light). The power switch typically has an On indicator, but the system or device could still be powerless if all the relevant power cables aren't actually plugged in—including the power strip. Remember that every cable has two ends, and both must be plugged into something. If you're thinking something like, “Sheesh—a four-year-old knows that,” you're probably right. But again, I can't count the times this has turned out to be the root cause of a “major system failure.” The best way to go about troubleshooting power problems is to start with the most obvious device and work your way back to the power-service panel. There could be a number of power issues between the device and the service panel, including a bad power cable, bad outlet, bad electrical wire, tripped circuit breaker, or blown fuse, and any of these things could be the actual cause of the problem that appears to be device-death instead. Operator Error Or, the problem may be that you've got a user who simply doesn't know how to be one. Maybe you're dealing with someone who doesn't have the tiniest clue about the equipment they're using or about how to perform a certain task correctly—in other words, the problem may be due to something known as operator error (OE). Here's a short list of the most common types of OEs and their associated acronyms: Equipment exceeds operator capability (EEOC) Problem exists between chair and keyboard (PEBCAK) ID Ten T error (an ID10T) A word of caution here, though—assuming that all your problems are user-related can quickly make an ID10T error out of you. Although it can be really tempting to take the easy way out and blow things off, remember that the network's well-being and security are ultimately your responsibility. So, before you jump to the operator-error conclusion, ask the user in question to reproduce the problem in your presence, and pay close attention to what they do. Understand that doing this can require a great deal of patience, but it's worth your time and effort if you can prevent someone who doesn't know what they're doing from causing serious harm to pricey devices or leaving a gaping hole in your security. You might even save the help-desk crew's sanity from the relentless calls of a user with the bad habit of flipping off the power switch without following proper shutdown procedures. You just wouldn't know they always do that if you didn't see it for yourself, right? And what about finding out that that pesky user was, in fact, trained really badly by someone and that they aren't the only one? This is exactly the kind of thing that can turn the best security policy to dust and leave your network and its resources as vulnerable to attack as that goat in Jurassic Park. The moral here is, always check out the problem thoroughly. If the problem and its solution aren't immediately clear to you, try the procedure yourself, or ask someone else at another workstation to do so. Don't just leave the issue unsettled or make the assumption that it is user error or a chance abnormality because that's exactly what the bad guys out there are hoping you'll do. This is only a partial list of super simple stuff. No worries. Rest assured you'll come up with your own expanded version over time. Is Hardware or Software Causing the Problem? A hardware problem often rears its ugly head when some device in your computer skips a beat and/or dies. This one is pretty easy to discern because when you try to do something requiring that particular piece of hardware, you can't do it and instead get an error telling you that you can't do it. Even if your hard disk fails, you'll probably get warning signs before it actually kicks, like a Disk I/O error or something similar. Other problems drop out of the sky and hit you like something from the wrong end of a seagull. No warning at all—just splat! Components that were humming along fine a second ago can and do suddenly fail, usually at the worst possible time, leaving you with a mess of lost data, files, everything—you get the idea. Solutions to hardware problems usually involve one of three things: Changing hardware settings Updating device drivers Replacing dead hardware If your hardware has truly failed, it's time to get out your tools and start replacing components. If this isn't one of your skills, you can either send the device out for repair or replace it. Your mantra here is “back up, back up, back up,” because in either case, a system could be down for a while—anywhere from an hour to several days—so it's always good to keep backup hardware around. And I know everyone and your momma has told you this, but here it is one more time: Back up all data, files, hard drive, everything, and do so on a regular basis. Software problems are muddier waters. Sometimes you'll get General Protection Fault messages, which indicate a Windows or Windows program (or other platform) error of some type, and other times the program you're working in will suddenly stop responding and hang. At their worst, they'll cause your machine to randomly lock up on you. When this type of thing happens, I'd recommend visiting the manufacturer's support website to get software updates and patches or searching for the answer in a knowledge base. Sometimes you get lucky and the ailing software will tell the truth by giving you a precise message about the source of the problem. Messages saying the software is missing a file or a file has become corrupt are great because you can usually get your problem fixed fast by providing that missing file or by reinstalling the software. Neither solution takes very long, but the downside is that whatever you were doing before the program hosed will probably be at least partially lost; so again, back up your stuff, and save your data often. It's time for you to learn how to troubleshoot your workstations and servers. Is It a Workstation or a Server Problem? The first thing you've got to determine when troubleshooting this kind of problem is whether it's only one person or a whole group that's been affected. If the answer is only one person (think, a single workstation), solving the issue will be pretty straightforward. More than that and your problem probably involves a chunk of the network, like a segment. A clue that the source of your grief is the latter case is if there's a whole bunch of users complaining that they can't discover neighboring devices/nodes. So either way, what do you do about it? Well, if it's the single-user situation, your first line of defense is to try to log in from another workstation within the same group of users. If you can do that, the problem is definitely the user's workstation, so look for things like cabling faults, a bad NIC, power issues, and OSs. But if a whole department can't access a specific server, take a good, hard look at that particular server, and start by checking all user connections to it. If everyone is logged in correctly, the problem may have something to do with individual rights or permissions. If no one can log in to that server, including you, the server probably has a communication problem with the rest of the network. And if the server has totally crashed, either you'll see messages telling you all about it on the server's monitor or you'll find its screen completely blank—screaming indicators that the server is no longer running. And keep in mind that these symptoms do vary among network operating systems. Which Segments of the Network Are Affected? Figuring this one out can be a little tough. If multiple segments are affected, you may be dealing with a network-address conflict. If you're running Transmission Control Protocol/Internet Protocol (TCP/IP), remember that IP addresses must be unique across an entire network. So, if two of your segments have the same static IP subnet addresses assigned, you'll end up with duplicate IP errors—an ugly situation that can be a real bear to troubleshoot and can make it tough to find the source of the problem. If all of your network's users are experiencing the problem, it could be a server everyone accesses. Thank the powers that be if you nail it down to that because if not, other network devices like your main router or hub may be down, making network transmissions impossible and usually meaning a lot more work on your part to fix. Adding wide area network (WAN) connections to the mix can complicate matters exponentially, and you don't want to go there if you can avoid it, so start by finding out if stations on both sides of a WAN link can communicate. If so, get the champagne—your problem isn't related to the WAN—woo hoo! But if those stations can't communicate, it's not a happy thing: You've got to check everything between the sending station and the receiving one, including the WAN hardware, to find the culprit. The good news is that most of the time, WAN devices have built-in diagnostics that tell you whether a WAN link is working okay, which really helps you determine if the failure has something to do with the WAN link itself or with the hardware involved instead. Is It Bad Cabling? Back to hooking up correctly…. Once you've figured out whether your plight is related to one workstation, a network segment, or the whole tamale (network), you must then examine the relevant cabling. Are the cables properly connected to the correct port? More than once, I've seen a Digital Subscriber Line (DSL) modem connection to the wall cabled all wrong—it's an easy mistake to make and an easy one to fix. And you know that nothing lasts forever, so check those patch cables running between a workstation and a wall jack. Just because they don't come with expiration dates written on them doesn't mean they don't expire. They do go bad, especially if they get moved, trampled on, or tripped over a lot. (I did tell you that it's a bad idea to run cabling across the office floor, didn't I?) Connection problems are the tell here—if you check the NIC and there is no link light blinking, you may have a bad patch cable to blame. It gets murkier if your cable in the walls or ceiling is toast or hasn't been installed correctly. Maybe you've got a user or two telling you the place is haunted because they only have problems with their workstations after dark when the lights go on. Haunted? No … some genius probably ran a network cable over a fluorescent light, which is something that just happens to produce lots of electromagnetic interference (EMI), which can really mess up communications in that cable. Next on your list is to check the medium dependent interface/medium dependent interface-crossover (MDI/MDI-X) port setting on small, workgroup hubs and switches. This is a potential source of trouble that's often overlooked, but it's important because this port is the one that's used to uplink to a switch on the network's backbone. First, understand that the port setting has to be set to either MDI or MDI-X depending on the type of cable used for your hub-to-hub or switch-to-switch connection. For instance, the crossover cables I talked about way back in Chapter 3, “Networking Connectors and Wiring Standards,” require that the port be set to MDI, and a standard network patch cable requires that the port be set to MDI-X. You can usually adjust the setting via a regular switch or a dual inline package (DIP) switch, but to be sure, if you're still using hubs, check out the hub's documentation. (You did keep that, right?) Cable Considerations Cable installs should fall within the specifications for a successful installation, such as for speed and distance. However, the type of installation should also be considered when planning an installation of cabling. The cabling might require flexibility or strength running up a wall. If the cable is run in a ventilated space, there may also be fire code considerations. The following sections will discuss the common considerations for cable installation. Shielded and Unshielded Unshielded twisted-pair (UTP) is the most common cabling for Ethernet networks today due to its cost and ease of installation. However, it is unshielded from electromagnetic interference (EMI), which is why its use can be problematic in areas where EMI exists. Therefore, UTP should be avoided in close proximity to heavy industrial equipment that can emit EMI. Shielded twisted-pair (STP) is not as common for Ethernet cabling as UTP, due to its cost and difficult installation. However, it is shielded for EMI, and therefore it should be used in industrial settings where EMI is present. It is important to note that there are several different types of STP cable. Some STP cabling has a foil shielding around all four pairs of wires, some is foil shielded around each pair of wires with an overall foil shielding, and some cabling is shielded with a wire mesh. The consideration is more shielding, and a heavier shield increases the cost and lowers the chance that EMI will affect data transfer. When installing cable in an industrial setting such as a factory where cabling is exposed to vibrations, chemicals, temperature, and EMI, the MICE (Mechanical, Ingress, Climatic, Chemical, and Electromagnetic) classification should be followed. The standard is defined in an ISO/IEC (International Organization for Standardization/International Electrotechnical Commission) publication. It is best to engage an engineer to define the type of cabling to use when in doubt for an industrial setting because safety can be compromised. Plenum and Riser-Rated Riser-rated cable is used when cable is run between floors in a non-plenum area. The cable is made with a polyvinyl chloride (PVC) plastic–jacketed sheathing. However, if the PVC cabling catches fire, it emits a toxic black smoke and hydrochloric acid that irritates the lungs and eyes. Therefore, when installing cabling in circulated airspace such as HVAC ducting and air returns, also called plenum areas, the type of cabling should be considered. Plenum cable is made with Teflon-jacketed cable or fire retardant–jacketed cable. It does not emit toxic vapors when burned or heated, and it is more expensive than PVC cables. It is specified in the National Electric Code (NEC) that is published by the National Fire Protection Association (NFPA). Because a circulated airspace is highly subjective, when in doubt use plenum cabling. You will not want to be responsible for failing a code inspection because a code inspector defines a cabling passage as an airspace. Cable Application Cables can be used for many different applications. The most common is obviously Ethernet host connectivity. However, a network cable can be used for several other purposes, as I will describe. Rollover Cable/Console Cable A rollover cable is typically flat stock cable that contains eight wires. A rollover cable is unmistakably different than an Ethernet cable, mainly because it is flat, and each wire is a different color. A rollover cable is crimped with an RJ-45 end with pin 1 matched with wire 1 on one side. The other side is also crimped with an RJ-45; however, pin 1 is matched with wire 8. So pin 2 is connected to pin 7 on the other side, pin 3 is connected to pin 6, pin 4 is connected to pin 5, and so on. Eventually, pin 8 will be connected with pin 1 on the other side, as shown in Table 18.1. TABLE 18.1 Rollover cable pinouts Side ASide B Pin 1 Pin 8 Pin 2 Pin 7 Pin 3 Pin 6 Pin 4 Pin 5 Pin 5 Pin 4 Pin 6 Pin 3 Pin 7 Pin 2 Pin 8 Pin 1 Rollover cables are used with an EIA/TIA adapters that convert a DB-9 serial port to an RJ-45 end. The opposite end will plug directly into the router or switch for console access, as shown in Figure 18.1. FIGURE 18.1 A router/switch console connection Over the years I've seen variations on the cable used for a console connection. The EIA/TIA adapter can also be wired to use a standard Ethernet cable, so it is always best to read the manual before making any connections. It is also becoming very popular for routers and switches to furnish a mini-USB connection so that when they are plugged into a laptop or computer, the operating system detects as a USB serial COM port. Crossover Cable When connecting a switch to a switch, a router to a router, or a host to a host, the cabling often needs to be crossed over. This means that the transmit pairs are crossed over to the receive pairs on the other side of the cable and vice versa. This is easily achieved with a crossover cable, which has the EIA/TIA 586A wiring specification crimped on one end and the EIA/TIA 568B wiring specification on the other end. Table 18.2 lists the EIA/TIA 568A and 568B wiring specifications. TABLE 18.2 EIA/TIA 568 crossover cabling RJ-45 568A 568B Pins Pin 1 White/green White/orange Pin 2 Green Orange Pin 3 White/orangeWhite/green Pin 4 Blue Blue Pin 5 White/blue White/blue Pin 6 Orange Green Pin 7 White/brown White/brown Pin 8 Brown Brown Problems can arise when a straight-through cable connects a switch to a switch, a router to a router, or a host to a host. You simply won't get a link light. However, most newer routers and switches have medium dependent interface crossover (MDI-X) ports that sense a similar device is being plugged in and automatically cross the signals over. A valuable tool to have in your toolbox is a small length of cable that acts as a crossover cable and a female-to-female RJ-45 adapter. If there is doubt that the connection requires a crossover cable, you can pop this small crossover cable onto the existing cable and verify that the link light comes on. Other Important Cable Issues Causing Performance Issues They may be basic, but they're still vital—an understanding of the physical issues that can happen on a network when a user is connected via cable (usually Ethernet) is critical information to have in your troubleshooting repertoire. Because many of today's networks still consist of large amounts of copper cable, they suffer from the same physical issues that have plagued networking since the very beginning and have throughput capacity issues and signal degradation. Newer technologies and protocols have helped to a degree, but they haven't made these issues a thing of the past yet. Some physical issues that still affect networks are listed and defined next: Incorrect Pinout/TX/RX Reverse/Damaged Cable The first things to check when working on cabling are the cable connectors to make sure they haven't gone bad. After that, look to make sure the wiring is correct on both ends by physically checking the cable pinouts. Important to remember is that if you have two switches, you need a crossover cable where you cross pins 1 and 2 with 3 and 6. On the other hand, if you have a PC going into a switch, you need a straight-through cable where pins 1 and 2 correspondingly connect to pins 1 and 2 on each side—the same with 3 and 6. Finally, make sure the termination pins on both ends are the correct type for the kind of cable you're using. Bad Port In some cases, the issue is not the cable but the port into which the cable is connected. On many devices, ports have LEDs that can alert you to a bad port. For example, a Cisco router or switch will have an LED for each port, and the color of the LED will indicate its current state. In most cases, a lack of any light whatsoever indicates an issue with the port. Loopback plugs can be used to test the functionality of a port. These devices send a signal out and then back in the port to test it. Transceiver Mismatch Interfaces that send and receive are called transceivers. When a NIC is connected to a port, the two transceivers must have the same certain settings or issues will occur. These settings are the duplex and the speed settings. If the speed settings do not match, there will be no communication. If the duplex settings are incorrect, there may be functionality, but the performance will be poor. An incorrect standard mismatch can cause signal strength issues. Crosstalk Again, looking back to Chapter 3, remember that crosstalk is what happens when there's signal bleed between two adjacent wires that are carrying a current. Network designers minimize crosstalk inside network cables by twisting the wire pairs together, putting them at a 90-degree angle to each other. The tighter the wires are twisted, the less crosstalk you have, and newer cables like Cat 6 cable really make a difference. But like I said, not completely—crosstalk still exists and affects communications, especially in high-speed networks. This is often caused by using the wrong category of cable or by mismatching the member of one pair with a member of another when terminating a cable. Near-End/Far-End Crosstalk Near-end crosstalk is a specific type of crosstalk measurement that has to do with the EMI bled from a wire to adjoining wires where the current originates. This particular point has the strongest potential to create crosstalk issues because the crosstalk signal itself degrades as it moves down the wire. If you have a problem with it, it's probably going to show up in the first part of the wire where it's connected to a switch or a NIC. Far-end crosstalk is the interference between two pairs of a cable measured at the far end of the cable with respect to the interfering transmitter. This condition is often caused by improperly terminating a cable. For example, it's important to maintain the twist right up to the punch-down or crimp connector. In the case of crimp connectors, it's critical to select the correct grade of connector even though one grade may look identical to another. Attenuation/dB Loss/Distance Limitation As a signal moves through any medium, the medium itself will degrade the signal—a phenomenon known as attenuation that's common in all kinds of networks. True, signals traversing fiber-optic cable don't attenuate as fast as those on copper cable, but they still do eventually. You know that all copper twisted-pair cables have a maximum segment distance of 100 meters before they'll need to be amplified, or repeated, by a hub or a switch, but single- mode fiber-optic cables can sometimes carry signals for miles before they begin to attenuate (degrade). If you need to go far, use fiber, not copper. Although there is attenuation/dB loss in fiber, it can go much farther distances than copper cabling can before being affected by attenuation. Latency Latency is the delay typically incurred in the processing of network data. A low-latency network connection is one that generally experiences short delay times, while a high-latency connection generally suffers from long delays. Many security solutions may negatively affect latency. For example, routers take a certain amount of time to process and forward any communication. Configuring additional rules on a router generally increases latency, thereby resulting in longer delays. An organization may decide not to deploy certain security solutions because of the negative effects they will have on network latency. Auditing is a great example of a security solution that affects latency and performance. When auditing is configured, it records certain actions as they occur. The recording of these actions may affect the latency and performance. Packet Loss Protocol Data Units (PDUs) define “data” encapsulation or de- encapsulation at each layer of the OSI and/or DoD model. At the Network or Internet layer, this defines a “packet,” which are small units of the data stream transmitted on a network from a source to a destination. You'll see packet loss when a network packet fails to reach the destination, resulting in loss. This is caused by wireless low signal strength, interference, excessive noise, software corruption, or high CPUs on hosts. All of this can cause network congestion. Jitter Jitter occurs when the data flow in a connection is inconsistent; that is, it increases and decreases in no discernable pattern. Jitter results from network congestion, timing drift, and route changes. Jitter is especially problematic in real-time communications like IP telephony and videoconferencing. Collisions A network collision happens when two devices try to communicate on the same physical segment simultaneously. Collisions like this were a big problem in the early Ethernet networks, and a tool known as carrier sense multiple access with collision detection (CSMA/CD) was used to detect and respond to them in Ethernet_II. Nowadays, we use switches in place of hubs because they can separate the network into multiple collision domains, learn the Media Access Control (MAC) addresses of the devices attached to them, create a type of permanent virtual circuit between all network devices, and prevent collisions. Shorts Basically, a short circuit, or short, happens when the current flows through a different path within a circuit than it's supposed to; in networks, they're usually caused by some physical fault in the cable. You can find shorts with circuit-testing equipment, but because sooner is better when it comes to getting a network back up and running, replacing the ailing cable until it can be fixed (if it can be) is your best option. Open Impedance Mismatch (echo) Open impedance on cable-testing equipment tells you that the cable or wires connect into another cable, and there is an impedance mismatch. When that happens, some of the signal will bounce back in the direction it came from, degrading the strength of the signal, which ultimately causes the link to fail. Interference/Cable Placement EMI and radio frequency interference (RFI) occur when signals interfere with the normal operation of electronic circuits. Computers happen to be really sensitive to sources of this, such as TV and radio transmitters, which create a specific radio frequency as part of their transmission process. Two other common culprits are two-way radios and cellular phones. Your only way around this is to use shielded network cables like shielded twisted-pair (STP) and coaxial cable (rare today) or to run EMI/RFI-immune but pricey fiber-optic cable throughout your entire network. Split Pairs A split pair is a wiring error where two connections that are supposed to be connected using the two wires of a twisted pair are instead connected using two wires from different pairs. Such wiring causes errors in high-rate data lines. If you buy your cables precut, you won't have this problem. TX/RX Transposed Just like the first item discussed in this section, incorrect pinout, when connecting from a PC-type device into a switch, for the PC use pins 1 and 2 to transmit and 3 and 6 for receiving a digital signal. This means that the pins must be reversed on the switch, using pins 1 and 2 for receiving and 3 and 6 for transmitting the digital signal. If your connection isn't working, check the cable end pinouts. Bent Pins Many of the connectors you will encounter have small pins on the end that must go into specific holes on the interface to which they plug. If these pins get bent, either they won't go into the correct hole or they won't go into a hole at all. When this occurs, the cable either will not work at all or will not work correctly. Not bending these fragile pins when working with these cable types will prevent this issue from occurring. Bottlenecks/Bottlenecking Bottlenecks are areas of the network where the physical infrastructure cannot handle the traffic. In some cases, this is a temporary issue caused by an unusual burst of traffic. In other scenarios, upgrading the infrastructure or reorganizing the network to alleviate the bottleneck is a wake-up call. The apparent result of a bottleneck is poor performance. Fiber Cable Issues Fiber is definitely the best kind of wiring to use for long-distance runs because it has the least attenuation at long distances compared to copper. The bad news is that it's also the hardest to troubleshoot. First, let's understand the difference between the different fiber modes and then go onto troubleshooting. There are two major types of fiber optics: single-mode and multimode. Figure 18.2 shows the differences between multimode and single-mode fibers. FIGURE 18.2 Multimode and single-mode fibers Single mode is more expensive, has tighter cladding, and can go much farther distances than multimode. The difference comes in the tightness of the cladding, which makes a smaller core, meaning that only one mode of light will propagate down the fiber. Multimode is looser and has a larger core so it allows multiple light particles to travel down the glass. These particles must be put back together at the receiving end so the distance is less than that with single-mode fiber, which allows only very few light particles to travel down the fiber. Here are some common fiber issues to be aware of: SFP/GBIC (Cable Mismatch) The small form-factor pluggable (SFP) is a compact, hot-pluggable transceiver used for networking and other types of equipment. It interfaces a network device motherboard for a switch, router, media converter, or similar device to a fiber-optic or copper networking cable. Due to its smaller size, SFP obsolesces the formerly ubiquitous gigabit interface converter (GBIC), so SFP is sometimes referred to as a mini-GBIC. Always make sure you have the right cable for each type of connector type and that they are not mismatched. Bad SFP/GBIC (Cable or Transceiver) If your link is down, verify that your cable or transceiver hasn't gone bad. Also, if the termination end is GBIC or SFP based, many network systems have console commands that output statistics on the status of the device. You can also swap out the SFP/GBIC to verify if it is faulty or not. Wavelength Mismatch One of the more confusing terms used regarding fiber networks is wavelength. Though it sounds very complicated and scientific, it's actually just the term used to define what we think of as the color of light. Wavelength mismatch occurs when two different fiber transmitters at each end of the cable are using either a longer or shorter wavelength. This means you've got to make sure your transmitters match on both ends of the cable. Fiber Type Mismatch Fiber type mismatches, at each of the transceivers, can cause wavelength issues, massive attenuation, and dB loss. Dirty Connectors It's important to verify your connectors to make sure no dirt or dust has corrupted the cable end. You need to polish your cable ends with a soft cloth, but do not look into the cable if the other end is transmitting—it could damage your eyes! Connector Mismatch Just because it fits doesn't mean it works. Make sure you have precisely the right connectors for each type of cable end or transceiver. Bend Radius Limitations Fiber, whether it is made of glass or plastic, can break. You need to make sure you understand the bend radius limitations of each type of fiber you purchase and that you don't exceed the specifications when installing fiber in your rack. Distance Limitations The pros of fiber are that it's completely immune to EMI and RFI and that it can transmit up to 40 kilometers—about 25 miles! Add some repeater stations and you can go between continents. But all fiber types aren't created equally. For example, single mode can perform at much greater distances than multimode can. And again, make sure you have the right cable for the distance you'll require to run your fiber! Unbounded Media Issues (Wireless) Now let's say your problem-ridden user is telling you they use only a wireless connection. Well, you can definitely take crosstalk and shorts off the list of suspects, but don't get excited, because with wireless, you've got a whole new bunch of possible Physical layer problems to sort through. Wireless networks are really convenient for the user but not so much for administrators. They can require a lot more configuration; with wireless networks, you don't just get to substitute one set of challenges for another—you pretty much add all those fresh new issues on top of the wired challenges you already have on your plate. The following are some of those new wireless challenges: Interference Because wireless networks rely on radio waves to transmit signals, they're more subject to interference, even from other wireless devices like Bluetooth keyboards, mice, or cell phones that are all close in frequency ranges. Any of these—even microwave ovens!—can cause signal bleed that can slow down or prevent wireless communications. Factors like the distance between a client and a wireless access point (WAP) and the stuff between the two can also affect signal strength and even intensify the interference from other signals. So, careful placement of that WAP is a must. Device Saturation/Bandwidth Saturation Limits the Throughput Capacity Clearly, it's important to design and implement your wireless network correctly. Be sure to understand the number of hosts connecting to each AP you'll be installing. If you have too much device saturation on an AP, it will result in low available bandwidth. Just think about when you're in a hotel and how slow the wireless is because of throughput issues. This is directly due to device/bandwidth saturation for each AP. And more Aps don't always solve the problem—you need to design correctly! Simultaneous Wired/Wireless Connections It's not unusual to find that a laptop today will have a wired and wireless connection simultaneously. Typically this doesn't create a problem, but don't think you get more bandwidth or better results because of it. It's possible that the configurations can cause a problem, although that's rare today. For instance, if each provides a DNS server with a different address, it can cause name resolution issues, or even default gateway issues. Most of the time, it just causes confusion in your laptop, making it work harder to determine the correct DNS or default gateway address to use. And the laptop can give up and stop communicating completely! Because of this, you need to remind the user to turn off their wireless when they take it into their office and connect it to their dock. Configurations Mistakes in the configuration of the wireless access point or wireless router or inconsistencies between the settings on the AP and the stations can also be the source of problems. The following list describes some of the main sources of configuration problems. Incorrect Encryption/Security Type Mismatch You know that wireless networks can use encryption to secure their communications and that different encryption flavors are used for wireless networks, like Wired Equivalent Privacy (WEP) and WI-FI Protected Access 3 (WPA3) with Advanced Encryption Standard (AES). WPA3 is the latest standard, and it is common now. To ensure the tightest security, configure your wireless networks with the highest encryption protocol that both the WAP and the clients can support. Oh, and make sure the AP and its clients are configured with the same type of encryption. This is why it's a good idea to disable security before troubleshooting client problems, because if the client can connect once you've done that, you know you're dealing with a security configuration error. Incorrect, Overlapping, or Mismatched Channels Wireless networks use many different frequencies within the 2.4 GHz or 5 GHz band, and I'll bet you didn't know that these frequencies are sometimes combined to provide greater bandwidth for the user. You actually do know about this—has anyone heard of something called a channel? Well, that's exactly what a channel is, and it's also the reason some radio stations come in better than others—they have more bandwidth because their channel has more combined frequencies. You also know what happens when the AP and the client aren't quite matching up. Have you ever hit the scan button on your car's radio and only kind of gotten a station's static-ridden broadcast? That's because the AP (radio station) and the client (your car's radio) aren't quite on the same channel. Most of the time, wireless networks use channel 1, 6, or 11, and because clients auto-configure themselves to any channel the AP is broadcasting on, it's not usually a configuration issue unless someone has forced a client onto an incorrect channel. Also, be sure not to use the same channel on APs within the same area. Overlapping channels cause your signal-to-noise ratio to drop because you'll get a ton of interference and signal loss! Incorrect Frequency/Incompatibilities So, setting the channel sets the frequency or frequencies that wireless devices will use. But some devices, such as an AP running 802.11ac and ax, allow you to tweak those settings and choose a specific frequency such as 2.4 GHz or 5 GHz. As with any relationship, it works best if things are mutual. So if you do this on one device, you've got to configure the same setting on all the devices with which you want to communicate, or they won't—they'll argue, and you don't want that. Incorrect-channel and frequency-setting problems on a client are rare, but if you have multiple APs and they're in close proximity, you need to make sure they're on different channels/frequencies to avoid potential interference problems. Wrong Passphrase When a passphrase is used as an authentication method, the correct passphrase must be entered when authenticating to the AP or to the controller. When an incorrect passphrase is provided, access will be denied. This is another issue that will impair functionality. SSID Mismatch When a wireless device comes up, it scans for service set identifiers (SSIDs) in its immediate area. These can be basic service set identifiers (BSSIDs) that identify an individual access point or extended service set identifiers (ESSIDs) that identify a set of APs. In your own wireless LAN, you clearly want the devices to find the ESSID that you're broadcasting, which isn't usually a problem: Your broadcast is closer than the neighbor's, so it should be stronger—unless you're in an office building or apartment complex that has lots of different APs assigned to lots of different ESSIDs because they belong to lots of different tenants in the building. This can definitely give you some grief because it's possible that your neighbor's ESSID broadcast is stronger than yours, depending on where the clients are in the building. So if a user reports that they're connected to an AP but still can't access the resources they need or authenticate to the network, you should verify that they are, in fact, connected to your ESSID and not your neighbor's. This is very typical in an open security wireless network. You can generally just look at the information tool tip on the wireless software icon to find this out. However, you can easily solve this problem today by making the office SSID the preferred network in the client software. Wireless Standard Mismatch As you found out in Chapter 12, “Wireless Networking,” wireless networks have many standards that have evolved over time, like 802.11a, 802.11b, 802.11g, and 802.11n, and 802.11ac and ax. Standards continue to develop that make wireless networks even faster and more powerful. The catch is that some of these standards are backward compatible and others aren't. For instance, most devices you buy today can be set to all standards, which means they can be used to communicate with other devices of all four standards. So, make sure the standards on the AP match the standards on the client or that they're at least backward compatible. It's either that or tell all your users to buy new cards for their machines. Be sure to understand the throughput, frequency, distance capabilities, and available channels for each standard you use. Untested Updates It's really important to push updates to the APs in your wireless network, but not before you test them. Just like waiting for an update from Microsoft or Apple to become available for weeks or months before you update, you need to wait for the OS or patch updates for your AP. Then, you need to test the updates thoroughly on your bench before pushing them to your live network. Distance/Signal Strength/Power Levels Causing Signal Degradation or Loss Location, location, location. You've got only two worries with this one: Your clients are either not far enough away or too far from the AP. Suppose your AP doesn't seem to have enough power to provide a connectivity point for your clients. In that case, you can move it closer to them, increase the distance that the AP can transmit by changing the type of antenna it uses, or use multiple APs connected to the same switch or set of switches to solve the problem. If the power level or signal is too strong, and it reaches out into the parking area or farther out to other buildings and businesses, place the AP as close as possible to the center of the area it's providing service for. And don't forget to verify that you've got the latest security features in place to keep bad guys from authenticating to and using your network. Client Disassociation Issues Wireless clients randomly disassociating with an AP can be a difficult item to troubleshoot, especially if it's random or intermittent. It's frustrating for users, and they'll take it out on the engineer. There are a couple of reasons that I have found that caused this, although this may not be the answer for all the issues you see. Disabling the 802.11b client access actually prevented a customer from having them intermittently disconnect and then reconnect. Most clients are 802.11a or g or higher, so when the AP encoding goes down to 802.11b, then all other clients may run that same encoding, which disconnects them. By not allowing 802.11b clients, this solved this customer issue. Another client had a Meraki AP with automatic channel planning set to on. This caused the channels to randomly change, which disconnects the clients, so we had to disable this feature. Roaming Misconfiguration When you walk through a building and lose connection on your AP, it's common to initially blame signal coverage and assume a dead spot. As you know, this is referred to as roaming between APs, and there are several reasons why devices can have problems transitioning from one access point to the next. Here are a few of the causes I have found causing roaming issues: Excessive coverage Poor Signal coverage Re-authentication Mismatching configuration Hidden SSIDs Latency and Overcapacity When wireless users complain that the network is slow (latency) or that they are losing their connection to applications during a session, it is usually a capacity or distance issue. Remember, 802.11 is a shared medium, and as more users connect, all user throughput goes down. Suppose this becomes a constant problem as opposed to the occasional issue where 20 guys with laptops gather for a meeting every six months in the conference room. In that case, it may be time to consider placing a second AP in the area. When you do this, place the second AP on a different non- overlapping channel from the first and make sure the second AP uses the same SSID as the first. In the 2.4 GHz frequency, the three non-overlapping channels are 1, 6, and 11. Now the traffic can be divided between them, and users will get better performance. It is also worth noting that when clients move away from the AP, the data rate drops until at some point it is insufficient to maintain the connection. Bounce For a wireless network spanning large geographical distances, you can install repeaters and reflectors to bounce a signal and boost it to cover about a mile. This can be a good thing, but if you don't tightly control signal bounce, you could end up with a much bigger network than you wanted. To determine exactly how far and wide the signal will bounce, make sure you conduct a thorough wireless site survey. However, bounce can also refer to multipath issues, where the signal reflects off objects and arrives at the client degraded because it is arriving out of phase. The solution is pretty simple. APs use two antennas, both of which sample the signal and use the strongest signal and ignore the out-of-phase signal. However, 802.11ac and ax takes advantage of multipath and can combine the out-of-phase signals to increase the distance hosts can be from the AP. Incorrect Antenna Type or Switch Placement Can Cause Insufficient Wireless Coverage Most of the time, the best place to put an AP and/or its antenna is as close to the center of your wireless network as possible. But you can position some antennas a distance from the AP and connect to it with a cable—a method used for a lot of the outdoor installations around today. If you want to use multiple APs, you've also got to be a little more sophisticated about deciding where to put them all; you can use third-party tools like the packet sniffers Wireshark and AirMagnet on a laptop to survey the site and establish how far your APs are actually transmitting. You can also hire a consultant to do this for you—there are many companies that specialize in assisting organizations with their wireless networks and the placement of antennas and APs. This is important because poor placement can lead to interference and poor performance, or even no performance at all. Environmental Factors It's vital to understand your environmental factors when designing and deploying your wireless network. Do you have concrete walls, window film, or metal studs in the walls? All of these will cause a degradation of dB or power level and result in connectivity issues. Again, plan your wireless network carefully! Reflection When a wave hits a smooth object that is larger than the wave itself, depending on the media the wave may bounce in another direction. This behavior is categorized as reflection. Reflection can be the cause of serious performance problems in a WLAN. As a wave radiates from an antenna, it broadens and disperses. If portions of this wave are reflected, new wave fronts will appear from the reflection points. If these multiple waves all reach the receiver, the multiple reflected signals cause an effect called multipath. Multipath can degrade the strength and quality of the received signal or even cause data corruption or canceled signals. APs mitigate this behavior by using multiple antennas and constantly sampling the signal to avoid a degraded signal. Refraction Refraction is the bending of an RF signal as it passes through a medium with a different density, thus causing the direction of the wave to change. RF refraction most commonly occurs as a result of atmospheric conditions. In long-distance outdoor wireless bridge links, refraction can be an issue. An RF signal may also refract through certain types of glass and other materials that are found in an indoor environment. Absorption Some materials will absorb a signal and reduce its strength. While there is not much that can be done about this, this behavior should be noted during a site survey, and measures such as additional APs or additional antenna types may be called for. Signal-to-Noise Ratio Signal-to-noise ratio (SNR) is the difference in decibels between the received signal and the background noise level (noise floor). If the amplitude of the noise floor is too close to the amplitude of the received signal, data corruption will occur and result in layer 2 retransmissions, negatively affecting both throughput and latency. An SNR of 25 dB or greater is considered good signal quality, and an SNR of 10 dB or lower is considered very poor signal quality. Now that you know all about the possible physical network horrors that can befall you on a typical network, it's a good time for you to memorize the troubleshooting steps that you've got to know to ace the CompTIA Network+ exam. Troubleshooting Steps In the Network+ troubleshooting model, there are seven steps you've got to have dialed in: 1. Identify the problem. 2. Establish a theory of probable cause. 3. Test the theory to determine cause. 4. Establish a plan of action to resolve the problem and identify potential effects. 5. Implement the solution or escalate as necessary. 6. Verify full system functionality and implement preventative measures if applicable. 7. Document findings, actions, outcomes, and lessons learned throughout the process. To get things off to a running start, let's assume that the user has called you yet again, but now they're almost in tears because they can't connect to the server on the intranet and they also can't get to the Internet. (By the way, this happens a lot, so pay attention— it's only a matter of time before it happens to you!) Absolutely, positively make sure you memorize this seven-step troubleshooting process/methodology in the right order when studying for the Network+ exam! Step 1: Identify the Problem Before you can solve the problem, you've got to figure out what it is, right? Again, asking the right questions can get you far along this path and really help clarify the situation. Identifying the problem involves steps that together constitute information gathering. Gather Information by Questioning Users A good way to start is by asking the user the following questions: Exactly which part of the Internet can't you access? A particular website? A certain address? A type of website? None of it at all? Can you use your web browser? Is it possible to duplicate the problem? If the hitch has to do with an internal server to the company, ask the user if they can ping the server and talk them through doing that. Ask the user to try to SSH/telnet or SFTP/FTP to an internal server to verify local network connectivity; if they don't know how, talk them through it. If there are multiple complaints of problems occurring, look for the big stuff first and then isolate and approach each problem individually. Here's another really common trouble ticket that just happens to build on the last scenario: Now let's say you've got a user who's called you at the help desk. By asking the previous questions, you found out that this user can't access the corporate intranet or get out to any sites on the Internet. You also established that the user can use their web browser to access the corporate FTP site, but only by IP address, not by the FTP server name. This information tells you two important things: that you can rule out the host and the web browser (application) as the source of the problem and that the physical network is working. Duplicate the Problem, If Possible When a user reports an issue, you should attempt to duplicate the issue. When this is possible, it will aid in discovering the problem. When you cannot duplicate the issue, your challenge becomes harder because you are dealing with an intermittent problem. These issues are difficult to solve because they don't happen consistently. Determine If Anything Has Changed Moving right along, if you can reproduce the problem, your next step is to verify what has changed and how. Drawing on your knowledge of networking, you ask yourself and your user questions like these: Were you ever able to do this? If not, then maybe it just isn't something the hardware or software is designed to do. You should then tell the user exactly that, as well as advise them that they may need additional hardware or software to pull off what they're trying do. If so, when did you become unable to do it? If, once upon a time, the computer was able to do the job and then suddenly could not, whatever conditions surrounded and were involved in this turn of events become extremely important. You have a really good shot at unearthing the root of the problem if you know what happened right before things changed. Just know that there's a high level of probability that the cause of the problem is directly related to the conditions surrounding the change when it occurred. Has anything changed since the last time you could do this? This question can lead you right to the problem's cause. Seriously—the thing that changed right before the problem began happening is almost always what caused it. It's so important that if you ask it and your user tells you, “Nothing changed…it just happened,” you should rephrase the question and say something like, “Did anyone add anything to your computer?” or “Are you doing anything differently from the way you usually do it?” Were any error messages displayed? These are basically arrows that point directly at the problem's origin; error messages are designed by programmers for the purpose of pointing them to exactly what it is that isn't working properly in computer systems. Sometimes error messages are crystal clear, like Disk Full, or they can be cryptically annoying little puzzles in and of themselves. If you pulled the short straw and got the latter variety, it's probably best to hit the software or hardware vendor's support site, where you can usually score a translation from the “programmerese” in which the error message is written into plain English so you can get back to solving your riddle. Are other people experiencing this problem? You've got to ask this one because the answer will definitely help you target the cause of the problem. First, try to duplicate the problem from your own workstation because if you can't, it's likely that the issue is related to only one user or group of users—possibly their workstations. (A solid hint that this is the case is if you're being inundated with calls from a bunch of people from the same workgroup.) Is the problem always the same? It's good to know that when problems crop up, they're almost always the same each time they occur. But their symptoms can change slightly as the conditions surrounding them change. A related question would be, “If you do x, does the problem get better or worse?” For example, ask a user, “If you use a different file, does the problem get better or worse?” If the symptoms lighten up, it's an indication that the problem is related to the original file that's being used. It's important to try to duplicate the problem to find the source of the issue as soon as possible! Understand that these are just a few of the questions you can use to get to the source of a problem. Okay, so let's get back to our sample scenario. So far, you've determined that the problem is unique to one user, which tells you that the problem is specific to this one host. Confirming that is the fact that you haven't received any other calls from other users on the network. And when watching the user attempt to reproduce the problem, you note that they're typing the address correctly. Plus, you've got an error message that leads you to believe that the problem has something to do with Domain Name System (DNS) lookups on the user's host. Time to go deeper… Identify Symptoms I probably don't need to tell you that computers and networks can be really fickle—they can hum along fine for months, suddenly crash, and then continue to work fine again without ever seizing in that way again. That's why it's so important to be able to reproduce the problem and identify the affected area to narrow things down so you can cut to the chase and fix the issue fast. This really helps—when something isn't working, try it again, and write down exactly what is and is not happening. Most users' knee-jerk reaction is to straight up call the help desk the minute they have a problem. This is not only annoying but also inefficient, because you're going to ask them exactly what they were doing when the problem occurred and most users have no idea what they were doing with the computer at the time because they were focused on doing their jobs instead. This is why if you train users to reproduce the problem and jot down some notes about it before calling you, they'll be much better prepared to give you the information you need to start troubleshooting it and help them. So, with that, here we go. The problem you've identified results in coughing out an error message to your user when they try to access the corporate intranet. It looks like Figure 18.3. And when this user tries to ping the server using its hierarchical web name, it also fails (see Figure 18.4). You're going to respond by checking to see whether the server is up by pinging the server by its IP address (see Figure 18.5). Nice—that worked, so the server is up, but you could still have a server problem. Just because you can ping a host, it doesn't mean that host is 100% up and running, but in this case, it's a good start. And you're in luck because you've been able to re-create this problem from this user's host machine. By doing that, you now know that the URL name is not being resolved from Internet Explorer, and you can't ping it by the name either. But you can ping the server IP address from your limping host, and when you try this same connection to the internal.lammle.com server from another host nearby, it works fine, meaning the server is working fine. So, you've succeeded in isolating the problem to this specific host! FIGURE 18.3 Cannot connect FIGURE 18.4 Host could not be found. It is a huge advantage if you can watch the user try to reproduce the problem themselves because then you know for sure whether the user is performing the operation correctly. It's a really bad idea to assume the user is typing in what they say they are. FIGURE 18.5 Successful ping Approach Multiple Problems Individually You should never mix possible solutions when troubleshooting. When multiple changes are made, interactions can occur that muddy the results. Always attack one issue at a time and make only a single change at a time. When a change does not have a beneficial effect, reverse the change before making another change. You've now nailed down the problem. This leads us to step 2. Step 2: Establish a Theory of Probable Cause After you observe the problem and identify the symptoms, next on the list is to establish its most probable cause. (If you're stressing about it now, don't, because though you may feel overwhelmed by all this, it truly does get a lot easier with time and experience.) You must come up with at least one possible cause, even though it may not be completely on the money. And you don't always have to come up with it yourself. Someone else in the group may have the answer. Also, don't forget to check online sources and vendor documentation. Again, let's get back to our scenario, in which you've determined the cause is probably an improperly configured DNS lookup on the workstation. The next thing to do is to verify the configuration (and probably reconfigure DNS on the workstation; we'll get to this solution later, in step 4). Understand that there are legions of problems that can occur on a network—and I'm sorry to tell you this, but they're typically not as simple as the example we've been using. They can be, but I just don't want you to expect them to be. Always consider the physical aspects of a network, but look beyond them into the realm of logical factors like the DNS lookup issue we've been using. Question the Obvious The probable causes that you've got to thoroughly understand to meet the Network+ objectives are as follows: Port speed Port duplex mismatch Mismatched MTU Incorrect virtual local area network (VLAN) Interface shutdown/disabled/suspended Interface issues/increasing counters Incorrect IP address/duplicate IP address Wrong gateway Wrong DNS Incorrect subnet mask Incorrect interface/interface misconfiguration Duplicate MAC addresses Expired IP address Rogue DHCP server Untrusted SSL certificate Incorrect time DHCP address pool exhaustion Blocked TCP/UDP ports Incorrect host-based firewall settings Incorrect ACL settings Unresponsive service Multicast flooding Asymmetrical routing Low optical link budget Network Time Protocol issues Hardware issues with Power over Ethernet (PoE) Bring your own device challenges Licensed features Network performance issues Let's talk about these logical issues, which can cause an abundance of network problems. Most of these happen because a device has been improperly configured. Port Speed Because networks have been evolving for many years, there are various levels of speed and sophistication mixed into them—often within the same network. Most of the newest NICs can be used at 10 Mbps, 100 Mbps, and 1000 Mbps. Most switches can support at least 10 Mbps and 100 Mbps, and an increasing number of switches can also support 1G or 25/40/100 Gbps. Plus, many switches can also autosense the speed of the NIC that's connected and use different speeds on various ports. As long as the switches are allowed to autosense the port speed, it's rare to have a problem develop that results in a complete lack of communication. But if you decide to set the port speed manually, make positively sure to set the same speed on both sides of a link. Port Duplex Mismatch There are generally three duplex settings on each port of a network switch: full, half, and auto. For two devices to connect effectively, the duplex setting has to match on both sides of the connection. If one side of a connection is set to full and the other is set to half, they're mismatched. More elusively, if both sides are set to auto but the devices are different, you can also end up with a mismatch because the device on one side defaults to full and the other one defaults to half. Duplex mismatches can cause lots of network and interface errors, and even a lack of a network connection. This is partially because setting the interfaces to full duplex disables the CSMA/CD protocol. This is definitely not a problem in a network that has no hubs (and therefore no shared segments in which there could be collisions), but it can make things really ugly in a network where hubs are still being used. This means the settings you choose are based on the type of devices you have populating your network. If you have all switches and no hubs, feel free to set all interfaces to full duplex, but if you've got hubs in the mix, you have shared networks, so you're forced to keep the settings at half duplex. With all new switches produced today, leaving the speed and duplex setting to auto (the default on both switches and hosts) is the recommended way to go. Mismatched MTU Ethernet LANs enforce what is called a maximum transmission unit (MTU). This is the largest size packet that is allowed across a segment. In most cases, this is 1500 bytes. Left alone this is usually not a problem, but it is possible to set the MTU on a router interface, which means it is possible for a mismatch to be present between two router interfaces. This can cause problems with communications between the routers, resulting in the link failing to pass traffic. To check the MTU on an interface, execute the command show interface. Incorrect VLAN Switches can have multiple VLANs each, and they can be connected to other switches using trunk links. As you now know, VLANs are often used to represent departments or the occupations of a group of users. This makes the configurations of security policies and network access lists much easier to manage and control. On the other hand, if a port is accidentally assigned to the wrong VLAN in a switch, it's as if that client was magically transported to another place in the network. If that happens, the security policies that should apply to the client won't anymore, and other policies will be applied to the client that never should have been. The correct VLAN port assignment of a client is as important as air; when I'm troubleshooting a single-host problem, this is the first place I look. It's pretty easy to tell if you have a port configured with a wrong VLAN assignment. If this is the case, it won't be long before you'll get a call from some user screaming something at you that makes the building shake, like, “I can get to the Internet, but I can't get to the Sales server, and I'm about to lose a huge sale. DO SOMETHING!” When you check the switch, you will invariably see that this user's port has a membership in another VLAN, like Marketing, which has no access to the Sales server. Interface Shutdown/Disabled/Suspended You can check a switch and router interface port status with the show interfaces command. You can administratively disable a switch or router port with the shutdown command. Switch(config)#int f0/1 Switch(config-if)#shutdown %LINK-5-CHANGED: Interface FastEthernet0/1, changed state to administratively down Understand that on switches you can configure what is called the Port-security command, which can help you administrate, mostly to stop people from plugging multiple devices into the same port. But how do we actually prevent someone from simply plugging a host into one of our switch ports—or worse, adding a hub, switch, or access point into the Ethernet jack in their office? By default, MAC addresses will dynamically appear in your MAC forward/filter database, but you can stop them in their tracks by using port security! Figure 18.6 shows two hosts connected to the single switch port Fa0/3 via either a hub or access point (AP). Port Fa0/3 is configured to observe and allow only certain MAC addresses to associate with the specific port. So in this example, Host A is denied access, but Host B is allowed to associate with the port. You can configure the device to take one of the following actions when a security violation occurs by using the switchport port-security command: Switch(config)#int f0/3 Switch(config-if)#switchport port-security violation ? protect Security violation protect mode restrict Security violation restrict mode shutdown Security violation shutdown mode These are the three options for port security: Protect: The protect violation mode drops packets with unknown source addresses until you remove enough secure MAC addresses to drop below the maximum value. FIGURE 18.6 Port Security on a switch port restricts port access by MAC address. Restrict:The restrict violation mode also drops packets with unknown source addresses until you remove enough secure MAC addresses to drop below the maximum value. Shutdown: Shutdown is the default violation mode. The shutdown violation mode puts the interface into an error-disabled state immediately. Interface Issues/Increasing Counters Let's zoom in on an interface Fa0/0 and talk about what to expect if there were errors on this interface that cause increasing counters: R2#sh int fa0/0 FastEthernet0/0 is up, line protocol is up [output cut] Full-duplex, 100Mb/s, 100BaseTX/FX ARP type: ARPA, ARP Timeout 04:00:00 Last input 00:00:05, output 00:00:01, output hang never Last clearing of "show interface" counters never Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 0 bits/sec, 0 packets/sec 5 minute output rate 0 bits/sec, 0 packets/sec 1325 packets input, 157823 bytes Received 1157 broadcasts (0 IP multicasts) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog 0 input packets with dribble condition detected 2294 packets output, 244630 bytes, 0 underruns 0 output errors, 0 collisions, 3 interface resets 347 unknown protocol drops 0 babbles, 0 late collision, 0 deferred 4 lost carrier, 0 no carrier 0 output buffer failures, 0 output buffers swapped out CRC Errors Cyclic redundancy check (CRC) errors happen when the last 4 bytes of the frame (FCS) fail to verify the incoming frame. As you can see in the following example, this interface has both input errors and CRC errors. Input errors are any error encountered on the interface, whereas CRC errors are exclusively failed FCS checks. Both of these counters are accumulative and need to be manually cleared with the clear counters command, specifying the interface name and number, such as gigabitEthernet 4/27. es-29#show interface gigabitEthernet 4/27 GigabitEthernet4/27 is up, line protocol is up (connected) [ Output Cut] Received 72917 broadcasts (99 multicasts) 232 runts, 0 giants, 0 throttles 112085 input errors, 111853 CRC, 0 frame, 0 overrun, 0 ignored 0 input packets with dribble condition detected 161829497 packets output, 20291962434 bytes, 0 underruns 0 output errors, 0 collisions, 3 interface resets [ Output Cut ] It important to understand that you can view only incoming CRC errors. Outgoing frames are the responsibility of the other side of the connection, to be checked against the FCS. Common causes of CRC errors usually involve wiring, but having the wrong duplex manually configured on both sides can also cause CRC errors. Giants Giant frames are just what their name suggests; they are large frames. When the interface receives an incoming frame larger than the configured maximum transmission unit (MTU) for an interface or VLAN, the giant frame counter will increment. The giants counters can be found in the output of the show interface command, as shown in the following example. The default MTU for Ethernet is 1500 bytes. It is very common to see this counter increment if the connected host is sending jumbo frames with an MTU of 9000 bytes. es-29#show interface gigabitEthernet 4/27 GigabitEthernet4/27 is up, line protocol is up (connected) [ Output Cut] Received 72917 broadcasts (99 multicasts) 232 runts, 0 giants, 0 throttles 112085 input errors, 111853 CRC, 0 frame, 0 overrun, 0 ignored 0 input packets with dribble condition detected 161829497 packets output, 20291962434 bytes, 0 underruns 0 output errors, 0 collisions, 3 interface resets [ Output Cut ] Runts If giant frames are large frames, then logically, runts are small frames. When an interface receives an incoming frame smaller than 64 bytes, the frame is considered a runt. This commonly happens when there are collisions, but it can also happen if you have a faulty connection. In the following example, the interface has received a number of runt frames, but no collisions were detected. However, the interface has received a number of CRC errors, so this is probably a bad physical connection. es-29#show interface gigabitEthernet 4/27 GigabitEthernet4/27 is up, line protocol is up (connected) [ Output Cut] Received 72917 broadcasts (99 multicasts) 232 runts, 0 giants, 0 throttles 112085 input errors, 111853 CRC, 0 frame, 0 overrun, 0 ignored 0 input packets with dribble condition detected 161829497 packets output, 20291962434 bytes, 0 underruns 0 output errors, 0 collisions, 3 interface resets [ Output Cut ] Drops The number of drops is caused when the output queue is full. For example, when you receive traffic on a 1000Mb interface and forward through a 100Mb interface, you'll see congestion, which causes packet loss and high delay. Input Queue Drops If the input queue drops counter increments, this signifies that more traffic is being delivered to the router that it can process. If this is consistently high, try to determine exactly when these counters are increasing and how the events relate to CPU usage. You'll see the ignored and throttle counters increment as well. Output Queue Drops This counter indicates that packets were dropped due to interface congestion, leading to packet drops and queuing delays. When this occurs, applications like VoIP will experience performance issues. If you observe this constantly incrementing, consider QoS. Input Errors Input errors often indicate high errors such as CRCs. This can point to cabling problems, hardware issues, or duplex mismatches. Output Errors This is the total number of frames that the port tried to transmit when an issue such as a collision occurred. You've got to be able to analyze interface statistics to find problems if they exist, so let's pick out the important factors relevant to meeting that challenge effectively now. Speed and Duplex Settings It's good to know that the most common cause of interface errors is a mismatched duplex mode between two ends of an Ethernet link. This is why it's so important to make sure that the switch and its hosts (PCs, router interfaces, etc.) have the same speed setting. If not, they just won't connect. And if they have mismatched duplex settings, you'll receive a legion of errors, which cause nasty performance issues, intermittent connectivity—even total loss of communication! Using autonegotiation for speed and duplex is a very common practice, and it's enabled by default. But if this fails for some reason, you'll have to set the configuration manually like this: Switch(config)#int gi0/1 Switch(config-if)#speed ? 10 Force 10 Mbps operation 100 Force 100 Mbps operation 1000 Force 1000 Mbps operation auto Enable AUTO speed configuration Switch(config-if)#speed 1000 Switch(config-if)#duplex ? auto Enable AUTO duplex configuration full Force full duplex operation half Force half-duplex operation Switch(config-if)#duplex full If you have a duplex mismatch, a telling sign is that the late collision counter will increment. Incorrect IP Address/Duplicate IP Address The most common addressing protocol in use today is IPv4, which provides a unique IP address for each host on a network. Client computers usually get their addresses from Dynamic Host Configuration Protocol (DHCP) servers. But sometimes, especially in smaller networks, IP addresses for servers and router interfaces are statically assigned by the network's administrator. An incorrect or duplicate IP address on a client will keep that client from being able to communicate and may even cause a conflict with another client on the network, and a bad address on a server or router interface can be disastrous and affect a multitude of users. This is exactly why you need to be super careful to set up DHCP servers correctly and also when configuring the static IP addresses assigned to servers and router interfaces. Wrong Gateway A gateway, sometimes called a default gateway or an IP default gateway, is a router interface's address that's configured to forward traffic with a destination IP address that's not in the same subnet as the device itself. Let me clarify that one for you: If a device compares where a packet wants to go with the network it's currently on and finds that the packet needs to go to a remote network, the device will send that packet to the gateway to be forwarded to the remote network. Because every device needs a valid gateway to obtain communication outside its own network, it's going to require some careful planning when considering the gateway configuration of devices in your network. If you're configuring a static IP address and default gateway, you need to verify the router's address. Not doing so is a really common “wrong gateway” problem that I see all the time. Wrong DNS DNS servers are used by networks and their clients to resolve a computer's hostname to its IP addresses and to enable clients to find the server they need to provide the resources they require, like a domain controller during the login and authentication process. Most of the time, DNS addresses are automatically configured by a DHCP server, but sometimes these addresses are statically configured instead. Because lots of applications rely on hostname resolution, a botched DNS configuration usually causes a computer's network applications to fail just like the user's applications in our example scenario. If you can ping a host using its IP address but not its name, you probably have some type of name-resolution issue. It's probably lurking somewhere within a DNS configuration. Incorrect Subnet Mask When network devices look at an IP address configuration, they see a combination of the IP address and the subnet mask. The device uses the subnet mask to establish which part of the address represents the network address and which part represents the host address. So clearly, a subnet mask that is configured wrong has the same nasty effect as a wrong IP address configuration does on communications. Again, a subnet mask is generally configured by the DHCP server; if you're going to enter it manually, make sure the subnet mask is tight or you'll end up tangling with the fallout caused by the entire address's misconfiguration. Incorrect Interface/Interface Misconfiguration If a host is plugged into a misconfigured switch port or if it's plugged into the wrong switch port that's configured for the wrong VLAN, the host won't function correctly. Make sure the speed, duplex, and correct Ethernet cable is used. Get any of that wrong and either you'll get interface errors on the host and switch port or, worse, things just won't work at all! Duplicate MAC Addresses There should never be duplicate MAC addresses in your environment. Each interface vendor is issued an organizationally unique identifier (OUI), which will match on all interfaces produced by that vendor, and then the vendor is responsible for ensuring unique MAC addresses. That means duplicate MAC addresses usually indicate a MAC spoofing attack, in which some malicious individual changes their MAC address, which can be done quite easily in the properties of the NIC. Expired IP Address In almost all cases, when DHCP is used to allocate IP configurations to devices, the configuration is supplied to the DHCP client on a temporary basis. The lease period is configurable, and when the lease period and a grace period transpire, the lease is expired. The effect of an expired lease is the next time that client computer starts, it must enter the initialization state and obtain new TCP/IP configuration information from a DHCP server. There is nothing, however, to prevent the client from obtaining a new lease for the same IP address. Rogue DHCP Server Dynamic Host Configuration Protocol (DHCP) is used to automate the process of assigning IP configurations to hosts. When configured properly, it reduces administrative overload, reduces the human error inherent in manual assignment, and enhances device mobility. But it introduces a vulnerability that when leveraged by a malicious individual can result in an inability of hosts to communicate (constituting a DoS attack) and peer-to-peer attacks. When an illegitimate DHCP server (called a rogue DHCP server) is introduced to the network, unsuspecting hosts may accept DHCP Offer packets from the illegitimate DHCP server rather than the legitimate DHCP server. When this occurs, the rogue DHCP server will not only issue the host an incorrect IP address, subnet mask, and default gateway address (which makes a peer-to-peer attack possible), it can also issue an incorrect DNS server address, which will lead to the host relying on the attacker's DNS server for the IP addresses of websites (such as major banks) that lead to phishing attacks. Figure 18.7 shows an example of how this can occur. In Figure 18.7, after receiving an incorrect IP address, subnet mask, default gateway, and DNS server address from the rogue DHCP server, the DHCP client uses the attacker's DNS server to obtain the IP address of his bank. This leads the client to unwittingly connect to the attacker's copy of the bank's website. When the client enters his credentials to log in, the attacker now has the client's bank credentials and can proceed to empty out his account. Untrusted SSL Certificate Reception of an untrusted SSL certificate error message can be for several reasons. Figure 18.8 shows the possible reasons for the warning message. The first reason, “A trusted certificate authority did not issue the Security certificate presented by this website,” means the CA that issued the certificate is not trusted by the local machine. This will occur if the certificate of the CA that issued the certificate is not found in the Trusted Root Certification Authorities Folder on the local machine. The second reason this might occur is that the certificate is not valid. The certificate may have been presented before the validity period begins, or it may have expired, meaning the validity period is over. FIGURE 18.7 Rogue DHCP FIGURE 18.8 Certificate error The third reason is that the name on the certificate does not match the name listed on the certificate. Incorrect Time Incorrect time on a device can be the cause of several issues. First, in a Windows environment using Active Directory, a clock skew of more than 5 minutes between a client and server will prevent communication between the two. Second, proper time synchronization is critical for successful operation when certificates are in use. Finally, when system logs are sent to a central server such as a syslog server, proper time synchronization is critical to understand the order of events. DHCP Address Pool Exhaustion When a DHCP server is implemented, it is configured with a limited number of IP addresses. When the IP addresses in a scope are exhausted, any new DHCP clients will be unable to obtain an IP address and will be unable to function on the network. DHCP servers can be set up to provide backup to another DHCP server for a scope. When this is done, it is important to ensure that while the two DHCP servers service the same scope, they do not have any duplicate IP addresses. Blocked TCP/UDP Ports When the ports used by common services and applications are blocked, either on the network firewall or on the personal firewall of a device, it will be impossible to make use of the service or application. One easy way to verify the open ports on a device is to execute the netstat command. Figure 18.9 shows an example of the output. FIGURE 18.9 Netstat -a output Incorrect Host-Based Firewall Settings As you saw in the explanation of blocked TCP/UDP ports, incorrect host-based firewall settings can either prevent transmissions or allow unwanted communications. Neither of these outcomes is desirable. One of the best ways to ensure that firewall settings are consistent and correct all the time is to control these settings with a group policy. When you do this, the settings will be checked and reset at every policy refresh interval. Incorrect ACL Settings Access control lists (ACLs) are used to control which traffic types can enter and exit ports on the router. When mistakes are made either in the construction of the ACLs or in their application, many devices may be affected. The creation and application of these tools should be done only by those who have been trained in their syntax and in the logic ACLs use in their operation. Unresponsive Service Services can fail for several reasons. Many services depend on other services for their operation. Therefore, the failure of one service sometimes causes a domino effect, taking down other services that depend on it. You can use the Services snap-in inside the Computer Management Microsoft Management Console (MMC) to identify these dependencies as well as start and stop services. To identify the services upon which a particular service depends, use the Dependencies tab on the Services snap-in, as shown in Figure 18.10. FIGURE 18.10 Service dependencies In Figure 18.10, the spooler service is selected and the Dependencies tab is displayed. Here we see that the spooler service depends on the HTTP and RPC services. Therefore, if the spooler service will not start, we may need to restart one of these two services first. Multicast Flooding Multicasting is used for network devices to communicate with each other and to save network capacity by having only one sender but many listeners. This is handy for video because the content server does not have to generate an individual stream for every subscribed listener. However, multicast can flood a network with packets as they are sent to every networking device and potentially out every port, even if there are no listeners on the switch port. To resolve these issues, you must enable the multicast features on modern switches and routers that are designed to lessen the impact of multicast flooding. Asymmetrical Routing Asymmetrical routing is when a session takes different paths through a network. Generally a routed network will have only one path for both send and receive traffic from a client to a server and vice versa. However, conditions can exist where a router sends traffic out and it comes back using another path. Check that you are not running multiple routing protocols inside your network or that your ISP is not retuning traffic via another path than what you are sending out. Low Optical Link Budget When you're troubleshooting fiber-optic links, a test set should be used to make sure the received light level is not too low as to be detected. If there is too much loss over a fiber link due to too many interconnects where additional loss is added, if the distance is greater than the standard dictates, or if there are dirty connections, a link may not be able to be established. Network Time Protocol Issues Networks and computers have the ability to sync their time clock to a central server called a Network Time Protocol (NTP) server. If communications are lost, or NTP was never configured to begin with, time stamps for logging, application synchronizations and licenses based on dates can all cause major headaches. It is always a good practice to use a center time source and make sure all of your devices get their data and time data from the NTP servers. Hardware issues with PoE As the networking world evolved and it became common to attach devices such as IP phones and remote WI-FI access points to the network, the requirement to supply power to these devises arose. Many IP phones could be powered from a central access Ethernet switch to save having to find a power outlet at every desk. With Wi-Fi access ports, many are located in office ceilings or remote locations where local power may not be available or may be costly to install. The network switch manufacturers responded with a PoE option for these use cases. PoE allows for both the power and data to be transmitted

Chapter 18 - Network Troubleshooting Methodology.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue