Full Transcript

Troubleshooting Microsoft Windows OS Problems Windows is mind-bogglingly complex. Other operating systems are complex too, but the mere fact that the Windows operating system has nearly 60 million lines of code (and thousands of developers worked on it!) makes you pause and shake your head. Fortunat...

Troubleshooting Microsoft Windows OS Problems Windows is mind-bogglingly complex. Other operating systems are complex too, but the mere fact that the Windows operating system has nearly 60 million lines of code (and thousands of developers worked on it!) makes you pause and shake your head. Fortunately, you just need to take a systematic approach to solving software-related issues. The steps you learned in Chapter 13 are still relevant to software issues. Windows-based issues can be grouped into several categories based on their cause, such as boot problems, missing files (such as system files), configuration files, and virtual memory. If you’re troubleshooting a boot problem, it’s imperative that you understand the Windows boot process. Some common Windows problems don’t fall into any category other than “common Windows problems.” We cover those in the following sections, followed by a discussion of the tools that can be used to fix them. Common Symptoms 1284 There are numerous “common symptoms” that CompTIA asks you be familiar with for the exam. They range from the dreaded Blue Screen of Death (BSOD) to spontaneous restarts and everything in between. They are discussed here in the order in which they appear in the objective list. Slow Performance The performance of your systems will inevitably slow down over time. This could be due to a multitude of causes, ranging from bad Windows Update patches to malware. Slow performance is one of the hardest problems to solve on a Windows operating system, because many of the symptoms are related to each other. The first step to solving the problem is identifying the component that is impacted by the performance issue. The following is a list of critical components that can be affected by slow performance: CPU A symptom of poor CPU performance is the slow execution of applications. The operating system GUI will be unresponsive and sluggish. You may also hear your CPU fan is running higher than normal, if overheating starts to occur. CPU problems can be caused by an application that requires high CPU usage, such as movie rendering. RAM A symptom of the operating system running out of RAM is high disk activity. As physical memory is used up, the less active pages of memory are swapped out to the hard disk drive (HDD). The symptoms of performance will closely resemble CPU-related issues, where applications are slow in loading. RAM problems can be caused by too many applications being open at once or an application that has high RAM requirements, such as a database. Disk A symptom of poor hard disk drive performance is the thrashing of the drive heads on the platters of the drive. Thrashing occurs when there is excessive movement of the drive arm to locate information on the drive. Disk problems can be caused by excessive fragmentation, high RAM usage, or a high volume of drive usage by applications, such as a video capture. Network Symptoms of poor network performance are slow loading web pages, network applications that load slowly, and even timeouts. If you are using wireless, network issues can be caused by poor signal strength. If you are connected through Ethernet, poor network performance can be directly related to the local area network (LAN). The problems can also be outside of your network. Graphics Symptoms of poor graphics (GPU) performance are usually related to slow-running video games and playback of videos. The frames per seconds (FPS) will be excessively low as the computer tries to render the screen. It is not common to have graphics-related issues unless you just downloaded the latest shoot-em-up game. Graphics-related problems are usually the hardest to solve because they require third-party tools by the graphics card vendor. 1285 As you can see from the list of possibly affected components, many of the symptoms are closely related, such as RAM, CPU, and disk. The excessive usage of RAM can create performance symptoms with your hard disk drive. If left for a long period of time, these can both lead to an increase in CPU activity. There are several tools that you can use to identify the problem area, so that you can focus your attention closer to narrow down the problem. The first tool you should start up is the Task Manager, as shown in Figure 24.1. You can launch Task Manger several different ways, such as right-clicking the Start menu and selecting Task Manager, right-clicking the taskbar and selecting Task Manager, or (my personal favorite) pressing Ctrl+Shift+Esc at the same time. The Performance tab will show you four of the five critical areas (detailed previously) on the left side. In this example, you can see that the processor is spiked out at almost 100% and all other systems are within tolerance. Figure 24.1 The Performance tab in Task Manager Now that you’ve isolated the problem to the critical area of CPU, you can narrow it down further by looking at the Processes tab, as shown in Figure 24.2. You can see that VMware Tools Core Service is using 40.3% of the CPU. The other 59.7% is most likely distributed among other processes. By clicking the core area headings of CPU, Memory, Disk,1286and Network, you can sort usage from high to low or from low to high. In this particular instance, the operating system was caught booting up, so that particular process was displaying high CPU. Figure 24.2 The Processes tab in Task Manager  Starting in Windows 8, the Task Manager gained a few tabs. One tab in particular is called Details. The Details tab allows you to see the details of the processes, such as the user that executed the process, the process ID (PID), and the name of the process executable. You can sort by any of the headings by clicking them, as shown in Figure 24.3, just like the Processes tab. You can even right-click the headings and add columns, such as peak memory usage and the exact command line, just to name a few. Figure 24.3 The Details tab in Task Manager 1287 Using Resource Monitor, you can get a much more detailed view than what is displayed in Task Manager. You can open Resource Monitor with the shortcut on the lower left of the Performance tab in Task Manager, as shown in Figure 24.4. This tool allows you to read real-time performance data on every process on the operating system. Resource Monitor also allows you to sort details, the same as Task Manager. You can click each critical area and drill down to the performance issue. A unique feature of Resource Monitor is the visualization of data. When you select a process on the upper view, Resource Monitor automatically filters the activity of the critical area, as shown in Figure 24.5. As you can see in this example, the Edge browser processes have been selected and then the Network tab can be chosen to display the network activity and connections. The result is the isolation of network activity for this process. This can be done for any of the critical areas.1288 Figure 24.4 Resource Monitor Figure 24.5 Selective isolation in Resource Monitor 1289 Now that you’ve isolated the problem to an action or process in the operating system, you need to do the following: Formulate a theory of probable cause. Test the theory to determine the cause. Establish a plan of action to resolve the problem and implement the solution. Verify the full-system functionality and, if applicable, implement preventative measures. Document findings, actions, and outcomes. If this all sounds familiar, these are the steps from Chapter 13. The theory of probable cause may be a hardware upgrade is required due to a new version of the application that demands more resources. Or it could be as simple as the job the application is running is higher than normal in load. Remember to question the obvious and do the simple things, such as rebooting, to see if the problem goes away. It is often joked about that problems always go away after a reboot. It’s not too far from the truth. Sometimes a process is hung up and is affecting another process. A reboot sometimes fixes them both and the symptoms go away. More likely than not, the problem will still be there. This is where you need to start testing your theory of probable cause to determine the cause. You might find that after running a large query the hard drive is extremely stressed. Your action plan might be to upgrade the hard drive or move the workload to a faster machine. In either case, you need to verify that the process for the user is functioning the way they expect it to. If you determine that it’s a certain report in the database, educate the user that the report takes time and maybe schedule it when the computer is not immediately needed. Or schedule the system for an upgrade of hardware to prevent these problems in the future. Ultimately, you should document your finding so that other technicians do not waste time with the same issue. The more intricate the problem, the more time that is wasted when you forget you’ve solved it already but don’t remember the answer. Always document the actions taken, such as upgrades or changes to the process for the user. You should also note the outcomes of whether it was successful or showed no immediate performance increase, so that if another technician is working on the same or similar issue, they can gauge if the solution is effective. Limited Connectivity When you have limited connectivity issues, the Windows operating system will display a yellow triangle on the icon for the network in the lower-right corner of the screen. If you hover over the network icon, it will display the error of No Internet Access. Clicking the icon will bring up the wireless connectivity settings, as shown in Figure 24.6. Here, a wireless icon is displayed, but limited connectivity such as this is not exclusive to wireless connections. Limited connectivity problems can also happen with wired Ethernet connections, although the icon will be different to reflect the connection. Figure 24.6 No Internet access 1290 Let’s look deeper at what is happening in the operating system to make the connectivity status work the way it does. When a connection (either wireless or wired) is detected, the built-in firewall will try to set the location by asking the default gateway for its MAC address. After a network location is established, the firewall will then check to see if the Internet is accessible via the Network Connection Status Indicator (NCSI). It performs this check by sending a simple web request to http://www.msftncsi.com/ncsi.txt. 1291If the text Microsoft NCSI is received back, then you have Internet connectivity. If the text does not come back, you have limited connectivity. This means that although you have a network connection or wireless association, you do not have Internet access.  The Network Connection Status Indicator (NCSI) can also detect captive portals. During the process of Internet connectivity, if something is received in lieu of the text Microsoft NCSI, this means you are behind a captive portal. You will be prompted to click a balloon that pops up by the network icon. This will take you to the captive portal page. Depending on the version of Windows, the web browser may automatically start and take you to the login page for the captive portal. The first step to solving this problem is to identify the problem. When you see the No Internet access message in the notifications area, it’s because you have a network connection, but you don’t have a connection to the public Internet. The first thing you should do is sketch out the network, as shown in Figure 24.7. Here, you can see the laptop on which the message was received. First, identify whether you are actually connected to the local network. You can do this by opening a command prompt and typing ipconfig, as shown in Figure 24.8. After identifying the network connection, verify that a valid IP address is configured on the interface. Also verify that a correct default gateway is configured. Remember, the problem could be as simple as the user connected to the wrong network security set identifier (SSID). Figure 24.7 Diagram of the problem network Assuming the problem is not simple, you can use the information to fill out your network map that will be used to identify the problem, as shown in Figure 24.9. Put as much detail as possible on this drawing. You should also add the information for a known connected host so that you have something to test against. Don’t worry about neatness; just make sure that you understand the drawing. An issue can even be scratched out on a piece of lined notepad. The goal is to have something you can use to visualize the problem. Figure 24.8 Output of the ipconfig command Figure 24.9 Diagram with details of the problem 1292 Now you can use the ping command to check the local host by pinging the IP address of 192.168.8.100 and checking for a reply, as shown in Figure 24.10. If that succeeds, then ping the router (default gateway) at the IP address of 192.168.8.1. If that succeeds, then ping the known good host at the IP address of 192.168.8.101. If everything succeeds, then you can conclude that you are connected to the local network. Figure 24.10 Testing connectivity with the ping command 1293 The next step is to ping a host on the Internet that is known to respond to ping. You should, however, do this by its IP address, to avoid name resolution problems. The Google DNS address of 8.8.8.8 can be used as a test, since it’s easy to remember. If the test fails, the problem may be your router or external to your network. Don’t discount testing from a known good host. If that host fails a simple ping to the Internet, then the problem is bigger than one host. The idea of this example is to show you how to systematically collect information to begin solving the problem of limited connectivity. There can be many different causes. The following are some common causes you should be aware of: Wrong SSID The problem client may be connected to the wrong SSID. Connecting to the wrong SSID would place the client in possibly the wrong network, and it may not have access to the Internet. 1294Static IP Address Configured A static IP address may be configured that is preventing the client from communicating on the local network and the Internet. Internet Router Problems If all clients are having issues getting to the Internet, then the Internet router may be at fault. It can sometimes be a firewall rule for outbound traffic. Comparing the nonworking host to a working host can help diagnose this cause. External Internet Problems Tools such as ping and tracert can help to verify a problem on the external network. These tools were covered in Chapter 16, “Operating System Administration.” DNS Resolution Problems Name resolution can sometimes be the cause of limited connectivity issues. You need DNS to resolve fully qualified domain names (FQDNs) to IP addresses. The nslookup tool can be used to test name resolution. Third-Party Software Third-party software, such as antivirus and anti-malware products, has been known to interfere with network connectivity. These programs should be temporally disabled to see if the symptoms go away. If they do, then the vendor should be contacted for support of the product. Failure to Boot With the introduction of Windows Vista, the boot process had changed from prior operating systems of Windows XP, 2000, and NT 4.0. We’ve used this current boot process introduced with Windows Vista, all the way to today with Windows 10. The current boot process allows for the adoption of UEFI firmware. In order to troubleshoot a failure to boot, you need to understand the complete boot process, starting with either the BIOS or UEFI. The process will be slightly different depending on which firmware you have on the motherboard. However, the outcome is the same: The hardware hands control over to the operating system so that the operating system can boot. BIOS Legacy BIOS systems perform a power on self-test (POST), and then the BIOS bootstrap routine looks at the master boot record (MBR) at the beginning of the disk. The MBR then reads the boot sector on the first primary partition found. This boot sector then instructs the bootmgr to load. UEFI UEFI firmware will perform a similar power on self-test (POST). Then the UEFI bootstrap begins by loading drivers for the hardware. One of the differences is that UEFI can contain drivers that allow it to boot across a network or other non-standard devices. Just like the legacy BIOS, the UEFI firmware looks at the MBR in the GUID Partition Table (GPT). The GPT defines a globally unique identifier (GUID) that points to a partition containing the BOOTMGR. Therefore, UEFI firmware requires a partitioning scheme of GPT and cannot use the standard MBR partitioning scheme. The initial boot sequence from hardware control to software control is almost identical between BIOS and UEFI firmware. UEFI firmware does give you many more options,1295because UEFI drivers can be loaded before control is handed over to the software. This allows UEFI to treat all locations containing an operating system the same. Up to the point the hardware hands control over to the software, there is no difference between a network boot and a hardware boot. After control is handed over to the software, several files are used to complete the operating system bootup. The most important files are as follows: Windows Boot Manager The Windows Boot Manager (BOOTMGR) bootstraps the system. In other words, this file starts the loading of an operating system on the computer. BCD The Boot Configuration Data (BCD) holds information about operating systems installed on the computer, such as the location of the operating system files. winload.exe Winload.exe is the program used to boot Windows. It loads the operating system kernel (ntoskrnl.exe). winresume.exe If the system is not starting fresh but resuming a previous session, then winresume.exe is called by the BOOTMGR. ntoskrnl.exe The Windows OS kernel is the heart of the operating system. The kernel is responsible for allowing applications shared access to the hardware through drivers. ntbtlog.txt The Windows boot log stores a log of boot-time events. It is not enabled by default. System Files In addition to the previously listed files, Windows needs a number of files from its system folders (for example, SYSTEM and SYSTEM32), such as the hardware abstraction layer (hal.dll), the session manager (smss.exe), the user session (winlogon.exe), and the security subsystem (lsass.exe). Numerous other dynamic link library (DLL) files are also required, but usually the lack or corruption of one of them produces a noncritical error, whereas the absence of hal.dll causes the system to be nonfunctional. We’ll now look at the complete Windows boot process. It’s a long and complicated process, but keep in mind that these are complex operating systems, providing you with a lot more functionality than older versions of Windows: The system self-checks and enumerates hardware resources. Each machine has a different startup routine, called the POST (power-on self-test), which is executed by the commands written to the motherboard of the computer. Newer PnP boards not only check memory and processors, but also poll the systems for other devices and peripherals. The master boot record (MBR) loads and finds the boot sector. Once the system has finished its housekeeping, the MBR is located on the first hard drive and loaded into memory. The MBR finds the bootable partition and searches it for the boot sector of that partition. 1296The MBR determines the filesystem and loads the BOOTMGR. Information in the boot sector allows the system to locate the system partition and to find and load into memory the file located there. The BOOTMGR reads the boot configuration data (BCD) to get a list of boot options for the next step. The BCD contains multi-boot information or options on how the boot process should continue. The BOOTMGR then executes winload.exe. This switches the system from real mode (which lacks multitasking, memory protection, and those things that make Windows so great) to protected mode (which offers memory protection, multitasking, and so on) and enables paging. Protected mode enables the system to address all the available physical memory. If Windows is returning from a hibernated (suspended) state, winresume.exe is responsible for reading the hiberfil.sys file into memory and passes control to the kernel after this file is loaded. The OS kernel loads the executive subsystems. Executive subsystems are software components that parse the Registry for configuration information and start needed services and drivers. The HKEY_LOCAL_MACHINE\SYSTEM Registry hive and device drivers are loaded. The drivers that load at this time serve as boot drivers, using an initial value called a start value. Control is passed to the kernel, which initializes loaded drivers. The kernel loads the Session Manager, which then loads the Windows subsystem and completes the boot process. Winlogon.exe loads. At this point, you are presented with the login screen. After you enter a username and password, you’re taken to the Windows Desktop. Now that you understand the boot process, let’s look at how you can collect information to identify the problem. We’ll consider this in two parts: hardware and software. The hardware process begins with the POST, and the software portion of the bootstrap begins with the BOOTMGR. You can collect information from the BIOS/UEFI firmware boot with third-party system event log (SEL) viewers. However, it is very unlikely that you have a failure to boot because of a BIOS/UEFI firmware issue. It’s not impossible, but it is highly unlikely. To collect information on the software portion of the boot process loads, you can use boot logging. The ntbtlog.txt file is located at the base of the C:\Windows folder, as shown in Figure 24.11. Boot logging is off by default and needs to be turned on. To enable boot logging, issue the command bcdedit /set {current} bootlog Yes. You can also use the System Configuration utility (msconfig.exe), by selecting the Boot Log option from the Boot tab, as shown in Figure 24.12. Because the BCD is read by the BOOTMGR, this point of the boot process is where logging would begin and the first entries would be the loading of the kernel.1297 Figure 24.11 The ntbtlog.txt file Figure 24.12 System Configuration options for boot logging 1298 Chances are, if you’re having trouble booting into Windows, you won’t be able to access the command prompt to issue bcdedit commands, nor will you able to access msconfig.exe. Not to worry. You can still access logging by allowing the operating system to fail two times in a row. The third time, the computer will boot into the recovery console. From there, click Troubleshooting ➢ Advanced Options ➢ See More Recovery Options ➢ Startup Settings ➢ Restart. When the computer restarts, it will boot into the Startup Settings menu, as shown in Figure 24.13. Of course, you won’t be able to boot the computer to retrieve the files, but you can use the command prompt in the Windows Recovery Environment. Figure 24.13 Startup Settings menu The idea is to collect information to identify the problem and, above all, to fix the problem. Sometimes you need to let Windows repair itself. The Windows Recovery Environment contains a Startup Repair option. Using the Startup Repair option is similar to issuing bootrec /REBUILDBCD at the command prompt, which will rebuild the BCD. If that fails, the ultimate solution might be to use the Reset This PC option in the Windows Recovery Environment or to install the operating system from scratch. Operating System Not Found When it’s reported that an operating system is missing, the first thing to check is that no media is in the machine (DVD, CD, and so on). The system may be reading this media during boot before accessing the hard drive. If that is the case, remove the media and reboot. (Down the road, change the BIOS/UEFI settings to boot from the hard drive before any other media.)

Use Quizgecko on...
Browser
Browser