(The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-24-101-28-39.pdf

1.6 Performance 29 1.6 Performance Assessing the performance of computers can be quite challenging. The scale and intricacy of modern software systems, together with the wide range of perfor...

1.6 Performance 29 1.6 Performance Assessing the performance of computers can be quite challenging. The scale and intricacy of modern software systems, together with the wide range of performance improvement techniques employed by hardware designers, have made performance assessment much more difficult. When trying to choose among different computers, performance is an important attribute. Accurately measuring and comparing different computers is critical to purchasers and therefore, to designers. The people selling computers know this as well. Often, salespeople would like you to see their computer in the best possible light, whether or not this light accurately reflects the needs of the purchaser’s application. Hence, understanding how best to measure performance and the limitations of those measurements is important in selecting a computer. The rest of this section describes different ways in which performance can be determined; then, we describe the metrics for measuring performance from the viewpoint of both a computer user and a designer. We also look at how these metrics are related and present the classical processor performance equation, which we will use throughout the text. Defining Performance When we say one computer has better performance than another, what do we mean? Although this question might seem simple, an analogy with passenger airplanes shows how subtle the question of performance can be. Figure 1.14 lists some typical passenger airplanes, together with their cruising speed, range, and capacity. If we wanted to know which of the planes in this table had the best performance, we would first need to define performance. For example, considering different measures of performance, we see that the plane with the highest cruising speed was the Concorde (retired from service in 2003), the plane with the longest range is the Boeing 777- 200LR, and the plane with the largest capacity is the Airbus A380-800. Passenger Cruising range Cruising speed Passenger throughput Airplane capacity (miles) (m.p.h.) (passengers × m.p.h.) Boeing 737 240 3000 0564 135,360 BAC/Sud Concorde 132 4000 01350 178,200 Boeing 777-200LR 301 9395 554 166,761 Airbus A380-800 853 8477 0587 500,711 FIGURE 1.14 The capacity, range, and speed for a number of commercial airplanes. The last column shows the rate at which the airplane transports passengers, which is the capacity times the cruising speed (ignoring range and takeoff and landing times). Let’s suppose we define performance in terms of speed. This still leaves two possible definitions. You could define the fastest plane as the one with the highest 30 Chapter 1 Computer Abstractions and Technology response time Also cruising speed, taking a single passenger from one point to another in the least time. called execution time. If you were interested in transporting 500 passengers from one point to another, The total time required however, the Airbus A380-800 would clearly be the fastest, as the last column of the for the computer to figure shows. Similarly, we can define computer performance in several distinct ways. complete a task, including disk accesses, memory If you were running a program on two different desktop computers, you’d say accesses, I/O activities, that the faster one is the desktop computer that gets the job done first. If you were operating system running a datacenter that had several servers running jobs submitted by many users, overhead, CPU execution you’d say that the faster computer was the one that completed the most jobs during time, and so on. a day. As an individual computer user, you are interested in reducing response time—the time between the start and completion of a task—also referred to as throughput Also called execution time. Datacenter managers often care about increasing throughput or bandwidth. Another bandwidth—the total amount of work done in a given time. Hence, in most cases, measure of performance, we will need different performance metrics as well as different sets of applications it is the number of tasks to benchmark personal mobile devices, which are more focused on response time, completed per unit time. versus servers, which are more focused on throughput. Throughput and Response Time Do the following changes to a computer system increase throughput, decrease EXAMPLE response time, or both? 1. Replacing the processor in a computer with a faster version 2. Adding additional processors to a system that uses multiple processors for separate tasks—for example, searching the web Decreasing response time almost always improves throughput. Hence, in case ANSWER 1, both response time and throughput are improved. In case 2, no one task gets work done faster, so only throughput increases. If, however, the demand for processing in the second case was almost as large as the throughput, the system might force requests to queue up. In this case, increasing the throughput could also improve response time, since it would reduce the waiting time in the queue. Thus, in many real computer systems, changing either execution time or throughput often affects the other. In discussing the performance of computers, we will be primarily concerned with response time for the first few chapters. To maximize performance, we want to minimize response time or execution time for some task. Thus, we can relate performance and execution time for a computer X: 1 Performance X Execution time X This means that for two computers X and Y, if the performance of X is greater than the performance of Y, we have 1.6 Performance 31 Performance X Performance Y 1 1 Execution time X Execution time Y Execution time Y Execution time X That is, the execution time on Y is longer than that on X, if X is faster than Y. In discussing a computer design, we often want to relate the performance of two different computers quantitatively. We will use the phrase “X is n times faster than Y”—or equivalently “X is n times as fast as Y”—to mean Performance X n Performance Y If X is n times as fast as Y, then the execution time on Y is n times as long as it is on X: Performance X Execution time Y n Performance Y Execution time X Relative Performance If computer A runs a program in 10 seconds and computer B runs the same EXAMPLE program in 15 seconds, how much faster is A than B? We know that A is n times as fast as B if ANSWER Performance A Execution timeB n PerformanceB Execution time A Thus the performance ratio is 15 1.5 10 and A is therefore 1.5 times as fast as B. In the above example, we could also say that computer B is 1.5 times slower than computer A, since Performance A 1. 5 PerformanceB 32 Chapter 1 Computer Abstractions and Technology means that Performance A PerformanceB 1. 5 For simplicity, we will normally use the terminology as fast as when we try to compare computers quantitatively. Because performance and execution time are reciprocals, increasing performance requires decreasing execution time. To avoid the potential confusion between the terms increasing and decreasing, we usually say “improve performance” or “improve execution time” when we mean “increase performance” and “decrease execution time.” Measuring Performance Time is the measure of computer performance: the computer that performs the same amount of work in the least time is the fastest. Program execution time is measured in seconds per program. However, time can be defined in different ways, depending on what we count. The most straightforward definition of time is called wall clock time, response time, or elapsed time. These terms mean the total time to complete a task, including disk accesses, memory accesses, input/output (I/O) activities, operating system overhead—everything. Computers are often shared, however, and a processor may work on several programs simultaneously. In such cases, the system may try to optimize throughput rather than attempt to minimize the elapsed time for one program. Hence, we might want to distinguish between the elapsed time and the time over which the processor is CPU execution working on our behalf. CPU execution time or simply CPU time, which recognizes time Also called CPU this distinction, is the time the CPU spends computing for this task and does not time. The actual time the CPU spends computing include time spent waiting for I/O or running other programs. (Remember, though, for a specific task. that the response time experienced by the user will be the elapsed time of the program, not the CPU time.) CPU time can be further divided into the CPU time spent in the user CPU time The program, called user CPU time, and the CPU time spent in the operating system CPU time spent in a performing tasks on behalf of the program, called system CPU time. Differentiating program itself. between system and user CPU time is difficult to do accurately, because it is often hard system CPU time The to assign responsibility for operating system activities to one user program rather than CPU time spent in another and because of the functionality differences between operating systems. the operating system For consistency, we maintain a distinction between performance based on performing tasks on elapsed time and that based on CPU execution time. We will use the term system behalf of the program. performance to refer to elapsed time on an unloaded system and CPU performance to refer to user CPU time. We will focus on CPU performance in this chapter, although our discussions of how to summarize performance can be applied to either elapsed time or CPU time measurements. Understanding Different applications are sensitive to different aspects of the performance of a computer system. Many applications, especially those running on servers, depend Program as much on I/O performance, which, in turn, relies on both hardware and software. Performance Total elapsed time measured by a wall clock is the measurement of interest. In 1.6 Performance 33 some application environments, the user may care about throughput, response time, or a complex combination of the two (e.g., maximum throughput with a worst-case response time). To improve the performance of a program, one must have a clear definition of what performance metric matters and then proceed to find performance bottlenecks by measuring program execution and looking for the likely bottlenecks. In the following chapters, we will describe how to search for bottlenecks and improve performance in various parts of the system. Although as computer users we care about time, when we examine the details clock cycle Also called of a computer it’s convenient to think about performance in other metrics. In tick, clock tick, clock particular, computer designers may want to think about a computer by using a period, clock, or cycle. measure that relates to how fast the hardware can perform basic functions. Almost The time for one clock period, usually of the all computers are constructed using a clock that determines when events take processor clock, which place in the hardware. These discrete time intervals are called clock cycles (or runs at a constant rate. ticks, clock ticks, clock periods, clocks, cycles). Designers refer to the length of a clock period both as the time for a complete clock cycle (e.g., 250 picoseconds, or 250 ps) and as the clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse clock period The length of each clock cycle. of the clock period. In the next subsection, we will formalize the relationship between the clock cycles of the hardware designer and the seconds of the computer user. 1. Suppose we know that an application that uses both personal mobile Check devices and the Cloud is limited by network performance. For the following Yourself changes, state whether only the throughput improves, both response time and throughput improve, or neither improves. a. An extra network channel is added between the PMD and the Cloud, increasing the total network throughput and reducing the delay to obtain network access (since there are now two channels). b. The networking software is improved, thereby reducing the network communication delay, but not increasing throughput. c. More memory is added to the computer. 2. Computer C’s performance is four times as fast as the performance of computer B, which runs a given application in 28 seconds. How long will computer C take to run that application? CPU Performance and Its Factors Users and designers often examine performance using different metrics. If we could relate these different metrics, we could determine the effect of a design change on the performance as experienced by the user. Since we are confining ourselves to CPU performance at this point, the bottom-line performance measure is CPU 34 Chapter 1 Computer Abstractions and Technology execution time. A simple formula relates the most basic metrics (clock cycles and clock cycle time) to CPU time: CPU execution time CPU clock cycles for a program for a program Clock cycle time Alternatively, because clock rate and clock cycle time are inverses, CPU execution time CPU clock cycles for a program for a program Clock rate This formula makes it clear that the hardware designer can improve performance by reducing the number of clock cycles required for a program or the length of the clock cycle. As we will see in later chapters, the designer often faces a trade-off between the number of clock cycles needed for a program and the length of each cycle. Many techniques that decrease the number of clock cycles may also increase the clock cycle time. Improving Performance EXAMPLE Our favorite program runs in 10 seconds on computer A, which has a 2 GHz clock. We are trying to help a computer designer build a computer, B, which will run this program in 6 seconds. The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as computer A for this program. What clock rate should we tell the designer to target? ANSWER Let’s first find the number of clock cycles required for the program on A: CPU clock cycles A CPU time A Clock rate A CPU clock cycles A 10 seconds cycles 2 109 second cycles CPU clock cycles A 10 seconds 2 109 20 109 cycles second 1.6 Performance 35 CPU time for B can be found using this equation: 1. 2 CPU clock cycles A CPU timeB Clock rateB 1. 2 20 109 cycles 6 seconds Clock rateB 1. 2 20 109 cycles 0. 2 20 109 cycles 4 109 cycles Clock rateB 4 GHz 6 seconds second second To run the program in 6 seconds, B must have twice the clock rate of A. Instruction Performance The performance equations above did not include any reference to the number of instructions needed for the program. However, since the compiler clearly generated instructions to execute, and the computer had to execute the instructions to run the program, the execution time must depend on the number of instructions in a program. One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction. Therefore, the number of clock cycles required for a program can be written as Average clock cycles CPU clock cycles Instructions for a program per instruction The term clock cycles per instruction, which is the average number of clock clock cycles cycles each instruction takes to execute, is often abbreviated as CPI. Since different per instruction instructions may take different amounts of time depending on what they do, CPI (CPI) Average number of clock cycles per is an average of all the instructions executed in the program. CPI provides one instruction for a program way of comparing two different implementations of the identical instruction set or program fragment. architecture, since the number of instructions executed for a program will, of course, be the same. Using the Performance Equation Suppose we have two implementations of the same instruction set architecture. Computer A has a clock cycle time of 250 ps and a CPI of 2.0 for some program, and computer B has a clock cycle time of 500 ps and a CPI of 1.2 for the same EXAMPLE program. Which computer is faster for this program and by how much? 36 Chapter 1 Computer Abstractions and Technology We know that each computer executes the same number of instructions for ANSWER the program; let’s call this number I. First, find the number of processor clock cycles for each computer: CPU clock cycles A I 2.0 CPU clock cycles B I 1.2 Now we can compute the CPU time for each computer: CPU time A CPU clock cycles A Clock cycle time I 2.0 250 ps 500 I ps Likewise, for B: CPU timeB I 1.2 500 ps 600 I ps Clearly, computer A is faster. The amount faster is given by the ratio of the execution times: CPU performance A Execution timeB 600 I ps 1.2 CPU performanceB Execution time A 500 I ps We can conclude that computer A is 1.2 times as fast as computer B for this program. The Classic CPU Performance Equation instruction count The We can now write this basic performance equation in terms of instruction count number of instructions (the number of instructions executed by the program), CPI, and clock cycle time: executed by the program. CPU time Instructioncount CPI Clock cycle time or, since the clock rate is the inverse of clock cycle time: Instruction count CPI CPU time Clock rate These formulas are particularly useful because they separate the three key factors that affect performance. We can use these formulas to compare two different implementations or to evaluate a design alternative if we know its impact on these three parameters. 1.6 Performance 37 Comparing Code Segments A compiler designer is trying to decide between two code sequences for a computer. The hardware designers have supplied the following facts: EXAMPLE CPI for each instruction class A B C CPI 1 2 3 For a particular high-level language statement, the compiler writer is considering two code sequences that require the following instruction counts: Instruction counts for each instruction class Code sequence A B C 1 2 1 2 2 4 1 1 Which code sequence executes the most instructions? Which will be faster? What is the CPI for each sequence? Sequence 1 executes 2 + 1 + 2 = 5 instructions. Sequence 2 executes 4 + 1 + 1 = 6 instructions. Therefore, sequence 1 executes fewer instructions. ANSWER We can use the equation for CPU clock cycles based on instruction count and CPI to find the total number of clock cycles for each sequence: n CPU clock cycles ∑ (CPIi Ci ) i=1 This yields CPU clock cycles1 (2 1) (1 2) (2 3) 2 2 6 10 cycles CPU clock cycles2 (4 1) (1 2) (1 3) 4 2 3 9 cycles So code sequence 2 is faster, even though it executes one extra instruction. Since code sequence 2 takes fewer overall clock cycles but has more instructions, it must have a lower CPI. The CPI values can be computed by CPU clock cycles CPI Instruction count CPU clock cycles1 10 CPI1 2. 0 Instruction count1 5 CPU clock cycles2 9 CPI2 1. 5 Instruction count 2 6 38 Chapter 1 Computer Abstractions and Technology Figure 1.15 shows the basic measurements at different levels in the computer and what is being measured in each case. We can see how these factors are combined to yield execution time measured in seconds per program: Instructions Clock cycles Seconds Time Seconds/Program Program Instrucction Clock cycle TheBIG Always bear in mind that the only complete and reliable measure of Picture computer performance is time. For example, changing the instruction set to lower the instruction count may lead to an organization with a slower clock cycle time or higher CPI that offsets the improvement in instruction count. Similarly, because CPI depends on the type of instructions executed, the code that executes the fewest number of instructions may not be the fastest. CPU execution time for a program Seconds for the program Instruction count Instructions executed for the program Clock cycles per instruction (CPI) Average number of clock cycles per instruction Clock cycle time Seconds per clock cycle FIGURE 1.15 The basic components of performance and how each is measured. How can we determine the value of these factors in the performance equation? We can measure the CPU execution time by running the program, and the clock cycle time is usually published as part of the documentation for a computer. The instruction count and CPI can be more difficult to obtain. Of course, if we know the clock rate and CPU execution time, we need only one of the instruction count or the CPI to determine the other. We can measure the instruction count by using software tools that profile the execution or by using a simulator of the architecture. Alternatively, we can use hardware counters, which are included in most processors, to record a variety of measurements, including the number of instructions executed, the average CPI, and often, the sources of performance loss. Since the instruction count depends on the architecture, but not on the exact implementation, we can measure the instruction count without knowing all the details of the implementation. The CPI, however, depends on a wide variety of design details in the computer, including both the memory system and the processor structure (as we will see in Chapter 4 and Chapter 5), as well as on the mix of instruction types executed in an application. Thus, CPI varies by application, as well as among implementations with the same instruction set. 1.6 Performance 39 The above example shows the danger of using only one factor (instruction count) to assess performance. When comparing two computers, you must look at all three components, which combine to form execution time. If some of the factors are identical, like the clock rate in the above example, performance can be determined by comparing all the nonidentical factors. Since CPI varies by instruction mix, instruction mix both instruction count and CPI must be compared, even if clock rates are equal. A measure of the dynamic frequency of instructions Several exercises at the end of this chapter ask you to evaluate a series of computer across one or many and compiler enhancements that affect clock rate, CPI, and instruction count. In programs. Section 1.10, we’ll examine a common performance measurement that does not incorporate all the terms and can thus be misleading. The performance of a program depends on the algorithm, the language, the Understanding compiler, the architecture, and the actual hardware. The following table summarizes Program how these components affect the factors in the CPU performance equation. Performance Hardware or software component Affects what? How? Algorithm Instruction count, The algorithm determines the number of source program CPI instructions executed and hence the number of processor instructions executed. The algorithm may also affect the CPI, by favoring slower or faster instructions. For example, if the algorithm uses more divides, it will tend to have a higher CPI. Programming Instruction count, The programming language certainly affects the instruction language CPI count, since statements in the language are translated to processor instructions, which determine instruction count. The language may also affect the CPI because of its features; for example, a language with heavy support for data abstraction (e.g., Java) will require indirect calls, which will use higher CPI instructions. Compiler Instruction count, The efficiency of the compiler affects both the instruction CPI count and average cycles per instruction, since the compiler determines the translation of the source language instructions into computer instructions. The compiler’s role can be very complex and affect the CPI in varied ways. Instruction set Instruction count, The instruction set architecture affects all three aspects of architecture clock rate, CPI CPU performance, since it affects the instructions needed for a function, the cost in cycles of each instruction, and the overall clock rate of the processor. Elaboration: Although you might expect that the minimum CPI is 1.0, as we’ll see in Chapter 4, some processors fetch and execute multiple instructions per clock cycle. To reflect that approach, some designers invert CPI to talk about IPC, or instructions per clock cycle. If a processor executes on average two instructions per clock cycle, then it has an IPC of 2 and hence a CPI of 0.5. 40 Chapter 1 Computer Abstractions and Technology Elaboration: Although clock cycle time has traditionally been fixed, to save energy or temporarily boost performance, today’s processors can vary their clock rates, so we would need to use the average clock rate for a program. For example, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too warm. Intel calls this Turbo mode. Check A given application written in Java runs 15 seconds on a desktop processor. A new Yourself Java compiler is released that requires only 0.6 as many instructions as the old compiler. Unfortunately, it increases the CPI by 1.1. How fast can we expect the application to run using this new compiler? Pick the right answer from the three choices below: 15 0. 6 a. 8.2 sec 1. 1 b. 15 0. 6 1. 1 9.9 sec 1. 5 1. 1 c. 27.5 sec 0. 6 1.7 The Power Wall Figure 1.16 shows the increase in clock rate and power of nine generations of Intel microprocessors over 36 years. Both clock rate and power increased rapidly for decades and then flattened or dropped off recently. The reason they grew together is that they are correlated, and the reason for their recent slowing is that we have run into the practical power limit for cooling commodity microprocessors. 10,000 120 3600 2667 3300 3500 3500 3600 2000 100 1000 103 Clock Rate (MHz) 95 91 95 Clock Rate 200 Power (watts) 87 80 75.3 84 100 66 60 25 12.5 16 Power 40 10 10.1 29.1 20 3.3 4.1 4.9 1 0 Pentium Pro (1997) Pentium (1993) 80286 (1982) 80386 (1985) 80486 (1989) Pentium 4 Willamette (2001) Pentium 4 Prescott (2004) Core 2 Kentsfield (2007) Core i5 Coffee Lake (2018) Core i5 Clarkdale (2010) Core i5 Haswell (2013) Core i5 Skylake (2016) FIGURE 1.16 Clock rate and power for Intel x86 microprocessors over nine generations and 36 years. The Pentium 4 made a dramatic jump in clock rate and power but less so in performance. The Prescott thermal problems led to the abandonment of the Pentium 4 line. The Core 2 line reverts to a simpler pipeline with lower clock rates and multiple processors per chip. The Core i5 pipelines follow in its footsteps.

(The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-24-101-28-39.pdf

Document Details

Tags

Related

Full Transcript