(The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-24-101-42-45.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
1.8 The Sea Change: The Switch from Uniprocessors to Multiprocessors 43 The Sea Change: The Switch from 1.8 Uniprocessors to Multiprocessors The power limit has forced a dramatic change in the design of microprocessors....
1.8 The Sea Change: The Switch from Uniprocessors to Multiprocessors 43 The Sea Change: The Switch from 1.8 Uniprocessors to Multiprocessors The power limit has forced a dramatic change in the design of microprocessors. Up to now, most Figure 1.17 shows the improvement in response time of programs for desktop software has been like microprocessors over time. Since 2002, the rate has slowed from a factor of 1.5 per music written for a year to a factor of only 1.03 per year. solo performer; with Rather than continuing to decrease the response time of one program running the current generation on the single processor, as of 2006 all desktop and server companies are shipping of chips we’re getting a microprocessors with multiple processors per chip, where the benefit is often more little experience with on throughput than on response time. To reduce confusion between the words duets and quartets and processor and microprocessor, companies refer to processors as “cores,” and other small ensembles; such microprocessors are generically called multicore microprocessors. Hence, a but scoring a work for “quadcore” microprocessor is a chip that contains four processors or four cores. large orchestra and In the past, programmers could rely on innovations in hardware, architecture, chorus is a different and compilers to double performance of their programs every 18 months without kind of challenge. having to change a line of code. Today, for programmers to get significant Brian Hayes, Computing improvement in response time, they need to rewrite their programs to take in a Parallel Universe, advantage of multiple processors. Moreover, to get the historic benefit of running 2007. faster on new microprocessors, programmers will have to continue to improve the performance of their code as the number of cores increases. To reinforce how the software and hardware systems work together, we use a special section, Hardware/Software Interface, throughout the book, with the first one appearing below. These elements summarize important insights at this critical interface. Parallelism has always been crucial to performance in computing, but it was often Hardware/ hidden. Chapter 4 will explain pipelining, an elegant technique that runs programs Software faster by overlapping the execution of instructions. Pipelining is one example of instruction-level parallelism, where the parallel nature of the hardware is abstracted Interface away so the programmer and compiler can think of the hardware as executing instructions sequentially. Forcing programmers to be aware of the parallel hardware and to rewrite their programs to be parallel had been the “third rail” of computer architecture, for companies in the past that depended on such a change in behavior failed (see Section 6.15). From this historical perspective, it’s startling that the whole IT industry bet its future that programmers will successfully switch to explicitly parallel programming. 44 Chapter 1 Computer Abstractions and Technology Intel Core i7 4 cores 4.2 GHz (Boost to 4.5 GHz) Intel Core i7 4 cores 4.0 GHz (Boost to 4.2 GHz) Intel Core i7 4 cores 4.0 GHz (Boost to 4.2 GHz) Intel Xeon 4 cores 3.7 GHz (Boost to 4.1 GHz) 100,000 Intel Xeon 4 cores 3.6 GHz (Boost to 4.0 GHz) Intel Xeon 4 cores 3.6 GHz (Boost to 4.0 GHz) Intel Core i7 4 cores 3.4 GHz (boost to 3.8 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) 49,935 Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) 49,870 Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 39,419 31,999 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 49,935 40,967 Intel Core 2 Extreme 2 cores, 2.9 GHz 34,967 24,129 10,000 AMD Athlon 64, 2.8 GHz AMD Athlon, 2.6 GHz 14,387 19,484 Performance (vs. VAX-11/780) Intel Xeon EE 3.2 GHz 11,865 6,681 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer 8400 6/575, 575 MHz 21264 1,267 1000 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 280 23%/year 12%/year 3.5%/year Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117 100 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 10 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz 5 AX-11/780, 5 MHz 25%/year 1 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 FIGURE 1.17 Growth in processor performance since the mid-1980s. This chart plots performance relative to the VAX 11/780 as measured by the SPECint benchmarks (see Section 1.11). Prior to the mid-1980s, processor performance growth was largely technology- driven and averaged about 25% per year. The increase in growth to about 52% since then is attributable to more advanced architectural and organizational ideas. The higher annual performance improvement of 52% since the mid-1980s meant performance was about a factor of seven larger in 2002 than it would have been had it stayed at 25%. Since 2002, the limits of power, available instruction-level parallelism, and long memory latency have slowed uniprocessor performance recently, to about 3.5% per year. Why has it been so hard for programmers to write explicitly parallel programs? The first reason is that parallel programming is by definition performance programming, which increases the difficulty of programming. Not only does the program need to be correct, solve an important problem, and provide a useful interface to the people or other programs that invoke it; the program must also be fast. Otherwise, if you don’t need performance, just write a sequential program. The second reason is that fast for parallel hardware means that the programmer must divide an application so that each processor has roughly the same amount to do at the same time, and that the overhead of scheduling and coordination doesn’t fritter away the potential performance benefits of parallelism. As an analogy, suppose the task was to write a newspaper story. Eight reporters working on the same story could potentially write a story eight times faster. To achieve this increased speed, one would need to break up the task so that each reporter had something to do at the same time. Thus, we must schedule the sub-tasks. If anything went wrong and just one reporter took longer than the seven others did, then the benefits of having eight writers would be diminished. Thus, we must balance the 1.8 The Sea Change: The Switch from Uniprocessors to Multiprocessors 45 load evenly to get the desired speedup. Another danger would be if reporters had to spend a lot of time talking to each other to write their sections. You would also fall short if one part of the story, such as the conclusion, couldn’t be written until all the other parts were completed. Thus, care must be taken to reduce communication and synchronization overhead. For both this analogy and parallel programming, the challenges include scheduling, load balancing, time for synchronization, and overhead for communication between the parties. As you might guess, the challenge is stiffer with more reporters for a newspaper story and more processors for parallel programming. To reflect this sea change in the industry, the next five chapters in this edition of the book each has a section on the implications of the parallel revolution to that chapter: Chapter 2, Section 2.11: Parallelism and Instructions: Synchronization. Usually independent parallel tasks need to coordinate at times, such as to say when they have completed their work. This chapter explains the instructions used by multicore processors to synchronize tasks. Chapter 3, Section 3.6: Parallelism and Computer Arithmetic: Subword Parallelism. Perhaps the simplest form of parallelism to build involves computing on elements in parallel, such as when multiplying two vectors. Subword parallelism relies on wider arithmetic units that can operate on many operands simultaneously. Chapter 4, Section 4.10: Parallelism via Instructions. Given the difficulty of explicitly parallel programming, tremendous effort was invested in the 1990s in having the hardware and the compiler uncover implicit parallelism, initially via pipelining. This chapter describes some of these aggressive techniques, including fetching and executing multiple instructions concurrently and guessing on the outcomes of decisions, and executing instructions speculatively using prediction. Chapter 5, Section 5.10: Parallelism and Memory Hierarchies: Cache Coherence. One way to lower the cost of communication is to have all processors use the same address space, so that any processor can read or write any data. Given that all processors today use caches to keep a temporary copy of the data in faster memory near the processor, it’s easy to imagine that parallel programming would be even more difficult if the caches associated with each processor had inconsistent values of the shared data. This chapter describes the mechanisms that keep the data in all caches consistent. Chapter 5, Section 5.11: Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks. This section describes how using many disks in conjunction can offer much higher throughput, which was the original inspiration of Redundant Arrays of Inexpensive Disks (RAID). The real popularity of RAID proved to be the much greater dependability offered by including a modest number of redundant disks. The section explains the differences in performance, cost, and dependability between the various RAID levels. 46 Chapter 1 Computer Abstractions and Technology In addition to these sections, there is a full chapter on parallel processing. Chapter 6 goes into more detail on the challenges of parallel programming; presents the two contrasting approaches to communication of shared addressing and explicit message passing; describes a restricted model of parallelism that is easier to program; discusses the difficulty of benchmarking parallel processors; introduces a new simple performance model for multicore microprocessors; and, finally, describes and evaluates four examples of multicore microprocessors using this model. As mentioned above, Chapters 3 to 6 use matrix vector multiply as a running I thought [computers] example to show how each type of parallelism can significantly increase performance. would be a universally Appendix B describes an increasingly popular hardware component that applicable idea, like a is included with desktop computers, the graphics processing unit (GPU). Invented book is. But I didn’t to accelerate graphics, GPUs are becoming programming platforms in their own think it would develop right. As you might expect, given these times, GPUs rely on parallelism. as fast as it did, because Appendix B describes the NVIDIA GPU and highlights parts of its parallel I didn’t envision we’d programming environment. be able to get as many parts on a chip as we finally got. The transistor came along Real Stuff: Benchmarking the unexpectedly. It all 1.9 happened much faster Intel Core i7 than we expected. J. Presper Eckert, Each chapter has a section entitled “Real Stuff ” that ties the concepts in the book coinventor of ENIAC, with a computer you may use every day. These sections cover the technology speaking in 1991 underlying modern computers. For this first “Real Stuff ” section, we look at how integrated circuits are manufactured and how performance and power are measured, with the Intel Core i7 as the example. workload A set of programs run on a computer that is either SPEC CPU Benchmark the actual collection of A computer user who runs the same programs day in and day out would be the applications run by a user or constructed from real perfect candidate to evaluate a new computer. The set of programs run would form programs to approximate a workload. To evaluate two computer systems, a user would simply compare such a mix. A typical the execution time of the workload on the two computers. Most users, however, workload specifies both are not in this situation. Instead, they must rely on other methods that measure the programs and the the performance of a candidate computer, hoping that the methods will reflect relative frequencies. how well the computer will perform with the user’s workload. This alternative is usually followed by evaluating the computer using a set of benchmarks— programs specifically chosen to measure performance. The benchmarks form a workload that the user hopes will predict the performance of the actual workload. As we noted above, to make the common case fast, you first need to know accurately which case is common, so benchmarks play a critical role in computer benchmark A program architecture. selected for use in SPEC (System Performance Evaluation Cooperative) is an effort funded and comparing computer supported by a number of computer vendors to create standard sets of benchmarks performance. for modern computer systems. In 1989, SPEC originally created a benchmark set focusing on processor performance (now called SPEC89), which has evolved