csc25-lecture-notes (2)-1-78.pdf

CSC-25 High Performance Architectures Lecture Notes and Exercises Denis Loubach 2024 Third Edition Lecture Notes of CSC-25 High Performance Architectures Third Edition Denis Loubach [email protected] http://www.comp.ita.br/~dloubach Department of Computer Systems Computer Science Division – IEC Aeronautics Institute of Technology – ITA Denis Loubach © July, 2022 ii Contents I Theory 1 1 Introduction 3 A Brief History of Computer Evolution............................... 3 Concepts, Performance & Amdahl Law................................ 10 High Performance Computing & Chip Manufacturing....................... 21 2 Processor Operation, RISC & CISC 29 Processor Operation.......................................... 29 Instructions Set Architecture Approaches.............................. 35 Control of a GPR Processor - MIPS................................. 38 3 Instruction-Level Parallelism 49 Exploiting Instruction-Level Parallelism............................... 49 Major Hurdles of Pipelining: Concepts and Hazards........................ 52 Datapaths and Stalls.......................................... 55 MIPS Simple Multiple-Cycle Implementation............................ 63 The MIPS Pipeline........................................... 67 4 Memory Systems 73 Overview................................................ 73 Memory Hierarchy........................................... 74 Cache Memory............................................. 76 Interleaved Memory.......................................... 85 RAM Construction Technology.................................... 88 Virtual Memory............................................ 92 5 Superscalar Processors 99 Advances in Instruction-Level Parallelism Exploration....................... 99 Out-of-Order Execution........................................ 101 WAR and WAW Dependencies & Out-of-Order Completion.................... 102 Tomasulo’s Algorithm......................................... 104 6 Branch Prediction & Speculative Superscalar Processors 111 Speculative Execution......................................... 111 Dynamic Branch Prediction...................................... 111 Hardware-Based Speculation..................................... 116 Multiple Issue Superscalar Processors................................ 122 iii Contents Alternatives to Tomasulo’s Algorithm................................ 124 Summary on Some ILP Techniques.................................. 126 7 Vector Computers & Graphics Processing Units 127 Vector Computers........................................... 127 Graphics Processing Units....................................... 137 8 Multiple Instruction, Multiple Data (MIMD) Systems 143 Parallel Architectures......................................... 143 Cache Coherence............................................ 148 Thread-Level Parallelism....................................... 154 9 Storage and I/O Systems 157 Storage................................................. 157 Clusters (I/O Servers) Evaluation................................... 162 Buses.................................................. 166 Bus Design............................................... 168 10 Warehouse-scale Computers 173 Overview................................................ 173 Examples................................................ 175 Cloud Computing............................................ 180 II Exercises 181 A Introduction 183 Part B.................................................. 183 Part C.................................................. 183 B Processor Operation, RISC & CISC 185 C Instruction-Level Parallelism 187 Part A.................................................. 187 Part B.................................................. 188 D Memory Systems 189 Part A.................................................. 189 Part B.................................................. 190 E Superscalar Processors 191 F Branch Prediction & Speculative Superscalar Processors 195 G Vector Computers & Graphics Processing Units 199 H Multiple Instruction, Multiple Data Systems 201 I Storage & I/O Systems 203 Part A.................................................. 203 Part B.................................................. 204 iv Contents J Warehouse-scale Computers 205 Bibliography 208 v Part I Theory 1 Chapter 1 Introduction A Brief History of Computer Evolution Computers: A Bit of History From prehistory, one can mention the Chinese abacus from more than 2000 years ago. Advancing in time, the mechanical calculators from Pascal1 and Leibniz2 from around the year 1650, take their place. Pascal intended to help his father in the taxes area. The Pascal’s project also inspired Leibniz on its stepped drum design. Taking now the “history” period, there is the Charles Babbage’s mechanical general-purpose computer, named “Analytical Engine”. An innovation of Babbage’s project was the use of punched cards. Also, around 1840, Ada Lovelace became recognized as the first programmer woman. In the 1890s United States census, the Herman Hollerith machine was used. That machine was able to reduce the time to compile the data from 7 to 2.5 years. Later, Hollerith founded a company to produce tabulating machines, which came to be named International Business Machines - IBM. Machines used base 10 in the 19th century. The Boolean logic (base 2) simplified the general scheme. Then, base 2 is used in the design of almost all computers from the 20th century until nowadays. In the ’40s, an electronic computer design became reality. That was von Neumann’s computer. In that computer, the instructions were stored in a memory, instead of punched cards. This model is still in use during the present days. Some of the first computers from around 1945 were the Z1, Mark-I, and ENIAC. The Z1 was from Germany (Fig. 1.1) and the Harvard Mark-I from the US. The term “Harvard architecture” was coined to describe this type of machine, i.e., separated memories for instruction and data. Presently, it stands for a single main memory with separate instructions cache and data cache. The IBM automatic sequence controlled calculator - ASCC, called Mark-I by Harvard University’s staff is illustrated in Fig. 1.2. The electronic numerical integrator and computer - ENIAC is shown in Figs. 1.3 and 1.4. ENIAC has the format of a “U” with 24 meters of extension. It had 20 registers of 10-digit and each register was 60 centimeters long. The programming tasks were done through switches and cable 1 Arithmetic machine or Pascaline. 2 Leibniz wheel or stepped drum. 3 1 Introduction Figure 1.1: Replica of the Z1 in the German Museum of Technology in Berlin. Source: de.wikipedia.org: 22:33, 27. Dez 2005. ComputerGeek (Diskussion). Figure 1.2: The right end included data and program readers, and automatic typewriters. Source: https://commons.wikimedia.org/wiki/File:Harvard_Mark_I_Computer_-_ Right_Segment.JPG connections where data came from punched cards. The computer was able to make about 1,900 additions per seconds or 500 multiplications. 4 A Brief History of Computer Evolution Figure 1.3: ENIAC in Philadelphia, Pennsylvania. Glen Beck (background) and Betty Snyder (fore- ground) program the ENIAC in building 328 at the Ballistic Research Laboratory. Source: https://commons.wikimedia.org/wiki/File:Eniac.jpg Figure 1.4: Cpl. Irwin Goldstein (foreground) sets the switches on one of the ENIAC’s function tables at the Moore School of Electrical Engineering. (U.S. Army photo). Source: https: //commons.wikimedia.org/wiki/File:Classic_shot_of_the_ENIAC.jpg Evolution Since 1946, digital electronics technology has significantly evolved and continues evolving. This list includes electromechanical relays, electronic valves, transistors, printed circuits, integrated circuits (MSI3 , LSI4 , VLSI5 , ULSI6 ), fiber optics, superconductivity, laser beam, and gallium arsenate in 3 Medium-scale integration - MSI (1968): 1 to 500 transistors. 4 Large-scale integration - LSI (1971): 500 to 20×103 transistors. 5 Very large-scale integration - VLSI (1980): 20×103 to 1×106 transistors. 6 Ultra large-scale integration - ULSI (1984): > 1×106 transistors. 5 1 Introduction lieu of silicon. Microprocessors In 1970, Intel produced the first microprocessor, an LSI integrated circuit (IC) containing all CPU logic circuit in a nail size chip. These advancements are continuing so far, following examples such as the Deep Blue7 based on VLSI and ULSI, with performance greater than 500 million instructions per second - MIPS. Another example is the program AlphaGo. It has 40 search threads, 48 CPUs, and 8 GPUs. Its distributed version has 40 search threads, 1,202 CPUs and 176 GPUs. Computer Generations The computer generations can be divided into four. The first generation dates from 1945 to 1959. That was mainly based on electronic valves, miles of wires, low performance (which means slow performance), huge in size, and with heating issues. The second generation dates from 1959 to 1964. It was about transistors and printed circuits. However, these changes led to faster, smaller, and cheaper computers. The third generation dates from 1964 to 1970, and it was based on IC with higher density, leading to cost reduction and processing in the order of microseconds. Operating systems started to be used on that generation. The fourth generation, from 1970 to around 2019, took advantage and improved the existing technology leading to architecture optimization with respect to user problems, miniaturization, trustworthy and higher performance, in the order of nanoseconds. However, new trends are pointing out now such as domain-specific architectures, multicore systems, and quantum computing to name a few. Functional Units The first computers had functional hardware units only to add and subtract integers. Then, integer multiplication and division together with floating-point operations were in charge of software. Nowadays, those functions along with others, such as trigonometry, exponential and logarithmic are implemented in hardware. And that hardware is subject to continual improvements. The First Microprocessors The Intel 4004 was the first chip containing all CPU elements. It was manufactured in 1971 and started the microcomputers era. That version was based on a 4-bit addition. However, the multiplication was conducted by software using “addition loops”. The Intel 8008 was the first 8-bit microprocessor. Twice as complex as compared to the 4004, it was introduced into the market in 1972. Notice that both 4004 and 8008 were designed to be application-specific. The x86 Processors The 8088 with an 8-bit external bus along with the 8086 with a 16-bit external bus started into the market in 1978. There was also the optional 8087 to support floating-point arithmetic. The 8088 had an instructions buffer of 4 Bytes, and 6 Bytes in the 8086. 7 Chess-playing computer developed by IBM. Custom VLSI chips. Won a chess game against Kasparov in 1996. 6 A Brief History of Computer Evolution 8086 Registers Table 1.1 illustrates the 8086 registers. Table 1.1: Some 8086 registers. Data registers are separated into 8-bit low (L) and 8-bit high (H) parts. 8 bits 8 bits AX AH AL General-purpose BX BH BL Data CX CH CL Registers DX DH DL Stack pointer SP Program counter PC Address Base pointer BP Registers Source index SI Destination index DI Code segment CS Data segment DS Segment Stack segment SS Registers Extra segment ES Flags Status 16 bits Other x86 Processors The Intel 80286 (1982) had multi-thread support (MTS), inter tasks and task’s memory protection, and 16-bit arithmetic. The Intel 80386 (1985) was the first Intel processor with 32-bit arithmetic and support to virtual memory. The Intel 80486 (1989) had a sophisticated cache memory technology and an instruction pipeline floating-point co-processor integrated into the main processor. The Intel Pentium (1992) was a superscalar computer. It had the instructions cache separated from the data cache. That chip applied intensive technology use to reduce negative impacts caused by conditional branch instructions in the sense of superscalar performance. From 2006, the Core 2 appeared and others with multiple cores, e.g., duo, quad, extreme. Nowadays, Intel has the i9 extreme core, along with i3, i5, and i7. 8086 and Pentium Registers Figure 1.5 shows a register comparison. The original set is in black color and the extended set is represented in gray color. The 8086 divided the first four registers in half, which can be used either as one 16-bit register or as two 8-bit registers. Starting with the 80386, the top eight registers were extended to 32 bits and could also be used as general-purpose registers - GPR. The floating-point registers on the bottom are 80 bits. They implement an FP stack, with the top of stack - TOS pointed to by the status register. One operand must be the TOS and the other can be any of the other seven registers below the TOS. 7 1 Introduction Figure 1.5: Some 8086 registers. Data registers separated into 8-bit low and 8-bit high parts. x86 Architecture “The x86 really isn’t all that complex - it just doesn’t make a lot of sense” This is a quote from Mike Johnson, 80x86 project leader in AMD, microprocessor report, 1994. The quote symbolizes the market competition between Intel and AMD since a long time ago. Notice: in view of this, RISC processors will be used as examples during the major part of this lecture notes. 8 A Brief History of Computer Evolution Instruction Set Designing Instructions Instructions with large differences in execution time or phase number are not suitable for pipeline8. In this case, why not create simple instructions with small execution time differences and even the same phase number? Is not it better to create powerful instructions to handle regular problems instead of simple ones that handle almost nothing? Instruction Set Approaches The high-level language computer architecture - HLLCA idea came up in the late ’70s. It was a movement aiming at the elimination of the gap between high-level languages and computer hardware. However, it never had a significant commercial impact at all. On the other hand, the increase in memory size eliminated the code size problem arising from high-level languages and then enabled operating systems to be written in high-level languages. In that case, the combination of simpler architectures together with software offered greater performance and more flexibility at a lower cost and lower complexity. At the beginning of the ’80s, Ditzel and Patterson argued that simple architectures would be the best way to advance in the instruction set architecture - ISA development. Then, they introduced the reduced instruction set computer - RISC idea. At the same time, some VAX9 architects did not follow the RISC idea and continued developing complex instruction set computer - CISC. Development RISC and CISC development followed in parallel competing to the market. The RISC computers had three major seminal projects: the Berkeley RISC processor, led by Patterson; – RISC-I and RISC-II (1980 to about 1983) – 16- and 32-bit instructions the Stanford MIPS10 processor, led by Hennessy; and – 1981 to 1984 – 32-bit instructions the IBM 801. Those universities’ projects, Berkeley and Stanford, were widely adopted by industry after their conclusions. IBM had never launched the IBM 801 into the market, but created the RS 6000 in 1990. That was the first superscalar RISC. In 1986, the industry started launching computers based on that technology. In 1987, Sun Microsystems launched the Scalable Processor Architecture - SPARC computers based on Berkeley RISC-II. RISC-based PowerPC11 joined forces together with IBM, Motorola, and Apple. 8 The detailed concept of pipeline is introduced later. 9 Virtual Address Extension - VAX is a discontinued instruction set architecture developed by DEC in ’70s. 10 Microprocessor without Interlocked Pipeline Stages, reflecting the lack of hardware to stall the pipeline, as the compiler would handle dependencies. 11 Performance Optimization With Enhanced RISC - Performance Computing – PowerPC was recently discontinued. 9 1 Introduction RISC vs. CISC Finally, just one CISC has survived the debate: and the winner was the x86. It was mainly due to its high chip volume, binary compatibility with PC software, internally translation from CISC to RISC, and enough scaling to support extra hardware. Important: do not confuse the RISC processor with RISC instruction set architecture. The same applies to the MIPS processor and MIPS ISA. Todays Trends About 40 Years Ago... The improvement in computer performance over the last 40-50 years has been provided by computer architecture advancements, notably: Moore’s Law12 ; and Dennard scaling13. Indeed, those facts supported the development of larger and more parallel systems. But, what now with the end of Dennard scaling a decade ago, and the slowdown of Moore’s Law due to a combination of physical limitations and economic factors? A Paradigm Shift? Is it the time for a paradigm shift now? To face this challenge, domain-specific architectures - DSA can provide an equivalent performance and power benefits of 3-plus generations of Moore’s Law and Dennard scaling, and also better implementations than may ever be possible with future scaling of general-purpose architectures. Some potential areas for architectural innovation within DSA include the high-quality implementations of open-source architectures which now have a much longer lifetime (slowdown in Moore’s Law), and the reconfigurable computing, FPGA-based accelerators, domain-specific co-processors, and so on. Concepts, Performance & Amdahl Law Computer Architecture Definitions A few decades ago definition of computer architecture was simply: “instruction set design”. However, this is a myopic view of this subject. Instruction set architecture - ISA refers to the actual programmer-visible instruction set, and serves as the boundary between the software and hardware, as illustrated in Fig. 1.6. The main key constraints, or goals, of computer architecture, are to maximize performance and energy efficiency and minimize cost, power, and size. The genuine computer architecture view is about designing the organization and hardware to meet goals and functional requirements. 12 Roughly saying, the maximum number of transistors in an IC doubles approximately every two years. 13 This is about the reduction of MOS supply voltage in combination with the scaling of feature sizes: as transistors get smaller, their power density stays almost constant. 10 Concepts, Performance & Amdahl Law Figure 1.6: ISA as an interface between software and hardware, from. The organization addresses the high-level aspects of a computer design, i.e., the memory system and interconnect; internal processor where arithmetic, logic, branching, and data transfer take place; and microarchitecture. The latter (microarchitecture) can be used in place of the term organization. The hardware is about the specifics of a computer, e.g., detailed logic design and packaging technology. Basically, architecture aims to cover the three aspects of the computer design: (i) instruction set architecture, (ii) organization/microarchitecture, and (iii) hardware. The Five Basic Parts As illustrated in Fig. 1.7, the original von Neumann machine has these five basic parts: memory, arithmetic logic unit (data path), program control unit (control path), input equipment, and output equipment. Memory Arithmetic Input Control Logical Unit Unit Acc Output Figure 1.7: The original von Neumann machine. In this sense, the instruction set architecture defines the processor’s instructions which in turn guides the design of the control and data paths. Computer Organization Fig. 1.8 shows a possible computer organization illustrating a direct interface between the processor and cache memory. The memory bus enables the communication from the main memory to cache, and the input/output - I/O bus allows the peripherals (disks, graphics card, network) communication to the processor by using an interruption mechanism. Classes of Computers Main Characteristics From personal mobile devices to embedded systems, Fig. 1.9 shows the differences in price related to the system and microprocessor. It also highlights what are the key design issues of each class of computer. 11 1 Introduction Figure 1.8: Computer organization. Figure 1.9: The five mainstream computing classes. Notice that servers and embedded systems have a wide range in price since they include USB keys to network routers. Also, the servers range arises from the need for VLS multiprocessor systems for high-end transaction processing. The sales in 2015 (approximated figures) were about 1.6 billion PMDs, 90% cell phones, 275 million desktop PCs, 15 million servers, and 19 billion embedded processors. And, 14.8 billion ARM-technology- based chips were shipped. ARM-based devices continue to take more and more places, such as the TOP 500 list and Mars 2021 NASA’s mission. Processor Performance As illustrated in Fig. 1.10, in 2003 the power wall14 , memory wall15 , end of Dennard scaling, and instruction-level parallelism slowed uniprocessor performance. There was no remarkable performance improvement, compared to the last years. From 2011 to 2015, the limits of parallelism of Amdahl’s law became noticed by the almost flat curve. In 2015, the end of Moore’s law was evident. Clock Rate Between 1978 and 1986, the clock rate improved less than 15% per year (Fig. 1.11), while performance improved by 25% per year (Fig. 1.10). From 1986 to 2003, also known as the “renaissance period”, clock rates shot up almost 40% per year (Fig. 1.11), while there was a 52% performance improvement per year (Fig. 1.10). 14 Power dissipation, in the heat form, prevents a new frequency increasing. 15 Memory access frequency/performance is slower than processor’s. 12 Concepts, Performance & Amdahl Law Figure 1.10: Growth in processor performance over the last 40 years (SPEC integer benchmarks). From that last period, the clock rate has been nearly flat, growing at less than 2% per year, and single-processor performance improved just 3.5% per year. Figure 1.11: Microprocessors clock rate growth. 13 1 Introduction Memory Performance To have also a comparison in terms of memory performance, Fig. 1.12 shows the numbers from 1990 to 2011. This dates from the first edition of the book “Computer Architecture” (from Hennessy and Patterson) to the fifth edition. Figure 1.12: Memory performance growth. Why Study Computer Architecture? Computer architecture is fundamental to the high-performance computing and parallel processing devel- opment. It allows the understanding of the computer structure and operations in details. One learns which are and how to apply a couple of key performance optimization techniques for computer systems, including both hardware and software. And, in the worst-case scenario, it may help to buy you a good computer. Measuring Performance How to measure performance It is important to know about the performance of a given system. In this case, the following questions may arise. How to assess the impact of a computer performance improvement? This measure should be given in clocks, floating-point operations per second - FLOPS, or what? Must performance be measured by using the time spent to complete a given task? In that case: the lesser, the better, i.e., the minimum possible time to complete a task is a good performance indicator. Or must performance be measured by using a specific task set executed in a time frame? Thus, the greater, the better. Here, executing the maximum possible number of tasks in the specified time frame is a good performance indicator. Performance Measures To measure the performance in a system, one can use response time and throughput. Response time is the time between the start and the completion of an event, e.g., the time consumed to complete a task or a task set. On the other hand, throughput is the total amount of work done in a given timeframe, e.g., instructions per second, MFLOPS, Mbps. Now, think about these questions. Does increasing throughput always improve response time? Does reducing response time always improve throughput? Indeed, this is not so trivial. 14 Concepts, Performance & Amdahl Law A Simple Producer-Consumer Model Take for instance this simple producer-consumer model as shown in Fig. 1.13. Figure 1.13: Producer-consumer model. In this model, from the response time point of view, the problem is a minimization problem. In this case, the queue has to be empty and the server should be idle. However, from the throughput perspective, the problem is a maximization problem, e.g., the queue should never be empty and the server should never be idle. The concern here is to have as much work done as possible. Increasing the Throughput Generally, the throughput can be enhanced by adding extra hardware. This acts by reducing the latency with respect to the load (Fig. 1.14). Figure 1.14: Throughput enhancement by having more than one queue. On the other hand, response time is much harder to reduce. It is needed to consider the optimization of the architecture or even the implementing technology. Throughput Enhancement vs. Response Time The curve illustrated in Fig. 1.15 shows the trade-off involving throughput and response time. As the throughput is maximized, the response time is increased. Performance Although considering the throughput or even response time, the most important fact is the total time of user’s tasks completion. Therefore, the faster computer is always the one that responds to the user in less time. It is possible to “improve the response time” when the task number is high and intermediate relevant results are produced. This can be possible through parallelization. 15 1 Introduction Figure 1.15: Throughput enhancement vs. response time. Speedup Consider the execution time as in Eq. (1.1). 1 Performance X = (1.1) ExecutionTime X Then, to state that “X is n times as fast as Y” it means: Performance X ExecutionTime Y n= = (1.2) Performance Y ExecutionTime X Speedup is the observed improvement in performance due to an enhancement E, as in Eq. (1.3). ExecutionTime( ) Performance(E) Speedup(E) = = (1.3) ExecutionTime(E) Performance( ) Example – what is the observed speedup in execution time when using p processors? The answer is given by Eq. (1.4): TSERIAL Ts S= = (1.4) TPARALLEL Tp Or even using the notation for p processors, as in Eq. (1.5): T (1) S(p) = (1.5) T (p) Theoretically, most of the cases will have a linear speedup where S(p) ≤ p. Some cases can present a super-linear speedup: S(p) > p. This is the case when there is a non-optimal serial algorithm or some hardware features that place the serial algorithm at disadvantage. Possible Targets and Strategies There are several techniques, components, and communication interconnections that can be targeted for enhancement. 16 Concepts, Performance & Amdahl Law On the other hand, there are also several “types” of instructions which should be (preferably) improved. In this case, what to choose? What should be prioritized for possible enhancements? The answers are the common case and the Amdahl’s Law theory. The common case is simply the part of a system that is more frequently used in a specific scenario. For example, suppose that there is a floating-point co-processor and another functional hardware unit to take care of integer operations. The problem addressed by this system will give the common case. Is it about floating-points or integers? Amdahl’s Law According to Amdahl’s Law, the performance improvement to be obtained from using a given faster mode of execution is limited by the fraction of the time that this faster mode can be used. Therefore, it defines the speedup that can be gained by using a particular characteristic (i.e., an enhanced part). Let’s suppose that an enhancement E gives a speedup enhancement S for a time fraction F ≤ 1. What is the overall performance improvement in that case? F ExecutionTime(E) = ExecutionTime( ) × (1 − F ) + S 1 (1.6) Speedup(E) = (1 − F ) + FS Amdahl’s Law Demonstration Let’s define the “unenhanced” portion Tu with the time fraction the enhancement Fe is used: Tu = (1 − Fe ) × ExecutionTime old (1.7) Now, the “enhanced” portion Te : Fe Te = × ExecutionTime old (1.8) Se Finally, the overall speedup is given by Eq. (1.9). ExecutionTime old Speedup overall = (1.9) ExecutionTime new By replacing Eq. (1.7) and Eq. (1.8) in Eq. (1.9): ExecutionTime old ExecutionTime old = Tu + Te (1 − Fe ) × ExecutionTime old + FSee × ExecutionTime old Then: 1 Speedup overall = Fe (1.10) (1 − Fe ) + Se 17 1 Introduction Important: notice that Amdahl’s Law is applied to verify what will be the performance improvement to be obtained in a system that does not have the faster mode of execution already running or implemented. Amdahl’s Law Comments (Un)fortunately, there exists a real scenario. There, the speedup does not scale linearly with respect to the number of processors p, or new hardware, or any other enhancement E done in the system. At some point in time, the speedup tends to saturate and become “flat”. “Unless virtually all of a serial program is parallelized, the possible speed-up is going to be very limited, regardless the available number of processors” Quote from Pacheco, P. S. An Introduction to Parallel Programming. Morgan Kaufmann, 2011. Let’s take a look at the curves shown in Fig. 1.16. Figure 1.16: Speedup examples. Source: https://en.wikipedia.org/wiki/Amdahl%27s_law. In that case, the upper bound of Speedup overall is reached when p → ∞, and therefore: 1 Speedup overall ≤ (1.11) 1 − Fe From Fig. 1.16, it can be concluded that when up to 95% parallelism is obtained (green curve), the maximum speedup is given by: 1 S(p) ≤ = 20 1 − 0.95 which results in approximately p = 2048. Placing more processors than that number makes no sense. Typically, the “non-parallelizeable” fraction (1 − Fe ) decreases when increasing the problem’s size. Processor Performance Every component has some impact in performance, but processor performance is the most critical for the whole computer performance. 18 Concepts, Performance & Amdahl Law The following gives the CPU time by using the clock period duration Eq. (1.12) or rate Eq. (1.13). CPU time = # CPU clock cycles for a program × Clock cycle time (1.12) # CPU clock cycles for a program CPU time = (1.13) Clock rate The average number of clock cycles per instruction - CPI considering the instruction count - IC is given by Eq. (1.14). # CPU clock cycles for a program CPI = (1.14) # IC By replacing Eq. (1.14) in Eq. (1.12): CPU time = IC × CPI × Clock cycle time (1.15) And finally, with their units of measurements: Instructions ClockCycles Seconds Seconds CPU time = × × = (1.16) Program Instruction ClockCycle Program What impacts the processor performance? Table 1.2 gives a hint. Table 1.2: Summary relating the design level with possible areas that impact processor performance. Design level Instruction Count CPI Clock Period Program X Compiler X X Instruction Set X X Organization X X Technology X Benchmarks How to measure the computer performance related to the user’s tasks of interest? Performance is measured based on program or task execution time. Thus, how to choose? In the ideal case, a user knows exactly which programs s/he uses together with input data, and then s/he tests computer performance with no problem. But this does not sound feasible. The best choice of benchmarks to measure performance is a real application. And here there are some performance pitfalls, such as running programs that are much simpler than a real application. For example, running kernels that are small key pieces of real applications; toy programs, i.e., 100-line programs from beginning programming assignments, e.g., quicksort; and synthetic benchmarks that are fake programs invented to try to match the profile or behavior of real applications, e.g., Dhrystone, Whetstone. All three (kernels, toy programs, and synthetic) benchmarks are discredited today. It is possible that the compiler writer and architect make the computer appear faster on these stand-in programs than on real applications. Considering the conditions under which benchmarks run, the benchmark-specific compiler flags is one way to “improve” performance. Another way is to cause transformations that would be illegal on many programs or even slow down performance on others. To face these challenges and increase results significance, the recommendation is to use one compiler and one set of flags for all programs in the same language, e.g., C or C++. 19 1 Introduction To reasoning about the question of whether the source code modification is allowed or not, some considerations take place: no modifications are allowed at all; modifications are allowed, but essentially impossible to perform, e.g., those ones requiring ten-million LOC changing, such as database programs; modifications are allowed, provided the changed version produces the very same results; and do that modifications reflect real practice and provide useful insight to users? Another possibility is to use collections of benchmarks applications, also known as benchmark suites. The goal is to characterize the real relative performance of two different computers, including programs that users are unlikely to run by themselves. Suites examples are the Electronic Design News Embedded Microprocessor Benchmark Consortium - EEMBC containing 41 kernels for embedded applications on different domains. However, it does not have the reputation of being a “good predictor” of relative performance, due to the small kernels. This benchmark is trying to replace Dhrystone, which is still in use. Another example is the Standard Performance Evaluation Corporation - SPEC, dating back from the late ’80s. It was used for workstations and covers many different applications from today’s usage. Although the data from a benchmark is frequently used to evaluate two computers, always remember: performance is relative to the program. Fig. 1.10 was built using the SPEC integer benchmarks. However, this benchmark has changed over the years. For this reason, the performance of newer computers is estimated by a scaling factor relating the performance for different versions of SPEC, e.g., SPEC89, SPEC92, SPEC95, SPEC2000, and SPEC2006 (Fig. 1.17). Figure 1.17: SPEC benchmarks over time. Integer programs above the line and floating-point programs below. However, there still are risks related to the use of benchmarks. Since many buying decisions are made on a benchmark basis, designers are pressured to optimize their projects for the benchmark. 20 High Performance Computing & Chip Manufacturing High Performance Computing & Chip Manufacturing HPC Applications High performance computing - HPC systems are often used to perform numerical simulations of physical processes, such as the simulation of nuclear explosions effects, simulation of tidal evolution and ocean movement, atmospheric/weather simulation, fluid flow, and subsoil mapping by wave reflection calculation. Servers with a high volume of transactions processed per time unit are another area that demands high computational power. HPC Systems The massively parallel computers are computers specially designed to work with hundreds or even thousands of processors. They use distributed memory mechanisms, and they are often in mathematical simulations, e.g., the IBM Blue Gene project. That project was about a petaflop supercomputer aiming to attack problems such as protein folding. In the mentioned project, two application-specific integrated circuit - ASICs were designed: the BlueGene/L Compute - BLC (Fig. 1.18) and BlueGene/L Link - BLL. Figure 1.18: BlueGene/L Compute ASIC general information. Fig. 1.19 shows the spectrum ranging from the chip (with two processors) through the system (with 64 cabinets). Each cabinet has 2,048 processors (8 × 8 × 16 × 2) and 256 GB DDR memory (512 MB/computer card). The system has a total of 131,072 processors (64 × 32 × 32 × 2) and 16 TB DDR memory. The computing power is from 180 to about 360 TFLOPS. The power consumption per cabinet16 , assuming 50W/processor is approximately 102 kW, and assuming 10h/day, 20 days/month, it consumes 16 A Pentium 4 consumes about 60 W. 21 1 Introduction Figure 1.19: From chip to the system. about 20 MWh/month. In view of this, clusters are considered as multiple computers connected to operate as a single system. The cluster can achieve high processing power. Generally, they use specific middleware (software) for computers to exchange information efficiently. Examples are the message-passing interface - MPI and parallel virtual machine - PVM. Clusters are often used to handle large volume transactions, e.g., the Google clusters. HPC in Brazil Some HPC Brazilian examples are introduced next. This is courtesy of Prof. Jairo Panneta/CPTEC and Petrobras. The CPTEC/INPE17 examples are related to the weather forecast, while Petrobras’ are about geological studies. CPTEC/INPE This example (Fig. 1.20) shows weather and climate operational numerical forecast where the application technology domain is in pursuit. There, people incessantly seek to improve the forecast quality and accuracy by applying increasingly powerful computers to have as much parallelism as possible. Table 1.3 shows some of the computers used by INPE along of the years. Table 1.3: Computers used along of the years. 1994 1998 2004 2007 2012 Computer NEC SX3 NEC SX4 NEC SX6 NEC SUN Cray XE6 (Tupã) Processors 1 8 96 1100 31,296 Max Performance 3 GFLOPS 16 GFLOPS 768 GFLOPS 5,748 GFLOPS 258 TFLOPS Notice that the Tupã computer was in the 29th place, from the Top 500 of November, 201018. 17 Centro de Previsão do Tempo e Estudos Climáticos/Instituto Nacional de Pesquisas Espaciais. 18 https://top500.org, https://www.cptec.inpe.br/supercomputador/historico.php 22 High Performance Computing & Chip Manufacturing Figure 1.20: A forecast example considering different region sizes and consequently different responses in terms of accuracy. More processing power is required to have better accuracy. Petrobras This example (Figs. 1.21 and 1.22) shows seismic processing and wells positioning. Here, the application technology domain is also pursued. The seismic migration method was developed in-house. Technical people from the company are equally and incessantly seeking to improve the image quality and accuracy. The use of increasingly powerful computers parallelism is essential in this problem. Figure 1.21: Seismic method illustration show- Figure 1.22: Seismic method - Kirchhoff mi- ing a ship and a couple of sensors. gration method, also known as diffraction summation. Migra- tion is also named imaging. The following are some available numbers related to the company’s research. In 1998, they had about 76 PCs for software development. In 2006, they had a cluster with 5000 processors, 1,000 jobs of Kirchhoff migration, and about 10 jobs using 787 cores/job, considering a 22 days/job on average. In 2008, they had around 14,000 processors and 3,000 jobs. 23 1 Introduction Wrap-up High performance computing systems have many applications in different real-world domains. The techniques seen in computer architecture can be used to optimize the performance even in software systems. Chip Manufacturing The processor’s chip is fundamental to any computer’s good performance. The chips are built on wafers, and raw wafers are made of silicon, which comes from beach sand. Everything starts with single crystals that are created through a method called Czochralski process. This is a method of crystal growth. On it, a seed crystal19 is mounted on a rod and then dipped into molten silicon. The rod is pulled upwards and rotated at the same time, making a big cylindrical piece of silicon crystal, also known as ingot (as in the right-hand side of Fig. 1.23). Figure 1.23: Silicon cylinder to make wafers. The ingot resulted from this process measures from one to two meters long and can have up to 300 millimeters in diameter. This is where terms like 300-mm wafers come from. The ingot is then chopped into wafers. These wafers are polished (as in the left-hand side of Fig. 1.23) and sent to the chip manufacturers. Wafers are usually less than 1mm thick. These raw (“virgin”) wafers are where the chips will be manufactured on. Chips are mounted on the wafer through a process called photolithography (Fig. 1.24). Chemicals sensitive to ultraviolet light are used, and when exposed to ultraviolet light, they can become “soft” or “hard”. Layers (SiO2 and photoresist) grow one on top of the others, with an insulator between them, and masks are used to create the shapes. The process consists of blocking the ultraviolet light from the chemicals applied to the wafer, using masks created by the designers. It is done in the process by removing the “soft” parts and then repeating the process with another mask until the chip is finished. Each mask has a different pattern and they tell how the transistors and wires inside the chip will be manufactured. The number of masks used varies depending on the project, e.g., a Pentium 4 processor uses 26 masks. After this process, the transistors and circuit are built on the wafer. In this whole process, absolutely any particle can ruin the silicon, even dust, as illustrated in Fig. 1.25. What about vibrations on the wafer, light emitter, masks, lens, or something like that? That explains the need for clean rooms in a chip foundry (Fig. 1.26). 19 A piece of silicon crystal. 24 High Performance Computing & Chip Manufacturing Figure 1.24: Photolithography process overview. Figure 1.25: Problems detected in the chip manufacturing process. Figure 1.26: Clean room illustration. Chips on the wafer are then tested and the wafer is sent to the next step, where the chips are cut from the wafer (Fig. 1.27), have their terminals attached and are packed. After that, they are tested, labeled and ready to be sold (Fig. 1.28). 25 1 Introduction Figure 1.27: From the wafer to die. Figure 1.28: A chip example. Dies per Wafer A wafer is still tested and chopped into dies that are then packaged, despite all other advances in the manufacturing process. The number of dies per wafer is more accurately estimated by Eq. (1.17). π × ( WaferDiameter )2 π × WaferDiameter DiesPerWafer = 2 − √ (1.17) DieArea 2 × DieArea where the first term stands for the ratio of wafer area (πr2 ) to the die area, and the second term compensates the problem of rectangular dies near the periphery of round wafers by dividing the circumference (πd) by the diagonal of a square. The next two figures (Figs. 1.29 and 1.30) illustrate a die photography followed by an illustration of a wafer (Fig. 1.31). Figure 1.29: Intel Skylake microprocessor die Figure 1.30: The components of the Intel Sky- photograph, a 2015 14 nm pro- lake microprocessor die labeled cessor microarchitecture. with their functions. Bose-Einstein Formula How many of those chips will really work? This is an important question to be asked. It does not matter how good is the manufacturing process, some dies will not work, and more masks and layers mean more errors. There is a model that states what is the fraction of good dies on a wafer, or the die yield. The Bose-Einstein formula is an empirical model developed by looking at the yield of many manufacturing lines (Sydow, 2006), which still applies today. 26 High Performance Computing & Chip Manufacturing Figure 1.31: A 200mm diameter wafer of RISC-V dies designed by SiFive. The wafer contains 3712 chips. A model of integrated circuit yield, assuming that defects are randomly distributed over the wafer and that yield is inversely proportional to the fabrication process complexity is described in Eq. (1.18). 1 DieYield = WaferYield × (1.18) (1 + DefectsPerUnitArea × DieArea)N where the term WaferYield accounts for the wafers that are completely bad and do not need to be tested. DefectsPerUnitArea is the measure of the random manufacturing defects. This was typically 0.08 to 0.10 defects per square inch for a 28 nm node, and 0.10 to 0.30 for the newer 16 nm node, from around 2017. Also, it depends on the process maturity. The metric versions are 0.012 to 0.016 defects per square centimeter for 28 nm, and 0.016 to 0.047 for 16 nm. Finally, N is the process-complexity factor, a measure of manufacturing difficulty, e.g., for 28 nm processes, N is 7.5 to 9.5, and for a 16 nm process, N ranges from 10 to 14. Processors Information Table 1.4 shows some information numbers on processors from 2007. Table 1.4: Some processor data, according to Hennessy & Patterson, 2007 Estimated defect Chip Die Size [cm2 ] Feature Size [nm] Transistors [millions] rate [per cm2 ] IBM Power 5 3.89 0.30 130 276 Sun Niagara 3.80 0.75 90 279 AMD Opteron 1.99 0.75 90 233 And to compare with that, there are some numbers from 2017’s20 Symposia on VLSI Technology and Circuits, held in Kyoto. IBM and its research alliance partners, GlobalFoundries and Samsung, built a new type of transistor for chips at the 5 nanometers node. This represents a 40% performance improvement over 10 nm chips, using the same amount of power. There were more than 5 years of research and development and a US$ 3 billion investment. Fig. 1.32 illustrates the transistor gate length compared to the density of transistors in microprocessors over time. 20 Source: https://www.ibm.com/blogs/think/2017/06/5-nanometer-transistors/ 27 1 Introduction Figure 1.32: The evolution of transistor gate length (minimum feature size) and the density of transistors in microprocessors over time. The gate length shrank from 10 µm to 28 nm; and the number of transistors increased from 200 to over 1 million. Chip Manufacturing in Brazil There are efforts even in Brazil to produce chips. Examples are the CEITEC which is a federal public company attached to the Ministry of Science, Technology and Innovation - MCTI. That company is focused on the development and production of integrated circuits for radiofrequency identification - RFID and specific applications. It is located in Porto Alegre, RS. CEITEC was founded in 2008. In 2011, they signed a contract of technology transfer 600nm CMOS with a German company named X-Fab. The first chips sales were in 2012. Their RFID chips target cattle identification, the product is known as the “Ox Chip”. There is also the Unitec Semiconductors (former Six) that was incorporated by the Argentine group “Corporation America”. The industrial unit will be installed in Ribeirão das Neves, MG. The company was founded in 2012 and will have an investment of around R$ 1Bi. The plant will manufacture custom chips for niche markets, including industrial and medical applications. They will work with IBM’s 100nm licensed technology. Other examples are the HT Micron, from RS state, the Instituto Eldorado and Smart Modular Technologies, from SP, and the Qualcomm and Advanced Semiconductor Engineering (ASE), from Campinas. 28 Chapter 2 Processor Operation, RISC & CISC Processor Operation The Five Basic Parts The five basic parts, introduced in Chapter 1, illustrated the original von Neumann machine parts: memory, arithmetic logic unit (part of datapath), program control unit (control path), input equipment, and output equipment. In this sense, the instruction set architecture defines the processor’s instructions which in turn leads to the design of control and data paths. The Control Unit Generally, the execution of an instruction can be split into different phases: (i) operation code fetch, (ii) operation code decode, (iii) operands fetch, (iv) effective instruction execution, and (v) results store. The phases involving memory access, e.g., write some data back to the memory, can be 10 times slower than the other phases. Thus, it impacts the program performance in an important way. Instruction Set Architectures Classes The internal storage type in a processor is the most basic differentiation in terms of instruction set architectures classes. The following classes are the main ones, as shown in Fig. 2.1. Stack In the stack storage type, both instructions and operands are stored in memory following this abstract data type. The arithmetic logic unit - ALU operands are implicitly on the top of stack - TOS. Accumulator In the accumulator storage type, the instructions involve this special register (ACC register), and in some other cases, the memory. One operand is implicitly the accumulator (ACC) register. 29 2 Processor Operation, RISC & CISC Figure 2.1: Operand locations related to four ISA major classes. Lighter shades stands for inputs, and dark ones output. Register In the register storage type, instructions involve operands in various general-purpose registers - GPR (load-store architecture) or even the memory (register-memory architecture). Here, only explicit operands are allowed. Some Code Examples Let’s consider the expression C = A + B and examine the code sequence for that in the four major ISA classes. The code for stack (Listing 2.1). First, it is needed to push the operands from the memory to the stack. Then, the ALU operands are implicitly on the top of stack. Listing 2.1: Stack. 1 Push A 2 Push B 3 Add 4 Pop C The code for accumulator (Listing 2.2). It needs to fetch just one operand from memory since ALU operates considering one implicit operand as the ACC register, and the other as a memory position. Listing 2.2: Accumulator. 1 Load A 2 Add B 3 Store C 30 Processor Operation The code for register, namely the register-memory class (Listing 2.3). Operands are explicitly registers and memory positions. Listing 2.3: Register-memory. 1 Load R1 , A 2 Add R3 , R1 , B 3 Store R3 , C The code for register, i.e., the load-store class (Listing 2.4). Operands are only explicit registers. First, it is needed to load the operands from the memory to a register to execute the instruction Add. Listing 2.4: Load-store. 1 Load R1 , A 2 Load R2 , B 3 Add R3 , R1 , R2 4 Store R3 , C Notice that both stack and load-store classes have separate instructions to access the memory. The stack has push and pop for memory access, and the load-store has load and store for memory access. Processor with Accumulator and Microprogram Fig. 2.2 illustrates the architecture of processor with accumulator. The accumulator (ACC) register is connected to the ALU as its implicit operand. There is also the data register - DR; address register - AR; program counter - PC; and instruction register - IR. ALU ACC DR Memory PC IR Control AR Unit Figure 2.2: Processor with accumulator architecture. Fig. 2.3 shows the representation of the microprogram regarding the architecture of processor with accumulator. 31 2 Processor Operation, RISC & CISC BEGIN #1 Load FALSE processor END AR ← DR[addr] DR ← Mem[AR] ACC ← DR active? TRUE #2 Store AR ← PC AR ← DR[addr] DR ← ACC Mem[AR] ← DR DR ← Mem[AR] #5 Addition IR ← DR[opcode] AR ← DR[addr] DR ← Mem[AR] ACC ← ACC + DR #6 Subtraction PC ← PC + 1 AR ← DR[addr] DR ← Mem[AR] ACC ← ACC - DR #7 Multiplication IR AR ← DR[addr] DR ← Mem[AR] ACC ← ACC × DR #15 Jump PC ← DR[addr] #16 Jump if positive TRUE ACC > 0 ? PC ← DR[addr] FALSE #21 Stop halt processor Figure 2.3: This represents the microprogram related to the processor with accumulator architecture. In the figure’s left-hand side (Fig. 2.3), the microprogram behaves as follows (Listing 2.5). Listing 2.5: Microprogram pseudo-code first half. 1 BEGIN // processor starts 2 if ( processor is active ) // check if the processor is ACTIVE 3 AR = PC // address register gets the program counter 4 // register value 5 DR = Mem [ AR ] // data register gets the memory position value 6 // indexed by the address register value 7 IR = DR [ opcode ] // instruction register gets the operation 8 // code field from the data register contents 9 PC = PC + 1 // program counter is incremented to the 10 // next instruction 11 IR // verify which is the operation code 12 // to execute it 13 else 14 // processor is NOT active 15 END // execution ends 32 Processor Operation In the right-hand side of Fig. 2.3, the microprogram behaves as follows (Listing 2.6). Listing 2.6: Microprogram pseudo-code second half. 1 switch ( IR ) // verify which is the operation code to execute 2 case 1: // operation " LOAD " 3 AR = DR [ addr ] // address register gets the address field 4 // contents from data register 5 DR = Mem [ AR ] // data register gets the contents of memory 6 // position indexed by the address register 7 ACC = DR // accumulator register gets the data 8 // register value 9 case 2: // operation " STORE " 10 AR = DR [ addr ] 11 DR = ACC // data register gets the accumulator value 12 Mem [ AR ] = DR // memory position indexed by the address 13 // register gets the data register value 14. 15. 16. 17 case 5: // operation " ADDITION " 18 AR = DR [ addr ] 19 DR = Mem [ AR ] 20 ACC = ACC + DR // accumulator gets the accumulator plus 21 // the data register value 22. 23. 24. 25 case 15: // operation " JUMP " 26 PC = DR [ addr ] // program counter gets the address field 27 // value of the data register 28 29 case 16: // operation " JUMP IF POSITIVE " 30 if ( ACC >0) // if the accumulator is greater than zero 31 PC = DR [ addr ] 32. 33. 34. 35 case 21: // operation " STOP " 36 halt the processor Now, take the following program pseudo-code example, as shown in Listing 2.7. Listing 2.7: The result of X. 1 T1 = F + G 2 T1 = ( H - I ) * T1 3 T2 = E * F 4 X = A + B 5 X = ( ( C - D ) * X - T2 ) / T1 33 2 Processor Operation, RISC & CISC The previous listing pseudo-code is equivalent to the following expression: (C − D) × (A + B) − (E × F ) X= (2.1) (H − I) × (F + G) How to perform the correspondent assembly code in a processor with accumulator? That is possible to implement with a 4-bit adder circuit (Fig. 2.4), 1-bit multiplexer (Fig. 2.5), and even with binary multiplication, with “shift left” and additions. Figure 2.4: A 4-bit adder circuit. Figure 2.5: A 1-bit mux circuit. Just to remember some basic logic gates used to implement circuits along with its truth tables, Figs. 2.6 to 2.8 give their schemes. Figure 2.6: AND logic gate. Figure 2.7: NOT logic gate. Figure 2.8: NAND logic gate. Consider again the following phases: (i) operation code fetch, (ii) operation code decode, (iii) operands fetch, (iv) effective instruction execution, and (v) results store. There is an intrinsic performance problem with respect to the accumulator architecture. Listing 2.8 is the accumulator code version implementing the expression “X”, as in Eq. (2.1), example from Listing 2.7. Listing 2.8: Code for the accumulator architecture. 1 Ld F 2 Add G 3 Sto T1 4 Ld H 5 Sub I 34 Instructions Set Architecture Approaches 6 Mult T1 7 Sto T1 8 Ld E 9 Mult F 10 Sto T2 11 Ld A 12 Add B 13 Sto X 14 Ld C 15 Sub D 16 Mult X 17 Sub T2 18 Div T1 19 Sto X The code in Listing 2.8 has 19 instructions. It performs 19 fetches, 19 operands fetches, and then 38 memory accesses. On the other hand, the GPR (register-memory) code version is given in Listing 2.9. Listing 2.9: Code for GPR (register-memory) architecture. 1 Ld R1 , F 2 Add R1 , G 3 Ld R2 , H 4 Sub R2 , I 5 Mult R1 , R2 6 Ld R2 , E 7 Mult R2 , F 8 Ld R3 , A 9 Add R3 , B 10 Ld R4 , C 11 Sub R4 , D 12 Mult R3 , R4 13 Sub R3 , R2 14 Div R3 , R1 15 Sto X , R3 The code in Listing 2.9 has 15 instructions instead of 19 from the accumulator version. It does 15 fetches, 11 operands fetches, and then 26 memory accesses. In this case, the final number is a 31.5% (12/38) less memory access compared to the accumulator version. Instructions Set Architecture Approaches Instructions Design - RISC vs. CISC Instructions with very different execution times or very different number of phases are not suitable for a production line, i.e., a pipeline. Given this, why not create simple instructions with small differences in phases execution time, and same number of phases? What about create powerful instructions that solve common problems rather than simple instructions that solve almost nothing? 35 2 Processor Operation, RISC & CISC Here comes some background. In the late ’70s, the idea of high-level language computer architecture - HLLCA came up. In the early ’80s, Ditzel and Patterson argued that simpler architectures would be the best approach, and presented the idea of the reduced instruction set computer - RISC. At the same time, some designers related to VAX refuted that idea and continued to build computers based on the complex instruction set computer - CISC. RISC and CISC development followed in parallel competing to the market. To help to fix the RISC vs. CISC debate, VAX designers compared the VAX 8700 and MIPS M2000 in the early ’90s. VAX had powerful addressing modes, powerful instructions, efficient instruction coding, and few registers. MIPS had simple instructions, simple addressing modes, fixed-length instruction format, a large number of registers, and pipelining. The comparison considered that computers had similar organizations and equal clock time. On average, MIPS executes about two times as many instructions as the VAX. The CPI for the VAX is almost six times the MIPS CPI, yielding almost a three times performance advantage, as Fig. 2.9 shows. VAX was then discontinued and replaced by the Alpha, a 64-bit MIPS-like architecture. Figure 2.9: Ratio of MIPS to VAX in instructions executed and performance in clock cycles, based on SPEC89 programs. Overview Analysis The following analyses show the number of addresses used when performing an operation in the four major ISA classes. For the stack (this uses just the top of stack): 1 0 address | add tos

csc25-lecture-notes (2)-1-78.pdf

Document Details

Tags

Related

Full Transcript