cap-1.pdf

Chapter 1 Introduction A Brief History of Computer Evolution Computers: A Bit of History From prehistory, one can mention the Chinese abacus from more than 2000 years ago. Advancing in time, the mechanical calculators from Pascal1 and Leibniz2 from around the year 1650, take their place. Pascal intended to help his father in the taxes area. The Pascal’s project also inspired Leibniz on its stepped drum design. Taking now the “history” period, there is the Charles Babbage’s mechanical general-purpose computer, named “Analytical Engine”. An innovation of Babbage’s project was the use of punched cards. Also, around 1840, Ada Lovelace became recognized as the first programmer woman. In the 1890s United States census, the Herman Hollerith machine was used. That machine was able to reduce the time to compile the data from 7 to 2.5 years. Later, Hollerith founded a company to produce tabulating machines, which came to be named International Business Machines - IBM. Machines used base 10 in the 19th century. The Boolean logic (base 2) simplified the general scheme. Then, base 2 is used in the design of almost all computers from the 20th century until nowadays. In the ’40s, an electronic computer design became reality. That was von Neumann’s computer. In that computer, the instructions were stored in a memory, instead of punched cards. This model is still in use during the present days. Some of the first computers from around 1945 were the Z1, Mark-I, and ENIAC. The Z1 was from Germany (Fig. 1.1) and the Harvard Mark-I from the US. The term “Harvard architecture” was coined to describe this type of machine, i.e., separated memories for instruction and data. Presently, it stands for a single main memory with separate instructions cache and data cache. The IBM automatic sequence controlled calculator - ASCC, called Mark-I by Harvard University’s staff is illustrated in Fig. 1.2. The electronic numerical integrator and computer - ENIAC is shown in Figs. 1.3 and 1.4. ENIAC has the format of a “U” with 24 meters of extension. It had 20 registers of 10-digit and each register was 60 centimeters long. The programming tasks were done through switches and cable 1 Arithmetic 2 Leibniz machine or Pascaline. wheel or stepped drum. 3 1 Introduction Figure 1.1: Replica of the Z1 in the German Museum of Technology in Berlin. Source: de.wikipedia.org: 22:33, 27. Dez 2005. ComputerGeek (Diskussion). Figure 1.2: The right end included data and program readers, and automatic typewriters. Source: https://commons.wikimedia.org/wiki/File:Harvard_Mark_I_Computer_-_ Right_Segment.JPG connections where data came from punched cards. The computer was able to make about 1,900 additions per seconds or 500 multiplications. 4 A Brief History of Computer Evolution Figure 1.3: ENIAC in Philadelphia, Pennsylvania. Glen Beck (background) and Betty Snyder (foreground) program the ENIAC in building 328 at the Ballistic Research Laboratory. Source: https://commons.wikimedia.org/wiki/File:Eniac.jpg Figure 1.4: Cpl. Irwin Goldstein (foreground) sets the switches on one of the ENIAC’s function tables at the Moore School of Electrical Engineering. (U.S. Army photo). Source: https: //commons.wikimedia.org/wiki/File:Classic_shot_of_the_ENIAC.jpg Evolution Since 1946, digital electronics technology has significantly evolved and continues evolving. This list includes electromechanical relays, electronic valves, transistors, printed circuits, integrated circuits (MSI3 , LSI4 , VLSI5 , ULSI6 ), fiber optics, superconductivity, laser beam, and gallium arsenate in 3 Medium-scale integration - MSI (1968): 1 to 500 transistors. integration - LSI (1971): 500 to 20×103 transistors. 5 Very large-scale integration - VLSI (1980): 20×103 to 1×106 transistors. 6 Ultra large-scale integration - ULSI (1984): > 1×106 transistors. 4 Large-scale 5 1 Introduction lieu of silicon. Microprocessors In 1970, Intel produced the first microprocessor, an LSI integrated circuit (IC) containing all CPU logic circuit in a nail size chip. These advancements are continuing so far, following examples such as the Deep Blue7 based on VLSI and ULSI, with performance greater than 500 million instructions per second - MIPS. Another example is the program AlphaGo. It has 40 search threads, 48 CPUs, and 8 GPUs. Its distributed version has 40 search threads, 1,202 CPUs and 176 GPUs. Computer Generations The computer generations can be divided into four. The first generation dates from 1945 to 1959. That was mainly based on electronic valves, miles of wires, low performance (which means slow performance), huge in size, and with heating issues. The second generation dates from 1959 to 1964. It was about transistors and printed circuits. However, these changes led to faster, smaller, and cheaper computers. The third generation dates from 1964 to 1970, and it was based on IC with higher density, leading to cost reduction and processing in the order of microseconds. Operating systems started to be used on that generation. The fourth generation, from 1970 to around 2019, took advantage and improved the existing technology leading to architecture optimization with respect to user problems, miniaturization, trustworthy and higher performance, in the order of nanoseconds. However, new trends are pointing out now such as domain-specific architectures, multicore systems, and quantum computing to name a few. Functional Units The first computers had functional hardware units only to add and subtract integers. Then, integer multiplication and division together with floating-point operations were in charge of software. Nowadays, those functions along with others, such as trigonometry, exponential and logarithmic are implemented in hardware. And that hardware is subject to continual improvements. The First Microprocessors The Intel 4004 was the first chip containing all CPU elements. It was manufactured in 1971 and started the microcomputers era. That version was based on a 4-bit addition. However, the multiplication was conducted by software using “addition loops”. The Intel 8008 was the first 8-bit microprocessor. Twice as complex as compared to the 4004, it was introduced into the market in 1972. Notice that both 4004 and 8008 were designed to be application-specific. The x86 Processors The 8088 with an 8-bit external bus along with the 8086 with a 16-bit external bus started into the market in 1978. There was also the optional 8087 to support floating-point arithmetic. The 8088 had an instructions buffer of 4 Bytes, and 6 Bytes in the 8086. 7 Chess-playing 6 computer developed by IBM. Custom VLSI chips. Won a chess game against Kasparov in 1996. A Brief History of Computer Evolution 8086 Registers Table 1.1 illustrates the 8086 registers. Table 1.1: Some 8086 registers. Data registers are separated into 8-bit low (L) and 8-bit high (H) parts. 8 bits 8 bits AX AH AL General-purpose BX BH BL Data CX CH CL Registers DX DH DL Stack pointer Program counter Base pointer Source index Destination index Code Data Stack Extra segment segment segment segment SP PC BP SI DI CS DS SS ES Flags 16 bits Address Registers Segment Registers Status Other x86 Processors The Intel 80286 (1982) had multi-thread support (MTS), inter tasks and task’s memory protection, and 16-bit arithmetic. The Intel 80386 (1985) was the first Intel processor with 32-bit arithmetic and support to virtual memory. The Intel 80486 (1989) had a sophisticated cache memory technology and an instruction pipeline floating-point co-processor integrated into the main processor. The Intel Pentium (1992) was a superscalar computer. It had the instructions cache separated from the data cache. That chip applied intensive technology use to reduce negative impacts caused by conditional branch instructions in the sense of superscalar performance. From 2006, the Core 2 appeared and others with multiple cores, e.g., duo, quad, extreme. Nowadays, Intel has the i9 extreme core, along with i3, i5, and i7. 8086 and Pentium Registers Figure 1.5 shows a register comparison. The original set is in black color and the extended set is represented in gray color. The 8086 divided the first four registers in half, which can be used either as one 16-bit register or as two 8-bit registers. Starting with the 80386, the top eight registers were extended to 32 bits and could also be used as general-purpose registers - GPR. The floating-point registers on the bottom are 80 bits. They implement an FP stack, with the top of stack - TOS pointed to by the status register. One operand must be the TOS and the other can be any of the other seven registers below the TOS. 7 1 Introduction Figure 1.5: Some 8086 registers. Data registers separated into 8-bit low and 8-bit high parts. x86 Architecture “The x86 really isn’t all that complex - it just doesn’t make a lot of sense” This is a quote from Mike Johnson, 80x86 project leader in AMD, microprocessor report, 1994. The quote symbolizes the market competition between Intel and AMD since a long time ago. Notice: in view of this, RISC processors will be used as examples during the major part of this lecture notes. 8 A Brief History of Computer Evolution Instruction Set Designing Instructions Instructions with large differences in execution time or phase number are not suitable for pipeline8. In this case, why not create simple instructions with small execution time differences and even the same phase number? Is not it better to create powerful instructions to handle regular problems instead of simple ones that handle almost nothing? Instruction Set Approaches The high-level language computer architecture - HLLCA idea came up in the late ’70s. It was a movement aiming at the elimination of the gap between high-level languages and computer hardware. However, it never had a significant commercial impact at all. On the other hand, the increase in memory size eliminated the code size problem arising from high-level languages and then enabled operating systems to be written in high-level languages. In that case, the combination of simpler architectures together with software offered greater performance and more flexibility at a lower cost and lower complexity. At the beginning of the ’80s, Ditzel and Patterson argued that simple architectures would be the best way to advance in the instruction set architecture - ISA development. Then, they introduced the reduced instruction set computer - RISC idea. At the same time, some VAX9 architects did not follow the RISC idea and continued developing complex instruction set computer - CISC. Development RISC and CISC development followed in parallel competing to the market. The RISC computers had three major seminal projects: the Berkeley RISC processor, led by Patterson; – RISC-I and RISC-II (1980 to about 1983) – 16- and 32-bit instructions the Stanford MIPS10 processor, led by Hennessy; and – 1981 to 1984 – 32-bit instructions the IBM 801. Those universities’ projects, Berkeley and Stanford, were widely adopted by industry after their conclusions. IBM had never launched the IBM 801 into the market, but created the RS 6000 in 1990. That was the first superscalar RISC. In 1986, the industry started launching computers based on that technology. In 1987, Sun Microsystems launched the Scalable Processor Architecture - SPARC computers based on Berkeley RISC-II. RISC-based PowerPC11 joined forces together with IBM, Motorola, and Apple. 8 The detailed concept of pipeline is introduced later. Address Extension - VAX is a discontinued instruction set architecture developed by DEC in ’70s. 10 Microprocessor without Interlocked Pipeline Stages, reflecting the lack of hardware to stall the pipeline, as the compiler would handle dependencies. 11 Performance Optimization With Enhanced RISC - Performance Computing – PowerPC was recently discontinued. 9 Virtual 9 1 Introduction RISC vs. CISC Finally, just one CISC has survived the debate: and the winner was the x86. It was mainly due to its high chip volume, binary compatibility with PC software, internally translation from CISC to RISC, and enough scaling to support extra hardware. Important: do not confuse the RISC processor with RISC instruction set architecture. The same applies to the MIPS processor and MIPS ISA. Todays Trends About 40 Years Ago... The improvement in computer performance over the last 40-50 years has been provided by computer architecture advancements, notably: Moore’s Law12 ; and Dennard scaling13. Indeed, those facts supported the development of larger and more parallel systems. But, what now with the end of Dennard scaling a decade ago, and the slowdown of Moore’s Law due to a combination of physical limitations and economic factors? A Paradigm Shift? Is it the time for a paradigm shift now? To face this challenge, domain-specific architectures - DSA can provide an equivalent performance and power benefits of 3-plus generations of Moore’s Law and Dennard scaling, and also better implementations than may ever be possible with future scaling of general-purpose architectures. Some potential areas for architectural innovation within DSA include the high-quality implementations of open-source architectures which now have a much longer lifetime (slowdown in Moore’s Law), and the reconfigurable computing, FPGA-based accelerators, domain-specific co-processors, and so on. Concepts, Performance & Amdahl Law Computer Architecture Definitions A few decades ago definition of computer architecture was simply: “instruction set design”. However, this is a myopic view of this subject. Instruction set architecture - ISA refers to the actual programmer-visible instruction set, and serves as the boundary between the software and hardware, as illustrated in Fig. 1.6. The main key constraints, or goals, of computer architecture, are to maximize performance and energy efficiency and minimize cost, power, and size. The genuine computer architecture view is about designing the organization and hardware to meet goals and functional requirements. 12 Roughly saying, the maximum number of transistors in an IC doubles approximately every two years. is about the reduction of MOS supply voltage in combination with the scaling of feature sizes: as transistors get smaller, their power density stays almost constant. 13 This 10 Concepts, Performance & Amdahl Law Figure 1.6: ISA as an interface between software and hardware, from. The organization addresses the high-level aspects of a computer design, i.e., the memory system and interconnect; internal processor where arithmetic, logic, branching, and data transfer take place; and microarchitecture. The latter (microarchitecture) can be used in place of the term organization. The hardware is about the specifics of a computer, e.g., detailed logic design and packaging technology. Basically, architecture aims to cover the three aspects of the computer design: (i) instruction set architecture, (ii) organization/microarchitecture, and (iii) hardware. The Five Basic Parts As illustrated in Fig. 1.7, the original von Neumann machine has these five basic parts: memory, arithmetic logic unit (data path), program control unit (control path), input equipment, and output equipment. Memory Control Unit Arithmetic Logical Unit Input Acc Output Figure 1.7: The original von Neumann machine. In this sense, the instruction set architecture defines the processor’s instructions which in turn guides the design of the control and data paths. Computer Organization Fig. 1.8 shows a possible computer organization illustrating a direct interface between the processor and cache memory. The memory bus enables the communication from the main memory to cache, and the input/output - I/O bus allows the peripherals (disks, graphics card, network) communication to the processor by using an interruption mechanism. Classes of Computers Main Characteristics From personal mobile devices to embedded systems, Fig. 1.9 shows the differences in price related to the system and microprocessor. It also highlights what are the key design issues of each class of computer. 11 1 Introduction Figure 1.8: Computer organization. Figure 1.9: The five mainstream computing classes. Notice that servers and embedded systems have a wide range in price since they include USB keys to network routers. Also, the servers range arises from the need for VLS multiprocessor systems for high-end transaction processing. The sales in 2015 (approximated figures) were about 1.6 billion PMDs, 90% cell phones, 275 million desktop PCs, 15 million servers, and 19 billion embedded processors. And, 14.8 billion ARM-technologybased chips were shipped. ARM-based devices continue to take more and more places, such as the TOP 500 list and Mars 2021 NASA’s mission. Processor Performance As illustrated in Fig. 1.10, in 2003 the power wall14 , memory wall15 , end of Dennard scaling, and instruction-level parallelism slowed uniprocessor performance. There was no remarkable performance improvement, compared to the last years. From 2011 to 2015, the limits of parallelism of Amdahl’s law became noticed by the almost flat curve. In 2015, the end of Moore’s law was evident. Clock Rate Between 1978 and 1986, the clock rate improved less than 15% per year (Fig. 1.11), while performance improved by 25% per year (Fig. 1.10). From 1986 to 2003, also known as the “renaissance period”, clock rates shot up almost 40% per year (Fig. 1.11), while there was a 52% performance improvement per year (Fig. 1.10). 14 Power dissipation, in the heat form, prevents a new frequency increasing. access frequency/performance is slower than processor’s. 15 Memory 12 Concepts, Performance & Amdahl Law Figure 1.10: Growth in processor performance over the last 40 years (SPEC integer benchmarks). From that last period, the clock rate has been nearly flat, growing at less than 2% per year, and single-processor performance improved just 3.5% per year. Figure 1.11: Microprocessors clock rate growth. 13 1 Introduction Memory Performance To have also a comparison in terms of memory performance, Fig. 1.12 shows the numbers from 1990 to 2011. This dates from the first edition of the book “Computer Architecture” (from Hennessy and Patterson) to the fifth edition. Figure 1.12: Memory performance growth. Why Study Computer Architecture? Computer architecture is fundamental to the high-performance computing and parallel processing development. It allows the understanding of the computer structure and operations in details. One learns which are and how to apply a couple of key performance optimization techniques for computer systems, including both hardware and software. And, in the worst-case scenario, it may help to buy you a good computer. Measuring Performance How to measure performance It is important to know about the performance of a given system. In this case, the following questions may arise. How to assess the impact of a computer performance improvement? This measure should be given in clocks, floating-point operations per second - FLOPS, or what? Must performance be measured by using the time spent to complete a given task? In that case: the lesser, the better, i.e., the minimum possible time to complete a task is a good performance indicator. Or must performance be measured by using a specific task set executed in a time frame? Thus, the greater, the better. Here, executing the maximum possible number of tasks in the specified time frame is a good performance indicator. Performance Measures To measure the performance in a system, one can use response time and throughput. Response time is the time between the start and the completion of an event, e.g., the time consumed to complete a task or a task set. On the other hand, throughput is the total amount of work done in a given timeframe, e.g., instructions per second, MFLOPS, Mbps. Now, think about these questions. Does increasing throughput always improve response time? Does reducing response time always improve throughput? Indeed, this is not so trivial. 14 Concepts, Performance & Amdahl Law A Simple Producer-Consumer Model Take for instance this simple producer-consumer model as shown in Fig. 1.13. Figure 1.13: Producer-consumer model. In this model, from the response time point of view, the problem is a minimization problem. In this case, the queue has to be empty and the server should be idle. However, from the throughput perspective, the problem is a maximization problem, e.g., the queue should never be empty and the server should never be idle. The concern here is to have as much work done as possible. Increasing the Throughput Generally, the throughput can be enhanced by adding extra hardware. This acts by reducing the latency with respect to the load (Fig. 1.14). Figure 1.14: Throughput enhancement by having more than one queue. On the other hand, response time is much harder to reduce. It is needed to consider the optimization of the architecture or even the implementing technology. Throughput Enhancement vs. Response Time The curve illustrated in Fig. 1.15 shows the trade-off involving throughput and response time. As the throughput is maximized, the response time is increased. Performance Although considering the throughput or even response time, the most important fact is the total time of user’s tasks completion. Therefore, the faster computer is always the one that responds to the user in less time. It is possible to “improve the response time” when the task number is high and intermediate relevant results are produced. This can be possible through parallelization. 15 1 Introduction Figure 1.15: Throughput enhancement vs. response time. Speedup Consider the execution time as in Eq. (1.1). Performance X = 1 ExecutionTime X (1.1) Then, to state that “X is n times as fast as Y” it means: n= ExecutionTime Y Performance X = Performance Y ExecutionTime X (1.2) Speedup is the observed improvement in performance due to an enhancement E, as in Eq. (1.3). Speedup(E) = ExecutionTime( ) Performance(E) = ExecutionTime(E) Performance( ) (1.3) Example – what is the observed speedup in execution time when using p processors? The answer is given by Eq. (1.4): S= TSERIAL Ts = TPARALLEL Tp (1.4) Or even using the notation for p processors, as in Eq. (1.5): S(p) = T (1) T (p) (1.5) Theoretically, most of the cases will have a linear speedup where S(p) ≤ p. Some cases can present a super-linear speedup: S(p) > p. This is the case when there is a non-optimal serial algorithm or some hardware features that place the serial algorithm at disadvantage. Possible Targets and Strategies There are several techniques, components, and communication interconnections that can be targeted for enhancement. 16 Concepts, Performance & Amdahl Law On the other hand, there are also several “types” of instructions which should be (preferably) improved. In this case, what to choose? What should be prioritized for possible enhancements? The answers are the common case and the Amdahl’s Law theory. The common case is simply the part of a system that is more frequently used in a specific scenario. For example, suppose that there is a floating-point co-processor and another functional hardware unit to take care of integer operations. The problem addressed by this system will give the common case. Is it about floating-points or integers? Amdahl’s Law According to Amdahl’s Law, the performance improvement to be obtained from using a given faster mode of execution is limited by the fraction of the time that this faster mode can be used. Therefore, it defines the speedup that can be gained by using a particular characteristic (i.e., an enhanced part). Let’s suppose that an enhancement E gives a speedup enhancement S for a time fraction F ≤ 1. What is the overall performance improvement in that case? F ExecutionTime(E) = ExecutionTime( ) × (1 − F ) + S 1 Speedup(E) = (1 − F ) + FS (1.6) Amdahl’s Law Demonstration Let’s define the “unenhanced” portion Tu with the time fraction the enhancement Fe is used: Tu = (1 − Fe ) × ExecutionTime old (1.7) Now, the “enhanced” portion Te : Te = Fe × ExecutionTime old Se (1.8) Finally, the overall speedup is given by Eq. (1.9). Speedup overall = ExecutionTime old ExecutionTime new (1.9) By replacing Eq. (1.7) and Eq. (1.8) in Eq. (1.9): Then: ExecutionTime old ExecutionTime old = Tu + Te (1 − Fe ) × ExecutionTime old + FSee × ExecutionTime old Speedup overall = 1 (1 − Fe ) + Fe Se (1.10) 17 1 Introduction Important: notice that Amdahl’s Law is applied to verify what will be the performance improvement to be obtained in a system that does not have the faster mode of execution already running or implemented. Amdahl’s Law Comments (Un)fortunately, there exists a real scenario. There, the speedup does not scale linearly with respect to the number of processors p, or new hardware, or any other enhancement E done in the system. At some point in time, the speedup tends to saturate and become “flat”. “Unless virtually all of a serial program is parallelized, the possible speed-up is going to be very limited, regardless the available number of processors” Quote from Pacheco, P. S. An Introduction to Parallel Programming. Morgan Kaufmann, 2011. Let’s take a look at the curves shown in Fig. 1.16. Figure 1.16: Speedup examples. Source: https://en.wikipedia.org/wiki/Amdahl%27s_law. In that case, the upper bound of Speedup overall is reached when p → ∞, and therefore: Speedup overall ≤ 1 1 − Fe (1.11) From Fig. 1.16, it can be concluded that when up to 95% parallelism is obtained (green curve), the maximum speedup is given by: 1 S(p) ≤ = 20 1 − 0.95 which results in approximately p = 2048. Placing more processors than that number makes no sense. Typically, the “non-parallelizeable” fraction (1 − Fe ) decreases when increasing the problem’s size. Processor Performance Every component has some impact in performance, but processor performance is the most critical for the whole computer performance. 18 Concepts, Performance & Amdahl Law The following gives the CPU time by using the clock period duration Eq. (1.12) or rate Eq. (1.13). CPU time = # CPU clock cycles for a program × Clock cycle time CPU time = # CPU clock cycles for a program Clock rate (1.12) (1.13) The average number of clock cycles per instruction - CPI considering the instruction count - IC is given by Eq. (1.14). CPI = # CPU clock cycles for a program # IC (1.14) By replacing Eq. (1.14) in Eq. (1.12): CPU time = IC × CPI × Clock cycle time (1.15) And finally, with their units of measurements: CPU time = Instructions ClockCycles Seconds Seconds × × = Program Instruction ClockCycle Program (1.16) What impacts the processor performance? Table 1.2 gives a hint. Table 1.2: Summary relating the design level with possible areas that impact processor performance. Design level Instruction Count Program Compiler Instruction Set Organization Technology X X X CPI Clock Period X X X X X Benchmarks How to measure the computer performance related to the user’s tasks of interest? Performance is measured based on program or task execution time. Thus, how to choose? In the ideal case, a user knows exactly which programs s/he uses together with input data, and then s/he tests computer performance with no problem. But this does not sound feasible. The best choice of benchmarks to measure performance is a real application. And here there are some performance pitfalls, such as running programs that are much simpler than a real application. For example, running kernels that are small key pieces of real applications; toy programs, i.e., 100-line programs from beginning programming assignments, e.g., quicksort; and synthetic benchmarks that are fake programs invented to try to match the profile or behavior of real applications, e.g., Dhrystone, Whetstone. All three (kernels, toy programs, and synthetic) benchmarks are discredited today. It is possible that the compiler writer and architect make the computer appear faster on these stand-in programs than on real applications. Considering the conditions under which benchmarks run, the benchmark-specific compiler flags is one way to “improve” performance. Another way is to cause transformations that would be illegal on many programs or even slow down performance on others. To face these challenges and increase results significance, the recommendation is to use one compiler and one set of flags for all programs in the same language, e.g., C or C++. 19 1 Introduction To reasoning about the question of whether the source code modification is allowed or not, some considerations take place: no modifications are allowed at all; modifications are allowed, but essentially impossible to perform, e.g., those ones requiring ten-million LOC changing, such as database programs; modifications are allowed, provided the changed version produces the very same results; and do that modifications reflect real practice and provide useful insight to users? Another possibility is to use collections of benchmarks applications, also known as benchmark suites. The goal is to characterize the real relative performance of two different computers, including programs that users are unlikely to run by themselves. Suites examples are the Electronic Design News Embedded Microprocessor Benchmark Consortium - EEMBC containing 41 kernels for embedded applications on different domains. However, it does not have the reputation of being a “good predictor” of relative performance, due to the small kernels. This benchmark is trying to replace Dhrystone, which is still in use. Another example is the Standard Performance Evaluation Corporation - SPEC, dating back from the late ’80s. It was used for workstations and covers many different applications from today’s usage. Although the data from a benchmark is frequently used to evaluate two computers, always remember: performance is relative to the program. Fig. 1.10 was built using the SPEC integer benchmarks. However, this benchmark has changed over the years. For this reason, the performance of newer computers is estimated by a scaling factor relating the performance for different versions of SPEC, e.g., SPEC89, SPEC92, SPEC95, SPEC2000, and SPEC2006 (Fig. 1.17). Figure 1.17: SPEC benchmarks over time. Integer programs above the line and floating-point programs below. However, there still are risks related to the use of benchmarks. Since many buying decisions are made on a benchmark basis, designers are pressured to optimize their projects for the benchmark. 20 High Performance Computing & Chip Manufacturing High Performance Computing & Chip Manufacturing HPC Applications High performance computing - HPC systems are often used to perform numerical simulations of physical processes, such as the simulation of nuclear explosions effects, simulation of tidal evolution and ocean movement, atmospheric/weather simulation, fluid flow, and subsoil mapping by wave reflection calculation. Servers with a high volume of transactions processed per time unit are another area that demands high computational power. HPC Systems The massively parallel computers are computers specially designed to work with hundreds or even thousands of processors. They use distributed memory mechanisms, and they are often in mathematical simulations, e.g., the IBM Blue Gene project. That project was about a petaflop supercomputer aiming to attack problems such as protein folding. In the mentioned project, two application-specific integrated circuit - ASICs were designed: the BlueGene/L Compute - BLC (Fig. 1.18) and BlueGene/L Link - BLL. Figure 1.18: BlueGene/L Compute ASIC general information. Fig. 1.19 shows the spectrum ranging from the chip (with two processors) through the system (with 64 cabinets). Each cabinet has 2,048 processors (8 × 8 × 16 × 2) and 256 GB DDR memory (512 MB/computer card). The system has a total of 131,072 processors (64 × 32 × 32 × 2) and 16 TB DDR memory. The computing power is from 180 to about 360 TFLOPS. The power consumption per cabinet16 , assuming 50W/processor is approximately 102 kW, and assuming 10h/day, 20 days/month, it consumes 16 A Pentium 4 consumes about 60 W. 21 1 Introduction Figure 1.19: From chip to the system. about 20 MWh/month. In view of this, clusters are considered as multiple computers connected to operate as a single system. The cluster can achieve high processing power. Generally, they use specific middleware (software) for computers to exchange information efficiently. Examples are the message-passing interface - MPI and parallel virtual machine - PVM. Clusters are often used to handle large volume transactions, e.g., the Google clusters. HPC in Brazil Some HPC Brazilian examples are introduced next. This is courtesy of Prof. Jairo Panneta/CPTEC and Petrobras. The CPTEC/INPE17 examples are related to the weather forecast, while Petrobras’ are about geological studies. CPTEC/INPE This example (Fig. 1.20) shows weather and climate operational numerical forecast where the application technology domain is in pursuit. There, people incessantly seek to improve the forecast quality and accuracy by applying increasingly powerful computers to have as much parallelism as possible. Table 1.3 shows some of the computers used by INPE along of the years. Table 1.3: Computers used along of the years. Computer Processors Max Performance 1994 1998 2004 2007 2012 NEC SX3 1 3 GFLOPS NEC SX4 8 16 GFLOPS NEC SX6 96 768 GFLOPS NEC SUN 1100 5,748 GFLOPS Cray XE6 (Tupã) 31,296 258 TFLOPS Notice that the Tupã computer was in the 29th place, from the Top 500 of November, 201018. 17 Centro de Previsão do Tempo e Estudos Climáticos/Instituto Nacional de Pesquisas Espaciais. https://www.cptec.inpe.br/supercomputador/historico.php 18 https://top500.org, 22 High Performance Computing & Chip Manufacturing Figure 1.20: A forecast example considering different region sizes and consequently different responses in terms of accuracy. More processing power is required to have better accuracy. Petrobras This example (Figs. 1.21 and 1.22) shows seismic processing and wells positioning. Here, the application technology domain is also pursued. The seismic migration method was developed in-house. Technical people from the company are equally and incessantly seeking to improve the image quality and accuracy. The use of increasingly powerful computers parallelism is essential in this problem. Figure 1.21: Seismic method illustration show- Figure 1.22: Seismic method - Kirchhoff miing a ship and a couple of sensors. gration method, also known as diffraction summation. Migration is also named imaging. The following are some available numbers related to the company’s research. In 1998, they had about 76 PCs for software development. In 2006, they had a cluster with 5000 processors, 1,000 jobs of Kirchhoff migration, and about 10 jobs using 787 cores/job, considering a 22 days/job on average. In 2008, they had around 14,000 processors and 3,000 jobs. 23 1 Introduction Wrap-up High performance computing systems have many applications in different real-world domains. The techniques seen in computer architecture can be used to optimize the performance even in software systems. Chip Manufacturing The processor’s chip is fundamental to any computer’s good performance. The chips are built on wafers, and raw wafers are made of silicon, which comes from beach sand. Everything starts with single crystals that are created through a method called Czochralski process. This is a method of crystal growth. On it, a seed crystal19 is mounted on a rod and then dipped into molten silicon. The rod is pulled upwards and rotated at the same time, making a big cylindrical piece of silicon crystal, also known as ingot (as in the right-hand side of Fig. 1.23). Figure 1.23: Silicon cylinder to make wafers. The ingot resulted from this process measures from one to two meters long and can have up to 300 millimeters in diameter. This is where terms like 300-mm wafers come from. The ingot is then chopped into wafers. These wafers are polished (as in the left-hand side of Fig. 1.23) and sent to the chip manufacturers. Wafers are usually less than 1mm thick. These raw (“virgin”) wafers are where the chips will be manufactured on. Chips are mounted on the wafer through a process called photolithography (Fig. 1.24). Chemicals sensitive to ultraviolet light are used, and when exposed to ultraviolet light, they can become “soft” or “hard”. Layers (SiO2 and photoresist) grow one on top of the others, with an insulator between them, and masks are used to create the shapes. The process consists of blocking the ultraviolet light from the chemicals applied to the wafer, using masks created by the designers. It is done in the process by removing the “soft” parts and then repeating the process with another mask until the chip is finished. Each mask has a different pattern and they tell how the transistors and wires inside the chip will be manufactured. The number of masks used varies depending on the project, e.g., a Pentium 4 processor uses 26 masks. After this process, the transistors and circuit are built on the wafer. In this whole process, absolutely any particle can ruin the silicon, even dust, as illustrated in Fig. 1.25. What about vibrations on the wafer, light emitter, masks, lens, or something like that? That explains the need for clean rooms in a chip foundry (Fig. 1.26). 19 A 24 piece of silicon crystal. High Performance Computing & Chip Manufacturing Figure 1.24: Photolithography process overview. Figure 1.25: Problems detected in the chip manufacturing process. Figure 1.26: Clean room illustration. Chips on the wafer are then tested and the wafer is sent to the next step, where the chips are cut from the wafer (Fig. 1.27), have their terminals attached and are packed. After that, they are tested, labeled and ready to be sold (Fig. 1.28). 25 1 Introduction Figure 1.27: From the wafer to die. Figure 1.28: A chip example. Dies per Wafer A wafer is still tested and chopped into dies that are then packaged, despite all other advances in the manufacturing process. The number of dies per wafer is more accurately estimated by Eq. (1.17). DiesPerWafer = π × ( WaferDiameter )2 π × WaferDiameter 2 √ − DieArea 2 × DieArea (1.17) where the first term stands for the ratio of wafer area (πr2 ) to the die area, and the second term compensates the problem of rectangular dies near the periphery of round wafers by dividing the circumference (πd) by the diagonal of a square. The next two figures (Figs. 1.29 and 1.30) illustrate a die photography followed by an illustration of a wafer (Fig. 1.31). Figure 1.29: Intel Skylake microprocessor die Figure 1.30: The components of the Intel Skyphotograph, a 2015 14 nm prolake microprocessor die labeled cessor microarchitecture. with their functions. Bose-Einstein Formula How many of those chips will really work? This is an important question to be asked. It does not matter how good is the manufacturing process, some dies will not work, and more masks and layers mean more errors. There is a model that states what is the fraction of good dies on a wafer, or the die yield. The Bose-Einstein formula is an empirical model developed by looking at the yield of many manufacturing lines (Sydow, 2006), which still applies today. 26 High Performance Computing & Chip Manufacturing Figure 1.31: A 200mm diameter wafer of RISC-V dies designed by SiFive. The wafer contains 3712 chips. A model of integrated circuit yield, assuming that defects are randomly distributed over the wafer and that yield is inversely proportional to the fabrication process complexity is described in Eq. (1.18). DieYield = WaferYield × 1 (1 + DefectsPerUnitArea × DieArea)N (1.18) where the term WaferYield accounts for the wafers that are completely bad and do not need to be tested. DefectsPerUnitArea is the measure of the random manufacturing defects. This was typically 0.08 to 0.10 defects per square inch for a 28 nm node, and 0.10 to 0.30 for the newer 16 nm node, from around 2017. Also, it depends on the process maturity. The metric versions are 0.012 to 0.016 defects per square centimeter for 28 nm, and 0.016 to 0.047 for 16 nm. Finally, N is the process-complexity factor, a measure of manufacturing difficulty, e.g., for 28 nm processes, N is 7.5 to 9.5, and for a 16 nm process, N ranges from 10 to 14. Processors Information Table 1.4 shows some information numbers on processors from 2007. Table 1.4: Some processor data, according to Hennessy & Patterson, 2007 Chip Die Size [cm2 ] Estimated defect rate [per cm2 ] Feature Size [nm] Transistors [millions] 3.89 3.80 1.99 0.30 0.75 0.75 130 90 90 276 279 233 IBM Power 5 Sun Niagara AMD Opteron And to compare with that, there are some numbers from 2017’s20 Symposia on VLSI Technology and Circuits, held in Kyoto. IBM and its research alliance partners, GlobalFoundries and Samsung, built a new type of transistor for chips at the 5 nanometers node. This represents a 40% performance improvement over 10 nm chips, using the same amount of power. There were more than 5 years of research and development and a US$ 3 billion investment. Fig. 1.32 illustrates the transistor gate length compared to the density of transistors in microprocessors over time. 20 Source: https://www.ibm.com/blogs/think/2017/06/5-nanometer-transistors/ 27 1 Introduction Figure 1.32: The evolution of transistor gate length (minimum feature size) and the density of transistors in microprocessors over time. The gate length shrank from 10 µm to 28 nm; and the number of transistors increased from 200 to over 1 million. Chip Manufacturing in Brazil There are efforts even in Brazil to produce chips. Examples are the CEITEC which is a federal public company attached to the Ministry of Science, Technology and Innovation - MCTI. That company is focused on the development and production of integrated circuits for radiofrequency identification - RFID and specific applications. It is located in Porto Alegre, RS. CEITEC was founded in 2008. In 2011, they signed a contract of technology transfer 600nm CMOS with a German company named X-Fab. The first chips sales were in 2012. Their RFID chips target cattle identification, the product is known as the “Ox Chip”. There is also the Unitec Semiconductors (former Six) that was incorporated by the Argentine group “Corporation America”. The industrial unit will be installed in Ribeirão das Neves, MG. The company was founded in 2012 and will have an investment of around R$ 1Bi. The plant will manufacture custom chips for niche markets, including industrial and medical applications. They will work with IBM’s 100nm licensed technology. Other examples are the HT Micron, from RS state, the Instituto Eldorado and Smart Modular Technologies, from SP, and the Qualcomm and Advanced Semiconductor Engineering (ASE), from Campinas. 28 Appendix A Introduction Part B 1. Assume a computer runs a program P in a 100 seconds, where 30% of operations are memory access, with an average access time of 80 ms, and 60% are floating-point operations. What is the impact on the overall speedup of the system: a) If reducing the average memory access time by half? b) By doubling the performance of floating-point operations? 2. A computer runs a program P in 100 seconds, where 80% of the time is spent on memory operations. How much it takes to improve the memory performance so that the system has a five times performance gain? 3. A common transformation required in graphics processors is square root - SRQT. Implementations of floating-point - FP SQRT vary significantly in performance, especially among processors designed for graphics. Suppose FPSQRT is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQRT hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the FPSQRT. Compare these two design alternatives. 4. Comment on that sentence: “The multiprocessors are the new salvation for performance improvement”. Part C 1. Find the number of dies per 300 mm wafer for a die that is 1.5 cm on a side and and 1.0 cm on a side. 183 A Introduction 2. Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a side, assuming a defect density of 0.031 per cm2 and process-complexity factor is 13.5. Consider also the dies per wafer from last exercise and assume wafer yield is 100%. 184

Document Details

Tags

Related

Full Transcript