cap-1_full-8-18.pdf

1 Introduction RISC vs. CISC Finally, just one CISC has survived the debate: and the winner was the x86. It was mainly due to its high chip volume, binary compatibility with PC software, internally translation from CISC to RISC, and enough scaling to support extra hardware. Important: do not confuse the RISC processor with RISC instruction set architecture. The same applies to the MIPS processor and MIPS ISA. Todays Trends About 40 Years Ago... The improvement in computer performance over the last 40-50 years has been provided by computer architecture advancements, notably: Moore’s Law12 ; and Dennard scaling13. Indeed, those facts supported the development of larger and more parallel systems. But, what now with the end of Dennard scaling a decade ago, and the slowdown of Moore’s Law due to a combination of physical limitations and economic factors? A Paradigm Shift? Is it the time for a paradigm shift now? To face this challenge, domain-specific architectures - DSA can provide an equivalent performance and power benefits of 3-plus generations of Moore’s Law and Dennard scaling, and also better implementations than may ever be possible with future scaling of general-purpose architectures. Some potential areas for architectural innovation within DSA include the high-quality implementations of open-source architectures which now have a much longer lifetime (slowdown in Moore’s Law), and the reconfigurable computing, FPGA-based accelerators, domain-specific co-processors, and so on. Concepts, Performance & Amdahl Law Computer Architecture Definitions A few decades ago definition of computer architecture was simply: “instruction set design”. However, this is a myopic view of this subject. Instruction set architecture - ISA refers to the actual programmer-visible instruction set, and serves as the boundary between the software and hardware, as illustrated in Fig. 1.6. The main key constraints, or goals, of computer architecture, are to maximize performance and energy efficiency and minimize cost, power, and size. The genuine computer architecture view is about designing the organization and hardware to meet goals and functional requirements. 12 Roughly saying, the maximum number of transistors in an IC doubles approximately every two years. is about the reduction of MOS supply voltage in combination with the scaling of feature sizes: as transistors get smaller, their power density stays almost constant. 13 This 10 Concepts, Performance & Amdahl Law Figure 1.6: ISA as an interface between software and hardware, from. The organization addresses the high-level aspects of a computer design, i.e., the memory system and interconnect; internal processor where arithmetic, logic, branching, and data transfer take place; and microarchitecture. The latter (microarchitecture) can be used in place of the term organization. The hardware is about the specifics of a computer, e.g., detailed logic design and packaging technology. Basically, architecture aims to cover the three aspects of the computer design: (i) instruction set architecture, (ii) organization/microarchitecture, and (iii) hardware. The Five Basic Parts As illustrated in Fig. 1.7, the original von Neumann machine has these five basic parts: memory, arithmetic logic unit (data path), program control unit (control path), input equipment, and output equipment. Memory Control Unit Arithmetic Logical Unit Input Acc Output Figure 1.7: The original von Neumann machine. In this sense, the instruction set architecture defines the processor’s instructions which in turn guides the design of the control and data paths. Computer Organization Fig. 1.8 shows a possible computer organization illustrating a direct interface between the processor and cache memory. The memory bus enables the communication from the main memory to cache, and the input/output - I/O bus allows the peripherals (disks, graphics card, network) communication to the processor by using an interruption mechanism. Classes of Computers Main Characteristics From personal mobile devices to embedded systems, Fig. 1.9 shows the differences in price related to the system and microprocessor. It also highlights what are the key design issues of each class of computer. 11 1 Introduction Figure 1.8: Computer organization. Figure 1.9: The five mainstream computing classes. Notice that servers and embedded systems have a wide range in price since they include USB keys to network routers. Also, the servers range arises from the need for VLS multiprocessor systems for high-end transaction processing. The sales in 2015 (approximated figures) were about 1.6 billion PMDs, 90% cell phones, 275 million desktop PCs, 15 million servers, and 19 billion embedded processors. And, 14.8 billion ARM-technologybased chips were shipped. ARM-based devices continue to take more and more places, such as the TOP 500 list and Mars 2021 NASA’s mission. Processor Performance As illustrated in Fig. 1.10, in 2003 the power wall14 , memory wall15 , end of Dennard scaling, and instruction-level parallelism slowed uniprocessor performance. There was no remarkable performance improvement, compared to the last years. From 2011 to 2015, the limits of parallelism of Amdahl’s law became noticed by the almost flat curve. In 2015, the end of Moore’s law was evident. Clock Rate Between 1978 and 1986, the clock rate improved less than 15% per year (Fig. 1.11), while performance improved by 25% per year (Fig. 1.10). From 1986 to 2003, also known as the “renaissance period”, clock rates shot up almost 40% per year (Fig. 1.11), while there was a 52% performance improvement per year (Fig. 1.10). 14 Power dissipation, in the heat form, prevents a new frequency increasing. access frequency/performance is slower than processor’s. 15 Memory 12 Concepts, Performance & Amdahl Law Figure 1.10: Growth in processor performance over the last 40 years (SPEC integer benchmarks). From that last period, the clock rate has been nearly flat, growing at less than 2% per year, and single-processor performance improved just 3.5% per year. Figure 1.11: Microprocessors clock rate growth. 13 1 Introduction Memory Performance To have also a comparison in terms of memory performance, Fig. 1.12 shows the numbers from 1990 to 2011. This dates from the first edition of the book “Computer Architecture” (from Hennessy and Patterson) to the fifth edition. Figure 1.12: Memory performance growth. Why Study Computer Architecture? Computer architecture is fundamental to the high-performance computing and parallel processing development. It allows the understanding of the computer structure and operations in details. One learns which are and how to apply a couple of key performance optimization techniques for computer systems, including both hardware and software. And, in the worst-case scenario, it may help to buy you a good computer. Measuring Performance How to measure performance It is important to know about the performance of a given system. In this case, the following questions may arise. How to assess the impact of a computer performance improvement? This measure should be given in clocks, floating-point operations per second - FLOPS, or what? Must performance be measured by using the time spent to complete a given task? In that case: the lesser, the better, i.e., the minimum possible time to complete a task is a good performance indicator. Or must performance be measured by using a specific task set executed in a time frame? Thus, the greater, the better. Here, executing the maximum possible number of tasks in the specified time frame is a good performance indicator. Performance Measures To measure the performance in a system, one can use response time and throughput. Response time is the time between the start and the completion of an event, e.g., the time consumed to complete a task or a task set. On the other hand, throughput is the total amount of work done in a given timeframe, e.g., instructions per second, MFLOPS, Mbps. Now, think about these questions. Does increasing throughput always improve response time? Does reducing response time always improve throughput? Indeed, this is not so trivial. 14 Concepts, Performance & Amdahl Law A Simple Producer-Consumer Model Take for instance this simple producer-consumer model as shown in Fig. 1.13. Figure 1.13: Producer-consumer model. In this model, from the response time point of view, the problem is a minimization problem. In this case, the queue has to be empty and the server should be idle. However, from the throughput perspective, the problem is a maximization problem, e.g., the queue should never be empty and the server should never be idle. The concern here is to have as much work done as possible. Increasing the Throughput Generally, the throughput can be enhanced by adding extra hardware. This acts by reducing the latency with respect to the load (Fig. 1.14). Figure 1.14: Throughput enhancement by having more than one queue. On the other hand, response time is much harder to reduce. It is needed to consider the optimization of the architecture or even the implementing technology. Throughput Enhancement vs. Response Time The curve illustrated in Fig. 1.15 shows the trade-off involving throughput and response time. As the throughput is maximized, the response time is increased. Performance Although considering the throughput or even response time, the most important fact is the total time of user’s tasks completion. Therefore, the faster computer is always the one that responds to the user in less time. It is possible to “improve the response time” when the task number is high and intermediate relevant results are produced. This can be possible through parallelization. 15 1 Introduction Figure 1.15: Throughput enhancement vs. response time. Speedup Consider the execution time as in Eq. (1.1). Performance X = 1 ExecutionTime X (1.1) Then, to state that “X is n times as fast as Y” it means: n= ExecutionTime Y Performance X = Performance Y ExecutionTime X (1.2) Speedup is the observed improvement in performance due to an enhancement E, as in Eq. (1.3). Speedup(E) = ExecutionTime( ) Performance(E) = ExecutionTime(E) Performance( ) (1.3) Example – what is the observed speedup in execution time when using p processors? The answer is given by Eq. (1.4): S= TSERIAL Ts = TPARALLEL Tp (1.4) Or even using the notation for p processors, as in Eq. (1.5): S(p) = T (1) T (p) (1.5) Theoretically, most of the cases will have a linear speedup where S(p) ≤ p. Some cases can present a super-linear speedup: S(p) > p. This is the case when there is a non-optimal serial algorithm or some hardware features that place the serial algorithm at disadvantage. Possible Targets and Strategies There are several techniques, components, and communication interconnections that can be targeted for enhancement. 16 Concepts, Performance & Amdahl Law On the other hand, there are also several “types” of instructions which should be (preferably) improved. In this case, what to choose? What should be prioritized for possible enhancements? The answers are the common case and the Amdahl’s Law theory. The common case is simply the part of a system that is more frequently used in a specific scenario. For example, suppose that there is a floating-point co-processor and another functional hardware unit to take care of integer operations. The problem addressed by this system will give the common case. Is it about floating-points or integers? Amdahl’s Law According to Amdahl’s Law, the performance improvement to be obtained from using a given faster mode of execution is limited by the fraction of the time that this faster mode can be used. Therefore, it defines the speedup that can be gained by using a particular characteristic (i.e., an enhanced part). Let’s suppose that an enhancement E gives a speedup enhancement S for a time fraction F ≤ 1. What is the overall performance improvement in that case? F ExecutionTime(E) = ExecutionTime( ) × (1 − F ) + S 1 Speedup(E) = (1 − F ) + FS (1.6) Amdahl’s Law Demonstration Let’s define the “unenhanced” portion Tu with the time fraction the enhancement Fe is used: Tu = (1 − Fe ) × ExecutionTime old (1.7) Now, the “enhanced” portion Te : Te = Fe × ExecutionTime old Se (1.8) Finally, the overall speedup is given by Eq. (1.9). Speedup overall = ExecutionTime old ExecutionTime new (1.9) By replacing Eq. (1.7) and Eq. (1.8) in Eq. (1.9): Then: ExecutionTime old ExecutionTime old = Tu + Te (1 − Fe ) × ExecutionTime old + FSee × ExecutionTime old Speedup overall = 1 (1 − Fe ) + Fe Se (1.10) 17 1 Introduction Important: notice that Amdahl’s Law is applied to verify what will be the performance improvement to be obtained in a system that does not have the faster mode of execution already running or implemented. Amdahl’s Law Comments (Un)fortunately, there exists a real scenario. There, the speedup does not scale linearly with respect to the number of processors p, or new hardware, or any other enhancement E done in the system. At some point in time, the speedup tends to saturate and become “flat”. “Unless virtually all of a serial program is parallelized, the possible speed-up is going to be very limited, regardless the available number of processors” Quote from Pacheco, P. S. An Introduction to Parallel Programming. Morgan Kaufmann, 2011. Let’s take a look at the curves shown in Fig. 1.16. Figure 1.16: Speedup examples. Source: https://en.wikipedia.org/wiki/Amdahl%27s_law. In that case, the upper bound of Speedup overall is reached when p → ∞, and therefore: Speedup overall ≤ 1 1 − Fe (1.11) From Fig. 1.16, it can be concluded that when up to 95% parallelism is obtained (green curve), the maximum speedup is given by: 1 S(p) ≤ = 20 1 − 0.95 which results in approximately p = 2048. Placing more processors than that number makes no sense. Typically, the “non-parallelizeable” fraction (1 − Fe ) decreases when increasing the problem’s size. Processor Performance Every component has some impact in performance, but processor performance is the most critical for the whole computer performance. 18 Concepts, Performance & Amdahl Law The following gives the CPU time by using the clock period duration Eq. (1.12) or rate Eq. (1.13). CPU time = # CPU clock cycles for a program × Clock cycle time CPU time = # CPU clock cycles for a program Clock rate (1.12) (1.13) The average number of clock cycles per instruction - CPI considering the instruction count - IC is given by Eq. (1.14). CPI = # CPU clock cycles for a program # IC (1.14) By replacing Eq. (1.14) in Eq. (1.12): CPU time = IC × CPI × Clock cycle time (1.15) And finally, with their units of measurements: CPU time = Instructions ClockCycles Seconds Seconds × × = Program Instruction ClockCycle Program (1.16) What impacts the processor performance? Table 1.2 gives a hint. Table 1.2: Summary relating the design level with possible areas that impact processor performance. Design level Instruction Count Program Compiler Instruction Set Organization Technology X X X CPI Clock Period X X X X X Benchmarks How to measure the computer performance related to the user’s tasks of interest? Performance is measured based on program or task execution time. Thus, how to choose? In the ideal case, a user knows exactly which programs s/he uses together with input data, and then s/he tests computer performance with no problem. But this does not sound feasible. The best choice of benchmarks to measure performance is a real application. And here there are some performance pitfalls, such as running programs that are much simpler than a real application. For example, running kernels that are small key pieces of real applications; toy programs, i.e., 100-line programs from beginning programming assignments, e.g., quicksort; and synthetic benchmarks that are fake programs invented to try to match the profile or behavior of real applications, e.g., Dhrystone, Whetstone. All three (kernels, toy programs, and synthetic) benchmarks are discredited today. It is possible that the compiler writer and architect make the computer appear faster on these stand-in programs than on real applications. Considering the conditions under which benchmarks run, the benchmark-specific compiler flags is one way to “improve” performance. Another way is to cause transformations that would be illegal on many programs or even slow down performance on others. To face these challenges and increase results significance, the recommendation is to use one compiler and one set of flags for all programs in the same language, e.g., C or C++. 19 1 Introduction To reasoning about the question of whether the source code modification is allowed or not, some considerations take place: no modifications are allowed at all; modifications are allowed, but essentially impossible to perform, e.g., those ones requiring ten-million LOC changing, such as database programs; modifications are allowed, provided the changed version produces the very same results; and do that modifications reflect real practice and provide useful insight to users? Another possibility is to use collections of benchmarks applications, also known as benchmark suites. The goal is to characterize the real relative performance of two different computers, including programs that users are unlikely to run by themselves. Suites examples are the Electronic Design News Embedded Microprocessor Benchmark Consortium - EEMBC containing 41 kernels for embedded applications on different domains. However, it does not have the reputation of being a “good predictor” of relative performance, due to the small kernels. This benchmark is trying to replace Dhrystone, which is still in use. Another example is the Standard Performance Evaluation Corporation - SPEC, dating back from the late ’80s. It was used for workstations and covers many different applications from today’s usage. Although the data from a benchmark is frequently used to evaluate two computers, always remember: performance is relative to the program. Fig. 1.10 was built using the SPEC integer benchmarks. However, this benchmark has changed over the years. For this reason, the performance of newer computers is estimated by a scaling factor relating the performance for different versions of SPEC, e.g., SPEC89, SPEC92, SPEC95, SPEC2000, and SPEC2006 (Fig. 1.17). Figure 1.17: SPEC benchmarks over time. Integer programs above the line and floating-point programs below. However, there still are risks related to the use of benchmarks. Since many buying decisions are made on a benchmark basis, designers are pressured to optimize their projects for the benchmark. 20

cap-1_full-8-18.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue