Computer Organization and Design Unit 4 PDF

Computer Organization and Design Prof. Mahesh Awati Dr. Vanamala H R Department of Electronics and Communication Engg. Computer Organization and Design UNIT 4 – Computer Abstractions and Technology Prof.Mahesh Awati Dr. Vanamala H R Department of Electronics and Communication Engineering Computer Abstractions and Technology Syllabus/Topics Unit 4: Computer Abstractions and Technology: Introduction, Eight Great Ideas in Computer Architecture, Technologies for building processors and Memory, Performance, The Power Wall, The Switching from uniprocessor to Multiprocessor, Benchmarking Intel i7, Fallacies and Pitfalls, Concluding Remarks. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Abstraction Pipelining Parallelism Common Case Fast Prediction Hierarchy Dependability Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture 1. Use Abstraction to Simplify Design/Below your Program  Simplified view of hardware and software as hierarchical layers.  Hardware in the center and application software outermost.  A variety of systems software sitting between the hardware and the application software Software that provides services that are commonly useful, including operating systems, compilers, loaders, and assemblers  An operating system interfaces between a user’s program and the hardware and provides a variety of services and supervisory functions.  Handling basic input and output operations  Allocating storage and memory  Provides protected sharing of the computer recourses multiple applications.  A compiler A program that translates high-level language into assembly language statements. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Use Abstraction to Simplify Design/Below your Program Both computer architects and programmers had to invent techniques to make themselves more productive, for otherwise design time would lengthen  Abstraction Hides dramatically as resources grew. as Implementation Details LOW  Remembering High Level Language physical address of Program (Ex: C/C++) Details a,b, and c  Remembering Assembly Level Language registers in which a,b, Program (RISC-V) and c are stored  Loading a,b from Machine Level Language memory Program (RISC-V)  Storing result in ISA Memory Hardware Architecture Description Why Use Abstraction ? (Ex: Block Diagram) Simplify the design: Logic Circuit Description Helps us to cope with (Ex: Circuit Schematic) HIGH enormous complexity Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Use Abstraction to Simplify Design/Below your Program Summary  A compiler enables a programmer to write this high-level language expression: A+B  The compiler would compile it into this assembly language statement: add A, B  The assembler would translate this statement into the binary instructions that tell the computer to add the two numbers A and B. Benefits of High-level programming:  Allow the programmer to think in a more natural language, using English words and algebraic notation  Improved programmer productivity.  Programming languages allow programs to be independent of the computer on which they were developed. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Abstraction & inefficiency Takeaway: Abstraction is Good but it should not result in inefficiency Potential speedup of matrix multiply in Python for four optimizations Reference : A New Golden Age for Computer Architecture, By John L. Hennessy, David A. Patterson Communications of the ACM, February 2019, Vol. 62 No. 2, Pages 48-60 Computer Abstractions and Technology Pipelined Datapath 5 Stage Pipelined Processor IFID IDIE IEMA T1 T2 T3 T5 T4 MAWB Machine Code PC=0x0001 0000 ALU I1-101010101 Reg1 Opr1 I2- Wd 000101010 I3-111010101 Ws P Instn Registe Registe Addr Instruction Decoder Rs1 r rFile DataOut Addr Memory File Reg2 Data C Rs2 Opr2 Memory 04 PC=0x0001 0004 Immd DataIn Program Counter Instruction Execution Data Memory Memory Decoder Register Stage Instruction Fetch Instruction decode and register file read Execution or Memory access Write Back address ❶ ❷ calculation ❸ ❹ ❺ Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Performance via Pipelining  If these stages are independently performing the task in a sequence, then the pipelined approach of execution can be used.  Processor may have Single, Two, Three, Five or Six stages of pipeline. IF ID EX MA WB CLK I5 I4 I3 I2 I1 1 1 I5 I4 I3 I2 2 2 1 I5 I4 I3 3 1 3 2 I5 I4 1 4 4 3 2 I5 2 1 5 5 4 3 3 2 6 3 7 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Comparison of Non-Pipelined Verses Pipelined Mode Non Pipeline ( Single Cycle Processor ) 1. Time Interval of Clock of Single Cycle Processor = Tclk = T1+T2+T3+T4+T5 Ex: Instruction Fetch T1=160n Instruction Decoding T2=120n Execution T3=130n Memory Access T4=100n Write Back T4=140n Tclk = T1+T2+T3+T4+T5 = (160+120+130+100+140)nSec = 550n 2. Time taken by a program with N number of Instructions takes Ttotal = N x Tclk Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Comparison of Non-Pipelined Verses Pipelined Mode 5 stage Pipeline Processor 1. Time Interval of Clock of Single Cycle Processor = Tclk = Maximum {T1,T2,T3,T4,T5} Ex: Instruction Fetch T1=160n Instruction Decoding T2=120n Execution T3=130n Memory Access T4=100n Write Back T4=140n Tclk = Maximum {T1,T2,T3,T4,T5} =Maximum {160n,120n,130n,100n,140n} 2. Time taken by a program with N number of Instructions takes (Ideally) Ttotal = ( 5 + N -1 ) x Tclk Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Performance via Parallelism Sequential computing Parallel computing Doing different parts of a task in parallel accomplishes the task in less time than doing them sequentially Sequential computing is a computational Parallel computing is a computational model where a model in which operations are performed in problem or program is broken into multiple order, one at a time on one processor or smaller sequential computing operations some of computer. which are performed simultaneously in parallel. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Find how many steps are needed in case of a) Sequential execution and b) Parallel execution if two adders are available  Data dependency  Resource dependency P1: C =DXE P2: M =G+C P3: A = B+C P4: C = L+M P5: F = G/ E Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Performance via Prediction In some cases, it can be faster on average to guess and start working rather than wait until you know for sure, assuming that the mechanism to recover from a mis- prediction is not too expensive and your prediction is relatively accurate. Address Label Instructions 0x1000 I1 Is Tru PC Z==1 ?? e 0x1004 I2 0x1008 BZ Next No 0x100C I4 I4,I5 0x1010 I5 0x1014 I6 I6,I7,I8 0x1018 I7 0x101C I8 I9 0x1020 Next: I9……. I10 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Branch Prediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known definitively. This reduces effect of pipeline flush. Is it possible to Branch prediction reduces the effect of a pipeline flush avoid FLUSHING of by predicting possible braches and loads the new branch Pipeline ????? address prior to the execution of the instruction Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Make the Common Case Fast Making the common case fast will tend to enhance performance better than optimizing the rare case. Ironically, the common case is often simpler than the rare case and hence is usually easier to enhance.. Task 80% Addition 20% Mul Make Common Case Fast 100% will tend to enhance the performance better than Multiplication Improved 80% Addition 2 optimizing the rare case % by 10% 82% Addition Improved by 50% 40% Addition 20% Mul 60% Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Make the Common Case Fast Making the common case fast will tend to enhance performance better than optimizing the rare case. Ironically, the common case is often simpler than the rare case and hence is usually easier to enhance.. Amdahl’s Law:  A program needs 20 hours to complete  No matter what, 1 hour needs to run sequentially  Only, rest of the 19 hours can run in parallel Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Make the Common Case Fast Suppose a program runs in 100 seconds on a computer, with multiply operations responsible for 80 seconds of this time. How much do I have to improve the speed of multiplication if I want my program to run five times faster? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Hierarchy of Memories Programmers want the memory to be  Fast  Large  Cheap Why memory is more important?  Memory speed often shapes performance  Capacity limits the size of problems that can be solved  The cost of memory today is often the majority of computer cost Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Dependability via Redundancy  Computers not only need to be fast; they need to be dependable. Failing Column/ row does not  Since any physical device can fail, we make make DRAM IC systems dependable by including manufactured to be rejected redundant components that can take over when a failure occurs and to help detect failures.  Failing piece does not make whole system fail  Increasing transistor density reduce the cost of redundancy. DRAM Architecture Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Seven Great Ideas in Computer Architecture Summary Use Abstraction to Simplify Design Hierarchy of Memories Performance via Pipelining Performance via Parallelism Performance via Prediction Make the Common Case Fast Dependability via Redundancy Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory Technologies used over time  A transistor is simply an on/off switch by electricity. controlled  The integrated circuit (IC) combined dozens to hundreds of transistors into a single chip.  When Gordon Moore predicted the doubling of resources, continuous he was forecasting the growth rate of the number of transistors per chip. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory The Chip manufacturing process  The manufacture of a chip begins with silicon, a substance found in sand. Because silicon does not conduct electricity well, it is called a semiconductor.  With a special chemical process, it is possible to add materials to silicon that allow tiny areas to transform into one of three devices: 1. Excellent conductors of electricity (using either microscopic copper or aluminum wire) 2. Excellent insulators from electricity (like plastic sheathing or glass) 3. Areas that can conduct or insulate under specific conditions (as a switch)  Transistors fall into the last category.  A VLSI circuit, then, is just billions of combinations of conductors, insulators, and switches manufactured in a single small package. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory The chip manufacturing process Place many independent components on a single wafer. Patterns of chemicals are placed on each wafer , Creating the transistors, conductors, and insulators A rod composed of a silicon crystal wafer A slice from a silicon ingot n o with diameter: 8 and 12 inches & more than 0.1 inches thick length :12 to 24 inches. independent Chopped up, components on a or dice wafer. Ex: A processor die - The individual rectangular Defect A microscopic flaw in a sections that are cut from a wafer wafer or in patterning steps Why Wafers are Circular ? The process of converting Silicon sand into Silicon crystal involves spinning and the most efficient Good dies are connected to the spinning is Circular spinning input/output pins of a package Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory Cost and Prize of the Chip Yield : The percentage of good dies from the total number of dies on the wafer. The cost of an integrated circuit can be expressed in three simple equations: (1) It is an approximation, since it does not subtract the area near the border of the round wafer that cannot accommodate the rectangular dies (2). Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory Cost and Prize of the Chip Yield : The percentage of good dies from the total number of dies on the wafer. N in eqn(3) is the process-complexity factor & depends on technology used. Note: 1. Factor N should be know to calculate the yield. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory Numerical Question: Find the number of dies per 300mm (30 cm) wafer for a die that is 1.5 cm on a side and for a die that is 1.0cm on a side. Case 1: 𝑊𝑎𝑓𝑒𝑟 𝐴𝑟𝑒𝑎 ∏𝑟 2 3.14 𝑥 (15)2 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑒𝑠 ≈ 𝐷𝑖𝑒 𝐴𝑟𝑒𝑎 ≈ 1.5X1.5 = 1.5X1.5 = 314 Case 2: 𝑊𝑎𝑓𝑒𝑟 𝐴𝑟𝑒𝑎 ∏𝑟2 3.14 𝑥 (15)2 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑒𝑠 ≈ ≈ = =706 𝐷𝑖𝑒 𝐴𝑟𝑒𝑎 1.5X1.5 1.0X1.0 Diameter of Wafer 300mm Square Die width 1.5cm Square Die width 1cm Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory Numerical Question: For wafer 300mm diameter, Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a side, assuming a defect density of 0.047 per cm2 and N=12. Diameter of Wafer 300mm Square Die width 1.5cm Square Die width 1cm Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory Numerical Question: For wafer 300mm diameter, Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a side, assuming a defect density of 0.047 per cm2 and N=12. Case 1: The total Die Area = (1.5 x 1.5 ) cm2 = 2.25 cm2 As N is given, 1 1 Yield = = = 0.2993 1+defects per area x 𝑑 𝑖 𝑒 𝑎 𝑟 𝑒 𝑎 𝑁 1+0.047 X 2.25 12 Out of 314 dies 94 dies will be good Diameter of Wafer 300mm Case 2: The total Die Area = (1.0 x 1.0 ) cm = 1.00 cm 2 2 Square Die width 1.5cm 1 1 Yield = = = 0.5762 Square Die width 1cm 1+defects per area x 𝑑 𝑖𝑒 𝑎 𝑟 𝑒 𝑎 𝑁 1+0.047 X 1 12 Out of 706 dies 407 dies will be good Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Technologies for Building Processors and Memory Numerical Question: Assume a 15 cm diameter wafer has a cost of 12, contains 84 dies, and has 0.020 defects/cm2. Assume a 20 cm diameter wafer has a cost of 15, contains 100 dies, and has 0.031 defects/cm2. a. Find the yield for both wafers. b. Find the cost per die for both wafers. c. If the number of dies per d d wafer is by 10% and the defects per increased area unit increases by 15%, find the die Diameter of Wafer 15cm Diameter of Wafer 20cm area and yield. Cost of wafer 12 Cost of wafer 15 d. Assume a fabrication process improves the yield from 0.92 to 0.95.Find the Number of Die 84 Number of Die 100 defects per area unit for each version of Defects/cm2 0.020 Defects/cm2 0.031 the technology given a die area of 200 mm2. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why Basicperformance Definitions assessment of computer is much difficult ? Response time/ Execution time: The total time required for the computer to complete a task, including disk accesses, memory accesses, I/O activities, operating system overhead, CPU execution time, and so on. Important to Individual users S  Disk accesses, E  Memory accesses, T  I/O activities, N  Operating system overhead, A  CPU execution time A task D R Throughput / bandwidth : it is the number of tasks completed per unit time. T Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why Basicperformance Definitions assessment of computer is much difficult ? Task : Throughput and Response Time Do the following changes to a computer system increase throughput, decrease response time, or both? Case 1. Replacing the processor in a computer with a faster version Case 2. Adding additional processors to a system that uses multiple processors for separate tasks. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why Basicperformance Definitions assessment of computer is much difficult ? Task : Throughput and Response Time Case 1 : Replacing the processor in a computer with a faster version Decreasing response time almost always improves throughput. Hence, in this case, both response time and throughput are improved. Case 2: Adding additional processors to a system that uses multiple processors for separate tasks. In this case, No one task gets work done faster, so only throughput increases. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Performance assessment of computer is much difficult ? of a Computer  To maximize performance, we want to minimize response time or execution time for some task. i.e., Lesser Response time Maximum is the performance (Performance and Response or Execution time are Inversly propotional.  Thus, we can relate performance and execution time for a computer X:  For two computers X and Y, if the performance of X is greater than the performance of Y, we have Computer X Computer Y Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Performance assessment of computer is much difficult ? of a Computer How to relate the performance of two different computers quantitatively Task: Run same program (C program) on two different laptops and check the time  If X is n times as fast as Y, then the execution time on Y is n times as long as it taken by two laptops and is on X: identify the reason??? Relative Performance Computer X runs a program in 10 seconds Computer Y runs the same program in 15 seconds How much faster is A than B? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Performance assessment of computer is much difficult ? of a Computer Relative Performance Computer X runs a program in 10 seconds Computer Y runs the same program in 15 seconds How much faster is A than B? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Measuring assessment of computer is much difficult ? Performance “The computer that performs the same amount of work in the least time is the fastest.” Wall clock time, Response time, or Elapsed time: The total time to complete a task, including  disk accesses,  memory accesses,  input/output (I/O) activities,  operating system overhead—everything. Three sequential uni-programming processes. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Measuring assessment of computer is much difficult ? Performance Computers are often shared, however, and a processor may work on several programs simultaneously. Multiprogramming environment. O p t iIn such cases, the system may mtry to optimize throughput irather than attempt to minimize zthe elapsed time for one eprogram. The green regions indicate CPU execution and the yellow indicates I/O operations. However, note that processes B and C can run while A is waiting on its t I/O operation. Similarly, A and C execute while B is waiting on I/O operations. As a h result, the CPU is only completely idle while C’s I/O operation is performed at time r 15, because A and B have already run to completion o u Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy g Computer Abstractions and Technology Performance Why performance Measuring assessment of computer is much difficult ? Performance CPU execution time (CPU time) : It is the time the CPU spends on computing a particular task and does not include time spent waiting for I/O or running other programs. CPU time can be further divided into - User CPU time: the amount of time the processor spends in running your application code System CPU time: the time spent running code in the operating system kernel on behalf of your program We will use CPU performance to refer to User CPU time and concentrate on the same. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance and CPU Performance assessment of computer is much difficult ? Its Factors Different applications are sensitive to Different aspects of the performance of a computer system Example: Applications running on servers, depend as much on I/O performance, which, in turn, relies on both hardware and software. CPU Performance and Its Factors CPU performance - the bottom-line performance measure is CPU execution time. A simple formula relates the most basic metrics (clock cycles and clock cycle time) to CPU time: Who is responsible for improving the Number of clocks required for a program or the Length of the Alternatively, because clock rate and clock cycle time are inverses, clock cycle Hardware designer?? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why CPU Performance performance and assessment Its Factors of computer is much difficult ? Computer A Computer B First find the CPU time for A Our favorite program runs in 10 Computer designer build a computer, seconds on computer A, which has B, which will run this program in 6 a 2 GHz clock. seconds The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as computer A for this program. What clock rate should we tell the designer to target? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why CPU Performance performance and assessment Its Factors of computer is much difficult ? Computer A Computer B CPU time for B Our favorite program runs in 10 Computer designer build a computer, seconds on computer A, which has B, which will run this program in 6 a 2 GHz clock. seconds The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as computer A for this program. What clock rate should we tell the designer to target? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Answer: 4 GHz Computer Abstractions and Technology Performance Why performance Instruction assessment of computer is much difficult ? Performance  The performance equations above did not include any reference to the number of instructions needed for the program  Program is basically set of Instructions.  The computer had to execute the instructions to run the program, therefore execution time must depend on the number of instructions in a program.  Therefore, the number of clock cycles required for a program can be written as If Instructions retired =2000 and Avg clock cycles per instruction is 0.75, then CPU clock cycles =2000x.75=1500 clock cycles Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Instruction assessment of computer is much difficult ? Performance Clock cycles per instruction (CPI):One way of comparing two different implementations  The average number of clock cycles each instruction takes to execute.  Since different instructions may take different amounts of time depending on what they do, CPI is an average of all the instructions executed in the program.  CPI provides one way of comparing two different implementations of the identical instruction set architecture, since the number of instructions executed for a program will, of course, be the same. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Instruction assessment of computer is much difficult ? Performance Suppose we have two implementations of the same instruction set architecture. Computer A Computer B First, find the number of processor clock cycles for each computer: Computer A has a clock cycle time computer B has a clock cycle time of 250ps and a CPI of 2.0 for some of 500 ps and a CPI of 1.2 for the program same program Now we can compute the CPU Which computer is faster for this program and by how much? time for each computer: Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Instruction assessment of computer is much difficult ? Performance Suppose we have two implementations of the same instruction set architecture. Computer A Computer B Computer A has a clock cycle time computer B has a clock cycle time of 250ps and a CPI of 2.0 for some of 500 ps and a CPI of 1.2 for the program same program Now we can compute the CPU time Which computer is faster for this program and by how much? for each computer: Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Instruction assessment of computer is much difficult ? Performance Suppose we have two implementations of the same instruction set architecture. Computer A Computer B Computer A has a clock cycle time computer B has a clock cycle time of 250ps and a CPI of 2.0 for some of 500 ps and a CPI of 1.2 for the program same program Which computer is faster for this program and by how much? Conclusion: Computer A is 1.2 times as fast as computer B for this program. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance The Classic assessmentEquation CPU Performance of computer is much difficult ? Basic performance equation in terms of instruction count (the number of instructions executed by the program), CPI, and clock cycle time:  The three key factors that affect performance Instruction Count, CPI and Clock rate.  We can use these formulas to  Compare two different implementations or  To evaluate a design alternative if we know its impact on these three parameters. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Comparing Code assessment Segments of computer is much difficult ? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Comparing Code assessment Segments of computer is much difficult ?  Number of Instructions in Code Sequence 1 : 2+1+2=5  Number of Instructions in Code Sequence 2 : 4+1+1=6 Code Sequence 1 executes fewer Instructions Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Comparing Code assessment Segments of computer is much difficult ? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Comparing Code assessment Segments of computer is much difficult ?  Total number of clock cycles for sequence1: = 2x1+1x2+2x3 =10 clock cycles  Total number of clock cycles for sequence1: =4x1+1x2+1x3= 9 clock cycles So code sequence 2 is faster, even though it executes one extra instruction. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Comparing Code assessment Segments of computer is much difficult ? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance Comparing Code assessment Segments of computer is much difficult ?  CPI of Sequence 1 and 2 are Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Why performance The basic assessment components of computer of performance andishow much difficult each ? is measured. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance IPC (instruction per Clock Cycle):  Although you might expect that the minimum CPI is 1.0, some processors fetch and execute multiple instructions CISC Architecture – Minimum CPI is 1 per clock cycle. Single Cycle Processor–CPI is always 1  To reflect that approach, some designers invert CPI to talk Pipelined Processor–CPI is greater than 1 about IPC, or instructions per clock cycle. In some Processor–CPI is less than 1 which  If a processor executes on average two instructions per means multiple Instructions per clock cycle. clock cycle, then it has an IPC of 2 and hence a CPI of 0.5. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Power Wall The Clock rate and Power for Intel x86 microprocessors over nine generations and 36 years Both clock rate and power Both clock rate and power increased rapidly flattened or dropped off How could clock rates grow by a factor of 1000 while power increased by only a factor of 30? Energy and thus power can be reduced by lowering the voltage, which occurred with each new generation of technology, and power is a function of the voltage2. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Parameters affecting Performance of a Program The performance of a program depends on the algorithm, the language, the compiler, the architecture, and the actual hardware. Hardware / Software Algorithm Algorithm : Instruction  Instruction Count :The number of Count and CPI source program instructions executed  CPI: May also affect CPI , by favoring Performance of a Program Programming language : slower/faster instruction Instruction Count and CPI Ex: uses more divides Programming language Compiler: Instruction Count  Instruction Count :Statements in the and CPI language are translated to processor instructions. Instruction Set Architecture:  CPI: Language with heavy support for CPI, Clock rate & Instruction data abstraction (e.g., Java) will require Count indirect calls, which will use higher CPI instructions. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Parameters affecting Performance of a Program The performance of a program depends on the algorithm, the language, the compiler, the architecture, and the actual hardware. Hardware / Software Compiler: Algorithm : Instruction  The efficiency of the compiler affects Count and CPI both the instruction count and average cycles per instruction, since the Performance of a Program May Increase / decrease Programming language : compiler determines the translation of Instruction Count and CPI the source language instructions into computer instructions. Compiler: Instruction Count Instruction Set Architecture(ISA) : and CPI  The instruction set architecture affects all three aspects of CPU performance, Instruction Set Architecture: since it affects the instructions needed CPI, Clock rate & Instruction for a function, the cost in cycles of Count each instruction, and the overall clock rate of the processor Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Performance Turbo mode:  Although clock cycle time has traditionally been fixed, to save energy or temporarily boost performance, today’s processors can vary their clock rates, so we would need to use the average clock rate for a program.  For example, the Intel Core i7 will temporarily increase clock rate by about 10% until the chip gets too warm.  Intel calls this Turbo mode. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Power Wall The Clock rate and Power for Intel x86 microprocessors over nine generations and 36 years Both clock rate and power Both clock rate and power increased rapidly flattened or dropped off  The reason they grew together is that they are correlated, and  the reason for their recent slowing is that we have run into the practical power limit for cooling commodity microprocessors Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Power Wall  The dominant technology for integrated circuits is called CMOS  For CMOS, the primary source of energy consumption is dynamic energy i.e., the energy that is consumed when transistors switch states  The dynamic energy depends on the capacitive loading of each transistor and the voltage applied Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Power Wall Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Power Wall Relative Power  Suppose we developed a new, simpler processor that has 85% of the capacitive load of the more complex older processor.  Further, assume that it can adjust voltage so that it can reduce voltage 15% compared to processor B, which results in a 15% shrink in frequency. What is the impact on dynamic power? 2 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Sea Change: The Switch from Uniprocessors to Multiprocessors Growth in Processor Performance since the mid-1980s This chart plots performance relative to the VAX 11/780 Uniprocessor performance Prior to the mid-1980s, processor performance growth was largely technology driven and averaged about 25% per year. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Sea Change: The Switch from Uniprocessors to Multiprocessors Growth in Processor Performance since the mid-1980s This chart plots performance relative to the VAX 11/780 Uniprocessor performance The increase in growth to about 52% since then is attributable to more advanced architectural and organizational ideas First of a range of popular and influential computers implementing the VAX ISA. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Sea Change: The Switch from Uniprocessors to Multiprocessors Growth in Processor Performance since the mid-1980s This chart plots performance relative to the VAX 11/780 Uniprocessor performance Since 2002, the limits of power, available instruction- level parallelism, and long memory latency have slowed uniprocessor performance recently, to about 3.5% per year. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Sea Change: The Switch from Uniprocessors to Multiprocessors What do we do?? From mid 1980s As of 2006 Continuing to decrease the All desktop and server companies response time of one are shipping microprocessors with program running on the multiple processors per chip single processor Challenges for Programmers Benefit is often more on throughput than  Rewrite their programs to take advantage of multiple on response time processors.  Programmers will have to continue to improve the Companies refer to processors as “cores,” performance of their code as the number of cores and such microprocessors are generically increases called multicore microprocessors. Hence, a “quadcore” microprocessor is a chip that contains four processors or four cores. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Sea Change: The Switch from Uniprocessors to Multiprocessors Challenges for Programmers Why has it been so hard for programmers to write explicitly parallel programs?  Parallel programming is by definition performance programming, which increases the difficulty of programming.  Fast for parallel hardware means that the programmer must divide an application so that each processor has roughly the same amount to do at the same time, and that the overhead of scheduling and Q should move coordination doesn’t fritter away the potential in Sequence one after the other performance benefits of parallelism. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology The Sea Change: The Switch from Uniprocessors to Multiprocessors To reflect this sea change in the industry -  Parallelism and Instructions: Synchronization.  Parallelism and Computer Arithmetic: Subword Parallelism.  Parallelism via Instructions.  Parallelism and Memory Hierarchies: Cache Coherence.  Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPEC CPU Benchmark Workload: A set of programs run on a computer To evaluate two computer systems, a user would simply compare the execution time of the workload on the two computers. Benchmarks: Programs specifically chosen to measure performance The benchmarks form a workload that the user hopes will predict the performance of the actual workload. SPEC: (System Performance Evaluation Cooperative) is an effort funded and supported by a number of computer vendors to create standard sets of benchmarks for modern computer systems. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPEC CPU Benchmark  In 1989, SPEC originally created a benchmark set focusing on processor performance (now called SPEC89).  The latest is SPEC CPU2017, which consists of a set of 10 integer benchmarks (SPECspeed 2017 Integer) and 13 floating-point benchmarks (CFP2006).  The integer benchmarks vary from part of a C compiler to a chess program to a quantum computer simulation.  The floating-point benchmarks include structured grid codes for finite element modeling, particle method codes for molecular dynamics, and sparse linear algebra codes for fluid dynamics. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPECspeed 2017 Integer benchmarks running on a 1.8 GHz Intel Xeon E5-2650L Table describes the SPEC integer benchmarks and their execution time on the Intel Core i7 and shows the factors that explain execution time:  Instruction count  CPI  Clock cycle time Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPECspeed 2017 Integer benchmarks running on a 1.8 GHz Intel Xeon E5-2650L Execution Time: Execution Time = Instruction Count X CPI X Clock cycle time Ex: Perlbench Execution Time : 2684 X 109X0.42X 0.556 X 10-9 : 627 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPECspeed 2017 Integer benchmarks running on a 1.8 GHz Intel Xeon E5-2650L Reference Time: Reference Time = Supplied by SPEC Reference Time – Execution time for benchmark x on reference machine Ex: Perlbench Reference Time : 1774 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPECspeed 2017 Integer benchmarks running on a 1.8 GHz Intel Xeon E5-2650L SPECratio: Results are reported as ratio of Reference time to system run time. Reference Time SPECratio = EXecution Time Reference Time – Execution time for benchmark x on reference machine Reference Time – Execution time for benchmark x on test system Ex: Perlbench SPECratio : 1774/627 : 2.83 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPECspeed 2017 Integer benchmarks running on a 1.8 GHz Intel Xeon E5-2650L Geometric Mean: The single number quoted as SPECspeed 2017 Integer is the geometric mean of the SPECratios. The overall performance calculated by averaging ratios for all 12 integer benchmarks i.e. geometric mean Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPEC Power Benchmark  Given the increasing importance of energy and power, SPEC added a benchmark to measure power.  SPECpower_ssj2008 is the first industry-standard benchmark for measuring both the performance and the power consumption of servers.  SSJ - Server Side Java http://www.spec.org/power/docs/SPECpower_ssj2008-Design_ssj.pdf Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Benchmarking the Intel Core i7 SPEC Power Benchmark SPECpower reports power consumption of servers at different workload levels, divided into 10% increments, over a period of time. SPEC boils these numbers down to one number, called “overall ssj_ops per watt.” The formula for this single summarizing metric is ssj_opsi is performance at each 10% increment poweri is power consumed at each performance Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and level John L. Hennessy Computer Abstractions and Technology Pitfalls and Fallacies A pitfall - the utilization of a subset of the performance equation as a performance metric. Pitfall: Covered pit for use as a trap (Mistake) Fallacies (reasoning that is logically invalid) - Misunderstanding Corollary: One already proved Ex: Make Common case fast Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Pitfalls – Amdahl’s Law Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement. Improvement in x by a factor f Improvement in overall performance proportional to factor f Suppose a program runs in 100 seconds on a computer, with multiply operations responsible for 80 seconds of this time. How much do I have to improve the speed of multiplication if I want my program to run five times faster? 80s Multiplication 20s Rest T=100s Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Pitfalls – Amdahl’s Law Suppose a program runs in 100 seconds on a computer, with multiply operations responsible for 80 seconds of this time. How much do I have to improve the speed of multiplication if I want my program to run five times faster? That is, there is no amount by Multiplication – 80 seconds which we can enhance- multiply to achieve a fivefold increase in performance, if multiply accounts for only 80% of the workload. Corollary: make the common Can’t be done! case fast Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Fallacies (reasoning that is logically invalid) Computers at low utilization use little power.  Look back at i7 power benchmark  At 100% load: 258W  At 50% load: 170W (66%)  At 10% load: 121W (47%)  Google data center  Mostly operates at 10% – 50% load  At 100% load less than 1% of the time  Consider designing processors to make power proportional to load  Specially configured computer with the best results in 2020 still uses 33% of the peak power at 10% of the load. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Fallacies (reasoning that is logically invalid) Designing for performance and designing for energy efficiency are unrelated goals. Since energy is power over time, it is often the case that hardware or software optimizations that take less time save energy overall even if the optimization takes a bit more energy when it is used. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Pitfall: MIPS as a Performance Metric Using a subset of the performance equation as a performance metric  We have already warned about the danger of predicting performance based on simply one of the clock rate, instruction count, or CPI.  Another common mistake is to use only two of the three factors to compare performance. MIPS is an instruction execution rate. For a given program, MIPS is simply MIPS specifies performance inversely to execution time; faster computers have a higher MIPS rating. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Pitfall: MIPS as a Performance Metric There are three problems with using MIPS as a measure for comparing computers.  MIPS specifies the instruction execution rate but does not take into account the capabilities of the instructions.  We cannot compare computers with different instruction sets using MIPS, since the instruction counts will certainly differ. Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Pitfall: MIPS as a Performance Metric There are three problems with using MIPS as a measure for comparing computers. MIPS varies between programs on the same computer; thus, a computer cannot have a single MIPS rating. Change in CPI changes the MIPS rating Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Fallacies (reasoning that is logically invalid) Fallacy 1 One usual fallacy is to consider the computer with the largest clock rate as having the highest performance. Check if this is true for P1 and P2. Processor P1 Processor P2 Clock rate 4GHz Clock rate 3GHz Average CPI 0.9 Average CPI 0.75 Calculate CPU Time and Required Instruction to 5.0E9 Required Instruction to 1.0E9 check if the Fallacy 1 is be executed be executed TRUE for P1 and P2 Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Fallacies (reasoning that is logically invalid) Fallacy 2 “The processor executing the largest number of instructions will need a larger CPU time” Considering that processor P1 is executing a sequence of Check if this is true for P1 and P2. 1.0E9 instructions and that the CPI of processors P1 and P2 do not change, determine Processor P1 Processor P2 the number of instructions that P2 can execute in the same time that P1 needs to execute 1.0E9 instructions. Clock rate 4GHz Clock rate 3GHz Average CPI 0.9 Average CPI 0.75 Required Instruction to 1.0E9 Instruction getting ? be executed executed in same time that P1 Execution Time ? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Fallacies (reasoning that is logically invalid) Fallacy 3 “Use MIPS (millions of instructions per second) to compare the performance of two different processors and consider that the processor with the largest MIPS has the largest performance” Check if this is true for P1 and P2. Processor P1 Processor P2 Clock rate 4GHz Clock rate 3GHz Average CPI 0.9 Average CPI 0.75 Required Instruction to 1.0E9 Instruction getting ? be executed executed in same time that P1 Execution Time ? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Computer Abstractions and Technology Fallacies and Pitfalls There are three problems with using MIPS as a measure for comparing computers. If a new program executes more instructions but each instruction is faster, MIPS can vary independently from performance! Consider the following performance measurements for a program: a. Which computer has the higher MIPS rating? b. Which computer is faster? Reference : Computer Organization and Design - The Hardware/Software Interface: RISC-V Edition by David A. Patterson and John L. Hennessy Exercise questions: 1.6 1.7 1.8 1.9 1.10 1.12 1.15 THANK YOU Prof.Mahesh Awati Dr. Vanamala H R Department of Electronics and Communication

Computer Organization and Design Unit 4 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue