lec3-4471029-computer-abstraction-and-technologies(performance).pdf
Document Details
Uploaded by CuteWatermelonTourmaline
Kangwon National University
Full Transcript
Computer Abstractions and Technology : Performance 471029: Introduction to Computer Architecture 3th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU...
Computer Abstractions and Technology : Performance 471029: Introduction to Computer Architecture 3th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU 1 Understanding Performance Algorithm Determines number of operations executed Programming language, compiler, architecture Determine number of machine instructions executed per operation Processor and memory system Determine how fast instructions are executed I/O system (including OS) Determines how fast I/O operations are executed 2 Defining Performance Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 100 200 300 400 500 0 2000 4000 6000 8000 10000 Passenger Capacity Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 500 1000 1500 0 100000 200000 300000 400000 Cruising Speed (mph) Passengers x mph 3 Execution Time and Throughput Execution time How long it takes to do a task Throughput Total work done per unit time E.g., tasks / transactions / … per hour How are execution time and throughput affected by Replacing the processor with a faster version? Adding more processors? We’ll focus on execution time for now… 4 Relative Performance Define Performance = 1/Execution Time “X is n time faster than Y” Performanc e X Performanc e Y Execution time Y Execution time X n Example: time taken to run a program If 10s on A, 15s on B, how much faster A is than B? (PerformanceA / PerformanceB) Execution TimeB / Execution Time A = 15s/ 10s = 1.5 So A achieves 1.5 times higher performance than B 5 Measuring Execution Time Elapsed time Total execution time, including all aspects Processing, I/O, OS overhead, idle time Determines system performance CPU time Time spent for a given job in a shared system, Discount I/O time, other jobs’ shares Comprises user CPU time and system CPU time Different programs are affected differently by CPU and system performance Batch program, interactive program 6 CPU Clocking Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state Clock period: duration of a clock cycle Millisecond(ms) : 10-3s Microsecond(us) : 10-6s E.g., 250ps = 0.25ns = 250x10-12s Nanosecond(ns) : 10-9s Picosecond(ps): 10-12s Clock frequency (rate): cycles per second E.g., 4.0GHz = 4000Mhz = 4000000Khz = 4.0 x 109Hz 7 [Aside] Scaling Governor in Linux /sys/devices/system/cpu/cpuN/cpufreq scaling_max_freq scaling_cur_freq scaling_min_freq scaling_governor Documentation/admin-guide/pm/cpufreq.rst 8 CPU Time CPU execution time (CPU time) Time the CPU spends working on a task Does not include time waiting for I/O or running other programs CPU time can be improved (decreased) by Reducing number of clock cycles Increasing clock rate Hardware designer often trades off clock rate against cycle count CPU Time CPU Clock Cycles Clock Cycle Time CPU Clock Cycles Clock Rate 9 CPU Time Example Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 x clock cycles How fast must Computer B clock be? 10 CPU Time Example Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 x clock cycles kHz : 103 Hz How fast must Computer B clock be? MHz : 106 Hz GHz : 109 Hz Ans) Computer A: 10s = cyclesA(C) x intervalA = C / rateA C = 10s x rateA = 20G Computer B: 6s = 1.2C x intervalB = 24G / rateB rateB = 4Ghz 11 Instruction Performance # of instructions is called Instruction count Determined by program, ISA and compiler Clock cycles per instruction is referred to as CPI If the number of instructions in a program is the same, the CPU time is determined by the value of CPI Determined by CPU hardware Different instructions have different CPI CPU Time Instruction Count CPI Clock Cycle Time Instruction Count CPI Clock Rate 12 CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? 13 CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? CPU TimeA = Instruction Count x CPIA x Cycle TimeA = I x 2.0 x 250ps = I x 500 ps A is faster… CPU TimeB = Instruction Count x CPIB x Cycle TimeB = I x 1.2 x 500ps = I x 600ps … by this much 14 CPI in More Detail If different instruction classes take different numbers of cycles n Clock Cycles (CPIi Instruction Count i ) i 1 Weighted average CPI Clock Cycles n Instruction Count i CPI CPIi Instruction Count i1 Instruction Count Relative frequency 15 CPI Example Let’s assume the hardware designers have supplied the following facts Alternative compiled code sequences using instructions in classes A, B, C If you are a compiler writer(designer), which code sequence would you adopt in your compiler? Which code sequence executes the most instructions? Which will be faster? What is the CPI for each sequence? 16 CPI Example (Cont’d) Sequence 1: IC = 5 Clock Cycles = 2x1 + 1x2 + 2x3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: The most instructions IC = 6 Faster Clock Cycles = 4x1 + 1x2 + 1x3 = 9 Better CPI Avg. CPI = 9/6 = 1.5 17 Pitfall – Amdahl’s Law One case of the pitfall Improving an aspect of a computer and expecting a proportional improvement in overall performance Amdahl’s Law: “the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used” Example Let’s assume multiply operations account for 80s over 100s How much do I have to improve the speed of multiplication to get 5x overall? 18 Pitfall – Amdahl’s Law (cont’d) Example Let’s assume multiply operations account for 80s over 100s How much do I have to improve the speed of multiplication to get 5x overall? Based on Amdahl’s Law Execution time affected Execution Time improved Execution Time unaffected Amount of Improvemen t 80 Execution Time improved (100 - 80) n 80 20 (five times faster) (100 - 80) Can’t be done! n 19 Fallacies Commonly held mis-concepts Counter-examples could be effective when discussing fallacies Example Case 1: Computers at low utilization use little power Utilization of servers in Google’s warehouse – 10~50% most of the time – 100% less than 1% 10% of a workload 10% of the peak power? No!! SPEC power benchmark: 33% of the peak power at 10% of the load Case 2: Designing for performance and designing for energy efficiency are unrelated goals Hardware/software optimization itself takes more energy, but The results may lead to energy reduction because of the reduced execution time 20 An example of optimization 21 [ Note ] SPEC CPU Benchmark Programs used to mesaure performance Supposedly typical of actual workload Workload: the set of programs run on a computer that is either the the actual collection of applications run by a user or constructed from real programs to approximate such a mix [Remind] Common case fast To find bottleneck and which case is common, benchmarks play a critical role in computer architecture Standard Performance Evaluation Corp(SPEC) Develops benchmarks for CPU, I/O, Web, … by a number for computer vendors SPEC CPU 2006 12 workloads consisting of CINT2006(integer) and CFP2006(floating- point) 22 [ Note ] SPEC CPU Benchmark (cont’d) CINT2006 for Intel Core i7 23 Performance Summary Instructions Clock cycles Seconds CPU Time Program Instruction Clock cycle Performance depends on Algorithm: affects IC, possibly CPI e.g., if the algorithm uses more divides, it will tend to have a higher CPI Programming language: affects IC, CPI e.g., a language with heavy support for data abstraction (e.g., Java) will require indirect calls, which in turns will use higher CPI instructions Compiler: affects IC, CPI e.g., compiler use serveral optimization technique and affect the CPI in complex ways as we see in the previous example. Instruction set architecture: affects IC, CPI, Clock rate 24