lec3-4471029-computer-abstraction-and-technologies(performance).pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

Computer Abstractions and Technology : Performance 471029: Introduction to Computer Architecture 3th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU...

Computer Abstractions and Technology : Performance 471029: Introduction to Computer Architecture 3th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU 1 Understanding Performance  Algorithm  Determines number of operations executed  Programming language, compiler, architecture  Determine number of machine instructions executed per operation  Processor and memory system  Determine how fast instructions are executed  I/O system (including OS)  Determines how fast I/O operations are executed 2 Defining Performance  Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 100 200 300 400 500 0 2000 4000 6000 8000 10000 Passenger Capacity Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 500 1000 1500 0 100000 200000 300000 400000 Cruising Speed (mph) Passengers x mph 3 Execution Time and Throughput  Execution time  How long it takes to do a task  Throughput  Total work done per unit time  E.g., tasks / transactions / … per hour  How are execution time and throughput affected by  Replacing the processor with a faster version?  Adding more processors? We’ll focus on execution time for now… 4 Relative Performance  Define Performance = 1/Execution Time  “X is n time faster than Y” Performanc e X Performanc e Y  Execution time Y Execution time X  n  Example: time taken to run a program  If 10s on A, 15s on B, how much faster A is than B? (PerformanceA / PerformanceB)  Execution TimeB / Execution Time A = 15s/ 10s = 1.5  So A achieves 1.5 times higher performance than B 5 Measuring Execution Time  Elapsed time  Total execution time, including all aspects Processing, I/O, OS overhead, idle time  Determines system performance  CPU time  Time spent for a given job in a shared system,  Discount I/O time, other jobs’ shares  Comprises user CPU time and system CPU time  Different programs are affected differently by CPU and system performance  Batch program, interactive program 6 CPU Clocking  Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state  Clock period: duration of a clock cycle Millisecond(ms) : 10-3s Microsecond(us) : 10-6s  E.g., 250ps = 0.25ns = 250x10-12s Nanosecond(ns) : 10-9s Picosecond(ps): 10-12s  Clock frequency (rate): cycles per second  E.g., 4.0GHz = 4000Mhz = 4000000Khz = 4.0 x 109Hz 7 [Aside] Scaling Governor in Linux  /sys/devices/system/cpu/cpuN/cpufreq  scaling_max_freq  scaling_cur_freq  scaling_min_freq  scaling_governor  Documentation/admin-guide/pm/cpufreq.rst 8 CPU Time  CPU execution time (CPU time)  Time the CPU spends working on a task  Does not include time waiting for I/O or running other programs  CPU time can be improved (decreased) by  Reducing number of clock cycles  Increasing clock rate  Hardware designer often trades off clock rate against cycle count CPU Time  CPU Clock Cycles  Clock Cycle Time CPU Clock Cycles  Clock Rate 9 CPU Time Example  Computer A: 2GHz clock, 10s CPU time  Designing Computer B  Aim for 6s CPU time  Can do faster clock, but causes 1.2 x clock cycles  How fast must Computer B clock be? 10 CPU Time Example  Computer A: 2GHz clock, 10s CPU time  Designing Computer B  Aim for 6s CPU time  Can do faster clock, but causes 1.2 x clock cycles kHz : 103 Hz  How fast must Computer B clock be? MHz : 106 Hz GHz : 109 Hz  Ans)  Computer A: 10s = cyclesA(C) x intervalA = C / rateA  C = 10s x rateA = 20G  Computer B: 6s = 1.2C x intervalB = 24G / rateB  rateB = 4Ghz 11 Instruction Performance  # of instructions is called Instruction count  Determined by program, ISA and compiler  Clock cycles per instruction is referred to as CPI  If the number of instructions in a program is the same, the CPU time is determined by the value of CPI  Determined by CPU hardware  Different instructions have different CPI CPU Time  Instruction Count  CPI  Clock Cycle Time Instruction Count  CPI  Clock Rate 12 CPI Example  Computer A: Cycle Time = 250ps, CPI = 2.0  Computer B: Cycle Time = 500ps, CPI = 1.2  Same ISA  Which is faster, and by how much? 13 CPI Example  Computer A: Cycle Time = 250ps, CPI = 2.0  Computer B: Cycle Time = 500ps, CPI = 1.2  Same ISA  Which is faster, and by how much? CPU TimeA = Instruction Count x CPIA x Cycle TimeA = I x 2.0 x 250ps = I x 500 ps A is faster… CPU TimeB = Instruction Count x CPIB x Cycle TimeB = I x 1.2 x 500ps = I x 600ps … by this much 14 CPI in More Detail  If different instruction classes take different numbers of cycles n Clock Cycles   (CPIi  Instruction Count i ) i 1  Weighted average CPI Clock Cycles n  Instruction Count i  CPI     CPIi   Instruction Count i1  Instruction Count  Relative frequency 15 CPI Example  Let’s assume the hardware designers have supplied the following facts  Alternative compiled code sequences using instructions in classes A, B, C  If you are a compiler writer(designer), which code sequence would you adopt in your compiler?  Which code sequence executes the most instructions?  Which will be faster?  What is the CPI for each sequence? 16 CPI Example (Cont’d)  Sequence 1:  IC = 5  Clock Cycles = 2x1 + 1x2 + 2x3 = 10  Avg. CPI = 10/5 = 2.0  Sequence 2: The most instructions  IC = 6 Faster  Clock Cycles = 4x1 + 1x2 + 1x3 = 9 Better CPI  Avg. CPI = 9/6 = 1.5 17 Pitfall – Amdahl’s Law  One case of the pitfall  Improving an aspect of a computer and expecting a proportional improvement in overall performance  Amdahl’s Law:  “the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used”  Example  Let’s assume multiply operations account for 80s over 100s  How much do I have to improve the speed of multiplication to get 5x overall? 18 Pitfall – Amdahl’s Law (cont’d)  Example  Let’s assume multiply operations account for 80s over 100s  How much do I have to improve the speed of multiplication to get 5x overall?  Based on Amdahl’s Law Execution time affected Execution Time improved   Execution Time unaffected Amount of Improvemen t 80 Execution Time improved   (100 - 80) n 80 20 (five times faster)   (100 - 80) Can’t be done! n 19 Fallacies  Commonly held mis-concepts  Counter-examples could be effective when discussing fallacies  Example  Case 1: Computers at low utilization use little power  Utilization of servers in Google’s warehouse – 10~50% most of the time – 100% less than 1%  10% of a workload  10% of the peak power? No!!  SPEC power benchmark: 33% of the peak power at 10% of the load  Case 2: Designing for performance and designing for energy efficiency are unrelated goals  Hardware/software optimization itself takes more energy, but  The results may lead to energy reduction because of the reduced execution time 20 An example of optimization 21 [ Note ] SPEC CPU Benchmark  Programs used to mesaure performance  Supposedly typical of actual workload  Workload: the set of programs run on a computer that is either the the actual collection of applications run by a user or constructed from real programs to approximate such a mix  [Remind] Common case fast  To find bottleneck and which case is common, benchmarks play a critical role in computer architecture  Standard Performance Evaluation Corp(SPEC)  Develops benchmarks for CPU, I/O, Web, … by a number for computer vendors  SPEC CPU 2006  12 workloads consisting of CINT2006(integer) and CFP2006(floating- point) 22 [ Note ] SPEC CPU Benchmark (cont’d)  CINT2006 for Intel Core i7 23 Performance Summary Instructions Clock cycles Seconds CPU Time    Program Instruction Clock cycle  Performance depends on  Algorithm: affects IC, possibly CPI  e.g., if the algorithm uses more divides, it will tend to have a higher CPI  Programming language: affects IC, CPI  e.g., a language with heavy support for data abstraction (e.g., Java) will require indirect calls, which in turns will use higher CPI instructions  Compiler: affects IC, CPI  e.g., compiler use serveral optimization technique and affect the CPI in complex ways as we see in the previous example.  Instruction set architecture: affects IC, CPI, Clock rate 24

Use Quizgecko on...
Browser
Browser