Computer Abstractions and Technology : Performance (Lecture 3) - KNU

Summary

These lecture notes cover the topic of performance in computer abstraction and technology, within an introductory computer architecture course. The lecture discusses factors affecting performance from algorithms and programming languages to processor and I/O systems. The notes also include some examples and a discussion of benchmarks.

Full Transcript

Computer Abstractions and Technology : Performance 471029: Introduction to Computer Architecture 3th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU...

Computer Abstractions and Technology : Performance 471029: Introduction to Computer Architecture 3th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU 1 Understanding Performance  Algorithm  Determines number of operations executed  Programming language, compiler, architecture  Determine number of machine instructions executed per operation  Processor and memory system  Determine how fast instructions are executed  I/O system (including OS)  Determines how fast I/O operations are executed 2 Defining Performance  Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 100 200 300 400 500 0 2000 4000 6000 8000 10000 Passenger Capacity Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud BAC/Sud Concorde Concorde Douglas Douglas DC- DC-8-50 8-50 0 500 1000 1500 0 100000 200000 300000 400000 Cruising Speed (mph) Passengers x mph 3 Execution Time and Throughput  Execution time  How long it takes to do a task  Throughput  Total work done per unit time  E.g., tasks / transactions / … per hour  How are execution time and throughput affected by  Replacing the processor with a faster version?  Adding more processors? We’ll focus on execution time for now… 4 Relative Performance  Define Performance = 1/Execution Time  “X is n time faster than Y” Performanc e X Performanc e Y  Execution time Y Execution time X  n  Example: time taken to run a program  If 10s on A, 15s on B, how much faster A is than B? (PerformanceA / PerformanceB)  Execution TimeB / Execution Time A = 15s/ 10s = 1.5  So A achieves 1.5 times higher performance than B 5 Measuring Execution Time  Elapsed time  Total execution time, including all aspects Processing, I/O, OS overhead, idle time  Determines system performance  CPU time  Time spent for a given job in a shared system,  Discount I/O time, other jobs’ shares  Comprises user CPU time and system CPU time  Different programs are affected differently by CPU and system performance  Batch program, interactive program 6 CPU Clocking  Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state  Clock period: duration of a clock cycle Millisecond(ms) : 10-3s Microsecond(us) : 10-6s  E.g., 250ps = 0.25ns = 250x10-12s Nanosecond(ns) : 10-9s Picosecond(ps): 10-12s  Clock frequency (rate): cycles per second  E.g., 4.0GHz = 4000Mhz = 4000000Khz = 4.0 x 109Hz 7 [Aside] Scaling Governor in Linux  /sys/devices/system/cpu/cpuN/cpufreq  scaling_max_freq  scaling_cur_freq  scaling_min_freq  scaling_governor  Documentation/admin-guide/pm/cpufreq.rst 8 CPU Time  CPU execution time (CPU time)  Time the CPU spends working on a task  Does not include time waiting for I/O or running other programs  CPU time can be improved (decreased) by  Reducing number of clock cycles  Increasing clock rate  Hardware designer often trades off clock rate against cycle count CPU Time  CPU Clock Cycles  Clock Cycle Time CPU Clock Cycles  Clock Rate 9 CPU Time Example  Computer A: 2GHz clock, 10s CPU time  Designing Computer B  Aim for 6s CPU time  Can do faster clock, but causes 1.2 x clock cycles  How fast must Computer B clock be? 10 CPU Time Example  Computer A: 2GHz clock, 10s CPU time  Designing Computer B  Aim for 6s CPU time  Can do faster clock, but causes 1.2 x clock cycles kHz : 103 Hz  How fast must Computer B clock be? MHz : 106 Hz GHz : 109 Hz  Ans)  Computer A: 10s = cyclesA(C) x intervalA = C / rateA  C = 10s x rateA = 20G  Computer B: 6s = 1.2C x intervalB = 24G / rateB  rateB = 4Ghz 11 Instruction Performance  # of instructions is called Instruction count  Determined by program, ISA and compiler  Clock cycles per instruction is referred to as CPI  If the number of instructions in a program is the same, the CPU time is determined by the value of CPI  Determined by CPU hardware  Different instructions have different CPI CPU Time  Instruction Count  CPI  Clock Cycle Time Instruction Count  CPI  Clock Rate 12 CPI Example  Computer A: Cycle Time = 250ps, CPI = 2.0  Computer B: Cycle Time = 500ps, CPI = 1.2  Same ISA  Which is faster, and by how much? 13 CPI Example  Computer A: Cycle Time = 250ps, CPI = 2.0  Computer B: Cycle Time = 500ps, CPI = 1.2  Same ISA  Which is faster, and by how much? CPU TimeA = Instruction Count x CPIA x Cycle TimeA = I x 2.0 x 250ps = I x 500 ps A is faster… CPU TimeB = Instruction Count x CPIB x Cycle TimeB = I x 1.2 x 500ps = I x 600ps … by this much 14 CPI in More Detail  If different instruction classes take different numbers of cycles n Clock Cycles   (CPIi  Instruction Count i ) i 1  Weighted average CPI Clock Cycles n  Instruction Count i  CPI     CPIi   Instruction Count i1  Instruction Count  Relative frequency 15 CPI Example  Let’s assume the hardware designers have supplied the following facts  Alternative compiled code sequences using instructions in classes A, B, C  If you are a compiler writer(designer), which code sequence would you adopt in your compiler?  Which code sequence executes the most instructions?  Which will be faster?  What is the CPI for each sequence? 16 CPI Example (Cont’d)  Sequence 1:  IC = 5  Clock Cycles = 2x1 + 1x2 + 2x3 = 10  Avg. CPI = 10/5 = 2.0  Sequence 2: The most instructions  IC = 6 Faster  Clock Cycles = 4x1 + 1x2 + 1x3 = 9 Better CPI  Avg. CPI = 9/6 = 1.5 17 Pitfall – Amdahl’s Law  One case of the pitfall  Improving an aspect of a computer and expecting a proportional improvement in overall performance  Amdahl’s Law:  “the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used”  Example  Let’s assume multiply operations account for 80s over 100s  How much do I have to improve the speed of multiplication to get 5x overall? 18 Pitfall – Amdahl’s Law (cont’d)  Example  Let’s assume multiply operations account for 80s over 100s  How much do I have to improve the speed of multiplication to get 5x overall?  Based on Amdahl’s Law Execution time affected Execution Time improved   Execution Time unaffected Amount of Improvemen t 80 Execution Time improved   (100 - 80) n 80 20 (five times faster)   (100 - 80) Can’t be done! n 19 Fallacies  Commonly held mis-concepts  Counter-examples could be effective when discussing fallacies  Example  Case 1: Computers at low utilization use little power  Utilization of servers in Google’s warehouse – 10~50% most of the time – 100% less than 1%  10% of a workload  10% of the peak power? No!!  SPEC power benchmark: 33% of the peak power at 10% of the load  Case 2: Designing for performance and designing for energy efficiency are unrelated goals  Hardware/software optimization itself takes more energy, but  The results may lead to energy reduction because of the reduced execution time 20 An example of optimization 21 [ Note ] SPEC CPU Benchmark  Programs used to mesaure performance  Supposedly typical of actual workload  Workload: the set of programs run on a computer that is either the the actual collection of applications run by a user or constructed from real programs to approximate such a mix  [Remind] Common case fast  To find bottleneck and which case is common, benchmarks play a critical role in computer architecture  Standard Performance Evaluation Corp(SPEC)  Develops benchmarks for CPU, I/O, Web, … by a number for computer vendors  SPEC CPU 2006  12 workloads consisting of CINT2006(integer) and CFP2006(floating- point) 22 [ Note ] SPEC CPU Benchmark (cont’d)  CINT2006 for Intel Core i7 23 Performance Summary Instructions Clock cycles Seconds CPU Time    Program Instruction Clock cycle  Performance depends on  Algorithm: affects IC, possibly CPI  e.g., if the algorithm uses more divides, it will tend to have a higher CPI  Programming language: affects IC, CPI  e.g., a language with heavy support for data abstraction (e.g., Java) will require indirect calls, which in turns will use higher CPI instructions  Compiler: affects IC, CPI  e.g., compiler use serveral optimization technique and affect the CPI in complex ways as we see in the previous example.  Instruction set architecture: affects IC, CPI, Clock rate 24

Use Quizgecko on...
Browser
Browser