Chapter 2 Performance Issues PDF

+ Chapter 2 Performance Issues + Designing for Performance ◼ The cost of computer systems continues to drop dramatically, while the performance and capacity of those systems continue to rise equally dramatically ◼ Today’s laptops have the computing power of an IBM mainframe from 10 or 15 years ago ◼ Processors are so inexpensive that we now have microprocessors we throw away ◼ Desktop applications that require the great power of today’s microprocessor-based systems include: ◼ Image processing ◼ Three-dimensional rendering ◼ Speech recognition ◼ Videoconferencing ◼ Multimedia authoring ◼ Voice and video annotation of files ◼ Simulation modeling ◼ Businesses are relying on increasingly powerful servers to handle transaction and database processing and to support massive client/server networks that have replaced the huge mainframe computer centers of yesteryear ◼ Cloud service providers use massive high-performance banks of servers to satisfy high-volume, high-transaction-rate applications for a broad spectrum of clients + Microprocessor Speed Techniques built into contemporary processors include: Processor moves data or instructions into a Pipelining conceptual pipe with all stages of the pipe processing simultaneously Processor looks ahead in the instruction code fetched Branch prediction from memory and predicts which branches, or groups of instructions, are likely to be processed next Superscalar This is the ability to issue more than one instruction in every processor clock cycle. (In effect, multiple execution parallel pipelines are used.) Processor analyzes which instructions are dependent Data flow analysis on each other’s results, or data, to create an optimized schedule of instructions Speculative Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, execution holding the results in temporary locations, keeping execution engines as busy as possible + Performance Balance Increase the number ◼ Adjust the organization and of bits that are retrieved at one time architecture to compensate by making DRAMs “wider” rather than for the mismatch among the “deeper” and by using wide bus data capabilities of the various paths components Reduce the frequency ◼ Architectural examples of memory access by incorporating include: increasingly complex and efficient cache structures between the processor and main memory Change the DRAM Increase the interface to make it interconnect more efficient by bandwidth between processors and including a cache or memory by using other buffering higher speed buses scheme on the DRAM and a hierarchy of chip buses to buffer and structure data flow + Improvements in Chip Organization and Architecture ◼ Increase hardware speed of processor ◼ Fundamentally due to shrinking logic gate size ◼ More gates, packed more tightly, increasing clock rate ◼ Propagation time for signals reduced ◼ Increase size and speed of caches ◼ Dedicating part of processor chip ◼ Cache access times drop significantly ◼ Change processor organization and architecture ◼ Increase effective speed of instruction execution ◼ Parallelism + Problems with Clock Speed and Login Density ◼ Power ◼ Power density increases with density of logic and clock speed ◼ Dissipating heat ◼ RC delay ◼ Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them ◼ Delay increases as the RC product increases ◼ As components on the chip decrease in size, the wire interconnects become thinner, increasing resistance ◼ Also, the wires are closer together, increasing capacitance ◼ Memory latency ◼ Memory speeds lag processor speeds The use of multiple processors on the same chip provides the potential to increase performance Multicore without increasing the clock rate Strategy is to use two simpler processors on the chip rather than one more complex processor With two processors larger caches are justified As caches became larger it made performance sense to create two and then three levels of cache on a chip + Many Integrated Core (MIC) Graphics Processing Unit (GPU) MIC GPU ◼ Leap in performance as well ◼ Core designed to perform as the challenges in parallel operations on graphics developing software to exploit data such a large number of cores ◼ Traditionally found on a plug-in ◼ The multicore and MIC graphics card, it is used to strategy involves a encode and render 2D and 3D homogeneous collection of graphics as well as process general purpose processors video on a single chip ◼ Used as vector processors for a variety of applications that require repetitive computations + ◼ Gene Amdahl ◼ Deals with the potential speedup of a program using multiple processors compared to a single processor Amdahl’s ◼ Illustrates the problems facing industry Law in the development of multi-core machines ◼ Software must be adapted to a highly parallel execution environment to exploit the power of parallel processing ◼ Can be generalized to evaluate and design technical improvement in a computer system Spedup f = 0.95 f = 0.90 + f = 0.75 f = 0.5 Number of Processors Figure 2.4 Amdahl’s Law for Multiprocessors + Little’s Law ◼ Fundamental and simple relation with broad applications ◼ Can be applied to almost any system that is statistically in steady state, and in which there is no leakage ◼ Queuing system ◼ If server is idle an item is served immediately, otherwise an arriving item joins a queue ◼ There can be a single queue for a single server or for multiple servers, or multiple queues with one being for each of multiple servers ◼ Average number of items in a queuing system equals the average rate at which items arrive multiplied by the time that an item spends in the system ◼ Relationship requires very few assumptions ◼ Because of its simplicity and generality it is extremely useful q cr uar ys tz ta l an co di alog nv git to er al sio n From Computer Desktop Encyclopedia 1998, The Computer Language Co. Figure 2.5 System Clock Calculating the Mean The three The use of benchmarks to common compare systems involves formulas calculating the mean value of a set of data points related to used for execution time calculating a mean are: Arithmetic Geometric Harmonic ◼ An Arithmetic Mean (AM) is an appropriate measure if the sum of all the measurements is a meaningful and interesting value Arithmetic ◼ The AM is a good candidate for comparing the execution time performance of several systems For example, suppose we were interested in using a system for large-scale simulation studies and wanted to evaluate several alternative products. On each system we could run the simulation multiple times with different input values for each run, and then Mean take the average execution time across all runs. The use of multiple runs with different inputs should ensure that the results are not heavily biased by some unusual feature of a given input set. The AM of all the runs is a good measure of the system’s performance on simulations, and a good number to use for system comparison. + ◼ The AM used for a time-based variable, such as program execution time, has the important property that it is directly proportional to the total time ◼ If the total time doubles, the mean value doubles Computer Computer Computer Computer Computer Computer A time B time C time A rate B rate C rate (secs) (secs) (secs) (MFLOPS) (MFLOPS) (MFLOPS) Program 1 (108 FP 2.0 1.0 0.75 50 100 133.33 ops) Program 2 Table 2.2 (108 FP 0.75 2.0 4.0 133.33 50 25 ops) A Comparison Total execution 2.75 3.0 4.75 of Arithmetic time and Arithmetic Harmonic mean of 1.38 1.5 2.38 times Means for Inverse of Rates total execution 0.36 0.33 0.21 time (1/sec) Arithmetic mean of 91.67 75.00 79.17 rates Harmonic mean of 72.72 66.67 42.11 rates Table 2.3 A Comparison of Arithmetic and Geometric Means for Normalized Results (a) Results normalized to Computer A Computer A time Computer B time Computer C time Program 1 2.0 (1.0) 1.0 (0.5) 0.75 (0.38) Program 2 0.75 (1.0) 2.0 (2.67) 4.0 (5.33) Total execution time 2.75 3.0 4.75 Arithmetic mean of 1.00 1.58 2.85 normalized times Geometric mean of 1.00 1.15 1.41 normalized times (b) Results normalized to Computer B Computer A time Computer B time Computer C time Program 1 2.0 (2.0) 1.0 (1.0) 0.75 (0.75) Program 2 0.75 (0.38) 2.0 (1.0) 4.0 (2.0) Total execution time 2.75 3.0 4.75 Arithmetic mean of 1.19 1.00 1.38 normalized times Geometric mean of 0.87 1.00 1.22 normalized times Table 2.4 Another Comparison of Arithmetic and Geometric Means for Normalized Results (a) Results normalized to Computer A Computer A time Computer B time Computer C time Program 1 2.0 (1.0) 1.0 (0.5) 0.20 (0.1) Program 2 0.4 (1.0) 2.0 (5.0) 4.0 (10) Total execution time 2.4 3.00 4.2 Arithmetic mean of 1.00 2.75 5.05 normalized times Geometric mean of 1.00 1.58 1.00 normalized times (b) Results normalized to Computer B Computer A time Computer B time Computer C time Program 1 2.0 (2.0) 1.0 (1.0) 0.20 (0.2) Program 2 0.4 (0.2) 2.0 (1.0) 4.0 (2) Total execution time 2.4 3.00 4.2 Arithmetic mean of 1.10 1.00 1.10 normalized times Geometric mean of 0.63 1.00 0.63 normalized times + Benchmark Principles ◼ Desirable characteristics of a benchmark program: 1. It is written in a high-level language, making it portable across different machines 2. It is representative of a particular kind of programming domain or paradigm, such as systems programming, numerical programming, or commercial programming 3. It can be measured easily 4. It has wide distribution + System Performance Evaluation Corporation (SPEC) ◼ Benchmark suite ◼ A collection of programs, defined in a high-level language ◼ Together attempt to provide a representative test of a computer in a particular application or system programming area ◼ SPEC ◼ An industry consortium ◼ Defines and maintains the best known collection of benchmark suites aimed at evaluating computer systems ◼ Performance measurements are widely used for comparison and research purposes + ◼ Best known SPEC benchmark suite ◼ Industry standard suite for processor intensive applications SPEC ◼ Appropriate for measuring performance for applications that spend most of their time doing computation rather than I/O CPU2006 ◼ Consists of 17 floating point programs written in C, C++, and Fortran and 12 integer programs written in C and C++ ◼ Suite contains over 3 million lines of code ◼ Fifth generation of processor intensive suites from SPEC Benchmark Reference Instr Language Application Brief Description time count Area (hours) (billion) Programming PERL programming 400.perlbench 2.71 2,378 C Language language interpreter, applied to a set of three programs. Compression General-purpose data 401.bzip2 2.68 2,472 C compression with most work done in memory, rather than doing I/O. Table 2.5 C Compiler Based on gcc Version 3.2, 403.gcc 2.24 1,064 C generates code for Opteron. Combinatoria Vehicle scheduling 429.mcf 2.53 327 C l algorithm. Optimization Artificial Plays the game of Go, a 445.gobmk 2.91 1,603 C Intelligence simply described but deeply complex game. 456.hmmer 2.59 3,363 C Search Gene Sequence Protein sequence analysis using profile hidden Markov models. SPEC 458.sjeng 3.36 2,383 C Artificial Intelligence A highly ranked chess program that also plays several chess variants. CPU2006 462.libquantum 5.76 3,555 C Physics / Quantum Computing Simulates a quantum computer, running Shor's polynomial-time Integer Benchmarks factorization algorithm. Video H.264/AVC (Advanced 464.h264ref 6.15 3,731 C Compression Video Coding) Video compression. Discrete Uses the OMNet++ discrete Event event simulator to model a 471.omnetpp 1.74 687 C++ Simulation large Ethernet campus network. Path-finding Pathfinding library for 2D 473.astar 1.95 1,200 C++ Algorithms maps. XML A modified version of 483.xalancbmk 1.92 1,184 C++ Processing Xalan-C++, which transforms XML documents to other document types. Reference Instr count Benchmark time (hours) (billion) Language Application Area Brief Description Computes 3D transonic 410.bwaves 3.78 1,176 Fortran Fluid Dynamics transient laminar viscous flow. 416.gamess 5.44 5,189 Fortran Quantum Quantum chemical Chemistry computations. Physics / Quantum Simulates behavior of 433.milc 2.55 937 C Chromodynamics quarks and gluons Computational fluid Table 2.6 434.zeusmp 2.53 1,566 Fortran Physics / CFD dynamics simulation of astrophysical phenomena. Simulate Newtonian Biochemistry / equations of motion for 435.gromacs 1.98 1,958 C, Fortran Molecular Dynamics hundreds to millions of particles. 436.cactusAD 3.32 1,376 C, Fortran Physics / General Solves the Einstein M Relativity evolution equations. 437.leslie3d 444.namd 2.61 2.23 1,273 2,483 Fortran C++ Fluid Dynamics Biology / Molecular Model fuel injection flows. Simulates large biomolecular systems. SPEC CPU2006 Dynamics Program library targeted at Finite Element 447.dealII 3.18 2,323 C++ adaptive finite elements and Analysis Floating-Point error estimation. Linear Test cases include railroad 450.soplex 2.32 703 C++ Programming, planning and military airlift Optimization models. 453.povray 454.calculix 1.48 2.29 940 3,04` C++ C, Fortran Image Ray-tracing Structural Mechanics 3D Image rendering. Finite element code for linear and nonlinear 3D Benchmarks structural applications. 459.GemsFDT Computational Solves the Maxwell 2.95 1,320 Fortran D Electromagnetics equations in 3D. Quantum chemistry Quantum 465.tonto 2.73 2,392 Fortran package, adapted for Chemistry crystallographic tasks. Simulates incompressible 470.lbm 3.82 1,500 C Fluid Dynamics fluids in 3D. 481.wrf 3.10 1,684 C, Fortran Weather Weather forecasting model 482.sphinx3 5.41 2,472 C Speech recognition Speech recognition software. + Terms Used in SPEC Documentation ◼ Benchmark ◼ Peak metric ◼ A program written in a high-level ◼ This enables users to attempt to language that can be compiled optimize system performance by and executed on any computer optimizing the compiler output that implements the compiler ◼ Speed metric ◼ System under test ◼ This is simply a measurement of the ◼ This is the system to be evaluated time it takes to execute a compiled benchmark ◼ Used for comparing the ability of ◼ Reference machine a computer to complete single ◼ This is a system used by SPEC to tasks establish a baseline performance for all benchmarks ◼ Rate metric ◼ Each benchmark is run and ◼ This is a measurement of how many measured on this machine to tasks a computer can accomplish in establish a reference time for a certain amount of time that benchmark ◼ This is called a throughput, capacity, or rate measure ◼ Base metric ◼ Allows the system under test to ◼ These are required for all execute simultaneous tasks to reported results and have strict take advantage of multiple guidelines for compilation processors Start Get next program Run program three times Select median value Ratio(prog) = Tref(prog)/TSUT(prog) Yes More No Compute geometric programs? mean of all ratios End Figure 2.7 SPEC Evaluation Flowchart + Summary Performance Issues Chapter 2 ◼ Designing for performance ◼ Basic measures of computer ◼ Microprocessor speed performance ◼ Performance balance ◼ Clock speed ◼ Improvements in chip ◼ Instruction execution rate organization and ◼ Calculating the mean architecture ◼ Arithmetic mean ◼ Multicore ◼ Harmonic mean ◼ MICs ◼ Geometric mean ◼ GPGPUs ◼ Amdahl’s Law ◼ Benchmark principles ◼ Little’s Law ◼ SPEC benchmarks

Chapter 2 Performance Issues PDF

Document Details

Tags

Related

Summary

Full Transcript