High-Performance Computing Lecture Notes PDF

University of Wollongong in Dubai Outline High-Performance Computing: Microprocessor Evolution, Parallelism, and Performance Measurement 1. Designing for performance – Microprocessor speed & cost – Microprocessor speed- pipelining – Example – Performance balance – imbalance 2. Multicore, Many Integrated Core, – MIC- rationale – MIC strategy – GPU 3. Amdahl’s Law – Speeding up program execution – Sequential code, parallelizable code – Speedup factor – Example 4. Little’s Law – Rationale of Little’s law – Little’s law formula – Example 5. Basic measures of performance – Performance as a key parameter in processor design and selection – Main problems concerning performance measurements of a computer – The traditional measures of performance: clock speed, and instruction execution rate 6. The system clock ,and clock speed, frequency 7. Instruction execution rate 8. MIPS and MFLOPS Designing for performance The cost of computer systems continues to drop dramatically, while the performance and capacity of those systems continue to rise equally dramatically Designing for performance Today’s laptops have the computing power of an IBM mainframe from 10 or 15 years ago Designing for performance This has of continuing enabled the astounding technologic developme complexity al nt of and power. revolution applications Designing for performance Desktop applications that require the great power of today’s microprocessor-based systems: Image processing Three-dimensional rendering Speech recognition Video conferencing Multimedia authoring Voice and video annotation of files Simulation modeling Designing for performance Today Workstati on systems Support highly sophisticated engineering and scientific applications and have the capacity to support image and video applications Designing for performance Mainframes: IBM 370 IBM 390 Examples of farms of increasingly powerful servers Designing for performance Businesses rely on increasingly powerful servers To handle transaction and database processing To support massive client/server networks Microprocessor Speed Several Techniques are built into The traditional Today’s technique is processors to Pipelining increase speed. Microprocessor Speed-pipelining The execution of an instruction involves multiple stages of operation Fetching the instruction Decoding the opcode Fetching operands Performing a calculation Microprocessor Speed-pipelining Pipelining enables a processor to work simultaneously on multiple instructions By performing a different phase for each of the multiple instructions at the same time. Microprocessor Speed-pipelining The processor overlaps operations by moving data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously. Microprocessor Speed-pipelining For example, While one instruction is being executed, the computer is decoding the next instruction. This is the same principle as seen in an assembly line. Performance Balance or imbalance Processor While processor speed has grown rapidly, Memory The speed with which data can be transferred between main memory and the processor has lagged badly Performance Balance or imbalance Processor While processor speed has grown rapidly, I/O devices The speed with which data can be transferred between peripherals and the processor also lagged badly Multicore, Many Integrated Core or MIC Today’s trend in cores: Chip manufacturers are now in the process of making a huge leap forward in the number of cores per chip, with more than 50 cores per chip. Many Integrated Core or MIC The multicore and MIC strategy involves a homogeneous collection of general purpose processors on a single chip. At the same time, there is also GPU At the same time, chip manufacturers are pursuing another design option: a chip with multiple general- purpose processors plus graphics processing units (GPUs) specialized cores for video processing and other tasks. At the same time, there is also GPU GPU It is a core designed to perform parallel operations on graphics data. Traditionally found on a plug-in graphics card (display adapter), it is used to encode and render 2D and 3D graphics as well as process video. At the same time, there is also GPU Since GPUs they are for a variety of perform parallel increasingly applications that operations on being used as require multiple sets of vector repetitive data, processors computations. Amdahl’s Law- Gene Amdahl (1967) Amdahl’s law Deals with the potential speedup of a program using multiple processors compared to a single processor Amdahl’s Law- Gene Amdahl (1967) Consider  A program running on a single processor  such that a fraction (1 - f) of the execution time involves code that is inherently sequential,  and a fraction f that involves code that is infinitely parallelizable with f % of no scheduling the execution time (1-foverhead. )% of the execution time Code is sequential Code is infinitely parallelizable Amdahl’s Law- Gene Amdahl (1967) Let T Be the total execution time of the program Using a single processor. Amdahl’s Law- Gene Amdahl (1967) IF T be the total execution time of the program using a single processor. Then the speedup using a parallel processor with N processors that fully exploits the parallel portion of the program is as follows: What is the impact of the number of parallel processors When f is small,  The use of parallel processors has little effect on the speed up As N approaches infinity,  speedup is bound by 1/(1 - f ),  so that there are diminishing returns for using more processors. Parallel processors, the law of diminishing returns Law of diminishing returns At a certain point, adding another processor causes a relatively smaller increase Parallel processors, the law of diminishing returns Can be parallelized N f (1–f) 1 0.95 0.9 0.85 0.8 0.75 0.7 0.5 0.1 1 1 1 1 1 1 1 1 1 1 2 1 1.904762 1.818182 1.73913 1.666667 1 1.538462 1.333333 1.052632 3 1 2.727273 2.5 2.307692 2.142857 1.6 1.875 1.5 1.071429 4 1 3.478261 3.076923 2.758621 2.5 2 2.105263 1.6 1.081081 5 1 4.166667 3.571429 3.125 2.777778 2.285714 2.272727 1.666667 1.086957 6 1 4.8 4 3.428571 3 2.5 2.4 1.714286 1.090909 7 1 5.384615 4.375 3.684211 3.181818 2.666667 2.5 1.75 1.09375 8 1 5.925926 4.705882 3.902439 3.333333 2.8 2.580645 1.777778 1.09589 9 1 6.428571 5 4.090909 3.461538 2.909091 2.647059 1.8 1.097561 10 1 6.896552 5.263158 4.255319 3.571429 3 2.702703 1.818182 1.098901 11 1 7.333333 5.5 4.4 3.666667 3.076923 2.75 1.833333 1.1 12 1 7.741935 5.714286 4.528302 3.75 3.142857 2.790698 1.846154 1.100917 13 1 8.125 5.909091 4.642857 3.823529 3.2 2.826087 1.857143 1.101695 14 1 8.484848 6.086957 4.745763 3.888889 3.25 2.857143 1.866667 1.102362 15 1 8.823529 6.25 4.83871 3.947368 3.294118 2.884615 1.875 1.102941 16 1 9.142857 6.4 4.923077 4 3.333333 2.909091 1.882353 1.103448 17 1 9.444444 6.538462 5 4.047619 3.368421 2.931034 1.888889 1.103896 18 1 9.72973 6.666667 5.070423 4.090909 3.4 2.95082 1.894737 1.104294 19 1 10 6.785714 5.135135 4.130435 3.428571 2.96875 1.9 1.104651 20 1 10.25641 6.896552 5.194805 4.166667 3.454545 2.985075 1.904762 1.104972 21 1 10.5 7 5.25 4.2 3.478261 3 1.909091 1.105263 22 1 10.73171 7.096774 5.301205 4.230769 3.5 3.013699 1.913043 1.105528 23 1 10.95238 7.1875 5.348837 4.259259 3.52 3.026316 1.916667 1.105769 24 1 11.16279 7.272727 5.393258 4.285714 3.538462 3.037975 1.92 1.105991 25 1 11.36364 7.352941 5.434783 4.310345 3.555556 3.04878 1.923077 1.106195 26 1 11.55556 7.428571 5.473684 4.333333 3.571429 3.058824 1.925926 1.106383 27 1 11.73913 7.5 5.510204 4.354839 3.586207 3.068182 1.928571 1.106557 50 1 14.49275 8.474576 5.988024 4.62963 3.6 3.184713 1.960784 1.108647 75 1 15.95745 8.928571 6.198347 4.746835 3.773585 3.232759 1.973684 1.109467 80 1 16.16162 8.988764 6.225681 4.761905 3.846154 3.238866 1.975309 1.10957 85 1 16.34615 9.042553 6.25 4.775281 3.855422 3.244275 1.976744 1.109661 90 1 16.51376 9.090909 6.271777 4.787234 3.863636 3.249097 1.978022 1.109741 95 1 16.66667 9.134615 6.291391 4.79798 3.870968 3.253425 1.979167 1.109813 100 1 16.80672 9.174312 6.309148 4.807692 3.877551 3.257329 1.980198 1.109878 300 1 18.80878 9.708738 6.543075 4.934211 3.883495 3.307607 1.993355 1.1107 400 1 19.09308 9.779951 6.573541 4.950495 3.960396 3.314002 1.995012 1.110803 500 1 19.26782 9.823183 6.591958 4.960317 3.970223 3.31785 1.996008 1.110864 600 1 19.38611 9.852217 6.604293 4.966887 3.976143 3.320421 1.996672 1.110905 1000 1 19.62709 9.910803 6.629102 4.98008 3.9801 3.325574 1.998002 1.110988 2000 1 19.81179 9.955202 6.647831 4.99002 3.988036 3.329449 1.999 1.111049 20000 1 19.98102 9.995502 6.664778 4.999 3.994009 3.332944 1.9999 1.111105 Parallel processors, the law of diminishing returns Processing Speedup 25 20 number of processors 15 10 5 0 0 5000 10000 15000 20000 25000 f or % of parallelism 1 0.95 0.9 0.85 0.8 0.75 0.7 0.5 0.1 Speedup estimation Suppose that  a feature of the system is used during execution a fraction of the time f, before enhancement,  and that the speedup of that feature after enhancement is SUf. Then the overall speedup of the system is: Example of Speedup estimation Assume that given company  makes extensive use of floating-point operations, in its daily processing activity  with 40% of the time consumed by floating-point operations. Your company  Is interested in acquiring a new hardware floating- point module  The design of this module is such that your floating point operations will be sped up by a factor of K What is the overall Speedup of your system? Example of Speedup estimation Assume that given company  makes extensive use of floating-point operations, in its daily processing activity  with 40% of the time consumed by floating-point operations. Your company  Is interested in acquiring a new hardware floating- point module  The design of this module is such that your floating point operations will be speed up by a factor of K What is the overall Speedup of your system? Little’s Law Little’s Law Fundamental and simple relation with broad applications Very often, it is applied to determine the performance of Queueing systems A queuing system Consists of discrete objects, called items, which arrive at some rate to the system. Little’s Law(Rust 2008) Rationale of a queueing system The stream of arrivals enters the system, joins one or more queues and eventually receives service, and exits in a stream of departures. Queuing systems-importance of the service In most cases, service is the bottleneck that creates the queue, and so we usually have a service operation with a service time, Little’s law? What does it say? Given:  L = average number of items in the queuing system  W = average waiting time in the system for an item   = average number of items arriving per unit of time Little’s Law says that  L, the average number of items in a queuing system,  equals the average arrival rate of items to the system,   multiplied by the average waiting time of an item in the system, W. Application of Little’s Law Case at hand Your organization processes travel reimbursements and wants to know the average processing time (latency) of a reimbursement Application of Little’s Law The data collected so far have shown that: 200 requests arrive per day and there are 10,000 requests pending inside the system, Application of Little’s Law In this case: the arrival rate  is: 200/day The average number of tasks in the system, L = 10, 000 We seek to know W, the latency a reimbursement task spends in the system. Application of Little’s Law Since, L =  * W  W = L/ So, W = 10,000/200= 50 days. The basic measures of computer performance When deciding on the design or processor hardware and setting requirements for new systems Performance is one of the key parameters to consider, along with cost, size, security, reliability, and, in some cases, power consumption The basic measures of computer performance Main problem concerning performance measurements It is difficult to make meaningful performance comparisons among different processors, Even among processors in the same family. The basic measures of computer performance Main problem concerning performance measurements Raw speed is far less important Than how a processor performs when executing a given application. The basic measures of computer performance Unfortunately, An application performance depends not just on the raw speed of the processor but also on the instruction set, choice of implementation language, efficiency of the compiler, and skill of the programming done to implement the application. The basic measures of computer performance Two traditional measures of processor performance Instruction Clock execution Speed rate The system clock and the clock speed The system clock governs All Operations performed by a processor, such as Performing Fetching Decoding an an the arithmetic instruction instruction operation, , , etc. The system clock and the clock speed Typically, Thus, the the pulse measured  all speed of a frequency in cycles operations processor produced per begin with is dictated by the second, or the pulse of by clock, Hertz (Hz). the clock. The system clock and the clock speed The rate of For example, pulses is known as a 1-GHz the clock processor rate, receives 1 or clock billion pulses speed. per second. The system clock and the clock speed One increment, Or The time pulse, of the is clock referred between to as a pulses clock cycle, or a clock is the cycle tick. time. The cycle time, the frequency of the clock role in measuring the instruction execution rate A processor is driven by a clock with a constant frequency f or, equivalently, a constant cycle time t, where  = 1/f. The cycle time, the frequency of the clock role in measuring the instruction execution rate Let us define the instruction count, Ic, for a program as the number of machine instructions executed for that program until it runs to completion The cycle time, the frequency of the clock role in measuring the instruction execution rate IMPORTANT! WARNING Note that Ic is the number of instruction executions, not the number of instructions in the object code of the program. The cycle time, the frequency of the clock role in measuring the instruction execution rate For Example: If your Python program consists of 100 lines of python code. And let’s assume that when translated into machine language, it will consist of 450 machine language instructions. Ic is not equal to 450, because it depends on the logic of your program and its input data. The cycle time, the frequency of the clock role in measuring the instruction execution rate Clarifications The main idea here is that the number of machine language instructions obtained after “compilation” of a program , still represent the logic of the program at hand, but not the number of instructions that will be actually executed by the CPU The cycle time, the frequency of the clock role in measuring the instruction execution rate In our example, Our python program consists of 100 instructions Once, it is “compiled” and translated into machine language, the program will consists of 450 machine language instructions. This will be the “size” of the program in terms of low-level instructions stored in memory when the program is to executed The number of instructions of the object code is (450) in our example. The cycle time, the frequency of the clock role in measuring the instruction execution rate In our example It is stated that Ic is different from 450 Because Ic represents the number of instructions that are actually executed by the CPU. Ic depends on the logic of the program as well as the input being processed The cycle time, the frequency of the clock role in measuring the instruction execution rate More clarifications if for example, We consider the following piece of pseudo code : a test and jump example The cycle time, the frequency of the clock role in measuring the instruction execution rate More clarifications In this simple pseudo code , we can count five instructions. But the actual number of instruction executed Ic is NOT five. The cycle time, the frequency of the clock role in measuring the instruction execution rate Why Ic is NOT five in our example? WE need to determine the actual number of instructions that are executed at the CPU When the program runs Instruction Number of times it is executed S =0 1 i=1 1 S=S+i This statement will be executed for the following valid values of i=1,2,3,4,5,6,7,8,9,10., therefore 10 times If (I

High-Performance Computing Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue