🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

3- Analytical Modeling of Parallel Programs.pptx

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Transcript

Analytical Modeling of Parallel Programs Parallel & Distributed Processing - 1502412 Prof. Ali El- 1 Moursy Topic Overview Speedup – Performance measurement in uni...

Analytical Modeling of Parallel Programs Parallel & Distributed Processing - 1502412 Prof. Ali El- 1 Moursy Topic Overview Speedup – Performance measurement in uniprocessor – Sources of Overhead in Parallel Programs – Performance Metrics for Parallel Systems Efficiency & Cost of Parallel Systems Effect of Granularity on Performance Scalability of Parallel Systems Minimum Execution Time and Minimum Cost- Optimal Execution Time Analysis of Parallel Programs Other Scalability Metrics Parallel & Distributed Processing - 1502412 Prof. Ali El- 2 Moursy Measuring, Reporting, and Summarizing Performance What do we mean by one computer is faster than another? – The computer user is interested in reducing response time – The administrator of a large data processing center may be interested in increasing throughput In comparing design alternatives, we often want to relate the performance of two different computers, say, X and Y. The phrase “X is n times faster than Y” means Execution time y relative execution time  speedup / slowdown n Execution timex Parallel & Distributed Processing - 1502412 Prof. Ali El- 3 Moursy Measuring, Reporting, and Summarizing Performance Performance is the reciprocal of execution time then 1 Execution time y performance y performancex n    performance ratio 1 performance y Execution timex performancex Baseline: this term is usually used to refer to the basic/original computer system that the computer designer use to compare newly designed systems against. Parallel & Distributed Processing - 1502412 Prof. Ali El- 4 Moursy Quantitative Principles of Computer Design Amdahl’s Law: – It helps calculating the overall performance gain that can be obtained by improving some portion of a computer system. – Performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. – i.e. Computer designer has enhanced the floating point (FP) functional unit in a microprocessor. This will ONLY improve (reduce) the execution time of FP operations. Hence, How to calculate the net performance gain? Parallel & Distributed Processing - 1502412 Prof. Ali El- 5 Moursy Quantitative Principles of Computer Design Amdahl’s Law: Execution timeold Execution timenon  enhanced  Execution timeold  enhanced Execution timenew Execution timenon  enhanced  Execution timenew enhanced – Execution time non-enhanced is fixed between before and after enhanced – Execution time enhanced is reduced between before and after enhanced due to the enhancement accordingly we have Execution time new-enhanced after enhanced Execution time old-enhanced before enhanced Parallel & Distributed Processing - 1502412 Prof. Ali El- 6 Moursy Quantitative Principles of Computer Design Amdahl’s Law: – Execution time old-enhanced represent a percentage of the total old execution time called Fraction enhanced given by:- Execution timeold  enhanced Fraction enhanced  Execution timeold – After implementing the enhancement, the ratio between the Execution time new-enhanced to Execution time old-enhanced (speedup)is given by:- Execution timeold  enhanced Speedupenhanced  Execution timenew enhanced Parallel & Distributed Processing - 1502412 Prof. Ali El- 7 Moursy Quantitative Principles of Computer Design Amdahl’s Law: – To use Amdahl’s law, two factors are needed: 1. The fraction of the computation time the enhanced part of the program would take in the original computer that can be converted to take advantage of the enhancement 2. The improvement gained by the enhanced execution mode; that is, how much faster the enhanced part of the program would run if the enhanced mode were used for the entire program Execution timeold 1 Speedupoverall   Fraction enhanced Execution timenew 1  Fraction enhanced  Speedupenhanced Parallel & Distributed Processing - 1502412 Prof. Ali El- 8 Moursy Quantitative Principles of Computer Design Amdahl’s Law: – Example: We want to enhance the processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. The original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by the enhancement? Fractionenhanced = 0.4, Speedupenhanced = 10, then 1 1 Speedupoverall   1.56 1  0.4 0.4 0.64 10 1 1 Note : theoretica l Maximum Speedupoverall   1.67 1  0.4 0.4 0.6  Parallel & Distributed Processing - 1502412 Prof. Ali El- 9 Moursy Analytical Modeling - Basics A sequential algorithm is evaluated by its runtime (in general, runtime as a function of input size). The parallel runtime of a program depends on the input size, the number of processors, and the communication parameters of the machine. An algorithm must therefore be analyzed in the context of the underlying platform. A parallel system is a combination of a parallel algorithm and an underlying platform. Performance of the parallel system is affected by BOTH factors the parallel algorithm AND the platform used. Parallel & Distributed Processing - 1502412 Prof. Ali El- 10 Moursy Analytical Modeling - Basics A number of performance measures are intuitive. Wall clock time - the time from the start of the first processor to the stopping time of the last processor in a parallel ensemble. But how does this scale when the number of processors is changed or the program is ported to another machine altogether? How much faster is the parallel version? This begs the obvious followup question – what is the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program looks fast Parallel & Distributed Processing - 1502412 Prof. Ali El- 11 Moursy Uniprocessing vs. Parallel Processing performance evaluation More factors NOT exist in the uniprocessing should be considered when evaluating parallel processing system. Time is divided into two components ONLY in uniprocessor, enhanced and non-enhanced. In Parallel processing the overhead of parallelism should be accounted for. Parallel & Distributed Processing - 1502412 Prof. Ali El- 12 Moursy Sources of Overhead in Parallel Programs If I use two processors, shouldn’t my program run twice as fast? No - a number of overheads, including wasted computation, communication, idling, and contention cause degradation in performance. Parallel & Distributed Processing - 1502412 Prof. Ali El- 13 Moursy Sources of Overheads in Parallel Programs Interprocess interactions: Processors working on any non-trivial parallel problem will need to talk to each other. Idling: Processes may idle because of load imbalance, synchronization, or serial components. Excess Computation: This is computation not performed by the serial version. This might be because the serial algorithm is difficult to parallelize, or that some computations are repeated across processors to minimize communication or to prepare for the communication or to identify the role of the processing member. Parallel & Distributed Processing - 1502412 Prof. Ali El- 14 Moursy Performance Metrics for Parallel Systems: Execution Time Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. We denote the serial runtime by TS and the parallel runtime by TP. Parallel & Distributed Processing - 1502412 Prof. Ali El- 15 Moursy Performance Metrics for Parallel Systems: Total Parallel Overhead TS is the serial time. Tp is the parallel time. Ideally system of p processors should reduce the execution time to Ts/p Since parallelization has an overhead. The overhead function (To) is therefore given by To = TP - TS /p Parallel & Distributed Processing - 1502412 Prof. Ali El- 16 Moursy Performance Metrics for Parallel Systems: Speedup What is the benefit from parallelism? Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements. S = Ts / TP Assume an ideal case with fully parallelized execution time and no overhead for parallelization, – Is the speedup equal to the number of processing nodes? The answer is: NO! – Why? Net speedup is highly dependent on the Parallel algorithm Parallel & Distributed Processing - 1502412 Prof. Ali El- 17 Moursy Performance Metrics: Example Consider the problem of adding n numbers by using n processing elements. If n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processors. Parallel & Distributed Processing - 1502412 Prof. Ali El- 18 Moursy Performance Metrics: Example Parallel & Distributed Processing - 1502412 Prof. Ali El- 19 Moursy Performance Metrics: Example (continued) Serial time TS = Θ (n) Parallel time TP = Θ (log n) Speedup S is given by S = Θ (n / log n) Parallel & Distributed Processing - 1502412 Prof. Ali El- 20 Moursy Performance Metrics: Speedup For a given problem, there might be many serial algorithms available. These algorithms may have different runtimes and may be parallelizable to different degrees. For the purpose of computing speedup, we always consider the best sequential program as the baseline. Parallel & Distributed Processing - 1502412 Prof. Ali El- 21 Moursy Performance Metrics: Speedup Example Consider the problem of parallel bubble sort. The serial time for bubble sort is 150 seconds. The parallel time for odd-even sort (efficient parallelization of bubble sort) is 40 seconds. The speedup would appear to be 150/40 = 3.75. But is this really a fair assessment of the system? What if serial quick sort only took 30 seconds? In this case, the speedup is 30/40 = 0.75. This is a more realistic assessment of the system. Parallel & Distributed Processing - 1502412 Prof. Ali El- 22 Moursy Performance Metrics: Speedup Bounds Speedup can be as low as 0 (the parallel program never terminates). Speedup, in theory, should be upper bounded by p - after all, we can only expect a p-fold speedup if we use times as many resources. A speedup greater than p is possible only if each processing element spends less than time TS / p solving the problem. In this case, a single processor could be time- slided to achieve a faster serial program, which contradicts our assumption of fastest serial program as basis for speedup. Parallel & Distributed Processing - 1502412 Prof. Ali El- 23 Moursy Performance Metrics: Speedup Bounds If TS is the serial runtime of the algorithm, then the problem CAN NOT be solved in less than time TS on a single processing element. A speedup greater than p is sometimes observed (a phenomenon known as superlinear speedup or superscaling). This usually happens when the work performed by a serial algorithm is greater than its parallel formulation or due to hardware features that put the serial implementation at a disadvantage. Parallel & Distributed Processing - 1502412 Prof. Ali El- 24 Moursy Revision: Parallel / Distributed Systems Multiprocessor Architecture Distributed-Memory Multicomputer Architecture Parallel & Distributed Processing - 1502412 2 Prof. Ali El- 5 Moursy Performance Metrics: Speedup Bounds For example, the data for a problem might be too large to fit into the cache of a single processing element, thereby degrading its performance due to the use of slower memory elements (Hard Disk). But when partitioned among several processing elements, the individual data-partitions would be small enough to fit into their respective processing elements' caches. In the remainder of this book, we disregard superlinear speedup due to hierarchical memory. Parallel & Distributed Processing - 1502412 Prof. Ali El- 26 Moursy Performance Metrics: Efficiency Efficiency is a measure of the fraction of time for which a processing element is usefully employed Mathematically, it is given by S Ts E  (2) p p Tp where :S is the speedup and p is the number of processors Following the bounds on speedup, efficiency can be as low as 0 and as high as 1. Parallel & Distributed Processing - 1502412 Prof. Ali El- 27 Moursy Performance Metrics: Efficiency Example The speedup of adding numbers on processors is given by Efficiency is given by = = Parallel & Distributed Processing - 1502412 Prof. Ali El- 28 Moursy Cost of a Parallel System Cost is the product of parallel runtime and the number of processing elements used (p x TP ). Cost reflects the sum of the time that each processing element spends solving the problem. A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer is identical to serial cost (TS = p TP) Since E = TS / p TP, for cost optimal systems, E = O(1). Cost is sometimes referred to as work or processor- time product. Parallel & Distributed Processing - 1502412 Prof. Ali El- 29 Moursy Cost of a Parallel System: Example Consider the problem of adding numbers on processors. We have, TP = log n (for p = n). The cost of this system is given by p TP = n log n. Since the serial runtime of this operation is Θ(n), the algorithm is not cost optimal. Parallel & Distributed Processing - 1502412 Prof. Ali El- 30 Moursy Scalability of Parallel Systems How do we extrapolate performance from small problems and small systems to larger problems on larger configurations? Consider three parallel algorithms for computing an n-point Fast Fourier Transform (FFT) on 64 processing elements. A comparison of the speedups obtained by the binary- exchange, 2-D transpose and 3-D transpose algorithms on 64 processing elements with tc = 2, tw = 4, ts = 25, and th = 2. Clearly, it is difficult to infer scaling characteristics from observations on small datasets on small machines. Parallel & Distributed Processing - 1502412 Prof. Ali El- 31 Moursy Scaling Characteristics of Parallel Programs The efficiency of a parallel program can be written as: S T Ts E  s To Tp  P pTp p S Ts 1 1 1 E  p p  pTo P pTp Ts Tp Ts To  p 1  Ts Ts   The total overhead function To is an increasing function of p. Efficiency decreases with the increase in processor count p Parallel & Distributed Processing - 1502412 Prof. Ali El- 32 Moursy Scaling Characteristics of Parallel Programs For a given problem size (i.e., the value of TS remains constant), as we increase the number of processing elements, total overhead (pTo) increases. The overall efficiency of the parallel program goes down. This is the case for all parallel programs. Parallel & Distributed Processing - 1502412 Prof. Ali El- 33 Moursy Scaling Characteristics of Parallel Programs: Example Consider the problem of adding numbers on processing elements. Assume no. of processors p < n Assume both p & n are power of two For the highest concurrency level 1. Each processor will first perform n/p additions and get p results one in each processor. 2. Binary tree is used for the p results to sum them up to get final result. Parallel & Distributed Processing - 1502412 Prof. Ali El- 34 Moursy Scaling Characteristics of Parallel Programs: Example Assume p=4 & n=16 Parallel & Distributed Processing - 1502412 Prof. Ali El- 35 Moursy Scaling Characteristics of Parallel Programs: Example If each addition take one time unit and communication also take one time unit. Accordingly: = (5) = (6) = (7) Parallel & Distributed Processing - 1502412 Prof. Ali El- 36 Moursy Scaling Characteristics of Parallel Programs: Example (continued) Plotting the speedup for various input sizes gives us: Parallel & Distributed Processing - 1502412 Prof. Ali El- 37 Moursy Scaling Characteristics of Parallel Programs Total overhead function To is a function of both problem size Ts and the number of processing elements p. In many cases, To grows sublinearly with respect to Ts. In such cases, the efficiency increases if the problem size is increased keeping the number of processing elements constant. For such systems, we can simultaneously increase the problem size and number of processors to keep efficiency constant. We call such systems scalable parallel systems. Parallel & Distributed Processing - 1502412 Prof. Ali El- 38 Moursy Scaling Characteristics of Parallel Programs Recall that cost-optimal parallel systems have an efficiency of Θ(1). Scalability and cost-optimality are therefore related. A scalable parallel system can always be made cost-optimal if the number of processing elements and the size of the computation are chosen appropriately. Parallel & Distributed Processing - 1502412 Prof. Ali El- 39 Moursy Backup Parallel & Distributed Processing - 1502412 Prof. Ali El- 40 Moursy Isoefficiency Metric of Scalability For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down for all systems. For some systems, the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant. Parallel & Distributed Processing - 1502412 Prof. Ali El- 41 Moursy Isoefficiency Metric of Scalability Parallel & Distributed Processing - 1502412 Prof. Ali El- 42 Moursy Isoefficiency Metric of Scalability What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed? This rate determines the scalability of the system. The slower this rate, the better. Before we formalize this rate, we define the problem size W as the number of operations associated with the best serial algorithm to solve the problem. Parallel & Distributed Processing - 1502412 Prof. Ali El- 43 Moursy Isoefficiency Metric of Scalability We can write parallel runtime as: (8) The resulting expression for speedup is (9) Finally, we write the expression for efficiency as Parallel & Distributed Processing - 1502412 Prof. Ali El- 44 Moursy Isoefficiency Metric of Scalability For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To / W is maintained at a constant value. For a desired value E of efficiency, (11) If K = E / (1 – E) is a constant depending on the efficiency to be maintained, since To is a function of W and p, we have Parallel & Distributed Processing - 1502412 Prof. Ali El- (12)45 Moursy Isoefficiency Metric of Scalability The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant. This function is called the isoefficiency function. This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements Parallel & Distributed Processing - 1502412 Prof. Ali El- 46 Moursy

Tags

parallel programming performance analysis computer design
Use Quizgecko on...
Browser
Browser