Lectures on Parallel Computing PDF
Document Details
Tags
Summary
These are lecture notes on parallel computing, covering various aspects of the topic. The document discusses different programming models, software optimizations, and related computational concepts.
Full Transcript
Introduction 1 Books n [HPC] "Software Optimization for High Performance Computing: Creating Faster Applications" by K.R. Wadleigh and I.L. Crawford, Hewlett-Packard professional books, Prentice Hall. n [OPT] "The Software Optimization Cookbook"...
Introduction 1 Books n [HPC] "Software Optimization for High Performance Computing: Creating Faster Applications" by K.R. Wadleigh and I.L. Crawford, Hewlett-Packard professional books, Prentice Hall. n [OPT] "The Software Optimization Cookbook" (2nd ed.) by R. Gerber, A. Bik, K. Smith, and X. Tian, Intel Press. n [PP2] "Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers" (2nd ed.) by B. Wilkinson and M. Allen, Prentice Hall. n [PSC] "Parallel Scientific Computing in C++ and MPI" by G. Karniadakis and R. Kirby II, Cambridge University Press. n [SPC] "Scientific Parallel Computing" by L.R. Scott, T. Clark, and B. Bagheri, Princeton University Press. n [SRC] "Sourcebook of Parallel Programming" by J. Dongara, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White (eds), Morgan Kaufmann. Introduction n Why parallel? n … and why not! n Benefit depends on speedup, efficiency, and scalability of parallel algorithms n What are the limitations to parallel speedup? n The future of computing n Lessons Learned n Further reading Why Parallel? n It is not always obvious that a parallel algorithm has benefits, unless we want to do things … ¨ faster: doing the same amount of work in less time ¨ bigger: doing more work in the same amount of time n Both of these reasons can be argued to produce better results, which is the only meaningful outcome of program parallelization Faster, Bigger! n There is an ever increasing demand for computational power to improve the speed or accuracy of solutions to real-world problems through faster computations and/or bigger simulations n Computations must be completed in acceptable time (real-time computation), hence must be “fast enough” Faster, Bigger! n An illustrative example: a weather prediction simulation should not take more time than the real event n Suppose the atmosphere of the earth is divided into 5´108 cubes, each 1´1´1 mile and stacked 10 miles high n It takes 200 floating point operations per cube to complete one time step n 104 time steps are needed for a 7 day forecast n Then 1015 floating point operations must be performed n This takes 106 seconds (= 10 days) on a 1 GFLOP machine Grand Challenge Problems n Big problems ¨ A “Grand Challenge” problem is a problem that cannot be solved in a reasonable amount of time with today’s computers ¨ Examples of Grand Challenge problems: n Applied Fluid Dynamics n Meso- to Macro-Scale Environmental Modeling n Ecosystem Simulations n Biomedical Imaging and Biomechanics n Molecular Biology n Molecular Design and Process Optimization n Fundamental Computational Sciences n Nuclear power and weapons simulations CPU Energy Consumption & Heat Dissipation n “[without revolutionary new technologies …] increased power requirements of newer chips will lead to CPUs that are hotter than the surface of the sun by 2010.” – Intel CTO n Parallelism can help to conserve energy: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 2 CPUs Task 1 (f = 2 GHz) E µ f2 » 4 Task 2 t0 tp ts Important Factors n Important considerations in parallel computing ¨ Physical limitations: the speed of light, CPU heat dissipation ¨ Economic factors: cheaper components can be used to achieve comparable levels of aggregate performance ¨ Scalability: allows data to be subdivided over CPUs to obtain a better match between algorithms and resources to increase performance ¨ Memory: allow aggregate memory bandwidth to be increased together with processing power at a reasonable cost … and Why not Parallel? n Bad parallel programs can be worse than their sequential counterparts ¨ Slower: because of communication overhead ¨ Scalability: some parallel algorithms are only faster when the problem size is very large n Understand the problem and use common sense! n Not all problems are amenable to parallelism n In this course we will also focus a significant part on non- parallel optimizations and how to get better performance from our sequential programs … and Why not Parallel? n Some algorithms are inherently sequential n Consider for example the Collatz conjecture, implemented by int Collatz(int n) { int step; for (step = 1; n != 1; step++) { if (n % 2 == 0) // is n is even? n = n / 2; else n = 3*n + 1; } return step; } n Given n, Collatz returns the number of steps to reach n = 1 n Conjecture: algorithm terminates for any integer n > 0 n This algorithm is clearly sequential n Note: given a vector of k values, we can compute k Collatz numbers in parallel Speedup n Suppose we want to compute in parallel for (i = 0; i < N; i++) z[i] = x[i] + y[i]; n Then the obvious choice is to split the iteration space in P equal-sized N/P chunks and let each processor share the work (worksharing) of the loop: for each processor p from 0 to P-1 do: for (i = p*N/P; i < (p+1)*(N/P); i++) z[i] = x[i] + y[i]; n We would assume that this parallel version runs P times faster, that is, we hope for linear speedup n Unfortunately, in practice this is typically not the case because of overhead in communication, resource contention, and synchronization Speedup n Definition: the speedup of an algorithm using P processors is defined as SP = ts / tP where ts is the execution time of the best available sequential algorithm and tP is the execution time of the parallel algorithm n The speedup is perfect or ideal if SP = P n The speedup is linear if SP » P n The speedup is superlinear when, for some P, SP > P Relative Speedup n Definition: The relative speedup is defined as S1P = t1 / tP where t1 is the execution time of the parallel algorithm on one processor n Similarly, SkP = tk / tP is the relative speedup with respect to k processors, where k < P n The relative speedup SkP is used when k is the smallest number of processors on which the problem will run An Example Parallel search n Search in parallel by partitioning the search space into P chunks n SP = ( (x ´ ts/P) + Dt ) / Dt n Worst case for sequential search (item in last chunk): SP®¥ as Dt tends to zero n Best case for sequential search (item in first chunk): SP = 1 Sequential search Effects that can Cause Superlinear Speedup n Cache effects: when data is partitioned and distributed over P processors, then the individual data items are (much) smaller and may fit entirely in the data cache of each processor n For an algorithm with linear speedup, the extra reduction in cache misses may lead to superlinear speedup Efficiency n Definition: the efficiency of an algorithm using P processors is EP = SP / P n Efficiency estimates how well-utilized the processors are in solving the problem, compared to how much effort is lost in idling and communication/synchronization n Ideal (or perfect) speedup means 100% efficiency EP = 1 n Many difficult-to-parallelize algorithms have efficiency that approaches zero as P increases Scalability n Speedup describes how the parallel algorithm’s performance changes with increasing P n Scalability concerns the efficiency of the algorithm with changing problem size N by choosing P dependent on N so that the efficiency of the algorithm is bounded below n Definition: an algorithm is scalable if there is minimal efficiency e > 0 such that given any problem size N there is a number of processors P(N) which tends to infinity as N tends to infinity, such that the efficiency EP(N) > e > 0 as N is made arbitrarily large Amdahl’s Law n Several factors can limit the speedup ¨ Processors may be idle ¨ Extra computations are performed in the parallel version ¨ Communication and synchronization overhead n Let f be the fraction of the computation that is sequential and cannot be divided into concurrent tasks Amdahl’s Law n Amdahl’s law states that the speedup given P processors is SP = ts / ( f ´ ts + (1-f)ts / P ) = P / ( 1 + (P-1)f ) n As a consequence, the maximum speedup is limited by SP ® f -1 as P ® ¥ Gustafson’s Law n Amdahl's law is based on a fixed workload or fixed problem size per processor, i.e. analyzes constant problem size scaling n Gustafson’s law defines the scaled speedup by keeping the parallel execution time constant (i.e. time-constrained scaling) by adjusting P as the problem size N changes SP,N = P + (1-P)a(N) where a(N) is the non-parallelizable fraction of the normalized parallel time tP,N = 1 given problem size N n To see this, let b(N) = 1- a(N) be the parallelizable fraction tP,N = a(N) + b(N) = 1 then, the scaled sequential time is ts,N = a(N) + P b(N) giving SP,N = a(N) + P (1- a(N)) = P + (1-P)a(N) HPC Spring 2017 Limitations to Speedup: Data Dependences n The Collatz iteration loop has a loop-carried dependence ¨ The value of n is carried over to the next iteration ¨ Therefore, the algorithm is inherently sequential n Loops with loop-carried dependences cannot be parallelized n To increase parallelism: ¨ Change the loops to remove dependences when possible ¨ Thread-privatize scalar variables in loops ¨ Apply algorithmic changes by rewriting the algorithm (this may slightly change the result of the output or even approximate it) Limitations to Speedup: Data Dependences n Consider for example the update step in a Gauss-Seidel iteration for solving a two-point boundary-value problem: do i=1,n soln(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo n By contrast, the Jacobi iteration for solving a two-point boundary- value problem does not exhibit loop-carried dependences: do i=1,n snew(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo do i=1,n soln(i)=snew(i) enddo n In this case the iteration space of the loops can be partitioned and each processor given a chunk of the iteration space Limitations to Speedup: Data Dependences 1. do i=1,n n Three example loops diag(i)=(1.0/h(i))+(1.0/h(i+1)) offdiag(i)=-(1.0/h(i+1)) to initialize a finite enddo difference matrix 2. do i=1,n dxo=1.0/h(i) n Which loop(s) can be dxi=1.0/h(i+1) parallelized? diag(i)=dxo+dxi offdiag(i)=-dxi enddo n Which loop probably runs more efficient on 3. dxi=1.0/h(1) a sequential do i=1,n machine? dxo=dxi dxi=1.0/h(i+1) diag(i)=dxo+dxi offdiag(i)=-dxi Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however here we have synchronization points to exchange dependent values: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 missed the deadline! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 7.2 (f = 2 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts tp Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 consumes more energy! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 9.11 (f = 2.25 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts = tp Efficient Parallel Execution n Trying to construct a parallel version of an algorithm is not the end-all do-all of high-performance computing ¨ Recall Amdahl’s law: the maximum speedup is bounded by SP ® f -1 as P ® ¥ ¨ Thus, efficient execution of the non-parallel fraction f is extremely important ¨ Efficient execution of the parallel tasks is important since tasks may end at different times (slowest horse in front of the carriage) ¨ We can reduce f by improving the sequential code execution (e.g. algorithm initialization parts), I/O, communication, and synchronization n To achieve high performance, we should highly optimize the per-node sequential code and use profiling techniques to analyze the performance of our code to investigate the causes of overhead Limitations to Speedup: Locality n Memory hierarchy forms a barrier to performance when locality is poor n Temporal locality ¨ Same memory location accessed frequently and repeatedly ¨ Poor temporal locality results from frequent access to fresh new memory locations n Spatial locality ¨ Consecutive (or “sufficiently near”) memory locations are accessed ¨ Poor spatial locality means that memory locations are accessed in a more random pattern n Memory wall refers to the growing disparity of speed between CPU and memory outside the CPU chip Efficient Sequential Execution n Memory effects are the greatest concern for optimal sequential execution ¨ Store-load dependences, where data has to flow through memory ¨ Cache misses ¨ TLB misses ¨ Page faults n CPU resource effects can limit performance ¨ Limited number of floating point units ¨ Unpredictable branching (if-then-else, loops, etc) in the program n Use common sense when allocating and accessing data n Use compiler optimizations effectively n Execution best analyzed with performance analyzers Any Questions ? Uniprocessors Overview PART I: Uniprocessors PART II: Multiprocessors and and Compiler Optimizations Parallel Programming Models n Uniprocessors n Multiprocessors ¨ Processor architectures ¨ Pipeline and vector machines ¨ Instruction set architectures ¨ Shared memory ¨ Instruction scheduling and ¨ Distributed memory execution ¨ Message passing n Data storage n Parallel programming models ¨ Memory hierarchy ¨ Shared vs distributed memory ¨ Caches, TLB ¨ Hybrid n Compiler optimizations ¨ BSP 1/19/17 HPC Levels of Parallelism Superscalar and multicore processors (fine grain) (stacked, fine grain to coarse grain) 1/19/17 HPC 3 Instruction Pipeline n An instruction pipeline increases the instruction bandwidth n Classic 5-stage pipeline: ¨ IF instruction fetch ¨ ID instruction decode ¨ EX execute (functional units) ¨ MEM load/store ¨ WB write back to registers and forward results into the pipeline Five instructions are in the when needed by pipeline at different stages another instruction 1/19/17 HPC Instruction Pipeline Example n Example 4-stage pipeline n Four instructions (green, purple, blue, red) are processed in this order ¨ Cycle 0: instructions are waiting ¨ Cycle 1: Green is fetched ¨ Cycle 2: Purple is fetched, green decoded ¨ Cycle 3: Blue is fetched, purple decoded, green executed ¨ Cycle 4: Red is fetched, blue decoded, purple executed, green in write-back ¨ Etc. 1/19/17 HPC Instruction Pipeline Hazards n Pipeline hazards ¨ From data dependences ¨ Forwarding in the WB stage can eliminate some data dependence hazards ¨ From instruction fetch latencies (e.g. I-cache miss) ¨ From memory load latencies (e.g. D-cache miss) n A hazard is resolved by stalling the pipeline, which causes a bubble of one ore more cycles n Example ¨ Suppose a stall of one cycle occurs in the IF stage of the purple instruction ¨ Cycle 3: Purple cannot be decoded and a no-operation (NOP) is inserted 1/19/17 HPC N-way Superscalar n When instructions have the same word size ¨ Can fetch multiple instructions without having to know the instruction content of each n N-way superscalar processors fetch N instructions each cycle ¨ Increases the instruction-level 2-way superscalar RISC pipeline parallelism 1/19/17 HPC Instruction Latency Case Study n Consider two versions of Euclid’s algorithm 1. Modulo version 2. Repetitive subtraction version n Which is faster? int find_gcf1(int a, int b) int find_gcf2(int a, int b) { while (1) { while (1) { a = a % b; { if (a > b) if (a == 0) return b; a = a - b; if (a == 1) return 1; else if (a < b) b = b % a; b = b - a; if (b == 0) return a; else if (b == 1) return 1; return a; } } } } Modulo version Repetitive subtraction 1/19/17 HPC Data Dependences n Instruction level parallelism is limited by data dependences n Types of dependences: ¨ RAW: read-after-write also called (w*x)*(y*z) flow dependence ¨ WAR: write-after-read also called anti dependence ¨ WAW: write-after-write also called output dependence n The example shows a RAW dependence n WAR and WAW dependences exist because of storage location reuse (overwrite with new value) ¨ WAR and WAW are sometimes called false dependences ¨ RAW is a true dependence 1/19/17 HPC Case Study (1) Intel Core 2 Duo 2.33 GHz dxi = 1.0/h; for (i=1; i