Lecture 1: Introduction to Parallel Computing PDF
Document Details
Uploaded by LegendarySense8316
Tags
Summary
This lecture introduces the concepts of parallel computing, exploring its potential benefits and limitations. It discusses various aspects of parallelization, from the fundamental reasons for pursuing it to the challenges and constraints involved.
Full Transcript
Introduction 1 Books n [HPC] "Software Optimization for High Performance Computing: Creating Faster Applications" by K.R. Wadleigh and I.L. Crawford, Hewlett-Packard professional books, Prentice Hall. n [OPT] "The Software Optimization Cookbook"...
Introduction 1 Books n [HPC] "Software Optimization for High Performance Computing: Creating Faster Applications" by K.R. Wadleigh and I.L. Crawford, Hewlett-Packard professional books, Prentice Hall. n [OPT] "The Software Optimization Cookbook" (2nd ed.) by R. Gerber, A. Bik, K. Smith, and X. Tian, Intel Press. n [PP2] "Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers" (2nd ed.) by B. Wilkinson and M. Allen, Prentice Hall. n [PSC] "Parallel Scientific Computing in C++ and MPI" by G. Karniadakis and R. Kirby II, Cambridge University Press. n [SPC] "Scientific Parallel Computing" by L.R. Scott, T. Clark, and B. Bagheri, Princeton University Press. n [SRC] "Sourcebook of Parallel Programming" by J. Dongara, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White (eds), Morgan Kaufmann. Introduction n Why parallel? n … and why not! n Benefit depends on speedup, efficiency, and scalability of parallel algorithms n What are the limitations to parallel speedup? n The future of computing n Lessons Learned n Further reading Why Parallel? n It is not always obvious that a parallel algorithm has benefits, unless we want to do things … ¨ faster: doing the same amount of work in less time ¨ bigger: doing more work in the same amount of time n Both of these reasons can be argued to produce better results, which is the only meaningful outcome of program parallelization Faster, Bigger! n There is an ever increasing demand for computational power to improve the speed or accuracy of solutions to real-world problems through faster computations and/or bigger simulations n Computations must be completed in acceptable time (real-time computation), hence must be “fast enough” Faster, Bigger! n An illustrative example: a weather prediction simulation should not take more time than the real event n Suppose the atmosphere of the earth is divided into 5´108 cubes, each 1´1´1 mile and stacked 10 miles high n It takes 200 floating point operations per cube to complete one time step n 104 time steps are needed for a 7 day forecast n Then 1015 floating point operations must be performed n This takes 106 seconds (= 10 days) on a 1 GFLOP machine Grand Challenge Problems n Big problems ¨ A “Grand Challenge” problem is a problem that cannot be solved in a reasonable amount of time with today’s computers ¨ Examples of Grand Challenge problems: n Applied Fluid Dynamics n Meso- to Macro-Scale Environmental Modeling n Ecosystem Simulations n Biomedical Imaging and Biomechanics n Molecular Biology n Molecular Design and Process Optimization n Fundamental Computational Sciences n Nuclear power and weapons simulations Physical Limits n Which tasks are fundamentally too big to compute with one CPU? n Suppose we have to calculate in one second for (i = 0; i < ONE_TRILLION; i++) z[i] = x[i] + y[i]; n Then we have to perform 3x1012 memory moves per second n If data travels at the speed of light (3x108 m/s) between the CPU and memory and r is the average distance between the CPU and memory, then r must satisfy 3´1012 r = 3´108 m/s ´ 1 s which gives r = 10-4 meters n To fit the data into a square so that the average distance from the CPU in the middle is r, then the length of each memory cell will be 2´10-4 m / (Ö3´106) = 10-10 m which is the size of a relatively small atom! CPU Energy Consumption & Heat Dissipation n “[without revolutionary new technologies …] increased power requirements of newer chips will lead to CPUs that are hotter than the surface of the sun by 2010.” – Intel CTO n Parallelism can help to conserve energy: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 2 CPUs Task 1 (f = 2 GHz) E µ f2 » 4 Task 2 t0 tp ts Important Factors n Important considerations in parallel computing ¨ Physical limitations: the speed of light, CPU heat dissipation ¨ Economic factors: cheaper components can be used to achieve comparable levels of aggregate performance ¨ Scalability: allows data to be subdivided over CPUs to obtain a better match between algorithms and resources to increase performance ¨ Memory: allow aggregate memory bandwidth to be increased together with processing power at a reasonable cost … and Why not Parallel? n Bad parallel programs can be worse than their sequential counterparts ¨ Slower: because of communication overhead ¨ Scalability: some parallel algorithms are only faster when the problem size is very large n Understand the problem and use common sense! n Not all problems are amenable to parallelism n In this course we will also focus a significant part on non- parallel optimizations and how to get better performance from our sequential programs … and Why not Parallel? n Some algorithms are inherently sequential n Consider for example the Collatz conjecture, implemented by int Collatz(int n) { int step; for (step = 1; n != 1; step++) { if (n % 2 == 0) // is n is even? n = n / 2; else n = 3*n + 1; } return step; } n Given n, Collatz returns the number of steps to reach n = 1 n Conjecture: algorithm terminates for any integer n > 0 n This algorithm is clearly sequential n Note: given a vector of k values, we can compute k Collatz numbers in parallel Speedup n Suppose we want to compute in parallel for (i = 0; i < N; i++) z[i] = x[i] + y[i]; n Then the obvious choice is to split the iteration space in P equal-sized N/P chunks and let each processor share the work (worksharing) of the loop: for each processor p from 0 to P-1 do: for (i = p*N/P; i < (p+1)*(N/P); i++) z[i] = x[i] + y[i]; n We would assume that this parallel version runs P times faster, that is, we hope for linear speedup n Unfortunately, in practice this is typically not the case because of overhead in communication, resource contention, and synchronization Speedup n Definition: the speedup of an algorithm using P processors is defined as SP = ts / tP where ts is the execution time of the best available sequential algorithm and tP is the execution time of the parallel algorithm n The speedup is perfect or ideal if SP = P n The speedup is linear if SP » P n The speedup is superlinear when, for some P, SP > P Speedup is bigger than the number of processors. Relative Speedup n Definition: The relative speedup is defined as S1P = t1 / tP where t1 is the execution time of the parallel algorithm on one processor n Similarly, SkP = tk / tP is the relative speedup with respect to k processors, where k < P n The relative speedup SkP is used when k is the smallest number of processors on which the problem will run An Example Parallel search n Search in parallel by partitioning the search space into P chunks n SP = ( (x ´ ts/P) + Dt ) / Dt n Worst case for sequential search (item in last chunk): SP®¥ as Dt tends to zero n Best case for sequential search (item in first chunk): SP = 1 Sequential search Effects that can Cause Superlinear Speedup n Cache effects: when data is partitioned and distributed over P processors, then the individual data items are (much) smaller and may fit entirely in the data cache of each processor n For an algorithm with linear speedup, the extra reduction in cache misses may lead to superlinear speedup Efficiency n Definition: the efficiency of an algorithm using P processors is EP = SP / P n Efficiency estimates how well-utilized the processors are in solving the problem, compared to how much effort is lost in idling and communication/synchronization n Ideal (or perfect) speedup means 100% efficiency EP = 1 n Many difficult-to-parallelize algorithms have efficiency that approaches zero as P increases Scalability n Speedup describes how the parallel algorithm’s performance changes with increasing P n Scalability concerns the efficiency of the algorithm with changing problem size N by choosing P dependent on N so that the efficiency of the algorithm is bounded below n Definition: an algorithm is scalable if there is minimal efficiency e > 0 such that given any problem size N there is a number of processors P(N) which tends to infinity as N tends to infinity, such that the efficiency EP(N) > e > 0 as N is made arbitrarily large Amdahl’s Law n Several factors can limit the speedup ¨ Processors may be idle ¨ Extra computations are performed in the parallel version ¨ Communication and synchronization overhead n Let f be the fraction of the computation that is sequential and cannot be divided into concurrent tasks Amdahl’s Law n Amdahl’s law states that the speedup given P processors is SP = ts / ( f ´ ts + (1-f)ts / P ) = P / ( 1 + (P-1)f ) n As a consequence, the maximum speedup is limited by SP ® f -1 as P ® ¥ Gustafson’s Law n Amdahl's law is based on a fixed workload or fixed problem size per processor, i.e. analyzes constant problem size scaling n Gustafson’s law defines the scaled speedup by keeping the parallel execution time constant (i.e. time-constrained scaling) by adjusting P as the problem size N changes SP,N = P + (1-P)a(N) where a(N) is the non-parallelizable fraction of the normalized parallel time tP,N = 1 given problem size N n To see this, let b(N) = 1- a(N) be the parallelizable fraction tP,N = a(N) + b(N) = 1 then, the scaled sequential time is ts,N = a(N) + P b(N) giving SP,N = a(N) + P (1- a(N)) = P + (1-P)a(N) HPC Spring 2017 Limitations to Speedup: Data Dependences n The Collatz iteration loop has a loop-carried dependence ¨ The value of n is carried over to the next iteration ¨ Therefore, the algorithm is inherently sequential n Loops with loop-carried dependences cannot be parallelized n To increase parallelism: ¨ Change the loops to remove dependences when possible ¨ Thread-privatize scalar variables in loops ¨ Apply algorithmic changes by rewriting the algorithm (this may slightly change the result of the output or even approximate it) Limitations to Speedup: Data Dependences n Consider for example the update step in a Gauss-Seidel iteration for solving a two-point boundary-value problem: do i=1,n soln(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo n By contrast, the Jacobi iteration for solving a two-point boundary- value problem does not exhibit loop-carried dependences: do i=1,n snew(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo do i=1,n soln(i)=snew(i) enddo n In this case the iteration space of the loops can be partitioned and each processor given a chunk of the iteration space Limitations to Speedup: Data Dependences 1. do i=1,n n Three example loops diag(i)=(1.0/h(i))+(1.0/h(i+1)) offdiag(i)=-(1.0/h(i+1)) to initialize a finite enddo difference matrix 2. do i=1,n dxo=1.0/h(i) n Which loop(s) can be dxi=1.0/h(i+1) parallelized? diag(i)=dxo+dxi offdiag(i)=-dxi enddo n Which loop probably runs more efficient on 3. dxi=1.0/h(1) a sequential do i=1,n machine? dxo=dxi dxi=1.0/h(i+1) diag(i)=dxo+dxi offdiag(i)=-dxi Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however here we have synchronization points to exchange dependent values: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 missed the deadline! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 7.2 (f = 2 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts tp Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 consumes more energy! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 9.11 (f = 2.25 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts = tp Efficient Parallel Execution n Trying to construct a parallel version of an algorithm is not the end-all do-all of high-performance computing ¨ Recall Amdahl’s law: the maximum speedup is bounded by SP ® f -1 as P ® ¥ ¨ Thus, efficient execution of the non-parallel fraction f is extremely important ¨ Efficient execution of the parallel tasks is important since tasks may end at different times (slowest horse in front of the carriage) ¨ We can reduce f by improving the sequential code execution (e.g. algorithm initialization parts), I/O, communication, and synchronization n To achieve high performance, we should highly optimize the per-node sequential code and use profiling techniques to analyze the performance of our code to investigate the causes of overhead Limitations to Speedup: Locality n Memory hierarchy forms a barrier to performance when locality is poor n Temporal locality ¨ Same memory location accessed frequently and repeatedly ¨ Poor temporal locality results from frequent access to fresh new memory locations n Spatial locality ¨ Consecutive (or “sufficiently near”) memory locations are accessed ¨ Poor spatial locality means that memory locations are accessed in a more random pattern n Memory wall refers to the growing disparity of speed between CPU and memory outside the CPU chip Efficient Sequential Execution n Memory effects are the greatest concern for optimal sequential execution ¨ Store-load dependences, where data has to flow through memory ¨ Cache misses ¨ TLB misses ¨ Page faults n CPU resource effects can limit performance ¨ Limited number of floating point units ¨ Unpredictable branching (if-then-else, loops, etc) in the program n Use common sense when allocating and accessing data n Use compiler optimizations effectively n Execution best analyzed with performance analyzers Lessons Learned from the Past n Applications ¨ Parallel computing can transform science and engineering and answer challenges in society ¨ To port or not to port is NOT the question: a complete redesign of an application may be necessary ¨ The problem is not the hardware: hardware can be significantly underutilized when software and applications are suboptimal n Software and algorithms ¨ Portability remains elusive ¨ Parallelism isn’t everything ¨ Community acceptance is essential to the success of software ¨ Good commercial software is rare at the high end Any Questions ?