Lecture 1: Introduction to Parallel Computing PDF

Introduction 1 Books n [HPC] "Software Optimization for High Performance Computing: Creating Faster Applications" by K.R. Wadleigh and I.L. Crawford, Hewlett-Packard professional books, Prentice Hall. n [OPT] "The Software Optimization Cookbook" (2nd ed.) by R. Gerber, A. Bik, K. Smith, and X. Tian, Intel Press. n [PP2] "Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers" (2nd ed.) by B. Wilkinson and M. Allen, Prentice Hall. n [PSC] "Parallel Scientific Computing in C++ and MPI" by G. Karniadakis and R. Kirby II, Cambridge University Press. n [SPC] "Scientific Parallel Computing" by L.R. Scott, T. Clark, and B. Bagheri, Princeton University Press. n [SRC] "Sourcebook of Parallel Programming" by J. Dongara, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White (eds), Morgan Kaufmann. Introduction n Why parallel? n … and why not! n Benefit depends on speedup, efficiency, and scalability of parallel algorithms n What are the limitations to parallel speedup? n The future of computing n Lessons Learned n Further reading Why Parallel? n It is not always obvious that a parallel algorithm has benefits, unless we want to do things … ¨ faster: doing the same amount of work in less time ¨ bigger: doing more work in the same amount of time n Both of these reasons can be argued to produce better results, which is the only meaningful outcome of program parallelization Faster, Bigger! n There is an ever increasing demand for computational power to improve the speed or accuracy of solutions to real-world problems through faster computations and/or bigger simulations n Computations must be completed in acceptable time (real-time computation), hence must be “fast enough” Faster, Bigger! n An illustrative example: a weather prediction simulation should not take more time than the real event n Suppose the atmosphere of the earth is divided into 5´108 cubes, each 1´1´1 mile and stacked 10 miles high n It takes 200 floating point operations per cube to complete one time step n 104 time steps are needed for a 7 day forecast n Then 1015 floating point operations must be performed n This takes 106 seconds (= 10 days) on a 1 GFLOP machine Grand Challenge Problems n Big problems ¨ A “Grand Challenge” problem is a problem that cannot be solved in a reasonable amount of time with today’s computers ¨ Examples of Grand Challenge problems: n Applied Fluid Dynamics n Meso- to Macro-Scale Environmental Modeling n Ecosystem Simulations n Biomedical Imaging and Biomechanics n Molecular Biology n Molecular Design and Process Optimization n Fundamental Computational Sciences n Nuclear power and weapons simulations Physical Limits n Which tasks are fundamentally too big to compute with one CPU? n Suppose we have to calculate in one second for (i = 0; i < ONE_TRILLION; i++) z[i] = x[i] + y[i]; n Then we have to perform 3x1012 memory moves per second n If data travels at the speed of light (3x108 m/s) between the CPU and memory and r is the average distance between the CPU and memory, then r must satisfy 3´1012 r = 3´108 m/s ´ 1 s which gives r = 10-4 meters n To fit the data into a square so that the average distance from the CPU in the middle is r, then the length of each memory cell will be 2´10-4 m / (Ö3´106) = 10-10 m which is the size of a relatively small atom! CPU Energy Consumption & Heat Dissipation n “[without revolutionary new technologies …] increased power requirements of newer chips will lead to CPUs that are hotter than the surface of the sun by 2010.” – Intel CTO n Parallelism can help to conserve energy: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 2 CPUs Task 1 (f = 2 GHz) E µ f2 » 4 Task 2 t0 tp ts Important Factors n Important considerations in parallel computing ¨ Physical limitations: the speed of light, CPU heat dissipation ¨ Economic factors: cheaper components can be used to achieve comparable levels of aggregate performance ¨ Scalability: allows data to be subdivided over CPUs to obtain a better match between algorithms and resources to increase performance ¨ Memory: allow aggregate memory bandwidth to be increased together with processing power at a reasonable cost … and Why not Parallel? n Bad parallel programs can be worse than their sequential counterparts ¨ Slower: because of communication overhead ¨ Scalability: some parallel algorithms are only faster when the problem size is very large n Understand the problem and use common sense! n Not all problems are amenable to parallelism n In this course we will also focus a significant part on non- parallel optimizations and how to get better performance from our sequential programs … and Why not Parallel? n Some algorithms are inherently sequential n Consider for example the Collatz conjecture, implemented by int Collatz(int n) { int step; for (step = 1; n != 1; step++) { if (n % 2 == 0) // is n is even? n = n / 2; else n = 3*n + 1; } return step; } n Given n, Collatz returns the number of steps to reach n = 1 n Conjecture: algorithm terminates for any integer n > 0 n This algorithm is clearly sequential n Note: given a vector of k values, we can compute k Collatz numbers in parallel Speedup n Suppose we want to compute in parallel for (i = 0; i < N; i++) z[i] = x[i] + y[i]; n Then the obvious choice is to split the iteration space in P equal-sized N/P chunks and let each processor share the work (worksharing) of the loop: for each processor p from 0 to P-1 do: for (i = p*N/P; i < (p+1)*(N/P); i++) z[i] = x[i] + y[i]; n We would assume that this parallel version runs P times faster, that is, we hope for linear speedup n Unfortunately, in practice this is typically not the case because of overhead in communication, resource contention, and synchronization Speedup n Definition: the speedup of an algorithm using P processors is defined as SP = ts / tP where ts is the execution time of the best available sequential algorithm and tP is the execution time of the parallel algorithm n The speedup is perfect or ideal if SP = P n The speedup is linear if SP » P n The speedup is superlinear when, for some P, SP > P Speedup is bigger than the number of processors. Relative Speedup n Definition: The relative speedup is defined as S1P = t1 / tP where t1 is the execution time of the parallel algorithm on one processor n Similarly, SkP = tk / tP is the relative speedup with respect to k processors, where k < P n The relative speedup SkP is used when k is the smallest number of processors on which the problem will run An Example Parallel search n Search in parallel by partitioning the search space into P chunks n SP = ( (x ´ ts/P) + Dt ) / Dt n Worst case for sequential search (item in last chunk): SP®¥ as Dt tends to zero n Best case for sequential search (item in first chunk): SP = 1 Sequential search Effects that can Cause Superlinear Speedup n Cache effects: when data is partitioned and distributed over P processors, then the individual data items are (much) smaller and may fit entirely in the data cache of each processor n For an algorithm with linear speedup, the extra reduction in cache misses may lead to superlinear speedup Efficiency n Definition: the efficiency of an algorithm using P processors is EP = SP / P n Efficiency estimates how well-utilized the processors are in solving the problem, compared to how much effort is lost in idling and communication/synchronization n Ideal (or perfect) speedup means 100% efficiency EP = 1 n Many difficult-to-parallelize algorithms have efficiency that approaches zero as P increases Scalability n Speedup describes how the parallel algorithm’s performance changes with increasing P n Scalability concerns the efficiency of the algorithm with changing problem size N by choosing P dependent on N so that the efficiency of the algorithm is bounded below n Definition: an algorithm is scalable if there is minimal efficiency e > 0 such that given any problem size N there is a number of processors P(N) which tends to infinity as N tends to infinity, such that the efficiency EP(N) > e > 0 as N is made arbitrarily large Amdahl’s Law n Several factors can limit the speedup ¨ Processors may be idle ¨ Extra computations are performed in the parallel version ¨ Communication and synchronization overhead n Let f be the fraction of the computation that is sequential and cannot be divided into concurrent tasks Amdahl’s Law n Amdahl’s law states that the speedup given P processors is SP = ts / ( f ´ ts + (1-f)ts / P ) = P / ( 1 + (P-1)f ) n As a consequence, the maximum speedup is limited by SP ® f -1 as P ® ¥ Gustafson’s Law n Amdahl's law is based on a fixed workload or fixed problem size per processor, i.e. analyzes constant problem size scaling n Gustafson’s law defines the scaled speedup by keeping the parallel execution time constant (i.e. time-constrained scaling) by adjusting P as the problem size N changes SP,N = P + (1-P)a(N) where a(N) is the non-parallelizable fraction of the normalized parallel time tP,N = 1 given problem size N n To see this, let b(N) = 1- a(N) be the parallelizable fraction tP,N = a(N) + b(N) = 1 then, the scaled sequential time is ts,N = a(N) + P b(N) giving SP,N = a(N) + P (1- a(N)) = P + (1-P)a(N) HPC Spring 2017 Limitations to Speedup: Data Dependences n The Collatz iteration loop has a loop-carried dependence ¨ The value of n is carried over to the next iteration ¨ Therefore, the algorithm is inherently sequential n Loops with loop-carried dependences cannot be parallelized n To increase parallelism: ¨ Change the loops to remove dependences when possible ¨ Thread-privatize scalar variables in loops ¨ Apply algorithmic changes by rewriting the algorithm (this may slightly change the result of the output or even approximate it) Limitations to Speedup: Data Dependences n Consider for example the update step in a Gauss-Seidel iteration for solving a two-point boundary-value problem: do i=1,n soln(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo n By contrast, the Jacobi iteration for solving a two-point boundary- value problem does not exhibit loop-carried dependences: do i=1,n snew(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo do i=1,n soln(i)=snew(i) enddo n In this case the iteration space of the loops can be partitioned and each processor given a chunk of the iteration space Limitations to Speedup: Data Dependences 1. do i=1,n n Three example loops diag(i)=(1.0/h(i))+(1.0/h(i+1)) offdiag(i)=-(1.0/h(i+1)) to initialize a finite enddo difference matrix 2. do i=1,n dxo=1.0/h(i) n Which loop(s) can be dxi=1.0/h(i+1) parallelized? diag(i)=dxo+dxi offdiag(i)=-dxi enddo n Which loop probably runs more efficient on 3. dxi=1.0/h(1) a sequential do i=1,n machine? dxo=dxi dxi=1.0/h(i+1) diag(i)=dxo+dxi offdiag(i)=-dxi Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however here we have synchronization points to exchange dependent values: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 missed the deadline! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 7.2 (f = 2 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts tp Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 consumes more energy! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 9.11 (f = 2.25 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts = tp Efficient Parallel Execution n Trying to construct a parallel version of an algorithm is not the end-all do-all of high-performance computing ¨ Recall Amdahl’s law: the maximum speedup is bounded by SP ® f -1 as P ® ¥ ¨ Thus, efficient execution of the non-parallel fraction f is extremely important ¨ Efficient execution of the parallel tasks is important since tasks may end at different times (slowest horse in front of the carriage) ¨ We can reduce f by improving the sequential code execution (e.g. algorithm initialization parts), I/O, communication, and synchronization n To achieve high performance, we should highly optimize the per-node sequential code and use profiling techniques to analyze the performance of our code to investigate the causes of overhead Limitations to Speedup: Locality n Memory hierarchy forms a barrier to performance when locality is poor n Temporal locality ¨ Same memory location accessed frequently and repeatedly ¨ Poor temporal locality results from frequent access to fresh new memory locations n Spatial locality ¨ Consecutive (or “sufficiently near”) memory locations are accessed ¨ Poor spatial locality means that memory locations are accessed in a more random pattern n Memory wall refers to the growing disparity of speed between CPU and memory outside the CPU chip Efficient Sequential Execution n Memory effects are the greatest concern for optimal sequential execution ¨ Store-load dependences, where data has to flow through memory ¨ Cache misses ¨ TLB misses ¨ Page faults n CPU resource effects can limit performance ¨ Limited number of floating point units ¨ Unpredictable branching (if-then-else, loops, etc) in the program n Use common sense when allocating and accessing data n Use compiler optimizations effectively n Execution best analyzed with performance analyzers Lessons Learned from the Past n Applications ¨ Parallel computing can transform science and engineering and answer challenges in society ¨ To port or not to port is NOT the question: a complete redesign of an application may be necessary ¨ The problem is not the hardware: hardware can be significantly underutilized when software and applications are suboptimal n Software and algorithms ¨ Portability remains elusive ¨ Parallelism isn’t everything ¨ Community acceptance is essential to the success of software ¨ Good commercial software is rare at the high end Any Questions ?

Lecture 1: Introduction to Parallel Computing PDF

Document Details

Tags

Related

Summary

Full Transcript