HPC-7lec-merged (Excluded Slides) PDF

Introduction 1 Books n [HPC] "Software Optimization for High Performance Computing: Creating Faster Applications" by K.R. Wadleigh and I.L. Crawford, Hewlett-Packard professional books, Prentice Hall. n [OPT] "The Software Optimization Cookbook" (2nd ed.) by R. Gerber, A. Bik, K. Smith, and X. Tian, Intel Press. n [PP2] "Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers" (2nd ed.) by B. Wilkinson and M. Allen, Prentice Hall. n [PSC] "Parallel Scientific Computing in C++ and MPI" by G. Karniadakis and R. Kirby II, Cambridge University Press. n [SPC] "Scientific Parallel Computing" by L.R. Scott, T. Clark, and B. Bagheri, Princeton University Press. n [SRC] "Sourcebook of Parallel Programming" by J. Dongara, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White (eds), Morgan Kaufmann. Introduction n Why parallel? n … and why not! n Benefit depends on speedup, efficiency, and scalability of parallel algorithms n What are the limitations to parallel speedup? n The future of computing n Lessons Learned n Further reading Why Parallel? n It is not always obvious that a parallel algorithm has benefits, unless we want to do things … ¨ faster: doing the same amount of work in less time ¨ bigger: doing more work in the same amount of time n Both of these reasons can be argued to produce better results, which is the only meaningful outcome of program parallelization Faster, Bigger! n There is an ever increasing demand for computational power to improve the speed or accuracy of solutions to real-world problems through faster computations and/or bigger simulations n Computations must be completed in acceptable time (real-time computation), hence must be “fast enough” Faster, Bigger! n An illustrative example: a weather prediction simulation should not take more time than the real event n Suppose the atmosphere of the earth is divided into 5´108 cubes, each 1´1´1 mile and stacked 10 miles high n It takes 200 floating point operations per cube to complete one time step n 104 time steps are needed for a 7 day forecast n Then 1015 floating point operations must be performed n This takes 106 seconds (= 10 days) on a 1 GFLOP machine Grand Challenge Problems n Big problems ¨ A “Grand Challenge” problem is a problem that cannot be solved in a reasonable amount of time with today’s computers ¨ Examples of Grand Challenge problems: n Applied Fluid Dynamics n Meso- to Macro-Scale Environmental Modeling n Ecosystem Simulations n Biomedical Imaging and Biomechanics n Molecular Biology n Molecular Design and Process Optimization n Fundamental Computational Sciences n Nuclear power and weapons simulations NOT APPROVED Physical Limits n Which tasks are fundamentally too big to compute with one CPU? n Suppose we have to calculate in one second for (i = 0; i < ONE_TRILLION; i++) z[i] = x[i] + y[i]; n Then we have to perform 3x1012 memory moves per second n If data travels at the speed of light (3x108 m/s) between the CPU and memory and r is the average distance between the CPU and memory, then r must satisfy 3´1012 r = 3´108 m/s ´ 1 s which gives r = 10-4 meters n To fit the data into a square so that the average distance from the CPU in the middle is r, then the length of each memory cell will be 2´10-4 m / (Ö3´106) = 10-10 m which is the size of a relatively small atom! CPU Energy Consumption & Heat Dissipation n “[without revolutionary new technologies …] increased power requirements of newer chips will lead to CPUs that are hotter than the surface of the sun by 2010.” – Intel CTO n Parallelism can help to conserve energy: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 2 CPUs Task 1 (f = 2 GHz) E µ f2 » 4 Task 2 t0 tp ts Important Factors n Important considerations in parallel computing ¨ Physical limitations: the speed of light, CPU heat dissipation ¨ Economic factors: cheaper components can be used to achieve comparable levels of aggregate performance ¨ Scalability: allows data to be subdivided over CPUs to obtain a better match between algorithms and resources to increase performance ¨ Memory: allow aggregate memory bandwidth to be increased together with processing power at a reasonable cost … and Why not Parallel? n Bad parallel programs can be worse than their sequential counterparts ¨ Slower: because of communication overhead ¨ Scalability: some parallel algorithms are only faster when the problem size is very large n Understand the problem and use common sense! n Not all problems are amenable to parallelism n In this course we will also focus a significant part on non- parallel optimizations and how to get better performance from our sequential programs … and Why not Parallel? n Some algorithms are inherently sequential n Consider for example the Collatz conjecture, implemented by int Collatz(int n) { int step; for (step = 1; n != 1; step++) { if (n % 2 == 0) // is n is even? n = n / 2; else n = 3*n + 1; } return step; } n Given n, Collatz returns the number of steps to reach n = 1 n Conjecture: algorithm terminates for any integer n > 0 n This algorithm is clearly sequential n Note: given a vector of k values, we can compute k Collatz numbers in parallel Speedup n Suppose we want to compute in parallel for (i = 0; i < N; i++) z[i] = x[i] + y[i]; n Then the obvious choice is to split the iteration space in P equal-sized N/P chunks and let each processor share the work (worksharing) of the loop: for each processor p from 0 to P-1 do: for (i = p*N/P; i < (p+1)*(N/P); i++) z[i] = x[i] + y[i]; n We would assume that this parallel version runs P times faster, that is, we hope for linear speedup n Unfortunately, in practice this is typically not the case because of overhead in communication, resource contention, and synchronization Speedup n Definition: the speedup of an algorithm using P processors is defined as SP = ts / tP where ts is the execution time of the best available sequential algorithm and tP is the execution time of the parallel algorithm n The speedup is perfect or ideal if SP = P n The speedup is linear if SP » P n The speedup is superlinear when, for some P, SP > P Relative Speedup n Definition: The relative speedup is defined as S1P = t1 / tP where t1 is the execution time of the parallel algorithm on one processor n Similarly, SkP = tk / tP is the relative speedup with respect to k processors, where k < P n The relative speedup SkP is used when k is the smallest number of processors on which the problem will run An Example Parallel search n Search in parallel by partitioning the search space into P chunks n SP = ( (x ´ ts/P) + Dt ) / Dt n Worst case for sequential search (item in last chunk): SP®¥ as Dt tends to zero n Best case for sequential search (item in first chunk): SP = 1 Sequential search Effects that can Cause Superlinear Speedup Sp > p n Cache effects: when data is partitioned and distributed over P processors, then the individual data items are (much) smaller and may fit entirely in the data cache of each processor n For an algorithm with linear speedup, the extra reduction in cache misses may lead to superlinear speedup Efficiency n Definition: the efficiency of an algorithm using P processors is EP = SP / P n Efficiency estimates how well-utilized the processors are in solving the problem, compared to how much effort is lost in idling and communication/synchronization n Ideal (or perfect) speedup means 100% efficiency EP = 1 n Many difficult-to-parallelize algorithms have efficiency that approaches zero as P increases Scalability n Speedup describes how the parallel algorithm’s performance changes with increasing P n Scalability concerns the efficiency of the algorithm with changing problem size N by choosing P dependent on N so that the efficiency of the algorithm is bounded below n Definition: an algorithm is scalable if there is minimal efficiency e > 0 such that given any problem size N there is a number of processors P(N) which tends to infinity as N tends to infinity, such that the efficiency EP(N) > e > 0 as N is made arbitrarily large Amdahl’s Law n Several factors can limit the speedup ¨ Processors may be idle ¨ Extra computations are performed in the parallel version ¨ Communication and synchronization overhead n Let f be the fraction of the computation that is sequential and cannot be divided into concurrent tasks Amdahl’s Law n Amdahl’s law states that the speedup given P processors is SP = ts / ( f ´ ts + (1-f)ts / P ) = P / ( 1 + (P-1)f ) n As a consequence, the maximum speedup is limited by SP ® f -1 as P ® ¥ Gustafson’s Law n Amdahl's law is based on a fixed workload or fixed problem size per processor, i.e. analyzes constant problem size scaling n Gustafson’s law defines the scaled speedup by keeping the parallel execution time constant (i.e. time-constrained scaling) by adjusting P as the problem size N changes SP,N = P + (1-P)a(N) where a(N) is the non-parallelizable fraction of the normalized parallel time tP,N = 1 given problem size N n To see this, let b(N) = 1- a(N) be the parallelizable fraction tP,N = a(N) + b(N) = 1 then, the scaled sequential time is ts,N = a(N) + P b(N) giving SP,N = a(N) + P (1- a(N)) = P + (1-P)a(N) HPC Spring 2017 Limitations to Speedup: Data Dependences n The Collatz iteration loop has a loop-carried dependence ¨ The value of n is carried over to the next iteration ¨ Therefore, the algorithm is inherently sequential n Loops with loop-carried dependences cannot be parallelized n To increase parallelism: ¨ Change the loops to remove dependences when possible ¨ Thread-privatize scalar variables in loops ¨ Apply algorithmic changes by rewriting the algorithm (this may slightly change the result of the output or even approximate it) Limitations to Speedup: Data Dependences n Consider for example the update step in a Gauss-Seidel iteration for solving a two-point boundary-value problem: do i=1,n soln(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo n By contrast, the Jacobi iteration for solving a two-point boundary- value problem does not exhibit loop-carried dependences: do i=1,n snew(i)=(f(i)-soln(i+1)*offdiag(i) -soln(i-1)*offdiag(i-1))/diag(i) enddo do i=1,n soln(i)=snew(i) enddo n In this case the iteration space of the loops can be partitioned and each processor given a chunk of the iteration space Limitations to Speedup: Data Dependences 1. do i=1,n n Three example loops diag(i)=(1.0/h(i))+(1.0/h(i+1)) offdiag(i)=-(1.0/h(i+1)) to initialize a finite enddo difference matrix 2. do i=1,n dxo=1.0/h(i) n Which loop(s) can be dxi=1.0/h(i+1) parallelized? diag(i)=dxo+dxi offdiag(i)=-dxi enddo n Which loop probably runs more efficient on 3. dxi=1.0/h(1) a sequential do i=1,n machine? dxo=dxi dxi=1.0/h(i+1) diag(i)=dxo+dxi offdiag(i)=-dxi Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however here we have synchronization points to exchange dependent values: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 missed the deadline! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 7.2 (f = 2 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts tp Limitations to Speedup: Data Dependences n The presence of dependences between events in two or more tasks also means that parallelization strategies to conserve energy are limited n Say we have two tasks, which can be split up into smaller parts to run in parallel, however: 1 CPU (f = 3 GHz) Task 1 Task 2 E µ f2 » 9 consumes more energy! 2 CPUs Task 1 Task 1 E µ (1+Δ) f 2 » 9.11 (f = 2.25 GHz) Task 2 Task 2 Task 2 where Δ = 80% is the CPU idle time t0 ts = tp Efficient Parallel Execution n Trying to construct a parallel version of an algorithm is not the end-all do-all of high-performance computing ¨ Recall Amdahl’s law: the maximum speedup is bounded by SP ® f -1 as P ® ¥ ¨ Thus, efficient execution of the non-parallel fraction f is extremely important ¨ Efficient execution of the parallel tasks is important since tasks may end at different times (slowest horse in front of the carriage) ¨ We can reduce f by improving the sequential code execution (e.g. algorithm initialization parts), I/O, communication, and synchronization n To achieve high performance, we should highly optimize the per-node sequential code and use profiling techniques to analyze the performance of our code to investigate the causes of overhead Limitations to Speedup: Locality n Memory hierarchy forms a barrier to performance when locality is poor n Temporal locality ¨ Same memory location accessed frequently and repeatedly ¨ Poor temporal locality results from frequent access to fresh new memory locations n Spatial locality ¨ Consecutive (or “sufficiently near”) memory locations are accessed ¨ Poor spatial locality means that memory locations are accessed in a more random pattern n Memory wall refers to the growing disparity of speed between CPU and memory outside the CPU chip Efficient Sequential Execution n Memory effects are the greatest concern for optimal sequential execution ¨ Store-load dependences, where data has to flow through memory ¨ Cache misses ¨ TLB misses ¨ Page faults n CPU resource effects can limit performance ¨ Limited number of floating point units ¨ Unpredictable branching (if-then-else, loops, etc) in the program n Use common sense when allocating and accessing data n Use compiler optimizations effectively n Execution best analyzed with performance analyzers Lessons Learned from the Past NOT APPROVED n Applications ¨ Parallel computing can transform science and engineering and answer challenges in society ¨ To port or not to port is NOT the question: a complete redesign of an application may be necessary ¨ The problem is not the hardware: hardware can be significantly underutilized when software and applications are suboptimal n Software and algorithms ¨ Portability remains elusive ¨ Parallelism isn’t everything ¨ Community acceptance is essential to the success of software ¨ Good commercial software is rare at the high end Any Questions ? Uniprocessors Overview PART I: Uniprocessors PART II: Multiprocessors and and Compiler Optimizations Parallel Programming Models n Uniprocessors n Multiprocessors ¨ Processor architectures ¨ Pipeline and vector machines ¨ Instruction set architectures ¨ Shared memory ¨ Instruction scheduling and ¨ Distributed memory execution ¨ Message passing n Data storage n Parallel programming models ¨ Memory hierarchy ¨ Shared vs distributed memory ¨ Caches, TLB ¨ Hybrid n Compiler optimizations ¨ BSP 1/19/17 HPC Levels of Parallelism Superscalar and multicore processors (fine grain) (stacked, fine grain to coarse grain) 1/19/17 HPC 3 Superscalar Architectures NOT APPROVED Superscalar Architectures 1/19/17 HPC Instruction Pipeline n An instruction pipeline increases the instruction bandwidth n Classic 5-stage pipeline: ¨ IF instruction fetch ¨ ID instruction decode ¨ EX execute (functional units) ¨ MEM load/store ¨ WB write back to registers and forward results into the pipeline Five instructions are in the when needed by pipeline at different stages another instruction 1/19/17 HPC Instruction Pipeline Example n Example 4-stage pipeline n Four instructions (green, purple, blue, red) are processed in this order ¨ Cycle 0: instructions are waiting ¨ Cycle 1: Green is fetched ¨ Cycle 2: Purple is fetched, green decoded ¨ Cycle 3: Blue is fetched, purple decoded, green executed ¨ Cycle 4: Red is fetched, blue decoded, purple executed, green in write-back ¨ Etc. 1/19/17 HPC Instruction Pipeline Hazards n Pipeline hazards ¨ From data dependences ¨ Forwarding in the WB stage can eliminate some data dependence hazards ¨ From instruction fetch latencies (e.g. I-cache miss) ¨ From memory load latencies (e.g. D-cache miss) n A hazard is resolved by stalling the pipeline, which causes a bubble of one ore more cycles n Example ¨ Suppose a stall of one cycle occurs in the IF stage of the purple instruction ¨ Cycle 3: Purple cannot be decoded and a no-operation (NOP) is inserted 1/19/17 HPC N-way Superscalar n When instructions have the same word size ¨ Can fetch multiple instructions without having to know the instruction content of each n N-way superscalar processors fetch N instructions each cycle ¨ Increases the instruction-level 2-way superscalar RISC pipeline parallelism 1/19/17 HPC NOT APPROVED The Instruction Set in a Superscalar Architecture n Pentium processors translate CISC instructions to RISC-like µOps n Advantages: ¨ Higher instruction bandwidth ¨ Maintains instruction set architecture (ISA) compatibility n Pentium 4 has a 31 stage pipeline divided into three main stages: ¨ Fetch and decode ¨ Execution ¨ Retirement Simplified block diagram of the Intel Pentium 4 1/19/17 HPC NOT APPROVED Instruction Fetch and Decode n Pentium 4 decodes instructions into µOps and deposits the µOps in a trace cache ¨ Allows the processor to fetch the µOps trace of an instruction that is executed again (e.g. in a loop) n Instructions are fetched: ¨ Normally in the same order as stored in memory ¨ Or fetched from branch targets predicted by the branch prediction unit n Pentium 4 only decodes one instruction per cycle, but can deliver up to three µOps per cycle to the execution stage n RISC architectures typically fetch multiple instructions per cycle Simplified block diagram of the Intel Pentium 4 1/19/17 HPC NOT APPROVED Instruction Execution Stage n Executes multiple µOps in parallel ¨ Instruction-level parallelism (ILP) n The scheduler marks a µOp for execution when all operands of the a µOp are ready ¨ The µOps on which a µOp depends must be executed first ¨ A µOp can be executed out-of- order in which it appeared n Pentium 4: a µOp is re-executed when its operands were not ready n On Pentium 4 there are 4 ports to send a µOp into n Each port has one or more fixed execution units ¨ Port 0: ALU0, FPMOV ¨ Port 1: ALU1, INT, FPEXE Simplified block diagram of the Intel Pentium 4 ¨ Port 2: LOAD 1/19/17 HPC ¨ Port 3: STORE NOT APPROVED Retirement n Looks for instructions to mark completed ¨ Are all µOps of the instruction executed? ¨ Are all µOps of the preceding instruction retired? (putting instructions back in order) n Notifies the branch prediction unit when a branch was incorrectly predicted ¨ Processor stops executing the wrongly predicted instructions and discards them (takes »10 cycles) n Pentium 4 retires up to 3 instructions per clock cycle Simplified block diagram of the Intel Pentium 4 1/19/17 HPC NOT APPROVED Software Optimization to Increase CPU Throughput n Processors run at maximum speed (high instruction per cycle rate (IPC)) when 1. There is a good mix of instructions (with low latencies) to keep the functional units busy 2. Operands are available quickly from registers or D-cache 3. The FP to memory operation ratio is high (FP : MEM > 1) 4. Number of data dependences is low 5. Branches are easy to predict n The processor can only improve #1 to a certain level with out-of-order scheduling and partly #2 with hardware prefetching n Compiler optimizations effectively target #1-3 n The programmer can help improve #1-5 1/19/17 HPC Instruction Latency and Throughput n Latency: the number of clocks to complete an instruction when all of its inputs are ready a=u*v; b=w*x; c=y*z n Throughput: the number of clocks to wait before starting an identical instruction ¨ Identical instructions are those that use the same execution unit n The example shows three multiply operations, assuming there is only one multiply execution unit n Actual typical latencies (in cycles) ¨ Integer add: 1 ¨ FP add: 3 ¨ FP multiplication: 3 ¨ FP division: 31 Floating points 1/19/17 HPC Instruction Latency Case Study n Consider two versions of Euclid’s algorithm 1. Modulo version 2. Repetitive subtraction version latency is high n Which is faster? Modulo version is faster iterations are low int find_gcf1(int a, int b) int find_gcf2(int a, int b) { while (1) { while (1) { a = a % b; { if (a > b) Subtraction, if (a == 0) return b; a = a - b; so the latency is if (a == 1) return 1; else if (a < b) small b = b % a; b = b - a; if (b == 0) return a; else GCD if (b == 1) return 1; return a; } } } } Modulo version Repetitive subtraction 1/19/17 HPC NOT APPROVED Instruction Latency Case Study n Consider the cycles estimated for the case a=48 and b=40 Instruction # Latency Cycles Instruction # Latency Cycles Modulo 2 68 136 Subtract 5 1 5 Compare 3 1 3 Compare 5 1 5 Branch 3 1 3 Branch 14 1 14 Other 6 1 6 Other 0 Total 14 148 Total 24 24 Modulo version Repetitive subtraction n Execution time for all values of a and b in [1..9999] Modulo version Repetitive subtraction Blended version 18.55 sec 14.56 sec 12.14 sec 1/19/17 HPC Data Dependences n Instruction level parallelism is limited by data dependences n Types of dependences: ¨ RAW: read-after-write also called (w*x)*(y*z) flow dependence ¨ WAR: write-after-read also called anti dependence ¨ WAW: write-after-write also called output dependence n The example shows a RAW dependence n WAR and WAW dependences exist because of storage location reuse (overwrite with new value) ¨ WAR and WAW are sometimes called false dependences ¨ RAW is a true dependence 1/19/17 HPC NOT APPROVED Data Dependence Case Study n Removing redundant operations by (re)using (temporary) space may increase the number of dependences n Example: two versions to initialize a finite difference matrix 1. Recurrent version with lower FP operation count 2. Non-recurrent version with fewer dependences n Which is fastest depends on effectiveness of loop optimization and instruction scheduling by compiler (and processor) to hide latencies and the number of distinct memory loads dxi=1.0/h(1) do i=1,n do i=1,n dxo=1.0/h(i) dxo=dxi dxi=1.0/h(i+1) dxi=1.0/h(i+1) diag(i)=dxo+dxi diag(i)=dxo+dxi offdiag(i)=-dxi offdiag(i)=-dxi enddo enddo With recurrence Without recurrence WAR RAW (cross iter dep not shown) (cross iter dep not shown) 1/19/17 HPC Case Study (1) Intel Core 2 Duo 2.33 GHz dxi = 1.0/h; for (i=1; i0 less frequently n Rewrite conjunctions to logical expressions if (t1==0 && t2==0 && t3==0) Þ if ( (t1|t2|t3) == 0 ) n Use max/min or arithmetic to avoid branches if (a >= 255) a = 255; Þ a = min(a, 255); n Note that in C/C++ the cond?then:else operator and the && and || operators result in branches! 1/19/17 HPC NOT APPROVED Data Storage n Memory hierarchy n Performance of storage n CPU and memory n Virtual memory, TLB, and paging n Cache 1/19/17 HPC NOT APPROVED Memory Hierarchy faster n Storage systems are organized in a hierarchy: ¨ Speed ¨ Cost volatile ¨ Volatility cheaper per byte 1/19/17 HPC Performance of Storage n Registers are fast, typically one clock cycle to access data n Cache access takes tens of cycles n Memory access takes hundreds of cycles n Movement between levels of storage hierarchy can be explicit or implicit 1/19/17 HPC Memory Access n A logical address is translated into a physical address in virtual memory using a page table ¨ The translation lookaside buffer (TLB) is an efficient on-chip address translation cache ¨ Memory is divided into pages ¨ Virtual memory systems store pages in memory (RAM) and on disk ¨ Page fault: page is fetched from disk n L1 caches (on chip): ¨ I-cache stores instructions ¨ D-cache stores data n L2 cache (E-cache, on/off chip) ¨ Is typically unified: stores instructions and data 1/19/17 HPC Translation Lookaside Buffer Logical address to physical address translation Logical pages are mapped to physical with TLB lookup to limit page table reads (page pages in memory (typical page size is table is stored in physical memory) 4KB) 1/19/17 HPC Caches n Direct mapped cache: each location in main memory can be cached by just one cache location n N-way associative cache: each location in main memory can be cached by one of N cache locations n Fully associative: each location in main memory can be cached by any cache location n Experiment shows SPEC CPU2000 benchmark cache miss rates 1/19/17 HPC Cache Details n An N-way associative cache has a set of N cache lines per row n A cache line can be 8 to 512 bytes ¨ Longer lines increase memory bandwidth performance, but space can be wasted when applications access data in a random order n A typical 8-way L1 cache (on-chip) has 64 rows with 64 byte cache lines ¨ Cache size = 64 rows x 8 ways x 64 bytes = 32768 bytes n A typical 8-way L2 cache (on/off chip) has 1024 rows with 128 byte cache lines 1/19/17 HPC Cache Misses Compulsory miss: n Compulsory misses reading some_static_data ¨ Caused by the first reference to a datum ¨ Misses are effectively reduced for (i = 0; i < 100000; i++) by prefetching X[i] = some_static_data[i]; for (i = 0; i < 100000; i++) n Capacity misses X[i] = X[i] + Y[i]; ¨ Cache size is finite ¨ Misses are reduced by limiting the working set size of the application Capacity miss: n Conflict misses X[i] no longer in cache ¨ Replacement misses are caused by the choice of victim Conflict miss: cache line to evict by the X[i] and Y[i] are mapped replacement policy to the same cache line (e.g. ¨ Mapping misses are caused when cache is direct mapped) by level of associativity 1/19/17 HPC False Sharing n On a multi-core processor each core has an L1 cache and shares the L2 cache with other cores n False sharing occurs when two caches of processors (or cores) cache two different non-shared data items that reside on the same cache line n Cache coherency protocol marks the cache line (on all cores) dirty to force reload n To avoid false sharing: ¨ Allocate non-shared data on different cache lines (using malloc) ¨ Limit the use of global variables 1/19/17 HPC Multiprocessors HPC Prof. Robert van Engelen Overview n The PMS model n Shared memory multiprocessors ¨ Basic shared memory systems ¨ SMP, Multicore, and COMA n Distributed memory multicomputers ¨ MPP systems ¨ Network topologies for message-passing multicomputers ¨ Distributed shared memory n Pipeline and vector processors n Comparison n Taxonomies 12/31/16 HPC PMS Architecture Model A simple PMS model n Processor (P) ¨ A device that performs operations on data n Memory (M) ¨ A device that stores data n Switch (S) ¨ A device that facilitates transfer of data between devices n Arcs denote connectivity n Each component has An example computer system with different performance CPU and peripherals characteristics 12/31/16 HPC Shared Memory Multiprocessor n Processors access shared memory via a common switch, e.g. a bus ¨ Problem: a single bus results in a bottleneck n Shared memory has a single address space n Architecture sometimes known as a “Symmetric Multiprocessors - SMP” 12/31/16 HPC Shared Memory: the Bus Contention Problem n Each processor competes for access to shared memory ¨ Fetching instructions ¨ Loading and storing data n Performance of a single bus S: bus contention ¨ Access to memory is restricted to one processor at a time ¨ This limits the speedup and scalability with respect to the number of processors ¨ Assume that each instruction requires 0= 640)) temp_map[newrow][newcol] = map[oldrow][oldcol]; } for (i = 0; i < 480; i++) for (j = 0; j < 640; j++) map[i][j] = temp_map[i][j]; Worker recv(&row, master); for (oldrow = row; oldrow < row + 480/P; oldrow++) Each worker computes: { for (oldcol = 0; oldcol < 640; oldcol++) " row < x < row + 480/P; 0 < y < 640: { newrow = oldrow + delta_x; newcol = oldcol + delta_y; x’ = x + Dx send(oldrow, oldcol, newrow, newcol, master); y’ = y + Dy } } 3/30/17 HPC 7 Example 1: Geometrical Transformation Speedups? n Assume in the general case the pixmap has n2 points n Sequential time of pixmap shift ts = 2n2 n Communication tcomm = P(tstartup + tdata) + n2(tstartup + 4tdata) = O(P + n2) n Computation tcomp = 2n2 / P = O(n2/P) n Computation/communication ratio r = O((n2 / P) / (P + n2)) = O(n2 / (P2 + n2P)) n This is not good! ¨ The asymptotic computation time should be an order higher than the asymptotic communication time, e.g. O(n2) versus O(n) ¨ … or there must be a very large constant in the computation time n Performance on shared memory machine can be good ¨ No communication time 3/30/17 HPC 8 Example 2: Mandelbrot Set n A pixmap is generated by iterating the complex-valued recurrence zk+1 = zk2 + c Imaginary with z0=0 and c=x+yi until |z|>2 n The Mandelbrot set is shifted and scaled for display: x = xmin + xscale row y = ymin + yscale col Real for each of the pixmap’s pixels at row and col location The number of iterations it takes for z to end up at a point outside the complex circle with radius 2 determines the pixmap color 3/30/17 HPC 9 Example 2: Mandelbrot Set Color Computation int pix_color(float x0, float y0) { float x = x0, y = y0; int i = 0; while (x*x + y*y < (2*2) && i < maxiter) { float xtemp = x*x - y*y + x0; float ytemp = 2*x*y + y0; x = xtemp; y = ytemp; i++; } return i; } 3/30/17 HPC 10 Example 2: Mandelbrot Set Simple Master and Worker Master row = 0; for (p = 0; p < P; p++) { send(row, p); Send/recv (x,y) pixel colors row += 480/P; } for (i = 0; i < 480 * 640; i++) { recv(&x, &y, &color, anyP); display(x, y, color); } Worker recv(&row, master); for (y = row; y < row + 480/P; y++) { for (x = 0; x < 640; x++) { x0 = xmin + x * xscale; y0 = ymin + y * yscale; color = pix_color(x0, y0); send(x, y, color, master); } } 3/30/17 HPC 11 Processor Farms n Processor farms (also called the work-pool approach) n A collection of workers, where each worker repeats: ¨ Take new task from pool ¨ Compute task ¨ Return results into pool n Achieves load balancing ¨ Tasks differ in amount of work ¨ Workers can differ in execution speed (viz. heterogeneous cluster) result task Master Put tasks task Work pool result Get results task Get task Return result P0 P1 P2 P3 3/30/17 HPC 13 Load Balancing n Load balancing attempts to spread tasks evenly across processors n Load imbalance is caused by ¨ Tasks of different execution cost, e.g. Mandelbrot example ¨ Processors operate with different execution speeds or are busy n When tasks and processors are not load balanced: ¨ Some processes finish early and sit idle waiting ¨ Global computation is finished when the slowest processor(s) completes its task Load balanced Not load balanced P0 P0 P1 P1 P2 P2 P3 P3 start finish start finish 3/30/17 HPC 20 Static Load Balancing n Load balancing can be viewed as a form of “bin packing” n Static scheduling of tasks amounts to optimal bin packing ¨ Round robin algorithm ¨ Randomized algorithms ¨ Recursive bisection ¨ Optimized scheduling with simulated annealing and genetic algorithms n Problem: difficult to estimate amount of work per task, deal with changes in processor utilization and communication latencies Not load balanced Load balanced P0 P0 P1 P1 P2 P2 P3 P3 start finish start finish 3/30/17 HPC 21 Centralized Dynamic Load Balancing n Centralized: work pool with replicated workers n Master process or central queue holds incomplete tasks ¨ First-in-first-out or priority queue (e.g. priority based on task size) n Terminates when queue is empty or workers receive termination signal Work pool task queue in task task … task out Get task Return result P0 P1 P2 P3 or add new task worker 3/30/17 HPC 22 Decentralized Dynamic Load Balancing n Disadvantage of centralized approach is the central queue through which tasks move one by one n Decentralized: distributed work pools Master initial tasks task task … task Mini master task task … task task task … task worker 3/30/17 HPC 23 Fully Distributed Work Pool n Receiver-initiated poll method: (an idle) worker process requests a task from another worker process n Sender-initiated push method: (an overloaded) worker process sends a task to another (idle) worker Absence of n Workers maintain local task queues starvation: n Process selection assume ¥ tasks, ¨ Topology-based: select nearest neighbors how can we ¨ Round-robin: try each of the other workers in turn guarantee each one ¨ Random polling/pushing: pick an arbitrary worker is eventually executed? send/recv tasks task task … task task task … task select a worker worker 3/30/17 HPC 24 Worker Pipeline n Workers are organized in an array (or ring) with the master on one end (or middle) ¨ Master feeds the pipeline ¨ When the buffer of a worker is idle, it sends a request to the left ¨ When the buffer of a worker is full, incoming tasks are shifted to the worker on the right (passing task along until an empty slot) master p0 p1 Pn-1 3/30/17 HPC 25

HPC-7lec-merged (Excluded Slides) PDF

Document Details

Tags

Related

Summary

Full Transcript