Introduction to Parallel Computing Fall 2024 PDF

CSCI465/ECEN433 Introduction to Parallel Computing Fall 2024 Lecture (11) Workload-Driven Performance Evaluation Youare hired by [insert your favorite chip companyhere]. Youwalk in on day one, and your bosssays: “All of our senior architects have decided to take the year oﬀ. Yourjob is to lead the design of our next parallelprocessor.” Whatquestions might youask? Yourboss selects the application that matters most to the company “I wantyouto demonstrate goodperformance on this application.” Howdo you know if you have a good design? ▪ Absoluteperformance? - Often measured as wall clock time - Another example: operations per second ▪ Speedup: performance improvement due to parallelism? - Execution time of sequential program /execution time on P processors - Operations per second on P processors /operations per second of sequential program ▪ Eﬃciency? - Performance per unit resource - e.g., operations per second per chip area, per dollar, per watt Example: create grid of particles datastructure on large parallel machine (e.g, aGPU) ▪ Problem: place 1Mpoint particles in a 16-cell uniform grid based on 2Dposition - Parallel data structure manipulation problem: build a 2Darray oflists ▪ GTX980 GPU:Upto 2048 CUDAthreads per SMMcore, and 16 SMMcores on theGPU 0 1 2 3 3 4 5 6 4 7 1 5 2 8 9 10 11 0 12 13 14 15 Commonuse of this structure: N-bodyproblems ▪ Acommonoperation is to compute interactions with neighboringparticles ▪ Example: given particle, find all particles within radiusR - Create grid with cells of sizeR - Only need to inspect particles in surrounding grid cells R R Solution 1: parallelize overcells ▪ Onepossible answer is to decompose workby cells: for each cell, independently compute particles within it (eliminatescontention because no synchronization is required) - Insuﬃcient parallelism: only 16 parallel tasks, but need thousands of independent tasks to eﬃciently utilize GPU) - Workineﬃcient: performs 16 times moreparticle-in-cell computations than sequentialalgorithm list cell_lists; // 2D array of lists for each cell c // in parallel for each particle p // sequentially if (p is within c) append p to cell_lists[c] Solution 2: parallelize overparticles ▪ Another answer: assign one particle to each CUDAthread. Thread computes cell containing particle, then atomically updateslist. - Massive contention: thousands of threads contending for access to update single shared datastructure list cell_list; // 2D array of lists lock cell_list_lock; for each particle p // in parallel c = compute cell containing p lock(cell_list_lock) append p to cell_list[c] unlock(cell_list_lock) Solution 3: use finer-granularity locks ▪ Alleviate contention for single global lock by using per-celllocks - Assuming uniform distribution of particles in 2Dspace... ~16x less contention than solution2 list cell_list; // 2D array of lists lock cell_list_lock; for each particle p // in parallel c = compute cell containing p lock(cell_list_lock[c]) append p to cell_list[c] unlock(cell_list_lock[c]) Solution 4: compute partial results + merge ▪ Yet another answer: generate M“partial” grids in parallel, thencombine - Example: create Mthread blocks (at least as many thread blocks as SMXcores) - Each block assigned N/Mparticles - All threads in thread block update samegrid - Enables faster synchronization: contention reduced by factor of Mand also cost of synchronization is lower because it is performed on block-local variables (in CUDAsharedmemory) - Requires extra work: merging the Mgrids at the end of the computation - Requires extra memory footprint: Store Mgrids of lists, rather than1 Solution 5: data-parallelapproach 0 1 2 3 3 Step 1: compute cell containing each particle (parallel over inputparticles) 4 5 1 64 7 particle_index: 0 1 2 3 4 5 5 2 grid_index: 9 6 6 4 6 4 8 9 10 11 0 Step 2: sort results by cell (particle index array permuted based onsort) 12 13 14 15 particle_index: 3 5 1 2 4 0 grid_index: 4 4 6 6 6 9 This solution maintains a large amount of parallelism and removesthe needfor fine- Step 3: find start/end of each cell (parallel over particle_index elements) grained synchronization... at cost of a sort cell = grid_index[index] andextrapassesoverthedata(extra BW) if (index == 0) cell_starts[cell] = index; else if (cell != grid_index[index-1]) { cell_starts[cell] = index; This codeis run for each element of cell_ends[grid_index[index-1]] = index; particle_indexarray } (each instance has a unique value of‘index’) if (index == numParticles-1) // special case for last cell cell_ends[cell] = index+1;... cell_starts 0 2 5 cell_ends... 2 5 6 (notinclusive) 0 1 2 3 4 5 6 7 8 9 10 Parallel implementations of particle binning Implementation 3: build separate grids andmerge Sequentialalgorithm list cell_lists; // 2D array of lists for each particle p c = compute cell containing p append p to cell_lists[c] Implementation 1: Parallel overparticles Implementation 4: data-parallel sort list cell_list; // 2D array of lists lock cell_list_lock; for each particle p // in parallel c = compute cell containing p lock(cell_list_lock) append p to cell_list[c] unlock(cell_list_lock) Implementation 2: Parallel over grid cells list cell_lists; // 2D array of lists for each cell c // in parallel for each particle p // sequentially if (p is within c) Commonpitfall: compare parallel program speedup to parallel append p to cell_lists[c] algorithm running on one core (easier to make yourself look good) Speedup of ocean sim application: 258 x 258 grid Execution on 32 processor SGI Origin 2000 Speed up 12 4 8 16 32 Processors Figure credit: Culler, Singh, andGupta Pitfalls of fixed problem size speedupanalysis Oceanexecution on 32 processor SGI Origin 2000 Ideal Nobenefit! (slightslowdown) Problem size is just too small forthe Speed machine (large communication-to- up computationratio) Scaling the performance of small problem may not be all that important anyway (it might already execute fast 12 4 8 16 32 enough on a single core) Processors 258 x 258 grid on 32 processors: ~ 310 grid cells per processor 1K x 1K grid on 32processors: ~ 32K grid cells per processor Figure credit: Culler, Singh, andGupta Pitfalls of fixed problem size speedupanalysis Execution on 32 processor SGI Origin 2000 Here: super-linear speedup! with enough processors, chunk of grid assigned to each processor begins to fit in cache (key working set fits in per-processorcache) Speed Another example: if problem size is too large up for a single machine, working set maynot fit in memory: causing thrashing todisk (this would make speedup on a bigger parallel machine with more memory look amazing!) 12 4 8 16 32 Processors Figure credit: Culler, Singh, andGupta Understanding scaling: sizematters! ▪ There can be complex interactions between the size of the problem and the size of the parallelcomputer. - Can impact load balance, overhead, arithmetic intensity, locality of dataaccess - Eﬀects can be dramatic and application dependent Evaluating a machine with a fixed problem size can beproblematic - Toosmall aproblem: - Parallelism overheads dominate parallelism benefits (may even result in slow downs) Can be desirable to scale problem size as machine sizes grow (buy a bigger machine to compute more, rather than just compute the same problem faster) - Problem size may be appropriate for small machines, but inappropriate for large ones (does not reflect realistic usage of largemachine!) - Toolarge a problem: (problem size chosen to be appropriate for largemachine) - Key working set maynot “fit” in smallmachine (causing thrashing to disk, or key working set exceeds cache capacity, or can’t run at all) - Whenproblem working set “fits” in a large machine but not small one,super-linear speedups can occur Architects also think aboutscaling Acommonquestion: “Doesan architecturescale?” ▪ Scaling up: howdoes architecture’s performance scale with increasing corecount? - Will design scale to the high end? ▪ Scaling down: howdoes architecture’s performance scale with decreasing corecount? - Will design scale to the lowend? ▪ Parallel architectures are designed to workin a range of contexts - Same architecture used for low-end, medium-scale, and high-end systems - GPUsare a greatexample - Same SMMcore architecture, diﬀerent numbers of SMMcores per chip Tegra X1: 2 SMMcores GTX950: 6 SMMcores GTX980: 16 SMMcores Titan X:24 SMMcores (mobile SoC) (90watts) (165 watts) (250 watts) CommonScalingTerminology ▪ For mapping problem Xon system with P processors ▪ Strongscaling - Runtime of Xon P processors, vs. of Xon 1 processor - Goal = P - Doeshaving more processors get job done faster? ▪ Weakscaling - Runtime of (P*X) on P processors, vs. of Xon 1 processor - Goal = 1.0 - Doeshaving more processors let medo biggerjobs? Questions to ask when scaling a problem ▪ Under what constraints should the problem be scaled? - “Workdone by program” mayno longer be the quantity that is fixed - Fixed data set size, fixed memoryusage per processor, fixed execution time, etc.? ▪ Howshould be the problem bescaled? - Problem size is often determined by morethan one parameter - Oceanexample: problem defined by (N, ε, Δt,T) total time simulated byprogram (one day of oceanflow) grid size: (N xN) time stepsize convergence threshold ofsolver Scalingconstraints ▪ Application-oriented scaling properties (specific toapplication) - Particles perprocessor - Transactions perprocessor ▪ Resource-oriented scaling properties 1. Problem constrained scaling (PC) 2. Memoryconstrained scaling(MC) 3. Time constrained scaling (TC) User-oriented properties often more intuitive, but resource- oriented properties are more general, and apply across domains. (so we’ll talk about themtoday) Problem-constrainedscaling ▪ Focus: use a parallel computer to solve the same problem faster time 1processor Speedup= time Pprocessors ▪ Recall pitfalls from earlier in lecture (small problems maynot be realistic workloads for large machines, big problems maynot fit on smallmachines) ▪ Examples of problem-constrainedscaling: - Almost everything we’ve considered parallelizing in class so far Time-constrainedscaling ▪ Focus: completing morework in a fixed amount of time - Execution time kept fixed as the machine (and problem) scales work done by P processors Speedup= work done by 1 processor ▪ Howto measure“work”? - Challenge: “workdone”maynot be linear function of values of problem inputs (e.g. matrix multiplication is O(N3)work for O(N2)sized inputs) - Oneapproach: “workdone”is defined by execution time of same computation on a single processor (but consider eﬀects of thrashing if problem toobig) - Ideally, a measure of workis: - Simple tounderstand - Scales linearly with sequential run time (So ideal speedupremains linear in P) Moretime-constrained scalingexamples ▪ Computationalfinance - Run most sophisticated model possible in: 1 ms, 1 minute, overnight, etc. ▪ Modernwebsites - Wantto generate complex page, respond to user in Xmilliseconds (studies show site usage directly corresponds to page load latency) ▪ Real-time computer vision for robotics - Consider self-driving car: best quality pedestrian detection in 10ms Memory-constrainedscaling ▪ Focus: run the largest problem possible without overflowing main memory ** ▪ Memoryper processor is held fixed (e.g., add moremachines tocluster) ▪ Neither worknor execution time areheld constant! Speedup= work(P processors) x time (1 processor) time(Pprocessors) x work(1 processor) = workper unit time onPprocessors workper unit time on 1 processor ▪ Note: scaling problem size can make runtimes very large - Consider O(N3)matrix multiplication on O(N2) matrices **Assumptions: (1) memory resources scale with processor count (2) spilling to disk is infeasible behavior (tooslow) Memory-constrained scaling examples ▪ Onemotivation to use supercomputers and large clusters is simply to be able to fit large problems in memory ▪ Large N-bodyproblems - 2012 Supercomputing Gordon Bell Prize Winner: 1,073,741,824,000 particle N-body simulation on K- Computer (MPIimplementation) ▪ Large-scale machinelearning - Billions of clicks, documents,etc. 2Ddomain decomposition of N-bodysimulation ▪ Memcached(in memorycaching system for webapps) - Moreservers = more availablecache Scaling examples atPIXAR ▪ Rendering a “shot” (a sequence of frames) in amovie - Goal: minimize time to completion (problemconstrained) - Assign each frame to a diﬀerent machine in the cluster ▪ Artists working to design lighting for ascene - Provide interactive frame rate to artist (timeconstrained) - Moreperformance = higher fidelity representation shown to artist in allottedtime ▪ Physical simulation: e.g., fluidsimulation - Parallelize simulation across multiple machines to fit simulation grid in aggregate memory of processors (memory constrained) ▪ Final render of images formovie - Scene complexity is typically bounded by memory available on farm machines - Onebarrier to exploiting additional parallelism within a machine is thatrequired footprint often increases with number of processors Case study: our 2Dgridsolver ▪ For Nx Ngrid: - O(N2)memory requirement - O(N2)grid elements x O(N)convergence iterations... so total work increases asO(N3) ▪ Problem-constrainedscaling: N2 elements - Execution time: 1/P P processors - Elements per processor:N2/P - Available concurrency: fixed atP elementscomputed: (perprocessor) - Comm-to-comp ratio: O(P1/2) (for a 2D-blockedassignment) elementscommunicated: (per processor) Scaling the 2Dgridsolver ▪ For Nx Ngrid: - O(N2)memory requirement - O(N2)grid elements x O(N)convergence iterations... so total work increases asO(N3) ▪ Problem-constrained scaling: (note: Nis constanthere) notice: executiontime increases with MCscaling - Execution time: O(1/P) - Elements per processor:O(1/P) ▪ Memory-constrainedscaling: - Communication per processor:O(1/P1/2) - Let scaled grid size be NP1/2x NP1/2 - Comm-to-comp ratio: O(P1/2) - Execution time: O((NP1/2)3/P)=O(P1/2) (for a 2D-blockedassignment) - Elements per processor: fixed! (N2) - Comm-to-comp ratio: O(1) Time-constrainedscaling: - Let scaled grid size be K x K - Assume linear speedup: K3/P = N3(so K = NP1/3) (recall computation time for K x K grid onP processors = Implications: computation time for Nx Ngrid on1processor) Expect best “speedup” with MCscaling, - Execution time: fixed at O(N3)(by defn of TCscaling) then TCscaling, worst with PCscaling. - Elements per processor: K2/P = N2/P1/3 (due to communicationoverheads) - Communication per processor: K/P1/2= O(1/P1/6) - Comm-to-comp ratio: O(P1/6) Wordof caution about problemscaling ▪ Problem size in the previous examples was a single parameter n ▪ In practice, problem size is a combination ofparameters - Recall Oceanexample: problem is =(n, ε, Δt,T) ▪ Problem parameters are often related (not independent) - Example from Barnes-Hut: increasing particle count n changes required simulation time step and force calculation accuracy parameter ϴ ▪ Mustbe cognizant of these relationships under situations of TCor MCscaling Scaling summary ▪ Performance improvement due to parallelism is measured by speedup ▪ But speedup metrics take diﬀerent forms for diﬀerent scaling models - Whichmodel matters most is application/context specific ▪ In addition to assignment and orchestration, behavior of a parallel program depends significantly on the scaling properties of the problem and also the machine. - Whenanalyzing performance, be careful to analyze realistic regimes of operation (realistic sizes and realistic problem size parameters) - This requires application knowledge Thechallenges ofscaling problemsupordown also apply to software developers debugging/ tuning parallel programsonreal machines Commonexamples: ▪ Maywant to log behavior of code whendebugging - Debug logs get untenable in size when running full problem - Instrumentation slows downcode significantly - Instrumentation removes contention by changing timing of application ▪ Maywant to debug/test/tune code on small problem size on small machine before running two-week simulation on full problem size on a supercomputer There is simply no substitute for the experience of writing and tuning your ownparallelprograms. But today’s take-away is: BECAREFUL! It can be very tricky to evaluate and tune parallel software and parallel machines. It can be very easy to obtain misleading results or tune code (or a billion dollar hardware design) for a workload that is not representative of real-world use cases. It is helpful to start by precisely stating your application performance goals. Then determine if your evaluation approach is consistent with thesegoals. Here are sometricks for understanding the performance of parallel software Always, always, always try the simplest parallel solution first, then measure performance to see where you stand. Auseful performance analysis strategy ▪ Determine if your performance is limited by computation, memory bandwidth (or memory latency), or synchronization? ▪ Try and establish “highwatermarks” - What’sthe best you can do in practice? - Howclose is your implementation to a best-casescenario? Rooflinemodel ▪ Usemicrobenchmarks to compute peak performance of a machine as a function of arithmetic intensity ofapplication ▪ Then compare application’s performance to known peak values diagonal region:memory horizontal region: computelimited bandwidthlimited Figure credit: Williams et al.2009 Roofline model: optimization regions ▪ Usevarious levels of optimization inbenchmarks (e.g., best performance with and without using SIMDinstructions) Figure credit: Williams et al.2009 Establishing high watermarks* Add“math” (non-memoryinstructions) Doesexecution time increase linearly with operation count as math is added? (If so, this is evidence that codeis instruction-ratelimited) Removealmost all math, but load samedata Howmuchdoes execution-time decrease? If not much, suspect memorybottleneck Change all array accesses toA Howmuchfaster does your codeget? (This establishes an upper bound on benefit of improving locality of data access) Removeall atomic operations orlocks Howmuchfaster does your codeget? (provided it still does approximately the same amount of work) (This establishes an upper bound on benefit of reducing sync overhead.) * Computation, memory access, and synchronization are almost never perfectly overlapped. As a result, overall performance will rarely be dictated entirely by compute or by bandwidth or by sync. Even so, the sensitivity of performance change to the above programmodificationscanbeagoodindication of dominant costs Useprofilers/performance monitoringtools ▪ Image at left is “CPUusage” from activity monitor in OSXwhile browsing the webin Chrome (laptop has a quad-core Corei7 CPU) - Graph plots percentage of time OShas scheduled a process thread ontoa processor executioncontext - Not very helpful for optimizing performance ▪ All modern processors have low-level event “performancecounters” - Registers that count important details such as: instructionscompleted, clock ticks, L2/L3 cache hits/misses, bytes read from memory controller, etc. ▪ Example: Intel’s Performance Counter Monitor Toolprovides a C++ API for accessing theseregisters. PCM *m = PCM::getInstance(); SystemCounterState begin = getSystemCounterState(); // code to analyze goes here SystemCounterState end = getSystemCounterState(); printf(“Instructions per clock: %f\n”, getIPC(begin, end)); printf(“L3 cache hit ratio: %f\n”, getL3CacheHitRatio(begin, end)); printf(“Bytes read: %d\n”, getBytesReadFromMC(begin, end)); ▪ Also see Intel VTune, PAPI, oprofile,etc. Questions 40 ECEN433 – Introduction to Parallel Computing

Introduction to Parallel Computing Fall 2024 PDF

Document Details

Tags

Related

Summary

Full Transcript