CCS 2522 High Performance Computing PDF

Document Details

SubstantiveBixbite7424

Uploaded by SubstantiveBixbite7424

University of Colombo School of Computing

2024

Malik Silva

Tags

high performance computing computer science parallel computing programming

Summary

These lecture notes cover the fundamental concepts and techniques of high performance computing. The topics include parallel programming using MPI and OpenMP, load balancing, cache coherence, and false sharing, as well as various optimization strategies.

Full Transcript

CCS 2522 High Performance Computing 1 Malik Silva Unversity of Colombo School of Computing October 4, 2024 1 [email protected] Contents 1 Introduction...

CCS 2522 High Performance Computing 1 Malik Silva Unversity of Colombo School of Computing October 4, 2024 1 [email protected] Contents 1 Introduction 2 2 Serial Program Performance Optimization 6 3 Hardware Environments for Parallel Computing 20 4 MPI (Message Passing Interface) 30 5 Estimating the Cost of Message Passing Programs 41 6 Load Balancing 43 7 Shared Memory Programming using Pthreads 49 8 Shared Memory Performance Concerns:Cache Coherence and False Shar- ing 59 9 Shared Memory Programming using OpenMP 65 10 Parallel Program Design 77 11 Dependency Analysis for Parallelism 84 1 Chapter 1 Introduction Objectives: ♣ What is a process? ♣ Describe the effect of a “fork”. ♣ Define “speedup” and “efficiency” of a parallel computation. ♣ Outline factors that limit speedup of a parallel computation. ♣ What is an “embarrasingly parallel computation”? ♣ Derive Amdhal’s law. ♣ Explain why coarse granularity is preferred. 1. Process: (a) A process is an instance of an executing program. (b) Each process is given a unique process id. Note: In linux, the list of running processes can be obtained by typing “ps” at a command line. e.g., % ps PID TIME COMMAND 2345 0.12 inventory 2346 0.01 payroll (c) Process creation: The fork() function creates a new process by duplicating the calling process. e.g., #include main() { 2 int x; printf(‘‘Just one process so far\n’’); printf(‘‘calling fork...\n’’); x = fork(); if (x==0) printf(‘‘I am the child\n’’); else if (x>0) printf(‘‘I am the parent. Child has pid:%d\n’’,x); else printf(‘‘Fork returned error code; no child\n’’); } (d) New process runs a copy of the same program as its creator, with the variables within this having the same values as those within the calling process Address space of a child is a replica of its parent. process that called fork(): parent newly created process: child The parent resumes execution and the child starts execution at the same place where the call returns. (e) Return values of fork: ˆ for parent: process id of child or a negative value for any error ˆ child: 0 Note: Never get the spellings of these two wrong! process: tasks processors: hardware processing units 2. parallel processing: multiple processes operating on a single problem. 3. Embarrasingly Parallel Computations: An ideal parallel computation is one that can be immediately divided into completely independent parts that can be executed simultaneously. A truly embarasingly parallel computation suggests no communication between separate processes. Each process requires different (or same) data and produces results from its input without any need for results from other processes. An example of embarrasingly parallel computation is image processing. 4. Two important metrics: (a) Speedup (S) S(n) = Execution time using one processor (Best possible serial time) Execution time on a parallel computer with p processors ts = tp e.g., if the serial time of a computation is 4 minutes and the time taken on a parallel computer is 2 minutes, then the achived speedup is 4/2 = 2. S (b) Efficiency (E) = p 3 5. Factors that limit speedup: ˆ Periods when not all the processors can be performing useful work and are idle (includes when only one processor is active on inherently serial parts of the com- putation) ˆ Extra computations in the parallel version not appearing in the serial version ˆ Time for communication/synchronization 6. Maximum speedup? If the fraction of computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n processors is: (1−f )ts f ts + n ts Hence the speedup is: S(n) = (1−f )ts f ts+ n n = 1+(n−1)f (Amdhal’s Law) 7. Granularity: Size of the computation between communication and synchronization points. (Large computation size: “coarse granularity”; small computation size: “fine granularity”) Coarse granularity is preferred. Figure 1.1: A parallel process that has coarse granularity and another one that has fine granularity 4 8. This course: The objective of this course is to learn how to write serial and parallel programs with good performance. Course content: Serial program performance optimization Hardware environments for parallel computing Message passing interface Estimating the cost of message passing programs Load balancing Mid-term exam Shared memory programming using Pthreads Cache coherence and false sharing Shared memory programming using OpenMP Parallel program design Dependency analysis for parallelism Appications 9. Ealuation: Weekly quizzes (10%); Mid-term (20%); Final exam (70%) 5 Chapter 2 Serial Program Performance Optimization Objectives: ♣ Outline the basic serial performance improvement guidelines. ♣ What is “common sub-expression elimination”, “strength reduction”, “loop invari- ant code motion”, “constant value propagation and evaluation”? ♣ Appreciate the importance of data locality ♣ What is meant by the memory hierarchy? ♣ What is a cacheline? ♣ Outline the difference between direct mapped, set-associative and fully-assotiative caches ♣ Define compulsory, conflict and capacity cache misses ♣ What is meant by write-through and write-back caches? ♣ Define sptial and temporal locality ♣ Use fusion, loop interchange, blocking, data laying to improve data locality ♣ Define prefetching and give the requirements for good prefetching ♣ Use padding and array merging to avoid cache thrashing. ♣ Use loop interchange, loop fusion, loop fission, loop peeling, loop unroll. 10. Basic Serial Performance Optimization Guidelines ˆ Selection of best algorithm: For large data sets, advantages of using a low complexity algorithm (with lesser operations count) will be large. ˆ Use of efficient libraries: e.g., BLAS (Basic Linear Algebra Subroutines) (www. netlib.org). Good quality public domain library: LAPACK ˆ Optimal data locality: Important that faster, higher levels of the memory hierarchy are efficiently used 6 ˆ Use of compiler optimizations: e.g., cc -03 myprogram.c Using a medium optimization level (-O2 on many machines) typically leads to a speedup of factors of 2 to 3 without significant increase in compilation time. ˆ Use a better computing environment: e.g., a more powerful system with good systems software. 11. Compiler Optimizations ˆ Compilers do optimizations on units called basic blocks. A basic block is a segment of code that does not contain branches. ˆ Common subexpression elimination: If a subexpression is common to two or more expressions, the compiler will pull it out and precalculate it. e.g.,: s1 = a + b + c s2 = a + b - c will be replaced by: t = a + b s1 = t + c s1 = t - c ˆ Strength reduction: Replacement of an arithmetic expression by another equiva- lent expression which can be evaluated faster. e.g., Replacement of 2 ∗ i by i + i since integer additions are faster than integer multiplications. ˆ Loop invariant code motion: e.g., Compiler will replace: do i=1,n a(i) = r*s*a(i) enddo with: t1=r*s do i=1,n a(i) = t1*a(i) enddo ˆ Constant value propagation and evaluation: Whenever the compiler can deter- mine that a variable takes on a constant value, it will replace all the occurrences of this variable by a constant value. Also, when an expression involves several constant values, they will be calculated and replaced by a new constant value. e.g., two = 2; x = 3*two*y; will be replaced by the compiler by: x = 6*y; 7 12. Data Locality Improvement (a) Memory wall problem: ˆ The processors are very powerful and are capable of high performance. How- ever, this high performance of the processors is useless if they have to wait for the data to be brought in from distant places. It is like a very efficient worker (“processor”) idling while the data for him to process are being brought to him from distant places (memory, harddisk) (Figure 2.1). Figure 2.1: The memory wall problem ˆ “Not processing data, but moving data costly” ˆ The gap between processor and memory speeds continues to widen (Figure 2.2). It is a major hindrance to performance. The gap is projected to increase in the future too. Thus it is important to overcome this. (b) Definitions: ˆ Memory hierarchy: A hierarchy of memory is provided; L1-cache, L2- cache, memory, harddisk, etc. Fast cache at top of hierarchy; Access times increase down hierarchy; Typical access speeds: Cache 10-20 ns Memory 90-120 ns Hard Disk 10,000,000 to 20,000,000 ns Fast caches - Expensive - Thus small An example of memory hierarchy usage is shown in Figures 2.3 and 2.4. 8 Figure 2.2: The processor and memory access speed gap Figure 2.3: The memory access example. The processor wants to work with the variable “total”. There could be both instruction caches and a data caches. Or there could be a unified cache in which both the instructions and data are stored. ˆ Cache line: The smallest unit of data that can be transferred to or from memory (It is the data that fits the suitcase in Figure 2.1). It holds contents of a contiguous block of main memory. Cache lines are usually between 32 and 128 bytes. 9 Figure 2.4: The memory access example. The value of the variable “total” is brought to L2 and L1 caches. ˆ If data requested by processor are found in cache: cache hit if not: cache miss If cache miss, then new data must be brought in to cache. ˆ Cache Associativity: Specifies the number of different cache slots in which a cache line can reside. Three Types: Direct-mapped: If the size of the cache is n words, then the i word in memory can be stored only in the position given by mod(i, n). Set-associative: In an N-way set-associative cache, the cache is broken into sets where each set contains N cache lines. Then, each memory address is assigned a set, and the relevant cacheline can be cached in any one of those slots within the set. Fully-associative:a cache line can be placed in any free slot in cache. ˆ Cache misses: Compulsory miss:Occurs when the cache line has to be brought into the cache when first accessing it. Capacity miss: A miss that is not a compulsory miss but occurs in a fully associative cache. Conflict miss: A miss that is neither a compulsory miss nor occurs in a fully associative cache, but only occurs in a set-associative or direct-mapped cache due to cache accesses which conflict with each other. ˆ Cache line replacement: No big issue with direct-mapped caches. But for set-associative and fully-associative caches, some mechanism must be used. Ways of replacement: least-recently used, random, round-robin 10 ˆ Write-back and Write-through caches: When a processor performs a store instruction, it typically writes data into the cache-resident line containing the address. Two policies for dealing with the cache line subsequent to its having data written to it. Write-through: Data is written to both the cache line in the cache and to main memory. Write-back: Data is written only to the cache line in the cache. This modi- fied cache line is written to memory only when necessary (usually when it is replaced in the cache by another cache line). ˆ Data Locality: Memory hierarchy can be a solution to memory wall problem if there exists cache residence of data. One way to improve cache residence is to improve the data locality in the program. Two types of data locality: spatial and temporal Consider variables: ABCDE... Spatial locality: nearby memory locations are accessed (e.g., ABC...) Temporal locality: multiple accesses to same memory location (e.g., AAA...) (c) Improving Data Locality ˆ Fusion When same data is used in separate sections of the code, bring them close if possible. Improves temporal locality. e.g.1, Before:......A...............A...... After:......A......A......... e.g.2, Before: Matrix A usage Matrix B usage Matrix A usage Matrix B usage 11 After: Matrix A usage Matrix A usage Matrix B usage Matrix B usage ˆ Prefetching Moving data up the memory hierarchy before they are actually needed. See figure 2.5. Figure 2.5: Prefetching example. The processor is working at present with “total”. The variable “pressure” having the value 9.1, is a future usage. However, it is also “prefetched” into the cache hierarchy. Good data prefetching: Little overhead. Correct: data should be used in future. No “pollution”: Should not take away from cache the data that is being presently used. “Just-in-time”: Should neither be too early (as it may be evicted before use) nor too late. ˆ Loop Interchange Reorder loops to align the access pattern in loop with pattern of data storage in memory. e.g., In C: for (i=0;i

Use Quizgecko on...
Browser
Browser