20240212-DS642-Lecture-04.pdf

DS 642: Applications of Parallel Computing Lecture 4 02/12/2024 http://www.cs.njit.edu/~bader DS642 1 Applications of Parallel Computers Matrix Multiplication and the Roofline Model A Simple Model of Memory Assume just 2 levels in the hierarchy, fast and slow All data initially in slow memory – – – – – Minimum possible time = f * tf when all data in fast memory Machine Actual time Balance: Key – m = number of memory elements (words) moved between fast and slow memory tm = time per slow memory operation (inverse bandwidth in best case) Computational f = number of arithmetic operations Intensity (CI): Key to tf = time per arithmetic operation 1, else 1 1/24/23 CS267 Lecture 21 Recursive Matrix Multiplication Define C = RMM (A, B, n) if (n==1) { C = A * B ; } else { C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2) C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2) C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2) C11 = RMM (A11 , B01 , n/2) + RMM (A11 , B11 , n/2) } return C Arith(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ Arith(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 - n2 … ~same operations as usual, in different order 1/24/23 CS267 Lecture 22 Recursive Matrix Multiplication Define C = RMM (A, B, n) if (n==1) { C = A * B ; } else { C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2) C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2) C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2) C11 = RMM (A11 , B01 , n/2) + RMM (A11 , B11 , n/2) } return C Arith(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ Arith(n/2) + 4(n/2)2 if n > 1, else 1 ~ 2n3 this is our f = # flops 1/24/23 CS267 Lecture 23 Recursive Matrix Multiplication Define C = RMM (A, B, n) if (n==1) { C = A * B ; } else { C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2) C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2) C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2) C11 = RMM (A11 , B01 , n/2) + RMM (A11 , B11 , n/2) } return C f= Arith(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ Arith(n/2) + 4(n/2)2 if n > 1, else 1 ~ 2n3 What is m, data moved? 1/24/23 CS267 Lecture 24 Recursive Matrix Multiplication Define C = RMM (A, B, n) if (n==1) { C = A * B ; } else { C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2) C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2) C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2) C11 = RMM (A11 , B01 , n/2) + RMM (A11 , B11 , n/2) } return C f= Arith(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ Arith(n/2) + 4(n/2)2 if n > 1, else 1 ~ 2n3 W(n) = # words moved between fast, slow memory by RMM(. ,. , n) m= 1/24/23 CS267 Lecture 25 Recursive Matrix Multiplication Define C = RMM (A, B, n) if (n==1) { C = A * B ; } else { C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2) C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2) C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2) C11 = RMM (A11 , B01 , n/2) + RMM (A11 , B11 , n/2) } return C f= m= 1/24/23 Arith(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ Arith(n/2) + 4(n/2)2 if n > 1, else 1 ~ 2n3 this is our f = # flops W(n) = # words moved between fast, slow memory by RMM(. ,. , n) = 8 ꞏ W(n/2) + 4ꞏ 3(n/2)2 if 3n2 > Mfast , else 3n2 4 lines of code 3 matrices per line CS267 Lecture 26 Recursive Matrix Multiplication Define C = RMM (A, B, n) if (n==1) { C = A * B ; } else { C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2) C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2) C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2) C11 = RMM (A11 , B01 , n/2) + RMM (A11 , B11 , n/2) } return C f= m= 1/24/23 Arith(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ Arith(n/2) + 4(n/2)2 if n > 1, else 1 ~ 2n3 this is our f = # flops W(n) = # words moved between fast, slow memory by RMM(. ,. , n) = 8 ꞏ W(n/2) + 4ꞏ 3(n/2)2 if 3n2 > Mfast , else 3n2 = O( n3 / (Mfast )1/2 + n2 ) … same as blocked matmul CS267 Lecture 27 Recursive Matrix Multiplication Define C = RMM (A, B, n) if (n==1) { C = A * B ; } else { C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2) C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2) C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2) C11 = RMM (A11 , B01 , n/2) + RMM (A11 , B11 , n/2) } return C f= m= 1/24/23 Arith(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ Arith(n/2) + 4(n/2)2 if n > 1, else 1 ~ 2n3 this is our f = # flops W(n) = # words moved between fast, slow memory by RMM(. ,. , n) = 8 ꞏ W(n/2) + 4ꞏ 3(n/2)2 if 3n2 > Mfast , else 3n2 = O( n3 / (Mfast )1/2 + n2 ) … same as blocked matmul Don’t need to know Mfast for this to work! CS267 Lecture 28 Cache Oblivious In practice, cut off recursion well before 1x1 – Call “micro-kernel” on small blocks Pingali et al report about 2/3 of peak – Recursive + optimized micro-kernel – See: https://www.slideserve.com/lazar/acomparison-of-cache-conscious-and-cacheoblivious-programs – Atlas with ‘unleashed’ autotuning close to vendor 1/24/23 CS267 Lecture 29 Alternate Data Layouts May also use blocked or recursive layouts Several possible recursive layouts, depending on the order of the sub-blocks Copy optimization may be used to move Blocked-Row Major Z-Morton order (recursive) works well for any cache size but index calculations to find A[i,j] are expensive May switch to col/row major for small sizes 1/24/23 CS267 Lecture 30 Theory: Communication lower bound Theorem (Hong & Kung, 1981): Any reorganization of matmul (using only commutativity and associativity) has computational intensity q = O( (Mfast)1/2 ), so #words moved between fast/slow memory = Ω (n3 / (Mfast)1/2 ) Extensions, both lower bounds and (some) optimal algorithms (later in course) – Parallel matrix multiply, optimize latency as well as bandwidth – Rest of linear algebra (Gaussian elimination, least squares, tensors …) – Nested loops accessing arrays (eg All-Pairs-Shortest-Paths, N-body, …) – Open problems: Small loop bounds (eg matrix-vector vs matrix-matrix multiply) Dependencies, i.e. when only some reorganizations are correct 1/24/23 CS267 Lecture 31 Strassen’s Matrix Multiply The traditional algorithm (with or without tiling) has O(n3) flops Strassen discovered an algorithm with asymptotically lower flops – O(n2.81) Consider a 2x2 matrix multiply, normally takes 8 multiplies, 4 adds – Strassen does it with 7 multiplies and 18 adds Let M = m11 m12 = a11 a12 b11 b12 m21 m22 = a21 a22 b21 b22 Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22) p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11) p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11 p4 = (a11 + a12) * b22 Then m11 = p1 + p2 - p4 + p6 m12 = p4 + p5 Extends to nxn by divide&conquer m21 = p6 + p7 m22 = p2 - p3 + p5 - p7 1/24/23 CS267 Lecture 32 Strassen (continued) T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)2 = O(n log2 7) = O(n 2.81) Asymptotically faster Several times faster for large n in practice Cross-over depends on machine “Tuning Strassen's Matrix Multiplication for Memory Efficiency”, M. S. Thottethodi, S. Chatterjee, and A. Lebeck, in Proceedings of Supercomputing '98 Possible to extend communication lower bound to Strassen #words moved between fast and slow memory = Ω(nlog2 7 / M(log2 7)/2 – 1 ) ~ Ω(n2.81 / M0.4 ) (Ballard, D., Holtz, Schwartz, 2011, SPAA Best Paper Prize) Attainable too, more on parallel version later 1/24/23 CS267 Lecture 33 Other Fast Matrix Multiplication Algorithms World’s record was O(n 2.375477... ) – New Record! 2.375477 reduced to 2.3729269 – Coppersmith & Winograd, 1987 Virginia Vassilevska Williams, UC Berkeley & Stanford, 2011 Newer Record! 2.3729269 reduced to 2.3728639 – Francois Le Gall, 2014 Latest Record! 2.3728639 reduced to 2.3728596 – Lower bound on #words moved can be extended to (some) of these algorithms (2015 thesis of Jacob Scott): Ω(nw / M(w/2-1) ) Can show they all can be made numerically stable – Demmel, Dumitriu, Holtz, Kleinberg, 2007 Can do rest of linear algebra (solve Ax=b, Ax=λx, etc) as fast , and stably – Virginia Vassilevska Williams and Josh Alman, 2020 Demmel, Dumitriu, Holtz, 2008 Fast methods (besides Strassen) may need unrealistically large n 1/24/23 CS267 Lecture 34 Basic Linear Algebra Subroutines (BLAS) Industry standard interface: www.netlib.org/blas, www.netlib.org/blas/blast--forum Vendors, others supply optimized implementations 1970s Mid-1980s Late-1980s BLAS1: BLAS2: BLAS3: 15 operations vector ops: dot product, saxpy (y=*x+y), root-sumsquared, etc. 25 operations matrix-vector ops: matvec, etc. 9 operations matrix-matrix ops: matmul, etc. Computational Intensity: m=n2, f=2*n2 CI~2 and less overhead Computational Intensity: m

20240212-DS642-Lecture-04.pdf

Document Details

Tags

Related

Full Transcript