20240219-DS642-Lecture-05.pdf

DS 642: Applications of Parallel Computing Lecture 5 02/19/2024 http://www.cs.njit.edu/~bader DS642 1 More on Communication-optimal Matmul (and beyond) Slide credits: James Demmel 2 Outline Communication = moving data – Between main memory and cache – Between processors over a network – Most expensive operation (in time or energy) Goal: Provably minimize communication for algorithms that look like nested loops accessing arrays – Includes matmul, linear algebra (dense and sparse), n-body, convolutional neural nets (CNNs), … Simple case: n-body (sequential, with main memory and cache) – Communication lower bound and optimal algorithm Extension to Matmul Extension to algorithms that look like nested loops accessing 3 arrays, like CNNs (and open questions) Data access for n-body A() = array of structures – A(i) contains position, charge on particle i Usual n-body – for i = 1:n, for j = 1:n except i, F(i) = F(i) + force(A(i),A(j)) Simplify to make counting easier – Let B() = array of disjoint set of particles – for i = 1:n, for j = 1:n, e = e + potential(A(i),B(j)) Simplify more – for i = 1:n, for j = 1:n, access A(i) and B(j) Data access for n-body j for i = 0:n for j = 0:n access A(i), B(j) i Data access for n-body j for i = 0:n for j = 0:n access A(i), B(j) Ex: execute loop for i = 3, j = 5 i Data access for n-body j for i = 0:n for j = 0:n access A(i), B(j) Ex: execute loop for i = 3, j = 5 access A(3), B(5) i Data access for n-body j for i = 0:n for j = 0:n access A(i), B(j) Ex: execute loop for multiple pairs (i,j), access multiple A(i), B(j) i Data access for n-body j If we can only access 10 entries of A(i) and B(j), what is the max number of loop iterations we can do? 25 = 5 x 5 If we can access M = cache size entries, then we can do (M/2)2 = M2/4 loop iterations i Communication lower bound for n-body (intuition) for i=1:n, for j=1:n, access A(i), B(j) With a cache of size M full of data, can only perform M2/4 loop iterations To perform all n2 loop iterations, need to (re)fill cache n2/(M2/4) = 4(n/M)2 times Filling cache costs M reads from slow memory Need to do at least 4(n/M)2 * M = 4n2 / M reads – Can improve constant slightly – Write as Ω(n2/M) = Ω(#loop iterations / M) Optimal tiling for usual n-body j for i = 0:n for j = 0:n access A(i), B(j) Tiling (M=10) Read 5 entries of A: A([0,1,2,3,4]) Read 5 entries of B: B([0,1,2,3,4]) Perform 52 = 25 loop iterations i Optimal tiling for usual n-body j for i = 0:n for j = 0:n access A(i), B(j) i Optimal tiling for usual n-body j for i = 0:n for j = 0:n access A(i), B(j) i Optimal tiling for usual n-body j for i = 0:n for j = 0:n access A(i), B(j) i Generalizing to other algorithms Many algorithms look like nested loops accessing arrays – Linear Algebra (dense and sparse) – Grids (structured and unstructured) – Convolutional Neural Nets (CNNs) … Matmul: C = A*B – for i=1:n, for j=1:n, for k=1:n C(i,j) = C(i,j) + A(i,k) * B(k,j) Proof of Communication Lower Bound on C = AꞏB (1/4) Analogous to n-body: – Only M entries of A, B and C are available in cache – Find an upper bound F on the number of different iterations C(i,j) = C(i,j) + A(i,k)*B(k,j) we can perform – Need to refill cache n3/F times to complete algorithm – Need to read/write at least M n3/ F words to/from cache Like n-body, represent iterations and data geometrically CS267 Lecture 7 16 Proof of Communication Lower Bound on C = AꞏB (2/4) k “C face” Cube representing C(1,1) += A(1,3)ꞏB(3,1) C(2,3) A(2,1) A(1,1) B(1,3) B(2,1) A(1,2) j B(1,1) A(1,3) B(3,1) C(1,1) i “A face” If we have at most M “A squares”, “B squares”, and “C squares” on faces, how many cubes can we have? 17 Proof of Communication Lower Bound on C = AꞏB (3/4) k C shadow x y z j y z x A shadow i # cubes in black box with side lengths x, y and z = Volume of black box = xꞏyꞏz = ( xz ꞏ zy ꞏ yx)1/2 = (#A□s ꞏ #B□s ꞏ #C□s )1/2 (i,k) is in A shadow if (i,j,k) in 3D set (j,k) is in B shadow if (i,j,k) in 3D set (i,j) is in C shadow if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) ꞏ area(B shadow) ꞏ area(C shadow)) 1/2 18 Proof of Communication Lower Bound on C = AꞏB (4/4) # loop iterations doable with M words of data = #cubes ≤ (area(A shadow) · area(B shadow) ·area(C shadow)) 1/2 ≤ (M · M · M) 1/2 = M 3/2 = F Need to read/write at least M n3/ F = Ω(n3/M 1/2) = Ω(#loop iterations / M 1/2) words to/from cache 19 Recall optimal Matmul Algorithm Analogous to n-body: – What is the largest set of C(i,j)+=A(i,k)*B(k,j) we can perform given M entries A(i,k), B(k,j), C(i,j)? – What is the largest set of (i,j,k) we can have, given a bound M on the number of (i,k), (k,j), (i,j)? – What is the shape of the largest 3D volume we can have, given a bound M on the area of its shadows in 3 directions? – Answer: A cube, with edge length O(M 1/2 ), volume O(M 3/2 ) – Optimal ”blocked” Algorithm: 6 nested loops, 3 innermost loops do b x b matmul with b = O(M 1/2 ) Proof of Communication Lower Bound on C = AꞏB (4/4) # loop iterations doable with M words of data = #cubes ≤ (area(A shadow) · area(B shadow) ·area(C shadow)) 1/2 ≤ (M · M · M) 1/2 = M 3/2 = F Need to read/write at least M n3/ F = Ω(n3/M 1/2) = Ω(#loop iterations / M 1/2) words to/from cache Parallel Case: apply reasoning to one processor out of P ”Fast memory” = local processor, “Slow memory” = other procs Goal: lower bound # “reads/writes” = # words moved between one processor and others # loop iterations = n3 / P (load balanced) M = 3n2 / P (each processor gets equal fraction of data) # “reads/writes”  M ꞏ (n3 /P) / (M)3/2 = Ω (n2 / P1/2 ) 21 Recursive Matrix Multiplication (RMM) (1/2) C= = C11 C 21 C11 C12 C21 C22 =AꞏB= A11 Aꞏ12 A21 A22 B11 B12 B21 B22 A11ꞏB11 + A12ꞏB21 A11ꞏB12 + A12ꞏB22 A21ꞏB11 + A22ꞏB21 A21ꞏB12 + A22ꞏB22 C12 C 22 = A11 A21 A12 B11  B2 1 A22 B12 A11*B11 + A11*B12 + A12*B21 A12*B22 B2 2 A21*B11 + A21*B12 + A22*B21 A22*B22 = Eventually, the matrices will fit in cache Don’t need to optimized for size  But call overhead is high  CS267 - Lecture 3 22 Recursive Matrix Multiplication (2/2) func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return A(n) = # arithmetic operations in RMM(. ,. , n) = 8 ꞏ A(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 -n2 … same operations as usual, in different order W(n) = # words moved between fast, slow memory by RMM(. ,. , n) = 8 ꞏ W(n/2) + 4ꞏ 3(n/2)2 if 3n2 > Mfast , else 3n2 = O( n3 / (Mfast )1/2 + n2 ) … same as blocked matmul Don’t need to know Mfast for this to work! CS267 - Lecture 3 23 Strassen’s Matrix Multiply The traditional algorithm (with or without tiling) has O(n3) flops Strassen discovered an algorithm with asymptotically lower flops Consider a 2x2 matrix multiply, normally takes 8 multiplies, 4 adds – O(n2.81) – Strassen does it with 7 multiplies and 18 adds Let C = C11 C12 = A11 A12 B11 B12 A21 A22 B21 B22 C21 C22 = Let P1 = (A12 - A22) * (B21 + B22) P5 = A11 * (B12 - B22) P2 = (A11 + A22) * (B11 + B22) P6 = A22 * (B21 - B11) P3 = (A11 - A21) * (B11 + B12) P7 = (A21 + A22) * B11 P4 = (A11 + A12) * B22 Then C11 = P1 + P2 - P4 + P6 C12 = P4 + P5 Extends to nxn by divide&conquer C21 = P6 + P7 C22 = P2 - P3 + P5 - P7 CS267 - Lecture 3 24 Strassen (continued) T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)2 = O(n log2 7) = O(n 2.81) Asymptotically faster Several times faster for large n in practice Cross-over depends on machine “Tuning Strassen's Matrix Multiplication for Memory Efficiency”, M. S. Thottethodi, S. Chatterjee, and A. Lebeck, in Proceedings of Supercomputing '98 Possible to extend communication lower bound to Strassen #words moved between fast and slow memory Ω(nlog2 7 / M(log2 7)/2 – 1 ) ~ Ω(n2.81 / M0.4 ) (Ballard, D., Holtz, Schwartz, 2011, SPAA Best Paper Prize) Attainable too, more on parallel version later CS267 - Lecture 3 = 25 Other Fast Matrix Multiplication Algorithms World’s record was O(n 2.37548... ) – Coppersmith & Winograd, 1987 New Record! 2.37548 reduced to 2.37293 – Virginia Vassilevska Williams, UC Berkeley & Stanford, 2011 Newer Record! 2.37293 reduced to 2.37286 – Alman and WIlliams, 2020 Lower bound on #words moved can be extended to (some) of these algorithms (2015 thesis of Jacob Scott) Possibility of O(n2+) algorithm! – Cohn, Umans, Kleinberg, 2003 Can show they all can be made numerically stable – Demmel, Dumitriu, Holtz, Kleinberg, 2007 Can do rest of linear algebra (solve Ax=b, Ax=λx, etc) as fast , and numerically stably – Demmel, Dumitriu, Holtz, 2008 Fast methods (besides Strassen) may need unrealistically large n CS267 - Lecture 3 26 Approach to generalizing lower bounds Matmul for i=1:n, for j=1:n, for k=1:n, C(i,j)+=A(i,k)*B(k,j) => for (i,j,k) in S = subset of Z3 Access locations indexed by (i,j), (i,k), (k,j) General case for i1=1:n, for i2 = i1:m, … for ik = i3:i4 C(i1+2*i3-i7) = func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…) D(something else) = func(something else), … => for (i1,i2,…,ik) in S = subset of Zk Access locations indexed by “projections”, eg φC (i1,i2,…,ik) = (i1+2*i3-i7) φA (i1,i2,…,ik) = (i2+3*i4,i1,i2,i1+i2,…), … Goal: Communication lower bounds, optimal algorithms for any program that looks like this 27 General Communication Lower Bound Thm: Given a program with array refs given by projections φj, then there is an sHBL ≥ 1 such that #words_moved = Ω (#iterations/MsHBL-1) where sHBL is the the value of a linear program: minimize sHBL = Σj ej subject to rank(H) ≤ Σj ej*rank(φj(H)) for all subgroups H < Zk Proof depends on recent result in pure mathematics by Christ/Tao/Carbery/Bennett – Generalization of Hölder-Brascamp-Lieb (HBL) inequality to Abelian groups – HBL generalizes Cauchy-Schwartz, Loomis-Whitney, … 28 Is this bound attainable (1/2)? But first: Can we write it down? – One inequality per subgroup H < Zd, but still finitely many! – Thm (bad news): Writing down all inequalities in LP reduces to Hilbert’s 10th problem over Q (decideable or not? open question) – Thm (good news): Another LP has same solution, is decidable (but could be expensive) – Thm: (better news) Easy to write LP down explicitly in many cases of interest: When at most 3 arrays When at most 5 loop indices When subscripts are subsets of indices 29 Is this bound attainable? Thm: We can always construct an optimal tiling, that attains the lower bound Assumptions/caveats/open questions – Attains lower bound Ω (#iterations/MsHBL-1) in O() sense Lots of recent work in compiler community to get best constants – Depends on loop dependencies Not all tilings may compute the right answer Best case: no dependencies, or just reductions (like matmul) – Assumes loop bounds are large enough to fit tile Ex: same lower bound for matmul applies to matrix-vector-multiply, but not attainable Recent extension to arbitrary loop bounds, assuming all subscripts “projective” eg (i), (i,j), (i,j,k) etc, 30 What CNNs compute H W C Image What CNNs compute H W R S R K … C C Image Filter S What CNNs compute H W R H S R S R K … X W S = K C C Image Filter Out What CNNs compute B copies B copies H W R H S R S R K … X W S = K C C Image Filter Out What CNNs compute B copies B copies H W R H S R S R K … X W S = K C C Image Filter Out for k=1:K, for h=1:H, for w=1:W, for r=1:R, for s=1:S, for c=1:C, for b=1:B Out(k, h, w, b) += Image(r+w, s+h, c, b) * Filter( k, r, s, c ) What CNNs compute B copies B copies HσH WσW R H S R S R K … X W S = K C C Image Filter Out for k=1:K, for h=1:H, for w=1:W, for r=1:R, for s=1:S, for c=1:C, for b=1:B Out(k, h, w, b) += Image(r+σWw, s+σHh, c, b) * Filter( k, r, s, c ) How a CNN is often done – 1D case Ex: 1 1x3 filter, 1 1x5 image, shift σ = 1 – [f1,f2,f3], [im1,im2,im3,im4,im5] – 3 dot products of length 3 Convert image to matrix, do vector*matrix (BLAS2) im1 [ f1, f2, f3 ] * Im2 im2 im3 im3 im4 im3 im4 im5 Multiple filters -> matrix*matrix (BLAS3) How a CNN is often done – 2D case Same idea: – Convert each 2D image to a matrix: im2col (Matlab) – Convert each 2D filter into a row vector, stack them – Do matrix-matrix multiply Ex: 2x2 filter => 1x4 vector – [f11,f21,f12,f22] Ex: 5x5 image => 4x20 matrix (1 col per conv.) im11 im21 im31 im41 im12 … im44 im21 im31 im41 im51 im22 … im54 im12 im22 im32 im42 im13 … im45 im22 im32 im42 im52 im23 … im55 CNN using Im2col Same operations as 7 nested loops Can exploit optimized matmul Need to replicate data Can we communicate less, by doing convolutions directly? – Ex: Intel MKL-DNN, some NVIDIA libraries Communication Lower Bound for CNNs Let N = #iterations = KHWRSCB, M = cache size #words moved = Ω( max( … 5 terms BKHW, … size of Out σHσWBCWH,... size of Image CKRS,... size of Filter N/M,... same lower bound as n-body N/(M1/2 (RS/(σHσW))1/2 )... new lower bound ) New lower bound – Beats matmul by factor (RS/(σHσW))1/2 – Applies in common case when data does not fit in cache, but one RxS filter does – Tile needed to attain N/M too big to fit in loop bounds Attainable (many cases, solved using Mathematica) Optimal tiling for “slanted” n-body j for i = 0:n for j = 0:n access A(i), B(i+j) i Optimal tiling for “slanted” n-body j for i = 0:n for j = 0:n access A(i), B(i+j) Tiling: Read 5 entries of A: A([0,1,2,3,4]) Read 5 entries of B: B([4,5,6,7,8]) Perform 52 = 25 loop iterations i Optimal tiling for “slanted” n-body j for i = 0:n for j = 0:n access A(i), B(i+j) i Optimal tiling for “slanted” n-body j for i = 0:n for j = 0:n access A(i), B(i+j) i Optimal tiling for “slanted” n-body j for i = 0:n for j = 0:n access A(i), B(i+j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) Tiling: Read 5 entries of A: A([0,5,10,15,20]) Read 5 entries of B: B([0,-5,-10,-15,-20]) Perform 52 = 25 loop iterations i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Optimal tiling for “twisted” n-body j for i = 0:n for j = 0:n access A(3*i-j), B(i-2*j) i Shared Memory Programming: Mostly OpenMP (Acknowledgements to Tim Mattson and Aydin Buluc) A generic parallel architecture Proc Proc Proc Proc Proc Proc Interconnection Network Memory Memory Memory Memory Memory Where is the memory physically located? Is it connected directly to processors? What is the connectivity of the network? A Brief History of Parallel Languages When vector machines were king – Parallel “languages” were loop annotations (IVDEP) – Performance was fragile, but good user support When SIMD machines were king Mapping to MPPs/Clusters proved hard – Data parallel languages popular and successful (CMF, *Lisp, C*, …) – Irregular data (sparse mat-vec multiply OK), but irregular computation (divide and conquer, adaptive meshes, etc.) less clear When shared memory multiprocessors (SMPs) were king – Shared memory models, e.g., Posix Threads, OpenMP, are popular When clusters took over – Message Passing (MPI) became dominant With the addition of accelerators – OpenACC, CUDA were added In Cloud Computing – Hadoop, SPARK, … You’ll see the most popular in each category Outline Shared memory parallelism with threads What and why OpenMP? Parallel programming with OpenMP Introduction to OpenMP 1. 2. 3. 4. Creating parallelism Parallel Loops Synchronizing Data sharing Beneath the hood – Shared memory hardware Summary Recall Programming Model 1: Shared Memory Program is a collection of threads of control. – Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. – Threads communicate implicitly by writing and reading shared variables. – Threads coordinate by synchronizing on shared variables s Shared memory s =... y =..s... i: 2 i: 5 P0 P1 i: 8 Private memory Pn Parallel Programming with Threads Overview of POSIX Threads POSIX: Portable Operating System Interface – Interface to Operating System utilities PThreads: The POSIX threading interface – System calls to create and synchronize threads – Should be relatively uniform across UNIX-like OS platforms PThreads contain support for – Creating parallelism – Synchronizing – No explicit support for communication, because shared memory is implicit; a pointer to shared data is passed to a thread Forking Posix Threads Signature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: errcode = pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg); thread_id is the thread id or handle (used to halt, etc.) thread_attribute various attributes – Standard default values obtained by passing a NULL pointer – Sample attributes: minimum stack size, priority thread_fun the function to be run (takes and returns void*) fun_arg an argument can be passed to thread_fun when it starts errorcode will be set nonzero if the create operation fails “Simple” Threading Example void* SayHello(void *foo) { printf( "Hello, world!\n" ); Compile using gcc –lpthread return NULL; } int main() { pthread_t threads; int tn; for(tn=0; tn

20240219-DS642-Lecture-05.pdf

Document Details

Tags

Related

Full Transcript