🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

CSDS 451 T10 - Midterm Review.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

CSDS 451: Designing High Performant Systems for AI Theory 9 10/10/2023 Sanmukh Kuppannagari [email protected] https://sanmukh.github.io/ Case Western Reserve University Announcements • Midterm – Oct 17 Key AI/ML Kernels • Kernels • Matrix Vector Multiplication • Matrix Matrix Multipl...

CSDS 451: Designing High Performant Systems for AI Theory 9 10/10/2023 Sanmukh Kuppannagari [email protected] https://sanmukh.github.io/ Case Western Reserve University Announcements • Midterm – Oct 17 Key AI/ML Kernels • Kernels • Matrix Vector Multiplication • Matrix Matrix Multiplication • LU Decomposition • Understand how they work, their number of operations, their data access pattern Performance Modeling • Learn how to model 3 platforms • Processor-Memory • Systolic Arrays • Cluster • Modeling (also known as Performance Modeling) • Identifying important parameters that determine the performance of an algorithm mapped to the platform • Identifying equations that govern the performance Performance Modeling • Wont have questions on sustained performance best or worst case calculations • You still need to understand how they are calculated for conceptual questions Basic Execution Model Fetch-Execute A simple view Memory Fetch Execute Processor Parameters: Memory access = 1 cycle for any location Execution = 1 cycle Basic Execution Model Calculating Algorithm Performance • C[i] = ∑A[i][k] * B[k] RA: Read A[i][k] RB: Read B[k] M: Res <- Mult A[i][k]*B[k] RC: Read C[i] A: Add C[i] + Res S: Store C[i] Time 𝑖 = 0, 𝑘 = 0 RA RB M RC A S Rules to Create a Pipeline: - Cycles for Memory Operations = Fetch = 1 cycle - Cycles for Execution Operations = Execute = 1 cycle - All stages should have equal size, i.e., number of cycles - If a stage has longer cycles than other, all other stages will execute at the longer cycle Basic Execution Model Calculating Algorithm Performance • C[i] = ∑A[i][k] * B[k] RA: Read A[i][k] RB: Read B[k] M: Res <- Mult A[i][k]*B[k] RC: Read C[i] A: Add C[i] + Res S: Store C[i] Time 𝑖 = 0, 𝑘 = 0 RA RB M RC A S RA RB M RC A S 𝑖 = 0, 𝑘 = 1 For executing n loops: II*(n-1) + L = 1*(6-1) + 5 = 10 L: 6 II: 1 Loop Number Basic Execution Model Calculating Algorithm Performance • C[i] = ∑A[i][k] * B[k] RA: Read A[i][k] RB: Read B[k] M: Res <- Mult A[i][k]*B[k] RC: Read C[i] A: Add C[i] + Res S: Store C[i] RA RB M RC A S RA RB M RC A L: 6 II: 1 S Basic Execution Model Data Dependencies in Pipelining • RC is dependent upon S • RC needs to fetch most updated value of C[k] • Impact on Performance • Latency – Stalling in pipeline, increased latency • Initialization Interval – Second instruction starts later Basic Execution Model Calculating Algorithm Performance • C[i] = ∑A[i][k] * B[k] Time RA RA: Read A[i][k] RB: Read B[k] M: Res <- Mult A[i][k]*B[k] RC: Read C[i] A: Add C[i] + Res S: Store C[i] RB M RC A S RA RB M RC A S RA RB M RC A S II = 3 For executing n loops: II*(n-1) + L = 1*(6-1) + 5 = 10 L: 6 II: 3 Loop Number Basic Execution Model: Summary • Given a loop based program • For an iteration of the loop, create a pipeline • Determine latency by the number of instructions in the loop • Determine initialization interval by handling dependencies System Performance Metrics: Homework Example • Assumption #3: Pipelined Model Read 0 0.5 1 1.5 Read 2 2.5 Read Read Latency – 8 Initialization Interval – 2 Total instructions - 𝑛 − 𝑘 − 1 (for loop 𝑖 = 𝑘) Store Div Read 3 3.5 Div Read 4 4.5 Store Div Store System Performance Metrics: Homework Example • Problem – We only have two pipelines. • The previous model doesn’t consider that • Solution #1: Try to handle hardware dependencies also in addition to data dependencies • Solution #2: Simply divide by the max number of concurrent instructions and multiply by the number of pipelines System Performance Metrics: Homework Example • Assumption #3: Pipelined Model Read 0 0.5 1 1.5 Read Store Div Read 2 2.5 Read Read How to find maximum concurrent instruction? Idea: Find instruction 𝑘 that starts just after first instruction ends Maximum concurrent ops = 𝑘 − 1 So, Total time = 2*𝑇/(𝑘 − 1) 3 3.5 4 4.5 Div Store Read Div Store Read Read Div Store Read Div Read Store Memory System • System performance (execution time) depends on the “rate” at which data can be accessed by the program M (DRAM) 64 P With DRAM access latency of 100s of cycles, Need latency hiding mechanisms Updated Execution Model Latency: Time to receive the first word since the fetch instruction Bandwidth: Rate of data transfer = bitwidth of the interconnect x frequency Execute – 1 cycle Fetch – Latency + Data size/bandwidth cycles Memory System: Modeling • We will consider two scenarios 1. Sustained Performance (best case) • Data is pre-fetched and comes at the rate of bandwidth • No latency, data is available as fast as the bandwidth can support • If the data is available, accessing data is 1 clock cycle, similar to Basic Execution Model, otherwise, we wait till data arrives 2. Sustained Performance (worst case) • None of the data is pre-fetched • Accessing data takes “latency” clock cycles Limitations of Processor-Memory Platforms • Processor Performance ≫ Memory Performance M • Processor may idle waiting for data Machine Balance = Peak bandwidth GB/sec # of data transferred/sec 𝑃𝑒𝑎𝑘 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 (𝐹𝑙𝑜𝑝𝑠/𝑠𝑒𝑐) 𝑃𝑒𝑎𝑘 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ (𝐺𝐵/𝑠𝑒𝑐) # of ops/sec Peak Performance # of floating point ops/sec P The Roofline Model Samuel Williams, Lawrence Berkeley National Lab https://escholarship.org/content/qt0qs3s709/qt0qs3s709.pdf Compute Bound vs Memory Bound • Arithmetic Intensity (AI) = For every fetched data (from memory) how many ops make use of them # 𝑜𝑓 𝑜𝑝𝑠 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑒𝑑 = 𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑚𝑜𝑣𝑒𝑑 𝑓𝑟𝑜𝑚 𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 𝑚𝑒𝑚𝑜𝑟𝑦 • Algorithm Dependent • Example 2𝑛 AI of vector inner product = = 𝑂(1) 2𝑛 AI of Dense matrix multiplication = 2𝑛3 2𝑛2 = 𝑂(𝑛) Compute Bound vs Memory Bound Roofline Model • Understand system bottlenecks – Computation vs Memory • High level model using few parameters (ignores ISA etc.) • Can be used to understand and evaluate an application performance on a target • Can be used to make architectural decisions for custom accelerators The Roofline Model Samuel Williams, Lawrence Berkeley National Lab https://escholarship.org/content/qt0qs3s709/qt0qs3s709.pdf Compute Bound vs Memory Bound Roofline Model (for a given platform) Machine Balance Memory Bound Compute Bound Peak Performance Sustained Performance (ops/sec) log-log plot 0.1 1 10 Arithmetic Intensity Compute Bound vs Memory Bound Roofline Model (for a given platform) Machine Balance Memory Bound Compute Bound Peak Performance Sustained Performance (ops/sec) log-log plot 0.1 1 10 Arithmetic Intensity • Bold Red line denotes the maximum achievable performance for a given arithmetic intensity • Memory bound region: maximum achievable performance bounded by memory bandwidth • Compute bound region: maximum achievable performance bounded by compute capability Compute Bound vs Memory Bound Roofline Model (for a given platform) • Compute Bound: Maximum Performance is limited by available compute resources • if we add more compute resources, the performance may improve (for a given bandwidth) • Useful in designing accelerators. Not a focus of this class • Memory Bound: Maximum Performance is limited by memory bandwidth • Reorder computations to more efficiently utilize bandwidth • In other words, improve arithmetic intensity to move right on x axis, which also improves y axis (we will discuss this in next class) Roofline Model – What about actual performance? Machine Balance Memory Bound Compute Bound Peak Performance Sustained Performance (ops/sec) log-log plot 0.1 1 10 Arithmetic Intensity • For a fixed AI: improving performance (going up) requires better parallelism (we will study such techniques later in this course) • Once the Red bold line is reached • Green Region: Write better algorithm to improve arithmetic intensity (we will study later in this course) • Yellow Region: No algorithmic solution, need to add more computation power Techniques to Improve Application Performance • Compute Parallelization • Decompose tasks to enable concurrent processing (we will discuss later in this course) • Memory Optimizations • Optimizations addressing latency • Optimizations targeted toward maximizing bandwidth utilization • Optimizations that reduce the amount of data transfer Memory Optimizations: Summary • Two main techniques • #1: Improve bandwidth or reduce latency by optimizing data layout • #2: Reordering computations of algorithms to leverage cache and improve data reuse Optimizations Targeting Latency • Latency hiding using data transfers • Key takeaway for this course: data transfers of sequential datastructures (such as arrays) leads to streaming data transfers, i.e., data transfers where latencies for the subsequent data items is hidden. Read 0 -19 0.5 Read Read 1 1.5 2 2.5 3 ns Latency Read Request Times of data arrival Optimizations to Improve Bandwidth • In Practice • Accessing sequential data leads to better packing • Memories usually have much larger bitwidth than 64 bits • Nvidia A100’s memory has 5120 bit (615 bytes) wide bus. M (DRAM) 64 P M (DRAM) Should pack 64 bits in each cycle to maximize bandwidth P #1 Data Layout Optimization Storing a 2-dimemsional array in memory Row major order 0 𝑛-1 Column major order Memory 0 𝑛-1 0 Memory 0 Map … … 𝑛-1 𝐴(𝑖, 𝑗) → Memory(𝑖 ∙ 𝑛 + 𝑗) 0 ≤ 𝑖, 𝑗 < 𝑛 Map 𝑛-1 𝐴(𝑖, 𝑗) → Memory(𝑖 + 𝑛 ∙ 𝑗) 0 ≤ 𝑖, 𝑗 < 𝑛 #1 Data Layout Optimization 0 𝑛-1 0 • For k = 1 to n • For i = 1 to n … • C[i] <- A[i][k] + B[k] + C[i] 𝑛-1 Column major access pattern Row major order • Assume 𝐴 is stored in row major order Access pattern of the program Data Access not sequential Each read of A[i][k] will incur latency Data Layout #1 Data Layout Optimization • Assume 𝐴 is stored in row major order • For i = 1 to m – For k = 1 to n • C[i] <- A[i][k] + B[k] + C[i] Row major access pattern Access pattern of the program Data Access sequential Read A[i][k] at the rate of bandwidth Data Layout Cache Locality of References (Algorithm Characteristic) • Spatial locality • If location 𝑖 is referenced at time 𝑡, then locations near 𝑖 are referenced in a small window of time following 𝑡 memory … 𝑖−1 𝑖 Time 𝑡 𝑖+1 … • Temporal locality • In a small window of time, repeated references to a small set of data items • Storing these in cache, reduces traffic to main memory Small set of data items referenced time 𝑡1 𝑡2 Execution Model with Cache 10 ns 1 GHz M DRAM 64 64 P 4 GB 2 GHz 2 Pipelines Fetch_Memory – Latency + Data size/bandwidth Fetch_Cache – 1 cycle Execute – 1 cycle C Exec. pipelines Cache Small (16 MB) Fast (2 GHz) Bandwidth = 64 bits at 1 GHz (64 Gbits/sec or 8 GB/sec) Cache Limitations • In practice, cache size is limited • Previously fetched data is evicted from cache, if no space. • Circling back to Technique #2 Reordering Computations for Improved Data Reuse • Objective: Reorder the computations to maximize reuse of the fetched data Block Matrix Multiplication 𝑪 = (𝑨 × 𝑩)𝑵×𝑵 Cache Size - 𝑆 Definition: Data Reuse Factor (for an algorithm): Number of Computations/Number of data transfers Can be used to determine the effectiveness of an algorithm Block Matrix Multiplication • Data Reuse Factor, with infinite cache • • • • • Fetch matrix A - 𝑁 2 data transfers Fetch matrix B - 𝑁 2 data transfers Fetch matrix C - 𝑁 2 data transfers Compute C = 2𝑁 3 computation operations Save C back to memory - 𝑁 2 data transfers • Data Reuse Factor: 2𝑁3 4𝑁2 = 𝑂(𝑁) Block Matrix Multiplication • Key Idea: Partition matrices A/B/C into 𝑏 × 𝑏 size blocks 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 b b b • Blocks denoted as A [1: ][1: ]/ B [1: ][1: ]/ C [1: ][1: ] 𝑏 𝑏 𝑏 𝑏 𝑏 • Cb[i][j] = BlockedMM(Ab[i][:], Bb[:][j]) • For 𝑘 = 0 to 𝑁/𝑏 𝐵𝑏 [:,j] • Fetch • Multiply and Accumulate Cb[i][j] Ab[i][k], Bb[k][j] 𝐴𝑏 [i,:] PA 1: Actual Code 𝐶 𝑏 [i,j] 𝑏 Block Matrix Multiplication Block Matrix Multiplication 2 • 𝑆 = 3𝑏 , 𝑏 = 𝑆 3 ≈ 𝑂( 𝑆) • Data Reuse Factor = ? ? ? Block Matrix Multiplication • In each BlockMM function • Fetch Ab[i][k], Bb[k][j], • Computations for ∀𝑘 - 𝑂( 𝑆 × 𝑆 × Ab[i][k], Bb[k][j], • Store Cb[i][j] - O(𝑆) data transfers • Number of BlockMM - 𝑁 𝑏 × 𝑁 𝑏 𝑁 ) 𝑆 data transfers 3 ∀𝑘 - O(√𝑆 ∗ 𝑁 ) √𝑆 computations Block Matrix Multiplication • Data Reuse Factor: • Total Computations -O(𝑆 ∗ • Total data transfers – O(𝑆 ∗ • Data Reuse Factor: 𝑂 √𝑆 • For naïve case: 𝑂(1) 𝑁 𝑁 𝑁∗ ∗ ) 𝑏 𝑏 𝑁 𝑁 𝑁 ∗ ∗ ) 𝑆 𝑏 𝑏 transfers Systolic Arrays * * * * • 1, 2, or 3 dimensional array of locally connected data processing units • Input received from top and left. * * * * * * * * * * * * • Output from bottom and right • Programming Systolic Arrays → determining the flow of data to achieve the desired computations Matrix Vector using 1D Systolic Arrays • MV using 𝑝 < 𝑛 sized systolic array • Repeat the previous steps • Total time - 3𝑝 × 𝑛 ⌈ ⌉ 𝑝 𝑛 ⌈ ⌉ 𝑝 times Matrix Vector using 1D Systolic Arrays • MV num ops = 𝑛2 • MV Total time on systolic arrays = 3𝑛 • Number of processors = 𝑛 • Bandwidth needed = 𝑛 Matrix Multiplication using 2D Systolic Arrays 𝐵10 𝐵00 𝐴30 𝐴21 𝐴12 𝐴20 𝐴11 𝐴02 𝐴10 𝐴01 𝑚 𝐵11 𝐵10 𝑘 𝐵21 𝐵20 𝐵31 𝐵30 𝐴00 𝑘 𝑛 𝐴03 Matrix Multiplication using 2D Systolic Arrays 𝐴 𝐵 𝑚 𝑘 𝑘 𝑛 Matrix Multiplication using 2D Systolic Arrays 𝐴 𝐵 𝑚 𝑘 𝑘 𝑛 Matrix Multiplication using 2D Systolic Arrays 𝐴 𝐵 𝑚 𝑘 𝑘 𝑛 Matrix Multiplication using 2D Systolic Arrays 𝐴 𝐵 𝑚 𝑘 𝑘 𝑛 Matrix Multiplication using 2D Systolic Arrays 𝐴 𝐵 𝑚 𝑘 𝑘 𝑛 Matrix Multiplication using 2D Systolic Arrays 𝐴 𝐵 𝑚 𝑘 𝑘 𝑛 Matrix Multiplication using 2D Systolic Arrays • Total time for Matrix Multiplication • 𝐴 − 𝑚 × 𝑘, 𝐵 − 𝑘 × 𝑛 • Number of Rounds = 𝑛 𝑝 × 𝑘 ⌈ ⌉ 𝑝 • Total Time - 3𝑝 + 𝑚 − 2 × 𝑛 𝑝 × 𝑘 ⌈ ⌉ 𝑝 Systolic Arrays – Items to Note • In each cycle, all 𝑝 × 𝑝 processors work • It maybe non-useful work (idling) but still counts as Cost/work in our Cost optimality definition • Won’t ask you to design a new Systolic Array based algorithm like you did in Written Assignment • May ask questions on the algorithm that we studied in class Cluster of Accelerators • 𝑁 processors (memory + accelerator) • Local compute • Local memory • Connected using an Interconnection Network • Communication through Message Passing Interconnection Network 𝑃0 𝑃1 … 𝑃𝑛−1 Cluster of Accelerators - Notes • Don’t worry about send/receive blocking/non-blocking • Wont ask questions on these • Review the two algorithms we studied • Addition using recursive doubling technique • Cannon’s algorithm Adding Using Message Passing on 1D Mesh (2) • Key Idea: Recursive doubling 𝐴(0) • In iteration 𝑖: • Processors 𝑗 and 𝑗 + 2𝑖 communicate, where 𝑗 = 𝑘. 2𝑖+1 • Processor 𝑗 computes • 𝑗 − Active Processors 𝐴(𝑛 − 1) Adding Using Message Passing on 1D Mesh (3) • Key Idea: Recursive doubling 𝐴(0) • In iteration 𝑖: • Processors 𝑗 and 𝑗 + 2𝑖 communicate, where 𝑗 = 𝑘. 2𝑖+1 • Processor 𝑗 computes • 0, 2, 4, 6 − Active Processors 𝐴(𝑛 − 1) 𝑖=0 Adding Using Message Passing on 1D Mesh (4) • Key Idea: Recursive doubling 𝐴(0) • In iteration 𝑖: • Processors 𝑗 and 𝑗 + 2𝑖 communicate, where 𝑗 = 𝑘. 2𝑖+1 • Processor 𝑗 computes • 0, 4 − Active Processors 𝐴(𝑛 − 1) 𝑖=1 Adding Using Message Passing on 1D Mesh (5) • Key Idea: Recursive doubling 𝐴(0) • In iteration 𝑖: • Processors 𝑗 and 𝑗 + 2𝑖 communicate, where 𝑗 = 𝑘. 2𝑖+1 • Processor 𝑗 computes • 0 − Active Processors 𝐴(𝑛 − 1) 𝑖=2 Adding Using Message Passing on 1D Mesh (6) • Message Passing Algorithm (SPMD model) Program in process 𝑗, 0 ≤ 𝑗 ≤ 𝑛 − 1 1. 2. 3. 4. 5. 6. 7. 8. 9. 60 Do 𝑖 = 0 to log 2 𝑛 − 1 If 𝑗 = 𝑘 ∙ 2𝑖+1 +2𝑖 , for some 𝑘 ∈ 𝑁 Send 𝑨(𝒋) to process 𝒋 − 𝟐𝒊 2𝑖 distance communication Else if 𝑗 = 𝑘 ∙ 2𝑖+1 , for some 𝑘 ∈ 𝑁 Receive 𝑨(𝒋 + 𝟐𝒊 ) from process 𝒋 + 𝟐𝒊 𝐴 𝑗 ← 𝐴 𝑗 + 𝐴(𝑗 + 2𝑖 ) End Barrier End Note: 𝐴(𝑗) is local to process 𝑗 𝑁 = set of natural numbers = {0, 1, …} Performance Analysis of Clusters • Total Time = Total Computation Time + Total Communication Time • Total Computation Time – Number of Rounds X Execution time per round • Total Communication Time – Number of Rounds X Communication time per round • Total Bandwidth Needed to support communication Performance Analysis (1) • Total Computation time: Number of rounds × computation time per round • Note each processor is doing the same computation in each round • Computation time per round – Calculate using processor-memory or systolic array models • Computation time per round: 𝑂(1) • Number of rounds - log 𝑁 • Total Computation time - 𝑂(log 𝑁) Performance Analysis (2) • Total Communication time ??? • Depends upon the underlying interconnection topology • We will assume fully connected in this class • A transfer of 𝑘 data items from any processor to any processor takes 𝑂(𝑘) amount of time. • Assuming 𝑡𝑤 as the per word transfer time, this is equal to 𝑘 × 𝑡𝑤 • Note: Actual Interconnection modeling is much more complicated and coule be a lecture or 2 in itself Performance Analysis (3) • Total Communication time - Number of rounds × communication time per round • Number of rounds - log 𝑁 • Communication time per round - 1 × 𝑡𝑤 • Total Communication time: log 𝑁 × 𝑡𝑤 Performance Analysis (4) • Challenge: Does the system have enough communication bandwidth to support the communication requirement of the algorithm? • Maximum Communication Requirement – ?? • Note: Iteration 0 has the most (𝑁/2) processors active • Maximum communication requirement = 𝑁 .1 2 <𝐵 • 𝐵: Maximum bandwidth supported by the system. MM using Message Passing (1) Cannon’s algorithm 𝑪←𝑨×𝑩 • 𝒏 × 𝒏 matrices • 𝒑 × 𝒑 processors (processes) in a 2D mesh, 𝑃𝑖𝑗 0 ≤ 𝑖, 𝑗 < 𝑝, 1 ≤ 𝑝 ≤ 𝑛 • Blocking Send/Receive, SPMD model of concurrency • Processor 𝑷𝒊,𝒋 assigned to 𝑨𝒊,𝒋 , 𝑩𝒊,𝒋 , 𝑪𝒊,𝒋 (this data local to the processor) (𝑖, 𝑗)th block of size 𝑛 𝑝 × 𝑛 𝑝 MM using Message Passing (2) Circular left shift 0 … 1 Circular up shift 0 … 𝑝-1 𝑝-1 MM using Message Passing (3) Initial data alignment For 𝑨: 𝑖 𝑡ℎ row – circular left shift by 𝑖 (0 ≤ 𝑖 < 𝑝) For 𝑩: 𝑗𝑡ℎ column – circular up shift by 𝑗 0 ≤ 𝑗 < 𝑝 4 × 4 matrix 4 × 4 processor array 𝑨𝟎,𝟎 𝑩𝟎,𝟎 𝑨𝟎,𝟏 𝑩𝟏,𝟏 𝑨𝟎,𝟐 𝑩𝟐,𝟐 𝑨𝟎,𝟑 𝑩𝟑,𝟑 𝑨𝟏,𝟏 𝑩𝟏,𝟎 𝑨𝟏,𝟐 𝑩𝟐,𝟏 𝑨𝟏,𝟑 𝑩𝟑,𝟐 𝑨𝟏,𝟎 𝑩𝟎,𝟑 𝑨𝟐,𝟐 𝑩𝟐,𝟎 𝑨𝟐,𝟑 𝑩𝟑,𝟏 𝑨𝟐,𝟎 𝑩𝟎,𝟐 𝑨𝟐,𝟏 𝑩𝟏,𝟑 𝑨𝟑,𝟑 𝑩𝟑,𝟎 𝑨𝟑,𝟎 𝑩𝟎,𝟏 𝑨𝟑,𝟏 𝑩𝟏,𝟐 𝑨𝟑,𝟐 𝑩𝟐,𝟑 𝐴 and 𝐵 after initial alignment MM using Message Passing (4) Parallel algorithm (global view) 1. Initial data alignment 2. Repeat 𝑝 times 𝑛 Super step 𝑛 ➢ All processors 𝑃𝑖,𝑗 perform × matrix multiplication in parallel using local data 𝑝 𝑝 ➢ In parallel for all 𝑖, 𝑗 Processor 𝑃𝑖,𝑗 : 𝑛 𝑛 circular left shift 𝐴 ( × ) by 1 position in each row 𝑝 𝑝 Data alignment using ➢ In parallel for all 𝑖, 𝑗 Processor 𝑃𝑖,𝑗 : message passing 𝑛 𝑛 circular up shift 𝐵 ( × ) by 1 position in each col (permutation in each 𝑝 𝑝 End row and each col) Note: 𝐴, 𝐵, 𝐶 are partitioned : 𝑛 𝑝 × 𝑛 𝑝 matrices, local to each processor MM using Message Passing (5) Parallel algorithm (local view from 𝑷𝒊,𝒋 ) Repeat 𝑝 times ➢ 𝐶 ←𝐶+𝐴×𝐵 Super step 𝑛 𝑝 × 𝑛 𝑝 ➢ 𝐴 ← read from right neighbor from {𝑖, (𝑗 + 1) mod 𝑝} ➢ 𝐵 ← read from neighbor below from {(𝑖 + 1) mod 𝑝 , 𝑗} End matrix multiplication MM using Message Passing (6) Cannon’s algorithm Illustration (4 × 4 matrix, 4 × 4 processor array) 𝑨𝟎,𝟎 𝑨𝟎,𝟏 𝑨𝟎,𝟐 𝑨𝟎,𝟑 𝑩𝟎,𝟎 𝑩𝟏,𝟏 𝑩𝟐,𝟐 𝑩𝟑,𝟑 𝑨𝟏,𝟏 𝑨𝟏,𝟐 𝑨𝟏,𝟑 𝑨𝟏,𝟎 𝑩𝟏,𝟎 𝑩𝟐,𝟏 𝑩𝟑,𝟐 𝑩𝟎,𝟑 𝑨𝟐,𝟐 𝑨𝟐,𝟑 𝑨𝟐,𝟎 𝑨𝟐,𝟏 𝑩𝟐,𝟎 𝑩𝟑,𝟏 𝑩𝟎,𝟐 𝑩𝟏,𝟑 𝑨𝟑,𝟑 𝑨𝟑,𝟎 𝑨𝟑,𝟏 𝑨𝟑,𝟐 𝑩𝟑,𝟎 𝑩𝟎,𝟏 𝑩𝟏,𝟐 𝑩𝟐,𝟑 • Initial alignment Super step 0 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 - Note: It is easy to visualize/understand using p=n MM using Message Passing (7) Cannon’s algorithm 𝑨𝟎,𝟏 𝑨𝟎,𝟐 𝑨𝟎,𝟑 𝑨𝟎,𝟎 𝑩𝟏,𝟎 𝑩𝟐,𝟏 𝑩𝟑,𝟐 𝑩𝟎,𝟑 𝑨𝟏,𝟐 𝑨𝟏,𝟑 𝑨𝟏,𝟎 𝑨𝟏,𝟏 𝑩𝟐,𝟎 𝑩𝟑,𝟏 𝑩𝟎,𝟐 𝑩𝟏,𝟑 𝑨𝟐,𝟑 𝑨𝟐,𝟎 𝑨𝟐,𝟏 𝑨𝟐,𝟐 𝑩𝟑,𝟎 𝑩𝟎,𝟏 𝑩𝟏,𝟐 𝑩𝟐,𝟑 𝑨𝟑,𝟎 𝑨𝟑,𝟏 𝑨𝟑,𝟐 𝑨𝟑,𝟑 𝑩𝟎,𝟎 𝑩𝟏,𝟏 𝑩𝟐,𝟐 𝑩𝟑,𝟑 • Initial alignment Super step 0 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 Super step 1 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 MM using Message Passing (8) Cannon’s algorithm 𝑨𝟎,𝟐 𝑨𝟎,𝟑 𝑨𝟎,𝟎 𝑨𝟎,𝟏 𝑩𝟐,𝟎 𝑩𝟑,𝟏 𝑩𝟎,𝟐 𝑩𝟏,𝟑 𝑨𝟏,𝟑 𝑨𝟏,𝟎 𝑨𝟏,𝟏 𝑨𝟏,𝟐 𝑩𝟑,𝟎 𝑩𝟎,𝟏 𝑩𝟏,𝟐 𝑩𝟐,𝟑 𝑨𝟐,𝟎 𝑨𝟐,𝟏 𝑨𝟐,𝟐 𝑨𝟐,𝟑 𝑩𝟎,𝟎 𝑩𝟏,𝟏 𝑩𝟐,𝟐 𝑩𝟑,𝟑 𝑨𝟑,𝟏 𝑨𝟑,𝟐 𝑨𝟑,𝟑 𝑨𝟑,𝟎 𝑩𝟏,𝟎 𝑩𝟐,𝟏 𝑩𝟑,𝟐 𝑩𝟎,𝟑 • Initial alignment Super step 0 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 Super step 1 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 Super step 2 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 MM using Message Passing (9) Cannon’s algorithm 𝑨𝟎,𝟑 𝑨𝟎,𝟎 𝑨𝟎,𝟏 𝑨𝟎,𝟐 𝑩𝟑,𝟎 𝑩𝟎,𝟏 𝑩𝟏,𝟐 𝑩𝟐,𝟑 𝑨𝟏,𝟎 𝑨𝟏,𝟏 𝑨𝟏,𝟐 𝑨𝟏,𝟑 𝑩𝟎,𝟎 𝑩𝟏,𝟏 𝑩𝟐,𝟐 𝑩𝟑,𝟑 𝑨𝟐,𝟏 𝑨𝟐,𝟐 𝑨𝟐,𝟑 𝑨𝟐,𝟎 𝑩𝟏,𝟎 𝑩𝟐,𝟏 𝑩𝟑,𝟐 𝑩𝟎,𝟑 𝑨𝟑,𝟐 𝑨𝟑,𝟑 𝑨𝟑,𝟎 𝑨𝟑,𝟏 𝑩𝟐,𝟎 𝑩𝟑,𝟏 𝑩𝟎,𝟐 𝑩𝟏,𝟑 • Initial alignment Super step 0 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 Super step 1 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 Super step 2 - Compute using local data - Circular left shift 𝑨 - Circular up shift 𝑩 - Super step 3 - Compute using local data Performance Analysis (1) • Total Computation time = Number of round × time per round • Number of rounds = √𝑝 • Time per round = 𝑂( 𝑛3 𝑝3 ) • Total Computation time = 𝑛3 𝑂( ) 𝑝 Performance Analysis (2) • Total Computation time = Number of round × Communication time per round • Number of rounds = √𝑝 • Communication time per round on a Fully connected Network - 𝑂( • Total Communication time = 𝑂 𝑛2 𝑝 𝑛 𝑝 × 𝑛 ) 𝑝 Performance Analysis (3) • Communication Bandwidth Needed on a Fully Connected Network • Total data transferred in each communication step 𝑛 𝑛 2. . 𝑝 = 𝑂(𝑛2 ) 𝑝 𝑝 To ensure communication is not a bottleneck, system bandwidth 𝐵 > 𝑂(𝑛2 ) Performance Analysis (4) • In other words, the largest size of matrix that can be solved using 𝑝 processors on this system, using this algorithm is: ?? Performance Analysis (5) • In other words, the largest size of matrix that can be solved using 𝑝 processors on this system, using this algorithm is: 𝐵 × √𝐵 Analyzing Performance of Parallel Programs • Speedup & Amdahl’s law • Scaled speedup & Gustafson’s law • Efficiency • Scalability • Cost/Work performed by parallel algorithm and cost/work optimality Speedup • Improvement in time due to parallelization Speedup = • Serial Time = 1 • Parallel Time = 0.8 • Speedup = 1 0.8 = 1.25 Serial time (on a uniprocessor system) Time after parallelization Speedup • Improvement in time due to parallelization Speedup = • Serial Time = 1 • Parallel Time = 0.8 • Speedup = 1 0.8 = 1.25 Serial time (on a uniprocessor system) Time after parallelization Amdahl’s Law • Amdahl’s Law — Limit on speedup achievable when a program is run on a parallel machine • Given an input program • Total execution time: 𝑆 + 𝑃 𝑆 Serial Portion Parallelizable Portion 1 • Time on a uni-processor machine: 1 = 𝑆 + 𝑃 • Time on a parallel machine: 𝑆 + 𝑃/𝑓 𝑓 = speedup factor 𝑃 Amdahl’s Law = = = ≤ Serial Time Parallel Time 𝑆+𝑃 𝑆 + 𝑃/𝑓 𝑓 = speedup factor 1 𝑆 + 𝑃/𝑓 1 𝑆 If serial portion is 50 %, then the overall speedup ≤ 2 Note: Speedup factor (𝑓) ≤ # of processors 𝑆 1 𝑃 Scalability (2) Speedup = Serial time (on a uniprocessor system) Parallel time using 𝑝 processors If speedup = 𝑂(𝑝), then it is a scalable solution Scaled Speedup (Gustafson’s Law) (2) Amdahl’s Law: Fixed amount of computations 1 𝑆 Serial Time Parallel Time 1– 𝑆 1– 𝑆 𝑆 𝑝 : Number of Processors 𝑝 Gustafson’s Law: Increase 𝑝 and amount of computations 1 Parallel Time 𝑆′ Serial Time 𝑆′ 𝑃′ = 1 – 𝑆′ (1 – 𝑆′) 𝑝 If parallelism scales linearly with 𝑝, number of processors Scaled Speedup = Serial time Parallel time = 𝑆′ + 1 – 𝑆′ 𝑝 𝑆′ + 𝑃′ = 𝑆′ + 1 – 𝑆′ 𝑝 1 Strong or Weak Scaling • Strong Scaling • Governed by Amdahl’s Law • The number of processors is increased while the problem size remains constant • Results in a reduced workload per processor • Weak Scaling • Governed by Gustafson’s Law • Both the number of processors and the problem size are increased • Results in a constant workload per processor Performance (1) Efficiency • Question: If we use 𝑝 processors, is speedup = 𝑝 ? Typical execution of a program on a parallel machine Serial time • Efficiency = Fraction of time a processor is usefully employed during the computation 𝑝=1 𝑝=2 Serial Computation 𝑃1 𝑃2 0.5 • 𝐸 = Speedup / # of processors used 1 useful work idle (overhead) – 𝐸 is the average efficiency over all the processors – Efficiency of each processor can be different from the average value Time Performance (3) • Cost = Total amount of work done by a parallel system = Parallel Execution Time x Number of Processors = 𝑇𝑝 × 𝑝 • Cost is also called Processor Time Product • COST OPTIMAL (or WORK OPTIMAL) Parallel Algorithm – Total work done = Serial Complexity of the problem Writing Parallel Programs • Task Parallelism • Recursive doubling algorithm • Task Graphs, calculating various metrics, and scheduling • Data Parallelism • Data decomposition techniques Task Parallelism Techniques • Recursive Doubling – good for aggregation operations • Sum of elements • In iteration 𝑖: • Processors 𝑗 and 𝑗 + 2𝑖 communicate, where 𝑗 = 𝑘. 2𝑖+1 • Processor 𝑗 computes • 𝑗 − Active Processors Tasks and Dependencies Computation = decompose into tasks Task = Set of instructions (program segment) Inputs & outputs 𝑥1 𝑥2 𝑥3 𝑤𝑖 𝑦1 Begin execution once all inputs are available 𝑦2 Task size? = weight of the node (e.g., # of instructions executed) Note: tasks need not be of the same size Fine grain Coarse grain Example (1) Code A[2] = A[0] + 1 B[0] = A[2] + 1 Instructions Tasks Load R0 ← A[0] 𝑇0 Add R1 ← R0 + 1 𝑇1 Add R2 ← R1 + 1 𝑇2 Store A[2] ← R1 𝑇3 Store B[0] ← R2 𝑇4 Task Dependency Graph 𝑇0 𝑇1 𝑇3 𝑇2 𝑇4 Example (2) Matrix Vector Multiplication C[i] = ∑A[i][k] * B[k] Task 𝑖 − Compute C[i] 𝑛×𝑛 Parallel for i = 1 to n Compute C[i] 𝑇0 𝑇𝑛−1 … End Output C Output C Maximum Degree of Concurrency (1) Given a task dependency graph, maximum number of tasks that can be executed concurrently Maximum Parallelism Achievable at any Point in the code -> Decide number of processors Maximum Degree of Concurrency (2) Example: level by level ordering (topological sort) Task dependency graph Order tasks level by level • Any level 𝑖 task has dependency with some task in level 𝑖 − 1 (and possibly with other lower levels) but no dependency with level 𝑖 • All tasks in any level 𝑖 are independent Execute level 𝑖 tasks (in parallel), and then level 𝑖 + 1 tasks Maximum Degree of Concurrency (3) Example: level by level ordering (cont.) Find the number of tasks in each level Take maximum over all levels = maximum degree of concurrency (if scheduled using level by level ordering) Maximum Degree of Concurrency (4) Example 2 1 5 6 3 7 10 8 4 9 Maximum Degree of Concurrency (5) Example (cont.) Level 0 1 2 3 1 2 5 3 7 4 9 8 6 10 Maximum degree of concurrency = 4 Critical Path (1) Dependency graph Start nodes (indeg = 0) Finish nodes (outdeg = 0) Critical path = A longest path from a start node to a finish node (# of edges) Critical path length = Sum of the task weights of the nodes along the critical path Critical Path (2) Note: Critical path length considers critical path only (long paths) 1 1 For a given number of tasks, 1 Longer critical path ⇒ Longer execution time (may also mean less concurrency) Critical Path (3) 𝑇0 𝑛/2 𝑛/4 … … … … 1 𝑇𝑛−1 Total no. of tasks = 𝑛 Total no. of tasks = 𝑛 − 1 Critical path length = 𝑛 Critical path length = log 2 𝑛 Critical Path (4) Task 4 Task 3 Task 2 Task 1 100 10 8 10 6 11 7 Task 5 Task 6 Task 7 Critical path length = 10+6+11+7 = 34 Task dependency graph to Parallel Program (1) Given a task dependency graph Assume weight of each node = 1 Maximum degree of concurrency = 𝑐 Critical path length = 𝑙 Task dependency graph to Parallel Program (2) Idea DAG Organize into levels 0, 1, …, 𝑙 (level by level ordering) (𝑙 + 1) levels Execute level by level, 0 to 𝑙 Total number of processors needed ≤ 𝑐 Mapping Tasks to Processes 𝑇1 𝑇2 𝑷𝟑 𝑇3 𝑷𝟐 𝑇1 𝑇4 𝑷𝟎 𝑷𝟏 𝑇2 𝑷𝟑 𝑇3 𝑷𝟐 𝑷𝟏 𝑷𝟎 𝑷𝟎 𝑷𝟐 𝑇5 𝑇7 𝑇6 𝑷𝟎 𝑷𝟎 𝑇4 𝑷𝟎 𝑷𝟎 𝑇6 𝑇7 𝑇5 Task Parallelism Example – Merge Sort MS (𝐴 0 , … , 𝐴(𝑛 − 1)) Sort If 𝐴 = 1 return Do in parallel MS (𝐴 0 , … , 𝐴(𝑛Τ2 − 1)) Decompose MS (𝐴 𝑛Τ2 , … , 𝐴(𝑛 − 1)) End Merge Merge the two sorted sequences of size 𝑛Τ2 Task Dependency Graph Example: 𝑛 = 8 1 1 1 Weight: number of operations 1 1 1 1 1 1 1 1 1 1 3 3 7 Sort 1 1 1 1 1 1 Decompose Merge Task Dependency Graph • Merge sort 𝑛 = 8 • Maximum degree of concurrency = ?? • Critical path length (# edges) = ?? 109 Task Dependency Graph • Merge sort 𝑛 = 8 • Maximum degree of concurrency = 8 • Critical path length (# edges) = 7 110 Cost of the Parallel Algorithm • Cost = Total amount of work done by a parallel system = Parallel Execution Time x Number of Processors = 𝑇𝑝 × 𝑝 • 𝑇𝑝 = ? ? • Assume 𝑝 = 𝑛 Cost of the Parallel Algorithm Parallel time on 𝑛 processors 𝑇𝑝 𝑛 = parallel time for MergeSort using 𝑝 processes on 𝑛 data items 𝑇𝑛 𝑛 = 𝑇𝑛ൗ 𝑛ൗ2 + 𝑂(𝑛) 2 𝑇1 1 = 𝑂(1) 𝑇𝑛 𝑛 = 𝑂(𝑛) Serial merge Note: Decomposition is fast Merge takes most of the time Cost of the Parallel Algorithm • Cost = 𝑛 × 𝑂 𝑛 = 𝑂(𝑛2 ) • What is the cost of a serial merge sort algorithm ?? • Is this algorithm cost optimal for 𝑛 processors ?? Cost of the Parallel Algorithm • Cost = 𝑛 × 𝑂 𝑛 = 𝑂(𝑛2 ) • What is the cost of a serial merge sort algorithm 𝑶(𝒏 𝐥𝐨𝐠 𝒏) • Is this algorithm cost optimal for 𝑛 processors No Data Parallelism - Approaches • #1 Partition Output data • Each task is responsible for computing an output partition Output Data Parallelism - Approaches • #1 Partition Output data • Each task is responsible for computing an output partition Output 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #1 Partition Output data • Useful when partitions of output can be computed independently of each other • Example: Blocked Matrix Multiplication Output 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism – Approaches • #1 Partition Output data • Matrix Multiplication 𝐴0: 𝐶00 𝐵:0 𝑃0 𝐴0: 𝐶01 • Input data maybe shared • Output data distinct 𝐵:1 𝑃1 Data Parallelism - Approaches • #2 Partition Input data • Each task is responsible for performing “all” operations on the partition Input Data Parallelism - Approaches • #2 Partition Input data • Each task is responsible for performing “all” operations on the partition Input Data Parallelism - Approaches • #2 Partition Input data • Each task is responsible for performing “all” operations on the partition Input 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #2 Partition Input data • Tasks can either directly produce the final output • Or produce intermediate output which require another set of tasks to produce the final output Input 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #2 Partition Input data • Suitable when outputs cannot be independently computed • Example: Aggregation functions, Hashing, … Input 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #2 Partition Input data • Problem: Constructing a Hash Table • Inputs: a set of keys in the Universe 𝑈 • Output: mapping of each key to an index in {0,1, … , 𝑚 − 1} Data Parallelism - Approaches • #2 Partition Input data • Problem: Constructing a Hash Table • Can we do data parallelism based on output partitioning? Data Parallelism - Approaches • #2 Partition Input data • Problem: Constructing a Hash Table • Can we do data parallelism based on output partitioning? • • • • Yes, but What if the partition has no keys mapped to it – idle processor What if the partition has too many keys mapped to it – load balancing issue Does each processor need to look into the entire dataset? – wasteful computations Data Parallelism - Approaches • #2 Partition Input data • Divide the keys equally among processors 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #2 Partition Input data • Pros: no wasteful computations, no idle processors 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #2 Partition Input data • Cons: May need to do synchronization • Load balancing issues – Processors that see more conflicts will complete later than processors that see fewer conflicts 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #3 Partition both Input and Output data • Lets revisit the output partitioning in hash tables in detail 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • Cons: wasteful computations • Notice: Hash value of each key computed 𝑝 times 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • Pros: No synchronization 𝑃0 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • #3 Partition both inputs and outputs • For each input output pair, assign a processor 𝑃0 𝑃1 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • Assuming 𝑘𝑖 input partitions, 𝑘𝑜 output partitions, we have • 𝑝 = 𝑘𝑖 × 𝑘𝑜 𝑃0 𝑃1 𝑘𝑜 𝑘𝑖 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • # hash computations per key value??? • # Processors that may conflict?? 𝑃0 𝑃1 𝑘𝑜 𝑘𝑖 𝑃1 𝑃𝑝−1 Data Parallelism - Approaches • # hash computations per key value - 𝑘𝑜 < 𝑝 in output only partitioning • # Processors that may conflict - 𝑘𝑖 < 𝑝 in input only partitioning 𝑃0 𝑃1 𝑘𝑜 𝑘𝑖 𝑃1 𝑃𝑝−1 Block Distribution (1) 2-D Array Processes Block (continuous portion of array) Process Example 𝑃0 𝑃1 𝑃0 𝑃1 𝑃2 𝑃3 𝑃2 𝑃3 row-wise distribution column-wise distribution Block Distribution (2) 𝑝 = number of processes = 𝑝1 × 𝑝2 𝑛ൗ × 𝑛ൗ Block size 𝑝1 𝑝2 𝑃0 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 𝑃11 𝑃12 𝑃13 𝑃14 𝑃15 4 × 4 process grid 𝑃0 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 𝑃11 𝑃12 𝑃13 𝑃14 𝑃15 2 × 8 process grid Cyclic and Block Distribution Problem: distribute 1-D array 𝑛 > 𝑝 elements to 𝑝 processes Ex. 𝑛 = 8, 𝑝 = 2 A(7) A(0) Cyclic 𝑃0 𝑃1 𝑃0 𝑃1 𝑃0 Block 𝑃1 𝑃0 𝑃1 𝑃0 Block size = 𝑃1 𝑛 𝑝 = 4 Block Cyclic Distribution (1) • A variation of block distribution • Partition the work into many more blocks (𝛼𝑝) than the number of processes – 𝑛: problem size (size of input data) – 𝑝: # of processes 𝑛 𝑛 – 𝐿: block size, = {= 1 𝑡𝑜 } – 𝛼: 1 ≤ 𝛼 ≤ 𝑛 𝑝 𝛼𝑝 𝑝 • Distribute blocks in a wraparound fashion – Block 𝑏𝑖 → 𝑃𝑖%𝑝 (% is modulo operator) Block Cyclic Distribution (2) 𝑛 = 16, 𝑝 = 4 processes, 𝛼 = 1 Block size 𝐿 = 𝑛Τ𝛼𝑝 = 4 𝑃0 𝑃1 𝑃2 𝑃3 What if we do more work from left to right? 𝑃0 does less work compared with 𝑃3 Block Cyclic Distribution (3) 𝑛 = 16, 𝑝 = 4 processes, 𝛼 = 2 Block size 𝐿 = 𝑛Τ𝛼𝑝 = 2 𝑃0 𝑃1 𝑃2 𝑃3 𝑃0 𝑃1 𝑃2 More balanced than 𝛼 = 1 cyclically distribute 𝑃3 Block Cyclic Distribution (4) 𝑛 = 16, 𝑝 = 4 processes, 𝛼 = 4 Block size 𝐿 = 𝑛Τ𝛼𝑝 = 1 𝑃0 𝑃1 𝑃2 𝑃3 𝑃0 𝑃1 𝑃2 𝑃3 𝑃0 𝑃1 𝑃2 𝑃3 𝑃0 𝑃1 𝑃2 𝑃3 Load balance? Note: block size = 1 Cyclic distribution Convolution Layer • Given an Input Feature Map • • • • Size: 𝐻 × 𝑊 × 𝐶𝑖𝑛 𝐻: Height of the input feature map 𝑊: Width of the input feature map 𝐶𝑖𝑛 : Number of Channels in the feature map 𝐻 𝐶𝑖𝑛 𝑊 𝐶𝑖𝑛 𝐶𝑖𝑛 • 𝐶𝑖𝑛 × 𝐶𝑜𝑢𝑡 Kernels, each of size 𝑘 × 𝑘 𝑘 𝑘 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡 Performance Modeling of Convolution Layer • Number of Operations: 𝐻 × 𝑊 × 𝐶𝑜𝑢𝑡 × 𝑘 × 𝑘 × 𝐶𝑖𝑛 output pixels operations per pixel • Memory Transfer (with infinite cache): 𝐻 × 𝑊 × 𝐶𝑖𝑛 + 𝐶𝑖𝑛 × 𝐶𝑜𝑢𝑡 × 𝑘 × 𝑘 + 𝐻 × 𝑊 × 𝐶𝑜𝑢𝑡 • ≈ 𝐻 × 𝑊 × (𝐶𝑖𝑛 + 𝐶𝑜𝑢𝑡 ) Performance Modeling of Convolution Layer • Theoretical Arithmetic Intensity 𝑘 2 𝐶𝑖𝑛 𝐶𝑜𝑢𝑡 𝐶𝑖𝑛 + 𝐶𝑜𝑢𝑡 Example value for 𝑘 = 9, 𝐶𝑖𝑛 = 3, 𝐶𝑜𝑢𝑡 = 64 232 Example Machine Intensity = 75 (Nvidia A100*) how??? *https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf CNN basics • Won’t go into training or backpropagation concepts Next Classes • 10/17 Midterm Thank You • Questions? • Email: [email protected]

Use Quizgecko on...
Browser
Browser