COMP 426: Multicore Programming (CUDA) Fall 2024 PDF
Document Details
Uploaded by RiskFreeAlien
Concordia University
2024
COMP 426
David Kirk, NVIDIA
Tags
Summary
These lecture notes cover COMP 426: Multicore Programming, focusing on CUDA (Compute Unified Device Architecture) and GPGPU concepts. They provide an overview of the technology and related information.
Full Transcript
COMP 426: Multicore Programming CUDA (Compute Unified Device Architecture) Based on the slides of David Kirk, NVIDIA COMP 426, Fall 2024 CUDA Compute Unified Device Architecture (CUDA) What is GPGPU? General-Purpose computing on a Graphics...
COMP 426: Multicore Programming CUDA (Compute Unified Device Architecture) Based on the slides of David Kirk, NVIDIA COMP 426, Fall 2024 CUDA Compute Unified Device Architecture (CUDA) What is GPGPU? General-Purpose computing on a Graphics Processing Unit Using graphic hardware for non-graphic computations What is CUDA? Compute Unified Device Architecture Software architecture for managing data- parallel programming COMP 426, Fall 2024 CUDA 1 GPU vs. CPU COMP 426, Fall 2024 CUDA 2 CPU vs. GPU - Hardware More transistors devoted to data processing COMP 426, Fall 2024 CUDA 3 Traditional Graphics Pipeline zy z y y x x z y x z Geometry Rasterization y Processing Processing z x y x x Performed by (GP)GPU y Geometry Pipeline Processing Vertices x Mainly floating-point operations Rasterization Pipeline Processing Pixels Mainly dealing with Integer operations COMP 426, Fall 2024 CUDA 4 Pixel / Thread Processing COMP 426, Fall 2024 CUDA 5 G80 Device Thread Execution Manager issues threads 128 Thread Processors grouped into 16 multiprocessors (SMs) Parallel Data Cache enables thread cooperation COMP 426, Fall 2024 CUDA 6 Processing Element Processing element = thread processor = ALU COMP 426, Fall 2024 CUDA 7 Hardware implementation A set of SIMD Multiprocessors with On-Chip shared memory COMP 426, Fall 2024 CUDA 8 Streaming Multiprocessor (SM) Processing elements 8 scalar thread processors 32 GFLOPS peak at 1.35GHz 8192 32-bit registers (32KB) usual ops: float, int, branch,... Hardware multithreading up to 8 blocks (3 active) residents at once up to 768 active threads in total 16KB on-chip memory supports thread communication shared amongst threads of a block COMP 426, Fall 2024 CUDA 9 Data-parallel Programming Think of the GPU as a massively- threaded co-processor Write “kernel” functions that execute on the device/GPU -- processing multiple data elements in parallel Keep it busy! massive threading Keep your data close! local memory COMP 426, Fall 2024 CUDA 10 What is GPGPU ? General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting COMP 426, Fall 2024 CUDA 11 GPU vs. CPU CPU Fast caches Branching adaptability High performance GPU Multiple ALUs Fast onboard memory High throughput on parallel tasks Executes program on each fragment/vertex CPUs are great for task parallelism GPUs are great for data parallelism COMP 426, Fall 2024 CUDA 12 Previous GPGPU Constraints Dealing with graphics API To get general purpose code working, you had to re-write entire program as a collection of shaders and polygons COMP 426, Fall 2024 CUDA 13 CUDA “Compute Unified Device Architecture” General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs onto GPU COMP 426, Fall 2024 CUDA 14 CUDA – C with a Co-processor One program, two devices Serial or modestly parallel parts in host C code Highly parallel parts in device kernel C code Serial Code (host) Parallel Kernel (device)... KernelA>(args); Serial Code (host) Parallel Kernel (device) KernelB>(args);... COMP 426, Fall 2024 CUDA 15 CUDA Devices and Threads A CUDA compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads COMP 426, Fall 2024 CUDA 16 Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few (and is hurt by having too many) COMP 426, Fall 2024 CUDA 17 Buzzword: Kernel In CUDA, a kernel is code (typically a function) that can be run inside the GPU. The kernel code runs on many of the stream processors in the GPU in parallel. Each processor runs the code over different data (SPMD) COMP 426, Fall 2024 CUDA 18 Buzzword: Thread In CUDA, a thread is an execution of a kernel with a given index. Each thread uses its index to access a specific subset of the data, such that the collection of all threads cooperatively processes the entire data set. Think: Process ID These operate very much like threads in OpenMP they even have shared and private variables. So what’s the difference with CUDA? Threads are free COMP 426, Fall 2024 CUDA 19 Buzzword: Block In CUDA, a block is a group of threads. Blocks are used to organize threads into manageable (and schedulable) chunks. Can organize threads in 1D, 2D, or 3D arrangements What best matches your data? Some restrictions, based on hardware Threads within a block can do a bit of synchronization, if necessary. COMP 426, Fall 2024 CUDA 20 Buzzword: Grid In CUDA, a grid is a group of blocks no synchronization at all between the blocks. Grids are used to organize blocks into manageable (and schedulable) chunks. Can organize blocks in 1D or 2D arrangements What best matches your data? A grid is the set of threads created by a call to a CUDA kernel COMP 426, Fall 2024 CUDA 21 Mapping Buzzwords to GPU Hardware Grids map to GPUs Blocks map to the MultiProcessors (MP) Blocks are never split across MPs, but an MP can have multiple blocks Threads map to Stream Processors (SP) Warps are groups of (32) threads that execute simultaneously Completely forget these exist Image Source: until you get good at this NVIDIA CUDA Programming Guide COMP 426, Fall 2024 CUDA 22 What is CUDA General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management COMP 426, Fall 2024 CUDA 23 CUDA C Programming Heterogeneous programming model CPU and GPU are separate devices with separate memory spaces CPU code is standard C/C++ Driver API: low-level interface Runtime API: high-level interface (one extension to C) GPU code Subset of C with extensions CUDA goals Scale GPU code to 100s of cores, 1000s of parallel threads Facilitate heterogeneous computing COMP 426, Fall 2024 CUDA 24 CUDA Kernels and Threads Parallel portions of an application are executed on the device as kernels One kernel is executed at a time Many threads execute in each kernel Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Fast switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few COMP 426, Fall 2024 CUDA 25 Software Stack Libraries CUFFT & CUBLAS Runtime Common component Device component Host component Driver Driver API COMP 426, Fall 2024 CUDA 26 Single Instruction Multiple Thread (SIMT) Execution Groups of 32 threads formed into warps always executing same instruction share instruction fetch/dispatch some become inactive when code path diverges hardware automatically handles divergence Warps are primitive unit of scheduling pick 1 of 24 warps for each instruction slot. all warps from all active blocks are time-sliced COMP 426, Fall 2024 CUDA 27 Execution Model COMP 426, Fall 2024 CUDA 28 Thread/Warp Divergence Performance will be degraded because of warp divergence. COMP 426, Fall 2024 CUDA 29 CUDA API The CUDA source file uses.cu as extension. It contains host and device source codes. The CUDA Compiler Driver nvcc can compile it and generate CPU/PTX binary code. (PTX: Parallel Thread Execution, a device independent VM code) PTX code may be further translated for special GPU-Arch. COMP 426, Fall 2024 CUDA 30 C SAXPY void saxpy_serial(int n, float a, float *x, float *y) { int i; for(i=0; i < n; i++) { y[i] = a*x[i] + y[i]; } } … //invoke the kernel saxpy_serial(n, 2.0, x, y); COMP 426, Fall 2024 CUDA 31 SAXPY on a GPU Doing anything across an entire vector is threadID perfect for massively 0 1 2 3 4 5 6 7 parallel computing. Instead of one function … y[tid] = a*x[tid] + y[tid]; looping over the data … set, we’ll use many threads, each doing one calculation COMP 426, Fall 2024 CUDA 32 CUDA SAXPY __global__ void saxpy_cuda(int n, float a, float *x, float *y) { int i = (blockIdx.x * blockDim.x) + threadIdx.x; if(i < n) y[i] = a*x[i] + y[i]; } … int nblocks = (n + 255) / 256; //invoke the kernel with 256 threads per block saxpy_cuda(n, 2.0, x, y); COMP 426, Fall 2024 CUDA 33 Thread Life Cycle in HW Host Device Grid 1 Grid is launched on the SPA Kernel Block Block Block 1 (0, 0) (1, 0) (2, 0) Thread Blocks are serially distributed to all the SM’s Block Block Block (0, 1) (1, 1) (2, 1) Potentially >1 Thread Block per SM Grid 2 Each SM launches Warps of Kernel Threads 2 2 levels of parallelism Block (1, 1) SM schedules and executes Warps that are ready to run Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) As Warps and Thread Blocks Thread Thread Thread Thread Thread complete, resources are freed (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) SPA can distribute more Thread Thread Thread Thread Thread Thread Blocks (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) COMP 426, Fall 2024 CUDA 34 SM Executes Blocks Threads are assigned to Blocks Blocks SMs in Block granularity SM 0 SM 1 t0 t1 t2 … tm t0 t1 t2 … tm Up to 8 Blocks to each MT IU MT IU SM as resource allows SP SP SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) Shared Shared * 6 blocks, etc. Memory Memory Threads run TF concurrently Texture L1 SM assigns/maintains thread id #s SM manages/schedules L2 thread execution COMP 426, Fall 2024 CUDA Memory 35 Thread Scheduling/Execution Each Thread Blocks is divided in 32- … Block 1 Warps …Block 2 Warps thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 This is an implementation decision, … … not part of the CUDA programming model Warps are scheduling units in SM Streaming Multiprocessor If 3 blocks are assigned to an SM and Instruction L1 Data L1 each Block has 256 threads, how many Instruction Fetch/Dispatch Warps are there in an SM? Shared Memory Each Block is divided into 256/32 = 8 SP SP Warps SP SP There are 8 * 3 = 24 Warps SFU SFU SP SP At any point in time, only one of the SP SP 24 Warps will be selected for instruction fetch and execution. COMP 426, Fall 2024 CUDA 36 SM Warp Scheduling SM hardware implements zero- overhead Warp scheduling Warps whose next instruction has its SM multithreaded operands ready for consumption are Warp scheduler eligible for execution time Eligible Warps are selected for execution on a prioritized scheduling policy warp 8 instruction 11 All threads in a Warp execute the same instruction when selected warp 1 instruction 42 4 clock cycles needed to dispatch the warp 3 instruction 95 same instruction for all threads in a.. Warp in G80. If one global memory access is needed for warp 8 instruction 12 every 4 instructions A minimal of 13 Warps are needed to fully warp 3 instruction 96 tolerate 200-cycle memory latency COMP 426, Fall 2024 CUDA 37 Simple example (Matrix addition) cpu c program cuda program COMP 426, Fall 2024 CUDA 38 Thread Batching: Grids and Blocks Host Device Grid 1 A kernel is executed as a grid Kernel Block Block Block of thread blocks 1 (0, 0) (1, 0) (2, 0) All threads share data Block Block Block memory space (0, 1) (1, 1) (2, 1) A thread block is a batch of Grid 2 threads that can cooperate with each other by: Kernel 2 Synchronizing their execution For hazard-free shared Block (1, 1) memory accesses Efficiently sharing data Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) through a low latency shared memory Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Two threads from two different Thread Thread Thread Thread Thread blocks cannot cooperate (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) COMP 426, Fall 2024 CUDA 39 Block and Thread IDs Threads and blocks have Device Grid 1 IDs Block Block Block So each thread can decide (0, 0) (1, 0) (2, 0) what data to work on Block Block Block Block ID: 1D or 2D (0, 1) (1, 1) (2, 1) Thread ID: 1D, 2D, or 3D Simplifies memory Block (1, 1) addressing when Thread Thread Thread Thread Thread processing (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) multidimensional data Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Image processing Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Solving PDEs on volumes COMP 426, Fall 2024 CUDA 40 Memory Model There are 6 Memory Types : COMP 426, Fall 2024 CUDA 41 Memory Model Memory Model There are 6 Memory Types : Registers on chip fast access per thread limited amount 32 bit COMP 426, Fall 2024 CUDA 42 Memory Model Memory Model There are 6 Memory Types : Registers Local Memory in DRAM slow non-cached per thread relative large COMP 426, Fall 2024 CUDA 43 Memory Model Memory Model There are 6 Memory Types : Registers Local Memory Shared Memory on chip fast access per block 16 KByte synchronize between threads COMP 426, Fall 2024 CUDA 44 Memory Model Memory Model There are 6 Memory Types : Registers Local Memory Shared Memory Global Memory in DRAM slow non-cached per grid communicate between grids COMP 426, Fall 2024 CUDA 45 Memory Model Memory Model There are 6 Memory Types : Registers Local Memory Shared Memory Global Memory Constant Memory in DRAM cached per grid read-only COMP 426, Fall 2024 CUDA 46 Memory Model Memory Model There are 6 Memory Types : Registers Local Memory Shared Memory Global Memory Constant Memory Texture Memory in DRAM cached per grid read-only COMP 426, Fall 2024 CUDA 47 Memory Model Memory Model Registers Shared Memory on chip Local Memory Global Memory Constant Memory Texture Memory in Device Memory COMP 426, Fall 2024 CUDA 48 Memory Model Memory Model Global Memory Constant Memory Texture Memory managed by host code persistent across kernels COMP 426, Fall 2024 CUDA 49 CUDA Device Memory Space Overview (Device) Grid Each thread can: Block (0, 0) Block (1, 0) R/W per-thread registers Shared Memory Shared Memory R/W per-thread local memory Registers Registers Registers Registers R/W per-block shared memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) R/W per-grid global memory Read only per-grid constant Local Local Local Local memory Memory Memory Memory Memory Read only per-grid texture Host Global memory Memory The host can R/W global, Constant Memory constant, and texture Texture memories Memory COMP 426, Fall 2024 CUDA 50 Global, Constant, and Texture Memories (Device) Grid Global memory Block (0, 0) Block (1, 0) Main means of Shared Memory Shared Memory communicating R/W Data Registers Registers Registers Registers between host and device Contents visible to all Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) threads Texture and Constant Local Local Local Local Memories Memory Memory Memory Memory Constants initialized by Host Global Memory host Constant Contents visible to all Memory threads Texture Memory COMP 426, Fall 2024 CUDA 51 A Small Detour: A Matrix Data Type NOT part of CUDA It will be frequently used in typedef struct { int width; many code examples int height; 2 D matrix int pitch; single precision float elements float* elements; } Matrix; width * height elements pitch meaningful when the matrix is actually a sub-matrix of another matrix data elements allocated and attached to elements COMP 426, Fall 2024 CUDA 52 CUDA Device Memory Allocation (Device) Grid cudaMalloc() Block (0, 0) Block (1, 0) Allocates object in the device Global Memory Shared Memory Shared Memory Register Register Register Register Requires two parameters s s s s Address of a pointer to the Thread (0, Thread (1, Thread (0, Thread (1, allocated object 0) 0) 0) 0) Size of of allocated object Local Local Local Local Memor Memor Memor Memor cudaFree() y y y y Host Global Memory Frees object from device Constant Global Memory Memory Pointer to freed object Texture Memory COMP 426, Fall 2024 CUDA 53 CUDA Device Memory Allocation Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md.elements “d” is often used to indicate a device data structure BLOCK_SIZE = 64; Matrix Md int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); cudaMalloc((void**)&Md.elements, size); cudaFree(Md.elements); COMP 426, Fall 2024 CUDA 54 CUDA Host-Device Data Transfer (Device) Grid cudaMemcpy() Block (0, 0) Block (1, 0) memory data transfer Shared Memory Shared Memory Requires four Register Register Register Register parameters s s s s Pointer to source Thread (0, Thread (1, Thread (0, Thread (1, 0) 0) 0) 0) Pointer to destination Number of bytes copied Local Memor Local Memor Local Memor Local Memor y y y y Type of transfer Host Global Host to Host Memory Host to Device Constant Memory Device to Host Texture Device to Device Memory COMP 426, Fall 2024 CUDA 55 CUDA Host-Device Data Transfer Code example: Transfer a 64 * 64 single precision float array M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost); COMP 426, Fall 2024 CUDA 56 CUDA Function Declarations Executed on Only callable the: from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host __global__ defines a kernel function Must return void __device__ and __host__ can be used together COMP 426, Fall 2024 CUDA 57 CUDA Function Declarations __device__ functions cannot have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments COMP 426, Fall 2024 CUDA 58 Calling a Kernel Function – Thread Creation A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64;// 64 bytes of shared memory KernelFunc>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking COMP 426, Fall 2024 CUDA 59 A Simple Running Example Matrix Multiplication A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device COMP 426, Fall 2024 CUDA 60 Programming Model: Matrix Multiplication Example N P = M * N of size WIDTH x WIDTH Without tiling: WIDTH One thread handles one element of P M and N are loaded WIDTH times from global memory M P WIDTH WIDTH WIDTH COMP 426, Fall 2024 CUDA 61 Step 1: Matrix Data Transfers // Allocate the device memory where we will copy M to Matrix Md; Md.width = WIDTH; Md.height = WIDTH; Md.pitch = WIDTH; int size = WIDTH * WIDTH * sizeof(float); cudaMalloc((void**)&Md.elements, size); // Copy M from the host to the device cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); // Read M from the device to the host into P cudaMemcpy(P.elements, Md.elements, size, cudaMemcpyDeviceToHost);... // Free device memory cudaFree(Md.elements); COMP 426, Fall 2024 CUDA 62 Step 2: Matrix Multiplication A Simple Host Code in C // Matrix multiplication on the (CPU) host in // double precision. For simplicity, we assume // that all dimensions are equal void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; double b = N.elements[k * N.width + j]; sum += a * b;} P.elements[i * N.width + j] = sum;} } COMP 426, Fall 2024 CUDA 63 Multiply Using One Thread Block One Block of threads compute Grid 1 N Block 1 matrix P 2 Each thread computes one 4 element of P Thread 2 Each thread (2, 2) Loads a row of matrix M 6 Loads a column of matrix N Perform one multiply and addition for each pair of M and N elements Compute to off-chip memory 3 2 5 4 48 access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a BLOCK_SIZE thread block M P COMP 426, Fall 2024 CUDA 64 Step 3: Matrix Multiplication Host-side Main Program Code int main(void) { // Allocate and initialize the matrices Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0); // M * N on the device MatrixMulOnDevice(M, N, P); // Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; } COMP 426, Fall 2024 CUDA 65 Step 3: Matrix Multiplication Host-side code // Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N); // Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // Clear memory COMP 426, Fall 2024 CUDA 66 Step 3: Matrix Multiplication Host-side Code (cont.) // Setup the execution configuration dim3 dimBlock(WIDTH, WIDTH); dim3 dimGrid(1, 1); // Launch the device computation threads! MatrixMulKernel(Md, Nd, Pd); // Read P from the device CopyFromDeviceMatrix(P, Pd); // Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } COMP 426, Fall 2024 CUDA 67 Step 4: Matrix Multiplication Device-side Kernel Function // Matrix multiplication kernel – // thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; COMP 426, Fall 2024 CUDA 68 Step 4: Matrix Multiplication Device-Side Kernel Function N for (int k = 0; k < M.width; ++k) { float Melement = WIDTH M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device M memory; P // each thread writes one element P.elements[ty * P.pitch + tx] = ty Pvalue; } WIDTH tx WIDTH WIDTH COMP 426, Fall 2024 CUDA 69 Step 5: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; } // Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); } void FreeMatrix(Matrix M) { free(M.elements); } COMP 426, Fall 2024 CUDA 70 Step 5: Some Loose Ends (cont.) // Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) { int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); } // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice){ int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); } COMP 426, Fall 2024 CUDA 71 Step 6: Handling Arbitrary Sized Square Matrices N Have each 2D thread block to compute a (BLOCK_WIDTH)2 sub- matrix (tile) of the result matrix WIDTH Each has (BLOCK_WIDTH)2 threads Generate a 2D Grid of (WIDTH/BLOCK_WIDTH)2 blocks M P by You still need to put a loop around the kernel call for ty WIDTH cases where WIDTH is greater than Max grid size! bx tx WIDTH WIDTH COMP 426, Fall 2024 CUDA 72 CUDA Advantages over Legacy GPGPU Random access byte-addressable memory Thread can access any memory location Unlimited access to memory Thread can read/write as many locations as needed Shared memory (per block) and thread synchronization Threads can cooperatively load data into shared memory Any thread can then access any shared memory location Low learning curve Just a few extensions to C No knowledge of graphics is required COMP 426, Fall 2024 CUDA 73 CUDA Disadvantages Slightly low precision Limited support for IEEE-754 No recursive function call Hard to use for irregular join/fork logic No concurrency between jobs COMP 426, Fall 2024 CUDA 74 A Common Programming Strategy Global memory resides in device memory (DRAM) much slower access than shared memory (200x!) … but also much larger So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory COMP 426, Fall 2024 CUDA 75 Tiled Data Strategy Partition data into subsets that fit into shared memory Each block will then: Load its subset from global memory to shared memory using multiple threads to exploit memory-level parallelism Perform the computation on the subset from shared memory each thread can efficiently multi-pass over any data element Copy results from shared memory back to global memory COMP 426, Fall 2024 CUDA 76 Simple Matrix Multiplication __global__ void MatrixMulKernel (float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column index of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; } COMP 426, Fall 2024 CUDA 77 How about performance on G80? Grid All threads access global memory for Block (0, 0) Block (1, 0) their input matrix elements Two memory accesses (8 bytes) Shared Memory Shared Memory per floating point multiply-add Registers Registers Registers Registers 4 B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) 86.4 GB/s limits the code at 21.6 GFLOPS Host Global Memory The actual code runs at about 15 Constant Memory GFLOPS Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS COMP 426, Fall 2024 CUDA 78 Idea: Use Shared Memory to reuse global memory data N Each input element is read by WIDTH threads. WIDTH Load each element into Shared Memory and have several M P threads use the local version to ty reduce the memory WIDTH bandwidth tx Tiled algorithms WIDTH WIDTH COMP 426, Fall 2024 CUDA 79 bx 0 1 2 Tiled Multiply tx 0 1 2 TILE_WIDTH-1 Nd TILE_WIDTH Break up the execution of the WIDTH kernel into phases so that the TILE_WIDTH data accesses in each phase is focused on one subset (tile) of Md and Nd Md Pd 0 0 TILE_WIDTHE Pdsub 1 WIDTH 2 by 1 ty TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH TILE_WIDTH 2 WIDTH WIDTH COMP 426, Fall 2024 CUDA 80 Device Runtime Component: Synchronization Function void __syncthreads(); Synchronizes all threads in a block (Barrier) Once all threads have reached this point, execution resumes normally Used to avoid race conditions when accessing shared or global memory Allowed in conditional constructs only if the conditional is uniform across the entire thread block COMP 426, Fall 2024 CUDA 81 First-order Size Considerations in G80 Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads There should be many thread blocks A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. Compute to memory ratio is now 16:1 !! Memory bandwidth no longer a limiting factor COMP 426, Fall 2024 CUDA 82 CUDA Code: Kernel Execution Configuration // Setup the execution configuration dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH); COMP 426, Fall 2024 CUDA 83 Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles into shared memory Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; __syncthreads(); } Pd[Row*Width+Col] = Pvalue; } COMP 426, Fall 2024 CUDA 84 G80 Shared Memory and Threading Each SM in G80 has 16KB shared memory SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. Can potentially have up to 8 Thread Blocks actively executing This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time per SM Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS! COMP 426, Fall 2024 CUDA 85 Tiling Size Effects 100 90 80 70 60 GFLOPS 50 40 30 20 10 0 unrolled unrolled unrolled unrolled only only only only tiled tiled tiled tiled tiled & tiled & tiled & tiled & not tiled 4x4 tiles 8x8 tiles 12x12 tiles 16x16 tiles COMP 426, Fall 2024 CUDA 86 What’s Limiting My Code? Am I bandwidth bound? (How do I tell?) Make sure I have high thread occupancy to tolerate latencies (lots of threads) These threads can get some work done while we wait for memory Move re-used values to closer memories Shared Constant/Texture Am I not bandwidth bound – what is now my limit? Take a closer look at the instruction stream Unroll loops Minimize branch divergence COMP 426, Fall 2024 CUDA 87 OpenGL Rendering OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory Vertex buffer objects Pixel buffer objects Allows direct visualization of data from computation No device to host transfer Data stays in device memory – very fast compute / visualization cycle Data can be accessed from the kernel like any other global data (in device memory) COMP 426, Fall 2024 CUDA 88 OpenGL Interoperability 1. Register a buffer object with CUDA cudaGLRegisterBufferObject(GLuintbuffObj); OpenGL can use a registered buffer only as a source Unregister the buffer prior to rendering to it by OpenGL 2. Map the buffer object to CUDA memory cudaGLMapBufferObject(void**devPtr, GLuintbuffObj); Returns an address in global memory. Buffer must be registered prior to mapping 3. Launch a CUDA kernel to process the buffer Unmap the buffer object prior to use by OpenGL cudaGLUnmapBufferObject(GLuintbuffObj); 4. Unregister the buffer object cudaGLUnregisterBufferObject(GLuintbuffObj); Optional: needed if the buffer is a render target 5. Use the buffer object in OpenGL code COMP 426, Fall 2024 CUDA 89 Example from simpleGL in SDK 1. GL calls to create and initialize buffer, then register with CUDA: // create buffer object glGenBuffers( 1, vbo); glBindBuffer( GL_ARRAY_BUFFER, *vbo); // initialize buffer object unsigned int size = mesh_width * mesh_height * 4 * sizeof( float)*2; glBufferData( GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW); glBindBuffer( GL_ARRAY_BUFFER, 0); // register buffer object with CUDA cudaGLRegisterBufferObject(*vbo); COMP 426, Fall 2024 CUDA 90 Example from simpleGL in SDK, cont. 2. Map OpenGL buffer object for writing from CUDA float4 *dptr; cudaGLMapBufferObject((void**)&dptr, vbo)); 3. Execute the kernel to compute values for dptr dim3 block(8, 8, 1); dim3 grid(mesh_width / block.x, mesh_height / block.y, 1); kernel>(dptr, mesh_width, mesh_height, anim); 4. Unregister the OpenGL buffer object and return to Open GL cudaGLUnmapBufferObject(vbo); COMP 426, Fall 2024 CUDA 91