Parallel Computing GPU Programming using CUDA - PDF
Document Details
Uploaded by CleanPyrite978
BITS Pilani
Dr. Gargi Alavani Prabhu
Tags
Summary
This presentation details GPU programming using CUDA, covering topics like why GPUs are beneficial in compute-intensive tasks, how GPUs excel in 3D rendering, and various aspects of GPU architecture. It also outlines programming models, CUDA, and applications.
Full Transcript
Parallel Computing GPU Programming using CUDA BITS Pilani Dr. Gargi Alavani Prabhu CS & IS Department Pilani Campus Why GPUs? CPU has low throughput in case of compute intensive tasks CPUs are slower with e...
Parallel Computing GPU Programming using CUDA BITS Pilani Dr. Gargi Alavani Prabhu CS & IS Department Pilani Campus Why GPUs? CPU has low throughput in case of compute intensive tasks CPUs are slower with enhancing images and rendering graphics GPUs outdo CPUs when it comes to 3D rendering due to the complexity of the tasks GPU cores are specialized processors for handling graphics manipulations BITS Pilani, Pilani Campus GPU Programming General Purpose computation using GPU and graphics API in applications other than 3D graphics – GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting BITS Pilani, Pilani Campus Key milestones in evolution Graphics Rendering Programmability General-Purpose Computing CUDA and GPGPU GPU Libraries and Frameworks : – – cuDNN (CUDA Deep Neural Network), – – TensorRT, – – CUDA Toolkit AI and Deep Learning Heterogeneous Computing BITS Pilani, Pilani Campus Heterogeneous Computing Application Code Rest of Sequential CPU Code Compute-Intensive GPU Functions CPU Use GPU to Parallelize Device Host + BITS Pilani, Pilani Campus GPU and CPU: The Differences ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU GPU More transistors devoted to computation, instead of caching or flow control Suitable for data-intensive computation High arithmetic/memory operation ratio BITS Pilani, Pilani Campus GPU as a co-processor CPU GPU GPU Memory Memory BITS Pilani, Pilani Campus Outline of GPGPU BITS Pilani, Pilani Campus GPU Architecture – Tesla K20 7.1B Transistors 15 SMX units BITS Pilani, Pilani Campus Streaming Multiprocessor K20 FERMI BITS Pilani, Pilani Campus Understanding GP-GPU Architecture BITS Pilani, Pilani Campus V100 GPU V: Volta architecture 6 GPU Processing Clusters (GPCs) 7 Texture Processing Clusters (TPCs) Memory controllers Tesla V100 uses 80* SMs It has 5120 cores/GPU BITS Pilani, Pilani Campus V100 SM Each SM has: – 64 FP32 – 64 INT32 – 32 FP64 – 8 Tensor cores – 4 Texture units High-level performance of V100 is impressive: – 15.7 teraflops of FP32 – 7.8 teraflops of FP64 – 125 teraflops for dedicated tensor operations BITS Pilani, Pilani Campus Tensor Cores BITS Pilani, Pilani Campus Tensor Cores with Turing BITS Pilani, Pilani Campus Execution Model BITS Pilani, Pilani Campus Data Model BITS Pilani, Pilani Campus Programming Models for GPGPU BITS Pilani, Pilani Campus CUDA The GPU is viewed as a compute device that: – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel Hardware switching between threads (in 1 cycle) on long- latency memory reference Overprovision (1000s of threads) → hide latencies Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads – GPU threads are extremely lightweight Very little creation overhead – GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few BITS Pilani, Pilani Campus Applications of CUDA BITS Pilani, Pilani Campus Thread Batching: Grids and Blocks Kernel executed as a grid of thread Host Device blocks Grid 1 – All threads share data memory Kerne Block Block Block space l1 (0, 0) (1, 0) (2, 0) Thread block is a batch of threads, Block Block Block can cooperate with each other by: (0, 1) (1, 1) (2, 1) – Synchronizing their execution: For hazard-free shared memory Grid 2 accesses Kerne l2 – Efficiently sharing data through a low latency shared memory Block (1, 1) Two threads from two different blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) – (Unless thru slow global Thread Thread Thread Thread Thread memory) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Threads and blocks have IDs Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA BITS Pilani, Pilani Campus CUDA Function Declaration Executed Only callable on the: from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device Host __host__ float HostFunc() host Host __global__ defines a kernel function – Must return void __device__ and __host__ can be used together BITS Pilani, Pilani Campus CUDA Device Memory Space Overview (Device) Grid Each thread can: – R/W per-thread registers Block (0, 0) Block (1, 0) – R/W per-thread local memory Shared Memory Shared Memory – R/W per-block shared memory Registers Registers Registers Registers – R/W per-grid global memory – Read only per-grid constant Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) memory – Read only per-grid texture Local Memory Local Memory Local Memory Local Memory memory Host Global The host can R/W global, Memory constant, and texture Constant memories Memory Texture Memory BITS Pilani, Pilani Campus Global, Constant, and Texture Memories (Long Latency Accesses) (Device) Grid Global memory – Main means of Block (0, 0) Block (1, 0) communicating R/W Data between host and Shared Memory Shared Memory device – Contents visible to all Registers Registers Registers Registers threads Texture and Constant Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Memories – Constants initialized by Local Memory Local Memory Local Memory Local Memory host – Contents visible to all Host Global threads Memory Constant Memory Texture Memory Courtesy: NDVIA BITS Pilani, Pilani Campus Calling Kernel Function – Thread Creation A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc>(...); BITS Pilani, Pilani Campus Automatic Scalability BITS Pilani, Pilani Campus V100 Login INTERNAL IP 10.1.19.37 Username: parallel_computing Password: user123 BITS Pilani, Pilani Campus CUDA: Hello, World! example #define NUM_BLOCKS 4 #define BLOCK_WIDTH 8 int main( void) { printf( "Hello Cuda!\n" ); hello(); cudaDeviceSynchronize(); printf( "Welcome back to CPU!\n" ); return(0); } BITS Pilani, Pilani Campus CUDA: Hello, World! example __global__ void hello( void) { printf( "\tHello from GPU: thread %d and block %d\n", threadIdx.x, blockIdx.x ); } BITS Pilani, Pilani Campus CUDA: Vector Addition int main( void) { return(0); } BITS Pilani, Pilani Campus CUDA: Vector Addition float *d_A = NULL; if (cudaMalloc((void **)&d_A, size) != cudaSuccess) exit(EXIT_FAILURE); float *d_B = NULL; cudaMalloc((void **)&d_B, size); float *d_C = NULL; cudaMalloc((void **)&d_C, size); BITS Pilani, Pilani Campus CUDA: Vector Addition cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); BITS Pilani, Pilani Campus CUDA: Vector Addition int threadsPerBlock = 256; int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock; vectorAdd(d_A, d_B, d_C, numElements); cudaDeviceSynchronize(); BITS Pilani, Pilani Campus CUDA: Vector Addition cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); BITS Pilani, Pilani Campus CUDA: Vector Addition cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); BITS Pilani, Pilani Campus CUDA: Vector Addition __global__ void vectorAdd( const float *A, const float *B, float *C, int numElements) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < numElements) C[i] = A[i] + B[i]; } BITS Pilani, Pilani Campus Single-Precision A·X Plus Y void saxpy(int n, float a, float * restrict x, float * restrict y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } BITS Pilani, Pilani Campus BITS Pilani Pilani Campus Thank You