9-introCuda.pdf
Document Details
Uploaded by CoherentYtterbium
Instituto Politécnico do Cávado e do Ave
Tags
Full Transcript
Introduction to CUDA Programming High Performance Computing Master in Applied Artificial Intelligence Nuno Lopes Parallel Programming on GPU u GPUs are devices capable of massive data parallelism through SIMD: single instr...
Introduction to CUDA Programming High Performance Computing Master in Applied Artificial Intelligence Nuno Lopes Parallel Programming on GPU u GPUs are devices capable of massive data parallelism through SIMD: single instruction on multiple data u When compared to CPU, GPU have more parallel ALU units, focused on arithmetic data operations u To apply the same operation on multiple different data u Not suitable for task parallelism, i.e., to run different operations concurrently. Parallel Programming on GPU Figures from Kirk, Hwu, Programming Massively Parallel Processors CUDA Programming Architecture CUDA Programming Architecture u CUDA program integrates two different types of processors: u host: CPU u device: GPU u Application code includes host and device codes u Typical C program is CUDA program u Device variables and code is marked with keywords u CUDA annotated program is not valid C program CUDA Programming Architecture u CUDA compiler, Nvidia C Compiler (nvcc) takes cuda (.cu) input program and outputs: u standard C host-only source (for host compiler) u device code (for GPU compiler) u The output application runs as usual: u $ nvcc -o output.exe input.cu u $./output.exe CUDA Kernel and Thread CUDA Kernel Definition u Kernel represents the device function that will be “launched” into the GPU on multiple threads. u Described as standard C/C++ function with _global_ tag u By default, all functions are _host_ tagged CUDA Kernel Definition u Kernel function is specified as C code u Returns void, parameters serve as input and output u Local variables are independent for each thread. Kernel Launch u The execution of a kernel function on the device is named “kernel launch”. u It is the host that starts the launch; u The kernel specifies the code for a single thread; u Launching a kernel specifies the number of threads to be used. Thread Hierarchy u Threads are organised in a double hierarchy: grid & block. u When launching a kernel, a grid is created. u A grid can have a number of blocks of threads (in up to 3 dimensions). u A block can have a number of threads (in up to 3 dimensions). u Example for launching grid with 4 blocks, each with 32 threads: u KernelFunc > (parameters); Thread Hierarchy u Example: u Grid of 3 x 2 (6) blocks, u each with 4 x 3 (12) threads, u totalling 72 threads.... dim3 blocksPerGrid( 3, 2); dim3 threadsPerBlock( 4, 3); KernelFunc (params);... Source: Nvidia, Cuda C++ Programming Guide, web link, v11.6 pdf Thread Identification Each thread has available the following (private) variables: u gridDim: grid dimension (x, y, z coordinates) u blockDim: block dimension (x, y, z coordinates) u blockIdx: block identification (per thread) u threadIdx: thread identification Block-oriented Kernel Scheduling u The scheduling of kernel within a grid is block- oriented. u Blocks are required to run independent from each other, so that device can schedule them in any order. u Depending on the # hardware resources, Streaming Multiprocessors (SM), it is possible to run more than one block at a time. u SM schedules a group of threads of a block in parallel, called a warp. u Warps have size of multiple of 32 threads, depending on hardware. Block-oriented Kernel Scheduling Source: Nvidia, Cuda C++ Programming Guide CUDA Memory Model Memory Model CPU-GPU u Host has its own RAM: host memory u Device also has its own RAM: global memory u It is necessary to transfer data between these memories: u initially from host to device (with input data) u in the end, from the device to the host (with the result) Device Memory Initialisation & Transfer u Memory in the device Global Memory, needs to be allocated, from the host. u cudaMalloc( address of pointer, size) u cudaFree( pointer ) u An additional transfer function is available: u cudaMemcpy( destination, source, bytes, direction ) u Direction: u cudaMemcpyHostToDevice, u cudaMemcpyDeviceToHost Cuda Device Memory Layout Cuda Device Memory Layout u The specification of the storage type, scope and lifetime of device variables is defined by tag. u Automatic scalar (non array) vars are register based. u Local thread memory is physically on Global Memory. CUDA application example Structure of Vector Addition program Structure of Vector Addition program Structure of Vector Addition program