9-introCuda.pdf

Introduction to CUDA Programming High Performance Computing Master in Applied Artificial Intelligence Nuno Lopes Parallel Programming on GPU u GPUs are devices capable of massive data parallelism through SIMD: single instruction on multiple data u When compared to CPU, GPU have more parallel ALU units, focused on arithmetic data operations u To apply the same operation on multiple different data u Not suitable for task parallelism, i.e., to run different operations concurrently. Parallel Programming on GPU Figures from Kirk, Hwu, Programming Massively Parallel Processors CUDA Programming Architecture CUDA Programming Architecture u CUDA program integrates two different types of processors: u host: CPU u device: GPU u Application code includes host and device codes u Typical C program is CUDA program u Device variables and code is marked with keywords u CUDA annotated program is not valid C program CUDA Programming Architecture u CUDA compiler, Nvidia C Compiler (nvcc) takes cuda (.cu) input program and outputs: u standard C host-only source (for host compiler) u device code (for GPU compiler) u The output application runs as usual: u $ nvcc -o output.exe input.cu u $./output.exe CUDA Kernel and Thread CUDA Kernel Definition u Kernel represents the device function that will be “launched” into the GPU on multiple threads. u Described as standard C/C++ function with _global_ tag u By default, all functions are _host_ tagged CUDA Kernel Definition u Kernel function is specified as C code u Returns void, parameters serve as input and output u Local variables are independent for each thread. Kernel Launch u The execution of a kernel function on the device is named “kernel launch”. u It is the host that starts the launch; u The kernel specifies the code for a single thread; u Launching a kernel specifies the number of threads to be used. Thread Hierarchy u Threads are organised in a double hierarchy: grid & block. u When launching a kernel, a grid is created. u A grid can have a number of blocks of threads (in up to 3 dimensions). u A block can have a number of threads (in up to 3 dimensions). u Example for launching grid with 4 blocks, each with 32 threads: u KernelFunc > (parameters); Thread Hierarchy u Example: u Grid of 3 x 2 (6) blocks, u each with 4 x 3 (12) threads, u totalling 72 threads.... dim3 blocksPerGrid( 3, 2); dim3 threadsPerBlock( 4, 3); KernelFunc (params);... Source: Nvidia, Cuda C++ Programming Guide, web link, v11.6 pdf Thread Identification Each thread has available the following (private) variables: u gridDim: grid dimension (x, y, z coordinates) u blockDim: block dimension (x, y, z coordinates) u blockIdx: block identification (per thread) u threadIdx: thread identification Block-oriented Kernel Scheduling u The scheduling of kernel within a grid is block- oriented. u Blocks are required to run independent from each other, so that device can schedule them in any order. u Depending on the # hardware resources, Streaming Multiprocessors (SM), it is possible to run more than one block at a time. u SM schedules a group of threads of a block in parallel, called a warp. u Warps have size of multiple of 32 threads, depending on hardware. Block-oriented Kernel Scheduling Source: Nvidia, Cuda C++ Programming Guide CUDA Memory Model Memory Model CPU-GPU u Host has its own RAM: host memory u Device also has its own RAM: global memory u It is necessary to transfer data between these memories: u initially from host to device (with input data) u in the end, from the device to the host (with the result) Device Memory Initialisation & Transfer u Memory in the device Global Memory, needs to be allocated, from the host. u cudaMalloc( address of pointer, size) u cudaFree( pointer ) u An additional transfer function is available: u cudaMemcpy( destination, source, bytes, direction ) u Direction: u cudaMemcpyHostToDevice, u cudaMemcpyDeviceToHost Cuda Device Memory Layout Cuda Device Memory Layout u The specification of the storage type, scope and lifetime of device variables is defined by tag. u Automatic scalar (non array) vars are register based. u Local thread memory is physically on Global Memory. CUDA application example Structure of Vector Addition program Structure of Vector Addition program Structure of Vector Addition program

Document Details

Tags

Related

Full Transcript

Upgrade to continue