CUDA C Introduction Lecture Notes PDF
Document Details
Uploaded by FieryVirginiaBeach
Université de Technologie de Belfort-Montbéliard
J.Gaber
Tags
Summary
These are lecture notes on CUDA C, an extension of C for programming graphics processing units (GPUs). The notes cover the basics, including function type qualifiers, built-in vector types, and how to run code in parallel on the device. They also discuss memory management, synchronization, and providing examples with memory allocation.
Full Transcript
Introduction to CUDA C J.Gaber [email protected] general‐purpose GPU computing Introduction to CUDA C What will you learn? – Start from “Hello, World!” – Write and launch CUDA C kernels – Manage GPU memory – Run parallel kernels in CUDA C – Parallel communication...
Introduction to CUDA C J.Gaber [email protected] general‐purpose GPU computing Introduction to CUDA C What will you learn? – Start from “Hello, World!” – Write and launch CUDA C kernels – Manage GPU memory – Run parallel kernels in CUDA C – Parallel communication and synchronization – Race conditions and atomic operations CUDA C Prerequisites You (probably) need experience with C or C++ You do not need any GPU experience You do not need any graphics experience You do not need any parallel programming experience CUDA C CUDA C extends standard C as follows – Function type qualifiers to specify whether a function executes on the host or on the device – Variable type qualifiers to specify the memory location on the device – A new directive to specify how a kernel is executed on the device – Four built‐in variables that specify the grid and block dimensions and the block and thread indices – Built‐in vector types derived from basic integer and float types Built‐in Vector Types Vector types derived from basic They are all structures, like this: integer and float types typedef struct { char1, char2, char3, char4 float x,y,z,w; uchar1, uchar2, uchar3, uchar4 } float4; short1, short2, short3, short4 ushort1, ushort2, ushort3, ushort4 int1, int2, int3, int4 They all come with a constructor uint1, uint2, uint3 (dim3), uint4 function in the form make_, e.g., ulong1, ulong2, ulong3, ulong4 longlong1, longlong2 int2 make_int2(int x, int y); float1, float2, float3, float4 double1, double2 dim3 dimBlock(width, height); dim3 dimGrid(10); // same as dimGrid(10,0,0) Function Type Qualifiers Executed Only callable on the from the device float DeviceFunc() device device global void KernelFunc() device host host float HostFunc() host host device and global__ functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments device functions cannot have their address taken host and device qualifiers can be used together, in which case the function is compiled for both global and host qualifiers cannot be used together global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous CUDA C: The Basics Terminology – Host : The CPU and its memory (host memory) – Device : The GPU and its memory (device memory) Hello, World! #include #include int main(){ printf("Hello World!\n"); return EXIT_SUCCESS; } This basic program is just standard C that runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code At its simplest, CUDA C is just C! Hello, World! with Device Code #include #include __global__ void kernel( void ) { } int main(){ kernel>(); printf("Hello World!\n"); return EXIT_SUCCESS; } Hello, World! with Device Code __global__ void kernel( void ) { } __global__ indicates that the function – Runs on the device – Called from host code nvcc splits source file into host and device components Hello, World! with Device Code int main( void ) { kernel>(); printf( "Hello, World!\n" ); return 0; } Triple angle brackets mark a call from host code to device code This is all that’s required to execute a function on the GPU! A More Complex Example A simple kernel to add two integers: __global __ void add ( int *a, int *b, int *c) { *c = *a + *b; } add() runs on the device ; a, b, and c must point to device memory How do we allocate memory on the GPU? Memory Management Host and device memory are distinct entities – Device pointers point to GPU memory – Host pointers point to CPU memory Basic CUDA API for dealing with device memory – cudaMalloc(), cudaFree(), cudaMemcpy() – Similar to their C equivalents, malloc(), free(), memcpy() int main( void ) { int a, b, c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = sizeof( int ); cudaMalloc ( ( void**)&dev_a, size); cudaMalloc( (void**)&dev_b, size ); cudaMalloc ( (void**)&dev_c, size ); A More Complex Example: main() cudaMemcpy(dev_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(dev_b, &b, size, cudaMemcpyHostToDevice); cudaMemcpy(dev_c, &c, size, cudaMemcpyHostToDevice); // launch add() kernel on GPU, passing parameters add>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost ); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); return 0; } Parallel Programming in CUDA C So how do we run code in parallel on the device? – via the parameters between the triple angle brackets: add>( dev_a, dev_b, dev_c ); Instead of executing add() once, add() executed N times in parallel Parallel Addition: add() Terminology: – Each parallel invocation of add() referred to as a block – Kernel can refer to its block’s index with the variable blockIdx.x – Each block adds a value from a[] and b[], storing the result in c[]: __global__ void add( int *a, int *b, int *c ) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } Parallel Addition: main() #define N 512 int main( void ) { // host copies of a, b, c int a, b, c; // device copies of a, b, c int *dev_a, *dev_b, *dev_c; int size = sizeof( int ); //allocate device copies of a, b , c cudaMalloc ( ( void**)&dev_a, size); cudaMalloc( (void**)&dev_b, size ); cudaMalloc ( (void**)&dev_c, size ); Parallel Addition: main() (cont) a=(int *) malloc (size); b=(int *) malloc (size); c=(int *) malloc (size); //copy inputs to device cudaMemcpy(dev_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(dev_b, &b, size, cudaMemcpyHostToDevice); cudaMemcpy(dev_c, &c, size, cudaMemcpyHostToDevice); // launch add() kernel on GPU with N parallel blocks add>( dev_a, dev_b, dev_c ); Parallel Addition: main() (cont) // copy device result back to host copy of c cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost ); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); Free(a); free(b); free (c); return 0; } Threads Terminology: A block can be split into parallel threads Let’s change vector addition to use parallel threads instead of parallel blocks: We use threadIdx.x instead of blockIdx.x in add() __global__ void add ( int *a, int *b, int *c ) { c[threadIdx.x] = a[threadIdx.x ] + b[threadIdx.x ]; } Threads If we have M threads/block, a unique array index for each entry is given by int index = threadIdx.x + blockIdx.x * M; int index = x + y * width; Addition with Threads and Blocks The blockDim.x is a built‐in variable for threads per block: int index= threadIdx.x + blockIdx.x * blockDim.x; A combined version of our vector addition kernel to use blocks and threads: __global__ void add( int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; } Addition with Threads and Blocks //Launch kernel with blocks and threads add> Dot Product Unlike vector addition, dot product is a reduction from vectors to a scalar ଵ ଵ Dot Product __global__ void dot (int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadIdx.x] * b[threadIdx.x]; } How to compute the final sum ? Sharing Data Between Threads Terminology: – There is a shared memory for threads in a block – Declared with the __shared__ CUDA keyword – Extremely fast, on‐chip memory – Not visible to threads in other blocks running in parallel Synchronization We can synchronize threads with the function __syncthreads() Threads are only synchronized within a block Synchronization __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[N]; temp[threadIdx.x]= a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if( 0== threadIdx.x ) { int sum = 0; for( int i = 0; i < N; i++ ) sum += temp[i]; *c = sum; } } // launch dot() kernel with 1 block and N threads dot> ( dev_a, dev_b, dev_c ); Dot Product To avoid race conditions, many atomic operations on memory available with CUDA C __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; temp[threadIdx.x]= a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if( 0== threadIdx.x ) { int sum = 0; for( int i =0; i