cap7-lecture-notes-133-146-11-14.pdf

Graphics Processing Units Vector instructions and registers were also included in microprocessors such as x86, PowerPC, and also in machines like Nintendo-64 and Playstation 2 to accelerate graphical operations. Graphics Processing Units Overview A graphics processing unit - GPU is similar to a set of vector processors sharing hardware. The multiple SIMD processors in a GPU act as independent MIMD cores, like vector computers have multiple vector processors. The main difference is multithreading, which is fundamental to GPU. This feature is missing on most vector processors. Programming the GPU NVIDIA developed a C-like programming language to program its GPUs. This is the compute unified device architecture - CUDA. CUDA generates C/C++ code for the system processor (named host), and a C/C++ dialect for the GPU (named device). In this setup system, the processor is known as the “host”, and the GPU as the “device”. Side Note. The Open Computing Language - OpenCL is a CUDA-similar programming language, in a general and rough sense. Several companies are developing OpenCL to offer a vendor-independent language for multiple platforms, in contrast to CUDA. However, CUDA is much more famous. The CUDA thread. The CUDA thread is the lowest level of parallelism as the programming primitive. It follows the paradigm of “single instruction, multiple threads” - SIMT. The thread block. In CUDA, threads are blocked together and executed in groups. Multithreaded SIMD. The multithread SIMD is the hardware that executes a whole block of threads. Functions The CUDA functions can have different modifiers such as device, global or host. Listing 7.12: Executed in the device, launched by the device. 1 __device__ Listing 7.13: Executed in the device, launched by the host. 1 __global__ Listing 7.14: Executed in the host, launched by the host. 1 __host__ 137 7 Vector Computers & Graphics Processing Units Variables As functions, the CUDA variables have also some modifiers such as the device. Listing 7.15: Device modifier. 1 __device__ A variable declared with this modifier is allocated to the GPU memory, and accessible by all multithreaded SIMD processors. Extended function call CUDA has an extended function call, the CUDA execution configuration as described in Listing 7.16. Listing 7.16: CUDA execution configuration. 1 fname > (... parameter list...) dimGrid specifies the dimensions of the code, in terms of thread blocks; and dimBlock specifies the dimensions of a block, in terms of threads. blockIdx is the identifier/index for blocks. threadIdx is the identifier/index of the current thread within its block. blockDim stands for the number of threads in a block, which comes from the dimBlock parameter. Specific Terms Some CUDA specific terms are described in Fig. 7.5. Figure 7.5: CUDA specific terms comparison. 138 Graphics Processing Units Simple Example Let’s consider a conventional C code example as in Listing 7.17. Listing 7.17: Conventional C language code. 1 // invoke DAXPY 2 daxpy (n , 2.0 , x , y ); 3 4 // DAXPY in C 5 void daxpy ( int n , double a , double *x , double * y ) 6 { 7 for ( int i =0; i < n ; ++ i ) 8 y [ i ] = a * x [ i ] + y [ i ]; 9 } The corresponding CUDA version is present in Listing 7.18. Listing 7.18: CUDA corresponding version. 1 // launch n threads , one per vector element 2 // invoke DAXPY with 256 threads per thread block 3 // each block will be allocated to one multithreaded SIMD proc. 4 __host__ 5 int iBlockSize = 256; 6 int nblocks = ( n + iBlockSize - 1) / iBlockSize ; // how many blocks 7 daxpy < < < nblocks , iBlockSize > > >(n , 2.0 , x , y ); 8 9 // DAXPY in CUDA 10 __global__ 11 void daxpy ( int n , double a , double *x , double * y ) 12 { 13 int i = blockIdx. x * blockDim. x + threadIdx. x ; // idiomatic CUDA 14 if (i < n ) 15 y [ i ] = a * x [ i ] + y [ i ]; 16 } This code (Listing 7.18) launches n threads, once per vector element, with 256 threads per thread block in a multithread SIMD processor. The GPU function begins by computing the corresponding element index i based on the block ID, number of threads per block, and the thread ID. The operation of multiplication and addition is performed as long as the index i is within the array. Simple Example 2 As another example, let’s multiply 2 vectors together, considering each vector has 8,192 elements. ~=B A ~ ×C ~ The GPU code that works on the whole 8,192 elements multiply is called a grid, or vectorized loop. A grid is composed of thread blocks, i.e., body of a vectorized loop. In this case, each thread block with up to 512 elements, i.e., 16 threads per block. The SIMD instruction executes 32 elements at a time. 139 7 Vector Computers & Graphics Processing Units With 8,192 elements in the vectors, this example has 16 thread blocks. 8192 16 = 512 threads elements 8192 elements = 16 blocks × 16 × 32 block threads Wrapping this up – 1 grid with 8192 elements (illustrated in Fig. 7.6): 16 thread blocks – 16 SIMD threads ∗ 32 elements at a time Here, the hardware thread block scheduler assigns thread blocks to multithreaded SIMD processors. Then, the hardware thread scheduler picks which thread of SIMD instructions to run each clock cycle within a SIMD processor. Figure 7.6: CUDA grip mapping, or vectorizable loop. In this example, each thread of SIMD instructions computes 32 elements per instruction. 140

cap7-lecture-notes-133-146-11-14.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue