csc25-lecture-notes-133-147.pdf

Chapter 7 Vector Computers & Graphics Processing Units Vector Computers Flynn’s Taxonomy There exist a well-known cla...

Chapter 7 Vector Computers & Graphics Processing Units Vector Computers Flynn’s Taxonomy There exist a well-known classification of computer architectures named Flynn’s taxonomy1. It is briefly described as follows. The “single instruction, single data” - SISD is mainly related to single processors, as there is just one instruction working on one data at a time. The “single instruction, multiple data” - SIMD is mainly related to the computers such as vector architectures and graphics processing units. In this case, one instruction is able to work on different data at the same time. The “multiple instructions, single data” - MISD is related to the use of different instructions to handle just one data. This is used for fault tolerance for example. The “multiple instructions, multiple data” - MIMD is mainly related to the multiprocessors, where different instructions are able to work on different data in parallel. SIMD vs. MIMD The MIMD architecture needs to fetch one instruction per data operation, given more flexibility to it. On the other hand, SIMD architecture is potentially more energy-efficient than MIMD, i.e., a single instruction can launch many data operations. Also, it can be more attractive than MIMD, e.g., especially for personal mobile devices and servers where power consumption really makes the difference. Moreover, in SIMD, the programmer continues to think sequentially, and still achieves parallel speedup by having parallel data operations. SIMD Vector Processors SIMD vector processors are processors with high-level instructions operating on vectors, such as in Eq. (7.1). ~ =a×X Y ~ +Y ~ (7.1) 1 Proposed by Michael Flynn, 1966. 127 7 Vector Computers & Graphics Processing Units where X ~ and Y ~ are vectors of size n, and a is a scalar. Is that kind of instruction following the RISC or CISC approach? Here, a single instruction specifies a large amount of work to be performed. As informative data, the first vector processors were commercialized even before the superscalar processors. Common Applications Vector processors are particularly useful for scientific and engineering applications. Examples include simulations of physical phenomena, weather forecasts, and applications that operate on large structured data, i.e., matrices, and vectors. Multimedia applications can also benefit from vector processing, i.e., they typically contain large matrices and vectors, and also machine learning algorithms. Multimedia extensions, i.e., vectors, were introduced in microprocessors ISA over the time. Some examples are the Pentium multimedia extensions - MMX; the streaming SIMD extensions - SSE, SSE2, SSE3; and the advanced vector extensions - AVX. Main Characteristics The parallelism of loops can exposed by the programmer or even the compiler through the usage of vector instructions. In this case, the memory system is adapted to provide memory access to a whole vector instead of to each element at a time, i.e., through interleaved memory. The hardware only needs to check data hazards between two vector instructions once per vector operand, and not once for each element within the vectors. Since a whole loop is replaced by a vector instruction, control hazards that would arise are then eliminated. And that is really positive. In this case, the dependency verification logic needed for vector instructions is almost equal to the one required to verify dependencies between two scalar instructions. However, in the vector instruction case, much more elementary operations are executing in the same control logic’s complexity. Since the entire loop gets replaced by a vector instruction with predetermined behavior, the control hazards that would possibly occur in the loop are non-existent here. Basic Architecture Generally, a vector processor consists of a scalar unit2 , with a common pipeline, and also some vector units. In the example shown in Fig. 7.1, it is considered the number of 32 vector registers. All the functional units are vector functional units. 2 Typically a superscalar pipeline. 128 Vector Computers Figure 7.1: RV64V – RISC-V, Cray-1 based. In the RISC-V vector instruction set extension - RV64V, both the vector and scalar registers have a considerable number of read/write ports to accommodate parallel vector operations. Then, a set of switches (gray lines) connects those ports to the input/output of a vector functional unit. RV64 stands for RISC-V base instruction set considering 64-bit. There is also the RV32 for 32-bit. RV32I – Base integer instruction set, 32-bit, 32 registers (x0 - x31); RV32E – Base integer instruction set, 32-bit “embedded” version with 16 registers (x0 - x15); RV64I – Base integer instruction set, 64-bit; and RV128I – Base integer instruction set, 128-bit. Some standard extension are named as follows. M – integer multiplication and division; A – atomic operations; F – single-precision floating-point; D – double-precision floating-point; G – shorthand for the base and for “MAFD” standard extensions; and V – vector operations. Vector registers. Each vector register holds a single vector. RV64V has 32 registers, each of which is 64-bit wide. In this vector architecture, the vector register file is required to provide a sufficient number of ports to feed all the vector functional units. Thus, these ports enable enough overlap among vector operations to different vector registers. 129 7 Vector Computers & Graphics Processing Units Scalar registers. Scalar registers provide data as input to the vector functional units, besides the computed addresses to pass to the vector load/store unit. There are 31 general-purpose registers and 32 floating-point registers in this particular architecture. Vector functional units. Here, each unit is fully pipelined and able to start a new operation on every clock cycle. A control unit is needed to detect hazards: structural hazards for functional units, and data hazards on register accesses. Vector load/store unit. This unit loads and stores a vector to and from the memory. It is also fully pipelined: words can be moved between the vector registers and the memory with a bandwidth of one word per clock cycle, after an initial latency. This unit also handles scalar loads and stores. Some Vector Instructions – RISC-V ISA Some details on the vector add instruction (Listing 7.1). Listing 7.1: Vector add instruction. 1 vadd // add elements of V [ rs1 ] and V [ rs2 ] 2 // then put each result ( each vector element ) in V [ rd ] Some details on the vector sub instruction (Listing 7.2). Listing 7.2: Vector subtract instruction. 1 vsub // subtract elements of V [ rs2 ] from V [ rs1 ] 2 // then put each result in V [ rd ] Some details on the vector div instruction (Listing 7.3). Listing 7.3: Vector division instruction. 1 vdiv // divide elements of V [ rs1 ] by V [ rs2 ] 2 // then put each result in V [ rd ] Some details on the vector load instruction (Listing 7.4). Listing 7.4: Vector load instruction. 1 vld // load vector register V [ rd ] from memory 2 // starting at address R [ rs1 ] Some details on the vector store instruction (Listing 7.5). Listing 7.5: Vector store instruction. 1 vst // store vector register V [ rd ] into memory 2 // starting at address R [ rs1 ] 130 Vector Computers Operation Example Let’s take a look on a vector loop for the RV64V, such as in Eq. (7.1), repeated here for convenience. ~ =a×X Y ~ +Y ~ where X ~ and Y ~ are vectors of size n, and a is a scalar. This problem is known as the SAXPY3 or DAXPY4 loop that forms the inner loop of the Linpack5 benchmark. Let’s also assume that the number of elements, or the length, of a vector register is 32, and it matches the length of the vector operation. In this case, X ~ and Y ~ have 32 elements and the starting addresses of X~ and Y~ are in the registers x5 and x6, respectively. The code in Listing 7.6 shows the implementation of the DAXPY loop using the RV64G ISA. Listing 7.6: RV64G general-purpose registers, RISC-V code for DAXPY. 1 fld f0 , a // load scalar a 2 addi x28 , x5 ,#256 // last address to load 3 Loop : fld f1 ,0( x5 ) // load X [ i ] 4 fmul. d f1 , f1 , f0 // a * X [ i ] 5 fld f2 ,0( x6 ) // load Y [ i ] 6 fadd. d f2 , f2 , f1 // a * X [ i ] + Y [ i ] 7 fsd f2 ,0( x6 ) // store into Y [ i ] 8 addi x5 , x5 ,#8 // increment index to X 9 addi x6 , x6 ,#8 // increment index to Y 10 bne x28 , x5 , Loop // check if done On the other hand, the code shown in Listing 7.7 refers to the DAXPY implementation based on the RV64V ISA. Listing 7.7: RV64V code for DAXPY. 1 vsetdcfg 4* FP64 // enable 4 DP FP vregs 2 fld f0 , a // load scalar a 3 vld v0 , x5 // load vector X 4 vmul v1 , v0 , f0 // vector - scalar mult 5 vld v2 , x6 // load vector Y 6 vadd v3 , v1 , v2 // vector - vector add 7 vst v3 , x6 // store the sum 8 vdisable // disable vector regs In Listing 7.7, the initial instruction (vsetdcfg) sets the first four registers to hold 64-bit (double- precision) floating-point data, while the last instruction (vdisable) disables all vector registers. In this case, a context switch after this last instruction needs no additional state to be saved. 3 Single-precision “a × X plus Y ”. 4 Double-precision “a × X plus Y.” 5 Linpack is a collection of linear algebra routines. 131 7 Vector Computers & Graphics Processing Units RV64V vs. RV64G: Compare the codes: 8 instructions in Listing 7.7 versus 258 = (32 iterations × 8 instructions) + 2 setup instructions in Listing 7.6. This represents a huge difference between RV64V and RV64G when it comes to dynamic instructions bandwidth, i.e., 8 instructions versus 258. This reduction is due to the fact that vector operations work on 32 elements. Notice that the overhead instructions present in the RV64G implementation represents almost half of the loop, and they are not present in the RV64V code. When the compiler creates the vector instructions for such a problem (DAXPY), and the referred code spends much of the time running in the “vector mode”, this code is called vectorized or vectorizable. This is the case when there is no dependence between the loop’s iterations, i.e., no loop-carried dependences. Reasoning Can vector instructions have the same latency of scalar instructions? If so, what would be the expected performance gain? These questions are discussed in the next sections. Vector Instructions Optimizations The way vector instructions may require stalls is different from the RV64G ISA. There is also a forwarding mechanism special for the vector processor architectures, along with other optimizations such as the convoys and lanes in vector architectures. Stalls In the straightforward RV64G implementation from Listing 7.6, every fadd.d instruction must wait for a fmul.d to avoid a RAW dependence, and every fsd instruction must wait for the fadd.d to avoid a WAW hazard. On the other hand, the RV64V code from Listing 7.7 will stall just for the first element in each vector. In the vector mode, an instruction stalls once per vector instruction rather than once per vector element. This is related to the frequency of pipeline interlocks. In vector processors, a vector instruction will stall only for the first element in each vector. The subsequent elements in the vector will smoothly run in the pipeline. In this sense, the pipeline stalls only once per vector instruction, instead of once per vector element. Chaining In the context of vector operations, the forwarding of element-dependent operations is named chaining. Consider the following example. 132 Vector Computers A simple example of channing. Listing 7.8: Chaining example. 1 vmul v1 , v0 , f0 // vector - scalar mult 2 vadd v3 , v1 , v2 // vector - vector add , RAW hazard on " v1 " The RAW hazard related to vector register v1 will cause a chaining, i.e., a forwarding of element- dependent in that operation (vadd). Thus, the vector processor architecture has also a forwarding mechanism which is named chaining. Convoys A convoy is a set of vector instructions that could potentially execute together. The set must not contain any structural hazards and must not contain any RAW hazard. However, with respect to the RAW hazard, chaining allows this set to be in the same convoy. Thus, the results from a first functional unit in the chain are forwarded to a second functional unit. In the code example from Listing 7.7, three different convoys are possible: 1. vld and vmul – vector register v0 can be chained; 2. vld and vadd – vector register v2 can be chained; and 3. vst. Lanes Modern vector computers have vector functional units with multiple parallel pipelines, i.e., lanes. Lanes are able to produce two or more results per clock cycle. They may also have some functional units that are not fully pipelined. Fig. 7.2 illustrates two different computers and a comparison between them in terms of instructions completed per clock cycle. In multiple parallel lanes, as shown in Fig. 7.3, each lane contains one portion of the vector register file, and one execution pipeline from each vector functional unit. RV64V only allows the element n of one vector register to take part in the operations with the element n from the other vector registers. For example, adding A[n+1] to B[n] is not allowed. A vector arithmetic unit holds four execution pipelines, one per lane. They operate together to execute and complete a single vector instruction. 133 7 Vector Computers & Graphics Processing Units Figure 7.2: Parallel pipelines to execute the vector instruction add. Vector processor (A) completes one addition per clock cycle. Vector processor (B), with four pipelines add, can complete up to four additions per clock cycle. Note that processor (A) illustrates c already computed, while processor (B) already computed c, c, c, and c. Figure 7.3: Vector processor functional units structure with four lanes. The vector register is divided across the lanes, where each lane holds the every 4th element of each vector register. Here, it is illustrated three different vector functional units: (i) floating-point add, (ii) floating-point multiply, and (iii) load-store. Notice also that the vector register file ports for the pipelines are local to the lane. Vector-Length Registers A vector register’s processor has a natural vector length determined by the maximum vector length - MVL. 134 Vector Computers The following code snippet (Listing 7.9) illustrates the MVL concept. Listing 7.9: Simple code to exemplify the MVL concept. 1 for ( i =0; i < n ; i ++) 2 Y [ i ] = a * X [ i ] + Y [ i ]; In that case (Listing 7.9), vector sizes depended on n. However, this may not be known until runtime. Note also that n can be a parameter to a procedure containing a preceding loop, and then subject to change during execution. In this sense, the vector-length register - VL, as in Eq. (7.2), controls the length of any vector operation, including a vector load/store. vl ≤ mvl (7.2) Strip Mining What if n is unknown at compile time, and it is assigned to a value possibly greater than mvl? For this case, there is the strip mining technique. Strip mining consists in a generation of code where each vector operation is done for a size less than or equal to mvl. On it, one loop handles any number of iterations that is a multiple of mvl, and another loop handles the remainder iteration, which must be less than the mvl size. Fig. 7.4 illustrates the strip mining concept along with the Listing 7.10. Figure 7.4: In a odd case, all blocks but the first have length mvl, and m = (n mod mvl). Listing 7.10: MVL code example. 1 low = 0; 2 VL = ( n % MVL ); // find odd - size piece using modulo 3 for ( j =0; j >(n , 2.0 , x , y ); 8 9 // DAXPY in CUDA 10 __global__ 11 void daxpy ( int n , double a , double *x , double * y ) 12 { 13 int i = blockIdx. x * blockDim. x + threadIdx. x ; // idiomatic CUDA 14 if (i < n ) 15 y [ i ] = a * x [ i ] + y [ i ]; 16 } This code (Listing 7.18) launches n threads, once per vector element, with 256 threads per thread block in a multithread SIMD processor. The GPU function begins by computing the corresponding element index i based on the block ID, number of threads per block, and the thread ID. The operation of multiplication and addition is performed as long as the index i is within the array. Simple Example 2 As another example, let’s multiply 2 vectors together, considering each vector has 8,192 elements. ~=B A ~ ×C ~ The GPU code that works on the whole 8,192 elements multiply is called a grid, or vectorized loop. A grid is composed of thread blocks, i.e., body of a vectorized loop. In this case, each thread block with up to 512 elements, i.e., 16 threads per block. The SIMD instruction executes 32 elements at a time. 139 7 Vector Computers & Graphics Processing Units With 8,192 elements in the vectors, this example has 16 thread blocks. 8192 16 = 512 threads elements 8192 elements = 16 blocks × 16 × 32 block threads Wrapping this up – 1 grid with 8192 elements (illustrated in Fig. 7.6): 16 thread blocks – 16 SIMD threads ∗ 32 elements at a time Here, the hardware thread block scheduler assigns thread blocks to multithreaded SIMD processors. Then, the hardware thread scheduler picks which thread of SIMD instructions to run each clock cycle within a SIMD processor. Figure 7.6: CUDA grip mapping, or vectorizable loop. In this example, each thread of SIMD instructions computes 32 elements per instruction. 140 Graphics Processing Units For reference: CUDA Terminology Fig. 7.7 gives the main CUDA terminology currently used. Figure 7.7: CUDA detailed terminology. 141

csc25-lecture-notes-133-147.pdf

Document Details

Tags

Related

Full Transcript