cap7-lecture-notes-133-146-5-10.pdf

Full Transcript

Vector Computers Operation Example Let’s take a look on a vector loop for the RV64V, such as in Eq. (7.1), repeated here for convenience....

Vector Computers Operation Example Let’s take a look on a vector loop for the RV64V, such as in Eq. (7.1), repeated here for convenience. ~ =a×X Y ~ +Y ~ where X ~ and Y ~ are vectors of size n, and a is a scalar. This problem is known as the SAXPY3 or DAXPY4 loop that forms the inner loop of the Linpack5 benchmark. Let’s also assume that the number of elements, or the length, of a vector register is 32, and it matches the length of the vector operation. In this case, X ~ and Y ~ have 32 elements and the starting addresses of X~ and Y~ are in the registers x5 and x6, respectively. The code in Listing 7.6 shows the implementation of the DAXPY loop using the RV64G ISA. Listing 7.6: RV64G general-purpose registers, RISC-V code for DAXPY. 1 fld f0 , a // load scalar a 2 addi x28 , x5 ,#256 // last address to load 3 Loop : fld f1 ,0( x5 ) // load X [ i ] 4 fmul. d f1 , f1 , f0 // a * X [ i ] 5 fld f2 ,0( x6 ) // load Y [ i ] 6 fadd. d f2 , f2 , f1 // a * X [ i ] + Y [ i ] 7 fsd f2 ,0( x6 ) // store into Y [ i ] 8 addi x5 , x5 ,#8 // increment index to X 9 addi x6 , x6 ,#8 // increment index to Y 10 bne x28 , x5 , Loop // check if done On the other hand, the code shown in Listing 7.7 refers to the DAXPY implementation based on the RV64V ISA. Listing 7.7: RV64V code for DAXPY. 1 vsetdcfg 4* FP64 // enable 4 DP FP vregs 2 fld f0 , a // load scalar a 3 vld v0 , x5 // load vector X 4 vmul v1 , v0 , f0 // vector - scalar mult 5 vld v2 , x6 // load vector Y 6 vadd v3 , v1 , v2 // vector - vector add 7 vst v3 , x6 // store the sum 8 vdisable // disable vector regs In Listing 7.7, the initial instruction (vsetdcfg) sets the first four registers to hold 64-bit (double- precision) floating-point data, while the last instruction (vdisable) disables all vector registers. In this case, a context switch after this last instruction needs no additional state to be saved. 3 Single-precision “a × X plus Y ”. 4 Double-precision “a × X plus Y.” 5 Linpack is a collection of linear algebra routines. 131 7 Vector Computers & Graphics Processing Units RV64V vs. RV64G: Compare the codes: 8 instructions in Listing 7.7 versus 258 = (32 iterations × 8 instructions) +  2 setup instructions in Listing 7.6. This represents a huge difference between RV64V and RV64G when it comes to dynamic instructions bandwidth, i.e., 8 instructions versus 258. This reduction is due to the fact that vector operations work on 32 elements. Notice that the overhead instructions present in the RV64G implementation represents almost half of the loop, and they are not present in the RV64V code. When the compiler creates the vector instructions for such a problem (DAXPY), and the referred code spends much of the time running in the “vector mode”, this code is called vectorized or vectorizable. This is the case when there is no dependence between the loop’s iterations, i.e., no loop-carried dependences. Reasoning Can vector instructions have the same latency of scalar instructions? If so, what would be the expected performance gain? These questions are discussed in the next sections. Vector Instructions Optimizations The way vector instructions may require stalls is different from the RV64G ISA. There is also a forwarding mechanism special for the vector processor architectures, along with other optimizations such as the convoys and lanes in vector architectures. Stalls In the straightforward RV64G implementation from Listing 7.6, every fadd.d instruction must wait for a fmul.d to avoid a RAW dependence, and every fsd instruction must wait for the fadd.d to avoid a WAW hazard. On the other hand, the RV64V code from Listing 7.7 will stall just for the first element in each vector. In the vector mode, an instruction stalls once per vector instruction rather than once per vector element. This is related to the frequency of pipeline interlocks. In vector processors, a vector instruction will stall only for the first element in each vector. The subsequent elements in the vector will smoothly run in the pipeline. In this sense, the pipeline stalls only once per vector instruction, instead of once per vector element. Chaining In the context of vector operations, the forwarding of element-dependent operations is named chaining. Consider the following example. 132 Vector Computers A simple example of channing. Listing 7.8: Chaining example. 1 vmul v1 , v0 , f0 // vector - scalar mult 2 vadd v3 , v1 , v2 // vector - vector add , RAW hazard on " v1 " The RAW hazard related to vector register v1 will cause a chaining, i.e., a forwarding of element- dependent in that operation (vadd). Thus, the vector processor architecture has also a forwarding mechanism which is named chaining. Convoys A convoy is a set of vector instructions that could potentially execute together. The set must not contain any structural hazards and must not contain any RAW hazard. However, with respect to the RAW hazard, chaining allows this set to be in the same convoy. Thus, the results from a first functional unit in the chain are forwarded to a second functional unit. In the code example from Listing 7.7, three different convoys are possible: 1. vld and vmul – vector register v0 can be chained; 2. vld and vadd – vector register v2 can be chained; and 3. vst. Lanes Modern vector computers have vector functional units with multiple parallel pipelines, i.e., lanes. Lanes are able to produce two or more results per clock cycle. They may also have some functional units that are not fully pipelined. Fig. 7.2 illustrates two different computers and a comparison between them in terms of instructions completed per clock cycle. In multiple parallel lanes, as shown in Fig. 7.3, each lane contains one portion of the vector register file, and one execution pipeline from each vector functional unit. RV64V only allows the element n of one vector register to take part in the operations with the element n from the other vector registers. For example, adding A[n+1] to B[n] is not allowed. A vector arithmetic unit holds four execution pipelines, one per lane. They operate together to execute and complete a single vector instruction. 133 7 Vector Computers & Graphics Processing Units Figure 7.2: Parallel pipelines to execute the vector instruction add. Vector processor (A) completes one addition per clock cycle. Vector processor (B), with four pipelines add, can complete up to four additions per clock cycle. Note that processor (A) illustrates c already computed, while processor (B) already computed c, c, c, and c. Figure 7.3: Vector processor functional units structure with four lanes. The vector register is divided across the lanes, where each lane holds the every 4th element of each vector register. Here, it is illustrated three different vector functional units: (i) floating-point add, (ii) floating-point multiply, and (iii) load-store. Notice also that the vector register file ports for the pipelines are local to the lane. Vector-Length Registers A vector register’s processor has a natural vector length determined by the maximum vector length - MVL. 134 Vector Computers The following code snippet (Listing 7.9) illustrates the MVL concept. Listing 7.9: Simple code to exemplify the MVL concept. 1 for ( i =0; i < n ; i ++) 2 Y [ i ] = a * X [ i ] + Y [ i ]; In that case (Listing 7.9), vector sizes depended on n. However, this may not be known until runtime. Note also that n can be a parameter to a procedure containing a preceding loop, and then subject to change during execution. In this sense, the vector-length register - VL, as in Eq. (7.2), controls the length of any vector operation, including a vector load/store. vl ≤ mvl (7.2) Strip Mining What if n is unknown at compile time, and it is assigned to a value possibly greater than mvl? For this case, there is the strip mining technique. Strip mining consists in a generation of code where each vector operation is done for a size less than or equal to mvl. On it, one loop handles any number of iterations that is a multiple of mvl, and another loop handles the remainder iteration, which must be less than the mvl size. Fig. 7.4 illustrates the strip mining concept along with the Listing 7.10. Figure 7.4: In a odd case, all blocks but the first have length mvl, and m = (n mod mvl). Listing 7.10: MVL code example. 1 low = 0; 2 VL = ( n % MVL ); // find odd - size piece using modulo 3 for ( j =0; j

Use Quizgecko on...
Browser
Browser