csc25-chapter_07.pdf
Document Details
Uploaded by ManageableSatire
2024
Tags
Full Transcript
CSC-25 High Performance Architectures Lecture Notes – Chapter VII Vector Computers & Graphics Processing Units (GPU) Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC...
CSC-25 High Performance Architectures Lecture Notes – Chapter VII Vector Computers & Graphics Processing Units (GPU) Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC Aeronautics Institute of Technology – ITA 1st semester, 2024 Detailed Contents Vector Computers Vector-Length Registers Flynn’s Taxonomy Handling if Statements in Vector Loops SIMD vs MIMD Memory Banks SIMD Vector Processors Addressing Common Applications SIMD Implementations Main Characteristics Wrap-up Basic Architecture Graphics Processing Units - GPU Some Vector Instructions Overview Operation Example Programming the GPU Questions Simple Example Vector Instructions Optimization - Chaining Simple Example 2 Vector Instructions Optimization - Convoys CUDA Terminology Vector Instructions Optimization - Lanes References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 2/48 Outline Vector Computers Graphics Processing Units - GPU References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 3/48 Vector Computers Flynn’s Taxonomy Classification of computer architectures1 I multiple instructions, single data - MISD I single instruction, single data - SISD I fault tolerance I single processors I multiple instructions, multiple data - I single instruction, multiple data - SIMD MIMD I vector architectures, GPU I multiprocessors 1 Proposed by Michael Flynn, 1966 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 4/48 Vector Computers (cont.) SIMD vs MIMD MIMD architecture I needs to fetch one instruction per data operation I more flexibility SIMD architecture I potentially more energy-efficient than MIMD, i.e., a single instruction can launch many data operations I more attractive than MIMD, e.g., personal mobile devices and servers I programmer continues to think sequentially, and still achieves parallel speedup by having parallel data 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 5/48 Vector Computers (cont.) SIMD Vector Processors Processors with high-level instructions on vectors Y = a×X +Y where X and Y are vectors of size n; a is scalar RISC or CISC approach? A single instruction specifies a large amount of work to be performed The first vector processors were commercialized before the superscalar processors 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 6/48 Vector Computers (cont.) SIMD Vector Processors Processors with high-level instructions on vectors Y = a×X +Y where X and Y are vectors of size n; a is scalar RISC or CISC approach? A single instruction specifies a large amount of work to be performed The first vector processors were commercialized before the superscalar processors 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 6/48 Vector Computers (cont.) Common Applications Vector processors are particularly useful for scientific and engineering applications I simulations of physical phenomena I weather forecasts I applications that operate on large structured data, i.e., matrices/vectors Multimedia applications can also benefit from vector processing, i.e., they typically contain matrices/vectors, and also machine learning algorithms Multimedia extensions, i.e., vectors, were introduced in microprocessors ISA over the time I Pentium multimedia extensions - MMX I streaming SIMD extensions - SSE, SSE2, SSE3 I advanced vector extensions - AVX 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 7/48 Vector Computers (cont.) Main Characteristics Loops parallelism exposed by the programmer or compiler through the vector instructions Memory system adapted to provide a memory access to a whole vector instead of to each element, i.e., interleaved memory 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 8/48 Vector Computers (cont.) Main Characteristics The hardware only needs to check data hazards between two vector instructions once per vector operand I not once for each element within the vectors Since a whole loop is replaced by a vector instruction, control hazards that would arise are eliminated 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 9/48 Vector Computers (cont.) Basic Architecture Generally, a vector processor consists of I a scalar unit2 , with a common pipeline I vector units In RV64V, the vector extension I adds 32 architectural vector registers v0 - v31 I all the functional units - FU are vector FU RV64V – RISC-V/Cray-1 based 2 Typically a superscalar pipeline 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 10/48 Vector Computers (cont.) RISC-V vector instruction set extension – RVV Vector registers I 8 vector registers3 I each register holds a vector with I 32 elements I 64 bits/element I register file with I 16x read ports I 8x write pots Scalar registers I 31 GPR I 32 FPR RV64V – RISC-V/Cray-1 based 3 As illustrated 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 11/48 Vector Computers (cont.) Basic Architecture Vector registers I each vector register holds a single vector Scalar registers I provide data as input to the vector functional units I hold computed addresses to pass to the vector load/store unit 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 12/48 Vector Computers (cont.) Basic Architecture Vector functional units I each unit is fully pipelined I able to start a new operation on every clock cycle I a control unit is needed to detect hazards I structural hazards for functional units I data hazards on register accesses Vector load/store unit I loads/stores a vector from/to memory I fully pipelined I words can be moved between the vector registers and memory with a bandwidth of one word per clock cycle, after an initial latency I handle scalar load/store 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 13/48 Vector Computers (cont.) Some Vector Instructions ADD 1 vadd ;add elements of V[rs1] and V[rs2] 2 ;then put each result in V[rd] SUBtract 1 vsub ;subtract elements of V[rs2] from V[rs1] 2 ;then put each result in V[rd] DIVide 1 vdiv ;divide elements of V[rs1] by V[rs2] 2 ;then put each result in V[rd] 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 14/48 Vector Computers (cont.) Some Vector Instructions Load 1 vld ;load vector register V[rd] from memory 2 ;starting at address R[rs1] Store 1 vst ;store vector register V[rd] into memory 2 ;starting at address R[rs1] 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 15/48 Vector Computers (cont.) Operation Example Vector loop for RV64V Y = a×X +Y where X and Y are vectors of size n; a is scalar This problem is the SAXPY4 or DAXPY5 loop that forms the inner loop of the Linpack6 benchmark Let’s assume the number of elements, or length, of a vector register (i.e., 32) matches the length of the vector operation 4 single-precision a × X plus Y 5 double-precision a × X plus Y 6 collection of linear algebra routines 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 16/48 Vector Computers (cont.) Operation Example Consider that X and Y have 32 elements and the starting addresses of X and Y are in x5 and x6 RV64G general-purpose register, RISC-V code for DAXPY 1 fld f0,a ; load scalar a 2 addi x28,x5,#256 ; last address to load 3 Loop: fld f1,0(x5) ; load X[i] 4 fmul.d f1,f1,f0 ; a * X[i] 5 fld f2,0(x6) ; load Y[i] 6 fadd.d f2,f2,f1 ; a * X[i] + Y[i] 7 fsd f2,0(x6) ; store into Y[i] 8 addi x5,x5,#8 ; increment index to X 9 addi x6,x6,#8 ; increment index to Y 10 bne x28,x5,Loop ; check if done 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 17/48 Vector Computers (cont.) Operation Example Consider that X and Y have 32 elements and the starting addresses of X and Y are in x5 and x6 RV64V code for DAXPY 1 vsetdcfg 4*FP64 ; enable 4 DP FP vregs 2 fld f0,a ; load scalar a 3 vld v0,x5 ; load vector X 4 vmul v1,v0,f0 ; vector-scalar mult 5 vld v2,x6 ; load vector Y 6 vadd v3,v1,v2 ; vector-vector add 7 vst v3,x6 ; store the sum 8 vdisable ; disable vector regs 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 18/48 Vector Computers (cont.) Operation Example - Compare codes, 8 instructions versus 258=32 × 8 + 2 1 vsetdcfg 4*FP64 ; enable 4 DP FP vregs 2 fld f0,a ; load scalar a 3 vld v0,x5 ; load vector X 4 vmul v1,v0,f0 ; vector-scalar mult 5 vld v2,x6 ; load vector Y 6 vadd v3,v1,v2 ; vector-vector add 7 vst v3,x6 ; store the sum 8 vdisable ; disable vector regs 1 fld f0,a ; load scalar a 2 addi x28,x5,#256 ; last address to load 3 Loop: fld f1,0(x5) ; load X[i] 4 fmul.d f1,f1,f0 ; a * X[i] 5 fld f2,0(x6) ; load Y[i] 6 fadd.d f2,f2,f1 ; a * X[i] + Y[i] 7 fsd f2,0(x6) ; store into Y[i] 8 addi x5,x5,#8 ; increment index to X 9 addi x6,x6,#8 ; increment index to Y 10 bne x28,x5,Loop ; check if done 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 19/48 Vector Computers (cont.) Questions Can vector instructions have the same latency of scalar instructions? If so, what would be the expected performance gain? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 20/48 Vector Computers (cont.) Vector Instructions Optimization - Chaining Straightforward RV64G I every fadd.d must wait for a fmul.d I every fsd must wait for the fadd.d RV64V I will stall just for the first element in each vector I stalls once per vector instruction rather than once per vector element Chaining – forwarding of element-dependent operations 1 vmul v1,v0,f0 ;vector-scalar mult 2 vadd v3,v1,v2 ;vector-vector add, RAW hazard on ``v1'' 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 21/48 Vector Computers (cont.) Vector Instructions Optimization - Convoys Convoy – set of vector instructions that could potentially execute together I must not contain any structural hazards I not contain RAW hazard I however, chaining allows them to be in the same convoy I results from the first FU in the chain are forwarded to the second FU 3 convoys in the last code example 1. vld and vmul 2. vld and vadd 3. vst 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/48 Vector Computers (cont.) Vector Instructions Optimization - Lanes Lane – modern vector computers have vector FU with multiple parallel pipelines I produce two or more results per clock cycle I may also have some FU that are not fully pipelined Parallel pipelines to execute a vector instruction add. Vector processor (A) completes one addition per clock cycle. Vector processor (B), with four pipelines ADD, can complete up to four additions per clock cycle 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 23/48 Vector Computers (cont.) Vector Instructions Optimization - Lanes In multiple parallel lanes, each lane contains I one portion of the vector register file I one execution pipeline from each vector FU RV64V only allow element N of one vector register to take part in operations with element N from other vector registers Vector FU structure with 4 lanes. Vector register is divided across the lanes, with each lane holding every 4th element of each vector register 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 24/48 Vector Computers (cont.) Vector-Length Registers Maximum vector length - MVL I processor’s vector register has a natural vector length determined by MVL 1 for(i=0; i mvl? Strip mining – generation of code, i.e., each vector operation is done for a size ≤ mvl I one loop handles any number of iterations that is a multiple of mvl I another loop that handles any remaining iteration, and must be less than mvl 1 low = 0; 2 3 VL = (n % MVL); 4 5 for(j=0; j