cap7-lecture-notes-133-146-1-4.pdf

Chapter 7 Vector Computers & Graphics Processing Units Vector Computers Flynn’s Taxonomy There exist a well-known cla...

Chapter 7 Vector Computers & Graphics Processing Units Vector Computers Flynn’s Taxonomy There exist a well-known classification of computer architectures named Flynn’s taxonomy1. It is briefly described as follows. The “single instruction, single data” - SISD is mainly related to single processors, as there is just one instruction working on one data at a time. The “single instruction, multiple data” - SIMD is mainly related to the computers such as vector architectures and graphics processing units. In this case, one instruction is able to work on different data at the same time. The “multiple instructions, single data” - MISD is related to the use of different instructions to handle just one data. This is used for fault tolerance for example. The “multiple instructions, multiple data” - MIMD is mainly related to the multiprocessors, where different instructions are able to work on different data in parallel. SIMD vs. MIMD The MIMD architecture needs to fetch one instruction per data operation, given more flexibility to it. On the other hand, SIMD architecture is potentially more energy-efficient than MIMD, i.e., a single instruction can launch many data operations. Also, it can be more attractive than MIMD, e.g., especially for personal mobile devices and servers where power consumption really makes the difference. Moreover, in SIMD, the programmer continues to think sequentially, and still achieves parallel speedup by having parallel data operations. SIMD Vector Processors SIMD vector processors are processors with high-level instructions operating on vectors, such as in Eq. (7.1). ~ =a×X Y ~ +Y ~ (7.1) 1 Proposed by Michael Flynn, 1966. 127 7 Vector Computers & Graphics Processing Units where X ~ and Y ~ are vectors of size n, and a is a scalar. Is that kind of instruction following the RISC or CISC approach? Here, a single instruction specifies a large amount of work to be performed. As informative data, the first vector processors were commercialized even before the superscalar processors. Common Applications Vector processors are particularly useful for scientific and engineering applications. Examples include simulations of physical phenomena, weather forecasts, and applications that operate on large structured data, i.e., matrices, and vectors. Multimedia applications can also benefit from vector processing, i.e., they typically contain large matrices and vectors, and also machine learning algorithms. Multimedia extensions, i.e., vectors, were introduced in microprocessors ISA over the time. Some examples are the Pentium multimedia extensions - MMX; the streaming SIMD extensions - SSE, SSE2, SSE3; and the advanced vector extensions - AVX. Main Characteristics The parallelism of loops can exposed by the programmer or even the compiler through the usage of vector instructions. In this case, the memory system is adapted to provide memory access to a whole vector instead of to each element at a time, i.e., through interleaved memory. The hardware only needs to check data hazards between two vector instructions once per vector operand, and not once for each element within the vectors. Since a whole loop is replaced by a vector instruction, control hazards that would arise are then eliminated. And that is really positive. In this case, the dependency verification logic needed for vector instructions is almost equal to the one required to verify dependencies between two scalar instructions. However, in the vector instruction case, much more elementary operations are executing in the same control logic’s complexity. Since the entire loop gets replaced by a vector instruction with predetermined behavior, the control hazards that would possibly occur in the loop are non-existent here. Basic Architecture Generally, a vector processor consists of a scalar unit2 , with a common pipeline, and also some vector units. In the example shown in Fig. 7.1, it is considered the number of 32 vector registers. All the functional units are vector functional units. 2 Typically a superscalar pipeline. 128 Vector Computers Figure 7.1: RV64V – RISC-V, Cray-1 based. In the RISC-V vector instruction set extension - RV64V, both the vector and scalar registers have a considerable number of read/write ports to accommodate parallel vector operations. Then, a set of switches (gray lines) connects those ports to the input/output of a vector functional unit. RV64 stands for RISC-V base instruction set considering 64-bit. There is also the RV32 for 32-bit. RV32I – Base integer instruction set, 32-bit, 32 registers (x0 - x31); RV32E – Base integer instruction set, 32-bit “embedded” version with 16 registers (x0 - x15); RV64I – Base integer instruction set, 64-bit; and RV128I – Base integer instruction set, 128-bit. Some standard extension are named as follows. M – integer multiplication and division; A – atomic operations; F – single-precision floating-point; D – double-precision floating-point; G – shorthand for the base and for “MAFD” standard extensions; and V – vector operations. Vector registers. Each vector register holds a single vector. RV64V has 32 registers, each of which is 64-bit wide. In this vector architecture, the vector register file is required to provide a sufficient number of ports to feed all the vector functional units. Thus, these ports enable enough overlap among vector operations to different vector registers. 129 7 Vector Computers & Graphics Processing Units Scalar registers. Scalar registers provide data as input to the vector functional units, besides the computed addresses to pass to the vector load/store unit. There are 31 general-purpose registers and 32 floating-point registers in this particular architecture. Vector functional units. Here, each unit is fully pipelined and able to start a new operation on every clock cycle. A control unit is needed to detect hazards: structural hazards for functional units, and data hazards on register accesses. Vector load/store unit. This unit loads and stores a vector to and from the memory. It is also fully pipelined: words can be moved between the vector registers and the memory with a bandwidth of one word per clock cycle, after an initial latency. This unit also handles scalar loads and stores. Some Vector Instructions – RISC-V ISA Some details on the vector add instruction (Listing 7.1). Listing 7.1: Vector add instruction. 1 vadd // add elements of V [ rs1 ] and V [ rs2 ] 2 // then put each result ( each vector element ) in V [ rd ] Some details on the vector sub instruction (Listing 7.2). Listing 7.2: Vector subtract instruction. 1 vsub // subtract elements of V [ rs2 ] from V [ rs1 ] 2 // then put each result in V [ rd ] Some details on the vector div instruction (Listing 7.3). Listing 7.3: Vector division instruction. 1 vdiv // divide elements of V [ rs1 ] by V [ rs2 ] 2 // then put each result in V [ rd ] Some details on the vector load instruction (Listing 7.4). Listing 7.4: Vector load instruction. 1 vld // load vector register V [ rd ] from memory 2 // starting at address R [ rs1 ] Some details on the vector store instruction (Listing 7.5). Listing 7.5: Vector store instruction. 1 vst // store vector register V [ rd ] into memory 2 // starting at address R [ rs1 ] 130

cap7-lecture-notes-133-146-1-4.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue