cap5-105-115.pdf
Document Details
Uploaded by SelfDeterminationOmaha
ITA
Tags
Full Transcript
Chapter 5 Superscalar Processors Advances in Instruction-Level Parallelism Exploration Hardware an...
Chapter 5 Superscalar Processors Advances in Instruction-Level Parallelism Exploration Hardware and Software Basically, there are two largely distinct approaches when exploiting ILP: hardware-based and software- based. The first approach intends to help to discover and exploiting the parallelism in a dynamic fashion, i.e., during runtime. The second aims to find parallelism in a static fashion at the compile time, i.e., design time. Aggressive compiler-based proposals have been tried out numerous times, since the ’80s. One of the most famous tries was related to the Intel Itanium series, in 1999. Despite the huge efforts, such approaches have been successful only in domain-specific environments. The pipelining approach creates a theoretical barrier, i.e., clock cycles per instruction - CPI equal to 1, as seen in Chapter 3. Pipelining Limits When considering the pipelining limits, the CPI value for a pipelined processor is the sum of the base CPI and all contributions from stalls, as described in Eq. (5.1). CPI Pipeline = CPI IdealPipeline + Stalls Structural + Stalls DataHazard + Stalls Control (5.1) The ideal, or theoretical, pipeline CPI is approximately 1. In this sense, the actual pipeline CPI is greater than 1. And now comes an interesting question: how could the processor be changed to get a CPI less than 1? How to Advance in ILP Exploration The advances in ILP consider techniques such as pipeline, superpipeline, and multiple issue. The last is about superscalar processors and also the very long instruction word - VLIW approach. Pipeline. As seen in the previous chapters, the pipeline technique is about dividing instructions into phases or stages and overlapping these stages to improve performance. This is completely transparent at 99 5 Superscalar Processors the software level, as the hardware takes care of it. Fig. 5.1 illustrates a 5-stage pipeline. Figure 5.1: 5-stage pipeline. Superpipeline. The superpipeline approach explores the fact that various pipeline stages require less than half clock cycle to complete its operations. Thus, a doubled internal clock frequency enables two same stages to run within one external clock cycle. That is two same pipeline stages per clock cycle. Each pipeline stage function can be divided into two parts with no overlap (red box in Fig. 5.2), and each part is executed in half clock cycle, as illustrated in Fig. 5.2. This is named superpipeline level 2, i.e., deeper the pipeline. Figure 5.2: Superpipeline level 2. Two same pipeline stages per clock cycle: in the first clock cycle, two instructions can be fetched. Superscalar. In the superscalar processor, there exist multiple issues. In other words, the processor issues multiple scalar instructions per cycle. The processor is then able to execute independent instructions in parallel into different pipelines, as illustrated in Fig. 5.3. In that case, the processor has multiple functional units. In order to be really effective, out-of-order instructions execution has to be allowed and followed. And considering out-of-order executions, the processor has to handle “new” problems such as structural hazards, e.g., with respect to registers, and data hazards, i.e., WAW and WAR. Figure 5.3: Superscalar processor with functional units in parallel. Modern processors use just one pipeline, but with multiple functional units, e.g., two ALU, one LOAD unit, one STORE unit, one floating point - FP unit. The superscalar processor makes sense if the instructions issue frequency is much greater than the execution frequency: InstructionIssue FREQUENCY Execution FREQUENCY Very Long Instruction Word - VLIW. The VLIW approach encodes various different instructions in one “long” word including various opcodes and their operands. In the VLIW context, there is the packing. Modern processors have an “end of group” bit in the packing. This bit identifies a group of instructions comprising a package. 100 Out-of-Order Execution Packing is handled at design time by the compiler. The processor then fetches the package and issues it entirely at once. A VLIW processor also has multiple functional units, as illustrated in Fig. 5.4. Figure 5.4: Very long instruction word, encoding more than one instruction. The Quest for CPI < 1 The quest for a CPI less than 1 includes multiple issue processors such as the superscalar and VLIW. In a superscalar processor, the instructions can be ordered, or scheduled, by either software or hardware. The software scheduling considers modern compiler optimizations. Register optimizations can increase contention by maximizing the registers’ usage. In this case, there is a static schedule based on the knowledge of dependency-checking mechanisms. The static scheduling has to take into consideration also the branch prediction. In the hardware scheduling, there is Tomasulo’s algorithm1 detailed in the next sections. Among various superscalar processor architectures, there are the PowerPC, Sun SPARC, IBM Cell Broadband Engine (Playstation), and Intel x86. Related to a VLIW instruction set architecture processor, the instructions are ordered, i.e., scheduled, in packages by the software, i.e., compiler. This is often used in embedded systems because this architecture requires less complex hardware. There is an example of a VLIW processor not related to the embedded systems area. It was named by Intel as explicitly parallel instruction computing - EPIC: the Itanium2. Out-of-Order Execution Overview Even in RISC processors, there are instructions with necessarily very different execution times. Examples are an integer add (ADD) and a floating point division (DIV.D). The problem is that a simple pipeline does not work very well with different instruction execution times. A possible alternative for this problem is to allow more than one instruction to be issued, i.e., to enter the execution phase. The simple reasoning about this idea is to remain with the in-order issue, accommodate different execution times, and finally having a potential out-of-order instructions completion. 1 Robert Tomasulo, ACM-IEEE CS Eckert-Mauchly Award, 1997. "For the ingenious Tomasulo’s algorithm, which enabled out-of-order execution processors to be implemented.” 2 Itanium 9700 (Kittson) 2017, the last family generation. In January 2019, Intel announced “end of life” production, with shipping until 2021. 101 5 Superscalar Processors Let’s consider the following code snippet (Listing 5.1). Listing 5.1: Out-of order execution example. 1 DIV. D F0 , F2 , F4 2 ADD. D F10 , F0 , F8 3 SUB. D F12 , F8 , F14 The problem in this sequence (Listing 5.1) is that SUB.D is delayed despite its independence with respect to the other instructions. This is due to a RAW data hazard because of F0, which in turn involves the instructions DIV.D and ADD.D. A possible solution to that is to have an out-of-order execution, e.g., to execute SUB.D before ADD.D. Notice that SUB.D does not depend on DIV.D nor on ADD.D. WAR and WAW Dependencies & Out-of-Order Completion Overview More hardware in the execution stage - EX is required to afford a longer latency from floating-point - FP operations in the pipeline, as illustrated in Fig. 5.5. Given this case (shown in Fig. 5.5), the main integer unit handles loads and stores, integer ALU operations, and branches. There is also a FP and integer multiplier unit; a FP adder which handles FP add, subtract, and conversion; and a FP and integer divider function unit. Except for memory problems related to stalls, for instance, the slower phase will be the execution stage EX. Figure 5.5: RISC-V pipeline with different functional units in the execution phase. In this new context, there are two important modifications. The first one is to allow the EX phase to repeat as many times as required until the operation is completed, and the repetition number varies for different operations. The second change is that there are various FP functional units. Thus, a stall can occur should the instruction to be issued causes a structural or a data hazard. Dynamic Scheduling Problems Let’s consider the code snippet in Listing 5.2. 102 WAR and WAW Dependencies & Out-of-Order Completion Listing 5.2: Dynamic scheduling example 1, original code. 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 3 SUB. D F8 , F10 , F14 4 MUL. D F6 , F10 , F8 In that example (Listing 5.2), there are the following dependencies present: F0, related to instructions 1 and 2, which is a RAW hazard; F8, instructions 2 and 3, WAR hazard; F8, instructions 3 and 4, RAW hazard; and F6, instructions 2 and 4, WAW hazard. If one modifies the code in Listing 5.2 to accommodate a change in the execution order of instructions to reflect this new scheduling (e.g., i1, i3, i2, i4 from original code), the following code (Listing 5.3) would then decrease delays. Listing 5.3: Dynamic scheduling example 1, changed code. 1 DIV. D F0 , F2 , F4 2 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 However the code in Listing 5.3 still have problems, e.g., the register F8 of instruction ADD.D was overwritten. False Dependencies Solution The basic premise for false dependencies solution is to allow for out-of-order execution, as long as the very same result of in-order execution is preserved. In this sense, a technique can be applied. This is called registers renaming. It is used to avoid false WAR and WAW dependencies. WAW Analysis Let’s consider the original code in Listing 5.4 for a WAW dependency analysis. Here, there is a false WAW dependency involving F1. Listing 5.4: WAW dependency example, original code. 1 MUL. D F1 , F2 , F3 2 ADD. D F2 , F1 , F3 3 SUB. D F1 , F4 , F5 To fix the false dependency present in Listing 5.4, the register renaming technique replaces F1 by F5 (for example) in instruction 3, i.e., another register not used or read until its next writing point. The resulting code is in Listing 5.5. Listing 5.5: WAW dependency example, modified code. 1 MUL. D F1 , F2 , F3 2 ADD. D F2 , F1 , F3 3 SUB. D F5 , F4 , F5 103 5 Superscalar Processors WAR Analysis Now, let’s consider the code snippet in Listing 5.6 for a WAR dependency verification. In that code, there is a false WAR dependency involving instructions 1 and 3 with respect to the register F3. And that false dependency impacts the next instructions if the code remains this way. Listing 5.6: WAR dependency example, original code. 1 MUL. D F1 , F2 , F3 2 ADD. D F6 , F1 , F2 3 SUB. D F3 , F4 , F5 4... 5 ADD. D F9 , F4 , F3 To fix that false WAR dependency, so that a safe out-ot-order execution can happen, the register renaming acts by replacing F3 by F8 in instructions 3 and 5, i.e., another register not used or read until its next writing point, as seen in Listing 5.7. Listing 5.7: WAR dependency example, modified code. 1 MUL. D F1 , F2 , F3 2 ADD. D F6 , F1 , F2 3 SUB. D F8 , F4 , F5 4... 5 ADD. D F9 , F4 , F8 Basic Requirements The solution’s basic requirement is to identify instructions with no dependencies and allow them to pass in front of instructions with dependencies. Another requirement is to identify and block instructions with data or structural dependencies. Finally, the solution has to keep pipeline as efficient, i.e., busy, as possible. The solution methods can involve both software and hardware. In software, considering large sets of registers, the compiler can eliminate WAR and WAW hazards by renaming registers. Eventually, it can use the instruction MOV between registers. However, errors or even omissions in the compiler decisions can lead to poor processor overall efficiency. On the other hand, the hardware cannot observe instructions ahead, but it can still eliminate dependencies on registers. There is the scoreboard technique to aid in that sense. It is a technique that allows instructions to execute out-of-order when there are enough resources and no data dependence. It also avoids RAW hazards. However, a more advanced technique is the Tomasulo’s algorithm. Tomasulo’s Algorithm Definitions Tomasulo’s algorithm is a hardware solution for dynamic scheduling instructions allowing out-of-order execution by using different functional units as illustrated in Fig. 5.6. The RISC-V instruction set is used here to describe the Tomasulo’s algorithm ideas. 104 Tomasulo’s Algorithm Figure 5.6: The basic structure of a RISC-V FP unit using Tomasulo’s algorithm. Here, the instructions get from the instruction unit into the instruction queue. Instructions are issued in FIFO order from the queue. The reservation stations (3 related to the FP adders and 2 to the FP multiplier) include the operation along with its operands, and the data needed to detect and fix possible hazards. The load buffers (5 as illustrated here) are in charge of (i) holding the parts of the effective address until it is calculated, (ii) tracking the unresolved loads waiting on the memory, and (iii) holding the results of completed loads waiting on the CDB. The store buffers (5 as also illustrated here) are responsible for (i) holding the parts of the effective address until it is calculated, (ii) holding the destination memory addresses of unresolved stores that are waiting for a data value to store, and (iii) holding the address and value to store until the memory unit is ready to perform the operation. All results from the FP units (adders/multipliers) or load unit are placed on the CDB, and that goes to the FP register file, reservation stations, and store buffers (represented in bold line). In this algorithm, RAW hazards are avoided as the instruction are executed only when its operands are available. WAR and WAW hazards are handled by register renaming provided in the reservation stations. This algorithm has three basic steps: issue/(dispatch), execute/(issue), and write result. Issue/(dispatch). This is sometimes named dispatch. In this step, the instruction is getting from the instruction queue’s head, in which a FIFO order is preserved to keep the defined (correct) data flow. Should a corresponding empty reservation slot is present, the instruction is issued to that reservation station along with the operands’ values if they are already in the registers. Otherwise, if there are no empty reservation station slots, this case configures a structural hazard causing the stall in the issued instruction until a slot or a buffer is empty. Another point to consider is when the operands are not yet in the registers. In that case, the algorithm has to keep one eye on the functional units that will have the operands computed. This step renames registers, thus fixing possible WAR and WAW hazards. Execute/(issue). This is sometimes named as “issue” step. In this step, whenever there are operands not available, the CDB should be monitored to properly get them. When an operand is ready and available, it is put in the reservation station waiting for it. As soon as all operands are available, the operation 105 5 Superscalar Processors is then executed in the matching functional unit. RAW hazard is prevented by delaying the instruction execution until its operands are available. It is possible that more than one instruction gets ready in the same clock cycle for the same functional unit, and then the unit has to choose one to execute. Independent units can start executing at the same cycle. Loads and stores are a bit different when executing as they require a two-part step. The first part computes the effective address whenever the base register is available. The effective address goes to the load or store buffer. In the load buffer, loads are executed right away when the memory unit is free. In the store buffer, stores have to wait on the value to be placed in the memory. The effective address computation process keeps loads and stores in the same program order, preventing hazards. No instruction can start executing if the branch preceding that instruction (considering the same program order) has not been completed. This is the behavior of exception. Write Result. In this step, when a result is available it is written to the CDB. From there, it goes to the register file and also to any reservation station and buffers that are waiting for that result. Stores remain buffered until both store address and store value are available. The result is written as quickly as the memory unit is available. Key Principles Although some variations exist, two key principles remain here. One is about dynamically determining when an instruction is ready to execute. That helps to minimize RAW hazards. The other is about renaming the registers to avoid unnecessary hazards, i.e., false dependencies. That helps to minimize WAR and WAW hazards. Reservation Station Details The reservation station - RS has the following seven fields. The field Op holds the operation to perform on the source operands S1 and S2. Fields Qj and Qk hold the reservation stations that will produce the corresponding source operand. A value of zero (or empty value) in these fields indicates that the source operand is already available in the fields Vj or Vk, or it is even unnecessary. The fields Vj and Vk are the values of the source operands when they are already available. Only one of the fields V or Q is valid for each operand. For loads, the field Vk is used to hold the offset part. The field A holds the information for the memory address calculation for a load or store. Initially, the field immediate of the instruction is held here. After the address calculation, the effective address is then held. The field Busy indicates that a reservation station and its accompanying functional unit are occupied. Register File Details The field Qi is related to the register file. It indicates the number of the reservation station that contains the operation whose result should be stored into this register. If the value of Qi is blank (or 0), no current active instruction is computing a result destined for this register, i.e., the value is simply the register content. Simple Example Let’s consider the following RISC-V code snipped, as seen in Listing 5.8. 106 Tomasulo’s Algorithm Listing 5.8: RISC-V code example running according to Tomasulo’s algorithm. 1 fld f6 ,32( x2 ) 2 fld f2 ,44( x3 ) 3 fmul. d f0 , f2 , f4 4 fsub. d f8 , f2 , f6 5 fdiv. d f0 , f0 , f6 6 fadd. d f6 , f8 , f2 In a first assumed scenario, all instructions are issued, but just the first instruction load (fld) has been completed and wrote the result in the common data bus - CDB. This scenario is illustrated in details by Fig. 5.7. Figure 5.7: Tomasulo example showing the instruction status, the reservation station fields, and the register file status. Notice that all instructions were issued as the issue checkmark is present in all code lines. As mentioned, only the two loads (fld) completed their executions (also there with the checkmark), but only the first has its result written. In the reservation stations part, the load buffer Load1 is already not busy, which reflects this status. On the other hand, buffer Load2 is still handling a load. There, the field A holds the information to compute the effective address. The reservation station Add1 is handling a subtraction between f2 and f6. Register f2 does not have a value yet as it is waiting for station Load2 (Qj = Load2) to produce it. Conversely, register f6 contents is already available as the first load wrote its result (Vk = Mem[32+Regs[x2]]). Reservation station Add2 is taking care of the operation sum between f8 and f2, whose results are depending on Qj = Add1 and Qk = Load2, respectively. Station Mult1 is handling the multiplication instruction involving f2 and f4. The station Load2 (Qj = Load2) will give the value of f2, and f4 is already available (Vk = Regs[f4]). Finally, station Mult2 is with the division instruction relating registers f0 and f6. Register f0 is pending on station Mult1 (Qj = Mult1), and f6 is available (Vk = Mem[32+Regs[x2]]). The field Qi in the register file part indicates which reservation station is computing which register value. For example, register f0 will come out of reservation station Mult1, and so on. 107 5 Superscalar Processors Now, Fig. 5.8 illustrates a different scenario where all instructions are finished, but multiply and divide. However, the multiplication is ready to write its result. Figure 5.8: Tomasulo tables overview with a different scenario where all instructions are finished, except multiplication and division. Now, the multiplication was executed, as its operands’ values were available, as shown by Vj and Vk, and it is ready to write the value to register f0. The division is still pending on Qj = Mult1 for the f0 value. Loop-Based Example Now, let’s consider the following code (Listing 5.9) which multiplies the elements of an array by a scalar using register f2. Listing 5.9: RISC-V loop-based example running according to Tomasulo’s algorithm. 1 Loop : fld f0 ,0( x1 ) 2 fmul. d f4 , f0 , f2 3 fsd f4 ,0( x1 ) 4 addi x1 , x1 , -8 // x1 < - x1 -8 5 bne x1 , x2 , Loop // branches if x1 != x2 By eliminating all WAW and WAR hazards through the dynamic renaming of registers, and assuming, i.e., predicting, the branches are taken, this means multiple loop execution/iterations at once, thanks to the reservation stations. The loop is unrolled dynamically by the hardware. The reservation stations and registers renaming act as additional registers. Fig. 5.9 illustrates the case where two active loop iterations with no instruction is completed yet. 108 Tomasulo’s Algorithm Figure 5.9: Tomasulo loop-based example showing a snapshot status. Notice that as the branch is predicted as taken, two loop iterations were issued and the loads were executed considering that the effective address can be computed as in the field A of load buffers Load1 and Load2. The multiplications are waiting on Qj = Load1 and Qj = Load2 respectively. Register f2 is available (Vk = Regs[f2]). Stores are pending on stations Qk = Mult1 and Qk = Mult2. As the register renaming acted on x1, the values are available in the field Vj. However, the “write result” really depends on the branch outcome verification. Summary The Tomasulo’s algorithm allows out-of-order completion, but with in-order issue. It also allows to reduce delays caused by differences in execution times among different instructions, e.g., integers and FP. This hardware algorithm eliminates WAR and WAW hazards by renaming registers, thanks to its reservation stations. It blocks instructions due to RAW hazards. Finally, it allows loop unrolling, even without speculative execution. Speculative execution is subject to the next chapters. Algorithm’s Reference Details Fig. 5.10 presents the detailed operations in each step of Tomasulo’s algorithm. 109