cap5_1-105-115-1-6.pdf
Document Details
Uploaded by SelfDeterminationOmaha
ITA
Tags
Full Transcript
Chapter 5 Superscalar Processors Advances in Instruction-Level Parallelism Exploration Hardware an...
Chapter 5 Superscalar Processors Advances in Instruction-Level Parallelism Exploration Hardware and Software Basically, there are two largely distinct approaches when exploiting ILP: hardware-based and software- based. The first approach intends to help to discover and exploiting the parallelism in a dynamic fashion, i.e., during runtime. The second aims to find parallelism in a static fashion at the compile time, i.e., design time. Aggressive compiler-based proposals have been tried out numerous times, since the ’80s. One of the most famous tries was related to the Intel Itanium series, in 1999. Despite the huge efforts, such approaches have been successful only in domain-specific environments. The pipelining approach creates a theoretical barrier, i.e., clock cycles per instruction - CPI equal to 1, as seen in Chapter 3. Pipelining Limits When considering the pipelining limits, the CPI value for a pipelined processor is the sum of the base CPI and all contributions from stalls, as described in Eq. (5.1). CPI Pipeline = CPI IdealPipeline + Stalls Structural + Stalls DataHazard + Stalls Control (5.1) The ideal, or theoretical, pipeline CPI is approximately 1. In this sense, the actual pipeline CPI is greater than 1. And now comes an interesting question: how could the processor be changed to get a CPI less than 1? How to Advance in ILP Exploration The advances in ILP consider techniques such as pipeline, superpipeline, and multiple issue. The last is about superscalar processors and also the very long instruction word - VLIW approach. Pipeline. As seen in the previous chapters, the pipeline technique is about dividing instructions into phases or stages and overlapping these stages to improve performance. This is completely transparent at 99 5 Superscalar Processors the software level, as the hardware takes care of it. Fig. 5.1 illustrates a 5-stage pipeline. Figure 5.1: 5-stage pipeline. Superpipeline. The superpipeline approach explores the fact that various pipeline stages require less than half clock cycle to complete its operations. Thus, a doubled internal clock frequency enables two same stages to run within one external clock cycle. That is two same pipeline stages per clock cycle. Each pipeline stage function can be divided into two parts with no overlap (red box in Fig. 5.2), and each part is executed in half clock cycle, as illustrated in Fig. 5.2. This is named superpipeline level 2, i.e., deeper the pipeline. Figure 5.2: Superpipeline level 2. Two same pipeline stages per clock cycle: in the first clock cycle, two instructions can be fetched. Superscalar. In the superscalar processor, there exist multiple issues. In other words, the processor issues multiple scalar instructions per cycle. The processor is then able to execute independent instructions in parallel into different pipelines, as illustrated in Fig. 5.3. In that case, the processor has multiple functional units. In order to be really effective, out-of-order instructions execution has to be allowed and followed. And considering out-of-order executions, the processor has to handle “new” problems such as structural hazards, e.g., with respect to registers, and data hazards, i.e., WAW and WAR. Figure 5.3: Superscalar processor with functional units in parallel. Modern processors use just one pipeline, but with multiple functional units, e.g., two ALU, one LOAD unit, one STORE unit, one floating point - FP unit. The superscalar processor makes sense if the instructions issue frequency is much greater than the execution frequency: InstructionIssue FREQUENCY Execution FREQUENCY Very Long Instruction Word - VLIW. The VLIW approach encodes various different instructions in one “long” word including various opcodes and their operands. In the VLIW context, there is the packing. Modern processors have an “end of group” bit in the packing. This bit identifies a group of instructions comprising a package. 100 Out-of-Order Execution Packing is handled at design time by the compiler. The processor then fetches the package and issues it entirely at once. A VLIW processor also has multiple functional units, as illustrated in Fig. 5.4. Figure 5.4: Very long instruction word, encoding more than one instruction. The Quest for CPI < 1 The quest for a CPI less than 1 includes multiple issue processors such as the superscalar and VLIW. In a superscalar processor, the instructions can be ordered, or scheduled, by either software or hardware. The software scheduling considers modern compiler optimizations. Register optimizations can increase contention by maximizing the registers’ usage. In this case, there is a static schedule based on the knowledge of dependency-checking mechanisms. The static scheduling has to take into consideration also the branch prediction. In the hardware scheduling, there is Tomasulo’s algorithm1 detailed in the next sections. Among various superscalar processor architectures, there are the PowerPC, Sun SPARC, IBM Cell Broadband Engine (Playstation), and Intel x86. Related to a VLIW instruction set architecture processor, the instructions are ordered, i.e., scheduled, in packages by the software, i.e., compiler. This is often used in embedded systems because this architecture requires less complex hardware. There is an example of a VLIW processor not related to the embedded systems area. It was named by Intel as explicitly parallel instruction computing - EPIC: the Itanium2. Out-of-Order Execution Overview Even in RISC processors, there are instructions with necessarily very different execution times. Examples are an integer add (ADD) and a floating point division (DIV.D). The problem is that a simple pipeline does not work very well with different instruction execution times. A possible alternative for this problem is to allow more than one instruction to be issued, i.e., to enter the execution phase. The simple reasoning about this idea is to remain with the in-order issue, accommodate different execution times, and finally having a potential out-of-order instructions completion. 1 Robert Tomasulo, ACM-IEEE CS Eckert-Mauchly Award, 1997. "For the ingenious Tomasulo’s algorithm, which enabled out-of-order execution processors to be implemented.” 2 Itanium 9700 (Kittson) 2017, the last family generation. In January 2019, Intel announced “end of life” production, with shipping until 2021. 101 5 Superscalar Processors Let’s consider the following code snippet (Listing 5.1). Listing 5.1: Out-of order execution example. 1 DIV. D F0 , F2 , F4 2 ADD. D F10 , F0 , F8 3 SUB. D F12 , F8 , F14 The problem in this sequence (Listing 5.1) is that SUB.D is delayed despite its independence with respect to the other instructions. This is due to a RAW data hazard because of F0, which in turn involves the instructions DIV.D and ADD.D. A possible solution to that is to have an out-of-order execution, e.g., to execute SUB.D before ADD.D. Notice that SUB.D does not depend on DIV.D nor on ADD.D. WAR and WAW Dependencies & Out-of-Order Completion Overview More hardware in the execution stage - EX is required to afford a longer latency from floating-point - FP operations in the pipeline, as illustrated in Fig. 5.5. Given this case (shown in Fig. 5.5), the main integer unit handles loads and stores, integer ALU operations, and branches. There is also a FP and integer multiplier unit; a FP adder which handles FP add, subtract, and conversion; and a FP and integer divider function unit. Except for memory problems related to stalls, for instance, the slower phase will be the execution stage EX. Figure 5.5: RISC-V pipeline with different functional units in the execution phase. In this new context, there are two important modifications. The first one is to allow the EX phase to repeat as many times as required until the operation is completed, and the repetition number varies for different operations. The second change is that there are various FP functional units. Thus, a stall can occur should the instruction to be issued causes a structural or a data hazard. Dynamic Scheduling Problems Let’s consider the code snippet in Listing 5.2. 102 WAR and WAW Dependencies & Out-of-Order Completion Listing 5.2: Dynamic scheduling example 1, original code. 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 3 SUB. D F8 , F10 , F14 4 MUL. D F6 , F10 , F8 In that example (Listing 5.2), there are the following dependencies present: F0, related to instructions 1 and 2, which is a RAW hazard; F8, instructions 2 and 3, WAR hazard; F8, instructions 3 and 4, RAW hazard; and F6, instructions 2 and 4, WAW hazard. If one modifies the code in Listing 5.2 to accommodate a change in the execution order of instructions to reflect this new scheduling (e.g., i1, i3, i2, i4 from original code), the following code (Listing 5.3) would then decrease delays. Listing 5.3: Dynamic scheduling example 1, changed code. 1 DIV. D F0 , F2 , F4 2 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 However the code in Listing 5.3 still have problems, e.g., the register F8 of instruction ADD.D was overwritten. False Dependencies Solution The basic premise for false dependencies solution is to allow for out-of-order execution, as long as the very same result of in-order execution is preserved. In this sense, a technique can be applied. This is called registers renaming. It is used to avoid false WAR and WAW dependencies. WAW Analysis Let’s consider the original code in Listing 5.4 for a WAW dependency analysis. Here, there is a false WAW dependency involving F1. Listing 5.4: WAW dependency example, original code. 1 MUL. D F1 , F2 , F3 2 ADD. D F2 , F1 , F3 3 SUB. D F1 , F4 , F5 To fix the false dependency present in Listing 5.4, the register renaming technique replaces F1 by F5 (for example) in instruction 3, i.e., another register not used or read until its next writing point. The resulting code is in Listing 5.5. Listing 5.5: WAW dependency example, modified code. 1 MUL. D F1 , F2 , F3 2 ADD. D F2 , F1 , F3 3 SUB. D F5 , F4 , F5 103 5 Superscalar Processors WAR Analysis Now, let’s consider the code snippet in Listing 5.6 for a WAR dependency verification. In that code, there is a false WAR dependency involving instructions 1 and 3 with respect to the register F3. And that false dependency impacts the next instructions if the code remains this way. Listing 5.6: WAR dependency example, original code. 1 MUL. D F1 , F2 , F3 2 ADD. D F6 , F1 , F2 3 SUB. D F3 , F4 , F5 4... 5 ADD. D F9 , F4 , F3 To fix that false WAR dependency, so that a safe out-ot-order execution can happen, the register renaming acts by replacing F3 by F8 in instructions 3 and 5, i.e., another register not used or read until its next writing point, as seen in Listing 5.7. Listing 5.7: WAR dependency example, modified code. 1 MUL. D F1 , F2 , F3 2 ADD. D F6 , F1 , F2 3 SUB. D F8 , F4 , F5 4... 5 ADD. D F9 , F4 , F8 Basic Requirements The solution’s basic requirement is to identify instructions with no dependencies and allow them to pass in front of instructions with dependencies. Another requirement is to identify and block instructions with data or structural dependencies. Finally, the solution has to keep pipeline as efficient, i.e., busy, as possible. The solution methods can involve both software and hardware. In software, considering large sets of registers, the compiler can eliminate WAR and WAW hazards by renaming registers. Eventually, it can use the instruction MOV between registers. However, errors or even omissions in the compiler decisions can lead to poor processor overall efficiency. On the other hand, the hardware cannot observe instructions ahead, but it can still eliminate dependencies on registers. There is the scoreboard technique to aid in that sense. It is a technique that allows instructions to execute out-of-order when there are enough resources and no data dependence. It also avoids RAW hazards. However, a more advanced technique is the Tomasulo’s algorithm. Tomasulo’s Algorithm Definitions Tomasulo’s algorithm is a hardware solution for dynamic scheduling instructions allowing out-of-order execution by using different functional units as illustrated in Fig. 5.6. The RISC-V instruction set is used here to describe the Tomasulo’s algorithm ideas. 104