Advanced Pipelining PDF

Advanced Pipelining 1 EXPLOITING ILP TECHNIQUES S I R A A STOUR , PHD, ENG. Preview 2  Pipeline has three types of hazards  Structural  Data  Control  Solutions-most common  Add HW (More resources - forwarding)  Stalls - (NOP)  Code Reordering - Code Scheduling  In standard MIPS  Load-use data hazard Problem!!  Branches problem. Dependences & Hazards A General Overview 3  Three different types of dependences:  Data Dependences (also called true data dependences)  If two instructions are data dependent, they must execute in order and cannot execute simultaneously or be completely overlapped.  Name Dependences  Control Dependences 1- Data Dependent 4  An instruction j is data dependent on instruction i if either of the following holds:  Instruction i produces a result that may be used by instruction j  Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. (chain of dependencies)  A data value may flow between instructions either through:  Registers  The dependence is straightforward since the register names are fixed in instructions, or through,  Memory Locations  Dataflow through memories are difficult to detect:  100(R4) and 20(R6) may be identical memory addresses! 2- Name Dependent 5  Occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between the instructions associated with that name.  Two types of Name Dependences  Anti-dependence  Instruction j writes a register or memory location that instruction i reads  Output dependence  Instruction i and instruction j write the same register or memory location-ordering between the instructions must be preserved.  Is not a true dependence  Solution: Register Renaming  Done either statically by a compiler, or dynamically by the hardware Dependences: An Example 6 Loop: L.D F0,0(R1) ; #F0 = array element ADD.D F4,F0,F2 ; # add scalar in F2 S.D F4,0(R1) ; # store result DADDUI R1,R1,#-8 ; # decrement pointer 8 bytes BNE R1,R2,Loop ; # branch R1!=R2 1- Data Dependences Loop: L.D F0,0(R1) ; # F0=array element ADD.D F4,F0,F2 ; # add scalar in F2 S.D F4,0(R1) ; # store result DADDUI R1,R1,#-8 ; # decrement pointer 8 bytes BNE R1,R2,Loop ; # branch R1!=R2 Dependences: An Example 7 2- Name Dependencies Anti-dependence Loop: L.D F0,0(R1) ; # F0=array element ADD.D F4,F0,F2 ; # add scalar in F2 S.D F4,0(R1) ; # store result DADDUI R1, R1,#-8 ; # decrement pointer 8 bytes BNE R1, R2, Loop ; # branch R1!=R2 Solution: Register Renaming -> change R1 register by another one like R4 Loop: L.D F0,0(R1) ; # F0=array element ADD.D F4,F0,F2 ; # add scalar in F2 S.D F4,0(R1) ; # store result DADDUI R4, R1,#-8 ; # decrement pointer 8 bytes BNE R4, R2, Loop ; # branch R1!=R2 Data Hazards 8  A Hazard: prevent the next instruction in the instruction stream from executing during its designated clock cycle.  It exists whenever there is a name or data dependence between close enough instructions.  May be classified as one of three types  RAW (Read after Write) — j tries to read a source before i writes it,  Is the most common type and corresponds to a true data dependence Data Hazards 9  WAW (Write after Write)— j tries to write an operand before it is written by i.  Writesend up being performed in the wrong order and corresponds to an output dependence  WAW hazards are present only in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled.  WAR (Write after Read)—j tries to write a destination before it is read by i, so i incorrectly gets the new value.  This hazard arises from an anti-dependence. WAR hazards cannot occur in most static issue pipelines:  All reads are early (in ID) and all writes are late (in WB).  Note RAR (Read after Read) case is not a hazard. 3- Control Dependent 10  A control dependence determines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order and only when it should be.  In general, two constraints are imposed by control dependences: 1. An instruction that is control dependent on a branch cannot be moved before the branch because its execution is no longer controlled by the branch 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. Control Dependent & Reordering 11  Preserving strict program order ensure that control dependences are also preserved  We may be willing to violating the control dependences by executing instructions that should not have been executed, if we can do so without affecting the correctness of the program.  Two points critical to program correctness and preserved by maintaining both data and control dependences are:  Exception Behavior  Reordering of instruction execution must not cause any new exceptions in the program  Data Flow  Branches make the data flow dynamic  It is insufficient to just maintain data dependences; Program Order is what determines which predecessor will actually deliver a data value to an instruction Examples: Program Correctness 12 Example 1: Exceptions R2 is Data dependent -> changing DADDU R2,R3,R4 the ordering may change the results!! BEQZ R2,L1 Lw and Branch cannot be LW R1, 0(R2) interchanged -> Control dependent L1: => memory protection exception Solution: HW Speculation Example 2: Data flow DADDU R1,R2,R3 Preserving data dependent by preserving BEQZ R4,L only the Sequence of DADDU DSUBU R1,R5,R6 DSUBU L:... is not sufficient!! (by moving them before OR R7,R1,R8 or after the branch) Control dependent also needed Solution: Speculation Examples: Program Correctness 13 Example 3: Do the Reordering  Determine that violating the control dependence cannot affect either the exception behavior or the data flow DADDU R1,R2,R3 BEQZ R12, skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9  Suppose we knew that register (R4) was unused after the instruction labeled skip (dead). we could move the DSUBU instruction before the branch, since the data flow cannot be affected by this change. * Liveness: a value will be used by an upcoming instruction Examples: Program Correctness 14  If the branch is taken, the DSUBU instruction will execute and will be useless, but it will not affect the program results.  This type of code scheduling is also a form of Speculation, called Software Speculation, since the compiler is betting on the branch outcome;  In this case, the bet is that the branch is usually not taken. Parallelism via Instructions 15  Pipelining exploits the potential parallelism among instructions. This parallelism is called instruction-level parallelism (ILP).  Two approach to increase the ILP  Increasing the depth of the pipeline to overlap more instructions  More hazards usually -> needs to use scheduling techniques  Multiple issuing  Replicating the internal components so the computer can launch multiple instructions in every pipeline stage.  Launching multiple instructions per stage allows the execution rate to exceed the clock rate => CPI < 1. Multiple Issue Processors 16 In multiple issue pipeline responsibilities must be taken of: 1. Issue Slots: the processor to determines how many and which instructions can be issued in a CCT?  In Static issuing and VLIW the layout of simultaneously issuing instructions is restricted to simplify the decoding and instruction issue. 2. Data and Control Hazards 3. Instructions Issue Policy  In-Order Issue with In-Order Completion  In Order Issue with Out-of-Order Completion  Out-of-Order issue with Out-of-Order Completion Multiple Issue Processors 17 Two kinds of Multiple Issuing Static and Dynamic 1. Static Multiple Issue  Many decisions are made by the compiler before the execution.  Compiler is used to assist with packaging instructions and handling hazards.  Issue Packet: The set of instructions that issues together in 1 clock cycle;  The packet may be determined statically by the compiler.  The issue packet can be viewed as one large instruction with multiple operands:  VLIW: Very Long Instructions Word Processors  The Intel IA-64 Architecture (EPIC): Explicitly Parallel Instruction Computer  Itanuim, Itanuim-2 processors Multiple Issue Processors 18 2. Dynamic Multiple Issue  Many decisions are made during execution (on run-time) by the processor.  Issue Packet: The packet may be determined dynamically by the processor.  Superscalar Processors  Dynamic pipeline scheduling  Hardware support to instructions reordering to avoid hazard and stalls. Instruction Issue Policy 19  instruction issue: refer to the process of initiating instruction execution in the processor’s functional units.  Instruction issue occurs when instruction moves from the decode stage of the pipeline to the first execute stage of the pipeline.  instruction issue policy: refer to the protocol used to issue instructions.  Three types of orderings are important in this regard:  The order in which instructions are fetched.  The order in which instructions are executed.  The order in which instructions update the contents of register and memory locations. Instruction Issue Policy 20  We can group superscalar instruction issue policies into these categories:  In-order issue with in-order completion.  In-order issue with out-of-order completion.  Out-of-order issue with out-of-order completion. Speculation 21  An approach based on prediction, whereby the compiler or processor guesses the outcome of an instruction to remove it as a dependence in executing other instructions.  Compiler or processor guesses about the properties of an instruction so as to enable execution to begin for other instructions that may depend on the speculated instruction.  Speculating about the outcome of a branch so the instructions after the branch could be executed earlier  Speculating that a store before a load doesn't refer to same address. So we could execute load before the store.  Speculation Difficulty?  May be wrong  Need a checking mechanism and ability to unroll or back-out the effects of the instructions that been executed speculatively.  It may introduce exceptions. Speculation Types 22  Two Types:  Software Speculation  Compiler can use speculation to reorder instructions  For recovery from incorrect speculation: Compiler inserts additional instructions that check the accuracy of the speculation and provide a fix-up routine to use  Hardware Speculation  The processor hardware can perform the same transformation at runtime using different techniques.  For incorrect speculation, the hardware buffers the speculative results until it knows they are no longer speculative.  If speculation correct the instructions are completed by allowing the contents of the buffers to be written to the registers or memory  If speculation was wrong, the hardware flushes the buffers and re- executes the correct instruction sequence. Hardware Speculation 23  Hardware-based speculation combines three key ideas: (1) Dynamic Branch Prediction: to choose which instructions to execute (2) Speculation: to allow the execution of instructions before the control dependences are resolved (with the ability to undo the effects of an incorrectly speculated sequence) (3) Dynamic Scheduling: to deal with the scheduling of different combinations of basic blocks.  Techniques used  In addition to Dynamic scheduling techniques (seen later )  ReorderBuffer (ROB)  Commit Stage An Example: Static Multiple Issue with the MIPS ISA 24  Consider a simple static two-issue MIPS processor,  Where one of the instructions can be an integer ALU operation or branch and the other can be a load or store.  Issuing two instructions per cycle will require fetching and decoding 64 bits of instructions.  Many static multiple-issue processors,(essentially all VLIW processors) the layout of simultaneously issuing instructions is restricted to simplify the decoding and instruction issue.  will require that the instructions be paired and aligned on a 64-bit boundary, with the ALU or branch portion appearing first.  Instructions always issue in pairs  if one instruction of the pair cannot be used, we require that it be replaced with a nop. An Example: Static Multiple Issue with the MIPS ISA 25  Static ideal two issue processors at execution An Example: Static Multiple Issue with the MIPS ISA 26  Additional hardware is needed for issuing an ALU and a data transfer operation in parallel to get rid of structural hazards:  Extra ports in the register file.  Inone clock cycle we may need to read 2 registers for the ALU operation and 2 more for a store, and also one write port for an ALU operation and one write port for a load.  Additional Adder  Sincethe ALU is tied up for the ALU operation, we also need a separate adder to calculate the effective address for data transfers.  Best performance can be achieved?? Twice the original  This requires that twice as many instructions be overlapped in execution.  Increasing data & control hazards Static Multiple Issue 27  Remember Load-Use problem in one issue processors  Load needs one stall (e.g. use latency of 1 Clock Cycle)  Use latency: is the number of clock cycles between a load instruction and an instruction that can use the result of the load without stalling the pipeline.  In two issue processors use latency for loads is also 1 clock cycle  This means that the next two instructions cannot use the load result without stalling. Loop Unrolling 28  Example: schedule the loop below on a static two-issue pipeline for MIPS? Reorder the instructions to avoid as many pipeline stalls as possible. Assume branches are predicted so it is handled by HW. Loop: lw $t0, 0($s1) # $t0 = array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, -4 # decrement pointer bne $s1, $zero, Loop # branch $s1 != 0 3 data dependences Loop Unrolling 29  Needs 4 clock cycles to execute 5 instructions => CPI = 0.8. NOP NOP NOP  one pair of instructions has both issue slots used  CPI is in this case 0.8 versus the best case of 0.5  Can we improve it more??  Loop unrolling to 4 copies and register renaming. Loop Unrolling 30  Is an Important compiler technique to get more performance from loops that access arrays; where multiple copies of the loop body are made.  After unrolling, there is more ILP available by overlapping instructions from different iterations.  During the unrolling process, the compiler introduced additional registers ($t1, $t2, $t3) for register renaming:  The goal of this process is to eliminate Name Dependences. Loop Unrolling 31 NOP NOP  It takes 8 clock cycles to execute 14 instructions => CPI = 8/14 = 0.57 Basic Pipeline Scheduling and Loop Unrolling 32  Best performance of pipelining is by keeping it full all the times  Finding sequences of unrelated instructions that can be overlapped in the pipeline.  The execution of a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction  Example  Adding a scalar to a vector for (i = 999; i >= 0; i = i–1) x[i] = x[i] + s; Basic Pipeline Scheduling and Loop Unrolling 33  Assume R1 is initially the address of the element in the array with the highest address, and F2 contains the scalar value s Code in MIPS Loop: L.D F0,0(R1) # F0= array element ADD.D F4,F0,F2 # add scalar in F2 S.D F4,0(R1) # store result DADDUI R1,R1,-8 # decrement pointer;8 bytes (per DW) BNE R1,R2,Loop # branch R1!=R2 Basic Pipeline Scheduling and Loop Unrolling 34  Latencies of FP operations used;  First column shows the producing instruction types; second column the consuming instructions types and third column is the number of intervening clock cycles needed to avoid a stall. Basic Pipeline Scheduling and Loop Unrolling 35 Without Any Scheduling (ignoring delayed branch for now) Clock cycle issued Loop: L.D F0,0(R1) 1 stall 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4,0(R1) 6 DADDUI R1,R1,#-8 7 # Results are in exe stage stall 8 BNE R1,R2,Loop 9 # Comparison is in ID stage Unscheduled code requires nine clock cycles Basic Pipeline Scheduling and Loop Unrolling 36 Basic Scheduling Loop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4,8(R1) 6 BNE R1,R2,Loop 7 The clock cycles are reduced to Seven!! But only three instructions are actually doing the work inside the loop!! Loop Example with Delayed Branch Slot 37  One delayed branch slot  Unscheduled code: 10 clock cycles 1 Loop: L.D F0,0(R1) 2 stall #load interlock 3 ADD.D F4,F0,F2 4 stall #data hazard (F4) 5 stall #data hazard (F4) 6 S.D F4,0(R1) 7 DADDUI R1,R1,#-8 8 stall #branch interlock 9 BNE R1,R2,Loop 10 stall #delayed branch slot Loop Example with Delayed Branch Slot 38  Scheduled code: we need 6 cycles 1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,#-8 3 ADD.D F4,F0,F2 4 stall #data hazard (F4) 5 BNE R1,R2,Loop 6 S.D F4,8(R1) # 0+8 = 8 LOOP Unrolling Example 39  Unroll the loop  Replicate the body of the loop many times  Adjust the loop termination code  Eliminating the branch allows instructions from different iterations to be scheduled together  In this case we can eliminate the data stall  We also increase the ratio of useful work to overhead  Doing so requires more registers LOOP Unrolling Example 40  First Step: Copy the Loop  Assuming R1 – R2 (that is, the size of the array) is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. 1 Loop: L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) #drop DADDUI & BNE 4 L.D F0,-8(R1) 5 ADD.D F4,F0,F2 6 S.D F4,-8(R1) #drop DADDUI & BNE 7 L.D F0,-16(R1) 8 ADD.D F4,F0,F2 9 S.D F4,-16(R1) #drop DADDUI & BNE 10 L.D F0,-24(R1) 11 ADD.D F4,F0,F2 12 S.D F4,-24(R1) 13 DADDUI R1, R1, #-32 #-8 *4 = 32 14 BNE R1,R2,LOOP 15 NOP LOOP Unrolling Example 41  Second Step: Find Name Dependencies 1 Loop: L.D F0,0(R1) #1st iteration 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 L.D F0,-8(R1) #2nd iteration 5 ADD.D F4,F0,F2 6 S.D F4,-8(R1) 7 L.D F0,-16(R1) #3ed iteration 8 ADD.D F4,F0,F2 9 S.D F4,-16(R1) 10 L.D F0,-24(R1) #4th iteration 11 ADD.D F4,F0,F2 12 S.D F4,-24(R1) 13 DADDUI R1, R1, #-32 14 BNE R1,R2,LOOP 15 NOP LOOP Unrolling Example 42  Third Step: Register Renaming 1 Loop: L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 L.D F6,-8(R1) # Renamed F0 5 ADD.D F8,F6,F2 # Renamed F4, F0 6 S.D F8,-8(R1) # Renamed F4 7 L.D F10,-16(R1) # Renamed F0 8 ADD.D F12,F10,F2 # Renamed F4, F0 9 S.D F12,-16(R1) # Renamed F4 10 L.D F14,-24(R1) # Renamed F0 11 ADD.D F16,F 14,F2 # Renamed F4, F0 12 S.D F16,-24(R1) # Renamed F4 13 DADDUI R1, R1, #-32 14 BNE R1,R2,LOOP 15 NOP LOOP Unrolling Example: Unrolled, Unscheduled 43 Loop: L.D F0, 0(R1) 2 Stalls 1Stall ADD.D F4, F0, F2 S.D F4, 0(R1) ; L.D F6, -8(R1) 2 Stalls 1Stall ADD.D F8, F6, F2 S.D F8, -8(R1) ; L.D F10, -16(R1) 2 Stalls 1Stall ADD.D F12, F10, F2 S.D F12, -16(R1) ; L.D F14, -24(R1) 2 Stalls 1Stall ADD.D F16,F14,F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 1 Stalls BNE R1, R2, Loop 1 Stalls Total of 28 clock cycles for every 4 elements in the loop OR 7 cycles per iteration! LOOP Unrolling Example: Unrolled, Scheduled 44 After Scheduling the Unrolled Loop 1 Loop: L.D F0, 0(R1) 2 L.D F6, -8(R1) 3 L.D F10, -16(R1) 4 L.D F14,- 24(R1) 5 ADD.D F4, F0, F2 6 ADD.D F8, F6, F2 7 ADD.D F12, F10, F2 8 ADD.D F16, F14, F2 9 S.D F4, 0(R1) 10 S.D F8, -8(R1) 11 DADDUI R1, R1,#-32 12 S.D F12, 16(R1) # -16+32 = 16 13 BNE R1,R2,Loop 14 S.D F16, 8(R1) # -24+32 = 8 In total only 14 Clock cycles for 4 elements!! or 3.5 cycles per iteration!  The gain from scheduling on the unrolled loop is even larger than on the original loop!! Basic Pipeline Scheduling and Loop Unrolling 45  Loop unrolling limitations:  A decrease in the amount of overhead amortized with each unroll  Code size limitations  Growth in size  Compiler limitations  Potential shortfall in registers that is created by aggressive unrolling and aggressive scheduling – Register Pressure. Compiler Perspectives: Unrolling Loops 46  We don’t usually know the upper bound of a loop  Suppose it is n, and we want k copies of the loop  Don’t generate a single unrolled loop!  Generate a pair of consecutive loops:  First executes the original loop (n mod k) Times  Second executes the unrolled loop body (n/k) Times  For large n, most iterations occur in unrolled loop Compiler Perspectives: Dependencies 47  Compilers must preserve data dependencies  Determine if loop iterations are independent  Rename registers during unrolling  Eliminate extra test and branch instructions  Adjust loop maintenance and termination accordingly  Determine if loads and stores can be interchanged  Schedule the code, preserving dependencies Compiler Perspectives: Renaming 48  Dependent instructions can’t execute in parallel  Easy to determine for registers (fixed names)  Much harder for memory  This is the ―memory disambiguation‖ problem  Does 100(R4) = 20(R6)?  In different loop iterations, does 20(R6) = 20(R6)?  In our example, compiler must determine that if R1 doesn’t change then: 0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1)  In this case, loads and stores can be interchanged Dynamic Multiple-Issue Processors 49  known as superscalar processors  An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle by selecting them during execution.  In the simplest superscalar processors, instructions issue in order, and the processor decides whether zero, one, or more instructions can issue in a given clock cycle.  achieving good performance on such a processor still requires the compiler to try to schedule instructions to move dependences apart and thereby improve the instruction issue rate.  important difference between this simple superscalar and a VLIW processor: the code whether scheduled or not, is guaranteed by the hardware to execute correctly.  compiled code will always run correctly independent of the issue rate or pipeline structure of the processor. In some VLIW designs, this has not been the case, and recompilation was required when moving across different processor models Superscalar Processors 50  In simple pipelining:  To execute 4 instructions we need 7 CCT. F D E W F D E W F D E W F D E W Superscalar Processors 51  Super-Pipelining  Many pipeline stages perform tasks require less than half a CCT => doubled internal clock speed allows the performance of two tasks in one external clock cycle.  To execute 4 instructions we need 5.5 CCT. F D E W F D E W F D E W F D E W Superscalar Processors 52  Superscalar  To have multiple pipelines  Two pipelines:  To execute 4 instructions we need 5 CCT. F D E W F D E W F D E W F D E W Scalar Organization 53 SuperScalar Organization 54 Dynamic Scheduling 55  Many superscalars extend the basic framework of dynamic issue decisions to include dynamic pipeline scheduling.  Dynamic pipeline scheduling chooses which instructions to execute in a given clock cycle while trying to avoid hazards and stalls.  In other words, the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior.  It tries to avoid stalling when dependences are present. In contrast, static pipeline scheduling by the compiler tries to minimize stalls by separating dependent instructions so that they will not lead to hazards.  Example DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14  SUB cannot proceed even though it has no data dependency!!  Solution? Out-of-order execution!! Dynamic Scheduling 56  In such processors, the pipeline is divided into three major units:  An instruction fetch and issue unit,  Multiple functional units (a dozen or more in high-end designs in 2013).  and a Commit unit. Dynamic Scheduling 57 Dynamic Scheduling 58  Instruction fetch and issue unit  Fetches instructions, decodes them, and sends each instruction to a corresponding functional unit for execution  Functional units  Each functional unit has buffers, called reservation stations, which hold the operands and the operation.  As soon as the buffer contains all its operands and the functional unit is ready to execute, the result is calculated.  When the result is completed, it is sent to any reservation stations waiting for this particular result as well as to the commit unit, Dynamic Scheduling 59  Commit Unit  which buffers (often called reorder buffer) the result until it is safe to put the result into the register file or, for a store, into memory.  The reorder buffer, is also used to supply operands, in much the same way as forwarding logic does in a statically scheduled pipeline  Once a result is committed to the register file, it can be fetched directly from there, just as in a normal pipeline.  The combination of buffering operands in the reservation stations and results in the reorder buffer provides a form of register renaming, just like that used by the compiler in our earlier loop-unrolling example Dynamic Scheduling 60  These steps effectively use the reorder buffer and the reservation stations to implement register renaming. 1. When an instruction issues, it is copied to a reservation station for the appropriate functional unit.  Any operands that are available in the register file or reorder buffer are also immediately copied into the reservation station.  The instruction is buffered in the reservation station until all the operands and the functional unit are available.  For the issuing instruction, the register copy of the operand is no longer required, and if a write to that register occurred, the value could be overwritten. 2. If an operand is not in the register file or reorder buffer, it must be waiting to be produced by a functional unit.  The name of the functional unit that will produce the result is tracked.  When that unit eventually produces the result, it is copied directly into the waiting reservation station from the functional unit bypassing the registers. Dynamic Scheduling Out-of-Order Execution 61  This style of execution is called an out-of-order execution, since the instructions can be executed in a different order than they were fetched.  A situation in pipelined execution when an instruction blocked from executing does not cause the following instructions to wait.  To make programs behave as if they were running on a simple in-order pipeline, the instruction fetch and decode unit is required to issue instructions in order,  which allows dependences to be tracked, and the commit unit is required to write results to registers and memory in program fetch order.  This conservative mode is called in-order commit. Dynamic Scheduling Out-of-Order Execution 62  In-Order-Issue / Out-of-Order-Execution  Implementation  Separate the issue process into two parts: 1. Issue—Decode instructions, check for structural hazards. 2. Read operands—Wait until no data hazards, then read operands.  Out-of-order execution implies out-of-order completion.  Out-of-order execution introduces the possibility of WAR and WAW hazards!!  Usually pipeline allows multiple instructions to be in execution at the same time; by using multiple functional units to exploit dynamic scheduling Dynamic Scheduling Out-of-Order-Execution 63 Example DIV.D F0,F2,F4 Anti- dependence ADD.D F6,F0,F8 Output SUB.D F8,F10,F14 dependence MUL.D F6,F10,F8  If the pipeline executes SUB.D before ADD.D (which is waiting for the DIV.D), it will violate the anti-dependence, yielding a WAR hazard.  Both of these hazards are avoided by the use of Register Renaming.  In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. Dynamic Scheduling Out-of-Order-Execution 64  Out of Order Execution Techniques  Scoreboarding  Allowing instructions to execute out of order when there are sufficient resources and no data dependences  Tomasulo’s Algorithm  Handles anti-dependences and output dependences by effectively renaming the registers dynamically.  It can be extended to handle Speculation  Dynamic scheduling is often extended by including hardware- based speculation, especially for branch outcomes. A technique to reduce the effect of control dependences by predicting the outcome of a branch, executing instructions at the predicted destination address, and taking corrective actions when the prediction was wrong. Dynamic Scheduling Tomasulo’s Algorithm 65  Tracking when operands for instructions are available to allow the execution as soon as possible which minimizes RAW hazards,  Tracking instruction dependences and introducing renaming registers in hardware to minimize WAW and WAR hazards  Renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out- of-order write does not affect any instructions that depend on an earlier value of an operand. Renaming Registers Example 66 DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8  Two anti-dependences -> WAR  ADD.D and SUB.D  S.D and MUL.D.  Output dependence -> WAW  ADD.D and the MUL.D  Three true data dependences -> RAW  DIV.D and ADD.D,  SUB.D and MUL.D,  ADD.D and S.D. Renaming Registers Example 67 Assume the existence of two temporary registers, S and T DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T Tomasulo’s Algorithm Renaming Technique 68  In Tomasulo’s Algorithm Register Renaming is provided by Reservation Stations:  A reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register.  Pending instructions designate the reservation station that will provide their input.  Finally, when successive writes to a register overlap in execution, only the last one is actually used to update the register.  As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming.  Since there can be more reservation stations than real registers, the technique can even eliminate hazards arising from name dependences that could not be eliminated by a compiler. In-order issue/in-order completion 69  The simplest instruction issue policy: to have sequential execution (in-order issue) and to write results in that same order (in-order completion).  We assume a superscalar pipeline capable of:  fetching and decoding two instructions at a time,  having three separate functional units (e.g., two integer arithmetic and one floating-point arithmetic), and  having two instances of the write-back pipeline stage.  Example: A program with 5 Instructions: I1: needs two cycles to execute I2: I3 and I4 conflict on the same functional unit I5 depends on the value produced by I4 I5 and I6 conflict on the same functional unit In-Order Issue/In-Order Completion 70 Decoding Executing Writing CCT I1: needs two cycles to execute x y z 1 I2: I1 I2 I3 and I4 conflict on the same functional unit I3 I4 I1 I2 2 I5 depends on the value produced by I4 I3 I4 I1 3 I5 and I6 conflict on the same functional unit I4 I3 I1 I2 4 Instructions are fetched I5 I6 I4 5 two at a time, the next two must wait until the pair of I6 I5 I3 I4 decode stage are empty. 6 I6 To guarantee in-order- 7 completion, conflicts or multiple cycles exe. will stall I5 I6 8 the issuing process. In-Order Issue/Out-Order Completion 71  With out-of-order completion, any number of instructions may be in the execution stage at any one time, up to the maximum degree of machine parallelism across all functional units.  A new dependency, output dependency, arises.  Out-of-order completion requires more complex instruction issue logic than in-order completion.  it is more difficult to deal with instruction interrupts and exceptions.  When an interrupt occurs, instruction execution at the current point is suspended, to be resumed later. In-Order Issue/Out-Order Completion 72  Consider the following code fragment illustrates this dependency (op represents any operation): I1: R3 ← R3 op R5 I2: R4 ← R3 + 1 I3: R3 ← R5 + 1 I4: R7 ← R3 op R4  Instruction I2 cannot execute before instruction I1, because it needs the result in register R3 produced in I1; this is an example of a true data dependency.  I4 must wait for I3, because it uses a result produced by I3.  What about the relationship between I1 and I3?  There is no data dependency here. However, if I3 executes to completion prior to I1, then the wrong value of the contents of R3 will be fetched for the execution of I4.  So, I3 must complete after I1 to produce the correct output values.  To ensure this, the issuing of the third instruction must be stalled if its result might later be overwritten by an older instruction that takes longer to complete. In-Order Issue/Out-Order Completion 73 Decoding Executing Writing CCT I1: needs two cycles to execute 1 I2: I1 I2 I3 and I4 conflict on the same functional unit I3 I4 I1 I2 2 I5 depends on the value produced by I4 I4 I1 I3 I2 3 I5 and I6 conflict on the same functional unit I5 I6 I4 I1 I3 4 Any number of instructions may be in exe. stage at any I6 I5 I4 5 time. Instructions issuing is stalled I6 I5 6 by a resource conflicts, data dependency, control I6 7 dependency and name dependency (WAW) Out-Order Issue/Out-Order Completion 74  With in-order issue, the processor will only decode instructions up to the point of a dependency or conflict.  To allow out-of-order issue, it is necessary to decouple the decode and execute stages of the pipeline.  This is done with a buffer referred to as an instruction window.  However, this window is not an additional pipeline stage. An instruction being in the window simply implies that the processor has sufficient information about that instruction to decide when it can be issued.  After a processor has finished decoding an instruction, it is placed in the instruction window. As long as this buffer is not full, the processor can continue to fetch and decode new instructions. Out-Order Issue/Out-Order Completion 75  Any instruction may be issued, provided that 1. it needs the particular functional unit that is available, 2. No conflicts or dependencies block this instruction.  The result is that the processor has a lookahead capability, allowing it to identify independent instructions that can be brought into the execute stage.  The out-of-order issue, out-of-order completion policy is subject to the same constraints described earlier; an instruction cannot be issued if it violates a dependency or conflict.  The difference is that more instructions are available for issuing, reducing the probability that a pipeline stage will have to stall.  In addition, a new dependency, anti-dependency (also called read after write (RAW) dependency), arises. Out-Order Issue/Out-Order Completion 76  The code fragment considered earlier illustrates this dependency: I1: R3 ← R3 op R5 I2: R4 ← R3 + 1 I3: R3 ← R5 + 1 I4: R7 ← R3 op R4  Instruction I3 cannot complete execution before instruction I2 begins execution and has fetched its operands. This is so because I3 updates register R3, which is a source operand for I2. Out-Order Issue/Out-Order Completion 77 Decoding Window Executing Writing CCT 1 I1 I2 I3 I4 I1 I2 2 I1, I2 I5 I6 I3, I4 I1 I3 I2 3 I4, I5, I6 I6 I4 I1 I3 4 I5 I5 I4 I6 5 I5 6 To allow out-of-order-execution we need to decouple decode and execute stages by using a buffer called instructions window Issuing can proceed if the FU is available and no conflicts or dependences existed. Implementation: Intel Core i7 78  Intel Core i7, a high-end, dynamically scheduled, speculative processor, intended for high-end desktops and server applications  It uses an aggressive out-of-order speculative micro- architecture with reasonably deep pipelines with the goal of achieving high instruction throughput by combining multiple issue and high clock rates  Intel Core i7 Pipeline  The total pipeline depth is 14 stages, with branch miss- predictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle. 79 Intel Core i7 pipeline

Advanced Pipelining PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue