Full Transcript

Chapter 3 Instruction-Level Parallelism Exploiting Instruction-Level Parallelism Pipeline Concepts From 1985, almost all processors have used pipelining to overlap the execution of instructions and, consequently, improve performance. This potential overlapping among instructions enables the instruct...

Chapter 3 Instruction-Level Parallelism Exploiting Instruction-Level Parallelism Pipeline Concepts From 1985, almost all processors have used pipelining to overlap the execution of instructions and, consequently, improve performance. This potential overlapping among instructions enables the instruction-level parallelism - ILP concept, since the instructions can be evaluated in parallel. Other techniques, detailed in the following sections and chapters, also contribute to ILP. The pipeline is a natural idea. Consider, for example, an assembly line. Basically, it is about dividing the task into sequential sub-tasks, then allocate resources for each sub-task, and finally, control phase changes. The laundry example. It takes four hours in total to wash and dry. Therefore, four baskets would take 16 hours to get done. If one separates this whole process into phases, such as wash and dry, and assuming that the wash phase takes two hours and dry phase takes two hours... Then, how long it will take to finish four baskets? The answer is 10 hours, instead of the first 16 hours. The key is the phase overlapping. Following this idea, in a cycle pipeline, an instruction can be broken into the following cycles, as illustrated in Fig. 3.1. RI DI OO EXE AR Figure 3.1: Instruction cycle pipeline. 1. instruction fetch - RI – here, the instruction is fetched from a memory; 2. instruction decode - DI – once the instruction is available, it has to be decoded to find out what to do; 49 3 Instruction-Level Parallelism 3. operands fetch - OO – in this cycle, the operands are fetched; 4. execution - EXE – with the instruction decoded and operands fetched, this is time to perform the execution of that instruction; and 5. result store - AR – finally, this is the cycle to write the result, either in a memory position or a register. Tables 3.1 and 3.2 shows a comparison considering the instruction parallelism aspect. Next, Table 3.1 shows three instructions starting and finishing all the pipeline cycles. Each instruction takes five cycles to finish serially, i.e., without overlapping. This scenario gives a CPI equals to five. This behavior results in idle time and also inefficiency. Notice that there are lots of empty spaces in Table 3.1. CPI = 15 cyles 3 instructions Table 3.1: Without pipeline, idle time and inefficiency. RI DI OO EXE AR I1 I2 I1 I1 I1 I3 I2 I2 I2 I1 5 I3 I3 I2 10 I3 I3 15 On the other hand, Table 3.2 shows the same 15 clock cycles as previously mentioned, but now filled with 11 instructions starting and finishing all the pipeline cycles, therefore overlapping the cycles. In this five-stage pipeline, it is possible to overlap up to five cycles as seen in cycle number five (Table 3.2). In this case, the CPI (lower) limit is equal to one. Refer to the clock cycle number five to 15. There are 10 clock cycles and 10 instructions completed. Table 3.2: With pipeline. CPI = 15 cycles/11 instructions. CPI limit = 1, notice from the clock cycle CC 5 to 15. How many clocks and how many instructions? RI DI OO EXE AR I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 5 I6 I5 I4 I3 I2 I7 I6 I5 I4 I3 I8 I7 I6 I5 I4 I9 I8 I7 I6 I5 I10 I9 I8 I7 I6 10 I11 I10 I9 I8 I7 I11 I10 I9 I8 I11 I10 I9 I11 I10 I11 15 Single-Cycle, Multiple-Cycle and Pipeline Chapter 2 introduced the details of the single-cycle datapath - SCD. Fig. 3.2 illustrates the SCD along with the multiple-cycle and pipeline. Notice that in SCD, every instruction takes one cycle to complete. Given this, the cycle has to fit the longest instruction (i.e., load) causing a partial time waste when executing other instructions, e.g., as in clock cycle number 2. To avoid this time-wasting, the multiple-cycle implementation divides the instructions into various small cycles. For example, the instruction load takes five cycles to complete, while store takes four cycles. Then, in the clock cycle number 10, another instruction can start the “instruction fetch” cycle, instead of wasting that cycle. Finally, in the pipeline implementation, there is an overlapping of the pipeline stages: ifetch, reg, exec, mem, and wr. Therefore, the pipeline allows the instructions to execute in parallel causing a true performance improvement, i.e., speedup. By the end of clock cycle number 7, there are three instructions completed. In the pipeline implementation, every instruction has the same number of stages. 50 Exploiting Instruction-Level Parallelism Figure 3.2: single-cycle, multiple-cycle and pipeline overview comparison. Speedup What is the expected enhancement in performance when using the pipeline implementation? Let’s take a look in Table 3.3. Table 3.3: “Stages vs. time” diagram in the pipeline implementation. Stage S1 S2 S3 S4 S5 I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 5 I6 I5 I4 I3 I2 I7 I6 I5 I4 I3 I8 I7 I6 I5 I4 I9 I8 I7 I6 I5 I10 I9 I8 I7 I6 10 I11 I10 I9 I8 I7 I11 I10 I9 I8 I11 I10 I9 I11 I10 I11 15 t in clock cycles To compute the expected enhancement, the following definitions are used. n – number of instructions; p – number of pipeline stages (also known as the “pipeline depth”); Tj (1 ≤ j ≤ p) – waiting time in the pipeline stage Sj ; T L – stage transition time; TM AX = max{Ti } ; T = TM AX + T L : clock period (clock cycle); and frequency f = 1 T. 51 3 Instruction-Level Parallelism Eq. (3.1) gives the speedup measure from pipelining. Speedup Pipeline = AverageInstructionTimeUnpipelined AverageInstructionTimePipelined (3.1) AverageInstructionTimeUnpipelined = n × p × T (3.2) AverageInstructionTimePipelined = (p + n − 1) × T (3.3) Then, Eq. (3.1) can be written as in Eq. (3.4). Speedup Pipeline = n×p p+n−1 (3.4) If n  p, Speedup Pipeline ≈ p. Therefore, the maximum Speedup Pipeline ≤ p. Question: the higher the p, the higher the Speedup Pipeline ? Take a look at the definition of efficiency. Efficiency The efficiency η is described by the relation between the “occupied area” and “total area” considering the stages vs. time diagram (Table 3.3). The observed improvement when using several pipeline stages p is given by Eq. (3.5). Speedup Pipeline η= = p n×p p+n−1 p = n p+n−1 (3.5) When n → ∞, η = 1 (maximum efficiency). Throughput The throughput W is the number of completed tasks per time unit, as in Eq. (3.6). W = When η → 1, W → 1 T η n = (p + n − 1) × T T (3.6) = f (maximum throughput). Major Hurdles of Pipelining: Concepts and Hazards The key barriers to achieve the maximum values related to speedup, efficiency and throughput are stages with different times, e.g., related to fails in cache memory or instructions with different lengths. Another important point to consider is regarding hazards and dependencies. In that sense, structural hazards are about resource conflicts, and data hazards (an instruction that depends on the results of a previous instruction) involves: read after write - RAW; write after read - WAR; and write after write WAW. Finally, control hazards involve issues such as the branch prediction, delayed branches, and changes in the program counter - PC. 52 Major Hurdles of Pipelining: Concepts and Hazards Example: A conditional branch instruction in the pipeline. Assume that I5 is a conditional branch instruction. Other instructions are blocked when the conditional branch is decoded, which causes a bunch of idle stages (inefficiency) in the pipeline, as seen in Table 3.4. In that case, there are two possibilities to follow the pipeline: decode I6 as the next instruction, or take the branch and fetch I19. Until the conditional branch is decided, the pipeline is negatively impacted. Table 3.4: Example of a conditional branch instruction (I5 ) in the instruction decoding phase, clock cycle number 6. RI DI OO EXE AR I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 5 I6 I5 * I4 I3 I2 I19 I5 I4 I3 I5 I4 I5 I20 I19 I21 I20 I19... I21 I20 I19...... I21 I20 I19 10......... I21 I20 15............ I21............... Unbalanced Length in Pipeline Stages Another problem is the unbalanced length in the pipeline stages, as shown in Table 3.5. It causes lots of idle time and inefficiency, represented by the empty spaces. Table 3.5: Example of a pipeline with different stage lengths. RI takes four cycles, while DI takes one. RI DI OO EXE AR I1 I1 I1 I1 I2 I1 I2 I2 I2 I1 I1 I1 I3 I2 I1 5 I3 I3 I3 I2 I1 I2 I2 I4 I3 I2 I1 I1 I1 I4 I4 I4 I3 I2 I1 I3 I3 I2 15 I2 10 I5 I4 I3 I3 I2 I5 I5 I5 I4 I4 I4 I2 I3 I3 20 I6 I5 I4......... I3... Structural Hazard The structural hazard is also known as functional dependency. It is the result of a race condition involving two or more instructions competing for the same resource at the same time. Some possible solutions to workaround this problem are: one of the instructions must wait, or it is necessary to increase the number of resources available in the system. Table 3.6 shows an example where the instruction I3 is in the “instruction fetch” stage from clock cycle number three to six. Given this, no other instruction can be fetched, i.e., they have to wait until I3 moves to the next functional unit, i.e., “instruction decode”. Table 3.6: The instruction I4 has to wait before proceeding to avoid a structural harzard. RI DI OO EXE AR I1 I2 I1 I3 I2 I1 I3 I2 I1 I3 I2 I1 5 I3 I2 I4 I3 I5 I4 I3 I5 I4 I3 I5 I5 I4 I3 10 I4 I6 I7 15 53 3 Instruction-Level Parallelism Control of Resources Usage When a dependency involves only processor resources, such as registers or functional units, a resource reservation table is generally used. There, every instruction makes a check in the table about the resources that it will use. If all of them are marked, the instruction is blocked until there is a release. After using it, the instruction frees the resource by unchecking it in the reservation table. Data Hazard Data hazards occur when the pipeline changes the order of read/write accesses of instruction operands. Then, the order differs from the order seen by sequentially executing instructions on an unpipelined processor. Write after read - WAR Related to the WAR data hazard, is there any problem related to the code in Listing 3.1? Here, the highlight goes to B that is written after being read. Listing 3.1: WAR data hazard code snippet. 1 A = B + C 2 B = C + D There is a false dependency in this case, as long as there is no out-of-order execution, i.e., line number two before line one. WAR hazards are impossible in the simple five-stage pipeline, but they occur when instructions are reordered. Out-of-order execution is taken into account in the next chapters. Write after write - WAW Related to the WAW data hazard, is there any problem related to the code in Listing 3.2? Here, the highlight goes to A that is written (line two) after being written in the first place (line one). Listing 3.2: WAW data hazard code snippet. 1 A = B + C 2 A = D + E There is a false dependency in this case, as long as there is no out-of-order execution. WAW hazards are also impossible in the simple five-stage pipeline, but they occur when instructions are reordered. Read after write - RAW Related to the RAW data hazard, is there any problem related to the code in Listing 3.3? Here, the highlight goes to A that is read (line two) after being written (line one). Listing 3.3: RAW data hazard code snippet. 1 A = B + C 2 E = A + D 54 Datapaths and Stalls Yes, here there is a problem. The instruction from line two can only read the value of A after the completion of the instruction from line one. The simple solution is to wait the execution of the instruction in line two, that is until the value of A is ready and available to be used. But, wait for how many clock cycles? The answer is in the following sections. Control Hazard Control hazards come from the pipelining of conditional branches and other instructions that cause a change in the program counter register. Common Solution A common possible solution to those hazards is “simply” stall the pipeline as long as the hazard is present. This is done by inserting one or more “bubbles” into it, as shown next. Datapaths and Stalls The MIPS Pipeline Now, let’s consider a concrete case: the MIPS pipeline. The stages (or cycles) of MIPS pipeline are: 1. instruction fetch - IF cycle; 2. instruction decode/register fetch - ID cycle; 3. execution/effective address - EX cycle; 4. memory access - MEM cycle; and 5. write-back - WB cycle. An Overview of Pipeline and Functional Units Fig. 3.3 illustrates the MIPS pipeline and the functional units responsible to perform the functions of each pipeline stage. Time (Clock Cycles) Load IF ID Mem Reg EX A L MEM WB Mem Reg U Instruction Order Inst. #1 IF ID Mem Reg L IF ID Mem Reg EX A MEM WB Mem Reg U Inst. #2 EX A L MEM WB Mem Reg U Inst. #3 IF ID Mem Reg EX A L MEM WB Mem Reg U Inst. #4 IF ID Mem Reg EX A L MEM WB Mem Reg U Figure 3.3: MIPS pipeline and functional units overview. 55 3 Instruction-Level Parallelism Structural Hazard in MIPS Notice, as illustrated in Fig. 3.4, that a single memory (Mem) configures a structural hazard in this pipeline. In clock cycle number four, one instruction is trying to access the memory to load some data at the same time that another instruction is trying to access that memory to fetch an instruction. This is a hazard. Time (Clock Cycles) Load IF ID Mem Reg MEM WB Mem Reg EX A MEM WB Mem Reg EX A L U Instruction Order Inst. #1 IF ID Mem Reg L U Inst. #2 IF ID Mem Reg L IF ID Mem Reg MEM WB Mem Reg EX A MEM WB Mem Reg EX A U Inst. #3 L U Inst. #4 IF ID Mem Reg MEM WB Mem Reg EX A L U Figure 3.4: Structural hazard in MIPS: the memory. Solution 1 to Fix the Structural Hazard A possible solution to this problem is to stall the pipeline to resolve the memory structural hazard, by inserting a “bubble” on it, as shown in Fig. 3.5. In this case, the instruction #3 is delayed (stalled) by one clock cycle. Time (Clock Cycles) Load IF ID Mem Reg EX A L MEM WB Mem Reg U Instruction Order Inst. #1 IF ID Mem Reg EX A L MEM WB Mem Reg U Inst. #2 IF ID Mem Reg EX A L MEM WB Mem Reg U Inst. #3 (stall) bubble IF ID Mem Reg EX A L MEM WB Mem Reg EX A MEM WB Mem Reg U Inst. #4 IF ID Mem Reg L U Figure 3.5: Structural hazard in MIPS: stall the pipeline. 56 Datapaths and Stalls Solution 2 to Fix the Structural Hazard A second solution is about adding new hardware (memory), i.e., to have separated memories: the instruction cache - Im and data cache - Dm. Fig. 3.6 illustrates the two different memories that fix the structural hazard. Now, in the clock cycle number four, the instruction load can access the data memory, while at the same clock the instruction #3 is accessing the instruction memory. Time (Clock Cycles) Load IF ID Im Reg EX A L MEM WB Dm Reg EX A MEM WB Dm Reg EX A MEM WB Dm Reg EX A MEM WB Dm Reg EX A MEM WB Dm Reg U Instruction Order Inst. #1 IF ID Im Reg L U Inst. #2 IF ID Im Reg L U Inst. #3 IF ID Im Reg L U Inst. #4 IF ID Im Reg L U Figure 3.6: Structural hazard in MIPS: add extra resource in the system. Data Hazard in MIPS Let’s consider the following code, as in Listing 3.4. Listing 3.4: A code snippet to analyze data hazards. 1 add r1 , r2 , r3 2 sub r4 , r1 , r3 3 and r6 , r1 , r7 4 or 5 xor r10 , r1 , r11 r8 , r1 , r9 Notice that all instructions after the add (line one) use the result of the instruction add. The instruction add writes the value of r1 in the write-back pipeline stage, clock cycle number five, as illustrated in Fig. 3.7. Given this, there is a data hazard with respect to the register r1. The register file (Reg) is used as a source in the ID stage, and as a destination in the WB stage. Therefore, a register is read in the second half of the stage, and written in the first half (halves shown in gray color on the register file Reg, Fig. 3.7). Thus, there is a RAW hazard with respect to the instructions sub and and from Listing 3.4, as shown in Fig. 3.7. This is because r1 was not actually written when instructions sub and and reads it. The register r1 is written in the first half of the WB stage related to the add instruction. 57 3 Instruction-Level Parallelism Time (Clock Cycles) add r1,r2,r3 IF ID Im Reg EX A L MEM WB Dm Reg U Instruction Order sub r4,r1,r3 IF ID Im Reg EX A L MEM WB Dm Reg U and r6,r1,r7 IF ID Im Reg EX A L MEM WB Dm Reg EX A MEM WB Dm Reg EX A MEM WB Dm Reg U or r8,r1,r9 IF ID Im Reg L U xor r10,r1,r11 IF ID Im Reg L U Figure 3.7: Data hazards in MIPS. A possible workaround is the hardware stall. In this case, the hardware does not change the program counter - PC value, instead it keeps fetching the same instruction and sets control signals to harmless values, such as ZERO. As illustrated in Fig. 3.8, there is a two-stall penalty to fix the referred data hazard. Thus, r1 is written in the first half of WB and read in the second half of the ID stage (clock cycle number five). Time (Clock Cycles) add r1,r2,r3 IF ID Im Reg EX A L MEM WB Dm Reg U Instruction Order stall stall sub r4,r1,r3 Im bubble bubble bubble bubble Im bubble bubble bubble IF ID Im Reg EX A L bubble MEM WB Dm Reg EX A MEM WB Dm Reg U and r6,r1,r7 IF ID Im Reg L U Figure 3.8: A possible workaround for the data hazard, hardware stalls. Another possible workaround for this data hazard can be the fix by software. In other words, by inserting independent instructions in between. In the worst-case scenario, it inserts instructions nop, as illustrated in Fig. 3.9. 58 Datapaths and Stalls Time (Clock Cycles) add r1,r2,r3 IF ID Im Reg L IF ID Im Reg EX A MEM WB Dm Reg EX A MEM WB Dm Reg U Instruction Order nop L U nop IF ID Im Reg EX A L MEM WB Dm Reg EX A MEM WB Dm Reg U sub r4,r1,r3 IF ID Im Reg L U and r6,r1,r7 IF ID Im Reg EX A L MEM WB Dm Reg U Figure 3.9: Another possible workaround for the data hazard, software nop’s. Minimizing Data Hazard Stalls by Forwarding There is the forwarding technique to minimize the stalls due to the data hazard in the pipeline. Forwarding is a simple hardware technique, also known as hardware bypassing or short-circuiting. Let’s consider the previous RAW hazard from Listing 3.4. The key point in forwarding is: the result is not really needed by the instruction sub until after the instruction add actually produces it. Then, what if that result could be moved from the pipeline register where the add stores it to where the sub needs it? In this case, the necessity for a stall can be avoided. As seen in Fig. 3.10, the ALU’s result from EX/MEM and MEM/WB pipeline registers is always fed back to one ALU input. The forwarding hardware detects that the previous ALU operation wrote the register corresponding to a source for the current ALU operation. Then, the control logic selects the forwarded result as one of the ALU inputs, instead of the value read from the register file. If, for some reason, the sub is stalled, the add will be completed and the bypass will not be activated. This is also true when an interrupt occurs between these two instructions. But unfortunately, not all potential data hazards can be managed by the forwarding technique. Let’s consider the following code, as shown in Listing 3.5 Listing 3.5: A case not managed by the forwarding technique. 1 ld r1 ,0( r2 ) 2 sub r4 , r1 , r5 3 and r6 , r1 , r7 4 or r8 , r1 , r9 The problem there is that the instruction ld (load) does not have the data until the end of MEM cycle (clock cycle number four, Fig. 3.11). And, the instruction sub needs to have the data by the beginning of that clock cycle. Therefore, this case needs a stall. 59 3 Instruction-Level Parallelism Figure 3.10: Pipeline registers (gray rectangles) from left to rigth: IF/ID; ID/EX; EX/MEM; and MEM/WB. In this case, the ALU output stored in the register EX/MEM is forwarded to one ALU input in the instruction sub. Also, the ALU result in the register MEM/WB is forwarded to one ALU input in the instruction and. In the clock cycle number five, there is no forwarding. The register is written in the first half and read in the second. Figure 3.11: The instruction sub needs the data in the beginning of the clock cycle four, but the data is only ready at the end of that cycle, i.e., when there is an effective memory access. Hardware Changing for Forwarding To implement the forwarding technique, the hardware has to accomodate the following changes: (i) ALU output at the end of EX stage; (ii) memory output at the end of MEM stage; and (iii) ALU output at the end of MEM stage, as shown in Fig. 3.12. The hardware also has to increase the multiplexers to include new paths from the pipeline registers (dotted lines in Fig. 3.12). Assuming register writing occurs in the first half, and reading occurs in the second half of the cycle, there is no need for the WB output. Otherwise it would need an extra forwarding. 60

Use Quizgecko on...
Browser
Browser