🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

csc25-lecture-notes-55-78.pdf

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Full Transcript

Chapter 3 Instruction-Level Parallelism Exploiting Instruction-Level Parallelism Pipeline Concepts From 1985, almos...

Chapter 3 Instruction-Level Parallelism Exploiting Instruction-Level Parallelism Pipeline Concepts From 1985, almost all processors have used pipelining to overlap the execution of instructions and, consequently, improve performance. This potential overlapping among instructions enables the instruction-level parallelism - ILP concept, since the instructions can be evaluated in parallel. Other techniques, detailed in the following sections and chapters, also contribute to ILP. The pipeline is a natural idea. Consider, for example, an assembly line. Basically, it is about dividing the task into sequential sub-tasks, then allocate resources for each sub-task, and finally, control phase changes. The laundry example. It takes four hours in total to wash and dry. Therefore, four baskets would take 16 hours to get done. If one separates this whole process into phases, such as wash and dry, and assuming that the wash phase takes two hours and dry phase takes two hours... Then, how long it will take to finish four baskets? The answer is 10 hours, instead of the first 16 hours. The key is the phase overlapping. Following this idea, in a cycle pipeline, an instruction can be broken into the following cycles, as illustrated in Fig. 3.1. RI DI OO EXE AR Figure 3.1: Instruction cycle pipeline. 1. instruction fetch - RI – here, the instruction is fetched from a memory; 2. instruction decode - DI – once the instruction is available, it has to be decoded to find out what to do; 49 3 Instruction-Level Parallelism 3. operands fetch - OO – in this cycle, the operands are fetched; 4. execution - EXE – with the instruction decoded and operands fetched, this is time to perform the execution of that instruction; and 5. result store - AR – finally, this is the cycle to write the result, either in a memory position or a register. Tables 3.1 and 3.2 shows a comparison considering the instruction parallelism aspect. Next, Table 3.1 shows three instructions starting and finishing all the pipeline cycles. Each instruction takes five cycles to finish serially, i.e., without overlapping. This scenario gives a CPI equals to five. This behavior results in idle time and also inefficiency. Notice that there are lots of empty spaces in Table 3.1. 15 cyles CPI = 3 instructions Table 3.1: Without pipeline, idle time and inefficiency. RI I1 I2 I3 DI I1 I2 I3 OO I1 I2 I3 EXE I1 I2 I3 AR I1 I2 I3 5 10 15 On the other hand, Table 3.2 shows the same 15 clock cycles as previously mentioned, but now filled with 11 instructions starting and finishing all the pipeline cycles, therefore overlapping the cycles. In this five-stage pipeline, it is possible to overlap up to five cycles as seen in cycle number five (Table 3.2). In this case, the CPI (lower) limit is equal to one. Refer to the clock cycle number five to 15. There are 10 clock cycles and 10 instructions completed. Table 3.2: With pipeline. CPI = 15 cycles/11 instructions. CPI limit = 1, notice from the clock cycle - CC 5 to 15. How many clocks and how many instructions? RI I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 DI I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 OO I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 EXE I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 AR I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 5 10 15 Single-Cycle, Multiple-Cycle and Pipeline Chapter 2 introduced the details of the single-cycle datapath - SCD. Fig. 3.2 illustrates the SCD along with the multiple-cycle and pipeline. Notice that in SCD, every instruction takes one cycle to complete. Given this, the cycle has to fit the longest instruction (i.e., load) causing a partial time waste when executing other instructions, e.g., as in clock cycle number 2. To avoid this time-wasting, the multiple-cycle implementation divides the instructions into various small cycles. For example, the instruction load takes five cycles to complete, while store takes four cycles. Then, in the clock cycle number 10, another instruction can start the “instruction fetch” cycle, instead of wasting that cycle. Finally, in the pipeline implementation, there is an overlapping of the pipeline stages: ifetch, reg, exec, mem, and wr. Therefore, the pipeline allows the instructions to execute in parallel causing a true performance improvement, i.e., speedup. By the end of clock cycle number 7, there are three instructions completed. In the pipeline implementation, every instruction has the same number of stages. 50 Exploiting Instruction-Level Parallelism Figure 3.2: single-cycle, multiple-cycle and pipeline overview comparison. Speedup What is the expected enhancement in performance when using the pipeline implementation? Let’s take a look in Table 3.3. Table 3.3: “Stages vs. time” diagram in the pipeline implementation. Stage S1 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 S2 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 S3 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 S4 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 S5 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 5 10 15 t in clock cycles To compute the expected enhancement, the following definitions are used. n – number of instructions; p – number of pipeline stages (also known as the “pipeline depth”); Tj (1 ≤ j ≤ p) – waiting time in the pipeline stage Sj ; T L – stage transition time; TM AX = max{Ti } ; T = TM AX + T L : clock period (clock cycle); and frequency f = 1 T. 51 3 Instruction-Level Parallelism Eq. (3.1) gives the speedup measure from pipelining. AverageInstructionTimeUnpipelined Speedup Pipeline = (3.1) AverageInstructionTimePipelined AverageInstructionTimeUnpipelined = n × p × T (3.2) AverageInstructionTimePipelined = (p + n − 1) × T (3.3) Then, Eq. (3.1) can be written as in Eq. (3.4). n×p Speedup Pipeline = (3.4) p+n−1 If n  p, Speedup Pipeline ≈ p. Therefore, the maximum Speedup Pipeline ≤ p. Question: the higher the p, the higher the Speedup Pipeline ? Take a look at the definition of efficiency. Efficiency The efficiency η is described by the relation between the “occupied area” and “total area” considering the stages vs. time diagram (Table 3.3). The observed improvement when using several pipeline stages p is given by Eq. (3.5). n×p Speedup Pipeline p+n−1 n η= = = (3.5) p p p+n−1 When n → ∞, η = 1 (maximum efficiency). Throughput The throughput W is the number of completed tasks per time unit, as in Eq. (3.6). n η W = = (3.6) (p + n − 1) × T T When η → 1, W → 1 T = f (maximum throughput). Major Hurdles of Pipelining: Concepts and Hazards The key barriers to achieve the maximum values related to speedup, efficiency and throughput are stages with different times, e.g., related to fails in cache memory or instructions with different lengths. Another important point to consider is regarding hazards and dependencies. In that sense, structural hazards are about resource conflicts, and data hazards (an instruction that depends on the results of a previous instruction) involves: read after write - RAW; write after read - WAR; and write after write - WAW. Finally, control hazards involve issues such as the branch prediction, delayed branches, and changes in the program counter - PC. 52 Major Hurdles of Pipelining: Concepts and Hazards Example: A conditional branch instruction in the pipeline. Assume that I5 is a conditional branch instruction. Other instructions are blocked when the conditional branch is decoded, which causes a bunch of idle stages (inefficiency) in the pipeline, as seen in Table 3.4. In that case, there are two possibilities to follow the pipeline: decode I6 as the next instruction, or take the branch and fetch I19. Until the conditional branch is decided, the pipeline is negatively impacted. Table 3.4: Example of a conditional branch instruction (I5 ) in the instruction decoding phase, clock cycle number 6. RI I1 I2 I3 I4 I5 I6 I19 I20 I21............... DI I1 I2 I3 I4 I5 * I19 I20 I21............ OO I1 I2 I3 I4 I5 I19 I20 I21......... EXE I1 I2 I3 I4 I5 I19 I20 I21...... AR I1 I2 I3 I4 I5 I19 I20 I21... 5 10 15 Unbalanced Length in Pipeline Stages Another problem is the unbalanced length in the pipeline stages, as shown in Table 3.5. It causes lots of idle time and inefficiency, represented by the empty spaces. Table 3.5: Example of a pipeline with different stage lengths. RI takes four cycles, while DI takes one. RI I1 I1 I1 I1 I2 I2 I2 I2 I3 I3 I3 I3 I4 I4 I4 I4 I5 I5 I5 I5 I6... DI I1 I2 I3 I4 I5... OO I1 I1 I1 I1 I2 I2 I2 I2 I3 I3 I3 I3 I4 I4 I4 I4... EXE I1 I2 I3 AR I1 I1 I1 I1 I2 I2 I2 I2 I3 I3 I3... 5 10 15 20 Structural Hazard The structural hazard is also known as functional dependency. It is the result of a race condition involving two or more instructions competing for the same resource at the same time. Some possible solutions to workaround this problem are: one of the instructions must wait, or it is necessary to increase the number of resources available in the system. Table 3.6 shows an example where the instruction I3 is in the “instruction fetch” stage from clock cycle number three to six. Given this, no other instruction can be fetched, i.e., they have to wait until I3 moves to the next functional unit, i.e., “instruction decode”. Table 3.6: The instruction I4 has to wait before proceeding to avoid a structural harzard. RI I1 I2 I3 I3 I3 I3 I4 I5 I5 I5 I5 I6 I7 DI I1 I2 I3 I4 OO I1 I2 I3 I4 EXE I1 I2 I3 I4 AR I1 I2 I3 I4 5 10 15 53 3 Instruction-Level Parallelism Control of Resources Usage When a dependency involves only processor resources, such as registers or functional units, a resource reservation table is generally used. There, every instruction makes a check in the table about the resources that it will use. If all of them are marked, the instruction is blocked until there is a release. After using it, the instruction frees the resource by unchecking it in the reservation table. Data Hazard Data hazards occur when the pipeline changes the order of read/write accesses of instruction operands. Then, the order differs from the order seen by sequentially executing instructions on an unpipelined processor. Write after read - WAR Related to the WAR data hazard, is there any problem related to the code in Listing 3.1? Here, the highlight goes to B that is written after being read. Listing 3.1: WAR data hazard code snippet. 1 A = B + C 2 B = C + D There is a false dependency in this case, as long as there is no out-of-order execution, i.e., line number two before line one. WAR hazards are impossible in the simple five-stage pipeline, but they occur when instructions are reordered. Out-of-order execution is taken into account in the next chapters. Write after write - WAW Related to the WAW data hazard, is there any problem related to the code in Listing 3.2? Here, the highlight goes to A that is written (line two) after being written in the first place (line one). Listing 3.2: WAW data hazard code snippet. 1 A = B + C 2 A = D + E There is a false dependency in this case, as long as there is no out-of-order execution. WAW hazards are also impossible in the simple five-stage pipeline, but they occur when instructions are reordered. Read after write - RAW Related to the RAW data hazard, is there any problem related to the code in Listing 3.3? Here, the highlight goes to A that is read (line two) after being written (line one). Listing 3.3: RAW data hazard code snippet. 1 A = B + C 2 E = A + D 54 Datapaths and Stalls Yes, here there is a problem. The instruction from line two can only read the value of A after the completion of the instruction from line one. The simple solution is to wait the execution of the instruction in line two, that is until the value of A is ready and available to be used. But, wait for how many clock cycles? The answer is in the following sections. Control Hazard Control hazards come from the pipelining of conditional branches and other instructions that cause a change in the program counter register. Common Solution A common possible solution to those hazards is “simply” stall the pipeline as long as the hazard is present. This is done by inserting one or more “bubbles” into it, as shown next. Datapaths and Stalls The MIPS Pipeline Now, let’s consider a concrete case: the MIPS pipeline. The stages (or cycles) of MIPS pipeline are: 1. instruction fetch - IF cycle; 2. instruction decode/register fetch - ID cycle; 3. execution/effective address - EX cycle; 4. memory access - MEM cycle; and 5. write-back - WB cycle. An Overview of Pipeline and Functional Units Fig. 3.3 illustrates the MIPS pipeline and the functional units responsible to perform the functions of each pipeline stage. Time (Clock Cycles) IF ID EX MEM WB A Load Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #1 Mem Reg L Mem Reg Instruction Order U IF ID EX MEM WB A Inst. #2 Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #3 Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #4 Mem Reg L Mem Reg U Figure 3.3: MIPS pipeline and functional units overview. 55 3 Instruction-Level Parallelism Structural Hazard in MIPS Notice, as illustrated in Fig. 3.4, that a single memory (Mem) configures a structural hazard in this pipeline. In clock cycle number four, one instruction is trying to access the memory to load some data at the same time that another instruction is trying to access that memory to fetch an instruction. This is a hazard. Time (Clock Cycles) IF ID EX MEM WB A Load Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #1 Mem Reg L Mem Reg Instruction Order U IF ID EX MEM WB A Inst. #2 Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #3 Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #4 Mem Reg L Mem Reg U Figure 3.4: Structural hazard in MIPS: the memory. Solution 1 to Fix the Structural Hazard A possible solution to this problem is to stall the pipeline to resolve the memory structural hazard, by inserting a “bubble” on it, as shown in Fig. 3.5. In this case, the instruction #3 is delayed (stalled) by one clock cycle. Time (Clock Cycles) IF ID EX MEM WB A Load Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #1 Mem Reg L Mem Reg Instruction Order U IF ID EX MEM WB A Inst. #2 Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #3 (stall) bubble Mem Reg L Mem Reg U IF ID EX MEM WB A Inst. #4 Mem Reg L Mem Reg U Figure 3.5: Structural hazard in MIPS: stall the pipeline. 56 Datapaths and Stalls Solution 2 to Fix the Structural Hazard A second solution is about adding new hardware (memory), i.e., to have separated memories: the instruction cache - Im and data cache - Dm. Fig. 3.6 illustrates the two different memories that fix the structural hazard. Now, in the clock cycle number four, the instruction load can access the data memory, while at the same clock the instruction #3 is accessing the instruction memory. Time (Clock Cycles) IF ID EX MEM WB A Load Im Reg L Dm Reg U IF ID EX MEM WB A Inst. #1 Im Reg L Dm Reg Instruction Order U IF ID EX MEM WB A Inst. #2 Im Reg L Dm Reg U IF ID EX MEM WB A Inst. #3 Im Reg L Dm Reg U IF ID EX MEM WB A Inst. #4 Im Reg L Dm Reg U Figure 3.6: Structural hazard in MIPS: add extra resource in the system. Data Hazard in MIPS Let’s consider the following code, as in Listing 3.4. Listing 3.4: A code snippet to analyze data hazards. 1 add r1 , r2 , r3 2 sub r4 , r1 , r3 3 and r6 , r1 , r7 4 or r8 , r1 , r9 5 xor r10 , r1 , r11 Notice that all instructions after the add (line one) use the result of the instruction add. The instruction add writes the value of r1 in the write-back pipeline stage, clock cycle number five, as illustrated in Fig. 3.7. Given this, there is a data hazard with respect to the register r1. The register file (Reg) is used as a source in the ID stage, and as a destination in the WB stage. Therefore, a register is read in the second half of the stage, and written in the first half (halves shown in gray color on the register file Reg, Fig. 3.7). Thus, there is a RAW hazard with respect to the instructions sub and and from Listing 3.4, as shown in Fig. 3.7. This is because r1 was not actually written when instructions sub and and reads it. The register r1 is written in the first half of the WB stage related to the add instruction. 57 3 Instruction-Level Parallelism Time (Clock Cycles) IF ID EX MEM WB A add r1,r2,r3 Im Reg L Dm Reg U IF ID EX MEM WB A sub r4,r1,r3 Im Reg L Dm Reg Instruction Order U IF ID EX MEM WB A and r6,r1,r7 Im Reg L Dm Reg U IF ID EX MEM WB A or r8,r1,r9 Im Reg L Dm Reg U IF ID EX MEM WB A xor r10,r1,r11 Im Reg L Dm Reg U Figure 3.7: Data hazards in MIPS. A possible workaround is the hardware stall. In this case, the hardware does not change the program counter - PC value, instead it keeps fetching the same instruction and sets control signals to harmless values, such as ZERO. As illustrated in Fig. 3.8, there is a two-stall penalty to fix the referred data hazard. Thus, r1 is written in the first half of WB and read in the second half of the ID stage (clock cycle number five). Time (Clock Cycles) IF ID EX MEM WB A add r1,r2,r3 Im Reg L Dm Reg U stall Im bubble bubble bubble bubble Instruction Order stall Im bubble bubble bubble bubble IF ID EX MEM WB A sub r4,r1,r3 Im Reg L Dm Reg U IF ID EX MEM WB A and r6,r1,r7 Im Reg L Dm Reg U Figure 3.8: A possible workaround for the data hazard, hardware stalls. Another possible workaround for this data hazard can be the fix by software. In other words, by inserting independent instructions in between. In the worst-case scenario, it inserts instructions nop, as illustrated in Fig. 3.9. 58 Datapaths and Stalls Time (Clock Cycles) IF ID EX MEM WB A add r1,r2,r3 Im Reg L Dm Reg U IF ID EX MEM WB A nop Im Reg L Dm Reg Instruction Order U IF ID EX MEM WB A nop Im Reg L Dm Reg U IF ID EX MEM WB A sub r4,r1,r3 Im Reg L Dm Reg U IF ID EX MEM WB A and r6,r1,r7 Im Reg L Dm Reg U Figure 3.9: Another possible workaround for the data hazard, software nop’s. Minimizing Data Hazard Stalls by Forwarding There is the forwarding technique to minimize the stalls due to the data hazard in the pipeline. Forwarding is a simple hardware technique, also known as hardware bypassing or short-circuiting. Let’s consider the previous RAW hazard from Listing 3.4. The key point in forwarding is: the result is not really needed by the instruction sub until after the instruction add actually produces it. Then, what if that result could be moved from the pipeline register where the add stores it to where the sub needs it? In this case, the necessity for a stall can be avoided. As seen in Fig. 3.10, the ALU’s result from EX/MEM and MEM/WB pipeline registers is always fed back to one ALU input. The forwarding hardware detects that the previous ALU operation wrote the register corresponding to a source for the current ALU operation. Then, the control logic selects the forwarded result as one of the ALU inputs, instead of the value read from the register file. If, for some reason, the sub is stalled, the add will be completed and the bypass will not be activated. This is also true when an interrupt occurs between these two instructions. But unfortunately, not all potential data hazards can be managed by the forwarding technique. Let’s consider the following code, as shown in Listing 3.5 Listing 3.5: A case not managed by the forwarding technique. 1 ld r1 ,0( r2 ) 2 sub r4 , r1 , r5 3 and r6 , r1 , r7 4 or r8 , r1 , r9 The problem there is that the instruction ld (load) does not have the data until the end of MEM cycle (clock cycle number four, Fig. 3.11). And, the instruction sub needs to have the data by the beginning of that clock cycle. Therefore, this case needs a stall. 59 3 Instruction-Level Parallelism Figure 3.10: Pipeline registers (gray rectangles) from left to rigth: IF/ID; ID/EX; EX/MEM; and MEM/WB. In this case, the ALU output stored in the register EX/MEM is forwarded to one ALU input in the instruction sub. Also, the ALU result in the register MEM/WB is forwarded to one ALU input in the instruction and. In the clock cycle number five, there is no forwarding. The register is written in the first half and read in the second. Figure 3.11: The instruction sub needs the data in the beginning of the clock cycle four, but the data is only ready at the end of that cycle, i.e., when there is an effective memory access. Hardware Changing for Forwarding To implement the forwarding technique, the hardware has to accomodate the following changes: (i) ALU output at the end of EX stage; (ii) memory output at the end of MEM stage; and (iii) ALU output at the end of MEM stage, as shown in Fig. 3.12. The hardware also has to increase the multiplexers to include new paths from the pipeline registers (dotted lines in Fig. 3.12). Assuming register writing occurs in the first half, and reading occurs in the second half of the cycle, there is no need for the WB output. Otherwise it would need an extra forwarding. 60 Datapaths and Stalls Figure 3.12: Hardware forwarding including new three paths. Control Hazard Control hazards can cause greater performance losses compared to data hazards. When a conditional branch is executed, it may or may not change the program counter - PC to a value other than its current value plus four (PC=PC+4), i.e., the next instruction. If a conditional branch changes the PC to the branch’s target address, it is a taken branch, otherwise it is an untaken branch. If the instruction i is a taken branch, then the PC is normally not changed until the end of the ID cycle, i.e., after the completion of the address calculation and comparison. A simple method for coping with branches is to redo the fetch of the instruction following a branch instruction. Table 3.7 illustrates this case. The first IF (in the first branch successor) may be seen as a stall, as the instruction branch is detected during ID. This scheme is regarded as freeze or flush the pipeline. Table 3.7: Conditional branch causing a one-cycle stall in a five-stage pipeline. Branch instruction IF ID EX MEM WB Branch successor IF IF ID EX MEM WB Branch successor + 1 IF ID EX MEM Branch successor + 2 IF ID EX The performance loss is around 10-30% for one stall cycle per branch. Some schemes can be used in this case, namely the predicted-not-taken, predicted-taken, delayed branch. The Predicted-Not-Taken Scheme The predicted-not-taken scheme is illustrated in Fig. 3.13. When a branch is untaken, verified during the ID cycle, the scheme goes to the fetch and fall-through ordinarily (top of Fig. 3.13). When a branch is taken, verified during ID, the fetch is redone at the branch target. The Predicted-Taken Scheme As the name mentions, this scheme treats every branch as taken. Considering this five-stage MIPS pipeline, the target address is not known any earlier than the branch outcome (i.e., taken or untaken). Therefore, there is no advantage in this approach for this pipeline. 61 3 Instruction-Level Parallelism Figure 3.13: The predicted-not-taken scheme. Delayed Branch Scheme The delayed branch scheme was heavily used in the early RISC processors (Listing 3.6). Listing 3.6: Example of pseudo-code for the delayed branch scheme. 1 branch instruction 2 sequential successor ( i + 1) 3 branch target if taken The sequential successor is in the branch delay slot, i.e., that instruction is executed whether or not the branch is taken. Most of the processors implementing this technique use a single instruction delay, despite longer delays are possible. As seen in Fig. 3.14, the instructions in the delay slot are always executed. If the branch is untaken, the execution continues with the instruction after the branch delay instruction (top of Fig. 3.14). Otherwise, the execution continues at the branch target (bottom of Fig. 3.14). Figure 3.14: Delayed branch scheme. Question: what if the instruction in the branch delay slot is also a branch? Think about this... There is a lot of room for compilers to play in the delayed branch technique. Compilers with optimizations should choose the instructions to be placed after these branch instructions (i.e., branch delay slot), and they must be effectively executed. 62 MIPS Simple Multiple-Cycle Implementation Is it possible to eliminate the delay in branch instructions? A branch-prediction schemea is about guessing the outcome of the branch condition and proceeding as if this guessing were correct. The premise here is that the processor state cannot be affected should any errors occur in the prediction. Prediction is very good for performance when there are good prediction hit rates. In many cases this is possible, such as: end-of-loop testing is always false, except at the last iteration; and searches fail in all iterations, except possibly in the last. These techniques are generally used in superscalar processors, which will be addressed in the later chapters. aA Critical Intel Flaw Breaks Basic Security for Most Computers. Wired. 2018. https://www.wired.com/story/critical-intel-flaw-breaks-basic-security-for-most-computers Wrap-up The pipeline processor is considered as an enhancement of the multiple clock cycle processor. On this processor, each functional unit can be used only once per instruction. The instructions must use functional units at the same stage like all other instructions, which brings considerable performance improvements. This also brings new “problems”, e.g., structural, data, and control hazards. MIPS Simple Multiple-Cycle Implementation Every MIPS instruction can be implemented in at most five clock cycles, namely: 1. instruction fetch - IF; 2. instruction decode/register fetch - ID; 3. execution/effective address - EX; 4. memory access/branch completion - MEM; and 5. write-back - WB. If needed, recapitulate instructions type and layout as described in Chapter 2. Fig. 3.15 illustrates the multiple-cycle datapath. Next, the operations in each cycle along with an illustrative pseudo-code are detailed from the multiple- cycle implementation perspective. The IF Cycle During the instruction fetch cycle (Listing 3.7), the program counter register PC is sent out to fetch an instruction from the instruction memory (IMem) to be placed into the instruction register IR. Then, a temporary register named new program counter NPC gets the PC incremented by 4. NPC is the address of the next sequential instruction in the memory. The IR holds the instruction that will be worked on subsequent clock cycles, while NPC holds the next sequential PC. 63 3 Instruction-Level Parallelism Figure 3.15: Instructions flow through the datapath, considering the multiple-cycle implementation. Notice that the PC register is written/updated during the memory access clock cycle, and the other registers are written/updated during the write-back clock cycle. Listing 3.7: Internal operations during the IF cycle. 1 IR

Use Quizgecko on...
Browser
Browser