cap-3-67-78.pdf

Datapaths and Stalls Figure 3.12: Hardware forwarding including new three paths. Control Hazard Control hazards can cause greater performance losses compared to data hazards. When a conditional branch is executed, it may or may not change the program counter - PC to a value other than its current value plus four (PC=PC+4), i.e., the next instruction. If a conditional branch changes the PC to the branch’s target address, it is a taken branch, otherwise it is an untaken branch. If the instruction i is a taken branch, then the PC is normally not changed until the end of the ID cycle, i.e., after the completion of the address calculation and comparison. A simple method for coping with branches is to redo the fetch of the instruction following a branch instruction. Table 3.7 illustrates this case. The first IF (in the first branch successor) may be seen as a stall, as the instruction branch is detected during ID. This scheme is regarded as freeze or flush the pipeline. Table 3.7: Conditional branch causing a one-cycle stall in a five-stage pipeline. Branch instruction IF ID EX MEM WB Branch successor IF IF ID EX MEM WB Branch successor + 1 IF ID EX MEM Branch successor + 2 IF ID EX The performance loss is around 10-30% for one stall cycle per branch. Some schemes can be used in this case, namely the predicted-not-taken, predicted-taken, delayed branch. The Predicted-Not-Taken Scheme The predicted-not-taken scheme is illustrated in Fig. 3.13. When a branch is untaken, verified during the ID cycle, the scheme goes to the fetch and fall-through ordinarily (top of Fig. 3.13). When a branch is taken, verified during ID, the fetch is redone at the branch target. The Predicted-Taken Scheme As the name mentions, this scheme treats every branch as taken. Considering this five-stage MIPS pipeline, the target address is not known any earlier than the branch outcome (i.e., taken or untaken). Therefore, there is no advantage in this approach for this pipeline. 61 3 Instruction-Level Parallelism Figure 3.13: The predicted-not-taken scheme. Delayed Branch Scheme The delayed branch scheme was heavily used in the early RISC processors (Listing 3.6). Listing 3.6: Example of pseudo-code for the delayed branch scheme. branch instruction 1 2 sequential successor ( i + 1) 3 branch target if taken The sequential successor is in the branch delay slot, i.e., that instruction is executed whether or not the branch is taken. Most of the processors implementing this technique use a single instruction delay, despite longer delays are possible. As seen in Fig. 3.14, the instructions in the delay slot are always executed. If the branch is untaken, the execution continues with the instruction after the branch delay instruction (top of Fig. 3.14). Otherwise, the execution continues at the branch target (bottom of Fig. 3.14). Figure 3.14: Delayed branch scheme. Question: what if the instruction in the branch delay slot is also a branch? Think about this... There is a lot of room for compilers to play in the delayed branch technique. Compilers with optimizations should choose the instructions to be placed after these branch instructions (i.e., branch delay slot), and they must be effectively executed. 62 MIPS Simple Multiple-Cycle Implementation Is it possible to eliminate the delay in branch instructions? A branch-prediction schemea is about guessing the outcome of the branch condition and proceeding as if this guessing were correct. The premise here is that the processor state cannot be affected should any errors occur in the prediction. Prediction is very good for performance when there are good prediction hit rates. In many cases this is possible, such as: end-of-loop testing is always false, except at the last iteration; and searches fail in all iterations, except possibly in the last. These techniques are generally used in superscalar processors, which will be addressed in the later chapters. aA Critical Intel Flaw Breaks Basic Security for Most Computers. Wired. 2018. https://www.wired.com/story/critical-intel-flaw-breaks-basic-security-for-most-computers Wrap-up The pipeline processor is considered as an enhancement of the multiple clock cycle processor. On this processor, each functional unit can be used only once per instruction. The instructions must use functional units at the same stage like all other instructions, which brings considerable performance improvements. This also brings new “problems”, e.g., structural, data, and control hazards. MIPS Simple Multiple-Cycle Implementation Every MIPS instruction can be implemented in at most five clock cycles, namely: 1. instruction fetch - IF; 2. instruction decode/register fetch - ID; 3. execution/effective address - EX; 4. memory access/branch completion - MEM; and 5. write-back - WB. If needed, recapitulate instructions type and layout as described in Chapter 2. Fig. 3.15 illustrates the multiple-cycle datapath. Next, the operations in each cycle along with an illustrative pseudo-code are detailed from the multiplecycle implementation perspective. The IF Cycle During the instruction fetch cycle (Listing 3.7), the program counter register PC is sent out to fetch an instruction from the instruction memory (IMem) to be placed into the instruction register IR. Then, a temporary register named new program counter NPC gets the PC incremented by 4. NPC is the address of the next sequential instruction in the memory. The IR holds the instruction that will be worked on subsequent clock cycles, while NPC holds the next sequential PC. 63 3 Instruction-Level Parallelism Figure 3.15: Instructions flow through the datapath, considering the multiple-cycle implementation. Notice that the PC register is written/updated during the memory access clock cycle, and the other registers are written/updated during the write-back clock cycle. Listing 3.7: Internal operations during the IF cycle. 1 IR

Document Details

Tags

Related

Full Transcript

Upgrade to continue