Full Transcript

Appendix C Pipelining: Basic and Intermediate Concepts nasa7 1% 0% matrix300 0% 0% tomcatv 1% 0% 4096 entries: 2 bits per entry Unlimited entries: 2 bits per entry 5% 5% doduc SPEC89 benchmarks C-26 spice 9% 9% fpppp 9% 9% 12% 11% gcc 5% 5% espresso 18% 18% eqntott 10% 10% li 0% 2% 4% 6% 8% 10% 12%...

Appendix C Pipelining: Basic and Intermediate Concepts nasa7 1% 0% matrix300 0% 0% tomcatv 1% 0% 4096 entries: 2 bits per entry Unlimited entries: 2 bits per entry 5% 5% doduc SPEC89 benchmarks C-26 spice 9% 9% fpppp 9% 9% 12% 11% gcc 5% 5% espresso 18% 18% eqntott 10% 10% li 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% Frequency of mispredictions Figure C.17 Prediction accuracy of a 4096-entry 2-bit prediction buffer versus an infinite buffer for the SPEC89 benchmarks. Although these data are for an older version of a subset of the SPEC benchmarks, the results would be comparable for newerversionswithperhapsasmanyas8Kentriesneededtomatchaninfinite2-bitpredictor. C.3 How Is Pipelining Implemented? Before we proceed to basic pipelining, we need to review a simple implementation of an unpipelined version of RISC V. A Simple Implementation of RISC V In this section we follow the style of Section C.1, showing first a simple unpipelined implementation and then the pipelined implementation. This time, however, our example is specific to the RISC V architecture. C.3 How Is Pipelining Implemented? C-27 In this subsection, we focus on a pipeline for an integer subset of RISC V that consists of load-store word, branch equal, and integer ALU operations. Later in this appendix we will incorporate the basic floating-point operations. Although we discuss only a subset of RISC V, the basic principles can be extended to handle all the instructions; for example, adding store involves some additional computing of the immediate field. We initially used a less aggressive implementation of a branch instruction. We show how to implement the more aggressive version at the end of this section. Every RISC V instruction can be implemented in, at most, 5 clock cycles. The 5 clock cycles are as follows: 1. Instruction fetch cycle (IF): IR Mem[PC]; NPC PC + 4; Operation—Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise, the register NPC is used to hold the next sequential PC. 2. Instruction decode/register fetch cycle (ID): A Regs[rs1]; B Regs[rs2]; Imm sign-extended immediate field of IR; Operation—Decode the instruction and access the register file to read the registers (rs1 and rs2 are the register specifiers). The outputs of the general-purpose registers are read into two temporary registers (A and B) for use in later clock cycles. The lower 16 bits of the IR are also sign extended and stored into the temporary register Imm, for use in the next cycle. Decoding is done in parallel with reading registers, which is possible because these fields are at a fixed location in the RISC V instruction format. Because the immediate portion of a load and an ALU immediate is located in an identical place in every RISC V instruction, the sign-extended immediate is also calculated during this cycle in case it is needed in the next cycle. For stores, a separate sign-extension is needed, because the immediate field is split in two pieces. 3. Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of four functions depending on the RISC V instruction type: Memory reference: ALUOutput A + Imm; Operation—The ALU adds the operands to form the effective address and places the result into the register ALUOutput. C-28 Appendix C Pipelining: Basic and Intermediate Concepts Register-register ALU instruction: ALUOutput A func B; Operation—The ALU performs the operation specified by the function code (a combination of the func3 and func7 fields) on the value in register A and on the value in register B. The result is placed in the temporary register ALUOutput. Register-Immediate ALU instruction: ALUOutput A op Imm; Operation—The ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the temporary register ALUOutput. Branch: ALUOutput NPC + (Imm

Use Quizgecko on...
Browser
Browser