RISC-V Processor Implementation PDF
Document Details
Uploaded by MagnanimousCloisonnism
Vilnius University
Tags
Related
- (The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-102-258-pages-3.pdf
- (The Morgan Kaufmann Series in Computer Architecture and Design) David A. Patterson, John L. Hennessy - Computer Organization and Design RISC-V Edition_ The Hardware Software Interface-Morgan Kaufmann-102-258-pages-6.pdf
- Chapter 4 The Processor PDF
- RISC-V Processor Design PDF
- Building a Datapath (RISC-V) PDF
- Computer Organization and Design: RISC-V Edition PDF
Summary
This document details the design and implementation of a simple RISC-V processor datapath. It covers topics such as ALU control and the organization of instructions (I-type, R-type) and includes a truth table and multiple figures related to processor architecture and implementation. The keywords also relate to the fields of computer science and computer architecture.
Full Transcript
4.4 A Simple Implementation Scheme 269 Now that we have completed this simple datapath, we can add the control unit. The control unit must be able to take inputs and generate a write signal for each state element, the selector control for each multiplexor, and the ALU control. The ALU co...
4.4 A Simple Implementation Scheme 269 Now that we have completed this simple datapath, we can add the control unit. The control unit must be able to take inputs and generate a write signal for each state element, the selector control for each multiplexor, and the ALU control. The ALU control is different in a number of ways, and it will be useful to design it first before we design the rest of the control unit. Elaboration: The immediate generation logic must choose between sign-extending a 12-bit field in instruction bits 31:20 for load instructions, bits 31:25 and 11:7 for store instructions, or bits 31, 7, 30:25, and 11:8 for the conditional branch. Since the input is all 32 bits of the instruction, it can use the opcode bits of the instruction to select the proper field. RISC-V opcode bit 6 happens to be 0 for data transfer instructions and 1 for conditional branches, and RISC-V opcode bit 5 happens to be 0 for load instructions and 1 for store instructions. Thus, bits 5 and 6 can control a 3:1 multiplexor inside the immediate generation logic that selects the appropriate 12-bit field for load, store, and conditional branch instructions. 4.4 A Simple Implementation Scheme In this section, we look at what might be thought of as a simple implementation of our RISC-V subset. We build this simple implementation using the datapath of the last section and adding a simple control function. This simple implementation covers load word (lw), store word (sw), branch if equal (beq), and the arithmetic- logical instructions add, sub, and, and or. The ALU Control The RISC-V ALU in Appendix A defines the four following combinations of four control inputs: ALU control lines Function 0000 AND 0001 OR 0010 add 0110 subtract Depending on the instruction class, the ALU will need to perform one of these four functions. For load and store instructions, we use the ALU to compute the memory address by addition. For the R-type instructions, the ALU needs to perform one of the four actions (AND, OR, add, or subtract), depending on the value of the 7-bit funct7 field (bits 31:25) and 3-bit funct3 field (bits 14:12) in the instruction (see Chapter 2). For the conditional branch if equal instruction, the ALU subtracts two operands and tests to see if the result is 0. 270 Chapter 4 The Processor We can generate the 4-bit ALU control input using a small control unit that has as inputs the funct7 and funct3 fields of the instruction and a 2-bit control field, which we call ALUOp. ALUOp indicates whether the operation to be performed should be add (00) for loads and stores, subtract and test if zero (01) for beq, or be determined by the operation encoded in the funct7 and funct3 fields (10). The output of the ALU control unit is a 4-bit signal that directly controls the ALU by generating one of the 4-bit combinations shown previously. In Figure 4.12, we show how to set the ALU control inputs based on the 2-bit ALUOp control, funct7, and funct3 fields. Later in this chapter, we will see how the ALUOp bits are generated from the main control unit. This style of using multiple levels of decoding—that is, the main control unit generates the ALUOp bits, which then are used as input to the ALU control that generates the actual signals to control the ALU unit—is a common implementation technique. Using multiple levels of control can reduce the size of the main control unit. Using several smaller control units may also potentially reduce the latency of the control unit. Such optimizations are important, since the latency of the control unit is often a critical factor in determining the clock cycle time. There are several different ways to implement the mapping from the 2-bit ALUOp field and the funct fields to the four ALU operation control bits. Because only a small number of the possible funct field values are of interest and funct fields are used only when the ALUOp bits equal 10, we can use a small piece of logic that recognizes the subset of possible values and generates the appropriate ALU control signals. FIGURE 4.12 How the ALU control bits are set depends on the ALUOp control bits and the different opcodes for the R-type instruction. The instruction, listed in the first column, determines the setting of the ALUOp bits. All the encodings are shown in binary. Notice that when the ALUOp code is 00 or 01, the desired ALU action does not depend on the funct7 or funct3 fields; in this case, we say that we “don’t care” about the value of the opcode, and the bits are shown as Xs. When the ALUOp value is 10, then the funct7 and funct3 fields are used to set the ALU control input. See Appendix A. 4.4 A Simple Implementation Scheme 271 As a step in designing this logic, it is useful to create a truth table for the interesting combinations of funct fields and the ALUOp signals, as we’ve done in Figure 4.13; this truth table shows how the 4-bit ALU control is set depending on these input fields. truth table From logic, a Since the full truth table is very large, and we don’t care about the value of the ALU representation of a logical control for many of these input combinations, we show only the truth table entries operation by listing all the values of the inputs and for which the ALU control must have a specific value. Throughout this chapter, we then in each case showing will use this practice of showing only the truth table entries for outputs that must be what the resulting outputs asserted and not showing those that are all deasserted or don’t care. (This practice should be. has a disadvantage, which we discuss in Section C.2 of Appendix C.) Because in many instances we do not care about the values of some of the inputs, and because we wish to keep the tables compact, we also include don’t-care terms. don’t-care term An A don’t-care term in this truth table (represented by an X in an input column) element of a logical function in which the indicates that the output does not depend on the value of the input corresponding output does not depend to that column. For example, when the ALUOp bits are 00, as in the first row of on the values of all the Figure 4.13, we always set the ALU control to 0010, independent of the funct fields. inputs. Don’t-care terms In this case, then, the funct inputs will be don’t cares in this line of the truth table. may be specified in Later, we will see examples of another type of don’t-care term. If you are unfamiliar different ways. with the concept of don’t-care terms, see Appendix A for more information. Once the truth table has been constructed, it can be optimized and then turned into gates. This process is completely mechanical. Thus, rather than show the final steps here, we describe the process and the result in Section C.2 of Appendix C. Designing the Main Control Unit Now that we have described how to design an ALU that uses the opcode and a 2-bit signal as its control inputs, we can return to looking at the rest of the control. To start this process, let’s identify the fields of an instruction and the control lines that are needed for the datapath we constructed in Figure 4.11. To understand how to connect the fields of an instruction to the datapath, it is useful to review FIGURE 4.13 The truth table for the 4 ALU control bits (called Operation). The inputs are the ALUOp and funct fields. Only the entries for which the ALU control is asserted are shown. Some don’t-care entries have been added. For example, the ALUOp does not use the encoding 11, so the truth table can contain entries 1X and X1, rather than 10 and 01. While we show all 10 bits of funct fields, note that the only bits with different values for the four R-format instructions are bits 30, 14, 13, and 12. Thus, we only need these four funct field bits as input for ALU control instead of all 10. 272 Chapter 4 The Processor FIGURE 4.14 The four instruction classes (arithmetic, load, store, and conditional branch) use four different instruction formats. (a) Instruction format for R-type arithmetic instructions (opcode = 51ten), which have three register operands: rs1, rs2, and rd. Fields rs1 and rd are sources, and rd is the destination. The ALU function is in the funct3 and funct7 fields and is decoded by the ALU control design in the previous section. The R-type instructions that we implement are add, sub, and, and or. (b) Instruction format for I-type load instructions (opcode = 3ten). The register rs1 is the base register that is added to the 12-bit immediate field to form the memory address. Field rd is the destination register for the loaded value. (c) Instruction format for S-type store instructions (opcode = 35ten). The register rs1 is the base register that is added to the 12-bit immediate field to form the memory address. (The immediate field is split into a 7-bit piece and a 5-bit piece.) Field rs2 is the source register whose value should be stored into memory. (d) Instruction format for SB-type conditional branch instructions (opcode = 99ten). The registers rs1 and rs2 compared. The 12-bit immediate address field is sign-extended, shifted left 1 bit, and added to the PC to compute the branch target address. Figures 4.17 and 4.18 give the rationale for the unusual bit ordering for SB-type. the formats of the four instruction classes: arithmetic, load, store, and conditional branch instructions. Figure 4.14 shows these formats. There are several major observations about this instruction format that we will rely on: opcode The field that denotes the operation and The opcode field, which as we saw in Chapter 2, is always in bits 6:0. format of an instruction. Depending on the opcode, the funct3 field (bits 14:12) and funct7 field (bits 31:25) serve as an extended opcode field. The first register operand is always in bit positions 19:15 (rs1) for R-type instructions and branch instructions. This field also specifies the base register for load and store instructions. The second register operand is always in bit positions 24:20 (rs2) for R-type instructions and branch instructions. This field also specifies the register operand that gets copied to memory for store instructions. Another operand can also be a 12-bit offset for branch or load-store instructions. The destination register is always in bit positions 11:7 (rd) for R-type instructions and load instructions. The first design principle from Chapter 2—simplicity favors regularity—pays off here simplifying control of the datapath. 4.4 A Simple Implementation Scheme 273 Compared with MIPS, RISC-V has instruction formats that look more complicated Hardware/ but actually simplify the hardware. This can even improve the clock cycle time of Software some RISC-V implementations, especially the pipelined versions we see in Section 4.6. Since compilers, assemblers, and debuggers hide details of the instruction Interface format from the programmer, why not pick formats that help the hardware? The first example is the store instruction format. Figure 4.15 shows the MIPS instruction formats for data transfer and arithmetic instructions and their impact on the datapath. MIPS requires a 2:1 multiplexor to specify which field supplies the number of the register to be written, which is unnecessary in Figure 4.15. That multiplexor could be on a critical timing path that would stretch the clock cycle time. To keep the destination register always in bits 11 to 7 of all instructions, the RISC-V S format has to split the immediate field into two pieces: bits 31 to 25 have immediate[11:5] and bits 11 to 7 have immediate[4:0]. It looks odd compared with MIPS, which keeps the immediate field contiguous, but the RISC-V assembler hides this complexity, and the hardware benefits. op rs rt rd shamt funct 6 bit 5 bit 5 bit 5 bit 5 bit 6 bit R Format: Arithmetic op rs rt constant or address 6 bit 5 bit 5 bit 16 bit I format: Data transfers, Immediates The second example looks even odder. Figure 4.16 shows that RISC-V has two formats where all the fields are the same size and are immediates as in two other formats—SB versus S and UJ versus U—but the bits are swirled around. The SB and UJ formats once again simplify hardware by giving the assembler more work to do. The figures below show what the immediate generator hardware must do for RISC-V. Figure 4.17 shows which bits of the instruction correspond Instruction [25:21] Read Read register 1 Read address Instruction [20:16] Read data 1 Instruction register 2 0 [31:0] M Write Read Instruction u register data 2 Instruction [15:11] x memory 1 Write RegDst data Registers FIGURE 4.15 The MIPS, arithmetic instruction format, data transfer instruction format, and their impact on the MIPS datapath. For MIPS arithmetic instructions using the R format, rd is the destination register, rs is the first register operand, and rt is the second register operand. For MIPS load and immediate instructions, rs is still the first register operand, but rt is now the destination register. Hence the need of the 2:1 multiplexor to pick between the rd and rt fields to write the correct register. 274 Chapter 4 The Processor to the immediate field bits depending on the type of instruction, if hypothetically the conditional branch instructions used the S format instead of SB and the unconditional branch instructions used the U format instead of UJ. The last row shows the number of unique inputs per output bit, which determines the number of ports to a multiplexor in the immediate generator. In contrast, Figure 4.18 shows the actual formats for the branch and jumps, and the reduced number of unique inputs. The SB and UJ formats reduce the multiplexors for immediate bits 19 to 12 from 3:1 to 2:1 multiplexors and immediate bits 10 to 1 from 4:1 to 2:1. Once again, RISC-V architects designed odd-looking but efficient formats, simplifying 18 1-bit multiplexors. The savings are insignificant for high- end processors but helpful for very low-end processors, and the only cost is borne by the assembler (and your author). Using this information, we can add the instruction labels to the simple datapath. Figure 4.19 shows these additions plus the ALU control block, the write signals for FIGURE 4.16 The actual RISC-V formats. Figure 4.16 introduces R-, I-, S-, and U-types, which are straightforward. Immediate Output Bit by Bit 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Instruction Format Immediate Input Bit by Bit Load, Arith. Imm. I i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i30 i29 i28 i27 i26 i25 i24 i23 i22 i21 i20 Store S “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ i11 i10 i9 i8 i7 Cond. Branch S “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ i30 i29 i28 i27 i26 i25 i24 “ “ “ “ 0 Uncond. Jump U “ “ “ “ “ “ “ “ “ “ “ “ i30 i29 i28 i27 i26 i25 i24 i23 i22 i21 i20 i19 i18 i17 i16 i15 i14 i13 i12 “ Load Upper Imm. U “ i30 i29 i28 i27 i26 i25 i24 i23 i22 i21 i20 i19 i18 i17 i16 i15 i14 i13 i12 0 0 0 0 0 0 0 0 0 0 0 “ Unique Inputs 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 3 FIGURE 4.17 Inputs to immediate if hypotheticaly conditional branches use the S format, and if jumps, use the U format. Immediate Output Bit by Bit 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Instruction Format Immediate Input Bit by Bit Load, Arith. Imm. I i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i31 i30 i29 i28 i27 i26 i25 i24 i23 i22 i21 i20 Store S “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ i11 i10 i9 i8 i7 Cond. Branch SB “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ “ i7 “ “ “ “ “ “ “ “ “ “ 0 Uncond. Jump UJ “ “ “ “ “ “ “ “ “ “ “ “ i19 i18 i17 i16 i15 i14 i13 i12 i20 “ “ “ “ “ “ “ “ “ “ “ Load Upper Imm. U “ i30 i29 i28 i27 i26 i25 i24 i23 i22 i21 i20 “ “ “ “ “ “ “ “ 0 0 0 0 0 0 0 0 0 0 0 “ Unique Inputs 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 3 3 3 3 3 FIGURE 4.18 Inputs to immediate given that branches use the SB format and jumps use the UJ format, which is what RISC-V uses. 4.4 A Simple Implementation Scheme 275 PCSrc 0 M Add u x 4 Add Sum 1 RegWrite Instruction [19-15] Read Read register 1 Read MemWrite PC address Instruction [24-20] Read data 1 register 2 ALUSrc Zero MemtoReg Instruction [31-0] ALU ALU Read Instruction [11-7] Write Read 0 Address data 1 data 2 result M Instruction register M memory u u x x Write 1 0 data Registers Data Write memory data Instruction [31-0] Imm ALU Gen control MemRead Instruction [30,14-12] ALUOp FIGURE 4.19 The datapath of Figure 4.11 with all necessary multiplexors and all control lines identified. The control lines are shown in color. The ALU control block has also been added, which depends on the funct3 field and part of the funct7 field. The PC does not require a write control, since it is written once at the end of every clock cycle; the branch control logic determines whether it is written with the incremented PC or the branch target address. state elements, the read signal for the data memory, and the control signals for the multiplexors. Since all the multiplexors have two inputs, they each require a single control line. Figure 4.19 shows six single-bit control lines plus the 2-bit ALUOp control signal. We have already defined how the ALUOp control signal works, and it is useful to define what the six other control signals do informally before we determine how to set these control signals during instruction execution. Figure 4.20 describes the function of these six control lines. Now that we have looked at the function of each of the control signals, we can look at how to set them. The control unit can set all but one of the control signals based solely on the opcode and funct fields of the instruction. The PCSrc control line is the exception. That control line should be asserted if the instruction is branch if equal (a decision that the control unit can make) and the Zero output of the ALU, which is used for the equality test, is asserted. To generate the PCSrc signal, we will need to AND together a signal from the control unit, which we call Branch, with the Zero signal out of the ALU. These eight control signals (six from Figure 4.20 and two for ALUOp) can now be set based on the input signals to the control unit, which are the opcode bits 6:0. Figure 4.21 shows the datapath with the control unit and the control signals. 276 Chapter 4 The Processor FIGURE 4.20 The effect of each of the six control signals. When the 1-bit control to a two- way multiplexor is asserted, the multiplexor selects the input corresponding to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Remember that the state elements all have the clock as an implicit input and that the clock is used in controlling writes. Gating the clock externally to a state element can create timing problems. (See Appendix A for further discussion of this problem.) Before we try to write a set of equations or a truth table for the control unit, it will be useful to try to define the control function informally. Because the setting of the control lines depends only on the opcode, we define whether each control signal should be 0, 1, or don’t care (X) for each of the opcode values. Figure 4.22 defines how the control signals should be set for each opcode; this information follows directly from Figures 4.12, 4.20, and 4.21. Operation of the Datapath With the information contained in Figures 4.20 and 4.22, we can design the control unit logic, but before we do that, let’s look at how each instruction uses the datapath. In the next few figures, we show the flow of three different instruction classes through the datapath. The asserted control signals and active datapath elements are highlighted in each of these. Note that a multiplexor whose control is 0 has a definite action, even if its control line is not highlighted. Multiple-bit control signals are highlighted if any constituent signal is asserted. Figure 4.23 shows the operation of the datapath for an R-type instruction, such as add x1, x2, x3. Although everything occurs in one clock cycle, we can think of four steps to execute the instruction; these steps are ordered by the flow of information: 1. The instruction is fetched, and the PC is incremented. 2. Two registers, x2 and x3, are read from the register file; also, the main control unit computes the setting of the control lines during this step. 4.4 A Simple Implementation Scheme 277 0 M Add u x 4 Add Sum 1 Branch MemRead Instruction [6-0] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite Instruction [19-15] Read PC Read register 1 Read address Instruction [24-20] data 1 Read register 2 Zero Instruction [31-0] ALU ALU Instruction [11-7] Write Read 0 result AddressRead data 1 Instruction register data 2 M M memory u u x x Write 0 data Registers 1 Write Data data memory Instruction [31-0] Imm ALU Gen control Instruction [30,14-12] FIGURE 4.21 The simple datapath with the control unit. The input to the control unit is the 7-bit opcode field from the instruction. The outputs of the control unit consist of two 1-bit signals that are used to control multiplexors (ALUSrc and MemtoReg), three signals for controlling reads and writes in the register file and data memory (RegWrite, MemRead, and MemWrite), a 1-bit signal used in determining whether to possibly branch (Branch), and a 2-bit control signal for the ALU (ALUOp). An AND gate is used to combine the branch control signal and the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that PCSrc is now a derived signal, rather than one coming directly from the control unit. Thus, we drop the signal name in subsequent figures. 3. The ALU operates on the data read from the register file, using portions of the opcode to generate the ALU function. 4. The result from the ALU is written into the destination register (x1) in the register file. Similarly, we can illustrate the execution of a load register, such as lw x1, offset(x2) in a style similar to Figure 4.23. Figure 4.24 shows the active functional units and asserted control lines for a load. We can think of a load instruction as operating in five steps (similar to how the R-type executed in four): 278 Chapter 4 The Processor FIGURE 4.22 The setting of the control lines is completely determined by the opcode fields of the instruction. The first row of the table corresponds to the R-format instructions (add, sub, and, and or). For all these instructions, the source register fields are rs1 and rs2, and the destination register field is rd; this defines how the signals ALUSrc is set. Furthermore, an R-type instruction writes a register (RegWrite = 1), but neither reads nor writes data memory. When the Branch control signal is 0, the PC is unconditionally replaced with PC + 4; otherwise, the PC is replaced by the branch target if the Zero output of the ALU is also high. The ALUOp field for R-type instructions is set to 10 to indicate that the ALU control should be generated from the funct fields. The second and third rows of this table give the control signal settings for lw and sw. These ALUSrc and ALUOp fields are set to perform the address calculation. The MemRead and MemWrite are set to perform the memory access. Finally, RegWrite is set for a load to cause the result to be stored in the rd register. The ALUOp field for branch is set for subtract (ALU control = 01), which is used to test for equality. Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0: since the register is not being written, the value of the data on the register data write port is not used. Thus, the entry MemtoReg in the last two rows of the table is replaced with X for don’t care. This type of don’t care must be added by the designer, since it depends on knowledge of how the datapath works. 0 M Add u x 4 Add Sum 1 Branch MemRead Instruction [6–0] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite Instruction [19–15] Read PC Read register 1 Read address Instruction [24–20] Read data 1 register 2 Zero Instruction [31–0] ALU ALU Instruction [11–7] Write Read 0 result Address Read data 1 Instruction register data 2 M M memory u u x x Write 0 data Registers 1 Write Data data memory Instruction [31–0] Imm ALU Gen control Instruction [30,14–12] FIGURE 4.23 The datapath in operation for an R-type instruction, such as add x1, x2, x3. The control lines, datapath units, and connections that are active are highlighted. 4.4 A Simple Implementation Scheme 279 1. An instruction is fetched from the instruction memory, and the PC is incremented. 2. A register (x2) value is read from the register file. 3. The ALU computes the sum of the value read from the register file and the sign-extended 12 bits of the instruction (offset). 4. The sum from the ALU is used as the address for the data memory. 5. The data from the memory unit is written into the register file (x1). 0 M Add u x 4 Add Sum 1 Branch MemRead Instruction [6–0] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite Instruction [19–15] Read PC Read register 1 Read address Instruction [24–20] data 1 Read register 2 Zero Instruction [31–0] ALU ALU Instruction [11–7] Write Read 0 result Address Read data 1 Instruction register data 2 M M memory u u x x Write 0 data Registers 1 Write Data data memory Instruction [31–0] Imm ALU Gen control Instruction [30,14-12] FIGURE 4.24 The datapath in operation for a load instruction. The control lines, datapath units, and connections that are active are highlighted. A store instruction would operate very similarly. The main difference would be that the memory control would indicate a write rather than a read, the second register value read would be used for the data to store, and the operation of writing the data memory value to the register file would not occur. 280 Chapter 4 The Processor Finally, we can show the operation of the branch-if-equal instruction, such as beq x1, x2, offset, in the same fashion. It operates much like an R-format instruction, but the ALU output is used to determine whether the PC is written with PC + 4 or the branch target address. Figure 4.25 shows the four steps in execution: 1. An instruction is fetched from the instruction memory, and the PC is incremented. 2. Two registers, x1 and x2, are read from the register file. 0 M Add u x 4 Add Sum 1 Branch MemRead Instruction [6–0] MemtoReg Control ALUOp MemWrite ALUSrc RegWrite Instruction [19–15] Read PC Read register 1 Read address Instruction [24–20] data 1 Read register 2 Zero Instruction [31–0] ALU ALU Instruction [11–7] Write Read 0 result Address Read data 1 Instruction register data 2 M M memory u u x x Write 0 data Registers 1 Write Data data memory Instruction [31–0] Imm ALU Gen control Instruction [30,14-12] FIGURE 4.25 The datapath in operation for a branch-if-equal instruction. The control lines, datapath units, and connections that are active are highlighted. After using the register file and ALU to perform the compare, the Zero output is used to select the next program counter from between the two candidates. 4.4 A Simple Implementation Scheme 281 3. The ALU subtracts one data value from the other data value, both read from the register file. The value of PC is added to the sign-extended, 12 bits of the instruction (offset) left shifted by one; the result is the branch target address. 4. The Zero status information from the ALU is used to decide which adder result to store in the PC. Finalizing Control Now that we have seen how the instructions operate in steps, let’s continue with the control implementation. The control function can be precisely defined using the contents of Figure 4.22. The outputs are the control lines, and the inputs are the opcode bits. Thus, we can create a truth table for each of the outputs based on the binary encoding of the opcodes. Figure 4.26 defines the logic in the control unit as one large truth table that combines all the outputs and that uses the opcode bits as inputs. It completely specifies the control function, and we can implement it directly in gates in an automated fashion. We show this final step in Section C.2 in Appendix C. FIGURE 4.26 The control function for the simple single-cycle implementation is completely specified by this truth table. The top seven rows of the table gives the combinations of input signals that correspond to the four instruction classes, one per column, that determine the control output settings. The bottom portion of the table gives the outputs for each of the four opcodes. Thus, the output RegWrite is asserted for two different combinations of the inputs. If we consider only the four opcodes shown in this table, then we can simplify the truth table by using don’t cares in the input portion. For example, we can detect an R-format instruction with the expression Op4 ∙ Op5, since this is sufficient to distinguish the R-format instructions from lw, sw, and beq. We do not take advantage of this simplification, since the rest of the RISC-V opcodes are used in a full implementation. 282 Chapter 4 The Processor Why a Single-Cycle Implementation is not Used Today Although the single-cycle design will work correctly, it is too inefficient to be used in modern designs. To see why this is so, notice that the clock cycle must have the same length for every instruction in this single-cycle design. Of course, the longest possible path in the processor determines the clock cycle. This path is most likely a load instruction, which uses five functional units in series: the instruction memory, the register file, the ALU, the data memory, and the register file. Although the CPI is 1 (see Chapter 1), the overall performance of a single-cycle implementation is likely to be poor, since the clock cycle is too long. The penalty for using the single-cycle design with a fixed clock cycle is significant, but might be considered acceptable for this small instruction set. Historically, early computers with very simple instruction sets did use this implementation technique. However, if we tried to implement the floating-point unit or an instruction set with more complex instructions, this single-cycle design wouldn’t work well at all. Because we must assume that the clock cycle is equal to the worst-case delay for all instructions, it’s useless to try implementation techniques that reduce the delay of the common case but do not improve the worst-case cycle time. A single- cycle implementation thus violates the great idea from Chapter 1 of making the common case fast. In Section 4.6, we’ll look at another implementation technique, called pipelining, that uses a datapath very similar to the single-cycle datapath but is much more efficient by having a much higher throughput. Pipelining improves efficiency by Check executing multiple instructions simultaneously Yourself Look at the control signals in Figure 4.26. Can you combine any together? Can any control signal output in the figure be replaced by the inverse of another? (Hint: take into account the don’t cares.) If so, can you use one signal for the other without adding an inverter?. 4.5 A Multicycle Implementation In the prior section, we broke each instruction into a series of steps corresponding to the functional unit operations that were needed. We can use these steps to create a multicycle implementation. In a multicycle implementation, each step in the execution will take 1 clock cycle. The multicycle implementation allows a functional unit to be used more than once per instruction, as long as it is used on different clock cycles. This sharing can help reduce the amount of hardware required. The ability to allow instructions to take different numbers of clock cycles and the ability to share functional units within the execution of a single instruction are the major advantages of a multicycle design. This online section describes the multicycle implementation of MIPS. Although it could reduce hardware costs, almost all chips today use pipelining instead to increase performance over a single cycle implementation, so some readers may want to skip multicycle and go directly to pipelining. However, some