Pipelined datapath and control 5.2 (1).pdf

Pipelined datapath and control ▪ Now we’ll see a basic implementation of a pipelined processor. — The datapath and control unit share similarities with both the single- cycle and multicycle implementations that we already saw. — An example execution highlights important pipelining concepts. ▪ In future lectures, we’ll discuss several complications of pipelining that we’re hiding from you for now. January 26, 2009 1 Pipelining concepts ▪ A pipelined processor allows multiple instructions to execute at once, and each instruction uses a different functional unit in the datapath. ▪ This increases throughput, so programs can run faster. — One instruction can finish executing on every clock cycle, and simpler stages also lead to shorter cycle times. Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 IF ID EX MEM WB and $t1, $t2, $t3 IF ID EX MEM WB or $s0, $s1, $s2 IF ID EX MEM WB add $t5, $t6, $0 IF ID EX MEM WB 2 Pipelined Datapath ▪ The whole point of pipelining is to allow multiple instructions to execute at the same time. ▪ We may need to perform several operations in the same cycle. — Increment the PC and add registers at the same time. — Fetch one instruction while another one reads or writes data. Clock cycle 12 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 IF ID EX MEM WB and $t1, $t2, $t3 IF ID EX MEM WB or $s0, $s1, $s2 IF ID EX MEM WB add $t5, $t6, $0 IF ID EX MEM WB ▪ Thus, like the single-cycle datapath, a pipelined processor will need to duplicate hardware elements that are needed several times in the same clock cycle. 3 One register file is enough ▪ We need only one register file to support both the ID and WB stages. Read Read register 1 data 1 Read Read register 2 data 2 Write register Registers Write data ▪ Reads and writes go to separate ports on the register file. ▪ Writes occur in the first half of the cycle, reads occur in the second half. 4 Single-cycle datapath, slightly rearranged 1 0 PCSrc 4 Add P Add C Shift RegWrite left 2 Read Read register 1 data 1 MemWrite ALU Read Instruction Zero Read Read address 0 register 2 data 2 Result Address [31-0] Write register 1 Data Instruction MemToReg memory memory Registers ALUOp Write data ALUSrc Write Read 1 data data Instr [15 - 0] Sign RegDst extend MemRead 0 Instr [20 - 16] 0 Instr [15 - 11] 1 5 What’s been changed? ▪ Almost nothing! This is equivalent to the original single-cycle datapath. — There are separate memories for instructions and data. — There are two adders for PC-based computations and one ALU. — The control signals are the same. ▪ Only some cosmetic changes were made to make the diagram smaller. — A few labels are missing, and the muxes are smaller. — The data memory has only one Address input. The actual memory operation can be determined from the MemRead and MemWrite control signals. ▪ The datapath components have also been moved around in preparation for adding pipeline registers. 6 Pipeline registers ▪ We’ll add intermediate registers to our pipelined datapath too. ▪ There’s a lot of information to save, however. We’ll simplify our diagrams by drawing just one big pipeline register between each stage. ▪ The registers are named for the stages they connect. IF/ID ID/EX EX/MEMMEM/WB ▪ No register is needed after the WB stage, because after WB the instruction is done. 7 Pipelined datapath 1 0 PCSrc IF/ID ID/EX EX/MEM MEM/WB 4 Add P Add C Shift RegWrite left 2 Read Read register 1 data 1 MemWrite ALU Read Instruction Zero Read Read address 0 register 2 data 2 Result Address [31-0] Write 1 Data register MemToReg memory Instruction Registers ALUOp Write memory data ALUSrc Write Read 1 data data Instr [15 - 0] Sign RegDst extend MemRead 0 Instr [20 - 16] 0 Instr [15 - 11] 1 8 Propagating values forward ▪ Any data values required in later stages must be propagated through the pipeline registers. ▪ The most extreme example is the destination register. — The rd field of the instruction word, retrieved in the first stage (IF), determines the destination register. But that register isn’t updated until the fifth stage (WB). — Thus, the rd field must be passed through all of the pipeline stages, as shown in red on the next slide. ▪ Notice that we can’t keep a single ―instruction registerǁ like we did before in the multicycle datapath, because the pipelined machine needs to fetch a new instruction every clock cycle. 9 The destination register 1 0 PCSrc IF/ID ID/EX EX/MEM MEM/WB 4 Add P Add C Shift RegWrite left 2 Read Read register 1 data 1 MemWrite ALU Read Instruction Zero Read Read address 0 register 2 data 2 Result Address [31-0] Write 1 Data register MemToReg memory Instruction Registers ALUOp Write memory data ALUSrc Write Read 1 data data Instr [15 - 0] Sign RegDst extend MemRead 0 Instr [20 - 16] 0 Instr [15 - 11] 1 10 What about control signals? ▪ The control signals are generated in the same way as in the single-cycle processor—after an instruction is fetched, the processor decodes it and produces the appropriate control values. ▪ But just like before, some of the control signals will not be needed until some later stage and clock cycle. ▪ These signals must be propagated through the pipeline until they reach the appropriate stage. We can just pass them in the pipeline registers, along with the other data. ▪ Control signals can be categorized by the pipeline stage that uses them. 11 Pipelined datapath and control 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P Add C Shift RegWrite left 2 Read Read register 1 data 1 MemWrite ALU Read Instruction Zero Read Read address 0 register 2 data 2 Result Address [31-0] Write 1 Data register MemToReg memory Instruction Registers ALUOp Write memory data ALUSrc Write Read 1 data data Instr [15 - 0] Sign RegDst extend MemRead 0 Instr [20 - 16] 0 Instr [15 - 11] 1 12 What about control signals? ▪ The control signals are generated in the same way as in the single-cycle processor—after an instruction is fetched, the processor decodes it and produces the appropriate control values. ▪ But just like before, some of the control signals will not be needed until some later stage and clock cycle. ▪ These signals must be propagated through the pipeline until they reach the appropriate stage. We can just pass them in the pipeline registers, along with the other data. ▪ Control signals can be categorized by the pipeline stage that uses them. Stage Control signals needed EX ALUSrc ALUOp RegDst MEM MemRead MemWrite PCSrc WB RegWrite MemToReg 13 Pipelined datapath and control 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P Add C Shift RegWrite left 2 Read Read register 1 data 1 MemWrite ALU Read Instruction Zero Read Read address 0 register 2 data 2 Result Address [31-0] Write 1 Data register MemToReg memory Instruction Registers ALUOp Write memory data ALUSrc Write Read 1 data data Instr [15 - 0] Sign RegDst extend MemRead 0 Instr [20 - 16] 0 Instr [15 - 11] 1 14 Notes about the diagram ▪ The control signals are grouped together in the pipeline registers, just to make the diagram a little clearer. ▪ Not all of the registers have a write enable signal. — Because the datapath fetches one instruction per cycle, the PC must also be updated on each clock cycle. Including a write enable for the PC would be redundant. — Similarly, the pipeline registers are also written on every cycle, so no explicit write signals are needed. 15 An example execution sequence ▪ Here’s a sample sequence of instructions to execute. 1000: lw $8, 4($29) addresses in 1004: sub $2, $4, $5 decimal 1008: and $9, $10, $11 1012: or $16, $17, $18 1016: add $13, $14, $0 ▪ We’ll make some assumptions, just so we can show actual data values. — Each register contains its number plus 100. For instance, register $8 contains 108, register $29 contains 129, and so forth. — Every data memory location contains 99. ▪ Our pipeline diagrams will follow some conventions. — An X indicates values that aren’t important, like the constant field of an R-type instruction. — Question marks ??? indicate values we don’t know, usually resulting from instructions coming before and after the ones in our example. 16 Cycle 1 (filling) IF: lw $8, 4($29) ID: ??? EX: ??? MEM: ??? WB: ??? 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P 1004 Add C Shift RegWrite (?) left 2 ??? ??? 1000 Read Read ??? register 1 data 1 MemWrite (?) ALU Read Instruction ??? Read ??? Zero Read ??? ??? address 0 register 2 data 2 Result Address [31-0] Write ??? MemToReg ??? Data register 1 (?) memory Instruction ??? Registers ALUOp (???) Write memory ??? data ALUSrc (?) ??? Read Write 1 data data ??? Sign ??? RegDst (?) extend MemRead (?) ??? 0 ??? ??? 0 ??? ??? ??? ??? ??? 1 ??? 17 Cycle 2 IF: sub $2, $4, $5 ID: lw $8, 4($29) EX: ??? MEM: ??? WB: ??? 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P 1008 Add C Shift RegWrite (?) left 2 29 129 ??? 1004 Read Read register 1 data 1 MemWrite (?) ALU Read Instruction X X ??? Zero Read Read ??? address 0 register 2 data 2 Result Address [31-0] Write ??? MemToReg ??? Data register 1 (?) memory Instruction ??? Registers ALUOp (???) Write ??? memory ??? data ALUSrc (?) Read Write 1 data data 4 Sign ??? RegDst (?) ??? extend MemRead (?) 0 8 ??? 0 ??? ??? ??? X ??? 1 ??? 18 Cycle 3 IF: and $9, $10, $11 ID: sub $2, $4, $5 EX: lw $8, 4($29) MEM: ??? WB: ??? 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P 1012 Add C Shift RegWrite (?) left 2 4 104 129 1008 Read Read register 1 data 1 MemWrite (?) ALU Read Instruction 5 X Zero Read Read 105 ??? address 0 register 2 data 2 Result Address [31-0] 4 Write 133 MemToReg ??? Data register 1 (?) memory Instruction ??? Registers ALUOp (add) Write memory ??? Write ??? data ALUSrc (1) Read 1 data data X Sign 4 RegDst (0) extend MemRead (?) ??? 0 X 8 0 ??? ??? 2 X 1 8 ??? 19 Cycle 4 IF: or $16, $17, $18 ID: and $9, $10, $11 EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB: ??? 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P 1016 Add C Shift RegWrite (?) left 2 10 110 104 1012 Read Read register 1 data 1 MemWrite (0) ALU Read Instruction 11 105 Zero Read Read 111 133 address 0 register 2 data 2 Result Address [31-0] Write –1 ??? MemToReg 1 Data register (?) memory Instruction ??? Registers ALUOp (sub) Write memory 99 data ALUSrc (0) X Write Read ??? 1 data data X Sign X RegDst (1) extend MemRead (1) ??? 0 X X 2 8 ??? 9 2 0 1 ??? 20 Cycle 5 (full) IF: add $13, $14, $0 ID: or $16, $17, $18 EX: and $9, $10, $11 MEM: sub $2, $4, $5 WB: lw $8, 4($29) 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P 1020 Add C Shift RegWrite (1) left 2 17 117 110 1016 Read Read register 1 data 1 MemWrite (0) ALU Read Instruction 18 111 Zero Read Read 118 -1 address 0 register 2 data 2 Result Address [31-0] 8 Write 110 MemToReg 1 Data register (1) memory Instruction 99 Registers ALUOp (and) Write memory X 99 data ALUSrc (0) 105 Write Read 1 data data X Sign X RegDst (1) extend MemRead (0) 133 0 X X 9 2 8 16 9 0 1 99 21 Cycle 6 (emptying) IF: ??? ID: add $13, $14, $0 EX: or $16, $17, $18 MEM: and $9, $10, $11 WB: sub $2, $4, $5 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P ??? Add C Shift RegWrite (1) left 2 14 114 117 1020 Read Read register 1 data 1 MemWrite (0) ALU Read Instruction 0 0 118 Zero Read Read 110 address 0 register 2 data 2 Result Address [31-0] 2 Write 119 MemToReg 1 Data register (0) memory Instruction -1 Registers ALUOp (or) Write memory X data ALUSrc (0) 111 Write Read 1 data data X Sign X RegDst (1) extend MemRead (0) 0 X X 0 16 9 13 16 1 22 Cycle 7 IF: ??? ID: ??? EX: add $13, $14, $0 MEM: or $16, $17, $18 WB: and $9, $10, $11 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P ??? Add C Shift RegWrite (1) left 2 ??? ??? 114 ??? Read Read register 1 data 1 MemWrite (0) ALU Read Instruction ??? 0 Zero Read Read ??? 119 address 0 register 2 data 2 Result Address [31-0] 9 Write 114 MemToReg 1 Data register (0) memory Instruction 110 Registers ALUOp (add) Write memory X X data ALUSrc (0) 118 Write Read 1 data data ??? Sign X RegDst (1) extend MemRead (0) 110 0 X ??? 13 16 9 13 0 ??? 1 110 23 Cycle 8 IF: ??? ID: ??? EX: ??? MEM: add $13, $14, $0 WB: or $16, $17, $18 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P ??? Add C Shift RegWrite (1) left 2 ??? ??? ??? ??? Read Read register 1 data 1 MemWrite (0) ALU Read Instruction ??? ??? Zero Read Read ??? 114 address 0 register 2 data 2 Result Address [31-0] 16 Write ??? MemToReg 1 Data register (0) memory Instruction 119 Registers ALUOp (???) Write memory X X data ALUSrc (?) 0 Write Read 1 data data ??? Sign ??? RegDst (?) extend MemRead (0) 119 0 ??? ??? 0 ??? 13 16 ??? ??? 1 119 24 Cycle 9 IF: ??? ID: ??? EX: ??? MEM: ??? WB: add $13, $14, $0 1 0 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB IF/ID EX M WB 4 Add P ??? Add C Shift RegWrite (1) left 2 ??? ??? ??? ??? Read Read register 1 data 1 MemWrite (?) ALU Read Instruction ??? ??? Zero Read Read ??? ??? address 0 register 2 data 2 Result Address [31-0] 13 Write ??? MemToReg 1 Data register (0) memory Instruction 114 Registers ALUOp (???) Write memory X X data ALUSrc (?) ? Write Read 1 data data ??? Sign ??? RegDst (?) extend MemRead (?) 114 0 ??? ??? 0 ??? ??? 13 ??? ??? 1 114 25 That’s a lot of diagrams there Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 IF ID EX MEM WB and $t1, $t2, $t3 IF ID EX MEM WB or $s0, $s1, $s2 IF ID EX MEM WB add $t5, $t6, $0 IF ID EX MEM WB ▪ Compare the last nine slides with the pipeline diagram above. — You can see how instruction executions are overlapped. — Each functional unit is used by a different instruction in each cycle. — The pipeline registers save control and data values generated in previous clock cycles for later use. — When the pipeline is full in clock cycle 5, all of the hardware units are utilized. This is the ideal situation, and what makes pipelined processors so fast. ▪ Try to understand this example or the similar one in the book at the end of Section 6.3. 26 Performance Revisited ▪ Assuming the following functional unit latencies: 3ns 2ns 2ns 3ns 2ns Inst Reg Data Reg ALU mem Read Mem Write ▪ What is the cycle time of a single-cycle implementation? — What is its throughput? ▪ What is the cycle time of a ideal pipelined implementation? — What is its steady-state throughput? ▪ How much faster is pipelining? 27 Ideal speedup Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 IF ID EX MEM WB and $t1, $t2, $t3 IF ID EX MEM WB or $s0, $s1, $s2 IF ID EX MEM WB add $sp, $sp, -4 IF ID EX MEM WB ▪ In our pipeline, we can execute up to five instructions simultaneously. — This implies that the maximum speedup is 5 times. — In general, the ideal speedup equals the pipeline depth. ▪ Why was our speedup on the previous slide ―onlyǁ 4 times? — The pipeline stages are imbalanced: a register file and ALU operations can be done in 2ns, but we must stretch that out to 3ns to keep the ID, EX, and WB stages synchronized with IF and MEM. — Balancing the stages is one of the many hard parts in designing a pipelined processor. 28 The pipelining paradox Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 IF ID EX MEM WB and $t1, $t2, $t3 IF ID EX MEM WB or $s0, $s1, $s2 IF ID EX MEM WB add $sp, $sp, -4 IF ID EX MEM WB ▪ Pipelining does not improve the execution time of any single instruction. Each instruction here actually takes longer to execute than in a single- cycle datapath (15ns vs. 12ns)! ▪ Instead, pipelining increases the throughput, or the amount of work done per unit time. Here, several instructions are executed together in each clock cycle. ▪ The result is improved execution time for a sequence of instructions, such as an entire program. 29 Instruction set architectures and pipelining ▪ The MIPS instruction set was designed especially for easy pipelining. — All instructions are 32-bits long, so the instruction fetch stage just needs to read one word on every clock cycle. — Fields are in the same position in different instruction formats—the opcode is always the first six bits, rs is the next five bits, etc. This makes things easy for the ID stage. — MIPS is a register-to-register architecture, so arithmetic operations cannot contain memory references. This keeps the pipeline shorter and simpler. ▪ Pipelining is harder for older, more complex instruction sets. — If different instructions had different lengths or formats, the fetch and decode stages would need extra time to determine the actual length of each instruction and the position of the fields. — With memory-to-memory instructions, additional pipeline stages may be needed to compute effective addresses and read memory before the EX stage. 30 Summary ▪ The pipelined datapath combines ideas from the single and multicycle processors that we saw earlier. — It uses multiple memories and ALUs. — Instruction execution is split into several stages. ▪ Pipeline registers propagate data and control values to later stages. ▪ The MIPS instruction set architecture supports pipelining with uniform instruction formats and simple addressing modes. ▪ Next lecture, we’ll start talking about Hazards. 31

Pipelined datapath and control 5.2 (1).pdf

Document Details

Tags

Related

Full Transcript