SSC Course 5 CPU PDF
Document Details
Uploaded by AngelicTheme
Tags
Related
- Computer Organisation & Architecture Lecture 2: Computer Function and Cache Memory PDF
- Computer Organization and Architecture Reviewer PDF
- Computer Organization and Architecture PDF
- Computer Architecture Mid-Term Exam 2024 PDF
- Computer Science Theory - 1.1 Systems Architecture PDF
- Session 2 Introduction to Computer Architecture PDF
Summary
This document provides an overview of computer architecture, focusing on the Central Processing Unit (CPU), including topics such as the Scoreboard method, Tomasulo's algorithm, and branch prediction. It's a detailed presentation on these crucial concepts.
Full Transcript
Structure of Computer Systems Course 5 The Central Processing Unit - CPU Solutions for hazard cases Scoreboard method Tomasulo’s method Branch prediction Scoreboard method General considerations (wiki): used first in t...
Structure of Computer Systems Course 5 The Central Processing Unit - CPU Solutions for hazard cases Scoreboard method Tomasulo’s method Branch prediction Scoreboard method General considerations (wiki): used first in the CDC 6600 computer (1966), used for dynamically scheduling a pipeline so that the instructions can execute out-of-order when there are no conflicts and the hardware is available (no structural hazard is present) the data dependencies of every instruction are logged. instructions are released only when the scoreboard determines that there are no conflicts with previously issued and incomplete instructions. if an instruction is stalled because it is unsafe to continue, the scoreboard monitors the flow of executing instructions until all dependencies have been resolved before the stalled instruction is issued. Scoreboard method Implementation of the scoreboard method: Every instruction goes through 4 stages: Issue(ID1) decode instructions check for structural and WAW hazards stall until structural and WAW hazards are resolved Read operands (ID2) wait until no RAW hazards then read operands Execution (EX) operate on operands may be multiple cycles - notify scoreboard when done Write result (WB) finish execution stall if WAR hazard Scoreboard method Scoreboard structure: Instruction status Indicates which of 4 steps the instruction is in: ID1, ID2, EX, or WB. Functional unit status: Indicates the state of the functional unit (FU) Busy Indicates whether the unit is busy or not Op Operation to perform in the unit (e.g., + or –) Fi Destination register Fj, Fk Source-register numbers Qj, Qk Functional units producing source registers Fj, Fk Rj, Rk Flags indicating when Fj, Fk are ready Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register Scoreboard method Speedup from scoreboard 1.7 for FORTRAN programs 2.5 for hand-coded assembly language programs Hardware Scoreboard hardware approximately same as one FPU Main cost - buses (4 times normal amount) Could be more severe for modern processors Scoreboard and Tomasulo’s algorithm Issues with Scoreboard method: it does not solve structural hazard No forwarding logic introduces stall phases when a required functional unit is busy; the stall affects the next instructions too Tomasulo’s algorithm avoid the structural hazard and also resolve WAR and WAW dependencies with Register renaming and Common data bus (CDB) Used first in IBM 360/91 computer (1969) Register renaming – keep multiple copies of the same physical register Avoids data dependencies when the dependency is caused by the limited number of registers and not by a real data dependency Common data bus – a data is put on a common bus as soon as it’s available avoiding unnecessary stall until the data is written in the destination register Tomasulo’s alorithm Instruction stages: Issue – an instruction is issued if the required functional unit and all operands are available, else it is stalled and the next instruction is tested and if possible issued; if a real data is not yet available a virtual value is considered, until the real value becomes available Registers are renamed to avoid WAR and WAW hazards Execute – the instruction is carried out as long as the necessary operands are available or present on the CDB; special care must be given to Load and Store instructions that require access to the memory Write result – the result of the executed instruction is written back into the destination register and Store operations are made with the memory (see later commit stage) Tomasulo’s alorithm Reservation stations buffers that fetch and store instruction operands as they are available A reservation station holds the data and the result of an instruction It points to registers (if data is available) or other reservation stations that will contain the necessary data as soon as it becomes available (before it is written back in the register) The reservation station stores the result of an instruction execution and releases the functional unit as soon the instruction is executed; the result becomes available for other reservation stations ; in this way we avoid WAR and RAW stalls Tomasulo’s algorithm To avoid structural hazard, redundant functional units are used, such as multiple integer ALUs, floating point ALUs or address computing ALUs Example: the P6 architecture (Pentium II and III) contains 7 ALUs –> 2IEU, 1FEU, 1MMX, 3AGU In front of every functional unit a buffer or a list may store the request(s) (instructions) destined for that unit; e.g. Netburst architecture (Pentium IV) has a list of requests for every reservation station; In this way every functional unit is scheduled in advance and it can work almost without stalling Tomasulo’s algorithm Commit – an extra stage in the instruction execution sequence, besides issue, execute and write result Used to further improve the Tomasulo’s solution In the Write result stage the result is written in the re-order buffer (ROB) and not directly in the destination register or memory; all data in ROB may be used by other instructions; in this way some stall periods may be avoided Re-order buffer (ROB) – it is used to commit instructions executed out-of-order Contains data regarding instructions in original order; some entries may be filled-in in advance as result of out-of-order execution The instructions are committed in their original order ROB is useful for role-back procedures in case of branch prediction mismatch or exceptions In the commit stage data from the re-order buffer is copied into the real registers or into the memory in the order specified through the program and not in the order of execution Branch prediction A method for solving control hazard Problem: a brunch in the program disturbs pipeline execution; if the branch “is taken” the pipeline must be flushed and reinitialized with instructions from the target address Principle: try to guess the direction of a branch instruction (mainly conditional branch) and load the pipeline with instructions from the correct branch Methods: Static prediction – based on the nature of the branch instruction Dynamic prediction – take into consideration the history of the branch instructions (if there were taken or not in the past may predict their future behavior) Branch prediction Static prediction – based on the nature of the branch instruction Cases: Procedure calls - are taken Unconditional jumps - are taken Backward branches - are taken (considered as loops in the program) Forward branches - are not taken (considered exceptions from a normal execution) Advantage: it is simple and fast works well for programs having many loops drawback: does not work well if there are a lot of conditional jumps Branch prediction Dynamic prediction - take into consideration the history of the branch instructions Principle: use previous executions of a conditional jump in order to better predict the next executions Methods: Next line predictor – stores the pointer to the next instruction (or group of instructions if multiple instructions are fetched in the same time); the method stores the decision as well as the target (pointer) of the branch Saturating counters – store in 1 or two bits (saturating counters) the decisions made before; in case of 2 bit counter – 4 states: Strongly not taken (00) – “not taken” is predicted Weakly not taken (01) – “not taken” is predicted Weakly taken (10) – “taken” is predicted Strongly taken (11) - “taken” is predicted Taken 00 01 every occurrence of the branch updates Not the state of the counter taken 11 10 Branch prediction Dynamic prediction – methods (cont.) store the decision and the target address for every executed conditional jump in a BHT (Branch History Table) and BTB (Branch Target Buffer); this information will help predict next executions of the same instructions with aprox. 90% probability. BHT and BTB are indexed with less significant bits of the addresses (of PC); the number of bits used determines the dimension of the tables Two-level adaptive predictor necessary for alternating and imbricated conditional jumps idea: to memorize jump sequence patterns; prediction based on a pattern of taken (1) and not taken (0) branches a two-level adaptive predictor with an n-bit history can Prediction predict any repetitive 2 bit counter 0100 sequence with any period if n bits all n-bit sub-sequences are.... different Pattern history table Branch prediction Dynamic prediction – methods (cont.) Local branch prediction a separate history buffer for each conditional jump instruction it may use a 2 level branch predictor with common or individual pattern history table Pentium II and III have local branch predictors with a local 4-bit history and a local pattern history table with 16 entries for each conditional jump Global branch predictor keeps a shared (global) history of all conditional jumps any correlation between two branches is used for prediction; poor results if branches are not correlated; usually not as good as local predictors variants: “gshare" predictor “gselect” predictor Branch prediction Dynamic prediction – methods (cont.) Global branch predictor – possible implementation: two-level adaptive predictor with globally shared history buffer and pattern history table “gshare" predictor - index in the prediction history table is a XOR between the global history buffer and the jump address “gselect” predictor – index is obtain by concatenating the history buffer and the jump’s address Pentium M, Core 2 and AMD processors use global branch prediction combinations of local and global predictors: Alloyed branch prediction - concatenates local and global branch history buffer, sometimes also with the address of the jump Agree predictor – makes a XOR between the local and global predictor (used in Pentium 4) Hybrid predictor – a combination of predictors; the result is selected through voting or from the predictor with the best hit rates Loop predictor – detects if a conditional jump is a loop; it is taken N-1 times and not taken 1 time; it may use a counter for the loop; it may be part of a hybrid predictor Prediction of indirect jumps – when the jump target of a conditional branch has multiple choices – store the previous targets and more bits on the prediction history buffer for such a jump Prediction of function returns – stores a copy of the stack that contains the return addresses of the executed functions Branch prediction Correlated prediction Branch address (4 bits) example of a combination between local and global 2-bits per branch prediction local predictors how it works: every entry in the history table has 4 predictors (e.g. 2 bit counters) the 2 bit global history buffer Prediction Prediction select between the 4 predictors the state of the selected predictor is updated according with the decision made the global branch history gives the context and the local predictors store behavior of different jump instructions 2-bit recent global (2,2) predictor – 2 bit counters and branch history 2 bit history buffer (01 = not taken then taken) Misprediction statistics 20% 18% for specs tests 18% 16% 14% 12% 12% 11% Mispredictions 10% Frequency of 8% 6% 6% 6% 6% 5% 5% 4% 4% 2% 1% 1% 0% 0% 0% eqntott gcc 4,096 entries: 2-bits per entryUnlimited entries: 2-bits/entry 1,024 entries (2,2) 1. 4096 Entries 2-bit BHT 2. Unlimited Entries 2-bit BHT 3. 1024 Entries - local and global prediction (2,2) BHT - 1 and 3 require the same amount of memory – 8kbits Branch prediction Tournament predictor 2-bit local predictor fail on important branches; by adding global information, performance may improved Tournament predictors: use two predictors, 1 based on global information and 1 based on local information, and combine with a selector Hopes to select right predictor for right branch (or right context of branch) 10% Misprediction statistics Conditional branch misprediction rate 9% 8% 7% Local - 2 bit counters 6% 5% 4% Correlating - (2,2) scheme 3% 2% Tournament 1% 0% 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 Total predictor size (Kbits) Branch prediction Branch Target Buffer (BTB): contains target of taken branches an associative access memory contains: jump instr. address Send PC to memory target address and branch-target buffer prediction state IF No Entry found Yes in branch- target buffer? Jmp addr Target pred Send out predicted No Is instruction Yes PC a taken branch? ID No Taken Yes Normal Branch? instruction execution Mispredicted branch, Enter branch kill fetched Branch correctly instruction address instruction; restart predicted; continue EX and next PC into fetch at other target; execution with no PC New address branch-target buffer delete entry from stalls target buffer