04.-Computer-organization-Architecture_.pdf
Document Details
Tags
Full Transcript
Computer Organization & Architecture Published By: Physics Wallah ISBN: 978-93-94342-39-2 Mobile App: Physics Wallah (Available on Play Store) Website: www.pw.live Email: [email protected] Rights All rights will be reserved by Publ...
Computer Organization & Architecture Published By: Physics Wallah ISBN: 978-93-94342-39-2 Mobile App: Physics Wallah (Available on Play Store) Website: www.pw.live Email: [email protected] Rights All rights will be reserved by Publisher. No part of this book may be used or reproduced in any manner whatsoever without the written permission from author or publisher. In the interest of student's community: Circulation of soft copy of Book(s) in PDF or other equivalent format(s) through any social media channels, emails, etc. or any other channels through mobiles, laptops or desktop is a criminal offence. Anybody circulating, downloading, storing, soft copy of the book on his device(s) is in breach of Copyright Act. Further Photocopying of this book or any of its material is also illegal. Do not download or forward in case you come across any such soft copy material. Disclaimer A team of PW experts and faculties with an understanding of the subject has worked hard for the books. While the author and publisher have used their best efforts in preparing these books. The content has been checked for accuracy. As the book is intended for educational purposes, the author shall not be responsible for any errors contained in the book. The publication is designed to provide accurate and authoritative information with regard to the subject matter covered. This book and the individual contribution contained in it are protected under copyright by the publisher. (This Module shall only be Used for Educational Purpose.) Design Against Static Load INDEX 1. Machine Instructions........................................................................................................... 4.1 – 4.5 2. Addressing Modes................................................................................................................ 4.6 – 4.8 3. ALU Data Path...................................................................................................................... 4.9 – 4.13 4. CPU Control Unit Design....................................................................................................... 4.14 – 4.19 5. Memory Interfacing............................................................................................................. 4.20 – 4.22 6. Input Output Interfacing...................................................................................................... 4.23 – 4.28 7. Pipelining............................................................................................................................. 4.29 – 4.43 8. Cache Memory..................................................................................................................... 4.44 – 4.48 9. Main Memory...................................................................................................................... 4.49 – 4.51 10. Secondary Storage................................................................................................................ 4.52 – 4.56 GATE-O-PEDIA COMPUTER SCIENCE & INFORMATION TECHNOLOGY Design Against Static Load 1 MACHINE INSTRUCTIONS 1.1 Introduction Opcode – Specifies the operation code. The number of bits in opcode depends on total operations. Address – Specifies the operand address. The basic computer has three instruction code formats. (i) Memory – Reference instruction I = O Direct addressing I = 1 Indirect addressing 15 14 12 11 0 I opcode address Instruction format (opcode = 000 to 110) (ii) Register – Reference instruction 15 12 11 0 0111 Register Operation (opcode = 111, I = 1) (iii) Input – output Instruction 15 12 11 0 1111 I/O Operation (op code = 111, I = 1) A stack can be placed in a portion of a large memory or it can be organized as a collection of a memory words or registers. i.e., it may be (i) Register stack (ii) Memory stack 1.1.1. Register Stack The following figure shows the organization of a 64-word register stack. The stack pointer register SP contains a binary number whose value is equal to the address of the word that is currently on top of stack. Initially SP is cleared, EMPTY is set to 1, FULL is cleared to 0. If the stack is not full (if FULL = 0), a new item is inserted. With a push operation. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.1 Computer Organizations and Architecture Address FULL EMPTY 63 4 C 3 SP B 2 A 1 0 DR PUSH: SP SP + 1 [increment stack pointer] M{SP} DR [write item on top of stack] If (SP = 0) then FULL 1 [check if stack is full] EMTY 0 [mark the stack not empty] A item is deleted from the stack if the stack is not empty (if EMTY = 0), called POP operation DR M{SP} [Read item from top of stack] SP Sp – 1 [Decrement stack pointer] If (SP = 0) then EMTY – 1 1.1.2. Memory Stack A stack can be implemented in a random-access memory. The stack can be implemented by assigning a portion of memory to a stack operation and using a processor register as a stack pointer. i.e., A new item is inserted with the push operation as follows: 1000 Program PC (instructions) PUSH: SP SP – 1 2000 Data M[SP] DR AR (operands) A new item is deleted with a POP operation as follows 3000 Stack DR M[SP] SP SP SP + 1 3997 3998 3999 4000 Memory unit GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.2 Computer Organizations and Architecture A stack pointer is loaded with an initial value. This initial value must be the bottom address of an assigned stack in memory. SP is automatically incremented or decremented with every PUSH or POP operation. 1.1.3. Reverse Polish Notation An expression in post fix form is often called reverse polish notation. The infix expression (A + B) * [C*(D + E) + F] in reverse polish notation as AB + DE + C * F + * A reverse polish expression can be implemented or evaluated using stack for the expression X (A + B) * (C + D) With values A = 2, B = 3, C = 4, d = 2 is X (2 + 3) * (4 + 2) Infix expression x 23 + 42 + * Reverse polish expression 1.2 Instruction Types The basic unit of program is the instruction. Instructions are classified based on (1) Opcode – called functional classification (2) Based on number of references. 1.3 Functional Classification Data transfer instructions – (LOAD, STOR, MOV, EXCT, IN, OUT, PUSH, POP) Arithmetic instructions – (INC, DEC, ADD, SUB, MUL, DIV) Logical Instructions & Shift instructions. (CLR, COM, AND, XOR, OR, SHR, SHL, SHRA, SHLA, RO, ROL….) Branching instructions. (ABR, JMP, SKP, CALL, RETURN) 1.4 Based on Number of References Based on number of references, the instructions can be classified as 4 – address instructions 3- address instructions 2 – address instructions 1 – address instructions 0 – address instructions RISC Instructions GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.3 Computer Organizations and Architecture (i) 4 – Address Instructions: Opcode A1 A2 A3 A4 A1, A2 refers operands A3 refers destination A4 next instruction address. Since A4 is like program counter, the processor which supports 4-address instructions need not use PC. The length of instruction is more, hence requires more than one reference. (ii) 3 – address Instructions Opcode A1 A2 A3 Each address field specify a register or memory operand. It results in short program when evaluating arithmetic expressions. Requires too many bits to specify three addresses Example: To evaluate X = (A + B) * (C + D) Add R1, A, B R1 M[A] + M[B] Add R2, C, D R2 M[C] + M[D] MUL X, R1, R2 M[X] R1 * R2 For these instructions program counter must be required. (iii) 2 – address Instructions: Opcode A1 A2 A1 – first operand and destination A2 – second operand Uses MOV instruction for data transfer Example: X (A + B) * (C + D) MOV R1, A ADD R1, B MOV R2, C ADD R2, D MUL R1, R2 MOV X R1 One of the operand permanently lost. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.4 Computer Organizations and Architecture (iv) 1 – address Instructions Opcode A1 Uses an implied accumulator (AC) register for all data manipulation. Easily decoded and processed instructions. Example: X (A + B) * (C + D) LOAD A AC M[A] ADD B AC AC + M[B] STOR T M[T] AC LOAD C AC M[C] ADD D AC AC + M[D] MUL T AC AC * M[T] STOR X M[X] AC (v) Zero – address Instructions: Opcode The operands are referenced complexity from stack. More complex approach than others. Any changes in order of operands effect the result. Example: X = (A + B) * (C + D) PUSH A PUSH B ADD PUSH C PUSH D ADD MUL POP X (vi) RISC Instructions: All instructions are executed with in the registers of CPU, without referring to memory (Except LOAD, STOR) Uses in RISC processor LOAD & STOR are used between data transfer. Example: X = (A + B) * (C + D) LOAD R1, A LOAD R2, B LOAD R3, C LOAD R4, D ADD R1, R1, R2 ADD R3, R3, R4 MUL R1, R1, R3 STOR X, R1 GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.5 Computer Organizations and Architecture 2 ADDRESSING MODES 2.1 Introduction The various addressing modes commonly used are: Implied Mode: In this mode the operands are specified implicitly in the definition of the instruction. Example: Complement Accumulator (CMA) [All register reference instructions that use an accumulator are implied - mode instructions] Zero-address instructions in a stack organized computer. [here the operands are implied to be on top of the stack] Immediate Mode: In this mode the operand is specified in the instruction itself. It is very faster Useful for initializing register to a constant value. Example: ADD 10 i.e., AC ← AC + 10 The range of value limited by the address field Register Mode: In this mode the operands are in registers that reside within the CPU. A K-bit field can specify any one of 2k registers. Reduces the length of the address field. Example: ADD R1 GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.6 Computer Organizations and Architecture Register Indirect Mode: In this mode the instruction specifies a register in the CPU whose contents give the address of an operand in memory. Before using this mode, the programmer must ensure that the memory address of the operand is placed in the processor register with a previous instruction. The advantage is the address field uses fewer bits to select or register. Example: ADD (R1) R1 R0 R1 3000 3000 10 Autoincrement or Autodecrement Mode: It is similar to register indirect mode except that the register is incremented or decremented after its value is used to access memory. Example: ADD (R1) R1 R1 3000 10 3000 20 30 Direct Addressing Mode: In this mode the effective address is equal to the address part of the instructions. The operand resides in memory and its address is given directly Example. ADD 3000 by the address field of the instruction. 3000 Used to represent global variables in a program. In a branch type instruction, the address field specifies the actual 3000 10 branch address space. Allows to access limited address space. The effective address is defined to be the memory address obtained from the computation dictated by the given addressing mode. i.e. AC AC + M The direct addressing mode is also called “Absolute mode”. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.7 Computer Organizations and Architecture Indirect Addressing mode: Example ADD 3000 3000 In this mode, the address field of the instruction gives the address where the effective address is stored in memory. 3000 8400 Allows to access larger address space Requires 2 memory cycles to read an operand 8400 10 2.1.1 Displacement addressing modes The address field of the instruction added to the content of a specific register in the CPU. i.e, Effective Address = Address part of instruction + content of Register. 2.1.2 The Various displacement addressing modes are Relative addressing mode: In this mode the content of program counter (PC) is added to the address part of the instruction. The address part is a signed number (2’s complement) that can be either positive or negative. The result produces effective address whose position in memory is relative to the address of the next instruction. EA = Address Part (off set) + PC value Used with branch-type instructions when the branch address is in the area surrounding the instruction word itself. The advantage is, address field can be specified with a small number of bits compared to direct address. Indexed Addressing mode: In this mode the content of an index register is added to the address part of the instruction. Index register contains index value. Address part specifies the beginning address of a data array in memory. EA = Address Part (base address of data array) + Index register value (index value) Used for accessing array of data. The index register can be special CPU register or any general purpose register. Base register Addressing mode: In this mode the content of a base register is added to the address part of the instruction. The base register is assumed to hold base address. The address field gives the displacement relative to this base address. EA = Address Part (displacement/offset) + Base register value (Base address) GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.8 Computer Organizations and Architecture 3 3.1 Introduction ALU DATA PATH The sequence of operations involved in processing an instruction constitutes an instruction cycle. The instruction cycle consists of phases like. (1) Fetch cycle (2) Decode cycle (3) Operand fetch cycle (4) Execute cycle To perform these, the processor unit has to perform set of operations called “Micro-Operations”. 3.2 The Basic Operations Performed are 3.2.1 Register transfers: In general, the input & output of register Ri are connected to the bus via switched controlled by the signals Ri in and Riout Ri in B u s Ri Ri out When Ri in is set to 1, the data on the bus are loaded into Ri. When Ri out is set to 1, the contents of register Ri are placed on the bus. To transfer the contents of R1 to register R4 : R4 R1. Enable the output of register R1 by setting R1 out to 1. This places the contents of R1 on the processor bus. Enable the input to register R4 by setting R4 in to 1. This loads data from the processor bus into register R4. 3.2.2 Performing ALU operations The ALU has two inputs A and B and one output A gets operand from output of MUX B gets operand from bus The result is gets stored in Z-register. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.9 Computer Organizations and Architecture The sequence of steps for ALU operation R3 R1 (1) The contents of R1 are loaded in Y. (2) The contents of Y1R2 are applied to A & B inputs of ALU, performs ALU operation & stores the result in Z- register. (3) The contents of Z are stored in R3. The sequence of operations is (1) R1 out, Yin (2) R2 out, select Y, Add, Zin (3) Zout, R3 in The functions performed by ALU depends on the signals applied to its control lines. (Here Add line is set to 1). Only one register output can be connected to the bus during any clock cycle. The no of steps indicates no of clocks. 3.2.3 Fetching a word from memory (read operation) To read a memory word, consider MOV (R1), R2. The action needed to execute this instruction are (1) MAR [R1] (2) Start a read operation on the memory bus (3) Wait for the MFC (Memory function to complete) response from memory. (4) Load MDR from the memory bus (5) R2 [MDR] Hence the memory read operation sequence of operations is (1) R1 out, MARin, Read (2) WMFC (wait for memory operation to complete), MDR inE (3) MDRout, R2 in MDRoutE MDRout processor bus Memory bus data lines Internal MDR MDRinE MDRin GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.10 Computer Organizations and Architecture 3.2.4 Storing a word in Memory (write operation) The sequence of steps for write operation MOV R2, (1) The desired address is loaded into MAR (2) The data to be written is loaded into MDR (3) Write command is issued. (4) Wait for memory operation to complete. The sequence of operations is (1) R1 out, MAR in (2) R2out, MDRin, write (3) MDRoutE, WMFC 3.2.5 Branch Instructions A branch instruction replaces the content of PC with the branch target address. The address is obtained by adding an offset X, which is given in the branch instruction, to the updated value of PC. The control sequence for branching (unconditional is (1) PCout, PCin, Yin, WMFC (2) Zout, PCin, Yin, WMFC (3) MDRout, IRin (4) Offset – of – IRout, Add, Zin 3.2.6 Execution of complete Instruction Executing an instruction requires (Add (R3), R1) (1) Fetch the instruction (2) Fetch the first operand (memory location specified by R3) (3) Perform the addition (4) Load the result into R1 The control sequence for the execution of ADD (R3), R1 in a single-bus organization is (1) PCout, MARin, Read, Select, Add, Zin (2) Zout, PCin, Yin, WMFC (3) MDRout, IRin 4) R3out, MARin, Read (5) R1 out, Yin, WMFC (6) MDRout, select Y, Add, Zin (7) Zout, R1 in, End 3.3 Multiple – Bus Organization With simple single bus structure, the resulting control sequence is quite long because only one data item can be transferred over the bus in a clock cycle. To reduce the number of steps needed, most commercial processors provide multiple internal paths that enable several transfers to take place in parallel. The three bus organization of data path is GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.11 Computer Organizations and Architecture Bus A Bus B Bus C Incrementer PC Register file A ALU R B Instruction decoder IR MDR MAR Memory bus All general – purpose registers are combined into a single block called register file. It has two – outputs, allowing the contents of two different registers to be accessed simultaneously and have their contents placed on buses A and B. The data on bus ‘C’ to be loaded into third register during same clock. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.12 Computer Organizations and Architecture Buses A and B are used to transfer the source operands to the A and B inputs of ALU, the result is transferred to the destination over bus C. The control sequence for instruction ADD R4, R5, R6 (1) PCout, R = B, MARin, Read, Inc PC (2) WMFC (3) MDTout B, R = B, IRin (4) R4 outA, R5 outB, select A, Add, R6 in, end. Example (1) MAR MDR IR PC GPRS ALU The ALU, bus & registers in data path are of identical size. All operations including incrementing of PC and GPRSs are to be carried out in the ALU. 2-clock cycles are needed for memory read operation. (one for loading address in MAR and loading data from memory into the MDR). ❑❑❑ GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.13 Computer Organizations and Architecture 4 4.1 Introduction CPU CONTROL UNIT DESIGN To execute instructions, the processor must have some means of generating the control signals needed in the proper sequence. The two approaches used for this purpose are (1) Hardwired control (2) Micro programmed control The purpose of control unit is to generate accurate timing & control signals. 4.1 Hardwired Control The control unit uses fixed logic circuits to interpret instructions and generate control signals from them. Every control signal is expressed as SOP (sum of products) expression and realized using digital hardware. CLK Control step counter External Inputs IR Condition codes Control signals GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.14 Computer Organizations and Architecture The below figure shows the detailed hardwired control unit design. Control step counter CLK Step decoder T1 T2 Tn INS1 External Inputs INS2 Encoder Instruction decoder IR Condition codes INSM Control signals For executing an instruction completely each step is completed in one clock period. A counter is used to keep track of the control steps. The required control signals are determined with the following information. (1) Contents of control step counter. (2) Contents of instruction register (3) Contents of condition code flags (4) External input signals, such as MFC and interrupt requests. The instruction decoder decodes the instruction loaded in IR The step decoder provides a separate signal line for each step or time slot in a control sequence. The encoder gets input from instruction decoder step decoder, external inputs and condition codes, thus uses to generate the individual control signals. After execution of each instruction, the end signal is generated which resets control step counter and makes it ready for generation of control step for next instruction. For example: (1) The encoder circuit implements the following logic function to generate Yin. Yin = T1+ T5 Add + T4 BR +….. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.15 Computer Organizations and Architecture i.e. Here Yin signal is asserted during time interval T1 for all instructions, during T5 for add instruction, during T4 for branch instruction. The generation of Yin control signal is BR T4 T5 Add T1 Yin (2) The logic function to generate Zout signal can be given by Zout = T2 + T7 add + T6 BR + …… i.e., the Zout signal is asserted during time interval T2 for all instructions, during T7 for add instruction, during T6 for unconditional branch …….. etc. BR T6 Add T7 T1 Zout 4.2 Microprogrammed Control Every instruction in a processor is implemented by a sequence of one or more sets of micro operations. Each micro operation is associated with a specific set of control lines which when activated, causes that micro operation to take place. Using micro programmed control, control signals are generated by a program. Using this the control signal selection and sequencing information is stored in ROM or RAM called control memory (CM). GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.16 Computer Organizations and Architecture The control memory is part of control unit; it contains fixed programs (micro programs) that cannot be altered by occasional user. A control word (CW) is a word whose individual bits represent the various control signals. A sequence of CWs corresponding to the control sequence of a machine instruction constitutes the micro routine for that instruction, (micro program) The individual control words in the microprogram are referred as micro instruction. The basic organization of a microprogrammed control unit is Starting Address IR Generator Clock UPC Control CW store To read the control words sequentially from the control memory a “micro program counter” (PC) is used. sequencer UPC/CAR CDR External I/P CW Control Control Next Control Address Data address memory register register Generator (CAR) CDR The sequence of operations are: (1) CAR holds the address of next micro instruction to be read. (2) When address is available in UPC, the sequencer issues the READ command to CM. (3) The word from addressed location is read into CDR/UIR. (4) The UPC is incremented automatically by the clock, causing successive micro instructions to be read from CM. (5) The contents of UIR generates control signals which are delivered to various parts of the processor in correct sequence. The PC or CAR can be updated with various options using the address sequencer circuit as: GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.17 Computer Organizations and Architecture (1) When a new instruction is loaded in IR, the PC is loaded with the starting address of the micro routine for that instruction (called mapping process). (2) When a branch instruction is (micro) encountered and branch condition satisfied, the PC is loaded with the branch address. (3) When END instruction is encountered, the PC is loaded with address of first word. (4) In any other situation, the PC is incremented every time a new micro instruction is fetched from CM. 1011 Address Instruction code XXXX Mapping Mapping Process 0101100 Micro instruction address The transformation from the instruction code bits to an address in control memory where the routine is located is called Mapping process. The basic microinstruction format is F1 F2 F3 CD BR Address The function fields (F1, F2, F3,), condition 3 3 3 2 2 7 field (CD) & Branch field (BR) may be optional. 20 bits Comparison between Hardwired & Microprogramed Hardwired Micro programmed 1. Speed Fast Slow 2. Control functions Implemented in Hardwire Implemented in software 3. Ability to handle complex instructions Complex Easier 4. Design process Complicated Orderly and systematic 5. Instruction set size Under 100 Over 100 6. ROM size NIL 2k to 10k (20–400 bit micro instruction) 7. Applications RISC processors CISC, Main frames 4.2.1. The Micro Programmed Control Unit Can be (1) Horizontal Microprogramming: One bit per control signal. Maximum parallelism i.e., more than one signal can be simultaneously. No extra decoders are required for decoding. The length of control word is large, need to access more than once from control memory. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.18 Computer Organizations and Architecture (2) Vertical Microprogramming: The control signals are encoded in form for k-bits 2k signals. Maximum degree of parallelism is 1. The length of micro instruction is small. Response is relatively slower. Requires a decoder additionally. To increase the degree of parallelism soft vertical microprogramming is used which divides the control signals into mutually exclusive groups. Each group is associated with associated number of control bits. (i.e., combination of both vertical and horizontal microprogramming). ❑❑❑ GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.19 Computer Organizations and Architecture 5 5.1 Memory Hierarchy MEMORY INTERFACING The memory hierarchy system consists of all storage devices. The purpose of memory hierarchy is to bridge the speed mismatch between the fastest processor to the slowest memory component at reasonable cost. Registers Primary th In the structure of memory hierarchy I level memory Cache L1 is physically positioned higher than (I + 1)H level Secondary Cache L2 Ith memory. Let Ti, Si, Ci, & Fi are the access time, size, cost per Man memory (RAM / ROM) I+1th bit and frequency of references to the Ith level memory. Therefore Magnetic disks Magnetic Tapes Ti < Ti + 1 Si < Si + 1 Ci > Ci + 1 fi > fi + 1 Since same data presents at different levels. Ith level memory is the subset of I i + 1th level. i.e. Ii Ii+1 5.2 Memory Characteristics Location: Memory can be placed in 3 locations like registers, main memory, Auxiliary (or) secondary storage. Capacity: Word size, number of words i.e. capacity = number of words * word size. Unit of transfer: Maximum number of bits that can be read or written (blocks, bytes…) Access method: Random or sequential GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.20 Computer Organizations and Architecture Performance: Access time, memory cycle time, transfer rate, physical type. Physical: Volatile / non volatile erasable / non erasable Serial Access Random Access (1) Memory is organized into units of data called (1) Each storage location has an address uniquely records/blocks accessed sequentially (2) Access time depends on position of storage (2) Access time is independent of storage location location (3) Slower to Access (3) Faster to access (4) Cheaper (4) Costly relatively (5) Nonvolatile memories (5) May be relative or non (6) Example: Magnetic tapes (6) Example: Magnetic disks. When processor reads Ith level memory, if it is found in that level, his will occur otherwise it will be fault. 5.2.1. In a Two – Level Memory System Case – I: When fault occurs, in L1 reads from L2 (Proxy Hierarchy) T1, S1, H1, C1 L1 T2, S2, C2 L2 Tavg = H1 * T1 + (1–H1) * T2 Case – II: When fault occurs in L1, must be brought from L2 to L1 memory. (strict hierarchy) Tavg = H1 * T1 + (1–H1) * (T2 + T1) 5.2.2. In a Three – Level Memory System Case – I: When fault occurs in one level, then reads from its down level. T1, S1, H1, C1 L1 GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.21 Computer Organizations and Architecture T2, S2, H2, C2 L2 T2, S2, H2, C2 L3 Tavg = H1 * T1 + (1 – H1) * H2 * T2 + (1 – H1) * (1 – H2) * T3 Case – II: When fault occurs, must access from L1. Tavg = H1 * T1 + (1–H1) (H2) (T2 + T1) + (1–H1) (1–H2) (T3 + T2 + T1) * Average cost per bit C1S1 + C 2S2 CAvg = (Two - level) S1 + S2 C1S1 + C 2S2 + C3S3 CAvg = (Three - level) S1 + S2 + S3 ❑❑❑ GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.22 Computer Organizations and Architecture 6 6.1 Input – Output Organization INPUT OUTPUT INTERFACING The input – output subsystem (I/O) provides an efficient mode of communication b/n system and outside environment. The Commonly used peripheral devices are. Keyboard, monitors, printer & magnetic tape, magnetic disc….. The input & output devices that communicates with computer & people usually with alphanumeric character from ASCII – 128-character set, 7 – bits are used represent each character. It contains 26 upper case letters, 26 lower case letters, 10 numerals, 32 special chars such as %, *, $ …… In general ASCII chars are stored in a single unit called 1 byte (8-bits). The 8th bit may be used for parity bit for error detection. 6.1.1. Input – Output Interface The I/o interface provides a method of transferring information b/n internal storage (main memory) & external I/o devices. The devices/ peripherals connected to a computer need special communication links for interfacing with CPU because. Peripherals are electromechanical & electromagnetic devices, whereas CPU & memory are electronic devices hence the operations are different. The data transfer rate is shown then the transfer rate of CPU. The data codes & formats are different. The operating modes of peripherals is different and must be controlled. The four types of I/o command an interface will receive are 6.1.2. Control Command Issued to activate the peripheral & and to inform it what to do. Depending on mode of operation a control command sequence is issued. 6.1.3. Status Command Used to test various status conditions in the interface & the peripheral. Eg. Checking status of device, errors detected by interface. The status information is maintained in status register. 6.1.4. Data Output Causes the interface to respond by transferring data from the bus into one of its registers buffer. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.23 Computer Organizations and Architecture 6.1.5. Data Input With this command, interface receives an item of data from peripheral & places it in its buffer register. Then transfers to data bus of processor. The I/o read & I/o write are enabled during I/o transfer, memory ready & memory write are enabled during memory transfer. The two configurations possible for communication are. (a) Isolated I/o: The CPU has distinct input & output instructions, each instruction is associated with the address of an interface register. When CPU fetches & decodes the opcode, it places address associated with the instruction into the common address lines, at the same time, it enables the I/o read or I/o write control line. An isolated I/o method isolates memory & I/o addresses each has its own address space, hence memory address values are not affected by interface address. Uses one common bus for memory & I/o, with common control lines. (b) Memory – Mapped I/o: In this configuration, only one set of read & write signals and do not distinguish b/n memory & I/o addresses. and I/o. 6.2 Modes of Data Transfer Data transfer b/n the central computer (main memory) and I/o device may be handled in a variety of modes like programmed I/o, Interrupt – Initiated I/o & Direct memory access. 6.2.1. Programmed I/O Programmed I/o operations are the result of I/o instruction written in computer program: Example: In programmed I/o, the I/o device doesn’t have direct access to memory. A transfer from I/o device to memory requires the CPU to extents several instruction. Data bus Interface I/O bus Address bus Data valid I/O CPU Data Regr device I/O read Data Staters accepted I/O write register F – flag bit (Data transfer from I/o device to CPU) The device transfers bytes of data one at a time as they are available. When a byte of data is available the device places it on I/o bus and enables its data valid line. The Interface accepts the byte into its data register and enables the data accepted line. The interface sets the Flag bit. Then the device disables the data valid line, but it will not transfer another byte until the data accepted line is disabled by the interface. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.24 Computer Organizations and Architecture 6.2.2. Interrupt – Initiated I/O In programmed I/O, the CPU stays in a program loop until the I/o unit indicates that it is ready to transfer. Hence this process keeps processor busy needlessly. meanwhile keeps monitoring the device. When the interface determining that the device is ready, it generates an interrupt request Priority Interrupt: It is a system that establishes a priority over the various sources to determine which condition is to be serviced first when two or more requests arrive simultaneously. It also determines which conditions are permitted during processing of an 6.2.3. Daisy – Chaining Priority (Serial – Priority Interrupt) The system consists of a serial connection of all devices that request an interrupt. The device with the highest priority is placed in first position followed by lower – priority devices. The lowest priority will be placed last in the chain. Processor data bus Device 1 Device 2 Device 3 PI PO PI PO PI PO To next device Interrupt regst INT CPU interrupt Acknowledgement INTACK 6.2.4. Parallel Priority Interrupt This method uses a register whose bits are set separately by the interrupt signal from each device Priority is established according to the position of bits in the register. The CKT will also include a mask register to control the starting of each interrupt request. The mask register can be programmed to disable low-priority interrupts while a high – priority device is being revised. If will also allow a high-priority device to interrupt the CPU while low – priority device is being serviced. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.25 Computer Organizations and Architecture Disk 0 I0 Printer 1 Printing I1 Vector Adding to encoder CPU Reader 2 I2 Keyboard 3 I3 0 1 2 3 Mask Regr 6.2.5. Direct Memory Access (DMA) The speed of data transfer can be increased by removing CPU from the path and the peripheral device to manage the memory buses directly. This kind of transfer technique is called DMA transfer during DMA transfer the CPU is idle and has no control of the memory buses. The DMA controller takes control of buses to manage the transfer directly b/n i/o device & memory. To keep CPU idle to use bus, the “bus request (BR) input is used by the DMA controller to request the bus to relinquish control of the buses. When BR is active, CPU terminates current execution and places data, control & address lines in high impedance state. Then the bus behaves like an open circuit. The CPU then activates. “Bus grant” (BG) output to DMA, then DMA takes control of the bus to transfer without CPU intervention, when the transfer completes DMA disables the Bus request & CPU disables the bus grant. Then the CPU gets the control of the buses. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.26 Computer Organizations and Architecture 6.3. The DMA Transfer Take Place in Two Modes (1) Burst transfer: A block sequence consisting of a number of memory words is transferred in a continuous burst while DMA controller is master of the memory buses. Used to transfer fast devices like magnetic disks for large volumes. In this mode the data transfer will not be stopped, until an entire block is transferred. (2) Cycle Stealing: In this mode, DMA controller transfers one word at a time, after which it must return control of the buses to the CPU, later it will ‘steal’ memory cycle when CPU is idle. Address bus Address bus Data bus Data bus buffers buffers Data bus Address DMA register DS BR Data bus select RS Control Word count CPU Address Register bus select RD logic register BG Read WR Write write BR Bus BG Control grant register DMA request I/O device DMA acknowledgment DMA controller The DMA controller communicates with the CPU the data bus and control lines. The register in DMA are selected by CPU through address bus by enabling DS & RS inputs. When BG = 1 (bus grant) the CPU has relinquished the buses and the DMA can communicate directly with the memory by specifying address in address bus & activating the RD or WR control. The DMA communicates with the external device through the request and acknowledge lines. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.27 Computer Organizations and Architecture 6.4. The CPU Initialize the DMA by Sending (1) The starting address of memory block where data are available (for read) or where data are to be stored (for write) (2) Word count, the no of words in memory block. (3) Control of specify mode of transfer such as read or write. (4) A control to start the DMA transfer. Once DMA is initialized, the CPU stops communicating with the DMA unless it receivers an interrupt signal. 6.5. DMA Transfer When the peripheral device sends a DMA request, the DMA controller activates the BR line, informing the CPU to relinquish the buses. The CPU responds with its BG line, informing DMA that its buses are disabled. DMA then puts the current value of its address register into the adder bus and initiates RD or WR signal. It then sends DMA acknowledge to the peripheral device. When BG = 0, then CPU communicates with the internal DMA registers, when BG = 1 then RD & WR lines are used from DMA controller to RAM to specify read or write operation. Interrupt RAM (Random Access BG memory) BR RD WR ADD Data RD WR ADD Data Read control Write control Address bus Data bus bus Address select RD WR ADD Data DMA acknowledge DS I/O RS Peripheral device BR DMA request BG Interrupt When the device receives DMA acknowledge, it puts a word in the data bus (write) or receives on word from the data bus (read). In this way the peripheral communicates with memory without any involvement of CPU. For each word transfer DMA increments the address register and decrements its word count register. When count reaches to zero, the DMA disables bus request line so that the CPU can continue to execute its program DMA transfer is very useful for fast transfer of data. ❑❑❑ GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.28 Computer Organizations and Architecture 7 PIPELINING 7.1 Introduction “Parallel processing” denotes a large class of techniques that are used to provide simultaneous data–processing tasks for the purpose of increasing the computational speed of a computer system. The purpose of parallel processing is to speed up the computer processing capability and increases its throughput. “Through put” is the amount of processing that can be accomplished during a given interval of time”. Parallel processing can be viewed from various levels as (i) At registers level, Registers with parallel load operate with all the bits of the word simultaneously. (Instead of shift registers). This is at lowest level. (ii) Higher level of complexity can be achieved by having a multiplicity of functional units that perform identical or different operations simultaneously. (iii) By distributing the data among the functional units in multiple. Eg: The arithmetic logical & shift operations can be separated. (iv) Using multiple processors. (Flynn’s classification) (SISD, SIMD, MISD, MIMD). (v) Using pipelining in unit processor systems. 7.2 Pipelining Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates concurrently with all other segments. The result obtained from one segment is transferred to the next segment in the pipeline. The final result is obtained after the data have passed through all segments. Example: The perform Ai * Bi + Ci , for i = 1 to 6. Each sub operation multiply & add implemented in a segment with in a pipeline. Each segment has one or more registers. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.29 Computer Organizations and Architecture Ai Bi Ci Seg1 R1 R2 Multiplier Seg2 R4 R3 addder Seg3 R5 R1 through R5 are registers that receive new data with every clock pulse. The effect of each clock pulse can be shown as Segment – 1 Segment - 2 Segment - 3 Clock pulse Number R1 R2 R3 R4 R5 1 A1 B1 --- --- --- 2 A2 B2 A1 * B1 C1 --- 3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 --- --- A6 * B6 C6 A5 * B5 + C5 8 --- --- --- --- A6 * B6 + C6 The “General structure” of a “4 – segment pipeline” is Clock input S1 S2 R1 R2 S3 R3 S4 R4 GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.30 Computer Organizations and Architecture The operands pass through all four segments in a fixed sequence. Each segment consists of combinational circuit Si that performs a sub operation. The segments are separated by registers Ri that holds the intermediate results between stages. Information flows between adjacent stages under the control of common clock applied to all registers simultaneously. A task as the total operation performed going through all the segments in the pipeline. The behaviour of a pipeline can be illustrated with “Space – time diagram”. It shows the segment utilization as a function of time. The following diagram is for tasks T1 to T6 executed in 4 – segment pipeline. Segments 1 2 3 4 5 6 7 8 9 Clock cycles S–1 T1 T2 T3 T4 T5 T6 S–2 T1 T2 T3 T4 T5 T6 S–3 T1 T2 T3 T4 T5 T6 S–4 T1 T2 T3 T4 T5 T6 “An Arithmetic pipeline divides an arithmetic operation into sub operations for execution in the pipeline segments.” 7.2.1 Pipeline performance evaluation Consider a K – segment pipeline with a clock cycle time tp is used to execute n tasks. To complete n – tasks using a K – segment pipeline requires K + (n–1) clock cycles. (1) The number of clock cycles needed in a pipeline to execute 100 tasks in 6 segments. Ans. K=6 n = 100 6 + (100-1) = 105 clocks The processing time in each stage is called as “stage delay”, In a pipeline. (tp). The time delay due to interstage transfer of data is “interstage delay” (td) The interstage delays can be same between the stages, stage delay very from stage to stage based on segment operation. The time period for clock cycle, ‘t’p is tp = max {ti} + td tp = t m + t d Since tm > > td, maximum stage delay denotes the clock period. i.e. tp = t m The total time required in pipeline execution is TP = [K + (n–1)] * tp. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.31 Computer Organizations and Architecture 7.2.2 Speed up Ratio The speed up of a K – stage pipeline over an equivalent non-pipelined processor is defined as time without pipeline S= time with pipeline Consider a non-pipeline processor that performs the same operation as pipelined and takes a time equal to “tn” to complete each task. n*tn S= (K + (n -1))* t p As the number of tasks increases, k + n - 1 approaches to n Since tn ~ K * tp n *K * tp tn ~ ntp (K + (n -1))* t p S= nK S= K + (n -1) n tn S= K+n–1 ~ n n *t p tn S= tp For large number of tasks nk tn S= S= K + n -1 tp Or nK K tp = S= n tp S=K S=K The maximum speed up that can be achieved in a pipelined processor is equal to “number of stages”. [as n is large, K + n –1 is K]. (K). Sideal = Smax = K 7.2.3 Efficiency The efficiency of a pipeline is defined as the ratio of speed up factor and the number of stages in the pipeline. S nk Ek = = /K K K + (n - 1) GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.32 Computer Organizations and Architecture n EK = K + n -1 S EK = K 7.2.4 Throughput It is the number of tasks that can be completed by a pipeline per unit time. n HK = (K + (n -1)) t p 7.2.5 Stall cycle The performance of a pipeline is influence by Number of instructions Uneven stage delays Buffer overhead (Interstage delay) Dependencies. The objective of pipelines is CPIavg = 1. (i.e. Clocks per instruction – CPI) Depending on control mechanism used, the pipelines can be categorized as 7.2.6 I Asynchronous pipeline Data flow along the pipeline stages is controlled by handshaking protector i.e. when Si is ready to send/transmit, it sends ready signal to Si+1 and Si+1 send A. ck to Si after receiving. Input output data data S1 Ready S2 Ready Sn Ack Ack 7.2.7 II Synchronous pipeline Clocked high speed latches are used to interface between stages. At the falling edge of the clock pulse, all latches transfer data to the next stages simultaneously. Data L L L Input A A A T S1 T S2 T Sn C C C H H H Data Clock output GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.33 Computer Organizations and Architecture 7.2.8 Instruction pipelining An instruction pipeline operates on a stream of instructions by overlapping the Fetch, Decode, Execute and other phases of instruction cycle. The instruction cycle with 4 – segments is Instructions 1 2 3 4 5 6 7 8 9 Clock I1 FI DA FO FX I2 FI DA FO EX I3 FI DA FO EX I4 FI DA FO EX I5 FI DA FO EX Here FI - fetches an instruction DA - Decodes the instruction & calculates effective address FO - fetches the operand EX - Executes the instruction “In general, each stage in a pipeline is expected to complete its operation in one clock cycle, hence the clock period must allow the longest task to be completed. “The performance of a pipeline is high if different stages require about same amount of time.” The use of cache solves the memory access problem. (5) Consider a 4–stage pipeline, where different instructions require different number of clock cycles, at different stages. The total number of clocks required for execution of 4 instructions is ______ Ans.: 1 2 3 4 I1 2 1 2 2 7 I2 1 3 3 2 9 I3 2 2 2 2 8 I4 1 3 1 1 6 30 Through sequential process 30 clocks. But using pipeline the number of clock cycles required is 14. That is, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 I1 S1 S1 S2 S3 S3 S4 S4 I2 S1 S2 S2 S2 S3 S3 S3 S4 S4 I3 S1 S1 - S2 S2 - S3 S3 S4 S4 I4 S1 - - S2 S2 S2 S3 - S4 GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.34 Computer Organizations and Architecture 7.2.9 Data Hazards A data hazard is situation in which the pipeline is stalled because the data to be operated on are delayed for some reason, as illustrated in Figure. We will now examine the issue of availability of data in some detail. Consider a program that contains two instructions, I1 followed by I2. When this program is executed in a pipeline, the execution of I2 can begin before the execution of I1 is completed. This means that the results generated by I1 may not be available for use by I2. We must ensure that the results obtained when instructions are executed in a pipelined processor are identical to those obtained when the same instructions are executed sequentially. The potential for obtaining incorrect results when operations are performed concurrently can be demonstrated by a simple example. Assume that A = 5, and consider the following two operations: A3+A B4A When these operations are performed in the order given, the result is B = 32. But if they are performed concurrently, the value of A used in computing B would be the original value, 5, leading to an incorrect result. If these two operations are performed by instructions in a program, then the instructions must be executed one after the other, because the data used in the second instruction depend on the result of the first instruction. On the other hand, the two operations. A5C B 20 + C Can be performed concurrently, because these operations are independent. Time Clock cycle 1 2 3 4 5 6 7 8 9 Instruction I1 (Mul) F1 D1 E1 W1 I2 (Add) F2 D2 D2A E2 W2 I3 F3 D3 E3 W3 I4 F4 D4 E4 W4 Fig: Pipeline stalled by data dependency between D2 and W1 This example illustrates a basic constraint that must be enforced to guarantee correct results. When two operations depend on each other, they must be performed sequentially in the correct order. This rather obvious condition has far-reaching consequences. Understanding its implications is the key to understanding the variety of design alternatives and trade-offs encountered in pipelined computers. Consider the pipeline in Figure. The data dependency just described arises when the destination of one instruction is used as a source in the next instruction. For example, the two instructions Mul R2.R3.R4 Mul R5.R4.R6 GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.35 Computer Organizations and Architecture Give rise to a data dependency. The result of the multiply instruction is placed into register R4. which in turn is one of the two source operands of the Add instruction. Assuming that the multiply operation takes one clock cycle to complete; execution would proceed as shown in Figure. As the Decode unit decodes the Add instruction in cycle 3, it realizes that R4 is used as a source operand. Hence, the D step of that instruction cannot be completed unit the W step of multiply instruction has been completed. Completion of step D2 must be delayed to clock cycle 5, and is shown as step D2A in figure. Instruction I3 is fetched in cycle 3, but its decoding must be delayed because step D3 cannot precede D2. Hence, pipelined execution is stalled for two cycles. 7.2.10 Operand Forwarding The data hazard just described arises because one instruction, instruction I2 in Figure, is waiting for data to be written in the register file. However, these data are available at the output of the ALU once the Execute stage completes step E 1. Hence, the delay can be reduced, or possibly eliminated, if we arrange for the result of instruction I1 to be forwarded directly for use in step E2. Figure shows a part of the processor data path involving the ALU and the register file. Source 1 Source 2 SRC1 SRC2 Register file ALU RSLT Destination (a) Datapath SRC1, SRC2 RSLT E: Execute W: Write (ALU) (Register file) Forwarding Path (b) Position of the source and result registers in the processor pipeline Interstage buffers needed for pipelined operation, as illustrated in Figure. With reference to Figure, registers SRC1 and SRC2 are part of buffer B2 and RSLT is part of B3. The data forwarding mechanism is provided by the blue connection GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.36 Computer Organizations and Architecture lines. The two multiplexes connected at the inputs to the ALU allow the data on the destination bust to be selected instead of the contents of either the SRC1 or SRC2 register. When the instructions in Figure 8.6 are executed in the data path of Figure, the operations performed in each clock cycle are as follows. After decoding instruction I2 and detecting the data dependency, a decision is made to use data forwarding. The operand not involved in the dependency, register R2, is read and loaded in register SRC1 in clock cycle 3. In the next clock cycle, the product produced by instruction I1 is available in register RSLT, and because of the forwarding connection, it can be used in step E2. Hence, execution of I2 proceeds without interruption. 7.2.11 Handling Data Hazards in Software In Figure, we assumed the data dependency is discovered by the hardware while the instruction is being decoded. The control hardware delays reading register R4 unit cycle 5, thus introducing a 2-cycle stall unless operand forwarding is used. An alternative approach is to leave the task of detecting data dependence and dealing with them to the software. In this case, the compiler can introduce the two-cycle delay needed between instructions I1 and I2 by inserting NOP (No- operation) instructions, as follows: I1 : Mul R2, R3, R4 NOP NOP I2 : Add R5, R4, R6 If the responsibility for detecting such dependencies is left entirely to the software, the compiler must insert the NOP instructions to obtain a correct result. This possibility illustrates the close link between the compiler and the hardware. A particular feature can be either implemented in hardware or left to the compiler. Leaving tasks such as inserting NOP instructions to the compiler leads to simpler hardware. Being aware of the need for a delay, the compiler can attempt to reorder instructions to perform useful tasks in the NOP slots, and thus achieve better performance. On the other hand, the insertion of NOP instructions leads to larger code size. Also, it is often the case that a given processor architecture has several hardware implementations, offering different features. 7.3 INSTRUCTION HAZARDS The purpose of the instruction fetch unit is to supply the execution units with a steady stream of instructions. Whenever this stream is interrupted, the pipeline stalls, as Figure illustrates for the case of cache miss. A branch instruction may also cause the pipeline to stall. We will now examine the effect of branch instructions and the techniques that can be used for mitigating their impact. We start with unconditional branches. 7.4 UNCONDITIONAL BRANCHES Figure shows a sequence of instructions being executed in a two-stage pipeline. Instructions I1 to I3 are stored at successive memory addresses, and I2 is a branch instruction. Let the branch target be instruction Ih. In clock cycle 3, the fetch operation for instruction I3 is in progress at the same time that the branch instruction is being decoded and the target address computed. In clock cycle 4, the processor must discard I3, which has been incorrectly fetched, and fetch instruction Ih. In the meantime, the hardware unit responsible for the Execute (E) step must be told to do nothing during that clock period. Thus, the pipeline is stalled for one clock cycle. The time lost as a result of a branch instruction is often referred to as the branch penalty. In Figure, the branch penalty is one clock cycle. For a longer pipeline, the branch penalty may be higher. For example, Figure shows the effect of a branch GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.37 Computer Organizations and Architecture instruction on a four-stage pipeline. We have assumed that the branch address is computed in step E2. Instructions I3 and I4 must be discarded, and the target instruction, Ih, is fetched in clock cycle 5. Thus, the branch penalty is two clock cycles. Time Clock cycle 1 2 3 4 5 6 Instruction I1 F1 E1 I2 (Branch ) F2 E2 Execution unit idle I3 F3 X Ik Fk Ek Ik+1 Fk+1 Ek+1 Reducing the branch penalty requires the branch address to be computed earlier in the pipeline. Typically, the instruction fetch unit has dedicated hardware to identify a branch instruction and compute the branch target address as quickly as possible after an instruction is fetched. With this additional hardware, both of these tasks can be performed in step D 2, leading to the sequence of events shown in Figure. In this case, the branch penalty is only one clock cycle. Time Clock cycle 1 2 3 4 5 6 7 8 I1 F1 D1 E1 W1 I2 (Branch ) F2 D2 E2 I3 F3 D3 X I4 F4 X Ik Fk Dk Ek Wk Ik+1 Fk+1 Dk+1 Ek+1 (a) Branch address computed in Execute stage GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.38 Computer Organizations and Architecture Time Clock cycle 1 2 3 4 5 6 7 I1 F. D1 E1 W1 I2 (Branch ) F2 D2 I3 F3 X I4 Fk Dk Ek Wk Ik Fk+1 Dk+1 Ek+1 Ik+1 (b) Branch address computed in Decode stage 7.4.1 Instruction Queue and Prefetching Either a cache miss or a branch instruction stalls the pipeline for one or more clock cycles. To reduce the effect of these interruptions, many processors employ sophisticated fetch units that can fetch instructions before they are needed and put them in a queue. Typically, the instruction queue can store several instructions. A separate unit, which we call the dispatch unit, takes instructions from the front of the queue and Instruction fetch unit Instruction queue F:Fetch... instruction D: Dispatch E: Execute W: Write Decode instruction results unit Fig: Use of an instruction queue in the hardware organization of fig (b) Sends them to the execution unit. This leads to the organization shown in Figure 8.10. The dispatch unit also performs the decoding function. To be effective, the fetch unit must have sufficient decoding and processing capability to recognize and execute branch instructions. It attempts to keep the instruction queue filled at all times to reduce the impact of occasional delays when fetching instructions. When the pipeline stalls because of a data hazard, for example, the dispatch unit is not able to issue instructions from the instruction queue. However, the fetch unit continues to fetch instructions and add them to the queue. Conversely, if there is a delay in fetching instructions because of a branch or a cache miss, the dispatch unit continues to issue instructions from the instruction queue. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.39 Computer Organizations and Architecture 7.5 CONDITIONAL BRANCHES AND BRANCH PREDICTION A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction. The decision to branch cannot be made until the execution of that instruction has been completed. Branch instructions occur frequently. In fact, they represent about 20 percent of the dynamic instruction count of most programs. (The dynamic count is the number of instruction executions, taking into account the fact that some program instructions are executed many times because of loops.) Because of the branch penalty, this large percentage would reduce the gain in performance expected from pipelining. Fortunately, branch instructions can be handled in several ways to reduce their negative impact on the rate of execution of instructions. 7.5.1 Delayed Branch In Figure the processor fetches instruction I3 before it determines whether the current instruction, I2, is a branch instruction. When execution of I2 is completed and a branch is to be made, the processor must discard I3 and fetch the instruction at the branch target. The location following a branch instruction is called a branch delay slot. There may be more than one branch delay slot, depending on the time it takes to execute a branch instruction. For example, there are two branch delay slots in Figure and one delay slot in Figure. The instructions in the delay slots are always fetched and atleast partially executed before the branch decision is made and the branch target address is computed. A technique called delayed branching can minimize the penalty incurred as a result of conditional branch instructions. The idea is simple. The instructions in the delay slots are always fetched. Therefore, we would like to arrange for them to be fully executed whether or not the branch is taken. The objective is to be able to place useful instructions in these slots. If no useful instructions can be placed in the delay slots, these slots must be filled with NOP instructions. LOOP Shift left R1 Decrement R2 Branch=0 LOOP NEXT Add R1.R3 (a) Original program loop LOOP Decrement R2 Branch=0 LOOP Shift left R1 NEXT Add R1.R3 (b) Reordered instructions GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.40 Computer Organizations and Architecture Time Clock cycle 1 2 3 4 5 6 7 8 Instruction Decrement F E Branch F E Shift (delay slot) F E Decrement (Branch taken) F E Branch F E Shift (delay slot) F E Add (Branch not taken) F E The effectiveness of the delayed branch approach depends on how often it is possible to reorder instructions as in Figure. Experimental data collected from many programs indicate that sophisticated compilation techniques can use one branch delay slot in as many as 85 percent of the cases. For a processor with two branch delay slots, the compiler attempts o find two instructions preceding the branch instruction that it can move into the delay slots without introducing a logical error. The changes of finding two such instructions are considerably less than the changes of finding one. Thus, if increasing the number of pipeline stages involves an increase in the number of branch delay slots, the potential gain in performance may not be fully realized. 7.5.2 Branch Prediction Another technique for reducing the branch penalty associated with conditional branches is to attempt to predict whether or not a particular branch will be taken. The simplest form of branch prediction is to assume that the branch will not take place and to continue to fetch instructions in sequential address order. Until the branch condition is evaluated, instruction execution along the predicted path must be done on a speculative basis. Speculative execution means that instructions are executed before the processor is certain that they are in the correct execution sequence. Hence, care must be taken that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed by executed. If the branch decision indicates otherwise, the instructions and all their associated data in the execution units must be purged, and the correct instructions fetched and executed. GATE WALLAH COMPUTER SCIENCE & INFORMATION TECHNOLOGY HANDBOOK 4.41 Computer Organizations and Architecture Time Clock cycle 1 2 3 4 5 6