Computer Architecture PDF
Document Details
Uploaded by HonestDryad
UNSTPB, FILS
Bujor Pavaloiu
Tags
Related
Summary
These notes cover computer architecture, including topics like computer logic, instruction sets, assembly language, and high-performance computing. The author is Bujor Pavaloiu, and the notes are from UNSTPB, FILS.
Full Transcript
Computer Architecture Bujor Pavaloiu 1 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7 Assembly Language 8 Processing Unit 9 Memory System...
Computer Architecture Bujor Pavaloiu 1 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7 Assembly Language 8 Processing Unit 9 Memory System 10 Input-Output 11 High Performance Computing Computer Architecture - UNSTPB, FILS 2 Computer Architecture 1 Introduction 2 Computer Logic + Computer Arithmetic 3 Instruction Set 4 Assembly Language 5 Processing Unit 6 Memory System 7 Input-Output 8 High Performance Computing Computer Architecture - UNSTPB, FILS 3 Historical Evolution Year Name Made by Comments 1834 Analytical Eng. Babbage First attempt to build a digital computer 1936 Z1 Zuse First working relay calculating machine 1943 COLOSSUS British gov't First electronic computer 1944 Mark I Aiken First American general-purpose computer 1946 ENIAC I Eckert/Mauchl Modern computer history starts here 1949 EDSAC Wilkes First stored-program computer 1951 Whirlwind I M.I.T First real-time computer 1952 IAS Von Neumann Most current machines use this design 1960 PDP-1 DEC First minicomputer (50 sold) 1961 1401 IBM Enormously popular small business machine 1962 7094 IBM Dominated scientific computing in the early 1960s 1963 B5000 Burroughs First machine designed for a high-level language 1964 360 IBM First product line designed as a family 1964 6600 CDC First scientific supercomputer 1965 PDP-8 DEC First mass-market minicomputer (50,000 sold) 1970 PDP-11 DEC Dominated minicomputers in the 1970s 1974 8080 Intel First general-purpose 8-bit computer on a chip 1974 CRAY-1 Cray First vector supercomputer 1978 VAX DEC First 32-bit superminicomputer 1981 IBM PC IBM Started the modern personal computer era 1985 MIPS MIPS First commercial RISC machine 1987 SPARC Sun First SPARC-based RISC workstation 1990 RS6000 IBM First superscalar machine Computer Architecture - UNSTPB, FILS 4 ENIAC ◼ 1943-1946 ◼ Electronic Numerical Integrator And Computer ◼ 18.000 vacuum tubes ◼ 30 tones ◼ 150 m2 ◼ 140 kW ◼ Decimal ◼ 5000 additions per second ◼ John von Neumann Computer Architecture - UNSTPB, FILS 5 Chip Year MHz Transistors Memory Notes 4004 1971 0.108 2,300 640 First microprocessor on a chip 8008 1972 0.108 3,500 16KB First 8-bit microprocessor 8080 1974 2 6,000 64KB First general-purpose CPU on a chip 8086 1978 5-10 29,000 1 MB First 16-bit CPU on a chip 8088 1979 5-8 29,000 1 MB Used in IBM PC 80286 1982 8-12 134,000 16MB Memory protection present 80386 1985 16-33 275,000 4 GB First 32-bit CPU 80486 1989 25-100 1.2M 4 GB Built-in 8K cache memory, FPU Pentium 1993 60-233 3.1M 4 GB Two pipelines; later models had MMX Pentium Pro 1995 150-200 5.5M 4 GB Two levels of cache built in Pentium II 1997 233-400 7.5M 4 GB Pentium Pro plus MMX (SIMD) Pentium III 1999 450-1200 10M 4 GB Streaming SIMD Extention (SSE), 80 bit Pentium 4 2000 1500-4000 42M 64GB Hyperthreading, Br. Prd., SSE2,3, 128 bit Itanium 2001 800-2000 25M 64GB 64-bit CPU, SSE3 Computer Architecture - UNSTPB, FILS 6 Computer Types Type Price (USD) Example Disposable Computer 1 Gadgets Embeded Computer 10 Home appliances Game Computer 100 Game stations Personal Computer 1K Desktop or Notebook Server 10K Network Server Collection of WSs 100K Department Mainframe 1M Bank Data Processing Supercomputer 10M Weather forecasting Computer Architecture - UNSTPB, FILS 7 Abstract layers of computer use User Interface Application Operating System Drivers, BIOS Hardware Computer Architecture - UNSTPB, FILS 8 Abstract layers of computer Level 5 Problem-oriented language level Translation (compiler) Level 4 Assembly language level Translation (assembler) Partial interpretation Level 3 OS machine level (operating system) Interpretation (mprog) Level 2 Instruction set architecture level or direct execution Level 1 Microarchitecture level Hardware Level 0 Digital Logic Level Computer Architecture - UNSTPB, FILS 9 Computer Structure CPU Main Memory Data Bus Address Bus Control Bus I/O Computer Architecture - UNSTPB, FILS 10 PC structure ◼ Power Source ◼ Circuit Board - MotherBoard ◼ Processor ◼ Memory Slots ◼ Chips ◼ Rear/Front Connectors ◼ Sockets where the edge connectors of I/O boards can be inserted Computer Architecture - UNSTPB, FILS 11 PC structure Monitor Keyboard HD Network Main CPU Video Keyboard HD Network Memory Controller Controller Controller Controller … Bus Computer Architecture - UNSTPB, FILS 12 Device Controllers ◼ The job of a controller is to control its I/O device and handle bus access for it. ◼ When a program wants data from the disk for example, it gives a command to the disk controller, which then issues seeks and other commands to the drive. When the proper track and sector have been located, the drive begins outputting the data as a serial bit stream to the controller. The controller breaks the bit stream up into units and sends them further. ◼ A controller that reads or writes data to or from memory without CPU intervention performs DMA (Direct Memory Access). ◼ When the transfer is completed, the controller normally causes an interrupt, forcing the CPU to suspend running its current program and start running a special procedure, called an interrupt handler, to check for errors, take any special action needed, and inform the operating system that the I/O is now finished. When the interrupt handler is finished, the CPU continues with the program that was suspended when the interrupt occurred Computer Architecture - UNSTPB, FILS 13 Later PC Structure Main CPU Memory Monitor SCSI chain USB chain Network Memory Bus PCI Bridge Video SCSI USB Network Controller Controller Controller Controller … PCI Bus ISA Bridge ISA Bus Printer Sound … Modem Controller Card Computer Architecture - UNSTPB, FILS 14 CPU Structure ALU Registers Control Computer Architecture - UNSTPB, FILS 15 Computer Architecture Bujor Pavaloiu 16 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7 Assembly Language 8 Processing Unit 9 Memory System 10 Input-Output 11 High Performance Computing Computer Architecture - UNSTPB, FILS 17 Recap ◼ See the lectures in Digital Integrated Circuits ◼ Accent on: ◼ Implementation with transistors. BP and CMOS. ◼ Logical function of N variables ◼ Karnaugh minimization ◼ Scalability (using a solution for more complicated problems). Examples: MUX, Encoders and Decoders, ALUs ◼ Complexity – number of gates used Computer Architecture - UNSTPB, FILS 18 Recap ◼ Fan in and fan out ◼ Depth of a circuit (delays). Compare AND gate with 2N inputs made using “serial” addition of 2 inputs AND gates and the “pyramidal” construction. ◼ MUXes and Decoders used for implementation of logic functions ◼ Gates used for logical functions. Basic/Fundamental gates vs Universal ones. ◼ Gates used for arithmetical comparisons. Computer Architecture - UNSTPB, FILS 19 Recap ◼ Half Adder ◼ Full Adder ◼ 2 bit Full Adder using 2 1 bit Full Adders ◼ Scaling addition operation ◼ Carry Look-Ahead Computer Architecture - UNSTPB, FILS 20 Recap ◼ Combinational circuits vs. Sequential circuits ◼ Implementation of Latches using NAND and NOR Gates. Forbidden Inputs. ◼ JK, D-type, T Flip-Flops. ◼ Flip-Flop vs Latches. ◼ Reset and Clear for FFs. ◼ Races. ◼ Finite State Machines Computer Architecture - UNSTPB, FILS 21 Recap ◼ Asynchronous counters. Frequency dividers. ◼ Synchronous vs. Asynchronous circuits. Synchronous counters. Ring counter. Johnson counter. ◼ Decade counters. Special conditions. ◼ Shift registers. ◼ Parallel in – serial out shift Register ◼ Serial in – parallel out shift Register ◼ 4*3 Ram Memory ◼ Static RAM vs. Dynamic RAM Computer Architecture - UNSTPB, FILS 22 Computer Architecture Bujor Pavaloiu 23 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7 Assembly Language 8 Processing Unit 9 Memory System 10 Input-Output 11 High Performance Computing Computer Architecture - UNSTPB, FILS 24 Computer Arithmetic ◼ Numeration Bases ◼ Binary Representation ◼ Fixed Point Representation ◼ Floating Point Representation ◼ Fixed Point Arithmetic ◼ Floating Point Arithmetic Computer Architecture - UNSTPB, FILS 25 Binary, Octal, Decimal and Hexa(decimal) 2 8 10 16 0000 00 00 0 0001 01 01 1 0010 02 02 2 0011 03 03 3 0100 04 04 4 0101 05 05 5 0110 06 06 6 0111 07 07 7 1000 10 08 8 1001 11 09 9 1010 12 10 A 1011 13 11 B 1100 14 12 C 1101 15 13 D 1110 16 14 E 1111 17 15 F Computer Architecture - UNSTPB, FILS 26 Remainder and Multiplication Methods 193.2310 = ?2 Integer Part Fraction Part 193 /2 = 96 1 0.23 *2 = 0.46 96 /2 = 48 0 0.46 *2 = 0.92 48 /2 = 24 0 0.92 *2 = 1.84 24 /2 = 12 0 0.84 *2 = 1.68 12 /2 = 6 0 0.68 *2 = 1.36 6 /2 = 3 0 0.36 *2 = 0.72 3 /2 = 1 1 0.72 *2 = 1.44 1 /2 = 0 1 0.44 *2 = 0.88 193.2310 = 11000001.001110102 INFINITE Computer Architecture - UNSTPB, FILS 27 Fixed Point Representation ◼ In this representation, the position of the decimal point is fixed and the value of each digit in the number depends on its position relative to it. ◼ The positional form of a number is a set of side-by-side digits given generally in fixed-point form as: MSD Radix Point LSD n r= nk-1 nk-2 … n1 n0. n-1 n-2 … n-m+1 n-m r Integer Part Fraction Part Basis ◼ The first k bits represent the integer part and the last m bits represent the fraction part. The number is represented in basis r, so it can be represented in polynomial form as: k −1 nr = nk −1r k −1 + nk − 2r k−2 −1 +... + n1r + n0 + n−1r + n− 2r 1 −2 +... + n− m +1r − m +1 + n− mr −m = n r i i i = −m Computer Architecture - UNSTPB, FILS 28 Examples and conversions to base 10 123.45610 = 1*102+2*101+3*100+4*10-1+5*10-2+6*10-3 123.4568 = 1*82+2*81+3*80+4*8-1+5*8-2+6*8-3=83.58984375 12.3416 = 1*161+2*160+3*16-1+4*16-2=18.203125 ABC.DEF16 = 10*162+11*161+12*160+13*16-1+14*16-2+15*16-3 = 2748. 870849609375 11000001.001112 = 1*27+ 1*26+ 0*25+ 0*24 +0*23+ 0*22+ 0*21+ 1*20+ 0*2-1 + 0*2-2 + 1*2-3+ 1*2-4+ 1*2-5=193. 2188 Computer Architecture - UNSTPB, FILS 29 Positional Form ◼ We will try represent the number π in a 4.4 fixed point base 2 representation. We have that π 10 3.1416 0 2 3 + 0 2 2 + 1 2 1 + 1 2 0 + 0 2 −1 + 0 2 −2 + 1 2 −3 + 0 2 −4 10 ◼ So the positional form representation of π10 in (4.4)2 is: 0011.0010 Computer Architecture - UNSTPB, FILS 30 Signed Fixed Point Numbers Representations: ◼ Signed Magnitude ◼ One’s Complement ◼ Two’s Complement ◼ Excess (Biased) Computer Architecture - UNSTPB, FILS 31 Signed Magnitude Representation ◼ The sign is stored in MSB, the leftmost bit. ◼ It is used 0 for plus and 1 for minus. 310=0 0000011signed (8.0)2 ; -310=1 0000011signed (8.0)2; 6.2510=0 110 0100signed (4.4)2 ; -6.2510=1 110 0100signed (4.4)2 ; 010=0 0000000signed (8.0)2 ; -010=1 0000000signed (8.0)2 ; ◼ It has the disadvantage that 0 has two different representations Computer Architecture - UNSTPB, FILS 32 One’s Complement Representation ◼ The sign is stored in MSB , the leftmost bit. ◼ It is used 0 for plus and 1 for minus. ◼ The negative of a number is obtained by complementing each bit (from 0 to 1 or from 1 to 0). 310=0 00000111sComp (8.0)2 ; -310=1 11111001sComp (8.0)2; 6.2510=0 110 01001sComp (4.4)2; -6.2510=1 001 10111sComp (4.4)2 010=0 00000001sComp (8.0)2 ; -010=1 11111111sComp (8.0)2 ; ◼ It has the disadvantage that 0 has two different representations Computer Architecture - UNSTPB, FILS 33 Two’s Complement Representation ◼ The sign is stored in MSB , the leftmost bit. ◼ It is used 0 for plus and 1 for minus. ◼ The negative of a number is obtained by adding 1LSB to the one’s complement negative. 310=0 00000112sComp (8.0)2 ; -310=1 11111012sComp (8.0)2; 6.2510=0 110 01002sComp (4.4)2; -6.2510=1 001 11002sComp (4.4)2 010=0 00000002sComp (8.0)2 ; -010=0 00000002sComp (8.0)2 ; Computer Architecture - UNSTPB, FILS 34 Excess Representation ◼ A given bias (offset) is added to our own in order to make it positive (usually B=rk-1-1 or B=rk-1) 310=1010excess7(4.0)2 ; -310=0100excess7 (4.0)2; 310=1011excess8(4.0)2 ; -310=0101excess8 (4.0)2; 310=10000010excess127(8.0)2 ; -310=1 1111100excess127 (8.0)2; 310=10000011excess128 (8.0)2 ; -310= 01111101excess128 (8.0)2; Computer Architecture - UNSTPB, FILS 35 4.0 bit Signed Representation SignM 1sComp 2sComp Excess8 -8 - - 1000 0000 -7 1111 1000 1001 0001 -6 1110 1001 1010 0010 -5 1101 1010 1011 0011 -4 1100 1011 1100 0100 -3 1011 1100 1101 0101 -2 1010 1101 1110 0110 -1 1001 1110 1111 0111 1000 1111 0 0000 1000 0000 0000 1 0001 0001 0001 1001 2 0010 0010 0010 1010 3 0011 0011 0011 1011 4 0100 0100 0100 1100 5 0101 0101 0101 1101 6 0110 0110 0110 1110 7 0111 0111 0111 1111 Computer Architecture - UNSTPB, FILS 36 Floating Point Representation ◼ A floating-point number n in radix r has the general form: n = (−1) M r S E where S gives the sign (+ or – for respectively 0 or 1), M is the mantissa or fraction (smaller than 1) and E is the exponent (for radix r) Computer Architecture - UNSTPB, FILS 37 Floating Point Representation ◼ A number can have various floating-point representations for a given radix. For example, the number π can be represented in basis 10 as: π = 0.314159 101 π = 0.031415 10 2 π = 0.003141 103 ◼ From these representations it is chosen the normalized representation – the one for which M is the largest (still smaller than 1) π = 0.314159 101 with S= 0, M= 0.314259 and E =1, for radix 10 Computer Architecture - UNSTPB, FILS 38 IEEE-754 Standard Single Precision 31 30 23 22 0 S Exponent Mantissa 1 bit 8 bits 23 bits Double Precision 63 62 52 51 0 Sign Exponent Mantissa 1 bit 11 bits 52 bits Computer Architecture - UNSTPB, FILS 39 ◼ The exponent is stored in excess 127 code. ◼ Exponents 00000000 and 11111111 are reserved. ◼ Prove that in a radix 2 representation, the MSB of M is nonzero. What would you do with it? Computer Architecture - UNSTPB, FILS 40 Example for Conversion -123456789.2510 = SEM2 1. Take sign: S=1 (negative) 2. Convert to base 2: 123456789.25 can be taken as integer/fraction part or multiplied with a convenient power of 2: 123456789.25= 123456789.25*4*2-2=493827157*2-2 = 11101011011110011010001010101*2-2 3. Normalize: 1.1101011011110011010001010101*228*2-2 =1.1101011011110011010001010101*226 4. Get mantissa, ignoring the leading 1 and the trailing bits: M = 1101011011110011010001010101 5. Get exponent E, biasing with 127: E=26+127=153=100110012 11001100111010110111100110100010 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Computer Architecture - UNSTPB, FILS 41 Value of PI 3.141610 = SEM2 ??? 01000000010010010000111111111001 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Computer Architecture - UNSTPB, FILS 42 Complement Addition and Subtraction 0000 1111 0001 1110 0 0010 -1 1 -2 2 0011 1101 -3 3 1100 -4 4 0100 -5 5 0101 1011 -6 6 -7 7 0110 1010 -8 1001 0111 1000 Subtraction Addition Computer Architecture - UNSTPB, FILS 43 System bus Computer Architecture - UNSTPB, FILS 44 Problem We have an 8-bit data bus computer with 16bit address lines. 1. Compare with 8086. 2. How much is the maximum memory of the computer (in bits and bytes). 3. How much is the nonzero memory of computer You must connect 8KB of ROM and 16KB of RAM to the CPU in this order, with memory chip size of 4KB for both the ROM and RAM. 4. How much is the minimum address bus width? 5. What are the first addresses for ROM and RAM? 6. Show the implementation. Use decoders to select the right chip. 7. Remake the problem for a 16 bit computer. Computer Architecture - UNSTPB, FILS 45 Computer Architecture Bujor Pavaloiu 46 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7 Assembly Language 8 Processing Unit 9 Memory System 10 Input-Output 11 High Performance Computing Computer Architecture - UNSTPB, FILS 47 4. Instruction Set Architecture Is that part of the architecture that is visible to the programmer ▪ opcodes (available instructions) ▪ number and types of registers ▪ instruction formats ▪ storage access, addressing modes ▪ exceptional conditions Computer Architecture - UNSTPB, FILS 48 4. Instruction Set Architecture ▪ The Instruction Set Architecture (ISA) view of a machine corresponds to the machine and assembly language levels. ▪ A compiler translates a high level language, which is architecture independent, into assembly language, which is architecture dependent. ▪ An assembler translates assembly language programs into executable binary codes. ▪ For fully compiled languages like C, the binary codes are executed directly by the target machine. ▪ Java stops the translation at the byte code level. The Java virtual machine, which is at the assembly language level, interprets the byte codes (hardware implementations of the JVM also exist, in which Java byte codes are executed directly.) Computer Architecture - UNSTPB, FILS 49 ▪ A compiled program is loaded into the memory using I/O. The CPU reads instructions and data from the memory, executes the instructions, and stores the results back into the memory CPU Memory Data Bus Address Bus Control Bus I/O Computer Architecture - UNSTPB, FILS 50 The Fetch-Execute Cycle ▪ The steps that the control unit carries out in executing a program are: (1) Fetch the next instruction to be executed from memory (2) Decode the opcode (3) Read operand(s) from main memory, if any (4) Execute the instruction (5) Store results, if any (6) Go to (1). Computer Architecture - UNSTPB, FILS 51 x86 Registers ▪ General Purpose Registers ▪ Stack Registers ▪ Special Purpose Registers ▪ Flags Register Computer Architecture - UNSTPB, FILS 52 General Purpose Registers ▪ 8086 CPU has 8 general purpose registers: AX - the accumulator register (divided into AH / AL) BX - the base address register (divided into BH / BL) CX - the count register (divided into CH / CL) DX - the data register (divided into DH / DL) SI - source index register DI - destination index register BP - base pointer SP - stack pointer Computer Architecture - UNSTPB, FILS 53 General Purpose Registers ◼ 4 general purpose AX AH AL registers (AX, BX, CX, DX) are made of two BX separate 8 bit registers BH BL ◼ "H" is for high and "L" is for low part CX ◼ If you modify any of the CH CL two 8 bit registers, the 16 DX bit register is also DH DL updated, and vice-versa. ◼ Ex: AX=1011100000001110 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 AH=10111000; AL=0001110 AX Register Computer Architecture - UNSTPB, FILS 54 Segment Registers ◼ 8086 CPU has 4 segment registers: CS - points at the segment containing the current program. DS - generally points at segment where variables are defined. ES - extra segment register, it's up to a coder to define its usage. SS - points at the segment containing the stack. The segment registers have a very special purpose - pointing at accessible blocks of memory. Computer Architecture - UNSTPB, FILS 55 Segment Registers ◼ Initially, memory locations were addressed using a single 8 bit register - so only 216 memory locations were accessible. ◼ Latter, add an offset stored in a given register to the ◼ Effective Address = Segment*10h+Offset (Other General Purpose Register) ◼ Works for - DS with BX, SI, DI - SS with BP, SP Ex: To access address 012345h set BX=01234h, DS=00005h Effective Address = 01234h * 010h + 05h = 012345h Computer Architecture - UNSTPB, FILS 56 Special Purpose Registers ◼ Instruction Pointer ◼ Flags Register Instruction Pointer Together with CS segment register points to currently executing instruction. Computer Architecture - UNSTPB, FILS 57 The Flags Register ◼ The processor flags shows whether a performed operation was successful or not are used to perform this function. ◼ Flags are important to assembly language programs, as they are the only means available to determine whether a program’s function succeeded or not. ◼ The flags are divided into three groups - Status flags - Control flags - System flags Computer Architecture - UNSTPB, FILS 58 The Flags Register 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 IOP IOP NTF LF LF OF DF IF TF SF ZF AF PF CF Status Flags Bit FL Name Description Is 1 if a mathematical operation on an unsigned integer 0 CF Carry flag generates a carry or a borrow for the most significant bit. Is 1 if the total number of 1 bits in the result is even, and it 2 PF Parity flag is 0 if the total number of 1 bits in the result is odd Is 1 if a carry or borrow operation occurs from bit 3 of the 4 AF Adjust flag register used for the calculation of BCD mathematical ops. 5 ZF Zero flag Is 1 if the result of an operation is zero Is the same as the sign of the result, stored as its MSB (0 7 SF Sign flag for positive, 1 for negative) Is 1 in signed integer arithmetic when a positive value is 11 OF Overflow flag too large, or a negative value is too small, to be properly represented in the register Computer Architecture - UNSTPB, FILS 59 Control Flags Bit FL Name Description Is 1 if a mathematical operation on an unsigned integer 10 DF Direction flag generates a carry or a borrow for the most significant bit. System Flags Bit FL Name Description Is 1 if a mathematical operation on an unsigned integer 8 TF Trap flag generates a carry or a borrow for the most significant bit. Interrupt Is 1 if the total number of 1 bits in the result is even, and it 9 IF enable flag is 0 if the total number of 1 bits in the result is odd 12 IOP I/O privilege Is 1 if a carry or borrow operation occurs from bit 3 of the 13 LF level flag register used for the calculation of BCD mathematical ops. Nested task 14 NTF Is 1 if the result of an operation is zero flag Computer Architecture - UNSTPB, FILS 60 Registers and flags AX Accumu Mostly used for calculations and for CS Code Active Code Segment lator input/output Segment BX Base Only register that can be used as an DS Data Active Data Segment index Segment CX Count Register used for the loop instruction SS Stack Active Stack Segment Segment DX Data Input/output and used by multiply and divide ES Extra Active Extra Segment (user Segment defined) SI Source Used by string operations as Index source C Carry Is 1 for unsigned overflow (carry) DI Destinatio Used by string operations as Z Zero Is 1 for result 0 n Index destination S Sign Is 1 for negative result IP Instructio Offset of the current instruction O Overflow Is 1 for signed overflow n Pointer P Parity Is 1 for even parity of ones in result SP Stack Offset of stack Pointer A Auxiliary Is 1 for unsigned overflow in low nibble BP Base Used to pass data to and from the I Interrupt 1 for interrupts from external devices Pointer stack D Direction direction in data chains, is 0 for forwards Computer Architecture - UNSTPB, FILS 61 5. Assembly Language I. Data Transfer Operations II. Conversion Operations III. Arithmetic Operations IV. Logic Operations V. String Operations VI. I/O Operations VII. Flow Control Operations VIII. Other Operations Assembler We will use processor emulator emu8086 + FLAT Assembler Computer Architecture - UNSTPB, FILS 62 Chapters (not all and not in this order) Computer Architecture - UNSTPB, FILS 63 Registers and flags – AX Accumulator Mostly used for calculations and Emu8086 Interface Register for input/output BX Base Only register that can be used Register as an index CX Count Register used for the loop Register instruction DX Data Input/output and used by Register multiply and divide CS Code Active Code Segment Segment IP Instruction Offset of the current instruction Pointer SS Stack Active Stack Segment Segment SP Stack Offset of stack Pointer CF Carry Is 1 for unsigned overflow (carry) BP Base Pointer Used to pass data to and from the stack ZF Zero Is 1 for result 0 SI Source Used by string operations as SF Sign Is 1 for negative result Index source OF Overflow Is 1 for signed overflow DI Destination Used by string operations as Index destination PF Parity Is 1 for even parity of ones in result DS Data Active Data Segment AF Auxiliary Is 1 for unsigned overflow in low nibble Segment IF Interrupt 1 for interrupts from external devices ES Extra Active Extra Segment (user Segment defined) DF Direction direction in data chains, is 0 for forwards Computer Architecture - UNSTPB, FILS 64 Statements ◼ An assembly language source code file consists of a collection of statements ◼ Comments can be used with any statement. A semicolon (;) begins the comment, and the comment then extends until the end of the line. ◼ There are three types of functional assembly language statements: ◼ Instructions ◼ Directives ◼ Macros Computer Architecture - UNSTPB, FILS 65 Statements ◼ Instructions are translated into machine code and will be executed when the program is run ◼ Directives are not translated into code, but just specify the assembler parameters of functioning. ◼ Macros are sets of instructions, directives and other macros that are expanded by the assembler and assembled individually Computer Architecture - UNSTPB, FILS 66 I. Data Transfer Operations ◼ Memory Access Registers: BX, SI, DI, BP, combined with d8 (8 bit signed immediate displacement) and d16 (16 bit signed immediate displacement) inside [ ] [SI] [SI+d8] [SI+d16] [BX+SI] [BX+SI+d8] [BX+SI+d16] [DI] [DI+d8] [DI+d16] [BX+DI] [BX+DI+d8] [BX+DI+d16] d16 [BP+d8] [BP+d16] [BP+SI] [BP+SI+d8] [BP+SI+d16] [BX] [BX+d8] [BX+d16] [BP+DI] [BP+DI+d8] [BP+DI+d16] BX SI disp BP DI Work in segment DS, excepting those with BP, working in SS Computer Architecture - UNSTPB, FILS 67 Data Transfer Operations Important Data Transfer instructions MOV, LEA Stack instructions PUSH, POP The MOV instruction ◼ copies the second operand (source) to the first operand (destination). ◼ the source operand can be an immediate value, general- purpose register or memory location. ◼ the destination register can be a general-purpose register, or memory location. ◼ both operands must be the same size, (byte or a word) Computer Architecture - UNSTPB, FILS 68 MOV instruction MOV REG, memory MOV memory, REG MOV REG, REG MOV memory, immediate MOV REG, immediate REG: AX, BX, CX, DX, AH, AL, BL, BH, CH, CL, DH, DL, DI, SI, BP, SP. memory: [BX], [BX+SI+7], variable, etc... immediate: 5, -24, 3Fh, 10001101b, etc... MOV SREG, memory MOV memory, SREG MOV REG, SREG MOV SREG, REG SREG: DS, ES, SS, and only as second operand: CS. MOV cannot be used to set the value of the CS and IP registers MOV cannot use segment register with immediate value Computer Architecture - UNSTPB, FILS 69 MOV Examples Ex: MOV AX,1 MOV AL,1 Any difference? MOV AH,1 Variables VarName DW VarValue VarName DB VarValue Ex: MOV AL,1 MOV AL, X1 X1 DB 2 Computer Architecture - UNSTPB, FILS 70 Writing char “A” to video memory Ex: Write in the video memory ;MOV DS, 0B800h ; set segm. to B800h. wrong! why? MOV AX, 0B800h ; set AX to B800h. MOV DS, AX ; copy value of AX (B800h) to DS. MOV [140h], 0000111101000001b ; Write A (41h) with white on black in segment B800h, at offset 140h Computer Architecture - UNSTPB, FILS 71 Writing char “A” to video memory Ex: ORG 100h ; this directive is required for a com program. MOV AX, 0B800h ; set AX to B800h. MOV DS, AX ; copy value of AX (B800h) to DS. MOV CL, 'A' ; set CL to ASCII code of 'A' MOV CH, 0000_1111b ; set CH to binary value. MOV BX, 140h ; set BX to 15Eh. MOV [BX], CX ; copy contents of CX to memory ; at B800:0140 RET ; returns to operating system. Computer Architecture - UNSTPB, FILS 72 Computer Architecture Bujor Pavaloiu 73 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7-8 Assembly Language 9 Processing Unit 10 Memory System 11 Input-Output 12 High Performance Computing Computer Architecture - UNSTPB, FILS 74 Opcodes ◼ Assembly is a programming language that gives the programmer control over the machine code. ◼ The assembly code has DIRECT mapping in the opcodes (commands in the executable code). assembler code binary code mov AL, 1 B0 01 10110000 00000001 Computer Architecture - UNSTPB, FILS 75 Codes for register operands Register Code Register Code AL 000 AH 100 CL 001 CH 101 DL 010 DH 110 BL 011 BH 111 Computer Architecture - UNSTPB, FILS 76 LEA Instruction ◼ The LEA (Load Effective Address) instruction and alternative OFFSET operator are used to get the offset address of a variable. LEA is more powerful because it works also for indexed variables. Computer Architecture - UNSTPB, FILS 77 II. Stack Operations ◼ The stack is a memory space used to store data in a last in – first out manner. ◼ There are two typical instructions that work with the stack: PUSH - stores 16 bit word in the stack. POP - gets 16 bit word from the stack. PUSH POP Computer Architecture - UNSTPB, FILS 78 PUSH AX ; store value of AX in stack. PUSH BX ; store value of BX in stack. POP AX ; get AX from the stack (former value of BX) POP BX ; get BX from the stack (former value of AX) PUSH BX POP AX PUSH AX POP BX Computer Architecture - UNSTPB, FILS 79 II. Conversion Operations cbw, cwd, xlat Computer Architecture - UNSTPB, FILS 80 III. Arithmetic Operations add, inc sub, dec, cmp, neg, mul, imul, div, idiv Computer Architecture - UNSTPB, FILS 81 IV. Logic Operations and, or, xor, not, shl, shr, rcl, rcr Computer Architecture - UNSTPB, FILS 82 V. String Operations movs, stos, lods Computer Architecture - UNSTPB, FILS 83 VI. I/O Operations in, out Computer Architecture - UNSTPB, FILS 84 VII. Flow Control Operations jmp, call, ret, conditional jumps Computer Architecture - UNSTPB, FILS 85 VIII. Other Operations clc, stc, cmc Computer Architecture - UNSTPB, FILS 86 Simple program to add two variables ORG 100h ; This program adds two numbers MOV AL, x MOV BL, y ADD AL, BL ; ADD AL, y MOV z, AL RET ; stops the program. x DB 1 ; define byte initialized to 1 y DB 2 ; define byte initialized to 2 z DB 0 ; define byte initialized to 0 Computer Architecture - UNSTPB, FILS 87 Interrupts ◼ INT 10h / AH = 0 - set video mode. INT 10h / AH = 2 - set cursor position ◼ INT 10h / AH = 05h - select active video page ◼ INT 10h / AH = 08h - read character and attribute at cursor position ◼ INT 10h / AH = 09h - write character and attribute at cursor position Computer Architecture - UNSTPB, FILS 88 Computer Architecture Bujor Pavaloiu 89 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7-8 Assembly Language 9 Processing Unit 10 Memory System 11 Input-Output 12 High Performance Computing Computer Architecture - UNSTPB, FILS 90 CPU with System Bus CPU Data Bus Address Bus SYSTEM BUS Control Bus Computer Architecture - UNSTPB, FILS 91 Computer Architecture https://en.wikipedia.org/wiki/Computer_architecture Black lines indicate data flow, whereas red lines indicate control flow. Arrows indicate the direction of flow Computer Architecture - UNSTPB, FILS 92 Fetch/Execution Cycle ◼ While the processor is running, instructions must be retrieved from memory and loaded for the processor to handle. The job of the control unit is to perform the fetch/execution cycle: ◼ The instruction counter retrieves the next instruction code from memory and prepares it to be processed. ◼ The instruction decoder is used to decode the retrieved instruction code into a micro-operation. The micro- operation is the code that controls the specific signals within the processor chip to perform the function of the instruction code. ◼ When the prepared micro-operation is ready, the control unit passes it along to the execution unit for processing, and retrieves any results to store in an appropriate location. Computer Architecture - UNSTPB, FILS 93 Pipelining – Instruction Prefetch ◼ We can take advantage of the fact that we can retrieve the next instruction from memory and start to execute it before the prior one is ended (like on the assembly line in a plant) - PIPELINING Instruction Prefetch Instr. N+2 Fetch Instr. N+1 Execute Instr. N Computer Architecture - UNSTPB, FILS 94 Pipelining - example ◼ Your processor has a 3-stage pipeline where the delay is 34 ns through the first stage, 40 ns through the second stage, and 26 ns through the third stage. ◼ What is the speedup, over an infinite number of instructions, of the given pipeline over an unpipelined version assuming the logic is the same for both versions Speedup = the relative performance S1 S2 S3 Computer Architecture - UNSTPB, FILS 95 Pipelining - example Direct execution S3 S2 S1 S3 S2 S1 S3 S2 S1 I3 I2 I1 S3 S2 S1 I1 S3 S2 S1 I1 S3 S2 S1 I2 S3 S2 S1 I2 S3 S2 S1 I3 S3 S2 S1 I3 Would-be pipeline S3 S2 S1 I4 Real Pipeline Computer Architecture - UNSTPB, FILS 96 Pipeline Speedup ◼ Direct execution time is 34 ns + 40 ns + 26 ns = 100 ns for instruction, IE at each 100 ns a new instruction is executed. ◼ Pipeline execution time depends on the maximum stage delay (here 40 ns) IE at each 40 ns a new instruction is executed. ◼ The speedup is thus 100/40 = 2.5 Computer Architecture - UNSTPB, FILS 97 Problems for Real Computer Systems ◼ 1. Execution takes a longer time than fetching ◼ 2. Sometimes the next instruction depends on what happens with the current one (ex. branching) ◼ Solution to 1. More stages (more equilibrated) Computer Architecture - UNSTPB, FILS 98 Fetch/Execute Expanded – 1. Retrieve (fetch) Instruction from memory. 2. Decode Instruction for operation. 3. Compute Operands 4. Retrieve Operands, if needed. 5. Execute Instruction 6. Store the Results, if needed. Computer Architecture - UNSTPB, FILS 99 Instruction Execution Pipeline t(1) SR EI RO CO DI RI I1 t(2) SR EI RO CO DI RI I2 t(3) SR EI RO CO DI RI I3 t(4) SR EI RO CO DI RI I4 t(5) SR EI RO CO DI RI I5 t(6) SR EI RO CO DI RI I6 t(7) SR EI RO CO DI RI I7 t(8) SR EI RO CO DI RI I8 Computer Architecture - UNSTPB, FILS 100 Pipeline with branch t(1) SR EI RO CO DI RI I1 t(2) RO CO DI RI I2 t(3) CO DI RI I3 t(4) DI RI I4 t(5) RI I5 t(6) SR EI RO CO DI RI I16 t(7) SR EI RO CO DI RI I17 t(8) SR EI RO CO DI RI I18 Computer Architecture - UNSTPB, FILS 101 Pipelines ◼ No use in increasing indefinitely the number of stages because: - The time to transfer data from stage to stage increases (bad especially for branches) - The complexity of the circuits needed to control the pipeline increases rapidly – the control becomes more complicated than the process Computer Architecture - UNSTPB, FILS 102 Pipelines Performance ◼The cycle time is the time to advance the set of instructions one step through the pipeline T=max(ti)+d, 1≤i ≤M where M is the number of stages, d is the delay between stages and ti is the time to pass stage i. T=tM+d, Where tM is the maximum stage delay. Computer Architecture - UNSTPB, FILS 103 Pipelines Performance Usually tM >>d, so ignore d. For N instructions TN=(M+N-1))T Taking for example M=6, N=8, we get T8=13T The speedup factor is S=MNT/((M+N-1))T)= MN/(M+N-1) For N tending to infinite, S=M Computer Architecture - UNSTPB, FILS 104 Solution to 2. - Conditional Branches 1. Multiple streams 2. Prefetch branch target 3. Loop Buffer 4. Branch prediction 5. Delayed Branch In 1. duplicate the pipeline – problems with the simultaneous addressing of registers and with further branching before the decision is make. Computer Architecture - UNSTPB, FILS 105 Solution to 2. - Conditional Branches (2) In 2., prefetch the branch target together with the next instruction. If branch is selected, the target is already retrieved. In 3. there exists a very fast memory space (loop buffer), allocated to the fetching unit, containing the last N most recently retrieved instructions, in sequence. If a branch is taken, the loop buffer is checked first. This strategy is excellent for loops – from there comes the title. Computer Architecture - UNSTPB, FILS 106 Solution to 2. - Conditional Branches (3) In 4., static and dynamic approaches - Prediction never taken - Prediction always taken - Prediction using opcode - Taken switch - Branch history table The last two consider history: some bits as records of the last branching or a small cache. In 5., change the addresses (rearrange) to put the branches later. Computer Architecture - UNSTPB, FILS 107 8086 Processor Bus Interface Unit Execution Unit Why do you think the prefetch que is 6 bytes long? How many stages of pipelining does 8086 have? Computer Architecture - UNSTPB, FILS 108 Intel CPU microarchitectures partially from https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures Max Addres Process No Micro- Data Bus Pipeline Transistors Year clck s Bus node architecture (Bits) stages (MHz) (Bits) (nm) 1971 4004 0.75 4 12 1 10,000 2,300 1972 8008 0.8 8 14 1 10,000 3,500 1974 8080 3 8 16 1 6,000 6,000 1978 8086 (8086, 8088)* 5 16 20 2 3,000 29,000 1982 186 (80186, 80188)* 25 16 20 2 3,000 55,000 1982 286 (80286)* 25 16 24 3 1,500 134,000 1985 386 (80386)* 33 32 32 6 1,000 275,000 1989 486 (80486) 100 32 32 5 600 1.6M 1991 i860 (80860)** 50 64, 32/64/128 rgs 32 Lots, 50ms 1,000 1M++ 1993 P5 (Pentium)*** 300 64 32 5 350 4.5M 2022 Raptor Cove**** 6000 64 37 19 10 12B * There existed separate co-processors chips ** RISC, HARWARD arch. - instruction cache is 32 bits wide, data cache is 128 bits wide, SIMD, first 1M transistors chip *** First superscalar processor, separate code and data caches, in 1996 - MMX instruction set (SIMD). 32 bits internal registers ☺ **** 8 P-Cores and 16 E-Cores; Lots of cache… Computer Architecture - UNSTPB, FILS 109 Pentium 4 ◼ Intel processor (the Pentium 4) uses a control unit technology called NetBurst. The NetBurst technology incorporates four separate techniques to help speed up processing in the control unit. Knowing how these techniques operate can help you optimize your assembly language programs. The NetBurst features are as follows: 1. Instruction prefetch and decoding 2. Branch prediction 3. Out-of-order execution 4. Retirement Computer Architecture - UNSTPB, FILS 110 Pentium procesor https://electronicsdesk.com/pentium-microprocessor.html Computer Architecture - UNSTPB, FILS 111 Raptor Lake Architecture (1) Computer Architecture - UNSTPB, FILS 112 Raptor Lake Architecture (2) Computer Architecture - UNSTPB, FILS 113 Bus - Front side bus FSB does not exist now - HyperTransport, Intel QuickPath Interconnect or Direct Media Interface Use instead Base Clock, Bus Speed, etc… All devices are PLL multiple of this frequency Current high-speed serial computer expansion bus – PCI Express (Peripheral Component Interconnect Express) - PCIe Computer Architecture - UNSTPB, FILS 114 Memory Hierarchy Computer Architecture - UNSTPB, FILS 115 Memory Hierarchy Access Capacity Latency Bandwidth Cost/MB type CPU Random 64–1024 1 ns BS*NB High registers bytes Cache Random 8KB–16 MB 5–10 ns 200–500 $50 memory MB/s Main Random 16, 64, 512 30–50 ns 100–200 $0.1 memory GB MB/s Hard Disk Direct 100GB - 0.01-0.1 40–150 $0.005 20TB ms MB/s Computer Architecture - UNSTPB, FILS 116 The transmission speeds for devices Ethernet – Terabit Ethernet 319*1012 bps PCIe 6.0 x16 (Graphic Cards…) 1012 bps DDR5 - 64GB/s 5*1011 bps SSD (NVMe, PCIe) - 12.4 GB/s 1011 bps SATA 3 6*109 bps Ethernet – Gigabit Ethernet 109 bps Hard Disk – 120 MB/s 109 bps Ethernet, DVD, Scanner 107 bps Laser Printer, Floppy 106 bps Modem 105 bps Mouse, Keyboard 102 bps Computer Architecture - UNSTPB, FILS 117 Principle of memory locality ◼ Locality of reference: within a given period of time, programs tend to reference a certain space of memory repeatedly. ◼ - Spatial locality refers to the fact that when a given address has been referenced, it is most likely that the addresses next to it will be referenced within a short period of time, for example, consecutive instructions in a simple program. ◼ - Temporal locality, on the other hand, refers to the fact that once a particular memory address has been referenced, it is most likely that it will be referenced again in a short time, for example, an instruction in a program loop. Computer Architecture - UNSTPB, FILS 118 Cache ◼ Caching –> good Hit Ratios HRk = Hit Ratio (k) = probability to find required data at level k MRk = Miss Ratio (k) = 1-HRk ◼ For a two level memory, the average memory access time is: ta =HR1t1+MR1(t1+t2)= HR1t1+(1-HR1)(t1+t2)= =t1+(1-HR1) t2= t1+MR1t2 Computer Architecture - UNSTPB, FILS 119 Exercise: Find the average memory access time for a three level memory. Hint: You should obtain: ta = t1+MR1 (t2+MR2t3) Exercise: Find the average memory access time for an N level memory. Computer Architecture - UNSTPB, FILS 120 Computer Architecture Bujor Pavaloiu 121 Computer Architecture 1 Introduction 2-4 Computer Logic + Computer Arithmetic 5 Digital Representation/ Access/ Scalability 6 Instruction Set 7-8 Assembly Language 9 Processing Unit 10 Memory System 11 Input-Output 12 High Performance Computing Computer Architecture - UNSTPB, FILS 122 Cache Policy Memory Hierarchy -> Cache Policy Keep the information expected to be used more frequently (by the CPU) on upper (closer to CPU), faster levels * Faster implies smaller and more expensive ** More expensive implies smaller *** Smaller implies faster (easier to look for) Computer Architecture - UNSTPB, FILS 123 For a two level memory, the average memory access time is: ta = t1+MR1t2 To reduce ta , you must reduce the times (especially t2 including additional costs) and increase HRs 90/10 Rule (from empirical observation): "A program spends 90% of its time in 10% of its code" Computer Architecture - UNSTPB, FILS 124 Block: the unit of data transfer from one level to the next Example of calculus: 100 MHz clock. Instructions in cache execute in two clock cycles. Instructions not in cache must be loaded from main memory in a 64-byte burst. Reading from main memory requires 10 clock cycles to set up the data transfer but once set-up, the processor can read a 32-bit wide word at one word per clock cycle. The cache hit rate is 90%. Computer Architecture - UNSTPB, FILS 125 1. Compute Miss Penalty a. 100 MHz clock -> 10 ns clock period b. 10 cycles to set up the burst = 10 × 10 ns = 100 ns c. 32-bit wide word = 4 bytes -> 16 data transfers to load 64 bytes d. 16 × 10 ns = 160 ns e. t2 = 100 ns + 160 ns = 260 ns 2. Each instruction takes 2 clocks, t1 =20 ns to execute. 3. Effective execution time = 0.9x20 + 0.1x280 =18 + 28 = 46 ns = 20 + 0.1x260 = 20 + 26 = 46 ns Computer Architecture - UNSTPB, FILS 126 In programs – memory addresses Therefore, we need to devise a method to map the addresses in main memory to the addresses in the cache. Problems when miss and we want to replenish the cache – cache coherency, keeping a good hit ratio - cache efficiency. Computer Architecture - UNSTPB, FILS 127 Block Transfer Block Placement Where should a block be placed in the cache? Block Identification How is a block found if it is in the cache? Block Replacement Which block frame in the cache should be replaced upon a miss? Interaction Policies with Main Memory What happens on reads and writes to the cache Computer Architecture - UNSTPB, FILS 128 Block Placement Direct mapped : if each block has only one place it can appear in the cache, the cache is said to be direct mapped. The mapping is usually (Block address) MOD (Number of blocks in cache) Fully Associative : if a block can be placed anywhere in the cache, the cache is said to be fully associative. Set associative : if a block can be placed in a restricted set of places in the cache, the cache is said to be set associative. A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. Computer Architecture - UNSTPB, FILS 129 Direct Mapping Example: cache is 1,024 words (1K) and main memory contains 1,048,576 words (1M). Each cache location 000000h maps to 1,024 main memory locations. 000400h 0000h 03FFh 0FFFFFh Computer Architecture - UNSTPB, FILS 130 Most caches are really divided into : Cache memory: holds the memory image Tag memory: holds the address information and validity bit. Determines if the data is in the cache and if the cache data and memory data are coherent. Algorithmic state machine: the cache control mechanism. Its primary function is to guarantee that the data requested by the CPU is in the cache. Computer Architecture - UNSTPB, FILS 131 ◼ Caches and main memory are divided into equally sized quantities called blocks (or refill lines ). A refill line is typically between four and 256 bytes long (power of 2) and is the minimum quantity that the cache will deal with in terms of its interaction with main memory. ◼ Missing a single byte from main memory will result in a full filling of the block containing that byte. ◼ Cached processors have burst modes to access memory and usually never read a single byte from memory. Computer Architecture - UNSTPB, FILS 132 The direct-mapped cache partitions main memory into a matrix consisting of K columns of N blocks per column. The cache is one-column wide and N refill lines long. The Mth row of the cache can hold the Mth refill line of any one of the K columns of main memory. The tag address holds the address of the memory column. 0000h 03FFh 0000h 0000h 1 M N=210 03FFh 03FFh 1 K=210 Computer Architecture - UNSTPB, FILS 133 Exercise Example: Processor with: 4 GB memory 256KB direct-mapped cache memory. Blocks on 64 bytes. How does it look like? Computer Architecture - UNSTPB, FILS 134 a. Memory size= 232 bytes ; Cache size= 28+10=218 bytes; Block size = 26 bytes b. Number of Blocks in memory = 232/ 26 = 226 blocks c. Number of Blocks in cache = 218/ 26 = 212 blocks d. Number of columns = 226/ 212 = 214 => 14 bits in tag => 212x14 =56kb of additional tag memory e. Validity bit => another 212 = 4 Kbits cache, so we need about 8KB cache Valid Tag Cache Memory 1 0000000000000h 1 212 N= 212 1 bit 14 bits 26 bytes 1 K= 214 Computer Architecture - UNSTPB, FILS 135 Direct Mapped Block Placement Problems: Attempts to work with memory locations addressed by the same cache block (on a memory line) will give high miss rate. Solution: (Fully) associative block placement Computer Architecture - UNSTPB, FILS 136 Fully Associative Block Placement 033h 000000h 0FFh 000001h 000h 000002h Cache 0112h 000003h Tag 01Ah 000004h Memory 01Ah 000005h 0000h 000005h 01Ah 000002h 000h 000000h 033h 000003h 012h 0FFFFFh 000h 000004h 01Ah 03FFh 000001h 0FFh 000h 0FFFFFh Computer Architecture - UNSTPB, FILS 137 Fully Associative Block Placement Problems: 1. When all the available rows in the cache contain valid rows from main memory, how does the cache control hardware decide where in the cache to place the next block from main memory? 2. Since any block from main memory can be located at any block position in the cache, how does the cache control hardware determine if a main memory block is currently in the cache? Computer Architecture - UNSTPB, FILS 138 Solutions: 1- Place a counter for each block of the cache. On every clock cycle we advance all of the counters. When we access a given block, we reset the counter associated with it to zero. When a counter reaches the maximum count, it remains at that value. A block from the memory (missed in the cache) will fill the cache block with the largest count. -> least recently used (LRU) algorithm Computer Architecture - UNSTPB, FILS 139 2 -Have to check all tags in cache to find out if one of them stores the block address in RAM -> contents addressable memory (CAM) Each memory cell of the CAM memory also contains a data comparator circuit. When the tag address is sent to the CAM by the cache control unit all of the comparators do a parallel search of their contents. If the input tag address matches the address stored in a particular cache tag address location, the circuit indicates an address match (hit) and outputs the cache row address of the main memory tag address Problem: large cache -> heavy control… Computer Architecture - UNSTPB, FILS 140 Set Associative Block Placement The set-associative cache combines the properties of the direct-mapped and associative cache into one system -> More cache columns 0000h 03FFh 0000h 0000h 03FFh 03FFh Computer Architecture - UNSTPB, FILS 141 Set Associative Block Placement 000000h Cache 000001h 000002h Memory 000003h Tag 000004h Blocks = 4 per set 000005h 0000h 0003FFh 000400h 000401h Sets 000402h 03FFh The set is (Block address) MOD (Number of Sets in cache) 0FFFFFh Computer Architecture - UNSTPB, FILS 142 Block Identification ◼ How do you know if the data you are looking for is placed in Cache? Tag Index Offset http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/bl_ident_applet.html Computer Architecture - UNSTPB, FILS 143 Block Replacement When a miss happens you will have to load in cache the missing data ◼ direct-mapped placement -> there is no choice: only one block frame is checked for a hit and only that block can be replaced ◼ fully-associative or set-associative placement -> more than one block to choose from on a miss Computer Architecture - UNSTPB, FILS 144 Block Replacement ◼ Primary strategies: ◼ Random - to spread allocation uniformly, candidate blocks are randomly selected. Advantage: simple to implement in hardware Disadvantage: ignores principle of locality ◼ Least-Recently Used (LRU) - to reduce a chance of throwing out information that will be needed soon, accesses to blocks are recorded. The block replaced is the one that has been unused for the longest time. Advantage: takes locality into account Disadvantage: as the number of blocks to keep track of increases, LRU becomes more expensive (harder to implement, slower and often just approximated). Computer Architecture - UNSTPB, FILS 145 Block Replacement Other strategies: Pseudo LRU First In First Out (FIFO) Most-Recently Used (MRU) Least-Frequently Used (LFU) Adaptive Replacement Cache (LRU+LFU) Computer Architecture - UNSTPB, FILS 146 Interaction with Memory The READ policies are: Read Through - reading a word from main memory to CPU No Read Through - reading a block from main memory to cache and then from cache to CPU Computer Architecture - UNSTPB, FILS 147 WRITE policies ◼ Write Through - the information is written to both the block in the cache and to the block in the lower-level memory. Advantage: - read miss never results in writes to main memory - easy to implement - main memory always has the most current copy of the data (consistent) Disadvantage: - write is slower - every write needs a main memory access - as a result uses more memory bandwidth Computer Architecture - UNSTPB, FILS 148 WRITE policies ◼ Write back - the information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a dirty bit is commonly used. This status bit indicates whether the block is dirty (modified while in the cache) or clean (not modified). If it is clean the block is not written on a miss. Advantage: - writes occur at the speed of the cache memory - multiple writes within a block require only one write to main memory - as a result uses less memory bandwidth Disadvantage: - harder to implement - main memory is not always consistent with cache - reads that result in replacement may cause writes of dirty blocks to main memory Computer Architecture - UNSTPB, FILS 149 WRITE policies ◼ Write Miss: Write Allocate - the block is loaded on a write miss, followed by the write-hit action. No Write Allocate - the block is modified in the main memory and not loaded into the cache. ◼ Although either write-miss policy could be used with write through or write back, write-back caches generally use write allocate (hoping that subsequent writes to that block will be captured by the cache) and write-through caches often use no-write allocate (since subsequent writes to that block will still have to go to memory). Computer Architecture - UNSTPB, FILS 150 Memory_stall_cycles = IC * Mem_Refs * Miss_Rate * Miss_Penalty where IC = Instruction count Mem_Refs = Memory References per Instruction Miss_Rate = the fraction of accesses that are not in the cache Miss_Penalty = the additional time to service the miss Computer Architecture - UNSTPB, FILS 151