4CS015 Lecture 5: CPU Architecture - PDF

4CS015 Lecture 5: CPU Architecture Prepared by: Uttam Acharya 1 Objectives By the end of this session, you will be able to: Describe the history of development of the CPU. Explain how cache, pipelines and superscalar operation can improve performance. 2 History of Computers Charles Babbage is considered to the father of computer. He designed “Difference Engine” in 1822. He designed a fully automatic “Analytical engine” in 1842 for performing basic arithmetic functions. His efforts established several principles that are fundamental to the design of any digital computer. 3 History of Computers First computers programmed by wiring. 1944: EDVAC (Electronic Discrete Variable Automatic Computer) – first conceptual stored program computer. Von Neumann, Eckert, & Mauchly. 1948: First (prototype) stored program computer – Mark I at Manchester University. Harvard Mark III & IV. Howard Aiken. Harvard Architecture / fully automatic / controlled by paper tape / reliable / beginning modern computer era. 4 Computer Architecture Von Neumann Classic computer architecture. Data and program instructions stored in same memory. Fetch-execute cycle. Bottleneck. Small Suffers from Von Neumann compared to main memory & rate at ‘bottleneck’. which today's CPUs work. 5 Harvard Data and programs stored in separate memory areas. Allows for faster operation. Simultaneous access of both data and programs Allows data and instruction bus to be diff sizes. 6 Modified Harvard Hybrid approach – combination of von Neumann and Harvard Data Cache Single Main Memory for both Data Data and Program CPU Addr Memory Separate High-speed Memory Instr. Cache Caches for Data and Instructions Simultaneous access of both data and programs from caches 7 CPU CPU consists of transistors, combined as gates. Control Unit CPU works in cycles – fetch, decode, execute. Arithmetic Usually, each step handled by & Logic Registers different part of CPU. Unit Where would adders be found? 8 CPU CPU Architecture 9 Types of CPU Design Accumulator: All ALU operations work on data in accumulator (special register). Stack: All ALU operations work on data stored on the stack. Register-register: All ALU operations work on data stored in registers. 10 Example Architecture Memory (RAM) Stack Data Bus Control Unit Arithmetic Logic Registers Unit Accumulator is a special register, stored in here 11 Code Sequence Examples Stack Acc Reg-Mem Load-Store Push A Load A Load R1, A Load R1, A Push B Add B Add R3, R1, Load R2, B B Add Store C Store R3, C Add R3, R2, R1 Pop C Store R3, C Examples of: C = A + B 12 Architecture Bit Sizing Most CPU’s are rated by the number of bits they have: 16 bit (8086). 32 bit (80386). 64 bit (Core 2) INTEL 8086 Released: June 8, 1978, INTEL October 6 (i386) Released: October, 1985 INTEL CORE i7 Released: November, 2008 13 BUS A bus is a high-speed internal connection. Buses are used to send control signals and data between the processor and other components. Three types of bus are used. Address bus - carries memory addresses from the processor to other components bus is unidirectional. such as primary storage and input/output devices. The address is unidirectional. Data bus - carries the data between the processor and other components. The data bus is bidirectional. Control bus - carries control signals from the processor to other components. The control bus also carries the clock's pulses. 14 Data BUS CPU Memory Determines how much data can 0000 be copied from / to memory at a PC 0001 time (cycle?). 4003 Address Bus Advantages of Harvard vs. Von 4002 Neumann? Read 1011000 4003 1 4004 ???? Data Bus PC = Program counter. Register / holds the IR address of the current instruction. IR = Instruction register. Holds the current instruction while it is executed. 15 Address BUS Determines how much memory can be etc accessed. Address Matched with program counter / pointer D Bus A0 E etc etc etc etc etc C O D etc An E n bits wide R etc Data 0 Data n Chip Select read write Control Bus Data Bus 16 Address BUS Each ‘line’ in memory corresponds to a bit in a register (the MAR – memory address register). So, 8 lines = 8 bits = 2 8 = ?? 32 lines = 32 bits = 2 32 = 4GB. 64 lines = 64 bits = 2 64 = 18.5 EB (exabytes), 1.85x1019 bytes 17 Register Sizing Registers are the internal ‘memory’ of a CPU. They determine the maximum memory that can be addressed (PC). They determine the size of ALU operations. Can be 8-bit, 16-bit, 32-bit, 64-bt (or more!) EAX,EBX,ECX,EDX: general purpose registers. ESP: stack pointer (top). EBP: Stack base Pointer. EFLAGS: condition codes. EIP: program counter (instruction pointer). 18 Next Steps… Late fifties / early sixties number computers available increased rapidly. Generally, each generation had a different architecture than previous one (enhancements / improvements). Different instruction sets. Machines customized for customers. Was this a problem? IBM solution: system 360. 19 System 360 (Protocol) System 360 popularized use of microcode across the ‘range’. i.e. same instruction set. Example: Multiplication. Any program written for one model would work on all. Binary compatibility. 20 CISC – Complex Instructions Set Computer Originally, CPU speed and Memory speed the same. Instructions varied in size, 1 – 5 words (16 bit). Depending on data bus size, could take up to 5 cycles / instruction. Decoding may include microcode. CISC – designed to make programming easier – either for assembly programmer or compiler programmer. 21 Evolution Mid seventies – John Cocke @ IBM initiated research on performance of micro coded CPU’s. 1980 – Patterson built on this work, coined term RISC. Conclusions: Most programs use only small % of available instructions. Micro coded instructions not always best / most efficient way of performing task. 22 RISC – Reduced Instruction set Computer RISC should really be Reduced Complexity Instructions (RISC) Ideal specification: Each instruction executes in one cycle – increases processing efficiency. No microcode. All instructions work on registers, except load / store. May need more store as have more instructions. 23 RISC V. CISC Performance: RISC optimizes cycles / instruction. CISC optimizes instructions per program. Trade – off. How do we optimize cycle time? 24 Instruction Usage Which is the most important type of instruction to optimise? 25 Example: VAX VS. MIPS Description Value (mean) VAX CPI 9.9 MIPS CPI 1.7 CPI Ratio (VAX/MIPS) 5.8 Instruction Ratio 2.2 (MIPS/VAX) RISC factor 2.7 Cycles Per Instruction (CPI): instructions may take one or more cycles to complete. VAX – CISC architecture. MIPS – RISC architecture. RISC factor = CPI ratio / Instruction ratio. 26 Modern Enhancements How do we improve cycle times? Cycle time = I-time + E-time where I-time = time taken to fetch instruction from memory (or cache). E-time – time taken by ALU to execute instruction. 27 Clock Speed System clock like ‘heartbeat’ for system. All system synchronised to clock. Ideal condition is CPU performs one instruction (or more!) per clock cycle. 28 Clock Cycles Cycles measured in Hertz. 1 MHz = 1 million cycles/sec. 1 GHz = 1,000 million cycles/sec. Remember ideal: > 1 instruction/cycle! 29 Fetch Decode Execute Cycle 30 Fetch Decode Execute Cycle Fetch: PC points to address 100 (instruction: LOAD 10). Address 100 is loaded into MAR; instruction is fetched into MDR. Instruction is moved to CIR; PC increments to 101. Decode: The Control Unit decodes LOAD 10 as "load data from address 10 into Accumulator”. Execute: MAR is updated with address 10. Value at address 10 (e.g., 2) is fetched into MDR and transferred to Accumulator. Next Instruction (ADD 11): Address 101 is fetched. Value at address 11 (e.g., 3) is added to Accumulator (2 + 3 = 5). Store (STORE 12): Result in Accumulator (5) is stored at memory address 12. 31 Operations (Original CISC) Analogy: The fast-food outlet… 1. Fetch – press button, open door. 2. Decode – take order. 32 Operations 3. Execute – make food. 4. Store – handover to customer. 33 CACHE Expensive and high speed memory. Speed up memory retrieval. Relatively small amount. Instructions 34 CACHE And Memory Hierarchy 35 Pipeline decoding executing storing simultaneously 36 Pipelines Concept of using each functional unit simultaneously. Theoretically: N stage pipeline = n* increase in speed. Problems in practice? Pipeline Hazards. Instruction 2 operand is the output from Instruction 1. Stall / read-after-write hazard. Branching. 37 Superscalar Decode Store Execute Multiple, independent, functional units. Instruction-level parallelism. 38 Superscalar E=a+b Which instructions can be Concept of adding more functional units. performed simultaneously? F=c+d G=e*f Theoretically: N units = N instructions processing in parallel. Problems in practice? Order of instruction? Common instruction fetch unit. 39 Co-Processors Concept of providing hardware to perform specialist tasks (floating point, multimedia, etc). Theoretically: ‘Special’ tasks processed simultaneously, faster than ALU. Problems in practice? Examples – MMX (Flynn, SIMD). Single Instruction, Multiple Data streams (SIMD). One instruction operates on multiple data items at the same time. MMX – Hardware / multimedia performance boost. 40 Post CISC Most modern processors tend to combine the best of CISC (backward compatibility) with that of RISC (performance). X86 processors (Intel, AMD, etc) tend to use a hybrid design. 41 SUMMARY Storage options (stack, accumulator or register). CISC vs. RISC. Modern enhancements (cache, pipelines, superscalar). Modern processor examples. 42 Thank you… 43

4CS015 Lecture 5: CPU Architecture - PDF

Document Details

Tags

Related

Summary

Full Transcript