Micro_evolution.pdf
Document Details
Uploaded by CatchyPanther
Tags
Full Transcript
Microprocessor Evolution: 4004 to Pentium-4 Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovi...
Microprocessor Evolution: 4004 to Pentium-4 Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind November 2, 2005 6.823 L15- 2 First Microprocessor Emer Intel 4004, 1971 4-bit accumulator architecture Image removed due to copyright restrictions. 8µm pMOS To view image, visit 2,300 transistors http://news.com.com/Images+Moores+L 3 x 4 mm2 aw+turns+40/2009-1041_3-5649019- 5.html 750kHz clock 8-16 cycles/inst. November 2, 2005 6.823 L15- 3 Emer Microprocessors in the Seventies Initial target was embedded control First micro, 4-bit 4004 from Intel, designed for a desktop printing calculator Constrained by what could fit on single chip Single accumulator architectures 8-bit micros used in hobbyist personal computers Micral, Altair, TRS-80, Apple-II Little impact on conventional computer market until VISICALC spreadsheet for Apple-II (6502, 1MHz) First “killer” business application for personal computers November 2, 2005 6.823 L15- 4 Emer DRAM in the Seventies Dramatic progress in MOSFET memory technology 1970, Intel introduces first DRAM (1Kbit 1103) 1979, Fujitsu introduces 64Kbit DRAM => By mid-Seventies, obvious that PCs would soon have > 64KBytes physical memory November 2, 2005 6.823 L15- 5 Emer Microprocessor Evolution Rapid progress in size and speed through 70s – Fueled by advances in MOSFET technology and expanding markets Intel i432 – Most ambitious seventies’ micro; started in 1975 - released 1981 – 32-bit capability-based object-oriented architecture – Instructions variable number of bits long – Severe performance, complexity, and usability problems Motorola 68000 (1979, 8MHz, 68,000 transistors) – Heavily microcoded (and nanocoded) – 32-bit general purpose register architecture (24 address pins) – 8 address registers, 8 data registers Intel 8086 (1978, 8MHz, 29,000 transistors) – “Stopgap” 16-bit processor, architected in 10 weeks – Extended accumulator architecture, assembly-compatible with 8080 – 20-bit addressing through segmented addressing scheme November 2, 2005 6.823 L15- 6 Intel 8086 Emer Class Register Purpose Data: AX,BX “general” purpose CX string and loop ops only DX mult/div and I/O only Address: SP stack pointer BP base pointer (can also use BX) SI,DI index registers Segment: CS code segment SS stack segment DS data segment ES extra segment Control: IP instruction pointer (low 16 bit of PC) FLAGS C, Z, N, B, P, V and 3 control bits Typical format R allows cheaper system Estimated sales of 250,000 100,000,000s sold Software Microsoft negotiates to provide OS for IBM. Later buys and modifies QDOS from Seattle Computer Products. Open System Standard processor, Intel 8088 Standard interfaces Standard OS, MS-DOS IBM permits cloning and third-party software November 2, 2005 The Eighties: 6.823 L15- 8 Emer Personal Computer Revolution Personal computer market emerges – Huge business and consumer market for spreadsheets, word processing and games – Based on inexpensive 8-bit and 16-bit micros: Zilog Z80, Mostek 6502, Intel 8088/86, … Minicomputers replaced by workstations – Distributed network computing and high-performance graphics for scientific and engineering applications (Sun, Apollo, HP,…) – Based on powerful 32-bit microprocessors with virtual memory, caches, pipelined execution, hardware floating-point – Commercial RISC processors developed for workstation market Massively Parallel Processors (MPPs) appear – Use many cheap micros to approach supercomputer performance (Sequent, Intel, Parsytec) November 2, 2005 6.823 L15- 9 The Nineties Emer Advanced superscalar microprocessors appear first superscalar microprocessor is IBM POWER in 1990 MPPs have limited success in supercomputing market Highest-end mainframes and vector supercomputers survive “killer micro” onslaught 64-bit addressing becomes essential at high-end In 2004, 4GB DRAM costs 2x size of P-III Clock frequency rising faster than transistor speed – deeper pipelines, fewer logic gates per cycle – more advanced circuit designs (each gate goes faster) ⇒ Takes multiple cycles for signal to cross chip November 2, 2005 Visible Wire Delay in P-4 Design 6.823 L15- 31 Emer 1 TC Next IP 2 3 TC Fetch 4 5 Drive 6 Alloc 7 Rename 8 9 Queue 10 Schedule 1 11 Schedule 2 12 Schedule 3 Pipeline stages dedicated to just 13 Dispatch 1 driving signals across chip! 14 Dispatch 2 15 Register File 1 16 Register File 2 17 Execute 18 Flags 19 Branch Check 20 Drive November 2, 2005 P-4 Microarchitecture 6.823 L15- 32 Emer Instruction TLB/ 64 bits wide Front-End BTB (4K Entries) Prefetcher System Bus Instruction Decoder Microcode ROM Trace Cache BTB (512 Entries) Trace Cache (12K µops) µop Queue Quad Pumped Allocator/Register Renamer 3.2 GB/s Memory µop Queue Integer/Floating Point µop Queue Bus Interface Memory Scheduler Fast Slow/General FP Scheduler Simple FP Unit Integer Register File/Bypass Network FP Register/Bypass AGU AGU 2x ALU 2x ALU Slow ALU FP MMX FP L2 Cache Load Store Simple Simple Complex SSE Move (256K byte Address Address Instr. Instr. Instr. SSE2 8-way) 48 GB/s L1 Data Cache (8Kbyte 4-way) 256 bits Figure by MIT OCW. November 2, 2005 6.823 L15- 33 Microarchitecture Comparison Emer In-Order Out-of-Order Execution Execution Fetch Br. Pred. Fetch Br. Pred. In-Order Decode Resolve Decode Resolve In-Order Out-of-Order Execute ROB Execute In-Order Commit Commit Speculative fetch but not Speculative execution, with speculative execution - branches resolved after later branch resolves before instructions complete later instructions complete Completed values held in rename Completed values held in registers in ROB or unified physical bypass network until register file until commit commit Both styles of machine can use same branch predictors in front-end fetch pipeline, and both can execute multiple instructions per cycle Common to have 10-30 pipeline stages in either style of design November 2, 2005 6.823 L15- 34 MIPS R10000 (1995) Emer 0.35µm CMOS, 4 metal layers Four instructions per cycle Out-of-order execution Register renaming Speculative execution past 4 branches On-chip 32KB/32KB split I/D Image removed due to copyright cache, 2-way set-associative restrictions. Off-chip L2 cache To view the image, visit http://www- vlsi.stanford.edu/group/chips_micropro_ Non-blocking caches body.html Compare with simple 5-stage pipeline (R5K series) ~1.6x performance SPECint95 ~5x CPU logic area ~10x design effort November 2, 2005