Lec 04 Pentium 2 and Beyond (PDF)

Intel Pentium Processors Outline Introduction (Zhijian) – Willamette (11/2000) Instruction Set Architecture (Zhijian) Instruction Stream (Steve) Data Stream (Zhijian) What went wrong (Steve) Pentium 4 revisions – Northwood (1/2002) – Xeon (Prestonia, ~2002) – Prescott (2/2004) Dual Core – Smithfield Introduction Intel architecture 32-bit (IA-32) – 80386 instruction set (1985) – CISC, 32-bit addresses “Flat” memory model Registers – Eight 32-bit registers – Eight FP stack registers – 6 segment registers IA-32 (cont’d) Addressing modes – Register indirect (mem[reg]) – Base + displacement (mem[reg + const]) – Base + scaled index (mem[reg + (2scale x index)]) – Base + scaled index + displacement (mem[reg + (2scale x index) + displacement]) SIMD instruction sets – MMX (Pentium II) Eight 64-bit MMX registers, integer ops only – SSE (Streaming SIMD Extension, Pentium III) Eight 128-bit registers As it can be seen from the previous diagram, the Integer unit has two pipelines(U and V),while the Floating Point Unit (FPU) has one pipeline. The Pentium pipelined Integer Unit supports 5 stages: 1) Pre-fetch 2) Decode 3) Address generate 4) EX Execute - ALU and Cache Access 5) WB Writeback Although different later processors like the MMX tampered with the 5 execution steps(by adding intermediate LIFO structures to hold bulks of instructions), the steps remain the core foundation of the pipelining. 1) In the Pre-fetch cycle, two pre-fetch buffers read instructions to be executed. Instructions can be fetched from the U or V pipeline. The U pipeline contains more complex instructions. 2) In the Decode cycle, two decoders, decode the instructions and try to pair them together so they can run in parallel, since the Pentium features a Superscalar architecture. Even though the Pentium processor features a Superscalar architecture, in order for two instructions to run concurrently, like in the diagram below, they need to satisfy some rules. Essentially, the instructions have to be independent otherwise they cannot be paired together. 3) In the second Decode stage, or the address generate stage, the addresses of memory operands are calculated. After these calculations, the EX stage of the pipeline is ready to execute. A Floating Point instruction cannot be paired with an Integer instruction. 4) In the Execution cycle, the ALU is reached. 5) In the Write Back cycle, information is written back to the registers. For two instructions to be paired together in the Decode stage, they have to lack dependencies. The two paired instructions would also have to be basic, in the sense that they contain no displacements or immediate addressing. As it can be deduced, pipelines will sometimes execute an instruction at the time, despite the Superscalar ability. If two instructions are executing concurrently in the pipeline (given they satisfy the proper conditions, and are independent) and one of them stalls as a result of hazard control, the other one will also stall. Other than the Superscalar ability of the Pentium processor, the branch prediction mechanism is a much-debated improvement. Predicting the behaviors of branches can have a very strong impact on the performance of a machine. Since a wrong prediction would result in a flush of the pipes and wasted cycles. The branch prediction mechanism is done through a branch target buffer. The branch target buffer contains the information about all branches. Branch Prediction The prediction of whether a jump will occur or no, is based on the branch’s previous behavior. There are four possible states that depict a branch’s disposition to jump: Stage 0: Very unlikely a jump will occur Stage 1: Unlikely a jump will occur Stage 2: Likely a jump will occur Stage 3: Very likely a jump will occur When a branch has its address in the branch target buffer, its behavior is tracked. If a branch doesn’t jump two times in a row, it will go down to State 0. Once in Stage 0, the algorithm won’t predict another jump unless the branch will jump for two consecutive jumps (so it will go from State 0 to State 2). Once in Stage 3, the algorithm won’t predict another no jump unless the branch is not taken for two consecutive times. Branch Prediction The diagram below portrays the four stages associated branch prediction. It is actually believed that Pentium’s algorithm for branch prediction is incorrect. As it can be seen in the diagram to the right, State 0 will jump directly to State 3, instead of following the usual path which would include State 1, and State 2. This abnormality might be attributed to the way in which the branch target buffer operates: If a branch is not found in the branch target buffer, then it predicted that it won’t jump. - A branch won’t get an actual entry in the branch target buffer, until the first time it jumps, and when it does, it goes straight into State 3. - Because the branch won’t get an entry into the branch target buffer until the first time it jumps, this will cause an alteration into the actual state diagram, as it can be clearly seen. The four stages associated branch prediction. Branch Prediction (in later Pentium Models) The Intel Pentium branch prediction algorithm is indeed better than a 50% guess, but it has limitations. In a need to increase the accuracy of branch predictions, the processors following the Pentium adopted a different branch prediction algorithm. Some loops have repetitive patterns and they need to be recognized. With a two bit binary counter, it is impossible to attain any complexity. Later generation processors, such as the Pentium MMX, Pentium Pro, Pentium II, use another mechanism for branch prediction. A 4 bit register is used to record the previous behavior of the branch. If the 4 bit register would be 0001, it would mean that the branch only jumped the last time out of 4. A 4 bit register would not be of much use without any additional logic. In addition to the 4 bit register, there are 16, 2-bit counters like the ones that were previously shown. A 4 bit register that records the behavior of the branch along with 16 2-bit counters, the mechanism is able to give more accurate branching predictions. Since the register has 4 bits, it has 16 possible values, so the current value of the 4 bit register can always be associated with one of the 16 bit counters, like it is shown in the diagram. Each value in the 4 bit register, represents a trend of that branch. For each trend, we must be able to predict the next value. Since each register value will be pointing to a different 2-bit counter, the state of the 2-bit counter will most likely return the correct prediction for that particular register pattern. Therefore, by combining a 4 bit register that records past trends, with 16 individually updated 2-bit counters, we end up with a much stronger mechanism for prediction, which is currently used in Pentium MMX, Pentium II, and others. Branch Prediction The 4 bit register has 16 possible values and the current value of the 4 bit register can always be associated with one of the 16 bit counters. New Generation Chips The next move up from Pentium was Pentium MMX. The Pentium MMX, includes new instructions, registers, and data types which are aimed at maximizing the speed of multimedia computations. Since multimedia work requires massive data manipulation, SIMD instructions were added to the MMX set. SIMD instructions work on multiple data values at once, in order to maximize the amount of work done by each instruction. The improved multimedia support of the MMX, along with lower power consumption, larger caches, and new branch prediction mechanisms, brought about the new generations of Pentiums (II & III) Superpipelined vs. Superscalar Superpipelining : divide the instruction execution pipeline into the smaller stages. [ex] 5-stage pipeline (80486, Pentium)  12-stage (P6 processors) Superscalar : execute two or more instructions per clock cycle by using multiple execution units (include ALUs). [ex] Pentium executes two instructions simultaneously = 2-way superscalar Pentium II, III & Celeron : 3-way superscalar MMX (Multimedia Extension) : provides 2 architectural enhancements over non-MMX Pentium ① 57 instructions are added for multimedia (audio, video, and graphic data) applications. ② SIMD(Single-Instruction stream Multiple-Data stream) allows the same operation to be performed on multiple data items. Because many multimedia applications require large blocks of data to be manipulated, SIMD provides a significant performance enhancement. For general applications, 10~20% performance improved. For multimedia applications, nearly 70% improved. Data types for MMX technology SIMD Execution Model P6 family processors (1995-1999) Intel Pentium Pro processor – Three-way superscalar : decode, dispatch, and complete execution (retire) of three instructions per clock cycle on average – Introduced the dynamic execution (micro-data flow analysis, out-of-order execution, superior branch prediction, and speculative execution) in a superscalar implementation – Enhanced by caches (two on-chip 8-Kbyte 1st- level cache and 256-Kbyte 2nd-level cache in the same package (two-chips in the same package) – 36 address lines  max. 64 GB memory FIGURE 1-14 The Pentium Pro is two chips in one. The larger die is the processor, the smaller a 256K L2 cache. (Courtesy of Intel Corporation.) John Uffenbeck Copyright ©2002 by Pearson Education, Inc. The 80x86 Family: Design, Programming, Upper Saddle River, New Jersey 07458 and Interfacing, 3e All rights reserved. Pentium® Pro Processor Basic Execution Environment 232-1 Eight 32-bit General Purpose Registers Registers Six 16-bit Segment Registers Address Registers Space* 32 bits EFLAGS Register 32 bits EIP (Instruction Pointer Register) 0 25 * The address space can be flat or segmented Application Programming Registers 26 Dynamic Execution : a new approach to processing S/W instructions, that reduces idle processor time. ① Multiple Branch Prediction : Pentium Pro can look as far as 30 instructions ahead to anticipate conditional branches  reduce waste of pipeline clocks. ② Data Flow Analysis : looks at upcoming S/W instructions for the optimal sequence of processing. ③ Speculative Execution : allows to execute instructions in a different order from which they are entered the processor = “out-of-order execution”. The result of these instructions are stored as speculative results until their final states can be determined. Superscalar Processor of Degree Three : Pentium Pro has three instruction decoders, and can execute 3 simultaneous instructions. Internal Cache : L2 cache in the same package. P6 family processors (cont’d) Pentium II processor – Added Intel MMX technology – Processor core is packaged in the single edge contact cartridge (SECC) – 1st-level(L1) caches are enlarged (16 Kbytes each) – 2nd-level(L2) cache sizes of 256 KB, 512 KB, 1 MB are supported – A half-clock speed backside bus connects 2nd- level cache and the processor – Multiple low-power states such as AutoHALT, Stop-Grant, Sleep, and Deep Sleep are supported to conserve power when being idle P6 family processors (cont’d) Pentium II Xeon processor – Includes 4-way and 8-way, 2 Mbyte 2nd-level cache running on a dual-clock speed backside bus Intel Celeron processor – Focused on the PC market – Pentium II without L2 cache – Use the slot 1 connector without the plastic cover called “naked CPU” Celeron A : Includes 128KB L2 cache on the same die with processor. – Drawback : 66 MHz bus cycle – 370-pin PGA package (called Socket 370) Celeron Board John Uffenbeck Copyright ©2002 by Pearson Education, Inc. The 80x86 Family: Design, Programming, Upper Saddle River, New Jersey 07458 and Interfacing, 3e All rights reserved. P6 family processors (cont’d) Pentium III processor – Introduced Streaming SIMD Extensions (SSE) : expand SIMD execution model by providing new set of 128-bit registers and the ability to perform SIMD operations on packed single-precision floating-point values Pentium III Xeon processor – Enhanced a full-speed, on-die Advanced Transfer Cache Intel Pentium 4 processor – Latest IA-32 processor equipped with a full set of IA- 32 SIMD operations First implementation of a new micro-architecture called “NetBurst” by Intel (11/2000) Pentium III with integrated L2 cache (more than 22 million transistors) John Uffenbeck Copyright ©2002 by Pearson Education, Inc. The 80x86 Family: Design, Programming, Upper Saddle River, New Jersey 07458 and Interfacing, 3e All rights reserved. Pentium 4 Processor Family (2000-2005) Based on Intel NetBurst microarchitecture Introduced Streaming SIMD Extentions 2 (SSE2) Pentium 4 processor 3.40 GHz supports Hyper Threading Technology and Streaming SIMD Extentions 3 (SSE3) Pentium 4 Processor Extreme Edition supports Intel Extended Memory 64 Technology and Hyper- Threading Technology Pentium 4 Processor 6xx series supports Intel Extended Memory 64 Technology NetBurst Micro-Architecture 34 Streaming SIMD Extensions 2 (SSE2) Streaming SIMD Extensions 3 (SSE3) Horizontal and Asymmetric Processing Horizontal Data Movement in ADDSUBPD Intel Xeon Processor (2001-2005) Based on Intel NetBurst microarchitecture As a family, this group of IA-32 processors is designed for use in multiprocessor server systems and high- performance workstations Intel Xeon processor MP supports for Hyper-Threading Technology 64-bit Intel Xeon processor 3.60 GHz with 800 MHz System Bus introduced Intel Extended Memory 64 Technology Intel Pentium M Processor (2003-2005) Low-power mobile processor family Designed for extending battery life and seamless integration Its extended microarchitecture includes: – Support for Dynamic Execution – Low-power core with copper interconnect – On-die, primary 32-KB instruction cache and 32-KB write-back data cache, and second-level 2 MB cache with Advanced Transfer Cache Architecture – Advanced Branch Prediction and Data Prefetch Logic – Support for MMX tech, Streaming SIMD instructions, and SSE2 instruction set Intel Pentium Processor Extreme Edition (2005) Introduced dual-core technology that provides advanced H/W multi-threading support Based on Intel NetBurst microarchitecture Supports SSE, SSE2, SSE3, Hyper-Threading Technology, and Intel Extended Memory 64 Technology Pentium III vs. Pentium 4 Pipeline Comparison Between Pentium3 and Pentium4 Execution on MPEG4 Benchmarks @ 1 GHz Instruction Set Architecture Pentium4 ISA = Pentium3 ISA + SSE2 (Streaming SIMD Extensions 2) SSE2 is an architectural enhancement to the IA-32 architecture. SSE2 Extends MMX and the SSE extensions with 144 new instructions: 1. 128-bit SIMD integer arithmetic operations 2. 128-bit SIMD double precision floating point operations 3. Enhanced cache and memory management operations Comparison Between SSE and SSE2 Both support operations on 128-bit XMM register SSE only supports 4 packed single-precision floating-point values SSE2 supports more: 2 packed double-precision floating-point values 16 packed byte integers 8 packed word integers 4 packed doubleword integers 2 packed quadword integers Double quadword Pentium 4 data types 50 Hardware Support for SSE2 Adder and Multiplier units in the SSE2 engine are 128 bits wide, twice the width of that in Pentium3 Increased bandwidth in load/store for floating-point values load and store are 128-bit wide. One load plus one store can be completed between XMM register and L1 cache in one clock cycle SSE2 Instructions (1) Data movements Move data between XMM registers and between XMM registers and memory Double precision floating-point operations Arithmetic instructions on both scalar and packed values Logical Instructions Perform logical operations on packed double precision floating-point values Compare instructions Compare packed and scalar double precision floating-point values Shuffle and unpack instructions Shuffle or interleave double-precision floating-point values in packed double-precision floating-point operands SSE2 Instructions (2) Conversion Instructions Conversion between double word and double- precision floating-point or between single-precision and double-precision floating-point values Packed single-precision floating-point instructions Convert between single-precision floating-point and double word integer operands 128-bit SIMD integer instructions Operations on integers contained in XMM registers Cacheability Control and Instruction Ordering More operations for caching of data when storing from XMM registers to memory and additional control of instruction ordering on store operations Instruction Stream Instruction Stream What’s new? – Added Trace Cache – Improved branch predictor Terminology – Micro-op op), a decoded RISC-like instructions – Front end – instruction fetch and issue 1.Prefetches instructions that are likely to be executed 2.Fetches instructions that haven’t been prefetched 3.Decodes instruction into ops 4.Generates ops for complex instructions or special purpose code 5.Predicts branches Prefetch Three methods of prefetching: 1. Instructions only – Hardware 2. Data only – Software 3. Code or data – Hardware Decoder A single decoder that can operate at a maximum of 1 instruction per cycle and receives instructions from L2 cache 64 bits at a time. Some complex instructions must enlist the help of the microcode ROM. Trace Cache This involves a primary instruction cache in NetBurst architecture. It stores decoded ops. It has a 12K capacity. On a Trace Cache miss, instructions are fetched and decoded from the L2 cache. Traditional instruction cache I1 I2 I3 I4 Trace cache I1 I2 I6 I7 Pentium 4 Trace Cache has its own branch predictor that directs where instruction fetching needs to go next in the Trace Cache. It removes decoding costs on frequently decoded instructions and the extra latency to decode instructions upon branch mispredictions. Microcode ROM is used for complex IA-32 instructions (> 4 ops) , such as string move, and for fault and interrupt handling. When a complex instruction is encountered, the Trace Cache jumps into the microcode ROM which then issues the ops. After the microcode ROM finishes, the front end of the machine resumes fetching ops from the Trace Cache. Branch Prediction It predicts ALL near branches which includes conditional branches, unconditional calls and returns, and indirect branches. It does not predict far transfers that includes far calls, irets, and software interrupts. It dynamically predict the direction and target of branches based on PC using BTB. If no dynamic prediction is available, then it statically predicts “taken” for backwards looping branches and “not taken” for forward branches. Traces are built across predicted branches in order to avoid branch penalties. Branch Target Buffer uses a branch history table and a branch target buffer for prediction. Updating occurs when the branch is retired. Return Address Stack contains 16 entries and predicts the return addresses for any procedure calls. It allows the branches and their targets to coexist in a single cache line thereby increases parallelism since decode bandwidth is not wasted. P4 Branch hints permits software to provide hints to the branch prediction and trace formation hardware to enhance its performance. It take the forms of prefixes added to conditional branch instructions. It is used only at trace build time and have no effect on already built traces. Out-of-Order Execution is designed to optimize the performance by handling the most common operations in the most common context as fast as possible. It can handle 126 ops at once or up to 48 loads / 24 stores. Issue refers to Instructions that are fetched and decoded by a translation engine. The translation engine builds instructions into sequences of micro-ops and stores the micro-ops to the trace cache. Trace cache can issue 3 micro-ops per cycle. Execution units can dispatch up to 6 ops per cycle and exceeds the trace cache and the retirement op bandwidth. This allows greater flexibility in issuing ops to different execution units. Double-pumped ALUs are ALUs that executes an operation on both rising and falling edges of clock cycle. Retirement meant it can retire 3 ops per cycle. It has precise exceptions and it reorder buffers to organize completed ops. It also keeps track of branches and sends updated branch information to the BTB. Execution Units Execution Pipeline Execution Pipeline Register Renaming It contains 8-entry architectural register file (RAT and ROB) and 128-entry physical register file (for data). It has 2 RAT (Register Alias Table) referred as the Frontend RAT and Retirement RAT. The data does not need to be copied between register files when the instruction retires. On-chip Caches It has on-chip Caches, L1 instruction cache (Trace Cache), L1 data cache, and L2 unified cache. All caches are not inclusive and a pseudo-LRU replacement algorithm is used. On-chip Caches L1 Instruction Cache contains an Execution Trace Cache which stores decoded instructions and removes decoder latency from main execution loops. It also integrates a path of program execution to flow into a single line. L1 Data Cache (Non-blocking) that supports up to 4 outstanding load misses. It has Load latency, 2-clock for integer, 1 Load and 1 store per clock, and 6-clock for floating-point. It has a speculation Load that assumes the access will have a hit at the cache and a “Replay” feature if the dependent instructions happen to have a miss. L2 Cache (Nonblocking) Load latency with a net load access latency of 7 cycles. It has a bandwidth of one load and one store in one cycle and a new cache operation that begins every 2 cycles. It has a 256-bit wide bus between L1 and L2 48Gbytes per second @ 1.5GHz Data Prefetcher in L2 Cache Data Prefetcher in L2 Cache it is hardware prefetcher monitors the reference patterns and bring cache lines automatically. It attempts to stay 256 bytes ahead of current data access location and performs prefetch for up to 8 simultaneous independent streams. Store and Load Out of order store and load operations Stores are always in program order 48 loads and 24 stores can be in flight Store buffers and load buffers are allocated at the allocation stage Total 24 store buffers and 48 load buffers Store Store operations are divided into two parts: Store data and Store address Store data is dispatched to the fast ALU, which operates twice per cycle Store address is dispatched to the store AGU per cycle Store and Load Store-to-Load Forwarding Forward data from pending store buffer to dependent load Load stalls still happen when the bytes of the load operation are not exactly the same as the bytes in the pending store buffer Small L1 Cache Only 8k! – Doubled size of L2 cache to compensate Compare with – AMD Athlon – 128k – Alpha 21264 – 64k – PIII – 32k – Itanium – 16k System Bus Deliver data with 3.2Gbytes/S 64-bit wide bus Four data phase per clock cycle (quad pumped) 100MHz clocked system bus L3 cache (virtual) Original plans called for a 1M cache Intel’s idea was to strap a separate memory chip, perhaps an SDRAM, on the back of the processor to act as the L3 But that added another 100 pads to the processor, and would have also forced Intel to devise an expensive cartridge package to contain the processor and cache memory The Intel P5 and P6 family Year Type Transistors Technology Clock Issue Word L1 cache L2 cache (x1000) (m m) (MHz) format 1993 Pentium 3100 0.8 66 2 32-bit 2 X 8 kB 1994 Pentium 3200 0.6 75-100 2 32-bit 2 X 8 kB 1995 Pentium 3200 0.6/0.35 120-133 2 32-bit 2 X 8 kB P5 1996 Pentium 3300 0.35 150-166 2 32-bit 2 X 8 kB 1997 Pentium MMX 4500 0.35 200-233 2 32-bit 2 X 16 kB 1998 Mobile Pentium MMX 4500 0.25 200-233 2 32-bit 2 X 16 kB 1995 PentiumPro 5500 0.35 150-200 3 32-bit 2 X 8 kB 256/512 kB 1997 PentiumPro 5500 0.35 200 3 32-bit 2 X 8 kB 1 MB 1998 Intel Celeron 7500 0.25 266-300 3 32-bit 2 X 16 kB -- 1998 Intel Celeron 19000 0.25 300-333 3 32-bit 2 X 16 kB 128 kB 1997 Pentium II 7000 0.25 233-450 3 32-bit 2 X 16 kB 256 kB/512 kB P6 1998 Mobile Pentium II 7000 0.25 300 3 32-bit 2 X 16 kB 256 kB/512 kB 1998 Pentium II Xeon 7000 0.25 400-450 3 32-bit 2 X 16 kB 512 kB/1 MB 1999 Pentium II Xeon 7000 0.25 450 3 32-bit 2 X 16 kB 512 kB/2 MB 1999 Pentium III 8200 0.25 450-1000 3 32-bit 2 X 16 kB 512 kB 1999 Pentium III Xeon 8200 0.25 500-1000 3 32-bit 2 x 16 kB 512 kB 2000 Pentium 4 42000 0.18 1500 3 32-bit 8kB / 12kµOps 256 kB NetBurst 72 including L2 cache Northwood and Prescott Code name Willamette was announced for mid-2000 native IA-32 processor with Pentium III processor core running at 1.5 GHz 0.18 µm about 42 million transistors 20 pipeline stages (integer pipeline), IF and ID not included trace execution cache (TEC) for the decoded µOps NetBurst micro-architecture Rapid Execution Engine: Intel: “Arithmetic Logic Units (ALUs) run at twice the processor frequency” Fact: Two ALUs, running at processor frequency connected with a multiplexer running at twice the processor frequency Hyper Pipelined Technology: Twenty-stage pipeline to enable high clock rates Frequency headroom and performance scalability Very deep, out-of-order, speculative execution engine – Up to 126 instructions in flight (3 times larger than the Pentium III processor) – Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III processor) branch prediction – based on µOPs – 4K entry branch target array (8 times larger than the Pentium III processor) – new algorithm, reduces mispredictions compared to G- Share of the P6 generation about one third Foster Pentium 4 with external L3 cache and DDR-SDRAM support provided for server clock rate 1.7 - 2 GHz to be launched in Q2/2001 Northwood version released by 1/2002 Differences from Willamette – Socket 478 with 21 stage pipeline – 512 KB L2 cache – 2.0 GHz, 2.2 GHz clock frequency – 0.13 fabrication process (130 nm) about 55 million transistors Prescott version released 2/2004 Differences – 31 stage pipeline! – 1MB L2 cache – 3.8 GHz clock frequency – 0.9 fabrication process – SSE3 Northwood Prescott Intel Celeron Processor Overview of the Celeron Introduced in 1999 Equipped with MMX. Newer versions with SSE. Originally based on Pentium II Current versions use core similar to Pentium III Original Celeron Slower than other chips of the time No L2 cache Lacked casing Easily overclocked Inexpensive Original Celeron Intel Celeron Later Celerons Moved to Socket370 Based on Mendocino core SSE instruction set Half the L2 cache as the PIII Current Celerons Based on Coppermine core Packaging Single Edge Contact Cartridge 2 (S.E.C.C.2) Flip-Chip Pin Grid Array (FC-PGA) Flip-Chip Pin Grid Array 2 (FC-PGA2) Celeron Features OS Compatability 1. All Windows (including MS-DOS) 2. Most Linux 3. SCO Unix / UnixWare 4. Sun Solaris 5. OS/2 6. OPENSTEP Based on P6 Micro-architecture 10-stage pipeline x86 decoder (CISC to RISC) 36-bit address space 64GB addressable memory MMX Instruction Set (SIMD) and Streaming SIMD Extensions Processor comparison Duron: 64-bit cache to core pathway Celeron: 256-bit cache to core pathway AMD Duron Intel Celeron L1 Cache 128KB 32KB L2 Cache 64KB 128KB Bus Speed 200MHz (100MHz x 2) 66MHz Interface Socket A Socket 370 Mobile Celeron Voltage difference Reduced power consumption Variable power consumption (Intel SpeedStep) 0.13-micron process Smaller chip size Benchmarks AMD Duron Intel Celeron SYSmark 2000 127 124 SiSoft Sandra 2000 1947 MIPS / 973 MFLOPS 1883 MIPS / 935 MFLOPS 3DMark 2000 3726 3629 CPU Mark 328 283 MP3 encoding 122 seconds 123 seconds Inspire3D 366 seconds 339 seconds Overclocking the Celeron No L2 cache Pentium II cores used 66MHz bus / 100MHz cores Current Celerons have locked clock multipliers. FSB must be changed to overclock the chip. Risks of overclocking Increases heat production Breaks down connections Other components may not work with faster bus DESTROYING THE CHIP!!!! Hybrid Predictors The second strategy of McFarling is to combine multiple separate branch predictors, each tuned to a different class of branches. Two or more predictors and a predictor selection mechanism are necessary in a combining or hybrid predictor. – McFarling: combination of two-bit predictor and share two-level adaptive, – Young and Smith: a compiler-based static branch prediction with a two-level adaptive type, Hybrid predictors often better than single-type predictors. Simulations of Grunwald 1998 Table 1.1. SAg, gshare and MCFarling‘s combining predictor committed conditional taken misprediction rate Application instructions branches branches (%) (in millions) (in millions) (%) SAg gshare combining compress 80.4 14.4 54.6 10.1 10.1 9.9 gcc 250.9 50.4 49.0 12.8 23.9 12.2 perl 228.2 43.8 52.6 9.2 25.9 11.4 go 548.1 80.3 54.5 25.6 34.4 24.1 m88ksim 416.5 89.8 71.7 4.7 8.6 4.7 xlisp 183.3 41.8 39.5 10.3 10.2 6.8 vortex 180.9 29.1 50.1 2.0 8.3 1.7 jpeg 252.0 20.0 70.0 10.3 12.5 10.4 mean 267.6 46.2 54.3 8.6 14.5 8.1 87 Results Simulation of Keeton et al. 1998 using an OLTP (online transaction workload) on a PentiumPro multiprocessor reported a misprediction rate of 14% with an branch instruction frequency of about 21%. The speculative execution factor, given by the number of instructions decoded divided by the number of instructions committed, is 1.4 for the database programs. Two different conclusions may be drawn from these simulation results: Branch predictors should be further improved and/or branch prediction is only effective if the branch is predictable. If a branch outcome is dependent on irregular data inputs, the branch often shows an irregular behavior. 88 Predicated Instructions and Multipath Execution Confidence estimation is a technique for assessing the quality of a particular prediction. Applied to branch prediction, a confidence estimator attempts to assess the prediction made by a branch predictor. A low confidence branch is a branch which frequently changes its branch direction in an irregular way making its outcome hard to predict or even unpredictable. Four classes possible: – correctly predicted with high confidence C(HC), – correctly predicted with low confidence C(LC), – incorrectly predicted with high confidence I(HC), and – incorrectly predicted with low confidence I(LC). 89 Implementation of a confidence estimator Information from the branch prediction tables is used: – Use of saturation counter information to construct a confidence estimator  speculate more aggressively when the confidence level is higher – Used of a miss distance counter table (MDC):  Each time a branch is predicted, the value in the MDC is compared to a threshold. If the value is above the threshold, then the branch is considered to have high confidence, and low confidence otherwise. – A small number of branch history patterns typically leads to correct predictions in a PAs predictor scheme. The confidence estimator assigned high confidence to a fixed set of patterns and low confidence to all others. Confidence estimation can be used for speculation control, thread switching in multithreaded processors or multipath execution 90 Predicated Instructions Provide predicated or conditional instructions and one or more predicate registers. Predicated instructions use a predicate register as additional input operand. The Boolean result of a condition testing is recorded in a (one-bit) predicate register. Predicated instructions are fetched, decoded and placed in the instruction window like non predicated instructions. It is dependent on the processor architecture, how far a predicated instruction proceeds speculatively in the pipeline before its predication is resolved: – A predicated instruction executes only if its predicate is true, otherwise the instruction is discarded. In this case predicated instructions are not executed before the predicate is resolved. – Alternatively, as reported for Intel's IA64 ISA, the predicated instruction may be executed, but commits only if the predicate is true, otherwise the result is discarded. 91 Predication Example if (x = = 0) { a = b + c; d = e - f; } g = h * i; (Pred = (x = = 0) ) if Pred then a = b + c; if Pred then e = e - f; g = h * i; 92 Predication  Able to eliminate a branch and therefore the associated branch prediction  increasing the distance between mispredictions.  The the run length of a code block is increased  better compiler scheduling.  Predication affects the instruction set, adds a port to the register file, and complicates instruction execution.  Predicated instructions that are discarded still consume processor resources; especially the fetch bandwidth. Predication is most effective when control dependences can be completely eliminated, such as in an if-then with a small then body. The use of predicated instructions is limited when the control flow involves more than a simple alternative sequence. 93 Eager (Multipath) Execution Execution proceeds down both paths of a branch, and no prediction is made. When a branch resolves, all operations on the non-taken path are discarded. Oracle execution: eager execution with unlimited resources – gives the same theoretical maximum performance as a perfect branch prediction With limited resources, the eager execution strategy must be employed carefully. Mechanism is required that decides when to employ prediction and when eager execution: e.g. a confidence estimator Rarely implemented (IBM mainframes) but some research projects: – Dansoft processor, Polypath architecture, selective dual path execution, simultaneous speculation scheduling, disjoint eager execution 94.7.3 1.49.21 2.34.15.7.3 4 3 1.24.10.49.21 6 4 2.17.07.7.3 2.34.15 5 1.09 3.21 6.12.05.49.21.24.10 5 6 3 4 5 (a) (b) (c) (a) Single path speculative execution (b) full eager execution (c) disjoint eager execution Prediction of Indirect Branches Indirect branches, which transfer control to an address stored in register, are harder to predict accurately. Indirect branches occur frequently in machine code compiled from object-oriented programs like C++ and Java programs. One simple solution is to update the PHT to include the branch target addresses. 96 Branch handling techniques and implementations Technique Implementation examples No branch prediction Intel 8086 Static prediction always not taken Intel i486 always taken Sun SuperSPARC backward taken, forward not taken HP PA- 7x00 semistatic with profiling early PowerPCs Dynamic prediction: 1-bit DEC Alpha 21064, AMD K5 2-bit PowerPC 604, MIPS R10000, Cyrix 6x86 and M2, NexGen 586 two-level adaptive Intel PentiumPro, Pentium II, AMD K6, Athlon Hybrid prediction DEC Alpha 21264 Predication Intel/HP Merced and most signal processors as e.g. ARM processors, TI TMS320C6201 and many other Eager execution (limited) IBM mainframes: IBM 360/91, IBM 3090 Disjoint eager execution none yet 97 High-Bandwidth Branch Prediction Future microprocessor will require more than one prediction per cycle starting speculation over multiple branches in a single cycle, – e.g. Gag predictor is independent of branch address. When multiple branches are predicted per cycle, then instructions must be fetched from multiple target addresses per cycle, complicating I-cache access. – Possible solution: Trace cache in combination with next trace prediction. Most likely a combination of branch handling techniques will be applied, – e.g. a multi-hybrid branch predictor combined with support for context switching, indirect jumps, and interference handling. 98 PentiumPro and Pentium II/III The Pentium II/III processors use the same dynamic execution microarchitecture as the other members of P6 family. This three-way superscalar, pipelined micro-architecture features a decoupled, multi-stage superpipeline, which trades less work per pipestage for more stages. The Pentium II/III processor has twelve stages with a pipestage time 33 percent less than the Pentium processor, which helps achieve a higher clock rate on any given manufacturing process. A wide instruction window using an instruction pool. Optimized scheduling requires the fundamental “execute” phase to be replaced by decoupled “issue/execute” and “retire” phases. This allows instructions to be started in any order but always be retired in the original program order. Processors in the P6 family may be thought of as three independent engines coupled with an instruction pool. 99 Pentium® Pro Processor and Pentium II/III Microarchitecture 100 External Bus L2 Cache Memory Reorder Buffer Pentiu Bus Interface Unit D-cache Unit m II/III Instruction Fetch Unit Memory Interface (with I-cache) Unit Reservation Station Unit Branch Target Functional Buffer Units Instruction Microcode Decode Instruction Unit Sequencer Reorder Buffer & Register Retirement Alias Register Table File Pentium II/III: The In-Order Section The instruction fetch unit (IFU) accesses a non-blocking I- cache, it contains the Next IP unit. The Next IP unit provides the I-cache index (based on inputs from the BTB), trap/interrupt status, and branch- misprediction indications from the integer FUs. Branch prediction: – two-level adaptive scheme of Yeh and Patt, – BTB contains 512 entries, maintains branch history information and the predicted branch target address. – Branch misprediction penalty: at least 11 cycles, on average 15 cycles The instruction decoder unit (IDU) is composed of three separate decoders 102 Pentium II/III: The In-Order Section (Continued) A decoder breaks the IA-32 instruction down to ops, each comprised of an opcode, two source and one destination operand. These ops are of fixed length. – Most IA-32 instructions are converted directly into single micro ops (by any of the three decoders), – some instructions are decoded into one-to-four ops (by the general decoder), – more complex instructions are used as indices into the microcode instruction sequencer (MIS) which will generate the appropriate stream of ops. The ops are send to the register alias table (RAT) where register renaming is performed, i.e., the logical IA-32 based register references are converted into references to physical registers. Then, with added status information, ops continue to the reorder buffer (ROB, 40 entries) and to the reservation station unit (RSU, 20 entries). 103 The Fetch/Decode Unit IA-32 instructions Instruction Fetch Unit I-cache Next_IP Alignment Branch Target General Decoder Simple Decoder Simple Decoder Buffer Instruction Microcode Decode Instruction Unit Sequencer Register Alias Table op1 op2 op3 104 (a) in-order section (b) instruction decoder unit (ID The Out-of-Order Execute Section When the ops flow into the ROB, they effectively take a place in program order. ops also go to the RSU which forms a central instruction window with 20 reservation stations (RS), each capable of hosting one op. ops are issued to the FUs according to dataflow constraints and resource availability, without regard to the original ordering of the program. After completion the result goes to two different places, RSU and ROB. The RSU has five ports and can issue at a peak rate of 5 ops each cycle. 105 Latencies and throughtput for Pentium RSU Port FU II/III FUs Latency Throughput Integer arithmetic/logical 1 1 Shift 1 1 Integer mul 4 1 Floating-point add 3 1 0 Floating-point mul 5 0.5 Floating-point div long nonpipelined MMX arithmetic/logical 1 1 MMX mul 3 1 Integer arithmetic/logical 1 1 1 MMX arithmetic/logical 1 1 MMX shift 1 1 2 Load 3 1 3 Store address 3 1 4 Store data 1 1 106 MMX Functional Unit Issue/ Floating-point Functional Unit Execute Port 0 Integer Functional Unit Unit MMX Functional Unit Reservation Station Unit Jump Functional Unit to/from Reorder Integer Buffer Port 1 Functional Unit Load Port 2 Functional Unit Store Port 3 Functional Unit Store Port 4 Functional Unit The In-Order Retire Section. A op can be retired – if its execution is completed, – if it is its turn in program order, – and if no interrupt, trap, or misprediction occurred. Retirement means taking data that was speculatively created and writing it into the retirement register file (RRF). Three ops per clock cycle can be retired. 108 Retire Unit to/from D-cache Reservation Memory Station Interface Unit Unit Retirement Register File to/from Reorder Buffer 109 The Pentium II/III Pipeline ROB I-cache access BTB access BTB0 Reorder buffer read read Issue Fetch and predecode BTB1 Reservation station RSU IFU0 IFU1 Port 0 Execution and completion IFU2 Port 1 IDU0 Decode Port 2 IDU1 Port 3 ROB Retirement Register renaming RAT Reorder buffer write-back write ROB Reorder buffer read read Port 4 Retirement RRF (a) (b) (c) 110 Pentium II/III summary and offsprings Pentium III in 1999, initially at 450 MHz (0.25 micron technology), former name Katmai two 32 kB caches, faster floating-point performance Coppermine is a shrink of Pentium III down to 0.18 micron. 111 First level caches 12k µOP Execution Trace Cache (~100 k) Execution Trace Cache that removes decoder latency from main execution loops Execution Trace Cache integrates path of program execution flow into a single line Low latency 8 kByte data cache with 2 cycle latency 112 Second level caches Included on the die size: 256 kB Full-speed, unified 8-way 2nd-level on-die Advance Transfer Cache 256-bit data bus to the level 2 cache Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency) Bandwidth and performance increases with processor frequency 113 400 MHz Intel NetBurst micro-architecture system bus Provides 3.2 GB/s throughput (3 times faster than the Pentium III processor). Quad-pumped 100MHz scalable bus clock to achieve 400 MHz effective speed. Split-transaction, deeply pipelined. 128-byte lines with 64-byte accesses. 114 Mah Ung, Gordon (2000). “Celeron VS Duron: Let’s Get It On!” Maximum PC. October 2000, 52 – 55. Mayorov, Dmitry. “Intel Celeron 800 MHz and i815EP-based D815EPEA.” URL: http://www.digit-life.com/articles/celeron800/ McCoy, Jason. “Abit BP6 Dual Socket 370 Motherboard.” URL:http://www.2cpu.com/Hardware/bp61.htm Monroe, Frank (1998). “Celeron Overclocking FAQ.” URL:http://www.arstechnica.com/paedia/celeron_oc_faq.ht ml#Why%20is%20the%20Celeron%20so “Intel Celeron Processor Tech Details.” URL: http://www.intel.com/home/desktop/celeron/tech_info.htm “Intel Mobile Celeron Processor Tech Details.” URL: http://www.intel.com/home/mobile/celeron/tech_info.htm

Lec 04 Pentium 2 and Beyond (PDF)

Document Details

Tags

Related

Summary

Full Transcript