csc25-chapter_05.pdf

CSC-25 High Performance Architectures Lecture Notes – Chapter V Superscalar Processors Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC Aeronautics Institute of Technology – ITA 1st semester, 2024 Detailed Contents Advances in ILP exploration Dynamic Scheduling Problems Instruction-Level Parallelism - ILP False Dependencies Solution Pipelining Limits Tomasulo’s Algorithm How to Advance in ILP Exploration Overview The quest for CPI < 1 Reservation Station Advancing in ILP Simple Example Overview Loop-Based Example Out-of-order Execution Summary WAR and WAW Dependencies, and Out-of-order Algorithm’s Details Completion Overview References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 2/41 Outline Advances in ILP exploration Advancing in ILP WAR and WAW Dependencies, and Out-of-order Completion Tomasulo’s Algorithm References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 3/41 Advances in ILP exploration Instruction-Level Parallelism - ILP There are two largely distinct approaches to exploiting ILP 1. hardware-based approach to help discover and exploit the parallelism dynamically, i.e., at runtime 2. software-based approach to find parallelism statically at compile time, i.e., at design time 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 4/41 Advances in ILP exploration (cont.) Instruction-Level Parallelism - ILP Aggressive compiler-based proposals have been tried out numerous times, since the ’80s I and a famous one was the Intel Itanium series, 1999 Despite huge efforts, such approaches have been successful only in domain-specific environments 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 5/41 Advances in ILP exploration (cont.) Instruction-Level Parallelism - ILP Pipelining creates a theoretical barrier, i.e., clock cycles per instruction - CPI equal to 1 RI I1 I2 I3 DI I1 I2 I3 OO I1 I2 I3 EXE I1 I2 I3 AR I1 I2 I3 5 10 15 Without pipeline, idle time and inefficiency RI I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 DI I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 OO I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 EXE I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 AR I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 5 10 15 With pipeline, CPI limit = 1. With no branch instructions and no dependencies among operands, speedup can be equal to the pipeline depth, e.g., 5 in this case 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 6/41 Advances in ILP exploration (cont.) Pipelining Limits Finally, the CPI value for a pipelined processor is the sum of the base CPI and all contributions from stalls Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls (1) Ideal/theoretical pipeline CPI ≈ 1 I then, (real) pipeline CPI > 1 How could the processor be changed to get CPI < 1 ? Any Ideas? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 7/41 Advances in ILP exploration (cont.) How to Advance in ILP Exploration Pipeline Superpipeline Multiple issue I superscalar I very long instruction word - VLIW 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 8/41 Advances in ILP exploration (cont.) How to Advance in ILP Exploration - Pipeline As previously seen 5-stage pipeline 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 9/41 Advances in ILP exploration (cont.) How to Advance in ILP Exploration - Superpipeline Explores the fact that various pipeline stages require less than half clock cycle to execute/complete Thus, the doubled internal clock frequency enables two (same) stages to be executed in one external clock cycle Superpipeline level 2 Two (same) pipeline stages per clock cycle Each pipe stage function can be divided into two This is named superpipeline level 2, i.e., parts with no overlap, and each part is executed in deeper the pipeline half clock cycle 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 10/41 Advances in ILP exploration (cont.) How to Advance in ILP Exploration - Superscalar Issues multiple scalar instructions per cycle Able to execute independent instructions in parallel in different pipelines I multiple functional units Superscalar, w/ functional units in parallel Out-of-order execution, to be effective Modern processors use just one pipeline, but with multiple functional units, e.g., two ALU, one LOAD, It has to handle “new” problems one STORE, one floating point - FP I structural hazards, e.g., wrt to registers Makes sense if: I data hazards, i.e., WAW and WAR InstructionIssueFREQUENCY ExecutionFREQUENCY 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 11/41 Advances in ILP exploration (cont.) How to Advance in ILP Exploration - VLIW Encodes various different instructions in one “long” word I various opcodes and their operands Packing I modern processors have an “end of group” bit I this bit identifies a group of instructions comprising a package Very long instruction word, encoding more than one instruction Packing is handled at design time, by the compiler Multiple functional units I processor fetches the package, and issues it entirely at once 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 12/41 Advances in ILP exploration (cont.) The quest for CPI < 1 I Superscalar I VLIW 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 13/41 Advances in ILP exploration (cont.) The quest for CPI < 1 Superscalar, instructions ordered/scheduled by I software, i.e., modern compiler optimizations I register optimization can increase contention by maximizing registers usage I static schedule based on knowledge of dependency-checking mechanisms I branch prediction I hardware I Tomasulo’s algorithm1 Processors examples: PowerPC, Sun SPARC, IBM Cell Broadband Engine (Playstation), Intel x86 1 Robert Tomasulo, ACM-IEEE CS Eckert-Mauchly Award, 1997. "For the ingenious Tomasulo’s algorithm, which enabled out-of-order execution processors to be implemented.” 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 14/41 Advances in ILP exploration (cont.) The quest for CPI < 1 VLIW I instructions ordered/scheduled in packages by the software, i.e., compiler I used in embedded systems, i.e., less complex hardware I named by Intel as explicitly parallel instruction computing - EPIC I example: Itanium2 2 Itanium 9700 (Kittson) 2017, the last family generation. On January 2019, Intel announced “end of life” production, w/ shipping until 2021 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 15/41 Outline Advances in ILP exploration Advancing in ILP WAR and WAW Dependencies, and Out-of-order Completion Tomasulo’s Algorithm References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 16/41 Advancing in ILP Overview Even in RISC processors, there are instructions with (necessarily) very different execution times I integer add – ADD I floating point division – DIV.D What is the problem with that? What is the solution? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 17/41 Advancing in ILP (cont.) Overview Allow more than one instruction to be issued, i.e., to enter the execution phase Simple reasoning about this idea I in-order issue I different execution times I then, potential out-of-order completion 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 18/41 Advancing in ILP (cont.) Out-of-order Execution Example 1 DIV. D F0 , F2 , F4 2 ADD. D F10 , F0 , F8 3 SUB. D F12 , F8 , F14 Any problem? I SUB.D is delayed, despite its independence wrt other instructions Solution I out-of-order execution, e.g., execute SUB.D before ADD.D I SUB.D does not depend on DIV.D nor on ADD.D 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 19/41 Advancing in ILP (cont.) Out-of-order Execution Example 1 DIV. D F0 , F2 , F4 2 ADD. D F10 , F0 , F8 3 SUB. D F12 , F8 , F14 Any problem? I SUB.D is delayed, despite its independence wrt other instructions Solution I out-of-order execution, e.g., execute SUB.D before ADD.D I SUB.D does not depend on DIV.D nor on ADD.D 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 19/41 Advancing in ILP (cont.) Out-of-order Execution Example 1 DIV. D F0 , F2 , F4 2 ADD. D F10 , F0 , F8 3 SUB. D F12 , F8 , F14 Any problem? I SUB.D is delayed, despite its independence wrt other instructions Solution I out-of-order execution, e.g., execute SUB.D before ADD.D I SUB.D does not depend on DIV.D nor on ADD.D 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 19/41 Outline Advances in ILP exploration Advancing in ILP WAR and WAW Dependencies, and Out-of-order Completion Tomasulo’s Algorithm References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 20/41 WAR and WAW Dependencies, and Out-of-order Completion Overview More hardware (EX) is required to afford longer latency from FP operations 1. main integer unit I handles loads and stores, integer ALU operations, and branches 2. FP and integer multiplier 3. FP adder I handles FP add, subtract, and conversion 4. FP and integer divider Except for memory problems/stalls, slower phase will be the EX RISC-V pipeline 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 21/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) Dynamic Scheduling Problems Example (original code) Change execution order (i1, i3, i2, i4) 1 DIV. D F0 , F2 , F4 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 2 SUB. D F8 , F10 , F14 3 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 4 MUL. D F6 , F10 , F8 would decrease delays Dependencies? I F0, instructions 1 and 2 Any other problems? I F8, instructions 2 and 3 How to fix? I F8, instructions 3 and 4 I F6, instructions 2 and 4 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) Dynamic Scheduling Problems Example (original code) Change execution order (i1, i3, i2, i4) 1 DIV. D F0 , F2 , F4 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 2 SUB. D F8 , F10 , F14 3 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 4 MUL. D F6 , F10 , F8 would decrease delays Dependencies? I F0, instructions 1 and 2 Any other problems? I F8, instructions 2 and 3 How to fix? I F8, instructions 3 and 4 I F6, instructions 2 and 4 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) Dynamic Scheduling Problems Example (original code) Change execution order (i1, i3, i2, i4) 1 DIV. D F0 , F2 , F4 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 2 SUB. D F8 , F10 , F14 3 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 4 MUL. D F6 , F10 , F8 would decrease delays Dependencies? I F0, instructions 1 and 2 Any other problems? I F8, instructions 2 and 3 How to fix? I F8, instructions 3 and 4 I F6, instructions 2 and 4 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) Dynamic Scheduling Problems Example (original code) Change execution order (i1, i3, i2, i4) 1 DIV. D F0 , F2 , F4 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 2 SUB. D F8 , F10 , F14 3 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 4 MUL. D F6 , F10 , F8 would decrease delays Dependencies? I F0, instructions 1 and 2 Any other problems? I F8, instructions 2 and 3 How to fix? I F8, instructions 3 and 4 I F6, instructions 2 and 4 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) Dynamic Scheduling Problems Example (original code) Change execution order (i1, i3, i2, i4) 1 DIV. D F0 , F2 , F4 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 2 SUB. D F8 , F10 , F14 3 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 4 MUL. D F6 , F10 , F8 would decrease delays Dependencies? I F0, instructions 1 and 2 Any other problems? I F8, instructions 2 and 3 How to fix? I F8, instructions 3 and 4 I F6, instructions 2 and 4 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) Dynamic Scheduling Problems Example (original code) Change execution order (i1, i3, i2, i4) 1 DIV. D F0 , F2 , F4 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 2 SUB. D F8 , F10 , F14 3 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 4 MUL. D F6 , F10 , F8 would decrease delays Dependencies? I F0, instructions 1 and 2 Any other problems? I F8, instructions 2 and 3 How to fix? I F8, instructions 3 and 4 I F6, instructions 2 and 4 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) Dynamic Scheduling Problems Example (original code) Change execution order (i1, i3, i2, i4) 1 DIV. D F0 , F2 , F4 1 DIV. D F0 , F2 , F4 2 ADD. D F6 , F0 , F8 2 SUB. D F8 , F10 , F14 3 SUB. D F8 , F10 , F14 3 ADD. D F6 , F0 , F8 4 MUL. D F6 , F10 , F8 4 MUL. D F6 , F10 , F8 would decrease delays Dependencies? I F0, instructions 1 and 2 Any other problems? I F8, instructions 2 and 3 How to fix? I F8, instructions 3 and 4 I F6, instructions 2 and 4 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) False Dependencies Solution Basic premise Allow for out-of-order execution, as long as the very same result of in-order execution is preserved Technique used – registers renaming I to avoid false WAR and WAW dependencies 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 23/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) False Dependencies Solution WAW example Original code Modified code 1 MUL. D F1 , F2 , F3 1 MUL. D F1 , F2 , F3 2 ADD. D F2 , F1 , F3 2 ADD. D F2 , F1 , F3 3 SUB. D F1 , F4 , F5 3 SUB. D F5 , F4 , F5 Replace F1 by F5 in instruction 3, i.e., another register not used/read until its next writing point 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 24/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) False Dependencies Solution WAR example Original code Modified code 1 MUL. D F1 , F2 , F3 1 MUL. D F1 , F2 , F3 2 ADD. D F6 , F1 , F2 2 ADD. D F6 , F1 , F2 3 SUB. D F3 , F4 , F5 3 SUB. D F8 , F4 , F5 4... 4... 5 ADD. D F9 , F4 , F3 5 ADD. D F9 , F4 , F8 Replace F3 by F8 in instructions 3 and 5, i.e., another register not used/read until its next writing point 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 25/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) False Dependencies Solution Solution’s basic requirements I identify instructions with no dependencies and allow them to pass in front of instructions with dependencies I identify and block instructions with data or structural dependencies I keep pipeline as efficient (busy) as possible 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 26/41 WAR and WAW Dependencies, and Out-of-order Completion (cont.) False Dependencies Solution Solution methods Software I with large sets of registers, the compiler can eliminate WAR and WAW hazards by renaming registers I eventually, it can use “MOV”s between registers I errors (or omissions) in the compiler can lead to poor processor efficiency Hardware I it cannot observe instructions ahead, but it can eliminate dependencies on registers I scoreboard I technique that allows instructions to execute out-of-order when there are enough resources and no data dependence, avoids RAW hazards I Tomasulo’s algorithm 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 27/41 Outline Advances in ILP exploration Advancing in ILP WAR and WAW Dependencies, and Out-of-order Completion Tomasulo’s Algorithm References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 28/41 Tomasulo’s Algorithm (cont.) Overview Steps 1. issue/(dispatch) 2. execute/(issue) 3. write result 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 29/41 Tomasulo’s Algorithm (cont.) Overview Although variations exist, two key principles remains 1. dynamically determining when an instruction is ready to execute I minimize RAW hazards 2. renaming registers to avoid unnecessary hazards I minimize WAR and WAW hazards 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 30/41 Tomasulo’s Algorithm (cont.) Reservation Station Reservation station - RS seven fields I Op – operation to perform on source operands S1 and S2 I Qj, Qk – reservation stations that will produce the corresponding source operand I a value of zero indicates that the source operand is already available in Vj or Vk, or is unnecessary I Vj, Vk – value of the source operands I only one of the V fields or the Q fields is valid for each operand I for loads, Vk field is used to hold the offset field I A – information for the memory address calculation for a load or store I initially, the immediate field of the instruction is stored here I after the address calculation, the effective address is stored here I Busy – reservation station and its accompanying functional unit are occupied 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 31/41 Tomasulo’s Algorithm (cont.) Reservation Station Register file field I Qi – number of the reservation station that contains the operation whose result should be stored into this register I if the value of Qi is blank/0, no currently active instruction is computing a result destined for this register I i.e., the value is simply the register contents 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 32/41 Tomasulo’s Algorithm (cont.) Simple Example Let’s consider the following code 1 fld f6 ,32( x2 ) 2 fld f2 ,44( x3 ) 3 fmul. d f0 , f2 , f4 4 fsub. d f8 , f2 , f6 5 fdiv. d f0 , f0 , f6 6 fadd. d f6 , f8 , f2 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 33/41 Tomasulo’s Algorithm (cont.) Simple Example - All instructions issued, but just first load completed/write result in CDB 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 34/41 Tomasulo’s Algorithm (cont.) Simple Example - All instructions finished, but multiply and divide; multiply is ready to write its result 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 35/41 Tomasulo’s Algorithm (cont.) Loop-Based Example Multiplying the elements of an array by a scalar in f2 1 Loop : fld f0 ,0( x1 ) 2 fmul. d f4 , f0 , f2 3 fsd f4 ,0( x1 ) 4 addi x1 , x1 , -8 // x1 < - x1 -8 5 bne x1 , x2 , Loop // branches if x1 != x2 Eliminating WAW and WAR hazards through dynamic renaming of registers Let’s consider, i.e., predict, the branches are taken I means multiple loop execution/iterations at once, thanks to the reservation stations I loop is unrolled dynamically by the hw I reservation stations and registers renaming acts as additional registers 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 36/41 Tomasulo’s Algorithm (cont.) Loop-Based Example - Two active loop iterations with no instruction completed yet 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 37/41 Tomasulo’s Algorithm (cont.) Summary Allows out-of-order completion, but with in-order issue I allows to reduce delays caused by differences in execution times among instructions Eliminates WAR and WAW hazards by renaming registers I tanks to reservation stations Blocks instructions due to RAW hazards Allows loop unrolling, even without speculative execution 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 38/41 Tomasulo’s Algorithm (cont.) Algorithm’s Details 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 39/41 Outline Advances in ILP exploration Advancing in ILP WAR and WAW Dependencies, and Out-of-order Completion Tomasulo’s Algorithm References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 40/41 Information to the reader Lecture notes mainly based on the following references Castro, Paulo André. Notas de Aula da disciplina CES-25 Arquiteturas para Alto Desempenho. ITA. 2018. Hennessy, J. L. and D. A. Patterson. Arquitetura de Computadores: Uma Abordagem Quantitativa. 5a. Campus, 2014. –.Computer Architecture: A Quantitative Approach. 6th. Morgan Kaufmann, 2017. Stallings, W. Arquitetura e Organização de Computadores. 10a. Pearson, 2017. Tanenbaum, A. S. Organização Estruturada de Computadores. 6a. Pearson, 2015. 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 41/41

csc25-chapter_05.pdf

Document Details

Tags

Related

Full Transcript