Advanced Computer Architecture Lecture Notes PDF
Document Details
![WieldyIntellect4092](https://quizgecko.com/images/avatars/avatar-10.webp)
Uploaded by WieldyIntellect4092
Technische Universität München
Michael Gerndt
Tags
Related
- High-Performance Computing Hardware Architecture and Benchmarking PDF
- Parallel Processing Unit 1 PDF
- Computer Organization and Design: RISC-V Edition PDF
- Parallel Computer Architectures PDF
- Advanced Computer Architecture PDF Lecture Notes
- TTTK 1153 Computer Organization & Architecture: Parallel Architectures PDF
Summary
These lecture notes detail advanced computer architecture, covering current processor designs, system architecture, and parallel computing. The document also outlines a suggested learning methodology, course structure, and resources. Topics include instruction set architecture, pipelining, caches, memory technologies, and diverse processor architectures like VLIW, shared memory, and distributed memory systems. The document also notes the availability of lecture recordings and materials on a Moodle platform.
Full Transcript
Advanced Computer Architecture Michael Gerndt Technische Universität München 1 Michael Gerndt CV Diploma and Ph.D. in CS at Bonn (1989) 1990/1991 Postoc in Vienna 1992-2000 R...
Advanced Computer Architecture Michael Gerndt Technische Universität München 1 Michael Gerndt CV Diploma and Ph.D. in CS at Bonn (1989) 1990/1991 Postoc in Vienna 1992-2000 Researcher at Forschungszentrum Jülich Since 2000 Professor for Parallel Computer Architecture at TUM Previous research Automatic Parallelization for Distributed Memory Systems, SUPERB High Performance Fortran Shared Virtual Memory, SVM-FORTRAN Periscope: Automatic Performance Analysis and Tuning for HPC Systems Invasive Computing Current Research Cloud Computing: Function-as-a-Service and the Function Delivery Network Internet of Things - embedded processors Serverless IoT Framework Contact [email protected] O ce: 01.04.058 2 ffi Chair for Computer Architecture and Parallel Systems Chair for Computer Architecture and Parallel Systems (CAPS) I10 Prof. Dr. Martin Schulz, Prof. Dr. Michael Gerndt, Prof. Carsten Trinitis Bachelor course Einführung in die Rechnerarchitektur Bachelor lab course Einführung in die Rechnerarchitektur Parallel Programming for Computational Science and Engineering Courses on parallel programming, parallel computer architecture, and cloud computing in Master Informatics My Teaching Advanced Computer Architecture, WS, 4+0, 6 credits Cloud Computing, SS, 2+1, 4 credits E cient Programming of Multicore Processors and Supercomputers, Lab for CS and CSE master program (SS) IoT Sensor Node bachelor (SS) and master lab (WS) Seminar on Cloud Computing (WS) Seminar Existenzgründung (SS) 3 ffi Lecture Organization Time and place Tuesday, 12:30 - 14:00, HS2 Friday, 12:15 - 13:45, HS2 6 ECTS 4 SWS lecture only Resources Script is provided as PDF. It will be updated during the lecture. Lecture will be recorded (no guarantee) All materials will be available in Moodle Student presentations Topics will be o ered in Moodle and you can suggest some. 10-15 minutes presentations At most one presentation per lecture Gives bonus (0.3) First presentation next Friday, October 25th 4 ff Lecture Organization Course quizzes Quiz during the lecture. Small number of multiple choice questions about last lecture. Goal: Motivate you to attend the lecture and review the last lecture. 60% of points have to be reached Similar number of points for each quiz Gives bonus (0.3), boni can be combined Quizzes will start Tuesday, Oktober 29th Exam 90 minutes Moodle based, on-site, screen recording, multiple choice and free text Planned for 11.02.24, 16:00-17:30 Repetition exam will take place at the end of the summer semester Boni do not apply to the repetition exams 5 Goals of the Lecture Students... know the architecture of current processors as well as of entire IT systems.... can evaluate and assess di erent designs.... understand the interaction of architecture, compiler technology and applications.... know the di erent classes of parallel architectures and can describe their implementation concepts.... should be able to start research work in that area.... should be able to judge, categorize, and rank new information. 6 ff ff Application of the Knowledge A foundation for other lectures e.g., OS, distributed systems, computer graphics, microprocessors, parallel programming, Cloud computing, IoT Directly applicable in industry as hardware, OS, or compiler developer. Also for developers and programmers of embedded systems. Applicable in the context of selection and con guration of computer systems as well as in areas where the e ciency of programs is important: Realtime systems Servers Games and graphics IoT 7 ffi fi Computer Architecture De nition In computer science and engineering, computer architecture is the practical art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals; and the formal modeling of those systems. Three main subcategories of computer architecture: Instruction Set Architecture (ISA) Machine language including the instruction set, word size, memory address modes, processor registers, and address and data formats. Microarchitecture Implementation of the ISA (within a processor). There might be di erent implementations of an ISA (binary compatibility). System architecture Implementation of an entire computer system by assembling interacting hardware components. 8 fi ff Contents Introduction Cross cutting aspects Instruction set architecture Instruction pipelining Caches Memory technologies VLIW processors Data parallel architectures Shared memory systems Distributed memory systems 9 How to learn Go over the material at home and identify key terms and concepts Think about open questions, try to look it up in the internet. Use chatGPT and wikipedia. Terms pop up several times in the lecture, make connections between these slides. Learn terms and concepts. I recommend ash cards. On Apple you may want to use Marginnote 4 10 fl Edge-Cloud Continuum Data Centers Cloud Layer Privacy Barrier Edge Server Clusters Edge Layer Devices Sen Sensor Sen Actuator Device Layer Acceler ULP Cor Core ator Serverless IoT Framework Triggers Functions Data Cloud Layer Cloud Data Centers Privacy Barrier Edge Layer Edge Server Clusters Device Layer Devices Senior Homecare Costs of Nursing Home Private Nurse: 2000 € Intel Knights Landing MIC (Many Integrated Cores) Architecture targeted at highly parallel, high-throughput workload massive thread parallelism massive data parallelism high memory bandwidth compatible with x86 ISA Intel Xeon Phi: Knights Landing second generation (June 2016) standard processor/coprocessor ~ 3 TFLOPS (DP) Architecture: Grid of 54 physical blocks 2D mesh interconnect 38 tiles (2 Cores, 1MB shared L2, VPUs, CHA,...) up to 36 are active 8 MCDRAM controllers 2 DDR4 memory controllers I/O controllers (PCIe, DMI) Power management and miscellaneous functions 15 Intel Sapphire Rapids Processor used in Aurora (Argonne) and SuperMUC-NG Phase 2 Launched in January 2023 Golden Cove core architecture Marketed as Intel® Xeon® CPU Max Series 16 Sapphire Rapids 17 EMIB: Embedded Multi-die Interconnect Bridge 18 Intel AMX: Advanced Matrix Extensions - supports operations like TMUL for the multiplication of 2D dense matrices. The hardware works on a set of special registers called Tile Registers. IAA: In-memory Analytics Processor - accelerates compression and decompression for loading data of large databases to memory - accelerates search operation in in-memory databases DSA: Data Streaming Accelerator - data copying and gather/scatter in memory QAT: QuickAssist Technology - accelerates compression and decompression as well as encryption for network communication 19 https://youtu.be/TQzR_RMgx9Y Intel Sapphire Rapids Intel(R) Xeon(R) Platinum 9480 Speci cations Intel(R) Xeon(R) Platinum 8480L used in SuperMUC-NG Phase 2 Speci cations 20 fi fi Books Computer Architecture - A quantitative Approach. John Hennessy, David Patterson Structured Computer Organization Andrew Tanenbaum Parallel Computer Architecture A Hardware / Software Approach David E. Culler, Jasweinder Pal Singh, Anoop Gupta Morgan Kaufmann, 1999, ISBN 1-55860-343-3 21 Books Computer Organization and Design: RISC-V Edition Patterson & Hennessy Morgan Kaufmann / Elsevier 2021 Processor Microarchitecture An Implementation Perspective Antonio Gonzáles, Fernando Latorre, Grigorios Magklis Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers 2011 22... and the Web Wikipedia ChatGPT 23 Computer History 24 Computer System as Multilevel Machine 25 Evolution of Multilevel Machines Clear distinction between hardware and software Hardware implements machine instructions Software uses those machine instructions for programming The invention of microprogramming 1951 by Maurice Wilkes, University of Cambridge Simpli ed hardware, reduced tube count and enhanced reliability. By 1970 using a microprogram was dominant Used in Intel Xeon … In the early days, people: reserved the computer for a block of time brought in their FORTRAN program as a desk of punched cards took the FORTRAN compiler out of a cabinet inserted rst the compiler and then the program inserted the program again for a second pass of the compiler the compiler punched out the machine program. put in the machine program … This procedure was normal in many compute centers for years. 26 fi fi Evolution of Multilevel Machines In 1960 the operator's job was automated by the invention of the operating system Operating system was kept in memory It read and executed jobs consisting of the Fortran code and the data cards for execution. Interspersed with OS commands like compile and execute One of the rst widespread operating systems was FMS (FORTRAN Monitor System) Additions, e.g., IO, were called operating system macros or supervisor calls (today system calls) 1960s the timesharing system was developed at MIT Multiple users could access the machine simultaneously from remote terminals (1960s-1970s) 27 fi Computer Generations Zeroth Generation -1945: Mechanical Computers First Generation - 1955: Vacuum Tubes Second Generation - 1965: Transistors Third Generation – 1980: Integrated Circuits Fourth Generation - …: Very Large Scale Integration Fifth Generation: Ubiquitous computing 28 Zeroth Generation -1945: Mechanical Computers Wilhelm Schickard 1620 astronomer who developed his own tools Designed machine for addition, subtraction Probably never nished 29 fi Zeroth Generation -1945: Mechanical Computers Blaise Pascal built 1642 when he was 19 years old a computer to help his father in tax collection Addition only About 50 versions were built. Pascaline 1652 30 Zeroth Generation -1945: Mechanical Computers Gottfried Leibniz 1670 Build a mechanical machine that could also multiply and divide. He developed the binary number system. 31 Study Software MarginNote 4 (Mac only) REMnote (remnote.com) all platforms 32 Zeroth Generation -1945: Mechanical Computers Charles Babbage (1792-1871) Designed in 1822 the di erence engine to compute polynomials for navy navigation. Di erence Engine 2 (1847-1849) had an output unit where results were engraved in a copper plate (write-once medium) Film Computer History Museum in Palo Alto 33 ff ff Charles Babbage Analytical Engine Sydney Padua Charles Babbage started the design of the analytical machine in 1834 consisting of store (for variables), the mill (processing unit), input and output section. It should execute di erent programs on punched cards instead of only one (di erence engine). (http://www.fourmilab.ch/babbage/) Film at youtube Charles Babbage's Analytical Engine The programs were developed by Ada Augusta Lovelace. 34 ff ff Zeroth Generation -1945: Mechanical Computers Hollerith (1890) developed a calculator based on punched cards. It was used in the US census in 1890. He sold his patents to an investor and nally this lead to the company IBM. 35 fi Zeroth Generation -1945: Mechanical Computers Konrad Zuse (late 1930) 1938: Z1 mechanical engine, binary system, oating point arithmetic, programs on punched tape 1941: Z3 machine based on electromagnetic relays, binary arithmetic, program on punched tape. It had registers, CPU with oating point arithmetic, and IO unit. Work lost in the war. 36 fl fl Zeroth Generation -1945: Mechanical Computers Howard Aiken (1944) built a relay-based machine (Mark 1) with 72 words of 23 decimal digits, and an instruction time of 6 sec. Programmable via punched tapes. His work was based on Babbage's work he found in the library. 37 First Generation - 1955: Vacuum Tubes British intelligence (1944) built COLOSSUS the world rst electronic digital computer Designed to decode German messages encrypted with the Lorenz Machine (more advanced than the Enigma) Alan Turing developed the mathematical foundations Special purpose computer John Mauchley and Presper Eckert (1946) built ENIAC (Electronic Integrator and Computer) 18.000 tubes, 1.500 relays. It weighed 30 tons and consumed 140 kilowatt. Decimal arithmetic, each digit represented by 10 tubes No oating point computation Programmed via physical links Programmed by a group of young women, that worked before as "computers" performing the computations manually. Grant by the army. 38 fl fi First Generation - 1955: Vacuum Tubes John von Neumann designed the von Neumann Architecture in 1945 EDVAC (Electronic Discrete Variable Automatic Computer) a successor of ENIAC also developed by Mauchley and Eckert Maurice Wilkes (1946) built EDSAC (Electronic Delay Storage Automatic Calculator) using binary arithmetic in the UK. First stored program computer Data and instructions are stored as sound waves in a Mecury Delay Line mercury delay lines were used as primary memory, each delay line typically measuring about 5 feet (1.5 meters) in length. This length allowed each delay line to store around 32 words of data (18 bits per word), with a total storage capacity of 512 words initially, later expanded to 1024 words. 39 First Generation - 1955: Vacuum Tubes John von Neumann built 1952 the IAS machine at Princeton Program and data were in the same memory. It used binary arithmetic with parallel addition. 40 Second Generation - 1965: Transistors Transistor was invented at Bell Labs in 1948 Within 10 years vacuum tube computers were obsolete. The rst minicomputer PDP1 was built by DEC in 1961. It cost $120.000 compared to millions for the twice as fast transistorized computer of IBM, the IBM 7090. CDC introduced CDC 6600 It was 10 times faster than IBM 7090. The CPU had multiple functional units that could run in parallel. It also had smaller computers inside to do IO etc. The designer was Seymour Cray 41 fi Third Generation – 1980: Integrated Circuits Integrated circuits were co-invented by Jack Kilby and Robert Noyce in 1958 Jack Kilby demonstrated it half a year before Noyce on September 12th @ Texas Instruments Got Nobel Prize in physics in 2000 Germanium based Demonstrated continuous sine wave Robert Noyce @ Fairchild Semiconductors (Co-Founder) Silicon based 1968 he and Gordon Moore founded Intel 42 Third Generation – 1980: Integrated Circuits IBM build the System/360 (1964) It was the rst machine with multiprogramming for better CPU utilization. It could emulate other machines, e.g., the IBM 7094, via special microcode. It had a huge address space of 224 bytes (16 MB) which was su cient until mid 1980s Remote access through terminals Key features of lasting impact: 8 bit byte byte addressable memory 32 bit word two’s complement EBCDIC character set Story behind the development of the 360 system Main frames are still in use! 43 fi ffi Third Generation – 1980: Integrated Circuits DEC developed the PDP-11 Xerox designed the Alto I 1973 Palo Alto Research Center First computer with a graphical interface. Birth of the mouse WYSIWYG printing cut-and-paste Ethernet 44 Fourth Generation - …: Very Large Scale Integration Personal computer started due to price drop Apple, Commodore, Atari IBM PC in 1981 based on Intel CPU In 1984 the Apple Macintosh rst personal computer with a Graphical User Interface. Mid 1980s RISC was born 1992 DEC produced the rst 64-bit RISC processor, the Alpha processor. 45 fi fi Fifth Generation: Ubiquitous Computing In 1993 the Apple Newton was the rst PDA Small embedded processors are changing the world Going towards ubiquitous computing or pervasive computing Apple Newton Grid Systems: Tablet IBM: Simon 46 fi Milestones Year Name Developer Remarks 1834 Analytical Babbage First rst attempt to build a digital computer Engine 1941 Z3 Zuse First working relay-based machine 1943 COLOSSUS Britische First electronic computer Regierung 1944 Mark I Aiken First American general-purpose computer 1946 ENIAC I Eckert/ Modern computer history starts here Mauchley 1949 EDSAC Wilkes First stored program computer 1951 Whirlwind I M.I.T. First real-time computer 1952 IAS Von Most current machines use this design Neumann 1960 PDP-1 DEC First minicomputer (50 sold) 1961 1401 IBM Popular small business machine 1962 7094 IBM Dominated scienti c computing in the early 1960s. 1963 B5000 Burroughs First machine designed for a high-level language 1964 360 IBM First product line designed as a family 47 fi fi Milestones 1965 PDP‑8 DEC First mass-market minicomputer (50.000 sold) 1970 PDP‑11 DEC Dominating minicomputer in the 1970s 1974 8080 Intel First general-purpose 8-bit computer on a chip 1974 CRAY-1 Cray First vector supercomputer 1978 VAX DEC First 32-bit superminicomputer 1981 IBM PC IBM Started the modern personal computer era 1981 Osborne-1 Osborne First portable computer 1983 Lisa Apple First personal computer with a GUI 1985 386 Intel First 32-bit ancestor of the Pentium line 1985 MIPS MIPS First commercial RISC machine 1987 SPARC Sun First SPARC-based RISC workstation 1989 GridPad Grid First commercial tablet computer Systems 1990 RS6000 IBM First superscalar machine 1992 Alpha DEC First 64-bit personal computer 48 Milestones 1992 Simon IBM First smartphone 1993 Newton Apple First palmtop computer 2001 POWER4 IBM First dual-core chip multiprocessor 2004 ARMv7 arm 32 bit arm architecture 2006 Tesla / Nvidia Nvidia introduced the tesla architecture and CUDA CUDA programming API (2007) 2007 iphone Apple First commercial smartphone 2008 Android Google First version of the free OS 2010 iPad Apple First successful tablet computer 2011 ARMv8 arm 64 bit arm architecture 49 Von-Neumann Architecture (1946) Computer consists of 4 units Memory (programs and data) Control unit (interprets the program) Arithmetic unit I/O unit Aspects The structure is independent of the problem (programmable). Program and data are in the same memory. The memory is structured into cells of a xed length. The program consists of instructions that are executed sequentially. There can be (conditional) jumps. The machine is based on binary representation. Original article by John von Neumann: First draft of a report on the EDVAC, June 30th 1945 50 fi Predecessors of Von-Neumann-Computer 51 Historical Predecessors 52 Resources Computer Exhibit in Deutsches Museum Computer History Museum in Palo Alto Silicon Valley Alto I Computer System, Xerox (PARC), US, 1973 Collection of Computer Museums in Germany Timeline of computers 53 Cray 1 - 1976 54 Cray 2 - 1985 55 Cray 2 - 1985 56 Process Technology Moore‘s Law Stated in 1965 The density of transistors doubles every 18 months. Various versions of Moore‘s law exist: (12/18/24) months. https://en.wikipedia.org/wiki/Transistor_count 59 What is 10 nm? Wikipedia Fin of a FinFET of Intel‘s 10nm process. Wikipedia Multi-FIN-Multi-Gate https://www.techdesignforums.com/blog/2012/10/11/intel-tsmc- nfet-iedm/ 61 fi Process technologies Processor Feature size Year Nahalem 45 nm 2008 Sandy Bridge 32 nm 2011 Haswell 22 nm 2013 Skylake 14 nm 2015 Kabylake 14 nm 2016 Cannonlake 10 nm 2018 Cascade Lake 14 nm 2019 Ice Lake 10 nm 2020 Further Shrinking and Leakage Control anandtech.com Gate All Around FET (GAAFET) Multi Bridge Channel FET (MBCFET) 63 Further Shrinking and Leakage Control 2020: Samsung (MBCFET) and TSMC (FINFET) introduced the 5nm process used in Apple M1 May 2021: IBM announces 2 nm (MBCFET) 64 VLSI and Beyond Large dies Nividia GH100 GPU Hopper has 814 mm² with 80 billion transistors fabricated in TSMC 4 nm process 2.5D Integration Chiplets Combines dies with a silicon interposer AMD and Nvidia, for example, use it. Apple M1 Ultra: Only narrow interposer slice connecting two M1 3D packaging Entire dies are stacked and communicate via o -chip connections 3D stacking Connect dies with through silicon vias Monolithic 3D ICs Use multiple layers of silicon on top of a single substrate No commercial ICs available 65 ff Semiconductor Producers TSMC Taiwan Samsung Korea Intel USA Global Foundries USA SMIC China 66 Cross-Cutting Aspects 67 Performance 68 Performance It is not about "What the computer can do?" but "How good it does it." (non-functional property) Aspects of performance Speed Precision Memory Storage Graphics Communication Reliability Availability... 69 Performance Metrics Speed Latency (seconds) Throughput (#ops/second) MFLOPS (Million Floating Point Operation per Second) Memory Size (GB) Access latency (#cycles) Bandwidth (GB/s) DISC Size (TB) Access latency (msec) Bandwidth (Gbit/s) I/O ops / second... 70 Performance Evaluation of performance is used to select a computer system to optimize its con guration in designing computer systems to improve individual programs Techniques for performance evaluation Simple hardware characteristics Benchmarking Measuring individual applications Analytical modeling Simulation 71 fi How to measure time? Wall-clock time, response time, or elapsed time Latency of the execution Includes wait time for disk access etc. CPU time Time where CPU is active User time: CPU is executing the program System time: CPU is executing OS services for the program Unix time command 72 1 performance = execution time 73 How to search for Performance Bottlenecks? Clock cycle: Single cycle of the hardware clock Clock period or clock cycle time: Length of a clock cycle Clock rate: Number of cycles per second #cycles exec time = #cycles × clock period = clock rate Cycles per instruction (CPI): Average #cycles per instruction of a program #instructions × CPI exec time = clock rate CPI vs #cycles for certain instruction classes E.g. Multiplication might have 2 cycles while addition only 1 cycle Compiler optimizations might switch to faster instructions 74 What in uences the basic performance metrics? #instructions × CPI exec time = clock rate Component A ects How what? The algorithm determines the number of source program instructions executed and hence the number of processor instructions executed. The #instructions Algorithm CPI algorithm may also a ect the CPI, by favoring slower or faster instructions. For example, if the algorithm uses more divides, it will tend to have a higher CPI. The programming language certainly a ects the instruction count, since statements in the language are translated to processor instructions, Programming #instructions which determine instruction count. The language may also a ect the CPI Language CPI because of its features; for example, a language with heavy support for data abstraction (e.g., Java) will require indirect calls, which will use higher CPI instructions. The e ciency of the compiler a ects both the instruction count and average cycles per instruction, since the compiler determines the #instructions Compiler CPI translation of the source language instructions into computer instructions. The compiler’s role can be very complex and a ect the CPI in varied ways. The instruction set architecture a ects all three aspects of CPU #instructions Instruction Set performance, since it a ects the instructions needed for a function, the CPI Architecture cost in cycles of each instruction, and the overall clock rate of the clock rate processor. Book: Computer Organization and Design 75 ff ffi fl ff ff ff ff ff ff ff Power Wall 76 Power of processor generations 77 Dynamic Energy Power used by a processors determines the cooling requirements. Energy is what counts. Mobile devices: battery life time Servers: Electricity bill Dynamic energy in CMOS technology Energy used to switch a transistor 2 energy ∝ capacitive load × voltage Capacitive load depends on the capacitance of the number of transistors connected to an output and the capacitance of the wires and depends on the technology. power ∝ capacitive load × voltage 2 × frequency switched Although the frequency went up as well as the number of transistors, the power did not increase proportionally. This is because the voltage could be reduced with each new technology (Dennard Scaling: power density stays constant). Further reduction of voltage will no longer follow this rule. Reduction of clock frequency allows reduced voltage Power reduces with (voltage difference)2 and the reduced frequency. 78 Static Energy Even if transistors do not switch there is leakage of current. powerstatic ∝ currentstatic × voltage Further reduction of voltage increases leakage. Even today, 40% of power is due to leakage. 79 Counter Measures Lower clock rates Dynamic Voltage and Frequency Scaling (DVFS) Processor switches clock rate and voltage according to program characteristics. Switch to multiple cores with lower clock rate. Clock gating Switch the clock o for certain components. Power gating Switch power entirely o Dark Silicon Due to power limitations not all transistors can be used at a point in time Therefore, use those transistors for specialized units 80 ff ff Parallelism 81 Parallelism What is a parallel computer? Task system Set of tasks (certain operation with input and output) with a dependence relation solving a given problem. Concurrent tasks Two tasks are called concurrent, if they are not dependent on each other. Simultaneous or parallel tasks Two task are executed in parallel if at some point in time both tasks have been started and not terminated. Parallel computer Computer executing concurrent tasks in parallel. 82 Parallelism Parallelism in the hardware Bit-level arithmetic Instruction pipelining Multiple memory banks RAID systems Multicore processors Multicomputers Network interface … 83 Parallel Metrics Speedup Scienti c computing: Performance =work/time Strong vs weak scaling E ciency Speedup in terms of throughput Performance = throughput = transactions/minute 84 ffi fi Hardware/Software Interface Parallelism in processors is a way to speedup considering the power wall. It is not for free. 85 Memory Hierarchy 86 Memory Hierarchy Memory has some technological and economic limitations: Large memories are slow and cheap Fast memories are expensive and thus small Therefore a hierarchy of memories is used in current systems 87 Memory Hierarchy Why does it work? Programs exhibit temporal and spatial locality. Temporal locality: Addresses recently accessed will be accessed again in the near future with high probability. Spatial locality: Addresses near to recently accessed addresses will be accessed in near future with high probability. Example: Archival system Do we have temporal and spatial locality? 88 Announcements Lecture on Friday: Hörsaal 1A, „Zelt“ Study Software: REMnote (remnote.com) all platforms 89 Fault Tolerance, Reliability, Availability 90 Fault Tolerance, Reliability, Availability Based on Encyclopedia of Parallel Computing, Springer: Entry on Fault Tolerance by H. Zima, A. Nikora De nition Fault Defect of a system that may cause an error. Faults can be dormant in a system like incorrect code. Internal vs external fault Internal: fault of a component External: fault is a propagated failure of another component or from outside the system, e.g., cosmic particles. Fault categories Physical Malfunctioning of hardware Can be permanent or transient. Design May originate in hardware or software. e.g. illegal condition in a while loop. Interaction Occur during operation and are caused by the environment. e.g. illegal input of system operator or radiation hitting the system. 91 fi Fault Tolerance, Reliability, Availability De nition Error Illegal system state Errors can be propagated through a system generating other errors. e.g. a wrong assignment to a variable may lead to wrong iteration space in a subsequent loop 92 fi Fault Tolerance, Reliability, Availability De nition Failure Occurs when an error reaches the service interface of a system, resulting in system behavior that is inconsistent with its speci cation. Classi cation of failures Many non-orthogonal categories Domain: value or timing Capability for control, signalling: signalled or unsignalled failures Consequences: minor to catastrophic … 93 fi fi fi Fault Tolerance, Reliability, Availability Execution of a system is modeled by a sequence of states Correct states Error states Tolerated error states Failure states State transitions being caused as the result of atomic actions. Correct actions Fault actions Recovery actions 94 Fault Tolerance, Reliability, Availability Fault tolerant system It never enters a failure state. Errors may occur but never reach the service boundary. It works according to its speci cation (maybe slower or with restricted functionality) in spite of the presence of faults. Fault tolerance needs redundancy (functional or temporal) Implementation of fault tolerance implies three steps Error detection Error analysis Recovery Fail-controlled system Allow recovery from a failure using special protocols. e.g. it goes into a state from which corrective actions can be taken by an external controller. 95 fi Fault Tolerance, Reliability, Availability Fault prevention Methods preventing faults to being incorporated into a system. e.g. coding standards, rewalls, shielding against radiation Fault removal Techniques that eliminate faults during the design and development process. Validation („Are we building the right product?“) Validation checks the speci cation in order to determine if it correctly expresses the needs of the system's users. Veri cation („Are we building the product right?“) Veri cation proves that the system complies with the speci cation and thus can help to detect faults. Debugging Triggered by errors or failures Identify and remove faults. 96 fi fi fi fi fi Fault Tolerance, Reliability, Availability Reliability Probability that a system is free of failures up to a certain time. Safety Probability that the system is operating correctly or will fail in a safe manner at a certain time. e.g. tra c lights 97 ffi Fault Tolerance, Reliability, Availability Availability Percentage of the planned operation time, in which the system is available. What is 99.99% availability? http://vikashazrati.wordpress.com/2008/10/24/truth-about-availabilit/ 98 Virtualization 99 Virtualization Virtualization, in computing, is the creation of a virtual (rather than actual) version of something, such as a hardware platform, operating system, a storage device or network resources. (Wikipedia) 100 Virtualization Examples Virtual machine Virtual desktop Virtual memory Virtual CD/DVD drive Virtual private network Virtual machine Hides the physical characteristics of a computer platform and provides a Virtual Computer. Implementation is based on a virtual machine monitor (hypervisor). The hypervisor organizes the operation of several well isolated virtual machines on a single physical computer. The VMs are booted with their own operating system. The hypervisor controls the distribution of real resources among the VMs 101 Virtualization VM1 VMn Guests with (may) different OS non-priviledged Application Application OS OS Memory Management IO Scheduler priviledged CPU Scheduler Driver Hypervisor Physical Hardware 102 Virtualization Levels of virtualization Bare-metal Hypervisor runs on top of the hardware Hosted Hypervisor runs inside a hosting OS e.g. Virtual Box, Parallels OS-level Containerization Containers share the kernel of the hosting OS 103 Virtualization Virtualization techniques Issue: handling privileged instructions Full virtualization: OS is unchanged, binary rewriting Paravirtualization: OS needs to be changed. Hardware assisted virtualization Hardware-assisted virtualization Hardware supports an orthogonal mode of CPU operation: VM mode and hypervisor mode. Instructions, events etc. can be reserved for hypervisor mode, occurrence in VM mode leads to an implicit switch to hypervisor mode. Applications run in ring 3 in VM mode, Guest OS in ring 0 in VM mode, hypervisor in ring 0 in hypervisor mode. Example: IA-32 VT-x, AMD Paci ca Advantage: VM can run any OS, reduced overhead Disadvantage: old processors have no support 104 fi Virtualization Xen is an Open Source virtualization solution x86, x86-64, Itanium, ARM, and PowerPC Developed by University of Cambridge (Ian Pratt) Paravirtualization and, since Version 3.0, support for VT-x. Amazon is biggest Xen installation 105 Processor Classes 106 General Purpose Processors General purpose processors Used in desktops, laptops and servers Execution of any program; not restricted or specialized to certain application classes 107 Network Processors Network processors Designed for packet processing in communication networks, both in access hardware and switches/routers Replacing ASICs (Application-speci c circuit) or FPGAs (Field-Programmable Gate Arrays) Optimized for processing large numbers of incoming and outgoing packets. Consist of multiple Packet Processing Engines (PPEs). Steps in packet processing Checksum veri cation Field extraction Packet classi cation Path selection... 108 fi fi fi Example: Intel To no 2 Intel Presentation at HotChips 32 in 2020 Barefoot Networks "Intel To no2 – A 12.9 Tbps P4-Programmable Ethernet Switch" 7 nm, chiplet architecture PISA: Protocol Independent Switch Architecture Pictures from HotChips conference slides 109 fi fi To no 2 110 fi 2.5D Integration TSMC Chip-on-Wafer-on-Substrate (CoWoS) https://en.wikichip.org/wiki/tsmc/cowos 111 112 Programming Network Processors in P4 Domain-speci c language for the packet forwarding plane P4 programming language was introduced in 2014 - 2016 P4 stands for „Programming Protocol-Independent Packet Processors“ https://p4.org P4 targets General purpose processors, SoC, FPGAs, ASICs, network processors, SmartNICs Protocol-independent No native support for protocols like IP etc. Programs specify header format and eld names which are interpreted and processed Programs Parsing logic, headers, match action tables, tables for forwarding, actions The control plane can modify the various tables via the program API. Support for stateful processing via registers, counters and meters 113 fi fi All about Parallelism To no 2 is a highly parallel architecture Pipelining of packets Four parallel pipelines Each stage applies multiple MAU to a packet Each MAU uses VLIW instructions to run multiple functional units in parallel 114 fi Embedded Processors Embedded processors, microcontrollers Used in car engine controller, washing machine,... Simple: typically no oating point operations Very cheap Optimized for low power consumption Always used in a system-on-chip setting Example: Intel 8051 60.000 transistors Introduced in 1980 8 Bit CPU 4kB ROM, 128 Byte RAM (Harvard Architecture) no cache 32 IO-links 10-15 cents, 32 bit microcontroller costs 30 times more Updated variants are still produced by di erent companies (higher clock speeds, power-saving features, more peripherals and interfaces) 115 fl ff Espressif ESP32 https://www.espressif.com/en/products/socs/esp32 116 ESP32 Block Diagram 117 ESP32 Memory ESP32-WROOM-32E 520 KB SRAM for data and instructions 448 KB ROM for boot and kernel functions 8 KB SRAM in RTC (RTC-FAST Memory) code to run after deep sleep before any initialization or bootloader 8 KB SRAM in RTC (RTC-SLOW Memory) safe data over deep sleep, code and data for ULP processor 4 MB QSPI Flash memory (on-package) 118 ESP32-WROOM 119 ESP32 WROOM Board RBG LED application Reset Button USB to serial mirco USB 3.3 V input WIFI Antenna LI-ION Battery Charger 3.3V regulator RBG LED Power/Charge Communication over serial interface 120 ESP32 Family ESP32 ESP32-C6 High Power Domain: single 32 bit RISC-V core with 160 MHz, 320 KB ROM, 512 KB SRAM, 4 way set-associative cache with 32KB Low Power Domain: ultra low power RISC-V (20 MHz, two stage pipeline) and 16 KB SRAM Supports low-power mesh networking technologies Zigbee 3.0 and Thread 1.3 ESP32-S3 Xtensa 32-bit LX7 dual-core microprocessor with up to 240 MHz and SIMD support, shared caches instruction: 16-32 KB, 4/8 way set associative, line size 16/32 bytes data cache 32-64 KB, 4/8 way set associative, line size 16/32/64 bytes RISC-V and Finite State Machine coprocessors 121 Power Consumption 122 Average Current Consumption 123 Average Latency 124 Average Energy 125 RAM vs PSRAM 126 Cache Settings 64 KByte data cache 16 byte cache line size 64 byte 127 More Processor Classes Signal processors Optimized for signal processing Audio, image, or video processing Graphics processors Specialized for graphics processing consists of many simple cores sometimes specialized for certain computations in the graphics pipeline Example: Nvidia Fermi Also termed manycore processors Vector processors Highly pipelined architecture Designed for data parallel operation in numerical simulations Extreme memory bandwidth 128 Announcements No lecture on Friday Please send me your slides after your presentation as PDF Stick to the time limit for the presentations 129 Classes of Parallel Architectures M. Flynn, Very High-Speed Computing Systems, Proceedings of the IEEE, 54, 1966 130 Classes of Parallel Architectures SIMD (Single Instruction Multiple Data): Synchronized execution of the same instruction on a set of data MIMD (Multiple Instruction Multiple Data): Asynchronous execution of di erent instructions. Distributed Memory - DM (multicomputer): Building blocks are nodes with private physical address space. Communication is based on messages. Shared Memory - SM (multiprocessor): System provides a shared address space. Communication is based on read/write operation to global addresses. Uniform Memory Access – UMA : (symmetric multiprocessors - SMP): centralized shared memory, accesses to global memory from all processors have same latency. Non-Uniform Memory Access Systems - NUMA (Distributed Shared Memory Systems - DSM): memory is distributed among the nodes, local accesses much faster than remote accesses. 131 ff Instruction Set Architecture (ISA) 132 The ISA de nes the capabilities of the processor for the programmer. Processors are binary compatible if they support the same ISA. 133 fi ISA: Operands Datatypes and formats Integer of di erent length Fixed point and oating point numbers Character and string format Bit Operands Number of operands 0/1/2/3 address instructions Types of operands Immediate operands Register operands Memory operands Data alignment Memory operand is aligned at an n-byte boundary if the address is a multiple of n, i.e., A mod n = 0 Typical requirement is to align operands with a format of s bytes at an s-byte boundary. Advantages: optimized bus transfer and t within cache lines 134 ff fl fi ISA: Addressing Addressing modes Immediate: Operand is part of the instruction Register: Operand is in a register. Register address is given. Direct: memory address is part of the instruction Register indirect: memory address is in a register Indexed, Displacement: register indirect plus xed o set PC-relative: used for jumps 135 fi ff ISA: Instructions Classes of instructions data transfer arithmetic and logic operations control ow: procedure call, jumps, loop control I/O instructions System level instructions: interrupts, privilege level Instruction format Structured in opcode and operands single format for all instructions: von-Neumann di erent formats exible orthogonal design, e.g., all addressing modes possible for all operands. complex decoding 136 fl ff fl ISA: Registers Registers are memories in the processor fast access short address (register number) Register types General registers for integers, addresses, bit data Floating point registers Special registers for multimedia, loop control register.. System registers: program counter, status, control register Register le This term is frequently used for a set of registers Sliding window register le Short addresses for large set of registers Special operations to slide the window of addresses over the real registers. 137 fi fi ISA: Execution Modes Motivation Systems are used in multiuser and multiprogramming mode. Applications and data need to be protected. OS is managing the hardware. Two execution modes Supervisor or system mode Full access to all components reserved for OS User mode Limited access Access to memory only through memory translation No privileged instruction for I/O and access to control registers 138 ISA: Exception Handling An exception (software interrupt) occurs if an application executes a privileged instruction in user mode has an arithmetic error, e.g. division by zero accesses a virtual page with wrong permissions executes an emulated instruction... Exception is handled by the OS Interrupt of the currently executed instruction Save the processor state Jump to prede ned address for handler switch to supervisor mode Handle the exception Return to original instruction in user mode 139 fi ISA: Exceptions and Interrupts Precise exceptions All previous instructions have been terminated. All subsequent instructions have not been executed. Imprecise exceptions Exception can occur out-of-order. Interrupt handling Interrupts allow the processor to handle external asynchronous signals I/O signals Handled in the same way as exceptions 140 CISC vs. RISC Complex Instruction Set Architecture (CISC) variable instruction format memory and register operands many specialized instructions Example: ia32 Reduced Instruction Set Architecture (RISC) single instruction format load/store architecture limited addressing modes one cycle machine instructions Example: ARM, RISC V Comparison CISC leads to shorter programs CISC is more exible RISC is faster RISC needs to emulate complex instructions Important: optimizing compilers for RISC processors 141 fl Little and Big Endian Order in which the bytes are stored Little endian: least signi cant byte rst Big endian: most signi cant byte rst http://en.wikipedia.org/wiki/Endianness Note In 1726, Jonathan Swift described in his satirical novel Gulliver’s Travels tensions in Lilliput and Blefuscu: whereas royal edict in Lilliput requires cracking open one's soft-boiled egg at the small end, inhabitants of the rival kingdom of Blefuscu crack theirs at the big end (giving them the moniker Big-endians). 142 fi fi fi fi Comparison x86, ARM, RISC-V Registers x86: 16 general purpose, 16 oating point RISC-V, ARMv8: 32 general purpose, 32 oating point Memory access x86: operands of instructions in registers or memory, no alignment RISC-V: load-store, no alignment ARM-v8: load-store, alignment all: byte addressing Addressing modes x86: register, immediate, displacement, based index displacement, based scaled index plus displacement RISC-V: register, immediate, displacement ARM: register, immediate, displacement, PC-relative, sum of two registers, multiplication of the rst register with size of the operand in bytes, autoincrement and autodecrement addressing 143 fi fl fl Comparison x86, ARM, RISC-V Types and size of operands All: 8 bit (ASCII), 16 bit (Unicode and half word), 32 bit (Integer and word), 64-bit (double word and long int), 32 and 64 bit oating point (single and double precision IEEE 754) x86: 80 bit oating point (extended double precision) Operations RISC-V and ARMv8 are RISC x86: is CISC Control ow instructions All: conditional and unconditional jumps, procedure calls, returns. All: PC-relative addressing RISC-V: BE and BNE... compares contents of two registers ARMv8 and x86: test condition code bit set as side-e ect of an arithmetic/logical instruction ARMv8 and RISC-V: place the return address in a register x86: places the return address on the stack 144 fl fl fl ff Comparison x86, ARM, RISC-V Instruction format ARMv8 and RISC-V: all instructions are 32 bit long Hennessy Patterson p. 16 145 Comparison x86, ARM, RISC-V Instruction format x86: variable length ranging from 1 to 18 bytes Leads to shorter programs RISC-V and ARMv8 introduced later an extension, Thumb2 and RV64IC, providing a mix of 16 and 32 bit instructions. 146 Instruction Pipelining 147 Instruction Pipelining Pipelining concept Pipeline Hazards Superscalar processors Implementation of pipelining Instruction Fetch Decode Allocation/Renaming Issue Execute Commit Compiler support 148 Instruction Cycle IF - Instruction Fetch Fetch instruction from memory into instruction register Increment PC ID - Instruction Decode Decode instruction Fetch register operands from register le EX - Execute Perform ALU operation For load or store operations compute the address MA - Memory Access Read or write from/to memory in case of load or store. WB - Write Back Write result back to register le 149 fi fi Pipeline Production of Ford's Modell T (1913) Characteristics Many assembling stages Each stage is simple High degree of parallelism High throughput 150 Pipelining De nitions Pipelining is used in the processing of an object if the processing is split in steps the steps are processed sequentially for a single object the steps can be overlapped for di erent objects. A pipeline stage is the processing unit for a single step. The pipeline consists of all stages. 151 fi ff Pipelined Instruction Cycle Serial execution Pipelined execution Stages are the steps in the instruction cycle Each stage has to be implemented by own hardware. Overlapping of stages for subsequent instructions 152 Pipelined Instruction Cycle Advantages Better utilization of hardware Higher throughput. The instruction still takes 5 cycles but in each cycle one instruction nishes if each stage takes one cycle and the pipeline is lled. Speedup Number of stages K=5, number of instructions N Filling the pipeline takes K-1 cycles Then one instruction nishes per cycle Total number of cycles is N+K-1 Throughput is asymptotically 1 instruction per cycle Speedup: K*N/(N+K-1)->K 153 fi fi fi Pipelined Instruction Cycle Disadvantages Slowest stage de nes throughput. Design stages to be of the same length (1 cycle). Same or even higher latency All instructions have to go through all stages. All stages take the same time. More complex control logic 154 fi Implementation 155 MIPS Instruction Format 156 MIPS Pipeline 157 Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This leads to an interrupt of the synchronous execution in the pipeline and thus to a performance decrease. Classes Resource hazards Data hazards Control hazards Solution: suspend the execution of the instruction (pipeline stall) If an instruction is suspended in a certain stage of the pipeline, all subsequent instructions are also stopped. The pipeline logic inserts NOP operations into the next pipeline stage. The processing of all earlier instructions is continued. Many other advanced hardware and software techniques have been developed to reduce the e ect of hazards on the performance. 158 ff Hazard Classes Resource or structural hazards Result from two instructions that are processed in di erent stages which require the same resource. Not all of the components can be replicated to make sure that this never happens. Examples Parallel writes to the register le, e.g., if arithmetic operations can write directly and loads in the memory access phase. Parallel access to memory in IF and MA Subsequent instructions need the FP division hardware that is not implemented as a pipeline. Data hazards Instruction access the same data as earlier instructions and these are not yet nished, e.g., an operand computed by a previous instruction is not yet available. Data hazards result from data dependences between the instructions. Control or branch hazards The next instruction cannot be fetched due to a jump in the control ow. 159 fi fi ff fl Data Dependences An instruction j is data dependent on instruction i if There is a path from i to j and where I(i) = set or read data O(i)=set of written data Classi cation of data dependences True dependence: write … read Anti dependence: read … write Output dependence: write … write Anti and output dependences are called name dependences. Data dependences are properties of the program. They determine the execution order of instructions. Independent instructions can be reordered and even executed in parallel. They determine the maximum degree of parallelism. 160 fi Data Hazards Data hazards can occur if data dependent instructions are executed only with a short delay in the pipeline. Thus their accesses can overlap in the pipeline. Example: True dependence 161 Data Hazard Classi cation Read-after-write (RAW) Happens if instruction j would read a source register before an earlier instruction i wrote its result. Implied by a true dependence. Write-after-Read (WAR) Happens if instruction j would write the target register before an earlier instruction i read the operand. Implied by an anti dependence Write-after-Write (WAW) Happens if instruction j would write its target register before an earlier instruction i wrote its result to the same register. Implied by an output dependence. Can happen in pipelines where multiple stages can write or an instruction can proceed without waiting for a stalled previous instruction. 162 fi Dependences and Data Hazards It depends on the pipeline organization and the temporal execution of instructions whether data dependences lead to pipeline hazards or not. 163 Software Solutions (Static) Implemented by the compiler Insertion of NOPs Detection of potential data hazards Insertion of NOPs after instructions that might lead to hazards. Reordering of instructions Instruction scheduling phase of the compiler Reorders instructions so that independent instructions are executed between dependent instructions. 164 Hardware Solutions (Dynamic) Detection of con icts via appropriate hardware logic Handling Interlocking or stalling Forwarding Forwarding with interlocking 165 fl Interlocking or Stalling Stops instruction j and all subsequent instructions for multiple cycles. 166 Forwarding Direct forward of ALU results to the ALU input. Eliminates stall cycles. Requires additional hardware (forwarding logic) 167 Forwarding and Interlocking Forwarding and interlocking Not all hazards can be handled by forwarding Example: true dependence with load operation 168 Control Hazards Computation of the target and condition is done in the EX phase and it replaces PC in the MA phase. Condition typically depends on the EXE phase of the previous instruction requiring forwarding. Thus, only after three cycles the correct instruction can be loaded 169 Handling Control Hazards Early computation in ID Condition and target should be computed already in ID Structural Hazard: ALU can not be used for the computation of the target. Additional ALU is thus required in ID. Data dependence with previous arithmetic instruction RAW Hazard Critical path in ID phase is prolongated Decoding, computation of branch target, and updating PC for critical path. 170 Handling Control Hazards Delay slot: Insertion of independent instructions by the instruction scheduling of compiler Fill the stall cycle with an independent instruction (Delay Slot) Can be used in combination with previous technique. 171 Branch Prediction Prediction of branch decision when a jump is encountered. Speculative execution of instructions dependent on the predicted outcome. After the condition was computed Either continue without delay since the prediction was correct or delete the started instructions and fetch the correct ones. Two classes Static branch prediction by hardware or compiler Dynamic branch prediction by the hardware 172 Static Branch Prediction Hardware Static prediction in processor, backward jumps are predicted to be always taken. Compiler Speci cation via a bit in the jump opcode Prediction can be guided by program analysis or pro ling (feedback directed compilation) 173 fi fi Dynamic Branch Prediction Properties Based on dynamic behavior of the application. The history of a jump is taken into account. Leads to more precise predictions Expensive in terms of hardware 174 Dynamic Branch Prediction Branch Prediction Bu er Cache for information about conditional jumps Requires that the target can be computed fast Stores address tag and a prediction 175 ff Single Bit Predictor Single bit predictor If the bit is set, the brunch is predicted to be taken. If the prediction is wrong the bit is inverted. Single Bit Predictor is suboptimal for nested loops Wrong prediction in the rst iteration of inner loop. 176 fi Two Bit Predictor Two Bit Predictor Two bits allow to have four states strongly taken weakly taken weakly not taken strongly not taken Requires two mispredictions to switch prediction. 177 Two Bit Predictor with Saturation Scheme Two-Bit Predictor with Saturation Scheme Count the taken jumps If sum >= 2, predict taken jump Extensible to n Bit, but no big impact. 178 Prediction Buffer Size In uence of the size of the prediction bu er in SPEC89 A relativ small prediction bu er is su cient. 179 fl ff ffi ff Correlation Predictor Prediction is also based on the history of other jumps. If (aa==2) aa=0; If (bb==2) bb=0; If (aa!=bb){ … } Simple two bit predictor is not su cient to predict third branch. Taking into account the preceding jumps, enables a correct prediction. (m,n)-Predictors Uses the history of the last m jumps to select one of 2m n-bit predictors. Branch History Register (BHR) m-Bit shift register Store the global history of the last m jumps. Bits determine whether the jump was taken. After each jump the outcome is shifted into the BHR The BHR gives the index in the Pattern History Table (PHT) 180 ffi Correlation Predictor 181 Branch Target Buffer Branch Target Address Cache, Branch Target Bu er Required, if the computation of the target address is late in the pipeline. Stores the jump address and the target address Can be used in the IF phase. Can be combined with a predictor. 182 ff Branch Target Buffer 183 Superscalar Processor A superscalar processor can execute more than one instruction at the same time in all pipeline stages. It can achieve a throughput of multiple instructions per cycle. 184 Pipelining in Modern Processors 185 Pipelining in Modern Processors 186 Extended Pipeline Instruction Fetch Fetching instructions Main components are the instruction cache and the branch prediction logic. Decode Identify instruction type (e.g. control ow) and required resources (e.g., register ports, functional units) Decoders based on ROMs or ad-hoc circuitry Rename/Allocation Register renaming for removing name dependences. Allocation of entries in the reorder bu er, the issue queue, and the load/store queue. If resources are not available, the instruction is stalled until other instruction releases the required resources. Then the instruction is dispatched. 187 fl ff Extended Pipeline Instruction Issue Instructions wait in the queue until all operands and the functional unit are available. For in-order processors only the rst element is checked, for out-of-order processors all the instructions need to be checked per clock cycle. Execution Instructions are executed. Variety of units: integer, oating-point and logical operations, and also SIMD units (e.g. SSE) Bypass logic (forward logic): wires that move results from one unit to the input of other units. The design of the bypass network is critical since wire delays do not scale at the same pace as gate delays. They have an important contribution to the cycle time of the processor. Write back Forwards results to the registers or L/S queue. Commit Ensure sequential order. An instruction can only commit if older instructions have committed. Removes e ects of any instructions that were executed due to misspeculation. 188 ff fl fi Instruction Fetch 189 Instruction Fetch High performance instruction fetch unit Requires: one fetch every cycle a fetch can return a block of instructions (required for superscalar processors) Components: Branch predictors and instruction cache Organized as a pipeline (here 4 stages) Parallel access to I-TLB, cache tag array and data array Access based on virtual addresses Tag based on physical address Set-associative cache -> way multiplexer Multiple branch predictors in the rst cycle The one to be used is decided in the next cycle. 190 fi Instruction Fetch Branch Target Bu er determines for each instruction in the block whether it is a branch and possibly which type (conditional, call, return,...) and the target address if the branch is taken. 191 ff Instruction Fetch Branch Predictor Static vs dynamic prediction k-Bit predictors Last n bits can be used to index the table. Aliasing of branches Two branches might be mapped to the same entry. This impacts the accuracy if the branches have opposite bias. 192 Instruction Fetch Correlation predictor A 10 to 20 bit history is combined with the PC of the branch to generate an index to a table. A bitwise OR between the least signi cant bits of the PC and the BHR has proven to be simple and e ective. 193 ff fi Instruction Fetch Hybrid Branch Predictors E ectiveness of predictors depends on the code and other circumstances. For example, whether correlation predictors are better than local predictors depends on the application, e.g., warm-up time is required after every context switch. Therefore, another predictor (selector) can be used to predict which of multiple available predictors will be used. E.g., the local predictor after the context switch and after some time the correlation predictor. The selector is updated according to the correctness of the predictors. 194 ff Instruction Fetch Return Address Stack Special hardware predicting subroutine returns A BTB could also predict it, if the function is called from the same call site every time. Hardware LIFO structure For a call site the address of the next instruction is pushed onto the stack. For a return, the topmost address is taken as a prediction (popped out). If the stack would be unlimited all return addresses would be correctly predicted. Usually the stack is limited to a relatively small number, which would lead to mispredictions only if the nesting depth is larger. Not good for recursive functions. 195 Decode 196 Pipelining in Modern Processors 197 Decode The task is to decode instructions Input is a raw stream of bytes Output is a series of control signals Complexity depends on the ISA RISC Decoding Fixed instruction length makes it easy to identify instructions. Few encoding formats simplify separation of opcode and operands. Instructions are simple, which means that they generate only a few control signals in the pipeline. Simplicity enables single-cycle decoding using simple Programmable Logic Array (PLA) circuits and/or small look-up tables. 198 Decode CISC decoding - X86 Instruction Format Variable length instructions, length depends on opcode Opcode can start in the rst 5 bytes due to pre xes Pre x for string instruction Atomic operation Operand type override The opcode is of variable size, up to three bytes of primary opcode and, sometimes, bits 3 to 5 of ModR/M. 199 fi fi fi Example for String Pre x 200 fi Example for String Pre x 201 fi Decode https://wiki.osdev.org/X86-64_Instruction_Encoding#ModR.2FM_and_SIB_bytes 202 Decode Dynamic translation of x86 instructions x86 instructions are very complex e.g. add [eax], ebx adds register EBX to the value at the address speci ed in EAX and the result is written to this address This requires the processor to: Calculate the memory address from EAX and the data segment register DS. Bring the value into the core and add EBX. Store the result to the memory location calculated before. Control of these steps requires a lot of control signals. In a RISC ISA this would be load r0, ds:[eax] add r1, r0, ebx store ds:[eax], r1 To enable a much simpler pipeline control, modern x86 processors dynamically translate x86 instructions into an internal RISC-like instruction format. The P6 (Pentium Pro) microoperations (uops) had a xed length of 118 bits (3 address format) Info on #uops per x86 instruction in https://www.agner.org/optimize/instruction_tables.pdf 203 fi fi Decode High-Performance X86 decoding Two decoupled phases: instruction length decode (ILD) and the dynamic translation. Coupled via an instruction queue that hides bubbles in the ILD and allows the ILD to proceed when complex translations are required in the subsequent decoder. Instruction length decode Identi es instructions decodes all the pre xes Marks various properties that simplify dynamic translation Most common instructions are handled in a single cycle. Only two cases (length-changing pre xes) require six cycles. Dynamic translation unit Simple decoder: ia32 instructions that lead to a single uop Complex decoder: handles instructions that lead to up to 4 uops or stops the queue and triggers the microsequencer (MSROM) unit. 204 fi fi fi Decode Nehalem adds Loop Stream Detector Identi es loops Switches o frontend pipeline stages Saves power http://www.anandtech.com/show/2594/4 205 fi ff uOP cache Intel Sandybridge added the uOP cache Store the uOP sequence for instructions If an instruction is found (cache hit) it is faster than decoding the instruction. Integrated with branch prediction directly continue on the uOP cache if the predicted instruction is in the cache in case of a misprediction, the correct address might as well be in the cache E ciency depends on code characteristics 206 ffi Allocation / Renaming 207 Pipelining in Modern Processors 208 Tasks Allocation of resources needed for the execution Entries in the issue queue, the reorder bu er, and the load/store queue. Reorder bu er keeps track of the program order of instructions If resources are not available, the instruction stalls until a previous instruction commits and releases resources. Register renaming The limited number of architectural registers (also called logical registers) leads to name dependences. These can be removed by using a much larger set of phyiscal registers. Two techniques Additional registers in the issue queue and the reorder bu er. Additional registers in a merged register le. Instructions are dispatched into the issue queue. 209 ff fi ff ff Renaming via Merged Register File Dependences True Anti Output Write to architectural register Select a new physical register and update the mapping of architectural to physical registers. Read from architectural register Replace architectural register by the current physical register to which it is mapped. A physical register can be freed, when either it was written and there is no further instruction that reads the architectural register. Di cult to decide, future instructions are not known. What about when an instruction is decoded that writes to the same architectural register? or when the next instruction commits that writes to the same architectural register. 210 ffi Merged Register File Single register le (Merged Register File) storing speculative and committed values. Each physical register is either free or allocated. Free registers are stored in a free list The register map table stores for each architectural register the latest physical register assigned to it. When an instruction is renamed Input registers are looked up in the register map table and the ids are replaced The output register is replaced with the id of the new physical register. If no free physical register is available for the result, the instruction is stalled. 211 fi Renaming via Reorder Buffer Reorder bu er Instructions enter the reorder bu er in program order It provides a register for the result. It stores the speculative state. When the instruction commits, the value is written back to the architectural register le. Register Map Table determines for each architectural (logical) register whether the latest value is in the ROB or in the architectural register le. Renaming Output register is renamed into the ROB entry. The input registers are either unchanged (value already available) or renamed into the producing ROB entry. 212 fi ff fi ff Comparison Allocation and release of registers ROB: no free list, release of the register in the ROB entry requires id of architectural register, instructions in issue queue need to be updated. Merged register le: allocation based on free list, release is also complex due to decision when the register is no longer needed. Transfer of data ROB Operands from multiple sources require less read ports for register le but more interconnect. Data are rst stored in ROB and then transferred to the register le and the issue queue entry and then to the functional unit. All these transfers require energy. Merged register le Requires more read ports for the register le Less data transfers. 213 fi fi fi fi fi fi Issue 214 Pipelining in Modern Processors 215 Issue Instructions are issued to the functional units. Two main classes In-order issue Out-of-order issue Most frequently used. Many di erent implementations reservation stations uni ed issue queues distributed issue queues 216 fi ff In-Order Instruction Scheduling Instructions are issued to the functional units in program order. Also called static instruction scheduling Check the oldest non-issued instruction and issue it whenever the operands are available and the functional unit is available Div f0,f2,f4 Add f10,f0,f8 Load f12,f8,10 The division blocks all instructions The add is true dependent The load is independent. Implementation typically via scoreboard Resource table: keeps track of the status of the functional units. 217 Out-of-order Scheduling Instruction may be issued out-of-order (program order) Also called dynamic instruction scheduling, since the order is determined based on the dependences at runtime. Early concept based on a scoreboard. It keeps track of the data dependences and the state of the functional units. Instructions are issued as soon as all data dependences are ful lled. (true, anti, output) Tomasulo Algorithm (classical dynamic scheduling algorithm) based on reservation stations. 218 fi Tomasulo Algorithm Proposed by R. Tomasulo for the IBM 360/91 in 1967 Each functional unit has a private bu er called reservation station. They store the instructions and the operand ids and operand values. Instructions are immediately forwarded in the allocation stage to a free reservation stations. Takes over control until it is issued to the functional unit. The results are directly forwarded via the result bus to other functional units that require the data. 219 ff Tomasulo Algorithm 220 Instruction Cycle Allocation: Select free reservation station. Copy available operands from registers into the reservation station or store the reservation station that generates a required operand (resolving RAW and WAR hazards) Store in the register which reservation station generates the next value (handling of WAW hazards) Issue If operands are missing, wait. Insert new operands in the reservation station. If all operands are available, start operation. Execute Write result As soon as the result is available, it is broadcasted to the result bus (common data bus) The target register and the other reservations take the result from the bus. 221 Tomasulo Algorithm Reservation stations are stored in the reservation table. 222 Tomasulo-Algorithm Registers store in addition to the value: Valid bit: indicates whether the value is valid or is currently computed. Reservation station: It produces the next value. 223 Tomasulo-Algorithm 224 Tomasulo-Algorithm 225 Tomasulo-Algorithm 226 Tomasulo-Algorithm 227 Tomasulo-Algorithm 228 Tomasulo-Algorithm 229 Tomasulo-Algorithm 230 Tomasulo-Algorithm 231 Tomasulo-Algorithm 232 Tomasulo-Algorithm 233 Tomasulo-Algorithm 234 Tomasulo-Algorithm 235 Tomasulo-Algorithm 236 Tomasulo Algorithm The reservation stations and the result propagation perform an implicit register renaming. RAW, WAR and WAW hazards are resolved. No support for speculative execution 237 Uni ed Issue Queue 238 fi Uni ed Issue Queue Uni ed issue queue schemes have a single issue queue for all renamed in