Computer Abstractions and Technology Lecture Notes PDF

Computer Abstractions and Technology 471029: Introduction to Computer Architecture 2th Lecture Disclaimer: Slides are mainly based on COD 5th textbook and also developed in part by Profs. Dohyung Kim @ KNU and Computer architecture course @ KAIST and SKKU 1 Opening the Box Display Fans Computer board + DRAM + CPU + GPU Batteries I/O devices Speakers Storage 2 Components of a computer  Processor  Memory  Interconnects  NoC(Network-on-Chip), Processor- interconnect, large-scale network  I/Os  User-interface devices: Display, keyboard, mouse, sounds, camera, …  Storage devices: HDD, SSD, CD/DVD,  Network adapters: Ethernet, 3G/4G/5G, Wifi, Bluetooth, NFC, …  Same components for all classes of computer 3 Inside the Processor(CPU)  Datapath: performs operations on data  Control: tells the datapath, memory and I/O devices what to do  Old days..  “ARM1 (Acorn RISC Machine 1)”, 1985 Control Datapath Source: http://www.righto.com/2016/02/reverse-engineering-arm1-processors.html 4 Source: https://en.wikichip.org/wiki/acorn/microarchitectures/arm1 Inside the Processor(CPU) – cont’d Program Counter Fetch Instruction Pipe Pipe Stat Decode Instruction Decode ALU Decode Register Decode Shift Decode Register File …. ??? 5 Inside the Memory  Offchip memory controller ( ~ 2008) PCI DRAM (Memory Controller) 6 Inside the Memory – cont’d  Offchip memory controller ( ~ 2008) Source: “Memory systems: cache, DRAM, disk”, Jacob et al., 2010 7 Inside the Memory – cont’d  DRAM Technology 8 FLOPS(flop/s) = floating point operation per second What Happened? KFLOPS = 103 FLOPS MFLOPS = 106 FLOPS GFLOPS = 109 FLOPS 1996 2019 Hitachi CP-PACS/2048 Supercomputer AMD EPYC 7702P 64-Core Processor Performance: ~368.2 GFLOPS, Power: 257 KW Performance: ~388 GFLOPS, Power: ~0.2KW 2048 Processing Unit 3D hyper crossbar network Source: top500.org 3 orders of magnitude! 9 Reviewing 40 years of Moore’s Law  40 years of stunning progress in microprocessor design  1.4 x annual performance improvement for 40+ years  Width: 8163264 bit (~4x)  Instruction Level Parallelism  4~10 cycles per instruction to 4+ instructions per cycle (~10-20x)  Multicore: one to 128~ cores (~ 128+x)  Clock rate: 3MHz to 4GHz (through technology & architecture) 10 Now: Inside the Processor(CPU)  Status quo (Intel I7-3960X) Further integrated More functionalities 11 Now: Inside the Processor(CPU) – cont’d  Status quo (AMD Ryzen 5000, Zen3 Architecture) 12 Now: Inside the Memory  High-Bandwidth Memory, 3D stacked memory Source: https://community.cadence.com/cadence_blogs_8/b/fv/posts/what-s-new-with-hybrid-memory-cube-hmc 13 Now: Inside the Memory  High-Bandwidth Memory, 3D stacked memory NVIDIA Volta V100 GPU + HBM2, 2017 AMD Radeon Pro 5600 GPU + HBM2, 2020 14 Interconnect in CPU  Interconnect matters as the number of compute unit increases Intel Ice Lake 15 Interconnect in CPU  Network-On-Chip (On-Chip-Network) “Accelerating Fibre Orientation Estimation from Diffusion Weighted Magnetic Resonance Imaging Using GPUs”, Hernndez et al., 2013 16 Interconnect in CPU  Network-On-Chip (On-Chip-Network)  Example of Mesh topology Source: Intel Source: “On-Chip Networks”, Mark Hill, 2009 17 Eight great Ideas  Design for Moore’s Law  Use abstraction to simplify design  Make the common case fast  Performance via parallelism  Performance via pipelining  Performance via prediction  Hierarchy of memories  Dependability via redundancy 18 Abstractions  Abstraction helps us deal with complexity  Hide lower-level details  Application Programming Interface  Application and libraries(e.g., PL) interfaces  Application binary interface  System software interface  Or interface between two binary programs  E.g., calling convention  Instruction set architecture (ISA)  The hardware/software interface  Implementation  The details underlying and interface Source: https://www.computer.org/csdl/mags/co/2005/05/r5032-abs.html 19 Parallelism  Implicit parallelism: Instruction-level parallelism(ILP)  In programmer’s perspective, a sequence of instructions is executed sequentially  Hardware executes it in parallel.  Pipelining  Speculation(prediction)  Caching  Superscalar (multiple instruction per cycle)  Dynamic scheduling (out-of-order execution)  …  Explicit parallelism: Data & thread level parallelism  Hardware provides parallel resources to execute instructions simultaneously  Why? Diminishing returns on instruction-level parallelism 20 Everything goes well and looks fine… But we now have new challenges… 21 [ Note ] Uniprocessor Performance (Single Core) 22 The end of Moore’s Law 23 End of Dennard Scaling  Dennard scaling  As transistors get smaller, the power density is constant  Power = alpha * CFV2 Alpha – percent time switched  C = capacitance  F = frequency  V = voltage  Capacitance is related to area  So, as the size of the transistors shrunk, and the voltage was reduced, circuit could operate at higher frequencies at the same power  End of Dennard Scaling  Dennard scaling ignored the “leakage current” and “threshold voltage”, which establish a baseline of power per transistor  These created a “Power Wall” that has limited practical processor frequency 24 Running Into the Power Wall 25 End of Dennard Scaling is a Crisis  Processors have reached their power limit  Thermal dissipation is maxed out  Energy consumption has become more important to users  E.g., mobile, IoT and datacenter  E.g., the global It industry’s 2012 electricity consumption in billion kilowatt-hours compared to the world’s largest energy-consuming countries Source: GREENPEACE 26 “New Golden Age of Computer Architecture”  Hennesy & Patterson, 2018 Turing Lecture  The end of Dennard Scaling & Moore’s Law means no more free performance  “The next decade will see a Cambrian explosion of novel computer architectures” 27 Future Opportunities  Domain-specific Architecture(DSA)  Design architectures tailored to a specific problem domain  Also, called ‘Accelerator’  GPU for graphic processing  Neural network processor for deep learning  Processor for software-defined network  [Note] those above share the similar architectural techniques of the general-purpose processors  Domain-specific Language(DSL)  DSA requires targeting of higher-level operations to the architecture but trying to extract such structure and information like Python, Java, and C is simply too difficult  DSL enables this process to make it possible to program DSAs efficiently  Matlab, a language for operating on matrices  TensorFlow, a dataflow language used for programming DNNs  P4, a language for programming SDNs  Halide, a language for image processing 28 Future Opportunities  Secure architecture and S/W  Control isolation  Data isolation  Constant-time programming  Avoiding speculative execution  Avoiding shared resources https://www.neowin.net/news/microsoft-no-longer-suggests-overlooking-downfall-of-intel-7th-8th-9th-10th-11th-gen-cpus/ 29 Future Opportunities  Energy-efficient architecture and S/W  Minimizing instruction counts E.g., same functions, fewer codes   Less data movement  Fewer communications  Data-centric architecture  E.g., PIM(processor-in-memory) “Energy Efficiency across Programming Language”, SEL 2017 30

Computer Abstractions and Technology Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue