COMP 426: Multicore Programming Introduction PDF

Document Details

RiskFreeAlien

Uploaded by RiskFreeAlien

Concordia University

2024

Tags

multicore programming computer architecture processor design computer science

Summary

This document is an introduction to COMP 426: Multicore Programming, covering topics like Moore's Law, processor speed, and power density limitations.

Full Transcript

COMP 426: Multicore Programming Introduction Based on the References COMP 426, Fall 2024 Introduction Photomicrograph of Intel Pentium CPU COMP 426, Fall 2024 Introduction 1 Technology Trends: Microprocessor Capaci...

COMP 426: Multicore Programming Introduction Based on the References COMP 426, Fall 2024 Introduction Photomicrograph of Intel Pentium CPU COMP 426, Fall 2024 Introduction 1 Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Gordon Moore (co-founder of Intel) Called “Moore’s Law” predicted in 1965 that the transistor Microprocessors have density of semiconductor chips would double roughly every 18 months. become smaller, denser, and more powerful. Slide source: Jack Dongarra COMP 426, Fall 2024 Introduction 2 Increasing Processor Speed ◼ Pipelining to reduce clock period ◼ Increase clock rate/ processor speed COMP 426, Fall 2024 Introduction 3 Microprocessor Transistors and Clock Rate Growth in transistors per chip Increase in clock rate 100,000,000 1000 10,000,000 R10000 100 Pentium Clock Rate (MHz) 1,000,000 Transistors i80386 10 i80286 R3000 100,000 R2000 i8086 1 10,000 i8080 i4004 1,000 0.1 1970 1975 1980 1985 1990 1995 2000 2005 1970 1980 1990 2000 Year Year COMP 426, Fall 2024 Introduction 4 Limitation #1: Power Wall Power density Scaling clock speed (business as usual) will not work 10000 Sun’s Surface Rocket 1000 Power Density (W/cm2) Nozzle Nuclear 100 Reactor 8086 Hot Plate 10 4004 P6 8008 8085 386 Pentium® 286 486 8080 Source: Patrick 1 Gelsinger, Intel 1970 1980 1990 2000 2010 COMP 426, Fall 2024 Year Introduction 5 Instruction- Level Parallelism ◼ Multiple instructions executed/ completed every clock cycle COMP 426, Fall 2024 Introduction 6 Instruction-level parallelism ◼ Parallelism at the machine-instruction level ◼ The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. ◼ Instruction-level parallelism enabled rapid increases in processor performance over the last 15 years COMP 426, Fall 2024 Introduction 7 Very Large Instruction Word (VLIW) COMP 426, Fall 2024 Introduction 8 SIMD and Vector Processing COMP 426, Fall 2024 Introduction 9 Hardware Multithreading COMP 426, Fall 2024 Introduction 10 Limitation #2: ILP Wall Hidden Parallelism Tapped Out 10000 From Hennessy and Patterson, Computer Architecture: A Quantitative ??%/year Approach, 4th edition, 2006 1000 Performance (vs. VAX-11/780) 52%/year 100 ½ due to transistor density 10 ½ due to architecture changes, 25%/year e.g., Instruction Level Parallelism (ILP) 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 COMP 426, Fall 2024 Introduction 11 Limitation #2: ILP Wall Hidden Parallelism Tapped Out ◼ Superscalar designs were the state of the art; many forms of parallelism not visible to programmer ◼ multiple instruction issue ◼ dynamic scheduling: hardware discovers parallelism between instructions ◼ speculative execution: look past predicted branches ◼ non-blocking caches: multiple outstanding memory ops ◼ Unfortunately, these sources have been used up COMP 426, Fall 2024 Introduction 12 Limitation #3: Memory Wall ◼ Growing disparity between memory access times and CPU speed ◼ Handled by increasing the cache size ◼ Cache size increases show diminishing returns ◼ Multithreading can help overcome minor delays. COMP 426, Fall 2024 Introduction 13 Why Multicore? ◼ Diminishing performance improvement in a single core architecture ◼ Memory wall : the increasing gap between processor and memory speeds ◼ Larger and more caches help only to the extent that memory bandwidth is not a bottleneck ◼ ILP wall : superscalar approach is getting saturated due to the limited parallelism in the single stream of instructions ◼ Power wall : increasing clock frequency leads to consuming more power which is not well justified for the diminished performance improvement COMP 426, Fall 2024 Introduction 14 Multicore Architecture ◼ Key Idea ◼ To increase the number of processors (cores), decrease the cache size, and decrease the clock rate. ◼ Each core is simpler ◼ Using more processors with small caches achieves higher performance than enlarging caches and keeping the same number of processors. COMP 426, Fall 2024 Introduction 15 Multicore in Products ◼ All microprocessor companies switch to CMP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’05 Intel/’06 IBM/’04 Sun/’07 Processors/chip 2 2 2 8 Threads/Processor 1 2 2 16 Threads/chip 2 4 4 128 ◼ The STI Cell processor (PS3) has 9 cores ◼ The Fermi NVidia Graphics Processing Unit (GPU) has 512 cores ◼ Intel has demonstrated an 80-core research chip COMP 426, Fall 2024 Introduction 16 Advantages ◼ Signals (data) between processors (cores) travel shorter distances (multiple cores on a chip) ◼ high-quality signals ◼ allow higher bandwidth (for cache snooping circuitry, e.g.) ◼ Smaller size (PCB space) than multi-chip SMP designs ◼ Less power than multi-chip SMP designs ◼ have to drive signals off the chip less often ◼ Multiple cores can share resources like L2 cache COMP 426, Fall 2024 Introduction 17 Disadvantages/Challenges ◼ New OS and software support needed to optimally utilize multiple cores ◼ Actual performance improvement may not be proportional to the number of cores ◼ If an application is memory BW-limited, low improvement on a sharedbus multicore system ◼ If inter-processor communication is a limiting factor, faster on a dualcore than on two CPU’s COMP 426, Fall 2024 Introduction 18 Flynn’s Taxonomy COMP 426, Fall 2024 Introduction 19 Single Instruction, Single Data (SISD) architecture COMP 426, Fall 2024 Introduction 20 Single Instruction, Multiple Data (SIMD) architecture COMP 426, Fall 2024 Introduction 21 Multiple Instruction, Multiple Data (MIMD) architecture COMP 426, Fall 2024 Introduction 22 Processor Structures COMP 426, Fall 2024 Introduction 23 Processor Structures COMP 426, Fall 2024 Introduction 24 Multi-core processor ◼ Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip. Hence called Chip Multi Processor (CMP) ◼ Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data). ◼ Multi-core is a shared memory multiprocessor: All cores share the same memory COMP 426, Fall 2024 Introduction 25 Intel Nehalem (Homogeneous Multicore) COMP 426, Fall 2024 Introduction 26 AMD Shanghai (Homogeneous Multicore) COMP 426, Fall 2024 Introduction 27 Homogeneous Multicore ◼ All cores are exactly the same with the same instruction set architecture ◼ Easy to design and fabricate ◼ Easy to schedule coarse-grained computation on the cores ◼ Complex synchronization ◼ May not be highly efficient ◼ Memory latency is a major bottleneck COMP 426, Fall 2024 Introduction 28 NVIDIA Fermi Architecture: 512 Processing Elements COMP 426, Fall 2024 Introduction 29 Throughput Oriented Architecture ◼ Computation should be abundantly (embarrassingly) parallel and fine- grained with very little synchronization ◼ Massive number of threads can be run on large number of cores ◼ Highly efficient ◼ Not suitable for any general computation COMP 426, Fall 2024 Introduction 30 IBM Cell Processor (Heterogeneous Multicore) COMP 426, Fall 2024 Introduction 31 Heterogeneous Multicore ◼ Heterogeneous cores can provide different levels of compatibility between the processors. ◼ More efficient designs at no expense in backward compatibility. ◼ Slower cores backward compatible cores can be combined with faster ones. ◼ Cores with different instruction sets can be combined through programmable layer that translates one into another COMP 426, Fall 2024 Introduction 32 Heterogeneous Multicore ◼ Decreased Power Consumption ◼ Low power processors are usually more efficient. ◼ Heterogeneous cores can provide balance between performance and power consumption. ◼ Application Specific Instruction Sets ◼ Higher efficiency ◼ High performance cores ◼ Specialized Instruction Set for each core. ◼ Tailored for a specific application. ◼ High flexibility through software programmability. ◼ High performance at low power consumption. COMP 426, Fall 2024 Introduction 33

Use Quizgecko on...
Browser
Browser