ilovepdf_merged-1.pdf
Document Details
Uploaded by ImpartialKremlin5502
Université de Perpignan Via Domitia
Full Transcript
GPU Teaching Kit Accelerated Computing Lecture 1.1 – Course Introduction Course Introduction and Overview Course Goals – Learn how to program heterogeneous parallel computing systems and achieve – High performance and energy-efficiency – Functionali...
GPU Teaching Kit Accelerated Computing Lecture 1.1 – Course Introduction Course Introduction and Overview Course Goals – Learn how to program heterogeneous parallel computing systems and achieve – High performance and energy-efficiency – Functionality and maintainability – Scalability across future generations – Portability across vendor devices – Technical subjects – Parallel programming API, tools and techniques – Principles and patterns of parallel algorithms – Processor architecture features and constraints 2 2 People – Wen-mei Hwu (University of Illinois) – David Kirk (NVIDIA) – Joe Bungo (NVIDIA) – Mark Ebersole (NVIDIA) – Abdul Dakkak (University of Illinois) – Izzat El Hajj (University of Illinois) – Andy Schuh (University of Illinois) – John Stratton (Colgate College) – Isaac Gelado (NVIDIA) – John Stone (University of Illinois) – Javier Cabezas (NVIDIA) – Michael Garland (NVIDIA) 3 Course Content Course Introduction and Overview Module 1 Introduction to Heterogeneous Parallel Computing Course Introduction Portability and Scalability in Heterogeneous Parallel Computing CUDA C vs. CUDA Libs vs. OpenACC Module 2 Memory Allocation and Data Movement API Functions Introduction to CUDA C Data Parallelism and Threads Introduction to CUDA Toolkit Kernel-Based SPMD Parallel Programming Module 3 Multidimensional Kernel Configuration CUDA Parallelism Model Color-to-Greyscale Image Processing Example Blur Image Processing Example CUDA Memories Tiled Matrix Multiplication Module 4 Tiled Matrix Multiplication Kernel Memory Model and Locality Handling Boundary Conditions in Tiling Tiled Kernel for Arbitrary Matrix Dimensions Histogram (Sort) Example Module 5 Basic Matrix-Matrix Multiplication Example Kernel-based Parallel Thread Scheduling Programming Control Divergence 4 Course Content Module 6 DRAM Bandwidth Performance Considerations: Memory Coalescing in CUDA Memory Module 7 Atomic Operations Atomic Operations Module 8 Convolution Parallel Computation Patterns Tiled Convolution (Part 1) 2D Tiled Convolution Kernel Module 9 Tiled Convolution Analysis Parallel Computation Patterns Data Reuse in Tiled Convolution (Part 2) Module 10 Reduction Performance Considerations: Basic Reduction Kernel Parallel Computation Patterns Improved Reduction Kernel Scan (Parallel Prefix Sum) Module 11 Work-Inefficient Parallel Scan Kernel Parallel Computation Patterns Work-Efficient Parallel Scan Kernel (Part 3) More on Parallel Scan 5 Course Content Module 12 Scan Applications: Per-thread Output Variable Allocation Scan Applications: Radix Sort Performance Considerations: Scan Performance Considerations (Histogram (Atomics) Example) Applications Performance Considerations (Histogram (Scan) Example) Advanced CUDA Memory Model Module 13 Constant Memory Advanced CUDA Memory Model Texture Memory Module 14 Floating Point Precision Considerations Floating Point Considerations Numerical Stability Module 15 GPU as part of the PC Architecture GPU as part of the PC Architecture Module 16 Data Movement API vs. Unified Memory Pinned Host Memory Efficient Host-Device Data Task Parallelism/CUDA Streams Transfer Overlapping Transfer with Computation Module 17 Application Case Study: Advanced Advanced MRI Reconstruction MRI Reconstruction Module 18 Electrostatic Potential Calculation (Part 1) Application Case Study: Electrostatic Potential Calculation (part 2) Electrostatic Potential Calculation 6 Course Content Module 19 Computational Thinking For Computational Thinking for Parallel Programming Parallel Programming Joint MPI-CUDA Programming Joint MPI-CUDA Programming (Vector Addition - Main Function) Module 20 Joint MPI-CUDA Programming (Message Passing and Barrier) Related Programming Models: MPI (Data Server and Compute Processes) Joint MPI-CUDA Programming (Adding CUDA) Joint MPI-CUDA Programming (Halo Data Exchange) Module 21 CUDA Python using Numba CUDA Python Using Numba Module 22 OpenCL Data Parallelism Model OpenCL Device Architecture Related Programming Models: OpenCL Host Code (Part 1) OpenCL OpenCL Host Code (Part 2) Module 23 Introduction to OpenACC Related Programming Models: OpenACC Subtleties OpenACC Module 24 Related Programming Models: OpenGL and CUDA Interoperability OpenGL 7 Course Content Module 25 Effective use of Dynamic Parallelism Dynamic Parallelism Advanced Architectural Features: Hyper-Q Module 26 Multi-GPU Multi-GPU Example Applications Using Libraries: CUBLAS Module 27 Example Applications Using Libraries: CUFFT Using CUDA Libraries Example Applications Using Libraries: CUSOLVER Module 28 Advanced Thrust Advanced Thrust Module 29 Other GPU Development Other GPU Development Platforms: QwickLABS Platforms: QwickLABS Where to Find Support 8 GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. GPU Teaching Kit Accelerated Computing Lecture 1.2 – Course Introduction Introduction to Heterogeneous Parallel Computing Objectives – To learn the major differences between latency devices (CPU cores) and throughput devices (GPU cores) – To understand why winning applications increasingly use both types of devices 2 Heterogeneous Parallel Computing – Use the best match for the job (heterogeneity in mobile SOC) Cloud Latency Throughput Services Cores Cores HW IPs Configurable On-chip DSP Cores Logic/Cores Memories 3 CPU and GPU are designed very differently CPU GPU Latency Oriented Cores Throughput Oriented Cores Chip Chip Core Compute Unit Cache/Local Mem Local Cache Threading Registers Control Registers SIMD SIMD Unit Unit 4 CPUs: Latency Oriented Design ALU ALU – Powerful ALU Control – Reduced operation latency ALU ALU – Large caches – Convert long latency memory CPU Cache accesses to short latency cache accesses – Sophisticated control DRAM – Branch prediction for reduced branch latency – Data forwarding for reduced data latency 5 5 GPUs: Throughput Oriented Design – Small caches – To boost memory throughput – Simple control – No branch prediction GPU – No data forwarding – Energy efficient ALUs – Many, long latency but heavily DRAM pipelined for high throughput – Require massive number of threads to tolerate latencies – Threading logic – Thread state 6 6 Winning Applications Use Both CPU and GPU – CPUs for sequential parts – GPUs for parallel parts where latency matters where throughput wins – CPUs can be 10X+ faster – GPUs can be 10X+ faster than GPUs for sequential than CPUs for parallel code code 7 7 GPU computing reading resources 90 articles in two volumes 8 Heterogeneous Parallel Computing in Many Disciplines Data Financial Scientific Engineering Medical Intensive Analysis Simulation Simulation Imaging Analytics Electronic Digital Audio Digital Video Computer Biomedical Design Processing Processing Vision Informatics Automation Statistical Numerical Modeling Methods Ray Tracing Interactive Rendering Physics 9 9 GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. GPU Teaching Kit Accelerated Computing Lecture 1.3 – Course Introduction Portability and Scalability in Heterogeneous Parallel Computing Objectives – To understand the importance and nature of scalability and portability in parallel programming 2 Software Dominates System Cost – SW lines per chip increases at 2x/10 months – HW gates per chip increases at 2x/18 months – Future systems must minimize software redevelopment 3 Keys to Software Cost Control App – Scalability Core A 4 Keys to Software Cost Control App Core A 2.0 – Scalability – The same application runs efficiently on new generations of cores 5 Keys to Software Cost Control App Core A Core A Core A – Scalability – The same application runs efficiently on new generations of cores – The same application runs efficiently on more of the same cores 6 More on Scalability – Performance growth with HW generations – Increasing number of compute units (cores) – Increasing number of threads – Increasing vector length – Increasing pipeline depth – Increasing DRAM burst size – Increasing number of DRAM channels – Increasing data movement latency The programming style we use in this course supports scalability through fine-grained problem decomposition and dynamic thread scheduling 7 Keys to Software Cost Control App App App Core B Core A Core C – Scalability – Portability – The same application runs efficiently on different types of cores 8 Keys to Software Cost Control App App App – Scalability – Portability – The same application runs efficiently on different types of cores – The same application runs efficiently on systems with different organizations and interfaces 9 More on Portability – Portability across many different HW types – Across ISAs (Instruction Set Architectures) - X86 vs. ARM, etc. – Latency oriented CPUs vs. throughput oriented GPUs – Across parallelism models - VLIW vs. SIMD vs. threading – Across memory models - Shared memory vs. distributed memory 10 GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.