20240122-DS642-Lecture-01.pdf
Document Details
Uploaded by FantasticCyan
New Jersey Institute of Technology
2024
Tags
Full Transcript
DS 642: Applications of Parallel Computing Lecture 1 01/22/2024 http://www.cs.njit.edu/~bader DS642 1 David A. Bader Distinguished Professor and Director, Institute for Data Science IEEE Fellow, ACM Fellow, SIAM Fellow, AAAS Fellow IEEE Sidney Fernbach Award 2022 inductee into University of Maryland...
DS 642: Applications of Parallel Computing Lecture 1 01/22/2024 http://www.cs.njit.edu/~bader DS642 1 David A. Bader Distinguished Professor and Director, Institute for Data Science IEEE Fellow, ACM Fellow, SIAM Fellow, AAAS Fellow IEEE Sidney Fernbach Award 2022 inductee into University of Maryland’s Innovation Hall of Fame, A. James Clark School of Engineering Recent Service: White House's National Strategic Computing Initiative (NSCI) panel Computing Research Association Board Chair, NSF Committee of Visitors for Office of Advanced Cyberinfrastructure NSF Advisory Committee on Cyberinfrastructure Council on Competitiveness HPC Advisory Committee IEEE Computer Society Board of Governors IEEE IPDPS Steering Committee Editor-in-Chief, ACM Transactions on Parallel Computing Editor-in-Chief, IEEE Transactions on Parallel and Distributed Systems Over $188M of research awards 300+ publications, ≥ 13,600 citations, h-index ≥ 66 National Science Foundation CAREER Award recipient Directed: Facebook AI Systems Directed: NVIDIA GPU Center of Excellence, NVIDIA AI Lab (NVAIL) Directed: Sony-Toshiba-IBM Center for the Cell/B.E. Processor Founder: Graph500 List benchmarking “Big Data” platforms Recognized as a “RockStar” of High Performance Computing by InsideHPC in 2012 and as HPCwire’s People to Watch in 2012 and 2014. 01/22/2024 DS642 2 2021 IEEE Sidney Fernbach Award David Bader cited for the development of Linux-based massively parallel production computers and for pioneering contributions to scalable discrete parallel algorithms for realworld applications. 2022 IEEE Computer Society President Bill Gropp presents David Bader with the Sidney Fernbach Award at SC21 01/22/2024 DS642 3 1998: Bader Invents the Linux Supercomputer Roadrunner “This effort of yours has enormous historic resonance,” – Larry Smarr, Distinguished Professor Emeritus, UC San Diego Founding Director of NCSA, Founding Director of Calit2 Source: UC San Diego https://ucsdnews.ucsd.edu/pressrelease/pioneering-scientist-and-innovator-larry-smarr-retires 01/22/2024 DS642 4 Impact: Top500 Supercomputers Running Linux “Today, 100% of the Top 500 supercomputers in the world are Linux HPC systems, based on Bader’s technical contributions and leadership. This is one of the most significant technical foundations of HPC.” – Steve Wallach is a guest scientist for Los Alamos National Laboratory and 2008 IEEE CS Seymour Cray Computer Engineering Award recipient. Photo credit: Information Week, 2008 50% 01/22/2024 90% Source: http://www.Top500.org/ DS642 100% 5 Outline Syllabus Canvas My research 01/22/2024 DS642 6 01/22/2024 DS642 7 New Jersey Institute of Technology “NJIT Climbs the Rankings of U.S. News & World Report, A Top 50 Public University” – 13 Sep 2021 “NJIT Named As One of Nation's 'Best Colleges' for 2022, The Princeton Review Says” – 6 Sep 2021 “Wall Street Journal/College Pulse Ranks NJIT No. 2 Public University in the US” – 6 Sep 2023 01/22/2024 DS642 8 DS 642, Spring 2024 Applications of Parallel Computing This course will teach students how to design, analyze, and implement, parallel programs for high performance computational science and engineering applications. The course focuses on advanced computer architectures, parallel algorithms, parallel languages, and performance-oriented computing. Students will develop knowledge and skills to efficiently solve challenging problems in science and engineering, where very fast computers are required either to perform complex simulations or to analyze enormous datasets. 01/22/2024 DS642 9 Course Based on Georgia Tech CSE 6140 (Computational Science and Engineering Algorithms) and Berkeley CS267 (Applications of Parallel Computing) https://sites.google.com/lbl.gov/cs267-spr2023 01/22/2024 DS642 10 Course Plan See Canvas Calendar 14 meetings 01/22/2024 DS642 11 Exams (during class) Midterm: Monday, March 4, 2024 Final Projects: Monday, May 6, 2024 Please let me know in advance if you have a conflict with these exam dates. 01/22/2024 DS642 12 Evaluation / Grading Attendance 5% Homework 20% Midterm 25% Final Project 50% 01/22/2024 DS642 13 Policies Late Policy Students are expected to complete work on schedule. Late work is not accepted unless prior arrangements are made with the instructor. Academic Integrity and Student Conduct: “Academic Integrity is the cornerstone of higher education and is central to the ideals of this course and the university. Cheating is strictly prohibited and devalues the degree that you are working on. As a member of the NJIT community, it is your responsibility to protect your educational investment by knowing and following the academic code of integrity policy that is found at: http://www.njit.edu/policies/sites/policies/files/academic-integrity-code.pdf 01/22/2024 DS642 14 Lectures Slides will be uploaded to Canvas. 01/22/2024 DS642 15 Office Hours Monday, 5-5:30pm, in Jersey City Or by appointment 01/22/2024 DS642 16 Contact email: [email protected] Teaching Assistant: Office Hours: Tuesday, 11:30 am to 1:00 pm https://njit.webex.com/meet/ar238 A M Muntasir Rahman Use email for questions, help. 01/22/2024 DS642 17 NJIT Verification of Students Canvas roll-call 01/22/2024 DS642 18 DreamWorks Presents the Power of Supercomputing https://www.youtube.com/watch?v=TGSRvV9u32M (8 minutes) 01/22/2024 DS642 19 Topics Introduction to Single Processor Machines and Parallel Computing Optimizing/Tuning Matrix Multiplication Shared-Memory Programming, Memory Hierarchies, Multicore and Many core An Introduction to GPGPU Programming with CUDA Distributed Memory Machines and Programming, Advanced MPI and Collective Communication Parallel Matrix Multiply, Dense Linear Algebra, Sparse Matrix-Vector Multiplication Fast Fourier Transform Parallel Graph Algorithms Partitioning Applications for Heterogeneous Resources, Dynamic Load Balancing Machine Learning, Cloud Computing and Big Data Processing Measuring Performance, Identifying Bottlenecks Advanced Topics in Parallel Programming Project Presentations 01/22/2024 DS642 20 Cyber Innovations for Solving Global Grand Challenges LANL Roadrunner with IBM Cell B.E. Top500 No. 1 system from June 2008 to June 2009 Intel HIVE processor (2019) IBM Watson with POWER7/8, won Jeopardy in Feb 2010 NVIDIA GPUs used in 127 of Top500 systems, incl. top 2 (in USA), and fastest Cray XMT with ThreadStorm proc. in Europe and Japan. (Nov. 2018) Massively Multithreaded Architecture 01/22/2024 DS642 IBM BlueGene/Q. Record breaking performance over 10PF sustained on science apps 21 Exascale Computing and Big Data By Daniel A. Reed and Jack Dongarra July 2015 Communications of the ACM https://vimeo.com/129742718 01/22/2024 DS642 22 Supercomputers aid researchers in hunt for COVID-19 answers https://youtu.be/qXHAu7HDhNE 01/22/2024 DS642 23 Graph500 Benchmark, www.graph500.org Defining a new set of benchmarks to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads. Executive Committee: D.A. Bader, R. Murphy, M. Snir, A. Lumsdaine Five Business Area Data Sets: Cybersecurity Data Enrichment Easily PB of data Example: Maritime Domain Awareness 15 Billion Log Entires/Day (for large enterprises) Full Data Scan with End-to-End Join Required Medical Informatics 50M patient records, 20-200 records/patient, billions of individuals Entity Resolution Important Symbolic Networks Example, the Human Brain 25B Neurons 7,000+ Connections/Neuron Social Networks Example, Facebook, Twitter Nearly Unbounded Dataset Size 01/22/2024 Hundreds of Millions of Transponders Tens of Thousands of Cargo Ships Tens of Millions of Pieces of Bulk Cargo May involve additional data (images, etc.) DS642 24 Special Lecture: February 5 (5pm to 6pm) (watch for class announcement) Solving Global Grand Challenges with High Performance Data Analytics Data science aims to solve grand global challenges such as: detecting and preventing disease in human populations; revealing community structure in large social networks; protecting our elections from cyberthreats, and improving the resilience of the electric power grid. Unlike traditional applications in computational science and engineering, solving these social problems at scale often raises new challenges because of the sparsity and lack of locality in the data, the need for research on scalable algorithms and architectures, and development of frameworks for solving these real-world problems on high performance computers, and for improved models that capture the noise and bias inherent in the torrential data streams. In this talk, Bader will discuss the opportunities and challenges in massive data science for applications in social sciences, physical sciences, and engineering. 01/22/2024 DS642 25 01/22/2024 DS642 26 What’s a Parallel Computer? 01/22/2024 DS642 27 It’s all about the need for speed 01/22/2024 DS642 28 Parallel Computing: Faster Solutions Using multiple processors in parallel to solve problems more quickly than with a single processor Compute the prime factors of 1 billion numbers: 45 12 66... 13 If we had 1 million processors… 01/22/2024 DS642 29 Sum (reduction) in parallel Add n values 1 3 1 0 4 -6 3 2 Serial: O(n) Parallel: O(log n) Uses n processors! Takes advantage of associativity in + 01/22/2024 DS642 30 If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens? - Seymour Cray 01/22/2024 DS642 31 What is a Parallel Computer? Proc Proc Proc Network Network Proc Proc Shared Memory (SMP) or Multicore 01/22/2024 01/22/2024 Mem Mem Proc Proc Proc Proc Network Network Mem Proc Mem +/x/- Mem Mem Mem Proc Proc Proc High Performance Computing (HPC) or Distributed Memory DS642 +/x/- +/x/- Network Mem Single Instruction Multiple Data (SIMD) 32 What is a Parallel Computer? Proc Proc Proc Network Mem A shared memory multiprocessor (SMP*) by connecting multiple processors to a single memory system Network Proc Proc Proc A multicore processor contains multiple processors (cores) on a single chip * Technically, SMP stands for “Symmetric Multi-Processor” Shared Memory (SMP) or Multicore 01/22/2024 DS642 33 What is a Parallel Computer? Mem Mem Mem Proc Proc Proc A distributed memory multiprocessor has processors with their own memories connected by a high speed network Network Mem Mem Mem Proc Proc Proc High Performance Computing (HPC) or Distributed Memory 01/22/2024 01/22/2024 Also called a cluster A high performance computing (HPC) system contains 100s or 1000s of such processors (nodes) DS642 34 What is a Parallel Computer? Proc Network +/x/- +/x/- +/x/- A Single Processor Multiple Data (SIMD) computer has multiple processors (or functional units) that perform the same operation on multiple data elements at once Network Mem Single Instruction Multiple Data (SIMD) 01/22/2024 01/22/2024 Most single processors have SIMD units with ~2-8 way parallelism Graphics processing units (GPUs) use this as well DS642 35 Why is high-performance computing often synonymous with parallel computing? Why do we care so much about interconnect and communication? Performance = parallelism Efficiency = locality - Bill Dally (NVIDIA and Stanford) 01/22/2024 DS642 36 What’s not a Parallel Computer? 01/22/2024 DS642 37 Concurrency vs. Parallelism Concurrency: multiple tasks are logically active at one time. Parallelism: multiple tasks are actually active at one time. Concurrent, non-parallel Execution Concurrent, parallel Execution 01/22/2024 DS642 Slide Source: Tim Mattson, Intel 38 Parallel Computer vs. Distributed System A distributed system is inherently distributed, i.e., serving clients at different locations Server A parallel computer may use distributed memory (multiple processors with their own memory) for more performance response request Client 01/22/2024 Client Client distributed system Client DS642 39 The Fastest Computers (for Science) Have Been Parallel for a Long Time Fastest Computers in the world: top500.org LBNL’s Cori Computer has over 680,000 cores and ~30 Petaflops (1015 math operations / second) 01/22/2024 DS642 Supercomputing is done by parallel programming 40 Some of the World’s Fastest Computers The Top500 List 01/22/2024 DS642 41 Units of Measure for HPC High Performance Computing (HPC) units are: – – – We are here Flop: floating point operation, usually double precision unless noted Flop/s: floating point operations per second Bytes: size of data (a double precision floating point number is 8 bytes) Typical sizes are millions, billions, trillions… Kilo Mega Giga Tera Peta Exa Zetta Yotta Kflop/s = 103 flop/sec Mflop/s = 106 flop/sec Gflop/s = 109 flop/sec Tflop/s = 1012 flop/sec Pflop/s = 1015 flop/sec Eflop/s = 1018 flop/sec Zflop/s = 1021 flop/sec Yflop/s = 1024 flop/sec Kbyte = 103 ~ 210 = 1024 bytes (KiB) Mbyte = 106 ~ 220 bytes (MiB) Gbyte = 109 ~ 230 bytes (GiB) Tbyte = 1012 ~ 240 bytes (TiB) Pbyte = 1015 ~ 250 bytes (PiB) Ebyte = 1018 ~ 260 bytes (EiB) Zbyte = 1021 ~ 270 bytes (ZiB) Ybyte = 1024 ~ 280 bytes (YiB) Current fastest (public) machines are petaflop systems – 01/22/2024 Up-to-date list at www.top500.org DS642 42 The TOP500 Project 500 most powerful computers in the world Updated twice a year: – ISC’xy in June in Germany – SCxy in November in the U.S. All information available from the TOP500 web site at: www.top500.org Yardstick: Floating Point Operations per Second (FLOP/s) Rmax of Linpack Solve Ax=b, Matrix A is dense with random entries Dominated by dense matrix-matrix multiply 01/22/2024 DS642 43 # NOVEMBER 2021 st 41 RIKEN 1 Center for Computational Science 2 Oak Ridge National Laboratory List: The TOP10 Computer Country Fujitsu Fugaku Supercomputer Fugaku, Japan 7,630,848 442.0 29.9 USA 2,414,592 148.6 10.1 USA 1,572,480 94.6 7.4 China 10,649,600 93.0 15.4 USA 761,856 70.8 2.59 USA 555,520 63.5 2.65 4,981,760 61.4 18.5 Germany 449,280 44.1 1.76 Italy 669,760 35.5 2.25 USA 253,440 30.044 IBM Summit IBM Power System, P9 22C 3.07GHz, Mellanox EDR, NVIDIA GV100 IBM National Supercomputing Center in Wuxi NRCPC 5 Lawrence Berkeley National Laboratory (NERSC) HPE 6 NVIDIA Corporation NVIDIA 4 [Pflops] A64FX 48C 2.2GHz, Tofu interconnect D Lawrence Livermore National Laboratory 3 Cores Rmax Power [MW] Manufacturer Sierra IBM Power System, P9 22C 3.1GHz, Mellanox EDR, NVIDIA GV100 Sunway TaihuLight NRCPC Sunway SW26010, 260C 1.45GHz Perlmutter HPE Cray EX235n, AMD EPYC 7763 64C, 2.45GHz, NVIDIA A100, Slingshot-10 Selene DGX A100 SuperPOD, AMD 64C 2.25GHz, NVIDIA A100, Mellanox HDR 7 8 National University of Defense Technology NUDT Forschungszentrum Jülich (FZJ) Atos Tianhe-2A ANUDT TH-IVB-FEP, China Xeon 12C 2.2GHz, Matrix-2000 JUWELS Booster Module BullSequana XH2000, AMD EPYC 24C 2.8GHz, NVIDIA A100, Mell. HDR 9 Eni S.p.A Dell EMC HPC5 PowerEdge C4140, Xeon 24C 2.1GHz, NVIDIA T. V100, Mellanox HDR 10 01/22/2024 Azure East US 2 MS Azure Voyager-EUS2 - DS642 AMD EPYC 7V12 48C ND96amsr_A100_v4, 2 45GHz NVIDIA A100 Mellanox HDR Rmax Power [MW] # JUNE 2022 Manufacturer Computer Country 1 Oak Ridge National Laboratory HPE HPE Cray EX235a, USA 8,730,112 1,102 21.1 2 RIKEN Center for Computational Science Fujitsu Japan 7,630,848 442.0 29.9 Finland 1,268,736 151.9 2.9 USA 2,414,592 148.6 10.1 USA 1,572,480 94.6 7.4 China 10,649,600 93.0 15.4 USA 761,856 70.9 2.6 USA 555,520 63.5 2.7 China 4,981,760 61.4 18.5 France 319,072 46.1 0.9 3 EuroHPC / CSC 41ST LIST: THE TOP10 Frontier Cores [Pflops] AMD EPYC 64C 2.0GHz, Instinct MI250X, Slingshot-10 Fugaku Supercomputer Fugaku, A64FX 48C 2.2GHz, Tofu interconnect D HPE LUMI HPE Cray EX235a, AMD EPYC 64C 2.0GHz, Instinct MI250X, Slingshot-10 Oak Ridge National Laboratory 4 7 IBM National Supercomputing Center in Wuxi NRCPC Sierra IBM Power System, P9 22C 3.1GHz, Mellanox EDR, NVIDIA GV100 Sunway TaihuLight NRCPC Sunway SW26010, 260C 1.45GHz NERSC - Lawrence Berkeley National Laboratory HPE NVIDIA Corporation NVIDIA 8 Summit IBM Power System, P9 22C 3.07GHz, Mellanox EDR, NVIDIA GV100 Lawrence Livermore National Laboratory 5 6 IBM Perlmutter HPE Cray EX235n, AMD EPYC 64C 2.45GHz, NVIDIA A100, Slingshot-10 Selene DGX A100 SuperPOD, AMD 64C 2.25GHz, NVIDIA A100, Mellanox HDR National University of Defense Technology 9 10 45 01/22/2024 GENCI-CINES NUDT Tianhe-2A ANUDT TH-IVB-FEP, Xeon 12C 2.2GHz, Matrix-2000 HPE DS642Adastra HPE Cray EX235a, AMD EPYC 64C 2.0GHz, Instinct MI250X, Slingshot-10 # NOVEMBER 2022 1 2 3 Power [MW] Computer Country Oak Ridge National Laboratory HPE HPE Cray EX235a, USA 8,730,112 1,102 21.1 RIKEN Center for Computational Science Fujitsu Japan 7,630,848 442.0 29.9 Finland 2,069,760 309.1 6.0 Italy 1,463,616 174.7 5.6 USA 2,414,592 148.6 10.1 USA 1,572,480 94.6 7.4 China 10,649,600 93.0 15.4 USA 761,856 70.9 2.6 USA 555,520 63.5 2.7 4,981,760 61.4 18.5 EuroHPC / CSC 41ST LIST: THE TOP10 Frontier Cores Rmax Manufacturer [Pflops] AMD EPYC 64C 2.0GHz, Instinct MI250X, Slingshot-11 Fugaku Supercomputer Fugaku, A64FX 48C 2.2GHz, Tofu interconnect D HPE LUMI HPE Cray EX235a, AMD EPYC 64C 2.0GHz, Instinct MI250X, Slingshot-11 4 EuroHPC / CINECA Atos Leonardo Atos BullSequana XH2000, Xeon 32C 2.6GHz, NVIDIA A100, HDR Infiniband 5 6 7 8 9 Oak Ridge National Laboratory IBM Lawrence Livermore National Laboratory IBM P9 22C 3.07GHz, Mellanox EDR, NVIDIA GV100 Sierra IBM Power System, P9 22C 3.1GHz, Mellanox EDR, NVIDIA GV100 National Supercomputing Center in Wuxi NRCPC NERSC - Lawrence Berkeley National Laboratory HPE NVIDIA Corporation Summit IBM Power System, Sunway TaihuLight NRCPC Sunway SW26010, 260C 1.45GHz Perlmutter HPE Cray EX235n, AMD EPYC 64C 2.45GHz, NVIDIA A100, Slingshot-10 NVIDIA Selene DGX A100 SuperPOD, AMD 64C 2.25GHz, NVIDIA A100, Mellanox HDR 10 46 National 01/22/2024 University of Defense Technology NUDT Tianhe-2A DS642 ANUDT TH-IVB-FEP, Xeon 12C 2.2GHz, Matrix-2000 China Top 500 On June 2023 list, US-5, EU-2, China-2, Japan-1 Rank System/Site # Cores Rmax (PFlop/s) Rpeak (PFlop/s) Power (kW) 1 Frontier - HPE Cray, DOE/SC/Oak Ridge National Laboratory, United States 8,699,904 1,194.00 1,679.82 22,703 2 Supercomputer Fugaku - RIKEN Center for Computational Science, Japan 7,630,848 442.01 537.21 29,899 3 LUMI - EuroHPC/CSC, Finland 2,220,288 309.1 428.7 6,016 4 Leonardo - EuroHPC/CINECA, Italy 1,824,768 238.7 304.47 7,404 5 Summit - DOE/SC/Oak Ridge National Laboratory, United States 2,414,592 148.6 200.79 10,096 6 Sierra - DOE/NNSA/LLNL, United States 1,572,480 94.64 125.71 7,438 7 Sunway TaihuLight - National Supercomputing Center in Wuxi, China 10,649,600 93.01 125.44 15,371 8 Perlmutter - DOE/SC/LBNL/NERSC, United States 761,856 70.87 93.75 2,589 9 Selene - NVIDIA Corporation, United States 555,520 63.46 79.22 2,646 10 Tianhe-2A - National Super Computer Center in Guangzhou, China 4,981,760 61.44 100.68 18,482 © Hyperion Research 2023 2 Frontier (#1) System Overview System Performance Peak performance of 1.6 double precision exaFLOPS Measured Top500 performance (Rmax) was 1.102 exaFLOPS 01/22/2024 Each node has 3rd Gen AMD EPYC CPU with 64 cores 4 Purpose Built AMD Instinct 250X GPUs 4X128 GB of fast memory, 1 per GPU 5 terabytes of flash memory DS642 The system includes 9,472 nodes Slingshot interconnect 48 Fugaku (#2) System Overview System Performance Peak performance of 442 petaflops (per TOP500 Rmax), 2.0 EFLOPS on a different mixedprecision benchmark 01/22/2024 Each node has Fujitsu A64FX CPU (48+4 cores) per node HBM2 32 GiB DS642 RIKEN Center for Computational Science (R-CCS) The system includes 158,976 nodes Custom Tofu Interconnect D 1.6 TB NVMe SSD/16 nodes (L1) 150 PB Lustre Filesystem (L2) Cloud storage (L3) 49 Perlmutter at NERSC (#1 in Berkeley) Peak: 63.8 PFlop/s GPUs: 2 NVIDIA A100 NICs/node: 4 Slingshot 11 MemBW: 1.55 TB/s per GPU GPU memory: 40 GB HBM Proc: 1 AMD Epyc 7763 Nodes: 1792 Memory: 256 GB per node Peak: 7.7PFlop/s Memory: 512 GB per node NICs/node: 1 Slingshot 11 MemBW: 204.8 GB/s per CPU Proc: 2 AMD Epyc 7763 Nodes: 3072 01/22/2024 DS642 50 Performance History and Projection 100 Eflop/s 1.00E+11 10 Eflop/s 1.00E+10 1 Eflop/s 1.00E+09 100 Pflop/s 1.00E+08 10 Pflop/s 1.00E+07 1 Pflop/s 1.00E+06 100 Tflop/s 1.00E+05 10 Tflop/s 1.00E+04 1 Tflop/s 1.00E+03 100 Gflop/s 1.00E+02 10 Gflop/s 1.00E+01 1 Gflop/s 100 Mflop/s 1.00E+00 1.00E-01 01/22/2024 SUM N=1 N=500 DS642 51 Other Algorithms / Arithmetic Mixed precision iterative refinement approach solved a matrix of order 16,957,440 on Fugaku. – Composed of nodes made up of Fujitsu’s ARM A64fx Processor – The run used 158,976 nodes of Fugaku, 7,630,848 cores – Used a random matrix with large diagonal elements to insure convergence. Mixed precision HPL achieved 2.004 Eflop/s – 4.5 X over DP precision HPL (442 PFLOPS). – 67 Gflops/Watt Same accuracy compared to full 64 bit precision From Vector Supercomputers to Massively Parallel Accelerator Systems Programmed by completely rethinking algorithms and software for parallelism Programmed by “annotating” serial programs 01/22/2024 DS642 53 Nvidia Kepler Nvidia Pascal Nvidia Volta Nvidia Ampere Nvidia Turing IBM Cell Clearspeed Intel Xeon Phi Xeon Phi Main AMD Instinct Others ATI Radeon 500 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Over 1/3rd of all systems have accelerators or co-processors 01/22/2024 DS642 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 The actual “performance share” is higher 2006 Systems Nvidia Fermi 54 New systems 2022 / Accelerators AMD GPU, 11% None, 7% NVIDIA, 15% NVIDIA, 28% None, 61% AMD GPU, 78% # of Systems 01/22/2024 ∑ of Performance DS642 55 New systems 2022 / Main Processor NEC, 3% NEC, 0% Intel, 16% AMD, 36% Intel, 61% Some real competition (re)-emerging # of Systems 01/22/2024 AMD, 84% ∑ of Performance DS642 56 01/22/2024 DS642 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 15 2001 20 2000 1999 1998 1997 1996 1995 Age [Months] Average System Age 25 Less incentives to upgrade systems leads to older systems 10 7.6 month 5 0 57 nd 62 Highlights of the Top 500 List (SC23) https://youtu.be/CzufXS2jY20?si=xFo1U0xLqj5 69rXy&t=1780 (30 minutes, Erich Strohmaier) 01/22/2024 DS642 58 Gordon Bell Prizes: Science at Scale Established in 1987 with a cash award of $10,000 (since 2011), funded by Gordon Bell, a pioneer in HPC. For innovation in applying HPC to applications in science, engineering, and data analytics. 01/22/2024 DS642 59 Flop/s: floating-point operations per second Gordon Bell Prizes vs Top 500 1.E+19 “AI-flops” 1.E+18 Top500 #1 1.E+17 Gordon Bell 1.E+16 1.E+15 1.E+14 1.E+13 1.E+12 1.E+11 1.E+10 1.E+09 1.E+08 1990 01/22/2024 1995 2000 2005 DS642 2010 2015 2020 60 The Exascale Market (System Acceptances) © Hyperion Research 2023 61 94.3% of Sites Have Accelerators in Their Largest System Today Up from 82.7% having accelerators in 2021 In Mid 2021 © Hyperion Research 2023 In Late 2022 62 Accelerator Plans for Next Purchases From our recent end-user MCS study © Hyperion Research 2023 63 GPU/Accelerator Forecast Anticipated high growth for accelerators over next 5 years © Hyperion Research 2023 64 HPC Cloud Usage Forecast 17.9% growth over the next 5 years © Hyperion Research 2023 65 AI Forecast 22.7% growth over the next 5 years © Hyperion Research 2023 66 SCIENCE USING HIGH PERFORMANCE COMPUTING 01/22/2024 DS642 67 Simulation: The Third Pillar of Science Theory Experiment Simulation 01/22/2024 DS642 68 Simulation in Science and Engineering High performance simulation used to understand things that are: too big Understanding the universe too small Proteins and diseases too fast too slow too expensive or too dangerous for experiments Energy-efficient jet engines 01/22/2024 DS642 Climate change 69 Simulations Show the Effects of Climate Change in Hurricanes Michael Wehner and Prabhat, Berkeley Lab 01/22/2024 DS642 70 Faster Computers: More Detail 01/22/2024 Michael Wehner, Prabhat, Chris Algieri, Fuyu Li, Bill Collins, Lawrence Berkeley National Laboratory; Kevin Reed, University of Michigan; Andrew Gettelman, Julio Bacmeister, Richard Neale, National Center for Atmospheric Research DS642 71 HPC for Astrophysics Expanding debris from a supernova explosion (red) running over and shredding a nearby star (blue) 01/22/2024 Ligo and Virgo observations match earlier simulations of gravitational waves from neutron star merger. Simulations predict ~200 earth masses of gold; ~500 of platinum DS642 Dan Kasen, UCB Astronomy/Physics + LBNL 72 HPC for Energy Efficiency in Industry Paper industry is 4th Largest Energy Consumer in US Chombo-Pulp: Apply adaptive embedded boundary solver to resolve flow around pulp fibers and in felt pore space 01/22/2024 Adaptive mesh refinement and DS642 interface tracking 73 High Throughput HPC for Materials Design Design of Materials for Batteries, Solar Panels and More Software Supercomputers NANOPOROUS MATERIALS 530,243 INORGANIC COMPOUNDS 131,613 BAND STRUCTURES 76,194 MOLECULES 49,705 Data 01/22/2024 Screening > 40,000 Users Use of Bayesian optimization for layered materials [Bassman et al, npj Computational Materials 2018] DS642 Kristin Persson, Gerd Ceder, MSE UCB and LBNL, Materials Project 74 HPC for Carbon Capture Metal Organic Frameworks (MOFs) to capture carbon in natural gas plants. Removes >90% of CO2 from flue, 6X more than current (amine) technology. Steam to regenerate the MOF to reuse Exploring MOF design space with Density Functional Theory (DFT) Jeff Long College of Chemistry / DS642 UC Berkeley and LBNL 01/22/2024 75 Screening known drugs for COVID-19 Molecular docking to SARS-Cov-2 spike protein Screened 8,000 compounds Identified 77 of the most promising 01/22/2024 Jeremy C. Smith, Micholas Smith, U. Tenn. and ORNL DS642 76 “Exascale” Applications at Berkeley Lab (LBNL) Accelerators Cosmology Chemistry 01/22/2024 Subsurface Astrophysics Carbon Capture Combustion Urban Genomics DS642 Climate Earthquakes Photon Science 77 The Fourth Paradigm of Science Theory Experiment Simulation 01/22/2024 Data analysis DS642 78 Data analytics in science and engineering High Performance Data Analytics (HPDA) is used for data sets that are: too big too complex too fast (streaming) too noisy too heterogeneous for measurement alone Images from telescopes Particle from detectors 01/22/2024 DS642 Genomes from sequencers Sensor data 79 Data Growth is Outpacing Computing Growth 18 Graph based on average growth 16 Detector Sequencer Processor Memory 14 12 10 8 This gap is why we care about data movement (aka communication 6 4 2 0 2010 01/22/2024 2011 2012 2013 DS642 2014 2015 80 High Performance Data Analytics (HPDA) for Genomics What happens to microbes after a wildfire? (1.5TB) What at the seasonal fluctuations in a wetland mangrove? (1.6 TB) What are the microbial dynamics of soil carbon cycling? (3.3 TB) How do microbes affect disease and growth of switchgrass for biofuels (4TB) Combine genomics with isotope tracing methods for improved functional understanding (8TB) JGI-NERSC-KBase FICUS projects, MetaHipMer assembler ExaBiome project - 81 - Analysis of Genomic Data Dark green nodes: Kalanchoë genes Yellow nodes: pineapple genes Light green: model plant that uses a different photosynthesis strategy. Blue edges: positive correlations Red edges show negative correlations a 01/22/2024 Correlations of gene expression in plants that use different photosynthesis strategies. Kalanchoë and pineapple both use water-sparing photosynthesis 2.36 exaops / second on Summit computer – exaop = 1018 16-bit ops – Gordon Bell Prize at SC18 Sharlee Climer, Kjiersten Fagnan, Daniel Jacobson, Wayne Joubert, Amy Justice, David Kainer, Deborah Weighill DS642 82 The Fifth Paradigm of Science ? Theory Experiment Data analysis Simulation Machine Learning 01/22/2024 DS642 83 Machine learning demands more computing 300,000x increase from 2011 (AlexNet) to 2018 (AlphaGoZero) ExaFLOP/s for one Day From 2011-2017 the fastest Top500 machine grew < 10x https://blog.openai.com/ai-and-compute/ 01/22/2024 DS642 84 Machine learning demands more computing “Compute trends across three eras of machine learning”, Sevilla et al., 2022 01/22/2024 DS642 85 Data Analytics via Supervised Learning Classification Classification + Localization Object Detection - 86 - Instance Segmentation DS642 Extending image-based methods to complex, 3D, scientific data sets is non-trivial! 01/22/2024 Slide source: Prabhat Big Data, Big Model, and Big Iron Predicted Extreme Weather Ground Truth Extreme Weather - 87 - DS642 Deep learning results are smoother than heuristic labels Achieved over 1 EF peak on OLCF Summit: Gordon Bell Prize in 2018 Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur 01/22/2024 Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, Michael Houston GANs to build convergence maps of weak gravitational lensing 01/22/2024 DS642 CosmoGAN: Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Zarija Lukić, Rami Al-Rfou, Jan M. Kratochvil 88