High-Performance Computing (HPC) Architectures - L1.4

DEFINITION HPC is a field of endeavor that relates to all facets of technology, methodology, and application associated with achieving the greatest computing capability possible at any point in time, Referred to as “supercomputers” to perform a wide array of computational problems or “applications” (alternatively “workloads”) as fast as is possible. The action of performing an application on a supercomputer is widely termed “supercomputing” and is synonymous with HPC. HPC systems are distinguished from a conventional computer by the organization, interconnectivity, and scale of the many component resources A “node” incorporates all the functional elements required for computation, and is highly replicated to achieve large scale. ANATOMY OF A SUPERCOMPUTER NEODIGITAL AGE AND BEYOND MOORE’S LAW The international HPC development community will extend many-core heterogeneous system technologies, architectures, system software, and programming methods from the petaflops generation to exascale in the early part of the next decade. But the semiconductor fabrication trends that have driven the exponential growth of device density and peak performance are coming to an end as feature size approaches nanoscale (approximately 5 nm). This is often referred to as the “end of Moore’s law”. This does not mean that system performance will also stop growing, but that the means of achieving it will rely on other innovations through alternative device technologies, architectures, and even paradigms. The exact forms these advances will take are unknown at this time, but exploratory research suggests several promising directions - some based on new ways of using refined semiconductor devices, and other complete paradigm shifts based on alternative methods. Their performance is also measured using FLOPS. The most powerful supercomputer in the world now exceeds 1 exaFLOP — 1 quintillion (1018) FLOPS PCs and laptops usually have power of several hundred gigaFLOPS — 1 trillion (109) FLOPS. We refer to machines like this as exascale supercomputers. Seven most powerful supercomputers online today: · Location: Oak Ridge National Laboratory — Tennessee, U.S. · Performance: 1,194 petaFLOPS (1.2 exaFLOPS) · Components: AMD EPYC 64-core CPUs and AMD Instinct MI250X GPUs · First online: August 2022 Built by supercomputing giant HPE Cray — became the first exascale computer in the world. Scientists initially planned to use Frontier for cancer research, drug discovery, nuclear fusion, exotic materials, designing superefficient engines and modeling stellar explosions · Location: Argonne National Laboratory — Illinois, U.S. · Performance: 585 petaFLOPS (0.59 exaFLOPS) · Components: Intel Xeon Max Series CPUs and Intel Data Center Max Series GPUs · First online: June 2023 One of the youngest supercomputers on the list may also one day become the most powerful. Based at the Argonne Leadership Computing Facility (ALCF), Aurora became the second exascale supercomputer ever, and ALCF representatives said it has the potential to reach 2 exaFLOPS of computing power — which is double Frontier's. Built in partnership with Intel and HPE. · Location: Microsoft Azure — The Cloud / Distributed · Performance: 561 petaFLOPS (0.56 exaFLOPS) · Components: Intel Xeon Platinum 8480C 48C CPUs and Nvidia H100 GPUs · First online: August 2023 Microsoft's Eagle supercomputer is based in the cloud, and anybody can access it through the Microsoft Azure cloud platform. It's a distributed network of systems that collectively boast enough power to be the third fastest supercomputer in the world. Eagle can in theory be accessed by anybody willing to pay a fee. · Location: Riken Center for Computational Science — Kobe, Japan · Performance: 442 petaFLOPS (0.44 exaFLOPS) · Components: A64FX CPUs · First online: June 2020 Once the most powerful supercomputer in the world — between June 2020 and June 2022 — Fugaku is one of the oldest top- five systems on this list. It receives its name from Mount Fuji, Fugaku is currently training Japanese AI large language models, in the mold of ChatGPT · Location: CSC Data Center — Kajaani, Finland · Performance: 380 petaFLOPS (0.38 exaFLOPS) · Components: AMD 3rd-Gen EPYC 64-core CPUs and AMD Instinct MI250X GPUs · First online: June 2021 LUMI, based in Finland, is Europe's most powerful supercomputer and the fifth-fastest in the world. It uses 100% renewable hydroelectric energy, and its waste heat is used to warm nearby buildings. It began running pilot operations three years ago and became fully operational in February 2023. · Location: CINECA data center — Bologna, Italy · Performance: 239 petaFLOPS (0.23 exaFLOPS) · Components: Intel Xeon Platinum 8358 32- core CPUs and Nvidia A100 GPUs · First online: November 2022 · Location: Oak Ridge National Laboratory — Tennessee, U.S. · Performance: 149 petaFLOPS (0.15 exaFLOPS) · Components: IBM POWER9 22-core CPUs and Nvidia Tesla V100 GPUs · First online: June 2018 Summit was the most powerful supercomputer in the world for two years until it was replaced by Fagaku PARAM Rudra 2024 Based on Intel Xeon 2nd Generation Cascade Lake dual socket processors, Giant Metrewave Nvidia A100 GPU, 1 PFLOPS Radio Telescope, 35TB memory, and Pune 2PB storage. Cost ₹130 crore Inter-University 3 PFLOPS Accelerator Centre, New Delhi 838 S. N. Bose National TFLOPS Centre for Basic Sciences, Kolkata Supercomputer Vs HPC Supercomputer A supercomputer is a powerful subset of HPC that's made up of thousands of compute nodes that work together to solve complex problems. They are number cruncing machines. High-performance computing HPC is the practice of using multiple computers to process data and perform complex calculations at high speeds. HPC systems are made up of large clusters of servers that work together as a single computing resource. HPC and supercomputing are often used interchangeably, but in some contexts, "supercomputer" is used to refer to a more powerful subset of HPC. Key Properties of HPC Architecture Speed : Speed of the components and the clock rates Parallelism : Ability to do many things at once Efficiency : Performance of user workload in terms of FLOPS Power : The speed of processor cores is proportional to clock rate, and this in turn relates to the power applied. Reliability : Fault tolerance to be built for Hard and Soft errors Programmability : How difficult it is to write or develop a complex application code reflects the programmability of the system. VECTOR PROCESSING Vector-processing architecture exploits pipelining to achieve the advantages of fine-grain paral-lelism, latency hiding, and amortized control overheads, Pipelining also permits a high clock rate for vector-based computer architecture. Flynn’s taxonomy of Parall Architectures Amdahl’s Law Parallel computing: l Increase in number of “compute elements” (cores). l Architectures depend heavily on parallelism, and l Increase in number of CPUs Shared-memory computers A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space. Two types: Uniform Memory Access (UMA) systems exhibit a “flat” memory model: La- tency and bandwidth are the same for all processors and all memory locations. This is also called symmetric multiprocessing (SMP). On cache-coherent Nonuniform Memory Access (ccNUMA) machines, mem- ory is physically distributed but logically shared. The physical layout of such systems is quite similar to the distributed-memory case, but network logic makes the aggregated memory of the whole system appear as one single address space. The shared-memory multiprocessor architecture. Nonuniform memory access (NUMA) architectures NUMA and UMA Shared-Address-Space Platforms P P P M M Interconnection Network C C M Interconnection Network Interconnection Network P P P M M C C M P P M M P C C M (a) (b) (c) Typical shared-a d d r e s s - s p a c e architectures: ( a ) Uniform-m e m o r y - a c c e s s shared-a d d r e s s - s p a c e c o m p u t e r ; (b) Uniform-m e m o r y - a c c e s s shared-a d d r e s s - s p a c e c o m p u t e r with c a c h e s a n d memories; ( c ) Non-uniform-m e m o r y - a c c e s s shared-a d d re s s - s p a c e c o m p u t e r with l o c a l m e m o r y only. MPP Architecture Distributed-memory computers P P P P P P P P P P P P P P P P Memory Memory Memory Memory Memory Memory Memory Memory Network Int. Network Int. Network Int. Network Int. Communication network Network connection/topology Point-to-Point Buses Cross-bar Switched Fat tree network Mesh Hybrid Classifi cation of interconnection networks: ( a ) a static network; a n d ( b ) a d y n a m i c network. Static network Indirect network P P P P P P P P Network interface/switch Switching element Processing node Switches m a p a fi xed number of inputs to outputs. The total number of ports o n a switch is the degree of the switch. The cost of a switch grows a s the square of the d e g r e e of the switch, the peripheral ha rd wa re linearly a s the d e g re e , a n d the Network Topologies: Buses Shared Memory Address Data Processor 0 Processor 1 (a) Shared Memory Address Data Cache / Cache / Local Memory Local Memory Processor 0 Processor 1 (b) B u s - b a s e d interconnects ( a ) with n o l o c a l c a c h e s ; ( b ) with l oc a l m e m o r y / c a c h e s. S i n c e m u c h of the d a t a a c c e s s e d b y processors is l o c a l to the processor, a l o c a l m e m o r y c a n improve the p e r f o rm a n c e of bus- b a s e d m a c h i n e s. Network Topologies: Crossbars A switching element OUT OUT 0 OUT OUT 1 IN 2 Processing Elements 3 IN 4 5 IN 6 IN p−1 A c o m p l e t e l y n o n - blocking crossbar network c o n n e c t i n g p processors to b m e m o r y banks. A crossbar network uses a n p × m grid of switches to c o n n e c t p inputs to m outputs in a n o n - blocking manner. SW A SW B SW 1 SW 2 SW 3 SW 4 Network Topologies: Completely Connected and Star Connected Networks E x a m p l e of a n 8-n o d e c o m p l e t e l y c o n n e c t e d network. (a) (b) ( a ) A c o m p l e t e l y - c o n n e c t e d network of eight n o d e s ; ( b ) a Star c o n n e c t e d network of nine n o d e s. Network Topologies: Linear Arrays (a) (b) Linear arrays: ( a ) with n o w r a p a ro u n d links; ( b ) with w r a p a ro u n d link. Network Topologies: Two- and Three Dimensional Meshes (a) (b) (c) Two a n d three dimensional meshes: ( a ) 2-D m e s h with n o w r a p a ro u n d ; ( b ) 2-D m e s h with w r a p a ro u n d link (2-D torus); a n d ( c ) a 3-D m e s h with n o w r a p a ro u n d. Network Topologies: Hypercubes and their Construction 100 110 0 00 10 000 010 101 111 1 11 001 011 01 0-D hypercube 1-D hypercube 2-D hypercube 3-D hypercube 0100 0110 1100 1110 0000 0010 1010 1000 0101 0111 1101 1111 0001 0011 1001 1011 4-D hypercube Construction of h y p e rc u b e s from h y p e rc u b e s of lower dimension. Network Topologies: Properties of Hypercubes The d i s t a n c e b e t w e e n a n y two n o d e s is a t most log p. E a c h n o d e h a s log p neighbors. The d i s t a n c e b e t w e e n two n o d e s is g i v e n b y the n u m b e r of bit positions a t w h i c h the two n o d e s differ. Network Topologies: Tree-Based Networks Processing nodes Switching nodes (a) (b) C o m p l e t e binary tree networks: ( a ) a static tree network; a n d ( b ) a d y n a m i c tree network. Network Topologies: Tree Properties The d i s t a n c e b e t w e e n a n y two n o d e s is n o m o re t h a n 2 log p. Links higher u p the tree potentially c a rry m o re traffi c t h a n those a t the lower levels. For this reason, a variant c a l l e d a fat-tree, fattens the links a s w e g o u p the tree. Trees c a n b e laid out in 2D with n o wire crossings. Network Topologies: Fat Trees A fat tree network of 16 processing nodes.

High-Performance Computing (HPC) Architectures - L1.4

Document Details

Tags

Related

Summary

Full Transcript