TTTK 1153 Computer Organization & Architecture: Parallel Architectures PDF

TTTK 1153 Computer Organization & Architecture Topic 7: Parallel Architectures & Multicores Ts. Dr. Mohd Nor Akmal Khalid 1 A Process ❑ A complete process includes numerous things ▪ Address space (all the code and data pages) ▪ OS resources and accounting information ▪ A “thread of control”, which defines where the process is currently executing ▪ The Program Counter ▪ CPU registers 2 Taxonomy of Parallel Processor Architectures 3 Multiple Processor Organization Single instruction, single data (SISD) Multiple instruction, single data stream (MISD) stream ❑ A single processor executes a ❑ A sequence of data is single instruction stream to transmitted to a set of operate on data stored in a single memory processors, each of which executes a different instruction ❑ Uniprocessors fall into this category sequence ❑ Not commercially implemented Single instruction, multiple data (SIMD) stream Multiple instruction, multiple data ❑ A single machine instruction (MIMD) stream controls the simultaneous ❑ A set of processors execution of several processing elements on a lockstep basis simultaneously execute different instruction sequences on ❑ Vector and array processors fall into this category different data sets ❑ SMPs, clusters, and NUMA systems fit this category 4 Alternative Computer Organization IS DS DS CU PU MU PU1 LM1 (a) SISD DS PU2 LM2 IS CU IS DS DS CU1 PU1 PUn LMn IS DS (b) SIMD (with distributed memory) CU2 PU2 Memory Shared IS DS CU1 PU1 LM1 Interconnection IS DS IS DS Network CUn PUn CU2 PU2 LM2 (c) MIMD (with shared memory) CU = control unit SISD = single instruction, IS = instruction stream single data stream IS DS PU = processing unit SIMD = single instruction, CUn PUn LMn DS = data stream multiple data stream MU = memory unit MIMD = multiple instruction, (d) MIMD (with distributed memory) LM = local memory multiple data stream 5 Figure 17.2 Alternative Computer Organizations Parallelism ❑ With multiple paths of execution, we can implement (or simulate) simultaneous actions ❑ Why build a parallel program? ▪ Responsiveness to the user, e.g., the user interface always responds quickly ▪ Server handling simultaneous requests (web, etc.), e.g., each request is handled independently ▪ Execute faster on a multiprocessor, e.g., two CPUs can run two programs at once 6 Chip-Level Parallelism (a)On-chip parallelism, (b) A coprocessor, (c) A multiprocessor, (d) A multicomputer, (e) A grid 7 Tightly Coupled Multiprocessor Processor Processor Processor I/O I/O Interconnection Network I/O Main Memory 8 Symmetric Multiprocessor (SMP) Organization Processor Processor Processor L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache L2 Cache shared bus Main I/O Memory I/O Adapter Subsytem I/O Adapter I/O Adapter 9 Bus Organization Advantages Disadvantages ❑ Simplicity: Simplest approach to ❑ All memory references pass through the multiprocessor organization common bus ❑ Flexibility: It is generally easy to expand ❑ Performance is limited by bus cycle the system by attaching more time processors to the bus ❑ Each processor should have cache ❑ Reliability: The bus is essentially a memory to reduce the number of bus passive medium, and the failure of any accesses attached device should not cause the ❑ Leads to problems with cache failure of the whole system coherence ❑ If a word is altered in one cache, it could conceivably invalidate a word in another cache (other processors must be alerted that an update has taken place) ❑ Typically addressed in hardware rather than the operating system 10 Instruction-level Parallelism (a) A CPU pipeline, (b) A sequence of VLIW instructions, (c) An instruction stream with bundles marked 11 Threads vs. Process ❑ Thread: a flow of control (lightweight) ❑ Process: A heavyweight unit of control ▪ Most operating systems now ▪ Creating a new process is costly support two entities ▪ Lots of data must be allocated and ▪ Process: defines the address space initialized and general process attributes ▪ Operating system control data ▪ Thread: defines one or more structures execution paths within a process ▪ Memory allocation for the process ▪ Threads are the unit of scheduling ▪ Communicating between ▪ Processes are the “containers” in processes is costly which threads execute ▪ Most communication goes through the OS ▪ Need a context switch for each process 12 Multi-threaded Design ❑ Separating execution path from address space simplifies the design of parallel applications ❑ Some benefits of threaded designs ▪ Improved responsiveness to user actions ▪ Handling concurrent events (e.g., web requests) ▪ Simplified program structure (code, data) ▪ More efficient and so less impact on the system ▪ Map easily to multi-processor systems 13 Multi-threaded Design One Thread Three Thread stack stack 1 $sp $sp1 stack 2 $sp2 stack 3 $sp3 heap heap PC1 code PC PC3 code PC2 14 Cookbook Analogy ❑ Think of a busy kitchen ▪ 3 cooks and one cookbook ❑ Each cook maintains a pointer to where they are in the cookbook (the Program Counter) ❑ Two cooks could both be making the same thing (threads running the same procedure) ❑ The cooks must coordinate access to the kitchen appliances (resource access control) 15 On-Chip Multithreading I (a) – (c) Three threads. The empty boxes indicate that the thread has stalled waiting for memory. (d) Fine-grained multithreading. (e) Coarse-grained multithreading. 16 On-Chip Multithreading II Multithreading with a dual-issue superscalar CPU (a) Fine-grained multithreading. (b) Coarse-grained multithreading. (c) Simultaneous multithreading. 17 Thread Implementation ❑ A thread is bound to the process that provides its address space ❑ Each process has one or more threads ❑ How are threads implemented? ❑ Kernel threads: In the kernel (OS) and user mode libraries combined ❑ User threads: In user mode libraries alone 18 ❑ The operating system knows about and manages the threads in every program ❑ Thread operations (create, yield,...) all require kernel involvement ❑ The major benefit is that threads in a process are scheduled independently ▪ One blocked thread does not block the others ▪ Threads in a process can run on different Kernel CPUs ❑ Kernel threads have performance issues Threads ❑ Even though threads avoid process overhead, operations on kernel threads are still slow ▪ A thread operation requires a kernel call ▪ Kernel threads may be overly general to support different users’ needs, languages, etc. ▪ The kernel can’t trust the user, so there must be lots of checking on kernel calls 19 ❑ To make thread operations faster, they can be implemented at the user level ▪ Each thread is managed by the run-time system ▪ user-mode libraries are linked with your program ❑ Each thread is represented simply by a PC, registers, stack and a control block, User managed in the user’s address space Threads ❑ All activities happen in user address space so thread operations can be faster ❑ But OS scheduling takes place at process level ▪ Block entire process if a single thread is I/O blocked ▪ May run a process that is just running an idle thread 20 Explicit and Implicit Multithreading ❑ All commercial processors and most experimental ones use explicit multithreading ▪ Concurrently execute instructions from different explicit threads ▪ Interleave instructions from different threads on shared pipelines or parallel execution on parallel pipelines ❑ Implicit multithreading is the concurrent execution of multiple threads extracted from a single sequential program ▪ Implicit threads defined statically by the compiler or dynamically by hardware 21 Approach to Explicit Multithreading Simultaneous Chip Interleaved Blocked (SMT) multiprocessing Fine-grained, Instructions are Coarse-grained; Processor is where the simultaneously thread executed replicated on a processor deals issued from until an event single chip with two or multiple threads causes a delay Each processor more thread to execution Effective on in- handles contexts at a units of the order processor separate time superscalar Avoids pipeline threads Switching processor stall Advantage is thread at each that the clock cycle; If a available logic thread is area on a chip is blocked, it is used effectively skipped 22 Cluster ❑ Alternative to SMP as an approach to providing high performance and high availability ❑ Particularly attractive for server applications ❑ Defined as: ▪ A group of interconnected whole computers working together as a unified computing resource that can create the illusion of being one machine ▪ (The term whole computer means a system that can run on its own, apart from the cluster) ▪ Each computer in a cluster is called a node ❑ Benefits: ▪ Absolute scalability ▪ Incremental scalability ▪ High availability ▪ Superior price/performance 23 Cluster Configurations P P P P High-speed message link M I/O I/O I/O I/O M How failures are managed depends on the clustering method used Two approaches: (a) Standby server with no shared disk Highly available clusters Fault-tolerant clusters Failover The function of switching applications and High-speed message link P P I/O I/O P P data resources over from a failed system to an alternative system in the cluster Failback M I/O I/O I/O I/O M Restoration of applications and data RAID resources to the original system once it has been fixed Load balancing Incremental scalability (b) Shared disk Automatically include new computers in scheduling Middleware needs to recognize that processes may switch between machines Figure 17.8 Cluster Configurations 24 Nonuniform Memory Access (NUMA) Alternative to SMP and clustering Uniform memory access (UMA) ❑All processors have access to all parts of main memory using loads and stores ❑Access time to all regions of memory is the same ❑Access time to memory for different processors is the same Nonuniform memory access (NUMA) ❑All processors have access to all parts of main memory using loads and stores ❑Access time of the processor differs depending on which region of main memory is being accessed ❑Different processors access different regions of memory at different speeds Cache-coherent NUMA (CC-NUMA) ❑A NUMA system in which cache coherence is maintained among the caches of the various processors 25 Processor Processor 1-1 1-m L1 Cache L1 Cache L2 Cache L2 Cache Directory I/O Main Memory 1 CC-NUMA Processor 2-1 Processor 2-m Organization Interconnect L1 Cache L1 Cache Network L2 Cache L2 Cache Directory I/O Main Memory 2 Processor Processor N-1 N-m L1 Cache L1 Cache L2 Cache L2 Cache I/O Directory Main Memory N Figure 17.11 CC-NUMA Organization 26 NUMA Pros and Cons Advantages Disadvantages ❑ The main advantage of a CC- ❑ Does not transparently look like NUMA system is that it can an SMP deliver effective performance at ❑ Software changes will be required higher levels of parallelism than to move an operating system and SMP without requiring major applications from an SMP to a CC- software changes NUMA system ❑ Concern with availability ❑ Bus traffic on any individual node is limited to a demand that the bus can handle ❑ If many of the memory accesses are to remote nodes, performance begins to break down 27 Cloud Computing Elements & Context Enterprise - Cloud User LAN switch Characteristics Broad Rapid Measured On-Demand Essential Network Access Elasticity Service Self-Service Router Resource Pooling Software as a Service (SaaS) Platform as a Service (PaaS) Network Service Models Infrastructure as a Service (IaaS) or Internet Deployment Models Public Private Hybrid Community Router LAN Cloud switch service provider Figure 17.12 Cloud Computing Elements Servers Figure 17.14 Cloud Computing Context 28 Deployment Models of Cloud Computing ❑ Public Cloud ▪ The cloud infrastructure is made ❑ Hybrid Cloud available to the general public or a large industry group and is owned by an ▪ The cloud infrastructure is a organization selling cloud services composition of two or more ▪ Major advantage is cost clouds that remain unique ❑ Private Cloud entities but are bound together ▪ A cloud infrastructure implemented by standardized or proprietary within the internal IT environment of the organization technology that enables data ▪ A key motivation for opting for a private and application portability cloud is security ▪ Sensitive information can be ❑ Community Cloud placed in a private area of the ▪ Like a private cloud, it is not open to any subscriber cloud and less sensitive data can ▪ Like a public cloud, the resources are take advantage of the cost shared among several independent benefits of the public cloud organizations 29 Resource Sharing between Threads in Core i7 Microarchitecture 30 Homogeneous ❑ Single-chip multiprocessors: Multiprocessors ▪ A dual-pipeline chip. on a Chip ▪ A chip with two cores. 31 ❑ Multiprocessor with 16 CPUs sharing a common memory. Multiprocessors ❑ An image is partitioned into 16 sections, each analyzed by a different CPU. 32 ❑A multicomputer with 16 CPUs, each with its private memory. Multi-Computers ❑The bit-map image of Fig. 8-19 split up among the 16 memories. 33 Various layers where shared memory can be implemented. a) The hardware. Multi-Computers b) The operating system. c) The language run-time system. 34 Google’s Cluster Computing 35 ❑ Scheduling a cluster. ❑ The shaded areas indicate idle CPUs. Cluster ▪ FIFO. Scheduling ▪ Without head-of-line blocking. ▪ Tiling. 36 Grid Computing Layers 37 Alternative Chip Organizations Issue logic Program counter Single-thread register file Instruction fetch unit Execution units and queues L1 instruction cache L1 data cache L2 cache (a) Superscalar Issue logic Registers n Regoster 1 PC n PC 1 Instruction fetch unit Execution units and queues L1 instruction cache L1 data cache L2 cache (b) Simultaneous multithreading (superscalar or SMT) (superscalar or SMT) (superscalar or SMT) (superscalar or SMT) Core n Core 1 Core 2 Core 3 L1-D L1-D L1-D L1-D L1-I L1-I L1-I L1-I L2 cache (c) Multicore 38 Figure 18.1 Alternative Chip Organizations The BlueGene/P Custom Processor Chip 39 The BlueGene/P: (a) chip. (b) card. (c) board. (d) cabinet. (e) system. 40 Effective Applications of Multicore Processors ❑ Multi-threaded native applications ▪ Thread-level parallelism ▪ Characterized by having a small number of highly threaded processes ❑ Multi-process applications ▪ Process-level parallelism ▪ Characterized by the presence of many single-threaded processes ❑ Java applications ▪ Embrace threading in a fundamental way ▪ Java Virtual Machine is a multi-threaded process that provides scheduling and memory management for Java applications ❑ Multi-instance applications ❑ If multiple application instances require some degree of isolation, virtualization technology can be used to provide each of them with its own separate and secure environment 41 Multicore Organization Alternatives CPU Core 1 CPU Core n CPU Core 1 CPU Core n L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L2 cache L2 cache L2 cache I/O main memory I/O main memory (b) Dedicated L2 cache (a) Dedicated L1 cache CPU Core 1 CPU Core n CPU Core 1 CPU Core n L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L2 cache L2 cache L2 cache L3 cache main memory I/O main memory I/O (c) Shared L2 cache (d ) Shared L3 cache 42 Heterogeneous Multicore Organization ❑ Refers to a processor chip that includes more than one kind of core ❑ The most prominent trend is the use of both CPUs and graphics processing units (GPUs) on the same chip ▪ This mix however presents issues of coordination and correctness ❑ GPUs are characterized by the ability to support thousands of parallel execution trends ❑ Thus, GPUs are well matched to applications that process large amounts of vector and matrix data 43 Heterogeneous Multicore Chip Elements CPU CPU GPU GPU Cache Cache Cache Cache On-Chip Interconnection Network DRAM DRAM Last- Last- Controller Controller Level Level Cache Cache Figure 18.7 Heterogenous Multicore Chip Elements 44 Operating Parameters of Heterogeneous Multicore Chips Compared CPU GPU Clock frequency (GHz) 3.8 0.8 Cores 4 384 FLOPS/core 8 2 GFLOPS 121.6 614.4 FLOPS = floating point operations per second FLOPS/core = number of parallel floating point operations that can be performed 45 0% 8 2% relative speedup 6 5% 10% 4 2 0 1 2 3 4 5 6 7 8 Performance number of processors Effect of (a) Speedup with 0%, 2%, 5%, and 10% sequential portions Multiple 2.5 Cores 2.0 5% 10% 15% 20% relative speedup 1.5 1.0 0.5 0 1 2 3 4 5 6 7 8 number of processors 46 (b) Speedup with overheads NVIDIA GPU Architecture (Simplified) Source: http://dx.doi.org/10.1186/1756-0500-4-158 The grey rectangles of thread blocks are for illustration purposes and are not a physical part of the architecture. MP = Multi Processor, SM = Shared Memory, SFU = Special Functions Unit, IU = Instruction Unit, SP = Streaming processor (core). 47 Applications in Real-World Use of NVIDIA GPUs for training deep learning models with frameworks like TensorFlow and PyTorch, using thousands AI and Machine Learning of parallel cores, to enable efficient training of large neural networks. Gaming and Multimedia Valve’s Source Engine employs multithreading for tasks like rendering, physics, and AI, improving performance on Applications multicore systems Cloud Computing and Data Google’s cluster computing architecture handles billions of search queries using parallel processing techniques across Processing multicore processors in massive server farms Systems like BlueGene/P and IBM’s Watson use multicore HPC in Sciences processors for large-scale scientific computations, such as weather modeling and simulations Adobe Premiere Pro utilizes multicore processors for Multimedia Processing faster video rendering and effects processing, allowing real-time video editing. 48 THANK YOU! Next Lecture: Advanced Memory Systems 49

TTTK 1153 Computer Organization & Architecture: Parallel Architectures PDF

Document Details

Tags

Related

Summary

Full Transcript