csc25-chapter_08.pdf
Document Details
Uploaded by ManageableSatire
2024
Tags
Full Transcript
CSC-25 High Performance Architectures Lecture Notes – Chapter VIII MIMD Systems Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC Aeronautics Institute of Technology –...
CSC-25 High Performance Architectures Lecture Notes – Chapter VIII MIMD Systems Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC Aeronautics Institute of Technology – ITA 1st semester, 2024 Detailed Contents Snooping Coherence Protocols Parallel Architectures Write Invalidate Protocol Overview Write Update Protocol Memory Organization Brief Protocols Comparison Memory Architecture Write Invalidate Implementation Communication Models Directory-based Protocol Market Share Multicomputers SMP Thread-Level Parallelism - TLP Cache Coherence General Overview SMP Problems Hardware Approaches for Multithreading Coherence References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 2/44 Outline Parallel Architectures Cache Coherence Thread-Level Parallelism - TLP References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 3/44 Parallel Architectures Overview Multiple instruction streams, multiple data streams - MIMD Multiprocessors Computers consisting of tightly coupled processors whose coordination/usage are gener- ally controlled by a single OS and that share memory through a shared address space 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 4/44 Parallel Architectures (cont.) Memory Organization Multiple processors options, sharing 1. cache, memory, and I/O system 2. memory and I/O system 3. I/O system 4. nothing, usually communicates through networks Are all options feasible or interesting? I remember it is important to avoid bottlenecks 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 5/44 Parallel Architectures (cont.) Memory Organization Multiprocessors by their memory organization 1. symmetric (shared-memory) multiprocessors - SMP 2. distributed shared memory - DSM 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 6/44 Parallel Architectures (cont.) Memory Organization Symmetric (shared-memory) multiprocessors - SMP1 I ≈ 32 cores or less I share a single centralized memory where processors have equal access to, i.e., symmetric I memory/bus may become a bottleneck I use of large caches and many buses I uniform access time - UMA to all of the memory from all of the processors 1 a.k.a. centralized shared-memory multiprocessors 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 7/44 Parallel Architectures (cont.) Memory Organization Basic SMP structure 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 8/44 Parallel Architectures (cont.) Memory Organization Distributed shared memory - DSM I larger processor counts, e.g., 16-64 processor cores I distributed memory to I increase bandwidth I reduce access latency I communicating data among processors becomes more complex I requires more effort in the software to take advantage of the increased memory bandwidth I I/O system is also distributed I each node can be a small distributed system with centralized memory I nonuniform memory access - NUMA, i.e., access time depends on the location of a data word in memory 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 9/44 Parallel Architectures (cont.) Memory Organization Basic DSM structure 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 10/44 Parallel Architectures (cont.) Memory Architecture SMP (e.g., following UMA) I processors share a single memory and I they have uniform access times to the memory DSM (e.g., following NUMA) I processors share the same address space I not necessarily the same physical memory Multicomputers I processors with independent memories and address spaces I communicate through interconnection networks I may even be complete computers connected in network, i.e., clusters 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 11/44 Parallel Architectures (cont.) Communication Models SMP - central memory I threads and fork-join model2 I open multi-processing - OpenMP I implicit communication, i.e., memory access DSM - distributed memory I message passing model3 I message passing interface - MPI I explicit communication, i.e., message passing I synchronization problems 2 can also be applied to DSM 3 can also be applied to SMP 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 12/44 Parallel Architectures (cont.) Market Share SMP I bigger market share, both in $ and units I multiprocessors in a chip Multicomputers I popularization of clusters for systems on the internet I >100 processors, i.e., massively parallel processors - MPP 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 13/44 Parallel Architectures (cont.) SMP Large and efficient cache systems can greatly reduce the need for memory bandwidth SMP provide some cost benefits as they need not much extra hardware, and are based on general purpose processors - GPP Caches not only provide locality, but also replication. Is that a problem? Basic SMP structure 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 14/44 Outline Parallel Architectures Cache Coherence Thread-Level Parallelism - TLP References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 15/44 Cache Coherence SMP Problems Let’s assume that P1 writes the memory position X on its cache, and P2 reads the Mem[X]. What happens? Cache coherence problem I informally, a system is coherent if it returns the last value written to a data item Memory system coherence and consistency are complementary I coherence – ensure the use of the most current data I consistency – synchronize read/write between processors 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 16/44 Cache Coherence SMP Problems Let’s assume that P1 writes the memory position X on its cache, and P2 reads the Mem[X]. What happens? Cache coherence problem I informally, a system is coherent if it returns the last value written to a data item Memory system coherence and consistency are complementary I coherence – ensure the use of the most current data I consistency – synchronize read/write between processors 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 16/44 Cache Coherence (cont.) SMP Problems Consistency P1 P2 1 a = 0; 1 b = 0; 2... 2... 3 a = 1; 3 b = 1; 4 if(b==0) 4 if(a==0) 5... 5... Initially, both a and b in the cache with value equal to zero Which one of the if will be taken? Or both of them? I most of the times it needs to be handled by the programmer, i.e., synchronization 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 17/44 Cache Coherence (cont.) SMP Problems Consistency P1 P2 1 a = 0; 1 b = 0; 2... 2... 3 a = 1; 3 b = 1; 4 if(b==0) 4 if(a==0) 5... 5... Initially, both a and b in the cache with value equal to zero Which one of the if will be taken? Or both of them? I most of the times it needs to be handled by the programmer, i.e., synchronization 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 17/44 Cache Coherence (cont.) Coherence A memory system is coherent if 1. a read by processor P to location X following a write by P to X, with no writes of X by another processor between the write/read by P, always returns the value written by P 2. a read by processor P1 to location X following a write by processor P2 to X returns the written value, if the write/read are sufficiently separated in time and no other writes to X occur between those two accesses 3. writes to the same location are serialized, i.e., two writes to the same location by any two processors are seen in the same order by all processors I e.g., if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 18/44 Cache Coherence (cont.) Coherence Basic schemes for enforcing coherence I keep track of the status of any sharing of a data block I cache block status is kept by using status bits associated with that block 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 19/44 Cache Coherence (cont.) Coherence Hardware-based solution for multiprocessors to maintain cache coherence – cache coherence protocols Snooping I every cache containing a copy of the data from a physical memory block could track the sharing status of the block Directory-based I sharing status of a particular block of physical memory is kept in one location, i.e., directory I SMP uses single directory I DSM uses distributed directories 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 20/44 Cache Coherence (cont.) Snooping Coherence Protocols Each cache has a copy of the I memory data block, and I share status of the block, e.g., shared/non-shared As caches share the memory bus, they snoop on the memory traffic to check if they have copies of the “in-transit” block Protocols 1. write invalidate protocol 2. write update or write broadcast protocol 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 21/44 Cache Coherence (cont.) Write Invalidate Protocol Writing in a shared block invalidates the other block copies in the other processor’s cache When trying to access an invalid block I there is a cache miss, and I the data comes from the “dirty” cache block and also updates the memory, i.e., write back case Writing in non-shared blocks do not cause problems. Why? What about the write through approach? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/44 Cache Coherence (cont.) Write Invalidate Protocol Example I assuming that neither cache initially holds X and that the value of X in memory is 0 Assuming also write back caches 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 23/44 Cache Coherence (cont.) Write Update Protocol Difference only in the write fashion It updates all the cached copies of a data item when that item is written Must broadcast all writes to shared cache lines I consumes considerably more bandwidth Therefore, virtually all recent multiprocessors have opted to implement a write invalidate protocol 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 24/44 Cache Coherence (cont.) Brief Protocols Comparison Pro write invalidate protocol I multiple writing of the same word without intervening readings require multiple broadcasts, but just one initial block invalidation I each word written in a cache block requires a write broadcast in the write update prot., although only the first write of any word within the block needs to generate an invalidation I write invalidate prot. acts on cache blocks, while write update prot. must act on individual words I write invalidate prot. tends to generate less traffic in the memory bus I memory bus may be seen as a bottleneck in SMP 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 25/44 Cache Coherence (cont.) Brief Protocols Comparison Pro write update protocol I The delay between writing a word on one processor and reading the value written on another processor is smaller in the write update prot. I the written data is updated immediately in the reader’s cache 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 26/44 Cache Coherence (cont.) Write Invalidate Implementation Block invalidation I key to implementation is to get access to the memory bus I use it to invalidate a block, i.e., the processor sends the block address through the bus I the other processors are snooping on the bus, and I watching if they have that block in their caches, i.e., by checking the address and invalidating the block Serialized writing I the need to get access to the bus, as an exclusive resource, forces the serialization of the writes 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 27/44 Cache Coherence (cont.) Write Invalidate Implementation Locating a data item on a cache miss Write through cache I all written data are always sent to the memory I the most recent value of a data item can always be fetched from memory Write back cache I if block is clean I acts like write through I if block is dirty I sends dirty block in response to the read request I aborts the memory read/fetch operation 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 28/44 Cache Coherence (cont.) Write Invalidate Implementation A simple protocol with three states I invalid – as if not present in the cache I shared – present in one or more private caches (potentially shared) I not necessarily in the processor’s cache which requested the data I modified/exclusive – updated in the private cache / present only in one cache I not necessarily in the processor’s cache which requested the data The status changes from invalid to shared already in the first reading of the block, even if there is only one copy And, in the first write it becomes exclusive 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 29/44 Cache Coherence (cont.) Write Invalidate Implementation Write invalidate, cache coherence protocol FSM for a private write back cache. Hits/misses in local cache 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 30/44 Cache Coherence (cont.) Directory-based Protocol Is there any cache coherence problem wrt DSM, i.e., with shared address space? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 31/44 Cache Coherence (cont.) Directory-based Protocol Directory protocol - alternative to a snooping-based coherence protocol A directory keeps the state of every block that may be cached Information in the directory includes info like I which caches have copies of the block I whether it is dirty I block status, i.e., shared, uncached, modified 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 32/44 Cache Coherence (cont.) Directory-based Protocol Solution – distribute the directory along with the memory I different coherence requests can go to different directories I just as different memory requests go to different memories Each directory is responsible for tracking caches that share the memory addresses of the memory portion in the node 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 33/44 Cache Coherence (cont.) Directory-based Protocol Directory added to each node to implement cache coherence in DSM 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 34/44 Cache Coherence (cont.) Directory-based Protocol The state diagrams are the same as those used in snooping Implementation based on message passing between nodes, rather than snooping on the bus I needs to know which node has the block to make the invalidation Each processor keeps information on which processors have each memory block I field with an associated bit for each system processor for each memory block 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 35/44 Cache Coherence (cont.) Multicomputers Is there any cache coherence problem wrt multicomputers based on message passing? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 36/44 Outline Parallel Architectures Cache Coherence Thread-Level Parallelism - TLP References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 37/44 Thread-Level Parallelism - TLP General Overview Multithreading is more and more exploited and used, as ILP is already very well explored Threads - common way of explicit parallelism, “controlled” by the programmer I different from transparency of ILP and vector instructions I multithreading “comes together” with those other advancements Multithreading I allows multiple threads to share the functional units of a single processor in an overlapping way 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 38/44 Thread-Level Parallelism - TLP (cont.) General Overview A more general method to exploit TLP I multiprocessor with multiple independent threads operating at once and in parallel Multithreading I does not duplicate the entire processor as a multiprocessor does I multithreading shares most of the processor core among a set of threads I duplicates just private state, i.e., registers and program counter 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 39/44 Thread-Level Parallelism - TLP (cont.) Hardware Approaches for Multithreading Fine-grained I switches between threads on each clock cycle I causes the execution of instructions from multiple threads to be interleaved I uses round-robing and skips stalled threads Coarse-grained I switches threads only on costly stalls, e.g., L2 or L3 cache misses Simultaneous multithreading - SMT I most common implementation I variation on fine-grained implemented on top of a multiple-issue, dynamically scheduled processor 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 40/44 Thread-Level Parallelism - TLP (cont.) Hardware Approaches for Multithreading Superscalar processors and multithreading relationship? I SMT Register renaming and dynamic scheduling I allow multiple instructions from independent threads to be executed without noticing dependences among them 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 41/44 Thread-Level Parallelism - TLP (cont.) Hardware Approaches for Multithreading I superscalar with no MT support I superscalar with coarse and fine MT I superscalar with SMT Intel Core i7 and IBM Power7 processors use SMT Conceptual illustration greatly simplified. Shades of gray correspond to 4 different threads in the multithreading processor; white box indicates that corresponding execution slot is unused in that clock cycle 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 42/44 Outline Parallel Architectures Cache Coherence Thread-Level Parallelism - TLP References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 43/44 Information to the reader Lecture notes mainly based on the following references Castro, Paulo André. Notas de Aula da disciplina CES-25 Arquiteturas para Alto Desempenho. ITA. 2018. Hennessy, J. L. and D. A. Patterson. Computer Architecture: A Quantitative Approach. 6th. Morgan Kaufmann, 2017. 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 44/44