csc25-lecture-notes-154-161_part23.pdf

8 Multiple Instruction, Multiple Data (MIMD) Systems Cache Coherence SMP Problems Consider the memory organization of SMP. Let’s assume that a processor P1 writes the memory position X on its local cache, and processor P2 reads the Mem[X]. What happens? This relates to the consistency and coherence problems In the cache coherence problem context, informally, a system is coherent if it returns the last value written to a data item. The memory system’s coherence and consistency are complementary. Coherence ensures the use of the most current data, and consistency synchronizes the reads and writes between processors. Consistency Consistency example. Let’s take the following example related to consistency, involving the codes in Listings 8.3 and 8.4. Listing 8.3: Code snippet example to ver- Listing 8.4: Code snippet example to ver- ify consistency, running on ify consistency, running on processor P1. processor P2. 1 a = 0; 1 b = 0; 2... 2... 3 a = 1; 3 b = 1; 4 if ( b ==0) 4 if ( a ==0) 5... 5... Initially, both variables a and b are in the cache with value equal to zero. Which one of the if statements will be taken? Or both of them, from P1 and P2? In most of the times, this case needs to be handled by the programmer, i.e., through synchronization mechanisms. Coherence A memory system is coherent if the following three aspects are fulfilled. 1. program order preservation – a read by the processor P to the location X, following a write by processor P to location X, with no writes of X by another processor between the write/read by P, always returns the value written by P; 1 // code running on P 2 write A to Mem [ X ] 3 read Mem [ x ] 2. coherent view of memory – a read by the processor P1 to location X, following a write by processor P2 to X, returns the written value, if the write/read are sufficiently separated in time and no other writes to X occur between the two accesses; and 1 P2 : write A to Mem [ X ] // code running on P2 2. 148 Cache Coherence 3. 4. 5 P1 : read Mem [ X ] // code running on P1 3. serialization notion – writes to the same location are serialized, i.e., two writes to the same location by any two processors are seen in the same order by all processors. Consider the following. If the value 1 and then value 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. 1 P2 : write 1 to Mem [ Y ] // code running on P2 2 P1 : write 2 to Mem [ Y ] // code running on P1 Cache Coherence Protocols There are some basic schemes to enforce coherence. One is to keep track of the status of any sharing of a data block. The cache block status is kept by using a status bit associated with that block. Another one is a hardware-based solution for multiprocessors to maintain cache coherence, named cache coherence protocols. Mainly, there exist the snooping and the directory-based. In the snooping, every cache containing a copy of the data from a physical memory block could track the sharing status of the block. In the directory-based, the sharing status of a particular block of physical memory is kept in one location, i.e., the directory. Generally, SMP uses a single directory and DSM uses distributed directories. Snooping Coherence Protocols Each cache has a copy of both memory block data and “share status” of a block, e.g., shared/non-shared. As caches also share the memory bus, they snoop on the memory traffic to check if they have copies of the “in-transit” block. Snooping can be based on the write invalidate protocol, and the write update, also known as write broadcast protocol. Write Invalidate Protocol. In this case, writing in a shared block invalidates the other block copies in the other caches. Thus, when trying to access an invalid block, there is a cache miss, and the correct/most updated data comes from the “dirty” cache block. Finally, the protocol handles the memory update, following the write back approach. Writing in non-shared blocks do not cause any problem, since there is no possibility of coherence loss within non-shared blocks. What about the write through approach? The memory would always be coherent if following the write through approach. In that case, when trying to access/read an invalid block, it would not require a write in memory, i.e., read misses do not cause writes in the memory. It is always a fetch from the memory. However, a read miss in the write back would cause a memory write if the block to be replaced is with the dirty bit status active. 149 8 Multiple Instruction, Multiple Data (MIMD) Systems Write invalidate protocol example. Let’s assume that neither cache initially holds the data block from memory location X and that the value of X in memory is equal to 0. This is illustrated in Fig. 8.3. The illustration shows the value of processor and memory after both processor activity and bus activity have been completed. Blank stands for “no activity” or no “copy cached”. As processor A reads X, there is a cache miss and this is communicated in the bus. The data block from memory is then cached. In the same way as processor A, processor B reads memory location X, and that cache miss is also communicated through the bus. Data is cached. Next, processor A writes the value “1” in the memory location X. The new data goes only to its cache at this moment. Remember this is following the write back approach. And, considering that memory location X was shared, there is a “invalidation for X” communicated in the bus. Finally, processor B reads X again, but this time there was an “invalid X” noticed due to the last processor’s A write. Then, it is a cache miss for X. But the data block will come from processor A in lieu of memory, as processor A cancels the response from memory since it has the correct data value. At last, both processor B cache and the memory location X are updated. Figure 8.3: Write invalidate protocol example assuming write back caches, to have a better perfor- mance. Write Update Protocol. In this case, the difference is only in the write fashion. The write update protocol updates all the cached copies of a data item when that item is written. Then, it must broadcast all writes to shared cache lines, which consumes considerably more bandwidth than the write invalidate protocol. Therefore, virtually all recent multiprocessors have opted to implement a write invalidate protocol. Brief Protocols Comparison. Points in favor of the write invalidate protocol: 1. There are multiple writes of the same word without intervening reads requiring multiple broadcast, but just one initial block invalidation, as seen in the previous example; 2. Each word written in a cache block requires a write broadcast in the write update protocol, although only the first write of any word within the block needs to generate an invalidation. The write invalidate protocol acts on cache blocks, while the write update protocol must act on individual words; and 3. The write invalidate protocol tends to generate less traffic in the memory bus. Remember that the memory bus can be seen as a bottleneck in SMP. 150 Cache Coherence Point in favor of write update protocol: 1. The delay between writing a word on a processor and reading the value written on another processor is smaller in the write update protocol, as the written data is updated immediately in the reader’s cache. Write Invalidate Implementation. Regarding the block invalidation, the key to implementation is to get access to the memory bus, and to use this access to invalidate a block, i.e., the processor sends the block address through the bus. The other processors are snooping on the bus, and watching whether they have that block in their caches, i.e., by checking the address and invalidating the block. The mechanism relies on serialized writes. The need to get access to the bus, as an exclusive resource, forces the serialization of the writes. When locating a data item on a cache miss, write through or write back based caches can be used. In the write through cache, all written data are always sent to the memory. Then, the most recent value of a data item can always be fetched from memory. In the write back cache, if a block is clean (shared), the protocol simply acts like the write through approach. Otherwise, if the block is dirty (modified/exclusive), the protocol simply sends the dirty block in response to the read request and aborts the memory read/fetch operation. Considering this context, a simple protocol with three states is discussed next. The protocol states are: 1. invalid – this is considered as if the block is not present in the cache; 2. shared – this is considered when the block is present in one or more local/private caches (potentially shared). But, not necessarily in the processor’s cache which requested the data; 3. modified or exclusive – this is considered when the block was updated in the local/private cache. Here, the most up-to-date information is present only in one cache. But, not necessarily in the processor’s cache which requested the data. The state changes from invalid to shared already in the first reading of the block, even if there is only one copy. And, in the first write, it becomes exclusive. Fig. 8.4 illustrates the states along with the triggers activating the states transitions. These stimulus actions are applicable to a private cache block, rather than to a specific address in the cache. For the sake of clarity, Fig. 8.4 shows the same finite-state machine, but from two different perspectives. One based on the “requests from CPU” (left-hand side), and the other one considering the “requests from the bus” perspective (right-hand side). The finite-state machine controller is usually implemented in each core. Table 8.1 details and explains the state transitions from the CPU perspective. Table 8.2 details and explains the state transitions from the bus perspective. 151 8 Multiple Instruction, Multiple Data (MIMD) Systems Figure 8.4: Write invalidate, cache coherence protocol FSM for a private write back cache. Hits and misses in the local cache. Table 8.1: Finite-state machine cache transition details based on requests from the CPU. Addressed Trigger Cache Action Details cache block’s Type state Invalid CPU read miss Regular miss Place read miss on bus Invalid CPU write miss Regular miss Place write miss on bus Shared CPU read hit Regular hit Read data in local cache Shared CPU read miss Replacement Place read miss on bus. There was an address conflict, needing a block replacement Shared CPU write hit Coherence Place invalidate on bus. There is no data fetch here, as this is mainly intended to change state Shared CPU write miss Replacement Place write miss on bus. This was an address conflict, needing a block replacement Exclusive CPU write/read hit Regular hit Write/read data in local cache Exclusive CPU write miss Replacement Write back the cache block, and place a write miss on bus. There was an address conflict, needing a block replacement Exclusive CPU read miss Replacement Write back the cache block, and place a read miss on bus. There was an address conflict, needing a block replacement Directory-based Protocol Is there any cache coherence problem with respect to DSM, i.e., with shared address space? In DSM, the memory bandwidth is increased as the memory becomes distributed. This separates the local memory traffic from the remote and reduces the bandwidth demands in the overall memory system and interconnection network. When using the snooping-based protocols in DSM, there will be a considerable amount of messages 152 Cache Coherence Table 8.2: Finite-state machine cache transition details based on requests from the bus. Addressed Trigger Cache Action Details cache block’s Type state Shared Write miss for this block Coherence Invalidate this block as there was an attempt to write a shared block Shared Invalidate for this block Coherence Invalidate this block as there was an attempt to write a shared block Shared CPU read miss No action The shared cache or memory will service this read miss Exclusive Write miss for this block Coherence Write back the cache block, and abort memory access. There was an attempt to write a block that is exclusive elsewhere. Exclusive Read miss for this block Coherence Place cache block on bus, write back the cache block, and abort memory access. There was an attempt to read a shared data. broadcast on the bus. This will cause a negative impact on the benefits of DSM memory organization. The need for broadcasting messages needs to be eliminated. Given this, the directory protocol is an alternative to the snooping-based coherence protocol. A directory keeps the state of every block that may be cached. The information contained in the directory includes: which caches have copies of the block, whether it is dirty, and the block status, i.e., shared, uncached, modified. The solution is based on distributing the directory along with the memory, as shown in Fig. 8.5. A single directory is not a scalable solution mainly considering a multicore environment. Therefore, different coherence requests can go to different directories, just as different memory requests go to different memories. Each directory is responsible for tracking caches that share the memory addresses of the memory portion in the node. Figure 8.5: Directory added to each node to implement cache coherence in DSM. The directory-based protocol’s state diagrams are the same as those used in snooping, as depicted in Fig. 8.4. However, the implementation is based on message passing between nodes, rather than snooping on the bus. The protocol needs to know which node has the block to make the invalidation. 153 8 Multiple Instruction, Multiple Data (MIMD) Systems In this case, each processor keeps information on which processors have each memory block. This is a field with an associated bit for each system processor for each memory block. Multicomputers Is there any cache coherence problem with respect to the multicomputers based on message passing? In this case, as long as the definition of multicomputers is: “processors with independent memories and address spaces”, there is no other computer writing on its memory, only itself. Thus, there is no problem. Although, inside each computer, the problems with cache coherence are still present. Thread-Level Parallelism Overview Multithreading is more and more exploited and used, as ILP is already very well explored and limited in the present days. Given this, threads are a very common way of explicit parallelism, “controlled” by the programmer. They are also a very different approach regarding the transparency found of ILP and vector instructions. However, multithreading comes together with these other advancements in performance. Multithreading allows multiple threads to share the functional units of a single processor in an overlapping way. A more general method to exploit thread-level parallelism - TLP considers a multiprocessor with multiple independent threads operating at once and in parallel. Multithreading does not duplicate the entire processor as a multiprocessor does. Despite, multithreading shares most of the processor core among a set of threads, and duplicates just private state, i.e., registers and program counter. Hardware Approaches for Multithreading There are some approaches when dealing with multithread, including fine-grained, coarse-grained, and simultaneous multithreading. The fine-grained approach switches between threads on each clock cycle, causing the execution of instructions from multiple threads to be interleaved, uses round-robing, and skips stalled threads. The coarse-grained switches threads only on costly stalls, e.g., L2 or L3 cache misses. Finally, the simultaneous multithreading - SMT approach is the most common implementation. It is a variation of fine-grained implemented on top of a multiple-issue, dynamically scheduled processor. SMT are the superscalar processors and multithreading relationship. In SMT, the register renaming and dynamic scheduling allow multiple instructions from independent threads to be executed without noticing dependencies among them. Fig. 8.6 illustrates a view of a superscalar processor with no multithread - MT support, along with a superscalar processor with coarse and fine MT, and a superscalar processor with SMT. The Intel Core i7 and IBM Power7 processors use SMT. 154 Thread-Level Parallelism Figure 8.6: Conceptual illustration greatly simplified. Shades of gray correspond to 4 different threads in the multithreading processor; the white box indicates that the corresponding execution slot is unused in that clock cycle. In the superscalar, only one thread can be executed. The course MT is able to run two different threads in the considered timeframe. Fine MT, can execute all four threads by switching them on each clock cycle. Finally, SMT can execute up to two threads in parallel. Notice also the number of “white boxes” in each case. 155

csc25-lecture-notes-154-161_part23.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue