Cache Coherence Protocols PDF

In Chapter 7, we have discussed basic issues in cache coherence protocols, focusing on the design of broadcast/snoopy protocols on various interconnects (shared bus and point-to-point). In this chapter, we will look into more advanced issues in cache coherence protocol design. One of the issues is how cache coherence protocols can scale to a larger system size. Broadcast and snoopy protocols hit a scalability issue relatively early because traffic and snoop frequency scale at least linearly with the number of processors. Available interconnect bandwidth gets saturated quickly with broadcast traffic. We will discuss directory cache coherence protocols, which allows for scalable implementation by avoiding broadcasts. We will use them to discuss implementation issues for coherence protocols such as how to deal with protocol races, the use of transient states, etc. Finally, we will discuss contemporary multicore design issues, such as dealing with imprecise directory information, looking into whether coherence should be tracked at a single or multiple granularities, how coherence can be designed to allow a multicore system to be partitioned, and a how thread migration cost may be reduced. 10.1 Directory Coherence Protocols The most scalable organization uses point-to-point interconnection and, rather than relying on broadcasting and snooping, relies on a directory approach. In a directory approach, the information about which caches have a copy of a block is maintained in a structure called the directory. With such an approach, locating cached copies of a block does not rely on broadcasting a request to all processors, but rather by inquiring the directory about which caches may have copies of the block, and sending the request only to those caches. If a block is cached by all processors, then the request must be sent to all processors, which yields comparable traffic to a broadcast/snoopy approach. However, if a block is cached by only a few processors, the traffic savings of the directory approach over a broadcast/snoopy approach is substantial. Which case is more likely in reality? Well-tuned applications exhibit little data sharing for blocks that are frequently read and written because if there is a lot of such data, the applications will suffer from excessive serialization due to the critical sections needed to protect accesses to the data, as well as from coherence misses due to invalidations to the data. Thus, in well-tuned applications, most data sharing occurs only for data that is read only, and there is little sharing for data that is frequently read and written. For such data, a directory approach can yield a substantial traffic savings compared to the broadcast/snoopy approach because most writes will cause only a few caches to be invalidated. One case in which there may be little saving in traffic is when there is highly-contended synchronization in which a variable is frequently read and written by all processors. But again, this case implies a high number of invalidations which restrict the performance and scalability in other ways, which cannot be solved by using a bus. 10.2 Overview of Directory Coherence Protocol Figure 10.1 shows a basic working of a directory cache coherence protocol. The top diagram shows how a read request is handled, while the bottom diagram shows how a write request is handled. Let us start with the top diagram. Initially, assume that a block B is cached in a modified state in the cache belonging to processor P3. The block is uncached elsewhere. Suppose that P0 wants to read block B and suffers a cache miss. The miss triggers an inquiry to the directory (Step 1). Since the cached copy is the only valid copy in the system, the data block B must be obtained from P3's cache. The directory contains information that enables it. It records that B is cached in the modified state at P3. The directory forwards the request (or intervenes) to P3's cache (Step 2). Finally, P3's cache responds by supplying the block to the requester (Step 3), either directly or through the directory. Note that in this case, the directory has saved traffic compared to a broadcast coherence protocol. A broadcast protocol would have resulted in P0 sending its read miss request to all other processors/caches, since it does not know which cache may have a copy of the block. The bottom part of Figure 10.1 shows how a write request is handled, assuming initially a block C is cached in shared state in P1's, P2's, and P3's cache. First, the requestor (P1) sends a request to upgrade the state of block C to the directory (Step 1). The requestor may or may not already have the block in its cache (the figure shows the case in which the requestor has the block in its cache). The directory looks up its directory information and discovers that other than the requestor, only P1 and P2 have a copy of block C. Thus, it sends invalidations to the sharers P1 and P2 (Step 2). The sharers (P1 and P2) respond by sending an invalidation acknowledgement either to the requestor or to the directory, indicating they have received the invalidation request and have invalidated the block (Step 3). As in the prior case, the directory has saved traffic compared to a broadcast coherence protocol, which would have sent invalidations to all other processors/caches. 10.2.1 Directory Format and Location In the previous section, we have discussed the basic protocol for serving a read or write request. Essentially, when the read/write request reaches the directory, the directory must know: (1) which cache(s) keeps a copy of the block, and (2) in what state the block is cached. In order to satisfy the first requirement, the directory ideally reflects the state of the block in the caches. For example, if the caches use MESI states for each block, then ideally the directory should also keep a MESI state that perfectly reflects the state of the block in the cache(s). However, the directory state cannot always fully reflect the cache state of a block. For example, in MESI, if a block is in an exclusive state in a local cache, when the processor writes to the block, it can do so silently since no other cached copies exist. Hence, the block state is changed from exclusive to modified silently, and the block value is also changed silently. In this case, the directory never sees the transition from exclusive to modified. Thus, with MESI, the directory cannot distinguish between a block that is cached in exclusive state or modified state. Thus, to cater to MESI cache states, it can keep only three states: exclusive/modified (EM), shared (S), or invalid (I). To satisfy the requirement of keeping the information of which caches have a copy of a block, there are several possible formats the directory can keep. The important factors in choosing which format to use for a particular case are the amount of storage overheads and whether the information can overflow the allocated storage format or not. One format is to keep a bit that corresponds to each cache that indicates whether the block is cached there or not. This format is referred to as the full bit vector directory format. If there are p caches, each block in memory would then require p bits of information. Note that since the cache block size does not change with the number of caches, the ratio of directory storage overhead to the storage for a data block grows linearly with the number of caches. The overhead can be quite high for a larger machine. For example, assuming a cache block size of 64 bytes, when there are 64 caches, the storage overhead ratio is which is substantial. It becomes worse quickly with a larger number of caches. For 1024 caches, the storage overhead ratio becomes If the directory information is kept in the main memory, that means that two thirds of the main memory capacity is used up for storing directory information, and only one third is available for storing data. Clearly, the full bit vector is too costly to use in large systems. A simple way to reduce the storage overhead of the directory is to use one bit to represent a group of caches, rather than just one cache. If any cache in a group has a copy of a block, the bit value corresponding to the group is set to “1”. If the block is not cached anywhere in the group, the bit value is set to “0”. This format is referred to as the coarse bit vector directory format. Since the information is kept per group, when there is an invalidation, the invalidation must be sent to all caches in the group, even when only one cache in the group has the block. Thus, traffic increases compared to the full bit vector approach. However, the storage requirement for the directory decreases as well. If the number of caches per group is g, the storage overhead in the sharing bit vector per group is p/g. In the example above, for a 1024-cache system with 64-byte block size, and a group size of 4, the storage overhead ratio is lowered to A different approach to reduce the storage overhead is to not keep full directory information of all blocks in the main memory, but instead keep partial directory information, or keep full directory for only a limited number of blocks. One format that uses this approach is the limited pointer format, which keeps a limited number of pointers to sharers in the system. For example, for a given block in memory, suppose we keep pointers to only up to n caches. Each pointer requires log2(p) number of bits to represent the ID of the node. Hence, the storage requirement per block is n × log2(p). For a 1024-node system with 64-block size, and 8 pointers per block, the storage overhead ratio is which is significantly lower than a full bit vector format. Of course, the amount of storage overhead reduction depends on how many pointers are kept. However, a particular block is rarely cached in many locations, and hence most of the time four to eight pointers would be sufficient. One question is in the rare case in which the number of sharers is larger than the number of pointers, how can this situation be handled by the directory? In this case, the directory runs out of IDs that it can keep. There are several strategies to handle this case. One strategy is to not allow the case to occur. That is, when there is a cache which wants to keep a copy of a block, the directory sends an invalidation to a cache that currently keeps a copy of a block to make a room to store the ID of the new cache. This strategy can backfire when the reason that there are many sharers of a block is because the block contains data used for global synchronization, such as a centralized barrier. In this case, naturally many sharers are present, and when the number of sharers is kept artificially low, the barrier can become significantly slower. Other strategies, such as reverting to broadcast when the number of sharers exceeds the number of pointers, suffers from a similar problem. Another possible strategy is to keep a pool of free pointers which can be allocated when such a heavy amount of sharing occurs. Since not many blocks are heavily shared, the number of such free pointers does not need to be large. The limited pointer format keeps limited directory information (when overflowed) for all blocks in memory. Another way to rely on partial information is to keep full directory information only for a subset of blocks in memory. Note that in a typical configuration, the number of blocks in the main memory greatly exceeds the number of blocks that can be cached, because the main memory size greatly exceeds the cache size. Thus, for most blocks, the directory state will indicate that they are not cached at all. One way to compress the directory storage is to remove the directory information for such blocks, and only keep the directory information for blocks that are cached. This leads to a format referred to as the sparse directory format. In the sparse directory format, when there is a request for a block, the request is checked against the entries in a directory cache. If the entry for the block is found in the cache, the directory information is retrieved. If it is not found in the cache, the block must not have been cached before. Hence, a new entry is created in the directory cache for that block. With a sufficient directory cache size, the directory information of all cached blocks should fit in it. There is, however, a risk for overflow. The reason for this is that a clean block may be replaced silently from the cache, without notifying the directory cache. Thus, the directory cache may still keep the directory entry for the block, preventing the entry from being used for a block that may actually be cached. When the directory cache runs out of space, there is a possibility of evicting an entry that corresponds to a block that is still cached, while keeping entries corresponding to blocks that are no longer cached. Therefore, an overflow is a possibility, and must be handled with similar strategies used for handling the overflow of the limited pointer scheme. Another issue with the design of the directory is where it is located. There are several aspects to this. One aspect is whether there is only one location where the directory is located (a centralized configuration) or the directory is split up into different parts depending on addresses it keeps track (a distributed configuration). A centralized directory is simple to design. However, anything centralized at some point becomes a bottleneck to scalability: all cache miss requests will go to the same place and all invalidations have to be sent out from the same place. Thus, a scalable implementation of a directory requires a distributed organization. However, how the directory should be distributed is still an open question. The consideration of how to distribute the directory structure is affected by several factors. One factor is the minimum number of parts that sufficiently diffuse traffic. There is no convenient rule of thumb here, as the answer depends on the characteristics of the network, e.g., topology, bandwidth, latency, etc. Another factor to consider is in the event that a block is not cached, the proximity of the directory to the supplier of data is important as it determines the cache miss latency. Let us consider a few examples shown in Figure 10.2.

Cache Coherence Protocols PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue