csc25-lecture-notes (2)-79-104.pdf
Document Details
Uploaded by ManageableSatire
Tags
Full Transcript
Chapter 4 Memory Systems Overview The von Neumann Bottleneck The main memory is considered the bottleneck of the von Neuma...
Chapter 4 Memory Systems Overview The von Neumann Bottleneck The main memory is considered the bottleneck of the von Neumann’s architecture. This is because the main memory’s response time can be much longer than the processor execution time. What if the processor does not receive instructions as fast as it can process them, what does it do? Is this performance gap decreasing or increasing? These issues are addressed in the next sections. Memory and Processor: Performance Gap Fig. 4.1 shows two performance curves from 1980 to 2015 involving the processor and the memory. Notice that the processor performance improved much more than the memory performance. This is true even though the processor performance curve became almost flat from 2005 on. Figure 4.1: Performance gap measured as the difference in time between processor memory requests (for a single processor or core) and latency of a dynamic random-access memory - DRAM access. 73 4 Memory Systems Other Considerations Why all the memory in a computer cannot be made from the same CPU technology? This is basically due to its cost that would be very high since the volume of information to store in the memory is very large compared to the processor. Also, the CPU technology typically produces volatile memories. A lot of information should persist, even after the machine is powered off. That is the case of non-volatile memories. On the other hand, volatile memories tend to be always faster than non-volatile memories. Another important key fact is that much of the stored information is not used considering short periods. Principle of Locality The principle of locality states that programs tend to reuse both data and instructions they have used recently. That is the most important program property regularly exploited. A rule of thumb says that a program spends 90% of its execution time in only 10% of the code. This is also applicable to data, but not as strong as to code instructions. Given this, it is possible to predict, with reasonable accuracy, what instructions and data a given program will use in the near future based on its accesses in the recent past. Generally, two types of the locality are observed. The temporal locality states that recently accessed items are likely to be accessed soon. On the other hand, the spacial locality says that items whose addresses are near one another tend to be referenced close together in time. How to take advantage of this characteristic to improve performance? The principle of locality together with other guidelines1 led to memory hierarchies, including different latencies and sizes. Memory Hierarchy Overview Fig. 4.2 illustrates a memory hierarchy for personal mobile devices. Notice that it gives information on size and speed. The memory closest to the processer is the register file. And, this hierarchy goes until the storage. As the memory gets far from the processor its speed decreases (from picoseconds to microseconds) and its size increases (from Bytes to GigaBytes). Figure 4.2: Personal mobile devices. The memory in the next lower level becomes slower and larger when moving farther away from the processor. In the same sense, Fig. 4.3 presents a memory hierarchy for laptops. Notice that now the memory system has a third level of cache, the L3. 1 For a given implementation technology and power budget, smaller hardware can be made faster. 74 Memory Hierarchy Figure 4.3: Laptops. The memory in the next lower level becomes slower and larger when moving farther away from the processor. In turn, Fig. 4.4 shows a memory hierarchy for servers. There, besides having cache L3 with more capacity, servers can have both hard disk and flash memories as storage. Notice the difference in speed comparing disks to flash, from milliseconds to microseconds. Figure 4.4: Servers. The memory in the next lower level becomes slower and larger when moving farther away from the processor. Based on all this information on memory systems and hierarchy, some questions arise. (a) The higher levels should always be faster, and the lower ones larger? (b) In what situation is it advantageous to introduce new levels into the memory hierarchy? (c) When placing information in the higher levels, should this information be deleted from the lower levels? The faster memory is more expensive. Then, a memory hierarchy is organized into several levels. Each one smaller, faster, and more expensive per byte than the next lower level, which is farther from the processor. The memory hierarchy aims to provide a memory system with a cost per byte that is almost as low as the cheapest level of memory, and with a speed almost as fast as the fastest level. Inclusion Property In most cases2 , the data contained in a lower level is a superset of the next higher level. 2 But not all. Modern multiprocessors with smaller L1 and different block sizes have sometimes chosen not to enforce inclusion. 75 4 Memory Systems The inclusion property is always required for the lowest level of the hierarchy. The lowest level can be the main memory in the case of caches or the secondary storage (disk or flash) in the case of virtual memory. Visible Memory Levels Regarding the visible memory levels, the CPU registers are the first level. Registers are the CPU working memory for temporary holding instructions and data in terms of general-purpose registers - GPR (e.g., r1, r2, r3) and special-purpose registers - SPR (e.g., AR, DR, IR, SP, PC). Compared to the CPU registers, the main memory technology is usually inferior and its storage capacity is much higher. This is due to the fact it must be large enough to store the running programs and their respective data. On the other hand, the storage memory is non-volatile storage for both code (i.e., instructions) and data. It is also used as an overflow memory (related to virtual memory). The storage is accessed by the CPU through specific controllers and I/O. This is a considerable gap with respect to the main memory, accessed through a memory bus, despite other gaps in size (TB vs. GB) and speed (µs vs. ns). From the lowest level until the processor, the memory size decreases from TB to Bytes, speed increases from microseconds to picoseconds, and cost increases. Cache Memory Overview The cache memory is intended to store copies, for a short time period, of regions from the main memory that are most frequently used by the CPU. Unlike other memories, the cache is transparent to programmers, even in assembly language. It means that the programmer cannot address the cache in the basic or regular sense. Computers with multiple hierarchical cache levels are regularly common today. Typically, the highest levels are in the same CPU chip (e.g., L1, L2, L3). In the multiprocessors case, several processors demand information from a memory, so the cache is critical. Then, in multiprocessors, there is an additional complexity to control access to keep cache coherence. Cache is often divided into data cache and instruction cache. As seen in previous chapters, this is intended to avoid structural hazards. Basically, the next instruction to be executed in a processor is fetched from the instruction cache. If it is not found there, a new block is copied into the cache from the main memory, in the general sense. Case the instruction cache has nested loops with a large number of iterations, considerable main memory access can be spared thanks to the cache. A cache hit happens whenever the processor finds a requested data in the cache. On the other hand, a cache miss happens whenever the processor does not find a data item it needs in the cache. Cache Info Example - Linux Commands Listing 4.1 illustrates the command lscpu which gives information about the cache in a computer. 76 Cache Memory Listing 4.1: Example of the command lscpu result. 1 % lscpu 2 Architecture : x86_64 3 CPU op - mode ( s ): 32 - bit , 64 - bit 4 Byte Order : Little Endian 5 CPU ( s ): 8 6 On - line CPU ( s ) list : 0 -7 7 Thread ( s ) per core : 2 8 Core ( s ) per socket : 4 9 Socket ( s ): 1 10 NUMA node ( s ): 1 11 Vendor ID : GenuineIntel 12 CPU family : 6 13 Model : 158 14 Model name : Intel ( R ) Core ( TM ) i7 -7700 CPU @ 3.60 GHz 15 Stepping : 9 16 CPU MHz : 900.184 17 CPU max MHz : 4200 ,0000 18 CPU min MHz : 800 ,0000 19 BogoMIPS : 7200.00 20 Virtualization : VT - x 21 L1d cache : 32 K 22 L1i cache : 32 K 23 L2 cache : 256 K 24 L3 cache : 8192 K 25 NUMA node0 CPU ( s ): 0 -7 Listing 4.2 gives the output of the command getconf filtered to show cache levels. Listing 4.2: Example of the command getconf to get information about cache memory. 1 % getconf -a | grep CACHE 2 LE VE L1 _ICACHE_SIZE 32768 3 L EV E L 1_ I CACHE_ASSOC 8 4 L E V E L 1 _ I C ACH E_L IN ESI ZE 64 5 6 LE VE L1 _DCACHE_SIZE 32768 7 L EV E L 1_ D CACHE_ASSOC 8 8 L E V E L 1 _ D C ACH E_L IN ESI ZE 64 9 10 LEVEL 2_CACHE_SIZE 262144 11 LE VE L2 _CACHE_ASSOC 4 12 L E V E L 2 _ C A CHE_LI NESIZ E 64 13 14 LEVEL 3_CACHE_SIZE 8388608 15 LE VE L3 _CACHE_ASSOC 16 16 L E V E L 3 _ C A CHE_LI NESIZ E 64 17 18 LEVEL 4_CACHE_SIZE 0 19 LE VE L4 _CACHE_ASSOC 0 20 L E V E L 4 _ C A CHE_LI NESIZ E 0 77 4 Memory Systems Basic Hierarchy Operation The levels of the memory hierarchy (cache, main memory, secondary) are divided into blocks (or lines)3. At some point in time, some blocks (or lines) have copies of two or more levels. Every memory access request is presented first to the highest level. Then, if the word is found there, access to the lower level is avoided. Otherwise, the requested word (actually a block containing the word) is fetched from the lower to the higher level. The Four Memory Hierarchy Questions on Cache Q1- Where can a block be placed in the upper level? This is related to block placement. Q2- How is a block found if it is in the upper level? This is related to block identification. Q3- Which block should be replaced on a miss? This is related to block replacement. Q4- What happens on a write? This is related to the write strategy. Q1 - Block Placement In terms of block placement, there are some categories of cache organization. A cache can be fully associative, direct mapped, or set associative. In the fully associative, a block can be placed anywhere in the cache. The place is usually defined by a replacing policy. In the direct mapped, each block has only one place it can appear in the cache. The mapping is given by Eq. (4.1). (Block Address) mod (# of Blocks in Cache) (4.1) In the set associative, a block can be placed in a restricted set4 in the cache. A block is then first mapped onto a set, and then the block can be placed anywhere within that set, according to Eq. (4.2) (Block Address) mod (# of Sets in Cache) (4.2) When there exist n blocks in a set, the cache placement is named n-way set associative. Fig. 4.5 shows an example of fully associative cache, followed by a direct mapped, and a two-way set associative cache with eight blocks each. As illustrated, the block address #12 from the main memory (highlighted in gray color) can go to any place considering the fully associative cache, there are no placing restrictions. In the direct mapped cache with eight blocks, block address #12 from memory can go only in the cache block #4, according to Eq. (4.1). Finally, in the two-way set associate, the memory block address #12 can go to set #0, blocks #0 or #1, as described in Eq. (4.2). Table 4.1 shows some pros and cons regarding the cache organization aspects. Table 4.1: Pros and cons with respect to the categories of cache organization. Organization Pros Cons Fully associative Full flexibility. Complexity and implementation cost. Direct mapped Simplicity Possible inefficiency due to inflexibility. Set associative Mid-term between the two previous. The direct mapped organization can be viewed as an “one-way set associative”, i.e., the set with just one block. 3 Multiple words are named blocks (or lines). 4 Set means group of blocks in the cache. 78 Cache Memory Figure 4.5: Example considering cache has 8 blocks and memory has 32 blocks. Set associative considering 4 sets, i.e., two-way associative. The fully associative cache with m blocks could be viewed as an “m-way set associative”, i.e., one set with m blocks. Q2 - Block Identification The cache memory is divided into blocks with a fixed number of memory words in the upper level. In the fully associative organization, the cache address is then divided into two parts: (i) the block address, also known as tag; and (ii) the word position inside the block, named block offset, as shown in the left-hand side top part of Fig. 4.6. The tag of every cache block is checked to verify if it matches the block address from the processor. The number of bits related to the block offset is given by Eq. (4.3). #Bits Block Offset = log2 (Block Size [in Bytes]) (4.3) The number of bits related to the block address is given by Eq. (4.4). #Bits Block Address = (#Bits Memory Address) − (#Bits Block Offset) (4.4) For the set associative and direct mapped, the block frame address is divided into a tag field, which is the field compared for a cache hit, and an index field that selects the set (when the cache is set associative) or the block (when the cache is direct mapped), besides the block offset. This is illustrated in the left-hand side bottom part of Fig. 4.6. Figure 4.6: Fields of an address (i.e., tag and index) considering a set associative or direct mapped cache. As previously seen, fully associative does not need an index field, thus it has only the block address and the block offset. 79 4 Memory Systems The number of bits related to the index field is given by Eq. (4.5). #Bits Index = (#Bits Memory Address) − (#Bits Tag) − (#Bits Block Offset) (4.5) Example for a 1 kB direct mapped cache with 32 Bytes block. For a 2N Bytes cache, the layout is: from the MSBa side, the (32 − N ) bits are related to the tag; and from the LSBb side, the M lowest bits are related to the offset, i.e., block size = 2M. Therefore, this a 210 Bytes direct mapped cache with 25 Bytes of block size. bits [4:0] are the block offset → (M) = log2 (Block size) = 5 bits, as in Eq. (4.3) bits [9:5] are the index → (32 - #Bits Tag - #Bits Block Offset) = 5 bits, as in Eq. (4.5) bits [31:10] are the tag → (32 - N ) = 22 bits a Most significant bit. b Least significant bit. Example of block identification. Let’s consider the tag, contents, and block numbers as in Table 4.2. In a case, the address register AR is 01173. Assuming the tag is 0117 and block offset is 3, this is a cache hit (or simply, a hit). Therefore, the data register DR gets the value of 12 (word number 3 in the block number 1, as per the block offset). In another case, the address register AR points to 01163. Assuming the tag is 0116 and block offset is 3, this is a cache miss (or simply, miss) since the tag value was not found in any block number. Therefore, it requires a new search in the lower level consider the memory hierarchy. Table 4.2: Cache with a tag and its related contents. Tag 0116 is not present in this cache, i.e., no block has this tag address. Tag Content Block no. 7 6 2 0 11 76 DB 05 2E... 0 0 1 1 7 35 8B F3 12 81 7D 25 AF... 1 3 6 5 6 71 72 B8 AA... 2......... 1 7 4 1 02 8C FF 36... m The tag marks the memory address to which the cache block corresponds to. The tag field uses an additional bit to indicate the block data validity, also known as valid bit. Tags are searched in parallel since speed is really critical in this context. Q3 - Block Replacement In the case of a miss, the cache controller must select a block to be replaced with the required new data. Direct mapped. For the direct mapped cache, the hardware decisions are simplified. Basically, there is no choice: only one block frame is checked for a hit, and only that block index can be replaced. Refer to Fig. 4.5 example and Eq. (4.1). 80 Cache Memory Fully associative or set associative. For these two cache organizations, there exist three main replacement strategies: (i) random, (ii) least recently used - LRU, and (iii) first in, first out - FIFO. The random replacement aims to spread the allocation uniformly by a random selection of candidate blocks. Some controllers generate pseudorandom block numbers to get reproducible behavior, which is good for debugging purposes5. The LRU replacement keeps track of block accesses to reduce the chance of throwing out some data that is likely to be needed soon. This strategy uses the “past to predict future”, i.e., the block replaced is the one that has not been used for the longest time. It follows the corollary of locality: “if recently used blocks are likely to be used again, a good candidate for disposal is the least recently used block”. A drawback of the LRU strategy is that it can be complicated, i.e., complex to keep track of blocks’ usage. In this sense, the FIFO strategy is an approximation of LRU by determining the oldest block rather than the least recently used block. Moreover, as the number of blocks to keep track of increases, LRU becomes increasingly expensive and it is generally only approximated, e.g., the pseudo-LRU. A common approximation considers that several bits are related to each set in the cache and each bit corresponds to a single way in the cache. A “way” is the bank in a set associative cache, e.g., there are k ways in k-way set associative cache. When a set is accessed, the bit corresponding to the way containing the desired block is turned ON. If all the bits related to a set are turned ON, they are reset except for the most recently turned ON bit. If a block has to be replaced, the controller chooses a block from the way whose bit is turned OFF. This is done in a randomly-based manner if more than one choice is present. Next, Fig. 4.7 shows a comparison among LRU, random and FIFO strategies given a number of cache misses per 1,000 instructions. Figure 4.7: Cache misses per 1,000 instructions. As the cache size gets bigger, fewer misses occur (as expected). The random replacement strategy has scored the lower miss number in the 256 kBytes two-way associative cache. Comparing the numbers given in Fig. 4.7 one can analyze different trade-offs in terms of implementa- tion complexity and pay-off with respect to the number of cache misses. Q4 - Write Strategy An important fact regarding cache memory is that the reads dominate processor cache accesses. In turn, writes are approximately 10% of the memory traffic. Therefore, the point is about enhancing the common case performance by making the reads faster. Blocks can be read from the cache at the same time the tag is also read and compared. In case of a hit, the requested part of the block (e.g., a word) is given to the processor right away. Otherwise, in the miss case, no benefit but no harm indeed. 5 In this case, one wants a reproducible result to track possible bugs. 81 4 Memory Systems Could this strategy also be applied for writes? No. The changing of a block cannot start until the tag is checked to really confirm whether the address is a hit. Also, writes usually take longer than reads and the tag checking cannot occur in parallel with writes. The processor also specifies the size of the write (e.g., 1-8 bytes). Then, only that portion of a block can be changed. Reads can access more bytes than necessary without any problem. The write policies distinguish cache designs into: (i) write through and (ii) write back. Write through. In the write through policy, data is written to both the block in the cache and to the block in the lower-level memory, in parallel. Write back. In the write back policy, data is written only to the block in the cache. The modified cache block is written to the main memory only when it is replaced, that is when a cache miss happens. To aid in the write back policy, there is the dirty bit. It helps to reduce the frequency of writing back blocks on replacement events. The dirty bit is a status bit that indicates whether the cache block is either dirty (i.e., modified while in the cache) or clean (i.e., not modified). If the block is indicated as clean, there is no need for writing back on a “write cache miss” event, since the very same information to the cache is found in the lower levels. Table 4.3 presents the main pros and cons considering the write back and write through strategies. Table 4.3: Write strategy pros and cons summary. Write Strategy Pros Cons Write back Writes occur at the speed of the cache Read misses may result in writes to memory. the lower-level, depending on the dirty bit. If the bit is dirty, a write is needed. Multiple writes within a block require Coherency problem, cache is different only one write to the lower-level mem- from lower-level (this is bad for multi- ory. processors). Uses less memory bandwidth, as some cache writes do not go to memory (this is good for multiprocessors). Saves power, as it uses less from the hardware, i.e., memory hierarchy and interconnects (this is good for embed- ded systems). Write through Easier to implement Writes occur at the speed of the lower- level. Cache is always clean, read misses never result in writes to the lower- level. Simplifies data coherency, since the next lower-level has the most updated copy of the data (this is good for mul- tiprocessors). Multilevel caches are more suitable for the upper-level caches, as the writes need only propagate to the next lower- level, rather than all the way to the memory. Multiprocessors are trick as they want: (i) the write back strategy for processor caches to reduce the memory traffic (power, bandwidth) at the same time as (ii) the write through strategy to keep cache consistent with lower levels of the memory hierarchy (data coherency). 82 Cache Memory Write Stall. Notice that a write stall occurs when the processor needs to wait a write completion in the write through policy. A common optimization to reduce the write stall is a write buffer. It allows the processor to continue as soon as the data are written to that buffer, i.e., processor execution goes in parallel with memory writing/updating. In the write buffer (Fig. 4.8), the processor writes data to the cache and memory buffer, and the memory controller updates buffer contents to the lower-level. This scheme works fine if the store frequency (with respect to time) is much lesser than DRAM write cycle. 1 Otherwise, case the store frequency (with respect to time) is greater than DRAM write cycle , 1 this causes a problem, i.e., write buffer saturation. Figure 4.8: Write buffer used in write through approach. It is just a FIFO with typically four entries. If the condition of “processor cycle too fast” or “many writes in a row” persist for a long period, a buffer overflow will happen, no matter how big it is the buffer. Remember that the processor cycle time is always equals to or less than the DRAM write cycle time. Some possible solutions to this case is to use a write back cache, or install a second level cache - L2, as illustrated in Fig. 4.9. Typically, L1 is write through and L2 is write back. Figure 4.9: Write buffer used in write through approach, with two cache levels to improve writing performance. Write Miss A write miss is a fail when trying to write data that is currently not present in the cache. Basically, there are two options to handle write misses: (i) the “write allocate” which fetches the data of the addressed block from lower-level into the cache, or (ii) the “no-write allocate” in which the data remains only in the lower-level, and this does not affect the cache. Example. A 16-bit write to the memory location 0x0 causes a miss. Shall the rest of the block (Bytes 2, 3,... ) be read, that is, fetched to the cache? If the response is YES, the scheme is the write allocate. On the other hand, if the cache does not need to be changed, the response is NO, meaning no-write allocate. Performance Optimizations There are some possibilities on how to improve memory performance. 83 4 Memory Systems It is possible to decrease the memory cycle, by employing a cache memory or even improve the cache hit rate. Another possibility is to increase the word size, by accessing multiple words in parallel supported by the interleaved memory approach. Some definitions are given as follows. The memory cycle is the time spent to give a word back to the processor, e.g., 50 ns. The word size is the number of bytes to be given to the processor at each memory requisition, e.g., 32 bits or 4 Bytes. The transfer rate is the number of bytes delivered by the memory per time unit, e.g., 100 Mbps. Not all memory access is successful in the cache. Misses and hits occurs. Thus, the hit rate is considered fundamental to the cache memory system performance. The effective cycle EC of a memory system with cache is an intermediate value between the cache cycle CC and the main memory cycle MC , considering a hit rate6 h (0 ≤ h ≤ 1) as in Eq. (4.6). EC = h × CC + (1 − h) × MC (4.6) EC can also be given by Eq. (4.7), in terms of the miss rate µ as described by Eq. (4.8) and miss penalty ρ (the cost per miss) as in Eq. (4.9). EC = CC + µ × ρ (4.7) µ=1−h (4.8) ρ = MC − CC (4.9) It is common to have values near to 1 regarding h, therefore EC becomes quite sensitive to low changes in h. Example. Considering Eq. (4.6), let’s assume the next cases. case CC = MC 10 AND h decreases from 0.99 to 0.98 (1%), EC increases almost 10% (i.e., 8.3%) case CC = MC 10 AND h decreases from 0.99 to 0.89 (≈10%), EC almost doubles (i.e., 82.6%) case CC = MC 20 AND h decreases from 0.99 to 0.89 (≈10%), EC is multiplied by ≈ 2.5 (i.e., 159.7%) Small enhancements in h may lead to substantial improvements in the memory system overall perfor- mance. Some key points impacting h are the number of words in a block and number of blocks (spacial locality); and the decision criteria about the block contents giving room to new data coming from the cache in case of a miss (block replacement strategy). As previously seen: CPU time = CPU clock cycles for a program × Clock cycle time 6 Success probability. 84 Interleaved Memory But now it is known that it is needed to also account for the number of cycles during which the processor is stalled waiting for a memory access, as in Eq. (4.10). CPU time = (CPU clock cycles + Memory stall clock cycles) × Clock cycle time (4.10) Eq. (4.10) assumes that CPU clock cycles considers the time to cope with cache hit, and that the processor is stalled during a cache miss: Memory stall cycles = Number of misses × Miss penalty (4.11) Misses = IC × × Miss penalty (4.12) Instruction Memory Access = IC × × Miss rate × Miss penalty (4.13) Instruction Interleaved Memory Overview There are many techniques to improve the main memory performance including the increase of the data width, e.g., wider memory data bus (Fig. 4.10-b), and memory interleaving (Fig. 4.10-c). It is also possible to improve the access time taking into account the construction technology. The latter has an impact on the clock and latency. Another approach is to use double data rate - DDR, where data transfer occurs on both rising and falling clock edges. Figure 4.10: Memory improvements considering: (a) one-word-wide, (b) wide, and (c) interleaved memory organizations. Definitions The interleaved memory approach allows simultaneous access to as many words as the number of independent banks. As illustrated in Fig. 4.10-c, there are four memory banks. 85 4 Memory Systems With a number of different banks, it is possible to have several instructions and several operands in the fetch phase, and several results in the storage phase. However, to really take advantage of that, each memory access needs to address a different bank in a given moment. That is the meaning of interleaving. Wider Bus Width vs. Interleaved Memory Let’s consider the following case. The action of sending an address on the address bus takes 4 CPU clock cycles. A word access in the memory takes 56 clock cycles. Sending a word on the data bus takes 4 clock cycles. All of this considering a 64-bit word. Let’s analyze the following three situations. Situation (A), for a block of 1 word. In this situation, let’s assume the miss rate is 3%, miss penalty is 64 cycles (4 for addressing + 56 for word access + 4 for sending a word). The average number of cycles per instruction, without cache error, is 2. Finally, the memory access per instruction is 1.2. Situation (B), for a block of 2 words. Here, the miss rate is 2%, while the other numbers remain as in the situation (A). Situation (C), for a block of 4 words. Here, the miss rate is 1.2%. The other figures are the same as the situation (A). Original situation with a simple bus. Now comes the big question: what is the improvement in the system compared to the original situation with a simple bus when having: (i) a memory interleaving of 2 or 4 banks, and (ii) system with doubled bus width? And, all of this considering blocks of 1, 2 or 4 words. By applying Eq. (4.7) it is possible to obtain the desired response in numbers. The original situation with a simple bus, i.e., CPI for a memory system of one word is given by Eq. (4.15). EC = CC + µ × ρ (4.14) = 2 + 0.03 × (1.2 × 64) = 4.30 (4.15) Analyzing the Block Duplication Effects Considering a block of 2 words (128 bits), the block duplication effects are the following for a: (i) 64-bit bus without memory interleaving, (ii) 128-bit bus without interleaving, and (iii) 64-bit bus with memory interleaved of 2 banks. The effective cycle for a 64-bit bus, without interleaving is given by Eq. (4.17). Basically, this is fully sequential as the bus takes only one word at a time. Also, there is a decrease in performance on the 64-bit bus system without interleaving, i.e., from 4.30 Eq. (4.15) to 5.07. EC = CC + µ × ρ (4.16) = 2 + 0.02 × [1.2 × (2 × 64)] = 5.07 (4.17) On the other hand, the effective cycle for a 128-bit bus, also without interleaving is given by Eq. (4.19). This is fully parallel, two words go at a time into the 128-bit bus width. EC = CC + µ × ρ (4.18) = 2 + 0.02 × (1.2 × 64) = 3.54 (4.19) 86 Interleaved Memory The doubled bus width make it 1.22 times faster than the simple bus Eq. (4.15). 4.30 Speedup = = 1.22 3.54 Now with a 64-bit bus and interleaved memory of 2 banks. The effective cycle is given by Eq. (4.21). EC = CC + µ × ρ (4.20) = 2 + 0.02 × {1.2 × [4 + 56 + (2 × 4)]} = 3.63 (4.21) Here, 4 cycles are related to the initial request, 56 cycles stand for a word access in the memory (done in parallel with 2 banks), and (2×4) cycles are required to send the two words back. Sending the words back is done sequentially, i.e., the 64-bit bus width accommodates one word at a time. Finally, the memory interleaving is 1.19 times faster than the simple bus Eq. (4.15). 4.30 Speedup = = 1.19 3.63 Analyzing the Block Quadruplication Effects Considering a block of 4 words (256 bits), the block quadruplication effects are the following for a: (i) 64-bit bus without memory interleaving, (ii) 128-bit bus without interleaving, and (iii) 64-bit bus with memory interleaved of 4 banks. The effective cycle for a 64-bit bus, without interleaving is given by Eq. (4.23). This is fully sequential, i.e., one word at a time in the 64-bit width data bus. Notice that there is a decrease in performance in this 64-bit bus system without interleaving, i.e., from 4.30 Eq. (4.15) to 5.69. 0 EC = CC + µ × ρ (4.22) = 2 + 0.012 × [1.2 × (4 × 64)] = 5.69 (4.23) The effective cycle for a 128-bit bus, without interleaving is given by Eq. (4.25). This is partially parallel since only two words can go at a time. 0 EC = CC + µ × ρ (4.24) = 2 + 0.012 × [1.2 × (2 × 64)] = 3.84 (4.25) The doubled bus width make it 1.12 times faster than the simple bus Eq. (4.15). 4.30 Speedup = = 1.12 3.84 Now with a 64-bit bus, but with an interleaved memory of 4 banks, the effective cycle is given by Eq. (4.27). 0 EC = CC + µ × ρ (4.26) = 2 + 0.012 × {1.2 × [4 + 56 + (4 × 4)]} = 3.09 (4.27) The 4 cycles are for the initial request, 56 clocks to a word access in the memory (done in parallel as there are 4 banks), and the (4×4) are for sending the words back sequentially in the 64-bit bus width. 87 4 Memory Systems The memory interleaving with 4 banks is 1.39 times faster than the simple bus Eq. (4.15). 4.30 Speedup = = 1.39 3.09 Wrap-up The cost of quadrupling the memory bus may become prohibitive7 and would not yield that much performance improvement. A 256-bit bus without interleaving would give an effective cycle of: 0 EC = CC + µ × ρ = 2 + 0.012 × (1.2 × 64) = 2.92 This is a speedup of just 1.06 with respect to the memory interleaving configuration of 4 banks Eq. (4.27). 3.09 Speedup = = 1.06 2.92 RAM Construction Technology Technologies Overview The read-only memory - ROM is a non-volatile memory that can be written just once. Some variations can be electronically erased and reprogrammed at slow speed, e.g., the electrically erasable programmable read-only memory - EEPROM. Regarding the static random-access memory - SRAM, the priority is basically on speed and capacity. The term static means data does not need to be refreshed from time to time, i.e., periodically. On SRAM, there are non-multiplexed address lines. SRAM is about 8 to 16 times more expensive than DRAM. SRAM is also used to minimize access time to cache. When it comes to the dynamic random-access memory - DRAM, the priority is on the cost per bit and capacity. The term dynamic means data needs to the refreshed (i.e., re-written) periodically even without reading its data. There are multiplexed address lines in DRAM. Traditionally, DRAM used to have an asynchronous interface with its controller, and thus a synchroniza- tion overhead was faced. A clock signal was introduced to the DRAM chips making them synchronous, then it was named as “synchronous DRAM” - SDRAM. DDR SDRAM was an innovation where memory data is transferred on both rising and falling edges of the SDRAM clock signal, thereby the term double data rate - DDR. Here it follows some comparisons and information including DDR2, DDR3, and DDR4. The evolution of DDR technology with increased clock and voltage reduction in chips. From DDR1 to 2, it happened a reduction from 2.5 to 1.8 V, and higher clock rates, e.g., 266, 333, and 400 MHz. From DDR2 to DDR3, the voltage dropped to 1.5 V with a maximum clock speed of 800 MHz. From DDR3 to DDR4, the voltage dropped to 1-1.2 V with a maximum expected clock rate of 1600 MHz. The DDR5 standard was released in 20208. Memories are usually sold in small boards, i.e., dual inline memory module - DIMM. Those boards can contain from 4 to 16 DRAM chips and are usually arranged to provide 8-Byte words. 7 One can imagine how much wiring would be needed to fix that case. 8 There are speculations about Intel and AMD support to DDR5 in 2021 and 2022, respectively. 88 RAM Construction Technology DRAM Organization Modern DRAM is organized in banks with up to 16 banks in the DDR4. Each bank has a number of rows as illustrated in Fig. 4.11. Figure 4.11: DRAM basic internal organization. DRAM Operation This memory is organized as a matrix addressed by banks, rows and columns (Fig. 4.11). The address multiplexing is managed by the DRAM controller which first sends bank and row numbers followed by the column address, and finally, the data is accessed for a reading or a writing process. To initiate a new access, the DRAM controller handles the following commands. The activate command ACT9 opens a bank and a row, and loads the entire row into the row buffer. The column address10 can then be sent considering a single item or a burst. The precharge command PRE closes the bank and row and prepares it for a new access. Commands and block transfers are synchronized with a clock. Before accessing a new row, the bank must be precharged. If the row is in the same bank, then the precharge delay is faced. Otherwise, if the row is in a different bank, closing the row and precharging can overlap with the accessing of the new row. In SDRAMs, each of those command phases requires an integral number of clock cycles. DRAM uses just a single transistor, effectively acting as a capacitor, to store a bit. Then, it is possible to have more bits per chip. But this has some implications. Reads from the DRAM require a write back since the information is “destroyed”, that is, the read consumes the information. Also, while not reading or writing, DRAM requires a periodic refresh11 , due to memory leakage. The bits in a row are updated in parallel. When a row is in the row buffer, it can be transferred by either a single item, with successive CAS at whatever width the DRAM has (e.g., 4, 8, or 16 bits in DDR4), or by a burst which specifies a block transfer from a starting memory address. DRAM Performance In terms of DRAM performance, there is a non-uniform access - NUMA12 time due to data location and the refresh/write back requirement as previously seen. Typically, the time is divided into: (i) RAS precharge, meaning the row selection, (ii) RAS-to-CAS delay where the column selection occurs, (iii) CAS latency wich is the data read or write processes, and (iv) 9 Formerly named “row address strobe” - RAS. 10 “Column address strobe” - CAS. 11 Accessing every row within a time frame, such as 64 ms. DRAM controllers include hardware to do that. 12 Non-Uniform Memory Access - NUMA. 89 4 Memory Systems cycle time standing for the average completion time. The details are shown and commented in Figs. 4.12 and 4.13. Figure 4.12: DRAM write cycle time. First, the operation enable signal OE_L goes from low to high. Before that, time highlighted in red color, the row address has to be placed in the address bus. Then, the signal RAS_L goes from high to low indicating the row address information must be stable in the address bus. Notice that the write enable signal WE_L goes to low indicating the write operation is in course. Then, the data in is placed in the data bus to be written. The row address is get from the address bus (highlighted in cyan color) and column address is now placed there (red color again). At that moment (CAS_L from high to low, cyan color) the write access time starts and finishes when RAS_L and CAS_L transit to high again after the WE_L. Figure 4.13: DRAM read cycle time. First, the write enable signal WE_L goes from low to high as this is a read operation. Before that, time highlighted in red color, the row address has to be placed in the address bus. Then, the signal RAS_L goes from high to low indicating the row address information must be already stable in the address bus. Notice that the operation enable signal OE_L goes to low indicating the read operation is in course. Then, the data out will be placed in the data bus to be read. The row address is get from the address bus (highlighted in cyan color) and column address is now placed there (red color again). At that moment (CAS_L from high to low, cyan color) the read access time starts and finishes when RAS_L and CAS_L transit to high again after the OE_L. The “L” in the signal names (Figs. 4.12 and 4.13) means the signal is active in LOW, e.g., RAS_L. 90 RAM Construction Technology Fig. 4.14 illustrates some DDR SDRAM capacity along with access times. Figure 4.14: DDR SDRAM capacity and access times. Information for random memory words assuming a new row must be opened. If the row is in a different bank, that row is assumed to be already precharged. If the row is not open, precharge is required then leading to a longer access time. Memory Physical Modules To facilitate the handling and also to exploit the memory interleaving, memory modules are used in the form factor of DIMM (Fig. 4.15) with 4 to 16 memory chips in a 64-bit bus. The number of pins varies among different DDRn. Figure 4.15: An example of a 168 pins module. Fig. 4.16 shows some examples of DRAM clock rates. The first row shows the value of 2128 MBytes per second for DDR1. That number is given by: MiB M transfer B 2128 = 266 ×8 s s transfer Figure 4.16: DRAM clock rates examples. 91 4 Memory Systems Virtual Memory Main Concepts As in Fig. 4.4, the memory in the lower levels becomes slower and larger when moving farther away from the processor. The virtual memory - VM automatically manages13 the two levels in the memory hierarchy represented by the main memory and the secondary storage (disk or flash). Fig. 4.17 shows an overview of virtual addresses and the mapping to either the main memory or secondary storage (e.g., disk in this case). Other key points provided by virtual memory are the possibility to have memory space share and protection, and also memory relocation. Figure 4.17: Mapping of virtual memory to physical memory for a program with four pages. Here, the focus on virtual memory is specific to the computer architecture point of view. Several general memory hierarchy ideas about caches are analogous to virtual memory. For example, page or segment is used for block, and page fault or address fault is used for a miss. Memory mapping, also known as address translation. Considering the virtual memory, as represented in Fig. 4.17, the processor produces virtual addresses that in turn are translateda into physical addresses to access the main memory. a By a hardware and software combination. Virtual Memory and Cache The main differences between virtual memory and cache can be highlighted as follows. The replacement on cache misses is primarily controlled by hardware, while virtual memory’s replacement is primarily controlled by the operating system - OS. In this context, a longer miss penalty means it is more important to call a “good decision”. In this case, the operating system can be involved and take some time deciding what to replace to have a better outcome in terms of performance. The size of the processor address determines the size of virtual memory. However, the cache memory size is independent of that. 13 In other words, relieves the programmers from the overlays. 92 Virtual Memory Besides acting as the lower-level backing store for the main memory in the hierarchy, the secondary storage is also used for the file system - FS, and that has to be taken into account when computing the available size for virtual memory. The file system occupies most of the secondary storage, and that part is not usually in the address space. Virtual Memory Classes The virtual memory can be categorized into two classes, as seen in Fig. 4.18: pages with fixed-size blocks, and segments with variable-size blocks. Pages are typically fixed at 4096-8192 Bytes. In segments, the largest examples range from 216 to 232 Bytes, while the smallest segment is 1 Byte. Figure 4.18: Virtual memory page and segments. The Four Memory Hierarchy Questions on Virtual Memory Q1- Where can a block be placed in the main memory? This is related to block placement. Q2- How is a block found if it is in the main memory? This is related to block identification. Q3- Which should be replaced on a virtual memory miss? This is related to block replacement. Q4- What happens on a write? This is related to the write strategy. Q1 - Block Placement Notice that the miss penalty with respect to the virtual memory involves access to memory devices in the lower level, as shown in Fig. 4.4. Such devices have considerable low speed (thousands of clock cycles) e.g., magnetic disks. Therefore, the miss penalty in this case is very high. Given this, in a choice between a lower miss rate or simpler placing algorithms, the former is preferred due miss penalty cost. In this sense, the operating system allows blocks to be placed anywhere in the main memory, such as in the fully associative organization of cache memory. Q2 - Block Identification Here, there exist two classes, page, and segment. In the paged addressing, one word with fixed-size address is divided into a page number and a page offset within the page, like the cache addressing. This is the virtual address, as in the top left-hand side of Fig. 4.19. In the segmentation addressing, due to the variable size segment, it needs two words. One for the segment number and another one for the offset within the segment. Both paging and segmentation are based on a data structure (i.e., page table - PT) indexed by the page or segment number. This data structure holds the physical address of the block, as Fig. 4.19 shows. In the segmentation addressing, the offset is added to the segment’s physical address to obtain the final physical address. In the paging addressing, the offset is simply concatenated to the physical page address. 93 4 Memory Systems Figure 4.19: Page table, the data structure containing the physical page address. Example. Let’s consider a 32-bit virtual address with a page size of 4 KiB and 4 Bytes per page table entry - PTE. The page table - PT size is computed as: Virtual Memory Size PT size = × PTE size Page Size 232 = 12 × 22 2 = 222 (4 MiB) An alternative approach to the case illustrated in the previous example is to apply a hash function to the virtual address. This allows the data structure (i.e., page table) to have the size equals to the number of physical pages in the main memory, i.e., smaller than the number of virtual pages. With the page size of 4 KiB and 4 Bytes per page table entry - PTE, the page table size in this inverted page table - IPT schemea , and considering a 512 MiB physical memory, is given by: Physical Memory Size IPT size = × (PTE size + Virtual Address Size) Page Size 229 = 12 × (22 + 22 ) 2 = 220 (1 MiB) a As this approach uses the size of physical memory instead of virtual memory’s size, it can cause a collision problem since two different keys can point to the same index. There is the open addressing technique (among others) for collision resolution in hash tables. Finally, there is the translation look-aside buffer - TLB used to reduce the address translation time. Computers use a cache dedicated to this task. Q3 - Block Replacement The basic operating system guideline is to minimize page faults. Almost all operating systems try to replace the least recently used - LRU block. This approach uses an “use bit” or “reference bit” which is set whenever the page is accessed. In a given period, the operating system resets the reference bits and also records them to be able to check which pages were accessed during that time. 94 Virtual Memory Q4 - Write Strategy Let’s keep it simple. Therefore, the write strategy is the write back with the dirty bit. Write through is too costly14. Translation Look-aside Buffer - TLB Page tables are stored in the main memory as they can be large in size. Sometimes they are even paged themselves. In this case, it would take 2 times longer to obtain the physical address by going through these two levels of paging. To minimize the performance issue, a cache of page table is implemented. This is named translation look-aside buffer - TLB, as Fig. 4.20 shows. A TLB entry is like a regular cache entry with tag and data. The tag holds part of the virtual address without the offset. The data includes the physical page frame number, protection field, valid bit, use bit, and dirty bit related to the page. The changing of the physical page frame number or the protection of a page table entry requires the operating system to make sure the old entry is not in the TLB. Otherwise, it would make the system behave improperly. Therefore, the operating system resets them by changing the value in the page table, and then it invalidates the corresponding TLB entry. The dirty bit means the respective page is dirty rather than stating that the address translation in the TLB is dirty, or even that a given block in the cache is dirty. As soon as the entry is reloaded from the page table, the TLB gets an accurate copy of the bits. Figure 4.20: An example of virtual memory with TLB and page table. The first search is in the virtual address of the TLB. Should it fails, a search in the page table is done. Page Size Selection The page size selection is a matter of balance. The main points in favor of a larger page size are the following. The size of the page table is inversely proportional to the page size. In this sense, memory can be saved by making pages bigger. A larger page size allows for larger caches, then leading to fast cache hit times, according to the principle of locality. Transferring larger pages to and from the secondary storage is more efficient than transferring smaller pages. Finally, as the number of TLB entries is restricted, a larger page size means that more memory can be mapped efficiently, and this contributes to reducing the number of TLB misses. 14 According to Hennessy and Patterson, 2017, no one has yet built a virtual memory operating system using write through. 95 4 Memory Systems On the other hand, the main points in favor of a smaller page size are the following. It is good for conserving storage, i.e., less wasted storage when a contiguous region of virtual memory is not equal in size to a multiple of the page size. A smaller page size avoids unused memory in a page, i.e., internal fragmentation. Finally, it relates to the process start-up time. Many processes are small, thus a large page size would make longer the time to invoke a process. Virtual and Physical Caches Even a small and simple cache must deal with the translation of a virtual address from the processor to a physical address to access the memory. A “good call” is to make the common case fast. Here, in the virtual memory context, it means the use of virtual addresses even for the cache, since hits are much more common than misses. Virtual caches are caches that use virtual address. On the other hand, physical caches are the traditional caches using physical address. Virtual caches are no longer popular. The operating system and user programs may use two different virtual addresses for the same physical address. It happens fairly frequently, e.g., when two programs (two distinct virtual addresses) share the same data object. These duplicate addresses15 could result in two copies of the same data in a virtual cache. And, this duplicated information in the cache is not good at all. If one is modified, the other will have the wrong value, to mention a simple problem. One alternative to get the best of both virtual and physical caches is to use part of the page offset16 to index the cache (Fig. 4.21). In parallel to the cache being read using that index, the virtual part of the address is translated, and the tag match uses physical addresses. This approach allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses. 15 Synonyms or aliases. 16 The part that is identical in both virtual and physical addresses. 96 Virtual Memory Figure 4.21: A hypothetical memory hierarchy illustrating a scenario from the virtual address to the L2 cache. From the top, the 64-bit virtual address is split into a: (i) 51-bit virtual page number and (ii) 13-bit page offset. The virtual page number is sent to the TLB to get translated to a physical address. The high 7-bit part of the page offset is sent to the L1 cache acting like an index. If the TLB match is a hit (indicated in the figure by the symbol “=?”), then the physical page number is sent to the L1 cache tag for a comparison (L1 tag compare address “=?”). Should it matches, it means an L1 cache hit. Finally, the block offset is used to select the word for the processor/CPU. On the other hand, if the search in the L1 cache results in a miss, the physical address is then used to try now the L2 cache. The middle part (16-bit long) of the 41-bit physical address is now taken as the index to the L2 cache. The resulting L2 cache tag is compared to the 19-bit upper part of the physical address to verify a match (bottom symbol “=?”). Should it matches, this means an L2 cache hit, and the data is sent to the processor/CPU. The 6-bit block offset is taken to point a word. Finally, in the event of another miss (i.e., L2 miss), the physical address is taken to get the block from the memory. L1 is a direct mapped 8 KiB cache, and L2 is a direct mapped 4 MiB cache. They have a 64-Byte block size.. Example of Intel Core i7 The i7 has a 48-bit virtual addresses and 36-bit physical addresses. Its memory management uses a two-level TLB (Fig. 4.22). Figure 4.22: i7 TLB-data related. The i7 has a three-level cache hierarchy (Fig. 4.23). The first-level caches are virtually indexed and 97 4 Memory Systems physically tagged. L2 and L3 caches are physically indexed. Figure 4.23: i7 cache information. 98