Memory Hierarchy PDF
Document Details
Uploaded by Deleted User
null
2017
null
null
Tags
Summary
These lecture notes cover the topic of memory hierarchy in computer architecture.  It explores concepts like cache and locality, providing examples and diagrams. The material is suitable for undergraduate-level computer science courses.
Full Transcript
7. Memory Hierarchy 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 1 Memory Hierarchy Bigger Processor Registers ~.25ns L1 – Cache ~1ns L2 – Cache ~5ns...
7. Memory Hierarchy 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 1 Memory Hierarchy Bigger Processor Registers ~.25ns L1 – Cache ~1ns L2 – Cache ~5ns Main memory (semiconductor) ~40-100ns Disk cache Secondary storage (magnetic etc.)~ms Faster 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 2 Processor – Memory Speed Gap Increasing gap between speed of CPU and next level of memory hierarchy 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 3 What is a Cache? Small, fast storage used to improve average access time to a memory system that incorporates slower memory for most of the storage space. Holds a subset of the data and instructions used by the executing program Relies on Locality for efficient operation. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 4 Locality Programs usually exhibit locality Spatial Locality Programs tend to access memory locations near to locations being currently accessed. Temporal Locality Programs tend to access the same memory location close together in time i.e. accessed memory locations are likely to be accessed again in the near future. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 5 Examples of Program Locality int isub,iarray; for (isub=1; isub can be stored in N possible frames where N is a power of 2. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 15 Placement - Direct Mapped Memory 0FF Cache 100 0 101 1 102 2 103 3 104 4 105 5 106 6 107 7 108 Each memory location can be placed in only one of the cache locations 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 16 Placement - Fully Associative Memory 0FF Cache 100 0 101 1 102 2 103 3 104 4 105 5 106 6 107 7 108 Each memory location can be placed in any of the cache locations 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 17 Placement - Set Associative Memory 0FF Cache 100 0 101 1 102 2 103 3 104 4 105 5 106 6 107 7 108 2-way Set Associative Each memory location can be placed in either of two cache frames 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 18 Block Identification To identify whether a block is present in a cache line it is necessary to ‘remember’ from which memory location the block was loaded. This is done through the use of frame tags – These contain part or all of the memory address of the location. The size of the tag is influenced by the size of the cache frame and the cache organisation. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 19 Identification – Direct Mapped Tag Index Offset This only requires Memory Address 234 2 3 ‘boring’ memory with an external comparator. 7 6 5 4 3 2 1 0 Tag Tag = Tag 1 Candidate Tag Frame Hit or Miss Data The index selects a frame within the cache. Only one frame can possibly contain the required block from memory. The tag is used to verify the contents since multiple memory blocks map to a single frame. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 20 Identification – Fully Assoc. This requires special Tag Offset memory Memory Address 23456 2 incorporating the Nx comparators. 7 6 5 4 3 2 1 0 = Tag All frames = Tag are = Tag=23456 candidates = Tag Hit or Miss Data In a FA cache any frame may hold the needed memory block. The address tag must be compared to all frame tags at the same time. If there is a hit, then data is retrieved from that frame. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 21 Identification – Set Assoc. Tag Index Offset This requires 2x Memory Address 234 4 2 ‘boring’ memory with 2x external comparators. 7 6 5 4 3 2 1 0 Tag = Tag=145 2 candidate = Tag=234 Frames Tag mux Data Hit or Miss The index selects a set of frames in the cache (2 for 2-way in this example). The address tag is compared to only those tags. If there is a hit the data is retrieved from that frame via a 21 multiplexer. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 22 Identification – 2-way Mapped Tag Index Offset This only requires Memory Address 234 2 3 ‘boring’ memory with two external comparators. 76543210 76543210 Tag Tag Tag Tag 2 Candidate = Tag = Tag Frames Tag Tag Left hit Right hit mux Data Hit or Miss 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 23 Replacement - Direct Simple! There is only one possible frame in the cache to use so there is no choice and hence no need for a replacement policy. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 24 Replacement – Fully Assoc. Complicated! The block may be placed in any frame of the cache. How to choose? Random Simple to implement in hardware – speed is important. Least Recently used (LRU) Intuitively good but too complicated to implement. First In First Out (FIFO) Simpler than LRU but yields similar results. Optimal Replace line not needed for the longest time into the future – not possible but useful as reference (requires a TARDIS to implement!) 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 25 Replacement – Set Assoc. Also complicated. In a N-way set associative cache there are N possible frames to select from. Same policies as for Fully Associative but applied within a set of frames. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 26 Cache Reads - some details Reads dominate processor cache accesses. Good proportion of data accesses (~79%) All instruction accesses. Optimise cache for reads (Amdahl Law!) It is necessary for the processor to wait for any data to be read from memory on a cache miss. Read misses have a big influence on performance. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 27 Cache Reads - some details There is some time required to access the tag and determine if the frame contains the correct line of data. It is possible to overlap the cache operation with a speculative read access to memory if we allow the line fetched to be discarded on a cache hit. It is also desirable to access the data from all possible cache frames while checking the tag. Only one frame will eventually be used. Some consideration for power consumption may be needed. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 28 Overlapped Accesses on Reads CPU Cache Memory Access all candidate Start access to cache frames and memory which tags. At most one may not be will be used. necessary if a hit occurs. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 29 Write Policy Writes present special considerations – Cannot initiate a speculative write – either to the cache frame or memory – it is necessary to check the cache first, unlike reads. => Writes will typically be slower than reads. But, it may not be necessary for the processor to wait for a write to complete. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 30 Write Policy Write Through The information is written to both the block in the cache frame and to the block in the memory. Write Back The information is only written to the block in the cache frame. The modified cache frame is only written back to memory on eviction. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 31 Write Through The information is written to both the block in the cache frame and to the block in the memory. Simpler to implement Maintains cache-memory coherency i.e. the memory always contains the latest copy of the data – This is a consideration if there are multiple CPUs or other devices accessing memory. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 32 Write Back The information is only written to the block in the cache frame. The modified cache frame is only written back to memory on eviction. This usually requires the use of a ‘dirty bit’ associated with each cache frame to mark frames needing to be written back on eviction. It is possible for the memory and cache to contain different data – the memory contains stale data. Faster since writes occur at cache speed and multiple writes to the same cache block result in only a single write to main memory. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 33 Write Buffers A write buffer allows the queuing of writes to memory so that the processor can continue execution. The write buffer is important for write- through caches otherwise the write penalty may be excessive. This method gives priority to reads for which the processor must immediately wait. Write Buffer CPU Memory Cache 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 34 Write Buffer - Considerations What happens if there is a read from a location with a pending write in the buffer? We need to check the buffer as well as the cache for data (need a tag comparison similar to a fully associative cache !) Coalesce pending writes to the same memory location. Effectively caches the writes since only the last of multiple writes get written to memory. As mentioned, memory interface is organized as blocks for efficient access. Individual writes that access part of the same block can be merged if the earlier access is still in the write buffer. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 35 Write Merging 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 36 When to Write the Buffer As soon as possible – helps to stop the buffer filling and prevents stalls. Less often – allows for greater possibility of coalescing pending writes and write merging. These considerations can be influenced by reads from locations that have pending writes. Do you flush the buffer and access lower level of memory hierarchy or access directly from the write buffer? How do you search the buffer? Need some form of associative memory. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 37 Write Policy – Frame Allocation Do we allocate a cache frame on a write to memory that is not already in the cache (write miss)? Write allocate – Allocate frame on write miss. Note this then requires the block to be read from memory then modified in the cache and possibly in memory (write through). No write allocate – Just write to memory – don’t allocate a frame so the cache does not change. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 38 I+D or Separate I & D Caches Instructions and data have different characteristics. Should we have separate caches for data and instructions or a unified cache? Don’t need write policy for instruction cache. Does unified allow balanced use of the cache? Using separate caches allows for a larger bandwidth since there are individual ports for instructions and data. It also avoids a possible structural hazard for collisions of data and instruction accesses. Both possibilities have been used. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 39 Direct Cache (8 – slots) Example 3,0,1,2,6,4,5,7,8,9,0,0,0,2,2,2,4,9,1,9,1,3,12,5 miss (3) miss (0) miss (1) miss (2) miss (6) miss (4) miss (5) miss (7) 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 5 5 6 6 6 6 7 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 40 Direct Cache (8 – slots) Example 0,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1,3,12,5 miss (8) miss (9) miss (0) hit (0) hit (0) hit (2) hit (2) hit (2) 8 8 0 0 0 0 0 0 1 9 9 9 9 9 9 9 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 41 Direct Cache (8 – slots) Example 0,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1,3,12,5 hit (4) hit (9) miss (1) miss (9) miss (1) hit (3) miss (12) hit (5) 0 0 0 0 0 0 0 0 9 9 1 9 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 12 12 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 42 Spatial Locality - Example 0,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1,3,12,5 miss (0) hit (1) miss (2) hit (3) miss (4) hit (5) miss (6) hit (7) 01 01 01 01 01 01 01 01 23 23 23 23 23 23 45 45 45 45 67 67 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 43 Cache Performance Average Memory Access Time (AMAT) AMAT = Hit Time + Miss Rate * Miss Penalty Example: Hit Time = 1ns, Miss Rate = 4%, Miss Penalty = 10ns AMAT = 1 +.04 * 10 ns = 1.4 ns 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 44 Reducing Cache Miss Rate Two considerations (conflicting?) Speed of the cache – as high as possible to match the CPU clock. Large as possible to capture as many memory accesses as possible (to keep miss rate small). Higher associativity leads to better utilisation of the cache memory but is expensive (in silicon area for comparators, and speed for multiplexers required). How to cope? What is the best balance? 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 45 Two-level Caches Add a second level cache 1st level operates at the processor speed and provides most of the performance 2nd level cache help to reduce the miss- penalty of the first level cache. AMAT = Hit Time L1 + Miss Rate L1 * Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 * Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 * (Hit Time L2 + Miss Rate L2 * Miss PenaltyL2 ) 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 46 Performance of L2 Cache Definitions: Local Miss Rate = #misses/#refs to cache Global Miss Rate = #misses/#refs of CPU The local miss rate of a L2 cache will be quite poor since the L1 cache will have most of the hot items. The L2 cache will have a different emphasis. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 47 Performance of L2 Cache L1 cache will emphasis low latency to keep up with the CPU on most accesses (fair hit rate) L2 cache will emphasise catching the misses from the 1st level cache Larger size (reduces capacity misses) Higher associativity (reduces structural misses) Larger block sizes (reduces compulsory misses through spatial locality) Latency is less important than for 1st level (allows higher associativity) 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 48 Victim Caches Allows better performance of direct-mapped or small N-way cache that have problems with structural misses. Small (4 – 8-entry) fully associative cache that is located between the 1st level cache and lower levels. Items evicted from the cache are kept here for a short time to allow 2nd chance for items that are still hot. Items found in the victim cache are restored to the 1st level cache for a 2nd chance! 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 49 Cache Coherency I/O Processor CPU or CPU/Cache Access to X obtains stale data Cache X’ X is modified in cache to become X’. When is the memory updated? Memory X There can be a lack of coherency between the cache and memory. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 50 Multilevel Inclusion Inclusion for a multilevel cache means that all the data in the higher level cache is also maintained in the lower level one. Common requirement: Simplifies somewhat the cache coherency problem (reduces it to a simpler cache problem) Inclusion requires blocks to be flushed from the higher level to the lower level when replaced. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 51 Reducing Miss Penalty Cache blocks are relatively large while the CPU usually only requires a particular word before it can continue, so: Critical Word First Fetch the critical word of the block first (out- of-order fetch of data within block) Early Restart Fetch in conventional order but forward the required word to the CPU immediately it becomes available. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 52 Effect of Cache Size Total Miss Rate 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 53 Effect of Cache Size Miss Rate Type 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 54 Changes is Block Size Reduction in Increase in compulsory conflict misses misses 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 55 Hardware Pre-fetch This relies on their being spatial locality that has not been completely exhausted by the block size being used in combination with spare (average) memory bandwidth. On a cache miss load the required block as usual but also load the next consecutive block as well and keep in a stream buffer. The idea here is that sequential access is very common (spatial locality) and storing the next block or two is inexpensive in time & hardware. On a hit on the stream buffer the block is moved into the cache. The buffer is re-used on misses. (May have several buffers?) 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 56 Hardware Pre-fetch Where does diminishing returns set in? Consecutive instruction sequences have not increased much in size with time i.e. programs have grown larger but distance between flow changes (branches etc) has not changed much. Much of this locality may already have been absorbed in the increase in block sizes associated with larger memories. Different matter for data. Data structures have increased dramatically with the increase in memory sizes and computing power – consider video and image processing. This has also meant an increase in the likelihood of pre- fetching working for data since these data structures are often scanned sequentially. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 57 Virtual Memory Main memory can be viewed as a “cache” for the disk memory. Virtual memory is required for modern machines needing to support multiple processes and finite memory. Mapping from Virtual Addresses the CPU ‘sees’ to the Physical Addresses presented to memory. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 58 Virtual Memory Address CPU Memory Translation Virtual address Physical address 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 59 Address Mapping Virtual addresses Physical addresses Address translation Disk addresses 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 60 Paged Virtual Addresses Each process has its own virtual address space which is mapped into the single physical address space. Memory is divided into equal-size pages. A per-process page table does the mapping. The mapping may change with time. Paging Benefits VA > PA – not all pages need to reside in memory at once. Flexible placement of code & data Dynamic relocation is possible (External) Memory fragmentation prevented. Fast start-up (not all of a program need be loaded before execution commences) Protection & sharing between processes 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 61 Page Tables 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 62 Paging Where is paging in the memory hierarchy? Resident pages are in main memory Non-resident pages are on disk Very different timing requirements than for cache Allows much more sophisticated algorithms implemented in software But penalties are much higher. Write back policy is appropriate. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 63 Paging Costs Two overheads How to do the page lookup? This is basically a table lookup. The number of active entries is relatively small so hot ones can be kept in a translation cache that operates at hardware (~single instruction) speed. A miss on the translation cache requires a hardware/software table walk. Pretty expensive in time but not as bad as the next! What happens when a page is not in memory? Need to go to disk which is horribly slow. The likelihood of this must be kept small so deciding which pages to keep in memory justifies a relatively complex but slow software algorithm, say LRU. Compare this to the cache requirements. CPU=1 Mem ≈ 100x Disk ≈ 20,000x100x 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 64 Address Translation 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 65 Translation Look-aside Buffer Virtual address TLB access TLB miss No Yes TLB hit? exception Physical address No Yes Write? Try to read data from cache No Write access Yes bit on? Write protection exception Write data into cache, No Yes update the tag, and put Cache miss stall Cache hit? the data and the address into the write buffer Deliver data to the CPU 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 66 Page Table Overheads Page Table Size Example #1: 32-bit VA, 4KB pages, 4-byte PTE 1M Pages, 4MB Page Table Example #2: 64-bit VA, 4KB pages, 4-byte PTE 4P Pages, 16PB page table These tables are too large to keep in physical memory! But processes don’t actually use all the Virtual Address space so: Use multi-level page tables without all entries being resident (or used). Inverted page tables. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 67 Multi-level PT Size Example 32-bit address space, 4kb pages, 4 byte PTEs 2 level virtual page table 2nd level tables are each the size of 1 data page Program uses only upper and lower 1MB of address space How much memory does page table take? 4kb pages require 12 address bits A 4kb table contains 4kb table/4b PTE => 1k PTEs per table => 2nd level table uses 10 address bits. => a 2nd level table entry covers 1k x 4kb = 4Mb of memory Leaves 32 – 12 – 10 => 10 address bits for 1st level table => 1st level table has 1k entries and is 1k x 4b => 4kb size. For example we need 1 2nd level entry for top 1Mb, another for lowest. Memory = 1st level table (4KB) + 2 * 2nd level table (4KB) = 12KB 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 68 Address Translation 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 69 Inverted Page Table Use hashing function to lookup the required page table entry. Size is proportional to the memory used not the Virtual address space. Overhead of chained entries on hashing collisions. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 70 Hardware details Where to do the VAPA translation. Seems sensible to do after 1st level cache so it doesn’t impact on performance too much. CPU But this would be using Virtual addresses – So what? 1L Synonyms! Two VAs may well map to the same PA. All manner of problems. What happens when one is xlate modified? Cache coherency. How do you invalidate cache or 2L memory contents to keep them consistent? Protection is normally incorporated into the PT setup. Mem This would be negated for the 1st level cache. 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 71 Use PA caches Overhead of PT lookup can’t be hidden but it does prevent the other problems. CPU Need to do the translation at 1L cache xlate speeds. Use a cache to save recent translations 1L Still have 2 cache lookup overheads. Halved the speed of 1L cache. 2L Mem 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 72 Best of both worlds? PT lookup only uses upper part of the CPU physical address and there is no translation of the lower address. Cache lookup only uses the lower part of xlate 1L the address initially. The upper part is PA tag used for the tag. So 2L Can do translation and cache lookup in parallel. Check physical address tag against translated address afterwards. Mem Requires (Block Index+Offset) < (Page offset) 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 73 Virtual Index – Physical Tag Virtual Page # Virtual Page offset VA VA Tag Block Index Block offset PT cache Data Cache PA Tag PA Tag Data Data PA Tag Check (on hit) Hit? Physical Page # Physical Page offset PA 20/06/2017 EEE40013 Topic 4 - Memory Hierarchy 74