csc25-chapter_04a.pdf
Document Details
Uploaded by ManageableSatire
2024
Tags
Full Transcript
CSC-25 High Performance Architectures Lecture Notes – Chapter IV-A Memory Systems, Hierarchy and Concepts Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC Aeronautics Insti...
CSC-25 High Performance Architectures Lecture Notes – Chapter IV-A Memory Systems, Hierarchy and Concepts Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC Aeronautics Institute of Technology – ITA 1st semester, 2024 Detailed Contents Memory Systems Overview The von Neumann Bottleneck Cache Info Example Memory and Processor: Performance Gap Basic Hierarchy Operation Other Considerations Four Memory Hierarchy Questions Principle of Locality Q1 - Block Placement Memory Hierarchy Q2 - Block Identification Overview Q3 - Block Replacement Inclusion Property Q4 - Write Strategy Visible Memory Levels Performance Optimizations Cache Memory References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 2/69 Outline Memory Systems Principle of Locality Memory Hierarchy Cache Memory References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 3/69 Memory Systems The von Neumann Bottleneck Main memory – considered to be the bottleneck of von Neumann’s architecture The main memory response time can be much longer than the processor execution time If the processor does not receive instructions as fast as it can process, what does it do? Is this performance gap decreasing or increasing? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 4/69 Memory Systems (cont.) Memory and Processor: Performance Gap Performance gap measured as the difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 5/69 Memory Systems (cont.) Other Considerations Why cannot all memory on a computer be made from the same CPU technology? I its cost would be very high, since the volume of information to store is very large I CPU technology typically produces volatile memories 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 6/69 Memory Systems (cont.) Other Considerations A lot of information should persist, even after the machine is powered-off I non-volatile memory Volatility vs Performance I volatile memories tend to be always faster than non-volatile memories Important – much of the information is not used considering short periods of time 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 7/69 Outline Memory Systems Principle of Locality Memory Hierarchy Cache Memory References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 8/69 Principle of Locality Programs tend to reuse both data and instructions they have used recently I that is the most important program property regularly exploited Rule of thumb: a program spends 90% of its execution time in only 10% of the code I this is also applicable to data, but not as strong as to code instructions It is possible to predict, with reasonable accuracy, what instructions and data a given program will use in the near future based on its accesses in the recent past 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 9/69 Principle of Locality (cont.) Two types of locality are observed 1. temporal locality I recently accessed items are likely to be accessed soon 2. spacial locality I items whose addresses are near one another tend to be referenced close together in time 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 10/69 Principle of Locality (cont.) How to take advantage of this characteristic to improve performance? The principle of locality together with other guidelines1 led to memory hierarchies, including different latencies and sizes 1 For a given implementation technology and power budget, smaller hardware can be made faster 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 11/69 Outline Memory Systems Principle of Locality Memory Hierarchy Cache Memory References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 12/69 Memory Hierarchy Overview The memory in the next lower level becomes slower and larger when moving farther away from the processor 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 13/69 Memory Hierarchy (cont.) Overview The memory in the next lower level becomes slower and larger when moving farther away from the processor 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 14/69 Memory Hierarchy (cont.) Overview The memory in the next lower level becomes slower and larger when moving farther away from the processor 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 15/69 Memory Hierarchy (cont.) Overview Questions 1. The higher levels should always be faster, and the lower ones larger. Why? 2. In what situation is it advantageous to introduce new levels in the hierarchy? 3. When placing information in the higher levels, should it be deleted from the lower levels? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 16/69 Memory Hierarchy (cont.) Overview Fast memory is more expensive, then a memory hierarchy organized into several levels I each one smaller, faster, and more expensive per byte than the next lower level, which is farther from the processor Memory hierarchy aims to provide a memory system with I a cost/byte that is almost as low as the cheapest level of memory, and I a speed almost as fast as the fastest level 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 17/69 Memory Hierarchy (cont.) Inclusion Property In most cases2 , the data contained in a lower level are a superset of the next higher level I inclusion property – always required for the lowest level of the hierarchy I main memory in the case of caches I secondary storage (disk or flash) in the case of virtual memory 2 But not all. Modern multiprocessors with smaller L1 and different block sizes have sometimes chosen not to enforce inclusion 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 18/69 Memory Hierarchy (cont.) Visible Memory Levels CPU Registers I CPU working memory for temporary holding instructions and data I general purpose registers - GPR I e.g., r1, r2, r3 I special purpose registers - SPR I e.g., AR, DR, IR, SP, PC Memory I compared to CPU registers, the technology is usually inferior and its storage capacity is much higher I must be large enough to store the running programs and their respective data 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 19/69 Memory Hierarchy (cont.) Visible Memory Levels Storage I non-volatile storage (code, data) I overflow memory (related to virtual memory) I accessed by the CPU through specific controllers and I/O I considerable gap wrt memory I size: TB vs. GB I speed: µs vs. ns 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 20/69 Memory Hierarchy (cont.) Visible Memory Levels From lowest level to processor I size decreases – TB to Bytes I speed increases – µs to ps I cost increases 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 21/69 Outline Memory Systems Principle of Locality Memory Hierarchy Cache Memory References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/69 Cache Memory Overview Cache Stores copies, for a short time period, of regions from the main memory that are most frequently used by the CPU Unlike other memories, cache is transparent to programmers, even in assembly language Computers with multiple hierarchical cache levels are regularly common today, typically the highest levels are on the same CPU chip (e.g., L1, L2, L3) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 23/69 Cache Memory (cont.) Overview Multiprocessors I several processors demand information from a memory, so the cache is also critical I there is additional complexity to control access to keep cache coherence Cache is often divided into data cache and instruction cache Instruction cache I the next instruction to be executed is fetched from the instruction cache I if it is not found there, a new block is copied into cache I case the instruction cache has nested loops with large number of iterations, considerable main memory access can be spared 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 24/69 Cache Memory (cont.) Overview Cache hit Whenever the processor finds a requested data in the cache Cache miss Whenever the processor does not find a data item it needs in the cache 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 25/69 Cache Memory (cont.) Cache Info Example 1 $ lscpu 2 Architecture : x86_64 3 CPU op - mode ( s ): 32 - bit , 64 - bit 4 Byte Order : Little Endian 5 Address sizes : 39 bits physical , 48 bits virtual 6 CPU ( s ): 20 7 On - line CPU ( s ) list : 0 -19 8 Thread ( s ) per core : 2 9 Core ( s ) per socket : 10 10 Socket ( s ): 1 11 NUMA node ( s ): 1 12 Vendor ID : GenuineIntel 13 CPU family : 6 14 Model : 165 15 Model name : Intel ( R ) Core ( TM ) i9 -10900 F CPU @ 2.80 GHz 16 Stepping : 5 17 CPU MHz : 2800.000 18 CPU max MHz : 5200 ,0000 19 CPU min MHz : 800 ,0000 20 BogoMIPS : 5599.85 21 Virt ualization : VT - x 22 L1d cache : 320 KiB 23 L1i cache : 320 KiB 24 L2 cache : 2 ,5 MiB 25 L3 cache : 20 MiB 26 NUMAst node0 CPU ( s ): 0 -19 1 semester, 2024 Loubach CSC-25 High Performance Architectures ITA 26/69 Cache Memory (cont.) Cache Info Example 1 $ getconf -a | grep CACHE 2 L E V E L 1 _ IC A CH E _S I ZE 32768 3 L E V E L 1 _ I C AC H E _ A S S O C 8 4 LEVEL1_ICACHE_LINESIZE 64 5 6 L E V E L 1 _ DC A CH E _S I ZE 32768 7 L E V E L 1 _ D C AC H E _ A S S O C 8 8 LEVEL1_DCACHE_LINESIZE 64 9 10 L E V E L 2 _ C AC HE_ SIZ E 262144 11 L E V E L 2 _ CA C HE _ AS S OC 4 12 LEVEL2_CACHE_LINESIZE 64 13 14 L E V E L 3 _ C AC HE_ SIZ E 20971520 15 L E V E L 3 _ CA C HE _ AS S OC 16 16 LEVEL3_CACHE_LINESIZE 64 17 18 L E V E L 4 _ C AC HE_ SIZ E 0 19 L E V E L 4 _ CA C HE _ AS S OC 0 20 LEVEL4_CACHE_LINESIZE 0 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 27/69 Cache Memory (cont.) Basic Hierarchy Operation The levels of memory hierarchy (cache, main memory, secondary) are divided into blocks (or lines)3 At some point, some blocks (or lines) have copies of two or more levels Every access request is presented first to the highest level, if it is found there, access to the lower level is avoided Otherwise, the requested word (actually a block containing the word) is fetched from the lower to the higher level 3 Multiple words are named blocks (or lines) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 28/69 Cache Memory (cont.) Four Memory Hierarchy Questions 1. Where can a block be placed in the upper level? I related to block placement 2. How is a block found if it is in the upper level? I related to block identification 3. Which block should be replaced on a miss? I related to block replacement 4. What happens on a write? I related to write strategy 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 29/69 Cache Memory (cont.) Q1 - Block Placement Categories of cache organization 1. Fully associative 2. Direct mapped 3. Set associative 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 30/69 Cache Memory (cont.) Q1 - Block Placement Fully associative I block can be placed anywhere in the cache I place usually defined by the replacing policy 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 31/69 Cache Memory (cont.) Q1 - Block Placement Direct mapped – when each block has only one place it can appear in the cache Mapping given by (Block Address) mod (# of Blocks in Cache) (1) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 32/69 Cache Memory (cont.) Q1 - Block Placement Set associative – when a block can be placed in a restricted set4 of places in the cache A block is then first mapped onto a set, and then the block can be placed anywhere within that set, according to (Block Address) mod (# of Sets in Cache) (2) When there exist n blocks in a set I cache placement is named n-way set associative 4 Set = group of blocks in the cache 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 33/69 Cache Memory (cont.) Q1 - Block Placement Example considering cache has 8 blocks and memory has 32 blocks. Set associative considering 4 sets, i.e., two-way associative 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 34/69 Cache Memory (cont.) Q1 - Block Placement Pros and cons wrt the categories of cache organization Pros Cons Fully associative full flexibility complexity and implementation cost Direct mapped simplicity possible inefficiency due to inflexibility Set associative mid-term between the two previous Direct mapped may be viewed as a one-way set associative I i.e., set with just one block Fully associative cache with m blocks could be viewed as m-way set associative I i.e., one set with m blocks 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 35/69 Cache Memory (cont.) Q2 - Block Identification Cache is divided into blocks with a fixed number of memory words (upper level) The cache address is then divided into two parts (fully associative) 1. block address, a.k.a. tag 2. word position inside the block, named block offset The tag of every cache block is checked to verify if it matches the block address from the processor Sizes I bits related to offset = log2 (block size) I bits related to block address = (#bits memory address) - (#bits offset) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 36/69 Cache Memory (cont.) Q2 - Block Identification The block frame address can be further divided into For a 2N bytes cache I tag field – is the field compared for a hit I from the MSB side, the (32 − N) bits are related to the tag I index field – selects the set (set I from the LSB side, the M lowest bits are associative) / block (direct mapped) related to the offset, i.e., block size = 2M Example, for a 1 kB direct-mapped cache with 32 B blocks I 210 bytes cache, 25 bytes of block size. [31:10] tag = (32 - N) = 22 bits Fields of an address considering a set associative. [9:5] index = (32 - # tag bits - M) = 5 bits or direct mapped cache. As seen before, fully associative has no need for index field. [4:0] offset = (M) = 5 bits 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 37/69 Cache Memory (cont.) Q2 - Block Identification Example of block identification – AR = 01173 (tag = 0117 and block offset = 3), hit, then DR = 12 – AR = 01163 (tag = 0116 and block offset = 3), miss, then search in the lower level Tag Content Block no. 7620 11 76 DB 05 2E... 0 0117 35 8B F3 12 81 7D 25 AF... 1 3656 71 72 B8 AA... 2......... 1741 02 8C FF 36... m Cache Organization 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 38/69 Cache Memory (cont.) Q2 - Block Identification The tag marks the memory address to which the cache block corresponds to, and also uses a bit to indicate the block data validity, a.k.a. valid bit Tags are searched in parallel (speed is critical) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 39/69 Cache Memory (cont.) Q3 - Block Replacement In the case of a miss, cache controller must select a block to be replaced with the required data 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 40/69 Cache Memory (cont.) Q3 - Block Replacement Direct-mapped I hardware decisions are simplified I no choice: only one block frame is checked for a hit, and only that block index can be replaced 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 41/69 Cache Memory (cont.) Q3 - Block Replacement Fully associative or set associative Three main strategies 1. Random 2. Least recently used - LRU 3. First in, first out - FIFO 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 42/69 Cache Memory (cont.) Q3 - Block Replacement Random replacement I aims to spread allocation in a uniform way I random selection of candidate blocks I some controllers generate pseudorandom block numbers to get reproducible behavior, which is good for debug purposes 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 43/69 Cache Memory (cont.) Q3 - Block Replacement LRU replacement I keep track of block accesses to reduce the chance of throwing out data that is likely be needed soon I “past to predict future”, the block replaced is the one that has been not used for the longest time I corollary of locality: “if recently used blocks are likely to be used again, a good candidate for disposal is the least recently used block” 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 44/69 Cache Memory (cont.) Q3 - Block Replacement FIFO replacement I LRU can be complicated/complex to keep track I FIFO approximates LRU by determining the oldest block rather than the LRU 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 45/69 Cache Memory (cont.) Q3 - Block Replacement Pseudo-LRU I as the number of blocks to keep track of increases, LRU becomes increasingly expensive and is generally only approximated 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 46/69 Cache Memory (cont.) Q3 - Block Replacement Common approximation I a number of bits is related to each set in the cache I each bit corresponds to a single way in the cache I a way is bank in a set associative cache, e.g., there are k ways in k-way set associative cache I when a set is accessed, the bit corresponding to the way containing the desired block is turned on I if all the bits related to a set are turned on, they are reset with the exception of the most recently turned on bit I if a block has be replaced, the controller chooses a block from the way whose bit is turned off I randomly-based if more than one choice is present 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 47/69 Cache Memory (cont.) Q3 - Block Replacement Cache misses per 1,000 instructions 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 48/69 Cache Memory (cont.) Q4 - Write Strategy Reads dominate processor cache accesses, writes are ≈ 10% of memory traffic Enhancing the common case, makes reads faster I blocks can be read from the cache at the same time the tag is read and compared I in case of a hit, requested part of the block is given to the processor right away I otherwise, i.e., miss, no benefit but no harm indeed Could this strategy also be applied for writes? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 49/69 Cache Memory (cont.) Q4 - Write Strategy Changing a block cannot start until the tag is checked to see if the address is a hit Writes usually take longer than reads I tag checking cannot occur in parallel with writes Processor also specifies the size of the write (1-8 bytes) I only that portion of a block can be changed Reads can access more bytes than necessary without problem 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 50/69 Cache Memory (cont.) Q4 - Write Strategy Write policies distinguish cache designs I write through – data is written to both the block in the cache AND to the block in the lower-level memory (in parallel) I write back – data is written only to the block in the cache; modified cache block is written to the main memory ONLY when it is replaced (i.e., cache miss) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 51/69 Cache Memory (cont.) Q4 - Write Strategy Dirty bit – helps to reduce the frequency of writing back blocks on replacement I status bit I indicates whether the cache block is I dirty, i.e., modified while in the cache I clean, i.e., not modified If the block is indicated as clean, there is no need for writing back on a miss, since the very same information to the cache is found in lower levels 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 52/69 Cache Memory (cont.) Q4 - Write Strategy Write back I pros I writes occur at the speed of the cache memory I multiple writes within a block require only one write to the lower-level memory I uses less memory bandwidth, as some cache writes do not go to memory5 I saves power, as it uses less from the hardware, i.e., memory hierarchy and interconnects6 I cons I read misses may result in writes to the lower-level (depends on the dirty bit) I coherency problem, cache is different from lower-level7 5 Good for multiprocessors 6 Good for embedded systems 7 Bad for multiprocessors 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 53/69 Cache Memory (cont.) Q4 - Write Strategy Write through I pros I easier to implement I cache is always clean, read misses never result in writes to the lower-level I simplifies data coherency, next lower-level has the most current copy of the data8 I multilevel caches more viable for the upper-level caches, as the writes need only propagate to the next lower-level, rather than all the way to the memory I cons I writes occur at the speed of the lower-level 8 Good for multiprocessors 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 54/69 Cache Memory (cont.) Q4 - Write Strategy Multiprocessors are trick, as they want I write back – for processor caches to reduce the memory traffic (power, bandwidth) I write through – to keep cache consistent with lower levels of the memory hierarchy (data coherency) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 55/69 Cache Memory (cont.) Q4 - Write Strategy Write stall – when the processor needs to wait a write to complete during a write through A common optimization to reduce write stall is a write buffer It allows the processor to continue as soon as the data are written to that buffer, i.e., processor execution goes in parallel with memory writing/updating 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 56/69 Cache Memory (cont.) Q4 - Write Strategy /Write Buffer I processor writes data to the cache and memory buffer I memory controller updates buffer contents to the lower-level works fine if: 1 store frequency (wrt time) DRAM write cycle Write buffer used in write through approach. It is just a FIFO with typically four entries otherwise: 1 store frequency (wrt time) > DRAM write cycle problem, write buffer saturation 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 57/69 Cache Memory (cont.) Q4 - Write Strategy /Write Buffer If the condition of “processor cycle too fast” or “many writes in a row” persist for a long period I buffer overflow, no matter what how big it is I processor cycle time