csc25-chapter_04b.pdf

CSC-25 High Performance Architectures Lecture Notes – Chapter IV-B Memory construction technology & RAM, interleaved and virtual memories Denis Loubach [email protected] Department of Computer Systems Computer Science Division – IEC Aeronautics Institute of Technology – ITA 1st semester, 2024 Detailed Contents Main Memory Performance Overview Interleaved Memory Definition Wider bus width vs. interleaved memory Block Duplication Effects Block Quadruplication Effects RAM Construction Technology Technologies Overview DRAM Organization DRAM Operation DRAM Performance Modules Virtual Memory 1st semester, 2024 Loubach Recap: Memory Hierarchy Main Concepts Virtual Memory and Cache Virtual Memory Classes Four Questions, now on Virtual Memory Q1 - Block Placement Q2 - Block Identification Q3 - Block Replacement Q4 - Write Strategy Translation Lookaside Buffer - TLB Page Size Selection Virtual and Physical Caches Putting It All Together: Intel Core i7 References CSC-25 High Performance Architectures ITA 2/64 Outline Main Memory Performance Interleaved Memory RAM Construction Technology Virtual Memory Putting It All Together: Intel Core i7 References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 3/64 Main Memory Performance Overview Techniques to improve main memory performance I increase data width I wider memory data bus I memory interleaving I improve access time I construction technology, which impacts on e.g., clock, latency I double data rate - DDR 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 4/64 Outline Main Memory Performance Interleaved Memory RAM Construction Technology Virtual Memory Putting It All Together: Intel Core i7 References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 5/64 Interleaved Memory Definition Interleaved memory allows simultaneous access to as many words as the number of independent banks With a number of banks, it is possible to have several instructions and several operands in the fetch phase, and several results in the storage phase To really take advantage of that, each memory access needs to address a different bank in a given moment 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 6/64 Interleaved Memory (cont.) Wider bus width vs. interleaved memory Let’s consider the following I sending address on the address bus: takes 4 CPU clock cycles I word access in the memory: 56 clocks I sending a word on the data bus: 4 clocks I 64-bit of word size 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 7/64 Interleaved Memory (cont.) Wider bus width vs. interleaved memory cont’d (A) For a block of 1 word I average number of cycles per instruction (w/o cache error): 2 I miss penalty: 64 clocks (4 + 56 + 4) I memory access per instruction: 1.2 I miss rate: 3% (B) For a block of 2 words I miss rate: 2% (C) For a block of 4 words I miss rate: 1.2% 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 8/64 Interleaved Memory (cont.) Wider bus width vs. interleaved memory cont’d What is the improvement in the system compared to the original with simple bus when using I interleaving of 2 or 4 banks I system with doubled bus width and considering blocks of 1, 2 or 4 words? Solution EC0 = CC + µ × ρ where EC0 (1) is the effective cycle; CC is the cache cycle; µ is the miss rate; and ρ is miss penalty 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 9/64 Interleaved Memory (cont.) Wider bus width vs. interleaved memory cont’d I CPI for a memory system of 1 word (original systems) EC0 = CC + µ × ρ = 2 + 0.03 × (1.2 × 64) = 4.30 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 10/64 Interleaved Memory (cont.) Block Duplication Effects I Block of 2 words (128 bits) 64-bit bus, without interleaving EC0 128-bit bus, without interleaving EC0 = CC + µ × ρ = CC + µ × ρ 64-bit bus, interleaved (2 banks) EC0 = CC + µ × ρ = 2 + 0.02 × [1.2 × (2 × 64)] = 2 + 0.02 × (1.2 × 64) = 2 + 0.02 × {1.2 × [4 + 56 + (2 × 4)]} = 5.07 = 3.54 = 3.63 fully sequential 1st semester, 2024 fully parallel Loubach 4: initial request 56: word access in the memory 2×4: sending words back, sequentially CSC-25 High Performance Architectures ITA 11/64 Interleaved Memory (cont.) Block Duplication Effects Decreased performance on 64-bit bus system without interleaving, i.e., from 4.30 to 5.07 Doubled bus width make it 1.22× faster Speedup = 4.30 = 1.22 3.54 Memory interleaving is 1.19× faster Speedup = 1st semester, 2024 Loubach 4.30 = 1.19 3.63 CSC-25 High Performance Architectures ITA 12/64 Interleaved Memory (cont.) Block Quadruplication Effects Block of 4 words (256 bits) 64-bit bus, without interleaving EC0 = CC + µ × ρ 128-bit bus, without interleaving EC0 = CC + µ × ρ 64-bit bus, interleaved (4 banks) EC0 = CC + µ × ρ = 2 + 0.012 × [1.2 × (4 × 64)] = 2 + 0.012 × [1.2 × (2 × 64)] = 2 + 0.012 × {1.2 × [4 + 56 + (4 × 4)]} = 5.69 = 3.84 = 3.09 fully sequential 1st semester, 2024 partially parallel Loubach 4: initial request 56: word access in the memory 4×4: sending words back, sequentially CSC-25 High Performance Architectures ITA 13/64 Interleaved Memory (cont.) Block Quadruplication Effects Decreased performance on 64-bit bus system without interleaving, i.e., from 4.30 to 5.69 Doubled bus width make it 1.12× faster Speedup = 4.30 = 1.12 3.84 Memory interleaving is 1.39× faster Speedup = 1st semester, 2024 Loubach 4.30 = 1.39 3.09 CSC-25 High Performance Architectures ITA 14/64 Interleaved Memory (cont.) Block Quadruplication Effects The cost of quadrupling the memory bus may become prohibitive and would not yield much better performance 256-bit bus, without interleaving EC0 = CC + µ × ρ = 2 + 0.012 × (1.2 × 64) = 2.92 Speedup of just 1.06 wrt memory interleaving configuration w/ 64-bit bus and 4 blocks Speedup = 1st semester, 2024 Loubach 3.09 = 1.06 2.92 CSC-25 High Performance Architectures ITA 15/64 Outline Main Memory Performance Interleaved Memory RAM Construction Technology Virtual Memory Putting It All Together: Intel Core i7 References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 16/64 RAM Construction Technology Technologies Overview Read-only memory - ROM I non-volatile I can be written just once I some variations can be electronically erased and reprogrammed at slow speed, i.e., electrically erasable programmable read-only memory - EEPROM 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 17/64 RAM Construction Technology (cont.) Technologies Overview Static random-access memory - SRAM I priority on speed and capacity I static, data does not need to be refreshed from time to time, i.e., periodically I non-multiplexed address lines I 8 to 16× more expensive than DRAM Aims to minimize access time to caches 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 18/64 RAM Construction Technology (cont.) Technologies Overview Dynamic random-access memory - DRAM I priority on cost/bit and capacity I dynamic, data needs to the refreshed/re-written periodically (even w/o reading the data) I multiplexed address lines Traditionally, they had an asynchronous interface with their controller and thus a synchronization overhead I a clock signal was introduced to the DRAM chips making them synchronous I this was named synchronous DRAM - SDRAM 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 19/64 RAM Construction Technology (cont.) Technologies Overview DDR SDRAM – innovation where memory data is transferred on BOTH rising and falling edge of the SDRAM clock signal, thereby doubling the data rate DDR2, DDR3 and DDR4 I evolution of DDR technology with increased clock and voltage reduction in chips I (1 → 2): 2.5 to 1.8 V, and higher clock rates, e.g., 266, 333, and 400 MHz I (2 → 3): drops voltage to 1.5 V with a maximum clock speed of 800 MHz I (3 → 4): drops the voltage to 1-1.2 V with a maximum expected clock rate of 1600 MHz I DDR5 standard was released in 20201 ; DDR6 to be launched in 2024/252 1 Intel supports in the 12th and 13th generation cores, and AMD supports DDR5 in the Ryzen 7000-series architectures 2 https://www.pcworld.com/article/2237799/ ddr6-ram-what-you-should-already-know-about-the-upcoming-ram-standard.html 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 20/64 RAM Construction Technology (cont.) Technologies Overview Dual inline memory module - DIMM I memories are usually sold in small boards, i.e., DIMMs I containing from 4 to 16 DRAM chips, and I are usually arranged to provide 8-Byte words 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 21/64 RAM Construction Technology (cont.) DRAM Organization Modern DRAM are organized in banks (up to 16 for DDR4) Each bank has a number of rows DRAM basic internal organization 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 22/64 RAM Construction Technology (cont.) DRAM Operation Memory is organized as a matrix addressed by banks, rows and columns Address multiplexing, managed by DRAM controller I send bank and row numbers I send column address I data is accessed, i.e., read/write 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 23/64 RAM Construction Technology (cont.) DRAM Operation To initiate a new access, DRAM controller handles these commands The activate ACT3 command I opens a bank and a row, and I loads the entirely row into the row buffer Column address4 can then be sent, e.g., single item or burst The precharge PRE command I closes the bank and row, and I prepares it for a new access 3 Formerly 4 “column Commands and block transfers are synchronized with a clock named “row address strobe” - RAS address strobe” - CAS 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 24/64 RAM Construction Technology (cont.) DRAM Operation Before accessing a new row, the bank must be precharged I if the row is in the same bank, then the precharge delay is faced I if the row is in a different bank, closing the row and precharging can overlap with accessing the new row In SDRAMs, each of those command cycles requires an integral number of clock cycles 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 25/64 RAM Construction Technology (cont.) DRAM Operation DRAMs use just a single transistor, effectively acting as a capacitor, to store a bit (more bits per chip) Implications 1. reads from DRAM requires a write back, since the information is “destroyed” 2. while not reading/writing, requires a periodic refresh5 , due to leakage Bits in a row are updated in parallel 5 Accessing every row within a time frame, such as 64ms. DRAM controllers include hardware to do that 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 26/64 RAM Construction Technology (cont.) DRAM Operation When the row is in the buffer, it can be transferred by I single item – successive CAS at whatever the width of the DRAM is (4, 8, or 16 bits in DDR4), or I burst – specifying a block transfer and the starting address 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 27/64 RAM Construction Technology (cont.) DRAM Performance Non-uniform access time due to I location I refresh/write back DRAM write time Typically, time is divided into I RAS precharge – row selection I RAS-to-CAS delay – column selection I CAS latency – data read/write I cycle time – average complete time DRAM read time 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 28/64 RAM Construction Technology (cont.) DRAM Performance DDR SDRAM capacity and access times. Info for random memory word assuming a new row must be opened. If the row is in a different bank, row is assumed to be already precharged. If the row is not open, precharge is required then leading to longer access time 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 29/64 RAM Construction Technology (cont.) Modules To facilitate the handling and also to exploit the memory interleaving, memory modules are used I DIMM I 4 to 16 memory chips I 64-bit bus I the number of pins varies among different DDRn 168 pins module 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 30/64 RAM Construction Technology (cont.) Modules 2128 1st semester, 2024 Loubach MiB M transfer B = 266 ×8 s s transfer CSC-25 High Performance Architectures ITA 31/64 Outline Main Memory Performance Interleaved Memory RAM Construction Technology Virtual Memory Putting It All Together: Intel Core i7 References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 32/64 Virtual Memory (cont.) Recap: Memory Hierarchy The memory in the next lower level becomes slower and larger when moving farther away from the processor 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 33/64 Virtual Memory Main Concepts Virtual memory VM automatically manages6 the two levels of the memory hierarchy represented by I main memory, and I secondary storage Other key points I memory space share and protection I memory relocation 6 Relief Mapping of virtual memory to physical memory for a program with four pages programmers from the overlays 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 34/64 Virtual Memory (cont.) Main Concepts Several general memory hierarchy ideas about caches are analogous to VM I page or segment is used for block I page fault or address fault is used for miss Memory mapping or address translation with VM, the processor produces virtual addresses that are translated7 into physical addresses, which access the main memory 7 By a hardware and software combo 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 35/64 Virtual Memory (cont.) Virtual Memory and Cache Main differences between VM and cache Replacement on cache misses is primarily controlled by hardware, while VM replacement is primarily controlled by the operating system - OS The longer miss penalty means it is more important to call a “good decision” – the OS can be involved and take time deciding what to replace 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 36/64 Virtual Memory (cont.) Virtual Memory and Cache cont’d The size of the processor address determines the size of VM, but the cache size is independent of that 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 37/64 Virtual Memory (cont.) Virtual Memory and Cache cont’d Besides acting as the lower-level backing store for main memory in the hierarchy, secondary storage is also used for the file system - FS FS occupies most of secondary storage, and it is not usually in the address space 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 38/64 Virtual Memory (cont.) Virtual Memory Classes VM can be categorized into two classes I pages – fixed-size blocks I segments – variable-size blocks Pages, typically fixed at 4096-8192 bytes Segments, largest examples ranges from 216 to 232 bytes; smallest segment is 1 byte 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 39/64 Virtual Memory (cont.) Virtual Memory Classes 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 40/64 Virtual Memory (cont.) Four Questions, now on Virtual Memory Q1 Where can a block be placed in main memory? Q2 How is a block found if it is in main memory? Q3 Which should be replaced on a virtual memory miss? Q4 What happens on a write? 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 41/64 Virtual Memory (cont.) Q1 - Block Placement The miss penalty for VM involves access to memory devices in the lower level with low speed, thousands of clock cycles, e.g., magnetic disks Miss penalty in this case is very high Between lower miss rate or simpler placing algorithms, the former is preferred due miss penalty cost Then, OS allows blocks to be placed anywhere in the memory, such as fully associative 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 42/64 Virtual Memory (cont.) Q2 - Block Identification Paged addressing – one word fixed-size address divided into page number and offset within the page, like cache addressing Segmented addressing – due to the variable size segment, it needs two words, one for segment number and another for the offset within the segment 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 43/64 Virtual Memory (cont.) Q2 - Block Identification Paging and segmentation – based on a data structure indexed by the page/segment number This data structure holds the physical address of the block Offset I for segmentation, it is added to the segment’s physical address to obtain the final physical address I for paging, it is simply concatenated to this physical page address 1st semester, 2024 Loubach Page table, the data structure containing the physical page address CSC-25 High Performance Architectures ITA 44/64 Virtual Memory (cont.) Q2 - Block Identification Let’s consider a 32-bit virtual address, and I page size of 4 KiB I 4 bytes per page table entry - PTE What is the page table size? Virtual Memory Size × PTE size Page Size 232 = 12 × 22 2 = 222 (4 MiB) PT size = 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 45/64 Virtual Memory (cont.) Q2 - Block Identification An alternative to the previous is to apply a hash function to the virtual address This allows the data structure to be the length of the number of physical pages in the main memory, i.e., smaller than the number of virtual pages I page size of 4 KiB I 4 bytes per page table entry - PTE What is the page table size in this inverted page table scheme, considering a 512 MiB physical memory? IPT size = Physical Memory Size × (PTE size + Virtual Address Size) Page Size 229 × (22 + 22 ) 212 = 220 (1 MiB) = 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 46/64 Virtual Memory (cont.) Q2 - Block Identification Translation lookaside buffer - TLB to reduce address translation time, computers use a cache dedicated to this task 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 47/64 Virtual Memory (cont.) Q3 - Block Replacement The basic OS guideline is to minimize page faults Almost all OS try to replace the least recently used - LRU block The approach uses an use bit or reference bit which is set whenever the page is accessed In a given period, the OS resets the reference bits and also records them to be able to check which pages were accessed during that time 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 48/64 Virtual Memory (cont.) Q4 - Write Strategy Simply, write back with the dirty bit Write through is too costly8 8 According to Hennessy and Patterson, 2017, no one has yet built a VM OS using write through 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 49/64 Virtual Memory (cont.) Translation Lookaside Buffer - TLB Page tables are stored in the main memory, as they can be large in size. Sometimes they are even paged themselves In that case, it would take 2× longer to obtain the physical address, going through these two levels of paging To minimize this performance issue, a cache of page table is implemented – translation lookaside buffer - TLB 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 50/64 Virtual Memory (cont.) Translation Lookaside Buffer - TLB TLB entry is like a regular cache entry I tag I holds part of virtual address (w/o the offset) I data I I I I I physical page frame number protection field valid bit use bit dirty bit (wrt the page) 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 51/64 Virtual Memory (cont.) Translation Lookaside Buffer - TLB Changing the physical page frame number or protection of a page table entry requires the OS to make sure the old entry is not in the TLB; otherwise, it makes the system to behave improperly The OS resets them changing the value in the page table, and then invalidates the corresponding TLB entry As soon as the entry is reloaded from the page table, the TLB gets an accurate copy of the bits 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 52/64 Virtual Memory (cont.) Translation Lookaside Buffer - TLB VM with TLB and page table. First, search the virtual address in the TLB; if it fails, search in the page table 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 53/64 Virtual Memory (cont.) Page Size Selection It is a matter of balance Points in favor of a larger page size I size of the page table is inversely proportional to the page size; memory can be saved by making pages bigger I larger page size allows for larger caches, then leading to fast cache hit times (principle of locality) I transferring larger pages to/from secondary storage is more efficient than transferring smaller pages I the number of TLB entries is restricted; larger page size means that more memory can be mapped efficiently, then reducing the number of TLB misses 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 54/64 Virtual Memory (cont.) Page Size Selection Points in favor of a smaller page size I conserving storage – less wasted storage when contiguous region of virtual memory is not equal in size to a multiple of the page size I avoid unused memory in a page, i.e., internal fragmentation I process start-up time, many processes are small, thus a large page size would make longer the time to invoke a process 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 55/64 Virtual Memory (cont.) Virtual and Physical Caches Even a small/simple cache must deal with the translation of a virtual address from the processor to a physical address to access memory Making the common case fast – virtual addresses for the cache, since hits are much more common than misses Virtual caches – cache using virtual address Physical caches – traditional cache using physical address 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 56/64 Virtual Memory (cont.) Virtual and Physical Caches Virtual caches are not more popular I OS and user programs may use two different virtual addresses for the same physical address I happens fairly frequently, e.g., when two programs (two distinct virtual addresses) share the same data object I these duplicate addresses9 could result in two copies of the same data in a virtual cache I duplicated info in the cache I if one is modified, the other will have the wrong value 9 Synonyms or aliases 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 57/64 Virtual Memory (cont.) Virtual and Physical Caches One alternative to get the best of both virtual and physical caches Use part of the page offset10 to index the cache In parallel to the cache being read using that index, the virtual part of the address is translated, and the tag match uses physical addresses This approach allows I cache read to begin immediately, and I yet the tag comparison is still with physical addresses 10 The part that is identical in both virtual and physical addresses 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 58/64 Virtual Memory (cont.) Virtual and Physical Caches Hypothetical memory hierarchy. From virtual address to L2 cache 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 59/64 Outline Main Memory Performance Interleaved Memory RAM Construction Technology Virtual Memory Putting It All Together: Intel Core i7 References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 60/64 Putting It All Together: Intel Core i7 i7 uses 48-bit virtual addresses and 36-bit physical addresses Memory management uses a two-level TLB 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 61/64 Putting It All Together: Intel Core i7 i7 has a three-level cache hierarchy First-level caches are virtually indexed and physically tagged L2 and L3 caches are physically indexed 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 62/64 Outline Main Memory Performance Interleaved Memory RAM Construction Technology Virtual Memory Putting It All Together: Intel Core i7 References 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 63/64 Information to the reader Lecture notes mainly based on the following references Castro, Paulo André. Notas de Aula da disciplina CES-25 Arquiteturas para Alto Desempenho. ITA. 2018. Hennessy, J. L. and D. A. Patterson. Arquitetura de Computadores: Uma Abordagem Quantitativa. 5a. Campus, 2014. –.Computer Architecture: A Quantitative Approach. 6th. Morgan Kaufmann, 2017. Patterson, D. and S. Kong. Lecture notes, CS152 Computer Architecture and Engineering, Lecture 16: Memory Systems. Online. 1995. 1st semester, 2024 Loubach CSC-25 High Performance Architectures ITA 64/64

csc25-chapter_04b.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue