full-lecture-93-159.pdf
Document Details
Uploaded by RockStarCherryTree
Tags
Full Transcript
97 Chapter 3: Memory Management Operating Systems and System So ware 98 Overview 1. Why is memory management needed? 2. Address spaces 3. Virtual memory 4. Se...
97 Chapter 3: Memory Management Operating Systems and System So ware 98 Overview 1. Why is memory management needed? 2. Address spaces 3. Virtual memory 4. Segmentation 5. Allocators Operating Systems and System So ware 99 A Process is its Memory As seen in the previous chapter A process is a set of memory mappings with one or more threads 0xFFFFFFFF The memory layout of a process contains various sections that store different types of data Kernel reserved for the kernel, cannot be accessed by user space Runtime Memory Files 0xC0000000 Stack Process ID (PID) Address space Current working directory (CWD) environment, local variables Segments Parent Process ID (PPID) Page table User ID (UID) Memory mappings Registers Group ID (GID) Page table pointer (CR3) File descriptors Thread table[ ] Thread ID Heap Registers dynamically allocated memory Instruction pointer Stack pointer General purpose BSS Signals uninitialized data Scheduler metadata Used CPU time Time of next alarm Data initialized static variables Text program code 0x00000000 How does the operating system map these sections in memory? Operating Systems and System So ware 100 Early Days: Direct Physical Mapping Until the early 1980s, there was no memory abstraction: programs directly used physical memory. Memory addresses are hardcoded directly in the code: mov r0, 4242 moves the data contained in the register r0 into memory at the address 4242 jmp 1234 jumps directly to the instruction located at address 1234 Programs can use any memory address, and only one process can be in memory at a time In other words, memory is managed directly by the user program(s) and not by the OS! Computer manufacturers/OS developers may provide guidelines to users to avoid programs touching kernel memory. Operating Systems and System So ware 101 Single Memory Layout With direct physical mapping, processes had a simple memory layout (from the OS perspective) Three main layouts: 1. Everything mapped in RAM 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF Programs can access OS memory as they wish. Device drivers ROM Operating system Used in early mainframes. ROM 2. OS in Read-Only Memory (ROM) OS mapped in ROM, protecting it from user programs. User program User program OS cannot be updated or extended. RAM RAM Used on some embedded devices. 3. Device drivers in ROM User program RAM OS mapped in regular RAM, and only device drivers are mapped in ROM for protection. OS can be updated, but also damaged by faulty/malicious Operating system Operating system RAM RAM user programs. 0x00000000 0x00000000 0x00000000 Used in early personal computers (BIOS). Model 1: Everything mapped in RAM Model 2: OS in ROM Model 3: Drivers-only in ROM How do we handle multiple processes? Operating Systems and System So ware 102 Multiprocessing: Swapping Memory Naive solution: copy user memory to/from a storage device, e.g., hard drive. If we want to context switch between processes A and B: 1. Move the currently running process’ memory to storage 0xFFFFFFFF A’s memory: RAM → storage User program B RAM 2. Move the newly scheduled process’ memory into RAM B’s memory: storage → RAM A C Storage device E Limitations F Copying to/from storage is very costly D Not suitable for multi-core systems Operating system RAM 0x00000000 Operating Systems and System So ware 103 Multiprocessing: Hardware Memory Keys One way to limit the cost of context switching and support multi-core systems is via hardware support. 0xFFFFFFFF IBM 360 provided protection keys: 2 KB 1000 0101 Memory is split in fixed size blocks (2 KB), each with a 4-bit key 1101 1101 P0: 0101 Each process has a key assigned to it fault 1101 P1: 1000 1000 P2: 1101 When the OS allocates a block, it assigns to it the key of the process asking for memory 0101 0101 When a process accesses a block with a key different from his, a fault is triggered 1000 (and most likely a crash of the process) 0101 1101 0101 1000 Multiple processes can simultaneously live in memory while being isolated from each other 1000 Operating system RAM 0x00000000 Operating Systems and System So ware 104 Limitations of Direct Physical Mapping Take two programs A and B (16 KiB in size) We first load A into memory: it is placed at address 0, with its own protection key We then load B into memory: it is placed at address 1000, with its own protection key When B executes the jmp 28 instruction, it will jump into A’s memory, triggering a fault thanks to the protection keys! IBM 360 fixed this issue with static relocation: when a program is loaded into memory, all addresses were patched on the fly by adding the base address where the program was loaded. Here, the OS would add 16380 to all addresses in the program’s instructions. 16416 cmp 16412 16408 16404 Static relocation is slow and requires metadata to differentiate between addresses and regular values 16396 16392 16388 jmp jmp16412 28 16384... 32 32 32 add 28 cmp 28 add 28 mov 24 24 mov 24 16 16 16 12 12 12 8 8 8 4 4 4 jmp 24 0 jmp 28 0 jmp 24 0 Operating Systems and System So ware 105 Address Spaces Operating Systems and System So ware 106 The Address Space Abstraction To efficiently and safely manipulate memory, we need two properties: Protection: user programs cannot access each other’s memory as well as kernel memory IBM 360’s protection keys provided protection, and fixed relocation with a costly mechanism Relocation: programs can be located anywhere in memory To provide both properties, OS designers came up with an abstraction: address spaces. Definition An address space is an abstract view of the memory as seen by a process. They are independent from each other and may be mapped to any physical address in memory. Note Address spaces being independent from each other means that address 1000 of two different address spaces can be mapped to different physical addresses. Manipulating address spaces can be done in various ways, with different hardware support Operating Systems and System So ware 107 Dynamic Relocation Dynamic relocation provides a similar solution to IBM 360’s static relocation, but performed fully in hardware. 16416 cmp 16412 16408 Each process is mapped in physical memory and assigned two registers: jmp 16384 + 28 16404 16396 Base which contains the base address where the process’s address space is mapped 16392 16388 Limit which contains the size of the process’ address space jmp 28 16384 Base: 0 Base: 16384... Limit: 16384 Limit: 16384 32 32 32 When the process accesses a memory address X, add 28 cmp 28 add 28 the CPU automatically: mov 24 24 mov 24 16 16 16 Adds the value of the base register: 12 12 12 addr = X + base 8 4 8 4 8 4 Checks that the result is contained in the address space: jmp 24 0 jmp 28 0 jmp 24 0 base ≤ addr < base + limit Memory Program A Program B This provides both protection and relocation without incurring a loading cost. Limitation Each access is costly because of the addition and comparison Operating Systems and System So ware 108 Handling Limited Memory We can now have multiple processes stored in memory at the same time, but physical memory is not unlimited! Modern systems run hundreds of processes concurrently. As of writing these slides: All these processes may not all fit in the memory at the same time ps aux | wc -l 477 Two main techniques to solve this issue: 1. Swapping 2. Virtual memory Both rely on exploiting the memory hierarchy of computers: Fast memory: caches, RAM Fast but small and expensive Slow memory: storage disks (HDD, SSD, etc.) Slow but large and cheap The general idea is to keep recently/frequently used processes/memory in RAM and move unused processes/memory to storage devices. Operating Systems and System So ware 109 Swapping We actually saw this mechanism earlier with multiprocessing with direct physical mapping. 0xFFFFFFFF User program B Definition RAM Swapping is a technique that consists in moving memory between RAM and storage devices. Moving (usually unused) memory from memory to storage is called swapping out, A C while moving memory back from disk into memory when it is needed is called swapping in. Storage device E F D Size of the swap on the storage device Usually, the OS allocates a predefined area on the storage device for swapping memory. This could be a partition (static, fixed size) or a swap file (dynamic, may be created/ Operating system destroyed/grow on-the-fly). RAM 0x00000000 Let’s see how this applies to systems with proper relocation! Operating Systems and System So ware 110 Swapping with Dynamic Relocation 1. Initial state: We have processes B, F and C in memory 2. B blocks for a long time and is swapped out to disk to free up memory 3. D is scheduled, so it it swapped into memory 0xFFFFFFFF 4. F blocks for a long time and is swapped out to disk to free up memory C 5. B is ready again and swapped back into memory 6. E becomes ready, but there is not enough space in memory 7. The OS decides to evict B from memory to free enough space for E The choice of which memory to swap out depends on the eviction strategy. We will see that later. E 8. E is swapped into memory D Operating system kernel Combining dynamic relocation with swapping and eviction allows to fit multiple processes 0x00000000 in memory simultaneously while preserving the protection and relocation properties! Operating Systems and System So ware 111 Memory Fragmentation Definition Fragmentation is a phenomenon that happens on memory or storage devices when free space is spread into multiple small blocks, preventing new allocations of blocks larger than these free blocks, even if there is enough total free memory available. 1. A er multiple swaps in/out, we have fragmented memory 0xFFFFFFFF Here, we have 6 small free blocks 2. A becomes ready and must be swapped into memory A 3. There is enough free space overall, but no free chunk is large enough for A! D 4. The OS can perform compaction, or defragmentation, C to merge the small free chunks into one large enough for A F 5. Now, A has a large enough free chunk of memory 6. A is swapped into memory E B Operating system kernel 0x00000000 Operating Systems and System So ware 112 How Much Does This Cost? Swapping requires a large amount of copies between storage devices and memory. Compaction requires large amount of copies between different memory locations. As a rule of thumb:1 Reading 1 MiB of memory ≈ 10.000 ns = 0.010 ms Reading 1 MiB from an SSD ≈ 1.000.000 ns = 1 ms Reading 1 MiB from an HDD ≈ 5.000.000 ns = 5 ms If your scheduler has a quantum of 4 ms, you cannot swap your process in/out frequently! Compaction takes a long time, and requires to pause moving processes! Operating Systems and System So ware 113 Address Space Size Up until now, we haven’t discussed how large an address space should be. Is the address space size fixed,.i.e., decided upon creation? Can it change over time? Fixed-size address spaces: If the size of the address space is fixed, it is easy to manage by the OS. Whenever a process is created or swapped in, a contiguous chunk of memory of the size of the process’ address space is allocated. Note that eviction or compaction may happen to make this possible. Variable-size address space: If the address space can grow, e.g., because of too many allocations on the stack/heap, the OS must find enough space to allow this. Operating Systems and System So ware 114 Variable-size Address Spaces Depending on the memory layout of your address spaces, they could grow in different directions. Classic UNIX layout Limit Limit Stack and heap grow towards each other Stack When they touch each other, the process is Out-of-Memory (OoM): environment, local variables it can be killed or grow. Heap dynamically allocated memory Growing requires to: Move the limit register to increase the size of the address space Heap Stack dynamically allocated memory environment, local variables Move the stack up to the limit to allow stack and heap to grow again BSS BSS uninitialized data uninitialized data Fixed-stack layout Data Data initialized static variables initialized static variables Stack has a fixed maximum size, and heap grows towards the limit Text Text When the stack reaches the limit, the process is OoM program code program code Base Base Growing requires to move the limit register (a) UNIX layout (b) Fixed-stack layout This is from the address space perspective! The OS may still need to move/evict/compact memory to find a free chunk large enough! Operating Systems and System So ware 115 Free Memory Management In order to know if an address space fits in memory, or if eviction/compaction will provide enough free memory, the OS needs to keep track of all the free memory in the system There are two main techniques to track free memory: Bitmaps Free lists These techniques are usually implemented within memory allocators (which we will see later) We will discuss them briefly now as they are quite general techniques. Tip Note that this also applies to storage systems. Operating Systems and System So ware 116 Free Memory Management: Bitmaps The memory is split into small chunks, e.g., 8 bytes. Each bit in the bitmap represents the state of one chunk: Bit 0: chunk 0, bit 1: chunk 1, etc. 72 The state of a chunk can have two values: 64 0 means that the chunk is free 1 means that the chunk is in use 56 48 When the OS needs to allocate memory, it will iterate over the bitmap until it finds a large enough contiguous free space, i.e., enough consecutive 0s 40 There are different allocation strategies: first fit, next fit, best fit, etc. 32 24 Design issues: 16 Smaller chunks mean finer granularity ⟹ less waste But also a larger bitmap 8 (bitmapsize = memorysize /chunksize ) 0 Larger bitmaps mean long search times Operating Systems and System So ware 117 Free Memory Management: Free Lists The memory is split into small chunks, e.g., 8 bytes. The OS maintains a sorted list of allocations and free memory chunks. Each element of the list contains: State: U: used, or F: free 72 F | 72 | 8 Address: the starting address of the chunk Size: the size of the chunk 64 U | 64 | 8 56 There are variations of this structure, e.g., different lists for allocations and free chunks, more efficient data structures, etc., but the general idea 48 F | 48 | 16 is more or less the same. 40 U | 24 | 24 32 When the OS needs to allocate memory, it iterates on this list searching for a large enough free chunk according to an allocation strategy F | 16 | 8 24 When an allocated chunk is freed, the list is updated to merge the adjacent elements (if they are free) 16 U | 0 | 16 8 alloc/free list 0 Operating Systems and System So ware 118 Allocation Strategies First fit: The allocator returns the first free chunk where the allocation fits. The free chunk is split into two smaller chunks: the newly allocated one, and a free one with the remaining bytes of the chosen free chunk. Next fit: Same as first fit, except that the search starts from where the last search stopped (instead of from the beginning). This is supposed to avoid iterating over a lot of allocated chunks in the beginning of the list (they should statistically be more present there). In practice, it has been shown that this performs worse than first fit (because of frequent allocations and free, fragmentation creates free space in the beginning of the memory too). Best fit: Search for the smallest free chunk where the allocation fits. The rationale is to avoid breaking up large free chunks for small allocations, thus avoiding waste. In practice, this creates more fragmentation with lots of small holes compared to first fit. It is also slow since it requires iterating over the whole list. Worst fit: Exact opposite of best fit: always allocate in the largest available hole. The idea is to avoid the small hole fragmentation of best fit. All strategies can be improved with various optimizations: In practice, it doesn’t really work well either. Keeping a separate list for free chunks speeds up search but makes list updates costly Sorting the list to avoid long searches, at the cost of expensive insertions Storing the free list directly in the free chunks themselves Operating Systems and System So ware 119 Virtual Memory Operating Systems and System So ware 120 Limitations of Dynamic Relocation When using dynamic relocation as previously defined: Address spaces are not flexible. They can grow, but they are still a single piece of contiguous memory. Fragmentation cannot be avoided, only mitigated. Compaction is also costly as entire address spaces must be moved in memory. Swapping is extremely costly. Address spaces must be copied entirely to/from storage devices. Address spaces cannot be larger than physical memory. Processes using more need to explicitly mange their data between memory and storage. Operating Systems and System So ware 121 The Virtual Memory Abstraction Virtual memory was first proposed in 1961-62 for the Atlas supercomputer (University of Manchester). The core ideas are the following: Each program has its own address space, which size’s does not depend on the size of physical memory Address spaces are split into small chunks called pages A page is a contiguous range of memory addresses Address spaces are addressed with virtual addresses that may be mapped anywhere in physical memory An address space can be partially mapped into memory i.e., some pages are in memory while others are swapped out in storage When a process uses a virtual address that is not mapped into physical memory, the OS swaps in the necessary data before letting the process continue its execution With some hardware support, this efficiently provides protection and relocation! Operating Systems and System So ware 122 Paging Most systems implement virtual memory with paging. The addresses manipulated by programs are virtual addresses. In our previous examples, virtual addresses were directly used to address physical memory. With paging, a Memory Management Unit (MMU) sits between the CPU and the bus to translate virtual addresses into physical addresses. CPU CPU Memory vaddr Memory MMU paddr paddr paddr paddr Bus Bus (a) Without virtual memory (b) With virtual memory Operating Systems and System So ware 123 Paging (2) Definitions Example: A page is a fixed-size unit of virtual memory. Address space size: 64 KiB Physical memory: 32 KiB A page frame is a fixed-size unit of physical memory. Page size: 4 KiB Number of pages: 64 KiB / 4 KiB = 16 Number of page frames: 32 KiB / 4 KiB = 8 Pages are mapped by the operating system into a page frame of the same size. When the CPU tries to access a page, the MMU translates the virtual address into the corresponding physical address, i.e., page → page frame 65 536 Page 61 440 Page frame If the page is not present in physical memory: 57 344 53 248 1. A page frame is reclaimed, i.e., a page is swapped out 49 152 2. The requested page is mapped into the newly available page frame 45 056 i.e., the page is swapped in 40 960 36 864 32 768 32 768 28 672 28 672 24 576 24 576 20 480 20 480 16 384 16 384 12 288 12 288 8192 8192 4096 4096 Note that we need to keep track of which pages are absent from physical 0 0 memory to swap them in when needed. Operating Systems and System So ware 124 Memory Management Unit The Memory Management Unit (MMU) is a hardware component that transparently performs address translations. It contains a table with the translations from virtual addresses to physical ones: the page table. Let’s see an example with 64 KiB address spaces (virtual addresses are 16 bits long), 32 KiB physical memory (physical addresses are 15 bits long) and 4 KiB pages: Mapped page Unmapped page 1. The CPU issues a memory access to a virtual address, 41 772 MMU 011 0011 0010 1100 Outgoing physical 2. The address goes through the MMU that splits it into two parts: address: 13 100 4 Page number: we have 16 pages = 2 , so the 4 most significant bits identify 15 000 the page. This will be used for the translation. 14 000 13 000 Offset: the remaining 12 bits identify the memory location in the page. This 12 111 will be kept as is in the physical address. 11 000 10 011 Page frame 3. The page number is used as an index in the page table to find the 9 101 number associated page frame number. 8 000 7 000 4. The page frame number will become the most significant bits of the 6 000 physical address. 5 000 4 000 5. The offset is propagated as is in the physical address. 3 110 2 001 6. The physical address, 13 100, is used to issue the memory access in 1 100 memory through the bus. 0 010 Offset 1010 0011 0010 1100 Incoming virtual Operating Systems and System So ware Page number address: 41 772 125 MMU and Unmapped Pages In our example, physical memory is smaller than the address space, which means not all pages can be mapped into page frames simultaneously! Thus, the page table must also know if a page is mapped in memory or not. Let’s take our previous example again (with a slightly modified page table): 1. The CPU wants to access the virtual address 41 772, which has the page number 10. Mapped page Unmapped page 2. Page 10 is not mapped in memory, thus triggering a page fault. MMU 011 0011 0010 1100 Outgoing physical The OS must map page 10 into a page frame. address: 13 100 3. The OS decides to evict page 11 (swap out) to make space in physical memory. 15 000 14 000 4. The OS then allocates page 10 in the newly freed page frame. 13 000 12 111 The page fault is now resolved. 11 000 011 10 000 011 5. Translation continues normally by concatenating the page frame number Page frame 9 101 number and the offset into the physical address 13 100. 8 000 7 000 6 000 5 000 4 000 3 110 2 001 1 100 0 010 Offset 1010 0011 0010 1100 Incoming virtual Operating Systems and System So ware Page number address: 41 772 126 Page Table Entries The page table contains the translations between virtual and physical addresses as well as some metadata regarding the page. Page table entries (PTE) are architecture-dependent, but usually contain the following information: Present Protection Supervisor Page frame number (PFN) Page frame number: the page frame corresponding to this page, i.e., the actual translation Referenced Dirty Uncached Protection bits: the access permissions of this page (read/write/execute). Can be 1 bit (read/write or read-only) or 3 bits (read, write, execute). Modified/dirty bit: set to 1 if the page has been modified a er being swapped in. This is needed when the OS swaps out a page: if the page is not dirty, the OS does not need to copy it into storage. Referenced bit: set to 1 when the page is read from/written to. This is used for page reclamation algorithms to choose which page to evict. Supervisor bit: set to 1 if the page is only accessible in supervisor mode, i.e., by the kernel. A user process accessing a page with this bit set will trigger a fault. Present bit: set to 1 if the page is mapped onto a page frame, 0 if it is only on storage. Accessing a page with this bit set to 0 will trigger a page fault to allow the OS to swap the page in. Uncached bit: set to 1 if the page must not be cached in the CPU caches and always be accessed from/to memory. Used for pages mapping physical devices to avoid using outdated cached values. Operating Systems and System So ware 127 Mapping Address Spaces to Physical Memory Let’s see how paging works with multiple address spaces. We have our physical memory split into page frames. Process A has its pages mapped into page frames, in a location decided by the allocation strategy. Same for processes B, C and D. Kernel memory Virtual address spaces are spread into physical page frames, they don’t need to be fully contiguous: ⟹ Fragmentation is not a problem anymore Virtual address spaces can be swapped in and out at page granularity, transparently and on-demand: ⟹ Swapping is much more efficient Kernel memory is shared for all address spaces, so it is usually mapped in all of them at the same place, with the supervisor bit set Process A Process B to 1 in the PTEs. If the kernel had its own page table, every system call would require switching to it and back to the user page table. Note that is is not completely true anymore to mitigate side- channel attacks such as Meltdown. Process C Operating Systems and System So ware 128 Optimizing Paging With this basic view of paging, we may face some major problems: The address space can be very large as it does not depend on the size of the physical memory. Modern machines use 32-bit or 64-bit address spaces, 20 52 We need an efficient way of storing page tables which means 2 (1 Mi) or 2 (1 Mi Gi) PTEs respectively, for each process. This is too large to fit in memory. Every memory access triggers a translation from virtual to physical address. This means that for every memory access, we need to look up the page We need mechanisms to make translation fast frame number corresponding to the accessed page and build the physical address before actually performing the access. Operating Systems and System So ware 129 Page Table Layout The size of the page table grows with the size of virtual address spaces. 32 On 32-bit machines, address spaces can grow up to 4 GiB ( 2 ). With 4 KiB pages, we can have as much as 4 GiB / 4 KiB = 1 048 576 pages (1 Mi). And this is replicated for each address space, i.e., for each process. Since the page table stores one page table entry (PTE) per page, let’s see how large the page table is for a 32-bit machine. Each PTE stores: Page frame number: assuming machines with 4 GiB of physical memory, this would be 20 bits Metadata, e.g., present, dirty, etc.: 12 bits on x86 Thus, the page table would be 32 Mib = 4 MiB for each process. This is reasonable for machines with multiple GiB of memory. With 64-bit machines, this becomes impossible. Even though x86-64 processors use only 48-bit physical addresses, there would be 64 Bi PTEs with 4 KiB pages, each entry being 64-bit long. Each process would need a page table of hundreds of GiB of memory, which is obviously not reasonable. Operating Systems and System So ware 130 Multilevel Page Table One solution is to split page tables into smaller pieces, in a hierarchical fashion, and only allocate the parts that are actually allocated for the process, i.e., processes do not use 100% of their address space at all times, so we don’t need the entire page table to exist in memory. Second level page tables 1023 Let’s take 32 bit address spaces (4 GiB) with 4 KiB pages as an example. Since we use 4 KiB pages, let’s split our page table into chunks of the same size... (this will also simplify memory management of the page table). AS 232 20 5 Total number of PTEs: P T Enr = P = 12 = 2 4 3 2 2 P212 10 1 PTEs per page: P T Epage = = 4 = 2 = 1024 E First level page table 0 PTEnr 220 10 Number of page tables: P Tnr = PTE = 10 = 2 = 1024 1023 1023 page 2...... We can use a two-level page table scheme with: One first level page table where each valid entry points to a page table managing 4 MiB of memory 5 5 4 4 3 3 1024 second level page tables where each entry is a PTE pointing to a page frame 2 2 1 1 0 0 1023 Cheat sheet:... 32 Address space size: AS = 4 GiB =2 B 12 Page size: P = 4 KiB = 2 B 5 4 PTE size: E = 4 B Operating Systems and System So ware 3 131 Two-Level Page Table Second level page tables Let’s see how a virtual to physical address translation works with this type of page table. 1023 Assume this process using only 12 MiB of address space (3 second level page tables = 3x 4 MiB). Note that we only need 4 pages (16 KiB) for page tables instead of 4 MiB.... 1. When an virtual address vaddr is used by the CPU, it goes through the MMU like previously Incoming vaddr = 0x4041FB 5 4 0000 0000 0100 0000 0100 0001 1111 1011 3 2. The MMU splits the virtual address int three parts: PT1 PT2 Offset 2 1 0 PT1: 10 bits for the index in the first level page table First level page table 1023 1023 PT2:: 10 bits for the index in the level 2 page table Offset:: 12 bits for the offset in the page...... 3. PT1 is used as an index in the first level page table to find the address of the corresponding second level page table: the Page Table Address (PTA) 5 5 4 4 PFN + metadata 3 3 Page frame 2 2 number 4. PT2 is used as an index in the second level page table to 1 PTA 1 0 0 find the page frame number 1023 5. The physical address paddr is built by concatenating the page frame number with the offset.... 1110 0010 1000 0010 0000 0001 1111 1011 Outgoing paddr: 0xE28201FB This lookup process is called a page walk. 5 4 Operating Systems and System So ware 3 2 132 Multilevel Page Table (4-level) For 64-bit architectures, the same concept can be extended by using more page table levels. For example, the x86-64 architecture uses the following a 4-level page table layout: Note 1 x86-64 uses 48-bit virtual addresses (256 TiB address spaces), not all 64 bits of the address. Virtual address 16 bits 9 bits 9 bits 9 bits 9 bits 12 bits unused PML4 PDP PD PT Offset Note 2 PML4 Page Directory Pointer Table Page Directory Page Table x86-64 has an extension to add a fi h page PD entry table level, using 9 bits from the unused part of the virtual address............. PDP entry PTE PML4 entry Note 3 The reserved bits in the physical address are used for other features such as protection keys. x86-64 only supports 50 usable bits in the physical address, thus supporting at most reserved Page frame number Offset 1 PiB = 1024 TiB. 14 bits 38 bits 12 bits Operating Systems and System So ware 133 Page Tables and Context Switching If we go back to our Process Control Block from previous lectures Runtime Memory Files Process ID (PID) Address space Current working directory (CWD) Segments Parent Process ID (PPID) Page table User ID (UID) Memory mappings Registers Group ID (GID) Page table pointer (CR3) File descriptors Thread table[ ] Thread ID Registers Instruction pointer Stack pointer General purpose Signals Scheduler metadata Used CPU time Time of next alarm During a context switch between processes, the OS just needs to replace the page table pointer register to switch to the new address space. The MMU then automatically uses the value in this register to locate the first level of the page table. Operating Systems and System So ware 134 Cost of a Page Table Walk Whenever a virtual address needs to be translated into a physical address (potentially multiple times per instruction), the MMU must perform a page walk to recover the page frame number. Virtual address 16 bits 9 bits 9 bits 9 bits 9 bits 12 bits unused PML4 PDP PD PT Offset PML4 Page Directory Pointer Table Page Directory Page Table PD entry............ PDP entry PTE PML4 entry reserved Page frame number Offset 14 bits 38 bits 12 bits Physical address With a 4-level page table, a page walk requires four memory accesses, oneandtoSystem Operating Systems read the entry in each level of the page table. So ware 135 Optimizing Paging Let’s go back to our list of optimizations: The address space can be very large as it does not depend on the size of the physical memory. Modern machines use 32-bit or 64-bit address spaces, 20 52 Multilevel page tables which means 2 (1 Mi) or 2 (1 Mi Gi) PTEs respectively, for each process. This is too large to fit in memory. Every memory access triggers a translation from virtual to physical address. This means that for every memory access, we need to look up the page frame number corresponding to the accessed page and build the physical We need mechanisms to make translation fast even more! address before actually performing the access. With multilevel paging, it can take four memory accesses! Operating Systems and System So ware 136 Translation Lookaside Buffer Observation Most programs frequently reference the same addresses. With that in mind, the obvious solution to our translation speed problem is to use caching! The MMU usually contains a small associative cache called the Translation Lookaside Buffer (TLB). It stores, for each entry: Page Page frame Protection Valid Modified The mapping from a virtual page number to the corresponding physical page frame number 489 741 RW 1 1 Valid bit: set to 1 if the entry in the TLB is valid 965 282 R X 1 0 61 645 RW 1 1 Metadata from the page table entry: 579 412 RW 1 0 ▪ Protection bits from the page table entry (read/write/execute) 39 410 R X 1 0 ▪ Modified bit from the page table 410 355 R X 1 0 61 832 RW 1 1 14 526 RW 1 0 Note TLBs are small and can keep only ~256 entries in modern machines. Operating Systems and System So ware 137 TLB Workflow When a virtual address vaddr goes through the MMU, it follows this path: Return physical vaddr Page in the TLB? Valid entry? Check protection? paddr hit yes ok address miss ko no Page walk Segmentation fault Page mapped? Save in TLB yes no Page fault Swap page in Operating Systems and System So ware 138 TLB Management Most of the TLB management is performed by the MMU: Page Page frame Protection Valid Modified If the TLB is full when a new entry must be saved, the MMU will evict an entry 489 741 RW 1 1 When a TLB entry is loaded from the page table, the necessary metadata are copied into the TLB entry 965 282 R X 1 0 When a TLB entry is evicted, the MMU writes back the metadata in the TLB into the page table entry 61 645 RW 1 1 579 412 RW 1 0 39 410 R X 1 0 Some operations must still be triggered by the OS: 410 355 R X 1 0 61 832 RW 1 1 When a context switch occurs, the TLB must be flushed to avoid wrong translations, 14 526 RW 1 0 e.g., processes A and B might both use virtual page X, but they point to different page frames. When updating a page table entry, the OS needs to invalidate the related TLB entries to force a synchronization with the page table in memory, e.g., page 61 is swapped out, making the translation in the TLB invalid. Note Modern TLBs also store a Process Context IDentifier (PCID) that records the process owning a translation. There is therefore no need to flush the TLB during context switches. Operating Systems and System So ware 139 Reducing the Translation Overhead With 4 KiB pages, we need one translation per 4 KiB of memory. If a process actively uses GiB of memory, a large number of translations must be kept in the TLB. However, the TLB is: Small, i.e., ~256 entries pointing to 1 MiB of memory in total Shared by all processes (with PCID), i.e., processes thrash each other’s translations These processes are not TLB friendly (for them and other processes)! Observation If a program uses multiple GiB of memory, there is a good chance it uses large contiguous memory blocks, in the order of MiB or GiB. ⟹ It would benefit from having larger pages. This is surprisingly easy to do with multilevel page tables! Operating Systems and System So ware 140 Large and Huge Pages 64-bit systems have a 4-level page table The translation process goes through all levels to reach the PTE corresponding to a 4 KiB page. If we bypass the last level, we can index a large page of 2 MiB! If we bypass the last two levels, we can index a huge page of 1 GiB! PML4 Page Directory Pointer Table Page Directory Page Table............ Operating Systems and System So ware 141 Fork and Copy-on-Write When a fork() happens, we must copy the address space of the parent to create the child process. With virtual memory and paging, this just means creating a copy of the page table! As long as both processes only read the mapped pages, they can share the page frames! If one process writes to a page, the page is copied and remapped: this is called copy-on-write. copy write Process A Process B fork() itself is nearly free (no full address space copy). The cost of copy is paid on-demand. Physical memory Operating Systems and System So ware 142 Virtual Memory Is Not Necessarily Paging 16416 cmp 16412 Up until now, we only describe virtual memory through the paging mechanism. 16408 16404 jmp 16384 + 28 However, dynamic relocation was already a kind of (very limited) virtual memory. 16396 16392 16388 jmp 28 16384 Base: 0 Base: 16384... Limit: 16384 Limit: 16384 32 32 32 add 28 cmp 28 add 28 mov 24 24 mov 24 A general version of dynamic relocation, segmentation can be used as an alternative to paging. 16 12 16 12 16 12 8 8 8 4 4 4 jmp 24 0 jmp 28 0 jmp 24 0 Program A Program B Memory Unfortunately, support for segmentation has been dropped in modern architectures, in favor of paging. The concept, though, is still applicable in paging. Let’s see how segmentation works! Operating Systems and System So ware 143 Segmentation Instead of providing a single virtual address space per process, segmentation offers many independent virtual address spaces called segments. Each segment starts at address 0 and has a length that can change over time. The OS maintains a segment table for each process that stores segment descriptors: Base address: the physical address where the segment is located Length: the length of the segment Present Protection Metadata (non-exhaustive list) Base address Length ▪ Present bit: is the segment mapped in memory? ▪ Supervisor bit: is the segment only accessible in supervisor mode? Dirty Supervisor ▪ Protection bits: is the segment read-only, read-write, executable? Example of a segment descriptor Operating Systems and System So ware 144 Segmentation (2) Programs can split their memory into segments as they wish. 16 Ki For example, one could use four segments as follows: A text segment of size 4 KiB 12 Ki A stack segment of size 4 KiB 8 Ki Data 8 Ki A data segment of size 16 KiB A heap segment of size 8 KiB 4 Ki 4 Ki 4 Ki 4 Ki Heap Text Stack 0 0 0 0 Segment 1 Segment 9 Segment 10 Segment 12 Note This is usually done transparently by the compiler, not explicitly by the programmer. Most languages assume a flat address space, i.e., a single address space per process. The OS creates an entry per segment in the process’ segment table describing the translation to physical addresses. Operating Systems and System So ware 145 Segmentation: Addressing In a segmented system, addressing is done by supplying a two-part address that contains a segment number and the address within the segment. Similar to paging systems, the MMU performs the virtual to physical address translation using the process’ segment table: 1. The incoming virtual address is split into two parts: Segment number MMU 0001 0011 0010 1100 Outgoing physical Address within the segment address: 0x132C Limit Base 15 000 0 000 0x0 2. The MMU looks up the segment information in the segment table, 14 000 0 000 0x0 using the segment number as an index 13 000 0 0x0 000 Segment CHECK limit 12 8 KiB 0x6000 3. The physical address is built by adding the base address of the segment 11 0 011 000 0x0 011 000 and the address from the virtual address 10 16 KiB 0x1000 ADD Segment base 9 4 KiB 0x5000 4. The MMU checks that the address does not exceed the limit of the segment 8 000 0 000 7 000 0 000 5. The MMU uses the physical address to perform the memory access 6 0 0x0 000 5 000