IT421G Virtualization Slides PDF

Virtualization Virtualization, Cloud, and Storage Thomas Fischer https://www.his.se/ﬁsh Autumn/Winter 2024 Table of Contents History Concepts Full x86 Virtualization Operating System Virtualization Real-World Examples Exam Reading Instructions Virtualization Thomas Fischer Autumn/Winter 2024 Page 2 Quotations ﬁtting Virtualization (1) “ All problems in computer science can be solved by another level of ” indirection. David Wheeler (†2004) “... except for the problem of too many layers of indirection. ” Kevlin Henney Virtualization Thomas Fischer Autumn/Winter 2024 Page 3 Quotations ﬁtting Virtualization (2) “ Virtualization essentially introduces a level of indirection to a system to ” decouple applications from the underlying host system. Laadan and Nieh (2010) Virtualization Thomas Fischer Autumn/Winter 2024 Page 4 Quotations ﬁtting Virtualization (3) “ x86 virtualization is about basically placing another nearly full kernel, full ” of new bugs, on top of a nasty x86 architecture which barely has correct page protection. Then running your operating system on the other side of this brand new pile of s hit. You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can’t write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes. Theo de Raadt about virtualization Virtualization Thomas Fischer Autumn/Winter 2024 Page 5 History Virtualization Thomas Fischer Autumn/Winter 2024 Page 6 A Short and Biased History (1) Early computers: only one program and only one user Program Operating Program System Hardware “ These programs, hand crafted for the slow and costly hardware, required the entire ” machine. The users were normally present to make sure things were going well and, if not, reacted in real time to correct things or to collect information useful for diagnosis. The running time of a program was quite long compared to these user actions and to the time required to set up the machine for the next problem. Creasy (1981) Virtualization Thomas Fischer Autumn/Winter 2024 Page 7 A Short and Biased History (1) Early computers: only one program and only one user Program Operating Program System Hardware “ Machines continued to grow in capability and speed and to decline in cost per ” computing unit. With larger memories and independent I/O operation came the possibility of more efﬁcient machine utilization. A portion of the machine could be dedicated to the programs which assisted in machine operation, the operating system. Creasy (1981) Virtualization Thomas Fischer Autumn/Winter 2024 Page 7 A Short and Biased History (2) Research questions at this time (mid-1960s ± 10 years) How to run multiple programs at the same time? How to allow multiple users using a machine at the same time? think they have exclusive access to computer P1 ⋯ Pn P1 Pn OpSys1 ⋯ OpSysn Operating System Virtualization Hardware Hardware very complex very simple somewhat complex Virtualization Thomas Fischer Autumn/Winter 2024 Page 8 Concepts Virtualization Thomas Fischer Autumn/Winter 2024 Page 9 What is Virtualization? Virtualization means recreating a (potentially) existing, real entity (a) The virtualized entity behaves like the real entity (b) The virtualized entity does not exist as such What can be virtualized? (selection) Whole computers (most relevant for this lecture) A computer’s main memory (all modern operating systems do this) Instruction set (e. g. Java, CLR, Python’s byte code) Storage (SAN, Storage Area Networks) Virtualization Thomas Fischer Autumn/Winter 2024 Page 10 Why Computer Virtualization? Server consolidation (cost savings expected) Under-utilization Even cheap servers too ‘powerful’ Multiple servers get virtualized and combined on one physical machine Less hardware, less energy consumption, less space,... New server instances can be easily created Flexibility Move server instances between physical machines Better load balancing, efﬁcient use of available resources (hardware) Reliability and availability Each service can be put into own virtual server Whole servers can be saved (snapshots), moved, and restored Testing and Debugging Software can get developed and tested in known environment (recreating server-like environment on developer’s desktop) Software can get inspected in restricted environment (relevant for security researchers) Virtualization Thomas Fischer Autumn/Winter 2024 Page 11 Popek and Goldberg (1974) Virtual Machine Strict deﬁnition: Isolated duplicate (with certain limitations) of the real machine it is running on called hypervisor nowadays Virtual Machine Monitor (VMM) is a software realizing VMs 1. Programs perform in VM indistinguishable to real hardware execution (exception: timing and resource constraints) Equivalence property a. k. a. ﬁdelity 2. VMM should only interfere as little as possible Example Virtualized instructions executed directly on real processor Efﬁciency property a. k. a. performance 3. VMM is in complete control of system resources (1) No VM can access hardware without the VMM’s approval (2) VMM can preempt resources previously allocated to VMs Resource control property a. k. a. safety (‘safety’ ≠ ‘security’) Virtualization Thomas Fischer Autumn/Winter 2024 Page 12 Categorizing Virtualization (1) Type-1 a. k. a. native or bare metal VMM resides directly ‘on metal’, i. e. has full, direct access to hardware VMM has responsibilities similar to an operating system (hardware access, resource allocation,... ) Example VMware ESXi, Hyper-V Avoids overhead for host operating system / Direct hardware access Guest1 ⋯ Guestn VMM Hardware Virtualization Thomas Fischer Autumn/Winter 2024 Page 13 Categorizing Virtualization (2) Type-2 a. k. a. hosted VMM runs on top or inside of an existing operating system Examples VirtualBox, KVM, VMware Workstation Machine can be used for something else, too / Closely tied to operating system (kernel modules,... ) Guest1 ⋯ Guestn VMM sshd Operating System Hardware The difference between type-1 and type-2 is fuzzy, as a ‘type-1 hypervisor’ may be realized as a stripped-down, locked-down regular operating system with some type-2 hypervisor code Virtualization Thomas Fischer Autumn/Winter 2024 Page 14 Levels of Virtualization (1) Full Virtualization Guest does not realize it is virtualized (clever guests can still guess it from how ‘hardware’ looks like) CPU instructions are run on real hardware to the largest possible extent (performance) Access to hardware (memory, NIC, storage,... ) is controlled by hypervisor Guarantees virtualization properties (isolation,... ) Example VMware ESXi Virtualization Thomas Fischer Autumn/Winter 2024 Page 15 Levels of Virtualization (2) Full Virtualization (continued) Pure computation performance close to non-virtualized setup (main memory access and disk or network I/O may be slower) No modiﬁcation necessary in guest Guest has to support/run on available hardware (CPU,... ) Guest gets interrupted at each sensitive hardware access, virtualization monitor validates request and handles hardware component (performance penalty) Hardware support required for virtualization (most server-grade hardware provides this type of support, though) Virtualization Thomas Fischer Autumn/Winter 2024 Page 16 Levels of Virtualization (3) Para-Virtualization Like full virtualization, but... Special device drivers or kernel modiﬁcations allow guest kernel to access virtualization monitor’s functions directly (hyper call in correspondence to system calls) Guest is aware of virtualization Example VirtualBox More efﬁcient than full virtualization No hardware support for virtualization necessary if all critical system calls got replaced by hyper calls Necessary customization may not be possible for certain guest operating systems Guest has to support/run on available hardware (CPU,... ) Virtualization Thomas Fischer Autumn/Winter 2024 Page 17 Levels of Virtualization (4) Operating System Virtualization Same kernel instance for both host and guest systems, but different ‘user land’ (libraries, applications) Kernel manages access to hardware for both host and guest (that is what kernels do) Examples OpenVZ, BSD Jails, Solaris Containers, LXC, systemd-nspawn, Docker, Podman, basically everything with ‘containers’ Low overhead for actual virtualization Host and guest use same kernel Kernel has to be modiﬁed to support this level of virtualization (modern server-grade operating system kernels support this) Isolation between guests and towards host not as ‘strong’ as with other virtualization techniques (less secure) Guest has to support/run on available hardware (CPU,... ) Virtualization Thomas Fischer Autumn/Winter 2024 Page 18 Levels of Virtualization (5) Application-level Virtualization VM runs like a normal process on the host If application’s code does not match ABI or ISA... (a) VM interprets application (concept is close to emulation) (b) VM translates application’s code to local system’s equivalent before/during execution Also known as ‘process virtualization’ or ‘language VM’ Examples Java’s VM,.net’s CLR, FX!32 Instead of virtualizing a full machine, only parts necessary to run applications are virtualized Tweaks allow almost same performance as real applications VM has to be ported to each host / VM provides common denominator for supported platforms Virtualization Thomas Fischer Autumn/Winter 2024 Page 19 Levels of Virtualization (6) Emulation Idea Emulate a complete set of hardware including the CPU Software interpretes every single hardware instruction Guest inside the emulated environment never sees any real hardware, may not even realize that it is virtualized Examples DOSbox, Bochs Not WINE Any hardware can be emulated even if not supported by native hardware, e. g. a x86 PC on a Solaris Sparc machine Able to cover no longer existing or future designs, e. g. to play Gameboy games from the 1990s Slooow Virtualization Thomas Fischer Autumn/Winter 2024 Page 20 Full x86 Virtualization Virtualization Thomas Fischer Autumn/Winter 2024 Page 21 Rings on x86 Architectures Ring 3 User mode Ring 2 Supervisor mode Ring 1 Ring 0 Operating System rarely used User Applications Each ring is a permission state in which the CPU operates during the execution of instructions. Higher-numbered rings have restrictions on which CPU instructions may get executed and which memory segments or pages may get accessed. Virtualization Thomas Fischer Autumn/Winter 2024 Page 22 x86 Instructions (1) The x86 architecture has 250+ instructions (depending on how you count: different manufacturers, different models, variants including SSE, mathematical co-processor,... ) Some CPU instructions are privileged Includes memory access, I/O access, CPU state changes,... Excludes normal logical or mathematical operations,... Only code executed in ring 0 may invoke all privileged instructions Attempting to execute privileged instructions while in ring 3? 1. CPU will invoke a handler/dispatcher, i.e. switch to ring 0 and execute code at a pre-conﬁgured (by kernel during boot time) memory address 2. Kernel code will investigate situation and may... (a) reject it (ignore it, return error code, kill process,... ) (b) optionally after modifying it, perform the operation on behalf of the user code Still not talking about virtualization here... Virtualization Thomas Fischer Autumn/Winter 2024 Page 23 x86 Virtualization (1) Relation between operating system and VMM is similar to user process and operating system (a) Hardware access needs to be controlled, ﬁltered, or re-directed (b) Error conditions and invalid behavior need to be handled Challenges arising here 1. Virtualized operating system may not tamper with VMM VMM must handle all exceptions caused by attempts to execute privileged instructions 2. How does this map to the ring architecture? Will be investigated soon Virtualization Thomas Fischer Autumn/Winter 2024 Page 24 x86 Virtualization (2) Sensitive Instructions allow to interfere with VMM If executed in VM, must be intercepted and handled by VMM Types of sensitivity (according to Popek and Goldberg, 1974) An instruction is... control sensitive ‘if it attempts to change the amount of (memory) resources available, or affects the processor mode [... ]’ (the name ‘control’ comes from the requirement that the VMM needs to stay in control of resources) behavior sensitive ‘if the effect of its execution depends on the value of the relocation-bounds register, i.e. upon its location in real memory, or on the mode’ Example Instructions using some ‘base register’ to access memory Virtualization Thomas Fischer Autumn/Winter 2024 Page 25 x86 Virtualization (3) Sensitive Instructions Privileged Instructions All Instructions “ When executing in a virtual machine, some processor instructions can not ” be executed directly on the processor. These instructions would interfere with the state of the underlying VMM or host OS and are called sensitive instructions. The key to implementing a VMM is to prevent the direct execution of sensitive instructions. Robin and Irvine (2000) Virtualization Thomas Fischer Autumn/Winter 2024 Page 26 x86 Virtualization (3) Sensitive Instructions Privileged Instructions All Instructions “ If all sensitive instructions of a processor are privileged, the processor is ” considered to be “virtualizable:” then, when executed in user mode, all sensitive instructions will trap to the VMM. After trapping, the VMM will execute code to emulate the proper behavior of the privileged instruction for the virtual machine. Robin and Irvine (2000) Virtualization Thomas Fischer Autumn/Winter 2024 Page 26 x86 Virtualization (3) Sensitive Instructions Privileged Instructions All Instructions “ [T]he Pentium instruction set contains sensitive, unprivileged instructions. ” The processor will execute unprivileged, sensitive instructions without generating an interrupt or exception. Thus, a VMM will never have the opportunity to simulate the effect of the instruction. Robin and Irvine (2000) Virtualization Thomas Fischer Autumn/Winter 2024 Page 26 Rings and Virtualization (before 2006) VMM runs in ring 0 (obviously) Guests’ user processes run in ring 3 (obviously) What about guests’ kernel processes? (ring 0 in non-virtualized settings) Has to be located in ring 1 or 2 Guest kernel still contains instructions that only run in ring 0 Possible solutions 1. Rewrite guest kernel (a) Before execution (e. g. in kernel’s source code) as done with Xen (b) During execution as done with binary translation 2. Change hardware to add even higher-privileged ring for VMM Modern approach, currently state-of-the-art Virtualization Thomas Fischer Autumn/Winter 2024 Page 27 Binary Translation Problem There is no virtualization possible as long as there are non-privileged sensitive instructions in guests Idea Rewrite code to not contain sensitive instructions Virtualization becomes closer to emulation (here, ‘code’ means binary x86 instructions, not C or even assembler code) Problems Virtualized operating system shall not see its code being rewritten Translated code has different length than original code Jump/branch instructions’ addresses are no longer valid Fortunately most guest code doesn’t need to be translated and most code to be translated belongs to the operating system Virtualization Thomas Fischer Autumn/Winter 2024 Page 28 Hardware-Virtualization Co-Evolution So far... No support by hardware beyond features already existing for non-virtualization purposes Ring-based privileged architecture Memory protection depending on privilege Around 2005 or 2006... Hardware-assisted VMM to support virtualization New CPU instructions to switch between guest and host In-memory data structures to save/restore states and to conﬁgure hardware behavior in guest mode Marketed as VT-x or AMD-V (Paciﬁca) Virtualization Thomas Fischer Autumn/Winter 2024 Page 29 First Generation Extensions (1) Two modes of operation: guest mode and host mode Complementing existing ring architecture Privileged instructions treated differently in either mode Ring 3 Ring 2 Guest mode Ring 1 Ring 0 Root mode a. k. a. host mode a. k. a. ‘Ring –1’ Virtualization Thomas Fischer Autumn/Winter 2024 Page 30 First Generation Extensions (2) Virtual Machine Control Block (VMCB) on AMD Virtual Machine Control Structure (VMCS) on Intel In-memory data structure, one instance per virtual machine Stores state of the VMM and the VM, tells hardware how to react on sensitive instructions in guest New instructions like VMRUN or VMLAUNCH to switch from host to guest mode as conﬁgured in VMCB/VMCS, saves host state ﬁrst When returning from guest to host mode, restoring host state, and updating VMCB/VMCS on guest’s state (incl. reason for exit) Virtualization Thomas Fischer Autumn/Winter 2024 Page 31 First Generation Extensions (3) Still unsolved Access to I/O devices, interrupt handling, memory management,... Operations VMRUN , VMLAUNCH , VMMCALL ,... are very costly, number of switches has to be minimized Virtualization Thomas Fischer Autumn/Winter 2024 Page 32 First Generation Extensions (4) Benchmarking (Adams and Agesen (2006) on Pentium 4) Forkwait 40 000 × fork/waitpid Native (no virtualization) 6.0 s Software VMM 36.9 s Hardware-assisted VMM 106.4 s Division by 0 (faults without memory involved) Native (no virtualization) 889 cycles Software VMM 3223 cycles Hardware-assisted VMM 1014 cycles Page faults (faults with memory involved) Native (no virtualization) 1093 cycles Software VMM 3927 cycles Hardware-assisted VMM 11242 cycles Virtualization Thomas Fischer Autumn/Winter 2024 Page 33 First Generation Extensions (5) #include #include #include #include #include #include int main(int argc, char *argv[]) { for (int n = 0; n < 40000; ++n) { errno = 0; const pid_t pid = fork(); if (pid == -1) { /// Error, fork() failed fprintf(stderr, "fork() failed, errno=%d\n", errno); } else if (pid > 0) { /// Still inside parent, 'pid' contains child's PID int status = 0; waitpid(pid, &status, 0); } else { exit(0); ///< Just exit without doing anything } } return 0; } Virtualization Thomas Fischer Autumn/Winter 2024 Page 34 Memory Management (2010s) (1) Repetition from Operating Systems course Physical memory divided into frames, logical memory into pages Frames and pages have same size (e. g. 4 KiB or 2 MiB) Memory Management Unit (hardware) performs translation 1. Page Table is an (often hierarchical) list, elements found by offset Complete overview, but slow to use (‘walk through’) Located in main memory, one per process 2. Translation lookaside buffer (TLB) key-value pairs table Small selection of mappings (e. g. LRU), but fast Located in fast special hardware, one for whole system Context-switches invalidates TLB, unless TLB is context-aware Virtualization Thomas Fischer Autumn/Winter 2024 Page 35 Memory Management (2010s) (2) Linear address: 31 24 23 16 15 8 7 0 9 9 12 page-directory- pointer table page directory Dir.Pointer... entry page table... Dir.Pointer 64 bit PD GNU FDL, RokerHRO, Wikimedia Commons entry entry 4K memory page Dir.Pointer... entry Dir.Pointer entry 64 bit PT... entry 32*...... CR3 *) 32 bits aligned to a 32-Byte boundary Example for 32-bit x86 architecture; 64-bit architectures use at least four indirections Virtualization Thomas Fischer Autumn/Winter 2024 Page 36 Memory Management (2010s) (3) 1. Page table is in operating system’s custody 2. Page table contains hardware information (physical addresses) Inside virtualized operating systems, page tables do not contain ‘real’ physical addresses (otherwise OpSys could ‘escape’ VMM) Additional translation necessary from ‘guest physical addresses’ (gPA) to ‘host physical addresses’ (hPA) (a) Shadow Page Tables (SPT), software-based but makes use of ﬁrst-generation HW virtualization (VMRUN,... ) (b) Hardware-Assisted Page Tables (HAPT), hardware-based, ca. 2007, sometimes referred to as Two-Dimensional Paging (TDP) (1) ‘Extended Page Table’ (EPT, Intel) (2) ‘Rapid Virtualization Indexing’ (RVI) or ‘Nested Page Table’ (NPT, both AMD) Virtualization Thomas Fischer Autumn/Winter 2024 Page 37 Memory Management (2010s) (4) Shadow Page Tables (SPT) Virtualized operating system maintains page table for each of its processes, mapping gVA to gPA (this is what you learned in the operating systems course) VMM maintains ‘shadow page table’ for every page table in a guest, mapping gVA to hPA (this is new here) SPT’s entries correspond to their guest page table’s counterparts, including status bits (e. g. invalid bit) TLB still maps gVA to hPA, so most lookups are fast TLB miss? Now it becomes complicated... Virtualization Thomas Fischer Autumn/Winter 2024 Page 38 Memory Management (2010s) (5) Shadow Page Tables (continued) 1. Guest user process requests memory from guest kernel 2. Guest kernel assembles/updates process’s page table for this process (mapping from gVA to gPA), returns gVA to user process 3. Guest user process tries to write memory region, providing address (gVA) to hardware (i. e. MMU) So far, exactly what you learned in the operating systems course, except for the ‘g’. Virtualization Thomas Fischer Autumn/Winter 2024 Page 39 Memory Management (2010s) (5) Shadow Page Tables (continued) 1. Guest user process requests memory from guest kernel 2. Guest kernel assembles/updates process’s page table for this process (mapping from gVA to gPA), returns gVA to user process 3. Guest user process tries to write memory region, providing address (gVA) to hardware (i. e. MMU) 4. MMU does only know about shadow page table for this process, starts looking there Remember: MMU tries to resolve ‘some’ virtual address (VA) to a corresponding physical address (PA) walking a page table. The page table is located at a memory address as speciﬁed in the CR3 register, which happens to be the SPT for the current guest process Virtualization Thomas Fischer Autumn/Winter 2024 Page 39 Memory Management (2010s) (5) Shadow Page Tables (continued) 1. Guest user process requests memory from guest kernel 2. Guest kernel assembles/updates process’s page table for this process (mapping from gVA to gPA), returns gVA to user process 3. Guest user process tries to write memory region, providing address (gVA) to hardware (i. e. MMU) 4. MMU does only know about shadow page table for this process, starts looking there 5. As entry for requested gVA not (yet) in SPT, causes a page fault which is handled by VMM The mapping is not yet in the SPT, because our premise was that this is about memory only recently assigned to the guest process by the guest operating system; the VMM was not informed about this. This page fault is is also called a ‘shadow page fault’ as it only happens in the SPT and goes unnoticed by the guest. ‘Regular’ page faults can still happen in the guest’s page tables, just as you learned in the operating system course. Virtualization Thomas Fischer Autumn/Winter 2024 Page 39 Memory Management (2010s) (5) Shadow Page Tables (continued) 1. Guest user process requests memory from guest kernel 2. Guest kernel assembles/updates process’s page table for this process (mapping from gVA to gPA), returns gVA to user process 3. Guest user process tries to write memory region, providing address (gVA) to hardware (i. e. MMU) 4. MMU does only know about shadow page table for this process, starts looking there 5. As entry for requested gVA not (yet) in SPT, causes a page fault which is handled by VMM 6. VMM inspects page table in guest for this process, sees previously unknown mapping from gVA to gPA, updates SPT accordingly, allocates physical memory, allows MMU to resume The MMU, being unaware of virtualization, does not care if it is an operating system or a VMM that maintains a page table or handles page faults. Virtualization Thomas Fischer Autumn/Winter 2024 Page 39 Memory Management (2010s) (5) Shadow Page Tables (continued) 1. Guest user process requests memory from guest kernel 2. Guest kernel assembles/updates process’s page table for this process (mapping from gVA to gPA), returns gVA to user process 3. Guest user process tries to write memory region, providing address (gVA) to hardware (i. e. MMU) 4. MMU does only know about shadow page table for this process, starts looking there 5. As entry for requested gVA not (yet) in SPT, causes a page fault which is handled by VMM 6. VMM inspects page table in guest for this process, sees previously unknown mapping from gVA to gPA, updates SPT accordingly, allocates physical memory, allows MMU to resume 7. Physical address is provided from MMU to hardware (CPU) for write operation, TLB gets updated with mapping gVA to hPA Virtualization Thomas Fischer Autumn/Winter 2024 Page 39 Memory Management (2010s) (6) Instead of waiting for page faults, VMM can make memory used Guest Process L3 L2 L1 Guest Proc by guest for page table write-protected GVA GVA L4 GPA Guest from Zhang et al. (2014) Guest write−protected kernel kernel write−protected 0. MMU gets conﬁgured to invoke VMM to handle any writes on write−protected Guest CR3 Guest CR3 write−protected VMM VMM Hardw write-protected memory HAPT T 1. Guest operating system tries to update page table for guest process HPA L1 Shadow L2 CR3 L3 2. VMM intercepts write, updates SPT, then updates guest’s page table L4 (a) Address translation process with SPT (b) Add Figure 1: Shadow and nested pagings Summary translation lookup table (TLB) is used as a shortcut for the GPA-to-HPA page table walking. As Figure 1b depicts, the into the the sou initialization of desired entry is firstly searched in the TLB before each real sampling the pe memory access for the translation takes place, being fetched of the paging m No hardware support speciﬁcally for memory management necessary from memory only as it’s not presented in the TLB. At best each TLB look-up hits, while at worst each misses. ing in Section 4 to memory acce compared. Base (MMU does not know about virtualization) The result is, effort for frequent vmexit due to page fault has been avoided, overhead is meanwhile incurred due to the need of possible nested page table walking, more or less, de- anticipation for pending on the frequency the TLB miss occurred. TLB miss Complexity to maintain shadow pages up-to-date rate is thus in this case key to measure the performance of the paging method, which is quite workload-dependent. The overall performance is therefore determined by the difference of the overhead saved by avoiding the vmexit in the case with Memory consumption doubled as all page tables exist twice SPT and the overhead incurred by walking through the page table in the case with NPT. The less the TLB miss rate is, the more likely it is for NPT to outperform the SPT. For TLBMR thres workloads with good temporal locality, by which the TLB entries may be more fully utilized, NPT is preferred to SPT. Workloads could be roughly classified into four categories by behavior of memory accessing. They could be low or high in both TLB miss rate and page fault rate, as region A and B in Figure 2 depict, or be high only in either TLB miss rate Figure 2: Type or page fault rate, as region C and D depict. Obviously, the workload running on the VM guest could be quite randomly Virtualization Thomas Fischer Autumn/Winter 2024 falling into one of the four categories, which may beyondPage 402. RELATE the capability of a machine statically configured with either With Para-virtu SPT or NPT. Furthermore, although rare, even for the same ing different fro workload the behavior in memory accessing may also be in forming the ma Memory Management (2010s) (7) Hardware-Assisted Page Tables (HAPT) (a) Guest page table (GPT) One per process in guest, maps gVA to gPA (same as before) (b) Nested/Extended page table (RVT a. k. a. NPT/EPT) One per VM, maps gPA to hPA, looks pretty much like a normal page table Both tables have same 4-level structure Linear address: 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 This diagram shows a generic 4-level Wikimedia Commons GNU FDL, RokerHRO, sign extended 9 PML4 table 9 page-directory- pointer table 9 9 12 page table, not caring about ‘guest’, ‘host’, or ‘extended’. page directory page table............ 4K memory page... PML4 entry PDP entry 64 bit PD entry... 64 bit PT entry......... 40*... CR3 *) 40 bits aligned to a 4-KByte boundary Memory hardware (MMU, TLB) aware of virtualization Two registers: nCR3 (for NPT) and gCR3 (for GPT) Virtualization Thomas Fischer Autumn/Winter 2024 Page 41 Memory Management (2010s) (8) Hardware-Assisted Page Tables (continued) MMU walks second dimension for each level in guest page table 1. Process in guest provides gVA to MMU Bits 39–47 determine offset in gPML4, results in gPDPT’s gPA MMU walks all 4 levels of NPT (via nCR3) to map gPDPT’s gPA to its hPA 2. Bits 30–38 determine offset in gPDPT, results in gPDT’s gPA Again, MMU walks all 4 levels to map vPDT’s gPA to its hPA 3. Bits 21–29 determine offset in gPDT, results in gPT’s gPA Again, MMU walks all 4 levels to map vPT’s gPA to its hPA 4. Bits 12–20 determine offset in gPT, results in page’s frame’s gPA Again, MMU walks all 4 levels to map gPA to its hPA 24 memory access operations to resolve gVA to hPA Memory management happens between VM and MMU, no costly switches to VMM Virtualization Thomas Fischer Autumn/Winter 2024 Page 42 Memory Management (2010s) (9) Diagram exemplifying HAPT 63 48 47 39 38 30 29 21 20 12 11 0 nCR3 PWC caching policy NTLB caching policy sign ext. gVAidx 4gCR3 idx 3gL4 idx gL 23 idxgL12 offset gL1 5 10 15 20 1 6 11 16 21 Cached by NTLB Cached by 1D_PWC gL (Guest) 2 7 12gPA(L4) 17 sPA(L4) 22 gPA(L4) gPA (L3) gPA (L2) PTEgPA gPA (L1) gPA(L3) sPA(L3) Cached by 2D_PWC nL4 nL4 6 nL4 11PTE nL4 16 nL4 21 (Guest + nested) 3 gL nL 8 13gPA(L2) 18 sPA(L2) 23 1 gPA(L1) sPA(L1) PTE nCR3 nL3 nL3 nL3 nL3 nL3 4 gL 9 14 19 Not cached by NTLB 24 2 7 12 17 22 Not cached in PWC gPA sPA PTE data nL2 nL2 nL2 nL2 nL2 gCR3 5 10 15 20 from Yaniv and Tsafrir (2016) CR3 3 8 13 18 23 page from Ahn et al. (2012) nL1 nL1 Figure 1: Bare-metal nL1 radix page tablenL walk. 1 nL1 idx 4 idx 3 idx 2 idx 1 offset 4 9 14 19 24 gL Guest page table walk nL Nested page table walk sPA(L4) sPA (L3) sPA (L2) sPA (L1) 63 52 51 12 11 0 sPA = guest PTE = host PTE 0’s padding PPN attributes Figure 2: Two-dimensional (2D) page walks for virtualization Figure 2: 2Din x86 page radix architectures table walk. Virtualization Table 1: Radix PTE structure. Thomas Fischer Autumn/Winter 2024 Page 43 and system-physical page numbers. Such page table walk- translations only for guest page tables are cached in PWC, Memory Management (2010s) (9) WHITE PAPER AMD-V™ Nested Paging Diagram exemplifying HAPT 63 48 47 39 38 30 29 21 20 12 11 0 Guest Virtual PML4 Offset PDP Offset PD Offset PT Offset Physical Page Off. Page-Map Page Directory Page Directory Page Guest 4KB Level-4 Table Pointer Table Table Table memory page gPDPE gData 4KB pages addressed by gPTE guest physical address gPDE 51 12 gPML4E gCR3 Nested Nested Nested Nested Nested walk walk walk walk walk from Advanced Micro Devices, Inc. (2008) 63 48 47 39 38 30 29 21 20 12 11 0 PML4 Offset PDP Offset PD Offset PT Offset Physical Page Off. Page-Map Page Directory Page Directory Page Guest 4KB Level-4 Table Pointer Table Table Table memory page Nested walk nPDPE 4KB pages addressed by nPTE system physical address nPDE 51 12 nPML4E gPML4E nCR3 Figure 4: Translating guest linear address to system physical address using nested page tables Virtualization Thomas Fischer Autumn/Winter 2024 Page 43 Memory Management (2010s) (10) What are expensive work loads for SPT or HAPT? SPT HAPT Updating/writing to page table expensive due to Both reading and writing to PT is expensive as ‘two switching between VM and VMM dimensions’ have to be walked Reading from page table is ok No clear winner Contributing factors Software realization of VMM and SPT Hardware realization of HAPT and/or switching between VM and VMM Caching like TLB, page walk cache (PWC, relevant for HAPT),... Pattern of Memory Usage such as more read than write operations Technically possible for VMM to use both, switch between Requires prediction of future access patterns to know when to switch Experiments show only marginal gains, not worth the effort Virtualization Thomas Fischer Autumn/Winter 2024 Page 44 Device I/O (1) 1. Real device that exists in hardware Guest must have hardware driver for this hardware Realization (a) Dedicated devices VM gets full, exclusive access to existing device No device support necessary in VMM Guest gets best-possible hardware access Hardware cannot be shared among other VMs or VMM CPU↔Hardware: complex memory access and signaling paths (b) VMM emulates real hardware in software, commands and data are ﬁltered and translated onto existing hardware Emulates hardware that does not exist in reality Multiplexing (sharing of resources) comes for free Increases ﬂexibility e. g. to mimic old hardware when migrating physical machines to virtual ones Considerable overhead VMM must emulate hardware bugs as well Virtualization Thomas Fischer Autumn/Winter 2024 Page 45 Device I/O (2) 2. Para-virtual devices VM can ‘talk directly’ to VMM without going through device driver for physical hardware Good performance, less overhead in VMM Modiﬁcations in guest necessary (e. g. ‘virtual hardware drivers’) Virtualization Thomas Fischer Autumn/Winter 2024 Page 46 Device I/O (3) How do VMMs/Hypervisors Talk to Hardware? (a) Type-1/bare metal VMMs must have their own drivers (b) Type-2/hosted VMMs use host operating system for (at least some) hardware access (provides drivers and multiplexing) (c) VMM employs service VM with regular operating system plus hardware drivers and direct/full hardware access Example Xen, Hyper-V Service VM VM Service OpSys Guest VMM Hardware Virtualization Thomas Fischer Autumn/Winter 2024 Page 47 Resource Management: Overcommitment (1) Problem Multiple virtual machines share one physical machine (its CPU, RAM, disks,... ) Observation Just like physical machines, virtual machines often do not make use of all assigned resources, idle most of the time Idea Assign more ‘resources’ to a VM than available Example Four VMs, each gets 40% CPU Most of the time no problem, only in high-load situations VMs get less than 40% Operating systems make always full use of their assigned main memory (why?) VMM must be able to remove memory allocated to one machine and give it to another Virtualization Thomas Fischer Autumn/Winter 2024 Page 48 Resource Management: Overcommitment (2) Memory Sharing Overcommitment possible without taking memory away? Observation VM instances are similar: same operating system, same applications,... Idea Two identical regions of memory in two different VMs require only a single representation in physical memory Concept exists already in ‘normal’ operating systems when processes are forked/cloned Challenge 1 No obvious relation between VMs like between fork’ed processes Challenge 2 Comparing n × m many pages in memory is expensive (and must be repeated regularly) Savings of 5 – 30% possible according to VMware and Kingston Technology (2006) Virtualization Thomas Fischer Autumn/Winter 2024 Page 49 Resource Management: Overcommitment (3) Ballooning Problem To take away memory from VM, VMM must know which pages/frames are ‘unused’ Idea VMM has driver in VM, possibly realized as regular user process, that allocates non-swappable/non-pagable memory from guest kernel VMM communicates with driver, knows which memory its driver allocated VMM can re-assign corresponding physical memory to another VM while driver holds allocated memory in ﬁrst VM VMM driver can allocated and deallocate memory in each VM to adjust the available memory Strategy of (de)allocations depend on relative priority between VMs Virtualization Thomas Fischer Autumn/Winter 2024 Page 50 VMware Information Guide Memory Management in VMware ESX Server 3 You need to be sure your guest operating systems have sufficient swap space. This swap space Resource Management: Overcommitment (4) must be greater than or equal to the difference between the virtual machine’s configured memory size and its reservation. from VMware and Kingston Technology (2006), Figure 1, p. 4 Figure 1: Memory balloon driver in action Virtualization Thomas Fischer Swapping Autumn/Winter 2024 Page 51 When you power on a virtual machine, a corresponding swap file is created and placed in the Resource Management: Overcommitment (5) All Good with Ballooning? Works, solves problem of overcommitment (just like memory sharing did, as far as possible) Ballooning will take memory away only with some delay (operating systems are reluctant to swap/page) Guest where ballon expands may swap away ‘good’ pages, impact on performance Difﬁcult to determine right size of balloon for each VM Virtualization Thomas Fischer Autumn/Winter 2024 Page 52 Operating System Virtualization Laurén et al. (2017) Virtualization Thomas Fischer Autumn/Winter 2024 Page 53 Operating System Virtualization Idea Use same core operating system (kernel) for multiple virtual machines Implementation Operating system already does job similar to virtualization (separation of processes or users,... ) Good resource utilization possible due to lower overhead Control over guests’ individual processes, not just whole machines as in full virtualization No hardware support required Operating system (kernel and its tools) must support it to fulﬁll virtualization requirements Not possible to mix operating systems No Windows guest on Linux host Vulnerabilities in kernel may allow guests to escape Virtualization Thomas Fischer Autumn/Winter 2024 Page 54 The Mother of All Containers (1) “ Run a command with a different root directory ” Coreutils manual on chroot chroot /tmp/test ls -Rl / All ﬁles (libraries, device nodes,... ) required to run the program must exist inside the ‘jail’ (e. g. /tmp/test/usr/lib/libc.so.6) Use mount to prepare jail ﬁlesystem mount -t proc proc /tmp/test/proc/ mount --rbind /sys /tmp/test/sys/ mount --rbind /dev /tmp/test/dev/ mount --rbind /run /tmp/test/run/ cp /etc/resolv.conf /tmp/test/etc/resolv.conf Virtualization Thomas Fischer Autumn/Winter 2024 Page 55 The Mother of All Containers (2) Use cases Enter the ‘real’ system once booted from a live/rescue medium Repair bootloader, install/remove packages, reset password,... Launch programs in a separate ﬁlesystem Ubiquitously available on Linux and Unix, has been around since early 1980s Simple and low overhead No resource control except for standard Unix mechanisms Only separates ﬁlesystem, but not other resources (memory, processes, network,... ) Well-documented how a moderately-skilled adversary can escape (see the ‘chdir("..") escape technique’) Virtualization Thomas Fischer Autumn/Winter 2024 Page 56 Linux Fundamentals for Containers (1) Containers make use of the following Linux kernel features (features exist and are useful for other things than containers, too) 1. Namespaces 2. Control Groups 3. Capabilities Namespaces Traditionally, all processes see the same: Which processes are running, which ﬁlesystems are mounted where, which network interfaces are available and their IP addresses, which users and groups exist,... Necessary for virtualization Containers have their own ‘view’ on the system Virtualization Thomas Fischer Autumn/Winter 2024 Page 57 Linux Fundamentals for Containers (2) Namespaces (continued) Each process may have its own namespace of each category (see list below) New namespace can be created at time of process creation or login (via PAM) or process migrates between namespaces 1. Process identiﬁers Processes may or may not be visible in certain namespaces and may have different identiﬁers 2. Mounts Mounted ﬁlesystems may or may not be visible 3. Network Different network interfaces available with own conﬁguration 4. Interprocess Communication Whether processes can share memory for data exchange Virtualization Thomas Fischer Autumn/Winter 2024 Page 58 Linux Fundamentals for Containers (3) Namespaces (continued) 5. UTS controls hostname and domain 6. Users Which users and groups exist and which IDs they have 7. Control groups Which control groups exist and which processes are members of which 8. Time Since March 2020, controls monotonic clocks, not wall-clocks See also namespaces(7) Virtualization Thomas Fischer Autumn/Winter 2024 Page 59 Linux Fundamentals for Containers (4) Control Groups Traditionally Resource limitations put on a process did not apply to its child process (setrlimit(2)) Necessary for virtualization Resource limitations set for a container apply to all its processes as a group, i. e. container’s whole process hierarchy Control groups can be nested, children inherit limitations of parents Supported resource limitations (selection) Main memory consumption CPU utilization Disk I/O throughput Allows to freeze whole process group Allows to measure resource consumption e. g. for billing Virtualization Thomas Fischer Autumn/Winter 2024 Page 60 Linux Fundamentals for Containers (5) Capabilities Traditionally Processes launched as root (e. g. via SetUID bit) have unlimited power (execve(2)) Necessary for virtualization Even if containerized processes are launched as root certain operations should be forbidden (e. g. changing system time, reboot) Idea Categorize superuser privileges into ‘capabilities’ that can be enabled/disabled per process/thread (inherited to children) About 40 such capabilities exist in the Linux kernel, limiting access to certain Kernel functions For containers, white-list capabilities as necessary, forbid all others Virtualization Thomas Fischer Autumn/Winter 2024 Page 61 Docker (1) Docker is a management solution for operating system containers Conﬁguring operating systems to launch containers Maintaining and deploying containers Docker daemons dockerd and containerd manage and ‘own’ containers Docker command line tools used to direct daemons to start/stop/modify containers Container1 Container2 Processes Processes Docker Libraries Libraries command line tools Docker Daemon(s) Operating System Virtualization Thomas Fischer Autumn/Winter 2024 Page 62 Docker (2) Images and Overlays Image is just an archive of a Linux ﬁlesystem tree plus some metadata (version, author, origin,... ) Images can overlay other images 1. Create sparse ﬁlesystem tree only containing added or modiﬁed ﬁles and remember which ﬁles ‘got deleted’ 2. Impose sparse ﬁlesystem tree on already existing image ‘Already existing image’ stays unmodiﬁed, is actually read-only Overlays can be stacked upon each other Docker Inc. Virtualization Thomas Fischer Autumn/Winter 2024 Page 63 Docker (3) How to Build an Image Speciﬁcation in text ﬁle typically named Dockerfile Various commands in uppercase specify actions, such as... Base installation (i.e. underlying image) to start from FROM ubuntu:18.04 Statements to copy ﬁles into container image and run commands (e. g. to build software) COPY. /app RUN make -C /app Speciﬁcation which command to run by default when container is launched CMD python /app/app.py Build actual image from Dockerfile docker build -t my_hello_world path-where-Dockerfile-is Virtualization Thomas Fischer Autumn/Winter 2024 Page 64 Docker (4) What happends when a Container starts? 1. Assemble container instance’s ﬁlesystem Images stack upon each other Writable/volatile layer on top for container instance to write into Mount more ﬁlesystems into volatile layer as conﬁgured 2. Setup environment Control groups and namespaces Network conﬁguration 3. Launch container’s initial process Command from CMD or argument from docker command line Virtualization Thomas Fischer Autumn/Winter 2024 Page 65 Docker (5) Processes when a Container Runs Host (extract of output from ps axf ) 17728 pts/0 Sl+ 0:00 /usr/bin/docker run --interactive --tty ubuntu bash 16150 ? Ssl 0:00 /usr/bin/containerd 17781 ? Sl 0:00 \_ containerd-shim long line cut away 17804 pts/0 Ss 0:00 \_ bash 17860 pts/0 S+ 0:00 \_ top 17204 ? Ssl 0:06 /usr/bin/dockerd -H fd:// Guest PID TTY STAT TIME COMMAND 1 pts/0 Ss 0:00 bash 12 pts/0 R+ 0:00 ps axf Same bash process Different PIDs, depending on ‘point of view’ (i. e. PID namespace) Virtualization Thomas Fischer Autumn/Winter 2024 Page 66 Docker (6) Networking with Docker Port forwarding docker run... -p 3333:4444... Outside clients can connect to 3333/TCP on host, connection will be forwarded into container to a listening service on 4444/TCP Virtualization Thomas Fischer Autumn/Winter 2024 Page 67 Docker (7) Filesystems in Docker Bind mount a volume Make local directory accessible inside docker run... -v /tmp/a:/tmp/a... Host -rw-rw-r-- 1 thomas thomas 4 Jan 14 14:48 aaa.txt Guest -rw-rw-r-- 1 1000 1000 4 Jan 14 13:48 aaa.txt Container does not know any user with UID=1000 or GID=1000 (unless such a user got conﬁgured in one of the used images’ Dockerfiles) Virtualization Thomas Fischer Autumn/Winter 2024 Page 68 Docker and Security Containers are by design less secure than full virtualization Large interface directly interacting with host operating system Hard to really get secure Practical example how to escape a privileged Docker container “ PWD uses a privileged container and, prior to the ﬁx, failed to secure ” it properly. This makes an escape from the PWD container to the host difﬁcult – but not impossible as we’ve shown in this post. Injecting Linux kernel modules is only one of the paths open to a persistent attacker. Other attack paths do exist and must be securely dealt with when using privileged containers. How I Hacked Play-with-Docker and Remotely Ran Code on the Host Virtualization Thomas Fischer Autumn/Winter 2024 Page 69 Criticism on Docker Docker daemon(s) run constantly in backgroud, with root permissions Docker daemon(s) ‘own’ all running container instances (instead of user who launched the container) Makes it hard/impossible to start/stop containers via systemd (Docker daemon itself can be started/stopped via systemd) Resource control via systemd impossible (only via Docker directly) Virtualization Thomas Fischer Autumn/Winter 2024 Page 70 Podman – A Docker Alternative Idea Functional alternative to Docker, but different internal design Containers ‘owned’ by command line tool used to launch container, not some daemon Driven by RedHat, which is also maintains systemd (L. Poettering) Released February 2018 (Docker got released March 2013) Easier to have rootless containers than with Docker Supports building images from Docker’s Dockerfiles Command line client virtually identical to Docker’s client alias docker=podman Less ‘battle hardened’ than Docker Trying to be a replacement for Docker, some delay until new Docker features become available with Podman Virtualization Thomas Fischer Autumn/Winter 2024 Page 71 Containers on Windows (1) Getting up-to-date, technically correct, non-marketing, non-tutorial materials on containers on Windows is hard Most of Microsoft’s virtualization is done via Hyper-V (type-1 hypervisor), such as... Virtual machines running various ﬂavors of Linux or BSD Windows Subsystem for Linux 2 (WSL 2) Whatever Microsoft calls a container which runs Linux Exceptions are Windows Server Containers (WSC) “ Windows Server Containers provide application isolation through process and namespace ” isolation technology. A Windows Server container shares a kernel with the container host and all containers running on the host. Hyper-V Containers expand on the isolation provided by Windows Server Containers by running each container in a highly optimized virtual machine. In this conﬁguration, the kernel of the container host is not shared with the Hyper-V Containers. Performance tuning Windows Server Containers Virtualization Thomas Fischer Autumn/Winter 2024 Page 72 Containers on Windows (2) Microsoft advertises Docker as low-level container solution Docker on Windows adds management layer on top of Microsoft’s virtualization solutions Microsoft advertises various base images via Docker Hub Docker supports both Hyper-V-based and WSC-based containers Docker can make use of WSL 2 or Hyper-V/WSC / Docker Inc. and Microsoft cooperate closer than Docker Inc. does with any Linux distribution Virtualization Thomas Fischer Autumn/Winter 2024 Page 73 Real-World Examples VMware ESXi, WSL 1 & 2 Virtualization Thomas Fischer Autumn/Winter 2024 Page 74 VMware ESXi (1) Getting up-to-date, technically correct, non-marketing, non-tutorial materials on VMware ESXi is hard Type-1 hypervisor running on bare metal Primary component of VMware’s enterprise virtualization offerings Closed-source kernel VMkernel with built-in Linux emulation layer to make use of Linux hardware drivers VM1 VM2 ⋯ VMkernel Linux emulation HW driver1 HW driver2 ⋯ Virtualization Thomas Fischer Autumn/Winter 2024 Page 75 VMware ESXi (2) Mature, feature-complete solution Hypervisor has clear type-1, bare metal architecture Said to be small (less than 200 MB) Commercial, closed-source offering (costs money) Take-over by Broadcom and subsequent licensing changes burned some bridges Limited support from operating system vendors or Linux distributions (those push their own virtualization solutions) Virtualization Thomas Fischer Autumn/Winter 2024 Page 76 Windows Subsystem for Linux tl;dr Let’s you run Linux programs on Windows (with some practical limitations) Two versions exist which are technically completely different WSL 1 which is actually a ‘Windows subsystem’ Announced 2016, released 2017 WSL 2 which is a subsystem by name only (details on following slides) Announced 2018, released 2019 Availablility and feature set (e. g. support for Linux GUI programs) differ across Windows versions and builds Virtualization Thomas Fischer Autumn/Winter 2024 Page 77 WSL 1 (1) What is a ‘subsystem’ for Windows? Windows NT kernel (used in Windows 10 and 11) has a layered design where at the top level other operating systems’ kernel interfaces (‘subsystems’) can be provided to user-space processes Historic Examples ‘Windows on Windows’ to run 16-bit programs in 32-bit Windows, OS/2 to run OS/2 1.x programs, POSIX Current Examples ‘WoW64’ to run 32-bit programs in 64-bit Windows, WSL 1, Windows Subsystem for Android (WSA) WSL 1 implements a Linux kernel interface as a subsystem without using Linux code Virtualization Thomas Fischer Autumn/Winter 2024 Page 78 WSL 1 (2) Windows host ﬁlesystem is easily accessible from Linux guest via /mnt/driveletter/ Windows programs can transparently launch Linux programs and vice versa Access to ‘exotic’ hardware like serial ports from Linux guest Linux guest networking realized as a bridge Windows host can access Linux guest’s services via localhost Filesystem semantics differ between NT kernel and POSIX Worst-case performance is bad Only common Linux API calls implemented Tools using ‘exotic’ APIs are not supported: OpenVPN, Docker,... As no Linux kernel, no modules can be loaded Other (minor) practical issues and limitations Virtualization Thomas Fischer Autumn/Winter 2024 Page 79 WSL 1 (3) (Possible/probably) reasons for abandoning WSL 1 Considerable engineering task to re-implement Linux kernel interface on top of Windows NT kernel (task was never completed) Abysmal I/O performance due to fundamental differences between Windows NT kernel and POSIX semantics Missed opportunity Using Linux’ top and kill to manage Windows processes Virtualization Thomas Fischer Autumn/Winter 2024 Page 80 WSL 2 (1) Boring design 1. Hyper-V virtual machine virtualization Probably stripped down and customized for WSL 2 2. Customized Linux kernel provided by Microsoft Yes, the real Linux kernel, with source code available for download 3. WSL 2 Linux ‘instances’ run as containers Technical details well-hidden, Linux instances can be installed from Microsoft Store Single Linux VM started when ﬁrst container starts, stopped when last container stops No ‘subsystem’ any longer, name was kept for continuity/marketing Virtualization Thomas Fischer Autumn/Winter 2024 Page 81 WSL 2 (2) Better I/O performance than WSL 1 and supposedly better than generic ‘Linux in Hyper-V’ virtualization due to customizations Full Linux compatibility due to using an actual Linux kernel Active development with new features added Example Graphical Linux programs in Windows via RDP Custom init process inside container with limited functionality, systemd only supported via hacks until ofﬁcial support in September 2022 Network realized via NAT Accessing services in Linux instances requires manual conﬁguration of port forwarding Accessing host ﬁlesystem more difﬁcult, probably via some network ﬁle system Virtualization Thomas Fischer Autumn/Winter 2024 Page 82 Exam Reading Instructions For the exam, the following materials will be relevant: (a) This slide set, in its full extend (even if parts were skipped during the lecture) Virtualization Thomas Fischer Autumn/Winter 2024 Page 83 References (1) Adams, K., & Agesen, O. (2006, October). A Comparison of Software and Hardware Techniques for x86 Virtualization. In J. P. Shen & M. Martonosi (Eds.), Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII) (pp. 2–13). ACM. https://doi.org/10.1145/1168857.1168860 Advanced Micro Devices, Inc. (2008). AMD-V Nested Paging. Ahn, J., Jin, S., & Huh, J. (2012).Revisiting Hardware-assisted Page Walks for Virtualized Systems. Proceedings of the 39th Annual International Symposium on Computer Architecture, 40(3), 476–487. http://dl.acm.org/citation.cfm?id=2337159.2337214 Chernoff, A., Herdeg, M., Hookway, R., Reeve, C., Rubin, N., Tye, T., Bharadwaj Yadavalli, S., & Yates, J. (1998).FX!32 – A Proﬁle-Directed Binary Translator. Micro, IEEE, 18(2), 56–64. https://doi.org/10.1109/40.671403 Virtualization Thomas Fischer Autumn/Winter 2024 Page 84 References (2) Creasy, R. J. (1981).The Origin of the VM/370 Time-Sharing System. IBM Journal of Research and Development, 25(5), 483–490. Laadan, O., & Nieh, J. (2010, June 21). Operating System Virtualization: Practice and Experience. In G. Haber, D. D. Silva, & E. L. Miller (Eds.), Proceedings of the 3rd Annual Haifa Experimental Systems Conference (17:1–17:12). ACM. https://doi.org/10.1145/1815695.1815717 Laurén, S., Memarian, M. R., Conti, M., & Leppänen, V. (2017). Analysis of Security in Modern Container Platforms. In S. Chaudhary, R. Buyya, & G. Somani (Eds.), Research Advances in Cloud Computing (pp. 351–369). Springer Nature Singapore Pte Ltd. https://doi.org/10.1007/978-981-10-5026-8_14 Virtualization Thomas Fischer Autumn/Winter 2024 Page 85 References (3) Popek, G. J., & Goldberg, R. P. (1974).Formal Requirements for Virtualizable Third Generation Architectures. Communications of the ACM, 17(7), 412–421. https://doi.org/10.1145/361011.361073 Robin, J. S., & Irvine, C. E. (2000).Analysis of the Intel Pentium’s Ability to Support a Secure Virtual Machine Monitor. Proceedings of the 9th Conference on USENIX Security Symposium (SSYM’00). http://www.usenix.org/events/sec2000/full_papers/robin/robin.pdf VMware & Kingston Technology. (2006). The Role of Memory in VMware ESX Server 3 [Revision 20060926]. Yaniv, I., & Tsafrir, D. (2016).Hash, Don’t Cache (the Page Table). Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, 337–350. https://doi.org/10.1145/2896377.2901456 Virtualization Thomas Fischer Autumn/Winter 2024 Page 86 References (4) Zhang, Y., Oertel, R., & Rehm, W. (2014).Paging Method Switching for QEMU-KVM Guest Machine. Proceedings of the 2014 International Conference on Big Data Science and Computing, 22:1–22:8. https://doi.org/10.1145/2640087.2645709 Virtualization Thomas Fischer Autumn/Winter 2024 Page 87 Created with LATEX Beamer, which is free and open source software Thank you for your attention. Questions? How to contact: https://www.his.se/ﬁsh Unless otherwise noted, all materials on these slides are licensed under the Creative Commons Attribution-Share Alike 4.0 Unported License. Linear address: 31 24 23 16 15 8 7 0 9 9 12 page-directory- pointer table page directory Dir.Pointer... entry page table... Dir.Pointer 64 bit PD entry entry 4K memory page Dir.Pointer... entry Dir.Pointer entry 64 bit PT... entry 32*...... CR3 *) 32 bits aligned to a 32-Byte boundary Linear address: 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 sign exte

IT421G Virtualization Slides PDF

Document Details

Tags

Related

Summary

Full Transcript