LINFO2145 — Cloud Computing Lecture Notes PDF
Document Details
Uploaded by GratefulOmaha
UCL
Etienne Rivière
Tags
Summary
This document contains lecture notes on cloud computing, focusing on systems aspects of IaaS management. The content covers topics such as announcements, course objectives, outline, virtualization and memory sharing. The file is a PDF.
Full Transcript
LINFO2145 — Cloud Computing Lesson 8: Systems Aspects of IaaS management Pr. Etienne Rivière [email protected] 2 Announcements Quiz on lectures #5 and #6 closed P...
LINFO2145 — Cloud Computing Lesson 8: Systems Aspects of IaaS management Pr. Etienne Rivière [email protected] 2 Announcements Quiz on lectures #5 and #6 closed Peer review available until Wednesday, Nov. 20 (before class) Another quiz on lectures #7 and #8 (this lecture and next lecture) will be available next week Final quiz will be on the two “Big Data” lecture Donatien will give lecture #10 on Big Data processing / stream processing (his Ph.D. topic) Cloud Computing — E. Rivière 3 Course objectives Present advanced IaaS management techniques Describe how economies of scale can be performed in virtualized environments Show how VM migration allows fully decoupled management of OS and physical resources Cloud Computing — E. Rivière 4 Outline Part A: techniques for saving storage space VM memory sharing Deduplication at the main memory level Deduplication for VM image storage Part B: dynamic IaaS management VM migration across hosts Cloud Computing — E. Rivière 5 Part A: saving storage space Cloud Computing — E. Rivière 6 Recap: virtualization App 1 App 2 App 3 App 4 App 1 App 2 App 3 App 4 OS 1 OS 2 Operating System Virtualization layer hardware hardware non-virtualized system virtualized system Virtualization layer (hypervisor) orchestrates sharing of physical resources between virtual machines (VMs) Each host OS “sees” the resources allocated to it Cloud Computing — E. Rivière 7 Recap: virtualization modes VM VM A A A A VM VM VM VM A A A A OS OS A A A A OS OS OS OS VMM - binary rewriting Host OS VMM - Hypervisor VMM - Hypervisor hardware hardware hardware 1st generation 2nd generation 3rd generation Full virtualization by binary Paravirtualization Hardware-assisted full rewriting (software based) (software based collaborative virtualization virtualization) (software+hardware based) Requires modified guest OS Cloud Computing — E. Rivière 8 VM memory sharing Hypervisor exposes xed amount of memory to each VM 😀 Predictable, simple ☹ Dif cult to estimate needs, some VM may not use all their memory Ballooning: dynamic memory allocation Hypervisor dynamically change amount of memory frames allocated to VM Requires paravirtualization with a speci c driver on each VM OS kernel Freeing memory frames requires access to processes’ memory context (page tables) and semantics (page cache vs. pages used by applications) One VM needs more memory: de ate its balloon, in ate balloon of other VMs can page in and VM 1 use physical VM 2 memory Disk must page out and free physical memory Cloud Computing — E. Rivière fi fi fl fi fl 9 Other approaches Puma: Inter-VM memory sharing Allow host OS to borrow/lend their free memory Not necessarily on the same physical host Use inter-VM caching for the page cache Cache of disk pages used to speed-up I/O Does not page out applications pages Performance gains as network speed >> disk speed Same guest OS on both sides and modi cations of OS kernel RDMA: Remote Direct Memory Access Allow direct (very fast) access between VM memories Requires speci c NIC (Network Interface Cards) Cloud Computing — E. Rivière fi fi 10 Data duplication Cloud platforms handle large amounts of data From multiple companies, multiple applications, multiple users Many copies of the same data across all clouds Even in a single cloud, multiple copies coexist Duplication wastes storage space Deduplication is the automatic elimination of duplicate data in a storage system General principle: detect common data, keep a single copy and replace other copies by pointers to the single copy A large spectrum of applications and systems Deduplication ratio: before ÷ after Cloud Computing — E. Rivière 11 Deduplication targets Storage system Operations Mutable data Target RAM R/W Yes Minimal overhead Low delays, high File System R/W Yes throughput Low delays, high VM storage R/W Yes (but rare) throughput Backups R/Mostly W Yes (but rare) High throughput Archives Only W No High throughput Cloud Computing — E. Rivière 12 Example 1: deduplicating RAM Cloud Computing — E. Rivière 13 Deduplicating RAM between co-hosted VMs Several VMs on same physical host Same ‘type’ of OS used (GNU/Linux), or even same OS distribution (base image from IaaS provider) Many shared libraries, and even executable les In some cases, also shared data In Docker, this is addressed by re-using common FS layers but hypervisor has no knowledge of the FS used by the VMs Each memory-mapped le occupies an integer number of pages in memory (or same data downloaded by the VMs) ☹ Pages for same data multiple times in each VM memory space 💡 Store each page once and use deduplication Example for the VMWare ESX hypervisor Cloud Computing — E. Rivière fi fi 14 modern processors have hardware to support it. Virtual memory creates a uniform virtual address applications and allows the operating system and hardware to handle the address translation betwe VMWare ESX virtual address space and the physical address space. This technique not only simplifies the program work, but also adapts the execution environment to support large address spaces, process protectio mapping, and swapping in modern computer systems. Third-generation “bare-metal” hypervisor When running a virtual machine, the hypervisor creates a contiguous addressable memory space fo Paravirtualization virtualand hardware machine. assistedspace has the same properties as the virtual address space presented This memory Memory management applications by the guest operating system. This allows the hypervisor to run multiple virtual mach simultaneously while protecting the memory of each virtual machine from being accessed by others Host OS ‘sees’ and manages Therefore, from thewhat view it of thinks is its own the application physical running memory inside the virtual machine, the hypervisor adds But this is inlevel reality virtualtranslation of address memory from the the that maps hypervisor guest physical address to the host physical address. As a r there are three virtual memory layers in ESX: guest virtual memory, guest physical memory, and ho Hypervisor maps guest physical “virtual memory. physical Their memory” relationships to actual are illustrated physical in Figure 2 (a). memory Modern processors’ MMU Figure 2. Virtualsupport this(a) memory levels operation and memoryinaddress hardware (2 page translation table levels) (b) in ESX image © VMWare inc. Cloud Computing — E. Rivière 15 Deduplication in VMWare ESX: Transparent Page Sharing (TPS) Of ine deduplication: scan for duplicates in a lazy manner Periodically scan physical memory and look for duplicates Checking the content of each pair of pages is intractable Instead, use a hash function on the pages’ content Keep page references in a hash table If 2 hash values match, then pages might have the same content Rarely, two different pages may hash to same value (collision) Check the actual content for exact match Update 2nd-level page table (guest physical to host physical) Transparent to guest OS Use copy-on-write to automatically create a private copy upon a write operation to a shared page Cloud Computing — E. Rivière fl 16 Hash function Function over the content of the block Yields a value in a bounded hash space (e.g. 128 bits) Distribution of hash values should be uniform regardless of input keys Non-cryptographic hash functions (e.g. not MD5) enough for checking if two pages (might be) the same image by Jorge Stol Cloud Computing — E. Rivière fi 17 Deduplication in VMWare ESX Understanding Memory Resource Management in VMware ESX 4.1 Figure 4. Content-based page sharing in ESX A hash value is generated based on the candidate guest physical page’s content. The hash value is then used as a key to look up a global hash table, in which each entry records a hash value and the physical page image © VMWare inc. number of a shared page. If the hash value of the candidate guest physical page matches an existing entry, a Cloud Computing — E. Rivière 18 Copy-on-write Before VM 1 modi es page C VM1 VM2 Cloud Computing — E. Rivière fi 19 Copy-on-write After VM 1 modi es page C VM1 VM2 Cloud Computing — E. Rivière fi 20 Effectiveness of TPS in ESX Real-world page sharing in production deployments (data from VMWare) Total Shared Reclaimed Guests MB MB % MB % 10x WinNT 2048 880 42,9% 673 32,9% 9x Linux 1846 539 29,2% 345 18,7% 5x Linux 1658 165 10,0% 120 7,2% Cloud Computing — E. Rivière 21 Performance impact Mem.ShareScanTime 5, in addition to the default 60 minutes, the minimal Mem.ShareScanTime of 10 minutes was tested, which potentially introduces the highest page scanning overhead. (Pshare == TPS) Figure 10. Performance impact of transparent page sharing image © VMWare inc. Figure 10 confirms that enabling pageCloud sharing introduces Computing — E.a Rivière negligible performance overhead in the default 22 Example 2: deduplicating a le system for VM images storage Cloud Computing — E. Rivière fi 23 Storing VM images in a IaaS platform (outliers up to 160 GB) 20 VM image stored in Glance (OpenStack) or in OpenNebula’s image service 15 Number of VM images can be high Multiple base VM images Different OS, different versions 10 Many VM images per user 5 Size of one VM image typically several GB Right: distribution of VM sizes in a private cloud 0 Cloud Computing — E. Rivière assess the potential for coalescing duplicated with a 64-bit Fedora 9 system. As can be seen in resent after the default installation of the various distri- 24 Fig een virtual disk images we compared a cross- ure 2 we observed roughly 60% overlap between the two utions. We then Potential for deduplication in images from various versions of various Linux determine the amount ns as well images resulting from separate in- within a file system by looking for of self-similarity duplicate hashes images consisting primarily of the non-binary portion of the installation (configuration files, fonts, icons, doc nd discounting hard VM images Windows XP. We establish overlap candidates linked copies ng the file systems, producing a SHA-1 cryp- Our as false duplicates. umentation, etc.). hashanalysis for eachshowedfile andthat typical root associating disk the it with images have Image Fed 8 Fed 9 Ubuntu OpenSuSe round er 5% duplicate file and the number of initial installation, file and data Analysis that within hard-linksof the a single various to the amount image file in images of af- duplicate Figure Fedora 7 2: Different 34% Architectures 22% 8% 1 similar to Mirage’s manifests. The Linux Fedora 8 31% 10% 1 file data seems to be Study increasing s in question are Ext2 formatted root volumes and (Fedora data7 by had A. Liguori 4.1% or and E. We then Fedora van compared Hensbergen, IBM 9 slightly less similar images 11% by 2 0MB, Fedora 9 has boot which contains the kernels ate file lists from two 5.3% or 116MB). Sameand different OS ram images We (Fedora and then look concate- disks) 9), differentUbuntu for comparing du- installations the installation 8.04 of a 32-bit Fedora 9 system er the default installation of the various distri- with a 64-bit Fedora 9 system. As can be seen in Fig- Different OS licate file hashes to establish the amount of dataure (from the 2GNU/Linux du- family) we observed roughly 60% Figure 3:overlap Differentbetween the two Distributions licated between the two images. The total size of n determine the amount Percentage of duplicate of self-similarity uplicate files is compared to the total size of alloffiles file system by looking for duplicate hashes the consisting primarily of the non-binary portions images the le data between Next, (configuration installation VM images we comparedfiles, several fonts,different distributions icons, doc- rom the two images to calculate the % of duplicates. umentation, looking etc.). at the overlap between different versions of Fe unting hard linked copies as false duplicates. sis showed that typical root disk images have dora as well as 32-bit versions of Ubuntu and OpenSuSe Image Base Office SDK Web Image 11. The Fed 8 Fedoverlap resulting 9 Ubuntu OpenSuSe-11 can be seen in Figure 3. A % duplicate file data within a single image af- Base 96% 88% 85% 95% Fedora 7one might 34% 22% expect, adjacent 8% versions of the 15% distribution installation, and that the amount of duplicate Office 96% 79% 87% Fedora 8had relatively high 31%degrees 10%of overlap ranging 16%from 22% eems to be increasing (Fedora 7 had 4.1% or SDK 96% 85% Fedora 9to 34% despite about a year 11% of time between 21%their re dora 9 has 5.3% or 116MB). We then concate- Web 96% Ubuntu 8.04spective releases. It should be pointed out that 8% the ef sts from two different images and look for du- e hashes toFigure establish the amount fect is cumulative, if looking across all three distribu 1: Different Fedoraof9 data Flavorsdu- Figure 3: Different Distributions etween the two images. The total size of the tions which total about 6 GB of root file system data files is compared Figure 1 shows the to the totalofsize amount of all files similarity between sep- Next, we 2GBcompared of that data is overlapped several differentdata resulting in approx distributions, wo rateimages installs to of calculate severalthe % of duplicates. different configurations oflooking the atimately 1.2GB the overlap of wasted between space. different The overlap versions of Fe- between Fedora 9 x86–64 distribution. The image personality dora as theasFedora well 32-bit installations versions of and theand Ubuntu other distribution ven OpenSuSe Cloud Computing dors— E. Rivière is less striking. There was a high degree of overlap fi 25 Case study: LiveDFS Adaptation of a legacy le system (ext3) to support deduplication Inline deduplication: detect duplicates directly when processing write operations Adapted to commodity hardware (no specialized hardware and regular amounts of RAM) Integration and evaluation in OpenStack Cloud Computing — E. Rivière fi 26 LiveDFS design goals Performance of VM operations Should save storage … … but should have a small impact on VM startup performance (reads) Compatibility with existing tools POSIX le system interface Support for deletion Target “low-cost” commodity hardware Memory is limited Cloud Computing — E. Rivière fi 27 LiveDFS implementation Extension of an existing le-system (ext3) Low-level: implemented as a kernel module Should modify as little as possible the original FS No modi cations allowed to FS interface at all Deduplication at the level of xed-size blocks Linux and ext3 manipulate memory and les in xed-size pages/blocks of 4 KB Cloud Computing — E. Rivière fi fi fi fi fi 28 Structure of an ext3 inode Cloud Computing — E. Rivière 29 ext3 le system structure Partition Structure of a partition: split in several groups to maximize locality between metadata and data and reduce disk seeks overheads Group 0 Group 1 … Group N-1 Structure of a group Super-block inode and block bitmaps (allows knowing which is free/used) Metadata Blocks id Metadata Blocks id Cloud Computing — E. Rivière fi 30 Goal: deduplicate data blocks Partition Structure of a partition: split in several groups to maximize locality between metadata and data and reduce disk seeks overheads Group 0 Group 1 … Group N-1 Structure of a group Super-block inode and block bitmaps (allows knowing which is free/used) Metadata Blocks id Metadata Blocks id Cloud Computing — E. Rivière 31 How to … Know if a block already exists in this group and avoid storing it twice? Handle modi cations to a block referenced by several les after deduplication? Detect when a block is not referenced and should be freed to claim back disk space? Cloud Computing — E. Rivière fi fi 32 Detect that a block already exists Associate a 16-Byte MD5 hash ngerprint to each block Maintain a listing of ngerprints When a new block is to be written, check if it already exists Hash its content Check in the ngerprint store If ngerprints match, check content bit-by-bit There could be a false positive (two blocks might generate the same ngerprint — a rare but possible event) If content matches, store a reference to existing block How should we maintain the list of ngerprints? Cloud Computing — E. Rivière fi fi fi fi fi fi 33 Maintaining and accessing the list of ngerprints Single 4 TB disk, with 4 KB blocks and 16-Byte ngerprints 242 / 212 = 2(42-12) = 230 blocks (~1 Billion blocks) 230 x 16 Bytes = 230 x 24 = 234 Bytes = 16 GBytes ☹ Require up to 16 GB in RAM just for the FS deduplication of one disk is a lot: no space left for page cache 💡 So let’s store the ngerprints on disk instead ☹ But now, each write requires two disk seeks ( ngerprint, then actual write): overhead is too high Solution: use a two-step approach FP lter: small in-memory data structure that answers no/maybe FP store: if FP lter answer is ‘maybe’, access ngerprints on disk Cloud Computing — E. Rivière fi fi fi fi fi fi fi 34 LiveDFS block writing procedure Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Wong, Patrick P. C. Lee, and John C. S. L Data block to be written (from VFS layer) Memory Verdict Consequence Access Found in No FP Filter? NEW Block allocation BLOCK FP store update Yes FP filter update Disk Access Yes Match with EXISTING FP store update FP Store? BLOCK FP filter update No Fig. 2. Every block written to LiveDFS goes through the same decision process. image from original paper with publisher permission Cloud Computing — E. Rivière 35 Fingerprint store (on disk) Live Deduplication Storage of Virtual M Allocate some blocks to store ngerprints 16 GB for our 4 TB disk... Indexed by block number Fingerprint Store Does not allow directly nding a I block by its ngerprint: this is the P Q role of the in-memory FP lter Inodes data block zone Each entry includes a reference counter MD5 checksum Reference count MD5(P) 2 (16 bytes) (4 bytes) Increase when block referenced Remove block when counter drops to 0 Fingerprint store entries for block P Fig. 3. Deployment of a fingerprint store in a block group. image from original paper with publisher permission Cloud Computing — E. Rivière fi fi fi fi 36 Fingerprint lter (in memory) Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud 7 Hash table where key = rst n+k bits of ngerprint Index key Bucket Key (n bits) (k bits) Input Fingerprint... 12AD375CF1AE4......F1 Fingerprint value = bucket Store Index Memory having multiple P values Q per key address Bucket Key Block # Bucket Key Block # key is possible, 0000 01DF78AD 4797 09387AF1 13489 Inodes data block zone particularly if n+k is small 0001... 12BC2FF1 375CF1AE 261 1817 7BC120FA 95AB15C8 7346 4160 12AD.................. If no match: MD5 checksum MD5(P) 2 Reference count... Next ptr Next ptr NULL (16 bytes) (4 bytes) FFFF Block written normally and FP Bucket #1 Bucket #2 store updated with new entry Fingerprint store entries for block P Index Key Table Fig.Otherwise: 3. Deployment of a fingerprint store in Fig. 4. Design of the fingerprint filter. a block group. Check FP store rst that in Ext3FS, the default block size is 4KB and default block group size is 128MB, image from original paper with publisher permission so there are 32,768 blocks per block group. The Cloud Computing — E. fingerprint Rivière store will consume 655,360 fi fi fi fi 37 Generation and size of FP lter Need to re-generate FP lter in RAM when mounting the le system Authors report ~6 minutes / TB of data Would be faster with today’s SSD drives Size of FP lter depends on values of n and k small: more false positives high: more RAM used authors use n=19 and k=24: total size of structure is ≤ 2 GB this size does not depend on the size of the disk: same memory used for 2 TB, 4 TB … but we must multiply by the number of disks Cloud Computing — E. Rivière fi fi fi fi 38 Handling modi cations to deduplicated blocks If a block is modi ed, the change must not be visible in other les who reference the same block Use again the copy-on-write principle As for handling modi cations to shared pages in ESX TPS If the reference counter is ≥ 2: the block is rst copied to a new location, the reference is updated in the inode, the reference counter is decremented the write is applied to the new block Issue: may create higher fragmentation and disk seeks Cloud Computing — E. Rivière fi fi fi fi fi 39 Prefetching and journaling Write access for a VM typically done as a linear stream of blocks Accessing the FP store for one block: high probability that the following FP entries will be accessed next Prefetching mechanism: pre-read multiple FP entries from the FP store instead of a single one Journaling: write updates to the disk as a series of transactions that allows recovering a stable state (most modern le systems use it) Updates to the FP store also integrated in the journal Cloud Computing — E. Rivière fi zon S3, or a local server on which Glance is deployed. Note that OpenStack uses 40 th euca2ools command-line tool provided by Eucalyptus to add and delete VM images Integration in OpenStack User OpenStack store and retrieve Compute image information Image Service (Nova) (Glance) Controls store images LiveDFS Compute Compute Compute Compute Node Node Node Node VM image fetch storage image backend Fig. 6. LiveDFS Implemented as adeployment in an Linux kernel-space OpenStack cloud. module POSIX compliant and at the VFS level: transparent LiveDFS deployment. Figure 6partition Image storage shows how LiveDFS mounted is LiveDFS using deployed in an OpenStac cloud. LiveDFS serves as a storage layer between Glance and the VM image storag backend. Administratorsimage canfrom upload originalVM paperimages through with publisher Glance, and the images wi permission be stored in the LiveDFS partition.Cloud When a user Computing — E. wants Rivière to start a VM instance, the clou 41 Performance results Test on a single low-end server 12 Intel Chun-Ho Ng, CoreMa, Mingcao i5 Tsz-Yeung 760, 8 GB Ram, Wong, 1 TB Patrick dedicated P. C. HDD Lee, and John C. S. Lui disk partition;Comparison when prefetchingwith unmodi is disabled, ed the we bypass Ext3 step FS of prefetching a finger- print store into the page cache; when journaling is disabled, we use alternative calls to directly writeDifferent variants: fingerprints and reference counts to the fingerprint stores on disk. Spatial locality Prefetching Journaling √ LiveDFS-J × × √ LiveDFS-S × × √ √ LiveDFS-SJ × √ √ √ LiveDFS-all √ Table 1. Different LiveDFS variants evaluated in Section 4.1 ( = enabled, × = disabled). Experimental testbed. Our experiments are conducted on a Dell Optiplex 980 machine table CPU with an Intel Core i5 760 from original paper with at 2.8GHz and publisher permission 8GB DDR-III RAM. We equip the ma- Cloud Computing — E. Rivière fi 42 Live Deduplication Storage of Virtual M Sequential writes 250 250 LiveDFS-J LiveDFS-S LiveDFS-SJ 200 LiveDFS-all 200 Throughput (MB/s) Throughput (MB/s) Ext3FS 150 150 write throughput impact limited 100 100 50 50 0 0 Sequential Write S Fig. 7. Throughput of sequen- Fig. 8. Thro tialplotwrite. from original paper with publisher permission tial read. Cloud Computing — E. Rivière 43 ication Storage of Virtual Machine Images in an Open-So Sequential reads 250 250 LiveDFS-J LiveDFS-J LiveDFS-S LiveDFS-S LiveDFS-SJ LiveDFS-SJ 200 200 LiveDFS-all LiveDFS-all Throughput (MB/s) Throughput (MB/s) Ext3FS Ext3FS read tput not impacted 150 150 100 100 50 50 0 0 al Write Sequential Read Seq. ut of sequen- Fig. 8. Throughput of sequen- Fig. 9. Thro tialplotread. from original paper with publisher permission tial duplicate Cloud Computing — E. Rivière hine Images in an Open-Source Cloud 13 44 Sequential duplicate write 250 LiveDFS-J LiveDFS-J LiveDFS-S LiveDFS-S LiveDFS-SJ LiveDFS-SJ 200 LiveDFS-all LiveDFS-all Throughput (MB/s) Ext3FS Ext3FS 150 100 50 0 al Read Seq. Duplicate Write Writing a le that is exactly the same content as one existing already on disk ut of sequen- Cost is writing Fig.only9. the Throughput entries in the FP store of(small) vs writing full blocks sequen- Without journaling, lots of disk seeks and no write combining tial duplicate write. plot from original paper with publisher permission Cloud Computing — E. Rivière fi tions there is at most one Compute node that retrieves a VM image at a time. We also 45 evaluate the scenario when a Compute node retrieves multiple VM images simultane- OpenStack integration ously (see Experiment B3). Dataset. We use deployable VMevaluation images to drive our experiments. We have created 42 VM images in Amazon Machine Image (AMI) format. Table 2 lists all the VM images. The operating Are space systems of thesaving objectives VM images achieved? include ArchLinux, CentOS, Debian, Fedora, OpenSUSE, and Ubuntu. We prepare images of both x86 and x64 architectures for What each distribution, usingisthe the impact onconfiguration recommended VM startup/saving times? for a basic server. Networked installation is chosen so as to ensure that all installed software packages are up-to-date. Each VM image is configured to have size 2GB. Finally, each VM image is created as a single monolithic flat file. Authors use a collection of VM images Distribution Version (x86 & x64) Total ArchLinux 2009.08, 2010.05 4 CentOS 5.5, 5.6 4 Debian 5.0.8, 6.0.1 4 Fedora 11, 12, 13, 14 8 OpenSUSE 11.1, 11.2, 11.3, 11.4 8 Ubuntu 6.06, 8.04, 9.04, 9.10, 10.04, 10.10, 11.04 14 Table 2. The collection of original table from virtualpaper machine with images publisherinvolved in the experiments. permission Cloud Computing — E. Rivière 46 Live Deduplication Storage of Virtual Machin Space usage about 40% overall space saving 40 Average Space Usage (GB) 30 Storing the total 42 Space Usage (GB) images 20 Total disk space used 10 LiveDFS Zero- lled blocks Live Deduplication areof Virtual Machine Images in an Open-Source Storage EXT3 w/oCloud zero blocks 0 0 5 10 15 20 25 30 35 40 15 not counted Number of Images (b (a) Cumulative space usageLiveDFS 40 1.4 Savings range from Ext3FS w/o zero blocks 1.2 Fig. 10. Space usage of VM images ( 33% to 60% Average Space Usage (GB) 30 Space Usage (GB) 1 Depending on 20 distribution uses 0.8 0.6 around 21GB of space (as all zero-filled b 10 while 0.4 Ext3FS consumes 84GB of space for the can0.2 even achieve 75% of saving. LiveDFS EXT3 w/o zero blocks Figure 10(b) shows the average space usa 0 0 0 5 10 15 20 25 30 35 40 Number of Images tribution. The space Arch. Linux CentOS savings Debian range from 33% t Fedora OpenSUSE Ubuntu plots fromspace (a) Cumulative original paper with publisher usage (b) VM of permission Average images spaceofusage the sameof a VM image Linux for dif- distribution Cloud Computing — E. Rivièreferentthat blocks distributions can be deduplicated. Therefore, ou fi 16 47 Wo Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Store Store/Startup time 25 LiveDFS Average Time Required (s) Ext3FS Average Time Required (s) 20 Time to write a VM image (top) 15 Time to start a VM (bottom) 10 5 Writes are faster than ext3 0 LiveArch.Deduplication Linux CentOS Debian Startup Storage of Virtual Machin Fedora OpenSUSE Ubuntu Bene t from non-written duplicate blocks 70 (a) Individual distribution 1 Fig. 11. Average time required for inserting LiveDFS 1 60 Reads are slower Ext3FS Average Time Used (s) 1 Average Time Required (s) 50 Fragmentation: duplicated sponding VM image from the Glance server. 1 W blocks are not in sequential 40 being issued until the time when the euca-ru order on the disk and require 30 KVM hypervisor (i.e., when the VM starts runn disk seeks (expensive!) performance 20 of reading VM images using Live Bene t/Cost remains high Figure 10 distributions 12(a) shows the time needed to lau in the Compute node. We observe 0 Arch. Linux CentOS Debian Fedora OpenSUSE Ubuntu formance than Ext3FS. The main reason is that (a) Startup plots from original paper with publisher time for a VM instance for dif- (b permission. That ferentis,distributions. in deduplication, some blocks of a Cloud Computing — E. Rivière fi fi 48 Part B: dynamic IaaS management Cloud Computing — E. Rivière 49 VM migration ✓ Virtualization decouples the OS from the physical hardware ❌ Still,VM is tied to the physical host on which it has been launched VM migration allows dynamically moving a VM from one physical host to another Scheduled maintenance, workload consolidation or re-balancing, deployment on a better machine, move to another data center (costly), etc. In 2018 Google reported performing more than 1.000.000 migrations a month, leading to 50ms median (300ms tail) “blackout” (unresponsiveness of the VM) How can they achieve this? Cloud Computing — E. Rivière 50 VM migration vs. process migration Migrating entire virtual machine also migrates the kernel and all the machine resources Kernel data structures User-space processes Active network connections (TCP) Process migration well studied in the 90s Requires important changes to target and source kernels Complexity of shared resources ( le descriptors, shared memory regions, etc.) VM migration simpler as control is with hypervisor Cloud Computing — E. Rivière fi 51 Steps in VM migration Migrating memory Migrating local resources Network connections Local storage Migrating hypervisor-level VM state Starting VM on destination host Cloud Computing — E. Rivière 52 Migrating memory VM typically running live service(s) Memory transfer must minimize impact of migration Downtime: during which the VM must completely halt and restart on the destination host: service downtime to clients Total migration time Performance downgrade during migration preparation Memory transfer uses bandwidth and CPU from source and destination node: Service can be degraded Take advantage of guest physical to host physical mapping Mapping done and known by the hypervisor Cloud Computing — E. Rivière 53 Memory copy models: pure stop-and-copy Easiest to implement VM is halted, all its memory pages are copied by hypervisor from source host to destination host Does not require maintaining a copy of the VM on source host after transfer Downtime signi cant and proportional to VM memory state halt VM on src copy all pages start VM on dst downtime = full performance total migration time Cloud Computing — E. Rivière fi 54 Memory copy models: pure demand-migration Short stop-and-copy phase transfers only essential kernel data structures to destination Start VM at destination Page faults lead to page copy from source Performance degradation due to repeated and synchronous page transfers Needs to keep host machine up until all pages copied halt VM on src copy pages from src to dst on demand upon page faults downtime = performance degraded due to small setup time start VM on dst multiple synchronous page requests Cloud Computing — E. Rivière 55 Memory copy models: pre-copy Bounded iterative push (pre-copy) phase followed by a short stop-and-copy phase Iterative: pre-copy occurs in rounds Pages transferred at round n are only those that have been modi ed since round n - 1 Used for VM migration in Xen Same mechanisms implemented in VMWare ESX pre-copy pages from src to dst halt VM on src downtime = small remaining transfer + setup full performance performance degradation start VM on dst Cloud Computing — E. Rivière fi 56 Rationale for pre-copy A process typically uses actively only a subset of its allocated memory in a given time period Working Set model Some pages frequently used Many pages not currently used Same expected behavior for complete computer system (VM = OS + processes) Writable VM Working Set Generally true for server VMs Exception: rogue processes constantly lling memory Cloud Computing — E. Rivière fi 57 Writable working set size in SPEC CINT2000 Linux kernel compilation benchmark Tracking the Writable Working Set of SPEC CINT2000 80000 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf 70000 60000 Number of pages 50000 40000 30000 20000 10000 0 0 2000 4000 6000 8000 10000 12000 Elapsed time (secs) Figure 2: WWS curve forCloud a complete run—ofE.SPEC Computing RivièreCINT2000 (512MB VM) 58 Implementing pre-copy Tracking modi ed pages with shadow page table (SPT) Overlay copy of the guest physical to host physical page table All pages in SPT are marked as read-only Writes accesses lead to page faults intercepted by Xen Dirty page is recorded by Xen Page marked as in original table (e.g., R/W), memory access retried Dynamic rate limiting Bandwidth used for page transfer affects VM and apps. performance Adapt bandwidth for pre-copy rounds First pre-copy round: transfer all pages at minimal rate Next rounds: only modi ed pages, rate adjusted based on number of pages dirtied per second during last round Cloud Computing — E. Rivière fi fi 59 Finalizing VM migration Guest OS halted and last dirtied pages copied to destination State about VM transferred from source hypervisor to destination hypervisor Guest physical to host physical page table reconstructed by hypervisor on destination machine Guest physical pages map to different host physical pages VM instructed to restart where it halted Restarts device drivers and update local clock Requires small paravirtualization layer to handle restart Cloud Computing — E. Rivière 60 Migration of an Apache Web server VM VM with 800 MB memory 100 concurrent clients, 512 KB static page Effect of Migration on Web Server Transmission Rate 1000 1st precopy, 62 secs further iterations 870 Mbit/sec 9.8 secs 765 Mbit/sec Throughput (Mbit/sec) 800 600 694 Mbit/sec 400 165ms total downtime 200 512Kb files Sample over 100ms 100 concurrent clients Sample over 500ms 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 Elapsed time (secs) Figure 8: Results of migrating a running web server VM. Cloud Computing — E. Rivière 600 694 Mbit/sec 61 Throughput ( 400 Migrating a complex Web server 165ms total downtime 200 SPECweb99: application-level benchmark for Web Sample over 100ms 512Kb files servers (dynamic content, CGI, etc.) 100 concurrent clients Sample over 500ms 0 0 10 Intensive 20 in disk40 (= network) 30 50 60 throughput 70 Elapsed time (secs) 80 90 100 110 120 130 350 concurrent clients (connections), 90% server load VM with 800 Figure MB ofofmemory 8: Results migrating a running web server VM. 600 Iterative Progress of Live Migration: SPECweb99 350 Clients (90% of max load), 800MB VM Total Data Transmitted: 960MB (x1.20) In the final iteration, the domain is suspended. The remaining 18.2 MB of dirty pages are sent and the VM resumes execution 500 Area of Bars: on the remote machine. In addition to the 201ms required to 18.2 MB VM memory transfered copy the last round of data, an additional 9ms elapse while the 15.3 MB Memory dirtied during this iteration VM starts up. The total downtime for this experiment is 210ms. 14.2 MB Transfer Rate (Mbit/sec) 400 16.7 MB 24.2 MB 300 The first iteration involves a long, relatively low-rate transfer of the VM’s memory. In this example, 676.8 MB are transfered in 54.1 seconds. These early phases allow non-writable working 200 set data to be transfered with a low impact on active services. 28.4 MB 100 676.8 MB 126.7 MB 39.0 MB 0 0 50 55 60 65 70 Elapsed Time (sec) Figure 9: Results of migrating a running SPECweb VM. Cloud Computing — E. Rivière Packet flight t do do 0.06 62 0.04 0.02 Interactive application: 0 0 migrating a Quake 3 server VM 10 20 30 40 50 60 7 Elapsed time (secs) 64 MB memory VM, 6 concurrent players Figure 10: Effect on packet response time of migrating a running Quake 3 server VM. 450 Iterative Progress of Live Migration: Quake 3 Server 0.1 MB 6 Clients, 64MB VM 0.2 MB Total Data Transmitted: 88MB (x1.37) The final iteration in this case leaves only 148KB of data to 0.8 MB 400 transmit. In addition to the 20ms required to copy this last Area of Bars: round, an additional 40ms are spent on start-up overhead. The VM memory transfered total downtime experienced is 60ms. 350 Memory dirtied during this iteration Transfer Rate (Mbit/sec) 300 1.1 MB 250 1.2 MB 200 0.9 MB 1.2 MB 150 1.6 MB 100 56.3 MB 20.4 MB 4.6 MB 50 0 0 4.5 5 5.5 6 6.5 7 Elapsed Time (sec) Figure 11: Results of migrating a running Quake 3 server VM. Cloud Computing — E. Rivière 63 Quake 3 server VM migration: impact on clients Packet interarrival time during Quake 3 migration Packet flight time (secs) 0.12 Migration 2 Migration 1 downtime: 50ms downtime: 48ms 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 7 Elapsed time (secs) Figure 10: Effect on packet response time of migrating a running Quake 3 server VM. 450 Iterative Progress of Live Migration: Quake 3 Server 0.1 MB 6 Clients, 64MB VM 0.2 MB Total Data Transmitted: 88MB (x1.37) The final iteration in this case leaves only 148KB of data to 0.8 MB 400 transmit. In addition to the 20ms required to copy this last Area of Bars: round, an additional 40ms are spent on start-up overhead. The 350 VM memory transfered Cloud Computing —downtime total E. Rivière experienced is 60ms. 64 Paravirtualized optimizations OS-level driver can help in improving performance Free page cache pages Only the OS can know which of its pages are useful or free Truly free pages available to memory allocator Page cache for non-modi ed disk data Return pages to Xen hypervisor and other VMs by in ating balloon Reduce the length (number of pages) of the rst copy iteration Cloud Computing — E. Rivière fi fi fl 65 Paravirtualized optimizations Stun rogue processes Dynamic Rate-Limiting Processes who keep dirtying memory 10000 too fast for network to catch up Transferred pages ot alwaysKernel can monitor appropriate to select athe Write single network 8000 idth limit Working for migrationSet traffic. of ind