Cloud Computing (part I) PDF
Document Details
Uploaded by InvaluableKremlin9492
Instituto Superior Técnico
Rodrigo Bruno
Tags
Summary
These notes cover the basics of cloud computing, focusing on the concept of virtual machines (VMs). They discuss the evolution of virtualization from mainframe systems to modern cloud architectures, highlighting the role of hypervisors in managing VMs. The notes also analyze hypervisor scheduling and virtualization problems.
Full Transcript
Computer Systems Engineering Cloud Computing (part I) Rodrigo Bruno 1 Cloud Computing Recap (Protocols) 2 Computer Systems Engineering Cloud Computing Recap (Routing)...
Computer Systems Engineering Cloud Computing (part I) Rodrigo Bruno 1 Cloud Computing Recap (Protocols) 2 Computer Systems Engineering Cloud Computing Recap (Routing) 3 Computer Systems Engineering Cloud Computing Recap (OSI Layers) Application Layer HTTP, FTP, SSH, etc Encoding, compression Presentation Layer encryption/decryption, etc Session Layer Communication Sessions Reliable communication (TCP) Transport Layer Multi-network communication (IP) Network Layer Frames between two nodes Data Link Layer Bits over a physical medium Physical Layer 4 Computer Systems Engineering Cloud Computing Recap (Internet) 5 Computer Systems Engineering Cloud Computing For today! Cloud Computing (part I)! ○ Virtual Machines 6 Computer Systems Engineering Cloud Computing Mainframe Virtualization (early days) VM #1 VM #2 VM #3 VM #4 VM #5 VM #6 VM #7 Task A Task B Task C Task D Task E Task F Task G Logical VMs on a mainframe 7 Computer Systems Engineering Cloud Computing Stagnation of virtualization (1980s-1990s) Emergence of the PC: ○ Cheap hardware: sharing not needed ○ Single-user applications Servers: ○ High volume manufacturing changed server paradigm from custom-built to general-purpose processors ○ Applications share the same OS If an application crashes, the whole server can crash (poor isolation) 8 Computer Systems Engineering Cloud Computing Infrastructures in the late 1990s – early 2000s 9 Computer Systems Engineering Cloud Computing Pre-virtualization scenario (circa 2000) Problem? App App App App App App App App x86 x86 x86 x86 Windows Windows Suse Red Hat XP 2003 Linux Linux 12% Hardware 15% Hardware 18% Hardware 10% Hardware Utilization Utilization Utilization Utilization 10 Computer Systems Engineering Cloud Computing Solution: virtualization! One physical server hosts several Virtual Machines (VMs) Each VM runs its own operating system and applications, under the illusion of having its own physical server A VM is unaware that it is sharing the underlying hardware with other VMs The Virtual Machine Monitor (VMM) and Hypervisor isolate and orchestrate various VMs 11 Computer Systems Engineering Cloud Computing Post-virtualization scenario Virtual Machines App App App App App App App App Each application runs on its own operating system x86 x86 x86 x86 Each operating system does not Windows Windows Suse Red Hat XP 2003 Linux Linux know it is sharing the underlying (12%) (15%) (18%) (10%) hardware with others An hypervisor manages the virtual machines Hypervisor Much better hardware utilization, without compromising application x86 physical server protection 55% Hardware Utilization 12 Computer Systems Engineering Cloud Computing Hypervisor scheduling Each OS “thinks” it has the host's processor, memory, and other resources, all to itself Actually, the hypervisor controls the host processor and resources, allocating what is needed to each operating system, in turn (round-robin) It also makes sure that the virtual machines (guest OS and applications) cannot disrupt each other 13 Computer Systems Engineering Cloud Computing Hypervisor types 14 Computer Systems Engineering Cloud Computing Hypervisor types 15 Computer Systems Engineering Cloud Computing Why VMs instead of more processes? Stronger Isolation Saf ○ VMM is a simpler component App App e VM VM compared to an OS OS OS easier to verify security OS VMM No shared file system Hardware Hardware Performance isolation Different software requirements ○ different libraries ○ different OS versions 16 Computer Systems Engineering Cloud Computing Consolidation by virtualization higher efficiency (higher hardware utilization), lower energy footprint 17 Computer Systems Engineering Cloud Computing System-level virtualization A physical machine multiplexes the underlying CPU among multiple VMs ○ Similar to how an OS multiplexes processes on CPU (time-sharing) The VMM performs machine switch (much like context switch) ○ Run a VM for a bit, save its state, switch to another VM, and so on… What is the problem? ○ Unlike user processes, a guest OS expects to have the highest privileges in the machine, for example, to have unrestricted access to hardware ○ To keep VMs isolated from each other, guest OSs must be less privileged than the VMM 18 Computer Systems Engineering Cloud Computing Privilege rings (no virtualization) Privilege Ring 3 - Ring 2 Ring 1 Ring + 1 0 Ring OS OS kernels expect to run in Ring 0 with full privileges If a privileged instruction executes in Device drivers Ring > 0, it traps (an exception occurs and the OS is invoked) Device drivers Access control is enforced by the CPU Applications 19 Computer Systems Engineering Cloud Computing De-privileging the guest OS The VMM runs in ring 0 (with full privileges) The guest OS runs in ring 1 (cannot execute privileged instructions, which trap into ring 0) System calls by the application that used to trap into the guest OS now trap into the VMM ○ The VMM doesn’t know how to handle the trap ○ The VMM calls the usual guest OS trap handler ○ The guest OS trap handler returns (privileged instruction, so it traps to the VMM) ○ The VMM returns to the application All privileged instructions executed by the guest OS trap to the VMM, which emulates its effect (but under its control): trap and emulate 20 Computer Systems Engineering Cloud Computing Gerald Popek and Robert Goldberg formalized the requirements for a processor to support virtualization efficiently by trap and emulate: ○ Sensitive instruction: can change system state, or its semantics depend on system state ○ Privileged instruction: can run in privileged mode only (traps otherwise) ○ Theorem: virtualization is possible if all sensitive instructions are privileged “A virtual machine is an efficient, isolated duplicate of the real machine” ○ Efficient multiplexing: Non-sensitive instructions should execute directly on the hardware (virtualization should have a minor effect on performance) ○ Isolation: executed programs may not affect the system resources or other programs ○ Transparency: the behavior of a program executing under the VMM should be the same as if the program is executed directly on the hardware (except possibly for timing and resource availability) 21 Computer Systems Engineering Cloud Computing Virtualization problems Sensitive instructions must be a subset of privileged instructions – The x86 (32-bit) architecture does not satisfy this! This means that: – some sensitive instructions are not privileged – behaves differently when executed in different privilege levels – the x86 instruction set contains 17 sensitive, unprivileged instructions that do not trap if ring > 0 – the x86 was not designed with virtualization in mind! these instructions are harmless when running multiple processes but break the VM abstraction 22 Computer Systems Engineering Cloud Computing Root Mode Privilege Level 23 Computer Systems Engineering Cloud Computing Virtual Memory (no virtualization) 24 Computer Systems Engineering Cloud Computing VM1 VM2 VM3 VM4 Virtualizing Virtual 1 Memory Guest 2 Pseudo Physical 3 Memory 4 5 Hypervisor 5 Machine 1 Physical Memory 2 3 4 25 Computer Systems Engineering Cloud Computing Two-level address translation Guest Virtual Guest Physical System Physical Address 1 Address 2 Address gVA gPA sPA Guest Hypervisor Page Table Page Table 26 Computer Systems Engineering Cloud Computing Guest virtual addresses (managed by processes) Guest physical addresses (managed by guest OS) System physical addresses (managed by hypervisor) Guest physical addresses are hypervisor virtual addresses 27 Computer Systems Engineering Cloud Computing Guest gVA virtual address Guest Physical gcr3 memory System sPA physical address System Physical memory 5 +5 +5 +5 +4 = 24 Maximum possible memory accesses for one two dimensional page walk 28 Computer Systems Engineering Operating Systems Summary Virtual Machines ○ Priviledge rings ○ Virtualizing Virtual Memory 29 Computer Systems Engineering Operating Systems Where to go now? Cloud Computing (part II)! ○ Virtual Machines as a catalizer for Cloud Computing 30 Computer Systems Engineering Computer Systems Engineering Cloud Computing (part II) Rodrigo Bruno 1 Cloud Computing Recap (hypervisor)! 2 Computer Systems Engineering Cloud Computing Recap (privilege rings)! 3 Computer Systems Engineering Cloud Computing Recap (x86 virtualization problem)! popf ○ Pops 16 bits from stack to the %eflags register ○ Bit 9 of %eflags enables/disables interrupts, i.e., it is a sensitive instruction ○ However, popf is not privileged… What happens if guest OS (ring 1) runs popf? ○ In Ring 0, popf can set bit 9 ○ In Ring 1-3, the CPU will silently ignore popf! ○ What should happen is a trap so that VMM can emulates interrupts change which interrupts to forward to guest OS 4 Computer Systems Engineering Cloud Computing Recap (virtualizing virtual memory)! 5 Computer Systems Engineering Cloud Computing For today! Cloud Computing (part II)! ○ What is Cloud Computing ○ Deployment models ○ Service models ○ Amazon Web Services 6 Computer Systems Engineering Cloud Computing Pre-cloud enterprise infrastructures On-premises Era Companies need to build their own data center (even if small) This includes servers, networks, storage All this needs to be managed ○ costs time…and money 7 Computer Systems Engineering Cloud Computing Suppose you want to host a website for your business This is what you need to do: More traffic? More servers! Buy a stack of servers Less traffic? Do you know Monitor and maintain anything about this? these servers Idle servers Fixed costs… 8 Computer Systems Engineering Cloud Computing Instead of owning a data center, a company: – rents a remote infrastructure (servers, networks, storage) – uses it via an API, accessible through the Internet The cloud provider takes care of managing the actual infrastructure, guaranteeing its services Rent Internet 9 Computer Systems Engineering Cloud Computing Total Cost Minimum (CapEx + OpEx) cost (CapEx) Use rate 10 Computer Systems Engineering Cloud Computing Total Cost (OpEx) Use rate 11 Computer Systems Engineering Cloud Computing Nowadays, virtually every digital-based service: – Is accessed via the Internet – Uses a Web interface (or some app, e.g. set top boxes) – Is hosted on some type of cloud 12 Computer Systems Engineering Main Features of Cloud Computing I can use it whenever I want I don’t need to contact anyone Available from anywhere Anyone (authorized) can use it Multi-tenancy (shared) Unused by some, used by others Users don’t worry about capacity I only provision what I really need I can see what I use and spend Scale effect (sharing is cheaper) 13 Computer Systems Engineering Advantages of Cloud Computing Trade Capital Expenses (CapEx) for Operational Expenses (OpEx) – Don’t own infrastructure and avoid CapEx – Pay per use and concentrate on OpEx Benefit from economy of scale – Prices are reduced by sharing a large infrastructure across many users Stop guessing and anticipating how much capacity is needed – Elastic infrastructure is there to be scaled whenever required Increase speed and agility – Provision resources when needed and within minutes Stop spending money running and maintaining data centers – Concentrate on what is really important: your core business Go global fast and easy – Leverage the global infrastructure of large cloud computing providers 14 Computer Systems Engineering Cloud Computing (a definition) NIST - National Institute of Standards and Technology – Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction The owner of the resources (cloud provider) has an API (REST, typically) through which the resources can be provisioned and services can be used Models: – Deployment: how the cloud is organized (public, private, hybrid, etc.) – Service: the type of services offered (basic infrastructure, SW platform, complete apps) 15 Computer Systems Engineering Cloud Computing Deployment models Public Cloud Private Cloud Hybrid Cloud 16 Computer Systems Engineering Cloud Computing The cloud provider makes resources Public Cloud (e.g., compute, storage, databases) and applications available to the general public Easy and inexpensive setup because resources and underlying expenses are shared by a large number of users (economy of scale) No investment or wasted (unused) resources because users pay only what they use 17 Computer Systems Engineering Cloud Computing This is a single tenant cloud Private Cloud environment that runs on dedicated infrastructure (exclusive control, no shared resources with other tenants) It may reside on-premises, in a dedicated off-site data center, or with a managed private cloud provider Note that: – Public cloud is elastic and easily scalable – Private cloud is constrained by fixed infrastructure 18 Computer Systems Engineering Cloud Computing Cloud computing environment Hybrid Cloud with a mix of on-premises, private cloud, and public cloud services Leverages the best of the various worlds Used for: – Data compliance – Critical data control – Privacy (backend) Today, 82% of companies use hybrid clouds – Avoid provider dependency – Performance (bursts) 19 Computer Systems Engineering Cloud Computing 20 Computer Systems Engineering Cloud Computing Service Models IaaS – Infrastructure as a Service PaaS – Platform as a Service FaaS - Function as a Service SaaS – Software as a Service StaaS – Storage as a Service DBaaS – Database as a Service NaaS – Network as a Service... XaaS – Everything as a Service 21 Computer Systems Engineering Cloud Computing CaaS (Car as a Service): is it a “cloud” service? ○ Owning a car (CapEx) ○ Leasing (long term rental, avoiding CapEx) It is provided as a service, but not a cloud service: No Virtualization No Pay-Per-Use ○ Taxi (closer, but not a cloud service): Multi-tenant, elastic, pay-per-use, widely available Virtualized (in the sense of time sharing), but no self-service ○ Uber, Lyft, etc (almost a “cloud” service) Multi-tenant, elastic, pay-per-use, widely available Virtualized (time sharing), and (sort-of) self-service ○ Self-driving Uber? Computer Systems Engineering 22 Cloud Computing Cloud service models 23 Computer Systems Engineering Cloud Computing Infrastructure as a Service (IaaS) 24 Computer Systems Engineering Cloud Computing Infrastructure as a Service (IaaS) 25 Computer Systems Engineering Cloud Computing Platform as a Service (PaaS) 26 Computer Systems Engineering Cloud Computing Platform as a Service (PaaS) 27 Computer Systems Engineering Cloud Computing Software as a Service (SaaS) 28 Computer Systems Engineering Cloud Computing Software as a Service (SaaS) 29 Computer Systems Engineering Data Centers Apple’s Mesa Data center (120000 m2 in Arizona, USA) 30 Inside a Data Center Apple’s Mesa Data center (120000 m2 in Arizona, USA) 31 Inside a Data Center Apple’s Mesa Data center (120000 m2 in Arizona, USA) 32 Levels of Security in a Data Center 33 Sines 4.0 34 Sines 4.0 Connects Europe to America and Africa Power Usage Effectiveness of 1.1 Water Usage Effectiveness of 0 85% of electricity generated from renewables ○ Net Zero by 2030 ○ 1.2 GW Green Giant Campus 35 PUE (Power Usage Effectiveness) PUE (Power Usage Effectiveness) = Total Facility Power / IT Equipment Power PUE of 2.0 indicates that for every watt of IT power, an additional watt is consumed to cool and distribute power to the IT equipment 36 Evoluation of Data Centers PUE: Power Usage Effectiveness (lower is better, ideally 1) 37 Aspects to Consider 24/7/365 accessibility Physical risks, natural or caused by man Extensibility Low electricity cost Energy consumption Heat dissipation Access to high-speed network access 38 The Importance of Electricity 39 Data Center Efficiency 40 IT Power Load Share 41 Annual Electrical Cost of a 1MW Data Center NCPI: Network-Critical Physical Infrastructure 42 Improving Data Center Efficiency 43 Improvements 44 Efficiency of a Typical Data Center 45 Energy vs Performance *From “Treehouse: A Case For Carbon-Aware Datacenter Software” (HotCarbon’22) 46 Energy Carbon Intensity Energy Carbon Intensity = emitted greenhouse gases per kWh of energy lower is better! how to reduce? ○ consume less energy ○ use energy from carbon-free sources ○ reduce embodied carbon (carbon footprint of HW production) 47 Energy Carbon Intensity *From “The War of the Efficiencies: Understanding the Tension between Carbon and Energy Optimization” (HotCarbon’23) 48 Cloud Computing Providers Cloud computing providers 49 Computer Systems Engineering Cloud Computing AWS global region map (2023) 50 Computer Systems Engineering Cloud Computing The cloud computing market 51 Computer Systems Engineering Cloud Computing Summary Cloud Computing ○ definition ○ deployment models ○ service models ○ AWS 52 Computer Systems Engineering Cloud Computing Where to go now? Next four lectures: ○ How can we achieve secure communication over computer networks? Information Security (Chapter 11) ○ How can we measure the performance of our computer system? Performance (Chapter 6) ○ How to tolerate faults? Fault Tolerance (Chapter 8) ○ How to deal with system replication? Consistency (Chapter 10) 53 Computer Systems Engineering Computer Systems Engineering Information Security Rodrigo Bruno 1 Information Security Recap! What is Cloud Computing Deployment models Service models Amazon Web Services Use rate 2 Computer Systems Engineering Information Security Recap! What is Cloud Computing Deployment models Service models Amazon Web Services 3 Computer Systems Engineering Information Security Recap! What is Cloud Computing Deployment models Service models Amazon Web Services 4 Computer Systems Engineering Information Security Recap! What is Cloud Computing Deployment models Service models Amazon Web Services 5 Computer Systems Engineering Information Security For today! Fundamentals of secure systems Threats Design Principles Secure message exchange ○ integrity ○ authentication ○ confidentiality 6 Computer Systems Engineering Information Security Secure Systems Authenticity ○ is the agent’s claimed identity authentic? Integrity ○ Is the request actually the one the agent made? Authorization: ○ Does the agent have the appropriate permission to perform the action? 7 Computer Systems Engineering Information Security Threats Unauthorized information release ○ read a file without permission ○ read network messages destined to someone else Unauthorized information modification ○ alter or replace a file or a network package may go undetected Unauthorized denial of use: ○ forcing the system to crash ○ flooding the system with messages ○ blocking network traffic 8 Computer Systems Engineering Information Security High d(technology)/dt Poses Challenges For Security Security Challenges d(Technology)/dt 9 Computer Systems Engineering Information Security Usability and Security 10 Computer Systems Engineering Information Security Security is a Negative Goal 11 Computer Systems Engineering Information Security Design Principles 12 Computer Systems Engineering Information Security Design Principles 13 Computer Systems Engineering Information Security Design Principles 14 Computer Systems Engineering Information Security Design Principles 15 Computer Systems Engineering Information Security Design Principles 16 Computer Systems Engineering Information Security Design Principles 17 Computer Systems Engineering Information Security Security model based on complete mediation 18 Computer Systems Engineering Information Security Trusted Computing Base Collection of trusted modules in the system ○ the correction of untrusted modules does not compromise the security of the system Guidelines ○ minimum number of modules less modules makes it easy to reason about security ○ TCB modules should be as simple as possible ○ simple interactions between trusted and untrusted modules complex modules/interactions make it harder to detect security vulnerabilities 19 Computer Systems Engineering Information Security Real-world problem: how to exchange messages securely Authentication Integrity Confidentiality 20 Computer Systems Engineering Information Security Cryptography Cryptography uses mathematical tools that allow information to be exchanged securely with minimal overhead to the rightful participants and with extreme overhead to attackers. 21 Computer Systems Engineering Information Security Cryptographic Hash For a given M, it should be easy to compute H(M) -> V ○ M is the input/message ○ H is the hash function ○ V is the hash (fixed size) It must be difficult to compute M knowing only V; It must be difficult to find another M’ such that H(M’) = H(M) V must be as short as possible but still long enough to avoid collisions ○ typical value falls between 160 to 256 bits (2256 possible hashes) Hashes can be used to verify integrity 22 Computer Systems Engineering Information Security Cryptographic Hash (Secure Hash Algorithm) SHA1 SHA2 23 Computer Systems Engineering Information Security Key-Based Authentication (authentication + integrity) Model 24 Computer Systems Engineering Information Security Confidentiality 25 Computer Systems Engineering Information Security Cryptographic Ciphers - Shared-Secret Encryption / Symmetric Encryption Sign Verify 26 Computer Systems Engineering Information Security Cryptographic Ciphers (Advanced Encryption Standard) 27 Computer Systems Engineering Information Security Cryptographic Ciphers (Advanced Encryption Standard) 28 Computer Systems Engineering Information Security Cryptographic Ciphers - Public-key / Asymmetric Encryption Symmetric encryption doesn’t guarantee non-repudiation Pairs of private and public keys Encrypt ○ data signed with a public key can only be verified with the corresponding private key Public keys can be distributed freely ○ anyone can send a signed document to the user that has the corresponding key If you sign with a public key (of the recipient), only the Decrypt recipient will be able to verify More computationally expensive than symmetric encryption 29 Computer Systems Engineering Information Security Cryptographic Ciphers (Rivest–Shamir–Adleman) 30 Computer Systems Engineering Information Security Authentication + Integrity + Confidentiality Symmetric Key E E Encrypt(M) M Encrypt Decrypt M Sign(Encrypt(M)) Sign Verify Alice Sprivate-key(Alice) Spublic-key(Alice) Bob 31 Computer Systems Engineering Information Security Authentication + Integrity + Confidentiality (with hashing) E E Encrypt(M) M Encrypt Decrypt M Sign(Hash(M)) Hash Sign Verify Hash Alice Sprivate-key(Alice) Spublic-key(Alice) Bob 32 Computer Systems Engineering Information Security Authentication + Integrity + Confidentiality (with hashing + recv can verify) E E M Encrypt Decrypt M Hash Sign Sign Verify Verify Hash Alice Sprivate-key(Alice) Spublic-key(Bob) Sprivate-key(Bob) Spublic-key(Alice) Bob 33 Computer Systems Engineering Information Security Summary Fundamentals of secure systems Threats Design Principles Secure message exchange ○ integrity ○ authorization ○ confidentiality 34 Computer Systems Engineering Information Security Where to go now? Next four lectures: ✓ How can we achieve secure communication over computer networks? ✓ Information Security (Chapter 11) ○ How can we measure the performance of our computer system? Performance (Chapter 6) ○ How to tolerate faults? Fault Tolerance (Chapter 8) ○ How to deal with system replication? Consistency (Chapter 10) 35 Computer Systems Engineering Computer Systems Engineering Performance Rodrigo Bruno 1 Performance Recap! 2 Computer Systems Engineering Performance Recap! 3 Computer Systems Engineering Performance Recap! Authentication Integrity Confidentiality 4 Computer Systems Engineering Information Security Recap! EA EB M Encrypt Decrypt M Hash Sign Sign Verify Verify Hash Alice Sprivate-key(Alice) Spublic-key(Bob) Sprivate-key(Bob) Spublic-key(Alice) Bob 5 Computer Systems Engineering Performance For today! Throughput and Latency Plotting When and how to measure Performance bottlenecks How to overcome bottlenecks 6 Computer Systems Engineering Performance “In computing, computer performance is the amount of useful work accomplished by a computer system.” https://en.wikipedia.org/wiki/Computer_performance 7 Computer Systems Engineering Performance How do we measure performance? Main metrics: ○ latency time it takes to reply to a request ○ throughput number of requests answered per amount of time ○ efficiency useful work per resource consumption Other metrics: ○ memory utilization, power consumption, bandwidth, … 8 Computer Systems Engineering Performance Measuring throughput 9 Computer Systems Engineering Performance Measuring throughput (speedup) 10 Computer Systems Engineering Performance Measuring response time 11 Computer Systems Engineering Performance Measuring response time (histogram) Latency (ms) 12 Computer Systems Engineering Performance Measuring response time (SLA) 13 Computer Systems Engineering Performance Measuring response time (percentiles) 14 Computer Systems Engineering Performance Measuring response time (CDF) 15 Computer Systems Engineering Performance Average and Stdev 16 Computer Systems Engineering Performance Average and Stdev 17 Computer Systems Engineering Performance When to measure 18 Computer Systems Engineering Performance Patterns to look for 19 Computer Systems Engineering Performance Patterns to look for 20 Computer Systems Engineering Performance Patterns to look for 21 Computer Systems Engineering Performance Throughput and Latency under Load Throughput (req/sec) User load (req/sec) 22 Computer Systems Engineering Performance Throughput and Latency under Load Throughput (req/sec) User load (req/sec) 23 Computer Systems Engineering Performance Throughput and Latency under Load under loaded saturated trashing Throughput (req/sec) User load (req/sec) 24 Computer Systems Engineering Performance Throughput and Latency under Load Latency (sec) User load (req/sec) 25 Computer Systems Engineering Performance Throughput and Latency under Load under loaded saturated trashing Latency (sec) User load (req/sec) 26 Computer Systems Engineering Performance Studying Throughput and Latency 27 Computer Systems Engineering Performance Studying Throughput and Latency latencysystem = latency1 + … + latencyn throughputsystem = 1 / latencysystem 28 Computer Systems Engineering Performance Bottlenecks 2 ms 20 ms 1 ms 29 Computer Systems Engineering Performance Using concurrency to overcome bottlenecks Inter-request parallelism 30 Computer Systems Engineering Performance Using concurrency to overcome bottlenecks Intra-request parallelism 31 Computer Systems Engineering Performance Using concurrency to overcome bottlenecks Pipelining 32 Computer Systems Engineering Performance Pipelining throughput and latency latencyrequest = latency1 + … + latencyn throughputsystem = min(throughput1, …, throughputn) 33 Computer Systems Engineering Performance Can performance scale indefinitely? Speedup Number of threads 34 Computer Systems Engineering Performance Scalability in real life (parallel sort) 1, 10, 12, 6, 110, 102, 321, 2, 77, 100, 55, 99, 443, 235, 569 Split 1, 10, 12, 6, 110 102, 321, 2, 77, 100 55, 99, 443, 235, 569 Sort 1, 6, 10, 12, 110 2, 77, 100, 102, 321 55, 99, 235, 443, 569 Merge 1, 2, 6 10, 12, 55, 77, 99, 100, 102, 110, 235, 321, 443, 569 35 Computer Systems Engineering Performance Amdahl's law 36 Computer Systems Engineering Performance Is there any hope? 37 Computer Systems Engineering Performance Summary Throughput and Latency Plotting When and how to measure Performance bottlenecks How to overcome bottlenecks 38 Computer Systems Engineering Performance Where to go now? Next four lectures: ✓ How can we achieve secure communication over computer networks? ✓ Information Security (Chapter 11) ✓ How can we measure the performance of our computer system? ✓ Performance (Chapter 6) ○ How to tolerate faults? Fault Tolerance (Chapter 8) ○ How to deal with system replication? Consistency (Chapter 10) 39 Computer Systems Engineering Computer Systems Engineering Fault Tolerance Rodrigo Bruno 1 Fault Tolerance Recap! 2 Computer Systems Engineering Fault Tolerance Recap! Throughput (req/sec) Latency (sec) User load (req/sec) User load (req/sec) 3 Computer Systems Engineering Fault Tolerance Recap! Inter-request parallelism 4 Computer Systems Engineering Fault Tolerance Recap! Intra-request parallelism 5 Computer Systems Engineering Fault Tolerance Recap! Pipelining 6 Computer Systems Engineering Fault Tolerance Recap! 7 Computer Systems Engineering Fault Tolerance For today! Faults, errors, failures Reliable systems Availability Fault tolerance model Incremental redundancy Massive redundancy 8 Computer Systems Engineering Fault Tolerance “Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components.” https://en.wikipedia.org/wiki/Fault_tolerance 9 Computer Systems Engineering Fault Tolerance How to build a reliable computer system if Computer Systems are made of modules that can (and will) fail? 10 Computer Systems Engineering Fault Tolerance Terminology A fault is an underlying defect, imperfection, or flaw that can may cause problems An error is an incorrect value or signal that results from the fault A failure happens when a fault is not handled and error(s) cause a disruption: ○ software faults ○ hardware faults ○ … In a system built of subsystems, ○ the failure of a subsystem is a fault from the point of view of the larger subsystem 11 Computer Systems Engineering Fault Tolerance What to do in case of errors? (bad) Do nothing, ○ failing without warning potentially leading to a complete crash; (ok) Immediately stop, ○ hoping to limit the propagation of the error; (good) Detect and report the fault at its interface with the higher-level system (aka fail-fast), allows subsystems to react quickly and avoids further contamination; (best) Mask the error, ○ so that the higher-level subsystem does not realize there is a fault; 12 Computer Systems Engineering Fault Tolerance Classification of errors detectable ○ we can detect/identify the error maskable ○ we can mask the error tolerated ○ the error is masked 13 Computer Systems Engineering Fault Tolerance Fault tolerance model 1. Categorize each possible error into reliably detectable or not; a. all errors are untolerated at start; 2. For each undetectable error, evaluate the probability and risk of occurrence a. if probability is not negligible, improve the system to reliably detect the error 3. For each detectable error, devise a way to mask it a. if masking is possible, classify the error as maskable 4. For maskable errors, evaluate the cost of failure versus the cost of masking errors a. if worthwhile, implement a masking method and classify the error as tolerated 14 Computer Systems Engineering Fault Tolerance Key steps to building reliable systems Prepare a fault tolerance model (as explained in the previous slide) ○ estimate the risk for each ○ if risk is too high, design methods to detect, contain, and mask errors Detecting errors ○ differentiating an error from a data value or signal. How? Redundancy Containing errors ○ limit the propagation of the error. How? Modularity Masking errors ○ offering correct operation despite the error. How? Extra redundancy 15 Computer Systems Engineering Fault Tolerance Bathtub curve Infant mortality Wear-out Constant failure rate 16 Computer Systems Engineering Fault Tolerance Tolerating faults with incremental redundancy Problem: how can we tolerate bit flips? ○ Message sent: 100101 ○ Message received: 000101 Forward error correction (a form of incremental redundancy) ○ encodes values in a way that allows detection of a number of bit flips Hamming distance ○ 100101 -> 000101 (Hamming distance of 1) 17 Computer Systems Engineering Fault Tolerance Tolerating faults with incremental redundancy Suppose we have a binary encoding schema that produces values with a Hamming distance of 2. For example: ○ Valid values: 000, 011, 101, 110 ○ Message sent: 000 ○ Message received: 100 ○ We can detect an error but can we recover the original content? No, there are several legitimate encodings with the same same distance Was the correct value 000 or 101? 18 Computer Systems Engineering Fault Tolerance Tolerating faults with incremental redundancy What if we have an encoding schema that supports a Hamming distance of 3? ○ all legitimate values are 3 flits away from each other ○ the decoder can correct the received value into the legitimate value closer to the received one if 1 bit flips, a decoder can correct into the original value if 2 bits flips, a decoder can correct into the wrong value! For a Hamming distance d ○ d - 1 errors can be detected ○ floor((d - 1) / 2) errors can be corrected 19 Computer Systems Engineering Fault Tolerance Tolerating faults with massive redundancy What if instead bit flips, an entire server crashes? ○ we need full redundancy, aka replication Using replication, a set of replicas work in coordination ○ if one replica fails ○ other replicas can be used to mask the error In general terms: ○ a module can be replaced by a set of replicas of the same module, all operating in parallel with the same inputs ○ their output is connected to a voter ○ N-modular redundancy 20 Computer Systems Engineering Fault Tolerance N-modular redundancy Replica 1 Replica 2 Replica 3 N replicas participate in some computation The voter receives the output of each replica ○ if all replicas agree, the voter outputs the value that all replicas presented Voter ○ if the majority of replicas agree on a value the voter outputs the value that the majority presented Triple-modular the voter will raise an alert for the minority that presented a different Redundancy (TMR) value ○ if no majority is achieved, the voter reports a failure How many replicas are necessary to detect different F failures? And to mask F failures? 21 Computer Systems Engineering Fault Tolerance Fully replicated TMR 22 Computer Systems Engineering Fault Tolerance Availability and Reliability memoryless systems Downtime = (1 - Availability) TTF = time to failure MTTF = mean time to failure TTR is the mean time to repair 23 Computer Systems Engineering Fault Tolerance Reliability of a TMR If R is the reliability of a single module replica, the reliability of a TMR (i.e., the system keeps working with a majority) is estimated as follows: ○ R3 + 3R2(1-R) = 3R2 - 2R3 For R = 0.999, TMR move move to 0.999997 ○ one in a thousand to three in a million! Assuming independent failures and a memoryless processes. 24 Computer Systems Engineering Fault Tolerance Reliability of a TMR (long mission time) MTTF of 6000 hours of operation ○ with a single replica; With 3 replicas, the system accumulates 6000 hours in 2000 hours ○ mean time to first failure is 2000 hours After the first failure, with 2 replicas, the system takes 3000 hours to accumulate 6000 hours ○ mean time to second failure is 3000 hours Total mean time to system failure is ○ 2000 + 3000 = 5000 (which is < 6000) 25 Computer Systems Engineering Fault Tolerance Summary Faults, errors, failures Reliable systems Availability Fault tolerance model Incremental redundancy Massive redundancy 26 Computer Systems Engineering Fault Tolerance Where to go now? Next four lectures: ✓ How can we achieve secure communication over computer networks? ✓ Information Security (Chapter 11) ✓ How can we measure the performance of our computer system? ✓ Performance (Chapter 6) ✓ How to tolerate faults? ✓ Fault Tolerance (Chapter 8) ○ How to deal with system replication? Consistency (Chapter 10) 27 Computer Systems Engineering Computer Systems Engineering Consistency Rodrigo Bruno 1 Consistency Recap (fault, error, failure)! A fault is an underlying defect, imperfection, or flaw that can may cause problems An error is an incorrect value or signal that results from the fault A failure happens when a fault is not handled and error(s) cause a disruption: ○ software faults ○ hardware faults ○ … In a system built of subsystems, ○ the failure of a subsystem is a fault from the point of view of the larger subsystem 2 Computer Systems Engineering Consistency Recap (fault tolerance model)! 1. Categorize each possible error into reliably detectable or not; a. all errors are untolerated at start; 2. For each undetectable error, evaluate the probability of occurrence a. if probability is not negligible, improve the system to reliably detect the error 3. For each detectable error, devise a way to mask it a. if masking is possible, classify the error as maskable 4. For maskable errors, evaluate the cost of failure versus the cost of masking errors a. if worthwhile, implement a masking method and classify the error as tolerated 3 Computer Systems Engineering Consistency Recap (incremental redundancy)! 4 Computer Systems Engineering Consistency Recap (massive redundancy)! 5 Computer Systems Engineering Consistency For today! State machine replication Active vs passive replication Consensus (Paxos) CAT theorem Causal consistency Eventual consistency 6 Computer Systems Engineering Consistency State machine replication (massive replication) 7 Computer Systems Engineering Consistency State machine replication 8 Computer Systems Engineering Consistency Active replication withdraw 100$ Server 1 Op 1 value = 50 -> fail op 1 -> 150 after op 2 Server 2 Op 2 value = 50 -> 150 after op 1 -> 50 after op 2 deposit 100$ 9 Computer Systems Engineering Consistency State machine replication 10 Computer Systems Engineering Consistency Passive replication withdraw 100$ Server 1 Op 1 value = 50 -> 150 after op 2 -> 50 after op 1 0$ it 10 s po de Server 2 Op 2 value = 50 -> 150 after op 2 -> 50 after op 1 11 Computer Systems Engineering Consistency Passive replication withdraw 100$ Server 1 Op 1 value = 50 -> 150 after op 2 -> 50 after op 1 0$ it 10 s po de Server 2 Op 2 value = 50 -> 150 after op 2 -> 50 after op 1 12 Computer Systems Engineering Consistency Consensus (Paxos) “Paxos is a family of protocols for solving consensus in a network of unreliable or fallible processors.” https://en.wikipedia.org/wiki/Paxos_(computer_science) 13 Computer Systems Engineering Consistency Paxos N processes/servers want to agree on a value ○ want to tolerate F faults (fault tolerance model) tolerate F processes stopping tolerate F messages delayed or lost If there are less than F faults in a window ○ then consensus achieved (all processes agree on a value) Needs 2F+1 processes ○ stalls but safe if more than F faults 14 Computer Systems Engineering Consistency Paxos Source: Performance Comparison Between the Paxos and Chandra-Toueg Consensus Algorithms (report) 15 Computer Systems Engineering Consistency Paxos Two phases: ○ Prepare phase transaction manager proposes a ballout number n the number is sent to a quorum that must acknowledge it. Used to discover any competing or previous values block previous values (in the same window) ○ Accept phase after receiving an ack from each quorum participant the transaction manager disseminates the new value v 16 Computer Systems Engineering Consistency CAP theorem 17 Computer Systems Engineering Consistency CAP theorem 18 Computer Systems Engineering Consistency Strong consistency 19 Computer Systems Engineering Consistency Relaxing strong consistency - Causal Consistency Synchronize Only causally related! operations? 20 Computer Systems Engineering Consistency Causal operations: - Local execution: if a and b are executed in a process, then a -> b; - Read a write: if a is a put/write operation and b is a get/read operation that Causal Consistency returns the value written by a, then a -> b; - Transitivity: for operations a, b, and c, if a -> b and b -> c, then a -> c. 21 Computer Systems Engineering Consistency Causal operations: - Local execution: if a and b are executed in a process, then a -> b; - Read a write: if a is a put/write operation and b is a get/read operation that Causal Consistency returns the value written by a, then a -> b; - Transitivity: for operations a, b, and c, if a -> b and b -> c, then a -> c. 22 Computer Systems Engineering Consistency Causal operations: - Local execution: if a and b are executed in a process, then a -> b; - Read a write: if a is a put/write operation and b is a get/read operation that Causal Consistency returns the value written by a, then a -> b; - Transitivity: for operations a, b, and c, if a -> b and b -> c, then a -> c. 23 Computer Systems Engineering Consistency Eventual Consistency “given no updates (writes) all clients will see exactly the same state of a system in some time” Difference from causal consistency? ○ Does not preserve causal relationships Pros: ○ super duper highly available Cons: ○ no safety guarantees ○ need conflict resolution 25 Computer Systems Engineering Consistency Summary State machine replication Active vs passive replication Consensus (Paxos) CAT theorem Causal consistency Eventual 26 Computer Systems Engineering Consistency Where to go now? Next four lectures: ✓ How can we achieve secure communication over computer networks? ✓ Information Security (Chapter 11) ✓ How can we measure the performance of our computer system? ✓ Performance (Chapter 6) ✓ How to tolerate faults? ✓ Fault Tolerance (Chapter 8) ✓ How to deal with system replication? ✓ Consistency (Chapter 10) Then end. 27 Computer Systems Engineering