Suchi_Distributed_Systems_Notes.pdf
Document Details
Uploaded by EasiestMimosa
Georgia Institute of Technology
Full Transcript
Distributed Systems Suchithra Ravi Last Updated: May 27, 2021 The future is already here, it is just not evenly distributed yet. William Gibson, Not in Cyberpunk 1 0 Prefac...
Distributed Systems Suchithra Ravi Last Updated: May 27, 2021 The future is already here, it is just not evenly distributed yet. William Gibson, Not in Cyberpunk 1 0 Preface 6 1 Introduction 7 1.1 Why study Distributed Systems?.................... 7 1.2 What is a distributed system?...................... 7 1.2.1 Intuition.............................. 8 1.3 Simple Model of a Distributed System................. 8 1.3.1 More complex model of a Distributed System......... 9 1.3.2 Importance of Model....................... 9 1.4 What is hard about Distributed Systems................ 10 1.5 Properties of a Distributed System................... 11 1.5.1 Correctness............................ 12 1.6 Brewer's CAP Theorem......................... 13 2 Remote Procedure Call 14 2.1 Client-Server Architecture........................ 14 2.1.1 Challenges............................. 14 2.2 Role of RPC................................ 15 2.3 Architecture of RPC System....................... 15 2.4 Anatomy of an RPC Call......................... 16 2.5 Invocation Semantics of RPC Operation................ 17 2.6 Examples of RPC Systems........................ 19 2.6.1 gRPC............................... 19 3 Time in Distributed Systems 21 3.1 The Time Problem............................ 21 3.1.1 Why do we need to measure time in DS?............ 21 3.1.2 Why is measuring time hard in DS?............... 21 3.1.3 Logical Time........................... 23 3.2 Representing Time and Sequence.................... 23 3.2.1 Time Diagrams.......................... 24 3.3 Clock Consistency............................. 24 3.4 Lamport's Scalar Clock.......................... 25 3.4.1 Clock Denition.......................... 25 3.4.2 Clock Correctness......................... 26 3.5 Vector Clock................................ 27 3.5.1 Clock Denition.......................... 27 3.6 Matrix Clock............................... 29 4 State in Distributed Systems 30 4.1 The problem of State........................... 30 4.1.1 What is state?........................... 30 4.1.2 Cuts................................ 31 4.1.3 Challenges in capturing state.................. 32 2 4.2 System Model............................... 32 4.3 Finding a Consistent Cut......................... 33 4.3.1 Assumptions of the algorithm.................. 34 4.3.2 Properties of state captured................... 34 4.4 Global State................................ 35 4.4.1 Formal denition......................... 35 4.4.2 Benets of Global State..................... 35 5 Consensus in Distributed Systems 37 5.1 What is Consensus?............................ 37 5.2 Theoretical Posibility of Consensus................... 38 5.2.1 System Model........................... 38 5.2.2 FLP Theorem........................... 39 5.2.3 Is Consensus Really Impossible?................. 40 6 Consensus Protocols 41 6.1 Goals of Consensus Protocols...................... 41 6.2 2-Phase Commit (2PC).......................... 42 6.3 3-Phase Commit (3PC).......................... 42 6.4 Paxos................................... 42 6.4.1 Basics of Paxos.......................... 43 6.4.2 Phases of Paxos.......................... 44 6.4.3 Paxos vs. FLP.......................... 47 6.4.4 Paxos in Practice......................... 47 6.5 RAFT................................... 48 6.5.1 Phases of RAFT......................... 48 6.5.2 RAFT Correctness........................ 50 6.5.3 RAFT in Practice......................... 51 7 Replication 52 7.1 What is Replication............................ 52 7.1.1 Goals................................ 52 7.1.2 Replication Models........................ 52 7.1.3 Replication Techniques...................... 52 7.1.4 Replication and Consensus.................... 53 7.1.5 How to choose replication method................ 54 7.2 Chain Replication............................. 54 7.2.1 Pros and Cons.......................... 55 7.3 CRAQ................................... 55 7.3.1 CRAQ Performance comparison with Chain Replication... 56 8 Fault Tolerance 57 8.1 Basics of Failures............................. 57 8.1.1 How to deal with failures..................... 58 3 8.2 Rollback-Recovery............................ 58 8.3 Checkpointing............................... 61 8.3.1 Uncoordinated Checkpointing.................. 61 8.3.2 Coordinated Checkpointing................... 62 8.3.3 Communication-Induced Checkpoints.............. 63 8.4 Logging.................................. 63 8.5 Which Method to Use?.......................... 64 9 Distributed Transactions 65 9.1 Transactions and Distributed Transactions............... 65 9.2 Google Spanner.............................. 66 9.2.1 Spanner Stack........................... 67 9.2.2 Consistency Requirements for read operations......... 67 9.3 True Time................................. 68 9.3.1 Ordering Write Transactions................... 69 9.3.2 Ordering Read Transactions................... 71 9.3.3 TrueTime alternatives...................... 71 9.4 AWS Aurora................................ 72 10 Consistency in Distributed Data Stores 74 10.1 Consistency Models............................ 75 10.2 Look-Aside Cache............................. 76 10.2.1 Look-Aside Cache Read Operation............... 76 10.2.2 Look-Aside Cache Update Operation.............. 77 10.3 Memcached................................ 77 10.3.1 Features of Memcache...................... 78 10.3.2 Mechanisms in Memcached.................... 78 10.3.3 Scaling Memcache........................ 80 10.4 Causal+ Consistency........................... 82 11 Peer-to-Peer and Mobility 83 11.1 Communication Support assumed so far................ 83 11.2 Interconnect Support........................... 84 11.3 Peer to Peer Systems........................... 85 11.4 Connectivity in P2P........................... 86 11.4.1 Approach 1: Centralized entity................. 86 11.4.2 Approach 2: Flood or Gossip based protocols......... 86 11.4.3 Approach 3: Distributed Hash Table.............. 86 11.5 Distributed Hash Table (DHT)..................... 87 12 Distributed Machine Learning 89 12.1 Distributed Machine Learning Approaches............... 89 12.2 Geo-Distributed ML with Gaia..................... 91 12.2.1 ASP................................ 92 4 12.2.2 Results from the paper...................... 93 12.3 Collaborative Learning.......................... 93 12.3.1 Tradeos of Using Global Model................. 93 12.3.2 Collaborative Learning with Cartel............... 94 12.4 Other stages of ML Pipeline....................... 96 13 Byzantine Fault Tolerance 97 13.1 Byzantine Failure and Byzantine Generals............... 97 13.2 Practical Byzantine Fault Tolerance: pBFT.............. 98 13.3 pBFT Algorithm............................. 99 13.4 Byzantine Consensus vs. Blockchain.................. 101 14 Edge Computing and the Internet of Things(IoT) 104 14.1 Edge Computing?............................. 104 14.1.1 Closing the Latency/Bandwidth Gap.............. 105 14.1.2 Edge Computing Drivers..................... 106 14.2 Distributed Edge Computing....................... 107 14.3 IoT and Distributed Transactions.................... 108 14.4 Transactuations.............................. 109 14.4.1 Evaluation of Transactuations.................. 110 Index of Terms 112 5 Preface The editor looked at his clothes and asked, "Can you spell cat?". The boy looked at him and said, "Can you spell anthropomor- phology?" A controversial lady, Left as an exercise to the reader! These are my notes for CS7210, the Distributed Computing course taught at Georgia Tech. The intent of these notes is to allow for a review of concepts and act as a supplement for the lecture material. These are not ocial course materials. Also, though I strive hard to ensure the correctness of the materian, these were created throughout my time as a student, so there may still be some errors that slipped through. You can refer to the ocial recommended textbooks such as Distributed Systems for Fun and Prot or Distributed Systems for a better, deeper look at the concepts covered here. If you encounter typos; incorrect, misleading, or poorly-worded information; or simply want to contribute a better explanation or extend a section, please contact me on Piazza, Slack or via email. Here, I must take a moment to thank George Kudrayvtsev for creating this incredibly beautiful LATEX template and being generous enough to make it open source (thus inspiring me to make LATEX notes at all). Thank you! 6 Introduction "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable" Leslie Lamport, Somewhere 1.1 Why study Distributed Systems? Because they are everywhere!! Examples of DS applications: OMSCS FANG: Facebook (social media), Amazon (online shopping), Netix (streaming, CDN), Google (Cloud), etc. Enterprise systems (banking, SaaS) hosted on public, private clouds Telecommunication industry Newer domains: AR/VR, Self-driving cars HPC: Massive multi-core 1.2 What is a distributed system? A system consisting of multiple independent components, that can fail in some way, intentionally or unintentionally, transiently or permanently, and interacting/collabo- rating in some manner to complete some common task. 7 CHAPTER 1: Introduction 1.2.1 Intuition I These independent components do not share all information with each other. I Interact via some form of message passing, and these messages allow components to inuence each other's actions I Messages are not perfectly reliable- can get delayed or lost, leading the intended receiver to act dierently that might even lead the whole system to become unstable. I Independent computing units must appear as a single coherent computing entity implies that these units work on a common task/goal. Definition 1.1 : Distributed System A Distributed System is a collection of computing units that interact by exchanging messages via an interconnection network and appear to external users as a single coherent computing facility. Note that the denition does not explicitly mention failures!! 1.3 Simple Model of a Distributed System What we show: Nodes: * Receive message from channel * Take some time to act on it * Send message to one or more channels Messages: Spend some time in the channel and are delivered zero or more times. Time axis: Shows a series of messages and when they happened What we abstract out: The actual underlying network: Communication channels between nodes, number of hops, etc. Directionality of channels: Treat channels as unidirectional i.e. mes- sages may actually use the same communication channels underneath, but treated as separate channels Actual processing at the nodes: any processing manifests itself as a) Time delay at the node b) future messages to the rest of the system Suchithra Ravi 8 DISTIBUTED COMPUTING What we can represent: This model is general enough to represent very dierent types of systems Node failure: Set time taken to process at node to ∞ Unreliable communication: Set number of times delivered to 0 if dropped, to >1 if retransmitted. Point-to-point vs Multi-cast: Set appropriate number of channels to which the messages are sent. 1.3.1 More complex model of a Distributed System Typically, we think of a DS as the set of processing actions that happen at the nodes and the state changes as a result of those actions. Can include this in our model by adding a state variable to each node. Note that we are still not representing the individual processing actions at each node. Actions are triggered in response to messages and the outcome of the action is a change in state at the node. 1.3.2 Importance of Model Much of the research in DS is a combination of theoretical predictions based on building and analyzing models, as well as deployment and practical evaluation of real systems. But, why use models at all? Because Alternative is too complex: Alternative is to build a prototype and test under all possible scenarios - would be hard to mimic the correct number and distribution of nodes. How to choose a model Any model is dened by elements, rules, and assumptions or model invariants. I Model invariants: Statements that are always true for a model E.g. "Every message is delivered on time" -> this assumes lossless message delivery and no network failure. To choose a model- we have to ensure it is: Accurate: Represents the actual system being studied i.e. some problems can be studied using the model Tractable: Is analysis of the current problem possible using the model In other words, while choosing a model, pay attention to: What problems can we study using this model Can we build and analyze solutions for this problem using the model. Suchithra Ravi 9 CHAPTER 1: Introduction A simple model may be sucient: as long as we can adequately represent the current system we are studying and all the possible transitions in the system, we can use the model. 1.4 What is hard about Distributed Systems Asynchrony: Message latency can be zero or bounded or unpredictable or innite. (Most systems have unpredictable or innite) Failures: Failures can be a "failstop" (sudden stop) or transient or Byzantine (system is performing incorrect action). Also failures can be in the nodes or in the network links. Consistency: We want a single up-to-date copy of data and all nodes agree on this copy. But, to come to this conclusion, need to consider concurrency/order- ing in which various events happened, is the data replicated/cached, etc. All of these introduce various tradeos in the system design. Alternate way to look at this: in terms of 8 fallacies (i.e. statements that aren't always true in a distributed system): Suchithra Ravi 10 DISTIBUTED COMPUTING Assumption/Fallacy How it is violated Network is reliable The network may be unreliable and messages can be delayed or lost. Latency is zero Real networks have non-zero (sometimes un- bounded) latency for packet delivery. Bandwidth is innite Real networks have nite bandwidth, so there is a limit to the number of simultaneous messages sen- t/received. Network is secure Naturally, network may be insecure and this can aect message passing in the form of bandwidth choking or even malicious manipulation of messages passed Topology doesn't change If some components fail or new components are added (or a failed component is repaired), the net- work topology changes. There may also be other changes to topology. There is one administrator In a large system, having a single administrator may be inecient forcing systems to use consensus (and maybe no administrators!) Transport cost is zero Transport cost in terms of infrastructure/band- width, as well as energy is realistically non-zero. Network is homogenous Some parts of the networks may be slow or even shut down for various reasons, making the network non-homogenous. We will have to build systems that will work despite these assumptions becoming false. 1.5 Properties of a Distributed System Consistency: The system gives correct answers always Availability: The system provides responses always Partition Tolerance: The system provides responses irrespective of failures or delays in nodes/network Consistency also implies that the system should act as a single unit. For this, it should co-ordinate actions by multiple components in the presence of failures and/or concurrency. Other desirable properties include: Fault-Tolerance: System should recover from component failures without per- forming wrong actions Suchithra Ravi 11 CHAPTER 1: Introduction High Availability: System should restore operations and resume services even after components have failed Recoverability: Failed components can restart and rejoin system after failure has been resolved. Scalability: System should operate correctly even after some aspect of the system (including load/users) gets scaled to a larger size. Predictable Performance: System should maintain performance despite fail- ures Secure: System should authenticate access to data, state maintained and ser- vices provided 1.5.1 Correctness Intuition : If we gave the same set (or series) of inputs to a distributed system and to a single computing entity, the distributed system should give the same outputs as the single entity for it to be correct. Problem is: each node in the DS actually receives a separate set of inputs. So, to act like a single for the single entity, the inputs should be delivered in the same order i.e. we are concerned not just about order of inputs ot a single node but global order of inputs across all nodes. Also, all participants should agree that this is the order in which events occurred. Consistency Model: is a set of guarantees provided by the system about the or- dering of events. Dierent types of consistency models are possible: Strict Consistency: Guarantee that all events and changes in the system will have a single, uniform order and all parties agree upon this order. I Almost impossible to achieve: Impossible to guarantee all nodes will have the same notion of time. Linearizable: Transactions (i.e. Group of operations that read/modify some shared state) will not appear to be interleaved with other ongoing transactions i.e. the transactions will appear to be in linear order, even though individual operations are not. Serializable: The system guarantees that the outputs correspond to some or- dering of the transactions, but this need not correspond to the real-time ordering itself. (this could have occurred on a single node, but need not have.) I All nodes in the system should still perceive the same ordering. Suchithra Ravi 12 DISTIBUTED COMPUTING 1.6 Brewer's CAP Theorem This is actually a conjecture, not a theorem since it was never proven! Theorem 1.1 : CAP Theorem A Distributed System can never simultaneously meet all 3 properties: Con- sistency, Availability and Partition Tolerance. If a network partition occurs, you can have Consistency or Availability, but not both. Note: We can have both if there is no network partition (i.e. partition tolerance is not there). Systems are often classied by the CAP tradeos they make (or guarantees they provide): P+A: Key-value stores like Cassandra, DynamoDB P+C: Megastore, MySQL Cluster Despite what CAP theorem says, in practice, slow response = no response! =⇒ If a system is "available" but has high latency, it is not really available =⇒ Actual tradeo is between Latency and Consistency Theorem 1.2 : PACELC If there is a partition (P), how does the system trade o availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade o latency (L) and consistency (C)? Suchithra Ravi 13 Remote Procedure Call 2.1 Client-Server Architecture Common architectural pattern for building distributed systems Clients: Some nodes in the system that send requests and/or data to the server nodes I Client has to nd the address of the server it will contact I Client may need to establish connection with that server (depends on protocol) I Client has to copy "arguments" for its request from its memory to network packets Servers: Nodes that receive the request and "process" it I Processing may include accessing database to retrieve data I Copies data from server memory into netork packets to send to client Note here that the server and client could reside on the same machine 2.1.1 Challenges Discovery and binding: Client needs to nd the server and establish connec- tion Identifying interface and parameter types: Client needs to know the op- eration and the parameters required for it Data representation: Client and server should have previously agreed on how data will be represented Explicit Data management necessary: Data needs to be explicitly copied from the network packet buer to server/client memory Unpredictable Execution time: Results may arrive at arbitrary time de- 14 DISTIBUTED COMPUTING pending on network/server delays. Unknown Cause of failure: Client cannot tell if failure in network or server (or even type of failure) I All the problems above must be explicitly handled. 2.2 Role of RPC Address challenges above and simplify client-server interaction Hide complexity of distributed programming and make it similar to program- ming single node systems I Why? When these were developed, local programming was centered around procedures and procedure calls To achieve its goals, RPC needs to provide the following: Service registration mechanism: Servers should be able to register the ser- vices they provide + clients should be able to nd out that information Connection establishment: Support for any connections to be established Interface specication: Client should be able to nd out what are the pa- rameters required and results returned Type specication: Client should be able to nd data type and order for parameters and results Data management: System should manage the data transfer from/to memory to/from network buer Also referred to as serialization (marshalling) and deserialization (un- marshalling) Serialized bytestream has service descriptors (metadata) along with pa- rameters/results (data) Dealing with failures: Support to deal with failures Timeout and retry for transient failures Timeout and give some error message for permanent failures 2.3 Architecture of RPC System Topmost level: API: Programming interface that clients and server applica- tions use e.g. making RPC call Suchithra Ravi 15 CHAPTER 2: Remote Procedure Call Second Level: Stubs: RPC Call jumps into the stub layer. The stub layer knows about the procedure arguments/results -> Performs Data mar- shalling/unmarshalling Lowest Level: RPC Runtime: Connection management, sending/receiving data, failure management Other Components: Interface Denition Language (IDL): Used to create interface speci- cation (servers describe services, arguments etc.) RPC Compiler: Compiles IDL and spits out stub (and other code) Service registry: Establishes rules on how servers would announce their services How this is actually used : 1. Server developer develops app 2. Server developer provides specication in IDL 3. RPC Compiler compiles this spec 4. RPC Compiler generates stub code + skeleton server code 5. Implementation of the app service added to the skeleton server code 6. Service registered into the registry 7. Client writes client side code that calls the app 8. Client compiles this code with the code generated from RPC compiler 9. RPC Runtime takes care of everything else at runtime!! 2.4 Anatomy of an RPC Call Assume server implements some operation op (arg1, arg2) that the client cannot per- form by itself i.e. To do op, client needs to send the op, arg1, arg2 to the server. Goal of RPC: To make this work like a regular function call (do everything else under the hood). Client Side : 1. Client makes function-like call res = op(arg1, arg2) 2. Program Counter jumps to the stub implementation of this function (instead Suchithra Ravi 16 DISTIBUTED COMPUTING of the local memory address of the procedure implementation) 3. Stub creates a buer with function descriptor and arguments required for the procedure in the format expected by server. 4. RPC runtime takes care of connection establishment, etc. required to send message 5. RPC runtime sends a message to the server with the buer information Server side : 1. Message received is given to the RPC Runtime which gives it to the stub 2. Stub unmarshals arguments and determines which function has to be called 3. Stub calls the actual local implementation of the function 4. Function is executed and returns results to Stub 5. Control comes back to the server side stub 6. Stub populates the message buer with the result 7. Stub sends the message back to the client Back on the client side : 1. Client stub unpacks the result 2. Client Stub quietly places it in the memory location expected 3. Control returns to voila the original calling function 4. Calling function can extract the result from the memory location as if all of this just happened locally! 2.5 Invocation Semantics of RPC Operation RPC operations can be classied along dierent dimensions: Classication based on control transfer Synchronous RPC operations: Blocking operation where client thread waits on RPC call to complete I Client makes RPC call -> waits for response I Calling thread cannot move forward till the RPC response is received. Suchithra Ravi 17 CHAPTER 2: Remote Procedure Call Other threads might continue to run, even make other RPC calls, and wait for those! Asynchronous RPC operations: None-blocking operation where client makes RPC call but continues to do other operations. I Client makes the RPC call -> does other actions (not dependent on call) while waiting I All independent tasks completed -> client checks if rpc response available -> Processes result or waits Obvious advantage: Hides latency I Registering a callback: Client can choose to be notied when the response arrives and specify the things that need to be done when it does. Classication based on guarantees about message delivery A local procedure call executes exactly once and returns a result. Here, lack of response from procedure implies deadlock or failure. If the procedure hangs or crashes, we can always assume (correctly) that the procedure never executed, and the solution is to restart and redo the operation. In other words, the caller of the procedure always knows whether the call was executed or not. In a distributed system, no guarantee that the server will respond. But lack of response does not imply failure. It might simply be because of network issues: request lost, response lost, etc. But it may also be a real server failure. Rerunning may have undesirable eects (if the command was previously executed, can corrupt the data- like append operation in Project2.) Distributed Systems may give dierent types of guarantees about execution: Invariant Definition 2.1 : Exactly Once Execution The ideal scenario: Commands executed exactly once (similar to local pro- cedure calls) I Client side: timeout and retransmit if no response I Server Side: Distinguish between repeated requests, lter if needed may not be needed e.g. if add operation with both arguments specied on the call, redoing operation is ok. I Dealing with persistent failure (e.g. server node failure ) Suchithra Ravi 18 DISTIBUTED COMPUTING Invariant Definition 2.2 : At Most Once Execution Slightly worse scenario: executed once or not at all I Server Side: Distinguish between repeated requests, lter if needed I Client side: Timeout and retransmit if no response Client knows the command may not have been executed Invariant Definition 2.3 : At Least Once Execution Third option: Executed once or more (but never none) Client side: Timeout and retransmit if no response Server Side: No guarantee that dupicates will be eliminated Other semantics also possible: e.g. if it works with multiple replicas, system may guarantee that calls will be replicated in all replicas or at least in one replica etc. 2.6 Examples of RPC Systems sunRPC: Original RPC developed by Sun in the 80's SOAP, CORBA: Older systems used in enterprise solutions Apache Thrift gRPC Some of these RPC systems are specialized for specic contexts, e.g. high-speed (low latency), reliable network, embedded environment (optimizations to ensure small footprint), etc. 2.6.1 gRPC Released by Google around 2016 - inspired by sunRPC. Relies on protocol buers which allow us to describe interface, perform data serialization, etc. Suchithra Ravi 19 CHAPTER 2: Remote Procedure Call Interface specied in a.proto le, which is compiled to generate protoc le (Refer to gRPC code example) Suchithra Ravi 20 Time in Distributed Systems People like us who believe in physics know that the distinction between past, present and future is only a stubbornly persistent illusion Albert Einstein, Letter to Michael Besso 3.1 The Time Problem Physical Time: Time that we read o some "physical" clock. 3.1.1 Why do we need to measure time in DS? In a single node, easy to determine sequence (i.e. ordering) of operations uniquely. Why do we need to know the order? Causality Helps determine causality i.e. if one operation aected another -> needed for correctness, consistency. Resource Allocation Important for resource allocation, especially if we want to maintain some properties like fairness. Garbage Collection Useful to perform garbage collection. Can deduce if the results of an operation are no logner needed and free up resources. 3.1.2 Why is measuring time hard in DS? Why can't we simply read the local clock at each node to determine the ordering of operations? For messages, whose clock should we read? 21 CHAPTER 3: Time in Distributed Systems Option 1: Receiver-based Timing Use the time at the receiver's end to maintain ordering. Problem: Because of network delays, messages may arrive out-of-order! E.g., node n3 receives message m1 from n1 , then receives m2 from n2 and concludes that m1 happened before m2. But this need not be the correct order because: There are no guarantees about network delays and packet losses i.e. m2 might have happened earlier, but the packet might have taken longer to arrive. Dierent nodes may infer dierent orders e.g. if the same messages are also sent to node n4 that is, say, closer to n2 , it might receive message m2 before m1 and infer the exact opposite order as n3 !! Some messages may never arrive!! Messages can get lost forever (innite delay), in which case no order can be inferred for those events! So clearly, we cannot rely on the timestamp at the receiving node. Option 2: Sender-based Timing Stamp each message with local timestamp at the sender. Expect order to be unique and same as actual occurrence of events at the sender. Not so fast! Problem: Clocks may ont be globally synchronized. If there is a large enough skew between 2 senders, the ordering may be wrongly reported. E.g. If clock at n1 is running slightly ahead and reports a later timestamp than the clock at n2 , even though the message m1 actually happened before m2 , the timestamps received by n3 indicate the opposite, so the wrong order will be inferred. Thus we can see measuring time is hard in Distributed Systems for the following reaons: Consensus/Synchronicity: All nodes need not agree on what the global time is Network delays: Message propagation need not take xed (or predictable) amount of time. Network delays need not be constant or even consistent across nodes (i.e. some nodes may see greater delays in sending/receiving messages than other nodes) Failures: Both nodes and network connections can fail at any time. Malicious nodes: For instance, we can't trust all the timestamps we receive. Suchithra Ravi 22 DISTIBUTED COMPUTING 3.1.3 Logical Time Physical clocks clearly unusable, but we still need some concept of time measurement. Solution: Logical clocks! Logical clocks do not measure same kind of "time" as physical clocks. Instead, log- ical clocks generate timestamps which can be used to infer the real ordering and relationship of events in a distributed system. There are 3 types of logical clocks: I Scalar Clocks (Lamport Clocks) I Vector Clocks I Matrix Clocks 3.2 Representing Time and Sequence This section describes some common notation and terms we shall use later on. Processes represented by pi generate events represented by eki. Happens before: If a process pi generates events e0i , e1i , e2i... eki , ek+1 i ,... eni. Here each event ei happens before ei , represented by ei → ei Thus, k k+1 k k+1 eki → ek+j i ∀j ≥ 1 Process history: ordered sequence of events in process pi represented by Hi. Note: We often focus on messaging events (sending or receiving), rather than internal events, since the impact of messaging events on other nodes is clear. Definition 3.1 : Happens Before Relationship The happens before relationship is a relationship between 2 events dened by the following rules: At any node i, an internal event k happens before internal event k + 1 i.e. internali (k) → internali (k + 1) At any node i, receiving message k happens before sending the next message i.e. recvi (mk ) → sendi (mk+1 ) Across two nodes i and j , sending of a message happens before receipt of the message elsewhere i.e. sendi (mk ) → recvj (mk ) Suchithra Ravi 23 CHAPTER 3: Time in Distributed Systems 3.2.1 Time Diagrams Alternate way to represent the sequence of events is using Time Diagrams. Processes are shown as horizontal lines Dots on the lines indicate dierent events Messaging events are represented by arrows that connect the send event on one process with the receive event on another (with the arrowhead pointing towards the receiver) Arrow-less dots represent internal events. Definition 3.2 : Concurrent Events Two events may be such that there is NO happens before relationship be- tween them. Such events are said to be concurrent. E.g. message m1 goes from node p1 to p2 and m2 goes from p3 to p4. If these are completely unrelated processes, no relation between these events i.e. if e1 = send1 (m1 ), e2 = send3 (m2 ), e1 6→ e2 and e2 6→ e1 =⇒ e1 k e2. I Either event could have actually happened before the other I Swapping the order between the events has no impact on rest of the system I Both events may still have happened before a dierent event e3 e1 → e3 and e2 → e3 3.3 Clock Consistency Invariant Definition 3.3 : Clock Consistency Condition To be useful, a logical clock should timestamps that reect the actual rela- tionship between events. This is stated as the Clock Consistency condi- tion. Say logical clock C produces timestamps C(ei ) for each event ei. Monotonicity If 2 events are connected by a happens before rela- Suchithra Ravi 24 DISTIBUTED COMPUTING tionship, their timestamps need to reect that. e1 → e2 =⇒ C(e1 ) < C(e2 ) Thismeans that a clock cannot produce the same timestamp repeat- edly or timestamps decreasing in value. No implication for concurrent events: e1 k e2 =⇒ C(e1 )??C(e2 ) Strong Clock Consistency: Logical clocks that satisfy this property guarantee that we can determine the order of events uniquely by looking at their timestamps i.e. with stongly consistent clocks, e1 → e2 ⇐⇒ C(e1 ) < C(e2 ) Logical clocks are clocks used to map the history of events in a process to a partially ordered time domain T. Why partial? Because we cannot have a "complete" order if there are concurrent events. A clock function C is dened as the set of rules that must be followed to produce proper, consistent timestamps. To implement a logical clock, we need to decide upon I A data structure to represent the timestamps I Rules that will be followed to advance the time. 3.4 Lamport's Scalar Clock Each node has its own implementation of the clock, which executes clock rules to produce timestamps. A node only knows the value of the clock that it computed, but all nodes see the clock as a single scalar value. 3.4.1 Clock Denition Data Structure: Each process pi has its own clock Ci and the timestamp produced is a scalar Rules to generate timestamps: The rules can be derived by considering the rules for the happens before relationship. For 2 internal events a and b where a happens before b, if a = eki and b = ek+1 i , eki → ek+1 i =⇒ C(eki ) < C(ek+1 i ) Suchithra Ravi 25 CHAPTER 3: Time in Distributed Systems. Therefore, pi increments Ci between successive events. Ci (b) = Ci (a) + 1 For 2 related messaging events a = sendi (mk ) and b = recvj (mk ) in pro- cesses pi and pj respectively, sendi (mk ) → recvj (mk ) =⇒ Ci (sendi (mk )) < Cj (recvj (mk )) But, Cj (recvj (mk )) has to be greater than any other internal event at pj that happened before the arrival of the mesage mk at pj as well. Therefore, Cj (b) = max(Ci (a) + +, Cj ) 3.4.2 Clock Correctness Note that 3 types of time relationships are possible with this clock: (See lecture example) 1. One event happens before another 2. One event is concurrent with another 3. There is no clear order between 2 events. These will also be inferred to be concurrent!! Note: 1. Last point above is not a problem for Lamport's clock because it only satises the clock consistency condition and not the strong consistency condition. (See below) 2. Even if we conclude 4@p2 → 3@p1 (based on their timestamps), it is ok because: concurrency implies that swapping the order between them will not impact the rest of the system. all observers in the system will make the same, uniform conclusion about the ordering. Thus, Lamport's clock gives a mapping from all events in the system to a partially ordered list of timestamp events. (Partial because there is no order among concurrent events). To establish total order, need some tie-breaking rule e.g. Events with same timestamp from 2 dierent processes can be ordered based on the process ID, i.e. 3@p1 → 3@p2. As long as everyone in the system uses same tie-breaking rule, the actual rule itself doesn't matter, since any consistent order between these events is acceptable. Suchithra Ravi 26 DISTIBUTED COMPUTING Event counting : If timestamps are always incremented by 1, the clock can be used to estimate the minimum number of events in the system that occurred before the current event, including events in all nodes, even nodes that have not previously communicated with this node in the past. Of course, this is a minimum estimate, so actual number of events in the system may be much larger. Lamport's clock is consistent, but not strongly consistent! e1 → e2 =⇒ C(e1 ) < C(e2 ) 6 C(e1 ) < C(e2 ) =⇒ e1 → e2 Impact of consistency vs. strong consistency We cannot rely on Lamport's clock to get complete information about causality Consistency alone is sucient for correctness, since ordering can be inferred everywhere that it is needed Lack of strong consistency leads to some loss of eciency, because unrelated events may appear to be related and to require a specic ordering, which may be enforced by the system. E.g. 5@p2 appears to be earlier than 6@p3 though they are concurrent. System may try to enforce ordering even though it is unnecessary. 3.5 Vector Clock Vector Clock is a vector of scalars with dimensionality of vector ∝ number of nodes in the system. ( =⇒ Size overhead: O(N) (unlike O(1) size of scalar clock)) More powerful than Scalar clock, but also a more complex clock. 3.5.1 Clock Denition Basic idea: Each node maintains its own view of its own time as well as time at other nodes Each node has a copy of the vector of size n, and each element of the vector corresponds to the current nodes perception of time in the i'th node. How it works: Say process pi has clock vti If Ci is the i'th Lamport clock of process pi , vti [i] = Ci Suchithra Ravi 27 CHAPTER 3: Time in Distributed Systems If Cj0 is the pi 's perception/knowledge of the Lamport clock in process pj , vti [j] = Cj0 Rules to update the vector clock: 1. Before executing any event, update own clock: vti [i] = vti [i] + d for d > 0 2. Each message carries the vector clock vt of the sender at sending time. When a message is received, the receiver : (a) Update its knowledge of global time i.e. of all other processes vti [k] = max(vti [k], vt[k]) for 1 ≤ k ≤ n (b) Execute rule 1 for itself i.e. update its own time (c) Deliver/process the message I Rule 1 ensures that later events in the same process will always have a larger timestamp than older events I Rule 2 ensures that before a message is processed, the recipient has the latest view of the system (either the same view as the sender, or a more advanced view). This means when the message is actually processed, its timestamp will be greater than all the events at the sender as well as at the recipient (since rule 1 is also executed!) Rules to compare vector clocks: If all the elements in vector vt1 are lesser than or equal to all the elements in vt2 , vt1 ≤ vt2 if vt1 [i] ≤ vt2 [i]∀i If all the elements in vector vt1 are lesser than all the elements in vt2 AND at least one element in vt1 is strictly lesser than vt2 , vt1 < vt2 if vt1 [i] ≤ vt2 [i]∀i AND vt1 [k] < vt2 [k] for some k If 2 timstamps are such that neither is strictly less than the other, they are considered to be concurrent. vt1 k vt2 if !(vt1 < vt2 ) AND !(vt2 < vt1 ) Suchithra Ravi 28 DISTIBUTED COMPUTING Vector Clock is both consistent and Strongly consistent! Consistent because if one event occurs before another, the timestamp is de- nitely lesser. Strongly consistent because if the vector clock indicates concurrency of events, the events are guaranteed to be concurrent. Thus, vector clocks provide a stronger consistency and higher eciency (since we will no longer reorder unnecessary concurrent events (See consistency discussion)) at the cost of maintaining O(N) extra state i.e. additional clock state at each node, but also additional data sent with each message. Some of this can be reduced by using some compression techniques. 3.6 Matrix Clock Assume process pi has matrix clock mti. Extend vector clock to a matrix -> represent time as a matrix. The clock value at (i,i) for matrix clock of the i'th process holds its own scalar clock mti [i, i] = Ci The clock values at row i of the i'th process hold its vector clock (similar to above) mti [i] = vti The clock values in all other rows (say k) hold what process pi thinks the vector clock of process pk is. mti [k] = vt0k Every process maintains its view of every process' view of time (and not just its own view of every process' time) Obviously more complicated, and more storage required. So what do we gain? Since each clock knows what other clocks think about everyone's view of time, garbage collection is possible. E.g. we know that process 2 thinks that its own time is X and process 3's time is Y, if process 1's time is greater than both X and Y, any state concerning process 2 and 3 stored in 1 can be cleaned up. In other words, if mti [k, j] > t, that is- if "i's perception of what k thinks j's time is" is >t, then everything before t can be deleted. ∴ min(mti [k, j]) > t =⇒ everything before t can be deleted! Suchithra Ravi 29 State in Distributed Systems Everything is in a state of ux, including the status quo. Robert Byrne, The 637 Best Things Anybody Ever Said 4.1 The problem of State Goal is to capture a correct snapshot of global state in the distributed system. 4.1.1 What is state? Intuition: A distributed system is dened as a set of nodes connected by commu- nication channels The state of this system is then the state of the nodes and the state of the channels. To execute a program, each node goes through a series of events, both internal and message send/receive events. Therefore, the state of the distributed system is dened by the sequence of these events. So, we also need to capture state transitions and their sequence. Thus, state consists of: Process State: determined by all events that happened on that node until that point. Channel State: determined by the messages currently in-ight i.e. send events not paired with a corresponding receive event on another node. State transitions: Any event changes state of at least one of the components i.e. nodes or channels. Internal events → modify state of single node. Message send/receive events → modify state of 1 node + 1 channel. Ordering of events = sequence of state transitions. 30 DISTIBUTED COMPUTING Actual run: Actual sequence of events (or state transitions) that hap- pened in a system Observed run: Sequence of events that we observe. I IMPORTANT: This may not correspond to the actual run! * We may not observe one of the events that actually happened, or we may not know the strict ordering between 2 events 4.1.2 Cuts Cut: A curve that "cuts" the execution time in some manner so that events that occur before the cut are assumed to be completed. I In the case of a message in-ight, this means the send event was completed even if the receive wasn't. Consistent Cut: A snapshot of the system that divides the execution events so as to provide a possible ordering of events in the system. I Consistent cut need NOT correspond to a real situation that the system has actually been in, but it captures one possible situation that is "consistent" with the ordering of the events in the system. Pre-recording events: All events that happened before the time of taking the snapshot (or before the cut) in the system. Post-recording events: All events that happened after the time of taking the snapshot (or before the cut) in the system. Intuition for cuts : Think of the cut as a line we draw on the time diagram representation of the distributed system. I Not a strict line because we cannot guarantee that we can get a precise cut at the exact same time on all nodes. Our best guarantee is to get a cut that indicates the completion of certain events in each node. I Any cut is NOT a consistent cut i.e. a general cut does not have to give a "correct" description of the system. I Can make deductions about events: In-ight messages: By looking at completed messages at any point, we can also deduce which messages are in-ight (i.e. send completed but receive was not completed.) Older events: If a message receive event happens prior to the cut, the send event also denitely happened before the cut. Similarly, if event ek+1 i occurs before the cut, event eki denitely does. Suchithra Ravi 31 CHAPTER 4: State in Distributed Systems I Note that: A cut that has, say a receive message event preceding it, but no corresponding send message event, would not be considered consistent. 4.1.3 Challenges in capturing state Capturing state in a distributed system is hard for multiple reasons: No instantaneous recording No globally synchronized clock: Cannot ever capture events at each node at exactly the same time. No centrally initiated snapshot: For the same reason, cannot assume that a centralized node will be able to instantaneously initiate a snapshot of all the other nodes. I Random network delays imply that even if Node X initiates a snapshot at t0 , node Y and node Z may take the snapshot at times t1 and t2 with no guaranteed ordering between them. Non-determinism Non-deterministic Computation: Because of concurrent events, at any point, we can have multiple possible next events → Makes ordering, con- sistency etc. hard even with a single, multithreaded system. I Deterministic Computation means that at any point int he execu- tion, there is at most one event that can happen next. 4.2 System Model For the remainder of this chapter, we will use the following model of a distributed system: System consists of processes pi that communicate by exchanging messages via channels ci Channel properties: Directed: any channel carries messages in only one direction =⇒ possible to have 2 processes p and q where p can send messages to q but not the other way around! FIFO: messages are delivered in the order in which they are sent Error-free: messages will not be corrupted Though channel assumptions are not applicable to all systems, any system can be made to satisfy these properties by using TCP as the communication protocol (since TCP guarantees these properties). Suchithra Ravi 32 DISTIBUTED COMPUTING 4.3 Finding a Consistent Cut Goal of the algorithm : Capture snapshot of all distributed components i.e. Processes and channels Ensure the snapshot forms a consistent cut. Example : (See gure in lecture) Consider 2 processes exchanging a sequence of messages. For the example, ignore all internal events and consider only message send/receives. 1. The processes start in initial states Sp0 , Sq0. Say, we try to capture the snapshot of the system at some point, Sq1. 2. We send a marker to the other node p (to capture a snapshot). Say the marker arrives at p at Sp2. This point is after m3 has been sent by p. 3. To nd hat happened to m3, send another marker message from p to q. Say this message arrives at Sq3 4. Now, at q, we realize we already took a snapshot of this node. This snapshot indicates that m3 has not arrived at q yet. 5. This means, the message m3 must be in-ight. Based on these steps, the recorded snapshot is : I P is at Sp2 , Q is at Sq1 , Channel PQ has m3 in ight, channel QP is empty. I Recorded global state = (Sp2 , Sq1 ), (m3 , 0) Chandy-Lamport's Snapshot Algorithm initiator node is the node that will trigger the algorithm for capturing the state of the distributed system. save its local state send a marker token on all of the outgoing channels and all of the outgoing edges. All of the other nodes in the system will participate in this algorithm On receiving the rst marker on any incoming edge, they will save their state, they will mark the state of the incoming channel as empty propagate a marker message on all outgoing edges t resume execution, but they will also save incoming messages until a mark, on any of the channels until they see a marker arriving through that Suchithra Ravi 33 CHAPTER 4: State in Distributed Systems particular channel. When a marker does arrive on one of the incoming channels, then the process will mark the state of that channel. Such that all of these messages that were received since the process captured its state originally, all of these messages will be marked as messages that were in-ight. This algorithm guarantees a consistent state (but not necessarily the state we actually went through previously or the state in which system is currently!) 4.3.1 Assumptions of the algorithm There are no failures and all messages arrive intact and only once. And we said that current technologies such as TCP/IP will allow this assumption to be true. The communication channel are unidirectional. And this really just means that we need to separately consider each direction of the point-to-point message exchanges among the processes. communication channel is FIFO ordered. TCP can give us the FIFO property does not interfere with the normal execution of the processes. Meaning that the markers, they don't stall or reorder the processing of the other messages. This is also in principle practically doable. You can just have a separate task or a separate thread that uses a separate socket and port number for the marker messages. And also it means that each process in the system records its local state and the state of all of its coming channels. assumption here that there are enough resources to be able to do this 4.3.2 Properties of state captured Captures a consistent state Doesn't necessarily tell us where the system exactly is in its execution at this specic point of time Observed global state is a permutation of the actual state or one of the other global states in this tree. What does the last point mean? If we draw a tree of possible states of the system (i.e. start at initial state , then choose who sends the rst message , etc. basically have multiple child branches for dierent concurrent next events), the algorithm results in one of the possible states in this tree. If we execute the algorithm at one point in time, we can capture dierent sequences of states: E.g. we execute at S21, we can capture either (a) S10, S11, S21 for run Suchithra Ravi 34 DISTIBUTED COMPUTING that consists of events e11, e21, e12, or (b) S01, S11, S21 for run that consists of events e21, e11, e12 Obviously, both end in S21 and both correspond to consistent narratives of sequence of events so far. 4.4 Global State 4.4.1 Formal denition If state recorded is S ∗ , sequence of computations done so far is seq , and the true initial and nal states of the system are Si and Sj. Recorded state S ∗ must be on some path from Si to Sj. In other words, S ∗ is reachable from Si and Sj is reachable from S ∗ There exists some sequence seq ∗ , which is a permutation of seq which we can take from Si to Sj passing through S ∗ In this sequence, we know that S ∗ can be either the initial event or it happened after the initial event Si i.e. S ∗ = Si or Si → S ∗ Similarly, S ∗ can either be the nal event or it happened before the nal event Si i.e. S ∗ = Sj or S ∗ → Sj Theorem: If we follow the Chandy-Lamport's algorithm, for an execution of the system (seq ∗ ), which takes the system from an initial state (Si ) to a termination state (Sj ), the recorded state (S ∗ ) is such that it is reachable from the initial state and the termination state is reachable from the recorded state. Again, recorded state need not be a real state that the system actually went through! 4.4.2 Benets of Global State The algorithm allows Stable property detection which allows us to do better garbage collection, and Unstable property detection, which can help detect cor- rectness/consistency issues. Stable property is a property of the system which once it becomes true, it remains true for the remainder of the execution of the system. E.g. Deadlock, Termination, Token Loss. If we know that some phase completed by S ∗ , then it denitely completed before the nal state Sj , so we can do gc for it now. If a stable property is true in S ∗ , then it is denitely true in Sj (since it is valid for te rest of the execution). Suchithra Ravi 35 CHAPTER 4: State in Distributed Systems If a stable property is false in S ∗ , then it is denitely false in Si (because if ti were true in Si , it would be true for the rest of the execution). Unstable property is one for which there is no guarantee that once it becomes true it remains true for forever. E.g. Buer Overow, Race condition, load spike, etc. Since S ∗ might not have really occurred, if an unstable property is true in S ∗ it need not be true in the actual system or in the nal state Sj (sine unstable properties are transient). If we observe an unstable property to be true in S ∗ , which is a possible state, then we know that this property can occur in the system for some valid sequence of actions (which is obviously a bad thing!) - like bug detection! To summarize, For stable property y , if y(S ∗ ) = true =⇒ y(Sj ) = true For unstable property y , if y(S ∗ ) = true =⇒ y(Sj ) could possibly be true Suchithra Ravi 36 Consensus in Distributed Systems Science is really about individual experts reaching a consensus. Alan Stern, Interview about Pluto's planet status 5.1 What is Consensus? Consensus is the ability of multiple distributed processes to reach agreement about something -maybe the value of some shared state of the system, some action to be taken or even the current timestamp, etc. Why do we need it? Critical for making forward progress in a distributed system. Common scenario: Nodes need to agree upon the outcome of a transaction. E.g. In Project3, rst process executes a command, responds to the client, then sends a backup request and dies. If the backup request is lost, the backup server never executes this and all future clients will see a dierent result i.e. one indicating this command was never executed. Reaching a consensus makes it possible for the system to be correct. Challenges in reaching a consensus (Discussed earlier): Non-determinism Lack of global clock Network delays Failures Malicious behavior The ability of a system to reach consensus implies 3 key properties: 37 CHAPTER 5: Consensus in Distributed Systems Termination/Liveness: Guarantee of forward progress i.e. process should either move forward to make a consensus or terminate i.e. the non-faulty pro- cesses eventually decide on a value. Correctness/Agreement/Safety: Guarantee that all processes decide on a single value (I mean, if we are calling it a consensus... ) Validity/Safety: The value decided on must have been proposed by some of the processes (like, it can't be some arbitrary value pulled out of thin air!) 5.2 Theoretical Posibility of Consensus 5.2.1 System Model We will consider the theoretical possibility of achieving consensus by studying the following model: Asynchronous: Messages may be reordered and delayed but not corrupted. At-most one faulty process Failstop Model: where the node simply stops working. I Indistinguishable from innite message delay in the system, so we can ignore actual failure type/cause Obviously, these are simplifying assumptions and real systems are more complex (and often break these assumptions). Why is this model still useful : If we prove possibility of consensus: Can try to extend result to real complex systems through further investigation. If we prove impossibility: If impossible even in the simple model, obviously not possible for a complex model. Some relevant terms : Run: Ordering of events in the system Admissible Run: Run with one faulty processor and all messages eventually delivered i.e. similar to the model dened above. Deciding Run: Run where some non-faulty processors reach a decision. Totally correct consensus protocol: Protocol where all admissible runs are also deciding runs i.e. for any ordering of messages where the model assumptions are met, the non-faulty processors will be able to reach a decision. Suchithra Ravi 38 DISTIBUTED COMPUTING Univalent conguration: System conguration in which the system can reach a single decision (single value). Obviously, this would be part of a deciding run. Bivalent conguration System conguration in which multiple (or at least 2) decisions are possible. This means, a consensus hasn't yet been reached, so this cannot be part of a deciding run. 5.2.2 FLP Theorem Presented in the paper: "Impossibility of distributed consensus with one faulty pro- cess" by Michael Fischer, Nancy Lynch and Michael Patterson (F.L.P) Theorem 5.1 : FLP Theorem FLP Theorem states that in a system with one fault, no consensus protocol can be totally correct. Proof Intuition : Under the model assumptions, if it is possible to identify a starting conguration and an admissible run where the system does NOT reach a deciding state (a.k.a ends up in a bivalent conguration) =⇒ consensus cannot be reached in at least one conguration i.e. protocol is not totally correct. 0. Consider a system where nodes are capable of making one of two decisions: 0 or 1. I Assumptions from before hold: 1 faulty processor, messages are not corrupted, etc. 1. (Lemma 2 of paper) In a distributed system, there is at least one initial conguration where the nal decision is not known already (i.e. where result is not known beforehand) =⇒ there is at least one initial bivalent conguration. 2. There must be a single event or single message (whose delivery) will convert the bivalent system into a univalent system. 3. It is possible to delay the delivery of the message in point (2) so that it is never delivered and thus the system never transitions to a univalent state! =⇒ An admissible non-deciding schedule does exist for this system!! From these points, we can conclude that- for a system with 1 faulty node where messages can be delayed or reordered, there is always an initial bivalent state for which an admissible non-deciding schedule exists! Suchithra Ravi 39 CHAPTER 5: Consensus in Distributed Systems Deriving lemma 2 : Initial conguration depends on schedule of events- some might have a predetermined solution. But don't have to consider these because the denition of consensus is to choose a proposed (not predetermined) value. If we list all the congurations with their initial states, then any 2 congurations dier in the state of at least one process. If there is a fault in that diering process, the remaining processes form a bivalent conguration I The admissible run in both these congurations consist of the same values (i.e. initial states of the non-faulty processes). I Final state can have 2 dierent values (since initial state of faulty process is unknown or can't be deduced) I =⇒ 2 admissible runs that can result in dierent states for the faulty process, so this is a non-deciding conguration. Example: If cong0 refers to "p0 has 0 as initial state, p1 has 0 as inital state", while cong1 refers to "p0 has 0 as initial state, p1 has 1 as inital state", the two congurations dier in the initial state of p1. So if p1 is the faulty process, then we have 2 admissible runs, both has p0 ith 0 as initial state, but they cannot decide what the initial state of p1 is accurately, thus leading to a bivalent conguration. 5.2.3 Is Consensus Really Impossible? Given that failures are inevitable and networks have delays, this is our best system model. So, can we never have a correct distributed system? Don't give up yet! Theorem above is based on certain assumptions i.e. Any system that satises these as- sumptions cannot have consensus. But, real world systems that change these asusmp- tions or system properties can! This also means that the consensus protocols we shall see will not terminate if the specic conditions are not met (i.e. the assumptions are not broken.) Suchithra Ravi 40 Consensus Protocols 6.1 Goals of Consensus Protocols Continuing our discussion of consensus from the last chapter, goals of consensus protocols are: Agreement: Multiple parties should agree upon the same value Validity: Chosen value must be valid. Safety: The term safety refers to: Only a value that was proposed is chosen Only a single value is chosen and only a single chosen value is learnt Liveness: The term liveness refers to: Some proposed value is chosen Any chosen value is learnt Liveness is related to forward progress i.e. the system should not be stuck in the process of choosing. Safety is related to correctness i.e. the system should choose correctly and consistently. From the Denition FLP Theorem, we know that we cannot have both safety and liveness. 3 types of nodes in a consensus protocol : Proposers: Nodes that propose a new value for the shared state Acceptors: Nodes that evaluate the proposals for the value and choose, often based on some kind of ordering (like timestamps) Learners: Nodes that need to access (read) the value. 41 CHAPTER 6: Consensus Protocols 6.2 2-Phase Commit (2PC) 2-Phase Commit(2PC) originated from database community. There is always a co-ordinator, who is assumed to not fail Phase 1- Vote Collection Phase: Co-ordinator proposes value, participants vote Phase 2- Decision Phase: Co-ordinator tallies the votes, makes the decision and communicates the decision (commit). Blocking protocol: blocks if there is a failure, =⇒ does NOT guarantee liveness 6.3 3-Phase Commit (3PC) 3-Phase Commit(3PC) also originated from database community. Tries to solve the blocking problem above. Phase 1- Prepare phase: Votes are solicited Phase 2- Pre-Commit Phase: Decision communicated (might lock resources here) Phase 3- Decision/Commit Phase: Perform actual commit here Non-blocking: Timeout during phase 2 or 3 causes protocol to abort. Problem: Only works with fail-stop model- assumes that a failed node won't restart (or a timed-out message won't be eventually delivered) This means there will be safety issues on fail-restart. 6.4 Paxos Fun Fact : Original Paxos Paper The original Paxos paper was written by Leslie Lamport in 1990. However, the original paper was not accepted by the publication (a lot of people who don't get humor on this plant, smh! ) Anyway the original paper used the metaphor to describe the following problem: Describes Paxos Parliament has members who pass decrees Suchithra Ravi 42 DISTIBUTED COMPUTING The members only work part-time They communicate by messages that may be delayed (or someone may not vote) Nobody is malicious The paper then provided an algorithm with a set of rules they must follow to agree on a single decree (and multiple decrees i.e. Muti-Paxos) Essentially, the paper described the consensus problem, provided a solution- an algorithm that is a state machine of rules/state transitions that must be followed to achieve consensus, and proof of correctness of the algorithm. 6.4.1 Basics of Paxos This sections describes the workings of the Paxos consensus protocol. Assumptions/System Model : Systems with asynchronous communication, non- Byzantine failures Agents operate at arbitrary speed Agents can have fail-stop failures (stop and restart) Agents have some persistent memory to recover information on restart Messages can take arbitrary times to reach Messages can be duplicated, lost or reordered Messages cannot be corrupted Important underlying ideas : State Machine Replication Each node is a replica: of the same state machine (or algorithm) that updates the state following the same rules to update the state Majority Quorum: All decisions based on majority quorum Each decision is based on majority quorum, so any 2 decisions will have some common members who agreed to it → even when some nodes fail, possible to disseminate the consensus decision If some nodes fail and don't participate in a consensus round, when they restart, they will join a quorum that has at least 1 participant who knows about the past decisions Suchithra Ravi 43 CHAPTER 6: Consensus Protocols Helps tolerate fail-stop and fail-restart failures Ordering: Order among messages maintained using timestamps (so we can tolerate arbitrary message delays and message reordering) 6.4.2 Phases of Paxos Protocol has 3 phases: 1. Prepare Phase: Node that initiates proposes an agreement round. I Proposal message timestamped with order number → Allows proposals to be distinguished/ordered 2. Accept Phase: Gather votes on whether agreement is possible and the value is agreed upon Initiator gathers responses from the participants Responses indicate whether they agree to participate, what value they agree to and timestamp of proposal being agreed to 3. Learn Phase: Agreed upon value learned by all Starts once leading node gets enough nodes to agree to commit (quorum) Initiator communicates the agreement to all participants with agreement round and agreed value If initiator receives Acks from quorum of participants, it knows this round is complete Note: Possible to learn in 2 rounds Prepare and Accept form the write phase of the value; learn is the read phase Why use proposal number : Proposal number being part of messages - ensures ordering Helps with dealing with fail-restart and delayed messages There may be multiple ongoing proposals at a time, helps to keep track of messages Prepare Phase 0. Each round started by an initiator/leader/proposer 1. Proposer selects proposal number n and sends prepare request (with number Suchithra Ravi 44 DISTIBUTED COMPUTING n) to acceptors n is totally ordered among all processes i.e. unique across all proposals and all processes I No two processes can have same n I Same process can never have 2 proposals with same n 2. The acceptors who receive the message respond with an acceptor response If an acceptor receives a prepare request with n greater than any received so far, responds with promise not to accept any more proposals with value 5x read throughputs even at high write loads As write throughput increases, CRAQ replicas have to maintain 2 copies, check with tail etc., which is why read throughput drops as writes increases Suchithra Ravi 56 Fault Tolerance And oftentimes excusing of a fault Doth make the fault the worse by the excuse William Shakespeare, King John 8.1 Basics of Failures From fault to Failure : 1. A failure rst starts with a fault. I Fault can be in hardware or software. I System may function correctly even with fault until the fault is activated. 2. Activated fault leads to an error 3. Error propagates through the system as it executes and leads to a failure Types of faults : Transient: Manifest once then disappear Intermittent: Manifest occassionally Permanent: Once activated, the fault persists until it is removed Types of failures : Fail-Stop: One or more components stop working Omission: Some actions are missing i.e. Components fail to send or receive some messages 57 CHAPTER 8: Fault Tolerance Timing: Aected system components may not meet timing requirements (i.e. this might cause delays). I Can lead to failures if system relies on retry or reconguration on timeout. Byzantine: Arbitrary failures i.e. system continues to function but produces incorrect results I Can be because of malicious nodes or some software bugs, etc. 8.1.1 How to deal with failures : Avoidance: Idealy, avoid all failures by detecting early (before failure happens) and taking corretcive action. I Problem: Tends to be too expensive I Also prediction may be hard - so this is impractical Detection :Detect that a failure has occurred: I Common method: Heartbeat mechanism (or ping) to check if nodes are responsive. Heartbeat can detect fail-stop but not Byzantine (i.e. if a node is behaving incorrectly) I Error Correction Codes: Can be used to detect incorrect execution. Removal: Once detected, we would want to remove the root cause of failure and revert system to the last clean state I Rollback: Take system to a point before the fault manifested. (If fault is transient, maybe we won't hit it again) Rollback may not be possible if there are some external actions by the system Recovery: System should be able to recover from failure, i.e. detect, remove root cause, revert to last clean state and resume clean execution. I A system that can do these is considered fault-tolerant 8.2 Rollback-Recovery Basic idea : When a fault is detected, rollback to a previous state (where system was known to be correct), then re-execute with fault removed. Rollback includes: Rolling back eect of any messages transmitted after fault was detected Suchithra Ravi 58 DISTIBUTED COMPUTING Rollback any updates made to system state How to implement this : What state to roll back to? From Consistent Cuts, we know system has to roll back to a "consistent" state (from a time before the fault occurred.) I System need not roll back to an actual state that ever happened (consistent state doesn't mean it was a real state, just a state that is consistent with the operations in the system) How far back should we take the consistent cut from? (There may be many consistent cut points before the current point, which one to choose?) Try to nd by progressively rolling back changes in the system to nd consistent cuts I This can potentially rollback to the beginning and all data can be lost!! Instead choose one of the 2 methods: (See section below) * Checkpointing * Logging Granularity of operation may vary: Transparent/Full-System: Do not require any application level modication i.e. transparent to the application Rollback-Recovery system needs to track every individual message send/re- ceive (and their ordering), every state update (and its success/ordering). Obviously, large overhead. Transaction-level: Application modied to use transactional APIs (system will ensure transactions are executed atomically). Executing distributed transactions atomically is a whole other problem (see Chapter 9) Application-specic: System-level might track too many things. Application has better insight into when/what needs to be saved/checkpointed. Example, in HPC domain, massive number of operations/state updates. Only useful when the performance degradation caused by saving too much state is a real concern. Typically, this might be overkill. The future sections discuss transparent systems, but the discussion can be generalized to transactions or even application-initiated checkpoints. Suchithra Ravi 59 CHAPTER 8: Fault Tolerance Definition 8.1 : Checkpointing Checkpointing is the process of saving the application state periodically so that the system can revert to the checkpoint on failure. During normal operation: Periodically ave state of application/node -> Flush checkpoint to disk or persistent storage On failure: do any repair activities -> restore checkpoint from persis- tent storage -> restart system On hardware failure- might repair/replace parts (or) simply start a dierent node to replace failed node. Advantage: Can restart instantaneously after restore. Disadvantage: At each checkpoint, lot of I/O to save full system state. (this is where application specic ) Can reduce with application-specic checkpoint to save only necessary data Can track delta changes since the last checkpoint and only save those Definition 8.2 : Logging Logging is the capturing of information about operations performed so that the operations can be repeated to redeem the latest state on failure. Basic idea is to log information about operations performed i.e. change in dierent state variables 2 styles possible: UNDO: Store original value of changed variables - can use if we want to "undo" operations one by one REDO: Store new value of changed variables - systems is "rolled back" to original application state, then "redo" operations. Obviously, log has to be written into persistent storage Advantages: Not storing whole system state, so lesser I/O to persistent storage (not much storage required) I/O time happens while application executing, so good to have lesser I/O Disadvantages: Suchithra Ravi 60 DISTIBUTED COMPUTING Recovery takes much longer With REDO log, even regular application operations expensive - need to look through log to nd most recent value of dependent parameters Combining logging and checkpointing System can periodically perform checkpoint Between checkpoints, use logging to save updates (earlier logs can be discarded). Message send/receive also considered updates here. Advantages: Limit duration of recovery i.e. System doesn't have to go back to begin- ning, just needs to return to most recent checkpoint with consistent cut. Limit space and bandwidth usage Disadvantage: Need to detect a checkpoint that is a stable consistent cut. 8.3 Checkpointing System model for the next few sections: Fixed number of processors that may interact with outside world Processors communicate among each other only via messages Network is non-paritionable, but other assumptions vary (FIFO or not? Reliable communication or not? Remember we can achieve these using TCP) Number of tolerated failures may vary depending on the protocol 8.3.1 Uncoordinated Checkpointing Processes take checkpoints independently (See gure) On failure, the recovery line needs to be computed and process reverts to it. If the failing process rolls back to a particular checkpoint, others have to roll back to consistent checkpoints * E.g. if P3 rolls back to a time before was m3 sent from P2, P2 should roll back to a point before the send message m3 event. This process is continued till all processes are consistent. To achieve this, need to store dependency information (or track which messages were sent, so we can decide if we found a valid recovery line or not) Suchithra Ravi 61 CHAPTER 8: Fault Tolerance Disadvantages: Domino Eect: Since the checkpoints are uncoordinated, when we roll back to a point for P3, it may force us to go to an earlier point for P2, which itself might force us to move to an earlier point for P1, etc. i.e. we might rollback a lot more than we need and maybe even to the beginning! Useless checkpoints: (like in the example above where many had to be discarded) Many checkpoints are useless as they can never be part of a globally consistent state Multiple checkpoints per process: Since we need to keep rolling back till we nd a consistent state, need to store multiple checkpoints for each process I Particularly bad since many of these checkpoints might also be useless - so extra storage AND wasteful storage. Garbage Collection: To clean up these extra (and maybe useless) check- points, need to run gc - but this may be complex and time-consuming. 8.3.2 Coordinated Checkpointing Processes co-ordinate when they take the checkpoint so that the checkpoint results in a consistent state Advantages: No longer need a dependency graph to calculate the recovery line: most recent checkpoint is valid! No domino eect: All checkpoints taken are relevant checkpoints! This means, need only one checkpoint per process No garbage collection needed Disadvantages: The co-ordination itself Delay in initiator message E.g. If the initiator message is received in P3 after message m3 arrived, but in P2 before message m3 was sent? I Synchronous system: Can simply take checkpoints every T units of time. I If message delivery is reliable and bounded: can come up with a round- robin sort-of scheme Unnecessary checkpoints: Nodes may be forced to take checkpoints even though no changes since the previous checkpoint on that node! Suchithra Ravi 62 DISTIBUTED COMPUTING 8.3.3 Communication-Induced Checkpoints Use a consensus protocol to decide whether to take the checkpoint now or not. Dierent approaches possible: Blocking approach: Nodes do not process any other message while this is going on Initiator starts a 2PC or other protocol to start the consensus process Non-blocking approach: Global snapshot Algorithm Relies on special marker messages, needs the network to be FIFO To avoid FIFO, piggyback marker message on a regular message To capture state of nodes not communicating with anyone else, periodic independent checkpoints are taken If node receives message with marker, rst take snapshot, then process actual message. (All nodes will snapshot before this message was sent/re- ceived). 8.4 Logging Logging saves storage but needs more complex recovery (i.e. saves i/o, needs more compute!) Again, need logs to capture information such that a consistent cut can be obtained. Should not lead to orphaned events i.e. receive is in the log, but send isn't. Dierent approaches to achieve this: Pessimistic Logging: Each process logs everything to storage before events propagated. (Basically send message is logged before actually sending) I Obviously high overhead- write to persistent memory is slow- plus this is in critical path. Can improve slightly by using faster persistent memories. Optimistic Logging: Assume that the log will be always persistent before the failure + allow eects to be reversed I This assumption is hard to meet- need to track dependencies I Have to identify incomplete operations and remove their eects during recovery I Any operations that have external eects (e.g. robot movement), the op- eration has to be delayed until system can capture the information (and ensure this operation will not have to be reversed later.) Suchithra Ravi 63 CHAPTER 8: Fault Tolerance Causality Tracking: This is what we really need I Operate optimistically when there are no dependencies I Capture dependencies so that causally related events are deterministically recorded I No risk of delaying external events indenitely (causality tracking itself depends on some message exchange- once the causality messages are all delivered, safe to execute) 8.5 Which Method to Use? Right choice depends on multiple factors: Workload type: How often data updated, size of updates, are most updates shared data, is fast recovery important etc. Failure types: Types of possible failures and how they can be recovered System conguration: Cost/overhead of communication vs storage, system scale, etc. (These may change over time ) Changes from when the paper was written (see table from paper): Pessimistic logging considered inecient because writes to persistent memory are slow, but this is becoming faster with time Coordinated checkpointing (default for HPC)- favorable for most features, but delays operations - requires system-wide coordination. This might become too much overhead depending on the application (or network costs in the future ) Suchithra Ravi 64 Distributed Transactions Those Are Not Transactions (Cassandra 2.0) - Dave Scherer, Blog Martin Kleppmann, Blog, Designing Data-Intensive Applications 9.1 Transactions and Distributed Transactions A transaction is a group of operations that need to be applied together in an indi- visible manner. Transactions are usually expected to be done with ACID properties i.e. Atomicity, Consistency, Isolation and Durability. Atomicity: Either all the operations in the transaction are applied or none of them. Cannot execute one transaction with inputs that are results of a partially applied, dierent, transaction. Consistency: Described dierently in dierent contexts. Typically, later trans- actions should see the eects of other transactions committed in the past. Isolation: Concurrent transactions leave the database in a state as if they could be obtained if those transactions were executed in some order (Similar to serializability) Durability: Once a transaction is committed, it stays committed even after system failure. (Should write to persistent storage) Example: Consider a transaction Tx dened by 2 operations A and B that update variables a and b. Assume 2 clients executing this transaction as C1 and C2 on the database. I It is ok to have both C1 and C2 completed or neither. I It is not ok to have situations where C1 completed A but not B. C2 cannot see 65 CHAPTER 9: Distributed Transactions a value of a from C1 that was the result of A, without seeing the updated value of b (i.e. result of operation B). Thus, only 2 possible outputs for a transaction: either it is committed (permanently visible) or aborted (all intermediate updates are lost). Transactions are useful for: Concurrency control: If they complete atomically in isolation, then we can come up with a proposed ordering among the dierent transactions to get a consistent output. Fault Tolerance: We may not want to save partial states from the transactions (e.g. debit shows in one bank account but the corresponding credit doesn't show up for the bank transfer!) Definition 9.1 : Distributed Transaction Distributed transaction is similar to a regular transaction but executed across multiple nodes. Still need to guarantee ACID, but now across multi- ple nodes. Common solution: Assign a leader/initiator that initiates transactions, ex- ecutes consensus protocols across a) multiple participants of a single dis- tributed transaction b) across multiple transactions. 9.2 Google Spanner Google, Facebook etc. deal with millions of user requests by building an underlying data management layer which is geographically distrbiuted across the world. Spanne is: Global Data management layer used by Google for Ads, Play, etc. Also available as a Cloud DB service. Allows applications to interact with data through SQL queries (but unlike MySQl, oers much higher scalability.) How does it work: Data Stored globally at geographically separated zones (say US, Brazil, Russia, etc. ) At each location, the database is sharded (partitioned) across 1000s of servers. At each location, data is also replicated across multiple sites (say datacenters in dierent cities in the US) to provide fault tolerance and availability Suchithra Ravi 66 DISTIBUTED COMPUTING 9.2.1 Spanner Stack Spanner is made of a stak of multiple components: 1. Bottom Layer: Persistent Storage - Distributed File System (like GFS, Colossus, etc.) I Ensures data is written to persistent storage, replicated, written out to disk, etc. Data organized as les (extension of GFS)