Replication for Fault Tolerance PDF
Document Details

Uploaded by RealisticHouston
FEUP
2024
Pedro F. Souto
Tags
Summary
This document provides an overview of replication for fault tolerance using quorum consensus. It discusses different aspects like read-write quorums, consistency, and examples of quorum-based replication (Dynamo).
Full Transcript
Replication for Fault Tolerance Quorum Consensus Pedro F. Souto ([email protected]) October 2, 2024 1/23 Roadmap Quorums and Quorum Consensus Replication Ensuring Consistency with Transactions Playing with Quorums Dynamo Quorums Furth...
Replication for Fault Tolerance Quorum Consensus Pedro F. Souto ([email protected]) October 2, 2024 1/23 Roadmap Quorums and Quorum Consensus Replication Ensuring Consistency with Transactions Playing with Quorums Dynamo Quorums Further Reading 2/23 Quorum Consensus Protocols ▶ Each (replicated) operation (e.g. read/write) requires a quorum ▶ This is a set of replicas ▶ The fundamental property of these quorums is that ▶ If the result of one operation depends on the result of another, then their quorums must overlap, i.e. have common replicas ▶ A simple way to define quorums is to consider all replicas as peers. ▶ In this case quorums are determined by their size, i.e. the number of replicas in the quorum ▶ This is equivalent to assign 1 vote to each replica ▶ In his work, Gifford proposed the use of weighted voting, i.e. the assignment of different votes to each replica, so as to obtain different trade-offs between performance and availability of the different operations 3/23 Read/Write Quorums Must Overlap ▶ The replicas provide only read and write operations ▶ These operations apply to the whole object ▶ Because the output of a read operation depends on previous write operations, the read quorum must overlap the write quorum: NR + NW > N, where NR is the size of the read quorum NW is the size of the write quorum N is the number of replicas Read quorum A B C D A B C D A B C D E F G H E F G H E F G H I J K L I J K L I J K L NR = 3, N W = 10 NR = 7, NW = 6 NR = 1, N W = 12 Write quorum (a) (b) (c) 4/23 Quorum Consensus Implementation IMP Each object’s replica has a version number Read Client 1. Polls a read quorum, to find out the current version ▶ A server replies with the current version 2. Reads the object value from an up-to-date replica. ▶ If the size of the object is small, it can be read as the read quorum is assembled Write Client 1. Polls a write quorum, to find out the current version ▶ A server replies with the current version 2. Writes the new value with the new version to a write quorum ▶ We assume that writes modify the entire object, not parts of it IMP A write operation depends on previous write operations (via the version) and therefore write quorums must overlap: NW + NW > N ▶ Quorum b) above, (NR = 7, NW = 6, N = 12) violates this requirement ▶ This is not needed if, in step 1, client polls a read quorum 5/23 Naïve Implementation with Faults A : (1, 2.3) B : (1, 2.3) C : (1, 2.3) ▶ N = 3, NR = 2, NW = 2 get_version ▶ First/left client attempts to write, but because of a partition it 1 1 updates only one replica (A) (2, 5.4) 6/23 Naïve Implementation with Faults A : (1, 2.3) B : (1, 2.3) C : (1, 2.3) ▶ N = 3, NR = 2, NW = 2 get_version ▶ First/left client attempts to write, but because of a partition it 1 1 updates only one replica (A) get_version ▶ Second/right client, in different partition, attempts to write and it (2, 5.4) 1 succeeds. 1 ▶ Variable has different values for the same version. (2, 1.7) (2, 1.7) 6/23 Naïve Implementation with Faults A : (1, 2.3) B : (1, 2.3) C : (1, 2.3) ▶ N = 3, NR = 2, NW = 2 get_version ▶ First/left client attempts to write, but because of a partition it 1 1 updates only one replica (A) get_version ▶ Second/right client, in different partition, attempts to write and it (2, 5.4) 1 succeeds. 1 ▶ Variable has different values for the same version. get_version (2, 1.7) (2, 1.7) ▶ The partition heals and each get_version 2 client does a read 2 2 ▶ Each client gets a value different 2 from the one it wrote. read ▶ I.e. protocol does not ensure 1.7 read-your-writes read 5.4 6/23 Naïve Implementation with Concurrent Writes ▶ N = 3, NR = 2, NW = 2 A : (1, 2.3) B : (1, 2.3) C : (1, 2.3) ▶ Two clients attempt to write the get_version get_version replicas at more or less the same time 1 1 1 ▶ The two write quorums are not 1 equal, even though they overlap (2, 5.4) (2, 5.4) ▶ Again, replicas end up in an (2, 1.7) (2, 1.7) inconsistent state. 7/23 Naïve Implementation with Concurrent Writes ▶ N = 3, NR = 2, NW = 2 A : (1, 2.3) B : (1, 2.3) C : (1, 2.3) ▶ Two clients attempt to write the get_version get_version replicas at more or less the same time 1 1 1 ▶ The two write quorums are not 1 equal, even though they overlap (2, 5.4) (2, 5.4) ▶ Again, replicas end up in an get_version (2, 1.7) (2, 1.7) inconsistent state. get_version 2 ▶ Soon after, each client does a 2 2 read read 2 ▶ Each client gets a value different 1.7 from the one it wrote. read 5.4 7/23 Roadmap Quorums and Quorum Consensus Replication Ensuring Consistency with Transactions Playing with Quorums Dynamo Quorums Further Reading 8/23 Ensuring Consistency with Transactions (1/2) ▶ Gifford assumes the use of transactions, which use two-phase commit, or some variant ▶ The write (or read) of each replica is an operation of a distributed transaction ▶ We can view the sequence of operations in a replica on behalf of a distributed transaction as a sub-transaction on that replica ▶ If the write is not accepted by at least a write quorum, the transaction aborts A : (1, 2.3) B : (1, 2.3) C : (1, 2.3) ▶ The left client will not get 〈τ1 , get_erson〉 〈τ2 : get_erson〉 the vote from replica B and therefore it will abort 〈τ1 : 1〉 〈τ2 : 1〉 transaction τ1 〈τ1 : 1〉 〈τ2 : 1〉 ▶ The state of replica A will not be changed 〈τ1 : (2, 5.4)〉 ▶ On the other hand, 〈τ2 : (2, 1.7)〉 〈τ : (2, 1.7)〉 2 transaction τ2 commits, and 2-phase commit 2-phase commit its write will be effective. (1, 2.3) (2, 1.7) (2, 1.7) 9/23 Ensuring Consistency with Transactions (2/2) ▶ Transactions also prevent consistencies in the case of concurrent writes ▶ Transactions ensure isolation, by using concurrency control ▶ Lets assume the use of locks ▶ Most likely, version-based (optimistic) CC is a better match A : (1, 2.3) B : (1, 2.3) C : (1, 2.3) ▶ Server B processes the 〈τ1 : get_erson〉 LHS client write request 〈τ2 : get_erson〉 first, and tries to acquire a 〈τ1 : 1〉 〈τ2 : 1〉 write lock on behalf of τ1 , 〈τ1 : 1〉 〈τ2 : 1〉 but τ2 is holding a read lock ▶ Likewise for write request 〈τ1 : (2, 5.4)〉 from the RHS client 2-phase commit 〈τ2 : (2, 1.7)〉 ▶ Upon commit of τ1 , server B (2, 5.4) (2, 5.4) 〈τ : (2, 1.7)〉 2 detects deadlock, and 2-phase commit (locally) aborts τ2 (2, 5.4) (1, 2.3) ▶ The outcome of the two-phase commit of τ2 will be abort, because server B has aborted τ2 in order to commit τ1 10/23 XA-based Quorum Consensus Implementation IMP Each object’s access is performed in the context of a transaction Read Client 1. Polls a read quorum, to find out the current version ▶ There is no need to read the object’s state ▶ Only the first time the transaction reads the object 2. Reads the object state from an up-to-date replica. ▶ Only the first time the transaction reads the object Write (supporting partial writes) Client: 1. Polls a write quorum, to find out the current version and which replicas are up-to-date ▶ On the first time the transaction writes the object ▶ Object state may have to be read from an up-to-date replica ▶ Replicas may have to be updated 2. Writes the new value with the new version ▶ All writes by a transaction are applied to the same replicas ▶ Because these will be the only ones with an up-to-date version 11/23 Transaction-based Quorum Consensus Replication ▶ Transactions solve both the problem of failures and concurrency. ▶ Transactions can also support more complex computations: ▶ E.g. with multiple operations and/or multiple replicated objects ▶ But, transactions also have problems of their own: Deadlocks are possible, if transactions use locks ▶ Can deadlock also occur when a transaction comprises a single operation on one object? ▶ Other concurrency control approaches, e.g. optimistic CC based on timestamps (or versions), may be used ▶ These also have trade-offs Blocking if transactions use two-phase commit ▶ If the coordinator fails at the wrong time, the participants, i.e. the servers, may have to wait for the coordinate to recover ▶ Meanwhile, the objects accessed by such a transaction may become inaccessible, causing aborts of other transactions ▶ It may be a good idea to use as coordinator proxy servers instead of clients, because the latter are failure-prone ▶ But this may reduce availability 12/23 Roadmap Quorums and Quorum Consensus Replication Ensuring Consistency with Transactions Playing with Quorums Dynamo Quorums Further Reading 13/23 Playing with Quorums (1/2) Read quorum A B C D A B C D A B C D E F G H E F G H E F G H I J K L I J K L I J K L NR = 3, N W = 10 NR = 7, NW = 6 NR = 1, N W = 12 Write quorum (a) (b) (c) ▶ By choosing NR and NW appropriately we can get different trade-offs of the performance/availability of the different operations. E.g.: ▶ The quorum in c) corresponds to a protocol known as read-one/write-all 14/23 Playing with Quorums (2/2) ▶ By assigning each replica its own number of votes, which may be different from one, weighted-voting provides extra flexibility. E.g., assuming the crash probability of each replica to be 0.01: source: Gifford79 ▶ In Example 1, the quorums are designed for performance rather than availability Question What is the advantage of a replica with 0 votes? 15/23 Quorum Consensus Fault Tolerance ▶ Quorum-consensus tolerates unavailability of replicas ▶ This includes unavailability caused by both process (replicas) failures and communication failures, including partitions ▶ Actually, quorum consensus replication does not require distinguishing between the two types of failure ▶ The availability analysis by Gifford relies on the probability of crashing of a replica/server ▶ But we can follow the standard approach to evaluate the resiliency of a fault-tolerant protocol in a distributed system Question Let f be the maximum number of replicas that may crash simultaneously. ▶ What is the minimum number of replicas that we need? ▶ Do we need to change the quorum constraints? (Assume 1 replica, 1 vote). 16/23 Roadmap Quorums and Quorum Consensus Replication Ensuring Consistency with Transactions Playing with Quorums Dynamo Quorums Further Reading 17/23 Dynamo ▶ Dynamo is a replicated key-value storage system developed at Amazon ▶ It uses quorums to provide high-availability ▶ Whereas Gifford’s quorums support a simple read/write memory abstraction, Dynamo supports an associative memory abstraction, essentially a put(key,value)/get(key) API ▶ Rather than a simple version number, each replica of a (key,value) pair has a version vector ▶ Dynamo further enhances high-availability, by using multi-version objects ▶ Thus sacrificing strong consistency under certain failure scenarios 18/23 Dynamo’s Quorums ▶ Each key is associated with a set of servers, the preference list ▶ The first N servers in this list are the main replicas ▶ The remaining servers are backup replicas and are used only in the case of failures ▶ Each operation (get()/put()) has a coordinator, which is one of the first N servers in the preference list. ▶ The coordinator is the process that executes the actions typically executed by the client in Gifford’s quorums ▶ As well as the actions required from a replica ▶ As in Gifford’s quorums: put(.) requires a quorum of W replicas get(.) requires a quorum of R replicas such that: R+W >N 19/23 Dynamo’s Quorums put(key,value,context) the coordinator: 1. Generates the version vector for the new version and writes the new value locally ▶ The new version vector is determined by the coordinator from the context, a set of version vectors 2. Sends the (key, value) and its version vector to the N first servers in the key’s preference list ▶ The put() is deemed successful if at least W–1 replicas respond get(key) the coordinator ▶ Requests all versions of the (key, value) pair, including the respective version vectors, from the remaining first N servers in the preference list ▶ On receiving the response from at least R–1 replicas, it returns all the (key,value) pairs whose version-vector are maximal ▶ If there are multiple pairs, the application that executed the get() is supposed reconcile the different versions and write-back the reconciled pair using put(). Without failures Dynamo provides strong consistency 20/23 Dynamo’s "Sloppy" Quorums and Hinted Handoff In the case of failures the coordinator may not be able to get a quorum from the N first replicas in the preference list To ensure availability the coordinator will try to get a sloppy quorum by enlisting the backup replicas in the preference list ▶ The copy of the (key, value) sent to the backup server has a hint in its metadata identifying the server that was supposed to keep that copy ▶ The backup server scans periodically the servers it is substituting ▶ Upon detecting the recovery of a server, it will attempt to transfer the copy of the (key,value) ▶ If it succeeds, the backup server will delete its local copy At the cost of consistency sloppy quorums do not ensure that every quorum of a get() overlaps every quorum of a put() Sloppy quorums are intended as a solution to temporary failures ▶ To handle failures with a longer duration, Dynamo uses a anti-entropy approach for replica synchronization 21/23 Roadmap Quorums and Quorum Consensus Replication Ensuring Consistency with Transactions Playing with Quorums Dynamo Quorums Further Reading 22/23 Further Reading ▶ David K. Gifford, Weighted Voting for Replicated Data, SOSP’79: Proceedings of the 7th ACM Symposium on Operating Systems Principles (SOSP’79), 1979, Pages 150-162 ▶ Section 4 describes several refinements of the basic idea (weighted voting) that allow to improve reliability or performance ▶ van Steen and Tanenbaum, Distributed Systems, 3rd Ed. ▶ Section 7.5.3: Replicated-Write Protocols ▶ Michael Whittaker, Aleksey Charapko, Joseph M. Hellerstein, Heidi Howard, Ion Stoica. Read-Write Quorum Systems Made Practical. In PaPoC ’21: Proceedings of the 8th Workshop on Principles and Practice of Consistency for Distributed Data. Pages 1-8 ▶ Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. In Proceedings of twenty-first ACM SIGOPS Symposium on Operating systems principles (SOSP ’07), 2007. Pages 205–220.23/23