Fault Tolerance and Consistency Lecture Notes (PDF)

LINGI2145 — Cloud Computing Lesson 7: Fault Tolerance and Consistency Pr. Etienne Rivière [email protected] 2 Announcements Quiz #2 (on lectures 3-4) Answers...

LINGI2145 — Cloud Computing Lesson 7: Fault Tolerance and Consistency Pr. Etienne Rivière [email protected] 2 Announcements Quiz #2 (on lectures 3-4) Answers available after this lecture Quiz #3 (on lectures 5-6) Available today Deadline November 6, 10h45 (+1 week smart) December 11 (lecture time): guest industry presentation by Mike Martin Senior Cloud Solution Architect at Microsoft Cloud Computing — E. Rivière 3 Lecture objectives Discuss the need for geo-replication in Cloud data stores Present consistency models, coherence protocols and associated tradeoffs See how coordination problems can be solved elegantly using a speci c storage service Cloud Computing — E. Rivière fi 4 Part 1: Replication and consistency for Cloud data storage Cloud Computing — E. Rivière 5 Need for fault tolerance All parts used in data centers can fail Servers fail: hardware problems, software problems Disks fail Study by Backblaze SSDs slightly better, but not perfect Humans fail Switching off a server without notice Electricity fails gure © Backblaze Inc. Cloud Computing — E. Rivière fi 6 Need for fault tolerance A data center as a whole can fail Problems with the power supply (outage lasts longer than what UPS units can deliver) Problems with network connection How can we preserve the availability of stored data in the face of failures? OVH data center in Strasbourg, 2021 Cloud Computing — E. Rivière 7 Replication Store multiple copies of the same data In the same data center: survive fault/unavailability of one or more server(s) Cloud Computing — E. Rivière 8 Georeplication Replicate across data centers: survive faults/ unavailability of entire data center Bonus: serve local users with local copy of data data center in Ireland data center in East US data center in West US Cloud Computing — E. Rivière 9 Maintaining coherent copies Updates to one copy must be propagated to other copies This is the role of a coherence protocol But latencies between data centers can be important! data center in Ireland data center in East US data center in West US coherence layer write must be propagated to other sites/ replicas concurrent updates and reads Cloud Computing — E. Rivière 10 Consistency Multiple copies of the same data means we have to deal with concurrent read and writes How do we maintain coherent copies? How do multiple clients see modi cations to a data item? ★ Formalized as consistency properties ➡ A coherence protocol enforces a consistency property Tradeoff: stronger consistency properties generally require more costly coherence protocols Costly here == more traf c, higher latencies Let’s start by de ning the notion of a consistency model Cloud Computing — E. Rivière fi fi fi 11 Setup We will abstract stored objects as registers Historic name in distributed computing community We also use the term “process” for a distributed client interacting with stored object(s) Abstraction of a stored object Binary large object — BLOB (key/value store) JSON document (document store) Data structure accessible through access methods Queue, set, list, … Cloud Computing — E. Rivière 12 Single-process history a valid single client history W(v1) R←v1 R←v1 W(v2) R←v2 W(v3) W(v4) time v1 v2 v3 v4 value of register Implicit rules expected by the “sequential programmer” Read returns the last value written by the same process Single process means a unique order of operations These rules form a consistency model It de nes the set of valid histories of operations and responses Cloud Computing — E. Rivière fi 13 Concurrent history (with instantaneous operations) first client W(v1) R←v1 R←v2 W(v4) time v1 v2 v3 v4 value of register R←v1 W(v2) R←v2 W(v3) R(v4) time second client Change: read does not return last value written by same client (1st client’s 2nd read returns v2, not value v1) Relax: read operation returns last value written by any process at the moment it is called History still respects the ordering of operations by each client Cloud Computing — E. Rivière 14 Concurrent history (with instantaneous operations) first client W(v1) R←v2 R←v2 W(v4) time v1 v2 v4 v3 value of register R←v1 W(v2) R←v2 W(v3) R(v3) time second client This is another valid global history for the same local operations from the two clients Accesses can just be scheduled differently (speed of processes, scheduler actions, etc.) Cloud Computing — E. Rivière 15 But operations are not instantaneous in practice! Processes are distant from each other Replicas in different clouds with geo-replication WAN links between clouds Operations take time to be processed Delay between sending the read/write operation and seeing the result As a result, operations are interleaved in time Cloud Computing — E. Rivière 16 A non-coherent history (according to our previous consistency criteria) first client W(v2) time v1 v2 value of register R returns v2 time second client 2nd client’s read does not return the value present when its read command has been initiated Violates our (poorly-de ned) consistency criteria Considering the return of the operations leads to same problem Cloud Computing — E. Rivière fi 17 The symmetry first client W(v2) time v1 v2 value of register R returns v1 (but value is now v2) time second client We need a consistency criteria that is relaxed and allow reasoning about concurrent operations that take time to complete Cloud Computing — E. Rivière 18 Linearizability Assume each operation happens atomically at some time point between its invocation and its return Allows deciding on a total, linear order Every read starting later than the return of a write returns the value of that write, or that of a later write Prohibits stale reads and non-monotonic reads An object implementation is linearizable if all possible histories respect the sequential speci cation for this object A strong consistency model, and easy to reason about But there is a price to pay! Cloud Computing — E. Rivière fi 19 Linearizable history first client W(v2) R r. v3 W(v4) time v1 v2 v3 v4 value of register W(v3) time R r. v2 R r. v4 second client Cloud Computing — E. Rivière 20 Sequential consistency Allow operations to appear as if they took place atomically, but maybe before their invocation (reads) or after their completion (writes) Preserves the order of operations for each client “Read your writes” semantics All clients see the writes as appearing in the same order but does not require this order to respect the “real-time” order of invocations Still forms a global, unique, and total ordering of operations Weaker than linearizability: allow more valid histories Cloud Computing — E. Rivière 21 Sequentially consistent history first client W(v2) W(v2) time v1 v2 v3 value of register R v1 R v2 time second client Cloud Computing — E. Rivière 22 Composability Ability to combine individual operations and keep safety properties Combining = accessing different objects Safety = keeping a unique visible order of updates Sequential consistency is not composable The combination of sequentially-consistent history is not necessarily sequentially consistent Two processes reading multiple registers may not see the same order of writes to these registers Same write visibility order guaranteed only if reading from a single register Linearizability, however, is composable Cloud Computing — E. Rivière 23 Implementing coherence Cloud Computing — E. Rivière 24 Implementing coherence Imposing order requires coordination Order = constraints on the visibility of writes The more constraints on the order (the more histories we exclude), the more strict and costly is the coordination protocol Let’s discuss some implementation options Examples that follow will give you an idea of some classic protocols (details are omitted) Cloud Computing — E. Rivière 25 Options for implementing coherence When are replicas updated? Option 1: Synchronous replication Immediately Reply to client only after operation completes Option 2: Asynchronous replication In the background, later Reply to client immediately Algorithmic & performance tradeoffs Favor read performance vs. write performance (or ⟷) Performance under faults Cloud Computing — E. Rivière 26 Linearizability: Synchronous, write-all ☹ Write goes to all replicas (slow), blocks if a single replica is faulty 😀 Reads served from local replica (fast) Must still order concurrent writes or replicas will diverge, which will break consistency example, concurrent W(v1) and W(v2) replica 1 receives W(v1) then W(v2) — ends with v2 replica 2 receives W(v2) then W(v1) — ends with v1 Solution 1: primary-copy—designate a primary replica, send it all writes. The primary broadcasts all writes in unique order it decides Solution 2: broadcast and use timestamps to order writes W(v1) timestamp 1 and W(v2) timestamp 2 replica 1 receives W(v1,1) then W(v2,2) — ends with v2 replica 2 receives W(v2,2), ignores W(v1,1) — ends with v2 based on slides by Marcos Aguilera Cloud Computing — E. Rivière 27 Linearizability: Synchronous, write-quorum ☹ Write to a majority of replicas (slow) ☹ Reads from majority of replicas (slow) Guaranteed to read one copy of “latest” written value 😀 Non blocking under failures of minority of replicas Must still order concurrent writes Solution 1: designate a primary replica … Not a good idea here: loose non-blocking advantage if the primary fails Solution 2: use timestamps to order writes based on slides by Marcos Aguilera Cloud Computing — E. Rivière 28 How can we get timestamps? In a distributed system, time not synchronized across nodes If not synchronized, later write might be overwritten by earlier one Option 1: obtain timestamp from a majority of replicas Requires another round-trip Option 2: make assumptions about clock drift Google TrueTime API exposing clock drift bounds Used in the Spanner strongly-consistent geo-replicated DB Delay updates to ensure no reordering can happen due to clock drift Clock drift bounds can be minimized by using GPS signals in the data centers (the GPS system was built to be well synchronized!) based on slides by Marcos Aguilera Cloud Computing — E. Rivière 29 Asynchronous replication 😀 Operations rst execute on local replica and reply sent immediately to client Then, updates sent to remaining replicas in the background ☹ May loose updates if local replica fails before doing this How to deal with con icting updates that are sent asynchronously to different replicas? Many clever options, but we shall only describe the simplest, asynchronous primary-copy based on slides by Marcos Aguilera Cloud Computing — E. Rivière fi fl 30 Sequential consistency: asynchronous primary-copy Again, one replica is the primary, prohibit updates at other replicas Writes forwarded to primary, who broadcasts them in some order to all other replica But this happens after replying to the client Provides sequential consistency Reads “in the past” Read-your-writes only when clients always contact the same data center (or risk of reading past values) Still OK for many applications! based on slides by Marcos Aguilera Cloud Computing — E. Rivière 31 Availability and partitions Cloud Computing — E. Rivière 32 Availability and partitions Strong consistency models require updating replicas of the data to ensure order guarantees Two other criteria are important for a replicated cloud storage system Availability Is the system responsive within a bounded time to operations sent by clients? Partition-tolerance Can the system continue its operation despite arbitrary partitions due to network failures or disconnections? Cloud Computing — E. Rivière 33 Partition data center in Ireland connection stopped data center in East US data center in West US coherence layer coherence layer write must be propagated to other sites/ replicas concurrent updates and reads Partitions between data centers happen in practice Inside a data center, proper network fabric and link redundancy make them less stringent, but this is less the case between data centers Any situation where a part of the replicas are unreachable is a partition (example with Dropbox later) Cloud Computing — E. Rivière 34 Types of partition A study by Alquraan et al., [OSDI 2018] majority of those failures are caused by failing to return a response. For instance, in Elasticsearch, if a replica looked at 136 system failures due to network (other than the primary) receives write requests, it acts Table 5. Percentage of the failures that require client access during the network partition Client Access % as a coordinator and forwards the requests to the partitions, across 25 different cloud systems primary replica. If a primary completes the write No client access necessary Client access to one side only 28% 36% operation but fails to send an acknowledgment back to Client access to both sides 36% pected resultthe coordinator, then the coordinator will assume the (29%) wereoperation has failed and will return an error code to the Table 6. Percentage of the failures caused by each type of ault: partialclient. The next client read will return the value written network-partitioning fault. ate a set ofby a write operation that was reported to have failed. Partition type % the cluster,Moreover, if the client repeats the operation, then it will Complete partition 69.1% ch the nodesbe executed twice(a). Partial partition 28.7% The effects of The rest of the failures were caused by flaws in the Simplex partition 2.2% and tested.replication protocol, scheduling, data migration t of this fault Finding 6. While the majority (69%) of the failures mechanism, system integration with ZooKeeper, and system reconfiguration in response (b) to network require a complete partition, a significant percentage of hese failures. (c) them (29%) are caused by partial partitions (Table 6). partitioning failures, in which the nodes remove the on sequence Figure 1. Network partitioning types. (a) Complete partition: g constraints, unreachable nodes from their replica set. Partial network partitioning failures are poorly The system is split into two disconnected groups (b) Partial he number of These findings are surprising because 15 of the partition: The partition affects some, but not all, nodes in the understood and tested, even by expert developers. For a failure is systems use majority voting for leader election system. Group 3 in Figure (b) can communicate with the to instance, most of the network-partitioning failures in teristics thattolerate otherexactly two this groups. kind (c) of Simplex failure. partition, Similarly, in which the traffic Hadoop MapReduce and HDFS are caused by partial test cases flows primary only purposein one of direction. a data consolidation mechanism is network-partitioning faults. In the following section, we jority of the to correctly resolve conflicting versions of data. To found and reported 32 failures that led to data loss, stale discuss these failures in detail. improve gures reproduced from thatthe paper, Simplex with permission and by using reads, resilience, reappearancethis finding of deleted indicates data, unavailability, network partitioning caused 2% of the developers should and double locking, enforce brokentests locks.and design reviews — E. Rivière Cloud Computing fi 35 The CAP impossibility result The theorem states that between Consistency (any strong form: linearizable, sequential, causal), Availability & Partition-tolerance It is not possible to guarantee all 3 properties Only two options actually possible, denoted by their letters CP — sacri ce Availability: the system remain strongly consistent by refusing to serve some requests when there is a partition AP — sacri ce Consistency: continues to serve requests when there is a partition, accepting that strong consistency can be violated The implementations we discuss previously all ensure CP e.g., quorum write: by de nition only one of the partition can contain a majority; requests arriving to the other will not be served Cloud Computing — E. Rivière fi fi fi 36 AP system: Amazon Dynamo Used to implement the Amazon S3 service Warning: Amazon DynamoDB is actually a CP system Implemented as a Distributed Hash Table Each stored item identi ed by a key (hash of its URI) Each server in charge of a range of keys All servers know the assignment of servers to ranges Addition and removal of servers w/ consistent hashing 0.3 0.4 0.5 0 1 server Web page 1. hash (URL) → page ID 2. request page from closest server “apple.com” 0.4 Cloud Computing — E. Rivière fi 37 The Dynamo DHT Servers organized in a virtual ring In charge of all keys between previous node (excluded) and themselves (included) Items replicated on three subsequent nodes Table 1: Sum Key K A Problem G Partitioning B Nodes B, C and D store High Availabilit keys in for writes F C range (A,B) including Handling tempora K. failures E D image from original paper Figure 2: Partitioning and replication Cloud Computing — E. Rivièreof keys in Dynamo Recovering from 38 Dynamo coherence Writes handled by any of the replicas Version vector encodes amount of writes seen by each replica prior to current write Allow tracking (but not enforcing) causality between versions of an object Tolerates concurrent con icting writes Breaks strict consistency No agreed-upon order between replicas But works under partitions! Reads can return multiple con icting versions Figure 3: Version evolution of an object over time. image from originalDynamo paper has access to multiple branches that can syntactically reconciled, it will return all the objects at the Cloud Computing — E. Rivière with the corresponding version information in the conte fl fl 39 Eventual consistency All data centers will eventually (i.e., sometime in the future) have seen the same set of writes. But this does not guarantee the order in which these writes will have been seen! Consistency con icts must be resolved. Arbitrary choice: last-writer-wins Application-driven choice: read return multiple values, application implement con ict resolution Speci c data structures that provide strict convergence guarantees (CRDTs) but are not always applicable Cloud Computing — E. Rivière fi fl fl 40 Advantages of eventual consistency 😀 Performance Reads and writes can be served very fastly by addressing a single replica 😀 Supports continuity of service under partition Original motivating example: Amazon shopping cart If an item gets added to the cart on one replica; and removed from the cart on another replica, but not from the former; business choice is an inclusive merge procedure: keep potentially deleted items but do not remove items Apparently less costly for Amazon to offer items shipped by mistake to customer! Con icting versions happen rarely in practice in Amazon’s data centers, outside of partitions Cloud Computing — E. Rivière fl 41 Disadvantages of eventual consistency ☹ More complex for the programmer Last-writer-wins resolution might loose important updates Complex to predict and handle all possible con icts ☹ No guarantees on ordering of writes seen in each data center can lead to unwanted behaviors at the application level Examples follow ☹ Might need to expose con icts to the user Example of Dropbox Cloud Computing — E. Rivière fl fl 42 Anomaly 1: comment reordering image by Wyatt Lloyd, Facebook Cloud Computing — E. Rivière 43 Anomaly II: ooops Cloud Computing — E. Rivière 44 Anomaly 1I: photo privacy violation image by Wyatt Lloyd, Facebook Cloud Computing — E. Rivière 45 Eventual consistency and Dropbox Dropbox sacri ces consistency for availability Clear reason for doing this: support of ine modi cations to local replica of the le system Local copies are like geographically distant replicas Two experiments Creation of a con ict when system is partitioned Consistency of updates in a non-partitioned system subject to concurrent updates Cloud Computing — E. Rivière fi fi fl fi fl 46 create and edit a new document network connection lost: partition continues edition of ine (see next slide, and come back) network connection works again con ict detected, left to the user to x. Cloud Computing — E. Rivière fi fl fl 47 on a connected laptop, view and edit the le with no awareness of partition Cloud Computing — E. Rivière fi 48 Con icts arise even in non-partitioned mode two connected users (no partition) concurrently editing Cloud Computing — E. Rivière fl 49 Dropbox experiment takeaway What this experiment shows While a CP system sacri ces availability when there is a partition (but is available otherwise), a AP system sacri ces consistency even when there are no partitions! Choice driven by ratio of cost/probability of con ict Cloud Computing — E. Rivière fl fi fi 50 Part 2: Coordination kernels and the example of Apache ZooKeeper Cloud Computing — E. Rivière 51 Introduction Many distributed computer systems require some form of coordination Agree on a parameter value Decide which component is in charge of some processing … and many more Example of coordinating computation in Apache Hadoop Split the task of processing a large amount of data into many tasks, handled in parallel by a eet of worker processes ? When should a worker work and which data should it process? ? When is the computation nished or ready to move to next phase? ? How do we decide on a new worker to replace a worker that fails? Cloud Computing — E. Rivière fl fi 52 Example of dif culties with a simple model Work assignment in Hadoop Master assigns work Worker executes task assigned by master Master Worker Worker Worker Worker Cloud Computing — E. Rivière fi 53 Example of dif culties with a simple model 🤯 Master crashes ☹ Single point of failure ☹ No work assigned ☹ Need to select a new master ☹ State of the previous master is lost! Master Worker Worker Worker Worker Cloud Computing — E. Rivière fi 54 Example of dif culties with a simple model 💥 Worker crashes ☹ Some tasks won’t be executed Need the ability to reassign tasks Master Worker Worker Worker Worker Cloud Computing — E. Rivière fi 55 Example of dif culties with a simple model 💥 Worker does not receive assignment ☹ Task not executed Need to guarantee that worker receives assignment (or ability to re-assign) Master Worker Worker Worker Worker Cloud Computing — E. Rivière fi 56 Duplicate the master? ☹ Dif culty = synchronize the states of the primary and the backup Backup Master Complex state synchronization Primary Master Worker Worker Worker Worker Cloud Computing — E. Rivière fi 57 Save the master state in some data store ☹ All worker accesses still go through the primary master (bottleneck) ☹ Detection of worker faults by primary master Non scalable Connections to workers need to be recreated by backup master Backup Master Shared state Primary Master Worker Worker Worker Worker Cloud Computing — E. Rivière 58 Use of a coordination service 😀 Master-worker interactions through dedicated service All shared state is stored reliably by the service Handles node liveness monitoring Backup Master Coordination service Primary Master Worker Worker Worker Worker Cloud Computing — E. Rivière 59 Use of a coordination service Alternative 1: no master! Workers implement coordination logic through a common state shared and updated by all Can be dif cult to get right and can be inef cient Coordination logic distributed between workers Coordination Service Worker Worker Worker Worker Cloud Computing — E. Rivière fi fi 60 Use of a coordination service Alternative 2: elected master All workers can take the role of the master Election mechanism to elect a new master (leader) Coordination Service Worker Worker Worker Worker Master Master Master Master Cloud Computing — E. Rivière 61 Coordination is dif cult Many fallacies to stumble upon The network is not reliable The latency is not known in advance The bandwidth is nite There are multiple administrators Dif cult to know which node is alive, or not Many impossibilities results (FLP, CAP,...) Several requirements across applications Duplicating is bad, duplicating poorly is even worse Cloud Computing — E. Rivière fi fi fi 62 Apache ZooKeeper Generic coordination service Also called a coordination kernel Provides low-level synchronization primitives at server side Can be used at client side to build complex coordination recipes Cloud Computing — E. Rivière 63 ZooKeeper: the big picture Sessions ZK Ensemble: 2 f + 1 servers ZK Client Follower library ZK Client Leader library 10.000s ZK Client Follower clients library ZK Client Follower library ZK Client Follower library Cloud Computing — E. Rivière 64 ZooKeeper data model Session-based semantics Each client permanently connected to one of the ZK servers Each client monitored for faults (with keep-alive messages) Can create/delete/update shared znodes znodes can carry data if necessary znodes organized hierarchically, like a le system Two types of znodes Regular: persistent + can have children znodes Ephemeral: disappear if creator’s session ends no children znodes Recipes implemented by reading and writing znodes Cloud Computing — E. Rivière fi 65 ZooKeeper data tree Path = / Regular znode Path = /app1 Path = /app2 Ephemeral znode (no children) Path = Path = Path = /app1/p_1 /app1/p_2 /app1/p_3 Cloud Computing — E. Rivière 66 ZooKeeper API: create create(path, data, ags) Creates a znode with name path and value data ags: ephemeral | regular sequence: append monotonically increasing counter to name counter is > than the parent’s node counter counter is > than for all existing children of parent makepath: should recursively create the path if it does not exist In Python: create(path, value='', acl=None, ephemeral=False, sequence=False, makepath=False) Cloud Computing — E. Rivière fl fl 67 ZooKeeper API: setData setData(path, data, version) writes data to the znode of name path optional version number set succeeds only if provided version number match the current version (conditional write) In Python: set(path, value, version=-1) Cloud Computing — E. Rivière 68 ZooKeeper API: delete delete(path, version) delete znode with name path version is optional: conditional delete if version does not match, returns BadVersionError recursive allows deleting the whole sub-tree In Python: delete(path, version=-1, recursive=False) Cloud Computing — E. Rivière 69 ZooKeeper API: reading exists(path, watch) returns true if the znode of name path exists watch ag allows the client to be noti ed upon the rst modi cation of this znode in the future (but only if the znode exists, can not watch for znode creation) getData(path, watch) returns the data and meta-data of znode set watch ag, but only if the znode exists getChildren(path, watch) returns the set of names (paths) of the children of the znode of name path In Python: exists(path, watch=None) get(path, watch=None) get_children(path, watch=None, include_data=False) Cloud Computing — E. Rivière fi fl fl fi fi 70 Watches and noti cations Setting the watch ag allows the client to be noti ed upon the rst modi cation to the znode Modi cation to its value Deletion In particular, if session ends for an ephemeral node Addition or deletion of child node if not ephemeral A watch is a one-time trigger If the client wishes to continue monitoring changes to the znode, must re-read and set the watch at the same time Cloud Computing — E. Rivière fi fi fl fi fi fi 71 ZooKeeper consistency model FIFO per-client evaluation order (provided by TCP) Linearizable writes Reads are not linearized but only sequentially consistent reading from different servers: different writes order visibility in practice, each client connected to a single server special sync operations (needed when using external channels between clients, as sequential objects are not composable!) Allow serving reads fast (from one server) but provides strong guarantees on writes Ordering guarantee on noti cation (when watch ag set) A client always receives the noti cation of change before it sees the new state after the change Cloud Computing — E. Rivière fi fi fl 72 Coherence protocol 5 4 3 L 2 servers 5 Read call 1 6 1 2 Write call (Leader) clients write read write is acknowledged after a majority of replicas are updated Cloud Computing — E. Rivière 73 ZooKeeper coordination recipes Cloud Computing — E. Rivière 74 ZooKeeper recipes Implementing coordination solutions using API Often called ‘recipes’ Examples in following slides Shared con guration Barrier/ rendezvous Group membership Locks Read/Write locks Cloud Computing — E. Rivière fi 75 Implementing shared con guration (1) A new leader must update a common con guration Other processes should not use a partially-updated con guration If new leader crashes before nishing, another leader must be able to clean up regular znode “ready” deleted by the leader before changing parameters recreated by the leader after all changes sent ephemeral znode created by the leader containing its identity automatically deleted if the leader process fails allows other nodes to know if leader is still alive Cloud Computing — E. Rivière fi fi fi fi 76 new leader /configuration elected /configuration /configuration/ready old configuration currently updated /configuration/p_leader configuration leader dies ok before nishing /configuration /configuration /configuration/ready updated configuration /configuration/p_leader no ready znode partially updated no ephemeral node no risk to use configuration the leader is dead partially updated info. need to elect a new one Cloud Computing — E. Rivière fi 77 Implementing shared con guration (2) What happens if a client sees the ready znode before its deletion? Can it start using partially updated data? No, from the ordering guarantee of noti cations Need to set the watch ag when reading ready The client will get the noti cation for the change of the ready znode before it can read the con guration znodes the leader updates after creating ready Cloud Computing — E. Rivière fi fi fl fi fi 78 Implementing rendezvous Scenario: a client creates a master and some worker processes not possible to know information about IP:port of the master in advance to give to the workers Solution: use a znode z as rendezvous point given as a parameter to workers and master the master starts by lling z with its IP:port but workers can be started before master workers read z with watch = true if IP:port not lled yet, wait for noti cation and re-read otherwise, use the information read Cloud Computing — E. Rivière fi fi fi 79 Implementing group membership Objective: determine which nodes are alive in the system Solution: leverage semantics of ephemeral nodes regular znode /app_1/members each node creates an ephemeral node /app_1/members/node_i set sequential ag to obtain a unique number add contact information in the znode (IP:port) get members of the group by listing children of /app_1/members Cloud Computing — E. Rivière fl 80 Implementing locks locks can be implemented on top of wait-free synchronization simple version: create an ephemeral node named ‘/app_1/lock’, with the watch attribute set success: hold the lock (znode did not exist) failure: wait for noti cation and retry taking the lock Problem: subject to the herd effect When lock is released, all clients noti ed and all try to create the lock Peak of requests, and only one succeeds Cloud Computing — E. Rivière fi fi 81 client 1 client 1 client 2 client 2 client 3 client 3 create /lock (ephemeral, watch) success client 4 client 4 client 5 client 5 client 6 client 6 client 1 client 1 client 2 client 2 client 3 client 3 delete /lock client 4 client 4 notification list: notification list: client 5 client 1 client 5 client 1 client 2 client 2 create /lock (ephemeral, watch) client 4 client 4 client 5 client 5 client 6 all fail since /lock already exists client 6 client 6 client 6 Cloud Computing — E. Rivière 82 client 1 client 1 client 2 client 2 client 3 client 3 client 4 client 4 notifications client 5 client 5 all try to create /lock only one succeeds client 6 client 6 -> HERD EFFECT Cloud Computing — E. Rivière there are many clients waiting to acquire a lock, they will Write 83 Lo all vie for the lock when it is released even though only 1 n = Solution to the herd effect one client can acquire the lock. Second, it only imple- 2 C = 3 if n ments exclusive locking. The following two primitives 4 p = each node creates its own lock znode 5 if ex show how both of these problems can be overcome. with ephemeral and sequential ags set 6 goto only effectively Simple Locksholdwithout lock when no Effect Herd znode ofWe lower identi define ers a lock Read Loc 1 n = c toznode know that, l to set a watch on implement the locks. such znode ofIntuitively highest identi we er before line up itself2 C = g 3 if no all the only one clients client notirequesting ed whenthe lockznode a lock and each client obtains is deleted 4 p = w lockthegranted lock ininorder orderof ofrequest requestsarrival. Thus, clients wishing 5 if ex 6 goto alsoto noti obtained the lock of if owner dothe theephemeral following:nodes dies This Lock locks. 1 n = create(l + “/lock-”, EPHEMERAL|SEQUENTIAL) 2 C = getChildren(l, false) locks m 3 if n is lowest znode in C, exit only ea 4 p = znode in C ordered just before n 5 if exists(p, true) wait for watch event taining 6 goto 2 effect” Unlock lock an 1 delete(n) lower s The use of the SEQUENTIAL flag in line 1 of Lock sired be Cloud Computing — E. Rivière fi fi fl fi fi 84 Solution to the herd effect /app_1/fifo_lock lock_354 lock_355 lock_356 lock_357 lock_358 watch watch watch watch owns owns owns owns owns client 1 client 6 client 3 client 2 client 5 (lock_354) (lock_355) (lock_356) (lock_357) (lock_358) order (in time) of lock requests sent -- writes are totally ordered Cloud Computing — E. Rivière 85 ceeds, the client we can see by browsing the ZooKeeper data the can read the zn- amount of lock contention, break locks, and debug fied if the current Read/Write locks locking problems. when it dies or ex- s that are waiting write lock: only Read/Write other owned Locks by one process, To implement when read/write nowe locks nce they observe changelock the owned lock procedure slightly and have separate orks, it does have read lockprocesses multiple cedure time if is and write can nothe lock own samelock write as the procedures. Theatunlock read locks global lock case. exists pro- the same he herd effect. If e a lock, they will Write Lock even though only 1 n = create(l + “/write-”, EPHEMERAL|SEQUENTIAL) 2 C = getChildren(l, false) nd, it only imple- 3 if n is lowest znode in C, exit ng two primitives 4 p = znode in C ordered just before n 5 if exists(p, true) wait for event e overcome. 6 goto 2 Read Lock We define a lock 1 n = create(l + “/read-”, EPHEMERAL|SEQUENTIAL) 2 C = getChildren(l, false) tively we line up 3 if no write znodes lower than n in C, exit ach client obtains 4 p = write znode in C ordered just before n 5 if exists(p, true) wait for event s, clients wishing 6 goto 3 This lock procedure varies Cloud Computing slightly from the previous — E. Rivière quests frequently enough, then there is no need to send 300ms 6s. 3 servers ryany and we sample every other message. Otherwise, the client sends heartbeatTo prevent memory 80000 70000 5 servers 7 servers 9 servers 86 verflows, servers throttle the number of concurrent60000re- 13 servers messages during periods of low activity. If the client Operations per second cannot communicate with a server to send a request or uests in the system. ZooKeeper uses heartbeat, it connects to a different ZooKeeper server to request Performance throttling 50000 40000 o re-establish keep servers from its session. To being overwhelmed. prevent the session from tim- For these30000 ex- eriments, ing out, the we ZooKeeper 250 configured clientclients connect the ZooKeeper library sends to servers a heartbeat after the session has been idle for s/3 ms and switch to a all ZK nodes to have 20000 10000 new server if itofhas2,not maximum 000heardtotal Right from requests table a server for in = extreme process. 2s/3 values not appearing ms, 0 0 20 on the plot 40 60 80 100 where s is the session timeout in milliseconds. Percentage of read requests Throughput of saturated system 90000 Figure 5: The throughput performance of a saturated s 5 Evaluation 80000 3 servers tem as the ratio of reads to writes vary. 5 servers 7 servers We performed 70000 all139ofservers our evaluation on a cluster of 50 servers servers. 60000 Each server has one Xeon dual-core 2.1GHz Servers 100% Reads 0% Reads Operations per second 13 460k 8k processor, 4GB of RAM, gigabit ethernet, and two SATA 50000 9 296k 12k hard drives. We split the following discussion into two 7 257k 14k parts: throughput 40000 and latency of requests. 5 165k 18k 30000 3 87k 21k 5.1 Throughput 20000 Table 1: The throughput performance of the extremes 10000 To evaluate our system, we benchmark throughput when a saturated system. the system is0 saturated and the changes in throughput 0 20 40 60 80 In100Figure 5, we show throughput as we vary the ra for various injected failures. Percentage We varied therequests number of of read of read to write requests, and each curve corresponds servers that make up the ZooKeeper service, but always a different number of servers providing the ZooKee kept the number of clients the same. To simulate a large service. Table 1 shows the numbers at the extremes Figure 5: The throughput performance of a saturated number of clients, we used 35 machines to simulate Cloud 250 Computing — E. Rivière sys- Keeper is a critical production component, up to now 87 ou development focus for ZooKeeper has been correctnes and robustness. There are plenty of opportunities for im 40 60 80 Throughput with failures 100 proving performance significantly by eliminating thing centage of read requests like extra copies, multiple serializations of the same ob 1.Failure a saturated andvarying system, recovery the of a ject, more efficient internal data structures, etc. when all follower; clients connect to the Time series with failures 2. Failure and recovery of a 70000 Throughput 60000 different follower; 3. 50000 Operations per second chieve such highof Failure throughput the leader;by he servers that makeup the ser- 40000 4. Failure of he load because of two followers our relaxed 30000 (a,b) ininstead Chubby clients the twodirectrst all marks, 20000 then recovery igure 6 shows what happensat theif third mark (c); e of this relaxation and forced 10000 5. ct to the leader. As expected Failure of the leader; the 0 0 50 100 150 200 250 300 6. Seconds since start of series r for read-dominant workloads, Recovery ant workloads of the leader the throughput is Figure 8: Throughput upon failures. nd network load caused by ser- e ability of the leader to coor- To show the behavior of the system over time as fail he proposals, which in turn ad- Cloud Computing — E. Rivière fi 88 Conclusion Replication is necessary Data remains available despite faults Serve local requests from local replica Balance the load between replicas Coherence implies tradeoffs between consistency (programming facility) and performance (cost of strong consistency) Coordination can leverage speci c replicated data store with coordination kernel semantics Apache ZooKeeper, but also CoreOS etcd, etc. Cloud Computing — E. Rivière fi 89 References Maurice Herlihy, Jeannette M. Wing: Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 12(3): 463-492 (1990) Mustaque Ahamad, Gil Neiger, James E. Burns, Prince Kohli, Phillip W. Hutto: Causal Memory: De nitions, Implementation, and Programming. Distributed Computing 9(1): 37-49 (1995) Seth Gilbert and Nancy Lynch, “Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59. Daniel Abadi (Yale) on the CAP theorem http://dbmsmusings.blogspot.ch/2010/04/ problems-with-cap-and-yahoos-little.html Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen: Don't settle for eventual consistency. Commun. ACM 57(5): 61-68 (2014) Cloud Computing — E. Rivière fi 90 References Werner Vogels: Eventually consistent. Commun. ACM 52(1): 40-44 (2009) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007: 205-220 James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford: Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. 31(3): 8 (2013) http://research.google.com/archive/spanner.html Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, Benjamin Reed: ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX Annual Technical Conference 2010 Cloud Computing — E. Rivière

Fault Tolerance and Consistency Lecture Notes (PDF)

Document Details

Tags

Related

Summary

Full Transcript