W2_Lecture Notes.pdf
Document Details
Uploaded by Deleted User
Indian Institute of Technology, Patna
Tags
Full Transcript
Lecture: 05 Size of vector clock, Matrix clocks, Virtual time and EL Physical clock synchronization PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna...
Lecture: 05 Size of vector clock, Matrix clocks, Virtual time and EL Physical clock synchronization PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface Recap of Previous Lecture: EL In the previous lecture we have discussed the models of distributed computation, and presented the idea of PT causality & logical time that was proposed by Lamport in 1978 in an attempt to order events in distributed systems. N We have discussed two systems of logical clocks, namely, scalar and vector clocks to capture causality between events of a distributed computation. Preface Content of this Lecture: In this lecture we will discuss about: EL PT Size of vector clocks Matrix clocks Virtual time and N Physical clock synchronization Size of vector clocks An important question to ask is whether vector clocks of size n are necessary in a computation consisting of n processes. To answer this, we examine the usage of vector clocks. EL A vector clock provides the latest known local time at each other process. If this information in the clock is to be used to explicitly track the progress at every other process, then a vector clock of size n is necessary. PT A popular use of vector clocks is to determine the causality between a pair of events. Given any events e and f, the test for e ≺ f if and only if T(e) < T(f), N which requires a comparison of the vector clocks of e and f. Although it appears that the clock of size n is necessary, that is not quite accurate. It can be shown that a size equal to the dimension of the partial order (E,≺) is necessary, where the upper bound on this dimension is n. Size is equal to dimension of Partial Order Definitions: To understand this result on the size of clocks for determining causality EL between a pair of events, we first introduce some definitions: A linear extension of a partial order (E, ≺) is a linear ordering of E that PT is consistent with the partial order, i.e. if two events are ordered in the partial order, they are also ordered in the linear order. N A linear extension can be viewed as projecting all the events from the different processes on a single time axis. The dimension of a partial order is the minimum number of linear extensions, whose intersection gives exactly the partial order. Contd... However, the linear order will necessarily introduce ordering between each pair of events, and some of these orderings are not in the partial order. Also observe that different linear extensions are possible in EL general. Let P denote the set of tuples in the partial order defined by the PT causality relation; so there is a tuple (e, f) in P for each pair of events e and f such that e ≺ f. Let L1, L2,..... denote the sets of tuples in N different linear extensions of this partial order. The set P is contained in the set obtained by taking the intersection of any such collection of linear extensions L1, L2,..... This is because each Li must contain all the tuples, i.e., causality dependencies, that are in P 1. Client-Server Interaction Consider a client–server interaction between a pair of processes. Queries to the server and responses to the client occur in strict EL alternating sequences. Although n=2, all the events are strictly ordered, and there is only PT one linear order of all the events that is consistent with the “partial” order. N Hence the dimension of this “partial order” is 1. A scalar clock such as one implemented by Lamport’s scalar clock rules is adequate to determine e ≺ f for any events e and f in this execution. 2. Concurrent Send-Receive Now consider an execution on processes P1 and P2 such that each sends a message to the other before receiving the other’s message. EL The two send events are concurrent, as are the two receive events. To determine the causality between the send events or PT between the receive events, it is not sufficient to use a single integer; a vector clock of size n = 2 is necessary. N This execution exhibits the graphical property called a crown, wherein there are some messages m0 ,... mn-1 such that Send(mi) ≺ Receive(mi+1mod(n-1)) for all i from 0 to n−1. A crown of n messages has dimension n. 3. Complex Execution For a complex execution, it is not straightforward to determine the dimension of the partial order. Figure 5.1 shows an execution EL involving four processes. However, the dimension of this partial order is two. PT To see this informally, consider the longest chain. There are events outside this chain that can yield multiple linear extensions. Hence, the dimension is more than 1. N The right side of Figure 5.1 shows the earliest possible and the latest possible occurrences of the events not in this chain, with respect to the events in this chain. EL PT N Figure 5.1 Example illustrating dimension of a execution (E,≺). For n = 4 processes, the dimension is 2. 3. Complex Execution Let L1 be which contains the following tuples that are not in P: (c,a), (c, b), (c, d), (c, g), (c, h), (c, i), (c, j) EL (e, a), (e,b), (e, d), (e, g), (e, h), (e, i), (e, j) (f, a), (f, b), (f, d), (f, g), (f, h), (f, i), (f, j) PT Let L2 be , which contains the following tuples not in P: (a, c), (b, c), (c, d), (c, g), (c, h ), (c, i), (c, j) (a, e), (b, e), (d, e), (g, e), (h, e), (i, e), (e, j) N(a, f), (b, f), (d, f), (g, f), (h, f), (i, f), (j, f) Further, observe that (L1 \ P ) ∩ L2= ∅ and (L2\ P ) ∩ L1 = ∅. Hence, L1 ∩ L2 = P and the dimension of the execution is 2 as these two linear extensions are enough to generate P. Matrix Time In a system of matrix clocks, the time is represented by a set of n × n matrices of non-negative integers. EL A process pi maintains a matrix mti [1..n, 1..n] where, mt i [i , i ] denotes the local logical clock of pi and tracks the progress of PT the computation at process pi. mt i [i , j ] denotes the latest knowledge that process pi has about the N local logical clock, mt j [j , j ], of process pj. mt i [j , k ] represents the knowledge that process pi has about the latest knowledge that pj has about the local logical clock, mt k [k , k ], of pk. The entire matrix mt i denotes pi ’s local view of the global logical time. Matrix Time Process pi uses the following rules R1 and R2 to update its clock: R1 : Before executing an event, process pi updates its local logical time as follows: mt i [i , i ] := mt i [i , i ] + d (d > 0) R2: Each message m is piggybacked with matrix time mt. When pi receives such a EL message (m,mt) from a process pj , pi executes the following sequence of actions: Update its global logical time as follows: PT (a) 1 ≤ k ≤ n : mt i [i , k] := max (mt i [i , k], mt[j , k]) (That is, update its row mt i [i , ∗] with the pj ’s row in the received timestamp, mt) N (b) 1 ≤ k, l ≤ n : mt i [k, l ] := max (mt i [k, l ], mt[k, l ]) Execute R1 Deliver message m Matrix Time : Example 1 [0 0 0] [1 0 0] [0 0 0] [0 0 0] [0 0 0] [0 0 0] P1 EL [0 0 0] e11 [0 0 0] [0 0 0] P2 PT [0 0 0] [0 0 0] [0 0 0] P3 N Initial Form of Matrix Clock Matrix Time : Example 1 Non principal vector which fetch the previous [1 0 0] event principal vector [0 0 0] [0 0 0] P1 EL e11 [1 0 0] Principal vector is [1 1 1] work as vector clock [0 0 1] P2 PT [0 0 0] e21 [0 0 0] [0 0 1] P3 N e31 Working of principal and non-principal vector in matrix clock Matrix Time : Example 2 Principal vector [1 0 0] [2 1 0] [0 0 0] [1 1 0] [0 0 0] [0 0 0] P1 e1 2 EL e1 1 [1 0 0] [1 0 0] [2 1 0] [1 1 0] [1 2 1] [2 3 1] PT [0 0 0] [0 0 1] [0 0 1] P2 e21 e22 e23 [2 1 0] P3 [0 0 0] [0 0 0] [0 0 1] N [2 3 1] [2 3 2] e31 e32 Working of principal and non-principal vector in matrix clock Basic Properties Vector mt i [i ,.] contains all the properties of vector clocks. In addition, matrix clocks have the following property: mink (mt i [k , l ]) ≥ t ⇒ process pi knows that every other process pk knows EL that pl ’s local time has progressed till t. PT If this is true, it is clear that process pi knows that all other processes know that pl will never send information with a local time ≤ t. In many applications, this implies that processes will no longer require N from pl certain information and can use this fact to discard obsolete information. If d is always 1 in the rule R1, then mt i [k , l ] denotes the number of events occurred at pl and known by pk as far as pi ’s knowledge is concerned. Virtual Time Virtual time system is a paradigm for organizing and synchronizing distributed systems. EL This section provides description of virtual time and its implementation using the Time Warp mechanism. PT The implementation of virtual time using Time Warp mechanism works on the basis of an optimistic assumption. N Time Warp relies on the general lookahead-rollback mechanism where each process executes without regard to other processes having synchronization conflicts. Virtual Time If a conflict is discovered, the offending processes are rolled back to the time just before the conflict and executed forward along the revised path. EL Detection of conflicts and rollbacks are transparent to users. PT The implementation of Virtual Time using Time Warp mechanism makes the following optimistic assumption: N synchronization conflicts and thus rollbacks generally occurs rarely. Virtual Time “Virtual time is a global, one dimensional, temporal coordinate system on a distributed computation to measure the computational progress and to define synchronization.” EL A virtual time system is a distributed system executing in coordination with an imaginary virtual clock that uses virtual time. PT Virtual times are real values that are totally ordered by the less than relation, “ ‘t’ will start only after events at time ‘t’ are complete. Characteristics of Virtual Time Virtual time systems are not all isomorphic; it may be either discrete or continuous. Virtual time may be only partially ordered. EL Virtual time may be related to real time or may be independent of it. PT Virtual time systems may be visible to programmers and manipulated explicitly as values, or hidden and manipulated N implicitly according to some system-defined discipline Virtual times associated with events may be explicitly calculated by user programs or they may be assigned by fixed rules. Comparison with Lamport’s Logical Clocks In Lamport’s logical clock, an artificial clock is created one for each process with unique labels from a totally ordered set in a manner consistent with partial order. In virtual time, the reverse of the above is done by assuming that every event is EL labeled with a clock value from a totally ordered virtual time scale satisfying Lamport’s clock conditions. Thus the Time Warp mechanism is an inverse of Lamport’s scheme. PT In Lamport’s scheme, all clocks are conservatively maintained so that they never violate causality. N A process advances its clock as soon as it learns of new causal dependency. In the virtual time, clocks are optimistically advanced and corrective actions are taken whenever a violation is detected. Lamport’s initial idea brought about the concept of virtual time but the model failed to preserve causal independence. Time Warp Mechanism In the implementation of virtual time using Time Warp mechanism, virtual receive time of message is considered as its timestamp. EL The necessary and sufficient conditions for the correct implementation of virtual time are that each process must handle incoming messages in PT timestamp order. N This is highly undesirable and restrictive because process speeds and message delays are likely to highly variable. It natural for some processes to get ahead in virtual time of other processes. Time Warp Mechanism It is impossible for a process on the basis of local information alone to block and wait for the message with the next timestamp. It is always possible that a message with earlier timestamp arrives EL later. So, when a process executes a message, it is very difficult for it PT determine whether a message with an earlier timestamp will arrive later. Warp mechanism. N This is the central problem in virtual time that is solved by the Time The Time warp mechanism assumes that message communication is reliable, and messages may not be delivered in FIFO order. Time Warp Mechanism Time Warp mechanism consists of two major parts: (i) local control mechanism and (ii) global control mechanism EL (i) The local control mechanism ensures that events are executed and messages are processed in the correct order. PT N (ii) The global control mechanism takes care of global issues such as global progress, termination detection, I/O error handling, flow control, etc. Physical Clock Synchronization: NTP In centralized systems, there is only single clock. A process gets the time by simply issuing a system call to the kernel. In distributed systems, there is no global clock or common memory. EL Each processor has its own internal clock and its own notion of time. These clocks can easily drift seconds per day, accumulating significant PT errors over time. Also, because different clocks tick at different rates, they may not N remain always synchronized although they might be synchronized when they start. This clearly poses serious problems to applications that depend on a synchronized notion of time. Physical Clock Synchronization: NTP For most applications and algorithms that run in a distributed system, we need to know time in one or more of the following contexts: The time of the day at which an event happened on a specific machine in the network. EL The time interval between two events that happened on different machines in the network. PT The relative ordering of events that happened on different machines in the network. N Unless the clocks in each machine have a common notion of time, time-based queries cannot be answered. Clock synchronization has a significant effect on many problems like secure systems, fault diagnosis and recovery, scheduled operations, database systems, and real-world clock values. Physical Clock Synchronization: NTP Clock synchronization is the process of ensuring that physically distributed processors have a common notion of time. EL Due to different clocks rates, the clocks at various sites may diverge with time and periodically a clock synchronization must be performed to correct PT this clock skew in distributed systems. Clocks are synchronized to an accurate real-time standard like N UTC (Universal Coordinated Time). Clocks that must not only be synchronized with each other but also have to adhere to physical time are termed physical clocks. Clock Inaccuracies Physical clocks are synchronized to an accurate real-time standard like UTC (Universal Coordinated Time). EL However, due to the clock inaccuracy discussed above, a timer (clock) is said to be working within its specification if (where constant ρ is the PT maximum skew rate specified by the manufacturer.) N 1−ρ≤ ≤1+ρ (1) Figure 5.3 illustrates the behavior of fast, slow, and perfect clocks with respect to UTC. Clock Inaccuracies Fast Clock dC/dt > 1 Clock time, C EL Perfect Clock dC/dt = 1 Slow Clock dC/dt < 1 PT N UTC, t Figure 5.3: The behavior of fast, slow, and perfect clocks with respect to UTC. Offset delay estimation method The Network Time Protocol (NTP) which is widely used for clock synchronization on the Internet uses the Offset Delay Estimation method. EL The design of NTP involves a hierarchical tree of time servers: PT (i) The primary server at the root synchronizes with the UTC. N (ii) The next level contains secondary servers, which act as a backup to the primary server. (iii) At the lowest level is the synchronization subnet which has the clients. Clock offset and delay estimation: In practice, a source node cannot accurately estimate the local time on the target node due to varying message or network delays between the nodes. EL This protocol employs a common practice of performing several trials and chooses the trial with the minimum delay. PT Figure 5.4 shows how NTP timestamps are numbered and exchanged between peers A and B. N Let T1, T2, T3, T4 be the values of the four most recent timestamps as shown. Assume clocks A and B are stable and running at the same speed. Clock offset and delay estimation: T1 T2 B EL PT A N T3 T4 Figure 5.4: Offset and delay estimation Clock offset and delay estimation: Let a = T1 − T3 and b = T2 − T4. If the network delay difference from A to B and from B to A, called differential delay, is small, the clock offset θ and roundtrip delay δ of B relative EL to A at time T4 are approximately given by the following: θ= a+ b , δ= a−b (2) PT 2 N Each NTP message includes the latest three timestamps T1, T2 and T3, while T4 is determined upon arrival. Thus, both peers A and B can independently calculate delay and offset using a single bidirectional message stream as shown in Figure 5.5. Server A Ti − 2 Ti − 1 EL PT Server B NTi − 3 Ti Figure 5.5: Timing diagram for the two servers The Network Time Protocol: Synchronization protocol A pair of servers in symmetric mode exchange pairs of timing messages. A store of data is then built up about the relationship between the two EL servers (pairs of offset and delay). Specifically, assume that each peer maintains pairs (Oi ,Di ), where PT Oi - measure of offset (θ) N Di - transmission delay of two messages (δ). The offset corresponding to the minimum delay is chosen. Specifically, the delay and offset are calculated as follows. Assume that message m takes time t to transfer and m′ takes t ′ to transfer. Contd… The offset between A’s clock and B’s clock is O. If A’s local clock time is A(t) and B’s local clock time is B(t), we have A(t) = B(t) + O (3) Then, EL Ti − 2 = Ti − 3 + t + O (4) Ti = Ti − 1 − O + t’ (5) PT Assuming t = t ′ , the offset Oi can be estimated as: N Oi = (Ti − 2 − Ti − 3 + Ti − 1 − Ti )/2 The round-trip delay Di is estimated as: Di = (Ti − Ti − 3) − (Ti − 1 − Ti − 2) (6) (7) The eight most recent pairs of (Oi , Di ) are retained. The value of Oi that corresponds to minimum Di is chosen to estimate O Q: Whether Physical Clock or Logical Clock is used to capture the ‘Causality’ (In Distributed System)? In day-to-day life, the global time(physical time) to deduce causality relation is obtained from loosely synchronized clocks (i.e., wrist watches, wall clocks). EL However, in distributed computing systems, the rate of occurrence of events is several magnitudes higher and the event execution time is several magnitudes PT smaller. Consequently, if the physical clocks are not precisely synchronized, the causality relation between events may not be accurately captured. N Therefore, it turns out that in a distributed computation, the causality relation between events produced by a program execution and its fundamental monotonicity property can be accurately captured by logical clocks. Conclusion In this lecture we have discussed about the size of vector clock, matrix clocks and virtual time to capture causality EL between events of a distributed computation and How the physical clock synchronization can be used as a PT paradigm for organizing and synchronizing distributed systems. N In upcoming lecture, we will discuss about Global state and snapshot recording algorithms. Lecture: 06 Global State and Snapshot Recording Algorithms EL PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface Recap of Previous Lecture: In the previous lecture, we have discussed about the EL models of distributed computation, causality and a general framework of logical clocks (i.e. scalar, vector PT and matrix clocks) in distributed systems and N How the virtual time and physical clock synchronization can be used as a paradigm for organizing and synchronizing distributed systems. Preface Content of this Lecture: EL In this lecture, we will discuss about the Global states (i.e. consistent, inconsistent), Cuts in the space time PT diagram, Models of communication and Snapshot algorithm i.e. Chandy-Lamport algorithm to N record the global snapshot. Global State: Introduction Recording the global state of a distributed system on-the-fly is an important paradigm. EL The lack of globally shared memory, global clock and unpredictable message delays in a distributed system make this problem non-trivial. PT This lecture first defines consistent global states and discusses issues to be N addressed to compute consistent distributed snapshots. Then the algorithm to determine on-the-fly such snapshots is presented. System Model The system consists of a collection of n processes p1, p2,..., pn that are connected by channels. There are no globally shared memory and physical global clock and processes EL communicate by passing messages through communication channels. Cij denotes the channel from process pi to process pj and its state is denoted PT by SCij. The actions performed by a process are modeled as three types of events: N Internal events, the message send event and the message receive event. For a message mij that is sent by process pi to process pj , let send (mij ) and rec(mij ) denote its send and receive events. System Model At any instant, the state of process pi , denoted by LSi , is a result of the sequence of all the events executed by pi till that instant. EL For an event e and a process state LS i , e∈LSi iff e belongs to the sequence of events that have taken process pi to state LSi. PT For an event e and a process state LS i , e∈LSi iff e does not belong to the sequence of events that have taken process pi to state LSi. For a channel Cij , the following set of messages can be defined based N on the local states of the processes pi and pj Transit: transit(LSi , LSj ) = {mij |send (mij ) ∈ LSi rec(mij ) ∈ LSj } Consistent Global State The global state of a distributed system is a collection of the local states of the processes and the channels. EL Notationally, global state GS is defined as, GS = {U i LSi , Ui ,j SCij } PT A global state GS is a consistent global state iff it satisfies the following two conditions : C1: send(mij )∈LSi ⇒ mij ∈SCij ⊕ rec(mij )∈LSj N (⊕ is Ex-OR operator) C2: send(mij )∈LSi ⇒ mij ∈SCij ∧ rec(mij )∈LSj Global State of a Distributed System In the distributed execution of Figure 6.2: A global state GS1 consisting of local states {LS11 , LS23 , LS33 , LS42} is EL inconsistent because the state of p2 has recorded the receipt of message m12, however, the state of p1 has not recorded its send. PT On the contrary, a global state GS2 consisting of local states {LS12 , LS24 , LS34 , LS42} is consistent; all the channels are empty except N c21 that contains message m21. Global State of a Distributed System A global state GS = {Ui LSixi , Uj,k SCjkyj,zk } is transitless iff iff ∀ i, ∀ j : 1 ≤ I, j ≤ n : : SCjkyj,zk = Ø EL Thus, all channels are recorded as empty in a transitless global state. A global state is strongly consistent iff it is transitless as well as consistent. PT Note that in figure 6.2, the global state of local states {LS12 , LS23 , LS34 , LS42} is strongly consistent. N Recording the global state of a distributed system is an important paradigm when one is interested in analyzing, monitoring, testing, or verifying properties of distributed applications, systems, and algorithms. Design of efficient methods for recording the global state of a distributed system is an important problem. Example: Time e11 e12 e13 e14 P1 m12 m21 EL e21 e22 e23 e2 4 P2 PT e31 e32 e3 3 e3 4 e35 P3 N GS1 = {LS11 , LS23 , LS33 , LS42} is inconsistent GS2 = {LS12 , LS24 , LS34 , LS42} is consistent GS3 ={LS12 , LS23 , LS34 , LS42} is strongly consistent. e41 e42 P4 Figure 6.2: The space-time diagram of a distributed execution. Time P1 EL PT P2 P3 Strongly N Consistent Cut Inconsistent Cut Consistent Cut Figure 6.1 Cuts of a distributed computation Cuts of a distributed computation In the space–time diagram of a distributed computation, a zigzag line joining one arbitrary point on each process line is termed a cut in the computation. Such a line slices the space–time diagram, and thus the set of events in the EL distributed computation, into a PAST and a FUTURE. The PAST contains all the events to the left of the cut and the FUTURE contains all the events to PT the right of the cut. For a cut C, let PAST(C) and FUTURE(C) denote the set of events in the PAST and FUTURE of C, respectively. N Every cut corresponds to a global state and every global state can be graphically represented as a cut in the computation’s space–time diagram. Cuts in a space-time diagram provide a powerful graphical aid in representing and reasoning about global states of a computation. Cuts of a Distributed Computation Time C1 C2 e1 1 e12 e13 e14 P1 m1 m4 m5 e2 1 2 EL e2 e23 e24 P2 PT m2 e31 e32 e33 e34 e35 P3 N m3 Inconsistent Consistent Cut Cut e41 e42 P4 Figure 6.2: Illustration of Cut in a distributed execution. Explanation of Cut in a distributed execution In a consistent cut, every message received in the PAST of the cut was sent in the PAST of that cut. (In Figure 6.2, cut C2 is a consistent EL cut.) PT All messages that cross the cut from the PAST to the FUTURE are in transit in the corresponding consistent global state. N A cut is inconsistent if a message crosses the cut from the FUTURE to the PAST. (In Figure 6.2, cut C1 is an inconsistent cut.) Issues in Recording a Global State The following two issues need to be addressed: I1: How to distinguish between the messages to be recorded in the snapshot from those not to be recorded. EL -Any message that is sent by a process before recording its PT snapshot, must be recorded in the global snapshot (from C1). -Any message that is sent by a process after recording its snapshot, must not be recorded in the global snapshot (from C2). N I2: How to determine the instant when a process takes its snapshot. -A process pj must record its snapshot before processing a message mij that was sent by process pi after recording its snapshot. Example of Money Transfer Let S1 and S2 be two distinct sites of a distributed system which maintain bank accounts A and B, respectively. A site refers to a process in this example. Let the communication channels from site S1 to site S2 and from site S2 to site S1 be denoted by C12 and C21, respectively. EL Consider the following sequence of actions, which are also illustrated in the timing diagram of Figure 6.3: PT Time t0: Initially, Account A=$600, Account B=$200, C12 =$0, C21 =$0. N Time t1: Site S1 initiates a transfer of $50 from Account A to Account B. Account A is decremented by $50 to $550 and a request for $50 credit to Account B is sent on Channel C12 to site S2. Account A=$550, Account B=$200, C12 =$50, C21 =$0. Time t2 : Site S2 initiates a transfer of $80 from Account B to Account A. Account B is decremented by $80 to $120 and a request for $80 credit to Account A is sent on Channel C21 to site S1. Account EL A=$550, Account B=$120, C12 =$50, C21 =$80. PT Time t3: Site S1 receives the message for a $80 credit to Account A and updates Account A. Account A=$630, Account B=$120, C12 =$50, C21 =$0. N Time t4: Site S2 receives the message for a $50 credit to Account B and updates Account B. Account A=$630, Account B=$170, C12 =$0, C21 =$0. T3: Site S1 receives the message for a $80 credit to Account A and updates $600 $550 $550 $630 $630 S1: A EL $50 PT $80 S2: B C12 $200 t0 $0 $200 t1 $50 N $120 t2 $50 $120 t3 $50 $170 t4 $0 C21 $0 $0 $80 $0 $0 T4: Site S2 receives the message for a $50 Figure 6.3: Money Transfer Example credit to Account B and updates Account B Suppose the local state of Account A is recorded at time t0 to show $600 and the local state of Account B and channels C12 and C21 are recorded at time t2 to show $120, $50, and $80, respectively. Then the recorded global state shows $850 in the system. An extra $50 appears in the system. EL The reason for the inconsistency is that Account A’s state was recorded before the $50 transfer to Account B using channel C12 PT was initiated, whereas channel C12’s state was recorded after the $50 transfer was initiated. N This simple example shows that recording a consistent global state of a distributed system is not a trivial task. Recording activities of individual components must be coordinated appropriately. Model of Communication Recall, there are three models of communication: FIFO, non-FIFO, and Co. In FIFO model, each channel acts as a first-in first-out message queue EL and thus, message ordering is preserved by a channel. In non-FIFO model, a channel acts like a set in which the sender process PT adds messages and the receiver process removes messages from it in a random order. N A system that supports causal delivery of messages satisfies the following property: “For any two messages mij and mkj , if send (mij ) → send (mkj ), then rec(mij )→ rec(mkj)” Snapshot algorithm for FIFO channels Chandy-Lamport algorithm: The Chandy-Lamport algorithm uses a control message, called a marker EL whose role in a FIFO system is to separate messages in the channels. After a site has recorded its snapshot, it sends a marker, along all of PT its outgoing channels before sending out any more messages. A marker separates the messages in the channel into those to be N included in the snapshot from those not to be recorded in the snapshot. A process must record its snapshot no later than when it receives a marker on any of its incoming channels. Chandy-Lamport Algorithm The algorithm can be initiated by any process by executing the “Marker Sending Rule” by which it records its local state and sends a marker on each outgoing channel. EL A process executes the “Marker Receiving Rule” on receiving a marker. If the process has not yet recorded its local state, it records the state of PT the channel on which the marker is received as empty and executes the “Marker Sending Rule” to record its local state. N The algorithm terminates after each process has received a marker on all of its incoming channels. All the local snapshots get disseminated to all other processes and all the processes can determine the global state. Chandy-Lamport Algorithm Marker Sending Rule for process i 1) Process i records its state. 2) For each outgoing channel C on which a marker has not been sent, i sends a marker along C before i sends further messages along C. EL Marker Receiving Rule for process j On receiving a marker along channel C: PT if j has not recorded its state then Record the state of C as the empty set else N Follow the “Marker Sending Rule” Record the state of C as the set of messages received along C after j ’s state was recorded and before j received the marker along C Correctness and Complexity Correctness Due to FIFO property of channels, it follows that no message sent after the marker on that channel is recorded in the channel state. Thus, condition C2 is EL satisfied. When a process pj receives message mij that precedes the marker on channel Cij , it acts as follows: If process pj has not taken its snapshot yet, then it PT includes mij in its recorded snapshot. Otherwise, it records mij in the state of the channel Cij. Thus, condition C1 is satisfied. Complexity N The recording part of a single instance of the algorithm requires O(e) messages and O(d ) time, where e is the number of edges in the network and d is the diameter of the network. Properties of the recorded global state The recorded global state may not correspond to any of the EL global states that occurred during the computation. Consider two possible executions of the snapshot algorithm PT (shown in Figure 6.4) for the previous money transfer example. N $600 $550 $550 $630 $630 S1: A $50 EL $80 S2: B PT $200 $200 $120 $120 $170 t0 t1 t2 t3 t4 C12 $0 $50 $50 $50 $0 C21 $0 Execution $0 N Markers $80 $0 Markers $0 Message (1st example) (2nd example) Figure 6.4: Timing diagram of two possible executions of the banking example Properties of the recorded global state 1. (Markers shown using red dashed-and-dotted arrows.) Let site S1 initiate the algorithm just after t1. Site S1 records its EL local state (account A=$550) and sends a marker to site S2. The marker is received by site S2 after t4. When site S2 receives the marker, it records its local state (account B=$170), the state of PT channel C12 as $0, and sends a marker along channel C21. When site S1 receives this marker, it records the state of channel C21 as $80. N The $800 amount in the system is conserved in the recorded global state, A = $550, B = $170, C12 = $0, C21 = $80 $600 $550 $550 $630 $630 S1: A $50 EL $80 S2: B PT $200 $200 $120 $120 $170 t0 t1 t2 t3 t4 A = $550 N B = $170 C12 = $0 C21 = $80 The $800 amount in the system is conserved in the recorded global state Figure 6.4: Timing diagram of two possible executions of the banking example Properties of the recorded global state 2. (Markers shown using green dotted arrows.) Let site S1 initiate the algorithm just after t0 and before sending the EL $50 for S2. Site S1 records its local state (account A = $600) and sends a marker to site S2. The marker is received by site S2 between t2 and t3. When site S2 receives the marker, it records its local state (account PT B = $120), the state of channel C12 as $0, and sends a marker along channel C21. When site S1 receives this marker, it records the state of N channel C21 as $80. The $800 amount in the system is conserved in the recorded global state, A = $600, B = $120, C12 = $0, C21 = $80 $600 $550 $550 $630 $630 S1: A $50 EL $80 S2: B PT $200 $200 $120 $120 $170 t0 t1 t2 t3 t4 A = $600 N B = $120 C12 = $0 C21 = $80 The $800 amount in the system is conserved in the recorded global state Figure 6.4: Timing diagram of two possible executions of the banking example Properties of the recorded global state In both these possible runs of the algorithm, the recorded global states never occurred in the execution. This happens because a process can change its state asynchronously before EL the markers it sent are received by other sites and the other sites record their states. PT But the system could have passed through the recorded global states in some equivalent executions. The recorded global state is a valid state in an equivalent execution and if N a stable property (i.e., a property that persists) holds in the system before the snapshot algorithm begins, it holds in the recorded global snapshot. Therefore, a recorded global state is useful in detecting stable properties. Variants of Global Snapshot Algorithms Algorithms Features Baseline algorithm. Requires FIFO channels. Chandy- Lamport O(e) messages to record snapshot and O(d ) time. supports concurrent initiators, efficient assembly and distribution of a snapshot. Assumes bidirectional channels. Spezialetti- Kearns O(e) messages to record, O(rn2 ) messages to assemble and distribute snapshot. EL Works for non-FIFO channels. Markers piggybacked on computation Lai-Yang messages. Message history required to compute channel states. PT Small message history Li et al. needed as channel states are computed incrementally. No message history required. Mattern Acharya- Badrinath N Termination detection required to compute channel states. Requires causal delivery support, Centralized computation of channel states, Channel message contents need not be known. Requires 2n messages, 2 time units. Requires causal delivery support, Distributed computation of channel states. Requires Alagar-Venkatesan 3n messages, 3 time units, small messages. n = # processes, u = # edges on which messages were sent after previous snapshot, e = # channels, d is the diameter of the network, r = # concurrent initiators. Conclusion Recording global state of a distributed system is an important paradigm in the design of the distributed systems and the design of efficient methods of recording the global state is an important EL issue. This lecture first discussed a formal definition of the global state PT of a distributed system and issues related to its capture; then we have discussed the Chandy-Lamport Algorithm to record a N snapshot of a distributed system. In upcoming lecture, we will discuss about distributed mutual exclusion algorithms. Lecture: 07 Distributed Mutual Exclusion Algorithms & EL Non-Token Based Approaches PT N Rajiv Misra Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface Recap of Previous Lecture: In the previous lecture, we have discussed about the EL Global states (i.e. consistent, strongly consistent & inconsistent), Cuts in the space time diagram, Models PT of communication and N Snapshot algorithm i.e. Chandy-Lamport algorithm to record the global snapshot. Preface Content of this Lecture: EL In this lecture, we will discuss about Mutual exclusion algorithms for distributed computing systems such as: PT (i) Non-token-based approach, N (ii) Quorum based approach and (iii) Token-based approach Introduction Mutual exclusion: Concurrent access of processes to a shared resource or data is executed in mutually exclusive manner. EL Only one process is allowed to execute the critical section (CS) at any given time. PT In a distributed system, shared variables (semaphores) or a local kernel cannot be used to implement mutual exclusion. N Message passing is the sole means for implementing distributed mutual exclusion. Approaches of Distributed Mutual Exclusion Distributed mutual exclusion algorithms must deal with unpredictable message delays and incomplete knowledge EL of the system state. Three basic approaches for distributed mutual exclusion are: PT i. Non-Token based approach ii. N Quorum based approach iii. Token based approach (i) Non-token based approach: Two or more successive rounds of messages are EL exchanged among the sites to determine which site will enter the CS next. PT e.g. Lamport Algorithm, RicartAgarwala Algorithm etc. N (ii) Quorum based approach: Each site requests permission to execute the CS from a subset of sites (called a quorum). EL Any two quorums contain a common site. PT This common site is responsible to make sure that only one request executes the CS at any time. N e.g. Maekawa’s Algorithm, Agarwal-El Abbadi Quorum- Based Algorithm etc. (iii) Token-based approach: A unique token (also known as the PRIVILEGE message)is shared among the sites. EL A site is allowed to enter its CS if it possesses the token. PT Mutual exclusion is ensured because the token is unique. e.g. Suzuki-Kasami’s Broadcast Algorithm, Raymond’s Tree- Based Algorithm etc.N Preliminaries: System Model The system consists of N sites, S1, S2,..., SN. We assume that a single process is running on each site. The process at site Si is denoted by pi. EL A site can be in one of the following three states: requesting the CS, executing the CS, or neither requesting nor executing the CS (i.e., idle). PT In the ‘requesting the CS’ state, the site is blocked and can not make further requests for the CS. In the ‘idle’ state, the site is executing outside the CS. N In token-based algorithms, a site can also be in a state where a site holding the token is executing outside the CS (called the idle token state). At any instant, a site may have several pending requests for CS. A site queues up these requests and serves them one at a time. Requirements of Mutual Exclusion Algorithms 1. Safety Property: At any instant, only one process can execute the critical section. EL 2. Liveness Property: This property states the absence of deadlock and starvation. Two or more sites should not endlessly wait for PT messages which will never arrive. 3. Fairness: Each process gets a fair chance to execute the CS. N Fairness property generally means the CS execution requests are executed in the order of their arrival (time is determined by a logical clock) in the system. Performance Metrics The performance is generally measured by the following four metrics: (i) Message complexity: The number of messages required per CS execution by a site. EL (ii) Synchronization delay: After a site leaves the CS, it is the time required and before the next site enters the CS (see Figure 7.1). PT Last site exits the CS Next site enters the CS N time Synchronization delay Figure 7.1: Synchronization Delay Performance Metrics (iii) Response time: The time interval a request waits for its CS execution to be over after its request messages have been sent out (see below Figure 7.2) CS Request arrives Its request messages sent out The site enter into CS The site exits the CS EL PT CS execution time time Response time N (iv) System throughput: The rate at which the system executes requests for the CS. system throughput=1/(SD+E ) where SD is the synchronization delay and E is the average critical section execution time. Performance Metrics Low and High Load Performance: We often study the performance of mutual exclusion algorithms under two EL special loading conditions, viz., “low load” and “high load”. The load is determined by the arrival rate of CS execution requests. PT Under low load conditions, there is seldom more than one request for the N critical section present in the system simultaneously. Under heavy load conditions, there is always a pending request for critical section at a site. EL (i) Non-token based PT Approaches N (i) Lamport’s Algorithm Requests for CS are executed in the increasing order of timestamps and time is determined by logical clocks. EL Every site Si keeps a queue, request_queuei which contains PT mutual exclusion requests ordered by their timestamps. This algorithm requires communication channels to deliver N messages the FIFO order. Three types of messages are used- Request, Reply and Release. These messages with timestamps also updates logical clock. The Algorithm Requesting the critical section: When a site Si wants to enter the CS, it broadcasts a REQUEST(tsi , i ) message to all other sites and places the request on request_queuei. ((tsi , i ) denotes the timestamp of the request.) EL When Sj receives the REQUEST(tsi , i ) message from site Si , Sj places site Si ’s request on request_queuej and it returns a timestamped REPLY message to Si. PT Executing the critical section: Site Si enters the CS when the following two conditions hold: sites. N L1: Si has received a message with timestamp larger than (tsi , i ) from all other L2: Si ’s request is at the top of request _queuei. The Algorithm Releasing the critical section: Site Si , upon exiting the CS, removes its request from the top of its EL request queue and broadcasts a timestamped RELEASE message to all other sites. PT When a site Sj receives a RELEASE message from site Si , it removes Si ’s request from its request queue. N When a site removes a request from its request queue, its own request may come at the top of the queue, enabling it to enter the CS. Correctness Theorem: Lamport’s algorithm achieves mutual exclusion. Proof: Proof is by contradiction. Suppose two sites Si and Sj are executing the CS concurrently. For this to happen conditions L1 and L2 must hold at both the sites concurrently. EL This implies that at some instant in time, say t, both Si and Sj have their own requests at the top of their request_queues and condition L1 holds at them. PT Without loss of generality, assume that Si ’s request has smaller timestamp than the request of Sj. From condition L1 and FIFO property of the communication channels, it is clear N that at instant t the request of Si must be present in request_queuej when Sj was executing its CS. This implies that Sj ’s own request is at the top of its own request_queue when a smaller timestamp request, Si ’s request, is present in the request queuej – a contradiction! Correctness Theorem: Lamport’s algorithm is fair. Proof: The proof is by contradiction. Suppose a site Si ’s request has a smaller timestamp than the request of another site Sj and Sj is able to execute the EL CS before Si. For Sj to execute the CS, it has to satisfy the conditions L1 and L2. This PT implies that at some instant in time say t, Sj has its own request at the top of its queue and it has also received a message with timestamp larger than the timestamp of its request from all other sites. N But request_queue at a site is ordered by timestamp, and according to our assumption Si has lower timestamp. So Si ’s request must be placed ahead of the Sj ’s request in the request _queuej. This is a contradiction! Lamport’s Algorithm Example: Sites S1 and S2 are Making Requests for the CS Sites S1 enter the CS (1,1) S1 EL (1,2) PT S2 S3 N Lamport’s Algorithm Example: Site S1 exits the CS Site S1 enters the CS and sends RELEASE messages (1,1) (1,1), (1,2) S1 EL (1,2) (1,2) PT S2 (1,1), (1,2) S3 N Lamport’s Algorithm Example: Site S1 exits the CS Site S1 enters the CS and sends RELEASE messages (1,1) (1,1), (1,2) S1 EL (1,2) PT S2 (1,1), (1,2) Site S2 enters the CS S3 N Performance For each CS execution, Lamport’s algorithm requires (N − 1) REQUEST messages, (N − 1) REPLY messages, and (N − 1) EL RELEASE messages. PT Thus, Lamport’s algorithm requires 3(N − 1) messages per CS invocation. N Synchronization delay in the algorithm is T. An Optimization In Lamport’s algorithm, REPLY messages can be omitted in certain situations. For example, if site Sj receives a REQUEST message from site Si after it has sent its own REQUEST message with timestamp higher EL than the timestamp of site Si ’s request, then site Sj need not send a REPLY message to site Si. PT This is because when site Si receives site Sj ’s request with timestamp higher than its own, it can conclude that site Sj does not have any N smaller timestamp request which is still pending. With this optimization, Lamport’s algorithm requires between 3(N − 1) and 2(N − 1) messages per CS execution. (ii) Ricart-Agrawala Algorithm The Ricart-Agrawala algorithm assumes the communication channels are FIFO. The algorithm uses two types of messages: REQUEST and REPLY. A process sends a REQUEST message to all other processes to request their EL permission to enter the critical section. A process sends a REPLY message to a process to give its permission to that process. PT Processes use Lamport-style logical clocks to assign a timestamp to critical section requests and timestamps are used to decide the priority of requests. N Each process pi maintains the Request-Deferred array, RDi , the size of which is the same as the number of processes in the system. Initially, ∀i ∀j: RDi [j]=0. Whenever pi defer the request sent by pj , it sets RDi [j]=1 and after it has sent a REPLY message to pj , it sets RDi [j]=0. Description of the Algorithm Requesting the critical section: (a) When a site Si wants to enter the CS, it broadcasts a timestamped REQUEST message to all other sites. EL (b) When site Sj receives a REQUEST message from site Si , it sends a REPLY message to site Si if site Sj is neither requesting nor PT executing the CS, or if the site Sj is requesting and Si ’s request’s timestamp is smaller than site Sj ’s own request’s timestamp. Otherwise, the reply is deferred and Sj sets RDj [i]=1 N Executing the critical section: (c) Site Si enters the CS after it has received a REPLY message from every site it sent a REQUEST message to. Contd… Releasing the critical section: (d) When site Si exits the CS, it sends all the deferred REPLY messages: ∀j if RDi [j]=1, then send a REPLY message to Sj and set EL RDi [j]=0. PT Notes: When a site receives a message, it updates its clock using the N timestamp in the message. When a site takes up a request for the CS for processing, it updates its local clock and assigns a timestamp to the request. Correctness Theorem: Ricart-Agrawala algorithm achieves mutual exclusion. Proof: EL Proof is by contradiction. Suppose two sites Si and Sj ‘ are executing the CS concurrently and Si ’s request has higher priority than the request of Sj. Clearly, Si received Sj ’s request after it has made its own request. PT Thus, Sj can concurrently execute the CS with Si only if Si returns a REPLY to Sj (in response to Sj ’s request) before Si exits the CS. N However, this is impossible because Sj ’s request has lower priority. Therefore, Ricart-Agrawala algorithm achieves mutual exclusion. Ricart–Agrawala algorithm Example: Sites S1 and S2 are Making Requests for the CS (1,1) EL S1 PT (1,2) S2 N S3 Ricart–Agrawala algorithm Example: Request is deferred Site S1 enters the CS (1,1) EL S1 PT (1,2) S2 N S3 Ricart–Agrawala algorithm Example: Request is deferred Site S1 enters the CS (1,1) EL S1 Site S1 PT exits the (1,2) CS S2 N And send a REPLY message to S2’s deferred request S3 Performance For each CS execution, Ricart-Agrawala algorithm requires (N − 1) REQUEST messages and (N − 1) REPLY messages. EL Thus, it requires 2(N − 1) messages per CS execution. PT Synchronization delay in the algorithm is T. N Conclusion Mutual exclusion is a fundamental problem in distributed computing systems, where concurrent access to a shared resource or data is serialized. Mutual exclusion in a distributed system requires that only EL one process be allowed to execute the critical section at any given time. PT In this lecture, we have discussed about the ‘Concepts of Distributed Mutual Exclusion’ and its Non-token based approaches i.e. Lamports N algorithm and Ricart-Agrawala algorithm In upcoming lectures, we will discuss about Quorum based approaches and Token-based approaches. Lecture: 08 Quorum Based Distributed Mutual Exclusion Algorithms EL PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface Recap of Previous Lecture: In the previous lecture, we have presented the EL Concept of Distributed Mutual Exclusion Algorithms for distributed computing systems. PT We have also discussed about Non-Token based N approaches i.e. Lamports algorithm and Ricart- Agrawala algorithm Preface Content of this Lecture: EL In this lecture, we will discuss about the Quorum based approaches and PT Also discuss its few approaches namely Maekawa’s Algorithm and Agarwal-El Abbadi Quorum-Based Algorithm N EL (ii) Quorum Based PT Approaches N Introduction In the ‘quorum-based approach’, each site requests permission to execute the CS from a subset of sites EL (called a quorum). The quorums are formed in such a way that when two PT sites concurrently request access to the CS, at least one site receives both the requests and this site is responsible N to make sure that only one request executes the CS at any time. Quorum-Based Mutual Exclusion Algorithms Quorum-based mutual exclusion algorithms are different in the following two ways: EL 1. A site does not request permission from all other sites, but only from a subset of the sites. The request set of sites are PT chosen such that ∀i ∀j : 1 ≤ i , j ≤ N : : Ri ∩ Rj ≠ Φ. Consequently, every pair of sites has a site which mediates conflicts between that pair. N 2. A site can send out only one REPLY message at any time. A site can send a REPLY message only after it has received a RELEASE message for the previous REPLY message. Contd… Since these algorithms are based on the notion of ‘Coteries’ and ‘Quorums’, we next describe the idea of coteries and quorums. A coterie C is defined as a set of sets, where each set g ∈ C is called a EL quorum. The following properties hold for quorums in a coterie: PT Intersection property: For every quorum g, h ∈ C, g ∩ h ≠ ∅. For example, sets {1,2,3}, {2,5,7} and {5,7,9} cannot be quorums in a N coterie because the first and third sets do not have a common element. Minimality property: There should be no quorums g, h in coterie C such that g ⊇ h. For example, sets {1,2,3} and {1,3} cannot be quorums in a coterie because the first set is a superset of the second. Contd… Coteries and quorums can be used to develop algorithms to ensure mutual exclusion in a distributed environment. A simple protocol works as follows: Let ‘a’ is a site in quorum ‘A’. If ‘a’ wants to invoke mutual exclusion, EL it requests permission from all sites in its quorum ‘A’. Every site does the same to invoke mutual exclusion. Due to the PT Intersection Property, quorum ‘A’ contains at least one site that is common to the quorum of every other site. N These common sites send permission to only one site at any time. Thus, mutual exclusion is guaranteed. Note that the Minimality property ensures efficiency rather than correctness. (i) Maekawa’s Algorithm Maekawa’s algorithm was the first quorum-based mutual exclusion algorithm. The request sets for sites (i.e., quorums) in Maekawa’s algorithm are constructed to satisfy the following conditions: EL M1: (∀i ∀j : i ≠ j, 1 ≤ i , j ≤ N : : Ri ∩ Rj ≠ φ) PT M2: (∀i : 1 ≤ i ≤ N : : Si ∈ Ri ) M3: (∀i : 1 ≤ i ≤ N : : |Ri | = K ) N M4: Any site Sj is contained in K number of Ri s, 1 ≤ i , j ≤ N. Maekawa used the theory of projective planes and showed that N = K (K − 1) + 1. This relation gives |Ri | = Maekawa’s Algorithm Conditions M1 and M2 are necessary for correctness; whereas conditions M3 and M4 provide other desirable features to the EL algorithm. Condition M3 states that the size of the requests sets of all sites PT must be equal implying that all sites should have to do equal amount of work to invoke mutual exclusion. N Condition M4 enforces that exactly the same number of sites should request permission from any site implying that all sites have “equal responsibility” in granting permission to other sites. The Algorithm A site Si executes the following steps to execute the CS. Requesting the critical section (a) A site Si requests access to the CS by sending REQUEST(i ) EL messages to all sites in its request set Ri. (b) When a site Sj receives the REQUEST(i ) message, it sends a PT REPLY(j) message to Si provided it hasn’t sent a REPLY message to a site since its receipt of the last RELEASE message. Otherwise, it N queues up the REQUEST(i ) for later consideration. Executing the critical section (c) Site Si executes the CS only after it has received a REPLY message from every site in Ri. The Algorithm Releasing the critical section EL (d) After the execution of the CS is over, site Si sends a RELEASE(i ) message to every site in Ri. PT (e) When a site Sj receives a RELEASE(i ) message from site Si , it sends a REPLY message to the next site waiting in the queue and deletes that N entry from the queue. If the queue is empty, then the site updates its state to reflect that it has not sent out any REPLY message since the receipt of the last RELEASE message. Correctness Theorem: Maekawa’s algorithm achieves mutual exclusion. Proof: EL Proof is by contradiction. Suppose two sites Si and Sj are concurrently PT executing the CS. This means site Si received a REPLY message from all sites in Ri and N concurrently site Sj was able to receive a REPLY message from all sites in Rj. If Ri ∩ Rj = {Sk }, then site Sk must have sent REPLY messages to both Si and Sj concurrently, which is a contradiction. Correctness Since the size of a request set is , an execution of the CS requires REQUEST, REPLY, and RELEASE messages, resulting in 3 EL messages per CS execution. PT Synchronization delay in this algorithm is 2T. This is because after a site Si exits the CS, it first releases all the sites in Ri and then one of those sites sends a REPLY message to the next site that executes the CS. N Problem of Deadlocks Maekawa’s algorithm can deadlock because a site is exclusively locked by other sites and requests are not prioritized by their timestamps. EL Assume three sites Si , Sj , and Sk simultaneously invoke mutual exclusion. Suppose Ri ∩ Rj = {Sij }, Rj ∩ Rk = {Sjk }, and Rk ∩ Ri = {Ski }. PT Consider the following scenario: 1. Sij has been locked by Si (forcing Sj to wait at Si j). N 2. Sjk has been locked by Sj (forcing Sk to wait at Sjk ). 3. Ski has been locked by Sk (forcing Si to wait at Ski ). This state represents a deadlock involving sites Si , Sj , and Sk. Handling Deadlocks Maekawa’s algorithm handles deadlocks by requiring a site to yield a lock if the timestamp of its request is larger EL than the timestamp of some other request waiting for the same lock. PT A site suspects a deadlock (and initiates message N exchanges to resolve it) whenever a higher priority request arrives and waits at a site because the site has sent a REPLY message to a lower priority request. Handling Deadlocks Deadlock handling requires three types of messages: EL FAILED: A FAILED message from site Si to site Sj indicates that Si can not grant Sj’s request because it has currently granted permission to a PT site with a higher priority request. INQUIRE: An INQUIRE message from Si to Sj indicates that Si would like to find out from Sj if it has succeeded in locking all the sites in its request set. N YIELD: A YIELD message from site Si to Sj indicates that Si is returning the permission to Sj (to yield to a higher priority request at Sj ). Handling Deadlocks Maekawa’s algorithm handles deadlocks as follows: When a REQUEST(ts , i ) from site Si blocks at site Sj because Sj has currently granted permission to site Sk , then Sj sends a FAILED(j) message to Si if Si ’s request EL has lower priority. Otherwise, Sj sends an INQUIRE(j) message to site Sk. In response to an INQUIRE(j) message from site Sj , site Sk sends a YIELD(k ) message to Sj provided Sk has received a FAILED message from a site in its request set or if it PT sent a YIELD to any of these sites, but has not received a new GRANT from it. In response to a YIELD(k ) message from site Sk , site Sj assumes as if it has been N released by Sk , places the request of Sk at appropriate location in the request queue, and sends a GRANT(j) to the top request’s site in the queue. Maekawa’s algorithm requires extra messages to handle deadlocks Maximum number of messages required per CS execution in this case is 5 Handling Deadlocks: Case-I When a REQUEST(ts, i ) from site Si blocks at site Sj because Sj has currently granted permission to site Sk , then Sj sends a FAILED(j) message to Si if Si ’s request has lower priority. Otherwise, Sj sends an INQUIRE(j) message to site Sk. EL Si REQUEST(ts,i) FAILED (j) PT Block at Sj IF (ts,i) < (ts,k) Sj REQUEST(ts,k) N REPLY (j) INQUIRE (j) Else(ts,i) > (ts,k) Sk Handling Deadlocks: Case-II In response to an INQUIRE(j) message from site Sj , site Sk sends a YIELD(k) message to Sj provided Sk has received a FAILED message from a site in its request set or if it sent a YIELD to any of these sites, but has not received a new GRANT from it. EL Si PT Sj N INQUIRE (j) YIELD (k) Sk Handling Deadlocks: Case-III In response to a YIELD(k) message from site Sk , site Sj assumes as if it has been released by Sk , places the request of Sk at appropriate location in the request queue, and sends a GRANT(j) to the top request’s site in the queue. EL Si Sj Assumes assumes as if it GRANT (j) to top PT has been released by Sk request site i.e. Si Sj N YIELD (k) Sk (ii) Agarwal-El Abbadi Quorum-Based Algorithm Agarwal and El Abbadi developed a simple and efficient mutual exclusion algorithm by introducing tree quorums. They gave a novel algorithm for constructing tree-structured quorums in the sense that it uses hierarchical structure of a network. EL The mutual exclusion algorithm is independent of the underlying topology of the network and there is no need for a multicast facility in the network. PT However, such facility will improve the performance of the algorithm. The mutual exclusion algorithm assumes that sites in the distributed N system can be organized into a structure such as tree, grid, binary tree, etc. and There exists a routing mechanism to exchange messages between different sites in the system. (ii) Agarwal-El Abbadi Quorum-Based Algorithm Agarwal-El Abbadi quorum-based algorithm uses ‘tree-structured quorums’ All the sites in the system are logically organized into a complete EL binary tree. For a complete binary tree with level ‘k’, we have 2k +1 – 1 sites with PT its root at level k and leaves at level 0. The number of sites in a path from the root to a leaf is equal to the N level of the tree k+1 which is equal to O(log n). A path in a binary tree is the sequence a1, a2... ai , ai +1.... ak such that ai is the parent of ai +1. Algorithm for constructing a tree-structured quorum FUNCTION GetQuorum (Tree: NetworkHierarchy): QuorumSet; VAR left, right : QuorumSet; BEGIN IF Empty (Tree) THEN RETURN ({}); ELSE IF GrantsPermission(Tree↑.Node) THEN RETURN ((Tree↑.Node) ∪ GetQuorum (Tree↑.LeftChild)); EL OR RETURN ((Tree↑.Node) ∪ GetQuorum (Tree↑.RightChild));(*line 9*) ELSE PT left←GetQuorum(Tree↑.left); right←GetQuorum(Tree↑.right); IF (left = ∅ ∨ right = ∅) THEN (* Unsuccessful in establishing a quorum *) ELSE N EXIT(-1); RETURN (left ∪ right); END; (* IF *) END; (* IF *) END; (* IF *) END GetQuorum Algorithm for constructing a tree-structured quorum: Explanation The algorithm for constructing tree-structured quorums uses two functions called GetQuorum(Tree) and GrantsPermission(site) and assumes that there is a well-defined root for the tree. EL GetQuorum is a recursive function that takes a tree node “x” as the PT parameter and calls GetQuorum for its child node provided that the GrantsPermission(x) is true. N The GrantsPermission(x) is true only when the node “x” agrees to be in the quorum. If the node “x” is down due to a failure, then it may not agree to be in the quorum and the value of GrantsPermission(x) will be false. Algorithm for constructing a tree-structured quorum: Explanation The algorithm tries to construct quorums in a way that each quorum represents any path from the root to a leaf. i.e., in this case the (no failures) quorum is any set a1, a2... ai , ai +1.... ak , where a1 is the root and EL ak is a leaf, and for all i < k, ai , is the parent of ai +1. If it fails to find such a path (say, because node ’x’ has failed), the control PT goes to the ELSE block which specifies that the failed node ‘x’ is substituted by two paths both of which start with the left and right children of ‘x’ and end at leaf nodes. N If the leaf site is down or inaccessible due to any reason, then the quorum cannot be formed and the algorithm terminates with an error condition. The sets that are constructed using this algorithm are termed as tree quorums. Analysis of the algorithm for constructing tree- structured quorums The best case scenario of the algorithm takes O(log n) sites to form a tree quorum. There are certain cases where even in the event of a failure, O(log n) sites are sufficient to form a tree quorum. EL For example, if the site that is parent of a leaf node fails, then the number of sites that are necessary for a quorum will be still O(log n). Thus, the PT algorithm requires very few messages in a relatively fault-free environment. It can tolerate the failure up to n−O(log n) sites and still form a tree quorum. N In the worst case, the algorithm requires the majority of sites to construct a tree quorum and the number of sites is same for all cases (faults or no faults). The worst case tree quorum size is determined as O((n+1)/2) by induction. Examples of Tree-Structured Quorums When there is no node failure, the number of quorums formed is equal to the number of leaf sites. Consider the tree of height 3 show in Figure 8.1, constructed from 15 (=23+1-1) sites. EL In this case 8 quorums are formed from 8 possible root-leaf paths: 1-2-4-8, 1-2-4-9, 1-2-5-10, 1-2-5-11, 1-3-6-12, 1-3-6-13, 1-3-7-14 and 1-3-7-15. PT If any site fails, the algorithm substitutes for that site two possible paths starting from the site’s two children and ending in leaf nodes. For example, when node 3 fails, we consider possible paths starting from children 6 N and 7 and ending at leaf nodes. The possible paths starting from child 6 are 6-12 and 6-13, and from child 7 are 7-14 and 7-15. So, when node 3 fails, the following eight quorums can be formed: {1,6,12,7,14}, {1,6,12,7,15}, {1,6,13,7,14}, {1,6,13,7,15}, {1,2,4,8}, {1,2,4,9},{1,2,5,10}, {1,2,5,11}. 8 quorums are formed : 1 1-2-4-8, 1-2-4-9, 1-2-5-10, 2 3 EL 1-2-5-11, 1-3-6-12, PT 1-3-6-13, 4 5 6 7 1-3-7-14 and 1-3-7-15 8 9 N 10 11 12 13 14 15 Figure 8.1: A tree of 15 sites. So, when node 3 fails, quorums can 1 be formed: {1,6,12,7,14}, {1,6,12,7,15}, {1,6,13,7,14}, 2 3 EL {1,6,13,7,15}, {1,2,4,8}, PT {1,2,4,9}, 4 5 6 7 {1,2,5,10}, {1,2,5,11}. 8 9 N 10 11 12 13 14 15 Figure 8.1: A tree of 15 sites. Property of ‘Graceful degradation’ Since the number of nodes from root to leaf in an ‘n’ node complete tree is log n, the best case for quorum formation, i.e, the least number of nodes needed for a quorum is log n. EL When the number of node failures is greater than or equal to log n, the algorithm may not be able to form tree-structured PT quorum. So, as long as the number of site failures is less than log n, the N tree quorum algorithm gurantees the formation of a quorum and it exhibits the property of ‘graceful degradation’. Mutual Exclusion Algorithm A site s enters the critical section (CS) as follows: Site s sends a ‘Request’ message to all other sites in the structured quorum it belongs to. Each site in the quorum stores incoming requests in a request queue, EL ordered by their timestamps. A site sends a ‘Reply’ message, indicating its consent to enter CS, only to PT the request at the head of its request queue, having the lowest timestamp. If the site s gets a ‘Reply’ message from all sites in the structured quorum it N belongs to, it enters the CS. After exiting the CS, s sends a ‘Relinquish’ message to all sites in the structured quorum. On the receipt of the ‘Relinquish’ message, each site removes s ’s request from the head of its request queue. Mutual Exclusion Algorithm If a new request arrives with a timestamp smaller than the request at the head of the queue, an ‘Inquire’ message is sent to the process whose request is at the head of the queue and waits for a ‘Yield’ or ‘Relinquish’ message. When a site s receives an ‘Inquire’ message, it acts as follows: If s has EL acquired all of its necessary replies to access the CS, then it simply ignores the ‘Inquire’ message and proceeds normally and sends a ‘Relinquish’ message PT after exiting the CS. If s has not yet collected enough replies from its quorum, then it sends a N ‘Yield’ message to the inquiring site. When a site gets the ‘Yield’ message, it puts the pending request (on behalf of which the ‘Inquire’ message was sent) at the head of the queue and sends a ‘Reply’ message to the requestor. Example: Site S1 sends a ‘Request’ message to all other sites in the Remove S1 structured quorum i.e. S2, S3 it belongs to. Request Received all Reply's from the RQ Site S1 enter into CS Exit from CS EL S1 REQUEST(ts,1) REPLY(2) Relinquish (1) PT RQ2: 1 S2 REQUEST(ts,1) N REPLY (3) Relinquish (1) RQ3: 1 S3 Head of its request queue, having the lowest timestamp. If a new request arrives with a timestamp smaller than the request at the head of the queue, an ‘Inquire’ message is sent to the process whose request is at the head of the queue and waits for a ‘Yield’ or ‘Relinquish’ message. EL S1 REQUEST(ts,1) PT RQ2: 3, 1 Waits for a ‘Yield’ or ‘Relinquish’ message. S2 REQUEST(ts,1) N INQUIRE(2) RQ3: 1 S3 When a site S3 receives an ‘Inquire’ message, it acts as follows: If S3 has acquired all of its necessary replies to access the CS, then it simply ignores the ‘Inquire’ message and proceeds normally and sends a ‘Relinquish’ message after exiting the CS. EL S1 REQUEST(ts,1) Send Relinquish After PT RQ2: 3, 1 Exiting from CS S2 REQUEST(ts,1) N INQUIRE(2) Relinquish (3) RQ3: 1 S3 If S3 has acquired all of its necessary replies to access the CS, then it simply ignores S3 ignores the ‘Inquire’ message If S3 has not yet collected enough replies from its quorum, then it sends a ‘Yield’ message to the inquiring site. EL S1 REQUEST(ts,1) PT RQ2: 3, 1 S2 REQUEST(ts,1) N INQUIRE(2) YIELD(3) RQ3: 1 S3 If S3 has not yet collected enough replies from its quorum When a site gets the ‘Yield’ message, it puts the pending request (on behalf of which the ‘Inquire’ message was sent) at the head of the queue and sends a ‘Reply’ message to the requestor. EL S1 REQUEST(ts,1) REPLY(1) PT RQ2: 3, 1 S2 REQUEST(ts,1) N INQUIRE(2) YIELD(3) RQ3: 1 S3 If S3 has not yet collected enough replies from its quorum Correctness proof Mutual exclusion is guaranteed because the set of quorums satisfy the Intersection property. Consider a coterie C which consists of quorums {1,2,3}, {2,4,5} and {4,1,6}. Suppose nodes 3, 5 and 6 want to enter CS, and they send requests to sites EL (1, 2), (2, 4) and (1, 4), respectively. Suppose site 3’s request arrives at site 2 before site 5’s request. In this case, site 2 will grant permission to site 3’s request and reject site 5’s request. PT Similarly, suppose site 3’s request arrives at site 1 before site 6’s request. So site 1 will grant permission to site 3’s request and reject site 6’s request. N Since sites 5 and 6 did not get consent from all sites in their quorums, they do not enter the CS. Since site 3 alone gets consent from all sites in its quorum, it enters the CS and mutual exclusion is achieved. Conclusion In this lecture, we have discussed about quorum based approaches. i.e. Maekawa’s Algorithm and Agarwal-El Abbadi Quorum-Based Algorithm EL There exists a variety of quorums and a variety of ways to construct quorums. For example, Maekawa used the theory of projective PT planes to develop quorums of size √N and Agarwal-El Abbadi quorum-based algorithm uses ‘tree-structured quorums. N In upcoming lecture, we will discuss about ‘Token based approaches’ i.e. Suzuki-Kasami’s Broadcast Algorithm and Raymond’s Tree-Based Algorithm.