W1_Lecture Notes.pdf
Document Details
Indian Institute of Technology, Patna
Tags
Full Transcript
Lecture: 01 Introduction to Distributed Systems EL PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface The explosive growth of DCS makes unders...
Lecture: 01 Introduction to Distributed Systems EL PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface The explosive growth of DCS makes understanding imperative yet difficult because of uncertainties introduced by EL asynchrony, limited local knowledge, and partial failures. The nature solves it perfectly such as flock of birds( mobile PT intelligent agents communicates to achieve common goal). The field of distributed computing provides theoretical N underpinning for design and analysis of many DS such as- communication, coordination, synchronization uncertainty to lower bound techniques. and Course Contents Introduction to Distributed Systems Basic Algorithms in Message Passing System Case Studies: Leader Election in Rings Distributed Hash Table Peer to Peer Computing and EL Distributed Minimum Spanning Tree Overlay Graphs Models of Distributed Computation Google File System PT Causality & Logical Time HDFS and Map Reduce Global State and Snapshot Recording Algorithms Introduction to Spark N Distributed Mutual Exclusion Algorithms Distributed Shared Memory Introduction to Sensor Networks Consensus and Agreement Algorithms Checkpointing & Rollback Recovery Books Text Books: Distributed Computing: Principles, Algorithms, and Systems- EL Ajay D. Kshemkalyani and Mukesh Singhal Distributed Computing: Fundamentals, Simulations and Advanced Topics- PT Hagit Attiya and Jennifer Welch Reference Book: N Distributed Algorithms- Nancy Lynch Distributed System: Definition A distributed system is A collection of independent entities that cooperate to solve a EL problem that cannot be individually solved. A collection of computers that do not share common PT memory or a common physical clock, that communicate by a messages passing over a communication network, and N where each computer has its own memory and runs its own operating system. Typically the computers are semi-autonomous and are loosely coupled while they cooperate to address a problem collectively. Properties of Distributed Systems Heterogeneity Systems consist of heterogeneous hardware and software components EL Concurrency Multiple programs run together PT Shared data Data is accessed simultaneously by multiple entities No global clock N Each component has a local notion of time Interdependencies Independent components depend on each other Relation to Computer System Components Middleware S/W OS EL Communication P M S/W network PT OS (WAN/ LAN) P processor(s) P M M memory bank(s) S/W N OS P M Figure 1.1: A typical distributed system Relation to Computer System Components A DS connects autonomous processors by communication network which cooperate to run application network. EL The software components that run on each of the computers use the local operating system and network protocol stack for functioning. PT The distributed software is also termed as middleware. N A distributed execution is the execution of processes across the distributed system to collaboratively achieve a common goal. An execution is also sometimes termed a computation or a run. Layered Architecture The middleware(layered architecture) is the distributed software that drives the distributed system, while providing transparency of heterogeneity at the platform level. Several standards as : EL Distributed application OMG (Object Management Group ) CORBA (Common Object Distributed software PT Request Broker Network protocol stack (middleware libraries) Architecture) RPC (Remote Procedure Application layer Call) Operating system N Transport layer Network layer DCOM (Distributed Component Object Model) RMI (Remote Method Data link layer Invocation) MPI (Message-Passing Figure 1.2: Layered Architecture of Distributed System Interface) Application Motivation for Distributed System 1. Inherently distributed computations: Many applications such as money transfer in banking, or reaching consensus among parties that are geographically distant, the computation is inherently distributed. EL 2. Resource sharing: Sharing of Resources such as peripherals, complete data sets in databases, special libraries, etc. It cannot be fully replicated at all the PT sites because it is often neither practical nor cost-effective. 3. Access to geographically remote data and resources: such as Bank N databases, Supercomputers, resource-constrained mobile devices 4. Enhanced reliability: Possibility of replicating resources and executions to enhance reliability. The geographically distributed resources are not likely to crash/malfunction at the same time Motivation for Distributed System Reliability entails several aspects: Availability, i.e., the resource should be accessible at all times; EL Integrity, i.e., the value/state of the resource should be correct, in the face of concurrent access from multiple processors, as per the semantics expected by the application; PT Fault-tolerance, i.e., the ability to recover from system failures (defined as failure models) N Increased performance/cost ratio: By accessing geographically remote data and resources sharing. Other Advantages of Distributed System 5. Scalability: Adding more processors to communication network does not pose a direct bottleneck to communication system. EL 6. Modularity and incremental expandability: Heterogeneous PT processors may be easily added or replaced without affecting performance. N Design issues and challenges 1. Systems perspective of distributed system design EL 2. Algorithm perspective of distributed system design PT 3. Recent technology advances and/or Driven by new applications N 1. Distributed systems design challenges from a system perspective Communication: This task involves designing appropriate mechanisms for communication among the processes in the EL network. Processes: Some of the issues involved are: management of PT processes and threads at clients/servers; code migration; and the design of software and mobile agents. N Synchronization: Synchronization or coordination among the processes are essential. Mutual exclusion is the classical example of synchronization, but many other forms of synchronization, such as leader election, physical clocks, logical clocks and global state recording algorithms, all require different forms of synchronization. 1. Distributed systems challenges from a system perspective Fault tolerance: Requires maintaining correctness in spite of failures of links, nodes, and processes. EL Process resilience, reliable communication, distributed commit, check-pointing and recovery, agreement and PT consensus, failure detection, and self-stabilization are some of the mechanisms to provide fault-tolerance. N 1. Distributed systems challenges from a system perspective Transparency: Hiding the implementation policies from the user can be classified as: EL 1. Access transparency hides differences in data representation on different systems and provides uniform operations to access system resources. PT 2. Location transparency makes the locations of resources transparent to the users. 3. N Migration transparency allows relocating resources without changing names. Transparency contd.. 4. Relocation transparency The ability to relocate the resources as they are being accessed is relocation transparency. EL 5. Replication transparency does not let the user become aware of any replication. PT 6. Concurrency transparency deals with masking the concurrent use of shared resources for the user. N 7. Failure transparency refers to the system being reliable and fault-tolerant. Distributed Algorithms In Distributed Systems, different complexity measures are of EL interest such as: time, space but now considered communication cost (no. of messages, size, no. of shared variables) and the number of faulty vs. non-faulty components. PT p1 Because of complications faced by DS leads to increase scope of N “negative results”, lower bounds and impossibility results. p2 pn x1 x2 Distributed Algorithms Fundamental issues in design of distributed algorithms are factors such as : asynchrony, limited knowledge, and failures EL Asynchrony: Absolute & relative timings of events cannot always be known precisely. PT Local view: Computing entities can only be aware of information it acquires, so it has local views of global situation N Failures: Computing entities can fail independently, leaving some components operational while others are not. Distributed Computing Systems Studied since 1967, starting with Dijkstra and Lamport. EL – Edsger Dijkstra: 1972 Turing award winner – Leslie Lamport: 2013 Turing award winner PT N LESLIE LAMPORT He devised important algorithms and developed formal modeling and verification protocols that improve the quality of real distributed systems. EL Fundamental contributions to the theory and practice of distributed systems, PT notably the invention of concepts such as causality and logical clocks, safety and liveness, replicated state machines, and sequential consistency. Lamport was the winner of the 2013 Turing Award for distributed N computing systems, in which several autonomous computers communicate with each other by passing messages. 2. Algorithmic challenges in distributed computing 1) Time and global state in a distributed system: The processes in the system are spread across three-dimensional EL physical space. Another dimension, time, has to be superimposed uniformly across space. The challenges pertain to providing accurate physical time, and to providing a variant of time, called PT logical time. Logical time is relative time, and eliminates the overheads of N providing physical time for applications where physical time is not required. More importantly, logical time can (i) capture the logic and inter-process dependencies within the distributed program, and also (ii) track the relative progress at each process. 2. Algorithmic challenges in distributed computing 2) Synchronization/coordination mechanisms: The processes must be allowed to execute concurrently, except when EL they need to synchronize to exchange information, i.e., communicate about shared data. Synchronization is essential for the distributed processes to overcome the limited observation of the system state. Following mechanisms are used for synchronization/coordination: PT Leader Election- Deals with asymmetry Mutual Exclusion- Access to critical resources has to be coordinated state. N Termination Detection- Cooperation among processes to detect global Garbage Collection- Detecting garbage needs coordination 2. Algorithmic challenges in distributed computing 3) Reliable and fault-tolerant distributed systems: A reliable and fault-tolerant environment has multiple requirements and EL aspects, and these can be addressed using various strategies: PT 1. Consensus algorithms 2. Replication and replica management 3. Voting and quorum systems 4. 5. 6. N Distributed databases and distributed commit Self-stabilizing systems Check-pointing and recovery algorithms 7. Failure detectors 2. Algorithmic challenges in distributed computing 4) Group communication, multicast, and ordered message delivery: EL A group is a collection of processes that share a common context and collaborate on a common task within an application domain. PT Specific algorithms need to be designed to enable efficient group communication and group management wherein processes can join and leave groups dynamically, or even fail. When multiple N processes send messages concurrently, different recipients may receive the messages in different orders, possibly violating the semantics of the distributed program. 2. Algorithmic challenges in distributed computing 5) Distributed shared memory abstraction: Under the covers in the middleware layer, the abstraction of a EL shared address space has to be implemented by using message-passing. PT N 3. Applications of distributed computing and newer challenges 1) Mobile systems: Mobile systems typically use wireless communication which is based on electromagnetic waves and utilizes a shared EL broadcast medium. 2) Sensor networks: A sensor is a processor with an electro-mechanical PT interface that is capable of sensing physical parameters, such as temperature, velocity, pressure, humidity, and chemical 3) N Ubiquitous or pervasive computing: Ubiquitous systems represent a class of computing where the processors embedded in and seamlessly pervading through the environment perform application functions in the background. 3. Applications of distributed computing and newer challenges 4) Peer-to-peer computing: Peer-to-peer (P2P) computing represents computing over an application layer network EL wherein all interactions among the processors are at a “peer” level, without any hierarchy among the processors. PT Thus, all processors are equal and play a symmetric role in the computation. N 3. Applications of distributed computing and newer challenges 5) Distributed data mining: Data mining algorithms examine large amounts of data to detect patterns and trends in the data, to EL mine or extract useful information. A traditional example is: examining the purchasing patterns of customers in order to profile the customers and enhance the efficiency of directed PT marketing schemes. 6) Grid computing: Analogous to the electrical power distribution N grid, it is envisaged that the information and computing grid will become a reality some day. Very simply stated, idle CPU cycles of machines connected to the network will be available to others. 3. Applications of distributed computing and newer challenges 7) Security in distributed systems: The traditional challenges of security in a distributed setting include: confidentiality EL (ensuring that only authorized processes can access certain information), authentication (ensuring the source of PT received information and the identity of the sending process), and availability (maintaining allowed access to services despite malicious actions). N Conclusion Distributed systems are having a wide variety of applications in real world scenario. EL To understand its contribution, it is required to be familiar with its fundamental principles. PT This lecture first characterized distributed systems and distributed algorithms by looking at various informal definitions, design issues and challenges based on theoretical & systems aspects. N In upcoming lectures, we will try to give an insight on its detailed concepts that will give a good understanding of the further details. Lecture: 02 Basic Algorithms in Message- Passing Systems EL PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface Recap of previous lecture: Distributed Algorithms assume asynchrony, local knowledge and failures. EL Content of this lecture: This lecture introduces the formal model of a distributed message PT passing system. Two main timing models, synchronous and asynchronous are considered. N Few simple algorithms for message-passing systems with arbitrary topology, both synchronous and asynchronous will be discussed. These algorithms broadcast information, collect information, and construct spanning trees of the network. Message-Passing Model In a message-passing system, processors communicate by sending messages over communication channels, where EL each channel provides a bidirectional connection between two specific processors. PT The pattern of connections provided by the channels N describes the topology of the system. The collection of channels is often referred to as the network Message-Passing Model More formally, a system or algorithm consists of n processors p0, p1, …, pn-1 ; i is the index of processor pi EL (nodes of graph) bidirectional point-to-point channels PT (undirected edges of graph) each processor labels its incident channels 1, 2, 3,…; might N not know who is at other end Message-Passing Model EL 1 1 p3 p0 2 3 PT 2 2 N p2 1 1 p1 Modeling Processors and Channels Processor is a state machine including local state of the processor EL mechanisms for modeling channels Channel directed from processor pi to processor pj is modeled in two PT pieces: outbuf variable of pi and inbuf variable of pj N Outbuf corresponds to physical channel, inbuf to incoming message queue Modeling Processors and Channels inbuf outbuf EL p1's local variables p2's local PT variables N outbuf inbuf Pink area (local vars + inbuf) is accessible state for a processor. Configuration Vector of processor states (including outbufs, i.e., channels), one per processor, is a configuration of the system EL Captures current snapshot of entire system: accessible processor states (local vars + incoming msg queues) as well as communication channels. PT N Events Occurrences that can take place in a system are modeled as events. EL For message-passing systems, we consider two kind of events: PT (i) Deliver event and (ii) N Computation event (i) Deliver Event Moves a message from sender's outbuf to receiver's inbuf; message will be available next time receiver takes a step EL Sender Receiver PT p1 m3 m2 m1 p2 Sender Np1 m3 m2 m1 p2 Receiver (ii) Computation Event Occurs at one processor Start with old accessible state (local vars + incoming messages) EL Apply transition function of processor's state machine; handles all incoming messages PT End with new accessible state with empty inbufs, and new outgoing messages N There are 3 things happening in computational event: 1) Start with old accessible state 2) Apply transition function of processor's state machine; Computation Event handles all incoming messages 3) End with new accessible state with empty inbufs, and new outgoing messages EL b a PT c old d e new local local state state N pink indicates accessible state: local vars and incoming msgs white indicates outgoing msg buffers Execution Format is config, event, config, event, config, … EL in first config: each processor is in initial state and all inbufs are empty PT for each consecutive (config, event, config), new config is same as old config except: receiver's inbuf N if delivery event: specified msg is transferred from sender's outbuf to if computation event: specified processor's state (including outbufs) change according to transition function Admissibility Definition of execution gives some basic "syntactic" conditions. usually safety conditions (true in every finite prefix) EL Informally, a safety condition states that nothing bad has happened PT yet; for instance, the example just given can be restarted to require that a step by p1 never immediately follows a step by any processor other than p0. N Admissibility Sometimes we want to impose additional constraints usually liveness conditions (eventually something happens) A liveness condition is a condition that must hold a certain number of times, EL possible an infinite number of times Informally, a liveness condition states that the eventually something good PT happens Any sequence that satisfies all required safety conditions for a particular system type will be called an execution admissible N If an execution also satisfies all required liveness conditions, it will be called Executions satisfying the additional constraints are admissible. These are the executions that must solve the problem of interest Types of message-passing systems There are two types of message-passing systems, asynchronous and synchronous. 1) Asynchronous Systems: A system is said to be asynchronous if there is no fixed EL upper bound on how long it takes for a message to be delivered or how much time elapses between consecutive steps of a processor. An example of an PT asynchronous system is the Internet, where message (for instance E-mail) can take days to arrive, although often they only take seconds. 2) N Synchronous Systems: In the synchronous model processor execute in lockstep: The execution is partitioned into rounds, and in each round, every processor can send message to each neighbour, the messages are delivered, and every processor computes based on the messages just received. 1. Asynchronous Message Passing Systems An execution is admissible for the asynchronous model if every message in an outbuf is eventually delivered EL every processor takes an infinite number of steps No constraints on when these events take place: arbitrary message PT delays and relative processor speeds are not ruled out Models reliable system (no message is lost and no processor stops working) N 2. Synchronous Message Passing Systems The new definition of admissible captures lockstep unison feature of synchronous model. EL This definition also implies every message sent is delivered PT every processor takes an infinite number of steps. N Time is measured as number of rounds until termination. EL Broadcast and Convergecast on a Spanning Tree PT N Broadcast Over a Rooted Spanning Tree Broadcast is used to send the information to all. Suppose processors already have information about a rooted EL spanning tree of the communication topology tree: connected graph with no cycles PT spanning tree: contains all processors rooted: there is a unique root node processor N Implemented via parent and children local variables at each indicate which incident channels lead to parent and children in the rooted spanning tree Broadcast Over a Rooted Spanning Tree: Concept root initially sends M to its children when a processor receives M from its parent EL sends M to its children terminates (sets a local boolean to true) PT N Broadcast Over a Rooted Spanning Tree: Algorithm 1 EL PT N Broadcast Over a Rooted Spanning Tree: Example EL PT N Two steps in an execution of the broadcast algorithm Complexity Analysis Synchronous model: time is depth d of the spanning tree. ( at most n – 1 when chain) EL number of messages is n - 1, since one message is sent over each spanning tree edge PT Asynchronous model: same as synchronous ie time (d) and messages (n-1) N Convergecast: Concept Convergecast is used to collect the information. Again, suppose a rooted spanning tree has already been computed by EL the processors parent and children variables at each processor PT Do the opposite of broadcast: leaves send messages to their parents N non-leaves wait to get message from each child, then send combined (aggregate) info to parent Convergecast: Example a b,d c,f,h EL b c dotted lines: d e,g f,h PT non-tree edges d e f solid arrows: N g g h h parent-child relationships Convergecast: Example EL PT N Two steps in an execution of the convergecast algorithm Finding a Spanning Tree Given a Root Having a spanning tree is very convenient. How do you get one? EL Suppose a distinguished processor is known, to serve as the root. PT N Finding a Spanning Tree Given a Root root sends M to all its neighbors when non-root first gets M EL set the sender as its parent send "parent" msg to sender PT send M to all other neighbors (if no other neighbors, then terminate) when get M otherwise N send "reject" msg to sender use "parent" and "reject" msgs to set children variables and know when to terminate (after hearing from all neighbors) N PT EL Execution of Spanning Tree Algorithm root root a a Both models: O(m) messages EL b c O(diam) time b c PT d e f d e f g N h g h Synchronous: always gives Asynchronous: not breadth-first search (BFS) tree necessarily BFS tree Execution of Spanning Tree Algorithm root a No! a root EL b c b c d e f PT d e f g An asynchronous execution N h g Another asynchronous h gave this depth-first search (DFS) execution results in this tree: tree. Is DFS property guaranteed? neither BFS nor DFS Finding a DFS Spanning Tree Given a Root when root first takes step or non-root first receives M: mark sender as parent (if not root) EL for each neighbor in series send M to it PT wait to get "parent" or "reject" msg in reply send "parent" msg to parent neighbor N when processor receives M otherwise send "reject" to sender use "parent" and "reject" msgs to set children variables and know when to terminate N PT EL Finding a DFS Spanning Tree Given a Root Previous algorithm ensures that the spanning tree is always a DFS tree. Analogous to sequential DFS algorithm. EL Message complexity: O(m) since a constant number of messages are sent over each edge PT Time complexity: O(m) since each edge is explored in series. N Finding a Spanning Tree Without a Root Assume the processors have unique identifiers (otherwise impossible!) Idea: EL each processor starts running a copy of the DFS spanning tree algorithm, with itself as root PT tag each message with initiator's id to differentiate when copies "collide", copy with larger id wins. N Message complexity: O(nm) Time complexity: O(m) N PT EL Conclusion This lecture introduced the formal model of a distributed message passing system i.e. synchronous and asynchronous EL Few algorithms of message-passing systems are demonstrated to understand their concepts and complexity measures. PT The algorithms solve the problems of broadcast, convergecast, DFS. These are used as a basic building blocks of distributed algorithm. N In upcoming lectures, we will try to give a more detailed discussion on Leader Election and Minimum Cost Spanning Tree. Lecture: 03 Leader Election in Rings EL PT Rajiv Misra N Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface Recap of Previous Lecture: EL In the previous lecture, we have discussed the formal model of a distributed message passing system i.e. synchronous and asynchronous timing models with no-failures. PT Few simple algorithms for message-passing systems were demonstrated to understand their concepts and complexity measures. N The algorithms solve the problems of broadcast, convergecast, DFS and used as a basic building blocks of distributed algorithm. Preface Content of this Lecture: EL In this lecture, we will discuss the leader election problem in message-passing systems for a ring topology, in which a group PT of processors must choose one among them to be a leader. We will present the different algorithms for leader election N problem by taking the cases like anonymous/ non-anonymous rings, uniform/ non-uniform rings and synchronous/ asynchronous rings etc. Leader Election (LE) Problem: Introduction The leader election problem has several variants. LE problem is for each processor to decide that either it is the leader or EL non-leader, subject to the constraint that exactly one processor decides to be the leader. PT LE problem represents a general class of symmetry-breaking problems. For example, when a deadlock is created, because of processors waiting N in a cycle for each other, the deadlock can be broken by electing one of the processor as a leader and removing it from the cycle. Leader Election: Definition Each processor has a set of elected (won) and not-elected (lost) states. Once an elected state is entered, processor is always in an elected state EL (and similarly for not-elected): i.e., irreversible decision In every admissible execution: PT every processor eventually enters either an elected or a not-elected state N exactly one processor (the leader) enters an elected state Uses of Leader Election A leader can be used to coordinate activities of the system: find a spanning tree using the leader as the root EL reconstruct a lost token in a token-ring network PT In this lecture, we will study the leader election in rings. N Ring Networks In an oriented ring, processors have a consistent notion of left and right EL PT N For example, if messages are always forwarded on channel 1, they will cycle clockwise around the ring Why Study Rings? simple starting point, easy to analyze abstraction of a token ring EL lower bounds and impossibility results for ring topology also apply to arbitrary topologies PT N Anonymous Rings How to model situation when processors do not have unique identifiers? EL First attempt: require each processor to have the same state machine Subtle point: does algorithm rely on knowing the ring size (number of PT processors)? N Uniform (Anonymous) Algorithms A uniform algorithm does not use the ring size (same algorithm for each size ring) EL Formally, every processor in every size ring is modeled with the same state machine PT A non-uniform algorithm uses the ring size (different algorithm for each size ring) N Formally, for each value of n, every processor in a ring of size n is modeled with the same state machine An. Note the lack of unique ids. Leader Election in Anonymous Rings Theorem: There is no leader election algorithm for anonymous rings, even if algorithm knows the ring size (non-uniform) and synchronous model EL Proof Sketch: Every processor begins in same state with same outgoing messages (since anonymous) PT Every processor receives same messages, does same state transition, and sends same messages in round 1 N Ditto for rounds 2, 3, … Eventually some processor is supposed to enter an elected state. But then they all would. Leader Election in Anonymous Rings Proof sketch shows that either safety (never elect more than one leader) or liveness (eventually elect at least one leader) is violated. EL Since the theorem was proved for non-uniform and synchronous rings, the same result holds for weaker (less well-behaved) models: PT uniform asynchronous N Rings with Identifiers Assume each processor has a unique id. Don't confuse indices and ids: EL indices are 0 to n - 1; used only for analysis, not available to the processors PT ids are arbitrary nonnegative integers; are available to the processors through local variable id. N Specifying a Ring Start with the smallest id and list ids in clockwise order. EL id = 37 id = 3 p4 p0 PT p1 id = 19 p3 id = 25 N p2 id = 4 Example: 3, 37, 19, 4, 25 Uniform (Non-anonymous) Algorithms Uniform algorithm: there is one state machine for every id, no matter what size ring EL Non-uniform algorithm: there is one state machine for every id and every different ring size PT These definitions are tailored for leader election in a ring. N O(n2) Messages LE Algorithm: LeLann-Chang-Roberts (LCR) algorithm send value of own id to the left when receive an id j (from the right): EL if j > id then PT forward j to the left (this processor has lost) if j = id then if j < id then N elect self (this processor has won) do nothing Analysis of O(n2) Algorithm Correctness: Elects processor with largest id. message containing largest id passes through every processor EL Time: O(n) PT Message complexity: Depends how the ids are arranged. largest id travels all around the ring (n messages) N 2nd largest id travels until reaching largest 3rd largest id travels until reaching largest or second largest etc. Analysis of O(n2) Algorithm EL PT N Analysis of O(n2) Algorithm Worst way to arrange the ids is in decreasing order: 2nd largest causes n - 1 messages EL 3rd largest causes n - 2 messages etc. PT Total number of messages is n + (n-1) + (n-2) + … + 1 = (n2) N Analysis of O(n2) Algorithm Clearly, the algorithm never sends more than O(n2) messages in any admissible execution. EL Moreover, there is an admissible execution in which the algorithm sends (n2) messages; Consider the ring where the identifiers of the PT processor are 0,……, n-1 and they are ordered as in Figure 3.2. In this configuration, the message of processor with identifier i is send exactly i+1 N times, Thus the total number of messages, including the n termination messages, is Clockwise Unidirectional Ring Can We Use Fewer Messages? The O(n2) algorithm is simple and works in both synchronous and asynchronous model. EL But can we solve the problem with fewer messages? Idea: PT Try to have messages containing smaller ids travel smaller distance in the ring N O(nlogn) Messages LE Algorithm: The Hirschberg and Sinclair (HS) algorithm To describe the algorithm, we first define the k-neighbourhood of a processor pi in the ring to be the set of processors that are at distance at most k from pi in the ring (either to the left or to the right). Note that EL the k-neighbourhood of a processor includes exactly 2k+1 processors. The algorithm operates in phases; it is convenient to start numbering the PT phases with 0. In the kth phase a processor tries to become a winner for that phase; to be a winner, it must have the largest id in its N 2k-neighborhood. Only processors that are winners in the kth phase continue to compete in the (k+1)-st phase, Thus fewer processors proceed to higher phases, until at the end, only one processor is a winner and it is elected as the leader of the whole ring. The HS Algorithm: Sending Messages Phase 0 In more detail, in phase 0, each processor attempts to become a phase 0 winner and sends a message containing its EL identifier to its 1-neighborhood, that is, to each of its two neighbors. PT If the identifier of the neighbor receiving the probe is greater than the identifier in the probe, it swallows the probe; otherwise, it N sends back a message. If a processor receives a reply from both its neighbors, then the processor becomes a phase 0 winner and continues to phase 1. The HS Algorithm: Sending Messages Phase k In general, in phase k, a processor pi that is a phase k-1 winner sends messages with its identifier to its 2k-neighborhood (one in EL each direction). Each such message traverses 2k processors one by one, A probe is swallowed by a processor if it contains an identifier that is smaller than its own identifier. PT If the probe arrives at the last processor on the neighbourhood without being swallowed, then that last processor sends back a N message to pi. If pi receives replies from both directions, it becomes a phase k winner, and it continues to phase k+1. A processor that receives its own message terminates the algorithm as the leader and sends a termination message around the ring. N PT EL The HS Algorithm The pseudocode appears in Algorithm 5. Phase k for a processor corresponds to the period between its sending of a message in line 4 or 15 with third EL parameter k and its sending of a message in line 4 or 15 with third parameter k+1. The details of sending the termination message around the ring have been left out in the code, and only the leader terminates. PT The correctness of the algorithm follows in the same manner as in the simple algorithm, because they have the same swallowing rules. N It is clear that the probes of the processor with the maximal identifier are never swallowed; therefore, this processor will terminate the algorithm as a leader. On the other hand, it is also clear that no other can traverse the whole ring without being swallowed. Therefore, the processor with the maximal identifier is the only leader elected by the algorithm. O(n log n) Leader Election Algorithm Each processor tries to probe successively larger neighborhoods in both directions EL size of neighborhood doubles in each phase If probe reaches a node with a larger id, the probe stops PT If probe reaches end of its neighborhood, then a reply is sent back to initiator N If initiator gets back replies from both directions, then go to next phase If processor receives a probe with its own id, it elects itself O(n log n) Leader Election Algorithm EL pi probe probe reply reply PT probe probe probe probe reply reply reply reply probe reply probe reply N probe reply probe reply probe reply probe reply probe reply probe reply Analysis of O(n log n) Leader Election Algorithm Correctness: Similar to O(n2) algorithm. EL Message Complexity: PT Each message belongs to a particular phase and is initiated by a particular processor Probe distance in phase k is 2k N Number of messages initiated by a processor in phase k is at most 4*2k (probes and replies in both directions) Analysis of O(n log n) Leader Election Algorithm How many processors initiate probes in phase k ? EL For k = 0, every processor does PT For k > 0, every processor that is a "winner" in phase k - 1 does "winner" means has largest id in its 2k-1 neighborhood N Analysis of O(n log n) Leader Election Algorithm Maximum number of phase k - 1 winners occurs when they are packed as densely as possible: EL 2k-1 processors PT … a phase k-1 winner … a phase k-1 winner … N total number of phase k - 1 winners is at most n/(2k-1 + 1) Analysis of O(n log n) Leader Election Algorithm How many phases are there? EL At each phase the number of (phase) winners is cut approx. in half from n/(2k-1 + 1) to n/(2k + 1) PT So after approx. log2 n phases, only one winner is left. N more precisely, max phase is log(n–1)+1 Analysis of O(n log n) Leader Election Algorithm Total number of messages is sum, over all phases, of number of winners at that phase times number of messages originated by that winner: EL log(n–1)+1 PT phase 0 msgs ≤ 4n + n + 4 2k n/(2k-1+1) k=1 termination msgs N < 8n(log n + 2) + 5n msgs for phases 1 to = O(n log n) log(n–1)+1 Can We Do Better? The O(n log n) algorithm is more complicated than the O(n2) algorithm but uses fewer messages in worst case. EL Works in both synchronous and asynchronous case. Can we reduce the number of messages even more? PT Not in the asynchronous model… N Lower bound for LE algorithm But, can we do better than O(n log n)? Theorem: Any leader election algorithm for asynchronous rings EL whose size is not known a priori has ῼ(n log n) message complexity (holds also for unidirectional rings). PT Both LCR and HS are comparison-based algorithms, i.e. they use the identifiers only for comparisons (;=). N In synchronous networks, O(n) message complexity can be achieved if general arithmetic operations are permitted (non- comparison based) and if time complexity is unbounded. Overview of LE in Rings with Ids There exist algorithms when nodes have unique ids. We have evaluated them according to their message complexity. EL asynchronous ring: (n log n) messages PT synchronous ring: N (n) messages under certain conditions otherwise (n log n) messages All bounds are asymptotically tight. Conclusion This lecture provided an in-depth study of the leader election problem in message-passing systems for a ring EL topology. We have presented the different algorithms for leader PT election problem by taking the cases like anonymous/non- anonymous rings, uniform/non-uniform rings and synchronous/ asynchronous rings time. N In upcoming lecture, we will discuss about causality and Lecture: 04 Models of Distributed Computation, Causality & EL Logical Time PT N Rajiv Misra Dept. of Computer Science & Engineering Indian Institute of Technology Patna [email protected] Preface Recap of Previous Lecture: EL In the previous lecture we have discussed about the leader PT election problem in message-passing systems for a ring topology. Different algorithms for leader election problem were presented N by taking the cases like anonymous/non-anonymous rings, uniform/non-uniform rings and synchronous/ asynchronous rings. Preface Content of this Lecture: In this lecture, we will discuss about the models of distributed EL computation, causality and a general framework of logical clocks in distributed systems. PT Also In the absence of global physical time in DS, we present three systems of logical time, namely, scalar, vector, and N matrix time to capture causality between events of a distributed computation. EL Models of Distributed Computation PT N Introduction A distributed system consists of a set of processors that are connected by a communication network. The communication network provides the facility EL of information exchange among processors. The processors do not share a common global memory and communicate PT solely by passing messages over the communication network. There is no physical global clock in the system to which processes have instantaneous access. N The communication medium may deliver messages out of order, messages may be lost, garbled, or duplicated due to timeout and retransmission, processors may fail, and communication links may go down. Distributed Program: Definition Distributed program is composed of a set of n asynchronous processes, p1, p2,.., pn. EL Process execution and message transfer are asynchronous. Without loss of generality, we assume that each process is running on a PT different processor. Let Cij denote the channel from process pi to process pj and let mij denote a N message sent by pi to pj. The message transmission delay is finite and unpredictable. 1. Model of Distributed Executions The execution of a process consists of a sequential execution of its actions. The actions are atomic and the actions of a process are modeled as three types of events, namely, internal events, message send events, and message receive EL events. Let denote the xth event at process pi. For a message m, let send(m) and rec(m) denote its send and receive events, PT respectively. The occurrence of events changes the states of respective processes and channels, N thus causing transitions in the global system state. An internal event changes the state of the process at which it occurs. A send event (or a receive event) changes the state of the process that sends (or receives) the message and the state of the channel on which the message is sent (or received). Contd... The events at a process are linearly ordered by their order of occurrence. EL The execution of process pi produces a sequence of events ei1, ei2,..., eix , ,... and is denoted by Hi where PT Hi = (hi , →i ) hi is the set of events produced by pi and N binary relation →i defines a linear order on these events. Relation →i expresses causal dependencies among the events of pi. Contd… The send and the receive events signify the flow of information between processes and establish causal dependency from the sender EL process to the receiver process. A relation →msg that captures the causal dependency due to message exchange, is defined as follows. For every message m that is PT exchanged between two processes, we have send (m) →msg rec (m) N Relation →msg defines causal dependencies between the pairs of corresponding send and receive events. Contd… The evolution of a distributed execution is depicted by a space-time diagram. EL A horizontal line represents the progress of the process; a dot indicates an event; a slant arrow indicates a message transfer. PT Since we assume that an event execution is atomic (hence, indivisible and instantaneous), it is justified to denote it as a dot on a process line. N In the Figure 4.1, for process p1, the second event is a message send event, the third event is an internal event, and the fourth event is a message receive event. Message Internal Space-time diagram event receive event Message send Process event EL PT N Figure 4.1: The space-time diagram of a distributed execution. Preliminaries: Partial Order Relation Definition: A binary relation R on a set A is a partial order if and only if it is (i) reflexive, (ii) antisymmetric, and EL (iii) transitive. PT The ordered pair is called a poset (partially ordered set) when R is a partial order. Example 1: The less-than-or-equal-to relation on the set of integers I is a N partial order, and the set I with this relation is a poset. Preliminaries: Total Order Relation Definition: A binary relation R on a set A is a total order if and only if it is (i) a partial order, and EL (ii) for any pair of elements a and b of A, < a, b > in R or < b, a > in R. That is, every element is related with every element one way or the other. PT A total order is also called a linear order. N Example 2: The less-than-or-equal-to relation on the set of integers I is a total order. Causal Precedence Relation The execution of a distributed application results in a set of distributed events produced by the processes. Let H=∪i hi denote the set of events executed in a distributed EL computation. Define a binary relation → on the set H as follows that expresses causal dependencies between events in the distributed execution. PT The causal precedence relation induces an irreflexive partial order on the events of a distributed computation that is denoted as H=(H, →) N Contd… Note that the relation → is nothing but Lamport’s “happens before” relation. For any two events ei and ej , if ei → ej , then event ej is directly or transitively dependent on event ei. (Graphically, it means that there exists a path EL consisting of message arrows and process-line segments (along increasing time) in the space-time diagram that starts at ei and ends at ej ) PT For example, in Figure 4.1, e11→ e33 and e33 → e26. The relation → denotes flow of information in a distributed computation and N ei → ej dictates that all the information available at ei is potentially accessible at ej. For example, in Figure 4.1 event e26 has the knowledge of all other events shown in the figure. Space-time diagram e11→ e33 e33 → e26 EL PT N Figure 4.1: The space-time diagram of a distributed execution. Concurrent events For any two events ei and ej , if ei ej and ej ei , then events ei and ej are said to be concurrent (denoted as ei ej ) EL In the execution of Figure 4.1, e13 e33 and e24 e31 is not transitive; that is, (ei ej ) ∧ (ej ek ) ⇒ ei ek PT The relation For example, in Figure 4.1, e33 e24 and e24 e15, however, e33 e15 N For any two events ei and ej in a distributed execution, ei → ej or ej → ei , or ei ej. Space-time diagram e13 e33 e24 e31 EL PT N Figure 4.1: The space-time diagram of a distributed execution. Logical vs. Physical Concurrency In a distributed computation, two events are logically concurrent if and only if they do not causally affect each other. EL Physical concurrency, on the other hand, has a connotation that the events occur at the same instant in physical time. PT Note that two or more events may be logically concurrent even though they do not occur at the same instant in physical time. N Contd... For example, in Figure 4.1, events in the set {e13,e24 ,e33 } are logically concurrent, but they occurred at different instants in EL physical time. However, note that if processor speed and message delays had been different, the execution of these PT events could have very well coincided in physical time. N Whether a set of logically concurrent events coincide in the physical time or in what order in the physical time they occur does not change the outcome of the computation. events in the set {e13,e24 ,e33 } are logically concurrent Space-time diagram EL PT N Figure 4.1: The space-time diagram of a distributed execution. 2. Models of Communication Networks There are several models of the service provided by communication networks, namely, FIFO, Non-FIFO, and EL causal ordering. In the FIFO model, each channel acts as a first-in first-out PT message queue and thus, message ordering is preserved by a channel. N In the non-FIFO model, a channel acts like a set in which the sender process adds messages and the receiver process removes messages from it in a random order. 2. Models of Communication Networks contd... The “causal ordering” model is based on Lamport’s “happens before” relation. A system that supports the causal ordering model satisfies the following property: EL CO: For any two messages mij and mkj , if send (mij) → send (mkj), then rec (mij) → rec (mkj ) PT This property ensures that causally related messages destined to the same destination are delivered in an order that is consistent with their N causality relation. Causally ordered delivery of messages implies FIFO message delivery. (Note that CO ⊂ FIFO ⊂ Non-FIFO) Causal ordering model considerably simplifies the design of distributed algorithms because it provides a built-in synchronization. EL Causality & Logical Time PT N 2. Concept of Causality: Introduction The concept of causality between events is fundamental to the design and analysis of parallel and distributed computing and operating systems. EL Usually causality is tracked using physical time. PT In distributed systems, it is not possible to have a global physical time; it is possible to realize only an approximation of it. N As asynchronous distributed computations make progress in spurts, the logical time is sufficient to capture the fundamental monotonicity property associated with causality in distributed systems. Contd… This lecture discusses three ways to implement logical time: scalar time, vector time, and matrix time. Causality among events in a distributed system is a powerful concept in EL reasoning, analyzing, and drawing inferences about a computation. The knowledge of the causal precedence relation among the events of PT processes helps solve a variety of problems in distributed systems, such as distributed algorithms design( mutual exclusion, replicated databases, N deadlock detection), tracking of dependent events, knowledge about the progress of a computation, and concurrency measures. Framework for a System of Logical Clocks A system of logical clocks consists of a time domain T and a logical clock C. Elements of T form a partially ordered set over a relation 0) In general, every time R1 is executed, d can have a different value; however, typically d is kept at 1. Scalar Time R2: Each message piggybacks the clock value of its sender at sending time. When a process pi receives a message with timestamp Cmsg , it executes the following actions: EL Ci := max (Ci , Cmsg) PT Execute R1 Deliver the message N Figure 4.2 shows evolution of scalar time. Evolution of Scalar Time 1 2 3 8 9 p1 EL 9 2 1 4 5 7 11 p2 PT 3 10 4 1 p3 N 5 6 b 7 Figure 4.2: The space-time diagram of a distributed execution. Basic Properties Consistency Property Scalar clocks satisfy the monotonicity and hence the consistency property: EL for two events ei and ej , ei → ej ⇒ C(ei ) < C(ej ). PT Total Ordering Scalar clocks can be used to totally order events in a distributed system. N The main problem in totally ordering events is that two or more events at different processes may have identical timestamp. For example in Figure 4.2, the third event of process P1 and the second event of process P2 have identical scalar timestamp. Identical Scalar Time Stamp 1 2 3 8 9 p1 EL 9 2 1 4 5 7 11 p2 PT 3 10 4 1 p3 N 5 6 b 7 Figure 4.2: The space-time diagram of a distributed execution. Total Ordering A tie-breaking mechanism is needed to order such events. A tie is broken as follows: EL Process identifiers are linearly ordered and tie among events with identical scalar timestamp is broken on the basis of their process identifiers. PT The lower the process identifier in the ranking, the higher the priority. The timestamp of an event is denoted by a tuple (t, i ) where t is its time of occurrence and i is the identity of the process where it occurred. N The total order relation ≺ on two events x and y with timestamps (h,i) and (k,j), respectively, is defined as follows: x ≺ y ⇔ (h < k or (h = k and i < j )) Properties… Event counting If the increment value d is always 1, the scalar time has the following interesting EL property: if event e has a timestamp h, then h-1 represents the minimum logical duration, counted in units of events, required before producing the event e; PT We call it the height of the event e. In other words, h-1 events have been produced sequentially before the event e N regardless of the processes that produced these events. For example, in Figure 4.2, five events precede event b on the longest causal path ending at b. Five events precede event b on the longest causal path ending at b 1 2 3 8 9 p1 EL 9 2 1 4 5 7 11 p2 PT 3 10 4 1 p3 N 5 6 b 7 Figure 4.2: The space-time diagram of a distributed execution. Properties… No Strong Consistency The system of scalar clocks is not strongly consistent; that is, for two events EL ei and ej, C(ei ) < C(ej ) ⇒ ei eij For example, in Figure 4.2, the third event of process P1 has smaller scalar timestamp than the third event of process P2.However, the former did not PT happen before the latter. The reason that scalar clocks are not strongly consistent is that the logical local N clock and logical global clock of a process are squashed into one, resulting in the loss causal dependency information among events at different processes. For example, in Figure 4.2, when process P2 receives the first message from process P1, it updates its clock to 3, forgetting that the timestamp of the latest event at P1 on which it depends is 2. smaller scalar timestamp Clock Updation 1 2 3 8 9 p1 EL 9 2 1 4 5 7 11 p2 PT 3 10 4 1 p3 N 5 6 b 7 Figure 4.2: The space-time diagram of a distributed execution. Vector Time The system of vector clocks was developed independently by Fidge, Mattern and Schmuck. EL In the system of vector clocks, the time domain is represented by a set of n-dimensional non-negative integer vectors. PT Each process pi maintains a vector vti [1..n], where vti [i ] is the local logical clock of pi and describes the logical time progress at process pi. vti [j ] represents process pi ’s latest knowledge of process pj local time. N If vti [j ]=x , then process pi knows that local time at process pj has progressed till x. The entire vector vti constitutes pi ’s view of the global logical time and is used to timestamp events. Vector Time Process pi uses the following two rules R1 and R2 to update its clock: R1: Before executing an event, process pi updates its local logical time as follows: EL vti [i ] := vti [i ] + d (d > 0) R2: Each message m is piggybacked with the vector clock vt of the sender process at sending time. On the receipt of such a message (m,vt), process pi executes the PT following sequence of actions: Update its global logical time as follows: N 1 ≤ k ≤ n : vti [k] := max (vti [k], vti k]) Execute R1 Deliver the message m Vector Time The timestamp of an event is the value of the vector clock of its process when the event is executed. EL Figure 4.3 shows an example of vector clocks progress with the increment PT value d=1. Initially, a vector clock is [0, 0, 0,...., 0]. N Example of Vector Clock 1 2 3 4 5 0 0 0 3 3 0 0 0 4 4 p1 EL 5 2 2 3 0 3 4 0 0 2 2 2 4 5 1 2 3 4 6 PT p2 0 0 0 0 4 2 5 3 5 p3 0 0 1 N 2 3 2 0 2 3 3 2 3 4 4 Figure 4.3: Evolution of vector time. Comparing Vector Time Stamps The following relations are defined to compare two vector timestamps, vh and vk : vh = vk ⇔ ∀x : vh[x ] = vk [x ] EL vh ≤ vk ⇔ ∀x : vh[x ] ≤ vk [x ] vh < vk ⇔ vh ≤ vk and ∃x : vh[x ] < vk [x ] vh || vk ⇔ ¬(vh < vk ) ∧ ¬(vk < vh) PT If the process at which an event occurred is known, the test to compare two timestamps can be simplified as follows: If events x and y respectively occurred at N processes pi and pj and are assigned timestamps vh and vk, respectively, then x → y⇔ vh[i ] ≤ vk [i ] x || y ⇔ vh[i ] > vk [i ] ∧ vh[j ] < vk [j ] Properties of Vector Time Isomorphism If events in a distributed system are timestamped using a system of vector EL clocks, we have the following property. PT If two events x and y have timestamps vh and vk, respectively, then x → y ⇔ vh < vk x || y ⇔ vh || vk N Thus, there is an isomorphism between the set of partially ordered events produced by a distributed computation and their vector timestamps. Properties of Vector Time Strong Consistency The system of vector clocks is strongly consistent; thus, by examining the vector timestamp of two events, we can determine if the events are causally related. EL However, Charron-Bost showed that the dimension of vector clocks cannot be less than n, the total number of processes in the distributed computation, for this property to hold. PT Event Counting N If d=1 (in rule R1), then the i th component of vector clock at process pi , vt i [i ], denotes the number of events that have occurred at pi until that instant. So, if an event e has timestamp vh, vh[j ] denotes the number of events executed by process pj that causally precede e. Clearly, ∑ vh[j ] − 1 represents the total number of events that causally precede e in the distributed computation. Conclusion In a distributed system, a set of processes communicate by exchanging messages over a communication network. A distributed computation is spread over geographically distributed processes. The processes do not share a common global EL memory or a physical global clock, to which processes have instantaneous access. In this lecture we have presented the idea of logical time that was proposed by PT Lamport in 1978 in an attempt to order events in distributed systems. N We have discussed two systems of logical clocks, namely, scalar and vector clocks to capture causality between events of a distributed computation. In upcoming lecture, we will discuss about the size of vector clock, matrix clocks, virtual time and physical clock synchronization.