Full Transcript

CS 425 / ECE 428 Distributed Systems Fall 2024 Indranil Gupta (Indy) w/ Aishwarya Ganesan Lecture 1: Welcome, and Introduction Aug 27, 2024 Web: courses.engr.illinois.edu/cs425/ 1 Course Staff Instructors:...

CS 425 / ECE 428 Distributed Systems Fall 2024 Indranil Gupta (Indy) w/ Aishwarya Ganesan Lecture 1: Welcome, and Introduction Aug 27, 2024 Web: courses.engr.illinois.edu/cs425/ 1 Course Staff Instructors: – (Indy Gupta): 3112 Siebel Center – (Aishwarya Ganesan): 3120 Siebel Center – Office Hours Every Tue and Thu right after class until 4 pm – see website TAs: (see course website for office hours) – Anna Karanika (Lead TA) TAs for On-campus Students: Aryan Bhardwaj, Pete Stenger, Taimoor Tariq, Ting Liang, Kai Wang – Maleeha Masood (lead TA for MCS Coursera, and Deputy Lead TA) TAs for MCS Online Students: Xinying Zheng, Zhikun Wang, Zikun Liu – All students (oncampus or MCS Coursera) can access all above TAs 2 More about course logistics later in this lecture… What This Course is About Olympics Movies Travel to Saturn Interviews Company Acquisitions (Not Kidding) 3 What This Course is Really About Distributed Systems How to Design Algorithms for them How to Design The Systems How they work in real life How to build real distributed systems 4 Many Students are Intimidated by Computer Science… … It’s a good sign that you will do well! (Ice skating story: Narration) Moral: One must unlearn first, before learning. 5 Our Main Goal Today To Define the Term Distributed System 6 Can you name some examples of Operating Systems? 7 Can you name some examples of Operating Systems? … Linux Windows Unix FreeBSD macOS OSX 2K Aegis Scout Hydra Mach SPIN OS/2 Express Flux Hope Spring AntaresOS EOS LOS SQOS LittleOS TINOS PalmOS WinCE TinyOS iOS … 8 What is an Operating System? 9 What is an Operating System? User interface to hardware (device driver) Provides abstractions (processes, file system) Resource manager (scheduler) Means of communication (networking) … 10 FOLDOC definition (FOLDOC = Free On-Line Dictionary of Computing) Operating System - The low-level software which handles the interface to peripheral hardware, schedules tasks, allocates storage, and presents a default interface to the user when no application program is running. 11 Can you name some examples of Distributed Systems? 12 Can you name some examples of Distributed Systems? Client-Server (NFS) The Web The internet A wireless network DNS Gnutella or BitTorrent (peer to peer overlays) A “cloud”, e.g., Amazon EC2/S3, Microsoft Azure A datacenter, e.g., NCSA, a Google datacenter, AWS 13 What is a Distributed System? 14 I asked Alexa… “What is a Distributed System?” And Alexa said… “I don’t know what a stupid system is.” 15 How does Leslie Lamport define Distributed System? (Leslie Lamport is considered by many to be the most prominent Distributed System Researcher) “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” 16 FOLDOC definition A collection of (probably heterogeneous) automata whose distribution is transparent to the user so that the system appears as one local machine. This is in contrast to a network, where the user is aware that there are several machines, and their location, storage replication, load balancing and functionality is not transparent. Distributed systems usually use some kind of client-server organization. 17 Textbook definitions A distributed system is a collection of independent computers that appear to the users of the system as a single computer. [Andrew Tanenbaum] A distributed system is several computers doing something together. Thus, a distributed system has three primary characteristics: multiple computers, interconnections, and shared state. [Michael Schroeder] 18 Unsatisfactory Why are these definitions short? Why do these definitions look inadequate to us? Because we are interested in the insides of a distributed system – design and implementation – Maintenance – Algorithmics (“protocols” or “distributed algorithms”) 19 “I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the motion picture involved in this case is not that.” [Potter Stewart, Associate Justice, US Supreme Court (talking about his interpretation of a technical term laid down in the law, case Jacobellis versus Ohio 1964) ] 20 Which is a Distributed System – (A) or (B)? (A) (A) Facebook Social Network Graph among humans Source: https://www.facebook.com/note.php?note_id=469716398919 21 (B) (B) Peer to peer file-sharing system (Gnutella) 22 A working definition for us A distributed system is a collection of entities, each of which is autonomous, programmable, asynchronous and failure- prone, and which communicate through an unreliable communication medium. Entity=a process on a device (PC, PDA/mobile device) Communication Medium=Wired or wireless network Our interest in distributed systems involves – design and implementation, maintenance, algorithmics 23 Gnutella Peer to Peer System What are the “entities” (nodes)? Source: GnuMap Project What is the communication medium 24 (links)? Web Domains What are the “entities” (nodes)? What is the communication medium (links)? Source: http://www.vlib.us/web/worldwideweb3d.html 25 Datacenter What are the “entities” (nodes)? What is the communication medium (links)? 26 The internet – Quick Refresher Underlies many distributed systems. A vast interconnected collection of computer networks of many types. Intranets – subnetworks operated by companies and organizations. Intranets contain LANs (local area networks). WAN – wide area networks, consists of subnets (intranets, LANs, etc.) ISPs – Internet Service Providers. Companies that provide modem links and other types of connections to users. Intranets (actually the ISPs’ core routers) are linked by backbones – network links of large bandwidth, such as satellite connections, fiber optic cables, and other high-bandwidth circuits. 27 UC2B? Google Fiber? (MAN = Metropolitan Area Networks) An Intranet & a distributed system email server Desk to p co mp uters print and other servers Local area Running over this Intranet Web server network is a distributed file system email server print File server other servers the rest of the Internet router/firewall prevents unauthorized messages from leaving/entering; implemented by filtering incoming and outgoing messages 28 via firewall “rules” (configurable) Networking Stacks Application Underlying Application layer protocol transport protocol Distributed System Protocols! Networking Protocols e-mail smtp [RFC 821] TCP remote terminal access telnet [RFC 854] TCP Web http [RFC 2068] TCP TCP=Transmission Control Protocol UDP=User Datagram Protocol file transfer ftp [RFC 959] TCP streaming multimedia proprietary (Implemented via sockets) TCP or UDP (e.g. RealNetworks) remote file server NFS TCP or UDP internet telephony proprietary typically UDP 29 (e.g., Skype) The Heart of the World Wide Web: the HTTP Standard HTTP: hypertext transfer protocol WWW’s application layer protocol PC running client/server model Explorer – client: browser that requests, receives, and “displays” WWW objects – server: WWW server, which is storing the website, sends objects in response to Server requests Running Apache http1.0: RFC 1945 Web http1.1: RFC 2068 server – Leverages same connection to download Mac running images, scripts, etc. Safari 31 The HTTP Protocol: More http: TCP transport service: http is “stateless” client initiates a TCP connection server maintains no (creates socket) to server, port 80 information about past client requests server accepts the TCP Why? connection from client Protocols that maintain session “state” are http messages (application-layer complex! protocol messages) exchanged past history (state) must be maintained between browser (http client) and and updated. WWW server (http server) if server/client crashes, their views of “state” may be inconsistent, and hence TCP connection closed must be reconciled. RESTful protocols are stateless. 32 HTTP Example (contains text, Suppose user enters URL www.cs.illinois.edu/ references to 10 jpeg images) 1a. http client initiates a TCP connection to 1b. http server at host http server (process) at www.cs.illinois.edu waiting for a TCP www.cs.uiuc.edu. Port 80 is default for connection at port 80. “accepts” http server. connection, notifying client 2. http client sends a http request message (containing URL) into TCP connection socket 3. http server receives request messages, forms a response message containing requested object (index.html), sends message into socket time 33 HTTP Example (cont.) 4. http server closes the TCP connection 5. http client receives a response (if necessary). message containing html file, displays html, Parses html file, finds 10 referenced jpeg objects 6. Steps 1-5 are then repeated for each of 10 jpeg objects time For fetching referenced objects, have 2 options: non-persistent connection: only one object fetched per TCP connection – some browsers create multiple TCP connections simultaneously - one per object persistent connection: multiple objects transferred within one TCP connection 34 Your Shell as a Web browser 1. Telnet to your favorite WWW server: telnet www.google.com 80 Opens TCP connection to port 80 (default http server port) at www.google.com Anything typed in sent to port 80 at www.google.com 2. Type in a GET http request: GET /index.html By typing this in (may need to hit Or return twice), you send GET /index.html HTTP/1.0 this minimal (but complete) GET request to http server 3. Look at response message sent by http server! What do you think the response is? 35 Does our Working Definition work for the http Web? A distributed system is a collection of entities, each of which is autonomous, programmable, asynchronous and failure- prone, and that communicate through an unreliable communication medium. Entity=a process on a device (PC, PDA) Communication Medium=Wired or wireless network Our interest in distributed systems involves – design and implementation, maintenance, study, algorithmics 36 “Important” Distributed Systems Issues No global clock; no single global notion of the correct time (asynchrony) Unpredictable failures of components: lack of response may be due to either failure of a network component, network path being down, or a computer crash (failure-prone, unreliable) Highly variable bandwidth: from 16Kbps (slow modems or Google Balloon) to Gbps (Internet2) to Tbps (in between DCs of same big company) Possibly large and variable latency: few ms to several seconds Large numbers of hosts: 2 to several million 37 Many Interesting Design Problems Real distributed systems – Cloud Computing, Peer to peer systems, Hadoop, key-value stores/NoSQL, distributed file systems, sensor networks, measurements, graph processing, stream processing, … Classical Problems – Failure detection, Asynchrony, Snapshots, Multicast, Consensus, Mutual Exclusion, Election, … Concurrency – RPCs, Concurrency Control, Replication Control, Paxos, … Security – ACLs, Capabilities, … Others… 38 Typical Distributed Systems Design Goals Common Goals: – Heterogeneity – can the system handle a large variety of types of machines and devices? – Robustness – is the system resilient to host crashes and failures, and to the network dropping messages? – Availability – are data+services always there for clients? – Transparency – can the system hide its internal workings from the users? (warning: term means the opposite of what the name implies!) – Concurrency – can the server handle multiple clients simultaneously? – Efficiency – is the service fast enough? Does it utilize 100% of all resources? – Scalability – can it handle 100 million nodes without degrading service? (nodes=clients and/or servers) How about 6 B? More? – Security – can the system withstand hacker attacks? – Openness – is the system extensible? 39 “Important” Issues If you’re already complaining that the list of topics we’ve discussed so far has been perplexing… – You’re right! – It was meant to be (perplexing) The Goal for the Rest of the Course: see enough examples and learn enough concepts so these topics and issues will make sense – We will revisit many of these slides in the very last lecture of the 40 course! “Concepts”? Which of the following inventions do you think is the most important? 1. Car 2. Wheel 3. Bicycle “What lies beneath?” Concepts! 41 How will We Learn? All this information is contained in handout on course website: “Course Information and Schedule” Web: courses.engr.illinois.edu/cs425/ Textbook, Recommended but not Required – Colouris, Dollimore, Kindberg and Blair (5th edition) – CDK Textbook – If you use a different or older version, be sure to check problem numbers, etc. – Textbook is a great source of practice problems for the exams (midterm and final) Lectures 42 How will We Learn? (2) This information is in “Syllabus” handout on course website (read carefully!) Web: courses.engr.illinois.edu/cs425/ Homeworks (HWs) – Four in total, two before midterm (early-Oct) and two after – About 2-3 weeks per homework – Solutions need to be typed, figures can be hand-drawn Programming assignments (MPs, i.e., Machine Programming Assignments) – Only for 4 credit on-campus students (3 credit students welcome to do it, but we won’t be able to give you VMs or grade it or use it as extra credit) – You will be building a distributed (surprise!) processing system in stages – Four in total, two before midterm and two after – Demo dates set on website: please be in town on those days (no excuses for interviews, travel..) Exams/quizzes – Midterm in class (Oct 10th) + Final in person (date decided by campus) 43 What assistance is available to you? Lectures – Lecture slides will be placed online at course website “Tentative” version before lecture “Final” version after lecture Asking Questions (most preferable first) 1. Piazza: Your first stop for all questions (see if your question has already been asked) We will try to ensure that every question is answered within 24 hours of posting (during weekdays). Do not post solutions or code on Piazza – that’ll immediately earn you a zero on the HW/MP. 2. Office Hours (posted on course website): Every weekday has multiple office hours scheduled, including evenings. 3. Email staff mailing list (don’t email individual instructor/TA except for private issues) Course Prerequisite: Credit or concurrent enrollment in one of CS 340/341, or ECE 391, or equivalent OS/networking course (latter need 44 instructor permission) Course Staff Instructors: – (Indy Gupta): 3112 Siebel Center – (Aishwarya Ganesan): 3120 Siebel Center – Office Hours Every Tue and Thu right after class until 4 pm – see website TAs: (see course website for office hours) – Anna Karanika (Lead TA) TAs for On-campus Students: Aryan Bhardwaj, Pete Stenger, Taimoor Tariq, Ting Liang, Kai Wang – Maleeha Masood (lead TA for MCS Coursera, and Deputy Lead TA) TAs for MCS Online Students: Xinying Zheng, Zhikun Wang, Zikun Liu – All students (oncampus or MCS Coursera) can access all above TAs 45 More about course logistics later in this lecture… Individual vs. Group Work Homeworks – Individual work only. All work must be your own. MPs – In groups of 2 – Within group: Can discuss everything. – Outside group (e.g., on Piazza): can only discuss concepts and question (but not solution) For both HWs and MPs – You can discuss with others (and course staff) lecture concepts and the HW/MP question/spec itself, but you cannot discuss solutions or even ideas. – You cannot copy solutions/code or look at someone else’s solutions We will check (we use Moss, we also compare HWs) First violation: zero on HW/MP. Second violation: F in course. (Both have 46 happened in the past!) Speaking of HWs and MPs All MP demos are in-person (not online!), and the demo dates are posted to website. So please make sure you plan ahead! HW1 and MP1 have both been released! Don’t worry – you have time But start early! Start now. (4 cr only) You must let us know the composition of your group for MP by Monday 9/2 @ 5 pm. – If you don’t meet this deadline you won’t get VMs to do your MP! – Instructions on how to inform us are on the course website You can start on MP right away (don’t need lectures for it) Due in about 2 weeks – MP1 due 9/15 (demos on 9/16) – HW1 due 9/19 (Please read submission instructions carefully) 47 3 cr vs 4 cr vs MCS Coursera 48 Wrap-Up (Reading for today’s lecture: Relevant parts of Chapter 1) All students: – Go to course website https://courses.engr.illinois.edu/cs425/ – Todos: (All) Sign up for Piazza by today (4 cr) Form MP group by this week and respond to MP group survey. You can use Piazza “Search for Teammates”, or just hang around after class today. (All) Fill out Student Survey sheet by today (latest by this week Friday) – Not yet registered? No waitlist this year  Continue to do all the HWs, (MPs if 4 cr). Usually we get everyone in by end 49 of September (no guarantees though!). CS 425 / ECE 428 Distributed Systems Fall 2024 Indranil Gupta (Indy) W/ Aishwarya Ganesan Lecture 2-3: Introduction to Cloud Computing All slides © IG 1 What was the world’s first bug in a program? 2 World’s First Bug https://education.nationalgeographic.org/resource/w orlds-first-computer-bug/ 3 Quiz: Where is the World’s Largest Datacenter? 4 Quiz: Where is the World’s Largest Datacenter? (2018-2024) China Telecom. 10.7 Million sq. ft. (2017) “The Citadel” Nevada. 7.2 Million sq. ft. (2015) In Chicago! 350 East Cermak, Chicago, 1.1 MILLION sq. ft. Shared by many different “carriers” Critical to Chicago Mercantile Exchange See: – http://ict-price.com/top-10-biggest-data-centres-from-around-the-world/ – https://www.gigabitmagazine.com/top10/top-10-biggest-data-centres-world 5 – https://www.racksolutions.com/news/data-center-news/top-10-largest-data-centers-world/ 6 (There was) The Hype! Forrester in 2010 – Cloud computing will go from $40.7 billion in 2010 to $241 billion in 2020. Today: Cloud Market is $602B (expected to reach $2.3T by 2030) Companies and even Federal/state governments using cloud computing now: fbo.gov 7 Many Cloud Providers AWS: Amazon Web Services – EC2: Elastic Compute Cloud – S3: Simple Storage Service – EBS: Elastic Block Storage Microsoft Azure Google Cloud/Compute Engine/AppEngine Rightscale, Salesforce, EMC, Gigaspaces, 10gen, Datastax, Oracle, VMWare, Yahoo, Cloudera And many many more! 8 Two Categories of Clouds Can be either a (i) public cloud, or (ii) private cloud Private clouds are accessible only to company employees Public clouds provide service to any paying customer: – Amazon S3 (Simple Storage Service): store arbitrary datasets, pay per GB-month stored Recently: ~fractions of cent to a few cents, per GB month – Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary OS images, pay per CPU hour used Recently: few cents per CPU hr (depending on strength), only CPUs not GPUs – Google cloud: similar pricing ranges as above – Google AppEngine/Compute Engine: develop applications within their appengine framework, upload data that will be imported into their format, and run 9 Customers Save Time and $$$ (Stories from mid-2010s) Dave Power, Associate Information Consultant at Eli Lilly and Company: “With AWS, Powers said, a new server can be up and running in three minutes (it used to take Eli Lilly seven and a half weeks to deploy a server internally) and a 64-node Linux cluster can be online in five minutes (compared with three months internally). … It's just shy of instantaneous.” Ingo Elfering, Vice President of Information Technology Strategy, GlaxoSmithKline: “With Online Services, we are able to reduce our IT operational costs by roughly 30% of what we’re spending” Jim Swartz, CIO, Sybase: “At Sybase, a private cloud of virtual servers inside its datacenter has saved nearly $US2 million annually since 2006, Swartz says, because the company can share computing power and storage resources across servers.” 100s of startups in Silicon Valley can harness large computing resources without buying their own machines. 10 But what exactly IS a cloud? 11 What is a Cloud? It’s a cluster! It’s a supercomputer! It’s a datastore! It’s superman! None of the above All of the above Cloud = Lots of storage + compute cycles nearby 12 What is a Cloud? A single-site cloud (aka “Datacenter”) consists of – Compute nodes (grouped into racks) (2) – Switches, connecting the racks – A network topology, e.g., hierarchical – Storage (backend) nodes connected to the network (3) – Front-end for submitting jobs and receiving client requests (1) – (1,2,3: Often called “three-tier architecture”) – Software Services A geographically distributed cloud consists of – Multiple such sites – Each site perhaps with a different structure and services 13 A Sample Cloud Topology So then, what is a cluster? 14 “A Cloudy History of Time” The first datacenters! Timesharing Companies Clouds and datacenters & Data Processing Industry 1940 1950 Clusters 1960 Grids 1970 1980 PCs 1990 (not distributed!) 2000 Peer to peer systems 2012 15 “A Cloudy History of Time” First large datacenters: ENIAC, ORDVAC, ILLIAC Many used vacuum tubes and mechanical relays Berkeley NOW Project Supercomputers 1940 Server Farms (e.g., Oceano) 1950 1960 P2P Systems (90s-00s) Many Millions of users 1970 Many GB per day 1980 Data Processing Industry - 1968: $70 M. 1978: $3.15 Billion 1990 Timesharing Industry (1975): 2000 Market Share: Honeywell 34%, IBM 15%, Xerox 10%, CDC 10%, DEC 10%, UNIVAC 10% Grids (1980s-2000s): 2012 Clouds GriPhyN (1970s-80s) Honeywell 6000 & 635, IBM 370/168, Open Science Grid and Lambda Rail (2000s) Xerox 940 & Sigma 9, DEC PDP-10, UNIVAC 1108 (jump16 to calc) Globus & other standards (1990s-2000s) Trends: Technology Doubling Periods – storage: 12 mos, bandwidth: 9 mos, and (what law is this?) cpu compute capacity: 18 mos Then and Now – Bandwidth 1985: mostly 56Kbps links nationwide Today: Tbps links widespread – Disk capacity Today’s PCs have TBs, far more than a 1990 supercomputer 17 Trends: Users Then and Now Biologists: – 1990: were running small single-molecule simulations – Today: CERN’s Large Hadron Collider producing many PB/year 18 Prophecies In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”. Plug your thin client into the computing Utility and Play your favorite Intensive Compute & Communicate Application – Have today’s clouds brought us closer to this reality? Think about it. 19 Four Features New in Today’s Clouds I. Massive scale. II. On-demand access: Pay-as-you-go, no upfront commitment. – And anyone can access it III. Data-intensive Nature: What was MBs has now become TBs, PBs and XBs. – Daily logs, forensics, Web data, etc. – Humans have data numbness: Wikipedia (large) compressed is only about 10 GB! IV. New Cloud Programming Paradigms: MapReduce/Hadoop, NoSQL/Cassandra/MongoDB and many others. – High in accessibility and ease of programmability – Lots of open-source Combination of one or more of these gives rise to novel and unsolved distributed computing problems in cloud computing. 20 I. Massive Scale Facebook [GigaOm, 2012] – 30K in 2009 -> 60K in 2010 -> 180K in 2012 Microsoft [NYTimes, 2008] – 150K machines – Growth rate of 10K per month – 80K total running Bing – In 2013, Microsoft Cosmos had 110K machines (4 sites) Yahoo! : – 100K – Split into clusters of 4000 AWS EC2 [Randy Bias, 2009] – 40K machines – 8 cores/machine eBay : 50K machines HP : 380K in 180 DCs 21 Google [2011, Data Center Knowledge] : 900K (reputed in 2024: $2.5M) What does a datacenter look like from inside? A virtual walk through a datacenter (Facebook Prineville Datacenter, Mid 2010s) References: – https://www.youtube.com/watch?v=4A_A-CmrqpQ – http://gigaom.com/cleantech/a-rare-look-inside-facebooks- oregon-data-center-photos-video/ 22 Servers Front Back In Some highly secure (e.g., financial info) 23 Cooling Air sucked in from top (also, Bugzappers) Water purified Water sprayed into air 15 motors per server bank 24 Power Off-site On-site WUE = Annual Water Usage / IT Equipment Energy (L/kWh) – low is good PUE = Total facility Power / IT Equipment Power – low is good (e.g., Google~1.1) 25 Extra - Fun Videos to Watch Microsoft GFS Datacenter Tour (Youtube) – http://www.youtube.com/watch?v=hOxA1l1pQIw Timelapse of a Datacenter Construction on the Inside (Fortune 500 company) – http://www.youtube.com/watch?v=ujO-xNvXj3g 26 II. On-demand access: *aaS Classification On-demand: renting a cab vs. (previously) renting a car, or buying one. E.g.: – AWS Elastic Compute Cloud (EC2): a few cents to a few $ per CPU hour – AWS Simple Storage Service (S3): a few cents per GB-month HaaS: Hardware as a Service – You get access to barebones hardware machines, do whatever you want with them, Ex: Your own cluster – Not always a good idea because of security risks IaaS: Infrastructure as a Service – You get access to flexible computing and storage infrastructure. Virtualization or containerization is one way of achieving this (cgroups, Kubernetes, Dockers, VMs,…). Often said to subsume HaaS. – Ex: Amazon Web Services (AWS: EC2 and S3), OpenStack, Eucalyptus, Rightscale, Microsoft Azure, Google Cloud. 27 II. On-demand access: *aaS Classification PaaS: Platform as a Service – You get access to flexible computing and storage infrastructure, coupled with a software platform (often tightly coupled) – Ex: Google’s AppEngine (Python, Java, Go) SaaS: Software as a Service – You get access to software services, when you need them. Often said to subsume SOA (Service Oriented Architectures). – Ex: Google docs, MS Office 365 Online And new recently: FaaS = Function as a Service – e.g., AWS Lambda, Azure Functions, etc. 28 III. Data-intensive Computing Computation-Intensive Computing – Example areas: MPI-based, High-performance computing, Grids – Typically run on supercomputers (e.g., NCSA Blue Waters) Data-Intensive – Typically store data at datacenters – Use compute nodes nearby – Compute nodes run computation services In data-intensive computing, the focus shifts from computation to the data: CPU utilization no longer the most important resource metric, instead I/O is (disk and/or network) 29 IV. New Cloud Programming Paradigms Easy to write and run highly parallel programs in new cloud programming paradigms: – Google: MapReduce and Sawzall – Amazon: Elastic MapReduce service (pay-as-you-go) – Google (MapReduce) Indexing: a chain of 24 MapReduce jobs ~200K jobs processing 50PB/month (in 2006) – Yahoo! (Hadoop + Pig) WebMap: a chain of several MapReduce jobs 300 TB of data, 10K cores, many tens of hours (~2008) – Facebook (Hadoop + Hive) ~300TB total, adding 2TB/day (in 2008) 3K jobs processing 55TB/day – Similar numbers from other companies, e.g., Yieldex, eharmony.com, etc. – NoSQL: MySQL is an industry standard, but Cassandra is 2400 times faster! 30 Two Categories of Clouds Can be either a (i) public cloud, or (ii) private cloud Private clouds are accessible only to company employees Public clouds provide service to any paying customer You’re starting a new service/company: should you use a public cloud or purchase your own private cloud? 31 Single site Cloud: to Outsource or Own? Medium-sized organization: wishes to run a service for M months – Service requires 128 servers (1024 cores) and 524 TB – Same as UIUC CCT (Cloud Computing Testbed) cloud site (bought in 2009, now decommissioned) Outsource (e.g., via AWS): monthly cost – S3 costs: $0.12 per GB month. EC2 costs: $0.10 per CPU hour (costs from 2009) – Storage = $ 0.12 X 524 X 1000 ~ $62 K – Total = Storage + CPUs = $62 K + $0.10 X 1024 X 24 X 30 ~ $136 K Own: monthly cost – Storage ~ $349 K / M – Total ~ $ 1555 K / M + 7.5 K (includes 1 sysadmin / 100 nodes) using 0.45:0.4:0.15 split for hardware:power:network and 3 year lifetime of hardware 32 Single site Cloud: to Outsource or Own? Breakeven analysis: more preferable to own if: - $349 K / M < $62 K (storage) - $ 1555 K / M + 7.5 K < $136 K (overall) Breakeven points - M > 5.55 months (storage) - M > 12 months (overall) - As a result - Startups use clouds a lot - Cloud providers benefit monetarily most from storage 33 Academic Clouds: Emulab/CloudLab A community resource open to researchers in academia and industry. Very widely used by researchers everywhere today. https://www.emulab.net/ A cluster, with currently ~500 servers Founded and owned by University of Utah (led by Late Prof. Jay Lepreau) As a user, you can: – Grab a set of machines for your experiment – You get root-level (sudo) access to these machines – You can specify a network topology for your cluster – You can emulate any topology All images © Emulab 34 Public Research Clouds Accessible to researchers with a qualifying grant Chameleon Cloud: https://www.chameleoncloud.org/ HaaS OpenStack (~AWS) CloudLab: https://www.cloudlab.us/ Build your own cloud on their hardware 36 Summary Clouds build on many previous generations of distributed systems Especially the timesharing and data processing industry of the 1960-70s. Need to identify unique aspects of a problem to classify it as a new cloud computing problem – Scale, On-demand access, data-intensive, new programming Otherwise, the solutions to your problem may already exist! Next: Mapreduce! 37 Announcements 4 cr only: MP Groups Form DUE this week Mon Sep 2 nd @ 5 pm (see course webpage). – Hard deadline, as Engr-IT will create and assign VMs tomorrow! DO NOT – Change MP groups unless your partner has dropped – Leave your MP partner hanging: Both MP partners should contribute equally (we will ask!) All: Please fill out Student Survey by Fri 8/30 (see course webpage). MP1 due Sep 15th – VMs being distributed soon (watch Piazza) – Demos will be Monday Sep 16th (schedule and details will be posted soon on Piazza) HW1 due Sep 19th: Solve problems right after lecture covers topic! Check Piazza often! It’s where all the announcements are at! 38 CS 425 / ECE 428 Distributed Systems Fall 2024 Indranil Gupta (Indy) W/ Aishwarya Ganesan Lecture 4: Mapreduce and Hadoop All slides © IG 1 What is MapReduce? Terms are borrowed from Functional Language (e.g., Lisp) Sum of squares: (map square ‘(1 2 3 4)) – Output: (1 4 9 16) [processes each record sequentially and independently] (reduce + ‘(1 4 9 16)) – (+ 16 (+ 9 (+ 4 1) ) ) – Output: 30 [processes set of all records in batches] Let’s consider a sample application: Wordcount – You are given a huge dataset (e.g., Wikipedia dump or all of Shakespeare’s works) and asked to list the count for each 3 of the words in each of the documents therein Map Process individual records to generate intermediate key/value pairs. Key Value Welcome 1 Welcome Everyone Everyone 1 Hello Everyone Hello 1 Everyone 1 Input 4 Map Parallelly Process individual records to generate intermediate key/value pairs. MAP TASK 1 Welcome 1 Welcome Everyone Everyone 1 Hello Everyone Hello 1 Input Everyone 1 MAP TASK 2 5 Map Parallelly Process a large number of individual records to generate intermediate key/value pairs. Welcome 1 Welcome Everyone Everyone 1 Hello Everyone Hello 1 Why are you here I am also here Everyone 1 They are also here Why 1 Yes, it’s THEM! Are 1 The same people we were thinking of You 1 ……. Here 1 Input ……. MAP TASKS 6 Reduce Reduce processes and merges all intermediate values associated per key Key Value Welcome 1 Everyone 2 Everyone 1 Hello 1 Hello 1 Welcome 1 Everyone 1 7 Reduce Each key assigned to one Reduce Parallelly Processes and merges all intermediate values by partitioning keys Welcome 1 Everyone 2 REDUCE Everyone 1 TASK 1 Hello 1 Hello 1 Welcome 1 REDUCE Everyone 1 TASK 2 Popular: Hash partitioning, i.e., key is assigned to – reduce # = hash(key)%number of reduce tasks 8 Hadoop Code - Map public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector output, Reporter reporter) // key is empty, value is the line throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } 9 } // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount Hadoop Code - Reduce public static class ReduceClass extends MapReduceBase implements Reducer { public void reduce( Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { // key is word, values is a list of 1’s // called exactly once for each key (e.g., “hello”) int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } 10 Hadoop Code - Driver // Tells Hadoop how to run your Map-Reduce job public void run (String inputPath, String outputPath) throws Exception { // The job. WordCount contains MapClass and Reduce. JobConf conf = new JobConf(WordCount.class); conf.setJobName(”mywordcount"); // The keys are words (strings) conf.setOutputKeyClass(Text.class); // The values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.addInputPath( conf, newPath(inputPath)); FileOutputFormat.setOutputPath( conf, new Path(outputPath)); JobClient.runJob(conf); } // Source: http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount 11 Some Applications of MapReduce Distributed Grep: – Input: large set of files – Output: lines that match pattern – Map – Emits a line if it matches the supplied pattern – Reduce – Copies the intermediate data to output 12 Some Applications of MapReduce (2) Reverse Web-Link Graph – Input: Web graph: tuples (a, b) where (page a  page b) – Output: For each page, list of pages that link to it – Map – process web log and for each input , it outputs – Reduce - emits 13 Some Applications of MapReduce Count of URL access frequency (3) – Input: Log of accessed URLs, e.g., from proxy server – Output: For each URL, % of total accesses for that URL – Map – Process web log and outputs – Multiple Reducers - Emits (So far, like Wordcount. But still need %) – Chain another MapReduce job after above one – Map – Processes and outputs – 1 Reducer – Does two passes. In first pass, sums up all URL_count’s to calculate overall_count. In second pass calculates %’s Emits multiple 14 Some Applications of MapReduce (4) Map task’s output is sorted (e.g., quicksort) Reduce task’s input is sorted (e.g., mergesort) Sort – Input: Series of (key, value) pairs – Output: Sorted s – Map –  (identity) – Reducer –  (identity) – Partitioning function – partition keys across reducers based on ranges (can’t use hashing!) Take data distribution into account to balance reducer tasks 15 Programming MapReduce Externally: For user 1. Write a Map program (short), write a Reduce program (short) 2. Specify number of Maps and Reduces (parallelism level) 3. Submit job; wait for result 4. Need to know very little about parallel/distributed programming! Internally: For the Paradigm and Scheduler 1. Parallelize Map 2. Transfer data from Map to Reduce (shuffle data) 3. Parallelize Reduce 4. Implement Storage for Map input, Map output, Reduce input, and Reduce output (Ensure that no Reduce starts before all Maps are finished. That is, ensure the barrier between the Map phase and Reduce phase) 16 For the cloud: Inside MapReduce 1. Parallelize Map: easy! each map task is independent of the other! All Map output records with same key assigned to same Reduce 2. Transfer data from Map to Reduce: Called Shuffle data All Map output records with same key assigned to same Reduce task use partitioning function, e.g., hash(key)%number of reducers 3. Parallelize Reduce: easy! each reduce task is independent of the other! 4. Implement Storage for Map input, Map output, Reduce input, and Reduce output Map input: from distributed file system Map output: to local disk (at Map node); uses local file system Reduce input: from (multiple) remote disks; uses local file systems Reduce output: to distributed file system local file system = Linux FS, etc. distributed file system = GFS (Google File System), HDFS (Hadoop 17 Distributed File System) Map tasks Reduce tasks Output files into DFS 1 A A 2 I 3 4 B B II 5 6 III 7 C C Blocks Servers Servers from DFS (Local write, remote read) Resource Manager (assigns maps and reduces to servers) 18 The YARN Scheduler Used underneath Hadoop 2.x + YARN = Yet Another Resource Negotiator Treats each server as a collection of containers – Container = fixed CPU + fixed memory (think of Linux cgroups, but even more lightweight) Has 3 main components – Global Resource Manager (RM) Scheduling – Per-server Node Manager (NM) Daemon and server-specific functions – Per-application (job) Application Master (AM) Container negotiation with RM and NMs Detecting task failures of that job 19 YARN: How a job gets a container Resource Manager Capacity Scheduler In this figure 2 servers (A, B) 2 jobs (1, 2) 1. Need 2. Container Completed container 3. Container on Node B Node A Node Manager A Node B Node Manager B Application Application Task (App2) Master 1 4. Start task, please! Master 2 20 Server Failure Fault Tolerance – NM heartbeats to RM If server fails: RM times out waiting for next heartbeat, RM lets all affected AMs know, and AMs take appropriate action – NM keeps track of each task running at its server If task fails while in-progress, mark the task as idle and restart it – AM heartbeats to RM On failure, RM restarts AM, which then syncs it up with its running tasks RM Failure – Use old checkpoints and bring up secondary RM Heartbeats also used to piggyback container requests 21 – Avoids extra messages Slow Servers Slow tasks are called Stragglers The slowest task slows the entire job down (why?) Barrier at the end Due to Bad Disk, Network Bandwidth, CPU, or Memory of Map phase! Keep track of “progress” of each task (% done) Perform proactive backup (replicated) execution of some straggler tasks – A task considered done when its first replica complete (other replicas can then be killed) – Approach called Speculative Execution. 22 Locality Locality – Since cloud has hierarchical topology (e.g., racks) – For server-fault-tolerance, GFS/HDFS stores 3 replicas of each of the chunks (e.g., 64 MB in size) For rack-fault-tolerance, on different racks, e.g., 2 on a rack, 1 on a different rack – Mapreduce attempts to schedule a map task on 1. a machine that contains a replica of corresponding input data, or failing that, 2. on the same rack as a machine containing the input, or failing that, 3. Anywhere – Note: The 2-1 split of replicas is intended to reduce bandwidth when writing file. Using more racks does not affect overall Mapreduce scheduling performance 23 That was Hadoop 2.x… Hadoop 3.x (new!) over Hadoop 2.x – Dockers instead of container – Erasure coding instead of 3-way replication – Multiple Namenodes instead of one (name resolution) – GPU support (for machine learning) – Intra-node disk balancing (for repurposed disks) – Intra-queue preemption in addition to inter-queue – (From https://hortonworks.com/blog/hadoop-3-adds-value-hadoop-2/ (broken) and https://hadoop.apache.org/docs/r3.0.0/ ) 24 Mapreduce: Summary Mapreduce uses parallelization + aggregation to schedule applications across clusters Need to deal with failure Plenty of ongoing research work in scheduling and fault-tolerance for Mapreduce and Hadoop 25 Further MapReduce Exercises 26 Exercise 1 1. (MapReduce) You are given a symmetric social network (like Facebook) where a is a friend of b implies that b is also a friend of a. The input is a dataset D (sharded) containing such pairs (a, b) – note that either a or b may be a lexicographically lower name. Pairs appear exactly once and are not repeated. Find the last names of those users whose first name is “Kanye” and who have at least 300 friends. You can chain Mapreduces if you want (but only if you must, and even then, only the least number). You don’t need to write code – pseudocode is fine as long as it is understandable. Your pseudocode may assume the presence of appropriate primitives (e.g., “firstname(user_id)”, etc.). The Map function takes as input a tuple (key=a,value=b). 27 28 29 Exercise 1: Solution Goal: Last names of those users whose first name M1 (a,b): is “Kanye” and who have at least 300 friends. – if (firstname(a)==Kanye) then output (a,b) – if (firstname(b)==Kanye) then output (b,a) // note that second if is NOT an else if, so a single M1 function may be output up to 2 KV pairs! R1 (x, V): – if |V| >= 300 then output (lastname(x), -) 30 Exercise 2 2. For an asymmetrical social network, you are given a dataset D where lines consist of (a,b) which means user a follows user b. Write a MapReduce program (Map and Reduce separately) that outputs the list of all users U who satisfy the following three conditions simultaneously: i) user U has at least 2 million followers, and ii) U follows fewer than 20 other users, and iii) all the users that U follows, also follow U back. 31 32 33 Exercise 2: Solution Goal: Find users U M1(a,b): i) U has >= 2 million followers ii) U follows < 20 other users, – Output (key=a, value=(OUT,b)) iii) all U’s followers follow U back – Output (key=b, value=(IN,a)) // Note that a single M1 function outputs TWO KV pairs R1(key=x, V): – Collect Sout = set of all (OUT,*) value items from V – Collect Sin = set of all (IN,*) value items from V – if (|Sout| < 20 AND |Sin| >= 2M AND all items in Sout are also present in Sin) // third term via nested for loops – then output (x,_) 34 Exercise 3 3. For an asymmetrical social network, you are given a dataset D where lines consist of (a,b) which means user a follows user b. Write a MapReduce program (Map and Reduce separately) that outputs the list of all user pairs (x,y) who satisfy the following three conditions simultaneously: i) x has fewer than 100 M followers, ii) y has fewer than 100M followers, iii) x and y follow each other, and iv) the sum of x’s followers and y’s followers (double-counting common followers that follow both x and y is ok) is 100 M or more. Your output should not contain duplicates (i.e., no (x,y) and (y,x)). 35 36 37 Exercise 3: Solution Goal: find pairs (x,y): M1(a,b): output (b,a) i) x has < 100 M followers, ii) y has < 100M followers, R1(x,V): iii) x and y follow each other, iv) sum of x’s & y’s followers >= 100 M – if |V| < 100M, then for all a in V, output (double count ok). (lexicographic_sorted_pair(x,a), |V|) M2(a,b): Identity R2(key=(a,b), value={|V1|, |V2|,…}) – if |value|==1 output nothing – else if |value|==2 then add up the counts in value if sum of these counts >= 100M then output (a,b) 38 Announcements Please fill out Student Survey (see course webpage). DO NOT – Change MP groups unless your partner has dropped – Leave your MP partner hanging: Both MP partners should contribute equally (we will ask!) MP1 due Sep 15th – VMs distributed soon (watch Piazza) – Demos will be Monday Sep 16th (schedule and details will be posted next week on Piazza) HW1 due Sep 19th: Solve problems right after lecture covers topic! Check Piazza often! It’s where all the announcements are at! (deadline passed) MP Groups Form DUE this week Mon Sep 2nd @ 5 pm (see course webpage). 39 – Hard deadline, as Engr-IT will create and assign VMs tomorrow! CS 425 / ECE 428 Distributed Systems Fall 2024 Indranil Gupta (Indy) w/ Aishwarya Ganesan Lecture 5: Gossiping All slides © IG Today’s Agenda Epidemics, or how to use them to your advantage (to do good things) Multicast 3 Fault-tolerance and Scalability Needs: 1. Reliability (Atomicity) 100% receipt 2. Speed 4 Centralized 5 Tree-Based 6 Tree-based Multicast Protocols Build a spanning tree among the processes of the multicast group Use spanning tree to disseminate multicasts Use either acknowledgments (ACKs) or negative acknowledgements (NAKs) to repair multicasts not received SRM (Scalable Reliable Multicast) Uses NAKs But adds random delays, and uses exponential backoff to avoid NAK storms (Do you know why SRM is called a “talented” protocol?) RMTP (Reliable Multicast Transport Protocol) Uses ACKs But ACKs only sent to designated receivers, which then re-transmit missing multicasts These protocols still cause an O(N) ACK/NAK overhead [Birman99] 7 A Third Approach 8 A Third Approach 9 A Third Approach 10 A Third Approach 11 “Epidemic” Multicast (or “Gossip”) 12 Push vs. Pull So that was “Push” gossip Once you have a multicast message, you start gossiping about it Multiple messages? Gossip a random subset of them, or recently-received ones, or higher priority ones There’s also “Pull” gossip Periodically poll a few randomly selected processes for new multicast messages that you haven’t received Get those messages Hybrid variant: Push-Pull As the name suggests 13 Properties Claim that the simple Push protocol Is lightweight in large groups Spreads a multicast quickly Is highly fault-tolerant 14 Analysis From old mathematical branch of Epidemiology [Bailey 75] Population of (n+1) individuals mixing homogeneously Contact rate between any individual pair is  At any time, each individual is either uninfected (numbering x) or infected (numbering y) Then, x0  n, y0  1 and at all times x  y  n 1 Infected–uninfected contact turns latter infected, and it stays infected 15 Analysis (contd.) Continuous time process Then dx    xy (why?) dt with solution: n(n  1) (n  1) x  ( n 1) t ,y ne 1  ne   ( n 1)t (can you derive it?) 16 Epidemic Multicast 17 Epidemic Multicast Analysis b  (why?) n Substituting, at time t=clog(n), the number of infected is 1 y  (n  1)  cb  2 n (correct? can you derive it?) 18 Analysis (contd.) Set c,b to be small numbers independent of n Within clog(n) rounds, [low latency] 1 all but number of nodes receive the multicast cb  2 n [reliability] each node has transmitted no more than cblog(n)gossip messages [lightweight] 19 Why is log(N) low? log(N) is not constant in theory But pragmatically, it is a very slowly growing number Base 2 log(1000) ~ 10 log(1M) ~ 20 log (1B) ~ 30 log(all IPv4 addresses) = 32 log(all IPv6 addresses) = 128 20 Fault-tolerance Packet loss 50% packet loss: analyze with b replaced with b/2 To achieve same reliability as 0% packet loss, takes twice as many rounds Node failure 50% of nodes fail: analyze with n replaced with n/2 and b replaced with b/2 Same as above 21 Fault-tolerance With failures, is it possible that the epidemic might die out quickly? Possible, but improbable: Once a few nodes are infected, with high probability, the epidemic will not die out So the analysis we saw in the previous slides is actually behavior with high probability [Galey and Dani 98] Think: why do rumors spread so fast? why do infectious diseases cascade quickly into epidemics? why does a virus or worm spread rapidly? 22 Pull Gossip: Analysis In all forms of gossip, it takes O(log(N)) rounds before about N/2 processes get the gossip Why? Because that’s the fastest you can spread a message – a spanning tree with fanout (degree) of constant degree has O(log(N)) total nodes (height of tree) Thereafter, pull gossip is faster than push gossip After the ith, round let p i be the fraction of non- infected processes. Let each round have k pulls. Then p i 1  p  i k 1 This is super-exponential Second half of pull gossip finishes in time O(log(log(N)) 23 Topology-Aware Gossip Network topology is hierarchical N/2 nodes in a subnet Random gossip target selection => core routers Router face O(N) load (Why?) Fix: In subnet i, which contains ni nodes, pick gossip target in your subnet with probability (1-1/ni) Router load=O(1) Dissemination N/2 nodes in a subnet time=O(log(N)) 24 Answer – Push Analysis (contd.) b Using:  n Substituting, at time t=clog(n) n 1 n 1 y  b  ( n 1) c log(n ) 1 1  ne n 1  cb 1 n 1  (n  1)(1  cb 1 ) n 1  ( n  1)  cb  2 n 25 SO,... Is this all theory and a bunch of equations? Or are there implementations yet? 26 Some implementations Clearinghouse and Bayou projects: email and database transactions [PODC ‘87] refDBMS system [Usenix ‘94] Bimodal Multicast [ACM TOCS ‘99] Sensor networks [Li Li et al, Infocom ‘02, and PBBF, ICDCS ‘05] AWS EC2 and S3 Cloud (rumored). [‘00s] Cassandra key-value store (and others) use gossip for maintaining membership lists Usenet NNTP (Network News Transport Protocol) [‘79] 27 NNTP Inter-server Protocol 1. Each client uploads and downloads news posts from a news server 2. CHECK Upstream 238 {Give me!} Downstream Server Server TAKETHIS 239 OK Server retains news posts for a while, transmits them lazily, deletes them after a while. 28 Summary Multicast is an important problem Tree-based multicast protocols When concerned about scale and fault- tolerance, gossip is an attractive solution Also known as epidemics Fast, reliable, fault-tolerant, scalable, topology- aware 29 Announcements MP1: Due coming Sunday 9/15, demos Monday 9/16 VMs distributed: see Piazza Demo signup sheet: now on Piazza (sign up by Friday!) Demo details: will be posted tomorrow on Piazza Make sure you print individual and total linecounts HW1 due soon, Thu 9/19! Check Piazza often! It’s where all the announcements are at! CS 425 / ECE 428 Distributed Systems Fall 2024 Indranil Gupta (Indy) w/ Aishwarya Ganesan Lecture 6: Failure Detection and All slides © IG Membership, Grids Announcements MP1: Due Sunday 9/16, demos Monday 9/17 – VMs distributed: see Piazza – Demo signup sheet: now on Piazza (see signup deadline – this Friday!) – Demo details: see Piazza Make sure you print individual and total linecounts HW1: due 9/19! (You should have started on it already!) Check Piazza often! It’s where all the announcements are at! Please view Grid Computing Lecture Video from website! – Included in midterm syllabus! (We won’t lecture in class) 2 A Challenge You’ve been put in charge of a datacenter, and your manager has told you, “Oh no! We don’t have any failures in our datacenter!” Do you believe him/her? What would be your first responsibility? Build a failure detector What are some things that could go wrong if you didn’t do this? 3 Failures are the Norm … not the exception, in datacenters. Say, the rate of failure of one machine (OS/disk/motherboard/network, etc.) is once every 10 years (120 months) on average. When you have 120 servers in the DC, the mean time to failure (MTTF) of the next machine is 1 month. When you have 12,000 servers in the DC, the MTTF is about once every 7.2 hours! Soft crashes and failures are even more frequent! 4 To build a failure detector You have a few options 1. Hire 1000 people, each to monitor one machine in the datacenter and report to you when it fails. 2. Write a failure detector program (distributed) that automatically detects failures and reports to your workstation. Which is more preferable, and why? 5 Target Settings Process ‘group’-based systems – Clouds/Datacenters – Replicated servers – Distributed databases Fail-stop (crash) process failures 6 Group Membership Service Application Queries Application Process pi e.g., gossip, overlays, DHT’s, etc. joins, leaves, failures of members Membership Protocol Membership Group List Membership List Unreliable Communication 7 Two sub-protocols Application Process pi Group Membership List pj Complete list all the time (Strongly consistent) Dissemination Virtual synchrony Failure Detector Almost-Complete list (Weakly consistent) Gossip-style, SWIM, … Or Partial-random list (other systems) SCAMP, T-MAN, Cyclon,… Unreliable Focus of this series of lecture Communication 8 Large Group: Scalability A Goal this is us (pi) Process Group “Members” 1000’s of processes Unreliable Communication Network 9 Group Membership Protocol II Failure Detector Some process pi finds out quickly pj III Dissemination Unreliable Communication Network Fail-stop Failures only 10 Next How do you design a group membership protocol? 11 I. pj crashes Nothing we can do about it! A frequent occurrence Common case rather than exception Frequency goes up linearly with size of datacenter 12 II. Distributed Failure Detectors: Desirable Properties Completeness = each failure is detected Accuracy = there is no mistaken detection Speed – Time to first detection of a failure Scale – Equal Load on each member – Network Message Load 13 Distributed Failure Detectors: Properties Impossible together in Completeness lossy networks [Chandra Accuracy and Toueg] Speed If possible, then can – Time to first detection of a failure solve consensus! (but Scale consensus is known to be – Equal Load on each member unsolvable in – Network Message Load asynchronous systems) 14 What Real Failure Detectors Prefer Completeness Guaranteed Partial/Probabilistic Accuracy guarantee Speed – Time to first detection of a failure Scale – Equal Load on each member – Network Message Load 15 What Real Failure Detectors Prefer Completeness Guaranteed Partial/Probabilistic Accuracy guarantee Speed – Time to first detection of a failure Scale Time until some non-faulty process detects the failure – Equal Load on each member – Network Message Load 16 What Real Failure Detectors Prefer Completeness Guaranteed Partial/Probabilistic Accuracy guarantee Speed – Time to first detection of a failure Scale Time until some non-faulty process detects the failure – Equal Load on each member No bottlenecks/single – Network Message Load failure point 17 Failure Detector Properties Completeness In spite of arbitrary simultaneous Accuracy process failures Speed – Time to first detection of a failure Scale – Equal Load on each member – Network Message Load 18 Centralized Heartbeating  Hotspot pi pi, Heartbeat Seq. l++ pj Heartbeats sent periodically If heartbeat not received from pi within 19 timeout, mark pi as failed Ring Heartbeating  Unpredictable on pi simultaneous multiple pi, Heartbeat Seq. l++ failures pj 20 All-to-All Heartbeating  Equal load per member pi, Heartbeat Seq. l++ pi  Single hb loss  false detection … pj 21 Next How do we increase the robustness of all-to-all heartbeating? 22 Gossip-style Heartbeating  Good accuracy Array of pi properties Heartbeat Seq. l for member subset 23 Gossip-Style Failure Detection 1 10118 64 2 10110 64 1 10120 66 3 10090 58 2 10103 62 4 10111 65 3 10098 63 2 4 10111 65 1 1 10120 70 Address Time (local) 2 10110 64 Heartbeat Counter 3 10098 70 Protocol: Nodes periodically gossip their membership 4 4 10111 65 list: pick random nodes, send it list 3 Current time : 70 at node 2 On receipt, it is merged with local membership list (asynchronous clocks) When an entry times out, member is marked 24 as failed Gossip-Style Failure Detection If the heartbeat has not increased for more than Tfail seconds, the member is considered failed And after a further Tcleanup seconds, it will delete the member from the list Why an additional timeout? Why not delete right away? 25 Gossip-Style Failure Detection What if an entry pointing to a failed node is deleted right after Tfail (=24) seconds? 1 10120 66 2 10110 64 1 10120 66 34 10098 10111 75 65 50 2 10103 62 4 10111 65 3 10098 55 2 4 10111 65 1 Current time : 75 at node 2 4 3 26 Analysis/Discussion Well-known result: a gossip takes O(log(N)) time to propagate. So: Given sufficient bandwidth, a single heartbeat takes O(log(N)) time to propagate. So: N heartbeats take: – O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be O(N) – O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1) – What about O(k) bandwidth? What happens if gossip period Tgossip is decreased? What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased? Tradeoff: False positive rate vs. detection time vs. bandwidth 27 Next So, is this the best we can do? What is the best we can do? 28 Failure Detector Properties … Completeness Accuracy Speed – Time to first detection of a failure Scale – Equal Load on each member – Network Message Load 29 …Are application-defined Requirements Completeness Guarantee always Probability PM(T) Accuracy T time units Speed – Time to first detection of a failure Scale – Equal Load on each member – Network Message Load 30 …Are application-defined Requirements Completeness Guarantee always Probability PM(T) Accuracy T time units Speed – Time to first detection of a failure N*L: Compare this across protocols Scale – Equal Load on each member – Network Message Load 31 All-to-All Heartbeating pi, Heartbeat Seq. l++ pi Every T units L=N/T 32 Gossip-style Heartbeating pi T=logN * tg Array of Heartbeat Seq. l L=N/tg=N*logN/T for member subset Every tg units =gossip period, send O(N) gossip message 33 What’s the Best/Optimal we can do? Worst case load L* per member in the group (messages per second) – as a function of T, PM(T), N – Independent Message Loss probability pml log( PM (T )) 1 L*  log( p ) T. ml 34 Heartbeating Optimal L is independent of N (!) All-to-all and gossip-based: sub-optimal L=O(N/T) try to achieve simultaneous detection at all processes fail to distinguish Failure Detection and Dissemination components Can we reach this bound? Key: Separate the two components Use a non heartbeat-based Failure Detection Component 35 Next Is there a better failure detector? 36 SWIM Failure Detector Protocol pi pj random pj ping K random ack processes random K ping-req X X Protocol period ping = T’ time units ack ack 37 Detection Time 1 N 1 Prob. of being pinged in T’= 1  (1  )  1  e 1 N e E[T ] = T'. e 1 Completeness: Any alive member detects failure – Eventually – By using a trick: within worst case O(N) protocol periods 38 Accuracy, Load PM(T) is exponential in -K. Also depends on pml (and pf ) – See paper L E[ L]  28 8 L* L* for up to 15 % loss rates 39 SWIM Failure Detector Parameter SWIM First Detection Time  e  Expected  e 1periods Constant (independent of group size) Process Load Constant per period < 8 L* for 15% loss False Positive Rate Tunable (via K) Falls exponentially as load is scaled Completeness Deterministic time-bounded Within O(log(N)) periods w.h.p. 40 Time-bounded Completeness Key: select each membership element once as a ping target in a traversal – Round-robin pinging – Random permutation of list after each traversal Each failure is detected in worst case 2N-1 (local) protocol periods Preserves FD properties 41 SWIM versus Heartbeating Heartbeating O(N) First Detection Time SWIM Heartbeating Constant For Fixed : Constant Process Load O(N) False Positive Rate Message Loss Rate 42 Next How do failure detectors fit into the big picture of a group membership protocol? What are the missing blocks? 43 Group Membership Protocol II Failure Detector Some process pi finds out quickly pj III Dissemination Unreliable Communication Network Fail-stop Failures only 44 Dissemination Options Multicast (Hardware / IP) – unreliable – multiple simultaneous multicasts Point-to-point (TCP / UDP) – expensive Zero extra messages: Piggyback on Failure Detector messages – Infection-style Dissemination 45 Infection-style Dissemination pi pj random pj ping K random ack processes random K ping-req X X Protocol period ping = T time units ack ack Piggybacked membership information 46 Infection-style Dissemination Epidemic/Gossip style dissemination – After . log( N ) protocol periods, N l processes would not - (2 - 2) have heard about an update Maintain a buffer of recently joined/evicted processes – Piggyback from this buffer – Prefer recent updates Buffer elements are garbage collected after a while – After . log( N ) protocol periods, i.e., once they’ve propagated through the system; this defines weak consistency 47 Suspicion Mechanism False detections, due to – Perturbed processes – Packet losses, e.g., from congestion Indirect pinging may not solve the problem Key: suspect a process before declaring it as failed in the group 48 Suspicion Mechanism pi Dissmn (Suspect pj) Dissmn FD49 Suspected Alive Failed Dissmn (Alive pj) Dissmn (Failed pj) Suspicion Mechanism Distinguish multiple suspicions of a process – Per-process incarnation number – Inc # for pi can be incremented only by pi e.g., when it receives a (Suspect, pi) message – Somewhat similar to DSDV (routing protocol in ad-hoc nets) Higher inc# notifications over-ride lower inc#’s Within an inc#: (Suspect inc #) > (Alive, inc #) (Failed, inc #) overrides everything else 50 SWIM In Industry First used in Oasis/CoralCDN Implemented open-source by Hashicorp Inc. – Called “Serf” – Later “Consul” Today: Uber implemented it, uses it for failure detection in their infrastructure – See “ringpop” system 51 Wrap Up Failures the norm, not the exception in datacenters Every distributed system uses a failure detector Many distributed systems use a membership service Ring failure detection underlies – IBM SP2 and many other similar clusters/machines Gossip-style failure detection underlies – Amazon EC2/S3 (rumored!) 52 Grid Computing 53 “A Cloudy History of Time” The first datacenters! Timesharing Companies Clouds and datacenters & Data Processing Industry 1940 1950 Clusters 1960 Grids 1970 1980 PCs 1990 (not distributed!) 2000 Peer to peer systems 2012 54 “A Cloudy History of Time” First large datacenters: ENIAC, ORDVAC, ILLIAC Many used vacuum tubes and mechanical relays Berkeley NOW Project Supercomputers 1940 Server Farms (e.g., Oceano) 1950 1960 P2P Systems (90s-00s) Many Millions of users 1970 Many GB per day 1980 Data Processing Industry - 1968: $70 M. 1978: $3.15 Billion 1990 Timesharing Industry (1975): 2000 Market Share: Honeywell 34%, IBM 15%, Grids (1980s-2000s): 2012 Clouds Xerox 10%, CDC 10%, DEC 10%, UNIVAC 10% GriPhyN (1970s-80s) Honeywell 6000 & 635, IBM 370/168, Open Science Grid and Lambda Rail (2000s) Xerox 940 & Sigma 9, DEC PDP-10, UNIVAC 1108 55 Globus & other standards (1990s-2000s) Example: Rapid Atmospheric Modeling System, ColoState U Hurricane Georges, 17 days in Sept 1998 – “RAMS modeled the mesoscale convective complex that dropped so much rain, in good agreement with recorded data” – Used 5 km spacing instead of the usual 10 km – Ran on 256+ processors Computation-intenstive computing (or HPC = high performance computing) Can one run such a program without access to a supercomputer? 56 Distributed Computing Resources Wisconsin MIT NCSA 57 An Application Coded by a Physicist Job 0 Output files of Job 0 Input to Job 2 Job 1 Job 2 Jobs 1 and 2 can be concurrent Output files of Job 2 Job 3 Input to Job 3 58 An Application Coded by a Physicist Output files of Job 0 Input to Job 2 Several GBs May take several hours/days 4 stages of a job Init Job 2 Stage in Execute Stage out Publish Computation Intensive, so Massively Parallel Output files of Job 2 59 Input to Job 3 Scheduling Problem Wisconsin Job 0 Job 1 Job 2 Job 3 MIT NCSA 60 2-level Scheduling Infrastructure Wisconsin HTCondor Protocol Job 0 Job 1 Job 2 Job 3 Globus Protocol NCSA MIT Some other intra-site protocol 61 61 Intra-site Protocol HTCondor Protocol Wisconsin Job 3 Job 0 Internal Allocation & Scheduling Monitoring Distribution and Publishing of Files 62 Condor (now HTCondor) High-throughput computing system from U. Wisconsin Madison Belongs to a class of “Cycle-scavenging” systems – SETI@Home and Folding@Home are other systems in this category Such systems Run on a lot of workstations When workstation is free, ask site’s central server (or Globus) for tasks If user hits a keystroke or mouse click, stop task – Either kill task or ask server to reschedule task Can also run on dedicated machines 63 Inter-site Protocol Wisconsin Job 3 Job 0 Internal structure of different Job 1 Globus Protocol sites invisible to Globus MIT NCSA Job 2 External Allocation & Scheduling Stage in & Stage out of Files 64 Globus Globus Alliance involves universities, national US research labs, and some companies Standardized several things, especially software tools Separately, but related: Open Grid Forum Globus Alliance has developed the Globus Toolkit http://toolkit.globus.org/toolkit/

Use Quizgecko on...
Browser
Browser