CS4337 Applied System Design Lecture 2 – Big Data PDF

13/09/2024 CS4337 Applied System Design Lecture 2 – Big Data Week 1, SEM 1 2024/25 Dr. Andrew Ju 1 Big Data 2 1 13/09/20...

13/09/2024 CS4337 Applied System Design Lecture 2 – Big Data Week 1, SEM 1 2024/25 Dr. Andrew Ju 1 Big Data 2 1 13/09/2024 What is big data?  A bunch of data?  A technical term?  An industry? What is your own interpretation of Big Data?  A trend?  A set of skills?  … 3 What is big data? Gartner: Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making Wikipedia about Big Data says it is: An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. 4 2 13/09/2024 5 Big Data How big is Big?  500 million tweets are sent everyday  294 billion emails are sent daily  4 petabytes of data is created on Facebook in 1 day  4 petabytes of daily data is cumulatively created from each IoT connected car  65 billion messages are sent on WhatsApp in a day  5 billion daily searches are made on Google (1PB = 1024TB = 1024x1024GB = ? MB) 6 3 13/09/2024 7 More Stats 8 4 13/09/2024 Big Data 3 Vs of Big Data ❑ Volume (Petabyte Scale) - 44x increase from 2009-2020 - From 0.8 zettabytes to 35zb - Data volume is increasing exponentially ❑ Velocity (Speed) - Data is begin generated fast and need to be processed fast - Digital Streams, Social Media - Online Data Analytics - Late decisions = missing opportunities ❑ Variety - Structured: Relational Data (Tables/Transaction/Legacy Data) - Text Data (Web) - Semi-structured Data (XML) - Graph Data - Social Network, Semantic Web (RDF) 9 4 Vs of Big Data 10 5 13/09/2024 11 Who is generating Big Data? 12 6 13/09/2024 Another viewpoint for Big Data Divide and conquer 13 Philosophy to Scale Divide and Conquer  Divide work  Combine Results 14 7 13/09/2024 How to scale? Scale out Scale up 15 Example – Google Case Study  Process lots of data  Google processed about 24 petabytes  Solution? of data per day in 2009.  A single machine cannot serve all the  MapReduce data.  You need a distributed system to store and process in parallel  Parallel programming?  Threading is hard!  How do you facilitate communication between nodes?  How do you scale to more machines?  How do you handle machine failures? 16 8 13/09/2024 MapReduce Google's Solution: MapReduce  MapReduce [OSDI04] provides  Automatic parallelization, distribution  I/O scheduling  Load balancing  Network and data transfer optimisation  Fault tolerance  Handling of machine failures  Need more power: Scale out, not up!  Large number of commodity servers as opposed to some high-end specialized servers. 17 Why is Big Data a Big Deal now?  More data are being collected and stored  Open-source code  Commodity hardware / Cloud  Artificial Intelligence 18 9 13/09/2024 Some questions to answer Distributed Processing is non-trivial  How to assign tasks to different workers in an efficient way?  What happens if a task fails?  How do workers exchange results?  How to synchronize distributed tasks allocated to different workers? 19 Identified problems Big Data Storge is challenging  Data Volumes are massive  Reliability of Storing PBs of data is challenging  All kinds of failures: Disk/Hardware/Network Failures  Probability of failures simply increase with the number of machines. 20 10 13/09/2024 Solutions? One popular solution: Hadoop (next topic) 21 Summary 4 Vs of Big Data  Volume  Velocity  Variety  Veracity 22 11 13/09/2024 Hadoop 23 Introduction Key Questions to Answer  Why Hadoop?  What is Hadoop? To understand that need, Recap previous section  How to use Hadoop?  Example of Hadoop 24 12 13/09/2024 Hadoop Advantages Redundant, Fault-tolerant data storage Parallel computing framework Job coordination 25 Hadoop Advantages Because of Hadoop Redundant, Fault-tolerant data storage programmers need not worry about Parallel computing framework Q: Where is file located? Job coordination Q: How to handle failures & data lost? Q: How to divide computation? Q: How to program for scaling? 26 13 13/09/2024 Example A real-world example of New York Times Goal: Make entire archive of articles available online: 11 million, from 1851 Task: Translate 4 TB TIFF images to PDF files Solution: Used Amazon Elastic Computing Cloud (EC2) and Simple Storage System (S3) Time: Costs: 27 Hadoop – a little history - Hadoop is an open-source implementation based on Google File System (GFS) and MapReduce from Google - Hadoop was created by Doug Cutting and Mike Cafarella in 2005 - Hadoop was donated to Apache in 2006 28 14 13/09/2024 Who is using Hadoop? (Mostly whoever is generating Big Data) 29 Hadoop – Stack Two main layers - Distributed file system (HDFS) - Execution engine (MapReduce) 30 15 13/09/2024 Hadoop – Stack Two main layers - Distributed file system (HDFS) - Execution engine (MapReduce) 31 Hadoop – Architecture Designed as a master-slave shared- nothing architecture 32 16 13/09/2024 Hadoop - Resources  Apache Hadoop Documentation  https://hadoop.apache.org/docs/current/  Data Intensive Text Processing with Map-Reduce  http://lintool.github.io/MapReduceAlgorithms/  Hadoop Definitive Guide  https://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 33 HDFS 34 17 13/09/2024 Outline  Motivation  Architecture and Concepts  Inside  User Interface 35 Hadoop Distributed File System (HDFS) Motivation Questions Problem 1: Data is too big to store on one machine. - HDFS: Store the data on multiple machines! Problem 2: Very high-end machines are too expensive - HDFS: Run on commodity hardware! Problem 3: Commodity hardware will fail! - HDFS: Software is intelligent enough to handle hardware failure! Problem 4: What happens to the data if the machine stores the data fails? - HDFS: Replicate the data! Problem 5: How can distributed machines organize the data in a coordinated way? - HDFS: Master Slave Architecture! 36 18 13/09/2024 HDFS Architecture - Master/Slave - Name Node: Controller - File System Name Space Management - Block Mappings - Data Node: Work Horses - Block Operations - Replication - Secondary Name Node: - Checkpoint node 37 HDFS Inside: Name Node Master - Slave 38 19 13/09/2024 HDFS Inside: Blocks Why blocks? Q: Why do we need the abstraction "Blocks" in addition to "Files"? Reasons:  File can be larger than a single disk  Block is of fixed size, easy to manage and manipulate  Easy to replicate and do more fine-grained load balancing HDFS block size is by default 64MB, why it is much larger than regular file system? Reasons:  Minimize overhead: disk seek time is almost constant 39 HDFS Inside: Read 1. Client connects to NN to read data 2. NN tells client where to find the data blocks 3. Client reads blocks directly from data nodes (without going through NN) 4. In case of node failures, client connects to another node that serves the missing block 40 20 13/09/2024 HDFS Inside: Read Q: Why does HDFS choose such a design for read? Why not ask client to read blocks through NN? Reasons:  Prevent NN from being the bottleneck of the cluster  Allow HDFS to scale to large number of concurrent clients  Spread the data traffic across the cluster Q: Given multiple replicas of the same block, how does NN decide which replica the client should read? HDFS Solution:  Rack awareness based on network topology 41 HDFS Inside: Write 1. Client connects to NN to write data 2. NN tells client write to these data nodes 3. Client writes blocks directly to data nodes with desired replication factor 4. In case of node failures, NN will figure it out and replicate the missing blocks 42 21 13/09/2024 HDFS Inside: Write (homework)  Q: Where should HDFS put the three replicas of a block? What tradeoffs we need consider?  Tradeoffs:  Reliability  Write Bandwidth  Read Bandwidth  Q: What are some possible strategies? 43 HDFS Summary  Big Data and Hadoop background  What and Why about Hadoop  4 V challenge of Big Data  Hadoop Distributed File System (HDFS)  Motivation: guide Hadoop design  Architecture: Single rack vs Multi-rack clusters  Reliable storage, Rack-awareness, Throughput  Inside: Name Node file system, Read, Write 44 22 13/09/2024 MapReduce 45 MapReduce (Distributed Programming) Hadoop Architecture  Distributed File System  Execution Engine (Map Reduce) 46 23 13/09/2024 MapReduce (Distributed Programming) Map-Reduce Execution Engine (Example: Colour Count) Users only provide the “Map” and “Reduce” functions 47 Properties of MapReduce Engine Job Tracker is the Master node  Receives the user’s job  Decides on how many tasks will run (Number of Mappers)  Decides on where to run each mapper (Concept of Locality) This ﬁle has 5 Blocks → run 5 map tasks Where to run the task reading block “1” Try to run it on Node 1 or Node 3 48 24 13/09/2024 Properties of MapReduce Engine (continued) Task Tracker is the Slave node (runs on each data node)  Receives the task from Job Tracker  Runs the task until completion (either Map or reduce task)  Always in communication with the Job Tracker reporting progress In this example, 1 MapReduce job consists of 4 Map Tasks and 3 Reduce Tasks 49 Properties of MapReduce Engine (continued) Key/Value Pairs Mappers and Reducers are user’s code (provided function) Just need to obey the Key-Value pairs interface Mappers: Consume pairs Produce pairs Reducers: Consume pairs Produce pairs 50 25 13/09/2024 Properties of MapReduce Engine (continued) Key/Value Pairs Shuffling and Sorting: Hidden phase between mappers and reducers Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of 51 MapReduce Phases Deciding on what will be the key and what will be the value ➔ developer’s responsibility 52 26 13/09/2024 Example 1: Word Count Job: Count the occurrences of each word in a data set 53 Example 2: Colour Count 54 27 13/09/2024 Quiz I Which of the following acts as a back-up of the master Hadoop node in the event of failure? a. NameNode b. Secondary NameNode c. FsNode (Failsafe) d. None of above 55 Quiz II The common, worker node of a Hadoop system is called in Hadoop parlance a … a. NameNode b. Secondary NameNode c. SlaveNode d. DataNode e. WorkerNode 56 28

CS4337 Applied System Design Lecture 2 – Big Data PDF

Document Details

Tags

Related

Summary

Full Transcript