Apache Hadoop: HDFS, YARN, and MapReduce PDF
Document Details
Uploaded by NimbleSaxophone
Hashemite University
Majdi Maabreh
Tags
Summary
This document provides an overview of Apache Hadoop, including its key components like HDFS, YARN, and MapReduce. It also covers some of Hadoop's related projects and concepts like ZooKeeper and Flume, highlighting practical applications and use cases.
Full Transcript
Ch02: Apache Hadoop: HDFS, YARN, and MapReduce Majdi Maabreh, Ph.D. Program of Data Science and AI @ HU. [email protected] Some of the figures, pictures, and text used in this material may have been sourced from the w...
Ch02: Apache Hadoop: HDFS, YARN, and MapReduce Majdi Maabreh, Ph.D. Program of Data Science and AI @ HU. [email protected] Some of the figures, pictures, and text used in this material may have been sourced from the web. We extend our special thanks to the original owners and creators who have generously made these resources available online for educational purposes, enhancing students' understanding and learning experience. - Dr. Maabreh Dr. Majdi Maabreh , [email protected] 1 Hadoop The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high- availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Dr. Majdi Maabreh , [email protected] 2 Hadoop … WHY? Need to process huge datasets on large clusters of computers Very expensive to build reliability into each application Nodes fail every day – Failure is expected, rather than exceptional – The number of nodes in a cluster is not constant Need a common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License Dr. Majdi Maabreh , [email protected] 3 Hadoop … Who uses Hadoop? Amazon/A9 (is Amazon's product ranking algorithm that decides which products to show on top of the search result based on customer search queries.) Facebook Google New York Times Veoh ( is an American video-sharing website) Yahoo! …. many more (please check powered by Hadoop list ) https://cwiki.apache.org/confluence/display/HADOOP2/poweredby#PoweredBy-PoweredbyApacheHadoop Dr. Majdi Maabreh , [email protected] 4 Hadoop and Related Projects The project includes these modules: Hadoop Common: Other Hadoop modules depend on these Java libraries and tools. The Java files and scripts required to launch Hadoop are contained in these libraries, which also offer abstractions at the OS and file system levels. Hadoop Distributed File System (HDFS™): A distributed file system that provides high throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Dr. Majdi Maabreh , [email protected] 5 Some of Hadoop-related Projects Ambari: A web-based tool for provisioning, managing, and monitoring Hadoop clusters. It enables System Administrators to: Provision a Hadoop Cluster; installation step-by-step across any number of hosts, configuration of Hadoop services for the cluster. Manage a Hadoop Cluster; provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. Note: not only the above, but many more are there Monitor a Hadoop Cluster; provides a dashboard for monitoring health and status of the Hadoop cluster, notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc). Dr. Majdi Maabreh , [email protected] 6 Some of Hadoop-related Projects ZooKeeper: is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface. Flume: is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Dr. Majdi Maabreh , [email protected] 7 Some of Hadoop-related Projects Apache Sqoop: The Apache Sqoop project was retired in June 2021. designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Oozie: is a workflow scheduler system to manage Apache Hadoop jobs. Pig: Started at Yahoo! Research. Expresses sequences of MapReduce jobs Data model: nested “bags” of items Provides relational (SQL) operators (JOIN, GROUP BY, etc.) Easy to plug in Java functions An example of using Pig: Suppose you have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25. Use Pig if you have a more complex data processing pipeline, especially if you're dealing with semi-structured or unstructured data, doesn't enforce a rigid schema for data. Dr. Majdi Maabreh , [email protected] 8 Some of Hadoop-related Projects Hive: is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Hive is built on top of Apache Hadoop and supports storage on S3, HDFS, etc. Hive allows users to read, write, and manage petabytes of data using SQL-like queries. (“Relational database” built on Hadoop) ( 1-It's more suitable for structured data. 2- Ad Hoc Queries) Hbase: A scalable, distributed database that supports structured data storage for large tables. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase - Data Model Dr. Majdi Maabreh , [email protected] 9 Hadoop 1 (MRv1) Vs. Hadoop2 (MRv2) Less Services and Tools More Services and Services/ Daemons: Services/ Daemons: Tools NameNode NameNode/ Secondary NameNode DataNode DataNode Job Tracker Resource manager Task Tracker Node manager Working: Working: HDFS: Data Storage HDFS: Data Storage MapReduce: MapReduce: Data Processing Data Processing + Resource management YARN: Resource management Dr. Majdi Maabreh , [email protected] 10 YARN YARN – YARN is the acronym for Yet Another Resource Negotiator and is an open-source framework for distributed processing. It is the key feature of Hadoop version 2.0 of the Apache software foundation. The main purpose of evolution of the YARN architecture is to reinforce more data processing models such as Apache Storm and Apache Spark more than just supporting MapReduce. YARN splits the responsibilities of JobTracker into two daemons, a global ResourceManager and per-application ApplicationMaster. Dr. Majdi Maabreh , [email protected] 11 YARN Core Components of YARN ResourceManager; ApplicationMaster; and NodeManager. https://kaizen.itversity.com/yarn-application-life-cycle/ Dr. Majdi Maabreh , [email protected] 12 Hadoop Cluster Specification Hadoop is designed to run on commodity hardware. That means that you are not tied to expensive, proprietary offerings from a single vendor; rather, you can choose standardized, commonly available hardware from any of a large range of vendors to build your cluster. “Commodity” does not mean “low-end.” Low-end machines often have cheap components, which have higher failure rates than more expensive (but still commodity-class) machines. When you are operating tens, hundreds, or thousands of machines, cheap components turn out to be a false economy, as the higher failure rate incurs a greater maintenance cost. Dr. Majdi Maabreh , [email protected] 13 Hadoop Cluster Specification Why Not Use RAID? HDFS clusters do not benefit from using RAID (redundant array of independent disks) for data-node storage (although RAID is recommended for the name-node’s disks, to protect against corruption of its metadata). The redundancy that RAID provides is not needed, since HDFS handles it by replication between nodes. How large should your cluster be? There isn’t an exact answer to this question, but the beauty of Hadoop is that you can start with a small cluster (say, 10 nodes) and grow it as your storage and computational needs grow. In many ways, a better question is this: how fast does your cluster need to grow? You can get a good feel for this by considering storage capacity. For example, if your data grows by 1 TB a day and you have three-way HDFS replication, you need an additional 3 TB of raw storage per day. Allow some room for intermediate files and logfiles (around 30%, say), and this is in the range of one machine per week. Dr. Majdi Maabreh , [email protected] 14 Hadoop Cluster Specification Hadoop cluster is a collection of machines. The nodes are classified into 2 types: Master Nodes: Run the services that coordinate the cluster’s work. Clients contact the master nodes to perform computations. (each cluster may have 3-6 master nodes, it depends on the cluster size.) Worker nodes: In a Hadoop universe, slave nodes (Workers) are where Hadoop data is stored and where data processing takes place. Dr. Majdi Maabreh , [email protected] 15 Network Topology Typically there are 30 to 40 servers per rack (only 3 are shown in the diagram), with a 10 Gb switch for the rack and an uplink to a core switch or router (at least 10 Gb or better). For each rack in your cluster, you need two – Nodes are commodity PCs top-of-rack (ToR) switches, for both – 30-40 nodes/rack redundancy and performance. We recommend using 10GbE for ToR switches. Dr. Majdi Maabreh , [email protected] 16 Cluster Size Many of the decisions you need to make in terms of the composition of racks and networking are dependent on the scale of your Hadoop cluster. It has three main permutations. Small A single-rack deployment is an ideal starting point for a Hadoop cluster. The cluster is fairly self-contained, but because it still has relatively few slave nodes Dr. Majdi Maabreh , [email protected] 17 Cluster Size Medium A medium-size cluster has multiple racks, where the three master nodes are distributed across the racks Hadoop’s resiliency is starting to become apparent: Even if an entire rack were to fail (for example, both ToR switches in a single rack), the cluster would still function, albeit at a lower level of performance. A slave node failure would barely be noticeable. Three rack Hadoop deployment. Dr. Majdi Maabreh , [email protected] 18 Cluster Size Large In larger clusters with many racks, the networking architecture required is pretty sophisticated. Regardless of how many racks Hadoop clusters expand to, the slave nodes from any rack need to be able to efficiently “talk” to any master node. Dr. Majdi Maabreh , [email protected] 19 Rack awareness For multi-rack clusters, you need to map nodes to racks. This allows Hadoop to prefer within-rack transfers (where there is more bandwidth available) to off-rack transfers when placing MapReduce tasks on nodes. Choosing the closest data node for serving a purpose is Rack Awareness. The name node has the feature of finding the closest data node for faster performance for that Name node holds the ids of all the Racks present in the Hadoop cluster. Example Block1 (B1) Data Block2 File (B2) 2x in local Block3 rack, 1x (B3) elsewhere Hadoop has some Rack awareness policies. No more than 1 replica on the same Datanode. More than 2 replica’s of a single block is not allowed on the same Rack. Dr. Majdi Maabreh , [email protected] 20 Hadoop Installation Hadoop can be run in one of three modes: Standalone (or local) mode All Hadoop services run in a single JVM. Local file system is used not HDFS MapReduce jobs run with a single mapper and single reducer. Pseudo-distributed mode Real multimode cluster simulation All services are configured as fully distributed, but all daemon run on a single server. Cannot configure data replication or high availability –single server. Other projects such as Spark, Hive, Pig are needed to be installed separately. Fully distributed mode The Hadoop daemons run on a cluster of machines. Dr. Majdi Maabreh , [email protected] 21 Hadoop Installation– Your Task To get started, you'll need to create a virtual machine (VM) running Ubuntu OS and follow the provided steps to install and configure your Hadoop cluster; Hadoop_installation_Pseudo-distributed.txt You may also check the video on the MS Teams.. The steps to install Hadoop are explained clearly in the video. Dr. Majdi Maabreh , [email protected] 22 Start up / Shutdown you cluster Dr. Majdi Maabreh , [email protected] 23 Start up / Shutdown you cluster http://localhost:9870 Dr. Majdi Maabreh , [email protected] 24 Start up / Shutdown you cluster http://localhost:9870 Dr. Majdi Maabreh , [email protected] 25 HDFS Hadoop Distributed File System (HDFS) Dr. Majdi Maabreh , [email protected] 26 HDFS Very Large Distributed File System 10K nodes, 100 million files, 10PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth Dr. Majdi Maabreh , [email protected] 27 HDFS Architecture Dr. Majdi Maabreh , [email protected] 28 NameNode Manages File System Namespace A Transaction Log Maps a file name to a set of blocks Records file creations, file deletions Maps a block to the DataNodes where it resides etc Cluster Configuration Management Replication Engine for Blocks Metadata in Memory The entire metadata is in main memory No demand paging of metadata Types of metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g. creation time, replication factor Dr. Majdi Maabreh , [email protected] 29 DataNode A Block Server Stores data in the local file system Stores metadata of a block. Serves data and metadata to Clients Block Report Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data Forwards data to other specified DataNodes Dr. Majdi Maabreh , [email protected] 30 Heartbeats DataNodes send hearbeat to the NameNode Once every 3 seconds NameNode uses heartbeats to detect DataNode failure. Dr. Majdi Maabreh , [email protected] 31 Basic commands to manage HDFS Listing the content of a directory The content of a file Create a directory Dr. Majdi Maabreh , [email protected] 32 Basic commands to manage HDFS Upload a file Download a file Copy a file Move a file Dr. Majdi Maabreh , [email protected] 33 Basic commands to manage HDFS Remove a file Remove a directory Dr. Majdi Maabreh , [email protected] 34 Anatomy of a File Write Figure 3-4. A client writing data to HDFS– Hadoop: The def. Guide Dr. Majdi Maabreh , [email protected] 35 Anatomy of a File Read Figure 3-2. A client reading data from HDFS– Hadoop: The def. Guide Dr. Majdi Maabreh , [email protected] 36 MapReduce MapReduce Dr. Majdi Maabreh , [email protected] 37 MapReduce MapReduce is a programming model for efficient distributed computing It works like a Unix pipeline cat input | grep | sort | uniq -c | cat > output Input | Map | Shuffle & Sort | Reduce | Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building Dr. Majdi Maabreh , [email protected] 38 MapReduce Dr. Majdi Maabreh , [email protected] 39 MapReduce—logical Data Flow Figure 2-1. MapReduce logical data flow- Hadoop: The def. Guide Dr. Majdi Maabreh , [email protected] 40 MapReduce—Example Word Count problem Problem Statement: Given a large collection of text documents, the goal is to count the frequency of each unique word across all the documents. Input Data: The input consists of a set of text documents. These documents can be of varying lengths and contain words separated by spaces, punctuation, and other delimiters. Output: The output should be a list of unique words along with their corresponding counts. Example Salam Alikom Salam 2 Alikom 2 and Salam MapReduce and 1 Alikom Thank Thank 1 you you 1 Dr. Majdi Maabreh , [email protected] 41 MapReduce—Example Word Count problem Dr. Majdi Maabreh , [email protected] 42 MapReduce—Example Test Locally -- Unix pipeline Dr. Majdi Maabreh , [email protected] 43 MapReduce—Example Run on Hadoop Cluster $ hadoop jar /$path$/hadoop-streaming-X.X.X.jar -file mapper and reducer (python code) File location -mapper mapper.py -reducer reducer.py -input input.txt -output the DFS output directory for the reduce step. Dr. Majdi Maabreh , [email protected] 44 MapReduce—Example Run on Hadoop Cluster $ hadoop jar /$path$/hadoop-streaming-X.X.X.jar -file mapper and reducer (python code) -mapper mapper.py -reducer reducer.py -input input.txt -output the DFS output directory for the reduce step. Dr. Majdi Maabreh , [email protected] 45 MapReduce—Example Run on Hadoop Cluster Dr. Majdi Maabreh , [email protected] 46 MapReduce—Example-2 Retail Sales Dataset Problem Statement: Suppose you have a dataset of retail sales transactions, and each transaction record contains information about the product barcode and its corresponding sales revenue. Calculate the total sales revenue for each product. Sample of the input file The required output would be look like: {barcode}\t{total_revenue} For example: 0202001 1000.00 0202002 500.44 … … Dr. Majdi Maabreh , [email protected] 47 MapReduce—Example-2 Retail Sales Dataset Dr. Majdi Maabreh , [email protected] 48 MapReduce—Example-2 Retail Sales Dataset Test Locally -- Unix pipeline Upload to HDFS Run on Hadoop Dr. Majdi Maabreh , [email protected] 49 Python MRJob MapReduce library in Python. Allows Mapreduce applications to be written in a single class. No need for separate programs. You may run MR applications locally, on Hadoop cluster Amazon MapReduce( EMR). Installation Dr. Majdi Maabreh , [email protected] 50 Python MRJob-Example (word Count) Locally Dr. Majdi Maabreh , [email protected] 51 Python MRJob-Example (word Count) (-r hadoop) Run on Hadoop cluster Dr. Majdi Maabreh , [email protected] 52 Assignment Check the MS Teams channel for the required assignment and the submission details. Dr. Majdi Maabreh , [email protected] 53 References 1 Big Data: concept, technology, and Archeitecture by Balusamy B. et al (Wiley, 2021) 2 https://www.researchgate.net/figure/Apache-Flume-architecture_fig1_354094046 3 https://hadoop.apache.org/ 4 Hadoop: The Definitive Guide, 4th edition. 5 Dr. Majdi Maabreh , [email protected] 54