Big Data Analytics - Chapter 3 - Software Layers - Hadoop, HDFS, YARN PDF
Document Details
Uploaded by HopefulDenouement
Université Badji Mokhtar-Annaba
2024
Dr.Kamel Maaloul
Tags
Summary
These lecture notes provide an introduction to big data analytics, specifically focusing on software layers within Hadoop, including HDFS and YARN. They explore fundamental concepts and detail the advantages of Hadoop compared to traditional RDBMS systems.
Full Transcript
Big data Analytics Chapter III: Software layers Foundations of Hadoop HDFS file system Understand and apply the Map/Reduce paradigm Servers, racks and networks Stack layers Dr.Kamel Maaloul 2024-2025 RDBMS VS Hadoop properties? Advantages of Hadoop Vast...
Big data Analytics Chapter III: Software layers Foundations of Hadoop HDFS file system Understand and apply the Map/Reduce paradigm Servers, racks and networks Stack layers Dr.Kamel Maaloul 2024-2025 RDBMS VS Hadoop properties? Advantages of Hadoop Vast amounts of data Economic Efficient Scalable Reliable Applications not for Hadoop Low-latency data access – HBase is currently a better choice Lots of small files – All filesystem metadata is in memory – The number of files is constrained by the memory size of the name node Multiple writers, arbitrary file modifications Hadoop Cluster Servers, racks and networks Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit Aggregation switch Rack switch The Core Apache Hadoop Project Hadoop Common: – Java libraries and utilities required by other Hadoop modules. HDFS: – Hadoop Distributed File System Hadoop YARN -Yet Another Resource Negotiator-: – a framework for job scheduling and cluster resource management. Hadoop MapReduce is a programming model that simplifies large-scale data processing. Hadoop Layers Hadoop Related Subprojects Pig – High-level language for data analysis HBase – Table storage for semi-structured data Zookeeper – Coordinating distributed applications Hive – SQL-like Query language and Metastore Mahout – Machine learning HDFS Hadoop Distributed File System Goals of HDFS Very Large Distributed File System –10K nodes, 100 million files, 10 PB Assumes Commodity Hardware –Files are replicated to handle hardware failure –Detect failures and recovers from them Optimized for Batch Processing –Data locations exposed so that computations can move to where data resides –Provides very high aggregate bandwidth User Space, runs on heterogeneous OS The Design of HDFS Single Namespace for entire cluster Data Coherency – Write-once-read-many access model – Client can only append to existing files Files are broken up into blocks – Typically 64MB-128MB block size – Each block replicated on multiple DataNodes Intelligent Client – Client can find location of blocks from NameNode – Client accesses data directly from DataNode HDFS Architecture Functions of a NameNode Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides Cluster Configuration Management Replication Engine for Blocks To ensure high availability, – you need both an active NameNode and a standby NameNode. – Each runs on its own, dedicated master node. NameNode Metadata Metadata in Memory – The entire metadata is in main memory – No demand paging of metadata Types of metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g. creation time, replication factor A Transaction Log – Records file creations, file deletions etc Secondary NameNode Copies FsImage and Transaction Log from Namenode to a temporary directory Merges FSImage and Transaction Log into a new FSImage in temporary directory Secondary Namenode whole purpose is to have a checkpoint in HDFS Standby Name Node-QJM In Hadoop 2, the high availability feature was introduced. It has two NameNodes, one of the NameNodes is in active state while the other NameNode is in standby state. The active NameNode serves the client requests while the standby NameNode maintains synchronization of its state to take over as the active NameNode if the current active NameNode fails. There is a Quorum Journal Manager (QJM) runs in each NameNode. The QJM is responsible for communicating with JournalNodes using RPC; for example, sending namespace modifications,... ZooKeeper : Coordinating distributed applications DataNode A Block Server – Stores data in the local file system (e.g. ext3) – Stores metadata of a block (e.g. CRC) – Serves data and metadata to Clients Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes Block Placement Current Strategy – One replica on local node – Second replica on a remote rack – Third replica on same remote rack (default: 3 replicas) – Additional replicas are randomly placed Clients read from nearest replicas Heartbeats DataNodes send hearbeat to the NameNode periodically – Once every 3 seconds NameNode uses heartbeats to detect DataNode failure NameNode as a Replication Engine NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes Data Pipelining (i) Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next node in the Pipeline When all replicas are written, the Client moves on to write the next block in file Data Pipelining (ii) Rebalancer Goal: % disk full on DataNodes should be similar: – Usually run when new DataNodes are added – Cluster is online when Rebalancer is active User Interface Commads for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir/myfile.txt Commands for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommision datanodename Web Interface – http://host:port/dfshealth.jsp Introduction to Hadoop Yarn Yet Another Resource Negotiator Yarn YARN is the prerequisite for Enterprise Hadoop – providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. Applications that run on YARN YARN Cluster Basics In a YARN cluster, there are two types of hosts: – The ResourceManager is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers. – A NodeManager is a worker daemon that launches and tracks processes spawned on worker hosts. Yarn Resource Monitoring (i) YARN currently defines two resources: – v-cores – memory. Each NodeManager tracks – its own local resources and – communicates its resource configuration to the ResourceManager The ResourceManager keeps – a running total of the cluster’s available resources. Yarn Resource Monitoring (ii) 100 workers of same resources Yarn Container Containers – a request to hold resources on the YARN cluster. – a container hold request consists of vcore and memory Container as a hold The task running as a process inside a container Yarn Application and ApplicationMaster Yarn application – It is a YARN client program that is made up of one or more tasks – Example: MapReduce Application ApplicationMaster – It helps coordinate tasks on the YARN cluster for each running application – It is the first process run after the application starts. Interactions among Yarn Components (i) 1. The application starts and talks to the ResourceManager for the cluster Interactions among Yarn Components (ii) 2. The ResourceManager makes a single container request on behalf of the application Interactions among Yarn Components (iii) 3. The ApplicationMaster starts running within that container Interactions among Yarn Components (iv) 4. The ApplicationMaster requests subsequent containers from the ResourceManager that are allocated to run tasks for the application. Those tasks do most of the status communication with the ApplicationMaster allocated in Step 3 Interactions among Yarn Components (v) 5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated from the cluster. 6. The application client exits. (The ApplicationMaster launched in a container is more specifically called a managed AM. Unmanaged ApplicationMasters run outside of YARN’s control.) How Applications run on YARN –Step1 How Applications run on YARN –Step2 How Applications run on YARN –Step3 How Applications run on YARN –Step4 How Applications run on YARN –Step5 How Applications run on YARN –Step6 How Applications run on YARN –Step7 How Applications run on YARN Step 1: A client submits a job using “spark-submit” to the YARN Resource Manager. Step 2: The job enters a scheduler queue in the ResourceManager, waiting to be executed. Step 3: When it is time for the job to be executed, the ResourceManager finds a NodeManager capable of launching a container to run the ApplicationMaster. Step 4: The ApplicationMaster launches the Driver Program (the entry point of the program that creates the SparkSession/SparkContext). Step 5: The ApplicationMaster/Spark calculates the required resources (CPU, RAM, number of executors) for the job and sends a request to the Resource Manager to launch the executors. > The ApplicationMaster communicates with the NameNode to determine the file (block) locations within the cluster using the HDFS protocol. Step 6: The Driver Program assigns tasks to the executor containers and keeps track of the task status. Step 7: The executor containers execute the tasks and return the results to the Driver Program. The Driver Program aggregates the results and produces the final output. Schedulers The scheduler in YARN is responsible for scheduling and mediating available resources in the cluster among submitted applications, in accordance with a defined policy. It allows different policies for managing constraints such as capacity, fairness, and service level agreements (SLAs). There are three types of scheduling policies available in YARN: FIFO Scheduler Capacity Scheduler Fair Scheduler FIFO Scheduler: It is a simple scheduler that follows a first-come, first-served approach and it’s suitable for smaller clusters and simple workloads. Capacity Scheduler: It is a scheduler that divides cluster resources into multiple queues, each with its own reserved resources while able to dynamically utilize unused resources from other queues. It is suitable for large-scale, multi-user environments. Fair Scheduler: It is a scheduler that is designed to balance resources fairly and equally among accepted jobs, without requiring a set amount of reserved capacity. It is suitable for the cluster that runs jobs with varying sizes and resource requirements. INTRODUCTION TO MAPREDUCE is a programming model that simplifies large-scale data processing. Word Count Example Mapper – Input: value: lines of text of input – Output: key: word, value: 1 Reducer – Input: key: word, value: set of counts – Output: key: word, value: sum Launching program – Defines this job – Submits job to cluster Example I am a human, you are also a human I,1 a,2 map am,1 a, 1 also,1 a,1 a,1 reduce am,1 are,1 part0 also,1 am,1 human,1 are,1 map you,1 I,1 are,1 human,1 I, 1 human,1 part1 also,1 you,1 reduce human,2 map a, 1 you,1 human,1 JobTracker generates JobTracker generates Hadoop sorts the three TaskTrackers for two TaskTrackers for intermediate data map tasks map tasks MAPREDUCE-Steps Input: In this step, the sample file is input to MapReduce. Split: In this step, Hadoop splits / divides our sample input file into four parts, each part made up of one line from the input file. Note that, for the purpose of this example, we are considering one line as each split. However, this is not necessarily true in a real-time scenario. Map: In this step, each split is fed to a mapper which is the map() function containing the logic on how to process the input data, which in our case is the line of text present in the split. For our scenario, the map() function would contain the logic to count the occurrence of each word and each occurrence is captured / arranged as a (key, value) pair, which in our case is like (SQL, 1), (DW, 1), (SQL, 1), and so on. Combine: This is an optional step and is often used to improve the performance by reducing the amount of data transferred across the network. This is essentially the same as the reducer (reduce() function) and acts on output from each mapper. In our example, the key value pairs from first mapper "(SQL, 1), (DW, 1), (SQL, 1)" are combined and the output of the corresponding combiner becomes "(SQL, 2), (DW, 1)". Shuffle and Sort: In this step, output of all the mappers is collected, shuffled, and sorted and arranged to be sent to reducer. Reduce: In this step, the collective data from various mappers, after being shuffled and sorted, is combined / aggregated and the word counts are produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on. Output: In this step, the output of the reducer is written to a file on HDFS. The following image is the output of our word count example. Input and Output Formats A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used These default to TextInputFormat and TextOutputFormat, which process line-based text data Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data These are file-based, but they are not required to be INTRODUCTION TO YARN AND MAPREDUCE INTERACTION MapReduce on Yarn In the MapReduce paradigm, an application consists of Map tasks and Reduce tasks. Map tasks and Reduce tasks align very cleanly with YARN tasks. Putting it Together: MapReduce and YARN In a MapReduce application – there are multiple map/reduce tasks – each task runs in a container on a worker host in the cluster On the YARN side – the ResourceManager, NodeManager, and ApplicationMaster work together to manage the cluster’s resources Scheduling in YARN The ResourceManager (RM) tracks resources on a cluster, and assigns them to applications that need them. The scheduler is that part of the RM that does this matching honoring organizational policies on sharing resources. References "Hadoop: The Definitive Guide", Tom White, O'Reilly Media, Inc. https://blog.cloudera.com/blog/2015/09/untangling- apache-hadoop-yarn-part-1/ https://hadoop.apache.org/docs/r2.7.2/