BigDataAnalytics Unit4.pdf
Document Details
Uploaded by LoyalSpessartine
MIT World Peace University
Full Transcript
Big Data Analytics UNIT 4: HADOOP FRAMEWORK https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs 4.1 HDFS: - Block Size, Architecture - Namenode, Datanode, Secondary namenode, Federation, Anatomy of File Read and Write What is HDFS? HDFS stan...
Big Data Analytics UNIT 4: HADOOP FRAMEWORK https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs 4.1 HDFS: - Block Size, Architecture - Namenode, Datanode, Secondary namenode, Federation, Anatomy of File Read and Write What is HDFS? HDFS stands for Hadoop Distributed File System. HDFS is the primary storage system used by Hadoop applications. It works by rapidly transferring data between nodes. It is used by companies who need to handle and store big data. HDFS is a key component of many Hadoop systems, as it provides a means for managing big data, as well as supporting big data analytics. HDFS operates as a distributed file system designed to run on commodity hardware. HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware. HDFS provides high throughput data access to application data and is suitable for applications that have large data sets and enables streaming access to file system data in Apache Hadoop. What is Hadoop? How does it vary from HDFS? A key difference between Hadoop and HDFS is that Hadoop is the open-source framework that can store, process and analyse data, while HDFS is the file system of Hadoop that provides access to data. This essentially means that HDFS is a module of Hadoop. Source: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html 1 HDFS acts as the master server and can manage the files, controls client's access to files, and overseas file operating processes such as renaming, opening, and closing files. HDFS includes NameNodes and DataNodes. A NameNode is the hardware that contains the GNU/Linux operating system and software. A DataNode is hardware having the GNU/Linux operating system and DataNode software. For every Name Node in a HDFS cluster, you will locate one DataNode. These nodes help to control the data storage of their system as they can perform operations on the file systems if the client requests, and also create, replicate, and block files when the NameNode instructs. The purpose or goals of HDFS are, 1. Managing large datasets – The first goal is about organizing and storing large datasets. It is very challenging. HDFS is used to manage the applications that have to deal with huge datasets. To do this, HDFS requires hundreds of nodes per cluster. 2. Detecting faults – The second goal is to detect faults. HDFS should have an appropriate technical solution in place to scan and detect faults quickly and effectively. This is because HDFS includes a large number of commodity hardware. Failure of components is a highly probable and common issue. 3. Ensuring hardware efficiency – The third goal is to ensure efficiency. When large datasets are involved, there is a need to reduce the network traffic and increase the processing speed. This is required to ensure speed or performance. HDFS works with a main NameNode and multiple other DataNodes, all on a commodity hardware cluster. These nodes can be organized in the same place within a data centre. Next, the storage is broken down into blocks which are distributed among the multiple DataNodes for storing data. To reduce the chances of data loss, blocks are often replicated across nodes. NameNodes - NameNode is the node within the cluster that knows about the type of data (or what the data contains), the address of data (or what block it belongs to), and the block size. NameNodes are also used to control access (access to create, read, write, remove, and replicate) to files. It has necessary meta data to check if someone can write, read, create, remove, and replicate data across the various data nodes. In summary, NameNode contains meta data such as file name, file permissions, block ids, block size, block locations, and number of replicas (or replication factor). When in use, all this information is stored in main memory. But this information is also stored in disk for persistence storage. 2 The above diagram shows how Name Node stores information in the disk using two files 1. fsimage – This is the snapshot of the filesystem when namenode started (fsimage is file system image) 2. Edit logs – This file contains the sequence of changes (or log entries of changes) made after namenode started Whenever there is a restart of the namenode, edit logs are applied to fsimage to get the latest snapshot of the file system. However, namenode restart is a very rare event in production clusters. This implies that edit logs can grow very large in case of clusters where namenode runs for a long period of time. When this happens, 1. Editlogs becomes very large. That is very challenging to manage. 2. The next restart of Namenode takes long time because a lot of changes has to be merged 3. In the case of a crash, we lose a considerable size of metadata since fsimage is very old To overcome these three issues, we need a mechanism which will help us reduce the edit log size so that it is manageable and have up-to-date fsimage so that restart time of namenode is not significant. Secondary Namenode helps in overcoming the above three issues by taking over responsibility of merging editlogs with fsimage from the namenode. For example, the secondary NameNode is used for taking the hourly backup (or checkpoint) of the data. The secondary Namenode has the hourly backup or checkpoints of that data and stores this data into a file name fsimage. The terms ‘Master namenode’ or ‘Primary namenode’ refer to ‘Name node’ as shown below. Read - https://hrouhani.org/hdfs-file-system/ 3 As shown in the diagram above, the secondary Namenode gets the edit logs at regular intervals (say every hour) and updates fsimage so that it can be used for the next restart. The purpose of Secondary Namenode is to ensure a checkpoint in HDFS at regular intervals (for example, every hour). It is just a helper node for namenode. It is not a ‘backup’ node. That’s why it also known as checkpoint node. In case a Hadoop cluster fails, or crashes, the secondary Namenode is transferred to a new system. A new MetaData is assigned to that new system and a new Master is created with this MetaData, and the cluster is made to run again correctly. This is the benefit of Secondary Name Node. Note: In Hadoop2, we have High-Availability and Federation features that minimize the importance of this Secondary Name Node. Major Function of Secondary NameNode: It groups the edit logs and FSImage (File System Image) from NameNode together. It continuously reads the MetaData from the RAM of NameNode and writes into the hard disk. Remember: As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it is also known as the checkpoint Node. DataNodes - DataNodes are in constant communication with the NameNodes to identify whether they need to commence and complete a task. This stream of consistent collaboration means that the NameNode is acutely aware of the status of each DataNode. When a DataNode is singled out to not be operating the way it should, the NameNode is able to automatically re-assign that task to another functioning node in the same data block. Similarly, DataNodes are also able to communicate with each other, which means they can collaborate during 4 standard file operations. Because every NameNode is aware of the corresponding DataNodes and their performance, name nodes are crucial in the system. Data blocks are replicated across multiple DataNodes and accessed by the NameNode. Data Blocks and Block Size Hadoop is known for its reliable storage. Hadoop HDFS can store data of any size and format. HDFS in Hadoop divides the file into small size blocks called data blocks. These data blocks serve many advantages to the Hadoop HDFS. Also Read - https://data-flair.training/blogs/apache-hadoop-hdfs-introduction-tutorial/ What is a data block in HDFS? Files in HDFS are broken into block-sized chunks called data blocks. These blocks are stored as independent units. The size of these HDFS data blocks is 128 MB by default. We can configure the block size as per our requirement by changing the dfs.block.size property in hdfs-site.xml Hadoop distributes these blocks on different slave machines, and the master machine stores the metadata about blocks location. If the file size is not a multiple of 128, all the blocks of a file are of the same size except the last one as mentioned in this diagram. 5 In this example, we have a file of size 612 MB, and we are using the default block configuration (128 MB). Therefore, five blocks are created, the first four blocks are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612). In HDFS, 1. A file smaller than a single block does not occupy a full block size space of the underlying storage. 2. Each file stored in HDFS doesn’t need to be an exact multiple of the configured block size. Why are blocks in HDFS huge (that is 128 MB)? The default size of the HDFS data block is 128 MB. The reasons for the large size of blocks are: 1. To minimize the cost of seek – When the block size is large, data can be stored in minimum number of blocks (as compared to a smaller block size). When the block size is large, the time taken find (or seek) the data is less as compared to the time taken to transfer the data from disk. This results in the transfer of multiple blocks at the disk transfer rate (which is typically very high). 2. To reduce network traffic - If blocks are small, there will be too many blocks in Hadoop HDFS and thus too many blocks will have to be transferred over the network. Managing such a huge number of blocks and metadata will create overhead and lead to traffic in a network. When the block size is large, there will be a smaller number of blocks to be transferred over the network. This will reduce the network traffic. Advantages of Hadoop Data Blocks 1) No limitation on the file size - A file can be larger than any single disk in the network. 2) Simplicity of storage subsystem - Since blocks are of fixed size, we can easily calculate the number of blocks that can be stored on a given disk. 3) Replication for providing Fault Tolerance and High Availability - Blocks are easy to replicate between DataNodes thus, provide fault tolerance and high availability. 4) Eliminating metadata concerns - Since blocks are just chunks of data to be stored, we don’t need to store file metadata (such as permission information) with the data blocks. Metadata can be stored separately. How does HDFS store data? The HDFS file system consists of a set of Master services (NameNode, secondary NameNode, and DataNodes). The NameNode and secondary NameNode manage the HDFS metadata. The DataNodes host the underlying HDFS data. The NameNode tracks which DataNodes contain the contents of a given file in HDFS. HDFS divides files into blocks and stores each block on a DataNode. Multiple DataNodes are linked to the cluster. The NameNode then distributes replicas of these data blocks across the cluster. It also instructs the user or application where to locate wanted information. 6 What is the Hadoop distributed file system (HDFS) designed to handle? The Hadoop distributed file system is designed to handle big data. HDFS is invaluable to large corporations that would otherwise struggle to manage and store data from their business and customers. With Hadoop, you can store and manage data, whether it's transactional, scientific, social media, advertising, and machine. It also means you can come back to this data and gain valuable insights into business performance and analytics. HDFS can also handle raw data which is commonly used by scientists or those in the medical field who are looking to analyse such data. These are called data lakes. It allows them to combat more difficult questions without restraints. Hadoop can also be used to run algorithms for analytical purposes. Hadoop helps businesses to process and analyse data more efficiently, allowing them to discover new trends and anomalies. When it comes to transactional data, Hadoop is also equipped to handle millions of transactions. Users of Hadoop can also deep dive into the data to discover emerging trends and patterns to help with business goals. When Hadoop is constantly being updated with fresh data, one can compare new and old data to see what has changed, and why. Considerations with HDFS By default, HDFS is configured with 3x replication which means data blocks will have two additional copies. While this improves the likelihood of localized data during processing, it does introduce overhead storage costs. HDFS works best when configured with locally attached storage. This ensures the best performance for the file system. Increasing the capacity of HDFS requires the addition of new servers (compute, memory, disk), not just storage media. HDFS vs. Cloud Object Storage Scaling a Hadoop cluster in an on-premises environment can be challenging from a cost and space perspective. HDFS uses locally attached storage which can provide I/O performance benefits. One of Apache Hadoop's core components, YARN (Yet Another Resource Negotiator) is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes. With heavily utilized environments, it is possible that most data read/write operations will be over the network vs local. Cloud object storage includes technologies such as Azure Data Lake Storage, AWS S3, or Google Cloud Storage. It's independent of computing resources that access it and hence customers can store far larger amounts of data in the cloud. Customers that are looking to store Petabytes worth of data can easily do this using cloud object storage. However, all read and write operations against cloud storage will be over the network. The performance would not be better because of I/O operations happening across the network using 7 cloud object storage. So, it's important that applications leverage caching where possible or include necessary logic to minimize IO operations. Hadoop Federation In HDFS, NameNode stores the metadata of every file and block in the filesystem. On very large clusters with many files, NameNode requires a large amount of memory for storing metadata for each file and block. The initial HDFS architecture supported a single NameNode that manages the file system namespace. In this case, memory becomes the limiting factor for scaling, and single NameNode becomes a bottleneck. HDFS Architecture (Hadoop 1.0) HDFS architecture has two layers: 1. Namespace The Namespace layer in the HDFS architecture consists of files, blocks, and directories. This layer provides support for namespace related filesystem operations like create, delete, modify, and list files and directories. (Additional Info: To perform operations on HDFS files, you need to learn 12 frequently used HDFS commands.) 2. Block Storage layer Block Storage layer has two parts: Block Management: NameNode performs block management. Block Management provides DataNode cluster membership by handling registrations, and periodic heartbeats. It processes block reports. It supports block related operations like create, delete, modify, or get block location. It includes block replication (adding of new blocks in case of under replicated blocks and deleting of over replicated blocks when there is a configuration change). Storage: DataNode manages storage space by storing blocks on the local file system and providing read/write access. 8 This architecture allows for only single NameNode to maintain the filesystem namespace. Thus, it is simple to implement and works well for the small clusters. Big organizations like Yahoo, Facebook, faced some limitations with this approach because their clusters grew exponentially. Limitations of HDFS Architecture (Prior to Hadoop 2.0) 1. Due to the tight coupling of namespace and the storage layer, an alternate implementation of NameNode is difficult. This limits the usage of block storage directly by the other services. 2. Due to single NameNode, we can have only a limited number of DataNodes that a single NameNode can handle. 3. The operations of the filesystem are also limited to the number of tasks that NameNode handles at a time. Thus, the performance of the cluster depends on the NameNode throughput. 4. Also, because of a single namespace, there is no isolation among the occupant organizations (or different lines of businesses or departments) which are using the cluster. HDFS Federation feature was added in Hadoop 2.0. It allows a cluster to scale by adding more NameNodes. HDFS Federation feature introduced in Hadoop 2.0 enhances the prior HDFS architecture. It overcomes the limitations of HDFS architecture by adding multiple NameNode/namespaces support to HDFS. This allows the use of more than one NameNode/namespace. Therefore, it scales the namespace horizontally by allowing the addition of NameNode in the cluster. HDFS Federation Architecture In HDFS Federation architecture, there are multiple NameNodes and DataNodes. Each NameNode has its own namespace and block pool. All the NameNodes uses DataNodes as the common storage. Every NameNode is independent of the other and does not require any coordination amongst themselves. Each DataNode gets registered to all the NameNodes in the cluster and store blocks for all the block pools in the cluster. Also, DataNodes periodically send heartbeats and block reports to all the NameNodes in the cluster and handles the instructions from the NameNodes. Look at the figure below that shows the architecture design of the HDFS Federation. 9 HDFS Federation Architecture (Hadoop 2.0) In the above figure, which represents HDFS Federation architecture, there are multiple NameNodes which are represented as NN1, NN2,..NNn. NS1, NS2, and so on are the multiple namespaces managed by their respective NameNode (NS1 by NN1, NS2 by NN2, and so on). Each namespace has its own block pool (NS1 has Pool1, NS2 has Pool2, and so on). Each DataNode store blocks for all the block pools in the cluster. For example, DataNode1 stores the blocks from Pool 1, Pool 2, Pool3, etc. Let us now understand the block pool and namespace volume in detail. Block pool – Block pool in HDFS Federation architecture is the collection of blocks belonging to the single namespace. HDFS Federation architecture has a collection of block pools, and each block pool is managed independently from each other. This allows the generation of the Block IDs for new blocks by the namespace, without any coordination with other namespaces. Namespace Volume - Namespace with its block pool is termed as Namespace Volume. The HDFS Federation architecture has the collection of Namespace volume, which is a self-contained management unit. On deleting the NameNode or namespace, the corresponding block pool present in the DataNodes also gets deleted. On upgrading the cluster, each namespace volume gets upgraded as a unit. Benefits of HDFS Federation 1. Namespace Scalability - With federation, we can horizontally scale the namespace. This benefits the large clusters or cluster with too many small files because of more NameNode addition to the cluster. 2. Performance - It improves the performance of the filesystem as the filesystem operations are not limited by the throughput of a single NameNode. 10 3. Isolation - Due to multiple namespaces, it can provide isolation to the occupant organizations that are using the cluster. Anatomy of File Read in HDFS Source: https://www.npntraining.com/blog/anatomy-of-file-read-and-write/ Let us understand how data flows between the client interacting with HDFS, the name node, and the data nodes with the help of a diagram. Step 1: The client opens the file it wishes to read by calling open() on the FileSystem Object ( this object is an instance of DistributedFileSystem class in HDFS). Step 2: DistributedFileSystem (DFS) calls the name node, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file. For each block, the NameNode returns the addresses of the data nodes that have a copy of that block. Client interacts with respective DataNodes to read the file data. NameNode also provide a token to the client which it shows to data node for authentication. The DistrubitedFileSystem (FileSystem object) returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O. Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file. Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best data node for the next block. This happens transparently to the client, which from its point of view is simply reading an endless stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes because the client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of blocks as needed. 11 Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream. Anatomy of File Write in HDFS HDFS follows the Write-Once-Read-Many-Times model. In HDFS we cannot edit the files which are already stored in HDFS, but we can append data by reopening the files. Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS). Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the client to start out writing data to. Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the primary data node within the pipeline, which stores each packet and forwards it to the second data node within the pipeline. Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the pipeline. Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data nodes, called an “ack queue”. Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments before connecting to the name node to signal whether the file is complete or not. 12 HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can include them by again reopening the file. This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the data nodes in the cluster. Thus, it increases the availability, scalability, and throughput of the system. Also see, https://data-flair.training/blogs/hadoop-hdfs-data-read-and-write-operations/ https://www.geeksforgeeks.org/anatomy-of-file-read-and-write-in-hdfs/ 4.2 MapReduce: MapReduce programming model, Mapper and Reducer, Example of a map reduce job, Matrix multiplication using MapReduce MapReduce is a programming model for processing Big Data. MapReduce provides analytical capabilities for analysing huge volumes of complex data. The heart of Apache Hadoop is Hadoop MapReduce. It is a programming model used for processing large datasets in parallel across hundreds or thousands of Hadoop clusters on commodity hardware. Traditional Enterprise Systems normally have a centralized server to store and process data. They are not suitable to process huge volumes of scalable data because the centralized server creates too much of a bottleneck while processing multiple files simultaneously. Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a processing task into small parts and assigns them to many computers. Later, the results are collected at one place and integrated to form the result dataset. The MapReduce algorithm contains two important phases, namely Map and Reduce. Map – Here the set of data to be processed is converted it into subsets of data, where individual elements are broken down into tuples (key-value pairs). Reduce - Here the output from the Map stage is taken as an input and combined those key- value pairs into a smaller set of tuples. The big data processing job is the top-level unit of MapReduce. Each job is a collection of one or more Map or Reduce tasks. The execution of a big data processing job starts when it is submitted to the Job Tracker of MapReduce which specifies the map, combines, and reduce functions along with the location of input and output data. When the job is received, the job tracker searches the number of splits based on input path and select Task Trackers based on their network locality to the data sources. Task Tracker extracts information from the splits as the processing begins in Map phase. Records are parsed by the “Input Format” and generate key-value pairs in the memory buffer when Map function is invoked. Combine function is used to sort all the splits from the memory buffer. After the completion of a map task, Task Tracker gives a message to the Job Tracker. Job Tracker then gives a message to selected Task Tracker to start the reduce phase. Now Task Tracker sorts the key-value pairs for each key after reading it. At last, reduce function is invoked and all the values are collected into one output file. 13 MapReduce – Example 1: Counting the number of occurrences of words in a given file. When we have a massive file or a collection of files, we can use MapReduce to find the number of occurrences of different words at speed. This is depicted below. 14 Matrix Multiplication Using MapReduce https://www.talend.com/resources/what-is-mapreduce/ (has a good example and pictorial representation) https://www.geeksforgeeks.org/mapreduce-architecture/ https://www.talend.com/resources/big-data-agriculture/ https://www.qlik.com/us/data-analytics/supply-chain-analytics https://www.qlik.com/us/solutions/industries/public-sector https://www.qlik.com/us/data-analytics/big-data-analytics https://qlik.com/us/data-analytics/people-analytics https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm (has a good example) MapReduce - Example 2: Matrix Multiplication Let us consider the following matrices A and B. A is a 2x2 matrix and B is a 2x2 matrix as well. When we multiply A and B we will get matrix C which will be a 2x2 matrix. Matrix C will look like, (1x5 + 2x7) (1x6 + 2x8) (3x5 + 4x7) (3x6 + 4x8) Or 19 22 43 50 15 Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-row 1st column. This one step matrix multiplication has 1 mapper and 1 reducer. The Formula is: Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i Computing the mapper for Matrix A: # k, i, j computes the number of times it occurs. # Here all these are 2, therefore when k=1, i can have # 2 values 1 & 2, each case can have 2 further # values of j=1 and j=2. Substituting all values # in formula k=1 i=1 j=1 ((1, 1), (A, 1, 1)) j=2 ((1, 1), (A, 2, 2)) i=2 j=1 ((2, 1), (A, 1, 3)) j=2 ((2, 1), (A, 2, 4)) k=2 i=1 j=1 ((1, 2), (A, 1, 1)) j=2 ((1, 2), (A, 2, 2)) i=2 j=1 ((2, 2), (A, 1, 3)) j=2 ((2, 2), (A, 2, 4)) Computing the mapper for Matrix B: i=1 j=1 k=1 ((1, 1), (B, 1, 5)) k=2 ((1, 2), (B, 1, 6)) j=2 k=1 ((1, 1), (B, 2, 7)) k=2 ((1, 2), (B, 2, 8)) i=2 j=1 k=1 ((2, 1), (B, 1, 5)) k=2 ((2, 2), (B, 1, 6)) j=2 k=1 ((2, 1), (B, 2, 7)) k=2 ((2, 2), (B, 2, 8)) 16 The formula for Reducer is: Reducer(k, v)=(i, k)=>Make sorted Alist and Blist (i, k) => Summation (Aij * Bjk)) for j Output =>((i, k), sum) Computing the Reducer: # We can observe from Mapper computation # that 4 pairs are common (1, 1), (1, 2), # (2, 1) and (2, 2) # Make a list separate for Matrix A & # B with adjoining values taken from # Mapper step above: (1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)} Blist ={(B, 1, 5), (B, 2, 7)} Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i) (1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)} Blist ={(B, 1, 6), (B, 2, 8)} Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii) (2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)} Blist ={(B, 1, 5), (B, 2, 7)} Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii) (2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)} Blist ={(B, 1, 6), (B, 2, 8)} Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv) From (i), (ii), (iii) and (iv) we conclude that ((1, 1), 19) ((1, 2), 22) ((2, 1), 43) ((2, 2), 50) 17 Therefore, the final matrix is, https://www.cs.emory.edu/~cheung/Courses/554/Syllabus/9-parallel/matrix-mult.html https://www.geeksforgeeks.org/matrix-multiplication-with-1-mapreduce-step/ https://luxoft-training.com/news/how-to-implement-matrix-multiplication-using-map-reduce 4.3 YARN Architecture: Resource Manager, Node Manager, Application Master, Container, Anatomy of MapReduce Job run in YARN YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be known as large-scale distributed operating system used for Big Data processing. In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager and application manager. In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager and application manager. 18 YARN architecture basically separates resource management layer from the processing layer. YARN allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS. It makes HDFS much more efficient. Through its various components, YARN can dynamically allocate various resources and schedule the application processing. For large volume data processing, it is quite necessary to manage the available resources properly so that every application can leverage them. YARN Features: Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend and manage thousands of nodes and clusters. Compatibility: YARN supports the existing map-reduce applications without disruptions thus making it compatible with Hadoop 1.0 as well. Cluster Utilization: Since YARN supports Dynamic utilization of cluster in Hadoop, which enables optimized Cluster Utilization. Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi- tenancy. 19 The main components of YARN architecture include: Client: It submits map-reduce jobs. Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components: o Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources. o Application Manager: It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Master container if a task fails. Node Manager: It takes care of individual node on Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep-up with the Resource Manager. It registers with the Resource Manager and sends heartbeats with the health status of the node. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and start it on the request of Application master. Application Master: An application is a single job submitted to a framework. The application master is responsible for negotiating resources with the resource manager, tracking the status and monitoring progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time. Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The containers are invoked by Container Launch Context (CLC) which is a record that contains information such as environment variables, security tokens, dependencies etc. 20 Application workflow in Hadoop YARN: Work Flow Steps: 1. Client submits an application 2. The Resource Manager allocates a container to start the Application Manager 3. The Application Manager registers itself with the Resource Manager 4. The Application Manager negotiates containers from the Resource Manager 5. The Application Manager notifies the Node Manager to launch containers 6. Application code is executed in the container 7. Client contacts Resource Manager/Application Manager to monitor application’s status 8. Once the processing is complete, the Application Manager un-registers with the Resource Manager Advantages: Flexibility: YARN offers flexibility to run various types of distributed processing systems such as Apache Spark, Apache Flink, Apache Storm, and others. It allows multiple processing engines to run simultaneously on a single Hadoop cluster. Resource Management: YARN provides an efficient way of managing resources in the Hadoop cluster. It allows administrators to allocate and monitor the resources required by each application in a cluster, such as CPU, memory, and disk space. Scalability: YARN is designed to be highly scalable and can handle thousands of nodes in a cluster. It can scale up or down based on the requirements of the applications running on the cluster. Improved Performance: YARN offers better performance by providing a centralized resource management system. It ensures that the resources are optimally utilized, and applications are efficiently scheduled on the available resources. Security: YARN provides robust security features such as Kerberos authentication, Secure Shell (SSH) access, and secure data transmission. It ensures that the data stored and processed on the Hadoop cluster is secure. 21 Disadvantages: Complexity: YARN adds complexity to the Hadoop ecosystem. It requires additional configurations and settings, which can be difficult for users who are not familiar with YARN. Overhead: YARN introduces additional overhead, which can slow down the performance of the Hadoop cluster. This overhead is required for managing resources and scheduling applications. Latency: YARN introduces additional latency in the Hadoop ecosystem. This latency can be caused by resource allocation, application scheduling, and communication between components. Single Point of Failure: YARN can be a single point of failure in the Hadoop cluster. If YARN fails, it can cause the entire cluster to go down. To avoid this, administrators need to set up a backup YARN instance for high availability. Limited Support: YARN has limited support for non-Java programming languages. Although it supports multiple processing engines, some engines have limited language support, which can limit the usability of YARN in certain environments. Anatomy of MapReduce Job run in YARN The client submits the MapReduce job. The YARN resource manager coordinates the allocation of compute resources on the cluster. The YARN node managers launch and monitor the compute containers on machines in the cluster. The MapReduce application master coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers. The distributed filesystem (normally HDFS) is used for sharing job files between the other entities. (YARN can be configured to work with non-HDFS file systems as well). 22 https://www.geeksforgeeks.org/hadoop-yarn-architecture/ https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html https://intellipaat.com/blog/tutorial/hadoop-tutorial/what-is-yarn/ https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/ch07.html https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/ch07.html ---- 23