Hadoop Installation Guide PDF
Document Details
Uploaded by IrreplaceableExpressionism
Tags
Summary
This document provides a step-by-step guide on how to install and configure Hadoop, a framework for processing large datasets. It covers various aspects, including creating users, setting up SSH, installing Java, downloading Hadoop, and verifying the installation through commands and configurations. The document also emphasizes the significance of big data in modern business and engineering applications.
Full Transcript
How does Hadoop work? Step 1: A user/application can submit a job to the Hadoop (a Hadoop job client) for required process by specifying the following items: 1) The location of the input and output files in the distributed file system. 2) The java classes in the form of jar file containing t...
How does Hadoop work? Step 1: A user/application can submit a job to the Hadoop (a Hadoop job client) for required process by specifying the following items: 1) The location of the input and output files in the distributed file system. 2) The java classes in the form of jar file containing the implementation of map and reduce functions. 3) The job configuration by setting different parameters specific to the job. Step 2: The Hadoop job client then submits the job (jar/executable etc) and configuration to the Job Tracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Step 3: The Task Trackers on different nodes execute the task as per MapReduce implementation and output of the reduce function is stored into the output files on the file system. Setting Hadoop Framework Hadoop is supported by GNU/Linux platform. Before installing Hadoop into Linux environment, we need to setup Linux using ssh (Secure Shell). The steps below for setting up the Linux environment: a) Creating a user At the beginning, create a separate user for Hadoop to isolate Hadoop file system from Unix file system. The steps are given below: Open the root using the command “su“. Create a user from the root account using the command “useradd username”. Now you can open an existing user account using the command “su username”. b) SSH setup and key generation SSH setup is required to do different operations on a cluster such as starting, stopping distributed daemon shell operations. To authenticate different users of Hadoop, it is required to provide public/private key pair for Hadoop user and share it with different users. c) Installing Java Java is the main prerequisite for Hadoop. First of all you should verify the existence of java in your system using the command “java -version”. d) Downloading Hadoop Download and extract Hadoop 2.4.1 from Apache software foundation. Hadoop operation Modes Once you downloaded Hadoop, you can operate Hadoop cluster in one of the three supported modes: 1. Local/Standalone Mode: By default, it is configured in a standalone mode and can be run as a single java process. 2. Pseudo Distributed Mode: It is a distributed simulation on a single machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development. 3. Fully Distributed Mode: This mode is fully distributed with minimum two or more machines as a cluster. Verifying Hadoop installation The following steps are used to verify the hadoop installation. Step 1: Name node setup Set up the namenode using the command “hdfs namenode -format” The expected result is as follows: INFO namenode.NameNode: STARTUP_MSG: STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] … Step 2: Verifying Hadoop dfs The following command is used to start dfs. This command will start your Hadoop file system. $ start-dfs.sh The expected output is as follows: Starting namenodes on [localhost] localhost: starting namenode, logging to/home/hadoop/hadoop 2.4.1/logs/Hadoop-hadoop-namenode-localhost.out localhost: starting datanode, logging to/home/hadoop/Hadoop Step 3: Verifying Yarn Script The following command is used to start the yarn script. This command will start your Yarn daemons. $ start-yarn.sh The expected output as follow: starting yarn daemons starting resourcemanager, logging to/home/hadoop/Hadoop 2.4.1/logs/ yarn-hadoop-resourcemanager-localhost.out Step 4: Accessing Hadoop on browser The default port number to access Hadoop is 50070. The following url to get Hadoop services on browser. http://localhost:50070/ Step 5: Verify all applications for cluster The default port number to access all applications of cluster is 8088. Use the following url to visit this service http://localhost:8088 Big Data is used in decision making process to gain useful insights hidden in the data for business and engineering. At the same time, it presents challenges in processing, cloud computing has helped in advancement of big data by providing computational, networking and storage capacity. The volume and information captured from various mobile devices and multimedia by organizations is increasing every moment and has almost doubled every year. This sheer volume of data generated can be categorized as structured or unstructured data that cannot be easily loaded into regular relational databases. This big data requires pre-processing to convert the raw data into clean data set and made feasible for analysis. Healthcare, finance, engineering, e commerce and various scientific fields use these data for analysis and decision making. The advancement in data science, data storage and cloud computing has allowed for storage and mining of big data. The effects of Big Data are large on a practical level, as technology is applied to find solutions for vexing everyday problems. Big data is poised to reshape the way we live, work, and think. The possession of knowledge, which once meant an understanding of the past, is coming to mean an ability to predict the future. Ultimately, big data marks the moment when the “information society” finally fulfills the promise implied by its name. The data takes the center stage. All those digital bits that have been gathered can now be harnessed to serve new purpose and unlock new forms of value. Introduction to Hadoop History of Hadoop As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.). One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster. In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors. Apache Hadoop is an open source software framework written in java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Definition of Hadoop Apache Hadoop is an open source programming framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly. Its framework is based on Java programming with some native code in C and shell scripts. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug Cutting and Mike Cafarella. Hadoop is a shortening form of High Availability Distributed Object Oriented Platform. Hadoop can be used as a flexible and easy way for the distributed processing of large enterprise data sets. Apart from enormous data, it includes structured, semi-structured, and unstructured data including Internet clickstreams records, web server, social media posts, customer emails, mobile application logs, and sensor data from the Internet of Things (IoT). Big data applications use various tools and techniques for processing and analyses of the data. Below table represents some of them. Importance of Hadoop Hadoop is a valuable technology for big data analytics for the reasons as mentioned below: Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Traditional Approach An enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data present it to the users for analysis purpose. Relational data model is the primary data model used widely around the world for data storage and processing. Here relations are saved in the format of tables. This format stores the relation among entities. A table has rows represents the records and columns represents the attributes. A single row called tuple of a table contains a single record for that relation. This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amount of data, it is really a tedious task to process such data through a traditional database server. Google’s solution Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, collects the results to form the final result dataset. Hadoop is the most widely used implementation of the MapReduce paradigm. Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Why Hadoop when we have relational database? 1. One copy of data: -Exchanging data requires synchronization (consistency levels) Deadlocks can become a problem Need for backups-Need for recovery (based on logs or high availability) 2. RDBMS systems are very strong in transactional processing – What if you need to look at all the records in the database? How long will it take to do a relational scan 100 TB? 3. RDBMS systems work extremely well with structured data. Comparison of two data dealing methods RDBMS and Hadoop Parameter RDBMS Hadoop Type of data Structured data with Unstructured and known schemas structured Data groups Records, long fields, Files objects, XML Data modification Updates allowed Only inserts and deletes (hdf) Programs SQL and XQuery Hive, Pig, Jaql Hardware requirement Enterprise hardware Commodity hardware Data Processing Batch processing Streaming access to full files Hadoop consists of four main modules: Hadoop Distributed File System (HDFS) – A distributed file system that runs on standard commodity hardware and handles large data sets. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets. Yet Another Resource Negotiator (YARN) – Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks. MapReduce – A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result. Hadoop Common – Provides common Java libraries that can be used across all modules. Hadoop Architecture The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files and distributes them across nodes in cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on data that needs to be processed. This approach takes advantage of data locality-to allow dataset to be processed faster and more efficiently. Thus base Apache Hadoop Architecture mainly consists of four modules. (1) MapReduce (2) YARN (3) HDFS (4) Hadoop Common Hadoop HDFS stores the data, MapReduce processes the data stored in HDFS, and YARN divides the tasks and assigns resources. There are two daemons running in Hadoop HDFS that are NameNode (Master) and DataNode (Slave). Daemons are the light-weight process that runs in the background. NameNode NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data about the data. In particular, the NameNode contains the details of the number of blocks, locations of the data node that the data is stored in, where the replications are stored, and other details. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster. The name node has direct contact with the client. Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc. When the client applications want to add/copy/move/delete a file, they interact with NameNode. The NameNode responds to the request from client by returning a list of relevant DataNode servers where the data lives. DataNode DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The greater number of DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should have High storing capacity to store a large number of file blocks. They are responsible for serving the client’s read/write requests based on the instructions from NameNode. MapReduce MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN framework. The major feature of MapReduce is to perform the distributed processing in parallel in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks which are divided phase-wise: In first phase, Map is utilized and in next phase Reduce is utilized. Here, we can see that the Input is provided to the Map() function then it’s output is used as an input to the Reduce function and after that, we receive our final output. Let’s understand What this Map() and Reduce() does. As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is a set of Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a key-value pair. These key-value pairs are now sent as input to the Reduce(). The Reduce() function then combines this broken Tuples or key-value pair based on its Key value and form set of Tuples, and perform some operation like sorting, summation type job, etc. which is then sent to the final Output Node. Finally, the Output is Obtained. The Hadoop MapReduce works as follows: 1. Hadoop divides the job into tasks of two types, that is, map tasks and reduce tasks. YARN scheduled these tasks (which we will see later). These tasks run on different DataNodes. 2. The input to the MapReduce job is divided into fixed-size pieces called input splits. 3. One map task which runs a user-defined map function for each record in the input split is created for each input split. These map tasks run on the DataNodes where the input data resides. 4. The output of the map task is intermediate output and is written to the local disk. 5. The intermediate outputs of the map tasks are shuffled and sorted and are then passed to the reducer. 6. For a single reduce task, the sorted intermediate output of mapper is passed to the node where the reducer task is running. These outputs are then merged and then passed to the user- defined reduce function. 7. The reduce function summarizes the output of the mapper and generates the output. The output of the reducer is stored on HDFS. 8. For multiple reduce functions, the user specifies the number of reducers. When there are multiple reduce tasks, the map tasks partition their output, creating one partition for each reduce task. YARN YARN stands for “Yet Another Resource Negotiator “. It was introduced in Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be known as large-scale distributed operating system used for Big Data processing. YARN is the resource management layer in Hadoop. In a cluster architecture, Apache Hadoop YARN sits between HDFS and the processing engines being used to run applications. It schedules the task in the Hadoop cluster and assigns resources to the applications running in the cluster. It is responsible for providing the computational resources needed for executing the applications. YARN also allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File System) thus making the system much more efficient. Through its various components, it can dynamically allocate various resources and schedule the application processing. For large volume data processing, it is quite necessary to manage the available resources properly so that every application can leverage them. The main components of YARN architecture include: Client: It submits map-reduce jobs. Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. It runs on the master node per cluster to manage the resources across the cluster. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components: (a) Scheduler The scheduler allocates resources to various applications running in the cluster. It performs scheduling based on the allocated application and available resources. It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails due to hardware or application failure. It does not track the status of running application. It only allocates resources to various competing applications. The scheduler allocates the resources based on an abstract notion of a container. A container is nothing but a fraction of resources like CPU, memory, disk, network etc. (b) Application Manager It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Master container if a task fails. Following are the tasks of Application Manager:- Accepts submission of jobs by client. Negotiates first container for specific ApplicationMaster. Restarts the container after application failure. Below are the responsibilities of ApplicationMaster Negotiates containers from Scheduler Tracking container status and monitoring its progress. Node Manager: It take care of individual node on Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep-up with the Resource Manager. It registers with the Resource Manager and sends heartbeats with the health status of the node. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and start it on the request of Application master. Application Master: An application is a single job submitted to a framework. The application master is responsible for negotiating resources with the resource manager, tracking the status and monitoring progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time. YARN Features: YARN gained popularity because of the following features- Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend and manage thousands of nodes and clusters. Compatibility: YARN supports the existing map-reduce applications without disruptions thus making it compatible with Hadoop 1.0 as well. Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which enables optimized Cluster Utilization. Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-tenancy. Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies etc. Hadoop Common or Common Utilities Hadoop common or Common utilities are nothing but our java library and java files or we can say the java scripts that we need for all the other components present in a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically in software by Hadoop Framework.