Hadoop Ecosystem PDF
Document Details
Uploaded by SuaveAgate4242
Tags
Summary
This document provides an introduction to the Hadoop ecosystem, a framework for handling big data. It discusses the components of the Hadoop ecosystem and highlights the problems with traditional data systems. It covers topics like Google File system (GFS), Apache Hadoop and many more valuable information.
Full Transcript
Introduction The massive amount of data generated at a ferocious pace and in all kinds of formats is what...
Introduction The massive amount of data generated at a ferocious pace and in all kinds of formats is what we call today as Big data. But it is not feasible storing this data on the traditional systems that we have been using for over 40 years. HADOOP ECOSYSTEM To handle this massive data we need a much more complex framework consisting of not just one, but multiple components handling different operations. SOURCE: HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/10/INTRODUCTION-HADOOP-ECOSYSTEM/ Introduction Cont. Problems with Traditional Systems We refer to this framework as Hadoop and Most of the data generated today are semi-structured or together with all its components, we call it the unstructured. But traditional systems have been designed to Hadoop Ecosystem. handle only structured data that has well-designed rows and columns. Since there are so many components within this Relations Databases are vertically scalable which means you Hadoop ecosystem, it can become really need to add more processing, memory, storage to the same challenging at times to really understand and system. This can turn out to be very expensive. remember what each component does and Data stored today are in different silos. Bringing them where does it fit in in this big world. together and analyzing them for patterns can be a very difficult task. Solution What is Hadoop? People at Google also faced the above- So, how do we handle Big Data? This mentioned challenges when they wanted to is where Hadoop comes in! rank pages on the Internet. They found the Relational Databases to be very expensive and inflexible. So, they came up with their own novel solution. They created the Google File System (GFS) Google File System (GFS) Apache Hadoop GFS is a distributed file system that Apache Hadoop is an open-source overcomes the drawbacks of the traditional framework based on Google’s file system that systems. can deal with big data in a distributed It runs on inexpensive hardware and provides environment. parallelization, scalability, and reliability. This distributed environment is built up of a This laid the stepping stone for the evolution cluster of machines that work closely together of Apache Hadoop to give an impression of a single working machine. Some important properties of Some important properties of Hadoop Cont. Hadoop 4. It is economical as all the nodes in the cluster are commodity hardware which is nothing but inexpensive 1. Hadoop is highly scalable because it handles machines data in a distributed manner 5. Hadoop utilizes the data locality concept to process the data on the nodes on which they are stored rather than 2. Compared to vertical scaling in RDBMS, Hadoop moving the data over the network thereby reducing offers horizontal scaling traffic 6. It can handle any type of data: structured, semi- 3. It creates and saves replicas of data making it structured, and unstructured. This is extremely important in fault-tolerant today’s time because most of our data has no defined format Components of the Hadoop Ecosystem HDFS (Hadoop Distributed File System) It is the storage component of Hadoop that stores data in the form of files. Each file is divided into blocks of 128MB (configurable) and stores them on different machines in the cluster. It has a master-slave architecture with two main components: Name Node and Data Node HDFS Components MapReduce 1. Name node is the master node and there is only one To handle Big Data, Hadoop relies on the MapReduce per cluster. Its task is to know where each block algorithm introduced by Google and makes it easy to belonging to a file is lying in the cluster. distribute a job and run it in parallel in a cluster. Itessentially divides a single task into multiple tasks and 2. Data node is the slave node that stores the blocks of processes them on different machines. data and there are more than one per cluster. Its task Itworks in a divide-and-conquer manner and runs the is to retrieve the data as and when required. It keeps processes on the machines to reduce traffic on the in constant touch with the Name node through network. heartbeats. It has two important phases: Map and Reduce. MapReduce Cont. YARN YARN or Yet Another Resource Negotiator manages resources in the cluster and manages the applications over Hadoop. Map phase filters, groups, and sorts the data. Input data is It allows data stored in HDFS to be processed and run divided into multiple splits. Each map task works on a split of by various data processing engines such as batch data in parallel on different machines and outputs a key- processing, stream processing, interactive processing, value pair. The output of this phase is acted upon by the graph processing, and many more. This increases reduce task and is known as the Reduce phase. It efficiency with the use of YARN aggregates the data, summarises the result, and stores it on HDFS HBASE Pig Pig was developed for analyzing large datasets and HBase is a Column-based NoSQL overcomes the difficulty to write map and reduce functions. It consists of two components: Pig Latin and database. Pig Engine. It runs on top of HDFS and can handle any type of data. Pig Latin is the Scripting Language that is similar to SQL. It allows for real-time processing and Pig Engine is the execution engine on which Pig Latin runs. Internally, the code written in Pig is converted to random read/write operations to be MapReduce functions and makes it very easy for performed in the data programmers who aren’t proficient in Java Hive Sqoop A lot of applications still store data in relational Hive is a distributed data warehouse system databases, thus making them a very important source of developed by Facebook. It allows for easy data. Therefore, Sqoop plays an important part in reading, writing, and managing files on HDFS. bringing data from Relational Databases into HDFS. It has its own querying language for the purpose known as Hive Querying Language The commands written in Sqoop internally converts into (HQL) which is very similar to SQL. This makes it MapReduce tasks that are executed over HDFS. It works very easy for programmers to write with almost all relational databases like MySQL, Postgres, MapReduce functions using simple HQL queries SQLite, etc. It can also be used to export data from HDFS to RDBMS. Flume Kafka There are a lot of applications generating Flume is an open-source, reliable, and available data and a commensurate number of service used to efficiently collect, aggregate, and applications consuming that data. But move large amounts of data from multiple data sources into HDFS. connecting them individually is a tough task. Itcan collect data in real-time as well as in batch That’s where Kafka comes in. It sits between mode. It has a flexible architecture and is fault- the applications generating data (Producers) tolerant with multiple recovery mechanisms. and the applications consuming data (Consumers). ZooKeeper Oozie Ina Hadoop cluster, coordinating and Oozie is a workflow scheduler system that allows users synchronizing nodes can be a challenging task. to link jobs written on various platforms like Therefore, Zookeeper is the perfect tool for the MapReduce, Hive, Pig, etc. problem. UsingOozie you can schedule a job in advance and can create a pipeline of individual jobs to be executed sequentially or in parallel to achieve a Itis an open-source, distributed, and centralized bigger task. For example, you can use Oozie to service for maintaining configuration information, perform ETL operations on data and then save the naming, providing distributed synchronization, and output in HDFS providing group services across the cluster. Spark Spark has its own Ecosystem Spark is an alternative framework to Hadoop built on Scala but supports varied applications written in Java, Python, etc. Compared to MapReduce it provides in-memory processing which accounts for faster processing. In addition to batch processing offered by Hadoop, it can also handle real-time processing. Components of the Spark Ecosystem Components of the Spark Ecosystem Spark Core is the main execution engine for Spark MLlib is a scalable machine learning library that and other APIs built on top of it will enable you to perform data science tasks while leveraging the properties of Spark at the Spark SQL API allows for querying structured data same time stored in DataFrames or Hive tables GraphX is a graph computation engine that Streaming API enables Spark to handle real-time enables users to interactively build, transform, and data. It can easily integrate with a variety of data reason about graph-structured data at scale and sources like Flume, Kafka, and Twitter comes with a library of common algorithms Stages of Big Data Processing Stages of Big Data Processing Cont. With so many components within the Hadoop ecosystem, it can become pretty intimidating and difficult to understand what each component is doing. Therefore, it is easier to group some of the components together based on where they lie in the stage of Big Data processing.