Essentials of Hadoop (Unit-1) PDF
Document Details
Uploaded by SupportedArtNouveau3510
Tags
Summary
This document provides an introduction to Hadoop and big data. It details the rise of big data, sources of big data, and its attributes. It also briefly discusses the architecture of Hadoop.
Full Transcript
ESSENTIALS OF HADOOP (UNIT-1) TABLE OF CONTENTS What is Big Data and where it is produced? Rise of Big Data, Compare Hadoop vs traditional systems, Limitations and Solutions of existing Data Analytics Architecture, Attributes of Big Data, Types of data, other technologies vs Big Dat...
ESSENTIALS OF HADOOP (UNIT-1) TABLE OF CONTENTS What is Big Data and where it is produced? Rise of Big Data, Compare Hadoop vs traditional systems, Limitations and Solutions of existing Data Analytics Architecture, Attributes of Big Data, Types of data, other technologies vs Big Data. Hadoop Architecture and HDFS - What is Hadoop? Hadoop History, Distributing Processing System, Core Components of Hadoop, HDFS Architecture, Hadoop Master – Slave Architecture, Daemon types - Learn Name node, Data node, Secondary Name node. WHAT IS BIG DATA? Big data refers to large and complex datasets that are too vast to be effectively processed and analyzed using traditional data processing tools. These datasets typically consist of structured, semi-structured, and unstructured data from various sources, such as social media, sensors, devices, and transaction records. Big data is characterized by the volume, velocity, and variety of data, which pose challenges in terms of storage, processing, and analysis. To effectively harness the potential of big data, organizations leverage advanced technologies and analytics tools, such as machine learning, artificial intelligence, and data mining. By analyzing big data, businesses can gain valuable insights, identify patterns and trends, make informed decisions, and optimize processes. Big data analytics enables organizations to improve customer experiences, enhance operational efficiency, drive innovation, and gain a competitive edge in the market. Overall, big data plays a crucial role in enabling organizations to extract meaningful information from vast amounts of data, leading to data-driven decision-making and strategic insights that drive business growth and success. WHERE BIG DATA IS PRODUCED? Big data is produced in various sources and industries, including but not limited to: 1. Social media platforms: Platforms like Facebook, Twitter, Instagram, and LinkedIn generate vast amounts of data in the form of user interactions, posts, comments, likes, shares, and more. 2. E-commerce websites: Online retailers such as Amazon, Alibaba, and eBay collect data on customer behavior, purchase history, preferences, and browsing patterns. 3. IoT devices: Internet of Things (IoT) devices such as smart sensors, wearables, connected appliances, and industrial equipment generate massive amounts of data through continuous monitoring and data collection. 4. Financial institutions: Banks, insurance companies, and other financial institutions produce large volumes of data related to transactions, customer accounts, market trends, and risk analysis. 5. Healthcare sector: Hospitals, clinics, and healthcare providers generate significant amounts of data through electronic health records, medical imaging, patient monitoring systems, and clinical research. 6. Transportation and logistics: Companies in the transportation and logistics industry collect data on vehicle tracking, route optimization, supply chain management, and delivery operations. These are just a few examples of where big data is produced, highlighting the diverse sources and industries that contribute to the generation of large and complex datasets. RISE OF BIG DATA The rise of big data refers to the emergence and growth of the big data industry, which involves collecting, storing, processing, analyzing, and using large and complex sets of information for various purposes. Some of the factors that contributed to the rise of big data are:- The rapid increase in the amount of digital data generated by various sources, such as social media, sensors, cameras, web, transactions, etc. The advancement of hardware and software technologies that enabled storing, processing, and analyzing big data, such as Moore’s law, cloud computing, Hadoop, Spark, etc. The need for data-driven decision-making, innovation, and personalization in various domains, such as business, science, engineering, medicine, ecology, etc. The availability of new methods and tools for data analysis, such as data mining, data science, machine learning, artificial intelligence, etc. The rise of big data refers to the exponential growth in the volume, velocity, and variety of data being generated and collected in today's digital world. This phenomenon is driven by the increasing use of digital technologies, such as social media, mobile devices, Internet of Things (IoT) devices, and online transactions, which generate vast amounts of data on a daily basis. Big data encompasses structured and unstructured data sets that are too large and complex to be processed using traditional data processing applications. Organizations across various industries are leveraging big data to gain valuable insights, make data-driven decisions, improve operational efficiency, and enhance customer experiences. By analyzing and interpreting large data sets, businesses can identify trends, patterns, and correlations that were previously hidden, leading to more informed decision-making and strategic planning. The rise of big data has also led to the development of advanced technologies and tools, such as data analytics, machine learning, artificial intelligence, and cloud computing, which enable organizations to store, process, and analyze massive amounts of data in real-time. As big data continues to grow in importance, businesses are increasingly investing in data management and analytics capabilities to harness the power of data and drive innovation and competitive advantage. Big data has many benefits, such as providing insights, patterns, trends, predictions, and solutions for various problems and opportunities. However, big data also poses many challenges, such as security, privacy, quality, ethics, and governance. Therefore, big data requires careful and responsible management and use. COMPARE HADOOP vs TRADITIONAL SYSTEMS RDMS (Relational Database Management System): RDBMS is an information management system, which is based on a data model. In RDBMS tables are used for information storage. Each row of the table represents a record and column represents an attribute of data. Organization of data and their manipulation processes are different in RDBMS from other databases. RDBMS ensures ACID (atomicity, consistency, integrity, durability) properties required for designing a database. The purpose of RDBMS is to store, manage, and retrieve data as quickly and reliably as possible. Hadoop: It is an open-source software framework used for storing data and running applications on a group of commodity hardware. It has large storage capacity and high processing power. It can manage multiple concurrent processes at the same time. It is used in predictive analysis, data mining and machine learning. It can handle both structured and unstructured form of data. It is more flexible in storing, processing, and managing data than traditional RDBMS. Unlike traditional systems, Hadoop enables multiple analytical processes on the same data at the same time. It supports scalability very flexibly. In summary , Hadoop is better suited for big data environments where there is a need to process large volumes of diverse data, while traditional systems are more appropriate for environments that require high data integrity and are dealing primarily with structured data. Hadoop and traditional database systems (like RDBMS) are designed for different types of data processing. Here’s a comparison between the two: Hadoop: Purpose: Designed to handle large volumes of structured and unstructured data. Scalability: Highly scalable; can expand easily to handle more data. Data Processing: Can process both structured and unstructured data efficiently. Cost: Generally cost-effective as it is open-source software. Flexibility: Supports multiple analytical processes on the same data simultaneously. Data Schema: Dynamic; does not require data normalization. Integrity: Lower data integrity compared to RDBMS. Traditional Systems (RDBMS): Purpose: Primarily used for structured data storage, manipulation, and retrieval. Scalability: Less scalable than Hadoop. Data Processing: Best suited for structured data and OLTP (Online Transaction Processing) environments. Cost: Can be expensive due to licensing fees for software. Flexibility: Less flexible; typically does not allow multiple analytical processes on the same data at the same time. Data Schema: Static; requires data normalization. Integrity: High data integrity. Below is a table of differences between RDBMS and Hadoop: LIMITATIONS AND SOLUTIONS OF EXISTING DATA ANALYTICS ARCHITECTURE No organisation can function without huge amount of data these days. But ,there are some limitations/challenges of big data encountered by companies. These include data quality , storage , lack of data science professionals , validating data and accumulating data from different sources. ❖ Lack of knowledge professionals:- These professionals include data scientists , data analyst , and data engineers who are experienced in working with tools and making sense out of huge data sets. Solution:- Investing more money in recuritment of skilled professionals. ❖ Lack of proper understanding of massive data:- Companies fail in their big data initiatives , all thanks to insufficient understanding. For example:- If employees do not understand the importance of data storage , they might not keep the backup of sensitive data. Solution:- Basic training programs must be arranged for all employees who are handling data regularly. ❖ Data growth issues:- As these data sets grow exponentially with time, it gets challenging to handle. Data and analytics fuels digital business and plays a major role in future survival of organizations worldwide. ❖ Fault Tolerance:- It is the another technical challenge and fault tolerance computing is extremely hard, involving iterate algorithms. Nowadays some of new technologies like cloud computing and big data always intended that whenever the failure occurs the damage done should be within acceptable threshold that is the whole task should not begin from scratch. ❖ Confusion while Big data tool selection:- These confusions bother companies and sometimes they are unable to find the answers. They end up making poor decisions and selecting an inappropriate technology. Solution:-The best way to go about it is to seek professional help. ❖ Data Security:- Securing these huge sets of knowledge is one of daunting challenge of massive data. Companies can lose up to $3.7 million for a stolen record pr a knowledge breach. Solution:- Companies should recruit more cybersecurity professionals to guard their data. ❖ Integrity data from a variety of sources:- Data in a corporation comes from various sources like social media pages, customer log, financial reports , e-mails, and reports created by employees. Combining all this data to organize reports may be a challenging task. Solution:- Integration problems solves by purchasing proper tasks. ❖ Quality of data:- Where there is a collection of a large amount of data and storage of this data, it comes at a cost. Big companies, business leaders and IT leaders always want large data storage. Solution:- You have to check and fix any data quality issues constantly. Also, duplicate entries and typos are typical, especially when data originates from multiple sources. The team designed an intelligent data identifier that recognizes duplicates with minor data deviations and reports any probable mistakes to ensure the quality of the data they collect. As a result, the accuracy of the business insights derived from data analysis has improved. ATTRIBUTES OF BIG DATA There are five V's of Big data that explains the characteristics:- ▪ Volume ▪ Variety ▪ Velocity ▪ Veracity ▪ Value ▪ Volume:- The name of big data itself is related to an enormous size. Big data is a vast 'volumes' of data generated from many sources daily such as business processes ,machines , social media platforms , networks , human interactions and many more. For Example :- Facebook can generate approximately a billion messages,4.5 billion times that the 'like' button is recorded , and more than 350 million new posts are uploaded each day. Big data technologies can handle large amounts of data. ▪ Variety:- Big data can be structured , unstructured, and semi-structured that are being collected from different sources. Data will only be collected from databases and sheets in the past , but these days the data will comes in array forms , that are PDFs , Emails , audios , posts , photos , videos, etc. For Example:- Web Server logs , i. e the log file is created and maintained by some server that contains a list of activities. ▪ Veracity:- Veracity means how much the data is reliable. It has many wats to filter or translate the data. Veracity is the process of being able to handle and manage data effectively. Big data is also essential in business development. Example:- Facebook posts with hashtags. ▪ Value:- Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable, and reliable data that we store, process and also analyse. ▪ Velocity:- Velocity plays an important role compared to others. Velocity creates the speed by which the data is created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity bursts. The primary aspect of big data is to provide demanding data rapidly. Big data velocity deals with the speed at the data flows from sources like application logs , business processes , networks, and social media sites, sensors, mobile devices, etc. TYPES OF DATA Big Data can be structured, unstructured, and semi-structured that are being collected from different sources. Data will only be collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc. The data is categorized as below: 1.Structured data: In Structured schema, along with all the required columns. It is in a tabular form. Structured Data is stored in the relational database management system. Examples include names, dates, and addresses. 2.Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to work with semi-structured data. It is stored in relations, i.e., tables. Examples include XML data files and JSON documents. 3.Unstructured Data: All the unstructured files, log files, audio files, and image files are included in the unstructured data. Some organizations have much data available, but they did not know how to derive the value of data since the data is raw. Examples include mobile data, social media posts, and satellite imagery. A stark example of unstructured data is an output returned by ‘Google Search’ or ‘Yahoo Search.’ 4.Quasi-structured Data: The data format contains textual data with inconsistent data formats that are formatted with effort and time with some tools. Example: Web server logs, i.e., the log file is created and maintained by some server that contains a list of activities. OTHER TECHNOLOGIES VERSUS BIG DATA Big data is a powerful technology that interacts with various other technologies, each with its own unique focus and capabilities. Here’s a comparison of big data with some other key technologies: 1. Data analytics: Data analytics involves the process of examining data sets to draw conclusions and make informed decisions. It encompasses various techniques such as descriptive, diagnostic, predictive, and prescriptive analytics to extract insights from data. 2.Cloud Computing: While big data deals with the challenges of processing and analyzing large data sets, cloud computing provides the infrastructure and services necessary for storage and accessibility. Cloud computing offers scalability and resources on-demand, which are essential for handling big data workloads. 3.Artificial Intelligence (AI) and Machine Learning (ML): AI and ML technologies consume big data to learn and make intelligent decisions. Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It is often used in conjunction with big data to develop predictive models and algorithms for data analysis. The more data these technologies have, the more accurate their predictions and actions can be. Big data provides the necessary volume of data for AI and ML algorithms to be effective. 4.Internet of Things (IoT): IoT refers to the network of interconnected devices that collect and exchange data. The massive amounts of data generated by IoT devices contribute to the volume and velocity of big data, requiring advanced analytics to derive meaningful insights. IoT devices generate a significant portion of the data that constitutes big data. The data collected from IoT devices can be analyzed to gain insights and drive decision-making processes. 5.Business intelligence (BI): BI refers to technologies, applications, and practices for collecting, integrating, analyzing, and presenting business information. It helps organizations make data-driven decisions by providing historical, current, and predictive views of business operations. 6.Blockchain: Big data requires trust and verifiability when dealing with large volumes of data from various sources. Blockchain technology offers decentralization, immutability, and transparency, which can enhance the security and reliability of big data solutions. Each of these technologies has a symbiotic relationship with big data, often enhancing and being enhanced by the capabilities that big data offers. The interconnectedness of these technologies is reshaping the future of computing and business strategies. HADOOP ARCHITECTURE AND HDFS what is Hadoop? ❖ Hadoop - Introduction Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and storage of big size data. It is a open source software framework with huge data storage facility. HADOOP Client Client Client Components present in Hadoop :- ▪ Name Node(NN) ▪ Job Tracker(JT) ▪ Secondary Name Node(SNN) ▪ Data Node(DN) ▪ Task Tracker(TT) PHYSICAL ARCHITECTURE OF HADOOP Secondary Name Node Client Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker ARCHITECTURE OF HADOOP PROCESS:- ▪ Client provides it's job request to Hadoop. ▪ This request is accepted by Name node. ▪ Name node is master in Hadoop. ▪ It also contains a job tracker which is again a part of master. ▪ This job is divided into task's and job tracker provides it to the data node. ▪ Now the data node is a slave and it possess task tracker which actually performs the task. ▪ And job tracker continuously communicates with task tracker and if anytime it fails to reply then it assumes that the task tracker may have crashed. ❖ Name Node:- ▪ It is the master of HDFS. ▪ It has job tracker which keeps track of files distributed to data nodes. ▪ It is only single point of failure. ❖ Data Node:- ▪ It is slave of HDFS. ▪ It takes client block address from Name Node. ▪ For replication purpose it can communicate with other name node. ▪ Data node informs local changes/updates to Name node. ❖ Job Tracker:- ▪ It determines the files to process. ▪ Only one job tracker per Hadoop cluster is allowed. ▪ It runs on a server as a master node of cluster. ❖ Task Tracker:- ▪ There is a single task tracker per slave node. ▪ It may handle multiple task parallelly. ▪ Individual tasks are assigned by job tracker to task tracker. ▪ Job tracker continuously communicates with task tracker and if anytime it fails to reply then it assumes that the task tracker has crashed. ❖ Secondary Name Node:- ▪ State monitoring is done by SNN. ▪ Every cluster has one SNN. ▪ SNN resides on its own machine. ▪ On that machine or server no other Daemon (DN or TT) can work. ▪ SNN takes snapshot of HDFS metadata at constant intervals. COMPONENTS OF HADOOP(ARCHITECTURE) As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, example:- Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components. ▪ MapReduce ▪ HDFS(Hadoop Distributed File System) ▪ YARN(Yet Another Resource Negotiator) ▪ Common Utilities or Hadoop Common 1.HDFS:-(Hadoop Distributed File System) is a primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured and unstructured data across various nodes and thereby maintaining the meta data in the form of log files. HDFS consists of two more components:- 1.Name Node(Master) 2. Data Node(Slave) ❖ Name Node:- ▪ It is the master of HDFS. ▪ It has job tracker which keeps track of files distributed to data nodes. ▪ It is only single point of failure. ❖ Data Node:- ▪ It is slave of HDFS. ▪ It takes client block address from Name Node. ▪ For replication purpose it can communicate with other name node. ▪ Data node informs local changes/updates to Name node. 2.YARN:-(Yet Another Resource Negotiator) as the name implies , YARN is the one who helps to manage the resources across the clusters. In short , it performs scheduling and resource allocation for Hadoop systems. YARN consists of three major components:- 1.Resource Manager 2. Nodes Manager 3. Application Manager ▪ Resource Manager has the privilege of allocating resources for the application in the system. ▪ Node Manager work on allocation of resources such as CPU , memory , bandwidth per m/c, and later on acknowledges resource manager. ▪ Application Manager works as an interface between the resource manager and node manager and performs negotiations as per requirement of two. 3. Map Reduce:- By making use of distributed and parallel algorithms, Map Reduce makes it possible to carry over the processing's logic and helps to write applications which transform big data sets into manageable one. Map Reduce make use of two functions Map() and Reduce() whose task is :- ▪ Map() performs sorting and filtering of data thereby organising them in form of group. Map generates a Key- value pair based result which is later on processed by Reduce() method. ▪ Reduce() as name suggests does the summarization by aggregating mapped data. In simple Reduce() takes output generated by Map() as input and combined those tuples into smaller set of tuples. 4.Hadoop Common and Common Architecture:- are nothing but our java library and java files or we can say the java scripts that we need for all other components present in Hadoop Clusters. Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically in software by Hadoop Framework. Features and advantages of HADOOP ❑ Features of Hadoop:- 1.Open source:- It is a open source , which means it is free to use. Since it is an open-source project the source code is available online for anyone to understand or make some modifications as per their industry requirement. 2.Highly Scalable cluster:- It is highly scalable model. A large amount of data is divided into multiple inexpensive machines in a cluster which is proceed parallelly , the number of these machines or nodes can be increased or decreased as per enterprise's requirement. 3. Fault tolerance:- Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated on various data nodes in Hadoop cluster which ensures the availability of data if some how any of your systems got crashed. 4. Easy to use:- Hadoop is very easy to use since the developers needs not worry about any of the processing work since it is managed by Hadoop itself. Hadoop ecosystem is also very large comes up with lots of tools like Hive , Pig etc. 5. Data locality:- The concept of data locality is used to make Hadoop processing fast. In the data locality concept, the computation logic is moved near data rather than moving the data to the computation logic. 6. Cost-Effective:- It is open-source and uses cost effective commodity hardware which provides a cost-efficient model, unlike traditional relational databases that require expensive hardware and high end processors to deal with big data. ❑ Advantages: Storage and Processing: Hadoop can store and process petabytes of data at high speed, making it ideal for big data applications. Data Locality: It moves computation to the data rather than the other way around, reducing network congestion and increasing the overall throughput of the system. High Throughput: Hadoop provides high throughput access to application data, which is essential for big data applications. Resilience to Failure: Its design allows the system to continue operating uninterrupted even if some of the nodes fail. Open Source: Being open-source, Hadoop is continuously improved by a community of developers, which also reduces the total cost of ownership. These features and advantages make Hadoop a popular choice for organizations dealing with large-scale data processing and analytics. HADOOP DISTRIBUTED FILE SYSTEM The Hadoop Distributed File System (HDFS) is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop Cluster to hundreds(and even thousands) of nodes. With growing data velocity the data size easily outgrows the storage limit of a machine. A solution would be to store the data across a network of a machine. Such file systems are called Distributed File Systems. ❖ Some Important Features of HDFS(Hadoop Distributed File System) It’s easy to access the files stored in HDFS. HDFS also provides high availability and fault tolerance. Provides scalability to scaleup or scale down nodes as per our requirement. Data is stored in distributed manner i.e. various Data nodes are responsible for storing the data. HDFS provides Replication because of which no fear of Data Loss. HDFS Provides High Reliability as it can store data in a large range of Petabytes. HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve the cluster information. Provides high throughput DAEMON'S TYPES OF HADOOP/ master-slave architecture ❖ HDFS Storage Daemon’s As we all know Hadoop works on the MapReduce algorithm which is a master-slave architecture, HDFS has Name Node and Data Node that works in the similar pattern. 1. Name Node(Master) 2. Data Node(Slave) 1. Name Node: Name Node works as a Master in a Hadoop cluster that Guides the Data node(Slaves). Name node is mainly used for storing the Metadata i.e. nothing but the data about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster. Meta Data can also be the name of the file, size, and the information about the location(Block number, Block ids) of Data node that Name node stores to find the closest Data Node for Faster Communication. Name node instructs the Data Nodes with the operation like delete, create, Replicate, etc. As our Name Node is working as a Master it should have a high RAM or Processing power in order to Maintain or Guide all the slaves in a Hadoop cluster. Name node receives heartbeat signals and block reports from all the slaves i.e. Data Nodes. 2. Data Node: Data Nodes works as a Slave Data Nodes are mainly utilized for storing the data in a Hadoop cluster, the number of Data Nodes can be from 1 to 500 or even more than that, the more number of Data Node your Hadoop cluster has More Data can be stored. so it is advised that the Data Node should have High storing capacity to store a large number of file blocks. Data node performs operations like creation, deletion, etc. according to the instruction provided by the Name Node. Features and goals of HDFS ❑ Features of HDFS: Highly Scalable: HDFS can scale to hundreds of nodes in a single cluster, allowing for significant expansion in storage capacity. Replication: To ensure data availability and durability, HDFS maintains copies of data on different machines. Fault Tolerance: HDFS is robust against system failures. If any machine fails, another machine containing a copy of the data automatically becomes active. Distributed Data Storage: Data is divided into blocks and stored across multiple nodes, enabling efficient data processing. Portable: The system is designed to be easily portable from one platform to another. ❑ Goals of HDFS: Handling Hardware Failure: HDFS aims to quickly recover from server machine failures, ensuring minimal disruption to operations. Streaming Data Access: Applications running on HDFS require streaming access to their datasets, which HDFS provides. Coherence Model: HDFS follows a write-once-read-many approach, meaning that once a file is created, it need not be changed except for appends and truncates. These features and goals make HDFS a foundational component for distributed applications, particularly those within the Hadoop ecosystem. Where to use & not use HDFS ▪ Where to Use HDFS: Large Data Sets: HDFS excels at storing and managing very large files, making it ideal for big data applications. Batch Processing: It’s designed for batch processing rather than real-time processing, so it’s great for jobs that process large volumes of data at once. Data Resilience: If you need a system that can handle hardware failure without data loss, HDFS’s replication model provides robust data protection. Commodity Hardware: HDFS can be deployed on low-cost hardware, making it cost-effective for scaling out storage needs. ▪ Where Not to Use HDFS: Small Files: HDFS is not efficient for storing a large number of small files because it’s designed for large blocks of data. Low Latency Access: Applications requiring fast data access times may not perform well with HDFS due to its high- throughput design. Random Read/Write Operations: It’s not optimized for random read/write operations, which are common in transactional systems. Real-Time Processing: HDFS is not suitable for real-time data processing needs. In summary, HDFS is a good choice for applications that require distributed storage and processing of large data sets with high fault tolerance and data replication. However, it’s not the best fit for use cases that demand quick, random access or deal with many small files. For such scenarios, other storage solutions might be more appropriate. HADOOP HISTORY The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google. Let's focus on the history of Hadoop in the following steps: - In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web crawler software project. While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of costs which becomes the consequence of that project. This problem becomes one of the important reason for the emergence of Hadoop. In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file system developed to provide efficient access to data. In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large clusters. In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File System). This file system also includes Map reduce. In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year. Doug Cutting gave named his project Hadoop after his son's toy elephant. In 2007, Yahoo runs two clusters of 1000 machines. In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds. In 2013, Hadoop 2.2 was released. In 2017, Hadoop 3.0 was released.