Hadoop Unit 1 Essentials

Big Data

Big data refers to large and complex datasets that are too vast to be effectively processed and analyzed using traditional data processing tools.
Characteristics of big data include volume, velocity, and variety.
Big data is produced in various sources and industries, such as:
- Social media platforms (e.g., Facebook, Twitter, Instagram)
- E-commerce websites (e.g., Amazon, Alibaba, eBay)
- IoT devices (e.g., smart sensors, wearables, connected appliances)
- Financial institutions (e.g., banks, insurance companies)
- Healthcare sector (e.g., hospitals, clinics, healthcare providers)
- Transportation and logistics (e.g., companies in transportation and logistics industry)

Rise of Big Data

The rise of big data refers to the emergence and growth of the big data industry, which involves collecting, storing, processing, analyzing, and using large and complex sets of information.
Factors that contributed to the rise of big data include:
- Rapid increase in the amount of digital data generated by various sources
- Advancements in hardware and software technologies (e.g., cloud computing, Hadoop, Spark)
- Need for data-driven decision-making, innovation, and personalization in various domains

Hadoop vs Traditional Systems

Hadoop is an open-source software framework used for storing data and running applications on a group of commodity hardware.
Hadoop is better suited for big data environments where there is a need to process large volumes of diverse data.
Traditional systems (e.g., RDBMS) are more appropriate for environments that require high data integrity and are dealing primarily with structured data.
Comparison of Hadoop and traditional systems:
- Hadoop:
  - Purpose: Designed to handle large volumes of structured and unstructured data
  - Scalability: Highly scalable
  - Data Processing: Can process both structured and unstructured data efficiently
  - Cost: Generally cost-effective
  - Flexibility: Supports multiple analytical processes on the same data simultaneously
  - Data Schema: Dynamic
  - Integrity: Lower data integrity
- Traditional Systems (RDBMS):
  - Purpose: Primarily used for structured data storage, manipulation, and retrieval
  - Scalability: Less scalable
  - Data Processing: Best suited for structured data and OLTP environments
  - Cost: Can be expensive
  - Flexibility: Less flexible
  - Data Schema: Static
  - Integrity: High data integrity

Limitations and Solutions of Existing Data Analytics Architecture

Limitations of big data analytics include:
- Lack of knowledgeable professionals
- Lack of proper understanding of massive data
- Data growth issues
- Fault tolerance
- Confusion in big data tool selection
- Data security
- Integrity of data from various sources
Solutions to these limitations include:
- Investing in recruitment of skilled professionals
- Basic training programs for employees
- Seeking professional help for big data tool selection
- Recruiting cybersecurity professionals
- Integrating data from various sources
- Checking and fixing data quality issues

Attributes of Big Data

The five V's of Big Data:
- Volume: Large amounts of data generated from various sources
- Variety: Structured, semi-structured, and unstructured data
- Velocity: Speed at which data is created and processed
- Veracity: Reliability of data
- Value: Value of data in terms of insights and decision-making

Types of Data

Big data can be categorized into:
- Structured data (e.g., names, dates, addresses)
- Semi-structured data (e.g., JSON, XML, CSV, TSV)
- Unstructured data (e.g., images, audio files, videos)
- Quasi-structured data (e.g., web server logs)

Other Technologies vs Big Data

Big data interacts with various other technologies, each with its own unique focus and capabilities:
- Data analytics
- Cloud computing
- Artificial Intelligence (AI) and Machine Learning (ML)
- Internet of Things (IoT)
- Business intelligence (BI)
- Blockchain

Hadoop Architecture and HDFS

Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big data.
Components of Hadoop:
- Name Node (NN)
- Job Tracker (JT)
- Secondary Name Node (SNN)
- Data Node (DN)
- Task Tracker (TT)
Process of Hadoop:
- Client provides job request to Hadoop
- Job request is accepted by Name Node
- Job is divided into tasks and provided to Data Node
- Data Node performs tasks and communicates with Job Tracker
Architecture of Hadoop:
- Name Node: Master of HDFS, contains job tracker
- Data Node: Slave of HDFS, takes client block address from Name Node
- Job Tracker: Determines files to process, runs on server as master node### Hadoop Architecture
Hadoop is a framework that utilizes a large cluster of commodity hardware to maintain and store big data.
Hadoop works on MapReduce Programming Algorithm that was introduced by Google.
Hadoop Architecture mainly consists of 4 components:
- MapReduce
- HDFS (Hadoop Distributed File System)
- YARN (Yet Another Resource Negotiator)
- Common Utilities or Hadoop Common

HDFS (Hadoop Distributed File System)

HDFS is a primary component of Hadoop ecosystem and is responsible for storing large data sets of structured and unstructured data across various nodes.
HDFS consists of two more components:
- Name Node (Master)
- Data Node (Slave)

Name Node (Master)

It is the master of HDFS.
It has a job tracker that keeps track of files distributed to data nodes.
It is only a single point of failure.

Data Node (Slave)

It is a slave of HDFS.
It takes client block address from Name Node.
For replication purposes, it can communicate with other name nodes.
Data node informs local changes/updates to Name node.

Task Tracker

There is a single task tracker per slave node.
It may handle multiple tasks parallelly.
Individual tasks are assigned by job tracker to task tracker.
Job tracker continuously communicates with task tracker, and if it fails to reply, it assumes that the task tracker has crashed.

Secondary Name Node (SNN)

State monitoring is done by SNN.
Every cluster has one SNN.
SNN resides on its own machine.
On that machine or server, no other daemon (DN or TT) can work.
SNN takes a snapshot of HDFS metadata at constant intervals.

YARN (Yet Another Resource Negotiator)

YARN helps to manage resources across clusters.
YARN consists of three major components:
- Resource Manager
- Node Manager
- Application Manager

Resource Manager

It has the privilege of allocating resources for the application in the system.

Node Manager

It works on allocation of resources such as CPU, memory, bandwidth per machine, and later on acknowledges Resource Manager.

Application Manager

It works as an interface between Resource Manager and Node Manager and performs negotiations as per requirement of two.

Map Reduce

Map Reduce makes it possible to carry over the processing's logic and helps to write applications that transform big data sets into manageable ones.
Map Reduce uses two functions Map() and Reduce() whose task is:
- Map() performs sorting and filtering of data, thereby organizing them in form of group. Map generates a Key-value pair-based result which is later on processed by Reduce() method.
- Reduce() as the name suggests does the summarization by aggregating mapped data. In simple terms, Reduce() takes output generated by Map() as input and combines those tuples into a smaller set of tuples.

Hadoop Common and Common Architecture

Hadoop Common and Common Architecture are nothing but our Java library and Java files or we can say the Java scripts that we need for all other components present in Hadoop Clusters.
Hadoop Common verifies that hardware failure in a Hadoop cluster is common, so it needs to be solved automatically in software by Hadoop Framework.

Features and Advantages of Hadoop

Features of Hadoop:
- Open source
- Highly Scalable cluster
- Fault tolerance
- Easy to use
- Data locality
- Cost-Effective
Advantages of Hadoop:
- Storage and Processing
- Data Locality
- High Throughput
- Resilience to Failure
- Open Source

HDFS (Hadoop Distributed File System)

HDFS is a distributed file system that handles large data sets running on commodity hardware.
It is used to scale a single Apache Hadoop Cluster to hundreds (and even thousands) of nodes.

Features of HDFS

Easy access to files stored in HDFS
High availability and fault tolerance
Scalability to scale up or scale down nodes as per our requirement
Data is stored in a distributed manner, i.e., various Data nodes are responsible for storing the data
HDFS provides Replication because of which no fear of Data Loss
HDFS Provides High Reliability as it can store data in a large range of Petabytes
HDFS has in-built servers in Name node and Data Node that help them to easily retrieve the cluster information
Provides high throughput

Hadoop Unit 1 Essentials

Choose a study mode

Podcast

Questions and Answers

What is one of the advantages of using Hadoop's Distributed File System (HDFS)?

Hadoop's Name Node primarily stores metadata about the data in a Hadoop cluster.

What does HDFS stand for?

Hadoop is primarily known for its ability to store and process __________ of data at high speed.

Match the type of HDFS daemon with its role:

What are the primary characteristics of big data?

Which data type includes files like log files, audio files, and image files?

Name one of the components present in Hadoop alongside Name Node, Job Tracker, Secondary Name Node, and Data Node.

Hadoop is a framework written in Python.

Match the components with their descriptions:

What does big data refer to?

Where is big data produced?

Hadoop is an open-source software framework used for storing data and running applications.

Big data encompasses structured and unstructured data sets that are too large and complex to be processed using traditional data ________ applications.

Match the following attributes of big data with their explanations:

Study Notes

Big Data

Rise of Big Data

Hadoop vs Traditional Systems

Limitations and Solutions of Existing Data Analytics Architecture

Attributes of Big Data

Types of Data

Other Technologies vs Big Data

Hadoop Architecture and HDFS

HDFS (Hadoop Distributed File System)

Name Node (Master)

Data Node (Slave)

Task Tracker

Secondary Name Node (SNN)

YARN (Yet Another Resource Negotiator)

Resource Manager

Node Manager

Application Manager

Map Reduce

Hadoop Common and Common Architecture

Features and Advantages of Hadoop

HDFS (Hadoop Distributed File System)

Features of HDFS

Studying That Suits You

Related Documents

More Like This

Quiz de Introducción a Big Data y Hadoop

Big Data Technologies Quiz

Big Data Analytics &amp; Architecture Course Overview

Big Data Analytics &amp; Architecture Course Overview

Big Data Analytics & Architecture Course Overview

Big Data Analytics & Architecture Course Overview