Hadoop and Big Data Concepts

UnbiasedSweetPea avatar
UnbiasedSweetPea
·
·
Download

Start Quiz

Study Flashcards

24 Questions

What is the primary function of the Namenode in HDFS?

To manage the file system namespace and regulate client access

What is the default block size in HDFS?

64MB

What is the purpose of the Blockreport sent by the DataNode to the Namenode?

To provide a list of all blocks on the DataNode

What is the main goal of the replica placement policy in HDFS?

To ensure data reliability and availability

What is the function of the DataNode in HDFS?

To manage the data storage of the system and perform read-write operations

What is the purpose of the Heartbeat sent by the DataNode to the Namenode?

To report the availability of the DataNode

What is the minimum unit of data that can be read or written in HDFS?

Block

What is the replication factor in HDFS?

The number of replicas of a block

What is the primary purpose of Hadoop?

To manage and process huge volumes of structured and unstructured data

What is the core component of Hadoop's processing layer?

MapReduce

What is the primary feature of HDFS that enables it to handle large datasets?

Support for Petabyte size of data

What is the purpose of the Namenode in a Hadoop cluster?

To manage the cluster

What is the coherency model of HDFS?

Write-once-read-many

What is the main advantage of Hadoop's distributed computing platform?

It allows for moving computation to the data

What is the primary benefit of HDFS's fault-tolerance feature?

It ensures data availability in case of node failures

What is the key characteristic of Hadoop's data processing approach?

It divides the task into small parts and assigns them to many computers

What is the primary benefit of HDFS's replication policy?

Preventing data loss in case of rack failure

What is the purpose of data locality in Hadoop?

To minimize overall network congestion

What is the role of the JobTracker in the MapReduce framework?

To schedule the jobs' component tasks on the slaves

How does HDFS's replication policy affect the cost of writes?

It increases the cost of writes

What is the typical output of a MapReduce job?

A file stored in a file-system

What is the role of the TaskTracker in the MapReduce framework?

To execute the tasks as directed by the master

What is the primary function of the map tasks in a MapReduce job?

To process the input data-set in a parallel manner

What is the purpose of the job configuration in a MapReduce job?

To specify the input/output locations and supply map and reduce functions

Study Notes

Big Data and Hadoop

  • Big Data is a collection of large datasets that cannot be processed using traditional computing techniques.
  • It involves many areas of business and technology and requires an infrastructure to manage and process huge volumes of structured and unstructured data in real-time.
  • Hadoop is an Apache open-source framework written in Java that allows distributed processing of large datasets across clusters of computers using the MapReduce algorithm.

Hadoop Distributed File System (HDFS)

  • HDFS is a Distributed File System (DFS) that allows files from multiple hosts to share via a computer network.
  • It supports concurrency and includes facilities for transparent replication and fault tolerance.
  • HDFS is based on the Google File System (GFS).
  • Key features of HDFS:
    • Supports Petabyte size of data
    • Heterogeneous - can be deployed on different hardware
    • Streaming data access via batch processing
    • Coherency model - Write-once-read-many
    • Data locality - "Moving Computation is Cheaper than Moving Data"
    • Fault-tolerance

Hadoop Architecture

  • Namenode:
    • Each cluster has one Namenode
    • Contains the GNU/Linux operating system and the Namenode software
    • Prevents data loss when an entire rack fails and allows use of bandwidth from multiple racks when reading data
  • Data locality:
    • Moves computation close to where the actual data resides
    • Minimizes overall network congestion
    • Increases the overall throughput of the system

MapReduce

  • MapReduce workflow:
    • Splits input data-set into independent chunks
    • Processed by map tasks in a completely parallel manner
    • Outputs are sorted and input to reduce tasks
    • Framework takes care of scheduling tasks, monitoring, and re-executing failed tasks
  • MapReduce framework:
    • Consists of a single master JobTracker and one slave TaskTracker per cluster-node
    • Master is responsible for scheduling jobs, monitoring, and re-executing failed tasks
    • Slaves execute tasks as directed by the master

Namenode and Datanode

  • Namenode:
    • Manages the file system namespace
    • Regulates client's access to files
    • Executes file system operations
  • Datanode:
    • Same hardware as Namenode but runs the Datanode software
    • Manages data storage of their system
    • Performs read-write operations and block creation, deletion, and replication

Blocks and Replication

  • Blocks:
    • Files in HDFS are divided into one or more blocks
    • Default block size is 64MB but can be changed via configuration
  • Replication:
    • Blocks of a file are replicated for fault tolerance
    • Block size and replication factor are configurable per file
    • Namenode makes all decisions regarding replication of blocks
    • Replica placement policy follows Rack-aware replica placement for data reliability, availability, and network bandwidth utilization

Learn about Big Data, its definition, and the importance of infrastructure in handling large datasets. Explore Hadoop's role in managing and processing structured and unstructured data.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

MapReduce Data Reading Quiz
5 questions
Introduction to Hadoop: Chapter Two Quiz
12 questions
Big Data Tools and Hadoop Ecosystem
10 questions
Introducción a Big Data – Parte 2
12 questions
Use Quizgecko on...
Browser
Browser