Hadoop Setup and Configuration Quiz

Study Notes

Daemons like HDFS, YARN, and MapReduce run as separate Java processes; development mode is useful for testing.
Fully Distributed Mode requires at least two machines to form a Hadoop cluster.
Hadoop installation can be verified through a series of steps, starting with setting up the NameNode using hdfs namenode -format, producing startup messages indicating successful initialization.

To verify HDFS, execute $ start-dfs.sh, which initiates the NameNode and DataNode, providing logs for monitoring.
To verify YARN, run $ start-yarn.sh, which starts YARN daemons and ResourceManager logs.
Hadoop services can be accessed via browser; the default port for Hadoop is 50070 (http://localhost:50070/), while cluster applications can be accessed on port 8088 (http://localhost:8088).

Big Data plays a vital role in decision-making, uncovering hidden insights across various domains like healthcare and finance.
The growth in data captured from mobile devices and multimedia doubles yearly, leading to the necessity for pre-processing due to the structured and unstructured data categories.
The advancements in data science and cloud computing foster better storage and mining of Big Data, enabling predictive capabilities and driving innovation.

The evolution of web search necessitated automated responses, leading to the development of web crawlers and projects like Nutch.
Doug Cutting and Mike Cafarella created Nutch to enhance search efficiency; this project later transformed into Hadoop, inspired by Google’s methods in automated data handling.
Hadoop was publicly released as an open-source project by Yahoo in 2008 and is currently managed by the Apache Software Foundation.

RDBMS is designed for structured data with known schemas, while Hadoop efficiently handles both structured and unstructured data.
RDBMS centers on records, long fields, and XML objects, while Hadoop focuses on files.
RDBMS allows for updates, whereas Hadoop typically permits only inserts and deletes.

Hadoop comprises four main modules:
- HDFS: A distributed file system managing large datasets with high fault tolerance.
- YARN: Responsible for resource management and job scheduling across cluster nodes.
- MapReduce: A framework facilitating parallel data processing, converting input data into key-value pairs for further aggregation.
- Hadoop Common: Offers shared Java libraries for other Hadoop modules.

The architecture splits into storage (HDFS) and processing (MapReduce), where files are distributed across cluster nodes.
Processing employs packaged code transferred to nodes, promoting data locality and efficiency.

NameNode: Acts as the master in a Hadoop cluster, storing metadata crucial for data location and replication management. Directly interacts with client applications for file operations.
DataNode: Functions as a slave, storing actual data, with multiple DataNodes enhancing data storage capacity and performance.

MapReduce executes distributed processing, critical for handling Big Data through a two-phase approach: the Map phase processes input data into key-value pairs, and the Reduce phase aggregates outputs.
YARN enhances efficiency by facilitating various processing engines to operate concurrently on data stored in HDFS.

Client: Submits MapReduce jobs.
Resource Manager: The master daemon overseeing resource allocation and management, composed of:
- Scheduler: Allocates resources without monitoring application status, based on available resources and task requirements.
- Application Manager: Manages job submissions and container negotiations, restarting failed tasks as necessary.