HDFS Hadoop Distributed File System Lecture 3 PDF
Document Details
Uploaded by UnrivaledMothman
Tags
Summary
This document provides an overview of HDFS (Hadoop Distributed File System), covering topics such as motivation, architecture, and concepts. It discusses the challenges of storing large datasets and the advantages of distributed storage systems. The content appears to be lecture notes.
Full Transcript
HDFS Hadoop Distributed File System HDFS Outline Motivation Architecture and Concepts Inside User Interface Motivation Questions Problem 1: Data is too big to store on one machine. HDFS: Store the data on multiple machines! Motivation Question...
HDFS Hadoop Distributed File System HDFS Outline Motivation Architecture and Concepts Inside User Interface Motivation Questions Problem 1: Data is too big to store on one machine. HDFS: Store the data on multiple machines! Motivation Questions Problem 2: Very high end machines are too expensive HDFS: Run on commodity hardware! What is commodity hardware? Commodity hardware in computing is computers or components that are readily available, inexpensive and easily interchangeable with other commodity hardware. Commodity hardware is synonymous with off- the-shelf hardware. Motivation Questions Problem 3: Commodity hardware will fail! HDFS: Software is intelligent enough to handle hardware failure! Motivation Questions Problem 4: What happens to the data if the machine stores the data fails? HDFS: Replicate the data! Motivation Questions Problem 5: How can distributed machines organize the data in a coordinated way? HDFS: Master-Slave Architecture! HDFS Architecture: Master-Slave Master Name Node: Controller Name Node (NN) – File System Name Space Secondary Name Node Management (SNN) – Block Mappings Data Node (DN) Data Node: Work Horses – Block Operations – Replication Secondary Name Node: – Checkpoint node Slaves Single Rack Cluster HDFS Architecture: Master-Slave Multiple-Rack Cluster Switch Switch Name Node (NN) Secondary Name Node (SNN) Data Node (DN) Data Node (DN) Data Node (DN) Rack 1 Rack 2... Rack N What is a rack? The Rack is the collection of around 40-50 DataNodes connected using the same network switch. If the network goes down, the whole rack will be unavailable. A large Hadoop cluster is deployed in multiple racks. Rack awareness policies Not more than one replica be placed on one node. Not more than two replicas are placed on the same rack. For the common case where the replication factor is three, The block replication policy – put the first replica on the local rack, – a second replica on the different DataNode on the same rack, – and a third replica on the different rack. HDFS Architecture: Master-Slave Multiple-Rack Cluster Reliable Storage Switch Switch I know all NN will blocks and replicate lost replicas! blocks in another node Name Node (NN) Secondary Name Node ☺ (SNN) Data Node (DN) Data Node (DN) Data Node (DN) Rack 1 Rack 2... Rack N HDFS Architecture: Master-Slave Multiple-Rack Cluster Rack Awareness Switch Switch I know the topology of NN will the cluster! replicate lost blocks across Name Node (NN) Secondary Name Node racks ☺ (SNN) Data Node (DN) Data Node (DN) Data Node (DN) Rack 1 Rack 2... Rack N HDFS Architecture: Master-Slave Multiple-Rack Cluster Switch Switch Do not ask Single Point of me, I am Failure down Name Node (NN) Secondary Name Node (SNN) Data Node (DN) Data Node (DN) Data Node (DN) Rack 1 Rack 2... Rack N HDFS Architecture: Master-Slave Multiple-Rack Cluster Switch Switch How about network Keep bulky performance? communication within a rack! Name Node (NN) Secondary Name Node (SNN) Data Node (DN) Data Node (DN) Data Node (DN) Rack 1 Rack 2... Rack N HDFS Inside: Name Node Snapshot of FS Edit log: record Name Node changes to FS Filename Replication factor Block ID File 1 3 [1, 2, 3] File 2 2 [4, 5, 6] File 3 1 [7,8] Data Nodes 1, 2, 5, 7, 1, 5, 3, 1, 4, 3, 4, 3 2, 8, 6 2, 6 HDFS Inside: Name Node Name Node Periodically Secondary Name Node FS image FS image Edit log Edit log - House Keeping - Backup NN Meta Data Data Nodes Reply (Control Info. Embedded) Heart beating every 3 seconds. If NN does not hear from DN in 10 mins, it starts to replicate the blocks HDFS Inside: Read 1 Name Node Client 2 3 4 DN1 DN2 DN3... DNn 1. Client connects to NN to read data 2. NN tells client where to find the data blocks 3. Client reads blocks directly from data nodes (without going through NN) 4. In case of node failures, client connects to another node that serves the missing block HDFS Inside: Read Q: Why does HDFS choose such a design for read? Why not ask client to read blocks through NN? Reasons: Prevent NN from being the bottleneck of the cluster Allow HDFS to scale to large number of concurrent clients Spread the data traffic across the cluster HDFS Inside: Read Q: Given multiple replicas of the same block, how does NN decide which replica the client should read? HDFS Solution: Rack awareness based on network topology HDFS Inside: Write 1 Name Node Client 2 4 3 DN1 DN2 DN3... DNn 1. Client connects to NN to write data 2. NN tells client write these data nodes 3. Client writes blocks directly to data nodes with desired replication factor 4. In case of node failures, NN will figure it out and replicate the missing blocks HDFS Inside: Write Replication Strategy vs Tradeoffs Reliability Write Read Bandwidth Bandwidth Put all replicas on one node Put all replicas on different racks HDFS Inside: Write Replication Strategy vs Tradeoffs Reliability Write Read Bandwidth Bandwidth Put all replicas on one node Put all replicas on different racks HDFS: 1-> same node as client 2-> a node on different rack 3-> a different node on the same rack as 2 HDFS Interface Web Based Interface – http://ccl.cse.nd.edu/operations/hadoop/ Command Line: Hadoop FS Shell – https://hadoop.apache.org/docs/r2.4.1/hadoop- project-dist/hadoop- common/FileSystemShell.html HDFS-Web UI HDFS-Web UI HDFS Command Line Hadoop Shell Hadoop Lecture 1 Summary Big Data and Hadoop background – What and Why about Hadoop – 4 V challenge of Big Data Hadoop Distributed File System (HDFS) – Motivation: guide Hadoop design – Architecture: Single rack vs Multi-rack clusters – Reliable storage, Rack-awareness, Throughput – Inside: Name Node file system, Read, Write – Interface: Web and Command line