Podcast
Questions and Answers
What is the primary function of a file system abstraction?
What is the primary function of a file system abstraction?
- To manage network connections between computers.
- To physically store data on storage devices.
- To encrypt data for security purposes.
- To provide a uniform way for users to interact with data, regardless of storage device types. (correct)
In a distributed file system (DFS) architecture, what is the role of the client application?
In a distributed file system (DFS) architecture, what is the role of the client application?
- To ensure data replication and fault tolerance across the system.
- To access and process data on a remote server as if it were local. (correct)
- To store actual file data blocks across multiple servers.
- To manage metadata and namespace for all files.
Which component in a Distributed File System is responsible for maintaining metadata about file locations?
Which component in a Distributed File System is responsible for maintaining metadata about file locations?
- Backup Node
- Data Node
- Client Node
- Name Node (correct)
What is a key challenge in distributed file systems concerning data access?
What is a key challenge in distributed file systems concerning data access?
What is a primary design goal of the Google File System (GFS)?
What is a primary design goal of the Google File System (GFS)?
Why is 'component failure' a key design consideration in GFS?
Why is 'component failure' a key design consideration in GFS?
For what type of workloads is GFS best suited?
For what type of workloads is GFS best suited?
What is the function of the GFS Master?
What is the function of the GFS Master?
How does GFS handle file storage?
How does GFS handle file storage?
What is the default chunk size in GFS?
What is the default chunk size in GFS?
What is the purpose of replicating chunks across multiple chunk servers in GFS?
What is the purpose of replicating chunks across multiple chunk servers in GFS?
What type of metadata is primarily managed by the GFS Master?
What type of metadata is primarily managed by the GFS Master?
What is the 'operation log' in GFS Master, and why is it important?
What is the 'operation log' in GFS Master, and why is it important?
Which of the following is NOT a characteristic of 'Big Data'?
Which of the following is NOT a characteristic of 'Big Data'?
What was the primary motivation for creating Hadoop?
What was the primary motivation for creating Hadoop?
HDFS is best suited for applications that require:
HDFS is best suited for applications that require:
Why is HDFS designed to run on commodity hardware?
Why is HDFS designed to run on commodity hardware?
What is the default block size in HDFS?
What is the default block size in HDFS?
What is the primary reason for using large block sizes in HDFS?
What is the primary reason for using large block sizes in HDFS?
In HDFS, which component manages the file system namespace and metadata?
In HDFS, which component manages the file system namespace and metadata?
What is the role of DataNodes in HDFS?
What is the role of DataNodes in HDFS?
What happens if the NameNode in HDFS fails?
What happens if the NameNode in HDFS fails?
What is the purpose of a Secondary NameNode in HDFS?
What is the purpose of a Secondary NameNode in HDFS?
What is HDFS Federation designed to address?
What is HDFS Federation designed to address?
In HDFS Federation, how is the namespace divided?
In HDFS Federation, how is the namespace divided?
What is Apache YARN's primary role in Hadoop?
What is Apache YARN's primary role in Hadoop?
Which of the following is NOT a stage in a MapReduce program execution?
Which of the following is NOT a stage in a MapReduce program execution?
What is the main function of the 'Map' stage in MapReduce?
What is the main function of the 'Map' stage in MapReduce?
What is the purpose of the 'Shuffle' stage in MapReduce?
What is the purpose of the 'Shuffle' stage in MapReduce?
What is the role of the 'Reduce' stage in MapReduce?
What is the role of the 'Reduce' stage in MapReduce?
Which component in a MapReduce architecture is responsible for coordinating and managing jobs?
Which component in a MapReduce architecture is responsible for coordinating and managing jobs?
What is the role of Task Trackers (or NodeManagers in YARN) in MapReduce?
What is the role of Task Trackers (or NodeManagers in YARN) in MapReduce?
What does 'fault tolerance' in MapReduce primarily refer to?
What does 'fault tolerance' in MapReduce primarily refer to?
In MapReduce, how are task failures typically handled?
In MapReduce, how are task failures typically handled?
What is a 'straggler' task in MapReduce, and how are they handled?
What is a 'straggler' task in MapReduce, and how are they handled?
Which of the following is a common application of MapReduce?
Which of the following is a common application of MapReduce?
What is a significant concern regarding Hadoop data centers in terms of resource consumption?
What is a significant concern regarding Hadoop data centers in terms of resource consumption?
Why might Hadoop not scale linearly in practice, despite theoretical linear scalability?
Why might Hadoop not scale linearly in practice, despite theoretical linear scalability?
What is a privacy concern associated with Hadoop's 'write-once, read-many-times' paradigm?
What is a privacy concern associated with Hadoop's 'write-once, read-many-times' paradigm?
In a Distributed File System (DFS), if a client frequently accesses a file, where is a copy of this file most likely to be found to minimize access latency?
In a Distributed File System (DFS), if a client frequently accesses a file, where is a copy of this file most likely to be found to minimize access latency?
Google File System (GFS) prioritizes high throughput over low latency. Which architectural choice primarily supports this design goal?
Google File System (GFS) prioritizes high throughput over low latency. Which architectural choice primarily supports this design goal?
The GFS Master maintains the 'operation log'. What is the most critical role of this log in ensuring the reliability of GFS?
The GFS Master maintains the 'operation log'. What is the most critical role of this log in ensuring the reliability of GFS?
Hadoop Distributed File System (HDFS) is designed to run on commodity hardware. How does this influence HDFS's approach to fault tolerance?
Hadoop Distributed File System (HDFS) is designed to run on commodity hardware. How does this influence HDFS's approach to fault tolerance?
Why is the block size in HDFS set to a significantly larger size (e.g., 128MB) compared to traditional file systems?
Why is the block size in HDFS set to a significantly larger size (e.g., 128MB) compared to traditional file systems?
In HDFS, what is the primary role of the Secondary NameNode?
In HDFS, what is the primary role of the Secondary NameNode?
HDFS Federation is introduced to address a specific scalability challenge in HDFS. Which limitation does Federation primarily aim to overcome?
HDFS Federation is introduced to address a specific scalability challenge in HDFS. Which limitation does Federation primarily aim to overcome?
During the 'Shuffle' stage of a MapReduce job, what is the primary operation performed on the intermediate data?
During the 'Shuffle' stage of a MapReduce job, what is the primary operation performed on the intermediate data?
In MapReduce, 'speculative execution' is a mechanism to handle straggler tasks. What is the core principle behind speculative execution?
In MapReduce, 'speculative execution' is a mechanism to handle straggler tasks. What is the core principle behind speculative execution?
Large Hadoop data centers, while powerful, are known to have a significant environmental impact primarily due to:
Large Hadoop data centers, while powerful, are known to have a significant environmental impact primarily due to:
Flashcards
What is a File System?
What is a File System?
An abstraction that enables users to read, manipulate, and organize data.
What is a Distributed File System (DFS)?
What is a Distributed File System (DFS)?
A client/server-based application that allows clients to access and process data stored on a server as if it were on their own computer.
What is a Service in DFS?
What is a Service in DFS?
A software entity running on one or more machines that provides a particular type of function.
What is a DFS Server?
What is a DFS Server?
Signup and view all the flashcards
What is a DFS Client?
What is a DFS Client?
Signup and view all the flashcards
What is Google File System (GFS)?
What is Google File System (GFS)?
Signup and view all the flashcards
Four Key Observations Driving GFS Design
Four Key Observations Driving GFS Design
Signup and view all the flashcards
What is Hadoop Distributed File System (HDFS)?
What is Hadoop Distributed File System (HDFS)?
Signup and view all the flashcards
HDFS Blocks
HDFS Blocks
Signup and view all the flashcards
NameNode
NameNode
Signup and view all the flashcards
DataNode
DataNode
Signup and view all the flashcards
HDFS Redundancy
HDFS Redundancy
Signup and view all the flashcards
Operation Log
Operation Log
Signup and view all the flashcards
YARN (Yet Another Resource Negotiator)
YARN (Yet Another Resource Negotiator)
Signup and view all the flashcards
MapReduce
MapReduce
Signup and view all the flashcards
Map Stage
Map Stage
Signup and view all the flashcards
Reduce Stage
Reduce Stage
Signup and view all the flashcards
Hadoop Applications
Hadoop Applications
Signup and view all the flashcards
Study Notes
- This session includes an overview of Distributed File Systems (DFS), Google File System (GFS), MapReduce (MR), and Hadoop.
Distributed File System (DFS)
- Abstraction that enables users to read, manipulate, and organize data.
- Data is stored in files, organized in a hierarchical tree structure with directories as nodes.
- Provides a uniform view of data, independent of the underlying storage devices (floppy drives, hard drives, flash memory).
- Unlike stand-alone computers that typically feature a one-to-one mapping, software RAID distributes data across multiple storage devices, implemented below the file system layer.
- A client/server-based application that allows clients to access and process data stored on a server.
- When a user accesses a file, a copy is cached on the user's computer during processing and then returned to the server.
- Organizes file and directory services of individual servers into a global directory, making remote data access location-independent and identical for any client.
- All files are accessible to all users of the global file system.
- Organization is hierarchical and directory-based.
- Employs mechanisms to manage simultaneous access from multiple clients, ensuring updates are organized, data is current, and conflicts are prevented.
- Commonly uses file or database replication to protect against data access failures.
DFS Components
- Service: A software entity running on one or more machines, providing a specific function to clients.
- Server: Service software running on a single machine.
- Client: A process that invokes a service through a set of operations forming its interface.
- Client Interface: Formed by primitive file operations such as create, delete, read, and write.
- Goal: Provide a common view of a centralized file system with a distributed implementation.
- Goal: Allow opening and updating any file from any machine on the network, addressing synchronization issues and supporting shared local file capabilities.
- Goal: Ensure transparency for the client interface, not distinguishing between local and remote files.
DFS Challenges
- Naming and Transparency
- Remote file access
- Caching
- Stateful vs Stateless service
- Replication
- Fault Tolerance
- Security
Google File System (GFS)
- A scalable distributed file system designed for data-intensive applications.
- Built by Google in 2003.
- Shares goals of performance, scalability, reliability, and availability with previous distributed file systems.
- Design driven by observations of component failures, huge files, mutation of files, and benefits of co-designing applications and file system APIs.
- Assumes high component failure rates.
- Built from inexpensive commodity components
- Works with a modest number of large files; typically a few million files, each 100MB or larger.
- Workloads consist of large streaming reads (1MB+) and small random reads
- Focuses on sustained throughput over low latency for individual read/write operations.
- Not suitable for low-latency data access (milliseconds), many small files, or constantly changing data.
- Some details of GFS remain private.
GFS Design
- Single Master: Centralized management.
- Files stored as chunks with a fixed size of 64MB each.
- Reliability achieved through chunk replication across at least 3 servers.
- Data caching is used because of the large size of data sets.
- Interface is suitable for Google apps and includes create, delete, open, close, read, write, snapshot, and record append operations.
- Single files can contain many objects, such as Web documents.
- Files are divided into fixed-size chunks (64MB) with unique 64-bit identifiers.
- IDs are assigned by the GFS master at chunk creation time.
- Chunk servers store chunks on local disk as "normal" Linux files.
- Reads and writes are specified by (chunk_handle, byte_range) tuples.
- Files are replicated (default is 3 times) across all chunk servers.
- The Master maintains all file system metadata: the namespace, access control, file-to-chunk mapping, chunk locations, and garbage collection of orphaned chunks.
- Heartbeat messages are exchanged between the Master and Chunkservers, to determine whether the chunk server is still active and which chunks are stored.
- Clients communicate with the master for metadata operations and with chunk servers for data.
- The "operation log" holds the filesystem namespaces and mapping from files to chunks.
- Operation Logs are replicated on remote machines.
- 64MB chunk has 64 bytes of metadata
- Chunk servers track their chunks and communicate data to the Master via HeartBeat messages.
- Operational logs are persistent records of metadata and help in system recovery.
- The operation log is replicated on multiple remote machines, and log records must be locally flashed and remotely before client confirmations.
- Chunk replica placement strategy considers creation of initially empty chunks, uses underutilized chunk servers and spreads them across racks, and limits recent creations on each chunk server.
- Rereplication is initiated when the number of available replicas falls below a set threshold.
- Master instructs chunk servers to copy data directly from existing valid replicas.
- The number of active clone operations and bandwidth is limited, as part of "Re-replication".
- Rebalancing changes replica distribution and gradually fills new chunk servers.
Big Data
- Big data refers to extremely large datasets that cannot be processed using traditional computing techniques.
- Big data includes numerous tools, techniques, and frameworks.
- Has become a complete subject.
- Big Data points to discovery of knowledge and discovery, annotation of data
- Requires complex computational models
- Need elastic, on-demand capacities
- Involves programming models and supporting algorithms and data structures.
- From the start of time until 2003, 5 billion gigabytes of data was produced
- The same amount was created in every two days in 2011, and in every ten minutes in 2013.
- Today we create 2.5 quintillion bytes of data per day
- 90% of the data was created in the past 2 – 5 years only !!!
Hadoop
- Doug Cutting, Mike Cafarella, and team created Hadoop as an open-source project in 2005, basing the solution on one provided by Google.
- Named after Doug Cutting's son's toy elephant.
- Trademarked by the Apache Software Foundation.
- Executes applications using the MapReduce algorithm, which processes data in parallel across different CPU nodes.
- A framework to develop applications for running on clusters of computers, enabling statistical analysis of large datasets.
Hadoop Distributed File System (HDFS)
- Designed for storing very large files and streaming data access patterns.
- Runs on clusters of commodity hardware.
- "Very large" means hundreds of megabytes, gigabytes, or terabytes.
- Hadoop clusters store petabytes of data.
- Most efficient data processing pattern: write-once, read-many-times.
- Datasets are generated or copied once.
- Multiple analyses are performed on that dataset over time.
- Each analysis involves a large proportion of the dataset
Hadoop and Hardware
- Hadoop doesn't require expensive, highly reliable hardware, Instead commodity hardware can be used.
- Designed to run on clusters on commodity hardware.
- Designed to function through node failure, without interruption.
- Not ideal with low-latency data requirements.
- Not Ideal for smaller files, as the namenode holds filesystem metadata in memory.
HDFS Blocks
- Disks have a block size, which is the minimum amount of data that it can read or write.
- Blocks for single disks build on this by dealing with data in blocks, which are an integral multiple of the disk block size.
- Has the concept of blocks but it is much larger, with 128 MB by default.
- Files are broken into a chunks that are block-sized.
HDFS Name Nodes and Data Nodes
- HDFS utilizes a master-worker pattern with a namenode and multiple datanodes.
- The namenode manages the file system namespace by maintaining the file system tree and metadata.
- This is stored in the namespace image and the edit log, onto local disk.
- The namenode knows the datanodes where all the blocks are located.
- The locations are not stored persistently but are reconstructed from datanodes when the system starts.
- Clients access the file system through the namenode and datanodes using a POSIX-like interface.
- User code does not interact directly with the namenode or datanodes.
- Datanodes are workers, storing and retrieving data blocks and reporting to the namenode.
- Without the namenode, the filesystem cannot be used and data cannot be recovered.
- Redundancy: Namenode writes to multiple filesystems and it is possble to run a secondary namenode in order to periodically merge the namespace image.
HDFS Block Caching
- Normally datanodes read blocks from disk, but frequently accessed files can be cached in memory.
- By default, a block is cached in only one datanode's memory.
- Job schedulers can improve performance by running tasks on datanodes where blocks are cached.
- Can be used for smaller lookup tables as an example
- A cache directive can be added to a cache pool.
- Cache pools are used for managing cache permissions and resource usage.
HDFS Federation
- The namenode keeps a reference to every file and block.
- Memory becomes a limiting factor for scaling.
- HDFS Federation (release series of 2.x +) scales a cluster by adding namenodes, each of which which manages a portion of the filesystem namespace.
- Separate name nodes may point at /user and /share paths.
- Managed namespaces are independent of eachother, which implies that failure will only occur on one namenode.
Hadoop Distributed File System (HDFS)
- The client opens a file for reading with the FileSystem object open() call.
- It interacts with the NameNode using Remote Procedure Calls (RPCs).
- The NameNode returns the addresses of the DataNodes.
- Distributes by returning a FSDataInputStream, supported by file seeks.
- Reads data from DataNodes to managed datanode using DFSInputStream.
YARN
- Apache YARN (Yet Another Resource Negotiator) is Hadoop's cluster resource management system.
- YARN was introduced in Hadoop 2 to improve the MapReduce implementation and support other distributed computing paradigms.
- YARN provides APIs that are used by higher level frameworks, which are themselves built on YARN.
MapReduce
- Framework for easily running applications that process large amounts of data in parallel
- It utilizes clusters having thousands of nodes of commodity hardware and provides a software
- Implements three stages of an execution; the map stage, shuffle stage, and reduce stage.
- The map stage The map or mapper’s job is to process the input data.
- Input data stored in the file system.
- The mapper function line by line, creates several small chunks of data.
- The Reduce’s job is to process the data that comes from the mapper, producing a new set of output.
- Takes place on nodes with data on local disks the network traffic.
MapReduce Fault Tolerance
- If a task crashes, it is restarted on another node and will either fail if retried repeatedly - ignoring the input block - or continue to the output if successful.
- If a node crashes, its current tasks are relaunched and any maps executed previously are re-run because the output files were lost.
More on MapReduce
- With a slow task, such as a straggler, a launched second copy will start on another node and the first process will be stopped
- Requires numerous large data centers, costing around $1 billion each.
- Clusters consume 2.2% of America's power
- The top five U.S. Internet Firms operate around 300 data centers in the US continental United States.
- A single Hadoop data center consumes about 36 megawatts per year
- Violates privacy as it is only intended to be a write-once, read-many times paradigm and the details are forever stored.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.