Distributed File Systems Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a primary design consideration of the Google File System (GFS)?

  • Emphasis on end-user applications
  • Support for small files primarily
  • Focus on perfect fault tolerance
  • Handling huge data workloads efficiently (correct)

The Hadoop Distributed File System (HDFS) was originally developed for Apache Hadoop's web search engine.

False (B)

What does the acronym RPC stand for, as used in the context of NFS?

Remote Procedure Call

In GFS, the primary server stores __________ and monitors chunk servers.

<p>metadata</p> Signup and view all the answers

Match the following systems with their characteristics:

<p>NFS = Stateless and idempotent operations for fault tolerance GFS = Designed for big data workloads with atomic append operations HDFS = Similar to GFS and designed for web search engine infrastructure</p> Signup and view all the answers

Which of the following is NOT a characteristic of GFS?

<p>Support for numerous small files (D)</p> Signup and view all the answers

HDFS is utilized by companies like Facebook, LinkedIn, and Twitter.

<p>True (A)</p> Signup and view all the answers

What is the main reason for adopting fault tolerance in GFS?

<p>Component failures are the norm in clusters with commodity machines.</p> Signup and view all the answers

What is the primary task of the Namenode in HDFS?

<p>Coordinate and manage metadata (D)</p> Signup and view all the answers

Data in HDFS is typically modified frequently.

<p>False (B)</p> Signup and view all the answers

How often does a Datanode send a heartbeat signal to the Namenode?

<p>every 3 seconds</p> Signup and view all the answers

The design models of HDFS are similar to __________.

<p>GFS</p> Signup and view all the answers

Match the following HDFS components with their descriptions:

<p>Namenode = Coordinates and manages metadata Datanode = Stores data blocks Checkpoint Node = Combines checkpoints and journal Backup Node = Maintains in-memory filesystem namespace</p> Signup and view all the answers

Which of the following statements about Datanodes is incorrect?

<p>Datanodes can store entire file content in one block. (C)</p> Signup and view all the answers

Erasure coding is used in HDFS for data replication.

<p>False (B)</p> Signup and view all the answers

What happens if a Datanode fails to send a heartbeat signal?

<p>It is considered dead.</p> Signup and view all the answers

HDFS places the second and third replicas of a data block in __________ racks to optimize data availability.

<p>different</p> Signup and view all the answers

What is the maximum typical size of a block in HDFS?

<p>64MB (D)</p> Signup and view all the answers

What is the primary benefit of using specialized file formats in data storage?

<p>Faster read and write times (A)</p> Signup and view all the answers

Hadoop Distributed File System (HDFS) is an open-source clone of the Google File System (GFS).

<p>True (A)</p> Signup and view all the answers

What challenge does HDFS face when dealing with large datasets?

<p>Finding the relevant data in a particular location</p> Signup and view all the answers

___ is a coding technique that helps recover lost data by using parity checks.

<p>Erasure coding</p> Signup and view all the answers

Match the file formats with their descriptions:

<p>HDFS = Distributed file system designed for big data RCFile = Record Columnar File optimized for relational data Parquet = Columnar storage format for efficient data processing ORC = Optimized Row Columnar format for Hive</p> Signup and view all the answers

Which file format is specifically designed for use with Apache Hive?

<p>ORC (C)</p> Signup and view all the answers

Replication allows for N-1 failures to be tolerated in storage systems.

<p>True (A)</p> Signup and view all the answers

What feature distinguishes the Parquet file format from RCFile?

<p>Column-wise compression</p> Signup and view all the answers

The ___ allows for encoding of both keys and values, stored in binary format.

<p>Sequence file</p> Signup and view all the answers

Which of the following statements about Galois Fields is true?

<p>They are utilized primarily in error correction techniques. (B)</p> Signup and view all the answers

The performance of traditional text files is superior to that of specialized file formats.

<p>False (B)</p> Signup and view all the answers

What does HDFS stand for?

<p>Hadoop Distributed File System</p> Signup and view all the answers

ER(𝑛, 𝑘) stands for Reed Solomon encoding with ___ data symbols and ___ parity checks.

<p>k</p> Signup and view all the answers

Flashcards

NFS

A remote file access system that uses a virtual file system for transparency.

NFS Client-side Caching

Caching is used on the client side to improve performance, by storing copies of data locally.

Stateless and Idempotent RPCs

Stateless and idempotent RPCs help ensure fault tolerance. It means a request can be processed without relying on previous state, and repeated requests have the same effect.

NFS Synchronization and Flush-on-Close

NFS synchronizes data periodically with the server, and data is flushed to the server when a file is closed.

Signup and view all the flashcards

GFS Big Data Workloads

GFS is designed for handling very large datasets, with a focus on appending data and concurrent access.

Signup and view all the flashcards

GFS Architecture

GFS relies on a distributed architecture with a single primary server responsible for metadata, and many chunk servers storing data.

Signup and view all the flashcards

GFS Fault Tolerance on Commodity Hardware

GFS prioritizes fault tolerance by being designed to function even when components fail, often due to commodity hardware limitations.

Signup and view all the flashcards

HDFS (Hadoop Distributed File System)

HDFS, directly inspired by GFS, is a similar distributed file system commonly used for large datasets, like those found at Facebook, LinkedIn, and Twitter.

Signup and view all the flashcards

Hardware Failure Assumption in HDFS

HDFS assumes that hardware failures are common, making data replication crucial for reliability.

Signup and view all the flashcards

Large File Handling in HDFS

HDFS is designed to handle large files efficiently, often exceeding the size of typical disk blocks.

Signup and view all the flashcards

Streaming Read/Write Optimization in HDFS

HDFS optimizes for large sequential reads and writes, making it suitable for streaming data processing but less ideal for random access.

Signup and view all the flashcards

Write Once Read Many (WORM) in HDFS

Once data is written to HDFS, it is rarely modified. This design favors immutable data storage.

Signup and view all the flashcards

Concurrent Writes in HDFS

HDFS is designed to handle concurrent writes from multiple sources while ensuring data consistency.

Signup and view all the flashcards

High Bandwidth vs. Latency in HDFS

HDFS prioritizes high bandwidth over low latency, meaning it focuses on moving large amounts of data quickly, even if it takes slightly longer.

Signup and view all the flashcards

Moving Computation to the Data in HDFS

HDFS prioritizes storing data and allowing computations to be carried out on those data nodes, making it more suited for data analytics.

Signup and view all the flashcards

Namenode's Role in HDFS

The Namenode in HDFS plays a crucial role in managing the entire file system's metadata, including file locations, permissions, and access times.

Signup and view all the flashcards

Datanode's Role in HDFS

Datanodes in HDFS are responsible for storing actual data blocks, replicating them for redundancy, and communicating with the Namenode.

Signup and view all the flashcards

Heartbeat Mechanism in HDFS

In HDFS, a heartbeat mechanism ensures that the Namenode is aware of all active Datanodes, monitoring their status and ensuring they are online.

Signup and view all the flashcards

Reed-Solomon (n, k) Encoding

A type of error correction code used in QR codes, where data is divided into blocks and parity checks are added to allow for data recovery even if some blocks are lost or corrupted.

Signup and view all the flashcards

Erasure Coding

A coding technique that sacrifices storage space in exchange for improved fault tolerance. It allows recovery from multiple data block failures.

Signup and view all the flashcards

RCFile (Record Columnar File)

A storage format designed for relational tables, offering efficient storage utilization and fast data loading. Features include columnar storage, block compression, and adaptive access patterns.

Signup and view all the flashcards

SequenceFile

A file format designed for Hadoop, offering efficient storage and performance for analysis. Features include block-level compression, splitable files, and schema evolution support.

Signup and view all the flashcards

ORC (Optimized Row Columnar) File

A file format designed for Hive, building on the strengths of RCFile. Features include efficient storage, block-mode compression, and support for complex data types. It utilizes stripes for indexing and meta data management.

Signup and view all the flashcards

Parquet

A column-oriented file format developed by Twitter focusing on efficiency and compression. It utilizes different encoding techniques for different columns, offering flexibility and optimized retrieval.

Signup and view all the flashcards

Google File System (GFS)

A distributed file system designed for Google's data management needs. It offers features like data replication, high availability, and scalability for massive data storage.

Signup and view all the flashcards

Hadoop Distributed File System (HDFS)

An open-source implementation of Google File System, widely used for data storage in Hadoop-based systems. It provides similar features to GFS.

Signup and view all the flashcards

Data Block Encoding

A technique for efficiently storing and retrieving data by dividing the data into smaller blocks and adding parity checks to allow for recovery from lost blocks.

Signup and view all the flashcards

Replication

Data storage systems that rely on making multiple full copies of data to ensure availability and redundancy. It offers simplicity but can be inefficient in terms of storage space.

Signup and view all the flashcards

Text File (CSV, TSV)

A file format widely used for data storage, often due to its simplicity and versatility. Each line represents a record, and the format is inherently splitable.

Signup and view all the flashcards

Study Notes

Big Data Systems

  • Big data systems are concerned with distributed file systems
  • A seminar on database systems was held with TU Darmstadt
  • A speaker, Antonis Katsarakis (Huawei), presented on Wednesday from 16:15-17:00 in Zoom
  • The Dandelion Hashtable has in-memory request capacity beyond billion requests per second on commodity servers

Distributed File Systems

  • File systems are fundamental to data storage
  • Basics of file systems, network file systems, Google file systems, and Hadoop distributed file systems
  • Erasure coding and file formats are key parts of distributed file systems

File Systems

  • Motivation is providing abstractions for file interaction
  • POSIX aligns with ISO C 1999 standard
  • POSIX includes file locking and directories
  • Files are byte vectors containing data on external storage

HDD

  • Magnetic disks have caches ranging from 8MB to 128MB
  • Cost of hard disk drives (HDD) has decreased significantly
  • HDD prices vary and are influenced by the date these slides were created
  • HDDs use SATA, SCSI/SAS connections

SSD

  • Solid-state drives (SSDs) use NAND-based flash memory
  • SSDs have no moving mechanical components
  • SSD cache size ranges from 16MB to 512MB
  • SSD costs also vary and are impacted by date of slides
  • SSDs use PCI Express connections

Hard Drive Failures

  • Hard drives (HDD) have failure rates
  • Failure rates vary by size and model
  • An average of 5% failed annually in the first 3 years.

Bitrot

  • Bitrot is silent data corruption in hard drives.
  • HDD specifications estimate uncorrectable bit error rates to be exceptionally low (10-15)
  • Testing with 8x100GB HDDs after 2PB of reads revealed 4 read errors

File System Operations

  • File operations include opening, reading, writing, and closing files.
  • Directory operations include creating files, renaming files, and directories, deleting files, metadata and locks

Linux Ext2

  • Drives are partitioned into block groups
  • Superblock carries system data regarding blocks, free space, etc.
  • Bitmaps track free data blocks and inodes
  • Inodes are used to identify files and directories
  • Data blocks contain the actual data

Ext2 Inode (1)

  • Inode structures include owner/group, file length, type/access rights, number of data blocks, pointers to data blocks (direct and indirect) and timestamps
  • Different file types are denoted including files, directories, symbolic links, and sockets

Ext2 Inode (2)

  • Inode structures have direct blocks (12) and indirect blocks that use pointers to data blocks (double and triple)

Ext2 Example

  • Ext2 file systems have file and block size limits

Network File System (NFS)

  • NFS enables consistent namespace and access to filesystems across computers
  • It is used for local LAN files.
  • Each file resides on a server which is accessed by clients.
  • Clients treat remote files as if they were local.
  • NFS has a user-centric design with few concurrent writes.

Basic NFS

  • NFS utilizes a system call layer, virtual file system, local file system interface, NFS clients, NFS servers, RPC client stub, RPC server stub, and a network for file access

Sending commands

  • NFS leverages Remote Procedure Calls (RPCs) to propagate file system operations.
  • Naive method is forwarding each RPC to a server

Solution: Caching

  • NFS clients utilize caching by maintaining copies of remote files, allowing periodic synchronization with the server.
  • The original SUN NFS from 1984 employed in-memory caching, allowing files to be accessed without network activity.

Caching & Failures

  • Server crashes can lead to loss of unsaved data and offset errors in concurrent accesses.
  • Communication failures can lead to inconsistent data handling and unintended file deletions during concurrent modification or creation.
  • Client crashes also result in unsaved data

Solution: Stateless RPC

  • Stateless RPCs prevent state maintenance across commands and sessions.
  • Commands, such as read(), are stateless, meaning the server doesn't track the context.
  • This resilience allows servers to continue without recovering previously stored states.

Concurrent Writes in NFS

  • NFS lacks guarantees for concurrent writes.
  • Concurrent writes can lead to inconsistent data updates in the event multiple clients try to modify the same file simultaneously.

NFS Summary

  • Functionality includes transparent remote file access, using virtual file systems for improved performance.
  • Stateless RPC for fault tolerance with flush-on-close semantics.
  • Concurrent writes are not guaranteed.

Google File System (GFS)

  • GFS is a distributed file system specifically designed for big data workloads.
  • GFS handles high file sizes, appends, concurrency, and bandwidth.
  • GFS is robust and tolerant of failures, supporting thousands of machines.
  • GFS provides an API not based on POSIX designed for scalable implementation

GFS - Design Assumptions

  • A focus on component failures is prioritized over reliability.
  • GFS handles large files, streaming reads, and infrequent random writes.
  • The coherence model is simplified, prioritizing write-once-read-many patterns.

Hadoop Distributed File System (HDFS)

  • HDFS is a distributed file system similar to GFS, designed for big data workloads, handling large files, frequent appends, high concurrency, and high bandwidth.
  • It was originally built for the Apache Nutch web search engine.
  • HDFS is a robust system designed for hundreds or thousands of commodity machines where failures are common.

HDFS Architecture

  • HDFS employs a client-namenode-datanode architecture.
  • Only one namenode, but many datanodes storing data blocks
  • Namenode manages metadata and co-ordinates tasks between datanodes

Namenode

  • The namenode holds the entire file system's namespace in RAM.
  • Metadata such as hierarchy of files and directories, attributes, quotas, and access information are stored.
  • Checkpoints and journaling facilitate fault tolerance

Checkpoint/Backup Node

  • Checkpoint and backup nodes allow roles to be specified during node startup
  • Backup nodes maintain a synchronized in-memory copy of the file system namespace

Datanode

  • The datanode coordinates with the namenode and performs handshakes for verification.
  • Block reports are sent periodically, containing metadata for data blocks.
  • Heartbeats, sent to the namenode, confirm datanode health and availability. This aids in rebalancing.

Node failure detection

  • The namenode was previously a single point of failure, improved to handle multiple instances for name space regions.
  • Datanodes periodically send heartbeat signals; if signals are not received after 3 seconds, the datanode is considered dead.

Corrupted Data

  • Data blocks in HDFS include checksums.
  • Datanodes periodically send FileChecksums to the namenode.
  • Corrupted data does not automatically get deleted.

Block Placement

  • Within a rack, blocks are prioritized over cross-rack placement
  • Network bandwidth within a rack (or cabinet) is typically faster than cross-rack bandwidth, so blocks are placed accordingly.

Erasure Coding

  • Erasure coding is used as an alternative to replication, to improve storage utilization when space and cost are a concern
  • It encodes data into multiple blocks and allows for data recovery if some blocks are lost.

(Big Data) File Format

  • Specialized file formats are created to improve efficiency in Big Data systems, handling large datasets, dealing with schema changes over time, and optimizing file sizes on disk.

File Format

  • Challenges in HDFS include finding data, managing space due to large data sets and continuous evolution of schemas.
  • Specialized file formats offer faster read and write times and improved schema management tools, such as splittable files or support for data compression.

Common HDFS Storage Formats

  • Text files, such as CSV and TSV, are commonly used and are simple, with each line being a record, separated by a newline.
  • Sequence files support key/value encoding, useful for MapReduce tasks, and support block-level compression.

RCFile

  • Developed at Facebook, RCFile is a record-columnar file format designed for relational tables, offering fast data loading and effective space management.
  • It is also adaptable to varying query patterns.

Row-oriented storage vs columnar storage

  • Traditional row-oriented storage stores data in row-oriented format.
  • Columns are scattered between block storage
  • Columnar format stores data in column-oriented format, useful for specific column-oriented queries

ORC File

  • ORC file format was developed for better data handling in Hive, designed for large data sets and schema.
  • It uses column-wise compression and offers various types including date, decimal, and complex structures with indexes

Parquet

  • Apache Parquet is an efficient column-oriented storage format.
  • It supports various encodings for efficient data compression, and its architecture supports column-wise operations.

Summary

  • Several core file systems are considered, including basic file systems, network file systems (NFS), Google File systems (GFS), and Hadoop distributed file systems (HDFS).
  • Formats include text, sequence, RCFile, ORC, and Parquet.

Next Part

  • The next portion covers key-value stores

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser