Distributed File Systems Quiz
31 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a primary design consideration of the Google File System (GFS)?

  • Emphasis on end-user applications
  • Support for small files primarily
  • Focus on perfect fault tolerance
  • Handling huge data workloads efficiently (correct)
  • The Hadoop Distributed File System (HDFS) was originally developed for Apache Hadoop's web search engine.

    False (B)

    What does the acronym RPC stand for, as used in the context of NFS?

    Remote Procedure Call

    In GFS, the primary server stores __________ and monitors chunk servers.

    <p>metadata</p> Signup and view all the answers

    Match the following systems with their characteristics:

    <p>NFS = Stateless and idempotent operations for fault tolerance GFS = Designed for big data workloads with atomic append operations HDFS = Similar to GFS and designed for web search engine infrastructure</p> Signup and view all the answers

    Which of the following is NOT a characteristic of GFS?

    <p>Support for numerous small files (D)</p> Signup and view all the answers

    HDFS is utilized by companies like Facebook, LinkedIn, and Twitter.

    <p>True (A)</p> Signup and view all the answers

    What is the main reason for adopting fault tolerance in GFS?

    <p>Component failures are the norm in clusters with commodity machines.</p> Signup and view all the answers

    What is the primary task of the Namenode in HDFS?

    <p>Coordinate and manage metadata (D)</p> Signup and view all the answers

    Data in HDFS is typically modified frequently.

    <p>False (B)</p> Signup and view all the answers

    How often does a Datanode send a heartbeat signal to the Namenode?

    <p>every 3 seconds</p> Signup and view all the answers

    The design models of HDFS are similar to __________.

    <p>GFS</p> Signup and view all the answers

    Match the following HDFS components with their descriptions:

    <p>Namenode = Coordinates and manages metadata Datanode = Stores data blocks Checkpoint Node = Combines checkpoints and journal Backup Node = Maintains in-memory filesystem namespace</p> Signup and view all the answers

    Which of the following statements about Datanodes is incorrect?

    <p>Datanodes can store entire file content in one block. (C)</p> Signup and view all the answers

    Erasure coding is used in HDFS for data replication.

    <p>False (B)</p> Signup and view all the answers

    What happens if a Datanode fails to send a heartbeat signal?

    <p>It is considered dead.</p> Signup and view all the answers

    HDFS places the second and third replicas of a data block in __________ racks to optimize data availability.

    <p>different</p> Signup and view all the answers

    What is the maximum typical size of a block in HDFS?

    <p>64MB (D)</p> Signup and view all the answers

    What is the primary benefit of using specialized file formats in data storage?

    <p>Faster read and write times (A)</p> Signup and view all the answers

    Hadoop Distributed File System (HDFS) is an open-source clone of the Google File System (GFS).

    <p>True (A)</p> Signup and view all the answers

    What challenge does HDFS face when dealing with large datasets?

    <p>Finding the relevant data in a particular location</p> Signup and view all the answers

    ___ is a coding technique that helps recover lost data by using parity checks.

    <p>Erasure coding</p> Signup and view all the answers

    Match the file formats with their descriptions:

    <p>HDFS = Distributed file system designed for big data RCFile = Record Columnar File optimized for relational data Parquet = Columnar storage format for efficient data processing ORC = Optimized Row Columnar format for Hive</p> Signup and view all the answers

    Which file format is specifically designed for use with Apache Hive?

    <p>ORC (C)</p> Signup and view all the answers

    Replication allows for N-1 failures to be tolerated in storage systems.

    <p>True (A)</p> Signup and view all the answers

    What feature distinguishes the Parquet file format from RCFile?

    <p>Column-wise compression</p> Signup and view all the answers

    The ___ allows for encoding of both keys and values, stored in binary format.

    <p>Sequence file</p> Signup and view all the answers

    Which of the following statements about Galois Fields is true?

    <p>They are utilized primarily in error correction techniques. (B)</p> Signup and view all the answers

    The performance of traditional text files is superior to that of specialized file formats.

    <p>False (B)</p> Signup and view all the answers

    What does HDFS stand for?

    <p>Hadoop Distributed File System</p> Signup and view all the answers

    ER(𝑛, 𝑘) stands for Reed Solomon encoding with ___ data symbols and ___ parity checks.

    <p>k</p> Signup and view all the answers

    Study Notes

    Big Data Systems

    • Big data systems are concerned with distributed file systems
    • A seminar on database systems was held with TU Darmstadt
    • A speaker, Antonis Katsarakis (Huawei), presented on Wednesday from 16:15-17:00 in Zoom
    • The Dandelion Hashtable has in-memory request capacity beyond billion requests per second on commodity servers

    Distributed File Systems

    • File systems are fundamental to data storage
    • Basics of file systems, network file systems, Google file systems, and Hadoop distributed file systems
    • Erasure coding and file formats are key parts of distributed file systems

    File Systems

    • Motivation is providing abstractions for file interaction
    • POSIX aligns with ISO C 1999 standard
    • POSIX includes file locking and directories
    • Files are byte vectors containing data on external storage

    HDD

    • Magnetic disks have caches ranging from 8MB to 128MB
    • Cost of hard disk drives (HDD) has decreased significantly
    • HDD prices vary and are influenced by the date these slides were created
    • HDDs use SATA, SCSI/SAS connections

    SSD

    • Solid-state drives (SSDs) use NAND-based flash memory
    • SSDs have no moving mechanical components
    • SSD cache size ranges from 16MB to 512MB
    • SSD costs also vary and are impacted by date of slides
    • SSDs use PCI Express connections

    Hard Drive Failures

    • Hard drives (HDD) have failure rates
    • Failure rates vary by size and model
    • An average of 5% failed annually in the first 3 years.

    Bitrot

    • Bitrot is silent data corruption in hard drives.
    • HDD specifications estimate uncorrectable bit error rates to be exceptionally low (10-15)
    • Testing with 8x100GB HDDs after 2PB of reads revealed 4 read errors

    File System Operations

    • File operations include opening, reading, writing, and closing files.
    • Directory operations include creating files, renaming files, and directories, deleting files, metadata and locks

    Linux Ext2

    • Drives are partitioned into block groups
    • Superblock carries system data regarding blocks, free space, etc.
    • Bitmaps track free data blocks and inodes
    • Inodes are used to identify files and directories
    • Data blocks contain the actual data

    Ext2 Inode (1)

    • Inode structures include owner/group, file length, type/access rights, number of data blocks, pointers to data blocks (direct and indirect) and timestamps
    • Different file types are denoted including files, directories, symbolic links, and sockets

    Ext2 Inode (2)

    • Inode structures have direct blocks (12) and indirect blocks that use pointers to data blocks (double and triple)

    Ext2 Example

    • Ext2 file systems have file and block size limits

    Network File System (NFS)

    • NFS enables consistent namespace and access to filesystems across computers
    • It is used for local LAN files.
    • Each file resides on a server which is accessed by clients.
    • Clients treat remote files as if they were local.
    • NFS has a user-centric design with few concurrent writes.

    Basic NFS

    • NFS utilizes a system call layer, virtual file system, local file system interface, NFS clients, NFS servers, RPC client stub, RPC server stub, and a network for file access

    Sending commands

    • NFS leverages Remote Procedure Calls (RPCs) to propagate file system operations.
    • Naive method is forwarding each RPC to a server

    Solution: Caching

    • NFS clients utilize caching by maintaining copies of remote files, allowing periodic synchronization with the server.
    • The original SUN NFS from 1984 employed in-memory caching, allowing files to be accessed without network activity.

    Caching & Failures

    • Server crashes can lead to loss of unsaved data and offset errors in concurrent accesses.
    • Communication failures can lead to inconsistent data handling and unintended file deletions during concurrent modification or creation.
    • Client crashes also result in unsaved data

    Solution: Stateless RPC

    • Stateless RPCs prevent state maintenance across commands and sessions.
    • Commands, such as read(), are stateless, meaning the server doesn't track the context.
    • This resilience allows servers to continue without recovering previously stored states.

    Concurrent Writes in NFS

    • NFS lacks guarantees for concurrent writes.
    • Concurrent writes can lead to inconsistent data updates in the event multiple clients try to modify the same file simultaneously.

    NFS Summary

    • Functionality includes transparent remote file access, using virtual file systems for improved performance.
    • Stateless RPC for fault tolerance with flush-on-close semantics.
    • Concurrent writes are not guaranteed.

    Google File System (GFS)

    • GFS is a distributed file system specifically designed for big data workloads.
    • GFS handles high file sizes, appends, concurrency, and bandwidth.
    • GFS is robust and tolerant of failures, supporting thousands of machines.
    • GFS provides an API not based on POSIX designed for scalable implementation

    GFS - Design Assumptions

    • A focus on component failures is prioritized over reliability.
    • GFS handles large files, streaming reads, and infrequent random writes.
    • The coherence model is simplified, prioritizing write-once-read-many patterns.

    Hadoop Distributed File System (HDFS)

    • HDFS is a distributed file system similar to GFS, designed for big data workloads, handling large files, frequent appends, high concurrency, and high bandwidth.
    • It was originally built for the Apache Nutch web search engine.
    • HDFS is a robust system designed for hundreds or thousands of commodity machines where failures are common.

    HDFS Architecture

    • HDFS employs a client-namenode-datanode architecture.
    • Only one namenode, but many datanodes storing data blocks
    • Namenode manages metadata and co-ordinates tasks between datanodes

    Namenode

    • The namenode holds the entire file system's namespace in RAM.
    • Metadata such as hierarchy of files and directories, attributes, quotas, and access information are stored.
    • Checkpoints and journaling facilitate fault tolerance

    Checkpoint/Backup Node

    • Checkpoint and backup nodes allow roles to be specified during node startup
    • Backup nodes maintain a synchronized in-memory copy of the file system namespace

    Datanode

    • The datanode coordinates with the namenode and performs handshakes for verification.
    • Block reports are sent periodically, containing metadata for data blocks.
    • Heartbeats, sent to the namenode, confirm datanode health and availability. This aids in rebalancing.

    Node failure detection

    • The namenode was previously a single point of failure, improved to handle multiple instances for name space regions.
    • Datanodes periodically send heartbeat signals; if signals are not received after 3 seconds, the datanode is considered dead.

    Corrupted Data

    • Data blocks in HDFS include checksums.
    • Datanodes periodically send FileChecksums to the namenode.
    • Corrupted data does not automatically get deleted.

    Block Placement

    • Within a rack, blocks are prioritized over cross-rack placement
    • Network bandwidth within a rack (or cabinet) is typically faster than cross-rack bandwidth, so blocks are placed accordingly.

    Erasure Coding

    • Erasure coding is used as an alternative to replication, to improve storage utilization when space and cost are a concern
    • It encodes data into multiple blocks and allows for data recovery if some blocks are lost.

    (Big Data) File Format

    • Specialized file formats are created to improve efficiency in Big Data systems, handling large datasets, dealing with schema changes over time, and optimizing file sizes on disk.

    File Format

    • Challenges in HDFS include finding data, managing space due to large data sets and continuous evolution of schemas.
    • Specialized file formats offer faster read and write times and improved schema management tools, such as splittable files or support for data compression.

    Common HDFS Storage Formats

    • Text files, such as CSV and TSV, are commonly used and are simple, with each line being a record, separated by a newline.
    • Sequence files support key/value encoding, useful for MapReduce tasks, and support block-level compression.

    RCFile

    • Developed at Facebook, RCFile is a record-columnar file format designed for relational tables, offering fast data loading and effective space management.
    • It is also adaptable to varying query patterns.

    Row-oriented storage vs columnar storage

    • Traditional row-oriented storage stores data in row-oriented format.
    • Columns are scattered between block storage
    • Columnar format stores data in column-oriented format, useful for specific column-oriented queries

    ORC File

    • ORC file format was developed for better data handling in Hive, designed for large data sets and schema.
    • It uses column-wise compression and offers various types including date, decimal, and complex structures with indexes

    Parquet

    • Apache Parquet is an efficient column-oriented storage format.
    • It supports various encodings for efficient data compression, and its architecture supports column-wise operations.

    Summary

    • Several core file systems are considered, including basic file systems, network file systems (NFS), Google File systems (GFS), and Hadoop distributed file systems (HDFS).
    • Formats include text, sequence, RCFile, ORC, and Parquet.

    Next Part

    • The next portion covers key-value stores

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Test your knowledge on Distributed File Systems with this quiz, focusing on Google File System (GFS) and Hadoop Distributed File System (HDFS). Explore key concepts such as fault tolerance, system characteristics, and relevant acronyms like RPC.

    More Like This

    Use Quizgecko on...
    Browser
    Browser