Podcast
Questions and Answers
What is a primary design consideration of the Google File System (GFS)?
What is a primary design consideration of the Google File System (GFS)?
The Hadoop Distributed File System (HDFS) was originally developed for Apache Hadoop's web search engine.
The Hadoop Distributed File System (HDFS) was originally developed for Apache Hadoop's web search engine.
False (B)
What does the acronym RPC stand for, as used in the context of NFS?
What does the acronym RPC stand for, as used in the context of NFS?
Remote Procedure Call
In GFS, the primary server stores __________ and monitors chunk servers.
In GFS, the primary server stores __________ and monitors chunk servers.
Signup and view all the answers
Match the following systems with their characteristics:
Match the following systems with their characteristics:
Signup and view all the answers
Which of the following is NOT a characteristic of GFS?
Which of the following is NOT a characteristic of GFS?
Signup and view all the answers
HDFS is utilized by companies like Facebook, LinkedIn, and Twitter.
HDFS is utilized by companies like Facebook, LinkedIn, and Twitter.
Signup and view all the answers
What is the main reason for adopting fault tolerance in GFS?
What is the main reason for adopting fault tolerance in GFS?
Signup and view all the answers
What is the primary task of the Namenode in HDFS?
What is the primary task of the Namenode in HDFS?
Signup and view all the answers
Data in HDFS is typically modified frequently.
Data in HDFS is typically modified frequently.
Signup and view all the answers
How often does a Datanode send a heartbeat signal to the Namenode?
How often does a Datanode send a heartbeat signal to the Namenode?
Signup and view all the answers
The design models of HDFS are similar to __________.
The design models of HDFS are similar to __________.
Signup and view all the answers
Match the following HDFS components with their descriptions:
Match the following HDFS components with their descriptions:
Signup and view all the answers
Which of the following statements about Datanodes is incorrect?
Which of the following statements about Datanodes is incorrect?
Signup and view all the answers
Erasure coding is used in HDFS for data replication.
Erasure coding is used in HDFS for data replication.
Signup and view all the answers
What happens if a Datanode fails to send a heartbeat signal?
What happens if a Datanode fails to send a heartbeat signal?
Signup and view all the answers
HDFS places the second and third replicas of a data block in __________ racks to optimize data availability.
HDFS places the second and third replicas of a data block in __________ racks to optimize data availability.
Signup and view all the answers
What is the maximum typical size of a block in HDFS?
What is the maximum typical size of a block in HDFS?
Signup and view all the answers
What is the primary benefit of using specialized file formats in data storage?
What is the primary benefit of using specialized file formats in data storage?
Signup and view all the answers
Hadoop Distributed File System (HDFS) is an open-source clone of the Google File System (GFS).
Hadoop Distributed File System (HDFS) is an open-source clone of the Google File System (GFS).
Signup and view all the answers
What challenge does HDFS face when dealing with large datasets?
What challenge does HDFS face when dealing with large datasets?
Signup and view all the answers
___ is a coding technique that helps recover lost data by using parity checks.
___ is a coding technique that helps recover lost data by using parity checks.
Signup and view all the answers
Match the file formats with their descriptions:
Match the file formats with their descriptions:
Signup and view all the answers
Which file format is specifically designed for use with Apache Hive?
Which file format is specifically designed for use with Apache Hive?
Signup and view all the answers
Replication allows for N-1 failures to be tolerated in storage systems.
Replication allows for N-1 failures to be tolerated in storage systems.
Signup and view all the answers
What feature distinguishes the Parquet file format from RCFile?
What feature distinguishes the Parquet file format from RCFile?
Signup and view all the answers
The ___ allows for encoding of both keys and values, stored in binary format.
The ___ allows for encoding of both keys and values, stored in binary format.
Signup and view all the answers
Which of the following statements about Galois Fields is true?
Which of the following statements about Galois Fields is true?
Signup and view all the answers
The performance of traditional text files is superior to that of specialized file formats.
The performance of traditional text files is superior to that of specialized file formats.
Signup and view all the answers
What does HDFS stand for?
What does HDFS stand for?
Signup and view all the answers
ER(𝑛, 𝑘) stands for Reed Solomon encoding with ___ data symbols and ___ parity checks.
ER(𝑛, 𝑘) stands for Reed Solomon encoding with ___ data symbols and ___ parity checks.
Signup and view all the answers
Study Notes
Big Data Systems
- Big data systems are concerned with distributed file systems
- A seminar on database systems was held with TU Darmstadt
- A speaker, Antonis Katsarakis (Huawei), presented on Wednesday from 16:15-17:00 in Zoom
- The Dandelion Hashtable has in-memory request capacity beyond billion requests per second on commodity servers
Distributed File Systems
- File systems are fundamental to data storage
- Basics of file systems, network file systems, Google file systems, and Hadoop distributed file systems
- Erasure coding and file formats are key parts of distributed file systems
File Systems
- Motivation is providing abstractions for file interaction
- POSIX aligns with ISO C 1999 standard
- POSIX includes file locking and directories
- Files are byte vectors containing data on external storage
HDD
- Magnetic disks have caches ranging from 8MB to 128MB
- Cost of hard disk drives (HDD) has decreased significantly
- HDD prices vary and are influenced by the date these slides were created
- HDDs use SATA, SCSI/SAS connections
SSD
- Solid-state drives (SSDs) use NAND-based flash memory
- SSDs have no moving mechanical components
- SSD cache size ranges from 16MB to 512MB
- SSD costs also vary and are impacted by date of slides
- SSDs use PCI Express connections
Hard Drive Failures
- Hard drives (HDD) have failure rates
- Failure rates vary by size and model
- An average of 5% failed annually in the first 3 years.
Bitrot
- Bitrot is silent data corruption in hard drives.
- HDD specifications estimate uncorrectable bit error rates to be exceptionally low (10-15)
- Testing with 8x100GB HDDs after 2PB of reads revealed 4 read errors
File System Operations
- File operations include opening, reading, writing, and closing files.
- Directory operations include creating files, renaming files, and directories, deleting files, metadata and locks
Linux Ext2
- Drives are partitioned into block groups
- Superblock carries system data regarding blocks, free space, etc.
- Bitmaps track free data blocks and inodes
- Inodes are used to identify files and directories
- Data blocks contain the actual data
Ext2 Inode (1)
- Inode structures include owner/group, file length, type/access rights, number of data blocks, pointers to data blocks (direct and indirect) and timestamps
- Different file types are denoted including files, directories, symbolic links, and sockets
Ext2 Inode (2)
- Inode structures have direct blocks (12) and indirect blocks that use pointers to data blocks (double and triple)
Ext2 Example
- Ext2 file systems have file and block size limits
Network File System (NFS)
- NFS enables consistent namespace and access to filesystems across computers
- It is used for local LAN files.
- Each file resides on a server which is accessed by clients.
- Clients treat remote files as if they were local.
- NFS has a user-centric design with few concurrent writes.
Basic NFS
- NFS utilizes a system call layer, virtual file system, local file system interface, NFS clients, NFS servers, RPC client stub, RPC server stub, and a network for file access
Sending commands
- NFS leverages Remote Procedure Calls (RPCs) to propagate file system operations.
- Naive method is forwarding each RPC to a server
Solution: Caching
- NFS clients utilize caching by maintaining copies of remote files, allowing periodic synchronization with the server.
- The original SUN NFS from 1984 employed in-memory caching, allowing files to be accessed without network activity.
Caching & Failures
- Server crashes can lead to loss of unsaved data and offset errors in concurrent accesses.
- Communication failures can lead to inconsistent data handling and unintended file deletions during concurrent modification or creation.
- Client crashes also result in unsaved data
Solution: Stateless RPC
- Stateless RPCs prevent state maintenance across commands and sessions.
- Commands, such as read(), are stateless, meaning the server doesn't track the context.
- This resilience allows servers to continue without recovering previously stored states.
Concurrent Writes in NFS
- NFS lacks guarantees for concurrent writes.
- Concurrent writes can lead to inconsistent data updates in the event multiple clients try to modify the same file simultaneously.
NFS Summary
- Functionality includes transparent remote file access, using virtual file systems for improved performance.
- Stateless RPC for fault tolerance with flush-on-close semantics.
- Concurrent writes are not guaranteed.
Google File System (GFS)
- GFS is a distributed file system specifically designed for big data workloads.
- GFS handles high file sizes, appends, concurrency, and bandwidth.
- GFS is robust and tolerant of failures, supporting thousands of machines.
- GFS provides an API not based on POSIX designed for scalable implementation
GFS - Design Assumptions
- A focus on component failures is prioritized over reliability.
- GFS handles large files, streaming reads, and infrequent random writes.
- The coherence model is simplified, prioritizing write-once-read-many patterns.
Hadoop Distributed File System (HDFS)
- HDFS is a distributed file system similar to GFS, designed for big data workloads, handling large files, frequent appends, high concurrency, and high bandwidth.
- It was originally built for the Apache Nutch web search engine.
- HDFS is a robust system designed for hundreds or thousands of commodity machines where failures are common.
HDFS Architecture
- HDFS employs a client-namenode-datanode architecture.
- Only one namenode, but many datanodes storing data blocks
- Namenode manages metadata and co-ordinates tasks between datanodes
Namenode
- The namenode holds the entire file system's namespace in RAM.
- Metadata such as hierarchy of files and directories, attributes, quotas, and access information are stored.
- Checkpoints and journaling facilitate fault tolerance
Checkpoint/Backup Node
- Checkpoint and backup nodes allow roles to be specified during node startup
- Backup nodes maintain a synchronized in-memory copy of the file system namespace
Datanode
- The datanode coordinates with the namenode and performs handshakes for verification.
- Block reports are sent periodically, containing metadata for data blocks.
- Heartbeats, sent to the namenode, confirm datanode health and availability. This aids in rebalancing.
Node failure detection
- The namenode was previously a single point of failure, improved to handle multiple instances for name space regions.
- Datanodes periodically send heartbeat signals; if signals are not received after 3 seconds, the datanode is considered dead.
Corrupted Data
- Data blocks in HDFS include checksums.
- Datanodes periodically send FileChecksums to the namenode.
- Corrupted data does not automatically get deleted.
Block Placement
- Within a rack, blocks are prioritized over cross-rack placement
- Network bandwidth within a rack (or cabinet) is typically faster than cross-rack bandwidth, so blocks are placed accordingly.
Erasure Coding
- Erasure coding is used as an alternative to replication, to improve storage utilization when space and cost are a concern
- It encodes data into multiple blocks and allows for data recovery if some blocks are lost.
(Big Data) File Format
- Specialized file formats are created to improve efficiency in Big Data systems, handling large datasets, dealing with schema changes over time, and optimizing file sizes on disk.
File Format
- Challenges in HDFS include finding data, managing space due to large data sets and continuous evolution of schemas.
- Specialized file formats offer faster read and write times and improved schema management tools, such as splittable files or support for data compression.
Common HDFS Storage Formats
- Text files, such as CSV and TSV, are commonly used and are simple, with each line being a record, separated by a newline.
- Sequence files support key/value encoding, useful for MapReduce tasks, and support block-level compression.
RCFile
- Developed at Facebook, RCFile is a record-columnar file format designed for relational tables, offering fast data loading and effective space management.
- It is also adaptable to varying query patterns.
Row-oriented storage vs columnar storage
- Traditional row-oriented storage stores data in row-oriented format.
- Columns are scattered between block storage
- Columnar format stores data in column-oriented format, useful for specific column-oriented queries
ORC File
- ORC file format was developed for better data handling in Hive, designed for large data sets and schema.
- It uses column-wise compression and offers various types including date, decimal, and complex structures with indexes
Parquet
- Apache Parquet is an efficient column-oriented storage format.
- It supports various encodings for efficient data compression, and its architecture supports column-wise operations.
Summary
- Several core file systems are considered, including basic file systems, network file systems (NFS), Google File systems (GFS), and Hadoop distributed file systems (HDFS).
- Formats include text, sequence, RCFile, ORC, and Parquet.
Next Part
- The next portion covers key-value stores
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on Distributed File Systems with this quiz, focusing on Google File System (GFS) and Hadoop Distributed File System (HDFS). Explore key concepts such as fault tolerance, system characteristics, and relevant acronyms like RPC.