Podcast
Questions and Answers
What is a primary design consideration of the Google File System (GFS)?
What is a primary design consideration of the Google File System (GFS)?
- Emphasis on end-user applications
- Support for small files primarily
- Focus on perfect fault tolerance
- Handling huge data workloads efficiently (correct)
The Hadoop Distributed File System (HDFS) was originally developed for Apache Hadoop's web search engine.
The Hadoop Distributed File System (HDFS) was originally developed for Apache Hadoop's web search engine.
False (B)
What does the acronym RPC stand for, as used in the context of NFS?
What does the acronym RPC stand for, as used in the context of NFS?
Remote Procedure Call
In GFS, the primary server stores __________ and monitors chunk servers.
In GFS, the primary server stores __________ and monitors chunk servers.
Match the following systems with their characteristics:
Match the following systems with their characteristics:
Which of the following is NOT a characteristic of GFS?
Which of the following is NOT a characteristic of GFS?
HDFS is utilized by companies like Facebook, LinkedIn, and Twitter.
HDFS is utilized by companies like Facebook, LinkedIn, and Twitter.
What is the main reason for adopting fault tolerance in GFS?
What is the main reason for adopting fault tolerance in GFS?
What is the primary task of the Namenode in HDFS?
What is the primary task of the Namenode in HDFS?
Data in HDFS is typically modified frequently.
Data in HDFS is typically modified frequently.
How often does a Datanode send a heartbeat signal to the Namenode?
How often does a Datanode send a heartbeat signal to the Namenode?
The design models of HDFS are similar to __________.
The design models of HDFS are similar to __________.
Match the following HDFS components with their descriptions:
Match the following HDFS components with their descriptions:
Which of the following statements about Datanodes is incorrect?
Which of the following statements about Datanodes is incorrect?
Erasure coding is used in HDFS for data replication.
Erasure coding is used in HDFS for data replication.
What happens if a Datanode fails to send a heartbeat signal?
What happens if a Datanode fails to send a heartbeat signal?
HDFS places the second and third replicas of a data block in __________ racks to optimize data availability.
HDFS places the second and third replicas of a data block in __________ racks to optimize data availability.
What is the maximum typical size of a block in HDFS?
What is the maximum typical size of a block in HDFS?
What is the primary benefit of using specialized file formats in data storage?
What is the primary benefit of using specialized file formats in data storage?
Hadoop Distributed File System (HDFS) is an open-source clone of the Google File System (GFS).
Hadoop Distributed File System (HDFS) is an open-source clone of the Google File System (GFS).
What challenge does HDFS face when dealing with large datasets?
What challenge does HDFS face when dealing with large datasets?
___ is a coding technique that helps recover lost data by using parity checks.
___ is a coding technique that helps recover lost data by using parity checks.
Match the file formats with their descriptions:
Match the file formats with their descriptions:
Which file format is specifically designed for use with Apache Hive?
Which file format is specifically designed for use with Apache Hive?
Replication allows for N-1 failures to be tolerated in storage systems.
Replication allows for N-1 failures to be tolerated in storage systems.
What feature distinguishes the Parquet file format from RCFile?
What feature distinguishes the Parquet file format from RCFile?
The ___ allows for encoding of both keys and values, stored in binary format.
The ___ allows for encoding of both keys and values, stored in binary format.
Which of the following statements about Galois Fields is true?
Which of the following statements about Galois Fields is true?
The performance of traditional text files is superior to that of specialized file formats.
The performance of traditional text files is superior to that of specialized file formats.
What does HDFS stand for?
What does HDFS stand for?
ER(𝑛, 𝑘) stands for Reed Solomon encoding with ___ data symbols and ___ parity checks.
ER(𝑛, 𝑘) stands for Reed Solomon encoding with ___ data symbols and ___ parity checks.
Flashcards
NFS
NFS
A remote file access system that uses a virtual file system for transparency.
NFS Client-side Caching
NFS Client-side Caching
Caching is used on the client side to improve performance, by storing copies of data locally.
Stateless and Idempotent RPCs
Stateless and Idempotent RPCs
Stateless and idempotent RPCs help ensure fault tolerance. It means a request can be processed without relying on previous state, and repeated requests have the same effect.
NFS Synchronization and Flush-on-Close
NFS Synchronization and Flush-on-Close
Signup and view all the flashcards
GFS Big Data Workloads
GFS Big Data Workloads
Signup and view all the flashcards
GFS Architecture
GFS Architecture
Signup and view all the flashcards
GFS Fault Tolerance on Commodity Hardware
GFS Fault Tolerance on Commodity Hardware
Signup and view all the flashcards
HDFS (Hadoop Distributed File System)
HDFS (Hadoop Distributed File System)
Signup and view all the flashcards
Hardware Failure Assumption in HDFS
Hardware Failure Assumption in HDFS
Signup and view all the flashcards
Large File Handling in HDFS
Large File Handling in HDFS
Signup and view all the flashcards
Streaming Read/Write Optimization in HDFS
Streaming Read/Write Optimization in HDFS
Signup and view all the flashcards
Write Once Read Many (WORM) in HDFS
Write Once Read Many (WORM) in HDFS
Signup and view all the flashcards
Concurrent Writes in HDFS
Concurrent Writes in HDFS
Signup and view all the flashcards
High Bandwidth vs. Latency in HDFS
High Bandwidth vs. Latency in HDFS
Signup and view all the flashcards
Moving Computation to the Data in HDFS
Moving Computation to the Data in HDFS
Signup and view all the flashcards
Namenode's Role in HDFS
Namenode's Role in HDFS
Signup and view all the flashcards
Datanode's Role in HDFS
Datanode's Role in HDFS
Signup and view all the flashcards
Heartbeat Mechanism in HDFS
Heartbeat Mechanism in HDFS
Signup and view all the flashcards
Reed-Solomon (n, k) Encoding
Reed-Solomon (n, k) Encoding
Signup and view all the flashcards
Erasure Coding
Erasure Coding
Signup and view all the flashcards
RCFile (Record Columnar File)
RCFile (Record Columnar File)
Signup and view all the flashcards
SequenceFile
SequenceFile
Signup and view all the flashcards
ORC (Optimized Row Columnar) File
ORC (Optimized Row Columnar) File
Signup and view all the flashcards
Parquet
Parquet
Signup and view all the flashcards
Google File System (GFS)
Google File System (GFS)
Signup and view all the flashcards
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
Signup and view all the flashcards
Data Block Encoding
Data Block Encoding
Signup and view all the flashcards
Replication
Replication
Signup and view all the flashcards
Text File (CSV, TSV)
Text File (CSV, TSV)
Signup and view all the flashcards
Study Notes
Big Data Systems
- Big data systems are concerned with distributed file systems
- A seminar on database systems was held with TU Darmstadt
- A speaker, Antonis Katsarakis (Huawei), presented on Wednesday from 16:15-17:00 in Zoom
- The Dandelion Hashtable has in-memory request capacity beyond billion requests per second on commodity servers
Distributed File Systems
- File systems are fundamental to data storage
- Basics of file systems, network file systems, Google file systems, and Hadoop distributed file systems
- Erasure coding and file formats are key parts of distributed file systems
File Systems
- Motivation is providing abstractions for file interaction
- POSIX aligns with ISO C 1999 standard
- POSIX includes file locking and directories
- Files are byte vectors containing data on external storage
HDD
- Magnetic disks have caches ranging from 8MB to 128MB
- Cost of hard disk drives (HDD) has decreased significantly
- HDD prices vary and are influenced by the date these slides were created
- HDDs use SATA, SCSI/SAS connections
SSD
- Solid-state drives (SSDs) use NAND-based flash memory
- SSDs have no moving mechanical components
- SSD cache size ranges from 16MB to 512MB
- SSD costs also vary and are impacted by date of slides
- SSDs use PCI Express connections
Hard Drive Failures
- Hard drives (HDD) have failure rates
- Failure rates vary by size and model
- An average of 5% failed annually in the first 3 years.
Bitrot
- Bitrot is silent data corruption in hard drives.
- HDD specifications estimate uncorrectable bit error rates to be exceptionally low (10-15)
- Testing with 8x100GB HDDs after 2PB of reads revealed 4 read errors
File System Operations
- File operations include opening, reading, writing, and closing files.
- Directory operations include creating files, renaming files, and directories, deleting files, metadata and locks
Linux Ext2
- Drives are partitioned into block groups
- Superblock carries system data regarding blocks, free space, etc.
- Bitmaps track free data blocks and inodes
- Inodes are used to identify files and directories
- Data blocks contain the actual data
Ext2 Inode (1)
- Inode structures include owner/group, file length, type/access rights, number of data blocks, pointers to data blocks (direct and indirect) and timestamps
- Different file types are denoted including files, directories, symbolic links, and sockets
Ext2 Inode (2)
- Inode structures have direct blocks (12) and indirect blocks that use pointers to data blocks (double and triple)
Ext2 Example
- Ext2 file systems have file and block size limits
Network File System (NFS)
- NFS enables consistent namespace and access to filesystems across computers
- It is used for local LAN files.
- Each file resides on a server which is accessed by clients.
- Clients treat remote files as if they were local.
- NFS has a user-centric design with few concurrent writes.
Basic NFS
- NFS utilizes a system call layer, virtual file system, local file system interface, NFS clients, NFS servers, RPC client stub, RPC server stub, and a network for file access
Sending commands
- NFS leverages Remote Procedure Calls (RPCs) to propagate file system operations.
- Naive method is forwarding each RPC to a server
Solution: Caching
- NFS clients utilize caching by maintaining copies of remote files, allowing periodic synchronization with the server.
- The original SUN NFS from 1984 employed in-memory caching, allowing files to be accessed without network activity.
Caching & Failures
- Server crashes can lead to loss of unsaved data and offset errors in concurrent accesses.
- Communication failures can lead to inconsistent data handling and unintended file deletions during concurrent modification or creation.
- Client crashes also result in unsaved data
Solution: Stateless RPC
- Stateless RPCs prevent state maintenance across commands and sessions.
- Commands, such as read(), are stateless, meaning the server doesn't track the context.
- This resilience allows servers to continue without recovering previously stored states.
Concurrent Writes in NFS
- NFS lacks guarantees for concurrent writes.
- Concurrent writes can lead to inconsistent data updates in the event multiple clients try to modify the same file simultaneously.
NFS Summary
- Functionality includes transparent remote file access, using virtual file systems for improved performance.
- Stateless RPC for fault tolerance with flush-on-close semantics.
- Concurrent writes are not guaranteed.
Google File System (GFS)
- GFS is a distributed file system specifically designed for big data workloads.
- GFS handles high file sizes, appends, concurrency, and bandwidth.
- GFS is robust and tolerant of failures, supporting thousands of machines.
- GFS provides an API not based on POSIX designed for scalable implementation
GFS - Design Assumptions
- A focus on component failures is prioritized over reliability.
- GFS handles large files, streaming reads, and infrequent random writes.
- The coherence model is simplified, prioritizing write-once-read-many patterns.
Hadoop Distributed File System (HDFS)
- HDFS is a distributed file system similar to GFS, designed for big data workloads, handling large files, frequent appends, high concurrency, and high bandwidth.
- It was originally built for the Apache Nutch web search engine.
- HDFS is a robust system designed for hundreds or thousands of commodity machines where failures are common.
HDFS Architecture
- HDFS employs a client-namenode-datanode architecture.
- Only one namenode, but many datanodes storing data blocks
- Namenode manages metadata and co-ordinates tasks between datanodes
Namenode
- The namenode holds the entire file system's namespace in RAM.
- Metadata such as hierarchy of files and directories, attributes, quotas, and access information are stored.
- Checkpoints and journaling facilitate fault tolerance
Checkpoint/Backup Node
- Checkpoint and backup nodes allow roles to be specified during node startup
- Backup nodes maintain a synchronized in-memory copy of the file system namespace
Datanode
- The datanode coordinates with the namenode and performs handshakes for verification.
- Block reports are sent periodically, containing metadata for data blocks.
- Heartbeats, sent to the namenode, confirm datanode health and availability. This aids in rebalancing.
Node failure detection
- The namenode was previously a single point of failure, improved to handle multiple instances for name space regions.
- Datanodes periodically send heartbeat signals; if signals are not received after 3 seconds, the datanode is considered dead.
Corrupted Data
- Data blocks in HDFS include checksums.
- Datanodes periodically send FileChecksums to the namenode.
- Corrupted data does not automatically get deleted.
Block Placement
- Within a rack, blocks are prioritized over cross-rack placement
- Network bandwidth within a rack (or cabinet) is typically faster than cross-rack bandwidth, so blocks are placed accordingly.
Erasure Coding
- Erasure coding is used as an alternative to replication, to improve storage utilization when space and cost are a concern
- It encodes data into multiple blocks and allows for data recovery if some blocks are lost.
(Big Data) File Format
- Specialized file formats are created to improve efficiency in Big Data systems, handling large datasets, dealing with schema changes over time, and optimizing file sizes on disk.
File Format
- Challenges in HDFS include finding data, managing space due to large data sets and continuous evolution of schemas.
- Specialized file formats offer faster read and write times and improved schema management tools, such as splittable files or support for data compression.
Common HDFS Storage Formats
- Text files, such as CSV and TSV, are commonly used and are simple, with each line being a record, separated by a newline.
- Sequence files support key/value encoding, useful for MapReduce tasks, and support block-level compression.
RCFile
- Developed at Facebook, RCFile is a record-columnar file format designed for relational tables, offering fast data loading and effective space management.
- It is also adaptable to varying query patterns.
Row-oriented storage vs columnar storage
- Traditional row-oriented storage stores data in row-oriented format.
- Columns are scattered between block storage
- Columnar format stores data in column-oriented format, useful for specific column-oriented queries
ORC File
- ORC file format was developed for better data handling in Hive, designed for large data sets and schema.
- It uses column-wise compression and offers various types including date, decimal, and complex structures with indexes
Parquet
- Apache Parquet is an efficient column-oriented storage format.
- It supports various encodings for efficient data compression, and its architecture supports column-wise operations.
Summary
- Several core file systems are considered, including basic file systems, network file systems (NFS), Google File systems (GFS), and Hadoop distributed file systems (HDFS).
- Formats include text, sequence, RCFile, ORC, and Parquet.
Next Part
- The next portion covers key-value stores
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.