Big Data Systems: Distributed File Systems (PDF)

Big Data Systems Martin Boissier Distributed File Systems Data Engineering Systems Hasso-Plattner-Institut Announcements § Joint Database Systems Seminar with TU Darmstadt § Antonis Katsarakis (Huawei) § Wednesday: 16:15 – 17:00 in Zoom (details follow in Moodle) § Dandelion Hashtable: Beyond billion in-memory requests per second on a commodity server 2 Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Exercise 10.12. / 11.12. Stream Processing I Stream Processing II 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 This Lecture 1. Basics of File Systems 2. Network File System 3. Google File System 4. Hadoop Distributed File System 5. Erasure Coding 6. File Formats Based on § Hans-Arno Jacobsen: Distributed Systems 4 Where are we? § File system layer Application / Query Language / Analytics / Visualization Application § Basis for most data storage Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 5 Result of an Internet Crawl § Many data items the pages i saved we would just save the old one and update, Maybe archive § 20+ billion web pages x 20KB = 400 TB § Mostly append JBOD disk array § Storage § Concurrent access § Huge bandwidth required, high latency ok § 1000s commodity machines where failure is the norm § Solution § Distributed file system § Huge files § Replication 6 Basics of File Systems File Systems § Motivation: Provide abstractions a programmer can use to interact with files. § POSIX § POSIX, “The Single UNIX Specification” § Aligns with the ISO C 1999 standard (stdio.h) § File locking § Directories § Files are byte vectors or arrays containing data stored on external storage. 8 HDD § Magnetic discs § Cache (8MB – 128MB) § Cost § ~38€/TB (1.1.2014) § ~30€/TB (4.12.2014) § ~29€/TB (7.12.2015) § ~22€/TB (29.11.2017) § ~15€/TB (25.11.2021) § ~11€/TB (28.11.2023) § 5400 rpm – 15000 rpm rotations by minute § Seek 4-9ms cant be faster because arm and spindle Need to move § Connected via SATA, SCSI/SAS … 9 SSD § DRAM (NAND-based flash memory) § No moving mechanical components § Cache (16MB – 512MB) § Cost § ~600€/TB (1.1.2014) § ~350€/TB (4.12.2014) § ~260€/TB (7.12.2015) § ~250€/TB (29.11.2017) Source:OCZ § ~65€/TB (20.11.2020) § ~43€/TB (28.11.2023) § Can also be connected via PCI Express Source:Micron § Low-level operations differ a lot compared to HDD § On SSD’s overwriting costs more => TRIM Command § Deleting is delegated to internal firmware which has a garbage collector 10 How common are HDD failures? ~5% fail per year ~10% failed during first 3 years 11 Bitrot on HDD § Bitrot means silent corruption of data § HDD specifications predict an Uncorrectable bit Error Rate (UER) of 10!" § Evaluation § 8x100GB HDD § After 2 PB reads § 4 read errors were observed § How to protect against bitrot? § Error and erasure codes http://research.microsoft.com/pubs/64599/tr-2005-166.pdf 12 File System Operations File Operations Open Read Write Close Directory Operations Basic concepts Create file Files Mkdir Directories Rename file Links Rename dir Metadata Delete file Locks Delete dir 13 Disk File Systems § Linux § EXT4, EXT3, EXT2 § JFS, XFS, ….. § BTRFS, ZFS § Pooling, snapshots, checksums …. § Windows § NTFS § FAT, FAT32, exFAT, ReFS § For simplicity only EXT2 covered 14 Linux Ext2 § Drive is partitioned in block groups § Superblock (information on file system: number blocks, name, free space, …) § At start of every block group saves where free dta blocks are § Bitmaps for free data blocks and inodes (index nodes) § Inodes (one per file or directory) tell me where data actaully is § Data blocks Block and dramatic Superblock starwarsmeme.gif Inode Bitmap chipmunk.avi Inodes Data Blocks unicorn.gif actual data 15 Ext2 Inode (1) Linux/fs/ext2/ext2.h § Owner and group identifiers struct ext2_inode { § File type and access rights __le16 i_mode; __le16 i_uid; § Number of data blocks __le32 i_size; __le32 i_atime; § Array of 15 pointers to data blocks __le32 i_ctime; § 12 direct __le32 i_mtime; __le32 i_dtime; § 1 each: single, double, triple indirect __le16 i_gid; __le16 i_links_count; § Timestamp __le32 i_blocks; __le32 i_flags; § Types... § File struct ext2_dir_entry { § Directory __le32 inode; § Symbolic link __le16 rec_len; __le16 name_len; § Socket char name[]; § … }; 16 Ext2 Inode (2) Direct blocks (12) Inodes Indirect blocks Double indirect blocks Triple indirect blocks … 17 Ext2 Example Limits Block size 1 KB 2 KB 4 KB 8 KB Max File Size 16 GB 256 GB 2 TB 2 TB Max FS Size 4 TB 8 TB 16 TB 32 TB 18 Network File System Motivation for Distributed FS § Collaborating § Shared file directory for projects, companies § Sharing resources § Pooling resources across multiple devices § Incremental scalability by adding hardware over time § Challenges § Performance § Scalability § Consistency Access Rights e.g. that when multiple People write on the same file 20 Network File System (NFS) § Goals: § Consistent namespace for files across computers § Authorized users can access their files from any computer § Protocol designed for local LANs § NFS creates a remote access layer for file systems § Each file is hosted at a server, and accessed by clients § The namespace is distributed across servers § Each client treats remote files as local ones (“virtual files“) i just have the Impression that i have a file, but i just have a virtual file § NFS has a user-centric design: § Most files are privately owned by a single user § Few concurrent accesses not meant for Google docs and stuff § Reads are more common than writes 21 Basic NFS Client Server System call layer System call layer Virtual file system Virtual file system (VFS) layer (VFS) layer Local file Local file NFS client NFS server system interface system interface RPC client stub RPC server stub Network 22 Sending commands § Essentially, NFS works as replicated system using Remote Procedure Calls to propagate FS operations from the clients to the servers § Naïve solution: forward every RPC to the server § Server orders all incoming operations, performs them, returns results § Too costly! § Good: concurrent clients will be consistent § Bad: High access latency to due the number of RPCs § Ugly: Server is overloaded by RPCs 23 Solution: Caching § Clients use a cache to store a copy of the remote files § Clients can then periodically synchronize with the server in periods and not bytewise § This is essentially multi-primary replication: § How should synchronization be done? (eager/lazy) § What is the right consistency level? Original SUN NSF § Developed in 1984 § Uses in-memory caching: § File blocks, directory metadata § Stored at both clients and servers § Advantage: no network traffic for open/read/write/close if done locally § Problems: failures and cache consistency 24 Caching & Failures § Server crash § Any data not persisted to disk is lost § What if client does seek(); [server crash] read();? § Seek sets a position offset in the opened file § After the crash, the server forgets the offset, reads return incorrect data § Communication omission failures § Client A sends delete(foo), server processes it § ACK for the delete is lost, meanwhile another client B sends create(foo) § A times out and send delete(foo) again, deleting the file created by B! that is not good § Client crash § Since caching is in memory, lose all updates by the client not synched to the server 25 Solution: Stateless RPC how to mitigate § RPC commands are stateless § Server does not maintain state across commands in a “session” § E.g., read() is stateful (server needs to remember seek()) § read(position) is stateless § Server has all the information needed to make the correct read § Stateless RPC § Server can fail and continue to serve commands without recovering the old state § NFS’ RPCs are also idempotent § Repeating a command has no side effect § delete(“foo”) becomes delete(someid) § Cannot wrongly delete a new file named “foo” 26 Concurrent Writes in NFS § NFS does not provide any guarantees for concurrent writes! i might loose or overwrite updates § The server may update using one client’s writes, the other’s writes, or a mix of both! § Not usually a concern due to the user-centric design: assumed there are no concurrent writes. 27 NFS Summary § Transparent remote file access using the virtual file system § Client-side caching for improved performance § Stateless and idempotent RPCs for fault-tolerance § Periodical synchronization with the server, with flush-on-close semantics § No guarantees for concurrent writes 28 Google File System GFS – The Google File System § Designed for big data workloads: § Huge files, mostly appends, concurrency, huge bandwidth § Fault tolerance while running on inexpensive commodity hardware § 1000s machines where failure is the norm § Introduces an API which is designed to be implemented scalably (non-POSIX) § Architecture: one primary, many chunk (data) servers § Primary stores metadata, and monitors chunk servers § Chunk servers store and serve chunks 30 Drivers for the GFS design § Component failures are the norm in clusters with commodity machines § It is almost guaranteed that at any one point in time at least one component fails § Operating system failures § Human errors § Disk, memory, networking, power supplies § => Monitoring, error detection, fault tolerance and automatic recovery must be part of the system software § Huge files § Multi-GB § => Existing assumptions about block sizes have to be revisited § Data tends to be appended and read often. Random writes within a file are practically non-existent § GFS’ consistency model was simplified § Introduction of atomic append operations 31 Hadoop Distributed File System HDFS § The Hadoop Distributed File System § Originally built as infrastructure for the Apache Nutch web search engine § Quite similar to GFS started as a clone § HDFS is currently being used by: § Facebook § One cluster with 8800 cores and 12 PB raw storage § LinkedIn § Multiple clusters and also PB of data § Twitter § … and many more! 43 Design Assumptions § Similar to GFS § Hardware failure is common § Large files § Large streaming reads, small random reads § Once written, seldomly modified § Simple coherency model § Write once read many § Concurrent writes § Relaxes POSIX requirements § Moving computation to the data § High bandwidth more important than low latency 44 HDFS Architecture § HDFS Client Repository for all HDFS metadata basically primary § Namenode § Only one § Main task is to coordinate § Metadata § Datanode § Many datanodes § Store data Source: http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf 45 Namenode § Namenode keeps entire namespace in RAM § Hierarchy of files and directories § Inodes have attributes, permissions, modifications, access times, disk space quotas take snapshot and write it to disk log file, snapshot § Checkpoints to the local file system, changes recorded as journal § For improved durability redundant copies are stored on multiple local volumes or remote servers § Batches operations for improved performance § Transactions are committed together 46 Checkpoint/Backup Node § Namenodes can alternatively execute either of two other roles § Role specified at node startup § Checkpoint Node § Combines existing checkpoint and journal § Returns the checkpoint to the Namenode § Backup Node § Maintains an in-memory filesystem namespace which is synchronized with Namenode 47 Datanode § Each block replica on a Datanode consists of two files § Block Metadata (Checksums for the data and timestamps) § Data (actual length) § File content is split into blocks, typically 64MB § If possible each chunk resides on a different DataNode § Startup § Datanode connects to Namenode and performs handshake § Verify namespace and software version § Datanode is registered at the Namenode § Datanode sends block report (Block Metadata), renewed every hour § Begin sending heartbeat signals to Namenode Server knows, that we are still there § Heartbeat contains storage capacity, other statistics relevant for rebalancing 48 Node failure detection § Namenode § Prior to Hadoop 2.0, Namenode was single point of failure § Now: two redundant Namenodes in active/passive configuration § And: independent Namenodes for name space regions § Datanode § Heartbeat every 3 seconds to Namenode § If no heartbeat message received, Datanode is considered dead 49 Corrupted Data § Data includes checksum § Datanodes send FileChecksum to the namenode periodically § Try to resolve corrupted data § If no uncorrupted data can be found, it is still not deleted automatically, the user can still get the corrupted data and maybe resolve the issue manually 50 Block Placement § Network bandwidth within rack better than between racks § HDFS estimates network bandwidth between two nodes by their distance § Second and third replica typically placed in a different rack 51 Erasure Coding Reed Solomon knowing if a number or id is valid, without specifically checking it § Error-correcting code.... we can not only see that something is wrnog, we can even correct it ti s § Used in QR codes b e m so i ng § Block encoding p l ip F § Data blocks § Error-correction blocks § Read data blocks first § Else, decode with error blocks Take your mobilephone and scan this QR-Code, does it still work? 53 Reed Solomon (𝒏, 𝒌) § 𝑘 data symbols, n − 𝑘 parity checks rot gelb § Any 𝑘 symbols suffice for full data recovery RS(6, 4) Any 4 chunks can be used to recover the data 2 parity shards § Backblaze: (20,17) 3 parity Charts, which mean they can recover from 3 mistakes 54 Example (4,2) § Store 2 data blocks in 4 blocks for up to 2 erasures X1 X1 = A A X2 X2 = B Coding B X3 X3 = A + B X4 X4 = A +2 B § To recover solve equation, e.g., A = X4 – 2 B § In general Galois Fields Finite Ring (e.g., 2!" ) 55 As an Alternative to Replication Replication Erasure Encoding Storage Make N full copies, K data pieces, Efficiency of: 1/N N total pieces, Efficiency of: K/N Computation No encoding and decoding Encoding required, decoding needed needed if using parity pieces Fault-tolerance Tolerate N-1 failures Tolerate N-K failures Communication 1 Node for full data K Nodes for pieces § Use replication if latency is prioritized or computational resources are limited (e.g., mobile devices) § Use erasure coding if storage resources are limited (e.g., in-memory storage) or computational resources and communication is plentiful (e.g., HPC) 56 (Big Data) File Format File Format § HDFS challenges § Find the relevant data in a particular location § Raw data requires much space § Large datasets, evolved schemas, and storage constraints we add columns over time § Benefits of specialized file formats § Faster read times § Faster write times § Splitable files § Schema evolution support § Compression support 58 Common HDFS Storage Formats § Text files (e.g., CSV, TSV) § Encode anything as record every file is a record § Each line is a record and terminated with newline character ‘\n’ § Inherently splitable but can only be compressed on file-level § Could also store JSON, XML § Sequence files key values as bytes § Originally designed for MapReduce, integration is smooth § Encode key and value for each record and store in binary format § Support block-level compression 59 RCFile § Developed at Facebook § Record Columnar File designed for relational tables § Features ICDE 11 § Fast data loading § Efficient storage space utilization § Adaptive to dynamic data access patterns § Compression Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu: “RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems“, ICDE 11 60 RCFile § Row-oriented storage vs columnar storage row-store Images from Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu: “RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems“, ICDE 11 column-store 61 RCFile Images from Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu: “RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse RCFile Systems“, ICDE 11 62 ORC File § Optimized row columnar § Hortonworks and Facebook § Developed as Hive format we have more compression here § Features compared to RCFile § Single file as the output of each task § More Hive type support, including datetime, decimal, and the complex types (struct, list, map, and union) § Light-weight indexes stored within the file § Block-mode compression based on data type 63 ORC File § Divide the data into stripes (250MB) § Index data includes min and max values for each column and the row positions within each column § Stripe footer contains stripe meta data § File footer stores file meta data § Postscript holds compression parameters and the size of the compressed footer Image source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC 64 Parquet § Twitter and Cloudera § Feature § Column-wise compression § Different encoding techniques can be applied to different columns § Similar to ORC and RCFile § Different file structures 65 Parquet Image source: https://parquet.apache.org/documentation/latest/ 66 Summary § Basics of File Systems § Ext2 § Network File System § Distributed FS § Google File System § Basis for data storage at Google § Hadoop Distributed File System § Open-source GFS clone § Erasure Coding § Alternative to replication § File Formats § Performance improvement for analysis 69 Next Part § Key Value Stores Application / Query Language / Analytics / Visualization Application Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 70 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 71

Big Data Systems: Distributed File Systems (PDF)

Document Details

Tags

Related

Summary

Full Transcript