Cloud Computing: Google File System (GFS)
Document Details

Uploaded by WillingBluebell8137
د. منذر الطراونة
Tags
Summary
This document provides an overview of the Google File System (GFS), a distributed file system designed for cloud computing. It covers the architecture of GFS, including chunk servers and the master node, and describes the read, write, and record append algorithms used by the system. The presentation also discusses fault tolerance mechanisms implemented in GFS.
Full Transcript
Cloud Computing Distributed File System Google file system منذر الطراونة.د File System (GFS) Google has had issues with existing file systems on their huge distributed systems They created a new file system that matched well with...
Cloud Computing Distributed File System Google file system منذر الطراونة.د File System (GFS) Google has had issues with existing file systems on their huge distributed systems They created a new file system that matched well with MapReduce (also by Google) They wanted to have : – The ability to detect, tolerate, recover, from failures automatically – Large Files, >= 100MB in size each – Large, streaming reads (each read being >= 1MB in size) – Large sequential writes that append – Concurrent appends by multiple clients – Atomicity for appends without synchronization overhead among clients Motivation Google needed a good distributed file system – Redundant storage of massive amount of data on cheap and unreliable computers – Fault tolerance and auto-recovery need to be built into the systems (monitoring, error detection, fault tolerance, automatic recovery) because problems are very often caused by application bugs, OS bugs, human errors, and the failure of disks, memory, connectors, networking, and power supplies. Standard I/O assumptions (e.g. block size) has to be re- examined Record appends are the frequent form of writing Why not using an existing file system? – Google’s problem are different from anyone else’s Different workload and design priorities – GFS is designed for Google apps & workloads – Google apps are designed for GFS Architecture of GFS In the GFS – A master process maintains the metadata – A lower layer (i.e. a set of chunkservers) stores the data in unit called chunks GFS uses Cluster-based File System: A distributed file system that consists of several servers that share the responsibilities of the system, as opposed to a single server (possibly replicated). Architecture of GFS Single master to coordinate access, keep metadata Simple centralized management Fixed size chunks (64MB) Many Chunk Servers (100 – 1000s) Files stored as chunks Each chunk identified by 64‐bit unique id Reliability through replication Each chunk replicated across 3+ chunk servers Many clients accessing same and different files stored on same cluster No data caching Little benefit due to large data sets, streaming reads Familiar interface, but customized the API Simplify the problem, focus on Google apps Add snapshot and record append operations cont Figure from “The Google File System,” Ghemawat et. al., SOSP 2003 Master and Chunk Server Responsibilities Master Node Chunk Servers Holds all metadata Simple – Namespace Stores Chunks as files – Current locations of chunks Chunks are 64MB size – All in RAM for fast access Chunks on local disk using Manages chunk leases to standard filesystem chunk servers Read write requests specify Garbage collects orphaned chunk handle and byte chunks range Migrates chunks between Chunks replicated on chunk servers configurable chunk servers Polls chunk servers at startup Use heartbeat messages to monitor servers Chunk Similar to block, much larger than typical file system block Size Size: 64 MB! – Why so big, compare to few KBs block size of OS file systems in general? – Reduces client’s need to contact with the master – On a large chunk a client can perform many operations – Reduces the size of the metadata stored in the master (less chunks less metadata in the master!) – No internal fragmentation due to lazy space allocation Disadvantages: – Some small file consists of a small number of chunks can be accessed so many times! – In practice: not a major issue, as google applications mostly read large multi-chunk files sequentially – Solution: Fixed by storing such files with a high replication factor Master Chunkserver communication Master and chunkserver communicate regularly to obtain state: – Is chunkserver down? – Are there disk failure on chunkserver? – Are any replicas corrupted? – Which chunk replicas does chunkserver store? Master sends instructions to chunkserver – Delete existing chunk – Create new chunk Server Requests – Client retrieves metadata for operation from master – Read/Write data flows between client and chunkserver – Single master is not bottleneck, because its involvement with read/write operations is minimized Metadata is stored in master’s memory. The master maintains less than 64 bytes of metadata for each 64 MB chunk. The Design Tradeoff Can have small number of Large Files – Less Metadata, the GFS Masternode can handle it – Fewer Chunk Requests to Masternode – Best for Streaming Reads Large number of Small Files – 1 chunk per file – Waste of Space – Pressure on Masternode to index all the files How about Clients who need Data? GFS clients – Consult master for metadata – Access data from chunk servers – No caching at clients and chunk servers due to the frequent case of streaming A client typically asks for multiple chunk locations in a single request The master also predicatively provide chunk locations immediately following those requested Replication in GFS The Data Chunks on the Chunk Servers are replicated (Typically 3+ times) If a Chunk Server Fails – Master notices missing heartbeats – Master decrements count of replicas for all chunks on dead chunk server Master replicates chunks missing replicas in background Highest priority of chunks missing greatest number of replicas Consistency in GFS Changes to Namespace are atomic – Done by Single Master Server – Master uses logs to define global total order of namespace‐ changing operations Data changes are more complicated – Consistent: File regions all see as same, regardless of replicas they read from – Defined: after data mutation, file region is consistent and all clients see that entire mutation Mutation in GFS Mutation = write or append – must be done for all replicas Goal: minimize master involvement Lease mechanism: – Master picks one replica as primary; gives it a “lease” for mutations – Primary defines a serial order of mutations – All replicas follow this order Data flow decoupled from control flow Read Algorithm 1. Application originates the read request 2. GFS client translates the request form (filename, byte range) -> (filename, chunk index), and sends it to master 3. Master responds with chunk handle and replica locations (i.e. chunkservers where the replicas are stored) Read Algorithm 1. Client picks a location and sends the (chunk handle, byte range) request to the location 2. Chunkserver sends request 3. Client forwards the data to the application Read Algorithm (example) Write Algorithm 1. Application originates the request 2. GFS client translates request from (filename, data) - >(filename, chunk index), and sends it to master 3. Master responds with chunk handle and (primary+secondary) replica locations 4. Client pushes write data to all locations. Data is stored in chunkservers’ internal buffers Write Algorithm 5. Client sends write command to primary 6. Primary determines serial order for data instances stored in its buffer and writes the instances in that order to the chunk 7. Primary send the serial order to the secondaries and tell them to perform the write Write Algorithm 8. Secondaries respond to the primary 9. Primary respond back to the client Note: If write fails at one of chunkservers, client is informed and retries the write Record Append Algorithm Important operation for Google – Merging results from multiple machines in one files – Using file as producer - consumer queue 1. Application originates record append request 2. GFS client translates requests and sends it to master 3. Master responds with chunk handle and (primary + secondary) replica locations 4. Client pushes write data to all locations 5. Primary checks if record fits in specified chunk 6. If record doesn’t fit, then the primary: – Pads the chunk – Tell secondaries to do the same – And informs the client – Client then retries the append with the next chunk 7. If record fits, then the primary: – Appends the record – Tells secondaries to do the same – Receives responses from secondaries – And sends final response to the client Fault Tolerance Fast Recovery: master and chunkservers are designed to restart and restore state in few seconds Chunk Replication: across multiple machines, across multiple racks Master Mechanisms: – Keep log of all changes made to metadata – Periodic checkpoints of the log – Log and checkpoints replicated on multiple machines – Master state is replicated on multiple machines – Shadow master for reading data if real master is down Summary Google File System Architecture: – Chunk – Master How master and chunk-server communicates Read/Write/Record Append Fault Tolerance