Bigtable: A Distributed Storage System for Structured Data PDF
Document Details
Uploaded by ShinyKrypton
Tags
Summary
This document provides an introduction to Google Bigtable, a distributed storage system. It discusses its motivation, architecture, operations, compactions, and locality groups. The document also details the Bigtable API, data model, and underlying components. It covers concepts such as how Bigtable is designed for large datasets, high performance and availability, and how it's used across various Google products.
Full Transcript
Google Bigtable Abstracts from article "Bigtable: A distributed storage system for structured data", published by google folks in OSDI 2006 and ACM Transactions on Computer Systems, 2008 pm jat @ daiict A timeline …. GFS and Map-Reduce: 2003 and 2004 respectivel...
Google Bigtable Abstracts from article "Bigtable: A distributed storage system for structured data", published by google folks in OSDI 2006 and ACM Transactions on Computer Systems, 2008 pm jat @ daiict A timeline …. GFS and Map-Reduce: 2003 and 2004 respectively BigTable by Google: 2006 Hadoop: 2007 at Yahoo, was running on 1000 computers at Yahoo , in 2008 it was behind Yahoo’s search engine using 10000 cores. Amazon’s Dynamo DB: 2007 HBase: 2008 as Apache project and part of Hadoop MongoDB started as 10gen in 2007 “Cassandra” took birth at Facebook 2008 MonetDB , C-Store, VectorWise, Druid , … 04-10-2024 Google Big Table 2 Bigtable - Motivation The biggest question Google folk might have had is “how do we store structured data at scale” ? Lots of structured data – URLs: Contents, crawl metadata, links, anchors, PageRank, … – Per-user data: User preference settings, recent queries/search results, … – Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … Scale is large – Billions of URLs, many versions/pages (each version of size approx. 20KB) – 100TB+ of satellite image data Tight latency requirements: Hundreds of millions of users, thousands of queries per second 04-10-2024 Google Big Table https://www-users.cselabs.umn.edu/classes/Spring-2020/csci5980/slides/bigtable-slides.pdf 3 Why not just use commercial DB? The scale is too large for most commercial databases! Even if it weren’t, the cost would be very high – Building internally means the system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly – Much harder to do when running on top of a database layer https://www-users.cselabs.umn.edu/classes/Spring-2020/csci5980/slides/bigtable-slides.pdf 04-10-2024 Google Big Table 4 Bigtable Distributed multi-level map with “Column Family” data model Fault-tolerant Persistent Scalable – 100s of Terabytes of in-memory data – Spread over Thousands of servers – Petabyte of disk-based data – Millions of reads/writes per second, efficient scans Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance https://www-users.cselabs.umn.edu/classes/Spring-2020/csci5980/slides/bigtable-slides.pdf 04-10-2024 Google Big Table 5 Bigtable – introduction Google Called it a “Distributed database system built on top of GFS” Here are some quotes from the article “Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable” “Bigtable is designed to reliably scale to petabytes of data and thousands of machines” “Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability” “Bigtable is used by “more than sixty Google products and projects”: including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth”. 04-10-2024 Google Big Table 6 Bigtable – Data Model The table is the main structure of “Bigtable” emulating rows and columns The article defines a table in bigtable as “A table is a sparse, distributed, persistent multidimensional sorted map”. “sparse”: The total” number of columns in a table could be very large, however individual rows have values only for a few of the columns >> This is also the reason that these systems are called “Wide Column” databases “distributed”: partitioned “persistent” “multidimensional sorted map”: Dimensions: [“row-key”][“column-key”][“time- stamp”]: (row-key:string, column-key:string, timestamp:int64) -> value:string 04-10-2024 Google Big Table 7 Bigtable – Data Model Logically a “table” is a collection of “rows” Each row is uniquely identified by the “Row key”. A row key is an arbitrary string – currently supported size of row-key is 64KB (although 10-100 bytes is a typical size) Rows are sorted in lexicographic order by row key A row is structured as – Set of “column families” – A column family is a set of columns 04-10-2024 Google Big Table 8 Bigtable – Data Model Let us consider an example from the original article , shown here is a “table” named as “webtable” It is said to store a large collection of web pages and related information that Google uses in different projects. Each row would have a “row-key” to distinctly identify a row Row-key for “webtable” is “URL”. In this table, we have three column families: – “contents”, – “language”, – “anchor” A slice of an example table that stores Web pages 04-10-2024 Google Big Table 9 Bigtable – Data Model A Column Family can have any number of columns and column names are not pre- defined The “language” col-family stores the language in which a web page was written; it uses only one column and that is “empty”. Yes, a column-name can be empty! The column name can be any name and specified at run-time in the form of pair. The “anchor” col-family has unbound number of columns. For every anchor (i.e. “href” html tag) in a page, there is a column-name Does not surprise you? For example, the column name "my.look.ca" (in "anchor" column family) and the value is "CNN.com" 04-10-2024 Google Big Table 10 04-10-2024 Google Big Table 11 https://www2.cs.uic.edu/~brents/cs494-cdcs/slides/thegooglebigtable.pdf Bigtable – Data Model A column identifier has two parts: Column-Family-Name, and Column-Name The following are required to be specified for a table at the time of creation – Row Key – Names of Column Families For example, in webtable, a column-identifier is "anchor:my.look.ca” 04-10-2024 Google Big Table 12 Timestamps for Data Versions Different cells in a table would contain multiple versions of a data cell In Bigtable, every write operation creates a new versions of a data, “even for updates”. – Note the LSM Tree behind the scene Versions are indexed on timestamps. Bigtable timestamps are 64-bit integers. – It can be a real-timestamp or – Assigned explicitly by client applications but must be unique. For example in the Webtable example, content can have “page crawled time” as the timestamp value. Older versions are automatically garbage collected by a specified policy 04-10-2024 Google Big Table 13 Bigtable Operations The Bigtable API provides functions for – creating and deleting tables – creating and deleting column families Operations: primarily supports following fundamental operations based on Key: – put (Key, Value) – get (Key), – delete (Key), and – scan - return many values from many row/column entries, in sorted order). For example, all columns from a column family; or all column families for a row-key, etc. 04-10-2024 Google Big Table 14 Bigtable Operations – “scan” 04-10-2024 Google Big Table 15 Bigtable Operations Having only GET, PUT, and DELETE based on Key, mean what? – You can GET/PUT/DELETE a value for a cell in a row – require specifying row-key and column-key – You can GET/PUT/DELETE entire column family for a row - require specifying row- key and column-family-name Also note: Data of an entire column family can be deleted by altering the table 04-10-2024 Google Big Table 16 Bigtable vs Key-Value Key Value only but “granularity” is finer -> Key has components, row-key, column family, column-name “Data Record” is not just accessed by RecordID but by – Record Identifier, and also – Column Identifier Captures the semantics of “row columnar structure” - seen in relational systems! 04-10-2024 Google Big Table 17 Supported data types Bigtable treats all data as raw byte strings for most purposes. The only time Bigtable tries to determine the type is for increment operations, where the target must be a 64-bit integer encoded as an 8-byte “big-endian” value. 04-10-2024 Google Big Table 18 Bigtable Building Blocks GFS: Bigtable uses GFS to store log and data files (as on writing of the article) A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed applications Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status. Bigtable data are stored in the form of “SSTables”, and SSTables are stored on GFS, uses GFS for “fault-tolerance”. 04-10-2024 Google Big Table 19 Typical Bigtable Cluster Cluster scheduling master Lock service GFS master Machine 1 Machine 2 Machine N User app1 BigTable BigTable server User server BigTable master User app2 app1 … Scheduler GFS Scheduler GFS Scheduler GFS slave chunkserver slave chunkserver slave chunkserver Linux Linux Linux https://www-users.cselabs.umn.edu/classes/Spring-2020/csci5980/slides/bigtable-slides.pdf 04-10-2024 Google Big Table 20 Bigtable - Building Blocks SSTable: The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Bigtable API provides operations are provided to look up the value associated with a specific key and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). 04-10-2024 Google Big Table 21 Bigtable - Building Blocks SSTable: A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk. 04-10-2024 Google Big Table https://www2.cs.uic.edu/~brents/cs494-cdcs/slides/thegooglebigtable.pdf 22 Bigtable - Building Blocks Google Chubby: Bigtable relies on a highly-available and persistent distributed “lock service” called Chubby. – BTW, what is “lock service”? A Chubby service consists of five active replicas, one of which is elected to be the master and actively serves requests. The service is live when a majority of the replicas are running and can communicate with each other. https://www2.cs.uic.edu/~brents/cs494-cdcs/slides/thegooglebigtable.pdf 04-10-2024 Google Big Table 23 Bigtable - Building Blocks Google Chubby: Chubby uses directories and small files for a lock The Chubby client library provides consistent caching of Chubby files. Each Chubby client maintains a session with a Chubby service. A client's session expires if it is unable to renew its session lease within the lease expiration time. When a client's session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration. 04-10-2024 Google Big Table 24 Bigtable - Building Blocks Google Chubby: Bigtable uses Chubby for a variety of tasks: – to ensure that there is at most one active master at any time; – to store the “bootstrap location of “table data” – to discover tablet servers and finalize tablet server deaths – to store Bigtable schema information (the column family information for each table), and – to store “access control” information. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable. 04-10-2024 Google Big Table 25 Bigtable - Building Blocks Google Chubby: We recently (2006) measured this effect in 14 Bigtable clusters spanning 11 Chubby instances. The average percentage of Bigtable server hours during which some data stored in Bigtable was not available due to Chubby unavailability (caused by either Chubby outages or network issues) was 0.0047%. 04-10-2024 Google Big Table 26 Bigtable Implementation The Bigtable implementation has three major components: – one master server – many “tablet servers”, and – a library that is linked into every client, The Master Tablet Server is responsible for (1) assigning tablets to tablet servers (2) detecting the addition and expiration of tablet servers (3) balancing tablet-server load, and (4) garbage collection of files in GFS (5) it handles schema changes such as table and column family creations 04-10-2024 Google Big Table 27 https://www2.cs.uic.edu/~brents/cs494-cdcs/slides/thegooglebigtable.pdf Bigtable Implementation A Tablet Server responsible for (1) manages a set of tablets: 10-1000/tablet server. (2) Handles read/requests targeted to the tablets (3) Splits tablets when it grows. Initially, each table consists of just one tablet. tablet size: approximately 100-200 MB in size by default “Tablet servers” can be dynamically added (or removed) from a cluster to accommodate changes in workloads. 04-10-2024 Google Big Table 28 Bigtable Implementation – “tablets” A table in Bigtable is “sharded” into blocks of contiguous rows, called tablets Tablets are distribution units of BigTable tables. Each tablet stores rows for a range of row keys, and rows are sorted, and a tablet would have a start-key and end-key Tablets do not overlap for row-key ranges A tablet is built of multiple SS Tables, and SSTables can be shared 04-10-2024 Google Big Table 29 “Index” for Locating Tablets for a “Row-Key” Bigtable uses a three-level hierarchy analogous to that of a B+ tree to store tablet location information. The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the locations of all of the tablets of a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. User tablets typically store “references to SSTables” (SSTables are on GFS) The root tablet is never split – “unsplittable root table” ensures that location hierarchy does not go beyond three levels, Also, it makes it possible to keep it “in memory” 04-10-2024 Google Big Table 30 “Index” for Locating Tablets for a “Row-Key” The METADATA table stores the “location of a tablet under a row key” This three-level location scheme is sufficient to address 2^34 tablets (storing 2^61 bytes of data assuming one index entry uses 1KB of memory and a modest tablet size is 128 MB) Metadata Table Tablets for a user table Root tablet Pointer to Root Tablet File stored in lock service Row per non-META (Chubby) one row for each tablet (all tables) Metadata tablets 04-10-2024 Google Big Table 31 04-10-2024 Google Big Table 32 https://www2.cs.uic.edu/~brents/cs494-cdcs/slides/thegooglebigtable.pdf (Metadata Tablet) Big Table size – some calculations Let us say one tablet is of size 128MB, i.e. 2^7 MB = 2^7*2^20 B Assume one entry using 1KB, i.e. 2^10 B Root tablet can store (2^7*2^20)/2^10 = 2^17 entries (Root Tablet) Metadata tablets can have (at max) 2^17 tablets, and that in turn can refer to 2^17 * 2^17 = 2^34 tablets That is, the meta table can refer to the 2^34 number of “user tablets” 2^34 “user tablets” can store = 2^34 * 128 MB = 2^34 * 2^7*2^20 B = 2^61 B = 2^11 PB = 2048 PB 04-10-2024 Google Big Table 33 “Index” for Locating Tablets for a “Row-Key” Big Table client library is designed to cache metadata information. The client caches the metadata aggressively Keeps itself alert if the metadata cache is “stale”, etc. 04-10-2024 Google Big Table 34 Bigtable and LSM TabletServer Log file memtable (in-memory SSTable) In-memory SSTables is Memtable Final SSTable is SSTable for all keys On disk SSTables for a tablet! 04-10-2024 Google Big Table 35 Bigtable Structure TabletServer Log file memtable (in-memory SSTable) On disk SSTables 04-10-2024 Google Big Table 36 Write Operations 04-10-2024 Google Big Table 37 Read Operations 04-10-2024 Google Big Table 38 “DELETE” operations on SSTable We insert a “tombstone” into the “memtable” – tombstone is a “logical delete” for a key While attempting read for a key, if a “tombstone” record is found, the return status of the query is “not found”! Real “delete” happens at the time “Merge” time 04-10-2024 Google Big Table 39 Searching within a SSTable The block index is loaded into memory. Find out the appropriate block by performing a binary search in the in-memory index Read the appropriate block from the disk A lookup can be performed with a single disk seek Optionally, the entire SSTable is loaded into memory, which allows us to perform lookups and scans without touching the disk Both searches can be done through this approach: (1) Lookup for a specified key, or (2) for a specified range of keys. 04-10-2024 Google Big Table 40 Compactions Minor compaction: When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. “Minor compaction” primary flushes memtable to the storage (similar to Level-0) of LevelDB The minor compaction process has two goals – it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. – Incoming read and write operations can continue while compactions occur. 04-10-2024 Google Big Table 41 Compactions Major compaction: Merging is done when the number of SSTables produced by memtables reaches to some threshold. Merging reads the contents of a few SSTables and writes out a new SSTable. The input SSTables are discarded as soon as the compaction has finished. Major compaction also drops deleted “entries” Bigtable cycles through all of its tablets and regularly applies major compactions to them. It is called “major compaction” in Big Table This compaction allows Bigtable to reclaim resources used by deleted data, and also allows it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data. 04-10-2024 Google Big Table 42 Refinements The Bigtable article discusses various refinements that Bigtable provides – Locality Groups – Compression – Caching for read performance – Bloom filters … These refinements help in achieving “high performance”, “availability”, and “reliability” 04-10-2024 Google Big Table 43 “Locality Groups” Concept of locality is used for making the read operations efficient! Separate groups are created for frequently accessed column families, and referred as “locality groups”. A separate SSTable is generated for each locality group in each tablet. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. In addition, some useful tuning parameters can be specified on a per-locality group basis. For example, a locality group can be declared to be in-memory. For such SSTables disk access may not be required! 04-10-2024 Google Big Table 44 https://www-users.cselabs.umn.edu/classes/Spring-2020/csci5980/slides/bigtable-slides.pdf 04-10-2024 Google Big Table 45 Compression SSTables are compressed using new, speed-optimized compression algorithms. Compression is done block by block, so a tablet server can scan to a given SSTable block without uncompressing all prior blocks. Why Compression is done? – So that more data can be placed in a block and hence in a SSTable, and – Need to work with a lesser number of disk accesses! Compression algorithms that are said to be used are: AlgoBMDiff, Zippy 04-10-2024 Google Big Table 46 Bloom filters Bigtable allows associating a Bloom filter with each SSTable; or locality group SSTable where bloom filter is used to confirm if a particular key is definitely not in an SSTable. As a result require searching into lesser number of SSTables. This can reduce the number of SSTable reads to below O(t), although obviously the number of Bloom filter reads is still O(t). 04-10-2024 Google Big Table 47 Shared Logs Having separate log for each tablet would require processing large number of physical log files. – Bigtable is designed for 1M tablets, and this shall have 1M log files being simultaneously written and hence a performance penalty. In addition, this is also said to “reduce the effectiveness of the group commit optimization” The solution here is “shared logs”, that is one log file per tablet server instead of per tablet – Writes for many tablets co-mingled in the same file! Using one log is said to provide significant performance benefits during normal operation, BUT Complicates Recovery. – Scanning a huge log file would require a large number of disk I/O for multiple tablets. – A few modified technique for recovery techniques are discussed in the article and said to show some overcoming the problem! 04-10-2024 Google Big Table 48 Replication and Consistency Replication: Relies on “GFS”? Consistency: “Eventual”! 04-10-2024 Google Big Table 49 Notes from Bigtable product home page https://cloud.google.com/bigtable/docs/overview 04-10-2024 Google Big Table 50 Big Table “Current” Architecture https://cloud.google.com/bigtable/docs/overview 04-10-2024 Google Big Table 51 Colossus – new generation GFS Currently google uses Colossus, so called “new generation GFS” Google has recently (April’21) has published a whitepaper on “Colossus” 04-10-2024 Google Big Table 52 References/Further Readings Chang, Fay, et al. "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 1-26. presentation slides from authors https://www-users.cselabs.umn.edu/classes/Spring-2020/csci5980/slides/bigtable-slides.pdf Notes on Bigtable: A Distributed Storage System for Structured Data by Dr. Eddie Kohler at Harvard https://www.read.seas.harvard.edu/~kohler/class/cs261-f11/bigtable.html Khurana, Amandeep. "Introduction to HBase schema design." White Paper, Cloudera (2012). O’Neil, Patrick, et al. "The log-structured merge-tree (LSM-tree)." Acta Informatica 33 (1996): 351-385. Burrows, Mike. "The Chubby lock service for loosely-coupled distributed systems." Proceedings of the 7th symposium on Operating systems design and implementation. 2006. Hildebrand, Dean, and Denis Serenyi. "Colossus under the hood: a peek into Google’s scalable storage system." (2021): https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus- googles-file-system 04-10-2024 Google Big Table 53 References/Further Readings Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. "Paxos made live: an engineering perspective." Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. 2007. lecture notes by: https://www2.cs.uic.edu/~brents/cs494-cdcs/slides/thegooglebigtable.pdf 04-10-2024 Google Big Table 54