Big Data Systems Key Value Stores I PDF

Big Data Systems Martin Boissier Key Value Stores I Data Engineering Systems Hasso-Plattner-Institut Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use C...

Big Data Systems Martin Boissier Key Value Stores I Data Engineering Systems Hasso-Plattner-Institut Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Exercise 10.12. / 11.12. Stream Processing I Stream Processing II 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 This Lecture 1. KV Overview 2. B-Trees Sources Hans-Arno Jacobsen (TUM) – Distributed Systems Johannes Zschache (University of Leipzig) – NoSQL-Datenbanken Volker Markl (TUB) – Database Technology 4 Where are we? § Data Management Application / Query Language / Analytics / Visualization Application § Operational storage Data Processing Data Management Big Data Systems § Small writes and reads File System Virtualization / Container OS / Scheduling Infrastructure Hardware 5 Serving Requests § User access to inverted index § Input: search term User § Output: relevant URLs Interaction § Requirements § Many search terms Index § Many URLs § Frequent updates § Interactive speed ( 1 trillion (10^12) pages indexed ~ 10.000 queries per second Search must complete in ~200ms 7 Enter Key-Value Stores § Scalable container for key-value pairs § NoSQL semantics (non-relational) § KV-Stores offer simpler semantics (and syntax) in exchange for increased scalability, speed, availability, and flexibility § Small scale: Map / hash table -> index § Main operations § Write/update put(key, value) § Read get(key) § Delete delete(key) 8 This Lecture § Remember, two main types of queries: § OLAP queries: § OLTP queries: § Mostly read access to data § Read & Write access to data § Many rows per operation § Few rows per operation § Low qps querys per second § High qps § This distinction also holds at „web scale“ § Peta-/Exabytes (1015 – 1018) of data § 10.000+ nodes § This lecture: „How to process OLTP-like queries at web scale?“ 9 Web Scale § More information of web scale data stores vs. relational systems § “MonetDB is Web Scale” § https://www.youtube.com/watch?v=b2F-DItXtZs 10 DBMS vs KV-Store DBMS (SQL) Key-Value Store Key Value John Smith {Activity:Name= Swimming} Jane Bloggs {Activity:Cost=57} Mark Anthony {ID=219} Relational data schema No (static) data schema Data types Raw byte access Foreign keys No relations Full SQL support Single-row operations 11 KV-Stores – Requirements § Horizontal scalability § User growth, traffic patterns § Adapt to number of requests & data size § Performance § High speed for single-record read and write operations § Flexibility § Adapt to changing data definitions § Reliability § Uses commodity hardware: failure is the norm § Provide failure recovery § Availability and geo-distribution § Users are worldwide § Guarantee fast access 12 KV-Store Client Interface – Overview § Main operations § Write/update put(key, value) § Read get(key) § Delete delete(key) § Usually no aggregation, no table joins, no transactions! 13 KV-Store Client Interface – Example HBase Configuration conf = HBaseConfiguration.create(); Initialization conf.set("hbase.zookeeper.quorum", "192.168.127.129"); Using ZooKeeper HTable table = new HTable(conf, „MyBaseTable"); Column Family: “Schema” Put put = new Put(Bytes.toBytes("key1")); put.add(Bytes.toBytes("colfam1"), Bytes.toBytes(„value"), Bytes.toBytes(200)); table.put(put); Column: Defined at run-time Get get = new Get(Bytes.toBytes(“key1")); Result result = table.get(get); byte[] val = result.getValue(Bytes.toBytes("colfam1"), Bytes.toBytes(„value")); System.out.println("Value: " + Bytes.toInt(val)); 14 KV-Stores in Practice § Bigtable § Amazon § Apache HBase § Key: customerID § Value: customer profile (history, …) § Apache Cassandra § Redis § Facebook, Twitter § Amazon Dynamo § Key: UserID § Yahoo! PNUTS § Value: user profile (friends, photos) § LevelDB § RocksDB 15 Common Properties of KV-Stores Common Elements Common Features § Memory store & write ahead log § Flexible data model with column (WAL) families § Keep data in memory for fast access § Horizontal partitioning (key ranges) § Keep a commit log as ground truth § Store big chunks of data as raw § Versioning bytes § Store different versions of data § Fast column-oriented single-key § Timestamping access (read/write/update) § Replication § Store multiple copies of data § Failure detection & failure recovery 16 Common Non-Features of KV-Stores § Integrity constraints we do not chekc if data model is valid § Transactional guarantees, i.e., ACID § Powerful query languages § Materialized views 17 Index Simple Index § Prerequisite: Sorted data file Dense index sorted by Primary key § Called sequential file § In DB typically sorted on primary key of a relation § Index file stores (key, pointer) pairs § Key value K is associated with pointer § Pointer points to record in sequential file that contains key value K Sparse index § Index types § Dense Index § One entry in index file for every record in sequential file § Sparse Index § One entry in index file for every block in sequential file B-Trees § So-far: Single level index for access acceleration § In general: B-Trees (here B+-Trees) § As many levels as necessary § Blocks are at least 50% full § No overflow blocks needed 20 Structure § Index blocks organized in a tree § Balanced § Each path from root to leaf has the same length. § Parameter n § Every block holds at least n and up to 2n search keys § Every block holds at least n+1 up to 2n+1 pointers § Just like index block before, but one additional pointer § Choice of n § n as large as possible with respect to block size § 4096 Bytes per block; 4 Bytes per key; 8 Bytes per pointer § 4(2n) + 8(2n+1) ≤ 4096 => n = 170 § Alternative definition § Every block holds at least (𝑛 + 1)/2 − 1 search keys, (𝑛 + 1)/2 pointers § Every block holds at most n search keys, n+1 pointers 21 Definition of B+-Trees § Keys in the leaves are keys from the data § Sorted across all leaves (from left to right) § Root: at least one used pointer. § All pointers point to B-tree block one level below § Leaves: Last pointer points to next leaf (on the right), § From the other 2n pointers at least n are used. K1 RID1 K2 RID2 … Next § Point to data blocks § Inner Nodes: Pointers point to B-tree blocks of level below § At least n+1 are used § If j pointers are used, there are j-1 keys in the block § K1, … , Kj-1 P0 K1 P1 K2 P2 … Free § First pointer points to subtree with key values < K1. § Second pointer points to subtree with key values between K1 and K2. 22 Example Configuration § n=2 § All nodes: At most 4 keys and 5 pointers § Root: At least 1 key and 2 pointers § Inner Nodes: At least 2 key and 3 pointers § Leaves: At least 2 keys and 3 pointers 41 12 28 46 67 first pointer goes to block that is snaller than 46 (not equal) second ponter lerger than 46 and smaller tha 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 23 Alternative Definition § So far: Parameter n § Block contains at least n keys § Block contains at most 2n keys § Block contains at least n + 1 pointers (as before) § Alternative: § Node contains at most n keys, n+1 pointers § Inner node contains at least é(n+1)/2ù pointers § Leaf contains at least ë(n+1)/2û+1 pointers § Always: § Inner node contains always one pointer more than number of search keys § Leaf contains as many pointers as search keys § plus linked list 24 Examples of Leaves § n=2 § 4 keys and 5 pointers 67 71 83 99 § Full leaf To next leaf To block with To block with To block with To block with key 67 key 71 key 83 key 99 67 71 To next leaf § Partially filled leaf To block with To block with 67 Not key 67 key 71 allowed 25 Examples of Inner Nodes § Full inner node 67 71 83 99 To keys To keys K < 67 To keys To keys K ≥ 99 67 ≤ K < 71 To keys 83 ≤ K < 99 71 ≤ K < 83 67 71 § Partially filled inner node To keys Not K < 67 To keys To keys allowed 67 ≤ K < 71 K ≥ 71 26 Example B+-Tree §n=2 § ≥ 2 keys § ≥ 3 pointers 41 12 28 46 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 In the leaves, every key exists only once 27 Applications of B+-Trees § B-trees can assume different index roles: Dense index § Search key is primary key § Dense index § Order of data file is not important § Sparse index Sparse index § Data file must be sorted by primary key § Search key is NOT primary key § Search key is not unique § Duplicate keys in leaves have to be handled 29 B-Tree Search Searching in B-Trees § For now: Search key = primary key § Dense index § Operations for sparse indexes similar logarithmic search § Find K. § If we are at a leaf: § Search K in the leaf. § If we are at an inner node with keys K1, K2, … Kn: § If K < K1 go to first child § If K1 ≤ K < K2 go to second child § … § If Kn ≤ K go to last child 31 Example: Searching in B-Trees K = 60 41 41 ≤ 60 12 28 46 67 46 ≤ 60< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 59 < 60 L now we have to search that block linearily 32 Example: Searching in B-Trees K = 53 41 41 ≤ 53 12 28 46 67 46 ≤ 53< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 53 = 53 J 33 Range Queries § Queries with inequality in WHERE-clause § SELECT * FROM R WHERE R.k > 40 § SELECT * FROM R WHERE R.k >= 10 AND R.k b: § Follow pointer to next node. § Similar for open intervals [-∞ , b] respectively [a, ∞]. 34 Example: Searching in B-Trees 10 ≤ K ≤ 28 41 10 < 41 12 28 46 67 10 < 12 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 12, 15, 19, 28 35 B-Tree Insertion Inserting into a B-Tree Recursive Algorithm: § Search corresponding leaf. § If room, insert key and pointer. § If no room: Overflow § Split leaf in two parts and distribute keys equally § Split requires inserting a new key/pointer pair in parent node § Recursively ascend the tree § Exception: If no space in root § Split root § Create new root (with only one key) 37 Example of B-Tree Insertion K = 60 41 41 ≤ 60 12 28 46 67 46 ≤ 60< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 60 67 71 83 99 Easy J 38 Example of B-Tree Insertion K = 61 41 41 ≤ 61 12 28 46 67 46 ≤ 61< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 60 67 71 83 99 ? 39 Example of B-Tree Insertion K = 61 41 12 28 46 67 1 5 9 12 15 19 28 33 37 41 45 67 71 83 99 We need a key/pointer 46 53 59 60 61 pair Key: 59 40 Example of B-Tree Insertion K = 61 41 12 28 46 59 67 1 5 9 12 15 19 28 33 37 41 45 67 71 83 99 46 53 59 60 61 41 Insertion Cost § Let h be the height of the B-Tree § In practice often: h = 3 § Search for leaf node: h § If no split needed: Total cost h + 1 § h block reads, 1 block write § If split needed § Worst-case: ascends to root § Even caching useless: nodes need to be written to disk § Total cost: 3 h + 1 § On every level search and overflow handling § + writing new root 42 Deleting from a B-Tree Find corresponding node Delete key If still at least minimal number of keys in node § Nothing else to do If too few keys in node: Merge § If a sibling node (left or right) contains more than minimal number of keys, „steal“ a key. § Maybe need to adjust keys of parents § If not: There are two siblings in tree with minimal and sub-minimal number of keys § => These nodes can be merged. § Keys in parent have to be adjusted § Possibly delete propagates upward in the tree 43 Example of B-Tree Deletion K = 12 41 12 28 46 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 Just remove 12 44 Example of B-Tree Deletion K = 12 41 12 doesnt Need to be changed as a parent, split ist still valid 12 28 46 67 1 5 9 15 19 28 33 37 41 45 46 53 59 67 71 83 99 45 Example of B-Tree Deletion K = 15 41 12 28 46 67 1 5 9 15 19 28 33 37 41 45 46 53 59 67 71 83 99 46 Example of B-Tree Deletion K = 15 41 12 28 46 67 1 5 9 19 28 33 37 41 45 46 53 59 67 71 83 99 Steal from sibling 47 Example of B-Tree Deletion K = 15 41 9 28 46 67 1 5 9 19 28 33 37 41 45 46 53 59 67 71 83 99 Adjust parent 48 Example of B-Tree Deletion K=5 41 9 28 46 67 1 5 9 19 28 33 37 41 45 46 53 59 67 71 83 99 Cannot steal Delete propagates 49 Example of B-Tree Deletion K=5 41 28 46 67 1 9 19 28 33 37 41 45 46 53 59 67 71 83 99 And propagates 50 Example of B-Tree Deletion K=5 28 41 46 67 1 9 19 28 33 37 41 45 46 53 59 67 71 83 99 51 Deletion Cost § Search and local delete: h + 2 § Write leaf node § Write data file § When merging with siblings: h + 6 § Check left and right § Write block and move neighbors § Write parent nodes § Write data block § When merging up to root: 3h - 2 52 Deleting from B-Trees - Variation § Assumption: Normally data sets grow § Consequence: Never delete nodes in B-tree § Underflow nodes in B-Tree will eventually be filled again § We need to keep information of deletes and maintain structure § Improvement: Tombstone in B-tree instead of data file block 53 B-Tree Variations B-Trees for Non-Primary Keys § Meaning of pointers in the inner nodes changes § Why? § Non-primary keys are not unique § Keys with same value can span blocks § Given: Keys K1, … , Kj => Ki is smallest new key value reachable from (i+1)-st pointer. § I.e., there is no key value Ki in left subtree, but at least one occurrence of the key value in the subtree after the (i+1)-st pointer. § Problem: There is not always such a key 55 B-Trees for Non-Primary Keys No new values in second child 28 is not 41 new 12 28 ^ 67 1 5 9 12 15 19 28 28 28 28 41 46 41 53 41 59 67 71 There is no reason to start a search here. □Ki is smallest key value reachable from pointer (i+1) 56 B-Tree Variations: B*-Tree § B*-tree and B#-tree § Overflow during insertion: distribution across all leaves § If not possible: create 3 new leaves from 2 old leaves § In general, create m+1 nodes from m nodes § Better memory utilization m = 1 (B+-Tree) § At least 66% m = 2 (B*-Tree) m=3 57 Next Part § Key Value Stores II § LSM Tree Application / Query Language / Analytics / Application § Distributed Architecture Visualization Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 58 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 59 Big Data Systems Martin Boissier Key Value Stores II Data Engineering Systems Hasso-Plattner-Institut Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Exercise 10.12. / 11.12. Stream Processing I Stream Processing II 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 This Lecture 1. LSM-Tree 2. Distributed Architecture Sources Hans-Arno Jacobsen (TUM) – Distributed Systems Johannes Zschache (University of Leipzig) – NoSQL-Datenbanken Volker Markl (TUB) – Database Technology 4 Where are we? § Data Management Application / Query Language / Analytics / Visualization Application § Operational storage Data Processing Data Management Big Data Systems § Small writes and reads File System Virtualization / Container OS / Scheduling Infrastructure Hardware 5 LSM-Tree High Insert & Update in B-Tree § Insertion cost: logn(N) ~ 3-4 we have many insertioans in kv stores and insertions are expensive § 2-3 disk accesses per write it is expensive because we have to go to the leaf § 2-3 Seek time + Rotational latency + Transfer time § Many applications write a lot of data in small inserts. L § Can we do better? 7 Log Structured Merge - Tree § Split index in two parts C0 and C1 § C0 resides in memory – recent data § C1 resides on disk – older data § Both cover full data domain § O’Neil: Memory optimized tree § C0 = AVL-Tree § C1 = B-Tree D O’Neil, P. et al. (1996). The log-structured merge-tree (LSM-Tree). 8 LSM Tree – Compaction / Rolling Merge § Updates in memory in place – fast § Updates on disk in bulk (merge operation) – fast (for large blocks) § Multiple levels for further amortization of cost 9 LSM-Tree – Other Operations § Search § Start with C0 then go to C1 § If in C0 no need to got to C1 § Delete § Insert tombstone § Actual delete during merge § Nice side effects § Multiple versions of the data in the tree § Can be used for recovery, etc. 10 LSM Tree -> Log Structured File § Faster inserts: O(1) if we do not find in the memtable we do not know where to look next, so we have to go though everything standart hash table § Write goes always to Memtable (+log-file) § Memtable is flushed to immutable Level 1 SSTable § Sequential I/O § SSTables are periodically compacted L3 SSTable § Merge into next level L3 SSTable L3 SSTable § Read merges SSTables of all levels L2 SSTable L3 SSTable § Expensive for non-existent keys → Bloom filter L2 SSTable L3 SSTable § O(log(n)) Memtable L1 SSTable L2 SSTable L3 SSTable 11 Disk Layout – Immutable Data § Optimization of writes by inserting instead of overwriting § Memtable: main memory-buffer; protected by write-ahead log on HDD § Sorted Data Files (SSTable): sorted (by key) and immutable on HDD § Deletion using Tombstones § When level is full compact Memory Write MT Flush L1 Disk L2 L2 L2 L3 L3 L3 L3 L3 L3 L3 L3 L3 12 Compaction - Tiering vs Leveling § Leveling L1 Merge § 1 run per level L2 § Sort-merge with next L3 level when full § Tiering L1 L1 L1 § T runs per level L2 L2 L2 § Sort with one run on L3 L3 L3 next level § Hybrid / partial L1 § Fixed size runs L2 L2 L2 § increasing number per L3 L3 L3 L3 L3 L3 L3 L3 L3 level here factor 3 13 Read § Reading requires combination of files § Read until first version is found or tombstone § Expensive if key does not exist Memory MT Read Buffer L1 Disk L2 L2 L2 L3 L3 L3 L3 L3 L3 L3 L3 L3 14 Bloom Filter § Determine fast if key exists § No I/O if not key1 key2 h1(key1) h2(key1) h1(key2) § Bloom filter per SSTable h2(key2) § Bit array for inserted keys 0 1 0 0 1 0 1 0 0 0 0 1 0 § No false negatives h2(key3) h1(key3) § Multiple hashes per key exists(key3) No! 0 it definitly does not exist, 1 it might exist 15 File Format § Variable block size § Blocks are indexed § Trailer: metadata (e.g., start of the index) § Type = value/tombstone 16 Block Index - Hash Table In-memory hash map with byte-offsets as values 17 Source: https://medium.com/@shashankbaravani/database-storage-engines-under-the-hood-705418dc0e35 Sorted String Table (SSTable) Hash table with sorted keys pointer is block und number is the Offset in this block Advantages: faster compacting (merge), range queries, reduction of HDD IO 18 Source: https://medium.com/@shashankbaravani/database-storage-engines-under-the-hood-705418dc0e35 Distributed Architecture Distributed Architecture - Motivation § Scalability (Elasticity) how fast System scales up or down § Data volume, processing, or access exceeds one machine § Spread load on more machines § Availability (Fault Tolerance) § Guarantee data availability in presence of failures and crashes § Keep multiple distributed copies § Latency § For multiple various locations keep latency low § Local data access 20 Replication and Partitioning § Partitioning § Store subsets of data on different nodes § Improves scalability, availability, latency § Challenge: load balancing § Replication § Store multiple copies of data items § Introduces redundancy § Improves scalability, availability, latency § Challenge: consistency 21 Partitioning § Partitioning of key-value data § Partitioning by key range § Partitioning by hash of key § Partitioning algorithm § Each data item (record, row, document,...) belongs to exactly one partition (considering replicated partitions as same partitions). § Algorithm tasks: 1. Given any data item, assign it to a partition 2. Keep partitions (possibly) balanced 22 Partition Design: Prevention of Hotspots Book:38323 Book:38324 Book:38325 Book:38323 Hash function Book:38324 or Randomizer Book:38325 23 Partitioning – Hash vs Range § Range partitioning h(key) § Split data by key ranges § Good for scans § Harder to balance § Hash partitioning § Split data by range of key hash § Random data distribution § No locality 24 Distributed Hash Tables (DHT) § A distributed key-value store § Hash table providing two operations: § put(key, value): stores the pair (key, value) in the hash table § get(key) → value: retrieves the value associated with the given key we do not Need Special Routing. we see the key and know where to go § Partitioning of the logical key space among all peers § Data-centered routing without global knowledge (routing by value) § Specific systems differ in their topology 25 DHT: Design Goals § Load balancing: § Uniform distribution of keys among all peers § Decentralization: § Only peers with equal rights, no specialized nodes § Availability and robustness: § Adaptation of structure to joining, leaving or failing nodes § Flexible naming scheme: § No restrictions regarding the structure of keys 26 Consistent Hashing § DHT have the problem of hash fragmentation: § Traditional hash functions remap all tuples when the number of nodes changes (e.g., failure, node goes offline) I I G H C H G E D F F Rebuild A E D B C A B § Consistent hash function: § Minimizes the degree of remapping in case of the addition or removal of locations (nodes, slots) § Only K/n keys remapped on average (K... total number of keys, n... number of slots) I I F B C H G F Rebuild C H G A E D B A E D 27 DHT – High Level Architecture § Hash function assigns an m-bit identifier ID to each node and each key § Data structure (topology): identifier circle = hash ring § Hash values are mapped to unique positions on the ring § Key k is assigned to the first node those ID is equal to ID(k) or is a successor of ID(k) = successor(k) Node 1 Node 8 Key-Range Assigned to Node 8 (clockwise assignment) Node 14 Node 42 § Virtual nodes § Multiple nodes per server for load balancing Node 32 Node 21 § Gossiping for node and failure detection 28 DHT with Consistent Hashing § Node chooses random seed § Position in the ring § Assigned key range is [seed, next seed[ § Not necessarily balanced → multiple nodes per server § Values are hashed to key range § MD5 hash of values New Node New Key Range § New node requires only local changes Old Key Range § Only one key range is split 29 DHT Replication § Replication for high availability § Typically replication factor 3 § Replica on next N-1 neighbors in ring § Original node is “coordinator” § List of Replicas is “preference list” for data item § Coordinator is first item in list Key Range First Replica Second Replica 30 Summary § LSM-Tree key1 h1(key1) key2 h2(key1) § Distributed architecture h2(key2) h1(key2) § Partitioning 0 1 0 0 1 0 1 0 0 0 0 1 0 § Consistent hashing h2(key3) h1(key3) exists(key3) No! 31 Next Part § Key Value Stores III § Distributed Consistency Application / Query Language / Analytics / Visualization Application Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 32 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 33 Big Data Systems Martin Boissier Key Value Stores III Data Engineering Systems Hasso-Plattner-Institut Announcements § Submission deadline of Quiz 2 postponed to 2024-12-09 § Tomorrow‘s exercise and next week‘s lecture on 2024- 12-04 and 2024-12-11 switched (see next slide) 2 Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Stream Processing I 10.12. / 11.12. Stream Processing II Exercise 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 This Lecture 1. Recaps 2. Distributed Consistency § 2PC § 3PC 3. Data Model Sources Hans-Arno Jacobsen (TUM) – Distributed Systems Johannes Zschache (University of Leipzig) – NoSQL-Datenbanken Volker Markl (TUB) – Database Technology 4 First-In First-Out Scheduling § Single processor/queue § A.k.a., First-come first-serve 10 § Maintain jobs in queue in order of arrival Job 1 § When processor free, dequeue head 5 Job 2 3 Job 1 Job 2 Job 3 Job 3 0 2 5 10 15 18 0 2 5 Selection of metric depends on use § Average completion time is high case (e.g., scheduling interactive Job Length Arrival !"# $ % !"# & % !"# ' $(%$)%$* +' jobs as part of the OS or long- 1 10 0 § = = = 14.33 ' ' ' running MapReduce jobs). 2 5 2 § Average turn around time (completion time – arrival time): !"# $ % !"# & % !"# ' $(%$'%$' ', 3 3 5 § ' = ' = ' = 12 5 Recaps – B+-trees and Duplicate Keys § In most cases, B+-trees are used for unique keys § Meaning of pointers in the inner nodes changes § Keys with same value can span blocks § Given: Keys K1, … , Kj => Ki is smallest new key value reachable from (i+1)-st pointer. § I.e., there is no key value Ki in left subtree, but at least one occurrence of the key value in the subtree after the (i+1)-st pointer. § Problem: There is not always such a key https://15445.courses.cs.cmu.edu/fall2023/notes/08-trees.pdf 6 Recaps – B+-trees and Duplicate Keys No new values in second child 28 is not 41 new 12 28 ^ 42 67 1 5 9 12 15 19 28 28 28 28 41 46 41 53 41 42 67 71 There is no reason to start a search here. □Ki is smallest new key value reachable from pointer (i+1) 7 Recaps – B+-trees and Duplicate Keys 41 12 28 ^ 43 67 1 5 9 12 15 19 28 29 41 42 46 43 53 44 59 45 67 71 … … Alternatively, we can link lists per key. 8 Recaps – LSM Storage Designs § There is an array of storage designs § Leveling § Tiering § 1-Leveling (tiering for 1st level, others use leveling) § L-Leveling (leveling for last level, others use tiering) § Various hybrid forms § Depending on the use case, different options perform better § MemTable compactions can lead to lower performance when frequently accessed items are no longer memory-resident: § Fenggang et al. propose additional caches [AVK] § Often, tiering and block caching are sufficient (only first search causes I/O) [AVK] Fenggang et al.. AC-Key: Adaptive Caching for LSM-based Key-Value Stores. USENIX ATC 2020: 603-615 9 [Designs] Sarkar et al.: Constructing and Analyzing the LSM Compaction Design Space (Updated Version). CoRR abs/2202.04522 (2022) Recaps – Consistent Hashing § Assumption: § 'hash() MOD n' as node/partition function § System with five nodes Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 hash() MOD 5 Node 1: Node 2: Node 3: Node 4: Node 5: 5 10 15 20 1 6 11 16 2 7 12 17 3 8 13 18 4 9 14 19 Example adapted from https://medium.com/@anil.goyal0057/understanding-consistent-hashing-a-robust-approach-to-data-distribution-in-distributed-systems-0e4a0e770897 10 Recaps – Consistent Hashing § Assumption: § 'hash() MOD n' as node/partition function § System with five nodes Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 hash() MOD 5 Node 4 fails Node 1: 5 10 15 20 Node 2: 1 6 11 16 Node 3: 2 7 12 17 X Node 4: 3 8 13 18 Node 5: 4 9 14 19 Example adapted from https://medium.com/@anil.goyal0057/understanding-consistent-hashing-a-robust-approach-to-data-distribution-in-distributed-systems-0e4a0e770897 11 Recaps – Consistent Hashing § Assumption: § 'hash() MOD n' as node/partition function § System with five nodes Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Adapt 15 16 17 18 19 20 node/partition function accordingly. hash() MOD 4 Node 1: Node 2: Node 3: Node 4: Node 5: 4 8 12 16 20 1 5 9 13 17 2 6 10 14 18 3 8 13 18 3 7 11 15 19 12 Recaps – Consistent Hashing N5 Node 5 handles slots 17 to 20 1 20 2 19 3 N1 18 4 17 5 Node 1 handles slots 1 to 4 N4 16 6 15 7 14 8 13 9 N2 12 10 11 Node 2 handles slots 5 to 8 N3 13 Recaps – Consistent Hashing Node 5 now N5 handles slots 17 to 20 and 1 20 2 13 to 16 19 3 Most nodes do not need to adapt when N1 18 4 Node 4 fails. 5 X 17 When Node 4 becomes available again, we can split Node 5’s range of 13 to 20 N4 16 6 again and assign 13-16 to Node 4. 15 7 Similar approach for skewed workloads: If Node 3 is overloaded, we can split its 14 8 range and add a new Node for slots 8-9. 13 9 N2 12 10 11 N3 N6 14 Where are we? § Data Management Application / Query Language / Analytics / Visualization Application § Operational storage Data Processing Data Management Big Data Systems § Small writes and reads File System Virtualization / Container OS / Scheduling Infrastructure Hardware 15 Consistency Refresher: ACID § Transactions require ACID semantics: § Atomic § Consistent § Isolated § Durable § On a single machine: § Two-phase locking § Undo-/redo logging § How to guarantee ACID semantics for distributed transactions? § Distributed commit protocols § Distributed lock protocols 17 Distributed Commit Protocols § Commit protocols ensure ACID properties for transactions. § In the distributed case: § All participating nodes must come to the same final decision (commit or abort). § Commit only if all participants vote „Yes“ § Protocols: § Two-Phase Commit § Widely used, but not resilient to all possible failures § Three-Phase Commit § Non-blocking protocol; does not work with network partitions § Paxos Commit § Asynchronous protocol, fault tolerant & safe 18 Two-Phase Commit (2PC) - Overview § Widely used in distributed databases. § Roles in the protocol: § One coordinator § n workers (cohorts) § Assumptions: § Write-ahead log at each node § No node crashes forever § Any two nodes can communicate § Transactions are processed in two phases: § All nodes try to apply the transaction (writing it to the log) § The transaction is only committed if all nodes succeeded! 19 Two-Phase Commit (2PC) – Commit Request Phase Worker Worker 1. System receives user request. Query User Coordinator Worker Worker 20 Two-Phase Commit (2PC) – Commit Request Phase Worker 1. System receives user Worker pre request. pr pa ep re 2. Coordinator sends „Prepare ar e Commit“ message to all workers. User pare Coordinator pre e ar ep pr Worker Worker 21 Two-Phase Commit (2PC) – Commit Request Phase Log Log 1. System receives user re Worker Worker ad rea request. y/f dy ail /fa ilur ur e 2. Coordinator sends „Prepare e Commit“ message to all workers. 3. Workers process request and lure write pending changes to log. i User y/fa Coordinator If successful, worker d rea answers „Ready“. re ilu In case of errors, worker fa y/ answers „Failure“. ad Log re Log Worker Worker 22 Two-Phase Commit (2PC) – Commit Phase (Success) All workers answered Log Log “ready”: Worker Worker rea re dy ad y User ady Coordinator re ady re Log Log Worker Worker 23 Two-Phase Commit (2PC) – Commit Phase (Success) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. co co mm m it m it m it User m Coordinator co it m m co Log Log Worker Worker 24 Two-Phase Commit (2PC) – Commit Phase (Success) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. 2. Workers commit pending ac ac k changes from log; k Send “ack” to coordinator. 3. Coordinator completes transaction, once all Response workers finished. User k Coordinator ac k ac Log Log Worker Worker 25 Two-Phase Commit (2PC) – Commit Phase (Abort) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. rea 2. Workers commit pending re dy changes from log; ad y Send “ack” to coordinator. 3. Coordinator completes transaction, once all workers finished. At least one answered User ilure Coordinator “failure”: f a ady re Log Log Worker Worker 26 Two-Phase Commit (2PC) – Commit Phase (Abort) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. rol 2. Workers commit pending ro lba changes from log; llb ck ac Send “ack” to coordinator. k 3. Coordinator completes transaction, once all workers finished. At least one answered ack User lb Coordinator “failure”: rol 1. Coordinator sends k “rollback” to all workers. ac llb ro Log Log Worker Worker 27 Two-Phase Commit (2PC) – Commit Phase (Abort) All workers answered “ready”: Log Log 1. Coordinator sends “commit” to all workers. Worker Worker 2. Workers commit pending changes from log; ac Send “ack” to coordinator. k ack 3. Coordinator completes transaction, once all workers finished. Failure At least one answered “failure”: User k Coordinator 1. Coordinator sends “rollback” ac to all workers. 2. Workers undo pending ack changes from log; Log Send “ack” to coordinator. Log 3. Coordinator undoes transaction, once all workers Worker Worker finished. 28 Two-Phase Commit (2PC) - Problems § 2PC is a blocking protocol: § Coordinator must wait for responses from all workers § Recovery only possible if node failures are non-terminal § Possible dead-locks: § Coordinator failure between the two phases: § Workers will wait indefinitely for commit / rollback decision § Worker failure after “Prepare Commit”: § Coordinator will wait indefinitely for status from dead worker § Worker failure after commit / rollback decision: § Coordinator will wait indefinitely for “ack” from dead worker è 2PC will fail in case of (terminal) node failures or network outages! 29 Three-Phase Commit (3PC) - Overview § Non-blocking distributed commit protocol § Does not block in case of node failures / network outages § Allows up to k node failures § Assumptions are similar to 2PC: § One coordinator, n workers (cohorts) § Workers use write-ahead log § All nodes can exchange messages § Algorithm similar to 2PC, but: § Introduces special rules to handle timeouts § Introduces pre-commit phase after initial vote: Source: https://en.wikipedia.org/wiki/File:Three-phase_commit_diagram.png § Coordinator sends “pre-commit” to all nodes after initial vote § “commit” is sent, if at least n-k workers acknowledge pre-commit 30 Three-Phase Commit (3PC) – Advantages / Disadvantages § Advantages: § Global state is – usually – recoverable § As long as no more than k nodes fail at once § Does not block on single node failures § Disadvantages: § Network partitioning can lead to inconsistencies § Higher network overhead § Due to the higher cost, 3PC is generally not used in practice § Typical choice: 2PC + heuristics to handle failure cases § Both 2PC and 3PC fail if the network is partitioned! § è Leads to inconsistencies § è Paxos 31 Paxos – Overview § Distributed, asynchronous consensus protocol: § Three main roles: § Ensures consistency, even in case of network failures § Proposer: § State-of-the-art, used in Google’s Chubby / Apache § Offer proposals of the form [value, number]. Zookeper § Acceptor: § Consensus Protocol: § Accept or reject offered proposals so as to reach consensus on the chosen proposal/value. § „Generalized“ version of a distributed commit protocol § Learner: § Assume a collection of nodes can propose values § Become aware of the chosen proposal/value. § Consensus protocols ensure that a single value among the proposed ones is chosen § Not a fixed assignment, one node can have multiple roles! § Two important properties: § Quorum: § Safety § A majority of the acceptors. § Only a value that has been proposed may be chosen § Acceptors A, B, C → Majorities {A,B}, {B,C}, {A,C} § Only a single value is chosen § Proposals have to be accepted by a quorum. § A process never learns an unchosen value § Ensures at least some surviving node retains § Liveness knowledge of the results. § Some proposed value is eventually chosen § If a value has been chosen, then a process can eventually learn the value 32 Paxos vs. 2PC / 3PC § Both 2PC/3PC and Paxos are good choices under normal conditions § Paxos typically works better in very large clusters § 2PC is typically chosen for smaller, fixed environments § Difference: what happens when the network partitions: § 2PC/3PC: § Becomes inconsistent, but remains available § Paxos: § Remains consistent, but becomes unavailable § Is there a transaction protocol that remains available and consistent in case of Network Partitions? § No! § è CAP Theorem 33 CAP Theorem – Overview § Proposed in 2000 by Eric Brewer at ACM PODC § “Proven” in 2002 by scientists from MIT „System properties hold even in case of network Partition (not node!) failures.“ § For any distributed system … Tolerance CP P A „All requests to „All nodes must non-failing have the same view nodes must Consistency CA Availability on the data.“ / „All result in a request are atomic.“ response.“ 34 CAP Theorem – Implications I § CAP theorem forces to drop one of the three properties § Drop Partition Tolerance (CA system): § Available, and consistent, unless there is a partition § Examples: § Single node RDBMS § Parallel RDBMS using 2PC § Drop Availability (CP System): § Always consistent, even in a partition, but a reachable replica may deny service without agreement of the others § Examples: § Distributed system using Paxos § Large key-value stores like BigTable 35 CAP Theorem – Implications II § Drop Consistency (AP System): § A reachable replica provides service even in a partition, but may be inconsistent. § Examples: § DNS § „Web-Scale“ NoSQL systems. § Warning: Categorization is mostly theoretical! § Assigning a category to a system is often not obvious. § Categories often degenerate into each other. § At „Web-Scale“, Partition Tolerance is mandatory § Drop Availability or Consistency. § Dropping Availability: Replication. § Dropping Consistency: Weaker consistency models. 36 A Weaker Consistency Model § At „web-scale“, availability is often more important than consistency § Example: Amazon Webstore § Website down è ~1000$ loss per second downtime § Inventory inconsistent è A few customers might be unhappy § Other examples: Facebook Messaging, Google Web Search, … § However, some form of consistency is obviously required § Inventory should be roughly correct § Messages should eventually be displayed è Relax the ACID requirements for „web-scale“ applications! è Modern „web-scale“ applications focus on BASE, instead of ACID: § Basically Available § Soft-state § Eventually consistent 37 Consistency Models - Overview update: A B C read(D) D0→D1 Distributed D0 storage system § Strong consistency: § After the update completes, any subsequent access from A, B, C will return D1 § Weak consistency: § Does not guarantee that subsequent accesses will return D1 à A number of conditions need to be met before D1 is returned § Eventual consistency: § Special form of weak consistency § Guarantees that if no new updates are made, eventually all accesses will return D1 38 Variations of Eventual Consistency § Causal consistency: § If A notifies B about the update, B will read D1 (but not C!) § Read-your-writes: update: § A will always read D1 after its own update D0→D1 A B C read(D) § Session consistency: § Read-your-writes inside a session Distributed D0 storage system § Monotonic reads: § If a process has seen Dk, subsequent access will never return any Di with i < k § Monotonic writes: § Guarantees to serialize the writes of the same process § Variations are not mutually exclusive 39 ACID vs. BASE ACID BASE § Strong consistency for § Availability and scaling highest transactions of highest priority priorities § Availability less important § Weak consistency § Pessimistic § Optimistic § Rigorous analysis § Best effort § Complex mechanisms § Simple and fast 40 Data Model Design Principles KV-Store++ Wide Column Store : § Special key-value store: sorted and indexed § Aka: tabular data stores, columnar data stores, extensible record stores § Table-like structure with a flexible relational schema § Different sets of attributes per row § Scalable and fault-tolerant due to data distribution § Low latencies using direct read and write access § Limits: no joins, referential integrity, data types, transactions over multiple rows 42 Data Model (1) Customer Name column familiy Addresses First Name Last Name Street_1 City_1 Street_2 City_2 Mark Smith Long Str. 2 Berlin Short Str.3 Potsdam § Namespace defines table (region), e.g., “Customer” § Namespace consists of multiple column families § Example: Name, Address § Include related columns (similar content or equal types) § Column key = column family:column name, e.g., Name:Last Name § Column: Set of cells with same column key 43 Data Model (2) Customer Name Customer/123:Name:Lastname First Name Last name Mark Smith § Cell: Key-value pair § Key: Row-Key:Column Family:Column Name § Fast direct data access § Sorted: data is saved, e.g., lexicographically ordered (defined by implementation) § Indexed by row key and column key § Row: Set of cells with same key § Timestamps can be used § Multiple versions per cell 44 Example: Web-Table (1) § Data: Webpages and their links § Row-Key: Website-URL § Scenarios: § Direct access by webcrawler to add webpages § Batch-evaluation to build an index and/or ranking § Real-time access for search engine users to receive a cached version of the website 45 Example: Web-Table (2) Column Family Row Key Cell Version 46 Design Principles § Prevention of complex data structures in cells (JSON, XML, …) § Reasonable amount of versions § Denormalization § One long row with multiple columns per entity § Duplicates for efficient queries § Column names with values § Prevention of hot spotting in row keys § Usage of indices 47 Design: Denormalization (1) Customer Customer_Product Product Name Name … m:n Relationship … Customer Products Name … Product1 Product2 Product3 … Customer: 23 Mark Jones … 38383 48284 48284 … Product Customers Name … Customer1 Customer2 Customer3 … Prod:38383 Dell Laptop … 23 43 41 … 48 Design: Denormalization (2) Customer Products Name … 38383 48284 48284 … Customer: 23 Mark Jones … Dell Laptop Apple … Galaxy … … Product Customers Name … 23 43 41 … Prod:38383 Dell Laptop … Mark Jones John Marc Mark Marc … 49 Summary § Distributed Consistency Log Log § Data Model Worker Worker rol lba ck rollback ck User llba Coordinator ro rollback Log Log Worker Worker 50 Next Part § Stream Processing Application / Query Language / Analytics / Visualization Application Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 51 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 52

Big Data Systems Key Value Stores I PDF

Document Details

Tags

Related

Summary

Full Transcript