Big Data Systems Key Value Stores I PDF

Document Details

GlamorousPanther8038

Uploaded by GlamorousPanther8038

Hasso-Plattner-Institut

Martin Boissier

Tags

big data systems key-value stores data engineering databases

Summary

This document is a set of lecture notes on Big Data Systems, focusing on Key Value Stores. It discusses topics such as timelines, lectures, and requirements. The document provides an overview of data engineering systems.

Full Transcript

Big Data Systems Martin Boissier Key Value Stores I Data Engineering Systems Hasso-Plattner-Institut Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use C...

Big Data Systems Martin Boissier Key Value Stores I Data Engineering Systems Hasso-Plattner-Institut Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Exercise 10.12. / 11.12. Stream Processing I Stream Processing II 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 This Lecture 1. KV Overview 2. B-Trees Sources Hans-Arno Jacobsen (TUM) – Distributed Systems Johannes Zschache (University of Leipzig) – NoSQL-Datenbanken Volker Markl (TUB) – Database Technology 4 Where are we? § Data Management Application / Query Language / Analytics / Visualization Application § Operational storage Data Processing Data Management Big Data Systems § Small writes and reads File System Virtualization / Container OS / Scheduling Infrastructure Hardware 5 Serving Requests § User access to inverted index § Input: search term User § Output: relevant URLs Interaction § Requirements § Many search terms Index § Many URLs § Frequent updates § Interactive speed ( 1 trillion (10^12) pages indexed ~ 10.000 queries per second Search must complete in ~200ms 7 Enter Key-Value Stores § Scalable container for key-value pairs § NoSQL semantics (non-relational) § KV-Stores offer simpler semantics (and syntax) in exchange for increased scalability, speed, availability, and flexibility § Small scale: Map / hash table -> index § Main operations § Write/update put(key, value) § Read get(key) § Delete delete(key) 8 This Lecture § Remember, two main types of queries: § OLAP queries: § OLTP queries: § Mostly read access to data § Read & Write access to data § Many rows per operation § Few rows per operation § Low qps querys per second § High qps § This distinction also holds at „web scale“ § Peta-/Exabytes (1015 – 1018) of data § 10.000+ nodes § This lecture: „How to process OLTP-like queries at web scale?“ 9 Web Scale § More information of web scale data stores vs. relational systems § “MonetDB is Web Scale” § https://www.youtube.com/watch?v=b2F-DItXtZs 10 DBMS vs KV-Store DBMS (SQL) Key-Value Store Key Value John Smith {Activity:Name= Swimming} Jane Bloggs {Activity:Cost=57} Mark Anthony {ID=219} Relational data schema No (static) data schema Data types Raw byte access Foreign keys No relations Full SQL support Single-row operations 11 KV-Stores – Requirements § Horizontal scalability § User growth, traffic patterns § Adapt to number of requests & data size § Performance § High speed for single-record read and write operations § Flexibility § Adapt to changing data definitions § Reliability § Uses commodity hardware: failure is the norm § Provide failure recovery § Availability and geo-distribution § Users are worldwide § Guarantee fast access 12 KV-Store Client Interface – Overview § Main operations § Write/update put(key, value) § Read get(key) § Delete delete(key) § Usually no aggregation, no table joins, no transactions! 13 KV-Store Client Interface – Example HBase Configuration conf = HBaseConfiguration.create(); Initialization conf.set("hbase.zookeeper.quorum", "192.168.127.129"); Using ZooKeeper HTable table = new HTable(conf, „MyBaseTable"); Column Family: “Schema” Put put = new Put(Bytes.toBytes("key1")); put.add(Bytes.toBytes("colfam1"), Bytes.toBytes(„value"), Bytes.toBytes(200)); table.put(put); Column: Defined at run-time Get get = new Get(Bytes.toBytes(“key1")); Result result = table.get(get); byte[] val = result.getValue(Bytes.toBytes("colfam1"), Bytes.toBytes(„value")); System.out.println("Value: " + Bytes.toInt(val)); 14 KV-Stores in Practice § Bigtable § Amazon § Apache HBase § Key: customerID § Value: customer profile (history, …) § Apache Cassandra § Redis § Facebook, Twitter § Amazon Dynamo § Key: UserID § Yahoo! PNUTS § Value: user profile (friends, photos) § LevelDB § RocksDB 15 Common Properties of KV-Stores Common Elements Common Features § Memory store & write ahead log § Flexible data model with column (WAL) families § Keep data in memory for fast access § Horizontal partitioning (key ranges) § Keep a commit log as ground truth § Store big chunks of data as raw § Versioning bytes § Store different versions of data § Fast column-oriented single-key § Timestamping access (read/write/update) § Replication § Store multiple copies of data § Failure detection & failure recovery 16 Common Non-Features of KV-Stores § Integrity constraints we do not chekc if data model is valid § Transactional guarantees, i.e., ACID § Powerful query languages § Materialized views 17 Index Simple Index § Prerequisite: Sorted data file Dense index sorted by Primary key § Called sequential file § In DB typically sorted on primary key of a relation § Index file stores (key, pointer) pairs § Key value K is associated with pointer § Pointer points to record in sequential file that contains key value K Sparse index § Index types § Dense Index § One entry in index file for every record in sequential file § Sparse Index § One entry in index file for every block in sequential file B-Trees § So-far: Single level index for access acceleration § In general: B-Trees (here B+-Trees) § As many levels as necessary § Blocks are at least 50% full § No overflow blocks needed 20 Structure § Index blocks organized in a tree § Balanced § Each path from root to leaf has the same length. § Parameter n § Every block holds at least n and up to 2n search keys § Every block holds at least n+1 up to 2n+1 pointers § Just like index block before, but one additional pointer § Choice of n § n as large as possible with respect to block size § 4096 Bytes per block; 4 Bytes per key; 8 Bytes per pointer § 4(2n) + 8(2n+1) ≤ 4096 => n = 170 § Alternative definition § Every block holds at least (𝑛 + 1)/2 − 1 search keys, (𝑛 + 1)/2 pointers § Every block holds at most n search keys, n+1 pointers 21 Definition of B+-Trees § Keys in the leaves are keys from the data § Sorted across all leaves (from left to right) § Root: at least one used pointer. § All pointers point to B-tree block one level below § Leaves: Last pointer points to next leaf (on the right), § From the other 2n pointers at least n are used. K1 RID1 K2 RID2 … Next § Point to data blocks § Inner Nodes: Pointers point to B-tree blocks of level below § At least n+1 are used § If j pointers are used, there are j-1 keys in the block § K1, … , Kj-1 P0 K1 P1 K2 P2 … Free § First pointer points to subtree with key values < K1. § Second pointer points to subtree with key values between K1 and K2. 22 Example Configuration § n=2 § All nodes: At most 4 keys and 5 pointers § Root: At least 1 key and 2 pointers § Inner Nodes: At least 2 key and 3 pointers § Leaves: At least 2 keys and 3 pointers 41 12 28 46 67 first pointer goes to block that is snaller than 46 (not equal) second ponter lerger than 46 and smaller tha 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 23 Alternative Definition § So far: Parameter n § Block contains at least n keys § Block contains at most 2n keys § Block contains at least n + 1 pointers (as before) § Alternative: § Node contains at most n keys, n+1 pointers § Inner node contains at least é(n+1)/2ù pointers § Leaf contains at least ë(n+1)/2û+1 pointers § Always: § Inner node contains always one pointer more than number of search keys § Leaf contains as many pointers as search keys § plus linked list 24 Examples of Leaves § n=2 § 4 keys and 5 pointers 67 71 83 99 § Full leaf To next leaf To block with To block with To block with To block with key 67 key 71 key 83 key 99 67 71 To next leaf § Partially filled leaf To block with To block with 67 Not key 67 key 71 allowed 25 Examples of Inner Nodes § Full inner node 67 71 83 99 To keys To keys K < 67 To keys To keys K ≥ 99 67 ≤ K < 71 To keys 83 ≤ K < 99 71 ≤ K < 83 67 71 § Partially filled inner node To keys Not K < 67 To keys To keys allowed 67 ≤ K < 71 K ≥ 71 26 Example B+-Tree §n=2 § ≥ 2 keys § ≥ 3 pointers 41 12 28 46 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 In the leaves, every key exists only once 27 Applications of B+-Trees § B-trees can assume different index roles: Dense index § Search key is primary key § Dense index § Order of data file is not important § Sparse index Sparse index § Data file must be sorted by primary key § Search key is NOT primary key § Search key is not unique § Duplicate keys in leaves have to be handled 29 B-Tree Search Searching in B-Trees § For now: Search key = primary key § Dense index § Operations for sparse indexes similar logarithmic search § Find K. § If we are at a leaf: § Search K in the leaf. § If we are at an inner node with keys K1, K2, … Kn: § If K < K1 go to first child § If K1 ≤ K < K2 go to second child § … § If Kn ≤ K go to last child 31 Example: Searching in B-Trees K = 60 41 41 ≤ 60 12 28 46 67 46 ≤ 60< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 59 < 60 L now we have to search that block linearily 32 Example: Searching in B-Trees K = 53 41 41 ≤ 53 12 28 46 67 46 ≤ 53< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 53 = 53 J 33 Range Queries § Queries with inequality in WHERE-clause § SELECT * FROM R WHERE R.k > 40 § SELECT * FROM R WHERE R.k >= 10 AND R.k b: § Follow pointer to next node. § Similar for open intervals [-∞ , b] respectively [a, ∞]. 34 Example: Searching in B-Trees 10 ≤ K ≤ 28 41 10 < 41 12 28 46 67 10 < 12 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 12, 15, 19, 28 35 B-Tree Insertion Inserting into a B-Tree Recursive Algorithm: § Search corresponding leaf. § If room, insert key and pointer. § If no room: Overflow § Split leaf in two parts and distribute keys equally § Split requires inserting a new key/pointer pair in parent node § Recursively ascend the tree § Exception: If no space in root § Split root § Create new root (with only one key) 37 Example of B-Tree Insertion K = 60 41 41 ≤ 60 12 28 46 67 46 ≤ 60< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 60 67 71 83 99 Easy J 38 Example of B-Tree Insertion K = 61 41 41 ≤ 61 12 28 46 67 46 ≤ 61< 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 60 67 71 83 99 ? 39 Example of B-Tree Insertion K = 61 41 12 28 46 67 1 5 9 12 15 19 28 33 37 41 45 67 71 83 99 We need a key/pointer 46 53 59 60 61 pair Key: 59 40 Example of B-Tree Insertion K = 61 41 12 28 46 59 67 1 5 9 12 15 19 28 33 37 41 45 67 71 83 99 46 53 59 60 61 41 Insertion Cost § Let h be the height of the B-Tree § In practice often: h = 3 § Search for leaf node: h § If no split needed: Total cost h + 1 § h block reads, 1 block write § If split needed § Worst-case: ascends to root § Even caching useless: nodes need to be written to disk § Total cost: 3 h + 1 § On every level search and overflow handling § + writing new root 42 Deleting from a B-Tree Find corresponding node Delete key If still at least minimal number of keys in node § Nothing else to do If too few keys in node: Merge § If a sibling node (left or right) contains more than minimal number of keys, „steal“ a key. § Maybe need to adjust keys of parents § If not: There are two siblings in tree with minimal and sub-minimal number of keys § => These nodes can be merged. § Keys in parent have to be adjusted § Possibly delete propagates upward in the tree 43 Example of B-Tree Deletion K = 12 41 12 28 46 67 1 5 9 12 15 19 28 33 37 41 45 46 53 59 67 71 83 99 Just remove 12 44 Example of B-Tree Deletion K = 12 41 12 doesnt Need to be changed as a parent, split ist still valid 12 28 46 67 1 5 9 15 19 28 33 37 41 45 46 53 59 67 71 83 99 45 Example of B-Tree Deletion K = 15 41 12 28 46 67 1 5 9 15 19 28 33 37 41 45 46 53 59 67 71 83 99 46 Example of B-Tree Deletion K = 15 41 12 28 46 67 1 5 9 19 28 33 37 41 45 46 53 59 67 71 83 99 Steal from sibling 47 Example of B-Tree Deletion K = 15 41 9 28 46 67 1 5 9 19 28 33 37 41 45 46 53 59 67 71 83 99 Adjust parent 48 Example of B-Tree Deletion K=5 41 9 28 46 67 1 5 9 19 28 33 37 41 45 46 53 59 67 71 83 99 Cannot steal Delete propagates 49 Example of B-Tree Deletion K=5 41 28 46 67 1 9 19 28 33 37 41 45 46 53 59 67 71 83 99 And propagates 50 Example of B-Tree Deletion K=5 28 41 46 67 1 9 19 28 33 37 41 45 46 53 59 67 71 83 99 51 Deletion Cost § Search and local delete: h + 2 § Write leaf node § Write data file § When merging with siblings: h + 6 § Check left and right § Write block and move neighbors § Write parent nodes § Write data block § When merging up to root: 3h - 2 52 Deleting from B-Trees - Variation § Assumption: Normally data sets grow § Consequence: Never delete nodes in B-tree § Underflow nodes in B-Tree will eventually be filled again § We need to keep information of deletes and maintain structure § Improvement: Tombstone in B-tree instead of data file block 53 B-Tree Variations B-Trees for Non-Primary Keys § Meaning of pointers in the inner nodes changes § Why? § Non-primary keys are not unique § Keys with same value can span blocks § Given: Keys K1, … , Kj => Ki is smallest new key value reachable from (i+1)-st pointer. § I.e., there is no key value Ki in left subtree, but at least one occurrence of the key value in the subtree after the (i+1)-st pointer. § Problem: There is not always such a key 55 B-Trees for Non-Primary Keys No new values in second child 28 is not 41 new 12 28 ^ 67 1 5 9 12 15 19 28 28 28 28 41 46 41 53 41 59 67 71 There is no reason to start a search here. □Ki is smallest key value reachable from pointer (i+1) 56 B-Tree Variations: B*-Tree § B*-tree and B#-tree § Overflow during insertion: distribution across all leaves § If not possible: create 3 new leaves from 2 old leaves § In general, create m+1 nodes from m nodes § Better memory utilization m = 1 (B+-Tree) § At least 66% m = 2 (B*-Tree) m=3 57 Next Part § Key Value Stores II § LSM Tree Application / Query Language / Analytics / Application § Distributed Architecture Visualization Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 58 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 59 Big Data Systems Martin Boissier Key Value Stores II Data Engineering Systems Hasso-Plattner-Institut Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Exercise 10.12. / 11.12. Stream Processing I Stream Processing II 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 This Lecture 1. LSM-Tree 2. Distributed Architecture Sources Hans-Arno Jacobsen (TUM) – Distributed Systems Johannes Zschache (University of Leipzig) – NoSQL-Datenbanken Volker Markl (TUB) – Database Technology 4 Where are we? § Data Management Application / Query Language / Analytics / Visualization Application § Operational storage Data Processing Data Management Big Data Systems § Small writes and reads File System Virtualization / Container OS / Scheduling Infrastructure Hardware 5 LSM-Tree High Insert & Update in B-Tree § Insertion cost: logn(N) ~ 3-4 we have many insertioans in kv stores and insertions are expensive § 2-3 disk accesses per write it is expensive because we have to go to the leaf § 2-3 Seek time + Rotational latency + Transfer time § Many applications write a lot of data in small inserts. L § Can we do better? 7 Log Structured Merge - Tree § Split index in two parts C0 and C1 § C0 resides in memory – recent data § C1 resides on disk – older data § Both cover full data domain § O’Neil: Memory optimized tree § C0 = AVL-Tree § C1 = B-Tree D O’Neil, P. et al. (1996). The log-structured merge-tree (LSM-Tree). 8 LSM Tree – Compaction / Rolling Merge § Updates in memory in place – fast § Updates on disk in bulk (merge operation) – fast (for large blocks) § Multiple levels for further amortization of cost 9 LSM-Tree – Other Operations § Search § Start with C0 then go to C1 § If in C0 no need to got to C1 § Delete § Insert tombstone § Actual delete during merge § Nice side effects § Multiple versions of the data in the tree § Can be used for recovery, etc. 10 LSM Tree -> Log Structured File § Faster inserts: O(1) if we do not find in the memtable we do not know where to look next, so we have to go though everything standart hash table § Write goes always to Memtable (+log-file) § Memtable is flushed to immutable Level 1 SSTable § Sequential I/O § SSTables are periodically compacted L3 SSTable § Merge into next level L3 SSTable L3 SSTable § Read merges SSTables of all levels L2 SSTable L3 SSTable § Expensive for non-existent keys → Bloom filter L2 SSTable L3 SSTable § O(log(n)) Memtable L1 SSTable L2 SSTable L3 SSTable 11 Disk Layout – Immutable Data § Optimization of writes by inserting instead of overwriting § Memtable: main memory-buffer; protected by write-ahead log on HDD § Sorted Data Files (SSTable): sorted (by key) and immutable on HDD § Deletion using Tombstones § When level is full compact Memory Write MT Flush L1 Disk L2 L2 L2 L3 L3 L3 L3 L3 L3 L3 L3 L3 12 Compaction - Tiering vs Leveling § Leveling L1 Merge § 1 run per level L2 § Sort-merge with next L3 level when full § Tiering L1 L1 L1 § T runs per level L2 L2 L2 § Sort with one run on L3 L3 L3 next level § Hybrid / partial L1 § Fixed size runs L2 L2 L2 § increasing number per L3 L3 L3 L3 L3 L3 L3 L3 L3 level here factor 3 13 Read § Reading requires combination of files § Read until first version is found or tombstone § Expensive if key does not exist Memory MT Read Buffer L1 Disk L2 L2 L2 L3 L3 L3 L3 L3 L3 L3 L3 L3 14 Bloom Filter § Determine fast if key exists § No I/O if not key1 key2 h1(key1) h2(key1) h1(key2) § Bloom filter per SSTable h2(key2) § Bit array for inserted keys 0 1 0 0 1 0 1 0 0 0 0 1 0 § No false negatives h2(key3) h1(key3) § Multiple hashes per key exists(key3) No! 0 it definitly does not exist, 1 it might exist 15 File Format § Variable block size § Blocks are indexed § Trailer: metadata (e.g., start of the index) § Type = value/tombstone 16 Block Index - Hash Table In-memory hash map with byte-offsets as values 17 Source: https://medium.com/@shashankbaravani/database-storage-engines-under-the-hood-705418dc0e35 Sorted String Table (SSTable) Hash table with sorted keys pointer is block und number is the Offset in this block Advantages: faster compacting (merge), range queries, reduction of HDD IO 18 Source: https://medium.com/@shashankbaravani/database-storage-engines-under-the-hood-705418dc0e35 Distributed Architecture Distributed Architecture - Motivation § Scalability (Elasticity) how fast System scales up or down § Data volume, processing, or access exceeds one machine § Spread load on more machines § Availability (Fault Tolerance) § Guarantee data availability in presence of failures and crashes § Keep multiple distributed copies § Latency § For multiple various locations keep latency low § Local data access 20 Replication and Partitioning § Partitioning § Store subsets of data on different nodes § Improves scalability, availability, latency § Challenge: load balancing § Replication § Store multiple copies of data items § Introduces redundancy § Improves scalability, availability, latency § Challenge: consistency 21 Partitioning § Partitioning of key-value data § Partitioning by key range § Partitioning by hash of key § Partitioning algorithm § Each data item (record, row, document,...) belongs to exactly one partition (considering replicated partitions as same partitions). § Algorithm tasks: 1. Given any data item, assign it to a partition 2. Keep partitions (possibly) balanced 22 Partition Design: Prevention of Hotspots Book:38323 Book:38324 Book:38325 Book:38323 Hash function Book:38324 or Randomizer Book:38325 23 Partitioning – Hash vs Range § Range partitioning h(key) § Split data by key ranges § Good for scans § Harder to balance § Hash partitioning § Split data by range of key hash § Random data distribution § No locality 24 Distributed Hash Tables (DHT) § A distributed key-value store § Hash table providing two operations: § put(key, value): stores the pair (key, value) in the hash table § get(key) → value: retrieves the value associated with the given key we do not Need Special Routing. we see the key and know where to go § Partitioning of the logical key space among all peers § Data-centered routing without global knowledge (routing by value) § Specific systems differ in their topology 25 DHT: Design Goals § Load balancing: § Uniform distribution of keys among all peers § Decentralization: § Only peers with equal rights, no specialized nodes § Availability and robustness: § Adaptation of structure to joining, leaving or failing nodes § Flexible naming scheme: § No restrictions regarding the structure of keys 26 Consistent Hashing § DHT have the problem of hash fragmentation: § Traditional hash functions remap all tuples when the number of nodes changes (e.g., failure, node goes offline) I I G H C H G E D F F Rebuild A E D B C A B § Consistent hash function: § Minimizes the degree of remapping in case of the addition or removal of locations (nodes, slots) § Only K/n keys remapped on average (K... total number of keys, n... number of slots) I I F B C H G F Rebuild C H G A E D B A E D 27 DHT – High Level Architecture § Hash function assigns an m-bit identifier ID to each node and each key § Data structure (topology): identifier circle = hash ring § Hash values are mapped to unique positions on the ring § Key k is assigned to the first node those ID is equal to ID(k) or is a successor of ID(k) = successor(k) Node 1 Node 8 Key-Range Assigned to Node 8 (clockwise assignment) Node 14 Node 42 § Virtual nodes § Multiple nodes per server for load balancing Node 32 Node 21 § Gossiping for node and failure detection 28 DHT with Consistent Hashing § Node chooses random seed § Position in the ring § Assigned key range is [seed, next seed[ § Not necessarily balanced → multiple nodes per server § Values are hashed to key range § MD5 hash of values New Node New Key Range § New node requires only local changes Old Key Range § Only one key range is split 29 DHT Replication § Replication for high availability § Typically replication factor 3 § Replica on next N-1 neighbors in ring § Original node is “coordinator” § List of Replicas is “preference list” for data item § Coordinator is first item in list Key Range First Replica Second Replica 30 Summary § LSM-Tree key1 h1(key1) key2 h2(key1) § Distributed architecture h2(key2) h1(key2) § Partitioning 0 1 0 0 1 0 1 0 0 0 0 1 0 § Consistent hashing h2(key3) h1(key3) exists(key3) No! 31 Next Part § Key Value Stores III § Distributed Consistency Application / Query Language / Analytics / Visualization Application Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 32 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 33 Big Data Systems Martin Boissier Key Value Stores III Data Engineering Systems Hasso-Plattner-Institut Announcements § Submission deadline of Quiz 2 postponed to 2024-12-09 § Tomorrow‘s exercise and next week‘s lecture on 2024- 12-04 and 2024-12-11 switched (see next slide) 2 Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Stream Processing I 10.12. / 11.12. Stream Processing II Exercise 17.12. / 18.12. ML Systems I Exercise Christmas Break 3 This Lecture 1. Recaps 2. Distributed Consistency § 2PC § 3PC 3. Data Model Sources Hans-Arno Jacobsen (TUM) – Distributed Systems Johannes Zschache (University of Leipzig) – NoSQL-Datenbanken Volker Markl (TUB) – Database Technology 4 First-In First-Out Scheduling § Single processor/queue § A.k.a., First-come first-serve 10 § Maintain jobs in queue in order of arrival Job 1 § When processor free, dequeue head 5 Job 2 3 Job 1 Job 2 Job 3 Job 3 0 2 5 10 15 18 0 2 5 Selection of metric depends on use § Average completion time is high case (e.g., scheduling interactive Job Length Arrival !"# $ % !"# & % !"# ' $(%$)%$* +' jobs as part of the OS or long- 1 10 0 § = = = 14.33 ' ' ' running MapReduce jobs). 2 5 2 § Average turn around time (completion time – arrival time): !"# $ % !"# & % !"# ' $(%$'%$' ', 3 3 5 § ' = ' = ' = 12 5 Recaps – B+-trees and Duplicate Keys § In most cases, B+-trees are used for unique keys § Meaning of pointers in the inner nodes changes § Keys with same value can span blocks § Given: Keys K1, … , Kj => Ki is smallest new key value reachable from (i+1)-st pointer. § I.e., there is no key value Ki in left subtree, but at least one occurrence of the key value in the subtree after the (i+1)-st pointer. § Problem: There is not always such a key https://15445.courses.cs.cmu.edu/fall2023/notes/08-trees.pdf 6 Recaps – B+-trees and Duplicate Keys No new values in second child 28 is not 41 new 12 28 ^ 42 67 1 5 9 12 15 19 28 28 28 28 41 46 41 53 41 42 67 71 There is no reason to start a search here. □Ki is smallest new key value reachable from pointer (i+1) 7 Recaps – B+-trees and Duplicate Keys 41 12 28 ^ 43 67 1 5 9 12 15 19 28 29 41 42 46 43 53 44 59 45 67 71 … … Alternatively, we can link lists per key. 8 Recaps – LSM Storage Designs § There is an array of storage designs § Leveling § Tiering § 1-Leveling (tiering for 1st level, others use leveling) § L-Leveling (leveling for last level, others use tiering) § Various hybrid forms § Depending on the use case, different options perform better § MemTable compactions can lead to lower performance when frequently accessed items are no longer memory-resident: § Fenggang et al. propose additional caches [AVK] § Often, tiering and block caching are sufficient (only first search causes I/O) [AVK] Fenggang et al.. AC-Key: Adaptive Caching for LSM-based Key-Value Stores. USENIX ATC 2020: 603-615 9 [Designs] Sarkar et al.: Constructing and Analyzing the LSM Compaction Design Space (Updated Version). CoRR abs/2202.04522 (2022) Recaps – Consistent Hashing § Assumption: § 'hash() MOD n' as node/partition function § System with five nodes Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 hash() MOD 5 Node 1: Node 2: Node 3: Node 4: Node 5: 5 10 15 20 1 6 11 16 2 7 12 17 3 8 13 18 4 9 14 19 Example adapted from https://medium.com/@anil.goyal0057/understanding-consistent-hashing-a-robust-approach-to-data-distribution-in-distributed-systems-0e4a0e770897 10 Recaps – Consistent Hashing § Assumption: § 'hash() MOD n' as node/partition function § System with five nodes Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 hash() MOD 5 Node 4 fails Node 1: 5 10 15 20 Node 2: 1 6 11 16 Node 3: 2 7 12 17 X Node 4: 3 8 13 18 Node 5: 4 9 14 19 Example adapted from https://medium.com/@anil.goyal0057/understanding-consistent-hashing-a-robust-approach-to-data-distribution-in-distributed-systems-0e4a0e770897 11 Recaps – Consistent Hashing § Assumption: § 'hash() MOD n' as node/partition function § System with five nodes Input data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Adapt 15 16 17 18 19 20 node/partition function accordingly. hash() MOD 4 Node 1: Node 2: Node 3: Node 4: Node 5: 4 8 12 16 20 1 5 9 13 17 2 6 10 14 18 3 8 13 18 3 7 11 15 19 12 Recaps – Consistent Hashing N5 Node 5 handles slots 17 to 20 1 20 2 19 3 N1 18 4 17 5 Node 1 handles slots 1 to 4 N4 16 6 15 7 14 8 13 9 N2 12 10 11 Node 2 handles slots 5 to 8 N3 13 Recaps – Consistent Hashing Node 5 now N5 handles slots 17 to 20 and 1 20 2 13 to 16 19 3 Most nodes do not need to adapt when N1 18 4 Node 4 fails. 5 X 17 When Node 4 becomes available again, we can split Node 5’s range of 13 to 20 N4 16 6 again and assign 13-16 to Node 4. 15 7 Similar approach for skewed workloads: If Node 3 is overloaded, we can split its 14 8 range and add a new Node for slots 8-9. 13 9 N2 12 10 11 N3 N6 14 Where are we? § Data Management Application / Query Language / Analytics / Visualization Application § Operational storage Data Processing Data Management Big Data Systems § Small writes and reads File System Virtualization / Container OS / Scheduling Infrastructure Hardware 15 Consistency Refresher: ACID § Transactions require ACID semantics: § Atomic § Consistent § Isolated § Durable § On a single machine: § Two-phase locking § Undo-/redo logging § How to guarantee ACID semantics for distributed transactions? § Distributed commit protocols § Distributed lock protocols 17 Distributed Commit Protocols § Commit protocols ensure ACID properties for transactions. § In the distributed case: § All participating nodes must come to the same final decision (commit or abort). § Commit only if all participants vote „Yes“ § Protocols: § Two-Phase Commit § Widely used, but not resilient to all possible failures § Three-Phase Commit § Non-blocking protocol; does not work with network partitions § Paxos Commit § Asynchronous protocol, fault tolerant & safe 18 Two-Phase Commit (2PC) - Overview § Widely used in distributed databases. § Roles in the protocol: § One coordinator § n workers (cohorts) § Assumptions: § Write-ahead log at each node § No node crashes forever § Any two nodes can communicate § Transactions are processed in two phases: § All nodes try to apply the transaction (writing it to the log) § The transaction is only committed if all nodes succeeded! 19 Two-Phase Commit (2PC) – Commit Request Phase Worker Worker 1. System receives user request. Query User Coordinator Worker Worker 20 Two-Phase Commit (2PC) – Commit Request Phase Worker 1. System receives user Worker pre request. pr pa ep re 2. Coordinator sends „Prepare ar e Commit“ message to all workers. User pare Coordinator pre e ar ep pr Worker Worker 21 Two-Phase Commit (2PC) – Commit Request Phase Log Log 1. System receives user re Worker Worker ad rea request. y/f dy ail /fa ilur ur e 2. Coordinator sends „Prepare e Commit“ message to all workers. 3. Workers process request and lure write pending changes to log. i User y/fa Coordinator If successful, worker d rea answers „Ready“. re ilu In case of errors, worker fa y/ answers „Failure“. ad Log re Log Worker Worker 22 Two-Phase Commit (2PC) – Commit Phase (Success) All workers answered Log Log “ready”: Worker Worker rea re dy ad y User ady Coordinator re ady re Log Log Worker Worker 23 Two-Phase Commit (2PC) – Commit Phase (Success) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. co co mm m it m it m it User m Coordinator co it m m co Log Log Worker Worker 24 Two-Phase Commit (2PC) – Commit Phase (Success) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. 2. Workers commit pending ac ac k changes from log; k Send “ack” to coordinator. 3. Coordinator completes transaction, once all Response workers finished. User k Coordinator ac k ac Log Log Worker Worker 25 Two-Phase Commit (2PC) – Commit Phase (Abort) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. rea 2. Workers commit pending re dy changes from log; ad y Send “ack” to coordinator. 3. Coordinator completes transaction, once all workers finished. At least one answered User ilure Coordinator “failure”: f a ady re Log Log Worker Worker 26 Two-Phase Commit (2PC) – Commit Phase (Abort) All workers answered Log Log “ready”: 1. Coordinator sends “commit” Worker Worker to all workers. rol 2. Workers commit pending ro lba changes from log; llb ck ac Send “ack” to coordinator. k 3. Coordinator completes transaction, once all workers finished. At least one answered ack User lb Coordinator “failure”: rol 1. Coordinator sends k “rollback” to all workers. ac llb ro Log Log Worker Worker 27 Two-Phase Commit (2PC) – Commit Phase (Abort) All workers answered “ready”: Log Log 1. Coordinator sends “commit” to all workers. Worker Worker 2. Workers commit pending changes from log; ac Send “ack” to coordinator. k ack 3. Coordinator completes transaction, once all workers finished. Failure At least one answered “failure”: User k Coordinator 1. Coordinator sends “rollback” ac to all workers. 2. Workers undo pending ack changes from log; Log Send “ack” to coordinator. Log 3. Coordinator undoes transaction, once all workers Worker Worker finished. 28 Two-Phase Commit (2PC) - Problems § 2PC is a blocking protocol: § Coordinator must wait for responses from all workers § Recovery only possible if node failures are non-terminal § Possible dead-locks: § Coordinator failure between the two phases: § Workers will wait indefinitely for commit / rollback decision § Worker failure after “Prepare Commit”: § Coordinator will wait indefinitely for status from dead worker § Worker failure after commit / rollback decision: § Coordinator will wait indefinitely for “ack” from dead worker è 2PC will fail in case of (terminal) node failures or network outages! 29 Three-Phase Commit (3PC) - Overview § Non-blocking distributed commit protocol § Does not block in case of node failures / network outages § Allows up to k node failures § Assumptions are similar to 2PC: § One coordinator, n workers (cohorts) § Workers use write-ahead log § All nodes can exchange messages § Algorithm similar to 2PC, but: § Introduces special rules to handle timeouts § Introduces pre-commit phase after initial vote: Source: https://en.wikipedia.org/wiki/File:Three-phase_commit_diagram.png § Coordinator sends “pre-commit” to all nodes after initial vote § “commit” is sent, if at least n-k workers acknowledge pre-commit 30 Three-Phase Commit (3PC) – Advantages / Disadvantages § Advantages: § Global state is – usually – recoverable § As long as no more than k nodes fail at once § Does not block on single node failures § Disadvantages: § Network partitioning can lead to inconsistencies § Higher network overhead § Due to the higher cost, 3PC is generally not used in practice § Typical choice: 2PC + heuristics to handle failure cases § Both 2PC and 3PC fail if the network is partitioned! § è Leads to inconsistencies § è Paxos 31 Paxos – Overview § Distributed, asynchronous consensus protocol: § Three main roles: § Ensures consistency, even in case of network failures § Proposer: § State-of-the-art, used in Google’s Chubby / Apache § Offer proposals of the form [value, number]. Zookeper § Acceptor: § Consensus Protocol: § Accept or reject offered proposals so as to reach consensus on the chosen proposal/value. § „Generalized“ version of a distributed commit protocol § Learner: § Assume a collection of nodes can propose values § Become aware of the chosen proposal/value. § Consensus protocols ensure that a single value among the proposed ones is chosen § Not a fixed assignment, one node can have multiple roles! § Two important properties: § Quorum: § Safety § A majority of the acceptors. § Only a value that has been proposed may be chosen § Acceptors A, B, C → Majorities {A,B}, {B,C}, {A,C} § Only a single value is chosen § Proposals have to be accepted by a quorum. § A process never learns an unchosen value § Ensures at least some surviving node retains § Liveness knowledge of the results. § Some proposed value is eventually chosen § If a value has been chosen, then a process can eventually learn the value 32 Paxos vs. 2PC / 3PC § Both 2PC/3PC and Paxos are good choices under normal conditions § Paxos typically works better in very large clusters § 2PC is typically chosen for smaller, fixed environments § Difference: what happens when the network partitions: § 2PC/3PC: § Becomes inconsistent, but remains available § Paxos: § Remains consistent, but becomes unavailable § Is there a transaction protocol that remains available and consistent in case of Network Partitions? § No! § è CAP Theorem 33 CAP Theorem – Overview § Proposed in 2000 by Eric Brewer at ACM PODC § “Proven” in 2002 by scientists from MIT „System properties hold even in case of network Partition (not node!) failures.“ § For any distributed system … Tolerance CP P A „All requests to „All nodes must non-failing have the same view nodes must Consistency CA Availability on the data.“ / „All result in a request are atomic.“ response.“ 34 CAP Theorem – Implications I § CAP theorem forces to drop one of the three properties § Drop Partition Tolerance (CA system): § Available, and consistent, unless there is a partition § Examples: § Single node RDBMS § Parallel RDBMS using 2PC § Drop Availability (CP System): § Always consistent, even in a partition, but a reachable replica may deny service without agreement of the others § Examples: § Distributed system using Paxos § Large key-value stores like BigTable 35 CAP Theorem – Implications II § Drop Consistency (AP System): § A reachable replica provides service even in a partition, but may be inconsistent. § Examples: § DNS § „Web-Scale“ NoSQL systems. § Warning: Categorization is mostly theoretical! § Assigning a category to a system is often not obvious. § Categories often degenerate into each other. § At „Web-Scale“, Partition Tolerance is mandatory § Drop Availability or Consistency. § Dropping Availability: Replication. § Dropping Consistency: Weaker consistency models. 36 A Weaker Consistency Model § At „web-scale“, availability is often more important than consistency § Example: Amazon Webstore § Website down è ~1000$ loss per second downtime § Inventory inconsistent è A few customers might be unhappy § Other examples: Facebook Messaging, Google Web Search, … § However, some form of consistency is obviously required § Inventory should be roughly correct § Messages should eventually be displayed è Relax the ACID requirements for „web-scale“ applications! è Modern „web-scale“ applications focus on BASE, instead of ACID: § Basically Available § Soft-state § Eventually consistent 37 Consistency Models - Overview update: A B C read(D) D0→D1 Distributed D0 storage system § Strong consistency: § After the update completes, any subsequent access from A, B, C will return D1 § Weak consistency: § Does not guarantee that subsequent accesses will return D1 à A number of conditions need to be met before D1 is returned § Eventual consistency: § Special form of weak consistency § Guarantees that if no new updates are made, eventually all accesses will return D1 38 Variations of Eventual Consistency § Causal consistency: § If A notifies B about the update, B will read D1 (but not C!) § Read-your-writes: update: § A will always read D1 after its own update D0→D1 A B C read(D) § Session consistency: § Read-your-writes inside a session Distributed D0 storage system § Monotonic reads: § If a process has seen Dk, subsequent access will never return any Di with i < k § Monotonic writes: § Guarantees to serialize the writes of the same process § Variations are not mutually exclusive 39 ACID vs. BASE ACID BASE § Strong consistency for § Availability and scaling highest transactions of highest priority priorities § Availability less important § Weak consistency § Pessimistic § Optimistic § Rigorous analysis § Best effort § Complex mechanisms § Simple and fast 40 Data Model Design Principles KV-Store++ Wide Column Store : § Special key-value store: sorted and indexed § Aka: tabular data stores, columnar data stores, extensible record stores § Table-like structure with a flexible relational schema § Different sets of attributes per row § Scalable and fault-tolerant due to data distribution § Low latencies using direct read and write access § Limits: no joins, referential integrity, data types, transactions over multiple rows 42 Data Model (1) Customer Name column familiy Addresses First Name Last Name Street_1 City_1 Street_2 City_2 Mark Smith Long Str. 2 Berlin Short Str.3 Potsdam § Namespace defines table (region), e.g., “Customer” § Namespace consists of multiple column families § Example: Name, Address § Include related columns (similar content or equal types) § Column key = column family:column name, e.g., Name:Last Name § Column: Set of cells with same column key 43 Data Model (2) Customer Name Customer/123:Name:Lastname First Name Last name Mark Smith § Cell: Key-value pair § Key: Row-Key:Column Family:Column Name § Fast direct data access § Sorted: data is saved, e.g., lexicographically ordered (defined by implementation) § Indexed by row key and column key § Row: Set of cells with same key § Timestamps can be used § Multiple versions per cell 44 Example: Web-Table (1) § Data: Webpages and their links § Row-Key: Website-URL § Scenarios: § Direct access by webcrawler to add webpages § Batch-evaluation to build an index and/or ranking § Real-time access for search engine users to receive a cached version of the website 45 Example: Web-Table (2) Column Family Row Key Cell Version 46 Design Principles § Prevention of complex data structures in cells (JSON, XML, …) § Reasonable amount of versions § Denormalization § One long row with multiple columns per entity § Duplicates for efficient queries § Column names with values § Prevention of hot spotting in row keys § Usage of indices 47 Design: Denormalization (1) Customer Customer_Product Product Name Name … m:n Relationship … Customer Products Name … Product1 Product2 Product3 … Customer: 23 Mark Jones … 38383 48284 48284 … Product Customers Name … Customer1 Customer2 Customer3 … Prod:38383 Dell Laptop … 23 43 41 … 48 Design: Denormalization (2) Customer Products Name … 38383 48284 48284 … Customer: 23 Mark Jones … Dell Laptop Apple … Galaxy … … Product Customers Name … 23 43 41 … Prod:38383 Dell Laptop … Mark Jones John Marc Mark Marc … 49 Summary § Distributed Consistency Log Log § Data Model Worker Worker rol lba ck rollback ck User llba Coordinator ro rollback Log Log Worker Worker 50 Next Part § Stream Processing Application / Query Language / Analytics / Visualization Application Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 51 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 52

Use Quizgecko on...
Browser
Browser