Key-Value Stores Quiz
18 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following is NOT a characteristic of Key-Value Stores?

  • Horizontal scalability
  • Full SQL support (correct)
  • Flexibility in data definitions
  • High speed for single-record operations
  • Key-Value Stores lack a static data schema.

    True

    What is the main operation used to retrieve a value from a Key-Value Store?

    get(key)

    In a Key-Value Store, the __________ operation is used to remove an entry from the store.

    <p>delete</p> Signup and view all the answers

    Match the following Key-Value Store characteristics with their descriptions:

    <p>Horizontal Scalability = Adapts to user growth and traffic patterns High Speed = Fast single-record read and write operations Flexibility = Allows changes to data definitions Reliability = Provides failure recovery using commodity hardware</p> Signup and view all the answers

    Which part of the HBase client configuration is critical for connecting to the database?

    <p>Zookeeper quorum</p> Signup and view all the answers

    In HBase, the columns defined in a table must be established at compile-time.

    <p>False</p> Signup and view all the answers

    What does the 'Put' operation do in HBase?

    <p>It adds or updates a record in a table.</p> Signup and view all the answers

    In a key-value store, the key is unique and is used to retrieve the corresponding __________.

    <p>value</p> Signup and view all the answers

    Match the key-value stores with their key-value structure:

    <p>Amazon Dynamo = Key: UserID Apache HBase = Key: MyBaseTable Redis = Key: sessionID Bigtable = Key: rowID</p> Signup and view all the answers

    Which of the following is not a common feature of KV-Stores?

    <p>ACID transactional guarantees</p> Signup and view all the answers

    All KV-Stores check for integrity constraints to ensure data validity.

    <p>False</p> Signup and view all the answers

    What is the primary purpose of a write-ahead log (WAL) in KV-Stores?

    <p>To keep a commit log as ground truth.</p> Signup and view all the answers

    A ________ index contains one entry for every record in the sequential file.

    <p>dense</p> Signup and view all the answers

    Which of the following data model storage methods is not common in KV-Stores?

    <p>Document-based storage</p> Signup and view all the answers

    Match the following indexing types with their descriptions:

    <p>Dense Index = One entry for every record Sparse Index = One entry for every block B-Trees = Organized in a balanced tree structure Write-Ahead Log = Records changes before data updates</p> Signup and view all the answers

    Name one key property of B-Trees.

    <p>Balanced structure where each path from root to leaf has the same length.</p> Signup and view all the answers

    Versioning in KV-Stores allows the storage of multiple copies of the same data.

    <p>True</p> Signup and view all the answers

    Study Notes

    Big Data Systems - Key Value Stores I

    • Big data systems use key-value stores for storing and retrieving data
    • Key-value stores are containers for storing key-value pairs
    • NoSQL semantics (non-relational) are used in key-value stores
    • Key-value stores provide increased scalability, speed, availability, and flexibility

    Timeline I

    • A schedule of topics taught in the course
    • Includes dates, topics covered on Tuesdays and Wednesdays

    This Lecture

    • KV Overview
    • B-Trees
    • Sources of information for the lecture: Hans-Arno Jacobsen (TUM) - Distributed Systems, Johannes Zschache (University of Leipzig) - NoSQL-Datenbanken, Volker Markl (TUB) - Database Technology

    Where are we?

    • Data Management
    • Operational storage (small writes and reads)
    • Visualisation, Analytics and Query Language

    Serving Requests

    • User access to inverted index.
    • Input: search term.
    • Output: relevant URLs.
    • Requirements: Many search terms, many URLs, frequent updates, and interactive speed (<1s).

    More on Interaction

    • Today's internet applications have huge amounts of stored data (1PB=10^15 bytes).
    • There are a vast number of Internet users (e.g., 3.4 billion).
    • Applications require frequent updates, fast data retrieval, and rapidly changing data definitions.

    Enter Key-Value Stores

    • Key-value stores are scalable containers for key-value pairs
    • They use NoSQL semantics (non-relational)
    • Simplified semantics (and syntax) in exchange for increased scalability, speed, availability, and flexibility
    • Key-value store operations: Write/update, Read, Delete.

    This Lecture

    • Reminders of two main types of queries:
    • OLTP: Read and Write access to data, few rows/operation, high qps (queries per second).
    • OLAP: Mostly read access to data, many rows/operation, low qps.
    • Distinctions are also relevant at web scale.
    • Discusses processing OLTP-like queries at web scale.

    Web Scale

    • More information on web scale data store vs. relational systems.
    • "MonetDB is Web Scale" - Youtube Link included.

    DBMS vs KV-Store

    • Comparison between DBMS (SQL) and KV-Store
    • Relational data schema vs. No (static) data schema.
    • Data types, Foreign keys and Full SQL support in DBMS.
    • Raw byte access, no relations and single-row operations in KV-Store.

    KV-Stores - Requirements

    • Horizontal scalability, considering user growth, traffic patterns, and request/data size.
    • High speed for single record read and write operations.
    • Data flexibility and adapting to changing data definitions
    • Reliability through commodity hardware, failure recovery, availability, geo-distribution, global user base, and fast access.

    KV-Store Client Interface - Overview

    • Main operations: Write/update (put(K,V)), Read (get(K)), Delete (delete(K))
    • No aggregation, table joins or transactions in KV-store.

    KV-Store Client Interface - Example HBase

    • Initialization using ZooKeeper
    • Column Family and Column are defined at runtime in HBase

    KV-Stores in Practice

    • Bigtable, Apache HBase, Apache Cassandra, Redis, Amazon Dynamo, Yahoo! PNUTS, LevelDB, RocksDB.
    • Detailed examples of key-value pairs used for amazon, facebook and google .

    Common Properties of KV-Stores

    • Memory store and write-ahead log (WAL).
    • Keeping data in memory for fast access and commit log.
    • Versioning (different versions of data).
    • Timestamping (tracking data changes).
    • Replication (multiple copies of data).
    • Failure detection and failure recovery.

    Common Non-Features of KV-Stores

    • Integrity constraints.
    • Transactional guarantees (ACID).
    • Powerful query languages.
    • Materialized views.

    Simple Index

    • Prerequisite: sorted data file that is sorted on the primary key of a relation
    • Index file stores (key, pointer) pairs.
    • Key value K is associated with a pointer to a record where that key, value pair is located in the sorted file.
    • Index types: Dense (one entry per record in sequential file) and Sparse (one entry per block in the sequential file).

    B-Trees

    • Single-level index for simple access acceleration.
    • In general B-Trees, typically B+ trees are used, that use multiple levels, the number of levels depend on the need.
    • Blocks are at least 50% full to ensure efficient space usage.
    • No overflow blocks are needed due to the structure of the tree.

    Structure (B-tree)

    • Index blocks are organized in a tree structure
    • Balanced: All paths from root to leaf have the same length.
    • Parameter n: Every block holds at least n and up to 2n search keys and at least n+1 pointers up to 2n+1 pointers. This is different form normal tree structure.
    • Choice of n: n should be as large as possible with respect to block size.
    • Alternative definition: Defines conditions for the minimum and maximum number of keys and pointers in each block (including leaves).

    Definition of B+-Trees

    • Keys in leaves are keys from the data and sorted from left to right.
    • Root has at least one pointer.
    • All pointers point to subsequent B-tree blocks on the next level below.
    • Leaves have a pointer to the next leaf node on the right.
    • Pointers in inner nodes point to B-tree blocks of the level below.

    Example Configuration (B-tree)

    • Example configuration of a B+-tree demonstrating storage of data in blocks. Shows inner nodes and leaf nodes structure.

    Alternative Definition (B-tree)

    • Parameter n definition conditions for the minimum and maximum numbers of keys and pointers in each block (including leaves).

    Examples of Leaves (B-tree)

    • Examples of full and partially filled leaf blocks in the B-tree structure

    Examples of Inner Nodes (B-tree)

    • Examples of full inner nodes and partially filled inner node blocks in the B-tree structure.

    Example B+-Tree

    • Example of a B+ tree with parameter n=2 showing the structure and organization of data in the tree.

    Applications of B+-Trees

    • B-trees can assume different index roles.
    • Search key can be a primary key, order of data file is not important.
    • Dense index: when order of data file is not important
    • Sparse Index: when data file must be sorted by primary key.
    • Search key is not necessary primary key
    • Duplicate keys in leaves have to be handled
    • Basic search operations in B-Trees.

    Searching in B-Trees

    • Search operations in a B-Tree
    • Basic procedure for search based on keys and recursive search process to the leaf nodes

    Example Searching in B-Trees

    • Example of searching for a value K=60 in a sample B-tree.
    • Demonstrates logical branching based on key match and comparison.

    Example Searching in B-Trees (K=53)

    • Example of searching for a value K=53 in a sample B-tree, shows how values are found in the tree.

    Range Queries

    • Queries with inequalities in a WHERE clause.
    • Searches within a range [a,b]

    Example Searching in B-Trees (10 ≤ K ≤ 28)

    • Example of demonstrating range querying in a B-tree example

    B-Tree Insertion

    • Recursive algorithm for insertion.
    • Search for corresponding leaf, insert key-pointer pair.
    • Overflow occurs if no room in leaf node.
    • Splits the leaf node, distributes keys, inserts new key-pointer pair in parent node.
    • Recursively ascends if no space in root and create a new root.

    Example of B-Tree Insertion (K=60 and K=61)

    • Examples of inserting K=60 and K=61 into a B-Tree, demonstrations are provided for those operations

    Insertion Cost (B-tree)

    • Cost of insertion, searches and block writing during the insertion process. Includes cases where split is needed, ascend to root or create new root.

    Deleting from a B-Tree

    • Find the corresponding node
    • Delete the key
    • If minimal number of keys remains in the node: nothing else is to be done.
    • If too few keys, merge with sibling nodes, or steal from sibling
    • Adjusting parent node keys and possibility for delete propagations

    Example of B-Tree Deletion (K=12, K=15, K=5)

    • Examples of deleting values K=12, 15, 5 from a B-Tree example. Demonstrates how deletion can lead to merge and propagate actions.

    Deletion Cost (B-tree)

    • Cost of deletion operations

    Deleting from B-Trees – Variation

    • Assumption: B-tree data set usually grows, underflow is possible, and deleting values is done by creating tombstone in the tree

    B-Tree Variations

    • B*-tree

    B-Trees for Non-Primary Keys

    • Explanation of how the meaning of pointers in inner nodes changes for non-primary keys

    B-Tree Variations: B*-Tree

    • Overflow during insertion: distribution across all leaves.
    • If not possible to distribute, create 3 new leaves from 2 leaves.
    • Generate m+1 nodes from m nodes to improve memory utilization.
    • Demonstrates possible alternative for B-tree when handling values. At least 66% utilization

    Next Part (Key Value Stores and Distributed Architecture)

    High Insert & Update in B-Tree

    • Insertion cost in B-trees is proportional to log(N).
    • Several disk accesses are made per write due to data retrieval.

    Log Structured Merge - Tree

    • Split index (in memory and disk).
    • Co resides in memory (recent data), C1 resides on disk (older data).
    • Both Co and C1 cover full data domain.
    • O'Neil: Co = AVL-Tree, C1= B-Tree (to allow for compaction and merging operations)

    LSM Tree – Compaction / Rolling Merge

    • Updates in memory in place are fast.
    • Updates on disk are fast in bulk for large blocks.
    • Multiple merging levels for better cost amortizations to optimize large datasets.

    LSM-Tree – Other Operations

    • Search: start with CO then go to C1 if needed
    • Delete: insert tombstone
    • Actual delete during merge
    • Nice side effects (multiple version of data, used for recovery)

    LSM Tree -> Log Structured File

    • Faster inserts (O(1) for insert ).
    • Writes goes into Memtable (+ log-file).
    • Memtable flushed into immuatable Level 1 SSTable.
    • Sequential Input/0 (I/O).
    • SSTables are periodically compacted.
    • Merge into next level.
    • Expensive for non-existent keys -> Bloom Filter.
    • O(log(n)) performance for read operation using bloom filter

    Disk Layout – Immutable Data

    • Optimization of writes by inserting instead of overwriting.
    • Memtable (main memory buffer).
    • Storing sorted data files (immutable data).
    • Deletion using Tombstones (marking deleted data to allow compaction).

    Compaction - Tiering vs Leveling

    • Comparison of compaction methods
    • Leveling: one run per level and sort-merge.
    • Tiering: multiple runs per level, sort with one run for next level
    • Hybrid/Partial: fixed size runs, increase number per level

    Read (LSM-Tree)

    • Reading requires combination of files.
    • Read until first version is found or tombstone.
    • Expensive if key does not exist.

    Bloom Filter

    • Fast determination if a key exists in the storage.
    • No input/output (I/O) required if not found.
    • Bit array for inserted keys (stores bits to determine presence of a key).

    File Format (LSM-Tree Format)

    • Variable block size.
    • Blocks are indexed.
    • Trailer contains metadata (start of index).
    • Type= value/tombstone

    Block Index - Hash Table

    • In-memory hash map (dictionary).
    • Byte-offsets are used as values

    Sorted String Table (SSTable)

    • Hash table with sorted keys.
    • Advantages: faster compaction, range queries, reduction of HDD I/O in case of operations.
    • Sparse index: in memory sorted segment file (SSTable) on disk.

    Distributed Architecture – Motivation

    • Scalability (Elasticity) to accommodate growing data volumes, processing power, and access frequency.
    • Availability (Fault Tolerance): preserving data accessibility and availability even in case of failures/crashes.
    • Latency reduction in distributed copies using local data access.

    Replication and Partitioning

    • Partitioning: dividing data across different nodes (load balancing & latency reduction).
    • Replication: storing multiple copies of data items (redundancy increasing scalability, availability and latency, and consistency challenge).

    Partitioning (Key-Value Data)

    • Partitioning of key-value data, partitioning by key range or hash of key.
    • Partitioning Algorithm: assign each data item to a partition to balance load among partitions.

    Partition Design: Prevention of Hotspots

    • Hash functions or randomizers are used to assign keys to specific partitions to avoid hotspots.

    Partitioning – Hash vs Range

    • Range partitioning: splitting data by key ranges for scans and balancing, generally harder to balance.
    • Hash partitioning: splitting data by range of key hash, for random data distribution and no locality

    Distributed Hash Tables (DHT)

    • Distributed key-value store.
    • Hash table provides put() and get() operations.
    • Partitioning is data-centered and uses routing by value to avoid global knowledge.

    DHT: Design Goals

    • Load balancing
    • Decentralization
    • Availability and robustness
    • Flexible naming scheme

    Consistent Hashing

    • Traditional hash functions may need remapping of all tuples when number of nodes change.
    • Consistent hash function reduces remapping in case of node addition or removal with average K/n keys remapped.

    DHT – High-Level Architecture

    • Hash function for node and key identifiers.
    • Identifier circle (hash ring) that maps values to positions.
    • Assignment of keys to first node with equal or successor identifier.
    • Virtual nodes (multiples per server) is a mechanism to improve load balancing.
    • Gossiping is used for node and failure detection.

    DHT with Consistent Hashing

    • Random seed and key range assignment.
    • Hashing of values to key ranges & local changes.
    • Spliting of one key range when new nodes are added.

    DHT Replication

    • Replication for high availability.
    • Replication factor usually 3.
    • Original node becomes coordinator.
    • Replica lists act as preference list.

    Summary (LSM-Tree, Distributed Architecture, Partitioning, Consistent Hashing)

    • Summarizes the contents based on what was covered in the lectures.

    Next Part (Key Value Stores III, Distributed Consistency)

    • Indicates the next part will be focused on Key-Value Stores, distributed consistency.

    Announcements

    • Quiz deadline and lecture schedule change notices.

    Timeline I (Revised)

    • Includes the schedule for topics on Tuesdays and Wednesdays of the course.

    This Lecture (Recaps, Distributed Consistency, 2PC, 3PC, Data Model)

    • Recaps previous topics (distributed consistency, 2PC, 3PC, Data Model, examples)

    First-in-First-Out Scheduling

    • Description of FIFO Scheduling
    • Shows the average completion time and turn-around times, in comparison to the data structure.

    Recaps – B+-trees and Duplicate Keys

    • B+Tree is typically used for unique keys.
    • Meaning of pointers change based on the presence of duplicate keys in the inner nodes.
    • Keys with same value can span blocks in B+ tree.

    Recaps – LSM Storage Design

    • Explains different LSM design considerations in terms of storage design including leveling, tiering, and various hybrid models.

    Recaps - Consistent Hashing

    • Consistent-hashing functions and the mapping to nodes and the behavior in case of nodes failures. Demonstrates the example of hash function hash() MOD n.

    Where are we?

    • Overview of data management tools.
    • Focus is on operational storage
    • Small writes and reads are emphasized

    Consistency

    • Refresher of ACID semantics and transaction processing protocols.
    • How distributed commits are handled.

    Distributed Commit Protocols

    • Two-phase commit (2PC).
    • Three-phase commit (3PC).
    • Paxos.

    Two-Phase Commit (2PC) - Overview

    • Two-phase commit mechanism in distributed database environments.
    • Roles of coordinator and worker nodes.

    Two-Phase Commit (2PC) - Commit Request Phase

    • Procedure of 2PC, how it receives, handles and sends messages between the coordinator and workers to achieve consensus for transaction processing.

    Two-Phase Commit (2PC) - Commit Phase (Success)

    • Sequence of operations when all workers successfully execute the prepare commit protocol in 2PC and then the commit phase

    Two-Phase Commit (2PC) - Commit Phase (Abort)

    • Procedure of 2PC, when at least one worker fails to properly execute the prepare commit protocol, the coordinator initiates a rollback action.

    Two-Phase Commit (2PC) - Problems

    • 2PC blocking protocol.
    • Deadlock possibility due to coordinator/worker failures, or network issues.

    Three-Phase Commit (3PC) - Overview

    • Non-blocking distributed commit protocol.
    • Allows for node failures, does not block in network partition scenarios.
    • Assumes one coordinator, n workers and messages exchanged between them.

    Three-Phase Commit (3PC) – Advantages & Disadvantages

    • Addresses the weaknesses of 2PC
    • Network partition handling.
    • Higher network overhead due to more messaging.

    Paxos – Overview

    • Distributed consensus protocol, suitable for large clusters.
    • Ensures consistency despite network failures.
    • Three main roles: proposer, acceptor, and learner.

    Paxos vs. 2PC / 3PC

    • Comparison of Paxos with both 2PC and 3PC in terms of handling network partitions.

    CAP Theorem - Overview

    • Summarizes the CAP theorem, a system constraint when designing distributed database systems that limits to achieve only Two properties at a time.

    CAP Theorem – Implications I

    • Implications of different consistency model, availability, and partition tolerance trade-offs in the CAP theorem, given different system constraints.

    CAP Theorem – Implications II

    • Implications of dropping consistency and availability for a given system, in relation to partition tolerance.

    A Weaker Consistency Model

    • Explains that even though in web-scale consistency isn't always as important as availability, it is still important to consider and have some form of consistency in data systems.

    Consistency Models - Overview

    • Various types of consistency models in a given distributed storage system. Explaining how different models handle reads and writes to achieve different consistency results.

    Variations of Eventual Consistency

    • Different kinds of consistency models that handle data consistency in a distributed database system.
    • Explains various forms of eventual consistency in a given system including Causal, Session and Monotonic consistency.

    ACID vs. BASE

    • Comparison of ACID (Atomicity, Consistency, Isolation, Durability) and BASE (Basically Available, Soft State, Eventually Consistent) consistency models
    • Describes how those models approach consistency handling in a distributed system in relation to prioritizing consistency and availability in a given use-case.

    Data Model Design Principles

    • Principles to follow in the design of the data models

    KV-Store++

    • Wide column store with special key-value support, features, and requirements.
    • Aka. tabular data stores

    Data Model (1)

    • Key components (namespaces, column families, example) within a data model designed to describe the overall structure and design features.

    Data Model (2)

    • Describes the key-value pair format, and structure format.

    Example: Web-Table (1 & 2)

    • Implementation examples of using key value pairs in a web storage environment to handle webpages, links, content. Demonstrates how data can be structured in a KV storage system with multiple columns and versions.

    Design Principles

    • Key elements for modeling data elements

    Design: Denormalization (1 & 2)

    • Denormalization design strategies for database design and how to structure related data in systems.

    Summary

    • Summary of previously discussed topics on distributed consistency and data models.

    Next Part

    • Indication of the next part in the lecture.

    Thank you for your attention!

    • Thank-you message for attention.
    • Questions/In-Moodle/Email/Q&A session details.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Test your knowledge on key-value stores with this quiz! Explore their characteristics, operations, and specific implementations like HBase. Understand the common features and unique aspects of KV-Stores through various questions designed to challenge your understanding.

    More Like This

    Use Quizgecko on...
    Browser
    Browser