Summary

This document provides a summary of NoSQL databases, including their different types (key-value, document, column-oriented, and graph), concepts, and applications. It also describes the ACID properties and challenges in distributed systems.

Full Transcript

Motivation for NoSQL Databases Big Data Challenges: Traditional relational databases (SQL) struggle with the volume, velocity, and variety of modern data. Big data refers to extremely large and complex data sets that traditional data-processing software cannot handle effectively. – Vol...

Motivation for NoSQL Databases Big Data Challenges: Traditional relational databases (SQL) struggle with the volume, velocity, and variety of modern data. Big data refers to extremely large and complex data sets that traditional data-processing software cannot handle effectively. – Volume: The sheer amount of data, ranging from terabytes to petabytes. – Velocity: The speed at which data is generated and processed, including real-time data streams. – Variety: The different types of data, including structured, unstructured, and semi-structured data. Unstructured data doesn’t fit neatly into the traditional row and column structure of relational databases. Scalability Needs: Traditional SQL databases often require vertical scaling (adding more resources to a single server), which has limitations. NoSQL databases are designed for horizontal scaling (adding more servers), which allows for greater scalability. Flexibility: NoSQL databases offer more flexible schemas compared to the rigid structures of relational databases. This allows developers to adapt to new requirements and handle unstructured data more easily. Performance: NoSQL databases are optimized for specific use cases, offering better performance for certain types of operations than traditional SQL databases. Main Types of NoSQL Databases Key-Value Stores: – Concept: These are the simplest NoSQL data stores, which use primary key access. They function like a hash table where each key has a corresponding value. The value is stored as a blob. – Use Cases: Caching, session management, and storing simple data. – Examples: Redis, Memcached, Amazon DynamoDB. – Features: High performance and scalability, easy to use API with basic get, put and delete operations. Document Databases: – Concept: Store data as documents, often in JSON format. Each document can have a different schema. – Use Cases: Content management systems, real-time analytics, and applications that need flexible data structures. – Examples: MongoDB, Azure Cosmos DB. – Features: Flexible schemas, ability to handle both structured and unstructured data. Column-Oriented Databases: – Concept: Store data in columns rather than rows, which makes them efficient for read and write operations on large datasets. – Use Cases: Data warehousing and big data applications. – Examples: Apache Cassandra, HBase. – Features: Efficient data compression and high performance for analytical queries. Graph Databases: – Concept: Use graph structures with nodes, edges, and properties to represent data. – Use Cases: Social networks, recommendation engines, fraud detection, and knowledge graphs. – Examples: Neo4j, Amazon Neptune. – Features: Optimized for traversing relationships and discovering complex connections between data points. 1 CAP Theorem The CAP theorem states that a distributed system can only guarantee two out of three properties: Consistency, Availability, and Partition Tolerance. – Consistency (C): All nodes see the same data at the same time. – Availability (A): The system remains operational and responsive, even if some nodes fail. – Partition Tolerance (P): The system continues to function even if there are network partitions (communication failures) between nodes. Trade-offs: – CP Systems: Prioritize consistency and partition tolerance, sacrificing availability. Examples include distributed-lock systems (Chubby), Paxos protocol, BigTable, and MongoDB. – AP Systems: Prioritize availability and partition tolerance, allowing for eventual consistency. Examples include Cassandra, Amazon Dynamo, CouchDB, and Riak. – CA Systems: Prioritize availability and consistency, but are not partition tolerant. Examples include Oracle RAC, IBM DB2 Parallel, GFS and HDFS. NoSQL databases often choose AP, prioritizing availability and partition tolerance over strict consistency, adopting eventual consistency. ACID Transactions ACID properties ensure data integrity in databases: – Atomicity (A): All operations in a transaction are treated as a single unit; either all succeed or all fail. – Consistency (C): A transaction brings the database from one valid state to another valid state. – Isolation (I): Concurrent transactions do not interfere with each other. – Durability (D): Once a transaction is committed, the changes are permanent. Challenges in Distributed Systems: – Achieving ACID transactions across multiple nodes in a distributed system is complex and can lead to performance bottlenecks. – NoSQL databases often relax ACID properties to improve availability and performance. – Some NoSQL databases offer eventual consistency, where data may temporarily be inconsistent but will eventually converge to a consistent state. Indexing Purpose: Indexing improves query performance by creating efficient access paths to data. Key-Value Stores: Typically use indexes on the primary key. Document Databases: Support indexes on fields within documents. – MongoDB uses B-tree data structures for indexing. Column-Oriented Databases: Use indexes on partition keys and secondary indexes on non-primary key columns. – Cassandra stores secondary indexes locally on each node. Graph Databases: Support indexes on nodes, relationships, and properties. – Neo4j uses B-Trees for range indexes, inverted indexes for full-text search, and R-trees for spatial queries. 2 Types of Indexes: – Single Field Index, Compound Index, Multikey Index, Text Index, Geospatial Index, and Hashed Index in MongoDB. – Range indexes, text indexes, point indexes and token lookup indexes in Neo4J. Replication Purpose: Replication ensures data redundancy and high availability by maintaining multiple copies of data across different servers. Leader-Based Replication: One replica is designated as the leader (master), and all writes go through it. Followers (slaves) replicate data from the leader. – Used in relational databases like PostgreSQL, MySQL, and non-relational databases like Mon- goDB. – Can be synchronous or asynchronous. – Redis uses asynchronous replication in a master/replica model. Multi-Leader Replication: Multiple nodes can accept writes, and changes are replicated to all other nodes. This avoids having a single point of failure. Leaderless Replication: Any replica can accept writes. Clients send writes to multiple replicas, or a coordinator node manages writes. Data consistency is ensured through techniques like quorums. – Used in Cassandra and Dynamo-inspired systems. Chain replication: Data is replicated across a chain of nodes, in which data is written to the first node in the chain, and each node replicates to the next in the chain. Replication logs: Database writes are appended to a log and then copied to replicas. There are different types of logs, including statement-based replication, write-ahead logs, and logical (row-based) logs. – MongoDB: uses an oplog for replication. Consistency Levels: Different consistency levels offer trade-offs between consistency and perfor- mance. – Cosmos DB: offers strong, bounded staleness, session, consistent prefix, and eventual consistency levels. – Neo4j: offers causal consistency, eventual consistency and strong consistency. – Dynamo: allows the application to choose the level of consistency using NWR notation to control number of copies and writes before the write can complete and the number of copies that the application will access when reading. Partitioning Purpose: Partitioning distributes data across multiple machines to handle large datasets and high- throughput operations. Key-Range Partitioning: Data is divided into ranges based on the values of the keys. Hash Partitioning: Data is distributed using a hash function applied to the keys. This scatters data more evenly, but range queries are not efficient. Consistent Hashing: Used to distribute data and minimize data movement when nodes are added or removed. It assigns a range of values to each node on a ring. 3 Sharding: A method of distributing data across multiple machines to handle large datasets and high-throughput operations, also known as horizontal scaling. – MongoDB: shards at the collection level, using a shard key to distribute data. – Neo4j: uses sharding to divide the graph into smaller parts and enables distributed graph pro- cessing using a Fabric architecture. Dynamic Partitioning: Partitions are created or merged dynamically based on data size. Partitioning in Cassandra: Uses key-based partitioning. Data is distributed using a consistent hashing technique based on the partition key. Rebalancing: Partitions may be moved between nodes to ensure an even distribution of data and load. Routing: Requests are routed to the correct partition. – MongoDB uses mongos as a query router. – Cassandra uses a gossip protocol for routing. – Redis Cluster uses a different form of sharding where every key is conceptually part of a hash slot. Application of Concepts to NoSQL Database Types Redis (Distributed Caching/Key-Value Store): – Indexing: Uses primary key lookups only. – Replication: Uses asynchronous master-slave replication. – Partitioning: Uses Redis cluster for horizontal scaling, which shards data across multiple nodes. MongoDB (Document Database): – Indexing: Uses B-tree indexes to improve query performance. – Replication: Uses replica sets for data redundancy and high availability. – Partitioning: Uses sharding to distribute data across multiple nodes. Supports ranged and hashed sharding. Cassandra (Column-Oriented Database): – Indexing: Uses primary and secondary indexes. – Replication: Uses configurable replication strategies across multiple data centers. – Partitioning: Uses key-based partitioning and consistent hashing to distribute data. Neo4j (Graph Database): – Indexing: Uses B-trees, inverted indexes, and R-trees. – Replication: Uses the Raft consensus algorithm for replication and offers causal, eventual and strong consistency levels. – Partitioning: Uses sharding with a Fabric architecture, which allows to create a virtual database that spans multiple physical databases (shards). Cosmos DB (Document Database): – Indexing: Automatically indexes all properties in documents. – Replication: Automatic and synchronous multi-region replication with manual and automatic failover. – Partitioning: Uses logical and physical partitions, offering elastic scale-out. 4

Use Quizgecko on...
Browser
Browser