CS22512 NoSQL Databases PDF

CS22512 NOSQL DATABASES Unit 1 NoSQL DATABASES Syllabus: NoSQL Databases - Evolution of NoSQL Databases-Different types of NoSQL databases-Advan...

CS22512 NOSQL DATABASES Unit 1 NoSQL DATABASES Syllabus: NoSQL Databases - Evolution of NoSQL Databases-Different types of NoSQL databases-Advantages of NoSQL databases, Scalability and performance. Document data stores, Key-Value data stores. Case studies of MongoDB, HBase, Neo4J. NoSQL database design for applications. Introduction to NoSQL NoSQL, known as Not only SQL database, provides a mechanism for storage and retrieval of data and is the next generation database. It has a distributed architecture with MongoDB and is open source. Most of the NoSQL are open source and it has a capability of horizontal scalability which means that commodity kind of machines could be added. The capacity of your clusters can be increased. It is schema free and there is no requirement to design the tables and pushing the data to it. NoSQL provides easy replication claiming there are very less manual interventions in this. Once the replication is done, the system will automatically take care of fail overs. The crucial factor about NoSQL is that it can handle huge amount of data and can achieve performance by adding more machines to your clusters and can be implemented on commodity hardware. There are close to 150 NoSQL databases in the market which will make it difficult to choose to choose the right pick for your system. What is NoSQL Databases? NoSQL databases, initially meaning "non-SQL" or "non-relational," now stands for "not only SQL." They encompass various database architectures and data models designed for handling large volumes of unstructured and semi-structured data. These databases offer flexibility, scalability, and high performance, with types including document stores, key-value stores, column-family stores, and graph databases. 1 Brief History of NoSQL Databases The term "NoSQL" was first coined in 1998 by Carlo Strozzi, who created the Strozzi NoSQL database, which was a relational database without an SQL interface. The modern NoSQL movement gained momentum in the early 2000s as the need for more scalable and flexible databases grew, driven by the rise of big data and web-scale applications. In 2009, the first working NoSQL applications emerged, marking a true departure from the relational database model. Evolution of NoSQL Databases The first generation of database revolutions occurred in the late 1960s and early 1970s when the relational model was first introduced. This was followed by the second generation of database revolutions in the late 1990s and early 2000s when NoSQL databases began to gain popularity. There have been a lot of talks lately about the “NoSQL” database revolution. This term describes the new wave of open source, distributed, scalable databases built to handle the big data needs of today’s web-scale applications. NoSQL databases are a departure from the traditional, relational database model in several ways. NoSQL databases are designed to work with large data sets and provide high availability. NoSQL databases are horizontally scalable, meaning they can scale out by adding more nodes to the system. They are highly available and can continue functioning even if one or more nodes fail. They generally use a more simplified data model than relational databases. They often use key-value pairs or document-oriented storage or a graph model. This makes them easier to design and implement and allows them to be more flexible in storing data. They are often used in cloud computing environments where resources are dynamic and can be scaled up or down as needed. They are typically designed to be simple and easy to maintain. They are often designed with high availability and fault tolerance in mind. This means they can continue operating even if there are hardware or software failures. They are often designed for easy integration with other systems. This makes them a good choice in polyglot architectures, where different system components are written in different languages or use different database technologies. They are a powerful tool for handling big data, becoming increasingly popular as the need for web-scale applications continues to grow. 2 They are designed to run on commodity hardware, using simple replicas to spread the load and data across multiple servers. This horizontal scaling allows them to handle large data sets and high traffic loads. A database revolution is a process by which a database is created or updated to take advantage of new technologies or to improve performance. There are four main steps in the generation of database revolutions in NoSQL: 1. The first step is to identify the needs of the application. 2. The second step is to choose the right database technology. 3. The third step is to implement the database. 4. The fourth step is to monitor the performance of the database. Database Revolution: The steps involved in a database revolution are as follows: 1. Assess the current state of the database. 2. Determine the goals of the update or revolution. 3. Design the new database. 4. Implement the new database. 5. Test the new database. 6. Go live with the new database. Generations: 1. Relational Database: The first database revolution was the relational database in the 1970s, which organized data into tables with rows and columns. A relational database is a database that stores data in tables. Relational databases are based on the mathematical concept of Set theory and use a structured query language (SQL) for accessing and manipulating data. Tables are similar to folders in a file system, where each table contains a collection of information. A NoSQL database is a type of database that does not use the traditional table structure. NoSQL databases are made up of documents, and each document represents a record, Can store data in the form of tables and relations between those tables, and Can be queried using SQL. Advantages: Data is easy to organize. Querying is straightforward. It can be used to enforce data integrity. 2. Object-Oriented Database: The second database revolution was the object- oriented database in the 1980s, which organized data into objects with attributes and methods. Object-oriented databases are based on the concepts of object-oriented programming and use an object-oriented query language (OQL) for accessing and manipulating data. An object-oriented database is a database that stores data in 3 objects. Objects are similar to files in a file system, where each object contains a collection of information. It can store data in the form of objects. Advantages: More flexible than relational databases. It can represent more complicated relationships between data. It can be easier to work with object-oriented programming languages. 3. XML Database: The third database revolution was the XML database in the 1990s, which organized data into XML documents. An XML database is a database that stores data in XML documents. XML documents are similar to files in a file system, where each XML document contains a collection of information, and can store data in the form of XML documents. Advantages: It can be used to store semi-structured data. It can be queried using XPath. It can be easily integrated with web applications. 4. NoSQL Database: The fourth database revolution is the NoSQL database in the 2000s, which organizes data into key-value pairs, documents, columns, and graphs. A NoSQL database is a database that does not store data in tables, objects, or XML documents, and It can store data in a variety of formats. Advantages: It can be more scalable than relational databases. It can be more suitable for working with large amounts of data. It can be more flexible in terms of schema. 5. Key-Value Pair Database: The key-value pair database is the simplest NoSQL database and is often used for storing simple data such as configuration settings. A key-value pair database is a database that stores data in key-value pairs. A key-value pair has a key and a value. The key is used to identify the value of the data stored in the database, and It can store data in the form of key-value pairs. Advantages: It can be more scalable than relational databases. It can be easier to work with than other NoSQL databases. 6. Document Database: The document database is more complex and is used for storing semi-structured or unstructured data. A document database is a database that stores data in documents. Documents are similar to files in a file system, where each document contains a collection of information, and It can store data in the form of documents. 4 Advantages: It can be more flexible than relational databases. It can be easier to work with than XML databases. 7. Column Database: The column database stores data organized into columns, such as financial data. A column database is a database that stores data in columns. Columns are similar to fields in a database table, where each column contains data collection. It can store data in the form of columns. Advantages: It can be more scalable than relational databases. It can be more efficient for working with large amounts of data. 8. Graph Database: The graph database is the most complex NoSQL database for storing data organized into relationships. A graph database is a database that stores data in a graph. A graph is a collection of nodes and edges, where each node represents an entity, and each edge represents a relationship between two entities. It can store data in the form of graphs. Advantages: It can be more flexible than relational databases. It can be better for representing certain types of relationships between data. Different Types of NoSQL Databases A database is a collection of structured data or information which is stored in a computer system and can be accessed easily. A database is usually managed by a Database Management System (DBMS). NoSQL is a non-relational database that is used to store the data in the nontabular form. NoSQL stands for Not only SQL. The main types are documents, key-value, wide-column, and graphs. Types of NoSQL Database: Key-Value Model Document Model Columnar Data Model Graph-Based Model 5 1. Document-Based Database: A Document Data Model is a lot different than other data models because it stores data in JSON, BSON, or XML documents. in this data model, we can move documents under one document and apart from this, any particular elements can be indexed to run queries faster. Often documents are stored and retrieved in such a way that it becomes close to the data objects which are used in many applications which means very less translations are required to use data in applications. JSON is a native language that is often used to store and query data too. So in the document data model, each document has a key-value pair below is an example for the same. { "Name" : "Yashodhra", "Address" : "Near Patel Nagar", "Email" : "[email protected]", "Contact" : "12345" } Working of Document Data Model: This is a data model which works as a semi-structured data model in which the records and data associated with them are stored in a single document which means this data model is not completely unstructured. The main thing is that data here is stored in a document. Features: Document Type Model: As we all know data is stored in documents rather than tables or graphs, so it becomes easy to map things in many programming languages. 6 Flexible Schema: Overall schema is very much flexible to support this statement one must know that not all documents in a collection need to have the same fields. Distributed and Resilient: Document data models are very much dispersed which is the reason behind horizontal scaling and distribution of data. Manageable Query Language: These data models are the ones in which query language allows the developers to perform CRUD (Create Read Update Destroy) operations on the data model. Examples of Document Data Models : Amazon DocumentDB MongoDB Cosmos DB ArangoDB Couchbase Server CouchDB Advantages: Schema-less: These are very good in retaining existing data at massive volumes because there are absolutely no restrictions in the format and the structure of data storage. Faster creation of document and maintenance: It is very simple to create a document and apart from this maintenance requires is almost nothing. Open formats: It has a very simple build process that uses XML, JSON, and its other forms. Built-in versioning: It has built-in versioning which means as the documents grow in size there might be a chance they can grow in complexity. Versioning decreases conflicts. Disadvantages: Weak Atomicity: It lacks in supporting multi-document ACID transactions. A change in the document data model involving two collections will require us to run two separate queries i.e. one for each collection. This is where it breaks atomicity requirements. Consistency Check Limitations: One can search the collections and documents that are not connected to an author collection but doing this might create a problem in the performance of database performance. Security: Nowadays many web applications lack security which in turn results in the leakage of sensitive data. So it becomes a point of concern, one must pay attention to web app vulnerabilities. 7 Applications of Document Data Model : Content Management: These data models are very much used in creating various video streaming platforms, blogs, and similar services Because each is stored as a single document and the database here is much easier to maintain as the service evolves over time. Book Database: These are very much useful in making book databases because as we know this data model lets us nest. Catalog: When it comes to storing and reading catalog files these data models are very much used because it has a fast reading ability if incase Catalogs have thousands of attributes stored. Analytics Platform: These data models are very much used in the Analytics Platform. 2. Key-Value Stores: A key-value data model or database is also referred to as a key-value store. It is a non-relational type of database. In this, an associative array is used as a basic database in which an individual key is linked with just one value in a collection. For the values, keys are special identifiers. Any kind of entity can be valued. The collection of key-value pairs stored on separate records is called key-value databases and they do not have an already defined structure. How do key-value databases work? A number of easy strings or even a complicated entity are referred to as a value that is associated with a key by a key-value database, which is utilized to monitor the entity. Like in 8 many programming paradigms, a key-value database resembles a map object or array, or dictionary, however, which is put away in a tenacious manner and controlled by a DBMS. An efficient and compact structure of the index is used by the key-value store to have the option to rapidly and dependably find value using its key. For example, Redis is a key-value store used to tracklists, maps, heaps, and primitive types (which are simple data structures) in a constant database. Redis can uncover a very basic point of interaction to query and manipulate value types, just by supporting a predetermined number of value types, and when arranged, is prepared to do high throughput. When to use a key-value database: Here are a few situations in which you can use a key-value database:- User session attributes in an online app like finance or gaming, which is referred to as real-time random data access. Caching mechanism for repeatedly accessing data or key-based design. The application is developed on queries that are based on keys. Features: One of the most un-complex kinds of NoSQL data models. For storing, getting, and removing data, key-value databases utilize simple functions. Querying language is not present in key-value databases. Built-in redundancy makes this database more reliable. Advantages: It is very easy to use. Due to the simplicity of the database, data can accept any kind, or even different kinds when required. Its response time is fast due to its simplicity, given that the remaining environment near it is very much constructed and improved. Key-value store databases are scalable vertically as well as horizontally. Built-in redundancy makes this database more reliable. Disadvantages: As querying language is not present in key-value databases, transportation of queries from one database to a different database cannot be done. The key-value store database is not refined. You cannot query the database without a key. Some examples of key-value databases: Here are some popular key-value databases which are widely used: 9 Couchbase: It permits SQL-style querying and searching for text. Amazon DynamoDB: The key-value database which is mostly used is Amazon DynamoDB as it is a trusted database used by a large number of users. It can easily handle a large number of requests every day and it also provides various security options. Riak: It is the database used to develop applications. Aerospike: It is an open-source and real-time database working with billions of exchanges. Berkeley DB: It is a high-performance and open-source database providing scalability. Columnar Data Model of NoSQL : Basically, the relational database stores data in rows and also reads the data row by row, column store is organized as a set of columns. So if someone wants to run analytics on a small number of columns, one can read those columns directly without consuming memory with the unwanted data. Columns are somehow are of the same type and gain from more efficient compression, which makes reads faster than before. Examples of Columnar Data Model: Cassandra and Apache Hadoop Hbase. Working of Columnar Data Model: In Columnar Data Model instead of organizing information into rows, it does in columns. This makes them function the same way that tables work in relational databases. This type of data model is much more flexible obviously because it is a type of NoSQL database. The below example will help in understanding the Columnar data model: Row-Oriented Table: S.No. Name Course Branch ID 01. Tanmay B-Tech Computer 2 02. Abhishek B-Tech Electronics 5 03. Samriddha B-Tech IT 7 04. Aditi B-Tech E & TC 8 10 Column – Oriented Table: S.No. Name ID 01. Tanmay 2 02. Abhishek 5 03. Samriddha 7 04. Aditi 8 S.No. Course ID 01. B-Tech 2 02. B-Tech 5 03. B-Tech 7 04. B-Tech 8 S.No. Branch ID 01. Computer 2 02. Electronics 5 03. IT 7 04. E & TC 8 11 Columnar Data Model uses the concept of keyspace, which is like a schema in relational models. Advantages of Columnar Data Model : Well structured: Since these data models are good at compression so these are very structured or well organized in terms of storage. Flexibility: A large amount of flexibility as it is not necessary for the columns to look like each other, which means one can add new and different columns without disrupting the whole database Aggregation queries are fast: The most important thing is aggregation queries are quite fast because a majority of the information is stored in a column. An example would be Adding up the total number of students enrolled in one year. Scalability: It can be spread across large clusters of machines, even numbering in thousands. Load Times: Since one can easily load a row table in a few seconds so load times are nearly excellent. Disadvantages of Columnar Data Model: Designing indexing Schema: To design an effective and working schema is too difficult and very time-consuming. Suboptimal data loading: incremental data loading is suboptimal and must be avoided, but this might not be an issue for some users. Security vulnerabilities: If security is one of the priorities then it must be known that the Columnar data model lacks inbuilt security features in this case, one must look into relational databases. Online Transaction Processing (OLTP): Online Transaction Processing (OLTP) applications are also not compatible with columnar data models because of the way data is stored. Applications of Columnar Data Model: Columnar Data Model is very much used in various Blogging Platforms. It is used in Content management systems like WordPress, Joomla, etc. It is used in Systems that maintain counters. It is used in Systems that require heavy write requests. It is used in Services that have expiring usage. 12 Graph-Based databases: Graph Based Data Model in NoSQL is a type of Data Model which tries to focus on building the relationship between data elements. As the name suggests Graph-Based Data Model, each element here is stored as a node, and the association between these elements is often known as Links. Association is stored directly as these are the first-class elements of the data model. These data models give us a conceptual view of the data. These are the data models which are based on topographical network structure. Obviously, in graph theory, we have terms like Nodes, edges, and properties, let’s see what it means here in the Graph-Based data model. Nodes: These are the instances of data that represent objects which is to be tracked. Edges: As we already know edges represent relationships between nodes. Properties: It represents information associated with nodes. The below image represents Nodes with properties from relationships represented by edges. Working of Graph Data Model : In these data models, the nodes which are connected together are connected physically and the physical connection among them is also taken as a piece of data. Connecting data in this way becomes easy to query a relationship. This data model reads the relationship from storage directly instead of calculating and querying the connection steps. Like many different NoSQL 13 databases these data models don’t have any schema as it is important because schema makes the model well and good and easy to edit. Examples of Graph Data Models : JanusGraph: These are very helpful in big data analytics. It is a scalable graph database system open source too. JanusGraph has different features like: o Storage: Many options are available for storing graph data like Cassandra. o Support for transactions: There are many supports available like ACID (Atomicity, Consistency, Isolation, and Durability) which can hold thousands of concurrent users. o Searching options: Complex searching options are available and optional support too. Neo4j: It stands for Network Exploration and Optimization 4 Java. As the name suggests this graph database is written in Java with native graph storage and processing. Neo4j has different features like: o Scalable: Scalable through data partitioning into pieces known as shards. o Higher Availability: Availability is very much high due to continuous backups and rolling upgrades. o Query Language: Uses programmer-friendly query language Cypher graph query language.DGraph main features are: DGraph: It is an open-source distributed graph database system designed with scalability. o Query Language: It uses GraphQL, which is solely made for APIs. o open-source system: support for many open standards. Advantages of Graph Data Model : Structure: The structures are very agile and workable too. Explicit Representation: The portrayal of relationships between entities is explicit. Real-time O/P Results: Query gives us real-time output results. Disadvantages of Graph Data Model : No standard query language: Since the language depends on the platform that is used so there is no certain standard query language. Unprofessional Graphs: Graphs are very unprofessional for transactional-based systems. Small User Base: The user base is small which makes it very difficult to get support when running into a system. 14 Applications of Graph Data Model: Graph data models are very much used in fraud detection which itself is very much useful and important. It is used in Digital asset management which provides a scalable database model to keep track of digital assets. It is used in Network management which alerts a network administrator about problems in a network. It is used in Context-aware services by giving traffic updates and many more. It is used in Real-Time Recommendation Engines which provide a better user experience. Advantages of NoSQL Databases There are many advantages of working with NoSQL databases such as MongoDB and Cassandra. The main advantages are high scalability and high availability. High scalability: NoSQL databases use sharding for horizontal scaling. Partitioning of data and placing it on multiple machines in such a way that the order of the data is 15 preserved is sharding. Vertical scaling means adding more resources to the existing machine whereas horizontal scaling means adding more machines to handle the data. Vertical scaling is not that easy to implement but horizontal scaling is easy to implement. Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can handle a huge amount of data because of scalability, as the data grows NoSQL scales. The auto itself to handle that data in an efficient manner. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which means that they can accommodate dynamic changes to the data model. This makes NoSQL databases a good fit for applications that need to handle changing data requirements. High availability: The auto, replication feature in NoSQL databases makes it highly available because in case of any failure data replicates itself to the previous consistent state. Scalability: NoSQL databases are highly scalable, which means that they can handle large amounts of data and traffic with ease. This makes them a good fit for applications that need to handle large amounts of data or traffic Performance: NoSQL databases are designed to handle large amounts of data and traffic, which means that they can offer improved performance compared to traditional relational databases. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational databases, as they are typically less complex and do not require expensive hardware or software. Agility: Ideal for agile development. Scalability and Performance of NoSQL Databases Scalability and performance are critical aspects that distinguish NoSQL databases from traditional relational databases. Here's an in-depth look at these attributes: 16 Scalability 1. Horizontal Scalability: Definition: The ability to increase capacity by connecting multiple hardware or software entities so that they work as a single logical unit. Mechanism: Data is distributed across multiple servers (nodes) rather than relying on a single server. Benefits: Allows for handling increasing data volumes and user load by simply adding more nodes to the cluster. 2. Automatic Sharding: Definition: A method for distributing data across multiple machines. Mechanism: Data is divided into smaller, manageable pieces called shards, which are distributed across multiple nodes. Benefits: Enhances performance by parallelizing read and write operations and ensures that no single node becomes a bottleneck. 3. Elastic Scaling: Definition: The capability to dynamically adjust resource allocation to meet varying workloads. Mechanism: Resources can be scaled up or down based on current demand. Benefits: Cost-effective as resources are utilized based on demand, avoiding over- provisioning. 4. Replication: Definition: The process of copying data from one node to another. Mechanism: Data is replicated across multiple nodes to ensure high availability and fault tolerance. Benefits: Provides redundancy, ensuring data availability even if one node fails. Performance 1. High Throughput: Definition: The ability to process a high volume of transactions in a given time period. Mechanism: Optimized for fast read and write operations. Benefits: Suitable for applications requiring high-speed data processing, such as real- time analytics. 17 2. Low Latency: Definition: The delay before a transfer of data begins following an instruction for its transfer. Mechanism: Efficient data retrieval methods and in-memory storage (e.g., Redis). Benefits: Ensures quick response times, essential for real-time applications. 3. Data Locality: Definition: Storing related data close together to minimize access times. Mechanism: Techniques like sharding and partitioning ensure related data is stored on the same node. Benefits: Reduces the time taken to access related data, improving overall performance. 4. Optimized for Specific Workloads: Mechanism: Different types of NoSQL databases (e.g., key-value, document, column- family, graph) are optimized for specific use cases. Benefits: Ensures efficient data processing tailored to the specific needs of the application. 5. Caching: Definition: Storing copies of data in a high-speed storage layer. Mechanism: Frequently accessed data is stored in memory (e.g., Memcached, Redis). Benefits: Reduces the load on the primary database and speeds up data access. Differentiating Scalability and Performance of NoSQL Databases Aspect Scalability Performance Ability to handle increased loads by Speed and efficiency of processing Definition adding resources operations Handling larger data volumes and Minimizing response time and Primary Goal higher user loads maximizing throughput Horizontal (scale-out) and Vertical Measured in terms of throughput, Types (scale-up) latency, and response time 18 Aspect Scalability Performance In-Memory Storage, Data Locality, Sharding, Replication, Elastic Mechanisms Optimized Indexing, Asynchronous Scaling Operations Cost-effective, High availability, Fast read/write operations, Real-time Benefits Seamless growth processing, High efficiency Data consistency and management Ensuring low latency and high Challenges across distributed nodes throughput under heavy load Expanding capacity by adding more Enhancing the speed and efficiency of Focus nodes or resources data operations Example MongoDB, Cassandra Redis, HBase Technologies Resource Efficiently uses multiple servers to Efficient use of memory and CPU to Utilization distribute load speed up operations Applications requiring large-scale Applications needing quick data Use Cases data storage and high availability retrieval and processing Adjusts number of Adjusts performance characteristics Elasticity nodes/resources based on demand based on workload Document Data Stores: Document databases are a type of NoSQL database that store data in flexible, semi-structured documents, typically in JSON or a similar format. In a document database: Data is stored in collections, which are similar to tables in a relational database. Each collection contains multiple documents, which are similar to rows. Documents are the basic unit of data storage and are identified by a unique key. Documents can have a variable number of fields, and the fields can contain different data types, including nested documents and arrays. The schema is flexible, meaning documents within the same collection don't need to have the same structure. Some key features and benefits of document databases include: 19 Flexible Schema: Documents can have varying numbers and types of fields, allowing for easy adaptation to changing data requirements. Intuitive Data Model: Documents map naturally to data structures used in applications, reducing the need for complex object-relational mapping. High Performance: Document databases are optimized for retrieving and querying documents, providing fast access to data. Scalability: Document databases can scale horizontally by adding more servers to a cluster. Examples of popular document databases include MongoDB, Couchbase, and Amazon DocumentDB. Document databases are well-suited for use cases such as content management systems, mobile apps, real-time analytics, and IoT applications. Advantages of Document Data Stores 1. Schema-less: No restrictions on data format and structure, suitable for large volumes of data. 2. Faster Creation and Maintenance: Simple to create and maintain documents with minimal effort. 3. Open Formats: Utilizes simple build processes with XML, JSON, and other formats. 4. Built-in Versioning: Reduces conflicts as documents grow in size and complexity. Disadvantages of Document Data Stores 1. Weak Atomicity: Lacks support for multi-document ACID transactions, breaking atomicity requirements. 20 2. Consistency Check Limitations: Performance issues when searching unconnected collections and documents. 3. Security Concerns: Potential vulnerabilities in web applications can lead to data leaks. Applications of Document Data Stores 1. Content Management: Ideal for video streaming platforms, blogs, and similar services due to ease of maintenance. 2. Book Database: Useful for nesting data, making it suitable for managing book databases. 3. Catalog: Efficient for storing and reading catalog files with thousands of attributes. 4. Analytics Platform: Commonly used for analytics due to fast reading capabilities. Key-Value Data Stores in NoSQL: Key-Value data stores are one of the simplest types of NoSQL databases. They store data as a collection of key-value pairs, where each key is unique and maps directly to a value. In a key-value database: Data is stored as a collection of key-value pairs. Each key is a unique identifier that points to a corresponding value. The values can be simple data types like strings and numbers, or more complex objects. There is no predefined schema - the database does not enforce any structure on the data. Keys are used to quickly retrieve the associated values. Features: One of the most un-complex kinds of NoSQL data models. For storing, getting, and removing data, key-value databases utilize simple functions. Querying language is not present in key-value databases. Built-in redundancy makes this database more reliable. 21 Advantages: It is very easy to use. Due to the simplicity of the database, data can accept any kind, or even different kinds when required. Its response time is fast due to its simplicity, given that the remaining environment near it is very much constructed and improved. Key-value store databases are scalable vertically as well as horizontally. Built-in redundancy makes this database more reliable. Disadvantages: As querying language is not present in key-value databases, transportation of queries from one database to a different database cannot be done. The key-value store database is not refined. You cannot query the database without a key. Some examples of key-value databases: Here are some popular key-value databases which are widely used: Couchbase: It permits SQL-style querying and searching for text. Amazon DynamoDB: The key-value database which is mostly used is Amazon DynamoDB as it is a trusted database used by a large number of users. It can easily handle a large number of requests every day and it also provides various security options. Riak: It is the database used to develop applications. Aerospike: It is an open-source and real-time database working with billions of exchanges. Berkeley DB: It is a high-performance and open-source database providing scalability. 22 Case studies of MongoDB, HBase, Neo4J: Case Studies of MongoDB: 1. eBay Use Case: Metadata Storage for Billions of Listings Problem: eBay needed a scalable and flexible solution to manage and store the vast amounts of metadata associated with its listings. Traditional RDBMS solutions were inadequate in terms of performance and scalability for eBay's dynamic data needs. Solution: eBay implemented MongoDB to handle the high volume of transactions and efficiently manage the metadata. MongoDB’s schema-less nature and ability to horizontally scale across multiple servers made it an ideal choice for eBay’s dynamic data environment. Outcome: Scalability: MongoDB allowed eBay to scale horizontally across multiple servers, handling billions of documents efficiently. Performance: Improved read and write performance due to MongoDB's document- based storage and indexing capabilities. Flexibility: The dynamic schema design enabled eBay to adapt quickly to changing data requirements without significant overhead. Key Features: Document-oriented storage Dynamic schemas High availability and built-in replication Sharding for horizontal scalability 2. Forbes Use Case: Content Management and Delivery Problem: Forbes needed a robust content management system capable of handling high traffic and dynamic content. The traditional relational database was struggling to keep up with the growing demand for high-speed content delivery and personalization. Solution: Forbes adopted MongoDB to manage its content management system. MongoDB’s flexible schema allowed Forbes to store articles, images, and user data efficiently, while its horizontal scaling ensured high availability and performance during traffic spikes. Outcome: Performance: Enhanced content delivery speeds and reduced latency, providing a better user experience. Scalability: MongoDB’s sharding feature allowed Forbes to scale horizontally and handle increased traffic without performance degradation. 23 Flexibility: The ability to store diverse types of content in a single database simplified data management and retrieval processes. Key Features: Flexible schema for diverse content types High performance and low latency Horizontal scalability with sharding Robust indexing capabilities 3. MetLife Use Case: Customer 360 View Problem: MetLife aimed to create a comprehensive 360-degree view of its customers to improve customer service and personalization. Their existing relational database systems were fragmented and unable to provide a unified view of customer data across various channels and touchpoints. Solution: MetLife implemented MongoDB to consolidate customer data from multiple sources into a single, unified view. MongoDB’s flexible schema allowed MetLife to integrate diverse data types and sources seamlessly, providing a holistic view of customer interactions and history. Outcome: Unified Data: Achieved a comprehensive 360-degree view of customer data, enhancing customer service and personalization. Improved Customer Experience: Enabled real-time access to customer information, resulting in faster and more accurate customer service responses. Data Integration: Simplified the integration of diverse data sources and types, improving data consistency and reliability. Key Features: Flexible schema for integrating diverse data sources Real-time data access High availability and scalability Robust querying and indexing capabilities 4. Telefonica Use Case: IoT and Big Data Management Problem: Telefonica needed a database solution capable of handling the massive influx of data generated by its IoT devices and services. The traditional relational databases were unable to efficiently store and process the high-velocity, high-volume data streams. Solution: Telefonica adopted MongoDB for its IoT and big data management. MongoDB’s document-oriented storage and horizontal scaling capabilities allowed Telefonica to efficiently store and analyze large volumes of IoT data, providing valuable insights and improving service delivery. 24 Outcome: Data Management: Efficiently managed and stored large volumes of IoT data, enabling real-time analytics and insights. Scalability: MongoDB’s horizontal scaling capabilities ensured high availability and performance, even with increasing data volumes. Insights and Analytics: Improved data analysis and insight generation, leading to better decision-making and service optimization. Key Features: Document-oriented storage for flexible data management Horizontal scalability with sharding Real-time data processing and analytics High availability and performance Case Studies of HBase 1. Pinterest Use Case: Real-time Analytics and Data Storage Problem: Pinterest needed a solution to manage and store the massive amounts of data generated by user interactions and content. Their existing systems were inadequate for handling real-time analytics and the scalability required for their growing user base. Solution: Pinterest implemented HBase to store and process time-series data and real-time analytics data. HBase’s integration with Hadoop allowed them to leverage MapReduce for processing large data sets efficiently. Outcome: Scalability: HBase's ability to scale horizontally helped Pinterest manage billions of rows of data. Performance: Improved data retrieval times due to HBase's efficient read/write capabilities. Integration: Seamless integration with Hadoop for batch processing and analytics. Key Features: Column-family-oriented storage Strong consistency Linear and modular scalability Integration with Hadoop for processing 2. Yahoo! Use Case: Logging and Data Analysis 25 Problem: Yahoo! required a robust system to manage and analyze vast amounts of log data generated by its services. Traditional relational databases were not able to handle the high write throughput and the large volume of data efficiently. Solution: Yahoo! adopted HBase to store log data. HBase's architecture allowed Yahoo! to handle high write throughput and provided efficient random access to large data sets. This setup enabled near real-time analysis of logs, improving service monitoring and issue resolution. Outcome: High Throughput: HBase handled millions of writes per second, accommodating the massive influx of log data. Real-time Analysis: Enabled near real-time monitoring and analysis, enhancing service reliability and performance. Cost Efficiency: Reduced storage costs due to HBase's efficient use of disk space and scalability. Key Features: High write throughput Efficient random access to large data sets Cost-effective storage Integration with existing Hadoop infrastructure 3. Salesforce Use Case: Customer Data Management Problem: Salesforce needed a scalable and reliable database to manage customer data, including transactional and analytical workloads. The data volumes and the need for real-time access posed significant challenges. Solution: Salesforce implemented HBase to handle its customer data management requirements. HBase's distributed nature and strong consistency model allowed Salesforce to manage large volumes of data efficiently while ensuring data integrity. Outcome: Scalability: Easily scaled to handle growing data volumes without compromising performance. Reliability: Ensured data integrity and availability through HBase's strong consistency model. Performance: Improved read and write performance, enabling real-time access to customer data. Key Features: Distributed and scalable architecture Strong consistency High performance for read and write operations 26 Integration with Hadoop ecosystem 4. Facebook Use Case: Messaging Platform Storage Problem: Facebook's messaging platform required a storage solution capable of handling high write and read loads, as well as storing vast amounts of data generated by user messages. Solution: Facebook utilized HBase to store messages and handle the high volume of data transactions. HBase's scalability and performance characteristics made it suitable for Facebook's needs, allowing for efficient storage and retrieval of messages. Outcome: High Volume Handling: Managed billions of messages with high write and read loads. Scalability: Scaled horizontally to accommodate the growing user base and message volume. Efficiency: Provided efficient storage and retrieval, ensuring a smooth user experience. Key Features: High write and read throughput Horizontal scalability Efficient storage management Strong consistency Case Studies of Neo4j 1. LinkedIn Use Case: Real-time Social Network Analysis Problem: LinkedIn needed a database solution to manage and analyze the complex social graph of connections between professionals in real-time. Traditional relational databases were not efficient for traversing highly interconnected data. Solution: LinkedIn adopted Neo4j to handle their social graph data. Neo4j’s native graph storage and processing capabilities provided the necessary performance and flexibility for analyzing complex relationships between users. Outcome: Performance: Significant improvement in query performance for traversing relationships and connections. Flexibility: Easy to model and query complex relationships in the social graph. Real-time Analysis: Enabled real-time recommendations and insights for users. Key Features: Native graph storage and processing ACID compliance Cypher query language for graph queries 27 High-performance traversal of nodes and relationships 2. Walmart Use Case: Supply Chain Management Problem: Walmart needed a solution to optimize and manage its complex supply chain network, including suppliers, distribution centers, and retail stores. They required a system that could efficiently analyze and optimize supply chain operations. Solution: Walmart implemented Neo4j to model and analyze its supply chain network. By leveraging graph database capabilities, Walmart could identify bottlenecks, optimize routes, and ensure timely delivery of products. Outcome: Optimization: Improved efficiency in supply chain operations through better route optimization and bottleneck identification. Transparency: Enhanced visibility into the entire supply chain network. Efficiency: Reduced operational costs and improved delivery times. Key Features: Graph-based modeling of supply chain networks Real-time analysis and optimization Enhanced visibility and transparency Efficient route optimization 3. eBay Use Case: eCommerce Recommendations Engine Problem: eBay required a recommendation engine to enhance the shopping experience by suggesting relevant products to users based on their browsing and purchase history. The relational databases used previously were inefficient for real-time recommendations. Solution: eBay implemented Neo4j to power its recommendation engine. By utilizing the graph database, eBay could efficiently model relationships between users, products, and their interactions, enabling real-time and personalized recommendations. Outcome: Personalization: Improved the accuracy and relevance of product recommendations. Real-time Processing: Enabled real-time recommendations, enhancing the user shopping experience. Scalability: Managed large-scale data efficiently, ensuring high performance during peak traffic. Key Features: Real-time recommendation engine Personalized product suggestions Efficient handling of large-scale data Enhanced user experience 28 4. NASA Use Case: Knowledge Graph for Research and Collaboration Problem: NASA needed a solution to manage and explore the vast amount of data generated by various research projects and collaborations. They required a system to link and analyze different datasets and facilitate knowledge sharing. Solution: NASA adopted Neo4j to create a knowledge graph that connects different research projects, datasets, and collaborators. This allowed researchers to discover connections between different areas of research and collaborate more effectively. Outcome: Knowledge Sharing: Facilitated better collaboration and knowledge sharing among researchers. Data Discovery: Enabled the discovery of new insights by connecting disparate datasets. Research Efficiency: Improved the efficiency of research projects through better data management and connectivity. Key Features: Knowledge graph for connecting research data Enhanced collaboration and knowledge sharing Efficient data discovery and analysis Improved research efficiency NoSQL Database Design for Applications NoSQL databases have become popular for applications that require high scalability, flexible data models, and fast access to large volumes of data. Here's a comprehensive guide to designing a NoSQL database for applications. 1. Understanding the Requirements Data Types: Identify the types of data the application will manage, such as user profiles, transactions, logs, or product catalogs. Data Volume: Estimate the volume of data to understand scalability needs. Access Patterns: Determine how data will be accessed, e.g., frequent reads, writes, or complex queries. Performance: Define performance requirements in terms of read/write latencies. Consistency and Availability: Balance the need for data consistency and system availability based on the CAP theorem. 2. Choosing the Right NoSQL Database Document Stores: (e.g., MongoDB, CouchDB) Ideal for semi-structured data and flexible schemas. 29 Key-Value Stores: (e.g., Redis, DynamoDB) Best for high-performance read and write operations with simple key-value pairs. Column Stores: (e.g., Cassandra, HBase) Suitable for handling large-scale distributed data with high write throughput. Graph Databases: (e.g., Neo4j, Amazon Neptune) Excellent for applications with highly interconnected data. 3. Data Modeling Document Stores (e.g., MongoDB) Design for Queries: Model data to match the most frequent queries. Schema Design: Use embedded documents and arrays to avoid joins. Ensure document sizes are within limits. Indexing: Create indexes on fields that are frequently queried. Example: json Copy code { "user_id": "12345", "name": "John Doe", "email": "[email protected]", "addresses": [ {"type": "home", "address": "123 Main St, Springfield, IL"}, {"type": "work", "address": "456 Market St, Springfield, IL"} ], "orders": [ { "order_id": "98765", "date": "2023-01-01", "items": [ {"product_id": "A1", "quantity": 2}, {"product_id": "B2", "quantity": 1} ] } ] } Key-Value Stores (e.g., Redis, DynamoDB) Key Design: Design unique, descriptive keys for efficient lookups. Value Storage: Store values as simple data types, serialized objects, or JSON strings. Data Partitioning: Use sharding or consistent hashing to distribute data across nodes. 30 Example: plaintext Copy code "user:12345": { "name": "John Doe", "email": "[email protected]" } "order:98765": { "user_id": "12345", "date": "2023-01-01", "items": [ {"product_id": "A1", "quantity": 2}, {"product_id": "B2", "quantity": 1} ] } Column Stores (e.g., Cassandra, HBase) Column Family Design: Organize data into column families based on access patterns. Partition Keys: Choose partition keys to distribute data evenly and avoid hotspots. Clustering Columns: Use clustering columns to define the sort order within partitions. Example: plaintext Copy code CREATE TABLE users ( user_id UUID PRIMARY KEY, name TEXT, email TEXT, addresses MAP ); CREATE TABLE orders ( user_id UUID, order_id UUID, date TIMESTAMP, items MAP, PRIMARY KEY (user_id, order_id) ) WITH CLUSTERING ORDER BY (order_id DESC); Graph Databases (e.g., Neo4j) 31 Nodes and Relationships: Model entities as nodes and their connections as relationships. Properties: Store attributes on nodes and relationships. Traversal Patterns: Design the graph for common traversal patterns and queries. Example: plaintext Copy code CREATE (u:User {user_id: '12345', name: 'John Doe', email: '[email protected]'}) CREATE (o:Order {order_id: '98765', date: '2023-01-01'}) CREATE (u)-[:PLACED]->(o) CREATE (p1:Product {product_id: 'A1'}) CREATE (p2:Product {product_id: 'B2'}) CREATE (o)-[:CONTAINS {quantity: 2}]->(p1) CREATE (o)-[:CONTAINS {quantity: 1}]->(p2) 4. Data Partitioning and Distribution Sharding: Distribute data across multiple servers to enhance scalability and fault tolerance. Replication: Replicate data across nodes to improve availability and fault tolerance. Consistency Models: Choose between eventual consistency and strong consistency based on application needs. 5. Indexing and Query Optimization Indexes: Create indexes on frequently queried fields to enhance read performance. Secondary Indexes: Use secondary indexes to support complex queries. Query Planning: Design queries to minimize data scans and optimize performance. 6. Handling Schema Changes Schema Evolution: Design the model to accommodate changes without major disruptions. Versioning: Implement versioning strategies for documents or entities to manage schema changes. 7. Security and Access Control Authentication and Authorization: Implement robust mechanisms to control access. Data Encryption: Encrypt data at rest and in transit to ensure security. Access Control: Define policies to restrict access to sensitive data. 8. Monitoring and Maintenance Performance Monitoring: Continuously monitor database performance. Backup and Recovery: Implement regular backup and recovery processes. Capacity Planning: Plan for capacity requirements based on projected data growth. 32 9. Testing and Deployment Load Testing: Perform load testing to ensure the database can handle expected traffic. Staging Environment: Use a staging environment to test changes before production deployment. Continuous Deployment: Implement continuous deployment practices for smooth updates. 33

CS22512 NoSQL Databases PDF

Document Details

Tags

Related

Summary

Full Transcript