Semester 03 DMS IT 3306 Data Management System PDF

SEMESTER 03 IT 3306 Data Management System Data Management Evolution Follow us on @AcademyofGigaNerds @GigaNerds Contact us – 071 380 52 07 Semester 03 – DMS...

SEMESTER 03 IT 3306 Data Management System Data Management Evolution Follow us on @AcademyofGigaNerds @GigaNerds Contact us – 071 380 52 07 Semester 03 – DMS Data Management Evaluation Major Concepts of Object-Oriented, XML, and NoSQL Databases Modern database technologies such as Object-Oriented Databases (OODB), XML databases, and NoSQL systems have evolved to handle complex and unstructured data, addressing limitations found in traditional relational databases. Object Databases Overview of Object Database Concepts Relational vs. Object Databases: Relational databases use the relational data model, whereas object databases (also known as Object-Oriented Databases or OODBs) are built on the object data model. The main advantage of object databases lies in their flexibility to define both the structure and relevant operations of objects. Traditional Models: Early business requirements were met using traditional data models such as the network, hierarchical, and relational models. However, real-time applications requiring high performance, such as telecommunications, architectural design, biological sciences, and GIS (Geographical Information Systems), often struggled with the rigid structure of these models. Emergence of Object Databases: Object databases were developed to meet these new requirements and are now widely used in real-time application domains. Their growing popularity is largely due to: o Seamless integration with applications developed using Object-Oriented Programming Languages (OOPLs) like C++ and Java. o Relational databases sometimes cause conflicts when used with object-oriented applications, leading to the rise of object databases. o Object databases are particularly useful for domains needing complex data handling, such as GIS, scientific computations, and simulations. o Object-Relational Databases (RDBMS with object features) provide a middle ground and are also widely adopted. Examples of Object Databases: Examples of object database prototypes include: o Orion System (Microelectronics o OpenOODB (Texas Instruments) and Computer Technology o Iris (Hewlett-Packard) Corporation) o Ode (AT&T Bell Labs) o ENCORE/ObServer (Brown University) 2|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Commercial object databases include: o GemStone Object Server o Versant Object Database and (GemStone Systems) FastObjects (Versant Corporation) o ONTOS DB (Ontos) o ObjectStore (Object Design) o Objectivity/DB (Objectivity Inc.) o Ardent Database (Ardent) Introduction to Object-Oriented Concepts Object-Oriented (O-O) Basics: The concept of Object-Oriented (O-O) originated from Object-Oriented Programming Languages (OOPLs). Over time, these concepts were applied to other areas, including databases, software engineering, computer systems, and knowledge bases. Most of these ideas have been adopted by object databases, allowing for flexible data modeling and operations. Core Object Components: o State (Value): The state of an object can have a complex data structure. o Behavior (Operations): Objects exhibit behaviors through defined operations. Object Categories: o Transient Objects: These objects only exist while the program is running and disappear once the program terminates. o Persistent Objects: These objects continue to exist even after the program has ended and can be stored in object-oriented databases for future retrieval. Object-Oriented Databases (OODBs) Persistent Object Storage: One of the key features of Object-Oriented Databases is their ability to store objects permanently, allowing persistent objects to be retrieved later from secondary storage. This persistent storage expands the lifetime of objects beyond the running program. 3|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Data Sharing: Data within object-oriented databases can be shared across different applications and programs, promoting better data management and integration across systems. Object Identity and Literals Object Identity refers to the concept of assigning a unique identity to every object in an Object- Oriented Database (OODB). Each object is given a system-generated Object Identifier (OID), which is immutable and remains constant throughout the object's lifetime. The OID allows for the consistent reference to an object, ensuring the object's identity is preserved even if its data is altered. OID Properties: o Immutable: Once assigned, the OID does not change. o Unique: Every object has a distinct OID. o Internally used: OIDs may not be visible to external users but are essential for inter-object referencing. o Independent of attribute values: OIDs are not dependent on an object’s attributes, avoiding identity changes when attributes are modified. Literals in an OODB refer to attribute values that do not have OIDs. Unlike objects, literals cannot be independently referenced and are typically embedded within objects. Literal Types Supported by the Object Model: o Single-valued/Atomic Types: Indivisible values like integers, strings, Booleans, and floating-point numbers. o Struct (or Tuple) Constructor: Structured types that group multiple components, similar to tuples in relational databases (e.g., a composite attribute like struct EmpName). o Collection Type Constructors: Enable the definition of collections of elements, including: ▪ Set: Unordered, no duplicates. ▪ Bag: Unordered, allows duplicates. ▪ List: Ordered collection. ▪ Array: Ordered with a fixed size. ▪ Dictionary: Key-value pairs, unordered. 4|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Key Differences between Objects and Literals: Objects have OIDs, making them uniquely identifiable and capable of being referenced by other objects. Literals do not have OIDs and are generally used as values inside objects, representing atomic or composite values like strings or tuples. Example: A DEPARTMENT type could include both object references (e.g., Mgr: tuple (Manager: EMPLOYEE)) and literals (e.g., Dname: string; for department name). This differentiation between object identity and literals ensures a balance between uniquely identifiable entities (objects) and simple data representations (literals) within an OODB. Class In OODB, a class refers to a type definition that includes both the structure and operations associated with that type of object. Type Definition: Defines the data structure (attributes) and operations for a specific object type. Operations Declaration: Each class declares relevant operations, and their signatures (interfaces) are defined as part of the class definition. 5|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Method Implementation: The method (implementation) of each operation is written separately using a programming language. Common Operations in a Class: Object Constructor (new): Used to create new instances of an object. Destructor: Deletes or destroys an object. Modifier Operations: Modify the state (attributes) of an object. Retrieval Operations: Fetch specific information about the object. Encapsulation of Operations Encapsulation is a key feature in object-oriented databases (OODB), ensuring that data and the operations that manipulate it are bound together. This concept, borrowed from object-oriented programming (OOP), allows objects to maintain control over their internal data by exposing only necessary operations to external programs. 6|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Key Concepts: Encapsulation Mechanism: Binds the code (operations) and the data it manipulates, limiting access to object data from outside the defined interface. Signature (Interface): Specifies the operation name and its arguments, defining how external programs can interact with the object. Method (Body): The implementation of the operation, often hidden from external users, is defined elsewhere in the system. Message Passing: External programs pass messages to invoke operations on objects, which include the operation name and the required parameters. In OODB systems, encapsulation is relaxed to some extent. Objects may have visible and hidden attributes: Visible Attributes: Directly accessible to users via query languages. Hidden Attributes: Accessible only through predefined operations, ensuring data protection and integrity. Persistence of Objects In OODB, persistent objects are objects that exist beyond the lifespan of the program, stored in the database for future retrieval. Persistence is a crucial aspect of OODB that differentiates it from traditional programming environments, where most objects are transient. Persistence Mechanisms: Naming: Persistent objects are assigned a unique name, allowing them to be accessed in the future. 7|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Reachability: Objects become persistent if they are reachable from an existing persistent object. For example, if a department object is added to a persistent set of departments, it becomes persistent by association. This system allows for flexibility in object storage, supporting both transient and persistent objects, unlike traditional relational databases where all objects are inherently persistent. Type Hierarchies and Inheritance Concept: Type hierarchies and inheritance are essential features of Object-Oriented (OO) databases. In OO databases, type hierarchies allow the creation of new data types based on existing types, promoting reusability. Inheritance is the mechanism through which a new type or class (subtype) can acquire attributes and operations (functions) from a predefined type (supertype). Key Points: 1. Inheritance in Object Databases: o In OO databases, inheritance allows new types (subtypes) to inherit attributes and operations from existing types (supertypes). o This enables reuse of type definitions and promotes incremental development of data types. o Example: A Rectangle or Triangle class can inherit attributes like Color from a Shape superclass, reusing structure and behavior. 2. Functions: o In the basic inheritance model, attributes and operations are treated as “functions.” o Example: A Student type might include attributes like Name, NIC, Birth_date, which can be implemented as stored attributes, while an attribute like Age could be an operation that calculates age from the Birth_date. 3. Subtypes and Supertypes: o Subtypes inherit functions from their supertypes and may add their own. o Example: EMPLOYEE and STUDENT could be subtypes of PERSON, inheriting attributes such as Name and Birthdate, while adding additional attributes like Empid (Employee ID) for employees and RegNo (Registration Number) for students. 4. Geometric Object Example: o A Geometric_Obj could be a supertype with attributes like Name, Color, and operations like Cal_area (calculate area). 8|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation o Subtypes like Circle_Obj and Rectangle_Obj inherit attributes from Geometric_Obj but have additional attributes like radius (for circles) or width and height (for rectangles). Polymorphism and Multiple Inheritance Concept: Polymorphism refers to the ability to represent a single operation in multiple forms, known as operator overloading. Multiple inheritance occurs when a class or subtype inherits from two or more supertypes, allowing it to acquire attributes and operations from all of them. Polymorphism (Operator Overloading): An operation can be represented differently depending on the type of object it is applied to. Example: A Cal_area function could calculate the area of different shapes like circles, rectangles, or triangles, with different implementations for each shape. Polymorphism can be handled through early (static) binding, where the method is determined at compile time, or late (dynamic) binding, where the appropriate method is determined at runtime based on the object type. Multiple Inheritance: Occurs when a subtype inherits from more than one supertype. Example: ENGINEERING_MANAGER could be a subtype of both MANAGER and ENGINEER, inheriting functions from both supertypes. A type lattice is formed rather than a simple hierarchy. Problems with Multiple Inheritance: Ambiguity: If both supertypes define a function with the same name (e.g., Salary), it can create ambiguity in determining which function should be inherited by the subtype. Solutions: Ambiguity can be resolved by allowing developers to specify the function to inherit, using system defaults, or denying multiple inheritance if ambiguity is detected. Selective Inheritance: Allows a subtype to inherit only some functions from a supertype, excluding others. This concept is more common in artificial intelligence applications but is not typically supported in ODBs. 9|Page ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation XML Databases Reason for the Origination of XML XML stands for Extensible Markup Language. Similar to how desktop applications connect to local databases, web applications need interfaces to connect with data sources. These data sources provide access to the information needed by web applications. Web pages are formatted using hypertext documents, such as HTML (HyperText Markup Language). Why XML? HTML is widely used for formatting web pages but is not ideal for representing structured data. XML, along with other languages like JSON (JavaScript Object Notation), was created to structure and exchange data efficiently in web applications. XML is a self-describing language, meaning it not only holds data but also describes what the data represents. o For example, attribute names and values within an XML document convey both the content and its meaning. Static vs. Dynamic Web Pages Static Web Pages: Created using basic HTML, they display fixed content. Dynamic Web Pages: Interactive and responsive to user inputs. For instance, dynamic web pages can transfer self-describing XML documents based on user interaction. Key Points : 1. Web Application Structure Web applications typically follow a three-tier client/server architecture, where data sources are accessed via web interfaces. Web interfaces display web pages to users on various devices (desktops, laptops, mobile). 2. HTML Limitations HTML is suitable for formatting web pages but lacks the ability to manage structured data effectively, particularly when this data is stored in databases and needs to be dynamically extracted. 3. Introduction of XML 10 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation o XML was introduced to bridge this gap, allowing for the transfer and description of structured data. o XML documents are self-describing because they include attribute names alongside the data, making them ideal for dynamic web applications that require rich data interaction. 4. Comparison with JSON While XML provides structure, JSON is another option for structuring and exchanging data on the web. Both XML and JSON are popular for transferring data in a readable format but differ in syntax and use cases. 5. Formatting and Style Formatting in XML-based systems is typically handled separately using technologies like XSL (Extensible Stylesheet Language) and XSLT (XSL Transformations). XML in Dynamic Applications In dynamic web applications, XML can play a vital role in transferring information in textual files. For example, when a user interacts with a web page, their input (such as a bank account number) can be used to fetch data (such as account balance) from a database and display it on the page. XML allows this interaction by efficiently structuring and describing the data being transferred. Structured, Semi-Structured, and Unstructured Data Structured Data Definition: Structured data refers to information that is organized and stored in a clearly defined schema, typically in relational databases. Storage: Stored in relational databases with a fixed schema that defines the structure of each record (e.g., tables, columns, and rows). Characteristics: o Each data record follows the same structure. o Strict schema enforces consistency in data storage. o Examples: Employee databases, financial records, product inventories. 11 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Semi-Structured Data Definition: Semi-structured data does not adhere to a fixed schema but contains tags or markers to separate elements, making it partially structured. Storage: Typically stored in formats like XML or JSON, which do not require a rigid schema. Characteristics: o The data has some structure, but not all records share the same format. o Attributes and relationships may vary across entities. o Data can be represented using tree or graph data structures. o Example: A bibliographic database where some references contain full details (author, title, year), while others have incomplete information. o Tree and Graph Data Structures: ▪ In tree structures, elements are organized hierarchically. ▪ Directed graph models are also used, where: ▪ Tags/Labels represent schema names (attributes or relationships). ▪ Internal Nodes represent objects or composite attributes. ▪ Leaf Nodes represent actual data values. 12 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Unstructured Data Definition: Unstructured data lacks any predefined format or organization and does not fit neatly into a relational database or semi-structured model. Storage: Unstructured data is typically stored in its native format, such as documents, images, videos, or web pages (e.g., HTML). Characteristics: o It has no well-defined structure. o The meaning and type of data are not indicated by the formatting. o Examples: Text documents, email messages, multimedia files, HTML web pages. o HTML as Unstructured Data: ▪ HTML documents mix content with formatting instructions. ▪ Tags like , , format the content but do not describe the meaning of the data. ▪ Analyzing and interpreting such documents programmatically is difficult since they lack schema information. 13 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Representation in XML Databases Structured Data: In XML, structured data can be stored with a defined schema (DTD or XML Schema). Semi-Structured Data: XML naturally handles semi-structured data, allowing for flexible attributes and hierarchical structures without a predefined schema. Unstructured Data: XML can encapsulate unstructured content, adding tags and metadata to provide some structure for otherwise unformatted data. XML Hierarchical (Tree) Data Model The XML Hierarchical (Tree) Data Model represents the structure of XML documents in the form of a tree, where each element is a node, and elements can nest within one another to form parent-child relationships. This hierarchical structure reflects the nature of the data in XML, making it flexible for both data and document representation. Basic Objects Elements: The primary building blocks in XML. Elements are defined by tags and can contain text, other elements, or a combination of both. They represent data in a hierarchical fashion. Attributes: Attributes provide additional information about elements. Unlike elements, they are simple name-value pairs defined within the start tag of an element. Attributes are used sparingly and are often discouraged for holding critical data. Document Types: XML documents can be classified into three main types: 1. Data-centric XML documents: These documents usually follow a predefined schema and contain structured data (e.g., transactional data or database exports). 2. Document-centric XML documents: These contain large amounts of text with minimal structured data (e.g., articles, books). 3. Hybrid XML documents: A combination of structured data and unstructured textual data, possibly with or without a schema. Key Characteristics: 1. Data-centric XML: o Used for highly structured data with many small items. o Typically follows a schema that defines tag names and structure. o Ideal for exchanging data between systems, such as web services. 14 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 2. Document-centric XML: o Contains loosely defined tags and significant text content. o The tag structure is flexible, and the document may not have a strict schema. o Example: news articles or books. 3. Hybrid XML: o Combines both structured data and unstructured text. o May or may not have a predefined schema. Example XML Hierarchical Tree ProductX 1 Bellaire 5 123456789 Smith 32.5 15 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 453453453 Joyce 20.0 ProductY 2 Sugarland 5 123456789 7.5 453453453 20.0 333445555 10.0 In this example, the element contains multiple elements, each containing other nested elements like , , , and. This hierarchical tree structure visually represents the relationships between different pieces of data. 16 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Key Differences Between XML and HTML: In HTML, tags define how text is displayed, while in XML, tag names describe the meaning of the data. XML is extendible, meaning new tag names can be defined in a schema document, whereas HTML has fixed, predefined tags. Schemaless XML XML documents without a predefined schema are known as schemaless XML documents. They do not have a fixed tree structure and are more flexible in terms of tag usage and element names. NoSQL Databases 1. Introduction to Impedance Mismatch Definition: Impedance mismatch refers to the discrepancies between relational databases' structured data model and the more flexible in-memory data structures used by applications. Relational Model: In relational databases, data is organized in tables (relations) composed of rows (tuples). A tuple is defined as a set of name-value pairs. Limitations: Unlike in-memory structures, relational tuples are limited to simple values without nested records or lists. This necessitates translating complex in-memory data into a relational format for storage, leading to the frustration of developers. 17 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 2. Historical Context The 1990s: The rise of object-oriented programming led to attempts to create object- oriented databases to eliminate impedance mismatch. However, relational databases prevailed due to their integration capabilities and the standardization of SQL. Object-Relational Mapping (ORM): Tools like Hibernate emerged to address impedance mismatch, simplifying the process but not fully resolving performance issues. 3. Rise of Clusters Scaling Challenges: The 2000s saw a dramatic increase in data generated by large web properties, leading to scalability issues for traditional relational databases. The options for scaling included: o Scaling Up: Involves using larger machines, which is expensive and limited. o Scaling Out: Utilizing clusters of smaller, cost-effective machines, which are more resilient to failures. Relational Database Limitations: Traditional relational databases struggle in clustered environments due to shared disk dependencies and challenges with sharding (distributing data across multiple servers). 4. The Emergence of NoSQL Definition of NoSQL: Originally, the term referred to an open-source relational database that did not use SQL. However, it has evolved to describe a variety of distributed, non- relational databases developed primarily in the 21st century. Characteristics: o Non-SQL Based: Most NoSQL databases do not use SQL as their primary query language. o Open Source: Many are open-source projects, encouraging community involvement and development. o Cluster-Oriented: Designed to work efficiently in clustered environments, accommodating large datasets and high user loads. o Schema-Free: Allowing flexibility in adding fields without predefined structures, which is beneficial for handling varied data types. 5. The Notion of Polyglot Persistence Concept: Polyglot persistence is the idea of using different database technologies suited for different data storage needs. This approach recognizes the limitations of traditional relational databases and embraces the advantages of NoSQL solutions. 18 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Application vs. Integration: NoSQL databases are typically better suited for application databases rather than integration databases, promoting service encapsulation for improved data management. 19 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Common Characteristics of NoSQL Databases 1. Non-SQL Query Languages Alternative Query Methods: While NoSQL databases do not utilize traditional SQL, many offer their own query languages that often draw inspiration from SQL to ease the learning curve. o Cassandra Query Language (CQL): Resembles SQL in syntax, making it more approachable for those familiar with relational databases. o MongoDB Query Language: Utilizes JSON-like syntax for queries, aligning with its document-oriented data model. 2. Open-Source Nature Community-Driven Development: Most NoSQL databases are open-source, fostering a collaborative environment where developers contribute to the codebase, share innovations, and rapidly iterate on features. Cost Benefits: Open-source databases eliminate licensing fees, making them accessible for startups and enterprises alike. 3. Cluster-Friendly Architecture Designed for Distributed Systems: NoSQL databases are inherently built to operate across clusters of machines, ensuring that data is distributed, replicated, and accessible even in the event of individual node failures. Scalability: The ability to seamlessly add or remove nodes from a cluster allows NoSQL databases to scale horizontally to meet growing data and traffic demands. 20 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 4. Flexible Consistency and Distribution Models Eventual Consistency: Many NoSQL databases prioritize availability and partition tolerance over strict consistency, allowing for eventual consistency where data changes propagate across the system over time. Configurable Consistency Levels: Some NoSQL databases offer tunable consistency, enabling developers to balance between consistency and performance based on specific application needs. Data Distribution Strategies: Techniques like sharding (partitioning data across multiple nodes) and replication (duplicating data across nodes) are fundamental to NoSQL databases, ensuring data availability and fault tolerance. 5. Schemaless Design Dynamic Schemas: NoSQL databases do not require predefined schemas, allowing for the addition of new fields and structures on-the-fly without altering the entire database structure. Adaptability: This flexibility is particularly beneficial for applications dealing with non-uniform data or evolving data models, reducing the overhead associated with schema migrations. 6. Handling Complex Relationships Graph Databases: Specifically designed to manage intricate relationships between data entities, graph databases use nodes and edges to represent and traverse connections efficiently. Relation Mapping: Unlike relational databases that use join operations, graph databases excel in scenarios where relationships are first-class citizens, such as social networks, recommendation systems, and network topologies. 7. Performance Optimization Indexing and Query Optimization: NoSQL databases employ various indexing strategies tailored to their data models to enhance query performance. In-Memory Processing: Some NoSQL databases leverage in-memory. NoSQL Data Models NoSQL databases offer a variety of data models tailored to different application needs, providing flexibility and scalability beyond traditional relational databases. This guide covers key NoSQL data models, their structures, and suitable use cases, focusing on aggregate data models, complex relationship structures, key-value models, document data models, column-family stores, and graph data stores. 21 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation E.g: - Suppose we are going to implement an e-commerce portal. We will be storing data about orders, users, payments, product catalog, and a set of addresses. We will be using this data model in a relational environment. 22 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 1. Introduction to Aggregate Data Models 1.1. What is an Aggregate? Definition: An aggregate is a collection of related objects treated as a single unit. Purpose: Aggregates simplify data management by grouping related data, making it easier to handle replication and sharding in distributed systems. 1.2. Benefits of Aggregates Replication and Sharding: Aggregates naturally form units that can be easily replicated across multiple nodes or sharded (divided) to distribute the load. Cluster Operations: Aggregates facilitate efficient operations within clusters, enhancing scalability and reliability. Ease of Use for Developers: Programmers interact with aggregates as single entities, simplifying data manipulation and reducing complexity in application code. The same model can be represented like this, in an aggregate-oriented environment. 2. Data Models for Complex Relationship Structures 2.1. Handling Related Data Access Patterns: Aggregates help by grouping data that is frequently accessed together, reducing the need for complex joins. 23 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Example Scenario: o Client and Orders: ▪ Accessing History: Applications may need to view a client’s request history alongside client details. ▪ Different Needs: While some applications might treat client and orders as a single aggregate, others might handle orders as separate aggregates. 2.2. Mechanism of Handling Updates Atomicity in Aggregates: o Aggregates ensure that all modifications within the unit are atomic, meaning they either all succeed or all fail together. o This mirrors the ACID (Atomicity, Consistency, Isolation, Durability) properties in relational databases but is managed within the scope of the aggregate. Cluster-Friendly Design: o NoSQL databases are designed to operate on clusters, enabling them to handle large datasets with simple connections efficiently. o Aggregate-oriented models are optimized for environments where data is distributed across multiple nodes. 2.3. Practical Example: E-Commerce Portal Relational vs. Aggregate-Oriented Models: o Relational Model: ▪ Entities: Orders, Users, Payments, Product Catalog, Addresses. ▪ Relationships: Managed through foreign keys and joins. o Aggregate-Oriented Model: ▪ Main Aggregates: Customer and Order. ▪ Structure: ▪ Customer: Includes billing addresses. ▪ Order: Includes order items, shipping address, and payments. ▪ Payment: Links back to billing address. ▪ Duplication: Addresses are duplicated within aggregates to avoid complex joins, suitable when addresses rarely change. 24 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 3. Reasons for Using Aggregate Data Models 3.1. Facilitates Clustering Data Retrieval Efficiency: Aggregates minimize the number of nodes accessed during data retrieval, enhancing performance. Simplified Sharding: By treating aggregates as single units, sharding becomes more straightforward, distributing data evenly across clusters. 3.2. Transaction Management Atomic Operations: Aggregates allow for atomic transactions within their scope, ensuring data consistency without the overhead of managing complex transactions across multiple tables. Comparison with Relational Databases: o Relational Databases: Can handle complex transactions across multiple tables using ACID properties. o NoSQL Aggregates: Simplify transactions by limiting them to single aggregates, reducing complexity in distributed environments. 4. Key-Value Model and Suitable Use Cases 4.1. Key-Value Model Overview Structure: Data is stored as simple key-value pairs. Access: Retrieval is based solely on the key, making it highly efficient for specific lookup operations. Flexibility: The value can be any data structure, not limited to domain objects. 4.2. Features of Key-Value Stores Indexing: Databases can create indexes based on the aggregate content, enabling partial retrieval of data. Data Structures: Support for various data types such as lists, sets, hashes, and operations like range queries, differences, unions, and intersections. 4.3. Examples of Key-Value Databases Redis: Supports multiple data structures and operations, making it versatile beyond a simple key-value store. Riak: Organizes keys into separate buckets for better data segmentation. 25 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Oracle NoSQL: Provides robust key-value storage with enterprise features. 4.4. Use Cases for Key-Value Stores Session Management: o Scenario: Web applications require storing unique session data. o Advantages: ▪ Efficiency: Single PUT and GET operations manage entire session data. ▪ Performance: Fast access and manipulation due to simple key-based retrieval. Shopping Carts: o Scenario: E-commerce platforms store user shopping cart data. o Advantages: ▪ Simplicity: Each cart is a single key-value pair, easy to update and retrieve. Caching: o Scenario: Frequently accessed data is stored in key-value stores to reduce latency. o Advantages: ▪ Speed: Rapid data retrieval enhances application performance. 26 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 5. Document Data Model and Suitable Use Cases 5.1. Document Data Model Overview Structure: Stores data as documents (e.g., XML, JSON, BSON) with an embedded ID field. Flexibility: Supports varying structures within the same collection, allowing for heterogeneous data. 5.2. Features of Document Databases Self-Describing: Each document contains all necessary information, reducing the need for additional schema definitions. Hierarchical Structures: Capable of representing nested and complex data relationships within a single document. Schema Flexibility: Different documents within the same collection can have different schemas, accommodating evolving data requirements. 5.3. Comparison with Key-Value Stores Similarity: Both use keys for document identification. Difference: Document databases allow querying based on the content within the documents, not just the keys. 5.4. Use Cases for Document Databases Event Logging: o Scenario: Applications need to log diverse types of events with varying attributes. 27 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation o Advantages: ▪ Central Store: Acts as a unified repository for all event data. ▪ Dynamic Data Handling: Easily accommodates different event structures without schema changes. ▪ Efficient Sharing: Use fields like application name and event type for organized data sharing and retrieval. Content Management Systems (CMS): o Scenario: Managing diverse content types like articles, images, and user comments. o Advantages: ▪ Flexible Storage: Different content types can be stored with their unique structures within the same database. ▪ Ease of Integration: Simplifies the retrieval and manipulation of complex content hierarchies. 28 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 6. Column-Family Stores and Suitable Use Cases 6.1. Column-Family Model Overview Structure: Organizes data into column families, which are collections of rows and dynamic columns. Storage Style: Similar to a large table where each row can have a different set of columns. 6.2. Features of Column-Family Stores Storage Efficiency: Groups columns together to optimize read and write operations, especially when dealing with infrequent writes but multiple reads. Flexibility: Each row can have a different number of columns, allowing for varied data storage without predefined schemas. 6.3. Examples of Column-Family Databases Cassandra: Known for its scalability and high availability. HBase: Built on top of Hadoop for big data applications. ScyllaDB: Designed for high performance with low latency. 6.4. Use Cases for Column-Family Stores Content Management Systems and Blogging Platforms: o Scenario: Recording blog entries with features like tags, classifications, and comments. o Advantages: ▪ Efficient Data Organization: Separate column families for users, blogs, and comments allow for organized and scalable data management. ▪ Flexible Relationships: Easily manage relationships within the same row or across different key spaces without complex joins. Time-Series Data: o Scenario: Storing and querying time-stamped data such as logs, metrics, or sensor data. o Advantages: ▪ Optimized Reads: Efficient retrieval of data across specific time ranges due to column grouping. ▪ Scalability: Handles large volumes of time-series data with ease. 29 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 7. Graph Data Stores and Suitable Use Cases 7.1. Graph Data Model Overview Structure: Comprises nodes (entities) connected by edges (relationships), forming a graph structure. Specialization: Designed to efficiently handle complex interconnections and relationships between data points. 7.2. Features of Graph Databases Efficient Traversal: Quickly navigates through relationships without the need for expensive join operations. Persisted Relationships: Relationships are stored as first-class entities, enabling faster query performance for connected data. Multiple Relationship Types: Supports diverse types of relationships, enhancing the ability to model real-world scenarios. 7.3. Examples of Graph Databases Neo4j: A leading graph database known for its powerful query language, Cypher. Amazon Neptune: Managed graph database service supporting both property graph and RDF models. OrientDB: Multi-model database supporting graph, document, key-value, and object models. 30 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 7.4. Use Cases for Graph Databases Social Networks: o Scenario: Modeling and querying relationships between users, such as friendships, followers, and interactions. o Advantages: ▪ Complex Queries: Easily execute queries like finding mutual friends or recommending connections. ▪ Real-Time Updates: Efficiently handle dynamic and evolving relationships. Recommendation Systems: o Scenario: Suggesting products, content, or connections based on user preferences and behaviors. o Advantages: ▪ Relationship-Based Logic: Utilize interconnected data to generate accurate and personalized recommendations. ▪ Scalability: Manage large graphs representing vast user interactions and preferences. Fraud Detection: o Scenario: Identifying suspicious patterns and relationships in financial transactions. o Advantages: ▪ Pattern Recognition: Detect complex fraud patterns by analyzing connections between entities. ▪ Real-Time Analysis: Quickly identify and respond to fraudulent activities as they occur. 31 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 8. Summary of NoSQL Data Models and Their Use Cases 8.1. Aggregate Data Models Strengths: Simplifies replication and sharding, enhances developer productivity. Use Cases: E-commerce platforms, applications requiring atomic transactions within data units. 8.2. Key-Value Models Strengths: High efficiency for specific lookups, versatile data structures. Use Cases: Session management, shopping carts, caching mechanisms. 8.3. Document Data Models Strengths: Flexible schemas, self-describing documents, hierarchical data representation. Use Cases: Event logging, content management systems, dynamic data storage. 8.4. Column-Family Stores Strengths: Optimized for large-scale, infrequent writes with multiple reads, flexible column storage. Use Cases: Blogging platforms, content management, time-series data. 8.5. Graph Data Stores Strengths: Efficient handling of complex relationships, fast traversal of interconnected data. Use Cases: Social networks, recommendation systems, fraud detection. 32 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation 9. Choosing the Right NoSQL Data Model 9.1. Assess Data Characteristics Data Structure: Determine if your data is best represented as key-value pairs, documents, columns, or graphs. Access Patterns: Understand how your application will query and manipulate the data. 9.2. Consider Scalability and Performance Needs Read vs. Write Operations: Choose a model optimized for your primary workload (e.g., key-value for fast lookups, column-family for read-heavy operations). Data Volume: Ensure the chosen model can handle the expected data size and growth. 9.3. Evaluate Relationship Complexity Simple vs. Complex Relationships: Use key-value or document models for simpler associations, and graph models for intricate interconnections. 9.4. Flexibility and Schema Requirements Dynamic Schemas: Opt for document or key-value models if your data schema frequently changes. Schema Consistency: Use column-family or graph models if maintaining certain data consistency is crucial. 9.5. Integration with Existing Systems Ecosystem Compatibility: Ensure the NoSQL database integrates well with your existing technology stack and workflows. 10. Key Takeaways NoSQL Diversity: NoSQL encompasses various data models, each suited to specific types of data and application requirements. Flexibility and Scalability: NoSQL databases provide the flexibility to handle diverse data structures and the scalability to manage large volumes of data across distributed systems. Optimized Use Cases: Selecting the appropriate data model based on your application's needs can lead to significant performance and efficiency gains. Evolving Ecosystem: As data requirements continue to grow and diversify, NoSQL databases offer essential alternatives and complements to traditional relational databases, enabling more robust and scalable application architectures. 33 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Object Databases (ODB) vs. Relational Databases (RDB) Handling Relationships: Object Databases (ODB): o In ODBs, relationships between objects are managed using reference attributes. These references allow one object to hold a reference to another, akin to how real- world objects relate to each other. o These references can be single-valued (representing a one-to-one or one-to-many relationship) or a collection of references (representing many-to-many relationships). ODBs allow flexibility in handling these relationships naturally through object structures. o Mapping relationships, particularly binary relationships (between two entities), requires the database designer to specify which object or entity will own or possess the relationship attributes. This design consideration ensures that the reference flows in the intended direction and maintains consistency. Relational Databases (RDB): o In RDBs, relationships are represented through attributes with matching values across different tables. For example, the foreign key is a key concept used to establish relationships between records (tuples) across tables. o RDBs support single-valued references in a more straightforward manner. However, many-to-many (M-N) relationships cannot be directly modeled with simple foreign key references. Instead, a separate junction table (or a linking table) is created to represent these relationships. This table holds foreign keys from both related tables to bridge the relationship. Handling Inheritance: Object Databases (ODB): o In ODBs, inheritance is a fundamental part of the database structure. As objects in programming languages can inherit properties and behaviors from their parent objects, ODBs natively support this feature, allowing data models to reflect the inheritance structure. o This makes ODBs highly suitable for applications where object-oriented programming principles are important and where maintaining an inheritance hierarchy is key. Relational Databases (RDB): o In contrast, RDBs do not natively support inheritance. The relational model treats all entities as flat tables, so inheritance must be simulated by using various strategies 34 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation like table per class hierarchy, table per subclass, or table per concrete class. This adds complexity to the schema design and may impact performance and maintainability. Specifying Operations: Object Databases (ODB): o In ODBs, operations (such as methods or functions) are typically designed as part of the class specification during the design phase. This ensures that behaviors associated with the objects are encapsulated within the class definition. o Encapsulation is a key feature of object-oriented databases, where objects not only store data but also have methods to manipulate this data. This allows operations to be defined and tightly coupled with the data they operate on. Relational Databases (RDB): o RDBs do not require operations to be specified during the design phase. They are designed around data storage and retrieval, with no need for methods or operations to be defined beforehand. o Instead, relational databases support ad-hoc queries—users can construct and execute SQL queries on the fly to retrieve, update, or manipulate data without predefined operations. However, this dynamic querying in ODBs can be problematic as it would break the principle of encapsulation, where direct access to object data is restricted. XML Databases vs. Relational Databases Relationships: XML Databases: o XML databases follow a hierarchical tree structure. Data is organized in a parent- child relationship using simple and composite elements (nodes) and attributes to represent relationships between these elements. The structure allows XML to naturally represent nested and complex data, where elements can contain other elements. o Relationships are implicit in the tree structure, where each element is part of a broader hierarchy, and relationships are derived based on the nesting. 35 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Relational Databases: o Relational databases manage relationships using tables. One table often acts as the parent (containing primary key identifiers), and another table acts as the dependent or child (referencing the parent table's keys through foreign keys). o Relationships like one-to-many or many-to-many are explicitly defined using primary and foreign keys across multiple tables, providing a formal structure for relationships between data. Self-Describing Data: XML Databases: o XML is considered self-describing because the tags not only define the structure but also provide semantic meaning to the data. For instance, John tells us that "John" is a name without needing a schema. This flexibility allows XML to store heterogeneous data types within a single document. A document can contain text, numbers, dates, and other data types mixed within the same structure. Relational Databases: o In contrast, relational databases follow a more rigid structure. Data in a single column must be of the same data type (e.g., all values in a "salary" column must be numbers). The column definition in a table schema provides the explanation and constraints of what kind of data can be stored, enforcing uniformity within each column. Inherent Ordering: XML Databases: o XML documents maintain the order of data as it is represented by the sequence of tags in the document. This means the order of elements in the XML tree is important and preserved during processing. However, XML itself does not offer additional sorting mechanisms beyond the inherent order of the tags. Relational Databases: o In relational databases, the order of data rows is not fixed unless explicitly stated. Row order inside a table is determined by the internal storage mechanism, and data is typically retrieved in no specific order unless an ORDER BY clause is used in a SQL query. This offers more flexibility in querying data with custom sorting options based on the user's requirements. 36 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation Data Modeling Difference: NoSQL Databases: o In NoSQL databases, the data model is centered around aggregate orientation, where data is stored in complex structures, often containing nested lists and objects. NoSQL models allow greater flexibility, letting users store a variety of data types in a single record or document. o The concept of aggregate comes from Domain-Driven Design (DDD) and refers to a collection of related objects treated as a unit. This allows for the easy management of complex, hierarchical data structures. o Storage models focus on how data is stored and manipulated within the database. In NoSQL databases, data is stored in a way that optimizes access patterns and is often denormalized for better performance. Relational Databases: o Relational databases follow a set-based model, where data is structured into tables (relations) with rows and columns. Each table holds records, and the relationships between data points are maintained through foreign keys. o The data model is schema-based, meaning the structure of the data must be defined before storing any data, enforcing strict rules about data types and relationships. Modeling for Data Access: NoSQL Databases: 37 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation o NoSQL models are designed around data access patterns, with a focus on optimizing queries. Data is often denormalized at the time of writing to enable faster retrieval during querying. o NoSQL databases allow flexibility in retrieving specific pieces of data quickly, especially in column-family and key-value stores where frequently accessed data is easily retrievable based on keys. o Graph databases are particularly useful for traversing complex relationships. For example, if we need to find all customers who purchased a product, a query can quickly traverse relationships between nodes (e.g., "customers" and "products") without complex joins. The directionality and types of relationships play a significant role in query efficiency. Relational Databases: o Data in relational databases is accessed using SQL queries. Relational models typically involve normalized data, which reduces redundancy but can make complex queries slower due to the need for joining multiple tables. o The ORDER BY clause is used to retrieve ordered data, while indexes can help speed up data access. Aggregate-Oriented vs. Aggregate-Ignorant: Aggregate-Oriented (NoSQL): o NoSQL databases often follow an aggregate-oriented approach, where related data is grouped into aggregates for faster access. Relationships between aggregates (i.e., inter-aggregate relationships) can be more complex compared to relationships within an aggregate (i.e., intra-aggregate relationships). 38 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation o Schemaless databases like document stores allow users to add fields dynamically without rigid schemas. However, an implicit schema emerges, which users need to understand when utilizing data. o Materialized views in NoSQL are typically precomputed queries that help optimize read performance. Map-reduce computations are used to aggregate data in certain cases. Aggregate-Ignorant (Relational): o Relational databases do not rely on the concept of aggregates. Instead, they offer ad-hoc querying and the flexibility to access data in various ways using views and joins. Each piece of data is accessed individually through foreign keys and joins. Schemalessness in NoSQL: NoSQL Databases: o Schemalessness allows for flexible data models, ideal for situations where the structure of the data might change or be non-uniform. Data can be stored without a predefined schema, allowing for key-value pairs or documents with different structures. o Column-family databases offer further flexibility by allowing columns to be added dynamically, making it easier to work with evolving data requirements. Materialized Views: Relational Databases: o Views are virtual tables created by queries that select data from one or more base tables. These views are not physically stored but are generated dynamically. Users can access data from these views without knowing the source tables. o Materialized views in relational databases are precomputed and stored results that provide faster query results for read-heavy operations. NoSQL Databases: o NoSQL does not use traditional views. However, materialized views in NoSQL databases refer to precomputed queries cached for performance. These views are used to improve read efficiency by avoiding repeated calculations for frequently accessed data. o For example, in a document store, a materialized view could allow querying an order summary without retrieving the entire order document. This enables efficient access to relevant data. 39 | P a g e ACADEMY OF GIGA NERD Semester 03 – DMS Data Management Evaluation o In column-family databases, multiple column families can be used to maintain materialized views, ensuring that updates are atomic and consistent across all views. ***************************** 40 | P a g e ACADEMY OF GIGA NERD

Semester 03 DMS IT 3306 Data Management System PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue