Key Vocabulary Extraction PDF - Data Management

Key Vocabulary Extraction Consistency: Transactions must transition data ACID Transactions: A set of properties (Atomicity, from one valid state to another. Consistency, Isolation, Durab...

Key Vocabulary Extraction Consistency: Transactions must transition data ACID Transactions: A set of properties (Atomicity, from one valid state to another. Consistency, Isolation, Durability) that guarantee reliable Isolation: Transactions operate independently processing of database transactions. from one another. Batch Data: Data that is processed in groups, rather than Durability: Once a transaction is committed, it individually, often collected over time. remains so, even in case of system failure. Data Analysis: The process of extracting insights from Typical Scale for Read/Write Operations: data, including recognizing patterns and interpreting Operations occur on data scales typically various types of information. ranging from MBytes to GBytes. Data Independence: The ability to modify data storage SQL Purpose: structures without impacting the application programs SQL is the primary language for managing and that rely on the data. querying RelDBMS, enabling operations like Data Literacy: The competence to read, interpret, data retrieval and manipulation. create, and communicate data effectively. Data Types in Relational Tables: Implicit Information: Knowledge that can be inferred Common types include int, char, varchar, date, from data analysis, rather than explicitly stated. and decimal. Relational Database Management Systems (RelDBMS): Systems for managing data organized in relational Theoretical Foundations schemas (tables) that support efficient data handling First Normal Form (1NF): Ensures that all columns and querying. contain atomic values, promoting data integrity by Schema: An organization or structure of data that avoiding redundancy. defines how data is stored in a database. Foreign Keys: Link tables within the relational model, Static Block Data: Fixed data stored in a constant form, enabling complex data relationships. typically seen in databases. Methodologies Streaming Data: Data that is continuously generated and Data Management Practices: Involves policies and updated, often processed in real-time. techniques for ensuring data consistency, availability, Structural Information: Information regarding the and fault tolerance. organization and arrangement of data within a dataset Challenges in Pre-Big-Data Era or database. Limited Data Volume Handling: Struggles with Type Safety: Ensures that values match their data types integrating diverse data sources and slow processing to avoid manipulation errors. capabilities due to older technologies. Data Management and Analysis Key Vocabulary Extraction Goals of Data Management: Focuses on the efficient Data Warehouse (DWH): A centralized repository for handling, transformation, and storage of large datasets. storing and managing historical data from various Difference with Data Analysis: Data Management is sources to support business intelligence and decision- about data storage and handling, while Data Analysis making processes. seeks to extract insights and patterns from data. Decision Support System (DSS): A type of information Data Types and Operations system that supports business or organizational Static vs. Streaming Data: Understanding the distinction decision-making activities, often by providing analysis between fixed data and continuously updated data. tools for data. Batch Processing: The methodology of processing data Executive Information System (EIS): A type of computer- in bulk, rather than one at a time. based information system that supports executive Relational Database Systems decision-making processes with easy access to internal ACID Properties: Fundamental principles ensuring and external information relevant to organizational transaction reliability in databases. goals. Tables and Relationships: Core components of the Integrations: The process of combining data from relational model, structured as rows (tuples) and different sources into a single, unified view. columns (attributes). Online Analytical Processing (OLAP): A category of Entity-Relationship Diagrams (ERD): Visualization tools software technology that enables analysts, managers, that depict entities (tables), their attributes, and and executives to gain insight into data through fast, relationships. consistent, interactive access in various ways. Historical Perspectives Online Transaction Processing (OLTP): A class of Evolution of Database Models: Historical advancements software applications that supports the management of from the Hierarchical Database Model to the Relational transaction-oriented applications, typically for data entry Database Model, highlighting key innovations and and retrieval transactions. structures. Subject-Oriented: Describes a quality of data organized Data Privacy Considerations around specific areas of interest in decision-making. Importance of Data Privacy: The necessity of Time Variant: Characteristic of data that changes over understanding ethical and legal repercussions in data time, allowing for historical analysis and trends. handling, especially given the prevalence of sensitive Non-Volatile: Refers to a state where once data is loaded data. into a data warehouse, it does not change or get Important Data Highlighting deleted, ensuring data integrity for analysis. ACID Properties: Atomicity: Transactions are all-or-nothing. Transitioning to Data Warehouses Historical Context: The evolution of data management complex queries without impacting real-time transaction systems from the 1960s' operational databases to the processing. integration of data warehouses in the 1990s due to cost- Methodologies effective storage solutions. The approach to data integration involves extracting Role of DWH: Centralized storage for integrated data data from a variety of sources, transforming it into a from various sources, resolving conflicts for periodic uniform format, and loading it into the DWH. updates. OLTP vs. OLAP Operational Systems (OLTP): Focus on transaction processing with short, fast queries. Optimized for high Key Vocabulary Extraction throughput and real-time operations. ACID: A set of properties (Atomicity, Consistency, Analytical Systems (OLAP): Focus on complex queries Isolation, Durability) that ensures reliable processing of over historical data, supporting strategic decision- database transactions. These properties help in making with slower response times being acceptable. maintaining data integrity. BASE: A model used in NoSQL databases, standing for Characteristics of Data Warehouses Basically Available, Soft State, and Eventual consistency, Integrated and Consistent Data: DWH integrates data which emphasizes availability over strict consistency. from multiple operational sources into a consistent CAP Theorem: A principle that states it is impossible for format for analysis. a distributed data store to simultaneously provide all Read-Only Access: Data in DWH is provided in a format three guarantees: Consistency, Availability, and Partition that is static and not updated frequently, ensuring Tolerance. historical data analysis. Consistency Levels: Various degrees of consistency in Support for Decision Making: DWH serves as the basis distributed systems, indicating how updates and stale for DSS and BI applications, enabling deeper analysis via data are managed across replicas (e.g., Eventual OLAP. Consistency, Strong Consistency). Horizontal Scaling: The process of adding more machines to a system to handle increased load, as opposed to improving a single machine's capacity (Vertical Scaling). Multidimensional Data Modeling Key-Value Store: A type of NoSQL database that stores Dimensions and Hierarchies: Data cubes are utilized to data in pairs of keys and values, allowing for quick allow analysis from multiple perspectives, such as time retrieval based on keys. and product categories, with hierarchies facilitating NoSQL: Refers to a range of database systems that do detailed examination. not adhere strictly to the relational database model, Facts and Measures: Corporate KPIs, like sales and allowing for the storage of unstructured data. profits, are the facts that are monitored across various Key Points Identification dimensions. Introduction to NoSQL Significant Definitions The NoSQL movement began in the mid-2000s due to Data Warehouse Definition: "A Data Warehouse is a the limitations of Traditional Relational Database subject-oriented, integrated, non-volatile, and time- Management Systems (RDBMS) in horizontally scaling to variant collection of data in support of management manage large volumes of unstructured data. decisions." - W. H. Inmon Modern applications and web platforms generate OLAP Operations massive datasets that require flexible database solutions Typical operations in OLAP include complex analytical beyond conventional RDBMS. queries with aggregate functions over large datasets, Consistency Models contrasting with OLTP which focuses on transaction ACID vs. BASE: While traditional databases emphasize processing. ACID properties, NoSQL databases use BASE, which Data Cube Representation offers higher availability but relaxes strict consistency. A 3D data cube allows for multi-dimensional analysis, The application of BASE helps in scenarios where where each edge represents a dimension of data and the temporary inconsistency is acceptable but availability is cells reflect facts or measures. critical. Performance Comparisons The CAP Theorem DBMS for OLTP DWH for OLAP Data Warehouse Characteristics Typical insert, update, delete, Subject-Oriented: Organized by subject matter or event, select (bulk-insert) Operations select aligning data with relevant business metrics. many short Integrated Data: Consistency across data sources Transactions only read transactions transactions enables accurate reporting and analytics. Data per Large volumes Non-Volatility: Once input, data is immutable, which few rows Operation (MB/GB scale) preserves historical accuracy for analyses. Database Size GB TB Time Variability: The ability to track changes over time, which is critical for examining trends. The theorem outlines the trade-offs in distributed Theorems and Principles database design: The principle of OLAP systems requiring separate CP Systems: Prioritize Consistency and architectures than OLTP systems to properly handle Partition Tolerance but may sacrifice Availability. AP Systems: Prioritize Availability and Partition combination of proximity graphs and hierarchical Tolerance but may sacrifice Consistency. structures. NoSQL Data Models k-NN (k-Nearest Neighbor): A query type that retrieves Different NoSQL data models include: the 'k' closest points to a specified query point in the Key-Value Stores: Simple, efficient storage of embedding space. data in pairs. R-Tree: A balanced tree data structure used for spatial Document Stores: Store data as documents, access methods, allowing for efficient search operations often in JSON format, allowing for complex over multi-dimensional data. queries. Similarity Query: A type of query that retrieves data Wide Column Stores: Organize data into rows points similar to a specified query point based on a and columns but allow for flexible column distance metric. families. Vector Database: A database optimized for storing and Graph Databases: Focus on relationships and querying high-dimensional vector embeddings, often connections between data points. used in applications based on artificial intelligence and Benefits and Challenges of NoSQL machine learning. Benefits include high throughput, horizontal scalability, and simplified data models. Key Points Identification 1. Vector Databases Challenges may include complex query writing, potential loss of consistency, and complex maintenance of Motivation: Traditional databases struggle with high- distributed databases. dimensional vector embeddings common in AI applications. Vector databases offer optimized storage Important Data Highlighting Statistics and Formulas and querying for such data. Horizontal Scaling Techniques: Functionality: They facilitate similarity queries in embedding spaces where intuitive meanings of features Vertical Scaling: Increasing resources for a may not apply. single node, limited by hardware limitations. 2. Indexing Embedding Spaces Horizontal Scaling (Sharding): Distributing data Hierarchical Trees: Multi-dimensional search trees across multiple nodes to improve performance partition data into "pages" based on hierarchy and and storage capacity. proximity. CAP Theorem Visualization: Locality Sensitive Hashing: Used to index high- Any networked shared-data system can have at most two of the dimensional spaces efficiently with trade-offs between three properties: Consistency, Availability, Partition Tolerance accuracy and computation speed. Consistency Levels Learned Indexing: A technique to improve traditional Eventual Consistency: Updates are propagated at a later indexing by using machine learning models for point, ensuring overall system consistency over time. optimizing queries. Strong Consistency: Guarantees that all operations 3. Query Types appear to occur instantaneously and atomically across Range Query: Retrieves all database objects within a the entire system. certain distance from a specified query object. Additional Critical Aspects Theorems and Principles k-Nearest Neighbor Query: Finds the 'k' closest items to a query object in the embedding space. CAP Theorem: A cornerstone principle in distributed systems that outlines trade-offs among consistency, Ranking Query: Sorts database objects based on their availability, and partition tolerance. proximity to a query object, allowing for interactive Methodologies retrieval of results. 4. R-tree Structure Data Horizontal Partitioning: Breakdown of large datasets into smaller, manageable partitions without Basic Structure: Consists of inner nodes (directory altering the overall structure. entries) and leaf nodes (actual data points) for managing multi-dimensional data. Data Replication: Maintaining multiple copies of data across nodes to ensure availability and fault tolerance. Efficiency: Uses Minimum Bounding Rectangles (MBRs) Data Structures in Key-Value Stores to reduce unnecessary search paths during queries. Examples of key-value structures Important Data Highlighting include Strings, Hashes, Lists, Sets, and Ordered Sets, each supporting different operations for data retrieval Distance Functions in Vector Databases: and manipulation Euclidean distance as a common metric for ANN (Approximate Nearest Neighbor): A query type similarity: that finds approximate closest points rather than exact nearest neighbors, often used to improve search efficiency in vector databases. Embedding: A mapping of discrete objects into a Range Query Definition: continuous vector space where semantically similar objects are represented by similar vectors. k-NN Query: For a query object ( q ) and ( k ): MBR (Minimum Bounding Rectangle): The smallest rectangle that can contain a set of points in multi- dimensional space, used to define regions in R-Trees. Additional Critical Aspects HNSW (Hierarchical Navigable Small World): A type of Theorems and Principles graph-based data structure that efficiently supports approximate nearest neighbor search using a Filter-Refinement Principle: A methodology to improve database searches by eliminating unnecessary candidates early using inexpensive filters, followed by exact searches on a smaller subset. Data Structures R-Trees: Used for managing spatial data and can be adapted for high-dimensional vector data. They optimize space by grouping nearby objects together. HNSW Graph: A data structure that supports efficient ANN queries by utilizing both short-range and long-range connections.

Key Vocabulary Extraction PDF - Data Management

Document Details

Tags

Related

Summary

Full Transcript