Data Warehousing Tutorial PDF
Document Details
![WellEducatedSuccess](https://quizgecko.com/images/avatars/avatar-1.webp)
Uploaded by WellEducatedSuccess
Tags
Summary
This document provides an overview of data warehousing, covering topics such as data lakes, data marts, and data management in large enterprises. It also delves into different types of data, data models, and data integration techniques. The document aims to offer readers a comprehensive understanding of the subject.
Full Transcript
TOPIC 1: DATA WAREHOUSING DATA LAKES − Location to dump all sort of raw, structured and unstructured data. DATA WAREHOUSE (DW) − Large collection of organized and clean bus. Data to assist organizations to make decisions. DATA MARTS − Sub – set of DW. Specific to business domain. E...
TOPIC 1: DATA WAREHOUSING DATA LAKES − Location to dump all sort of raw, structured and unstructured data. DATA WAREHOUSE (DW) − Large collection of organized and clean bus. Data to assist organizations to make decisions. DATA MARTS − Sub – set of DW. Specific to business domain. Example: HR, Operations, Finance Problem: Heterogeneous Information Sources Different interfaces Different data representations Duplicate and inconsistent information Problem: Data Management in Large Enterprises Vertical fragmentation of informational systems (vertical stove pipes) - column Result of application (user)-driven development of operational systems -data marts Systems Used in organization Feature Operational Systems Informational Systems Purpose Day-to-day operations Decision-making and analysis Data Type Current, real-time Historical, summarized Processing High-speed transactional Analytical, complex queries Users Operational staff Analysts and decision- makers Examples POS, ERP, inventory systems BI tools, data warehouses Response Time Immediate Less time-sensitive Structure Highly structured and Aggregated and detailed multidimensional Goal: Unified Access to Data Collects and combines information Provides integrated view, uniform user interface Supports sharing The Warehousing Approach Information integrated in advance Stored in wh for direct querying and analysis Advantages of Warehousing Approach High query performance But not necessarily most current information Doesn't interfere with local processing at sources Complex queries at warehouse OLTP at information sources Information copied at warehouse Can modify, annotate, summarize, restructure, etc. Can store historical information Security, no auditing Has caught on in industry Not Either-Or Decision Query-driven approach still better for Rapidly changing information Rapidly changing information sources Truly vast amounts of data from large numbers of sources Clients with unpredictable needs A data ware house is simply complete and consistent store of data obtained from variety of source and made available to end users in a way they can understand and use it is a business context − BARRY DEVLIN IBM CONSULTANT Data Warehouse was coined by Bill Inmon in 1990 "In essence, the query-driven approach is like asking a question and collecting answers directly from the sources each time, instead of relying on pre-written notes or summaries" "A data warehouse is a subject-oriented, Integrated, time-variant, and non-volatile collection of data in support of management's decision making process". "A data warehouse is a copy of transaction data specifically structured for query and analysis." Ralph Kimball (The Data Warehouse Toolkit) "A data warehouse is a repository (data & metadata) that contains integrated, cleansed, and reconciled data from disparate sources for decision support applications, with an emphasis on online analytical processing. Typically the data is multidimensional, historical, non volatile. Stored collection of diverse data A solution to data integration problem Single repository of information Subject-oriented Organized by subject, not by application Used for analysis, data mining, etc. User interface aimed at executive Large volume of data (Gb, Tb) Non-volatile Historical Time attributes are important Updates infrequent May be append-only Examples All transactions at SM or Ayala Malls Complete client histories at insurance firm Stockbroker financial information and portfolios Warehouse is a Specialized DB Standard DB Mostly updates Many small transactions Mb - Gb of data Current snapshot Index/hash on p.k. Raw data Thousands of users (e.g., clerical users) Warehouse Mostly reads Queries are long and complex Gb - Tb of data History Lots of scans Summarized, reconciled data Hundreds of users (e.g., decision-makers, analysts) Data Warehousing Market Insights - 2028 Global Data Warehousing Market Size $2 billion in 1995 $21.18 billion in 2019 Projected to reach $51.18 billion by 2028 (alliedmarketresearch) Snowflake holds the biggest data warehousing share with 3,174 domains. Around 2.5 quintillion bytes of data are created daily Types of Data Business Data - represents meaning Real-time data (ultimate source of all business data) Reconciled data Derived data- new information created from the existing data sets. Metadata - describes meaning Build-time metadata Control metadata Usage metadata Data as a product - intrinsic meaning Produced and stored for its own intrinsic value e.g., the contents of a text-book Data Warehouse Architectures Conceptual View Single-layer Every data element is stored once only Virtual warehouse Two-layer Real-time + derived data, Most commonly used approach in industry today Three-layer Architecture: Conceptual View Transformation of real-time data to derived data really requires two steps. Comparison Architecture Business Use Case Example Single-Tier Small retail shop Basic reporting tracking sales in for small Excel businesses. Two-Tier Marketing Moderate agency analyzing analytics with campaign data. direct data connections. Three-Tier Multinational Advanced retailer analyzing analytics, global sales data. scalability, and BI insights. Data Warehousing: Two Distinct Issues (1) How to get information into warehouse "Data warehousing" (2) What to do with data once it's in warehouse "Warehouse DBMS" Both rich research areas Industry has focused on (2) Issues in Data Warehousing Warehouse Design Extraction Wrappers, monitors (change detectors) Integration Cleansing & merging Warehousing specification & Maintenance Optimizations Miscellaneous (e.g., evolution) Data Extraction Source types Relational, flat file, WWW, etc. How to get data out? Replication tool Dump file Create report Open Database Connectivity or ODBC or third-party "wrappers" Issues (1) Warehouse uses relational data model or multi-dimensional data model (e.g., data cube) On the other hand, source types Relational, OO, hierarchical, legacy Semistructured: flat file, WWW How do we get the data out? Warehouse must be kept current in light of changes to underlying sources How do we detect updates in sources? Wrapper Converts data and queries from one data model to another Extends query capabilities for sources with limited capabilities Wrapper Generation Solution 1: Hard code for each source Solution 2: Automatic wrapper generation Source-specific adapter (a.k.a. wrapper, translator) "Thickness" of adapter depends on source Data model used (e.g. rel. schema vs. unstructured) Interface (i.e., query language, API) Active capabilities (i.e., triggers) Degree of autonomy (e.g., same owner & modifiable vs. controlled by external entity & no changes possible) Cooperation (e.g., friendly vs. uncooperative) Routine When... Many tools for dealing with "standard situations" Standard sources with full/many capabilities e.g., most commercial DBMSs, all ODBC-compliant sources Standard interactions e.g., pass-through queries, extraction from rel. tables, replication Cooperative sources or sources under our control Tools Replication tools, ODBC, report writers, third-party "wrappers" Not So Routine When... "Non-standard situations" (gewal needs ng data) Unstructured or semi-structured sources with little or no explicit schema Uncooperative sources Sources with limited capabilities (e.g., legacy sources, WWW) Few commercial tools Mostly research Data Transformations Convert data to uniform format Byte ordering, string termination Internal layout Remove, add & reorder attributes Add key Add data to get history Sort tuples Monitors Goal: Detect changes of interest and propagate to integrator How? Triggers Replication server Log sniffer Compare query results Compare snapshots/dumps Data Integration Receive data (changes) from multiple wrappers/monitors and integrate into warehouse Rule-based Actions Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc. Data Cleansing Find (& remove) duplicate tuples e.g., Jane Doe vs. Jane Q. Doe Detect inconsistent, wrong data Attribute values that don't match Patch missing, unreadable data Notify sources of errors found Topic 2 (DATA WAREHOUSING) Digital Digital describes electronic technology that generates, stores, and processes data in terms of two states: positive and non-positive. Creating value at the new frontiers of the business world. Creating value in the processes that execute a vision of customer experience. Building foundational capabilities that support the entire structure (mckinsey.com). relating to, or utilizing devices constructed or working by the methods or principles of electronics. (merriam-webster.com). Data In computing, data is information that has been translated into a form that is efficient for movement or processing. Relative to today’s computers and transmission media, data is information converted into binary digital form. It is acceptable for data to be used as a singular subject or a plural subject. Raw data is a term used to describe data in its most basic digital format. (techtarget.com) Digital Data Digital data is the electronic representation of information in a format or language that machines can read and understand. In more technical terms, digital data is a binary format of information that's converted into a machine-readable digital format. ANALOGUE SIGNAL DIGITAL DATA Digital Data Digital data are numeric codes (0,1). It is stored in computer systems and software. These codes are represented as data through machine language systems that automatically convert them into information that can be interpreted by humans. Digital data is represented as discrete rather than continuous data. Uncompressed data is very large and is hard to transfer. What is Digital Data Management? Digital data management is the practice of collecting, storing, and providing access to information so that it can be analyzed and used. The goal of data management is to provide access to digital data in a way that protects it, while giving users the access that they need. A digital data management system is used to define, implement, and enforce policies, practices, and procedures to provide secure, efficient access. Key components of digital data management include: Optimized storage to expedite retrieval of digital data in computer memory, in and across multiple clouds, and in on-premises systems Data security to protect digital assets from unauthorized use and maximize privacy for sensitive data Backup systems for digital data Support for managing digital data retention policies, including archiving and destruction Tools to transfer data between systems, users, applications, analytics, and algorithms. Features of Data Classification The main goal of the organization of data is to arrange the data in such a form that it becomes fairly available to the users. So it's basic features as following. Homogeneity - The data items in a particular group should be similar to each other. Clarity - There must be no confusion in the positioning of any data item in a particular group. Stability - The data item set must be stable i.e. any investigation should not affect the same Discrete Data Continuous Data Definition Data with distinct and separate Data that can take any value values within a range Nature Countable Measurable Values Fixed, specific values Any values, including fractions and decimals Precision Exact and finite Infinite Statistics & Presentation Frequencies, counts /Bar Averages, variances, trends / charts, pie charts Histogram, line graphs Examples Number of cars sold, score in a Weight, height test What is Digital Data Management? Digital data management is the practice of collecting, storing, and providing access to information so that it can be analyzed and used. The goal of data management is to provide access to digital data in a way that protects it, while giving users the access that they need. A digital data management system is used to define, implement, and enforce policies, practices, and procedures to provide secure, efficient access. Key components of digital data management include: Optimized storage to expedite retrieval of digital data in computer memory, in and across multiple clouds, and in on-premises systems Data security to protect digital assets from unauthorized use and maximize privacy for sensitive data Backup systems for digital data Support for managing digital data retention policies, including archiving and destruction Tools to transfer data between systems, users, applications, analytics, and algorithms. Features of Data Classification The main goal of the organization of data is to arrange the data in such a form that it becomes fairly available to the users. So it's basic features as following. Homogeneity - The data items in a particular group should be similar to each other. Clarity - There must be no confusion in the positioning of any data item in a particular group. Stability - The data item set must be stable i.e. any investigation should not affect the same set of classification. Elastic - One should be able to change the basis of classification as the purpose of classification changes. Classifications of Digital Data Management Below is the percentage distribution of the three forms of data: Forms of Data Percentage Unstructured Data 80% Semi-structured Data 10% Structured Data 10% How to Manage It? Unstructured Data Definition: Data that doesn’t follow a specific format or model (e.g., images, videos, audio files, social media posts). Management: Use advanced tools like AI and machine learning for classification and retrieval. Store in data lakes or NoSQL databases (e.g., Hadoop, MongoDB). Apply metadata for easier search and indexing. Semi-structured Data Definition: Data with elements of both structured and unstructured formats (e.g., XML, JSON, NoSQL databases). Management: Use metadata to organize and retrieve data efficiently. Store in hybrid systems like NoSQL databases (e.g., MongoDB, CouchDB). Convert to structured data if necessary for deeper analysis. Leverage tools like Elasticsearch for indexing and search. Structured Data Definition: Data organized in a predefined schema, such as rows and columns in databases. Management: Store in relational databases (e.g., MySQL, PostgreSQL). Use SQL for querying and managing the data. Regularly back up and optimize the database for performance. Unstructured Data 1. Unstructured Data This is the data which does not conform to a data model or is not in a form which can be used easily by a computer program. About 80-90% data of an organi images, videos, letters, researchers, white papers, body of an email, etc. Sources: Word, PDF, Web pages, Memos, Videos (MP4, MPEG), Images (peg, GIF), Text , PPT, reports, chats, surveys) Characteristics: Does not conform to any data model Not in any particular format or sequence Not easily usable by a program Does not follow any rule or semantics How to Manage It? Semi-structured Data Management Techniques Schemas Describe the structure and content of data to some extent. Assign meaning to data, allowing automatic searching and indexing. Graph-based Data Models Contain data on the leaves of the graph (schema-less). Used for data exchange among heterogeneous sources. XML (Extensible Markup Language) Models the data using tags and elements. Schemas are not tightly coupled to the data. Data Representation Examplzation is in this format; for example, memos, chat rooms, PowerPoint presentations Digital Data Platforms A digital data management platform is used to collect, analyze, and retrieve large volumes of data. Usually, it also provides a number of core functions that allow users to securely, efficiently, and cost-effectively work with digital data. Digital data platforms typically include management fools to support common tasks, such as: Identifying, alerting, diagnosing, and resolving faults in the digital data management platform or other related systems Allocating memory and storage resources Making changes in the digital data storage design Enhancing system performance by optimizing responses to queries Importance of DMPs Essentially, platforms which house important digital data, such an customer data cookie IDs, mobile identifiers) and campaign data. These types of tools also help digital reles (eg. marketers and advertisers) build customer segments and their performance. These segments are but up based on demographic data, past browsing behavior, location, device and more. Benefits: 1. Unifies data and breaks down silos. Brings all your data together on one single platform, giving you and your team a more cohesive view of customers. Helps identify new audiences and customers. Ever heard of audience extension or lookalike modeling techniques? These are great features offered in some DMPs that let you discover and target new customer groups. 3. Provides continuous results. These are not one-off reporting solutions, but rather platforms that enable long-term strategies by way of constant and continuous reporting. Uses for Digital Data Management The possible uses for digital data management are nearly limitless. Digital data management is used in: Astrophysics Economic research Financial modeling Genetic science Particle physics Population studies DMP Examples Google Marketing Platform - flexible scaling, improved data analytics, automation and connectivity, and enhanced capabilities for video campaigns. Nielsen DMP - is a seamless media-buying platform that forms a complete view of customer behavior by leveraging different types of data. Some features include audience segmentation, message sequencing, media planning, profiling, orchestration, frequency capping and more. Offers Al capabilities. Oracle BlueKai DMP - highlights five critical purposes for its users: -Unify advertising data -Build effective audiences -Connect across devices -Activate digital campaigns -Analyze audiences Managing Digital Data Managing digital data in an organization involves a broad range of tasks, policies, procedures, and practices. The work of data management has a wide scope, covering factors such as how to: Create, access, and update data across a diverse data tier Store data across multiple clouds and on premises Provide high availability and disaster recovery Use data in a growing variety of apps, analytics, and algorithms Ensure data privacy and security Archive and destroy data in accordance with retention schedules and compliance requirements Data Models A database model shows the logical structure of a database, including the relationships and constraints that determine how data can be stored and accessed Individual database models are designed based on the rules and concepts of whichever broader data model the designers adopt. It represents the framework of what the relationships are within a database. It is an essential framework in DW as it serves as a vital component. Data models defines the structure which will support the analytical needs of the decision makers. Types of Data Models Relational Hierarchical Network Object-oriented database 5. Object-relational Data Models 1. Relational Model The most common model, the relational model sorts data into tables, also known as relations, each of which consists of columns and rows. The attributes in a relation are called a domain. A particular attribute or combination of attributes is chosen as a primary key that can be referred to in other tables, when it's called a foreign key. Each row, also called a tuple, includes data about a specific instance of the entity in question, such as a particular employee. Data Models 2. Hierarchical Model Organizes data into a tree-like structure, where each record has a single parent or root. Sibling records are sorted in a particular order. That order is used as the physical order for storing the database. This model is good for describing many real-world relationships. Data Models 3. Network Model - Builds on the hierarchical model by allowing many-to-many relationships between linked records, implying multiple parent records. Based on mathematical set theory, the model is constructed with sets of related records. 4. Object-oriented database Model defines a database as a collection of objects, or reusable software elements, with associated features and methods. There are several kinds of object-oriented databases: multimedia database hypertext database 5. Object-relational Model This hybrid database model combines the simplicity of the relational model with some of the advanced functionality of the object-oriented database model. In essence, it allows designers to incorporate objects into the familiar table structure. Languages and call interfaces include SQL3, vendor languages, ODBC, JDBC, and proprietary call interfaces that are extensions of the languages and interfaces used by the relational model.