CH09_PPT.pptx
Document Details
Uploaded by BelovedSatire
Macquarie University
Tags
Full Transcript
Modern Database Management Thirteenth Edition Chapter 9 Data Warehousing and Data Integration Copyright © 2022, 201...
Modern Database Management Thirteenth Edition Chapter 9 Data Warehousing and Data Integration Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Learning Objectives (1 of 2) 9.1 Define terms 9.2 Give reasons for the information gap between information needs and availability 9.3 List two reasons most organizations today need data warehousing 9.4 Name and describe the three levels in a data warehouse architecture 9.5 Describe the two major components of a star schema 9.6 Estimate fact table size Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Learning Objectives (2 of 2) 9.7 Design a data mart 9.8 Develop requirements for a data mart 9.9 Understand future data warehousing trends 9.10 Describe three types of data integration approaches 9.11 Describe four steps and activities of ETL for data integration for a data warehouse 9.12 Explain various forms of data transformation for data warehouses Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Definitions Data Warehouse – A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision- making processes ▪ Subject-oriented: e.g., customers, patients, students, products ▪ Integrated: consistent naming conventions, formats, encoding structures; from multiple data sources ▪ Time-variant: can study trends and changes ▪ Non-updatable: read-only, periodically refreshed Data Mart – A data warehouse that is limited in scope Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Need for Data Warehousing Integrated, company-wide view of high-quality information (from disparate databases) Separation of operational and informational systems and data (for improved performance) Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Issues With Company-Wide View Inconsistent key structures Synonyms Free-form v s structured fields ersu Inconsistent data values Missing data – See figure 9-1 for examples Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-1 Examples of Heterogeneous Data Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Organizational Trends Motivating Data Warehouses No single system of records Multiple systems not synchronized Organizational need to analyze activities in a balanced way Customer relationship management Supplier relationship management Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Separating Operational and Informational Systems Operational system – a system that is used to run a business in real time, based on current data; also called a system of record Informational system – a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Table 9-1 Comparison of Operational and Informational Systems Characteristic Operational Systems Informational Systems Primary purpose Run the business on a current basis Support managerial decision making Type of data Current representation of state of Historical point in time (snapshots) and the business predictions Primary users Clerks, salespersons, administrators Managers, business analysts, customers Scope of usage Narrow, planned, and simple Broad, ad hoc, complex queries and updates and queries analysis Design goal Performance: throughput, availability, Ease and low cost of flexible access reliability; alignment with business rules and use Volume Many constant updates and queries Periodic batch updates and queries on one or a few table rows requiring many or all rows The goals, purposes, and usage of information systems and data warehouses are very different from those of operational systems and OLTP databases Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Data Warehouse Architectures Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and Real-Time Data Warehouse Three-Layer architecture Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-2 Independent Data Mart Data Warehousing Architecture Data marts are mini-warehouses, limited in scope. Separate ETL for each independent data mart. Data access complexity due to multiple data marts Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Limitations of Independent Data Marts Separate ETL process for each data mart leads to redundant data and processing efforts Inconsistency between data marts Difficult to drill down for related facts between data marts Excessive scaling costs the more applications are built High cost for obtaining consistency between marts Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-3 Dependent Data Mart With Operational Data Store: 3-Level Architecture Operational data store (ODS) provides option for transforming current data. Single ETL for enterprise data warehouse (EDW). Dependent data marts loaded from EDW. Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-4 Logical Data Mart and Real Time Warehouse Architecture ODS and data warehouse are one and the same. Near real-time ETL for data warehouse. Data marts are Not separate databases, but logical views of the data warehouse. Easier to create new data marts Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Data Warehouse V s Data Mart: Scope ersu Data Warehouse – Application independent – Centralized, possibly enterprise-wide – Planned Data Mart – Specific DSS application – Decentralized by user area – Organic, possibly not planned Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Data Warehouse V s Data Mart: Data, ersu Subjects, and Sources Data Warehouse – Data is historical, detailed, and summarized – Data it lightly denormalized – Multiple subjects – Many internal and external sources Data Mart – Data has some history, is detailed and summarized – Data is highly denormalized – One central subject or concern to users – Few internal and external sources Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Data Warehouse V s Data Mart: Otherersu Characteristics Data Warehouse – Flexible – Data oriented – Long life – Large – Single complex structure Data Mart – Restrictive – Project oriented – Short life – Starts small, becomes large – Multi-, semi-complex structures, together complex Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-5 Three-Layer Data Architecture for a Data Warehouse Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved DW Data Characteristics Status v s Event Data ersu – Status – before and after images – Event – something that causes changes to the status. Typical example: a transaction. Transient v s Periodic Data ersu – Transient – changes to existing records are written over previous records, destroying previous data content – Periodic – data is never physically altered or deleted after being added to the store Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-6 Example of DBMS Log Entry Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-7 Transient Operational Data Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-8 Periodic Warehouse Data Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Other Data Warehouse Changes New descriptive attributes New business activity attributes New classes of descriptive attributes Descriptive attributes become more refined Descriptive data are related to one another New source of data Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Derived Data Objectives – Ease of use for decision support applications – Fast response to predefined user queries – Customized data for particular target audiences – Ad-hoc query support – Data mining capabilities Characteristics – Detailed (mostly periodic) data – Aggregate (for summary) – Distributed (to departmental servers) Most common data model = dimensional model (usually implemented as a star schema) Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-9 Components of a Star Schema Excellent for ad-hoc queries, but bad for online transaction processing Dimension tables contain descriptions about the subjects of the business Fact tables contain factual or quantitative data Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-10 Star Schema Example Fact table provides statistics for sales broken down by product, period, and store dimensions Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-11 Star Schema Sample Data Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Surrogate Keys Dimension table keys should be surrogate (non- intelligent and non-business related), because: – Business keys may change over time – Helps keep track of non-key attribute values for a given production key – Surrogate keys are simpler and shorter – Surrogate keys can be same length and format for all keys Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Grain of the Fact Table Granularity of Fact Table: What level of detail do you want? – Transactional grain – finest level – Aggregated grain – more summarized – Finer grains brings better market basket analysis capability – Finer grain implies more dimension tables, more rows in fact table – In Web-based commerce, finest granularity is a click Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Duration of the Database Natural duration – 13 months or 5 quarters Financial institutions may need longer duration Older data is more difficult to source and cleanse Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Size of Fact Table Depends on the number of dimensions and the grain of the fact table Number of rows = product of number of possible values for each dimension associated with the fact table Example: Assume the following for Figure 9-11: – Total number of stores = 1,000 – Total number of products = 10,000 – Total number of periods = 24 (two years’ worth of monthly data) Total rows calculated as follows (assuming only half the products record sales for a given month): Total rows 1,000 stores 5,000 active products 24 months – 120,000,000 rows Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-12 Modeling Dates Fact tables contain time-period data, so date dimensions are important. Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Variations of the Star Schema Multiple Facts Tables – Can improve performance – Often used to store facts for different combinations of dimensions – Conformed dimensions Factless Facts Tables – No non-key data, but foreign keys for associated dimensions – Used for: ▪ Tracking events ▪ Inventory coverage Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-13 Conformed Dimensions Two fact tables connect two star schemas. Conformed dimension is associated with multiple fact tables. Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-14 Factless Fact Table No data in fact table, just keys associating dimension records. Fact table forms an n-ary relationship between dimensions Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Normalizing Dimension Tables Multivalued Dimensions – Facts qualified by a set of values for the same business subject – Normalization involves creating a table for an associative entity between dimensions Hierarchies – Sometimes a dimension forms a natural, fixed-depth hierarchy – Design options ▪ Include all information for each level in a single denormalized table ▪ Normalize the dimension into a nested set of 1: N table relationships Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-15 Multivalued Dimension A helper table is an associative entity that implements a M : N relationship between dimension and fact. Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-16 Fixed Product Hierarchy Dimension hierarchies help to provide levels of aggregation for users wanting summary information in a data warehouse. Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Slowly Changing Dimensions (SCD) How to maintain knowledge of the past Kimball’s approaches: – Type 1: just replace old data with new (lose historical data) – Type 2: for each changing attribute, create a current value field and several old-valued fields (multivalued) – Type 3: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change — most common approach Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved 10 Essential Rules for Dimensional Modeling (1 of 2) 1. Use atomic facts 2. Create single-process fact tables 3. Include a date dimension for each fact table 4. Enforce consistent grain 5. Disallow null keys in fact tables Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved 10 Essential Rules for Dimensional Modeling (2 of 2) 6. Honor hierarchies 7. Decode dimension tables 8. Use surrogate keys 9. Conform dimensions 10.Balance requirements with actual data Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Data Integration Data integration creates a unified view of business data Other possibilities: – Application integration – Business process integration – User interaction integration Any approach requires changed data capture (CDC) – Indicates which data have changed since previous data integration activity Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Techniques for Data Integration Consolidation (ETL) – Consolidating all data into a centralized database (like a data warehouse) Data federation (EII) – Provides a virtual view of data without actually creating one centralized database Data propagation (EAI and EDR) – Duplicate data across databases, with near real- time delay Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved The Reconciled Data Layer (1 of 2) Typical operational data is: – Transient, not historical – Not normalized (perhaps due to denormalization for performance) – Restricted in scope, not comprehensive – Sometimes poor quality, containing inconsistencies and errors Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved The Reconciled Data Layer (2 of 2) After ETL, data should be: – Detailed, not summarized yet – Historical, periodic – Normalized, 3rd normal form or higher – Comprehensive, enterprise-wide perspective – Timely, data should be current enough to assist decision making – Quality controlled, accurate with full integrity Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved The ETL Process ETL = Extract, transform, and load – Capture/Extract – Scrub or data cleansing – Transform – Load and Index When is ETL done? – During initial load of Enterprise Data Warehouse (EDW) – During subsequent periodic updates to EDW Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-21 Steps in Data Reconciliation Data reconciliation involves capture/extract, cleanse, transform, and load/index. Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Record Level Transformation Functions Selection – the process of partitioning data according to predefined criteria Joining – the process of combining data from various sources into a single table or view Normalization – the process of decomposing relations with anomalies to produce smaller, well-structured relations Aggregation – the process of transforming data from detailed to summary level Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-22 Single-Field Transformations (1 of 3) (a) Basic representation Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-22 Single-Field Transformations (2 of 3) (b) Algorithmic Uses a formula or logical expression Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-22 Single-Field Transformations (3 of 3) (c) Table lookup Uses a separate table keyed by source record code Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-23 Multifield Transformations (1 of 2) (a) Many sources to one target Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Figure 9-23 Multifield Transformations (2 of 2) (b) One source to many targets Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved Copyright This work is protected by United States copyright laws and is provided solely for the use of instructors in teaching their courses and assessing student learning. Dissemination or sale of any part of this work (including on the World Wide Web) will destroy the integrity of the work and is not permitted. The work and materials from it should never be made available to students except by instructors using the accompanying text in their classes. All recipients of this work are expected to abide by these restrictions and to honor the intended pedagogical purposes and the needs of other instructors who rely on these materials. Copyright © 2022, 2019, 2016 Pearson Education, Inc. All Rights Reserved