INF2003 Database Systems Data Warehouse AY24/25 PDF

Summary

These lecture notes cover Database Systems and Data Warehouses, specifically focusing on OLAP and OLTP. The document details various aspects of data warehousing, including ETL processes, schemas, and operations.

Full Transcript

INF2003: Database Systems Data Warehouse AY24/25 Trimester 1 Zhang Wei SIT Nov. 12, 2024 Module and Teaching Feedback AY24/25, Tri 1 From University: Complete both module and teaching feedback. Please share constructive feedback in the free text sections Academic Staff will not be able to...

INF2003: Database Systems Data Warehouse AY24/25 Trimester 1 Zhang Wei SIT Nov. 12, 2024 Module and Teaching Feedback AY24/25, Tri 1 From University: Complete both module and teaching feedback. Please share constructive feedback in the free text sections Academic Staff will not be able to identify the student. All feedback received are anonymous. Their grades will not be affected as the feedback results will only be shared with the academic staff after the release of exam results. From Wei: We have been improving this module continuously based on students’ feedback. Thank you for your contribution to our module and for helping your junior peers. Data Warehouse Definition: a central repository of integrated data. Orthogonal data dimensions: Application dimension: the data normally come from multiple data sources. Sales, salaries, operational data, human resources data. Time dimension: both historical data and current data are included. Objective: reporting and data analysis. One of the key components of business intelligence. Focus on OLAP (w/ historical data) instead of OLTP. (more details later) Also known as enterprise data warehouse. Architecture (like computer system): Operational system. e.g., marketing or sales. Integration layer. ETL, ELT, data cleansing. Data warehouse. Data Loading - ETL ETL: extract, transform, load. Extract: extracting data from different sources, e.g., operation, external Transform: transforming/enriching the data, e.g., data type conversion, combining columns like first and last names, data cleansing, masking. Load: loading it into the data warehouse. Staging layer: store raw data extracted from different sources in operational data store. (some data may be deleted later) Integration layer: integrate the data from the stage layer and move the data to another database, e.g., data warehouse. Data is arranged into hierarchical groups, or dimensions. Dimensions are arranged into facts and aggregated facts. Schema = fact + dimension. Access layer: help users retrieve data. Challenges: data volume (time, spike), dirty data, synchronization … Tools: IBM InfoSphere DataStage, Informatica PowerCenter, and so on. ELT: no transformation before loading; far less popular. Conceptual Modeling - Schemas Two popular schemas: Star schema Snowflake schema Star schema. Simplest style of data warehouse schema; most popular. 1 fact table + n referencing dimension tables. Name from its physical model in a star shape. Fact table as center. Dimension tables as star’s points. Business process data → facts. Facts generally consist of numeric values and foreign keys. Transaction fact (sales events); snapshot fact (month end); accumulating snapshot (monthly total). Dimension data: measurable, quantitative, as the fact descriptive attri.. Dimension should be wide or narrow? Time dimension; geography dimension; product dimension; employee dimension; range dimension. Star Schema Benefits: Denormalized (relatively). For dimensions. Simpler queries: simple join logic. Query performance: read intensive. Disadvantages: Data integrity not enforced. Often, data loading is highly controlled for protection. Not flexible. Not suitable for complex analytics. More purpose-built. Example: How many TV sets have been sold, for each brand and country, in 1997. Simple join? Note the keys. Snowflake Schema ER diagram resembles a snowflake shape. Centralized fact tables. Connected to multiple dimensions. Like star schema, but different. Dimensions normalized into multiple related tables. May have multiple levels of relationships. Star schema: dimension represented by a single table. Benefits: storage saving. Does storage saving makes sense? Disadvantage: complex query. Alleviated year by year, but still slow. There is no end to normalization, 1NF, 2NF, 3NF, and so on. Technically, there is a more normalized solution. More normalized, more joins, more complex … Check the example in the next slide. Star vs Snowflake OLAP: Online Analytical Processing Relational DB: focus on OLTP (online transaction processing) Transaction: atomic change of state. e.g., banking, sales, trading … Process small, quick, interactive, large volume workloads. Optimized for INSERT, UPDATE operation. Focus on read operations. Highly normalized. Goals: availability, speed, concurrency, recoverability. OLAP. Derived from OLTP but allows complex analytics queries. Deliberately de-normalized with high SELECT throughput. Applications: e.g., sales report, marketing report, budgeting and forecasting. API support not as good as SQL. (true for most less popular functions) Options include Microsoft MDX and XML for Analysis. Market (years ago): Microsoft was ~30%, Oracle was ~20%, IBM was ~15%, SAP was ~15%, others ~20%. Core: OLAP cube. OLAP Cube Cube, a multi-dimensional array of data. Before time began, Also called hyper-cube if the dimension is higher than 3. there was the Cube Similar concept: cross-tabbed report in 1980s. Pre-computation can support an efficient query processing. An easy way for visualization. Terminology. Slice: keep some dimensions as constant and show the change of a few dimensions. e.g., apple sales in 2018. Measure: derived from the fact table. Dimension: derived from the dimension table. Cell: holds a number to represent some measure. e.g., sales, profits, budget. Hierarchy: organization of the elements of a dimension. AY20/21 ∋ AY20/21 Trimester 2. Trimester 2 ∋ 2nd half of Trimester 2. Operations: slice, dice, drill down/up, roll-up, pivot. OLAP Cube: Sample Queries You might think the concept is different. The queries are familiar. Data cube: ProductSales(product, region, sday, sales) 3 dimensions: product, region, and day. 1 measure: sales. Q1. Find the total sales for each product. A1. SELECT SUM(P.sales) FROM ProductSales AS P GROUP BY product; Q2. Find the total sales for each region. A2. SELECT SUM(P.sales) FROM ProductSales AS P GROUP BY region; Q3. Find the total sales for each product on each day. A3. SELECT SUM(P.sales) FROM ProductSales AS P GROUP BY product, sday; Q4. Find the total sales so far. A4. SELECT SUM(P.sales) FROM ProductSales AS P; Cube Operations Slice: picking a rectangular subset of a cube. Example: Fix location as Asia. Variables: product and time. Dice: sub-cube with specific values of multiple dimensions. Each dimension has a range. Is slice a dice? Example. Asia & Africa. 2001 – 2003. Cube Operations Drill Down: navigate among levels of data, from summarized to detailed. e.g., from the summary sales data to the sales for the individual products. Roll Up: navigate among levels of data, from detailed to summarized. The summarization rule could be an aggregate function. e.g., sum, profit = sales – expenses. Other aggregation functions: COUNT, MAX, MIN … Pivot: rotate the cube, to see different faces. Different perspectives. From one year to one location. Cube Operations: Exercise For the following questions, which OLAP operator shall we use? Q1. Find the sales of each city. A1. ? Q2. Given the sales of each city, get the sales for each state. Assume each state includes several cities. A2. ? Q3. Find the top-5 products ranked by total sales. A3. ? Q4. Given the sales of each state, get the sales of each city. A4. ? OLTP vs. OLAP - Summary OLTP OLAP Online transactional system and Online data retrieving and data Basic manages database modification. analysis system. Extract data and analyze for decision Focus Insert, update, delete information. making. Data Source OLTP and its transactions. Different OLTPs database. Transaction OLTP has short transactions. OLAP has long transactions. Time Comparatively less in OLTP. Comparatively more in OLAP. Queries Simpler. Complex. Normalization Normalized (3NF). Not normalized. Must maintain data integrity Does not get frequently modified. Integrity constraint. Hence, data integrity is not affected. Method Uses traditional DBMS. Uses the data warehouse. Users Clerk and IT professionals Knowledge workers No. of users Thousands of users. Only hundreds of users. Src: https://www.guru99.com/oltp-vs-olap.html, https://techdifferences.com/difference-between-oltp-and-olap.html DBMS vs. Data Warehouse More or less the same with the comparison between OLTP and OLAP. DBMS: tuned for OLTP. Access methods, indexing, concurrency control, recovery. Data Warehouse: tuned for OLAP. Complex queries, multidimensional view, consolidation. To summarize (may be not rigorous). DBMS is used to run a business. Data warehouse helps optimize the business.

Use Quizgecko on...
Browser
Browser