Data Warehousing And Some SQL OLAP Queries PDF
Document Details
Uploaded by Deleted User
University of Malta
2013
Prof Joseph G Vella
Tags
Summary
This document is a presentation on data warehousing and dimensional modeling, including a discussion on the context and requirements, as well as a methodology for building a data warehouse.
Full Transcript
Prof Joseph G Vella Prof. Joseph G Vella Dept. of Computer Information Systems, University of Malta A...
Prof Joseph G Vella Prof. Joseph G Vella Dept. of Computer Information Systems, University of Malta And Some SQL OLAP Queries 1 If only... If like the Creator, I had the whole under control, Would dismantle firmament, component parts pole to pole One universe anew, would create where with ease, The free spirited could have access, to desire noble goal True Translation of Hakim Omar Khayyam’s Robaiyat Joseph Vella © 2013- Data Warehousing - Slide 2 2 Data Warehousing and Dimensional Modelling 1 Prof Joseph G Vella Definition & Characteristics "a subject oriented, integrated, non-volatile, and time variant collection of data in support of management's decisions” - Bill Inmon Integrates historical data from operational systems and external sources. Updated spasmodically. Intensive on storage space and query elapsed time. Used by decision makers in the decision making process. For example in Banking, Insurance, Manufacturing, & Healthcare. Joseph Vella © 2013- Data Warehousing - Slide 3 3 Context & Requirements Contextual use of a DWH is an ever changing environment user needs, business conditions, changing nature of data, technological progress. Decision makers assertions on data requirements issues (List by Kimball, 1996): We have heaps of data, but we cannot access it! How can people playing the same role achieve substantially different results? We want to select, group, and manipulate data in every possible way! Show me just what matters! Everyone knows that some data is wrong! Joseph Vella © 2013- Data Warehousing - Slide 4 4 Data Warehousing and Dimensional Modelling 2 Prof Joseph G Vella Data volume vs value: from operational to decision support systems Joseph Vella © 2013- Data Warehousing - Slide 5 5 Decision makers need to wrestle with… Joseph Vella © 2013- Data Warehousing - Slide 6 6 Data Warehousing and Dimensional Modelling 3 Prof Joseph G Vella From database perspective … The data warehouse process includes activities to: Provide accessibility of data to decision makers Integrate data (both internal and external) Effective query definition Have data content conciseness (data around key business processes) Provide Multi-dim representation Control Correctness Increase Completeness Joseph Vella © 2013- Data Warehousing - Slide 7 7 Building DWH requirements The data warehouse system includes: Separation of loading, tallying, and processing Scalability (h/w & s/w) But also reacts gracefully to business development Open to new data sources, applications, and front-ends Data provenance Data life history Performance issues Loading & querying Security issues Administration issues Joseph Vella © 2013- Data Warehousing - Slide 8 8 Data Warehousing and Dimensional Modelling 4 Prof Joseph G Vella Building DWH methodology Joseph Vella © 2013- Data Warehousing - Slide 9 9 Data Staging Area Data Staging Area A storage area and set of processes that clean, transform, combine, duplicate, household, archive, and prepare source data for use in the data warehouse. Spread on a number of platforms Dominated by simple activities (e.g. sorting and sequential processing nature) Checking referential constraints and consistency across the pieces and the whole! Joseph Vella © 2013- Data Warehousing - Slide 10 10 Data Warehousing and Dimensional Modelling 5 Prof Joseph G Vella Data Staging Area Joseph Vella © 2013- Data Warehousing - Slide 11 11 Presentation Layer/Server Presentation Server The target physical machine on which the data warehouse data is organised and stored for direct querying by the end users, report writers and other applications. There are two main data models relational (with star schemas) - most popular dimensional (with data cubes) - promises better “interface” for access to data by the end users Joseph Vella © 2013- Data Warehousing - Slide 12 12 Data Warehousing and Dimensional Modelling 6 Prof Joseph G Vella Presentation Server Joseph Vella © 2013- Data Warehousing - Slide 13 13 Dashboards Joseph Vella © 2013- Data Warehousing - Slide 14 14 Data Warehousing and Dimensional Modelling 7 Prof Joseph G Vella Data Marts At a logical design level one can easily see that building and maintaining one design for all business processes is impractical. Some authors, e.g. Kimball, insist that for larger orginasations a number of data marts are deployed rather than a single DWH. But requires: Conformal dimensions; Meta data repository. One definition of a data mart (i.e. note no agreement exists!?): Is a DWH with additional restriction or an interest in: business process activity, department or location, and category of users. Additional advantages include: Incremental development possible, query basis is restricted, performance issues are easier to address. Joseph Vella © 2013- Data Warehousing - Slide 15 15 The Basic Elements and Practices (from Ralph Kimball et al) Joseph Vella © 2013- Data Warehousing - Slide 16 16 Data Warehousing and Dimensional Modelling 8 Prof Joseph G Vella Simple DWH architecture Joseph Vella © 2013- Data Warehousing - Slide 17 17 Two phase DWH architecture Joseph Vella © 2013- Data Warehousing - Slide 18 18 Data Warehousing and Dimensional Modelling 9 Prof Joseph G Vella Independent Data Marts DWH architecture Architecture fits Kimbal’s earlier framework Joseph Vella © 2013- Data Warehousing - Slide 19 19 Hub & Spoke DWH architecture Joseph Vella © 2013- Data Warehousing - Slide 20 20 Data Warehousing and Dimensional Modelling 10 Prof Joseph G Vella Basic Processes of the Data Warehouse (i) Over the Data Staging Area a major process is: extracting is the first step of getting data into the data warehouse environment. Extracting means reading and understanding the source data, and copying the parts that are needed to the data staging area for further work. Joseph Vella © 2013- Data Warehousing - Slide 21 21 Basic Processes of the Data Warehouse (ii) Over the Data Staging Area a major process is (continued): transforming a number of processes are involved cleaning the data by correcting misspellings, resolving domain conflicts dealing with missing data elements, and parsing into standard formats purging selected fields from the legacy data that are not useful for the DWH combining data sources by matching exactly on key values or by performing fuzzy matches on non-key attributes, including looking up textual equivalents of legacy codes creating surrogate keys for each dimension record in order to avoid a dependence on legacy defined keys, where the surrogate key generation process enforces ref. integrity between the dimension tables and the fact tables building aggregates for boosting the performance of common queries Joseph Vella © 2013- Data Warehousing - Slide 22 22 Data Warehousing and Dimensional Modelling 11 Prof Joseph G Vella e.g. Transformation (in ETL) Joseph Vella © 2013- Data Warehousing - Slide 23 23 Basic Processes of the Data Warehouse (iii) Data Staging is a major process (continued): Loading and Indexing usually takes the form of replicating the dimensions tables and fact tables and presenting these tables to the bulk loading facilities of each recipient data mart. Quality Assurance Checking by running a comprehensive exception report over the entire set of newly loaded data. totals, aggregations and time series data Release / Publishing end user community must be notified that new data has arrived Updating changes in the labels, changes in hierarchies, changes in status, and changes in corporate Joseph Vella © 2013- Data Warehousing - Slide 24 24 Data Warehousing and Dimensional Modelling 12 Prof Joseph G Vella Basic Elements and Practices (i) Source Systems An operational system of record whose function it is to capture the transactions of the business. (Typically Legacy systems). uptime availability narrow queries / OLTP, little ad hoc queries little historical data little by way of management reports (if any it’s a burden) mostly missing important attributes necessary for reporting (e.g. product, customer, geography or calendar) keys (of source) are copied to the DWH but not maintained at DWH level the productions key’s values become attributes in the DWH Many source systems working in a I.T. infrastructure Joseph Vella © 2013- Data Warehousing - Slide 25 25 Basic Processes of the Data Warehouse (iv) Data Staging is a major process (continued): Querying all activities of requesting data from a data mart, including ad hoc querying by end users, report writing, complex decision support applications it is the whole point of data warehousing … very computationally expensive … Data Feedback data sometimes need to flow down pass a clean dimension into the participant source systems pass the result of a complex query into another application or any source system Auditing and Security who saw what and did what openness vs sensitivity Joseph Vella © 2013- Data Warehousing - Slide 26 26 Data Warehousing and Dimensional Modelling 13 Prof Joseph G Vella Data Cube for Dimensional Modelling An example of 3D data cube. FD: Product, Store, Date -> quantity, receipts Joseph Vella © 2013- Data Warehousing - Slide 27 27 Examples of Aggregating the Data Cube Assuming: FD: Product -> type FD: Date -> month Joseph Vella © 2013- Data Warehousing - Slide 28 28 Data Warehousing and Dimensional Modelling 14 Prof Joseph G Vella Basic Processes of the Data Warehouse (v) But which data to aggregate!? Where size really matters … Refer to slide 9 for a simple star schema. Lets look at the dimensions possible size; time * 1000, store * 20, clerk * 150, product 10000, customer * 1000, and promotion * 200. Lets assume a complete line of description data fits in 1024 byte (I.e. 1K). Then the max cube size is an astronomical number!? Luckily not all values are in fact business facts!? Why aggregate? For different granularities. For example: Time dimension Daily - Current Year Weekly - Current & Previous four Years Monthly - Current & Previous ten Years Yearly - Current & Any Previous Years The main reason for granularity is storage requirements and relevance of data keeping for long term trails. Joseph Vella © 2013- Data Warehousing - Slide 29 29 A Relational Model Example Joseph Vella © 2013- Data Warehousing - Slide 30 30 Data Warehousing and Dimensional Modelling 15 Prof Joseph G Vella A Dimensional Model Example Joseph Vella © 2013- Data Warehousing - Slide 31 31 Example Data Cube (again) An example of 3D data cube. FD: Product, Store, Date -> quantity, receipts Joseph Vella © 2013- Data Warehousing - Slide 32 32 Data Warehousing and Dimensional Modelling 16 Prof Joseph G Vella Composition of Dimensions FD: Product-> type FD: type -> category FD: Store> city FD: city -> state Joseph Vella © 2013- Data Warehousing - Slide 33 33 4D data cube Joseph Vella © 2013- Data Warehousing - Slide 34 34 Data Warehousing and Dimensional Modelling 17 Prof Joseph G Vella DWH Modeling - Star Schemas Packages data in a symmetric format whose design goals are user understandability, query specification ease, query performance, and resilience to change. The main components are: Fact tables is the central & primary table in which each dimensional model that is meant to contain measurements of the business. For example these include key business focal points! Measurements are additive or numeric. Contains (at least two) foreign keys that join their respective dimension tables. Dimension tables are a set of companion tables to a fact table. For example these include people, places, things and time(s). Each dimension has a simple relationship with a fact table. Each dimension has a large number of attributes (textual and numeric). – Basis for querying (or latching) onto the DWH queries. Joseph Vella © 2013- Data Warehousing - Slide 35 35 Star Schema Joseph Vella © 2013- Data Warehousing - Slide 36 36 Data Warehousing and Dimensional Modelling 18 Prof Joseph G Vella Snow Flake Schema Joseph Vella © 2013- Data Warehousing - Slide 37 37 Fact constellation schema (sales & shipping) Joseph Vella © 2013- Data Warehousing - Slide 38 38 Data Warehousing and Dimensional Modelling 19 Prof Joseph G Vella Comparing DWHs and Data Marts A data mart is a logical sub-set of the complete data warehouse. It is much easier to manage and more likely to succeed. The DWH is then the union of all its data marts. We require a “common” backbone... Design restrictions: All data marts must be built from “conformed” dimensions and “conformed” facts. Consequently easily integrated together … and This would easily form part of our Data Warehouse Bus Architecture. Top-down or Bottom-up designs are not exactly the point here and in practice we seem to need a mix of both. NOTE: Some sources imply that data marts are external data, or summary / aggregated data... Joseph Vella © 2013- Data Warehousing - Slide 39 39 The DWH Business Process Is a coherent set of business activities that make sense to the business users of our data warehouse. e.g. order processing It is interesting to note that we are now building data- marts around a business process and then build the enterprise warehouse from the data-marts. Joseph Vella © 2013- Data Warehousing - Slide 40 40 Data Warehousing and Dimensional Modelling 20 Prof Joseph G Vella Slice – by a dim and one value Joseph Vella © 2013- Data Warehousing - Slide 41 41 Slice by two dimensions – each with a single value Joseph Vella © 2013- Data Warehousing - Slide 42 42 Data Warehousing and Dimensional Modelling 21 Prof Joseph G Vella Dice by three dimensions – each with a range of values Joseph Vella © 2013- Data Warehousing - Slide 43 43 DWH - OLAP The dynamic synthesis, analysis and consolidation of large volumes of dimensional data for historical and projectional understanding. Common analytical applications Consolidation aggregate by dimensions Drill down (and Rollup) given a figure partition this by a set of dimension’s values Partitioning - sub-setting Slicing (on rows) and dicing (on attribute) split data measures by a sequence of dimensions selections Common data model driven technology include ROLAP, MOLAP, HOLAP, etc Click hypertext button for Common Queries Examples. Joseph Vella © 2013- Data Warehousing - Slide 44 44 Data Warehousing and Dimensional Modelling 22 Prof Joseph G Vella Rolling-Up Joseph Vella © 2013- Data Warehousing - Slide 45 45 Another example of Drilling up Joseph Vella © 2013- Data Warehousing - Slide 46 46 Data Warehousing and Dimensional Modelling 23 Prof Joseph G Vella Drilling Down Joseph Vella © 2013- Data Warehousing - Slide 47 47 Another example of Drilling Down Joseph Vella © 2013- Data Warehousing - Slide 48 48 Data Warehousing and Dimensional Modelling 24 Prof Joseph G Vella Slicing & Dicing Joseph Vella © 2013- Data Warehousing - Slide 49 49 Pivoting Joseph Vella © 2013- Data Warehousing - Slide 50 50 Data Warehousing and Dimensional Modelling 25 Prof Joseph G Vella Drilling across two cubes Joseph Vella © 2013- Data Warehousing - Slide 51 51 StarNet Query Model Each radial line is a hierarchy of dim Region, Country, and City The radial line represents at which level of each dimension hierarchy are we to report View here is of ‘contracts’, ‘produc group’, ‘district’, ‘country’, ‘quarter’ statistics. Joseph Vella © 2013- Data Warehousing - Slide 52 52 Data Warehousing and Dimensional Modelling 26 Prof Joseph G Vella Indexes We have seem that indexes can reduce the amount of disk accesses. But with the volume of data and the explicit number of relationships (e.g. between the fact tables and their dimensions) new technology (or old technology re-cladded for us) is needed. An important genera of indexes are the bit map indexes (called secondary indexes in old text-books). These assume that data does not have too many updates! We use these indexes for attributes that share values from a domain. We say that the attribute has a high cardinality if at any point in time there are a good number of distinct values in that dimension (e.g. order dates). On the other hand a dimension has low cardinality is there are a restricted amount of distinct values (e.g. sex). Joseph Vella © 2013- Data Warehousing - Slide 59 59 Bit Map Indexes How to create a bit map JOB DEPTNO MGR index? Choose an attribute, compute PRESIDENT SALESMAN MANAGER ANALYST its distinct values CLERK and introduce these 7566 7698 7782 7788 7839 7902 in columns. Look at 10 20 30 40 each tuple and set ADAMS 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 the bit for the ALLEN 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 correct occurrence! BLAKE 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 CLARK 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 FORD 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 Comparison, join and JAMES 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 some aggregate JONES 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 operations reduced KING 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 to bit arithmetic. MARTIN 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 MILLER 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 Good perfromance for SCOTT 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 high cardinality. SMITH 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 Lately even low TURNER 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 cardinality attribute WARD 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 have benefitted (e.g. compression). Joseph Vella © 2013- Data Warehousing - Slide 60 60 Data Warehousing and Dimensional Modelling 27 Prof Joseph G Vella Indexes and Star Queries For a Join Index maintain a bit map index between each fact table and its associated dimensions. To compute a query over multiple dimensions: Take the intersection of join indexes until the set of candidate fact tuples is small; Do foreign key joins wit the rest of the dimension tables; Look up the fact table. To create a bit map index on Oracle - Enterprise edition required :-( use the following syntax: CREATE BITMAP INDEX emp.emp_dwh_second ON (sal, deptno, mgr); Joseph Vella © 2013- Data Warehousing - Slide 61 61 View Materialisation (Ora8i +) Up to now we have considered views as logical objects. We even remarked that it is acceptable for us to have the view query executed on demand each time end users need it! Can we reconsider this since the DWH query’s elapsed time are now significant! How about storing the view’s result for later us? (i.e. view materialisation) Sounds familiar in legacy system circles and our index definition. But what are the problems here! Storage; Maintaining the views’ up-to dateness. – Each change on the view’s base tables has to be reflected in the view materialisation. We could easily implement materialised viewed (through tables, triggers and PL/SQL code) but the DBMS does not “know” about these! Prior to Ora8i we had to design, implement and manage these! Joseph Vella © 2013- Data Warehousing - Slide 62 62 Data Warehousing and Dimensional Modelling 28 Prof Joseph G Vella View Materialisation (ii) What do materialised view (mv) provide for a DWH environment? Improve query elapsed time (e.g. pre-compute joins) and an underlying efficient query re-write mechanism Provide for summaries and cross tabs Provide for the replication of data in and across databases (I.e. synchronises changes across sources) Provide basis for other mv’s definitions - query re-wrires (complex dependencies will make management difficult!?) The mode of using mv in a DWH environment does not interfere with the view update problem! Interaction in a DWH environment is heavily skewed toward read only queries! Nonetheless Oracle’s implementation does allow updates (under strict conditions!). Joseph Vella © 2013- Data Warehousing - Slide 63 63 View Materialisation (iii) How, when and what to materialise is our design problem… at logical and physical level … Aims: create mv that have a high re-use value create mv that have similar data if one needs SELECT job, sum(SAL) FROM emp GROUP BY job; then introduce SELECT job, sum(SAL), count(SAL), min(SAL), max(SAL) FROM emp GROUP BY job; Components: Mechanisms to define metarialised views Refresh (update) mechanism Query Re-write mechanism Interaction Set of procedures in a PL/SQL package (DBMS_OLAP) SQL*Loader for incremental and direct loading. Joseph Vella © 2013- Data Warehousing - Slide 64 64 Data Warehousing and Dimensional Modelling 29 Prof Joseph G Vella View Materialisation (iv) CREATE MATERIALIZED VIEW vname [BUILD {IMMEDIATE | DEFERRED }] [ REFRESH { {FAST | COMPLETE | FORCE } | { ON COMMIT | ON DEMAND }|{ START WITH | NEXT} date } | NEVER REFRESH ] [ QUERY REWRITE ] AS sub_query; Important note: Snapshots have been superseded by mv! Explanations: BUILD IMMEDIATE - view materialisation is executed on submission (default). View in deferred mode will be materialised on a refresh op. REFRESH - This command relates the base data state to when this state is reflected in a view that depends on that data. Fast specifies the incremental method which is coupled to the dimension tables (e.g. in the mv log). Complete implies that any change in the data would trigger a complete re-computation of the mv. On commit implies changes are echoed onto the affected mvs. One can use the Start With variant to associate refreshes a timetable. The QUERY REWRITE is used in query optimisation and processing. Joseph Vella © 2013- Data Warehousing - Slide 65 65 View Materialisation (v) Types of Materilised Views with Joins & Aggregates fast refresh possible for inserts to facts table but not through a mv log Single Table with Aggregates incremental refresh through a mv log possible with only Joins Joseph Vella © 2013- Data Warehousing - Slide 66 66 Data Warehousing and Dimensional Modelling 30 Prof Joseph G Vella View Materialisation (vi) Some examples: CREATE MATERIALIZED VIEW store_sales_mv BUILD DEFERRED REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT s.s_n, sum(sales) as sum FROM store s, fact f WHERE f.sk = s.sk GROUP BY s.s_n; Notes: postpone the populating; each time a refresh is invoke the QP needs to re-compute in totality! Joseph Vella © 2013- Data Warehousing - Slide 67 67 View Materialisation (vii) Some examples continued: CREATE MATERIALIZED VIEW store_stdcnt_mv BUILD IMMEDIATE REFRESH FAST ENABLE QUERY REWRITE AS SELECT s.s_n, td.tk, STDEV(sales) AS sd_sales, AVG(sales) AS av_sales, COUNT(sales) AS cn_sales, SUM(SALES) AS sm_sales FROM store s, fact f, time_dim td WHERE s.sk = f.sk AND f.tdk = td.tk GROUP BY s_n, td.tk; Note: It’s possible to refresh FAST as because the sum() and count() functions support the average and standard deviation computation. Joseph Vella © 2013- Data Warehousing - Slide 68 68 Data Warehousing and Dimensional Modelling 31 Prof Joseph G Vella View Materialisation (viii) Some examples … continued. CREATE MATERIALIZED VIEW LOG ON fact WITH ROWID (sk, tk, msales, usales) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW sum_sales_mv BUILD IMMEDIATE REFRESH FAST ON COMMIT AS SELECT f.sk, f.tk, COUNT(*) cn_grp, SUM(f.msales) AS sum_m_sales, COUNT(f.msales) AS cn_m_sales, SUM(f.u_sales) AS sum_u_sales, COUNT(f.u_sales) AS cn_u_sales FROM fact f GROUP BY f.sk, f.tk; Notes: This is a single table based view materialisation. Also since a log has been defined then fast refreshes on commit are possible. Joseph Vella © 2013- Data Warehousing - Slide 69 69 Support Material Joseph Vella © 2013- Data Warehousing - Slide 70 70 Data Warehousing and Dimensional Modelling 32 Prof Joseph G Vella An Example of Drill Down & Rollup on Time Financial Year Arrows are functional dependencies Financial Periods Financial Year Business Drilldown Rollup Months Financial Seasonal Drilldown Periods Periods Rollup Business Weeks Calendar Months Business Days Joseph Vella © 2013- Data Warehousing - Slide 71 71 An Example of Drill Down & Rollup on related Dimensions Arrows are functional dependencies Joseph Vella © 2013- Data Warehousing - Slide 72 72 Data Warehousing and Dimensional Modelling 33 Prof Joseph G Vella Codd’s OLAP rules (again!? ;-> ) Note: Originally twelve (1993 & 1995). Some serious flaming and ‘accusations’ ensued. The twelve were updated to eighteen. Basic Features: 1. Multi - dimensional conceptual view 2. Intuitive data management (e.g. drilldown & rollup) 3. Accessibility (e.g. single logical view to the user) 4. Batch extraction versus interpretative 5. OLAP analysis models (four types are explicitly mentioned) 6. Client-Server architecture 7. Transparency of front end’s OLAP functionality from end user (a tall order!?) 8. Multi-user support Joseph Vella © 2013- Data Warehousing - Slide 73 73 Codd’s OLAP rules Special Features: 9. Treatment of non-normalised data (calculations should not affect the external data) 10. Storing OLAP results 11. Extraction and handling of missing data (remember nulls!) 12. Handling of missing values. Reporting Features: 13. Good reporting (e.g. on cube’s dims) 14. Consistent reporting performance 15. Automatic adjustment of physical level on changes to data. Dimensional Control: 16. Generic dimensionality 17. Unlimited dimensions and aggregation levels 18. Unrestricted cross dimensional operations Joseph Vella © 2013- Data Warehousing - Slide 74 74 Data Warehousing and Dimensional Modelling 34 Prof Joseph G Vella Kimball’s List of Killer Queries (i) 1. Simple Constraints Constraints against literal constants, such as `Show the sales of candy products in September 1997.” 2. Simple Sub-queries Constraints against a global value found in the data, such as “Show the sales of candy products in September 1997 in those stores that had above average sales of candy products.” 3. Correlated Sub-queries Constraints against a value defined by each output row, for example, "Show the sales of candy products for each month of 1997 in those stores that had above average sales of candy in that month." Joseph Vella © 2013- Data Warehousing - Slide 75 75 Kimball’s List of Killer Queries (ii) 4. Simple Behavioral Queries Constraints against values resulting from an exception report or a complex series of queries that isolate desired behavior, such as "Show the sales of those candy products in September 1997 whose household penetration for our grocery chain in the 12 months prior to September were more than two standard deviations less than the household penetration of the same products across our 10 biggest retail competitors." This query is a variation of the classic opportunity gap analysis. 5. Derived Behavioral Queries Constraints against values found in set operations (union, intersection, and set difference) on more than one complex exception report or series of queries, such as "Show the sales of those candy products identified in example number 4, and which also experienced a merchandise return rate more than two standard deviations greater than our 10 biggest retail competitors." This request is a set intersection of two behavioral queries. Joseph Vella © 2013- Data Warehousing - Slide 76 76 Data Warehousing and Dimensional Modelling 35 Prof Joseph G Vella Kimball’s List of Killer Queries (iii) 6. Progressive Sub-setting Queries Constraints against values, as in number 4, but temporally ordered so that membership in an exception report is dependent on membership in a previous exception report: "Show the sales of those candy products in example number 4 that were similarly selected in August 1997 but were not similarly selected in either June or July 1997.” A health care example of a progressive sub- setting query would be "Show the oldest 100 patients who initially complained of chest pain, then had either treatment A or treatment B, then did not have surgery, and are still alive today.” 7. Classification Queries Constraints on values that are the results of classifying records against a set of defined clusters using nearest neighbor and fuzzy matching logic: "Show the percentage of low-calorie candy sales contained in the 1,000 market baskets whose content most closely matches a young, health-conscious family profile." Joseph Vella © 2013- Data Warehousing - Slide 77 77 Dimensional Queries - Jarke et al. 2000 What is the total amount of receipts recorded last year per state and per product category? What is the relationship between the trend of PC manufacturers' shares and quarter gains over the last five years? Which orders maximize receipts? Which one of two new treatments will result in a decrease in the average period of admission? What is the relationship between profit gained by the shipments consisting of less than 10 items and the profit gained by the shipments of more than 10 items? Joseph Vella © 2013- Data Warehousing - Slide 78 78 Data Warehousing and Dimensional Modelling 36 Prof Joseph G Vella OLAP Example Schema Joseph Vella © 2013- Data Warehousing - Slide 79 79 OLAP - Operations Slice Time Dice 3/28/03 3/27/03 3/26/03 2D 3/25/03 3/24/03 Roll-up Bread Drill-down Soup Filter Product Milk Datacube: convenient way to think about multidimensional data for 3 dimensions Pop Faces of the cube represent the dimensions of the 2 10% 20% 30% 40% grocery store sales analysis for 1 A cell inside the cube lies at the intersection of the dimensions. The cell represents a row in the fact table. Promotion Common operations are: slice, dice,roll-up, drill- down, and filter. Joseph Vella © 2013- Data Warehousing - Slide 80 80 Data Warehousing and Dimensional Modelling 37 Prof Joseph G Vella OLAP - Slice Slice: The effect is to view a subset of the original data. Time 3/28/03 3/27/03 3/26/03 3/25/03 3/24/03 Bread Soup Product Milk Pop 2 10% 20% 30% 40% for 1 Promotion Joseph Vella © 2013- Data Warehousing - Slide 81 81 OLAP - Dice Time 3/28/03 3/27/03 3/26/03 A dice is an even 3/25/03 3/24/03 smaller subset of data. Bread Notice Soup promotion along x Product product along y Milk time along z Pop 2 10% 20% 30% 40% for 1 Promotion Joseph Vella © 2013- Data Warehousing - Slide 82 82 Data Warehousing and Dimensional Modelling 38 Prof Joseph G Vella OLAP - 2D Promotion Insert the Product spreadsheet corresponding to the All cube pivot. Date All promotion along x 3/24/2003 $ 150.95 date along y 3/24/2003 $ 575.00 product along z 3/25/2003 $ 319.13 3/26/2003 $ 95.70 3/27/2003 $ 223.82 Joseph Vella © 2013- Data Warehousing - Slide 83 83 OLAP - Roll-Up Promotion Product Although the interface looks like a All spreadsheet, it will Date All allow analyst to perform ALL $ 127,384,020.54 datacube operations through click actions. Joseph Vella © 2013- Data Warehousing - Slide 84 84 Data Warehousing and Dimensional Modelling 39 Prof Joseph G Vella OLAP - Drill-down Promotion Product 2 for 1 on date, promotion Date Soup and product 3/24/2003 $ 5,476.80 3/25/2003 $ 7,124.00 3/26/2003 $ 8,467.71 3/27/2003 $ 2,800.35 3/28/2003 $ 3,515.14 Joseph Vella © 2013- Data Warehousing - Slide 85 85 OLAP - Filter Promotion Product 2 for 1 Date Soup Exclude amounts < 3/24/2003 $ 5,476.80 $4,000.00 3/25/2003 $ 7,124.00 3/26/2003 $ 8,467.71 3/27/2003 $ 2,800.35 3/28/2003 $ 3,515.14 Joseph Vella © 2013- Data Warehousing - Slide 86 86 Data Warehousing and Dimensional Modelling 40 Prof Joseph G Vella Joseph Vella © 2013- Data Warehousing - Slide 87 87 Joseph Vella © 2013- Data Warehousing - Slide 88 88 Data Warehousing and Dimensional Modelling 41 Prof Joseph G Vella Star Schema Patient Physician #Patient ID #Physician ID Service Patient Name Physician Name Address Specialty ID #Service Code Age Credential ID Service Description Sex #Category Code Insurance ID Claim # Physician ID Payer # Patient ID # Service Code Time Periods #Payer ID Name # Payer ID #Claim Date Address # Claim Number Year Phone Number # Line Item Number Month EDI Number # Claim Date Quarter Date of Services Week Amount of Charge Unit of Services Joseph Vella © 2013- Data Warehousing - Slide 89 89 Joseph Vella © 2013- Data Warehousing - Slide 90 90 Data Warehousing and Dimensional Modelling 42 Prof Joseph G Vella Comparative models! – data cubes Joseph Vella © 2013- Data Warehousing - Slide 91 91 Comparative models! – star schema Joseph Vella © 2013- Data Warehousing - Slide 92 92 Data Warehousing and Dimensional Modelling 43 Prof Joseph G Vella Comparative models! – snowflake schema Joseph Vella © 2013- Data Warehousing - Slide 93 93 Data Warehousing and Dimensional Modelling 44