Data Warehouses Design PDF
Document Details
Uploaded by LegendaryTantalum
Tags
Summary
This document provides an overview of data warehousing design, covering topics such as data modeling, dimensional modeling, fact and dimension tables, and various data warehousing architectures.
Full Transcript
DATA MODELING CHAP 3 Data Warehouse Design (Conception des Entrepôts de données) ARCHITECTURE TYPE DSS INTRODUCTION La perception de la modélisation dimensionnelle a été développée par Ralph Kimball et est composée de tableaux de « faits » et de « dimension »....
DATA MODELING CHAP 3 Data Warehouse Design (Conception des Entrepôts de données) ARCHITECTURE TYPE DSS INTRODUCTION La perception de la modélisation dimensionnelle a été développée par Ralph Kimball et est composée de tableaux de « faits » et de « dimension ». Kimball vs Inmon architecture MODELISATION DIMENSIONELLE Dimensional Modeling (DM) is a data structure technique optimized for data storage in a Data warehouse. The purpose of dimensional modeling is to optimize the database for faster retrieval of data. The concept of Dimensional Modelling was developed by Ralph Kimball and consists of “fact” and “dimension” tables. FACTS AND DIMENSIONS Facts and dimensions form the core of any business intelligence effort. These tables contain the basic data used to conduct detailed analyses and derive business value. FACTS AND FACT TABLES What is a Fact? Facts are the key metrics used to measure the business performance Sales information; Production information; Inventory Information Fact tables contain the data corresponding to a particular business process. Each row represents a single event associated with that process and contains the measurement data associated with that event. Example of retail organization: might have fact tables related to customer purchases, customer service telephone calls, and product returns. The customer purchases table would likely contain information about the amount of the purchase, any discounts applied and the sales tax paid. Types of Facts Additive Semi additive Non additive An additive Fact is a measure that can be summarized across all dimensions. Ex: Sales Because we can summarized sales about products, customers, time An additive Fact can be aggregated along all dimensions. Aggregated data a data combined from several measurements. A semi Additive Fact is a measure that can be summarized across some dimensions (not all). Ex: Inventory inventory can be aggregated along all stores of the geography to get country inventory. (Products, geography, etc) But inventory cannot be aggregated along customer dimension A non additive fact is a measure that is not aggregatable. A non additive measure is a calculated measure. Ex: Percent profit We can calculate percent profit for a product We can calculate percent profit for a category product (3% p1 +4%p2 +7%p3 =? 14%(group category p1,p2,p3) NO! because we cannot add up product profit this way to get product profit percent for that category. GRANULARITY AND DESIGN ISSUES Granularity refers to the level of details that the fact has. The developer designing the purchase fact table described above would need to decide, for example, whether the grain of the table is a customer transaction or an individual item purchase. In the case of an individual item purchase grain, each customer transaction would generate multiple fact table entries, corresponding to each item purchased. The level of details in the design is imposed by the organization. So the granularity is determined based on business needs. Don’t load in the DW more details than your business requires. Primary key should be composite of all dimension keys DIMENSIONS AND DIMENSION TABLES Definitions: Dimensions describe the objects involved in a business intelligence effort. While facts correspond to events, dimensions correspond to people, items, or other objects. For example, in the retail example, we discussed that purchases, returns, are facts. On the other hand, customers, employees, items and stores are dimensions and should be contained in dimension tables. Types of Dimensions: - Conformed and non conformed dimensions - Generated dimensions CONFORMED AND NON CONFORMED DIMENSIONS A conformed dimension is a dimension shared by multiple facts tables. There is only one version of conformed table for a particular object It is used when all business users have the same definitions for the dimension All Product dim attributes should be of the same definition for all categorization of the business. A non Conformed dimension is defined where parts of the business use only some of the attributes of the dimension, and a different part of the business use a different set of attributes for that dimension. There maybe cross over, but there’s really 2 definitions for the dimension depending on who uses it. Target single fact table Used when dimensions have different definitions for different business units. S C H E M A D E S I G N : STAR V S S N O W F L A K E Star and Snowflake schemas explain how Dimension and Fact tables are related. Performance consideration Star schema requires de-normalization during the load process In the Snowflake schema, the E T L process use normalized information to the DW. This sometimes increase dimension complexity. (This can affect the performance of the cube during complex queries ) MODÈLE EN ÉTOILE (2) Sources : Lydie Soler, AgroTechParis C H A R A C T E R I S T I C S O F T H E FA C T TA B L E The fact table includes numerical values of what we measure. For example, a fact value of 20 might means that 20 widgets have been sold. Each fact table includes the keys to associated dimension tables. These are known as foreign keys in the fact table. Fact tables typically include a small number of columns. When it is compared to dimension tables, fact tables have a large number of rows. CHARACTERISTICS OF THE DIMENSION TABLE Dimension tables contain the details about the facts. That, as an example, enables the business analysts to understand the data and their reports better. The dimension tables include descriptive data about the numerical values in the fact table. That is, they contain the attributes of the facts. For example, the dimension tables for a marketing analysis function might include attributes such as time, marketing region, and product type. Since the record in a dimension table is denormalized, it usually has a large number of columns. The dimension tables include significantly fewer rows of information than the fact table. The attributes in a dimension table are used as row and column headings in a document or query results display. DEGENERATED DIMENSIONS Degenerated Dimensions A degenerate dimension is a dimension which is derived from the fact table and doesn’t have its own dimension table Used by a single fact table Dimension value is stored directly in the fact table Slowly Changing Dimensions Slowly changing dimensions are dimensions where historical attribute values are retained. It is used when the business doesn’t want to lose track of what actually happened. Ex: Customer moving from one city to another. T OP -D OWN A P P R O A C H Often known as Inmon’s Top Down Approach , data warehouse is built first. Inmon defines a data warehouse as a centralized repository for the entire enterprise.Dimensional data marts are created only after the complete data warehouse has been created. Thus, the data warehouse is at the center of the corporate information factory (CIF), which provides a logical framework for delivering business intelligence. BOTTOM UP- APPROACH Often called as Kimball’s bottom up approach, the most important business aspects or departments, data marts are created first. A data mart provide a thin view into the organisational data and addresses a single business area. These data marts are then integrated into larger data warehouse to build a complete data warehouse. TOP DOWN VS B OT TO M U P ADVANTAGES AND DISADVANTAGES –T OP D O W N Advantages Disadvantages It is easier to maintain Top Down It represents a very large Design project and the cost of Provides consistent dimensional views of data across data marts, implementing the project is as all data marts are loaded from significant. the data warehouse. It is time consuming and more This approach is robust against business changes. Creating a new time required for initial set up data mart from the data Highly skilled people required warehouse is very easy. Initial cost is high but subsequent for set up project development cost is lower ADVANTAGES AND DISADVANTAGES -B OTTOM – U P advantages disadvantages This model contains consistent Initial cost is low but each data marts and these data marts subsequent phase will cost can be delivered quickly. same The data marts are created first to provide reporting capability The positions of the data It is easier to extend the data warehouse and the data marts warehouse as it can easily are reversed in the bottom-up accommodate new business units. approach design. It is just creating new data marts and then integrating with other It is difficult to maintain and data marts. often redundant and subject to This Approach take less time. revisions Initial set up is very quickly ARCHITECTURE BUS A bus architecture is composed of a set of tightly integrated data marts that get their power from conformed dimensions and fact tables B U S MATRIX The Bus Matrix defines part of the Data Warehouse Bus Architecture and is an output of the Business Requirements phase in The Kimball Lifecycle. It is applied in the following phases of dimensional modeling and development of the Data Warehouse. CONCEPTS AVANCÉS Constellation (1) Série d’étoiles Fusion de plusieurs modèles en étoile qui utilisent des dimensions communes Plusieurs tables de fait et tables de dimensions, éventuellement communes 2 Constellation (2) 38 Sources : http://gankahhwee.com Modèle en flocon (2) 48 Sources : Lydie Soler, AgroTechParis OLAP On-Line Analytical Processing (OLAP), désigne l'ensemble des technologies permettant la prise de décision stratégique rapide et fiable sur des données modélisées en multi dimensionnel. les technologies OLAP se basent sur les bases de données multi dimensionnelles (conçues en dimensions et en faits) pour proposer des techniques d'analyse stratégique. OLAP OLAP est le fait de faire des analyses sur des bases de données multi dimensionnelles. X-OLAP définit la façon dont seront stockées physiquement les données pour permettre des analyses multi dimensionnelles. Differentes Technologies: R-OLAP stocke les données multi dimensionnelles dans un format relationnel (tables, relations), M-OLAP les stockent dans un format multi dimensionnel réel, H-OLAP utilise les méthodes(R et M) pour le stockage donc Hybride, D-OLAP stocke les données en local pour l'analyse. Desktop OLAP/ R-OLAP Relational OLAP. Comme son nom l'indique, il utilise le concept relationnel pour stocker des données modélisées dans le format multi dimensionnel. Les analyses (drill-down, pivot, ajout de dimensions, etc.) sont transformées en requêtes SQL classiques qui sont exécutées sur les tables. R-OLAP utilise aussi la notion de tables d'agrégats, c'està- dire créer des tables contenant des données sommaires et les stocker en mémoire en cas d'utilisation. R-OLAP reste la solution de choix dans le cas de gros volumes de données avec un accès restreint ROLAP (2) 41 Sources : EPFL, Lausanne ROLAP (1) Les données sont stockées dans une BD relationnelle Un moteur OLAP permet de simuler le comportement d’un SGBD multidimensionnel Avantages : Facile à mettre en place Peu couteux Evolution facile Stockage de gros volumes Inconvénients : Moins performant lors des phases de calculs Exemple de moteur ROLAP : Mondrian 3 OLAP/ M-OLAP Multi dimensional OLAP permet de stocker les données directement en un format permettant des opérations matricielles, donnant la capacité à effectuer des calculs très poussés en un temps record car tous les calculs sont précompilés ! Ce mode de stockage permet de pré calculer les résultats afin d'avoir accès directement à toute donnée, quel que soit le niveau de détail. M-OLAP reste la meilleure solution en terme de performances et d'efficacité. MOLAP (2) 43 Sources : EPFL, Lausanne MOLAP (1) Les données sont stockées comme des matrices à plusieurs dimensions : Cube[1:m,1:n,1:p](mesure) Accès direct aux données dans le cube Avantages : Rapidité Inconvénients : Difficile à mettre en place Formats souvent propriétaires Ne supporte pas de rtès gros volumes de données Exemple de moteurs MOLAP : Microsoft Analysis Services Hyperion 42 HOLAP (2) 45 Sources : EPFL, Lausanne HOLAP (1) Solution hybride entre ROLAP et MOLAP Données de base stockées dans un SGBD relationnel (tables de faits et de dimensions) + données agrégées stockées dans un cube Avantages / inconvénients : Bon compromis au niveau des coûts et des performances (les requêtes vont chercher les données dans les tables et le cube) 44 O P E R AT I O N S O L A P Afin de rendre l'analyse la moins contraignante et la plus souple possible, l'OLAP propose des opérateurs. Il s'agit de mécanismes servant à naviguer dans les hiérarchies et les dimensions. SLICING ( Tailler)- extraire un tranche d’un cube d’information DICING ( projection)-obtenir un sous cube avec toutes les dimensions ROTATATE (Pivoter )- interchanger deux dimensions DRILL DOWN (Forer )- derouler les informations d’une dimension ROLL UP ( Remonter)- synthetise les informations d’une dimension DRILL THROUGH (Percer) accede aux details en disposant que des agregats DRILL ACROSS (Forer lateralement) en restant au même niveau de dimension, permet de changer l'une des valeurs. Par exemple, passer de l'année 1998 à l'année 1999 OLAP CUBE OLAP- Data CUBE OLAP CUBE VS DATA MART STAR STAR CUBE Has Facts and Dim Has Facts and Dim, but the Focused on specific position of value dictate the analysis needs meaning Has date dim, so can More memory efficient can load analyze old and new data much measures and quickly process but difficult to add new Easy to store value uses dim , new data or change rdb environment granularity , need to rebuild Possible to add new facts, new cube. dimensions or attributes Can have sparse cube as they No wasted space in storage, store numeric value in all the new rows add if event to dimensions and not pointers place which can be absent if no record registered. (waste of space) Disadvantages of CUBES Not suitable for Ad-hoc queries. No real-time ability. If you have a real-time or NRT (near-real-time) DW, the cube unfortunately is not and has to be rebuilt in its entirety before you can analyze the latest data. (to be built daily or weekly) Size: Every attribute you add to a cube increases its size and not just a little Size of source data. If your organization creates millions of rows per week / day, to build and maintain a cube on this information can be a nightmare. (Fastidious) Maintenance: If you need several hours to load the daily data, you will need considerably more as your implementation matures, to keep refreshing the cubes with this growing data monster. Complexity. You cannot use SQL to query cubes, you will need to use MDX (Multidimensional Expressions) EXAMPLE OF SIMPLE MDX QUERY WITH MEMBER [Measures].[Total Ventes] AS [Measures].[Montant des Ventes] +[Measures].[Montant de la taxe] SELECT { [Measures].[Total Ventes], [Measures].[Montant des Ventes] } ON COLUMNS, { [Product].[Categrie de Produits.[Members } ON ROWS FROM [CubeDesVentes] WHERE ( [Date]. [Annee du Calendrier]. ) QUELQUES SOLUTIONS COMMERCIALES 53 EXERCICE On considère un entrepôt de données permettant d’observer les ventes de produits d’une entreprise. Le schéma des tables est le suivant : CLIENT (id-client, région, ville, pays, département) PRODUIT (id-prod, catégorie, coût-unitaire, fournisseur, prix- unitaire, nom-prod) TEMPS (id-tps, mois, nom-mois, trimestre, année) VENTE (id-prod, id-tps, id-client, date-expédition, prix-de-vente, frais-de-livraison) Questions 1. Indiquer quelles sont la (les) table(s) de fait et les tables de dimension de cet entrepôt. 2. Donner pour chaque dimension, sa (multi-) hiérarchie. 3. Donner la représentation du schéma en étoile de l’entrepôt selon la notation de Golfarelli. 4. On veut transformer ce schéma en schéma en flocon. Donner la nouvelle représentation de la table TEMPS (ajouter des paramètres / attributs, si nécessaire) 54