Data Warehousing Concepts PDF
Document Details
Uploaded by PainlessTriangle9252
Tags
Summary
This document provides an overview of data warehousing concepts, including definitions, characteristics, architectures such as centralized and federated approaches, and various models like Kimball and Inmon approaches. It also covers ETL processes and data design elements.
Full Transcript
Data warehousing 11 Outline Part 1: I. Introduction to Data Warehousing II. Architecture of Data Warehousing III. Design and Modeling in Data Warehousing Part 2: IV. ETL Processes in Data Warehousing A. Extracting Data 1. Data Extraction Techniques 2...
Data warehousing 11 Outline Part 1: I. Introduction to Data Warehousing II. Architecture of Data Warehousing III. Design and Modeling in Data Warehousing Part 2: IV. ETL Processes in Data Warehousing A. Extracting Data 1. Data Extraction Techniques 2. Data Profiling B. Transforming Data 1. Data Cleaning and Quality 2. Data Integration C. Loading Data 2 Data warehouse: Definition A data warehouse is a centralized repository that integrates and stores large volumes of structured, historical data from various sources within an organization. It is designed for the purpose of supporting business intelligence (BI) activities, including reporting, analysis, and decision-making processes. Data warehouses provide a consolidated view of an organization's data, allowing users to analyze trends, identify patterns, and gain valuable insights that can inform strategic and operational decisions. Data warehouses play a crucial role in business intelligence by providing decision- makers with a unified and consistent view of historical data. 3 Keys characteristics Key characteristics of a data warehouse include: Subject-Oriented: Data warehouses are organized around specific business subjects or areas, such as sales, finance, or customer relations, to support analytical queries and reporting within those domains. Integrated Data: Data from disparate sources, such as transactional databases, spreadsheets, and external systems, is integrated and transformed to ensure consistency and coherence in the warehouse. This integration process is often facilitated through ETL (Extract, Transform, Load) procedures. Time-Variant: Data in a data warehouse is time-stamped, allowing users to analyze trends and changes over time. This time-variant aspect enables historical analysis and reporting. 4 Keys characteristics Non-Volatile: Unlike operational databases that are frequently updated with transactional data, a data warehouse is non-volatile. Once data is loaded into the warehouse, it is typically not updated or deleted, ensuring a stable environment for analytical processing. Optimized for Query and Reporting: Data warehouses are structured and indexed for efficient querying and reporting. They often use denormalized schemas, such as star or snowflake schemas, to simplify and accelerate analytical queries. 5 Data warehouse VS Database (1/3) Data warehouse Database Purpose Primarily designed for Designed for transactional analytical processing and processing and day-to-day business intelligence. It is operations. Focus is on efficient optimized for complex data retrieval, insertion, and queries and reporting. updating. Data Types Stores large volumes of Stores operational data, often in historical, structured data. real-time. Primarily contains Often includes data from current and frequently updated multiple sources within the information. organization. Schema Design Uses specialized schemas like Typically uses normalized star schema or snowflake schemas to reduce redundancy schema for efficient querying and maintain data integrity. and reporting. Normalization helps in 6 transactional processing. Data warehouse VS Database (2/3) Data warehouse Database Data Integration Involves the integration of data May store data from a specific from various sources using ETL application or domain. (Extract, Transform, Load) Integration is focused on processes to ensure maintaining consistency within consistency and coherence. the operational context. Data Volatility Non-volatile; historical data is Volatile; data is frequently stored and rarely updated. updated and modified as part Changes typically involve of ongoing transactions. adding new data rather than modifying existing records. 7 Data warehouse VS Database (3/3) Data warehouse Database Query Optimization Optimized for complex Optimized for fast retrieval queries. and updating of individual records. User Base Primarily used by analysts, Used by application data scientists, and decision- developers, system makers for in-depth analysis, administrators, and reporting, and business operational staff for day-to- intelligence activities. day application support and transactional processing. Data Processing Online Analytical Processing Online Transactional (OLAP) Processing (OLTP) 8 OLTP VS OLAP Data warehouses are tailored for analytical processing, historical analysis, and business intelligence, whereas databases are focused 9 on supporting transactional processing and day-to-day operations. Main Components of a Data Warehouse A data warehouse comprises several components that work together to facilitate the storage, integration, and retrieval of large volumes of data for analytical processing. The main components of a data warehouse include: 1. Data Sources: These are systems or applications that generate and store data. Data sources can include operational databases, external data feeds, spreadsheets, and other repositories. 2. ETL (Extract, Transform, Load) Processes: ETL processes are responsible for extracting data from various sources, transforming it to conform to the data warehouse's structure and quality 10 standards, and loading it into the data warehouse. Main Components of a Data Warehouse 3. Data Warehouse Database: The central repository that stores the integrated and transformed data. It is optimized for analytical querying and reporting. Data warehouses often use specialized database management systems (DBMS) designed for analytical workloads. 4. Data Marts: Data marts are subsets of the data warehouse that focus on specific business functions or departments. They are often designed for the needs of a particular group of users. 5. OLAP (Online Analytical Processing) Servers: OLAP servers enable users to interactively analyze and explore data in a 11 multidimensional way. OLAP provides capabilities for slicing and dicing data, drilling down into details, and performing complex analyses. Design and Modeling in Data Warehousing Data warehouse modeling involves designing the structure and organization of data within a data warehouse to facilitate efficient querying, reporting, and analysis. The goal is to provide a clear and optimized representation of data that supports business intelligence and decision-making. Dimensional modeling is much better suited for business intelligence (BI) applications and data warehousing (DW) The key concepts in dimensional modeling are facts, dimensions, and attributes. 12 All these concepts can be organized in several ways, called schemas. Dimensional modeling overview The fact Tbl_Fact_Store_Sales is at the core of the dimensional model Four surrounding dimensions that define and put into context the store sales: Tbl_Dim_Item, which is what products were sold. Tbl_Dim_Date, which is when those products were sold Tbl_Dim_Customer, who bought the products 13 Tbl_Dim_Buyer, who bought the product for the store Key concepts: Facts Tables A fact is a measurement of a business activity, such as a business event or transaction, and is generally numeric. Examples of facts are sales, expenses, and inventory levels Fact tables are composed of two types of columns: keys and measures The first, the key column, consists of a group of foreign keys (FK) that point to the primary keys of dimensional tables that are associated with this fact table to enable business analysis. The relationships between fact tables and the dimensions are one-to-many. The second type of column is the actual measures of the business activity such as the sales revenue and order quantity. Every measurement has a grain, which is the level of detail in the measurement of an event such as a 14 unit of measure or currency used. Facts Tables: Example 15 Fact table—primary key is a surrogate key. Fact table— several measures. Key concepts : Dimension A dimension is an entity that establishes the business context for the measures (facts) used by an enterprise. Dimensions define the who, what, where, and why of the dimensional model, and group similar attributes into a category or subject area. Examples of dimensions are product, geography, customers, employees, and time. Whereas facts are numeric, dimensions are descriptive in nature (although some of those descriptions, such as a product list price, may be numeric). Creating a dimension enables facts to store attributes in a single place 16 Dimension Dimensions keep the database from being overrun with redundant data. With all the attributes in a dimension table, they don’t have to be repeated in the fact tables. Example: Take Amazon, for example. The data for an individual sale will contain the product identification number, but will not repeat all the attributes of the product (color, description, reviews, etc.). Those attributes are in a dimension, and each individual sale of that product just points to them. From a business perspective, the key purpose of dimensions it to use their 17 attributes to filter and analyze data based on performance measures Dimension Dimensions are used for Selection of data Grouping of data at the right level of detail Dimensions consist of dimension values Product dimension has values ”milk”, ”cream”, … Time dimension has values ”1/1/2001”, ”2/1/2001”,… Dimension values may have an ordering Used for comparing cube data across values Especially used for Time dimension 18 Dimension Dimensions have hierarchies with levels Typically 3-5 levels (of detail) Dimension values are organized in a tree structure Product: Product->Type->Category Store: Store->Area->City->County Time: Day->Month->Quarter->Year Dimensions have a bottom level and a top level Levels may have attributes Simple, non-hierarchical information Day has Workday as attribute Dimensions should contain much information 19 Time dimension may contain holiday, season, events,… Good dimensions have 50-100 or more attributes/levels Dimensional model: Example Example: sales of supermarkets Facts and measures Each sales record is a fact, and its sales value is a measure Dimensions Group correlated attributes into the same dimension Each sales record is associated with its values of Product, store, Time 20 Granularity: Dimensionality Hierarchy Granularity of facts is important Level of detail Given by combination of bottom levels A dimensional hierarchy defines mappings from a set of lower-level concepts to higher level concepts. 21 Data Warehouse Design A schema is a logical description of the entire database. Database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. 22 Star Schema In a star schema, there is a central fact table surrounded by dimension tables. Each dimension in a star schema is represented with only one-dimension table The fact table contains numerical measures (such as sales or revenue), and dimension tables provide descriptive information about the measures. This dimension table contains the set of attributes. 23 Star Schema: Example 24 Snowflake schema Snowflake schema is an expanded version of a star schema in which dimension tables are normalized into several related tables. Advantages Small saving in storage space Normalized structures are easier to update and maintain Disadvantages A schema that is less intuitive The ability to browse through the content is difficult 25 A degraded query performance because of additional joins. Snowflake schema: Example 26 Fact constellation schema A fact constellation has multiple fact tables. It is also known as galaxy schema. The following diagram shows two fact tables, namely sales and Inventory 27 From the Data Warehouse to Data Marts A data mart contains only those data that are specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects. Data marts are small in size. Data marts are customized by department 28 The complete Decision Support System 29 DWH Architecture 30 Types of Data Warehousing Architectures 1. Centralized Data Warehouse : is a single, unified repository that stores and manages data from various sources within an organization. It serves as a centralized and integrated platform for business intelligence and decision-making. 2. Data Marts : are smaller, specialized subsets of a data warehouse that focus on specific business areas, departments, or user groups. They are designed to meet the needs of a particular set of users with common 31 interests. Types of Data Warehousing Architectures 3. Federated Data Warehouse : is an architecture that integrates data from multiple independent data sources without physically consolidating the data into a central repository. It enables distributed data access and processing. 4. Hybrid Data Warehouse: combines elements of both centralized and distributed architectures. It may involve a mix of on-premises and cloud-based solutions, as well as a combination of centralized and 32 federated approaches. Extraction Transformation Loading–ETL tools 33 Data architecture VS Data modeling Data architecture applies to the higher-level view of how the enterprise handles its data, such as how it is categorized, integrated, and stored. Data modeling applies to very specific and detailed rules about how pieces of data are arranged in the database. Where data architecture is the blueprint for your house, data modeling is the instructions for installing a faucet. 34 Kimball Approach: Kimball emphasizes the use of dimensional modeling, creating star or snowflake schemas. This approach focuses on designing the data warehouse based on business processes and user requirements. Follows a bottom-up development approach, starting with the creation of data marts that address immediate business requirements. These data marts are then integrated to form the complete data warehouse. Kimball's approach involves the use of Extract, Transform, Load (ETL) processes that are specifically designed for dimensional models. This 35 ensures the transformation of source data into a format optimized for reporting and analysis. Kimball Approach: 36 Inmon's Approach: Inmon supporters the creation of a centralized Enterprise Data Warehouse (EDW) as the foundation. This EDW serves as a single, integrated repository for the entire organization. Inmon's approach follows a top-down development methodology. It begins with the creation of an enterprise-wide data warehouse and then focuses on building data marts to meet specific business needs. 37 Kimball VS Inmon’s Approach Philosophy: Kimball: Business-driven, iterative, and agile. Inmon: Enterprise-centric, normalized, and long-term. Data Model: Kimball: Dimensional modeling, star or snowflake schemas. Inmon: Normalized data model, 3NF. Development Approach: Kimball: Bottom-up development, starting with data marts. Inmon: Top-down development, starting with the enterprise data warehouse. Data Marts: Kimball: Considers data marts as primary deliverables. Inmon: Views data marts as subsets of the enterprise data warehouse. Flexibility: 38 Kimball: Agile and adaptable to changing business needs. Inmon: Emphasizes a stable and scalable architecture for long-term use. Kimball approach: Main steps 1. Choose the subject : Clearly define the business objectives and scope of the data warehouse project. 2. Requirements Gathering: Collaborate closely with business users to gather their reporting and analysis requirements. 3. Dimensional Modeling: Star or Snowflake Schema: Develop dimensional models using star or snowflake schemas. Identify Dimensions and Facts 4. ETL Design and Development: Create Extract, Transform, Load (ETL) processes based on dimensional models. 5. Data Mart Development: Develop data marts as subsets of the data warehouse, addressing specific business needs. 39 6. Business Intelligence Tools Integration: Choose and integrate business intelligence tools compatible with dimensional models.