Data Warehouse Lecture Notes PDF
Document Details
Uploaded by Deleted User
Yarmouk University
Dr. Rafat Hammad
Tags
Summary
These lecture notes cover topics related to data warehousing from Yarmouk University. The topics discussed include data warehouse architectures and design methodologies.
Full Transcript
Data Engineering and Analytics YARMOUK UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCES...
Data Engineering and Analytics YARMOUK UNIVERSITY FACULTY OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCES DA 330: Data Engineering and Analysis Topic 2: Data Warehouse Dr. Rafat Hammad 1 Acknowledgements: Most of these slides have been prepared based on various online tutorials and presentations, with respect to their authors and adopted for our course. Additional slides have been added from the mentioned references in the syllabus 1 TOPIC 2 : OUTLINE ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies Dr. Rafat Hammad 2 2 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics OPERATION VS. STRATEGIC SYSTEMS Operation systems gather, store, and process all the data needed to successfully perform the daily routine operations. They provide online information and produce a variety of reports to monitor and run the business. Strategic decision-support systems are not for running the day-to-day operations of the business. It is to formulate the business strategies, establish goals, set objectives, and monitor results. Here are some examples of business objectives: Gain market share by 10% in the next 3 years Enhance customer service level in shipments Increase sales by 15% in the Northeast Division Dr. Rafat Hammad 3 3 OPERATION VS. STRATEGIC SYSTEMS (CONT.) Dr. Rafat Hammad 4 4 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics OPERATION VS. STRATEGIC SYSTEMS (CONT.) Dr. Rafat Hammad 5 5 CHARACTERISTICS OF STRATEGIC SYSTEMS Dr. Rafat Hammad 6 6 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics BUSINESS INTELLIGENCE (BI) BI is “A set of processes, architectures, and technologies that convert raw data into meaningful information that drives profitable business actions. It is a suite of software and services to transform data into actionable intelligence and knowledge”. BI allows a business to transform: Data into information Information into knowledge Knowledge into intelligence BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts to provide users with detailed intelligence about the nature of the business. Dr. Rafat Hammad 7 7 WHAT IS DATA WAREHOUSE (DW)? Data Warehouse (DW) is an information system which stores historical and commutative data from single or multiple sources. It is designed to analyze, report, integrate transaction data from different sources. Data Warehousing: is defined as a technique for collecting, transforming, and managing data from varied sources to provide meaningful business insights. The data warehousing concept was intended to provide an architectural model for the flow of data from operational (transactional) systems to decision support environments. Dr. Rafat Hammad 8 8 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TOP 10 BENEFITS OF A DATA WAREHOUSE Dr. Rafat Hammad 9 9 MULTIPLE DATA TYPES IN A DATA WAREHOUSE Dr. Rafat Hammad 10 10 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics OPERATIONAL DATABASE Operational database management systems are used to update data in real-time. These types of databases allow users to do more than simply view archived data. Operational databases allow you to modify that data (add, change or delete data), doing it in real-time. Operational databases provide transactions as main abstraction to guarantee data consistency that guarantee the so-called ACID properties (Atomicity, Consistency, Isolation, and Durability). Dr. Rafat Hammad 11 11 OPERATIONAL DATABASE VS. DATA WAREHOUSE (CONT.) Operational database management systems is also known by the following names: Operational databases Transactional databases Online Transaction Processing (OLTP) databases. Data warehouse systems is also known by the following names: Data warehouse databases Data analytics databases Online Analytical Processing (OLAP) databases Dr. Rafat Hammad 12 12 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics OPERATIONAL DATABASE VS. DATA WAREHOUSE Parameter Operational Database Data Warehouse Archived, derived, Data Content Current values summarized Optimized for Optimized for complex Data Structure transactions queries Access Frequency High Medium to low Access Type Read, update, delete Read Usage Predictable, repetitive Ad hoc, random, heuristic Response Time Sub-seconds Several seconds to minutes Users Dr. Rafat Hammad Large number Relatively small number 13 13 ACID PROPERTIES IN DBMS Inthe context of databases and data storage systems, a transaction is any operation that is treated as a single unit of work, which either completes fully or does not complete at all and leaves the storage system in a consistent state. The classic example of a transaction is what occurs when you withdraw money from your bank account. Either the money has left your bank account, or it has not — there cannot be an in-between state. To maintain consistency in a database, before and after the transaction, certain properties are followed. These are called ACID properties: Atomicity, Consistency, Isolation, and Durability. Dr. Rafat Hammad 14 14 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ACID PROPERTIES - ATOMICITY Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails, and the database is left unchanged. An atomic system must guarantee atomicity in each situation, including power failures, errors and crashes. Dr. Rafat Hammad 15 15 ACID PROPERTIES - CONSISTENCY Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This prevents database corruption by an illegal transaction but does not guarantee that a transaction is correct. Referential integrity guarantees the primary key - foreign key relationship. Dr. Rafat Hammad 16 16 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ACID PROPERTIES - ISOLATION Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. Isolation is the main goal of concurrency control; depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions. Dr. Rafat Hammad 17 17 ACID PROPERTIES - DURABILITY Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile memory. Dr. Rafat Hammad 18 18 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics ACID PROPERTIES - COMPARISON Dr. Rafat Hammad 19 19 TOPIC 2 : OUTLINE ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies Dr. Rafat Hammad 20 20 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE COMPONENTS We identified the following major components of the data warehouse: Source data Data staging Data storage Metadata Management and control Information delivery Dr. Rafat Hammad 21 21 DATA WAREHOUSE COMPONENTS (CONT.) Dr. Rafat Hammad 22 22 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE COMPONENTS (CONT.) Metadata is like the data dictionary in a database system. Metadata repository is an integral part of a data warehouse system. It has the following information: 1)Definition of data warehouse − It includes the description of structure of data warehouse. The description is defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents. 2)Business metadata − It contains has the data ownership information, business definition, and changing policies. Dr. Rafat Hammad 23 23 DATA WAREHOUSE COMPONENTS (CONT.) 3)Operational Metadata − It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it. 4)Data for mapping from operational environment to data warehouse − It includes the source databases and their contents, data extraction, data partition cleaning, transformation rules, data refresh and purging rules. 5)Algorithms for summarization − It includes dimension algorithms, data on granularity, aggregation, summarizing, etc. Dr. Rafat Hammad 24 24 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE COMPONENTS (CONT.) The management & control component has the following tasks: Manages and controls data acquisition functions, ensuring that extracts and transformations are carried out correctly and in a timely fashion. Manages backing up significant parts of the data warehouse and recovering from failures. Monitoring the growth and periodically archiving data from the data warehouse. Governs data security and provides authorized access to the data warehouse. Interfaces with the end-user information delivery component to ensure that information delivery is carried out properly. Dr. Rafat Hammad 25 25 DATA WAREHOUSE COMPONENTS (CONT.) The information delivery component makes it easy for the users to access the information either directly from the enterprise-wide data warehouse, from the dependent data marts, or from the set of conformed data marts. Most of the information access in a data warehouse is through online analytical processing (OLAP) queries and interactive analysis sessions. With OLAP, the primary data warehouse feeds data to proprietary multidimensional databases (MDDBs) where summarized data is kept as multidimensional cubes of information. The users perform complex multidimensional analysis using the information cubes in the MDDBs. Dr. Rafat Hammad 26 26 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE ARCHITECTURES TYPES Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are: Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts) Dr. Rafat Hammad 27 27 DATA WAREHOUSE BASIC End users directly access data derived from several source systems through the data warehouse. Dr. Rafat Hammad 28 28 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE WITH A STAGING AREA Dr. Rafat Hammad 29 29 DATA WAREHOUSE WITH A STAGING AREA This is the place where all the extracted data is put together and prepared for loading into the data warehouse. A staging area simplifies building summaries and general warehouse management. The data staging area is both a storage area and a set of processes commonly referred to as Extract-Transformation-Load or ETL. It is somewhat analogous ( )مشابهto a kitchen where raw materials are combined to form a fine meal. Dr. Rafat Hammad 30 30 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE WITH A STAGING AREA AND DATA MARTS Dr. Rafat Hammad 31 31 DATA WAREHOUSE WITH A STAGING AREA AND DATA MARTS You may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. The previous figure illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales. Dr. Rafat Hammad 32 32 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics TYPES OF DATA WAREHOUSE There are three common types of data warehouses: Operational Data Store (ODS) Enterprise Data Warehouse (DW) Data Mart (DM) Dr. Rafat Hammad 33 33 TYPES OF DATA WAREHOUSE (CONT.) Dr. Rafat Hammad 34 34 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics OPERATIONAL DATA STORE An operational data store (ODS) is used for immediate reporting with current operational data. An ODS contains lightly transformed and lightly integrated operational data with a short time window. It is used for real time and near real time reporting. ODS is directly loaded from operational data, staging area, or incoming files. It can optionally serve as a data source for the data warehouse. An ODS must be frequently refreshed so that it contains very current data. An ODS can be updated daily, hourly, or even immediately after transactions on operational data. Dr. Rafat Hammad 35 35 ENTERPRISE DATA WAREHOUSE Enterprise Data Warehouse is a centralized place where all business information from different sources and applications are made available. It offers a unified approach for organizing and representing data. It also provide the ability to classify and analyze data according to the subject and give access according to those divisions. Dr. Rafat Hammad 36 36 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA MART Data marts contain a subset of organization-wide data. This data is valuable to a specific group of people in an organization. It is cost-effective alternatives to a data warehouse, which can take high costs to build. Data Mart allows faster access of Data. Data Mart is easy to use as it is specifically designed for the needs of its users. Thus, a data mart can accelerate business processes. Data Marts needs less implementation time compared to Data Warehouse systems. It contains historical data which enables the analyst to determine data trends. Dr. Rafat Hammad 37 37 TYPES OF DATA MARTS Three basic types of data marts are: Dependent Independent Hybrid The categorization is based primarily on the data source that feeds the data mart. Dr. Rafat Hammad 38 38 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DEPENDENT DATA MARTS Dependent data marts draw data from a central data warehouse that has already been created. This gives you the usual advantages of centralization Dr. Rafat Hammad 39 39 INDEPENDENT DATA MARTS Independent data marts, in contrast, are standalone systems built by drawing data directly from operational or external sources of data or both. This could be desirable for smaller groups within an organization. Dr. Rafat Hammad 40 40 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics HYBRID DATA MARTS A hybrid data mart allows you to combine input from sources other than a data warehouse. This could be useful for many situations, especially when you need ad hoc integration, such as after a new group or product is added to the organization. Dr. Rafat Hammad 41 41 TOPIC 6 : OUTLINE ❑ Operation vs. Strategic Systems ❑ Data Warehouse Architectures ❑ Data Warehouse Design Methodologies Dr. Rafat Hammad 42 42 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics DATA WAREHOUSE DESIGN METHODOLOGIES Bill Inmon considered to be the father of data warehousing. He publishes his book Building the Data Warehouse (1991) Ralph Kimball publishes his book The Data Warehousing Toolkit (1996) There are two common approaches for designing data warehouses: Inmon’s Top-Down Approach Kimball’s Bottom-Up Approach Dr. Rafat Hammad 43 43 INMON’S TOP-DOWN APPROACH Inmon defines a data warehouse as a centralized repository for the entire enterprise. Inmon defines the data warehouse in the following terms: Subject- oriented, Integrated, Time-variant, and Non-Volatile A data warehouse stores the “atomic” data at the lowest level of detail. A normalized data model is designed first. Then the dimensional data marts, which contain data required for specific business processes or specific departments are created from the data warehouse. Central data warehouse which follow the E-R model / normalized model. Dr. Rafat Hammad 44 44 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics INMON’S TOP-DOWN APPROACH (CONT.) Dr. Rafat Hammad 45 45 SUBJECT-ORIENTED DATA Subject-oriented: The data in the data warehouse is organized so that all the data elements relating to the same real-world event or subject are linked together. These subjects can be sales, marketing, distributions, etc. In operational systems, we store data by individual applications. In the data sets for an order processing application, we keep the data for that application. Dr. Rafat Hammad 46 46 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics SUBJECT-ORIENTED DATA (CONT.) Dr. Rafat Hammad 47 47 INTEGRATED DATA Integrated: The database contains data from most or all of an organization’s operational applications, and that this data is made consistent. Dr. Rafat Hammad 48 48 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics NON-VOLATILE DATA Non-Volatile: Data in the data warehouse is never overwritten or deleted. Once committed, the data is static, read-only and retained for future reporting. Dr. Rafat Hammad 49 49 TIME-VARIANT Time-variant: The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time. The data in the data warehouse is meant for analysis and decision making. If a user is looking at the buying pattern of a specific customer, the user needs data not only about the current purchase, but on the past purchases as well. A data warehouse, because of the very nature of its purpose, has to contain historical data and current values. Dr. Rafat Hammad 50 50 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics KIMBALL’S BOTTOM-UP APPROACH Kimball defines data warehouse as “a copy of transaction data specifically structured for query and analysis”. The data marts should be created first; these are then combined to create a broad data warehouse. The data warehouse is essentially a union of all the data marts. Data marts are built using dimensional modelling approach Dimensional modelling focuses on ease of end-user accessibility and provides a high level of performance to the data warehouse. Dr. Rafat Hammad 51 51 KIMBALL’S BOTTOM-UP APPROACH (CONT.) Dr. Rafat Hammad 52 52 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics INMON VS. KIMBALL Inmon (Data Kimball (Data Mart) Warehouse) Building Data Time Consuming Takes lesser time warehouse Maintenance Easy Difficult, often redundant and subject to revisions Cost High initial cost. Low Initial cost, Each Subsequent project subsequent phase will development costs will cost almost the same be much lower Dr. Rafat Hammad 53 53 INMON VS. KIMBALL (CONT.) Inmon (Data Kimball (Data Mart) Warehouse) Time Longer start-up time Shorter time to initial set- up Skill Specialist team Generalist team Requirement Data Integration Enterprise-wide Individual business areas requirements Data Model ER data model Dimensional data model Dr. Rafat Hammad 54 54 Dr. Rafat Hammad - Yarmouk University Data Engineering and Analytics THE END Dr. Rafat Hammad 55 55 Dr. Rafat Hammad - Yarmouk University