CHAPTER-3-DATA-WAREHOUSING-AND-MANAGEMENT.PDF
Document Details
Uploaded by GreatestJasper3191
Dalubhasaan ng Lungsod ng Lucena
Tags
Full Transcript
DATA WAREHOUSING AND MANAGEMENT MODULE 3 CHAPTER III: DATA WAREHOUSE CONCEPTS I. OBJECTIVES At the end of this chapter, the students should be able to: Understand the basic definitions and concepts of data ware...
DATA WAREHOUSING AND MANAGEMENT MODULE 3 CHAPTER III: DATA WAREHOUSE CONCEPTS I. OBJECTIVES At the end of this chapter, the students should be able to: Understand the basic definitions and concepts of data warehouses. Identify multidimensional model and present its main characteristics and components. Define in detail the description of the most common operations for manipulating data cubes. Understand data warehousing design and architectures. Describe the processes used in developing and managing data warehouses. Explain data warehousing operations. Describe business intelligence suite of tools. II. SUBJECT MATTER Topic: Data Warehouse Concepts Subtopic: - Multidimensional Model - OLAP Operations - Data Warehouses - Data Warehouse Architecture - Data Warehouse Design - Business Intelligence Tools III. PROCEDURE A. Preliminaries Pre- Assessment 1. Explain and analyze the concepts underlying database warehouses which are based on a multidimensional model, where data are represented as hypercube, with dimensions corresponding to the various business perspectives and cube cells containing the measures to be analyzed. 2. Present and define the multidimensional model and identify its main characteristics and components. 3. Examine and give a detailed description of the most common operations for manipulating data cubes. 4. Present the main characteristics of data warehouse systems and compare them against operational databases. 1 DATA WAREHOUSING AND MANAGEMENT MODULE 3 5. Introduce two representative business intelligence suite of tools, SQL Server and Pentaho. B. Lesson Proper This chapter introduces the basic concepts of data warehouses. A data warehouse is a particular database targeted toward decision support. It takes data from various operational databases and other data sources and transforms it into new structures that fit better for the task of performing business analysis. Data warehouses are based on a multidimensional model, where data are represented as hypercubes, with dimensions corresponding to the various business perspectives and cube cells containing the measures to be analyzed. We study the multidimensional model and present its main characteristics and components. Give a detailed description of the most common operations for manipulating data cubes. Present the main characteristics of data warehouse systems and compare them against operational databases. The architecture of data warehouse systems is also described in detail. As we shall see, in addition to the data warehouse itself, data warehouse systems are composed of back-end tools, which extract data from the various sources to populate the warehouse, and front-end tools, which are used to extract the information from the warehouse and present it to users. We introduce the design methodology we will use throughout the book. We finish by describing two representative business intelligence suite of tools, SQL Server and Pentaho. 1. Multidimensional Model The importance of data analysis has been steadily increasing from the early 1990s, as organizations in all sectors are being required to improve their decision-making processes in order to maintain their competitive advantage. Traditional database systems do not satisfy the requirements of data analysis. They are designed and tuned to support the daily operations of an organization, and their primary concern is to ensure fast, concurrent access to data. This requires transaction processing and concurrency control capabilities, as well as recovery techniques that guarantee data consistency. These systems are known as operational databases or online transaction processing (OLTP) systems. The OLTP paradigm is focused on transactions. Since OLTP systems must support heavy transaction loads, their design should prevent update anomalies, and thus, OLTP databases are highly normalized using the techniques studied in Chapter. 2. Thus, they perform poorly when executing complex queries that need to join many relational tables together or to aggregate large volumes of data. Besides, typical operational databases contain detailed data and do not include historical data. 2 DATA WAREHOUSING AND MANAGEMENT MODULE 3 The above needs called for a new paradigm specifically oriented to analyze the data in organizational databases to support decision making. This paradigm is called online analytical processing (OLAP). This paradigm is focused on queries, in particular, analytical queries. OLAP oriented databases should support a heavy query load. Therefore, the need for a different database model to support OLAP was clear and led to the notion of data warehouses, which are (usually) large repositories that consolidate data from different sources (internal and external to the organization), are updated off-line (although as we will see, this is not always the case in modern data warehouse systems), and follow the multidimensional data model. Being dedicated analysis databases, data warehouses can be designed and optimized to efficiently support OLAP queries. Data warehouses and OLAP systems are based on the multidimensional model, which views data in an n-dimensional space, usually called a data cube or a hypercube. A data cube is defined by dimensions and facts. Dimensions are perspectives used to analyze the data. Figure 1. A three-dimensional cube for sales data with dimensions Product, Time, and Customer, and a measure Quantity A dimension level represents the granularity, or level of detail, at which measures are represented for each dimension of the cube. Instances of a dimension are called members. Dimensions also have associated attributes describing them. For example, the Product dimension could contain attributes such as ProductNumber and UnitPrice, which are not shown in the figure. 3 DATA WAREHOUSING AND MANAGEMENT MODULE 3 On the other hand, the cells of a data cube, or facts, have associated numeric values (we will see later that this is not always the case), called measures. These measures are used to evaluate quantitatively various aspects of the analysis at hand. Hierarchies In order to extract strategic knowledge from a cube, it is necessary to view its data at several levels of detail. Hierarchies allow this possibility by defining a sequence of mappings relating lower-level, detailed concepts to higher-level, more general concepts. Given two related levels in a hierarchy, the lower level is called the child and the higher level is called the parent. The hierarchical structure of a dimension is called the dimension schema, while a dimension instance comprises the members at all levels in a dimension. Figure 2. Hierarchies of the Product, Time, and Customer dimensions Each product at the lowest level of the hierarchy can be mapped to a corresponding category. All categories are grouped under a member called all, which is the only member of the distinguished level All. This member is used for obtaining the aggregation of measures for the whole hierarchy, that is, for obtaining the total sales for all products. Measures Each measure in a cube is associated with an aggregation function that combines several measure values into a single one. Aggregation of measures takes place when one changes the level of detail at which data in a cube are visualized. This is performed by traversing the hierarchies of the dimensions. 4 DATA WAREHOUSING AND MANAGEMENT MODULE 3 Summarizability refers to the correct aggregation of cube measures along dimension hierarchies, in order to obtain consistent aggregation results. To ensure summarizability, a set of conditions may hold. Below, we list some of these conditions: Disjointness of instances: The grouping of instances in a level with respect to their parent in the next level must result in disjoint subsets. Completeness: All instances must be included in the hierarchy and each instance must be related to one parent in the next level. Correctness: It refers to the correct use of the aggregation functions. As explained next, measures can be of various types, and this determines the kind of aggregation function that can be applied to them. Additive measures can be meaningfully summarized along all the dimensions, using addition. These are the most common type of measures. Semiadditive measures can be meaningfully summarized using addition along some, but not all, dimensions. Nonadditive measures cannot be meaningfully summarized using addition across any dimension. Other Classification of Measures: Distributive measures are defined by an aggregation function that can be computed in a distributed way. If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. E.g., count(), sum(), min(), max() Algebraic measures are defined by an aggregation function that can be expressed as a scalar function of distributive ones. If it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. E.g., avg(), min_N(), standard_deviation() Holistic measures are measures that cannot be computed from other subaggregates. If there is no constant bound on the storage size needed to describe a subaggregate. E.g., median(), mode(), rank() 2. OLAP Operations The OLAP operations allow these perspectives and levels of detail to be materialized by exploiting the dimensions and their hierarchies, thus providing an interactive data analysis environment. 5 DATA WAREHOUSING AND MANAGEMENT MODULE 3 Roll up (drill-up): Summarize data by climbing up hierarchy or by dimension reduction. Drill down (roll down): Reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions. Slice and dice: Project and select Pivot (rotate): Reorient the cube, visualization, 3D to series of 2D planes Other operations Drill across: Involving (across) more than one fact table Drill through: Through the bottom level of the cube to its back-end relational tables (using SQL). In addition to the basic operations described above, OLAP tools provide a great variety of mathematical, statistical, and financial operations for computing ratios, variances, interest, depreciation, currency conversions, etc Figure 3. Summary of OLAP Operations 6 DATA WAREHOUSING AND MANAGEMENT MODULE 3 Figure 4. Typical OLAP operations 7 DATA WAREHOUSING AND MANAGEMENT MODULE 3 3. Data Warehouses A data warehouse is a repository of integrated data obtained from several sources for the specific purpose of multidimensional data analysis. More technically, a data warehouse is defined as a collection of subject-oriented, integrated, nonvolatile, and time-varying data to support management decisions. We explain next these characteristics: Subject oriented means that data warehouses focus on the analytical needs of different areas of an organization. These areas vary depending on the kind of activities performed by the organization. For example, in the case of a retail company, the analysis may focus on product sales or inventory management. Integrated means that data obtained from several operational and external systems must be joined together, which implies solving problems due to differences in data definition and content, such as differences in data format and data codification, synonyms (fields with different names but the same data), homonyms (fields with the same name but different meanings), multiplicity of occurrences of data, and many others. In operational databases these problems are typically solved in the design phase. Nonvolatile means that durability of data is ensured by disallowing data modification and removal, thus expanding the scope of the data to a longer period of time than operational systems usually offer. Time varying indicates the possibility of retaining different values for the same information, as well as the time when changes to these values occurred. A data warehouse is aimed at analyzing the data of an entire organization. It is often the case that particular departments or divisions of an organization only require a portion of the organizational data warehouse specialized for their needs. These departmental data warehouses are called data marts. However, these data marts are not necessarily private to a department; they may be shared with other interested parts of the organization. A data warehouse can be seen as a collection of data marts. This view represents a bottom-up approach in which a data warehouse is built by first building the smaller data marts and then merging these to obtain the data warehouse. On the other hand, in the classic data warehouse view, data marts are obtained from the data warehouse in a top-down fashion. In this approach, a data mart is sometimes just a logical view of a data warehouse. 8 DATA WAREHOUSING AND MANAGEMENT MODULE 3 Figure 5. Comparison between operational databases and data warehouses Figure 6. OLTP vs OLAP 9 DATA WAREHOUSING AND MANAGEMENT MODULE 3 4. Data Warehouse Architecture Figure 7. Typical data warehouse architecture The architecture depicted in Figure 7 consists of several tiers: The back-end tier is composed of extraction, transformation, and loading (ETL) tools, used to feed data into the data warehouse from operational databases and other data sources, which can be internal or external from the organization, and a data staging area, which is an intermediate database where all the data integration and transformation processes are run prior to the loading of the data into the data warehouse. The data warehouse tier is composed of an enterprise data warehouse and/or several data marts and a metadata repository storing information about the data warehouse and its contents. The OLAP tier is composed of an OLAP server, which provides a multidimensional view of the data, regardless of the actual way in which data are stored in the underlying system. The front-end tier is used for data analysis and visualization. It contains client tools such as OLAP tools, reporting tools, statistical tools, and data mining tools. 10 DATA WAREHOUSING AND MANAGEMENT MODULE 3 Back-End Tier As the name indicates, it is a three-step process as follows: Extraction gathers data from multiple, heterogeneous data sources. These sources may be operational databases but may also be files in various formats; they may be internal to the organization or external to it. In order to solve interoperability problems, data are extracted whenever possible using application programming interfaces (APIs) such as ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity). Transformation modifies the data from the format of the data sources to the warehouse format. This includes several aspects: cleaning, which removes errors and inconsistencies in the data and converts it into a standardized format; integration, which reconciles data from different data sources, both at the schema and at the data level; and aggregation, which summarizes the data obtained from data sources according to the level of detail, or granularity, of the data warehouse. Loading feeds the data warehouse with the transformed data. This also includes refreshing the data warehouse, that is, propagating updates from the data sources to the data warehouse at a specified frequency in order to provide up-to-date data for the decision-making process. Data Warehouse Tier The data warehouse tier depicts an enterprise data warehouse and several data marts. Another component of the data warehouse tier is the metadata repository. Metadata can be defined as “data about data.” Metadata has been traditionally classified into technical and business metadata. Business metadata describes the meaning (or semantics) of the data and organizational rules, policies, and constraints related to the data. On the other hand, technical metadata describes how data are structured and stored in a computer system and the applications and processes that manipulate such data. OLAP Tier The OLAP tier in the architecture is composed of an OLAP server, which presents business users with multidimensional data from data warehouses or data marts. Most database products provide OLAP extensions and related tools allowing the construction and querying of cubes, as well as navigation, analysis, and reporting. In this respect, several languages are worth mentioning. XMLA (XML for Analysis) aims at providing a common language for exchanging multidimensional data between client applications and OLAP servers. Further, MDX (MultiDimensional eXpressions) is a query language for OLAP databases. As 11 DATA WAREHOUSING AND MANAGEMENT MODULE 3 it is supported by a number of OLAP vendors, MDX became a de facto standard for querying OLAP systems. The SQL standard has also been extended for providing analytical capabilities; this extension is referred to as SQL/OLAP. Front-End Tier The front-end tier contains client tools that allow users to exploit the contents of the data warehouse. Typical client tools include the following: OLAP tools allow interactive exploration and manipulation of the warehouse data. They facilitate the formulation of complex queries that may involve large amounts of data. These queries are called ad hoc queries, since the system has no prior knowledge about them. Reporting tools enable the production, delivery, and management of reports, which can be paper-based reports or interactive, web-based reports. Reports use predefined queries, that is, queries asking for specific information in a specific format that are performed on a regular basis. Statistical tools are used to analyze and visualize the cube data using statistical methods. Data mining tools allow users to analyze data in order to discover valuable knowledge such as patterns and trends; they also allow predictions to be made on the basis of current data. 5. Data Warehouse Design Two major methods for the design of data warehouses and data marts: Top-down approach: the requirements of users at different organizational levels are merged before the design process starts, and one schema for the entire data warehouse is built, from which data marts can be obtained. Bottom-up approach: a schema is built for each data mart, according to the requirements of the users of each business area. The data mart schemas produced are then merged in a global warehouse schema. Analysis-driven approach: key users from different organizational levels provide useful input about the analysis needs. Source-driven approach: the data warehouse schema is obtained by analyzing the data source systems. In this approach, normally, the participation of users is only required to confirm the correctness of the data structures that are obtained from the source systems or to identify some facts and measures as a starting point for the design of multidimensional schemas. 12 DATA WAREHOUSING AND MANAGEMENT MODULE 3 Analysis/source-driven approach is a combination of the analysis- and source-driven approaches, aimed at matching the users’ analysis needs with the information that the source systems can provide. Figure 8. Phases in data warehouse design 6. Business Intelligence Tools Overview of Microsoft SQL Server Tools Microsoft SQL Server provides an integrated platform for building analytical applications. It is composed of three main components, described below: Analysis Services is an OLAP tool that provides analytical and data mining capabilities. It is used to define, query, update, and manage OLAP databases. Integration Services supports ETL processes, which are used for loading and refreshing data warehouses on a periodic basis. Integration Services is used to extract data from a variety of data sources; to combine, clean, and summarize this data; and, finally, to populate a data warehouse with the resulting data. Reporting Services is used to define, generate, store, and manage reports. Reports can be built from various types of data sources, including data warehouses and OLAP cubes. SQL Server Data Tools (SSDT) is a development platform integrated with Microsoft Visual Studio. SQL Server Data Tools supports Analysis Services, Reporting Services, and Integration Services projects. On the other hand, SQL Server Management Studio (SSMS) provides integrated management of all SQL Server components. The underlying model across these tools is called the Business Intelligence Semantic Model (BISM). This model comes in two modes, the multidimensional and tabular modes, where, as their name suggest, the difference among them stems from their underlying paradigm (multidimensional or relational). 13 DATA WAREHOUSING AND MANAGEMENT MODULE 3 Overview of Pentaho Business Analytics Pentaho Business Analytics is a suite of business intelligence products. It comes in two versions: an enterprise edition that is commercial and a community edition that is open source. The main components are the following: Pentaho Business Analytics Platform serves as the connection point for all other components. It enables a unified, end-to-end solution from data integration to visualization and consumption of data. It also includes a set of tools for development, deployment, and management of applications. Pentaho Analysis Services, also known as Mondrian, is a relational OLAP server. It supports the MDX (multidimensional expressions) query language and the XML for Analysis and olap4j interface specifications. Pentaho Data Integration, also known as Kettle, consists of a data integration (ETL) engine and GUI (graphical user interface) applications that allow users to define data integration jobs and transformations. Pentaho Report Designer is a visual report writer that can query and use data from many sources. It consists of a core reporting engine, capable of generating reports in several formats based on an XML definition file. 14 DATA WAREHOUSING AND MANAGEMENT MODULE 3 ACTIVITY 1: REVIEW QUESTIONS Based on the discussion, answer each questions and / or statements briefly. Write on the spaces provided below. 1. Why are hierarchies important in data warehouses? Give examples of various hierarchies. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 2. Discuss the role of measure aggregation in a data warehouse. How can measures be characterized? ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 3. Discuss the role of measure aggregation in a data warehouse. How can measures be characterized? ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 4. Give some essential characteristics of a data warehouse. How do a data warehouse and a data mart differ? ________________________________________________________________________ ________________________________________________________________________ 15 DATA WAREHOUSING AND MANAGEMENT MODULE 3 ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 5. Describe the various components of a typical data warehouse architecture. Identify variants of this architecture and specify in what situations they are used. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 16 DATA WAREHOUSING AND MANAGEMENT MODULE 3 ACTIVITY 2: EXERCISE: OLAP OPERATIONS A data warehouse of a telephone provider consists of five dimensions: caller customer, callee customer, time, call type, and call program and three measures: number of calls, duration, and amount. Define the OLAP operations to be performed in order to answer the following queries. Propose the dimension hierarchies when needed. (a) Total amount collected by each call program in 2012. (b) Total duration of calls made by customers from Brussels in 2012. (c) Total number of weekend calls made by customers from Brussels to customers in Antwerp in 2012. (d) Total duration of international calls started by customers in Belgium in 2012. (e) Total amount collected from customers in Brussels who are enrolled in the corporate program in 2012. 17