CHAPTER-1-DATA-WAREHOUSING-AND-MANAGEMENT.pdf
Document Details
Uploaded by GreatestJasper3191
Dalubhasaan ng Lungsod ng Lucena
Tags
Full Transcript
DATA WAREHOUSING AND MANAGEMENT MODULE 1 CHAPTER I: FUNDAMENTAL CONCEPTS: INTRODUCTION TO DATA WAREHOUSE SYSTEMS I. OBJECTIVES At the end of this chapter, the students should be able to: Exp...
DATA WAREHOUSING AND MANAGEMENT MODULE 1 CHAPTER I: FUNDAMENTAL CONCEPTS: INTRODUCTION TO DATA WAREHOUSE SYSTEMS I. OBJECTIVES At the end of this chapter, the students should be able to: Explain the fundamental concepts to data warehouse systems. Discuss the historical overview of data warehousing, starting from the early achievements. Understand the various ways of organizing and managing information for decision making use. Review the history of decision support systems. Learn briefly what is data warehouse and see why data warehousing is the viable solution Describe the field of spatial and spatiotemporal data warehouses Describe new domains and challenges that are being explored in order to answer the requirements of today’s analytical applications. II. SUBJECT MATTER Topic: Introduction to Data Warehousing Systems Subtopic: - A Historical Overview of Data Warehousing - Spatial and Spatiotemporal Data Warehouses - New Domains and Challenges III. PROCEDURE A. Preliminaries Pre- Assessment 1. Explain and discuss the historical overview of data warehousing including traditional operational or transactional databases, the design of operational databases, and multidimensional modelling. 2. Identify the field of spatial and spatiotemporal data warehouses, which has been increasingly used in many application domains. 3. Examine and discuss new domains and challenges that are being explored in order to answer the requirements of today’s analytical applications. B. Lesson Proper Organizations today are facing increasingly complex challenges in terms of management and problem solving in order to achieve their operational goals. This situation compels people in those organizations to utilize analysis tools that can better support their decisions. 1 DATA WAREHOUSING AND MANAGEMENT MODULE 1 In this chapter, we present an overview of the data warehousing field, covering both established topics and new developments, and indicate the chapters in the book where these subjects are covered. We give a historical overview of data warehousing and On-Line Analytical Processing (OLAP), starting from the early achievements. Then, we describe the field of spatial and spatiotemporal data warehouses, which has been increasingly used in many application domains. Finally, we describe new domains and challenges that are being explored in order to answer the requirements of today’s analytical applications. 1. Introduction Business intelligence comprises a collection of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information for decision making. Business intelligence and decision-support systems provide assistance to managers at various organizational levels for analyzing strategic information. These systems collect vast amounts of data and reduce them to a form that can be used to analyze organizational behavior. This data transformation comprises a set of tasks that take the data from the sources and, through extraction, transformation, integration, and cleansing processes, store the data in a common repository called a data warehouse. Data warehouses have been developed and deployed as an integral part of decision-support systems to provide an infrastructure that enables users to obtain efficient and accurate responses to complex queries. The Data Warehouse is an integrated, subject-oriented, time-variant, non-volatile database that provides support for decision making. Integrated: The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization. Subject-Oriented: The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company. Time Variant: The Warehouse data represent the flow of data through time. It can even contain projected data. Non-Volatile: Once data enter the Data Warehouse, they are never removed. The Data Warehouse is always growing. 2 DATA WAREHOUSING AND MANAGEMENT MODULE 1 Figure 1. Comparison of Data Warehouse and Operational Database Characteristics A data warehouse system has the following characteristics: It provides a centralized utility of corporate data or information assets. It is contained in a well-managed environment. It has consistent and repeatable processes defined for loading operational data. It is built on an open and scalable architecture that will handle future expansion of data. It provides tools that allow its users to effectively process the data into information without a high degree of technical support. A wide variety of systems and tools can be used for accessing, analyzing, and exploiting the data contained in data warehouses. From the early days of data warehousing, the typical 3 DATA WAREHOUSING AND MANAGEMENT MODULE 1 mechanism for those tasks has been online analytical processing (OLAP). OLAP systems allow users to interactively query and automatically aggregate the data contained in a data warehouse. In this way, decision makers can easily access the required information and analyze it at various levels of detail. On-Line Analytical Processing (OLAP) is an advanced data analysis environment that supports decision making, business modeling, and operations research activities. OLAP systems are designed to use both operational and Data Warehouse data. Four Main Characteristics of OLAP: Use multidimensional data analysis techniques Provide advanced database support Provide easy-to-use end user interfaces Support client/server architecture Figure 2. OLAP Server Arrangement 4 DATA WAREHOUSING AND MANAGEMENT MODULE 1 The design of operational databases is typically performed in four phases: requirements specification, conceptual design, logical design, and physical design. During the requirements specification process, the needs of users at various levels of the organization are collected. The specification obtained serves as a basis for creating a database schema capable of responding to user queries. Databases are designed using a conceptual model, such as the entity-relationship (ER) model, which aims at describing an application without taking into account implementation considerations. The resulting design is then translated into a logical model, which is an implementation paradigm for database applications. Finally, physical design particularizes the logical model for a specific implementation platform in order to produce a physical model. Multidimensional modeling views data as consisting of facts linked to several dimensions. Multidimensional Data Analysis Techniques The processing of data in which data are viewed as part of a multidimensional structure. Multidimensional view allows end users to consolidate or aggregate data at different levels. Multidimensional view allows a business analyst to easily switch business perspectives. A fact represents the focus of analysis (e.g., analysis of sales in stores) and typically includes attributes called measures. Measures are usually numeric values that allow a quantitative evaluation of various aspects of an organization. For example, measures such as the amount or number of sales might help to analyze sales activities in various stores. Dimensions are used to see the measures from several perspectives. For example, a time dimension can be used to analyze changes in sales over various periods of time, whereas a location dimension can be used to analyze sales according to the geographical distribution of stores. Hierarchies allow users to explore measures at various levels of detail. Examples of hierarchies are month–quarter–year in the time dimension and city–state–country in the location dimension. 5 DATA WAREHOUSING AND MANAGEMENT MODULE 1 Figure 3. OLAP Server with Multidimensional Data Store Arrangement Conceptual Modelling of Data Warehouses Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 6 DATA WAREHOUSING AND MANAGEMENT MODULE 1 Although data warehouses are, in the end, a particular kind of databases, there are significant differences between the development of operational databases and data warehouses. A key one is the fact that data in a warehouse are extracted from several source systems. Thus, data must be taken from these sources, transformed to fit the data warehouse model, and loaded into the data warehouse. This process is called extraction, transformation, and loading (ETL), and it has been proven crucial for the success of a data warehousing project. Data analytics is the process of exploiting the contents of a data warehouse in order to provide essential information to the decision-making process. Three main tools can be used for this: Data mining consists in a series of statistical techniques that analyze the data in a warehouse in order to discover useful knowledge that is not easy to obtain from the original data. Key performance indicators (KPIs) are measurable organizational objectives that are used for characterizing how an organization is performing. Dashboards are interactive reports that present the data in a warehouse, including the KPIs, in a visual way, providing an overview of the performance of an organization for decision-support purposes. 7 DATA WAREHOUSING AND MANAGEMENT MODULE 1 2. Spatial and Spatiotemporal Data Warehouse Over the years, spatial data has been increasingly used in various areas, like public administration, transportation networks, environmental systems, and public health, among others. Spatial data can represent either objects located on the Earth’s surface, such as mountains, cities, and rivers, or geographic phenomena, such as temperature, precipitation, and altitude. Spatial data can also represent nongeographic data, that is, data located in other spatial frames such as a human body, a house, or an engine. Management of spatial data is carried out by spatial databases or geographic information systems (GISs). Since the latter are used for storing and manipulating geographic objects and phenomena, we shall use the more general term spatial databases in the following. Spatial databases are used to store spatial data located in a two- or three-dimensional space. These systems provide a set of functions and operators for querying and manipulating spatial data. Queries may refer to spatial characteristics of individual objects, such as their area or perimeter, or may require complex operations on two or more spatial objects. Topological relationships between spatial objects, such as intersection, touches, and crosses, are essential in spatial applications. For example, two roads may intersect, two countries may touch because they have a common border, or a road may cross a dessert. An important characteristic of topological relationships is that they do not change when the underlying space is distorted through rotation, scaling, and similar operations. Spatial data warehouses emerged as a combination of the spatial database and data warehouse technologies. Spatial data warehouses provide improved data analysis, visualization, and manipulation. This kind of analysis is called spatial OLAP (SOLAP), conveying a reference to the ability of exploring spatial data through map navigation and aggregation, as it is performed in OLAP with tables and charts. 3. New Domains and Challenges Nowadays, the availability of enormous amounts of data is calling for a shift in the way data warehouse and business intelligence practices have been carried out since the 1990s. It is becoming clear that for certain kinds of business intelligence applications, the traditional approach, where day-to-day business data produced in an organization are collected in a huge common repository for data analysis, needs to be revised, to account for efficiently handling large-scale data. In many emerging domains where business intelligence practices are gaining acceptance, such as 8 DATA WAREHOUSING AND MANAGEMENT MODULE 1 social networks or geospatial data analytics, massive-scale data sources are becoming common, posing new challenges to the data warehouse research community. We already commented that the typical method of loading data into a data warehouse is through an ETL process. This process pulls data from source systems periodically (e.g., daily, weekly, or monthly), obtaining snapshot of the business data at a given moment in time. These data are then used for refreshing the contents of the data warehouse. For some data analysis tasks (like worldwide price evolution of some product), the data contained in a conventional data warehouse may not suffice. External data sources, like the web, can provide useful multidimensional information, although usually too volatile to be permanently stored. The semantic web aims at representing web content in a machine-processable way. The basic layer of the data representation for the semantic web recommended by the World Wide Web Consortium (W3C) is the Resource Description Framework (RDF), on top of which the Web Ontology Language (OWL) is based. 9 DATA WAREHOUSING AND MANAGEMENT MODULE 1 ACTIVITY 1: REVIEW QUESTIONS Based on the discussion, answer each question and / or statements briefly. Write on the spaces provided below. 1. Why are traditional databases called operational or transactional? Why are these databases inappropriate for data analysis? ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 2. Discuss four main characteristics of data warehouses. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 3. What is the purpose of online analytical processing (OLAP) systems and how are they related to data warehouses? ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 4. What is the extraction, transformation, and loading (ETL) process? ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 5. Explain the differences between spatial databases and spatial data warehouses. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 10 DATA WAREHOUSING AND MANAGEMENT MODULE 1 ACTIVITY 2: EXCERSICE An information package of a promotional analysis is shown below. To evaluate the effectiveness of various promotions, brand managers are interested in analyzing data for the products represented, the promotional offers, and the locations where the promotions ran. Construct a star schema based on the information package diagram, and discuss how the brand manager or other analysts can use the model to evaluate the promotions. All time periods All products All Locations All promotions Years Category Region Type Quarters Sub-category District Sub-type Months Brand Store Name Package Size Measures/Facts: Units, revenue, cost, margin (Calculated) 11