Business Intelligence Course Book PDF

BUSINESS INTELLIGENCE DLBCSEBI01 BUSINESS INTELLIGENCE MASTHEAD Publisher: IU Internationale Hochschule GmbH IU International University of Applied Sciences Juri-Gagarin-Ring 152 D-99084 Erfurt Mailing address: Albert-Proeller-Straße 15-19 D-86675 Buchdorf [email protected] www.iu.de DLBCSEBI01 Version No.: 001-2023-0818 Gerasimos Marketos © 2022 IU Internationale Hochschule GmbH This course book is protected by copyright. All rights reserved. This course book may not be reproduced and/or electronically edited, duplicated, or distributed in any kind of form without written permission by the IU Internationale Hochschule GmbH (hereinafter referred to as IU). The authors/publishers have identified the authors and sources of all graphics to the best of their abilities. However, if any erroneous information has been provided, please notify us accordingly. 2 TABLE OF CONTENTS BUSINESS INTELLIGENCE Introduction Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Basic Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Required Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Unit 1 Motivation and Introduction 13 1.1 Motivation and Historical Development of the Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Business Intelligence as a Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Unit 2 Data Provisioning 21 2.1 Operational and Dispositive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 The Data Warehouse Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Architecture Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Unit 3 Data Warehouse 35 3.1 ETL Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 DWH and Data-Mart Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 ODS and Meta-Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Unit 4 Modeling Multidimensional Dataspaces 55 4.1 4.2 4.3 4.4 4.5 56 57 61 62 64 Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Physical Storage Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Star Schema and Snowflake Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Historicization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Unit 5 Analytical Systems 5.1 5.2 5.3 5.4 69 Free Data Research and OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Reporting Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Model-Based Analysis Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Concept-Oriented Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Unit 6 Distribution and Access 79 6.1 Distribution of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 Access to Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Appendix List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 List of Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4 INTRODUCTION WELCOME SIGNPOSTS THROUGHOUT THE COURSE BOOK This course book contains the core content for this course. Additional learning materials can be found on the learning platform, but this course book should form the basis for your learning. The content of this course book is divided into units, which are divided further into sections. Each section contains only one new key concept to allow you to quickly and efficiently add new learning material to your existing knowledge. At the end of each section of the digital course book, you will find self-check questions. These questions are designed to help you check whether you have understood the concepts in each section. For all modules with a final exam, you must complete the knowledge tests on the learning platform. You will pass the knowledge test for each unit when you answer at least 80% of the questions correctly. When you have passed the knowledge tests for all the units, the course is considered finished and you will be able to register for the final assessment. Please ensure that you complete the evaluation prior to registering for the assessment. Good luck! 6 BASIC READING Grossmann, W., & Rinderle-Ma, S. (2015). Fundamentals of business intelligence. Springer. Kolb, J. (2013). Business intelligence in plain language: A practical guide to data mining and business analytics. Createspace. Sharda, R., Delen, D., & Turban, E. (2014). Business intelligence and analytics: Systems for decision support. Pearson. Sherman, R. (2014). Business intelligence guidebook: From data integration to analytics. Morgan Kaufmann. Vaisman, A., & Zimányi, E. (2016). Data warehouse systems: Design and implementation. Springer. 7 REQUIRED READING UNIT 1 Simon, A. (2014). Modern enterprise business intelligence and data management. Elsevier MK. Chapter 5 UNIT 2 Clegg, D. (2015). Evolving data warehouse and BI architectures: The big data challenge. TDWI Business Intelligence Journal, 20(1), 19—24. UNIT 3 Ankorion, I. (2005). Change data capture: Efficient ETL for real-time BI. DM Review, 15(1), 36 —43. UNIT 4 Kimball, R. (2008). Slowly changing dimensions, types 2 and 3. DM Review, 18(10), 19—38. UNIT 5 Abkay, S. (2015). How big data applications are revolutionizing decision making. Business Intelligence Journal, 20(1), 25—29. UNIT 6 Gangadharan, G. R., & Swami, S. N. (2004). Business intelligence systems: Design and implementation strategies. 26th international conference, information technology interfaces (ITI 2004) (pp. 139—144). The University of Zagreb Computing Centre (SRCE). 8 FURTHER READING UNIT 1 Chaudhuri, S., Dayal, U., & Narasayya, V. R. (2011). An overview of business intelligence technology. Communications of the ACM, 51(8), 88—98. Available online Kawatzeck, R., & Dinter, B. (2015). Agile business intelligence: Collection and classification of agile business intelligence actions by means of a catalog and a selection guide. Information Systems Management, 32(3), 177—191. UNIT 2 Arizachandra, T., & Watson, H. J. (2008). Which data warehouse architecture is best? Communications of the ACM, 51(10), 146—147. Clegg, D. (2015). Evolving data warehouse and BI architectures: The big data challenge. Business Intelligence Journal, 20(1), 19—24. Ivan, M.-L. (2014). Characteristics of in-memory business intelligence. Informatica Economica, 18(3), 17—25. Zafary, F. (2020). Implementation of business intelligence considering the role of information systems integration and enterprise resource planning. Journal of Intelligence Studies in Business, 10(1), 59—74. UNIT 3 Liu, J., Li, J., Li, W., & Wu, J. (2016). Rethinking big data: A review on the data quality and usage issues. ISPRS Journal of Photogrammetry and Remote Sensing, 116, 134—142. Available online Philip Chen, C. L., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314—347. Available online UNIT 4 Franconi, E., & Kamblet, A. (2004). A data warehouse conceptual data model. Proceedings of the 16th international conference on scientific and statistical database management (SSDBM 2004) (pp. 435—436). IEEE. Stiglich, P. (2014). Data modeling in the age of big data. Business Intelligence Journal, 19(4), 17— 9 UNIT 5 Allio, M. (2012). Strategic dashboards: Designing and deploying them to improve implementation. Strategy & Leadership, 40(5), 24—31. Vincentdo, V., Pratama, A. R., Girsang, A. S., Suwandi, R., & Andrean, Y. P. (2019). Reporting and decision support using data warehouse for e-commerce top-up cell-phone credit transaction. 7th international conference on cyber and IT service management (CITSM) (pp. 1—4). IEEE. UNIT 6 Alpar, P., & Schulz, M. (2016). Self-service business intelligence. Business & Information Systems Engineering, 58, 151—155. Lennerholt, C., van Laere, J., & Söderström, E. (2018). Implementation challenges of self service business intelligence: A literature review. Proceedings of the 51st Hawaii international conference on system sciences (pp. 5055—5063). Lennerholt, C., van Laere, J., & Söderstrom, E. (2018). User related challenges of self-service business intelligence. Proceedings of the 53rd Hawaii international conference on system sciences (pp. 188—197). 10 LEARNING OBJECTIVES Business intelligence (BI) is a process used to extract information from company data that supports informed corporate management and the optimization of business activities. In the course Business Intelligence, the techniques, procedures, and models used in BI for data provision, information generation, and analysis, as well as the distribution of the information gained through BI processes, are presented and discussed. At the end of the course, you will be able to explain the various aspects of data warehousing and independently select methods or techniques to meet specific BI requirements. Ultimately, you will be able to independently design and prototype business intelligence applications based on concrete requirements. 11 UNIT 1 MOTIVATION AND INTRODUCTION STUDY GOALS On completion of this unit, you will have learned … – – – – what the term business intelligence (BI) means. how the term business intelligence was developed. the characteristics of a data warehouse. how the term business intelligence is defined in practice. 1. MOTIVATION AND INTRODUCTION Introduction For several years now, there has been a growing trend towards the globalization and dynamization of markets. As a result of greater competition, many companies now seek to create information advantages in order to establish overall competitive advantages. Information has thus become a managerial resource that is of strategic and tactical importance. An effective supply of relevant information is a prerequisite for improving the quality of corporate decision-making. Business intelligence (BI) involves integrating strategies, processes, and technologies to generate critical knowledge about the current status and potential of the often fragmented divisions of a company. These perspectives on the company are then combined with market and competitor data in decision support systems which present this newlyacquired knowledge in such a way that it can be used directly for analysis, planning, and control purposes. Located within the overarching concept of business intelligence is the data warehouse. The term data warehouse (DWH) is often understood differently due to the various definitions and interpretations that exist in the literature and in practice. In the following sections, we will explore the historical development of business intelligence. The term data warehouse is then described and positioned within the context of BI. 1.1 Motivation and Historical Development of the Field The historical development of business intelligence goes back to the 1960s. As seen in the following figure, a number of different systems for supporting managerial decision-making have existed from that time. The term data warehouse was initially abbreviated to DW, but today, DWH is the more common abbreviation. You will find different abbreviations for data warehouse in the literature, particularly in some of the classic texts. 14 Figure 1: Historical Development Source: Humm & Wietek, 2005, p. 4. Management Information System (MIS) At the end of the 1960s, the first information systems were introduced along with the term management information system (MIS). According to Grothe, the goal of a MIS was to “provide managers of companies with the information they need to make decisions. Time, content, and the way information was presented were to be optimized as secondary conditions” (Grothe, 2000, p. 65) (translated by author). However, according to Gluchowski et al. (2008), these goals could only be met to a limited extent because of limitations in the technology available at that time. Decision Support System (DSS) In the mid-1970s, the management information system was largely replaced by the decision support system (DSS). With the advent of interactive electronic data processing (EDP) systems, additional models, methods, and scenarios were made available to companies which enabled individual analyses of information (Gluchowski et al., 2008). Advances in hardware also made it possible to process information more efficiently. It was here that the basis for data-based decision support was laid (Grothe, 2000). Unfortunately, decision support systems for the most part did not meet the high expectations associated with them. Thanks to technical progress, a DSS meant that structured data could be analyzed. However, the analysis of data was only possible for parts of the company. Moreover, this could only be done with operational data (Gluchowski et al.m 15 2008; Grothe, 2000). According to Hannig (2002), a further problem was that managers for the most part did not accept decision support systems as they did not trust computers to support creative decision-making processes. Executive Information System (EIS) In the mid-1980s, executive information systems (EIS) emerged at the same time as the arrival of powerful personal computers (PCs) in companies (Gluchowski et al., 2008; Hannig, 2002). The target users of executive information systems were primarily upper management and staff working in controlling functions. The EIS was comprised of individual systems that presented decision-relevant, multidimensional data to management in a more up-to-date and improved way than previous information systems (Gluchowski et al., 2008). In contrast to its predecessors (i.e., MIS and DSS), the EIS was easier to implement due to the spread of PCs in companies; with the MIS and DSS, central computers had to be used compared to the EIS operated using a PC. However, the disadvantage of using individual systems in the EIS was that they could only be used within a single department or company site, as they were developed individually for this purpose. As with decision support systems, the potential of executive information systems was not realized as they were not accepted by end users (Hannig 2002) and making any changes to the EIS—due to the individualized development of each system—was expensive (Grothe 2000). Data Warehouse (DWH) The main breakthrough in the acceptance and use of information systems came as a result of globalization, which accelerated at the beginning of the 1990s. Prior to this, managers had been largely skeptical about adopting such systems. However, as operations and supply chains spread across the globe, managers became more dependent on available information. Decision-making had also fundamentally changed as a result of decentralization. Decisions were no longer made in the head office (which could be located on the other side of the world), but rather were made promptly, locally, and using up-to-date information. Another reason for the increased demand for effective information systems was the flood of data facing companies, resulting from internationalization and the associated spread of company locations around the world. Previous systems (MIS, DSS, and EIS) were simply not able to meet these requirements. A significant problem for many companies was that they now had to manage several inconsistent or non-compatible data sources. A new type of information management system was required: a complete, uniform, and consistent database (Hannig, 2002). A central database was developed that brought together data from the different systems used throughout a company; thus emerged the term data warehouse (DWH) (Grothe, 2000; Hannig, 2002). In the 1990s, the creation of DWH analysis tools, often referred to as BI, was a major influence on the development of data warehouses. Today, the term BI is mostly used as a generic term (Grothe, 2000). 16 1.2 Business Intelligence as a Framework Many companies today face the same scenario: a constantly increasing flood of data paired with insufficient useful information. Ensuring the effective supply of information to management is an important competitive factor. Often, however, the right information is not delivered in the right quantity, at the right place, at the right time. The aim of business intelligence is to ensure that it is. With the help of the data warehouse, operational information logistics can be improved and valuable and specific information can be delivered in a timely manner to management. Features of a DWH A DWH addresses the data problems experienced by management thus described. The father of data warehousing, W. H. Inmon, coined the following definition: DATA WAREHOUSE A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management’s decisions (Inmon, 2005, p. 31). The four basic characteristics of a data warehouse are described below: 1. Subject-oriented (theme-focused) means that the data stock of a DWH are selected and organized according to profession or business criteria. 2. Integrated (unified) refers to the integration of data from heterogeneous source systems. The data must be standardized with regard to structure and format. 3. Nonvolatile (persistent) refers to the permanent storage of data in the DWH. Stored data are not changed or deleted. 4. Time-variant (historicization) means that time series analyses (comparison of data over time) are possible in the DWH. Data are stored as they existed at specific points in time. As a result, changes and developments over time can be analyzed. Definitions There are a number of DWH and BI definitions found in the literature. Different interpretations of DWH and BI are therefore summarized in the following sections. DWH 17 The aforementioned definition of a data warehouse supplied by Inmon has been extended by several authors, who describe additional tasks such as the connection, extraction, and transformation of external data as well as data collection and administration. According to Schinzer et al. (1999), the concept of the data warehouse also relates to the analysis and presentation of data with the help of appropriate tools. Figure 2: Delimitation of the term “DWH” Source: Glasker, 2017. Data warehouse (DWH) in the narrower sense This involves purely data collection. This figure indicates that data warehouse (DWH), in the narrower sense of the word, covers purely data collection. DWH, in the broader sense, includes report generation, graph generation, spreadsheet analysis, and analysis methods. BI Archiving data alone does not bring about any competitive advantages. These are only realized through the creative and intelligent application of data (Muksch & Behme, 1996). The use of knowledge available across the company is known as BI. BI thus represents a further extension of the DWH concept in the broader sense and can be thought of as the front end of the DWH. The term was originally coined by the Gartner Group and is defined as follows: 18 BUSINESS INTELLIGENCE Business intelligence is the process of transforming data into information and, through discovery, into knowledge (Muksch & Behme, 1996, p. 37). According to Gluchowski et al. (2008), BI involves techniques and applications that are designed to support decision-making and lead to a better understanding of the mechanisms driving outcomes. The following figure shows how BI can be classified using a narrow understanding, an analysis-oriented understanding, and a broader understanding of the concept of BI. Figure 3: Classification of BI Source: Gluchowski et al., 2008, p. 92. BI, in the narrower sense, refers to core applications that support decision-making without the additional input of more advanced methods or modeling. These include online analytical processing (OLAP), MIS, and EIS. Analysis-oriented BI refers to the applications that allow decision makers to analyze existing data directly on the system using a user interface and various methods and models. These include OLAP, MIS and EIS, text mining, data mining, and ad hoc reporting. 19 Business intelligence (BI) in the broader sense This includes all BI applications that are used directly or indirectly for decision-making. Business intelligence (BI), in the broader sense, covers all applications that are used directly or indirectly for decision making. This includes evaluation and presentation functions as well as data preparation and storage (Gluchowski et al., 2008; Kemper et al., 2010). SUMMARY With the help of business intelligence, companies try to gain information advantages and thus establish competitive advantages. Information is now a managerial resource that has immense strategic value. The historical development of business intelligence goes back to the 1960s and information systems have gone through various iterations (BI, DWH, EIS, DSS, and MIS). The primary characteristics of a DWH are the collection of subject-oriented, integrated, nonvolatile, and time-variant data. In the literature there is a multitude of DWH and BI definitions. These definitions characterize DWH or BI in both a narrower and broader sense. BI, in a broader sense, includes all applications that support decision-making either directly (e.g., OLAP) or indirectly (e.g., data extraction). 20 UNIT 2 DATA PROVISIONING STUDY GOALS On completion of this unit, you will have learned … – – – – how operational and dispositive systems differ from one another. what typical BI reference architecture looks like. which basic BI components exist. which architecture variants are possible. 2. DATA PROVISIONING Introduction The term business intelligence (BI) refers to procedures and processes that facilitate the systematic analysis of electronic data. Insights derived from the data then enable companies to make better operational or strategic decisions. The basic prerequisite for the use of powerful BI tools is the preparation and storage of consistent data that meets the business needs of managers. From a technical perspective, a data warehouse (DWH) is realized via a database system. These often specialized database systems are configured for the needs of complex queries, since loading processes with high data volumes have additional requirements. The storage of data in tables based on rational database systems is widespread. 2.1 Operational and Dispositive Systems Dispositive Something is said to be dispositive if it relates to the management and control of operational processes. Erich Gutenberg, a German economist and key figure in post-war, modern business studies, classified activities of the firm as operational and dispositive (Schmidt, 1998). According to Gutenberg, work is operational if it directly relates to the provision of goods and services, the utilization of goods and services, and the performance of financial tasks that are not of a planning nature. Activities are deemed to be dispositive if they relate to the management and control of operational processes. In line with this distinction offered by Erich Gutenberg (1983), application systems and the data that exist within these systems can be distinguished as either operational or dispositive systems. Operational systems are about capturing and recording data, whereas dispositive systems are about analyzing data. OLTP and OLAP Operational systems These are systems (e.g., ERP, CRM) that manage up-to-date information. Operational systems are used to store and manage information necessary for the everyday operations of a company, e.g., a customer database or an employee directory. Information in these systems is regularly changed and frequently queried. Only current data records are of interest; past address data of a customer, for example, are of little value and can be deleted or overwritten. The data models used in such operational systems must be optimized for a high number of transactions. The processing method used for operational systems is known as online transactional processing (OLTP). The DWH falls under the branch of dispositive systems. Dispositive systems are used to extract information from operational data. For example, it might be determined that a significant number of customers have relocated in the last six months and this information 22 can be used to adapt and optimize sales structures. By using a DWH, operational systems are relieved of analytical queries, which might otherwise reduce processing capacities because of their complexity. There are different types of queries conducted in operational and dispositive systems that aim to meet different objectives. Operational systems have a relatively large number of users. During business hours, numerous read requests are made for individual data records. Dispositive systems are generally queried by fewer people, but these people are individual experts seeking to address complex issues. They undertake sophisticated queries that evaluate a large number of data sets. The systems are optimized according to their respective application purpose in order to achieve an improved processing speed. The processing method used for dispositive systems is known as online analytical processing (OLAP). Due to the heterogeneity of transaction-oriented operational systems and the analytic-oriented data warehouse, both the systems and the data warehouse are physically separate from one another. Operational and Dispositive Data When considering a DWH, the question of technical necessity arises. After all, a DWH is only a replication of data that a company generates and stores in its data processing systems. However, the need for a DWH is understood when two different views of the data— operational and dispositive—are considered. Operational data are directly related to the company’s service provision activities. Dispositive data, on the other hand, are of an analytical nature and are used to manage and control the company (Kemper et al., 2010). In the following table, the most important differences between the two views are described. Table 1: Characteristics of Operational and Dispositive Data Characteristics of operational data Characteristics of dispositive data Objective Handling of business processes Information for management; decision support Alignment Detailed, granular business transaction data Mostly condensed, transformed data; comprehensive metadata Time frame Up-to-date; time-related; transaction-oriented Different, task-dependent; history review Modeling Old stocks often not modeled (function-oriented) Subject or topic-related, standardized and suitable for end users Status Often redundant; inconsistent Consistently modeled; controlled redundancy Update Running and competing Complementary; updating of derived, aggregated data 23 Characteristics of operational data Queries Structured; mostly static in the program code Characteristics of dispositive data Ad-hoc for complex, constantly changing questions and ready-made standard evaluations Source: Kemper et al., 2010, p. 16. The differences between operational and dispositive data can be illustrated using the example of an insurance company. Examples of operational data in this context are: • detailed information on individual insurance contracts, • continuous data changes recorded via the online portal, and • storage of contracts in two different systems: one for motor vehicles and one for life insurance. Examples of dispositive data are: • summaries of sales and profits for each customer group, • presentation of temporal changes compared to the previous year, and • comparison of motor vehicle and life insurance product lines. We can see that an evaluation based directly on operational data does not meet the requirements of the planning process. The heterogeneous system landscape in particular makes it difficult to compare information. Furthermore, the concept of concurrent “queries” and “transactions” would mean that conducting direct analyses on operational data would be problematic: resource-intensive queries with long runtimes have the potential to block the entire operational system and impair day-to-day business (Bauer & Günzel, 2008; Kemper et al., 2010). 2.2 The Data Warehouse Concept In practice, the data warehouse can include different process phases, architectures, and BI components, depending on the requirements of the organization utilizing it. The following section explains basic concepts of reference architectures that must be customized for the actual project at hand. Please note that the terms business intelligence and data warehouse (in its broader sense) have been used synonymously. 24 Process Phases and Reference Architecture A large number of process phases and reference architectures exist in the literature. Process phases refer to the stages through which data passes. Data warehouse reference architecture refers to the template for designing the collection and storage of data using the data warehouse. According to Kemper et al. (2010), the process phases are as follows: • data provision, • information generation, storage, and distribution, and • information access. The first process phase of a DWH is to merge data and information from heterogeneous sources. The data can come from supply chain management (SCM), enterprise resource planning (ERP), customer relationship management (CRM), or external systems. All of these heterogeneous data are merged in the DWH. In the second step, data are analyzed using OLAP and data mining. In addition to extensive query options, these systems can also generate event-driven warning messages. In the third step, the findings from the second step are communicated to the company in the form of recommendations or actions. The following figure from Gansor et al. (2010) illustrates the various components of BI reference architecture and provides context for their subsequent description. 25 Figure 4: BI Reference Architecture Source: Gansor et al., 2010, p. 56. BI components Source systems In classic BI reference architecture, data from a number of heterogeneous sources is imported, all with different structures, content, and access interfaces. OLTP systems are usually used, however, in principle, any type of source system can be conceivably included, e.g., semi-structured websites or unstructured text files. Source systems can include both internal company data (e.g., ERP, PPS system) and externally procured data (e.g., stock market prices, current raw material prices). Staging area The staging area is a work area in which data is temporarily stored. The staging area is necessary to relieve downstream systems when processing large amounts of data (Inmon, 2005). 26 Operational data store (ODS) In contrast to the classic DWH approach, the ODS does not have aggregated data and longer history considerations. It is frequently used as a preliminary stage for supplying data for conventional DWH approaches (Kemper et al., 2010). Basic database (core data warehouse) The basic database is the central database within the DWH. After the initial transformation process, data are made available for various evaluation purposes or downstream systems. Evaluation database (data mart) The evaluation database forms the basis for downstream analysis tools. The data are stored with the help of a multidimensional model. From a technical point of view, evaluation databases are usually based on relational databases. Often, several evaluation databases are used and data are divided according to analysis requirements or organizational units (Bauer & Günzel, 2008). Extracting, transforming, and loading (ETL) process The extracting, transforming, and loading process integrates the data from the source systems into the DWH. The processing steps of extracting, transforming, and loading are carried out using ETL tools. From the data source systems, the data are transferred to the staging area via the data extraction step. After extracting the data from the data sources and loading them into the work area, the data must be converted according to the requirements of the company. Transformation affects both the structure and the content of the data. Data that comes from different sources must be converted into a uniform format. Plausibility checks can be used to improve data quality. The data are then transferred to the basic database (loading) as soon as they are available in a cleansed state following transformation. Since the basic database already contains integrated and cleansed data, data only need to be transformed into the target schema and possibly enriched or aggregated before they are loaded into the evaluation database (Bauer & Günzel, 2008). Aggregation Data are aggregated if they are required at a lower granularity than in the source systems. The process of aggregation considerably reduces the amount of data. For performance reasons, data are usually aggregated to the minimum required granularity. An example of aggregating data is combining daily sales into monthly sales. Granularity This is the level of detail data has within a data structure (“Granularity”, 2020). Front end Analysis tools form the front end of BI architecture. Front end tools can be more or less complex, depending on their application. Tools for data mining and OLAP are used to analyze the dataset and extract information from the mass of available data. Previously unknown relationships can be then uncovered, particularly through the use of data mining, where techniques such as classification and clustering are used. OLAP tools make the 27 dataset accessible in an interactive way. The choice to aggregate data and the degree of aggregation for displayed data can be determined by the user. Portal systems are usually used to access information (Kemper et al., 2010). 2.3 Architecture Variants In practice, there is a large number of architecture variants used for constructing data warehouses, some of which have been borrowed from other areas of data management while others have actually emerged from the BI field itself. In this section, several known architecture variants have been listed and then described in further detail. The basic architectural variants which can be used to create a DWH include • • • • • • independent data marts, data marts with coordinated data models, central core data warehouse (C-DWH) (no data marts), several C-DWHs, C-DWH and dependent data marts, and DWH architecture mix. Independent Data Marts Independent data marts These are where independent DWHs are created in individual departments. In practice, the architecture form of independent data marts is often created by individual departments building their own DWHs independently of each other, as seen in the following figure. Figure 5: Independent Data Marts Source: Kemper et al., 2010, p. 22. 28 When using independent data marts, a central database (core data warehouse) is not required. This approach reduces the complexity of the entire DWH, making it easier and more manageable. As a result, usable results for the departments can be achieved in a relatively short period of time. However, the development of a company-wide data warehouse becomes much more difficult due to the subsequent isolation of applications (Kemper et al., 2010). Data Marts with Coordinated Data Models As in the previous variant, source data is prepared several times for different data management systems. However, the individual data marts coordinate with each other with regard to a common data model, as seen in the following figure. Figure 6: Data Marts with Coordinated Data Models Data marts When data marts have coordinated data models, there are several data marts using a common data model. Source: Kemper et al., 2010, p. 22. The use of data marts with conceptually coordinated data models ensures the consistency and integrity of the dispositive data model. Compared to the first variant, the establishment of a company-wide data warehouse will involve less effort (Kemper et al., 2010). Central C-DWH (No Data Marts) For smaller BI solutions, it may make sense to dispense with data marts, e.g., if the number of end users and data volumes are small. In this case, we recommend the variant Central C-DWH. Central C-DWH A central C-DWH places the evaluation function of the C-DWH in the foreground. 29 Figure 7: Central C-DWH Source: Kemper et al., 2010, p. 22. With this monolithic approach, the evaluation function of the core data warehouse is placed in the foreground. However, this approach can have considerable disadvantages (e.g., performance, administration effort) in complex solutions (Kemper et al., 2010). Multiple C-DWHs Multiple C-DWHs This is a variant that is useful for large, divisionoriented companies. Under certain business conditions, such as in the case of different product or market structures, it is possible to set up several core data warehouses, creating a variant known as multiple C-DWHs. Figure 8: Multiple C-DWHs Source: Kemper et al., 2010, p. 22. This framework is particularly prevalent in sector-oriented companies and large corporations that produce a diverse range of products and services (Kemper et al., 2010). 30 C-DWH and Dependent Data Marts Extending the core data warehouse with data marts is the architecture variant most frequently presented in textbooks. The data marts are supplied with the help of transformation processes and data from the core data warehouse. The following figure shows the CDWH and dependent data marts architectural variant. Figure 9: C-DWH and Dependent Data Marts C-DWH and dependent data marts This is the most frequently presented architectural variant in literature. Its major advantage is short response times. Source: Kemper et al., 2010, p. 22. With dependent data marts, data are periodically extracted from the C-DWH and stored in data marts. The extracted data are small, department-specific data extracts from the core data warehouse. By creating these extracts, the data volume of the data marts is considerably smaller. As a result, faster response times can be achieved for queries to this data stock (Kemper et al., 2010). The structure used for dependent data marts is often referred to as hub-spoke architecture (Bauer & Günzel, 2008). DWH Architecture Mix In practice, a common architectural variant is the DWH architecture mix that consists of C-DWHs, dependent and independent data marts, and direct data access (i.e., virtual DWH with its own data transformation). DWH architecture mix A DWH architecture mix consists of C-DWHs, dependent and independent data marts, and direct data access. 31 Figure 10: Mix of DHW Architecture Source: Kemper et al., 2010, p. 22. The approach taken when developing BI architecture can be an iterative, organic process where the architecture evolves over time in keeping with the BI needs of the organization. However, the approach can also be the result of a conscious design process to ensure optimal support for value-adding primary processes and adjunct cross-sectional processes (Kemper et al., 2010). In practice, there are many data warehouse systems that modify standard architecture variants and adapt them to the respective requirements of the specific organization. For example, several data marts and an operational data store (ODS) can be created to cooperate with the company-wide core data warehouse. SUMMARY Using Erich Gutenberg’s criteria, application systems can be divided into operational and dispositive systems according to the type of work they support. Operational systems serve to store and manage everyday information for a company. The data warehouse is considered a dispositive system. DWHs are used to extract information from operational data. The corresponding data from these systems can also be classified as operational or dispositive. From extracting operational data to managerial decision-making, BI (or use of a DWH) can be distinguished according to the process phases: (1) data provision, (2) information generation, storage, distribution, and (3) information access. 32 BI architecture consists of various components, e.g., source systems, staging area, ODS, C-DWH, data mart, ETL, aggregation, and front end combined together in a number of different architectural variants. In practice, there are a large number of DWH or data mart architecture variants, some of which have emerged directly from the BI field. 33 UNIT 3 DATA WAREHOUSE STUDY GOALS On completion of this unit, you will have learned … – – – – – how data from different operational systems are integrated company-wide. what transformation steps are necessary to achieve this. what distinguishes a C-DWH from data mart architecture. which functions are offered by an operational data store. the extent to which metadata can support business intelligence. 3. DATA WAREHOUSE Introduction Before business intelligence (BI)-relevant data is made available in the data warehouse, a number of activities need to take place. BI applications require integrated data that is organized in a subject-specific manner, e.g., according to customer, product, or organizational unit. Using the extracting, transforming, and loading (ETL) process, data from operational systems are transformed into data that can be interpreted from a business management perspective. This requires that data stored over long periods of time are made available to management in aggregated form. It also requires that large amounts of data from several operational databases are consolidated and stored in the data warehouse. The ETL process cleanses and transforms operational data, which are then stored in the data warehouse for further analysis. After the extraction of operational data from the source systems, the transformation process prepares the data for use. Preparation takes place via four sub-processes: filtering, harmonization, aggregation, and enrichment. Data are then loaded into the evaluation level of the data warehouse (DWH). 3.1 ETL Process ETL process This process is used to convert operational data into management-relevant information. 36 In order to merge and prepare data from several operational data sources, it is converted into management-relevant information via a process of targeted conversion. This is carried out using three steps (extract, transform, and load) which are collectively known as the ETL process. This process is illustrated in the following figure. Figure 11: ETL Process Source: Glasker, 2017. Large amounts of data are extracted from source systems, processed according to the requirements of the DWH, and then inserted into, or written to, the DWH. The process of transferring data from operational sources to the DWH typically takes place at periodic intervals and consists of the following three steps: 1. Extraction of relevant data from various sources 2. Transformation of data into a uniform multidimensional format 3. Loading of data into the data warehouse to be available for analysis Establishing the ETL process is the most complex step in data warehouse development. The ETL process is of central importance as the creation of a solid DWH is only possible if it contains high-quality data. 37 In principle, ETL processes can be individually programmed or developed with the help of various tools. Due to the complexity of ETL processes, the use of a tool is recommended in most cases (Kimball & Caserta, 2004). The following sections describe in detail the transformation process in detail Components of the Transformation Process Transformation This consists of filtering, harmonization, aggregation, and enrichment. The transformation step is the most elaborate and complex part of the integration process. According to Kemper, transformation consists of four sub-processes—filtering, harmonization, aggregation, and enrichment—which are outlined below. Table 2: Sub-Processes of Transformation Components of the transformation process Filtering Extraction and correction of technical and content defects in the data Harmonization Business reconciliation of the filtered data Aggregation Aggregation of the filtered and harmonized data Enrichment Calculation and storage of key business figures Source: Kemper et al., 2010, p. 28. The individual components are described in detail below. The first two transformation steps—filtering and harmonization—are responsible for cleansing and preparing data, e.g., aligning different codes and currencies. Afterwards, the data are, in principle. ready for BI analyses. The next two steps—aggregation and enrichment—summarize data according to topic. Business key figures are also added to the data. The data generated in this way and loaded into the data warehouse are thus already oriented to the needs of individual user groups and their analysis purposes. Transformation 1: Filtering Filtering The filtering sub-process includes the intermediate storage of extracts and data cleansing. 38 With the help of filtering, the data required for the DWH are selected, temporarily stored, and freed from defects. The filtering process is divided into extraction and cleansing. During extraction, the data is placed in the extraction areas (staging areas) specially provided for this purpose. The purpose of cleansing is to remove syntactic and semantic defects. In the following figure, filtering is represented as a sub-process of the transformation process. Figure 12: Transformation 1: Filtering Source: Kemper et al., 2010, p. 28. The purpose of data cleansing is to correct defects and achieve a specified level of data quality. Cleansing is necessary because operational systems do not always contain correct data. There are many reasons for incorrect data, e.g., incorrect entries by users, system errors, system updates. The types of defects to be remedied can be divided into syntactic (technical) and semantic (content) defects. Syntactic defects are formal errors such as incorrect control characters, alphanumeric values in numeric fields, NULL values in a NOT NULL field, or values outside the value range. Semantic defects are errors of a business nature, such as obviously incorrect sales figures. Defect classes In the literature, a distinction is made between first, second, and third class defects. Defects of the first class can be detected and corrected automatically during the extraction process. For second class defects, defect recognition is automatic but the correction must be made manually after the extraction process. Defects of the third class can only be detected and corrected manually. 39 Table 3: Classification of Defects in the Framework of the Correction Class 1 Class 2 Class 3 Adjustment Automatic detection and correction Automatic detection and manual correction Manual detection and manual correction Syntactic defects Known format adjustment Recognizable format incompatibilities – Semantic defects Missing data values Outlier values/inconsistent value constellations Undetected semantic errors in source data Source: Chamoni & Gluchowski, 2015, p. 135. The basic defects of the first class that are automatically recognized can be corrected using certain algorithms. For example, internal format, control, and special characters can be identified at the syntactic level during extraction and processed in the extracted data using assignment tables (mapping tables). The same applies to semantic errors. If, for example, data from individual stores were omitted when transferring sales data, these can be supplemented using equivalent values, such as monthly planned values or actual values from the previous month. Defects of the second class can also be detected automatically but must be corrected manually by technicians or business economists. In the case of syntactic defects, an example of these would be syntax variants in the operational data sources that have not yet been taken into account. Once detected and corrected, these can be handled automatically in the future. On a semantic level, automated plausibility checks and value range checks can detect invalid data fields, e.g., by comparing balance sheet and control totals. Depending on the severity of the error, the operational sources may also need to be corrected. While syntactic defects can always be detected automatically, this does not apply to semantic defects. Defects of the third class only concern semantic errors. These are all defects that cannot be detected by the test procedures for second class errors, i.e., neither plausibility nor value range checks. Rather, these defects can only be identified by business experts. Here too, the operational sources may need to be corrected as well. Transformation 2: Harmonization Harmonization The harmonization subprocess includes the business reconciliation of filtered data. 40 The second transformation step after filtering deals with the harmonization of the data. Harmonization, also known as normalization, refers to the process of reconciling filtered data. Harmonization is necessary if data from different source systems are integrated. In source systems that have grown heterogeneously, different keys or characteristics are often used for the same facts or properties. The classic example is the key for gender, which could be represented differently in three different systems, e.g., male/female, M/W, 0/1. The goal of harmonization is therefore to combine the same facts and characteristics into a common key. Harmonization can also involve transferring different measures into a common measure, e.g., different currencies are converted into a single currency so that they are comparable (Kimball & Caserta, 2004). Figure 13: Transformation 2: Harmonization Source: Kemper et al., 2010, p. 32. In the harmonization sub-process, the filtered and cleansed data are merged. A distinction is made between the following types of syntactic and business harmonization. Syntactic harmonization Syntactic harmonization includes key harmonies as well as codes, synonyms, and homonyms, which we will now discuss in more detail. Key harmonies: Harmonization must include the dissolution of key harmonies. In principle, a common key is necessary when data from several databases are merged. The problem is usually solved with the help of a mapping table, which generates a new, artificial primary key, e.g., for each customer. The primary keys of the operational systems are then carried along as foreign keys, so that evaluations can be carried out on them. The keys of data records must be unique within the basic database, DWH, and data mart. The keys available in source systems do not usually fulfill this requirement due to their heterogeneity and also the distribution of the data. During the transformation phase, global, unique keys must therefore be assigned. These global keys are called “surrogate keys.” Modern ETL tools have standardized transformations that generate unique surrogate keys. 41 The mapping of local keys to global surrogate keys must be documented in order to be able to react flexibly to changes. Besides the standardization of data, surrogate keys play an important role in historicization (Kimball & Caserta, 2004). For example, take two data sources (e.g., CRM, ERP) that contain customer data that must be integrated into a table. Both source data records have a customer key. A global key for customer data must therefore be introduced for the target database. To do this, the keys of the source tables are removed during transformation and replaced by surrogates. In addition to key harmonies, codes, synonyms, and homonyms must also be resolved. Here are some examples: • Codes. Individual data sets can be coded differently. For example, attributes such as gender can be coded as M/W in data source one and as 0/1 in data source two. • Synonyms. Different attribute names can have the same meaning. For example, in data source one, the attribute “personnel” may be provided for the name of company employees but in data source two, it may be “employees”. • Homonyms. The same attribute names can have different meanings. For example, in data source one, “partner” can mean the name of customers, while in data source two, “partner” can refer to the name of suppliers. In all three cases, the data must be harmonized. In the first case, the attribute value must be uniformly set, e.g., to 0/1 values; in the second case, a common attribute name must be chosen; in the third case, a different attribute name must be chosen for the two categories. Mapping tables are usually utilized for the matching process, which merge the filtered data into subject-oriented data collections via name and code matching. Semantic harmonization In addition to syntactic alignment, the standardization of business terms is also carried out. This is also known as semantic harmonization. The normalization of business terms is not so much a technical problem as a business and organizational one. The operational data (e.g., currency) must be converted into uniform values, i.e., monetary values of different currencies must exist in a uniform currency system. For the corresponding activities, transformation rules can be implemented. After completion of the harmonization subprocess, cleansed and consistent data are available in the data warehouse for analysis purposes. Transformation 3: Aggregation Aggregation The aggregation subprocess involves the compression of filtered and harmonized data. 42 With the help of the aggregation sub-process, filtered and harmonized data are condensed and converted to the desired granularity. Figure 14: Transformation 3: Aggregation Source: Kemper et al., 2010, p. 37. Let’s consider the aggregation sub-process in practice. To create daily updated data for product and customer groups, all individual data must be summarized via aggregation algorithms to produce daily product and customer group specific values. In addition, running totals are performed for business key figures. The aim of aggregation is to generate total values that are stored in the data warehouse in pre-calculated form for later use. Transformation 4: Enrichment The creation and storage of key business figures from filtered and harmonized data is called enrichment. The enrichment sub-process is the final step in the transformation process. 43 Enrichment The enrichment sub-process involves the creation of business key figures after harmonization or aggregation. Figure 15: Transformation 4: Enrichment Source: Kemper et al., 2010, p. 38. As previously described, the existing data is enriched with business indicators. Calculations are performed and results are added to the remaining data. In particular, key figures that are relevant for several users are stored. For example, weekly contribution margins atproduct level or annual contribution margins at store level can be calculated and integrated. The former may be of interest to the product manager, while the latter is of interest to store managers and executive management. There are several advantages to including these key figures in the dataset. Due to the pre-calculation of these figures, queries can be 44 performed more efficiently. In addition, the pre-calculated values are consistent due to the one-time calculation. Furthermore, these figures facilitate coordinated business management. The main activities of the ETL process are completed through the extraction process but particularly the transformation process. The transformation sub-processes are carried out in the staging area. After transformation, the data are written to the target system during the loading phase. 3.2 DWH and Data-Mart Concepts The DWH in a narrower sense includes data storage. Individual components of the DWH including the staging area, basic database, data mart, ODS, and metadata, that all facilitate data storage are now described in detail. Staging Area According to Inmon (2005), the staging area or landing zone is a workspace in which data are temporarily stored. Extraction into the staging area usually takes place periodically. Transformations are performed within the staging area. In principle, the data is deleted from the staging area after it has been loaded into downstream systems. The purpose of the staging area is to relieve downstream systems (e.g., the basic database). This separate work area is of particular importance, especially when processing large amounts of data and performing complex transformations (Inmon, 2005). Staging area This is where extracts are stored temporarily in order to relieve downstream systems. C-DWH Functionality The basic database (known as C-DWH) is a data store or repository located between the staging area and the evaluation database. The basis database differs from the staging area in particular in the way the data are stored. The database provides detailed, historicized, consistent, and normalized data for downstream systems. The data are transferred from the staging area when they are in a specific state, whereupon they are integrated into the basic database and stored in an adjusted form. With regard to the evaluation database, the basic database is primarily defined by the data model. In the basic database, data are stored in a normalized and query neutral state. In addition, the data are stored at the smallest required granularity. Depending on requirements, data are also historicized, that is, they are kept track of over time (Bauer & Günzel, 2008). The C-DWH performs the following functions: Basic database This contains integrated data for downstream systems. • Collection and integration. This refers to the collection of company-wide data required for later analysis. • Distribution. This refers to the data supply of downstream systems (e.g., data marts). • Quality assurance. This refers to how transformed data ensures the syntactic and semantic coherence of the dispositive database. 45 Update strategy The co

Business Intelligence Course Book PDF

Document Details

Tags

Related

Summary

Full Transcript