Datawarehouse.ppt
Document Details
Uploaded by AdoredCharoite
Full Transcript
BUSINESS INTELLIGENCE DATA WAREHOUSE DATA WAREHOUSE “The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is nonvolatile and relevant to some moment in time” (Inmon) A copy of transaction data specifically str...
BUSINESS INTELLIGENCE DATA WAREHOUSE DATA WAREHOUSE “The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is nonvolatile and relevant to some moment in time” (Inmon) A copy of transaction data specifically structured for query and analysis (Kimball) A data warehouse is a repository of an organization's electronically stored data, designed to facilitate reporting and analysis . (Wikipedia) DATA WAREHOUSE • A decision support database that is maintained separately from the organization’s operational database. • Support information processing by providing a solid platform of consolidated, historical data for analysis. 4 MAIN CHARACTERISTICS OF DATA WAREHOUSING 1. Subject oriented • Organized around major subjects, such as sales progress • Containing only information relevant for decision support • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing 4 MAIN CHARACTERISTICS OF DATA WAREHOUSING 1. Subject oriented • For example, to learn more about your company's sales, you can build concentrates on sales. a warehouse that Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented (http://docs.oracle.com/) 4 MAIN CHARACTERISTICS OF DATA WAREHOUSING 2. Integrated • Constructed by integrating various data sources • Must place data from different sources into a consistent format, to do so they must deal with naming conflict and discrepancies • Data cleaning and data techniques are applied • Ensure consistency conventions among sources • When data is moved to the warehouse, it is converted multiple, integration in naming different data 4 MAIN CHARACTERISTICS OF DATA WAREHOUSING 3. Time variant (time series) • maintains historical data, data for analysis from multiple sources contain multiple time points A data warehouse's focus on change over time • The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) 4 MAIN CHARACTERISTICS OF DATA WAREHOUSING 4. Non-volatile • after data are entered into a data warehouse, users cannot change or update the data. Operational update of data does not occur in the data warehouse environment • Does not require transaction processing, recovery, and concurrency control mechanisms • Requires only two operations in data accessing: • Initial loading of data and access of data SUMMARY OF DATA WAREHOUSE • Runs on a DBMS such as Oracle, SQL, DB2 … • Keeps a large amount of data from different time for a long period of time • Data in data warehouse cannot be overwritten by users • Data comes from various sources, internally and externally • Carefully designed to allow for analysis/ pattern discovery on identified subject matter OLTP OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. Database type : Operational OLAP • Online Analytical Processing (OLAP) is a reporting application that provides high-performance analysis and easy reporting on large volumes of data • The goal of OLAP: – multidimensional data analysis, – provide fast and flexible data summarization, analysis, and reporting capabilities – ability to view trends over time Type of database : Data warehouse OLTP VS OLAP OLTP OLAP Users Clerk, IT professional Knowledge worker Function Day to day operations Decision support DB Design To suit typical database function of update, edit, delete, relational Designed for reporting on Subjects, data warehouse Data Current, up-to-date detailed, Historical, summarized, integrated, multidimensional,, consolidated Usage Repetitive, structured Ad-hoc, un-structured Access Read/write Read. Lots of scans Type of Work Short, simple transaction Complex query # Records Accessed Tens Millions # Users Thousands Hundreds, Tens DB Size 100MB-GB 100GB-TB HOW THE DATABASE LOOKS LIKE FOR THE TWO TYPES The operational database (relational): HOW THE DATABASE LOOKS LIKE FOR THE TWO TYPES The datawarehouse (star schema): WHY … Can we not operate on operational database to obtain the answers to our business questions? Answer : require complex query formulation, preparation of data to address the query and if use the operational database, the process will be very slow due to complex joins and multiple scans A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer." DATA WAREHOUSING - CONCEPT Data mart Smaller and focuses on a particular subject or department. It is a subset of data warehouse/departmental data warehouse A data mart is a smaller DW designed around one problem, organizational function, topic, or other focus area. Can be Dependent data mart A subset that is created directly from a data warehouse Ensures that the end user is viewing the same version of the data that are accessed by all other data warehouse users Or Independent data mart A small data warehouse designed for a strategic business unit or a department DATA WAREHOUSING - CONCEPT Enterprise data warehouse (EDW) A large scale data warehouse enterprise for decision support used across the Used to provide data for many types of DSS, including CRM, supply chain management, BPM, KMS etc Metadata Data about data. In a data warehouse, metadata describe the contents of a data warehouse and the manner of its use. Metadata in layman term : Metadata describes other data. It provides information about a certain item's content. For example, an image may include metadata that describes how large the picture is, the color depth, the image resolution, when the image was created, and other data http://www.techterms.com/definition/metadata DATA WAREHOUSING PROCESS OVERVIEW The data warehousing process consists of the following steps: 1. Data are imported from various internal and external sources 2. Data are cleansed and organized consistently with the organization’s needs 3a. Data are loaded into the enterprise data warehouse 4a.If desired, data marts are created as subsets of the EDW —or— 3b.Data are loaded into data marts 4b.The data marts are consolidated into the EDW 5. Analyses are performed as needed DATA WAREHOUSING - PROCESS OVERVIEW The major components of a data warehousing process • Data sources. Data are sourced from operational systems • • • • and possibly from external data sources. Data extraction. Data are extracted using custom-written or commercial software called ETL. Data loading. Data are loaded into a staging area, where they are transformed and cleansed. The data are then ready to load into the data warehouse. Data warehouse/Comprehensive database. This is the EDW that supports decision analysis by providing relevant summarized and detailed information. Middleware tools. Middleware tools enable access to the data warehouse from a variety of front-end applications. Data Warehousing - Process Overview DATA WAREHOUSING ARCHITECTURES There are several basic architectures for data warehousing To distinguished the architectures data warehouse is divided into three parts (Three-tier architecture): 1. The data warehouse that contains the data & software 2. Data acquisition (back-end) software, which extracts data from legacy systems and external sources, consolidates and loads into the data warehouse 3. Client (front-end) software, which allows users access and analyze data from the warehouse Alternative Data Warehousing Architectures Alternative Data Warehousing Architectures DATA WAREHOUSING ARCHITECTURES Factors that potentially affect the architecture selection decision: 3. Information interdependence between organizational units Upper management’s information needs Urgency of need for a data warehouse 4. Constraints on resources, funding 5. Strategic view of the data warehouse prior to implementation 6. Compatibility with existing systems 1. 2. DATA INTEGRATION: THE EXTRACTION, TRANSFORMATION, AND LOAD (ETL) PROCESS Data integration is a term that covers three processes which combine to move data from multiple sources into a data warehouse: data access, data federation, and change capture DATA INTEGRATION : THE EXTRACTION, TRANSFORMATION, AND LOAD (ETL) PROCESS Extraction, transformation, and load (ETL) technologies Fundamentally, a DW could not exist without ETL The ETL process consists of Extraction (reading data from one or more databases), Transformation (converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse) Load (putting the data into the data warehouse) DATA INTEGRATION : THE EXTRACTION, TRANSFORMATION, AND LOAD (ETL) PROCESS The ETL process also contributes to the quality of the data in a DW, its purpose is to load the data warehouse with integrated and cleansed data DATA TRANSFORMATION TOOLS – TO PURCHASE OR TO BUILD? The process can either be done through – purchasing data transformation tools and setup ETL, or – purchase the tool together with the expertise to setup ETL, or – developing the tools using programming languages. •Programmers can set up ETL processes using almost any programming language, but building such processes from scratch is very complex. DATA TRANSFORMATION TOOLS – TO PURCHASE OR TO BUILD? • Increasingly, companies are buying ETL tools to help in the creation of ETL process (Wikipedia) • Examples of ETL tools – IBM InfoSphere DataStage – Microsoft SQL Server Integration Services (SSIS) – Oracle Data Integrator (ODI) DATA WAREHOUSE DEVELOPMENT A data warehousing project is a major undertaking MORE COMPLICATED AS IT COMPRISES AND INFLUENCES MANY DEPARTMENTS, INPUT OUTPUT INTERFACES AND CAN BE PART OF BUSINESS STRATEGY Data warehouse development approaches Inmon Model: EDW approach Kimball Model: Data mart approach Which model is best? There is no one-size-fits-all strategy to data warehousing It depends on the need and the capacity of the organization For many organizations, data mart approach is a convenient first step in implementing DW DATA WAREHOUSE DEVELOPMENT Kimball Model Pros: Easy to build organizationally Easy to build technologically Cons: Enterprise wide view unavailable Redundant data costs High ETL costs High DBA costs Inmon Model Pros: Business Enterprise View Design consistency Data reusability Cons: Require corporate leadership and vision EVENTUALLY IT CAN BE THIS … DATA WAREHOUSE DEVELOPMENT Describe the major similarities and differences between the Inmon and Kimball data warehouse development approaches. •Similarities: Both methods can produce an enterprise data warehouse and subset data marts. •Differences: – Inmon’s approach starts with an enterprise data warehouse, creating data marts as subsets of that EDW if appropriate. – KimbThe focus is on proven, traditional methods and technologies. all’s starts with data marts, consolidating them into an EDW later if appropriate. It focuses in creating a useful end-user capability quickly. DATA WAREHOUSE DEVELOPMENT Effort Data Mart (Kimball Model) EDW (Inmon Model) Scope One subject area Several subject areas Development Time Months Years Development Cost $10K - $100 K ++ $1000000 ++ Development Difficulty Low to medium High Sources Only some operational and Many operational and external systems external systems Size Megabytes to several gigabytes Gigabytes to petabytes Hardware Workstations and departmental servers Enterprise servers and mainframe OS Windows and Linux Unix, S/390 Number of simultaneous 10s 100s to 1000s REAL-TIME DATA WAREHOUSING Traditionally, a data warehouse are not business critical, data are commonly updated on a weekly basis – not allowing for responding to transactions in near real time Today, organizations are facing the need for realtime data warehousing, as decision support has become operational. The emergence of real-time data warehousing (RDW) or active data warehousing (ADW) The process of loading and providing data via a data warehouse as they become available REAL-TIME DATA WAREHOUSING The need for real-time data A business often cannot afford to wait a whole day for its operational data to load into the data warehouse for analysis Real-time data collection can reduce eliminate the nightly batch processes or DATA WAREHOUSE ADMINISTRATION AND SECURITY ISSUES Due to its huge size and complicated nature, DW requires strong monitoring and administrating Needs more than a DBA Needs data warehouse administrator (DWA) A person responsible for the administration and management of a data warehouse DATA WAREHOUSE ADMINISTRATION AND SECURITY ISSUES What skills should a DWA possess? Why? Familiarity with high-performance hardware, software and networking technologies, since a data warehouse is based on those. Solid business insight, to understand the purpose of the DW and its business justification, familiarity with business decision, making processes to understand how the DW for business strategic purpose will be used easy. Excellent communication skills, to communicate with the rest of the organization DATA WAREHOUSE ADMINISTRATION AND SECURITY ISSUES Effective security in a data warehouse should focus on four main areas: 1. Establishing effective corporate and security policies and procedures. An effective security policy should start at the top and be communicated to everyone in the organization. 2. Implementing logical security procedures and techniques to restrict access. This includes user authentication, access controls, and encryption. 3. Limiting physical access to the data center environment. 4. Establishing an effective internal control review process for security and privacy