Data Warehousing & Data Mining Unit-1 Notes.pdf
Document Details
Uploaded by Deleted User
Full Transcript
DATA WAREHOUSING AND DATA MINING UNIT-1 DATA WAREHOUSING Overview: Data: Data is information that is collected as a source of reference or analysis. It can be in the form of facts, numbers, words, measurements, o...
DATA WAREHOUSING AND DATA MINING UNIT-1 DATA WAREHOUSING Overview: Data: Data is information that is collected as a source of reference or analysis. It can be in the form of facts, numbers, words, measurements, observations, or descriptions of things. Data can be organized in the form of graphs, charts, or tables. Information: Information is the meaning that's derived from processed data. It can be defined as any communication or representation of knowledge, such as facts, data, or opinions, in any medium or form. Data Warehousing Definition: The term "Data Warehouse" was first coined by Bill Inmon in 1990. Bill Inmon's definition of a data warehouse is that it is a “subject-oriented, nonvolatile, integrated, time-variant collection of data in support of management's decisions.” The model then creates a thorough, logical model for every primary entity. A data warehouse is a system that stores and organizes data from multiple sources to support business intelligence and analysis. It can store both current and historical data, and is designed to provide a long-term view of data over time. Data Warehouse Components 1. Overall Architecture The data warehousing architecture is based on a relational database management system server that functions as the central repository for informational data. Typically the source data for the warehouse is coming from the operational applications. As the data enters the data warehouse, it is transformed into an integrated structure and format. 2. Data Warehouse Database The central data warehouse database is a cornerstone of the data warehousing environment. This data base is almost implemented on relational data base management system. Some of the approaches needed for the data base are Parallel relational database designs that require a parallel computing platform. An innovative approach to speed up a traditional RDBMS, MDDBs that are based on proprietary data base technology or implemented using RDBMS. 3. Sourcing, Acquisition, Cleanup and transformation Tools These tools perform all of the conversion, summarization, key changes, structural changes and condensation needed to transform disparate data into information that can be used by decision support tool. The functionality includes: Removing unwanted data from operational databases Converting to common data names and definitions Calculating summaries and derived data Establishing defaults for missing data Accommodating sources data definitions changes These tools have to deal with some significant issues as follows: Data heterogeneity: This is the difference in the way how the data is defined and used in different models. Database heterogeneity: DBMS are very different in data models data access language data languages, data navigation and so on 4. Meta data Meta data is data about data that describes the data warehouse. Meta data can be classified into 1. Technical metadata: This contains information about data warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks, Technical metadata documents include Information about data sources Transformation descriptions Warehouse object and data structure definitions for data targets. The rules used to perform data clean up and data enhancement. 2. Business Meta data: Contains information that gives users an easy-to-understand perspective of the information stored in the data warehouse.Business metadata documents information about: Subject areas and information object type, including queries, reports, images, video, and/or audio clips. Internet home pages. 5. Access Tools Users interact with the data warehousing using front-end tools. Tools are divided into five groups Data query and reporting tools Application development tools Executive information system tools Online analytical processing tools Data mining tools Query and Reporting Tools This category is further divided into two groups: Reporting and Managed query tools Reporting tools can be divided into production reporting tools and desktop report writers Production reporting tools will let companies generate regular operational reports or support high-volume batch jobs, such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end users. Managed query tools shield end users from the complexities of SQL and database structures by inserting a metalayer between users and the database. Metalayer It is the software that provides subject –oriented views of a database and supports point-and click creation of SQL. Application development tools: The tools require such a complex set of queries and sophisticated data models that the business users may find themselves overwhelmed by the need to become SQL and/or data modeling reports. Executive information systems tools: Executive information systems (EIS) are tools that help top managers access summarized information to spot trends, problems, and opportunities. They are often used in conjunction with data warehouses to separate operational data from informational data. Online analytical processing tools (OLAP): These tools are based on the concepts of multi-dimensional data bases and allow a sophisticated user to analyze the data using elaborate, multi dimensional complex views. Data Mining A critical success factor for any business today is its ability to use information effectively. Data mining as the process of discovering meaningful new correlations, patterns, and trends by digging into large amounts of data stored in warehouses, using AI and statistical and mathematical techniques. In these areas, data mining can reach beyond the capabilities of the OLAP, especially since the major attraction of the data mining is its ability to build predictive rather than retrospective models. Most organizations engage in data mining to Discover Knowledge Visualize data Correct data Data Visualization Data visualization is not a separate class of tools; rather than it is a method of presenting the output of the previously mentioned tools in such a way that the entire problem or/and the solution. Data visualization goes far beyond simple bar and pie charts. It is a collection of complex techniques that currently represents an area of intense research. 6. Data Marts Data marts are presented as an inexpensive alternative to a data warehouse that takes significantly less time and money to build. It is a subsidiary to a data warehouse of integrated data. It is created for a dedicated set of users. A data mart is a set of de normalized, summarized or aggregated data. Data mart is separate data base server, often on local area network serving a dedicated group of users. Two types of data marts. 1. Dependent data mart: The data content is sourced from a data ware house, have a high value because no matter how many are deployed and no matter how many different technology are use. 2. Independent data mart: Unfortunately the misleading statements about the simplicity and low cost of data marts sometimes result in organizations or vendors incorrectly positioning them as an alternative to the data warehouse.This view point defines independent data marts. The concept of independent data mart is dangerous one. Each enterprise will start to design their own data marts without integration. The complex many-to-one problem will be diverted to many-to many sourcing and management nightmare. Scalability of data mart is complex. The business drivers underlying such developments include: Extremely urgent user requirement The absence of a budget for a full data warehouse strategy The absence of a sponsor for an enterprise wide decision support strategy The decentralization of business units The attraction of easy-to-use tools and a mind-sized project. The recommended approach by Ralph Kimball is as follows. For any two data marts in an enterprise, the common dimension must conform to the equality and roll-up rule. In summary data mart presents two problems those are scalability and integration. 7. Data warehouse Administration and Management Managing data warehouse includes: Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updating metadata Auditing and reporting data warehouse usage and status Purging data Replicating, sub setting and distributing data Backup and recovery Data warehouse storage management 8. Information Delivery System The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specified scheduling algorithm. Information delivery system distributes ware house stored data and other information objects to other data warehouses and end user products. Building a Data warehouse Business Considerations: Return on Investment 1. Approach The subject oriented nature of the data warehouse determines the scope of the information in the data warehouse. Organizations embarking on data warehousing development can chose on of the two approaches Top-down approach: Meaning that the organization has developed an enterprise data model, collected enterprise wide business requirement, and decided to build an enterprise data warehouse with subset data marts. Bottom-up approach: Implying that the business priorities resulted in developing individual data marts, which are then integrated into the enterprise data warehouse. 2. Organizational Issues The requirements and environments associated with the informational applications of a data warehouse are different. Therefore an organization will need to employ different development practices than the ones it uses for operational applications. 3. Design Consideration In general, a data warehouse’s design point is to consolidate data from multiple, often heterogeneous, sources into a query data base. The main factors include: Heterogeneity of data sources, which affects data conversion, quality, time-liness Use of historical data, which implies that data may be” old” Tendency of database to grow very large Data Content: Typically a data warehouse may contain detailed data, but the data is cleaned up and transformed to fit the warehouse model, and certain transactional attributes of the data are filtered out. The content and the structure of the data warehouses are reflected in its data model. The data model is a template for how information will be organized with in the integrated data warehouse framework. Meta data: Defines the contents and location of the data in the warehouse, relationship between the operational databases and the data warehouse, and the business view of the warehouse data that are accessible by end-user tools. the warehouse design should prevent any direct access to the warehouse data if it does not use meta data definitions to gain the access. Data distribution: As the data volumes continue to grow, the data base size may rapidly outgrow a single server. Therefore, it becomes necessary to know how the data should be divided across multiple servers. The data placement and distribution design should consider several options including data distribution by subject area, location, or time. Tools: Data warehouse designers have to be careful not to sacrifice the overall design to fit to a specific tool. Selected tools must be compatible with the given data warehousing environment each other. Performance consideration: Rapid query processing is a highly desired feature that should be designed into the data warehouse. Nine decisions in the design of a data warehouse: 1. Choosing the subject matter 2. Deciding what a fact table represents 3. Identifying and conforming the decisions 4. choosing the facts 5. Storing pre calculations in the fact table 6. Rounding out the dimension table 7. Choosing the duration of the data base 8. The need to track slowly changing dimensions 9. Deciding the query priorities and the query modes 4. Technical Considerations A number of technical issues are to be considered when designing and implementing a data warehouse environment. These issues include: The hardware platform that would house the data warehouse. The data base management system that supports the warehouse data base. The communication infrastructure that connects the warehouse, data marts, operational systems, and end users. The hardware platform and software to support the meta data repository. The systems management framework that enables the centralized management and administration of the entire environment. 5. Implementation Considerations A data warehouse cannot be simply bought and installed-its implementation requires the integration of many products within a data ware house. Access tools Data Extraction, clean up, Transformation, and migration Data placement strategies Meta data User sophistication levels: Casual users, Power users, Experts Warehouse Database: A data warehouse is a digital storage system that stores and analyzes large amounts of data from multiple sources. It's a key component of business intelligence and is used to support decision making. Here are some characteristics of a data warehouse: Stores structured and unstructured data A data warehouse can store both structured data, like information about products, people, and transactions, and unstructured data, like documents, images, and social media posts. Supports analytics A data warehouse is designed to support business intelligence activities, especially analytics. Provides a long-range view A data warehouse can store both current and historical data, giving a long-range view of data over time. Creates consistency A data warehouse integrates data from different sources to create consistency among the data types. Is nonvolatile Once data is in a data warehouse, it's stable and doesn't change. Organizes data in a tabular format A data warehouse organizes data in a tabular format so that SQL can be used to query the data. Mapping the Data Warehouse to a Multiprocessor Architecture: Relational Data base Technology for data warehouse The size of a data warehouse rapidly approaches the point where the search of a data warehouse rapidly approaches the point where the search for better performance and scalability becomes a real necessity. The search is pursuing two goals: Speed Up: The ability to execute the same request on the same amount of data in less time. Scale-Up: The ability to obtain the same performance on the same request as the data base size increases. 1. Types of Parallelism Parallel execution of tasks within the SQL statements can be done in either of two ways: Horizontal parallelism: Which means that the data base is partitioned across multiple disksand the parallel processing occurs in the specific tasks, that is performed concurrently on different processors against different sets of data Vertical Parallelism: which occurs among different tasks-all components query operations are executed in parallel in a pipelined fashion. In other words an output from one task becomes an input into another task as soon as records become available. 2. Data Partitioning Data partitioning is a key requirement for effective parallel execution of data base operations. It spreads data from data base tables across multiple disks so that I/O operations such as read and write can be performed in parallel. Random partitioning includes random data striping across multiple disks on single servers. In round robin partitioning, each new record id placed on the new disk assigned to the data base. Intelligent partitioning assumes that DBMS knows where a specific record id located and does not waste time searching for it across all disks. This partitioning allows a DBMS to fully exploit parallel architectures and also enables higher availability. Intelligent partitioning includes: Hash Partitioning: Hash algorithm is used to calculate the partition number Key range partitioning: Partitions are based on the partition key Schema partitioning: Each table is placed in each disk, Useful for small references User-defined partitioning: Tables are partitioned based on user defined expressions. 3. Database Architecture for parallel Processing i. Shared-Memory Architecture Also called as shared-everything style.Traditional approach to implement an RDBMS on SMP hardware. Simple to implement. The key point of this approach is that a single RDBMS server can potentially utilize all processors, access all memory, and access the entire database, thus providing the user with a consistent single system image. ii. Shared-disk Architecture It implements the concept of shared ownership of the entire data base between RDBMS servers, each of which is running on a node of distributed memory system. Each RDBMS server can read, write, update and delete records from the same shared data base, which would require the system to implement a form of distributed lock manager (DLM). Pining: In worst case scenario, if all nodes are reading and updating same data, the RDBMS and its DLM will have to spend a lot of resources synchronizing multiple buffer pool. This problem is called as pining. Data skew: Uneven distribution of data. Shared-disk architectures can reduce performance bottle-necks resulting from data skew. iii. Shared-Nothing Architecture The data is partitioned across many disks, and DBMS is “partitioned” across multiple conservers, each of which resides on individual nodes of the parallel system and has an ownership of its own disk and thus, its own data base partition. It offers non-linear scalability. These requirements include: Support for function shipping Parallel join strategies Support for data repartitioning Query compilation Support for data base transactions Support for the single system image of the data base environment iv. Combined Architecture Interserver parallelism of the distributed memory architecture means that each query is parallelized across multiple servers. While intraserver parallelism of the shared memory architecture means that a query is parallelized within the server. 4. Parallel RDBMS Feature Some of the demands from the DBMS vendors are: Scope and techniques of parallel DBMS operations Optimized implementation Application transparency The parallel environment DBMS Management tools Price/Performance 5. Alternative Technologies In addition to parallel data base technology, a number of vendors are working on other solutions improving performance in data warehousing environments. These includes: Advanced database indexing products Specialized RDBMSs designed specifically for the data warehousing Multidimensional data bases 6. Parallel DBMS Vendors 1. Oracle 2. Informix 3. IBM 4. Sybase 5. Microsoft Difference between Database System and Data Warehouse: Database System Data Warehouse It supports operational processes. It supports analysis and performance reporting. Capture and maintain the data. Explore the data. Current data. Multiple years of history. Data is balanced within the scope of this one Data must be integrated and balanced from multiple system. system. Data is updated when transaction occurs. Data is updated on scheduled processes. Data verification occurs when entry is done. Data verification occurs after the fact. 100 MB to GB. 100 GB to TB. ER based. Star/Snowflake. Application oriented. Subject oriented. Database System Data Warehouse Primitive and highly detailed. Summarized and consolidated. Flat relational. Multidimensional. Multi-Dimensional Data Model: A multidimensional model views data in the form of a data-cube. A data cube enables data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. The dimensions are the perspectives or entities concerning which an organization keeps records. For example, a shop may create a sales data warehouse to keep records of the store's sales for the dimension time, item, and location. These dimensions allow the store to keep track of things, for example, monthly sales of items and the locations at which the items were sold. Each dimension has a table related to it, called a dimensional table, which describes the dimension further. For example, a dimensional table for an item may contain the attributes item_name, brand, and type. A multidimensional data model is organized around a central theme, for example, sales. This theme is represented by a fact table. Facts are numerical measures. The fact table contains the names of the facts or measures of the related dimensional tables. Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In this 2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the item dimension (classified according to the types of an item sold). The fact or measure displayed in rupee_sold (in thousands). Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are represented as a series of 2D tables. Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig: Data Cubes: Data cubes are a type of OLAP (online analytical processing) cubes that store data in a multi-dimensional structure. The data is stored in the form of dimensions and measures, organized into cells. Dimensional modelling is an important practice for business intelligence and data warehousing professionals to use when creating a data warehouse. Data cubes can be created using a programming language like SQL (Structured Query Language). However, this is only sometimes practical or possible due to the complexity of the data warehouse and its size. For example, the amount of time required to programmatically create a dimensional model with 500 dimensions and 30 measures would be prohibitively long. Most business intelligence professionals use tools like Microsoft Excel or PowerPivot for Microsoft Office 365 to build their data cubes manually or semi-automatically. This allows them to quickly create unique views of their data. They do not have to write complex code or execute complex queries that could take hours or days to run on large datasets. Operations on Data Cube: Data cubes are a valuable data structure that can store and compute aggregated data. They are used for large- scale analytics and for answering ad hoc queries. Data cubes consist of a set of measures defined on one or more dimensions. Five basic operations can be performed on data cubes: Roll-up Roll-up is a form of aggregation that combines data from multiple dimensions into a single row. For example, you can roll-up sales data by region, country, and city to show total sales for each region. You can also roll-up product category data by year, month, and day to show total sales for each product category. When you roll up data in a cube, the source dimension values must be preserved for the new fact table value. For example, if you roll up sales data by country and city, both country and city values should be preserved in the new fact table value. Drill-down Drill-down is the process of moving from a high-level view of the cube to a lower-level view. This is done by using dimensions as filters in the slicer pane. Drill-down can be used to show more detailed information about a particular record, or it can be used to navigate across multiple records. Slice A slice is a subset of rows and columns. In an OLAP cube, each measure is a column, and each dimension is a row. A slice returns only those measures that intersect with the selected dimensions. For example, if you select Year, Quarter, and State as dimensions, you would get three slices: one for all years, one for Q1-Q3 only, and one for all states. Dice A dice operation is a special type of discrete cube operation that creates a new cube by combining two cells from two existing cubes using either an inner join or an outer join. In other words, it performs a set-based calculation on multiple cuboids and treats each row from one cuboid as the key and each row from another cuboid as its value. Pivot Pivot is a data summarization operation that takes a multi-dimensional dataset and reduces it into a table with rows and columns. The result is called a pivot table or just a pivot. A pivot table is useful for analyzing large amounts of data, especially when you want to see the same information in different ways. Star Schema: A star schema is a type of data modeling technique used in data warehousing to represent data in a structured and intuitive way. In a star schema, data is organized into a central fact table that contains the measures of interest, surrounded by dimension tables that describe the attributes of the measures. In a star schema, each dimension table is joined to the fact table through a foreign key relationship. This allows users to query the data in the fact table using attributes from the dimension tables. For example, a user might want to see sales revenue by product category, or by region and time period. The star schema is a popular data modeling technique in data warehousing because it is easy to understand and query. The simple structure of the star schema allows for fast query response times and efficient use of database resources. Additionally, the star schema can be easily extended by adding new dimension tables or measures to the fact table, making it a scalable and flexible solution for data warehousing. Star schema is the fundamental schema among the data mart schema and it is simplest. This schema is widely used to develop or build a data warehouse and dimensional data marts. It includes one or more fact tables indexing any number of dimensional tables. The star schema is a necessary cause of the snowflake schema. It is also efficient for handling basic queries. It is said to be star as its physical model resembles to the star shape having a fact table at its center and the dimension tables at its peripheral representing the star’s points. Below is an example to demonstrate the Star Schema: In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID, Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension tables. Employee dimension table contains the attributes: Emp ID, Emp Name, Title, Department and Region. Product dimension table contains the attributes: Product ID, Product Name, Product Category, Unit Price. Customer dimension table contains the attributes: Customer ID, Customer Name, Address, City, Zip. Time dimension table contains the attributes: Order ID, Order Date, Year, Quarter, Month. Model of Star Schema : In Star Schema, Business process data, that holds the quantitative data about a business is distributed in fact tables, and dimensions which are descriptive characteristics related to fact data. Sales price, sale quantity, distant, speed, weight, and weight measurements are few examples of fact data in star schema. Often, A Star Schema having multiple dimensions is termed as Centipede Schema. It is easy to ha ndle a star schema which have dimensions of few attributes. Advantages of Star Schema: 1. Simpler Queries: Join logic of star schema is quite cinch in comparison to other join logic which are needed to fetch data from a transactional schema that is highly normalized. 2. Simplified Business Reporting Logic: In comparison to a transactional schema that is highly normalized, the star schema makes simpler common business reporting logic, such as of reporting and period-over-period. 3. Feeding Cubes: Star schema is widely used by all OLAP systems to design OLAP cubes efficiently. In fact, major OLAP systems deliver a ROLAP mode of operation which can use a star schema as a source without designing a cube structure. Disadvantages of Star Schema: 1. Data integrity is not enforced well since in a highly de-normalized schema state. 2. Not flexible in terms if analytical needs as a normalized data model. 3. Star schemas don’t reinforce many-to-many relationships within business entities – at least not frequently. Snowflake Schema: The snowflake schema is a variant of the star schema. Here, the centralized fact table is connected to multiple dimensions. In the snowflake schema, dimensions are present in a normalized form in multiple related tables. The snowflake structure materialized when the dimensions of a star schema are detailed and highly structured, having several levels of relationship, and the child tables have multiple parent tables. The snowflake effect affects only the dimension tables and does not affect the fact tables. A snowflake schema is a type of data modeling technique used in data warehousing to represent data in a structured way that is optimized for querying large amounts of data efficiently. In a snowflake schema, the dimension tables are normalized into multiple related tables, creating a hierarchical or “snowflake” structure. In a snowflake schema, the fact table is still located at the center of the schema, surrounded by the dimension tables. However, each dimension table is further broken down into multiple related tables, creating a hierarchical structure that resembles a snowflake. For Example, in a sales data warehouse, the product dimension table might be normalized into multiple related tables, such as product category, product subcategory, and product details. Each of these tables would be related to the product dimension table through a foreign key relationship. Example: The Employee dimension table now contains the attributes: EmployeeID, EmployeeName, DepartmentID, Region, and Territory. The DepartmentID attribute links with the Employee table with the Department dimension table. The Department dimension is used to provide detail about each department, such as the Name and Location of the department. The Customer dimension table now contains the attributes: CustomerID, CustomerName, Address, and CityID. The CityID attributes link the Customer dimension table with the City dimension table. The City dimension table has details about each city such as city name, Zipcode, State, and Country. Advantages of Snowflake Schema It provides structured data which reduces the problem of data integrity. It uses small disk space because data are highly structured. Disadvantages of Snowflake Schema Snowflaking reduces space consumed by dimension tables but compared with the entire data warehouse the saving is usually insignificant. Avoid snowflaking or normalization of a dimension table, unless required and appropriate. Do not snowflake hierarchies of dimension table into separate tables. Hierarchies should belong to the dimension table only and should never be snowflakes. Multiple hierarchies that can belong to the same dimension have been designed at the lowest possible detail. Fact Constellations: Fact Constellation is a schema for representing multidimensional model. It is a collection of multiple fact tables having some common dimension tables. It can be viewed as a collection of several star schemas and hence, also known as Galaxy schema. It is one of the widely used schema for Data warehouse designing and it is much more complex than star and snowflake schema. For complex systems, we require fact constellations. Figure – General structure of Fact Constellation Here, the pink coloured Dimension tables are the common ones among both the star schemas. Green coloured fact tables are the fact tables of their respective star schemas. Example: In above demonstration: Placement is a fact table having attributes: (Stud_roll, Company_id, TPO_id) with facts: (Number of students eligible, Number of students placed). Workshop is a fact table having attributes: (Stud_roll, Institute_id, TPO_id) with facts: (Number of students selected, Number of students attended the workshop). Company is a dimension table having attributes: (Company_id, Name, Offer_package). Student is a dimension table having attributes: (Student_roll, Name, CGPA). TPO is a dimension table having attributes: (TPO_id, Name, Age). Training Institute is a dimension table having attributes: (Institute_id, Name, Full_course_fee). So, there are two fact tables namely, Placement and Workshop which are part of two different star schemas having dimension tables – Company, Student and TPO in Star schema with fact table Placement and dimension tables – Training Institute, Student and TPO in Star schema with fact table Workshop. Both the star schema have two dimension tables common and hence, forming a fact constellation or galaxy schema. Advantage: Provides a flexible schema. Disadvantage: It is much more complex and hence, hard to implement and maintain.