Data Mining and Data Warehousing (ITE P111) PDF
Document Details
Uploaded by UserFriendlyNoseFlute7233
Paul William V. Quiliope
Tags
Summary
This document covers concepts of data mining and data warehousing, including the fundamentals and architecture of data mining systems and the evolution of database technology.
Full Transcript
DATA MINING AND DATA WAREHOUSING ITE P111 1 Wed 8-11, Friday 9-11 Paul William V. Quiliope UNIT-I INTRODUCTION Fundamentals of Data Mining (3 - 18) Functionalities(19 – 31) Classification of Data Mining Sys...
DATA MINING AND DATA WAREHOUSING ITE P111 1 Wed 8-11, Friday 9-11 Paul William V. Quiliope UNIT-I INTRODUCTION Fundamentals of Data Mining (3 - 18) Functionalities(19 – 31) Classification of Data Mining System (32-35) Issues in Data Mining (35-37) DATA WAREHOUSE Data Warehouse (38 – 43) Multidimensional Model (44 -66) Architecture(67-85) Implementation(86-94) From Data Warehouse to Data Mining(95-97) Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology Motivation: DATA DATA DATADATA DATA DATA DATADATA DATA DATADATADATADATA DATA DATA DATA DATA DATA DATA DATADATA DATA DATA DATA DATADATA DATA DATADATA DATA DATA DATA DATA DATADATADATADATA DATA DATADATA DATA DATA DATA DATADATA Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology Knowledge is required in different applications Financial Data Analysis - Loan payment prediction and customer credit policy analysis, Classification and clustering of customers for targeted marketing, Detection of money laundering and other financial crimes. Retail Industry - collects large amount of data from on sales, customer purchasing history, goods transportation, consumption and services. Telecommunication Industry - helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service Biological Data Analysis- distributed genomic and proteomic databases, Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences, Discovery of structural patterns and analysis of genetic networks and protein pathways, Association and path analysis. Other Scientific Applications - Huge amount of data have been collected from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being generated because of the fast numerical simulations in various fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. Intrusion Detection - Association and correlation analysis, aggregation to help select and build discriminating attributes, Analysis of Stream data. Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology Evolution: 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology What is Data Mining? Data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Thus, data mining should have been more appropriately named as knowledge mining which emphasis on mining from large amounts of data. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Other names are: Knowledge mining from data Knowledge extraction Data Analysis Pattern Analysis Data Archaeology Data Dredging Knowledge Discovery from Data (KDD) as Synonym for Data Mining Data Mining is one of the step in Knowledge Discovery from Data (KDD) Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology KDD: This is a view from typical database systems and data warehousing communities Data mining plays an essential role in the knowledge discovery process Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection. Cleaning in case of Missing values. Cleaning noisy data, where noise is a random or variance error. Cleaning with Data discrepancy detection and Data transformation tools. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source(Data Warehouse). Data integration using Data Migration tools. Data integration using Data Synchronization tools. Data integration using ETL(Extract-Load-Transformation) process. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection. Data selection using Neural network. Data selection using Decision Trees. Data selection using Naive bayes. Data selection using Clustering, Regression, etc. Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure. Data Transformation is a two step process: Data Mapping: Assigning elements from source base to destination to capture transformations. Code generation: Creation of the actual transformation program. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful. Transforms task relevant data into patterns. Decides purpose of model using classification or characterization. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures. Find interestingness score of each pattern. Uses summarization and Visualization to make data understandable by user. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent data mining results. Generate reports. Generate tables. Generate discriminant rules, classification rules, characterization rules, etc. Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology Architecture of a Data Mining System: Database, Data Warehouse, WWW and Other repositories- All these are the various sources of Data. Apply Cleaning on these sources and integrate all the sources data and select the required data from it. Database or Data warehouse Server – It stores the data which is relevant to the mining task Knowledge base – This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources) Data Mining Engine – This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern Evaluation Module – This component typically employs interestingness measures interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. User interface – This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. Fundamentals of Data Mining CO: Understand the Data Mining as part of the evolution of Database Technology Data Mining - on What Kind of Data? Data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. Algorithms and approaches may differ when applied to different types of data. Flat files - Flat files are actually the most common data source for data mining algorithms, especially at the research level. Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. The data in these files can be transactions, time-series data, scientific measurements, etc. Relational Databases - It consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key. Relations Customer, Items, Employee, Branch representing business activity in All Electronics store Relation between tables shown using Purchase, items-sold, Works-at