Full Transcript

Data Warehouse and Data Mining UNIT-2 Data Mining: Introduction * Data Mining: Concepts and Techniques 1 Chapter 2. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be...

Data Warehouse and Data Mining UNIT-2 Data Mining: Introduction * Data Mining: Concepts and Techniques 1 Chapter 2. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining 2 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data(the increasing amount of data being generated every day) Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube We are drowning/overloaded in data, but starving/needy for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 3 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial/importanat, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data data mining refers to extracting or \mining" know ledge from large amounts of data Data mining: a misnomer? (For example, if a product is called a "smartphone" but lacks typical smartphone functionalities, that could be considered a misnomer. Similarly, "data mining" could be seen as a misnomer because it implies simply extracting data, whereas the actual process involves discovering insights and patterns from data.)(Worng name) (knowledge mining from data) Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems 4 Knowledge Discovery in Database(KDD) Process This is a view from typical database systems and data Pattern Evaluation & Presentation warehousing communities Data mining plays an essential role in the knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection & transformation Data Cleaning Data Integration Databases 5 Knowledge Discovery in Database (KDD) Process Data Cleaning − In this step, the noise and inconsistent data is removed. Data Integration − In this step, multiple data sources are combined. Data Selection − In this step, data relevant to the analysis task are retrieved from the database. Data Transformation − In this step, data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations. Data Mining − In this step, intelligent methods are applied in order to extract data patterns. Pattern Evaluation − In this step, data patterns are evaluated. Knowledge Presentation − In this step, knowledge is represented. * Data Mining: Concepts and Techniques 6 Architecture of typical data mining process Communicates between users and the data mining system by specifying data mining queries intended for selecting and ranking patterns according to their potential interest to the user Consist set of functional modules for task such as outlier, correlation Fetch the relevant data from Domain based on user mining request knowledge used to guide the search and evaluate interesting patterns (Concept hierarchies) * Data Mining: Concepts and Techniques 7 Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Business Analyst Visualization Techniques Data Mining Data Information Discovery Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 8 KDD Process: A Typical View from ML and Statistics Input Data Data Data Post-Proce Pre-Processin Mining ssing g Data integration Pattern discovery Pattern evaluation Normalization Association & correlation Pattern selection Feature selection Classification Pattern interpretation Clustering Dimension reduction Pattern visualization Outlier analysis ………… This is a view from typical machine learning and statistics communities 9 Example: Medical Data Mining Health care & medical data mining – often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation 10 Data Mining: On What Kinds of Data? Database-oriented data sets and applications Relational database, data warehouse, transactional database Advanced data sets and advanced applications Data streams (Social Media Feeds)and sensor data(Predictive Maintenance in Manufacturing) Time-series data(Weather Forecasting), temporal data(E-Commerce Purchase Behavior), sequence data (incl. bio-sequences) Structure data(Customer Relationship Management (CRM) Systems), graphs, social networks and multi-linked data Object-relational databases(Product Design and Manufacturing, Medical Imaging and Patient Records) The World-Wide Web(Monitoring Brand Reputation in social media) 11 Data Mining: On What Kinds of Data? Heterogeneous databases and legacy databases(Healthcare Data Integration, A healthcare organization may have legacy databases for patient records, while newer systems manage electronic health records (EHR), medical imaging, and real-time patient monitoring data. The organization also uses data from external sources like insurance claims and public health databases.) Spatial data(Urban Planning and Development) and spatiotemporal data(Environmental Monitoring, Environmental agencies use spatiotemporal data to monitor pollution levels and track changes in air quality.) (allow the representation of simple geometric objects such as points, lines and polygons. Some spatial databases handle more complex structures such as 3D objects, topological coverages, linear networks) Multimedia database(Video Surveillance(monitoring) and Security Systems) Text databases(Customer Feedback Analysis for Product Improvement) * Data Mining: Concepts and Techniques 12 Data Mining Function: (1) class/concept description : characterization and discrimination Multidimensional concept description: data Characterization (by summarizing the data of the class under study ) and data discrimination (by comparison of the target class with one or a set of comparative classes) E.g. Data characterization The output of data characterization can be presented in various forms. Examples include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs. Produce a description summarizing the characteristics of customer who spend $1000 a year E.g. Data discrimination Compare the customers who shop for computer products regularly versus those who rarely shop for such products 13 A data mining system should be able to compare two groups of AllElectronics customers, such as those who shop for computer products regularly (more than 4 times a month) vs. those who rarely shop for such products (i.e., less than three times a year). The resulting description could be a general, comparative profile of the customers such as 80% of the customers who frequently purchase computer products are between 20-40 years old and have a university education, whereas 60% of the customers who infrequently buy such products are either old or young, and have no university degree. Drilling-down on a dimension, such as occupation, or adding new dimensions, such as income level, may help in finding even more discriminative features between the two classes * Data Mining: Concepts and Techniques 14 Data Mining Function: (2) Mining frequent patterns ,Association and Correlation Analysis Frequency patterns : patterns that occur frequently in data Many kind of patterns such as itemsets, subsequences and substructures Frequent itemsets typically refers to a set of items that frequently appear together e.g. milk and bread Subsequence such as the pattern that customers tend to purchase first PC followed by digital camera Substructure can refer to different structural form such as graphs, trees, etc. 15 Data Mining Function: (2) Association and Correlation Analysis Frequent patterns (or frequent itemsets) What items are frequently purchased together in your Walmart? A typical association rule milk Bread [1%, 50%] (support, confidence) Support : 1% of all of the transactions under analysis showed that computer and software were purchased together. Confidence : 50% means that if a customer buys a computer , there is a 50 % chance that she will buy software as well. Single dimensional association rules: It involves a single predicate. Multidimensional association rules (eg. Age, income, buy) Association rule discarded if they don’t support minimum support threshold and minimum confidence support To uncover interesting statistical correlations between associated attribute value pairs. 16 Data Mining Function: (3) Classification Classification and label prediction Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown class labels Typical methods Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, … Typical applications: Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, … 17 Data Mining Function: (4) Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity It can facilitate taxonomy formation, that is , the organization of observations into a hierarchy of classes that group similar events together. 18 Data Mining Function: (5) Outlier Analysis Outlier analysis Outlier: A data object that does not comply with the general behavior of the data Noise or exception? ― One person’s garbage could be another person’s treasure Methods: by product of clustering or regression analysis, … Useful in fraud detection, rare events analysis 19 Evolution analysis It describes and models regularities or trends for object whose behavior changes over time. E.g. stock market analysis * Data Mining: Concepts and Techniques 20 Are all the patterns interesting Typically not – only a small fraction of the patterns potentially generated would actually be of interest to any given user. A pattern is interesting if it is : If it is easily understood by human Valid on new or test data with some degree of certainty Potentially useful Novel An interesting pattern represents knowledge Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns? * Data Mining: Concepts and Techniques 21 Evaluation of Knowledge Are all mined knowledge interesting? One can mine tremendous amount of “patterns” and knowledge Some may fit only certain dimension space (time, location, …) Some may not be representative, may be transient, … Evaluation of mined knowledge → directly mine only interesting knowledge? Descriptive vs. predictive Coverage Typicality vs. novelty Accuracy Timeliness … 22 Classification of Data mining systems Data Mining as a confluence of multiple disciplines Categorized according to various criteria Classification according to the kinds of databases mined Relational transactional, object relational, data warehouse, Classification according to the kinds of knowledge mined Association and correlation analysis, classification, prediction, clustering, other analysis Classification according to the kinds of techniques utilized Autonomous system, query driven systems Classification according to the kinds of application adopted Finance, telecommunications, DNA, Stock markets, * email and so on Data Mining: Concepts and Techniques 24 Data mining task primitives The set of task relevant data to be mined: specifies set of data in which user is interested For eg. Database or Data warehouse name Database tables or data warehouse cubes Conditions for data selections Relevant attributes or dimensions Data grouping criteria Knowledge type to be mined Characterization Discrimination Association/correlation Classification/prediction clustering * Data Mining: Concepts and Techniques 25 Data mining task primitives Background knowledge: Concept hierarchies User beliefs about relationships in the data Pattern interestingness measures Simplicity Certainty (Confidence) Utility (support) Novelty Visualization of discovered patterns Rules, tables, reports, charts, graphs, decision trees and cubes Drill down and roll up * Data Mining: Concepts and Techniques 26 Example You are specially interested in those customers whose salary is no less than 40,000 and who have bought more than 1,000 worth of items, each of which is priced at no less than 100 also consider age 1. Use database Allelectronics_db 2. Use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age 3. Mine classification as promising customers 4. In relevance to c.age,c.income,I.type,I.place_made,T.branch 5. From customer C,item I, Transaction T 6. Where I.item_ID=T.item_ID and C.cust_ID=T.cust_ID and c.income>=40000 and I. price>=100 7. Group by t.cust_ID 8. Having sum(I.price) >=1,000 9. Display as rules * Data Mining: Concepts and Techniques 27 In previous example task-relevant data specified by lines 1 to 8 1 - specifies allElectronics database 4 – lists the relevant attributes 2 – specifies that the concept hirerachie such as location_hierarchy 3- specifies kind of knowledge to be mined 9 – specifies that the mining results are to be displayed as a set of rules 6-8 - condition conditions based on line 5 relations * Data Mining: Concepts and Techniques 28 Integration of a data Mining system with a database or Datwarehouse system Critical question : how to integrate DM with DW or DB Following are the integration schemes No coupling : DM system will not utilize any function of DW or DB Drwbacks : Spend substantial amount of time in finding, collecting, cleaning and transforming data DB and DW has many tested and scalable algorithms and data structures Making difficult to integrate such a system into an information processing environment Thus no coupling means poor designing * Data Mining: Concepts and Techniques 29 Loose coupling : use some facility of DW or DB Fetching data from a repository managed by this systems ,performing data mining, and then storing the mining results either in a file or in a designated place in a database or DW Better than no coupling Mainly memory based Mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performance with large data sets * Data Mining: Concepts and Techniques 30 Semitight coupling Besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives such as sum, count, max, min, sd, aggregation, histogram analysis, multiway join can be provided in the DB/DW system. Tight coupling : DM system is Smoothly integrated into the DW/DM system Highly desirable approach as it gives high system performance * Data Mining: Concepts and Techniques 31 Data Mining Issues * Data Mining: Concepts and Techniques 32 Major Issues in Data Mining (1) Mining Methodology and User Interaction Issues Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task. Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction. 33 Major Issues in Data Mining (1) Mining Methodology and User Interaction Issues Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining. Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable. Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor. Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty. 34 Major Issues in Data Mining (1) Performance Issues There can be performance-related issues such as follows − Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable. Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch. 35 Major Issues in Data Mining (2) Diverse Data Types Issues Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. Mining information from heterogeneous databases and global information systems − The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining. Data mining and society Social impacts of data mining Privacy-preserving data mining- DM will help business management, economy, recovery and security.it create the risk of disclosing an individual’s personal information. Invisible data mining- we can’t expect everyone in society to learn and master DM techniques. Ex; when purchasing item online. 36 Applications of Data Mining Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining 37

Use Quizgecko on...
Browser
Browser