Week 1 Introduction (chapter 1).pdf

Full Transcript

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 — (Lecture 1) Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All...

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 — (Lecture 1) Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved. 1 Chapter 1. Introduction 2 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ Knowledge discovery process. ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Summary 3 Why Data Mining? ◼ The Explosive Growth of Data: from terabytes to petabytes. ◼ Data collection and data availability ◼ Automated data collection tools, database systems, Web, computerized society ◼ Major sources of abundant data ◼ Business: Web, e-commerce, transactions, stocks, … ◼ Science: Remote sensing, bioinformatics, scientific simulation, … ◼ Society and everyone: news, digital cameras, YouTube 4 What Is Data Mining? 5 What Is Data Mining? Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data, usually automatically gathered. 6 What Is Data Mining? ◼ Data mining (knowledge discovery from data) ◼ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. ◼ Alternative names ◼ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. ◼ Watch out: Is everything “data mining”? ◼ Simple search and query processing ◼ (Deductive) expert systems 7 Why Data Mining? We are drowning in data but starving for knowledge! December 7, 2022 Data Mining: Concepts and Techniques 8 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ Knowledge discovery process. ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Summary 9 Knowledge Discovery (KDD) Process ◼ This is a view from typical database systems and data warehousing communities. ◼ Data mining plays an essential role in the knowledge discovery process 10 December 7, 2022 Data Mining: Concepts and Techniques 11 (KDD) Process ◼ 1.Datacleaning: to remove noise and inconsistent data. ◼ 2. Data integration: where multiple data sources may be combined. ◼ 3. Data selection: where data relevant to the analysis task are retrieved from the database. ◼ 4. Data transformation: where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations. December 7, 2022 Data Mining: Concepts and Techniques 12 (KDD) Process ◼ 5. Data mining: an essential process where intelligent methods are applied to extract data patterns. ◼ 6. Pattern evaluation: to identify the truly interesting patterns representing knowledge based on interestingness. ◼ 7. Knowledge presentation: where visualization and knowledge representation techniques are used to present mined knowledge to users. December 7, 2022 Data Mining: Concepts and Techniques 13 Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Business Analyst Visualization Techniques Data Mining Data Information Discovery Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 14 KDD Process: A Typical View from ML and Statistics Input Data Data Pre- Data Post- Processing Mining Processing Data integration Pattern discovery Pattern evaluation Normalization Association & correlation Pattern selection Feature selection Classification Pattern interpretation Clustering Dimension reduction Pattern visualization Outlier analysis ………… ◼ This is a view from typical machine learning and statistics communities 15 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ Knowledge discovery process. ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Summary 16 Data Mining: On What Kinds of Data? Database data: ◼ A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. ◼ A relational database is a collection of tables, each of which is assigned a unique name. ◼ Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). ◼ Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. December 7, 2022 Data Mining: Concepts and Techniques 17 Data Mining: On What Kinds of Data? December 7, 2022 Data Mining: Concepts and Techniques 18 Data Mining: On What Kinds of Data? Transactional database: ◼ Each record in a transactional database captures a transaction, such as a customer’s purchase, a flight booking, or a user’s clicks on a web page. ◼ A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction, such as the items purchased in the transaction. ◼ A transactional database may have additional tables, which contain other information related to the transactions, such as item description, information about the salesperson or the branch, and so on. December 7, 2022 Data Mining: Concepts and Techniques 19 December 7, 2022 Data Mining: Concepts and Techniques 20 Data Mining: On What Kinds of Data? Data warehouse: (centralized storage system ) ◼ is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI), and machine learning. A data warehouse system enables an organization to run powerful analytics on huge volumes (petabytes and petabytes) of historical data in ways that a standard database cannot. ◼ Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. December 7, 2022 Data Mining: Concepts and Techniques 21 Data Mining: On What Kinds of Data? ◼ Advanced data sets and advanced applications ◼ Data streams and sensor data ◼ Time-series data, temporal data, sequence data (incl. bio-sequences) ◼ Structure data, graphs, social networks and multi-linked data ◼ Object-relational databases ◼ Heterogeneous databases and legacy databases ◼ Spatial data and spatiotemporal data ◼ Multimedia database ◼ Text databases ◼ The World-Wide Web 22 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ Knowledge discovery process. ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Summary 23 Data Mining Function (Tasks) ◼ Each of the following data mining techniques addresses a different problem and provides a different insight. Knowing the type of problem, you're trying to solve will determine the type of data mining technique that will produce the best results. December 7, 2022 Data Mining: Concepts and Techniques 24 Data Mining Function (Tasks) ◼ Data Mining tasks are the kind of data patterns or knowledge that can be mined. ◼ Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks ◼ In general, such tasks can be classified into two categories: descriptive and predictive. December 7, 2022 Data Mining: Concepts and Techniques 25 Data Mining Function (Tasks) ◼ Data mining tasks can be classified into two categories: ◼ Descriptive: mining tasks characterize the general properties of the data in the database. ◼ Predictive: mining tasks perform induction on the current data in order to make predictions. December 7, 2022 Data Mining: Concepts and Techniques 26 Data mining functionalities ◼ There are several data mining functionalities. December 7, 2022 Data Mining: Concepts and Techniques 27 Data Mining Function: Classification ◼ Classification is used for predictive mining tasks. ◼ The input data for predictive modeling consists of two types of variables: ◼ The first is explanatory variables, which define the essential properties of the data, ◼ The second is one target variables, whose values are to be predicted. ◼ Classification is used to predicate the value of the discrete target variable. 28 Data Mining Function: Classification 29 Data Mining Function: Regression ◼ Regression models similar to classification, except we are trying to predict the value of a variable (e.g. amount of purchase), rather than a class (e.g. purchaser or non-purchaser). ◼ Classification models : predicts categorical labels (discrete, unordered). ◼ Regression models: predicts continuous values(numerical data values). 30 Data Mining Function: Association and Correlation Analysis ◼ Frequent patterns, are patterns that occur frequently in data. ◼ A frequent itemset typically refers to a set of items that often appear together in a transactional data set. ◼ for example, milk and bread, which are frequently bought together in grocery stores by many customers. ◼ A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. ◼ Mining frequent patterns leads to the discovery of interesting associations and correlations within data. 31 Data Mining Function: Cluster Analysis ◼ Unsupervised learning (i.e., Class label is unknown) ◼ Finds groups of data pointes (clusters) so that data points that belong to one cluster are more similar to each other than to data points belonging to different cluster. 32 Data Mining Function: Outlier Analysis ◼ Discovers data points that are significantly different than the rest of the data. Such points are known as anomalies or outliers. ◼ Applications: Credit Card Fraud Detection 33 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ Knowledge discovery process. ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Summary 34 Data Mining: Confluence of Multiple Disciplines Data Mining: Confluence of Multiple Disciplines ◼ Some data mining methods come from statistics such as naïve bays and maximum entropy , also used for estimating probabilities of predictions. ◼ Data Mining is differed than statistics in kind of data (not only numerical) , kinds of methods ( mostly use machine learning methods), more than one hypotheses, amount of data (statistics uses samples) December 7, 2022 Data Mining: Concepts and Techniques 36 Data Mining: Confluence of Multiple Disciplines ◼ Data Mining uses methods from Machine Learning such as decision tree and neural nets. ◼ Machine Learning uses samples and Data Mining uses whole data. ◼ Machine Languages sometimes used to replace humans where Data Mining to help humans. December 7, 2022 Data Mining: Concepts and Techniques 37 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ Knowledge discovery process. ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Summary 38 Applications of Data Mining ◼ Web page analysis: from web page classification, clustering to PageRank & HITS algorithms ◼ Collaborative analysis & recommender systems ◼ Basket data analysis to targeted marketing ◼ Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis 39 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ Knowledge discovery process. ◼ What Kind of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Technology Are Used? ◼ What Kind of Applications Are Targeted? ◼ Summary 40 Summary ◼ Data mining: Discovering interesting patterns and knowledge from massive amount of data ◼ A natural evolution of database technology, in great demand, with wide applications ◼ A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation ◼ Mining can be performed in a variety of data ◼ Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. ◼ Data mining technologies and applications 41

Use Quizgecko on...
Browser
Browser