Data Mining: Introduction PDF
Document Details
Uploaded by Deleted User
2020
Tan, Steinbach, Karpatne, Kumar
Tags
Summary
This document is lecture notes on Data Mining: Introduction. It discusses why and how data mining is used in different aspects of life. It introduces the topic of data mining using examples and introduces concepts like predictive modelling and classification
Full Transcript
Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar Introduction to Data Mining, 2nd Edition 09/09/2020...
Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar Introduction to Data Mining, 2nd Edition 09/09/2020 1 Tan, Steinbach, Karpatne, Kumar Large-scale Data is Everywhere! There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies E-Commerce New mantra Cyber Security Gather whatever data you can whenever and wherever possible. Expectations Gathered data will have value either for the purpose Traffic Patterns Social Networking: Twitter collected or for a purpose not envisioned. Sensor Computational Introduction to Data Mining, 2nd Networks Edition Simulations 09/09/2020 2 Tan, Steinbach, Karpatne, Kumar Why Data Mining? Commercial Viewpoint l Lots of data is being collected and warehoused – Web data Googlehas Peta Bytes of web data Facebook has billions of active users – purchases at department/ grocery stores, e-commerce Amazon handles millions of visits/day – Bank/Credit Card transactions l Computers have become cheaper and more powerful l Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) Introduction to Data Mining, 2nd Edition 09/09/2020 3 Tan, Steinbach, Karpatne, Kumar Why Data Mining? Scientific Viewpoint l Data collected and stored at enormous speeds – remote sensors on a satellite NASA EOSDIS archives over petabytes of earth science data / year fMRI Data from Brain Sky Survey Data – telescopes scanning the skies Sky survey data – High-throughput biological data – scientific simulations terabytes of data generated in a few hours Gene Expression Data l Data mining helps scientists – in automated analysis of massive datasets – In hypothesis formation Introduction to Data Mining, 2nd Edition Surface Temperature of Earth 09/09/2020 4 Tan, Steinbach, Karpatne, Kumar Great opportunities to improve productivity in all walks of life Introduction to Data Mining, 2nd Edition 09/09/2020 5 Tan, Steinbach, Karpatne, Kumar Great Opportunities to Solve Society’s Major Problems Improving health care and reducing costs Predicting the impact of climate change Reducing hunger and poverty by Finding alternative/ green energy sources increasing agriculture production Introduction to Data Mining, 2nd Edition 09/09/2020 6 Tan, Steinbach, Karpatne, Kumar What is Data Mining? l Many Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi- automatic means, of large quantities of data in order to discover meaningful patterns Introduction to Data Mining, 2nd Edition 09/09/2020 7 Tan, Steinbach, Karpatne, Kumar Origins of Data Mining l Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems l Traditional techniques may be unsuitable due to data that is – Large-scale – High dimensional – Heterogeneous – Complex – Distributed l A key component of the emerging field of data science and data- driven discovery Introduction to Data Mining, 2nd Edition 09/09/2020 8 Tan, Steinbach, Karpatne, Kumar Data Mining Tasks l Prediction Methods – Use some variables to predict unknown or future values of other variables. l Description Methods – Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Introduction to Data Mining, 2nd Edition 09/09/2020 9 Tan, Steinbach, Karpatne, Kumar Data Mining Tasks … Clusterin Data Tid Refund Marital Taxable Status Income Cheat g 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Predictive Modeling 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No Anomaly Association 13 No Single 85K Yes 14 No Married 75K No Detection Rules 10 15 No Single 90K Yes Milk Introduction to Data Mining, 2nd Edition 09/09/2020 10 Tan, Steinbach, Karpatne, Kumar Predictive Modeling: Classification l Find a model for class attribute as a function of the values of other attributes Model for predicting credit worthiness Class Employed # years at Level of Credit Yes Tid Employed present No Education Worthy address 1 Yes Graduate 5 Yes 2 Yes High School 2 No No Education 3 No Undergrad 1 No { High school, 4 Yes High School 10 Yes Graduate Undergrad } … … … … … 10 Number of Number of years years > 3 yr < 3 yr > 7 yrs < 7 yrs Yes No Yes No Introduction to Data Mining, 2nd Edition 09/09/2020 11 Tan, Steinbach, Karpatne, Kumar Classification Example # years at categorical quantitative categorical Level of Credit Tid Employed present clas Education Worthy address s 1 Yes Undergrad 7 ? # years at 2 No Graduate 3 ? Level of Credit Tid Employed present 3 Yes High School 2 ? Education Worthy address 1 Yes Graduate 5 Yes … … … … … 10 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … Test 10 Set Learn Training Model Set Classifie r Introduction to Data Mining, 2nd Edition 09/09/2020 12 Tan, Steinbach, Karpatne, Kumar Examples of Classification Task l Classifying credit card transactions as legitimate or fraudulent l Classifying land covers (water bodies, urban areas, forests, etc.) using satellite data l Categorizing news stories as finance, weather, entertainment, sports, etc l Identifying intruders in the cyberspace l Predicting tumor cells as benign or malignant l Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Introduction to Data Mining, 2nd Edition 09/09/2020 13 Tan, Steinbach, Karpatne, Kumar Classification: Application 1 l Fraud Detection – Goal: Predict fraudulent cases in credit card transactions. – Approach: Use credit card transactions and the information on its account-holder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. Introduction to Data Mining, 2nd Edition 09/09/2020 14 Tan, Steinbach, Karpatne, Kumar Classification: Application 2 l Churn prediction for telephone customers – Goal: To predict whether a customer is likely to be lost to a competitor. – Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. – How often the customer calls, where he calls, what time- of-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997 Introduction to Data Mining, 2nd Edition 09/09/2020 15 Tan, Steinbach, Karpatne, Kumar Classification: Application 3 l Sky Survey Cataloging – Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23,040 x 23,040 pixels per image. – Approach: Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the From [Fayyad, farthest et.al.] Advances objects in Knowledge that Discovery and are 1996 Data Mining, difficult to find! Introduction to Data Mining, 2nd Edition 09/09/2020 16 Tan, Steinbach, Karpatne, Kumar Classifying Galaxies Courtesy: http://aps.umn.edu Earl Class: Attributes: y Stages of Formation Image features, Characteristics of light waves received, etc. Intermediat e Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB Introduction to Data Mining, 2nd Edition 09/09/2020 17 Tan, Steinbach, Karpatne, Kumar Regression l Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. l Extensively studied in statistics, neural network fields. l Examples: – Predicting sales amounts of new product based on advetising expenditure. – Predicting wind velocities as a function of temperature, humidity, air pressure, etc. – Time series prediction of stock market indices. Introduction to Data Mining, 2nd Edition 09/09/2020 18 Tan, Steinbach, Karpatne, Kumar Clustering l Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized Introduction to Data Mining, 2nd Edition 09/09/2020 19 Tan, Steinbach, Karpatne, Kumar Applications of Cluster Analysis l Understanding – Custom profiling for targeted marketing – Group related documents for browsing – Group genes and proteins that have similar functionality – Group stocks with similar price fluctuations l Summarization – Reduce the size of large data sets Courtesy: Michael Eisen Clusters for Raw SST and Raw NPP 90 Use of K-means to 60 Land Cluster 2 partition Sea Surface 30 Temperature (SST) and Land Cluster 1 Net Primary Production latitude (NPP) into clusters that 0 Ice or No NPP -30 reflect the Northern Sea Cluster 2 and Southern Hemispheres. -60 Introduction to Data Mining, 2nd Edition Sea Cluster 1 -90 09/09/2020 20 Tan, Steinbach, Karpatne, Kumar -180 - 150 -120 -90 -60 - 30 0 30 60 90 120 150 180 Cluster longitude Clustering: Application 1 l Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. – Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Introduction to Data Mining, 2nd Edition 09/09/2020 21 Tan, Steinbach, Karpatne, Kumar Clustering: Application 2 l Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. – Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Enron email dataset Introduction to Data Mining, 2nd Edition 09/09/2020 22 Tan, Steinbach, Karpatne, Kumar Association Rule Discovery: Definition l Given a set of records each of which contain some number of items from a given collection – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk Rules Discovered: 2 Beer, Bread {Milk} --> {Coke} 3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer} 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Introduction to Data Mining, 2nd Edition 09/09/2020 23 Tan, Steinbach, Karpatne, Kumar Association Analysis: Applications l Market-basket analysis – Rules are used for sales promotion, shelf management, and inventory management l Telecommunication alarm diagnosis – Rules are used to find combination of alarms that occur together frequently in the same time period l Medical Informatics – Rules are used to find combination of patient symptoms and test results associated with certain diseases Introduction to Data Mining, 2nd Edition 09/09/2020 24 Tan, Steinbach, Karpatne, Kumar Association Analysis: Applications l An Example Subspace Differential Coexpression Pattern from lung cancer dataset Three lung cancer datasets [Bhattacharjee et al. 2001], [Stearman et al. 2005], [Su et al. 2007] Enriched with the TNF/NFB signaling pathway which is well-known to be related to lung cancer P-value: 1.4*10-5 (6/10 overlap with the pathway) [Fang et al PSB 2010] Introduction to Data Mining, 2nd Edition 09/09/2020 25 Tan, Steinbach, Karpatne, Kumar Deviation/Anomaly/Change Detection l Detect significant deviations from normal behavior l Applications: – Credit Card Fraud Detection – Network Intrusion Detection – Identify anomalous behavior from sensor networks for monitoring and surveillance. – Detecting changes in the global forest cover. Introduction to Data Mining, 2nd Edition 09/09/2020 26 Tan, Steinbach, Karpatne, Kumar Motivating Challenges l Scalability l High Dimensionality l Heterogeneous and Complex Data l Data Ownership and Distribution l Non-traditional Analysis Introduction to Data Mining, 2nd Edition 09/09/2020 27 Tan, Steinbach, Karpatne, Kumar Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining , 2nd Edition by Tan, Steinbach, Kumar 01/27/2021 Introduction to Data Mining, 2nd Edition 1 Tan, Steinbach, Karpatne, Kumar Outline Attributes and Objects Types of Data Data Quality Similarity and Distance Data Preprocessing 01/27/2021 Introduction to Data Mining, 2nd Edition 2 Tan, Steinbach, Karpatne, Kumar What is Data? Collection of data objects Attributes and their attributes An attribute is a property Tid Refund Marital Taxable or characteristic of an Status Income Cheat object 1 Yes Single 125K No – Examples: eye color of a 2 No Married 100K No person, temperature, etc. 3 No Single 70K No Objects – Attribute is also known as variable, field, characteristic, 4 Yes Married 120K No dimension, or feature 5 No Divorced 95K Yes A collection of attributes 6 No Married 60K No describe an object 7 Yes Divorced 220K No – Object is also known as 8 No Single 85K Yes record, point, case, sample, 9 No Married 75K No entity, or instance 10 No Single 90K Yes 10 Attribute Values Attribute values are numbers or symbols assigned to an attribute for a particular object Distinction between attributes and attribute values – Same attribute can be mapped to different attribute values ◆ Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values ◆ Example: Attribute values for ID and age are integers – But properties of attribute can be different than the properties of the values used to represent the attribute Introduction to Data Mining, 2nd Edition 01/27/2021 4 Tan, Steinbach, Karpatne, Kumar Measurement of Length The way you measure an attribute may not match the attributes properties. 5 A 1 B 7 2 C This scale This scale 8 3 preserves preserves only the the ordering ordering D and property of additivity length. 10 4 properties of length. E 15 5 Types of Attributes There are different types of attributes – Nominal ◆ Examples: ID numbers, eye color, zip codes – Ordinal ◆ Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short} – Interval ◆ Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio ◆ Examples: temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race) 01/27/2021 Introduction to Data Mining, 2nd Edition 6 Tan, Steinbach, Karpatne, Kumar Properties of Attribute Values The type of an attribute depends on which of the following properties/operations it possesses: – Distinctness: = – Order: < > – Differences are + - meaningful : – Ratios are * / meaningful – Nominal attribute: distinctness – Ordinal attribute: distinctness & order – Interval attribute: distinctness, order & meaningful differences – Ratio attribute: all 4 properties/operations 01/27/2021 Introduction to Data Mining, 2nd Edition 7 Tan, Steinbach, Karpatne, Kumar Difference Between Ratio and Interval Is it physically meaningful to say that a temperature of 10 ° is twice that of 5° on – the Celsius scale? – the Fahrenheit scale? – the Kelvin scale? Consider measuring the height above average – If Bill’s height is three inches above average and Bob’s height is six inches above average, then would we say that Bob is twice as tall as Bill? – Is this situation analogous to that of temperature? 01/27/2021 Introduction to Data Mining, 2nd Edition 8 Tan, Steinbach, Karpatne, Kumar Attribute Description Examples Operations Type Nominal Nominal attribute zip codes, employee mode, entropy, values only ID numbers, eye contingency distinguish. (=, ) color, sex: {male, correlation, 2 Categorical Qualitative female} test Ordinal Ordinal attribute hardness of minerals, median, values also order {good, better, best}, percentiles, rank objects. grades, street correlation, run () numbers tests, sign tests Interval For interval calendar dates, mean, standard attributes, temperature in deviation, differences between Celsius or Fahrenheit Pearson's Quantitative Numeric values are correlation, t and meaningful. (+, - ) F tests Ratio For ratio variables, temperature in Kelvin, geometric mean, both differences and monetary quantities, harmonic mean, ratios are counts, age, mass, percent variation meaningful. (*, /) length, current This categorization of attributes is due to S. S. Stevens Attribute Transformation Comments Type Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Categorical Qualitative Ordinal An order preserving change of An attribute encompassing values, i.e., the notion of good, better best new_value = f(old_value) can be represented equally where f is a monotonic function well by the values {1, 2, 3} or by { 0.5, 1, 10}. Interval new_value = a * old_value + b Thus, the Fahrenheit and where a and b are constants Celsius temperature scales Quantitative Numeric differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet. This categorization of attributes is due to S. S. Stevens Discrete and Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floating- point variables. 01/27/2021 Introduction to Data Mining, 2nd Edition 11 Tan, Steinbach, Karpatne, Kumar Asymmetric Attributes Only presence (a non-zero attribute value) is regarded as important ◆ Words present in documents ◆ Items present in customer transactions If we met a friend in the grocery store would we ever say the following? “I see our purchases are very similar since we didn’t buy most of the same things.” 01/27/2021 Introduction to Data Mining, 2nd Edition 12 Tan, Steinbach, Karpatne, Kumar Critiques of the attribute categorization Incomplete – Asymmetric binary – Cyclical – Multivariate – Partially ordered – Partial membership – Relationships between the data Real data is approximate and noisy – This can complicate recognition of the proper attribute type – Treating one attribute type as another may be approximately correct 01/27/2021 Introduction to Data Mining, 2nd Edition 13 Tan, Steinbach, Karpatne, Kumar Key Messages for Attribute Types The types of operations you choose should be “meaningful” for the type of data you have – Distinctness, order, meaningful intervals, and meaningful ratios are only four (among many possible) properties of data – The data type you see – often numbers or strings – may not capture all the properties or may suggest properties that are not present – Analysis may depend on these other properties of the data ◆ Many statistical analyses depend only on the distribution – In the end, what is meaningful can be specific to domain 01/27/2021 Introduction to Data Mining, 2nd Edition 14 Tan, Steinbach, Karpatne, Kumar Important Characteristics of Data – Dimensionality (number of attributes) ◆ High dimensional data brings a number of challenges – Sparsity ( the presence of some attributes may enough ) ◆ Only presence counts – Resolution ◆ Patterns depend on the scale – Size ◆ Type of analysis may depend on size of data 01/27/2021 Introduction to Data Mining, 2nd Edition 15 Tan, Steinbach, Karpatne, Kumar Types of data sets Record – Data Matrix – Document Data – Transaction Data Graph – World Wide Web – Molecular Structures Ordered – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data 01/27/2021 Introduction to Data Mining, 2nd Edition 16 Tan, Steinbach, Karpatne, Kumar Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 01/27/2021 Introduction to Data Mining, 2nd Edition 17 Tan, Steinbach, Karpatne, Kumar Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such a data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Projection Projection Distance Load Thickness of x Load of y load 10.23 5.27 15.22 2.7 1.2 12.65 6.25 16.22 2.2 1.1 01/27/2021 Introduction to Data Mining, 2nd Edition 18 Tan, Steinbach, Karpatne, Kumar Document Data Each document becomes a ‘term’ vector – Each term is a component (attribute) of the vector – The value of each component is the number of times the corresponding term occurs in the document. timeout season coach game score play team win ball lost Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 01/27/2021 Introduction to Data Mining, 2nd Edition 19 Tan, Steinbach, Karpatne, Kumar Transaction Data A special type of data, where – Each transaction involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. – Can represent transaction data as record data TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk 01/27/2021 Introduction to Data Mining, 2nd Edition 20 Tan, Steinbach, Karpatne, Kumar Graph Data Examples: Generic graph, a molecule, and webpages 2 5 1 2 5 Benzene Molecule: C6H6 01/27/2021 Introduction to Data Mining, 2nd Edition 21 Tan, Steinbach, Karpatne, Kumar Ordered Data Sequences of transactions Transaction Customer ID Items Purchased Transaction Date Sequence 001 1 Bread, Milk 01-01-2024 001 2 Eggs, Butter 10-03-2024 001 3 Bread, Cheese 3-05-2024 Milk, Eggs, 001 4 01-06-2024 Orange Juice 01/27/2021 Introduction to Data Mining, 2nd Edition 22 Tan, Steinbach, Karpatne, Kumar Ordered Data Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG 01/27/2021 Introduction to Data Mining, 2nd Edition 23 Tan, Steinbach, Karpatne, Kumar Ordered Data Spatio-Temporal Data Average Monthly Temperature of land and ocean 01/27/2021 Introduction to Data Mining, 2nd Edition 24 Tan, Steinbach, Karpatne, Kumar Data Quality Poor data quality negatively affects many data processing efforts Data mining example: a classification model for detecting people who are loan risks is built using poor data – Some credit-worthy candidates are denied loans – More loans are given to individuals that default 01/27/2021 Introduction to Data Mining, 2nd Edition 25 Tan, Steinbach, Karpatne, Kumar Data Quality … What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: – Noise and outliers – Wrong data – Fake data – Missing values – Duplicate data 01/27/2021 Introduction to Data Mining, 2nd Edition 26 Tan, Steinbach, Karpatne, Kumar Noise For objects, noise is an extraneous object For attributes, noise refers to modification of original values – Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen – The figures below show two sine waves of the same magnitude and different frequencies, the waves combined, and the two sine waves with random noise ◆ The magnitude and shape of the original signal is distorted 01/27/2021 Introduction to Data Mining, 2nd Edition 27 Tan, Steinbach, Karpatne, Kumar Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set – Case 1: Outliers are noise that interferes with data analysis – Case 2: Outliers are the goal of our analysis ◆ Credit card fraud ◆ Intrusion detection Causes? 01/27/2021 Introduction to Data Mining, 2nd Edition 28 Tan, Steinbach, Karpatne, Kumar Missing Values Reasons for missing values – Information is not collected (e.g., people decline to give their age and weight) – Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values – Eliminate data objects or variables – Estimate missing values ◆ Example: time series of temperature ◆ Example: census results – Ignore the missing value during analysis 01/27/2021 Introduction to Data Mining, 2nd Edition 29 Tan, Steinbach, Karpatne, Kumar Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another – Major issue when merging data from heterogeneous sources Examples: – Same person with multiple email addresses Data cleaning – Process of dealing with duplicate data issues When should duplicate data not be removed? 01/27/2021 Introduction to Data Mining, 2nd Edition 30 Tan, Steinbach, Karpatne, Kumar Similarity and Dissimilarity Measures Similarity measure – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. – Often falls in the range [0,1] Dissimilarity measure – Numerical measure of how different two data objects are – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies Proximity refers to a similarity or dissimilarity 01/27/2021 Introduction to Data Mining, 2nd Edition 31 Tan, Steinbach, Karpatne, Kumar Similarity/Dissimilarity for Simple Attributes The following table shows the similarity and dissimilarity between two objects, x and y, with respect to a single, simple attribute. 01/27/2021 Introduction to Data Mining, 2nd Edition 32 Tan, Steinbach, Karpatne, Kumar Euclidean Distance Euclidean Distance where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes (components) or data objects x and y. Standardization is necessary, if scales differ. 01/27/2021 Introduction to Data Mining, 2nd Edition 33 Tan, Steinbach, Karpatne, Kumar Euclidean Distance 3 point x y 2 p1 p1 0 2 p3 p4 1 p2 2 0 p2 p3 3 1 0 p4 5 1 0 1 2 3 4 5 6 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Distance Matrix 01/27/2021 Introduction to Data Mining, 2nd Edition 34 Tan, Steinbach, Karpatne, Kumar Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes (components) or data objects x and y. 01/27/2021 Introduction to Data Mining, 2nd Edition 35 Tan, Steinbach, Karpatne, Kumar Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance. – A common example of this for binary vectors is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r → . “supremum” (Lmax norm, L norm) distance. – This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. 01/27/2021 Introduction to Data Mining, 2nd Edition 36 Tan, Steinbach, Karpatne, Kumar Minkowski Distance L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 point x y p1 0 2 L2 p1 p2 p3 p4 p2 2 0 p1 0 2.828 3.162 5.099 p3 3 1 p2 2.828 0 1.414 3.162 p4 5 1 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0 Distance Matrix 01/27/2021 Introduction to Data Mining, 2nd Edition 37 Tan, Steinbach, Karpatne, Kumar Mahalanobis Distance 𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱, 𝐲 = ((𝐱 − 𝐲)𝑇 Ʃ−1 (𝐱 − 𝐲))-0.5 is the covariance matrix For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. 01/27/2021 Introduction to Data Mining, 2nd Edition 38 Tan, Steinbach, Karpatne, Kumar Mahalanobis Distance Covariance Matrix: 0.3 0.2 = C 0.2 0.3 B A: (0.5, 0.5) B: (0, 1) A C: (1.5, 1.5) Mahal(A,B) = 5 Mahal(A,C) = 4 01/27/2021 Introduction to Data Mining, 2nd Edition 39 Tan, Steinbach, Karpatne, Kumar Common Properties of a Distance Distances, such as the Euclidean distance, have some well known properties. 1. d(x, y) 0 for all x and y and d(x, y) = 0 if and only if x = y. 2. d(x, y) = d(y, x) for all x and y. (Symmetry) 3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z. (Triangle Inequality) where d(x, y) is the distance (dissimilarity) between points (data objects), x and y. A distance that satisfies these properties is a metric 01/27/2021 Introduction to Data Mining, 2nd Edition 40 Tan, Steinbach, Karpatne, Kumar Common Properties of a Similarity Similarities, also have some well known properties. 1. s(x, y) = 1 (or maximum similarity) only if x = y. (does not always hold, e.g., cosine) 2. s(x, y) = s(y, x) for all x and y. (Symmetry) where s(x, y) is the similarity between points (data objects), x and y. 01/27/2021 Introduction to Data Mining, 2nd Edition 41 Tan, Steinbach, Karpatne, Kumar Similarity Between Binary Vectors Common situation is that objects, x and y, have only binary attributes Compute similarities using the following quantities f01 = the number of attributes where x was 0 and y was 1 f10 = the number of attributes where x was 1 and y was 0 f00 = the number of attributes where x was 0 and y was 0 f11 = the number of attributes where x was 1 and y was 1 Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (f11 + f00) / (f01 + f10 + f11 + f00) J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11) 01/27/2021 Introduction to Data Mining, 2nd Edition 42 Tan, Steinbach, Karpatne, Kumar SMC versus Jaccard: Example x= 1000000000 y= 0000001001 f01 = 2 (the number of attributes where x was 0 and y was 1) f10 = 1 (the number of attributes where x was 1 and y was 0) f00 = 7 (the number of attributes where x was 0 and y was 0) f11 = 0 (the number of attributes where x was 1 and y was 1) SMC = (f11 + f00) / (f01 + f10 + f11 + f00) = (0+7) / (2+1+0+7) = 0.7 J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0 01/27/2021 Introduction to Data Mining, 2nd Edition 43 Tan, Steinbach, Karpatne, Kumar Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) = / ||d1|| ||d2|| , where indicates inner product or vector dot product of vectors, d1 and d2, and || d || is the length of vector d. Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 | d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 || d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449 cos(d1, d2 ) = 0.3150 01/27/2021 Introduction to Data Mining, 2nd Edition 44 Tan, Steinbach, Karpatne, Kumar Correlation measures the linear relationship between objects 01/27/2021 Introduction to Data Mining, 2nd Edition 45 Tan, Steinbach, Karpatne, Kumar Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. 01/27/2021 Introduction to Data Mining, 2nd Edition 46 Tan, Steinbach, Karpatne, Kumar Drawback of Correlation x = (-3, -2, -1, 0, 1, 2, 3) y = (9, 4, 1, 0, 1, 4, 9) yi = xi2 mean(x) = 0, mean(y) = 4 std(x) = 2.16, std(y) = 3.74 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) =0 01/27/2021 Introduction to Data Mining, 2nd Edition 47 Tan, Steinbach, Karpatne, Kumar Correlation vs Cosine vs Euclidean Distance Compare the three proximity measures according to their behavior under variable transformation – scaling: multiplication by a value – translation: adding a constant Property Cosine Correlation Euclidean Distance Invariant to scaling Yes Yes No (multiplication) Invariant to translation No Yes No (addition) Consider the example – x = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0) – ys = y * 2 (scaled version of y), yt = y + 5 (translated version) Measure (x , y) (x , ys) (x , yt) Cosine 0.9667 0.9667 0.7940 Correlation 0.9429 0.9429 0.9429 Euclidean Distance 1.4142 5.8310 14.2127 01/27/2021 Introduction to Data Mining, 2nd Edition 48 Tan, Steinbach, Karpatne, Kumar Correlation vs cosine vs Euclidean distance Choice of the right proximity measure depends on the domain What is the correct choice of proximity measure for the following situations? – Comparing documents using the frequencies of words ◆ Documents are considered similar if the word frequencies are similar – Comparing the temperature in Celsius of two locations ◆ Two locations are considered similar if the temperatures are similar in magnitude – Comparing two time series of temperature measured in Celsius ◆ Two time series are considered similar if their “shape” is similar, i.e., they vary in the same way over time, achieving minimums and maximums at similar times, etc. 01/27/2021 Introduction to Data Mining, 2nd Edition 49 Tan, Steinbach, Karpatne, Kumar Comparison of Proximity Measures Domain of application – Similarity measures tend to be specific to the type of attribute and data – Record data, images, graphs, sequences, 3D-protein structure, etc. tend to have different measures However, one can talk about various properties that you would like a proximity measure to have – Symmetry is a common one – Tolerance to noise and outliers is another – Ability to find more types of patterns? – Many others possible The measure must be applicable to the data and produce results that agree with domain knowledge 01/27/2021 Introduction to Data Mining, 2nd Edition 50 Tan, Steinbach, Karpatne, Kumar Information Based Measures Information theory is a well-developed and fundamental disciple with broad applications Some similarity measures are based on information theory – Mutual information in various versions – Maximal Information Coefficient (MIC) and related measures – General and can handle non-linear relationships – Can be complicated and time intensive to compute 01/27/2021 Introduction to Data Mining, 2nd Edition 51 Tan, Steinbach, Karpatne, Kumar Information and Probability Information relates to possible outcomes of an event – transmission of a message, flip of a coin, or measurement of a piece of data The more certain an outcome, the less information that it contains and vice-versa – For example, if a coin has two heads, then an outcome of heads provides no information – More quantitatively, the information is related the probability of an outcome ◆ The smaller the probability of an outcome, the more information it provides and vice-versa – Entropy is the commonly used measure 01/27/2021 Introduction to Data Mining, 2nd Edition 52 Tan, Steinbach, Karpatne, Kumar Entropy For – a variable (event), X, – with n possible values (outcomes), x1, x2 …, xn – each outcome having probability, p1, p2 …, pn – the entropy of X , H(X), is given by 𝑛 𝐻 𝑋 = − 𝑝𝑖 log 2 𝑝𝑖 𝑖=1 Entropy is between 0 and log2n and is measured in bits – Thus, entropy is a measure of how many bits it takes to represent an observation of X on average 01/27/2021 Introduction to Data Mining, 2nd Edition 53 Tan, Steinbach, Karpatne, Kumar Entropy Examples For a coin with probability p of heads and probability q = 1 – p of tails 𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞 – For p= 0.5, q = 0.5 (fair coin) H = 1 – For p = 1 or q = 1, H = 0 What is the entropy of a fair four-sided die? 01/27/2021 Introduction to Data Mining, 2nd Edition 54 Tan, Steinbach, Karpatne, Kumar Entropy for Sample Data: Example Hair Color Count p -plog2p Black 75 0.75 0.3113 Brown 15 0.15 0.4105 Blond 5 0.05 0.2161 Red 0 0.00 0 Other 5 0.05 0.2161 Total 100 1.0 1.1540 Maximum entropy is log25 = 2.3219 01/27/2021 Introduction to Data Mining, 2nd Edition 55 Tan, Steinbach, Karpatne, Kumar Entropy for Sample Data Suppose we have – a number of observations (m) of some attribute, X, e.g., the hair color of students in the class, – where there are n different possible values – And the number of observation in the ith category is mi – Then, for this sample 𝑛 𝑚𝑖 𝑚𝑖 𝐻 𝑋 = − log 2 𝑚 𝑚 𝑖=1 For continuous data, the calculation is harder 01/27/2021 Introduction to Data Mining, 2nd Edition 56 Tan, Steinbach, Karpatne, Kumar Mutual Information Information one variable provides about another Formally, 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌), where H(X,Y) is the joint entropy of X and Y, 𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗 𝑖 𝑗 Where pij is the probability that the ith value of X and the jth value of Y occur together For discrete variables, this is easy to compute Maximum mutual information for discrete variables is log2(min( nX, nY ), where nX (nY) is the number of values of X (Y) 01/27/2021 Introduction to Data Mining, 2nd Edition 57 Tan, Steinbach, Karpatne, Kumar Mutual Information Example Student Count p -plog2p Student Grade Count p -plog2p Status Status Undergrad 45 0.45 0.5184 Undergrad A 5 0.05 0.2161 Grad 55 0.55 0.4744 Undergrad B 30 0.30 0.5211 Total 100 1.00 0.9928 Undergrad C 10 0.10 0.3322 Grade Count p -plog2p Grad A 30 0.30 0.5211 A 35 0.35 0.5301 Grad B 20 0.20 0.4644 B 50 0.50 0.5000 Grad C 5 0.05 0.2161 C 15 0.15 0.4105 Total 100 1.00 2.2710 Total 100 1.00 1.4406 Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624 01/27/2021 Introduction to Data Mining, 2nd Edition 58 Tan, Steinbach, Karpatne, Kumar Maximal Information Coefficient Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no. 6062 (2011): 1518-1524. Applies mutual information to two continuous variables Consider the possible binnings of the variables into discrete categories – nX × nY ≤ N0.6 where ◆ nX is the number of values of X ◆ nY is the number of values of Y ◆ N is the number of samples (observations, data objects) Compute the mutual information – Normalized by log2(min( nX, nY ) Take the highest value 01/27/2021 Introduction to Data Mining, 2nd Edition 59 Tan, Steinbach, Karpatne, Kumar General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed. 1: For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1]. 2: Define an indicator variable, k, for the kth attribute as follows: k = 0 if the kth attribute is an asymmetric attribute and both objects have a value of 0, or if one of the objects has a missing value for the kth attribute k = 1 otherwise 3. Compute 01/27/2021 Introduction to Data Mining, 2nd Edition 60 Tan, Steinbach, Karpatne, Kumar Using Weights to Combine Similarities May not want to treat all attributes the same. – Use non-negative weights 𝜔𝑘 σ𝑛 𝑘=1 𝜔𝑘 𝛿𝑘 𝑠𝑘 (𝐱,𝐲) – 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐱, 𝐲 = σ𝑛 𝑘=1 𝜔𝑘 𝛿𝑘 Can also define a weighted form of distance 01/27/2021 Introduction to Data Mining, 2nd Edition 61 Tan, Steinbach, Karpatne, Kumar Data Preprocessing Aggregation Sampling Discretization and Binarization Attribute Transformation Dimensionality Reduction Feature subset selection Feature creation 01/27/2021 Introduction to Data Mining, 2nd Edition 62 Tan, Steinbach, Karpatne, Kumar Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose – Data reduction - reduce the number of attributes or objects – Change of scale ◆ Cities aggregated into regions, states, countries, etc. ◆ Days aggregated into weeks, months, or years – More “stable” data - aggregated data tends to have less variability 01/27/2021 Introduction to Data Mining, 2nd Edition 63 Tan, Steinbach, Karpatne, Kumar Example: Precipitation in Australia This example is based on precipitation in Australia from the period 1982 to 1993. The next slide shows – A histogram for the standard deviation of average monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in Australia, and – A histogram for the standard deviation of the average yearly precipitation for the same locations. The average yearly precipitation has less variability than the average monthly precipitation. All precipitation measurements (and their standard deviations) are in centimeters. 01/27/2021 Introduction to Data Mining, 2nd Edition 64 Tan, Steinbach, Karpatne, Kumar Example: Precipitation in Australia … Variation of Precipitation in Australia Standard Deviation of Average Standard Deviation of Monthly Precipitation Average Yearly Precipitation 01/27/2021 Introduction to Data Mining, 2nd Edition 65 Tan, Steinbach, Karpatne, Kumar Sampling Sampling is the main technique employed for data reduction. – It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming. 01/27/2021 Introduction to Data Mining, 2nd Edition 66 Tan, Steinbach, Karpatne, Kumar Sampling … The key principle for effective sampling is the following: – Using a sample will work almost as well as using the entire data set, if the sample is representative – A sample is representative if it has approximately the same properties (of interest) as the original set of data 01/27/2021 Introduction to Data Mining, 2nd Edition 67 Tan, Steinbach, Karpatne, Kumar Sample Size 8000 points 2000 Points 500 Points 01/27/2021 Introduction to Data Mining, 2nd Edition 68 Tan, Steinbach, Karpatne, Kumar Types of Sampling Simple Random Sampling – There is an equal probability of selecting any particular item – Sampling without replacement ◆ As each item is selected, it is removed from the population – Sampling with replacement ◆ Objects are not removed from the population as they are selected for the sample. ◆ In sampling with replacement, the same object can be picked up more than once Stratified sampling – Split the data into several partitions; then draw random samples from each partition 01/27/2021 Introduction to Data Mining, 2nd Edition 69 Tan, Steinbach, Karpatne, Kumar Sample Size What sample size is necessary to get at least one object from each of 10 equal-sized groups. 01/27/2021 Introduction to Data Mining, 2nd Edition 70 Tan, Steinbach, Karpatne, Kumar Discretization Discretization is the process of converting a continuous attribute into an ordinal attribute – A potentially infinite number of values are mapped into a small number of categories – Discretization is used in both unsupervised and supervised settings 01/27/2021 Introduction to Data Mining, 2nd Edition 71 Tan, Steinbach, Karpatne, Kumar Unsupervised Discretization Data consists of four groups of points and two outliers. Data is one- dimensional, but a random y component is added to reduce overlap. 01/27/2021 Introduction to Data Mining, 2nd Edition 72 Tan, Steinbach, Karpatne, Kumar Unsupervised Discretization Equal interval width approach used to obtain 4 values. 01/27/2021 Introduction to Data Mining, 2nd Edition 73 Tan, Steinbach, Karpatne, Kumar Unsupervised Discretization Equal frequency approach used to obtain 4 values. 01/27/2021 Introduction to Data Mining, 2nd Edition 74 Tan, Steinbach, Karpatne, Kumar Unsupervised Discretization K-means approach to obtain 4 values. 01/27/2021 Introduction to Data Mining, 2nd Edition 75 Tan, Steinbach, Karpatne, Kumar Discretization in Supervised Settings – Many classification algorithms work best if both the independent and dependent variables have only a few values – We give an illustration of the usefulness of discretization using the following example. 01/27/2021 Introduction to Data Mining, 2nd Edition 76 Tan, Steinbach, Karpatne, Kumar Binarization Binarization maps a continuous or categorical attribute into one or more binary variables 01/27/2021 Introduction to Data Mining, 2nd Edition 77 Tan, Steinbach, Karpatne, Kumar Attribute Transformation An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values – Simple functions: xk, log(x), ex, |x| – Normalization Refers to various techniques to adjust to ◆ differences among attributes in terms of frequency of occurrence, mean, variance, range ◆ Take out unwanted, common signal, e.g., seasonality – In statistics, standardization refers to subtracting off the means and dividing by the standard deviation 01/27/2021 Introduction to Data Mining, 2nd Edition 78 Tan, Steinbach, Karpatne, Kumar Example: Sample Time Series of Plant Growth Minneapolis Net Primary Production (NPP) is a measure of plant growth used by ecosystem scientists. Correlations between time series Correlations between time series Minneapolis Atlanta Sao Paolo Minneapolis 1.0000 0.7591 -0.7581 Atlanta 0.7591 1.0000 -0.5739 Sao Paolo -0.7581 -0.5739 1.0000 01/27/2021 Introduction to Data Mining, 2nd Edition 79 Tan, Steinbach, Karpatne, Kumar Seasonality Accounts for Much Correlation Minneapolis Normalized using monthly Z Score: Subtract off monthly mean and divide by monthly standard deviation Correlations between time series Correlations between time series Minneapolis Atlanta Sao Paolo Minneapolis 1.0000 0.0492 0.0906 Atlanta 0.0492 1.0000 -0.0154 Sao Paolo 0.0906 -0.0154 1.0000 01/27/2021 Introduction to Data Mining, 2nd Edition 80 Tan, Steinbach, Karpatne, Kumar Curse of Dimensionality When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which are critical for clustering and outlier detection, become less meaningful Randomly generate 500 points Compute difference between max and min distance between any pair of points 01/27/2021 Introduction to Data Mining, 2nd Edition 81 Tan, Steinbach, Karpatne, Kumar Dimensionality Reduction Purpose: – Avoid curse of dimensionality – Reduce amount of time and memory required by data mining algorithms – Allow data to be more easily visualized – May help to eliminate irrelevant features or reduce noise Techniques – Principal Components Analysis (PCA) – Singular Value Decomposition – Others: supervised and non-linear techniques 01/27/2021 Introduction to Data Mining, 2nd Edition 82 Tan, Steinbach, Karpatne, Kumar Dimensionality Reduction: PCA Goal is to find a projection that captures the largest amount of variation in data x2 e x1 01/27/2021 Introduction to Data Mining, 2nd Edition 83 Tan, Steinbach, Karpatne, Kumar Dimensionality Reduction: PCA 01/27/2021 Introduction to Data Mining, 2nd Edition 84 Tan, Steinbach, Karpatne, Kumar Feature Subset Selection Another way to reduce dimensionality of data Redundant features – Duplicate much or all of the information contained in one or more other attributes – Example: purchase price of a product and the amount of sales tax paid Irrelevant features – Contain no information that is useful for the data mining task at hand – Example: students' ID is often irrelevant to the task of predicting students' GPA Many techniques developed, especially for classification 01/27/2021 Introduction to Data Mining, 2nd Edition 85 Tan, Steinbach, Karpatne, Kumar Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: – Feature extraction ◆ Example: extracting edges from images – Feature construction ◆ Example: dividing mass by volume to get density – Mapping data to new space ◆ Example: Fourier and wavelet analysis 01/27/2021 Introduction to Data Mining, 2nd Edition 86 Tan, Steinbach, Karpatne, Kumar Mapping Data to a New Space Fourier and wavelet transform Frequency Two Sine Waves + Noise Frequency 01/27/2021 Introduction to Data Mining, 2nd Edition 87 Tan, Steinbach, Karpatne, Kumar Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 2/1/2021 Introduction to Data Mining, 2nd Edition 1 Classification: Definition l Given a collection of records (training set ) – Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the class label ◆ x: attribute, predictor, independent variable, input ◆ y: class, response, dependent variable, output l Task: – Learn a model that maps each attribute set x into one of the predefined class labels y 2/1/2021 Introduction to Data Mining, 2nd Edition 2 Examples of Classification Task Task Attribute set, x Class label, y Categorizing Features extracted from spam or non-spam email email message header messages and content Identifying Features extracted from malignant or benign tumor cells x-rays or MRI scans cells Cataloging Features extracted from Elliptical, spiral, or galaxies telescope images irregular-shaped galaxies 2/1/2021 Introduction to Data Mining, 2nd Edition 3 General Approach for Building Classification Model 2/1/2021 Introduction to Data Mining, 2nd Edition 4 Classification Techniques Base Classifiers – Decision Tree based Methods – Rule-based Methods – Nearest-neighbor – Naïve Bayes and Bayesian Belief Networks – Support Vector Machines – Neural Networks, Deep Neural Nets Ensemble Classifiers – Boosting, Bagging, Random Forests 2/1/2021 Introduction to Data Mining, 2nd Edition 5 Example of a Decision Tree Splitting Attributes Home Marital Annual Defaulted ID Owner Status Income Borrower 1 Yes Single 125K No Home 2 No Married 100K No Owner Yes No 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Single, Divorced Married 6 No Married 60K No Income NO 7 Yes Divorced 220K No < 80K > 80K 8 No Single 85K Yes 9 No Married 75K No NO YES 10 No Single 90K Yes 10 Training Data Model: Decision Tree 2/1/2021 Introduction to Data Mining, 2nd Edition 6 Apply Model to Test Data Test Data Start from the root of tree. Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2nd Edition 7 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2nd Edition 8 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2nd Edition 9 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2nd Edition 10 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2nd Edition 11 Apply Model to Test Data Test Data Home Marital Annual Defaulted Owner Status Income Borrower No Married 80K ? Home 10 Yes Owner No NO MarSt Single, Divorced Married Assign Defaulted to “No” Income NO < 80K > 80K NO YES 2/1/2021 Introduction to Data Mining, 2nd Edition 12 Another Example of Decision Tree MarSt Single, Married Divorced Home Marital Annual Defaulted ID Owner Status Income Borrower NO Home 1 Yes Single 125K No Yes Owner No 2 No Married 100K No 3 No Single 70K No NO Income 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes NO YES 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No There could be more than one tree that fits the same data! 10 No Single 90K Yes 10 2/1/2021 Introduction to Data Mining, 2nd Edition 13 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class Tree 1 Yes Large 125K No Induction 2 No Medium 100K No algorithm 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Decision Tid Attrib1 Attrib2 Attrib3 Class Model Tree 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 2/1/2021 Introduction to Data Mining, 2nd Edition 14 Data Mining Model Overfitting Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar Classification Errors Training errors: Errors committed on the training set Test errors: Errors committed on the test set Generalization errors: Expected error of a model over random selection of records from same distribution Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Example Data Set Two class problem: + : 5400 instances 5000 instances generated from a Gaussian centered at (10,10) 400 noisy instances added o : 5400 instances Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing Increasing number of nodes in Decision Trees Decision Tree with 4 nodes Decision Tree Decision boundaries on Training data Decision Tree with 50 nodes Decision Tree Decision boundaries on Training data Which tree is better? Decision Tree with 4 nodes Which tree is better ? Decision Tree with 50 nodes Model Underfitting and Overfitting As the model becomes more and more complex, test errors can start increasing even though training error may be decreasing Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large Model Overfitting – Impact of Training Data Size Using twice the number of data instances Increasing the size of training data reduces the difference between training and testing errors at a given size of model Model Overfitting – Impact of Training Data Size Decision Tree with 50 nodes Decision Tree with 50 nodes Using twice the number of data instances Increasing the size of training data reduces the difference between training and testing errors at a given size of model Data Mining Classification: Alternative Techniques Lecture Notes for Chapter 4 Rule-Based Introduction to Data Mining , 2nd Edition by Tan, Steinbach, Karpatne, Kumar Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition) ® y – where u Condition is a conjunction of tests on attributes u y is the class label – Examples of classification rules: u (Blood Type=Warm) Ù (Lay Eggs=Yes) ® Birds u (Taxable Income < 50K) Ù (Refund=Yes) ® Evade=No 9/30/2020 Introduction to Data Mining, 2nd Edition 2 Rule-based Classifier (Example) Name Blood Type Give Birth Can Fly Live in Water Class human warm yes no no mammals python cold no no no reptiles salmon cold no no yes fishes whale warm yes no yes mammals frog cold no no sometimes amphibians komodo cold no no no reptiles bat warm yes yes no mammals pigeon warm no yes no birds cat warm yes no no mammals leopard shark cold yes no yes fishes turtle cold no no sometimes reptiles penguin warm no no sometimes birds porcupine warm yes no no mammals eel cold no no yes fishes salamander cold no no sometimes amphibians gila monster cold no no no reptiles platypus warm no no no mammals owl warm no yes no birds dolphin warm yes no yes mammals eagle warm no yes no birds R1: (Give Birth = no) Ù (Can Fly = yes) ® Birds R2: (Give Birth = no) Ù (Live in Water = yes) ® Fishes R3: (Give Birth = yes) Ù (Blood Type = warm) ® Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® Reptiles R5: (Live in Water = sometimes) ® Amphibians 9/30/2020 Introduction to Data Mining, 2nd Edition 3 Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no) Ù (Can Fly = yes) ® Birds R2: (Give Birth = no) Ù (Live in Water = yes) ® Fishes R3: (Give Birth = yes) Ù (Blood Type = warm) ® Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® Reptiles R5: (Live in Water = sometimes) ® Amphibians Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no ? grizzly bear warm