DATA MINING_merged.pdf
Document Details
Uploaded by WinningTropicalRainforest
Tags
Related
- Process Mining Introduction PDF
- Data Mining ENIT Partie 1 (1) PDF - 15/12/2020
- Artificial Neural Networks Lecture PDF
- Data Mining and Analytics: AIM411 Introduction to Data Mining PDF
- Data Mining and Analytics: AIM411 Course Outline PDF
- Cours de Fouille de Données PDF - 2ème année Génie Informatique
Full Transcript
Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 01/17/2018 Introduction to Data Mining, 2nd Edition 1 Large-scale Data is Everywhere! ▪ ▪ There has been enormous data growth in both commercial and scientific da...
Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 01/17/2018 Introduction to Data Mining, 2nd Edition 1 Large-scale Data is Everywhere! ▪ ▪ There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies New mantra (slogan) ▪ ▪ Cyber Security E-Commerce Gather whatever data you can whenever and wherever possible. Expectations ▪ Gathered data will have value either for the purpose collected or for a purpose not envisioned. Traffic Patterns Sensor Networks 01/17/2018 Introduction to Data Mining, 2nd Edition Social Networking: Twitter Computational Simulations 2 Why Data Mining? Commercial Viewpoint Lots of data is being collected and warehoused (stored) – Web data Yahoo has Peta Bytes of web data ◆ Facebook has billions of active users ◆ – purchases at department/ grocery stores, e-commerce ◆ Amazon handles millions of visits/day – Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) 01/17/2018 Introduction to Data Mining, 2nd Edition 3 Why Data Mining? Scientific Viewpoint Data collected and stored at enormous speeds – remote sensors on a satellite NASA EOSDIS archives over petabytes of earth science data / year ◆ fMRI Data from Brain Sky Survey Data – telescopes scanning the skies ◆ Sky survey data – High-throughput biological data – scientific simulations ◆ terabytes of data generated in a few hours Gene Expression Data Data mining helps scientists – in automated analysis of massive datasets – In hypothesis formation 01/17/2018 Introduction to Data Mining, 2nd Edition Surface Temperature of Earth 4 Great opportunities to improve productivity in all walks of life 01/17/2018 Introduction to Data Mining, 2nd Edition 5 Great Opportunities to Solve Society’s Major Problems Improving health care and reducing costs Predicting the impact of climate change Finding alternative/green energy sources Reducing hunger and poverty by increasing agriculture production 01/17/2018 Introduction to Data Mining, 2nd Edition 6 What is Data Mining? Many Definitions – Nontrivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns 01/17/2018 Introduction to Data Mining, 2nd Edition 7 What is (not) Data Mining? What is not Data Mining? What is Data Mining? – Look up phone number in phone directory – Certain names are more common in certain US locations (O’Brien, O’Rourke, O’Reilly… in Boston area) – Query a Web search engine for information about “Amazon” – Group together similar documents returned by search engine according to their context (e.g., Amazon rainforest, Amazon.com) 01/17/2018 Introduction to Data Mining, 2nd Edition 8 Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional techniques may be unsuitable due to data that is – Large-scale – High dimensional – Heterogeneous – Complex – Distributed A key component of the emerging field of data science and datadriven discovery 01/17/2018 Introduction to Data Mining, 2nd Edition 9 Data Mining Tasks Prediction Methods – Use some variables to predict unknown or future values of other variables. Description Methods – Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 01/17/2018 Introduction to Data Mining, 2nd Edition 10 Data Mining Tasks … Data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 60K 10 Milk 01/17/2018 Introduction to Data Mining, 2nd Edition 11 Predictive Modeling: Classification Find a model for class attribute as a function of the values of other attributes Model for predicting credit worthiness Class 1 Yes Graduate # years at present address 5 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … Tid Employed Level of Education Employed Credit Worthy Yes No Yes No Education Graduate { High school, Undergrad } 10 Number of years 01/17/2018 Number of years > 3 yr < 3 yr > 7 yrs < 7 yrs Yes No Yes No Introduction to Data Mining, 2nd Edition 12 Classification Example 1 Yes Undergrad # years at present address 7 2 No Graduate 3 ? 3 Yes High School 2 ? … … … … … Tid Employed 1 Yes Graduate # years at present address 5 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … Tid Employed Level of Education Credit Worthy Yes Level of Education ? 10 Test Set 10 Training Set 01/17/2018 Credit Worthy Learn Classifier Introduction to Data Mining, 2nd Edition Model 13 Examples of Classification Task Classifying credit card transactions as legitimate or fraudulent Classifying land covers (water bodies, urban areas, forests, etc.) using satellite data Categorizing news stories as finance, weather, entertainment, sports, etc. Identifying intruders in the cyberspace Predicting tumor cells as benign or malignant Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil 01/17/2018 Introduction to Data Mining, 2nd Edition 14 Classification: Application 1 Fraud Detection – Goal: Predict fraudulent cases in credit card transactions. – Approach: ◆ Use credit card transactions and the information on its account-holder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc. ◆ Label past transactions as fraud or fair transactions. This forms the class attribute. ◆ Learn a model for the class of the transactions. ◆ Use this model to detect fraud by observing credit card transactions on an account. 01/17/2018 Introduction to Data Mining, 2nd Edition 15 Classification: Application 2 Churn prediction for telephone customers – Goal: To predict whether a customer is likely to be lost to a competitor. – Approach: ◆ Use detailed record of transactions with each of the past and present customers, to find attributes. – How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. ◆ Find a model for loyalty. ◆ From [Berry & Linoff] Data Mining Techniques, 1997 01/17/2018 Introduction to Data Mining, 2nd Edition 16 Classification: Application 3 Sky Survey Cataloging – Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23,040 x 23,040 pixels per image. – Approach: ◆ Segment the image. ◆ Measure image attributes (features) - 40 of them per object. ◆ Model the class based on these features. ◆ Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 01/17/2018 Introduction to Data Mining, 2nd Edition 17 Classifying Galaxies Courtesy: http://aps.umn.edu Early Class: • Stages of Formation Attributes: • Image features, • Characteristics of light waves received, etc. Intermediate Late Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB 01/17/2018 Introduction to Data Mining, 2nd Edition 18 Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Extensively studied in statistics, neural network fields. Examples: – Predicting sales amounts of new product based on advertising expenditure. – Predicting wind velocities as a function of temperature, humidity, air pressure, etc. – Time series prediction of stock market indices. 01/17/2018 Introduction to Data Mining, 2nd Edition 19 Clustering Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized 01/17/2018 Introduction to Data Mining, 2nd Edition 20 Applications of Cluster Analysis Understanding – Custom profiling for targeted marketing – Group related documents for browsing – Group genes and proteins that have similar functionality – Group stocks with similar price fluctuations Summarization – Reduce the size of large data sets Courtesy: Michael Eisen Clusters for Raw SST and Raw NPP 90 60 Land Cluster 2 latitude 30 Land Cluster 1 0 Ice or No NPP -30 Sea Cluster 2 -60 Use of K-means to partition Sea Surface Temperature (SST) and Net Primary Production (NPP) into clusters that reflect the Northern and Southern Hemispheres. Sea Cluster 1 -90 -180 -150 01/17/2018 -120 -90 -60 -30 0 30 longitude 60 90 120 150 180 Cluster Introduction to Data Mining, 2nd Edition 21 Clustering: Application 1 Market Segmentation: – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. – Approach: ◆ Collect different attributes of customers based on their geographical and lifestyle related information. ◆ Find clusters of similar customers. ◆ Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. 01/17/2018 Introduction to Data Mining, 2nd Edition 22 Clustering: Application 2 Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. – Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Enron email dataset 01/17/2018 Introduction to Data Mining, 2nd Edition 23 Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk 01/17/2018 Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Introduction to Data Mining, 2nd Edition 24 Association Analysis: Applications Market-basket analysis – Rules are used for sales promotion, shelf management, and inventory management Telecommunication alarm diagnosis – Rules are used to find combination of alarms that occur together frequently in the same time period Medical Informatics – Rules are used to find combination of patient symptoms and test results associated with certain diseases 01/17/2018 Introduction to Data Mining, 2nd Edition 25 Association Analysis: Applications An Example Subspace Differential Coexpression Pattern Three lung cancer datasets [Bhattacharjee et a from lung cancer dataset 2001], [Stearman et al. 2005], [Su et al. 2007] Enriched with the TNF/NFB signaling pathway which is well-known to be related to lung cancer P-value: 1.4*10-5 (6/10 overlap with the pathway) [Fang et al PSB 2010] 01/17/2018 Introduction to Data Mining, 2nd Edition 26 Deviation/Anomaly/Change Detection Detect significant deviations from normal behavior Applications: – Credit Card Fraud Detection – Network Intrusion Detection – Identify abnormal behavior from sensor networks for monitoring and surveillance. – Detecting changes in the global forest cover. 01/17/2018 Introduction to Data Mining, 2nd Edition 27 Motivating Challenges Scalability High Dimensionality Heterogeneous and Complex Data Data Ownership and Distribution Non-traditional Analysis 01/17/2018 Introduction to Data Mining, 2nd Edition 28 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object Attributes – Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, or feature Objects A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance © Tan,Steinbach, Kumar Introduction to Data Mining Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 4/18/2004 ‹#› Attribute Values Attribute values are numbers or symbols assigned to an attribute Distinction between attributes and attribute values – Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different – ID has no limit but age has a maximum and minimum value © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Measurement of Length The way you measure an attribute is somewhat may not match the attributes properties. A 5 1 B 7 This scale preserves only the 2 C 8 3 ordering property of length. D 10 4 E 15 This scale preserves the ordering and additivity properties of length. 5 A mapping to lengths to numbers that captures only the order properties of length A mapping to lengths to numbers that captures both order and additivity properties of length Thus, an attribute can be measured in a way that does not capture all the properties of the attribute. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Attributes There are different types of attributes – Nominal Examples: ID numbers, eye color, zip codes – Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} – Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio Examples: temperature in Kelvin, length, time, counts © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Properties of Attribute Values The type of an attribute depends on which of the following properties it has: = < > + */ – – – – Distinctness: Order: Addition: Multiplication: – – – – Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Attribute Type Description Examples Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Ratio Operations This categorization of attributes is due to S. S. Stevens Comments Categorical (or qualitative) attribute Transformation Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Numeric (Quantitative) attributes Attribute Level Interval new_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). new_value = a * old_value Length can be measured in meters or feet. Ratio The types of attributes can also be described in terms of transformations that do not change the meaning of an attribute. Discrete and Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes assume only two values, e.g., true/false, yes/no, male/female, or 0/1. Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floating-point variables. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Asymmetric Attributes Only presence (a non-zero attribute value) is regarded as important Words present in documents Items present in customer transactions If we met a friend in the grocery store would we ever say the following? “I see our purchases are very similar since we didn’t buy most of the same things.” It is more meaningful and more efficient to focus on the nonzero values. Binary attributes where only non-zero values are important are called asymmetric binary attributes. – Association analysis uses asymmetric attributes © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of data sets Record – Data Matrix – Document Data – Transaction Data Graph-based – World Wide Web – Molecular Structures Ordered – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Important Characteristics of Data – Dimensionality (number of attributes) Curse of Dimensionality – Sparsity Only presence counts – Resolution Patterns depend on the scale – Size Type of analysis may depend on size of data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Projection of x Load Projection of y load Distance Load Thickness 10.23 5.27 15.22 2.7 1.2 12.65 6.25 16.22 2.2 1.1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Document Data Each document becomes a `term' vector, – each term is a component (attribute) of the vector, – the value of each component is the number of times the corresponding term occurs in the document. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Transaction Data A special type of record data, where – each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Graph Data Examples: Generic graph, a molecule, and webpages 2 1 5 2 5 Benzene Molecule: C6H6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Ordered Data Sequences of transactions Items/Events An element of the sequence © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Ordered Data Genomic sequence data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Ordered Data Spatio-Temporal Data Average Monthly Temperature of land and ocean © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Quality Poor data quality negatively affects many data processing efforts “The most important point is that poor data quality is an unfolding disaster. – Poor data quality costs the typical company at least ten percent (10%) of revenue; twenty percent (20%) is probably a better estimate.” Thomas C. Redman, DM Review, August 2004 Data mining example: a classification model for detecting people who are loan risks is built using poor data – Some credit-worthy candidates are denied loans – More loans are given to individuals that default © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: – Noise and outliers – missing values – duplicate data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Noise Noise refers to modification of original values – Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set – Case 1: Outliers are noise that interferes with data analysis – Case 2: Outliers are the goal of our analysis Credit card fraud Intrusion detection © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Missing Values Reasons for missing values – Information is not collected (e.g., people decline to give their age and weight) – Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values – – – – Eliminate Data Objects Estimate Missing Values ?: missing value Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another – Major issue when merging data from heterogeous sources Examples: – Same person with multiple email addresses Data cleaning – Process of dealing with duplicate data issues © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature Subset Selection Feature Creation Discretization and Binarization Attribute Transformation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose – Data reduction Reduce the number of attributes or objects – Change of scale Cities aggregated into regions, states, countries, etc. – More “stable” data Aggregated data tends to have less variability © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Aggregation Variation of Precipitation in Australia apparent decrease in the amount of std. dev. owing to aggregation Standard Deviation of Average Monthly Precipitation © Tan,Steinbach, Kumar Introduction to Data Mining Standard Deviation of Average Yearly Precipitation 4/18/2004 ‹#› Sampling Sampling is the main technique employed for data selection. – It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sampling … The key principle for effective sampling is the following: – using a sample will work almost as well as using the entire data sets, if the sample is representative – A sample is representative if it has approximately the same property (of interest) as the original set of data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Sampling Simple Random Sampling – There is an equal probability of selecting any particular item Sampling without replacement – As each item is selected, it is removed from the population Sampling with replacement – Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling – Split the data into several partitions; then draw random samples from each partition © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sample Size 8000 points © Tan,Steinbach, Kumar 2000 Points Introduction to Data Mining 500 Points 4/18/2004 ‹#› Sample Size What sample size is necessary to get at least one object from each of 10 equal-sized groups. The figure showing an idealized set of clusters (groups) from which these points might be drawn © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Curse of Dimensionality When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful • Randomly generate 500 points • Compute difference between max and min distance between any pair of points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction Purpose: – Avoid curse of dimensionality – Reduce amount of time and memory required by data mining algorithms – Allow data to be more easily visualized – May help to eliminate irrelevant features or reduce noise Techniques – Principle Component Analysis (PCA) – Singular Value Decomposition – Others: supervised and non-linear techniques © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction: PCA Goal is to find a projection that captures the largest amount of variation in data x2 e x1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction: PCA Find the eigenvectors of the covariance matrix The eigenvectors define the new space x2 e x1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction: PCA Principal Components Analysis (PCA) is a linear algebra technique for continuous attributes that finds new attributes (principal components) that – (1) are linear combinations of the original attributes, – (2) are orthogonal (perpendicular) to each other, and – (3) capture the maximum amount of variation in the data. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction: PCA © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Feature Subset Selection Another way to reduce dimensionality of data Redundant features – duplicate much or all of the information contained in one or more other attributes – Example: purchase price of a product and the amount of sales tax paid Irrelevant features – contain no information that is useful for the data mining task at hand – Example: students' ID is often irrelevant to the task of predicting students' GPA © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: – Feature extraction Example: extracting edges from images – Feature construction Example: dividing mass by volume to get density – Mapping data to new space Example: Fourier and wavelet analysis © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Mapping Data to a New Space Fourier transform Wavelet transform Two Sine Waves © Tan,Steinbach, Kumar Two Sine Waves + Noise Introduction to Data Mining Frequency 4/18/2004 ‹#› Discretization Discretization is the process of converting a continuous attribute into an ordinal attribute – A potentially infinite number of values are mapped into a small number of categories – Discretization is commonly used in classification – Many classification algorithms work best if both the independent and dependent variables have only a few values – We give an illustration of the usefulness of discretization using the Iris data set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Iris Sample Data Set Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html – From the statistician Douglas Fisher – Three flower types (classes): Setosa Versicolour Virginica – Four (non-class) attributes Sepal width and length Virginica. Robert H. Mohlenbrock. USDA Petal width and length NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discretization: Iris Example Petal width low or petal length low implies Setosa. Petal width medium or petal length medium implies Versicolour. Petal width high or petal length high implies Virginica. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discretization: Iris Example … How can we tell what the best discretization is? – Unsupervised discretization: find breaks in the data values 50 Example: 40 Counts Petal Length 30 20 10 0 0 2 4 6 Petal Length 8 – Supervised discretization: Use class labels to find breaks © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discretization Without Using Class Labels Data consists of four groups of points and two outliers. Data is onedimensional, but a random y component is added to reduce overlap. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discretization Without Using Class Labels Equal interval width approach used to obtain 4 values. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discretization Without Using Class Labels Equal frequency approach used to obtain 4 values. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discretization Without Using Class Labels K-means approach to obtain 4 values. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discretization Without Using Class Labels Data Equal interval width Equal frequency © Tan,Steinbach, Kumar K-means Introduction to Data Mining 4/18/2004 ‹#› Discretization Using Class Labels Entropy based approach 3 categories for both x and y 5 categories for both x and y Discretizing x and y attributes for four groups (classes) of points. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Binarization Binarization maps a continuous or categorical attribute into one or more binary variables Typically used for association analysis Often convert a continuous attribute to a categorical attribute and then convert a categorical attribute to a set of binary attributes – Association analysis needs asymmetric binary attributes – Examples: eye color and height measured as {low, medium, high} © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Binarization © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Attribute Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values – Simple functions: xk, log(x), ex, |x| – Standardization and Normalization © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Attribute Transformation An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values – Simple functions: xk, log(x), ex, |x| – Normalization Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, range Take out unwanted, common signal, e.g., seasonality – In statistics, standardization refers to subtracting off the means and dividing by the standard deviation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Example: Sample Time Series of Plant Growth Minneapolis Net Primary Production (NPP) is a measure of plant growth used by ecosystem scientists. Correlations between time series Correlations between time series Minneapolis Minneapolis 1.0000 Atlanta 0.7591 Sao Paolo -0.7581 © Tan,Steinbach, Kumar Atlanta 0.7591 1.0000 -0.5739 Introduction to Data Mining Sao Paolo -0.7581 -0.5739 1.0000 4/18/2004 ‹#› Seasonality Accounts for Much Correlation Minneapolis Normalized using monthly Z Score: Subtract off monthly mean and divide by monthly standard deviation Correlations between time series Correlations between time series Minneapolis Minneapolis 1.0000 Atlanta 0.0492 Sao Paolo 0.0906 © Tan,Steinbach, Kumar Atlanta 0.0492 1.0000 -0.0154 Introduction to Data Mining Sao Paolo 0.0906 -0.0154 1.0000 4/18/2004 ‹#› Similarity and Dissimilarity Similarity – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. – Often falls in the range [0,1] Dissimilarity – Numerical measure of how different are two data objects – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies Proximity refers to a similarity or dissimilarity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity/Dissimilarity transformation examples For the dissimilaritiy values of 0, 1, 10, 100; transformation equation results in similaritiy values of 1, 0.5, 0.09, 0.01, respectively. transformation equation results in similaritiy values of 1.00, 0.99, 0.00, 0.00, respectively. transformation equation results in similaritiy values of 1.00, 0.37, 0.00, 0.00, respectively. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Euclidean Distance Euclidean Distance dist n ( pk k 1 qk ) 2 Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Euclidean Distance 3 point p1 p2 p3 p4 p1 2 p3 p4 1 p2 0 0 1 2 3 4 5 y 2 0 1 1 6 p1 p1 p2 p3 p4 x 0 2 3 5 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Distance Matrix © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance n dist ( | pk qk k 1 1 r r |) Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance. – A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance (L2 norm) r . “supremum” (Lmax norm, L norm) distance. – This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Minkowski Distance point p1 p2 p3 p4 x 0 2 3 5 y 2 0 1 1 L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 L2 p1 p2 p3 p4 p1 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L p1 p2 p3 p4 p1 p2 p3 p4 0 2.828 3.162 5.099 0 2 3 5 2 0 1 3 3 1 0 2 5 3 2 0 Distance Matrix © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Mahalanobis Distance is the covariance matrix For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Mahalanobis Distance Covariance Matrix: C 0 .3 0 .2 0.2 0.3 A: (0.5, 0.5) B B: (0, 1) A C: (1.5, 1.5) Mahal(A,B) = 5 Mahal(A,C) = 4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Common Properties of a Distance (Dissimilarity) Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2. d(p, q) = d(q, p) for all p and q. (Symmetry) 3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. A distance that satisfies these properties is a metric © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Common Properties of a Similarity Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity Between Binary Vectors Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities M01 = the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00 = the number of attributes where p was 0 and q was 0 M11 = the number of attributes where p was 1 and q was 1 Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› SMC versus Jaccard: Example p= 1000000000 q= 0000001001 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| , where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2, and || d || is the length of vector d. Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 <d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 || d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 || d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449 cos(d1, d2 ) = 0.3150 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Extended Jaccard Coefficient (Tanimoto) Variation of Jaccard for continuous or count attributes – Reduces to Jaccard for binary attributes © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product pk ( pk mean( p)) / std ( p) qk ( qk mean( q)) / std ( q) correlation( p, q) p q © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Correlation measures the linear relationship between objects © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. If the correlation is 0, then there is no linear relationship between the attributes. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Drawback of Correlation x = (-3, -2, -1, 0, 1, 2, 3) y = (9, 4, 1, 0, 1, 4, 9) yi = x i 2 mean(x) = 0, mean(y) = 4 std(x) = 2.16, std(y) = 3.74 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) =0 If the correlation is 0, then there is no linear relationship between the attributes of the two data objects. However, non-linear relationships may still exist as in this example. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Using Weights to Combine Similarities May not want to treat all attributes the same. – Use weights wk which are between 0 and 1 and sum to 1. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Density Measures the degree to which data objects are close to each other in a specified area. The notion of density is closely related to that of proximity. Concept of density is typically used for clustering and anomaly detection. Examples: – Euclidean density Euclidean density = number of points per unit volume – Probability density Estimate what the distribution of the data looks like – Graph-based density Connectivity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Euclidean Density: Grid-based Approach Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains Grid-based density. © Tan,Steinbach, Kumar Introduction to Data Mining Counts for each cell. 4/18/2004 ‹#› Euclidean Density: Center-Based Euclidean density is the number of points within a specified radius of the point Illustration of center-based density. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools Related to the area of Exploratory Data Analysis (EDA) – Created by statistician John Tukey – Seminal book is Exploratory Data Analysis by Tukey – A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/index.htm © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Techniques Used In Data Exploration In EDA, as originally defined by Tukey – The focus was on visualization – Clustering and anomaly detection were viewed as exploratory techniques – In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory In our discussion of data exploration, we focus on – Summary statistics – Visualization – Online Analytical Processing (OLAP) © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set. – Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html – From the statistician Douglas Fisher – Three flower types (classes): Setosa Virginica Versicolour – Four (non-class) attributes Sepal width and length Petal width and length © Tan,Steinbach, Kumar Introduction to Data Mining Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute. 8/05/2005 ‹#› Summary Statistics Summary statistics are numbers that summarize properties of the data – Summarized properties include frequency, location and spread Examples: location - mean spread - standard deviation – Most summary statistics can be calculated in a single pass through the data © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set – For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of an attribute is the most frequent attribute value The notions of frequency and mode are typically used with categorical data © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Percentiles For continuous data, the notion of a percentile is more useful. Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value xp of x such that p% of the observed values of x are less than xp . For instance, the 50th percentile is the value x50% such that 50% of all values of x are less than x50%. © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Measures of Location: Mean and Median The mean is the most common measure of the location of a set of points. However, the mean is very sensitive to outliers. Thus, the median or a trimmed mean is also commonly used. © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Measures of Location: Mean and Median • Trimmed mean: o A percentage p between 0 and 100 is specified, the top and bottom (p/2)% of the data is thrown out, and the mean is then calculated in the normal way. o Example o Consider the set of values {1,2,3,4,5,90}. o What is the mean, median and the trimmed mean with p=40%? o Answer o mean=17.5 o median=3.5 o trimmed mean(40%)=3.5 © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Measures of Spread: Range and Variance Range is the difference between the max and min The variance or standard deviation is the most common measure of the spread of a set of points. However, this is also sensitive to outliers, so that other measures are often used. absolute average deviation (AAD) median absolute deviation (MAD) interquartile range (IQR) © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Visualization of data is one of the most powerful and appealing techniques for data exploration. – Humans have a well developed ability to analyze large amounts of information that is presented visually – Can detect general patterns and trends – Can detect outliers and unusual patterns © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Example: Sea Surface Temperature The following shows the Sea Surface Temperature (SST) for July 1982 – Tens of thousands of data points are summarized in a single figure © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Representation Is the mapping of information to a visual format Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors. Example: – Objects are often represented as points – Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape – If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived. © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Arrangement Is the placement of visual elements within a display Can make a large difference in how easy it is to understand the data Example: A table of nine objects (rows) with six binary attributes (columns) permuted so that the relationships of the rows and columns are clear. © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Selection Is the elimination or the de-emphasis of certain objects and attributes Selection may involve the choosing a subset of attributes – Dimensionality reduction is often used to reduce the number of dimensions to two or three – Alternatively, pairs of attributes can be considered Selection may also involve choosing a subset of objects – A region of the screen can only show so many points – Can sample, but want to preserve points in sparse areas © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization Techniques: Histograms Histogram – Usually shows the distribution of values of a single variable – Divide the values into bins and show a bar plot of the number of objects in each bin. – The height of each bar indicates the number of objects – Shape of histogram depends on the number of bins Example: Petal Width (10 and 20 bins, respectively) © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Two-Dimensional Histograms Show the joint distribution of the values of two attributes Example: petal width and petal length – What does this tell us? While two-dimensional histograms can be used to discover interesting facts about how the values of two attributes co-occur, they are visually more complicated. Most of the flowers fall into only three of the bins—those along the diagonal. It is not possible to see this by looking at the one-dimensional distributions. © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Pie Chart Pie Chart – is similar to a histogram, but is typically used with categorical attributes that have a relatively small number of values. – uses the relative area of a circle to indicate relative frequency. – are used less frequently in technical publications because the size of relative areas can be hard to judge All three flower types have the same frequency © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization Techniques: Box Plots Box Plots – Invented by J. Tukey – Another way of displaying the distribution of data – Following figure shows the basic part of a box plot outlier 90th percentile 75th percentile 50th percentile 25th percentile The line inside the box indicates the value of the 50th percentile 10th percentile © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Example of Box Plots Box plots can be used to compare attributes Box plot for Iris attributes © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Example of Box Plots Box plots can also be used to compare how attributes vary between different classes of objects Box plots of attributes by Iris species © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization Techniques: Scatter Plots Scatter plots – Attributes values determine the position – Two-dimensional scatter plots most common, but can have three-dimensional scatter plots – Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects – It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes See example on the next slide © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Scatter Plot Array of Iris Attributes © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization Techniques: Contour Plots For some three-dimensional data, two attributes specify a position in a plane, while the third has a continuous value, such as temperature or elevation. Contour plots – Useful when a continuous attribute is measured on a spatial grid – They partition the plane into regions of similar values – Breaks the plane into separate regions where the values of the third attribute (temperature, elevation) are roughly the same – The most common example is contour maps of elevation of land locations. – Can also display temperature, rainfall, air pressure, etc. An example for Sea Surface Temperature (SST) is provided on the next slide © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Contour Plot Example: SST Dec, 1998 Celsius © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization Techniques: Matrix Plots Matrix plots – A data matrix can be visualized as an image by associating each entry of the data matrix with a pixel in the image. – objects are sorted according to class (If class labels are known) so that all objects of a class are together – Typically, the attributes are normalized to prevent one attribute from dominating the plot If different attributes have different ranges, then the attributes are often standardized to have a mean of zero and a standard deviation of 1. – Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects – Examples of matrix plots are presented on the next two slides © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization of the Iris Data Matrix The first 50 rows represent Iris flowers of the species Setosa, the next 50 Versicolour, and the last 50 Virginica. The Setosa flowers have petal width and length well below the average, while the Versicolour flowers have petal width and length around average. The Virginica flowers have petal width and length above average. standard deviation Plot of the Iris data matrix where columns have been standardized to have a mean of 0 and standard deviation of 1 © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Visualization of the Iris Correlation Matrix It can also be useful to look for structure in the plot of a proximity matrix for a set of data objects. Again, it is useful to sort the rows and columns of the similarity matrix (when class labels are known) so that all the objects of a class are together. Plot of the Iris correlation matrix. The flowers in each group are most similar to each other, but Versicolour and Virginica are more similar to one another than to Setosa. © Tan,Steinbach, Kumar Introduction to Data Mining This allows a visual evaluation of the cohesiveness of each class and its separation from other classes. 8/05/2005 ‹#› Visualization Techniques: Parallel Coordinates Parallel Coordinates – Used to plot the attribute values of high-dimensional data – Instead of using perpendicular axes, use a set of parallel axes – The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line – Thus, each object is represented as a line – Often, the lines representing a distinct class of objects group together, at least for some attributes – Ordering of attributes is important in seeing such groupings © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Parallel Coordinates Plots for Iris Data © Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 ‹#› Other Visualization Techniques Star Plots – Similar approach to parallel coordinates, but axes radiate from a central point – This technique uses one axis for each attribute. – Typically, all the attribute values are mapped to the range [0,1]. – The line connecting the values of an object is a polygon Star coordinates graph of the 150th flower of the Iris data set © Tan,Steinbach, Kumar Introduction to Dat