2_Ch2_mmk.pdf
Document Details
Uploaded by SmoothestPhotorealism
KAU
Full Transcript
+ Data Mining and Warehousing CPIT 440 Chapter 2: Getting to Know your Data + 2 Outlines n Data Exploration n Data Objects and Attribute Types n Basic Stati...
+ Data Mining and Warehousing CPIT 440 Chapter 2: Getting to Know your Data + 2 Outlines n Data Exploration n Data Objects and Attribute Types n Basic Statistical Descriptions of Data n Data Visualization n Measuring Data Similarity and Dissimilarity n Summary CPIT440:Data Mining and Warehousing + 3 Data Objects n Data sets are made up of data objects. n A data object represents an entity. n Examples: n Sales database: customers, store items, sales n Medical database: patients, treatments n University database: students, professors, courses n Database rows -> data objects; columns ->attributes. n Data objects also called objects, data points, examples, instances, records, samples, tuples. n Data objects are described by attributes. CPIT440:Data Mining and Warehousing + 4 Example Columns Features, attributes, dimensions, variables, covariate Annual Credit card Age Gender Income (SAR) offer Rows 1 65 M 150,000 No Objects Examples 2 35 F 180,000 Yes Instances 3 55 M 120,000 Yes Records Samples 4 37 M 90,000 No Tuples 5 25 F 125,000 No CPIT440:Data Mining and Warehousing + 5 Attributes n Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. n Examples: age, name, gender, address n The term dimension is commonly used in data warehousing. Machine learning literature tends to use the term feature, while statisticians prefer the term variable. Data mining and database professionals commonly use the term attribute n Observed values for a givin attribute is know as observation n A set of attributes used to describe a given object is called an attribute/feature vector CPIT440:Data Mining and Warehousing + 6 Attributes Types Attributes Quantitative Qualitative (numeric) (categorical) Interval Nominal Ordinal Binary Ratio Symmetric Asymmetric CPIT440:Data Mining and Warehousing + Categorical Attributes 7 n Nominal: categories, states, names of things n color = {black, blond, brown, grey, red, white} n marital status= {single, married, divorced, widowed) n Binary n Nominal attribute with only 2 states (0 and 1) n Symmetric binary: both outcomes equally important n Gender={male, female} n Asymmetric binary: outcomes not equally important. n medical test={positive, negative} n Ordinal n Values provide enough information to order (rank) objects. n Size = {small, medium, large} n Grades={A+, A, B+, B, C+, C, D, F} CPIT440:Data Mining and Warehousing + 8 Numeric Attributes n Quantity (integer or real-valued) n Interval-scaled n Measured on a scale of equal-sized units n Values have order n temperature, calendar dates, intelligence score, blood pressure n No true zero-point n Cannot say that a value is multiple or ratio of another value n Ratio-scaled n Numeric attribute with an inherent zero-point n We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). n length, counts, monetary quantities, temperature in kelvin n Always from 0 to maximum value CPIT440:Data Mining and Warehousing + 9 Interval-scaled vs Ratio-scaled n The difference between interval and ratio scales comes from their ability to dip below zero. n Interval scales hold no true zero and can represent values below zero. n For example, you can measure temperature below 0 degrees Celsius, such as -10 degrees. n Ratio scaled, on the other hand, never fall below zero. Height and weight measure from 0 and above, but never fall below it. n For example, imagine we have two apples: One has a mass of 100 grams and the other has a mass of 200 grams. Unlike an interval scale, it make perfect sense to say that a 100-gram apple is half the mass of a 200-gram apple. This is because zero grams on this scale represents a natural minimum quantity (i.e. no mass at all). So 200 grams of mass is twice as much mass as 100 grams of mass. CPIT440:Data Mining and Warehousing + Discrete vs. Continuous Attributes 10 n Discrete Attribute n Has only a finite or countable infinite set of values n Age (0 to 110), zip codes, counts, or the set of words in a collection of documents. Customer ID is countably infinite n May or may not be represented as integer variables. Color is discrete but not numeric n Note: Binary attributes are a special case of discrete attributes. n Continuous Attribute n Has real numbers as attribute values. n temperature, height, or weight n Practically, real values can only be measured and represented using a finite number of digits. n Continuous attributes are typically represented as floating-point variables. + 11 CPIT440:Data Mining and Warehousing + 12 Outlines n Data Objects and Attribute Types n Basic Statistical Descriptions of Data n Data Visualization n Measuring Data Similarity and Dissimilarity n Summary CPIT440:Data Mining and Warehousing + The data analysis pipeline n Mining is not the only step in the analysis process Data Result Data Mining Preprocessing Post-processing n Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning is required to make sense of the data n Techniques: Sampling, Dimensionality Reduction, Feature selection. n A dirty work, but it is often the most important step for the analysis. n Post-Processing: Make the data actionable and useful to the user n Statistical analysis of importance n Visualization. n Pre- and Post-processing are often data mining tasks as well + 14 Basic Statistical Descriptions of Data To better understand data 1. Measures of central tendency, variation and spreads. n Which measures the location of the middle or center of a data distribution. n Mean, median, mode and midrange. 2. Measures of data dispersion n How are the data spread out. n Range, quartiles and interquartile range; the five-number summary and boxplots; and the variance and standard deviation of the data. 3. Graphic displays CPIT440:Data Mining and Warehousing + 15 1.1 Central Tendency: Mean nA sample is a representative group drawn from the x population. (sample: , population: ) µ n n Average (arithmetic mean): 1 x = ∑ xi n i=1 n Where n is sample size n Example: {1 , 3, 5, 7, 10} x =(1 + 3+ 5+ 7+ 10) N / 5 = 5.2 ∑x i i=1 n N is population size µ= N CPIT440:Data Mining and Warehousing + 16 1.1 Central Tendency: Mean n åw x i i n Weighted arithmetic mean: x= i =1 n åw i n Example 1: {1, 3, 5, 7, 10} i =1 x =[1*(1/5) + 3*(1/5)+ 5*(1/5)+ 7*(1/5)+ 10*(1/5)] = 5.2 n Example 2: three exams {80, 85, 95} , the last is easier so the weights will be 40%, 40% and 20% x =80*(40/100)+85*(40/100)+95*(20/100) =32+34+19=85 CPIT440:Data Mining and Warehousing + 17 1.1 Central Tendency: Mean n Trimmed mean: n A major problem with the mean is its sensitivity to extreme (e.g., outlier) values. n Sort the values and remove a percentage at the high and low extremes (E.g. 2%). n Avoid trimming too large portion (such as 20%) at both ends, as this can result in the loss of valuable information. CPIT440:Data Mining and Warehousing + 18 1.2 Central Tendency: Median n Median: The median is the middle observation of a set of arranged data (ascending or descending order). n The rank of the median is given by: (n +1) 2 where n is the total number of observations. n The rank is the position of the median in a list of n numbers in order. n When the data is skewed (asymmetric), median is a better measure of the center than mean. n The median generally applies to numeric data CPIT440:Data Mining and Warehousing + 19 1.2 Central Tendency: Median n Example: {13, 13, 13, 13, 14, 14, 16, 18, 21} rank=(9+1)/2=5 median=14 n When n is odd, the median is the middle observation. n When n is even, the median is the average or midpoint of the two middle observations. CPIT440:Data Mining and Warehousing + 20 1.3 Central Tendency: Mode n Mode n Value that occurs most frequently in the data n It is possible for the greatest frequency to correspond to several different values, which results in more than one mode: n One mode: Unimodal n Two modes: Bimodal n Tree mode: Trimodal nA data set with two or more modes is multimodal n If each data value occurs only once, then there is no mode CPIT440:Data Mining and Warehousing + 21 1.3 Central Tendency: Mode n Empirical formula: Mean – mode ≈ 3 x (mean - median) n Example: {13, 13, 13, 13, 14, 14, 16, 18, 21} mode=13 n Can be calculated for quantitative and qualitative data n This implies that the mode for unimodal frequency curves that are moderately skewed can easily be approximated if the mean and median values are known. CPIT440:Data Mining and Warehousing + 22 1.4 Central Tendency: Midrange n The midrange is the average of the largest and smallest values in the set. n Example: {13, 13, 13, 13, 14, 14, 16, 18, 21} midrange=(21+13)/2=17 CPIT440:Data Mining and Warehousing + 23 Symmetric vs. Skewed Data n Median, mean and mode of symmetric, positively and negatively skewed data. Symmetric Negative Positive skewed data skewed data CPIT440:Data Mining and Warehousing + 24 Symmetric vs. Skewed Data n Example CPIT440:Data Mining and Warehousing + 25 Basic Statistical Descriptions of Data To better understand data 1. Measures of central tendency, variation and spreads. 2. Measures of data dispersion 3. Graphic displays CPIT440:Data Mining and Warehousing + 26 2. Measuring the Dispersion of Data nA sample is a representative group drawn from the population (sample: s , population: σ ) n Variance: 1 n 1 n 2 1 n 2 s = 2 å n - 1 i =1 ( xi - x ) = 2 [å xi - (å xi ) ] n - 1 i =1 n i =1 n n 1 1 s = å µ å xi - µ 2 2 2 ( xi - 2 ) = N i =1 N i =1 n Standard deviation: is the square root of the variance. CPIT440:Data Mining and Warehousing + 27 2. Measuring the Dispersion of Data n Range: the difference between the largest and smallest values. n Example: {4, 6, 9, 3, 7} § The lowest value is 3, and the highest is 9. § The range is 9-3=6 CPIT440:Data Mining and Warehousing + 28 2. Measuring the Dispersion of Data n Percentiles: split an ordered set of data into 100 equal- sized consecutive sets. CPIT440:Data Mining and Warehousing + 29 Measuring the Dispersion of Data n Percentile: the value below which a percentage of data falls. CPIT440:Data Mining and Warehousing + 30 Measuring the Dispersion of Data n Nearest rank method to find percentiles n Rank = n * (P/100) nP = (Rank/n) * 100 where n is the number of values, P is the percentile, Rank is the ordinal rank of the intended value in our list CPIT440:Data Mining and Warehousing + 31 Measuring the Dispersion of Data n Example: Find the 50th percentile {45, 50, 61, 75, 90, 95, 99}. n Rank = n * (p/100) 1. n*50/100 = 7*50/100 = 3.5 2. ≈4 3. 75 4. 50% of the students have score below 75 n The 50th percentile is the median. CPIT440:Data Mining and Warehousing + 32 Measuring the Dispersion of Data n Example: assume the following scores {45, 50, 61, 75, 90, 95, 99}. Find the percentile of scores that are bellow 90 nP = (Rank/n) * 100 1. Rank of 90 = 5 , n=7 2. P = (5/7)*100 = 71.4 % 3. 71.4% of the students have score below 90 CPIT440:Data Mining and Warehousing + 33 Measuring the Dispersion of Data n Quantiles: data points that split the data distribution into equal-size consecutive sets. nA quantile is exactly like a percentile but expressed as a decimal instead. The 50th percentile is 0.5 quantile. n The 2-quantiles: median n The 4-quantiles (Quartiles): the three data points that split the data distribution into four equal parts; each part represents one-fourth of the data distribution. CPIT440:Data Mining and Warehousing + 34 How to find quantile? Steps… Example: data {45, 50, 61, 75, 90, 95, 99} n Step 1: Order the data from smallest to largest. à ordered n Step 2: Count how many observations you have in your data set. à 7 n Step 3: Convert any percentage to a decimal for “q”. We are looking for the number where 50 percent of the values fall below it, so convert that to 0.5. n Step 4: Insert your values into the formula: ith observation = q (n + 1) ith observation = 0.5 * (7 + 1) = 4 n Answer: The ith observation is at 4, (round if needed). n The 4th number in the set is 75, which is the number where 50 percent of the values fall below it. CPIT440:Data Mining and Warehousing + 35 Measuring the Dispersion of Data n Quartiles: give an indication of a distribution’s center, spread, and shape. n The first quartile Q1 (25th percentile) n The second quartile Q2 (50th percentile) median n The third quartile Q3 (75th percentile) CPIT440:Data Mining and Warehousing + 36 Measuring the Dispersion of Data n Inter-quartile range: the distance between the first and third quartiles. q IQR = Q3 – Q1 n Example: {45, 50, 61, 75, 90, 95, 99}. 1. Q1= n*25/100 = 7*25/100 = 1.75 ≈ 2 à 50 2. Q3= n*75/100 = 7*75/100 = 5.25 ≈ 5 à 90 3. IQR=90-50=40 CPIT440:Data Mining and Warehousing + 37 Measuring the Dispersion of Data n Five number summary: min, Q1, median, Q3, max n Outlier: usually, a value higher/lower than 1.5 x IQR n higher than Q3+1.5*IQR n lower than Q1-1.5*IQR n Better for skewed distributions CPIT440:Data Mining and Warehousing + 38 Outlier n Use the following dataset to find the outliers: {5, 45, 50, 61, 75, 90, 95, 160} n Rank of Q1 =n * (p/100) = 8*25/100 = 2 Q1 = 45 n Rank of Q3 =n * (p/100) = 8*75/100 = 6 Q3 = 90 n IQR = Q1 – Q3 = 90 – 45 = 45 n Outliers are: n Higher than Q3 by (1.5 * IQR = 67.5)à (Q3+1.5*IQR)à157.5 n Lower than Q1 by (1.5 * IQR = 67.5)à(Q1-1.5*IQR)à22.5 n 160 and 5 are outliers CPIT440:Data Mining and Warehousing + 39 Statistical processing allowed based on attributes Offers: Nominal Ordinal Interval Ratio The sequence of variables is – Yes Yes Yes established Mode Yes Yes Yes Yes Median – Yes Yes Yes Mean – – Yes Yes Difference between variables can be – – Yes Yes evaluated Addition and Subtraction of variables – – Yes Yes Multiplication and Division of – – – Yes variables Absolute zero – – – Yes CPIT440:Data Mining and Warehousing + 40 Discussion n Which measure sensitive to the outlier? What can we use instead? n Can we have no mode? When? n Which measure is better for describing skewed distributions? CPIT440:Data Mining and Warehousing + 41 Discussion n Which measure sensitive to the outlier? What can we use instead? n Mean n Median or trimmed mean n Can we have no mode? When? n Yes, when each data point appears once n Which measure is better for describing skewed distributions? Why? n Five-Number Summary CPIT440:Data Mining and Warehousing + 42 Basic Statistical Descriptions of Data To better understand data 1. Measures of central tendency, variation and spreads. 2. Measures of data dispersion 3. Graphic displays CPIT440:Data Mining and Warehousing + 43 Outlines n Data Objects and Attribute Types n Basic Statistical Descriptions of Data n Data Visualization n Measuring Data Similarity and Dissimilarity n Summary CPIT440:Data Mining and Warehousing + 44 CPIT440:Data Mining and Warehousing + Graphic Displays of Basic Statistical 45 Descriptions n This is truer in DATA n Graphical displays can show information and patterns much more than tabular. n These graphs are helpful for the visual inspection of data, which is useful for data preprocessing. CPIT440:Data Mining and Warehousing + 46 Graphic Displays of Basic Statistical Descriptions 1. Boxplot: graphic display of five-number summary. 2. Histogram: x-axis are values, y-axis are frequencies. 3. Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane. 4. Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are £ xi 5. Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another. CPIT440:Data Mining and Warehousing + 47 1. Boxplot Analysis n A box and whisker plot, also called a boxplot n displays the five-number summary of a set of data. n the minimum, 1st quartile, median, 3rd quartile, and maximum. n Ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually. CPIT440:Data Mining and Warehousing + 48 Boxplot Analysis n Boxplot is good to measure the spread of the data n Data is represented with a box. n The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR. n The median is marked by a line within the box. n Whiskers: two lines outside the box extended to Minimum and Maximum only if these values are less than 1.5 IQR beyond the quartiles. n Otherwise, the whiskers terminate at the most extreme observations occurring within 1.5* IQR of the quartiles. The remaining cases are plotted individually as outliers n The box contain 50% of the data CPIT440:Data Mining and Warehousing + 49 Boxplot Analysis Outliers Largest Q3 Median IQR Q1 Smallest CPIT440:Data Mining and Warehousing + 50 Example: Finding the five-number summary A sample of 101010 boxes of raisins has these weights (in grams): 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Make a boxplot of the data. CPIT440:Data Mining and Warehousing + 51 Example: solution n 1. Prepare numeric data n Step 1: Order the data from smallest to largest. n Our data is already in order. n Step 2: Find the median. n The median is the mean of the middle two numbers: 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 n The median is 32 CPIT440:Data Mining and Warehousing + 52 Example: solution n Step 3: Find the quartiles. The first quartile is the median of the data points to the left of the median. Rank of Q1 =n * (p/100) = 10*(25/100) = 2.5 =3 25, 28, 29, 29, 30 , 34, 35, 35, 37, 38 Q1=29 The third quartile is the median of the data points to the right of the median. Rank of Q3 =n * (p/100) = 10*(75/100) = 7.5 =8 25, 28, 29, 29, 30, 34, 35, 35, 37, 38 Q3=35 CPIT440:Data Mining and Warehousing + 53 Example: solution n Step 4: Complete the five-number summary by finding the min and the max. n The min is the smallest data point, which is 25. n The max is the largest data point, which is 38. n The five-number summary is: min, Q1, median, Q3, max 25, 29, 32, 35, 38. CPIT440:Data Mining and Warehousing + 54 Example: solution n 2. Making a box plot for the same dataset from above. n Step 1: Scale and label an axis that fits the five-number summary. n Step 2: Draw a box from Q1 to Q3, with a vertical line through the median Recall that Q1=29, median = 32, Q3=35 CPIT440:Data Mining and Warehousing + 55 Example: solution n Step 3: Draw a whisker from Q1 to the min and from Q3 to the max. Recall that the min is 25 and the max is 38. CPIT440:Data Mining and Warehousing + 56 2- Histogram Analysis n The basic display for single numerical variable is histogram. n Show the frequency of attribute values n If the attribute X is nominal, a bar is drawn for each known value of X n If X is numeric, the range of values for X is partitioned into disjoint consecutive subranges (known as buckets or bins) n The width of the bins is equal n For each subrange, a bar is drawn with a height that represents the total count of items observed within the subrange CPIT440:Data Mining and Warehousing + 57 Histogram Analysis Frequencies Cases CPIT440:Data Mining and Warehousing + 58 Histogram Analysis n The two histograms shown below may have the same boxplot representation. n The same values for: min, Q1, median, Q3, max n But they have rather different data distributions. CPIT440:Data Mining and Warehousing + 59 59 3- Scatter plot n Uses dots to represent values for two different numeric variables. n The position of each dot on the horizontal and vertical axis indicates values for an individual data point. n Provides a first look at bivariate data to see clusters of points, outliers, relationships between two variables. n Each pair of values is treated as a pair of coordinates and plotted as points in the plane. n Two attributes, X, and Y, are correlated if one attribute implies the other CPIT440:Data Mining and Warehousing + Positively and Negatively Correlated 60 60 Data n Positive correlation n Negative correlation n If the plotted points pattern slopes from n If the pattern of plotted points lower left to upper right this means that slopes from upper left to lower right, the values of X increase as the values of the values of X increase as the Y increase, suggesting a positive values of Y decrease, suggesting a correlation negative correlation CPIT440:Data Mining and Warehousing + 61 Uncorrelated Data CPIT440:Data Mining and Warehousing + Scatter plot 62 n Example: Draw the scatter plot of the values of Systolic Blood Pressure and Body Mass Index (Shown in next slide) SN# X (BMI) Y (Blood Pressure) SN# X (BMI) Y (Blood Pressure) 1 14.5 91.8 13 22.5 116.7 2 15 103.8 14 23.4 104.2 3 15.9 109.2 15 23.9 115.2 4 16.6 103 16 24.6 122.2 5 17 111.3 17 25.3 126.4 6 18.6 102.2 18 26.6 140.9 7 19.4 109.8 19 27 131.6 8 19.7 101.5 20 27.9 140.5 9 20.4 100.7 21 28.5 119.8 10 21 113.3 22 29.7 130.9 11 21.4 129.3 23 32.4 128.5 62 12 21.9 115.2 24 34.6 146.9 + Scatter plot 63 Scatter plot 150 140 130 Y (Blood Pressure) 120 110 100 90 80 10 15 20 25 30 35 40 X (BMI) + 64 4. Quantile Plot n Analyze the data distribution of an attribute. n The word “quantile” comes from the word quantity. In simple terms, a quantile is where a sample is divided into equal-sized, adjacent, subgroups (that’s why it’s sometimes called a “fractile“). n Let xi , for i=1 to N, be the data sorted in increasing order, fi indicates that approximately 100 fi % of the data are below or equal to the value xi n Calculate the sample quantile as: fi = (i - 0.5) / N n Plot the points (fi , xi ) CPIT440:Data Mining and Warehousing + 65 4. Quantile Plot n Example: data {45, 50, 61, 75, 90, 95, 99} i sorted data Quantile fi 1 45 0.071428571 2 50 0.214285714 3 61 0.357142857 4 75 0.5 5 90 0.642857143 6 95 0.785714286 7 99 0.928571429 CPIT440:Data Mining and Warehousing + 66 4. Quantile Plot n Example: data {45, 50, 61, 75, 90, 95, 99} Quantile plot 120 100 80 Scores 60 40 20 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fi Value CPIT440:Data Mining and Warehousing + 67 5. Quantile-Quantile (Q-Q) Plot n Graphs the quantiles of one univariate distribution against the corresponding quantiles of another. n Example: it shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2. CPIT440:Data Mining and Warehousing + Quantile-Quantile (Q-Q) Plot 68 n Example: n Scores of exam 1: {45, 50, 61, 75, 90, 95, 99} (sorted) n Scores of exam 2: {65, 75, 76, 85, 93, 97, 99} (sorted) Q-Q plot 100 90 80 70 60 Q-Q line 50 guid line 40 40 50 60 70 80 90 100 CPIT440:Data Mining and Warehousing + 69 Basic Statistical Descriptions of Data CONCLUSION Basic data descriptions (e.g., measures of central tendency and measures of dispersion) and graphic statistical displays (e.g., quantile plots, histograms, and scatter plots) provide valuable insight into the overall behavior of your data. By helping to identify noise and outliers, they are especially useful for data cleaning. CPIT440:Data Mining and Warehousing + 70 Outlines n Data Objects and Attribute Types n Basic Statistical Descriptions of Data n Data Visualization n Measuring Data Similarity and Dissimilarity n Summary CPIT440:Data Mining and Warehousing + Similarity and Distance n For many different problems we need to quantify how close two objects are. n Examples: n For an item bought by a customer, find other similar items n Group together the customers of site so that similar customers are shown the same ad. n Group together web documents so that you can separate the ones that talk about politics and the ones that talk about sports. n Find all the near-duplicate mirrored web documents. n Find credit card transactions that are very different from previous transactions. n To solve these problems, we need a definition of similarity, or distance. n The definition depends on the type of data that we have + 72 Similarity and Dissimilarity n Similarity n Numerical measure of how alike two data objects are n i and j, will typically return the value 0 if the objects are unalike. The higher the similarity value, the greater the similarity between objects. n Typically, a value of 1 indicates complete similarity, that is, the objects are identical. n Dissimilarity (e.g., distance) n Numerical measure of how different two data objects are n It returns a value of 0 if the objects are the same (and therefore, far from being dissimilar). n The higher the dissimilarity value, the more dissimilar the two objects are. n Proximity refers to a similarity or dissimilarity CPIT440:Data Mining and Warehousing + 73 Jaccard Similarity n The Jaccard similarity (Jaccard coefficient) of two sets S1, S2 is the size of their intersection divided by the size of their union. n JSim (C1, C2) = |C1ÇC2| / |C1ÈC2|. 3 in intersection. 8 in union. Jaccard similarity = 3/8 n Extreme behavior: n Jsim(X,Y) = 1, iff X = Y n Jsim(X,Y) = 0 iff X,Y have not elements in common + 74 Data Matrix and Dissimilarity Matrix n Data matrix (n*p matrix) ! $ # x... x... x 11 1f 1p & n n data points with p dimensions # & #............... & o Objects in rows and attributes # in & x... x... x columns. # i1 if ip & #............... && # # x... x... x " n1 nf np &% n Dissimilarity matrix (n*n matrix) é 0 ù n n data points, only the distance ê d(2,1) ú ê 0 ú n A triangular matrix ê d(3,1) d ( 3,2) 0 ú ê ú ê : : : ú êëd ( n,1) d ( n,2)...... 0úû CPIT440:Data Mining and Warehousing + 75 Dissimilarity Matrix n Nominal attributes n Binary attributes n Numeric attributes n Ordinal attributes n Combination of different types of attributes CPIT440:Data Mining and Warehousing + 76 Proximity Measures for Nominal Attributes CPIT440:Data Mining and Warehousing + 77 Proximity Measure for Nominal Attributes n Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute). Method 1: Simple matching d (i, j) = p - m n n m: # of matches, p: total # of variables p n Method 2: Use a large number of binary attributes n can be encoded using asymmetric binary attributes by creating a new binary attribute for each of the states. n For an object with a given state value, the binary attribute representing that state is set to 1, while the remaining binary attributes are set to 0. CPIT440:Data Mining and Warehousing + 78 Simple matching: Example CPIT440:Data Mining and Warehousing + Example : calculate Proximity Measure 79 for Nominal Attribute distance(object1, Object2) = P – M / P P is total number of attributes, M is total number of matches CPIT440:Data Mining and Warehousing + 80 Proximity Measure for Nominal Attributes – method 2 n Assume that we have three objects with the following attributes: n Color attribute has three states: Red (R), Green (G), Blue (B) n Shape attribute has two states: Rectangle (Rec), Triangle (Tri) Color (R,G,B) Shape obj1 R Tri obj2 G Tri obj3 R Rec CPIT440:Data Mining and Warehousing + 81 Proximity Measure for Nominal Attributes – method 2 (Cont.) Color (R,G,B) Shape obj1 R Tri obj2 G Tri obj3 R Rec Red Green Blue Rectangle Triangle Obj1 1 0 0 0 1 Obj2 0 1 0 0 1 obj3 1 0 0 1 0 After converting to asymmetric binary attributes use the contingency table CPIT440:Data Mining and Warehousing + 82 Proximity Measures for Binary Attributes CPIT440:Data Mining and Warehousing + 83 Proximity Measure for Binary Attributes n A contingency table for binary data: n Distance measure for symmetric binary variables: d(i, j)= q + rr ++ ss + t n Distance measure for asymmetric binary variables: d(i, j)= q +r +r +s s CPIT440:Data Mining and Warehousing + 84 Proximity Measure for Binary Attributes n The similarity measure for symmetric binary variables (Jaccard coefficient): n The similarity measure for asymmetric binary variables (Jaccard coefficient): sim(i, j)= q + qr + s =1− d(i, j) CPIT440:Data Mining and Warehousing + Proximity Measure for Nominal Attributes after 85 converting to asymmetric binary - example n Contingency table: obj1 n d(obj1,obj2) = 2/3 1 0 Sum obj2 1 1 1 1 n Jaccard = 1/3 0 1 2 3 Sum 2 3 Red Green Blue Rectangle Triangle Obj1 1 0 0 0 1 Obj2 0 1 0 0 1 obj3 1 0 0 1 0 CPIT440:Data Mining and Warehousing + 86 Method 2 ”Use a large number of binary attributes”: calculation Steps 1. Convert data objects to asymmetric binary attributes 2. Generate contingency table by counting over objects under examination. 3. Calculate d(i,j) 4. Calculate sim(i,j) CPIT440:Data Mining and Warehousing + 87 Dissimilarity between Binary Variables n Example n Gender is symmetric, while the others are asymmetric binary. n Yes, positive = 1 and No, negative = 0 n Suppose that the distance between the objects depends only on the asymmetric attributes CPIT440:Data Mining and Warehousing + 88 Dissimilarity between Binary Variables n Example Jim 1 0 Mary 1 1 2 0 1 2 CPIT440:Data Mining and Warehousing 89 CPIT440:Data Mining and Warehousing + 90 Dissimilarity of Numeric Data n These measures include the Euclidean, Manhattan, and Minkowski distances. n techniques used to find the distance/dissimilarity among objects. n In some cases, the data are normalized before applying distance calculations. This involves transforming the data to fall within a smaller or common range, such as [−1, 1]. n Z-score normalization: x z= s - µ n μ: mean of the population, σ: standard deviation CPIT440:Data Mining and Warehousing + 91 Dissimilarity of Numeric Data: Euclidean Distance n Euclidean distance is a technique used to find the distance/dissimilarity among objects. n also known “as the crow flies.” n Euclidean distance is the straight line between the starting point and destination. n The Euclidean distance between these two objects can be calculated from the below formula between objects i and j is defined as: CPIT440:Data Mining and Warehousing + 92 Example Ahmad Mohammed n Euclidean distance (Mohammed, Ahmad) = SQRT ( (10 – 6)2 + (90 - 95)2) = 6.40312 n Euclidean distance (Ahmad, Mohammed) = SQRT ( (10 – 6)2 + (90 - 95)2) = 6.40312 Ahmad Mohammed Ahmad Mohammed CPIT440:Data Mining and Warehousing + 93 Dissimilarity of Numeric Data: Manhattan Distance n Another dissimilarity measure, also known as the taxi driver or city block distance n In contrast to the Euclidean distance, the Manhattan distance, we count city blocks that we need to pass in moving from the starting point to the destination n Let i = (xi1, xi2,..., xip) and j = (xj1, xj2,..., xjp) be two objects described by p numeric attributes. The Manhattan distance between objects i and j is defined as: CPIT440:Data Mining and Warehousing + 94 Dissimilarity of Numeric Data: Manhattan Distance n For instance, based on the below map, the taxi driver to reach the destination, first, has to move four blocks to the left and then three blocks toward the north direction. CPIT440:Data Mining and Warehousing + 95 Dissimilarity of Numeric Data: Euclidean and Manhattan n Both the Euclidean and the Manhattan distance satisfy the following mathematical: n Non-negativity: d(i,j)≥0 Distance is a non-negative number n Identity of indiscernibles: d(i,i)=0 The distance of an object to itself is 0 n Symmetry: d(i,j) = d(j,i) Distance is a symmetric function. n Triangle inequality: Based on the Triangle inequality, distance from i to j, can not be greater than when we move from i to j with a detour of k d(i, j) ≤ d(i, k) + d(k, j): Going directly from object i to object j CPIT440:Data Mining and Warehousing + 96 Dissimilarity of Numeric Data: Euclidean and Manhattan CPIT440:Data Mining and Warehousing + 97 Dissimilarity of Numeric Data: Minkowski Distance n Minkowski distance is a generalization of the Euclidean and Manhattan distances. It is defined as: n p (number of attributes) refers to our notation of h n In the Minkowski distance formula, for h=1, the result will be the same as the Manhattan distances, and for the h=2, it will be equal to the Euclidean distance. n Supremum distance is the generalization of the Minkowski distance when the h approaches infinity. Supremum distance can be helpful when we want to calculate the maximum distance between two objects. CPIT440:Data Mining and Warehousing + 98 Example point attribute 1 attribute 2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 Euclidean (L2) Manhattan (L1) L2 x1 x2 x3 x4 L x1 x2 x3 x4 x1 0 x1 0 x2 3.61 0 x2 5 0 x3 2.24 5.1 0 x3 3 6 0 x4 4.24 1 5.39 0 x4 6 1 7 0 Supremum distance D(x1,x2)=√((3-1)2+(5-2)2) D(x1,x2)=|(3-1)+(5-2)| (x1,x2) = 5-2 = 3 =√(4+9) =|2+3| =√(13) =5 Attribute 2 has the =3.61 greatest distance + 99 Example n Euclidean: n Manhattan: n Supremum: CPIT440:Data Mining and Warehousing + 100 Proximity Measures for Ordinal Attributes CPIT440:Data Mining and Warehousing + 101 Ordinal Variables n Order is important, e.g., rank n Can be treated like interval-scaled (numeric attributes) n Replace an ordinal variable value by its rank rif Î{1,..., M f } n Map the range of each variable onto [0.0, 1.0] by replacing i-th object in the f-th variable by rif - 1 zif = M f -1 Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1 n Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3 n Compute the dissimilarity using methods for numeric variables. CPIT440:Data Mining and Warehousing + Example 103 n A) Find the distance between object 2 and 1 (only based on test-1) n States of Test-1={fair, good, excellent } r = {1, 2, 3} Mf = 3 Object ID Test-1 (ordinal) Test-2 (numerical) 1 Excellent (3) 0.55 2 Fair (1) 0 3 Good (2) 1 4 Excellent (3) 0.14 zif = r -1if n Map to [0,1] using M -1 f n Z1= (1-1)/(3-1)=0 , Z2= (2-1)/(3-1)=0.5 , Z3= (3-1)/(3-1)=1 Object ID Test-1 Test-2 (numerical) 1 1 0.55 2 0 0 3 0.5 1 4 CPIT440:Data Mining1and Warehousing 0.14 +Example (Cont.) 104 n A) Find the distance between object 2 and 1 (only based on test-1) n Euclidean distance d(2,1)= (0 − 1)! =1 d(3,1)= (0.5 − 1)! =0.5 n The following is the distance matrix of all the four objects based on test-1 CPIT440:Data Mining and Warehousing +Example (Cont.) 105 n B) Find the distance between object 2 and 1 (based on both test-1 and test-2) n All the two attributes are numerical, so we can use Euclidean distance Object ID Test-1 (numerical) Test-2 (numerical) 1 1 0.55 2 0 0 3 0.5 1 4 1 0.14 n Euclidean distance d(2,1)= (0 − 1)! +(0 − 0.55)! = 1 + 0.303 = 1.14 CPIT440:Data Mining and Warehousing + 106 Cosine Similarity n A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. n Cosine measure: If x and y are two vectors (e.g., term-frequency vectors): n Where ||x|| is the Euclidean norm of vector x CPIT440:Data Mining and Warehousing + 107 Example n Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) n d1 d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 n ||d1||=(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 n ||d2||=(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 n cos(d1, d2 ) = 0.94 CPIT440:Data Mining and Warehousing + 108 Outlines n Data Objects and Attribute Types n Basic Statistical Descriptions of Data n Data Visualization n Measuring Data Similarity and Dissimilarity n Summary CPIT440:Data Mining and Warehousing + 109 Summary n Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled. n Gain insight into the data by: n Basic statistical data description: central tendency, dispersion, graphical displays. n Measure data similarity. n Above steps are the beginning of data preprocessing. n Many methods have been developed but still an active area of research. CPIT440:Data Mining and Warehousing