Data Mining Lecture 2: Getting to Know Your Data PDF

DATA MINING Lecture 2: Getting to Know Your Data Dr. Doaa Elzanfaly 1 1 Lecture Outline ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Dat...

DATA MINING Lecture 2: Getting to Know Your Data Dr. Doaa Elzanfaly 1 1 Lecture Outline ◼ Data Objects and Attribute Types ◼ Basic Statistical Descriptions of Data ◼ Measuring Data Similarity and Dissimilarity ◼ Summary 2 2 2 Data Objects ◼ Datasets are made up of data objects. ◼ A data object represents an entity. ◼ Examples: ◼ sales database: customers, store items, sales ◼ medical database: patients, treatments ◼ university database: students, professors, courses ◼ Also called samples , examples, instances, data points, objects, tuples. ◼ Data objects are described by attributes. ◼ Database rows -> data objects; columns ->attributes. 3 3 3 Attributes ◼ Attribute (or dimensions, features, variables) a data field, representing a characteristic or feature of a data object. ◼ E.g., customer _ID, name, address ◼ Types: Qualitative Quantitative ▪ Nominal (categorical) ▪ Numeric: ▪ Ordinal Interval-scaled ▪ Binary Ratio-scaled Symmetric Asymmetric 4 4 4 Attribute Types ◼ Nominal: categories, states, or “names of things” ◼ Hair_color = {black, blond, brown, grey, red, white} ◼ marital status, occupation, ID numbers, zip codes ◼ Binary ◼ Nominal attribute with only 2 states (0 and 1) ◼ Symmetric binary: both outcomes equally important ◼ e.g., gender ◼ Asymmetric binary: outcomes not equally important. ◼ e.g., medical test (positive vs. negative) ◼ Convention: assign 1 to most important outcome (e.g., HIV positive) ◼ Ordinal ◼ Values have a meaningful order (ranking) but magnitude between successive values is not known. ◼ Size = {small, medium, large}, grades, army rankings 5 5 5 Numeric Attribute Types ◼ Quantity (integer or real-valued) ◼ Interval – Scaled ◼ Measured on a scale of equal-sized units ◼ Values have order ◼ E.g., temperature in C˚or F˚, calendar dates ◼ No true zero-point ◼ Ratio - Scaled ◼ Inherent zero-point ◼ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ◼ e.g., length, counts, monetary quantities 6 6 6 Discrete vs. Continuous Attributes ◼ Discrete Attribute ◼ Has only a finite or countably infinite set of values ◼ E.g., zip codes, profession, or the set of words in a collection of documents ◼ Sometimes, represented as integer variables ◼ Note: Binary attributes are a special case of discrete attributes ◼ Continuous Attribute ◼ Has real numbers as attribute values ◼ E.g., temperature, height, or weight ◼ Practically, real values can only be measured and represented using a finite number of digits ◼ Continuous attributes are typically represented as floating-point variables 7 7 7 Basic Statistical Descriptions of Data ◼ Basic statistical descriptions are used to identify properties of the data and highlight which data values should be treated as noise or outliers. ◼ Central tendency : measures the location of the middle or centre of a data distribution. ◼ Mean, Median, Mode, and Midrange. ◼ Dispersion of the data: how are the data spread out? ◼ range, variance, quartiles, and interquartile range. ◼ Outliers. ◼ five-number summary and boxplots; and the variance and standard deviation 8 8 8 Measuring the Central Tendency 1 n ◼ Mean (algebraic measure)/average : x =  xi Note: n is sample size and N is population size. n i =1 n ◼ Weighted arithmetic mean w x i i x= i =1 ◼ Trimmed mean: chopping extreme values n w i ◼ Median: i =1 ◼ Middle value if odd number of values, or average of W=1 the middle two values otherwise ◼ Estimated by interpolation (for grouped data): n / 2 − ( freq)l median = L1 + ( ) width ◼ Mode freqmedian ◼ Value that occurs most frequently in the data ◼ Unimodal, bimodal, trimodal, or multimodal. ◼ Empirical formula: mean − mode = 3  (mean − median) 9 9 9 Symmetric vs. Skewed Data ◼ Median, Mean and Mode of symmetric symmetric, positively and negatively skewed data positively skewed negatively skewed 10 10 10 Measuring the Dispersion of Data ◼ Quartiles, outliers and boxplots ◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ◼ Inter-quartile range: IQR = Q3 – Q1 ◼ Five number summary: min, Q1, median, Q3, max ◼ Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually ◼ Outlier: usually, a value higher/lower than 1.5 x IQR ◼ Variance and standard deviation (sample: s, population: σ) ◼ Variance: (algebraic, scalable computation) 1 n 1 n 2 1 n 1 n 1 n s2 =  n − 1 i =1 ( xi − x ) 2 = [ xi − ( xi ) 2 ] n − 1 i =1 n i =1  = 2 N  i =1 ( xi −  ) = 2 N x i =1 i 2 − 2 ◼ Standard deviation s (or σ) is the square root of variance s2 (or σ2) 11 11 11 Boxplot Analysis ◼ Five-number summary of a distribution ◼ Minimum, Q1, Median, Q3, Maximum ◼ Boxplot ◼ Data is represented with a box ◼ The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ◼ The median is marked by a line within the box ◼ Whiskers: two lines outside the box extended to Minimum and Maximum ◼ Outliers: points beyond a specified outlier threshold, plotted individually 12 12 12 Graphic Displays of Basic Statistical Descriptions ◼ Boxplot: graphic display of five-number summary ◼ Histogram: x-axis are values, y-axis repres. frequencies ◼ Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are  xi ◼ Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another ◼ Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 13 13 13 Similarity and Dissimilarity Measures of Proximity ◼ Similarity ◼ Numerical measure of how alike two data objects are. ◼ Value is higher when objects are more alike. ◼ Often falls in the range [0,1]. ◼ Dissimilarity (e.g., distance) ◼ Numerical measure of how different two data objects are. ◼ Lower when objects are more alike. ◼ Minimum dissimilarity is often 0 (objects are the same). ◼ Upper limit varies. Sim(i, j) = 1 –D (i, j) ◼ Two data structures are commonly used: ◼ Data Matrix. ◼ Dissimilarity Matrix. 14 14 14 Data Matrix and Dissimilarity Matrix ◼ Data matrix  x11... x1f... x1p  ◼ n data points with p   ...............  dimensions x... xif... xip   i1  ◼ Used to store the data objects ...............  ◼ Two modes x... xnf... xnp   n1  ◼ Dissimilarity matrix ◼ n data points, but registers  0   d(2,1)  only the distance  0  ◼ A triangular matrix  d(3,1) d ( 3,2) 0    ◼ Used to store dissimilarity  : : :  d ( n,1) d ( n,2)...... 0 values for pairs of objects ◼ Single mode 15 15 15 Proximity Measure for Nominal Attributes ◼ Nominal attribute can take 2 or more states ◼ Red, yellow, blue, green (generalization of a binary attribute) ◼ Method 1: Simple matching ◼ M is the # of matches, p is the total # of variables d (i, j) = p − p m ◼ Method 2: Use a large number of binary attributes ◼ Creating a new binary attribute for each of the M nominal states 16 16 16 Dissimilarity between nominal attributes - Example Using Method 1: d (i, j) = p − p m As it is only one attribute, P =1. Using the dissimilarity matrix, d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ. All objects are dissimilar except objects 1 and 4 17 17 17 Proximity Measure for Binary Attributes ◼ A contingency table for binary data ▪ q is the number of attributes that equal 1 for both objects i and j, ▪ r is the number of attributes that equal 1 for object i but equal 0 for object j, ▪ s is the number of attributes that equal 0 for object i but equal 1 for object j, ▪ t is the number of attributes that equal 0 for both objects i and j. ▪ The total number of attributes is p, where p = q + r + s + t. 18 18 18 Proximity Measure for Binary Attributes ◼ Symmetric Binary Dissimilarity: ◼ Distance measure for symmetric binary variables. ◼ Asymmetric Binary Dissimilarity: ◼ Distance measure for asymmetric binary variables Negative matches (t) are not considered 19 19 19 Proximity Measure for Binary Attributes ◼ Jaccard coefficient (similarity measure for asymmetric binary variables): ◼ Note: Jaccard coefficient is the same as “coherence”: 20 20 20 Dissimilarity between Binary Variables ◼ Example Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N ◼ Gender is a symmetric attribute ◼ The remaining attributes are asymmetric binary ◼ Let the values Y and P be 1, and the value N 0 21 21 21 Dissimilarity between Binary Variables 0+1 d ( jack , mary ) = = 0.33 2+ 0+1 1+1 d ( jack , jim ) = = 0.67 1+1+1 1+ 2 d ( jim , mary ) = = 0.75 1+1+ 2 Jack and Mary are the most likely to have a similar disease 22 22 22 Dissimilarity of Numeric Data ◼ Generally, a numeric attribute expressed in smaller units will lead to a larger range for that attribute, and thus tend to give such attributes greater effect or “weight.” ◼ In some cases, the data are normalized before applying distance calculations. ◼ This involves transforming the data to fall within a smaller or common range, such as [-1,1] or [0.0, 1.0]. ◼ Normalizing the data attempts to give all attributes an equal weight. 23 23 23 Dissimilarity of Numeric Data: Euclidean distance ◼ The most popular distance measure, i.e., straight line or “as the crow flies” ◼ Euclidean Distance represents the shortest distance between two points. ◼ Let i = (xi1, xi2, …, xip) and j = (xj1, xj2, … , xjp) be two objects described by p numeric attributes. The Euclidean distance between objects i and j is defined as: Pythagorean Theory 24 24 24 Dissimilarity of Numeric Data: Euclidean distance Data Matrix point attribute1 attribute2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 Dissimilarity Matrix (with Euclidean Distance) 25 25 25 Dissimilarity of Numeric Data: Manhattan distance ◼ Another well-known measure is the Manhattan (or city block) distance, It is defined as 26 26 26 Distance on Numeric Data: Minkowski Distance ◼ A generalization of the Euclidean and Manhattan distances and is defined as: ◼ i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, ◼ and h is the order (the distance is also called L-p norm) ◼ It represents the Manhattan distance when h = 1 (i.e., L1 norm) and Euclidean distance when h = 2 27 27 27 Distance on Numeric Data: Minkowski Distance https://vitalflux.com/different-types-of-distance-measures-in-machine-learning/ 28 28 28 Special Cases of Minkowski Distance ◼ h = 1: (L1 norm) Manhattan (city block, L1 norm) distance ◼ E.g., the Hamming distance: the number of bits that are different between two binary vectors d (i, j) =| x − x | + | x − x | +...+ | x − x | i1 j1 i2 j 2 ip jp ◼ h = 2: (L2 norm) Euclidean distance d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 ) i1 j1 i2 j 2 ip jp ◼ h → . “supremum” (Lmax norm, L norm) distance. ◼ This is the maximum difference between any component (attribute) of the vectors 29 29 29 Example: Minkowski Distance point attribute 1 attribute 2 Manhattan (L1) x1 1 2 L x1 x2 x3 x4 x2 3 5 x1 0 x3 2 0 x2 5 0 x4 4 5 x3 3 6 0 x4 6 1 7 0 Euclidean (L2) L2 x1 x2 x3 x4 x1 0 x2 3.61 0 x3 2.24 5.1 0 x4 4.24 1 5.39 0 Supremum L x1 x2 x3 x4 x1 0 x2 3 0 x3 2 5 0 x4 3 1 5 0 Dissimilarity Matrices 30 30 30 Dissimilarity Measures of Numeric Data ◼ Distance is meant to be a metric, for similarity (or dissimilarity), if and only if it satisfies the following four conditions: 1. Non-negativity: d(p, q) ≥ 0, for any two distinct observations p and q. ◼ Distance is a non-negative number. 2. Symmetry: d(p, q) = d(q, p) for all p and q. ◼ Distance is a symmetric function 3. Triangle Inequality: d(p, q) ≤ d(p, r) + d(r, q) for all p, q, r. ◼ Going directly from object i to object j in space is no more than making a detour over any other object k 4. Identity of indiscernible: d(p, q) = 0 only if p = q. ◼ The distance of an object to itself is 0 31 31 31 Distance on Ordinal Variables ◼ An ordinal variable can be discrete or continuous ◼ Order is important, e.g., rank ◼ Can be treated like interval-scaled ◼ Replace x by their rank rif {1,...,M f } if ◼ Map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by ◼ Compute the dissimilarity using methods for interval- scaled variables 32 32 32 Distance on Ordinal Variables Step 1 Step 2 Step 3 replace each value by its Normalizes the ranking Euclidean distance rank 33 33 33 Distance on Attributes of Mixed Type ◼ A dataset may contain all attribute types ◼ Nominal, symmetric binary, asymmetric binary, numeric, ordinal ◼ One may use a weighted formula to combine their effects  pf = 1 ij( f ) dij( f ) d (i, j) =  pf = 1 ij( f ) 34 34 34 Distance on Attributes of Mixed Type  pf = 1 ij( f ) dij( f ) d (i, j) =  pf = 1 ij( f ) 35 35 35 Distance on Attributes of Mixed Type Test 1 Test 2 Test 3 36 36 36 Distance on Attributes of Mixed Type ◼ The resulting dissimilarity matrix obtained for the data described by the three attributes of mixed types is: ◼ Objects 1 and 4 are the most similar. WHY? Which objects are the least similar? 37 37 37 Cosine Similarity ◼ A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. ◼ Each document is an object represented by what is called a term- frequency vector ◼ Term-frequency vectors are typically very long and sparse. ◼ The traditional distance measures that we have studied in this chapter do not work well for such sparse numeric data. 38 38 38 Cosine Similarity ◼ Cosine similarity is a measure of similarity that can be used to compare documents - nonmetric measure ◼ Let x and y be two vectors for comparison. Using the cosine measure as a similarity function, we have: ◼ where ||x|| is the Euclidean norm of vector x = (x1, x2, … , xp), defined as: (Conceptually, it is the length of the vector) ◼ Similarly, ||y|| is the Euclidean norm of vector y. ◼ The measure computes the cosine of the angle between vectors x and y. ◼ A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. ◼ The closer the cosine value to 1, the smaller the angle and the greater the match between vectors. 39 39 39 Example: Cosine Similarity ◼ Ex: Find the similarity between documents x and y. x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) y = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) x y = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 cos(d1, d2 ) = 0.94 ◼ Using the cosine similarity measure to compare the two documents, they would be considered quite similar. 40 40 40 References ◼ W. Cleveland, Visualizing Data, Hobart Press, 1993 ◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003 ◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 ◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. ◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997 ◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics, 8(1), 2002 ◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 ◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999 ◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001 ◼ C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2009 41 41 41

Data Mining Lecture 2: Getting to Know Your Data PDF

Document Details

Tags

Related

Summary

Full Transcript