Data Exploration PDF

What is Data Exploration? ❑ A preliminary exploration of the data to better understand its characteristics. ❑ Key motivations for data exploration include 1. Helping to select the right tool for preprocessing or analysis 2. Making use of humans’ abilities to recognize patterns. People can recognize patterns not captured by data analysis tools ❑ Related to the area of Exploratory Data Analysis (EDA) ❑ Created by statistician John Tukey ❑ A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook https://www.itl.nist.gov/div898/handbook/ Iris Sample Data Set ❑ Many of the exploratory data techniques are illustrated with the Iris Plant data set (150 measurement). ❑ Can be obtained from the UCI Machine Learning Repository https://archive.ics.uci.edu/ ❑ From the statistician Douglas Fisher ❑ Three flower types (classes): 1. Setosa 2. Virginica 3. Versicolour ❑ Four attributes ❑ Sepal width and length Virginica. Robert H. Mohlenbrock. USDA ❑ Petal width and length NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute. Getting to Know Your Data ❑ Data Objects and Attribute Types ❑ Basic Statistical Descriptions of Data ❑ Data Visualization ❑ Measuring Data Similarity and Dissimilarity Getting to Know Your Data ❑ Knowledge about your data is useful for data preprocessing ❑ What are the types of attributes that make up your data? ❑ What kind of values does each attribute have? ❑ Which attributes are discrete or continuous? ❑ What do the data look like? ❑ How are the values distributed? ❑ Are there ways we can visualize the data to get a better sense of it all? ❑ Can we spot any outliers? ❑ Can we measure the similarity of some data objects with respect to others? ❑ Gaining such insight into the data will help with the subsequent analysis. What is a Data set? Attributes/features ❑ Collection of data objects and their attributes Tid Refund Marital Taxable Cheat ❑ An attribute is a property or characteristic of Status Income an object. 1 Yes Single 125K No ❑ Examples: color, temperature, etc. 2 No Married 100K No 3 No Single 70K No ❑ Attribute is also known as a variable, field, 4 Yes Married 120K No characteristic, feature, dimension, predictor, 5 No Divorced 95K Yes regressor, or independent variable. Objects 6 No Married 60K No ❑ A collection of attributes describe an object 7 Yes Divorced 220K No ❑ Object is also known as record, sample, example, 8 No Single 85K Yes entity, instance, tuple, point, data point, vector, 9 No Married 75K No pattern, event, case, or observation. 10 No Single 90K Yes 10 Data warehousing -> dimension Machine learning -> feature Statisticians -> variable Data mining -> attribute Attribute Values ❑ Attribute values are numbers or symbols assigned to an attribute ❑ Distinction between attributes and attribute values 1. Same attribute can be mapped to different attribute values ❑ Example: height can be measured in feet or meters 2. Different attributes can be mapped to the same set of values ❑ Example: Attribute values for ID and age are integers ❑ But properties of attribute values can be different ❑ ID has no limit, but age has a maximum and minimum value ❑ Attribute vector (or feature vector) = a set of attributes used to describe a given object. ❑ The distribution of data involving one attribute is called univariate. A bivariate distribution involves two attributes, and so on. Measurement of Length ❑ The way you measure an attribute is somewhat may not match the attributes properties. 5 A 1 B 7 2 C 8 3 D 10 4 E 15 5 Types of Attributes ❑ The type of an attribute is determined by the set of possible values. ❑ There are different types of attributes: 1. Nominal or categorical (qualitative) ❑ Nominal means “relating to names”. ❑ Nominal values can be represented by symbols or numbers. ❑ Examples: ID numbers, zip codes, eye color, occupation ❑ Mathematical operations on values of nominal attributes are not meaningful (e.g., subtract a customer ID from another). ❑ Binary attribute ❑ A nominal attribute with only two categories or states: 0 or 1. ❑ Binary attributes are referred to as Boolean if the two states correspond to true and false. ❑ Symmetric binary: both outcomes are equally important ❑ (e.g., choose between two equally job offers) ❑ Asymmetric binary: outcomes not equally important. ❑ (e.g., medical test; positive vs. negative) ❑ Convention: assign 1 to the most important outcome (e.g., HIV positive) 2. Ordinal (qualitative) ❑ Values have a meaningful order (ranking) but the magnitude between successive values is not known. ❑ Examples: Rankings (e.g., the taste of potato chips on a scale from 1-10), grades = (A+, A, A-, B+, etc.), size = {small, medium, large}, college students = {freshman, sophomore, junior, senior}. ❑ The central tendency of an ordinal attribute can be represented by its mode and its median. 3. Interval (quantitative) ❑ Measured on a scale of equal-sized units ❑ Values have order ❑ Examples: temperature in Co̊r F,̊ calendar dates ❑ No true zero-point (e.g., no year 0, no temperature). ❑ Allow us to compare the difference between values. 4. Ratio (quantitative) ❑ A value is a multiple (or ratio) of another value. ❑ Example: if a bowl of fruit contains eight oranges and six lemons, then the ratio of oranges to lemons is eight to six (that is, 8:6, which is equivalent to the ratio 4:3) ❑ zero-point ❑ Examples: Temperature in Kelvin, length, time, counts ❑ A weight of 4 grams is twice a weight of 2 grams, because weight is a ratio variable. A temperature of 100 degrees C is not twice as hot as 50 degrees C, because temperature C is not a ratio variable. Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature scale has what is considered a true zero-point (0◦K=−273.15◦C): It is the point at which the particles that comprise matter have zero kinetic energy. Properties of Attribute Values ❑ The type of an attribute depends on which of the following properties it possesses: a) Distinctness: =  b) Order: < > c) Addition: + - d) Multiplication: */ ❑ Nominal attribute: distinctness ❑ Ordinal attribute: distinctness & order ❑ Interval attribute: distinctness, order & addition ❑ Ratio attribute: all 4 properties Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just zip codes, employee ID mode, entropy, different names, i.e., nominal attributes numbers, eye color, contingency provide only enough information to gender correlation, 2 test distinguish one object from another. (=, ) Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles, provide enough information to order {good, better, best}, rank correlation, run objects. () grades, street numbers tests, sign tests Interval For interval attributes, the differences calendar dates, mean, standard between values are meaningful, i.e., a temperature in Celsius or deviation, Pearson's unit of measurement exists. Fahrenheit correlation, t and F (+, - ) tests Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean, ratios are meaningful. (*, /) monetary quantities, harmonic mean, counts, age, mass, percent variation length, electrical current Discrete vs. Continuous Attributes ❑Discrete Attribute ❑ Has only a finite or countably infinite set of values ❑ Examples: zip codes, counts, or the set of words in a collection of documents ❑ Often represented as integer variables. ❑ Binary attributes are a special case of discrete attributes ❑Continuous Attribute ❑ Has real numbers as attribute values ❑ Examples: temperature, height, or weight. ❑ Real values can only be measured and represented using a finite number of digits. ❑ Continuous attributes are typically represented as floating-point variables. Types of Data Sets 1. Record a) Relational records b) Data Matrix c) Document Data d) Transaction Data 2. Graph and network a) Transportation network b) World Wide Web c) Social or information networks d) Molecular Structures 3. Ordered a) Sequential Data b) Sequence Data c) Temporal Data d) Spatial Data e) Video data (1) Record Data ❑ Data that consists of a collection of records, each of which consists of a fixed set of attributes. Usually stored in flat files or in relational databases. Tid Refund Marital Taxable Cheat Status Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 (1.a) Relational records Relational tables, highly structured (1.b) Data Matrix ❑ Data matrix, e.g., numerical matrix, crosstabs ❑ If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute. ❑ Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute. (1.c) Document Data ❑Each document becomes a term vector ❑ Each term is a component (attribute) of the vector. ❑ The value of each component is the number of times the corresponding term occurs in the document. timeout season coach game score team ball lost pla wi n y Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 (1.d) Transaction Data ❑ Each transaction record involves a set of items. ❑ For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Salt, Bread 3 Salt, Coke, Diaper, Milk 4 Salt, Bread, Diaper, Milk 5 Coke, Diaper, Milk (2) Graphs and Networks (a) Transportation network (2) Graphs and Networks (b) World Wide Web (2) Graphs and Networks (c) Molecular Structures (2) Graphs and Networks (d) Social or information networks (3) Ordered Data (a) Sequential data: extension of record data, where each record has a time associated with it. (3) Ordered Data (b) Sequence data: a sequence of individual entities such as words or letters. Similar to sequential data with no time stamps but with positions in ordered sequence. Genomic sequence data (3) Ordered Data (c) Time series data: a special type of sequential data in which each record is a time series, i.e., a series of measurements taken over time. ❑ Temporal autocorrelation: if 2 measurements are close in time, then the values of those measurements are often very close. Monthly airline bookings Important Characteristics to Consider ❑Some important questions to first consider when first looking at a time series are: ❑ Is there a trend, meaning that, on average, the measurements tend to increase (or decrease) over time? ❑ Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows related to calendar time such as seasons, quarters, months, days of the week, and so on? ❑ Are there outliers? In regression, outliers are far away from your line. With time series data, your outliers are far away from your other data. ❑ Is there a long-run cycle or period unrelated to seasonality factors? ❑ Is there constant variance over time, or is the variance non-constant? ❑ Are there any abrupt changes to either the level of the series or the variance? (3) Ordered Data (d) Spatial data ❑ Spatial autocorrelation: objects that are physically close tend to be similar in other ways (e.g., temperature and rainfall). Average monthly temperature of land and ocean – Spatio-temporal data (3) Ordered Data (e) Video data: sequence of images Getting to Know Your Data ❑ Data Objects and Attribute Types ❑ Basic Statistical Descriptions of Data ❑ Data Visualization ❑ Measuring Data Similarity and Dissimilarity Example: CPU and GPU Dataset ❑ Component Type: Whether the component is a CPU or GPU (e.g., "CPU", "GPU"). ❑ Model Name: The specific model name of the component (e.g., "Intel Core i7-11700K", "NVIDIA GeForce RTX 3080"). ❑ Manufacturer: The company that manufactures the component (e.g., "Intel", "AMD", "NVIDIA"). ❑ Base Clock Speed (GHz): The base operating frequency (e.g., 3.6 for CPUs, 1.44 for GPUs). ❑ Number of Cores: Number of physical cores for CPUs or CUDA cores/stream processors for GPUs (e.g., 8 for a CPU, 8704 for a GPU). ❑ Memory Size (GB): Total memory size or VRAM (e.g., 32 GB for system memory in a CPU context, 10 GB for GPU VRAM). ❑ Memory Bandwidth (GB/s): The rate at which data can be transferred to and from memory (e.g., 50 GB/s for CPUs, 760 GB/s for GPUs). ❑ TDP (Thermal Design Power in Watts): Power consumption under maximum load (e.g., 95W for CPUs, 320W for GPUs). ❑ Release Date: The release date of the component (e.g., "Q1 2021" for CPUs, "Q3 2020" for GPUs). ❑ Price (USD): The typical market price for the component (e.g., $399 for a CPU, $699 for a GPU). Memory Component Model Base Clock Number of Memory Release Price Manufacturer Bandwidth TDP (W) Type Name Speed (GHz) Cores Size (GB) Date (USD) (GB/s) Intel Core CPU i9-13900K Intel 3 8 32 50 125 Q1 2023 599 NVIDIA GeForce GPU RTX 4090 NVIDIA 2.2 16384 24 1008 450 Q4 2024 1999 AMD Ryzen CPU 7 7800X AMD 4 8 32 60 105 Q2 2024 499 AMD Radeon RX GPU 7900 XTX AMD 2 6144 20 900 350 Q1 2023 1199 Intel Core CPU i5-13600K Intel 3.8 6 32 50 125 Q3 2023 329 NVIDIA GeForce GPU RTX 4060 NVIDIA 1.8 3072 12 432 200 Q1 2023 499 AMD Ryzen CPU 9 7950X AMD 3.5 16 64 80 170 Q4 2022 899 NVIDIA GeForce GPU RTX 4070 NVIDIA 1.9 7168 16 600 285 Q2 2023 699 Summary Statistics ❑Summary statistics are numbers that summarize properties of the data ❑ Summarized properties include frequency, location and spread ❑ Examples: location - mean spread - standard deviation ❑ Most summary statistics can be calculated in a single pass through the data Basic Statistical Descriptions of Data ❑ Motivation ❑ To better understand the data: central tendency, variation and spread ❑ Data dispersion characteristics ❑ Median, max, min, quantiles, outliers, variance,... ❑ Numerical dimensions correspond to sorted intervals ❑ Data dispersion: ❑ Analyzed with multiple granularities of precision ❑ Boxplot or quantile analysis on sorted intervals ❑ Dispersion analysis on computed measures ❑ Folding measures into numerical dimensions ❑ Boxplot or quantile analysis on the transformed cube Measuring the Central Tendency: (1) Mean ❑Mean (algebraic measure) (sample vs. population): The mean is the most common measure of the location of a set of points. Note: n is sample size and N is population size. 1 n x =  xi =  x n i =1 N The mean is very sensitive to outliers. n ❑ Weighted arithmetic mean: w x i i x= i =1 n w i =1 i Trimmed (Truncated) mean ❑ Chopping extreme values ❑ 5 to 25 percent of the ends are discarded (avoid trimming too large portions) ❑ The scoring method used in many sports that are evaluated by a panel of judges (e.g., Olympics gymnastics) is a truncated mean: discard the lowest and the highest scores; calculate the mean value of the remaining scores Measuring the Central Tendency: (2) Median ❑ Median: Middle value if odd number of values, or average of the middle two values otherwise, in a set of ordered data values. Median: Estimated by interpolation (for grouped data) ❑ The median is expensive to compute when we have a large number of observations. ❑ For numeric attributes, however, we can easily approximate the value. n/2−B Approximate median = L + ( )w F ❑ L = lower class boundary of the group containing the median ❑ n = total number of values ❑ B = cumulative frequency of the groups before the median group ❑ F = frequency of the median interval ❑ w = width of the median interval Example ❑ The median is the middle value, which in our case is the 11th one, which is in the 61 - 65 group. ❑ We call it "61 - 65", but it really includes values from 60.5 up to (but not including) 65.5. ❑ Why? Well, the values are in whole seconds, so a real time of 60.5 is measured as 61. Likewise, 65.4 is measured as 65. Measuring the Central Tendency: (3) Mode ❑ Mode: Value that occurs most frequently in the data (highest peak) ❑ The notions of frequency and mode are typically used with categorical data ❑ Unimodal (one mode) ❑ Empirical formula: mean − mode  3  (mean − median) ❑ Multi-modal Bimodal Trimodal Symmetric vs. Skewed Data symmetric ❑ Median, mean and mode of symmetric, positively and negatively skewed data mean > median > mode positively skewed negatively skewed (right) (left) Long right tail. The long tail is on the positive (right) side of the peak Long left tail Properties of Normal Distribution Curve ← — ————Represent data dispersion, spread — ————→ Represent central tendency Measures Data Distribution: Variance and Standard Deviation ❑ Variance and standard deviation (sample: s, population: σ) ❑ Variance: (algebraic, scalable computation) ❑ Q: Can you compute it incrementally and efficiently? n n n 1 1 1  [ xi − ( xi ) ] 2 s = 2 ( xi − x ) = 2 2 n − 1 i =1 n − 1 i =1 n i =1 n n 1 1  =  ( xi −  ) =  xi −  2 2 2 2 N i =1 N i =1 ❑ Standard deviation s (or σ) is the square root of variance s2 (or σ2) ❑ A low standard deviation means that the data observations tend to be very close to the mean, while a high standard deviation indicates that the data are spread out over a large range of values. Example: Population Variance vs. Sample Variance ❑ The sample variance is divided by 𝑛−1 (instead of 𝑛) to correct for bias in the estimation of the population variance. This is known as Bessel’s correction. ❑ Population: 2, 4, 6, 8, 10 ❑ Population Mean (μ) = (2 + 4 + 6 + 8 + 10) / 5 = 6 ❑ Squared Differences from Mean: ❑ (2 - 6)^2 = 16 ❑ (4 - 6)^2 = 4 ❑ (6 - 6)^2 = 0 ❑ (8 - 6)^2 = 4 ❑ (10 - 6)^2 = 16 ❑ Population Variance (σ²) = (16 + 4 + 0 + 4 + 16) / 5 = 8 Sample Variance (Without Bessel’s Correction) – cont. ❑ Sample: 2, 4, 6 ❑ Sample Mean (x̄):= (2 + 4 + 6) / 3 = 4 ❑ Squared Differences from Mean: ❑ (2 - 4)^2 = 4 ❑ (4 - 4)^2 = 0 ❑ (6 - 4)^2 = 4 ❑ Sample Variance (without correction): = (4 + 0 + 4) / 3 = 2.67 Sample Variance (With Bessel’s Correction) – cont. ❑ Sample: 2, 4, 6 ❑ Sample Mean (x̄): = (2 + 4 + 6) / 3 = 4 ❑ Squared Differences from Mean: ❑ (2 - 4)^2 = 4 ❑ (4 - 4)^2 = 0 ❑ (6 - 4)^2 = 4 ❑ Sample Variance (with Bessel's correction): = (4 + 0 + 4) / 2 = 4 ❑ Without Bessel’s correction: The sample variance is 2.67, which underestimates the population variance. ❑ With Bessel’s correction: The sample variance is 4, a better estimate of the population variance. Alternatives to Standard Deviation for Outliers ❑ Variance and standard deviation are also sensitive to outliers, so other measures are often used. 1. Interquartile Range (IQR) ❑ Common in exploratory data analysis ❑ Focuses on the middle 50% of the data (Q3 - Q1) ❑ Robust to outliers ❑ Simple to interpret ❑ Equation: IQR = Q3 - Q1 2. Median Absolute Deviation (MAD) ❑ Robust alternative to standard deviation ❑ Less sensitive to outliers ❑ Equation: MAD = median(|xi - median(x)|) Example: Median Absolute Deviation (MAD) ❑ Example Data: 3, 7, 8, 5, 122. ❑ Standard deviation = 46.53 ❑ Median of the dataset = 7 ❑ For each data point, calculate the absolute deviations from the median (7): ❑ |3 - 7| = 4 ❑ |5 - 7| = 2 ❑ |7 - 7| = 0 ❑ |8 - 7| = 1 ❑ |12 - 7| = 115 ❑ Absolute deviations = 4, 2, 0, 1, 115 ❑ Arrange the absolute deviations in ascending order: 0, 1, 2, 4, 115 ❑ MAD = 2 3. Winsorizing Alternatives to Standard Deviation for Outliers – cont. ❑ Common in finance and econometrics ❑ Replaces extreme values with percentile values ❑ Keeps all data but limits outliers' influence ❑ Example: Replace values above the 95th percentile and below the 5th percentile with the 95th and 5th percentile values. 4. Trimming ❑ Completely removes extreme values ❑ Reduces skewness but risks data loss ❑ Example: Remove the top and bottom 5% of data. 5. Log Transformation ❑ Common for positively skewed data ❑ Compresses the range of large values ❑ Reduces the impact of outliers ❑ Equation: x' = log(x) Graphic Displays of Basic Statistical Descriptions ❑ Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are  xi ❑ Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another ❑ Boxplot: graphic display of five-number summary ❑ Histogram: x-axis are values, y-axis are frequencies ❑ Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Quantile Plot ❑ Quantiles are points taken at regular intervals of data distribution, dividing it into essentially equal size consecutive sets ❑ Displays all of the data for a given attribute sorted in increasing order which allows the user to assess both the overall behavior and unusual occurrences. ❑ 4-quantiles = quartiles. 100-quantiles = percentiles ❑ Plots quantile information ❑ For a data xi sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi Quantile-Quantile (Q-Q) Plot ❑ Graphs the quantiles of one univariate distribution against the corresponding quantiles of another ❑ View: Is there a shift in going from one distribution to another? ❑ To aid in comparison, the straight line represents the case where, for each given quantile, the unit price at each branch is the same. ❑ Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Percentiles ❑ For continuous data, the notion of a percentile is more useful. ❑ Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile x p is a value of x such that p% of the observed values of x are less than x p. ❑ Example: the 50th percentile is the value x50% such that 50% of all values x50% of x are less than  (median).    The nearest-rank method ❑ The P-th percentile of a list of N ordered values (sorted from least to greatest) is the smallest value in the list such that no more than P percent of the data is strictly less than the value and at least P percent of the data is less than or equal to that value. This is obtained by first calculating the ordinal rank and then taking the value from the ordered list that corresponds to that rank. ❑ Ordinal rank (n) = (P*N)/100 ❑ Example: Consider the ordered list {15, 20, 35, 40, 50}, which contains 5 data values. What are the 5th, 30th, 40th, 50th and 100th percentiles of this list using the nearest-rank method? Percentile Number in Number from the ordered list Percentile Ordinal rank n P list N that has that rank value 5th 5 (5*5)/100 = [0.25] = 1 the first number in the ordered list, which is 15 15 30th 5 (30*5)/100 = [1.5] = 2 the 2nd number in the ordered list, which is 20 20 40th 5 (40*5)/100 = [2.0] = 2 the 2nd number in the ordered list, which is 20 20 50th 5 (50*5)/100 = [2.5] = 3 the 3rd number in the ordered list, which is 35 35 100th 5 (100*5)/100 = = 5 50, which is the last number in the ordered list 50 Measuring the Dispersion of Data: Quartiles & Boxplots ❑ A way of displaying the distribution of data ❑ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ❑ Inter-quartile range: IQR = Q3 – Q1 ❑ Five number summary: min, Q1, median, Q3, max ❑ Boxplot: Data is represented with a box ❑ Q1, Q3, IQR: The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ❑ Median (Q2) is marked by a line within the box ❑ Whiskers: two lines outside the box extended to Min and Max within 1.5 times the IQR ❑ Outliers: points beyond a specified outlier threshold, plotted individually ❑ Outlier: usually, a value higher/lower than 1.5 x IQR Example of Box Plot ❑ The figure shows boxplots for unit price data for items sold at four branches of AllElectronics during a given time period. For branch 1, we see that the median price of items sold is $80, Q1 is $60, and Q3 is $100. Notice that two outlying observations for this branch were plotted individually, as their values of 175 and 202 are more than 1.5 times the IQR here of 40. Example of Box Plot ❑ Box plots can be used to compare attributes outlier 90th percentile 75th percentile 50th percentile 25th percentile 10th percentile Different min and max Visualization of Data Dispersion: 3-D Boxplots Pie Chart ❑ Typically used with categorical attributes ❑ Use relative area of a circle to indicate relative frequency ❑ Common in popular articles but less frequent in technical publications because the size of relative areas can be hard to judge. Histogram Analysis ❑ Histogram: Graph display of tabulated frequencies, shown as bars ❑ Usually shows the distribution of values of a single variable ❑ Divide the values into bins and show a bar plot of the number of objects in each bin. ❑ The height of each bar indicates the number of objects ❑ Shape of histogram depends on the number of bins Bar Chart ❑ A visual representation of data that uses rectangular bars to show the frequency, count, or other measures for different categories. Aspect Histogram Bar Chart Used to show the distribution of a dataset, Used to compare different categories or groups of data by Purpose especially for understanding the shape, representing them with individual bars. spread, and central tendency of data. Plots continuous (quantitative) data Data Type Plots categorical or discrete data. divided into intervals or bins. Bars are ordered sequentially, based on the Bars can be reordered to emphasize certain categories or for Bar Order data intervals, and cannot be reordered. aesthetic purposes. Bar The area of the bar denotes the frequency The height (or length, in the case of horizontal bar charts) of the Representation or density of the data within that interval. bar represents the value or count of the category. Bars are adjacent to each other, with no Bars have spaces between them to emphasize that the data Bar Spacing gaps, to show the continuous nature of the categories are distinct and separate. data. Can indicate skewness in data distribution Not applicable, as bar charts do not convey distribution Skewness (left-skewed, right-skewed, or symmetric). properties. X-axis represents intervals or bins of data; X-axis (or Y-axis in horizontal bar charts) represents different Axis Y-axis shows frequency or density. categories; Y-axis (or X-axis) shows the values. Showing the age distribution of a Example Use Comparing sales of different products, survey responses across population, exam score distribution, or Cases different groups, or counts of different event types. time taken for a process. Histograms Often Tell More than Boxplots ❑ The two histograms shown in the left may have the same boxplot representation ❑ The same values for: min, Q1, median, Q3, max ❑ But they have rather different data distributions Two-Dimensional Histograms ❑ Show the joint distribution of the values of two attributes ❑ Example: petal width and petal length Scatter Plots ❑ A scatter plot is a type of data visualization that displays values for typically two variables for a set of data. The data is represented as a collection of points, where each point represents the values of two variables. ❑ Two-dimensional scatter plots are most common but can have three- dimensional scatter plots. ❑ Often, additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects. ❑ Arrays of scatter plots can compactly summarize the relationships of several pairs of attributes. ❑ Scatter plots are useful for identifying relationships, trends, correlations, and outliers in data sets. They are also valuable for assessing how well two attributes differentiate between distinct classes. Scatter Plots ❑ Provides a first look at bivariate data to see clusters of points, outliers, etc. ❑ Each pair of values is treated as a pair of coordinates and plotted as points in the plane Positively and Negatively Correlated Data ❑ The left half fragment is positively correlated ❑ The right half is negative correlated Uncorrelated Data Getting to Know Your Data ❑ Data Objects and Attribute Types ❑ Basic Statistical Descriptions of Data ❑ Data Visualization ❑ Measuring Data Similarity and Dissimilarity Visualization ❑ Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed. ❑ The goal of visualization is the interpretation of the visualized information by a person and the formation of a mental model of the information. ❑ Visualization of data is one of the most powerful and appealing techniques for data exploration. ❑ Humans have a well-developed ability to analyze large amounts of information that is presented visually ❑ Domain experts can detect general patterns and eliminate uninteresting ones. ❑ Can detect outliers and unusual patterns Example: Sea Surface Temperature ❑The following shows the Sea Surface Temperature (SST) for July 1982 ❑ Around 250,000 numbers are summarized in a single figure Representation ❑ Is the mapping of information to a visual format ❑ Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors. ❑ Objects are often represented as points ❑ Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape ❑ If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived. ❑ In any given set of data, there are many implicit relationships, and hence a key challenge of visualization is to choose a technique that makes the relationships of interest easily observable. Arrangement ❑ Is the placement of visual elements within a display ❑ Can make a large difference in how easy it is to understand the data ❑ Example: No clear relationship After sorting rows and columns based on relationships Selection ❑ The elimination or the de-emphasis of certain objects and attributes ❑ Selection may involve choosing a subset of attributes ❑ Dimensionality reduction is often used to reduce the number of dimensions to two or three ❑ Alternatively, pairs of attributes can be considered ❑ Selection may also involve choosing a subset of objects ❑ A region of the screen can only show so many points ❑ Can sample, but want to preserve points in sparse areas Data Visualization ❑ Why data visualization? ❑ Gain insight into an information space by mapping data onto graphical primitives ❑ Provide qualitative overview of large data sets (e.g., correlation) ❑ Search for patterns, trends, structure, irregularities, relationships among data ❑ Help find interesting regions and suitable parameters for further quantitative analysis ❑ Provide a visual proof of computer representations derived (e.g., debugging) ❑ Categorization of visualization methods and techniques: 1. Pixel-oriented 2. Geometric projection 3. Icon-based 4. Hierarchical 5. Visualizing complex data and relations (1) Pixel-Oriented Visualization Techniques ❑ For a data set of m dimensions, create m windows on the screen, one for each dimension ❑ The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows ❑ The colors of the pixels reflect the corresponding values Example (a) Income (b) Credit Limit (c) transaction volume (d) Age (no trends) Pixel-oriented visualization of four attributes by sorting all customers in income ascending order ❑ Observations: credit limit increases as income increases; customers whose income is in the middle range are more likely to purchase more; there is no clear correlation between income and age. Laying Out Pixels in Circle Segments ❑ To save space and show the connections among multiple dimensions, space filling is often done in a circle segment (a) Representing a data record in circle Representing about 265,000 segmentData Items 50-dimensional (b) Laying out pixels in circle segment with the ‘Circle Segments’ Technique Matrix Plots ❑ A matrix plot is a type of visualization used to display data in a matrix format, where each cell represents the intersection of data from two variables. ❑ This can be useful when objects are sorted according to class. ❑ The attributes are normalized to prevent one attribute from dominating the plot. ❑ Plots of similarity matrices can be useful for visualizing the relationships between objects. ❑ Matrix plots come in several forms: ❑ Heatmap: can show how one variable changes in relation to another, with color intensity representing data values. ❑ Scatterplot Matrix: A grid of scatter plots that compares multiple variables with each other. ❑ Confusion Matrix: Used in machine learning classification, it shows the actual vs predicted classifications. Visualization of the Iris Data Matrix standard deviation Visualization of the Iris Correlation Matrix Heat map ❑ Annual U.S. temperature compared to the 20th-century average for each U.S. Climate Normals period from 1901-1930 (upper left) to 1991-2020 (lower right). https://www.noaa.gov/news/new-us-climate-normals-are-here-what- do-they-tell-us-about-climate-change Scatterplot Matrices ❑ The scatterplot matrix technique is a useful extension to the scatter plot. ❑ The scatterplot matrix becomes less effective as the dimensionality increases. ❑ Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2 –k)/2) unique scatterplots] Scatter Plot Array of Iris Attributes (2) Geometric Projection Visualization Techniques ❑ Visualization of geometric transformations and projections of the data ❑ Methods ❑ Direct visualization ❑ Landscapes ❑ Projection pursuit technique: Help users find meaningful projections of multidimensional data ❑ Prosection views ❑ Hyperslice ❑ Parallel coordinates Direct Data Visualization Ribbons with Twists Based on Vorticity Data Mining: Concepts and Techniques Visualization of a 3-D data set using a scatter plot Landscapes ❑ Visualization of the data as perspective landscape ❑ The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data News articles visualized as a landscape Contour Plots ❑ Useful when a continuous attribute is measured on a spatial grid ❑ They partition the plane into regions of similar values ❑ The contour lines that form the boundaries of these regions connect points with equal values ❑ Can display elevation, temperature, rainfall, air pressure, etc. Contour Plot Example: SST Dec, 1998 Celsius Parallel Coordinates ❑ Used to plot the attribute values of data ❑ Instead of using perpendicular axes, use a set of parallel axes ❑ The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line ❑ Thus, each object is represented as a line ❑ Often, the lines representing a distinct class of objects group together, at least for some attributes ❑ Ordering of attributes is important in seeing such groupings ❑ A major limitation of the parallel coordinates technique is that it cannot effectively show a data set of many records. Even for a data set of several thousand records, visual clutter and overlap often reduce the readability of the visualization and make the patterns hard to find. Parallel Coordinates ❑ n equidistant axes which are parallel to one of the screen axes and correspond to the attributes ❑ The axes are scaled to the [minimum, maximum]: range of the corresponding attribute ❑ Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute (3) Icon-Based Visualization Techniques ❑ Visualization of the data values as features of icons ❑ Typical visualization methods a) Star Plots b) Chernoff Faces c) Stick Figures ❑ General techniques ❑ Shape coding: Use shape to represent certain information encoding ❑ Color icons: Use color icons to encode more information ❑ Tile bars: Use small icons to represent the relevant feature vectors in document retrieval Star Plots ❑ Similar approach to parallel coordinates, but axes radiate from a central point ❑ The line connecting the values of an object is a polygon Setosa Versicolour Virginica Chernoff Faces ❑ Approach created by Herman Chernoff ❑ This approach associates each attribute with a characteristic of a face ❑ A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc. ❑ The values of each attribute determine the appearance of the corresponding facial characteristic ❑ Each object becomes a separate face Setosa ❑ Relies on human’s ability to distinguish faces ❑ Doesn’t scale well Versicolour Virginica Chernoff Faces ❑ The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, (S. Dickson) Stick Figure ❑ A 5-piece stick figure (1 body and 4 limbs w. different angle/length) ❑ A census data where age and income are mapped to the display axes, and the remaining dimensions (gender, education, and so on) are mapped to stick figures. (4) Hierarchical Visualization Techniques ❑ The visualization techniques discussed so far focus on visualizing multiple dimensions simultaneously. However, for a large data set of high dimensionality, it would be difficult to visualize all dimensions at the same time. ❑ Hierarchical visualization techniques partition all dimensions into subsets (i.e., subspaces). The subspaces are visualized in a hierarchical manner. ❑ Methods a) Dimensional Stacking b) Worlds-within-Worlds c) Tree-Map d) Cone Trees e) InfoCube Dimensional Stacking ❑ Partitioning of the n-dimensional attribute space in 2-D subspaces, which are ‘stacked’ into each other ❑ Partitioning of the attribute value ranges into classes. The important attributes should be used on the outer levels. ❑ Adequate for data with ordinal attributes of low cardinality ❑ But, difficult to display more than 9 dimensions ❑ Important to map dimensions appropriately Dimensional Stacking Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes Worlds-within-Worlds ❑ Assign the function and two most important parameters to innermost world ❑ Fix all other parameters at constant values - draw other (1 or 2 or 3 dimensional worlds choosing these as the axes) ❑ Software that uses this paradigm ❑ N–vision: Dynamic interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer) ❑ Auto Visual: Static interaction by means of queries Tree-Map ❑ Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values ❑ The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) Schneiderman@UMD: Tree-Map to support Treemap of votes by county, state and locally predominant large data sets of a million items recipient in the US Presidential Elections of 2012 InfoCube ❑ A 3-D visualization technique where hierarchical information is displayed as nested semi-transparent cubes ❑ The outermost cubes correspond to the top level data, while the subnodes or the lower level data are represented as smaller cubes inside the outermost cubes, etc. Three-D Cone Trees ❑ 3D cone tree visualization technique works well for up to a thousand nodes or so ❑ First build a 2D circle tree that arranges its nodes in concentric circles centered on the root node ❑ Cannot avoid overlaps when projected to 2D ❑ Graph from Nadeau Software Consulting website: Visualize a social network data set that models the way an infection spreads from one person to the next Visualizing Complex Data and Relations: Tag Cloud ❑Tag cloud: Visualizing user-generated tags ❑ The importance of tag is represented by font size/color ❑ Popularly used to visualize word/phrase distributions KDD 2013 Research Paper Title Tag Cloud Newsmap: Google News Stories in 2005 Visualizing Complex Data and Relations: Social Networks ❑ Visualizing non-numerical data: social and information networks organizing information networks A typical network structure A social network Visualizing COVID-19 More visualizations https://blog.mapbox.com/notable-maps-visualizing-covid-19-and-surrounding-impacts-951724cc4bd8 Getting to Know Your Data ❑ Data Objects and Attribute Types ❑ Basic Statistical Descriptions of Data ❑ Data Visualization ❑ Measuring Data Similarity and Dissimilarity Similarity, Dissimilarity, and Proximity ❑ Similarity measure or similarity function ❑ A real-valued function that quantifies the similarity between two objects ❑ Measure how two data objects are alike: The higher value, the more alike ❑ Often falls in the range [0,1]: 0: no similarity; 1: completely similar ❑ Dissimilarity (or distance) measure ❑ Numerical measure of how different two data objects are ❑ In some sense, the inverse of similarity: The lower, the more alike ❑ Minimum dissimilarity is often 0 (i.e., completely similar) ❑ Upper limit varies. Range [0, 1] or [0, ∞) , depending on the definition ❑ Proximity refers to either similarity or dissimilarity Similarity, Dissimilarity, and Proximity Allow us to: 1. Find clusters of similar customers (e.g., similar area of residence, age). 2. Detect outliers 3. Perform nearest-neighbor classification (e.g., a patient) ❑ Correlation and Euclidean distance are useful for dense data such as time series or two-dimensional points. ❑ Jaccard and cosine similarity measures are useful for sparse data such as documents. Data Matrix ❑ We looked at ways of studying the central tendency, dispersion, and spread of observed values for some attribute X. Our objects there were one-dimensional, that is, described by a single attribute. Now we talk about objects described by multiple attributes. Therefore, we need a change in notation. ❑ Data matrix  x11 x12... x1l  ❑ Object-by-attribute structure (two-mode matrix)    x21 x22... x2l  D= ❑ A data matrix of n data points with L dimensions     ❑ The objects may be tuples in a relational database, and  xn1 xn 2... xnl  are also referred to as data samples or feature vectors. Dissimilarity (distance) matrix ❑ Object-by-object structure (one-mode matrix).  0    ❑ n data points (objects), but registers only the distance d(i, j).  d (2,1) 0    ❑ d(i, j) is the measured dissimilarity or “difference” between    d ( n,1) d ( n, 2)... 0  objects i and j. ❑ Usually symmetric, thus a triangular matrix ❑ Distance functions are different for real, boolean, categorical, and vector variables ❑ Weights can be associated with different variables ❑ Many clustering and nearest-neighbor algorithms operate on a dissimilarity matrix. ❑ Data in the form of a data matrix can be transformed into a dissimilarity matrix. ❑ Measures of similarity can often be expressed as a function of measures of dissimilarity. For example, for nominal data, sim(i, j) = 1 – d(i, j). Example: Data Matrix and Dissimilarity Matrix Data Matrix point attribute1 attribute2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 Dissimilarity Matrix (by Euclidean Distance) x1 x2 x3 x4 x1 0 x2 3.61 0 x3 2.24 5.1 0 x4 4.24 1 5.39 0 Distance on Numeric Data: Minkowski Distance ❑ Minkowski Distance is a generalization of Euclidean Distance d (i, j ) = p | xi1 − x j1 | p + | xi 2 − x j 2 | p + + | xil − x jl | p where i = (xi1, xi2, …, xil) and j = (xj1, xj2, …, xjl) are two L dimensional data objects, and p is the order (the distance so defined is also called L-p norm) ❑ The data can be normalized before applying distance calculations. Common Properties of a Distance ❑ Distances have some well-known properties. ❑ d(i, i) = 0. The distance of an object to itself is 0. ❑ d(i, j) > 0 if i ≠ j (Positivity; distance is a non-negative number). ❑ d(i, j) = d(j, i) (Symmetry) ❑ d(i, j)  d(i, k) + d(k, j) (Triangle Inequality) Common Properties of a Similarity ❑ Similarities also have some well-known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between data objects, p and q. Special Cases of Minkowski Distance ❑p = 1: (L1 norm) Manhattan (or taxicab, city block) distance ❑ E.g., the Hamming distance: the number of bits that are different between two binary vectors d (i, j ) =| xi1 − x j1 | + | xi 2 − x j 2 | + + | xil − x jl | ❑ p = 2: (L2 norm) Euclidean distance d (i, j ) = | xi1 − x j1 |2 + | xi 2 − x j 2 |2 + + | xil − x jl |2 ❑p → : (Lmax norm, L norm) “supremum” distance ❑ The maximum difference between any component (attribute) of the vectors l d (i, j ) = lim | xi1 − x j1 | + | xi 2 − x j 2 | + p p p + | xil − x jl | = max | xif − xif | p p → f =1 Example: Minkowski Distance at Special Cases point attribute 1 attribute 2 Manhattan (L1) x1 1 2 L x1 x2 x3 x4 x2 3 5 x1 0 x3 2 0 x2 5 0 x4 4 5 x3 3 6 0 x4 6 1 7 0 Euclidean (L2) L2 x1 x2 x3 x4 x1 0 x2 3.61 0 x3 2.24 5.1 0 x4 4.24 1 5.39 0 Supremum (L) L x1 x2 x3 x4 x1 0 x2 3 0 x3 2 5 0 x4 3 1 5 0 Mahalanobis Distance A measure of the distance between a point P and a distribution D It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D and grows as P moves away from the mean along each principal component axis.  is the covariance matrix of the input data X mahalanobis ( p, q ) = ( p − q )  −1 ( p − q )T 1 n  j ,k =  n − 1 i =1 ( X ij − X j )( X ik − X k ) For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. Proximity Measure for Binary Attributes ❑ A contingency table for binary data ❑ q is the number of attributes that equal 1 for both objects i and j ❑ r is the number of attributes that equal 1 for object i but equal 0 for object j ❑ s is the number of attributes that equal 0 for object i but equal 1 for object j ❑ t is the number of attributes that equal 0 for both objects i and j ❑ The total number of attributes is p, where p = q + r + s + t. Proximity Measure for Binary Attributes ❑ A contingency table for binary data ❑ Distance measure for symmetric binary variables: ❑ Distance measure for asymmetric binary variables: ❑ Jaccard coefficient (similarity measure for asymmetric binary variables): ❑ Note: Jaccard coefficient is the same as “coherence” (a concept discussed in Pattern Discovery) Example: Dissimilarity between Asymmetric Binary Variables Which two patients have similar problems? Mary Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N 1 0 ∑row Mary F Y N P N P N 1 2 0 2 Jack Jim M Y P N N N N 0 1 3 4 ❑ Gender is a symmetric attribute ∑col 3 3 6 ❑ The remaining attributes are asymmetric binary Jim ❑ Let the values Y (yes) and P (positive) be 1, and the value N (no or 1 0 ∑row negative) be 0 1 1 1 2 Jack ❑ Distance: 0 1 3 4 0 +1 ∑col 2 4 6 d ( Jack , Mary ) = = 0.33 2 + 0 +1 Mary 1+1 d ( Jack , Jim) = = 0.67 1 0 ∑row 1+1+1 1+ 2 Jim 1 1 1 2 d ( Jim, Mary ) = = 0.75 1+1+ 2 0 2 2 4 ❑ Jack and Mary have similar problems since their distance is smaller ∑col 3 3 6 Similarity Between Binary Vectors ❑ Common situation is that objects, p and q, have only binary attributes ❑ Compute similarities using the following quantities M00 = the number of attributes where p = 0 and q = 0 M01 = the number of attributes where p = 0 and q = 1 M10 = the number of attributes where p = 1 and q = 0 M11 = the number of attributes where p = 1 and q = 1 ❑ Simple Matching Coefficient (SMC) SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) ❑ Jaccard Coefficients Jaccard = number of 11 matches / number of not-both-zero attribute values = (M11) / (M01 + M10 + M11) Example: SMC vs. Jaccard p= 1000000000 q= 0000001001 M00 = 7 M01 = 2 M10 = 1 M11 = 0 SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 Jaccard = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0 Proximity Measure for Categorical Attributes ❑ Method 1: Simple matching ❑ p: total # of variables and m: # of matches d (i, j) = p − p m ❑ Method 2: Use a large number of binary attributes ❑ Creating a new binary attribute for each of the M nominal states ❑ For example, to encode the nominal attribute color, a binary attribute can be created for each color. For an object having the color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0. ❑ Similarity can be computed as sim(i, j) =1− d (i, j) = m p Example: Dissimilarity between nominal attributes ❑ Given the table below, compute the dissimilarity matrix. ❑ Since we have one nominal attribute => p =1 ❑ d(i,j) evaluates to 0 if objects i and j match and 1 if the objects differ. Proximity Measures for Ordinal Attributes ❑ Can be treated like interval-scaled 1. Replace an ordinal variable value by its rank: rif  {1,..., M f } 2. Map the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable by rif − 1 zif = M f −1 3. Compute the dissimilarity using methods for interval-scaled variables ❑ Example: {freshman, sophomore, junior, senior} ❑ freshman =1; sophomore= 2, junior = 3; senior =4 ❑ Compute z => freshman: 0; sophomore: 1/3; junior: 2/3; senior 1 ❑ Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3 Example: Dissimilarity between ordinal attributes ❑ There are three states for test-2: fair, good, and excellent => Mf = 3 ❑ Step 1: replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3, respectively. ❑ Step 2: normalize the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. ❑ Step 3: we can use the Euclidean distance, which results in the following dissimilarity matrix ❑ Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4. Dissimilarity for Attributes of Mixed Types ❑ A dataset may contain all attribute types (e.g., nominal, binary, numeric). ❑ One approach is to group each type of attribute together, performing separate data mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive compatible results. However, in real applications, it is unlikely that a separate analysis per attribute type will generate compatible results. ❑ A more preferable approach is to process all attribute types together, performing a single analysis. One such technique combines the different attributes into a single dissimilarity matrix, bringing all of the meaningful attributes onto a common scale of the interval [0.0, 1.0]. 𝑖𝑓 𝑒𝑖𝑡ℎ𝑒𝑟 (1) 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑣𝑎𝑙𝑢𝑒 (𝑡ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑓 𝑓𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖 𝑜𝑟 𝑗) 𝑜𝑟 (2) 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒 𝑓 𝑖𝑠 𝑎𝑠𝑦𝑚𝑚𝑒𝑡𝑟𝑖𝑐 𝑏𝑖𝑛𝑎𝑟𝑦 and both 𝑓 0 objects have a value of 0. ❑ Indicator variable 𝑤𝑖𝑗 = ቄ 1 otherwise ❑ Suppose that the data set contains p attributes of mixed type. The dissimilarity d(i, j) between objects i and j is defined as p  ij dij w (f) (f) f =1 d (i, j ) = p ……equation 1  ij w (f) f =1 ❑ If f is numeric: Use the normalized distance where h runs over all nonmissing objects for attribute f. ❑ If f is binary or nominal: dij(f) = 0 if xif = xjf; or dij(f) = 1 otherwise ❑ If f is ordinal rif − 1 ❑ Compute ranks zif (where zif = M − 1 ) f ❑ Treat zif as numeric (interval-scaled) ❑ These steps are identical to what we have already seen for each of the individual attribute types. The only difference is for numeric attributes, where we normalize so that the values map to the interval [0.0, 1.0]. Example: Dissimilarity for attributes of mixed type ❑ Compute the dissimilarity matrix considering all attributes. ❑ The procedures we followed for test-1 (nominal) and test-2 (ordinal) are the same as outlined earlier for processing attributes of mixed types. Therefore, we can use the dissimilarity matrices obtained for test-1 and test-2. ❑ We need to compute the dissimilarity matrix for the third attribute test-3 (numeric). That is, we must 3 compute 𝑤𝑖𝑗. ❑ Let maxhxh = 64 and minhxh = 22. We normalize the data and get the following dissimilarity matrix for test-3 Cont. ❑ We can now use the dissimilarity matrices for the three attributes in our computation of equation 1. test-1 𝑓 ❑ 𝑤𝑖𝑗 = 1 for each of the three attributes. 1 1 +1 0.5 +1(.45) ❑ For example, d 3,1 = = 0.65 3 test-2 ❑ The resulting dissimilarity matrix obtained for the data described by the three attributes of mixed types is shown in test 4. ❑ From the previous table, we can intuitively guess that objects 1 test-3 and 4 are the most similar, based on their values for test-1 and test-2. This is confirmed by the dissimilarity matrix, where d(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates that objects 1 and 2 are the least similar. test- 4 Cosine Similarity ❑ A document can be represented by a bag of terms or a long vector, with each attribute recording the frequency of a particular term (such as word, keyword, or phrase) in the document ❑ Term-frequency vectors are typically very long and sparse (i.e., they have many 0 values). ❑ The traditional distance measures we have studied do not work well for such sparse numeric data. ❑ For example, two term-frequency vectors may have many 0 values in common, meaning that the corresponding documents do not share many words, but this does not make them similar. ❑ We need a measure for numeric data that ignores zero-matches. ❑ Cosine similarity of two documents will range from 0 to 1 ❑ Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc. Cosine Similarity ❑ If A and B are two document vectors (e.g., term-frequency vectors), then where indicates vector dot product and || A || is the length of vector A. ❑ A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors. ❑ Note that because the cosine similarity measure does not obey all of the properties of defining metric measures (e.g., does not have the triangle inequality property), it is referred to as a nonmetric measure. Example 1 : Calculating Cosine Similarity ❑Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) ❑ First, calculate vector dot product d1 d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25 ❑ Then, calculate ||d1|| and ||d2|| || d1 ||= 5  5 + 0  0 + 3  3 + 0  0 + 2  2 + 0  0 + 0  0 + 2  2 + 0  0 + 0  0 = 6.481 || d 2 ||= 3  3 + 0  0 + 2  2 + 0  0 + 11 + 11 + 0  0 + 1 1 + 0  0 + 1 1 = 4.12 ❑ Calculate cosine similarity: cos(d1, d2 ) = 25/ (6.481 X 4.12) = 0.94 ❑ Based on the cosine similarity measure, these documents are quite similar. Example 2 : Calculating Cosine Similarity d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1 d2= 3×1 + 2×0 + 0×0 + 5×0 + 0×0 + 0×0 + 0×0 + 2×1 + 0×0 + 0×2 = 5 ||d1|| = (3×3+2×2+0×0+5×5+0×0+0×0+0×0+2×2+0×0+0×0)0.5 = (42) 0.5 = 6.48 ||d2|| = (1×1+0×0+0×0+0×0+0×0+0×0+0×0+1×1+0×0+2×2) 0.5 = (6) 0.5 = 2.45 cos( d1, d2 ) = 0.315 KL Divergence: Comparing Two Probability Distributions ❑ The Kullback-Leibler (KL) divergence: Measure the difference between two probability distributions over the same variable x ❑ From information theory, closely related to relative entropy, information divergence, and information for discrimination ❑ DKL(p(x) || q(x)): divergence of q(x) from p(x), measuring the information lost when q(x) is used to approximate p(x) Discrete form Continuous form

Data Exploration PDF

Document Details

Tags

Related

Summary

Full Transcript