Week02-2023.docx
Document Details
Uploaded by GenerousChrysoprase
La Trobe University
Full Transcript
Outline Attributes And Objects Types Of Data Data Quality Similarity And Distance Data Preprocessing What is a Dataset? Attributes Collection of data objects and their attributes What is an attribute? an attribute is a property or characteristic of an object Examples: eye color of a person, t...
Outline Attributes And Objects Types Of Data Data Quality Similarity And Distance Data Preprocessing What is a Dataset? Attributes Collection of data objects and their attributes What is an attribute? an attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, dimension, or feature What is an object? A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance 10 Attribute Values Attribute values are numbers or symbols assigned to an attribute for a particular object Distinction between attributes and attribute values Same attribute can be mapped to different attribute values Example: height can be measured in feet or meters Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute can be different than the properties of the values used to represent the attribute Types of Attributes Nominal ID numbers Eye color Zip codes Ordinal Rankings (Taste of potato chips on a scale from 1-10) Grades Height {tall, medium, short} Interval Calendar dates Temperatures in Celsius or Fahrenheit. Ratio Temperature in Kelvin Length, counts Elapsed time (Time to run a race) The type of an attribute depends on which of the following properties/operations it possesses: Distinctness: = ≠ Order: < > Differences are meaningful: + - Ratios are meaningful: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & meaningful differences Ratio attribute: all 4 properties/operations Attribute Type Description Examples Operations Nominal Nominal attribute values only distinguish. (=, ≠) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, χ2 test Ordinal Ordinal attribute values also order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, differences between values are meaningful. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, current geometric mean, harmonic mean, percent variation This categorization of attributes is due to S. S. Stevens Attribute Type Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Interval new_value = a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet. This categorization of attributes is due to S. S. Stevens Discrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. RECORD Data Matrix Document Data Transaction Data GRAPH World Wide Web Molecular Structures ORDERED Spatial Data Temporal Data Sequential Data Genetic Sequence Data Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 A special type of record data Each transaction involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. Can represent transaction data as record data TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Generic Graph Molecule Webpages Benzene Molecule: C6H6 Sequences of Transactions Items/Events An element of the sequence Genomic Sequence Data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG Poor data quality negatively affects many data processing efforts Data mining example: a classification model for detecting people who are loan risks is built using poor data Some credit-worthy candidates are denied loans More loans are given to individuals that default What kinds of data quality problems can occur? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: Noise and outliers Wrong data Fake data Missing values Duplicate data For objects, noise is an extraneous object For attributes, noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen The figures below show two sine waves of the same magnitude and different frequencies, the waves combined, and the two sine waves with random noise The magnitude and shape of the original signal is distorted Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Case 1: outliers are noise that interferes with data analysis Case 2: outliers are the goal of our analysis Credit card fraud Intrusion detection Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate data objects or variables Estimate missing values Example: time series of temperature Example: census results Ignore the missing value during analysis Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeneous sources Examples: Same person with multiple email addresses Deduplication Process of dealing with duplicate data issues Aggregation Sampling Discretization and binarization Attribute transformation Dimensionality reduction Feature subset selection Feature creation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction - reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc. Days aggregated into weeks, months, or years More “stable” data - aggregated data tends to have less variability This example is based on precipitation in Australia from the period 1982 to 1993. The next slide shows A histogram for the standard deviation of average monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in Australia, and A histogram for the standard deviation of the average yearly precipitation for the same locations. The average yearly precipitation has less variability than the average monthly precipitation. All precipitation measurements (and their standard deviations) are in centimeters. Variation of Precipitation in Australia Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Sampling is the main technique employed for data reduction. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming. The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data set, if the sample is representative A sample is representative if it has approximately the same properties (of interest) as the original set of data Types of Sampling Simple random sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions Draw random samples from each partition Curse of Dimensionality When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which are critical for clustering and outlier detection, become less meaningful Dimensionality Reduction PURPOSE Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise TECHNIQUES Principal Components Analysis (PCA) Singular Value Decomposition Others: supervised and non-linear techniques Dimensionality Reduction: PCA Goal is to find a projection that captures the largest amount of variation in data x2 e x1 Feature Subset Selection Another way to reduce dimensionality of data Redundant Features Duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant Features Contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA Many techniques developed, especially for classification Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature extraction Example: extracting edges from images Feature construction Example: dividing mass by volume to get density Mapping data to new space Example: Fourier and wavelet analysis Mapping Data to a New Space Fourier and wavelet transform Two Sine Waves + Noise Frequency Discretization Discretization is the process of converting a continuous attribute into an ordinal attribute A potentially infinite number of values are mapped into a small number of categories Discretization is used in both unsupervised and supervised settings Example: Height (in cm) {“short”, “average”, “tall”} Data consists of four groups of points and two outliers. Equal interval width approach used to obtain 4 values. Equal frequency approach used to obtain 4 values. K-means approach to obtain 4 values. Many classification algorithms work best if both the independent and dependent variables have only a few values We give an illustration of the usefulness of discretization using the following example. Binarization maps a continuous or categorical attribute into one or more binary variables An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: xk, log(x), ex, |x| Normalization Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, range Take out unwanted, common signal, e.g., Seasonality In statistics, standardization refers to subtracting off the means and dividing by the standard deviation SIMILARITY MEASURE Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] DISSIMILARITY MEASURE Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity The following table shows the similarity and dissimilarity between two objects, x and y, with respect to a single, simple attribute. EUCLIDEAN DISTANCE where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes (components) or data objects x and y. Standardization is necessary, if scales differ. Euclidean Distance 3 2 1 0 0 1 2 3 4 5 6 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Distance Matrix Thank You latrobe.edu.au La ľíobe Univeísity CRICOS Píovideí Code Numbeí 00115M ľEQSA PRV12132 - Austíalian Univeísity © Copyíight La ľíobe Univeísity 2023