Chap2-Data_Final Exam syllabus PDF

Summary

This document is a set of lecture notes on the topic of data preprocessing. It deals with various data types and sets, including numerical matrix, crosstabs, sequential data. The attributes, types, and methods of handling are discussed. The different concepts concerning the cleaning and handling of noise, missing data, and redundancy presented. It also presents various normalization and discretization techniques.

Full Transcript

Chapter 2. Data, Measurements, and Data Preprocessing ❑ Data Types ❑ Statics of Data ❑ Similarity and Distance Measures ❑ Data Quality, Data Cleaning and Data Integration ❑ Data Transformation ❑ Dimensionality Reduction ❑ Summary 1 ...

Chapter 2. Data, Measurements, and Data Preprocessing ❑ Data Types ❑ Statics of Data ❑ Similarity and Distance Measures ❑ Data Quality, Data Cleaning and Data Integration ❑ Data Transformation ❑ Dimensionality Reduction ❑ Summary 1 Types of Data Sets: (1) Record Data ❑ Relational records ❑ Relational tables, highly structured ❑ Data matrix, e.g., numerical matrix, crosstabs ❑ Transaction data timeout season coach game score team ball lost pla wi n y TID Items 1 Bread, Coke, Milk 2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2 3 Beer, Coke, Diaper, Milk Document 2 0 7 0 2 1 0 0 3 0 0 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0 ❑ Document data: Term-frequency vector (matrix) of text documents 2 Types of Data Sets: (2) Graphs and Networks ❑ Transportation network ❑ World Wide Web ❑ Molecular Structures ❑ Social or information networks 3 Types of Data Sets: (3) Ordered Data ❑ Video data: sequence of imagesframes ❑ Temporal data: time-series ❑ Sequential Data: transaction sequences pattern ❑ Genetic sequence data 4 Types of Data Sets: (4) Spatial, image and multimedia Data ideas ❑ Spatial data: maps ❑ Image data: ❑ Video data: 5 Data Objects ❑ Data sets are made up of data objects ❑ A data object represents an entity mm ❑ Examples: ❑ sales database: customers, store items, sales ❑ medical database: patients, treatments ❑ university database: students, professors, courses ❑ Also called samples , examples, instances, data points, objects, tuples ❑ Data objects are described by attributes ❑ my → attributes Database rows → data objects; columns 6 Attributes ❑ Attribute (or dimensions, features, variables) ❑ A data field, representing a characteristic or feature of a data object. ❑ E.g., customer _ID, name, address ❑ Types: ❑ 1 Nominal (e.g., red, blue)Exmajors 2 Binary (e.g., {true, false}) ❑ ❑ Ordinal (e.g., {freshman, sophomore, junior, senior}) ❑ Numeric: quantitative ❑ Interval-scaled: 100○C is interval scales doesn'thavetruezero mm ❑ Ratio-scaled: 100 K is ratio scaled since it is twice as high as 50 ○K Hastrueang ○ 5 Discrete vs. Continuous Attributes ❑ 7 Attribute Types ❑ Nominal: categories, states, or “names of things” ❑ Hair_color = {auburn, black, blond, brown, grey, red, white} ❑ marital status, occupation, ID numbers, zip codes ❑ Binary ❑ Nominal attribute with only 2 states (0 and 1) ❑ Symmetric binary: both outcomes equally important ❑ e.g., gender ❑ Asymmetric binary: outcomes not equally important. ❑ e.g., medical test (positive vs. negative) ❑ Convention: assign 1 to most important outcome (e.g., HIV positive) ❑ Ordinal ❑ Values have a meaningful order (ranking) but magnitude between successive values is not known ❑ Size = {small, medium, large}, grades, army rankings 8 Numeric Attribute Types ❑ Quantity (integer or real-valued) ❑ Interval ❑ Measured on a scale of equal-sized units ❑ Values have order ❑ E.g., temperature in C˚or F˚, calendar dates ❑ No true zero-point ❑ Ratio ❑ Inherent zero-point ❑ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ❑ e.g., temperature in Kelvin, length, counts, monetary quantities 9 Discrete vs. Continuous Attributes ❑ Discrete Attribute ❑ Has only a finite or countably infinite set of values ❑ E.g., zip codes, profession, or the set of words in a collection of documents ❑ Sometimes, represented as integer variables ❑ Note: Binary attributes are a special case of discrete attributes ❑ Continuous Attribute ❑ Has real numbers as attribute values ❑ E.g., temperature, height, or weight ❑ Practically, real values can only be measured and represented using a finite number of digits ❑ Continuous attributes are typically represented as floating-point variables 10 Statics of Data ❑ Measuring the Central Tendency ❑ Measuring the Dispersion of Data ❑ Covariance and Correlation Analysis ❑ Graphic Displays of Basic Statics of Data 11 Measuring the Central Tendency: (1) Mean ❑ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. 1 n x =  xi =  x Same n i =1 N Papulation n ❑ Weighted arithmetic mean: w x i i x= i =1 n bytheweight multiply w i =1 i outof removevalues ❑ Trimmed range theoutliers mean: ❑ Chopping extreme values (e.g., Olympics gymnastics score computation) 12 Measuring the Central Tendency: (2) Median ❑ Median: ❑ Middle value if odd number of values, or average of the middle two values otherwise ❑ Estimated by interpolation (for grouped data): ftp.me tet 1ooxun 31g 1597 650 isthelowerlimit 950 the h 2450 950 1 Sum before the median interval Approximate size sample n / 2 − ( freq )l median Interval width (L2 – L1) 1135hr 7 L1 + ( median = 1597 ) width freqmedian 1500 13 Low interval limit mm Measuring the Central Tendency: (3) Mode ❑ Mode: Value that occurs most frequently in the data ❑Unimodal ❑ Empirical formula: mean − mode = 3  (mean − median) ❑Multi-modal ❑ Bimodal ❑ Trimodal 14 Symmetric vs. Skewed Data symmetric ❑ Median, mean and mode of symmetric, positively and negatively skewed data nightfoot Olethoot positively skewed negatively skewed Mode Median mean Mode Mediansmean 15 Properties of Normal Distribution Curve ← — ————Represent data dispersion, spread — ————→ Represent central tendency 16 Measures Data Distribution: Variance and Standard Deviation ❑Variance and standard deviation (sample: s, population: σ) ❑ Variance: (algebraic, scalable computation) ❑ Q: Can you compute it incrementally and efficiently? n n n 1 SaMMariance 1 1    2 s2 = ( xi − x ) 2 = [ xi − ( xi ] ) 2 standard deviation n − 1 i =1 n − 1 i =1 n i =1 Note: The subtle difference of motion n n 1 1 formulae for sample vs. population  =  ( xi −  ) =  i 2 2 2 x − 2 n : the size of the sample N : the size of the population N i =1 N i =1 ❑ Standard deviation s (or σ) is the square root of variance s2 (or σ2) 17 Correlation between Two Numerical Variables ❑ Correlation between two variables X1 and X2 is the standard covariance, obtained by normalizing the covariance with the standard deviation of each variable  12  12 12 = =  1 2  12 2 2 ❑ Sample correlation for two attributes X1 and X2: 𝜎ො12 σ𝑛𝑖=1 𝑥𝑖1 − 𝜇ො1 𝑥𝑖2 − 𝜇ො2 𝜌ො12 = = 𝜎ො1 𝜎ො2 2 σ𝑛𝑖=1 𝑥𝑖1 − 𝜇ො1 2 σ𝑛 𝑖=1 𝑥𝑖2 − 𝜇ො2 where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 , σ1 and σ2 are the respective standard deviation of X1 and X2 ❑ If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s) ❑ The higher, the stronger correlation ❑ If ρ12 = 0: independent (under the same assumption as discussed in co-variance) ❑ If ρ12 < 0: negatively correlated 18 Visualizing Changes of Correlation Coefficient ❑ Correlation coefficient value range: [–1, 1] ❑ A set of scatter plots shows sets of points and their correlation coefficients changing from –1 to 1 19 Graphic Displays of Basic Statistical Descriptions maxand min ❑ Boxplot: graphic display of five-number summary q and q displays more than box plot information ❑ Histogram: x-axis are values, y-axis repres. frequencies showsthedistribution showsall the value x is paired with f indicating that approximately 100 f % plot:with values relationship ❑ Quantile each i i i of data are  xi alwayscompares 2things ❑ Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another correlation ❑ Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 20 Measuring the Dispersion of Data: Quartiles & Boxplots ❑ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ❑ Inter-quartile range: IQR = Q3 – Q1 ❑ Five number summary: min, Q1, median, Q3, max ❑ Boxplot: Data is represented with a box ❑ Q1, Q3, IQR: The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ❑ Median (Q2) is marked by a line within the box ❑ Whiskers: two lines outside the box extended to Minimum and Maximum ❑ Outliers: points beyond a specified outlier threshold, plotted individually ❑ Q3Q Outlier: usually, a value higher/lower than 1.5 x IQR 21 Positively and Negatively Correlated Data ❑ The left half fragment is positively correlated ❑ The right half is negative correlated 22 Uncorrelated Data 23 Ordinal Variables ❑ An ordinal variable can be discrete or continuous ❑ Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior) ❑ Can be treated like interval-scaled ❑ Replace an ordinal variable value by its rank: rif  {1,..., M f } ❑ Map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by rif − 1 zif = M f −1 ❑ Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1 ❑ Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3 ❑ Compute the dissimilarity using methods for interval-scaled variables 24 Cosine Similarity of Two Vectors ❑ A document can be represented by a bag of terms or a long vector, with each attribute recording the frequency of a particular term (such as word, keyword, or phrase) in the document ❑ Other vector objects: Gene features in micro-arrays ❑ Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc. ❑ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then d1 d 2 cos (d1 , d 2 ) = || d1 ||  || d 2 || where indicates vector dot product, ||d||: the length of vector d 25 Example: Calculating Cosine Similarity ❑ Calculating Cosine Similarity: d d cos (d1 , d 2 ) = 1 2 || d1 ||  || d 2 || where indicates vector dot product, ||d||: the length of vector d ❑ Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) ❑ First, calculate vector dot product d1 d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25 ❑ Then, calculate ||d1|| and ||d2|| || d1 ||= 5  5 + 0  0 + 3  3 + 0  0 + 2  2 + 0  0 + 0  0 + 2  2 + 0  0 + 0  0 = 6.481 || d 2 ||= 3  3 + 0  0 + 2  2 + 0  0 + 1 1 + 11 + 0  0 + 1 1 + 0  0 + 1 1 = 4.12 ❑ Calculate cosine similarity: cos(d1, d2 ) = 25/ (6.481 X 4.12) = 0.94 26 Capturing Hidden Semantics in Similarity Measures ❑ The above similarity measures cannot capture hidden semantics ❑ Which pairs are more similar: Geometry, algebra, music, politics? ❑ The same bags of words may express rather different meanings ❑ “The cat bites a mouse” vs. “The mouse bites a cat” ❑ This is beyond what a vector space model can handle ❑ Moreover, objects can be composed of rather complex structures and connections (e.g., graphs and networks) ❑ New similarity measures needed to handle complex semantics ❑ Ex. Distributive representation and representation learning 27 Data Quality, Data Cleaning and Data Integration ❑ Data Quality Measures ❑ Data Cleaning ❑ Data Integration 28 What is Data Preprocessing? — Major Tasks ❑ Data cleaning ❑ Handle missing data, smooth noisy data, identify or remove outliers, and resolve inconsistencies ❑ Data integration ❑ Integration of multiple databases, data cubes, or files ❑ Data reduction ❑ Dimensionality reduction ❑ Numerosity reduction ❑ Data compression ❑ Data transformation and data discretization ❑ Normalization ❑ Concept hierarchy generation 29 Why Preprocess the Data? — Data Quality Issues ❑ Measures for data quality: A multidimensional view AUtBnI ❑ Accuracy: correct or wrong, accurate or not ❑ Completeness: not recorded, unavailable, … ❑ Consistency: some modified but some not, dangling, … ❑ Timeliness: timely update? ❑ Believability: how trustable the data are correct? ❑ Interpretability: how easily the data can be understood? 30 Data Cleaning ❑ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, and transmission error ❑ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ❑ e.g., Occupation = “ ” (missing data) ❑ Noisy: containing noise, errors, or outliers ❑ e.g., Salary = “−10” (an error) ❑ Inconsistent: containing discrepancies in codes or names, e.g., ❑ Age = “42”, Birthday = “03/07/2010” ❑ Was rating “1, 2, 3”, now rating “A, B, C” ❑ discrepancy between duplicate records ❑ Intentional (e.g., disguised missing data) ❑ Jan. 1 as everyone’s birthday? 31 Incomplete (Missing) Data ❑ Data is not always available ❑ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ❑ Missing data may be due to ❑ Equipment malfunction Technicalissues ❑ Inconsistent with other recorded data and thus deleted ❑ Data were not entered due to misunderstanding ❑ Certain data may not be considered important at the time of entry ❑ Did not register history or changes of the data ❑ Missing data may need to be inferred 32 How to Handle Missing Data? ❑ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ❑ Fill in the missing value manually: tedious + infeasible? ❑ Fill in it automatically with ❑ a global constant : e.g., “unknown”, a new class?! ❑ the attribute mean ❑ the attribute mean for all samples belonging to the same class: smarter ❑ the most probable value: inference-based such as Bayesian formula or decision tree 33 Noisy Data ❑ Noise: random error or variance in a measured variable ❑ Incorrect attribute values may be due to ❑ Faulty data collection instruments ❑ Data entry problems ❑ Data transmission problems ❑ Technology limitation ❑ Inconsistency in naming convention ❑ Other data problems ❑ Duplicate records ❑ Incomplete data ❑ Inconsistent data 34 How to Handle Noisy Data? ❑ F Binning reducecardinality ❑ First sort data and partition into (equal-frequency) bins ❑ Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. noise ❑ Regression ❑ Smooth by fitting the data into regression functions iii ❑ Clustering ketaldata ❑ Detect and remove outliers groupthenremoveoutliers ❑ Semi-supervised: Combined computer and human inspection ❑ Detect suspicious values and check by human (e.g., deal with possible outliers) 35 Data Cleaning as a Process ❑ Data discrepancy detection ❑ Use metadata (e.g., domain, range, dependency, distribution) ❑ Check field overloading ❑ Check uniqueness rule, consecutive rule and null rule ❑ Use commercial tools ❑ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections ❑ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) ❑ Data migration and integration ❑ Data migration tools: allow transformations to be specified ❑ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface ❑ Integration of the two processes ❑ Iterative and interactive (e.g., Potter’s Wheels) 36 Data Integration ❑ Data integration ❑ Combining data from multiple sources into a coherent store ❑ Why data integration? ❑ Help reduce/avoid noise ❑ Get a more complete picture ❑ Improve mining speed and quality ❑ Schema integration: ❑ e.g., A.cust-id  B.cust-# ❑ Integrate metadata from different sources ❑ Entity identification: ❑ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton 37 Handling Noise in Data Integration ❑Detecting data value conflicts ❑ For the same real world entity, attribute values from different sources are different ❑ Possible reasons: no reason, different representations, different scales, e.g., metric vs. British units ❑ Resolving conflict information ❑ Take the mean/median/mode/max/min ❑ Take the most recent ❑ Truth finding: consider the source quality ❑ Data cleaning + data integration 38 Handling Redundancy in Data Integration ❑ Redundant data occur often when integration of multiple databases ❑ Object identification: The same attribute or object may have different names in different databases ❑ Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue Thebirthdategives ❑ What’s the problem? youtheage ❑ 𝑌 = 2𝑋 → 𝑌 = 𝑋1 + 𝑋2 𝑌 = 3𝑋1 − 𝑋2 𝑌 = −1291𝑋1 + 1293𝑋2 ❑ Redundant attributes may be detected by correlation analysis and covariance analysis 39 Data Transformation ❑ Normalization ❑ Discretization ❑ Data Compression ❑ Sampling 40 Data Transformation ❑ A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values ❑ Methods ❑ Smoothing: Remove noise from data ❑ Attribute/feature construction ❑ New attributes constructed from the given ones ❑ Aggregation: Summarization, data cube construction ❑ Normalization: Scaled to fall within a smaller, specified range ❑ min-max normalization ❑ z-score normalization ❑ normalization by decimal scaling ❑ Discretization: Concept hierarchy climbing 41 Normalization ❑ Min-max normalization: to [new_minA, new_maxA] v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA ❑ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0] 73,600 − 12,000 ❑ Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716 ❑ Z-score normalization (μ: mean, σ: standard deviation): v − A Z-score: The distance between the raw score and the v' =  A population mean in the unit of the standard deviation 73,600 − 54,000 ❑ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225 16,000 ❑ Normalization by decimal scaling v Where j is the smallest integer such that Max(|ν’|) < 1 v'= 10 j 42 Simple Discretization: Binning ❑ Equal-width (distance) partitioning ❑ Divides the range into N intervals of equal size: uniform grid ❑ if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. ❑ The most straightforward, but outliers may dominate presentation ❑ Skewed data is not handled well ❑ Equal-depth (frequency) partitioning ❑ Divides the range into N intervals, each containing approximately same number of samples ❑ Good data scaling ❑ Managing categorical attributes can be tricky 43 Example: Binning Methods for Data Smoothing ❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equal-depth) bins: - Bin 1: 4, 8, 9, 15 Allhave 123 4 candwidth - Bin 2: 21, 21, 24, 25 I3 26,36320,2029,34 m Famiipg Etm maxBin3 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 4 15meano 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 Closesttothe - Bin 2: 21, 21, 25, 25 number - Bin 3: 26, 26, 26, 34 44 Data Compression ❑ String compression ❑ There are extensive theories and well-tuned algorithms Original Data Compressed ❑ Typically lossless, but only limited manipulation Data is possible without expansion lossless ❑ Audio/video compression ❑ Typically lossy compression, with progressive Original Data refinement Approximated ❑ Sometimes small fragments of signal can be reconstructed without reconstructing the whole Lossy vs. lossless compression ❑ Time sequence is not audio ❑ Typically short and vary slowly with time ❑ Data reduction and dimensionality reduction may 45 also be considered as forms of data compression Automatic Concept Hierarchy Generation ❑Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set ❑ The attribute with the most distinct values is placed at the lowest level of the hierarchy ❑ Exceptions, e.g., weekday, month, quarter, year country 15 distinct values province_or_ state 365 distinct values city 3567 distinct values street 674,339 distinct values 46 Sampling ❑ Sampling: obtaining a small sample s to represent the whole data set N ❑ Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data ❑ Key principle: Choose a representative subset of the data ❑ Simple random sampling may have very poor performance in the presence of skew ❑ Develop adaptive sampling methods, e.g., stratified sampling: ❑ Note: Sampling may not reduce database I/Os (page at a time) 47 Types of Sampling ❑ Simple random sampling: equal probability of selecting any particular item Raw Data ❑ Sampling without replacement ❑ Once an object is selected, it is removed from the population ❑ Sampling with replacement ❑ A selected object is not removed from the population ❑ Stratified sampling Stratified sampling ❑ Partition (or cluster) the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) 48 What Is Dimensionality Reduction? ❑ Curse of dimensionality Afriustdata ❑ When dimensionality increases, data becomes increasingly sparse ❑ Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful ❑ The possible combinations of subspaces will grow exponentially ❑ Dimensionality reduction ❑ Reducing the number of random variables under consideration, via obtaining a set of principal variables keepimportant stuff ❑ Advantages of dimensionality reduction ❑ Avoid the curse of dimensionality ❑ Help eliminate irrelevant features and reduce noise ❑ Reduce time and space required in data mining ❑ Allow easier visualization 49

Use Quizgecko on...
Browser
Browser