Chap2-Data_Final Exam syllabus PDF

Chapter 2. Data, Measurements, and Data Preprocessing ❑ Data Types ❑ Statics of Data ❑ Similarity and Distance Measures ❑ Data Quality, Data Cleaning and Data Integration ❑ Data Transformation ❑ Dimensionality Reduction ❑ Summary 1 Types of Data Sets: (1) Record Data ❑ Relational records ❑ Relational tables, highly structured ❑ Data matrix, e.g., numerical matrix, crosstabs ❑ Transaction data timeout season coach game score team ball lost pla wi n y TID Items 1 Bread, Coke, Milk 2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2 3 Beer, Coke, Diaper, Milk Document 2 0 7 0 2 1 0 0 3 0 0 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0 ❑ Document data: Term-frequency vector (matrix) of text documents 2 Types of Data Sets: (2) Graphs and Networks ❑ Transportation network ❑ World Wide Web ❑ Molecular Structures ❑ Social or information networks 3 Types of Data Sets: (3) Ordered Data ❑ Video data: sequence of imagesframes ❑ Temporal data: time-series ❑ Sequential Data: transaction sequences pattern ❑ Genetic sequence data 4 Types of Data Sets: (4) Spatial, image and multimedia Data ideas ❑ Spatial data: maps ❑ Image data: ❑ Video data: 5 Data Objects ❑ Data sets are made up of data objects ❑ A data object represents an entity mm ❑ Examples: ❑ sales database: customers, store items, sales ❑ medical database: patients, treatments ❑ university database: students, professors, courses ❑ Also called samples , examples, instances, data points, objects, tuples ❑ Data objects are described by attributes ❑ my → attributes Database rows → data objects; columns 6 Attributes ❑ Attribute (or dimensions, features, variables) ❑ A data field, representing a characteristic or feature of a data object. ❑ E.g., customer _ID, name, address ❑ Types: ❑ 1 Nominal (e.g., red, blue)Exmajors 2 Binary (e.g., {true, false}) ❑ ❑ Ordinal (e.g., {freshman, sophomore, junior, senior}) ❑ Numeric: quantitative ❑ Interval-scaled: 100○C is interval scales doesn'thavetruezero mm ❑ Ratio-scaled: 100 K is ratio scaled since it is twice as high as 50 ○K Hastrueang ○ 5 Discrete vs. Continuous Attributes ❑ 7 Attribute Types ❑ Nominal: categories, states, or “names of things” ❑ Hair_color = {auburn, black, blond, brown, grey, red, white} ❑ marital status, occupation, ID numbers, zip codes ❑ Binary ❑ Nominal attribute with only 2 states (0 and 1) ❑ Symmetric binary: both outcomes equally important ❑ e.g., gender ❑ Asymmetric binary: outcomes not equally important. ❑ e.g., medical test (positive vs. negative) ❑ Convention: assign 1 to most important outcome (e.g., HIV positive) ❑ Ordinal ❑ Values have a meaningful order (ranking) but magnitude between successive values is not known ❑ Size = {small, medium, large}, grades, army rankings 8 Numeric Attribute Types ❑ Quantity (integer or real-valued) ❑ Interval ❑ Measured on a scale of equal-sized units ❑ Values have order ❑ E.g., temperature in C˚or F˚, calendar dates ❑ No true zero-point ❑ Ratio ❑ Inherent zero-point ❑ We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). ❑ e.g., temperature in Kelvin, length, counts, monetary quantities 9 Discrete vs. Continuous Attributes ❑ Discrete Attribute ❑ Has only a finite or countably infinite set of values ❑ E.g., zip codes, profession, or the set of words in a collection of documents ❑ Sometimes, represented as integer variables ❑ Note: Binary attributes are a special case of discrete attributes ❑ Continuous Attribute ❑ Has real numbers as attribute values ❑ E.g., temperature, height, or weight ❑ Practically, real values can only be measured and represented using a finite number of digits ❑ Continuous attributes are typically represented as floating-point variables 10 Statics of Data ❑ Measuring the Central Tendency ❑ Measuring the Dispersion of Data ❑ Covariance and Correlation Analysis ❑ Graphic Displays of Basic Statics of Data 11 Measuring the Central Tendency: (1) Mean ❑ Mean (algebraic measure) (sample vs. population): Note: n is sample size and N is population size. 1 n x =  xi =  x Same n i =1 N Papulation n ❑ Weighted arithmetic mean: w x i i x= i =1 n bytheweight multiply w i =1 i outof removevalues ❑ Trimmed range theoutliers mean: ❑ Chopping extreme values (e.g., Olympics gymnastics score computation) 12 Measuring the Central Tendency: (2) Median ❑ Median: ❑ Middle value if odd number of values, or average of the middle two values otherwise ❑ Estimated by interpolation (for grouped data): ftp.me tet 1ooxun 31g 1597 650 isthelowerlimit 950 the h 2450 950 1 Sum before the median interval Approximate size sample n / 2 − ( freq )l median Interval width (L2 – L1) 1135hr 7 L1 + ( median = 1597 ) width freqmedian 1500 13 Low interval limit mm Measuring the Central Tendency: (3) Mode ❑ Mode: Value that occurs most frequently in the data ❑Unimodal ❑ Empirical formula: mean − mode = 3  (mean − median) ❑Multi-modal ❑ Bimodal ❑ Trimodal 14 Symmetric vs. Skewed Data symmetric ❑ Median, mean and mode of symmetric, positively and negatively skewed data nightfoot Olethoot positively skewed negatively skewed Mode Median mean Mode Mediansmean 15 Properties of Normal Distribution Curve ← — ————Represent data dispersion, spread — ————→ Represent central tendency 16 Measures Data Distribution: Variance and Standard Deviation ❑Variance and standard deviation (sample: s, population: σ) ❑ Variance: (algebraic, scalable computation) ❑ Q: Can you compute it incrementally and efficiently? n n n 1 SaMMariance 1 1    2 s2 = ( xi − x ) 2 = [ xi − ( xi ] ) 2 standard deviation n − 1 i =1 n − 1 i =1 n i =1 Note: The subtle difference of motion n n 1 1 formulae for sample vs. population  =  ( xi −  ) =  i 2 2 2 x − 2 n : the size of the sample N : the size of the population N i =1 N i =1 ❑ Standard deviation s (or σ) is the square root of variance s2 (or σ2) 17 Correlation between Two Numerical Variables ❑ Correlation between two variables X1 and X2 is the standard covariance, obtained by normalizing the covariance with the standard deviation of each variable  12  12 12 = =  1 2  12 2 2 ❑ Sample correlation for two attributes X1 and X2: 𝜎ො12 σ𝑛𝑖=1 𝑥𝑖1 − 𝜇ො1 𝑥𝑖2 − 𝜇ො2 𝜌ො12 = = 𝜎ො1 𝜎ො2 2 σ𝑛𝑖=1 𝑥𝑖1 − 𝜇ො1 2 σ𝑛 𝑖=1 𝑥𝑖2 − 𝜇ො2 where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 , σ1 and σ2 are the respective standard deviation of X1 and X2 ❑ If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s) ❑ The higher, the stronger correlation ❑ If ρ12 = 0: independent (under the same assumption as discussed in co-variance) ❑ If ρ12 < 0: negatively correlated 18 Visualizing Changes of Correlation Coefficient ❑ Correlation coefficient value range: [–1, 1] ❑ A set of scatter plots shows sets of points and their correlation coefficients changing from –1 to 1 19 Graphic Displays of Basic Statistical Descriptions maxand min ❑ Boxplot: graphic display of five-number summary q and q displays more than box plot information ❑ Histogram: x-axis are values, y-axis repres. frequencies showsthedistribution showsall the value x is paired with f indicating that approximately 100 f % plot:with values relationship ❑ Quantile each i i i of data are  xi alwayscompares 2things ❑ Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another correlation ❑ Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane 20 Measuring the Dispersion of Data: Quartiles & Boxplots ❑ Quartiles: Q1 (25th percentile), Q3 (75th percentile) ❑ Inter-quartile range: IQR = Q3 – Q1 ❑ Five number summary: min, Q1, median, Q3, max ❑ Boxplot: Data is represented with a box ❑ Q1, Q3, IQR: The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ❑ Median (Q2) is marked by a line within the box ❑ Whiskers: two lines outside the box extended to Minimum and Maximum ❑ Outliers: points beyond a specified outlier threshold, plotted individually ❑ Q3Q Outlier: usually, a value higher/lower than 1.5 x IQR 21 Positively and Negatively Correlated Data ❑ The left half fragment is positively correlated ❑ The right half is negative correlated 22 Uncorrelated Data 23 Ordinal Variables ❑ An ordinal variable can be discrete or continuous ❑ Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior) ❑ Can be treated like interval-scaled ❑ Replace an ordinal variable value by its rank: rif  {1,..., M f } ❑ Map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by rif − 1 zif = M f −1 ❑ Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1 ❑ Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3 ❑ Compute the dissimilarity using methods for interval-scaled variables 24 Cosine Similarity of Two Vectors ❑ A document can be represented by a bag of terms or a long vector, with each attribute recording the frequency of a particular term (such as word, keyword, or phrase) in the document ❑ Other vector objects: Gene features in micro-arrays ❑ Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc. ❑ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then d1 d 2 cos (d1 , d 2 ) = || d1 ||  || d 2 || where indicates vector dot product, ||d||: the length of vector d 25 Example: Calculating Cosine Similarity ❑ Calculating Cosine Similarity: d d cos (d1 , d 2 ) = 1 2 || d1 ||  || d 2 || where indicates vector dot product, ||d||: the length of vector d ❑ Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) ❑ First, calculate vector dot product d1 d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25 ❑ Then, calculate ||d1|| and ||d2|| || d1 ||= 5  5 + 0  0 + 3  3 + 0  0 + 2  2 + 0  0 + 0  0 + 2  2 + 0  0 + 0  0 = 6.481 || d 2 ||= 3  3 + 0  0 + 2  2 + 0  0 + 1 1 + 11 + 0  0 + 1 1 + 0  0 + 1 1 = 4.12 ❑ Calculate cosine similarity: cos(d1, d2 ) = 25/ (6.481 X 4.12) = 0.94 26 Capturing Hidden Semantics in Similarity Measures ❑ The above similarity measures cannot capture hidden semantics ❑ Which pairs are more similar: Geometry, algebra, music, politics? ❑ The same bags of words may express rather different meanings ❑ “The cat bites a mouse” vs. “The mouse bites a cat” ❑ This is beyond what a vector space model can handle ❑ Moreover, objects can be composed of rather complex structures and connections (e.g., graphs and networks) ❑ New similarity measures needed to handle complex semantics ❑ Ex. Distributive representation and representation learning 27 Data Quality, Data Cleaning and Data Integration ❑ Data Quality Measures ❑ Data Cleaning ❑ Data Integration 28 What is Data Preprocessing? — Major Tasks ❑ Data cleaning ❑ Handle missing data, smooth noisy data, identify or remove outliers, and resolve inconsistencies ❑ Data integration ❑ Integration of multiple databases, data cubes, or files ❑ Data reduction ❑ Dimensionality reduction ❑ Numerosity reduction ❑ Data compression ❑ Data transformation and data discretization ❑ Normalization ❑ Concept hierarchy generation 29 Why Preprocess the Data? — Data Quality Issues ❑ Measures for data quality: A multidimensional view AUtBnI ❑ Accuracy: correct or wrong, accurate or not ❑ Completeness: not recorded, unavailable, … ❑ Consistency: some modified but some not, dangling, … ❑ Timeliness: timely update? ❑ Believability: how trustable the data are correct? ❑ Interpretability: how easily the data can be understood? 30 Data Cleaning ❑ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, and transmission error ❑ Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ❑ e.g., Occupation = “ ” (missing data) ❑ Noisy: containing noise, errors, or outliers ❑ e.g., Salary = “−10” (an error) ❑ Inconsistent: containing discrepancies in codes or names, e.g., ❑ Age = “42”, Birthday = “03/07/2010” ❑ Was rating “1, 2, 3”, now rating “A, B, C” ❑ discrepancy between duplicate records ❑ Intentional (e.g., disguised missing data) ❑ Jan. 1 as everyone’s birthday? 31 Incomplete (Missing) Data ❑ Data is not always available ❑ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ❑ Missing data may be due to ❑ Equipment malfunction Technicalissues ❑ Inconsistent with other recorded data and thus deleted ❑ Data were not entered due to misunderstanding ❑ Certain data may not be considered important at the time of entry ❑ Did not register history or changes of the data ❑ Missing data may need to be inferred 32 How to Handle Missing Data? ❑ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ❑ Fill in the missing value manually: tedious + infeasible? ❑ Fill in it automatically with ❑ a global constant : e.g., “unknown”, a new class?! ❑ the attribute mean ❑ the attribute mean for all samples belonging to the same class: smarter ❑ the most probable value: inference-based such as Bayesian formula or decision tree 33 Noisy Data ❑ Noise: random error or variance in a measured variable ❑ Incorrect attribute values may be due to ❑ Faulty data collection instruments ❑ Data entry problems ❑ Data transmission problems ❑ Technology limitation ❑ Inconsistency in naming convention ❑ Other data problems ❑ Duplicate records ❑ Incomplete data ❑ Inconsistent data 34 How to Handle Noisy Data? ❑ F Binning reducecardinality ❑ First sort data and partition into (equal-frequency) bins ❑ Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. noise ❑ Regression ❑ Smooth by fitting the data into regression functions iii ❑ Clustering ketaldata ❑ Detect and remove outliers groupthenremoveoutliers ❑ Semi-supervised: Combined computer and human inspection ❑ Detect suspicious values and check by human (e.g., deal with possible outliers) 35 Data Cleaning as a Process ❑ Data discrepancy detection ❑ Use metadata (e.g., domain, range, dependency, distribution) ❑ Check field overloading ❑ Check uniqueness rule, consecutive rule and null rule ❑ Use commercial tools ❑ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections ❑ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) ❑ Data migration and integration ❑ Data migration tools: allow transformations to be specified ❑ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface ❑ Integration of the two processes ❑ Iterative and interactive (e.g., Potter’s Wheels) 36 Data Integration ❑ Data integration ❑ Combining data from multiple sources into a coherent store ❑ Why data integration? ❑ Help reduce/avoid noise ❑ Get a more complete picture ❑ Improve mining speed and quality ❑ Schema integration: ❑ e.g., A.cust-id  B.cust-# ❑ Integrate metadata from different sources ❑ Entity identification: ❑ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton 37 Handling Noise in Data Integration ❑Detecting data value conflicts ❑ For the same real world entity, attribute values from different sources are different ❑ Possible reasons: no reason, different representations, different scales, e.g., metric vs. British units ❑ Resolving conflict information ❑ Take the mean/median/mode/max/min ❑ Take the most recent ❑ Truth finding: consider the source quality ❑ Data cleaning + data integration 38 Handling Redundancy in Data Integration ❑ Redundant data occur often when integration of multiple databases ❑ Object identification: The same attribute or object may have different names in different databases ❑ Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue Thebirthdategives ❑ What’s the problem? youtheage ❑ 𝑌 = 2𝑋 → 𝑌 = 𝑋1 + 𝑋2 𝑌 = 3𝑋1 − 𝑋2 𝑌 = −1291𝑋1 + 1293𝑋2 ❑ Redundant attributes may be detected by correlation analysis and covariance analysis 39 Data Transformation ❑ Normalization ❑ Discretization ❑ Data Compression ❑ Sampling 40 Data Transformation ❑ A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values ❑ Methods ❑ Smoothing: Remove noise from data ❑ Attribute/feature construction ❑ New attributes constructed from the given ones ❑ Aggregation: Summarization, data cube construction ❑ Normalization: Scaled to fall within a smaller, specified range ❑ min-max normalization ❑ z-score normalization ❑ normalization by decimal scaling ❑ Discretization: Concept hierarchy climbing 41 Normalization ❑ Min-max normalization: to [new_minA, new_maxA] v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA ❑ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0] 73,600 − 12,000 ❑ Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716 ❑ Z-score normalization (μ: mean, σ: standard deviation): v − A Z-score: The distance between the raw score and the v' =  A population mean in the unit of the standard deviation 73,600 − 54,000 ❑ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225 16,000 ❑ Normalization by decimal scaling v Where j is the smallest integer such that Max(|ν’|) < 1 v'= 10 j 42 Simple Discretization: Binning ❑ Equal-width (distance) partitioning ❑ Divides the range into N intervals of equal size: uniform grid ❑ if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. ❑ The most straightforward, but outliers may dominate presentation ❑ Skewed data is not handled well ❑ Equal-depth (frequency) partitioning ❑ Divides the range into N intervals, each containing approximately same number of samples ❑ Good data scaling ❑ Managing categorical attributes can be tricky 43 Example: Binning Methods for Data Smoothing ❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equal-depth) bins: - Bin 1: 4, 8, 9, 15 Allhave 123 4 candwidth - Bin 2: 21, 21, 24, 25 I3 26,36320,2029,34 m Famiipg Etm maxBin3 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 4 15meano 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 Closesttothe - Bin 2: 21, 21, 25, 25 number - Bin 3: 26, 26, 26, 34 44 Data Compression ❑ String compression ❑ There are extensive theories and well-tuned algorithms Original Data Compressed ❑ Typically lossless, but only limited manipulation Data is possible without expansion lossless ❑ Audio/video compression ❑ Typically lossy compression, with progressive Original Data refinement Approximated ❑ Sometimes small fragments of signal can be reconstructed without reconstructing the whole Lossy vs. lossless compression ❑ Time sequence is not audio ❑ Typically short and vary slowly with time ❑ Data reduction and dimensionality reduction may 45 also be considered as forms of data compression Automatic Concept Hierarchy Generation ❑Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set ❑ The attribute with the most distinct values is placed at the lowest level of the hierarchy ❑ Exceptions, e.g., weekday, month, quarter, year country 15 distinct values province_or_ state 365 distinct values city 3567 distinct values street 674,339 distinct values 46 Sampling ❑ Sampling: obtaining a small sample s to represent the whole data set N ❑ Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data ❑ Key principle: Choose a representative subset of the data ❑ Simple random sampling may have very poor performance in the presence of skew ❑ Develop adaptive sampling methods, e.g., stratified sampling: ❑ Note: Sampling may not reduce database I/Os (page at a time) 47 Types of Sampling ❑ Simple random sampling: equal probability of selecting any particular item Raw Data ❑ Sampling without replacement ❑ Once an object is selected, it is removed from the population ❑ Sampling with replacement ❑ A selected object is not removed from the population ❑ Stratified sampling Stratified sampling ❑ Partition (or cluster) the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) 48 What Is Dimensionality Reduction? ❑ Curse of dimensionality Afriustdata ❑ When dimensionality increases, data becomes increasingly sparse ❑ Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful ❑ The possible combinations of subspaces will grow exponentially ❑ Dimensionality reduction ❑ Reducing the number of random variables under consideration, via obtaining a set of principal variables keepimportant stuff ❑ Advantages of dimensionality reduction ❑ Avoid the curse of dimensionality ❑ Help eliminate irrelevant features and reduce noise ❑ Reduce time and space required in data mining ❑ Allow easier visualization 49

Chap2-Data_Final Exam syllabus PDF

Document Details

Tags

Related

Summary

Full Transcript