Data Preprocessing.pdf
Document Details
Uploaded by WellIntentionedAmazonite
2024
Tags
Full Transcript
Data Preprocessing Li, Jia DSAA 5002 The Hong Kong of Science and Technology (Guangzhou) 2024 Spring Jan 22 1 Detect fraud accounts Number of accounts is huge A feature vector for an account Many attributes Gender Number of friends Location ID Age Chat timestamp Report count Noisy, incomplete, even...
Data Preprocessing Li, Jia DSAA 5002 The Hong Kong of Science and Technology (Guangzhou) 2024 Spring Jan 22 1 Detect fraud accounts Number of accounts is huge A feature vector for an account Many attributes Gender Number of friends Location ID Age Chat timestamp Report count Noisy, incomplete, even misleading 2 What is Data? Attributes Collection of data objects and their attributes Tid Gen der An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Objects Location No. of Friend Fraud s 1 F Hongkong 125 No 2 M Shenzhen 100 No 3 M Cambodia 70 No 4 F Vietnam 120 No 5 M Beijing 95 Yes 6 M Hangzhou 60 No 7 F Guangzhou 220 No 8 M Shanghai 85 Yes 9 F U.S.A 75 No 10 F U.K. 90 Yes 10 3 Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates. Ratio Examples: number of friends, time, report counts 4 Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: = Distinctness: Order: < > Addition: Multiplication: Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties + */ 5 Discrete and Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables. 6 Measuring the Central Tendency of Data Mean : An algebraic measure n sample vs. population mean: Weighted arithmetic mean: x= w x i =1 n = x N i w i =1 i 1 n x = xi n i =1 i Trimmed mean: chopping extreme values Median: A holistic measure Middle value if odd number of values, or average of the middle two values otherwise Mode Value that occurs most frequently in the data Empirical formula: mean − mode = 3 (mean − median) 7 Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data positively skewed symmetric negatively skewed 8 Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, M, Q3, max Outlier: usually, a value more than 1.5 x IQR above Q3 or below Q1 Variance and standard deviation (sample: s, population: σ) Variance: n n n 1 1 1 1 n 1 n 2 2 2 2 2 2 2 s = ( xi − x ) = [ xi − ( xi ) ] = ( xi − ) = xi − 2 N i =1 N i =1 n − 1 i =1 n − 1 i =1 n i =1 Standard deviation s (or σ) is the square root of variance s2 (or σ2) 9 Box Plot 3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1 Q1=(4.3+4.3)/2=4.3 Median = 4.4 Q 3=(4.7+4.8)/2=4.75 10 Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: Noise and outliers missing values duplicate data 11 Noise Noise refers to modification of original values Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise 12 Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set. 13 Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) 14 Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeneous sources Examples: Same person with multiple email addresses Data cleaning Process of dealing with duplicate data issues 15 Data Preprocessing Aggregation Sampling Discretization and Binarization Attribute Transformation 16 Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More “stable” data Aggregated data tends to have less variability 17 Aggregation Daily login activities, from timestamp to 24 hours (day or night) 18 Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. 19 Sampling The key principle for effective sampling is the following: using a sample will work almost as well as using the entire data sets, if the sample is representative A sample is representative if it has approximately the same property (of interest) as the original set of data 20 Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition 21 Sample Size 8000 points 2000 Points 500 Points 22 Sample Size What sample size is necessary to get at least one object from each of 10 groups. 23 Mapping Data to a New Space Fourier transform Wavelet transform Two Sine Waves Two Sine Waves + Noise Frequency 24 Discretization Using Class Labels 25 Discretization Without Using Class Labels Data Equal frequency Equal interval width K-means 26 Attribute Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: xk, log(x), ex, |x| Standardization/Normalization 27 Attribute Transformation: Normalization Min-max normalization: from [minA, maxA] to [new_minA, new_maxA] v' = v − minA (new _ maxA − new _ minA) + new _ minA maxA − minA Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then 73,600 − 12,000 $73,000 is mapped to (1.0 − 0) + 0 = 0.716 98,000 − 12,000 Z-score normalization (μ: mean, σ: standard deviation): v' = v − A A Ex. Let μ = 54,000, σ = 16,000. Then 73,600 − 54,000 = 1.225 16,000 28 Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] or [-1,1] Dissimilarity (e.g. distance) Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 29 Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects. 30 Euclidean Distance dist = n ( pk k =1 − qk ) 2 Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ. 31 Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance n dist = ( | pk − qk k =1 1 r r | ) Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. 32 Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r → . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. 33 Common Properties of a Distance Distances, such as the Euclidean distance, have some well known properties. 1. 2. 3. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. A distance that satisfies these properties is a metric 34 Common Properties of a Similarity Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. 35 Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| , where indicates vector dot product and || d || is the length of vector d. Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) =.3150 36 Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product 𝑝𝑘′ = (𝑝𝑘 − 𝑚𝑒𝑎𝑛(𝑝))/𝑠𝑡𝑑(𝑝) 𝑞𝑘′ = (𝑞𝑘 − 𝑚𝑒𝑎𝑛(𝑞))/𝑠𝑡𝑑(𝑞) 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛(𝑝, 𝑞) = 𝑝′ 𝑞′ 37 Correlation Analysis (Categorical Data) Χ2 (chi-square) test 2 ( Observed − Expected ) 2 = Expected The larger the Χ2 value, the more likely the variables are related The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population 38 Chi-Square Calculation: An Example Play chess Not play chess Sum (row) Like science fiction 250 (90) 200 (360) 450 Not like science fiction 50 (210) 1000 (840) 1050 Sum (col.) 300 1200 1500 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) 2 2 2 2 ( 250 − 90 ) ( 50 − 210 ) ( 200 − 360 ) ( 1000 − 840 ) 2 = + + + = 507.93 90 210 360 840 It shows that like_science_fiction and play_chess are correlated in the group 39 Other similarity metric Mutual information 𝑃(𝑋,𝑌) ) 𝑃 𝑋 𝑃(𝑌) I(X;Y)= 𝐾𝐿( KL here denotes Kullback-Leibler divergence. It should hold the properties of nonnegativity and symmetry. Is mutual information a valid similarity metric? What is the relation between mutual information and correlation? 40 MI as a Loss Function MI involves joint distribution, which is only tractable for discrete variable. I(X;Y)= 𝐾𝐿 𝑃 𝑋,𝑌 𝑃 𝑋 𝑃 𝑌 ≥ 𝑠𝑢𝑝 𝐸𝑃 𝑋,𝑌 𝑇 − log 𝐸𝑃 𝑋 𝑃(𝑌) 𝑒𝑇 T is an arbitrary function that projects the joint distribution or marginals into scalars. Parameterize T by a deep neural network. Estimate the joint and marginal distribution with samples. 41 Belghazi, et al. “Mutual Information Neural Estimation“ ICML 2018. Deep Graph Infomax MI between the input graph and node representations to train node embeddings unsupervisedly. 42 Slides Credit Many slides are adopted from Lecture Notes for Chapter 2 Introduction to Data Mining By Tan, Steinbach, Kumar Other references: Belghazi et al. MINE: Mutual Information Neural Estimation. ICML 2018 Veličković et al. Deep Graph Infomax. ICLR 2019. 43