Data Preprocessing Lecture Notes PDF

The German University in Cairo DTSC 103 Data Engineering 2 Lecture 2 Data Preprocessing I Mervat Abuelkheir [email protected] Before We Process: Profiling & Stats GUC Data Types ○ Tabular ○ Spatial Relational data Maps ○ Records ○ Text Document stores and JSON ○ Image XML CSV ○ Video ○ Graph/Network ○ Ordered WWW Gene sequences Social Time-series Biological Video frames 3 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Recap – Attribute Types Qualitative Attributes ○ Categorical/Nominal Each value represents category, code, or state e.g. hair color, marital status, customer ID ○ Most algorithms are Possible to be represented as numbers (coding) designed to work with numbers! ○ Binary Nominal with only two values; two states or categories: 0 or 1 (absent or ○ Qualitative present, true or false) attributes may need Symmetric: both states are equally valuable and have the same weight to be encoded into  e.g. gender numbers Asymmetric: states are not equally important  e.g. medical test outcomes – +ve or -ve (Which outcome should take 1?) ○ Ordinal Values have a meaningful order or ranking, magnitude between successive values is not known e.g. professional rank, grade, size, customer satisfaction 4 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Recap – Attribute Types ○ Interval-scaled Quantitative Attributes Measured on a scale of equal-size units e.g. temperature, year Do not have a true zero point ○ Sometimes we need to normalize Not possible to be expressed as multiples quantitative data ○ Ratio-scaled ○ Sometimes we need to discretize quantitative data – Back to categorical! Have a true zero point A value can be expressed as a multiple of another e.g. years of experience, weight, salary 5 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Basic Statistical Descriptions of Data Measuring Central Tendency Measuring dispersion of Data 6 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Measuring Central Tendency Population versus sample: ○ A population is the entire set of objects or events under study Population can be hypothetical “all students” or all students in this class ○ A sample is a “representative” subset of the objects or events under study Needed because it’s sometimes impossible or intractable to obtain or compute with population data ○ Why do we care? Refer to your statistics course (sample has to be representative, and sample mean should approximate population mean) 7 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Measuring Central Tendency For N observations of numerical variable X: 𝑥1 , 𝑥2 , … , 𝑥𝑁 ○ Mean: or average of values 𝑁 𝑖=1 𝑥𝑖 𝑥1 +𝑥2 + …+𝑥𝑁 𝑥= = 𝑁 𝑁 ○ Weighted Average: a weight is associated with each value 𝑁 𝑖=1 𝑤𝑖 𝑥𝑖 𝑤1 𝑥1 +𝑤2 𝑥2 + …+𝑤𝑁 𝑥𝑁 𝑥= = 𝑁 𝑁 ○ Problem: sensitivity to outlier values e.g. mean salary, mean student score Trimmed mean  chop off extreme values at both ends ○ There is always uncertainty involved when calculating a sample mean to estimate a population mean 8 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Measuring Central Tendency ○ Median: middle value in set of ordered values N is odd  median is middle value of ordered set N is even  median is not unique  average of two middlemost values Expensive to compute for large # of observations ○ Mode: value that occurs most frequently in the attribute values Works for both qualitative and quantitative attributes Data can be unimodal, bimodal, or trimodal  No mode? 9 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Measuring Dispersion of Data The spread of a sample of observations measures how well the mean or median describes the sample For N observations of numerical variable X: 𝑥1 , 𝑥2 , … , 𝑥𝑁 ○ First, we order the observations! Then, we can compute … ○ Range: difference between the largest and smallest values ○ Quantiles: points taken at regular intervals of a data distribution, dividing it into (almost) equal-size consecutive sets Most famous  percentile  100 equal-sized sets Quartiles  4 Quantiles ○ Interquartile Range: = Q3 - Q1 10 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Measuring Dispersion of Data ○ Five-Number Summary: Min, Q1, Median (Q2), Q3, Max Outliers 25% 25% ○ Boxplots: visualization for the five-number Min 25% 25% Max summary Whiskers terminate at min & max OR the most extreme observations within 1.5 × IQR of the quartiles   Lower whisker: Min OR Q1 – (1.5 × IQR)  Upper whisker: Max OR Q3 + (1.5 × IQR) Remaining points are plotted individually (outliers!) Working Example: https://www.khanacademy.org/math/statistics- probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review 11 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Pop Quiz Example: Item prices at a store are: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30 Total # observations is 52 1. Identify Q1, Q2, Q3 2. Draw the boxplot 12 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Pop Quiz Example: Item prices at a store are: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30 Total # observations is 52 40? 1. Identify Q1, Q2, Q3 2. Draw the boxplot 13 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Measuring Dispersion of Data ○ Variance & SD: indicate how spread out a data distribution is Low SD  data observations tend to be very close to the mean High SD  data is spread out over a large range of values 1 𝑁 1 𝑁 𝜎2 = 𝑖=1 𝑥𝑖 − 𝑥 2 = 𝑥 𝑖=1 𝑖 2 − 𝑥2 𝑁 𝑁 𝑆𝐷 = 𝜎 14 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Building a Data Profile – Task 1 ○ Use WhiteRabbit (https://github.com/OHDSI/WhiteRabbit) to profile the following datasets: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data https://www.kaggle.com/hugomathien/soccer Or a dataset of your choosing ○ Generate a report and submit to my email (DE – Task 1 – Data Profile) ○ Think of more complex data quality features and metrics and how to design a tool that includes them 15 Data Engineering - Introduction & EDA © M.Abuelkheir, GUC Why Preprocess Data? GUC https://www.nytimes.com/2018/10/03/us/fitbit-murder-arrest.html 17 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC https://www.wired.com/story/telltale-heart-fitbit-murder/ 18 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Low-quality data will lead to low-quality and misleading analysis results (No matter how sophisticated the model is!) 19 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Data Challenges ○ Massive data ○ Curse of dimensionality (high-dimensional problem in terms of features) ○ Missing data values (sometimes not missing at random) ○ Wrong data values (needs detection and correction) ○ Sometimes data is not factual (yet not technically wrong!) and we have a complicated set of factors that affect user-provided data values 20 Source: CS109 Stanford’s Data Science Course Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Why Preprocess Data? To improve the quality of data (usability and reliability) and make it suitable for the requirements of the intended use ○ Factors of data quality Accuracy  values contained in each field of data record should be correct and accurately represent “real world” values Completeness  data should contain all necessary and expected information, and scope of data element should be understood by the user Consistency  recorded data should be the same throughout the organization and across all systems, with no conflicts Timeliness  data should be available when it’s expected and needed by the user Integrity  data should be valid across relationships, there are relationships that connect all the data together to reduce duplicates Interpretability  how easy the data is understood 21 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Why Preprocess Data? To improve the quality of data (usability and reliability) and make it suitable for the requirements of the intended use ○ Factors of data quality Accuracy  lack of is due to faulty instruments, errors caused by human/computer/transmission, deliberate errors … Completeness  lack of is due to data acquired over different design phases, optional attributes Consistency  lack of is due to semantics, data types, field formats … Timeliness  lack of is due to design issues of system or data source issues Integrity lack of is due to poor definitions of data relationships Interpretability  lack of is due to no data dictionary 22 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Example 1 Two records from a pipe-delimited file: T.Das|97336o8327|24.95|Y| – |0.0|1000 TedJ.|973 – 360 – 8779|2000|N|M|NY|1000 ○ Interpretability? ○ Accuracy? ○ Integrity? ○ Completeness? ○ Consistency? 23 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Example 1 Two records from a pipe-delimited file: T.Das|97336o8327|24.95|Y| – |0.0|1000 TedJ.|973 – 360 – 8779|2000|N|M|NY|1000 ○ Interpretability? name, phone number, revenue, indicator, gender, state, usage 24 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Example 1 Two records from a pipe-delimited file: T.Das|97336o8327|24.95|Y| – |0.0|1000 TedJ.|973 – 360 – 8779|2000|N|M|NY|1000 ○ Interpretability? name, phone number, revenue, indicator, gender, state, usage ○ Accuracy? 25 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Example 1 Two records from a pipe-delimited file: T.Das|97336o8327|24.95|Y| – |0.0|1000 TedJ.|973 – 360 – 8779|2000|N|M|NY|1000 ○ Interpretability? name, phone number, revenue, indicator, gender, state, usage ○ Accuracy? ○ Integrity? 26 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Example 1 Two records from a pipe-delimited file: T.Das|97336o8327|24.95|Y| – |0.0|1000 TedJ.|973 – 360 – 8779|2000|N|M|NY|1000 ○ Interpretability? name, phone number, revenue, indicator, gender, state, usage ○ Accuracy? ○ Integrity? ○ Completeness? 27 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Example 1 Two records from a pipe-delimited file: T.Das|97336o8327|24.95|Y| – |0.0|1000 TedJ.|973 – 360 – 8779|2000|N|M|NY|1000 ○ Interpretability? name, phone number, revenue, indicator, gender, state, usage ○ Accuracy? ○ Integrity? ○ Completeness? ○ Consistency? 28 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Data Preparation is itself a Challenge! ○ Data quality problems are highly complex and context dependent Extensive domain knowledge is needed Solutions need to be chosen case by case ○ No single tool can solve a majority of data quality problems! ○ Do not apply data preprocessing methods manually – AUTOMATE For large datasets Reuse of code for similar issues 29 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Major Preparation Tasks That Improve Quality of Data ○ Data cleaning  filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies ○ Data transformation  normalization, discretization ○ Data reduction  obtain a reduced representation of the data set that is much smaller in volume, while producing almost the same analytical results ○ Data integration  include data from multiple sources in analysis, map semantic concepts, infer attributes … 30 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Detailed Preparation Tasks ○ Data cleaning ○ Data reduction Impute/remove missing values Partition/sample data Detect and remove/handle Aggregate records noise/outliers Extract/select/fuse features Detect and remove/handle inconsistencies ○ Feature engineering Scale, normalize, discretize and encode ○ Data transformation features Scale attributes with varying ranges Extract/select/fuse features Normalize distributions Add new features Encode categorical attributes ○ Data integration Discretize numerical attributes Handle data from multiple sources Detect and remove/handle inconsistencies 31 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Data Cleaning GUC Data Cleaning ○ Data in the Real World Is Dirty! Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g. Occupation=“ ” (missing data) Noisy: containing noise, errors, or outliers  e.g. Salary=“−10” (an error)  Intentional  Jan. 1 as everyone’s birthday? Inconsistent: containing discrepancies in codes or names  Age=“42” and Birthday=“03/07/2010”  Rating was “1, 2, 3”, now is “A, B, C”  Switched fields!  Censored/defaulted/maxed values 33 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Data Cleaning Involves … … checking for and imposing semantic and syntactic validity of raw data Through … ○ filling in missing values via missing value imputation ○ smoothing out noise and identifying outliers via binning, regression, outlier detection, clustering, discretization ○ correcting inconsistencies in the data via fuzzy joins, regular expressions, database profiling, metadata 34 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC But CAREFUL! ○ Methods that fix data problems and transforms raw data to clean data must be reproducible ○ Non-reproducible transformation cannot be distinguished from invention! ○ You must preserve your raw data in their original form! ○ You need a recipe – working code that tracks the fully defined operations necessary to produce your clean data from your raw data Can be put into production in operational scenarios 35 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Missing Data Values ○ Why do we care? Why not throw away the data points with missing values? Loss of information (obviously) Potential of bias in analysis results for remaining data  Why is data missing in the first place? Patterns of missing data can actually mean something (missing data mechanism)  and will affect the choice of imputation method! ○ A missing value may not imply incomplete data! e.g. driver’s license number ○ A zero may be a true value and may be a default value due to attribute constraint rules e.g. zero-quantity bill versus default-zero bill that is not yet processed 36 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Detecting Missing Values ○ Scan for gaps in rows and columns ○ Cross-check data schema with actual data for missing attributes ○ Check data during transit for losses due to transfer/communication process ○ Check history of data source ○ Keep track of estimated values and error bounds (counts, means, SDs) e.g. per-unit-time transactions received from a store, packets from a router, cars per area Deviation when receiving new records may indicate missing records ○ How to detect confounding defaults? 37 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Types of Missing Values ○ Missing Completely At Random (MCAR) – there is no relationship between the missing data mechanism and any values, observed or missing Probability of being missing is the same for all values e.g. if weighing scale ran out of battery, weight attribute has values MCAR ○ Missing At Random (MAR) – there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data missingness can be explained by variables on which you have full information e.g. if men are more likely to tell their weight than women, weight is MAR (what is the observed variable here?) ○ MCAR and MAR are ignorable – enough information is available in the data to allow imputing missing values, therefore the missing data mechanism can be ignored 38 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Types of Missing Values ○ Missing Not at Random (MNAR) – there is a relationship between the propensity of a value to be missing and its values Probability of being missing varies for reasons that are unknown to us e.g. people with the lowest education have missing education level in education attribute, the sickest people are most likely to drop out of a medical study, a device measures some response and can only measure values above 0.5 so any value below will be missing ○ MNAR is non-ignorable – the missingness mechanism has to be modeled explicitly as you deal with the missing data ○ How can you assess which type of missing data you have? Data Explains Pattern Yes No Missingness Yes MAR MNAR 39 Pattern No I © M.Abuelkheir, Data Engineering - Data Preprocessing --- GUC MCAR Statistically … ○ Missing Completely At Random (MCAR) – there is no relationship between the missing data mechanism and any values, observed or missing 𝑃 𝑀 = 1|𝑌𝑜𝑏𝑠 , 𝑌𝑎𝑏𝑠 , 𝜖 = 𝑃 𝑀 = 1| 𝜖 ○ Missing At Random (MAR) – there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data 𝑃 𝑀 = 1|𝑌𝑜𝑏𝑠 , 𝑌𝑎𝑏𝑠 , 𝜖 = 𝑃 𝑀 = 1|𝑌𝑎𝑏𝑠 , 𝜖 ○ Missing Not at Random (MNAR) – there is a relationship between the propensity of a value to be missing and its values 𝑃 𝑀 = 1|𝑌𝑜𝑏𝑠 , 𝑌𝑎𝑏𝑠 , 𝜖 40 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Handling Missing Values – Simplistic Methods ○ Fill in the missing value manually  time consuming, not feasible for large datasets ○ Use a global constant  replace all missing attribute values by same value (e.g. unknown, null) Analysis task may mistakenly think that “null” is an interesting concept Careful not to use a global constant that is a valid value of the attribute!  e.g. using “zero” for numerical attribute whose values can in fact include zero ○ Listwise deletion  an object is deleted if it’s missing data on any attribute in the analysis Default, not very effective, unless:  Object has several attributes with missing values and only few objects have missing values – you have statistical power  Assumes MCAR 41 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Handling Missing Values – Imputation ○ Goal is to replace missing values with plausible values! ○ This implies a level of uncertainty in the imputed values ○ Do not assume in the analysis that imputed values are the real values ○ Imputation should be used with caution! Imputed values are good for aggregate analysis No individual imputed value should be trusted 42 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Handling Missing Values – Univariate Imputation ○ Mean/Average imputation  replace missing observation with mean of non-missing observations for normal (symmetric) data distributions, mean is used, skewed distribution should employ median Standard error of attribute will be underestimated (imputing the mean preserves the mean!) ○ Use value drawn from distribution of non-missing values for the attribute  assuming missing values follow same distribution of available values If missing data is 2-3% it is ok to use the above methods Univariate imputation is a poor choice for some ML algorithms (e.g. LR) 43 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Handling Missing Values – Multivariate Imputation ○ Use mean or median for all samples belonging to the same class as the given object e.g. mean or median of customers in a certain age group, customers who default, … ○ Use the most probable (estimated) value  e.g. using regression or Bayesian inference Still may maintain small standard error if applied once, as parameters do not change 44 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Handling Missing Values – Linear Regression ○ Regression analysis attempts to determine the strength of the relationship between two (or more) variables: A dependent variable (usually denoted by Y) One or more changing variables (known as independent variables or predictors) In imputation, the dependent variable is the one with missing values ○ Linear Regression (LR)  dependent variable is continuous ○ Multiple Regression  multiple independent variables ○ Regression types other than linear Logistic Regression  dependent variable is binary Multinomial Logistic Regression  dependent variable is categorical 45 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Brief Overview of The Linear Regression Model 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 + 𝜀 ○ 𝛽0 is the intercept ○ 𝛽𝑗 is the slope of the 𝑗th variable  the average increase in 𝑌 when 𝑋𝑗 increases by one and all other 𝑋s are held constant ○ The 𝛽𝑗 s are the model parameters we need to find 46 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Brief Overview of The Linear Regression Model ○ Regression error  distance between original data points and regression line ○ The distances are called the error terms, or the residuals ○ How to find the best regression model Gradient Descent 47 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Linear Regression for Imputation – Example ○ For attributes (weight, age, height), height is Weight Age Height Health_Index missing, so we use regression formula 20 2 10 5 𝒉𝒆𝒊𝒈𝒉𝒕 = 𝛽0 + 𝛽1 𝒂𝒈𝒆 + 𝛽2 𝒘𝒆𝒊𝒈𝒉𝒕 15 5 9 3 𝒉𝒆𝒊𝒈𝒉𝒕 = 8.33 + 0.167𝒂𝒈𝒆 + 0.1𝒘𝒆𝒊𝒈𝒉𝒕 25 5 10 20 4 10 1 Monotone missing values 48 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Linear Regression for Imputation – Example (Cont.) ○ For attributes (weight, age, height), height is Weight Age Height Health_Index missing, so we use regression formula 20 2 10 5 𝒉𝒆𝒊𝒈𝒉𝒕 = 𝛽0 + 𝛽1 𝒂𝒈𝒆 + 𝛽2 𝒘𝒆𝒊𝒈𝒉𝒕 15 5 9 3 𝒉𝒆𝒊𝒈𝒉𝒕 = 8.33 + 0.167𝒂𝒈𝒆 + 0.1𝒘𝒆𝒊𝒈𝒉𝒕 25 5 10 20 4 10.998 ○ Since more than one attribute has missing 10 1 9.497 values, use regression formula for attributes Monotone missing values with missing values monotonically 𝒉𝒆𝒂𝒍𝒕𝒉_𝒊𝒏𝒅𝒆𝒙 = 𝛽0 + 𝛽1 𝒂𝒈𝒆 + 𝛽2 𝒘𝒆𝒊𝒈𝒉𝒕 + 𝛽3 𝒉𝒆𝒊𝒈𝒉𝒕 Regression analysis rule: for 𝑘 parameters, you must have 𝑁 ≥ 𝑘 distinct data points 49 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Multivariate Imputation by Chained Equations – MICE 1. Perform a simple (e.g. mean) imputation for Weight Age Height Health_Index every missing value in dataset 20 2 10 5 Those are called “placeholders” 15 5 9 3 25 5 10 𝑣𝑟3 2. Choose one variable/attribute with missing 20 4 𝑣𝑟1 𝑣𝑟4 values, and set the placeholders back to 10 1 𝑣𝑟2 𝑣𝑟5 missing Multivariate Imputation by Chained Equations 3. Regress this attribute on the other attributes with their observed and placeholder values 4. Impute missing values of attribute with regressed values 5. Repeat steps 2-4 for each other 50 variable/attribute with missing values Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC Linear Regression for Multiple Imputation ○ An error term is generated at random from a Weight Age Height Health_Index Gaussian distribution, so you can generate 20 2 10 5 multiple estimates for the missing values with 15 5 9 3 randomized error components, and construct 25 5 10 𝑣𝑟3 multiple datasets for multiple imputation 20 4 𝑣𝑟1 𝑣𝑟4 ○ Better yet, repeat MICE (steps 2-4) for a 10 1 𝑣𝑟2 𝑣𝑟5 number of cycles (default is 10) with different Multivariate Imputation by Chained Equations random error terms, and update imputations after each cycle 51 Data Engineering - Data Preprocessing I © M.Abuelkheir, GUC GUC Thank You

Data Preprocessing Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript