Discovering Knowledge in Data - Ch. 2: Data Preprocessing PDF

Discovering Knowledge in Data Daniel T. Larose, Ph.D. Chapter 2 Data Preprocessing Prepared by James Steck and Eric Flores Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 1 CRISP-DM Review Business Data Data Research Modeling Evaluation Deployment Understanding Preparation Understanding This chapter examines phases 2 and 3 of the CRISP-DM process Chapter 3 expands on the Data Understanding phase Chapter 4 and above focus on Modeling Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 2 Why Do We Preprocess Data? Raw data often unprocessed, incomplete, noisy May contain: ◦ Obsolete/redundant fields ◦ Missing values ◦ Outliers ◦ Data in form not suitable for data mining ◦ Values not consistent with policy or common sense Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 3 Why Do We Preprocess Data? (cont’d) For data mining purposes, database values must undergo data cleaning and data transformation Data often from legacy databases where values: ◦ Not looked at in years ◦ Expired ◦ No longer relevant ◦ Missing Minimize GIGO (Garbage In à Garbage Out) ◦ IF Garbage Into model is minimized à THEN Garbage results Out from model is minimized Effort for Data preparation ranges around10%-60% of data mining process – depending in dataset Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 4 Data Cleaning – Example CustomerID field is assumed to be fine; But Zip Code, Gender? Zip Code Gender ◦ Do not assume local format ◦ Value is missing for customer 90210 (U.S.) vs. J2S7K7 (Canada) 1003 In a free trade era should expect some unusual values ◦ Be aware of data type/conversion issues Zip code 06269 stored in numeric field truncates the leading zeroes, and thus, is represented as 6269 (Zip Code for Storrs, CT) Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 5 Data Cleaning – Example (cont’d) Income Field Contains $10,000,000? Income Field Contains $99,999? ◦ Possibly valid on zip code 90210 (Beverly Hills, CA) ◦ Value may be valid, but…other values appear rounded to nearest $5,000 ◦ Still considered outlier (extreme data value) - Some statistical and data mining ◦ Legacy Systems: Value represents database code methods negatively affected by outliers used to denote missing value? ◦ Handling of outliers examined later in this chapter Other considerations for Income Income Field Contains -$40,000? ◦ Confirm values in expected unit of measure, such ◦ Income less than $0? as U.S. dollars ◦ Value beyond bounds for expected ◦ Which unit of measure for income? income, therefore an error ◦ Customer with zip code J2S7K7 in Canadian ◦ Caused by data entry error? dollars? ◦ Discuss anomaly with database administrator Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 6 Data Cleaning – Example (cont’d) Age field contains C Marital Status Field ◦ Possible a leftover of earlier categorization of age ◦ What is the meaning of the symbols? into a bin labeled C? ◦ Data Mining software will likely reject a text value ◦ Don’t make assumptions: Is S for Single or on an otherwise numeric field – this needs Separated? resolution ◦ Consider possibility of codes using words Age field contains 0 (zero) from another language: C is for Cold in ◦ Unlikely: A newborn baby made $1000 transaction English, and Chaud (Hot) in French ◦ Most probably: Missing value or other anomalous Transaction Amount Field condition coded as 0 (zero) ◦ Important: Age value will quickly become ◦ Values in this fields seems OK, assuming obsolete; it is recommended to store date type common unit of measure fields (like birthdate) instead, and calculate age as needed Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 7 Handling Missing Data Missing values pose problems to data analysis methods More common in databases containing large number of fields Absence of information rarely beneficial to task of analysis In contrast, all things being equal, having more data almost always better Careful analysis required to handle issue Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 8 Handling Missing Data (cont’d) Examine cars dataset containing records for 261 automobiles manufactured in 1970s and 1980s ◦ Available for download at: www.dataminingconsultant.com Suppose that some fields are missing for certain records, like in figure below: Delete Records Containing Missing Values? ◦ Dangerous, as pattern of missing values may be systematic ◦ Valuable information in other fields lost As much as 80% of the records lost if 5% of data values are missing, according to Schmueli, Patel, and Bruce Three alternative methods available – Not entirely satisfactory Data imputation methods – Better approach Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 9 Handling Missing Data (cont’d) Alternative Method #1- Replace Missing Values with Constant, specified by the Analyst Example: ◦ Missing numeric values replaced with 0.0 ◦ Missing categorical values replaced with “Missing” Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 10 Handling Missing Data (cont’d) Alternative Method #2 - Replace Missing Values with Mode or Mean Example: ◦ Mode of categorical field brand = US Missing values replaced with this value ◦ Mean for non-missing values in numeric field cubicinches = 200.65 Missing values replaced with 200.65 Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 11 Handling Missing Data (cont’d) Notes on Alternative Method #2 - Replace Missing Values with Mode or Mean ◦ Substituting mode or mean for missing values sometimes works well – however, end user needs to be informed ◦ Mean not always best choice for “typical” value ◦ Resulting confidence levels for statistical inference become overoptimistic (Larose), since measures of spread are artificially reduced ◦ Benefits and drawbacks resulting from the replacement of missing values must be carefully evaluated against possible invalidity of results Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 12 Handling Missing Data (cont’d) Alternative Method #3 - Replace Missing Values with Random Values ◦ Example: Value for cylinders, cubicinches, and hp randomly drawn proportionately from each field’s distribution ◦ Values randomly taken from underlying distribution ◦ Benefit: Measures of location and spread remain closer to original ◦ No guarantee that resulting records would make sense (see side note) This record leads to a car that does not exist! Japanese car with 400cc engine Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 13 Handling Missing Data (cont’d) Data Imputation Methods ◦ Imputation of Missing Data - What is the likely value, given record’s other attribute values? ◦ Example: From two samples below, American car would be expected to have more cylinders American car with 300 cubic inches and 150 horsepower Japanese car with 100 cubic inches and 90 horsepower ◦ Requires tools like multiple regression, or classification and regression trees ◦ To be discussed in Chapter 13 – Imputation of Missing Data Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 14 Identifying Misclassifications Check classification labels, to verify values valid and consistent Example: Table below – Frequency distribution for origin of manufacture of automobiles ◦ Frequency distribution shows five classes: USA, France, US, Europe, and Japan ◦ Count for USA = 1 and France = 1? ◦ Two records classified inconsistently with respect to origin of the manufacture ◦ Maintain consistency by labeling USA à US, and France à Europe Brand Frequency USA 1 France 1 US 156 Europe 46 Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 15 Graphical Methods for Identifying Outliers Outliers are extreme values that go against the trend of the remaining data Outliers may represent errors in data entry Even if valid data point, certain statistical methods are very sensitive to outliers and may produce unstable results Two graphical methods presented Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 16 Graphical Methods for Identifying Outliers (cont’d) Method #1 - Histogram ◦ A histogram examines values of numeric fields ◦ Example: Histogram shows vehicle weights for cars data set* The extreme left-tail contains one outlier weighing several hundred pounds (192.5) Should we doubt validity of this value? This is to light for a car. Possibility: Original value was 1925 pounds. Requires further investigation. * This is a slightly amended version of the cars dataset, available as cars2 in series website. Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 17 Graphical Methods for Identifying Outliers (cont’d) Method #2 – Two-dimensional Scatter Plot ◦ Two-dimensional scatter plots help determine outliers in more than one variable ◦ Example: Scatter plot of mpg against weightlbs shows two possible outliers Most data points cluster together along x-axis However, one car weighs 192.5 pounds and other gets over 500 miles per gallon? Important: A record may be outlier in a particular dimension, but not in the other Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 18 Measures of Center and Spread Measures of center (1/5) - Introduction Estimate where the center of a particular variable lies Most common measures of center ◦ Mean, Median and Mode They are a special case of measures of location, which indicate where a numeric variables lies (examples: percentiles and quantiles) Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 19 Measures of Center and Spread (cont’d) Measures of center (2/5) - Mean Average of the valid values for a variable ◦ Add all field values and divide by sample size ◦ Denoted as x̄ (x-bar) and computed as: Where ∑ represents “sum of all variables” n represents sample size Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 20 Measures of Center and Spread (cont’d) Measures of center (3/5) - Example From the table below, use the Sum and Count to calculate the Mean Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 21 Measures of Center and Spread (cont’d) Measures of center (4/5) – Alternatives Mean is not always ideal ◦ On extremely skewed datasets, it is less representative of variable center; it is also sensitive to outliers Alternative measures of center ◦ Median – Field value in the middle, when field values are sorted into ascending order ◦ Mode – Field value occurring with the greatest frequency Pros: Can be used with either numerical or categorical data Cons: Not always associated with the variable center Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 22 Measures of Center and Spread (cont’d) Measures of center (5/5) – Further notes Measures of center do not always concur Example: Table below ◦ Median is 1 – Half the customers made one customer service call ◦ Mode is 1 - Most frequent number of calls is one ◦ But Mean is 1.563 – (53.6% higher than median/mode) Caused by the mean sensitivity to the right-skewness of the data Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 23 Measures of Center and Spread Measures of Spread (1/5) - Introduction Measures of location not enough to summarize a variable Example: Table with P/E rations for two portfolios (below) ◦ Portfolio A – Spread with one very low and one very high value ◦ Porfolio B – Tighly clustered around the center ◦ P/E rations for each portfolio is distinctly different, yet they both have P/E ratios with mean 10, media 11 and mode 11 Clearly, measures of center do not provide a complete picture Measures of spread or measure of variability complete the picture by describing how spread the data values of each portfolio are Stock Portfolio A Stock Portfolio B 1 7 11 8 11 11 11 11 16 13 Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 24 Measures of Center and Spread Measures of Spread (2/5) - Introduction Typical measures of variability include ◦ Range (maximum – minimum) ◦ Standard Deviation – Sensitive to the presence of outliers (because of the squaring involved – see below) ◦ Mean Absolute Deviation – Preferred in situations involving extreme values ◦ Interquartile Range Sample Standard Deviation is defined by ◦ Interpreted as “typical” distance between a field value and the mean ◦ Most field values lie within two standard deviations of the mean Example: For table below, most calls were made within 2(1.315) = 2.63 of the mean of 1.563 calls. In other words, they made between -1.067 and 4.193 calls, which rounded to integers is 0 to 4 calls. Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 25 Data Transformation Variables tend to have ranges different from each other In baseball, two fields may have ranges: ◦ Batting average: [ 0.0, 0.400 ] ◦ Number of home runs: [ 0, 70 ] Some data mining algorithms adversely affected by differences in variable ranges Variables with greater ranges tend to have larger influence on data model’s results Therefore, numeric field values should be normalized Standardizing scales the effect each variable has on results Neural Networks and other algorithms that make use of distance measures benefit from normalization Two of the prevalent methods will be reviewed In the following pages X* will refer to the normalized form of variable X Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 26 Min-Max Normalization Determines how much greater field value is than minimum value for field Scales this difference by field’s range X - min( X ) X - min( X ) X* = = range( X ) max( X ) - min( X ) Figure 2.8 below shows summary statistics for weightlbs field ◦ Min = 1613 ◦ Max = 4997 Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 27 Min-Max Normalization (cont’d) Find Min-Max normalization for cars weighing 1613, 3384 and 4997 pounds, respectively X - min( X ) X* = max( X ) - min( X ) Where: min(X) = 1613 max(X) = 4997 Car Weightlbs Formula Result Comments Ultra-light X = 1613 1613 − 1613 X* = 0 Represents the minimum value in this variable, 𝑋∗ = vehicle 4997 − 1613 and has min-max normalization of zero. Mid-range X = 3384 3884 − 1613 X* = 0.5 Weight exactly half-weight between the ligthest 𝑋= vehicle 4997 − 1613 and the heaviest vehicle, and has min-max normalization of 0.5. Heaviest X = 4997 4997 − 1613 X* = 1 Heaviest vehicle of the dataset has min-max 𝑋= vehicle 4997 − 1613 normalization of one. Min-Max normalization will always have a value between 0 and 1. Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 28 Z-score Standardization Widely used in statistical analysis Takes difference between field value and field value mean Scales this difference by field’s standard deviation X - mean( X ) X* = SD( X ) Figure 2.8 below shows that mean (weight) and standard deviation for weight equals 3005.49 and 852.646, respectively Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 29 Z-score Standardization (cont’d) Find Z-scode standardization for cars weighing 1613, 3384 and 4997 pounds, respectively X - mean( X ) Where: X* = SD( X ) mean(X) = 3005.49 SD(X) = 852.65 Car Weightlbs Formula Result Comments Ultra-light X = 1613 1613 − 3005.49 X* ≈ -1.63 Data values below the mean will have negative 𝑋∗ = vehicle 852.646 Z-score standardization. Mid-range X = 3384 3884 − 3005.49 X* ≈ 0 Values falling exactly on the mean will have zero 𝑋= vehicle 852.646 (0) Z-score Heaviest X = 4997 4997 − 3005.49 X* ≈ 2.34 Data values about the mean will have a negative 𝑋= vehicle 852.646 Z-score standardization It is also possible to find the associated data value for a give Z-score (see appendix). Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 30 Decimal Scaling Ensures that normalized values lies between -1 and 1 Defined as: X X* = 10 d where d represents the number of digits in the data value with the largest absolute value. For the weight data, the largest absolute value is |4997|=4997, with d=4 digits Decimal scaling for the minimum and maximum weights are: ∗ 1613 𝑀𝑖𝑛: 𝑋!"#$%&' = = 0.1613 10) ∗ 4997 𝑀𝑎𝑥: 𝑋!"#$%&' = = 0.4997 10) Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 31 Transformations to achieve normality Some data mining algoriths and statistics methods require normally distributed variables Normal distribution ◦ Continuous probability distribution known as the ‘bell curve’ (symmetric) ◦ Centered and mean µ (myu) and spread given by σ (sigma) Standard normal Z-distribution with µ=0 and σ=1 Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 32 Transformations to achieve normality (cont’d) Common misconception: Z-Score standardized variables follow standard normal Z distribution – THIS IS NOT CORRECT ◦ Z-score standardized variables do have µ=0 and σ=1, but the distribution may be skewed (not symmetric) Compare histogram for original and z-standardized weight data – both are right-skewed (not symmetrical) Original data Z-Standardized Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 33 Transformations to achieve normality (cont’d) Statistics for measuring the skewness of a distribution: 3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛) 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 ◦ Right-skewness data – Is positive, as mean is greater than the median ◦ Left skewness data – Mean is smaller than the median, generating negative values ◦ Perfectly symmetric data – mean, median and mode are equal, so skewness is zero Most real-world data is Not as common, left-skewed right-skewed, including data occurs when data is most financial data right-censored, such as test scores on an easy test (cannot get higher than Right-skewness 100). Left-skewness (positive) (positive) Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 34 Transformations to achieve normality (cont’d) Example: Calculate skewness statistic for the weight and weight_z (z-standardized weight) variables For weight: 3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛) 3(3005.490 − 2835) 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = = = 0.6 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 852.646 For weight_z: 3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛) 3(0 − (−0.02)) 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = = = 0.6 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 1 Again, Z-score standardization has no effect on skewness Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 35 Transformations to achieve normality (cont’d) To eliminate skewness, we must apply a transformation to the data ◦ This makes the data symmetric and makes it “more normally distributed” Common transformations are: Natural Log Square Root Inverse Square Root ln(𝑤𝑒𝑖𝑔ℎ𝑡) 𝑤𝑒𝑖𝑔ℎ𝑡 1 𝑤𝑒𝑖𝑔ℎ𝑡 Square root transformation somewhat reduces skewness, while ln reduces it even further Square root Natural log Original data transformation transformation Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 36 Transformations to achieve normality (cont’d) Example #1: Apply SQRT and LN transformations to weight data For SQRT(weight): 3 54.280 − 53.245 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝑠𝑞𝑟𝑡 𝑤𝑒𝑖𝑔ℎ𝑡 = ≈ 0.40 7.709 For LN(weight): 3 7.968 − 53.245 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝑠𝑞𝑟𝑡 𝑤𝑒𝑖𝑔ℎ𝑡 = ≈ 0.19 0.284 Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 37 Transformations to achieve normality (cont’d) Example #2: Apply inverse square root transformation to weight data For INVERSE_SQRT(weight): 3 0.019 − 0.019 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝑠𝑞𝑟𝑡 𝑤𝑒𝑖𝑔ℎ𝑡 = =0 0.003 Important: There is nothing special about the inverse square root transformation. It just worked with the skewness in the weight data Notice that while we have achieved symmetry, we have not reached normality (the distribution does not match the normal curve) Histogram for inv_sqrte(weight) with Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, normal distribution curve overlay by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 38 Transformations to achieve normality (cont’d) After achieving symmetry, we must also check for normality The Normal Probability Plot ◦ Plots the quantiles for a particular distribution against the quantiles of the standard normal distribution ◦ Similar to percentile, pth quantile of a distribution is value xp, such that p% of the distribution values are less than or equal to xp ◦ If the bulk of the points fall on a straight line, the distribution is normal; systematic deviations indicate nonnormality As expected, the normal probability plot for the inverse_sqrt(weigth) indicates nonnormality While normality was not achieved, algorithms requiring normality usually do fine when supplied with data that is symmetric and unimodal Normal probability plots Plot for inverse_sqrt(weight) has systematic Plot for normally distributed data deviations that indicate nonnormality Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 39 Transformations to achieve normality (cont’d) Detransformation – After completing the analysis, it is required to “de-transform” the data Example for the Inverse Square Root: 1 Transformation → 𝑦 = 𝑥 1 De-transformation ← 𝑥 = * 𝑦 Results provided by algorithm in the transformed scale would have to be converted back using the de-transformation formula Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 40 Numerical Methods for Identifying Outliers Using Z-score Standardization to Identify Outliers ◦ Outliers are Z-score Standardization values either less than -3, or greater than 3 ◦ Values much beyond range [ -3, 3 ] require further investigation to determine their validity Should not automatically omit outliers from analysis ◦ For example, on the vehicle weight dataset: Vehicle with minimum weight, 1613 pounds: Z-score = -1.63 Vehicle with maximum weight, 4997 pounds: Z-score = 2.34 Neither z-score is outside the [ -3, 3 ] range, conclude no outliers among vehicle weights ◦ However, Mean and Standard Deviation are both sensitive to the presence of outliers µ and σ are both part of the formular for z-score standardization If an outlier is added or deleted from the dataset, µ and σ will be affected ◦ When selecting a method for evaluating outliers, should not use measures which are themselves sensitive to outliers Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 41 Numerical Methods for Identifying Outliers (cont’d) Using Interquartile Range (IQR) to Identify Outliers ◦ Robust statistical method and less sensitive to presence of outliers ◦ Data divided into four quartiles, each containing 25% of data First quartile (Q1) 25th percentile Second quartile (Q2) 50th percentile (median) Third quartile (Q3) 75th percentile Fourth quartile (Q4) 100th percentile ◦ IQR is measure of variability in data Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 42 Numerical Methods for Identifying Outliers (cont’d) ◦ IQR = Q3 – Q1 and represents spread of middle 50% of the data ◦ Data value defined as outlier if located: 1.5 x (IQR) or more below Q1; or 1.5 x (IQR) or more above Q3 ◦ For example, set of test scores have 25th percentile (Q1) = 70, and 75th percentile (Q3) = 80 ◦ 50% of test scores fall between 70 and 80 and Interquartile Range (IQR) = 80 – 70 = 10 ◦ Test scores are identified as outliers if: Lower than Q1 – 1.5 x (IQR) = 70 – 1.5(10) = 55; or Higher than Q3 + 1.5 x (IQR) = 80 + 1.5(10) = 95 Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 43 Flag Variables Some numerical methods require predictor to be numeric ◦ Regression requires recoding categorical variable into one or more flag variables Flag variables (aka dummy or indicator variable) is a categorical variable with onle two values: 0 or 1 Example: Categorical variable sex can be converted as: If sex = female, then sex_flag = 0; If sex = male, then sex_flag = 1 If category has 𝑘 ≥ 3 possible values, then define 𝑘 − 1 dummy variables ◦ The unassigned category (the one for which no flag is created) is taken as the reference category Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 44 Flag Variables (cont’d) For example, for a variable region having 𝑘 = 4 possible values {north, east, south, west Define the following 𝑘 − 1 = 3 flag variables Flag name IF region= then otherwise north_flag north north_flag=1 north_flag=0 east_flag east east_flag=1 east_flag=0 south_flag south south_flag=1 south_flag=0 Variable for west is not needed, since 𝑟𝑒𝑔𝑖𝑜𝑛 = 𝑤𝑒𝑠𝑡 is identified when all three flag variables are zero (0). ◦ Inclusion of fourth flag variables will cause some algorithms to fail because of the singularity of the matrix regression, for instance. ◦ Unassigned category becomes the reference category ◦ For example: if in a regression the coefficient for north_flag equals $1000, then the estimated income for region = north is $1000 greater than for region = west when all other predictors are held constant Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 45 Transforming categorical variables into numerical variables Why not transforming the categorical variable region into a single numerical variable? For example: Region Region_num North 1 East 2 South 3 West 4 This is a common and hazardous error. The algorithm now assumes that: ◦ The four regions are ordered ◦ West > South > East > North ◦ West is three times closer to South compared to north, etc. This practice should be avoided, except with categorical variables that are clearly ordered, such as with a variable survey_response with values always, usually, sometimes, never Still, careful consideration should be given to the actual values. Should never, sometimes, usually, always be numbered as: ◦ 1, 2, 3 and 4; or 0, 1, 2 3, since 0 actually means never ◦ But what if there relative distance between categorical values is not constant? Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 46 Binning Numerical Variables Some algorithms require categorical predictors Continuous predictors are partitioned as bins or bands ◦ Example: House value numerical variable partitioned into: low, medium or high Four common methods: Method Description Notes 1. Equal width binning Divides predictor into k categories of equal width, where k is Not recommended, since width of bins chosen by client/analyst can be affected by presence of outliers 2. Equal frequency Divides predictor into k categories, each having k/n records, Assumes that each category is equally binning where n is the total number of records likely, which is not warranted 3. Binning by clustering Uses clustering algorithm, like k-means clustering (Chapter Methods 3 and 4 are preferred 10) to automatically calculate “optimal” partitioning 4. Binning based on Methods 1 to 3 ignore the target variable; this method predictive value partitions numerical predictor based on the effect each partition has on the value of the target variable (see Chapter 3) Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 47 Binning Numerical Variables (cont’d) Example: Discretize X = {1,1,1,1,1,2,2,11,11,12,12,44} into k=3 categories Method Low Medium High a. Equal Width 0 ≤ X < 15 15 ≤ X < 30 30 ≤ X < 45 Contains all values except Contains no data Contains single outlier one b. Equal Frequency First four data values Next four data values Last four data values {1,1,1,1} {1,2,2,11} {11,12,12,44} c. k-means {1,1,1,1,2,2} 11,11,12,12 {44} Clustering How is that in Equal Frequency, values {1,1,1,1,1} are split into two categories? Equal values should belong to the same category As illustrated in image below, k-means clustering identifies apparently intuitive partitions Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 48 Reclassifying categorical variables Equivalent of binning numerical variables Algorithms like Logistic Regression and C4.5 decision tree are suboptimal with too many categorical values Used to reduce the number of values in a categorical field Example: ◦ Variable state {50 values} → Variable region {Northeast, Southeast, NorthCentral, Southwest, West} ◦ Instead of 50 values, analyst/algorithm handle only 5 values ◦ Alternatively, could convert state into economic_level, with values {richer states, midrange states, poorer states} Data analyst should select reclassification that fir business/research problem Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 49 Adding and index field Adding Index field is recommended Tracks the sort order of the records in the database Data mining data is partitioned at least one ◦ Index helps to rebuild dataset in original order In IBM/SPSS Modeler, use @Index function in Derive node to create an index field Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 50 Removing variables that are not useful Some variables will not help the analysis ◦ Unary variables – Take only a single value (a constant). Example – In an all-girls private school, variable sex will always be femaly, thus not having any effect in the data mining algorithm ◦ Variables which are very nearly unary – Some algorithms will treat these as unary. Analyst should consider whether removing. Example - In a team with 99.9% females and 0.05% males, the variable sex is nearly unary. Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 51 Variables that should probably not be removed Variables with 90% or more missing values Consider that there may be a pattern in missingness Inputation becomes challenging Example: Variable donation_dollars in self-reported survey ◦ Top 10% donors might report donations, while others do not – the 10% is not representative ◦ Preferable to construct a flag variable, donation_flag, since missingness might have predictive power ◦ If there is reason to believe that 10% is representative, then proceed to imputation using regression or decision tree (chapter 13) Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 52 Variables that should probably not be removed (cont’d) Strongly correlated variables Important information might be discarded when removing correlated variables Example: Variables precipitation and ‘attendance at the beach’ are negatively correlated ◦ The might double-count aspect of the analysis or cause instability in model results – prompting analyst to remove one variable ◦ Should perform Principal Component analysis instead, to convert into a set of uncorrelated principal components Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 53 Removal of duplicate records Records might have been inadvertly copied, creating duplicates ◦ Duplicate records lead to overweighting of their data values – therefore, they should be removed Example – If ID field is duplicated, then remove it But, consider genuine duplicates ◦ When the number of records is higher than all possible combination of field values, there will be genuine duplicates Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 54 A word about ID fields ID fields have different value for each record Might be hurtfull, with algorithm finding spurious relationships between ID field and target Recommendation: Filter ID fields from data mining algorithm, but do not remove them from the data, so that analyst can still differentiate the records Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. 55

Discovering Knowledge in Data - Ch. 2: Data Preprocessing PDF

Document Details

Tags

Related

Summary

Full Transcript