Full Transcript

M ACHINE L EARNING DATA P REPROCESSING Lukas De Kerpel — [email protected] Prof. dr. Dries Benoit — [email protected] OVERVIEW 1 I NTRODUCTION 6 F EATURE ENGINEERING 2 DATA LOADING 7 C LASS IMBALANCE 3 DATA EXPLORATION 8 S UMMARY 4 DATA...

M ACHINE L EARNING DATA P REPROCESSING Lukas De Kerpel — [email protected] Prof. dr. Dries Benoit — [email protected] OVERVIEW 1 I NTRODUCTION 6 F EATURE ENGINEERING 2 DATA LOADING 7 C LASS IMBALANCE 3 DATA EXPLORATION 8 S UMMARY 4 DATA SPLITTING 9 R EFERENCES 5 DATA CLEANING 10 TO DO 3 / 64 I NTRODUCTION DATA TYPES I Tabular data for most business classification, regression and clustering tasks profile_id name age sex occupation email 1 Stereotypical Barbie 19 f doll [email protected] 2 Ken 21 m just beach [email protected] Time series data for forecasting, anomaly detection and trend analysis 5 / 64 DATA TYPES II Textual data for sentiment analysis, text Graph data for social network analysis, classfication or natural language processing node classification or link prediction 6 / 64 DATA TYPES III Image/video data for computer vision tasks Geospatial data for location-based or (e.g. object detection or segmentation) spatial analysis 7 / 64 W HY DATA PREPROCESSING ? ML: fitting models to data using various algorithms In practice: raw data rarely ideal for learners Data preprocessing: we will transform the raw data to useful features before feeding it to the learners In short: better data representations, better models! Raw Data Data Data Data Feature Class Base data loading exploration splitting cleaning engineering imbalance table 8 / 64 DATA PREPROCESSING IN PRACTICE Data scientist typically seek the best combinations of data preprocessing techniques, such as Scaling or other numerical transformations Encoding (converting categorical features into numerical ones) Feature selection and engineering Handling missing and imbalanced data Dimensionality reduction (e.g. PCA) Often done emperically via cross-validation. However, make sure there is no data leakage! Example: 9 / 64 I LLUSTRATIVE EXAMPLE : O K C UPID PROFILE DATA Data set user profile data for 59,946 users from the San Francisco (US) area rich array of categorical, numerical and text variables: typical user information (such as age, height, sex, etc.) lifestyle variables (such as diet, drinking habits, smoking habits etc.) text responses to 9 essay questions related to an individual’s interest and personal descriptions messy data: many suspicious and missing values Goal predict whether a person’s profession is in the STEM fields (science, technology, engineering, and math) class imbalance: only 18.5% of profiles work in this area 10 / 64 DATA LOADING DATA IMPORT How to read datasets: Code labs: load dataset from library with data() function Real-life project: read dataset from external file e.g. csv file (comma-separated value) p r o f i l e _ i d , age , h e i g h t , sex , o r i e n t a t i o n , s t a t u s , r e l i g i o n , p e t s 1 1 ,27 ,74 , "m" , " gay " , " s i n g l e " , " a g n o s t i c i s m " , " l i k e s _dogs " 2 2 ,34 ,59 , " f " , " s t r a i g h t " , " a v a i l a b l e " , " c h r i s t i a n i t y " , " l i k e s _dogs_and_ c a t s " 3 R function: read.csv(): reads a file in csv format and returns a dataframe p r o f i l e s 3 Graphically: Histograms Boxplots: Figure: Z-score outlier if observation outside whiskers of box Detecting multivariate outliers: advanced Note Figure: Boxplot Only apply this to the training set 31 / 64 O UTLIERS : HANDLING Handling: Invalid observations: treat outlier as missing value Valid observations: truncation → impose lower and upper limit on values (+ indicator), based on... Z-score: |zi |= 3 Expert opinion Note Nonparametric models (e.g. DT, NN, SVM) often insensitive to outliers Parametric models (e.g. LinR, LogR) often sensitive to outliers 32 / 64 O K C UPID : HEIGHT MALES Invalid observation: height of 1 inch → missing value Valid observation: height of 90 inch → truncation 33 / 64 S TRING MANIPULATION : REGULAR EXPRESSIONS Complex strings can often be broken into useful pieces Examples: Social security numbers: ‘70.03.10-192.64’ Phone numbers: ‘0032 9 264 79 27’ Street addresses: ‘Tweekerkenstraat 2, 9000 Ghent, Belgium’ Email address: ‘[email protected]’ Dates and times: ‘2012-06-28-20-30’ Regular expressions (regex): Pattern describing a certain amount of text Powerful way to find and replace strings that take a defined format Regex tester: https://regex101.com 34 / 64 R EGEX : LITERAL CHARACTERS Literal characters: Most basic regular expression It matches the first occurrence of the character in the string Regex String Match a Jack is a boy a Metacharacters: Twelve characters: \, ˆ, $,., |, ?, *, +, (, ), [, { Regex String Match Special meaning in regular expressions 1\+1=2 1+1=2 1+1=2 Use as literal: escape with backslash \ 35 / 64 R EGEX : CHARACTER CLASSES Character class matches only one out of several Regex String Match characters gr[ae]y gray gray Use a hyphen inside a character class to specify a range gr[ae]y grey grey of characters (e.g. [0-9]) gr[ae]y graay NO MATCH You can use more than one range (e.g. [0-9a-fA-F]) or gr[ae]y graey NO MATCH combine ranges and single characters (e.g. [0-9a-fxA-FX]) 36 / 64 R EGEX : SHORTHAND CHARACTER CLASSES \d matches a single character that is a digit \w matches a ‘word character’ (alphanumeric characters Regex String Match plus underscore) \w\s\d A6 A6 \s matches a whitespace character (includes tabs and line breaks) 37 / 64 R EGEX : DOT CHARACTER & ANCHORS Regex String Match The dot. matches a single character, except line break gr.y gray gray character gr.y grey grey gr.y gr%y gr%y Regex String Match Anchors match a position, not characters ˆa abc a ˆ matches at the start of the string ˆb abc NO MATCH $ matches at the end of the string c$ abc c a$ abc NO MATCH 38 / 64 R EGEX : REPETITION Question mark ?: makes the preceding token optional Asterisk or star *: match the preceding token zero or more times Plus +: match the preceding token once or more Limiting repetition {min,max}: specify how many times a token can be repeated {0,1} = ? {0,} = * {1,} = + Regex String Match NO MATCH [1-9][0-9]{3} 9032 9032 39 / 64 R EGEX : GROUPING AND CAPTURING Place parentheses around multiple tokens to create a capturing group E.g. example has one group Group 0 always contains the entire regex match Regex String Match \w+@(\w+)\.be [email protected] ugent \w+@(\w+)\.be [email protected] kuleuven 40 / 64 R EGEX : R FUNCTIONS Search for matches: grep(value = FALSE): returns the indices of the matches grep(value = TRUE): returns the elements of the matches grepl(): returns a logical vector of the matches > grep ( " ^San " , l o c a t i o n , v a l u e =TRUE) 1 [ 1 ] " San F r a n c i s c o " " San Mateo " " San R a f a e l " " San Pablo " 2 41 / 64 R EGEX : R FUNCTIONS Use capture groups: regexec(): returns the indices of the matched groups regmatches(): extract matched substrings > r e c o r d match m a t c h. e x t r a c t match.extract 4 [] 5 [ 1 ] " [email protected] " " e l o n " " bezos " " ugent " 6 42 / 64 S TRING MANIPULATION : PARSING DATES POSIX classes: stores date, time, and time zone POSIXct class: stored in seconds beginning at 1 January 1970 POSIXlt class: stored in a human-readable format (e.g. year, month, etc.) → Recommend POSIXct: optimized for storage and computation > l a s t _ o n l i n e last _online 2 [ 1 ] " 2012−06−28 2 0 : 3 0 : 0 0 CEST" 3 > f o r m a t ( l a s t _ o n l i n e , f o r m a t = "%d " ) 4 [ 1 ] " 28 " 5 43 / 64 F EATURE ENGINEERING F EATURE ENGINEERING Goal: constructing useful features that have a relationship to the target that the model can learn improve predictive performance reduce computational needs improve interpretability of the results Feature engineering techniques: Categorical predictors dummy encoding integer encoding Numeric predictors feature scaling feature transformations feature interactions Note Nothing beats domain knowledge to get a good representation of the data! 45 / 64 C ATEGORICAL DATA : ENCODING Ordinal features: integer encoding assigns each unique value to a different integer Nominal features: dummy encoding creates new columns indicating the presence of each possible value in the original data Binary features: integer encoding indicator variable Note Nominal features with high cardinality (> 10): select OHE categories with top frequency Need same categories in train & test set Be aware of the dummy variable trap! 46 / 64 O K C UPID : INTEGER ENCODING Table: Integer encoding drinks Original value Integer variable Table: Integer encoding has_link "not at all" 0 Original value Integer variable "rarely" 1 "socially" 2 FALSE 0 "often" 3 TRUE 1 "very often" 4 "desperately" 5 47 / 64 O K C UPID : DUMMY ENCODING Table: Dummy encoding status Original value Dummy variables is_single is_available is_seeing_someone is_married "single" 1 0 0 0 "available" 0 1 0 0 "seeing someone" 0 0 1 0 "married" 0 0 0 1 "unknown" 0 0 0 0 48 / 64 N UMERICAL DATA : TRANSFORMATIONS Log transform: make distribution less skewed and reduce range of values e.g. essay length Power transform: make data more right (left) skewed, for power > 1 (< 1) Interaction terms: effect of one feature is dependent on another feature Ratios: e.g. ratio of stem words to essay_length 49 / 64 O K C UPID : ESSAY LENGTH Figure: log transform essay_length 50 / 64 N UMERICAL DATA : FEATURE SCALING Scaling: making sure all numerical features have same scale Two methods: 1 Normalization (min-max scaling) values are shifted and rescaled such that they end up ranging from 0 to 1 xold −min(xold ) xnew = max(x )−min(x ) old old 2 Standardization (standard scaling): values are centered and rescaled to fit a standard normal distribution N (µ = 0, σ = 1) x −µ xnew = oldσ Note The statistics required for the transformation (e.g., the mean and stdev) are estimated from the training set and are applied to all data sets (e.g., the validation and test set) 51 / 64 W HY FEATURE SCALING ? With few exceptions, ML algorithms don’t perform well when the input numerical attributes have very different scales! Not needed for linear/logistic regression, but scaling makes coefficients more interpretable Necessary for models with penalization term (ridge/lasso) Important (but not necessary) for models based on Euclidean distances (SVM, KNN) Critical when performing PCA Tree-based models scale-invariant 52 / 64 O K C UPID : NORMALIZING Figure: normalization last_online 53 / 64 O K C UPID : STANDARDIZING Figure: standardization height 54 / 64 O K C UPID : OTHER USEFUL FEATURES Split location into city and state > unique ( p r o f i l e s $ l o c a t i o n ) 1 [ 1 ] " south san f r a n c i s c o , c a l i f o r n i a " 2... 3 Categorical predictors like religion: values and modifiers > unique ( p r o f i l e s $ r e l i g i o n ) 1 [ 1 ] " a g n o s t i c i s m b u t n o t t o o s e r i o u s about i t " 2... 3 Extracting programming languages from speaks variable > unique ( p r o f i l e s $speaks ) 1 [ 1 ] " e n g l i s h ( f l u e n t l y ) , spanish ( p o o r l y ) , f r e n c h ( p o o r l y ) , c++ " 2... 3 55 / 64 C LASS IMBALANCE C LASS IMBALANCE Class distribution is often highly skewed (e.g. 99/1 distribution in fraud detection) Two main resampling methods: Undersampling: randomly select examples from the majority class and remove them Oversampling: randomly select examples from the minority class, with replacement, and duplicate them profile_id features stem profile_id features stem profile_id features stem 1 0 1 0 2 0 2 0 2 0 5 1 3 0 3 0 8 0 4 0 4 0 10 0 5 1 5 1 5* 1 6 0 Figure: Undersampling 5** 1 6 0 7 0 7 0 8 0 8 0 9 0 9 0 10 0 10 0 Figure: Original data Figure: Oversampling 57 / 64 C LASS IMBALANCE Should be performed on training set (leave test set untouched) Optimal distribution often requires trail-and-error and depends on application & algorithm (although 80/20 often used) What to use when? Small number of minority obs. → oversampling Large number of majority obs. → undersampling Advanced method: Synthetic Minority Oversampling Technique (SMOTE) Performance metrics: avoid accuracy use AUC, precision-recall curve, etc. 58 / 64 S UMMARY M AIN LESSONS LEARNT 1 Raw data is rarely ideal for learning, most machine learning models require data scientists to build a good representation of the data. 2 Many ML algorithms can only handle numeric features, so we need to encode the categorical variables. 3 Scaling is crucial for distance-based ML models (e.g. kNN, SVM, neural networks). 4 In order to optimize the preprocessing steps, set aside part of the training set as a validation set. 5 Mind the data leakage trap: never choose preprocessing techniques based on the test data! 6 Imbalanced datasets require extra care to build useful models 60 / 64 R EFERENCES R EFERENCES Baesens, B. (2014). Analytics in a big data world: The essential guide to data science and its applications (1st). Wiley Publishing. Geron, A. (2017). Hands-On Machine Learning with Scikit-Learn TensorFlow : concepts, tools, and techniques to build intelligent systems. O‘Reilly. Kuhn, M. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman Hall/CRC Data Science Series Regular Expressions Quick Start. (n.d.). Retrieved September 24, 2021, from https://www.regular-expressions.info/quickstart.html Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510 62 / 64 TO DO TO DO Lab session on Data Preprocessing in Dodona 64 / 64

Use Quizgecko on...
Browser
Browser