Data Tidying and Preprocessing Lecture Notes PDF
Document Details
Uploaded by VivaciousRhodium149
The University of Manchester, Alliance Manchester Business School
Manuel López-Ibáñez
Tags
Summary
These notes cover data tidying and preprocessing techniques in Python, focusing on the Pandas and scikit-learn libraries. They explain the concept of tidy data, various data types(numerical, categorical, ordinal), and data transformations. The use of data analysis tools is explored as well
Full Transcript
I MAN,CHESTER_ I 1s24 The University of Man c'hester Alliance Manchester Business Schoo BMAN73701 Programming in ~ python™for Business Analytics Python Week 4: Lecture 2 Data Tidying and Data Preprocessing Prof...
I MAN,CHESTER_ I 1s24 The University of Man c'hester Alliance Manchester Business Schoo BMAN73701 Programming in ~ python™for Business Analytics Python Week 4: Lecture 2 Data Tidying and Data Preprocessing Prof. Manuel López-Ibáñez [email protected] Office hours: Mon 4pm-5pm, Fri 9am-10am https://calendly.com/manuel-lopez-ibanez MAN,CHEsTER_ 1824 From Raw data to Data Analysis The University of Man c'hester Alliance Manchester Business Schoo Raw Data Sources Raw (Databases, Web, Acquisition Tidying Tabular Data data Excel, Text, APIs) [ Numerical ] [ Categorical ] [ Ordinal ] Preprocessing Summary stats, analysis, Data Analysis visualisation Tabular Data BMAN73701 Week 4 3 MAN,CHEsTER_ 1824 Intro to Pandas The University of Man c'hester Alliance Manchester Business Schoo Drlla \\'lrtmgliug with Paudas. N11111Py, tmd /Py lbon pandas Python library for data manipulation and analysis Website: http://pandas.pydata.org/ O'REILLY- \fies McKinney Documentation: http://pandas.pydata.org/pandas-docs/stable/ import pandas as pd BMAN73701 Week 4 4 MANCH ESTER_ Data Tidying vs. Data Preprocessing I I 1~2➔ The University of Man c'h ester Alliance Manchester Business Schoo 1. Data Wrangling / Tidying: arrange data for further analysis (split-apply-combine, groupby, melt, pivot, …) 2. Data Munging / Preprocessing: O/o transform raw data into a form appropriate for data analysis or machine learning. 3. Data Analysis: using statistics and visualisation to understand data, machine learning, optimisation, etc. O/o BMAN73701 Week 4 5 I MAN,CHESTER_ I 1s24 The University of Man c'hester Alliance Manchester Business Schoo BMAN73701 Programming in ~ python™for Business Analytics Python Week 4: Lecture 2 Data Tidying and Data Preprocessing Part 1: Data Tidying Part 2: Data Preprocessing MAN,CHESTER_ What is Data? I I 1s24 The University of Man c'hester Alliance Manchester Business Schoo A dataset is a collection of numerical and/or categorical values A variable groups values that measure the ~ , same attribute, criteria, feature, dimension, … (all values will have the same type and units of measurement) , ~ Type Model Profit An observation groups the values of several variables for the same object, person, item, A New 1 experimental unit, alternative, sample, B New 2 point, ….._ C New 3 A Old 4 B Old 5 C Old 6 ~ ·~ BMAN73701 Week 4 7 MAN CHEsTER How many variables and observations? 1 J824 The University of Manchester Alliance Manchester Business School QUIZ 3 variable (Type, New, Old) Type New Old 6 observation (Profit) A 1 4 B 2 5 C 3 6 Table 5.1: Sales profit (M$) of new and old models of various types of widgets. BMAN73701 Week 4 8 MAN,CHESTER_ Long vs Wide data I I 1s24 The University of Man c'hester Alliance Manchester Business Schoo Long format: Wide format: More than one One column per variable column contain the same variable Type Model Profit Type New Old A New 1 A 1 4 B New 2 B 2 5 C New 3 C 3 6 A Old 4 B Old 5 C Old 6 good for machine learning BMAN73701 Week 4 9 MANCH ESTER_ Tidy Data I I 1~2➔ The University of Man c'h ester Alliance Manchester Business Schoo long format Hadley Wickham. “Tidy Data”, Journal of Statistical Software, 59:10, 2014. http://dx.doi.org/10.18637/jss.v059.i10 1. Each variable appears in a single column 2. Each row describes a complete observation 3. Each table corresponds to a type of observational unit ✓ Standard format to work with data ✓ Operations on one variable operate on a single column ✓ Easier to use Split-apply-combine ✓ Standard for machine learning (features columns, samples rows) BMAN73701 Week 4 10 MAN,CHESTER_ Wide vs Long data: Which one is tidier? I I 1s24 The University of Man c'hester Alliance Manchester Business Schoo Long format: Wide format: More than one One column per variable column contain the same variable Type Model Profit Type New Old A New 1 A 1 4 B New 2 B 2 5 C New 3 C 3 6 A Old 4 B Old 5 C Old 6 BMAN73701 Week 4 11 Wide Long data I MAN,CHESTER_ I 1s24 The University of Man c'hester Alliance Manchester Business Schoo Long format: Wide format: More than one One column per variable column contain the same variable Type Model Profit Type New Old A New 1 A 1 4 B New 2 B 2 5 C New 3 C 3 6 A Old 4 try in spyder B Old 5 C Old 6 Wide Long (Melt / Unpivot) pd.melt(df, Long Wide (Pivot) remain pivoted column which id_vars = 'Type', df.pivot( will be the index (each column to unpivot value_vars = ['New','Old'], unique value will index ='Type', become row) value of the new name of the unpivot value value_name = 'Profit', pivoted table values = 'Profit', each unique value unpivot column name var_name = 'Model') will become column name columns = 'Model') BMAN73701 Week 4 12 MANCHESTER 1824 Wide to Long (Melt) The University of Manchester Alliance Manchester Business School In [ 2].. df - pd. DataFrame( dict (Type=[ 'A' 1 ·a· , ·c· 1, make a df from dictionary..... New= [ l , 2 3 ] Ol d= [4 5 , 6] )) 1 1 1..... d·f Out Type Nei.-1 Ol d 0 A 1 4 1 B 2 :5 2 C 3 6 column that will remain (identifier) --> will be repeated column that will be unpivoted I n l[ 3]: pd. mel t (d f , id_va r-s = ·Type· val ue va rs = [ ·New· , ·Old' ] , 1... : value name= 'Profit' , var name= 'Model' ) Out : name of the value in unpivoted column name of the new column from unpivoted column Type Model Prof it 0 A New 1 In [ 4 ] : pd. mel t{d f , id _var~s = 'Type' , var- - name= ' Mode I' ) 1 B New 2 Out [ 4] : 2 C New 3 Type Model va l ue 3 A Ol d 4 0 A New 1 4 B Old: 5 1 B New 2 5 C Ol d 6 2 C New 3 3 A Old 4 4 B Old 5 5 C Old 6 BMAN73701 Week 4 13 MANCHESTER 1824 Long to Wide (Pivot) The University of Manchester Alliance Manchester Business School I n : d f = pd. Data f r-ame ( d i:=t ( Type= 2* [ A 1 B 1 C 1 1 1 1 1 1 1 ].... Model= 3* [ 'New' ] + 3* [ '0ld' ] ,.... Profit =np. ar ange ( l , 7 ) )).... df Out [ 5] : Type Model Pro f i t 0 A New 1 1 B ~Jei.1 2 2 C ~Jew 3 3 A Ol d 4 4 B Ol d 5 5 C Ol d 6 select the column that contains the value select the column to pivot (becomes a new column) I n [ 6] : d f. pi vat { index= · Type· , v al u es = · Profit· , col umn s = · Model' ) Out : keep unique instances as a row index Model New Ol d Type A 1 4 B 2 5 C 3 6 can reset index so type no longer an index In [ 7 ] : df. pi vat (index~ · Type· values ~ · Profit· , columns ~ · Model · ). reset index {) j Out [ 7] : Model Type r~ew Old 0 A 1 4 1 8 2 5 2 C 3 6 BMAN73701 Week 4 14 MANCHESTER 1824 pivot_table there are multiple Type A&B that has duplicate New / Old Profit add aggfunc to aggregate the data BMAN73701 Week 4 15 MAN,CHEsTER_ 1824 Is “longer” data always “tidier”? The University of Man c'hester Alliance Manchester Business Schoo df3 df3. melt(id_vars=[ fir st 1 1 , 1 last 1 ]) x good first last height weight first last variable value ♦ - I o, John Doe 5.5 130 0 John Doe height 5.5 - ~ 1 Mary Bo 6.0 150 1 Mary Bo height 6.0 2. John Doe weight 130 3 Mary Bo weight 150 I 1 column has 1 column has different var different https://pandas.pydata.org/pandas-docs/stable/reshaping.html value unit BMAN73701 Week 4 16 MANCHESTER 1824 GroupBy The Un iversity of Manchester Alliance Manchester Business Schoo,! Problem: Compute the mean Profit of each Type Type IsNew Profit A True 1 B False 2 Solution #1: Boolean indexing C True 3 A False 4 for type in pd.unique(df['Type']): B True 5 C False 6 subdf = df[ df['Type'] == type ] subdf['Profit'].mean() Solution #2:.groupby() df.groupby('Type')['Profit'].mean() we want to find the mean from 'Profit' column, and separate it by type BMAN73701 Week 4 17 MAN,CHESTER_ GroupBy I I 1s24 The University of Man c'hester Alliance Manchester Business Schoo separating groups according to () groupedby_type = df.groupby('Type') groupedby_type Name: 'A' df Type IsNew Profit Type IsNew Profit A True 1 A True 1 ~ A False 4 > B False 2 v C True 3 Name: 'B' A False 4 Type IsNew Profit B True 5 B False 2 C False 6 B True 5 Name: 'C' Type IsNew Profit C True 3 C False 6 ~ BMAN73701 Week 4 18 MANCHESTER 1824 GroupBy get the mean of the 'profit' column for each 'type' its not a normal DF object its a spesific groupby.DF (might not able to do everything that we can do w normal DF object) BMAN73701 Week 4 19 MAN,CHEsTER_ 1824 Split-Apply-Combine Philosophy The University of Man c'hester Alliance Manchester Business Schoo Hadley Wickham. “The Split-Apply-Combine Strategy for Data Analysis”, Journal of Statistical Software, 40:1, April 2011. http://www.jstatsoft.org/v40/i01/paper SPLIT APPLY COMBINE based on a a change to the newly common criteria each group processed groups usually based on the index? ex. calculate the mean (Gregory Kanevsk, https://community.teradata.com/t5/Learn-Data-Science/Building-Data-Science-Pipelines-with-R-and-Aster-Part-3-Data/ba-p/79855) BMAN73701 Week 4 20 MAN,CHEsTER_ 1824 Split-Apply-Combine Philosophy The University of Man c'hester Alliance Manchester Business Schoo for each group (a,b,c) sum the value (metric) df.groupby('group').sum() for column metric, we will sum and count num of rows df.groupby('group')['metric'].agg([np.sum, 'count']) id group metric id group metric 1 A 1.5........................ 1 A 1.5 id group metric 2 B 3 2 B 3 3 B 6 ------- 3 B 6 4 B 2 4 B 2 5 B -1 5 B -1 6 C 4 7 C 0 7 C 0 (Andee Kaplan, http://andeekaplan.com/2014/01/20/SplitApplyCombine) BMAN73701 Week 4 21 I MANCH ESTER 1~2---l GroupBy The Un iversity of Manchester Alliance Manchester Business Schoo,!.groupby() generates a DataFrameGroupBy, NOT a DataFrame Each group in DataFrameGroupBy is a DataFrame: .groupby() splits a data frame into several Many functions work on both types. For grouped data frames, the function is applied to each group. Example ‘.mean()’ One can do even more powerful operations: ❑ Group by several columns ❑ Aggregate using many functions at once to each group ❑ Transform values within groups using data from the whole group ❑ Filter groups according to group-wise conditions BMAN73701 Week 4 22 I MAN,CHESTER_ I 1s24 The University of Man c'hester Alliance Manchester Business Schoo BMAN73701 Programming in ~ python™for Business Analytics Python Week 4: Lecture 2 Data Tidying and Data Preprocessing Part 1: Data Tidying Part 2: Data Preprocessing MANCHESTER 1824 Data Preprocessing The Un iversity of Manchester Alliance Manchester Business Schoo,! usually for human inputs Textual transformations (re + Pandas) import re import pandas as pd regular expression re.search() df.replace() re.sub() df['Col'].str.replace() Dropping, replacing missing data Numpy: np.isnan() only floating-point arrays! Pandas: df.isna() Find missing values df.fillna() Replace missing values df.dropna() Remove missing values BMAN73701 Week 4 24 MANCHESTER 1824 Missing Values The Un iversity of Manchester Alliance Manchester Business Schoo,! Unknown, lost, wrong data Empty cell in Excel Nothing between commas in CSV None or NaN when printing arrays / DataFrames Wrong indexing often creates NaN values None or np.nan to create missing values df.iloc[0, 0] = np.nan because NaN is not equal to any number, even itself NA == NA None == None but np.nan != np.nan never use == for nan, use to function: use np.isnan() or df.isna() to check if there any missing value BMAN73701 Week 4 25 MANCHESTER 1824 Missing Values in NumPy The University of Manchester Alliance Manchester Business School I n : df = pd. read_tab l e(path + · smalldata.txt ' ) make a data frame frome file.... df. i loc [0 , 0 ] = np. nan set the 0,0 element to nan.... X = df. va l ues get the values of df.... X I n [ 16,] : n p. is nan { X) check if in the array there are any nan (per element) Out [ 14] : Out [ 16] : array( [[ nan, -0. 1 , 0. :5 ] , array { [ [ Trt1e., False, Fa lse]. [ 0. 2, - 0. 2, 0. 4] , [ Fa l se, False, False ], [ 0. 3, - 0. 3, 0. 3] , [ 0. 4, - 0. 4, 0. 2] , [ Fa l se, False J False ]. [ 0.5 , - 0.5 , 0. 1]] ) [ Fa l se., False J False ]. [ Fa l se., False., False ]] ) I n [.17]: X [ np. isnan {X) ] filter only the nan in X array Out : array { [ nan]) I n [ 18 ] : X [ n D. is n a.n ( X) ] = 0 replace all nan to 0 I n [19 np.any{np. isn an{X)) 1 ]: check if there are any nan in the array (for all) Out [ 19] : Fa l se I n [20 np.any{np. isn an{d f )) 1 ]: df values also changed, so there are no more nan Out [ 28] : Fa l se BMAN73701 Week 4 26 I MANCH ESTER 1~2---l Handling Missing Values in NumPy The Un iversity of Manchester Alliance Manchester Business Schoo,! NumPy mathematical functions propagate missing values* np.sum(X) NaN if X contains NaN, the sum will shows NaN There are special functions that ignore missing values np.nansum(X) 16.0 (ignores NaN) even if X contains NaN, it is ignored We could replace them by a particular value X[np.isnan(X)] = 0 for every nan in X, replace by 0 There are other ways of handling missing values in pandas df, NaN is ignored automatically * Pandas DataFrames do the opposite ! BMAN73701 Week 4 27 I MANCH ESTER 1~2---l Handling Missing Values in Pandas The Un iversity of Manchester Alliance Manchester Business Schoo,! Pandas mathematical functions ignore missing values df['Num'].sum() 9.0 (same as np.nansum) pandas function will ignore mv automatically But mathematical operators propagate NaN ! df['Num'] + df['Num'] nan ! We could replace missing values by a particular value df.fillna(np.mean(df['Num'])) Or we can drop rows/columns with missing values df.dropna(axis = 0) typically drop the row, but depends on the data BMAN73701 Week 4 28 MANCHESTER 1824 Missing Values (Pandas) numpy can't do non numerical data (ex NA, None) is there any missing value (numerical or non numerical) why does it show weird True False? BMAN73701 Week 4 30 MANCHESTER 1824 Math with Missing Values (Pandas) The University of Manchester Alliance Manchester Business School I n [ 18] : df = pd. Dat a Frame( di ct {Cat = [np. na n, · NA · , · NaN · , None ] ,. Num= [np. nan, 0.5 , 0.5 , 0.5 ], Clean= [ l , 2, 3 , 4 ] )). df Out [ 18] : Cat Num Clean 0 NaN NaN 1 because 'NA' and 'NaN' (row 1 and 2) is a string, 1 NA 0.5 2 pd doesn't know that its a missing values 2 NaN 0. 5 3 3 ~lone 0. 5 4 In : np. sum{df [ ' Num ' ]. values) if we try to sum using np.nan will result in nan (since we have 1 np.nan) Out : nan if we sum using the df.sum it will ignore nan I n [ 41] : d f [ · Num ]. sum { ) Out: 1. ~ 5 I n [ 42] : ct·f [ Num · ] [ 0 ] + d ·f [ · Num · ] [ 1] in arithmetic operation between values (addition, substract) pandas will propogate nan Out : nan BMAN73701 Week 4 31 MANCHESTER 1824 Handling Missing Values (Pandas) The University of Manchester Alliance Manchester Business School I n [ 18] : df = pd. Dat a Fr a me ( cl i c t ( Cat = [ n p. nan , · NA · , · NaN · , Non e ] , Num=[np. nan, 0.5 , 0.5 , 0.5 ], Clean=[ l , 2, 3 , 4 ] )). df Out [ 18] : Cat Num Clean I n [ 43] : df. f i l l na {df [ · Num ]. mean {) ) 0 NaN NaN 1 Out: 1. calculate the mean of 'Num' 2. use the result to fill any nan 1 NA 0. 5 2 Cat Num Cl ean 2 NaN 0. 5 3 0 0.5 0.5 1 3 ~lone 0. 5 4 1 NA 0.5 2 2 NaN 0.5 3 3 0.5 0.5 4 I n : df. dropn a {axis - 0 ) # Rows drop the rows that have nan Out: Cat Num Cl ean 1 NA 0. 5 2 2 NaN 0. 5 3 I n : df. dropn a {axis - 1) # Columns Out: drop the column that have nan Cl ean 0 1 1 2 2 3 3 4 Week 4 BMAN73701 32 MAN,CHEsTER_ 1824 Data Preprocessing The University of Man c'hester Alliance Manchester Business Schoo filling in / replacing missing data Imputing missing data Scaling, Standardization, L1/L2 normalization Binarization Encoding of categorical features: As integers As binary vectors (one-of-K) import sklearn.preprocessing BMAN73701 Week 4 33 MAN,CHEsTER_ 1824 Data Preprocessing with scikit-learn The University of Man c'hester Alliance Manchester Business Schoo Built on top of NumPy and Matplotlib Input may be Numpy or Pandas DataFrame Output is Numpy Open-source, free to use and contribute (example) Keeps being updated! Object-oriented: create objects, call their methods to update them or transform other data Docs: https://scikit-learn.org/stable/index.html Examples: http://scikit-learn.org/stable/auto_examples/index.html API reference: https://scikit-learn.org/stable/api/index.html BMAN73701 Week 4 34 I MANCH ESTER 1~2---l Imputing Missing Data The Un iversity of Manchester Alliance Manchester Business Schoo,! Replace with mean or median or most_frequent…. along columns from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values=np.nan, strategy='mean') imputer.fit(df[['X','Y']]) df[['X','Y']] = imputer.transform(df[['X','Y']]) Replace with predicted values: kNN, regression, RandomForests BMAN73701 Week 4 35 I MANCH ESTER 1~2---l Preprocessing: fit and transform The Un iversity of Manchester Alliance Manchester Business Schoo,! something.fit(X) Learn / calculate some parameters from X Modifies 'something', returns nothing. something.transform(X) Transforms input X in some manner It does NOT modify 'something', returns transformed 'X' something.fit_transform(X) Do the above two steps in one go Modifies 'something' and returns transformed 'X' BMAN73701 Week 4 36 MAN,CHESTER_ Scaling (MinMaxScaler) I I 1s24 The University of Man c'hester Alliance Manchester Business Schoo Scale data to a new range, e.g., [0, 1] (also called “normalisation” to [0,1]) 𝑥 ∈ [𝑥min , 𝑥max ] ⇒ 𝑥′ ∈ [𝑥′min , 𝑥′max ] 𝑥 − 𝑥min 𝑥′ = 𝑥′min + (𝑥′max − 𝑥′min ) 𝑥max − 𝑥min Columns/features have very different ranges and we want to have similar ranges Some ML methods/packages may expect or perform better with normalised inputs BMAN73701 Week 4 37 MANCHESTER 1824 Scaling (MinMaxScaler) the result is in np array use slicing so we keep the column name and index BMAN73701 Week 4 38 MANCHESTER 1824 Standardisation The Un iversity of Manchester Alliance Manchester Business Schoo,! Also called z-score normalisation or standard scaling Ideally, transform the data to be normally distributed with mean 𝜇 = 0 and variance 𝜎 2 = 1 ′ 𝑥−𝜇 In practice 𝑥 = 𝜎 Models that assume data is centred around zero (RBF kernel in SVM) Features with very different variances are problematic for ML BMAN73701 Week 4 39 MANCHESTER 1824 Standardisation The University of Manchester Alliance Manchester Business School In : df. apply{ [np. mean. np. std]) calculate the mean and std for each column Out [ 24] : X y mean 84.500000 250.833333 std 166.33400 1 499.444469 In : z_sca l er = preprocessing.StandardScaler()... : df [ : ] - z sca l er.fit transform{df)... : d f Out [ 25] : X y 0 -0.579662 -0.577607 value changed so that mean = 0 and std = 1 (or near) 1 -0.579662 -0.577607 2 -0.572720 1.732051 3 1.732044 -0.576836 In [ 26 1 ] : d f. apply { [ n p. mean. n p. std] ) Out [ 26] : X y mean 0. 000000 -2.775558e-17 std 1.154701 1.154701e+00 BMAN73701 Week 4 40 MAN,CHEsTER_ 1824 L1/L2 normalisation The University of Man c'hester Alliance Manchester Business Schoo Scale individual samples/observations (rows) 𝑥Ԧ to a “length” of 1 (unit norm) 𝑥′ Ԧ = 11 𝑥Ԧ 11 Each sample is a vector in a multi-dimensional space 𝑓2 𝑓2 𝒔𝟏 𝒔𝟏 𝒔𝟐 𝒔𝟐 𝑓1 𝑓1 usually used in data with lots of 0 𝑛 used to compare similarity or distance 𝑛 2 L1 = II 𝑥Ԧ II 1 = 𝑥𝑖 L2 = II 𝑥Ԧ II 2 = 𝑖=1𝑥𝑖 𝑖=1 Models that compare similarity / distance between pairs of samples (text classification / clustering) BMAN73701 Week 4 41 MANCHESTER 1824 L1/L2 normalisation all length changed into 1 BMAN73701 Week 4 0.2 0.4 0.6 0.8 42 I MANCH ESTER 1~2---l Binarization The Un iversity of Manchester Alliance Manchester Business Schoo,! Convert numerical values to binary {0, 1} 0 if 𝑥 ≤ 0.6 ቊ 1 if 𝑥 > 0.6 preprocessing.Binarizer(threshold=0.6).fit_transform(data) ✓ Some ML models only work with boolean input values ✓ Some ML models may produce non-boolean predictions, but we are asking a boolean question 1. Predict probability of a given client buying a product ⇒ [0,1] 2. Decide whether to call the client ⇒ {𝑛𝑜, 𝑦𝑒𝑠} BMAN73701 Week 4 43 MANCH ESTER_ Encoding of Categorical Features I I 1~2➔ The University of Man c'h ester Alliance Manchester Business Schoo size = ["S", "M", "L"] country = ["US", "ES", "UK", "CN"] Many operations only work with numerical values. Encode them as integers Country Size Country Size US S 1 1 size_enc = {'S':1, 'M':2, 'L':3} ES M country_enc = {'US':1, 'ES':2, 2 2 UK L 'UK':3, 'CN':4} 3 3 CN M 4 2 UK L 3 3 BMAN73701 Week 4 44 MANCHESTER 1824 Encoding Categorical Features The University of Manchester Alliance Manchester Business School with scikit-learn In [ 1 02 ] : le nco de r = pr epr ocessi ng. Label Eri coder() 1... : le ncode r. fit ( df [ "country"]) for unique values in country, make an encoding (0,1,2,,3... : le nco de r. classes check all the unique values Out : array( [ · CN · , · ES · , · UK " , · US · ] , dtype =object ) In [ 1 03 ] : df [ ·country·] = le ncode r.t r ansforrm( df [ "country"]) 1 transform the data into numerical df Out: cou nt ry size 0 3 s 1 1 M 2 2 L 3 0 M 4 2 L get back the original value In [ 194 ] : df l[ ·country· ] = le ri co de r. i river se_t r ari sfor m( df I "country"]) 1 df Out: count ry size 0 us s 1 IES M 2 UK L 3 CN M 4 UK L BMAN73701 Week 4 46 I I MANCH ESTER_ 1~2➔ Encoding of Categorical Features: The University of Man c'h ester Alliance Manchester Business Schoo one-of-K size = ["S", "M", "L"] country = ["US", "ES", "UK", "CN"] Most ML methods will interpret integers as ordered values (ordered features)! this might happen if we simply change the data into 0,1,2,3,4 Encode each categorical feature with K values into K binary features, with only one active (One-of-K, OneHot) all country value is the same = 1, and 0 if not true Country Size C_US C_ES C_UK C_CN Size US S "US" = [1,0,0,0] 1 0 0 0 1 ES M "ES" = [0,1,0,0] 0 1 0 0 2 UK L "UK" = [0,0,1,0] 0 0 1 0 3 CN M "CN" = [0,0,0,1] 0 0 0 1 2 UK L 0 0 1 0 3 BMAN73701 Week 4 47 MANCHESTER 1824 OneHot Encoding with scikit-learn will result in dense array encoder = preprocessing.OneHotEncoder(sparse_output=False). set_output(transform="pandas") output as pd df encoder.fit(df[['country']]) encoder.transform(df[['country']]) Out: # Concat with original data combines data horizontally pd.concat([df, encoder.transform(df[['country']])], axis = 1) Out: BMAN73701 Week 4 50 MANCH ESTER_ Going further I I 1~2➔ The University of Man c'h ester Alliance Manchester Business Schoo More about Split-apply-combine in Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html Split-apply-combine on Netflix data with Pandas: https://www.datacamp.com/tutorial/pandas-split-apply-combine-groupby Scikit-learn data preprocessing tutorial: http://scikit-learn.org/stable/modules/preprocessing.html About Feature Scaling and Normalization and the effect of standardization for ML algorithms: http://sebastianraschka.com/Articles/2014_about_feature_scaling.html BMAN73701 Week 4 52