Week_2_MLP_Data_preprocessing - Colaboratory.pdf

Data Pre-processing Techniques Data preprocessing involves several transformations that are applied to the raw data to make it more amenable for learning. It is carried out before using it for model training or predicton. There are many pre-processing techniques for Data cleaning Data Imputation Feature scaling Feature transformation Polynomial Features Discretization Handling categorical features Custom Transformers Composite Transformers Apply transformation to diverse features TargetTransformedRegressor Feature Selection Filter based feature selection Wrapper based feature selection Feature Extraction PCA The transformations are applied in a specific order and the order can be specified via Pipeline. We need to apply different transformations based on the feature type. FeatureUnion helps us perform that task and combine outputs from multiple transformations into a single transformed feature matrix. We will also study as how to visualize this pipeline. Importing basic libraries In this colab, we are importing libraries as needed. However it is a good practice to have all imports in one cell - arranged in an alphabetical order. This helps us weed out any duplicate imports and some such issues. import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns sns.set_theme(style="whitegrid") 1. Feature Extraction DictVectorizer Many a times the data is present as a list of dictionary objects. ML algorithms expect the data in matrix form with shape (n, m) where n is the number of samples and m is the number of features. DictVectorizer converts a list of dictionary objects to feature matrix. Let's create a sample data for demo purpose containing age and height of children. Each record/sample is a dictionary with two keys age and height , and corresponding values. data = [{'age': 4, 'height':96.0}, {'age': 1, 'height':73.9}, {'age': 3, 'height':88.9}, {'age': 2, 'height':81.6}] There are 4 data samples with 2 features each. Let's make use of DictVectorizer to convert the list of dictionary objects to the feature matrix. from sklearn.feature_extraction import DictVectorizer dv = DictVectorizer(sparse=False) data_transformed = dv.fit_transform(data) data_transformed array([[ 4. , 96. ], [ 1. , 73.9], [ 3. , 88.9], [ 2. , 81.6]]) data_transformed.shape (4, 2) The transformed data is in a feature matrix form - 4 samples with 2 features each i.e. shape (4, 2) 2. Data Imputation Many machine learning algorithms need full feature matrix and they may not work in presence of missing data. Data imputation identifies missing values in each feature of the dataset and replaces them with an appropriate value based on a fixed strategy such as mean or median or mode of that feature. use specified constant value. Sklearn library provides sklearn.impute.SimpleImputer class for this purpose. from sklearn.impute import SimpleImputer Some of its important parameters: missing_values: Could be int , float , str , np.nan or None. Default is np.nan. strategy: string, default is 'mean'. One the following strategies can be used: mean - missing values are replaced using the mean along each column median - missing values are replaced using the median along each column most_frequent - missing values are replaced using the most frequent along each column constant - missing values are replaced with value specified in fill_value argument. add_indicator is a boolean parameter that when set to True returns missing value indicators in indicator_ member variable. Note: mean and mode strategies can only be used with numeric data. most_frequent and constant strategies can be used with strings or numeric data. Data imputation on real world dataset Let's perform data imputation on real world dataset. We will be using heart-disease dataset from uci machine learning repository for this purpose. We will load this dataset from csv file. cols = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'ol heart_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart- The dataset has the following features: 1. Age (in years) 2. Sex (1 = male; 0 = female) 3. cp - chest pain type 4. trestbps - resting blood pressure (anything above 130-140 is typically cause for concern) 5. chol - serum cholestoral in mg/dl (above 200 is cause for concern) 6. fbs - fasting blood sugar ( > 120 mg/dl) (1 = true; 0 = false) 7. restecg - resting electrocardiographic results * 0 = normal; * 1 = having ST-T wave abnormality; * 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria 8. thalach - maximum heart rate achieved 9. exang - exercise induced angina * 1 = yes; * 0 = no 10. oldpeak - depression induced by exercise relative to rest 11. slope - slope of the peak exercise ST segment * 1 = upsloping; * 2 = flat Value; * 3 = downsloping 12. ca - number of major vessels (0-3) colored by flourosopy 13. thal - (3 = normal; 6 = fixed defect; 7 = reversable defect) 14. num - diagnosis of heart disease (angiographic disease status)( * 0: < 50% diameter narrowing; * 1: > 50% diameter narrowing STEP 1: Check if the dataset contains missing values. This can be checked via dataset description or by check number of nan or np.null in the dataframe. However such a check can be performed only for numerical features. For non-numerical features, we can list their unique values and check if there are values like ?. heart_data.info() RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null float64 1 sex 303 non-null float64 2 cp 303 non-null float64 3 trestbps 303 non-null float64 4 chol 303 non-null float64 5 fbs 303 non-null float64 6 restecg 303 non-null float64 7 thalach 303 non-null float64 8 exang 303 non-null float64 9 oldpeak 303 non-null float64 10 slope 303 non-null float64 11 ca 303 non-null object 12 thal 303 non-null object 13 num 303 non-null int64 dtypes: float64(11), int64(1), object(2) memory usage: 33.3+ KB Let's check if there are any missing values in numerical columns - here we have checked it for all columns in the dataframe. (heart_data.isnull().sum()) age 0 sex 0 cp 0 trestbps 0 chol 0 fbs 0 restecg 0 thalach 0 exang 0 oldpeak 0 slope 0 ca 0 thal 0 num 0 dtype: int64 There are two non-numerical features: ca and thal. List their unique values. print ("Unique values in ca:", heart_data.ca.unique()) print ("Unique values in thal:", heart_data.thal.unique()) Unique values in ca: ['0.0' '3.0' '2.0' '1.0' '?'] Unique values in thal: ['6.0' '3.0' '7.0' '?'] Both of them contain ? , which is a missing values. Let's count the number of missing values. print ("# missing values in ca:", heart_data.loc[heart_data.ca == '?', 'ca'].count()) print ("# missing values in thal:", heart_data.loc[heart_data.thal == '?', 'thal'].count()) # missing values in ca: 4 # missing values in thal: 2 STEP 2: Replace '?' with nan. heart_data.replace('?', np.nan, inplace=True) STEP 3: Fill the missing values with sklearn missing value imputation utilities. Here we use SimpleImputer with mean strategy. We will try two variations - add_indicator = False : Default choice that only imputes missing values. imputer = SimpleImputer(missing_values = np.nan, strategy ='mean') imputer = imputer.fit(heart_data) heart_data_imputed = imputer.transform(heart_data) print (heart_data_imputed.shape) (303, 14) add_indicator = True : Adds additional column for each column containing missing values. In our case, this adds two columns one for ca and other for thal. It indicates if the sample has a missing value. imputer = SimpleImputer(missing_values = np.nan, strategy ='mean', add_indicator=True) imputer = imputer.fit(heart_data) heart_data_imputed_with_indicator = imputer.transform(heart_data) print (heart_data_imputed_with_indicator.shape) (303, 16) 3. Feature scaling Feature scaling transforms feature values such that all the features are on the same scale. When we use feature matrix with all features on the same scale, it provides us certain advantages as listed below: Enables faster convergence in iterative optimization algorithms like gradient descent and its variants. The performance of ML algorithms such as SVM, K-NN and K-means etc that compute euclidean distance among input samples gets impacted if the features are not scaled. Tree based ML algorithms are not affected by feature-scaling. In other words, feature scaling is not required for tree based ML algorithms. Feature scaling can be performed with the following methods: Standardization Normalization MaxAbsScaler Let's demonstrate feature scaling on a real world dataset. For this purpose we will be using ablone dataset. We will use different scaling utilities in sklearn library. cols = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera w abalone_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abal Abalone dataset has the following columns 1. Sex - nominal (M, F, and I (infant)) 2. Length (mm - Longest shell measurement) 3. Diameter (mm - perpendicular to length) 4. Height (mm - with meat in shell) 5. Whole weight (grams - whole abalone) 6. Shucked weight (grams - weight of meat) 7. Viscera weight (grams - gut weight (after bleeding)) 8. Shell weight (grams - after being dried) 9. Rings (target - age in years) STEP 1: Examine the dataset Feature scaling is performed only on numerical attributes. Let's check which are numerical attributes in this dataset. We can get that via info() method. abalone_data.info() RangeIndex: 4177 entries, 0 to 4176 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 4177 non-null object 1 Length 4177 non-null float64 2 Diameter 4177 non-null float64 3 Height 4177 non-null float64 4 Whole weight 4177 non-null float64 5 Shucked weight 4177 non-null float64 6 Viscera weight 4177 non-null float64 7 Shell weight 4177 non-null float64 8 Rings 4177 non-null int64 dtypes: float64(7), int64(1), object(1) memory usage: 293.8+ KB STEP 1a [OPTIONAL]: Convert non-numerical attributes to numerical ones. In this dataset, Sex is a non-numeric column in this dataset. Let's examine it and see if we can convert it to numeric representation: abalone_data.Sex.unique() array(['M', 'F', 'I'], dtype=object) # Assign numerical value to sex. abalone_data = abalone_data.replace({"Sex": {"M":1,"F":2, "I":3}}) abalone_data.info() RangeIndex: 4177 entries, 0 to 4176 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 4177 non-null int64 1 Length 4177 non-null float64 2 Diameter 4177 non-null float64 3 Height 4177 non-null float64 4 Whole weight 4177 non-null float64 5 Shucked weight 4177 non-null float64 6 Viscera weight 4177 non-null float64 7 Shell weight 4177 non-null float64 8 Rings 4177 non-null int64 dtypes: float64(7), int64(2) memory usage: 293.8 KB STEP 2: Separate labels from features. y = abalone_data.pop("Rings") print("The DataFrame object after deleting the column") abalone_data.info() The DataFrame object after deleting the column RangeIndex: 4177 entries, 0 to 4176 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 4177 non-null int64 1 Length 4177 non-null float64 2 Diameter 4177 non-null float64 3 Height 4177 non-null float64 4 Whole weight 4177 non-null float64 5 Shucked weight 4177 non-null float64 6 Viscera weight 4177 non-null float64 7 Shell weight 4177 non-null float64 dtypes: float64(7), int64(1) memory usage: 261.2 KB STEP 3: Examine feature scales Statistical method Check the scales of different feature with describe() method of dataframe. abalone_data.describe().T count mean std min 25% 50% 75% max Sex 4177.0 1.955470 0.827815 1.0000 1.0000 2.0000 3.000 3.0000 Length 4177.0 0.523992 0.120093 0.0750 0.4500 0.5450 0.615 0.8150 Diameter 4177.0 0.407881 0.099240 0.0550 0.3500 0.4250 0.480 0.6500 Height 4177.0 0.139516 0.041827 0.0000 0.1150 0.1400 0.165 1.1300 Whole weight 4177.0 0.828742 0.490389 0.0020 0.4415 0.7995 1.153 2.8255 Shucked weight 4177.0 0.359367 0.221963 0.0010 0.1860 0.3360 0.502 1.4880 Viscera weight 4177.0 0.180594 0.109614 0.0005 0.0935 0.1710 0.253 0.7600 Shell weight 4177.0 0.238831 0.139203 0.0015 0.1300 0.2340 0.329 1.0050 Note that There are 4177 examples or rows in this dataset. The mean and standard deviation of features are quite different from one another. We can confirm that with a variety of visualization techniques and plots. Visualization of feature distributions Visualize feature distributions. Histogram Kernel density estimation (KDE) plot Box Violin Feature histogram We will have separate and combined histogram plots to check if the feature are indeed on different scales. #@title [Separate histograms] [Separate histograms] for name in cols[0:len(cols)-1]: plt.hist(abalone_data[name].values) # histogram plot plt.title(name,fontsize=16) plt.xlabel('Range',fontsize=16) plt.ylabel('Frequency',fontsize=16) plt.show() The feature variability can be better visualized in a combined histogram. #@title [Histograms - combined] [Histograms - combined] # create a new plot plt.figure(figsize=(15,8)) for colname in abalone_data: plt.hist(abalone_data[colname].values, alpha=0.5) # name the curves of features in_cols = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Visce plt.legend(in_cols, fontsize=18,loc="upper right",frameon=True) plt.title('Distribution of features across samples',fontsize=20) plt.xlabel('Range',fontsize=16) plt.ylabel('Frequency',fontsize=16) plt.show() KDE plot Alternatively, we can generate Kernel Density Estimate plot using Gaussian kernels. In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination. #@title [KDE plots - combined] [KDE plots - combined] ax = abalone_data.plot.kde() Observe that the features have different distributions and scales. Box plot A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range. #@title [Box plot] [Box plot] ax = sns.boxplot(data=abalone_data, orient="h", palette="Set2") Violin plot A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution. #@title [Violin plot] [Violin plot] ax = sns.violinplot(data=abalone_data, orient="h", palette="Set2", scale="width") Looking at these plots, we conclude that features are on different scales. STEP 4: Scaling Normalization The features are normalized such that their range lies between [0,1] or [-1,1]. There are two way to achieve this. MaxAbsoluteScaler transforms features in range [-1, 1] MinMaxScaler transforms features in range [0, 1] MaxAbsoluteScalar It transforms the original features vector x into new feature vector x′ so that all values fall within range [−1, 1] x ′ x = MaxAbsoluteValue where MaxAbsoluteValue = max(x. max, |x. min|) x= np.array([4, 2, 5, -2, -100]).reshape(-1,1) print(x) [[ 4] [ 2] [ 5] [ -2] [-100]] from sklearn.preprocessing import MaxAbsScaler mas = MaxAbsScaler() x_new = mas.fit_transform(x) print(x_new) [[ 0.04] [ 0.02] [ 0.05] [-0.02] [-1. ]] MinMaxScalar Normalization is a procedure in which the features' values are scaled such that they range between 0 and 1. This technique is also called min-max scaling. It is performed with the following formula: Xold − Xmin Xnew = Xmax − Xmin where Xold is the old value of a data point, which is rescaled to Xnew. Xmin is minimum value of feature X. Xmax , is maximum value of feature X. Normalization can be achieved by MinMaxScaler from sklearn library. from sklearn.preprocessing import MinMaxScaler X = abalone_data mm = MinMaxScaler() X_normalized = mm.fit_transform(X) X_normalized[:5] array([[0. , 0.51351351, 0.5210084 , 0.0840708 , 0.18133522, 0.15030262, 0.1323239 , 0.14798206], [0. , 0.37162162, 0.35294118, 0.07964602, 0.07915707, 0.06624075, 0.06319947, 0.06826109], [0.5 , 0.61486486, 0.61344538, 0.11946903, 0.23906499, 0.17182246, 0.18564845, 0.2077728 ], [0. , 0.49324324, 0.5210084 , 0.11061947, 0.18204356, 0.14425017, 0.14944042, 0.15296462], [1. , 0.34459459, 0.33613445, 0.07079646, 0.07189658, 0.0595158 , 0.05134957, 0.0533134 ]]) Let's look at the mean and standard deviation (SD) of each feature: X_normalized.mean(axis=0) array([0.47773522, 0.60674608, 0.59307774, 0.12346584, 0.29280756, 0.24100033, 0.23712127, 0.2365031 ]) X_normalized.std(axis=0) array([0.4138578 , 0.16226829, 0.16676972, 0.03701066, 0.17366046, 0.14925109, 0.14430695, 0.13870055]) The means and SDs of different features are now comparable. We can confirm this again through visualization as before: #@title [Histogram of transformed features] # create a new plot [Histogram of transformed plt.figure(figsize=(15,8)) features] # convert ndarray into dataframe for plotting the histogram data=pd.DataFrame(X_normalized, columns=in_cols) for colname in abalone_data: plt.hist(data[colname].values, alpha=0.2) plt.legend(in_cols, fontsize=18,loc="upper right",frameon=True) plt.title('Distribution of features across samples after normalizati plt.xlabel('Range',fontsize=16) plt.ylabel('Frequency',fontsize=16) plt.show() #@title [Box plot] [Box plot] ax = sns.boxplot(data=data, orient="h", palette="Set2") #@title [Violin plot] [Violin plot] ax = sns.violinplot(data=data, orient="h", palette="Set2", scale="w #@title [KDE plot] [KDE plot] ax = data.plot.kde() Standardization Standardization is another feature scaling technique that results into (close to) zero mean and unit standard deviation of a feature's values. Formula for standardization: Xold − μ Xnew = σ Here, μ and σ respectively are the mean and standard deviation of the feature values. Standardization can be achieved by StandardScaler from sklearn library. from sklearn.preprocessing import StandardScaler ss = StandardScaler() X_standardized = ss.fit_transform(X) X_standardized[:5] array([[-1.15434629, -0.57455813, -0.43214879, -1.06442415, -0.64189823, -0.60768536, -0.72621157, -0.63821689], [-1.15434629, -1.44898585, -1.439929 , -1.18397831, -1.23027711, -1.17090984, -1.20522124, -1.21298732], [ 0.05379815, 0.05003309, 0.12213032, -0.10799087, -0.30946926, -0.4634999 , -0.35668983, -0.20713907], [-1.15434629, -0.69947638, -0.43214879, -0.34709919, -0.63781934, -0.64823753, -0.60759966, -0.60229374], [ 1.26194258, -1.61554351, -1.54070702, -1.42308663, -1.27208566, -1.2159678 , -1.28733718, -1.32075677]]) X_standardized.mean(axis=0) array([-1.19075871e-17, -5.83471770e-16, -3.02792930e-16, 3.91249292e-16, 9.18585294e-17, -1.02065033e-17, 2.70472337e-16, 2.97689679e-16]) X_standardized.std(axis=0) array([1., 1., 1., 1., 1., 1., 1., 1.]) The means of different features are now comparable with SD=1. #@title [Histogram - combined] [Histogram - combined] # create a new plot plt.figure(figsize=(15,8)) data=pd.DataFrame(X_standardized, columns=in_cols) for colname in abalone_data: plt.hist(data[colname].values, alpha=0.4) plt.legend(in_cols, fontsize=18,loc="upper right",frameon=True) plt.title('Distribution of features across samples after standardisation',fontsize=20) plt.xlabel('Range',fontsize=16) plt.ylabel('Frequency',fontsize=16) plt.show() #@title [KDE plot - combined] [KDE plot - combined] ax = data.plot.kde() #@title [Box plot] [Box plot] ax = sns.boxplot(data=data, orient="h", palette="Set2") #@title [Violin plot] [Violin plot] ax = sns.violinplot(data=data, orient="h", palette="Set2", scale="width") 4. add_dummy_feature Augments dataset with a column vector, each value in the column vector is 1. This is useful for adding a parameter for bias term in the model. x = np.array( [[7, 1 ], [1, 8 ], [2, 0 ], [9, 6 ]]) from sklearn.preprocessing import add_dummy_feature x_new = add_dummy_feature(x) print(x_new) [[1. 7. 1.] [1. 1. 8.] [1. 2. 0.] [1. 9. 6.]] 5. Custom transformers Enables conversion of an existing Python function into a transformer to assist in data cleaning or processing. Useful when: 1. The dataset consists of heterogeneous data types (e.g. raster images and text captions), 2. The dataset is stored in a pandas.DataFrame and different columns require different processing pipelines. 3. We need stateless transformations such as taking the log of frequencies, custom scaling, etc. from sklearn.preprocessing import FunctionTransformer You can implement a transformer from an arbitrary function with FunctionTransformer. For example, let us build a transformer that applies a log transformation to features: For this demonstration, we will be using a wine quality dataset from UCI machine learning repository. It has got the following attributes: 1. fixed acidity 2. volatile acidity 3. citric acid 4. residual sugar 5. chlorides 6. free sulfur dioxide 7. total sulfur dioxide 8. density 9. pH 10. sulphates 11. alcohol 12. quality (output: score between 0 and 10) wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-qu wine_data.describe().T count mean std min 25% 50% 75% ma fixed 1599.0 8.319637 1.741096 4.60000 7.1000 7.90000 9.200000 15.9000 acidity volatile 1599.0 0.527821 0.179060 0.12000 0.3900 0.52000 0.640000 1.5800 acidity citric acid 1599.0 0.270976 0.194801 0.00000 0.0900 0.26000 0.420000 1.0000 residual 1599.0 2.538806 1.409928 0.90000 1.9000 2.20000 2.600000 15.5000 sugar chlorides 1599.0 0.087467 0.047065 0.01200 0.0700 0.07900 0.090000 0.6110 Let's use free np.log1p which returns natural logarithm of (1 + the feature value). sulfur 1599.0 15.874922 10.460157 1.00000 7.0000 14.00000 21.000000 72.0000 dioxide transformer total= FunctionTransformer(np.log1p, validate=True) wine_data_transformed sulfur 1599.0= 46.467792 transformer.transform(np.array(wine_data)) 32.895324 6.00000 22.0000 38.00000 62.000000 289.0000 pd.DataFrame(wine_data_transformed, dioxide columns=wine_data.columns).describe().T density 1599.0 0.996747 0.001887 0.99007 0.9956 0.99675 0.997835 1.0036 count mean std min 25% 50% 75% max fixed 1599.0 2.215842 0.178100 1.722767 2.091864 2.186051 2.322388 2.827314 acidity volatile 1599.0 0.417173 0.114926 0.113329 0.329304 0.418710 0.494696 0.947789 acidity citric acid 1599.0 0.228147 0.152423 0.000000 0.086178 0.231112 0.350657 0.693147 residual 1599.0 1.218131 0.269969 0.641854 1.064711 1.163151 1.280934 2.803360 sugar chlorides 1599.0 0.083038 0.038991 0.011929 0.067659 0.076035 0.086178 0.476855 free sulfur 1599.0 2.639013 0.623790 0.693147 2.079442 2.708050 3.091042 4.290459 dioxide total sulfur 1599.0 3.634750 0.682575 1.945910 3.135494 3.663562 4.143135 5.669881 dioxide density 1599.0 0.691519 0.000945 0.688170 0.690945 0.691521 0.692064 0.694990 pH 1599 0 1 460557 0 035760 1 319086 1 437463 1 460938 1 481605 1 611436 Notice the change in statistics of all features. For example, total sulfur dioxide 1599.0 46.467792 32.895324 6.00000 22.0000 38.00000 62.0 became total sulfur dioxide 1599.0 3.634750 0.682575 1.945910 3.135494 3.663562 4.14 6. Polynomial Features Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b] , the degree-2 polynomial features are [1, a, b, a2 , ab, b2 ]. sklearn.preprocessing.PolynomialFeatures enables us to perform polynomial transformation of desired degree. Let's demonstrate it with wine quality dataset. from sklearn.preprocessing import PolynomialFeatures wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-qu wine_data_copy = wine_data.copy() wine_data = wine_data.drop(['quality'], axis=1) print('Number of features before transformation = ', wine_data.shape) # Let us fit a polynomial of degree 2 to wine_data poly = PolynomialFeatures(degree=2) poly_wine_data = poly.fit_transform(wine_data) print('Number of featueres after transformation = ', poly_wine_data.shape) Number of features before transformation = (1599, 11) Number of featueres after transformation = (1599, 78) Note that after transformation, we have 78 features. Let's list out these features: poly.get_feature_names_out() array(['1', 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'fixed acidity^2', 'fixed acidity volatile acidity', 'fixed acidity citric acid', 'fixed acidity residual sugar', 'fixed acidity chlorides', 'fixed acidity free sulfur dioxide', 'fixed acidity total sulfur dioxide', 'fixed acidity density', 'fixed acidity pH', 'fixed acidity sulphates', 'fixed acidity alcohol', 'volatile acidity^2', 'volatile acidity citric acid', 'volatile acidity residual sugar', 'volatile acidity chlorides', 'volatile acidity free sulfur dioxide', 'volatile acidity total sulfur dioxide', 'volatile acidity density', 'volatile acidity pH', 'volatile acidity sulphates', 'volatile acidity alcohol', 'citric acid^2', 'citric acid residual sugar', 'citric acid chlorides', 'citric acid free sulfur dioxide', 'citric acid total sulfur dioxide', 'citric acid density', 'citric acid pH', 'citric acid sulphates', 'citric acid alcohol', 'residual sugar^2', 'residual sugar chlorides', 'residual sugar free sulfur dioxide', 'residual sugar total sulfur dioxide', 'residual sugar density', 'residual sugar pH', 'residual sugar sulphates', 'residual sugar alcohol', 'chlorides^2', 'chlorides free sulfur dioxide', 'chlorides total sulfur dioxide', 'chlorides density', 'chlorides pH', 'chlorides sulphates', 'chlorides alcohol', 'free sulfur dioxide^2', 'free sulfur dioxide total sulfur dioxide', 'free sulfur dioxide density', 'free sulfur dioxide pH', 'free sulfur dioxide sulphates', 'free sulfur dioxide alcohol', 'total sulfur dioxide^2', 'total sulfur dioxide density', 'total sulfur dioxide pH', 'total sulfur dioxide sulphates', 'total sulfur dioxide alcohol', 'density^2', 'density pH', 'density sulphates', 'density alcohol', 'pH^2', 'pH sulphates', 'pH alcohol', 'sulphates^2', 'sulphates alcohol', 'alcohol^2'], dtype=object) Observe that - Some features have ^2 suffix - these are degree 2 features of the input features. For example, sulphates^2 is the square of sulphates features. Some features are combination of names of the original feature names. For example, total sulfur dioxide pH is a combination of two features total sulfur dioxide and pH. 7. Discretization Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models. # KBinsDiscretizer discretizes features into k bins from sklearn.preprocessing import KBinsDiscretizer Let us demonstrate KBinsDiscretizer using Wine quality dataset. wine_data = wine_data_copy.copy() # transform the dataset with KBinsDiscretizer enc = KBinsDiscretizer(n_bins=10, encode="onehot") X = np.array(wine_data['chlorides']).reshape(-1, 1) X_binned = enc.fit_transform(X) X_binned # since output is sparse, use to_array() to expand it. X_binned.toarray()[:5] array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]]) 8. Handling Categorical Features We need to convert the categorical features into numeric features. 1. Ordinal encoding 2. One-hot encoding 3. Label encoder 4. Using dummy variables Ordinal Encoding Categorical features are those that contain categories or groups such as education level, state etc as their data. These are non-numerical features and need to be converted into appropriate form before they feeding them for training an ML model. One intuitive way of handling them could be to assign them a numerical value. As an example, take state as a feature with 'Punjab', 'Rajasthan' and 'Haryana' as the possible values. We might consider assigning numbers to these values as follows: Old f eature N ew f eature Punjab 1 Rajasthan 2 Haryana 3 However, this approach assigns some ordering to the labels, i.e., states, thus representing that Haryana is thrice Punjab and Rajasthan is twice Punjab, these relationships do not exist in the data, thus providing wrong information to the ML model. One-hot Encoding One of the most-common approaches to handle this is: One-hot encoding. This approach consists of creating an additional feature for each label present in the categorical feature (i.e., the number of different states here) and putting a 1 or 0 for these new features depending on the categorical feature's value. That is, Old f eature N ew f eature1 (P unjab) N ew f eature2 (Rajasthan) N ew f eatu Punjab 1 0 Rajasthan 0 1 Haryana 0 0 It may be implemented using OneHotEncoder class from sklearn.preprocessing module. Let's demonstrate this concept with Iris dataset. from sklearn.preprocessing import OrdinalEncoder from sklearn.preprocessing import OneHotEncoder Iris dataset has the following features: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: Iris Setosa, Iris Versicolour, Iris Virginica cols = ['sepal length', 'sepal width', 'petal length', 'petal width','label'] iris_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ir iris_data.head() sepal length sepal width petal length petal width label 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa The label 2 is a categorical 4.7 attribute. 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa iris_data.label.unique() 4 5.0 3.6 1.4 0.2 Iris-setosa array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object) There are three class labels. Let's convert them to one hot vectors. onehotencoder = OneHotEncoder(categories='auto') print('Shape of y before encoding', iris_data.label.shape) ''' Passing 1d arrays as data to onehotencoder is deprecated in version , hence reshape to (-1,1) to have two dimensions. Input of onehotencoder fit_transform must not be 1-rank array ''' iris_labels = onehotencoder.fit_transform(iris_data.label.values.reshape(-1,1)) # y.reshape(-1,1) is a 450x1 sparse matrix of type '' # with 150 stored elements in Coordinate format. # y is a 150x3 sparse matrix of type '' with 150 stored # elements in compressed sparse row format. print('Shape of y after encoding', iris_labels.shape) # since output is sparse use to_array() to expand it. print ("First 5 labels:") print(iris_labels.toarray()[:5]) Shape of y before encoding (150,) Shape of y after encoding (150, 3) First 5 labels: [[1. 0. 0.] [1. 0. 0.] [1. 0. 0.] [1. 0. 0.] [1. 0. 0.]] Let us observe the difference between one hot encoding and ordinal encoding. enc = OrdinalEncoder() iris_labels = np.array(iris_data['label']) iris_labels_transformed = enc.fit_transform(iris_labels.reshape(-1, 1)) print ("Unique labels:", np.unique(iris_labels_transformed)) print ("\nFirst 5 labels:") print (iris_labels_transformed[:5]) Unique labels: [0. 1. 2.] First 5 labels: [[0.] [0.] [0.] [0.] [0.]] LabelEncoder Another option is to use LabelEncoder for transforming categorical features into integer codes. from sklearn.preprocessing import LabelEncoder # get the class column in a new variable iris_labels = np.array(iris_data['label']) # encode the class names to integers enc = LabelEncoder() label_integer = enc.fit_transform(iris_labels) label_integer array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) MultilabelBinarizer Encodes categorical features with value between 0 and k − 1, where k is number of classes. movie_genres =[{'action', 'comedy' }, {'comedy'}, {'action', 'thriller'}, {'science-fiction', 'action', 'thriller'}] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() mlb.fit_transform(movie_genres) array([[1, 1, 0, 0], [0, 1, 0, 0], [1, 0, 0, 1], [1, 0, 1, 1]]) Using dummy variables # use get_dummies to create a one-hot encoding for each unique categorical value in the 'c # Convert categorical class variable to one-hot encoding: iris_data_onehot = pd.get_dummies(iris_data, columns=['label'], prefix=['one_hot']) iris_data_onehot sepal sepal petal petal one_hot_Iris- one_hot_Iris- one_hot_Iris- length width length width setosa versicolor virginica 0 5.1 3.5 1.4 0.2 1 0 0 1 4.9 3.0 1.4 0.2 1 0 0 2 4.7 3.2 1.3 0.2 1 0 0 3 4.6 3.1 1.5 0.2 1 0 0 4 5.0 3.6 1.4 0.2 1 0 0........................ 145 6.7 3.0 5.2 2.3 0 0 1 146 6.3 2.5 5.0 1.9 0 0 1 147 6.5 3.0 5.2 2.0 0 0 1 148 6.2 3.4 5.4 2.3 0 0 1 149 5.9 3.0 5.1 1.8 0 0 1 150 rows × 7 columns 9. Composite Transformers ColumnTransformer It applies a set of transformers to columns of an array or pandas.DataFrame , concatenates the transformed outputs from different transformers into a single matrix. It is useful for transforming heterogenous data by applying different transformers to separate subsets of features. It combines different feature selection mechanisms and transformation into a single transformer object. x = [ [20.0, 'male',], [11.2, 'female',], [15.6, 'female',], [13.0, 'male',], [18.6, 'male',], [16.4, 'female',] ] x = np.array(x) from sklearn.compose import ColumnTransformer from sklearn.preprocessing import MaxAbsScaler, OneHotEncoder ct = ColumnTransformer([('scaler', MaxAbsScaler(),), ('pass', 'passthrough',), ('encoder', OneHotEncoder(),)]) ct.fit_transform(x) array([['1.0', '20.0', '0.0', '1.0'], ['0.5599999999999999', '11.2', '1.0', '0.0'], ['0.78', '15.6', '1.0', '0.0'], ['0.65', '13.0', '0.0', '1.0'], ['0.93', '18.6', '0.0', '1.0'], ['0.82', '16.4', '1.0', '0.0']], dtype='

Week_2_MLP_Data_preprocessing - Colaboratory.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue