AWS Data Science Course Part 2 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a tutorial on data preprocessing techniques, focusing on encoding categorical variables and handling missing values in data science and machine learning. Methods covered include one-hot encoding and imputation with mean.
Full Transcript
AWS Course - The Elements of Data Science - part 2 Ordinal values where order does matter use a MAP function where we apply a number that relates to the “size” To convert a categorical variable to a number: If there is no ordering, or related sizing, by converting categorial into a sequence of...
AWS Course - The Elements of Data Science - part 2 Ordinal values where order does matter use a MAP function where we apply a number that relates to the “size” To convert a categorical variable to a number: If there is no ordering, or related sizing, by converting categorial into a sequence of integers, this will give us to the wrong usage => use one hot encoding instead for NOMINAL values If using label encoder, we apply numerical encoding in below example: 2 > 1 > 0 and 2 = 2 x 1 On the other hand, we can use ONE HOT ENCODING: Easy to do one hot encoding with Pandas: ==> use get_dummies(df) For very large features, one hot encoder may increase the dataset drastically (too many features for # of observation EX: DATASET WITH 2 MISSING VALUES - WE WILL USE AN IMPUTATION WITH MEAN Standard scaler -> values are centered around 0 (mean of 0() and standard deviation of 1 To note this also allows outliers to also still show MinMax scaler -> values between 0 and 1 very robust for small standard deviations cases MaxAbs scaling: divide every element by max does not destroy sparsity - does not change the center Robust Scaling: based on 25th and 75th quantile => outliers will have minimal impact Normalizer – Scaler function are applied to a single column (MinMax Scaler, Robust Scaler, mean/variance…) – normaliser are applied to a single row – => widely used in text => very likely we overfit on training data with high degree of polynomials -> related to confidence interval random sampling for each country Both are 57%. But confidence interval wider with 100 people, and narrower fo 1000 people