Lecture 9 - Scikit-Learn PDF

DS 5110 – Lecture 9 Scikit-Learn Roi Yehoshua Agenda  Introduction to machine learning  The machine learning workflow  Classification vs. regression  Bias-variance tradeoff  Scikit-learn  Data preprocessing  Model evaluation  Pipelines  Hyperparameter tuning  Feature engineering 2 Roi Yehoshua, 2024 What is Machine Learning (ML)?  Subfield of AI that deals with automatic learning from data 3 Roi Yehoshua, 2024 What is Machine Learning (ML)?  Solving problems with ML vs. the traditional programming tools: Traditional approach The ML approach 4 Roi Yehoshua, 2024 Types of Machine Learning  Supervised learning – learn from labeled data  Goal is to learn a function that maps an input to an output  Unsupervised learning – learn from unlabeled data  Goal is to extract useful patterns from the data  Reinforcement learning – learn to take the best actions in an environment in order to maximize a cumulative reward 5 Roi Yehoshua, 2024 Types of Machine Learning 6 Roi Yehoshua, 2024 Supervised Learning  Given: a training set of n labeled examples D = {(x1, y1), (x2, y2), …, (xn, yn)}  Each xi is a d-dimensional vector of feature values, xi = (xi1, xi2, …, xid)T  yi is the target or label we are trying to predict  Goal: learn a function that maps x to y  The trained model is evaluated on an independent test set Input Output Feature vector Model Label (x) (y) Incoming email Spam Spam Filter Ham 7 Roi Yehoshua, 2024 Features Matrix  The dataset is typically stored in a single matrix called features / design matrix  Each row of the matrix is one data point (one feature vector)  Each column represents the values of a given feature across all the data points Feature 1 Feature d Training example i 8 Roi Yehoshua, 2024 Supervised Learning Workflow Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 9 Roi Yehoshua, 2024 Classification vs. Regression  When the label y is a discrete variable, the learning problem is called classification  y is the class that the sample belongs to  When the label y is continuous, the learning problem is called regression Classification Regression 10 Roi Yehoshua, 2024 Example 1: Iris Flower Classification  From the statistician Douglas Fisher (1936)  Data set: 150 Iris flowers, 50 from each of the species: Setosa, Virginica, Versicolour  Four attributes  Sepal width and length (in centimeters)  Petal width and length (in centimeters) 11 Roi Yehoshua, 2024 Example 2: Handwritten Digit Recognition  MNIST data set  Input: images of 28  28 pixels  Pixel values range from 0 to 255 (grey level)  Output: a digit 0-9  Data set contains 60,000 training examples and 10,000 test examples  Setup:  Represent each input image as a vector x  R784  Learn a classifier h(x) such that h: x → {0,1,2,3,4,5,6,7,8,9}  One of the first commercial and widely used ML systems (for zip codes and checks) 12 Roi Yehoshua, 2024 Example 3: Credit Approval  Input: applicant information  Output:  approve credit? → classification  credit line (dollar amount) → regression 13 Roi Yehoshua, 2024 Learning Algorithms  There are many supervised learning algorithms  From simple decision trees to deep neural networks  No free lunch (NFL) theorem:  No single learning algorithm performs best across all possible problems or datasets  Occam’s Razor:  Given multiple explanations (models) for a phenomenon, prefer the simplest one 14 Roi Yehoshua, 2024 Bias-Variance Tradeoff  The generalization error of a model can be decomposed into three parts  Bias measures the error of the model on the training data  Variance captures the sensitivity of the model to fluctuations in the training data  Irreducible error represents the noise inherent in the data 15 Roi Yehoshua, 2024 Overfitting and Underfitting  Underfitting: the model is too simple to capture the pattern in the data  High bias, low variance  Overfitting: the model learns the training data too well and fails to generalize  Low bias, high variance  It’s challenging to find the right balance between bias and variance 16 Roi Yehoshua, 2024 Scikit-Learn  scikit-learn is a free machine learning library for Python  Provides efficient versions of a large number of common ML algorithms, including various classification, regression and clustering algorithms  Characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation  Designed to interoperate with Python scientific libraries NumPy and SciPy  Largely written in Python, with some core algorithms written in Cython to achieve fast performance  http://scikit-learn.org/ 17 Roi Yehoshua, 2024 Scikit-Learn Estimator API  An estimator is any object that learns from data  All estimators have a fit() method that learns the model’s parameters from the data  Two main types of estimators:  Predictors can also make predictions on new data  Also have predict() and score() methods  Transformers make changes to the data (e.g., normalization)  Also have a transform() method 18 Roi Yehoshua, 2023 Iris Classification Example  First step is to load the dataset  Typically the dataset is loaded from external file (e.g., csv) via Pandas  Scikit-Learn also provides some standard datasets in sklearn.datasets 19 Roi Yehoshua, 2024 Loading the Dataset  Let’s load the Iris dataset from Scikit-Learn: 20 Roi Yehoshua, 2024 Data Exploration  Understand the data types of the features  e.g., numerical, categorical  Understand statistical properties of the features  Examine correlations among the features  Examine correlations between the features and the target  Identify potential issues, such as missing values, outliers, imbalanced classes 21 Roi Yehoshua, 2024 Data Exploration  The info() method lets you inspect the data types and check for missing values 22 Roi Yehoshua, 2024 Data Exploration  The describe() method provides summary statistics for each feature 23 Roi Yehoshua, 2024 Data Exploration  SeaBorn’s pairplot() visualizes the feature distributions and their relationships 24 Roi Yehoshua, 2024 Data Exploration  The corr() method computes correlation coefficients between the features: 25 Roi Yehoshua, 2024 Data Exploration  Check class imbalance: 26 Roi Yehoshua, 2024 Train-Test Split  The test set is used to evaluate the model’s performance on unseen data  It should remain untouched until the final model evaluation  You can use Scikit-Learn’s train_test_split() for splitting the data to train/test:  The default split is 75% train, 25% test  You can change this using the train_size or test_size parameters  Set stratify=y to maintain the same class proportions in the test set 27 Roi Yehoshua, 2024 Data Preprocessing  Data cleaning  Handling missing data  Encoding categorical data  Discretization  Scaling / normalization  Outlier detection  Dimensionality reduction  sklearn.preprocessing includes many methods for data preprocessing 28 Roi Yehoshua, 2024 Scikit-Learn Transformers  Transformers are objects used to preprocess and transform the data  Transformers provide the following methods:  fit() - learns model parameters from the data set  e.g., mean and std for feature scaling  transform() - applies the transformation to the data  fit_transform() - calls both fit() and transform() 29 Roi Yehoshua, 2023 Example: Feature Scaling  Standardization transforms each feature to have zero mean and unit variance 30 Roi Yehoshua, 2024 Training the Model  Create an instance from the estimator you want to use  e.g., LogisticRegression, RandomForestClassifier  Set its hyperparameters in the constructor  Hyperparameters are configuration settings set before training a model  In contrast to parameters that are learned from the data during training  Call its fit() method on the training set 31 Roi Yehoshua, 2024 Training the Model  You can find a list of all the model’s hyperparameters on its Scikit-Learn page  e.g., https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html 32 Roi Yehoshua, 2024 Model Inspection  After training, you can inspect the model’s learned parameters  Each estimator provide attributes that end in underscore to access these parameters  For example, logistic regression learns a set of weights (coefficients) for each feature: 33 Roi Yehoshua, 2024 Model Evaluation  We can first evaluate the model’s performance on the training set  The score() method of the estimator returns the default scoring metric  In classification it is the accuracy (proportion of correctly classified examples)  In regression, it is R2 score (discussed later) 34 Roi Yehoshua, 2024 Model Evaluation  To evaluate the performance on the test set we first need to preprocess it  Using the same transformers that were applied to the training set  Calling their transform() method instead of fit_transform() 35 Roi Yehoshua, 2024 Confusion Matrix  A useful tool to inspect the classification errors of the model 36 Roi Yehoshua, 2024 Using the Model for Predictions  We can use the estimator’s predict() method to make predictions on new samples 37 Roi Yehoshua, 2024 Model Selection  It is important to evaluate multiple models on your dataset  Each model has unique strengths and weaknesses  It is hard to know in advance which model will perform the best  Typically, you start from a baseline model that is fast to train  e.g., logistic regression, decision tree  Then you can experiment with more complex models to see if you get better results 38 Roi Yehoshua, 2024 Model Selection  Example for comparing different classifiers on Iris: 39 Roi Yehoshua, 2024 Model Selection  Training all the estimators and save their results in a dictionary: 40 Roi Yehoshua, 2024 Model Selection  Display the results in a DataFrame: 41 Roi Yehoshua, 2024 Pipelines  Pipelines allow you to chain multiple transformers with a final estimator  The pipeline automates the appropriate method calls during training/test  Allow seamless integration with other Scikit-Learn tools for model evaluation 42 Roi Yehoshua, 2024 Pipelines  Example for defining and fitting a pipeline: 43 Roi Yehoshua, 2024 Pipelines  To access the estimators inside a pipeline use its named_steps attribute:  To modify the hyperparameters of an estimator use the set_params() method:  The syntax for writing the parameters is __ 44 Roi Yehoshua, 2024 Model Selection and Tuning  Goal: choose the model that performs best on unseen data (low generalization error)  Model selection takes into account both:  Model type (e.g., logistic regression, SVM, KNN, etc.)  Different model configurations (hyperparameters)  Model tuning is an iterative process of tweaking the model hyperparameters and data preprocessing steps to extract the best performance out of any given model  Using the test set for model selection or tuning can cause overfitting to the test data  Leading to overly optimistic estimates of the model performance on unseen data 45 Roi Yehoshua, 2024 Validation Set  Split the dataset into three disjoint subsets:  Use the validation set to evaluate and compare different model configurations  Reserve the test set for a final evaluation of the model with the best validation score  Problem: we lose a decent amount of the training data for validation 46 Roi Yehoshua, 2024 k-Fold Cross Validation  Randomly split the training set into k equal-sized partitions S1, …, Sk  For i = 1, …, k:  Train the model on all the training data except for Si  Test the model on Si  The resulting k errors are averaged to obtain an estimate of the generalization error 47 Roi Yehoshua, 2024 Cross Validation in Scikit-Learn  The method cross_val_score() can be used to perform cross-validation  The cv argument specifies the splitting strategy (defaults to 5 folds) 48 Roi Yehoshua, 2024 Hyperparameter Tuning  Search for the optimal combination of hyperparameter values  Main methods  Manual tuning: Choose hyperparameters based on intuition, experience, or educated guess  Grid search: Perform an exhaustive search over all possible hyperparameter combinations  Random search: Randomly sample combinations of hyperparameter values  Bayesian optimization: Use probabilistic models to guide the search for better hyperparameters by balancing exploration and exploitation 49 Roi Yehoshua, 2024 Grid Search In Scikit-Learn  Can be performed using the GridSearchCV class  Requires defining a grid of parameters  Uses cross-validation to evaluate each combination of parameters  For example, we can use grid search to find optimal parameters of the decision tree: 50 Roi Yehoshua, 2024 Grid Search In Scikit-Learn  You can also inspect the cross-validation scores for each combination of parameters: 51 Roi Yehoshua, 2024 Feature Engineering  In the real world, data rarely comes in a tidy, [samples, features] format  The generation of feature vectors from the raw data is called feature engineering  This process typically involves the following steps:  Data preprocessing: data preparation, cleaning, and transformation  Feature selection: selecting the most useful features to train on  Feature extraction: combining existing features to produce a more useful one 52 Roi Yehoshua, 2024 Missing Data  One of the most common data quality issues in real-world datasets  Most ML algorithms cannot work with missing features  Common solutions:  Remove samples or features with missing values  Imputation: Replace missing values with appropriate fill values such as the mean/median  Use model-based techniques to predict the missing value based on other features 53 Roi Yehoshua, 2024 Imputation of Missing Data  The class SimpleImputer is an imputation transformer for completing missing values  The strategy argument defines the imputation strategy  Can be ‘mean’, ‘median’, ‘most_frequent’, or ‘constant’ 54 Roi Yehoshua, 2024 Handling Categorical Data  A categorical feature represents discrete values that belong to a finite set of categories or classes  There are two types of categorical variables  Ordinal features have a natural ordering defined on their values  Examples include income level, education level, clothing size, etc.  Nominal features have no concept of ordering  Examples include movie genres, weather names, country names, etc.  For most ML models categorical features must be first transformed into numbers 55 Roi Yehoshua, 2024 Transforming Ordinal Features  OrdinalEncoder transforms categorical features into integers (ordinal codes)  The encoding results in a single column of integers [0 to #categories - 1] per feature 56 Roi Yehoshua, 2024 One-Hot Encoding  Convert a categorical feature with m labels into a binary vector of size m where only one of its elements has 1 and all the other elements have 0 57 Roi Yehoshua, 2024 One-Hot Encoding  The OneHotEncoder class converts categorical values into one-hot vectors  By default it returns a sparse matrix in which only nonzero values are stored  Specify sparse=False to get a dense array 58 Roi Yehoshua, 2024 Discretization  Converting a continuous feature into a discrete one by dividing its range into bins  Two basic approaches:  Equal-width – all the bins have the same width  Equal frequency (equal depth) – all the bins have the same number of objects 59 Roi Yehoshua, 2024 Discretization  KBinsDiscretizer can be used to discretize continuous data into intervals  n_bins defines the number of bins to produce (default is 5)  strategy defines the strategy used to define the widths of the bins  Choose ‘uniform’ for equal-width binning and ‘quantile’ for equal-frequency binning 60 Roi Yehoshua, 2024 Feature Scaling  Many ML algorithms struggle when features have different scales  Models may become biased toward features having higher magnitude values  Feature scaling adjust the range of features to ensure they have comparable scales  Main methods for feature scaling:  Min-max scaling (or normalization)  Rescales features to a specified range, typically [0, 1]  Standard scaling (or standardization)  Transforms features to have a mean of 0 and a standard deviation of 1. 61 Roi Yehoshua, 2024 Feature Scaling in Scikit-Learn  Scikit-Learn provides the classes MinMaxScaler and StandardScaler for scaling  Example for using the MinMaxScaler: 62 Roi Yehoshua, 2024 Feature Selection  Select a subset of the features by removing irrelevant or redundant features  Irrelevant features provide no useful information for the task  e.g., the student’s ID is irrelevant for predicting the student’s GPA  Redundant features duplicate existing information  e.g., product purchase price and sales tax amount  Main methods for feature selection  Filter methods: Use statistical techniques to rank features by relevance  Embedded methods: Perform feature selection as part of the model training process  Wrapper methods: Evaluate feature subsets by training and testing models iteratively  e.g., recursive feature elimination (RFE) 63 Roi Yehoshua, 2024 Feature Selection in Scikit-Learn  The classes in sklearn.feature_selection can be used for feature selection  e.g., VarianceThreshold can be used to remove all features with small variance  To demonstrate, we’ll use the Diagnostic Breast Cancer dataset 64 Roi Yehoshua, 2024 Feature Selection in Scikit-Learn  We can remove features with less than 0.05 variance:  To view the variances as well as which features were selected by this algorithm, we can use the variances_ property and the get_support(...) method, respectively: 65 Roi Yehoshua, 2024 Feature Extraction  Create new features from the existing ones to better capture the patterns in the data  Often requires innovation and domain knowledge from the data analyst  For example, assume we have the following data of air pollution by country: 66 Roi Yehoshua, 2024 Feature Extraction  Let’s add a population density feature:  And now examine the correlation coefficients again:  There is a relatively high correlation between density and air_pollution 67 Roi Yehoshua, 2024

Lecture 9 - Scikit-Learn PDF

Document Details

Tags

Related

Summary

Full Transcript