Lecture 9 - Scikit-Learn PDF
Document Details
Uploaded by Deleted User
Northeastern University
2024
Roi Yehoshua
Tags
Related
- Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-Concepts-Tools-and-Techniques-to-Build-Intelligent-Systems-OReilly-Media-2019.pdf
- Data Preprocessing in Machine Learning PDF
- EE7209 Machine Learning Lecture 06 PDF
- Advanced Machine Learning PDF
- Internship Report on Python For Data Science/ Iris Dataset PDF 2024-2025
- Python for Rapid Engineering Solutions PDF
Summary
This is a lecture on Scikit-Learn, a Python machine learning library. The lecture covers topics such as an introduction to machine learning, the machine learning workflow, scikit-learn, data preprocessing, and model evaluation.
Full Transcript
DS 5110 – Lecture 9 Scikit-Learn Roi Yehoshua Agenda Introduction to machine learning The machine learning workflow Classification vs. regression Bias-variance tradeoff Scikit-learn Data preprocessing Model evaluation Pipelines Hyperparameter tuning ...
DS 5110 – Lecture 9 Scikit-Learn Roi Yehoshua Agenda Introduction to machine learning The machine learning workflow Classification vs. regression Bias-variance tradeoff Scikit-learn Data preprocessing Model evaluation Pipelines Hyperparameter tuning Feature engineering 2 Roi Yehoshua, 2024 What is Machine Learning (ML)? Subfield of AI that deals with automatic learning from data 3 Roi Yehoshua, 2024 What is Machine Learning (ML)? Solving problems with ML vs. the traditional programming tools: Traditional approach The ML approach 4 Roi Yehoshua, 2024 Types of Machine Learning Supervised learning – learn from labeled data Goal is to learn a function that maps an input to an output Unsupervised learning – learn from unlabeled data Goal is to extract useful patterns from the data Reinforcement learning – learn to take the best actions in an environment in order to maximize a cumulative reward 5 Roi Yehoshua, 2024 Types of Machine Learning 6 Roi Yehoshua, 2024 Supervised Learning Given: a training set of n labeled examples D = {(x1, y1), (x2, y2), …, (xn, yn)} Each xi is a d-dimensional vector of feature values, xi = (xi1, xi2, …, xid)T yi is the target or label we are trying to predict Goal: learn a function that maps x to y The trained model is evaluated on an independent test set Input Output Feature vector Model Label (x) (y) Incoming email Spam Spam Filter Ham 7 Roi Yehoshua, 2024 Features Matrix The dataset is typically stored in a single matrix called features / design matrix Each row of the matrix is one data point (one feature vector) Each column represents the values of a given feature across all the data points Feature 1 Feature d Training example i 8 Roi Yehoshua, 2024 Supervised Learning Workflow Tid Attrib1 Attrib2 Attrib3 Class Learning 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No Learn 8 No Small 85K Yes Model 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Tid Attrib1 Attrib2 Attrib3 Class Model 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? Deduction 14 No Small 95K ? 15 No Large 67K ? 10 Test Set 9 Roi Yehoshua, 2024 Classification vs. Regression When the label y is a discrete variable, the learning problem is called classification y is the class that the sample belongs to When the label y is continuous, the learning problem is called regression Classification Regression 10 Roi Yehoshua, 2024 Example 1: Iris Flower Classification From the statistician Douglas Fisher (1936) Data set: 150 Iris flowers, 50 from each of the species: Setosa, Virginica, Versicolour Four attributes Sepal width and length (in centimeters) Petal width and length (in centimeters) 11 Roi Yehoshua, 2024 Example 2: Handwritten Digit Recognition MNIST data set Input: images of 28 28 pixels Pixel values range from 0 to 255 (grey level) Output: a digit 0-9 Data set contains 60,000 training examples and 10,000 test examples Setup: Represent each input image as a vector x R784 Learn a classifier h(x) such that h: x → {0,1,2,3,4,5,6,7,8,9} One of the first commercial and widely used ML systems (for zip codes and checks) 12 Roi Yehoshua, 2024 Example 3: Credit Approval Input: applicant information Output: approve credit? → classification credit line (dollar amount) → regression 13 Roi Yehoshua, 2024 Learning Algorithms There are many supervised learning algorithms From simple decision trees to deep neural networks No free lunch (NFL) theorem: No single learning algorithm performs best across all possible problems or datasets Occam’s Razor: Given multiple explanations (models) for a phenomenon, prefer the simplest one 14 Roi Yehoshua, 2024 Bias-Variance Tradeoff The generalization error of a model can be decomposed into three parts Bias measures the error of the model on the training data Variance captures the sensitivity of the model to fluctuations in the training data Irreducible error represents the noise inherent in the data 15 Roi Yehoshua, 2024 Overfitting and Underfitting Underfitting: the model is too simple to capture the pattern in the data High bias, low variance Overfitting: the model learns the training data too well and fails to generalize Low bias, high variance It’s challenging to find the right balance between bias and variance 16 Roi Yehoshua, 2024 Scikit-Learn scikit-learn is a free machine learning library for Python Provides efficient versions of a large number of common ML algorithms, including various classification, regression and clustering algorithms Characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation Designed to interoperate with Python scientific libraries NumPy and SciPy Largely written in Python, with some core algorithms written in Cython to achieve fast performance http://scikit-learn.org/ 17 Roi Yehoshua, 2024 Scikit-Learn Estimator API An estimator is any object that learns from data All estimators have a fit() method that learns the model’s parameters from the data Two main types of estimators: Predictors can also make predictions on new data Also have predict() and score() methods Transformers make changes to the data (e.g., normalization) Also have a transform() method 18 Roi Yehoshua, 2023 Iris Classification Example First step is to load the dataset Typically the dataset is loaded from external file (e.g., csv) via Pandas Scikit-Learn also provides some standard datasets in sklearn.datasets 19 Roi Yehoshua, 2024 Loading the Dataset Let’s load the Iris dataset from Scikit-Learn: 20 Roi Yehoshua, 2024 Data Exploration Understand the data types of the features e.g., numerical, categorical Understand statistical properties of the features Examine correlations among the features Examine correlations between the features and the target Identify potential issues, such as missing values, outliers, imbalanced classes 21 Roi Yehoshua, 2024 Data Exploration The info() method lets you inspect the data types and check for missing values 22 Roi Yehoshua, 2024 Data Exploration The describe() method provides summary statistics for each feature 23 Roi Yehoshua, 2024 Data Exploration SeaBorn’s pairplot() visualizes the feature distributions and their relationships 24 Roi Yehoshua, 2024 Data Exploration The corr() method computes correlation coefficients between the features: 25 Roi Yehoshua, 2024 Data Exploration Check class imbalance: 26 Roi Yehoshua, 2024 Train-Test Split The test set is used to evaluate the model’s performance on unseen data It should remain untouched until the final model evaluation You can use Scikit-Learn’s train_test_split() for splitting the data to train/test: The default split is 75% train, 25% test You can change this using the train_size or test_size parameters Set stratify=y to maintain the same class proportions in the test set 27 Roi Yehoshua, 2024 Data Preprocessing Data cleaning Handling missing data Encoding categorical data Discretization Scaling / normalization Outlier detection Dimensionality reduction sklearn.preprocessing includes many methods for data preprocessing 28 Roi Yehoshua, 2024 Scikit-Learn Transformers Transformers are objects used to preprocess and transform the data Transformers provide the following methods: fit() - learns model parameters from the data set e.g., mean and std for feature scaling transform() - applies the transformation to the data fit_transform() - calls both fit() and transform() 29 Roi Yehoshua, 2023 Example: Feature Scaling Standardization transforms each feature to have zero mean and unit variance 30 Roi Yehoshua, 2024 Training the Model Create an instance from the estimator you want to use e.g., LogisticRegression, RandomForestClassifier Set its hyperparameters in the constructor Hyperparameters are configuration settings set before training a model In contrast to parameters that are learned from the data during training Call its fit() method on the training set 31 Roi Yehoshua, 2024 Training the Model You can find a list of all the model’s hyperparameters on its Scikit-Learn page e.g., https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html 32 Roi Yehoshua, 2024 Model Inspection After training, you can inspect the model’s learned parameters Each estimator provide attributes that end in underscore to access these parameters For example, logistic regression learns a set of weights (coefficients) for each feature: 33 Roi Yehoshua, 2024 Model Evaluation We can first evaluate the model’s performance on the training set The score() method of the estimator returns the default scoring metric In classification it is the accuracy (proportion of correctly classified examples) In regression, it is R2 score (discussed later) 34 Roi Yehoshua, 2024 Model Evaluation To evaluate the performance on the test set we first need to preprocess it Using the same transformers that were applied to the training set Calling their transform() method instead of fit_transform() 35 Roi Yehoshua, 2024 Confusion Matrix A useful tool to inspect the classification errors of the model 36 Roi Yehoshua, 2024 Using the Model for Predictions We can use the estimator’s predict() method to make predictions on new samples 37 Roi Yehoshua, 2024 Model Selection It is important to evaluate multiple models on your dataset Each model has unique strengths and weaknesses It is hard to know in advance which model will perform the best Typically, you start from a baseline model that is fast to train e.g., logistic regression, decision tree Then you can experiment with more complex models to see if you get better results 38 Roi Yehoshua, 2024 Model Selection Example for comparing different classifiers on Iris: 39 Roi Yehoshua, 2024 Model Selection Training all the estimators and save their results in a dictionary: 40 Roi Yehoshua, 2024 Model Selection Display the results in a DataFrame: 41 Roi Yehoshua, 2024 Pipelines Pipelines allow you to chain multiple transformers with a final estimator The pipeline automates the appropriate method calls during training/test Allow seamless integration with other Scikit-Learn tools for model evaluation 42 Roi Yehoshua, 2024 Pipelines Example for defining and fitting a pipeline: 43 Roi Yehoshua, 2024 Pipelines To access the estimators inside a pipeline use its named_steps attribute: To modify the hyperparameters of an estimator use the set_params() method: The syntax for writing the parameters is __ 44 Roi Yehoshua, 2024 Model Selection and Tuning Goal: choose the model that performs best on unseen data (low generalization error) Model selection takes into account both: Model type (e.g., logistic regression, SVM, KNN, etc.) Different model configurations (hyperparameters) Model tuning is an iterative process of tweaking the model hyperparameters and data preprocessing steps to extract the best performance out of any given model Using the test set for model selection or tuning can cause overfitting to the test data Leading to overly optimistic estimates of the model performance on unseen data 45 Roi Yehoshua, 2024 Validation Set Split the dataset into three disjoint subsets: Use the validation set to evaluate and compare different model configurations Reserve the test set for a final evaluation of the model with the best validation score Problem: we lose a decent amount of the training data for validation 46 Roi Yehoshua, 2024 k-Fold Cross Validation Randomly split the training set into k equal-sized partitions S1, …, Sk For i = 1, …, k: Train the model on all the training data except for Si Test the model on Si The resulting k errors are averaged to obtain an estimate of the generalization error 47 Roi Yehoshua, 2024 Cross Validation in Scikit-Learn The method cross_val_score() can be used to perform cross-validation The cv argument specifies the splitting strategy (defaults to 5 folds) 48 Roi Yehoshua, 2024 Hyperparameter Tuning Search for the optimal combination of hyperparameter values Main methods Manual tuning: Choose hyperparameters based on intuition, experience, or educated guess Grid search: Perform an exhaustive search over all possible hyperparameter combinations Random search: Randomly sample combinations of hyperparameter values Bayesian optimization: Use probabilistic models to guide the search for better hyperparameters by balancing exploration and exploitation 49 Roi Yehoshua, 2024 Grid Search In Scikit-Learn Can be performed using the GridSearchCV class Requires defining a grid of parameters Uses cross-validation to evaluate each combination of parameters For example, we can use grid search to find optimal parameters of the decision tree: 50 Roi Yehoshua, 2024 Grid Search In Scikit-Learn You can also inspect the cross-validation scores for each combination of parameters: 51 Roi Yehoshua, 2024 Feature Engineering In the real world, data rarely comes in a tidy, [samples, features] format The generation of feature vectors from the raw data is called feature engineering This process typically involves the following steps: Data preprocessing: data preparation, cleaning, and transformation Feature selection: selecting the most useful features to train on Feature extraction: combining existing features to produce a more useful one 52 Roi Yehoshua, 2024 Missing Data One of the most common data quality issues in real-world datasets Most ML algorithms cannot work with missing features Common solutions: Remove samples or features with missing values Imputation: Replace missing values with appropriate fill values such as the mean/median Use model-based techniques to predict the missing value based on other features 53 Roi Yehoshua, 2024 Imputation of Missing Data The class SimpleImputer is an imputation transformer for completing missing values The strategy argument defines the imputation strategy Can be ‘mean’, ‘median’, ‘most_frequent’, or ‘constant’ 54 Roi Yehoshua, 2024 Handling Categorical Data A categorical feature represents discrete values that belong to a finite set of categories or classes There are two types of categorical variables Ordinal features have a natural ordering defined on their values Examples include income level, education level, clothing size, etc. Nominal features have no concept of ordering Examples include movie genres, weather names, country names, etc. For most ML models categorical features must be first transformed into numbers 55 Roi Yehoshua, 2024 Transforming Ordinal Features OrdinalEncoder transforms categorical features into integers (ordinal codes) The encoding results in a single column of integers [0 to #categories - 1] per feature 56 Roi Yehoshua, 2024 One-Hot Encoding Convert a categorical feature with m labels into a binary vector of size m where only one of its elements has 1 and all the other elements have 0 57 Roi Yehoshua, 2024 One-Hot Encoding The OneHotEncoder class converts categorical values into one-hot vectors By default it returns a sparse matrix in which only nonzero values are stored Specify sparse=False to get a dense array 58 Roi Yehoshua, 2024 Discretization Converting a continuous feature into a discrete one by dividing its range into bins Two basic approaches: Equal-width – all the bins have the same width Equal frequency (equal depth) – all the bins have the same number of objects 59 Roi Yehoshua, 2024 Discretization KBinsDiscretizer can be used to discretize continuous data into intervals n_bins defines the number of bins to produce (default is 5) strategy defines the strategy used to define the widths of the bins Choose ‘uniform’ for equal-width binning and ‘quantile’ for equal-frequency binning 60 Roi Yehoshua, 2024 Feature Scaling Many ML algorithms struggle when features have different scales Models may become biased toward features having higher magnitude values Feature scaling adjust the range of features to ensure they have comparable scales Main methods for feature scaling: Min-max scaling (or normalization) Rescales features to a specified range, typically [0, 1] Standard scaling (or standardization) Transforms features to have a mean of 0 and a standard deviation of 1. 61 Roi Yehoshua, 2024 Feature Scaling in Scikit-Learn Scikit-Learn provides the classes MinMaxScaler and StandardScaler for scaling Example for using the MinMaxScaler: 62 Roi Yehoshua, 2024 Feature Selection Select a subset of the features by removing irrelevant or redundant features Irrelevant features provide no useful information for the task e.g., the student’s ID is irrelevant for predicting the student’s GPA Redundant features duplicate existing information e.g., product purchase price and sales tax amount Main methods for feature selection Filter methods: Use statistical techniques to rank features by relevance Embedded methods: Perform feature selection as part of the model training process Wrapper methods: Evaluate feature subsets by training and testing models iteratively e.g., recursive feature elimination (RFE) 63 Roi Yehoshua, 2024 Feature Selection in Scikit-Learn The classes in sklearn.feature_selection can be used for feature selection e.g., VarianceThreshold can be used to remove all features with small variance To demonstrate, we’ll use the Diagnostic Breast Cancer dataset 64 Roi Yehoshua, 2024 Feature Selection in Scikit-Learn We can remove features with less than 0.05 variance: To view the variances as well as which features were selected by this algorithm, we can use the variances_ property and the get_support(...) method, respectively: 65 Roi Yehoshua, 2024 Feature Extraction Create new features from the existing ones to better capture the patterns in the data Often requires innovation and domain knowledge from the data analyst For example, assume we have the following data of air pollution by country: 66 Roi Yehoshua, 2024 Feature Extraction Let’s add a population density feature: And now examine the correlation coefficients again: There is a relatively high correlation between density and air_pollution 67 Roi Yehoshua, 2024