Machine Learning Tools PG Lecture 03 PDF
Document Details
Uploaded by JubilantOrbit1110
The University of Adelaide
2024
Tags
Summary
This document is a lecture on machine learning, focusing on topics such as data workflows and preprocessing techniques like handling missing data and scaling. It covers a supervised learning workflow, outlining steps like data inspection, model selection, and training/testing. It also discusses different techniques for dealing with missing data, including removal and replacement strategies, as well as advanced methods and using sklearn. The lecture also notes the importance of consistent scaling and pipelines.
Full Transcript
Using Machine Learning Tools PG Week 3 – Machine Learning Workflow continued COMP SCI 7317 Trimester 2, 2024 From last week… A typical workflow for supervised learning 1. Data: Look at your data for common problems - Missing, invalid, outliers, noise, scale etc. (remember G...
Using Machine Learning Tools PG Week 3 – Machine Learning Workflow continued COMP SCI 7317 Trimester 2, 2024 From last week… A typical workflow for supervised learning 1. Data: Look at your data for common problems - Missing, invalid, outliers, noise, scale etc. (remember GIGO) 2. Model selection: Choose some candidate models based on data and task - Use your knowledge/experience & that of others (i.e. docs, papers, blogs, …) 3. Testing and Training: Split data into training and test sets - Need to do this carefully (more on this next week) 4. Validation: Split training data into (reduced) training and validation sets - Holdout or cross validation 5. Train candidate models on training sets 6. Select best model type/settings based on validation set errors 7. Retrain model on the full training set (merging train/validation) 8. Apply best model to test data - This gives your estimate of the generalisation error From last week… A typical workflow for supervised learning 1. Generalisation 2. Training, validation and testing 3. Cross-validation What we will cover this week Enhancing the machine learning pipeline through additional preprocessing steps Missing data Scaling Sklearn pipelines The internals of linear regression (a key base model) Tuning hyperparameters Enhancing the machine learning pipeline: additional data preprocessing Missing data X1 X2 X3 In a machine learning context, missing data refers to the absence of certain values in the dataset that is used. ### ### ### ### ### ### Can occur due to a number of reasons: Errors in data collection or entry ### ??? ### Certain values being irrelevant/inapplicable for specific ### ### ### observations ### ??? ### It is important to consider missing data as it can: ### ### ### Introduce bias if missingness is related to target/key variables Decrease statistical power of a model Add complexity to the data processing pipeline Checking for missing data, errors, and outliers Useful to look carefully at data by eye and visualise it to determine the presence of missing data (workshop 3). Visualisations can also tell us if there are any outliers, errors or other issues. Missing data X1 X2 X3 ### ### ### There are several ways to deal with missing data, such ### ### ### as missing features: ### ??? ### ### ### ### 1. Removing data ### ??? ### ### ### ### Logic: Removing rows or columns with missing values simplifies the dataset Pro: Simple to execute and avoids potential inaccuracies introduced by estimated/imputed X1 X2 X3 values ### ### ### Con: Can lead to loss of valuable information, ### ### ### especially if large portion of data missing ### ### ### ### ### ### X1 X2 X3 Missing data ### ### ### ### ### ### There are several ways to deal with missing data, such as ### ??? ### missing features: ### ### ### ### ??? ### 2. Replacing missing values ### ### ### Logic: Fill in missing values with a placeholder or an estimated value (e.g., mean, median, mode). Pro: Maintains dataset size and statistical power X1 X2 X3 Con: May introduce bias or reduce variance if the ### ### ### imputed values do not accurately represent the missing ### ### ### data ### ??? ### ### ### ### ### ??? ### ### ### ### Missing data There are several ways to deal with missing data, such as missing features: 2. Using Advanced Methods Logic: Utilise more sophisticated techniques, such as K-Nearest Neighbors (KNN) or Multivariate Imputation by Chained Equations (MICE), to predict missing values based on the relationships between features Pro: Can provide more accurate imputations by considering the relationships between features Con: Can be computationally intensive and may require more expertise to implement correctly Missing data: sklearn ‘sklearn’ supports various imputation strategies. For example, to replace missing features with median feature value: from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy="median") # Calculate the median for each feature imputer.fit(training_features) # Fill missing data (NaN) with median value filled_features = imputer.transform(training_features) Scaling data Consistency in Units: Features often have different units and ranges, which can lead to some features dominating others. Example: Model to predict house prices using features: Size (in square meters, e.g., 50 to 500) Number of Bedrooms (count, e.g., 1 to 5) Price (target variable in dollars, e.g., $100,000 to $1,000,000) Year Built (year, e.g., 1900 to 2020) What could happen if we use these features without scaling? Scaling data Consistency in Units: Features often have different units and ranges, which can lead to some features dominating others. Example: Model to predict house prices using features: Size (in square meters, e.g., 50 to 500) Number of Bedrooms (count, e.g., 1 to 5) Price (target variable in dollars, e.g., $100,000 to $1,000,000) Year Built (year, e.g., 1900 to 2020) ‘Year Built’ and ‘Price’ can overshadow the influence of ‘Size’ and ‘Number of Bedrooms’ if not scaled properly. Scaling features to a similar range ensures all features contribute equally to the model, leading to more balanced and accurate predictions. Scaling data Distance Metrics: Many ML methods (e.g., k-NN, SVM) rely on distance metrics, which are sensitive to the scale of the features. Without scaling, features with larger ranges can disproportionately influence the model. Scaling data Numerical Stability: Scaling improves numerical stability and the performance of gradient descent algorithms by ensuring that features are on a similar scale. Model Performance: Some ML algorithms (e.g., linear regression, neural networks) perform better and converge faster when features are scaled. But, safer to use it in general as it does no harm. Scaling data: Min-max scaling ‘sklearn’ supports scaling methods. Example: scale all features to a specified range, typically [0, 1]. from sklearn.preprocessing import MinMaxScaler # Default range is [0, 1] scaler = MinMaxScaler() # Find min and max value for each feature scaler.fit(training_features) # Apply scaling to each feature scaled_features = scaler.transform(training_features Scaling data: Standardised scaling Alternatively, scale each feature to have mean = 0, variance 1. This is can be done using sklearn’s StandardScaler: from sklearn.preprocessing import StandardScaler # Default is mean 0, variance 1 scaler = StandardScaler() # Find mean and variance for each feature scaler.fit(training_features) # Apply scaling to each feature scaled_features = scaler.transform(training_features) Scaling data: Other considerations Comparison of Scaling Methods: Standardized Scaling (StandardScaler): Pro: Centers data by subtracting the mean and scaling to unit variance. Con: Sensitive to outliers, which can distort the scaled values. Min-Max Scaling (MinMaxScaler): Pro: Scales data to a fixed range [0, 1], preserving relationships between features. Con: Can be affected by outliers, leading to a smaller range for most data points. Other options also exist: Using percentiles (RobustScaler): Pro: Uses percentiles, making it robust to outliers. Centers data by subtracting the median and scaling according to the interquartile range (IQR). Con: May not scale data as effectively if the data is not skewed, but generally a safer choice when outliers and skews are present. Tip: Consider whether data has outliers, skewed distributions, multi-modal distributions. Scaling data: Other considerations StandardScaler: Main body of the data is centred around 0, but Original data: plot shows the distribution of the outliers are still evident at the data before any scaling is applied. obvious extreme right. This can distort outliers present on the far right, which will affect scaling, as presence of outliers scaling methods differently affects the mean and standard deviation MinMaxScaler: Main body of data is scaled between 0 and 1, but the outliers are also scaled to the extreme ends of the range. This leads to a smaller range for most data points, which are compressed between 0 and 0.4 RobustScaler: Data is scaled with a more uniform distribution. The outliers do not have as significant an impact, as shown by the more balanced distribution around zero. Pipelines in Sklearn What is a pipeline in sklearn? A Pipeline is a tool in sklearn that allows you to chain together multiple data preprocessing steps and model training steps into a single, cohesive workflow. Why are pipelines useful? Consistency: Ensures that the same preprocessing steps (e.g., imputation, scaling) are applied to all datasets Code Organisation: Simplifies your code by combining multiple steps into a single object, making it more readable and maintainable Avoids Data Leakage: Best practice = chain steps together as a Pipeline and apply to multiple datasets Prevents information from validation/test set influencing the training process This is necessary to avoid bias and obtain a more accurate model evaluation Important considerations Unseen Data: Although preprocessing steps must be applied consistently, it’s important to treat validation/test data as unseen Estimation of parameters from the whole dataset causes information leakage Example: Using the median for imputation or min, max, mean, variance, percentiles for scaling. Safest to avoid preprocessing on training + validation/test data. But some instances are OK: Pre-processing based on single instances, such as for images. Important considerations Cross-Validation: During K-fold cross-validation, refit the pipeline for each fold to ensure proper scaling and imputation for the (varying) training and validation sets Preprocessing Steps in Pipeline: Include all necessary preprocessing steps (e.g., imputation, scaling) in the pipeline to ensure they are applied consistently Example pipeline for k-fold training Each time the preprocessing is likely to use different values (but same principle) Pipelines: an example from sklearn.pipeline import Pipeline # Replace missing features with median, and scale to std distribution std_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), ('std_scaler', StandardScaler()) ]) transformed_features = std_pipeline.fit_transform(training_features)... # Apply the same transformations to test data trans_test_features = std_pipeline.transform(test_features) Pipelines: an example from sklearn.pipeline import Pipeline # Do imputation, scaling and then feed into ML method std_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), ('std_scaler', StandardScaler()), (‘linreg’, LinearRegression()) ]) std_pipeline.fit(X_train,y_train) # Apply the same transformations (and method) to test data y_pred = std_pipeline.predict(X_test) Can build in ML method directly Linear Regression A closer look at linear regression We have used linear regression as a ‘black box’ A model provided by sklearn that can be fitted to our training features, then predict labels for features i.e. we have some code run data through it make predictions A closer look at linear regression Machine Learning & Simulation. (2021, April 21). Linear Regression in High Dimensions - Deriving the matrix-valued Least-Squares Loss [Video]. YouTube. https://www.youtube.com/watch?v=jXVvsa58aoQ Linear regression The goal of linear regression is to find the best-fitting line (or hyperplane) that predicts the dependent variable 𝑦𝑦 based on the independent variables (features) 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛. Compact form: Expanded form: 𝑦𝑦 = ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥 𝑦𝑦 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + ⋯ + 𝜃𝜃𝑛𝑛 𝑥𝑥𝑛𝑛 𝜽𝜽: Parameter vector 1 𝒙𝒙: Feature vector 𝑥𝑥1 : Predicted value 𝒚𝒚 𝑦𝑦 = 𝜃𝜃0 𝜃𝜃1 … 𝜃𝜃𝑛𝑛 𝑥𝑥2 𝒉𝒉𝜽𝜽 𝒙𝒙 : Hypothesis function, represents predicted value … given input 𝑥𝑥 and parameters 𝜃𝜃 𝑥𝑥𝑛𝑛 𝜽𝜽 𝒙𝒙: Dot product of parameter vector 𝜃𝜃 and feature vector 𝑥𝑥 Linear regression Compact form: Expanded form: 𝑦𝑦 = ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥 𝑦𝑦 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + ⋯ + 𝜃𝜃𝑛𝑛 𝑥𝑥𝑛𝑛 Bias term here does not relate to biased/unbiased estimations! Parameter vector, 𝜽𝜽 These are the values that the model learns during training. Contains all the parameters (also called feature weights) 𝜃𝜃1 , 𝜃𝜃2 , … , 𝜃𝜃𝑛𝑛 and bias term 𝜃𝜃0 (intercept). Bias term: Shifts regression line vertically. Feature weights: represent strength and direction of relationship between each independent variable (feature) and dependent variable. Linear regression Compact form: Expanded form: 𝑦𝑦 = ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥 𝑦𝑦 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + ⋯ + 𝜃𝜃𝑛𝑛 𝑥𝑥𝑛𝑛 Feature vector, 𝒙𝒙 Contains the features 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 𝑥𝑥0 is always 1 to account for the bias term, 𝜃𝜃0 Linear regression Example: Predict the price of a house based on its size, number of bedrooms, and age using linear regression. 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 (features) could be the size of the house (𝑥𝑥1 ), number of bedrooms (𝑥𝑥2 ),, and age of the house (𝑥𝑥3 ,). 𝜃𝜃1 , 𝜃𝜃2 , … , 𝜃𝜃𝑛𝑛 (parameters) are numbers that the model learns, showing how much each feature contributes to the house price. 𝜃𝜃0 (bias term) is a base price that adjusts prediction. The linear regression model combines these inputs with their respective weights to make a prediction: 𝑦𝑦 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ℎ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = 𝜃𝜃0 base price + 𝜃𝜃1 ∗ size + 𝜃𝜃2 ∗ no. bedrooms + 𝜃𝜃3 ∗ 𝑎𝑎𝑎𝑎𝑎𝑎 Linear regression: optimisation When we fit a linear regression model, we optimise the parameter vector (𝜃𝜃) to minimise an error/loss/cost function over data points Mean squared error (MSE), where 𝑦𝑦 = target values: If we have N features, 𝑥𝑥 and 𝜃𝜃 are (N+1) dimensional vectors Linear regression model has N+1 parameters to fit Linear regression: optimisation There are two methods to finding the optimal solution in linear regression, using an analytical solution or iterative methods. Analytical solution (closed-form or using normal equation) Direct calculation using a mathematical formula Efficient for smaller datasets Example: The closed form solution is given by 𝜃𝜃 = (𝑋𝑋 𝑇𝑇 𝑋𝑋)−1 𝑋𝑋 𝑇𝑇 𝑦𝑦 Linear regression: optimisation Iterative methods (i.e. gradient descent) Used when the analytic solution is impractical (e.g., very large datasets). Process: Start with an initial guess for 𝜃𝜃 and iteratively update it to minimise the cost function. Equation: 𝜃𝜃 = 𝜃𝜃 − 𝛼𝛼∇𝐽𝐽 𝜃𝜃 where 𝛼𝛼 = learning rate and ∇𝐽𝐽 𝜃𝜃 gradient of cost function, 𝐽𝐽 = 𝑀𝑀𝑀𝑀𝑀𝑀 Why use iterative methods instead? Scalability: Suitable for very large datasets where analytic solutions are computationally expensive. Flexibility: Can handle more complex models and different types of cost functions. Linear regression: optimisation Gradient descent Calculate gradient: Find gradient (slope) of the cost function at the current point, 𝑥𝑥 Update the parameter: Adjust in the direction that minimises cost function. In higher dimensions: Calculate the partial derivatives (slopes) with respect to each parameter. Adjust each parameter 𝑥𝑥𝑖𝑖 to reduce the error, moving 'downhill' along the gradient. Linear regression: optimisation Challenges with gradient descent: Learning rate The learning rate controls size of the steps taken towards the minimum of the cost function. It determines how quickly the algorithm converges to the optimal solution. Too Fast (High Learning Rate) - Overshoot: Steps taken towards the minimum too large. Causes algorithm to overshoot the minimum, missing it entirely. Algorithm may oscillate around the minimum or even diverge, failing to find the optimal solution. Linear regression: optimisation Challenges with gradient descent: Learning rate The learning rate controls size of the steps taken towards the minimum of the cost function. It determines how quickly the algorithm converges to the optimal solution. Too Slow (Low Learning Rate) - Slow Convergence: Steps taken towards minimum are very small. Causes algorithm to converge very slowly, taking a long time to reach the minimum. While more stable, it is inefficient and can significantly increase the computational time. Linear regression ≠ straight line Linear Regression can do more than fit straight lines Polynomial regression (1D shown) Example: 𝑦𝑦 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥22 Linear term for feature 𝑥𝑥1 Linear and quadratic term for feature 𝑥𝑥2 Linear for each 𝜃𝜃 parameter but not for data (𝑥𝑥) values Polynomial (linear) regression Still linear in the model parameters, so same maths Can include non-linear functions of data/features sklearn includes easy ways to create polynomial features How to decide what degree to use (quadratic, cubic, etc.)? from sklearn.preprocessing import PolynomialFeatures # To generate the quadratic features poly = PolynomialFeatures(degree=2) # Include extra quadratic features poly_training = poly.fit_transform(training_features) Hyperparameters and tuning Parameters and hyperparameters Parameters Hyperparameters Learnable: Values are learned by the User defined: Values are set before the model during training. learning process begins. Dynamic: As learning proceeds, the As learning proceeds, the values of values of these parameters change. these follow pre-determined rules for Focus on learning: We do not need to changing. specify these values initially as they are Need attention: We need to set these optimized by the learning algorithm. values correctly to guide the learning Example: Weights in linear regression. process effectively. Example: Learning rate. Parameters and hyperparameters: tuning How to determine appropriate values? Small number of possible options - Try them all (Grid Search): Exhaustively search through a manually specified subset of the hyperparameter space. Large number of possible options - Random Search: Randomly sample from the hyperparameter space and evaluate the performance. Typically covers fewer options than Grid Search but is more feasible when the hyperparameter space is large. Best hyperparameter selection - Validation Error: The "best" set of hyperparameters is usually determined by the lowest validation error, not the test error, as otherwise it leads to bias. Parameters and hyperparameters: tuning Example: Grid search over values of k and weights (evaluated by 5-fold cross validation) Note: In the KNeighborsRegressor example, ‘weights’ is a hyperparameter, not to be confused with ‘feature weights’ or parameters 𝜃𝜃1 , 𝜃𝜃2 , … , 𝜃𝜃𝑛𝑛 ) from 𝜃𝜃sklearn.model_selection import GridSearchCV from sklearn.neighbors import KNeighborsRegressor θ values). knn_reg = KNeighborsRegressor() param_grid = [ # try 6 (3×2) combinations of hyperparameters {'n_neighbors': [3, 5, 10], 'weights': ['uniform', 'distance']}, ] grid_search = GridSearchCV(knn_reg, param_grid, cv=5, scoring='neg_mean_squared_error’) # evaluate KNN model with 6 settings grid_search.fit(training_features, training_labels) # what were the best settings? grid_search.best_params_ Summary Enhancing machine learning pipeline using additional preprocessing steps Removing or replacing (imputation) missing data Scaling data (not always required but better to be safe) Sklearn pipelines for workflows (best practice) A closer look at the linear regression model How to use it for polynomial fitting Tuning hyperparameters GridSearchCV Next week: Classification models Questions? [email protected]