Supervised Learning PDF
Document Details
Uploaded by DexterousFern6890
NUS
Dr Yeo Wee Kiang
Tags
Summary
These lecture notes cover supervised learning, focusing on regression and classification techniques. The document details different algorithms in supervised learning, including linear regression and random forests, as well as key concepts and evaluation metrics. The notes are accompanied with visualizations and examples.
Full Transcript
Supervised Learning Dr Yeo Wee Kiang Unauthorized distribution or sharing of these materials is strictly prohibited. 1 IS5126 Hands-on with Applied...
Supervised Learning Dr Yeo Wee Kiang Unauthorized distribution or sharing of these materials is strictly prohibited. 1 IS5126 Hands-on with Applied Analytics Overview Regression Residuals The various regression algorithms Gradient Descent in Linear Regression Commonly used metrics in regression Linear Regression analysis Key Concepts in Linear Regression R-squared, Assumptions of linear regression Adjusted R-squared RMSE Linear Regression Process Classification The different stages Logistic Regression Homoscedasticity and Decision Trees, Random Forests heteroscedasticity Receiver Operating Characteristic curve ROC AUC How to plot the ROC curve Unauthorized distribution or sharing of these materials is strictly prohibited. 2 Regression Algorithm Scikit Learn documentation Ordinary least squares Linear Regression. linear_model.LinearRegression(*[,...]) Linear Model trained with L1 prior as regularizer (aka the linear_model.Lasso([alpha, fit_intercept,...]) Lasso). Linear least squares with l2 regularization. linear_model.Ridge([alpha, fit_intercept,...]) Linear regression with combined L1 and L2 priors as linear_model.ElasticNet([alpha, l1_ratio,...]) regularizer. A decision tree regressor. tree.DecisionTreeRegressor(*[, criterion,...]) A random forest regressor. ensemble.RandomForestRegressor([...]) Linear Support Vector Regression. svm.LinearSVR(*[, epsilon, tol, C, loss,...]) Unauthorized distribution or sharing of these materials is strictly prohibited. 3 Linear Regression Linear regression is a fundamental statistical and machine learning technique used for modeling the relationship between a dependent variable (target) and one or more independent variables (predictors or features). It assumes that this relationship is linear, which means it can be represented as a straight line. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html Unauthorized distribution or sharing of these materials is strictly prohibited. 4 Linear Regression Key Concepts in Linear Regression Dependent Variable (y): The dependent variable, often denoted as "y", is the outcome or response variable we want to predict or explain. It is often known as the "target". Independent Variables (X): Independent variables, denoted as "X", are the features or predictors used to explain variations in the dependent variable. Linear Relationship: Linear regression assumes that the relationship between the dependent variable and independent variables can be expressed as a linear combination, where the impact of each independent variable is proportional to its weight or coefficient. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html Unauthorized distribution or sharing of these materials is strictly prohibited. 5 Linear Regression Key Concepts in Linear Regression Simple Linear Regression: In simple linear regression, there is only one independent variable that is used to predict the dependent variable. The relationship is represented by a straight-line equation: y = a + bx Multiple Linear Regression: Multiple linear regression extends the concept to multiple independent variables: y = a + b₁X₁ + b₂X₂ +... + bₙXₙ It models more complex relationships by considering multiple factors simultaneously. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html Unauthorized distribution or sharing of these materials is strictly prohibited. 6 Linear Regression If any of these assumptions are not met, the results of the linear regression model may be unreliable. Assumptions of linear regression There are several assumptions that must be met for linear regression to be valid: Linearity: The relationship between the dependent and independent variables must be linear. Independence: The observations must be independent of each other. This means that the value of one observation should not be influenced by the value of any other observation. Homoscedasticity: The variance of the residuals (the difference between the predicted values and the actual values of the dependent variable) should be constant across all levels of the independent variables. Normality: The residuals should be normally distributed. Absence of multicollinearity: The independent variables should not be highly correlated with each other. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html Unauthorized distribution or sharing of these materials is strictly prohibited. 7 Homoscedasticity and heteroscedasticity Plot with random data showing homoscedasticity: at each Plot with random data showing heteroscedasticity: The value of x, the y-value of the dots has about the same variance of the y-values of the dots increase with variance. increasing values of x. https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity Unauthorized distribution or sharing of these materials is strictly prohibited. 8 Linear Regression If any of these assumptions are not met, the results of the linear regression model may be unreliable. Assumptions of linear regression There are several assumptions that must be met for linear regression to be valid: Linearity: The relationship between the dependent and independent variables must be linear. Independence: The observations must be independent of each other. This means that the value of one observation should not be influenced by the value of any other observation. Homoscedasticity: The variance of the residuals (the difference between the predicted values and the actual values of the dependent variable) should be constant across all levels of the independent variables. Normality: The residuals should be normally distributed. Absence of multicollinearity: The independent variables should not be highly correlated with each other. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html Unauthorized distribution or sharing of these materials is strictly prohibited. 9 Linear Regression Process Data Collection Data Model Building Model Model Prediction and Gather data on the Exploration Select the Evaluation Interpretation Inference dependent and Analyze data appropriate type of Evaluate the Interpret the Use the model for independent through linear regression model's coefficients to predictions and variables. Ensure visualization and (simple or multiple) performance using understand the make inferences data quality and summary statistics and choose metrics like Mean impact of each about the cleanliness. to identify independent Squared Error independent relationships relationships and variables. (MSE), R-squared. variable on the between variables. potential outliers. Estimate the Check for dependent variable. coefficients (a and assumptions' b values) through validity. techniques like Ordinary Least Squares (OLS). Unauthorized distribution or sharing of these materials is strictly prohibited. 10 Residuals Linear regression Goal: To choose regression coefficients for the independent variable(s) that minimize these residuals. The errors that our model make are called residuals. To compute the residual, we simply subtract the predicted value from the actual value. We use a metric Sum of Squared Errors (SSE), the sum of all squared residuals: 𝑆𝑆𝑆𝑆𝑆𝑆 = 𝜖𝜖1 2 + 𝜖𝜖2 2 + … + 𝜖𝜖𝑛𝑛 2 Unauthorized distribution or sharing of these materials is strictly prohibited. 11 Gradient Descent in Linear Regression Minimizing the residuals Initial Model: Begin with initial coefficients defining the linear model's predictions. Iterative Optimization: Calculate residuals (errors) by comparing actual and predicted values. Update coefficients incrementally based on gradients (directions of steepest decrease) of the loss function. Repeat the process until convergence criteria are met. Loss Function: Use a loss function like Mean Squared Error (MSE) to measure the average of squared residuals. Convergence: Gradient descent converges to minimize the loss function, resulting in the best-fitting linear regression model with minimized residuals. Image source: https://easyai.tech/en/ai-definition/gradient-descent/ Unauthorized distribution or sharing of these materials is strictly prohibited. 12 Commonly used metrics in regression analysis R-squared (R²): Commonly used to assess the goodness-of-fit of a regression model. Higher R-squared values (closer to 1) indicate that the model fits the data well and explains a large portion of the variance Adjusted R-squared: Adjusted R-squared is particularly useful when comparing models with different numbers of predictors. It strikes a balance between model complexity and goodness-of-fit. Higher adjusted R-squared values indicate a better model, considering both fit and model complexity. Root Mean Square Error (RMSE): RMSE is in the same units as the dependent variable. Lower RMSE values indicate smaller prediction errors and a better-fitting model. Unauthorized distribution or sharing of these materials is strictly prohibited. 13 R-squared, Adjusted R-squared Linear regression The R2 metric represents the proportion of variance in the dependent variable explained by the independent variable(s). 𝑆𝑆𝑆𝑆𝑆𝑆 𝑅𝑅2 = 1 − Total sum of squares The value of R can range between 0 and 1, and the higher its value the more accurate the regression model is. The aim is to get as close as possible to 1. R-squared tends to increase as you add more independent variables to the model, even if those variables do not improve the model's predictive power. This is a limitation because it may lead to overfitting. Adjusted R-squared addresses the issue of overfitting by penalizing the inclusion of unnecessary predictors in the model. As you add more predictors to the model, the adjusted R-squared will increase only if those predictors significantly improve the model's performance. If they don't, the adjustment will penalize the model for their inclusion. Unauthorized distribution or sharing of these materials is strictly prohibited. 14 Root Mean Square Error [RMSE] Linear regression RMSE is the square root of the SSE divided by the total number of data points N: 𝑆𝑆𝑆𝑆𝑆𝑆 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑁𝑁 RMSE tends to be used more often to check quality of a linear regression model since its 𝑆𝑆𝑆𝑆𝑆𝑆 = 𝜖𝜖1 2 + 𝜖𝜖2 2 + … + 𝜖𝜖𝑛𝑛 2 units are the same as the dependent variable Sum of Squared Errors and it is normalized by the value of N. Unauthorized distribution or sharing of these materials is strictly prohibited. 15 Linear Regression Challenges Overfitting vs. Underfitting: Finding the right balance between model complexity and performance to avoid overfitting (high variance) or underfitting (high bias). Multicollinearity: Addressing issues when independent variables are highly correlated, which can affect the interpretation of coefficients. Assumption Violations: Handling cases when assumptions of linearity, normality, and homoscedasticity are not met. Outliers: Dealing with outliers that can distort the model's performance and interpretation. Feature Engineering: Carefully selecting and transforming independent variables to improve model accuracy. Unauthorized distribution or sharing of these materials is strictly prohibited. 16 Classification Algorithm Scikit Learn documentation Logistic Regression classifier. linear_model.LogisticRegression([penalty,...]) A decision tree classifier. tree.DecisionTreeClassifier(*[, criterion,...]) A random forest classifier. ensemble.RandomForestClassifier([...]) Linear Support Vector Classification. svm.LinearSVC([penalty, loss, dual, tol, C,...]) Classifier implementing the k-nearest neighbours neighbors.KNeighborsClassifier([...]) vote. One-vs-the-rest (OvR) multiclass strategy. multiclass.OneVsRestClassifier(estimator, *) One-vs-one multiclass strategy. multiclass.OneVsOneClassifier(estimator, *) Label Propagation classifier. semi_supervised.LabelPropagation([kernel,...]) Unauthorized distribution or sharing of these materials is strictly prohibited. 17 While the name "logistic regression" suggests a connection Logistic Regression to regression, its primary use is in classification problems. In logistic regression, the mathematical framework is adjusted to model the probability of a binary outcome (e.g., 0 or 1, Yes or No) rather than predicting a continuous value. Logistic regression achieves this by applying the logistic or sigmoid function to the linear combination of input features. This transformation maps the output to a probability between 0 and 1, making it suitable for classification. In logistic regression models, we also have a dependent variable y and a set of independent variables x1,x2, …, xk. In logistic regression, however, we want to learn a function that provides the probability that y=1 given a set of independent variables: 1 𝑃𝑃(𝑦𝑦 = 1) = 1 + 𝑒𝑒 −(𝛽𝛽0+𝛽𝛽1𝑥𝑥1 +⋯+ 𝛽𝛽𝑥𝑥 𝑥𝑥𝐾𝐾 ) The above function is called the Logistic function, it provides a number between 0 and 1, representing the probability that the outcome-dependent variable is true. Our goal, when developing logistic regression models is to choose coefficients that predict a high probability when y = 1 but predict a low probability when y=0. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html Unauthorized distribution or sharing of these materials is strictly prohibited. 18 Decision Tree Also known as recursive partitioning Decision trees can be used for binary and multi-class classification and for regression, using both continuous and categorical features. The decision tree is a greedy algorithm that divides the feature space into binary partitions in a recursive manner. In terms of classification, decision trees use various impurity measures to make decisions about how to split the data at each node. One common impurity measure is the Gini index (or Gini impurity). Unauthorized distribution or sharing of these materials is strictly prohibited. 19 The goal is to create a tree structure that minimizes the Decision Tree Gini impurity at each internal node Gini impurity is a measure of impurity or disorder in a dataset. In the context of decision trees for classification, Gini impurity quantifies how often a randomly chosen element in the dataset would be incorrectly classified. It ranges from 0 to 1: 0 indicates perfect purity (all elements belong to the same class), 1 indicates maximum impurity (elements are evenly distributed across classes). At each internal node of the tree, a decision is made to split the data into two or more branches, corresponding to different values of a selected feature. Gini impurity is often used as a criterion to measure the impurity of subsets created by potential splits. The split that results in the lowest Gini impurity is chosen as the best split at that node. The terminal nodes of the decision tree, known as leaf nodes, represent the final predictions. Unauthorized distribution or sharing of these materials is strictly prohibited. 20 Random Forest Random forests train an ensemble of decision trees independently, allowing for concurrent training. Because the method introduces randomization into the training process, each decision tree is unique. Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting. The variance of the predictions is reduced when the predictions from each tree are combined, which improves the performance on test data. Unauthorized distribution or sharing of these materials is strictly prohibited. 21 Random Forest What’s so “random” about Random Forest? During training, Subsampling is conducted on the original dataset on each iteration to get a different training set (also known as bootstrapping). Different random subsets of features are considered when splitting at each tree node. How does Random Forest make predictions? A random forest must aggregate the predictions from its collection of decision trees to make a prediction on a new instance. For classification and regression, this aggregate is done differently. Classification: Majority vote. The prediction of each tree is counted as a vote for one class. The label will be the class with the highest votes. Regression: Averaging. Each tree can predict a real value. The average of the tree predictions is output as the label. Unauthorized distribution or sharing of these materials is strictly prohibited. 22 Metrics Model Validation or Evaluation for Supervised Learning Metrics for evaluating Metrics for evaluating Regression Classification models models Accuracy Mean squared error (MSE) Precision Root mean squared error (RMSE) Recall Mean absolute error (MAE) F1 score R-squared Area under the ROC curve (ROC AUC) Unauthorized distribution or sharing of these materials is strictly prohibited. 23 Receiver Operating Characteristic curve A receiver operating characteristic curve, or ROC Area Under the Curve of ROC curve curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. ROC curves are particularly helpful for understanding the trade-off between True Positive Rate (sensitivity) and False Positive Rate (1- specificity) across different classification thresholds. It is important to note that the TPR and FPR are inversely proportional. If we increase the threshold value, the TPR will decrease and the FPR will increase. The ROC curve is a staircase curve, but it is often If we decrease the threshold value, the TPR will plotted as a smooth line to make it easier to interpret. increase and the FPR will decrease. Unauthorized distribution or sharing of these materials is strictly prohibited. 24 Receiver Operating Characteristic curve Area Under the Curve of ROC curve The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or recall. The false-positive rate is also known as the probability of false alarm and can be calculated as (1 − specificity). The ROC curve is a staircase curve, but it is often https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning plotted as a smooth line to make it easier to interpret. Unauthorized distribution or sharing of these materials is strictly prohibited. 25 Step-by-step: Plotting the ROC curve The ROC curve is typically created by following these steps, in sequence: Calculate the TPR and FPR for different threshold values. This can be done by sorting the predicted probabilities from lowest to highest and then calculating the TPR and FPR for each threshold value. Plot the TPR and FPR for each threshold value. The ROC curve is a plot of the TPR on the y-axis and the FPR on the x-axis. Connect the points with a line. The ROC curve is a staircase curve, but it is often plotted as a smooth line to make it easier to interpret. 26 Unauthorized distribution or sharing of these materials is strictly prohibited. Step-by-step: Plotting the ROC curve What is the “threshold”? Predicted The threshold in ROC curve is a value between 0 and 1 that is Threshold Classification Probability used to classify observations as positive or negative. For example, if we set the threshold to 0.6, then any 0.50 0.60 Negative observation with a predicted probability of 0.6 or higher will be classified as positive. If the predicted probability is 0.65 0.60 Positive below the threshold, the observation will be classified as 0.30 0.60 Negative negative. 0.80 0.60 Positive The threshold value can be used to control the trade-off between false positives and false negatives. 0.60 0.60 Positive A lower threshold value will result in more false positives, The predicted probability is the probability but it will also result in fewer false negatives. that a machine learning model predicts for a A higher threshold value will result in fewer false given observation. It is a value between 0 and positives, but it will also result in more false negatives. 1. 27 Unauthorized distribution or sharing of these materials is strictly prohibited. Step-by-step: Plotting the ROC curve Example: Suppose we have a binary classification model that predicts whether a patient has a disease (positive) or not (negative). If we set the threshold to 0.6, we have the following values: Category Count Consequence of changing the threshold Increasing the threshold: Will decrease the number of true positives but will also decrease the number of false positives. True Positives (TP) 80 Decreasing the threshold: Will increase the number of true positives but will also increase the number of false positives. Increasing the threshold: Will decrease the number of false positives but will also decrease the number of true positives. False Positives (FP) 20 Decreasing the threshold: Will increase the number of false positives but will also increase the number of true positives. Increasing the threshold: Will increase the number of true negatives but will also increase the number of false negatives. True Negatives (TN) 50 Decreasing the threshold: Will decrease the number of true negatives but will also decrease the number of false negatives. Increasing the threshold: Will increase the number of false negatives but will also increase the number of true negatives. False Negatives (FN) 10 Decreasing the threshold: Will decrease the number of false negatives but will also decrease the number of true negatives. 28 Unauthorized distribution or sharing of these materials is strictly prohibited. Step-by-step: Plotting the ROC curve To plot the ROC curve for this model, we would first calculate the TPR and FPR at different threshold values. Based on the threshold of 0.6, the TPR would be 80/90 = 0.80 and the FPR would be 20/70 = 0.29. We will repeat this process for different threshold values to create a table of TPR and FPR values. After creating the table of TPR and FPR values, we can follow these steps: Plot the TPR and FPR values on a scatter plot. The TPR should be plotted on the y-axis and the FPR should be plotted on the x-axis. Connect the points with a line. https://upload.wikimedia.org/wikipedia/commons/thumb/1/13 /Roc_curve.svg/768px-Roc_curve.svg.png The ROC curve is a staircase curve, but it is often plotted as a smooth line to make it easier to interpret. 29 Unauthorized distribution or sharing of these materials is strictly prohibited. Receiver Operating Characteristic curve Area Under the Curve of ROC curve The area under the ROC curve (ROC AUC) is a single metric that summarizes the overall performance of a classification model. A higher ROC AUC indicates a better model, with values ranging from 0.5 (random guessing) to 1 (perfect classification). The ROC curve is a staircase curve, but it is often https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning plotted as a smooth line to make it easier to interpret. Unauthorized distribution or sharing of these materials is strictly prohibited. 30