Machine Learning System Design Performance Measurement Lecture
Document Details
Uploaded by AccommodativeXylophone
Tags
Summary
This lecture provides an overview of machine learning system design, focusing on performance measurement metrics. It covers key concepts in regression and classification, along with the challenges of imbalanced datasets. Specific examples are also included.
Full Transcript
Machine learning system design Performance measurement Regression 🡪 When evaluating the performance of regression models, there are several key metrics that help determine how well the model is performing. 🡪 These metrics can be divided into two main categories: error metrics and go...
Machine learning system design Performance measurement Regression 🡪 When evaluating the performance of regression models, there are several key metrics that help determine how well the model is performing. 🡪 These metrics can be divided into two main categories: error metrics and goodness-of-fit metrics. 🡪 Next an overview of the most commonly used performance measures in regression: (Error Metrics) 1-Mean Absolute Error (MAE): Definition: The average of the absolute differences between predicted and actual values. Formula: a. A lower MAE indicates a better fit. It is easy to understand and not sensitive to outliers. 2-Mean Squared Error (MSE): Definition: The average of the squared differences between predicted and actual values. Formula: a. A lower MSE indicates a better fit. MSE is more sensitive to outliers than MAE due to the squaring of errors. 3-Mean Absolute Percentage Error (MAPE): Definition: The average of the absolute percentage differences between predicted and actual values. Formula: IA lower MAPE indicates a better fit. MAPE expresses the error as a percentage, which can be more interpretable, but it can be problematic if the actual values are close to zero. Goodness-of-Fit Metrics 1-R-squared (R²): Definition: The proportion of the variance in the dependent variable that is predictable from the independent variables. Formula: R² ranges from 0 to 1. An R² of 1 indicates that the model perfectly explains the variance, while an R² of 0 indicates that the model explains none of the variance. However, R² can sometimes be misleading for nonlinear models. Classification Classification accuracy is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage. Classification accuracy alone is typically not enough information to to decide whether your classifier is a good enough model to solve your problem. It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm. Example (a cancer diagnoses ) Consider a binary classification problem where 95% of the instances belong to class 0 (negative) and 5% belong to class 1 (positive). Class 0 (negative): 950 instances Class 1 (positive): 50 instances Model A: Predicts every instance as class 0. Accuracy: 950+0/(950+50)=95% Model B: Predicts class 0 for 900 instances and class 1 for 100 instances. Correctly identifies 45 positive instances and 900 negative instances. Accuracy: (900+45)/(950+50)= 94.5% 🡪 Although Model A has higher accuracy, it fails to identify any positive instances, making it practically useless for identifying the minority class Accuracy Paradox 🡪 The accuracy paradox occurs when a machine learning model with high accuracy does not necessarily provide valuable or meaningful predictions. 🡪 This often happens in the context of imbalanced datasets, where one class is much more frequent than others. 🡪 In an imbalanced dataset, one class (often the negative class) dominates the dataset, while the positive class is rare. 🡪 For example, in a medical diagnosis dataset, healthy patients might be much more common than patients with a rare disease. Accuracy Paradox 🡪 High Accuracy but Poor Performance: If a model predicts the majority class for all instances, it can achieve high accuracy because it correctly predicts the frequent class most of the time. However, this model would perform poorly on the minority class, which might be the class of interest (e.g., detecting a rare disease). 🡪 To better evaluate models on imbalanced datasets, other metrics besides accuracy are used: ( Errors metrics for skewed classes) 🡪 Confusion Matrix 🡪 Precision 🡪 Recall 🡪 F1- score 🡪 ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Confusion matrix A table showing the performance of the classification model, including true positives, true negatives, false positives, and false negatives. Helps visualize the model's performance on each class. Confusion matrix False alarm Misdetection Precision Precision (Of all patients where we predicted , what fraction actually has cancer? Precision= Precision Precision (Of all patients where we predicted , what fraction actually has cancer? Precision= It is also called the Positive Predictive Value(PPV). Recall (Sensitivity or True Positive Rate): Recall Measures the proportion of true positive predictions among all actual positives. Recall (Sensitivity or True Positive Rate): Recall (Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?) Recall= Precision/Recall/accuracy Precision= Recall Recall= Example: Multi Class Classification Trading off precision and recall true positives precision = Trading off precision and recall no. of predicted positive true positives recall = no. of actual positive Logistic regression: Predict 1 if Predict 0 if Suppose we want to predict (cancer) only if very confident. true positives precision = Trading off precision and recall no. of predicted positive true positives recall = no. of actual positive Logistic regression: Predict 1 if X 0.7 X Predict 0 if 0.7 Suppose we want to predict (cancer) only if very confident. (Increase the threshold value) Higher Precision, Lower Recall true positives precision = Trading off precision and recall no. of predicted positive true positives recall = Logistic regression: no. of actual positive Predict 1 if Predict 0 if Suppose we want to avoid missing too many cases of cancer (avoid false negatives). true positives precision = Trading off precision and recall no. of predicted positive true positives recall = Logistic regression: no. of actual positive Predict 1 if X 0.3 Predict 0 if X 0.3 Suppose we want to avoid missing too many cases of cancer (avoid false negatives). 🡪 decrease threshold value Higher Recall, but Lower Precision Trading off precision and recall The greater the threshold, the greater the precision and the lower the recall. (Be confident) The lower the threshold, the greater the recall and the lower the precision. (Safe prediction) In the example, if we classify all patients as 0 🡪 recall so despite having a lower error percentage, we can quickly see it has worse recall. Compare precision/recall How to compare precision/recall numbers? Apply learning algorithm with different threshold Precision(P) Recall (R) Algorithm 1 0.5 0.4 Algorithm 2 0.7 0.1 Algorithm 3 0.02 1.0 Averaging precision and recall Precision(P) Recall (R) Average Algorithm 1 0.5 0.4 0.45 Algorithm 2 0.7 0.1 0.4 Algorithm 3 0.02 1.0 0.51 Averaging precision and recall Precision(P) Recall (R) Average Algorithm 1 0.5 0.4 0.45 Algorithm 2 0.7 0.1 0.4 Algorithm 3 0.02 1.0 0.51 This does not work well. If we predict all y=0 then that will bring the average up despite having 0 recall. If we predict all examples as y=1, then the very high recall will bring up the average despite having 0 precision. F1 Score (F score) F1 score conveys the balance between the precision and the recall. F1 Score= Precision(P) Recall (R) F1 Score Algorithm 1 0.5 0.4 0.444 Algorithm 2 0.7 0.1 0.175 Algorithm 3 0.02 1.0 0.0392 F1 Score (F score) Precision(P) Recall (R) F1 Score Algorithm 1 0.5 0.4 0.444 Algorithm 2 0.7 0.1 0.175 Algorithm 3 0.02 1.0 0.0392 In order for the F Score to be large, both precision and recall must be large. We want to train precision and recall on the validation set so as not to bias our test set. Example 10 13 75 188 Example 10 13 75 188 Precision= 10/(10+13)=0.43 Recall=10/(10+75)=0.12 Accuracy=(10+188)/(10+13+75+188)= 0.7 F1 score=0.19 ROC-AUC (Receiver Operating Characteristic - Area Under Curve) ROC-AUC is a performance measurement for classification problems at various threshold settings. It is especially useful for evaluating binary classification models and is less sensitive to class imbalance compared to accuracy. ROC Curve: The ROC curve is a graphical representation of a classifier's performance. It plots the True Positive Rate (TPR, also known as Recall or Sensitivity) against the False Positive Rate (FPR) at different threshold values. ROC-AUC (Receiver Operating Characteristic - Area Under Curve) AUC (Area Under the Curve): The AUC is a single scalar value that summarizes the performance of the classifier across all threshold values. AUC ranges from 0 to 1: ○ AUC = 0.5: The model performs no better than random guessing. ○ AUC = 1.0: The model perfectly separates the positive and negative classes. ○ 0.5 < AUC < 1.0: The model performs better than random but is not perfect. ○ AUC < 0.5: The model performs worse than random guessing, indicating a problematic model (often indicates an issue with the model or data). Higher AUC: Indicates a better performing model. For instance, an AUC of 0.90 means that the model has a 90% chance of distinguishing between positive and negative classes. Lower AUC: Indicates poorer performance, approaching random guessing at 0.5. Use numerical integration to find the area under the ROC curve. Overfitting vs Underfitting Overfitting and underfitting are common problems in machine learning and statistics related to the performance of a predictive model. Overfitting If we have too many features, the learned hypothesis may fit the training set very well This means Overfitting occurs when a model learns not only the underlying pattern in the training data but also the noise and outliers. This results in a model that performs very well on training data but poorly on unseen data (test data). (not General ) Overfitting Characteristics: High accuracy on training data. Low accuracy on validation/test data. The model is too complex (e.g., too many parameters relative to the number of observations). Low bias and high variance. When a model has low bias and high variance, it means that the model is very flexible and captures the training data's complexity well (low bias), but it does not generalize well to unseen data due to overfitting (high variance). Underfitting when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. eg. if we take hθ(x)=θ0+θ1x1+θ2x2 then we are making an initial assumption that a linear model will fit the training data well and will be able to generalize but that may not be the case. This results in poor performance on both training and test data. Underfitting Characteristics: Low accuracy on both training and validation/test data. The model is too simple (e.g., too few parameters relative to the complexity of the data). High bias and low variance. When a model has high bias and low variance, it means that the model is too simple to capture the underlying structure of the data (high bias), but its predictions are stable across different training datasets (low variance). Underfitting Example: A linear regression model trying to fit a highly nonlinear relationship in the data will have high bias, as the linear model is too simple to capture the complex patterns. Visual Representation Underfitting: The model (e.g., a straight line) does not capture the underlying pattern of the data points. Good Fit: The model captures the underlying pattern without fitting the noise, resulting in good performance on unseen data. Overfitting: The model (e.g., a very wiggly line) fits the training data perfectly, including noise, but fails to generalize to new data. Example: Linear regression (housing prices) Price Price Price Size Size Size Addressing overfitting: (housing prices) size of house no. of bedrooms no. of floors age of house average income in neighborhood kitchen size To avoid overfitting Simplify the model (e.g., reduce the number of features ). Example: Linear regression (housing prices) Just right Example: Logistic regression x2 x2 x2 x1 x1 x1 ( = sigmoid function) Overfitting Solution Simplify the model (e.g., reduce the number of features or parameters). 🡪 Manually select which features to keep. 🡪 Model selection algorithm. i. for example Cross-validation to ensure model generalization. Use regularization techniques 🡪 Keep all the features, but reduce magnitude/values of parameters 🡪 Works well when we have a lot of features, each of which contributes a bit to predicting y. Underfitting Solution Increase the complexity of the model (e.g., add more features or parameters). Use more sophisticated models (e.g., switching from linear regression to polynomial regression). Ensure the data is adequately preprocessed and relevant features are selected.