Lecture Note Chapter 4 PDF
Document Details
Uploaded by ExquisiteNarwhal
Dr Nor Azuana Ramli
Tags
Summary
This document provides a lecture note on topics related to machine learning, model evaluation, improvement, and deployment.It explains different concepts like how to evaluate models effectively and identify crucial performance metrics.
Full Transcript
CHAPTER 4 MODEL EVALUATION, IMPROVEMENT & DEPLOYMENT Dr Nor Azuana Ramli Evaluation Metrices and Scoring CONTENTS HyperParameter Tuning Model Deployment Course Outcomes By the end of this chapter, you should be able to: ✓understand the need...
CHAPTER 4 MODEL EVALUATION, IMPROVEMENT & DEPLOYMENT Dr Nor Azuana Ramli Evaluation Metrices and Scoring CONTENTS HyperParameter Tuning Model Deployment Course Outcomes By the end of this chapter, you should be able to: ✓understand the need of model evaluation in machine learning problems. ✓applying performance score and metrics in order to evaluate the machine learning model. ✓improve the model by using GridSearch. ✓deploy your model after obtain the best model for your case study. Model Evaluation The idea of building machine learning models works on a constructive feedback principle. You build a model, get feedback from metrics, make improvements and continue until you achieve a desirable accuracy. Evaluation metrics explain the performance of a model. An important aspect of evaluation metrics is their capability to discriminate among model results. Some evaluation metrics utilized: regression and classification metrics ✓ Evaluation metrics for regression models are quite different than the above metrics we discussed for classification models because we are now predicting in a continuous range instead of a discrete number of classes. Regression ✓ If your regression model predicts the price of a house to be RM400K and it sells Metrics for RM405K, that's a pretty good prediction. However, in the classification examples we were only concerned with whether or not a prediction was correct or incorrect, there was no ability to say a prediction was "pretty good". ✓ Thus, here we cover R-squared and some error terms evaluation metrics for regression models. R-Squared ✓ To evaluate the overall fit of a linear regression model, we use the R- squared value. ✓ R-squared is the proportion of variance explained. ✓ It is the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. ✓ The null model just predicts the mean of the observed response and thus it has an intercept and no slope. ✓ R-squared is between 0 and 1. ✓ Higher values are better because it means that more variance is explained by the model. ✓ R-squared formula: Confusion Matrix Example of: Confusion Matrix Performance Measure/ Score Accuracy Recall Precision F1-score In cases where Number of we want to find correct Ability of a Ability of a an optimal blend predictions made model to find all classification of precision and by the model the relevant model to identify recall, we can divided by the cases within a only the relevant combine the two total number of dataset data points metrics using F1- predictions. score List of Formulae: Limitations Of Accuracy As A Standalone Metric When it comes to measuring a model’s performance or anything in general, people focus on accuracy. However, being heavily reliant on the accuracy metric can lead to incorrect decisions, since it has some limitations such as: Working with Imbalanced Data: No data ever comes perfect, and the use of the accuracy metric should be evaluated on its predictive power. For example, working with a dataset where one class outweighs another will cause the model to achieve a higher accuracy rate as it will predict the majority class. Error Types: Understanding and learning about your model's performance in a specific context will aid you in fine-tuning and improving its performance. For example, differentiating between the types of errors through a confusion matrix, such as FP and FN, will allow you to explore the model's limitations. Step-by-step Manual Calculation 1 2 3 4 Define the Collect all the Classify the Present them in a outcomes model’s outcomes into the matrix table (Identify the two predictions, four categories: possible outcomes including how TP, TN, FP & FN of your task: many times the Positive or model predicted Negative) each class and its occurrence. Exercise Final Exam S1 2021/2022 Exercise Final Exam S1 2022/2023 Precision measures the accuracy of positive prediction. It answers the question of ‘when the model predicted TRUE, how often was it right?’. Precision, in particular, is important when the cost of a false positive is high. Precision VS Recall or sensitivity measures the number of actual positives correctly identified by the model. It answers the question of ‘When the class was actually TRUE, how often did the classifier get it right?’. Recall Recall is important when missing a positive instance (FN) is shown to be significantly worse than incorrectly labelling negative instances as positive. Precision use: False positives can have serious consequences. For example, a classification model used in the finance sector wrongfully identified a transaction Precision as fraudulent. In scenarios such as this, the precision metric is important. VS Recall Recall use: Identifying all positive cases can be imperative. For example, classification models used in the medical field failing to diagnose correctly can be detrimental. In scenarios in which correctly identifying all positive cases is essential, the recall metric is important. Receiver Operating Characteristics (ROC) A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers. It was first used in signal detection theory but is now used in many other areas such as medicine, radiology, natural hazards and machine learning. Interpreting the ROC Curve The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. ***Note that the ROC does not depend on the class distribution. This makes it useful for evaluating classifiers predicting rare events such as diseases or disasters. In contrast, evaluating performance using accuracy would favor classifiers that always predict a negative outcome for rare events. Area under a receiver operating characteristic (ROC) curve, abbreviated as AUC, is a single scalar value that measures the overall performance of a binary classifier (Hanley and McNeil 1982). Area Under The AUC value is within the range [0.5–1.0], where the minimum value represents the performance of the ROC a random classifier and the maximum value would correspond to a perfect classifier (e.g., with a classification error rate equivalent to zero). Curve (AUC) The AUC is a robust overall measure to evaluate the performance of score classifiers because its calculation relies on the complete ROC curve and thus involves all possible classification thresholds. The AUC is typically calculated by adding successive trapezoid areas below the ROC curve. Classification for the Accuracy of the AUC Area Under the ROC Classification Curve 0.90 – 1.00 Excellent 0.80 – 0.90 Good 0.70 – 0.80 Fair 0.60 – 0.70 Poor 0.50 – 0.60 Fail Error Metrics MEAN ABSOLUTE ERROR (MAE) ✓This is the mean of the absolute value of errors ✓Easy to understand 1 𝑛 |𝑦𝑖 − 𝑦ො 𝑖 | 𝑛 𝑖=1 MEAN SQUARED ERROR (MSE) ✓ This is the mean of the squared errors ✓ Larger errors are noted more than with MAE, making MSE more popular 1 𝑛 (𝑦𝑖 − 𝑦ො𝑖 )2 𝑛 𝑖=1 ROOT MEAN SQUARE ERROR (RMSE) ✓ This is the root of the mean of the squared errors ✓ Most popular (has same units as y) 1 𝑛 (𝑦𝑖 − 𝑦ො𝑖 )2 𝑛 𝑖=1 Model Improvement There are several ways that can be used to improve your model such as combine the classifiers (ensemble learning), cross-validation, tuning hyperparameter and data pre-processing. In classification, if the target is not balance, we could perform imbalanced dataset analysis. Also how to tune your hyperparameter by using GridSearch instead of do it manually. Grid-Search Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions The performance of a model significantly depends on the value of hyperparameters. Ideally, we need to try all possible values to know the optimal values. Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters. GridSearchCV is a function that comes in Scikit- learn’s model_selection package. This function helps to loop through predefined hyperparameters and fit your model on your training set. So, in the end, we can select the best parameters from the listed hyperparameters. Another method is Randomized Search. In contrast to Grid Search, not all given parameter values are tried out in Randomized Search. Working with data is one thing, but deploying a machine learning model to production can be another. Data engineers are always looking for new ways to deploy their machine learning models to Model production. They want the best performance, and they care about how much it costs. Deployment Well, now you can have both! Let's take a look at the deployment process and see how we can do it successfully! How to Deploy a Machine Learning Model in Production? Most data science projects deploy machine learning models as an on-demand prediction service or in batch prediction mode. Some modern applications deploy embedded models in edge and mobile devices. Each model has its own merits. For example, in the batch scenario, optimizations are done to minimize model compute cost. There are fewer dependencies on external data sources and cloud services. The local processing power is sometimes sufficient for computing algorithmically complex models. It is also easy to debug an offline model when failures occur or tune hyperparameters since it runs on powerful servers. On the other hand, web services can provide cheaper and near real-time predictions. Availability of CPU power is less of an issue if the model runs on a cluster or cloud service. The model can be easily made available to other applications through API calls and so on. One of the main benefits of embedded machine learning is that we can customize it to the requirements of a specific device. We can easily deploy the model to a device, and its runtime environment cannot be tampered with by an external party. A clear drawback is that the device needs to have enough computing power and storage space. Deploying Machine Learning Models as Web Services. ✓The simplest way to deploy a machine learning model is to create a web service for prediction. ✓To create a machine learning web service, you need at least three steps. Create a machine learning model, train it and validate its performance. Persist the model. Serve the persisted model using a web framework. Deploying Machine Learning Models for Batch Prediction. While online models can serve prediction, on- demand batch predictions are sometimes preferable. Offline models can be optimized to handle a high volume of job instances and run more complex models. In batch production mode, you don't need to worry about scaling or managing servers either. Batch prediction can be as simple as calling the predict function with a data set of input variables. Sometimes you will have to schedule the training or prediction in the batch processing method. There are several ways to do this. One way is by using Airflow or Prefect to automate the task. However, building the model may require multiple stages in the batch processing framework. You need to decide what features are required and how you should construct the model for each stage. Deploying Machine Learning Models for Batch Prediction. ✓ Train the model on a high-performance computing system with an appropriate batch-processing framework. ✓ Usually, you partition the training data into segments that are processed sequentially, one after the other. You can do this by splitting the dataset using a sampling scheme (e.g., balanced sampling, stratified sampling) or via some online algorithm (e.g., map- reduce). ✓ The partitions can be distributed to multiple machines, but they must all load the same set of features. Feature scaling is recommended. If you used unsupervised pre-training (e.g., autoencoders) for transfer learning, you must undo each partition. ✓ After all the stages have been executed, you can predict unseen data with the resulting model by iterating sequentially over the partitions. Deploying Machine Learning Models on Edge Devices as Embedded Models. ✓Computing on edge devices such as mobile and IoT has become very popular in recent years. The benefits of deploying a machine learning model on edge devices include, but are not limited to: Reduce data bandwidth Reduced latency as the device is consumption as we ship processed likely to be close to the user than a results back to the cloud instead of server far away. raw data that requires big size and eventually more bandwidth. Deploying Machine Learning Models on Edge Devices as Embedded Models. ✓ Edge devices such as mobile and IoT devices have limited computation power and storage capacity due to the nature of their hardware. We cannot simply deploy machine learning models to these devices directly, especially if our model is big or requires extensive computation to run inference on them. ✓ Instead, we should simplify the model using techniques such as quantization and aggregation while maintaining accuracy. These simplified models can be deployed efficiently on edge devices with limited computation, memory, and storage. ✓ We can use the TensorFlow Lite library on Android to simplify our TensorFlow model. TensorFlow Lite is an open-source software library for mobile and embedded devices that tries to do what the name says: run TensorFlow models in Mobile and Embedded platforms. “Let’s practice all the theories that you learnt today in the lab through Python”