2. Basics of Modeling and Evaluation.pdf
Document Details
Uploaded by LighterChaos
2023
Tags
Full Transcript
Machine Learning Applications Winter semester 2023/2024 Prof. Dr.-Ing. Uwe Klingauf Lecture II: Basics of Modeling and Evaluation 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 1 Agenda for Today 1. Modeling techniques 2. Linear Regression 3. Choos...
Machine Learning Applications Winter semester 2023/2024 Prof. Dr.-Ing. Uwe Klingauf Lecture II: Basics of Modeling and Evaluation 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 1 Agenda for Today 1. Modeling techniques 2. Linear Regression 3. Choosing the right model 4. K-Nearest-Neighbor Classification 5. Model evaluation 6. Evaluation metrics 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 2 Modeling Modeling Modeling is an essential part in the data mining process and consists of Business Understanding Data Understanding multiple steps: ▪ Select modeling technique: Determination of the algorithms to use. If multiple techniques are applied, the task must be run separately for Data Preparation each technique. ▪ Generate test design: Test the model’s quality and validity, separate the dataset into train and test sets. Deployment ▪ Data Build model(s): Fit the model(s) on the data set, and estimate its (their) quality on unseen data. Adjust hyperparameters. Modeling ▪ Assess model(s): Judge the modeling success based on existing domain knowledge and defined success criteria. Compare the different Evaluation models. ▪ Revising and tuning of model parameters: Iterate model building and assessment until finding best model(s). 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 3 Introduction to Modeling MODELING TECHNIQUES 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 4 Categories of Machine Learning D Reminder Machine Learning is the science “concerned with the question of how to construct computer programs that automatically improve with experience” [Tom Mitchell (1997)] Supervised Learning Unsupervised Learning Reinforcement Learning Training data includes desired outputs Training data does not include desired outputs Rewards from sequence of actions Lecture V Lecture XII 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 5 Supervised Learning Typical data structure D A feature, also called attribute, is a data item that represents a characteristic or a property of a data entity. D Feature A Feature B Feature C Label 2 9 5 0 1 6 5 2 8 3.5 7 1 10 7 4 1 The label is the desired output of the machine learning algorithm, e.g. the attribute that we want to predict. A row in the data set represents an instance of the data, e.g. one out of many produced parts or one specific point in time Features can be: ▪ Categories like size or volume ▪ Time-dependent sensor data 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 6 Supervised Learning Regression and Classification ▪ Given examples of input data (features) 𝑋 and output (label) 𝑌 ▪ Goal: Predict function 𝑌 = 𝐹(𝑋) for new, unknown examples 𝑋 𝐹 𝑋 continuous: Regression 𝐹 𝑋 discrete : Classification Examples: ▪ Demand prediction based on sales data ▪ Temperature forecasting ▪ Predicting the likelihood of a loan default ▪ Estimation of the remaining lifetime of technical components Examples: ▪ Image classification ▪ Spam filtering ▪ Credit card fraud detection ▪ Fault diagnosis for technical components 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 7 How do machine learning algorithms work? Every machine learning algorithm has three components: ▪ Representation ▪ Choosing the modeling type and thus defining the space of allowed models (hypothesis space) ▪ Evaluation ▪ Scoring function or cost function to judge the models and distinguish good models from bad models ▪ Optimization ▪ Process of finding the best model in hypothesis space based on the given scoring function Source: medium.com blog entry 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 8 There are many algorithms for supervised learning Decision Trees Support Vector Machines Neural Networks [opengenus.org] Random Forests K-Nearest-Neighbor Long-short-termmemory (LSTM) Convolutional neural networks (CNN) [Random forest - Wikipedia] [KnnClassification - Wikipedia] 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf [LSTM Cell - Wikipedia] [CNN architecture (researchgate.net)] 9 Supervised Learning Example LINEAR REGRESSION 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 10 Supervised Learning Example – Linear Regression ▪ Predict people‘s weight based on their height ▪ Data from 25 persons available to train the model ▪ Linear regression model ▪ Assumption: linear relationship between height and weight 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 11 Linear Regression Representation ▪ Linear Regression is one of the most simple machine learning models ▪ The representation of the model is a linear function: 𝑦 = 𝑓 𝑥 = 𝑚𝑥 + 𝑏 In general, if we have a number of 𝑝 different input features: 𝑝 𝑦 = 𝑓 𝑥Ԧ = 𝛽𝑖 𝑥𝑖 + 𝛽0 𝑖=1 𝑝 With 𝑥0 = 1: 𝑦 = 𝛽𝑖 𝑥𝑖 = 𝑥Ԧ 𝑇 𝛽Ԧ 𝑖=0 To learn the model, we have to find the best parameters 𝛽Ԧ for the training data 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 𝑦1 = 𝑥 − 95 𝑦2 = 2𝑥 − 270 𝑦3 = 0.2𝑥 + 45 12 Linear Regression Evaluation ▪ The linear models need to be evaluated using a scoring function ▪ Often, the sum of squared errors is used: 𝑁 𝑇 Ԧ 2 𝑆𝑆𝐸 𝛽Ԧ = (𝑦𝑖 − 𝑥𝑖 𝛽) 𝑖=1 𝑆𝑆𝐸 𝛽Ԧ = 𝑦Ԧ − 𝑿𝛽Ԧ 𝑇 Ԧ (𝑦Ԧ − 𝑿𝛽) 𝑆𝑆𝐸1 = 929.8 𝑆𝑆𝐸2 = 3670.2 𝑆𝑆𝐸3 = 1058.7 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 13 Linear Regression Optimization Ԧ the 𝑆𝑆𝐸 needs to be minimized ▪ To find the optimal parameters 𝛽, 𝜕𝑆𝑆𝐸(𝛽) 𝜕𝛽 Ԧ = −2𝑿𝑇 (𝑦Ԧ − 𝑿𝛽) ▪ To find the minimum of the 𝑆𝑆𝐸, the derivative is set to 0: Ԧ =0 −2𝑿𝑇 (𝑦Ԧ − 𝑿𝛽) ! Convex optimization problem → Every local minimum is also a global minimum መ 𝛽Ԧ = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝑦Ԧ መ The prediction of our model is set to: 𝑦ො = 𝑥Ԧ 𝑇 𝛽Ԧ 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 𝑦ො = 0.6481𝑥 − 33.26 𝑆𝑆𝐸 = 731.7 14 What if there is no analytical solution? In the linear regression example, an analytical solution for the optimization problem exists. For more complex machine learning problems, an analytical solution is often not known. D Image Source: Coursera Optimization in machine learning means finding the minimum of a cost function. In most cases, iterative approaches have to be implemented to find the minimum. Global minimum Cost function of a machine learning problem 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 15 Gradient descent A simple optimization algorithm Starting point Minimum 1. Choose a starting point 𝑥 2. Calculate the gradient ∇𝑓(𝑥) of the cost function at the starting point 3. Step in the direction of the negative gradient (steepest descent): 𝑥 ′ = 𝑥 − 𝛾∇𝑓(𝑥) 4. New iteration Starting point Minimum The learning rate 𝛾 defines the size of the iteration steps and must be set to an appropriate value. Image Source: Niklas Donges 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 16 Gradient descent Linear regression example – Implementation in Python Cost function: 𝑓 𝑚, 𝑏 = Gradient: ∇𝑓 𝑚, 𝑏 = 1 𝑁 σ (𝑦 𝑁 𝑖=1 𝑖 𝑑𝑓 𝑑𝑚 𝑑𝑓 𝑑𝑏 = − (𝑚𝑥𝑖 + 𝑏))2 1 𝑁 σ −2𝑥𝑖 (𝑦𝑖 − (𝑚𝑥𝑖 + 𝑏)) 𝑁 𝑖=1 1 𝑁 σ −2(𝑦𝑖 − (𝑚𝑥𝑖 + 𝑏)) 𝑁 𝑖=1 With normalized input data, 𝛾 = 0.1 and 100 iteration steps, the gradient descent algorithm finds the same result for the linear regression example: 𝑦ො = 0.6481𝑥 − 33.26 𝑆𝑆𝐸 = 731.7 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 17 Gradient descent Hyperparameter D In machine learning, hyperparameters are parameters that control the learning process. They are not part of the resulting model. ▪ The learning rate 𝛾 is a hyperparameter ▪ It influences, how good and how fast the optimal solution is found ▪ The optimum value of hyperparameters is often not known beforehand ▪ Rules of thumb ▪ Experience ▪ Try and error Loss function of the previous linear regression example for different learning rates 𝛾 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 18 Challenges in Optimization ▪ Finding the global minimum ▪ Computational effort ▪ Calculating gradients is costly ▪ very slow for complex problems with many variables ▪ Calculation costs increase with the number of data points → Stochastic gradient descent: Linear regression: Local optimum = global optimum Complex optimization problem ▪ Local minima ▪ Saddle points ▪ Updating of parameters is based on a random sample drawn from the total data set ▪ Reduction of computation time Alternative Methods for Optimization: ▪ Adaptive Learning Rate Method ▪ Newton method ▪ … 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf Link to paper: A Survey of Optimization Methods from a Machine Learning Perspective, Sun et al. 19 Investigation of the error in machine learning CHOOSING THE RIGHT MODEL 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 20 What is a good model? A good model: 1. Has a low error ▪ Predictions should be close to the actual values Prediction Error 2. Generalizes well to unknown data ▪ Model predictions should work just as well for new, unknown data points 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 21 Expected Prediction Error How big is the error of our regession model? ▪ The observed values are 𝑦 = 𝑓 𝑥 + 𝜖 ▪ With measurement error (noise) 𝜖, 𝐸 𝜖 = 0, 𝑉𝑎𝑟 𝜖 = 𝜎 2 ▪ The predicted values of the regression model are 𝑦ො = 𝑓መ 𝑥 Noise ▪ The expected mean squared error is defined as: 𝑀𝑆𝐸 = 𝐸 𝑦 − 𝑓መ 𝑥 2 One can show that the error is made up by three terms: መ + 𝑉𝑎𝑟[𝑓] መ 𝑀𝑆𝐸 = 𝜎 2 + 𝐵𝐼𝐴𝑆 2 [𝑓] Irreducible Error due to noise in the data 𝑓 𝑥 = 0.0026𝑥 2 , 𝜎 𝜖 = 10 𝑓መ 𝑥 = 0.7906𝑥 − 57.73 Reducible Error 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 22 Bias D The bias of an estimator for a random variable y is the difference between an estimators expected value and the true value of the parameter being estimated. መ 𝐵𝐼𝐴𝑆 𝑓መ = 𝐸[𝑓(𝑥) − 𝑓(𝑥)] The bias is independent of the training set considered and 0 for a perfect learner It can be thought of as a systematic error due to incorrect assumptions in the model Linear regression example: The true function is quadratic with some added noise Systematic error due to assumption of a linear model 𝑓 𝑥 = 0.0026𝑥 2 , 𝜎 𝜖 = 10 𝑓መ 𝑥 = 0.7906𝑥 − 57.73 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 23 Variance D The variance of an estimator measures how much the estimator spreads out from its average value. 2 መ መ 𝑉𝐴𝑅 𝑓መ = 𝐸[(𝐸 𝑓(𝑥) − 𝑓(𝑥)) ] The variance is independent of the true value y and 0 for a learner that always predicts the same for all training sets It denotes changes in the model when using different training data Changes in the model when using 25 different data points from the same distribution 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 24 Bias-Variance-Trade-Off High variance Low variance Low bias High bias D D Accuracy assesses how close the results are to the actual value (bias of the results). Precision assesses how close the results are with each other (variance) and therefore, how well the output is reproducible. Bias-Variance-Trade-Off ▪ Goal: minimize both bias and variance ▪ Very often, reducing variance leads to a higher bias and vice versa ▪ simple models: too general, high bias ▪ complex models: high variance Source: according to https://wp.stolaf.edu/it/gis-precision-accuracy/ 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 25 Achievement of generalizability Higher model complexity Model degree = 1 Training MSE= 1,4 Model degree = 2 MSE= 0,7 Model degree = 10 Model Trade-Off MSE = 0,5 Total Model balance Bias2 Model complexity MSE = 1,7 MSE = 1,02 MSE = 2,7 Validation error / Generalization error Error Test Output Error Variance Global data pattern Model behaviour Samples Source: according to https://neeravbasant.wordpress.com/tag/bias-variance-trade-off/ 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf Model balance Training error Training data Test data increasing bias Increasing variance Model complexity 26 Overfitting and Underfitting Model degree = 1 Underfitting D D Model degree = 2 Good fit Model degree = 10 Overfitting Underfitting: A model that suffers from underfitting is too general for a problem solution so that it is no even able to repeat the data it was trained with. The model has high bias and low variance. Overfitting: A model that suffers from overfitting is too much adjusted to its training data so that it is not able to generalize the problem but repeat exactly what it has learned. The model has low bias and high variance. 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 27 Mentimeter Use case: ▪ Polynomial regression (degree 4) to predict the weight of persons ▪ MSE on unknown test data is much larger than on our training data → overfitting What measures can we take to reduce the overfitting of our model? go to www.menti.com and use the code 5977 8330 → Link to result 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 28 How to improve a model? Underfittting model Increase model complexity Helps to reduce bias Modify model architecture Additional features might help for a better prediction Modify input features → Adding more training data is usually not helpful 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf Overfitting model Add more training data Helps to reduce variance Feature subset selection Reduce the number of input features Decrease model complexity Also reduces computational time Helps to reduce variance, but also increases bias 29 How to improve a model? Linear Regression Example Underfitting ▪ Problem: assumption of linear model not correct ▪ Choose a non-linear model ▪ Polynomial regression ▪ Regression splines Overfitting ▪ Problem: Model too much adjusted to training data ▪ High number of (irrelevant) input features ▪ Number of data points not large enough ▪ Regularization 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf ! Regularization: Can be useful for highdimensional feature spaces 30 Regularization Ridge Regression Cost function: σ𝑁 𝑖=1(𝑦𝑖 𝑇 Ԧ + − 𝑥𝑖 𝛽) 2 𝑝 𝜆 σ𝑗=1 𝛽𝑗2 = 𝑆𝑆𝐸 𝛽Ԧ + 𝜆 ≥ 0: Tuning parameter ▪ Penalty term proportional to square of the coefficients 𝑝 𝜆 σ𝑗=1 𝛽𝑗2 Bias-Variance-Trade-Off for increasing 𝜆 MSE on unseen data Irreducible Error Variance Squared Bias ▪ Find best parameters to minimize the cost function ▪ Minimize sum of squared errors ▪ Penalty term shrinks the estimated coefficients towards zero ▪ Tuning parameter controls the relative impact of the penalty term By shrinking the ridge coefficients, the variance of the predictions is reduced. When the tuning parameter is chosen too high, the bias can increase significantly. Source: James et al., An Introduction to Statistical Learning 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 31 Regularization Lasso Regression 𝑇 Ԧ 2 𝑝 𝑝 Ԧ Cost function: σ𝑁 (𝑦 − 𝑥 𝑖 𝛽) + 𝜆 σ𝑗=1 𝛽𝑗 = 𝑆𝑆𝐸 𝛽 + 𝜆 σ𝑗=1 𝛽𝑗 𝑖=1 𝑖 𝜆 ≥ 0: Tuning parameter ▪ Penalty term proportional to absolute value of coefficients ▪ Find best parameters to minimize the cost function ▪ Minimize sum of squared errors ▪ Penalty term shrinks the estimated In Lasso regression, some of the parameters are set to 0 with increasing tuning parameter. →Selection of most important features →Increased model interpretability Lasso Ridge Contour of SSE coefficients towards zero ▪ Tuning parameter controls the relative impact of the penalty term Shrinkage constraint Source: James et al., An Introduction to Statistical Learning 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 32 Supervised Learning Example K-NEAREST NEIGHBOR CLASSIFICATION 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 33 Classification Example We want to learn a model that predicts the gender of persons given their size and their weight. 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 34 Classification Can we use the approach from linear regression? Regression models can also be used for classification. For instance, we may encode the label 𝑦 ∈ {𝑚𝑎𝑙𝑒, 𝑓𝑒𝑚𝑎𝑙𝑒} as 𝑦 ∈ {−1, +1} and try to learn the function 𝑓(𝑥) = 𝑥Ԧ 𝑇 𝛽Ԧ in a way that: 𝑦=ቊ +1, −1, 𝑖𝑓 𝑓(𝑥) ≥ 0 𝑖𝑓 𝑓(𝑥) < 0 Using the SSE, the solution shown on the right is found. In general, linear regression is not ideal for classification: Difficult if there are more than two classes “True” decision boundary could be non-linear 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 35 Classification algorithm k-Nearest-Neighbor The k-Nearest-Neighbor algorithm is a very intuitive approach for classification. ▪ No learning of a model is necessary ▪ Assigning of class labels for unknown data is solely based on the training data How to assign a class label to a new data point given some training data? 1. Choose the number k of neighbors 2. Calculate the distance (e.g Euclidean distance) from the data point to the training data points 3. Take the k nearest neighbors 4. Majority voting to determine the class assigned to the data point Image Source: Wikipedia 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 36 k-Nearest-Neighbor k=1 No misclassification on the training data itself Low bias High variance k=5 Some misclassifications on the traing data Decision boundary seems more reasonable than for k=1 k=9 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf Even more misclassifications on the training data One clear decision boundary (no more exclaves) Higher bias, lower variance 37 How to assess a model MODEL EVALUATION 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 38 Evaluation of Learned Models ▪ Validation through experts ▪ a domain expert evaluates the plausibility of a learned model + often the only option (e.g., clustering) - subjective, time-intensive, costly ▪ Validation on data ▪ evaluate the performance of the model on a separate dataset drawn from the same distribution as the training data - labeled data are scarce, could be better used for training and simple, no domain knowledge needed, methods for re-using training data exist + fast (e.g., cross-validation) ▪ On-line Validation ▪ test the learned model in a fielded application + gives the best estimate for the overall utility - bad models may be costly 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 39 Validation on Data Out-of-sample Testing ▪ To evaluate if a model fits the data, it has to be tested on previously unseen data ▪ Performance cannot be measured on training data (Overfitting!) → The dataset is split into three parts: 1. Training data: data that is used to train the algorithm 2. Validation data: data that is used to optimize the hyperparameters of the model 3. Test data: data that is used to test the final model – never seen by the algorithm before 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 40 Typical Learning Curves Typical split of the available data Test 20% Validation 20% Training 60% Image source: Hastie et al., The elements of statistical learning 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 41 Validation on Data k-fold-Cross-Validation 1. Partition your dataset into 𝑘 equal subsets (e.g. with 𝑘 = 10) 2. For every partition: a. Keep the partition as test set and use the other k-1 partitions as training data b. Train the model and evaluate its performance on the test set 3. Average the results + Makes best use of the data No influence of random sampling - Computationally expensive 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 42 Which score should be used? EVALUATION METRICS 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 43 Binary Classification Confusion Matrix ▪ Accuracy: 𝑎𝑐𝑐 = 𝑡𝑝+𝑡𝑛 𝑡𝑝+𝑓𝑛+𝑓𝑝+𝑡𝑛 ▪ Precision: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ▪ Recall: 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝 𝑡𝑝+𝑓𝑝 𝑡𝑝 𝑡𝑝+𝑓𝑛 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf Which proportion of positive classifications was correct? Which proportion of actual positives was identified correctly? 44 Why is accuracy not always the best metric? Prediction of oil pump failures Classification (Random forest) Metrics Faulty = positive Healthy = negative 5 + 1422 = 0.97 5 + 1422 + 13 + 33 5 13 Healthy 33 1422 𝑅𝑒𝑐𝑎𝑙𝑙 = Faulty Healthy 𝑇𝑃 5 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.13 𝑇𝑃 + 𝐹𝑃 5 + 33 True class Faulty Predicted class 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 5 = = 0.28 𝑇𝑃 + 𝐹𝑁 5 + 13 High class imbalance: Accuracy is dominated by the majority class An alternative one-score metric is the F1-Score: 𝐹1 = 2 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 45 Evaluation Context within the data mining process Evaluation The evaluation of machine learning models comprises all the following steps: Business Understanding Data Understanding ▪ Select the best model: Careful comparison with the task at hand. ▪ Assess the level of achievement of the business objectives: Do the models meet the business success criteria? ▪ Data Preparation Test models on test applications: A generated model that meets the selected criteria best becomes an approved model. ▪ Deployment Review results for quality assurance questions: Have all steps been executed successfully? Data ▪ Modeling Evaluation Listing of future actions and decisions ! The metric used for the evaluation depends heavily on the application context. A good understanding of the business / the task is essential to choose an appropriate metric. 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 46 Which metric should be chosen? The metric for evaluation depends heavily on the application context. A good understanding of the business / the task is essential to choose an appropriate metric. Precision vs. Recall High precision: ▪ Low number of false positives ▪ In failure prediction: reduced number of false alarms, Business Understanding Data Understanding reduced workload for operators High recall: Data Preparation ▪ Low number of false negatives ▪ In failure prediction: no faults remain undetected, very Deployment important for safety Data Modeling Evaluation There is often a trade-off between precision and recall and both cannot be optimized at the same time. 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 47 Topics of Today - Summary ▪ Modeling techniques ▪ Linear Regression ▪ Choosing the right model ▪ Evaluation metrics ▪ Model evaluation ▪ Next Week: Data Understanding and Exploratory Data Analysis 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 48 THANK YOU! 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 49 Sources ▪ Lecture slides from Prof. Kristian Kersting, „Machine Learning Applications“, winter semester 2021/2022 ▪ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H.: The Elements of Statistical Learning. Data mining, inference, and prediction. Second edition. Springer. (2002) ▪ James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert: An Introduction to Statistical Learning. Second edition. Springer. (2021) ▪ Ng, Andrew: Machine Learning Yearning, availabe at https://www.dbooks.org/machine-learning-yearning-1501/. (2018) ▪ Coursera, Supervised Machine Learning: Regression and Classification, Stanford / DeepLearning.AI, available at https://www.coursera.org/learn/machine-learning?specialization=machine-learning-introduction ▪ Han, Jiawei; Kamber, Micheline; Pei, Jian: Data Mining. Concepts and techniques. 3rd ed. (Online). Elsevier professional. (2011) ▪ Ertel, Wolfgang: Grundkurs Künstliche Intelligenz. Fünfte Auflage. Springer Fachmedien Wiesbaden. (2021) ▪ Cleve, Jürgen; Lämmel, Uwe: Data Mining. 3. Auflage. De Gruyter. (2020) ▪ Palacio-Nino, Julio-Omar; Berzal, Fernando: Evaluation Metrics for Unsupervised Learning Algorithms. (2019) 25.10.2023 | Machine Learning Applications | Modeling and Evaluation | Prof. Dr.-Ing. Uwe Klingauf 50