Machine Learning Workflow Lecture PDF
Document Details
Uploaded by StylishSpessartine
University of Science and Technology
2023
Prof. Noureldien Abdelrahman Noureldien
Tags
Summary
This document is a lecture on machine learning. It covers the workflow for addressing machine learning problems, including defining the problem, collecting data and evaluating models.
Full Transcript
University of Science and Technology Faculty of Computer Science and Information Technology Department of Computer Science …. Semester 7 Subject: Introduction to Machine Learning Lecture (5): Workflow for Addressing Machine Learning Prob...
University of Science and Technology Faculty of Computer Science and Information Technology Department of Computer Science …. Semester 7 Subject: Introduction to Machine Learning Lecture (5): Workflow for Addressing Machine Learning Problems ___________________________________________________________________ Instructor: Prof. Noureldien Abdelrahman Noureldien Date: 11-11-2023 To deal with a machine learning problem the following methodology is needed. 5.1 Define Appropriately the Problem The first, and one of the most critical things to do, is to find out what are the inputs and the expected outputs. The following questions must be answered: What is the main objective? What is the input data? Is it available? What kind of problem are we facing? Binary classification? Clustering? What is the expected output? It is crucial to keep in mind that machine learning can only be used to memorize patterns that are present in the training data, so we can only recognize what we have seen before. When using Machine Learning we are making the assumption that the future will behave like the past, and this isn’t always true. 5.2. Collect Data This is the first real step towards the real development of a machine learning model, collecting data. The more and better data that we get; the better our model will perform. Typically our data will have the following shape: 1 Note: The previous table corresponds to the famous Boston housing dataset, a classical dataset frequently used to develop simple machine learning models. Each row represents a different Boston’s neighborhood and each column indicates some characteristic of the neighborhood (criminality rate, average age… etc). The last column represents the median house price of the neighborhood and it is the target, the one that will be predicted taking into account the other. 5.3. Choose a Measure of Success “If you can’t measure it you can’t improve it”. If you want to control something it should be observable, and in order to achieve success, it is essential to define what is considered success: Maybe precision? Accuracy? Customer-retention rate? This measure should be directly aligned with the higher level goals of the business at hand. And it is also directly related with the kind of problem we are facing: Regression problems use certain evaluation metrics such as mean squared error (MSE). Classification problems use evaluation metrics such as precision, accuracy and recall. 5.4. Setting an Evaluation Protocol Once the goal is clear, we have to decide how we are going to be measure or to evaluate the progress towards achieving the goal. The most common evaluation methods are: 5.4.1 Maintaining a Hold out Validation Set This method consists on setting apart some portion of the data as the test set. The process would be to train the model with the remaining fraction of the data, tuning its parameters with the validation set and finally evaluating its performance on the test set. The reason to split data in three parts is to avoid information leaks. The main inconvenient of this method is that if there is little data available, the validation and test sets will contain so few samples that the tuning and evaluation processes of the model will not be effective. 2 5.4.2 K-Fold Validation K-Fold consists in splitting the data into K partitions of equal size. For each partition i, the model is trained with the remaining K-1 partitions and it is evaluated on partition i. The final score is the average of the K scored obtained. 5.4.3 Iterated K-Fold Validation with Shuffling This technique is especially relevant when having little data available and it is needed to evaluate the model as precisely as possible. It consists on applying K-Fold validation several times and shuffling the data every time before splitting it into K partitions. The Final score is the average of the scores obtained at the end of each run of K-Fold validation. 3 Note: It is crucial to keep in mind the following points when choosing an evaluation method In classification problems, both training and testing data should be representative of the data, so we should shuffle our data before splitting it, to make sure that is covered the whole spectrum of the dataset. When trying to predict the future given the past (weather prediction, stock price prediction…), data should not be shuffled, as the sequence of data is a crucial feature and doing so would create a temporal leak. We should always check if there are duplicates in our data in order to remove them. Otherwise the redundant data may appear both in the training and testing sets and cause inaccurate learning on our model. 5.5. Preparing the Data Before beginning to train models we should transform our data in a way that can be fed into a Machine Learning model. The most common techniques are: 5.5.1 Dealing with missing data It is quite common in real-world problems to miss some values of our data samples. It may be due to errors on the data collection, blank spaces on surveys, measurements not applicable…etc Missing values are typically represented with the “NaN” or “Null” indicators. The problem is that most algorithms can’t handle those missing values so we need to take care of them before feeding data to our models. Once they are identified, there are several ways to deal with them: 1. Eliminating the samples or features with missing values. 2. Imputing (calculating) the missing values, with some pre-built estimators. One common approach is to set the missing values as the mean value of the rest of the samples. 5.5.2 Handling Categorical Data Categorical data, are either ordinal or nominal features. Ordinal features are categorical features that can be sorted (cloth’s size: L