Machine Learning Workflow (Lecture 5 & 6) PDF
Document Details
Uploaded by StylishSpessartine
University of Science and Technology
Noureldien Abdelrahman
Tags
Summary
This document outlines a workflow for addressing machine learning problems, emphasizing initial problem definition, data collection, and evaluation protocols. It uses the Boston housing dataset as an example and touches upon techniques for preparing data.
Full Transcript
University of Science and Technology Faculty of Computer Science and Information Technology Department of Computer Science …. Semester 8 Subject: Introduction to Machine Learning Lecture (5): Workflow for Addressing Machine Learning Problems __________________________________________________________...
University of Science and Technology Faculty of Computer Science and Information Technology Department of Computer Science …. Semester 8 Subject: Introduction to Machine Learning Lecture (5): Workflow for Addressing Machine Learning Problems ___________________________________________________________________ Instructor: Prof. Noureldien Abdelrahman Noureldien Date: 11-11-2023 To deal with a machine learning problem the following methodology is needed. 5.1 Define Appropriately the Problem The first, and one of the most critical things to do, is to find out what are the inputs and the expected outputs. The following questions must be answered: What is the main objective? What is the input data? Is it available? What kind of problem are we facing? Binary classification? Clustering? What is the expected output? It is crucial to keep in mind that machine learning can only be used to memorize patterns that are present in the training data, so we can only recognize what we have seen before. When using Machine Learning we are making the assumption that the future will behave like the past, and this isn’t always true. 5.2. Collect Data This is the first real step towards the real development of a machine learning model, collecting data. The more and better data that we get; the better our model will perform. Typically our data will have the following shape: Note: The previous table corresponds to the famous Boston housing dataset, a classical dataset frequently used to develop simple machine learning models. Each row represents a different Boston’s neighborhood and each column indicates some characteristic of the neighborhood (criminality rate, average age… etc). The last column represents the median house price of the neighborhood and it is the target, the one that will be predicted taking into account the other. 5.3. Choose a Measure of Success “If you can’t measure it you can’t improve it”. If you want to control something it should be observable, and in order to achieve success, it is essential to define what is considered success: Maybe precision? Accuracy? Customer-retention rate? This measure should be directly aligned with the higher level goals of the business at hand. And it is also directly related with the kind of problem we are facing: Regression problems use certain evaluation metrics such as mean squared error (MSE). Classification problems use evaluation metrics such as precision, accuracy and recall. 5.4. Setting an Evaluation Protocol Once the goal is clear, we have to decide how we are going to be measure or to evaluate the progress towards achieving the goal. The most common evaluation methods are: 5.4.1 Maintaining a Hold out Validation Set This method consists on setting apart some portion of the data as the test set. The process would be to train the model with the remaining fraction of the data, tuning its parameters with the validation set and finally evaluating its performance on the test set. The reason to split data in three parts is to avoid information leaks. The main inconvenient of this method is that if there is little data available, the validation and test sets will contain so few samples that the tuning and evaluation processes of the model will not be effective. 5.4.2 K-Fold Validation K-Fold consists in splitting the data into K partitions of equal size. For each partition i, the model is trained with the remaining K-1 partitions and it is evaluated on partition i. The final score is the average of the K scored obtained. 5.4.3 Iterated K-Fold Validation with Shuffling This technique is especially relevant when having little data available and it is needed to evaluate the model as precisely as possible. It consists on applying K-Fold validation several times and shuffling the data every time before splitting it into K partitions. The Final score is the average of the scores obtained at the end of each run of K-Fold validation. Note: It is crucial to keep in mind the following points when choosing an evaluation method In classification problems, both training and testing data should be representative of the data, so we should shuffle our data before splitting it, to make sure that is covered the whole spectrum of the dataset. When trying to predict the future given the past (weather prediction, stock price prediction…), data should not be shuffled, as the sequence of data is a crucial feature and doing so would create a temporal leak. We should always check if there are duplicates in our data in order to remove them. Otherwise the redundant data may appear both in the training and testing sets and cause inaccurate learning on our model. 5.5. Preparing the Data Before beginning to train models we should transform our data in a way that can be fed into a Machine Learning model. The most common techniques are: 5.5.1 Dealing with missing data It is quite common in real-world problems to miss some values of our data samples. It may be due to errors on the data collection, blank spaces on surveys, measurements not applicable…etc Missing values are typically represented with the “NaN” or “Null” indicators. The problem is that most algorithms can’t handle those missing values so we need to take care of them before feeding data to our models. Once they are identified, there are several ways to deal with them: Eliminating the samples or features with missing values. Imputing (calculating) the missing values, with some pre-built estimators. One common approach is to set the missing values as the mean value of the rest of the samples. 5.5.2 Handling Categorical Data Categorical data, are either ordinal or nominal features. Ordinal features are categorical features that can be sorted (cloth’s size: L<M<S). While nominal features don’t imply any order (cloth’s color: yellow, green, red). The methods to deal with ordinal and nominal features are: Mapping ordinal features: to make sure that the algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Frequently we will do this mapping manually. Example: L: 2, M: 1, S: 0. Encoding nominal class labels: The most common approach is to perform one-hot encoding, which consists in creating a new dummy feature for each unique value in the nominal feature column. Example: in the color column, if we have three classes: yellow, red, green and perform one-hot encoding, we will get three new columns, one for each unique class. Then, if we have a yellow shirt, it will be sampled as: yellow = 1, green = 0, red = 0. 5.5.3 Feature Scaling Machine learning is like making a mixed fruit juice. If we want to get the best-mixed juice, we need to mix all fruit not by their size but based on their right proportion. We just need to remember apple and strawberry are not the same unless we make them similar in some context to compare their attribute. Similarly, in many machine learning algorithms, to bring all features in the same standing, we need to do scaling so that one significant number doesn’t impact the model just because of their large magnitude. The most common techniques of feature scaling are Normalization and Standardization. Normalization is used when we want to bound our values between two numbers, typically, between [0,1] or [-1,1]. While Standardization transforms the data to have zero mean and a variance of 1, they make our data unitless. Refer to the below diagram, which shows how data looks after scaling in the X-Y plane. Normalization: it refers to rescaling the features to a range of [0, 1], which is a special case of min-max scaling. To normalize our data we’ll simply need to apply the min-max scaling method to each feature column. Standardization: it consists in centering the feature columns at mean 0 with standard deviation 1 so that the feature columns have the same parameters as a standard normal distribution (zero mean and unit variance). This makes much easier for the learning algorithms to learn the weights of the parameters. In addition, it keeps useful information about outliers and makes the algorithms less sensitive to them. Normalization Standardization Why do we need scaling? Machine learning algorithm just sees number — if there is a vast difference in the range say few ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher ranging numbers have superiority of some sort. So these more significant number starts playing a more decisive role while training the model. The machine learning algorithm works on numbers and does not know what that number represents. A weight of 10 grams and a price of 10 dollars represents completely two different things — which is a no brainer for humans, but for a model as a feature, it treats both as same. Suppose we have two features of weight and price, as in the below table. The “Weight” cannot have a meaningful comparison with the “Price.” So the assumption algorithm makes that since “Weight” > “Price,” thus “Weight,” is more important than “Price.” So these more significant number starts playing a more decisive role while training the model. Thus feature scaling is needed to bring every feature in the same footing without any upfront importance. Interestingly, if we convert the weight to “Kg,” then “Price” becomes dominant. 5.5.4 Selecting Meaningful Features One of the main reasons that causes machine learning models to overfit is because of having redundancy in our data, which makes the model to be too complex for the given training data and unable to generalize well on unseen data. One of the most common solutions to avoid overfitting is to reduce data’s dimensionality. This is frequently done by reducing the number of features of our dataset via Principal Component Analysis (PCA) which is a type of Unsupervised Machine Learning algorithm. PCA identifies patterns in our data based on the correlations between the features. This correlation implies that there is redundancy in our data, in other words, that there is some part of the data that can be explained with other parts of it. This correlated data is not essential for the model to learn its weights appropriately and so, it can be removed. It may be removed by directly eliminating certain columns (features) or by combining a number of them and getting new ones that hold the most part of the information. 5.5.5 Splitting Data into Subsets In general, we will split our data in three parts: training, testing and validating sets. We train our model with training data, evaluate it on validation data and finally, once it is ready to use, test it one last time on test data. Now, is reasonable to ask the following question: Why not having only two sets, training and testing? In that way, the process will be much simpler, just train the model on training data and test it on testing data. The answer is that, developing a model involves tuning its configuration. This tuning is done with the feedback received from the validation set, and is in essence, a form of learning. The ultimate goal is that the model can generalize well on unseen data, in other words, predict accurate results from new data, based on its internal parameters adjusted while it was trained and validated. Definition Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to an independent variable when other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc. We can understand the concept of regression analysis using the below example: Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that. The below list shows the advertisement made by the company in the last 5 years and the corresponding sales: Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the prediction about the sales for this year. So to solve such type of prediction problems in machine learning, we need regression analysis. Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables. In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the machine learning model can make predictions about the data. In simple words, "Regression shows a line or curve that passes through all the datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and the regression line is minimum." The distance between datapoints and line tells whether a model has captured a strong relationship or not. Some examples of regression can be as: Prediction of rain using temperature and other factors Determining Market trends Prediction of road accidents due to rash driving. Terminologies Related to the Regression Analysis: Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable. Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor. Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided. Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable. Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting. Why do we use Regression Analysis? As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately. So for such case we need Regression analysis which is a statistical method and used in machine learning and data science. Below are some other reasons for using Regression analysis: Regression estimates the relationship between the target and the independent variable. It is used to find the trends in data. It helps to predict real/continuous values. By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors. Types of Regression There are various types of regressions which are used in data science and machine learning. Each type has its own importance on different scenarios, but at the core, all the regression methods analyze the effect of the independent variable on dependent variables. Here we are discussing some important types of regression which are given below: Linear Regression Logistic Regression Polynomial Regression Support Vector Regression Decision Tree Regression Random Forest Regression Ridge Regression Lasso Regression: Linear Regression: Linear regression is a statistical regression method which is used for predictive analysis. It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables. It is used for solving the regression problem in machine learning. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called linear regression. If there is only one input variable (x), then such linear regression is called simple linear regression. And if there is more than one input variable, then such linear regression is called multiple linear regression. The relationship between variables in the linear regression model can be explained using the below image. Here we are predicting the salary of an employee on the basis of the year of experience. Below is the mathematical equation for Linear regression: Y= aX+b Here, Y = dependent variables (target variables), X= Independent variables (predictor variables), a and b are the linear coefficients Some popular applications of linear regression are: Analyzing trends and sales estimates Salary forecasting Real estate prediction Arriving at ETAs in traffic. Logistic Regression: Logistic regression is another supervised learning algorithm which is used to solve the classification problems. In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1. Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. It is a predictive analysis algorithm which works on the concept of probability. Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how they are used. Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid function is used to model the data in logistic regression. The function can be represented as: f(x)= Output between the 0 and 1 value. x= input to the function e= base of natural logarithm. When we provide the input values (data) to the function, it gives the S-curve as follows: It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values below the threshold level are rounded up to 0. There are three types of logistic regression: Binary(0/1, pass/fail) Multi(cats, dogs, lions) Ordinal(low, medium, high) Polynomial Regression: Polynomial Regression is a type of regression which models the non-linear dataset using a linear model. It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and corresponding conditional values of y. Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion, so for such case, linear regression will not best fit to those datapoints. To cover such datapoints, we need Polynomial regression. In Polynomial regression, the original features are transformed into polynomial features of given degree and then modeled using a linear model. Which means the datapoints are best fitted using a polynomial line. The equation for polynomial regression also derived from linear regression equation that means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn. Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is our independent/input variable. The model is still linear as the coefficients are still linear with quadratic Note: This is different from Multiple Linear regression in such a way that in Polynomial regression, a single element has different degrees instead of multiple variables with the same degree. Support Vector Regression: Support Vector Machine is a supervised learning algorithm which can be used for regression as well as classification problems. So if we use it for regression problems, then it is termed as Support Vector Regression. Support Vector Regression is a regression algorithm which works for continuous variables. Below are some keywords which are used in Support Vector Regression: Kernel: It is a function used to map a lower-dimensional data into higher dimensional data. Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line which helps to predict the continuous variables and cover most of the datapoints. Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin for datapoints. Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and opposite class. In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number of datapoints are covered in that margin. The main goal of SVR is to consider the maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain a maximum number of datapoints. Consider the below image: Here, the blue line is called hyperplane, and the other two lines are known as boundary lines. Decision Tree Regression: Decision Tree is a supervised learning algorithm which can be used for solving both classification and regression problems. It can solve problems for both categorical and numerical data Decision Tree regression builds a tree-like structure in which each internal node represents the "test" for an attribute, each branch represent the result of the test, and each leaf node represents the final decision or result. A decision tree is constructed starting from the root node/parent node (dataset), which splits into left and right child nodes (subsets of dataset). These child nodes are further divided into their children node, and themselves become the parent node of those nodes. Consider the below image: Above image showing the example of Decision Tee regression, here, the model is trying to predict the choice of a person between Sports cars or Luxury car. Random forest is one of the most powerful supervised learning algorithms which is capable of performing regression as well as classification tasks. The Random Forest regression is an ensemble learning method which combines multiple decision trees and predicts the final output based on the average of each tree output. The combined decision trees are called as base models, and it can be represented more formally as: g(x)= f0(x)+ f1(x)+ f2(x)+.... Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in which aggregated decision tree runs in parallel and do not interact with each other. With the help of Random Forest regression, we can prevent Overfitting in the model by creating random subsets of the dataset. Ridge Regression: Ridge regression is one of the most robust versions of linear regression in which a small amount of bias is introduced so that we can get better long term predictions. The amount of bias added to the model is known as Ridge Regression penalty. We can compute this penalty term by multiplying with the lambda to the squared weight of each individual features. A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so to solve such problems, Ridge regression can be used. Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization. It helps to solve the problems if we have more parameters than samples.