Linear Models – Linear Regression PDF
Document Details
Uploaded by CleverNobelium1412
Tags
Summary
This document presents a lecture on linear models and linear regression, part of a supervised learning topic, in the context of machine learning. It explains the concept of linearity vs. non-linearity in machine learning systems and outlines the use of linear models for predictions, particularly in regression problems. The document elaborates on building and optimizing linear models using gradient descent and explores how gradient descent finds optimal values via minimization of a cost or objective function.
Full Transcript
Linear Models – Linear Regression 1. Introduction One of the fundamental concepts in Machine Learning (ML) is the idea of linearity versus nonlinearity. Linearity refers to the property of a system or model where the output is directly proportional to the input, while nonlinearit...
Linear Models – Linear Regression 1. Introduction One of the fundamental concepts in Machine Learning (ML) is the idea of linearity versus nonlinearity. Linearity refers to the property of a system or model where the output is directly proportional to the input, while nonlinearity implies that the relationship between input and output is more complex and cannot be expressed as a simple linear function. 2 2. Linearity In many cases, linear models are often the simplest and most effective approach. A linear model essentially fits a straight line to the data, allowing it to make predictions based on a linear relationship between the input features and the output variable. In a regression problem, a linear model is used to predict a continuous target (Label / Output) variable based on one or more input features, such as the size and age of a tree. 3 3. Simple linear regression Simple linear regression utilizes this linear equation: Y = b0 + b1*x1 - where Y is the dependent variable (what we want to predict), x1 is the independent variable or feature, b0 is the y-intercept or a constant, and finally b1 is the slope coefficient. For example, predicting the output of potatoes from the amount of fertilizer a farmer chooses to use. The amount of fertilizer is the independent variable or the feature (x1), while the dependent variable is the output of potatoes (y). This would make the equation look like this: Potatoes = b0 + b1*Fertilizer for our linear regression model. 4 4. Problem Statement We have a collection of labeled examples where N is the size of the collection, xi is the D-dimensional feature vector of example i = 1,...,N, yi is a real-valued target and every feature is also a real number. We want to build a model as a linear combination of features of example x: where w is a D-dimensional vector of parameters and b is a real number. The notation means that the model f is parametrized by two values: w and b. 5 4. Problem Statement We want to find the optimal values (w*, b*). Obviously, the optimal values of parameters define the model that makes the most accurate predictions. 6 4. Problem Statement The figure displays the regression line (in light-blue) for one-dimensional examples (dark-blue dots). We can use this line to predict the value of the target ynew for a new unlabeled input example xnew. If our examples are D-dimensional feature vectors (for D > 1), the only difference with the one-dimensional case is that the regression model is not a line but a plane (for two dimensions) or a hyperplane (for D > 2). From the figure we can conclude that the regression hyperplane lies as close to the training examples as possible if the blue line in the figure was far from the blue dots, the prediction ynew would have fewer chances to be correct 7 5. Solution to find the best line Optimization procedure is used to find the optimal values for w* and b* that tries to minimize the following expression: In mathematics, the expression we minimize or maximize is called an objective function. The expression in the above objective is called the loss function. It’s a measure of penalty for misclassification of example i. This particular choice of the loss function is called squared error loss. All model-based learning algorithms have a loss function and what we do to find the best model is we try to minimize the objective known as the cost function. 8 5. Solution to find the best line Closed form solutions to finding an optimum of a function (best line) are simple algebraic expressions and are often preferable to using complex numerical optimization methods, such as gradient descent. 9 6. What is Gradient Descent Gradient Descent is an iterative optimization algorithm that tries to find the optimum value (Minimum/Maximum) of an objective function. It is one of the most used optimization techniques in machine learning projects for updating the parameters of a model in order to minimize a cost function. The main aim of gradient descent is to find the best parameters of a model which gives the highest accuracy on training as well as testing datasets. [solve overfitting problem] 10 7. How Does Gradient Descent Work Gradient descent works by moving downward toward the pits or valleys in the graph to find the minimum value. This is achieved by taking the derivative of the cost function, as illustrated in the figure. During each iteration, gradient descent step-downs the cost function in the direction of the steepest descent. By adjusting the parameters in this direction, it seeks to reach the minimum of the cost function and find the best-fit values for the parameters. The size of each step is determined by parameter α known as Learning Rate. 11 7. How Does Gradient Descent Work In the Gradient Descent algorithm, one can infer two points : 12 7. How Does Gradient Descent Work 13 8. How To Choose Learning Rate The choice of correct learning rate is very important as it ensures that Gradient Descent converges in a reasonable time: 14 8. How To Choose Learning Rate 15 9. Linear Regression Using Gradient Descent Our cost function is the Mean Square Error(MSE) which equals to The linear regression formula is To run gradient descent on this error function, we first need to compute its gradient. To compute it, we will need to differentiate our error function. Since our function is defined by two parameters (w and b), we will need to compute a partial derivative for each. These derivatives work out to be: 16 9. Linear Regression Using Gradient Descent Now, we are going to apply the gradient descent algorithm to minimize the cost function for the linear regression model. So we have: And 17 Linear Regression: Example Prediction Model (Predictor) Questions 19