Fundamentals of Data Science - Linear Regression PDF

Chapter 3 Linear Regression Linear regression is a statistical method used to model the relationship between a dependent variable (also called the target, response, or output) and one or more independent variables (also called predictors, explanatory variables, or features) by fitting a linear equation to observed data. 3.1 Simple Linear Regression In the case of simple linear regression, there is one dependent variable Y and one independent variable X. The goal is to model the relationship between X and Y using a linear function of X. The equation of the model is given by: Y = β0 + β1 X + ϵ Where: Y : The dependent variable (response variable). X: The independent variable (explanatory variable). β0 : The intercept of the regression line, which represents the value of Y when X = 0. β1 : The slope of the regression line, which represents the change in Y for a one-unit change in X. ϵ: The error term (also called the residual), representing the difference between the observed value of Y and the value predicted by the model. 3.1.1 Interpretation of Parameters Intercept β0 : This is the predicted value of Y when X = 0. In many cases, this value is not meaningful (e.g., predicting salary when experience is zero), but it provides a reference point. Slope β1 : This describes the relationship between X and Y. Specifically, it quantifies the expected change in Y for a unit increase in X. For example, if β1 = 2, it means that for every unit increase in X, Y is expected to increase by 2 units. Assumptions of Simple Linear Regression: 1. Linearity: The relationship between the dependent variable Y and the independent variable X is linear. 2. Independence: The residuals (errors) ϵ are independent. 3. Homoscedasticity: The residuals have constant variance (i.e., the variance of the errors is the same across all values of X). 4. Normality: The residuals are normally distributed. 49 50 CHAPTER 3. LINEAR REGRESSION 3.2 Multiple Linear Regression: In the case of multiple linear regression, there are multiple independent variables X1 , X2 ,... , Xp. The goal is to model the relationship between the dependent variable Y and the set of independent variables X1 , X2 ,... , Xp. The equation of the model is given by: Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ϵ Where: Y : The dependent variable (response variable). X1 , X2 ,... , Xp : The independent variables (predictors). β0 : The intercept of the regression hyperplane. β1 , β2 ,... , βp : The coefficients (slopes) associated with each independent variable. ϵ: The error term (residual). In matrix notation, this can be written as: Y = Xβ + ϵ Where: Y is an n × 1 vector of the dependent variable values. X is an n × (p + 1) matrix of the independent variables (with a column of 1s for the intercept). β is a (p + 1) × 1 vector of the model coefficients. ϵ is an n × 1 vector of the residuals. 3.2.1 Interpretation of Parameters in Multiple Linear Regression: Intercept β0 : The predicted value of Y when all the independent variables X1 , X2 ,... , Xp are equal to 0. Slope βi : The expected change in Y for a one-unit increase in Xi , holding all other independent variables constant. 3.2.2 Assumptions of Multiple Linear Regression: 1. Linearity: The relationship between the dependent variable Y and each independent variable Xj is linear. 2. Independence: The residuals are independent. 3. Homoscedasticity: The residuals have constant variance. 4. Normality: The residuals are normally distributed. 5. No Multicollinearity: The independent variables X1 , X2 ,... , Xp are not too highly correlated with each other. 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 51 3.3 Example: Supervised Learning in the context of linear regression This image provides an example of Supervised Learning in the context of linear regression, specifically focusing on predicting housing prices based on house size. Let’s break it down in detail: Figure 3.1 Supervised Learning:The slide emphasizes that this example falls under supervised learning. In supervised learning, the model is trained using data where the correct ”answers” (labels) are already known.The “right answer” refers to the known house prices (target variable) in the dataset.This is done to ”train” the algorithm to learn the relationship between input data (features) and the correct answers (output or labels). Once trained, the model can make predictions on unseen data. Regression Problem:This is a regression problem, meaning the model’s task is to predict real-valued out- puts. In this case, the goal is to predict house prices (in thousands of dollars). The alternative to regression is classification, where the output is discrete (like predicting categories or labels, e.g., dog vs cat). Housing Prices in Portland, OR: The graph illustrates the relationship between house size (in square feet) and the price of the house. The x-axis represents house size, and the y-axis represents price (in thousands of dollars). Each red ”X” in the graph represents a data point, i.e., an individual house in the dataset, with its size and corresponding price. The goal is to find a regression line (a straight green line) that best fits the data, capturing the relationship between size and price. 3.3.1 Understanding the Graph On the left side, we see a plot of the input feature x (e.g., size of a house) against the output y (e.g., price of a house). Each red ”X” represents a training example, where both the input x and the corresponding output y are known. Green Line (Linear Regression Line): – The green line is the best-fit line (hypothesis) that the linear regression model learns. This line minimizes the difference between the predicted price (on the line and represented by the red ”Y”) and the actual price (represented by the red ”X”). The goal of linear regression is to fit a line (hypothesis) that best describes the relationship between x and y. The line is represented by the hypothesis function hθ (x) = θ0 + θ1 x. – The goal is to choose values for θ0 and θ1 such that the hypothesis hθ (x) gives a prediction close to the actual target value y for every training example. – The equation of the line is typically of the form: Price = β0 + β1 × Size where β0 is the intercept (the price of a house when size is 0) and β1 is the slope (the rate at which price increases as size increases). 52 CHAPTER 3. LINEAR REGRESSION Interpretation of the Line: – Houses below the green line are underpriced relative to the predicted price, while houses above the line are overpriced. – The line is used to predict the price of a house given its size. For instance, for a house of size 1700 square feet (marked on the graph), you can see the model’s predicted price by following the line vertically. Key Terms – Real-Valued Output means that the output (in this case, house price) is a continuous value, not a category or label. – In this example, the size of the house is the primary feature (independent variable) used to predict the price. – In general, features are the input variables that the model uses to make predictions. – The idea is to minimize the error between the predicted output (from the hypothesis) and the actual target values (from the training set). This error is quantified by the cost function. 3.3.2 Learning Algorithm The workflow of the machine learning process: 1. Training Set: This is where the training data is provided, including the sizes of houses and their respective prices. 2. Learning Algorithm: This is the model or algorithm (e.g., linear regression) that takes the training data as input and ”learns” the relationship between the features (house size) and the target (house price). 3. Hypothesis (h): After training, the learning algorithm outputs a hypothesis function h(x). This function represents the relationship between the input and output. In linear regression, the hypothesis function is represented as h(x) = θ0 + θ1 x, where θ0 is the intercept and θ1 is the slope of the line. 4. Predicted Price: Finally, given a new house size, the hypothesis function h(x) will predict the estimated price of the house. Figure 3.2: Learning Algorithm 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 53 3.3.3 Training Set of Housing Prices (Portland, OR) We are talking abouttraining set for a supervised learning problem, specifically a regression problem. The goal of this regression problem is to predict housing prices (in Portland, OR) based on features of the houses. The table shows a training set consisting of pairs of inputs (features) and outputs (targets): Size in feet2 (x): The independent variable or feature, representing the size of each house. Price ($) in 1000’s (y): The dependent variable or target, representing the price of each house. Number of training examples (m): This is the total number of training examples in the dataset. x’s: Input features of the training examples, in this case, the size of the house. y’s: Output or target values, which are the corresponding house prices. This training set contains multiple examples of housing data. For example, the first house is 2104 square feet and costs $460,000 (or 460 in the table). The goal is to use this training data to predict the price of a house given its size. Figure 3.3: Variables In summary the key concept of training sets are: The training set provides the data to learn from. The hypothesis function hθ (x) = θ0 + θ1 x models the relationship between the input feature x (house size) and the target output y (house price). Different values of θ0 and θ1 result in different lines on the graph. The goal is to find the values of θ0 and θ1 that minimize the error between the predicted values and the actual values. This process involves minimizing a cost function that measures the fit of the model. The hypothesis function hθ (x) = θ0 + θ1 x is a simple linear equation used to predict the price (y) given the size of the house (x). θ0 : This is the intercept or bias term, representing the predicted price when x = 0. In real-life problems, this might not always make sense (e.g., a house with zero square feet), but it is included in the model to shift the prediction line vertically. θ1 : This is the slope or weight for the feature x, representing how much the price changes as the house size increases by 1 square foot. The hypothesis function is linear, so it predicts a straight line when plotted on a graph, and the parameters θ0 and θ1 define the position and orientation of the line. 54 CHAPTER 3. LINEAR REGRESSION 3.3.4 Linear Regression Hypothesis Hypothesis with One Variable: In linear regression, the hypothesis function is used to predict the output y given an input x( linear combination of the input features). For a single feature (variable), the hypothesis hθ (x) is defined as: hθ (x) = θ0 + θ1 x Where: hθ (x): This is the predicted output (price in this case) for a given input x (size of the house). θ0 is the intercept (the value of hθ (x) when x = 0). θ1 is the slope (the coefficient of the variable x, which represents how much y changes with a one-unit increase in x). x is the input feature or variable. The equation represents a straight line in 2D space, where θ0 shifts the line vertically, and θ1 dictates the tilt (or slope) of the line. The goal of the learning algorithm is to learn the values of θ0 and θ1 that best fit the training data, which minimizes the difference between the predicted values and the actual values of the target variable. Extending to Multiple Variables: The slide also extends the hypothesis to multiple variables, x1 , x2 , x3 ,... , xn. This extension to multiple features (also called multivariate linear regression) is expressed as: hθ (x) = θ0 + θ1 x1 + θ2 x2 + θ3 x3 + θ4 x4 + · · · + θn xn This equation represents a hyperplane in higher-dimensional space. Here, each θi is the weight (or coefficient) corresponding to the feature xi , and θ0 is still the intercept. For example, the hypothesis might look like: hθ (x) = 100 + 0.1x1 + 0.6x2 + 0.07x3 + 10x4 Where: x1 could be the size of the house (in square feet), x2 could be the number of bedrooms, x3 could be the number of floors, and x4 could be the age of the house. This equation models the relationship between multiple factors and the house price. Multivariate Linear Regression with Notation: To simplify the expression of multivariate linear regression, matrix notation is used. The hypothesis can be compactly represented as a dot product between the feature vector x and the parameter vector θ. For convenience, we define: x0 = 1 (This represents the intercept term, so we add a constant feature of 1 to the feature vector).   x0  x1  x = .  is an n + 1-dimensional vector representing the input features.   ..  xn   θ0  θ1  θ = .  is the parameter vector.   ..  θn 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 55 Thus, the hypothesis function can be written as: hθ (x) = θ0 + θ1 x1 + θ2 x2 + · · · + θn xn = θT x Where θT x represents the dot product of the two vectors. This notation simplifies calculations, especially when dealing with many features, and enables efficient implemen- tation in matrix-based programming languages such as Python or MATLAB. In summary, the key components of hypothesis in linear regression are: In single-variable linear regression, the hypothesis is a simple straight-line equation hθ (x) = θ0 + θ1 x. In multivariate linear regression, the hypothesis generalizes to multiple variables and is represented as hθ (x) = θ0 + θ1 x1 + · · · + θn xn. Matrix notation is used to simplify and represent this hypothesis compactly, where the hypothesis is the dot product of the parameter vector θ and the feature vector x, i.e., hθ (x) = θT x. This approach allows us to predict output values for a given set of input features using the learned parameters (weights) from the training data. Figure 3.4: Hypothesis Choosing θ0 and θ1 The question here is: how do we choose the parameters θ0 and θ1 to best fit the data? In linear regression, the goal is to find the line that best represents the relationship between the input x and the output y. To do this, we need to minimize the error between the predicted values and the actual values in the training set. Different parameter choices for θ0 and θ1 result in different lines: In the left graph, θ0 = 1.5 and θ1 = 0: This produces a horizontal line at 1.5, meaning the model predicts the same price (1.5 in normalized units) for all house sizes. This line clearly does not fit the data well because it does not consider the effect of x (house size). In the middle graph, θ0 = 0 and θ1 = 0.5: This produces a line with a positive slope, meaning that as the house size increases, the predicted price increases proportionally. This is closer to a realistic fit but still needs adjustment. In the right graph, θ0 = 1 and θ1 = 0.5: This shows another variation, where both the intercept and slope are non-zero. This line shifts upward and might represent a better fit depending on the data. 56 CHAPTER 3. LINEAR REGRESSION Figure 3.5: Choosing Parameters Minimizing the Cost Function The next step is to determine the best values for θ0 and θ1. The cost function is used to measure how well a given line fits the data. Specifically, it calculates the error between the predicted values (from the hypothesis) and the actual target values (from the training set). The goal of linear regression is to minimize this error, which is often done using gradient descent or other optimization techniques. Gradient descent iteratively updates the parameters θ0 and θ1 to reduce the cost function J(θ0 , θ1 ). The goal is to converge to a minimum point in the cost function, where the error between predicted and actual values is as small as possible. 3.3.5 Cost Function J(θ0 , θ1 ) To evaluate how well the chosen θ0 and θ1 parameters fit the data, we use a cost function. This cost function quantifies the error between the predicted values hθ (x) and the actual target values y from the training set. The cost function for linear regression is the Mean Squared Error (MSE), which calculates the average squared error between the predicted value hθ (x) and the actual value y for all training examples: m 2 1 X J(θ0 , θ1 ) = hθ (x(i) ) − y (i) 2m i=1 Where: m is the number of training examples. hθ (x(i) ) is the predicted value for the i-th training example. y (i) is the actual target value for the i-th training example. Why do we square the error? The squared error ensures that larger errors have a proportionally larger effect on the cost. Squaring also makes all errors positive, preventing positive and negative errors from canceling each other out. Why do we divide by 2m? Dividing by 2m serves two purposes: 1. The division by m averages the error over all the training examples, ensuring that the cost doesn’t scale with the number of training examples. 2. The factor of 2 is included for mathematical convenience when differentiating the cost function (in gradient descent), as it cancels out the 2 that comes from the derivative of the squared term. The goal is to minimize this cost function J(θ0 , θ1 ), i.e., find the values of θ0 and θ1 that result in the smallest possible error. 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 57 Simplified Cost Function (When θ0 = 0) In some cases, for simplicity, we might assume that θ0 = 0. This means the hypothesis function becomes: hθ (x) = θ1 x In this case, the cost function is simplified to: m 2 m 2 1 X 1 X (i) J(θ1 ) = hθ (x(i) ) − y (i) = θ1 x − y (i) 2m i=1 2m i=1 This simplified cost function only depends on θ1 , and the goal is to minimize this cost function to find the best value of θ1. In summary the key concepts are: Hypothesis Function: Models the relationship between x and y using the parameters θ0 and θ1. Cost Function: Measures the error between the predicted values and the actual target values. The goal is to minimize this error by finding the best values for θ0 and θ1. Simplified Case: When θ0 = 0, the problem is simplified to finding the best θ1 by minimizing the cost function J(θ1 ). Figure 3.6: θ1 = 1 3.3.6 Analysis of hθ (x) when θ1 = 1 Perfect Fit When θ1 = 1, the hypothesis function hθ (x) = x perfectly fits the training data. This is visually confirmed on the left plot, where the black line (model predictions) goes through all the red crosses (training data points). Cost Function at θ1 = 1 Since the model predictions perfectly match the training data, the error for each training example is 0. This leads to a cost of 0. Mathematically, for each training example (x(i) , y (i) ), the squared error is: (hθ (x(i) ) − y (i) )2 = 02 = 0 Summing over all training examples and averaging gives J(θ1 ) = 0. On the right plot, you can see the minimum of the cost function at θ1 = 1, where the cost J(θ1 ) = 0. 58 CHAPTER 3. LINEAR REGRESSION Interpretation of the Cost Function Curve Global Minimum: The goal of linear regression is to find the value of θ1 that minimizes the cost function. In this case, the cost function reaches its global minimum at θ1 = 1, where the error is 0. Shape of the Cost Function: The cost function J(θ1 ) is typically shaped like a bowl. This is because it is a quadratic function of θ1 due to the squared error term. When the cost is plotted as a function of θ1 , it will generally have a single minimum point (global minimum) where the cost is lowest. In this specific case, because the data and model are perfectly aligned at θ1 = 1, the cost function reaches 0 at this point. Mathematical Calculation of the Cost Function: At θ1 = 1, let’s compute the cost function for this simple example: Assume we have three training examples: (1, 1), (2, 2), (3, 3). The hypothesis function for θ1 = 1 is: hθ (x) = x The predicted outputs are: hθ (1) = 1, hθ (2) = 2, hθ (3) = 3 The actual outputs are: y(1) = 1, y(2) = 2, y(3) = 3 The squared errors for each training example are: (hθ (1) − 1)2 = 02 = 0 (hθ (2) − 2)2 = 02 = 0 (hθ (3) − 3)2 = 02 = 0 Summing and averaging these errors: 3 1 X J(θ1 ) = 0=0 2m i=1 Thus, J(1) = 0, as shown on the right graph. This slide demonstrates the perfect case in linear regression, where the model with θ1 = 1 perfectly fits the training data, resulting in zero cost. This is the ideal solution, but in real-world scenarios, we rarely encounter perfect fits. Instead, we aim to minimize the cost function as much as possible using techniques like gradient descent. 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 59 3.3.7 Analysis of hθ (x) when θ1 = 0, 5: Figure 3.7: θ1 = 0,5 The hypothesis function hθ (x) in linear regression represents the predicted output for a given input. The equation for the hypothesis is: hθ (x) = θ1 x In this plot: θ1 is set to 0.5. The black line represents the hypothesis function hθ (x) = 0.5x. The red crosses represent the actual data points, which are (1, 1), (2, 2), and (3, 3). Interpretation of the Line Since θ1 = 0.5, the slope of the line is shallower compared to the actual data trend. This means that for each unit increase in x, the predicted value of y increases by 0.5. This line does not perfectly fit the data. The predictions are lower than the actual values for x = 1, x = 2, and x = 3, which introduces error. Predicted Values For each input x, the corresponding predicted value hθ (x) is: hθ (1) = 0.5 × 1 = 0.5 hθ (2) = 0.5 × 2 = 1.0 hθ (3) = 0.5 × 3 = 1.5 Clearly, the predicted values are lower than the actual values (1, 2, 3), meaning the model is underfitting the data. Cost Function J(θ1 ) The cost function J(θ1 ) represents how well the hypothesis hθ (x) fits the actual data. It measures the error between the predicted values and the actual values. The cost function is given by: 60 CHAPTER 3. LINEAR REGRESSION m 1 X J(θ1 ) = (hθ (x(i) ) − y (i) )2 2m i=1 Where: m is the number of training examples. hθ (x(i) ) is the predicted value for the i-th training example. y (i) is the actual output for the i-th training example. The term (hθ (x(i) ) − y (i) )2 is the squared error between the predicted and actual outputs. Calculation of J(θ1 = 0.5) For θ1 = 0.5, we can calculate the cost function as follows: We are given three training examples: (1, 1), (2, 2), (3, 3). The hypothesis function is: hθ (x) = 0.5x The predicted values are: hθ (1) = 0.5 hθ (2) = 1.0 hθ (3) = 1.5 The actual values are y = 1, 2, 3. Now, calculate the squared errors for each training example: For (x(1) , y (1) ) = (1, 1): (hθ (1) − 1)2 = (0.5 − 1)2 = (−0.5)2 = 0.25 For (x(2) , y (2) ) = (2, 2): (hθ (2) − 2)2 = (1 − 2)2 = (−1)2 = 1 For (x(3) , y (3) ) = (3, 3): (hθ (3) − 3)2 = (1.5 − 3)2 = (−1.5)2 = 2.25 Now, sum the squared errors: Sum of squared errors = 0.25 + 1 + 2.25 = 3.5 Finally, calculate the cost function J(θ1 = 0.5): m 1 X 1 3.5 J(0.5) = (hθ (x(i) ) − y (i) )2 = × 3.5 = ≈ 0.58 2m i=1 2×3 6 This is the value shown in the plot on the right side. The cost function value J(0.5) ≈ 0.58 indicates that there is a significant error when θ1 = 0.5. On the cost function plot, the x-axis represents the parameter θ1 , and the y-axis represents the cost J(θ1 ). You can see that for θ1 = 0.5, the cost is around 0.58. This means that this value of θ1 results in a moderate error, as the hypothesis function doesn’t fit the data perfectly. Key Takeaways Underfitting: The left plot with θ1 = 0.5 shows an underfitting model. The line is not capturing the trend in the data correctly, resulting in a high cost. Cost Minimization: The goal of linear regression is to find the value of θ1 that minimizes the cost function J(θ1 ). As you can see on the right plot, the cost is not at its minimum for θ1 = 0.5, meaning there are better values for θ1 (such as θ1 = 1, as seen in the previous slide). Gradient Descent: To find the optimal value of θ1 , methods like gradient descent can be applied to minimize the cost function. This would adjust θ1 iteratively to reduce the cost and find the best fit. 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 61 3.3.8 Analysis of hθ (x) when θ1 = 0: Figure 3.8: θ1 = 0 This plot shows the hypothesis function hθ (x) when θ1 = 0. The hypothesis is linear, defined as: hθ (x) = θ1 x Given that θ1 = 0, the hypothesis simplifies to: hθ (x) = 0 The red crosses represent the actual data points: (1, 1), (2, 2), and (3, 3). The green line represents the hypothesis function, which is a flat line at y = 0. This means that regardless of the value of x, the predicted output is always 0. Underfitting: Since the hypothesis predicts 0 for all inputs, it is severely underfitting the data. The model completely fails to capture the positive correlation between x and y, as seen in the data points. Cost Function J(θ1 ) The cost function J(θ1 ) represents how well the hypothesis hθ (x) fits the actual data. It measures the error between the predicted values and the actual values. The cost function is given by: m 1 X J(θ1 ) = (hθ (x(i) ) − y (i) )2 2m i=1 Where: m is the number of training examples. hθ (x(i) ) is the predicted value for the i-th training example. y (i) is the actual output for the i-th training example. The term (hθ (x(i) ) − y (i) )2 is the squared error between the predicted and actual outputs. This plot shows how the cost function J(θ1 ) varies as a function of the parameter θ1. The x-axis represents different values of θ1 , and the y-axis represents the corresponding cost J(θ1 ). Convex Shape: The cost function is convex, meaning it has a single global minimum. This is a common feature of linear regression problems, where the cost function is a quadratic bowl. The plot shows that the global minimum occurs at θ1 = 1, as indicated by the minimum point (star) on the curve. 62 CHAPTER 3. LINEAR REGRESSION Figure 3.9: Hypothesis hθ (x) = 50 + 0.06x Figure 3.10: Cost Function J(θ0 , θ1 ) Calculation of J(θ1 = 0.5) The cost function J(θ1 ) measures the difference between the predicted values and the actual values. The cost function is given by: m 2 1 X J(θ1 ) = hθ (x(i) ) − y (i) 2m i=1 where m is the number of data points, and hθ (x(i) ) is the predicted value for the i-th input x(i). For θ1 = 0, the hypothesis is hθ (x) = 0. Let’s calculate the cost for each data point: For (x(1) , y (1) ) = (1, 1): (hθ (1) − 1)2 = (0 − 1)2 = 12 = 1 For (x(2) , y (2) ) = (2, 2): (hθ (2) − 2)2 = (0 − 2)2 = 22 = 4 For (x(3) , y (3) ) = (3, 3): (hθ (3) − 3)2 = (0 − 3)2 = 32 = 9 The total squared error is 1 + 4 + 9 = 14. Since there are 3 data points, the cost function is: 1 14 J(0) = × 14 = ≈ 2.33 2×3 6 This cost is displayed on the plot as J(0) ≈ 2.3. 3.3.9 Hypothesis hθ (x) = 50 + 0.06x The left plot shows the hypothesis function, which currently underfits the data due to the chosen values of θ0 and θ1. The right plot depicts the cost function, with the aim of finding the parameter values that minimize J(θ0 , θ1 ) for the best fit. In this plot, we are examining a hypothesis function hθ (x) with the form: hθ (x) = θ0 + θ1 x where: θ0 = 50 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 63 θ1 = 0.06 Underfitting: The line hθ (x) = 50 + 0.06x does not pass through the majority of data points. This suggests that the model may not be capturing all the complexity in the data, as the predicted line does not fully align with the observed distribution. Deviation from Data Points: Some points are far from the line, indicating that the residuals (differences between predicted and actual prices) are substantial in these cases. This might increase the cost function, as larger residuals contribute more to the overall error. Interpretation of the line Hypothesis Representation: The line hθ (x) = 50 + 0.06x is plotted, representing our linear model for predicting housing prices based on size (square feet). Intercept (θ0 = 50): This represents the price prediction (in thousands of dollars) when the size of the house is zero. Thus, according to this model, a house with zero size is estimated to be valued at 50,000 dollars. The intercept shows the baseline price in this context. Slope (θ1 = 0.06): The slope of 0.06 means that for each additional square foot, the model predicts an increase of 60 dollars in price (since 0.06 in thousands of dollars is equivalent to 60 dollars). The slope controls how steeply the price increases with size. The red crosses represent the actual data points from the training set, showing the relationship between house size (in square feet) and price (in thousands of dollars). We can see a general upward trend in the data, suggesting a positive correlation between house size and price. However, the data points are scattered around the line, indicating some variance in the relationship. Cost Function J(θ0 , θ1 ) The right plot is a graphical representation of the cost function J(θ0 , θ1 ), which depends on the parameters θ0 and θ1. Convex Shape: The cost function J(θ0 , θ1 ) appears to have a convex shape, meaning it likely has a single minimum. This is typical in linear regression with mean squared error as the cost function, which ensures a unique global minimum. Optimal Values for θ0 and θ1 : The point where J(θ0 , θ1 ) reaches its minimum represents the best fit parameters that minimize the prediction error for this dataset. The optimal values of θ0 and θ1 are where this function is minimized. Gradient Descent: If we use gradient descent, we will iteratively adjust θ0 and θ1 in the direction that reduces J(θ0 , θ1 ) until we reach the minimum. Impact on Model Fit: Minimizing the cost function improves the alignment of the hypothesis line with the data points. When J(θ0 , θ1 ) is minimized, we achieve the most accurate linear model given the data. 3.3.10 Behavior of the Cost Function For θ1 = 0: The cost is high, approximately 2.3, as computed on the left plot. The cross on the right plot at θ1 = 0 shows this high cost. For θ1 = 0.5: As we increase θ1 , the cost decreases (as seen in the previous slide where J(0.5) ≈ 0.58). For θ1 = 1: The cost reaches its minimum value, J(1) = 0. This means that with θ1 = 1, the hypothesis perfectly fits the data, leading to zero error. Minimizing the Cost: The goal in linear regression is to find the value of θ1 that minimizes the cost function. In this case, the optimal value is θ1 = 1. Gradient Descent: A common optimization method used to minimize the cost function is gradient descent. Since the cost function is convex, gradient descent will always converge to the global minimum (in this case, θ1 = 1). 64 CHAPTER 3. LINEAR REGRESSION 3.3.11 Contour Plot A contour plot of the cost function is a graphical representation used in machine learning, especially in linear regression, to visualize the behavior of the cost function J(θ0 , θ1 ) across different values of the parameters θ0 and θ1. Let’s break down the concept and its significance in deep detail. In linear regression, we define the hypothesis function hθ (x) as: hθ (x) = θ0 + θ1 x The cost function, J(θ0 , θ1 ), measures how well our hypothesis function matches the data. It calculates the average squared difference between the predicted and actual values, expressed as: m 2 1 X J(θ0 , θ1 ) = hθ (x(i) ) − y (i) 2m i=1 where: – m is the number of training examples, – hθ (x(i) ) is the predicted output for the i-th training example, – y (i) is the actual output for the i-th training example. The goal of linear regression is to find the values of θ0 and θ1 that minimize this cost function, as it implies the hypothesis line best fits the training data. Purpose of the Contour Plot The contour plot provides a two-dimensional visualization of how the cost function J(θ0 , θ1 ) changes with respect to the parameters θ0 and θ1. It allows us to visualize the ”error landscape” of the cost function. Instead of relying on numerical values alone, we can see the regions where the cost is higher or lower, and how it varies as we adjust θ0 and θ1. This plot helps us understand the optimization process, specifically how iterative methods like gradient descent navigate through the parameter space to reach the point of minimum cost. How to Read a Contour Plot of the Cost Function Each contour line represents a set of points where the cost function J(θ0 , θ1 ) has the same value. Think of these lines as ”elevation lines” on a map. Moving from the outermost contour to the innermost contour, the cost function J(θ0 , θ1 ) decreases. The innermost contour represents the lowest point, which is the minimum value of J(θ0 , θ1 ). The point at the center of the innermost contour, typically marked with a symbol like a red dot or cross, is the global minimum. This is where J(θ0 , θ1 ) is minimized, and therefore, where θ0 and θ1 are optimal. Contour Plot Example in Linear Regression In linear regression, each axis on the contour plot represents one of the parameters: – The horizontal axis is θ0 , the y-intercept of the line. – The vertical axis is θ1 , the slope of the line. For each pair (θ0 , θ1 ), the height (cost) of J(θ0 , θ1 ) is calculated and plotted as a contour. If the cost function is convex (like a bowl shape), the contour plot will have an elliptical shape, with the global minimum at the center. This convex shape ensures that gradient descent can find the global minimum by iteratively moving ”downhill” in the cost function landscape. 3.3. EXAMPLE: SUPERVISED LEARNING IN THE CONTEXT OF LINEAR REGRESSION 65 Gradient Descent and the Contour Plot Gradient Descent is an optimization algorithm used to minimize the cost function. In each step, it adjusts θ0 and θ1 to reduce J(θ0 , θ1 ). In the context of a contour plot: – Starting from an initial point, gradient descent follows the negative gradient (the direction that reduces the cost the most). – With each iteration, the point moves closer to the center of the contours, representing lower cost values. The path taken by gradient descent is often shown on the contour plot as a sequence of points connected by lines, moving from higher to lower contours until reaching the global minimum. Advantages of Using Contour Plots in Linear Regression Visual Intuition: Contour plots provide a way to see the relationship between the parameters θ0 and θ1 and the cost function without needing a 3D plot. Optimization Tracking: They allow us to visualize how the parameters are updated during optimization algorithms like gradient descent. Convexity Check: The shape of the contours can tell us if the cost function is convex. If it’s convex (elliptical contours with a single minimum), gradient descent will reliably find the global minimum. Non-convex shapes (having multiple minima) would imply a more complex optimization problem. Interpreting Contour Plot for Parameter Adjustments In training, if we observe that the current values of θ0 and θ1 lie on the outer contours (higher cost), it indicates that these values do not fit the data well. As we move closer to the center of the contours, the parameters fit the data better, which translates into a lower cost. The steepness of the contours (distance between lines) reflects how sensitive the cost function is to changes in each parameter. Narrowly spaced contours mean a steeper gradient and faster changes in cost, while widely spaced contours mean a gentler slope. In summary: A contour plot of the cost function provides a powerful visualization to understand how the cost function J(θ0 , θ1 ) changes with respect to different values of parameters in linear regression. By analyzing the contour plot, we can: Identify the region where the cost is minimized. Visualize the gradient descent path as it converges toward the global minimum. Gain insights into the sensitivity and behavior of the cost function in response to parameter adjustments. In essence, the contour plot serves as a map of the error landscape for linear regression, guiding the optimization process toward the best-fit parameters. Example Let’s walk through a practical example of a linear regression problem, calculate the cost function, and visualize the contour plot. Problem Setup Suppose we have a simple dataset with just a few points: — Size (feet2 ) x — Price ()in1000′ sy — —————————–—————————-— — 1 — 2 — — 2 — 2.5 — —3—3— 66 CHAPTER 3. LINEAR REGRESSION We want to find a linear function: hθ (x) = θ0 + θ1 x that best fits this data by minimizing the cost function J(θ0 , θ1 ). Cost Function The cost function for linear regression is: m 2 1 X J(θ0 , θ1 ) = hθ (x(i) ) − y (i) 2m i=1 where m is the number of data points (in this case, m = 3). Step 1: Hypothesis with Initial Values Let’s start with initial guesses for θ0 and θ1 , and compute the cost for different values. I’ll plot both the contour plot of the cost function and the gradient descent path to show the optimization steps. I’ll calculate this and display the plots. Explanation of the Contour Plot and Initial Hypothesis 1. **Contour Plot of the Cost Function**: - The left side shows a contour plot of the cost function J(θ0 , θ1 ). - Each contour line represents a level set where the cost function J(θ0 , θ1 ) has a constant value. - The center of the contours (ellipses) represents the minimum of the cost function. This is where the best values for θ0 and θ1 lie, giving the lowest error. - As we move away from the center, the cost increases, meaning these values of θ0 and θ1 are less optimal for fitting the data. 2. **Initial Hypothesis Plot**: - On the right, we have a plot showing the training data points in red and an initial hypothesis line in blue. - The hypothesis hθ (x) = 0 + 0 · x is currently a flat line at y = 0, which clearly does not fit the data. - The goal of linear regression is to adjust θ0 and θ1 to shift and tilt this line so that it best fits the red data points. To minimize the cost function, we can use gradient descent to iteratively adjust θ0 and θ1 and move towards the center of the contours, improving the fit of our hypothesis line to the data points. 3.3.12 How the Learning Works The model learns by finding the best parameters (slope β1 and intercept β0 ) for the line such that the sum of the squared differences between the actual prices and predicted prices is minimized. This method is called Ordinary Least Squares (OLS). Model Prediction: For any new house size, say 2000 square feet, you can look up on the green line to get the predicted price of that house based on the learned model. The point where the vertical red line intersects the green line gives you the model’s prediction for a house size of 1700 square feet. The example effectively demonstrates how supervised learning and linear regression can be used to predict real- valued outputs. By learning from a dataset of house sizes and prices, the model can predict future house prices based on size. The goal of linear regression is to find a linear relationship that minimizes the difference between the predicted values and the actual values. 3.4 Gradient Descent The goal of gradient descent is to iteratively adjust the parameters θ0 and θ1 to minimize the cost function J(θ0 , θ1 ), which measures how well our linear model fits the training data. Gradient descent follows the direction of the negative gradient (downhill) to reach the minimum of the cost function. 1. Update Rule for Gradient Descent The main formula for updating θj is: ∂ θj := θj − α J(θ0 , θ1 ) ∂θj 3.4. GRADIENT DESCENT 67 Here: – θj : the parameter being updated (e.g., θ0 or θ1 ). – α: the learning rate, which controls the step size. A larger α means faster updates but can risk overshooting the minimum, while a smaller α makes convergence slower but more stable. ∂ – ∂θj J(θ0 , θ1 ): the partial derivative of the cost function with respect to θj. This term indicates the direction and magnitude of change for θj. 2. Simultaneous Update It’s essential to update both θ0 and θ1 simultaneously to ensure that the adjustment is based on the same cost function evaluation. This is achieved by calculating the updates using temporary variables (‘temp0‘ and ‘temp1‘) before applying them. Simultaneous update avoids using a partially updated value of θ0 or θ1 when calculating the new value for the other, which would lead to incorrect updates. 3. Correct vs. Incorrect Update Correct Method: Both ‘temp0‘ and ‘temp1‘ are calculated using the current values of θ0 and θ1 before applying them: Incorrect Method: Updates θ0 before updating θ1 , causing θ1 ’s update to use an already modified θ0 , leading to a mismatch in the calculated cost function gradient. Figure 3.11: Variables 4. Loop until Convergence Gradient descent repeats this update rule until convergence—when changes in θ0 and θ1 are minimal or the cost function J(θ0 , θ1 ) stops decreasing by a significant amount. Convergence depends on the learning rate α and the nature of the cost function. In summary, the key components of linear regression are: It is a supervised learning algorithm used to predict continuous, real-valued outputs (house prices). We begin with a training set of examples where both the inputs and outputs are known. The learning algorithm uses this training data to estimate the relationship (the hypothesis function) between the input features and the target values. In the case of multiple features, linear regression can learn from several variables (e.g., size, number of bed- rooms) to predict the target value (house price). 68 CHAPTER 3. LINEAR REGRESSION 3.4.1 Effect of Learning Rate Effect of a Small Learning Rate (α) If α is too small, gradient descent will take very small steps toward the minimum. This results in a slow convergence to the minimum point, meaning it will take many iterations to reach the optimal values for the parameters. In the upper part of the plot, we see that with a small α, the path taken by gradient descent slowly approaches the minimum. Effect of a Large Learning Rate (α) If α is too large, gradient descent may overshoot the minimum, failing to converge. When α is excessively large, it can even lead to divergence, where the algorithm oscillates or moves further away from the minimum with each step. In the lower plot, with a large α, we see that the updates overshoot, bouncing back and forth across the minimum, and failing to converge to a stable point. Figure 3.12: Effect of Large Learning Rate Left Plot: Small Learning Rate Behavior: The gradient descent path shows a slow, gradual progression towards the minimum. Learning Rate (α): In this case, the learning rate is small (α = 0.1), resulting in tiny updates to θ with each iteration. Convergence: While gradient descent is converging towards the minimum, it requires more iterations to reach the optimal value, indicating slow convergence. This plot demonstrates the impact of a small learning rate, which makes the algorithm more stable but much slower in reaching the minimum. 3.4. GRADIENT DESCENT 69 Right Plot: Large Learning Rate Behavior: The gradient descent path oscillates and diverges from the minimum. Learning Rate (α): Here, the learning rate is large (α = 1.1), which causes each update to θ to overshoot the minimum, and as a result, the algorithm fails to settle and diverges. Convergence: Instead of converging, gradient descent overshoots back and forth across the minimum due to the large step sizes, leading to divergence. In summary: A small learning rate leads to slow convergence, requiring more iterations to reach the minimum. A large learning rate can cause the algorithm to diverge, failing to find the minimum. 3.4.2 Global Minimum and Local Optima Figure 3.13: Global Minimum and Local Optima The curve represents the cost function J(θ1 ) as a function of the parameter θ1. In machine learning, the objective of gradient descent is to minimize this cost function, adjusting θ1 to reach the lowest point on this curve. Global Minimum: The lowest point on this curve is the global minimum. At the global minimum, J(θ1 ) has the smallest possible value, and ideally, this is where gradient descent should aim to reach. Local Optima (Local Minima): The curve has multiple dips, which are local minima. These are points where the gradient (slope) of J(θ1 ) with respect to θ1 is zero, and any small movement to either side would increase J(θ1 ). 70 CHAPTER 3. LINEAR REGRESSION However, these are not the lowest points on the curve, meaning they are ”local” rather than ”global” minima. If the starting point for θ1 is near a local minimum, gradient descent may converge to that local minimum instead of the global minimum. In this case, the algorithm would stop updating θ1 because the slope (gradient) is zero at the local minimum. This can prevent the model from reaching the best possible solution (the global minimum), potentially resulting in suboptimal performance. The initial value of θ1 significantly influences the final outcome in cases with multiple minima. Starting closer to the global minimum increases the likelihood of reaching it. Strategies to Overcome Local Minima: To help avoid getting trapped in local minima, some strategies include: Multiple initializations: Run gradient descent from different starting points and select the solution with the lowest cost. Momentum-based optimization: Modify gradient descent to include ”momentum” that helps push through small local minima. Advanced optimization techniques: Use methods like stochastic gradient descent or optimization algo- rithms like Adam, which adapt the learning rate dynamically. Fixed Learning Rate Gradient descent can converge to a local minimum even if the learning rate α is fixed and not decreased over time. Unlike some optimization methods that require a gradually decreasing learning rate, gradient descent can still work with a constant α, especially when dealing with smooth, convex functions. Graph of J(θ1 ): The graph shows J(θ1 ) as a function of θ1 , with the gradient at each point illustrated by the slope of the curve. Descent Path: Starting from an initial point, the gradient descent path is shown with arrows pointing down the slope of the curve, illustrating how θ1 is updated iteratively. As the points approach the minimum, the gradient (represented by the slope) decreases, and thus the steps in each iteration become smaller, allowing θ1 to settle at the minimum. The automatic reduction in step size near the minimum is an inherent property of the gradient descent algorithm, so there’s no need to adjust α dynamically. This simplifies the algorithm, as it only requires setting an appropriate learning rate α at the beginning rather than adjusting it over time. Automatic Adjustment of Step Size As gradient descent approaches a local minimum, the value of d dθ1 J(θ1 ) becomes smaller because the slope of the cost function near a minimum is flatter. Since the step size in each iteration is proportional to the gradient d dθ1 J(θ1 ), the steps naturally get smaller as the algorithm nears a minimum. This behavior results in smaller updates to θ1 near the minimum, allowing gradient descent to converge without overshooting. 3.4. GRADIENT DESCENT 71 Local Minimum vs. Global Minimum The diagram shows a scenario where J(θ1 ) has a local minimum. In this case, if gradient descent starts close to this local minimum, it may converge to it rather than the global minimum. Convergence to a local minimum is a common challenge, especially when the cost function J(θ1 ) is not convex and has multiple minima. 3.4.3 Batch Gradient Descent Batch Gradient Descent is an optimization algorithm where, in each iteration, the gradient of the cost function with respect to the model’s parameters is computed using the entire training dataset. This approach involves calculating the gradient as a single, aggregated update by summing over all training examples, which yields a comprehensive direction for adjusting the parameters. Each iteration thus relies on processing the entire dataset in a ”batch” to generate a global update, ensuring a consistent and stable descent along the cost surface toward an optimal parameter set. For each parameter θj , the update rule in batch gradient descent is given by: m 2 1 X ∂ 1 (i) (i) θj := θj − α hθ (x ) − y m i=1 ∂θj 2 Expanding the partial derivative, we get: m 1 X (i) θj := θj − α hθ (x(i) ) − y (i) xj m i=1 where: α is the learning rate, controlling the step size of each update. The summation term computes the average gradient over the entire dataset. This ensures that each parameter θj moves in the direction that reduces the overall cost across all examples, rather than making adjustments based on individual samples. Advantages of Batch Gradient Descent Stable Convergence: Since each update step considers the entire dataset, batch gradient descent tends to have a smoother convergence towards the minimum of the cost function. Global Perspective: By leveraging all the data, it avoids the randomness and noise that might arise in other methods like stochastic gradient descent, which only uses one sample at a time. Deterministic Updates: Given the same dataset and starting conditions, batch gradient descent will always follow the same path to the minimum, which is predictable and reliable. Disadvantages of Batch Gradient Descent Computationally Expensive: Calculating the gradient using the entire dataset in each iteration can be very slow and computationally intensive, especially with large datasets. Memory Intensive: Batch gradient descent requires loading all training examples into memory to compute the gradient, which can be challenging for very large datasets that may not fit in memory. 3.4.4 Stochastic Gradient Descent (SGD) This slide introduces Stochastic Gradient Descent (SGD), also known as Online Gradient Descent. SGD is an optimization method that approximates the gradient of the cost function by using a single training example at each step, rather than the entire dataset. This approach is particularly useful for large datasets where calculating the gradient on the full dataset would be computationally prohibitive. 72 CHAPTER 3. LINEAR REGRESSION Motivation for Stochastic Gradient Descent In traditional batch gradient descent, each update of the model parameters requires calculating the gradient based on the entire training dataset, which is computationally expensive and memory-intensive for large datasets. Stochastic Gradient Descent addresses this issue by updating the parameters based on a single training example at a time, which significantly reduces computational costs and enables faster updates. Stochastic Gradient Descent Mechanics SGD randomly selects a single data point from the training dataset at each iteration to compute the gradient and updates the model parameters based on this individual example. This approach approximates the true gradient of the cost function. Although each step is less accurate than batch gradient descent, it allows for much faster updates and is often sufficient for finding a good approximation of the minimum. Steps Involved in SGD (Pseudocode) : Initialize Parameters: – The algorithm begins with an initial set of parameters θ0 (e.g., weights in a neural network) and a learning rate α. Repeat Until Convergence: – Randomly Shuffle the Training Set: To reduce the bias and dependency in the order of examples, the training set is shuffled randomly at the beginning of each epoch. – Iterate Over Each Training Example: ∗ For each example x(k) , the algorithm computes the gradient of the cost function with respect to the current parameters θ(i). ∗ Update the parameters using the gradient computed on this single example: θ(i+1) = θ(i) − α∇J (k) (θ(i) ) where: · θ(i) : current parameter values, · α: learning rate, and · ∇J (k) (θ(i) ): gradient of the cost function evaluated at the k-th example. Explanation of the Update Step : For each training example, SGD computes the gradient and immediately updates the parameters. This differs from batch gradient descent, which accumulates the gradients over all examples and updates the parameters only once per iteration. In SGD, updates happen frequently, making the path toward the minimum more erratic but faster in terms of computation. Advantages of Stochastic Gradient Descent Faster Updates: Since each update only requires a single training example, SGD can make progress toward the minimum quickly, which is helpful for large datasets. Online Learning: Since SGD processes one example at a time, it can be used in an ”online” fashion, updating the model continuously as new data becomes available. This is ideal for applications where data arrives in real time, such as streaming services or recommendation systems. Avoiding Local Minima: Due to the noisy nature of SGD, it can sometimes escape local minima, which makes it beneficial for non-convex cost functions (e.g., neural networks). 3.4. GRADIENT DESCENT 73 Disadvantages of Stochastic Gradient Descent High Variance in Updates: Each update is based on a single example, so the path toward the minimum is noisy and can oscillate around the minimum, which may slow down convergence. Less Stable Convergence: Due to the noise, the parameter updates can overshoot the minimum and might not converge as smoothly as batch gradient descent. This is often mitigated by decreasing the learning rate over time or using techniques like momentum. Sensitive to Learning Rate: If the learning rate α is too large, SGD can diverge due to the large variance in updates. A smaller learning rate can help but may slow down the training process. Comparison with Batch Gradient Descent Batch Gradient Descent computes the gradient over the entire dataset, resulting in more stable but slower updates. Stochastic Gradient Descent updates parameters on each training example, leading to faster but noisier convergence. Stochastic Gradient Descent is an efficient optimization algorithm that approximates the gradient of the cost function by computing it on a single example at each step. This method is particularly useful for large datasets, as it allows for quick updates, but it introduces noise in the convergence path. By updating frequently, SGD enables faster progress toward the minimum, although it may not converge as smoothly as batch gradient descent. Adjustments, such as reducing the learning rate over time, can help achieve better convergence stability. 3.4.5 Mini-Batch Gradient Descent This slide discusses Mini-Batch Gradient Descent, which is a variation of gradient descent that sits between Batch Gradient Descent and Stochastic Gradient Descent (SGD). Mini-batch gradient descent divides the training dataset into smaller groups called mini-batches, using each mini-batch to perform a parameter update. This approach leverages the benefits of both batch and stochastic gradient descent. In mini-batch gradient descent, the dataset is split into several small batches, often with a size ranging from 32 to 512 examples, though this can vary depending on the specific task and computational resources. At each step, a mini-batch is used to compute the gradient and update the model parameters, rather than using the entire dataset (as in batch gradient descent) or a single example (as in SGD). This mini-batch approach allows for faster updates than batch gradient descent, as well as smoother and more stable convergence compared to pure stochastic gradient descent. Small Mini-Batches (e.g., size of 1, similar to SGD) introduce noise in the convergence path, which might help in escaping local minima but may also slow down convergence due to the high variance in updates. Large Mini-Batches (e.g., close to the full dataset, similar to batch gradient descent) yield more stable updates but are computationally expensive and may miss out on the benefits of faster updates. A mini-batch size that balances the benefits of SGD and batch gradient descent is often chosen, usually between 32 and 256 examples, depending on the problem and computational resources. Process of Mini-Batch Gradient Descent 1. For each mini-batch, perform the following steps: Forward Pass: – Compute the model’s predictions ŷ for each example in the mini-batch. – For example, in a linear regression model, calculate ŷ = θT x, where θ represents the parameters and x represents the features. Compute the Cost Function: 74 CHAPTER 3. LINEAR REGRESSION – Calculate the cost (loss) for the mini-batch. For example, in linear regression, use the Mean Squared Error (MSE) cost function: b 2 1 X (i) J(θ) = ŷ − y (i) 2b i=1 where b is the number of examples in the mini-batch, ŷ (i) is the prediction for the i-th example, and y (i) is the true label. Backward Pass (Compute Gradients): – Calculate the gradient of the cost function with respect to each parameter θj. The gradient indicates the direction in which each parameter needs to be adjusted to minimize the cost. – For each parameter θj : b ∂J(θ) 1 X (i) (i) = ŷ − y (i) xj ∂θj b i=1 (i) where xj is the value of feature j in example i of the mini-batch. Parameter Update: – Update each parameter θj using the gradient and the learning rate α: ∂J(θ) θj := θj − α ∂θj – The simultaneous update of all parameters ensures that the gradient descent step is based on the average gradient for the mini-batch, leading to more stable convergence than Stochastic Gradient Descent. 2. Repeat for All Mini-Batches in an Epoch: Repeat steps 4a to 4d for each mini-batch in the training set until all examples have been used to update the parameters once (completing one epoch). 3. Repeat for Multiple Epochs: Repeat steps 2 to 5 for a set number of epochs or until the cost function converges (i.e., until there is little to no change in the cost function between epochs). 4. Stopping Criterion: Stop the training when the cost function converges or after a predetermined number of epochs. Conver- gence is typically defined as a small change in the cost function, indicating that further training may not significantly improve the model. Advantages of Mini-Batch Gradient Descent Smoother Convergence: – Since the gradient is calculated over multiple examples in each mini-batch, the updates are more stable compared to SGD, which often has noisy updates due to using only one example. – This smoother convergence helps the model avoid oscillations that can occur in pure stochastic gradient descent. Faster Convergence Compared to Batch Gradient Descent: – By using a subset of data rather than the entire dataset for each update, mini-batch gradient descent allows for faster parameter updates than batch gradient descent. Optimization for Hardware: – Mini-batch gradient descent is computationally efficient and compatible with vectorized operations and GPU acceleration. Processing mini-batches in parallel allows the model to leverage modern deep learning libraries that support vectorization, making the training process faster. 3.4. GRADIENT DESCENT 75 Comparison to Other Methods Batch Gradient Descent: Processes the entire dataset in each step, resulting in stable but slow convergence, especially for large datasets. Stochastic Gradient Descent: Processes a single example per update, resulting in fast but noisy conver- gence, which can sometimes hinder convergence stability. Mini-Batch Gradient Descent: Processes a small subset (mini-batch) per update, achieving a balance between the slow, stable convergence of batch gradient descent and the fast, noisy updates of SGD. Mini-Batch Gradient Descent is a preferred optimization method for training large models on large datasets. By using small batches, it enables faster updates than batch gradient descent and smoother convergence than stochastic gradient descent. It also takes advantage of vectorized operations and GPU acceleration, making it computationally efficient. Mini-batch gradient descent achieves a balance between speed and convergence stability, making it suitable for deep learning applications. 3.4.6 Gradient Descent for Multiple Variable In the context of linear regression, the multiple variable case, also known as multiple linear regression, is a model that predicts an output variable y based on multiple input features (or independent variables) x1 , x2 ,... , xn. Each input variable contributes linearly to the predicted output. This model is an extension of simple linear regression, which only involves one input variable. The hypothesis for multiple linear regression can be expressed as: hθ (x) = θ0 + θ1 x1 + θ2 x2 + · · · + θn xn where: hθ (x) is the predicted value for the output variable y, θ0 is the intercept term (also known as the bias), θ1 , θ2 ,... , θn are the parameters (weights) associated with each input feature, x = [x1 , x2 ,... , xn ] represents the vector of input features. In vectorized form, the hypothesis function can be simplified as: hθ (x) = θT x where: θ = [θ0 , θ1 ,... , θn ]T is the parameter vector, x = [1, x1 , x2 ,... , xn ]T is the input feature vector with a 1 prepended to account for the intercept term. Cost Function for Multiple Linear Regression To train a multiple linear regression model, we use a cost function to measure how well the model fits the training data. The cost function for multiple linear regression, often called the Mean Squared Error (MSE), is defined as: m 2 1 X J(θ) = hθ (x(i) ) − y (i) 2m i=1 where: m is the number of training examples, hθ (x(i) ) is the predicted output for the i-th training example, y (i) is the actual output for the i-th training example. The goal of training is to find the parameters θ that minimize this cost function, resulting in the best fit of the linear model to the data. 76 CHAPTER 3. LINEAR REGRESSION Gradient Descent for Multiple Linear Regression To minimize the cost function J(θ), we commonly use gradient descent. In gradient descent, we update the parameters iteratively according to the rule: ∂ θj := θj − α J(θ) ∂θj for j = 0, 1,... , n, where: α is the learning rate, which controls the step size of each update. The partial derivative of J(θ) with respect to θj is given by: m ∂ 1 X (i) J(θ) = hθ (x(i) ) − y (i) xj ∂θj m i=1 This update rule allows us to iteratively move the parameters θ in the direction that minimizes the cost function. The general update rule for each parameter θj is: m 1 X (i) θj := θj − α hθ (x(i) ) − y (i) xj m i=1 where: α is the learning rate, which controls the step size, Pm (i) 1 m i=1 hθ (x(i) ) − y (i) xj is the partial derivative of the cost function J(θ) with respect to θj , averaged over all training examples. This formula calculates the gradient (slope) of the cost function with respect to each parameter θj. The algorithm then subtracts a fraction of this gradient from the current value of θj to update it. Simultaneous Update Explanation For multiple linear regression, we have multiple parameters θ0 , θ1 ,... , θn , each corresponding to an input feature (plus the bias term for θ0 ). The gradient descent algorithm updates each of these parameters by moving them in the direction that decreases the cost function. To ensure that all parameters are updated based on the current values (before any of them change), we perform a simultaneous update. Steps of Simultaneous Update 1 Pm (i) 1. Compute the Gradient for Each Parameter: First, compute m i=1 hθ (x(i) ) − y (i) xj for each θj based on the current values of θ0 , θ1 ,... , θn. 2. Store Temporary Values: Since we are doing a simultaneous update, we calculate the new values for all θj (using temporary variables) without changing any of the original θj values immediately. 3. Update All Parameters Together: After computing all the temporary values, assign them back to the parameters θj simultaneously. This ensures that each update step is based on the same set of parameter values from the previous iteration, maintaining consistency. 3.4. GRADIENT DESCENT 77 The simultaneous update rule can be written for each θj as follows: m 1 X (i) θ0 := θ0 − α hθ (x(i) ) − y (i) x0 m i=1 m 1 X (i) θ1 := θ1 − α hθ (x(i) ) − y (i) x1 m i=1 m 1 X (i) θ2 := θ2 − α hθ (x(i) ) − y (i) x2 m i=1... m 1 X θn := θn − α hθ (x(i) ) − y (i) x(i) n m i=1 Each parameter θj is updated using the gradient with respect to itself, ensuring that all parameters are adjusted based on the cost function’s gradient at the current values. Advantage of Simultaneous Update Consistency: Simultaneous updates ensure that all parameters are updated consistently based on the same cost function gradient. If we were to update parameters one-by-one without temporary storage, the later parameters would be updated based on already-changed values, leading to potential inconsistencies. Convergence: The simultaneous update method improves convergence behavior, especially for larger datasets. It helps maintain a stable learning trajectory since all updates are based on a single, unified gradient calcu- lation at each iteration. In summary, the simultaneous update method ensures that all parameter adjustments in gradient descent are based on the same state of the model, leading to more predictable and stable training behavior. Example of Multiple Linear Regression Problem we are introducing to the idea of multiple features in a regression problem: Figure 3.14: Variables 1. Size: The size of the house in square feet. 2. Number of Bedrooms: Another feature describing the number of bedrooms. 3. Number of Floors: Another feature indicating how many floors the house has. 4. Age of Home: The age of the house in years. 5. Price: The target variable, which is the price of the house. 78 CHAPTER 3. LINEAR REGRESSION The formula now generalizes to handle multiple features: hθ (x) = θ0 + θ1 x1 + θ2 x2 + · · · + θn xn Where: x1 , x2 ,... , xn : The input features (size, number of bedrooms, etc.). θ1 , θ2 ,... , θn : The coefficients or weights learned by the model. Here, the model is trying to learn the best values of θ that fit the training examples across all the features. Feature Scaling in Gradient Descent for Linear Regression When working with multiple features that are on vastly different scales or units, it becomes challenging for gradient descent to converge quickly and efficiently. The cost function J(θ) may exhibit elongated contours, causing the gradient descent steps to ”zigzag” inefficiently toward the minimum. By scaling the features, we aim to standardize their ranges, which leads to a more circular contour plot of the cost function and faster convergence. Feature x1 represents the size of a house in square feet, with values ranging from 0 to 2000. Feature x2 represents the number of bedrooms, which has a smaller range (1 to 5). Since x1 and x2 are on very different scales, the gradient descent algorithm will struggle to adjust θ1 and θ2 effi- ciently without scaling. To bring these features onto a similar scale, the image suggests transforming them by dividing each feature by its typical range: 1. Scaling x1 (Size in feet²): 2 size (feet ) x1 = 2000 This normalization divides the size by 2000, bringing the values of x1 closer to the range [0, 1]. 2. Scaling x2 (Number of bedrooms): number of bedrooms x2 = 5 Here, dividing by 5 scales the number of bedrooms to the range [0.2, 1], making it comparable to x1. After these transformations, the features x1 and x2 are on a similar scale, which helps gradient descent to update the parameters more uniformly. Figure 3.15: Feature Scaling 3.4. GRADIENT DESCENT 79 Without Feature Scaling (Left Plot): The left contour plot J(θ) represents the cost function for unscaled features. The contours are elongated along the θ1 and θ2 axes due to the large scale difference between x1 and x2. As a result, the gradient descent path ”zigzags” down the cost function, taking inefficient steps along the path. This zigzag pattern is a symptom of the imbalance in feature scales, as the algorithm struggles to make progress toward the minimum efficiently. With Feature Scaling (Right Plot): The right contour plot J(θ) shows the cost function after feature scaling. The contours are now more circular or isotropic, which is ideal for gradient descent. With scaled features, gradient descent can follow a more direct path toward the minimum, resulting in faster convergence. This is because the updates to θ1 and θ2 are now balanced, allowing gradient descent to progress smoothly. Feature scaling standardizes the gradients computed for each parameter θj , reducing the dominance of any single parameter that might be associated with a larger-scale feature. When all features are on a similar scale, the gradient descent step size (controlled by the learning rate α) applies uniformly across all parameters, avoiding excessively small or large updates for specific parameters. Common Techniques for Feature Scaling 1. Min-Max Scaling (Normalization): Scale features to a specific range, often [0, 1] or [-1, 1]. In this image’s example, each feature was scaled based on a typical range, effectively performing min-max scaling. Criteria for an Acceptable Range: (a) Centered Around Zero (Symmetry): – Ideally, feature values should be centered around zero, such as in the range [−1, 1] or slightly larger but still close to zero (e.g., [−3, 3] or [−0.5, 0.5]). – Centering around zero ensures that gradients are balanced, which helps gradient descent algo- rithms converge more smoothly without significant oscillations. (b) Small Magnitude: – Each feature should ideally span only a small interval, typically [−1, 1], [−2, 2], or some similar range. Large magnitudes (e.g., [−100, 100]) can cause the optimization algorithm to take very small steps for some parameters and large steps for others, leading to inefficient or unstable convergence. – For features with very small magnitudes (e.g., [−0.0001, 0.0001]), the gradients might become too small, leading to slow convergence. (c) Consistent Scale Across Features: – All features should have similar ranges to prevent one feature from dominating the cost function due to its larger scale. – If some features are on a much larger scale (e.g., in thousands) compared to others (e.g., in decimals), it can skew the optimization process, as the gradient will prioritize the feature with a larger scale, which can lead to poor convergence or suboptimal solutions. Benefits for Gradient Descent: With the scaled data, gradient descent will update parameters more smoothly: – Each feature contributes equally to the cost function, so gradient descent won’t be biased toward one feature. – Convergence will likely be faster, as the algorithm can take balanced steps in all parameter directions without getting ”stuck” on large or small magnitudes. 80 CHAPTER 3. LINEAR REGRESSION 2. Standardization (Z-score normalization): Formula for Mean Normalization: To normalize a feature xj , we replace it with: xj − µj xj = sj where: – µj is the mean (average) value of feature xj in the training set. – sj is either the range of xj (i.e., max − min) or the standard deviation of xj. – Standard Deviation Alternative: If the data has outliers or non-uniform distributions, it may be better to use the standard deviation sj rather than the range. Using the standard deviation can make the normalization more robust to outliers, centering each feature around its mean but scaling it relative to the spread of the data. This formula effectively rescales each feature to be centered around zero, which has advantages for convergence. Advantages of Mean Normalization Helps – Reducing the scale of features: Large differences in feature magnitudes can distort the learning process. By normalizing, we prevent features with larger scales from disproportionately influencing the model. – Centering Data: By centering each feature around zero, the optimization algorithm can take more balanced steps in positive and negative directions, aiding in faster and more efficient convergence. Feature Example 1 Example 2 Example 3 Example 4 Size (sq ft) 1500 2500 800 3800 Bedrooms 3 4 2 5 Example of Min-Max Scaling Let’s use Min-Max Scaling to bring these features into a similar range, specifically [−1, 1]. 1. For Size (min = 500, max = 4000): x − min xscaled = ×2−1 max − min Here: x−min max−min scales x to a [0, 1] range. Multiplying by 2 extends the range from [0, 1] to [−1, 1]. Subtracting 1 shifts the range from [0, 2] to [−1, 1]. Applying this to each value of Size: 1500: 1500−500 4000−500 × 2 − 1 = −0.64 2500−500 2500: 4000−500 × 2 − 1 = −0.14 800−500 800: 4000−500 × 2 − 1 = −0.82 3800−500 3800: 4000−500 × 2 − 1 = 0.82 2. For Bedrooms (min = 1, max = 5): x − min xscaled = ×2−1 max − min Applying this to each value of Bedrooms: 3: 3−1 5−1 ×2−1=0 3.4. GRADIENT DESCENT 81 Feature Example 1 Example 2 Example 3 Example 4 Scaled Size -0.64 -0.14 -0.82 0.82 Scaled Bedrooms 0 0.5 -0.5 1 4: 4−1 5−1 × 2 − 1 = 0.5 2: 2−1 5−1 × 2 − 1 = −0.5 5: 5−1 5−1 ×2−1=1 Now, all feature values are in the range [−1, 1], and both features have similar scales. In summary: Feature scaling is a preprocessing step in machine learning to ensure that gradient descent optimizes all parameters efficiently. By making sure that features x1 , x2 , etc., are on a similar scale, we can achieve: More circular contours in the cost function plot. Smoother and faster convergence of gradient descent. Balanced updates across all parameters. This leads to a more robust and efficient training process, especially when working with features that vary widely in their range or units. Example of Mean Normalization of House Data: Suppose we have a dataset of house features with this two attributes: House Size (sq. ft.) Bedrooms 1 2100 3 2 1600 2 3 2400 4 4 3000 5 5 1100 2 We want to apply mean normalization to each feature. 1. Mean of Size (sq. ft.): 2100 + 1600 + 2400 + 3000 + 1100 10200 µsize = = = 2040 5 5 2. Range of Size (sq. ft.): Rangesize = max − min = 3000 − 1100 = 1900 3. Mean of Bedrooms: 3+2+4+5+2 16 µbedrooms = = = 3.2 5 5 4. Range of Bedrooms: Rangebedrooms = max − min = 5 − 2 = 3 The formula for mean normalization is: x−µ xnormalized = range Using this formula, we can normalize each feature in the dataset. 82 CHAPTER 3. LINEAR REGRESSION 1. Normalized Size for each house: 2100 − 2040 60 Normalized SizeHouse 1 = = ≈ 0.0316 1900 1900 1600 − 2040 −440 Normalized SizeHouse 2 = = ≈ −0.2316 1900 1900 2400 − 2040 360 Normalized SizeHouse 3 = = ≈ 0.1895 1900 1900 3000 − 2040 960 Normalized SizeHouse 4 = = ≈ 0.5053 1900 1900 1100 − 2040 −940 Normalized SizeHouse 5 = = ≈ −0.4947 1900 1900 2. Normalized Bedrooms for each house: 3 − 3.2 −0.2 Normalized BedroomsHouse 1 = = ≈ −0.0667 3 3 2 − 3.2 −1.2 Normalized BedroomsHouse 2 = = ≈ −0.4 3 3 4 − 3.2 0.8 Normalized BedroomsHouse 3 = = ≈ 0.2667 3 3 5 − 3.2 1.8 Normalized BedroomsHouse 4 = = ≈ 0.6 3 3 2 − 3.2 −1.2 Normalized BedroomsHouse 5 = = ≈ −0.4 3 3 After mean normalization, our dataset looks like this: House Normalized Size Normalized Bedrooms 1 0.0316 -0.0667 2 -0.2316 -0.4 3 0.1895 0.2667 4 0.5053 0.6 5 -0.4947 -0.4 This normalization process brings both the Size and Bedrooms features to a range that’s centered around zero. This transformation improves the convergence speed and stability of algorithms like gradient descent by ensuring that each feature contributes similarly to the optimization process, regardless of their original scales. 3.5 Polynomial Regression Polynomial regression is an extension of linear regression that models the relationship between the independent variable x and the dependent variable y as an n-degree polynomial. In other words, instead of being limited to a linear function, polynomial regression allows for curves that can model more complex relationships. The polynomial regression model can be expressed as: y = θ0 + θ1 x + θ2 x2 + θ3 x3 + · · · + θd xd + ϵ where: d is the degree of the polynomial, θ0 , θ1 ,... , θd are the coefficients for each term, x, x2 , x3 ,... , xd are the polynomial features of x, and ϵ is the error term. 3.5. POLYNOMIAL REGRESSION 83 3.5.1 Importance of Feature Scaling in Polynomial Regression Feature scaling is crucial in polynomial regression, especially when the input feature x (in this example, ”Size”) has values that vary significantly across different powers. Feature scaling is critical in polynomial regression for several key reasons. Unlike simple linear regression, polynomial regression introduces higher-order terms (like x2 , x3 , etc.), which can lead to vast differences in the magnit

Fundamentals of Data Science - Linear Regression PDF

Document Details

Tags

Related

Summary

Full Transcript