Regression Modelling PDF

UNIT -IV REGRESSION MODELLING Introduction regression modelling Mathematical model for Linear regression Simple Linear regression Multiple Linear Regression Improving Accuracy of Linear regression model Polynomial Regression Logistic regression Maximum likelihood Estimation Stepwise regression Ridge regression Lasso Regression Elastic Net regression modelling Regression ❖ Regression in machine learning consists of mathematical methods that allow data scientists to predict a continuous outcome (y) based on the value of one or more predictor variables (x). ❖ Linear regression is probably the most popular form of regression analysis because of its ease-of-use in predicting and forecasting. Linear Regression in Machine Learning Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables, hence called as linear regression. The linear regression model provides a sloped straight line representing the relationship between the variables. Mathematically, we can represent a linear regression as: Mathematically, we can represent a linear regression as: y= a0+a1x+ ε………………..(1) Here, Y= Dependent Variable (Target Variable) X= Independent Variable (predictor Variable) a0= intercept of the line (Gives an additional degree of freedom) a1 = Linear regression coeﬃcient (scale factor to each input value). ε = random error The values for x and y variables are training datasets for Linear Regression model representation. Types of Linear Regression Linear regression can be further divided into two types of the algorithm: ○ Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression. ○ Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression. COMMON REGRESSION ALGORITHMS: The most common regression algorithms are Simple linear regression Multiple linear regression Polynomial regression Multivariate adaptive regression splines Logistic regression Maximum likelihood estimation (least squares) SIMPLE LINEAR REGRESSION: ➔ As the name indicates, simple linear regression is the simplest regression model which involves only one predictor. ➔ This model assumes a linear relationship between the dependent variable and the predictor variable. In the context of Karen’s problem, if we take Price of a Property as the dependent variable and the Area of the Property (in sq. m.) as the predictor variable, we can build a model using simple linear regression. Price = a + b. Area Property where ‘a’ and ‘b’ are intercept and slope of the straight line, respectively. Just to recall, straight lines can be defined in a slope– intercept form Y = (a + bX), where a = intercept and b = slope of the straight line. The value of intercept indicates the value of Y when X = 0. It is known as ‘the intercept or Y intercept’ because it specifies where the straight line crosses the vertical or Y-axis Slope of the simple linear regression model → Slope of a straight line represents how much the line in a graph changes in the vertical direction (Y-axis) over a change in the horizontal direction (X-axis) as shown. Slope = Change in Y/Change in X Rise is the change in Y-axis (Y − Y ) and Run is the change in X-axis (X − X ). So, slope is represented as given below Let us find the slope of the graph where the lower point on the line is represented as (−3, −2) and the higher point on the line is represented as (2, 2). (X , Y ) = (−3, −2) and (X , Y ) = (2, 2) Rise = (Y − Y ) = (2 − (−2)) = 2 + 2 = 4 Run = (X − X ) = (2 − (−3)) = 2 + 3 = 5 Slope = Rise/Run = 4/5 = 0.8 There can be two types of slopes in a linear regression model: positive slope and negative slope. Different types of regression lines based on the type of slope include Linear positive slope , Curve linear positive slope , Linear negative slope, Curve linear negative slope. Linear positive slope A positive slope always moves upward on a graph from left to right. Slope = Rise/Run = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X) Scenario 1 for positive slope: Delta (Y) is positive and Delta (X) is positive Scenario 2 for positive slope: Delta (Y) is negative and Delta (X) is negative Curve linear positive slope Curves in these graphs (refer to Fig. 8.4) slope upward from left to right. Slope = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X) Slope for a variable (X) may vary between two graphs, but it will always be positive; hence, the above graphs are called as graphs with curve linear positive slope. Linear negative slope: A negative slope always moves downward on a graph from left to right. As X value (on X-axis) increases, Y value decreases Slope = Rise/Run = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X) Scenario 1 for negative slope: Delta (Y) is positive and Delta (X) is negative Scenario 2 for negative slope: Delta (Y) is negative and Delta (X) is positive Curve linear negative slope: Curves in these graphs (refer to Fig. 8.6) slope downward from left to right. Slope = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X) Slope for a variable (X) may vary between two graphs, but it will always be negative; hence, the above graphs are called as graphs with curve linear negative slope No relationship graph: Scatter graph indicates ‘no relationship’ curve as it is very difficult to conclude whether the relationship between X and Y is positive or negative Error in simple regression The regression equation model in machine learning uses the above slope–intercept format in algorithms. X and Y values are provided to the machine, and it identifies the values of a (intercept) and b (slope) by relating the values of X and Y. However, identifying the exact match of values for a and b is not always possible. There will be some error value (ɛ) associated with it. This error is called marginal or residual error. Example of simple regression: A college professor believes that if the grade for internal examination is high in a class, the grade for external examination will also be high. A random sample of 15 students in that class was selected, and the data is given below: A scatter plot was drawn to explore the relationship between the independent variable (internal marks) mapped to X-axis and dependent variable (external marks) mapped to Y-axis as depicted in below Figure Residual Scatter plot Residual is the distance between the predicted point (on the regression line) and the actual point OLS:Ordinary Least Square Ordinary Least Squares (OLS) is the technique used to estimate a line that will minimize the error (ε), which is the difference between the predicted and the actual values of Y. This means summing the errors of each prediction or, more appropriately, the Sum of the Squares of the Errors (SSE) EXAMPLE: Let us calculate the value of a and b for the given example Contd: So, in the context of the given problem, we can say Marks in external exam = 19.04 + 1.89 × (Marks in internal exam) or, M = 19.04 + 1.89 × M X Y X Ext Int OLS Algorithm M = 19.04 + 1.89 × M The value of the intercept from the above equation is 19.05. However, none of the internal mark is 0. So, intercept = 19.05 indicates that 19.05 is the portion of the external examination marks not explained by the internal examination marks. OLS algorithm Step 1: Calculate the mean of X and Y Step 2: Calculate the errors of X and Y Step 3: Get the product Step 4: Get the summation of the products Step 5: Square the difference of X Step 6: Get the sum of the squared difference Step 7: Divide output of step 4 by output of step 6 to calculate ‘b’ Step 8: Calculate ‘a’ using the value of ‘b’ Regression Graph Maximum and minimum point of curves The maximum point is the point on the curve of the graph with the highest y-coordinate and a slope of zero. The minimum point is the point on the curve of the graph with the lowest y-coordinate and a slope of zero MULTIPLE LINEAR REGRESSIONS: Simple Linear Regression, where a single Independent/Predictor(X) variable is used to model the response variable (Y). But there may be various cases in which the response variable is affected by more than one predictor variable; for such cases, the Multiple Linear Regression algorithm is used. “ Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.” Key points about MLR: ○ For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or independent variable may be of continuous or categorical form. ○ Each feature variable must model the linear relationship with the dependent variable. ○ MLR tries to ﬁt a regression line through a multidimensional space of data-points. MLR equation: In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor variables x1, x2, x3,...,xn. Since it is an enhancement of Simple Linear Regression, so the same is applied for the multiple linear regression equation, the equation becomes: 1. Y=b0+b1x1+b2x2+ b3x3 Where, Y= Output/Response variable b0, b1, b2, b3 , bn....= Coeﬃcients of the model. x1, x2, x3, x4,...= Various Independent/feature variable Assumptions for Multiple Linear Regression: ○ A linear relationship should exist between the Target and predictor variables. ○ The regression residuals must be normally distributed. ○ MLR assumes little or no multicollinearity (correlation between the independent variable) in data. Implementation of Multiple Linear Regression model using Python: To implement MLR using Python, we have below problem: Problem Description: We have a dataset of 50 start-up companies. This dataset contains ﬁve main information: R&D Spend, Administration Spend, Marketing Spend, State, and Proﬁt for a ﬁnancial year. Our goal is to create a model that can easily determine which company has a maximum proﬁt, and which is the most affecting factor for the proﬁt of a company. Since we need to ﬁnd the Proﬁt, so it is the dependent variable, and the other four variables are independent variables. Below are the main steps of deploying the MLR model: The simple linear regression model and the multiple regression model assume that the dependent variable is continuous. The following expression describes the equation involving the relationship with two predictor variables, namely X and X. The model describes a plane in the three-dimensional space of Ŷ, X1 , and X2. Parameter ‘a’ is the intercept of this plane. Parameters ‘b ’ and ‘b ’ are referred to as partial regression coefficients. Parameter b represents the change in the mean response corresponding to a unit change in X1 when X2 is held constant. Parameter b represents the change in the mean response corresponding to a unit change in X2 when X1 is held constant. While finding the best fit line, we can fit either a polynomial or curvilinear regression. These are known as polynomial or curvilinear regression, respectively Assumptions in Regression Analysis: 1. The dependent variable (Y) can be calculated / predicated as a linear function of a specific set of independent variables (X’s) plus an error term (ε). 2. The number of observations (n) is greater than the number of parameters (k) to be estimated, i.e. n > k. 3. Relationships determined by regression are only relationships of association based on the data set and not necessarily of cause and effect of the defined class. 4. Regression line can be valid only over a limited range of data. If the line is extended (outside the range of extrapolation), it may only lead to wrong predictions. 5. If the business conditions change and the business assumptions underlying the regression model are no longer valid, then the past data set will no longer be able to predict future trends. 6. Variance is the same for all values of X (homoskedasticity). 7. The error term (ε) is normally distributed. This also means that the mean of the error (ε) has an expected value of 0. 8. The values of the error (ε) are independent and are not related to any values of X. This means that there are no relationships between a particular X, Y that are related to another specific value of X, Y. Given the above assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE), and this is called as Gauss-Markov Theorem. Main Problems in Regression Analysis - In multiple regressions, there are two primary problems: multicollinearity and heteroskedasticity. Multicollinearity: Two variables are perfectly collinear if there is an exact linear relationship between them. Multicollinearity is the situation in which the degree of correlation is not only between the dependent variable and the independent variable, but there is also a strong correlation within (among) the independent variables themselves. A multiple regression equation can make good predictions when there is multicollinearity, but it is difficult for us to determine how the dependent variable will change if each independent variable is changed one at a time. When multicollinearity is present, it increases the standard errors of the coefficients. By overinflating the standard errors, multicollinearity tries to make some variables statistically insignificant when they actually should be significant (with lower standard errors). One way to gauge multicollinearity is to calculate the Variance Inflation Factor (VIF), which assesses how much the variance of an estimated regression coefficient increases if the predictors are correlated. If no factors are correlated, the VIFs will be equal to 1. The assumption of no perfect collinearity states that there is no exact linear relationship among the independent variables. This assumption implies two aspects of the data on the independent variables. First, none of the independent variables, other than the variable associated with the intercept term, can be a constant. Second, variation in the X’s is necessary. In general, the more variation in the independent variables, the better will be the OLS estimates in terms of identifying the impacts of the different independent variables on the dependent variable. Heteroskedasticity: Heteroskedasticity refers to the changing variance of the error term. If the variance of the error term is not constant across data sets, there will be erroneous predictions. In general, for a regression equation to make accurate predictions, the error term should be independent, identically (normally) distributed (iid). In statistics, heteroskedasticity (or heteroscedasticity) happens when the standard deviations of a predicted variable, monitored over different values of an independent variable or as related to prior time periods, are non-constant. Improving Accuracy of the Linear Regression Model The concept of bias and variance is similar to accuracy and prediction. Accuracy refers to how close the estimation is near the actual value, whereas prediction refers to continuous estimation of the value. High bias = low accuracy (not close to real value) High variance = low prediction (values are scattered) Low bias = high accuracy (close to real value) Low variance = high prediction (values are close to each other) overall error of the model will be low, implying a low bias (high accuracy) and low variance (high prediction). In the linear regression model, it is assumed that the number of observations (n) is greater than the number of parameters (k) to be estimated. i.e. n > k, and in that case, the least squares estimates tend to have low variance and hence will perform well on test observations. However, if observations (n) is not much larger than parameters (k), then there can be high variability in the least squares ﬁt, resulting in overﬁtting and leading to poor predictions If k > n, then linear regression is not usable. This also indicates inﬁnite variance, and so, the method cannot be used at all. Accuracy of linear regression can be improved using the following three methods: 1. Shrinkage Approach 2. Subset Selection 3. Dimensionality (Variable) Reduction Shrinkage (Regularization) Approach Regularization is one of the most important concepts of machine learning. It is a technique to prevent the model from overﬁtting by adding extra information to it. Sometimes the machine learning model performs well with the training data but does not perform well with the test data. It means the model is not able to predict the output when deals with unseen data by introducing noise in the output, and hence the model is called overﬁtted. This problem can be deal with the help of a regularization technique. This technique can be used in such a way that it will allow to maintain all variables or features in the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a generalization of the model. It mainly regularizes or reduces the coeﬃcient of features toward zero. In simple words, "In regularization technique, we reduce the magnitude of the features by keeping the same number of features." How does Regularization Work? Regularization works by adding a penalty or complexity term to the complex model. Let's consider the simple linear regression equation: y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of the model, and b represents the intercept. Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the cost function for the linear model is given below: Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value of Y. The loss function for the linear regression is called as RSS or Residual sum of squares. Techniques of Regularization→There are mainly two types of regularization techniques, which are given below: Ridge Regression & Lasso Regression Ridge Regression Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so that we can get better long-term predictions. Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization. In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda to the squared weight of each individual feature. The equation for the cost function in ridge regression will be In the above equation, the penalty term regularizes the coeﬃcients of the model, and hence ridge regression reduces the amplitudes of the coeﬃcients that decreases the complexity of the model. As we can see from the above equation, if the values of λ tend to zero, the equation becomes the cost function of the linear regression model. Hence, for the minimum value of λ, the model will resemble the linear regression model. A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so to solve such problems, Ridge regression can be used. It helps to solve the problems if we have more parameters than samples Lasso Regression: Lasso regression is another regularization technique to reduce the complexity of the model. It stands for Least Absolute and Selection Operator. It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead of a square of weights. Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink it near to 0. It is also called as L1 regularization. The equation for the cost function of Lasso regression will be: Some of the features in this technique are completely neglected for model evaluation. Hence, the Lasso regression can help us to reduce the overﬁtting in the model as well as the feature selection. Key Difference between Ridge Regression and Lasso Regression ○ Ridge regression is mostly used to reduce the overﬁtting in the model, and it includes all the features present in the model. It reduces the complexity of the model by shrinking the coeﬃcients. ○ Lasso regression helps to reduce the overﬁtting in the model as well as feature selection. Subset Selection Identify a subset of the predictors that is assumed to be related to the response and then fit a model using OLS on the selected reduced subset of variables. There are two methods in which subset of the regression can be selected: 1. Best subset selection (considers all the possible (2 )) 2. Stepwise subset selection 1. Forward stepwise selection (0 to k) 2. Backward stepwise selection (k to 0) In best subset selection, we fit a separate least squares regression for each possible subset of the k predictors. For computational reasons, best subset selection cannot be applied with very large value of predictors (k). The best subset selection procedure considers all the possible (2 ) models containing subsets of the p predictors. The stepwise subset selection method can be applied to choose the best subset. There are two stepwise subset selection: 1. Forward stepwise selection (0 to k) 2. Backward stepwise selection (k to 0) Forward stepwise selection is a computationally efficient alternative to best subset selection. Forward stepwise considers a much smaller set of models, that too step by step, compared to best set selection. Forward stepwise selection begins with a model containing no predictors, and then, predictors are added one by one to the model, until all the k predictors are included in the model. In particular, at each step, the variable (X) that gives the highest additional improvement to the fit is added. Backward stepwise selection begins with the least squares model which contains all k predictors and then iteratively removes the least useful predictor one by one. Dimensionality reduction (Variable reduction) The earlier methods, namely subset selection and shrinkage, control variance either by using a subset of the original variables or by shrinking their coefficients towards zero. In dimensionality reduction, predictors (X) are transformed, and the model is set up using the transformed variables after dimensionality reduction. The number of variables is reduced using the dimensionality reduction method. Principal component analysis is one of the most important dimensionality (variable) reduction techniques. Elastic Net regression modelling → Elastic net linear regression uses the penalties from both the lasso and ridge techniques to regularize regression models. The technique combines both the lasso and ridge regression methods by learning from their shortcomings to improve the regularization of statistical models. The elastic net method improves lasso’s limitations, i.e., where lasso takes a few samples for high dimensional data. The elastic net procedure provides the inclusion of “n” number of variables until saturation. If the variables are highly correlated groups, lasso tends to choose one variable from such groups and ignore the rest entirely.

Regression Modelling PDF

Document Details

Tags

Related

Summary

Full Transcript