Regression: Linear, Logistic, and Beyond PDF
Document Details
Uploaded by Deleted User
Manu Navjeevan
Tags
Summary
This document covers regression modeling, focusing on linear and logistic regression techniques. It details the estimation of linear regression models and the interpretation of model parameters, along with methods to evaluate model fit. The document introduces concepts beyond linear and logistic regression, such as probit and Poisson regression.
Full Transcript
Regression: Linear, Logistic, and Beyond Econ 471: Data Science for Economists Manu Navjeevan August 7, 2024 1 Regression Modeling We will now shif...
Regression: Linear, Logistic, and Beyond Econ 471: Data Science for Economists Manu Navjeevan August 7, 2024 1 Regression Modeling We will now shift gears and turn to the problem of modeling relationships between variables. A large subset of these problems fall into the category of regression where we have a response variable Y that we would like to predict using a vector of input variables X = (X1 ,... , Xd )′ ∈ Rd. For example we may want to predict income, Y , using input variables X1 , say education, and X2 , say experience. The vector X = (X1 , X2 )′ simply collects education and experience to simplify notation. Formally, this problem reduces to learning the conditional mean function E[Y |X] from the data. Recall that we are interested in the conditional mean function as it is the “best” predictor of Y using the information in X, that is for any other function of X, φ(X) we have that h 2 i h 2 i E Y − E[Y |X] ≤ E Y − φ(X) (1) Left completely unrestricted, this problem can be quite complex as this unknown conditional mean function could be anything at all. Therefore, it is common to make some assumptions about the form of the conditional mean function in order to try to learn it from the data. We will see some examples of this below. 2 Linear Regression As a first model, it is useful to consider the workhorse linear regression model. This is one where we assume that the conditional expectation is linear in the explanatory X variables. That is; E[Y |X] = β0 + β1 X1 + · · · + βd Xd Letting ϵ = Y − E[Y |X] and recalling that E[ϵ|X] = 0 you may often see this model presented as Y = β0 + β1 + · · · + βd Xd + ϵ, E[ϵ|X] = 0 (2) One of the benefits of the linear regression model is that is very easy to interpret the parameters of the model. As an example, suppose that we are a company evaluating the efficacy of our marketing strategy. In this example let Y be sales in dollars for a given quarter, X1 be the amount of money we spent on television ads in that quarter, and X2 be the amount of money we spend on online advertising in that same quarter. Some questions we may be able to answer with linear regression models; 1. Does television advertising have an effect on sales? We can look at whether the slope parameter attached to X1 , β1 , is nonzero. 2. How strong is the relationship between advertising spending and sales? We can look at goodness-of-fit criterion for the considered linear regression models. 1 Linear Regression Page 2 3. Is the relationship between sales and advertising spending linear? We can consider transformations of sales or our explanatory variables and see if we can better fit the data. 4. Are there synergies between spending on television and online advertising? We can include interaction terms between television and online spending and see if they are nonzero. For this reason, linear models remain popular in the data science world and are still used even when more complex techniques are available. 2.1 Estimating the Linear Regression Model As a consequence of (1), the parameters β0 , β1 ,... , βd solve h i 2 β0 , β1 ,... , βd = arg min E (Y − b0 − b1 X1 − · · · − bd Xd ) (3) b0 ,b1 ,...,bd That is, for any other possible values b0 , b1 ,... , bd h 2 i h i 2 E (Y − β0 − β1 X1 − · · · − βd Xd < E (Y − b0 − b1 X1 − · · · − bd Xd ) If the joint distribution of (Y, X1 ,... , Xd ) was known to us, we could use this to calculate β0 , β1 ,... , βd directly. Unfortunately, we typically do not know this joint distribution. However, we do have access to a sample {(Yi , X1i ,... , Xdi )}ni=1 of independent observations of the random variables (Y, X1 ,... , Xd ). To use the data to estimate the underlying parameters β0 ,... , βd we solve the sample analog of (3). That is, we let n 1X 2 β̂1 , β̂1 ,... , β̂d = arg min (Yi − b0 − b1 X1i − · · · − bd Xdi ) (4) b1 ,b1 ,...,bd n i=1 These are called the least squares estimates of the underlying parameters (β0 , β1 ,... , βd ). The solutions to (3) and (4) have closed form solutions, but doing so in general requires some matrix algebra so we will skip it for now. For the specific case where d = 1, that is there is only one explanatory variable, we can write β0 = E[Y ] − β1 E[X] β̂0 = Ȳ − β̂1 X̄ Cov(X, Y ) Cov(X, d Y) β1 = β̂1 = Var(X) Var(X) d 1 Pn where Cov(X, d Y ) = n−1 i=1 (Xi − X̄)(Yi − Ȳ ) is the sample covariance between X and Y and Var(X) d = 1 Pn 2 n−1 i=1 (Xi − X̄). As with the sample mean, the estimates β̂0 and β̂1 are functions of the random sample {(Yi , Xi )}ni=1 and as such can be treated as random variables1 However, using the fact the sample means converge to their population analogs (X̄ → E[X], Ȳ → E[Y ], etc.) we can verify that, just like the sample these mean, these estimates are consistent for the true values, that is as n → ∞ β̂0 → β0 and β̂1 → β1 The consistency result generalizes to the general regression model, so long as the number of parameters we are estimating, d, is “small” relative to the sample size n. Later on in this course, we will discuss the case where there are many right hand side variables, this will require some machine learning techniques. In the context of our example of investigating the effect of various types of advertising on sales volume this means that, given a large enough sample of time periods, we will be able to figure out the effect of television and online advertising spending on sales in dollars. 1 If we draw a different random sample {(Yi , Xi )}n i=1 , we will get different values for β̂0 and β̂1. Linear Regression Page 3 Remark 1 (Closed Form Solutions). While the minimization problem in (4) does have a closed form solution using matrix algebra, in general we need not worry about the existence of such formulas. The problem in (4) is a special instance of what is called a convex optimization problem, which are easy to solve using numerical optimization techniques. Later on in the course, we will deal with estimators defined through minimization problems that do not have explicit closed form solutions. Regardless, since they are also instances of convex optimization, we can compute these estimators quickly using modern computers. Remark 2 (Misspecification in the Linear Regression Model). Of course, it is reasonable to ask what our estimation procedure achieves if the linear regression model is misspecified, that is if E[Y |X] ̸= β0 + β1 X1 + · · · + βd Xd It turns out that, even if the true conditional mean is not linear, our ordinary least squares estimates converge to the best linear approximation of the true conditional mean function. That is, as n → ∞, (β̂0 , β̂1 ,... , β̂d ) → (β̃0 , β̃1 ,... , β̃d ) where h 2 i β̃0 , β̃1 ,... , β̃d = arg min E E[Y |X] − b0 − b1 X1 − · · · − bd Xd b0 ,b1 ,...,bd This is useful as it ensures that our estimated linear regression model remains interpretable even if the true conditional mean is not linear. 2.2 Inference in the Linear Regression Model Of course, knowing that the parameter estimates β̂0 , β̂1 ,... , β̂d are close to their true values may not be sufficient for many problems. We may want to quantify exactly how close these estimates should be to their true values. Just like with the sample mean, we are interested in the distribution of the estimator β̂ and how it relates to the “true” underlying parameter β. To fix ideas, let us return to our original example where Y = Sales in Dollars, X1 = Spending on Television Advertising, and X2 = Spending on Online Advertising. Suppose that we are in an argument with our boss. She does not want to spend more money on online advertising unless the return of an additional dollar spent on online advertising is larger than a dollar of additional sales. To test this we gather a sample of n time periods, suppose that the relationship between sales and advertising spending is linear and estimate the model Y = β0 + β1 X1 + β2 X2 + ϵ In terms of the parameters of the model our boss’ skepticism can be interpreted as a null hypothesis that H0 : β2 ≤ 1 Whereas we as analysts want to prove to her that H1 : β2 > 1 In our sample, we find that β̂2 ≈ 1.381. Is this sufficient evidence to disprove our bosses claim and rule in favor of our alternative hypothesis? On one hand, we know that β̂2 should be close to the true value β2. However, on the other hand we know that β̂2 is a random variable as it depends on our random sample. If we picked another sample we would get a differnet value of β̂2. To answer this question, we would like to know something like Pr β̂1 ≥ 1.381 given β1 = 1 Or, in other words Pr β̂1 − β1 ≥ 0.381 Linear Regression Page 4 In your previous econometrics classes, you may have been taught that we can approximate the distribution of β̂1 in large samples via ! σϵ2 β̂1 − β1 ∼ N 0, Pn (1 − ρ212 ) i=1 (X2i − X̄2 )2 where ρ12 is the sample correlation coeffecient between X1 and X2. This approximation however, is valid under specific conditions. The main assumption needed that we may want to relax is homoskedasticty, which posits that the variance of the outcome Y does not depend on X1 or X2. In the context of our example above, this may be problematic as we are assuming that the variance of Sales Volume in time periods with low advertising spending is the same as the variance of Sales Volume in time periods with high advertising spending. However, in periods with high advertising spending we are (presumably) reaching more customers who each then have to make a decision about whether or not to spend on our product. This will naturally increase the variance of Sales Volume. As such, homoskedasticity may not be a realistic assumption in this setting. Given that we do not believe homoskedasticity, the question is how do we perform inference? There are essentially two options. 1. It turns out that, even if homoskedasticity is violated, the estimates β̂ still follow an approximately normal distribution albeit with a different form for the standard errors (the variance). Using R we can calculate these heteroskedasticity robust standard errors in the following manner; 1. library(sandwich) 2. reg