FBA1018 - Introductory Econometrics Autumn 2024 PDF
Document Details
Uploaded by Deleted User
Dublin City University
2024
DCU
Tags
Related
Summary
This document provides an introduction to econometrics, focusing on OLS, regression techniques, and related concepts. It appears to be lecture notes.
Full Transcript
FBA1018 – Introductory Econometrics Autumn 2024 Regression (1/3) Agenda The basics of regression (Week 4) Getting fancier with regression (Week 5) Issues with standard errors (Week 6) Additional regression concerns (Week 6) The Basics of Regression Error terms...
FBA1018 – Introductory Econometrics Autumn 2024 Regression (1/3) Agenda The basics of regression (Week 4) Getting fancier with regression (Week 5) Issues with standard errors (Week 6) Additional regression concerns (Week 6) The Basics of Regression Error terms Regression assumptions and sampling variation Hypothesis testing in OLS Regression tables and model-fit statistics Subscripts in regression equations Turning a causal diagram into a regression Coding examples Recall from Week 2 – Describing Relationships We can use the values of one variable () to predict the values of another (). We call this explaining using. While there are many ways to do this, one is to fit a line or shape that describes the relationship. For example,. Estimating this line using ordinary least squares (standard, linear regression) will select the line that minimizes the sum of squared residuals, which is what you get if you take the prediction errors from the line, square them, and add them up. Linear regression, for example, gives us the best linear approximation of the relationship between and. The quality of that approximation depends in part on how linear the true model is. Recall from Week 2 – Describing Relationships (Cont.) Pro: Uses variation efficiently Pro: A shape is easy to explain Con: We lose some interesting variation Con: If we pick a shape that’s wrong for the relationship, our results will be bad Recall from Week 2 – Describing Relationships (Cont.) We can interpret the coefficient that multiples a variable () as a slope. So, a one-unit increase in is associated with a increase in. With only one predictor, the estimate of the slope is the covariance of and divided by the variance of. With more than one, the result is sort of similar, but also accounts for the way the different predictors are correlated. If we plug an observation’s predictor variables into an estimated regression, we get a prediction. This is the part of that is explained by the regression. The difference between and is the unexplained part, which is also called the “residual.” Recall from Week 2 – Describing Relationships (Cont.) If we add another variable to the regression equation, such as , then the the coefficient on each variable will be estimated using the variation that remains after removing what is explained by the other variable. So our estimate of would not give the best-fit line between and , but rather between the part of not explained by and the part of not explained by. This “controls for.” If we think the relationship between and isn’t well-explained by a straight line, we can use a curvy one instead. OLS can handle things like that are “linear in parameters” (notice that the parameters and are just plain multiplied by a variable, then added up), or we can use nonlinear regression like probit, logit, or a zillion other options. What we’re going to add The error term Sampling variation The statistical properties of OLS Interpreting regression results Interpreting coefficients on binary and transformed variables Error Terms Fitting a straight line is not sufficient. It will rarely predict any observation perfectly, much less all of the observations. There’s going to be a difference between the line that we fit and the observation we get. We can add this difference to our actual equation as an “error term”. That difference goes by two names: the residual, which we’ve talked about, is the difference between the prediction we make with our fitted line and the actual value, and the error is the difference between the true best fit-line and the actual value. Why the distinction? Well… sampling variation. Error Terms The Difference Between the Residual and the Error Error Terms That includes both stuff we can see and stuff we can’t. So, for example, if the true model is given by the Figure, and we are using the OLS model , then is made up of some combination of , , and. We know that since we can determine from the graph that , , and all cause , but aren’t in the model. Regression Assumptions & Sampling Variation Exogeneity assumption: If we want to say that our OLS estimates of will, on average, give us the population , then it must be the case that is uncorrelated with. If we do have that endogeneity problem, then on average our estimate of won’t give us the population value. When that happens - when our estimate on average gives us the wrong answer, we call that a bias in our estimate. In particular, this form of bias, where we are biased because a variable correlated with is in the error term, is known as omitted variable bias, since it happens because we omitted an important variable from the regression equation. Regression Assumptions & Sampling Variation Regression coefficients also follow a normal distribution, and we know what the mean and standard deviation of that normal distribution is. Or, at least we do if we make a few more assumptions about the error term. In a regression model with one predictor, like , an OLS estimate of the true-model population follows a normal distribution with a mean of and a standard deviation of , where is the number of observations in the data, is the standard deviation of the error term , and is the variance of. The standard deviation of a sampling distribution is often referred to as a standard error. Hypothesis Testing in OLS Step 1: We pick a theoretical distribution, specifically a normal distribution centred around a particular value of Step 2: Estimate using OLS in our observed data, getting Step 3: Use that theoretical distribution to see how unlikely it would be to get if indeed the true population value is what we picked in the first step Step 4: If it’s super unlikely, that initial value we picked is probably wrong! Hypothesis Testing in OLS – Example Assume the true model is , where is normally distributed with mean 0 and variance 1. We generated 200 random observations from the true model. Pretend that we don’t know the truth is =0.2, we estimated the OLS model using the regression The first estimate we get is 0.142. We can use the formula from the last section to also calculate that the standard error of is. So the theoretical distribution we’re looking at is a normal distribution with mean 0 and standard deviation 0.077. Hypothesis Testing in OLS – Example Under that distribution, the 0.142 estimate we got is at the 96.7th percentile of the theoretical distribution, as shown in the Figure below. That means that something as far from 0 as 0.142 is (or farther) happens (100 – 96.7) x 2 = 3.3 x 2 = 6.6% of the time. If we started with , then we would not reject the null hypothesis, since 6.6% is higher than 5%, even though we happen to know for a fact that the null is quite wrong. Hypothesis Testing in OLS – Key points An estimate not being statistically significant doesn’t mean it’s wrong. It just means it’s not statistically significant. Never, ever, ever, ever, ever, ever, ever give in to the thought “oh no, my results aren’t significant, I’d better change the analysis to get something significant.” Bad. I’m pretty sure professors always say to never do this, but students somehow remember the opposite. A lot of stuff goes into significance besides just “is the true relationship nonzero or not.” - sampling variation, of course, and also the sample size, the way you’re doing your analysis, your choice of , and so on. A significance test isn’t the last word on a result. Statistical significance only provides information about whether a particular null value is unlikely. It doesn’t say anything about whether the effect you’ve found matters. In other words, “statistical significance” isn’t the same thing as “significant.” A result showing that your treatment improves IQ by.000000001 points is not a meaningful effect, whether it’s statistically significant or not. Regression Tables and Model-Fit Statistics We’re going to run some regressions using data on restaurant and food inspections. We might be curious whether chain restaurants get better health inspections than restaurants with fewer (or only one) location. Summary Statistics for Restaurant Inspection Data Regression Tables and Model-Fit Statistics We run 2 regressions. The first just regresses inspection score on the number of locations the chain has: The second adds year of inspection as a control: Regression Tables and Model-Fit Statistics Two Regressions of Restaurant Health Inspection Scores on Number of Locations Regression Tables and Model-Fit Statistics Intercept/ Constant; Coefficient Standard errors of the coefficients “Significance stars”: whether the coefficient is statistically significantly different from a null- hypothesis value of 0. To be really precise, these aren’t exactly significance tests. Instead, they’re a representation of the p-value. Regression Tables and Model-Fit Statistics The number of observations and : These are measures of the share of the dependent variable’s variance that is predicted by the model. In the first model is 0.065, telling us that 6.5% of the variation in Inspection Score is predicted by the Number of Locations. is the same idea, except that it makes an adjustment for the number of variables you’re using in the model, so it only counts the variance explained above and beyond what you’d get by just adding a random variable to the model. “F-statistic.” This is a statistic used to do a hypothesis test. Specifically, it uses a null that all the coefficients in the model (except the intercept/constant) are all zero at once, and tests how unlikely your results are given that null. Regression Tables and Model-Fit Statistics The interpretation of an OLS coefficient on a variable is “controlling for the other variables in the model, a one-unit change in is linearly associated with a -unit change in ”. If we want to get even more precise, we can say “If two observations have the same values of the other variables in the model, but one has a value of that is one unit higher, the observation with the one unit higher will on average have a that is units higher”. We can say “comparing two inspections in the same year, the one for a restaurant that’s a part of a chain with one more location than the other will on average have an inspection score -0.019 lower.” After all, that’s the idea of controlling for variables. We want to compare like-with-like and close back doors, and so we’re trying to remove the part of the Number of Locations/Inspector Score relationship that is driven by Year of Inspection. Subscripts in Regression Equations : Thehere tells us what index the data varies across. In this regression, and differ across individuals. : The here would be shorthand for time period. This is describing a regression where each observation is a different time period. The tells us that we are relating from a given period to the from the period before. : This is describing a regression in which and vary across a set of individuals and across time (a panel data set). Turning a Causal Diagram into a Regression Coding examples & Regressions in papers BOLTON, P. and KACPERCZYK, M. (2023), Global Pricing of Carbon-Transition Risk. J Finance, 78: 3677- 3754. https://doi.org/10.1111/jofi.13272