Lecture 4: Ordinary Least Squares (OLS) PDF

Lecture 4: Ordinary Least Squares (OLS) Applied Statistics and Econometrics Lecture 4: Ordinary Least Squares (OLS) Alessandro Casini Alessandro Casini 1 / 80 Lecture 4: Ordinary Least Squares (OLS) Outline of Lecture 4 1. Properties of OLS 1.1 The Least Squares assumptions (SW 4.4) 1.2 Sampling distribution of OLS (SW 4.5) 2. Hypothesis tests about β 1 (SW 5.1) 3. Confidence intervals about β 1 (SW 5.2) 4. Regression when X is binary (0/1) (SW 5.3) 5. Heteroskedasticity and homoskedasticity (SW 5.4) 6. The Gauss-Markov Theorem (SW 5.5) Alessandro Casini 2 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Outline 1 Properties of OLS 2 Testing Hypotheses 3 Confidence Intervals 4 Regression When X is Binary 5 Heteroskedasticity and Homoskedasticity 6 The Gauss-Markov Theorem Alessandro Casini 3 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS OLS estimators are random variables ▶ The OLS estimates are computed from a sample of data. A different sample would give different values of β̂ 0 and β̂ 1. ▶ This means that β̂ 0 and β̂ 1 are random variables. ▶ Recall that β̂ 0 and β̂ 1 are functions of Xi and Yi , which are random variables. ▶ As such, they have a distribution of values. ▶ We would like to know, at the very least, the expected value and variance of the OLS estimator. ▶ In fact, we would like to know whether we can fully determine the sampling distribution of the OLS estimator. Alessandro Casini 4 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Probability framework for linear regression ▶ Population: the group of observations of interest (e.g., all possible school districts). ▶ Random variables: variables Y and X relevant to the analysis of interest (e.g., test scores, STR). ▶ These random variables are characterized by a joint distribution which is unknown. ▶ An object of interest in this joint distribution is the conditional expectation of Y given X, E(Y | X ), because it tells us how mean Y is related to X. ▶ Does E(Y | X ) increase or decrease with X. By how much? Alessandro Casini 5 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Probability framework for linear regression ▶ Data collection and simple random sampling: We have a sample of n units (entities) chosen at random from the population of interest, and observe (record) Xi and Yi for each unit i = 1,... , n. ▶ Simple random sampling implies that {( Xi , Yi )}, i = 1,... , n, are independently and identically distributed (i.i.d.). ▶ That is, ( Xi , Yi ) are distributed independently of ( X j , Yj ) for different units i and j. Alessandro Casini 6 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Properties of OLS ▶ Is E β̂ 1 = β 1 ? That is, is β̂ 1 unbiased for β 1 ? ▶ Is the variance of β̂ 1 small (compared to that of alternative estimators)? ▶ To do this we need some assumptions about the way Y and X are related to each other and about how the data were sampled. ▶ We first discuss these assumptions which are known as the Least Squares Assumptions, ▶ We then move on to derive the expected value and variance of β̂ 1 , as well as its sampling distribution (in large samples). Alessandro Casini 7 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Least Squares Assumptions (SW Section 4.4) ▶ The linear regression model with a single regressor is Y = β0 + β1 X + u ▶ The three least squares assumptions are: Assumption #1: The conditional mean of u given X is zero. That is, for each x, E(u| X = x ) = 0 Assumption #2: (Yi , Xi ) are i.i.d. Assumption #3: Large outliers in Y and X are unlikely. Alessandro Casini 8 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Least Squares Assumption #1: mean-independence (conditional zero-mean) E(u| X = x ) = 0 ▶ That E(u| X = x ) is zero is just a normalization. ▶ The substantive part of LSA#1 is that E(u| X = x ) is constant: it does not vary with X. ▶ This means that, on average, all other factors that make up u do not vary with X ⇒ u is mean-independent of X. ▶ We often write E(u| X ) as a shorthand for E(u| X = x ). ▶ LSA#1 is the most important assumption of the regression model. But it cannot be verified in the data. ▶ Other two assumptions are verifiable and are usually satisfied. Alessandro Casini 9 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Regression and conditional expectation ▶ Under LSA #1, the population regression line is the conditional expectation E (Y | X ) E (Y | X ) = β 0 + β 1 X + E ( u | X ) = β 0 + β 1 X because E(u| X ) = 0 under LSA #1. ▶ The regression model, in fact, estimates the conditional expectation function E(Y | X ) which traces how mean Y varies with X. ▶ β 1 tells us how changes in X affect the (conditional) mean of Y. ▶ Recall that β 1 is the causal effect of X on Y because it reflects the change in Y when X (and only X ) changes, while u is held fixed (a thought experiment). Alessandro Casini 10 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Least Squares Assumption #1: mean-independence (conditional zero-mean) Distribution of u = Y − β 0 − β 1 X1 appears to be the same for different values of X. Stronger assumption than needed. ⇒ E(u| X = x ) does not vary with X (this is what is required by LSA #1). Alessandro Casini 11 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Implication of Least Squares Assumption #1 ▶ LSA#1 is crucial to give a causal interpretation to the parameters of the linear model. ▶ Suppose it is violated, e.g. E(u| X ) changes with X. ▶ Then, when X changes, u will also change. ▶ Recall definition of causal effect: the change in Y triggered by X and X alone. ▶ Without LSA #1, the observed change in Y reflects changes in both X and in u ⇒ we will not be able to estimate a causal effect without LSA #1. Alessandro Casini 12 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS OLS estimates a causal effect when LSA #1 holds ▶ Thought experiment: compute the change in E(Y | X ) when X increases by 1 unit from x to x + 1: E (Y | X = x ) = β 0 + β 1 x + E(u| X = x ) E (Y | X = x + 1 ) = β 0 + β 1 ( x + 1) + E ( u | X = x + 1) ▶ Under LSA#1, the difference in mean Y when X changes by 1 unit is E (Y | X = x + 1 ) − E (Y | X = x ) = β 1 + E ( u | X = x + 1 ) − E ( u | X = x ) ▶ LSA#1 ⇒ E(u| X = x ) = E(u| X = x + 1) = 0 E (Y | X = x + 1 ) − E (Y | X = x ) = β 1 ▶ Under LSA #1, observed changes in Y reflect only the changes in X. ▶ Because OLS uses observed changes in Y and X, we will later show that OLS is an unbiased estimator of the causal effect β 1. Alessandro Casini 13 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS OLS does not estimate a causal effect when LSA #1 does not hold ▶ If LSA #1 is not true, the observed changes in Y will not reflect the causal effect of X. ▶ If LSA#1 does not hold then E (Y | X = x + 1 ) − E (Y | X = x ) = β 1 + E ( u | X = x + 1 ) − E ( u | X = x ) ▶ Observed changes in Y therefore reflect: 1. The causal effect of a change in X (via β 1 ). 2. Indirect changes via the changes in the factors that make up u. ▶ Thus, OLS will not estimate the causal effect of X on Y when LSA#1 fails. ▶ Importantly, we can always compute the OLS estimator (computation is not related to the assumptions). ▶ But need LSA#1 to interpret it as the estimator of a causal effect. Alessandro Casini 14 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS LSA #1: mean-independence and correlation ▶ From the analysis it is clear that what really matters is for E(u| X ) not not vary with X (E(u| X = x + 1) = E(u| X = x ))... it does not have to be zero! ▶ Mean-independence implies zero covariance (and correlation), but the converse is not true: E (u| X ) = 0 ⇒ Cov( X, u) = 0 ⇒ Corr ( X, u) = 0 Cov( X, u) = Corr ( X, u) = 0 ⇏ E (u| X ) = 0 ▶ It is easier to think of a possible correlation between X and u (instead of mean-independence). If we think they are correlated, we automatically know that LSA#1 fails. ▶ Unfortunately, we cannot verify this with data since u is not observed. ▶ We will see through the course that mean-independence between X and u is a very questionable assumption. Alessandro Casini 15 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Validity of LSA #1 in test score-STR model ▶ Take the model Testscore = β 0 + β 1 × STR + u. ▶ Family income is part of u since it surely affects test scores (positively). ▶ Children from rich families have more learning opportunities (private tutoring, computers, etc.) and end up doing better in school. ▶ It also makes sense that class size and family income are (negatively) correlated. ▶ More expensive schools where rich families send their children usually have smaller class sizes (more teachers per student). ▶ This would make STR and u (negatively) correlated. ▶ Then LSA#1 would fail in this model. Alessandro Casini 16 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Randomized experiments and mean-independence ▶ A benchmark for thinking about the validity of mean-independence is an ideal randomized controlled experiment. ▶ If X were assigned randomly to the units in the sample, it cannot be correlated with other characteristics of the unit, i.e., the things that make up u. ▶ Can you view X in your model as if it were randomly assigned (even though you have observational and not experimental data)? ▶ If the answer is yes, assumption E(u| X = x ) = 0 holds. ▶ Do you believe students and teachers are as if randomly assigned to different size classes? Alessandro Casini 17 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Least Squares Assumption #2: i.i.d. ▶ (Yi , Xi ) are i.i.d. (identically, independent distributed). ▶ This arises automatically if the unit (individual, district) is sampled by simple random sampling. ▶ Units are selected randomly from the same population. ▶ Thus, their ( Xi , Yi )′ s are drawn independently and from the same population. ▶ Independence is across observations, not between X and Y! ▶ The main case when non-i.i.d. sampling occurs is when data are recorded over time (“time series data”). Alessandro Casini 18 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Least Squares Assumption #3: no large outliers ▶ Large outliers – extreme values – in Y and X are unlikely. ▶ On a technical level, we assume that they have finite fourth moments, E Y 4 < ∞, E X4 < ∞ ▶ Finite fourth moments occur when a variable is bounded. ▶ Most economic data are bounded or drawn from distributions with finite fourth moments. ▶ Standardized test scores automatically satisfy this, ▶ STR, family income, etc. satisfy this too. ▶ The substance of this assumption is to rule out large outliers that can strongly influence the results. Alessandro Casini 19 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Least Squares Assumption 3: no large outliers Is the alone point an outlier in X or Y? In practice, outliers often are data glitches (coding/recording problems), so check your data for outliers (e.g., plot the data)! Alessandro Casini 20 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Where are we? 1. Properties of OLS 1.1 The Least Squares assumptions (SW 4.4) 1.2 Sampling distribution of OLS (SW 4.5) 2. Hypothesis tests about β 1 (SW 5.1) 3. Confidence intervals about β 1 (SW 5.2) 4. Regression when X is binary (0/1) (SW 5.3) 5. Heteroskedasticity and homoskedasticity (SW 5.4) 6. The Gauss-Markov Theorem (SW 5.5) Alessandro Casini 21 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Sampling distribution of the OLS estimator (SW 4.5) ▶ The OLS estimator is computed from a sample of data: a different sample gives a different value of β̂ 1. ▶ This is the source of the variance (“sampling uncertainty”) of β̂ 1. ▶ We want to find out: ▶ E( β̂ 1 ) : where is the distribution of β̂ 1 centered? ▶ Var ( β̂ 1 ) : quantify the sampling uncertainty of β̂ 1. ▶ the distribution of β̂ 1 in finite (“small”) samples. ▶ the distribution of β̂ 1 as sample size n → ∞ (“large samples”). Alessandro Casini 22 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Preliminary algebra 1 Start with Yi = β 0 + β 1 Xi + ui and taking average on both sides, Ȳ = β 0 + β 1 X̄ + ū and express the model in deviations form the average Yi − Ȳ = β 1 ( Xi − X̄ ) + (ui − ū) (1) Substituting (1) into the expression for β̂ 1 , we get ∑in=1 (Yi − Ȳ )( Xi − X̄ ) ∑in=1 [ β 1 ( Xi − X̄ ) + (ui − ū)] ( Xi − X̄ ) β̂ 1 = n = ∑i=1 ( Xi − X̄ )2 ∑in=1 ( Xi − X̄ )2 ∑in=1 ( Xi − X̄ )(ui − ū) = β1 + ∑in=1 ( Xi − X̄ )2 Alessandro Casini 23 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Preliminary algebra 2 We have that " # n n n ∑ (Xi − X̄ )(ui − ū) = ∑ (Xi − X̄ )ui − ∑ (Xi − X̄ ) ū i =1 i =1 i =1 " # n n = ∑ (Xi − X̄ )ui − ∑ Xi − nX̄ ū i =1 i =1 n = ∑ (Xi − X̄ )ui − [nX̄ − nX̄ )] ū i =1 n = ∑ (Xi − X̄ )ui i =1 Alessandro Casini 24 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Preliminary algebra 3 We put both results together to write the OLS estimator as ∑n ( Xi − X̄ )ui n ( Xi − X̄ ) β̂ 1 = β 1 + in=1 ∑i=1 ( Xi − X̄ )2 = β 1 + ∑ n (X − X̄ )2 ui i =1 ∑ i =1 i Let ( Xi − X̄ ) ωi = ∑in=1 ( Xi − X̄ )2 Then the OLS estimator can also be expressed as n β̂ 1 = β 1 + ∑ ωi u i (2) i =1 Alessandro Casini 25 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Expected value of the OLS estimator Taking expectations conditional on all the observed Xi′ s on both sides of (2) gives " # n = E [ β 1 | X1 ,... , X n ] + E ∑ ω i u i | X1 ,... , X n E β̂ 1 | X1 ,... , Xn i =1 n = β1 + ∑ E [ ω i u i | X1 ,... , X n ] i =1 n = β1 + ∑ ω i E [ u i | X1 ,... , X n ] i =1 n β1 + ∑ ω i E [ u i | Xi ] ( LSA#2) i =1 = β 1 under LSA # 1 Alessandro Casini 26 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS OLS is unbiased ▶ Under LSA #1, E β̂ 1 | X1 ,... , Xn = β 1 ▶ And since this expectation is constant for any values of the X ′ s we also have E β̂ 1 = β 1 The OLS estimator is unbiased under the Least Squares Assumptions #1 and #2. Alessandro Casini 27 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Variance of the OLS estimator ▶ Recall that we were able to write ∑in=1 ( Xi − X̄ )ui β̂ 1 = β 1 + ∑in=1 ( Xi − X̄ )2 ∑in=1 ( Xi − X̄ )ui ⇒ β̂ 1 − β 1 = ∑in=1 ( Xi − X̄ )2 ▶ By the Law of Large Numbers (LLN): p ▶ 1 n ∑in=1 ( Xi − X̄ )2 − → σX2 = Var ( X ) p ▶ 1 n ∑in=1 ( Xi − X̄ )ui − → ( Xi − µ X ) u i ▶ This implies that, in large samples, ∑in=1 ( Xi − µ X )ui β̂ 1 − β 1 ≈ 2 nσX Alessandro Casini 28 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Variance of the OLS estimator ▶ Since Var β̂ 1 = Var β̂ 1 − β 1 we have, for large n, ! ∑in=1 ( Xi − µ X )ui Var (∑in=1 ( Xi − µ X )ui ) Var ( β̂ 1 ) = Var 2 = 2 nσX n2 σ 2 X nVar (( Xi − µ X )ui ) = 2 2 ( LSA#2) n2 σX 1 Var (( Xi − µ X )ui ) = × 2 2 n σ X ▶ Key points: 1. The variance of β̂ 1 is inversely proportional to the sample size n (just like Var (Ȳ )). 2. The larger the variance of X, the smaller the variance of β̂ 1. Alessandro Casini 29 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Larger variance of X, smaller variance of the OLS estimator If there is more variation in X, then there is more information in the data that you can use to pin down the slope of the regression line. Alessandro Casini 30 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Estimating the variance of OLS ▶ The expression for the variance of β̂ 1 (for large n) is: 1 Var (( Xi − µ X )ui ) σβ̂2 = Var ( β̂ 1 ) = × 2 2 1 n σ X ▶ 2? What is Var (( Xi − µ X )ui ? What is σX ▶ If we want to know what the variance of β̂ 1 is, we need to know these variances. ▶ The estimator of the variance of β̂ 1 replaces the unknown population values by estimators constructed from the data: d ( β̂ 1 ) = 1 × estimator of Var (( Xi − µ X )ui ) σ̂β̂2 = Var 1 n (estimator of σX 2 )2 Alessandro Casini 31 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Estimating the variance of OLS ▶ We estimate Var (( Xi − µ X )ui ) with n 1 ∑ n − 2 i =1 (( Xi − X̄ )2 û2i ) where ûi is the OLS residual ûi = Yi − β̂ 0 + β̂ 1 Xi. ▶ 2 with We estimate σX ∑in ( Xi − X̄ )2. 1 n ▶ Thus the robust standard error of β̂ 1 is q SE( β̂ 1 ) = d ( β̂ 1 ) Var v 1 n 2 2 n−2 ∑i =1 (( Xi − X̄ ) ûi ) u u1 = u tn × h i2 1 n n ∑i ( Xi − X̄ ) 2 ▶ It looks complicated but the software does it for us. In Stata’s regression output it appears in the “Std. Err.” column. ▶ This is called heteroskedasticity-robust standard errors or simply White’s standard errors (from White (1982)). Alessandro Casini 32 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Stata’s estimate of the SE of OLS. reg testscr str,robust Linear regression Number of obs = 420 F(1, 418) = 19.26 Prob > F = 0.0000 ̂ ) 𝑺𝑬(𝛽 R-squared = 0.0512 1 Root MSE = 18.581 ------------------------------------------------------------------------------ | Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- str | -2.279808.5194892 -4.39 0.000 -3.300945 -1.258671 _cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057 ------------------------------------------------------------------------------ ▶ Notice the use of the robust option in the regress command. This option did not appear in a previous Stata output example. This detail is important...but we’ll need to cover a bit more material before you understand why. Alessandro Casini 33 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Consistency of OLS ▶ When the Least Squares Assumptions hold, we can show that the OLS estimator (probabilistically) approaches the true parameter β 1 as the sample size increases, p β̂ 1 − → β1 ▶ This implies that OLS is a consistent estimator for β 1. ▶ Intuition: When n → ∞ we have that Var ( β̂ 1 ) → 0 (verify this!) and the estimator gets closes and closer to its mean... which is β 1 since OLS is unbiased for β 1 when the least square assumptions hold. ▶ We will derive the probability limit of β̂ 1 in a future class. Alessandro Casini 34 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Large sample distribution of the OLS estimator ▶ The exact sampling distribution is complicated because it depends on the distribution of (Y, X )... and we did not (and do not want to) assume anything about this. ▶ However, when the sample size n is large we get a simple (and good) approximation: β̂ 1 ∼ N β 1 , σβ̂2 1 where 1 Var (( Xi − µ X )ui ) σβ̂2 = Var ( β̂ 1 ) = × 2 1 n σ2 X Alessandro Casini 35 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Asymptotic distribution of OLS ▶ Previous result is difficult to use because N β 1 , σ2 is difficult to compute. β̂ 1 ▶ Instead, we standardize β̂ 1 and use the standard normal distribution, β̂ − β 1 q1 ∼ N (0, 1) Var ( β̂ 1 ) ▶ N (0, 1) is a good approximation to the true (unknown) distribution and it even works when we replace the (usually) unknown variance of β̂ 1 by a consistent estimator, i.e., β̂ − β 1 β̂ − β 1 q1 = 1 ∼ N (0, 1) d ( β̂ ) Var SE( β̂ 1 ) 1 ▶ This result is called the asymptotic distribution of β̂ 1. Alessandro Casini 36 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Parallels between properties of OLS and of the sample mean β̂ 1 Ȳ ▶ E[ β̂ 1 ] = β 1 ▶ E[Ȳ ] = µY p p ▶ β̂ 1 − → β1 ▶ Ȳ − → µY ▶ β̂ 1 ∼ N ( β 1 , σ2 ) approx. ▶ Ȳ ∼ N (µY , σȲ2 ) approx. β̂ 1 Var (( Xi −µ X )ui ) σY2 ▶ σ2 = 1 n × ▶ σȲ2 = n β̂ 1 (σX2 )2 Alessandro Casini 37 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Summary of the sampling distribution of OLS If the three LS assumptions hold, then: 1. E( β̂ 1 ) = β 1 (that is, β̂ 1 is unbiased) 1 Var (( Xi − µ X )ui ) 1 Var ( β̂ 1 ) = × 2 ∝ n σ 2 n X 2. Other than its mean and variance, the exact small n distribution of β̂ 1 is complicated and depends on the distribution of (Y, X ). 3. β̂ 1 is a consistent estimator of β 1 , p β̂ 1 − → β1 4. When n is large, approximately (or asymptotically) β̂ − β 1 q1 ∼ N (0, 1) d ( β̂ 1 ) Var Alessandro Casini 38 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Key concept Alessandro Casini 39 / 80 Lecture 4: Ordinary Least Squares (OLS) Properties of OLS Where are we? 1. Properties of OLS 1.1 The Least Squares assumptions (SW 4.4) 1.2 Sampling distribution of OLS (SW 4.5) 2. Hypothesis tests about β 1 (SW 5.1) 3. Confidence intervals about β 1 (SW 5.2) 4. Regression when X is binary (0/1) (SW 5.3) 5. Heteroskedasticity and homoskedasticity (SW 5.4) 6. The Gauss-Markov Theorem (SW 5.5) Alessandro Casini 40 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Outline 1 Properties of OLS 2 Testing Hypotheses 3 Confidence Intervals 4 Regression When X is Binary 5 Heteroskedasticity and Homoskedasticity 6 The Gauss-Markov Theorem Alessandro Casini 41 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Testing hypotheses (SW 5.1) ▶ Now that we know the sampling distribution of the OLS estimator we are ready to test hypotheses about β 1. ▶ We start with an example. ▶ A government official believes the number of students in a class has no effect on learning or, specifically, on test scores. ▶ Recall the model Testscore = β 0 + β 1 STR + u. This person is therefore asserting that β 1 = 0. ▶ We, as applied economists, want to treat this assertion as a null hypothesis H0 : β 1 = 0 and use the data to test this hypothesis. Alessandro Casini 42 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Null and alternative hypotheses ▶ The null hypothesis and two-sided alternative are: H0 : β 1 = β 1,0 vs. H1 : β 1 ̸= β 1,0 where β 1,0 is the hypothesized value of β 1 under the null (e.g., β 1,0 = 0). ▶ The null hypothesis and one-sided alternative is: H0 : β 1 = β 1,0 vs. H1 : β 1 < (>) β 1,0 ▶ In Economics, it is almost always possible to come up with stories in which an effect could go either way, so it is standard to focus on two-sided alternatives. Alessandro Casini 43 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses General approach to testing ▶ In general, the t-statistic has the form estimator - hypothesized value under H0 t= standard error of estimator ▶ For testing the mean of Y this becomes Ȳ − µY,0 t= √ sY / n ▶ And for testing β 1 this becomes β̂ 1 − β 1,0 β̂ − β 1,0 t= q = 1 σ̂ 2 SE( β̂ 1 ) β̂ 1 where SE( β̂ 1 ) is the square root of the estimator of the variance of the sampling distribution of β̂ 1. Alessandro Casini 44 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Procedure for testing null against 2-sided alternative ▶ Construct the t-statistic: β̂ 1 − β 1,0 t= SE( β̂ 1 ) ▶ Reject at 5% significance level if |t| > 1.96 ▶ The p-value is Pr [|t| > |t act |] = probability in tails of standard normal distribution above |t act | ▶ Reject H0 at the 5% significance level if the p-value < 0.05. ▶ In general, reject H0 at the α% significance level if the p-value is < α/100. ▶ This procedure relies on the large-n approximation when the Student t and Normal distributions are very similar. ▶ Typically n = 100 is large enough for the approximation to be excellent. Alessandro Casini 45 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Testing whether class size affects performance ▶ We test whether STR has any effect on test scores, H0 : β 1 = 0 vs. H1 : β 1 ̸= 0 \ = 698.93 − 2.28 STR testscr (10.36) (0.52) ▶ The t-statistic is β̂ 1 − 0 −2.28 t= = = −4.39 SE( β̂ 1 ) 0.52 Alessandro Casini 46 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses t-statistic in Stata: test scores and STR ▶ The “t” column in Stata’s regression output gives the t-statistic for the hypothesis that coefficient is zero, e.g., H0 : β 1 = 0. ▶ Since |t| > 1.96, we reject the null hypothesis at 5% (and at anything higher). ▶ Can we reject at 1%? What is the probability that the t-statistic has a value below -4.39 or above 4.39? Alessandro Casini 47 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Test for the significance of a regressor ▶ The t-statistic reported by Stata is only for the null hypothesis H0 : β 1 = 0. ▶ Thus, the p-value reported by Stata is the p-value for H0 : β 1 = 0 vs. H1 : β 1 ̸= 0. ▶ The test H0 : β 1 = 0 vs. H1 : β 1 ̸= 0 is perhaps the most popular one, and we call it a test for the significance of a regressor. ▶ If we reject H0 : β 1 = 0, we say that the X has a “statistically significant” effect on Y. ▶ If we do not reject H0 : β 1 = 0, we say that the regressor X has no statistically significant effect on Y. Alessandro Casini 48 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Testing other null hypotheses ▶ If your test H0 : β 1 = −2 vs. H1 : β 1 ̸= −2 you cannot use the reported t- statistic and p-value in Stata’s output! ▶ You need to do the test manually using the information in Stata’s output, β̂ 1 − (−2) −0.28 t= = = −0.54 SE( β̂ 1 ) 0.52 ▶ In this case you cannot reject the null at 5% significance level since |t| < 1.96. ▶ Unfortunately, we do not know off hand the p-value of this statistic although it can be computed rather easily. Alessandro Casini 49 / 80 Lecture 4: Ordinary Least Squares (OLS) Testing Hypotheses Where are we? 1. Properties of OLS 1.1 The Least Squares assumptions (SW 4.4) 1.2 Sampling distribution of OLS (SW 4.5) 2. Hypothesis tests about β 1 (SW 5.1) 3. Confidence intervals about β 1 (SW 5.2) 4. Regression when X is binary (0/1) (SW 5.3) 5. Heteroskedasticity and homoskedasticity (SW 5.4) 6. The Gauss-Markov theorem (SW 5.5) Alessandro Casini 50 / 80 Lecture 4: Ordinary Least Squares (OLS) Confidence Intervals Outline 1 Properties of OLS 2 Testing Hypotheses 3 Confidence Intervals 4 Regression When X is Binary 5 Heteroskedasticity and Homoskedasticity 6 The Gauss-Markov Theorem Alessandro Casini 51 / 80 Lecture 4: Ordinary Least Squares (OLS) Confidence Intervals Confidence interval for beta1 (SW 5.2) ▶ Recall that a 95% confidence interval for β 1 can be expressed, equivalently, as: ▶ an interval (that is a function of the data) that contains the true parameter value 95% of the time in repeated samples. ▶ the set of values that cannot be rejected at the 5% significance level ▶ In general, because for large n the sampling distribution of an estimator is normal, a 95% confidence interval can be constructed as the estimator ± 1.96× its standard error. ▶ Thus, a 95% confidence interval for β 1 is, β̂ 1 ± 1.96 × SE( β̂ 1 ) = β̂ 1 − 1.96 × SE( β̂ 1 ) , β̂ 1 + 1.96 × SE( β̂ 1 ) ▶ And a 90% confidence interval is β̂ 1 ± 1.64 × SE( β̂ 1 ) = β̂ 1 − 1.64 × SE( β̂ 1 ) , β̂ 1 + 1.64 × SE( β̂ 1 ) Alessandro Casini 52 / 80 Lecture 4: Ordinary Least Squares (OLS) Confidence Intervals Confidence interval \ = 698.93 − 2.28 STR testscr (10.36) (0.52) β̂ 1 = −2.28, SE( β̂ 1 ) = 0.52 ▶ 95% confidence interval β̂ 1 ± 1.96 × SE( β̂ 1 ) = (−2.28 ± 1.96 × 0.52) = (−3.30, −1.26) ▶ 90% confidence interval β̂ 1 ± 1.64 × SE( β̂ 1 ) = (−2.28 ± 1.64 × 0.52) = (−3.13, −1.43) Alessandro Casini 53 / 80 Lecture 4: Ordinary Least Squares (OLS) Confidence Intervals Stata computes the 95% confidence interval ▶ Different confidence intervals (e.g., 90%) have to be computed manually with the information provided in Stata’s output and critical values. Alessandro Casini 54 / 80 Lecture 4: Ordinary Least Squares (OLS) Confidence Intervals Where are we? 1. Hypothesis tests about β 1 (SW 5.1) 2. Confidence intervals about β 1 (SW 5.2) 3. Regression when X is binary (0/1) (SW 5.3) 4. Heteroskedasticity and homoskedasticity (SW 5.4) 5. The Gauss-Markov theorem (SW 5.5) Alessandro Casini 55 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Outline 1 Properties of OLS 2 Testing Hypotheses 3 Confidence Intervals 4 Regression When X is Binary 5 Heteroskedasticity and Homoskedasticity 6 The Gauss-Markov Theorem Alessandro Casini 56 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Regression when X is binary (SW 5.3) ▶ Often a regressor is binary taking values 0 or 1: ▶ X = 1 if female, = 0 if male. ▶ X = 1 if treated (experimental drug), = 0 if not treated. ▶ X = 1 if small class size, = 0 if not a small class size. ▶ Binary regressors are often called “dummy” variables or regressors. ▶ A binary X does not affect the computation of OLS estimators, nor its properties. ▶ Everything we did so far applies to any type of X : continuous, discrete, etc. ▶ But it does not make much sense to call β 1 the “slope” of the regression line when X is binary. ▶ How do we interpret β 1 when the regressor is binary? Alessandro Casini 57 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Interpretation of slope parameter when X is binary ▶ Consider Y = β 0 + β 1 X + u, when X is binary. ▶ Then, assuming LSA#1, E [Y | X = 0] = β 0 mean of Y given X = 0 (e.g., males) E [Y | X = 1] = β 0 + β 1 mean of Y given X = 1 (e.g., females) ▶ Thus, β 1 = E [Y | X = 1 ] − E [Y | X = 0 ] ▶ When the regressor is a dummy β 1 is population difference in group means. Alessandro Casini 58 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Example of binary X ▶ When X is binary we usually denote it by D (for dummy). ▶ For example, Testscore = β 0 + β 1 D + u where 1 if STRi < 20 (small class) Di = 0 if STRi ≥ 20 (large class) Alessandro Casini 59 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Generating a dummy variable in Stata ▶ There are several options. ▶ For example, gen D=(str< 20) Alessandro Casini 60 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Comparison with difference in group means ▶ In Lecture 1 we computed means and standard deviations ▶ And the difference in test score means is Ȳsmall − Ȳlarge = 657.35 − 649.98 = 7.37 which is exactly the OLS estimator of β 1...as expected. Alessandro Casini 61 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Comparison with difference in group means ▶ We also showed in Lecture 1 that v s2 u u s2 SE(Ȳsmall − Ȳlarge ) = t small + large nsmall nlarge r 19.362 17.852 = + = 1.824 238 182 as in the regression output. ▶ Conversely, we can use the OLS estimates to get the mean test scores in both groups of districts: E[Y\ i | Di = 0 ] = β̂ 0 = 649.9788 E[Y\ i | Di = 1 ] = β̂ 0 + β̂ 1 = 649.9788 + 7.3724 = 657.35 exactly as in the previous table! Alessandro Casini 62 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Another example: Gender difference in monthly wages ▶ Model: wagei = β 0 + β 1 Femalei + ui ▶ Italian LFS data. ▶ Female is a dummy variable equal to 1 if individual is female. ▶ How large is the pay gap between men and women? ▶ Is the gender pay gap statistically signif.? Economically signif.? ▶ What is the 95% confidence interval for the gender pay gap? ▶ What is the mean monthly wage of a man? Of a woman?. reg wage female,robust Linear regression Number of obs = 26,127 F(1, 26125) = 2212.72 Prob > F = 0.0000 R-squared = 0.0775 Root MSE = 501.86 ------------------------------------------------------------------------------ | Robust retric | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | -291.3592 6.193919 -47.04 0.000 -303.4997 -279.2188 _cons | 1444.535 4.423072 326.59 0.000 1435.866 1453.205 ------------------------------------------------------------------------------ Alessandro Casini 63 / 80 Lecture 4: Ordinary Least Squares (OLS) Regression When X is Binary Where are we? 1. Hypothesis tests about β 1 (SW 5.1) 2. Confidence intervals about β 1 (SW 5.2) 3. Regression when X is binary (0/1) (SW 5.3) 4. Heteroskedasticity and homoskedasticity (SW 5.4) 5. The Gauss-Markov Theorem (SW 5.5) Alessandro Casini 64 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Outline 1 Properties of OLS 2 Testing Hypotheses 3 Confidence Intervals 4 Regression When X is Binary 5 Heteroskedasticity and Homoskedasticity 6 The Gauss-Markov Theorem Alessandro Casini 65 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity The (conditional) variance of u (SW 5.4) ▶ The regression model is Y = β0 + β1 X + u ▶ If Var (u| X = x ) is constant, then u is said to be homoskedastic. ▶ Homoskedasticity = the variance of the conditional distribution of u given X does not depend on X. ▶ Otherwise, u is heteroskedastic. ▶ Note that Var (Y | X = x ) = Var (u| X = x ) because Var ( β 0 + β 1 X | X = x ) = 0 but that Var (Y ) ̸= Var (u) Alessandro Casini 66 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Heteroskedasticity and homoskedasticity ▶ Consider the example wagei = β 0 + β 1 educi + ui ▶ Homoskedasticity means that the variance of ui does not change with the education level. ▶ We do not know anything about Var (ui |educi ) but we can use data to check whether the variance of wages changes with education. ▶ One option is to compute sample variances at different education levels. If ui is homoskedastic, the variances should be approximately the same. ▶ Homoskedasticity is often not a realistic assumption. Alessandro Casini 67 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Checking for homoskedasticity Alessandro Casini 68 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Graphical representation of heteroskedasticity in testscore-size example. Alessandro Casini 69 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Impact of Heteroskedasticity/Homoskedasticity ▶ What, if any, is the impact on the OLS estimator of having homoskedastic or heteroskedastic errors? ▶ It turns out that whether the error is heteroskedastic or homoskedastic does not affect most properties of OLS: ▶ The OLS estimator is still unbiased. ▶ The OLS estimator is still consistent. ▶ The OLS estimator is still asymptotically normally distributed. ▶ All these properties follow from three Least Squares Assumptions... and these did not mention the conditional variance of u. Alessandro Casini 70 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Impact of Heteroskedasticity/homoskedasticity on variance ▶ The only place where this distinction matters is in the formula for computing the variance of the OLS estimator. ▶ The formula we presented in these slides is always valid, irrespective of whether there is heteroskedasticity or homoskedasticity. ▶ This formula gives heteroskedasticity-robust standard errors. ▶ There is another formula – not presented here – which is valid only when there is homoskedasticity. It is referred as the homoskedasticity-only formula. ▶ It is often the default setting for many software programs. ▶ Remember: If you use the wrong S.E. your tests and confidence intervals will also be wrong. ▶ To insure against this you should always use heteroskedasticity-robust standard errors because these are also valid if the error is homoskedastic. Alessandro Casini 71 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Heteroskedasticity-robust SE in Stata ▶ The default in Stata is to compute homoskedasticity-only standard errors. ▶ To get the heteroskedastic-robust version, you must override the default (in Stata you use the robust option...as you may have noticed in previous examples). ▶ If you don’t use the robust option and there is in fact heteroskedasticity, your standard errors (and t-statistics and confidence intervals) will be wrong. Alessandro Casini 72 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Heteroskedastic-robust standard errors in Stata Alessandro Casini 73 / 80 Lecture 4: Ordinary Least Squares (OLS) Heteroskedasticity and Homoskedasticity Where are we? 1. Hypothesis tests about β 1 (SW 5.1) 2. Confidence intervals about β 1 (SW 5.2) 3. Regression when X is binary (0/1) (SW 5.3) 4. Heteroskedasticity and homoskedasticity (SW 5.4) 5. The Gauss-Markov Theorem (SW 5.5) Alessandro Casini 74 / 80 Lecture 4: Ordinary Least Squares (OLS) The Gauss-Markov Theorem Outline 1 Properties of OLS 2 Testing Hypotheses 3 Confidence Intervals 4 Regression When X is Binary 5 Heteroskedasticity and Homoskedasticity 6 The Gauss-Markov Theorem Alessandro Casini 75 / 80 Lecture 4: Ordinary Least Squares (OLS) The Gauss-Markov Theorem Why is OLS so popular – Gauss-Markov Theorem ▶ We have already learned a great deal about OLS: 1. OLS is unbiased and consistent (under the three LS assumptions); 2. We have a formula for heteroskedasticity-robust standard errors; 3. We can use them to construct confidence intervals and test statistics. ▶ A natural question to ask is whether there are other estimators that might have a smaller variance – are more precise – than OLS? Alessandro Casini 76 / 80 Lecture 4: Ordinary Least Squares (OLS) The Gauss-Markov Theorem Why is OLS so popular – Gauss-Markov Theorem ▶ We can always find an estimator than has a very small variance (e.g., use the number 1.1 as your estimator irrespective of the data). ▶ But low variance is only good if the estimator’s distribution is centered around the true parameter. ▶ So, the real question is whether there are estimators which are unbiased, as OLS is, and have a smaller variance than OLS. ▶ We focus on all estimators that are linear functions of Y1 ,... Yn and unbiased. ▶ The OLS estimator is indeed such a linear function under the LSA assumptions (verify that we can write β̂ 1 = ∑in=1 ωi Yi ). ▶ We then have the following important result: Alessandro Casini 77 / 80 Lecture 4: Ordinary Least Squares (OLS) The Gauss-Markov Theorem The Gauss Markov Theorem Theorem (Gauss-Markov Theorem) If the three Least Squares assumptions hold and if the errors are homoskedastic, then the OLS estimator is the Best Linear Unbiased Estimator (BLUE). ▶ Important result: it says that if, in addition to LSA 1-3, the errors are homoskedastic then OLS has the smallest variance among all linear and unbiased estimators. ▶ This makes OLS the best estimator in this class of estimators (linear and unbiased), and explains why it is so popular. ▶ An estimator with the smallest variance is called an efficient estimator. ▶ The GM theorems says that OLS is the efficient estimator among the linear unbiased estimators of β 1. ▶ The set of four (4) assumptions necessary for the GM Theorem to hold are sometimes called the “Gauss-Markov Assumptions” Alessandro Casini 78 / 80 Lecture 4: Ordinary Least Squares (OLS) The Gauss-Markov Theorem Limitations of the Gauss-Markov Theorem ▶ For the GM theorem to hold we need homoskedasticity which is often not a realistic assumption. ▶ If there is heteroskedasticity, the GM theorem does not hold. This means that there may be more efficient estimators than OLS (and there are!). ▶ But OLS is still unbiased and consistent if the three LS assumptions hold...it may just not have the smallest variance possible. ▶ Moreover, the GM theorem applies only to linear estimators. There may be other non-linear estimators with lower variance. Alessandro Casini 79 / 80 Lecture 4: Ordinary Least Squares (OLS) The Gauss-Markov Theorem Summary of Lecture 4 (bivariate OLS) ▶ We analyzed a simple linear model describing a relationship between an outcome and a single variable (regressor). ▶ We learnt how to use a sample of data to estimate the intercept and slope of this line. ▶ The OLS estimator not only has an intuitive motivation – minimizing the sum of squared errors – but also, given the appropriate assumptions, has good statistical properties. ▶ We learnt how to use the OLS estimator to test hypotheses about the parameters of the model and build confidence intervals. Alessandro Casini 80 / 80

Lecture 4: Ordinary Least Squares (OLS) PDF

Document Details

Tags

Related

Summary

Full Transcript