Quantitative Methods Lecture 4 2024-2025 PDF
Document Details
Universiteit Leiden
2024
Dr. Brendan Carroll
Tags
Summary
This lecture covers quantitative methods, focusing on regression analysis and the assumptions of OLS (Ordinary Least Squares) in a fourth meeting. It details properties of estimates and assumptions of OLS part I. The presentation is from Leiden University.
Full Transcript
Quantitative Methods Fourth meeting: Properties of estimates; Assumptions of OLS Part I Dr. Brendan Carroll Leiden University. The university to discover. Recap of Last Week - Hypothesis testing in regression - Interpretation of regression results - Variables in regression...
Quantitative Methods Fourth meeting: Properties of estimates; Assumptions of OLS Part I Dr. Brendan Carroll Leiden University. The university to discover. Recap of Last Week - Hypothesis testing in regression - Interpretation of regression results - Variables in regression including dummy (binary/dichotomous) variables - Multivariate regression – Why? What? How? Leiden University. The university to discover. This Week - Models for predicting, models for inference - Properties of estimators and relationship to sampling distribution of b - The assumptions of OLS and the Gauss- Markov theorem - Assumption 1: Linear model - Assumption 2: (Conditional) mean independence – Part I Leiden University. The university to discover. Why Regression? - To predict the dependent variable - easier - To make causal inferences – more difficult - These are not mutually exclusive – we can do both, but then we need to meet criteria for each Leiden University. The university to discover. Why Regression? - I Regression to predict the dependent variable - We don’t care that correlation ≠ causation - We just want to choose Xs to minimize our errors of prediction - In other words – we want to maximize R-squared - No assumptions needed - Examples: predicting stocks, financial markets, economic indicators, election outcomes Yi = a + b1 X 1i + b 2 X 2i +... + bkXki + ui Leiden University. The university to discover. Why Regression? - II Regression to make causal inferences - We want to know if X causes Y - We want each b to be a good estimate of β, the true effect of X on Y - We must satisfy certain assumptions to make good estimates - We don’t care about minimizing prediction errors - Examples: social science, public policy making, epidemiology Yi = a + b1 X 1i + b 2 X 2i +... + bkXki + ui Leiden University. The university to discover. Properties of Estimators - b is our estimate of β - How good an estimate is b? - We want our estimate to be unbiased - b=β - We want our estimate to be efficient - Minimum Sb or best - Sb is the standard error of the estimate - Which assumption is more important? Leiden University. The university to discover. Sampling Distribution - Recall that b1=some value (obtained from one sample) is our estimate of β1 - b1* = some other value for some other sample - Because we could collect different samples of size n, b1 (our estimator) is itself a variable with a distribution - The distribution of our estimator is the sampling distribution Leiden University. The university to discover. Properties of Sampling Distribution - Any value of b1 is possible, but values close to β1 are more likely - On average, b1 = β1 (provided assumptions are met) - That is, the estimator is unbiased - But there is error (the standard error of the estimate), which can be estimated; in the bivariate case: Sb = (Yi − Yˆ ) 2 /( n − 2) i (Xi − X ) 2 Leiden University. The university to discover. Standard Error of the Estimate - In trivariate case: S b1 = i i (Y − Yˆ )2 S b2 = (Y − Yˆ ) i i 2 1i 1 ( X − X ) 2 (1 − r 2 X 1 X 2 )( n − 3) 2i 2 ( X − X ) 2 (1 − r 2 X 1 X 2 )( n − 3) - General multivariate case: S b1 = (Y − Yˆ ) i i 2 (X 1i − X 1 ) 2 (1 − R 2 i )( n − k − 1) Leiden University. The university to discover. Gauss-Markov Theorem (p. 122-3) - OLS gives best, unbiased estimates if - (OLS is BestLinearUnbiasedEstimate BLUE if) - 1: Linear function of Xs plus disturbances - 2: (Conditional) mean independence (disturbances have mean zero) - 3: Homoskedasticity - 4: Uncorrelated disturbances - 5: Disturbances are normally distributed Leiden University. The university to discover. The Value of these Assumptions - Assumptions 1 and 2 guarantee unbiasedness - Assumptions 1, 2, 3 and 4 guarantee efficiency - Assumption 5 enables accurate hypothesis tests at (relatively) small sample sizes - Note 1: a violation of assumption 1 or 2 may also affect efficiency, but if our estimates are biased, who cares about efficiency? - Note 2: if we reject the null hypothesis, inefficiency does not matter Leiden University. The university to discover. Linearity Assumption - I The dependent variable Y is a linear function of the Xs plus a random error or disturbance Yi = a + b1 X 1i + b 2 X 2i +... + bkXki + ui - What does linearity imply about the relationship between X and Y? - That the effect of X on Y is constant over the full range of X Leiden University. The university to discover. Linearity Assumption - II Leiden University. The university to discover. Linearity Assumption - III - In reality, the assumption is not perfectly met - In many cases, we can use OLS to estimate non-linear relationships (and remain BLUE) by transforming the variables (next week) - Logarithmic transformation - Quadratic transformation - Interaction of two or more Xs Leiden University. The university to discover. Linearity Assumption - IV - If nonlinearity cannot be accommodated into OLS through transformation, we have many nonlinear estimation techniques - Logit and probit for dichotomous Y - Poisson and negative binomial for count Y - Multilevel models for Y at one level and Xs at multiple levels Leiden University. The university to discover. Mean independence – Part I Leiden University. The university to discover. (Conditional )Mean independence The mean of u – the error/residuals (conditional on X) equals zero (and thus does not depend on Xs). This is the most important assumption to ensure that OLS is not biased!!! Yi = a + b1 X 1i + b 2 X 2i +... + bkXki + ui Leiden University. The university to discover. (Random) Error/Residual u 12 Y 10 8 6 Y 4 2 0 0 2 4 6 8 10 12 X Leiden University. The university to discover. Violations of Mean Independence - Measurement error in the independent variables - Reverse causation - Specification error – omission of relevant variables - Specification error – wrong functional form (next week) Leiden University. The university to discover. Measurement Error - Systematic measurement error - Random measurement error - In the dependent variable - In the independent variable(s) Leiden University. The university to discover. Systematic Measurement Error (I) - Measuring something in addition to, instead of, or as an incomplete part of the true concept of interest - Because relative to true concept of interest, depends on first having a good definition - Always results in bias, inefficiency, and nonsense - Can only be dealt with during research design Leiden University. The university to discover. Systematic Measurement Error (II) - Example: GDP as a measure of national wealth - GDP measures only the monetary value of goods and services produced in a country - Values destruction of ecosystems that generate short-term revenues, undervalues unpaid ‘household’ and other work - Example: survey design, measuring feminism, and surveyor-induced measurement error - “Should men and women get equal pay for equal work?” - Measures the extent to which social pressure induces individuals to answer questions in a certain way Leiden University. The university to discover. Random Measurement Error - For a particular observation, the observed value differs from the true value - Call this difference “error” - The errors are random (that is, for some observations they are bigger, others they are smaller, and still others there are no errors at all and these differences are unpredictable) - Faulty measuring tool, carelessness, rounding Leiden University. The university to discover. In the Dependent Variable (I) Y = a + b1 X 1 + b2 X 2 +... + bkXk + ui - Where u is the difference between predicted Y and actual Y for each observation - Now let e be the error in measurement of Y for a particular observation; Y is true but only Y* is observed Y * = Y + ei - By using Y* we are really regressing Y * = (a + b1 X 1 + b2 X 2 +... + bkXk + ui ) + ei Leiden University. The university to discover. In the Dependent Variable (II) - Provided that the measurement error is truly random, then ui + ei is indistinguishable from ui - In other words, the partial slope coefficients remain unbiased - Because standard errors are based on residuals, the standard errors are elevated in the presence of random measurement error in the DV (Yi − Yˆi ) 2 - Efficiency affected Sb1 = ( X 1i − X 1 ) 2 (1 − r 2 X1 X 2 )(n − 3) Leiden University. The university to discover. In an Independent Variable - In the bivariate case, measurement error in IV leads to underestimation (closer to zero) of slope coefficient and loss of efficiency - In multivariate case, measurement error in IV becomes much less predictable - Always biased and inefficient - Whether over- or underestimation depends on degree of measurement error and on correlations among independent variables in complex ways Leiden University. The university to discover. Random Measurement Error in Short - In DV: unbiased but inefficient - In IV: - For bivariate regression, slope coefficient will be too small (bias); inefficiency - For multivariate regression, slope coefficients are biased (unpredictably); inefficiency To fix, you need better data! Leiden University. The university to discover. Reverse Causation - You think X causes Y, but what if Y causes X? - What if there is feedback between the two? - Reverse causation leads to biasedness - Example: the effect of public opinion on political agendas - Solution: very difficult. Theory, advanced methods, potentially unsolvable Leiden University. The university to discover. Specification Error – Wrong Variables - Including an irrelevant variable - inefficiency - Excluding a relevant variable - biasedness Leiden University. The university to discover. Including an Irrelevant Variable - Suppose X1 effects Y but X2 does not, yet we estimate Y = a + b1X1 + b2X2 - Then b1 = β1 and b2 = β2 (always on average) - But what about Sb1? Compare: (Yi − Yˆi ) 2 (Yi − Yˆi ) 2 Sb1 = Sb = ( X 1i − X 1 ) 2 (1 − r 2 X1 X 2 )(n − 3) ( X i − X ) 2 (n − 2) Leiden University. The university to discover. Consequences of Including an Irrelevant Variable - Partial slope coefficients remain unbiased - Estimates are inefficient - The greater the correlation between the included variables, the more inefficient - If they are uncorrelated, estimates are efficient Leiden University. The university to discover. Excluding a Relevant Variable - Suppose that X1 and X2 both effect Y but we estimate only Y = a + b1X1 - Now b1 = β1+ β2r2X1X2 - In other words, b1 is biased, but only to the extent that X1 and X2 are correlated Leiden University. The university to discover. Consequences of Excluding Relevant Variables - For biasedness, it depends - If X1 and X2 are positively correlated, then b 1 > β1 - If X1 and X2 are negatively correlated, then b1 < β1 - Efficiency is affected, but who cares? Leiden University. The university to discover. Excluding a relevant variable - Does the variable have a causal effect on the dependent variable? - Is the variable correlated with those variables whose effects are the focus of the study? If answer “yes” to both, then the excluding the variable leads to bias. Solution: add the variable as a “control” Leiden University. The university to discover. How to Detect and Deal with Wrong Variable Errors - Exclusion of relevant variables is a problem of theory - Inclusion of irrelevant variables can be diagnosed by the t-statistics (e.g., hypothesis testing) Leiden University. The university to discover. End of lecture Leiden University. The university to discover.