Quantitative Methods in Empirical Economic Geography PDF

Document Details

SleekFreeVerse

Uploaded by SleekFreeVerse

Leibniz Universität Hannover

Christian Hundt

Tags

linear regression econometrics economic geography quantitative methods

Summary

These lecture notes cover linear regression models in empirical economic geography. The document discusses various aspects of linear regression, including assumptions, violations, solutions to violations, and common problems.

Full Transcript

Quantitative Methods in Empirical Economic Geography Linear regression models (Part III) Lecturer: Christian Hundt Slides: Christian Hundt und Kerstin Nolte Institute of Economic and Cultural Geography Leibniz University Hannover M2 – Methods in Empirical Econom...

Quantitative Methods in Empirical Economic Geography Linear regression models (Part III) Lecturer: Christian Hundt Slides: Christian Hundt und Kerstin Nolte Institute of Economic and Cultural Geography Leibniz University Hannover M2 – Methods in Empirical Economic Geography 1 OLS-assumptions: The Gauss-Markov theorem M2 – Methods in Empirical Economic Geography 2 OLS assumptions: Gauss-Markov Theorem ▪ Why use OLS? ▪ When not to use OLS? OLS = Estimation that yields consistent* estimators for 𝛽0 and 𝛽1 , … 𝛽𝑛 => but only under certain assumptions *An estimator is consistent if the estimates "converge" towards the true value of the estimated parameter as the sample size increases. M2 – Methods in Empirical Economic Geography 3 OLS assumptions: Gauss-Markov Theorem Gauss-Markov Theorem: If the following assumptions apply in a linear regression model… 1. Linear in parameters 2. Random sample of size n 3. No perfect collinearity among the independent variables 4. Exogeneity of the predictors 5. Homoscedastizity: Error terms have the same variance: Var 𝑢 𝑥 = 𝜎 2 6. Error terms are not correlated (no autocorrelation) … then OLS provides the Best Linear Unbiased Estimator (BLUE ). Wooldridge, Chapter 3 M2 – Methods in Empirical Economic Geography 4 OLS assumptions: Gauss-Markov Theorem ▪ In short: OLS is the method of choice if all Gauss-Markov- assumptions apply → no other estimator would be better. ▪ Put differently: violation of assumptions if… 1. Non-linearity of parameters 2. “biased“(non-random) sample 3. Multicollinearity 4. Endogeneity 5. Heteroscedasticity 6. Autocorrelation ▪ If assumptions are violated: o We need a different estimation method or need to “work“ on these problems. M2 – Methods in Empirical Economic Geography 5 Gauss-Markov Theorem 1. Linear in parameters 1. Linear in parameters 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + 𝛽3 𝑥𝑖3 + ⋯ + 𝑢𝑖 →Linear relationship between dependent and (several) independent variables M2 – Methods in Empirical Economic Geography 6 Gauss-Markov Theorem 1. Linear in parameters - Wenn man Muster wie bei Plot 2 erkennt, ist das ein Hinweis auf Nicht-Linearität - Bei Plot 1 unauffälliges Muster -> man kann weitermachen ▪ Use a “residuals vs. fits plot” to check (y-axis => residuals; x-axis => fitted values = estimated values = 𝑦): ො Hier werden Fehlerterme gezeigt content/uploads/VP.html http://methods-berlin.com/wp- Wenn wir eine Gerade haben, auch dann haben wir ein Muster ▪ The dashed line at y=0 shows the expected value of the residuals. This is always zero and the residuals should be distributed around this line without any recognizable pattern. M2 – Methods in Empirical Economic Geography 7 Gauss-Markov Theorem 1. Linear in parameters ▪ And what if the model is not linear in parameters? Auffällige Struktur kann oft durch eine Logaritmierung gelöst werden →Put it into a linear framework: Logarithmic transformation (=> lecture 3) o Example: Convert whole equation into linear form – E.g. Cobb-Douglas Production Function Y= 𝐾 𝑎 𝐿𝑏 → log 𝑌 = 𝑎 log 𝐾 + 𝑏 log( 𝐿) →If the residual plot shows a u-shaped (or inverse u- shaped) relationship, add an x2 and in this case, a parabola rather than a straight line is fitted to the values (=> last lecture). → Use a different estimation method. M2 – Methods in Empirical Economic Geography 8 Gauss-Markov Theorem 2. Random sample of size n 2. Random sample of size n ▪ If we start with a non-random sample, we can do little about it Unless we collect data ourselves: remember lecture 1 on sampling M2 – Methods in Empirical Economic Geography 9 Gauss-Markov Theorem 2. Random sample of size n ▪ Sampling methods (lecture 1) Probability sampling – Random sampling: Each unit in the population has a known, nonzero chance of being included in the sample – Inferences can be made about the population Non-probability sampling – Opposite of random sampling: deliberate, non-random selection – Sampling is not representative of the population – Practiced for reasons of costs, timeliness, convenience → we focus on probability sampling (as we are interested in inference statistics) Bei amtlichen Daten kann man davon ausgehen, → Reflections before collecting a sample dass man Sie verwenden kann wie sie sind M2 – Methods in Empirical Economic Geography 10 Gauss-Markov Theorem 2. Random sample of size n ▪ What does this mean for our regressions? o Typical case: data is given, so you cannot change the sampling method or sample size – you do your regressions with what you have o Generally speaking: inference only possible for random samples, and… o … the larger the sample size, the less you have to worry… o … and the more independent variables you include, the larger your sample size should be. M2 – Methods in Empirical Economic Geography 11 Gauss-Markov Theorem 2. Random sample of size n ▪ Rule of thumb ▪ No universal rule on number of observations – the more the better ▪ A common rules of thumb that you might find o At least 10 observations per variable that you include o E.g. in the billing-amount-tip example with 20 observations, including tip, smiling, gender would need at least 30 observations M2 – Methods in Empirical Economic Geography 12 Side note: Degrees of freedom ▪ Typical regression output: degrees of freedom ▪ Df=n-1-k with n as the sample size, k as the number of estimated parameters (and 1 for the intercept) ▪ The more degrees of freedom (n-1-k), the more accurate your predictions. M2 – Methods in Empirical Economic Geography 13 Side note: Degrees of freedom ▪ More degrees of freedom correspond to lower critical values of t. Umso größer das N, umso besser die Qualität der Aussage https://www.scribbr.com/statistics/students-t-table/ M2 – Methods in Empirical Economic Geography 14 Gauss-Markov Theorem 2. Random sample of size n ▪ If sample selection based on dependent variable: always problematic => selecting cases based on the fulfillment of a criterion and then using these cases as evidence for the criterion z.B. Anzahl zukünftiger Studierenden nur unter Abiturienten abfragen Sample ist dann schon sehr selektiv Abitur müsste die unabhängige Variable sein ▪ But there are non-random samples that cause no problems… o Sample selection based on the independent variable (exogenous sample selection) o E.g., stratified samples if stratification by independent variable Wooldridge, Chapter 9.5 M2 – Methods in Empirical Economic Geography 15 Gauss-Markov Theorem 3. No perfect collinearity 3. No perfect collinearity among the independent variables o Your independent variables should (ideally) be independent of each other o If one independent variable varies this should have no effect on other independent variables ▪ If assumption is violated: o Multicollinearity: Correlation among the independent variables in a multiple regression model M2 – Methods in Empirical Economic Geography 16 Gauss-Markov Theorem 3. No perfect collinearity ▪ Weak constraint → Independent variables can be correlated, just no perfect collinearity (one variable perfectly predicts another variable) o e.g. same variable in different unit o the dummy trap: including a dummy for each category ▪ Exclude perfectly collinear variables (software will let you know) ▪ Multicollinearity can be tested for with variance inflation factor (VIF) http://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ M2 – Methods in Empirical Economic Geography 17 Gauss-Markov Theorem 3. No perfect collinearity ▪ Variance inflation factor ▪ Serves as a tool to detect multicollinearities between the independent variables of a model o Apply auxiliary regression to regress individual variable on all other independent variables o Use R² of this auxiliary regression to estimate VIF 1 o 𝑉𝑖𝑓𝑖 = 1−𝑅𝑖2 M2 – Methods in Empirical Economic Geography 18 Gauss-Markov Theorem 3. No perfect collinearity ▪ Example for estimating VIF : ▪ First step: regress variable on other variables: here: smiling on billing_amount and gender ▪ Use R² of this auxiliary regression, here: 0.6005 M2 – Methods in Empirical Economic Geography 19 Gauss-Markov Theorem 3. No perfect collinearity ▪ Estimate VIF 1 1 𝑉𝑖𝐹𝑠𝑚𝑖𝑙𝑖𝑛𝑔 = 2 = = 2.50 1 − 𝑅𝑠𝑚𝑖𝑙𝑖𝑛𝑔 1 − 0.6005 ▪ Software does all the work for you, check R documentation: https://www.rdocumentation.org/packages/car/versions/3.0-2/topics/vif M2 – Methods in Empirical Economic Geography 20 Gauss-Markov Theorem 3. No perfect collinearity 5 kritisch, über 10 Katastrophe -> Variable rausschmeißen ▪ VIF>5 → sign for multicollinearity ▪ VIF>10 →extreme multicollinearity (would make you worry) ▪ In the case of multicollinearity, o it may no longer be possible to determine exactly which influence comes from which variable. o regression coefficients can change drastically if the data changes very slightly or if new variables are added. ▪ If you detect strong multicollinearity o Drop highly multicollinear variables (although you might want to keep them…) o Perform PCA of highly multicollinear variables M2 – Methods in Empirical Economic Geography 21 Countermeasure: Principal Component Analysis ▪ PCA is a tool for dimensionality reduction => consolidate data for easier interpretation ▪ Especially for variables that are correlated with each other (contain duplicate information) ▪ Two main applications relevant to us Used to address multicollinearity in regression models Exploratory data analysis ▪ Basic idea ▪ Transform data set into linearly uncorrelated “principal components (PC or C)“ – the first component captures the largest part of the variance in the data, the second component captures the second largest part of the variance in the data, and so on. M2 – Methods in Empirical Economic Geography 22 Dimensionality reduction through PCA cor = Korrelationskoeffizient https://www.youtube.com/watch?v=SWfucxnOF8c Youtube-Video wie man es macht: => Chol and Age have been replaced by PC1. Das Auswählen, was die höchste Varianz hat Interpretation ist nicht klaussurrelevant M2 – Methods in Empirical Economic Geography 23 Principal component regression ▪ If you want to include several variables into a regression that are highly correlated o Perform PCA o Include dimensionality-reduced new variable into your regression ▪ Advantages o Reduces complexity, you include fewer variables into your regression (more power) o Can solve multicollinearity ▪ Disadvantage o No intuitive interpretation o Driven by mathematics/ statistics not by theory M2 – Methods in Empirical Economic Geography 24 Principal Component Analysis PCA nicht klausurrelevant ▪ Mathematical procedure… ▪ … for those interested: more insights can be found here Lever, J., Krzywinski, M., & Altman, N. (2017). Points of Significance: Principal component analysis. Nature Methods, 14, 641–642. Wang, F. (2009). Factor Analysis and Principal- Components Analysis. In R. Kitchin & N. Thrift (Eds.), International Encyclopedia of Human Geography (pp. 1– 7). Oxford: Elsevier. h M2 – Methods in Empirical Economic Geography 25 Principal Component Analysis ▪ What you need to know (cont‘d) ▪ Common application: PCA-generated variables are included in regressions Frequent examples: institutional indicators (World Governance Indicators), wealth indicators… Nice overview on wealth index: Vyas, S., & Kumaranayake, L. (2006). Constructing socio-economic status indices: how to use principal components analysis. Health Policy and Planning, 21(6), 459–468. DHS guide on constructing wealth index: https://dhsprogram.com/programming/wealth%20index/ Steps_to_constructing_the_new_DHS_Wealth_Index.pdf M2 – Methods in Empirical Economic Geography 26 Principal Component Analysis ▪ What you need to know (cont‘d) ▪ R does the fancy mathematics for you, check out: https://www.datacamp.com/community/tutorials/pca- analysis-r ▪ Drawback Interpretation of variables constructed with PCA in regressions: typically not interpretable beyond positive/ negative/ no relationship M2 – Methods in Empirical Economic Geography 27 Gauss-Markov Theorem 4. Exogeneity of the predictors Liegt vor wenn x und y Einfluss aufeinander haben -> Zirkelschluss -> Einflüsse sehr schwierig auseinander zu halten 4. Exogeneity => The predictor variables X are independent of the error term u: 𝐸 𝑢 𝑥 =0 ▪ Violated if the independent variable is correlated with the error term. ▪ This so-called endogeneity has three possible sources: a) Misspecification of the functional form b) Omitted variable c) Simultaneity M2 – Methods in Empirical Economic Geography 28 Gauss-Markov Theorem 4. Exogeneity of the predictors Endogeneity a) Misspecification of the functional form Einfachster Fall: Model nicht richtig spezifiziert, z.B. sollten Variablen geloggt werden ▪ E.g. variable should be logged or in quadratic form ▪ Functional form can be tested with RESET test (Ramsey Regression Equation Specification Error Test) o Adds polynomials to our regression equation and performs F-test with » 𝐻0 no functional form misspecification » 𝐻1 there is functional form misspecification o Only gives information that there is a functional form misspecification or not – but does not provide alternative specification M2 – Methods in Empirical Economic Geography 29 Gauss-Markov Theorem 4. Exogeneity of the predictors Endogeneity b) Omitted variable Fehler: Habe ich eine Variable vergessen? ▪ Underspecification: missing variable bias o Ideally: try to include variable or proxy variable (e.g., use an IQ test as a proxy for an individual’s ability) o If previous data is available: use lagged variable (from a prior year) https://economictheoryblog.com/2018/05/04/omitted-variable-bias/ M2 – Methods in Empirical Economic Geography 30 Gauss-Markov Theorem 4. Exogeneity of the predictors Endogeneity c) Simultaneity ▪ The dependent variable is not only determined by the independent variables but one or more of the independent variables are simultaneously influenced by the dependent variable. ▪ One possible solution: instrument variables that are (strongly) correlated with the predictors but not with the dependent variable Instrument Variable: selber aktiv werden ▪ Watch this video (as of ~ minute 11): https://www.youtube.com/watch?v=dLuTjoYmfXs Letzter Punkt der vorherigen Folie sollte oben auch drauf Aktuelle Variablen können nicht auf ältere wirken M2 – Methods in Empirical Economic Geography 31 Gauss-Markov Theorem 5. Homoscedasticity 5. Homoscedasticity: Error terms have the same variance Var 𝑢 𝑥 = 𝜎 2. ▪ Detecting heteroscedastic errors: a) Visual inspection via “residuals vs. fits plot” (see slide 7) Errors unsystematic, Suggests errors vary close to zero → no with independent heteroscedasticity ☺ variable  https://python.plainenglish.io/he teroscedasticity-analysis-in-time- series-data-fee51503cc0e M2 – Methods in Empirical Economic Geography 32 Gauss-Markov Theorem 5. Homoscedastizity ▪ Detecting heteroscedastic errors b) Statistical tests for heteroscedasticity o e.g. Breusch-Pagan test: regresses squared OLS residuals on independent variables o White-test: also involves regressing squared OLS residuals on independent variables, but is somewhat more complicated ▪ If you detect heteroscedasticity o Try to get at the source of it − Maybe you have a variable with a large spread…? o One way to solve it: data transformations (e.g. log transformation) o Another way: Calculate robust standard errors M2 – Methods in Empirical Economic Geography 33 Gauss-Markov Theorem 6. No autocorrelation 6. No autocorrelation: Error terms are not correlated with each other. ▪ Violated if residuals are not independent of each other => autocorrelation ▪ Aurocorrelation can occur in the case of o repeated measurements (time series) => panel analysis o hierarchical group structures (e.g., students in classes) => multi-level modeling Wir beschränken uns auf Panel Analyse: Hier ist Schluss für heute M2 – Methods in Empirical Economic Geography 34 Gauss-Markov Theorem 6. No autocorrelation ▪ Detection of autocorrelation via o visual inspection (e.g. residuals vs. observation order) https://online.stat.psu.edu/stat462/node/121/ o Durbin–Watson statistic M2 – Methods in Empirical Economic Geography 35 OLS assumptions: Gauss-Markov Theorem ▪ If all six assumptions are met o OLS is BLUE (best linear unbiased estimator) → you would not find a better estimation method o If not: either try to solve problems − E.g. transformations, including omitted variables… o … or use a different estimator (beyond OLS) − To be continued M2 – Methods in Empirical Economic Geography 36 Normality assumption ▪ and yet one more assumption… o Formally not part of Gauss-Markov theorem o But to calculate p values for significance testing (t statistics and F statistics) we add another assumption: − Normality assumption: the unobserved error is normally distributed in the population (=> the assumption requiring a normal distribution applies only to the residuals, not to the independent variables!) − “Countermeasure”: collect a sufficiently large sample (>200), which ensures that the distribution of residuals will approximate normality − Tool: Histogram Wooldridge, Chapter 4 M2 – Methods in Empirical Economic Geography 37 What damage is caused in the event of violations? 1. Non-linearity of parameters => Biased coefficients and biased standard errors 2. “biased“(non-random) sample => biased selection, not representative 3. Multicollinearity => Biased standard errors 4. Endogeneity => Biased coefficients and biased standard errors 5. Heteroscedasticity => Biased standard errors 6. Autocorrelation => Biased standard errors 7. Normal distribution => Biased standard errors M2 – Methods in Empirical Economic Geography 38 Questions? M2 – Methods in Empirical Economic Geography 39 Recipe for a linear regression analysis M2 – Methods in Empirical Economic Geography 40 Recipe for a regression analysis 1. Start with assumptions/ theory about relationships between variables 2. Obtain data and operationalize variables 3. Create regression equation and decide on estimation method 4. Perform regression and interpretation 5. Model diagnostics 6. Adapt model and start over with 4 M2 – Methods in Empirical Economic Geography 41 Recipe for a regression analysis 1. Start with assumptions/ theory about relationships between variables ▪ Tips are influenced by billing amount, experience, gender, friendliness of waiter… ▪ Growth depends on… (check the theory…) M2 – Methods in Empirical Economic Geography 42 Recipe for a regression analysis 2. Obtain data and operationalize variables ▪ Is your data a random sample of the population? ▪ Data preparation Identify gross errors/ outliers – Start with graphical investigation ▪ Preliminary model investigation o Identify functional form/ important interactions – Residual and other scatterplots ▪ Operationalize variables o Friendliness can be measured by frequency of smiles o Innovativeness can be measured by the number of patents M2 – Methods in Empirical Economic Geography 43 Side note: influential observations and outlier Wooldridge, Figure 9.1 M2 – Methods in Empirical Economic Geography 44 Side note: influential observations and outlier ▪ “Unusual “ observations may largely affect OLS estimates ▪ OLS very sensitive: minimizes the sum of squared residuals → large residuals receive a lot of weight ▪ Source of unusual observation o Data entering error (too many zeros, wrong unit…) − Carefully check data for such problems o “True“ extreme value − Decision with researcher M2 – Methods in Empirical Economic Geography 45 Side note: influential observations and outlier ▪ Detecting outliers o Visual inspection (Scatterplot, Boxplot...) o Different systematic approaches based on standard descriptive statistics (all available in R) − Difference in fit(s) (Dffits) − Cook‘s distance −… https://onlinecourses.science.psu.edu/stat501/node/340/ R package: https://cran.r-project.org/web/packages/olsrr/vignettes/influence_measures.html M2 – Methods in Empirical Economic Geography 46 Recipe for a regression analysis 3. Create regression equation and decide on estimation method ▪ Common problems o Including irrelevant variables: overspecification o Omitted variable bias: underspecification o Misspecification M2 – Methods in Empirical Economic Geography 47 Side note: Over- and underspecifying regression models ▪ Two traps in setting up your regression equation o Include too few variables: underspecification o Include too many variables: overspecification M2 – Methods in Empirical Economic Geography 48 Side note: Over- and underspecifying regression models ▪ Overspecification o Inclusion of irrelevant variables: inclusion of variables even though they do not have an effect o Theory-free „kitchen sink“ regression o No serious problem → does not bias my coefficients as there is no effect o However: reduces degrees of freedom and thus can be problematic ▪ Measure: Use adjusted R2 statistic M2 – Methods in Empirical Economic Geography 49 Side note: Over- and underspecifying regression models ▪ Underspecification o If you leave out important variables − Because you forget them − Because they are not available/ not measurable → You underfit your model and encounter an “omitted variable bias“ M2 – Methods in Empirical Economic Geography 50 Side note: Over- and underspecifying regression models ▪ Problems of omitted variable bias o You may − Overestimate the strength of an effect − Underestimate the strength of an effect − Change the sign of the effect − Mask an effect that actually exists o This occurs if o the omitted variable correlates with the dependent variable o the omitted variable correlates with at least one independent variable in the model Wooldridge (2009): Chapter 3.3 M2 – Methods in Empirical Economic Geography 51 Creating a regression equation ▪ Gold standard: theory-driven variable selection Naive but often practiced approaches ▪ Stepwise regressions o Forward selection: Start with no variables, add variables and only keep significant ones… o Backward selection: Estimate the most complex model with all available covariates, then remove the insignificant ones ▪ Other approaches: adjusted R-squared statistics, residuals vs. predictor plot ▪ Whatever you do: please stay theory-driven M2 – Methods in Empirical Economic Geography 52 Creating a regression equation ▪ Trade-off between o Including as many predicting variables as possible o And keeping it simple (have many degrees of freedom) M2 – Methods in Empirical Economic Geography 53 Recipe for a regression analysis 4. Perform regression and interpretation M2 – Methods in Empirical Economic Geography 54 Recipe for a regression analysis 5. Model diagnostics ▪ Examination of model assumptions o Perform a series of tests and take countermeasures if the model assumptions are not met 6. Adapt model and start over with 4 ▪ If necessary, use a different estimation method, not OLS. M2 – Methods in Empirical Economic Geography 55 Recipe for a regression analysis 1. Start with assumptions/ theory about relationships between variables 2. Obtain data and operationalize variables 3. Create regression equation and decide on estimation method 4. Perform regression and interpretation 5. Model diagnostics 6. Adapt model and start over with 4 M2 – Methods in Empirical Economic Geography 56 Questions? M2 – Methods in Empirical Economic Geography 57 Further reading ▪ Wooldridge (2013): Introductory Econometrics. A Modern Approach (5th international ed.). Australia: South Western, Cengage Learning. Chapters 3,4, 8, 9. ▪ And a short summary: http://uweconsoc.com/ols-blue-and-the-gauss-markov-theorem/ ▪ And links in the slides☺ M2 – Methods in Empirical Economic Geography 58

Use Quizgecko on...
Browser
Browser