Advanced Statistical Analysis Lecture Notes - University of Groningen

Summary

These lecture notes from the University of Groningen provide an overview of advanced statistical analysis, focusing on linear regression and the challenges of data issues such as measurement error and missing values. The notes cover topics like transformations, data types(time series, cross sectional, and panels), and different imputation methods.

Full Transcript

Advanced Statistical Analysis Week 2 - Lecture 3: Power of OLS and its limitations Dr. Mark van Duijn Department of Economic Geography Department of DemographyUniversity of Groningen [email protected] 13 Feb, 2023 Introduction Part I Part III Conclusions Agenda Part I: Transformations and inter...

Advanced Statistical Analysis Week 2 - Lecture 3: Power of OLS and its limitations Dr. Mark van Duijn Department of Economic Geography Department of DemographyUniversity of Groningen [email protected] 13 Feb, 2023 Introduction Part I Part III Conclusions Agenda Part I: Transformations and interpretation Part II: Data and the power of OLS Schafer & Graham (2002) - Missing Data: Our View of the State of the Art Mehmetoglu & Jakobsen (2022) - Chapter 4,5,7 Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20232 / 39 Introduction Part I Part III Conclusions Transformations and interpretation Linear: y= b 0 + b 1x 1 + . . . +e When do we want to transform the variables? No normal distribution. Create a histogram. Transformation can help to obtain normal distribution Check the model fit before and after to say something about the transformation Log-linear: ln(y ) = b 0 + b 1x 1 + . . . +e Log-log: ln(y ) = b 0 + b 1ln (x 1) + . . .+e Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20233 / 39 Introduction Part I Part III Conclusions Transformations and interpretation Linear: y= b 0 + b 1x 1 + . . . +e When do we want to transform the variables? No normal distribution. Create a histogram. Transformation can help to obtain normal distribution Check the model fit before and after to say something about the transformation Log-linear: ln(y ) = b 0 + b 1x 1 + . . . +e Log-log: ln(y ) = b 0 + b 1ln (x 1) + . . .+e Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20233 / 39 Introduction Part I Part III Conclusions Transformations and interpretation Linear: y= b 0 + b 1x 1 + . . . +e When do we want to transform the variables? No normal distribution. Create a histogram. Transformation can help to obtain normal distribution Check the model fit before and after to say something about the transformation Log-linear: ln(y ) = b 0 + b 1x 1 + . . . +e Log-log: ln(y ) = b 0 + b 1ln (x 1) + . . .+e Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20233 / 39 Introduction Part I Part III Conclusions Transformations and interpretation Linear: y= b 0 + b 1x + . . . +e b 1 : level-effect +1 x increases y with b 1 Log-linear: ln(y ) = b 0 + b 1x + . . . +e b 1 : growth rate +1 x increases y with exp( b 1) times or +1 x increases y with (( exp( b 1) − 1) ∗100) % Log-log: ln(y ) = b 0 + b 1ln (x ) + . . .+e b 1 : elasticity +1% in x, increases y with ( b 1)% Assuming that x is a continuous variable! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20234 / 39 Introduction Part I Part III Conclusions Data Cross-sectional: Observe multiple economic agents or events once Time-series: Observe one economic agent or event over multiple time periods Panel: Observe multiple economic agents or events over multiple time periods Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20235 / 39 Introduction Part I Part III Conclusions Data issues Show you the power of linear regression: Perfect data, perfect predictability, perfect linear relationships Introduce errors in data: Measurement error Introduce errors in data: Exclude important variables Introduce errors in data: Missing values Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20236 / 39 Introduction Part I Part III Conclusions Example Start with an everyday example to explain the price of groceries Imagine you do grocery shopping You have the following information on: Total price of all items Some of the bought items (and quantities) Multiple observations Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20237 / 39 Introduction Part I Part III Conclusions Linear regression We then can reveal: The implicit prices for each item More specifically, the willingness to pay for each item How to do this?: Regress the (total) price of the heterogeneous goods on the quantities of the attributes Example on how this works: Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20238 / 39 Introduction Part I Part III Conclusions Example data Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 20239 / 39 Introduction Part I Part III Conclusions Linear regression price =β 0+ β 1∗ apples +β 2∗ pears +β 3∗ milk +β 4∗ bread +β 5∗ chocolate +ϵ Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202310 / 39 Introduction Part I Part III Conclusions Linear regression price =β 0+ β 1∗ apples +β 2∗ pears +β 3∗ milk +β 4∗ bread +β 5∗ chocolate +ϵ Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202311 / 39 Introduction Part I Part III Conclusions Linear regression with perfect information price =β 0+ β 1∗ apples +β 2∗ pears +β 3∗ milk +β 4∗ bread +β 5∗ chocolate +ϵ Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202312 / 39 Introduction Part I Part III Conclusions Measurement error In surveys there are always some errors. What are the sources of these errors? Respondents do not honestly answer the question (age, income, . . . ) Why do we care about measurement error? Can distort observed relationships and makes multivariate techniques less powerful Can result in biased coefficients, but always results in efficiency losses (read: larger standard errors →lower t-values →higher p-values) 3 possible scenarios: Measurement error in dependent variable Measurement error in independent variable Measurement error in both dependent and independent variable Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202313 / 39 Introduction Part I Part III Conclusions Measurement error Measurement error in dependent variable If the error in Y is random (and thus uncorrelated with X), it does not bias the estimates of the coefficients The error in Y estimates the relationship less precisely: larger estimated standard errors Measurement error in independent variable The error in X always biases the coefficients towards zero Underestimate positive coefficients and overestimate negative coefficients Known as attenuation bias Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202314 / 39 Introduction Part I Part III Conclusions Measurement error Solutions? It may not be a problem: if your results are still acceptable, it only makes your results more ’conservative’ Use instrumental variables (2SLS): Find a ’strong’ instrument which is correlated with your X and uncorrelated with the error term Use an error-in-variables model: if we know, or can estimate, the amount of error in X, one can ’fix-up’ the estimates (eivreg in STATA) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202315 / 39 Introduction Part I Part III Conclusions Linear regression with measurement error price =β 0+ β 1∗ apples +β 2∗ pears +β 3∗ milk +β 4∗ bread +β 5∗ chocolate +ϵ Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202316 / 39 Introduction Part I Part III Conclusions Linear regression price =β 0+ β 1∗ apples +β 2∗ pears +β 3∗ milk +β 4∗ bread +β 5∗ chocolate +ϵ Estimated price Real price Apples β 1 ¿ 0.254 ¿ 0.25 Pears β 2 ¿ 0.362 ¿ 0.35 Milk β 3 ¿ 1.196 ¿ 1.20 Bread β 4 ¿ 0.942 ¿ 0.99 Chocolat β 5 ¿ 2.493 ¿ 2.50 n=500 What happens if ndecreases? What happens if we do not observe quantities of chocolat? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202317 / 39 Introduction Part I Part III Conclusions Missing observations and omitted variable bias Fewer observations lead to less precise estimates! Omitted variables can lead to inconsistent (biased) results! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202318 / 39 Introduction Part I Part III Conclusions Missing observations and omitted variable bias Fewer observations lead to less precise estimates! Omitted variables can lead to inconsistent (biased) results! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202318 / 39 Introduction Part I Part III Conclusions Missing values in variables Some of the information is not available for a respondent in dependent or independent variables? Important question: Is it systematic or random? What is its impact? Reduces sample size available for analysis (see previous example) Can distort results (bias and inefficiency) Types of missing data: Missings completely at random (MCAR) Missings not at random (MNAR) Missings at random (MAR) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202319 / 39 Introduction Part I Part III Conclusions Missing values in variables Some of the information is not available for a respondent in dependent or independent variables? Important question: Is it systematic or random? What is its impact? Reduces sample size available for analysis (see previous example) Can distort results (bias and inefficiency) Types of missing data: Missings completely at random (MCAR) Missings not at random (MNAR) Missings at random (MAR) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202319 / 39 Introduction Part I Part III Conclusions Missing values in variables Some of the information is not available for a respondent in dependent or independent variables? Important question: Is it systematic or random? What is its impact? Reduces sample size available for analysis (see previous example) Can distort results (bias and inefficiency) Types of missing data: Missings completely at random (MCAR) Missings not at random (MNAR) Missings at random (MAR) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202319 / 39 Introduction Part I Part III Conclusions Types of missing data Missing completely at random (MCAR) Probability that an observation is missing is unrelated to the value or another variable Does not lead to bias in regression analysis. Missings are ignorable Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202320 / 39 Introduction Part I Part III Conclusions Types of missing data Missing not at random (MNAR) Probability that an observation is missing is related to the value or another variable For example, high income people do not fill in the question about income Does lead to bias in regression analysis. Missings are non-ignorable Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202321 / 39 Introduction Part I Part III Conclusions Types of missing data Missing at random (MAR) Probability that an observation is missing is only related to another variable For example, depressed people have a lower income in general and may be less inclined to report their income Does lead to bias in regression analysis if control variables are missing. Missings are non-ignorable Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202322 / 39 Introduction Part I Part III Conclusions Reporting on missing data Survey How many respondents did participate in the survey? Response rate Where did you survey the respondents? and/or Where do the respondents live? Which variables are affected? Frequency of not responding to each of the variables involved Show this in a table: ”Descriptive statistics” Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202323 / 39 Introduction Part I Part III Conclusions Missing data How to deal with missing data 1 Determine the type of missing data 2 Determine the extent of missing data 3 Diagnose the randomness of the missing data 4 Select the appropriate method for solution Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202324 / 39 Introduction Part I Part III Conclusions Solutions Data reduction Missing category Carry over value Imputation Mean imputation Regression mean imputation Multiple imputation Other: weighting Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202325 / 39 Introduction Part I Part III Conclusions Data reduction Listwise deletion (default) Each case that has a missing value for any variable in the analysis is dropped from the analysis Reduced sample size; reduces statistical power Unbiased results if data are MCAR Unbiased results if data are MAR when controlled for variables that affect missingness Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202326 / 39 Introduction Part I Part III Conclusions Extra category Add an additional category Only works for missings in categorical variables or with recoding of continuous variable Does not reduce sample size Can bias results as very different cases can be grouped Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202327 / 39 Introduction Part I Part III Conclusions Carry over value Carry last entry forward Missing item at t=4, t=5, t=6 takes on the value from t3 Only works with longitudinal data Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202328 / 39 Introduction Part I Part III Conclusions Imputation Simple mean Only possible for continuous variables Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202329 / 39 Introduction Part I Part III Conclusions Imputation Regression mean Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202330 / 39 Introduction Part I Part III Conclusions Imputation Multiple imputation 1 Imputation multiple times 2 Analysis each imputation 3 Pooling of results Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202331 / 39 Introduction Part I Part III Conclusions Weighting Weighting to let sample distribution of key variables match the (assumed) known population distribution Simple example: survey response rate is 100 % of women, but only 50% of men Inverse weighting solution: every male response is given a weight of 2 However, in practice much more difficult! The more key variables, the more difficult to create weights (gender, income, ethnicity, . . . ) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202332 / 39 Introduction Part I Part III Conclusions Solutions Data reduction Missing category Carry over value Imputation Other: weighting None of the proposed methods is ideal, most rely on strong assumptions More complete data →Better results, less bias Important for any scientific paper: Report how you deal with missing data and discuss possible bias due to missing data Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202333 / 39 Introduction Part I Part III Conclusions Data issues There are quite some data issues that you might encounter. Some examples: Sampling errors Measurement errors Omitted variables Non-response . . . and many more (see, e.g., https://www.statisticshowto.datasciencecentral.com/what-is-bias/) Bias often leads to consistency issues and always leads to efficiency issues! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202334 / 39 Introduction Part I Part III Conclusions Diagnostics Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202335 / 39 Introduction Part I Part III Conclusions Obligatory article to study Schafer & Graham (2002). Missing Data: Our View of the State of the Art. Psychological Methods , 7(2), 147-177.Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202336 / 39 Introduction Part I Part III Conclusions What did we learn? Solutions to data issues are all about: Does my data still represent the population? Consistency andefficiency !Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202337 / 39 Introduction Part I Part III Conclusions What did we learn? Make judgements on when to transform variables (and when not) Describe the causes and consequences of measurement error Describe and identify the problems and impact of missing variables and missing values Describe and apply solutions to missing values Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202338 / 39 Introduction Part I Part III Conclusions Next Lecture 4: Feb 16 at 11h00-13h00 Computer lab sessions: Thursday Feb 16 at 15h00-17h00 Prepare Mehmetoglu & Jakobsen (2022) - Chapter 6 Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 13 Feb 202339 / 39

Use Quizgecko on...
Browser
Browser