Advanced Statistical Analysis Lecture Notes
Document Details
Uploaded by ClearerKoala
University of Groningen
2023
Mark van Duijn
Tags
Summary
These lecture notes cover multinomial logistic regression, including examples of STATA analysis
Full Transcript
Advanced Statistical Analysis Week 5 - Lecture 9 Dr. Mark van Duijn Department of Economic Geography Department of DemographyUniversity of Groningen [email protected] 6 March, 2023 Introduction Part I Part II Part III Part IV Conclusions Agenda Part I: Multinomial logistic regression Part II: S...
Advanced Statistical Analysis Week 5 - Lecture 9 Dr. Mark van Duijn Department of Economic Geography Department of DemographyUniversity of Groningen [email protected] 6 March, 2023 Introduction Part I Part II Part III Part IV Conclusions Agenda Part I: Multinomial logistic regression Part II: STATA example (UCLA) Part III: DeMaris (1995) Part IV: Reczek et al. (2014) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20232 / 43 Introduction Part I Part II Part III Part IV Conclusions Literature Mehmetoglu & Jakobsen (2017): Chapter 8 DeMaris (1995): A tutorial in logistic regression, Journal of Marriage and the Family , 57, 956-968(compulsory reading)Reczek, Liu and Spiker (2014): A Population-Based Study on Alcohol Use in Same-Sex and Different-Sex Unions, Journal of Marriage and the Family , 76, 557-572 (compulsory reading)https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/ (background reading) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20233 / 43 Introduction Part I Part II Part III Part IV Conclusions Different dependent variables Binary:Individual chooses among two choices. 0/1 (’no/yes’) outcome. Multinomial: Individual chooses among more than two choices. Ordered:Individual reveals the strenght of his/her preferences. Numerical values are only a ranking. 0/1/2/3/4 to indicate the strenght of preferences. Count:Count of the number of occurrences. Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20234 / 43 Introduction Part I Part II Part III Part IV Conclusions Binary vs. multinomial Binary Multinomial Examples Examples Use-nonuse of contraception No, temporary, permanent contraception Success-failure of treatment Underweight, normal weight, overweight Agree-disagree Agree, indifferent, disagree Yes-no response Moving to region A, B, C, D Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20235 / 43 Introduction Part I Part II Part III Part IV Conclusions Binary vs. multinomial logistic regression Ln(Odds), odds, probabilities modelled for each outcome category One outcome category is the reference category Binary logistic regression is a special case of multinomial regression However, do not confuse MNL models as several logistic regression models combined ln (p j p r ) = b 0j + b 1jx 1 + . . . +b ijx i + e j ln (p k p r ) = b 0k + b 1k x 1 + . . . +b ik x i + e k p r + p j + p k = 1 Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20236 / 43 Introduction Part I Part II Part III Part IV Conclusions Binary vs. multinomial logistic regression Additional KEY assumption: Observations (and its errors) are independent from irrelevant alternatives (IIA property) In other words, the unobserved portion of utility for one alternative is unrelated to the unobserved portion of utility for another alternative Test the assumption: Leave one alternative out and check what happens to coefficients Fairly restrictive assumption: If violated, use other models (examples are probit / mixed logit) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20237 / 43 Introduction Part I Part II Part III Part IV Conclusions Multinomial logistic regression Dependent variables Polytomous outcome variable →multiple outcomes Often coded as 1 2 3 . . . Outcome variable Y →Probability of certain outcomes are modelled as P (Y = 1) P (Y = 2) P (Y = 3) . . . Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20238 / 43 Introduction Part I Part II Part III Part IV Conclusions Multinomial logistic regression Dependent variable is very similar to binary logistic regression One outcome defined as reference category Suppose dependent variable Y has three outcome categories (r,j,k) Choose one reference category. Here, outcome category r is the reference category ln (p j p r ) = b 0j + b 1jx 1 + . . . +b ijx i + e j for category j ln (p k p r ) = b 0k + b 1k x 1 + . . . +b ik x i + e k for category k Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 20239 / 43 Introduction Part I Part II Part III Part IV Conclusions Multinomial logistic regression ln (p j p r ) = b 0j + b 1jx 1 + . . . +b ijx i + e j for category j ln (p k p r ) = b 0k + b 1k x 1 + . . . +b ik x i + e k for category k p r + p j + p k = 1 One equation for each outcome category, except for the reference outcome category (remember, binary logistic regression for P(Y = 0)) Number of equations = Dependent variable with X outcomes - 1 Each independent variable has a specific coefficient value pertaining to an outcome Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202310 / 43 Introduction Part I Part II Part III Part IV Conclusions Multinomial logistic regression Example: modelling weight Y=0,1,2 Outcome categories Underweight (Y=0) Normal weight (Y=1): reference category Overweight (Y=2) Equation 1: ln(odds) of being underweight compared to normal weight ln (P (Y =0) P (Y =1) ) = b 0j + b 1jx 1 + . . . +b ijx i + e j for category j Equation 2: ln(odds) of being overweight compared to normal weight ln (P (Y =2) P (Y =1) ) = b 0k + b 1k x 1 + . . . +b ik x i + e k for category k Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202311 / 43 Introduction Part I Part II Part III Part IV Conclusions Multinomial logistic regression Again, if you know the ln(odds), you can calculate the odds and probabilities Excel is an excellent tool to do these simple calculations (predictions) Interpretation almost similar to binary logistic regression, however, now it becomes more important to put the emphasis on the reference category Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202312 / 43 Introduction Part I Part II Part III Part IV Conclusions Remember lessons learnt first weeks Steps for model building: 1 Description of the data: descriptive statistics, crosstabs 2 Sample representative for the whole population? 3 Check data for outliers, measurement error, missing values, . . . 4 Choose the correct regression method 5 Check the assumptions of the particular regression method 6 Use goodness of fit tests to find the ’best’ specification Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202313 / 43 Introduction Part I Part II Part III Part IV Conclusions Remember lessons learnt first weeks What does this mean for multinomial regression models: Crosstabs are in particular insightful if one has only categorical (in)dependent variables Crosstabs for checking empty cells: If a cell has very few or no cases, the model may become unstable or it might not even run at all The independence of irrelevant alternatives (IIA) assumption: roughly, the IIA assumption means that adding or deleting alternative outcome categories does not affect the odds among the remaining outcomes. There are alternative modeling methods that relax the IIA assumption Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such cases, this is not an appropriate analysis because of IIA LR tests, pseudo R squared, . . . Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202314 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Entering high school students make program choices among general program, vocational program and academic program. Their choice might be modeled using their writing score (continuous variable) and their social economic status (low, medium, high) Sample: n = 200 Tip: Start with binary logistic regression model to find first insights on the studied process and results. Then extend to multinomial. Possible exam question: Formulate the statistical/mathematical model of the above mentioned research problem. Define variables and the operationalization of these variables. Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202315 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example The dependent variable (y) has three outcomes. The choice set is general, vocational or academic program choice. The independent variables are their writing score (wscore) and social economic status (ses) →Hence we can estimate a multinomial logistic regression y ∗ = b 0 + b 1 ∗ wscore +b 2 ∗ ses low + b 3 ∗ ses high + e y ∗ = {general ,vocational ,academic } ln (P (Y =general )P (Y =vocation )) = b 0g + b 1g ∗ wscore +b 2g ∗ ses low + b 3g ∗ ses high + e g ln (P (Y =academic )P (Y =vocation )) = b 0a + b 1a ∗ wscore +b 2a ∗ ses low + b 3a ∗ ses high + e a b ∗ are the parameters to be estimated error term having specific properties (logistic distribution, IIA property) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202316 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Descriptive statistics Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202317 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Crosstabulation Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202318 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Crosstabulation: Is there a significant association between type of program chosen and the social economic status? In STATA, the next step is to do a multinomial regression But first, let me show you why the crosstabs can very insightful in this case Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202319 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Regression output: Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202320 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Regression output: Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202321 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Regression output: g 1 = ln(P (Y =general ) P (Y =vocation )) = 0 .251 + 0 .036 ∗ses low − 0.690 ∗ses middle g 2 = ln(P (Y =academic ) P (Y =vocation )) = 1 .792 −1.332 ∗ses low − 1.442 ∗ses middle exp g 1 = P (Y =general ) P (Y =vocation )= exp (0 .251+0 .036 ∗ses low− 0.690 ∗ses middle ) exp g 2 = P (Y =academic ) P (Y =vocation )= exp (1 .792 −1.332 ∗ses low− 1.442 ∗ses middle ) What are the ln(odds) and odds for g 3? Now we can easily predict ln(odds), odds, and probabilities However, the formulation of the probabilities changes a little because the dependent variable has three outcomes Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202322 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Regression output: P (Y =general ) = exp ( g 1 ) exp ( g 1 ) + exp ( g 2 ) + exp ( g 3 ) P (Y =general ) = exp ( g 1 ) exp ( g 1 ) + exp ( g 2 ) +1 P (Y =academic ) =exp ( g 2 ) exp ( g 1 ) + exp ( g 2 ) +1 P (Y =vocation ) = 1 exp ( g 1 ) + exp ( g 2 ) +1 A more formal notation is: P (Y =group j) = exp ( g j) P J k =1 exp ( g k ) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202323 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Prediction: exp g 1 = P (Y =general ) P (Y =vocation )= exp (0 .251+0 .036 ∗ses low− 0.690 ∗ses middle ) exp g 2 = P (Y =academic ) P (Y =vocation )= exp (1 .792 −1.332 ∗ses low− 1.442 ∗ses middle ) Odds if ses low = 1 exp g 1 = P (Y =general ) P (Y =vocation )= exp (0 .251+0 .036 ∗1 − 0.690 ∗0) = 1 .33 exp g 2 = P (Y =academic ) P (Y =vocation )= exp (1 .792 −1.332 ∗1− 1.442 ∗0) = 1 .58 Predicted conditional probability for ˆ P (Y =general ) ifses low = 1 ˆ P (Y =general ) = 1 .33 1 .33+1 .58+1 = 0 .340 Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202324 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Back to the excel sheet: Note the similarity between the predicted conditional probabilities and the probabilities of the crosstabulation Note that in this simple case, crosstabs can be very insightful Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202325 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Goodness of fit: Evaluates the overall model! Does the final model fit the data significantly better than the constant-only model? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202326 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Add variables: Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202327 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Interpretation: The odds of choosing the academic program in a public school are 0.152 times the odds of choosing the vocation program (reference program) in a private school (reference school) Or . . . the odds of choosing the academic program in a private school are 6.58 (= 1 0 .152 ) times the odds of choosing the vocation program in a public school The odds of choosing the general program for a student with low ses are not significanly different from 1 compared to the odds of choosing the vocational program (reference program) for a student with high ses (reference ses) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202328 / 43 Introduction Part I Part II Part III Part IV Conclusions STATA Example Interpretation: β < 0 :eβ < 1 : odds and probability P(Y =j) become smaller β > 0 :eβ > 1 : odds and probability P(Y =j) increase β = 0 : eβ = 1 : odds and probability P(Y =j) do not change If β is significantly different from 0, then the exponent is significantly different from 1! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202329 / 43 Introduction Part I Part II Part III Part IV Conclusions Discuss paper of DeMaris (1995) Continue the discussion of DeMaris (1995) DeMaris (1995): A tutorial in logistic regression, Journal of Marriage and the Family , 57, 956-968Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202330 / 43 Introduction Part I Part II Part III Part IV Conclusions Discuss methods and results of Reczek et al. (2014) Reczek, Liu and Spiker (2014): A Population-Based Study on Alcohol Use in Same-Sex and Different-Sex Unions, Journal of Marriage and the Family , 76, 557-572 Researchers investigate differentials in alcohol use among different-sex and same-sex married and cohabiting individuals. They use representative population-based survey data from 1997-2011 including around 84 000 male observations. Table 2 (next slide) shows the results from a multinomial logistic regression for men only where the choice set has 5 categories (reference category is someone who is a lifetime abstainer from drinking alcohol). Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202331 / 43 Introduction Part I Part II Part III Part IV Conclusions MNL regression results Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202332 / 43 Introduction Part I Part II Part III Part IV Conclusions MNL regression results Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202333 / 43 Introduction Part I Part II Part III Part IV Conclusions Discuss methods and results of Reczek et al. (2014) Interpret coefficient of different-sex cohabiting in Column A (current heavy drinker) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202334 / 43 Introduction Part I Part II Part III Part IV Conclusions Discuss methods and results of Reczek et al. (2014) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202335 / 43 Introduction Part I Part II Part III Part IV Conclusions MNL regression results Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202336 / 43 Introduction Part I Part II Part III Part IV Conclusions Discuss methods and results of Reczek et al. (2014) Look at the results under Column B (current moderate drinker). The a behind the coefficient of ‘same-sex cohabiting’ implies that the coefficient is significantly different (on the 95% significance level) from the coefficient of ‘same-sex married’. Calculate whether this is a correct finding (t-critical value 95% = 1.96) and explain how you performed your calculation. Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202337 / 43 Introduction Part I Part II Part III Part IV Conclusions Discuss methods and results of Reczek et al. (2014) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202338 / 43 Introduction Part I Part II Part III Part IV Conclusions Anything that catches your attention? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202339 / 43 Introduction Part I Part II Part III Part IV Conclusions Practice Another example https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/ Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202340 / 43 Introduction Part I Part II Part III Part IV Conclusions Remember the important steps Steps in analysis: Frequencies, crosstabs, chi-squared statistics, correlations Regression Effects of explanatory variables on outcome variable (separately) Full model: all relevant explanatory variables included Interactions? Only one at a time Goodness of fit (at each step) Document data, data transformations, steps taken Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202341 / 43 Introduction Part I Part II Part III Part IV Conclusions What did we learn? Distinquish between binary and multinomial methods Apply multinomial logistic regression models Interpret results from MNL regression models Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202342 / 43 Introduction Part I Part II Part III Part IV Conclusions Next lecture Next lecture: March 9 at 11h00-13h00 on multilevel modelling (Chapter 10) // No exam material Computer lab sessions: Thursday 15h00-17h00 on multinomial logistic regression Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 6 Mar 202343 / 43