Advanced Statistical Analysis Lecture Notes (University of Groningen)
Document Details
Uploaded by ClearerKoala
University of Groningen
2023
Mark van Duijn
Tags
Summary
These are lecture notes from the University of Groningen on Advanced Statistical Analysis, specifically focusing on discrete choice models and logistic regression. The notes cover topics such as model assumptions, different types of discrete choice models, and a discussion of DeMaris (1995).
Full Transcript
Advanced Statistical Analysis Week 3 - Lecture 5 Dr. Mark van Duijn Department of Economic Geography Department of DemographyUniversity of Groningen [email protected] 20 Feb, 2023 Introduction Part I Part II Conclusions Announcement National Student Survey! Important to promote your Master! Dr....
Advanced Statistical Analysis Week 3 - Lecture 5 Dr. Mark van Duijn Department of Economic Geography Department of DemographyUniversity of Groningen [email protected] 20 Feb, 2023 Introduction Part I Part II Conclusions Announcement National Student Survey! Important to promote your Master! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20232 / 27 Introduction Part I Part II Conclusions Agenda Part I: Discrete choice models: Logistic regression / Logit Part II: DeMaris (1995) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20233 / 27 Introduction Part I Part II Conclusions Logistic regression So... What do you know / remember? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20234 / 27 Introduction Part I Part II Conclusions Discrete choice models Discrete choice models: These type of models describe (decision makers’) choices among two or more discrete alternatives (can also be applied to events) In other words, the dependent variable is not a continuous variable but a limited dependent variable (or discrete variable)Binary choice (0/1) Multinomial choice . . . Decision makers: people, households, firms, etc. Alternatives: competing products or any other options or items over which choices must be made Background literature:Ch.1-2-3 Train (2009) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20235 / 27 Introduction Part I Part II Conclusions Discrete choice models Discrete choice models: These type of models describe (decision makers’) choices among two or more discrete alternatives (can also be applied to events) In other words, the dependent variable is not a continuous variable but a limited dependent variable (or discrete variable)Binary choice (0/1) Multinomial choice . . . Decision makers: people, households, firms, etc. Alternatives: competing products or any other options or items over which choices must be made Background literature:Ch.1-2-3 Train (2009) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20235 / 27 Introduction Part I Part II Conclusions Discrete choice models Discrete choice models: These type of models describe (decision makers’) choices among two or more discrete alternatives (can also be applied to events) In other words, the dependent variable is not a continuous variable but a limited dependent variable (or discrete variable)Binary choice (0/1) Multinomial choice . . . Decision makers: people, households, firms, etc. Alternatives: competing products or any other options or items over which choices must be made Background literature:Ch.1-2-3 Train (2009) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20235 / 27 Introduction Part I Part II Conclusions Discrete choice models Discrete choice models: These type of models describe (decision makers’) choices among two or more discrete alternatives (can also be applied to events) In other words, the dependent variable is not a continuous variable but a limited dependent variable (or discrete variable)Binary choice (0/1) Multinomial choice . . . Decision makers: people, households, firms, etc. Alternatives: competing products or any other options or items over which choices must be made Background literature:Ch.1-2-3 Train (2009) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20235 / 27 Introduction Part I Part II Conclusions Discrete choice models When to use discrete choice models: ”In general, the researcher needs to consider the goals of the research and the capabilities of alternative methods when deciding whether to apply a discrete choice model” (Train, 2009, p.14)Train’s Ch. 1, 2 and 3 is no exam material but the following sheets are. . . Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20236 / 27 Introduction Part I Part II Conclusions The choice set Characteristics that should be met to use discrete choice models: Alternatives must be mutually exclusiveThe choice must be exhaustiveThe number of alternatives must be finiteThe first and second criteria can nearly always be met if they are violated! For example, suppose 2 alternatives are labeled A and B. Both A and B can be chosen. . . Solution? Individual can choose neither. . . Solution? Important assumption: Choice is based on utility maximization behavior! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20237 / 27 Introduction Part I Part II Conclusions The choice set Characteristics that should be met to use discrete choice models: Alternatives must be mutually exclusiveThe choice must be exhaustiveThe number of alternatives must be finiteThe first and second criteria can nearly always be met if they are violated! For example, suppose 2 alternatives are labeled A and B. Both A and B can be chosen. . . Solution? Individual can choose neither. . . Solution? Important assumption: Choice is based on utility maximization behavior! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20237 / 27 Introduction Part I Part II Conclusions The choice set Characteristics that should be met to use discrete choice models: Alternatives must be mutually exclusiveThe choice must be exhaustiveThe number of alternatives must be finiteThe first and second criteria can nearly always be met if they are violated! For example, suppose 2 alternatives are labeled A and B. Both A and B can be chosen. . . Solution? Individual can choose neither. . . Solution? Important assumption: Choice is based on utility maximization behavior! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20237 / 27 Introduction Part I Part II Conclusions The choice set Characteristics that should be met to use discrete choice models: Alternatives must be mutually exclusiveThe choice must be exhaustiveThe number of alternatives must be finiteThe first and second criteria can nearly always be met if they are violated! For example, suppose 2 alternatives are labeled A and B. Both A and B can be chosen. . . Solution? Individual can choose neither. . . Solution? Important assumption: Choice is based on utility maximization behavior! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20237 / 27 Introduction Part I Part II Conclusions The choice set Characteristics that should be met to use discrete choice models: Alternatives must be mutually exclusiveThe choice must be exhaustiveThe number of alternatives must be finiteThe first and second criteria can nearly always be met if they are violated! For example, suppose 2 alternatives are labeled A and B. Both A and B can be chosen. . . Solution? Individual can choose neither. . . Solution? Important assumption: Choice is based on utility maximization behavior! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20237 / 27 Introduction Part I Part II Conclusions The choice set Characteristics that should be met to use discrete choice models: Alternatives must be mutually exclusiveThe choice must be exhaustiveThe number of alternatives must be finiteThe first and second criteria can nearly always be met if they are violated! For example, suppose 2 alternatives are labeled A and B. Both A and B can be chosen. . . Solution? Individual can choose neither. . . Solution? Important assumption: Choice is based on utility maximization behavior! Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20237 / 27 Introduction Part I Part II Conclusions Different types of discrete choice models Binary:Individual chooses among two choices. 0/1 (’no/yes’) outcome.Multinomial:Individual chooses among more than two choices. Ordered:Individual reveals the strenght of his/her preferences. Numerical values are only a ranking. 0/1/2/3/4 to indicate the strenght of preferences. Count:Count of the number of occurrences. Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20238 / 27 Introduction Part I Part II Conclusions Different types of discrete choice models Binary:Individual chooses among two choices. 0/1 (’no/yes’) outcome.Multinomial:Individual chooses among more than two choices. Ordered:Individual reveals the strenght of his/her preferences. Numerical values are only a ranking. 0/1/2/3/4 to indicate the strenght of preferences. Count:Count of the number of occurrences. Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20238 / 27 Introduction Part I Part II Conclusions Different types of discrete choice models Binary:Individual chooses among two choices. 0/1 (’no/yes’) outcome. Multinomial:Individual chooses among more than two choices. Ordered:Individual reveals the strenght of his/her preferences. Numerical values are only a ranking. 0/1/2/3/4 to indicate the strenght of preferences. Count:Count of the number of occurrences. Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20238 / 27 Introduction Part I Part II Conclusions Different types of discrete choice models Binary:Individual chooses among two choices. 0/1 (’no/yes’) outcome. Multinomial:Individual chooses among more than two choices. Ordered:Individual reveals the strenght of his/her preferences. Numerical values are only a ranking. 0/1/2/3/4 to indicate the strenght of preferences. Count:Count of the number of occurrences. Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20238 / 27 Introduction Part I Part II Conclusions Binary/dichotomous variables Outcome is binary: Success-failure of treatment Use-nonuse of contraception Survival-death of an infant Agree-disagree question on survey Yes-no response Why no linear regression model? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 20239 / 27 Introduction Part I Part II Conclusions Example Outcome is binary: Dependent variable: Happiness Outcome Y=1 if unhappy Outcome Y=0 if happy Independent variable: Health score (0-100) What now? What is the first step before modelling? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202310 / 27 Introduction Part I Part II Conclusions Example Scatterplot? How would that scatterplot look like? Correlation? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202311 / 27 Introduction Part I Part II Conclusions Linear regression assumptions Model assumptions Error term has a conditional mean of zero, is independent, is normally distibuted, and have a constant variance Model is correctly specified: Linearity between the Y (dependent variable) and X’s (independent variables) Absence of multicollinearity Absence of influential observations Consistency? Efficiency? Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202312 / 27 Introduction Part I Part II Conclusions Logistic regression assumptions Model assumptions Model is correctly specified: Linearity between the logit of Y (dependent variable) and X’s (independent variables) Absence of multicollinearity Absence of influential observations No correlation between x-e / uncorrelated errors Error term has a logistic distibution Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202313 / 27 Introduction Part I Part II Conclusions Switch to logistic regression Logistic regression: Logistic regression is a specialized form of regression that is designed to predict and explain a binary (two-group) categorical variable rather than a metric dependent measure. Its variate is similar to regular regression and made up of metric independent variables. Switch to model probability P(Y = 1) Probabilities range from 0 to 1 Linear relationship between variables and ln(odds ) Estimation technique: Maximum Likelihood (iterative procedure) Logit is popular due to the fact that the formula for the choice probabilities takes a closed form and is readily interpretable Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202314 / 27 Introduction Part I Part II Conclusions Assumptions Logistic regression: Error term has a logistic distribution Fairly restrictive assumption: If violated, use other models (examples are probit / mixed logit) However, the goal is to represent utility so well that the only unobserved portion is simply white noise Then the logit model is ideal rather than restrictive Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202315 / 27 Introduction Part I Part II Conclusions Example Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202316 / 27 Introduction Part I Part II Conclusions Example Modelling probabilities Instead of modelling outcome as 0 or 1, model P(Y = 1) p: probability of success 1-p: probability of failure Mean of Y is equal to p →proportion Example: p(unhappy ) = 0.5 Health score proportion unhappy=probability < 20: >0.8 > 40: <0.2 Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202317 / 27 Introduction Part I Part II Conclusions Comparison linear and logistic regression Fit the model using OLS: P(y = 1) = β 0 + β 1x + ϵ Prediction can be problematic; can predict outside the [0,1] range (no linear relationship; specification error) Assumption of homoscedasticity and normality of errors is not met Solution: transformation of P(y = 1) Logit transformation is the most common one P (y = 1) →odds =p 1 − p Restricts predicted values for P(y=1) to range [0,1] logit =logit (p ) = ln(odds ) =ln( p 1 − p) ln ( p 1 − p) = β 0 + β 1x + ϵ Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202318 / 27 Introduction Part I Part II Conclusions Odds and probabilities Probability: P(y = 1) = successes successes +failures Odds: odds=success failure =p 1 − p Example: The odds of winning this bet is 3:5 The probability of winning this bet is 3:8 Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202319 / 27 Introduction Part I Part II Conclusions Odds and probabilities Suppose the probability of students who actually vote is: P (y = 1) = 0 .8 → odds =0 .8 1 − 0.8 = 4 For every 4 students actually voting, the next one is expected not to vote Odds can take values [0 ,∞ ] Odds go to infinity as P(y = 1) goes to 1 (note the problem when denomenator goes towards 0) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202320 / 27 Introduction Part I Part II Conclusions Interpretation Logistic regression equation: ln ( p 1 − p) = b 0 + b 1x 1 + . . . +ϵ If x1 increase with 1 unit, ln(odds) will increase with b1 If x1 increase with 1 unit, odds will multiply with exp(b1) A 1 unit increase in x1 increases the odds of Y=1 with about exp(b1) times (compared to Y=0 and keeping all other variables constant) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202321 / 27 Introduction Part I Part II Conclusions Short recap Logistic regression equation: ln ( p 1 − p) = b 0 + b 1x 1 + . . . +ϵ Difference to linear regression: Outcome Y not modelled directly, but P(Y = 1) Estimation technique: Maximum Likelihood (iterative process) Changes in X do not have a constant effect on Y: if P(Y = 1) is close to 0 or 1, a change in X has a smaller effect on P(Y = 1) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202322 / 27 Introduction Part I Part II Conclusions Short recap Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202323 / 27 Introduction Part I Part II Conclusions Discuss paper of DeMaris (1995) DeMaris (1995): A tutorial in logistic regression, Journal of Marriage and the Family , 57, 956-968Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202324 / 27 Introduction Part I Part II Conclusions What did we learn? Describe logistic regression models and compare them to linear regressions models Apply logistic regression to predict odds and probabilities of choices or events Recognize logistic regression models in published articles Practice interpretation logistic regression models Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202325 / 27 Introduction Part I Part II Conclusions Advised literature to read Compulsory literature: Hand-out Logistic regression: Problems with Linear Probability Models (using OLS) Mehmetoglu & Jakobsen (2022): Chapter 8 on logistic regression DeMaris (1995): A tutorial in logistic regression, Journal of Marriage and the Family , 57, 956-968 Background literature: Background document I (Train, 2009) Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202326 / 27 Introduction Part I Part II Conclusions Next lecture Next lecture: Thursday from 11h00-13h00 Computer lab sessions: Thursday from 15h00-17h00 on logistic regression models Dr. Mark van Duijn (RUG) Advanced Statistical Analysis 20 Feb 202327 / 27