ADDA15.lec7_Logistic Regression question generator.docx
Document Details
Uploaded by JudiciousNephrite2042
Full Transcript
PSYC40005 - 2023 ADVANCED DESIGN AND DATA ANALYSIS Lecture 7: Logistic regression and loglinear models Adam Osth Melbourne School of Psychological Sciences University of Melbourne Melbourne Connect – Room 8216 [email protected] Logistic regression Loglinear models Linear models Changes in x produce...
PSYC40005 - 2023 ADVANCED DESIGN AND DATA ANALYSIS Lecture 7: Logistic regression and loglinear models Adam Osth Melbourne School of Psychological Sciences University of Melbourne Melbourne Connect – Room 8216 [email protected] Logistic regression Loglinear models Linear models Changes in x produce the same change in y regardless of the value of x In other words… We have a linear model that predicts weight in kg from height in cm of the form: weight =.66height - 35 If someone’s height increases from 100 to 110, we predict an increase in weight from 31 to 37.6 kg (increase of 6.6 kg) What if it increases from 200 to 210? » Increase from 97 to 103.6 kg, an increase of 6.6 kg Non-linear models Changes in x produce change in y that depend on the value of x We have a non-linear model that predicts weight in kg from height in cm of the form: weight =10log(.66height – 20) + 50 If someone’s height increases from 40 to 50, weight increases from 68.56 to 75.64 (increase of 7.07) If height increases from 60 to 70, weight increases from 79.75 to 82.65 (increase of 2.90) Changes in y get lower as x increases (because it is a negatively accelerated function There are many cases where linear models are inappropriate Not everything increases or decreases without bounds! Sometimes we have a lower bound of zero E.g., you can’t have negative amounts of food! Sometimes we have an upper bound of some kind Accuracy has an upper bound of 1.0 or 100% Not everything changes by the same amount each time! Negatively accelerated functions: learning over time, forgetting over time, increase in muscle mass with training, etc. Positively accelerated functions (e.g., exponential growth): spread of infections, population growth, etc. Section 1: Lecture 7 What has two outcomes? Predicting whether someone is alive or dead Predicting whether or not a participant is a member of a group (a student, a parent, etc.) Predicting a participant’s two choice data Accuracy! There are many cases where responses are scored as either correct or incorrect Yes vs. no responses Category A vs. Category B (categorization) Recognize vs. not recognize (recognition memory) A dataset to examine aspects of maths achievement in secondary school. In that dataset, there are (predictor) variables relating to: Gender, parent’s education, mosaic test, visualisation test There is also a binary variable (0,1), alg2, indicating whether a student enrolled in an advanced subject algebra 2. Y = α + β1X1 + β2X2 +... + βpXp What does β mean? For every unit of X, we get a corresponding increase in Y We can’t use this interpretation with binary outcomes …but we can with the logarithm of the odds of obtaining Y Inverse of exponentiation Exp(x) or ex Increases on a log scale correspond to multiplicative increases E.g.: magnitude of earthquakes What’s the difference between magnitude 5 and 6? Magnitude 6 is 10x the intensity of Magnitude 5 Magnitude 7 is 100x the intensity of Magnitude 5 This is an example of log10 The most common example is the natural log: ln, or loge ”e” ~ 2.71 Exponential Linear Logarithmic log(exp(y)) exp(log(y)) Instead of predicting Y = 0 or 1 We model the probability of Y = 1 occurring This is a continuous function, ranging between 0 and 1. Specifically, we model the log odds of obtaining Y=1 ODDS: P(Y=1)/P(Y=0) LOG ODDS: log P(Y=1)/P(Y=0) We predict the logarithm of the odds as a regression 15 Estimating the log odds for Y, alg2 If visual = 0, then log odds = - 1.262 i.e. odds = e-1.262 = 0.28 P(Y=1)/P(Y=0) = P(Y=1)/[1-P(Y=1)] = 0.28 So estimated P(Y=1) = 0.28/ (1+0.28) = 0.22 If visual = 14.75, odds = 6.75 And estimated P(Y=1) = 0.87 In general Estimated P(Y=1) = odds/(1+odds) 16 But what do we mean by odds? This has exactly the same meaning as at the race track! If the odds>1 then Y=1 is a more probable outcome than Y=0 If the odds = 1, then it’s 50-50. If odds < 1, then Y=0 is more probable. What does it look like if you plot predictors against probability instead of log odds? This is a non-linear function!!! We have not completely abandoned the linear model We have generalized it to something called the generalized linear model. The fundamental model for data analysis DATA = MODEL + RESIDUAL Linear model Y = α + β1X1 + β2X2 +... + βpXp + ε Generalized linear model (involving a function of Y) f(Y) = α + β1X1 + β2X2 +... + βpXp + ε The function is called the link, sometimes labelled μ. Some links: The identity link: μ = Y (i.e. f(Y) = Y) Gives the linear model The logistic link: μ = log P(Y=1)/P(Y=0) or 1/(1 + e-Y)– used for binary variables Gives the logistic regression model The logarithm link: μ = log Y – used for counts or frequencies Gives the loglinear model – 2nd half of this lecture. There are many others! E.g., exponential, inverse Gaussian, quadratic, etc. Standard linear regression conforms to certain assumptions: data are unbounded (-infinity to infinity)! Datasets do not always meet these assumptions 1 1+ e−t t can range from –infinity to infinity Y values are always between 0 and 1 1 1+ e−(α+β1x+...+β px ) Regression equation is substituted into the logistic function Predictors can range from negative infinity to infinity – the logistic function makes it such that the predictors only predict values between 0 and 1 To summarize: The generalized linear model is a very useful way to explore datasets that do not conform to the assumptions of linear regression Idea is that the appropriate function/link allow linear techniques to be employed, even if the data are not linear So logistic regression is the best way to predict a binary variable from other variables. Make sure your dependent variable is coded (0,1) 0,1 coding is arbitrary – can reverse the coding The beauty of it all: No assumption of normality Binary data come from a binomial or Bernoulli distribution No assumption of linearity No assumption of homoscedasticity (equal variances) With binomially distributed data, the variance depends on the probability or frequency As the probability/expected frequency approaches 0 or 1, the variance approaches zero Binomial distribution – 100 draws of either 0 or 1, with different probabilities Highest variability for intermediate probabilities – much less for extreme probabilities (p =.02 and.99) When p reaches 0 or 1, there is no variability So what are the assumptions? Binary outcomes which are mutually exclusive Independence of observations (as per usual) Independent variables can be continuous or categorical How the model is fit to data: maximum likelihood estimation Maximize the log likelihood of the data under the model parameters For each observation (which is a 1 or 0), we have a predicted probability from the model We calculate the likelihood of each observation according to a Bernoulli distribution The closer the probabilities are to the data -> the higher the likelihood This contrasts with linear regression which (commonly) uses ordinary least squares methods Minimizes the squared deviations of the model from the data (e.g. residuals) Parameters cannot be ”solved” the way they can in linear regression – stats programs estimate the parameters using numerical methods From the dataset, use predictor variables : – Gender, parent’s education, mosaic test, visualisation test To predict the probability of whether a student enrolled in algebra 2 (alg2). The variance depends on the proportion (mean) and hence cannot be compared either with linear regression R2 or R2 for other binary dependent variables with a different mean. It’s easier to account for more variance for extreme values, since there is no variability Means around.5 have very large variance Half the values are 1 and half are 0 Means around.99 do not have much variance 99% of the values are 1, 1% are 0 (low variance compared to above) In fact, in logistic regression, R2 is not calculated based on the correlation or variation accounted for at all! It is calculated based on likelihood ratios The Cox and Snell (1989) approximation is a ratio of the log likelihood of the data under the model to the log likelihood of the data under the “null model” (no predictors) In other words, the better your model does over the null model, the higher the R2 value. This does not have a maximum value of 1.0 Nagelkerke (1991) adjusts this by taking the ratio of the Cox and Snell R2 to its maximum possible value. 33 What does it mean to correctly predict values? If a value is 1 and the predicted value is >.5 OR If a value is 0 and the predicted value is <.5 This is a convention that SPSS uses for interpretability Models should be compared on the Cox & Snell/Nagelkerke r2 The visual test and parental education are significant predictors. Higher scores on the test and higher levels of parental education increase the chances of the student taking algebra 2. B is the LOG ODDS RATIO. The log odds of taking algebra 2 increase by.190 for every unit increase in the visual test. The log odds increase by.380 for every unit increase in parental education 35 “The log odds of taking algebra 2 increase by.190 for every unit increase in the visual test.” We exponentiate log odds to get the odds If we exponentiate.190, we get 1.21 Does this mean the odds increase by 1.21 for every unit increase in x? No! The odds are exponentially related to the predictors – this is a nonlinear function Likewise, the probability is linked to the predictors via a logistic function, also nonlinear! Linear function: Every increase in the predictor produces a corresponding increase in LOG ODDS If we take the exp(log odds)…. Non-linear function: Every increase in the predictor produces different increases in the odds, depending on the level of the predictor! If we take odds/(1 + odds)… If we take odds/(1 + odds)… Section 2: Lecture 7