ADDA15.lec7_Adam_fixed (1).docx
Document Details

Uploaded by JudiciousNephrite2042
Full Transcript
PSYC40005 - 2023 ADVANCED DESIGN AND DATA ANALYSIS Lecture 7: Logistic regression and loglinear models Adam Osth Melbourne School of Psychological Sciences University of Melbourne Melbourne Connect – Room 8216 [email protected] Logistic regression Loglinear models Linear models Changes in x produce...
PSYC40005 - 2023 ADVANCED DESIGN AND DATA ANALYSIS Lecture 7: Logistic regression and loglinear models Adam Osth Melbourne School of Psychological Sciences University of Melbourne Melbourne Connect – Room 8216 [email protected] Logistic regression Loglinear models Linear models Changes in x produce the same change in y regardless of the value of x In other words… We have a linear model that predicts weight in kg from height in cm of the form: weight =.66height - 35 If someone’s height increases from 100 to 110, we predict an increase in weight from 31 to 37.6 kg (increase of 6.6 kg) What if it increases from 200 to 210? » Increase from 97 to 103.6 kg, an increase of 6.6 kg Non-linear models Changes in x produce change in y that depend on the value of x We have a non-linear model that predicts weight in kg from height in cm of the form: weight =10log(.66height – 20) + 50 If someone’s height increases from 40 to 50, weight increases from 68.56 to 75.64 (increase of 7.07) If height increases from 60 to 70, weight increases from 79.75 to 82.65 (increase of 2.90) Changes in y get lower as x increases (because it is a negatively accelerated function There are many cases where linear models are inappropriate Not everything increases or decreases without bounds! Sometimes we have a lower bound of zero E.g., you can’t have negative amounts of food! Sometimes we have an upper bound of some kind Accuracy has an upper bound of 1.0 or 100% Not everything changes by the same amount each time! Negatively accelerated functions: learning over time, forgetting over time, increase in muscle mass with training, etc. Positively accelerated functions (e.g., exponential growth): spread of infections, population growth, etc. Section 1: Lecture 7 What has two outcomes? Predicting whether someone is alive or dead Predicting whether or not a participant is a member of a group (a student, a parent, etc.) Predicting a participant’s two choice data Accuracy! There are many cases where responses are scored as either correct or incorrect Yes vs. no responses Category A vs. Category B (categorization) Recognize vs. not recognize (recognition memory) A dataset to examine aspects of maths achievement in secondary school. In that dataset, there are (predictor) variables relating to: Gender, parent’s education, mosaic test, visualisation test There is also a binary variable (0,1), alg2, indicating whether a student enrolled in an advanced subject algebra 2. Y = α + β1X1 + β2X2 +... + βpXp What does β mean? For every unit of X, we get a corresponding increase in Y We can’t use this interpretation with binary outcomes …but we can with the logarithm of the odds of obtaining Y Inverse of exponentiation Exp(x) or ex Increases on a log scale correspond to multiplicative increases E.g.: magnitude of earthquakes What’s the difference between magnitude 5 and 6? Magnitude 6 is 10x the intensity of Magnitude 5 Magnitude 7 is 100x the intensity of Magnitude 5 This is an example of log10 The most common example is the natural log: ln, or loge ”e” ~ 2.71 Exponential Linear Logarithmic log(exp(y)) exp(log(y)) Instead of predicting Y = 0 or 1 We model the probability of Y = 1 occurring This is a continuous function, ranging between 0 and 1. Specifically, we model the log odds of obtaining Y=1 ODDS: P(Y=1)/P(Y=0) LOG ODDS: log P(Y=1)/P(Y=0) We predict the logarithm of the odds as a regression 15 Estimating the log odds for Y, alg2 If visual = 0, then log odds = - 1.262 i.e. odds = e-1.262 = 0.28 P(Y=1)/P(Y=0) = P(Y=1)/[1-P(Y=1)] = 0.28 So estimated P(Y=1) = 0.28/ (1+0.28) = 0.22 If visual = 14.75, odds = 6.75 And estimated P(Y=1) = 0.87 In general Estimated P(Y=1) = odds/(1+odds) 16 But what do we mean by odds? This has exactly the same meaning as at the race track! If the odds>1 then Y=1 is a more probable outcome than Y=0 If the odds = 1, then it’s 50-50. If odds < 1, then Y=0 is more probable. What does it look like if you plot predictors against probability instead of log odds? This is a non-linear function!!! We have not completely abandoned the linear model We have generalized it to something called the generalized linear model. The fundamental model for data analysis DATA = MODEL + RESIDUAL Linear model Y = α + β1X1 + β2X2 +... + βpXp + ε Generalized linear model (involving a function of Y) f(Y) = α + β1X1 + β2X2 +... + βpXp + ε The function is called the link, sometimes labelled μ. Some links: The identity link: μ = Y (i.e. f(Y) = Y) Gives the linear model The logistic link: μ = log P(Y=1)/P(Y=0) or 1/(1 + e-Y)– used for binary variables Gives the logistic regression model The logarithm link: μ = log Y – used for counts or frequencies Gives the loglinear model – 2nd half of this lecture. There are many others! E.g., exponential, inverse Gaussian, quadratic, etc. Standard linear regression conforms to certain assumptions: data are unbounded (-infinity to infinity)! Datasets do not always meet these assumptions 1 1+ e−t t can range from –infinity to infinity Y values are always between 0 and 1 1 1+ e−(α+β1x+...+β px ) Regression equation is substituted into the logistic function Predictors can range from negative infinity to infinity – the logistic function makes it such that the predictors only predict values between 0 and 1 To summarize: The generalized linear model is a very useful way to explore datasets that do not conform to the assumptions of linear regression Idea is that the appropriate function/link allow linear techniques to be employed, even if the data are not linear So logistic regression is the best way to predict a binary variable from other variables. Make sure your dependent variable is coded (0,1) 0,1 coding is arbitrary – can reverse the coding The beauty of it all: No assumption of normality Binary data come from a binomial or Bernoulli distribution No assumption of linearity No assumption of homoscedasticity (equal variances) With binomially distributed data, the variance depends on the probability or frequency As the probability/expected frequency approaches 0 or 1, the variance approaches zero Binomial distribution – 100 draws of either 0 or 1, with different probabilities Highest variability for intermediate probabilities – much less for extreme probabilities (p =.02 and.99) When p reaches 0 or 1, there is no variability So what are the assumptions? Binary outcomes which are mutually exclusive Independence of observations (as per usual) Independent variables can be continuous or categorical How the model is fit to data: maximum likelihood estimation Maximize the log likelihood of the data under the model parameters For each observation (which is a 1 or 0), we have a predicted probability from the model We calculate the likelihood of each observation according to a Bernoulli distribution The closer the probabilities are to the data -> the higher the likelihood This contrasts with linear regression which (commonly) uses ordinary least squares methods Minimizes the squared deviations of the model from the data (e.g. residuals) Parameters cannot be ”solved” the way they can in linear regression – stats programs estimate the parameters using numerical methods From the dataset, use predictor variables : – Gender, parent’s education, mosaic test, visualisation test To predict the probability of whether a student enrolled in algebra 2 (alg2). The variance depends on the proportion (mean) and hence cannot be compared either with linear regression R2 or R2 for other binary dependent variables with a different mean. It’s easier to account for more variance for extreme values, since there is no variability Means around.5 have very large variance Half the values are 1 and half are 0 Means around.99 do not have much variance 99% of the values are 1, 1% are 0 (low variance compared to above) In fact, in logistic regression, R2 is not calculated based on the correlation or variation accounted for at all! It is calculated based on likelihood ratios The Cox and Snell (1989) approximation is a ratio of the log likelihood of the data under the model to the log likelihood of the data under the “null model” (no predictors) In other words, the better your model does over the null model, the higher the R2 value. This does not have a maximum value of 1.0 Nagelkerke (1991) adjusts this by taking the ratio of the Cox and Snell R2 to its maximum possible value. 33 What does it mean to correctly predict values? If a value is 1 and the predicted value is >.5 OR If a value is 0 and the predicted value is <.5 This is a convention that SPSS uses for interpretability Models should be compared on the Cox & Snell/Nagelkerke r2 The visual test and parental education are significant predictors. Higher scores on the test and higher levels of parental education increase the chances of the student taking algebra 2. B is the LOG ODDS RATIO. The log odds of taking algebra 2 increase by.190 for every unit increase in the visual test. The log odds increase by.380 for every unit increase in parental education 35 “The log odds of taking algebra 2 increase by.190 for every unit increase in the visual test.” We exponentiate log odds to get the odds If we exponentiate.190, we get 1.21 Does this mean the odds increase by 1.21 for every unit increase in x? No! The odds are exponentially related to the predictors – this is a nonlinear function Likewise, the probability is linked to the predictors via a logistic function, also nonlinear! Linear function: Every increase in the predictor produces a corresponding increase in LOG ODDS If we take the exp(log odds)…. Non-linear function: Every increase in the predictor produces different increases in the odds, depending on the level of the predictor! If we take odds/(1 + odds)… If we take odds/(1 + odds)… Section 2: Lecture 7 Evaluating for an association between two variables, each with different levels Position and performance level in a company University year and grade average We can test for associations here using the chi square method e.g performance appraisal and position in a company Evidence for an association between appraisal and position 43 Data showing infants’ surrival rates in two clinics (clinic A and clinic B) depending on two levels of prenatal care 44 a aC b0 Conclusion: Prenatal care is associated with infant survival. e Increasing trend in the red and blue groups, but when you collapse it appears as a negative trend! Loglinear models are statistical models for data based on counts, especially for 3 or more categorical variables – i.e. 3-way (or higher) contingency tables We need to investigate frequencies and proportions in each cell of the table Calculating proportions requires multiplication Let N be the number of car accidents pM the proportion of male car accidents, pD the proportion of drunk driving accidents, and FMD the frequency of car accidents involving both males and drugs. Then assuming no association between gender and method of car accidents, we expect FMD = N pM pD 49 But it is difficult to deal with multiplications – We simple statistical souls prefer just to add up! Then FMD = N pM pD becomes log(FMD) = log (N pM pD) = log N + log pM + log pD Or even more simply: log FMD = θ + λM + λD And this looks like a linear regression predicting the log of the frequencies θ is a parameter relating to N (just like a constant in regression) λM a parameter dealing with gender (ie males compared to females) λD a parameter dealing with method of car accident (ie drunk driving compared to other methods) Hence, loglinear!50 But maybe we should not make the assumption there is no association between gender and method? 51 But maybe we should not make the assumption there is no association between gender and method? This is a model for a 2 X 2 table (i.e. 4 cells). It has 4 parameters. The saturated model. You can’t use more parameters (eg λND and λFD ) are redundant For the other cells: log FM-ND = θ + λM log FFD = θ + λD log F F-ND = θ But maybe we should not make the assumption there is no association between gender and method? This is a model for a 2 X 2 table (i.e. 4 cells). It has 4 parameters. The saturated model. You can’t use more parameters (eg λND and λFD ) are redundant The question is: Is the non-saturated model log FMD = θ + λM + λD an acceptable fit to the data? 53 Why would we want a simpler model? We want the most bang for our buck from a model’s parameters As we add more parameters to a model (predictors, interaction terms, etc.), we often get worse prediction or generalization to new data This doesn’t mean we should always accept a simple model – a simple model may fit badly! We want the simplest possible model that can also fit the data well Start with the saturated model It fits the data perfectly! Remove the highest order interaction Is the change in fit NON-SIGNIFICANT? If so, then the simpler non-saturated model is a plausible description of the data The model fits just as well as the previous model, but with less parameters! If the changes is significant, then the interaction is necessary for the model The model fits more poorly than the previous model Likelihood ratio statistic (G2) The statistic has an approximated chi-squared distribution. The saturated model has zero degrees of freedom, so does not have a probability level! Can we now remove some 2-way interactions? Three 2-way interactions: clinic X survival, clinic X prenatal, survival X prenatal Each case in one and only one cell Ratio of cases to variables: 5 times as many cases as there are cells (Tabachnik & Fidell, 2007). Expected cell frequencies: all should be greater than one and no more than 20% less than 5 (T&F) Standardized residuals should be normally distributed with no obvious pattern when plotted against observed values. Logistic regression is useful when you have BINARY outcome variables (e.g., bounded between 0 and 1) Loglinear models are useful when you have COUNTS or FREQUENCIES (e.g., lower bound of zero, upper bound of infinity) We can still apply linear regression here via the generalized linear model The link function (logistic, or log) allows us to still use linear models and the assumptions involved But we need to interpret the slopes properly The models are linear with respect to log odds (logistic regression) or log counts (loglinear models), but that’s it IN THIS LECTURE, you Learnt methods to undertake a regression with a binary outcome variable How to analyse categorical variables and multiway tables Learnt how to undertake logistic regression and loglinear analyses