Logistic and Poisson Models

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Why might a researcher choose a logistic regression over a linear regression model? Explain your answer.

Logistic regression is used when the outcome variable is binary or categorical, while linear regression is appropriate for continuous outcomes. In human subjects research, we often work with categorical outcomes, meaning a linear regression is not appropriate.

How does the interpretation of a coefficient β in a logistic regression with a continuous predictor X differ from its interpretation in a linear regression model?

In logistic regression, e to the power of β represents the change in the odds of the outcome for every one-unit increase in X, while in linear regression, indicates the change in the outcome Y.

What key assumption regarding the relationship between the mean and variance must be checked when considering a Poisson regression, and what alternative model is suggested if this assumption is violated?

Poisson regression assumes that the mean and variance of the outcome are equal. If the outcome is over dispersed, the variance is much greater than the mean, and a negative binomial model may be appropriate.

In the context of a study using Zou's modified Poisson regression, why might a researcher choose this method over traditional logistic regression when modeling a binary outcome?

<p>Zou's modified Poisson regression estimates relative risks (RRs) rather than odds ratios (ORs), which are easier to interpret. It estimates RRs well even when the outcome is not rare, in contrast to logistic regression.</p>

Signup and view all the answers

Briefly explain the function and importance of a conceptual model. Provide an example in your explanation of one way they are often used by researchers.

<p>A conceptual model is a visual tool that represents research questions/hypotheses for relationships between measured variables. Conceptual models are useful to visually represent research questions and specify expected relationships between variables.</p>

Signup and view all the answers

Flashcards

What is a Link Function?

A transformation made to the outcome variable to ensure a linear relationship with the predictors.

What is Logistic Regression?

A regression model where the outcome variable is binary.

What is the logit function?

A function that transforms probabilities to a log-odds scale, used in logistic regression.

What is the Odds Ratio?

A measure of the relative odds of an event occurring given exposure, compared to non-exposure.

Signup and view all the flashcards

What is Poisson Regression?

A regression model used to model count data (i.e. number of events).

Signup and view all the flashcards

Study Notes

CHS 729 Week 8: Logistic and Poisson Models

The session covers generalized linear models, specifically logistic and Poisson regression.
The material includes aligning research questions with justified methods, using conceptual models/path diagrams, and method planning.
Final projects and midcourse evaluation debriefs fill out the agenda.

Statistical Models

These models relate an outcome (Y) to variables (X1, X2,...).
The outcome (Y) can also be termed the response or dependent variable
X-variables are referred to as predictors, explanatory variables, or independent variables.
Models take the form Y = f(X) + ϵ; f() is a function applied to predictors, and ϵ is the model's error.

Simple Linear Regression

Regresses outcome Y on one predictor X
Linear regression attempts to fit a linear function f() where Y = f(X).
The linear function is Y = mX + b, where m = slope, and b = intercept
Slope (m) represents how much Y changes for each unit change in X.

Limitations of Linear Regression

Linear regression can be unsuitable if the assumptions are not met.
Categorical outcomes render linear regression inappropriate.
Continuous outcomes, such as days of heroin use, can violate model assumptions via distribution issues.
A need exists for regression tools adaptable to situations where the linear regression is insufficient.

Linear Model Components

Linear regression is expressed by Y = 𝛽0 + 𝛽1X1 + 𝛽2X2 + … + 𝛽nXn
Additional components often remain hidden:
Random Component: Assumed error distribution of outcome variable Y.
Link Function: A transformation of outcome Y to ensure a linear relationship with predictors (X1, X2,...).

Linear Regression Specifics

In linear regression:
Random component is captured by the continuous, normally distributed outcome (constant variance of error terms).
The link function is known as the identity link, with no transformation, and Y can be used as is.

Generalized Linear Model (GLM)

This model take the form link(Y) = 𝛽0 + 𝛽1X1 + 𝛽2X2 + … + 𝛽nXn.
The choice of link(Y) is to transform Y such that:
Link(Y) follows a specific distribution.
A linear relationship exists between each predictor X and the outcome link(Y).
Understanding various regression types is vital.

Logistic Regression Scenario

Binary outcome (Yes/No) can be an example
Setting Yes = 1 and No = 0 allows plotting outcome Y against predictor X.
Linear regression may not fit the data's scale, and no linear relationship may be found .
Using the line, values over ~650 would imply that Y is negative.
However, this is incorrect as Y is binary.

Standard Logistic Function

It is defined as f(x) = 1 / (1 + e^-x)
Values range between 0 and 1.
A mathematical function fitting the model to return values when our outcomes can only be 0 or 1 is desired.

Altering Outcomes

The function still has values between 0 and 1.
When thinking of 1 = success, it is possible to frame the function as a probability of success, p(Y = 1).
A logistic function above is the probability of passing an exam.
The data points are either 0 (fail) or 1 (pass).
According to the model, studying for 3 hours means there is about a 62.5% chance of passing.

Logistic Function Definition

Defining a logistic function yields p(Y = 1) = 1 / (1 + e^(𝛽0 + 𝛽1X1 + 𝛽2X2 + …)
The important point is not having a linear relationship between the outcome p(Y) and predictors X1, X2,...
The transformation of outcomes is with the logit function to solve this.

Logit Link Function

The function represents the inverse of: p(Y = 1) = 1 / (1 + e^(𝛽0 + 𝛽1X1 + 𝛽2X2 + …))
After rearrangement to get: logit(p) = ln(p / (1-p)) = 𝛽0 + 𝛽1X1 + 𝛽2X2 + ...
The logit function is a link, transforming our outcome for a linear relationship between outcome (log odds) and predictors.

Log Odds Ratios

The outcome is either 0 or 1
The value p(Y=1) / p(Y=0) equals the odds ratio
p(Y=1) / (1-p(Y=1)) is the same as p(Y=1) / p(Y=0).
p(Y=1) / p(Y=0) is thus the probability ratio of our outcome, where Success = 1 and Failure = 0.

Odds

They represents the ratio of something happening versus something not happening.
An example: 10 chess game wins for you, and 8 for Bob translates to 10:8 odds for winning the next game
Odds and probability are related.
Example: 10/18 games were won with a probability of ≈ 0.5556.
8/18 games were lost, probability ≈ 0.4444.
Taking a ratio yields the odds: 10/18 / 8/18 = 10/8.

Log-Odds

These represent the odds of success (assuming Y = 1).
ln(p(Y=1) / p(Y=0)) is the log of the odds.
Generating a function of the form ln(p(Y = 1) / (1 - p(Y = 1))) = 𝛽0 + 𝛽1X1 + 𝛽2X2 + … happens when fitting a logistic regression.

Logistic Regression Models

Can be fit when dealing with dichotomous outcomes in generalized linear models
A logit link function is used to fit a model: logit(Y) = ln(p(Y = 1) / (1 - p(Y = 1))) = 𝛽0 + 𝛽1X1 + 𝛽2X2 + ...
This establishes a relationship between each X variable and the log-odds of outcome Y.
With the model, each 𝛽 needs to be interpreted.

Interpreting Beta

A simple logistic regression looks at the link between you smoking cigarettes and having heart disease
Resulting functions have to form ln(p / (1-p)) = 0.3 + 0.5X
When categorical, the referent group is defined for what the other groups are compared to.
No smoking is selected, as it is the non-exposure group.

Interpreting Beta for Categorical Variable X

ln(p / (1-p)) = 0.3 + 0.5Xsmoking; smoking, like the example shown, increases log-odds of heart disease by 0.5.
However, log-odds may not be easily interpreted by everyone.
The impact of odds can be found by exponentiating the 𝛽 coefficient: e𝛽 = e^0.5 ≈ 1.65.
People who do smoke have a 1.65 greater odds of heart disease than those who do not smoke.

Interpreting Beta for Continuous Variable X

ln(p / (1-p)) = 0.3 + 0.5Xage; X is continuous so let age in years be the example
e𝛽 = e^0.5 ≈ 1.65 which is exponentiating our 𝛽 coefficient
A 1.65 greater odds of the outcome would occur for every one unit increase.

Interpreting Beta for Multiple Predictors.

ln(p / (1-p)) = 0.3 + 0.5Xage + 1.2Xgender, linear regression model coefficients are deemed as additive
Values can be multiplicative when exponentiated
Women are the gender coefficient and male is the referent.
For someone one year older and is a woman, compared to a man one year younger, the odds are multiplicative: e^(𝛽age+𝛽gender) = e^(0.5+1.2) = e^0.5 * e^1.2 ≈ 1.65 * 3.32.

Odds Ratio vs Risk Ratio

We commonly use the odds ratio, as it can be calculated given data.
However, the odds ratio can be notoriously hard for some to easily understand.
People often want to see what an outcome will be for certain groups, such as with a variable like age.
People are generally interested in the risk ratio (or relative risk).

Risk vs Odds Definitions

Using a 2x2 table is valuable for conceptualizing relative risk and odd ratio differences
Relative risk (RR) is (risk of event in treatment) / (risk of event: 𝑎⁄(𝑎 + 𝑏) / 𝑐⁄(𝑐 + 𝑑) = (𝑎𝑐 + 𝑎𝑑) / (𝑎𝑐 + 𝑏𝑐).
Whereas the odds ratio (OR) is (odds of event in treatment) / (odds of event in control): 𝑎/b / c/d = ad/bc..
Knowing the likelihood in a specific group is why we commonly find RR important.

The Odds Ratio

It is a good and reasonable approximation if the outcome is rare
If the outcome is rare, in our table a (rare to have a high event after treatment) or c (rare to have a low amount of the the bad event when trying to prevent it), these values will be reasonably low for the table
Therefore with either a or c being low, the product of a*c will be quite small as well.
Since RR = (ac+ad)/(ac+bc) and OR = ad/bc; with ac being reasonably small, these two values will be pretty similar to each other too. (approximations)
If our outcome is rare and low (under <5% of observations), it is understood the OR is quite similar to what we value/care about, the RR

Logistic Regression

It is a generalized linear model when binary is the outcome.
Using a logit-link function transforms the outcome from identity scale to log-odds.
Functions of the form ln(p / (1-p)) = 𝛽0 + 𝛽1X1 + 𝛽2X2 + ... result
Applying logarithms to the predictor’s beta coefficient yields its odds ratio with ORx = e𝛽

Ordinary Least Squares

It is useful determining the line in linear regression.
This method does not align and function with logistic regression.
Maximum likelihood is another option.
There will be a way of measuring results using OLS, having measuring functions minimize the overall metric.

The Good News

The time to use a logistic regression and how to use it in R is largely understood.
The assumptions that come with logistic regression are more relaxed than that of linear regression:
A binary outcome
The observations are independent
No Multicollinearity of predictors
No extreme outliers
A linear relationship between the predictors and Logit(Y)
Rule of 10 (10 observations of the least frequent category of each predictor)

Logistics Regression in R example

The data.frame in R called state.x77 is in use.
Factors including state-level illiteracy rate, state-level murder rate, and state-level graduation rate and their effect on state's life expectancy.
Outcome dichotomization happened: 0 = life expectancy is < 71 years, 1 = life expectancy is equal to or above 71 years of age.
It is termed "seventyone" as the example

Running Logistic Regression in R Mechanics

𝑔𝑙𝑚() function is being used, like with linear regression.
The data and equation are specified
An additional command is family which consists of the link function and random component in the code.
The binomial shows data outcome structure, while logit showcases the transforming outcome with a logit function.

Regression Output

The output is noticed to be similar for the model, being akin to both linear regression and normal regression
The beta coefficients will be in the "Estimates” column
Error standard, z-values, and p-values.
The link is between every X and the y log-odds.

Running Logistic Regression in R: Ouputting

A results can be set as:
- ORs <- exp(coef(model))[-1]
- Confit < - exp(confint(model))[-1,]
- P_Val <- coef (summary(model))[-1,4]
The coefficients are for c(“Illiteracy”, “Murder Rate”, “Highschool.
data.frame data will consist of: variable_names, ORs, confint, P_Val.

Writing Results to CSV

"logRegression.csv"
For every increase in the high school graduation rates, odds of at least 71 life expactancy, increased by 1.29, 29% increase.

Adjusted Odds Ratio

Adjusted shows multiple predictors were added into a model.
The impact and relationship between illiteracy and murder rate is adjusted connecting from the data.

Modelling Count Data with a Poisson Regression

Moving on, the modelling will now cover count data

Count Data Types

Numeric outcome.
"Over a given period of time, how often Y happen?"
Amount of days an individual binge drink
Number of deaths that occurred over a year
Number of amount of cigarettes someone can intake.

Key Points Around Count Data

Assuming normality and using regressions are usually used when answers are greater in value than 10.
The assumption cannot be made with an answer below 10.

Defining Poisson

One value defines a Poisson distribution: λ.
Expected value is λ and the variance of an error.
Becomes similar to normal distribution when λ is greater or equal to 10.

Identifying A Poisson Distribution

The variable can be identified with an hist() function.
There will be high concentrations close to 0 and a small skew.

Generalized Linear Model for Poisson

In(u) : We can use the log link if we let u be the expected number of events set to occur for a given observation.
Taking a logarithm distributes the variable to become more normal, in the case that a variable is right-skewed
Functions come in the form with:In(u) = 𝛽0 + 𝛽1X1 + 𝛽2X2 + ...
With this distribution, a Poisson distribution is assumed with the error structure.

Overdose Death Example

The histogram is measured at the country level and shows how many total OD deaths occurred through the year of 2018
A smaller proportion of values and amounts fall close to 0
Questions like:"how rates of poverty and adults that go to school can cause more overdoses" can come to mind when asking questions related to this

Running a Poisson Regression

Quite similar to that of to the running of a logistic reggression
glm() is the function that is in use during this process.

Changes in Families

Is used and changes using "possion" which the name of the family
When the numbers, people, county/region matters, its accounted for more in the "OD deaths" example

Adjusting

Offset is required, to allow for adding our model in total
With data and information from every area and group, "Family" comes with the use of poisson

Results

A transformation using log scale, meaning the coefficients need to be to be taken in to account.
In(u) = 𝛽0 + 𝛽1X1 + 𝛽2X2 + ... Is an example of the data being used.
For more on the relationship and effect, the beta needs to have its coefficients taken into account.
Same effects happen using logistics.

Incidence Rate Ratio

Its based on a log scale where the coefficients are in relation to the numbers.
This shows that when the data with all these areas, all have significant relationship and the coefficients are taken into consideration.

Poisson Assumption

The main assumption/view is that that total equal amount/number and results is the outcome.
Often the outcome is overdispersed means having a more wide variance.
This means you would need to run a binomial mode to show that all the results were as close as can be in relation to each other.
Looking as null and how close the points are
A high range that's has data closer to it that binomial is needed. The higher count means a sense need to be had between all.

Running Negative Binomial data numbers in the set

Gmlnb allows for using the functions and parameters with more understanding.
The data in relation of to both and the link and numbers that connect/coordinate the data.
The Deviance or points of the date is close as ever.
When a model is in the NB model there will result, being interpreted the same as it would with Poisson.

Putting in the Results

A CSV format is what has the output and is commonly used.
When "Rattos are seen being presented there's another side with having the data used that's not shown but considered
The high results being from highschool high graduation and an unseeable amount of deaths are the points and focus to take from the data

Modified Poisson

Its used as an alternative, and shown with logistics

Modified Zou's Poisson

The odds being used are complex
The use of R.R ratios in results
With different angles and view of the results.
The common used and known version of the data.

Steps

The code for the results needs to have different angles and use for each one
The "sandwichs" estimator results and the errors.

Implantation

Same in code and easy to use.
Simple version of code is better.
Sandwich codes

Zou"s Modified Poisson

To easily see the data even when data is already extracted

log

A step by step data extract

"Rare codes

In Logistics

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.