Introduction to Statistics-General Linear Model PDF

IoPPN 5th November 2024 Dr Moritz Herle Introduction to Statistics SGDP Centre The general linear model The big secret Almost all common statistical tests are the same underlying model The general linear model A statistical model with wide applicability. All of the following are just versions of the general linear model: t-tests (and their non-parametric equivalents, like Wilcoxon and Mann-Whitney U) ANOVA, ANCOVA MANOVA, MANCOVA Correlation (Pearson and Spearman) Linear regression (ordinary least squares regression), multiple regression Goodness-of-fit tests (e.g. chi-square test) Various kinds of “machine learning” and prediction modelling And more All about expressing relations between variables, e.g.: What is the relation between the test score and the grouping variable? What is the relation between the score on the pre-test and that on the post- test? t- test ANOVA Correl ation Regres sion etc. The genera l linear model The general linear model equation Estimate the dependent variable from other variables, using a straight line: A residual error term An estimate of the ith observation of the outcome, Y The ith (that thing over The intercept observation of the Y is called of the regression The slope of thethe predictor, X “hat”. Ŷ= Y-hat) line regression line Correlation A standardized measure of the linear relation between two variables Can take on values (called r) from -1.00 to +1.00 Just by looking at the graph, what do you think is the correlation between X & Y? In R, you can use the Magnusson, K. (2023). Interpreting Correlations: An cor.test() function interactive visualization (Version 0.7.1) [Web App]. R Psychologist. https://rpsychologist.com/correlation/ Correlations r=0.2 r= -0.7 r=0.99 What would r=1 look like? Magnusson, K. (2023). Interpreting Correlations: An interactive visualization (Version 0.7.1) [Web App]. R Psychologist. https://rpsychologist.com/correlation/ Beware Anscombe’s Quartet! Source Source The general linear model Two continuous variables (X & Y) in a sample of 500 participants. X, mean= 20, SD=5 Y, mean= 20, SD= 3 Predict Ŷ from X The general linear model tries to fit a line through the datapoints that is Magnusson, K. (2023). Interpreting Correlations: An interactive visualization (Version 0.7.1) [Web App]. R as close as possible Psychologist. https://rpsychologist.com/correlation/ The general linear model It finds this line of best fit by minimizing the (squared) distance between the line and each of the points Hence this is sometimes called “Ordinary least squares regression” By default, its estimates come out unstandardized i.e. using the units of the original variables In R, you can use the lm() function – stands for “linear Magnusson, K. (2023). Interpreting Correlations: An model” interactive visualization (Version 0.7.1) [Web App]. R Psychologist. https://rpsychologist.com/correlation/ The general linear model ^ = 𝑏 +𝑏 𝑋 + 𝜀 𝑌 𝑖 0 1 𝑖 𝑖 b0 = intercept b1 = slope Ŷi = 17.6 + 0.12*Xi Magnusson, K. (2023). Interpreting Correlations: An interactive visualization (Version 0.7.1) [Web App]. R Psychologist. https://rpsychologist.com/correlation/ The general linear model equation Now we could calculate the predicted value for Ŷi for participant i, if we know the value of X for ^ 𝑌 = 𝑏0 +𝑏 1 𝑋 participant i 𝑖 𝑖 Ŷi = 17.6 + 0.12*Xi Xi = 5 Ŷi = 17.6 + 0.12*5 Ŷi = 18.2 Magnusson, K. (2023). Interpreting Correlations: An However, the predicted Ŷi interactive visualization (Version 0.7.1) [Web App]. R Psychologist. https://rpsychologist.com/correlation/ The general linear model A residual error The difference betweenterm the observed Y1 and predicted Ŷi Y1 - Ŷ i The estimated regression coefficients (b0 and b1) are the those that minimise the sum of the squared residuals ∑(Yi - Ŷi )2 The general linear model Ordinary Least Square Regression The estimated regression coefficients (b0 and b1) are the those that minimise the sum of the squared residuals. 1.Take the distance between a datapoint and the fitted line 2. Square that distance 3. Repeat for all datapoints, and sum up all these surfaces 4. Find the line where this combined surface is the smallest. Navarro (2018) Navarro (2018) p.46 The general linear model equation A residual error term An estimate of the ith observation of the outcome, Y The ith The intercept observation of of the regression The slope of thethe predictor, X line regression line Calculates a t-value as the test statistic for the correlation – it’s almost as if the tests In fact, the are related… significance test for a correlation is a t-test Brief detour on the word “predict” Different things scientists mean by the word predict: 1. A correlation (you can predict one value, to some extent above chance, from the value of another); 2. A longitudinal relation (you can predict a value at time x from a value at point x-1); E.g. Childhood trauma predicts later psychiatric problems E.g. For genetic scores: a prediction of traits could (in theory) be made before birth 3. A hypothesis suggested by a theory (a theory makes predictions that can then be tested using statistics); 4. The outcome of the hypothesis itself (“…we predict that X will be related to Y…”); 5. A prediction model, using a lot of variables (100s) to explain the most amount of variance in an outcome (e.g. polygenic risk score) 6. Saying something about one dataset from what you know about another dataset (you can train a predictor in one dataset from data in another) E.g. Netflix predicts what film you want to watch next from what it’s learned from all the other Netflix customers from your demographic = 18.81 + (-0.018)* + 2.46 DD/Month/ Professor/Dr: Topic title: YYYY P-values, “-16” decimals to the right of Zero, 0.00000000000000002 “-7” decimals to the right of zero 0.0000000138 R-squared: the square of the correlation of all the predictors (in this case - 0.017). How much variance does the predictor explain? DD/Month/ Professor/Dr: Topic title: YYYY Unstandardized Standardized Standardized vs. unstandardized Which is better to report? “For every 1 min difference in average exercise per day, there’s a 0.017 difference in BMI” (unstandardized) OR “For every 1SD difference in average exercise per day, there’s a 0.176 SD difference in BMI” (standardized) OR “Average minutes exercise per day 3% of the variance in BMI” (standardized R2) Depends on the context: Unstandardized estimates more intuitive, but can’t easily be compared across different kinds of measurements Standardized estimates are less concrete, but can be compared across different measurements What if you have several predictors? Physical activity is an “okay” predictor of BMI. How about we add more information to the model, to try to improve our prediction? We need a multiple regression model: ^ = 𝑏 +𝑏 𝑋 +𝑏 𝑋 + 𝜀 𝑌 𝑖 0 1 𝑖 2 𝑖 𝑖 We can go on adding as many predictors as we like Or maybe: as many as it’s sensible to add ^ = 𝑏 +𝑏 𝑋 +𝑏 𝑋 +𝑏 𝑋 +…+ 𝜀 𝑌 𝑖 0 1 𝑖 2 𝑖 3 𝑖 𝑖 Multiple regression model ^ = 𝑏 +𝑏 𝑋 +𝑏 𝑋 +𝑏 𝑋 +…+ 𝜀 𝑌 𝑖 0 1 𝑖 2 𝑖 3 𝑖 𝑖 Two variables Multiple regression model ^ = 𝑏 +𝑏 𝑋 +𝑏 𝑋 +𝑏 𝑋 + 𝜀 𝑌 𝑖 0 1 𝑖 2 𝑖 3 𝑖 𝑖 X = Hours of baby is sleeping Z = Hours of Dan is sleeping Y = How grump Dan is Instead of a singular line, now we are trying to fit a plane in 3D space which minimises the distances to the Multiple regression model ^ = 𝑏 +𝑏 𝑋 +𝑏 𝑋 +𝑏 𝑋 + 𝜀 𝑌 𝑖 0 1 𝑖 2 𝑖 3 𝑖 𝑖 Just another example X = Weight Z = Horsepower Y = Mileage Multiple regression: Confounders (aka covariates) Predicting Y X Y using X X Y However, sometimes the presence of a confounder (Z) can bias the estimate of interest. A confounder is a third Z variable that is associated with X and Y DD/Month/ Professor/Dr: Topic title: YYYY Multiple regression: Mediators M X Y A mediator (M) is any variable that is caused by X and then in turn causes Y. Usually not included in your regression model (in comparison to confounders) DD/Month/ Professor/Dr: Topic title: YYYY Directed Acyclic Graphs (DAGs) Use DAGs to visualise your theoretical causal model, between exposure and outcome of interest, including all Confounders (C) and Mediators (M) X – Smoking Y – Lung cancer DD/Month/ Professor/Dr: Topic title: YYYY Directed Acyclic Graphs (DAGs) – Add confounders X – Smoking Y – Lung cancer C1 – Gender C2 – Education C3 – Alcohol A confounder is any other variable that is causing X and Y. DD/Month/ Professor/Dr: Topic title: YYYY Directed Acyclic Graphs (DAGs) – Add confounders X – Smoking Y – Lung cancer C1 – Gender C2 – Education C3 – Alcohol DD/Month/ Professor/Dr: Topic title: YYYY Directed Acyclic Graphs (DAGs)- Add Mediators X – Smoking Y – Lung cancer C1 – Gender C3 – Education C3 – Alcohol M1 – Physical activity A mediator (M) is any variable that is caused by X and then in turn causes Y. DD/Month/ Professor/Dr: Topic title: YYYY Directed Acyclic Graphs (DAGs) X – Smoking Y – Lung cancer C1 – Gender C3 – Education C3 – Alcohol M1 – Physical activity M2 - BMI DD/Month/ Professor/Dr: Topic title: YYYY Draw a DAG! Tennant PWG, Murray EJ, Arnold KF, et al. Use of directed acyclic graphs (DAGs) to identify confounders in applied health research: review and recommendations. Int J Epidemiol. 2021;50(2):620-632. doi:10.1093/ije/dyaa213 https://academic.oup.com/ije/article/50/2/620/6012812?login =true DD/Month/ Professor/Dr: Topic title: YYYY Multiple regression example Can we use physical activity to predict BMI values at 12 years? X – BMI Y – Physical Activity C1 – Gender C2 – Maternal Age at birth C3 – Weight of mother (genetics?) BMI ~ PhysicalAct + Gender + Maternal Age at birth + Weight of mother DD/Month/ Professor/Dr: Topic title: YYYY Multiple regression example Think – Pair – Share How would you interpret these results and why? DD/Month/ Professor/Dr: Topic title: YYYY (Multiple) regression assumptions Normality (of residuals) (-> if you were to plot the residuals you would see a normal distribution) Linearity (-> associations between X and Y are linear, aka constant) Homogeneity of variance (of residuals) Uncorrelated predictors (-> no collinearity) Uncorrelated residuals (-> no effect of another unmeasured variable) No highly-influential outliers See Navarro (2018), pp.474-4 Multiple regression isn’t magic! “Adjusting for”, “controlling for”, “partialling out”, etc. all refer to the same process, of having multiple variables in your regression model If you control outcome Y for predictor variable X, then check the association between variable Z and outcome Y, you’re asking: “what would be the Y~Z relation in a sample where everyone had the average level of X?” Multiple regression isn’t magic “Adjusting for”, “controlling for”, “partialling out”, etc. all refer to the same process, of having multiple variables in your regression model If you control outcome Y for predictor variable X, then check the association between variable Z and outcome Y, you’re asking: “what would be the Y~Z relation in a sample where everyone had the average level of X?” But it’s not magic Predictions from regression models, even if “controlled” don’t suddenly make associations causal All depends on where your data came from If they’re from a randomised experiment, causal conclusions might be justified If they’re from an observational study, probably not Summary Most of the commonly-used statistical tests are really the same underlying model… …which is all about drawing a straight line through datapoints, attempting to minimise the residual errors The general linear model Across the different outputs from R, recurrent pieces of information appear across all the different tests we’ve discussed Both unstandardized and standardized estimates are useful depending on the context Read on: Chapters 15 in Navarro (2018) https://learningstatisticswithr.com/lsr-0.6.pdf

Introduction to Statistics-General Linear Model PDF

Document Details

Tags

Related

Summary

Full Transcript