Summary

This document explains simple linear regression, including concepts like dependent and independent variables, intercept, slope, and residuals. It also covers interactions, inference in a population, and the Coefficient of Determination (R²).

Full Transcript

Simple Linear Regression Simple linear regression examines the relationship between two continuous variables by fitting a straight line to the data points. y: Dependent variable (outcome) x: Independent variable (predictor) b0​: Intercept (value of y when x=0) b1​: Slope (c...

Simple Linear Regression Simple linear regression examines the relationship between two continuous variables by fitting a straight line to the data points. y: Dependent variable (outcome) x: Independent variable (predictor) b0​: Intercept (value of y when x=0) b1​: Slope (change in y for a one-unit increase in x) ε: Residuals, error term (difference between observed and predicted values predicted by the model, help assess the model’s accuracy) Interactions The relationship of interest depends on the level of some other variable. Highlights that “the effect of x1 on y” is now (b1+b3⋅x2) - i.e. “some number b1, plus some number b3 that changes depending on x2”. Inference Judgements about the parameters (measurable factors) of a population. Coefficient of Determination - R² The quality of our overall model - considers the total variance our model accounts for ★ Quantifies the amount of variability in the outcome that is accounted for by the predictors. ★ The more variance accounted for, the better. Represents the extent to which the prediction of y is improved when predictions are based on the linear relationship between x and y. It is a goodness of fit test (how well the line approximates our data) and gives the percentage of variance explained by the predictors together. Total Sum of Squares Squared distance of each data point from the mean of y (ȳ). Without any information, our best guess at the value of y for any person is the mean. Residual Sum of Squares Squared distance of each point’s distance from the predicted value. Model Sum of Squares The deviance of the predicted scores from the mean ȳ - the outcome variable y. F Tests - tests the significance of the overall model Tests of individual predictors do not tell us this. Does so by testing the statistical significance of the F-Ratio (a test statistic). Model Comparisons For linear models, model comparisons are typically testing reduction in residual sums of squares (via an F test). For models with other error distributions, we often test differences in log-likelihood (via χ2 test). Test everything in the model all at once by comparing it to a ‘null model’ with no predictors. Recall that a categorical predictor with k levels involves fitting k−1 coefficients. We can test “are there differences in group means?” by testing the reduction in residual sums of squares resulting from the inclusion of all k−1 coefficients at once. ANOVA/ANCOVA Type 1 (sequential) = tests the addition of each variable entered into the model, in order. Type 3 = tests the addition of each variable as if it were the last one entered into the model. Group Structured Data The people within my study will vary, there is variation and I want to know what the average person was. Clustered Data Children within schools. Patients within clinics. Observations within individuals. Clusters of Clusters Children within classrooms within schools within districts. Patients within doctors within wards within clinics. Time-periods within trials with individuals. Measurements on observational units within a given cluster are often more similar to each other than to those in other clusters. Our measure of academic performance for children in a given class will tend to be more similar to one another (because of class specific things such as the teacher) than to children in other classes. Clustering is something systematic that our model should take into account. Standard errors will often be smaller than they should be, meaning that: 💺 Confidence intervals = too narrow. 🚢 T statistics = TOO LARGE. 🤏 P values =. misleadingly small Quantifying Clustering Clustering can be expressed in terms of the expected correlation among the measurements within the same cluster. Intra-Class Correlation Coefficient (ICC) There are various formulations of ICC, but the basic principle: Ratio of variance between groups : total variance The expected correlation between two randomly drawn observations from the same group. library(ICC) ICCbare(x = dwelling, y = lifesat, data = d3) 1 = The ICC ranges from complete correlation (ICC = 1) to no correlation (ICC = 0). In the extreme case of an ICC of 1, all participants in a cluster are likely to have exactly the same outcome; thus, sampling 1 participant from that cluster is as informative as sampling the whole cluster. 0 = Participants in a cluster behave essentially independently of each other and their outcomes are no more related than if they were from different clusters, then the ICC is 0. We want this to be as close to 1 as possible. 0.9 - excellent reliability A high ICC indicates that clusters differ significantly from each other, so the grouping (e.g., schools, participants, tasks) matters. ​Low ICC means the clustering contributes little to the variation in the outcome. If the ICC is near zero, a simpler model without random effects might suffice. If our ICC is lower, then there's less tight clustering in our data (because there's a reasonable amount of variance within clusters). So this means that a multilevel approach might not be as crucial. Wide vs Long Data Wide - observations are spread across columns. Long - each observation of the outcome is a separate row. Long data computations by group: longd %>% group_by(ID) %>% summarise(ntrials = n_distinct(trial), meanscore = mean(score), sdscore = sd(score)) Modelling Clustered Data Complete Pooling: Pulls all information together to give a universal overall estimate, ignoring groups. No Pooling: Completely partition out cluster differences in average y. Treat every cluster as an independent entity - estimates separate models for each group. Data from cluster i contributes to the slope for cluster i, but nothing else. Information from Glasgow, Edinburgh, Perth, Dundee etc doesn’t influence what we think about Stirling. Prevents us from studying cluster level effects. Partial Pooling: Though each group is unique, having been sampled from the same population, all groups are connected and thus might contain valuable information about one another. This approach "borrows strength" from the entire dataset, leading to more reliable estimates, especially for groups with limited data. Benefits For a person in dwelling, their life satisfaction is the intercept for what dwelling they are in, the average age in their dwelling + their age + random error. Now ask questions which span multiple levels. Can ask whether our group level influences our observation level. Multilevel Models Modelling group-level variability, rather than estimating group differences. These models are essential when dealing with data that have hierarchical or nested structures, such as students within schools or repeated measures within individuals. In many research scenarios, data are organised at multiple levels. For example: Students (Level 1) are nested within schools (Level 2). Traditional regression models may not adequately account for the dependencies introduced by such structures, potentially leading to incorrect inferences. MLMs address this by allowing certain parameters to vary across higher-level units, effectively capturing the hierarchical nature of the data. Fixed Effects: These are the average effects assumed to be constant across all groups or clusters. For instance, the overall effect of study hours on exam scores across all students. Random Effects: These account for variations at different levels of the hierarchy. For example, the effect of study hours might differ between schools due to varying teaching quality or resources. Random Intercepts Model The simplest form of an MLM includes random intercepts. Allows the baseline level of the outcome variable to vary across groups. This model can be expressed as: yij​: Outcome for individual j in group i β0​: Overall intercept (fixed effect) β1​: Slope of predictor x (fixed effect) xijj​: Predictor variable for individual j in group i u0j​: Random intercept for group j, capturing group-specific deviations εij​: Residual error for individual j in group ij In this model, u0j​allows each group to have its own intercept, reflecting differences between groups. Random Slopes Model Extending the model, we can allow slopes to vary across groups: The relationship between the predictor and outcome may differ between groups: u1j​: Random slope for group j, indicating how the effect of x varies by group. B0i = group i’s intercept average for the population of groups (γ00) + the deviation of group i (ζ0i) from γ00. B1i = the slope for the average population of groups (γ10) + the deviation of group i (ζ1i) from γ10 (B1i X value of x + εi). γ00 = fixed - the average of the population groups. ζ = z score = difference between data point + the mean of the dataset. ζ0i = random = random effects - the deviation for the specific observation group. Group deviations for intercepts and slopes are normally distributed with: 👺 Mean = 0 🎞️ SD = σ0 + σ1 respectively 🤞Correlation = ρ 01 Variance of: the same as SD but NOT squared - it is in square units. Fitting Multilevel Models in R The lme4 package in R facilitates fitting MLMs using the lmer() function for linear outcomes. The syntax incorporates both fixed and random effects: library(lme4) model

Use Quizgecko on...
Browser
Browser