Podcast Beta
Questions and Answers
What is the primary purpose of the by argument in the data.table subsetting structure?
What is the main difference between Integer and Numeric data types in R?
What is the purpose of subsetting data in analyses?
What is the rule for data merges in R?
Signup and view all the answers
What is the purpose of Factor data type in R?
Signup and view all the answers
What is the return value of logical operators in R?
Signup and view all the answers
What type of join results in a dataset with all rows in both x and y?
Signup and view all the answers
What is the primary advantage of using long data format?
Signup and view all the answers
What is the purpose of logical operators in data management?
Signup and view all the answers
What is the characteristic of wide data format?
Signup and view all the answers
What is the convention for treating Boolean values in arithmetic operations?
Signup and view all the answers
What is the purpose of the Characters data type in R?
Signup and view all the answers
What is the purpose of reshaping data?
Signup and view all the answers
What is the consequence of using wide data format when there are missing values?
Signup and view all the answers
Which data type is used to store data that can be either TRUE or FALSE?
Signup and view all the answers
What operator is used for set notation in R?
Signup and view all the answers
What is the primary advantage of using rowMeans() to average a variable?
Signup and view all the answers
What is the consequence of adding items together to get a total score if a participant misses any single item?
Signup and view all the answers
What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?
Signup and view all the answers
What is the recommended approach to scoring questionnaire scales if the scale is typically added up?
Signup and view all the answers
What is the advantage of using rowMeans() to calculate the total score when there are small amounts of missing data?
Signup and view all the answers
What is the purpose of the rowMeans() function in R?
Signup and view all the answers
What is the main purpose of adding 'psych::' when calling the alpha function in R?
Signup and view all the answers
What is the primary advantage of using geom_point() over geom_bar()?
Signup and view all the answers
What is the main goal of the 'data to ink ratio' concept in data visualization?
Signup and view all the answers
What is the primary purpose of using shapes on scatterplots?
Signup and view all the answers
What type of plot is used to compare the distribution of data between groups?
Signup and view all the answers
What is the main difference between histograms and density plots?
Signup and view all the answers
What is the purpose of a QQ plot?
Signup and view all the answers
What is the purpose of using z-scores to identify extreme values?
Signup and view all the answers
What is the main advantage of using dot plots for small datasets?
Signup and view all the answers
What is the main difference between a QQ plot and a deviates plot?
Signup and view all the answers
What is the primary advantage of using the rowMeans() function to calculate the total score?
Signup and view all the answers
What is the recommended approach to scoring questionnaire scales if the scale is typically added up?
Signup and view all the answers
What is the consequence of adding items together to get a total score if a participant misses any single item?
Signup and view all the answers
What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?
Signup and view all the answers
What is the primary advantage of using rowMeans() to average a variable when there are small amounts of missing data?
Signup and view all the answers
What is the alternative to using the sum of all items to get a total score?
Signup and view all the answers
What is the purpose of using rowMeans() to calculate the total score?
Signup and view all the answers
What is the primary advantage of using the rowMeans() function over adding items together?
Signup and view all the answers
What is the consequence of using rowMeans() to calculate the total score when there are small amounts of missing data?
Signup and view all the answers
What is the purpose of scoring questionnaire scales?
Signup and view all the answers
What is the primary purpose of using the 'psych::' prefix when calling the alpha function in R?
Signup and view all the answers
What is the key difference between a bivariate plot and a univariate plot?
Signup and view all the answers
What is the primary goal of the 'data to ink ratio' concept in data visualization?
Signup and view all the answers
What type of plot is used to show the distribution of data between groups?
Signup and view all the answers
What is the primary purpose of using shapes on scatterplots?
Signup and view all the answers
What is the main difference between a histogram and a density plot?
Signup and view all the answers
What is the primary purpose of a QQ plot?
Signup and view all the answers
What is the primary advantage of using dot plots for small datasets?
Signup and view all the answers
What is the primary purpose of using z-scores to identify extreme values?
Signup and view all the answers
What is the main difference between a QQ plot and a deviates plot?
Signup and view all the answers
What is the purpose of squaring the residuals in linear regression?
Signup and view all the answers
What is the difference between simple and multiple linear regression?
Signup and view all the answers
What does yi represent in the equation for a straight line?
Signup and view all the answers
What is the purpose of estimating b0 and b1 in linear regression?
Signup and view all the answers
What is the subscript i indicating in the equation for a straight line?
Signup and view all the answers
What is the purpose of leaving off the residual/error term in the equation for a straight line?
Signup and view all the answers
How does multiple linear regression differ from simple linear regression?
Signup and view all the answers
What is the purpose of the regression coefficients (b0 and b1) in linear regression?
Signup and view all the answers
What is the goal of linear regression?
Signup and view all the answers
What is the purpose of the intercept (b0) in linear regression?
Signup and view all the answers
What is the purpose of the link function in GLMs?
Signup and view all the answers
What is the assumption of equal variance/homoscedasticity in linear regression?
Signup and view all the answers
What is the purpose of a QQ plot in linear regression diagnostics?
Signup and view all the answers
What is the consequence of violating the assumption of independence in linear regression?
Signup and view all the answers
What is the purpose of a scatterplot of predicted values against residuals in linear regression diagnostics?
Signup and view all the answers
What is the purpose of the inverse link function in GLMs?
Signup and view all the answers
What is the assumption of normally-distributed errors in linear regression?
Signup and view all the answers
What is the purpose of the L.I.N.E. acronym in linear regression?
Signup and view all the answers
What is the purpose of a density plot of residuals in linear regression diagnostics?
Signup and view all the answers
What is the assumption of independence in linear regression?
Signup and view all the answers
What is the interpretation of b0 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?
Signup and view all the answers
What is the primary purpose of the generalized linear model (GLM)?
Signup and view all the answers
What is the purpose of the residual standard error (σ) in the linear regression model?
Signup and view all the answers
What is the interpretation of the coefficient b1 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?
Signup and view all the answers
What is the purpose of the F-test in the linear regression model?
Signup and view all the answers
What is the primary advantage of using the linear regression model over other types of regression models?
Signup and view all the answers
What is the purpose of the t-value in the linear regression output?
Signup and view all the answers
What is the interpretation of the R-squared value in the linear regression model?
Signup and view all the answers
What is the purpose of the confint() function in R when working with linear regression models?
Signup and view all the answers
What is the primary advantage of using the adjusted R-squared value over the regular R-squared value?
Signup and view all the answers
What is the range of probabilities in logistic regression?
Signup and view all the answers
What is the assumption in logistic regression about the relationship between the outcome and the predictor variables?
Signup and view all the answers
What is the purpose of checking for outliers/extreme values in logistic regression?
Signup and view all the answers
What type of outcome variable is suitable for Poisson regression?
Signup and view all the answers
What is a characteristic of the Poisson distribution?
Signup and view all the answers
What is the function used to perform logistic regression in R?
Signup and view all the answers
What does an odds ratio greater than 1 indicate in logistic regression?
Signup and view all the answers
Why is linear regression rarely used for count outcomes?
Signup and view all the answers
What is a marginal effect in logistic regression?
Signup and view all the answers
What is an example of a research use case where Poisson regression may be appropriate?
Signup and view all the answers
What is a disadvantage of using linear regression for count outcomes?
Signup and view all the answers
What is the purpose of checking for separation in logistic regression?
Signup and view all the answers
What is the main difference between Poisson regression and linear regression?
Signup and view all the answers
Why is a large sample size required for logistic regression?
Signup and view all the answers
What is a characteristic of count variables?
Signup and view all the answers
What is the purpose of the link function in Poisson regression?
Signup and view all the answers
What is the assumption about the distribution of the outcome variable in Poisson regression?
Signup and view all the answers
What is the interpretation of the Incident Rate Ratio (IRR) in Poisson regression?
Signup and view all the answers
What is the purpose of exponentiating the coefficients in Poisson regression?
Signup and view all the answers
What is the main difference between linear regression and binary logistic regression?
Signup and view all the answers
What is the purpose of the link function in logistic regression?
Signup and view all the answers
What is the distribution assumed for the outcome variable in binary logistic regression?
Signup and view all the answers
What is the link function used in logistic regression?
Signup and view all the answers
What is the purpose of using the glm() function in R for Poisson regression?
Signup and view all the answers
What is the advantage of using Poisson regression over linear regression for count data?
Signup and view all the answers
What type of plot is used to show the distribution of stress when negative affect is missing or not?
Signup and view all the answers
What is the purpose of identifying patterns of missing data?
Signup and view all the answers
What do the blue dots in the margin plot represent?
Signup and view all the answers
What is shown in the margin plot along with the scatter plot?
Signup and view all the answers
What does the red boxplot in the margin plot represent?
Signup and view all the answers
What can be seen from the distribution of stress when negative affect is missing or not?
Signup and view all the answers
What happens when data are missing not at random (MNAR)?
Signup and view all the answers
What is multiple imputation?
Signup and view all the answers
What is the formula to determine total uncertainty in average estimate in multiple imputation?
Signup and view all the answers
What is the purpose of examining missing data before imputation?
Signup and view all the answers
What is the VIM package in R used for?
Signup and view all the answers
What is the consequence of data being MNAR?
Signup and view all the answers
What is the purpose of generating multiple imputed datasets?
Signup and view all the answers
What is the benefit of multiple imputation?
Signup and view all the answers
What is the result of performing the analysis of interest on each imputed dataset?
Signup and view all the answers
What is the main consequence of missing data?
Signup and view all the answers
What happens when data are missing completely at random (MCAR)?
Signup and view all the answers
What is the classification of missing data when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest?
Signup and view all the answers
What can be done to recover unbiased estimates when data are Missing at Random (MAR)?
Signup and view all the answers
What is the consequence of using complete cases only when data are Missing at Random (MAR)?
Signup and view all the answers
Why is list-wise deletion often inefficient?
Signup and view all the answers
What is the main difference between Missing at Random (MAR) and Missing Completely at Random (MCAR)?
Signup and view all the answers
What is the main issue with list-wise deletion?
Signup and view all the answers
What is the condition for list-wise deletion to yield unbiased estimates?
Signup and view all the answers
What is the main difference between Missing Completely at Random (MCAR) and Missing at Random (MAR)?
Signup and view all the answers
What is the main advantage of using a conditional approach when data are Missing at Random (MAR)?
Signup and view all the answers
What is the main issue with complete cases only?
Signup and view all the answers
What is the main difference between Missing Not at Random (MNAR) and the other types of missingness?
Signup and view all the answers
What is the purpose of a margin plot in data analysis?
Signup and view all the answers
What do the blue dots in a margin plot represent?
Signup and view all the answers
What can be inferred from the boxplots on the x-axis of a margin plot?
Signup and view all the answers
Why is there only one boxplot on the y-axis of a margin plot?
Signup and view all the answers
What is the purpose of examining the distribution of stress when negative affect is missing or not?
Signup and view all the answers
What can be inferred from the presence of red dots in a margin plot?
Signup and view all the answers
What is the benefit of using margin plots in data analysis?
Signup and view all the answers
What is the primary reason why we cannot recover unbiased estimates when data are missing not at random?
Signup and view all the answers
What is the main difference between mean positive affect being MNAR and MAR?
Signup and view all the answers
What is the purpose of multiple imputation in addressing missing data?
Signup and view all the answers
What is the formula to determine the total uncertainty in the average estimate in multiple imputation?
Signup and view all the answers
What is the issue with using multiple imputation with small sample sizes?
Signup and view all the answers
What is the purpose of examining the missing data before doing any imputation?
Signup and view all the answers
What is the benefit of using the VIM package in R for examining missing data?
Signup and view all the answers
What is the consequence of assuming MAR when the data are actually MNAR?
Signup and view all the answers
What is the purpose of generating multiple imputed datasets in multiple imputation?
Signup and view all the answers
What is the advantage of using multiple imputation over single imputation?
Signup and view all the answers
What is a key characteristic of linear mixed models that allows them to handle repeated measures data?
Signup and view all the answers
What is a key difference between fixed effects and random effects in linear mixed models?
Signup and view all the answers
When would you use repeated measures ANOVA instead of linear mixed models?
Signup and view all the answers
What is an advantage of using linear mixed models over repeated measures ANOVA?
Signup and view all the answers
How do fixed effects approximate the distribution of the data?
Signup and view all the answers
What is a key assumption of linear regression that linear mixed models can relax?
Signup and view all the answers
What is a benefit of using linear mixed models for clustered data?
Signup and view all the answers
How do random effects approximate the distribution of the data?
Signup and view all the answers
What is the main difference between fixed effects and random effects in regression analysis?
Signup and view all the answers
What does an intraclass correlation coefficient (ICC) of 0.5 indicate?
Signup and view all the answers
What is the purpose of the meandeviations() function in linear mixed models?
Signup and view all the answers
What is a key assumption of linear mixed models (LMMs)?
Signup and view all the answers
What is the primary advantage of using linear mixed models over traditional linear regression?
Signup and view all the answers
What is the purpose of the ICC in linear mixed models?
Signup and view all the answers
What is a key feature of linear mixed models that allows them to relax the assumption of independence in traditional linear regression?
Signup and view all the answers
What is the interpretation of an ICC of 0.25?
Signup and view all the answers
Study Notes
Data Table Subsetting
- Data table subsetting structure: DT[i, j, by], where DT is the data table, i is the rows, j is the columns, and by is the grouping variable
Data Types in R
- Logical: used for logical data (TRUE or FALSE)
- Integer: used for integer type data (whole numbers like 0, 1, 2)
- Numeric: used for real numbers (1.1, 4.8) and can be used for integer data (less efficient)
- Factor: special representation of numeric data when data are fundamentally discrete (e.g., study condition coded as 0 = control, 1 = medication, 2 = psychotherapy)
- Characters: used for text type data (names, qualitative data, etc.) and can store numbers as strings
Operators
- Logical operators: used to manage data (e.g., find outliers, values greater/less than a score)
- Boolean values: TRUE or FALSE, where TRUE is treated as 1 and FALSE is treated as 0 in arithmetic
- Operators:
- = OR %ge%: Greater than or equal
- %gl%: Greater than AND less than
- %gel%: Greater than or equal AND less than
- %gle%: Greater than AND less than or equal
- %gele%: Greater than or equal AND less than or equal
- %in%: In
- %!in% or %nin%: Not in
- %c%: Chain operations on the RHS together
- %e%: Set operator, to use set notation
Subsetting Data
- Subsetting: excluding outliers, selecting participants who meet certain criteria
- Order of subsetting matters
Merging Data
- Rules: one join at a time, x dataset is always on the left, y dataset is always on the right
- Types of joins:
- Natural join: resulting data has only rows present in both x and y (all = FALSE)
- Full outer join: resulting data has all rows in x and all rows in y (all = TRUE)
- Left outer join: resulting data has all rows in x (all.x = TRUE)
- Right outer join: resulting data has all rows in y (all.y = TRUE)
Reshaping Data
- Necessary for repeated measures/longitudinal/panel data
- Types of data structures:
- Wide: each measure has a separately-named variable for each time point it was measured
- Each entity occupies their own row, and each variable occupies a single column
- Easy to read and interpret, used in descriptive statistics and reporting
- Long: time point (or wave) is a variable, IDs will have multiple rows
- Machine-friendly data structure, easier to perform functions like filtering and aggregating
- Easier to add new data and avoids the problem of null values
- Wide: each measure has a separately-named variable for each time point it was measured
Scoring Questionnaire Scales
- There are two ways to score questionnaire scales: adding together to get a sum total score, or taking an average of all items
- Using
rowMeans()
allows you to perform calculations excluding missing data, which is commonly done and sensible when dealing with small amounts of missing data - If you want to deal with missing data but need a total score, you can use
rowMeans()
and multiply the results by the number of items that should have been completed
Cronbach's Alpha
-
psych::alpha()
is the function to find Cronbach's alpha, a common measure of scale reliability -
psych::
is used to specify thealpha
function from thepsych
package, asalpha
is a popular function name
Bivariate Plots
- A bivariate plot shows the relationship between two variables, mapped onto the x- and y-axes
-
geom_point()
is used to make scatterplots, and additional arguments can be added using+
(e.g.,geom_line()
) -
geom_bar()
is used for barplots
Best Practices in Data Visualization
- Data to ink ratio: aim for more data and less ink
- Themes are helpful in achieving this goal
- Axes can be useful for providing more data, such as labeling with quantiles
- Shapes on scatterplots help identify categorical variables
Types of Plots
Violin Plots
- Used to compare the distribution of data between groups
- Thicker regions have more points, narrow regions have fewer data points
- Show the range/spread of each variable and mean and confidence interval summaries
Histograms
- Define equal width bins on the x-axis and count how many observations fall within each bin
- Bars display these, where the width of the bar is the width of the bin and the height is the count (frequency) of observations
- Show a univariate distribution
Density Plots
- Show the distribution using a smooth density function rather than binning data
- Height indicates the relative frequency of observations at a particular value
- Designed so that they sum to one
- Show a univariate distribution
Dot Plots
- Effective at showing raw data for small datasets
- Each dot represents one person
- Dots are stacked on top of each other if they would overlap
- Provide greater precision than histograms
- Show a univariate distribution
QQ Plots
- A scatterplot created by plotting two sets of quantiles against one another
- If both sets of quantiles came from the same distribution, the points should form a line that's roughly straight
Scoring Questionnaire Scales
- There are two ways to score questionnaire scales: adding together to get a sum total score, or taking an average of all items
- Using
rowMeans()
allows you to perform calculations excluding missing data, which is commonly done and sensible when dealing with small amounts of missing data - If you want to deal with missing data but need a total score, you can use
rowMeans()
and multiply the results by the number of items that should have been completed
Cronbach's Alpha
-
psych::alpha()
is the function to find Cronbach's alpha, a common measure of scale reliability -
psych::
is used to specify thealpha
function from thepsych
package, asalpha
is a popular function name
Bivariate Plots
- A bivariate plot shows the relationship between two variables, mapped onto the x- and y-axes
-
geom_point()
is used to make scatterplots, and additional arguments can be added using+
(e.g.,geom_line()
) -
geom_bar()
is used for barplots
Best Practices in Data Visualization
- Data to ink ratio: aim for more data and less ink
- Themes are helpful in achieving this goal
- Axes can be useful for providing more data, such as labeling with quantiles
- Shapes on scatterplots help identify categorical variables
Types of Plots
Violin Plots
- Used to compare the distribution of data between groups
- Thicker regions have more points, narrow regions have fewer data points
- Show the range/spread of each variable and mean and confidence interval summaries
Histograms
- Define equal width bins on the x-axis and count how many observations fall within each bin
- Bars display these, where the width of the bar is the width of the bin and the height is the count (frequency) of observations
- Show a univariate distribution
Density Plots
- Show the distribution using a smooth density function rather than binning data
- Height indicates the relative frequency of observations at a particular value
- Designed so that they sum to one
- Show a univariate distribution
Dot Plots
- Effective at showing raw data for small datasets
- Each dot represents one person
- Dots are stacked on top of each other if they would overlap
- Provide greater precision than histograms
- Show a univariate distribution
QQ Plots
- A scatterplot created by plotting two sets of quantiles against one another
- If both sets of quantiles came from the same distribution, the points should form a line that's roughly straight
Simple vs. Multiple Linear Regression
- Simple linear regression: equation
yi = b0 + b1 * xi + εi
, whereyi
is the outcome variable,xi
is the predictor/explanatory variable,εi
is the residual/error term,b0
is the intercept, andb1
is the slope of the line. - In simple linear regression, the model parameters (
b0
andb1
) are the same for all participants, but each person has their own values ofyi
andxi
, and there will be some unexplained residual (εi
). - If we want to talk about only what is predicted based on the regression coefficients, we can write
yi = b0 + b1 * xi
, leaving off the residual error term (εi
).
Multiple Linear Regression
- Multiple linear regression works in principle basically the same way as simple linear regression, but allows for more than one predictor (explanatory) variable in a single model.
- The equation for multiple linear regression is
yi = b0 + b1 * x1i + ... + bk * xki + εi
, whereyi
is the outcome variable,x1i
,x2i
, ...,xki
are the predictor variables, andεi
is the residual/error term. - The regression coefficients (
b0
,b1
, ...,bk
) are interpreted fairly similarly to those in simple linear regression, but with some extra requirements. -
b0
is the intercept, the expected (model predicted) value ofyi
when all predictors are 0. -
b1
,b2
, ...,bk
are the slopes of the line, capturing how muchyi
is expected to change for a one unit change inx1
,x2
, ...,xk
, respectively, holding all other predictors constant.
Line of Best Fit and Residuals
- The line of best fit is the regression line that goes through the data points, minimizing all the residuals.
- The residuals are the differences between the model predicted values and the observed values.
Generalized Linear Models (GLMs)
- GLMs extend the linear model to different outcomes, such as continuous, normally distributed variables (linear regression), binary 0/1 variables (logistic regression, probit regression), and count variables (poisson regression, negative binomial regression).
- GLMs force things to be linear, using some function to link or transform eta (n).
- In linear regression, there is no link function because it's already in linear space.
Probability Distribution
- A probability distribution is a function where, as we move along the x-axis, the y-axis is telling us what is the probability of that value occurring.
Normal Distribution (Gaussian Distribution)
- Parameters: mean and standard deviation.
R Output from a Linear Model
- The
lm()
function in R is used to fit a linear model, and it uses a formula interface to specify the desired model, with the formatoutcome ~ predictor
. - The
summary()
function provides a quick summary of the model, including the regression coefficients, standard errors, t-values, and p-values. - The
confint()
function provides confidence intervals for the regression coefficients.
Linear Regression Assumptions
- L.I.N.E. = there is a linear relationship, variables and errors are independent, errors are normally distributed, and there needs to be equal variance.
- Independent variables: all values of the outcome should come from a different person.
- Errors: for any pair of observations, the error terms should be uncorrelated.
- Normally-distributed errors: the errors (i.e., the residuals) should be random and normally distributed with a mean of 0.
- Equal variance/homoscedasticity: for each value of the predictors, the variance of the error term should be constant.
Model Diagnostics
- To assess normally-distributed errors, look at the density plot of residuals (black line) vs a normal distribution (dotted blue line).
- To identify outliers, look at the QQ plot of residuals (black points are outliers).
- To check for equal variance/homoscedasticity, look at a scatterplot of the model predicted values against the residuals - basically, we want the residuals to be about the same as the predicted values (i.e., blue dotted lines to be horizontal and parallel).
Poisson Regression
- Used for count variables, which are discrete and must be whole, 0 or positive numbers
- Poisson distribution is used when counts are relatively rare
- Examples of research use cases:
- Examining risk factors for accidents over a 12-month period
- Analyzing the number of children people have
- Predicting how many friends people have
- Evaluating whether an intervention reduced medication non-adherence
- Testing whether treating mental health can lower healthcare appointments
- Poisson regression:
- Does not assume normal distribution
- Has one parameter, lambda (mean and variance)
- Linear regression rarely works well for count outcomes due to:
- Straight line being a bad fit at extremes
- Non-normal distribution of residuals
- General linear regression deals with Poisson regression using:
- Link functions to transform linear predicted values to never go below 0
- Assuming a Poisson distribution
- Defining the link function as: η = g(λ) = ln(λ)
- Assumptions of Poisson regression:
- Poisson distribution, outcome is counts, positive integers
- Mean and variance must be the same
- Linear relationship on the link scale (ln)
- No need to worry about normally distributed errors or equal variance/homoscedasticity
- Watch for right-side outliers (extremely high counts)
- Importance of large sample size
- How to do Poisson regression in R:
- Use the glm() function with the argument 'family = poisson'
- Incident Rate Ratios (IRRs):
- Interpret as: "for each one unit higher predictor score, there are IRR times as many events of the outcome"
- Example: IRR = 2, base rate = 1, one unit higher would be 1*2 = 2
Binary Logistic Regression
- Used for binary outcomes, where the outcome only takes on two values: 0 or 1
- Examples of research use cases:
- Predicting whether someone will have major depression or not
- Determining the probability of patients remitting from major depression
- Predicting the probability of readmission to the hospital within 30 days
- Predicting the probability of death before age 60
- Binary logistic regression:
- Linear regression will not work for binary outcomes due to:
- Straight line being a bad fit
- Non-normal distribution of residuals
- Linear regression will not work for binary outcomes due to:
- General linear regression deals with binary logistic regression using:
- Link functions to transform linear predicted values to never go below 0 and never go above 1
- Assuming a Bernoulli distribution
- Defining the link function as: η = g(μ) = ln(μ/1−μ)
- Assumptions of logistic regression:
- Bernoulli distribution, outcome is probability of event occurring
- Linear relationship on the link scale (ln)
- Independent variables, independent errors
- Identify outliers/extreme values on the predictors
- Check for separation (predictor variable perfectly predicts the outcome)
- Importance of large sample size
- How to do logistic regression in R:
- Use the glm() function with the argument 'family = binomial'
- Odds ratio:
- Indicates how many more times the odds of the outcome occurring will be for a one unit change in the predictor
- Higher than 1 means a positive relationship, less than 1 means a negative relationship
- Marginal effect:
- Instantaneous effect of change at a particular point
- Equivalent to the slope of a straight line at that value
Missing Data
- Missing data are common and problematic, leading to biased results and efficiency loss.
Types of Missing Data
- Missing completely at random (MCAR): when the missingness mechanism is completely independent of the estimate of our parameter(s) of interest.
- Missing at random (MAR): when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest.
- Missing not at random (MNAR): when the missingness mechanism is associated with the estimate of our parameter(s) of interest.
Consequences of Missing Data
- Listwise deletion may lead to biased results unless the data are missing completely at random (MCAR).
- When data are missing completely at random (MCAR), listwise deletion will yield unbiased estimates of the true parameter(s) if the data had not been missing.
- When data are missing at random (MAR), it is possible to recover unbiased estimates if the right other variables are present.
Multiple Imputation (MI)
- Multiple imputation is a robust way to address missing data, involving generating multiple, different datasets with plausible values imputed for the missing data.
- Steps in MI:
- Start with the incomplete data.
- Generate 𝑚 datasets with no missingness, by filling in different plausible values for any missing data.
- Perform the analysis of interest on each imputed dataset.
- Pool the results from the analyses run on each imputed dataset to generate an overall estimate, 𝑄¯.
- Formula to determine total uncertainty in average estimate: T = V¯ + B + B/m, where 𝑉¯ is the average uncertainty estimate of Q̂ across the multiply imputed datasets, B captures the variance in the estimates, Q̂, and m is the number of imputed datasets.
Issues with Using Imputed Datasets
- Issues with using imputed datasets with general linear models:
- Small sample sizes (i.e., 100 or less).
- Colinear variables.
- Lots of interactions between variables.
- Non-normal residuals.
Examining Missing Data
- Before doing any imputation, it is a good idea to examine the data using the VIM package in R.
- The aggr() function shows the proportion of missing data on each individual variable and the patterns of missing data.
- Margin plots can help identify if imputations fall outside the range of observed data or fit with the rest of the trend from the observed data.
Missing Data
- Missing data are common and problematic, leading to biased results and efficiency loss.
Types of Missing Data
- Missing completely at random (MCAR): when the missingness mechanism is completely independent of the estimate of our parameter(s) of interest.
- Missing at random (MAR): when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest.
- Missing not at random (MNAR): when the missingness mechanism is associated with the estimate of our parameter(s) of interest.
Consequences of Missing Data
- Listwise deletion may lead to biased results unless the data are missing completely at random (MCAR).
- When data are missing completely at random (MCAR), listwise deletion will yield unbiased estimates of the true parameter(s) if the data had not been missing.
- When data are missing at random (MAR), it is possible to recover unbiased estimates if the right other variables are present.
Multiple Imputation (MI)
- Multiple imputation is a robust way to address missing data, involving generating multiple, different datasets with plausible values imputed for the missing data.
- Steps in MI:
- Start with the incomplete data.
- Generate 𝑚 datasets with no missingness, by filling in different plausible values for any missing data.
- Perform the analysis of interest on each imputed dataset.
- Pool the results from the analyses run on each imputed dataset to generate an overall estimate, 𝑄¯.
- Formula to determine total uncertainty in average estimate: T = V¯ + B + B/m, where 𝑉¯ is the average uncertainty estimate of Q̂ across the multiply imputed datasets, B captures the variance in the estimates, Q̂, and m is the number of imputed datasets.
Issues with Using Imputed Datasets
- Issues with using imputed datasets with general linear models:
- Small sample sizes (i.e., 100 or less).
- Colinear variables.
- Lots of interactions between variables.
- Non-normal residuals.
Examining Missing Data
- Before doing any imputation, it is a good idea to examine the data using the VIM package in R.
- The aggr() function shows the proportion of missing data on each individual variable and the patterns of missing data.
- Margin plots can help identify if imputations fall outside the range of observed data or fit with the rest of the trend from the observed data.
Independence of Observations
- Observations are not always independent, e.g., in longitudinal studies, repeated measures experiments, and clustered data
- This type of data poses challenges to statistical analysis, but can be addressed using linear mixed models
Linear Mixed Models
- Relax the assumption of independence of observations in linear regression
- Allow for variation in observations, including continuous time, missing data, and continuous predictors
Fixed Effects vs Random Effects
- Fixed effects: assume same slope and intercept for all participants, only applicable for one observation per participant
- Random effects: allow for different coefficients (slopes and intercepts) per participant, applicable for repeated measures
Fixed Effects Approximation
- Approximate distribution by: 𝑀 = estimated mean; 𝑆𝐷 = 0 (standard deviation is fixed at 0)
Random Effects Approximation
- Approximate distribution by: 𝑀 = estimated mean; 𝑆𝐷 = estimated standard deviation (SD is free to vary)
Main Difference Between Fixed and Random Effects
- Fixed effects: regression coefficients are the same for everyone
- Random effects: regression coefficients vary randomly for each participant
Intraclass Correlation Coefficient (ICC)
- Measures the ratio of between variance to total variance (ranges between 0 and 1)
- ICC > 0 indicates individual means differ, and individual differences need to be accounted for in analysis
Meandeviations() Function
- Used to calculate between and within versions of a repeated measures variable
Linear Mixed Model Assumptions
- Assume individual units' deviations from the fixed effect follow a normal distribution with mean 0 and standard deviation
- Assume random effect intercept also follows a normal distribution
- Only one additional parameter is needed compared to regular linear regression
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
Learn about data types in R, including logical, integer, and numeric, as well as data.table subsetting structures. Understand the different data types and how to work with them efficiently.