quiz image

Data Types and Subsetting in R

UsefulJoy avatar
UsefulJoy
·
·
Download

Start Quiz

Study Flashcards

168 Questions

What is the primary purpose of the by argument in the data.table subsetting structure?

To group the data by a specific variable

What is the main difference between Integer and Numeric data types in R?

Integer is used for whole numbers, while Numeric is used for decimal numbers

What is the purpose of subsetting data in analyses?

To exclude outliers and select specific participants

What is the rule for data merges in R?

One join at a time and the x dataset is always on the left

What is the purpose of Factor data type in R?

To store discrete numeric data with a specific label

What is the return value of logical operators in R?

A logical value of TRUE or FALSE

What type of join results in a dataset with all rows in both x and y?

Full outer join

What is the primary advantage of using long data format?

It is easier to perform functions like filtering and aggregating

What is the purpose of logical operators in data management?

To identify outliers and values outside a specific range

What is the characteristic of wide data format?

Each individual entity occupies their own row, and each of their variables occupy a single column

What is the convention for treating Boolean values in arithmetic operations?

TRUE is treated as 1 and FALSE is treated as 0

What is the purpose of the Characters data type in R?

To store text data, such as names and qualitative data

What is the purpose of reshaping data?

To prepare data for repeated measures or longitudinal analysis

What is the consequence of using wide data format when there are missing values?

It results in null values in columns where no data is available

Which data type is used to store data that can be either TRUE or FALSE?

Logical

What operator is used for set notation in R?

%e%

What is the primary advantage of using rowMeans() to average a variable?

It does not return NA even if some of the data is missing

What is the consequence of adding items together to get a total score if a participant misses any single item?

The participant will be missing on the entire subscale

What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?

To deal with missing data and get a total score

What is the recommended approach to scoring questionnaire scales if the scale is typically added up?

Use rowMeans() and multiply the result by the number of items

What is the advantage of using rowMeans() to calculate the total score when there are small amounts of missing data?

It imputes the mean for an individual for any missing items

What is the purpose of the rowMeans() function in R?

To calculate the mean of a row of data, excluding missing data if desired

What is the main purpose of adding 'psych::' when calling the alpha function in R?

To access a specific package's function

What is the primary advantage of using geom_point() over geom_bar()?

Geom_point() is used for scatterplots, while geom_bar() is used for barplots

What is the main goal of the 'data to ink ratio' concept in data visualization?

To achieve a balance between the amount of data shown and the amount of ink used

What is the primary purpose of using shapes on scatterplots?

To differentiate between categorical variables

What type of plot is used to compare the distribution of data between groups?

Violin plot

What is the main difference between histograms and density plots?

Histograms use bins to display the frequency, while density plots use a smooth density function

What is the purpose of a QQ plot?

To compare the distribution of two sets of quantiles

What is the purpose of using z-scores to identify extreme values?

To identify outliers in a dataset

What is the main advantage of using dot plots for small datasets?

They are effective at showing individual data points

What is the main difference between a QQ plot and a deviates plot?

A QQ plot is used to compare the distribution of two sets of quantiles, while a deviates plot is used to show the deviation from a normal distribution

What is the primary advantage of using the rowMeans() function to calculate the total score?

It allows for the exclusion of missing data if desired

What is the recommended approach to scoring questionnaire scales if the scale is typically added up?

Using rowMeans() and multiplying the result by the number of items

What is the consequence of adding items together to get a total score if a participant misses any single item?

The participant will be missing on the entire subscale

What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?

To deal with missing data and get a total score

What is the primary advantage of using rowMeans() to average a variable when there are small amounts of missing data?

It does not return NA even if some of the data is missing

What is the alternative to using the sum of all items to get a total score?

Using rowMeans() and multiplying the result by the number of items

What is the purpose of using rowMeans() to calculate the total score?

To deal with missing data and get a total score

What is the primary advantage of using the rowMeans() function over adding items together?

It allows for the exclusion of missing data if desired

What is the consequence of using rowMeans() to calculate the total score when there are small amounts of missing data?

The participant's score will be more accurate

What is the purpose of scoring questionnaire scales?

To summarize the data into a single value

What is the primary purpose of using the 'psych::' prefix when calling the alpha function in R?

To indicate the package where the alpha function is located

What is the key difference between a bivariate plot and a univariate plot?

The number of variables shown in the plot

What is the primary goal of the 'data to ink ratio' concept in data visualization?

To achieve a balance between data and ink in a plot

What type of plot is used to show the distribution of data between groups?

Violin plot

What is the primary purpose of using shapes on scatterplots?

To identify categorical variables quickly

What is the main difference between a histogram and a density plot?

The method used to show the distribution

What is the primary purpose of a QQ plot?

To check if a dataset follows a normal distribution

What is the primary advantage of using dot plots for small datasets?

They are more effective at showing raw data

What is the primary purpose of using z-scores to identify extreme values?

To identify outliers in a dataset

What is the main difference between a QQ plot and a deviates plot?

The orientation of the plot

What is the purpose of squaring the residuals in linear regression?

To minimize the sum of residuals

What is the difference between simple and multiple linear regression?

Simple linear regression has only one predictor variable

What does yi represent in the equation for a straight line?

The outcome variable

What is the purpose of estimating b0 and b1 in linear regression?

To produce the line of best fit

What is the subscript i indicating in the equation for a straight line?

The number of observations

What is the purpose of leaving off the residual/error term in the equation for a straight line?

To show only the predicted values

How does multiple linear regression differ from simple linear regression?

Multiple linear regression can have any number of predictor variables

What is the purpose of the regression coefficients (b0 and b1) in linear regression?

To predict the outcome variable

What is the goal of linear regression?

To minimize the sum of squared residuals

What is the purpose of the intercept (b0) in linear regression?

To predict the outcome variable when the predictor is zero

What is the purpose of the link function in GLMs?

To transform the linear predictor to the mean of the response variable

What is the assumption of equal variance/homoscedasticity in linear regression?

The variance of the residuals is constant across all values of the predictors

What is the purpose of a QQ plot in linear regression diagnostics?

To check for normality of the residuals

What is the consequence of violating the assumption of independence in linear regression?

The standard errors will be underestimated

What is the purpose of a scatterplot of predicted values against residuals in linear regression diagnostics?

To check for equal variance of the residuals

What is the purpose of the inverse link function in GLMs?

To transform the mean of the response variable to the linear predictor

What is the assumption of normally-distributed errors in linear regression?

The residuals are normally distributed with a mean of 0

What is the purpose of the L.I.N.E. acronym in linear regression?

To remember the assumptions of linear regression

What is the purpose of a density plot of residuals in linear regression diagnostics?

To check for normality of the residuals

What is the assumption of independence in linear regression?

The residuals are independent and identically distributed

What is the interpretation of b0 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?

The expected value of y when all predictors are zero

What is the primary purpose of the generalized linear model (GLM)?

To extend the linear model to different outcomes

What is the purpose of the residual standard error (σ) in the linear regression model?

To estimate the variance of the residuals

What is the interpretation of the coefficient b1 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?

The change in y when x1 is increased by one unit, holding all other predictors constant

What is the purpose of the F-test in the linear regression model?

To determine the overall fit of the model

What is the primary advantage of using the linear regression model over other types of regression models?

It provides a simple and interpretable model for continuous outcomes

What is the purpose of the t-value in the linear regression output?

To determine the significance of individual model parameters

What is the interpretation of the R-squared value in the linear regression model?

The proportion of variance in the outcome variable explained by the model

What is the purpose of the confint() function in R when working with linear regression models?

To estimate the confidence interval of the model parameters

What is the primary advantage of using the adjusted R-squared value over the regular R-squared value?

It provides a better estimate of the model's predictive ability

What is the range of probabilities in logistic regression?

From 0 to 1

What is the assumption in logistic regression about the relationship between the outcome and the predictor variables?

Linear on the link scale

What is the purpose of checking for outliers/extreme values in logistic regression?

To identify unusual observations that may affect the model

What type of outcome variable is suitable for Poisson regression?

Count variable

What is a characteristic of the Poisson distribution?

It has one distribution parameter

What is the function used to perform logistic regression in R?

glm()

What does an odds ratio greater than 1 indicate in logistic regression?

A positive relationship between variables

Why is linear regression rarely used for count outcomes?

Because the data is often not normal and the residuals don't follow a normal distribution

What is a marginal effect in logistic regression?

The instantaneous effect of a predictor on the outcome at a particular point

What is an example of a research use case where Poisson regression may be appropriate?

Examining risk factors for the number of accidents someone gets into over a 12-month period

What is a disadvantage of using linear regression for count outcomes?

The straight line is often a bad fit, especially at extremes

What is the purpose of checking for separation in logistic regression?

To identify when a predictor perfectly predicts the outcome

What is the main difference between Poisson regression and linear regression?

Poisson regression is used for count variables, while linear regression is used for continuous variables

Why is a large sample size required for logistic regression?

To ensure that the parameters are distributed normally

What is a characteristic of count variables?

They are discrete and must be whole numbers

What is the purpose of the link function in Poisson regression?

To transform the linear predicted value to ensure it never goes below 0

What is the assumption about the distribution of the outcome variable in Poisson regression?

Poisson distribution

What is the interpretation of the Incident Rate Ratio (IRR) in Poisson regression?

The rate at which the outcome variable changes for a one-unit change in the predictor

What is the purpose of exponentiating the coefficients in Poisson regression?

To take the coefficient out of the log scale

What is the main difference between linear regression and binary logistic regression?

The type of outcome variable

What is the purpose of the link function in logistic regression?

To transform the linear predicted value to ensure it never goes below 0 and never goes above 1

What is the distribution assumed for the outcome variable in binary logistic regression?

Bernoulli distribution

What is the link function used in logistic regression?

Logit function

What is the purpose of using the glm() function in R for Poisson regression?

To perform Poisson regression

What is the advantage of using Poisson regression over linear regression for count data?

It assumes a Poisson distribution

What type of plot is used to show the distribution of stress when negative affect is missing or not?

Margin plot

What is the purpose of identifying patterns of missing data?

All of the above

What do the blue dots in the margin plot represent?

Observed data

What is shown in the margin plot along with the scatter plot?

Two boxplots on each axis

What does the red boxplot in the margin plot represent?

The distribution of stress when negative affect is missing

What can be seen from the distribution of stress when negative affect is missing or not?

The distribution of stress is very different when negative affect is missing or not

What happens when data are missing not at random (MNAR)?

You cannot recover unbiased estimates.

What is multiple imputation?

A robust way to address missing data by generating multiple datasets.

What is the formula to determine total uncertainty in average estimate in multiple imputation?

T = V¯ + B + B/m

What is the purpose of examining missing data before imputation?

To examine the patterns of missing data.

What is the VIM package in R used for?

To explore and visualize missing data.

What is the consequence of data being MNAR?

Mean positive affect will be MNAR.

What is the purpose of generating multiple imputed datasets?

To get multiple, different datasets with plausible values for missing data.

What is the benefit of multiple imputation?

It provides a robust way to address missing data.

What is the result of performing the analysis of interest on each imputed dataset?

Multiple, different Q̂ estimates.

What is the main consequence of missing data?

Loss of efficiency and biased results

What happens when data are missing completely at random (MCAR)?

Unbiased estimates of the true parameter(s) with list-wise deletion

What is the classification of missing data when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest?

Missing at random (MAR)

What can be done to recover unbiased estimates when data are Missing at Random (MAR)?

Condition on the correct variables

What is the consequence of using complete cases only when data are Missing at Random (MAR)?

Biased estimates of the true parameter(s)

Why is list-wise deletion often inefficient?

It discards data on other variables

What is the main difference between Missing at Random (MAR) and Missing Completely at Random (MCAR)?

MAR is conditionally independent of the estimate of our parameter(s) of interest

What is the main issue with list-wise deletion?

It is inefficient

What is the condition for list-wise deletion to yield unbiased estimates?

Data are missing completely at random (MCAR)

What is the main difference between Missing Completely at Random (MCAR) and Missing at Random (MAR)?

MCAR is independent of the estimate of interest, while MAR is conditionally independent

What is the main advantage of using a conditional approach when data are Missing at Random (MAR)?

It can recover unbiased estimates if the right variables are present

What is the main issue with complete cases only?

They are biased

What is the main difference between Missing Not at Random (MNAR) and the other types of missingness?

MNAR is associated with the estimate of interest, while MAR and MCAR are not

What is the purpose of a margin plot in data analysis?

To identify patterns of missing data

What do the blue dots in a margin plot represent?

Observed data

What can be inferred from the boxplots on the x-axis of a margin plot?

The distribution of stress is very different when negative affect is missing or not

Why is there only one boxplot on the y-axis of a margin plot?

Because there are no missing values for stress

What is the purpose of examining the distribution of stress when negative affect is missing or not?

To understand the relationship between stress and negative affect

What can be inferred from the presence of red dots in a margin plot?

There are missing values in the data

What is the benefit of using margin plots in data analysis?

To identify patterns of missing data and understand the distribution of data

What is the primary reason why we cannot recover unbiased estimates when data are missing not at random?

Because the data we need to recover unbiased estimates are themselves missing

What is the main difference between mean positive affect being MNAR and MAR?

MNAR is when the data are missing due to the parameter itself, while MAR is when the data are missing due to another parameter

What is the purpose of multiple imputation in addressing missing data?

To generate multiple, different datasets where in each one, different plausible values are imputed for the missing data

What is the formula to determine the total uncertainty in the average estimate in multiple imputation?

T = V¯ + B + B/m

What is the issue with using multiple imputation with small sample sizes?

It may not be feasible to generate multiple imputed datasets

What is the purpose of examining the missing data before doing any imputation?

To examine the patterns and amount of missing data on each variable

What is the benefit of using the VIM package in R for examining missing data?

It provides a visual representation of the missing data

What is the consequence of assuming MAR when the data are actually MNAR?

It may result in biased estimates

What is the purpose of generating multiple imputed datasets in multiple imputation?

To generate multiple, different datasets where in each one, different plausible values are imputed for the missing data

What is the advantage of using multiple imputation over single imputation?

It provides a range of possible estimates of the missing data

What is a key characteristic of linear mixed models that allows them to handle repeated measures data?

They can handle continuous time and missing data

What is a key difference between fixed effects and random effects in linear mixed models?

Fixed effects have the same slope and intercept for everyone, while random effects have different ones

When would you use repeated measures ANOVA instead of linear mixed models?

When everyone has the same number of time points and the outcome is continuous

What is an advantage of using linear mixed models over repeated measures ANOVA?

Linear mixed models require fewer assumptions about the data

How do fixed effects approximate the distribution of the data?

𝑀 = estimated mean; 𝑆𝐷 = 0

What is a key assumption of linear regression that linear mixed models can relax?

Independence of observations

What is a benefit of using linear mixed models for clustered data?

They can handle missing data at certain time points

How do random effects approximate the distribution of the data?

𝑀 = estimated mean; 𝑆𝐷 = estimated standard deviation

What is the main difference between fixed effects and random effects in regression analysis?

Fixed effects assume a single coefficient for all individuals, while random effects assume individual differences in regression coefficients

What does an intraclass correlation coefficient (ICC) of 0.5 indicate?

50% of the total variance occurs between people, and 50% of the total variance is within person

What is the purpose of the meandeviations() function in linear mixed models?

To calculate the between and within versions of a repeated measures variable

What is a key assumption of linear mixed models (LMMs)?

That individual units' deviations from the fixed effect follow a normal distribution with mean 0 and standard deviation equal to the standard deviation of the deviations

What is the primary advantage of using linear mixed models over traditional linear regression?

LMMs can model individual differences in regression coefficients, despite needing to estimate only one additional parameter

What is the purpose of the ICC in linear mixed models?

To determine the proportion of variance explained by individual differences

What is a key feature of linear mixed models that allows them to relax the assumption of independence in traditional linear regression?

The ability to model individual differences in regression coefficients, despite needing to estimate only one additional parameter

What is the interpretation of an ICC of 0.25?

25% of the total variance occurs between people, and 75% of the total variance is within person

Study Notes

Data Table Subsetting

  • Data table subsetting structure: DT[i, j, by], where DT is the data table, i is the rows, j is the columns, and by is the grouping variable

Data Types in R

  • Logical: used for logical data (TRUE or FALSE)
  • Integer: used for integer type data (whole numbers like 0, 1, 2)
  • Numeric: used for real numbers (1.1, 4.8) and can be used for integer data (less efficient)
  • Factor: special representation of numeric data when data are fundamentally discrete (e.g., study condition coded as 0 = control, 1 = medication, 2 = psychotherapy)
  • Characters: used for text type data (names, qualitative data, etc.) and can store numbers as strings

Operators

  • Logical operators: used to manage data (e.g., find outliers, values greater/less than a score)
  • Boolean values: TRUE or FALSE, where TRUE is treated as 1 and FALSE is treated as 0 in arithmetic
  • Operators:
    • = OR %ge%: Greater than or equal
    • %gl%: Greater than AND less than
    • %gel%: Greater than or equal AND less than
    • %gle%: Greater than AND less than or equal
    • %gele%: Greater than or equal AND less than or equal
    • %in%: In
    • %!in% or %nin%: Not in
    • %c%: Chain operations on the RHS together
    • %e%: Set operator, to use set notation

Subsetting Data

  • Subsetting: excluding outliers, selecting participants who meet certain criteria
  • Order of subsetting matters

Merging Data

  • Rules: one join at a time, x dataset is always on the left, y dataset is always on the right
  • Types of joins:
    • Natural join: resulting data has only rows present in both x and y (all = FALSE)
    • Full outer join: resulting data has all rows in x and all rows in y (all = TRUE)
    • Left outer join: resulting data has all rows in x (all.x = TRUE)
    • Right outer join: resulting data has all rows in y (all.y = TRUE)

Reshaping Data

  • Necessary for repeated measures/longitudinal/panel data
  • Types of data structures:
    • Wide: each measure has a separately-named variable for each time point it was measured
      • Each entity occupies their own row, and each variable occupies a single column
      • Easy to read and interpret, used in descriptive statistics and reporting
    • Long: time point (or wave) is a variable, IDs will have multiple rows
      • Machine-friendly data structure, easier to perform functions like filtering and aggregating
      • Easier to add new data and avoids the problem of null values

Scoring Questionnaire Scales

  • There are two ways to score questionnaire scales: adding together to get a sum total score, or taking an average of all items
  • Using rowMeans() allows you to perform calculations excluding missing data, which is commonly done and sensible when dealing with small amounts of missing data
  • If you want to deal with missing data but need a total score, you can use rowMeans() and multiply the results by the number of items that should have been completed

Cronbach's Alpha

  • psych::alpha() is the function to find Cronbach's alpha, a common measure of scale reliability
  • psych:: is used to specify the alpha function from the psych package, as alpha is a popular function name

Bivariate Plots

  • A bivariate plot shows the relationship between two variables, mapped onto the x- and y-axes
  • geom_point() is used to make scatterplots, and additional arguments can be added using + (e.g., geom_line())
  • geom_bar() is used for barplots

Best Practices in Data Visualization

  • Data to ink ratio: aim for more data and less ink
  • Themes are helpful in achieving this goal
  • Axes can be useful for providing more data, such as labeling with quantiles
  • Shapes on scatterplots help identify categorical variables

Types of Plots

Violin Plots

  • Used to compare the distribution of data between groups
  • Thicker regions have more points, narrow regions have fewer data points
  • Show the range/spread of each variable and mean and confidence interval summaries

Histograms

  • Define equal width bins on the x-axis and count how many observations fall within each bin
  • Bars display these, where the width of the bar is the width of the bin and the height is the count (frequency) of observations
  • Show a univariate distribution

Density Plots

  • Show the distribution using a smooth density function rather than binning data
  • Height indicates the relative frequency of observations at a particular value
  • Designed so that they sum to one
  • Show a univariate distribution

Dot Plots

  • Effective at showing raw data for small datasets
  • Each dot represents one person
  • Dots are stacked on top of each other if they would overlap
  • Provide greater precision than histograms
  • Show a univariate distribution

QQ Plots

  • A scatterplot created by plotting two sets of quantiles against one another
  • If both sets of quantiles came from the same distribution, the points should form a line that's roughly straight

Scoring Questionnaire Scales

  • There are two ways to score questionnaire scales: adding together to get a sum total score, or taking an average of all items
  • Using rowMeans() allows you to perform calculations excluding missing data, which is commonly done and sensible when dealing with small amounts of missing data
  • If you want to deal with missing data but need a total score, you can use rowMeans() and multiply the results by the number of items that should have been completed

Cronbach's Alpha

  • psych::alpha() is the function to find Cronbach's alpha, a common measure of scale reliability
  • psych:: is used to specify the alpha function from the psych package, as alpha is a popular function name

Bivariate Plots

  • A bivariate plot shows the relationship between two variables, mapped onto the x- and y-axes
  • geom_point() is used to make scatterplots, and additional arguments can be added using + (e.g., geom_line())
  • geom_bar() is used for barplots

Best Practices in Data Visualization

  • Data to ink ratio: aim for more data and less ink
  • Themes are helpful in achieving this goal
  • Axes can be useful for providing more data, such as labeling with quantiles
  • Shapes on scatterplots help identify categorical variables

Types of Plots

Violin Plots

  • Used to compare the distribution of data between groups
  • Thicker regions have more points, narrow regions have fewer data points
  • Show the range/spread of each variable and mean and confidence interval summaries

Histograms

  • Define equal width bins on the x-axis and count how many observations fall within each bin
  • Bars display these, where the width of the bar is the width of the bin and the height is the count (frequency) of observations
  • Show a univariate distribution

Density Plots

  • Show the distribution using a smooth density function rather than binning data
  • Height indicates the relative frequency of observations at a particular value
  • Designed so that they sum to one
  • Show a univariate distribution

Dot Plots

  • Effective at showing raw data for small datasets
  • Each dot represents one person
  • Dots are stacked on top of each other if they would overlap
  • Provide greater precision than histograms
  • Show a univariate distribution

QQ Plots

  • A scatterplot created by plotting two sets of quantiles against one another
  • If both sets of quantiles came from the same distribution, the points should form a line that's roughly straight

Simple vs. Multiple Linear Regression

  • Simple linear regression: equation yi = b0 + b1 * xi + εi, where yi is the outcome variable, xi is the predictor/explanatory variable, εi is the residual/error term, b0 is the intercept, and b1 is the slope of the line.
  • In simple linear regression, the model parameters (b0 and b1) are the same for all participants, but each person has their own values of yi and xi, and there will be some unexplained residual (εi).
  • If we want to talk about only what is predicted based on the regression coefficients, we can write yi = b0 + b1 * xi, leaving off the residual error term (εi).

Multiple Linear Regression

  • Multiple linear regression works in principle basically the same way as simple linear regression, but allows for more than one predictor (explanatory) variable in a single model.
  • The equation for multiple linear regression is yi = b0 + b1 * x1i + ... + bk * xki + εi, where yi is the outcome variable, x1i, x2i, ..., xki are the predictor variables, and εi is the residual/error term.
  • The regression coefficients (b0, b1, ..., bk) are interpreted fairly similarly to those in simple linear regression, but with some extra requirements.
  • b0 is the intercept, the expected (model predicted) value of yi when all predictors are 0.
  • b1, b2, ..., bk are the slopes of the line, capturing how much yi is expected to change for a one unit change in x1, x2, ..., xk, respectively, holding all other predictors constant.

Line of Best Fit and Residuals

  • The line of best fit is the regression line that goes through the data points, minimizing all the residuals.
  • The residuals are the differences between the model predicted values and the observed values.

Generalized Linear Models (GLMs)

  • GLMs extend the linear model to different outcomes, such as continuous, normally distributed variables (linear regression), binary 0/1 variables (logistic regression, probit regression), and count variables (poisson regression, negative binomial regression).
  • GLMs force things to be linear, using some function to link or transform eta (n).
  • In linear regression, there is no link function because it's already in linear space.

Probability Distribution

  • A probability distribution is a function where, as we move along the x-axis, the y-axis is telling us what is the probability of that value occurring.

Normal Distribution (Gaussian Distribution)

  • Parameters: mean and standard deviation.

R Output from a Linear Model

  • The lm() function in R is used to fit a linear model, and it uses a formula interface to specify the desired model, with the format outcome ~ predictor.
  • The summary() function provides a quick summary of the model, including the regression coefficients, standard errors, t-values, and p-values.
  • The confint() function provides confidence intervals for the regression coefficients.

Linear Regression Assumptions

  • L.I.N.E. = there is a linear relationship, variables and errors are independent, errors are normally distributed, and there needs to be equal variance.
  • Independent variables: all values of the outcome should come from a different person.
  • Errors: for any pair of observations, the error terms should be uncorrelated.
  • Normally-distributed errors: the errors (i.e., the residuals) should be random and normally distributed with a mean of 0.
  • Equal variance/homoscedasticity: for each value of the predictors, the variance of the error term should be constant.

Model Diagnostics

  • To assess normally-distributed errors, look at the density plot of residuals (black line) vs a normal distribution (dotted blue line).
  • To identify outliers, look at the QQ plot of residuals (black points are outliers).
  • To check for equal variance/homoscedasticity, look at a scatterplot of the model predicted values against the residuals - basically, we want the residuals to be about the same as the predicted values (i.e., blue dotted lines to be horizontal and parallel).

Poisson Regression

  • Used for count variables, which are discrete and must be whole, 0 or positive numbers
  • Poisson distribution is used when counts are relatively rare
  • Examples of research use cases:
    • Examining risk factors for accidents over a 12-month period
    • Analyzing the number of children people have
    • Predicting how many friends people have
    • Evaluating whether an intervention reduced medication non-adherence
    • Testing whether treating mental health can lower healthcare appointments
  • Poisson regression:
    • Does not assume normal distribution
    • Has one parameter, lambda (mean and variance)
    • Linear regression rarely works well for count outcomes due to:
      • Straight line being a bad fit at extremes
      • Non-normal distribution of residuals
  • General linear regression deals with Poisson regression using:
    • Link functions to transform linear predicted values to never go below 0
    • Assuming a Poisson distribution
    • Defining the link function as: η = g(λ) = ln(λ)
  • Assumptions of Poisson regression:
    • Poisson distribution, outcome is counts, positive integers
    • Mean and variance must be the same
    • Linear relationship on the link scale (ln)
    • No need to worry about normally distributed errors or equal variance/homoscedasticity
    • Watch for right-side outliers (extremely high counts)
    • Importance of large sample size
  • How to do Poisson regression in R:
    • Use the glm() function with the argument 'family = poisson'
  • Incident Rate Ratios (IRRs):
    • Interpret as: "for each one unit higher predictor score, there are IRR times as many events of the outcome"
    • Example: IRR = 2, base rate = 1, one unit higher would be 1*2 = 2

Binary Logistic Regression

  • Used for binary outcomes, where the outcome only takes on two values: 0 or 1
  • Examples of research use cases:
    • Predicting whether someone will have major depression or not
    • Determining the probability of patients remitting from major depression
    • Predicting the probability of readmission to the hospital within 30 days
    • Predicting the probability of death before age 60
  • Binary logistic regression:
    • Linear regression will not work for binary outcomes due to:
      • Straight line being a bad fit
      • Non-normal distribution of residuals
  • General linear regression deals with binary logistic regression using:
    • Link functions to transform linear predicted values to never go below 0 and never go above 1
    • Assuming a Bernoulli distribution
    • Defining the link function as: η = g(μ) = ln(μ/1−μ)
  • Assumptions of logistic regression:
    • Bernoulli distribution, outcome is probability of event occurring
    • Linear relationship on the link scale (ln)
    • Independent variables, independent errors
    • Identify outliers/extreme values on the predictors
    • Check for separation (predictor variable perfectly predicts the outcome)
    • Importance of large sample size
  • How to do logistic regression in R:
    • Use the glm() function with the argument 'family = binomial'
  • Odds ratio:
    • Indicates how many more times the odds of the outcome occurring will be for a one unit change in the predictor
    • Higher than 1 means a positive relationship, less than 1 means a negative relationship
  • Marginal effect:
    • Instantaneous effect of change at a particular point
    • Equivalent to the slope of a straight line at that value

Missing Data

  • Missing data are common and problematic, leading to biased results and efficiency loss.

Types of Missing Data

  • Missing completely at random (MCAR): when the missingness mechanism is completely independent of the estimate of our parameter(s) of interest.
  • Missing at random (MAR): when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest.
  • Missing not at random (MNAR): when the missingness mechanism is associated with the estimate of our parameter(s) of interest.

Consequences of Missing Data

  • Listwise deletion may lead to biased results unless the data are missing completely at random (MCAR).
  • When data are missing completely at random (MCAR), listwise deletion will yield unbiased estimates of the true parameter(s) if the data had not been missing.
  • When data are missing at random (MAR), it is possible to recover unbiased estimates if the right other variables are present.

Multiple Imputation (MI)

  • Multiple imputation is a robust way to address missing data, involving generating multiple, different datasets with plausible values imputed for the missing data.
  • Steps in MI:
  • Start with the incomplete data.
  • Generate 𝑚 datasets with no missingness, by filling in different plausible values for any missing data.
  • Perform the analysis of interest on each imputed dataset.
  • Pool the results from the analyses run on each imputed dataset to generate an overall estimate, 𝑄¯.
  • Formula to determine total uncertainty in average estimate: T = V¯ + B + B/m, where 𝑉¯ is the average uncertainty estimate of Q̂ across the multiply imputed datasets, B captures the variance in the estimates, Q̂, and m is the number of imputed datasets.

Issues with Using Imputed Datasets

  • Issues with using imputed datasets with general linear models:
  • Small sample sizes (i.e., 100 or less).
  • Colinear variables.
  • Lots of interactions between variables.
  • Non-normal residuals.

Examining Missing Data

  • Before doing any imputation, it is a good idea to examine the data using the VIM package in R.
  • The aggr() function shows the proportion of missing data on each individual variable and the patterns of missing data.
  • Margin plots can help identify if imputations fall outside the range of observed data or fit with the rest of the trend from the observed data.

Missing Data

  • Missing data are common and problematic, leading to biased results and efficiency loss.

Types of Missing Data

  • Missing completely at random (MCAR): when the missingness mechanism is completely independent of the estimate of our parameter(s) of interest.
  • Missing at random (MAR): when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest.
  • Missing not at random (MNAR): when the missingness mechanism is associated with the estimate of our parameter(s) of interest.

Consequences of Missing Data

  • Listwise deletion may lead to biased results unless the data are missing completely at random (MCAR).
  • When data are missing completely at random (MCAR), listwise deletion will yield unbiased estimates of the true parameter(s) if the data had not been missing.
  • When data are missing at random (MAR), it is possible to recover unbiased estimates if the right other variables are present.

Multiple Imputation (MI)

  • Multiple imputation is a robust way to address missing data, involving generating multiple, different datasets with plausible values imputed for the missing data.
  • Steps in MI:
  • Start with the incomplete data.
  • Generate 𝑚 datasets with no missingness, by filling in different plausible values for any missing data.
  • Perform the analysis of interest on each imputed dataset.
  • Pool the results from the analyses run on each imputed dataset to generate an overall estimate, 𝑄¯.
  • Formula to determine total uncertainty in average estimate: T = V¯ + B + B/m, where 𝑉¯ is the average uncertainty estimate of Q̂ across the multiply imputed datasets, B captures the variance in the estimates, Q̂, and m is the number of imputed datasets.

Issues with Using Imputed Datasets

  • Issues with using imputed datasets with general linear models:
  • Small sample sizes (i.e., 100 or less).
  • Colinear variables.
  • Lots of interactions between variables.
  • Non-normal residuals.

Examining Missing Data

  • Before doing any imputation, it is a good idea to examine the data using the VIM package in R.
  • The aggr() function shows the proportion of missing data on each individual variable and the patterns of missing data.
  • Margin plots can help identify if imputations fall outside the range of observed data or fit with the rest of the trend from the observed data.

Independence of Observations

  • Observations are not always independent, e.g., in longitudinal studies, repeated measures experiments, and clustered data
  • This type of data poses challenges to statistical analysis, but can be addressed using linear mixed models

Linear Mixed Models

  • Relax the assumption of independence of observations in linear regression
  • Allow for variation in observations, including continuous time, missing data, and continuous predictors

Fixed Effects vs Random Effects

  • Fixed effects: assume same slope and intercept for all participants, only applicable for one observation per participant
  • Random effects: allow for different coefficients (slopes and intercepts) per participant, applicable for repeated measures

Fixed Effects Approximation

  • Approximate distribution by: 𝑀 = estimated mean; 𝑆𝐷 = 0 (standard deviation is fixed at 0)

Random Effects Approximation

  • Approximate distribution by: 𝑀 = estimated mean; 𝑆𝐷 = estimated standard deviation (SD is free to vary)

Main Difference Between Fixed and Random Effects

  • Fixed effects: regression coefficients are the same for everyone
  • Random effects: regression coefficients vary randomly for each participant

Intraclass Correlation Coefficient (ICC)

  • Measures the ratio of between variance to total variance (ranges between 0 and 1)
  • ICC > 0 indicates individual means differ, and individual differences need to be accounted for in analysis

Meandeviations() Function

  • Used to calculate between and within versions of a repeated measures variable

Linear Mixed Model Assumptions

  • Assume individual units' deviations from the fixed effect follow a normal distribution with mean 0 and standard deviation
  • Assume random effect intercept also follows a normal distribution
  • Only one additional parameter is needed compared to regular linear regression

Learn about data types in R, including logical, integer, and numeric, as well as data.table subsetting structures. Understand the different data types and how to work with them efficiently.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser