Data Types and Subsetting in R

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the primary purpose of the by argument in the data.table subsetting structure?

To specify the columns to be selected
To specify the rows to be selected
To group the data by a specific variable (correct)
To sort the data in ascending order

What is the main difference between Integer and Numeric data types in R?

Integer is used for decimal numbers, while Numeric is used for whole numbers
Integer is more efficient than Numeric for storing whole numbers
Integer is used for categorical data, while Numeric is used for continuous data
Integer is used for whole numbers, while Numeric is used for decimal numbers (correct)

What is the purpose of subsetting data in analyses?

To reshape data from wide to long format
To merge data from multiple datasets
To exclude outliers and select specific participants (correct)
To perform natural joins on datasets

What is the rule for data merges in R?

One join at a time and the x dataset is always on the left (A) Signup and view all the answers

What is the purpose of Factor data type in R?

To store discrete numeric data with a specific label (D) Signup and view all the answers

What is the return value of logical operators in R?

A logical value of TRUE or FALSE (C) Signup and view all the answers

What type of join results in a dataset with all rows in both x and y?

Full outer join (A) Signup and view all the answers

What is the primary advantage of using long data format?

It is easier to perform functions like filtering and aggregating (B) Signup and view all the answers

What is the purpose of logical operators in data management?

To identify outliers and values outside a specific range (A) Signup and view all the answers

What is the characteristic of wide data format?

Each individual entity occupies their own row, and each of their variables occupy a single column (D) Signup and view all the answers

What is the convention for treating Boolean values in arithmetic operations?

TRUE is treated as 1 and FALSE is treated as 0 (A) Signup and view all the answers

What is the purpose of the Characters data type in R?

To store text data, such as names and qualitative data (A) Signup and view all the answers

What is the purpose of reshaping data?

To prepare data for repeated measures or longitudinal analysis (C) Signup and view all the answers

What is the consequence of using wide data format when there are missing values?

It results in null values in columns where no data is available (B) Signup and view all the answers

Which data type is used to store data that can be either TRUE or FALSE?

Logical (B) Signup and view all the answers

What operator is used for set notation in R?

%e% (D) Signup and view all the answers

What is the primary advantage of using rowMeans() to average a variable?

It does not return NA even if some of the data is missing (D) Signup and view all the answers

What is the consequence of adding items together to get a total score if a participant misses any single item?

The participant will be missing on the entire subscale (C) Signup and view all the answers

What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?

To deal with missing data and get a total score (B) Signup and view all the answers

What is the recommended approach to scoring questionnaire scales if the scale is typically added up?

Use rowMeans() and multiply the result by the number of items (B) Signup and view all the answers

What is the advantage of using rowMeans() to calculate the total score when there are small amounts of missing data?

It imputes the mean for an individual for any missing items (A) Signup and view all the answers

What is the purpose of the rowMeans() function in R?

To calculate the mean of a row of data, excluding missing data if desired (A) Signup and view all the answers

What is the main purpose of adding 'psych::' when calling the alpha function in R?

To access a specific package's function (B) Signup and view all the answers

What is the primary advantage of using geom_point() over geom_bar()?

Geom_point() is used for scatterplots, while geom_bar() is used for barplots (B) Signup and view all the answers

What is the main goal of the 'data to ink ratio' concept in data visualization?

To achieve a balance between the amount of data shown and the amount of ink used (B) Signup and view all the answers

What is the primary purpose of using shapes on scatterplots?

To differentiate between categorical variables (B) Signup and view all the answers

What type of plot is used to compare the distribution of data between groups?

Violin plot (A) Signup and view all the answers

What is the main difference between histograms and density plots?

Histograms use bins to display the frequency, while density plots use a smooth density function (D) Signup and view all the answers

What is the purpose of a QQ plot?

To compare the distribution of two sets of quantiles (C) Signup and view all the answers

What is the purpose of using z-scores to identify extreme values?

To identify outliers in a dataset (B) Signup and view all the answers

What is the main advantage of using dot plots for small datasets?

They are effective at showing individual data points (A) Signup and view all the answers

What is the main difference between a QQ plot and a deviates plot?

A QQ plot is used to compare the distribution of two sets of quantiles, while a deviates plot is used to show the deviation from a normal distribution (D) Signup and view all the answers

What is the primary advantage of using the rowMeans() function to calculate the total score?

It allows for the exclusion of missing data if desired (C) Signup and view all the answers

What is the recommended approach to scoring questionnaire scales if the scale is typically added up?

Using rowMeans() and multiplying the result by the number of items (C) Signup and view all the answers

What is the consequence of adding items together to get a total score if a participant misses any single item?

The participant will be missing on the entire subscale (B) Signup and view all the answers

What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?

To deal with missing data and get a total score (A) Signup and view all the answers

What is the primary advantage of using rowMeans() to average a variable when there are small amounts of missing data?

It does not return NA even if some of the data is missing (B) Signup and view all the answers

What is the alternative to using the sum of all items to get a total score?

Using rowMeans() and multiplying the result by the number of items (D) Signup and view all the answers

What is the purpose of using rowMeans() to calculate the total score?

To deal with missing data and get a total score (B) Signup and view all the answers

What is the primary advantage of using the rowMeans() function over adding items together?

It allows for the exclusion of missing data if desired (C) Signup and view all the answers

What is the consequence of using rowMeans() to calculate the total score when there are small amounts of missing data?

The participant's score will be more accurate (B) Signup and view all the answers

What is the purpose of scoring questionnaire scales?

To summarize the data into a single value (D) Signup and view all the answers

What is the primary purpose of using the 'psych::' prefix when calling the alpha function in R?

To indicate the package where the alpha function is located (B) Signup and view all the answers

What is the key difference between a bivariate plot and a univariate plot?

The number of variables shown in the plot (A) Signup and view all the answers

What is the primary goal of the 'data to ink ratio' concept in data visualization?

To achieve a balance between data and ink in a plot (A) Signup and view all the answers

What type of plot is used to show the distribution of data between groups?

Violin plot (D) Signup and view all the answers

What is the primary purpose of using shapes on scatterplots?

To identify categorical variables quickly (D) Signup and view all the answers

What is the main difference between a histogram and a density plot?

The method used to show the distribution (A) Signup and view all the answers

What is the primary purpose of a QQ plot?

To check if a dataset follows a normal distribution (C) Signup and view all the answers

What is the primary advantage of using dot plots for small datasets?

They are more effective at showing raw data (C) Signup and view all the answers

What is the primary purpose of using z-scores to identify extreme values?

To identify outliers in a dataset (B) Signup and view all the answers

What is the main difference between a QQ plot and a deviates plot?

The orientation of the plot (C) Signup and view all the answers

What is the purpose of squaring the residuals in linear regression?

To minimize the sum of residuals (D) Signup and view all the answers

What is the difference between simple and multiple linear regression?

Simple linear regression has only one predictor variable (C) Signup and view all the answers

What does yi represent in the equation for a straight line?

The outcome variable (B) Signup and view all the answers

What is the purpose of estimating b0 and b1 in linear regression?

To produce the line of best fit (C) Signup and view all the answers

What is the subscript i indicating in the equation for a straight line?

The number of observations (B) Signup and view all the answers

What is the purpose of leaving off the residual/error term in the equation for a straight line?

To show only the predicted values (C) Signup and view all the answers

How does multiple linear regression differ from simple linear regression?

Multiple linear regression can have any number of predictor variables (B) Signup and view all the answers

What is the purpose of the regression coefficients (b0 and b1) in linear regression?

To predict the outcome variable (B) Signup and view all the answers

What is the goal of linear regression?

To minimize the sum of squared residuals (B) Signup and view all the answers

What is the purpose of the intercept (b0) in linear regression?

To predict the outcome variable when the predictor is zero (A) Signup and view all the answers

What is the purpose of the link function in GLMs?

To transform the linear predictor to the mean of the response variable (D) Signup and view all the answers

What is the assumption of equal variance/homoscedasticity in linear regression?

The variance of the residuals is constant across all values of the predictors (C) Signup and view all the answers

What is the purpose of a QQ plot in linear regression diagnostics?

To check for normality of the residuals (C) Signup and view all the answers

What is the consequence of violating the assumption of independence in linear regression?

The standard errors will be underestimated (C) Signup and view all the answers

What is the purpose of a scatterplot of predicted values against residuals in linear regression diagnostics?

To check for equal variance of the residuals (B) Signup and view all the answers

What is the purpose of the inverse link function in GLMs?

To transform the mean of the response variable to the linear predictor (A) Signup and view all the answers

What is the assumption of normally-distributed errors in linear regression?

The residuals are normally distributed with a mean of 0 (D) Signup and view all the answers

What is the purpose of the L.I.N.E. acronym in linear regression?

To remember the assumptions of linear regression (B) Signup and view all the answers

What is the purpose of a density plot of residuals in linear regression diagnostics?

To check for normality of the residuals (A) Signup and view all the answers

What is the assumption of independence in linear regression?

The residuals are independent and identically distributed (A) Signup and view all the answers

What is the interpretation of b0 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?

The expected value of y when all predictors are zero (B) Signup and view all the answers

What is the primary purpose of the generalized linear model (GLM)?

To extend the linear model to different outcomes (D) Signup and view all the answers

What is the purpose of the residual standard error (σ) in the linear regression model?

To estimate the variance of the residuals (B) Signup and view all the answers

What is the interpretation of the coefficient b1 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?

The change in y when x1 is increased by one unit, holding all other predictors constant (B) Signup and view all the answers

What is the purpose of the F-test in the linear regression model?

To determine the overall fit of the model (D) Signup and view all the answers

What is the primary advantage of using the linear regression model over other types of regression models?

It provides a simple and interpretable model for continuous outcomes (B) Signup and view all the answers

What is the purpose of the t-value in the linear regression output?

To determine the significance of individual model parameters (C) Signup and view all the answers

What is the interpretation of the R-squared value in the linear regression model?

The proportion of variance in the outcome variable explained by the model (A) Signup and view all the answers

What is the purpose of the confint() function in R when working with linear regression models?

To estimate the confidence interval of the model parameters (B) Signup and view all the answers

What is the primary advantage of using the adjusted R-squared value over the regular R-squared value?

It provides a better estimate of the model's predictive ability (A) Signup and view all the answers

What is the range of probabilities in logistic regression?

From 0 to 1 (A) Signup and view all the answers

What is the assumption in logistic regression about the relationship between the outcome and the predictor variables?

Linear on the link scale (A) Signup and view all the answers

What is the purpose of checking for outliers/extreme values in logistic regression?

To identify unusual observations that may affect the model (D) Signup and view all the answers

What type of outcome variable is suitable for Poisson regression?

Count variable (B) Signup and view all the answers

What is a characteristic of the Poisson distribution?

It has one distribution parameter (A) Signup and view all the answers

What is the function used to perform logistic regression in R?

glm() (A) Signup and view all the answers

What does an odds ratio greater than 1 indicate in logistic regression?

A positive relationship between variables (C) Signup and view all the answers

Why is linear regression rarely used for count outcomes?

Because the data is often not normal and the residuals don't follow a normal distribution (D) Signup and view all the answers

What is a marginal effect in logistic regression?

The instantaneous effect of a predictor on the outcome at a particular point (B) Signup and view all the answers

What is an example of a research use case where Poisson regression may be appropriate?

Examining risk factors for the number of accidents someone gets into over a 12-month period (B) Signup and view all the answers

What is a disadvantage of using linear regression for count outcomes?

The straight line is often a bad fit, especially at extremes (B) Signup and view all the answers

What is the purpose of checking for separation in logistic regression?

To identify when a predictor perfectly predicts the outcome (B) Signup and view all the answers

What is the main difference between Poisson regression and linear regression?

Poisson regression is used for count variables, while linear regression is used for continuous variables (D) Signup and view all the answers

Why is a large sample size required for logistic regression?

To ensure that the parameters are distributed normally (C) Signup and view all the answers

What is a characteristic of count variables?

They are discrete and must be whole numbers (B) Signup and view all the answers

What is the purpose of the link function in Poisson regression?

To transform the linear predicted value to ensure it never goes below 0 (C) Signup and view all the answers

What is the assumption about the distribution of the outcome variable in Poisson regression?

Poisson distribution (C) Signup and view all the answers

What is the interpretation of the Incident Rate Ratio (IRR) in Poisson regression?

The rate at which the outcome variable changes for a one-unit change in the predictor (A) Signup and view all the answers

What is the purpose of exponentiating the coefficients in Poisson regression?

To take the coefficient out of the log scale (B) Signup and view all the answers

What is the main difference between linear regression and binary logistic regression?

The type of outcome variable (A) Signup and view all the answers

What is the purpose of the link function in logistic regression?

To transform the linear predicted value to ensure it never goes below 0 and never goes above 1 (D) Signup and view all the answers

What is the distribution assumed for the outcome variable in binary logistic regression?

Bernoulli distribution (B) Signup and view all the answers

What is the link function used in logistic regression?

Logit function (C) Signup and view all the answers

What is the purpose of using the glm() function in R for Poisson regression?

To perform Poisson regression (D) Signup and view all the answers

What is the advantage of using Poisson regression over linear regression for count data?

It assumes a Poisson distribution (A) Signup and view all the answers

What type of plot is used to show the distribution of stress when negative affect is missing or not?

Margin plot (B) Signup and view all the answers

What is the purpose of identifying patterns of missing data?

All of the above (D) Signup and view all the answers

What do the blue dots in the margin plot represent?

Observed data (C) Signup and view all the answers

What is shown in the margin plot along with the scatter plot?

Two boxplots on each axis (D) Signup and view all the answers

What does the red boxplot in the margin plot represent?

The distribution of stress when negative affect is missing (C) Signup and view all the answers

What can be seen from the distribution of stress when negative affect is missing or not?

The distribution of stress is very different when negative affect is missing or not (A) Signup and view all the answers

What happens when data are missing not at random (MNAR)?

You cannot recover unbiased estimates. (B) Signup and view all the answers

What is multiple imputation?

A robust way to address missing data by generating multiple datasets. (D) Signup and view all the answers

What is the formula to determine total uncertainty in average estimate in multiple imputation?

T = V¯ + B + B/m (A) Signup and view all the answers

What is the purpose of examining missing data before imputation?

To examine the patterns of missing data. (D) Signup and view all the answers

What is the VIM package in R used for?

To explore and visualize missing data. (D) Signup and view all the answers

What is the consequence of data being MNAR?

Mean positive affect will be MNAR. (C) Signup and view all the answers

What is the purpose of generating multiple imputed datasets?

To get multiple, different datasets with plausible values for missing data. (D) Signup and view all the answers

What is the benefit of multiple imputation?

It provides a robust way to address missing data. (A) Signup and view all the answers

What is the result of performing the analysis of interest on each imputed dataset?

Multiple, different Q̂ estimates. (B) Signup and view all the answers

What is the main consequence of missing data?

Loss of efficiency and biased results (D) Signup and view all the answers

What happens when data are missing completely at random (MCAR)?

Unbiased estimates of the true parameter(s) with list-wise deletion (B) Signup and view all the answers

What is the classification of missing data when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest?

Missing at random (MAR) (A) Signup and view all the answers

What can be done to recover unbiased estimates when data are Missing at Random (MAR)?

Condition on the correct variables (D) Signup and view all the answers

What is the consequence of using complete cases only when data are Missing at Random (MAR)?

Biased estimates of the true parameter(s) (C) Signup and view all the answers

Why is list-wise deletion often inefficient?

It discards data on other variables (B) Signup and view all the answers

What is the main difference between Missing at Random (MAR) and Missing Completely at Random (MCAR)?

MAR is conditionally independent of the estimate of our parameter(s) of interest (A) Signup and view all the answers

What is the main issue with list-wise deletion?

It is inefficient (A) Signup and view all the answers

What is the condition for list-wise deletion to yield unbiased estimates?

Data are missing completely at random (MCAR) (C) Signup and view all the answers

What is the main difference between Missing Completely at Random (MCAR) and Missing at Random (MAR)?

MCAR is independent of the estimate of interest, while MAR is conditionally independent (B) Signup and view all the answers

What is the main advantage of using a conditional approach when data are Missing at Random (MAR)?

It can recover unbiased estimates if the right variables are present (A) Signup and view all the answers

What is the main issue with complete cases only?

They are biased (C) Signup and view all the answers

What is the main difference between Missing Not at Random (MNAR) and the other types of missingness?

MNAR is associated with the estimate of interest, while MAR and MCAR are not (D) Signup and view all the answers

What is the purpose of a margin plot in data analysis?

To identify patterns of missing data (A) Signup and view all the answers

What do the blue dots in a margin plot represent?

Observed data (C) Signup and view all the answers

What can be inferred from the boxplots on the x-axis of a margin plot?

The distribution of stress is very different when negative affect is missing or not (C) Signup and view all the answers

Why is there only one boxplot on the y-axis of a margin plot?

Because there are no missing values for stress (B) Signup and view all the answers

What is the purpose of examining the distribution of stress when negative affect is missing or not?

To understand the relationship between stress and negative affect (A) Signup and view all the answers

What can be inferred from the presence of red dots in a margin plot?

There are missing values in the data (D) Signup and view all the answers

What is the benefit of using margin plots in data analysis?

To identify patterns of missing data and understand the distribution of data (C) Signup and view all the answers

What is the primary reason why we cannot recover unbiased estimates when data are missing not at random?

Because the data we need to recover unbiased estimates are themselves missing (A) Signup and view all the answers

What is the main difference between mean positive affect being MNAR and MAR?

MNAR is when the data are missing due to the parameter itself, while MAR is when the data are missing due to another parameter (A) Signup and view all the answers

What is the purpose of multiple imputation in addressing missing data?

To generate multiple, different datasets where in each one, different plausible values are imputed for the missing data (A) Signup and view all the answers

What is the formula to determine the total uncertainty in the average estimate in multiple imputation?

T = V¯ + B + B/m (B) Signup and view all the answers

What is the issue with using multiple imputation with small sample sizes?

It may not be feasible to generate multiple imputed datasets (B) Signup and view all the answers

What is the purpose of examining the missing data before doing any imputation?

To examine the patterns and amount of missing data on each variable (B) Signup and view all the answers

What is the benefit of using the VIM package in R for examining missing data?

It provides a visual representation of the missing data (B) Signup and view all the answers

What is the consequence of assuming MAR when the data are actually MNAR?

It may result in biased estimates (C) Signup and view all the answers

What is the purpose of generating multiple imputed datasets in multiple imputation?

To generate multiple, different datasets where in each one, different plausible values are imputed for the missing data (C) Signup and view all the answers

What is the advantage of using multiple imputation over single imputation?

It provides a range of possible estimates of the missing data (B) Signup and view all the answers

What is a key characteristic of linear mixed models that allows them to handle repeated measures data?

They can handle continuous time and missing data (C) Signup and view all the answers

What is a key difference between fixed effects and random effects in linear mixed models?

Fixed effects have the same slope and intercept for everyone, while random effects have different ones (C) Signup and view all the answers

When would you use repeated measures ANOVA instead of linear mixed models?

When everyone has the same number of time points and the outcome is continuous (C) Signup and view all the answers

What is an advantage of using linear mixed models over repeated measures ANOVA?

Linear mixed models require fewer assumptions about the data (C) Signup and view all the answers

How do fixed effects approximate the distribution of the data?

𝑀 = estimated mean; 𝑆𝐷 = 0 (A) Signup and view all the answers

What is a key assumption of linear regression that linear mixed models can relax?

Independence of observations (D) Signup and view all the answers

What is a benefit of using linear mixed models for clustered data?

They can handle missing data at certain time points (C) Signup and view all the answers

How do random effects approximate the distribution of the data?

𝑀 = estimated mean; 𝑆𝐷 = estimated standard deviation (C) Signup and view all the answers

What is the main difference between fixed effects and random effects in regression analysis?

Fixed effects assume a single coefficient for all individuals, while random effects assume individual differences in regression coefficients (C) Signup and view all the answers

What does an intraclass correlation coefficient (ICC) of 0.5 indicate?

50% of the total variance occurs between people, and 50% of the total variance is within person (D) Signup and view all the answers

What is the purpose of the meandeviations() function in linear mixed models?

To calculate the between and within versions of a repeated measures variable (B) Signup and view all the answers

What is a key assumption of linear mixed models (LMMs)?

That individual units' deviations from the fixed effect follow a normal distribution with mean 0 and standard deviation equal to the standard deviation of the deviations (D) Signup and view all the answers

What is the primary advantage of using linear mixed models over traditional linear regression?

LMMs can model individual differences in regression coefficients, despite needing to estimate only one additional parameter (D) Signup and view all the answers

What is the purpose of the ICC in linear mixed models?

To determine the proportion of variance explained by individual differences (D) Signup and view all the answers

What is a key feature of linear mixed models that allows them to relax the assumption of independence in traditional linear regression?

The ability to model individual differences in regression coefficients, despite needing to estimate only one additional parameter (B) Signup and view all the answers

What is the interpretation of an ICC of 0.25?

25% of the total variance occurs between people, and 75% of the total variance is within person (C) Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

Data Table Subsetting

Data table subsetting structure: DT[i, j, by], where DT is the data table, i is the rows, j is the columns, and by is the grouping variable

Data Types in R

Logical: used for logical data (TRUE or FALSE)
Integer: used for integer type data (whole numbers like 0, 1, 2)
Numeric: used for real numbers (1.1, 4.8) and can be used for integer data (less efficient)
Factor: special representation of numeric data when data are fundamentally discrete (e.g., study condition coded as 0 = control, 1 = medication, 2 = psychotherapy)
Characters: used for text type data (names, qualitative data, etc.) and can store numbers as strings

Operators

Logical operators: used to manage data (e.g., find outliers, values greater/less than a score)
Boolean values: TRUE or FALSE, where TRUE is treated as 1 and FALSE is treated as 0 in arithmetic
Operators:
- = OR %ge%: Greater than or equal
- %gl%: Greater than AND less than
- %gel%: Greater than or equal AND less than
- %gle%: Greater than AND less than or equal
- %gele%: Greater than or equal AND less than or equal
- %in%: In
- %!in% or %nin%: Not in
- %c%: Chain operations on the RHS together
- %e%: Set operator, to use set notation

Subsetting Data

Subsetting: excluding outliers, selecting participants who meet certain criteria
Order of subsetting matters

Merging Data

Rules: one join at a time, x dataset is always on the left, y dataset is always on the right
Types of joins:
- Natural join: resulting data has only rows present in both x and y (all = FALSE)
- Full outer join: resulting data has all rows in x and all rows in y (all = TRUE)
- Left outer join: resulting data has all rows in x (all.x = TRUE)
- Right outer join: resulting data has all rows in y (all.y = TRUE)

Reshaping Data

Necessary for repeated measures/longitudinal/panel data
Types of data structures:
- Wide: each measure has a separately-named variable for each time point it was measured
  - Each entity occupies their own row, and each variable occupies a single column
  - Easy to read and interpret, used in descriptive statistics and reporting
- Long: time point (or wave) is a variable, IDs will have multiple rows
  - Machine-friendly data structure, easier to perform functions like filtering and aggregating
  - Easier to add new data and avoids the problem of null values

Scoring Questionnaire Scales

There are two ways to score questionnaire scales: adding together to get a sum total score, or taking an average of all items
Using rowMeans() allows you to perform calculations excluding missing data, which is commonly done and sensible when dealing with small amounts of missing data
If you want to deal with missing data but need a total score, you can use rowMeans() and multiply the results by the number of items that should have been completed

Cronbach's Alpha

psych::alpha() is the function to find Cronbach's alpha, a common measure of scale reliability
psych:: is used to specify the alpha function from the psych package, as alpha is a popular function name

Bivariate Plots

A bivariate plot shows the relationship between two variables, mapped onto the x- and y-axes
geom_point() is used to make scatterplots, and additional arguments can be added using + (e.g., geom_line())
geom_bar() is used for barplots

Best Practices in Data Visualization

Data to ink ratio: aim for more data and less ink
Themes are helpful in achieving this goal
Axes can be useful for providing more data, such as labeling with quantiles
Shapes on scatterplots help identify categorical variables

Types of Plots

Violin Plots

Used to compare the distribution of data between groups
Thicker regions have more points, narrow regions have fewer data points
Show the range/spread of each variable and mean and confidence interval summaries

Histograms

Define equal width bins on the x-axis and count how many observations fall within each bin
Bars display these, where the width of the bar is the width of the bin and the height is the count (frequency) of observations
Show a univariate distribution

Density Plots

Show the distribution using a smooth density function rather than binning data
Height indicates the relative frequency of observations at a particular value
Designed so that they sum to one
Show a univariate distribution

Dot Plots

Effective at showing raw data for small datasets
Each dot represents one person
Dots are stacked on top of each other if they would overlap
Provide greater precision than histograms
Show a univariate distribution

QQ Plots

A scatterplot created by plotting two sets of quantiles against one another
If both sets of quantiles came from the same distribution, the points should form a line that's roughly straight

Scoring Questionnaire Scales

There are two ways to score questionnaire scales: adding together to get a sum total score, or taking an average of all items
Using rowMeans() allows you to perform calculations excluding missing data, which is commonly done and sensible when dealing with small amounts of missing data
If you want to deal with missing data but need a total score, you can use rowMeans() and multiply the results by the number of items that should have been completed

Cronbach's Alpha

psych::alpha() is the function to find Cronbach's alpha, a common measure of scale reliability
psych:: is used to specify the alpha function from the psych package, as alpha is a popular function name

Bivariate Plots

A bivariate plot shows the relationship between two variables, mapped onto the x- and y-axes
geom_point() is used to make scatterplots, and additional arguments can be added using + (e.g., geom_line())
geom_bar() is used for barplots

Best Practices in Data Visualization

Data to ink ratio: aim for more data and less ink
Themes are helpful in achieving this goal
Axes can be useful for providing more data, such as labeling with quantiles
Shapes on scatterplots help identify categorical variables

Types of Plots

Violin Plots

Used to compare the distribution of data between groups
Thicker regions have more points, narrow regions have fewer data points
Show the range/spread of each variable and mean and confidence interval summaries

Histograms

Define equal width bins on the x-axis and count how many observations fall within each bin
Bars display these, where the width of the bar is the width of the bin and the height is the count (frequency) of observations
Show a univariate distribution

Density Plots

Show the distribution using a smooth density function rather than binning data
Height indicates the relative frequency of observations at a particular value
Designed so that they sum to one
Show a univariate distribution

Dot Plots

Effective at showing raw data for small datasets
Each dot represents one person
Dots are stacked on top of each other if they would overlap
Provide greater precision than histograms
Show a univariate distribution

QQ Plots

A scatterplot created by plotting two sets of quantiles against one another
If both sets of quantiles came from the same distribution, the points should form a line that's roughly straight

Simple vs. Multiple Linear Regression

Simple linear regression: equation yi = b0 + b1 * xi + εi, where yi is the outcome variable, xi is the predictor/explanatory variable, εi is the residual/error term, b0 is the intercept, and b1 is the slope of the line.
In simple linear regression, the model parameters (b0 and b1) are the same for all participants, but each person has their own values of yi and xi, and there will be some unexplained residual (εi).
If we want to talk about only what is predicted based on the regression coefficients, we can write yi = b0 + b1 * xi, leaving off the residual error term (εi).

Multiple Linear Regression

Multiple linear regression works in principle basically the same way as simple linear regression, but allows for more than one predictor (explanatory) variable in a single model.
The equation for multiple linear regression is yi = b0 + b1 * x1i + ... + bk * xki + εi, where yi is the outcome variable, x1i, x2i, ..., xki are the predictor variables, and εi is the residual/error term.
The regression coefficients (b0, b1, ..., bk) are interpreted fairly similarly to those in simple linear regression, but with some extra requirements.
b0 is the intercept, the expected (model predicted) value of yi when all predictors are 0.
b1, b2, ..., bk are the slopes of the line, capturing how much yi is expected to change for a one unit change in x1, x2, ..., xk, respectively, holding all other predictors constant.

Line of Best Fit and Residuals

The line of best fit is the regression line that goes through the data points, minimizing all the residuals.
The residuals are the differences between the model predicted values and the observed values.

Generalized Linear Models (GLMs)

GLMs extend the linear model to different outcomes, such as continuous, normally distributed variables (linear regression), binary 0/1 variables (logistic regression, probit regression), and count variables (poisson regression, negative binomial regression).
GLMs force things to be linear, using some function to link or transform eta (n).
In linear regression, there is no link function because it's already in linear space.

Probability Distribution

A probability distribution is a function where, as we move along the x-axis, the y-axis is telling us what is the probability of that value occurring.

Normal Distribution (Gaussian Distribution)

Parameters: mean and standard deviation.

R Output from a Linear Model

The lm() function in R is used to fit a linear model, and it uses a formula interface to specify the desired model, with the format outcome ~ predictor.
The summary() function provides a quick summary of the model, including the regression coefficients, standard errors, t-values, and p-values.
The confint() function provides confidence intervals for the regression coefficients.

Linear Regression Assumptions

L.I.N.E. = there is a linear relationship, variables and errors are independent, errors are normally distributed, and there needs to be equal variance.
Independent variables: all values of the outcome should come from a different person.
Errors: for any pair of observations, the error terms should be uncorrelated.
Normally-distributed errors: the errors (i.e., the residuals) should be random and normally distributed with a mean of 0.
Equal variance/homoscedasticity: for each value of the predictors, the variance of the error term should be constant.

Model Diagnostics

To assess normally-distributed errors, look at the density plot of residuals (black line) vs a normal distribution (dotted blue line).
To identify outliers, look at the QQ plot of residuals (black points are outliers).
To check for equal variance/homoscedasticity, look at a scatterplot of the model predicted values against the residuals - basically, we want the residuals to be about the same as the predicted values (i.e., blue dotted lines to be horizontal and parallel).

Poisson Regression

Used for count variables, which are discrete and must be whole, 0 or positive numbers
Poisson distribution is used when counts are relatively rare
Examples of research use cases:
- Examining risk factors for accidents over a 12-month period
- Analyzing the number of children people have
- Predicting how many friends people have
- Evaluating whether an intervention reduced medication non-adherence
- Testing whether treating mental health can lower healthcare appointments
Poisson regression:
- Does not assume normal distribution
- Has one parameter, lambda (mean and variance)
- Linear regression rarely works well for count outcomes due to:
  - Straight line being a bad fit at extremes
  - Non-normal distribution of residuals
General linear regression deals with Poisson regression using:
- Link functions to transform linear predicted values to never go below 0
- Assuming a Poisson distribution
- Defining the link function as: η = g(λ) = ln(λ)
Assumptions of Poisson regression:
- Poisson distribution, outcome is counts, positive integers
- Mean and variance must be the same
- Linear relationship on the link scale (ln)
- No need to worry about normally distributed errors or equal variance/homoscedasticity
- Watch for right-side outliers (extremely high counts)
- Importance of large sample size
How to do Poisson regression in R:
- Use the glm() function with the argument 'family = poisson'
Incident Rate Ratios (IRRs):
- Interpret as: "for each one unit higher predictor score, there are IRR times as many events of the outcome"
- Example: IRR = 2, base rate = 1, one unit higher would be 1*2 = 2

Binary Logistic Regression

Used for binary outcomes, where the outcome only takes on two values: 0 or 1
Examples of research use cases:
- Predicting whether someone will have major depression or not
- Determining the probability of patients remitting from major depression
- Predicting the probability of readmission to the hospital within 30 days
- Predicting the probability of death before age 60
Binary logistic regression:
- Linear regression will not work for binary outcomes due to:
  - Straight line being a bad fit
  - Non-normal distribution of residuals
General linear regression deals with binary logistic regression using:
- Link functions to transform linear predicted values to never go below 0 and never go above 1
- Assuming a Bernoulli distribution
- Defining the link function as: η = g(μ) = ln(μ/1−μ)
Assumptions of logistic regression:
- Bernoulli distribution, outcome is probability of event occurring
- Linear relationship on the link scale (ln)
- Independent variables, independent errors
- Identify outliers/extreme values on the predictors
- Check for separation (predictor variable perfectly predicts the outcome)
- Importance of large sample size
How to do logistic regression in R:
- Use the glm() function with the argument 'family = binomial'
Odds ratio:
- Indicates how many more times the odds of the outcome occurring will be for a one unit change in the predictor
- Higher than 1 means a positive relationship, less than 1 means a negative relationship
Marginal effect:
- Instantaneous effect of change at a particular point
- Equivalent to the slope of a straight line at that value

Missing Data

Missing data are common and problematic, leading to biased results and efficiency loss.

Types of Missing Data

Missing completely at random (MCAR): when the missingness mechanism is completely independent of the estimate of our parameter(s) of interest.
Missing at random (MAR): when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest.
Missing not at random (MNAR): when the missingness mechanism is associated with the estimate of our parameter(s) of interest.

Consequences of Missing Data

Listwise deletion may lead to biased results unless the data are missing completely at random (MCAR).
When data are missing completely at random (MCAR), listwise deletion will yield unbiased estimates of the true parameter(s) if the data had not been missing.
When data are missing at random (MAR), it is possible to recover unbiased estimates if the right other variables are present.

Multiple Imputation (MI)

Multiple imputation is a robust way to address missing data, involving generating multiple, different datasets with plausible values imputed for the missing data.
Steps in MI:
Start with the incomplete data.
Generate 𝑚 datasets with no missingness, by filling in different plausible values for any missing data.
Perform the analysis of interest on each imputed dataset.
Pool the results from the analyses run on each imputed dataset to generate an overall estimate, 𝑄¯.
Formula to determine total uncertainty in average estimate: T = V¯ + B + B/m, where 𝑉¯ is the average uncertainty estimate of Q̂ across the multiply imputed datasets, B captures the variance in the estimates, Q̂, and m is the number of imputed datasets.

Issues with Using Imputed Datasets

Issues with using imputed datasets with general linear models:
Small sample sizes (i.e., 100 or less).
Colinear variables.
Lots of interactions between variables.
Non-normal residuals.

Examining Missing Data

Before doing any imputation, it is a good idea to examine the data using the VIM package in R.
The aggr() function shows the proportion of missing data on each individual variable and the patterns of missing data.
Margin plots can help identify if imputations fall outside the range of observed data or fit with the rest of the trend from the observed data.

Missing Data

Missing data are common and problematic, leading to biased results and efficiency loss.

Types of Missing Data

Missing completely at random (MCAR): when the missingness mechanism is completely independent of the estimate of our parameter(s) of interest.
Missing at random (MAR): when the missingness mechanism is conditionally independent of the estimate of our parameter(s) of interest.
Missing not at random (MNAR): when the missingness mechanism is associated with the estimate of our parameter(s) of interest.

Consequences of Missing Data

Listwise deletion may lead to biased results unless the data are missing completely at random (MCAR).
When data are missing completely at random (MCAR), listwise deletion will yield unbiased estimates of the true parameter(s) if the data had not been missing.
When data are missing at random (MAR), it is possible to recover unbiased estimates if the right other variables are present.

Multiple Imputation (MI)

Multiple imputation is a robust way to address missing data, involving generating multiple, different datasets with plausible values imputed for the missing data.
Steps in MI:
Start with the incomplete data.
Generate 𝑚 datasets with no missingness, by filling in different plausible values for any missing data.
Perform the analysis of interest on each imputed dataset.
Pool the results from the analyses run on each imputed dataset to generate an overall estimate, 𝑄¯.
Formula to determine total uncertainty in average estimate: T = V¯ + B + B/m, where 𝑉¯ is the average uncertainty estimate of Q̂ across the multiply imputed datasets, B captures the variance in the estimates, Q̂, and m is the number of imputed datasets.

Issues with Using Imputed Datasets

Issues with using imputed datasets with general linear models:
Small sample sizes (i.e., 100 or less).
Colinear variables.
Lots of interactions between variables.
Non-normal residuals.

Examining Missing Data

Before doing any imputation, it is a good idea to examine the data using the VIM package in R.
The aggr() function shows the proportion of missing data on each individual variable and the patterns of missing data.
Margin plots can help identify if imputations fall outside the range of observed data or fit with the rest of the trend from the observed data.

Independence of Observations

Observations are not always independent, e.g., in longitudinal studies, repeated measures experiments, and clustered data
This type of data poses challenges to statistical analysis, but can be addressed using linear mixed models

Linear Mixed Models

Relax the assumption of independence of observations in linear regression
Allow for variation in observations, including continuous time, missing data, and continuous predictors

Fixed Effects vs Random Effects

Fixed effects: assume same slope and intercept for all participants, only applicable for one observation per participant
Random effects: allow for different coefficients (slopes and intercepts) per participant, applicable for repeated measures

Fixed Effects Approximation

Approximate distribution by: 𝑀 = estimated mean; 𝑆𝐷 = 0 (standard deviation is fixed at 0)

Random Effects Approximation

Approximate distribution by: 𝑀 = estimated mean; 𝑆𝐷 = estimated standard deviation (SD is free to vary)

Main Difference Between Fixed and Random Effects

Fixed effects: regression coefficients are the same for everyone
Random effects: regression coefficients vary randomly for each participant

Intraclass Correlation Coefficient (ICC)

Measures the ratio of between variance to total variance (ranges between 0 and 1)
ICC > 0 indicates individual means differ, and individual differences need to be accounted for in analysis

Meandeviations() Function

Used to calculate between and within versions of a repeated measures variable

Linear Mixed Model Assumptions

Assume individual units' deviations from the fixed effect follow a normal distribution with mean 0 and standard deviation
Assume random effect intercept also follows a normal distribution
Only one additional parameter is needed compared to regular linear regression

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Data Types and Subsetting in R

Choose a study mode

Podcast

Questions and Answers

What is the primary purpose of the by argument in the data.table subsetting structure?

What is the main difference between Integer and Numeric data types in R?

What is the purpose of subsetting data in analyses?

What is the rule for data merges in R?

What is the purpose of Factor data type in R?

What is the return value of logical operators in R?

What type of join results in a dataset with all rows in both x and y?

What is the primary advantage of using long data format?

What is the purpose of logical operators in data management?

What is the characteristic of wide data format?

What is the convention for treating Boolean values in arithmetic operations?

What is the purpose of the Characters data type in R?

What is the purpose of reshaping data?

What is the consequence of using wide data format when there are missing values?

Which data type is used to store data that can be either TRUE or FALSE?

What operator is used for set notation in R?

What is the primary advantage of using rowMeans() to average a variable?

What is the consequence of adding items together to get a total score if a participant misses any single item?

What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?

What is the recommended approach to scoring questionnaire scales if the scale is typically added up?

What is the advantage of using rowMeans() to calculate the total score when there are small amounts of missing data?

What is the purpose of the rowMeans() function in R?

What is the main purpose of adding 'psych::' when calling the alpha function in R?

What is the primary advantage of using geom_point() over geom_bar()?

What is the main goal of the 'data to ink ratio' concept in data visualization?

What is the primary purpose of using shapes on scatterplots?

What type of plot is used to compare the distribution of data between groups?

What is the main difference between histograms and density plots?

What is the purpose of a QQ plot?

What is the purpose of using z-scores to identify extreme values?

What is the main advantage of using dot plots for small datasets?

What is the main difference between a QQ plot and a deviates plot?

What is the primary advantage of using the rowMeans() function to calculate the total score?

What is the recommended approach to scoring questionnaire scales if the scale is typically added up?

What is the consequence of adding items together to get a total score if a participant misses any single item?

What is the purpose of multiplying the result of rowMeans() by the number of items that should have been completed?

What is the primary advantage of using rowMeans() to average a variable when there are small amounts of missing data?

What is the alternative to using the sum of all items to get a total score?

What is the purpose of using rowMeans() to calculate the total score?

What is the primary advantage of using the rowMeans() function over adding items together?

What is the consequence of using rowMeans() to calculate the total score when there are small amounts of missing data?

What is the purpose of scoring questionnaire scales?

What is the primary purpose of using the 'psych::' prefix when calling the alpha function in R?

What is the key difference between a bivariate plot and a univariate plot?

What is the primary goal of the 'data to ink ratio' concept in data visualization?

What type of plot is used to show the distribution of data between groups?

What is the primary purpose of using shapes on scatterplots?

What is the main difference between a histogram and a density plot?

What is the primary purpose of a QQ plot?

What is the primary advantage of using dot plots for small datasets?

What is the primary purpose of using z-scores to identify extreme values?

What is the main difference between a QQ plot and a deviates plot?

What is the purpose of squaring the residuals in linear regression?

What is the difference between simple and multiple linear regression?

What does yi represent in the equation for a straight line?

What is the purpose of estimating b0 and b1 in linear regression?

What is the subscript i indicating in the equation for a straight line?

What is the purpose of leaving off the residual/error term in the equation for a straight line?

How does multiple linear regression differ from simple linear regression?

What is the purpose of the regression coefficients (b0 and b1) in linear regression?

What is the goal of linear regression?

What is the purpose of the intercept (b0) in linear regression?

What is the purpose of the link function in GLMs?

What is the assumption of equal variance/homoscedasticity in linear regression?

What is the purpose of a QQ plot in linear regression diagnostics?

What is the consequence of violating the assumption of independence in linear regression?

What is the purpose of a scatterplot of predicted values against residuals in linear regression diagnostics?

What is the purpose of the inverse link function in GLMs?

What is the assumption of normally-distributed errors in linear regression?

What is the purpose of the L.I.N.E. acronym in linear regression?

What is the purpose of a density plot of residuals in linear regression diagnostics?

What is the assumption of independence in linear regression?

What is the interpretation of b0 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?

What is the primary purpose of the generalized linear model (GLM)?

What is the purpose of the residual standard error (σ) in the linear regression model?

What is the interpretation of the coefficient b1 in the multiple linear regression equation yi=b0+b1∗x1i+...+bk∗xki+εi?