Podcast
Questions and Answers
In the context of correlation analysis, what crucial condition must be satisfied to ensure the accurate application and interpretation of the cor(x, y)
function?
In the context of correlation analysis, what crucial condition must be satisfied to ensure the accurate application and interpretation of the cor(x, y)
function?
- The relationship between x and y must be perfectly non-linear.
- The data must be categorical.
- The data should contain outliers.
- The relationship between x and y must be linear. (correct)
What is the primary purpose of using jitter plots when visualizing data?
What is the primary purpose of using jitter plots when visualizing data?
- To calculate the correlation coefficient between two quantitative variables.
- To reduce the number of observations in a dataset.
- To eliminate outliers from a dataset.
- To smear categorical data points for better visualization of data density. (correct)
What is the primary purpose of the table(x, y)
function in statistical analysis using R?
What is the primary purpose of the table(x, y)
function in statistical analysis using R?
- To compute the frequency of each pair of responses between two variables. (correct)
- To fit a linear regression model.
- To calculate the correlation between two variables x and y.
- To create a jitter plot for visualizing categorical data.
In the context of bootstrapping, what is the critical characteristic of the sampling process?
In the context of bootstrapping, what is the critical characteristic of the sampling process?
What is the main purpose of performing a permutation test for correlation?
What is the main purpose of performing a permutation test for correlation?
What is the role of complete.cases() in data preparation for statistical analysis?
What is the role of complete.cases() in data preparation for statistical analysis?
What does the corrplot
function with method = "ellipse"
primarily visualize?
What does the corrplot
function with method = "ellipse"
primarily visualize?
In the context of regression analysis, what is the primary purpose of examining residual plots?
In the context of regression analysis, what is the primary purpose of examining residual plots?
When performing bootstrap for a quadratic model, what is the key advantage of using a bootstrap confidence interval compared to a theoretical confidence interval?
When performing bootstrap for a quadratic model, what is the key advantage of using a bootstrap confidence interval compared to a theoretical confidence interval?
In regression analysis, what is the key distinction between a prediction band and a confidence band?
In regression analysis, what is the key distinction between a prediction band and a confidence band?
What is the primary reason that influential points and outliers can significantly affect a regression model?
What is the primary reason that influential points and outliers can significantly affect a regression model?
Why is it important to address heteroskedasticity in a regression model?
Why is it important to address heteroskedasticity in a regression model?
What is the purpose of using the gsub
function in the expression gsub("\\?\\*", "", guns_death$Country)
?
What is the purpose of using the gsub
function in the expression gsub("\\?\\*", "", guns_death$Country)
?
In the context of multiple regression, what does multicollinearity refer to, and why is it a concern?
In the context of multiple regression, what does multicollinearity refer to, and why is it a concern?
In the context of model selection in multiple regression, what is the primary purpose of using Adjusted R-squared over regular R-squared?
In the context of model selection in multiple regression, what is the primary purpose of using Adjusted R-squared over regular R-squared?
In multiple regression, what is one of the key assumptions regarding the nature of the predictor variables?
In multiple regression, what is one of the key assumptions regarding the nature of the predictor variables?
When using categorical predictors in a regression model, how does R handle the inclusion of these variables by default?
When using categorical predictors in a regression model, how does R handle the inclusion of these variables by default?
What is the primary purpose of the Box-Cox transformation?
What is the primary purpose of the Box-Cox transformation?
In the context of ANOVA, what does the F-statistic primarily test?
In the context of ANOVA, what does the F-statistic primarily test?
In a two-way ANOVA, what does a significant interaction effect indicate?
In a two-way ANOVA, what does a significant interaction effect indicate?
In the context of linear regression, what is the fundamental assumption regarding the error terms ($\epsilon_i$)?
In the context of linear regression, what is the fundamental assumption regarding the error terms ($\epsilon_i$)?
Consider a scenario where you aim to estimate the parameters $\beta_0$, $\beta_1$, and $\sigma$ in a simple linear regression model. What does $\sigma$ specifically represent?
Consider a scenario where you aim to estimate the parameters $\beta_0$, $\beta_1$, and $\sigma$ in a simple linear regression model. What does $\sigma$ specifically represent?
In the context of interpreting coefficients in a linear regression model, what does the intercept represent?
In the context of interpreting coefficients in a linear regression model, what does the intercept represent?
When interpreting a coefficient for a continuous predictor (e.g., V241324) in a linear regression model, what does the coefficient signify?
When interpreting a coefficient for a continuous predictor (e.g., V241324) in a linear regression model, what does the coefficient signify?
In the context of interpreting coefficients for categorical predictors (e.g., V241227x, Party ID) in a regression model, what do these coefficients represent?
In the context of interpreting coefficients for categorical predictors (e.g., V241227x, Party ID) in a regression model, what do these coefficients represent?
What is the primary difference between a prediction interval and a confidence interval in the context of regression analysis?
What is the primary difference between a prediction interval and a confidence interval in the context of regression analysis?
In the context of interaction effects in a two-way ANOVA, what does a significant interaction indicate?
In the context of interaction effects in a two-way ANOVA, what does a significant interaction indicate?
What is the purpose of the Box-Cox transformation in the context of regression analysis?
What is the purpose of the Box-Cox transformation in the context of regression analysis?
In the context of best subset selection in regression, what is a key consideration when evaluating different models?
In the context of best subset selection in regression, what is a key consideration when evaluating different models?
What is the purpose of Tukey's Honestly Significant Difference (HSD) test?
What is the purpose of Tukey's Honestly Significant Difference (HSD) test?
In forward stepwise regression, what criterion is typically used to determine which variable to add to the model at each step?
In forward stepwise regression, what criterion is typically used to determine which variable to add to the model at each step?
In backward stepwise regression, what criterion is typically used to determine which variable to remove from the model at each step?
In backward stepwise regression, what criterion is typically used to determine which variable to remove from the model at each step?
What is the primary purpose of the Akaike Information Criterion (AIC) in model selection?
What is the primary purpose of the Akaike Information Criterion (AIC) in model selection?
What is heteroskedasticity, and why is it a concern in regression analysis?
What is heteroskedasticity, and why is it a concern in regression analysis?
How does Pearson's product-moment correlation coefficient measure the relationship between two variables?
How does Pearson's product-moment correlation coefficient measure the relationship between two variables?
In the context of ANOVA, what is the key difference between Type II and Type III sums of squares?
In the context of ANOVA, what is the key difference between Type II and Type III sums of squares?
What is the purpose of Bartlett's test in the context of ANOVA?
What is the purpose of Bartlett's test in the context of ANOVA?
Under what conditions might Welch's ANOVA be preferred over a standard ANOVA?
Under what conditions might Welch's ANOVA be preferred over a standard ANOVA?
What type of data is Kruskal-Wallis test most appropriate for?
What type of data is Kruskal-Wallis test most appropriate for?
In a two-way ANOVA, how do you interpret an interaction plot?
In a two-way ANOVA, how do you interpret an interaction plot?
What does the 'Total SS' (Sum of Squares) represent in the context of ANOVA?
What does the 'Total SS' (Sum of Squares) represent in the context of ANOVA?
How does increasing the confidence level affect the width of a confidence interval?
How does increasing the confidence level affect the width of a confidence interval?
When interpreting the standard error, what does it primarily quantify?
When interpreting the standard error, what does it primarily quantify?
How does Levene's test assess the equality of variances across groups?
How does Levene's test assess the equality of variances across groups?
What is the relationship between variance and standard deviation?
What is the relationship between variance and standard deviation?
For a regression model, suppose you want to test the overall significance, which statistical test should be used?
For a regression model, suppose you want to test the overall significance, which statistical test should be used?
In linear regression, what is the effect of adding an irrelevant variable to the model?
In linear regression, what is the effect of adding an irrelevant variable to the model?
What term describes the condition when predictor variables in a regression model are highly correlated with each other?
What term describes the condition when predictor variables in a regression model are highly correlated with each other?
Which of these is not a method to address heteroscedasticity in a regression model?
Which of these is not a method to address heteroscedasticity in a regression model?
When performing a two-way ANOVA, what is the difference between a main effect and an interaction effect?
When performing a two-way ANOVA, what is the difference between a main effect and an interaction effect?
What is the assumption required for ANOVA?
What is the assumption required for ANOVA?
What is the meaning of degree of freedom?
What is the meaning of degree of freedom?
What does a confidence interval estimate?
What does a confidence interval estimate?
In what condition should you use Welch's ANOVA?
In what condition should you use Welch's ANOVA?
What is the primary reason one might choose to use the Kruskal-Wallis test?
What is the primary reason one might choose to use the Kruskal-Wallis test?
Among R-squared and Adjust R-squared, which measurement is better to use when increasing predictors?
Among R-squared and Adjust R-squared, which measurement is better to use when increasing predictors?
How is Total SS calculated in the context of ANOVA?
How is Total SS calculated in the context of ANOVA?
Which test is a post-hoc test to determine which population mean differ after we have already run ANOVA and determined our F-statistic is significant.
Which test is a post-hoc test to determine which population mean differ after we have already run ANOVA and determined our F-statistic is significant.
In the context of linear regression, what does the assumption $\epsilon_i \sim N(0, \sigma)$ imply about the error terms?
In the context of linear regression, what does the assumption $\epsilon_i \sim N(0, \sigma)$ imply about the error terms?
In a linear regression model, the coefficient for a continuous predictor (e.g., V241324) is 0.5. How should this coefficient be interpreted?
In a linear regression model, the coefficient for a continuous predictor (e.g., V241324) is 0.5. How should this coefficient be interpreted?
In the output of a linear regression model, what does a statistically significant coefficient for a categorical predictor (e.g., V241227x) indicate?
In the output of a linear regression model, what does a statistically significant coefficient for a categorical predictor (e.g., V241227x) indicate?
In the context of multiple R-squared and Adjusted R-squared, what is the key benefit of using Adjusted R-squared when evaluating regression models with different numbers of predictors?
In the context of multiple R-squared and Adjusted R-squared, what is the key benefit of using Adjusted R-squared when evaluating regression models with different numbers of predictors?
What is the primary purpose of performing a Box-Cox transformation on the response variable in a linear regression model?
What is the primary purpose of performing a Box-Cox transformation on the response variable in a linear regression model?
In the context of analyzing the output of an ANOVA, what does a significant F-statistic suggest about the means of the groups being compared?
In the context of analyzing the output of an ANOVA, what does a significant F-statistic suggest about the means of the groups being compared?
In the context of a two-way ANOVA, how should a significant interaction effect between two factors be interpreted?
In the context of a two-way ANOVA, how should a significant interaction effect between two factors be interpreted?
What is the utility of Tukey's Honestly Significant Difference (HSD) test following an ANOVA?
What is the utility of Tukey's Honestly Significant Difference (HSD) test following an ANOVA?
In forward stepwise regression, what criterion is typically used to determine which predictor variable to add to the model at each step?
In forward stepwise regression, what criterion is typically used to determine which predictor variable to add to the model at each step?
In backward stepwise regression, what is the variable selection criteria?
In backward stepwise regression, what is the variable selection criteria?
What is the primary goal of the Akaike Information Criterion (AIC) in model selection?
What is the primary goal of the Akaike Information Criterion (AIC) in model selection?
What does Bartlett's test assess in the context of ANOVA?
What does Bartlett's test assess in the context of ANOVA?
Under what specific conditions is Welch's ANOVA preferred over a standard ANOVA?
Under what specific conditions is Welch's ANOVA preferred over a standard ANOVA?
What is the most crucial condition for accurately applying and interpreting Pearson's product-moment correlation coefficient?
What is the most crucial condition for accurately applying and interpreting Pearson's product-moment correlation coefficient?
In the context of ANOVA, what differentiates Type II Sums of Squares from Type III Sums of Squares?
In the context of ANOVA, what differentiates Type II Sums of Squares from Type III Sums of Squares?
How might increasing the confidence level (e.g., from 95% to 99%) affect the width of a confidence interval, assuming all other factors remain constant?
How might increasing the confidence level (e.g., from 95% to 99%) affect the width of a confidence interval, assuming all other factors remain constant?
How should a statistically significant interaction effect in a two-way ANOVA be visually assessed and confirmed?
How should a statistically significant interaction effect in a two-way ANOVA be visually assessed and confirmed?
What can be inferred about individual group differences when the Kruskal-Wallis test yields a statistically significant result?
What can be inferred about individual group differences when the Kruskal-Wallis test yields a statistically significant result?
In ANOVA, what does the 'Total SS' (Sum of Squares) represent, and how is it calculated?
In ANOVA, what does the 'Total SS' (Sum of Squares) represent, and how is it calculated?
When should the Kruskal-Wallis test be employed instead of a one-way ANOVA, and what type of data is it most suitable for?
When should the Kruskal-Wallis test be employed instead of a one-way ANOVA, and what type of data is it most suitable for?
In linear regression, how does adding an irrelevant variable to the model typically affect the residual sum of squares (RSS) and the model's overall performance?
In linear regression, how does adding an irrelevant variable to the model typically affect the residual sum of squares (RSS) and the model's overall performance?
When interpreting the standard error, what does it primarily quantify, and how does it relate to the precision of parameter estimates?
When interpreting the standard error, what does it primarily quantify, and how does it relate to the precision of parameter estimates?
Which of the following transformations is LEAST likely to address heteroscedasticity in a regression model?
Which of the following transformations is LEAST likely to address heteroscedasticity in a regression model?
How does Levene's test assess the equality of variances across groups, and what is its primary advantage over other tests like Bartlett's test?
How does Levene's test assess the equality of variances across groups, and what is its primary advantage over other tests like Bartlett's test?
What is the relationship between variance and standard deviation, and how does understanding this relationship aid in interpreting statistical results?
What is the relationship between variance and standard deviation, and how does understanding this relationship aid in interpreting statistical results?
Why is careful consideration of predictor variables crucial in multiple regression?
Why is careful consideration of predictor variables crucial in multiple regression?
In linear regression, what condition would make the intercept $\beta_0$ uninterpretable?
In linear regression, what condition would make the intercept $\beta_0$ uninterpretable?
In an ANOVA, what does a large mean square between groups (MSG) relative to the mean square within groups (MSW) suggest?
In an ANOVA, what does a large mean square between groups (MSG) relative to the mean square within groups (MSW) suggest?
What is the impact of failing to meet the homogeneity of variances assumption in ANOVA, and what are some potential remedies?
What is the impact of failing to meet the homogeneity of variances assumption in ANOVA, and what are some potential remedies?
Why is visualizing interaction effects essential in a two-way ANOVA, and what can interaction plots reveal that numerical outputs alone cannot?
Why is visualizing interaction effects essential in a two-way ANOVA, and what can interaction plots reveal that numerical outputs alone cannot?
In regression analysis, what are prediction bands, and how do they differ fundamentally from confidence bands?
In regression analysis, what are prediction bands, and how do they differ fundamentally from confidence bands?
What is multicollinearity, and what is its primary impact on the interpretation of regression coefficients?
What is multicollinearity, and what is its primary impact on the interpretation of regression coefficients?
In linear regression, what is the role of residual plots in assessing the validity of model assumptions?
In linear regression, what is the role of residual plots in assessing the validity of model assumptions?
What is the first step one should do when preparing to analyze the data?
What is the first step one should do when preparing to analyze the data?
In the expression, Yi = β0 + β1Xi + εi
, what does εi
represent, and what assumptions do we typically make about its distribution in linear regression?
In the expression, Yi = β0 + β1Xi + εi
, what does εi
represent, and what assumptions do we typically make about its distribution in linear regression?
Flashcards
cor(x,y)
cor(x,y)
Calculates the correlation between two variables, ranging from 0 to 1. Only use when the relationship is linear.
Correlation
Correlation
Strength of the linear relationship between two quantitative variables.
Jitter Plots
Jitter Plots
Smears categorical data in a plot so we can see the density of observations.
table(x,y)
table(x,y)
Calculates the frequency of each pair of responses in a dataset.
Signup and view all the flashcards
Bootstrapping
Bootstrapping
Resampling data with replacement to estimate variability.
Signup and view all the flashcards
Permutation Test for Correlation
Permutation Test for Correlation
A statistical test to determine if the correlation between two variables is significant.
Signup and view all the flashcards
WB2 <- WB2[complete.cases(WB2), ]
WB2 <- WB2[complete.cases(WB2), ]
Changes the dataset to only include complete entries for variables, removing rows with NA values.
Signup and view all the flashcards
cor(WB2[,-1])
cor(WB2[,-1])
Gives the correlation between all columns of a data frame.
Signup and view all the flashcards
corrplot
corrplot
Visualizes the direction and strength of correlations using ellipses.
Signup and view all the flashcards
Matrix Plot
Matrix Plot
A visual tool to check for linear relationships between multiple variables.
Signup and view all the flashcards
Theoretical regression model
Theoretical regression model
A line that best fits the average relationship between two variables.
Signup and view all the flashcards
na.omit()
na.omit()
Removes rows with missing values from a dataset.
Signup and view all the flashcards
Confidence intervals
Confidence intervals
Gives intervals of the slope of a regression line
Signup and view all the flashcards
Quadratic Model
Quadratic Model
Adds a curve to a plot to represent a quadratic relationship.
Signup and view all the flashcards
Prediction Band
Prediction Band
Estimate the 95% interval for a single data point.
Signup and view all the flashcards
Confidence Band
Confidence Band
Estimate the 95% interval for a regression line.
Signup and view all the flashcards
Influential Points
Influential Points
Influential points and outliers that can disproportionately affect the regression line.
Signup and view all the flashcards
Cooks distance
Cooks distance
Points with a larger value on regression lines
Signup and view all the flashcards
Heteroskedasticity
Heteroskedasticity
Where the standard deviation of residuals is not constant
Signup and view all the flashcards
normal quantile plot
normal quantile plot
In log is not normal
Signup and view all the flashcards
Multiple R-squared
Multiple R-squared
A measure of how well a linear regression model fits the observed data, representing the proportion of variance in the dependent variable explained by the independent variables.
Signup and view all the flashcards
Forward Stepwise Regression
Forward Stepwise Regression
A stepwise regression technique that starts with no predictors and adds them one at a time based on which improves the model fit the most.
Signup and view all the flashcards
Backward Stepwise Regression
Backward Stepwise Regression
A stepwise regression technique that starts with all potential predictors and removes them one at a time based on which has the least impact on the model fit.
Signup and view all the flashcards
Akaike Information Criterion (AIC)
Akaike Information Criterion (AIC)
A criterion for model selection that balances model fit and complexity, penalizing models with more parameters.
Signup and view all the flashcards
Pearson's Product-Moment Correlation
Pearson's Product-Moment Correlation
A measure of the strength and direction of the linear relationship between two variables.
Signup and view all the flashcards
ANOVA Type 3
ANOVA Type 3
Testing effects of variables in ANOVA
Signup and view all the flashcards
ANOVA Type 2
ANOVA Type 2
Testing effects of variables in ANOVA
Signup and view all the flashcards
ANOVA
ANOVA
A statistical test used to compare the means of two or more groups.
Signup and view all the flashcards
Bartlett's Test
Bartlett's Test
Checking the equality of variances within groups
Signup and view all the flashcards
Levene's Test
Levene's Test
Test for equality of variances of groups
Signup and view all the flashcards
Welch's ANOVA
Welch's ANOVA
An alternative to ANOVA when the assumption of equal variances is violated.
Signup and view all the flashcards
Kruskal-Wallis Test
Kruskal-Wallis Test
A non-parametric test used to compare the means of two or more groups when the data are not normally distributed.
Signup and view all the flashcards
Two-Way ANOVA
Two-Way ANOVA
An ANOVA used to study the effects of two or more independent variables on a dependent variable.
Signup and view all the flashcards
Interaction Plot
Interaction Plot
A plot used to visualize the interaction effects between two or more independent variables in an ANOVA.
Signup and view all the flashcards
Total SS (Sum of Squares)
Total SS (Sum of Squares)
The sum of squares representing the overall variability in the data.
Signup and view all the flashcards
Interpreting ANOVA
Interpreting ANOVA
Assessing the statistical significance and practical importance of the results obtained from an ANOVA test.
Signup and view all the flashcards
Interpreting Standard Error
Interpreting Standard Error
A measure of the variability of the sample mean, indicating the precision of the estimate.
Signup and view all the flashcards
Box-Cox Transformation
Box-Cox Transformation
A transformation applied to non-normal data to make it more closely approximate a normal distribution, which is required for many statistical tests.
Signup and view all the flashcards
Best Model Subset
Best Model Subset
Process of selecting the best subset of predictors for a regression model based on various fit statistics.
Signup and view all the flashcards
Tukey's Honestly Significant Difference (HSD) Test
Tukey's Honestly Significant Difference (HSD) Test
A post-hoc test used in ANOVA to determine which group means are significantly different from each other.
Signup and view all the flashcards
Intercept in Regression
Intercept in Regression
The predicted value of outcome when all continuous predictors are zero, and all categorical predictors are at their reference category
Signup and view all the flashcards
Regression Coefficient
Regression Coefficient
Indicates the change in the outcome variable associated with a one-unit increase in the predictor, holding all other variables constant
Signup and view all the flashcardsStudy Notes
Module 10: Correlation and Bootstrap
cor(x,y)
calculates the correlation between 0 and 1; only use when the relationship is linear.- Correlation is the strength of the linear relationship between two quantitative variables.
- Jitter plots smear categorical data, on a quantitative scale, to visualize the density of observations at each intersection of values, using
plot(jitter(x, factor=1), jitter(y))
. table(x,y)
calculates the frequency of each pair of responses.- The frequency of a table can be turned into a size-proportional scatterplot by assigning
freq <- c(table(x,y))
, then plotting withplot(x, y, cex = sqrt(freq))
, and superimposing frequenciestext(x, y, freq)
. - Bootstrapping correlation and regression slopes involves calculating the correlation
cor1 <- cor(x,y)
, testing if the correlation is nonzerocor.test(x,y)
, fitting a linear regression modellm1 <- lm(x~y)
, and extracting coefficients of the linelm1$coef
which returns the intercept and slope. - Bootstrapping is done by sampling rows with replacement from the original dataset, ensuring the same number of observations, then calculating the correlation and slope for the "fake" dataset to determine confidence intervals.
Module 11: Correlation, Permutations, Multiple
- The Permutation Test assesses correlation by checking the p-value
cor.test(cry$Crying, cry$IQ)
. - Outliers can artificially inflate correlations; so a permutation test is performed where one variable is permuted and then recalculated.
- To use the permutation test first permute one variable
fakeIQ = sample(IQ)
then bind the natural ordercbind(crying, IQ)
and the random ordercbind(crying, fakeIQ)
. - The p-value, representing the likelihood of observing the true correlation if there is no association, is then calculated.
- To make
CorResults
for the permutation test assign a number of samples, for example,10000
, by settingcorResults <- rep(NA, n-samp)
. - To test for correlation run a loop where: for (i in n-samp) { corResults [i] <-cor(crying, sample (IQ)) }.
- Calculate a 2-sided p-value for fake correlations by finding the
tme cor <- mean(abs(corResults) >= abs(cor(crying, IQ)))
. WB2 <- WB2[complete.cases(WB2), ]
changes the dataset to include complete entries for variablescor(WB2[,-1])
outputs correlations between all columns, made easier to visualize in a plot.corrplot(cor(WB2[,-1]), method = "ellipse")
generates a plot displaying the direction and strength of correlations.- Matrix plots can be used to to check for linearity with multiple scatterplots
plot(WB2[,-1])
. - Taking the log may help linearize data.
- Other methods include
chart.Correlation(WB2[,-1])
,corrplot.mixed(cor(WB2[,-1]), ...other modifiers...)
. - Pearson Product-Moment Correlation is calculated using
cor.test(ctTEMP$Min, ctTEMP$Year)
.
Module 12: Regression
- The theoretical regression model is
My = Bo + B₁X
. - The actual values are represented as
Y₁ = Bo + B₁X₁ + Eᵢ
, where there are independent observations from a normal distribution with a mean of 0 and some common standard deviation. - Before performing regressions, verify the assumptions of linearity and normal distribution of errors.
- To account for missing values use code like
crosby <- na.omit(crosby)
. - To fit the model you can use
lm1 <- lm(Mintemp ~ Year)
. - To show coefficients of line you can use
lm1$coefficients
. - You can find the confidence interval for the true slope using
confint(lm1, 'Year')
. - You can also use bootstrapping here, sample with replacement, calculates a new correlation, saves scores into a vector, and get a CI of those scores.
- The boot CI will be wider than the theoretical CI.
- Quadratic Models can be create using the code
lm2 <- lm(y ~ x + x^2)
and adding a curve to the plot using the codepoints(x, lm2$fitted.values, type = "l")
. - For bootstrapping quadratics begin with finding the number of rows using
N <- nrow(data)
. - You can use the code
bResults <- matrix(rep(NA, 3*n-samp), ncol = 3)
to create a matrix. - You can use the code
S <- sample(1:N, N, replace = T)
to get a sample. Assign the fake yearfakeYear <- Year[s]
and the fake year squaredfakeYearSq <- fakeYear^2
, and the fake tempfakeTemp <- Mintemp[s]
. - The code
lmtemp <- lm(fakeTemp ~ fakeYear + fakeYearSq)
can get the resultsbResults[i,] <- lmtemp$coef
. - Finally get the get the CI using the code
ci-line <- quantile(bResults[,2], c(.025,.975))
andci-quad <- quantile(bResults[,3], c(.025,.975))
. - Linear regression aims to find the best-fitting line (or hyperplane) that models the relationship between a dependent variable 𝑌 and one or more independent variables 𝑋
- The observed values of 𝑌 are equal to the mean function plus a random error: 𝑌𝑖=𝛽0+𝛽1𝑋𝑖+𝜀𝑖
- The errors 𝜀𝑖 are assumed to be independent and identically distributed random variables from a normal distribution with mean zero and constant standard deviation: 𝜀𝑖∼𝑁(0,𝜎)
- The aim is to estimate the following unknown parameters: 𝛽0,𝛽1,𝜎, where 𝛽0 is the true intercept, 𝛽1 is the true slope, and 𝜎 is the true standard deviation of the errors.
Module 13: Regression 2
- For a 95% confidence interval for average minimum temp in 2100, a prediction band is needed, which is wider than confidence intervals since it accounts for the range of potential individual data points.
- For a 95% confidence interval for the regression line in year 2100, use a confidence band.
- 95% confidence means the regression line is between blue lines.
- 95% confidence means all future data will fall between the red lines.
- R-squared measures the proportion of variance in y explained by the regression model.
- Influential points and outliers (large residuals) can affect the regression line.
- Cook's distance quantifies the influence of a data point; a larger value indicates greater influence.
- Standardized residuals are calculated as the residual divided by the sample SD.
- Heteroskedasticity exists if the standard deviation of residuals is non-constant.
- Use a transformation to deal with heteroskedasticity by making ln() or sqrt() transformations; look for a fanning pattern in the residual spread.
- Add a regression line to a plot using
abline(lm1$coef, lwd = 3)
. - New temperatures can be predicted until 2100 using the code
newyear <- seq(min(Year), 2100, by=1)
and assigning the data framend <- data.frame(Year = newYear)
. - Use the code
confbands2 <- predict(lm1, interval = "confidence", newdata=nd)
andpredbands1 <- predict(lm1, interval = "prediction", newdata=nd)
to get results. - The code
plot(mod1, which=4)
generates a plot of Cook's distance. - The code
text(x,y, vector of text to show, pos = 1)
superimposes text on points. - Create a normal quantile plot using the function
qqPlot(rstudent(lm1))
which generates a norm-a plot of studentized residuals. - Good visualization will result in no pattern.
- Using the library
olsrr
can give useful residual plots.
Module 16: Multiple Regression and Continuous Predictors
- In multiple regression, the theoretical model equation: Y = Bo + B₁X₁ + B₂X₂ +...+BₙXₙ.
- Bₙ represents the change in Y for every unit increase in Xₙ; there are three assumptions: linearity, constant variance, and the predictors are not strongly correlated.
- Predicted values are calculated
ŷᵢ = b₀ + b₁X₁ + b₂X₂ + ... + bₙXₙ
, and residuals are calculatedeᵢ = Yᵢ - ŷᵢ
. - Multiple R² indicates the amount of variability in Y explained by multiple regression.
- Forward stepwise regression aims to create a simpler model by starting with the best predictor and iteratively adding significant predictors.
- R² value always increases when more are added, which is a problem.
- Adjusted R² adds a penalty for the number of terms; always less than R² use Adjusted R² .
- Akaike Information Criterion is useful for smaller models and is only used when comparing different models.
- Cp statistic helps identify the simplest model.
Module 18: Multiple Regression and Categorical Predictors
- Assesses whether things are getting warmer over time by checking if in regression slope > 0
- Variables may be categorical and non-orderable.
- R automatically creates reference variables, omitting one level of each categorical variable from the model.
- Assumptions of heteroskedasticity are tested; you can do Box-Cox transformation if assumptions are not met.
levels(var)
displays the levels of a variable.- Multiple boxplots can be displayed
boxplot(TMIN ~ Month, data = ctTEMP)
. - Code to Change order of levels includes steps such as levels (ctTEMP$Month) <- c("01-Jan", "02-Feb",..., "12-Dec") and sort(levels(ctTEMP$Month).
- To create a linear regression model w/ multiple categorical predictors use the code
lm1 <- lm(TMIN ~ Year + Month + Name, data = ctTEMP)
. - Code to Try different reference levels is
ctTEMP <- within(ctTEMP, Name <- relevel(Name, ref = "STORRS"))
. - Test normality of residuals and heteroskedasticity using the
hist(lm1$residuals)
function. qqPlot(lm1$residuals)
plots the values of the residuals.plot(fitted(lm1), rstudent(lm1))
plots the fit of the residuals.
Module 19: ANOVA
- ANOVA serves as a regression model with at least three levels of a categorical variable. A t-test can be used if there are only two.
- The ANOVA assumptions are that the populations are from normal distributions, constant standard deviation/variance between groups, and all observations are independent.
- ANOVA assumes a overall mean is modified by a "treatment effect".
- It Quantifies the amount of variation observed in the dataset due to group identification.
- To use the code
CODE fit ANOVA model
use code such asaov1 <- aov(cooperation ~ emotion)
. - Conducted is
pairwise.t.test(cooperation, emotion, p.adjust = "none")
. - Turkey's Honestly Significant Difference Test is used using the code TukeyHSD(aov1).
- ANOVA has a total sum of squares using the equation TOTAL sum squares = SSgroups + SSerrors.
- ANOVA also has a null hypothesis where the mean in each groups are around the same.
- ANOVA also has an alternative hypothesis means Do vary across groups.
- ANOVA uses the code
F=mean square group/ mean square be eror
.
Module 20: Other Tests
- Tests for equality of variances include
Ho: σ₁=σ₂=...=σₙ
andHA: one σᵢ different
. - Assumes normality of group distributions and a p-value says if one SD is different.
- Uses
bartlett.text
. - Levene's Test does NOT assume normality of variances but needs more robust data; test uses
levene.Test
. - Welch's ANOVA is for normal populations with unequal variances, and uses
oneway.test(y~x)
. - Kruskal-Wallis Test assumes that there are NO DISTRIBUTIONAL ASSUMPTIONS and the HO states group medians are same while HA states at one group the median is different, test
kruskal. test
- Two-Way ANOVA is made of 2 categorical predictors + their interaction term with the same assumption of equal variances; uses,
aov(y~ x + z + x*z)
.. - Uses
interaction. plot(x, groups, response.
- observation = overall mean + row factor effect & column factor effect+ interaction effect & residuals / errors.
Module 21: ANCOVA, GLMs
- A generalized Linear Model is any kind of linear model ex regression, one-way Anova, tow-way Anova, 1 of 2 sample I tests.
- Model can be expressed in the form
Y = Bo + BiX. + B2X2 + ... + βnxn + €
. - To calculate Type I Sum of Squares (Sequential) you calculate the Amant of model explained by a factor, by is affected by terms BEFORE IT.
- You can find use the code to use other types of
Sum of Squares
such as to createANOVA(..., type = "III")
.
Additional Concepts
- To Prepare for exams review class material modules 16 to 22, review labs from 5 onwards, review labs and quizzes and complete a sample exam.
- Exams will test: writing basic R code, knowing and understanding R functions, reading R code, interpreting R output, and applying substantial knowledge of key concepts.
- Interpret continuous predictors (e.g., V241324, V241325, V241330x) such that a 1-point increase in the independent variable corresponds to a 𝛽 point change in the dependent variable (e.g., V241157, Trump feeling thermometer), holding all other variables constant.
- For categorical predictors (e.g., V241227x, Party ID), coefficients represent the difference in outcome between each category and the baseline (reference) category, usually Independent.
- The intercept represents the predicted value of the dependent variable when all continuous predictors are equal to 0, and all categorical predictors are at their reference (baseline) category.
- Consider whether the intercept is meaningful in the context of the variables and if a value of 0 on a 1–7 scale has a real-world interpretation.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.