Estimating σ^2 and r^2

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

In linear regression, what does $S_e$ represent?

The coefficient of determination.
The sum of squared errors.
The predicted value of Y.
The estimated standard deviation of the error term. (correct)

What does the coefficient of determination ($r^2$) represent?

The correlation coefficient.
The overall error if the sample mean is used to predict every Y.
The typical value of an error when estimating Y.
The percentage of reduction in error when using the linear regression of Y on X instead of using the sample mean to predict Y. (correct)

Which of the following is true regarding the sum of squared errors (SSE) in a linear regression model?

SSE has the same units as Y.
SSE is used to directly calculate the coefficient of determination ($r^2$). (correct)
A lower SSE always indicates a better model fit, regardless of the number of predictors.
SSE is the square root of the estimate of variance.

If the 95% confidence interval for the slope ($\beta_1$) in a simple linear regression is (0.2, 0.8), which of the following is the most reasonable interpretation?

For every one-unit increase in X, Y is expected to increase between 0.2 and 0.8 units, with 95% confidence. (C) Signup and view all the answers

What assumption about the errors ($e_i$) is assessed using a QQ plot?

Normality. (A) Signup and view all the answers

Which of the following statements is correct regarding the distribution of the estimator $b_1$ in simple linear regression?

$b_1$ follows a normal distribution with mean $\beta_1$ and variance $\sigma^2 / Σ(x_i - \bar{x})^2$. (D) Signup and view all the answers

What is the consequence of predicting Y outside the range of X values used to build the regression model?

The errors in prediction may be larger because the relationships may not hold outside the observed range. (B) Signup and view all the answers

What does a non-constant variance of errors (heteroscedasticity) violate?

The assumption of constant variance of errors. (A) Signup and view all the answers

Which of the following is NOT a typical issue caused by outliers in regression analysis?

Ensuring a more robust regression model. (D) Signup and view all the answers

In the context of assessing the assumptions of a linear regression model, what does examining a plot of residuals versus fitted values help to determine?

Whether the errors have constant variance. (A) Signup and view all the answers

Flashcards

What is Se?

A typical value of error when estimating Y using the linear relationship with X.

What is Σ(yi-y)²?

Overall error if we used the sample mean (y-bar) to predict every yi.

What is Σ(yi-y)² - Σei²?

Reduction in error when using regression to predict yi, instead of the average (y-bar).

What is r²?

Percentage of reduction in error when using the linear regression to predict Y, instead of using the sample mean.

Signup and view all the flashcards

The idea of the sample mean

A baseline; if the model doesn't do better than this, it's not a good fit.

Signup and view all the flashcards

What are Outliers?

Values far from the bulk of the data that can disproportionately influence regression results, skewing the regression line, violating normality and constant variance assumptions.

Signup and view all the flashcards

How to assess constant variance of errors?

Visually assess if all errors have constant variance by plotting errors vs. predicted values.

Signup and view all the flashcards

Study Notes

Ei ~ N(0, σ^2)
Y|X ~ N(β0 + βx, σ^2)

Estimating σ^2

It can be estimated and has practical meaning
SSE = SS(resid) = Σei^2, which is the sum of squared errors (or residuals)
Se estimates σ^2 and is calculated as Se = √(SSE/(n-2))
Se represents a typical error value when estimating Y using the linear relationship with X
SSE has units of Y squared, while Se has units of Y

Coefficient of Determination (r^2)

Assesses how well the model fits
r^2 is calculated as [Σ(yi - ӯ)^2 - Σei^2] / [Σ(yi - ӯ)^2]
Σ(y - ӯ)^2 represents the overall error if ӯ were used to predict every yi.
Σei^2 is the overall error if ŷi = b0 + b1Xi were used to predict yi.
Σ(yi - ӯ)^2 - Σei^2 is the error reduction when using regression to predict yi, instead of using ӯ.
r^2 is the percentage of error reduction when using the linear regression of Y on X to predict ŷ, instead of using the sample mean

Baseline Comparison

The sample mean is a baseline; not improving upon it means failing to create a good model
r^2 = (correlation coefficient)^2 = (r)^2

Cricket Example

SSE = 1.014, n = 15, r = 0.823
Se = √(SSE/(n-2)) = √(1.014/(15-2)) = 0.279
It's a typical error when predicting # of chirps/sec with the linear relationship with temperature (°F), amounting to about 0.279 chirps/sec

Interpreting r^2

r^2 = (0.823)^2 = 0.6774
Using linear regression with # of chirps/sec and temperature reduces overall error by 67.74% compared to using only the sample mean

Confidence Interval for β1

If E ~ N(0, σ^2), then Y ~ N(:, b1 is also normally distributed
b1 ~ N(β1, σ^2/Σ(xi - x)^2)
Σ(xi - x)^2 = Sx^2 (n-1)
Since σ^2 is estimated with Se^2, b1 is distributed t with d.f. = n-2
If the distribution of b1 is known (if all assumptions hold), a confidence interval can be created
The confidence interval (CI) for β1 is b1 ± tα/2 * Se / √(Sx^2(n-1)) with d.f. = n-2

General Rules for Confidence Intervals

If both bounds are > 0: significant positive linear relationship
If both bounds are < 0: significant negative linear relationship
If the bounds contain 0: β1 = 0 is plausible, suggesting no significant linear relationship

Systolic Blood Pressure Example

Y = Systolic BP, X = Age
Mean Sys BP = 150.09, Mean Age = 62.45, Standard deviation Sys BP = 13.63, Standard deviation Age = 9.11, r = 0.782, n = 11, SSE = 78.3
b1 = (0.782) * (13.63/9.11) = 1.170
Se = √(78.3/(11-2)) = 2.9495

95% Confidence Interval

tα/2 @ d.f. = 9 = 2.262
1. 170 ± 2.262 * (2.9495) / √(9.11^2(11-1)) => (0.9384, 1.4016)
It can be 95% confidently stated that when age increases by one year, systolic BP increases by between 0.9384 and 1.4016 on average, indicating a significant positive linear relationship

Issues for Regression

Outliers can cause multiple problems
Skewing the regression line
Causing non-normality of errors, violating assumptions
Causing non-constant variance of errors, violating assumptions
Outliers are typically removed from the dataset

Predicting Y

Data on X values is needed
Predicting y beyond that range where there is no data, may lead to huge errors

Assessing Assumptions

Normality of errors assessed using QQ plots of Ei or a Shapiro-Wilks test
Constant variance visually assessed by plotting ei vs. ŷi; constant vertical spread indicates constant variance of errors