Podcast
Questions and Answers
In linear regression, what does $S_e$ represent?
In linear regression, what does $S_e$ represent?
- The coefficient of determination.
- The sum of squared errors.
- The predicted value of Y.
- The estimated standard deviation of the error term. (correct)
What does the coefficient of determination ($r^2$) represent?
What does the coefficient of determination ($r^2$) represent?
- The correlation coefficient.
- The overall error if the sample mean is used to predict every Y.
- The typical value of an error when estimating Y.
- The percentage of reduction in error when using the linear regression of Y on X instead of using the sample mean to predict Y. (correct)
Which of the following is true regarding the sum of squared errors (SSE) in a linear regression model?
Which of the following is true regarding the sum of squared errors (SSE) in a linear regression model?
- SSE has the same units as Y.
- SSE is used to directly calculate the coefficient of determination ($r^2$). (correct)
- A lower SSE always indicates a better model fit, regardless of the number of predictors.
- SSE is the square root of the estimate of variance.
If the 95% confidence interval for the slope ($\beta_1$) in a simple linear regression is (0.2, 0.8), which of the following is the most reasonable interpretation?
If the 95% confidence interval for the slope ($\beta_1$) in a simple linear regression is (0.2, 0.8), which of the following is the most reasonable interpretation?
What assumption about the errors ($e_i$) is assessed using a QQ plot?
What assumption about the errors ($e_i$) is assessed using a QQ plot?
Which of the following statements is correct regarding the distribution of the estimator $b_1$ in simple linear regression?
Which of the following statements is correct regarding the distribution of the estimator $b_1$ in simple linear regression?
What is the consequence of predicting Y outside the range of X values used to build the regression model?
What is the consequence of predicting Y outside the range of X values used to build the regression model?
What does a non-constant variance of errors (heteroscedasticity) violate?
What does a non-constant variance of errors (heteroscedasticity) violate?
Which of the following is NOT a typical issue caused by outliers in regression analysis?
Which of the following is NOT a typical issue caused by outliers in regression analysis?
In the context of assessing the assumptions of a linear regression model, what does examining a plot of residuals versus fitted values help to determine?
In the context of assessing the assumptions of a linear regression model, what does examining a plot of residuals versus fitted values help to determine?
Flashcards
What is Se?
What is Se?
A typical value of error when estimating Y using the linear relationship with X.
What is Σ(yi-y)²?
What is Σ(yi-y)²?
Overall error if we used the sample mean (y-bar) to predict every yi.
What is Σ(yi-y)² - Σei²?
What is Σ(yi-y)² - Σei²?
Reduction in error when using regression to predict yi, instead of the average (y-bar).
What is r²?
What is r²?
Signup and view all the flashcards
The idea of the sample mean
The idea of the sample mean
Signup and view all the flashcards
What are Outliers?
What are Outliers?
Signup and view all the flashcards
How to assess constant variance of errors?
How to assess constant variance of errors?
Signup and view all the flashcards
Study Notes
- Ei ~ N(0, σ^2)
- Y|X ~ N(β0 + βx, σ^2)
Estimating σ^2
- It can be estimated and has practical meaning
- SSE = SS(resid) = Σei^2, which is the sum of squared errors (or residuals)
- Se estimates σ^2 and is calculated as Se = √(SSE/(n-2))
- Se represents a typical error value when estimating Y using the linear relationship with X
- SSE has units of Y squared, while Se has units of Y
Coefficient of Determination (r^2)
- Assesses how well the model fits
- r^2 is calculated as [Σ(yi - ӯ)^2 - Σei^2] / [Σ(yi - ӯ)^2]
- Σ(y - ӯ)^2 represents the overall error if ӯ were used to predict every yi.
- Σei^2 is the overall error if ŷi = b0 + b1Xi were used to predict yi.
- Σ(yi - ӯ)^2 - Σei^2 is the error reduction when using regression to predict yi, instead of using ӯ.
- r^2 is the percentage of error reduction when using the linear regression of Y on X to predict ŷ, instead of using the sample mean
Baseline Comparison
- The sample mean is a baseline; not improving upon it means failing to create a good model
- r^2 = (correlation coefficient)^2 = (r)^2
Cricket Example
- SSE = 1.014, n = 15, r = 0.823
- Se = √(SSE/(n-2)) = √(1.014/(15-2)) = 0.279
- It's a typical error when predicting # of chirps/sec with the linear relationship with temperature (°F), amounting to about 0.279 chirps/sec
Interpreting r^2
- r^2 = (0.823)^2 = 0.6774
- Using linear regression with # of chirps/sec and temperature reduces overall error by 67.74% compared to using only the sample mean
Confidence Interval for β1
- If E ~ N(0, σ^2), then Y ~ N(:, b1 is also normally distributed
- b1 ~ N(β1, σ^2/Σ(xi - x)^2)
- Σ(xi - x)^2 = Sx^2 (n-1)
- Since σ^2 is estimated with Se^2, b1 is distributed t with d.f. = n-2
- If the distribution of b1 is known (if all assumptions hold), a confidence interval can be created
- The confidence interval (CI) for β1 is b1 ± tα/2 * Se / √(Sx^2(n-1)) with d.f. = n-2
General Rules for Confidence Intervals
- If both bounds are > 0: significant positive linear relationship
- If both bounds are < 0: significant negative linear relationship
- If the bounds contain 0: β1 = 0 is plausible, suggesting no significant linear relationship
Systolic Blood Pressure Example
- Y = Systolic BP, X = Age
- Mean Sys BP = 150.09, Mean Age = 62.45, Standard deviation Sys BP = 13.63, Standard deviation Age = 9.11, r = 0.782, n = 11, SSE = 78.3
- b1 = (0.782) * (13.63/9.11) = 1.170
- Se = √(78.3/(11-2)) = 2.9495
95% Confidence Interval
- tα/2 @ d.f. = 9 = 2.262
-
- 170 ± 2.262 * (2.9495) / √(9.11^2(11-1)) => (0.9384, 1.4016)
- It can be 95% confidently stated that when age increases by one year, systolic BP increases by between 0.9384 and 1.4016 on average, indicating a significant positive linear relationship
Issues for Regression
- Outliers can cause multiple problems
- Skewing the regression line
- Causing non-normality of errors, violating assumptions
- Causing non-constant variance of errors, violating assumptions
- Outliers are typically removed from the dataset
Predicting Y
- Data on X values is needed
- Predicting y beyond that range where there is no data, may lead to huge errors
Assessing Assumptions
- Normality of errors assessed using QQ plots of Ei or a Shapiro-Wilks test
- Constant variance visually assessed by plotting ei vs. ŷi; constant vertical spread indicates constant variance of errors
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.