Model Flexibility and MSE

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Inflexible models are often more interpretable than flexible models.

True (A)

What is the most commonly used measure for the quality of fit?

mean squared error

The ideal learning method achieves low _____ and low bias.

variance

What is the primary purpose of cross-validation?

<p>To estimate test error rate (A)</p> Signup and view all the answers

What does the Bayes error rate represent?

<p>the lowest possible test error rate</p> Signup and view all the answers

Match the following regression types with their characteristics:

<p>Simple Linear Regression = Assumes the form Y = β0 + β1 X + ϵ Multiple Linear Regression = Involves multiple predictors X1, X2, ... Xp Polynomial Regression = Extends linear regression to allow for polynomial relationships</p> Signup and view all the answers

The expected test error rate is minimized by the _____ classifier.

<p>Bayes</p> Signup and view all the answers

The R-squared statistic lies between 0 and 1.

<p>True (A)</p> Signup and view all the answers

Define collinearity.

<p>When two or more predictor variables are closely related to one another.</p> Signup and view all the answers

What does the Akaike Information Criterion (AIC) assess?

<p>Model fit based on maximum likelihood.</p> Signup and view all the answers

What is the main advantage of decision trees?

<p>Easily explained and interpreted (B)</p> Signup and view all the answers

The _____ method combines predictions from multiple weak learners to create a strong prediction.

<p>ensemble</p> Signup and view all the answers

What does PCA stand for?

<p>Principal Component Analysis</p> Signup and view all the answers

Which of the following is a component of the K-means clustering algorithm?

<p>Use a fixed number of clusters determined beforehand (B), Randomly assign observations to clusters (C)</p> Signup and view all the answers

Flashcards

Prediction accuracy vs. model interpretability

Balance between a model's ability to accurately predict and how easily its logic can be understood.

Mean Squared Error (MSE)

Commonly used metric to assess how well a statistical model describes the actual data. Calculated by averaging the squared differences between predicted and actual values.

Training MSE

The MSE calculated from the data used to train a model. Minimizing this can lead to overfitting.

Test MSE

The MSE calculated from data not used during training. It is a better indicator of how well the model generalizes to new data.

Signup and view all the flashcards

Cross-validation

Technique to repeatedly split the data to assess the test MSE and determine how well a model generalizes to an independent data set.

Signup and view all the flashcards

U-shape of test MSE

As model complexity increases, test MSE initially decreases, then increases, forming a U shape.

Signup and view all the flashcards

Expected Test MSE Decomposition

Variance of the model's predictions + squared bias of the model + variance of the error term

Signup and view all the flashcards

Irreducible Error

Error that exists because of the noise in the data and cannot be reduced by any model.

Signup and view all the flashcards

Bias

Error introduced by approximating a real-life problem, which may be complicated, by a simplified model.

Signup and view all the flashcards

Variance

The extent to which a model's prediction would change if fit on a different dataset.

Signup and view all the flashcards

Test Error Rate

Represents the average error rate obtained when using the model to classify new data.

Signup and view all the flashcards

Bayes Classifier

Classifier that assigns each observation to the most likely class based on predictor values.

Signup and view all the flashcards

Bayes Decision Boundary

Boundary where conditional probability of classification is exactly 0.5.

Signup and view all the flashcards

Bayes Error Rate

Lowest possible test error rate achievable by any classifier.

Signup and view all the flashcards

K-Nearest Neighbors (KNN) Classifier

Estimates conditional distribution to classify observations. Based on averaging observations in the training dataset to classify observations in the testing dataset.

Signup and view all the flashcards

Choice of K in KNN

In KNN, the number of neighbors weights the classification. Higher number of neighbors results in lower bias and higher variance.

Signup and view all the flashcards

Simple Linear Regression

Assumes a linear relationship between the input and output variables. A simple technique that's used as a baseline.

Signup and view all the flashcards

Residual Sum of Squares (RSS)

Minimizing the sum of squared differences between the actual and predicted values.

Signup and view all the flashcards

Residual Standard Error (RSE)

Measure of the spread of the error term in a regression model, estimates the typical difference between observed and predicted values.

Signup and view all the flashcards

Wald Confidence Intervals

Interval where the true value of a coefficient is likely to fall.

Signup and view all the flashcards

Hypothesis Testing in Regression

Tests the hypothesis that there is truly NO relationship (0 slope) between the input and output variables.

Signup and view all the flashcards

R² Statistic

Quantifies the proportion of variance in the response variable explained by the predictors.

Signup and view all the flashcards

Multiple Linear Regression

Regression model with multiple predictors affecting the response variable.

Signup and view all the flashcards

Ordinary Least Squares (OLS) Estimator

OLS is a technique that finds coefficients that minimize the sum of squared differences.

Signup and view all the flashcards

Hypothesis Test for Multiple Predictors

Tests whether there is any relationship between the predictors and the response.

Signup and view all the flashcards

Variable Selection

Process of reducing the number of variables used to fit the model.

Signup and view all the flashcards

Forward Selection

Start with no variables and sequentially add the variable that most improves the model fit.

Signup and view all the flashcards

Backward Selection

Start with all variables and sequentially remove the variable that least impacts the model fit.

Signup and view all the flashcards

Mixed Selection

Combination of forward and backward selection. Adds variables to provide the best fit and removes variables with low impact.

Signup and view all the flashcards

Interpreting R² in Multiple Regression

For multiple linear regression, an R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable.

Signup and view all the flashcards

Multiple Regression RSE

Modifies RSE and penalizes adding more variance. Favors models without adding too much noise.

Signup and view all the flashcards

Qualitative Predictors (Factors)

Predictors known as factors, two options for use. Creation of indicator variable, or create additionul dummy variables for qualitative predictors.

Signup and view all the flashcards

Interaction Term

Adding an interaction term such as, to relax the additive assumption where responses are both additive and linear.

Signup and view all the flashcards

Residual Plots

Graphical tests to identify non-linearity where one approach is to use non-linear transformations of the predictors.

Signup and view all the flashcards

Heteroscedasticity

Non-constant variance, may result in reduction of heteroscedasticity. If there's good idea of variance to each response, use weighted least squares to combat.

Signup and view all the flashcards

Outlier

A point where yi is far from the value predicted by the model. Removing may have little effect on model but have a high influence on the RSE.

Signup and view all the flashcards

Studentized Residuals

Dividing estimated standard error, detects possible outliers, values greater than 3 in absolute value.

Signup and view all the flashcards

Study Notes

  • Prediction accuracy and model interpretability often involve a trade-off
  • Inflexible models are usually more interpretable than flexible ones
  • Mean squared error (MSE) measures the quality of fit

MSE Formula

  • MSE = (1/n) * Σ(yi - f(xi))^2, where the sum is from i=1 to n
  • f represents a fit model

Training vs. Test MSE

  • Training MSE is minimized to find f
  • Test MSE is minimized by the ideal f
  • Cross-validation estimates test MSE
  • Test MSE typically follows a U-shape, decreasing with flexibility before increasing again

Expected Test MSE Decomposition

  • E(yo - f(xo))^2 = var(f(xo)) + [Bias(f(xo))]^2 + var(ϵ)
  • Ideal learning methods balance low variance and low bias
  • Irreducible error (var(ϵ)) serves as a lower bound for expected test MSE
  • Bias arises from simplifying complex relationships with simple models
  • Greater model flexibility generally reduces bias, but increases variance

Variance

  • Variance measures the change in f with different training datasets
  • More flexible statistical methods tend to have higher variance

Classification Setting

  • Training MSE is replaced by training error rate: (1/n) * Σ1{yi ≠ Å·i}, where the sum is from i=1 to n

Ideal Classifier

  • Minimizes test error rate
  • Achieved by the Bayes classifier, which assigns observations to the class j with the largest P(Y = j | X = x0)
  • Example: In a two-class scenario, assigns to class 1 if P(Y = 1 | X = x0) > 0.5
  • Bayes decision boundary is where the conditional probability equals 0.5
  • Bayes error rate is the lowest possible test error rate

Bayes Error Rate

  • 1 - E(max P(Y = j | X)), where the max is over j

KNN Classifier

  • K-nearest neighbors classifier (KNN) estimates conditional distribution for classification
  • P(Y = j | X = x0) = (1/K) * Σ1{yi = j}, where the sum is over i ∈ N0
  • N0 represents K points closest to x0 in the training data
  • KNN classifies x0 based on the largest estimated conditional probability
  • Higher K values lead to smoother classifiers with lower bias and higher variance and vice versa

Simple Linear Regression

  • Simple linear regression model: Y = β0 + β1X + ϵ
  • βi estimates are found by minimizing residual sum of squares (RSS)

RSS Formula

  • RSS = Σe_i^2 = Σ(yi - Å·i)^2, where the sum is from i=1 to n

Minimizers

  • β̂1 = Σ(xi - xÌ„)(yi - yÌ„) / Σ(xi - xÌ„)^2, where the sum is from i=1 to n
  • β̂0 = yÌ„ - β̂1xÌ„
  • yÌ„ = (1/n) * Σyi where the sum is from i=1 to n
  • xÌ„ = (1/n) * Σxi where the sum is from i=1 to n

se(β̂0 )2 Formula

  • se(β̂0)^2 = σ^2 * [(1/n) + (xÌ„^2 / Σ(xi - xÌ„)^2)], where the sum is from i=1 to n and σ^2 = var(ϵ)

se(β̂1 )2 Formula

  • se(β̂1)^2 = σ^2 / Σ(xi - xÌ„)^2, where the sum is from i=1 to n and σ^2 = var(ϵ)
  • Formulas assume uncorrelated errors with equal variance σ^2
  • se(β̂1) smaller when xi values are more spread out

σ2 Estimate

  • Typically unknown, but can be estimated by residual standard error (RSE)

RSE Formula

  • RSE = sqrt(RSS / (n-2))

Wald Confidence Intervals

  • Utilise Standard errors

Hypothesis Tests

  • Used on the coefficients

Null Hypothesis Test

  • H0: β1 = 0, H1: β1 ≠ 0
  • Assesses relationship between X and Y

t-statistic Formula

  • t = β̂1 / SE(β̂1)
  • Follows a tn-2 distribution

Model Fit Quantification

  • RSE and R2 statistic

R2 Formula

  • R2 = (TSS - RSS) / TSS = 1 - (RSS / TSS)
  • States the proportion of variance explained
  • TSS = Σ(yi - yÌ„)^2, the total sum of squares, from i=1 to n
  • R2 always lies between 0 and 1
  • R2 = r2 in simple linear regression, where r is sample correlation between X and Y

Multiple Linear Regression Overview

  • Model form: Y = β0 + Σ(βk * Xk) + ϵ, where sum is from k=1 to p
  • Xj represents the jth predictor, and βj quantifies the association
  • Each βj is the average effect on Y of a unit increase in Xj, holding other predictors constant

Prediction Formula

  • Given estimators β̂0, β̂1,..., β̂p, the predictions take the form: Å· = β̂0 + Σ(β̂k * xk), where sum is from k=1 to p

Multiple Linear Regression Minimization

  • Minimizes the sum of squared residuals
  • RSS = Σ(yi - Å·i)^2 = Σ(yi - β̂0 - Σ(β̂k * xik))^2

OLS Estimator Formula

  • Given the full rank design matrix X, the OLS estimator is given by the equation: β̂OLS = (XTX)-1XTY

Hypothesis Test Overview

  • Answers if there is a relationship between response and predictors
  • Null hypothesis: H0: β1 = β2 = · · · = βp = 0

Alternative Hypothesis

  • At least one βk for k > 0 is not zero

F-statistic Formula

  • F = (TSS - RSS)/p / RSS/(n - p - 1)
  • TSS = Σ(yi - yÌ„)^2, from i=1 to n

Model Assumptions

  • If model assumptions are correct, then E[RSS/(n - p - 1)] = σ^2
  • If H0 holds, then E[(TSS - RSS)/p] = σ^2

F-Statistic Relationship With Response

  • If there is no relationship between the response and predictors, we expect the F-statistic to be close to 1
  • If the alternative hypothesis holds, then E[(TSS – RSS)/p] > σ2 so we expect F to be greater than 1

Hypothesis Test on Subset of Coefficients

  • Sometimes we would like to instead test that a particular subset of q of the coefficients are zero. That is, H0: βp−q+1 = βp−q+2 = · · · = βp = 0
  • Let RSS0 denote the residual sum of squares that use all variables except the last q. Then we take the F -statistic

F-Statistic Formula (Subset)

  • F= (RSS0 − RSS)/q / RSS/(n − p − 1)
  • This F -statistic reports the partial effect of adding these q variables. In all of these cases, for a nested model with p1 < p2 , we compare with the F distribution with (p2 − p1 , n − p2 − 1) degrees of freedom

Variable Selection

  • Determining which predictors are associated with the response
  • In order to fit a single model involving only those predictors. Various statistics can be used to jduge the quality of a model including Mallow’s Cp , Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R

Efficiency Approach

  • Since there are a total of 2p models that contain subsets of p variables, it is infeasible to try every subset. More efficient approaches must be taken

Three Classical Approaches

  • FORWARD SELECTION: Begin with the null model then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model and continue until some stopping rule is satisfied
  • BACKWARD SELECTION: We start with all variables in the model and remove the variable with the largest p-value. The new (p − 1)-variable model is fit and the variable with the largest p-value is removed and we continue this procedure until a stopping rule is reached
  • MIXED SELECTION: This is a combination of forward and backward selection. Start with no variables, then add the variables that provide the best fit one-by-one. If at any point the p-value for one of the variables in the model rises above a threshold, we remove that variable from the model and we continue to perform forward and backward steps until all variables in the model have a sufficiently low p-value.

Forward Selection

  • Forward selection is a greedy approach and can include variables that later become redundant

Multiple Linear Regression

  • Forward selection cannot be used if p > n and forward selection can always be used
  • In multiple linear regression, R2 = cor(Y, YÌ‚ )2. An R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable

R2 in Machine Learning

  • the R2 statistic will always increase when more variables are added to the model, even if those variables are only weakly associated with the response since adding another variable always results in a decrease in the residual sum of squares in the training data

Residual Sum of Squares Formula

  • For multiple linear regression the RSE is defined as RSE= sqrt (RSS/ (n-p-1))
  • Models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p

Trade Off in Machine Learning

  • The inaccuracy in the coefficient estimates is related to the reducible error and one can compute a confidence interval

Error in Machine Learning

  • Even if the true values for β0 , β1 , · · · , βp are known, the response cannot be predicted perfectly..
  • The error due to ϵ is called the irreducible error. Prediction errors are always wider than confidence intervals since they incorporate both the error in the estimate of the population and the randomness of the individual point

Predictors in Machine Learning

  • Sometimes qualitative predictors known as factors are used
  • If a factor has two levels, we can create a dummy variable or an indicator..
  • When a qualitative predictor has more than two levels, we can create additional dummy variables

Assumptions of Linear Regression

  • The standard linear regression model has several highly restrictive assumptions that are often violated..
  • It assumes that the relationship between the predictors and response are additive and linear..
  • One wayof relaxing the additive assumption is by introducing an interaction term..
  • The predictor X1 X2 is said to be an interaction term constructed by computing the product of X1 and X2

List Of Common Problems With Machine Learning

  • Non-linearity of the response-predictor relationships..
  • Correlation of error terms..
  • Non-constant variance of error terms..
  • Outliers
  • High-leverage points
  • Collinearity
  • If the true relationship is far from linear, then all conclusions drawn from the fit is suspect

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser