Polynomial Regression and Heteroscedasticity

Study Notes

Heteroscedasticity, or heteroskedasticity, occurs in datasets with a vast range between maximum and minimum observed values.

Polynomial regression is a form of linear regression tailored for non-linear relationships between dependent and independent variables.
The model can be represented as: y = a0 + a1x1 + a2x1² + … + anx1ⁿ.
Choosing the degree of the polynomial is a hyperparameter that must be selected carefully to avoid model overfitting.

Model Complexity Reduction: Simplifying the model can help mitigate overfitting.
Stepwise Regression:
- An iterative method of model building that adds/removes explanatory variables based on statistical significance.
- Forward Selection: Begins with no variables, subsequently tests each variable as it is included.
- Backward Elimination: Starts with all variables, removing one at a time based on statistical significance.
- Bidirectional Elimination: Combines forward and backward methods to determine which variables to include or exclude.

Regularization is used to limit or shrink estimated coefficients to avoid overfitting.
It reduces validation loss and enhances model accuracy by penalizing high-variance models.

Lasso Regularization (L1):
- Stands for Least Absolute Shrinkage and Selection Operator.
- Adds L1 penalty, which is the sum of the absolute values of beta coefficients.
Ridge Regularization (L2):
- Applies L2 penalty, which is the sum of the squares of the beta coefficients' magnitudes.
Elastic Net Regression:
- Combines penalties from both Lasso and Ridge.
- Rectifies Lasso’s limitations in high-dimensional data by allowing the inclusion of multiple variables until saturation.
- Handles groups of highly correlated variables effectively.

Clustering is an unsupervised learning method focused on identifying patterns in unlabeled input data.
It categorizes data points into groups based on similarities.

Classification involves grouping data based on characteristics and features, part of supervised learning.
The model is trained using a dataset with features and corresponding labels, then tested on a separate dataset.
Regression applies to continuous variables, while classification deals with discrete variables.

Bias:
- Refers to the difference between the average model prediction and the actual value.
- High bias indicates oversimplification, leading to underfitting.
Variance:
- Measures the variability of model predictions for a given data point.
- High variance indicates the model's tendency to closely fit training data and potentially overfit.

Underfitting: Occurs when a model fails to capture underlying data patterns, characterized by high bias and low variance.
Overfitting: Happens when a model learns noise along with the pattern, marked by low bias and high variance.