Introduction to Data Partitioning and Cross-Validation

Study Notes

Data partitioning is a method where some sample observations are excluded during analysis.
Key purposes include model evaluation, model selection, finding smoothing parameters, and estimating bias/error.

Used to assess model accuracy by iteratively splitting data into training and test sets.
Reduces risk of model evaluation on familiar data, which can produce misleading low error rates.
K-fold cross-validation involves dividing data into K equally sized partitions, commonly five or ten.
Leave-one-out cross-validation treats each data point as a separate test set.
Fewer folds lead to lower variance but higher bias, while more folds reduce bias and increase variance.

Data split into K partitions, with each partition alternating between being a test set and training set.
Average prediction error calculated across iterations for model evaluation.

Regression model predicting ACT scores uses predictor variables such as gender, age, and SAT scores.
5-fold cross-validation demonstrated an R-squared value of 0.418, indicating the model explains 42% of variance in ACT scores.

The choice of K affects bias/error based on sample size.
Pre-screening predictor variables may lead to inaccurate error predictions.
All results from cross-validation and full sample should be reported and data should be IID (independent and identically distributed).

Parametric bootstrapping assumes known data distribution with unknown parameters.
Involves drawing random samples from a fitted model based on the original dataset.
The process iteratively generates bootstrap samples of equal size to the original data.

Uses maximum likelihood estimators (MLE) to fit a distribution to data and draw bootstrap samples.
Estimates uncertainty about statistical parameters by calculating variance from multiple bootstrap estimates.

Investigating the optimal investment fraction in two assets (X and Y) to minimize risk/variance.
The bootstrap method allows direct estimation of the standard deviation of optimal investment fractions without generating new samples.

Implementing bootstrap involves creating a function for the statistic of interest and using the “boot()” function for resampling.
An example demonstrates that original data yield α̂ = 0.5758, with a bootstrap estimate for SE(α̂) of 0.0897.

PCA addresses the challenge of multivariate analysis known as the curse of dimensionality, which arises from high correlations between original variables.
PCA reduces dataset dimensions by creating new variables, called principal components, which are linear combinations of the original variables.
Only quantitative variables are suitable for PCA, which helps ensure that the reduced dataset retains similar information in fewer dimensions.

Principal components are orthogonal to each other.
The first principal component captures the majority of variance in the original dataset, while the second captures the next highest variance.
For a two-dimensional dataset, only two principal components can exist.

Data Normalization:
- Standardizes variables to ensure equal contribution, preventing a single variable from overshadowing others.
- Involves subtracting the mean and possibly dividing by the standard deviation.
Covariance Matrix Calculation:
- Computes a symmetric matrix where each element represents the covariance between pairs of variables.
Eigenvectors and Eigenvalues:
- Eigenvectors indicate directions (e.g., vertical) corresponding to variance in the data.
- Eigenvalues quantify variance present in the data along those directions.
Selection of Principal Components:
- Eigenvectors are ranked according to their eigenvalues. The highest eigenvalue corresponds to the first principal component, and so forth.
Data Transformation in New Dimensional Space:
- Original data is re-oriented onto a new subspace created by principal components through multiplication with eigenvectors.
- This process provides a fresh perspective of the data while preserving the original dataset.

A dataset with variables represented by dimensions X (1, 3, 6, 9, 12, 15) and Y (2, 4, 6, 8, 10, 12) can be analyzed using PCA to find principal components that summarize the underlying variance effectively.