Applications of Statistical Techniques (SMA 3023) PDF
Document Details
Uploaded by RapturousParallelism
Tags
Summary
This document provides an overview of applications of statistical techniques, focusing on data partitioning, cross-validation, and the bootstrap. It's structured as a lecture or presentation, likely for an undergraduate-level course. The document also includes examples and discusses the use of R and MATLAB for certain tasks.
Full Transcript
Applications of Statistical Techniques (SMA 3023) CHAPTER 8 Textbook: 1) Computational Statistics Handbook with Matlab 2) https://quantdev.ssri.psu.edu/tutorials/cross-validation-tutorial 3) An Introduction to Statistical Learn...
Applications of Statistical Techniques (SMA 3023) CHAPTER 8 Textbook: 1) Computational Statistics Handbook with Matlab 2) https://quantdev.ssri.psu.edu/tutorials/cross-validation-tutorial 3) An Introduction to Statistical Learning with Applications in R Data Partition Chapter 8 1 / 31 Table of Contents 1 Introduction 2 Cross-Validation Demonstration Remark Conclusion 3 Bootstrap Example: Bootstrap using R Chapter 8 2 / 31 Introduction Data partitioning refers to procedures where some observations from the sample are removed as part of the analysis. These techniques are used for the following purposes: To evaluate the accuracy of the model or classification scheme To decide what is a reasonable model for the data To find a smoothing parameter in density estimation To estimate the bias and error in parameter estimation And many others. Chapter 8 3 / 31 Example: We have a sample where we measured the average atmospheric temperature and the corresponding amount of steam used per month. Our goal in the analysis is to model the relationship between these variables. Once we have a model, we can use it to predict how much steam is needed for a given average monthly temperature. Problem: What model to use? Should always look at the scatterplot of the data. Chapter 8 4 / 31 Figure 1: Average temperature vs Amount of steam used per month Chapter 8 5 / 31 It appears that using a line (i.e., a first degree polynomial) to model the relationship between the variables is not unreasonable. However, other models might provide a better fit (such as a cubic or some higher degree polynomial). How to decide which model is better? We need to assess the accuracy of the various models, which as the best accuracy or lowest error — can use the prediction error. The problem with this approach is that it is sometimes impossible to obtain new data, so all we have available to evaluate our models (or our statistics) is the original data set. Thus, we consider two methods that allow us to use the data already in hand for the evaluation of the models, which are cross-validation and the jackknife (or bootstrap) (both can assess the prediction accuracy). Chapter 8 6 / 31 Cross-Validation One of the jobs of a statistician or engineer is to create models using sample data, usually for the purpose of making predictions. We must then decide what model best describes the relationship between the variables and estimate its accuracy. The naive researcher will build a model based on the data set and then use that same data to assess the performance of the model. The problem with this is that the model is being evaluated or tested with data it has already seen. Therefore, that procedure will yield a low prediction error. In a perfect world, the data sets would be large enough that we could set aside a sizable portion of the data set to validate (i.e., examine the resulting prediction error) the model we run on the majority of the data set. Unfortunately, this type of data is not always available, especially in social science research. Chapter 8 7 / 31 To combat the issue of limited data, while still being able to assess the fit of the model, we use cross-validation. Essentially, cross-validation iteratively splits the data set into two portions: a test and a training set. The prediction errors from each of the test sets are then averaged to determine the expected prediction error for the whole model. Chapter 8 8 / 31 Figure 2: Cross-Validation process Chapter 8 9 / 31 Cross-validation process 1 The data were split (or folded) into five equally sized partitions. 2 During the first fitting of the model, the first 20% of the data (i.e., the first fold) are considered the test set and the remaining 80% of the data (i.e., the remaining four folds) are considered the training set. 3 In the following iterations (columns from left to right), a different 20% of the data are considered the test set, while the remaining 80% of the data are considered the training set. 4 The model is fit to the test/training data a chosen number (K, or the number of folds) of times, and the prediction error from each model fitting is then averaged to determine the prediction statistics for the model. Chapter 8 10 / 31 The choice of the number of splits (or “folds”) to the data is up to the research (hence why this is sometimes called K-fold cross-validation), but five and ten splits are used frequently. Additionally, leave-one-out cross-validation is when the number of folds is equal to the number of cases in the data set (K = N). The choice of the number of splits does impact bias (the difference between the average/expected value and the correct value - i.e., error) and variance. Generally, the fewer the number of splits, the lower the variance and the higher the bias/error (and vice versa). Chapter 8 11 / 31 Example: Cross-validation to examine the average prediction of a aregression model Use caret package to predict a participant’s ACT score from gender, age, SAT verbal score, and SAT math score using the “sat.act” data from the psych package, and assess the model fit using 5-fold cross-validation. Set up the number of folds for cross-validation by defining the training control (In this case, we chose 5 folds, but the choice is ulimately up to the researcher). Chapter 8 12 / 31 Next, run the regression model: ACT ∼ gender + age + SATV + SATQ. Chapter 8 13 / 31 Examine model predictions: After using 5-fold cross-validation, it resulted that the model accounts for 42% of the variance (R-squared = 0.418) in ACT scores for these participants. Can also examine model predictions for each fold: Chapter 8 14 / 31 Demonstration Let’s examine what our R 2 values would’ve been if we used: 1 the whole sample 2 half of the sample Chapter 8 15 / 31 Whole sample: Chapter 8 16 / 31 First half: Chapter 8 17 / 31 Second half: Chapter 8 18 / 31 Remark: 1 R 2 for the whole sample was approximately equal to the CV results. 2 But, the R 2 for the randomly selected sample of the first half of the data was a bit smaller (accounted for approximately 38% of the variance of ACT scores) 3 and the R-squared for the remaining second half of the data was larger (accounted for approximately 48% of the variance). 4 This demonstrates the value of using CV, as opposed to solely splitting your data into training and test sets, as a better method of obtaining model estimates. Chapter 8 19 / 31 Conclusion (Cross-validation): 1 The choice of K (i.e., the number of folds) — Sample size may influence the choice of K (larger K result in smaller samples within each fold). 2 The selection of predictor variables — Do not pre-screen predictors – will result in inaccurate prediction errors. 3 The reporting of results — the accuracy of cross-validation and the parameters from the whole sample should be report. 4 The requirement of IID (independent and identically distributed) data — cross-validation requires that folds (i.e., test and training sets) are independent. Chapter 8 20 / 31 Bootstrap Parametric bootstrapping assumes that the data comes from a known distribution with unknown parameters. Based on the assumption that the original data set is a realization of a random sample from a distribution of a specific parametric type, in this case a parametric model is fitted by parameter θ, often by maximum likelihood, and samples of random numbers are drawn from this fitted model. Usually the sample drawn has the same sample size as the original data. Then the estimate of original function f (xi , θ) can be written as f (xi ; θ̂). Chapter 8 21 / 31 This sampling process is repeated many times as for other bootstrap methods. Considering the centered sample mean in this case, the random sample original distribution function f (xi , θ) is replaced by a bootstrap random sample with function f (xi ; θ̂), and the probability distribution of xi. The use of a parametric model at the sampling stage of the bootstrap methodology leads to procedures which are different from those obtained by applying basic statistical theory to inference for the same model. Chapter 8 22 / 31 Inference using MLE Assume θ̂ is the MLE we obtained from our sample of size n. For b = 1, 2, · · ·, B: Simulate sample Xb∗ = (X1b ∗ , · · ·, X ∗ ) as i.i.d. draws from f (x ; θ̂) nb 1 i (that is, assuming that θ̂ is the true parameter value). Find MLE using sample Xb∗ , denote it as θb∗. Calculate the sample variance of (θ1∗ , · · ·, θB∗ ) , it gives the bootstrap approximation to 1/nI (θ). Chapter 8 23 / 31 In summary, the parametric Bootstrap proceeds as follows: Collect the data set of n samples {x1 , · · ·, xn }. Determine the parameter(s) of the distribution that best fits the data from the known distribution family using maximum likelihood estimators (MLEs). Generate B Bootstrap samples {x1∗ , · · ·, xn∗ } by randomly sampling from this fitted distribution. For each Bootstrap sample {x1∗ , · · ·, xn∗ } calculate the required statistic θ̂. The distribution of these B estimates of q represents the Bootstrap estimate of uncertainty about the true value of q. Chapter 8 24 / 31 Example Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y , respectively, where X and Y are random quantities. We will invest a fraction α of our money in X , and will invest the remaining 1 − α in Y. Since there is variability associated with the returns on these two assets, we wish to choose α to minimize the total risk, or variance, of our investment. In other words, we want to minimize Var (αX + (1 − α)Y ). One can show that the value that minimizes the risk is given by Chapter 8 25 / 31 In reality, the quantities σX2 , σY2 , and σX ,Y are unknown. We can compute estimates for these quantities, σ̂X2 , σ̂Y2 , and σ̂X ,Y , using a data set that contains past measurements for X and Y. We can then estimate the value of α that minimizes the variance of our investment using Chapter 8 26 / 31 To estimate α, we simulated 100 pairs of returns for the investments X and Y (α̂ ranges from 0.532 to 0.657). To quantify the accuracy of the estimate of α, we can estimate the standard deviation of α̂ by repeating the process of simulating 100 paried observations of X and Y , and estimating α 1000 times; i.e α̂1 , α̂2 ,..., α̂1 000. X and Y. (SE(α̂) ≈ 0.083 — means that α̂ differ from α by approximately 0.08, on average). Chapter 8 27 / 31 In practice, however, the procedure for estimating SE(α̂) outlined above cannot be applied, because for real data we cannot generate new samples from the original population. However, the bootstrap approach allows us to use a computer to emulate the process of obtaining new sample sets, so that we can estimate the variability of α̂ without generating additional samples. Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set (see Figure 5.11 for illustration). Chapter 8 28 / 31 Chapter 8 29 / 31 Bootstrap using R Estimating the accuracy of a statistic of interest: 1 must create a function that computes the statistic of interest. 2 use the ”boot()” function to perfome the bootstrap by repeatedly sampling observations from the data set with replacement. Chapter 8 30 / 31 Use ”ISLR2” package to get the simulated data of 100 pairs of returns. The final output shows that using the original data, α̂ = 0.5758, and that the bootstrap estimate for SE(α̂) is 0.0897. Chapter 8 31 / 31