Study Guide Part II: Cross Validation & Bootstrapping PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document covers cross-validation and bootstrapping methods for estimating prediction error in statistical learning. It outlines the rationale behind these methods and discusses benefits and drawbacks of their application. It additionally mentions regularization techniques.
Full Transcript
STUDY GUIDE PART II CHAPTER 5 CROSS VALIDATION BOOTSTRAPPING resampling methods get more info abt fitted model by refitting the model to training data provide estimates of test set prediction...
STUDY GUIDE PART II CHAPTER 5 CROSS VALIDATION BOOTSTRAPPING resampling methods get more info abt fitted model by refitting the model to training data provide estimates of test set prediction error So bias of param estimates bootstrapping is best for this Best solution for prediction error estimates large designated test set but often not available when large dataset not available use class of methods that hold on a subset of training observations from fitting apply statistiial learning methods to the observations CROSS VALIDATION 1 randomly divide data into training and validation hold on 2 model fit to training data mode is used to prudit for validation set 3 validation set error is used as an estimate of test error lusually w MSE to ardanitative response mis classification rate error for qualitative want to use cross validation to pick degree of poly nomial AND give idea of test error end of fitting process DRAWBACKS of VALIDATION validation error can be highly variable be the splits of data are made randomly highly dependent on what data is included in valltra only a subset of observations those in the training set are used to fit the model a lot of data is not considered validation set error could overestimate test error training set is half as large as it would be no validation when you have more data generally the error is lower CROSS VALIDATION armates test error is used to select best moder k fold randomly divide data into k c forces equal sized parts leave of a part fit model to the rest of the folds obtain predictions to the left out a What Does combine mean do this for each a then combine the results 5 me I ai rain record error 4 n reword emon fquai.IT k parts Ca Ca Ca he observation in each kpart N multiple of k ne n CV k MSER MSEe Eiece Yi Ji he has van error rate of fit for observation i from the data with a removed setting K n yields n fold or Leave one out CV Loocu LOO CV great for least squares or polynomial regression can_be useful but doesn't change up the data enough training data only differs by a observation Estimates for each fold are many correlated high variance low bias better option is b fond val with 6 5 or 10 ISSUES with cross van die k 5 5 1 5 4 5 tramin be training set is only K 1 k as as big as the original training set the estimates of the error will be biased upwards LOO CV bias is reduced be training data set is barger almost the same everytime but has high variance bias van CROSS VAL FOR CLASSIFICATION CV if erru She is not very useful be we are assuming that the obs are independent but they aren't be the training data shares observations you have already cherry picked the 100 predictors that have the greatest predictive ability so you would essentially be tricking cross variation so your test error will be lower 0 on fff a WRONG apply CV in sep 2 RIGHT apply CV in 1 AND 2 THE BOOTSTRAP flexible tool that can be used to quantify uncertainty choose α to minimize var not realistic though be don't have the whole population But the bootstrap helps in mimicing obtaining new dataset repeatedly sample from our original datasets with repeater each bootstrapped dataset is the same size as the origina dataset some observations may appear at times others might not be included at all in more complex scenarios ie time step semanos bootstrapping can be more complicated in time series can't just sample with replacement ie stock price for certain day stone price one day will be related to stock price the previous day in a time series ppl use a BLOCK BOOTSTRAP i I group samples t assume groups are uncorrelated Within block assume blocks B2 B B4 correlation Primary use of bootstrap estimat error i but also used to determine CI i look at histogram of α look the si asi.CI to get aol.CI BOOTSTRAP PERCENTILE PREDICTION ERROR in CV each of the k validation folds in distinct from K 1 folds for ie over training needed FOR BOOT STRAP f0 think abt each bootst dataset as our training data original sample as val train f gIgh for bootstrap there is signif each sample overlap w original data Abt 213 of og dat in each BS set 213 of data has already been seen by the model so pred er will be biased downwards even worse is usig og data as training BS as val FIX to estimate BS error use predictions for observations that did not occur in the current BS sample this gets complicated in CV ends up being more simple better to estimate pred error CHAPTER 6 Linear model selection regularization linear model Y Bot β x t Bp Xp E exploring linear model to aromidate non linear alternative fitting procedures by replacing least squares fitting why consider alternatives to least squares When feature p obs n wat to control var prediction accuracy model interp remove irrelevant features setting coess o easier to interpret FEATURE selection 19550 PCR PLS fidget 3 METHODS subset selection shrinkage dimensigf.gg SUBSET SELECTION best sub stepwise BEST SUBSET SELECTION 1 let me be the null model with no predictors simply the sample mean for each observation ftp.nskd 2 for a a 1 2 3 fit an PC Pe models that predictors contain k predictors E Ite models b pick best among models call it Mk Best smallest RSS or largest R 3 select a model from no Mr using cross validation prediction error co or qq.mg ggncgaygmwyieegymga pygynn we may find best models with 1 2,314 p predictors then of those best models you find that model with the lowest testemo through cross validation can use model when P 10 20 Extensions to other models an we have looked models with least sum of squares but can also apply to models such as log neg devariance 2 maximized log likelihood plays the role of Rss for more moders STEPWISE SELECTION for large p subsets cannot use best subset selection be of models you need to compare is too large best subset can suffer computation ally statistically be the higher the chance of finding models that look good on training data but have little predictive power w new data overfitting high van of coest estimates 80 stepwise methods may be preferable when p 40 especially FORWARD STEPWISE Start model contains no predictors just intercept add predictors to model 1 by 1 until all are in model AT EACH STEP the variable giving model the greatest ADDITIONAL improvement to the fit is added DIFF from best subset each step not looking at all moons with k of predictors but models with k 1 predictors from the prov step 1 k Process 1 Mo null model who predictors 2 for K 0 p i iff a consider all p n models that augment the predictors in Mr IPative fgm b choose the best among p re models thats Man model best smallest RSS largest R2 3 select model from Mo Mu using Cv Cp Bic oradjusted R2 arrested model pros computational advantage over BSS 20models cons FSW I ptcp il.IR om not gamented to find the best possible model of an 20 models with p predictors not looking at every single potential modes belt moon min k predictors may not account forthe fact the best move is not a superset of features I miant be the strongest but 2 3 tg might be stronger We can only build off of our preexisting model if variables had no correlation the feature selection would be the same Backward stepwise good when pis large But not greater than n begin with full least squares model with an predictors remove the least useful s by 1 1 start with Mp full model 2 for k p p i I a consider an ie models that contain all but 2 of the predictors in Me for a total of b I preas b choose the best among k models me 1 Best smallest ass of R highest 3 selent a model from Mk Mo using CU prece em CP BIC of adjusted R2 considers p models searches through it pH 2 models BSS not guaranteed to yield the BEST move w P pred number of samples n must be larger than variables p so the full model can be fit Forward stepwise moves even when nap and is the ONLY viable moon when p is very large choosing optimal model but will have greatest lowest Rss be will have the most variables low test error but that means var will be high so we cannot just choose the best model with lowest BSS R2 ESTIMATING test error approam a indirectly estimate test error by making an adjective to the traing error to account for overfitting Bic A c cp 2 use CV or validation Cp BIC Alc adjusted R2 to estimate test error adjust training err for model site can be used to select best model w varying variables want values to be small besides R want large Mallow's Cp no 3 L L AIC zloge 2d 82 var win each criteria desired for a m D 4 Cavarst I intercept large class of models fit by max likelihood L max value of likelihood function for the estimated model 2109L RSS 82 for linear moons Are Cp non linear AIC BIC BIC agg ff like Cp Bic tend to have a small value for a model w low test error want moves w small BIC Aic v BIC since logn 2 for any n t Ble places heavier penalty on mom's w more variables o results in the selection of smaller movers than Cp Adjusted R2 adj R 1 r 1 ff TSS bi ji to maximize ad R2 want to minimize RSS n d i Rss always Rss as vars n d i 4 or due to the presence of d in the denom unlike R2 adj R2 pays price for a including unnecessary variablesto the model ie win be penalized if you fed by adding apredic that doesn't add mum to predictive ability CAN be used when Psn VALIDATION CV each procedure returns models no MR we want to find E once meseret the best variable we have ma select k with lowest estimated test error don't need an estimate of 02 can be used in a wider range of model selection tasks even when it is hard to detremine the predictors in model est error variance 1 standard error rule might not pick minimum of curve comparing prices error but choose the simplest modes within 1 SD of minimum do this be its easier to interpret virtually the same error but a simple model simplestmodel within 1 Strandand error must standardize SHRINKAGE METHODS ridge reg t lasso subset selections use least squares to fit alinear model that contains a subset of predictors afferent way to ft linear model not least squares to fit costs to shrink coests 0 based on importance penalty RIDGE REGRESSION want to minimize RSS 7 β A 20 is a tuning param want to find coests that fit data well reduce RSS the larger then coests o AE BE shrinkage penalty small when β Bp are close to 0 it has the effect of shrinking Bi 0 controls the impact of the 2 terms on the coest estimates use CV to determine a really balancing problem to keep equation A βp da vector norm 118112 THE scaling of predictors standard least squares west estimates are scale equivarient multiplying i by a constant simply scares the least squares coeff ie regardless of how the jth pred is eaten is the same IN RIDGE ridge coects change significantly when multiplying apredictor by a const be the sum of squared coests term in the penalty O we should apply ridge AFTER standardizing the preels using x̅ij inE.FI xF standardited feature divide by SD of all variables coest controls variance ie a less flexible coest bias ie x coests never go to 0 just makes them approach 0 CONS does not select predictors instead uses all LASSO minimize uses l penalty instead of 12 l norm 11311 Ʃ Bj instead of 1113112 T.fi Shanks agents 0 but can make coests to 0 if feature is not important D variable selection creates sparse models good option when there is large P A all costs x B why does dasso allow for β 0 conclusions lasso might perform better when response is a function ofonly a small predictors pred this related to the response is never known aprion for a real dataset CV can we used to determine main approach is better fora given data set is a o loss funnt is the same as ordinary leastsquares selecting the tuning param for Ridget Lasso Cv is a good sen to determine a ca aicet need tf.fi maiest then model refit observations is using all selected tuning param DIMENSION REDUCTION that t fit them exploring approaches Irmpreds to a least squares moder dimension reduction methods use p original predicts to fit a model using in new preels ML D m linear combinations of P Predictors new Preels Z Zm constants 0m Omp dimension reduction estimated costs constrains β can win the bias variance tradeoffs model fit with least squares predictors using new predictors that were transformed with new β'S REMEMBER MLI dimension reduction Principal components REGRESSION PCR used to define the linear combinations of predictors in reg 1ˢᵗ PC is that linear comb of variables with largest variance mm many connate vanians we replace men mma.mn set of principle compts that capture their shared variation group of points in a particular pattern vary the most but collectively twin RSS is smallest if there is a strong correlation btwn features 1st PC say don't need to use whole pop to present sales j 1 t PC using more PC bias MSE PC combinations w var highest BUT Pos assess I choosing directions in to minimize test MSE can use cross validation m components when M really hian just use least squares analysis unsupervised identifies linear combinations directions that best represent theyPLf directions are identified wo supervision be response Y is not used to determine the principal comp directions to PCR has a large CON no guarantee that the directions that pest explain the predictors will also But for predicting the response PARTIAL LEAST SQUARES PLS dimension reduction method that identifies new pred CE fm that are linear combinations of og features fits a linear mode with ordinary LS using M features PLS is supervised ie use Y to identify new features that approximate old features are related to response PLS attempts to find directions to explain predictors response 1 standardize p predictors 2 PLS computes first direction of 21 by setting each 01 to the west from linear regression of Y onto this coest is proportional to X t Y i in computing Z Ʃ 0 Xj PLS places most weighton predictors that are strongly related to the response 3 other divertions found by taking residuals repeating the above CHAPTER7 moving beyond linearity assuminglinearity is not enough always nonlinear offers flexibility POLYNOMIAL REGRESS yi Botβ x Bax Bax Ei extremes data thins out standard error gets wider tails of CI way details create new variables x x2 42 treat as multi linear reg not rly interested in coeffs more interested in the fitted tune values Y be f xo is a linear funct any canget a v simple expression for pointwise variances at to any point wise any given point you can have error we eith fx d at some reasonably low value or use CV logistic reg follow naturally to get CI find upper lower bounces using logit scale then invert to get on prob scale can do separately on many variables just stack the vars in a matrix separate after COW polynomials have extreme tail behavior very bad for extrapolation step Functs cut variable into distinct sections helpful if natural cut points local unlike polynomials create dunny variable fit model useful way of creating interactions that are easily interpretable ie Flur 22005 age I yr 2005 age allows diff linear functs in each age cat choine of outpoints knots can be hard piecewise poly nomials can use diff polynomials in regions desired by knots better to add constraints to polynomials ie continuity SPLINKS have MAX CONTINUITY to want breaks 0 enforcingcontinuity continuity degree 1 denn 1st 2ndderivative ie for cubic continuity at 2nd denv Linear splines knots k k he is a piecewise linear polynomial cont each knot y Bot Bib Gilt Babe x2 next x bet exit Ci Eat k i k hence positive part Cubic spline knots k i K is a premise polynom w cont derivatives up to order 2 each knot y Both blexi β 35k taxi tei biff ti bzcx.it bletsail Xi Ee natural cubic splines extrapolates linearly btwn boundary knots this adds 4 2.2 extra constraints allows us to put more internal knots for however many df add more internal knots controlwagging extremes knot placement 1 strategy decide knots k place them at appropriatequantile of observer cubic spline with k knots has 4 params or df natural spline w K knott has k at splines are more local fofEar rasignit.az spline smoothing splines faubi adds up ftp.ifeng minimize Cy go.net Tst tries to make 94 match the data ear x second term roughness penalty modulates how wish 9 small x little penalty more wiggly If tuning param sen to eq is a natural cubic spline with a knot every unique Xi roughness penalty is controlled by 7 details avoid knot selection issue leaving a single to choose vector of a fitted values can be women as g Say So mxn matrix determined by Xity estate degrees of of smoothing matrix dfx reyfgg.gg 5 cement roughness penalty sum of diagonal matrix can specify of rather than 7 can estimate 7 using Loocu or CV Local regression win a sliding weight function we fit separate linear fits over the range of by weighted least squares if at of for smoothing spline look rather similar Gams generalized additive models yi Bot f Xii fz xis t fplxip.ltEi plot various contributions of models fitting nonlinear variables can fit a Gams using many models ie natural splines coests not that interesting I fitted functions are can mix terms some linear some non linear Gams are additive no interaction in model can include low order interactions using ie bivariate functions can be used for classification CHAPTER 8 decision tree models stratisflying segmenting pros simple easily interp con not competitive w best supervised learning approaches Dec trees can be used for class treg internal nodes ppiit feature 1 divide predictor space into J non overlapping regions choose toarrive predictorspace intohigh dimensional rectangles goal is to find boxes with minimal ass eau box shouldn't represent each individual sample take a top down greedy approach known as binary splitting greedy be each step the split is mace that particular steprather than looking ahead picking a split that win lead to a future better tree 1 select pre split with smallest Rss look all potential prees spins want split to minimize RSS repeat process can specify hyper parameters to make tree smaller use mean of observations in terminal nodes to make predictions could produce an overfit tree if tree is too large ie each one has a terminal node you could stop early when splits don't pred ability continue as long as RSS can tune hyper parameters tuning or can prune PRUNING building a large tree cutting down the tree removing bravery that don't signifianting contribute to the pred ability penalize the non predictive features and α is selected using CV 1 use tree approam to grow stopping when obs in a terminal node has some min obs 2 apply cat complexity pruning to get a subtree as a funnt ofα 3 use CV to determin α 4 Return to subtree from 2 that corresponds to α more incremental improvements have smaller branches can beatory biased CLASS TREE assume each ons belongs to dominant class in the region it belongs use classification error rate instead of RSS but not very sensitive to growing a tree use Gini index instead G EI Pmn 1 Pmu measures purity of classes Ifc P's are close gini is small CON do not have same level of accuracy as other regressio crass moders BAGGING think abt the bootstrap recall a set of n independent observations 2 Zn earn variane of the mean I observation is given by 02in averaging a set of observations reduces variance take bootstrap sampies to make many trees Average all of the predictions to get Ipag x EI f ly prediction of tree x classification majority rules aoi say 1 101 say 0 it's 1 00 B error each tree contains abt 213 of data think back to bootstrap use remaining 113 to validate Random Forest reduces variance by decorrelating the trees build decision trees on bootstrapped samples but each time a split in the tree is considered a random selectionof predictors is chosen from the full set of parameters split uses only a of m predictors new selection of m is taken earn split usually choose in F ie predictors selentered for each split Egg ie don't consider out of an potential predictors p but if out of a random selection of predictors m can fiter predictors if unsupervised can't overfit by adding more trees but you know your'e done when adding more trees doesn't variance it levels out BOOSTING sequential alogathym to improve on previous models f't a tree Fb mn d splits att leaves to training data Xiv update by adding in shrunken version of new tree update residuals output updated modes each tree can be small w few terminal nodes slowly improve f where model I doesn't perform well tunig params splits in tree depth ie d 1 just a stump w 1 single split trees can be oveft but would take a lot of trees B use CV to determine B sunkage param sman positive usually 0.01 or 0.001 usually X MB variable importance record total drop in Rss for every split in the tree A larger win RSS more important for classification trees we add total Gini index see how mum decrease split for a predictor averaged over all trees BAYSMAN ADDINE regression trees ensemble related to random forest boosting tree is constructed in a random manner each tree tries to capture signal missed in previous tree used for regression crassif etc start w stump and perturbations based on partial residuals average trees to get final predictions k trees iterations of BART x prediction for kth regression tree of bin iteration post each iteration k trees from that iteration are summer iteration 1 all trees have a single node min the response values divided by tot trees further iterations update all k trees one a time in on iteration to update kth tree subtract from earn response the predictions from all but the kt tree to get the partial residua rather than fitting a fresh tree to the partial residual Bar randomly chooses a perturbation to the tree from a prev iteratio favoring perturbations that improved the ft of the partial reside perturbations 2 can charge structure by adding pruning branches 21 may change the prediction in earn terminal move output collection of prediction models to obtain a single prediction we take the average after some L burn in iterations perturbation stigle moves guard against overfitting be they limit how hard we fit data in ean iteration also compute percenties amongst other metrics to provide a measure uncertainty in the final prediction reduce likelihood of overfitting w slow learning can use quartiles as well as another classifier Bayesian each time we randomly perturb a tree to ft the residuals we draw a tree from a posterior distribution Priortsampling distribution usually choose B values k values that are large a moderate L burn in Performs well with little tuning