Study Guide Part II: Cross Validation & Bootstrapping PDF

STUDY GUIDE PART II CHAPTER 5 CROSS VALIDATION BOOTSTRAPPING resampling methods get more info abt fitted model by refitting the model to training data provide estimates of test set prediction...

CHAPTER 5 CROSS VALIDATION resampling methods get more provide estimates bootstrapping Best solution for large designated test set when large dataset not a subset of training observations methods to 1 randomly 2 model for validation set 3 validation lusually error want to give idea DRAWBACKS validation error made randomly only a subset of fit the model validation set training set when CROSS VALIDATION armates test k fold randomly leave of a predictions to do this for I ai fquai.IT k parts Ca N multiple of CV k has van error rate setting K n LOO CV great for can_be useful only differs by correlated better ISSUES with be training original training biased upwards LOO CV almost the same CROSS VAL CV She is not are independent shares observations you have already greatest predictive ability variation so your WRONG apply RIGHT apply THE BOOTSTRAP flexible not realistic But the repeatedly each bootstrapped dataset others might in more more complicated in time ie stock related in a time i I blocks B2 B Primary use but of α BOOTSTRAP PREDICTION in CV from over FOR BOOT STRAP f0 train f gIgh in each BS 213 so pred even worse FIX to use the current this better CHAPTER 6 Linear linear exploring alternative fitting why consider When prediction accuracy model interp interpret 3 METHODS SUBSET SELECTION BEST SUBSET SELECTION 1 let me be the each observation 2 for a a 1 2 3 fit an PC b pick best among 3 select a model qq.mg ggncgaygmwyieegymga find best models through cross validation Extensions to other models an we can also devariance for more moders STEPWISE SELECTION for large of models you need to best subset higher the but have little overfitting 80 stepwise methods may FORWARD Start model contains add predictors to AT EACH STEP improvement DIFF from with k step 1 k Process 1 Mo 2 for K iff a IPative b model 3 select pros cons not an not belt fact stronger model be Backward begin remove 1 start 2 for a predictors b choose smallest 3 selent BIC of adjusted R2 considers searches BSS not number p moves when p is choosing optimal but will variables high so we cannot BSS R2 ESTIMATING approam a to the traing 2 use Cp BIC adjust training err best model want values to Mallow's Cp AIC zloge criteria desired for large class of models max likelihood model for linear moons non linear AIC BIC BIC agg like Cp Bic want moves Aic v BIC since logn 2 more variables Adjusted R2 adj R 1 to maximize Rss as vars denom unlike R2 adj the model that doesn't add CAN be used VALIDATION CV each procedure meseret select k don't need an can be used when error variance 1 standard might not pick simplestmodel within 1 Strandand error must SHRINKAGE subset selections a subset of afferent way penalty RIDGE REGRESSION want to minimize RSS 7 β A 20 want to find the larger AE BE it has the controls use CV to determine really balancing da vector norm scaling of predictors standard multiplying ie regardless of IN RIDGE by a const O we should x̅ij standardited coest controls coest coests never CONS does not LASSO minimize uses l Shanks agents feature is not important creates good option A all costs why does dasso conclusions lasso might perform better a small predictors pred this related for a real dataset CV can we data set is a o selecting the Cv is need then model param DIMENSION REDUCTION exploring to a least squares use p original predicts ML D m linear combinations new Preels dimension can win model fit that were REMEMBER MLI dimension reduction Principal components used to define 1ˢᵗ PC is mm many set of principle group of points in collectively twin if there is say don't need using more PC PC combinations BUT I choosing can use when M unsupervised identifies linear combinations directions is not used to to PCR no guarantee will also But PARTIAL LEAST SQUARES dimension that are linear with ordinary LS PLS is supervised approximate old PLS attempts 1 standardize 2 PLS the west from this coest i in computing predictors 3 other divertions the above CHAPTER7 moving beyond linearity assuminglinearity is POLYNOMIAL REGRESS yi Botβ x Bax extremes data thins out details create new not rly interested be f xo is simple expression point wise we eith fx d logistic reg follow to get CI to get on prob scale can do separately separate after COW polynomials step Functs cut helpful if local unlike create useful way of creating allows choine of outpoints piecewise poly nomials can use diff better to add SPLINKS have MAX CONTINUITY to want breaks 0 enforcingcontinuity 1st 2ndderivative Linear splines knots k k knot y Bot Bib Gilt Babe x2 next x hence Cubic spline knots up to order y Both blexi biff ti bletsail natural cubic splines extrapolates constraints for however many controlwagging knot placement 1 strategy of observer cubic spline with natural spline w splines are more local smoothing splines minimize second term small x sen to eq Xi roughness penalty details avoid knot selection issue vector of a fitted values So estate degrees of dfx can specify can estimate Local regression win a of by weighted least squares if at of for Gams yi Bot f plot various contributions of fitting nonlinear variables can fit coests not can mix terms Gams low order can be used CHAPTER 8 stratisflying segmenting pros simple easily interp con not competitive Dec trees can be internal nodes ppiit feature 1 divide predictor choose goal is to find boxes with eau box shouldn't take a top down greedy be ahead picking 1 select pre repeat process can specify hyper parameters use mean of could produce has a terminal node you could stop early as long as RSS can or can prune PRUNING building that don't signifianting penalize α 1 use tree approam to has some 2 apply cat complexity 3 use CV to determin α 4 Return to subtree more can beatory CLASS TREE assume each belongs use classification but not use Gini index instead G EI are close CON do not crass moders BAGGING recall a set of n of the mean averaging a set take bootstrap predictions to get classification 00 B error each tree use remaining 113 Random Forest reduces variance build decision trees on but each time a chosen from the full new selection of m usually choose ie don't consider a random selection of can fiter can't overfit by adding more when adding more BOOSTING sequential alogathym f't a tree Fb mn update update residuals output updated modes each tree can slowly improve f tunig params splits in tree depth split trees can be CV to determine B sunkage param usually X MB variable importance record larger win RSS for classification mum decrease BAYSMAN ADDINE ensemble related to random tree is constructed signal missed used for start w stump k trees x post each iteration 1 all by tot further iterations in on iteration predictions from rather than randomly chooses favoring perturbations 2 can charge structure by 21 may change output to obtain burn in iterations perturbation stigle how hard also compute uncertainty reduce likelihood can use quartiles Bayesian each time we draw a tree usually choose B values k values Performs well with little STUDY GUIDE PART II BOOTSTRAPPING info abt fitted model by refitting the model to training data of test set prediction error So bias of param estimates is best for this prediction error estimates but often not available available use class of methods that hold on from fitting apply statistiial learning the observations CROSS VALIDATION divide data into training and validation hold on fit to training data mode is used to prudit set error is used as an estimate of test error w MSE to ardanitative response mis classification rate for qualitative use cross validation to pick degree of poly nomial AND of test error end of fitting process of VALIDATION can be highly variable be the splits of data are highly dependent on what data is included in valltra observations those in the training set are used to a lot of data is not considered error could overestimate test error is half as large as it would be no validation you have more data generally the error is lower error is used to select best moder divide data into k c forces equal sized parts part fit model to the rest of the folds obtain the left out a What Does combine mean each a then combine the results 5 me rain record error 4 n reword emon Ca Ca he observation in each kpart k ne n MSER MSEe Eiece Yi Ji he of fit for observation i from the data with a removed yields n fold or Leave one out CV Loocu least squares or polynomial regression but doesn't change up the data enough training data a observation Estimates for each fold are many high variance low bias option is b fond val with 6 5 or 10 cross van die k 5 5 1 5 4 5 tramin set is only K 1 k as as big as the set the estimates of the error will be bias is reduced be training data set is barger everytime but has high variance bias van FOR CLASSIFICATION if erru very useful be we are assuming that the obs but they aren't be the training data cherry picked the 100 predictors that have the so you would essentially be tricking cross test error will be lower 0 on fff a CV in sep 2 CV in 1 AND 2 tool that can be used to quantify uncertainty choose α to minimize var though be don't have the whole population bootstrap helps in mimicing obtaining new dataset sample from our original datasets with repeater dataset is the same size as the origina some observations may appear at times not be included at all complex scenarios ie time step semanos bootstrapping can be series can't just sample with replacement price for certain day stone price one day will be to stock price the previous day series ppl use a BLOCK BOOTSTRAP group samples t assume groups are uncorrelated Within block assume B4 correlation of bootstrap estimat error i also used to determine CI i look at histogram look the si asi.CI to get aol.CI PERCENTILE ERROR each of the k validation folds in distinct K 1 folds for ie training needed think abt each bootst dataset as our training data original sample as val for bootstrap there is signif each sample overlap w original data Abt 213 of og dat set of data has already been seen by the model er will be biased downwards is usig og data as training BS as val estimate BS error predictions for observations that did not occur in BS sample gets complicated in CV ends up being more simple to estimate pred error model selection regularization model Y Bot β x t Bp Xp E linear model to aromidate non linear fitting procedures by replacing least squares alternatives to least squares feature p obs n wat to control var remove irrelevant features setting coess o easier to FEATURE selection 19550 PCR PLS fidget subset selection shrinkage dimensigf.gg best sub stepwise null model with no predictors simply the sample mean for ftp.nskd Pe models that predictors contain k predictors E Ite models models call it Mk Best smallest RSS or largest R from no Mr using cross validation prediction error co or pygynn we may with 1 2,314 p predictors then of those best models you find that model with the lowest testemo can use model when P 10 20 have looked models with least sum of squares but apply to models such as log neg 2 maximized log likelihood plays the role of Rss p subsets cannot use best subset selection be compare is too large can suffer computation ally statistically be the chance of finding models that look good on training data predictive power w new data high van of coest estimates be preferable when p 40 especially STEPWISE no predictors just intercept model 1 by 1 until all are in model the variable giving model the greatest ADDITIONAL to the fit is added best subset each step not looking at all moons of predictors but models with k 1 predictors from the prov null model who predictors 0 p i consider all p n models that augment the predictors in Mr fgm choose the best among p re models thats Man best smallest RSS largest R2 model from Mo Mu using Cv Cp Bic oradjusted R2 arrested model computational advantage over BSS 20models FSW I ptcp il.IR om gamented to find the best possible model of 20 models with p predictors looking at every single potential modes moon min k predictors may not account forthe the best move is not a superset of features I miant be the strongest but 2 3 tg might be We can only build off of our preexisting if variables had no correlation the feature selection would the same stepwise good when pis large But not greater than n with full least squares model with an predictors the least useful s by 1 with Mp full model k p p i I consider an ie models that contain all but 2 of the in Me for a total of b I preas the best among k models me 1 Best ass of R highest a model from Mk Mo using CU prece em CP p models through it pH 2 models guaranteed to yield the BEST move w P pred of samples n must be larger than variables so the full model can be fit Forward stepwise even when nap and is the ONLY viable moon very large model have greatest lowest Rss be will have the most low test error but that means var will be just choose the best model with lowest test error indirectly estimate test error by making an adjective error to account for overfitting Bic A c cp CV or validation Alc adjusted R2 to estimate test error for model site can be used to select w varying variables be small besides R want large no 3 L L 2d 82 var win each a m D 4 Cavarst I intercept fit by L max value of likelihood function for the estimated 2109L RSS 82 Are Cp ff tend to have a small value for a model w low test error w small BIC for any n t Ble places heavier penalty on mom's w o results in the selection of smaller movers than Cp r 1 ff TSS bi ji ad R2 want to minimize RSS n d i Rss always n d i 4 or due to the presence of d in the R2 pays price for a including unnecessary variablesto ie win be penalized if you fed by adding apredic mum to predictive ability when Psn returns models no MR we want to find E once the best variable we have ma with lowest estimated test error estimate of 02 in a wider range of model selection tasks even it is hard to detremine the predictors in model est error rule minimum of curve comparing prices error but choose the simplest modes within 1 SD of minimum do this be its easier to interpret virtually the same error but a simple model standardize METHODS ridge reg t lasso use least squares to fit alinear model that contains predictors to ft linear model not least squares to fit costs to shrink coests 0 based on importance is a tuning param coests that fit data well reduce RSS then coests o shrinkage penalty small when β Bp are close to 0 effect of shrinking Bi 0 the impact of the 2 terms on the coest estimates a problem to keep equation A βp 118112 THE least squares west estimates are scale equivarient i by a constant simply scares the least squares coeff how the jth pred is eaten is the same ridge coects change significantly when multiplying apredictor be the sum of squared coests term in the penalty apply ridge AFTER standardizing the preels using inE.FI xF feature divide by SD of all variables variance ie a less flexible bias ie x go to 0 just makes them approach 0 select predictors instead uses all penalty instead of 12 l norm 11311 Ʃ Bj instead of 1113112 T.fi 0 but can make coests to 0 if D variable selection sparse models when there is large P x B allow for β 0 when response is a function ofonly to the response is never known aprion used to determine main approach is better fora given loss funnt is the same as ordinary leastsquares tuning param for Ridget Lasso a good sen to determine a ca aicet tf.fi maiest refit observations is using all selected tuning that t fit them approaches Irmpreds moder dimension reduction methods to fit a model using in new preels of P Predictors Z Zm constants 0m Omp reduction estimated costs constrains β the bias variance tradeoffs with least squares predictors using new predictors transformed with new β'S REGRESSION PCR the linear combinations of predictors in reg that linear comb of variables with largest variance connate vanians we replace men mma.mn compts that capture their shared variation a particular pattern vary the most but RSS is smallest a strong correlation btwn features 1st PC to use whole pop to present sales j 1 t PC bias MSE w var highest Pos assess directions in to minimize test MSE cross validation m components really hian just use least squares analysis directions that best represent theyPLf are identified wo supervision be response Y determine the principal comp directions has a large CON that the directions that pest explain the predictors for predicting the response PLS reduction method that identifies new pred CE fm combinations of og features fits a linear mode using M features ie use Y to identify new features that features are related to response to find directions to explain predictors response p predictors computes first direction of 21 by setting each 01 to linear regression of Y onto is proportional to X t Y Z Ʃ 0 Xj PLS places most weighton that are strongly related to the response found by taking residuals repeating not enough always nonlinear offers flexibility Bax Ei standard error gets wider tails of CI way variables x x2 42 treat as multi linear reg in coeffs more interested in the fitted tune values Y a linear funct any canget a v for pointwise variances at to any any given point you can have error at some reasonably low value or use CV naturally find upper lower bounces using logit scale then invert on many variables just stack the vars in a matrix have extreme tail behavior very bad for extrapolation variable into distinct sections natural cut points polynomials dunny variable fit model interactions that are easily interpretable ie Flur 22005 age I yr 2005 age diff linear functs in each age cat knots can be hard polynomials in regions desired by knots constraints to polynomials ie continuity continuity degree 1 denn ie for cubic continuity at 2nd denv he is a piecewise linear polynomial cont each bet exit Ci Eat k i k positive part k i K is a premise polynom w cont derivatives 2 each knot β 35k taxi tei bzcx.it Xi Ee linearly btwn boundary knots this adds 4 2.2 extra allows us to put more internal knots df add more internal knots extremes decide knots k place them at appropriatequantile k knots has 4 params or df K knott has k at fofEar rasignit.az spline faubi adds up ftp.ifeng Cy go.net Tst tries to make 94 match the data ear x roughness penalty modulates how wish 9 little penalty more wiggly If tuning param is a natural cubic spline with a knot every unique is controlled by 7 leaving a single to choose can be women as g Say mxn matrix determined by Xity of smoothing matrix reyfgg.gg 5 cement roughness penalty sum of diagonal matrix of rather than 7 7 using Loocu or CV sliding weight function we fit separate linear fits over the range smoothing spline look rather similar generalized additive models Xii fz xis t fplxip.ltEi models a Gams using many models ie natural splines that interesting I fitted functions are some linear some non linear are additive no interaction in model can include interactions using ie bivariate functions for classification decision tree models w best supervised learning approaches used for class treg space into J non overlapping regions toarrive predictorspace intohigh dimensional rectangles minimal ass represent each individual sample greedy approach known as binary splitting each step the split is mace that particular steprather than looking a split that win lead to a future better tree split with smallest Rss look all potential prees spins want split to minimize RSS to make tree smaller observations in terminal nodes to make predictions an overfit tree if tree is too large ie each one when splits don't pred ability continue tune hyper parameters tuning a large tree cutting down the tree removing bravery contribute to the pred ability the non predictive features and is selected using CV grow stopping when obs in a terminal node min obs pruning to get a subtree as a funnt ofα from 2 that corresponds to α incremental improvements have smaller branches biased ons belongs to dominant class in the region it error rate instead of RSS very sensitive to growing a tree Pmn 1 Pmu measures purity of classes Ifc P's gini is small have same level of accuracy as other regressio think abt the bootstrap independent observations 2 Zn earn variane I observation is given by 02in of observations reduces variance sampies to make many trees Average all of the Ipag x EI f ly prediction of tree x majority rules aoi say 1 101 say 0 it's 1 contains abt 213 of data think back to bootstrap to validate by decorrelating the trees bootstrapped samples split in the tree is considered a random selectionof predictors is set of parameters split uses only a of m predictors is taken earn split in F ie predictors selentered for each split Egg out of an potential predictors p but if out of predictors m predictors if unsupervised trees but you know your'e done trees doesn't variance it levels out to improve on previous models d splits att leaves to training data Xiv by adding in shrunken version of new tree be small w few terminal nodes where model I doesn't perform well ie d 1 just a stump w 1 single oveft but would take a lot of trees B use sman positive usually 0.01 or 0.001 total drop in Rss for every split in the tree A more important trees we add total Gini index see how split for a predictor averaged over all trees regression trees forest boosting in a random manner each tree tries to capture in previous tree regression crassif etc and perturbations based on partial residuals average trees to get final predictions iterations of BART prediction for kth regression tree of bin iteration iteration k trees from that iteration are summer trees have a single node min the response values divided trees update all k trees one a time to update kth tree subtract from earn response the all but the kt tree to get the partial residua fitting a fresh tree to the partial residual Bar a perturbation to the tree from a prev iteratio perturbations that improved the ft of the partial reside adding pruning branches the prediction in earn terminal move collection of prediction models a single prediction we take the average after some L moves guard against overfitting be they limit we fit data in ean iteration percenties amongst other metrics to provide a measure in the final prediction of overfitting w slow learning as well as another classifier randomly perturb a tree to ft the residuals we from a posterior distribution Priortsampling distribution that are large a moderate L burn in tuning

Study Guide Part II: Cross Validation & Bootstrapping PDF

Document Details

Tags

Related

Summary

Full Transcript