Model Assessment (Evaluation/Validation) and Model Selection PDF

Model Assessment (Evaluation/Validation) and Model Selection (part I) Alessio Micheli [email protected] 2024 Dipartimento di Informatica Università di Pisa - Italy ML Course structure Where we go...

Model Assessment (Evaluation/Validation) and Model Selection (part I) Alessio Micheli [email protected] 2024 Dipartimento di Informatica Università di Pisa - Italy ML Course structure Where we go Dip. Informatica University of Pisa Function approximation framework 1 INTRO Data, Task, Model, Learn. Alg., Validation Discrete H Continuous H Probabilistic 2 3 4 Concept Linear K-nn learning models (LTU-LMS) Ind. Bias SOM 10 5 RNN 11 12 Neural Networks Deep L. 6 Bayesian Networks 7 Theory Validation & SLT Bias/Variance 13 8 SVM 9 14 Applications/Project Advanced topics 2 A. Micheli A premise: Bias-Variance Dip. Informatica University of Pisa Bias-Variance decomposition provides an useful framework to understand the validation issue, showing how the estimation of a model performance is difficult – Considering also the role of different training set realizations – Showing again the need of a trade-off between fitting capability (bias) and model flexibility (variance) in different way Anyway we will discuss it later by a specify lecture Let me introduce just this plot, to recall the need to optimize a model in the right way. 4 A. Micheli Under & Overfitting in a Bias-Variance plot* Dip. Informatica University of Pisa unlucky lucky Complex model : lower bias, higher variance → possible overfitting 5 A. Micheli Motivations Dip. Informatica University of Pisa Recall that we are looking for the best solution (minimal prediction – test error) searching a balancing between fitting (accuracy on training data) and model complexity Training set is not a good estimate of test error – Initially ok but you have too many bias (underfitting, high TR error) than too high variance (low TR error but overfitting) Assuming a tuning hyper-parameters  (implicit or explicit) that varies the model complexity, we wish to find the value of  that minimizes test error Look for methods for estimating the expected error for a model (or each model of a class of models or even of a set of models*) 6 A. Micheli Approximating estimation step (intro: a guide) Dip. Informatica University of Pisa We can approximate the estimation Analytically: – AIC, BIC Akaike/Bayesian Information Criterion limited to linear models (in par.) – MDL (Minimum Description Length) – SRM: Structural Risk Minimization & VC-dimension → In lecture INTRO + a next lecture on the VC-DIM In practice, often we can approximate the estimation on the data set: By resampling: direct estimation of extra-sample error via – cross-validation (hold-out, K-fold CV, …) – bootstrap 7 A. Micheli Validation: Two aims Dip. Informatica University of Pisa Recall the slides in the intro (on the validation)! THE MOST IMPORTANT SLIDE of the course After models training on the training set Model selection: estimating the performance (generalization error) of different learning models in order to choose the best one (to generalize) – this includes search the best hyper-parameters of your model (e.g. polynomial order, #units in a NN, lambda, eta, … …). It returns a model Model assessment: having chosen a final model (or a class of models), estimat ing/evaluating its prediction error/ risk (generalization error) on new test data (measure of the quality of the ultimately chosen model) It returns an estimation value Gold rule: Keep separation between goals and use separate data sets A. Micheli in each phase 8 Hold out Dip. Informatica University of Pisa If data set size is sufficient: e.g. 50% TR, 25% VL, 25% TS disjoint sets Training Validation Test DTR+VL Development/design set D\DTR+V TR: Training set is used to fit (to minimize Remp in SLT) [training]L VL: Validation set (or selection set) is used to select the best model (among different models and/or hyper-parameters configurations) [model selection] TR+VL sometimes are joitnly called development/design set , i.e. used to build the final model (training different models and selecting the best final model) TS:Test set is used for estimation of generalization error (of the final model) (to estimate R in SLT) [model assessment] Notes: 1) the estimation made for model selection (on VL set) [is for model selection purpose], it is not a good estimation for the assessment phase/risk test. 2) Test set results cannot be used for model selection (or call it validation set) 9 A. Micheli Test or model selection? Dip. Informatica University of Pisa What if test set is used in a (repeated) design cycle? – We are making model selection and not reliable assessment (estimation of expected generalization error) and we wouldn't be able to do that on future examples. Blind test set concept (e.g. for ML competitions) Image an exam exercise: if you see the solutions, it is not a test! – In that case, used test set error provides an overoptimistic evaluation of the true test error (→ we will see how easy is to obtain very high classification accuracy over random task even using the test set only implicitly) Gold rule: Keep separation between goals and use separate sets in each phase (TR for training, VL for model selection, TS for risk estimation) 10 A. Micheli TR/VL/TS by a simple schema Dip. Informatica University of Pisa Developer side Model TR training Model development/ design Model Data VL selection set Model assessment TS Deployed model Inference/ e.g. Client side New data Model inf. flow predictions Data flow 11 A. Micheli Counterexample (relevance of a correct assessment) Dip. Informatica University of Pisa – 20-30 examples, 1000 random value input variables, – Random target 0/1 – We select 1 model with 1 input var/feature that guess (for chance) 99% (or 100%) on any successive TR and VL and TS set splitting ? Perfect result (a model with accuracy 99% )? What is wrong? 99%/100% is not good estimation of the test error (the right one is 50%) 1) Estimation of error on TR and VL is NOT good risk estimation. 2) Using the entire dataset for feature/model selection prejudice the estimation (biased estimation, called subset selection bias ) – Test set was implicitly used at the beginning (*). – Test set must be separated in advance, before making any model selection (even of the Feature selection) ! On external test the result is 50% (random coin result)!!!!! We are discussing the correctness of the estimation, not the possibility to solve the task ! K-fold CV and other techniques does not solve the issue [but can confuse you ! ;-) ] 12 A. Micheli The table for the Counterexample Dip. Informatica University of Pisa Input variable value 1 2 … 26 27 28 … 1000 Target 1 … … … 1 1 1 … … 1 2 0 0 0 0 3 1 1 1 1 Pattern 4 0 0 0 0 … 0 0 0 0 … 1 1 1 1 … 0 0 0 0 20 … … … 1 1 1 … … 1 TS1 1 0 0 1 TS2 0 1 0 0 TS3 1 1 1 1 Accuracy 100% 33% 66% 13 A. Micheli Grid search (model selection phase) Dip. Informatica University of Pisa Hyper-parameters. i.e. parameters that are not directly learnt within estimators, can be set by searching a hyper-parameters space (of course by model selection on a validation set). Search best hyper-parameter values. Exhaustive Grid Search exhaustively generates candidates from a grid of parameter values, TRIVIAL EXAMPLE: Hyper- param. Lambda 0.1 Lambda 0.01 Lambda 0.001 Example: The best one is Res3 → 1 unit Res1 Res4 Res7 (Units=100, lambda=0.1) is the winner 10 units Res2 Res5 Res8 AUTOMATIZE IT!!! 100 units Res3 Res6 Res9 Parallelization is easy (independence of trials) Alternatives: Randomized Parameter Optimization (*) → see later And many others (also in current ML libraries) → see later 15 A. Micheli Grid search -2 Dip. Informatica University of Pisa The cost of search can be high: Cartesian product between the sets of values for each hyper-parameter #(values in the range)#hyperparameters : – e.g. before 3X3=32=9 – What with 5-6 hyperpar., each with 10 possible different values? Can be useful to fix some hyper-parameters values in a preliminary experimental phase (since they show to be not relevant) [but take care, see FAQ later on seq. selection error] Two (or more) levels of nested grid search: Useful! Apply a first coarse grid search to the hyperparameters with a table on the combinations of all the possible (e.g. growing exponential) values too find good intervals (regions) Then finer grid-searchs can be performed (over smaller and smaller intervals and with selected hyper-parameters) still using all the significant hyper-par. 16 A. Micheli Alternatives to grid search Dip. Informatica University of Pisa Grid search is very useful to observe the behavior of the model changing the hyper-parameters values (to explore the hyper-parameter space → very important for your understanding) – But it is computationally demanding for large spaces and does not scale well with the increase in number of hyperparameters Alternatives exist, trying: – to reduce the cost: e.g. random search, which avoid to resample the same values of influent hyperparameter (in case some other are influent), and allows to fix the budget of samples independently from the # of hyper-par. See Bergstra, Y. Bengio, (2012). "Random Search for Hyper-Parameter Optimization" J. Machine Learning Research 13: 281–305. – to automatize the search: e.g. Bayesian approaches, evolutionary approaches, configuration evaluation methods (e.g. Hyperband*), et al. Current libraries, as Keras tuner for NN and Scikit-Learn, include various alternatives (toward AutoML)→ anyway, don’t try them uncritically, compare with the grid search that teaches you more for the first approach to ML 17 Micheli Grid versus Random search Dip. Informatica University of Pisa Grid search for 2 hypepar.s.: Random search for 2 hypepar.s: a total of 100 (10x10) different 100 different random choices are combinations are evaluated and evaluated. The green bars show compared. Blue contours indicate that more individual values for each regions with strong results, hyperparameter are considered whereas red ones show regions compared to a grid search. with poor results. 18 A. Micheli Hold out - rethinking Dip. Informatica University of Pisa If data set size is sufficient: e.g. 50% TR, 25% VL, 25% TS Training Validation Test DTR+VL D\DTR+VL How much data is enough? – It dependes on model complexity, signal-to-noise ratio of the underlying function (e.g. Few for linear model, linear separable task, no noise) How to deal with data that are insufficient to make 3 significant/large size splits? Can we avoid to be sensible to the particular partitioning of examples? K-fold CV can help! 19 A. Micheli Notes and Terminologies Dip. Informatica University of Pisa Generically we have used the term “cross validation” for both: – An approach to model selection based on direct estimation of prediction error by resampling (instead of analytical) Choose hyperparameters estimating errors – More specifically for implementations of estimation methods with resampling specifying how to divide data for both model selection and assessment How to make partitions of the data set : e.g. by K-fold CV 20 A. Micheli K-fold CV recall Dip. Informatica University of Pisa K-fold Cross-Validation D1 D2 D3 D 4 Split the data set D into k mutually exclusive subsets D1,D2,…,DK D1 D2 D3 D 4 Train the learning algorithm on D\Di and test it on Di It uses all the data for training and testing D1 D2 D3 D 4 You don't get an unique model (see next lect.) but D1 D2 D3 D 4 You get also a variance (standard deviation) over different folders for your Fold 4 Fold-1 Res on D1 It can be used both for the validation set or for the test set Fold-2 Res on D2 Issues: Fold-3 Res on D3 How many folds? 3-fold, 5-fold , 10-fold, …., 1-leave-out Often computationally very expensive …. … Combinable with validation set, double-K-fold CV, …. TOTAL MEAN +/- SD 21 A. Micheli An example of model selection and assessment with K-fold CV Dip. Informatica University of Pisa Split data in TR and Test set (here simple hold-out or a K-fold CV) [Model selection] Use K-fold CV (internal) over TR set, obtaining new TR e VL set in each folder, to find best hyper-parameters of your model (e.g. polynomial order, λ of ridge regression, #units in a NN, …): Grid-search with many possible values of the hyper-parameters – i.e. for example a k-fold-CV for (λ =0.1, #units=20, …), a k-fold-CV for (λ =0.01 #units=20), … i.e. k-fold-CV for each cell of the grid matrix,... and then take the best configuration (λ, #units, …) (comparing the mean error computed over the validation sets obtained by the all the folds of each k-fold CV, i.e. the result on the “diagonal”) Train on the whole (original) TR set the final model [Model assessment] Evaluate it on the external Test set – Question: Can we retrain the model again with all the data? We will see more combinations in the next lecture! 22 A. Micheli Before… Dip. Informatica University of Pisa Before enter in general method for model selection/assessment we face (in brief) some particular cases 23 A. Micheli Lucky/Unlucky sampling Dip. Informatica University of Pisa Can we avoid to be sensible to the particular partitioning of examples? (bias of the particular sample) Stratification procedure: – Stratification is the process of grouping members of the population into relatively homogeneous subgroups before sampling. Classification: for each (random) partition (TR and TS in hold-out/CV) each class is represented in approximately the same proportions as in the full data set Folder composition: check if order in data related to specific meaning Repeated hold-out method or CV: repeat splitting with different random sampling – E.g. repeat the CV 10 times ! – Average the results to yield an overall estimation – Useful especially for comparing different models ! 24 A. Micheli Very few data (informal) Dip. Informatica University of Pisa With very few data: difficult to say whether a sample is representative or not …. Stratification Avoid (or consider in evaluation): – Missing classes or features in training data – Special classes of data not sampled in TR or TS (again stratification) – Prior known outliers can affect the mean test results – On the opposite: selection of only “easy” cases Also: blind test set can be misleading if it is – From a different distribution – Measured with different scale, different tolerance, etc. – Uncleaned, unpreprocessed, … – Extrapolation (out of range data) 25 A. Micheli Error Functions for Evaluation Dip. Informatica University of Pisa Measuring error (addenda to slides in the intro) Classification: see first lessons: – Accuracy (correct classification rate/error rate often in %), confusion matrix (specificity, sensitivity, ROC curve) Regression: residue ri=(yi-oi) y = target – MS error (meani [r2])→ RMS (root mean square) or S – Mean absolute error (mean of the absolute error |r|) – Max absolute error (maxi [|ri|] ) – R (correlation coefficient/index) (R2 coeff. of determination): E.g. R=sqrt(1- (S2/Sy2)) where Sy2=meani [ (yi-meani [y])2 ] (variance of y) degree of linear dependence between the variables y e o in [0-1], where 1 is the best – Outputs versus targets plots – Various statistical tests and statistical significance analysis 26 A. Micheli Other approaches Dip. Informatica University of Pisa Bootstrap: random resampling with replacement and repeat it to from different subsets (vl or ts) 27 A. Micheli REPETITA: 2 gold rules Dip. Informatica University of Pisa TR result is not a good estimation of VL and TS result VL result is not a good estimation of TS result Hence: Do not use validation set results for test estimation [risk estimation/model assessment] Do not use test set results for model selection (in any form*!!!) (or call it validation set) Or even 1 rule: Keep separation between goals and use separate sets in each phase 28 A. Micheli In the following Dip. Informatica University of Pisa Correct use of cross-validation methods for model selection and/or estimation of risk (assessment) Simple concepts with a rich formal notations, can help to avoid ambiguities: Very useful guide. It is not intended as a recipe book but help you to rationalize, to understand in a systematic form issues of a rigorous approach. Then you have to be reasonable (when you are aware) In any case use them if you have uncertainties, it is better than to use fantasy! Exercise (in the slide of the next lecture): relate wrong cases 1 and 2 of slide “counterexample” to misuses of such schemas! A. Micheli 29 Bibliography Dip. Informatica University of Pisa Hastie, Tibshirani, Friedman: Chapter 7 Duda, Hart, Stork: 9.4, 9.6 Cherkassky, Mulier: 3.4.2 Haykin: e.g. 4.14 2nd edition - 4.13 3rd edition Mitchell: e.g. 4.6.5, 5.1, 5.6 Note that often the space in such books is small with respect to the relevance of the topic!!! 30 A. Micheli For fun Dip. Informatica University of Pisa Can I have just a look to the test set? See https://youtu.be/XvOsh15hLIs * * It can hold also for the validation set used as test set ;-) By Sepe-Dukic past students 31 A. Micheli

Model Assessment (Evaluation/Validation) and Model Selection PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue