Bias and Variance in Machine Learning

Where does the error come from? Slides from http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html CSE/STAT 416 - Intro to Machine Learning - Spring 2018 (washington.edu) Review Average Error on Testing Data error due to "bias" and error due to "variance" A more complex model does not always lead to better performance on testing data. 3 sources of error In forming predictions, there are 3 sources of error: Noise Bias Variance Data inherently noisy y yi = fw(true)(xi)+εi variance of noise price ($) You can never beat this, it is an aspect Irreducible of the data. error square feet (sq.ft.) x Bias contribution Assume we fit a constant function N house N other house sales ( ,$) sales ( ,$) y y fŵ(train1) price ($) price ($) fŵ(train2) square feet (sq.ft.) x square feet (sq.ft.) x Bias contribution Over all possible size N training sets, what do I expect my fit to be? y fw(true) fŵ(train3) f ŵ(train1) price ($) fŵ(train2) fw̄ square feet (sq.ft.) x Bias contribution Is our approach flexible Bias(x) = fw(true)(x) - 𝑓𝑤ഥ enough to capture fw(true)? If not, error in predictions. low complexity ➔ y high bias fw(true) price ($) Bias is inherent to your model. 𝑓𝑤ഥ square feet (sq.ft.) Bias of function estimator Average estimated function =𝑓𝑤ഥ (x) True function = fw(x) Etrain[fŵ(train)(x)] y over all training sets of size N price ($) fw f ŵ(train2) fŵ(train1) fw̄ xt square feet (sq.ft.) x Bias of function estimator Average estimated function = 𝑓𝑤ഥ (x) True function = fw(x) bias(fŵ(xt)) = fw(xt) - 𝑓𝑤ഥ (xt) y The difference of the average value of prediction 𝑓𝑤ഥ (over different realization price ($) fw of training data) to the true underlying function 𝑓𝑤. 𝑓𝑤ഥ xt square feet (sq.ft.) x Variance contribution How much do specific fits vary from the expected fit? y fŵ(train3) f ŵ(train1) price ($) fŵ(train2) 𝑓𝑤ഥ square feet (sq.ft.) x Variance contribution How much do specific fits vary from the expected fit? y fŵ(train3) f ŵ(train1) price ($) fŵ(train2) fw̄ square feet (sq.ft.) x Variance of high-complexity models Assume we fit a high-order polynomial y y fŵ(train1) price ($) price ($) fŵ(train2) square feet (sq.ft.) x square feet (sq.ft.) x Variance of high-complexity models Assume we fit a high-order polynomial fŵ(train1) y fŵ(train2) price ($) fw̄ fŵ(train3) square feet (sq.ft.) Variance of high-complexity models high complexity ➔ y high variance price ($) fw̄ square feet (sq.ft.) Variance of function estimator The squared deviation of 𝑓𝑤ෝ from its expected value 𝑓𝑤ഥ over different realizations of training data. Bias of high-complexity models high complexity y ➔ fw(true) low bias price ($) 𝑓𝑤ഥ square feet (sq.ft.) 𝐸 𝑓 ∗ = 𝑓ҧ 𝑓∗ Variance Bias 𝑓 Case Study Estimator Bias + 𝑓∗ 𝑦=𝑓 Variance Only Niantic knows 𝑓 From training data, we find 𝑓 ∗ 𝑓 𝑓 ∗ is an estimator of 𝑓 Parallel Universes In all the universes, we are collecting (catching) 10 Pokémons as training data to find 𝑓 ∗ Universe 1 Universe 2 Universe 3 Parallel Universes In different universes, we use the same model, but obtain different 𝑓 ∗ Universe 123 Universe 345 y = b + w ∙ xcp y = b + w ∙ xcp 𝑓 ∗ in 100 Universes y = b + w ∙ xcp y = b + w1 ∙ xcp + w2 ∙ (xcp)2 + w3 ∙ (xcp)3 y = b + w1 ∙ xcp + w2 ∙ (xcp)2 + w3 ∙ (xcp)3 + w4 ∙ (xcp)4 + w5 ∙ (xcp)5 Bias Bias: If we average all the 𝑓 ∗ , is it close to 𝑓? 𝐸 𝑓 ∗ = 𝑓ҧ 𝐵𝑖𝑎𝑠 𝑓 ∗ = 𝔼 𝑓 ∗ − 𝑓 Large Bias Assume this is 𝑓 Small Bias Black curve: the true function 𝑓 y = b + w ∙ xcp Red curves: 5000 𝑓 ∗ Blue curve: the average of 5000 𝑓 ∗= 𝑓 ҧ y = b + w1 ∙ xcp + w2 ∙ (xcp)2 y = b + w1 ∙ xcp + w2 ∙ (xcp)2 + w3 ∙ (xcp)3 + w3 ∙ (xcp)3 + w4 ∙ (xcp)4 + w5 ∙ (xcp)5 𝐵𝑖𝑎𝑠 𝑓 ∗ = 𝔼 𝑓 ∗ − 𝑓 Bias y = b + w1 ∙ xcp + w2 ∙ (xcp)2 y = b + w ∙ xcp + w3 ∙ (xcp)3 + w4 ∙ (xcp)4 + w5 ∙ (xcp)5 model Large Small Bias Bias model 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑓 ∗ = 𝔼[ 𝑓 ∗ − 𝔼(𝑓 ∗ )2 ] Variance y = b + w1 ∙ xcp + w2 ∙ (xcp)2 + w3 ∙ (xcp)3 + w4 ∙ (xcp)4 y = b + w ∙ xcp + w5 ∙ (xcp)5 Small Large Variance Variance Simpler model is less influenced by the sampled data Bias and Variance of Estimator Bias The difference between this estimator's expected value and the true value of the parameter being estimated. 𝐵𝑖𝑎𝑠 𝑓 ∗ , 𝑓 = 𝐵𝑖𝑎𝑠 𝑓 ∗ = 𝔼 𝑓 ∗ − 𝑓 = 𝔼[𝑓 ∗ − 𝑓] Variance How far, on average, the collection of estimates are from the expected value of the estimates. 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑓 ∗ , 𝑓 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑓 ∗ = 𝔼[ 𝑓 − 𝔼(𝑓 ∗ )2 ] Bias v.s. Variance Error Model complexity Bias v.s. Variance Error from bias Error from variance Error observed Underfitting Overfitting Large Bias Small Bias Small Variance Large Variance Test/Train Error v.s. amount of data For a fixed model complexity Error # data points in training set What to do with large bias? Diagnosis: If your model cannot even fit the training examples, then you have large bias Underfitting If you can fit the training data, but large error on testing data, then you probably have large variance Overfitting For bias, redesign your model: Add more features as input large bias A more complex model What to do with large variance? More data Very effective, but not always practical 10 examples 100 examples Regularization May increase bias Summary Bias: the class of models can’t fit the data. Variance: the class of models could fit the data, but doesn’t because it’s hard to fit. Summary Cat classification Training error: 1% 15% 15% 0.5% Validation error: 11% 16% 30% 1% Bias and Variance of Estimator Expected Prediction Error 2 2 𝔼𝐷 𝑦− 𝑓∗ 𝒙 = 𝔼𝐷 𝑓 𝒙 +𝜖− 𝑓∗ 𝒙 dataset true function learned from data noise 𝜖~𝑁 0, 𝜎 2 (e.g., due to measurement errors) 1. Draw size n sample 𝐷 = { 𝒙1, 𝑦1 , … , 𝒙𝑛, 𝑦𝑛 } 2. Train linear regressor 𝑓 ∗ using 𝐷 3. Draw a test example (𝒙, 𝑓(𝒙) + 𝜀) 4. Measure squared error of 𝑓 ∗ on that example 𝒙 Bias and Variance of Estimator Expected Prediction Error Error due to incorrect assumptions 𝐸𝑃𝐸 𝒙 = 𝑛𝑜𝑖𝑠𝑒 2 + 𝑏𝑖𝑎𝑠 2 + 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 Unavoidable error Error due to variance of training samples Bias–Variance decomposition https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Bias and Variance in Machine Learning

Document Details

Tags

Related

Summary

Full Transcript