Lecture 4: Optimizing Predictors & Neural Networks PDF

Lecture 4 Optimizing predictors & neural networks Joaquim Gromicho Become able to compute the accuracy measures for classification, clustering and regression. Understand that given a data set, finding the classification tree, Learning clustering of linear regression that maximize accuracy (i.e. minimize errors) on that set becomes an optimization problem! objectives Become able to compute classification trees, clusters and linear regression minimizing errors on the training set, understanding which are approximations and which are optimal. for today Understand that overdoing the importance of accuracy may lead to overfitting, which may affect the predictive value of the model. Get an idea of how these learning methods differ from neural networks. Classification trees Now into the details How does a tree select a split variable and value? ? ? ? This case is simple One boundary perfectly splits the observations Pure nodes But what about this example? What is the best split line? But what about this example? What is the best split line? 20/20 Entropy measures impurity Entropy = 1 4/0 16/20 4 36 Entropy = ∙0+ ∙ 0.99 = 0.89 40 40 20/11 0/9 31 9 Entropy = 40 ∙ 0.94 + 40 ∙ 0 = 0.73 12/2 4 8/18 4 4 0 − log 2 − log 2 0 40 4 4 4 14 26 Entropy = 40 ∙ 0.59 + 40 ∙ 0.89 = 0.56 36 16 16 20 20 + − log 2 − log 2 40 36 36 36 36 4 36 Largest = ∗0+ ∗ 0.9910760598382222 40 40 information Entropy 𝑆 = −𝑝 log 2 𝑝 −(1 − 𝑝) log 2 1 − 𝑝 = 0.8919684538544 gain From last lecture’s notebook We see entropies larger than 1! The previous slide focus on dichotomies: classifications into two options. That relates to the binary entropy function. For more than two classes we have a general formula. The maximum is log 2 𝑛 with 𝑛 the number of classes. Note that log 2 2 = 1! Some additional computations were added to that notebook, those may help to understand. Two popular criteria Two popular criteria are gini impurity and entropy. See https://quantdare.com/decision-trees-gini-vs-entropy/ for a easy explanation. Regardless of which criterion is used, CART is a greedy method. Each split is taken as the best at the moment. The Iris dataset from Fisher, 1936 Two measurements: sepal length and width CART: Breiman et al 84, 33000+ citations! CART CART CART CART CART Our classification tree CART OCT as a solution of a MIO (same depth!) Dimitris Bertsimas & Jack Dunn, 2017 MIO: OCT and OCT-H OCT-H as a solution to a MIO, depth 1! Cool, isn’t it? (from Bertsimas & Dunn) A master thesis on computing OCT got a nice award Clustering Back to clustering What is SSE? The sum of squared errors: 𝑛 𝑘 2 𝑆𝑆𝐸 = σ𝐾 𝑘=1 σ𝑥∈𝐶𝑘 σ𝑖=1 𝑥𝑖 − 𝑚𝑖 Is the sum of the squares of the distance from each point in a cluster to the center of it, over all clusters. Results from the k-means algorithm for our example 𝑘=2 𝑘=3 𝑘=4 Results from the k-means algorithm for our example 𝑘=6 𝑘 = 91 What is the “best” number of clusters? Select an acceptable number that is close to the best → at the elbow SSE = squared distance to the cluster center Take a closer look at the curve Question: Algorithm results ◼ What do you think explains the hick-ups in this graph? Regression Linear regression Residual sum of squares 𝑅𝑆𝑆 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 Where 𝑦ො𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 the point on the line at 𝑥𝑖 Linear regression 𝑦ො = 𝛽0 + 𝛽1 𝑥 The values of 𝑚 and 𝑏 should be such that σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 is minimized. Setting the gradient to zero yields: 𝑛σ𝑥𝑦− σ𝑥 σ𝑦 𝛽1 = 𝑛σ𝑥 2 − σx 2 𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ With 𝑥ҧ = σ𝑥/𝑛 and 𝑦ത = σ𝑦/𝑛 BUT what does all this looking back tell us about forecast accuracy? Break Overfitting (regression) Overfitting (regression) More data will reduce the risk of overfitting Overfitting (classification) Older wisdom William of Ockham (1285-1349) “One should not increase, beyond what is necessary, the number of entities required to explain anything.” Create simple models e.g. through dimensionality reduction But be aware: Your model can also be too simple New notebook This lecture is complemented by this notebook which includes many examples of the effect of increasing the depth, hence the complexity, of a CART tree. How can we avoid being too dependent on the specific historical data? Randomization to the rescue Analytics for a Better World Lecture 4 70 Bias is the difference between your model's expected Variance refers to your predictions and algorithm's sensitivity to the true values. specific sets of training data. Bias and Variance → Underfitting and Overfitting 𝑿𝟏 𝑿𝟐... 𝑿𝒏 𝒀 Obs 1 𝑿𝟏 𝑿𝟐... 𝑿𝒏 𝒀 Obs 2 Obs 1 Training set... Obs 2 Obs n... Obs n Historical data Obs n+1 𝑿𝟏 𝑿𝟐... 𝑿𝒏 𝒀 Obs n+1... Obs... Test set many Obs many Obs 𝑿𝟏 𝑿𝟐... 𝑿𝒏 𝒀 Obs n+1... Test set Obs many Obs 𝑿𝟏 𝑿𝟐... 𝑿𝒏 𝒀 Obs 1 Obs 2 Model Training set... Obs n Estimate quality of the model on Predictions for Y the test set. on the test set Underfitting Overfitting Model complexity Optimal balance → Error Total error Bias Variance → Model complexity What have we learned about model accuracy? Quality of fit does not measure forecast accuracy Complex models can be overfitting which reduces forecast accuracy Simple models can be underfitting also reducing forecast accuracy The best model finds a balance between errors through bias and variance An estimation of the accuracy of a model is obtained by running the model on a test set that was not used to create (train) the model. Recall the perceptron Rosenblatt (1958) proposed a machine for binary classifications The Iris dataset again Petal length, Petal width Iris-setosa, Iris-versicolor Analytics for a Better World Lecture 4 81 Sepal length, Sepal width Iris-setosa, Iris-versicolor Analytics for a Better World Lecture 4 82 Extra! You may find here the code that I created to illustrate the learning algorithm of a perceptron. Note that we do not expect you to create code like this during this course! Analytics for a Better World Lecture 4 83 The multi-layer perceptron (MLP) Not one, but many outputs Outputs of ‘layer’ inputs to next Can represent (m)any function! What to use as targets/how to train? What does a neural network look like? Weights, or parameters 𝑃(𝒙 = a) 𝑃(𝒙 ≠ 𝑎) A neural network is a function with parameters, mapping input to output! Weights, or parameters What is a neuron? 𝑥1 𝜃1 𝜃2 𝑥2 𝑓 ෍ 𝜃𝑖 𝑥𝑖 𝑖 𝜃3 𝑥3 Sum of inputs, multiplied by weights, passed through activation function 𝑓 This is a linear 𝑓 𝑥 = Non-linear! transformation of inputs E.g. Rectified Linear Unit (ReLU) To train our Neural Network… We need to specify what we want to achieve: we specify the objective or loss function 1 𝐿(𝜃) = ෍ 𝐿 𝜃, 𝑥𝑛 , 𝑦ො𝑛 𝑛 𝑛 The loss is a function of the parameters, typically an average over a (fixed) dataset How to compute the accuracy measures for classification, clustering and regression? That given a data set, finding the classification tree, clustering of linear regression that maximize accuracy (i.e. minimize errors) on that set becomes an optimization problem? Did we How to compute classification trees, clusters and linear regression minimizing errors on the training set, understanding which are approximations and which are optimal? learn? That overdoing the importance of accuracy may lead to overfitting, which may affect the predictive value of the model? How these learning methods differ from neural networks? What’s next?

Lecture 4: Optimizing Predictors & Neural Networks PDF

Document Details

Tags

Related

Summary

Full Transcript