Full Transcript

# Machine Learning: Deep Neural Networks ## Overfitting ### Generalization * Ultimate goal: good performance on new, previously unseen data * **Generalization**: the ability to perform well on unseen data * The problem * We want to minimize the expected loss on new data * We onl...

# Machine Learning: Deep Neural Networks ## Overfitting ### Generalization * Ultimate goal: good performance on new, previously unseen data * **Generalization**: the ability to perform well on unseen data * The problem * We want to minimize the expected loss on new data * We only have the training data * We minimize the empirical loss on the training data * This is **empirical risk minimization** * How well does this work? ### The i.i.d. assumption * **i.i.d.**: independent and identically distributed * Data is sampled i.i.d. from some probability distribution $P(X,Y)$ * Training data: $\{(x_1, y_1),..., (x_N, y_N)\}$ * Test data: $\{(x'_1, y'_1),..., (x'_N, y'_N)\}$ * Both training and test data are sampled from $P(X,Y)$, independently ### Sources of error * Assume we are trying to approximate some function $f^*$ * **Approximation error**: our model class cannot represent $f^*$ * e.g., Linear regression to model quadratic data * **Estimation error**: our algorithm picks the wrong function from the model class * e.g., overfitting the training data * Deep learning excels at reducing approximation error! * But overfitting (estimation error) is still a problem ### Overfitting * **Overfitting**: achieving low training error but poor generalization * Model learns the training data "too well" * Model learns the noise in the training data * Overfitting is more likely with * Complex models * Small datasets * Model capacity * The ability of a model to fit a wide range of functions * Models with high capacity can easily overfit ### Underfitting * **Underfitting**: failing to achieve low error on the training set * Model is too simple to fit the data well * Underfitting is more likely with * Simple models * Large datasets ### Capacity * The ideal model has enough capacity to * Fit the true function well * But not so much that it overfits the noise ### Example: Polynomial Regression * $y = f(x) + \epsilon$ * $f(x) = \cos(x)$ * $\epsilon \sim \mathcal{N}(0, 0.1)$ * Fit a polynomial of degree $M$ to the data * $y = \sum_{i=0}^M w_i x^i$ * As $M$ increases, the model capacity increases ### Polynomial Regression: M = 0 The image shows a scatter plot of data points with a horizontal line fitted to it. The line represents a polynomial of degree 0. The model underfits the data, as it is too simple to capture the underlying pattern. ### Polynomial Regression: M = 1 The image shows a scatter plot of data points with a straight line fitted to it. The line represents a polynomial of degree 1. The model still underfits the data, as it cannot capture the curvature in the data. ### Polynomial Regression: M = 3 The image shows a scatter plot of data points with a curve fitted to it. The curve represents a polynomial of degree 3. The model fits the data reasonably well, capturing the general shape of the data without overfitting to noise. ### Polynomial Regression: M = 9 The image shows a scatter plot of data points with a wiggly curve fitted to it. The curve represents a polynomial of degree 9. The model overfits the data, as it tries to fit every single data point, including the noise. ### How to Avoid Overfitting * Get more data! * The best way to improve generalization * Regularization * Add a penalty to the loss function to discourage complex models * Early stopping * Stop training when the validation error starts to increase * Dropout * Randomly drop out neurons during training * Forces the network to learn more robust features * Data augmentation * Create new training data by transforming the existing data * e.g., rotating, scaling, cropping images ### Regularization * Add a penalty to the loss function to discourage complex models * L2 regularization * Also known as weight decay * Adds a penalty proportional to the square of the weights * $\mathcal{L} = \mathcal{L}_0 + \lambda \sum_i w_i^2$ * L1 regularization * Adds a penalty proportional to the absolute value of the weights * $\mathcal{L} = \mathcal{L}_0 + \lambda \sum_i |w_i|$ * $\lambda$ is a hyperparameter that controls the strength of the regularization ### Early Stopping * Monitor the performance on a validation set during training * Stop training when the validation error starts to increase * Why does this work? * As training progresses, the model may start to overfit the training data * The validation error will start to increase before the training error does * Early stopping prevents the model from overfitting ### Dropout * Randomly drop out neurons during training * Each neuron is dropped out with probability $p$ * Forces the network to learn more robust features * Prevents neurons from co-adapting to each other * Dropout can be interpreted as training an ensemble of models ### Data Augmentation * Create new training data by transforming the existing data * e.g., rotating, scaling, cropping images * Helps the model generalize to new data * Can be used to increase the size of the training set ### Data Augmentation Examples The image shows various transformations applied to an image of a cat: 1. Original Image 2. Flipped horizontally 3. Cropped and zoomed 4. Rotated slightly 5. Adjusted brightness ### Data Augmentation Best Practices * Do not use transformations that would change the label * e.g., flipping digits * Use transformations that are relevant to the task * e.g., rotating images of objects, but not text ## Hyperparameter Optimization ### Hyperparameters * Parameters that are not learned during training * Examples: * Learning rate * Regularization strength * Network architecture * Batch size * Hyperparameters have a big impact on performance * How to choose them? ### Methods for Choosing Hyperparameters * Manual tuning * Grid search * Random search * Bayesian optimization ### Manual Tuning * Train the model with different hyperparameter values * Evaluate the performance on a validation set * Adjust the hyperparameters based on the results * Repeat until satisfied * Tedious and time-consuming ### Grid Search * Define a grid of hyperparameter values * Train the model with every combination of hyperparameter values * Evaluate the performance on a validation set * Choose the hyperparameter values that give the best performance * Computationally expensive ### Random Search * Define a distribution over hyperparameter values * Sample hyperparameter values from the distribution * Train the model with the sampled hyperparameter values * Evaluate the performance on a validation set * Choose the hyperparameter values that give the best performance * More efficient than grid search ### Bayesian Optimization * Use a probabilistic model to estimate the performance of different hyperparameter values * Use the model to choose the next hyperparameter values to try * More efficient than grid search and random search * Requires more sophisticated tools ### Bayesian Optimization: Gaussian Process The image is a graph showing the performance as a function of a single hyperparameter. The graph consists of: * A plot of data points with error bars representing the uncertainty in the performance estimates * A Gaussian process model fitted to the data points, showing the predicted performance and uncertainty for other hyperparameter values. * Vertical lines are drawn, intersecting with the function plot, where the next hyperparameter value to try has been chosen based on the Gaussian process model. ### Rules of Thumb * Start with a random search * Use a validation set to evaluate performance * Use a log scale for hyperparameters that vary over several orders of magnitude * Visualize the results * Be patient