AOML_Slides_Unit-2.pdf
Document Details
Uploaded by Deleted User
Tags
Full Transcript
Algorithms and Optimizations In Machine Learning Unit 2: Generalization Bhaskarjyoti Das Department of Computer Science in AI & ML and Engineering Teaching Assistant: Sriraksha M S (7th Sem) ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Introduction to Machine Learning Bhaskarjyoti Das Depart...
Algorithms and Optimizations In Machine Learning Unit 2: Generalization Bhaskarjyoti Das Department of Computer Science in AI & ML and Engineering Teaching Assistant: Sriraksha M S (7th Sem) ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Introduction to Machine Learning Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Intro to Machine Learning Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. An Example: Self Driving Cars A self-driving car system uses dozens of components that include detection of cars, pedestrians, and other objects. Algorithms and Optimizations in Machine Learning Intro to Machine Learning Three Approaches to Machine Learning: 1. Supervised Learning ⮚ First, we collect a dataset of labeled training examples. ⮚ We train a model to output accurate predictions on this dataset. ⮚ When the model sees new, similar data, it will also be accurate. Applications: Classifying medical images, Translating between pairs of languages, Detecting objects in a self-driving car. Algorithms and Optimizations in Machine Learning Intro to Machine Learning 2. Unsupervised Learning Here, we have a dataset without labels. Our goal is to learn something interesting about the structure of the data: Clusters hidden in the dataset. Outliers: particularly unusual and/or interesting data points. Useful signal hidden in noise, e.g. human speech over a noisy phone. Applications: Recommendation systems, Anomaly detection, Signal denoising Algorithms and Optimizations in Machine Learning Intro to Machine Learning 3. Reinforcement Learning In reinforcement learning, an agent is interacting with the world over time. We teach it good behavior by providing it with rewards. Applications: Creating agents that play games such as Chess. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Key Components of ML Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Key Components of ML These are some of the core components that will follow us around, no matter what kind of machine learning problem we tackle: 1. The data that we can learn from. 2. A model of how to transform the data. 3. An objective function that quantifies how well (or badly) the model is doing. 4. An algorithm to adjust the model’s parameters to optimize the objective function. Algorithms and Optimizations in Machine Learning Data Key Properties of Datasets in Data Science 1. Data Basics: o Examples/Samples: Each data point. o Features (sometimes called covariates): Attributes used for prediction. o Label/Target: Special attribute to be predicted. 2. Numerical Representation: o Image Data: Pixels as numerical values (e.g., RGB values). o Health Data: Attributes like age, vitals, medications. 3. Fixed-Length Vectors (when every example is has by same no of numerical features): o Consistency: Same number of features for each example. o Dimensionality: Constant length of vectors. Algorithms and Optimizations in Machine Learning Data 4. Challenges with Non-Fixed-Length Data: o Variable Image Sizes: Different resolutions/shapes. o Text Data: Varying lengths of reviews. 5. Importance of Data Quantity and Quality: o More data: Enables training powerful models. o Quality Data: Avoid “garbage in, garbage out”. o Label/Target: Special attribute to be predicted. 6. Ethical Considerations: o Representation: Ensure diverse and inclusive datasets. o Bias and Fairness: Avoid reinforcing societal prejudices. Algorithms and Optimizations in Machine Learning Models Models in ML: - Definition: Computational systems that convert input data into predictions. - Focus on statistical models that can be trained from data. Models in Deep Learning: - Uses powerful models with multiple layers of transformations. - Each layer processes the data successively, forming a "deep" structure. - Advantages: Can tackle complex problems that classical methods struggle with. Algorithms and Optimizations in Machine Learning Objective Functions Objective Functions (Also called loss functions): Measure how good (or bad) our models are. The lower, the better. Loss functions can be squared error (for numerical prediction), error rate (for classification) Surrogate Objectives are used when direct optimization is difficult. Overfitting: - Model performs well on the training set but poorly on the test set. - Analogy: A student memorizes practice exams but fails on new questions in the final exam. Key Concept: Minimize loss on training data, but ensure generalization to new, unseen data to avoid overfitting. Algorithms and Optimizations in Machine Learning Algorithms to Optimize Optimization: We need an algorithm capable of searching for the best possible parameters for minimizing the loss function. Approach: Use Gradient Descent to find optimal parameters, or it’s variants like Stochastic Gradient Descent, Adam, etc. Gradient Descent Steps: 1. Initialize Parameters 2. Compute Gradient: For each parameter, calculate how the loss changes with a small perturbation. 3. Update Parameters Algorithms and Optimizations in Machine Learning Summary Components of a Supervised Machine Learning Problem A dataset consists of training examples, which are pairs of inputs and targets. Each input is a vector of features or attributes. A learning algorithm can be fully defined by a model class, objective and optimizer. The output of a supervised learning is a predictive model that maps inputs to targets. For instance, it can predict targets on new inputs. Algorithms and Optimizations in Machine Learning Summary Components of an Unsupervised Machine Learning Problem In unsupervised learning, we start with a dataset that does not contain labels. We seek to discover interesting and useful patterns in this data, such as: Clusters of related data points Outliers Denoised signals Algorithms and Optimizations in Machine Learning References https://d2l.ai/chapter_introduction/index.html#key-components https://kuleshov-group.github.io/aml-book/contents/lecture2-supervised- learning.html?highlight=components https://kuleshov-group.github.io/aml-book/contents/lecture8-unsupervised-learning.html ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Generalization Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Generalization Generalization: The ability of a model trained on a specific dataset to perform well on unseen data. Importance: The ultimate goal of machine learning is to build models that can generalize effectively to real-world scenarios beyond the training data. Challenges: Balancing overfitting vs underfitting is a major challenge. Algorithms and Optimizations in Machine Learning Overfitting When our training error is significantly lower than our validation error, we are overfitting. How can we deal with overfitting? If we're overfitting, we may reduce the complexity of our model by reducing the size of the model. We may also modify our objective to penalize complex models that may overfit the data. Techniques for combatting overfitting are often called regularization methods. Algorithms and Optimizations in Machine Learning Overfitting Solution Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. The idea is to penalize complex models that may overfit the data. 1. L1 (or lasso) regularization 2. L2 (or ridge) regularization 3. Dropout 4. Early Stoppage Algorithms and Optimizations in Machine Learning Underfitting When our model performs poorly on both the training and validation data, capturing neither the underlying patterns nor the noise, we are underfitting. How can we deal with overfitting? Increase model complexity: Use a more complex model or algorithm. Add features: Incorporate additional relevant features to the dataset. Feature engineering: Create new features from existing ones. Train longer: Increase the number of training epochs. Parameter tuning: Adjust model’s hyperparameters for better performance. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Bias-Variance Tradeoff Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Bias Bias is simply defined as the inability of the model measured by some difference or error occurring between the model’s predicted value and the actual value. These differences between actual or expected values and the predicted values are known as error or bias error or error due to bias. Bias is a systematic error that occurs due to wrong assumptions in the machine learning process. Algorithms and Optimizations in Machine Learning Bias Let Y be the true value of a parameter, and let Y’ be an estimator of Y, based on a sample of data. Then the bias of estimator Y’ is given by, Low Bias: Low bias value means fewer assumptions are taken to build the target function. In this case, the model will closely match the training dataset. High Bias: High bias value means more assumptions are taken to build the target function. In this case, the model will not match the training dataset closely. Algorithms and Optimizations in Machine Learning Variance Variance is the measure of spread in data from its mean position. In machine learning variance is the amount by which the performance of a predictive model changes when it is trained on different subsets of the training data. Let Y be the actual values of the target variable and Y’ be the predicted values of the target variable. Then, variance of the model can be calculated as Algorithms and Optimizations in Machine Learning Variance Low Variance: Model is less sensitive to changes in the training data and can produce consistent estimates of the target function with different subsets of data from the same distribution. This is the case of underfitting when the model fails to generalize on both training and test data. High Variance: Model is very sensitive to changes in the training data and can result in significant changes in the estimate of the target function when trained on different subsets of data from the same distribution. This is the case of overfitting when the model performs well on the training data but poorly on new, unseen test data. It fits the training data too closely that it fails on the new training dataset. Algorithms and Optimizations in Machine Learning Bias – Variance Combinations High Bias, Low Variance: A model with high bias and low variance is said to be underfitting. High Variance, Low Bias: A model with high variance and low bias is said to be overfitting. Algorithms and Optimizations in Machine Learning Bias – Variance Combinations High-Bias, High-Variance: Model is not able to capture the underlying patterns in the data (high bias) and is also too sensitive to changes in the training data (high variance). As a result, the model will produce inconsistent and inaccurate predictions on average. Algorithms and Optimizations in Machine Learning Bias – Variance Combinations Low Bias, Low Variance: Model is able to capture the underlying patterns in the data (low bias) and is not too sensitive to changes in the training data (low variance). This is the ideal scenario for a machine learning model, as it is able to generalize well produce consistent and accurate prto new, unseen data and edictions. Algorithms and Optimizations in Machine Learning Identifying High Variance, High Bias High variance can be identified if the model has: Low training error and high test error. High Bias can be identified if the model has: High training error and the test error is almost similar to training error. Algorithms and Optimizations in Machine Learning Bias – Variance Tradeoff While building the machine learning model, it is really important to take care of bias and variance in order to avoid overfitting and underfitting in the model. If the model is very simple with fewer parameters, it may have Low Variance and High Bias. Whereas, if the model has a large number of parameters, it will have High Variance and Low Bias. So, it is required to make a balance between bias and variance errors, and this balance between the bias error and variance error is known as the Bias-Variance Trade-off. Algorithms and Optimizations in Machine Learning Bias – Variance Tradeoff For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other: If we decrease the variance, it will increase the bias, and vice versa. Algorithms and Optimizations in Machine Learning Bias – Variance Tradeoff High variance leads to overfitting, while high bias results in oversimplified models that fail to capture important patterns, underfitting. Hence, there is a need to strike a balance between bias and variance. Algorithms and Optimizations in Machine Learning Bias – Variance Decomposition Hence, Bias-Variance trade-off involves finding a balance between bias and variance errors to create an optimal model that accurately captures regularities in training data and generalizes well to unseen data. The technique by which we analyze the performance of the machine learning model is known as Bias Variance Decomposition Algorithms and Optimizations in Machine Learning Bias – Variance Decomposition Error due to Bias: The difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Bias(a0) = E[â0] - a0 Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Variance(a0) = E[ ( â0– E[â0] )2 ] Algorithms and Optimizations in Machine Learning Bias – Variance Decomposition The error of a model can be broken down into 3 parts 1. Error due to bias 2. Error due to variance 3. Unexplainable/random error Algorithms and Optimizations in Machine Learning Bias – Variance Decomposition Algorithms and Optimizations in Machine Learning Bias – Variance Decomposition Algorithms and Optimizations in Machine Learning Bias – Variance Decomposition Algorithms and Optimizations in Machine Learning References https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/ https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff https://www.youtube.com/watch?v=EuBBz3bI-aA https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Optimization Algorithms Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Optimization Algorithms Optimization algorithms can be broadly categorized based on whether they require the target function to be differentiable. This categorization leads to two primary types: differentiable non-differentiable Algorithms and Optimizations in Machine Learning Optimization Algorithms 1. Differentiable Optimization Algorithms They require the target function to be continuous and differentiable. These algorithms leverage gradient information to find the minima or maxima of the target function efficiently. Common Algorithms: a) Gradient Descent (GD): Iteratively moves in the direction of the negative gradient to find the local minimum. Variants: 1. Batch Gradient Descent: Uses the entire dataset to compute the gradient. 2. Stochastic Gradient Descent (SGD): Uses a single data point to compute the gradient. Algorithms and Optimizations in Machine Learning Optimization Algorithms b) Conjugate Gradient Method: An optimization algorithm that builds up conjugate directions, which leads to faster convergence compared to basic gradient descent. b) Newton's Method: Uses the second-order Taylor expansion of the function and involves the computation of the Hessian matrix. b) Adam (Adaptive Moment Estimation): Combines the advantages of two other extensions of stochastic gradient descent, Adaptive Gradient Algorithm (AdaGrad), and Root Mean Square Propagation (RMSProp). Algorithms and Optimizations in Machine Learning Optimization Algorithms 2. Non-Differentiable Optimization Algorithms Non-differentiable optimization algorithms are used when the target function is not continuous or differentiable. These algorithms do not rely on gradient information and often use heuristic or combinatorial methods to find the optimum. Common Algorithms: a) Genetic Algorithms (GA): Inspired by natural selection, these algorithms use operations like selection, crossover, and mutation to evolve solutions. Algorithms and Optimizations in Machine Learning Optimization Algorithms b) Particle Swarm Optimization (PSO): Inspired by the social behavior of birds, particles move through the solution space influenced by their own and their neighbors' experiences. b) Ant Colony Optimization (ACO): Simulates the behavior of ants finding paths to food. Pheromone trails guide the search for optimal solutions. b) Simulated Annealing (SA): Mimics the annealing process in metallurgy. It explores the solution space by accepting worse solutions with a certain probability that decreases over time. b) Hill Climbing: A local search algorithm that iteratively makes small changes to the solution, each time selecting the neighbor that improves the objective function the most. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Convex and non-convex functions Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Convex and non-convex functions Algorithms and Optimizations in Machine Learning Convex and non-convex functions Algorithms and Optimizations in Machine Learning Convex and non-convex functions Convex Optimization Non – Convex Optimization The objective function is convex Objective function is non-convex Any local minimum is a global May have multiple local minima minimum and maxima Example: Linear programming, Example: Evolutionary gradient descent algorithms, genetic algorithms, particle swarm optimization (PSO) Applications: Machine learning Applications: Deep learning ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Local vs Global Minima Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Local vs Global Minima Algorithms and Optimizations in Machine Learning Local vs Global Minima Algorithms and Optimizations in Machine Learning Local vs Global Minima Algorithms and Optimizations in Machine Learning References https://datascience.stackexchange.com/questions/117983/differentiable-vs-non-differentiable- loss-function-in-ml https://rumn.medium.com/convex-vs-non-convex-functions-why-it-matters-in-optimization-for- machine-learning-39cd9427dfcc https://medium.com/@dilip.voleti/maxima-vs-minima-and-global-vs-local-in-machine-learning- basic-concept-741e760e9f80 All images are taken from Google ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING No Free Lunch Theorem Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning No Free Lunch Theorem What is the No Free Lunch Theorem? The No Free Lunch (NFL) theorem is a concept in optimization and machine learning which states that no single algorithm is universally the best-performing algorithm for all problems. It indicates that no one optimum optimization algorithm exists. The effectiveness of an algorithm is problem-specific; what works well for one problem might not work well for another. Algorithms and Optimizations in Machine Learning No Free Lunch Theorem The theory asserts that when the performance of all optimization methods is averaged across all conceivable problems, they all perform equally well. For any algorithm A, if we sum the performance of A over all possible problems, it is equal to the summed performance of any other algorithm B. This is expressed as: where P represents the set of all possible problems, and EA[P] is the performance of algorithm A on problem P. Algorithms and Optimizations in Machine Learning References https://www.geeksforgeeks.org/what-is-no-free-lunch-theorem/ https://towardsdatascience.com/what-no-free-lunch-really-means-in-machine-learning- 85493215625d No Such Thing THANK YOU Bhaskarjyoti Das Department of Computer Science and Engineering in AI &ML [email protected] Algorithms and Optimizations In Machine Learning Unit 2: Optimization of Model Hyperparameters Bhaskarjyoti Das Department of Computer Science in AI & ML and Engineering Teaching Assistant: Sriraksha M S (7th Sem) MATHEMATICS FOR MACHINE LEARNING What are hyperparameters? Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Parameter vs Hyperparameter The distinction between parameters and hyperparameters is somewhat arbitrary, and is mostly driven by the distinction between what can be numerically optimized versus what needs to use search techniques. Table 1: Based on Optimization Perspective Aspect Parameters Hyperparameters Parameters are learned by Values that are numerically Values that control the Definition optimized optimization process the model during learning while Hyperparameters are Optimized directly via Determined through search set by the user before Optimization Method algorithms (e.g., gradient techniques (e.g., grid descent) search, random search) learning. Weights and biases in Learning rate, batch size, Examples neural networks number of epochs Algorithms and Optimizations in Machine Learning Parameter vs Hyperparameter Another way to consider the distinction is to consider parameters as the explicit parameters of a probabilistic model, and to consider hyperparameters (higher-level parameters) as parameters that control the distribution of these explicit parameters. Table 2: Based on Probabilistic Model Perspective Aspect Parameters Hyperparameters Definition Explicit parameters of a probabilistic Parameters controlling the Parameters are learned by model distribution of explicit parameters the model during learning Influence how the model's while Hyperparameters are Role Directly define the model behavior parameters are set or optimized set by the user before learning. Mean and variance in Gaussian Regularization strength, kernel type Examples models in SVMs Algorithms and Optimizations in Machine Learning What is Hyperparameter Optimization? Hyperparameter optimization (HPO) is the process of automatically finding the best hyperparameters for a machine learning model to improve performance also enhancing model generalization and efficiency. Unlike parameters, which are learned during training, hyperparameters like learning rate, batch size, and regularization settings need to be set manually. Optimization methods like grid search, random search, and Bayesian optimization systematically explore different hyperparameter combinations. HPO algorithms treat this as a global optimization problem, aiming to maximize validation performance by adjusting hyperparameters systematically. Algorithms and Optimizations in Machine Learning Hyperparameter Optimization API Purpose: Automate the process of tuning hyperparameters to enhance model performance in machine learning. Key Features: Systematic Search: Efficiently explore combinations of hyperparameters using methods like Grid Search, Random Search, and Bayesian Optimization. Cross-Validation: Ensures robust model evaluation by splitting the dataset into multiple training and testing sets. Scalability: Handles large datasets and complex models. Integration: Compatible with popular ML libraries (e.g., Scikit-Learn, TensorFlow, PyTorch). Benefits: Improves Model Accuracy and Saves Time Algorithms and Optimizations in Machine Learning Hyperparameter Optimization API Example Usage: Direct APIs Algorithms and Optimizations in Machine Learning Types of Hyperparameter Optimization When the algorithm has many parameters, it is very hard to try all the possible combinations to find the best set. For that reason, we would like to do hyperparameter tuning efficiently and in a manageable way. Types of Hyperparameter Search There are four main methods to perform hyperparameters search: 1. Grid search 2. Randomized search 3. Stochastic Hill Climbing 4. Bayesian Search ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Grid Search Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Grid Search Grid Search is a method wherein we try all possible combination of the set of Hyperparameters. Each combination of the Hyperparameters represent a machine learning model. Hence, N combinations represent N machine learning models. Grid Search tests all combinations (all possible models) of such values on the given data and finds the best possible model by using user specified evaluation metrics such as accuracy, precision, recall, etc. Through Grid search, we identify the model which shows the best performance. This makes grid search pretty expensive not just in terms of time complexity but also space complexity. Algorithms and Optimizations in Machine Learning Grid Search For all the examples in the slides, we will use Kaggles’ heart-attack- analysis-prediction- dataset. Code: Algorithms and Optimizations in Machine Learning Grid Search Advantages of Grid Search: Exhaustive search: Grid search exhaustively searches through a hyperparameter space, ensuring that the optimal set of hyperparameters is found. Simple implementation: Grid search is simple to implement and does not require any advanced optimization techniques. Disadvantages of Grid Search: Computationally expensive: Grid search can be computationally expensive, particularly when there are many hyperparameters or large datasets. Limited search space: Grid search is limited to the hyperparameters and their respective ranges specified by the user, which may not include the optimal set of hyperparameters. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Random Search Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Random Search Random Search on the other hand, is a method wherein we try randomly chosen combination of Hyperparameters. This is usually quite computationally cheap and manages to give us decent enough Hyperparameter combinations which enables desired level of performance. The only problem with Random Search is that it doesn’t tell us how it chooses the Hyperparameter combinations. The process is totally random and there is no way to know if there exists a better combination. There is no way to narrow down the search space as we are not actually aware of where we can find better values. Algorithms and Optimizations in Machine Learning Random Search For all the examples in the slides, we will use Kaggles’ heart-attack- analysis-prediction- dataset. Code: Algorithms and Optimizations in Machine Learning Random Search This is just grid search, but with randomly chosen points instead of points on a grid. This solves the curse of dimensionality ❑ Don’t need to increase the number of grid points exponentially as the number of dimensions increases. Problem: with random search, not necessarily going to get anywhere near the optimal parameters in a finite sample. Algorithms and Optimizations in Machine Learning Grid vs Random Search Feature Grid Search Random Search Exhaustive search over Randomly samples Search Method specified hyperparameter hyperparameter values from values specified distributions Evaluates a subset of Grid Search: Tests all Search Space Evaluates all possible possible permutation combinations based on Coverage combinations in the grid combinations of random sampling hyperparameters of given Can be high, especially with Generally lower, as it samples Machine Learning algorithm. Computational Cost large hyperparameter grids a subset of the space May become inefficient as the Often more efficient for large Random Search: Randomly Efficiency number of hyperparameters or or continuous hyperparameter chooses combinations of their ranges increase spaces hyperparameters from Implementation Requires setup of distributions searchable space (all possible Simple and straightforward combinations). Complexity and sampling strategy Best for small to moderate Best for large hyperparameter Usage Scenarios hyperparameter spaces with spaces or when a broad well-defined ranges exploration is needed ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Stochastic Hill Climbing Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Stochastic Hill Climbing Stochastic Hill Climbing introduces an element of randomness to the search process. Instead of always selecting the best neighbor, it probabilistically chooses neighbors, allowing it to escape local optima more effectively. Advantages: It selects a neighbor at random, and Simple implementation. decides (based on the amount of improvement in that neighbor) Effective for local search and tuning. whether to move to that neighbor or to Disadvantages: examine another. Limited to local optima. Might miss better solutions outside of the local search space. Algorithms and Optimizations in Machine Learning Stochastic Hill Climbing Steps: 1. Initialization: Start with an initial set of hyperparameters. 2. Evaluation: Evaluate model performance using these hyperparameters. 3. Modification: Make random adjustments to the hyperparameters. 4. Acceptance: Accept the new hyperparameters if they improve the model performance. 5. Iteration: Repeat the process until convergence or a stopping criterion is met. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Bayesian Search Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Bayesian Search Bayesian Search: Based upon Bayes Rule and considers previously known knowledge to help narrow down the search space of good hyperparameter combinations. Bayesian Search usually takes more time than Random Search but less time than Grid Search. BO is considered the to-beat for new algorithms in the field of hyperparameter optimization. In order of Time Complexity Grid Search > Bayesian Search > Random Search Algorithms and Optimizations in Machine Learning Bayesian Search The only problem with Random Search is that it doesn’t tell us how it chooses the Hyperparameter combinations. The process is totally random and there is no way to know if there exists a better combination Bayesian Search on the other hand solves the above problem. The technique as the name suggests works on Bayes Principle. Bayes principle basically says that posterior probability distribution is directly proportional to the priors given to it (prior probability distribution) and the likelihood function. Algorithms and Optimizations in Machine Learning Bayesian Search Thus, instead of randomly choosing the next set of parameters, the algorithm optimizes the choice, and likely reaches the best parameter set faster than the previous two methods. Meaning, this method chooses only the relevant search space and discards the ranges that will most likely not deliver the best solution. Thus, it can be beneficial when you have a large amount of data, the learning is slow, and you want to minimize the tuning time. Algorithms and Optimizations in Machine Learning Bayesian Search Undirected For all the examples in the slides, we will use Kaggles’ heart-attack- analysis-prediction- dataset. Code: Algorithms and Optimizations in Machine Learning Bayesian Search Key Features: Surrogate Model: A simplified model used to predict performance. Acquisition Function: Decides where to look next for better hyperparameters. Efficiency: Faster and requires fewer tests than random or grid search. Steps: 1. Initialization: Start with a few random hyperparameter sets. 2. Surrogate Modeling: Create a model to predict performance based on initial tests. 3. Acquisition Function: Use this model to choose the next hyperparameters to test. 4. Evaluation: Test these hyperparameters on your model. 5. Iteration: Update the surrogate model with new results and repeat. Algorithms and Optimizations in Machine Learning Bayesian Search Advantages: Needs fewer tests to find good hyperparameters. Balances trying new things and refining good options. Disadvantages: More complex to set up and understand. Can be computationally intensive. Algorithms and Optimizations in Machine Learning References ∙ https://towardsdatascience.com/intuitive-hyperparameter-optimization-grid-search-random-search- and-bayesian-search-2102dbfaf5b ∙ https://towardsdatascience.com/bayesian-optimization-for-hyperparameter-tuning-how-and-why- 655b0ee0b399 ∙ https://machinelearningmastery.com/manually-optimize-hyperparameters/ ∙ https://d2l.ai/chapter_hyperparameter-optimization/index.html THANK YOU Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML [email protected] Algorithms and Optimizations In Machine Learning Local Optimization Over Differentiable Target Functions Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Teaching Assistant: Gautham Krithiwas (7th sem) ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Introduction To Gradient Descent Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Gradient Descent For general non-linear functions, most algorithms only guarantee a local optimum This means the algorithm would find a feasible x0 such that f(x0) High learning rate If the weight is varying fast, then the st for that weight is large. This is because st is the squared sum of gradient history, which is large because the gradients are steep and weights are varying quickly. Hence learning rate is divided by a large value -> learning rate is small Quickly varying weights -> low learning rate Algorithms and Optimizations in Machine Learning AdaGrad Advantages and Disadvantages Advantages Eliminates manual tuning of learning rate Rapidly varying weights have a smaller learning rate, promoting stability Slowly varying weights have a larger learning rate, helping it to get out of flat gradient regions Disadvantages Accumulated gradients can keep growing, causing the learning rate to become infinitesimally small and resulting in vanishing gradients Old and stale gradients stored in the aggregated sum can continue to influence the present Algorithms and Optimizations in Machine Learning RMSProp RMSProp stands for Root Mean Square Propagation Solves AdaGrad’s radically diminishing learning rate by using a moving average of the squared gradients This involves a small modification to the AdaGrad update formula ○ Recent gradients normalize the accumulated squared gradients Similar to AdaGrad, there is a different learning rate for each parameter The remaining details are the same as AdaGrad Algorithms and Optimizations In Machine Learning Adam Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Adam Adam stands for Adaptive Moment Estimation Adam is a combination of two algorithms: ○ Gradient descent with momentum ○ RMSProp It maintains a moving average of the gradients to create momentum It also maintains a moving average of the squared gradients, similar to RMSProp Algorithms and Optimizations in Machine Learning Adam Each of them have a dedicated decay factor (beta1 and beta2) in the range [0, 1) It uses the momentum in place of the regular gradient, and the sum of squared gradients for modifying the learning rate Adam also addresses the bias towards 0 in the initial iterations. This bias happens because gradients are initialized to 0. Adam incorporates bias-correction. Algorithms and Optimizations in Machine Learning Adam When updating the weights, we use both bias corrected values: There is a slight noticeable change compared to RMSProp in the denominator: We use instead of , which works better in practice. We get both the advantages of RMSProp and momentum Algorithms and Optimizations in Machine Learning Effect of Betas The values for beta1 and beta2 based on empirical evidence is 0.9 and 0.999, and would work well for most applications The bias correction used influences the estimations only in the earlier iterations. In later iterations, with larger values of t, both adjustment factors become almost 0 Algorithms and Optimizations in Machine Learning Adam Algorithm The algorithm is similar to minibatch SGD Assume we have a training dataset of size n Assume minibatch of size m While not converged: ○ Sample a minibatch of size m from the dataset ○ Calculate the losses of the dataset (f) and calculate the gradients gt ○ Calculate both parameters, then normalize them ○ Update the weights as: Algorithms and Optimizations In Machine Learning Nesterov Accelerated Gradient (NAG) Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning NAG A disadvantage of classical momentum is that the algorithm can overshoot the minima This is because we add the momentum to the gradient of current weight, which can cause it to overshoot Nesterov Accelerated Gradient (NAG) solves this by “looking ahead” and correcting any overshooting made by large momentum terms Algorithms and Optimizations in Machine Learning Normal Momentum vs NAG In ordinary momentum, we first calculate the momentum term vt and use this to update the weights In contrast: ○ NAG first “looks ahead” by adding momentum to the current weights ○ Gradients are computed at this new estimated weight instead ○ These gradients are used to compute vt+1 and then the original weights are updated Algorithms and Optimizations in Machine Learning NAG Characteristics NAG in a sense first makes a big jump in the direction of the previous accumulated gradient (momentum), and then measures the gradient where it ends up and makes a correction In contrast, normal momentum would take a big jump in the direction of the current gradient and momentum. Intuitively, this is more likely to overshoot a minima Hence, NAG augments ordinary momentum to correct any overshooting made by ordinary momentum Algorithms and Optimizations in Machine Learning Summary Algorithms and Optimizations in Machine Learning Comparison of Optimizer Performances Algorithms and Optimizations in Machine Learning References https://d2l.ai/chapter_optimization/minibatch-sgd.html https://d2l.ai/chapter_optimization/momentum.html https://www.cs.toronto.edu/~fritz/absps/momentum.pdf https://d2l.ai/chapter_optimization/adagrad.html https://d2l.ai/chapter_optimization/rmsprop.html https://d2l.ai/chapter_optimization/adam.html https://medium.com/konvergen/momentum-method-and- nesterov-accelerated-gradient-487ba776c987 THANK YOU Bhaskarjyoti Das Department of Computer Science and Engineering in AI &ML [email protected] Algorithms and Optimizations In Machine Learning Global Optimization over Non-differentiable Target Functions - 1 Bhaskarjyoti Das Department of Computer Science in AI & ML and Engineering TA: Rohit Rajesh Global Optimization over Non- differentiable Target Functions - 1 Introduction Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Overview of Optimization What Is Optimization? The Optimization problem aims to find the values for a set of variables θ ∈ Θ, that minimize a scalar-valued loss function or cost function: L:Θ→R Optimization in machine learning may also be used to maximize a utility function or reward function R(θ), which is often equivalently minimized post transformation L(θ) = −R(θ). A point that satisfies the above equation is called a global optimum. Finding such a point is called global optimization. This is critical because it seeks the best possible solution over the entire search space, avoiding local minima that might trap simpler optimization method (Especially when models have complex landscapes with multiple local minima). Techniques used for global optimization can lead to better generalization and robustness of machine learning models. Algorithms and Optimizations in Machine Learning Overview of Optimization Differentiable vs. non-differentiable functions Differentiable Functions These functions have well-defined gradients, allowing for methods like gradient descent to effectively find optimal solutions. Many traditional optimization techniques rely on this property. Non-differentiable Functions Certain functions found in robust statistics and machine learning involve absolute values or step functions. These may have points where derivatives do not exist or are discontinuous. Global optimization methods must adapt to handle these challenges without gradient Algorithms and Optimizations in Machine Learning Motivation and Applications Algorithms and Optimizations in Machine Learning Global vs. Local Optimization Local Optimization Global Optimization Goal: Find the nearest minimum or maximum, Goal: Find the global minimum or maximum of which may not be the global one the objective function, over the entire search space, regardless of the starting point. Typically uses gradient-based methods like Often requires algorithms that can explore gradient descent or Newton's method, which large and complex landscapes, such as genetic are efficient but may get stuck in local minima algorithms, simulated annealing, and Bayesian optimization ➔ Necessary Condition for Local Minimum ◆ At a local minimum θ*, the gradient g* = ∇L(θ*), must be zero, indicating that θ* is a stationary point. ◆ The Hessian matrix 𝐻𝐻* at 𝜃𝜃* must be positive semi-definite, ensuring that the point is not a maximum or saddle point. ➔ Sufficient Condition: If the gradient g* = 0 and the Hessian 𝐻𝐻* is positive definite, then 𝜃𝜃* is guaranteed to be a local minimum. Algorithms and Optimizations in Machine Learning Non-differentiable Functions Common non-differentiable functions in ML 1. Hinge Loss: Used in support vector machines Defined as max(0,1−y⋅f(x)) Non-differentiable at the hinge point 2. Absolute Value Function: Used commonly for Regularization, Robust statistics f(x)=∣x∣ Non-differentiable at x=0. 3. ReLU (Rectified Linear Unit): Common activation function in neural networks Defined as f(x)=max(0,x). It is non-differentiable at x=0. Algorithms and Optimizations in Machine Learning Challenges to Global Optimization → Lack of Gradient Information: Traditional gradient-based methods are ineffective as they rely on gradient information to navigate the search space. Without gradients, it becomes difficult to determine the direction of improvement. → Complexity and Computational Cost: Global optimization methods typically require more computational resources due to the need to explore a larger search space. Methods like Genetic Algorithms (GAs) and Simulated Annealing (SA) can be computationally expensive due to their iterative nature and the need to evaluate the objective function multiple times. Algorithms and Optimizations in Machine Learning Intro to Genetic Algorithms Genetic algorithms are inspired by natural selection. They use a population of candidate solutions to find optimal solutions for a problem. KEY TERMINOLOGY Algorithms and Optimizations in Machine Learning Intro to Genetic Algorithms Fundamental Concepts Algorithms and Optimizations in Machine Learning Genetic Algorithms - Candidate Solutions Algorithms and Optimizations in Machine Learning Prufer Code Tree encoding Given a tree (represented as graph, not as a rooted tree) with n labeled nodes with labels from 1 to n, a Prufer code uniquely identifies the tree. The sequence has n-2 values. Prufer Code Algorithm to Generate Prufer Code from a Tree 1. Initialize an empty list P to store the Prufer code. 2. While the tree has more than 2 nodes: ○ Find the leaf node with the smallest label. ○ Append the label of the leaf node's only neighbor to P. ○ Remove the leaf node from the tree. 3. Return the list P as the Prufer code. Example: For a tree with edges: (1-2, 2-3, 2-4, 4-5), the Prufer code is [2, 2, 4]. Algorithms and Optimizations in Machine Learning Prufer Code Reconstruction Algorithms and Optimizations in Machine Learning Prufer Code Tree encoding Given a tree (represented as graph, not as a rooted tree) with n labeled nodes with labels from 1 to n, a Prufer code uniquely identifies the tree. The sequence has n-2 values. Prufer Code Algorithm to Reconstruct a Tree from a Prufer Code 1. Input: Prufer code P of length n-2 for a tree with n nodes. 2. Initialize: ○ Create a list L of size n with all node labels. ○ Create an empty list Tree to store the edges. 3. While there are elements in P: ○ Find the smallest node label in L not in P. ○ Create an edge between this node and the first node in P, add it to Tree. ○ Remove the node from L and the first element from P. 4. Connect the last two remaining nodes in L and add this edge to Tree. 5. Return the list of edges in Tree. Algorithms and Optimizations in Machine Learning Intro to Genetic Algorithms Fitness Function and Selection Mechanisms Algorithms and Optimizations in Machine Learning Intro to Genetic Algorithms Genetic Operators: Crossover and Mutation Algorithms and Optimizations in Machine Learning Intro to Genetic Algorithms Population Dynamics and Convergence Algorithms and Optimizations in Machine Learning Intro to Genetic Algorithms Millions of Choices in Genetic Algorithms Algorithms and Optimizations in Machine Learning Q&A Session Q&A Unlock the Potential of Genetic Algorithms for Complex Optimization Problems THANK YOU Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML [email protected] Algorithms and Optimizations In Machine Learning Global Optimization over Non-differentiable Target Functions - 2 Bhaskarjyoti Das Department of Computer Science in AI & ML and Engineering Global Optimization over Non- differentiable Target Functions - 2 Optimization Algorithms Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Classification of Algorithms General Classification Algorithms and Optimizations in Machine Learning Classification of Algorithms Deterministic vs. Stochastic methods Deterministic: 1. Methods that produce the same output given the same input. 2. Path and solution are consistent and repeatable. 3. Faster convergence for well-behaved functions. 4. May get stuck in local minima for non-convex functions. 5. Ex: Gradient Descent Stochastic: 1. Can produce different outputs even with the same input. 2. Includes randomness to explore the search space. 3. Can escape local minima to find a global optimum. 4. Generally slower but more robust for complex problems. 5. Ex: Genetic Algorithms (GA): Mimic the process of natural evolution. Simulated Annealing (SA): Inspired by the annealing process in metallurgy Algorithms and Optimizations in Machine Learning Classification of Algorithms Exact vs. Heuristic methods Exact: 1. Methods that guarantee finding the optimal solution. 2. Provide provable guarantees of optimality. 3. Can be computationally expensive for large-scale problems. 4. Typically used for problems where an exact solution is crucial. 5. Ex: Branch and Bound, Dynamic Programming Heuristic: 1. They find good enough solutions using practical approaches, often sacrificing optimality for efficiency 2. Faster and more scalable to large problems. 3. Do not guarantee an optimal solution but often find good solutions. 4. Useful for complex and NP-hard problems where exact methods are impractical. 5. Ex: Genetic Algorithms, Particle Swarm Optimization Algorithms and Optimizations in Machine Learning Evolutionary Algorithms - Genetic Algorithms GA workflow 1. Initialize population P(0) 2. for generation t=1 to max_generations do a. Evaluate fitness of each individual in P(t) b. Select parents from P(t) c. Apply crossover and mutation to produce offspring d. Form new population P(t+1) 3. end for Algorithms and Optimizations in Machine Learning Evolutionary Algorithms - Genetic Algorithms - Population Models 1. Steady State In steady state GA, we generate one or two off-springs in each iteration and they replace one or two individuals from the population. A steady state GA is also known as Incremental GA. 1. Generational In a generational model, we generate ‘n’ off-springs, where n is the population size, and the entire population is replaced by the new one at the end of the iteration. Algorithms and Optimizations in Machine Learning Evolutionary Algorithms - Genetic Algorithms - Parent Selection 1. Parent selection is very crucial to the convergence rate of the GA as good parents drive individuals to a better and fitter solutions. 2. The taking up of the entire population by one extremely fit solution is known as premature convergence and is an undesirable condition in a GA. 3. Maintaining good diversity in the population is extremely crucial for the success of a GA. Algorithms and Optimizations in Machine Learning Evolutionary Algorithms - Genetic Algorithms Algorithms and Optimizations in Machine Learning Genetic Algorithms - Exploration and Exploitation In Genetic Algorithms (GAs), search refers to the process of exploring the solution space to find optimal or near-optimal solutions to a given problem. It involves two key strategies: Exploration and Exploitation: Search can be performed with either blind search strategies or heuristic strategies. Blind search strategies do not make use of problem domain information. Heuristic search strategies use additional information to guide the search along the best directions. Trade-off between exploration and exploitation happens in GA as well: ○ Exploration: The algorithm explores the search space for new solutions in new regions. ○ Exploitation: The algorithm exploits the existing solution and refines that solution for better fitness. → The GA uses operations like selection, crossover, and mutation to perform this search, iterating over generations of candidate solutions (chromosomes) to evolve toward the best solution. Algorithms and Optimizations in Machine Learning Genetic Algorithms - Optimization Example Question: Algorithms and Optimizations in Machine Learning Genetic Algorithms - Solution Blueprint Problem Overview: To minimize: where Parameters Given: 1. Crossover rate = 0.5 2. Mutation rate = 0.1 3. Population size = 10 chromosomes 4. Selection Method: Roulette Wheel Step 1: Initialization Generate an initial population of 10 chromosomes (randomly selected values of ‘x’ within the range [-1, 2]) Chromosome x value f(x) value 1 x1 f(x1) 2 x2 f(x2)......... 10 x10 f(x10) Algorithms and Optimizations in Machine Learning Genetic Algorithms - Solution Blueprint Step 2: Calculate Fitness Calculate the fitness of each chromosome. For minimization, fitness can be inversely proportional to the function value f(x). A possible fitness function is: Fitness(x) = Chromosome f(x) value Fitness 1 f(x1) Fitness(x1) 2 f(x2) Fitness(x2)......... 10 f(x10) Fitness(x10) Algorithms and Optimizations in Machine Learning Genetic Algorithms - Solution Blueprint Step 3: Selection (Roulette Wheel) 1. Calculate the total fitness of the population. 2. Determine the probability of selection for each chromosome: Probability of selection = 3. Perform roulette wheel selection to choose pairs for crossover. Algorithms and Optimizations in Machine Learning Genetic Algorithms - Solution Blueprint Step 4: Crossover With a crossover rate of 0.5, randomly pair selected chromosomes and perform crossover to generate new offspring. Parent 1 Parent 2 Offspring 1 Offspring 2............ Step 5: Mutation With a mutation rate of 0.1, randomly alter genes in the offspring. Offspring Before Mutation After Mutation......... Algorithms and Optimizations in Machine Learning Genetic Algorithms - Solution Blueprint Step 6: Evaluation Calculate the fitness of the new population (offspring) and assess whether the termination criterion (like number of generations or convergence) is met. Chromosome x value f(x) value Fitness 1............ Final Notes: 1. Repeat Steps 3-6 for each generation until the termination criterion is met. 2. The chromosome with the best fitness in the final generation is considered the optimal solution to the problem. This process allows you to compute the next generation based on the fitness of the current population, eventually leading to the optimal solution. Algorithms and Optimizations in Machine Learning Evolutionary Algorithms - Differential Evolution (DE) Basic Principles Population-Based: DE optimizes by evolving a population of candidate solutions. Unlike Genetic Algorithms, DE directly uses vector operations on real-valued parameters, making it more suitable for continuous optimization. Key Steps Initialization: Randomly generate population. Mutation: Create mutant vectors by combining randomly selected solutions. Crossover: Combine mutant with target vector to form trial vector. Selection: Replace target vector if trial vector is better. Advantages Simple & Scalable: Easy to implement and handles high-dimensional problems. Global Search: Effective in avoiding local minima, robust against noise. Limitations Parameter Sensitivity: Performance depends on tuning. Slow Convergence: Can be slow in fine-tuning. Not Ideal for Discrete Problems: Best suited for continuous optimization. Algorithms and Optimizations in Machine Learning Particle Swarm Optimization (PSO) Concepts and Mechanics Swarm Intelligence: Inspired by the collective behavior of birds flocking or fish schooling, PSO models a group of particles (potential solutions) moving through the search space to find the optimal solution. Particle Update Rules: ○ Each particle adjusts its position based on its own best experience and the best-known position in the swarm. ○ Position and velocity are updated using: Algorithm Workflow Initialization: Randomly generate particles with positions and velocities. Iteration: Update velocities and positions; adjust based on personal and global bests. Convergence: Stop when a stopping criterion is met (e.g., minimal change in position or after a fixed number of iterations). Use Cases: Widely used for optimizing nonlinear functions, neural network training, and engineering design problems. Difference from Genetic Algorithms: PSO is simpler, requiring fewer parameters, and focuses on the social sharing of information rather than the evolutionary process of selection, crossover, and mutation Algorithms and Optimizations in Machine Learning Simulated Annealing (SA) Core Principles Metaphor of Annealing in Metallurgy: SA is inspired by the process of heating and slowly cooling metals to reach a stable crystalline structure, representing finding a global minimum in optimization. Algorithm Steps: Temperature Control: Start with a high "temperature" and gradually decrease it. Higher temperatures allow exploration of the search space by accepting worse solutions. Acceptance Probability: Use a probabilistic acceptance criterion: Strengths and Weaknesses When to Use SA: Ideal for problems with complex landscapes where global optimization is challenging. SA is effective at escaping local minima but can be slow to converge. Applications: Used in scheduling, traveling salesman problems, and circuit design. Difference from Genetic Algorithms: SA uses a single solution and probabilistic moves, focusing on controlled randomization rather than population-based evolution and crossover as in Genetic Algorithms. Algorithms and Optimizations in Machine Learning Bayesian Optimization Fundamentals Surrogate Models (Gaussian Processes): BO uses surrogate models, often Gaussian processes, to approximate the objective function and predict its behavior across the search space, allowing efficient optimization with fewer function evaluations. Acquisition Functions: Guide the search by balancing exploration (sampling areas of uncertainty) and exploitation (sampling areas likely to have good solutions). Common functions include Expected Improvement (EI) and Upper Confidence Bound (UCB). Workflow Model Fitting: Fit the surrogate model to existing data points (sampled solutions). Exploration vs. Exploitation: Use the acquisition function to choose the next sample point, either exploring unknown areas or exploiting known promising areas. Iteration: Update the model with new data and repeat until convergence. Applications and Examples Use Cases: Effective in hyperparameter tuning in machine learning, optimization of expensive black-box functions, and engineering design. Difference from Genetic Algorithms: BO focuses on sample efficiency and is designed for optimizing expensive functions with minimal evaluations, unlike Genetic Algorithms, which use a population-based, evolutionary approach to explore the search space broadly. Algorithms and Optimizations in Machine Learning Comparative Analysis - Strengths and Weaknesses Algorithm Strengths Weaknesses Stuck in local minima, requires differentiability, sensitive Gradient Descent Simple, efficient for smooth and convex functions. to learning rate. Good for global optimization, handles non-differentiable Computationally expensive, requires careful tuning of Genetic Algorithms functions, adaptable. parameters (mutation rate, crossover rate). Simulated Annealing Escapes local minima, simple implementation, versatile. Slow convergence, requires good cooling schedule. Efficient for linear programming, provides exact Simplex Method solutions. Not suitable for non-linear or large-scale problems. Simple to implement, good for continuous optimization, May converge prematurely, sensitive to parameter Particle Swarm Optimization parallelizable. settings. Computationally intensive for large problems, complex Branch and Bound Guarantees optimal solution, systematic exploration. implementation. Efficient for expensive-to-evaluate functions, provides Computationally intensive for high-dimensional Bayesian Optimization uncertainty quantification. problems, requires prior knowledge. Robust, simple to implement, good for global Slow convergence, requires careful tuning of Differential Evolution optimization. parameters (mutation factor, crossover rate). Algorithms and Optimizations in Machine Learning Comparative Analysis - Performance on Different Problem Types Particle Bayesian Gradient Genetic Simulated Simplex Swarm Branch Problem Type Optimizati Differential Evolution Descent Algorithms Annealing Method Optimizati and Bound on on Convex Optimization Excellent Good Good Excellent Good Good Good Good Non-convex Poor Excellent Excellent Poor Good Good Excellent Excellent Optimization Combinatorial Poor Excellent Good Poor Good Excellent Fair Good Optimization Linear Programming Poor Poor Poor Excellent Poor Good Poor Poor Large-scale Problems Good Good Fair Poor Good Poor Fair Good Algorithms and Optimizations in Machine Learning Comparative Analysis - Computational Efficiency Algorithm Computational Efficiency Gradient Descent Efficient for small to medium-scale problems. Genetic Algorithms Computationally intensive, slower convergence. Simulated Annealing Moderate, depends on cooling schedule. Simplex Method Efficient for linear problems, not scalable. Particle Swarm Optimization Efficient, parallelizable. Algorithms and Optimizations in Machine Learning Hybrid Approaches Hybrid approaches combine multiple algorithms to leverage their strengths and mitigate weaknesses, aiming for better performance, faster convergence, and improved solution quality Advantages of Hybrid Approaches: 1. Enhanced Exploration and Exploitation: ○ Balances global exploration and local exploitation, reducing the risk of getting stuck in local minima. 2. Improved Convergence Rates: ○ Combines fast initial convergence with precise fine-tuning, making the process quicker and more efficient. 3. Robustness: ○ Solves a wide range of problems, including complex, non-differentiable, and noisy functions, and can be adapted to specific challenges. 4. Scalability: ○ Efficiently handles large-scale problems and can be parallelized for high-dimensional tasks. Algorithms and Optimizations in Machine Learning Hybrid Approaches 1. Metaheuristic + Local Search: Ex: Genetic Algorithm + Gradient Descent: Concept: Use Genetic Algorithms to explore the global search space and identify promising regions. Then, apply Gradient Descent to those regions for precise local optimization. Application: This approach is often used in neural network training, where GA helps in selecting initial weights, and GD refines the model to minimize the loss function. 1. Differential Evolution + Tabu Search: Concept: Differential Evolution handles the global search, while Tabu Search ensures that the algorithm doesn't revisit recently explored solutions, thereby avoiding local traps. Application: Useful in combinatorial optimization problems, such as scheduling and routing, where a large solution space needs to be efficiently explored. Algorithms and Optimizations in Machine Learning Practical Considerations - Implementation Tips 1. Choice of Algorithm: Problem Nature: Choose an algorithm based on the nature of the problem (e.g., convex vs. non-convex, continuous vs. discrete). ○ Convex Problems: Use deterministic methods like Gradient Descent. ○ Non-Convex Problems: Consider stochastic methods like Genetic Algorithms or Simulated Annealing. ○ High-Dimensional Problems: Bayesian Optimization is effective but may require dimensionality reduction. Objective Function: If the objective function is non-differentiable, gradient-based methods may not work. Opt for Genetic Algorithms, Differential Evolution, or other heuristic methods. Resource Constraints: Consider the computational cost and memory requirements. Simplex and Branch and Bound methods can be expensive, while Particle Swarm Optimization and Differential Evolution are generally more scalable. Exploration vs. Exploitation: Stochastic methods like Genetic Algorithms and Particle Swarm Optimization are better for exploration, whereas deterministic methods are better for exploitation. 2. Parameter tuning: Initial Parameters: Start with default or widely recommended parameters but be prepared to tune them based on your specific problem. Learning Rate (Gradient Descent): A low learning rate may lead to slow convergence, while a high rate may cause overshooting. Mutation Rate (Genetic Algorithms): A high mutation rate increases exploration but may disrupt good solutions, whereas a low rate might cause premature convergence. Cross-Validation: Use cross-validation to tune parameters and avoid overfitting. This helps in balancing the bias-variance tradeoff. Automated Tuning: Consider using libraries like Hyperopt or Optuna for automatic parameter optimization. Algorithms and Optimizations in Machine Learning Practical Considerations - Tools and Libraries 1. DEAP (Distributed Evolutionary Algorithms in Python): Use Cases: Implementing genetic algorithms, genetic programming, and other evolutionary algorithms. Key Features: Easy to use, highly flexible, supports parallelization, customizable operators. Installation: pip install deap Documentation: DEAP Documentation 1. PyGAD: Use Cases: Building Genetic Algorithms in Python with customizable options. Key Features: Supports Keras/PyTorch integration, allows custom fitness functions, supports parallel processing. Installation: pip install pygad Documentation: PyGAD Documentation 1. Optuna: Use Cases: Automated hyperparameter optimization, particularly in deep learning and machine learning models. Key Features: Efficient sampling strategies, dynamic pruning, integrates with popular libraries like TensorFlow, PyTorch, and Scikit-Learn. Installation: pip install optuna Documentation: Optuna Documentation Algorithms and Optimizations in Machine Learning Q&A Session Q&A Algorithms and Optimizations in Machine Learning References, Further Reading and Resources 1. Probabilistic Machine Learning: An Introduction - Kevin Patrick Murphy. 2. Algorithms for Optimization (sections 9.2 and 20.2) by Mykel J. Kochenderfer and Tim A. Wheeler 3. MIT OCW - https://youtu.be/kHyNqSnzP8Y?si=-ZyZ2Mn53Clu3Z0r 4. Shapiro J. (2001) Genetic Algorithms in Machine Learning. In: Paliouras G., Karkaletsis V., Spyropoulos C.D. (eds) Machine Learning and Its Applications. ACAI 1999. Lecture Notes in Computer Science, vol 2049. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44673-7_7 5. http://www2.econ.iastate.edu/tesfatsi/holland.gaintro.htm 6. Samir Roy and Udit Chakraborty, “Introduction to Soft Computing : Neuro-Fuzzy and Genetic Algorithms” by Pearson Publishers. 7. https://www.ac.tuwien.ac.at/files/pub/raidl-00.pdf THANK YOU Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML [email protected] Algorithms and Optimizations In Machine Learning Constrained Optimization - 1 Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Teaching Assistant: Prerana Sanjay Kulkarni (7th Sem) ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Constrained Optimization: Introduction Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning What is Constrained Optimization? Constrained optimization is a technique used to find the optimal solution to a problem, subject to a set of constraints. Goal: to maximize or minimize an objective function while satisfying a set of constraints. Algorithms and Optimizations in Machine Learning Support Vector Machine (Vladmir N. Vapnik, 1963) A Support Vector Machine (SVM) is a supervised machine learning model used for classification tasks. To understand SVM, we first see what a hyperplane is. How a plane refers to a flat, 2-D surface, a hyperplane is a flat hypersurface in n-D (n-dimensional) space. More information: An n-D space contains hyperplanes of (n-1) dimensions. Algorithms and Optimizations in Machine Learning Separating Hyperplane A hyperplane separates datapoints into different classes. It is a decision boundary for linearly separable data. There may be multiple hyperplanes that can separate data points into different classes. Which one do we pick? We pick the hyperplane with the greatest margin. Algorithms and Optimizations in Machine Learning Margins A Margin is the distance between the decision boundary (hyperplane) and the closest data points from each class. The goal of SVM is to maximize this margin while minimizing classification errors. A larger margin indicates a greater degree of confidence in the classification, as it means that there is a larger gap between the decision boundary and the closest data points from each class. The margin is a measure of how well-separated the classes are in feature space. SVMs are designed to find the hyperplane that maximizes this margin, which is why they are sometimes referred to as maximum- margin classifiers. Algorithms and Optimizations in Machine Learning Support Vectors Support Vectors are the data points that lie closest to the decision boundary (hyperplane) in a Support Vector Machine (SVM). (Support vectors are said to lie on the canal or the gutter) These data points are important because they determine the position and orientation of the hyperplane, and thus have a significant impact on the classification accuracy of the SVM. In fact, SVMs are named after these support vectors because they “support” or define the decision boundary. The support vectors are used to calculate the margin, which is the distance between the hyperplane and the closest data points from each class. Algorithms and Optimizations in Machine Learning Support Vectors and Margins You can think of the hyperplane/decision boundary as the median line of a street, and that we are trying to create the widest possible street, such that no data samples lie on the street. Finding the optimal hyperplane is an optimization problem (where we need to maximize the margins, and is solved using Lagrange Multipliers, as we will see later. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Defining SVM: Primal Form Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning The Primal and Dual Forms SVM is defined in two ways – the Primal Formulation and the Dual Formulation. Both get the same optimization result, but how they get it is very different. The Primal Form is preferred when we don't need to apply any kernel trick to the data, and the dataset is large but the dimension of each data point is small. The Dual Form is preferred when the data has a huge dimension and we must apply some kernel trick. (We will come to kernel tricks and what they are used for, later) Algorithms and Optimizations in Machine Learning The Primal Form The Primal Form aims to find the optimal hyperplane that maximizes the margin between two classes. Algorithms and Optimizations in Machine Learning The Primal Form Algorithms and Optimizations in Machine Learning The Primal Form Algorithms and Optimizations in Machine Learning The Primal Form The distance between the gutters gives the width of the margin. Consider the following samples that lie on the gutters. The component of (x+ - x-) along the normal to the median(hyperplane) should give the perpendicular distance between the gutters. Algorithms and Optimizations in Machine Learning The Primal Form To get the widest margin, we need to: ** Note: These simplifications are done for mathematical convenience during the optimization process Algorithms and Optimizations in Machine Learning The Primal Form: Formulation Algorithms and Optimizations in Machine Learning The Primal Form: Decision Rule The decision rule can be written as Algorithms and Optimizations in Machine Learning The Primal Form: Non-Separable Problems What if the classes are non-separable (no separating hyperplane exists)? Then our optimization problem does not have a solution and we need to modify it. Our solution is going to be to make each constraint soft, by introducing slack variables, which allow the constraint to be violated. ** θ and θ0 represent w(weights) and b(biases) respectively Algorithms and Optimizations in Machine Learning The Primal Form: Non-Separable Problems In the optimization problem, we assign a penalty C to these slack variables to obtain: This optimization problem can be converted to an unconstrained form: Where the notation (x)+ = max(x, 0) (The working to convert this optimization problem to its unconstrained form can be found at: https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor-vector- machines.html?highlight=hinge#towards-an-unconstrained-objective Algorithms and Optimizations in Machine Learning Hinge Loss In the previous formula, the Hinge Loss penalizes incorrect predictions, and the L2 regularizer ensures weights are small and well-behaved. The hinge loss function penalizes misclassified points as well as correctly classified points that are within a margin (i.e., too close to the decision boundary). Specifically, if the model predicts a value f(x) that correctly classifies the data point but does not satisfy the margin condition y⋅f(x)≥1, the loss will still be positive. This encourages the model to find a decision boundary with a wider margin. The Hinge Loss for a single training example is given by: Here, y is the true label of the example (+1 or -1), and f(x) is the model’s predicted output. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Defining SVM: Dual Form Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning The Dual Form The Primal Form is not always convenient to solve, especially for high-dimensional data. By converting it to the Dual Form, we can handle these cases more effectively. To find the extremum of a function (with constraints), we use Lagrange Multipliers. Algorithms and Optimizations in Machine Learning The Dual Form We differentiate the Lagrangian with respect to w and b: Algorithms and Optimizations in Machine Learning The Dual Form Plugging in (a) and (b) back into the Lagrangian equation, we get the following equation, which is the final equation to be maximized (without worrying about the constraints). From this, we discover that the optimization only depends on the dot products of pairs of samples. Algorithms and Optimizations in Machine Learning The Dual Form: Decision Rule The decision rule can be written as Algorithms and Optimizations in Machine Learning The Dual Form: Non-Separable Problems Our dual problem assumes that a separating hyperplane exists. If it doesn’t, our optimization problem does not have a solution, and we need to modify it. Our approach is going to be to make each constraint soft, by introducing slack variables, which allow the constraint to be violated. Here, θ refers to w(weights) and θ0 refers to b(biases). In the optimization problem, we assign a penalty C to these slack variables to obtain the following (primal) problem: Algorithms and Optimizations in Machine Learning The Dual Form: Non-Separable Problems To find its dual, we find the Lagrangian The dual objective equals: We solve for optimal θ and θ0 in closed form and plug back the resulting values into the objective. We can then show that the dual takes the following form: Algorithms and Optimizations in Machine Learning Coordinate Descent Coordinate descent is a general way to optimize functions f(x) of multiple variables x∈Rd. Sample visualization: Algorithms and Optimizations in Machine Learning Coordinate Descent and SMO The dual can be solved by a form of Coordinate Descent: A simple and efficient optimization algorithm used with this is the SMO (Sequential Minimal Optimization), which is used as follows: Algorithms and Optimizations in Machine Learning References https://www.youtube.com/watch?v=_PwhiWxHK8o&pp=ygUScHJpbWFsIH Byb2JsZW0gc3Zt https://medium.com/low-code-for-advanced-data-science/support-vector- machines-svm-an-intuitive-explanation-b084d6238106 https://kuleshov-group.github.io/aml-book/contents/lecture12-suppor- vector-machines.html#classification-margins https://kuleshov-group.github.io/aml-book/contents/lecture13-svm- dual.html https://kuleshov-group.github.io/aml-book/contents/lecture14- kernels.html https://medium.com/aiguys/the-optimization-behind-svm-primal-and- dual-form-5cca1b052f45 THANK YOU Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML [email protected] Algorithms and Optimizations In Machine Learning Constrained Optimization - 2 Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Teaching Assistant: Prerana Sanjay Kulkarni (7th Sem) ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Linear and Non-Linear Solutions Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning What are Linear Solutions? In the context of Support Vector Machines (SVMs), the difference between linear and non-linear solutions lies in how they handle data separation in the feature space. In the following figures, figure A shows linearly separable data, and figure B shows linearly inseparable data. Algorithms and Optimizations in Machine Learning What are Linear Solutions? A linear SVM finds a hyperplane (in 2D, a line) that best separates the data into classes. It assumes that the data is linearly separable, meaning that a straight line (or hyperplane) can divide the classes with a maximum margin. Algorithms and Optimizations in Machine Learning What if the Data is Linearly Inseparable? Problem: What happens when data is linearly inseparable? Solution: We use kernels. Kernels make machine learning algorithms suitable for dealing with non-linear data. Non-linear SVMs are used when the data cannot be separated by a straight line or hyperplane in the original feature space. In such cases, SVM uses a kernel trick to map the data into a higher-dimensional space where a linear separation is possible. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Kernels Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning Kernel Tricks The kernel trick allows the SVM to compute the separation in the higher- dimensional space without explicitly transforming the data. Popular kernels include polynomial, radial basis function (RBF), and sigmoid. These kernels enable SVM to handle complex decision boundaries that are not linear in the original space. Algorithms and Optimizations in Machine Learning Using Kernels in SVM 1. Select Kernel Function Example: The Radial Basis Function (RBF) Algorithms and Optimizations in Machine Learning Using Kernels in SVM 2. Compute the Kernel Matrix 3. Implicitly transform data (map data to higher-dimensional space) Algorithms and Optimizations in Machine Learning Using Kernels in SVM 4. Solve Quadratic Optimization (maximize the margin using the dual problem) 5. Classify the new data Decision Rule: Algorithms and Optimizations in Machine Learning How do we pick the right Kernel Function? 1. Linear kernel: If the data can be well-separated by a linear decision boundary, a linear kernel should be used. A linear kernel is the simplest and most computationally efficient kernel function, and it works well for low-dimensional datasets with a large number of features. 2. Polynomial kernel: If the data has polynomial features or contains interaction effects between the features, a polynomial kernel should be used. A polynomial kernel maps the input data to a higher-dimensional space using polynomial functions of the original features. 3. Radial basis function (RBF) kernel: If the data cannot be well-separated by a linear or polynomial decision boundary, an RBF kernel should be used. An RBF kernel is a popular choice for SVMs because it can capture complex nonlinear relationships in the data. However, choosing the appropriate value of the gamma hyperparameter is critical to prevent overfitting. 4. Sigmoid kernel: If the data has a sigmoidal shape or exhibits strong nonlinearities, a sigmoid kernel should be used. A sigmoid kernel maps the input data to a higher- dimensional space using sigmoidal functions of the original features. 5. Other kernels: There are several other types of kernel functions that can be used for SVMs, such as Laplacian kernel, ANOVA kernel, and Bessel kernel. These kernels are Algorithms and Optimizations in Machine Learning The Radial Basis Function (RBF) Kernel: Key Points The RBF kernel maps data points into a higher-dimensional space, enabling SVMs to handle non-linear relationships in the data. This is particularly useful when classes are not linearly separable in the original feature space. The RBF kernel is based on the Gaussian function, meaning it measures similarity between data points based on the Euclidean distance. The closer two points are, the higher their similarity. Algorithms and Optimizations in Machine Learning The Radial Basis Function (RBF) Kernel: Key Points (continued) The RBF kernel has a localized influence, meaning each support vector affects only the data points near it. This allows the SVM to create complex, curved decision boundaries that adapt to the structure of the data. The RBF kernel’s performance is sensitive to the choice of parameters, namely gamma and the regularization parameter, c. Gamma (γ) defines how far the influence of a single training example reaches. It's a parameter in the RBF kernel that controls the width of the Gaussian function. A larger γ makes the similarity decay faster, affecting the model's ability to generalize. Algorithms and Optimizations in Machine Learning References You can get a basic idea on working with SVMs and Kernels using python libraries here: https://github.com/Narpear/AOML_TA_PSK_Code/tree/main/Using%20 Kernels%20with%20SVM Algorithms and Optimizations in Machine Learning References https://www.youtube.com/watch?v=_PwhiWxHK8o&pp=ygUScHJpb WFsIHByb2JsZW0gc3Zt https://medium.com/@Suraj_Yadav/what-is-kernel-trick-in-svm- interview-questions-related-to-kernel-trick-97674401c48d https://kuleshov-group.github.io/aml-book/contents/lecture14- kernels.html#the-kernel-trick-in-svms THANK YOU Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML [email protected] Algorithms and Optimizations In Machine Learning Occam’s Razor, Minimum Description Length, and Dropout in Neural Networks Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Teaching Assistant: Prerana Sanjay Kulkarni (7th Sem) ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Occam’s Razor Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning What is Occam’s Razor? Occam's razor, also known as ‘Lex Parsimoniae’ or the ‘Law of Briefness’, is a philosophical problem-solving principle that suggests the simplest explanation is usually the best one. **Razor: Shaving away the unnecessary assumptions while distinguishing between two theories Algorithms and Optimizations in Machine Learning Occam’s Razor Applied to Machine Learning Model selection involves choosing the most suitable model from a set of candidate models based on criteria such as accuracy, simplicity, and generalization capability to unseen data. In the context of Machine Learning (and model selection), Occam’s Razor can be interpreted as: If two models have the same performance on the validation/testing dataset, then select the simpler model because it is more likely to generalize well. (This reduces the chances of overfitting) Algorithms and Optimizations in Machine Learning Occam’s Razor Applied to Machine Learning (continued) Note: When choosing between two models, we can only say a simpler model is better if it’s generalization error is equal to or less than that of the more complex model. Practical considerations: While Occam’s Razor traditionally supports the choice of simpler models when they achieve equivalent or better generalization errors compared to more complex models, there are instances where opting for a simpler model is advantageous despite its potentially higher generalization error. In fact simpler models may provide the following advantages: Less memory usage. Faster inference times. Better explainability. Algorithms and Optimizations in Machine Learning Occam’s Razor Applied to Machine Learning (example) Example – To illustrate the practical application of Occam's Razor in choosing a model using numbers, let's consider a scenario involving three different machine learning models designed to predict house prices based on various features such as size, location, and age. We'll compare these models based on their complexity, memory usage, inference time, and explainability, which are key factors influenced by Occam's Razor. Model A: Complex Model Complexity: High (uses deep learning techniques like convolutional neural networks) Memory Usage: 10 GB RAM Inference Time: 5 seconds per prediction Explainability: Low (difficult to interpret why certain features affect the price) Algorithms and Optimizations in Machine Learning Occam’s Razor Applied to Machine Learning (example) Model B: Medium Complexity Model Complexity: Medium (uses random forests) Memory Usage: 3 GB RAM Inference Time: 2 seconds per prediction Explainability: Moderate (can explain feature importance but not individual predictions) Model C: Simplest Model Complexity: Low (uses linear regression) Memory Usage: 100 MB RAM Inference Time: 0.1 seconds per prediction Explainability: High (easy to understand how each feature contributes to the prediction) Algorithms and Optimizations in Machine Learning Occam’s Razor Applied to Machine Learning (example) Applying Occam's Razor: Generalization Error: Assume that all models have similar generalization errors, meaning they perform equally well on unseen data. Choosing Based on Occam's Razor: In this example, Occam's Razor guides us towards choosing Model C over Models A and B, even though it might have a slightly higher generalization error. The simplicity of Model C—reflected in its lower memory usage, faster inference times, and better explainability—makes it the most advantageous choice according to Occam's principle. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Minimum Description Length Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning What is the ‘Minimum Description Length’ Principle? The Minimum Description Length (MDL) Principle in machine learning, states that the best description of the data is given by the model which compresses it the best. In other words: Learning a model for the data or predicting it is about capturing the regularities in the data and any regularity in the data can be used to compress it. Thus, the more we can compress a data, the more we have learnt about it and the better we can predict it. Algorithms and Optimizations in Machine Learning MDL and Occam’s Razor MDL is also connected to Occam’s Razor used in machine learning which states that “other things being equal, a simpler explanation is better than a more complex one.” In MDL, the simplicity (or rather complexity) of a model is interpreted as the length of the code obtained when that model is used to compress the data. Algorithms and Optimizations in Machine Learning MDL Applied to Machine Learning (example) Example – Imagine we have a dataset of daily weather observations for a year, where each day is marked as either "sunny" or "rainy". We want to create a model to predict the weather. Model 1 – If it's sunny today and was rainy yesterday and sunny the day before, then tomorrow will be rainy, unless it's a prime-numbered day of the month, in which case it will be sunny, except if it's August, then – (This model might fit the training data perfectly, but it's very complex and likely overfitting.) Model 2 – "If it's sunny today, there's a 70% chance it will be sunny tomorrow. If it's rainy today, there's a 60% chance it will be rainy tomorrow.“ Algorithms and Optimizations in Machine Learning MDL Applied to Machine Learning (example) Applying MDL: Data Compression: Model 1 might require a long code to describe all its rules and exceptions. Model 2 can be described much more concisely. Prediction Accuracy: Model 1 might perform perfectly on the training data but poorly on new data. Model 2, while not perfect, is likely to generalize better to new data. Algorithms and Optimizations in Machine Learning MDL Applied to Machine Learning (example) MDL Criterion: The total description length = length of the model description + length of the data when compressed using the model Model 1: Long model description + Very short compressed data description Model 2: Short model description + Slightly longer compressed data description MDL would likely prefer Model 2 because its total description length is shorter Connection to Occam's Razor: Model 2 is simpler and aligns with Occam's Razor principle. It captures the essential pattern in the data without overfitting to noise or rare exceptions. Algorithms and Optimizations in Machine Learning MDL Applied to Machine Learning (example) In this example, the MDL principle would favor Model 2 because it provides a good balance between model simplicity and data fit. It compresses the data reasonably well (captures the main pattern) while being simple to describe (low model complexity). This demonstrates how MDL embodies Occam's Razor by preferring the simplest explanation that adequately describes the data, which often leads to better generalization and prediction on new, unseen data. ALGORITHMS AND OPTIMIZATIONS IN MACHINE LEARNING Regularization in Neural Networks Bhaskarjyoti Das Department of Computer Science and Engineering in AI & ML Algorithms and Optimizations in Machine Learning How is Regularization Performed in Neural Networks? Overfitting is a problem in neural networks, and there are several regularization methods to avoid overfitting. These include batch normalization, using validation data, L1/L2 regularization, weight decay, noise injection, label smoothing, dropout, etc. Now, we see ‘dropout’. (More info on overfitting and other methods to avoid overfitting in the next slide deck) Algorithms and Optimizations in Machine Learning What is Dropout? Dropout is a regularization technique in neural networks that randomly drops (sets to zero) a fraction of the neuron activations during training, helping to prevent overfitting and improve generalization by forcing the network to learn redundant representations. Algorithms