Midterm Notes 2 PDF
Document Details
Uploaded by EnergyEfficientRomanArt
University of Illinois Urbana-Champaign
Tags
Summary
These notes cover pre-midterm concepts and a summary of machine learning models, such as parametric and non-parametric models, decision trees, K-means, MDS, Isomap, t-SNE, UMAP, linear regression, logistic regression, and SVM. The notes also discuss how to choose the right algorithm based on the amount of data, and the differences in training and prediction speed between different models. The concepts of regularization (L1 and L2) are also explained within this document.
Full Transcript
Midterm Notes 2 Pre-Midterm Concepts & Some overall summary Topic Subject 1. We want to estimate our model’s error when trained on N training samples, using a test set with M samples. True or false: a) With different sets of M samples, we would probab...
Midterm Notes 2 Pre-Midterm Concepts & Some overall summary Topic Subject 1. We want to estimate our model’s error when trained on N training samples, using a test set with M samples. True or false: a) With different sets of M samples, we would probably get the same error measurement. False. There would be some variance in the error measurement, since we’d be measuring based on a different set of samples. As an extreme example, if M=1, then for some test sets, the average error would be 0 and for others it would be 1. b) If we increase M, we should get a more accurate estimate (i.e. one that has lower variance with different test sets of size M)? True. Increasing the sample size decreases the error of the estimate of the mean. Specifically, the var(mean(X)) = 1/M var(X), where M is the sample size used to estimate the mean. c) If we increase N but do not change the test set, we’d expect the test error to be unchanged. False. Test error should go down because we can fit parameters better with more training samples (or have more samples to represent the full distribution in the case of KNN). d) The expected error does not depend on M, but it does depend on N. True. Parametric Model: dimensionality is constant with respect to dataset (you know the form of distribution) Non-parametric: function uses entire training dataset to make predictions, and complexity of model increases in size. Non-parametric has the advantage of not loosing any information at training time, but they are computationally expensive and may overfit the training dataset. Decision Tree = axis aligned boundaries * For predicting continuous variables, use a regression tree How To Decide Which Algorithm to Use One factor is how much data you have. In the small data ( increase risk of overfitting when decrease K Larger values of K: regions assigned to feature space become less fragmented, smoother decision boundaries -> bias will increase (model will take into account more neighbors when making a prediction leading to a more generalized and less flexible model) but variance and generalization error will decrease Assume that all of the data samples within X_train and X_test X_train and X_test come from the same distribution and are independent of each other (iid = independent and identically distributed). That means are I.I.D x_0 does not tell us anything about x_1 if we already know the sampling distribution D Unsupervised learning Basic idea is to assign each point to the nearest of the established K centers Objective: minimize sum of squared distances between each point & nearest cluster center Guaranteed to always converge Guaranteed to reach local optimum (solution cannot be improved either by re-assigning points or updating cluster centers) If there is much more variance along one dimension than K-means the others, that dimension will have more effect on the K- means clustering Can be viewed as a kind of compression, if you represent each data point with the id of its cluster Quality can be evaluated with RMSE (based on distance of each point to its nearest cluster center) or purity (if labels are available) Hierarchical K-means: iterative cluster points into K groups (think vocabulary tree/hierarchical tree) for faster lookup Agglomerative clustering: iteratively merge 2 most similar points/clusters; good for manifold structure Locality Sensitive Hashing fast approximate search method; returns similar data points to query good for high dimensional data LSH does not guarantee an exact match not really concerned about retrieving the “closest point” to a query, but we are happy to have “a sufficiently close enough point” to the query. structure Compresses data and reduces noise Linear projection (e.g. a street map projects everything onto the 2D ground dimensions and ignores height) preserves Euclidean pairwise distances MDS between any two points preserves global structure of dataset same as MDS but defined distance in Isomap graph (compute adjacency graph e.g. 5NN) not unique graph computes pairwise probability distributions for the data in the original t-SNE dimension preserves local structure rather than exact distances Assumes data is uniformly distributed on an underlying manifold that is locally connected. UMAP Incorporates K-nn structure to achieve good embeddings relatively quickly Goal is also to preserve the local structure (like t-SNE) faster than KNN as KNN has to calculate all the distances Objective: Minimize least squares (sum of squared diff between predicted values and actual target) Use when few data points and strong correlations between individual features & target value Use when you have data for a trend overtime and need to extrapolate very fast regression function better for explaining a relationship Training can incorporate L2 or L1 regularization to prevent noisy features from having too much influence Linear Regression KNN is faster to train Logistic regression is faster to predict - once trained, logistic regression makes predictions by evaluating a fixed number of learned params (weights) which is faster than KNN which requires calculating the distance Logistic Regression requires less storage - only stores learned weights, while KNN stores entire training dataset Logistic Regression can assign different importance to different features (feature weights) KNN can fit more complicated decision functions (non linear is more complex) KNN would have lower training error - because it memorizes the training data SPARSE; prefers to assign a lot of weight to the most useful features (1 to negative, -1 to positive and 0 for 0) L1 (Lasso) Sets weights in w to 0 for least influential variables (known as Regularization sparse solution, a kind of feature selection) For L1/Lasso, only most independently useful features have which means it can be used to select similar features. prefers to assign smaller weight to everything L2 (Ridge) L2 is straightforward to implement, but L1 usually requires Regularization specialized optimization algorithms because of its non- differentiable nature at 0 Used to predict a binary outcome (e.g. yes/no, 1/-1), alternative way of thinking about this is a 2-classification problem and dealing with it using SVM (support vector machine) Based on Bernoulli Random Variable (coin toss probability) maximize expected likelihood of true label given data Logistic Regression (probabilistic, considering the likelihood of entire Model dataset) maximize confidence in the correct label (maximize likelihood of correctly classifying points, considering all data points in decision boundary) every example contributes to loss When data are linearly separable (i.e. y = 1 if wTx + b > 0), logistic regression will behave badly ◦ linearly separable: a line (or hyperplane) in feature space that can split the two labels Maximize likelihood of target labels given the features Used for classification to predict probability that an input belongs to one of two classes Maximize expected likelihood of true label given data Every example contributes to loss Binary variable instead of a continuous value Minimize negative log-likelihood Linear Logistic C parameter controls regularization Regression logistic function (aka sigmoid function) is used to map the linear combination of features (w^T x + b) to a probability between 0 to 1 (probability that y = 1 given x) The logit function helps transform the relationship into a linear one between the features and the log-odds, allowing logistic regression to predict probabilities for binary outcomes using linear combinations of the input features don't use when features are low dimensional (linear function may not be expressive enough) Classification technique Good for high-dimensional features maximizing the margin between the decision boundary & nearest data points from either class Decision boundaries depend on support vectors; this means that SVM does not depend on the whole training set and just the support vectors Linear SVM and Linear logistic regression have same classifier form but will find different parameters due to differing objective/loss functions SVM Remember: linear SVM is like weighted nearest neighbor product similarity Kernel functions are used in non-linear SVMs, which enable SVMs to find complex, non-linear decision boundaries without explicitly transforming the data into a higher-dimensional space hinge loss function for both linear and non-linear SVM, but non-linear also incudes kernel Primal Formulation of SVM: used when data points of classes are linearly separable Dual Formulations of SVM: used when the data points are not linearly separable Expresses conditional probability of x given y in terms of the Bayes Rule conditional probability of y given x, the probability of x and the prior probability of y In Naive Bayes, you're trying to find P(y|x_1, x_2,.... x_n) the probability of a class label y given features x_1, x_2,... x_n Simplifies the computation of this posterior probability by assuming features are independent given the label Probabilistic classifier. Naive because it assumes that the features have little/no correlation with each other Class conditional probabilities + prior probabilities Highly efficient learning & prediction Generalization performance may be worse than more sophisticated learning methods Log Transformation: To simplify the computation when multiplying small probabilities (if you keep multiplying Naive Bayes Model repeatedly the value my become too small or even 0; avoid numerical instability), we take the log of the expression which allows us to convert products into sum of log probabilities instead When Not to Use ◦ Sufficient data => Logistic or linear regression will usually work better more flexible / fewer assumptions than Naïve Bayes) ◦ Does not provide a good confidence estimate because it “overcounts” influence of dependent variables E-M (Expectation Maximization) Widely applicable, iterative method for finding maximum likelihood estimates when data has missing or hidden components Good for unlabeled clustering that can be used to fit Gaussian mixture models Use cases: bad annotators problem, foreground/background segmentation (without knowing what foreground looks like in advance), topic models (find "latent"/hidden topics from a set of documents) Three main steps ◦ Initialize model params (mean, standard deviation, prior probability) ◦ E-step (expectation): Estimate the expected value of hidden variables (e.g. which cluster a point belongs to), the log-likelihood probability, based on the current parameters ◦ M-step (maximization): Maximize the expected log-likelihood functions by updating the model parameters (e.g. means and variances) ▪ another way of thinking is estimate new parameters for each model, weighted by likelihood ◦ these 2 steps are repeated until the model parameter converges Instead of calculating a distribution over the latent variables z, we choose the most likely z* at each step. K-means is an example; in K-means you assign each data point to the nearest cluster center in Advantages ◦ Simpler: can be applied when cannot derive EM ◦ Sometimes work better if you want to make hard Hard EM (e.g. K- predictions at the end means) BUT ◦ Generally not as accurate as full EM especially for estimating probability distribution function (pdf) parameters. ◦ This is because it discards uncertainty whereas regular EM takes the uncertainty in latent variables into account through a probabilistic distribution Estimating P(x) for some observation x ◦ Classification: want P(x | y) ◦ Outlier/anomaly: want to know if P(x) is small (which would mean its an anomaly/rare event) ◦ Prediction: estimate P(x) to predict the most likely outcome/action based on observation x Probability Density ◦ Model dependencies & mutual information Function (PDF) ▪ in ML, dependencies between variables are modeled estimation and PDFs help capture these relationships ▪ mutual information measures how much knowing one variable reduces uncertainty about another In discrete variables, estimating PDF is simple because it is just counting the occurrence of each outcome Continuous variables are more tricky Common distributions - Gaussian, Exponential, Beta Methods to estimate PDFs Summary of Strengths and Weaknesses: Parametric models: Simple and effective when you know the form of the distribution, but limited when the assumption doesn’t hold. Semi-parametric models: More flexible, especially in capturing multimodal distributions, but can be inefficient in higher dimensions. Non-parametric models: Extremely flexible and can model any distribution, but computationally intensive, especially in high-dimensional data. Lesson 12: Decision Trees Topic Subject Q1. If I flip a fair coin (50/50 chance of heads or tails), what is the entropy of the outcome, measured in bits? a) 0 b) 0.5 c) 1 d) -1 Ans: C Q2. What does H(X|Y), the conditional entropy of X given Y, measure? a) The uncertainty of X given a particular value of Y b) How much information Y provides about X c) The uncertainty of X that is expected to remain if I know the value of Y d) None of these Ans: C Ans: Patrons because in Type, you can see that in the split based on "Type" the data is still split into subcategories (French, Italian, Thai, Burger) but this does not fully separate green from red blocks effectively. In the split based on Patrons, the data is divided into "None", "Some" and "Full", and each subcategory has fewer blocks. Patrons does a better job of reducing the impurity in groups. *REVIEW Q4. What is the regression tree minimizing? a) Variance of X b) Variance of Y c) Conditional Variance of X given the tree model d) Conditional Variance of Y given X and the tree model Ans: D *review question Decision function given is a linear decision function where x_n represents the input features, w is the weight vector and b is the bias term. Naive Bayes classifiers, particularly for binary features, can be expressed as a linear combination of the feature values in certain cases. The Naive Bayes decision rule, especially when features are conditionally independent and binary, results in a linear decision function when transformed into log-space. This form does not explicitly calculate wTxn+bw^T x_n + bwTxn+b in the same way as logistic regression or linear SVM, but the decision boundary it creates can be expressed in a linear form when dealing with binary features. an explicit model during training. Instead, it simply stores the training examples. When a new training example is added, it can immediately be used in future classifications without requiring the model to be retrained. The classifier just includes the new example in its stored set and uses it when calculating distances to make predictions. Naive Bayes with binary features is very fast to update as well, as the probability counts can be updated efficiently. Other classifiers (e.g., logistic regression, SVM, Naive Bayes, Decision Trees) require some level of retraining or updating of parameters to incorporate new training data, which is more complex than simply adding a new point to a stored dataset as in 1-NN. *review question In 1-Nearest Neighbor (1-NN), each training point is classified based on the nearest point, which will always be itself in the training set. Therefore, 1-NN will have zero training error, as every point in the training set is guaranteed to be classified correctly by itself. Decision trees can achieve minimal error especially if they are grown fully (unpruned). By splitting until each leaf node contains only a single class or reaches a very specific condition, a decision tree can perfectly classify the training data. However, this may lead to overfitting, which is why pruning or setting constraints (like max depth) is often used to prevent the tree from perfectly fitting the training data. Other classifiers may not achieve minimal error on the training set because they are designed to find generalized patterns, not to exactly memorize the training data. "Globally optimal" means whether the parameters can be found that minimize/maximize the objective given the model. E.g., the solution found for linear logistic regression will be the global optimal, even though there might be other models that achieve lower logistic loss. 1. Linear logistic regression: Logistic regression has a convex objective function, allowing optimization algorithms to find a globally optimal solution. 2. Linear SVM: Linear SVMs also have a convex objective function (the hinge loss with regularization), so they can achieve a globally optimal solution. 3. SVM with RBF Kernel: While SVM with an RBF kernel does involve a non-linear kernel, the optimization problem remains convex in the dual space, allowing for a globally optimal solution to the objective function.local 4. Naive Bayes Classifier on Binary Features: Naive Bayes does not rely on iterative optimization but instead calculates probabilities directly based on assumptions of feature independence. Given these assumptions, it effectively finds a globally optimal solution based on the likelihood of the observed data. 5. Decision Tree: Decision trees are prone to finding local optima due to their greedy, recursive splitting approach. They do not guarantee a globally optimal solution to their objective function. 1. In class we observed the following two graphs regarding the decision boundaries for decision trees (left) and 1-nearest neighbor (right). Each data point consists of values for two features x1 and x2. a) Why are the decision boundaries “parallel” to one of the axes in this case? Because, in this case, any classification decision is only with respect to one of the features x1 or x2. Decision trees split the data based on thresholds applied to individual features. Each decision node in the tree evaluates a condition in the form of x1≤ threshold or x_2 ≤threshold. These splits create boundaries that are parallel to the feature axes because the tree is only considering one feature at a time when making each split. For example: ◦ A split on x1will result in a vertical boundary on the x1-x2plane. ◦ A split on x2will result in a horizontal boundary. b) The displayed decision boundaries of the 1-NN are lines that are not parallel to either of the axes. In general, we can also have these type of “non-parallel” lines as boundaries for decision trees, how could this happen? The decision boundaries depend on how we split the features or attributes: for example, we can have a split (on a node) such as “a.x1 + b.x2 >c”, where a, b and c are some real values. In this case, we are describing a region which belongs to one side of a line which is not parallel to either axis when a and b are different than zero. Oblique Decision Trees: To create non-parallel decision boundaries, we would need to modify the way decision nodes split data. Instead of simple splits on individual features, an oblique decision tree allows splits based on linear combinations of features. For instance, a split condition might look like: ◦ a⋅x_1 +b⋅x_2 > c where a, b and c are real values (constants) This type of condition describes a linear boundary that separates the data based on the weighted sum of the features, rather than a single feature. The line represented by this inequality is defined by the equation: a⋅x1+b⋅x2=c If both a and b are non-zero, this line will not be parallel to either axis. It can have any orientation depending on the values of a and b. c) As a follow up to the previous question: in a practical setting, what could be a possible disadvantage of using decision boundaries that are not “parallel” to the axes in decision trees? A possible disadvantage is during the training of the decision tree: making splits that depend on just one of the attributes in each node of the tree is simpler than making a decision that could combine multiple features, e.g., in the split “a.x1 + b.x2 >1”, the algorithm would have to infer the best values for a, b and c to obtain the most informative split. Also, as the decision boundary becomes more complicated, it is easier to overfit to the training data (i.e. get higher test error, even though training error is smaller). 2. Can any type of decision tree always be expressed as a binary tree? Why? Yes. Whenever a node produce a split with N > 2 different values or outgoing arrows, we can always formulate it as a series of N -1 nodes whose: 1) first N-2 nodes will have one outgoing arrow pointing to a possible value and the other to another node, 2) last node will have outgoing arrows to two different values. 3. An important part of the algorithm studied in class for building a node in a decision tree is to choose an attribute and then split the given data for the node based on the values of the attribute. a) Do all attributes in a decision tree have to be of the same type (discrete/categorial or continuous)? Or is it possible for them to be of different types? In addition, provide an example to support your answer. Decision trees can have features of different types. For example, in classifying specific groups of dogs that are more likely to get sick of a specific disease, each dog in the data has a specific breed (categorical data) and a specific weight (continuous variable) that can be good predictors for predicting its likelihood of getting sick. b) How do we choose the “best” attribute depending on whether its values are discrete (i.e., categorical) or continuous? If selecting a discrete feature, we simply compute the information gain of the split defined by its different values. If selecting a continuous feature, we need to choose a specific value for a threshold based on which one gives the best information gain across a group of possible thresholds. Since we might have a mixed bag of discrete and continuous features, the information gains across all different attributes are compared to each other and the attribute with the highest information gain is chosen. 4. Why is it a good decision to use early stopping during the training of decision trees? Because if the training process is not stopped sufficiently early, we may end up with every single data point inside different decision boundaries, i.e., a single data point for each leaf of the decision tree. As a result, we have overfitting. 5. Assume we have the following table, each row being a data point: Compute the H(y), H(y|x1) and H(y|x2). If you need to choose whether to split this data based on x1 or x2, which feature should you choose according to information gain criterion? Step 1: Calculating H(y) Main Ideas Notes Classification 1. Nearest neighbor: widely used a. Can instantly learn new classes and predict from one or many examples 2. Naive Bayes: represents a common assumption as part of density estimation, more typical as part of an approach rather than final predictor a. Fast estimation from lots of data b. Not terrible estimation from limited data Recap of Classification and Regression Regression 1. Logistic regression: widely used a. Effective prediction from high-dimensional features b. good confidence estimates 2. Linear Regression: widely used a. Can extrapolate, explain relationships b. Predict continuous values from many variables Almost all algorithms involve nearest neighbor, logistic regression, or linear regression ◦ Main challenge: feature learning How to use 1. Nearest neighbor: uses all features jointly to find similar features examples Nearest Neighbor, 2. Linear Models: make predictions out of weighted sums Linear Models, of the features Decision Trees 3. Decision Trees: iteratively choose attribute and split value that best separates classes for data in current node Decision Tree Components Internal nodes: test attributes internal nodes ? Branching is determined by attribute value branching ? Leaf nodes are outputs (class assignments) leaf nodes ? Training: Recursively, for each node in tree: 1. If labels in the node are mixed: a. Choose attribute and split values based on data that reaches each node Decision Tree i. This is done through measuring information gain Algorithm 1. For each discrete attribute: compute information split based on? gain of split (check each feature) discrete vs 2. For each continuous attribute: select most continuous? informative threshold and compute its information gain (i.e. check attribute and threshold). Can be done efficiently based on sorted values. ii. Select attribute/threshold with highest information gain b. Branch and create 2 (or more) nodes 2. Return Prediction: 1. Check conditions to descend tree 2. Return label of leaf node To decide what/where to split: Use counts at leaves to define probability distributions so we can measure uncertainty Entropy: Entropy: measure of uncertainty associated with a random number/ amount of data X Quantifying Uncertainty Higher entropy -> higher uncertainty; lower entropy -> * first 2 equations lower uncertainty under definition are Coin flip example from slides, the one below is from 412 Entropy is log_2 because entropy measures the average number of bits needed to encode X. Entropy of a joint distribution H(X|Y = y) measures uncertainty of X if Y is known to have a particular value Specific Conditional Entropy Conditional Entropy quantifies the amount of uncertainty associated with a random variable given the knowledge of ANOTHER random variable Used to assess how much information the dependent (response) variable gained when splitting the data on a particular feature Conditional Entropy P(raining) = 25/100 = 1/4; P(not raining) = 75/100 = 3/4 Properties of Conditional Entropy H is always non-negative. Chain rule: H(X,Y) = H (X | Y) + H(Y) = H(Y|X) + H(X) If X and Y are independent, then X doesn't tell us anything about Y: H(Y|X) = H(Y). By knowing X, we can only decrease the uncertainty about Y: H(Y|X) common ML phenomenon; increasing model complexity can lead to higher test error due to over-fitting ANS: C, because the trees are independently trained ANS: A, because all features are used to train each tree Entropy Explanation Entropy = average number of bits needed to code X Ensemble Models 2 main types of ensemble construction 1. Independent sampling (averaging) BAG = different subsets reduce variance = random forest 2. Incremental training to fix previous model's mistake BOOST = correct previous reduce bias = boosting An ensemble combines many models into one and averages or sums predictions ◦ often results in higher accuracy than using a single model (by bias ↓ or variance ↓) ◦ e.g. averaging multiple "opinions" is better than a single friend's response Averaging multiple “weak” predictions is often more accurate than any single predictor ◦ combining models that may not be highly accurate on their own can produce a more reliable prediction (law of numbers) Two main types of ensemble construction: 1. Independent Sampling: a. models are trained independently often by using different subsets of data (e.g. bagging in random forests) b. which reduces variance by averaging over independent predictions 2. Incrementally training model to fix previous model's mistake a. models are trained to focus on correcting errors of previous models (e.g. boosting) b. reduces bias by incrementally adjusting for mistakes made by prior models h(x): model trained on data, y_mean(x) = expected value (ideal prediction) h_mean(x): expected model averaged up over all data x: data; y: true value Expected test error is made up of variance, noise and bias Bias-Variance Variance: ◦ how much the model's predictions would vary if it were Tradeoff Variance trained on different training sets. High variance definition indicates that the model is sensitive to fluctuations What does high in the training data, often leading to overfitting. ◦ Cause: limited data. Different same-sized training sets variance indicate what is the cause can produce different models, causing predictions to of high variance fluctuate noise definition Noise: ◦ irreducible error inherent in the data/ problem itself, what is the cause of noise which cannot be eliminated regardless of model's bias definition complexity. Model cannot capture inherent randomness cause of bias or variability in data Bias: ◦ error due to approximating real-world problem which may be complex, by a simpler model. High bias indicates model is too simplistic, failing to capture underlying patterns, often leading to underfitting ◦ Cause: loosely expected error when optimal model is learned from infinite data *above is for regression, but expected test error components is the same for classification error and logistic regression training error is high Overfitting = validation error starts to rise Bias = error due to overly simplistic assumptions in Target Diagrams: learning 1. Low Bias, Low Variance: The predictions (blue dots) algorithm are tightly clustered around the target (center of the Variance = model's bullseye), indicating high accuracy and consistency. sensitivity to 2. Low Bias, High Variance: The predictions vary training data. significantly, spreading around the target in different More complex directions. This happens when the model is too model leads to complex, capturing noise in the training data, which overfitting leads to overfitting. test error ↑ training data and 3. High Bias, Low Variance: Predictions are consistently thus higher off-target, but they are grouped together. This is typical variance. of underfitting, where the model is too simple to Model complexity ↑: capture the underlying pattern. test error ↓ training error ↓ 4. High Bias, High Variance: Predictions are spread out overfitting ↑ and far from the target, indicating both underfitting (more noise) (high bias) and overfitting (high variance). which leads to As model complexity increases, the model accuracy of test set bias ↓ (more increases until some optimal point, then goes down as it data patterns begins to overfit captured) Graph: & variance ↑ 1. Bias (in red) decreases as model complexity increases. (more Complex models can capture more data patterns, fluctuations) reducing bias. generalization 2. Variance (in blue) increases as model complexity error ↑ (b/c of increases. A more complex model is sensitive to higher variance) fluctuations in the data, which increases variance. This test error ↓ leads to overfitting, where the model performs well on (underfitting) training data but not test data as it is too flexible and test error ↑ fits the training data closely. (overfitting) 1. Bootstrapping: repeatedly sample without replacement 2. Bagging: aggregate bootstrapping; targets overfitting & high variance by averaging many models trained in How ensembles random subsets of data (samples) battle bias and a. parallelizable method for combining models (each variance (4) tree in random forest can be learned in parallel) 3. Boosting: reduce underfitting by combining models (weak learners) that correct each other's feature a. inherently sequential way to combine models 4. Adaboost: adaptive boosting; reduce underfitting; re- weights instead of re-sampling; exponential loss Classifier Evaluation Works well with small datasets Repeatedly draw n samples from data D with replacement (i.e. each time a tuple is selected, it is Bootstrap equally likely to be selected again and re-added to the estimation training set) For each set of samples, estimate a statistic The bootstrap estimate is the mean of the individual estimates across all resampled datasets Used to estimate a statistic (parameter) and its variance Bagging - Aggregate Bootstrapping Main Idea Training Classification - classify vs continuous ? Prediction Bagging stands for "bootstrap aggregation" Idea is that random errors of any given model trained on a specific bootstrapped sample will "cancel out" if we combine many models trained on different bootstrap samples & average their outputs (definition of ensembling) Analogy: Diagnosis based on multiple doctor's majority vote Training: ◦ for i = 1.. M, draw n* < n samples from D with replacement ◦ train classifier C_i using those samples Classification: classify an unknown sample X ◦ each of the M models classify X and return the majority vote ◦ final classifier is a majority vote of C_1... C_M Prediction: ◦ To predict continuous variables, use average prediction instead of vote Increases classifier stability/ reduces ↓ variance Bagging (Aggregate bootstrapping) with decision trees as base models Random Forests - Classification or Train a collection of large trees (e.g. 100 trees, low bias, Regression high variance) For each: large trees = (low 1. Data bagging: Randomly sample some fraction/subset bias, high of the data with replacement for each tree (e.g. 90%) variance) 2. Feature bagging: Randomly sample some number of (bias ↓, variance features (more diversity) ↑) a. For regression: suggest (# features) / 3 b. For classification: suggest sqrt(# features) or log_2(# features) 3. Train a deep tree using only those samples/features 4. Optional: can get validation error on held out data Predict: Average the predictions of all trees (variance ↑, bias ↓) variance ↑: diff trees when resample from data Random Forests are not a good choice for very high- dimensional tasks (e.g. text classifiers) compared to fast, accurate linear models One decision tree -> prone to overfitting Many decision trees -> more stable, better generalization 1. Each tree uses a different set of features to fully explain the training set a. each tree is trained on a random subset of features from the training dataset which introduces variability among trees, preventing any single feature from dominating the prediction process 2. Each tree's prediction is based on a complex rule a. Each individual tree generates a prediction based on a unique set of splits (rules) derived from the subset of features it was trained on b. These rules may be overfitted to the training data, which can result in exceptions when tree is applied to test data. However, when combined in a forest, the Random Forest: impact is minimized Intuition 3. Averaging prediction from many many complex rules enables prediction based on complicated combinations of features a. Random forest combines predictions from all individual trees, usually by averaging (for regression) or majority voting (for classification) b. this smooths out overfitting of individual trees and allows model to leverage complex combination of features without overly depending on any one tree's prediction c. reduces variance by averaging many tree predictions Summary Random forests reduce overfitting by creating a "forest" of diverse trees, each trained on different feature subsets and data samples. By aggregating the predictions, random forests improve accuracy, stability, and generalization, making them a powerful tool for both classification and regression tasks. 1. Learner = Hypothesis = Classifier a. terms are used interchangeably. b. A learner or classifier refers to an individual model used to make predictions, while a hypothesis represents the decision rule or function that the model learns. 2. Weak Learner: classifier that can achieve = 1 means that it is correctly classified and outside the margin, loss is zero (NO PENALTY) ▪ = 1, samples we are in the ◦ Gradient descent is good for large-scale SVMs, as it "flat" part and optimizes the convex objective function in the primal form, gradient is just allowing for efficient training with large datasets, although it 0. may approximate rather than exactly solve the SVM problem. ◦ BUT high computation cost to calculate all gradients so SGD or coordinate descent is better for large datasets. Type: Pegasos = gradient descent for SVM. Efficiency: Processes one sample at a time to optimize SVM objective function of minimizing regularization term & hinge loss ◦ allows for rapid convergence & reduced computational loads compared to batch gradient descent for large datasets ◦ efficient optimization for large datasets b/c it avoids calculating gradient for all samples but does it one at a time Pegasos: Selective Update: Only samples that are misclassified/within Primal margin contribute to gradient update, reducing unnecessary Estimated sub- computations. GrAdient SOlver for SVM...0 Pegasos algorithm The weight update is not sampled randomly from a uniform distribution, but computed from a random sample of data. Also, SGD does not proceed by checking whether an update decreases the loss -- it just takes a step SGD: Update weights one sample at a time; efficient optimization according to the on large datasets loss gradient for Random sampling: sample chosen uniformly at random from that mini-batch. training set S Adaptive Step Size: step size/ learning rate decreases decreases over time, ensuring updates become smaller as algorithm progresses, leading to stable convergence. Selective update: Only samples that are misclassified/within margin contribute to gradient update, reducing unnecessary computations. Random Mini-Batch Selection: Choose a mini-batch A_t of k examples uniformly at random from the training set S Identify Margin Violators: subset A_2t+ Pegasos with Weight Update from regularization term & hinge loss: scales weight by ( 1- n_t λ ), reducing magnitude to avoid mini-batch overfitting. Adds gradient contribution from each sample in At+, Benefits over normal Pegasos averaged by batch size K Key points: 1. Calculates gradient over a subset (batch) to reduce variance; uniformly at random 2. Reduced gradient variance: Using mini-batch SGD makes it faster than original Pegasos algorithm and reduces the variance of the gradient estimate, making convergence more stable and reducing the chance of oscillations. 3. Improved Efficiency: By only focusing on margin-violating examples within the mini-batch, Pegasos avoids unnecessary calculations for correctly classified examples, making it computationally efficient 4. Controlled Regularization: The term (1−η_t λ)w_tcontrols the weights' growth, maintaining a balance between margin maximization and avoiding overfitting. SGD can be applied to various loss functions by using the appropriate subgradient SGD key points SGD is a very simple & effective optimization algorithm - step toward better solution based on a small sample of training data ◦ Not very sensitive to mini-batch size (but larger batches can be faster with GPU parallel processing) ◦ A larger training set makes it faster to obtain the same test performance ◦ SGD (Pegasos) outperforms traditional SVM optimizers in terms of speed and convergence stability. Effective on Large Datasets: Scales well for large-scale, high-dimensional datasets Pegasos provides a balance of fast training and low test error, making it a practical choice for large-scale SVM training. Alternative Optimizers: ◦ SDCA (Stochastic Dual Coordinate Ascent): A form of stochastic optimization that dynamically adjusts the learning rate, often used in sparse data scenarios ▪ SDCA also performs well, leveraging adaptive learning rates for further stability in certain cases ◦ Pegasos and SDCA are the most efficient in terms of training time, making them suitable for large datasets with many features (e.g., CCAT and cov1). ◦ SVM-Perf achieves competitive test errors but is slower, especially on larger datasets. ◦ LASVM is the least efficient, with exceptionally high training times on large datasets like CCAT, making it less practical for large-scale applications. Effect of mini- batch size on primal suboptimality (runtime & Both show that: performance) Larger mini-batches tend to improve convergence stability of the Pegasos but can lead to suboptimal results if too large algorithm Smaller mini-batches might converge less effectively due to high variance in gradient estimates. Optimal Mini-Batch Size: There is an optimal mini-batch size for achieving the best primal suboptimality without excessive computational overhead. Effect of kT: Increasing kT (runtime) generally reduces primal sub-optimality, especially for larger mini-batch sizes. Trade-Offs: Very small mini-batches may lead to noisy gradients, while very large mini-batches may slow down convergence due to less frequent updates. Effect of different sampling procedures on performance of Pegasos Same Permutation Each Epoch (red line) is the most measured by effective strategy for convergence, as it minimizes the primal randomness in the updates while still allowing each data point suboptimality to be seen in each epoch. This suggests that using a over epochs consistent order for each epoch improves stability and reduces noise in the updates. New Permutation Every Epoch (blue) is also effective, as it avoids duplicate sampling within an epoch and reduces variability, though it may not be as stable as using a fixed order. Uniformly Random Sampling (black) is the least effective, as the high variability from replacement sampling introduces noise, slowing convergence. What it is: Building block in neural networks and ML classification tasks. What it does: Iteratively adjusts weights during training to minimize classification errors on the training data. ◦ adjusts weight based on error ◦ learning rate controls adjustment size Type: linear classifier/ linear function used for binary classification. ◦ This means that the data needs to be LINEARLY SEPARABLE. Very similar to linear logistic regression, and also uses the sigmoid function, but perceptron does not imply a particular Perceptron error or training objective what it is what it does/goal type sigmoid function weighted sum & thresholding binary output loss function Prediction: Weighted Sum and Thresholding ◦ Perceptron computes a weighted sum of the inputs (linear), which can be represented at w * x + b, where w is the vector of weights, x is the vector of inputs and b is the bias term. ◦ The sign function sgn is then applied to the sum, and returns +1 if positive, and -1 if result is negative. Binary Output: which indicates whether weighted sum is above or below zero. This makes it a thresholded linear model Loss Function: Classically uses a hinge loss function, but other losses such as MSE (Mean Squared Error) or logistic loss can be considered for different types of training. Involves adjusting weights to minimize the prediction error. Perceptron Update Rule with MSE Loss 1. Prediction: computes a weighted sum of inputs, with corresponding weights plus a bias term b. 2. Error (Loss) Function: Squared difference between the predicted output f(x) and the target output y; larger errors are penalized more. 3. Weighted Update Rule: To minimize the (prediction) error, we adjust the weight w_i, by taking a small step in the direction that decreases E(x). a. Chain rule for gradient computation Perceptron In summary: leverages mini-batch SGD with MSE loss to iteratively update the weights, gradually reducing the prediction error with decaying learning rate for stability as training progresses. Optimization by SGD (MSE 1. Random Initialization of Weights (e.g. from a Gaussian Loss) distribution) 2. Iterative Optimization 3. Learning Rate Decay (n = 0.1/t) decreases over time as t increases. Helps the model take smaller steps as it gets closer to optimal solution, providing stability and preventing overshooting. 4. Batch processing: Mini-batch gradient descent allows for faster updates 5. Weight Update rule (average for each batch) Perceptron In summary, when using logistic loss, the perceptron’s update rule Update with incorporates the probabilistic output of the sigmoid function, Logistic Loss updating the weights in a way that maximizes the probability of correctly classifying each sample in the batch. 1. Logistic Loss: Negative log-likelihood, adjusts based on probability. 2. Sigmoid Function: Converts outputs into probabilities. 3. Update Rule: Modifies weights to maximize the probability of correct classification. Summary 3 main types of gradient descent: 1. Batch gradient descent 2. Stochastic Gradient Descent (SGD) 3. Mini-batch gradient descent Perceptron: Linear classifier for binary classification. Prediction: Computes weighted sum & applies a threshold function. Loss function: Hinge loss most common, but can also use MSE or logistic Perceptron with MSE Loss: Adjust weights to minimize MSE; chain rule for gradient optimization Perceptron SGD with MSE Loss: mini-batch SGD with decaying learning rate for iterative updates Perceptron with Logistic loss: Negative log-likelihood adjusts based on probability, sigmoid function converts output into probabilities, and rule update modifies weights to maximize probability of correct classification. Lesson 15: MLPs and Backpropagation Topic Subject b is true but its not why ReLU is better False: Sigmoid activations are very non-linear. The problem is that the gradient is always less than 1 and often very small, so with many layers, the gradient becomes negligible. 1. Indicate whether the following is true or false. If something is false, indicate what is wrong with the statement justifying your response appropriately. a) MLP and decision trees are related in that they sequentially apply non-linear functions to predict. T b) Linear activations cannot be used to reduce the range of the data values. False. Linear activations, changing its weight gain can perform scaling on the values of the data. c) Sigmoid function transforms a continuous score to a probability value. T d) Sigmoid functions are differentiable and bounded, properties that made them popular in applied settings for a long time. T e) The backpropagation principle is a result of the chain rule of differentiation. T f) Since ReLU is nondifferentiable, they have the difficulty of not allowing larger gradient values to flow through the network. False. While it is true that it is not fully differentiable, it has a constant gradient for positive values, thus making it less likely to have undesirable problems of vanishing gradients in the backpropagation. g) Stacking layers in general increase the expressivity of an MLP. T, assuming we have nonlinear functions at each node. h) The vanishing gradient problem can occur when several weights in an MLP go to zero. T i) An MLP are good predictors but are not able to learn features from the data. False. In an MLP, each layer can successively learn good features for the data that can be then used for prediction on the successive layers. Even more, if the number of hidden layers keep reducing, the learned features even reduce in dimensionality! j) An MLP has higher bias and lower variance than a perceptron. F, opposite is true (it has lower bias but higher variance) because an MLP requires more training data since it is more expressive - it is prone to overfitting if not enough data is provided. 2. What is a potential problem with only stacking linear layers in an MLP without non-linearities? The problem is that stacking linear layers can be made equivalent to just a single linear layer (composition of linear operators is linear) and thus may not perform better than just a simple perceptron – the point of using MLP is to augment its expressive power. values of its domain, and it pushes, during training of MLPs by gradient descent methods, the weights to be updated to such part of its domain. As a result, when using gradient updates in training MLPs, the gradient of the sigmoid becomes close to zero and practically no update happens on the weights – an example of “vanishing gradients”. 4. We saw in class that any function can be approximated to arbitrary accuracy by a network with two hidden layers with sigmoid activation. Someone might argue it wouldn’t make sense to train an MLP with more than two layers. Provide an argument against such assertion. In order to achieve good approximation with a two layer MLP, the hidden layers may need to become very large. i.e., have a large number of nodes. Instead, we could stack more layers in order to provide enough expressive power to approximate the function and perhaps end with a lower total number of nodes. Multi-Layer Perceptron (MLP) Recall Logistic regression has the sigmoid function compared to linear regression (but requires linear Perceptrons are linear prediction models MLP (Multi-Layer Perceptron): non-linear prediction models, composed of multiple layers with non-linear activations Activation functions introduce non-linearity so that the network can model complex functions but this is harder to optimize as it is not a convex function anymore Types of activation function: ◦ Linear: Rarely used in hidden layers b/c does not introduce non-linearity but could be used in output layer to predict continuous values ◦ Sigmoid: S-shape graph, best for binary classification/probability output, vanishing gradients problem (gradients become too small during backpropagation) ◦ ReLU: commonly used in hidden layers, helps with separation), now vanishing gradients & faster convergence during MLP also uses the training; sparse network activation (sets negative values sigmoid function to 0, enhancing gradient flow) but introduces both linear & non-linear Input features (independent variables) -> processed through weights -> fed into hidden layer nodes which capture latent relationships Each hidden layer node applies a weighted sum (Σ) of the inputs and passes this through an activation function (e.g. sigmoid function) in the output node, to calculate the probability of being alive Resulting value (0.6) represents the model's output/prediction, which is labeled as the dependent variable Benefit: Much greater expressivity, can model non-linear functions (i.e. more complex functions) Cost: Benefits and Costs 1. Optimization is no longer a convex function, so globally optimum solution is no longer guaranteed (or even likely) of going from a 2. Larger model = more training & inference time perceptron to MLP 3. Larger model = more data required to obtain a good fit Summary: MLP has lower bias & higher variance, and additional error due to optimization challenges weights and biases. It is calculated as: 256 × (784+1) for the first layer 10 × (256+1) for the second layer The "+1" accounts for the bias terms in each layer. Rarely used in hidden layers because they do not introduce non- linearity, which is essential for the model to capture complex patterns. Linear activation could be used in the output layer, especially in regression tasks to predict a continuous value. Key features: Linear Activation 1. No-Op Activation (i.e. nothing happens): input is identical to output f(x) = x as no transformation is applied 2. Use cases: Information compression, data alignment 3. Stacked Linear Layers: multiple stacked layers are equivalent to a single linear layer a. because the composition remains linear (i.e. no additional complexity/expressiveness by stacking layers) 4. Derivative: f'(x) = 1 is a constant, which can be limiting as it doesn't contribute to gradient scaling Key features: 1. Mapping/Range: maps any real-valued input to range between 0 to 1; good for probabilities 2. Usage: a. traditionally popular choice for internal layers in neural Sigmoid networks Activation (s- b. still a common choice for output layers in binary graph) classification tasks to map to a probability Main advantage: i. Given a function f(x) as the log odds (logistic probabilistic regression model), sigmoid function transforms it interpretation into a probability Main 3. Vanishing gradient problem: Sigmoid function produces disadvantage: weak gradients at the extremum (close to 0 or 1), leading Vanishing to the vanishing gradient problem in deep networks, gradient problem making it difficult to train/ optimize models with many layers 4. Derivative/Gradient: a. Derivative f'(x) on left (red line): derivative diminishes as x moves towards large positive/negative values, contributing to vanishing gradient problem ReLU (Rectified Linear Unit) Activation Graph shows how negative values output 0 and positive values linearly increase with a slope of 1 Advantage: Sparse Network Activation & ReLU: main benefit is improved gradient flow which makes optimization easier (i.e. more complex functions) Key features 1. Mapping: f(x) = max(0, x) a. If input is negative or 0, it is mapped to 0 (if x 0, f(x) =x unchanged 2. Usage: Positive gradient a. Typical choice for internal layers in current deep (also prevents networks, because it introduces nonlinearity while being vanishing gradient computationally efficient problem in 3. Sparse Network Activation: as a proportion of output is sigmoid) set to 0. a. this can lead to efficient computation & mitigate the risk of overfitting as not all neurons are activated simultaneously 4. Gradient: a. All positive values have a gradient f'(x) = 1 , and 0 for x higher memory & computation requirements 3. Activation Function: a. Each layer has activation function to introduce non- linearity for complex representation Application example of neural networks in Backgammon (1992) Network structure: 1 internal FC (fully connected) layer with sigmoid activation function Reinforcement Learning: used to improve strategy; reward calculated based on evaluation of game position/result ◦ training involved updating weights after each turn TD-Backgammon based on difference between current & previous evals, adjusted by learning rate and decay params TD-Gammon 3.0 achieved competitive results, showing that neural networks could learn strategies to expert players & power of ML TD stands for temporal difference learning; model adjusts weights based on differences in successive evals to improve future noises done in small batches (mini-batch) a. training random order are cycled through in f'(x) across multiple epochs Each weight's gradient is a product of the gradients on the path from the weight's input to the prediction Back-propagation: General Concept The gradients can be computed using a kind of dynamic program, i.e. recursively propagating the gradient from the prediction error to the input Back-propagation: 1. Forward Pass basic algorithm a. Pass input data through the network to compute the (Linear Activation) output. Output is a weighted sum of its inputs/ hidden nodes. b. Compute the error between predicted output and true label using a loss function (e.g. MSE or cross-entropy loss) c. 2. Backward Pass: Calculate gradients of output weight (chain rule) a. Starting from output, calculate gradient loss with respect to each weight connected to the output neuron. For a weight w_35 connected to a hidden node f_3 to the output node g_5, applying the internal layers The chain rule allows the gradient to be broken into simpler parts. e.g. b. 3. Backward Pass: Calculate gradients of weights in hidden layers using recursive chain rule a. Move back to hidden layer. For each weight w_13 that connects an input x_1 to a hidden layer node f_3, calculate gradient error with respect to w_13, applying chain rule again. Back-propagation calculation MLP optimization by SGD Epoch: one full pass through dataset Mini-batches used to avoid getting stuck in local minima Steps per batch: compute output, evaluate loss, backpropagate gradients, update weights Summary Lesson 16: CNNs and Key Ingredients of Deep Learning Topic Subject a) False because CNN can also be applied to 1D or 3D data. b) True. this means that filters can be applied to every position/part of the image, detecting specific features (like edges, textures, shapes) regardless of where they are in the image. This sliding or convolutional operation means that if a pattern (such as an edge) appears in one corner of the image or the center, the filter will recognize it because it scans the entire image with the same weights. This enables translation invariance because the CNN doesn’t need to learn separate filters for each possible position of a feature; instead, a single filter can detect the feature anywhere in the image. *question was tested twice c) Loss functions haven't really changed *question was tested twice *review a) Adam uses gradient of the loss wrt to the weights to determine how the weights should adjust to reduce the loss b) Momentum is incorporated by maintaining an exponential moving average of past gradients, which help to accelerate in directions where the gradient consistently points in the same way, allowing it to navigate efficiently through valleys and reduce oscillations d) related to how it uses RMSProp (root mean squared propagation), which is the moving average of the squared gradients to normalize the step size 1. Indicate whether the following is true or false. If something is false, indicate what is wrong with the statement justifying your response appropriately. a) Convolutional layers are needed to extract features and to reduce the dimensions of the input image data. False. The convolutional layer just apply the convolution operation using weights and does not do any reduction of size – the latter could be done with pooling layers, for example. b) The weights of the convolutional layer can be trained. T c) Applying pooling to convolutional neural networks (CNN’s) help reduce the feature representation. T d) An advantage of convolution is that it implements nonlinearity easily in a CNN. False. Two reasons why such a statement is false: a) the convolution operator is actually a linear operator, and b) the nonlinearity is actually applied after a convolution layer. e) Momentum used in SGD allows to skip all local minima. False. While it may allow skipping “small” local minima (with a small basin of attraction and “depth”), it is not guaranteed to get stuck in deeper local minima. f) Adaptive gradient (AdaGrad) was introduced to allow all weights to have an influence on the direction of the optimization trajectory. T g) AdaGrad provides smoother optimization trajectories than SGD, and thus can converge faster. False. While it is true that it provides smoother trajectories than SGD, the way in which it normalizes the gradient effectively makes the step-size for the updates of the weights smaller, and thus the convergence is slower. h) Adaptive Moment Estimation (Adam) fuses momentum with gradient normalization. T i) The residual connections of a ResNet alleviates the problem of vanishing gradients, though the neural networks cannot still be made too deep. False. The fact that it helps with the vanishing gradients problem, implies that neural networks can be made much deeper (remember that the chain rule dictates that low weights of upstream layers will affect the gradient update of lowstream layers), assuming the right residual connections are placed appropriately. 2. What is the reason why classic neural networks increase both their training and test error as their depth increases (for different training iterations)? Because of the vanishing gradient problem: due to the chain rule, early weights in the first layers have gradients that depend on posterior weight values (and their product): if a good number of these posterior weights at some point have a value very close to zero, they may not have enough information for the gradient update, which now will become close to zero too. Thus, a larger network increases the chance of more weights in upper layers going to zero, which in turn translates in less effective updating of its parameters through gradient methods, which then translates to a reduction on the effectiveness of its training, and finally results in less learning and fitting. Two-layer neural networks: can learn non-linear functions provided each perceptron has a differentiable nonlinearity Deeper neural networks: could theoretically learn compositional representations of complex functions but Two-layer neural are hard to optimize networks vs Deep learning: very deep neural networks; large Deeper neural unstructured datasets (images, text, audio), modern networks computational resources like GPU Deep Learning Deep neural networks (DNN) can represent complex models very compactly ◦ shallow neural networks can represent any function, but need very large hidden layers ◦ Deep networks can represent very complex x -> y mappings with fewer parameters Pure MLPs are not great for images as it cannot process the 2D structure of images effectively ◦ Images have local patterns that can appear at different positions Pure MLPs and Image processing techniques: Linear filtering: foundation of image processing; Applies Images filters for smoothing/highlighting patterns ◦ at each pixel, output a weighted sum of pixels in surrounding patch (neighborhood) ◦ e.g. Gaussian-weighted smoothing filter, edge detection, local pattern detection Gaussian-weighted smoothing filter: Used to blur images, reduce noise, and achieve a smooth transition of intensity values. (left top) Edge detection: Highlights edges by identifying areas where intensity changes sharply, commonly used in object detection and feature extraction. (left bottom) Local pattern detection: Detects specific patterns in local regions, like textures or shapes, based on the choice of kernel values. CNN 1. Input Image: CNN learns filter weights to create grids of (Convolutional features ("feature map") Network) filters enable translation invariance because the CNN doesn't need to learn separate filters for each possible position of feature; a single filter slides over the input image and detects specific features (like edges, textures, shapes etc) no matter the position in the image 2. Convolution: can be thought of as a feature extraction a. applies convolution operation using weights which can be trained 3. Multiple filters are learned (kernels - applied to input image), producing a map of feature vectors. These filters are trained to detect specific features such as edges, textures or patterns and are localized. 4. Following layers operate on the feature map from the previous layer Non-linearity: After convolution, a non-linear activation function (e.g. ReLU/ sigmoid) applied to introduce non- linear properties for complex pattern learning. Spatial pooling: reduces spatial dimensions of the feature maps, retaining essential features while making computation more efficient. ◦ This allows the filters to cover the map size/extend the "receptive field" (portion of the image considered), promoting translation invariancer ◦ reduce feature representation ◦ CNN can focus on presence of features rather than exact locations Key idea: learn features and classifiers that work well together ("end-to-end training") LeNet contains multiple convolutional and max-pooling (subsampling) layers and a fully connected MLP in the end Given an image of a hand-written digit, LeNet predicts it to be one of the 10 classes (0-9). C1: Convolutional Layer: Applies 6 filters of size 5x5 resulting in 6 LeNet-5: CNN for feature maps of size 28x28 character & digit S2: Subsampling layer: Average pooling, reducing the 6 feature maps from 28x28 to 14x14 recognition C5: Fully connected convolutional layer: 120 neurons, each connected to the 5x5 receptive fields of the previous layer F6: Fully connected layer: 84 neurons with full connections to the previous layer Output Layer: Contains 10 neurons, one for each digit (0-9), using Gaussian connections to produce the final classification output Key points: Average pooling is used instead of max pooling Activation Function: Sigmoid or tanh for nonlinearity Training Dataset: Trained on MNIST dataset, error rate ~1% ImageNet classification with deep CNNs Similar framework to LeNet but ◦ Max pooling (for reducing spatial dimensions of feature maps and enhancing translation invariance) ◦ ReLU nonlinearity for activation function rather than sigmoid/tanh which mitigates the vanishing gradient problem and allows for faster convergence ◦ More data and bigger model (7 hidden layers, 650K units, AlexNet: ILSVRC 60M params) 2012 winner ◦ Trained on GPU for improved performance/parallel processing (50x speedup over CPU) ◦ Dropout regularization: incorporated in the fully connected layers to address overfitting. Dropout randomly deactivates a portion of neurons during training, making the network more robust by preventing co-adaptations of neurons. Factors that allowed the breakthrough: ◦ ReLU Activation Function: faster convergence & train deeper networks efficinetly ◦ ImageNet: dataset with millions of labeled images across thousands of cateogries ◦ GPU Processing: advent made training large neural networks practicable Although ReLU activation function helped mitigate vanishing gradient problem, it was still challenging to train very deep networks GoogLeNet addressed this with unique architectural features: ◦ Bottlenecks: Inception modules which use multiple GoogLeNet convolutional filters and pooling operations in parallel with each module were added. ▪ Convolutions act as bottlenecks, reducing the number of channels before the larger convolutions ◦ Multiple Stages of Supervision: smaller networks that provide additional supervision allowed gradients flow to earlier layers during training, improving convergence and accuracy Vanishing Gradients: Vanishing 1. Early weights have a long path to reach output during Gradients and backpropagation, path length causes gradients to shrink Information 2. Any zeros along that path kill the gradient Propagation in 3. Early layers cannot be optimized: due to diminishing Very Deep gradients Networks 4. Multiple stages of supervision can help, but it's overall both ReLU complicated and time consuming and Residual a. e.g. in GoogleNet multiple stages of supervision can Networks helped partially address this issue to tackle the vanishing Information propagation gradients ◦ Networks need to continuously maintain and add problem, but information as data passes through layers, but there is a Residual Network risk in very deep architectures that information may was more degrade/dissipate important for deep networks as Solution: Residual Networks (ResNets) which helps it added back the maintain gradient flow & propagate information effectively, input making it feasible to train much deeper networks without encountering severe degradation or optimization issues Residual Module: Skip Connections: Use skip or shortcut connections around 2-3 layer MLPs or CNNs ◦ allow input x to skip over some ayers & be added directly to the output of those layers ◦ Gradients can flow quickly back through connections Each module only adds information to previous layers Residual Function: ◦ F(x) transformation (often composed of a few layers with activations) is applied to the input -> F(x) + x (including the skipped input) ResNet: the ◦ ReLU activation for non-linearity while maintaining residual module gradient flow & bottleneck Overall avoids degradation problems by enabling direct module information and gradient flow Despite depth, residual connections enable error gradients to "skip" all the way back to the beginning Bottleneck Module: Enables very deep networks (50+ layers) by reducing the number of operations/parameters/nodes required for each convolutional layer while maintaining representational power for learning complex features using smaller 1x1 convolutions. ◦ making it feasible to train very deep architectures (reduce 256 to 64 feature maps...). Reducing the number of nodes in the inner MLP layer greatly improves the speed of training and testing and reduces the number of parameters, with minimal impact on accuracy. This innovation allows ResNet to scale efficiently and achieve high performance on challenging image recognition tasks. Downsampling: reduces spatial dimensions, applied to both main path and shortcut to ensure matching dimensions Convolutional layers: Batch Normalization & ReLU ResBlock Code activations improve training stability, speed and efficiency ◦ Both work to reduce number of parameters (rather than necessarily making it easier to optimize) ◦ Batch Normalization: ▪ during training, feature distribution at intermediate layers keep changing as the network learns, which destabilizes training ▪ BatchNorm normalizes features of each mini- batch according to its mean and variance, and learned parameters γ, β ▪ image on left shows networks with BatchNorm converge faster & to a better accuracy (and more stable) shortcut variable holds value of skipped connection that is added to the transformed input later input + shortcut => skip connection allowing gradients to flow back directly through the shortcut path Forward: applies prediction, going through each layer Backward: applies backpropagation to compute the loss gradient with respect to parameters in each layer More recent Recall that in basic SGD, weights are updated based on the improvements over gradient of the loss function with respect to the weights basic SGD Basic SGD can be inefficient due to oscillations making convergence to the minimum slow or unstable especially in regions with steep gradients 1. Momentum: helps to smooth path to convergence by accumulating past gradients, allowing for faster convergence and carrying the ball through a local minimum a. does not guarantee that it will skip all local minima or deeper local minima, but can allow skipping "small" local minima 2. AdaGrad (Adaptive Gradient): adapts learning rates for each parameter based on the historical gradient, making it well-suited for sparse data a. avoids moving in only one direction and can lead to smoother convergence b. introduced to allow all weights to have an influence on the direction of the optimization trajectory c. makes smoother trajectories than SGD, BUT the way it normalizes the gradient makes the step-size for the updates of the weights smaller, and thus convergence SLOWER d. 3. RMSProp (Root mean squared propagation): Uses a moving average of squared gradients to normalize the step size, handling non-stationary objectives (gradient normalization) Summary Lesson 17: Deep Learning Optimization and Computer Vision Topic Subject 1. Indicate whether the following is true or false. If something is false, indicate what is wrong with the statement justifying your response appropriately. a) ResNets make connections between layers that improve the gradient flow during training. True. ResNet skip connections do not add any parameters, but by adding the input of a MLP block to its output, the error gradient can more efficiently propagate to all layers of the network. b) Bottleneck ResNets reduce the dimensionality after performing convolution with the purpose of improving computational complexity of convolution during training and inference. True. Reducing the number of nodes in the inner MLP layer greatly improves the speed of training and testing and reduces the number of parameters, with minimal impact on accuracy. c) The idea with batch normalization is to normalize the features with their empirical mean and variance because feature distributions change “considerably” during training. True d) Batch normalization is not done over the whole dataset because this can adversely produce unexpected oscillations on the features learned in the hidden layers. False. The reason it is not used over the whole dataset is that SGD uses batches and it is computationally more efficient to just normalize the batches. e) Padding is done over an image in a CNN to allow the convolution to be applied over the image’s border. True f) The stride in a convolutional layer indicates the length of the convolution filter. False. The stride allows “jumping” pixels over which we apply the convolution filter, creating a sort of downsampling, e.g. storing the result of filtering on every other pixel. g) Data augmentation should generate the same type of features (or responses) at every layer of the deep network. False. This is not necessarily true: it may not generate the same type of features in the lowest of the layers; however, if the network is trained correctly, two images with the same dog, one rotated and the other not, should generate the same prediction at the end. h) Linear probing can use a decoder to train a deep neural network over new classes, based on a previously trained feature extractor or encoder. True i) The “pre-compute features method” loads a pre-trained model which acts as a feature extractor, removes the last layer and replaces it with a new untrained layer, forming a “new” neural network over which to train the whole network. False. Actually, what is described is what the freeze encoder method does. j) Fine-tuning and linear probing in the “freeze encoder method” only differ in that the former has a smaller learning rate than the latter. False. In the “freeze encoder” method, the existing encoder weights are not updated. k) “Warm start” sets lower training rates for earlier layers during fine-tuning. False. It is actually about freezing the lower layers for some time (a few epochs) until the classification layer is able to optimize for the current encoder. Warm start can also refer to starting with a small learning rate that is gradually increased, before starting to decrease again. 2. Why is fine-tuning okay when the training converges to a local minimum which is very close to the initialization point? The primary reason is because it works well in practice. Now, the qualitative conceptual reason is that we are looking to mainly train the last layer and not the whole network. Therefore, if we end up training the whole network by finding a “deeper” (i.e. better) minimum, we will most probably end up changing most of the weights of the network or even overfitting it since the new data is not supposed to be too large in most applications where fine-tuning is used. *TAKE NOTE OF THE DIFFERENCE; for smaller examples -> freeze + linear probe; for larger examples -> fine-tune Deep Network 1. Define the network model Training 2. Set the key training parameters: # epochs, initial learning rate and schedule, optimizer, loss function, data loaders 3. Train and track performance * Padding done over an image in CNN allows the convolution to be applied over the image's borders * Stride allows "jumping" pixels over which we apply the convolution filter, creating a sort of downsampling (e.g. storing the result of filtering on every other pixel) Data augmentation: creates virtual training samples to increase the diversity of the training dataset, preventing overfitting Widely used in modern neural network training, especially in deep training for image processing tasks, to improve robustness & accuracy of models DOES NOT necessarily generate the same type of features in lowest layers. However, if the network is trained correctly, 2 images with the same dog, one rated and the other not, should generate the same prediction at the end. Techniques 1. Horizontal Flip Training Trick: 2. Random Crop Data 3. Color Casting: Adding color filters (e.g. red, green, blue) Augmentation 4. Geometric Distortion: Applying transformations like - what is it for rotation, stretching and warping - what are the 4 Purpose: stimulate a larger training set, which often helps techniques improve performance (prevent overfitting, make it more generalizable) Steps 1. Define transformation sequence