Supervised Learning Lecture 6 PDF
Document Details
Uploaded by HarmoniousBasilisk7084
Dr. Mohamed AlHajri
Tags
Summary
This document is a lecture on supervised learning, discussing various algorithms like Linear Classification, Perceptron Algorithm, and Linear Support Vector Machines (SVM). It also covers linear and non-linear regression techniques.
Full Transcript
Supervised Learning Dr. Mohamed AlHajri 1 Content Supervised Learning Linear Classification Perceptron Algorithm Linear Support Vector Machine (Linear SVM) Linear Regression Least Square Weighted Least Square Ridge Regression...
Supervised Learning Dr. Mohamed AlHajri 1 Content Supervised Learning Linear Classification Perceptron Algorithm Linear Support Vector Machine (Linear SVM) Linear Regression Least Square Weighted Least Square Ridge Regression Non-Linear Classification (Kernel) Kernel Perceptron Algorithm 2 Kernel Support Vector Machine (Linear SVM) Content Non-Linear Regression Non-Linear Least Square K-nearest neighbor Decision Trees Feedforward Neural Network Convolutional Neural Network Recurrent Neural Network 3 Lecture 6 (Linear methods) 4 Supervised Learning Supervised learning involves learning a mapping function 𝑓: 𝑋 → 𝑌 from input features 𝑋 ∈ ℝ𝑛 to output labels 𝑌, where: Classification: 𝑌 is discrete (e.g., 𝑌 ∈ {1,2, ⋯ , 𝑘}). Regression: 𝑌 is continuous (e.g., 𝑌 ∈ ℝ). 5 https://panamahitek.com/en/what-is-the-difference-between-regression-and- Linear Classification Linear separability refers to the ability to separate two classes of data points in a feature space using a single straight hyperplane. Mathematically, a dataset is said to be linearly separable if there exists a hyperplane such that all data points belonging to one class are on one side of the hyperplane, while all data points belonging to the other class are on the opposite side. In ℝ𝑛 , the hyperplane can be defined as: 𝑤. 𝑥 + 𝑏 = 0 Where: 𝑤 is the weight vector (defining the orientation of the hyperplane), 𝑥 is the input feature vector, and 𝑏 is the bias term (defining the position of the hyperplane). For a dataset to be linearly separable, there must exist a weight vector 𝑤 and a bias 𝑏 such that: 6 𝑦𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 > 0 ∀𝑖 Linear Classification Hyperplane 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏 = 0 𝑥1 + 𝑥2 − 1 = 0 Datapoints 1,1 ; 𝑤. 𝑥 + 𝑏 = 2 > 0 (+1) 2,2 ; 𝑤. 𝑥 + 𝑏 = 4 > 0 (+1) 1, −2 ; 𝑤. 𝑥 + 𝑏 = −3 < 0 (−1) −1, −3 ; 𝑤. 𝑥 + 𝑏 = −5 < 0 (−1) 7 Linear Classification Perceptron Algorithm: foundational algorithm for binary classification, aiming to find a linear decision boundary that separates two classes. It updates its weights iteratively based on misclassifications. 𝑓 𝑥 = 𝑠𝑖𝑔𝑛(𝑤. 𝑥 + 𝑏) Where 𝑤 ∈ ℝ𝑛 is the weight vector 𝑏 ∈ ℝ is the bias term 8 Linear Classification Perceptron Algorithm Update Rule: When a data point (𝑥𝑖 , 𝑦𝑖 ) is misclassified, update the weights and bias: 𝑤 ← 𝑤 + 𝜂𝑦𝑖 𝑥𝑖 𝑏 ← 𝑏 + 𝜂𝑦𝑖 where 𝜂 is the learning rate How did we derive this update? 9 Linear Classification Gradient Descent Interpretation: The perceptron update can be viewed as a form of gradient descent on the hinge loss: 𝐿(𝑤, 𝑏) = max(0, −𝑦𝑖 (𝑤 ⋅ 𝑥𝑖 + 𝑏)) Partial Derivatives: 𝜕𝐿 −𝑦 𝑥 ; 𝑖𝑓 𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≤ 0 =ቊ 𝑖 𝑖 𝜕𝑤 0 ; 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝜕𝐿 −𝑦 ; 𝑖𝑓 𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≤ 0 =ቊ 𝑖 𝜕𝑏 0 ; 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Weight Update as Gradient Step: 𝜕𝐿 𝑤 ←𝑤 −𝜂 = 𝑤 + 𝜂𝑦𝑖 𝑥𝑖 𝜕𝑤 𝜕𝐿 𝑏 ← 𝑏 −𝜂 = 𝑏 + 𝜂𝑦𝑖 10 𝜕𝑏 Linear Classification Pseudo Code: Inputs: X: Training data matrix of shape (m, n) ; y: Labels vector of shape (m,), with values in {-1, +1} η: Initial learning rate ; MaxIter: Maximum number of iterations ; LearningRateSchedule - w = zeros(n); b = 0 For t = 1 to MaxIter: For each (x_i, y_i) in (X, y): if y_i * (w · x_i + b) 𝑑1, and the transformed data will be linearly separable. 36 Non-Linear Classification In the previous example we have 𝑥1 𝒙 = 𝑥 ∈ ℝ2 2 The data is not linearly separable and so to overcome this issue we will do the following transformation: 𝑥1 𝜙(𝒙) = 𝑥2 ∈ ℝ3 𝑥12 + 𝑥22 After the new transformation, the data is linearly separable. The classifier will have the following form: ℎ 𝜙 𝒙 ; 𝜃, 𝜃0 = sgn 𝜃 𝑇 𝜙 𝒙 + 𝜃0 = sgn 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥12 + 𝑥22 + 𝜃0 37 Non-Linear Classification The classifier that we will get will be a circle in 2D as shown in figure below 38 Kernel Functions As we have seen that these functions will map the data to higher dimensions and this results in larger vectors and so higher complexity. Therefore, we will discuss two specific functions we can compute efficiently. Polynomial function: Assume we have 𝒙 ∈ ℝ𝑛 and the polynomial of degree 𝑑: 𝑑! 𝑗 𝑗 𝜙 𝒙 = 𝑥11 ⋯ 𝑥𝑛𝑛 1𝑗𝑛+1 𝑗1 ! 𝑗2 ! ⋯ 𝑗𝑛+1 ! 𝑗1 +𝑗2 +⋯+𝑗𝑛+1 =𝑑 For example if we have 𝑛 = 2, 𝑑 = 2 1 (𝑗1 = 0, 𝑗2 = 0, 𝑗3 = 2) 2𝑥1 (𝑗1 = 1, 𝑗2 = 0, 𝑗3 = 1) 2𝑥2 (𝑗1 = 0, 𝑗2 = 1, 𝑗3 = 1) 𝜙 𝒙 = 2𝑥1 𝑥2 (𝑗1 = 1, 𝑗2 = 1, 𝑗3 = 0) 𝑥12 (𝑗1 = 2, 𝑗2 = 0, 𝑗3 = 0) 39 𝑥22 (𝑗1 = 0, 𝑗2 = 2, 𝑗3 = 0) Kernel Functions The size of the vector 𝜙(𝒙) is 𝑛+𝑑 (𝑛 + 𝑑)! = 𝑑 𝑑! 𝑛! If we have a large 𝑛 or 𝑑 or both then 𝜙(𝒙) can get very large and become very expensive to deal with 22 𝑛 = 20, 𝑑 = 2 ⇒ = 231 2 102 𝑛 = 100, 𝑑 = 2 ⇒ = 5151 2 103 𝑛 = 100, 𝑑 = 3 ⇒ = 176851 3 40 Non-Linear Classification (Experiment 3) 41 Non-Linear Classification (Experiment 3) 42 Non-Linear Classification (Experiment 3) 43 Kernel Functions Additivity: The sum of two valid kernels is also a valid kernel. Scalar Multiplication: The product of a valid kernel and a positive scalar is also a valid kernel. Product of Kernels: The product of two valid kernels is also a valid kernel. Exponentiation: Raising a valid kernel to a positive power yields another valid kernel. 44 Kernel Least Square In linear least squares, the model is limited to linear relationships between input features and outputs. To handle non-linear relationships, we can map the input data into a higher-dimensional space using a feature map 𝜙(𝑥), which transforms 𝑥 ∈ ℝ𝑑 to a higher-dimensional space. 2 min 𝑦 − 𝜙(𝑋)𝛽 𝛽 2 −𝟏 𝜷= 𝝓 𝑿 𝑻𝝓 𝑿 𝝓 𝑿 𝑻𝒚 45 Kernel Least Square (Experiment 4) 46 Limitations of Kernel Least Square Complex Patterns: When data is too complex, kernel methods might not capture the intricacies Overfitting: Flexible kernels can easily overfit, especially in noisy datasets. Curse of Dimensionality: Kernel methods may become ineffective in very high-dimensional spaces. Kernel and Parameter Tuning: Choosing the right kernel and tuning parameters is often non-trivial. 47 Lecture 8 (k-NN, Decision Tree) 48 k-nearest neighbor k-Nearest Neighbors (k-NN) is an intuitive, simple, yet powerful non-parametric and instance- based learning algorithm widely used for both classification and regression tasks. Unlike many supervised learning algorithms that require training a model, k-NN stores the entire training dataset and makes predictions for new data points by comparing them directly to the stored instances. It is a lazy learning algorithm, meaning no explicit training occurs, and the predictions are made based on the proximity of new instances to existing ones. 49 k-nearest neighbor Given a query point 𝑥𝑞 , the k-NN algorithm identifies the k nearest neighbors in the training set based on a distance metric: Euclidean Distance 𝑛 2 𝑑 𝒙 𝑞 , 𝒙𝑖 = 𝑥𝑞𝑗 − 𝑥𝑖𝑗 𝑗=1 Manhattan Distance 𝑛 𝑑 𝒙𝑞 , 𝒙𝑖 = |𝑥𝑞𝑗 − 𝑥𝑖𝑗 | 𝑗=1 50 k-nearest neighbor - Classification In classification tasks, the algorithm assigns a class label to the query point 𝑥𝑞 based on the most frequent label (the mode) among its k-nearest neighbors. Once the k-nearest neighbors are identified, the class labels of the neighbors are collected, and the query point is classified by majority vote: 𝑦ො𝑞 = 𝑚𝑜𝑑𝑒( 𝑦1 , 𝑦2 , ⋯ , 𝑦𝑘 ) where 𝑦1 , 𝑦2 , ⋯ , 𝑦𝑘 are the class labels of the k-nearest neighbors. Example 1: For a query point with neighbors' class labels [0, 1, 1, 0, 1], the mode is 1, so the predicted class is 1. Example 2: For a query point with neighbors' class labels [0, 1, 1, 0], the mode is ?. There is a tie 51 k-nearest neighbor - Classification In cases where multiple classes have the same frequency among the neighbors, a tie may occur. Possible strategies to handle ties include: Random selection: Randomly select one of the tied classes. Preference to closest neighbor: Choose the class label of the nearest neighbor in case of a tie. Weighted voting: Apply weights based on the distance of each neighbor, giving closer neighbors more influence in the decision. Approach Advantages Disadvantages Inconsistent, lacks Random Selection Simple, neutral bias interpretability Nearest Neighbor Consistent, clear logic, Sensitive to noise and Preference interpretable scaling issues Reliable, less noisy, Extra computational cost, 52 Weighted Voting reduces impact of ties sensitive to weight choice k-nearest neighbor - Regression In regression tasks, k-NN predicts the output based on the average of the target values of the k- nearest neighbors. For a query point 𝑥𝑞 , the predicted value is the mean of the target values 𝑦𝑖 of the nearest neighbors: 𝑘 1 𝑦ො𝑞 = 𝑦𝑖 𝑘 𝑖=1 To improve performance, especially when some neighbors are much closer than others, weighted k-NN can be used. Here, each neighbor's influence is weighted by its distance from the query point: σ𝑘𝑖=1 𝑤𝑖 𝑦𝑖 𝑦ො𝑞 = 𝑘 σ𝑖=1 𝑤𝑖 53 1 where 𝑤𝑖 = is the weight assigned to neighbor i based on its distance to the query point. 𝑑(𝑥𝑞 ,𝑥𝑖 ) k-nearest neighbor Selecting an appropriate value of k is crucial for k-NN's performance. Small values of k can lead to overfitting, while large values can cause underfitting. One common approach to choose k is the elbow method. This method involves plotting the error (classification error rate or mean squared error for regression) as a function of k and identifying the point where the error stops decreasing significantly (the "elbow" point). 54 k-nearest neighbor 55 Experiment 5 Limitations k-nearest neighbor Curse of Dimensionality: As the number of features increases, the distance between data points becomes less meaningful. In high-dimensional spaces, points tend to become equidistant from each other, reducing the effectiveness of distance-based methods like k-NN. Solution: Dimensionality reduction techniques like Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) can help reduce the number of features, improving the meaningfulness of distance computations. Computational Complexity: For each query, k-NN requires computing the distance between the query point and all training points. This leads to a time complexity of O(N⋅n), where N is the number of training points and n is the number of features. Solution: Data structures such as KD-trees and ball trees can reduce the computational burden, especially in low- dimensional data. Imbalance in Class Distribution: When one class dominates the dataset, the nearest neighbors of any query point may frequently belong to the majority class, leading to biased predictions. Solution: Use distance-weighted voting, where closer neighbors are given more importance in the decision- 56 making process, reducing bias toward majority classes. Decision Tree A decision tree is a simple model for supervised classification. It is used for classifying a single discrete target feature. Each internal node performs a Boolean test on an input feature (in general, a test may have more than two options, but these can be converted to a series of Boolean tests). The edges are labeled with the values of that input feature. Each leaf node specifies a value for the target feature. 57 Content adapted from Alice Gao Decision Tree – Example 1 (All examples belong to the same class) 58 Content adapted from Alice Gao Decision Tree – Example 2 (No features left) 59 Content adapted from Alice Gao Decision Tree – Example 3 (No examples left) 60 Content adapted from Alice Gao Decision Tree Which feature we use at each step? Ideally, we would like to find the optimal order of testing features, which will minimize the size of our tree. Unfortunately, finding the optimal order is too expensive computationally. Instead, we will use a greedy approach. The greedy approach will make the best choice at each step without worrying about how our current choice could affect the potential choices in the future. More concretely, at each step, we will choose a feature that makes the biggest difference to the classification, or a feature that helps us make a decision as quickly as possible. 61 Content adapted from Alice Gao Decision Tree Feature that reduces our uncertainty at much as possible will be chosen. To measure the reduction in uncertainty, we will calculate the uncertainty in the examples before testing the feature, and subtract the uncertainty in the examples after testing the feature. The difference measures the information content of the feature. Intuitively, testing the feature allows us to reduce our uncertainty and gain some useful information. We will select the feature that has the highest information content. How do we measure uncertainty? 62 Content adapted from Alice Gao Decision Tree 𝑘 𝐼 𝑃 𝑐1 , ⋯ , 𝑃 𝑐𝑘 = − 𝑃 𝑐𝑖 log 2 (𝑃 𝑐𝑖 ) 𝑖=1 What is the entropy of the distribution (0.5,0.5)? −0.5 log 2 0.5 − 0.5 log 2 0.5 = 1 This distribution has 1 bit of uncertainty. What is the entropy of the distribution (0.01,0.99)? −0.01 log 2 0.01 − 0.99 log 2 0.99 = 0.08 This distribution has 0.08 bit of uncertainty. The entropy is maximized in the case of a uniform distribution Information gain will be the metric to be used to determine the feature to be used (ID3). 𝑘 𝑝𝑖 + 𝑛𝑖 𝑝𝑖 𝑛𝑖 63 InfoGain = Ibefore − Iafter = Ibefore − ∗ 𝐼( , ) 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑖=1 Content adapted from Alice Gao Decision Tree – Example 4 There are 14 examples, 9 positive and 5 negative What is the entropy of the examples before we select a feature for the root node of the tree? 9 5 9 9 5 5 𝐼 , = −( log 2 + log 2 ) ≈ 0.94 14 14 14 14 14 14 What is the expected information gain if we select Outlook as the root node of the tree? 𝑆𝑢𝑛𝑛𝑦; 2 𝑌𝑒𝑠, 3 𝑁𝑜 ; 5 𝑇𝑜𝑡𝑎𝑙 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = ቐ𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡; 4 𝑌𝑒𝑠, 0 𝑁𝑜 ; 4 𝑇𝑜𝑡𝑎𝑙 𝑅𝑎𝑖𝑛; 4 𝑌𝑒𝑠, 2 𝑁𝑜 ; 5 𝑇𝑜𝑡𝑎𝑙 5 2 3 4 4 0 5 3 2 𝐺𝑎𝑖𝑛 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.94 −.𝐼 , +.𝐼 , + 𝐼 , 14 5 5 14 4 4 14 5 5 64 5 4 5 = 0.94 − 0.971 + 0 + 0.971 = 0.94 − 0.694 = 0.247 14 14 14 Content adapted from Alice Gao Decision Tree – Example 4 What is the expected information gain if we select Humidity as the root node of the tree? 𝑁𝑜𝑟𝑚𝑎𝑙 ; 6 𝑌𝑒𝑠 1 𝑁𝑜; 7 𝑇𝑜𝑡𝑎𝑙 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ቊ 𝐻𝑖𝑔ℎ; 3 𝑌𝑒𝑠 4 𝑁𝑜; 7 𝑇𝑜𝑡𝑎𝑙 7 6 1 7 3 4 𝐺𝑎𝑖𝑛 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.94 −.𝐼 , +.𝐼 , 14 7 7 14 7 7 7 7 = 0.94 − 0.592 + 0.985 = 0.94 − 0.789 = 0.151 14 14 65 Content adapted from Alice Gao Decision Tree – Example 4 What is the expected information gain if we select Wind as the root node of the tree? 𝑊𝑒𝑎𝑘 ; 6 𝑌𝑒𝑠 2 𝑁𝑜; 8 𝑇𝑜𝑡𝑎𝑙 𝑊𝑖𝑛𝑑 = ቊ 𝑆𝑡𝑟𝑜𝑛𝑔; 3 𝑌𝑒𝑠 3 𝑁𝑜; 6 𝑇𝑜𝑡𝑎𝑙 8 6 2 6 3 3 𝐺𝑎𝑖𝑛 𝑊𝑖𝑛𝑑 = 0.94 −.𝐼 , +.𝐼 , 14 8 8 14 6 6 8 6 = 0.94 − 0.81 + 1 = 0.94 − 0.891 = 0.0485 14 14 66 Content adapted from Alice Gao Decision Tree – Example 4 What is the expected information gain if we select Temperature as the root node of the tree? 𝐺𝑎𝑖𝑛 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029 67 Content adapted from Alice Gao Decision Tree Information gain is not the only splitting criterion to be used in decision tree. The next criterion is the gini index (CART) 𝐶 𝐺𝑖𝑛𝑖 𝑡 = 1 − 𝑝𝑖2 𝑖=1 where C is the number of classes 𝑝𝑖 is the proportion of instances belonging to class i at particular node t. Gini index is computationally efficient Information gain will be useful in the case of imbalanced datasets 68 Decision Tree The gini coefficient and the information gain (entropy) are different splitting criteria but are unified under what is known as the Tsallis Entropy. 𝑛 1 𝑞 𝑆𝑞 𝑋 = ( 𝑝𝑖 − 1) , 𝑞 ∈ ℝ 1−𝑞 𝑖=1 lim 𝑆𝑞 𝑋 = 𝐻(𝑋) 𝑞→1 𝑛 𝑆2 𝑋 = 1 − 𝑝𝑖2 = 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 𝑖=1 69 Decision Tree Wang, Yisen, Chaobing Song, and Shu-Tao Xia. "Unifying the Split Criteria of 70 Decision Trees Using Tsallis Entropy." arXiv preprint arXiv:1511.08136 (2016). Decision Tree Wang, Yisen, Chaobing Song, and Shu-Tao Xia. "Unifying the Split Criteria of Decision Trees Using Tsallis Entropy." arXiv preprint arXiv:1511.08136 (2016). 71 Choosing the optimal value of q is still an open question Decision Tree It would be better to grow a smaller and shallower tree. The smaller and shallower tree may not predict all of the training data points perfectly but it may generalize to test data better. We have two options to prevent over-fitting when learning a decision tree Pre-pruning: stop growing the tree early Post-pruning: grow a full tree first and then trim it afterwards. 72 Content adapted from Alice Gao Decision Tree Pre-pruning If we decide not to split the examples at a node and stop growing the tree there, we may still have examples with different labels. At this point, we can decide to use the majority label as the decision for that leaf node. Here are some criteria we can use: Maximum depth: We can decide not to split the examples if the depth of that node has reached a maximum value that we decided beforehand. Minimum number of examples at the leaf node: We can decide not to split the examples if the number of examples remaining at that node is less than a predefined threshold value. Minimum information gain: We can decide not to split the examples if the benefit of splitting at that node is not large enough. We can measure the benefit by calculating the expected information gain. In other words, do not split examples if the expected information gain is less than the threshold. Reduction in training error: We can decide not to split the examples at a node if the reduction in training error is less than a predefined threshold value. 73 Content adapted from Alice Gao Decision Tree Post-pruning is particularly useful when any individual feature is not informative, but multiple features working together is very informative. Example: Suppose we are considering post-pruning with the minimal information gain metric. First of all, we will restrict our attention to nodes that only have leaf nodes as its descendants. At a node like this, if the expected information gain is less than a predefined threshold value, we will delete this node’s children which are all leaf nodes and then convert this node to a leaf node. There has to be examples with different labels at this node possibly both positive and negative examples. We can make a majority decision at this node. 74 Content adapted from Alice Gao Lecture 9 (FNN) 75 Feedforward Neural Network Neural networks learns a mapping function 𝑓: 𝑋 → 𝑌 from input features 𝑋 ∈ ℝ𝑛 to output labels 𝑌, where: Classification: 𝑌 is discrete (e.g., 𝑌 ∈ {1,2, ⋯ , 𝑘}). Regression: 𝑌 is continuous (e.g., 𝑌 ∈ ℝ). Key components of a neural network: Layers (Depth): define the number of layers. This include the input, hidden, and output layer. Width (Number of neurons per layer): define the number of neurons per layers. https://www.spotfire.com/glossary/what-is-a-neural- network Activation function: This will add a layer of non-linearity to allow to model complex relationships Loss function: Measures the difference between the predicted output and the actual target. (MSE [Regression], Cross-entropy 76 [Classification]) Feedforward Neural Network 𝑎1 = 𝑓 𝑊1 𝑋 + 𝑏1 𝑎1 𝑋1 𝑋1 where 𝑊1 ∈ ℝ1×3 , 𝑋 = 𝑋2 , 𝑏1 ∈ ℝ 𝑎2 𝑋3 𝑋2 f is the activation function which 𝑎3 could be a linear or a non-linear 𝑋3 activation function. The non- 𝑎4 linearity will give us the flexibility of modelling non-linear problems. https://www.spotfire.com/glossary/what-is-a-neural-network 77 Feedforward Neural Network – Forward Propagation (Classification) (1) (1) (1) 𝑎1 = 𝑓 𝑊1 𝑋 + 𝑏1 = 𝑊1 𝑋 + 𝑏1 ; f is a linear activation function. 𝑋1 𝑎1 Class 0 𝑜1 where (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑘 𝑊𝑖𝑗 (i is the starting node and j is the ending node, and k is the layer) 𝑋2 𝑎2 𝑜2 Class 1 𝑋1 (1) 𝑋 = 𝑋2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋3 𝑋3 𝑎3 2 2 𝑒 𝑜ෝ1 𝑜1 = 𝑓 𝑊1 𝑎 + 𝑏1 = 𝑓 𝑜ො1 = ; f is a softmax function 𝑒 𝑜ෝ1 +𝑒 𝑜ෝ2 2 2 𝑒 𝑜ෝ2 𝑜2 = 𝑓 𝑊2 𝑎 + 𝑏2 = 𝑓 𝑜ො2 = ; f is a softmax function 𝑎4 𝑒 𝑜ෝ1 +𝑒 𝑜ෝ2 𝑙 = −(𝑦log 𝑜2 + 1 − 𝑦 log 𝑜1 ); where y is the true label and 𝑦ො is the 78 predicated label Feedforward Neural Network – Forward Propagation (Classification) (1) (1) (1) 𝑎1 = 𝑓 𝑊1 𝑋 + 𝑏1 = 𝑊1 𝑋 + 𝑏1 ; f is a linear activation function. 𝑋1 𝑎1 𝑜1 Class 0 where (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑘 𝑊𝑖𝑗 (i is the starting node and j is the ending node, and k is the layer) 𝑋2 𝑎2 𝑜2 Class 1 𝑋1 (1) 𝑋 = 𝑋2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋3 𝑜3 Class 2 2 2 ෝ1 𝑒𝑜 𝑋3 𝑎3 𝑜1 = 𝑓 𝑊1 𝑎 + 𝑏1 = 𝑓 𝑜ො1 = ෝ3 ; f is a softmax function 𝑒 𝑜ෝ1 +𝑒 𝑜 ෝ 2 +𝑒 𝑜 2 2 𝑒 𝑜ෝ2 𝑜2 = 𝑓 𝑊2 𝑎 + 𝑏2 = 𝑓 𝑜ො2 = ෝ 1 +𝑒 𝑜 ෝ3 ෝ 2 +𝑒 𝑜 ; f is a softmax function 𝑒𝑜 2 2 𝑒 𝑜ෝ3 𝑜3 = 𝑓 𝑊2 𝑎 + 𝑏3 = 𝑓 𝑜ො3 = ෝ 1 +𝑒 𝑜 ෝ3 ෝ 2 +𝑒 𝑜 ; f is a softmax function 𝑎4 𝑒𝑜 𝑙 = − σ3𝑖=1 𝑝𝑖 log(𝑜𝑖 ); where 𝑝𝑖 is the true distribution and 𝑜𝑖 is the predicated 79 distribution Feedforward Neural Network – Forward Propagation (Regression) (1) (1) (1) 𝑎1 = 𝑓 𝑊1 𝑋 + 𝑏1 = 𝑊1 𝑋 + 𝑏1 ; f is a linear activation 𝑋1 𝑎1 𝑜1 function. where (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑋2 𝑎2 𝑘 𝑊𝑖𝑗 (i is the starting node and j is the ending node, and k is the layer) 𝑋1 (1) 𝑋3 𝑎3 𝑋 = 𝑋2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋3 2 2 2 2 𝑜1 = 𝑓 𝑊1 𝑎 + 𝑏1 = 𝑊1 𝑎 + 𝑏1 ; f is a linear function 𝑎4 𝑙 = 𝑦 − 𝑜1 2 (SE); where y is the true value and 𝑜1is the predicated 80 value Feedforward Neural Network – Forward Propagation (Regression) (1) (1) (1) 𝑎1 = 𝑓 𝑊1 𝑋 + 𝑏1 = 𝑊1 𝑋 + 𝑏1 ; f is a linear activation function. 𝑋1 𝑎1 𝑜1 where (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑘 𝑋2 𝑎2 𝑊𝑖𝑗 (i is the starting node and j is the ending node, and k is the layer) 𝑜2 𝑋1 (1) 𝑋 = 𝑋2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋3 𝑋3 𝑎3 2 2 2 2 𝑜1 = 𝑓 𝑊1 𝑎 + 𝑏1 = 𝑊1 𝑎 + 𝑏1 ; f is a linear function 2 2 2 2 𝑜2 = 𝑓 𝑊2 𝑎 + 𝑏2 = 𝑊2 𝑎 + 𝑏2 ; f is a linear function 𝑎4 1 𝑙 = [ y1 − o1 2 + y2 − o2 2 ] (MSE); where 𝑦1 , 𝑦2 is the true value and 2 81 𝑜1 , 𝑜2 is the predicted value Feedforward Neural Network These linear activation functions will allow us to model and capture linear relationships as shown below but it fails in the case of non-linear relationships. 82 Feedforward Neural Network Therefore, we need non-linear activation functions 83 Feedforward Neural Network – Activation Functions Activation functions introduce non-linearity into neural networks, enabling them to approximate complex functions. Each function has unique characteristics that impact the network’s performance, convergence, and generalization. Below are common activation functions used in neural networks. 84 Feedforward Neural Network – Activation Functions Sigmoid function 1 𝑓 𝑥 = 1 + 𝑒 −𝑥 Advantages: Smooth gradient, useful for binary classification. Output values between 0 and 1, which can represent probabilities. Disadvantages: Vanishing gradient problem for large or small inputs, slowing down training. 85 Feedforward Neural Network – Activation Functions Tanh function 𝑒 𝑥 − 𝑒 −𝑥 𝑓 𝑥 = 𝑥 𝑒 + 𝑒 −𝑥 Advantages: Zero-centered output, which helps faster convergence Larger gradients compared to sigmoid, reducing vanishing gradient Disadvantages: Still suffers from the vanishing gradient problem, especially in deep networks. 86 Feedforward Neural Network – Activation Functions Relu function 𝑓 𝑥 = max(0, 𝑥) Advantages: Solves vanishing gradient problem for positive values Disadvantages: Dead neurons issue, where certain neurons always output zero and stop learning. Can cause gradient explosion in certain cases. 87 Feedforward Neural Network – Activation Functions Add something about 0/1 function and the difficultity of optimization 88 Feedforward Neural Network – Forward Propagation (Classification) max 0, 𝑊1 1 𝑋 + 𝑏1 ; 𝑓 𝑖𝑠 𝑎 𝑟𝑒𝑙𝑢 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑋1 𝑎1 𝑜1 Class 0 1 (1) (1) ; 𝑓 𝑖𝑠 𝑎 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑎1 = 𝑓 𝑊1 𝑋 + 𝑏1 = 1+𝑒 −𝑊1 𝑋+𝑏1 (1) (1) 𝑒 𝑊1 𝑋+𝑏1 −𝑒 −𝑊1 𝑋+𝑏1 (1) (1) ; 𝑓 𝑖𝑠 𝑎 tanh 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑒 𝑊1 𝑋+𝑏1 +𝑒 −𝑊1 𝑋+𝑏1 𝑋2 𝑎2 𝑜2 Class 1 where (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑊𝑖𝑗𝑘 (i is the starting node and j is the ending node, and k is the layer) 𝑋3 𝑎3 𝑋1 (1) 𝑋 = 2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋 𝑋3 𝑒 𝑜ෝ1 𝑎4 𝑜1 = 𝑓 𝑊1 2 𝑎 + 𝑏12 = 𝑓 𝑜ො1 = 𝑒 𝑜ෝ1 +𝑒 𝑜ෝ2 ; f is a softmax function 𝑒 𝑜ෝ2 89 𝑜2 = 𝑓 𝑊2 2 𝑎 + 𝑏22 = 𝑓 𝑜ො2 = 𝑒 𝑜ෝ1 +𝑒 𝑜ෝ2 ; f is a softmax function 𝑙 = −(𝑦log 𝑜2 + 1 − 𝑦 log 𝑜1 ); where y is the true label and 𝑦ො is the predicated label Feedforward Neural Network – Forward Propagation (Classification) 𝑋1 𝑎1 𝑜1 Class 0 𝑋2 𝑎2 𝑜2 Class 1 𝑋3 𝑎3 𝑎4 90 Feedforward Neural Network Wait! How do we calculate all of the weights and biases of the neural network? 91 Feedforward Neural Network – Backpropagation (Regression) (1) (1) 𝑋1 𝑎1 𝑎1 = 𝑊1 𝑋 + 𝑏1 𝑜1 (1) (1) (1) (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑋1 (1) 𝑎2 𝑋 = 𝑋2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋2 𝑜2 𝑋3 𝑜1 = 𝑊1 2 𝑎 + 𝑏12 ; 𝑜2 = 𝑊2 2 𝑎 + 𝑏22 1 𝑙 = 2 [ y1 − o1 2 + y2 − o2 2 ] 𝑋3 𝑎3 How to update the weights and biases? min (2) 𝑙 1 1 1 1 1 1 1 1 1 2 𝑊11 ,𝑊21 ,𝑊31 ,𝑊12 ,𝑊22 ,𝑊32 ,𝑊13 ,𝑊23 ,𝑊33 ,𝑊11 ,𝑊21 ,… There are several optimizier that will be used: 𝑎4 Stochastic Gradient Descent SGD with momentum 92 Adam Feedforward Neural Network – Backpropagation (Regression) 93 https://arxiv.org/pdf/2010.07468 Feedforward Neural Network – Backpropagation (Regression) (1) (1) 𝑋1 𝑎1 𝑎1 = 𝑊1 𝑋 + 𝑏1 𝑜1 (1) (1) (1) (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑋1 (1) 𝑋2 𝑎2 𝑜2 𝑋 = 𝑋2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋3 2 2 2 2 𝑜1 = 𝑊1 𝑎 + 𝑏1 ; 𝑜2 = 𝑊2 𝑎 + 𝑏2 1 𝑙 = [ y1 − o1 2 + y2 − o2 2 ] 𝑋3 𝑎3 2 (2) (2) 𝜕𝑙 (2) 𝜕𝑜1 𝜕𝑙 𝑊11 = 𝑊11 − 𝜂 2 = 𝑊11 − 𝜂 (2) 𝜕𝑊11 𝜕𝑊11 𝜕𝑜1 (2) 𝜕𝑜1 𝜕 1 = 𝑊11 − 𝜂 2 ( [ y1 − o1 2 + y2 − o2 2 ]) 𝑎4 𝜕𝑊11 𝜕𝑜1 2 2 𝜕 2 2 2 2 2 = 𝑊11 − 𝜂 2 𝑊11 𝑎1 + 𝑊21 𝑎2 + 𝑊31 𝑎3 + 𝑊41 𝑎4 (− 𝑦1 − 𝑜1 ) = 𝑊11 − 𝜂𝑎1 94 𝜕𝑊11 The same procedure will be done for all of the weights and biases in the neural network. Feedforward Neural Network – Forward Propagation (Regression) max 0, 𝑊1 1 𝑋 + 𝑏1 ; 𝑓 𝑖𝑠 𝑎 𝑟𝑒𝑙𝑢 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑋1 𝑎1 𝑜1 1 (1) (1) ; 𝑓 𝑖𝑠 𝑎 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑎1 = 𝑓 𝑊1 𝑋 + 𝑏1 = 1+𝑒 −𝑊1 𝑋+𝑏1 (1) (1) 𝑒 𝑊1 𝑋+𝑏1 −𝑒 −𝑊1 𝑋+𝑏1 (1) (1) ; 𝑓 𝑖𝑠 𝑎 tanh 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑒 𝑊1 𝑋+𝑏1 +𝑒 −𝑊1 𝑋+𝑏1 where 𝑋2 𝑎2 𝑜2 (1) 𝑊1 = 𝑊11 𝑊21 𝑊31 ∈ ℝ1×3 𝑊𝑖𝑗𝑘 (i is the starting node and j is the ending node, and k is the layer) 𝑋1 (1) 𝑋 = 𝑋2 ∈ ℝ3×1 ; 𝑏1 ∈ ℝ 𝑋3 𝑎3 𝑋3 𝑜1 = 𝑓 𝑊1 2 𝑎 + 𝑏12 = 𝑊1 2 𝑎 + 𝑏12 ; f is a linear function 𝑜2 = 𝑓 𝑊2 2 𝑎 + 𝑏22 = 𝑊2 2 𝑎 + 𝑏22 ; f is a linear function 𝑎4 1 𝑙= [ y1 − o1 2 + y2 − o2 2 ] (MSE); where 𝑦1 , 𝑦2 is the true value and 𝑜1 , 𝑜2 is the predicted value 2 95 Feedforward Neural Network 96 Feedforward Neural Network 97 98 Feedforward Neural Network 99 Feedforward Neural Network – Weight Initialization Random Initialization: weights are random initialization, typically from a uniform or normal distribution. The issue with this initialization that it can leaf to large or small gradients that can cause slow convergence or divergence. Advantages: Allows neurons to learn different features Disadvantages: Without further scaling, it can lead to large or small gradients, causing small convergence or divergence. He Initialization: specifically designed for Relu activation 2 𝑊~𝑁𝑜𝑟𝑚𝑎𝑙 0, 𝑛𝑖𝑛 Ensures that the variance of the weights does not shrink or explode. Xavier Initialization: or layer l, weights are initialized with a uniform or normal distribution scaled by 6 6 𝑊~𝑈𝑛𝑖𝑓𝑜𝑟𝑚 − , 𝑛𝑖𝑛 + 𝑛𝑜𝑢𝑡 𝑛𝑖𝑛 + 𝑛𝑜𝑢𝑡 where 𝑛𝑖𝑛 and 𝑛𝑜𝑢𝑡 are the number of input and output units in the layer. 100 Balances the variance across layers, making it suitable for sigmoid and tanh activations. Feedforward Neural Network – Wider vs Deeper Are there functions that can be expressed by wide and shallow neural networks, that cannot be approximated by any narrow neural network, unless its depth is very large? (Correct) On the other hand, if the answer is negative, then depth generally plays a more significant role than width for the expressive power of neural networks. [Proven in https://proceedings.mlr.press/v178/vardi22a/vardi22a.pdf] (Wrong) If the answer is positive, then width and depth, in principle, play an incomparable role in the expressive power of neural networks, as sometimes depth can be more significant, and sometimes width. 101 Feedforward Neural Network – MNIST Neural Network 102 Feedforward Neural Network – MNIST 103 Feedforward Neural Network – MNIST 104 Lecture 10 (CNN,RNN) 105 CNN 106 What computers ‘see’: Images as Numbers What you see What you both see What the computer "sees" Levin Image Processing & Computer Vision Input Image Input Image + values Pixel intensity values (“pix-el”=picture-element) An image is just a matrix of numbers [0,255]. i.e., 1080x1080x3 for an RGB image. Can I just do classification on the 1,166400-long image vector directly? No. Instead: exploit image spatial structure. Learn patches. Build them up This slide is taken from Manolis Kellis Feature Extraction with Convolution - Filter of size 4x4 : 16 different weights - Apply this same filter to 4x4 patches in input - Shift by 2 pixels for next patch This “patchy” operation is convolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter This slide is taken from Manolis Kellis Fully Connected Neural Network Input: Fully Connected: 2D image Each neuron in Vector of pixel hidden layer values connected to all neurons in input layer No spatial information Many, many parameters Key idea: Use spatial structure in input to inform architecture of the network This slide is taken from Manolis Kellis Convolution operation is element wise multiply and add Filter / Kernel This slide is taken from Manolis Kellis Simple Kernels / Filters This slide is taken from Manolis Kellis Zero Padding Controls Output Size (Goodfellow 2016) Same convolution: zero pad input so output Valid-only convolution: output only when is same size as input dimensions entire kernel contained in input (shrinks output) This slide is taken from Manolis Kellis Key idea: learn hierarchy of features directly from the data (rather than hand-engineering them) Low level features Mid level features High level features Edges, dark spots Eyes, ears,nose Facial structure This slide is taken from Manolis Kellis Lee+ ICML 2009 LeNet-5 Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 Helped establish how we use CNNs today Replaced manual feature extraction This slide is taken from Manolis Kellis [LeCun et al., 1998] LeNet-5 conv avg pool conv avg pool... 5×5 f=2 5×5 f=2 s=1 s=2 s=1 s=2 32×32×1 28×28×6 14×14×6 10×10×16 FC FC... 𝑦𝑦 ⋮ ⋮ 10 5×5×16 120 84 Reminder: Output size = (N+2P-F)/stride + 1 This slide is taken from Andrew Ng [LeCun et al., 1998] An image classification CNN This slide is taken from Manolis Kellis Representation Learning in Deep CNNs Low level features Mid level features High level features Edges, dark spots Eyes, ears,nose Facial structure Conv Layer 1 Conv Layer 2 Conv Layer 3 This slide is taken from Manolis Kellis Lee+ ICML 2009 Introducing Non-Linearity - Apply after every convolution operation (i.e., after convolutional layers) Rectified Linear Unit - ReLU: pixel-by-pixel operation that replaces (ReLU) all negative values by zero. - Non-linear operation tf.keras.layers.ReLU This slide is taken from Manolis Kellis Karn Intuitive CNNs Pooling tf.keras.layers.Max Pool2D( pool_size=(2,2), ) strides=2 1) Reduced dimensionality 2) Spatial invariance Max Pooling, average pooling This slide is taken from Manolis Kellis How can computers recognize objects? Challenge: Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc. How can we overcome this challenge? Answer: Learn a ton of features (millions) from the bottom up Learn the convolutional filters, rather than pre-computing them This slide is taken from Manolis Kellis ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. AlexNet ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012 Facilitated by GPUs, highly optimized convolution implementation and large datasets (ImageNet) Large CNN Has 60 Million parameter compared to 60k parameter of LeNet-5 This slide is taken from Manolis Kellis [Krizhevsky et al., 2012] Architecture AlexNet CONV1 Input: 227x227x3 images (224x224 before MAX POOL1 padding) NORM1 CONV2 First layer: 96 11x11 filters applied at stride 4 MAX POOL2 NORM2 Output volume size? CONV3 (N-F)/s+1 = (227-11)/4+1 = 55 -> CONV4 [55x55x96] CONV5 Max POOL3 FC6 Number of parameters in this layer? FC7 (11*11*3)*96 = 35K FC8 Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012] AlexNet conv max pool conv max pool... 11 × 11 3×3 5×5 3×3 s=4 s=2 S=1 s=2 227×227 ×3 P = 0 55×55 × 96 27×27 ×96 P = 2 27×27 ×256 conv conv conv max pool...... 3×3 3×3 3×3 3×3 S=1 s=1 S=1 s=2 13×13 P=1 P=1 P=1 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 ×256 This slide is taken from Andrew Ng [Krizhevsky et al., 2012] AlexNet FC FC... ⋮ ⋮ Softmax 1000 4096 4096 This slide is taken from Andrew Ng [Krizhevsky et al., 2012] AlexNet Details/Retrospectives: first use of ReLU used Norm layers (not common anymore) heavy data augmentation dropout 0.5 batch size 128 7 CNN ensemble Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Krizhevsky et al., 2012] AlexNet AlexNet was the coming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition. This slide is taken from Manolis Kellis [Krizhevsky et al., 2012] ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input VGGNet 3x3 conv, 64 3x3 conv, 64 Pool 1/2 3x3 conv, 128 3x3 conv, 128 Smaller filters Pool 1/2 Only 3x3 CONV filters, stride 1, pad 1 3x3 conv, 256 3x3 conv, 256 and 2x2 MAX POOL , stride 2 Pool 1/2 3x3 conv, 512 3x3 conv, 512 Deeper network 3x3 conv, 512 Pool 1/2 AlexNet: 8 layers 3x3 conv, 512 VGGNet: 16 - 19 layers 3x3 conv, 512 3x3 conv, 512 Pool 1/2 FC 4096 ZFNet: 11.7% top 5 error in ILSVRC’13 FC 4096 VGGNet: 7.3% top 5 error in ILSVRC’14 FC 1000 Softmax Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014] VGGNet VGG Net reinforced the notion that convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work. Keep it deep. Keep it simple. This slide is taken from Manolis Kellis [Simonyan and Zisserman, 2014] ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. GoogleNet Going Deeper with Convolutions - Christian Szegedy et al.; 2015 ILSVRC 2014 competition winner Also significantly deeper than AlexNet x12 less parameters than AlexNet Focused on computational efficiency This slide is taken from Manolis Kellis [Szegedy et al., 2014] GoogleNet 22 layers Efficient “Inception” module - strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure No FC layers Only 5 million parameters! ILSVRC’14 classification winner (6.7% top 5 error) [Szegedy et al., 2014] GoogleNet “Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other Filter concatenation 1x1 3x3 5x5 1x1 convolution convolution convolution convolution 1x1 1x1 3x3 max convolution convolution pooling Previous layer Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014] GoogleNet Details/Retrospectives : Deeper networks, with computational efficiency 22 layers Efficient “Inception” module No FC layers 12x less params than AlexNet ILSVRC’14 classification winner (6.7% top 5 error) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014] GoogleNet Introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. This slide is taken from Manolis Kellis [Szegedy et al., 2014] ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. ResNet Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015 Extremely deep network – 152 layers Deeper neural networks are more difficult to train. Deep networks suffer from vanishing and exploding gradients. Present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. This slide is taken from Manolis Kellis [He et al., 2015] ResNet ILSVRC’15 classification winner (3.57% top 5 error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015] ResNet What happens when we continue stacking deeper layers on a convolutional neural network? 56-layer model performs worse on both training and test error -> The deeper model performs worse (not caused by overfitting)! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015] ResNet 𝑎𝑎[𝑙𝑙+1] 𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2] Short cut/ skip connection a[l] 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2] a[l+1] 𝐳𝐳 [𝐥𝐥+𝟏𝟏] = 𝐖𝐖 [𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛 [𝐥𝐥+𝟏𝟏] 𝐳𝐳 [𝐥𝐥+𝟐𝟐] = 𝐖𝐖 [𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛 [𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳 [𝐥𝐥+𝟏𝟏] ) 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳 [𝐥𝐥+𝟐𝟐] ) 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐 + 𝐚𝐚 𝐥𝐥 = 𝐠𝐠(𝐖𝐖 [𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛 [𝐥𝐥+𝟐𝟐] + 𝐚𝐚 𝐥𝐥 ) This slide is taken from Manolis Kellis [He et al., 2015] ResNet Full ResNet architecture: Stack residual blocks Every residual block has two 3x3 conv layers Periodically, double # of filters and downsample spatially using stride 2 (in each dimension) Additional conv layer at the beginning No FC layers at the end (only FC 1000 to output classes) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015] ResNet Total depths of 34, 50, 101, or 152 layers for ImageNet For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015] ResNet Experimental Results: Able to train very deep networks without degrading Deeper networks now achieve lower training errors as expected Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015] ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. RNN 107 Sequence Applications: One-to-Many Input: fixed-size Output: sequence e.g., image captioning Captions: https://www.microsoft.com/cognitive-services/en-us/computer-vision-api This slide is taken from Danna Gurari Sequence Applications: Many-to-One Input: sequence Output: fixed-size e.g., sentiment analysis (hate? love?, etc) https://www.rottentomatoes.com/m/star_wars_the_last_jedi This slide is taken from Danna Gurari Sequence Applications: Many-to-Many Input: sequence Output: sequence e.g., language translation This slide is taken from Danna Gurari Recall: Feedforward Neural Networks Problem: many model parameters! Problem: no memory of past since weights learned independently Each layer serves as input to the next layer with no loops This slide is taken from Danna Gurari Figure Source: http://cs231n.github.io/neural-networks-1/ Recurrent Neural Networks (RNNs) Main idea: use hidden state to capture information about the past Feedforward Network Recurrent Network Each layer receives input from Each layer receives input the previous layer with no loops from the previous layer and the output from the previous time step This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: Time Step 1 Main idea: use hidden state to capture information about the past This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: Time Step 2 Main idea: use hidden state to capture information about the past This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: Time Step 3 Main idea: use hidden state to capture information about the past This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: And So On… Main idea: use hidden state to capture information about the past … This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: Model Parameters and Inputs All layers share the same model parameters (U, V, W) What is different between the layers? … This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: Model Parameters and Inputs When unfolded, a RNN is a deep feedforward network with shared weights! … This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: Advantages Overcomes problem that weights of each layer are learned independently by using previous hidden state Overcomes problem that model has many parameters since weights are shared across layers … This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN: Advantages Retains information about past inputs for an amount of time that depends on the model’s weights and input data rather than a fixed duration selected a priori … This slide is taken from Danna Gurari http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ RNN Example: Predict Sequence of Characters Goal: predict next character in text Training Data: sequence of characters represented as one-hot vectors This slide is taken from Danna Gurari Example: Predict Sequence of Characters Goal: predict next character in text Prediction: feed training sequence of one-hot encoded characters; e.g., “hello” For simplicity, assume the following vocabulary (i.e., character set): {h, e, l, o} What is our input at time step 1? What is our input at time step 2? What is our input at time step 3? What is our input at time step 4? And so on… This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Predict Sequence of Characters Recall activation functions: use tanh as activation function Sigmoid Tanh ReLU This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Predict Sequence of Characters Initialize to random value: 0.567001 + bias ) Initialize to 0 Input at next time step This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Predict Sequence of Characters Initialize to random value: 0.427043 Initialized to random value: 0.567001 + bias ) Output at previous time step This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Predict Sequence of Characters This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Prediction (Many-To-One) This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Prediction (Many-To-Many) This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Prediction for Time Step 2 Applying softmax, to compute letter probabilities: This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ Example: Prediction for Time Step 2 Given our vocabulary is {h, e, l, o}, what letter is predicted? Applying softmax, to compute letter probabilities: This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/ RNN Variants: Different Number of Hidden Layers Experimental evidence suggests deeper models can perform better: - Graves et al.; Speech Recognition with Deep Recurrent Neural Networks; 2013. - Pascanu et al.; How to Construct Deep Recurrent Neural Networks; 2014. This slide is taken from Danna Gurari http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf RNN: Vanishing Gradient Problem Problem: training to learn long-term dependencies e.g., language: “In 2004, I started college” vs “I started college in 2004” e.g., Vanishing gradient: a product of numbers less than 1 shrinks to zero Exploding gradient: a product of numbers greater than 1 explodes to infinity This slide is taken from Danna Gurari https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/