Full Transcript

Neural Networks - Lecture 1 Introduction, Logistic Regression, Backpropagation, MLP Outline ● Housekeeping :-) ● Linear Regression, Objective Function, MSE Loss ● Logistic Regression, Activation Function, Cross-Entropy Loss ● Backpropagation ● Multi-Layered Perceptron 2/43 Housekeepin...

Neural Networks - Lecture 1 Introduction, Logistic Regression, Backpropagation, MLP Outline ● Housekeeping :-) ● Linear Regression, Objective Function, MSE Loss ● Logistic Regression, Activation Function, Cross-Entropy Loss ● Backpropagation ● Multi-Layered Perceptron 2/43 Housekeeping – Semester Organization ● 14 lectures ● 4 Programming and Analysis Homework Assignments (30p) ● 1 project in 3 steps (60p) – Step 1: Topic selection and SotA presentation ● Presentation + paper (Introduction and Related Work sections) – 3 Intermediary short presentations (as slides) set 2 weeks apart – 20 p – Step 2: Intermediary project evaluation (e.g. first results) – 20 p ● – Presentation + continued paper (Method description and first results) Step 3: Final Project Poster Presentation (counts as part of exam) – 20p ● Poster + final paper (Final experiments, Final results, Conclusion) ● Project Hours to be used as Office Hours + Short Tutorial Coverage ● 1 written final exam (10p) ● Passing conditions: – 50% of semester activity required for exam entry (grades for HW + Intermediary presentations + Step 2 of project) – 50% of final project, 30% of written exam 3/43 Housekeeping – Syllabus 4/43 Project Guidelines ● To be discussed after the lecture 5/43 Recap Linear Regression, Objective Functions and MSE Loss 6/43 The case of linear regression Regression = predict the value of a continuous variable Temperature No. chirps Cricket chirps as a function of temperature 20 88.6 19.8 93.3 18.4 84.3 80 17.1 80.6 70 15.5 75.2 60 17.1 82 15.4 69.4 16.2 83.3 30 15 79.6 20 17.2 82.6 10 16 80.6 0 17 83.5 14.4 76.3 100 90 no. chirps ● No. chirps Linear (No. chirps) 50 40 14 15 16 17 18 19 20 21 degrees Celsius 7/43 The case of linear regression General Case ● Training dataset – n instances of input-output pairs {x1:n, y1:n} R1xd, y Training ● xi ∈ ∈ R (for these examples) {x1:n, y1:n} → LEARNER → model params. Θ Testing xn+1, Θ → PREDICTOR → ŷn+1 8/43 The case of linear regression Linear predictor y^ i= ^y (x i )= θ1 + xi θ2 Loss function – Mean Squared Error (MSE) / Quadratic Loss 1 2 J ( θ )= ∑ ( y i − y^ i ) n i=1. . n 9/43 The case of linear regression In our example: 1 2 J ( θ )= ∑ ( y i −θ1 −xi θ2 ) n i=1. . n In the general case d y^ i=∑ xij θ j =xi 1 θ1 + xi 2 θ2 +...+ xid θd , where xi 1=1 j=1 ^y = X θ ,with ^y ∈ℝ n×1 , X ∈ℝ n×d and θ ∈ℝ d×1 10/43 The case of linear regression - 2 In the general case d y^ i=∑ xij θ j =xi 1 θ1 + xi 2 θ2 +...+ xid θd , where xi 1=1 j=1 ^y = X θ ,with ^y ∈ℝ n×1 , X ∈ℝ n×d and θ ∈ℝ d×1 x11 ⋯ x 1 d θ y^1 1 ⋮ = x 21 ⋯ x 2 d ⋮ y^n x n 1 ⋯ xnd θd [ ][ ][ ] 11/43 The case of linear regression - 2 In the general case d y^ i=∑ xij θ j =xi 1 θ1 + xi 2 θ2 +...+ xid θd , where xi 1=1 j=1 ^y = X θ ,with ^y ∈ℝ n×1 , X ∈ℝ n×d and θ ∈ℝ d×1 x11 ⋯ x 1 d θ y^1 1 ⋮ = x 21 ⋯ x 2 d ⋮ y^n x n 1 ⋯ xnd θd [ ][ ][ ] T n T 2 J ( θ )=( y− X θ ) ( y− X θ )=∑ ( yi −x i θ ) i=1 12/43 The case of linear regression - 2 Finding the solution → minimize the loss ∂ J ( θ) ∂ T T T T T T = ( y y−2 y X θ + θ X X θ )=0−2 X y +2 X X θ=0 ∂θ ∂θ ⇒ θ = ( X T X )− 1 X T y 13/43 The case of linear regression - 3 How about classification? 14/43 The case of linear regression - 3 How about classification? What if we try to use linear regression to classify the points? 15/43 The case of linear regression - 3 Convention: y = 1 for positive class, y = 0 for negative class θ0 T T ⇒ find θ = θ1 , such that θ x i =1, if x i is positive and θ x i =0 otherwise θ2 1 where x i = x i 1 xi 2 [] [] 16/43 The case of linear regression - 3 What if we introduce an “unproblematic” set of points for one of the classes? 17/43 The case of linear regression - 3 What if we introduce an “unproblematic” set of points for one of the classes 18/43 Recap Logistic Regression, Activation Functions, Cross Entropy Loss 19/43 The case of logistic regression Classification problem ill-posed for linear regression We need a function that “squeezes in” the weighted input into a probability space σ ( η)= 1 −η 1+e Logistic Sigmoid Function 20/43 The case of logistic regression 21/43 The case of logistic regression - 2 Formulate our classification problem in terms of a probabilitybased model Bernoulli random variable: X takes values in {0, 1} if x=1 p( x∣θ ) = θ , 1−θ , if x=0 x 1− x p(x∣θ )=Ber(x∣θ )=θ (1−θ ) 22/43 The case of logistic regression - 2 Entropy (H) – measure of the uncertainty associated with a random variable H ( X )=− ∑ p(x∣θ )log p( x∣θ) x∈X For a Bernoulli variable 1 x (1−x) H ( X )=−∑ θ (1−θ ) x (1− x ) log [ θ (1−θ ) ]=−[ θ log ( θ )+(1−θ )log (1−θ )] x=0 23/43 The case of logistic regression - 2 Logistic regression model specifies prob. of binary output yi in {0,1} given input xi as: n p( y∣X , θ ) = n p( y i∣x i , θ ) = ∏ Ber ( y i∣σ (x i θ )) ∏ i=1 i=1 n Where = ∏ i=1 [ 1 −x θ 1+e i yi ][ 1− 1 −x θ 1+e i 1− y i ] n xi θ = θ0 + ∑ θ j xij j=1 24/43 The case of logistic regression – Maximizing Likelihood Maximum Likelihood Estimation (MLE) - method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters MLE property: for i.i.d. data, MLE minimizes Cross-Entropy 25/43 The case of logistic regression – Maximizing Likelihood MLE property: for i.i.d. data, MLE minimizes Cross-Entropy n p( y∣X , θ ) = n p( y i∣x i , θ ) = ∏ Ber ( y i∣σ (x i θ )) ∏ i=1 i=1 n = yi 1 1 1− ∏ −x θ −x θ i=1 1+e 1+e ⏟ [ i ][ i 1− y i ] πi Maximize likelihood p( y∣X , θ )→minimize negative log likelihood n J ( θ )=−log P ( y∣X , θ )=−∑ y i log πi +(1− yi ) log(1− πi ) ⏟ i=1 cross-entropy 26/43 The case of logistic regression – Distribution Divergence Denote πi= p( y i=1∣xi , θ ) and 1− π i= p( yi =0∣xi , θ ) D = {(xi, yi), i = 1..n} - is the Empirical Data Distribution Dm = {(xi, πi), i = 1..n } - is the modeled data distribution We are trying to minimize the “discrepancy” between the modeled data distribution and the empirical one 27/43 The case of logistic regression - 2 Kullback-Leibler Divergence = measure of difference (not a proper distance) between two probability distributions P(x) KL( P∥Q)=∑ P (x)log = Q(x) x ( ) ∑ P(x)log ( P(x))−∑ P(x) log(Q( x)) x x ⏟ ⏟ −H ( P( x)) +H (P (x), Q (x)) where H ( P (x),Q( x)) is the Cross− Entropy measure of P ( x) and Q(x) => minimizing KL divergence = minimizing the crossentropy 28/43 The case of logistic regression – Solve with Gradient Descent n J ( θ )=−log P ( y∣X , θ )=−∑ y i log πi +(1− yi ) log(1− πi ) ⏟ i=1 cross-entropy The loss function no longer has closed-form solution => apply gradient-descent based method 29/43 The case of logistic regression – Solve with Gradient Descent n J ( θ )=−log P ( y∣X , θ )=−∑ y i log πi +(1− yi ) log(1− πi ) ⏟ i=1 cross-entropy The loss function no longer has closed-form solution => apply gradient-descent based method derivative of logit function : σ ' ( x)= σ ( x)(1− σ ( x)) n ∂ J (θ ) ∂ = − y i log σ ( x i θ )+(1− y i )log (1−σ ( x i θ )) ∂θ ∂θ ∑ i=1 ( ∂ J (θ ) = ∂θ n T ) T ∑ x i ( πi− yi ) = X ( π − y) i=1 30/43 The case of logistic regression – Solve with Gradient Descent The loss function no longer has closed-form solution => apply gradient-descent based method Gradient vector points in direction of steepest ascent => for minimization take opposite direction θ t +1 ∂ J (θ ) t =θ − λ (θ ) ∂θ t where λ → learning rate Source: Wikimedia Commons (https://bit.ly/2lOcCi9) 31/43 Logistic regression as a NN Softmax – generalization of neural network from binary classification to multiclass (forced) binary case: π i1 = p( y i∣x i , θ ) = e e x i θ1 x i θ1 x i θ2 , if yi =0 +e xθ e πi 2= x θ x θ , if y i=1 e +e i i 1 2 i 2 Source: Sebastian Rashka (https://bit.ly/2lONuI5) 32/43 Logistic regression as a NN π i1 = p( y i∣x i , θ ) = e e x i θ1 x i θ1 x i θ2 , if y i =0 +e xθ e πi 2= x θ x θ , if y i=1 e +e i i 1 2 i 2 33/43 Recap Backpropagation 34/43 Logistic regression and backprop δ 4=J ( θ ) δ3 = ∂ J ∂ J ∂ z4 = ∂ z3 ⏟ ∂ z4 ∂ z3 1 δ3 = ∂−( y log (z 3 )+(1− y )log (1−z 3 )) ∂ z3 δ3 = − y (1− y ) − z 3 (1−z 3 ) ∂ J ∂ J ∂ z3 δ2 = = ∂ z2 ∂ z 3 ∂ z2 δ2 = δ 3 ∂ σ ( z2) = δ3 σ (z 2 )(1− σ (z 2 )) ∂ z2 d δ21 = ∂J =δ2 2 ∂ z1 ∂ ∑ x i θi i=1 ∂ x2 = δ2 θ2 35/43 Backprop – Layer specification In general Forward pass: z k+ 1=f k ( z k ) Backward pass: j j ∂ zk +1 ∂J ∂ J ∂ z k +1 j δ = i =∑ j = δ ∑ k +1 ∂ zi j ∂ z k j ∂ z k +1 ∂ z k j k i k j j ∂ zk +1 ∂J ∂ J ∂ zk +1 j =∑ = δ k +1 ∂ θ ∂ θk j ∂ z j ∂ θk ∑ k j k +1 36/43 Recap Multi-Layered Perceptron 37/43 Multi-Layer Perceptron Consider the rings dataset Rings dataset Logistic regression result Logistic regression is still insufficient for classification, because the dataset is not linearly separable 38/43 Multi-Layer Perceptron 1 2 1 u k =∑ xij θ jk j=1 2 3 2 3 1 2 u k =∑ a j θ jk =∑ σ (u j ) θ jk j=1 j=1 MLP example: 2 layers with 3 and 2 neurons, respectively Objective: minimize cross-entropy error 39/43 Multi-Layer Perceptron Each neuron computes a separation plane on the space of its inputs 40/43 Multi-Layer Perceptron 3 input neurons create “cuts” that can isolate the inner ring 2 output neurons decide on which “side” of each cut are the positive versus the negative examples 41/43 Multi-Layer Perceptron ● ● Neural Net Playground (https://playground.tensorflow.org/): the simple MLP Observe influence of: – Activation Functions – Relation between absolute gradient value and learning rate – Non-normalized layer inputs – Batch-size 42/43 Summary ● Take aways: – Linear regression + Cost Function for regression → MSE Loss – Logistic regression + Cost Function for Classification (binary or multi-class) → Cross-Entropy Loss – Activation Functions – role of non-linearities – Backpropagation ● Chain rule of probability ● Gradient based update rule 43/43 Neural Networks - Lecture 2 Optimization Methods Alexandru Sorici UPB Table of contents 1. Introduction 2. Gradient Descent 3. Second Order Optimization 4. Stochastic Gradient Descent 5. Gradient Descent Optimization Algorithms 6. References 1 Intro Recap - partial derivatives and gradient Consider a two variable function f(θ) = f(θ1 , θ2 ) = θ12 + θ22 ∂f(θ1 , θ2 ) f(θ1 + ∆θ1 , θ2 ) − f(θ1 , θ2 ) = lim ∆θ1 →0 ∂θ1 ∆θ1 ∂f(θ) = 2θ1 ∂θ1 [ 2θ1 ∇J(θ) = 2θ2 ∂f(θ) = 2θ2 ∂θ2 ] 2 Recap - second order derivatives - Hessian Hessian = square matrix of second-order partial derivatives ( ) ∂ ∂f(θ) ∂ 2 f(θ) = =2 ∂θ1 ∂θ1 ∂θ12 ( ) ∂ ∂f(θ) ∂ 2 f(θ) = =2 ∂θ2 ∂θ2 ∂θ22 ∂ 2 f(θ) ∂ 2 f(θ) = =0 ∂θ1 θ2 ∂θ2 θ1 [ 2 0 H= 0 2 ] 3 Recap - General Case - gradient vector θ - a d-dimensional vector f(θ) - scalar-valued function Gradient vector of f(•) with respect to θ:  ∂f(θ)  ∂θ 1  ∂f(θ)   ∂θ2   ∇J(θ) =   ..   .  ∂f(θ) ∂θd Figure 1: Gradient field - Source: Wikimedia Commons (https://bit.ly/2lOcCi9) 4 Recap - General Case - Hessian matrix θ - a d-dimensional vector f(θ) - scalar-valued function The Hessian matrix of f(•) with respect to θ (written as ∇2θ f(θ) or H) is a d × d matrix of second-order partial derivatives:  ∂ 2 f(θ) 2 1  ∂θ  ∂ 2 f(θ)  ∂θ2 θ1 ∂ 2 f(θ) ∂θ1 θ2 ∂ 2 f(θ) ∂θ22 ∂ 2 f(θ) ∂θd θ1 ∂ 2 f(θ) ∂θd θ2 H = ∇2θ f(θ) =     .. . .. .  ... ... .. . ... ∂ 2 f(θ) ∂θ1 θd  ∂ 2 f(θ)  ∂θ2 θd  .. . ∂ 2 f(θ) ∂θd2     5 Recap - Offline learning When doing offline learning, we typically use a batch of data x1:n = x1 , x2 , . . . , xn and optimize functions of the form: n f(θ) = f(θ, x1:n ) = 1∑ f(θ, xi ) ≈ n ∫ f(θ, x)p(x)dx i=1 The gradient is then: g(θ) = ∇θ f(θ) = n ∑ ∇θ f(θ, xi ) i=1 6 Recap - Optimization for Linear regression When f is the loss function for linear regression (MSE loss): T f(θ) = f(θ, X, y) = (y − Xθ) ((y − Xθ)) = n ∑ (yi − xi θ)2 i=1 ∇θ f(θ) = ∂ T (y y − 2yT Xθ + θ T XT Xθ) = −2XT y + 2XT Xθ ∂θ ∇2θ f(θ) = 0 + 2XT X 7 Gradient Descent Gradient Descent Algorithm Gradient descent or steepest descent is an iterative algorithm for optimization which has the following update form: θ (k+1) = θ (k) − λ(k) g(k) = θ (k) − λ(k) ∇f(θ (k) ) where k is the step in the algorithm, g(k) = g(θ (k) ) - gradient at step k, λ(k) is the learning rate or step size 8 Gradient Descent Derivation Function f is our loss function J(θ) Objective: find a small change ∆θ in parameters θ that minimizes the value of J [ ] ∂J arg min J(θ + ∆θ) ≈ arg min J(θ) + ∆θ ∆θ ∆θ ∂θ s.t. ∥∆θ∥ ≤ ϵ Constrained optimization problem −→ Lagrange multipliers arg min J(θ) + ∆θ ∆θ ∂J + η∆θI∆θ T ∂θ Solving for ∆θ in the optimization gives us: ∆θ = θ (k+1) − θ (k) = − 1 ∂J ∂J =⇒ θ (k+1) = θ (k) − λ 2η ∂θ ∂θ 9 How to choose the learning rate? Source: https://www.jeremyjordan.me/nn-learning-rate/ 10 Second Order Optimization Newton’s algorithm The most basic second-order optimization algorithm. Has updates of the form: (k) θ (k+1) = θ (k) − H−1 K g Algorithm derived from second-order Taylor series approx. of J(θ) around θ (k) : T 1 Jquad (θ) = J(θ (k) ) + g(k) (θ − θ (k) ) + (θ − θ (k) )T HK (θ − θ (k) ) 2 Differentiate, equate to 0 and solve for θ: ∇J(θ) = 0+g(k) +HK (θ −θ (k) ) = 0 (k) =⇒ θ = θ (k) − H−1 K g 11 Second Order Optimization - issues (k) To compute the direction of descent d(k) = −H−1 we need to K g invert the Hessian matrix HK , at each iteration! An efficient and popular way to circumvent this issue is to use the Conjugate Gradient method to solve the linear system of equations HK d(k) = −g(k) for d(k) 12 SGD Stochastic Gradient Descent The loss function that helps us compute the gradient with respect to model parameters ∇θ J(θ) also depends on the observed data instances. ∫ want ∇θ J(θ) = ∇θ J(x, θ)p(x)dx = 0 | {z } E[∇θ J(x,θ)] θ (k+1) = θ (k) − λE [∇θ J(x, θ)] ≈ θ (k) − λ n 1 ∑ ∇θ J(x(i) , θ) n i=1 ≈θ (k) − λ∇θ J(x(k) , θ (k) ) The algorithm is stochastic because it only ”looks” at a subset (or even instance) of the data to estimate the true gradient. [ ] θ (k+1) = θ (k) − λE [∇J] + λ E [∇J] − ∇J(x(k) , θ (k) ) | {z } noise If λ is a series that converges under variance of the noise, then the noise is bounded and the algorithm converges 13 SGD and Batches ∑n Batch θ (k+1) = θ (k) + λ n1 Stochastic / Online θ (k+1) = θ (k) + λ∇θ J(x(k) , θ (k) ) Mini-batch θ (k+1) = θ (k) + λ n1 i=1 ∑l i=1 ∇θ J(x(i) , θ (k) ) ∇θ J(x(i) , θ (k) ) Convergence speed of GD variants (source: stack-exchange) 14 SGD Concerns/Challenges Good practices: • Normalize the input space: • Shift inputs to have a zero mean over the whole dataset • Scale the inputs to have unit variance • Decorrelate input components • For linear layers, we get a big win by decorrelating each component of the input from the other components (e.g. use PCA to remove components with small eigenvalues, divide remaining principal components by the sqrt of their eigenvalues) • Initialization of weights should break symmetry (e.g. Xavier initialization) • Aim for class balance in your mini-batches 15 SGD Concerns/Challenges • Choosing a proper learning rate • Define rate of annealing or scheduled reducing • Having the same learning rate for all parameters (e.g. for rarer or more important features we want a bigger update in that direction) • Plateau and saddle points can leave SGD stuck in suboptimal local minima 16 Gradient Descent Optimization Algorithms Momentum SGD without momentum (left) and with momentum (right). Source: DataScience Deep Dive Blog: Optimizations of Gradient Descent Intuition: Momentum accelerates SGD in relevant direction and dampens oscillations Let ∆θt be the update step of parameters θ at step t ∆θt = γ∆θt−1 + λ∇θ J(θ) θ (t+1) = θ (t) − ∆θt γ is usually set to 0.9. 17 Nesterov Accelerated Gradient Intuition: we want to give Momentum a sense of ”when to slow down” before the hill slopes up again. Compute gradient w.r.t future estimated parameters θ − γ∆θt−1 . ∆θt = γ∆θt−1 + λ∇θ J(θ − γ∆θt−1 ) θ (t+1) = θ (t) − ∆θt Momentum update (blue vectors) compared to NAG update (brown + red vectors). Source G. Hinton Lecture 6c, CSC321 18 Adagrad Intuition: adapt learning rate of individual parameters, depending on parameter importance. • smaller updates (lower learning rate) for frequently occurring features • larger updates (higher learning rate) for infrequent ones • suited for sparse data (e.g. object classification in video, word embeddings) (t) Notation : gt,i = ∇Jθ (θi ) (t+1) θi (t) = θi − √ λ Gt,ii + ϵ gt,i Element i, i on the diagonal = sum of squares of past gradients (up to step t) for parameter θi . Gt ∈ Rd×d - diagonal matrix. 19 Adadelta Intuition: Adagrad’s weakness - accumulation of squared gradients =⇒ learning rate shrinks =⇒ no new knowledge updates Adadelta solution: restrict past gradients to a window w + compute a decaying running average at each time step. E[g2 ]t = γE[g2 ]t−1 +(1−γ)g2t , where E[g2 ]t is the running average at step t. ∆θt = √ θ (t+1) =θ λ E[g2 ]t (t) +ϵ ⊙ gt = λ ⊙ gt RMS[g]t − ∆θt 20 Adadelta - contd Note: theoretically, the update should have the same hypothetical units as the parameter Adadelta solution: define an exponential decay of squared parameter updates E[∆θ 2 ]t = γE[∆θ 2 ]t−1 + (1 − γ)∆θt2 RMS[∆θ]t = ∆θt = √ E[∆θ 2 ]t + ϵ RMS[∆θ]t−1 gt , where RMS[∆θ]t−1 replaces λ RMS[g]t θ (t+1) = θ (t) − ∆θt 21 RMSProp Developed by Geoffrey Hinton in Lecture 6e of Coursera Class. Equivalent to Adadelta, without the exponential decay of squared parameter updates E[g2 ]t = 0.9E[g2 ]t−1 + 0.1g2t θ (t+1) = θ (t) − √ λ E[g2 ]t + ϵ gt whereλ = 0.001 is considered a good default value 22 Adam Adaptive Moment Estimation (Adam) ≈ Adadelta with momentum. • store exponentially decaying average of squared gradients vt (second moment variance) • store also exponentially decaying average of past gradients mt (first moment mean), similar to momentum mt = β1 mt−1 + (1 − β1 )gt vt = β2 vt−1 + (1 − β2 )g2t mt and vt are biased towards 0 at initial time steps (especially if β1 , β2 are close to 1) =⇒ bias-correction: mt m̂t = 1 − β1t vt v̂t = 1 − β2t Adam update rule: λ m̂t v̂t + ϵ where default indicated values are β1 = 0.9, β2 = 0.999 and ϵ = 10−8 θ (t+1) = θ (t) − √ 23 Visualization of GD optimization algorithms 24 So, how to go about optimizing? • For small datasets (e.g. 10.000 cases) or big datasets without much redundancy: use full-batch method • conjugate gradient, LBFGS • adaptive learning rates • For big, redundant datasets: use mini-batches • try RMSProp, Adam • try SGD with momentum (particularly when fine-tuning) • If input data is sparse, try adaptive learning-rate methods • If fast-convergence (for initial tests especially) is required, use adaptive learning rate methods 25 So, how to go about optimizing? (Tricks) • Shuffling and Curriculum Learning • Shuffle training data after each epoch, to avoid biasing the optimization algorithm • Curriculum Learning: for some problems (e.g. involving a temporal dimension), training using progressively harder examples helps • Batch Normalization • Common practice: normalize inputs to zero mean and unit variance • For deep networks, intermediary layer inputs may vary significantly during training (covariate shift) =⇒ normalize intermediary layer inputs 26 So, how to go about optimizing? (Tricks) • Early stopping • Monitor error on validation set and stop (with some patience, i.e. number of gradient update steps) if validation error does not improve enough • Gradient Noise • Used to make training more robust to poor initialization, or when having deep and complex networks (highly irregular error function - escape local minima) • Gradient modification: gt,i = gt,i + N(0, σt2 ) η σt2 = (1 + t)γ 27 So, how to go about optimizing? (Tricks) • Stochastic Weight Averaging (SWA) • Effective, low computational overhead method to exploit SGD behavior in ”flattened” loss function space (when approaching convergence) • https://pytorch.org/blog/pytorch-1.6-now-includes-stochasticweight-averaging/ 28 References References Credits for slides to lectures by Nando de Freitas, Geoffrey Hinton and article by Sebastian Ruder. • Nando de Freitas. Machine Learning. Lecture 5 - Optimization • Sebastian Ruder (2016). An overview of gradient descent optimisation algorithms. arXiv preprint arXiv:1609.04747. • Geoffrey Hinton. Neural Networks for Machine Learning. Lecture 6 29 Convolutional Neural Networks Alexandru Sorici October 18, 2023 Table of contents 1. Introduction 2. Building blocks of CNNs 3. Fully Convolutional Networks 4. Normalization 5. Classic Networks 1 Intro Conv Nets are everywhere (contd) Figure 1: SafeUAV: Learning to estimate depth and safe landing areas for UAVs from synthetic data. Alina Marcu, Dragos Costea, Vlad Licaret, Mihai Parvu, Marius Leordeanu 2 Conv Nets are everywhere (contd) Figure 2: Spatio-Temporal Features in Action Recognition Using 3D Skeletal Joints. M. Trascau, M. Nan, AM Florea 3 Conv Nets are everywhere(contd) Figure 3: End-to-end models for self-driving cars on UPB campus roads. A. Mihalea, R. Samoilescu, A. Nica, M. Trascau, A. Sorici, AM Florea 4 CNN Building Blocks Convolution Operation For the continuous case, a convolution product between two functions f and g is defined as: Z +∞ Z ∞ (f ∗ g )(t) = f (τ )g (t − τ )dτ = f (t − τ )g (τ )dτ −∞ −∞ For the discrete case, integral is replaced by a sum: (f ∗ g )(n) = ∞ X m=−∞ f (m)g (n − m) = ∞ X f (n − m)g (m) m=−∞ 5 Convolution Operation in spatial domain An RGB image can be considered as a function (f : R2 → R3 ), where f (x, y ) = (red value, green value, blue value)T Operators can be applied to the image, by treating is a function: 6 Convolution Operation in spatial domain Convolution Operation on images The convolution operation implies use of a convolution kernel (K - of dimension w × w , where w = 2k + 1), applied to source image (I ): O =I ∗K O(i, j) = w −1 w −1 X X I (i + u − k, j + v − k) · K (u, v ) u=0 v =0 7 Convolution Operation in spatial domain - contd Properties Conv. op. is associative and comutative. 8 CNN Layers We use 3 main types of layers when constructing CNN architectures: • Convolution Layer • Pooling Layer • Fully Connected Layer (like in MLP) Figure 4: Example of CNN Layer stacking. Source: Stanford CS231n Lecture Notes 9 Fully Connected Layer Consider an image of dim. 32 × 32 × 3. We wish to ascribe it to one of 10 classes. 1. Linearize image =⇒ dimension of 3072 × 1. 2. Pass resulting vector through a fully connected layer.   w1,1  ..  . w1,2 .. . ... .. . w10,1 w10,2 ...  x1 x2 .. .      w1,3072 b1 y1     .   .  ..     . ·  +  ..  =  ..    w10,3072 b10 y10 x3072 (1) H(X ) = (W .x) + b T Disadvantage: lose info about spatial arrangement of pixel (example of inductive bias) 10 Convolution Layer - I Alternative: connect each neuron to only a local region of the input volume Receptive Field = size of the filter - spatial extent of the local connectivity of each neuron 11 Convolution Layer - I 12 Convolution Layer - I 13 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: ? 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0×1 1×0 1×1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1  ×0 ×1 ×0     0 0 0 1 1 1 0   1 2 4 3 3   1 0 1   ×1 ×0 ×1     0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1×1 1×0 1×1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1      ×1 ×0 ×1  0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1×1 1×0 0×1 0 0        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1      ×1 ×0 ×1  0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1×1 0×0 0×1 0        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1      ×1 ×0 ×1  0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0×1 0×0 0×1        0 0 1 1 1 0 0  1 4 3 4 1 ×0 ×1 ×0      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1     ×1 ×0 ×1   0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1  ×1 ×0 ×1     0 0 0 1 1 1 0   1 2 4 3 3   1 0 1   ×0 ×1 ×0     0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1     ×1 ×0 ×1     0 0 1 1 0 0 0   1 3 3 1 1  1 0 1       3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1   ×1 ×0 ×1     0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1     ×0 ×1 ×0     0 0 1 1 0 0 0   1 3 3 1 1  1 0 1  ×1 ×0 ×1      3 3 1 1 0  0 1 1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1       0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1     ×1 ×0 ×1     0 0 1 1 0 0 0   1 3 3 1 1  1 0 1  ×0 ×1 ×0      3 3 1 1 0  0×1 1×0 1×1 0 0 0 0  1 1 0 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1   0 1 1 1 0 0 0        0 0 1 1 1 0 0  1 4 3 4 1      0 0 0 1 1 1 0   1 2 4 3 3   1 0 1       0 0 0 1 1 0 0  ∗  0 1 0  =  1 2 3 4 1         0 0 1 1 0 0 0   1 3 3 1 1  1 0 1  ×1 ×0 ×1      3 3 1 1 0  0×0 1×1 1×0 0 0 0 0  1×1 1×0 0×1 0 0 0 0 I K I ∗K 14 Convolution Layer - II Case study For an input volume (e.g. source image) of 32 × 32 × 3 and apply a filter of size 5 × 5 × 3 the dimension of the output map is: 28 × 28 × 1 General case Starting from an input volume of dimension: Height × Width × Depth and applying a convolution layer with kernel (filter) of size K × K × Depth we get an activation map of size [Height − (K − 1)] × [Width − (K − 1)] × 1. 15 Convolution Layer - III - HyperParameters Four hyperparameters control the size of the output volume: • Depth: number of filters (each looks at smth different in the input). • Stride: the step taken when sliding the filter. Usual practice - stride = 1 or 2 (3 or more very uncommon) • Zero-Padding: size of the number of 0s that surround the border of the input volume. Most common use: preserve spatial size of input volume, i.e. input and output have the same width and height. • Dilation: distance in between elements of the convolution kernel. 16 Convolution Layer - III - HyperParameters Convolution arithmetic animation :-) Source: https://github.com/vdumoulin/conv arithmetic 17 Convolution Layer - III - HyperParameter arithmetic Convolution output size computation How do kernel size, stride, padding, dilation influence the size of the output? 18 Convolution Layer - III - HyperParameter arithmetic Convolution output size computation How do kernel size, stride, padding, dilation influence the size of the output? Formulae  Hout Wout  Hin + 2 ∗ padding − dilation ∗ (kernel size − 1) − 1 = +1 stride   Win + 2 ∗ padding − dilation ∗ (kernel size − 1) − 1 = +1 stride Observation There is a constraint on strides: result of the division has to be an integer. E.g. For input of size Hin = 10, padding = 0, kernel size = 3, dilation = 1, use of stride h i = 2 is not possible because: Hin +2∗padding −dilation∗(kernel size−1)−1 +1 = stride (10 + 2 ∗ 0 − 1 ∗ (3 − 1) − 1)/2 + 1 = 7/2 + 1, which is not an integer 18 Convolution Layer - III - Parameter Sharing Parameter sharing is used in CNNs to control the number of learnable parameters. E.g. if output volume has 55 × 55 × 96 and filter size is 11 × 11 × 3, having a weight for each neuron would result in 290400 × 364 = 105705600 parameters just for one layer. =⇒ Make a simplifying assumption: if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). =⇒ Parameters within a depth slice are shared: 11 × 11 × 3 × 96 = 34944 (+ 96 biases). 19 Convolution Layer - III - Learned Features Figure 5: Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume. Source: Stanford CS321n notes. 20 Convolution Layer - IV - PyTorch classes • Class Header: class torch.nn.Conv1d(in channels, out channels, kernel size, stride=1, padding=0, dilation=1, groups=1, bias=True) • Input: (N, Cin , Lin ) • Output: h (N, Cout , Lout ) Lout = Lin +2∗padding −dilation∗(kernel stride size−1)−1 i +1 • Formula: out(Ni , Coutj ) = bias(Coutj ) + PCin −1 k=0 weight(Coutj , k) ? input(Ni , k) 21 Convolution Layer - IV - Pytorch Conv2D • Class Header: class torch.nn.Conv2d(in channels, out channels, kernel size, stride=1, padding=0, dilation=1, groups=1, bias=True) • Input: (N, Cin , Hin , Win ) • Output:h (N, Cout , Hout , Wout ) i size[0]−1)−1 Hout = Hin +2∗padding [0]−dilation[0]∗(kernel +1 stride[0] h i size[1]−1)−1 +1 Wout = Win +2∗padding [1]−dilation[1]∗(kernel stride[1] • Formula: out(Ni , Coutj ) = bias(Coutj ) + PCin −1 k=0 weight(Coutj , k) ? input(Ni , k) 22 Convolution Layer - IV - Pytorch Conv3D • Class Header: class torch.nn.Conv3d(in channels, out channels, kernel size, stride=1, padding=0, dilation=1, groups=1, bias=True) • Input: (N, Cin , Din , Hin , Win ) • Output:h (N, Cout , Dout , Hout , Wout ) i size[0]−1)−1 Dout = Din +2∗padding [0]−dilation[0]∗(kernel +1 stride[0] i h Hin +2∗padding [1]−dilation[1]∗(kernel size[1]−1)−1 Hout = +1 stride[1] i h Win +2∗padding [2]−dilation[2]∗(kernel size[2]−1)−1 Wout = +1 stride[2] • Formula: out(Ni , Coutj ) = bias(Coutj ) + PCin −1 k=0 weight(Coutj , k) ? input(Ni , k) 23 Convolution Layer - IV - Grouped Convolutions Figure 6: a) Normal convolution, b) Grouped Convolution with 2 groups. Source: Optimizing Grouped Convolutions on Edge Devices, Gibson et al. 2018 24 Convolution Layer - V - Spatially Separable Convolutions Figure 7: Spatially separable convolution example - img. source: towardsdatascience.com • Advantage: reduced computational load: fewer kernel parameters –¿ fewer multiplications • Disadvantage: loss of representational power (not all 2D kernels can be factorized into a sequence of 2 1-D kernel applications) • Some useful kernels can be factorized this way: e.g. Sobel kernel for edge detection 25 Convolution Layer - VI - Depthwise Separable Convolutions • Separate the convolution operation into two separate operations: (i) a depthwise convolution and (ii) a pointwise convolution Figure 8: Typical 2D convolution. img. src: opengenus.org • No. multiplications for typical 2D conv: 128 3x3x3 kernels moving 5x5 times =⇒ 128 x 3 x x 3 x 5 x 5 = 86400 26 Convolution Layer - VI - Depthwise Separable Convolutions (cont) Figure 9: Composition of depth- and point-wise convolutions. img. src: opengenus.org Computational advantage • depthwise: 3 3x3x1 kernels moving 5 x 5 times =⇒ 3 x 3 x 3 x 1 x 5 x 5 = 675 • pointwise: 128 1x1x3 filters moving 5 x 5 times =⇒ 128 x 1 x 1 x 3 x 5 x 5 = 9600 • In general, reduction is proportional to 1 , k2 where k is size of kernel • Depthwise separable convolution used in small, efficient conv nets (e.g. MobileNet suite) 27 Pooling Layer • Modifies the input volume into a smaller and more manageable representation • Operates independently over each activation map 28 Pooling Layer Common settings: • filter size = 2, stride = 2 • filter size = 3, stride = 2 29 Pooling Layer • Used usually at final stages of a conv net architecture to replace fully connected layers • Tries to avoid overfitting by forcing feature maps to hold ”global” information that is relevant for the subsequent task (e.g. classification, identification) 30 Fully Convolutional Networks Fully Convolutional Networks • Used most often in semantic segmentation or in generative networks • Relies on two important operations: unpooling and upsampling (through transpose convolution / deconvolution ) Figure 10: Semantic Segmentation • Intuitively: Conv network captures context information (what) about the input, DeConv network (re)captures the spatial information (where) 31 Fully Convolutional Networks - Unpooling indices 1.5 1.7 1.4 1.3 2.0 2.1 1.8 1.6 2.3 1.9 1.5 1.4 2.2 2.1 1.6 1.7 maxpooling (1,1) (2,1) (0,2) (3,3) 2.1 1.8 2.3 1.7 unpooling 0 0 0 0 0 2.1 1.8 0 2.3 0 0 0 0 0 0 1.7 activations Figure 11: Source: Towards Data Science - Review: DeconvNet — Unpooling Layer (Semantic Segmentation) 32 Fully Convolutional Networks - Transpose Convolution The convolution operation can be expressed using a matrix and a reshape of the input into a linearized vector. Figure 13: Original convolution kernel Figure 12: Input=4 × 4, stride=1, no In convolution we have a many-to-one relationship (in the example 9-to-1) between input and output padding Source: Medium - Up-sampling with Transposed Convolution 33 Fully Convolutional Networks - Transpose Convolution • The transpose convolution operation can be expressed using a transposed matrix. • We implement a one-to-many relationship (in our example 1-to-9) between input and output. • Parameters of upsampling kernel are learned. Figure 14: Input=2 × 2, stride=1, no padding 34 Normalization Normalization Figure 15: Visual comparison of different normalization methods. N - batch dimension, C - channel (feature map) dimension, H,W - input size dimension Uses of normalization • address covariate-shift problem - simplification of the learning dynamic =⇒ higher learning rates are possible =⇒ faster convergence time • makes the loss surface smoother (magnitude of gradients bounded better =⇒ ”better behaved” gradients) • some form of normalization becomes necessary in very deep networks 35 Batch Normalization γ and β are learnable parameters → mean and variance of layer activations are decided by two simple parameters (as opposed to complex interactions between layers during learning) 36 Batch Normalization - Discussion Batch Normalization caveats • Normalization statistics in BatchNorm correspond to mean and variance of the mini-batch (instead of the whole data set) • small mini-batches may produce noisy estimates ⇒ negatively affects training • not suitable for recurrent connections in an RNN (recurrent activations will (and also shoud) have different statistics ⇒ would have to fit a separate BatchNorm for each time step 37 Batch Normalization - Discussion Fine Tuning • Question: use the statistics of mean and variance from the original dataset or the new one? • Most frameworks do not freeze the BatchNorm params (γ, β) when freezing a layer ⇒ they use the mini-batch statistics from the new dataset • If new dataset is large and mini-batch size is comparable (equal) to original dataset → can retrain statistics • If new dataset is small or mini-batch size is small → may be better to use original dataset statistics Where to place BatchNorm layer? • Most often before non linearity • Worth-while sometimes to test after an activation function 38 Classic Networks Classic Networks - LeNet 5 Figure 16: LeNet-5 - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner (1998) • No. params: 60000 39 Classic Networks - AlexNet Figure 17: AlexNet - Alex Krizhevsky, Geoffrey Hinton, Ilya Sutskever (2012) • No. params: 60 million 40 Classic Networks - AlexNet Details / Retrospectives: • First CNN to win ImageNet challenge • First use of ReLU • heavy data augmentation (lots of params to train) • dropout of 0.5 • batch size of 128 • optimization using SGD + Momentum 0.9 • learning rate 1e-2, reduced by 10 manually when accuracy plateaus • L2 weight decay 5e-4 • in competition used 7 CNN ensemble: 18.2% → 15.2% 41 Classic Networks - VGG-16 Figure 18: VGG-16 - Karen Simonyan and Andrew Zisserman (2014) • No. params: 138 million 42 Classic Networks - VGG-16 • Uses only 3 × 3 CONV, stride 1, pad 1 and 2 × 2 MAX POOL stride 2 Figure 19: Block diagram of VGG-16 and VGG-19. Source: Stanford CS231n Lecture 9 - slide 42 43 Classic Networks - VGG-16 • Uses only 3 × 3 CONV, stride 1, pad 1 and 2 × 2 MAX POOL stride 2 • Stack of 3 × 3 conv (stride 1) layers has same effectve receptive field as one 7 × 7 conv layer ... but deeper, more non-linearities and fewer params: 3 × (32 C 2 ) vs 72 C 2 , where C is channels per layer Figure 19: Block diagram of VGG-16 and VGG-19. Source: Stanford CS231n Lecture 9 - slide 42 43 Classic Networks - VGG-16 • Uses only 3 × 3 CONV, stride 1, pad 1 and 2 × 2 MAX POOL stride 2 • Stack of 3 × 3 conv (stride 1) layers has same effectve receptive field as one 7 × 7 conv layer ... but deeper, more non-linearities and fewer params: 3 × (32 C 2 ) vs 72 C 2 , where C is channels per layer • Similar training procedure as AlexNet • FC7 features generalize well to other tasks Figure 19: Block diagram of VGG-16 and VGG-19. Source: Stanford CS231n Lecture 9 - slide 42 43 Classic Networks - GoogleNet - Inception Cell 44 Classic Networks - GoogleNet - Architecture Details: • Introduces bottleneck layers, using 1 × 1 convolutions to reduce computational complexity (reduce number of channels over which 3 × 3 and 5 × 5 conv. layers have to operate) • Philosophy of ”inception module”: design good local network topology and stack the modules on top of another • Apply parallel filter operations on the same input • Multiple receptive field sizes (1 × 1, 3 × 3, 5 × 5, • Pooling operation (3 × 3) • Concatenate filter outputs depth-wise 45 Classic Networks - GoogleNet - Architecture Figure 20: GoogleNet - Christian Szegedy et al (2014) • No. params: 5 million 46 Classic Networks - GoogleNet - Architecture Details • After the last convolutional layer, a global average pooling layer is used that spatially averages across each feature map, before final FC layer. • No longer multiple expensive FC layers • Uses auxilliary classification outputs to inject additional gradient at lower layers 47 Classic Networks - Residual Network (ResNet) - Skip connection Figure 21: Skip Connection y = x + F(x) ⇒ ∂E ∂y ∂E ∂E = · = · (1 + F 0 (x)) ∂x ∂y ∂x ∂y 48 Classic Networks Figure 22: ResNet - Kaiming He et al (2015) • No. params: 25 million (ResNet-50) • ImageNet winner 2015: 3.6% error rate (better than human performance - 5.1%) 49 Classic Networks - Residual Network (ResNet) - Architecture Details • Issue: deep networks using stacks of ”plain” conv layers suffer from optimization problem • Solution: Use layers to fit residual F (x) = H(x) − x, instead of H(x) directly (where H(x) would be the output of the ”plain” layers) • Stack residual blocks; periodically double no. filters and downsample spatially using stride 2 • No FC layers at the end • For deeper networks (ResNet-50+) use ”bottleneck layer” to improve computational efficiency • Training ResNet in practice: • BatchNorm after every CONV Layer, Xavier 2/ initialization • SGD + Momentum (0.9), Learning rate: 0.1 (divide by 10 when validation error plateaus) • Mini-batch 256, weight decay of 1e-5, no dropout 50 Classic Networks - Wide ResNet Highlights • Performs experiments to explore the problem of diminishing feature reuse (i.e. assumed weakness in original ResNet design, that can cause the network to forego using the residual blocks, because of skip connections) • Experiments with • • • • Adding Dropout regularization within the Residual Block Different number of conv layers within a residual block (l-factor) Multiplication of feature map size within residual block (k-factor) conv-bn-relu vs bn-relu-conv (post-activation vs pre-activation) 51 Classic Networks - Wide ResNet Figure 23: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 52 Classic Networks - Wide ResNet Figure 24: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 53 Classic Networks - Wide ResNet • B(3,3) - original ”basic” block • B(3,1,3) - with one extra 1×1 layer • B(1,3,1) - with the same dimensionality of all convolutions, straightened bottleneck • B(1,3) - the network has alternating 1×1 - 3×3 convolutions everywhere • B(3,1) - similar idea to the previous block • B(3,1,1) - Network-in-Network style block Figure 25: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 54 Classic Networks - Wide ResNet Figure 26: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 55 Classic Networks - Wide ResNet Figure 27: Source - Wide Residual Networks, Zagoruyko and Komodakis, 2017 56 Classic Networks - Wide ResNet Takeaway: • widening consistently improves performance across residual networks of different depth • increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed • main power of residual networks is in residual blocks, and not in extreme depth 57 Classic Networks - ResNeXt Highlights • Focuses on three central aspects inspired by previous conv nets (VGG, ResNet and Inception): uniform blocks, residual connections and multi-branch convolutional architectures • Uses grouped convolutions to implement multi-branch aggregated transformations • Considers cardinality (number of branches) as the important control dimension in network performance • Strives to develop an easy-to-use template for building both wide and deep conv nets 58 Classic Networks - ResNeXt Figure 28: Equivalent building blocks of ResNeXt. (a): Aggregated residual transformations, the same as Fig. 1 right. (b): A block equivalent to (a), implemented as early concatenation. (c): A block equivalent to (a,b), implemented as grouped convolutions [24]. Notations in bold text highlight the reformulation changes. A layer is denoted as (# input channels, filter size, # output channels). Source: Aggregated Residual Transformations for Deep Neural Networks. Xie et al., 2017 59 Classic Networks - ResNeXt Figure 30: Increased cardinality is Figure 29: Increased cardinality leads to better accuracy better than increased depth or increased width of the network. Source: Aggregated Residual Transformations for Deep Neural Networks. Xie et al., 2017 60 Classic Networks - DenseNet Highlights • Simplify the connectivity pattern between layers as introduced in other networks (e.g. ResNet) • Focus on obtaining representational power through feature reuse, instead of very deep or very wide network architectures • It introduces direct connections between any two layers with the same feature-map size • Information (feature maps) from previous volumes is concatenated to subsequent volumes. Concatenation amount controlled by a growth factor. Figure 31: Source: Densely Connected Convolutional Networks, Huang et al., 2018 61 Classic Networks - DenseNet Figure 32: Source: Densely Connected Convolutional Networks, Huang et al., 2018 62 Classic Networks - DenseNet Figure 33: Full schematic representation of DenseNet-121. Source: Pablo Ruiz Blog, https://towardsdatascience.com/understanding- and- visualizing- densenets- 7f688092391a 63 Classic Networks - DenseNet Figure 34: Source: Densely Connected Convolutional Networks, Huang et al., 2018 64 Classic Networks - Performance vs Num. Operations Figure 35: Source: Benchmark analysis of representative deep neural network architectures, 2018, S. Bianco et al. 65 Classic Networks - Performance vs Speed (fps) Figure 36: Source: Benchmark analysis of representative deep neural network architectures, 2018, S. Bianco et al. 66 Transfer Learning Figure 37: Source: Stanford CS231n Lecture 9, 2019 67 Transfer Learning Figure 38: Source: Stanford CS231n Lecture 9, 2019 68 Transfer Learning • Transfer Learning with CNNs is the norm, not the exception • CNNs trained on ImageNet used in object detection (e.g. Fast R-CNN), image captioning (CNN + RNN) • Takeaway: if you have some dataset of interest, but it has < 1 million images • Find a large dataset with similar data, train big ConvNet there • Transfer learn to your dataset 69 70 Neural Networks - Lecture 4 Applications of Convolutional Neural Networks Outline ● Semantic and Instance Segmentation ● Object Detection ● CNN for Timeseries Analysis ● Fine tuning techniques 2/53 Task definitions Source: AI Pool blog: How does instance segmentation work 3/53 Semantic Segmentation 4/53 Semantic Segmentation – DeepLab-v3 ● Highlights: – Use of dilated convolutions (atrous convolutions) to enable larger effective receptive field sizes. Helps in capturing spatial context information – Use of Atrous Spatial Pyramid Pooling to extract content information from several scale levels at the same time 5/53 Semantic Segmentation – DeepLab-v3 ● Conv Layer vs Dilated Conv Layer: – Use of dilated convolutions to extract larger information context – Output stride (i.e. reduction factor for image resolution) limited to 16, as larger values are harmful for semantic segmentation 6/53 Semantic Segmentation – DeepLab-v3 ● Atrous Spatial Pyramid Pooling: – Uses idea that it is effective to resample features at different scales for more accurate region classification – Implements Spatial Pyramid Pooling using a combination of atrous convolutions and global average pooling – BatchNorm applied in between all diluted-conv filters 7/53 Instance Segmentation 8/53 Instance Segmentation Approaches ● Two main streams of methods in Instance Segmentation – Proposal based – propose possible regions of interest where an object might be found → extract features from that region → use features to classify which pixel is part of the object and what class it actually is – Segmentation based – start from segmentation as first objective (previously explained models) and learn specially designed transformations or instance boundaries 9/53 Instance Segmentation – the case of MaskRCNN ● ● Region of Interest (RoI) proposal based architecture Builds upon Faster-RCNN [Ren et al., 2015] but introduces two important aspects – RoI Align Layer – binary mask prediction (for individual instances) The MaskR-CNN framework for instance segmentation. Source: Mask-RCNN, He et al., 2018 10/53 Instance Segmentation – A prior on FasterRCNN ● ● Faster R-CNN used for object detection Fundamental concepts: Region Proposal Network, Anchor Boxes, RoI Pooling The different layers of Faster R-CNN. Source: alegion.com/faster-r-cnn 11/53 Instance Segmentation – A prior on FasterRCNN ● ● Faster RCNN has a backbone network (VGG-16) which feeds the Region Proposal Network and the Class and BBox regressor network RPN proposes regions where objects might be found – Uses 9 anchor types (3 different scales, 3 different aspect ratios) – Predicts 2k object classification scores (whether foreground or background object) and 4k region regression coords; k – number of anchors Faster-RCNN Architecture. Source: https://towardsdatascience.com/faster-rcnn-object-detection-f865e5ed7fc4 12/53 Instance Segmentation – A prior on FasterRCNN ● Anchor Boxes – Take the final feature map from backbone network as input – Their sizes are relative to input image Downsampling ratios of CNN feature maps. Source: Mathworks, anchor boxes for object detection Anchor boxes: 3 scales, 3 aspect ratios. Source: Medium @smallfishbigsea 13/53 Instance Segmentation – A prior on FasterRCNN ● RoI Pooling – For collected Region Proposals, reuse existing convolutional feature map to extract features – Crop the convolutional feature map using each proposal and then resize each crop to a fixed sized (e.g. 14 x 14 x convdepth) using interpolation (usually bilinear) Region of Interest Pooling. Source: TryoLabs 14/53 Instance Segmentation – A prior on FasterRCNN ● Final Object Class Assignment and BBox fine-tuning – Use RoI Pooling features – Classify proposals into one of the classes, plus a background class – Better adjust BBox for proposal given the predicted class R-CNN architecture – final object classifications and BBox regression. Source: TryoLabs 15/53 Instance Segmentation – Back to Mask-RCNN ● Mask-RCNN changes – Uses different backbones: ResNext-101, FPN (Feature Pyramid Network) – Adds a fully convolutional Mask Prediction branch Left/Right panels show the heads for the ResNet C4 and FPN backbones, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers. All convs are 3×3, except the output conv which is 1×1, deconvs are 2×2 with stride 2, and we use ReLU in hidden layers. Left: ‘res5’ denotes ResNet’s fifth stage, which for simplicity we altered so that the first conv oper- ates on a 7×7 RoI with stride 1 (instead of 14×14 / stride 2 as in [19]). Right: ‘×4’ denotes a stack of four consecutive convs. Source: Mask R-CNN, He et al., 2018 16/53 Instance Segmentation – Back to Mask-RCNN ● Mask-RCNN changes – Replaces RoI Pooling with a RoI Align Layer ● Avoids quantization of RoI coordinates or spatial bins to feature map grid RoI Align Layer in action. Source: Mask R-CNN, He et al., 2018 17/53 Instance Segmentation – Back to Mask-RCNN ● Mask-RCNN changes – Replaces RoI Pooling with a RoI Align Layer ● Avoids quantization of RoI coordinates or spatial bins to feature map grid RoI Align Layer in action. Source: Mask R-CNN, He et al., 2018 18/53 Instance Segmentation – Mask-RCNN Results 19/53 A note on Deformable Convolution Although convolutions help with learning spatially-local biases, they suffer greatly from signal-subsampling issues: ● ● We are limited in processing patterns using rectangular patterns, whereas most natural patterns are not best suited to this template. We are limited in sampling points only from the discrete grid. Think of an 8x8 grid in which the object of interest is 2x2. We would not have a direct correspondent in a latent 2x2 grid which for said object due to downsampling. – In this case, ideally we should be able to query fractional positions in the grid for a 0.5x0.5 pixel 20/53 A note on Deformable Convolution Deformable convolutions work similarly to RoI Pool / Alling operations: ● They can learn non-rectangular patterns by computing pixel offsets. ○ For a given filter, instead of applying it to its discrete neighbours, we compute a set of (k, dx, dy) offsets based on which new pixels will be selected. ○ Since the (dx, dy) values can be fractional, we can now query subpixels features: i.e. query at a pixel at a position such as (1, 0.5) ○ Ideally we would like to have such offsets for each spatial position in each filter - but this will blow-up memory, we’re required to compute (CHWk2) additional values at each stage. 21/53 A note on Deformable Convolution The deformation mechanism can be used to build more general mechanisms: ● We make a trade-off between modelling capacity and compute, by setting a constant spatial-offset (k, x, y) for each channel C ● We can retrieve an adaptive-pooling mechanism. ○ Average pooling is just a convolution with a uniform kernel with a value of 1/K, where K is the total filter size ○ Learning a scalar for each k pixel in the filter allows us to learn a sampling operation that can be a mixture between (Average,Max,Min) operations or something more general. ● This scaling mechanism can be extended to the normal deformable convolution as well ○ This can be a double-edged sword, since we might not use a kernel at it’s full capacity. Scaling a pixel k in the kernel with a value of 0, renders the pixel useless. 22/53 Object Detection 23/53 Object Detection Approaches ● Two main approaches to Object Detection – Multi stage predictors - region proposals and region classification / bounding box regression happen in separate steps): e.g. Fast/Faster R-CNN, R-FCN – Single stage predictors - use a single forward pass through the architecture to perform both object classification and object bounding box regression at the same time: e.g. YOLO, SSD, RetinaNet 24/53 Object Detection Approaches The “anatomy” of an object detector. Source YOLOv4, Bochkovskiy et al., 2020 25/53 A History of YOLOs - YOLO v1 ● ● Processes frames in real time (45 fps) Trains on full image and has a single pipeline for detection and localization (as opposed to competitors such as R-CNN) – Has high detection accuracy – Has lower (than competitors at the time) localization accuracy 26/53 A History of YOLOs - YOLOv1 Algorithm ● ● YOLO algorithm works using the following three techniques: – Residual blocks – Bounding box regression – Intersection Over Union (IOU) Conceptual design diagram 27/53 A History of YOLOs - YOLO v1 Algorithm ● Bounding box regression – 5 parameters per bounding box: ● (x, y) - relative to the bounds of the grid cell) ● width, height – relative to whole image ● c – confidence = how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts – ● Each grid cell predicts B bounding boxes + class conditional probability – ● Confidence = Pr(Object) ∗ IOUtruthpred class conditional probability p(class i | Object) – conditioning is on grid cell containing an object => YOLO makes S x S x (B x 5 + C) predictions 28/53 A History of YOLOs - YOLO v1 Algorithm ● Intersection over Union – Metric used to force predicted output box to coincide with ground truth 29/53 A History of YOLOs - YOLO v1 Algorithm ● Loss Function: – Regression objective: o

Use Quizgecko on...
Browser
Browser