Lecture 2 - Deep Feedforward Networks PDF
Document Details
Uploaded by PopularForethought7405
UVA
Yuki M. Asano
Tags
Summary
This document is lecture notes on deep feedforward networks focusing on modularity, nonlinearities, gradient-based learning, and backpropagation. The lecture is part of a deep learning course at the University of Amsterdam (UvA).
Full Transcript
Lecture 2: Deep Feedforward Networks Deep Learning 1 @ UvA Yuki M. Asano DEEP LEARNING ONE - 1 Lecture Overview o Modularity in deep learning o Deep learning nonlinearities o Gradient-based learning o Chain rule o Backpropagation UVA DEEP LEARNING COURSE – EFS...
Lecture 2: Deep Feedforward Networks Deep Learning 1 @ UvA Yuki M. Asano DEEP LEARNING ONE - 1 Lecture Overview o Modularity in deep learning o Deep learning nonlinearities o Gradient-based learning o Chain rule o Backpropagation UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 2 COURSE LEARNING COURSE – ‹#›– 2 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 2 Last time o We went from here: o To here: o So how do deep neural networks do it? UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 3 COURSE LEARNING COURSE – ‹#›– 3 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 3 From linear functions to nonlinear = from shallow to deep o Consider linear function f = ReLU(Ax), A in ℝ!×# , x in ℝ#×$, ReLU(x) = max(0, x) new ReLU ReLU ReLU −1 0 𝑅𝑒𝐿𝑈 = 3 3 new ReLU ReLU o 𝑅𝑒𝐿𝑈 3 = 3 but we want something non-linear! −1 0 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 4 COURSE LEARNING COURSE – ‹#›– 4 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 4 From linear functions to nonlinear = from shallow to deep o Consider linear function f = ReLU(Ax), A in ℝ!×# , x in ℝ#×$ , ReLU(x) = max(0, x) o What about y = ReLU(Bf) = ReLU(B ReLU(Ax)) ? Inputs that end up non-zero A B ReLU ReLU Original x space Original x space f = ReLU(Ax) y = ReLU(B ReLU(Ax)) UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 5 COURSE LEARNING COURSE – ‹#›– 5 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 5 We’ve learned XOR. Original x space In practice: (5 layer MLP) https://arxiv.org/abs/1906.00904 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 6 COURSE LEARNING COURSE – ‹#›– 6 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 6 Deep Feedforward Networks Deep feedforward networks o Feedforward neural networks Deep feedforward networks, also often called feedforward neur ◦ Also called multi-layer perceptrons (MLPs) or multilayer perceptrons (MLPs), are the quintessential deep lea ◦ The goal is to approximate some function f The goal of a feedforward network is to approximate some function f ∗ ◦ A feedforward network defines∗a mapping for a classifier, y = f (x) maps an input x to a category y. A feedfor defines a mapping y = f (x; θ) and learns the value of the parameters in the best function approximation. ◦ Learns the value of the parameters θ that result in the best function approximation.These models are called feedforward because information flow function being evaluated from x, through the intermediate computa o No feedback defineconnections f , and finally to the output y. There are no feedback connec ◦ When including outputs feedback connections, of the model are fedwe obtain back intorecurrent neuralfeedforward itself. When networks. ne ◦ Nb: brainsare have many feedback extended connections to include feedback connections, they are called recu networks, presented in chapter 10. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEPER INTO DEEP VISLab Feedforward networks are of extreme importance EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE to LEARNING machine - 7 COURSE LEARNING COURSE – ‹#›– 7 VISLab LEARNING AND OPTIMIZATIONS - 7 le Deep feedforward networks o In a formula of a composite of functions: y = 𝑓 𝑥; 𝜃 = 𝑎. 𝑥; 𝜃$,…,1 = ℎ. (ℎ.2$ … (ℎ$ 𝑥, θ$ , … ), θ.2$ , θ. ) where 𝜃3 is the parameter in the l-th layer o We can simplify the notation by 𝑎. = 𝑓 𝑥; 𝜃 = ℎ. ∘ ℎ.2$ ∘ ⋯ ∘ ℎ$ ∘ 𝑥 where each functions ℎ3 is parameterized by the parameter 𝜃3 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 8 COURSE LEARNING COURSE – ‹#›– 8 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 8 Neural networks in blocks o We can visualize 𝑎. = ℎ. ∘ ℎ.2$ ∘ ⋯ ∘ ℎ$ ∘ 𝑥 as a cascade of blocks Forward connections (Feedforward architecture) Input ℎ! ℎ" ℎ# ℎ$ ℎ% Output hidden layers UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES EFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 9 COURSE LEARNING COURSE – ‹#›– 9 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 9 What is a module? o Module ⇔ Building block ⇔ Transformation ⇔ Function o A module receives as input either data 𝑥 or another module’s output o A module returns an output 𝑎 based on its activation function ℎ … CHAPTER 6. DEEP FEEDFORWARD NETWORKS CHAPTER 6. DEEP FEEDFORWARD NETWORKS o A module may or may not have trainable parameters 𝑤 o Examples: f = Ax, f= exp(x) y y y y w w h1 h2 h1 h h2 h W W x1 x2 x1 x x2 x Figure 6.2: An example of 6.2: Figure a feedforward An example network, drawn in twonetwork, of a feedforward different styles. drawn Specifically, in two different styles. Specifically, this is the feedforward network we use to solve the XOR example. It has this is the feedforward network we use to solve the XOR example. a single hidden It has a single hidden layer containing two units. (Left)In this style, we draw every unit as a node in the graph. UVA DEEP LEARNING COURSE – EFSTRATIOS layer GAVVES containing EFSTRATIOS two –LEARNING GAVVES DEEP units. UVA (Left)In DEEP ONE this COURSE LEARNING style, we - 10 COURSE draw every unit as a node in theVISLab DEEPER graph. INTO DEEP This style is very explicit EFSTRATIOSandGAVVES unambiguous – UVA but DEEP for networks LEARNING – ‹#›–than larger 10 this example VISLab This style is very explicit and unambiguous but for networksLEARNING larger than ANDthis example - 10 OPTIMIZATIONS Requirements (1) The activation functions must be 1st-order differentiable (almost) everywhere (2) Take special care when there are cycles in the architecture of blocks o No other requirements o We can build as complex hierarchies as we want UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 11 COURSE LEARNING COURSE – ‹#›– 11 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 11 Feedforward model o The vast majority of models o Almost all CNNs/Transformers o As simple as it gets Feedforward architecture Input ℎ! ℎ" ℎ# ℎ$ ℎ% Output UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 12 COURSE LEARNING COURSE – ‹#›– 12 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 12 Non-linear feature learning perspective o Linear models ◦ logistic regression, linear regression ◦ convex, with closed-form solution ◦ can be fit efficiently and reliably ◦ limited capacity o Extend to nonlinear models ◦ apply the linear model not to x itself but to a transformed input φ(x), where φ is a nonlinear transformation ◦ kernel trick, e.g., RBF kernel ◦ nonlinear dimension reduction UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 13 COURSE LEARNING COURSE – ‹#›– 13 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 13 Non-linear feature learning perspective o The strategy of deep learning is to learn φ: y = f(x; θ, w) = φ(x; θ)Tw ◦ φ defines a hidden layer ◦ find the θ that corresponds to a good* representation. ◦ no longer a convex training problem o We design families φ(x; θ) rather than the right function ◦ Encode human knowledge to help generalization * good = linearly separable (in the case of classification) UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 14 COURSE LEARNING COURSE – ‹#›– 14 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 14 Non-linear feature learning perspective o Learning XOR Original x space Learned h space 1 1 h2 x2 0 0 Numbers in nodes = thresholds 0 1 0 1 2 x1 h1 o In the transformed space represented by the features extracted by a neural igure 6.1: Solving the XOR problem by learning a representation. The bold numbers rinted on the network, a linear plot indicate the model value that canfunction the learned now solve the problem. must output at each point. Left)A linear model applied directly to the original input cannot implement the XOR unction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1, he model’s output UVA must decrease DEEP LEARNING as x2 increases. COURSE – EFSTRATIOS GAVVES A linear EFSTRATIOS EFSTRATIOS model GAVVES DEEP GAVVES must –LEARNING – UVAUVA DEEP DEEP apply ONE LEARNING aCOURSE fixed - 15 COURSE LEARNING – ‹#›– 15 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 15 Directed acyclic graph models o We can mix up our hierarchies o Makes sense when we have good knowledge of problem domain o Makes sense when combining multiple inputs & modalities ◦ E.g., RGB & LIDAR Interweaved & skip connections Eg combining images + text Loss Input ℎ! ℎ" ℎ# ℎ$ ℎ% ℎ& ℎ' UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 16 COURSE LEARNING COURSE – ‹#›– 16 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 16 Hierarchies of modules o Data efficient modules and hierarchies ◦ Trade-off between model complexity and efficiency ◦ Often, more training iterations with a “weaker” model better than fewer with a “stronger” one* ◦ ReLUs are basically half-linear functions, but give SoTA (also) because they train faster o Not too complex modules, better complex hierarchies ◦ Again, ReLUs are basically half-linear functions, but give SoTA o Use parameters smartly ◦ Often, the real constraint is GPU memory. o Compute modules in the right order to feed next modules * Not for extremely large models UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 17 COURSE LEARNING COURSE – ‹#›– 17 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 17 Loopy connections o Module’s past output is module’s future input o We must take care of cycles, i.e., unfold the graph (“Recurrent Neural Networks”) o Mostly not used (anymore) Loopy connections (must be unfolded) Input ℎ! ℎ" ℎ# ℎ$ ℎ% Output UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 18 COURSE LEARNING COURSE – ‹#›– 18 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 18 How to get w? gradient-based learning o The nonlinearity causes the loss function to be nonconvex ◦ no linear equation solution o We need to train the network with iterative, gradient based optimizers ◦ Stochastic gradient descent o No convergence guarantee and sensitive to the initialization of the parameters UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 19 COURSE LEARNING COURSE – ‹#›– 19 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 19 How to get w? gradient-based learning To use the gradient to adjust weights, we need some measuring stick (a “loss” or “cost” function) UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 20 COURSE LEARNING COURSE – ‹#›– 20 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 20 Cost function o Usually, maximum likelihood on the training set w ∗ = arg max 8 𝑝#>?@3 (𝑦|𝑥; 𝑤) ; ?@3 (𝑦|𝑥) is the output from the last layer ◦ The idea of maximum likelihood Estimation is to find the parameters of the model that can best explain the data. o Taking the logarithm, the maximum likelihood is to minimizing the negative log-likelihood: ℒ 𝑤 = −𝔼?@3 (𝑦|𝑥; 𝑤) ◦ which is equivalently described as the cross-entropy between training data and model distribution; 𝑝A?DED is the empirical data distribution UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 21 COURSE LEARNING COURSE – ‹#›– 21 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 21 as the cross-entropy between the training data and the model distribution. This −Efunction J(θ ) =cost x,y∼p̂dataislog by (y | x). p model (6.12) Cost functions given m of the cost function changes J( −Ex,yto θ ) =model from model, ∼p̂data model (y | x). log pdepending (6.12) m of logopIf we model specify. The the expansion model of the as above equation typically The specific that do not depend formmodel on the of the parameters cost functionand changes mayfrom model to model, depending be dis- on the le, as we saw specific5.5.1 in section if plog form, of pmodel model (y.x | The ) = expansion N (y ; ofθ), f(x; theI above ), equation typically yields some terms that do not depend on the model parameters and may be dis- e mean squared error cost, carded. For example, as we saw in section 5.5.1, if pmodel (y | x ) = N (y ; f(x; θ), I ), o 1We thencan recover we recover the the mean mean squared squared error error cost cost, 2 J (θ ) = E x,y∼p̂data ||y − f (x; θ )|| + const, (6.13) 2 1 J (θ ) = E x,y∼p̂data ||y − f (x; θ )||2 + const, (6.13) tor of 12 2 and a term that does not depend on θ. The discarded The ◦up n the varianceto constant of is based the Gaussian a scaling factor 1on the variance ofdistribution, of in which thethis Gaussian case distribution, which is 2 and a term that does not depend on θ. The discarded arametrize. not parameterized, Previously, constant is basedweonsaw and the thattherefore can be discarded. theofequivalence variance the Gaussianbetween distribution, which in this case o The od estimation equivalence with we chose an output not holds regardless distribution to parametrize. of we the saw function and minimization Previously, that of used to predict the equivalence between the r holds for a mean maximum of the Gaussian. linearlikelihood model, estimation but in fact,with the an output distribution equivalence holds and minimization of (x; θ ) usedmean squaredthe to predict error holds mean of for theaGaussian. linear model, but in fact, the equivalence holds regardless of the f (x; θ ) used to predict the mean of the Gaussian. of this UVA approach of deriving the EFSTRATIOScost DEEP LEARNING COURSE – EFSTRATIOS GAVVES function EFSTRATIOS GAVVES DEEP GAVVES from –LEARNING – UVAUVA DEEP DEEP maximum ONE LEARNING - 22 COURSE LEARNING COURSE VISLab – ‹#›– 22 VISLab DEEPER INTO DEEP An advantage of this approach of deriving the cost function from maximum LEARNING AND OPTIMIZATIONS - 22 Cost functions o Euclidean loss ℎ 𝑥, 𝑦 = 0.5 𝑦 − 𝑥 F o Suitable for regression problems o Sensitive to outliers ◦ Magnifies errors quadratically o Other cost functions: cross-entropy, KL-divergence (see also ML 1) UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 23 COURSE LEARNING COURSE – ‹#›– 23 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 23 Cost functions o Main point: cost functions describe what the model should do o The gradient of the cost function must be large and predictable enough to serve as a good guide for learning algorithms o Functions that saturate (become very flat) undermine this objective. o In many cases, this is due to the activation functions saturation. o The negative log-likelihood help to avoid this problem for many models because it can undo the exponentiation of the output (eg see softmax definition later) UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 24 COURSE LEARNING COURSE – ‹#›– 24 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 24 Deep learning modules DEEP LEARNING ONE – 25 Activation functions o Defined how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. o If output range limited, then called a “squashing function.” o The choice of activation function has a large impact on the capability and performance of the neural network. o Different activation functions may be combined, but rare o All hidden layers typically use the same activation function o Need to be differentiable at most points UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 26 COURSE LEARNING COURSE – ‹#›– 26 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 26 Linear Units 𝒙 ∈ ℝ!×+ , 𝒘 ∈ ℝ,×+ ℎ 𝑥; 𝑤 = 𝒙 ⋅ 𝒘- + 𝑏 𝑑ℎ =𝒘 𝑑𝒙 o Identity activation function o No activation saturation o Hence, strong & stable gradients ◦ Reliable learning with linear modules UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 27 COURSE LEARNING COURSE – ‹#›– 27 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 27 Rectified Linear Unit (ReLU) ReLU ℎ 𝑥 = max 0, 𝑥 𝜕ℎ 1 when 𝑥 > 0 =F 𝜕𝑤 0, when 𝑥 ≤ 0 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 28 COURSE LEARNING COURSE – ‹#›– 28 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 28 Rectified Linear Unit (ReLU) o Advantages ◦ Sparse activation: In randomly initialized network, ~50% active ◦ Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions. ◦ Eg for sin(x), x 0 ℎ 𝑥 =F 𝑎𝑥, when x ≤ 0 𝜕ℎ 1, when 𝑥 > 0 =F 𝜕𝑥 𝑎, when x ≤ 0 o Leaky ReLUs allow a small, positive gradient when the unit is not active. o Parametric ReLUs, or PReLU, treat 𝑎 as learnable parameter UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 31 COURSE LEARNING COURSE – ‹#›– 31 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 31 Exponential Linear Unit (ELU) ELU 𝑥, when 𝑥 > 0 ℎ 𝑥 =F exp 𝑥 − 1 , x ≤ 0 𝜕ℎ 1, when 𝑥 > 0 =F 𝜕𝑥 exp(𝑥) , x ≤ 0 o ELU is a smooth approximation to the rectifier. o It has a non-monotonic “bump” when x < 0. o It serves as the default activation for models such as BERT. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 32 COURSE LEARNING COURSE – ‹#›– 32 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 32 Gaussian Error Linear Unit GELU o Similar to ELU, but non-monotonic (change in gradient sign) o Default for Vision Transformers & state of the art (see Lect 4) https://arxiv.org/pdf/1710.05941 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 33 COURSE LEARNING COURSE – ‹#›– 33 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 33 Sigmoid and Tanh o 𝑡𝑎𝑛ℎ 𝑥 has better output range [−1, +1] ◦ Data centered around 0 (not 0.5) → stronger gradients ◦ Less “positive” bias for next layers (mean 0, not 0.5) o Both saturate at the extreme → 0 gradients ◦ Easily become “overconfident” (0 or 1 decisions) ◦ Undesirable for middle layers ◦ Gradients ≪ 1 with chain multiplication o 𝑡𝑎𝑛ℎ 𝑥 better for middle layers o Sigmoids for outputs to emulate probabilities ◦ Still tend to be overcofident UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 34 COURSE LEARNING COURSE – ‹#›– 34 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 34 Sigmoid and Tanh Sigmoid Tanh 1 𝑒< − 𝑒 2< ℎ 𝑥 = ℎ 𝑥 = < 1 + 𝑒 2< 𝑒 + 𝑒 2< 𝜕ℎ 𝜕ℎ = G(…->Conclusion, but skip details/don’t try to understand maths 3rd pass: Try to recap what you didn’t understand, reread those parts, be critical. …. Dive into the code After every pass you can drop out. Which is good. No need to detail-read every paper. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 63 COURSE LEARNING COURSE – ‹#›– 63 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 63 Backpropagation DEEP LEARNING ONE – 64 Backprop: even former head of Tesla AI thinks it’s important UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 65 COURSE LEARNING COURSE – ‹#›– 65 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 65 Backpropagation ⟺ Chain rule o The neural network loss is a composite function of modules o We want the gradient w.r.t. to the parameters of the 𝑙 layer 𝑑ℒ 𝑑ℒ 𝑑ℎ. 𝑑ℎ3 𝑑ℒ 𝑑ℒ 𝑑ℎ3 = n n …n ⇒ = ⋅ 𝑑𝑤3 𝑑ℎ. 𝑑ℎ.2$ 𝑑𝑤3 𝑑𝑤3 𝑑ℎ3 𝑑𝑤3 Gradient of loss w.r.t. the module output Gradient of a module w.r.t. its parameters o Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 66 COURSE LEARNING COURSE – ‹#›– 66 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 66 Backpropagation ⟺ Chain rule!!! o Backpropagating gradients means repeating computation of 2 quantities 𝑑ℒ 𝑑ℒ 𝑑ℎ3 = ⋅ 𝑑𝑤3 𝑑ℎ3 𝑑𝑤3 ?J% o For just compute the Jacobian of the 𝑙-th module w.r.t. to its parameters 𝑤3 ?;% o Very local rule → “every module looks for its own” o Since computations can be very local, this means that ◦ graphs can be complex ◦ modules can be complex if differentiable UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 67 COURSE LEARNING COURSE – ‹#›– 67 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 67 Backpropagation ⟺ Chain rule but as an algorithm o Backpropagating gradients means repeating computation of 2 quantities 𝑑ℒ 𝑑ℒ 𝑑ℎ3 = ⋅ 𝑑𝑤3 𝑑ℎ3 𝑑𝑤3 ?ℒ o For we apply chain rule again to recursively reuse computations ?J% 𝑑ℒ 𝑑ℒ 𝑑ℎ3f$ 3 = ⋅ 𝑑ℎ 𝑑ℎ3f$ 𝑑ℎ3 Recursive rule → computation-friendly Gradient of module w.r.t. its module input o Remember, the output of a module is the input for the next one: 𝑎3 =𝑥3f$ UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 68 COURSE LEARNING COURSE – ‹#›– 68 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 68 But you know this already from ML 1 … right? UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 69 COURSE LEARNING COURSE – ‹#›– 69 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 69 But why do we actually use Backprop? o Quiz: what are the advantages of backprop? o 1) it’s the most accurate way of training neural networks o 2) it’s how the brain also learns o 3) it implicitly models recurrent structures in neural networks o 4) otherwise you cannot even train a 3x3x3 neuron MLP UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 70 COURSE LEARNING COURSE – ‹#›– 70 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 70 Regarding point 4: o Remember we were able to find the gradients for x1 without any backprop magic o This works easily for a 3x3x3 MLP. 𝐿 (+) 𝑦! 𝑦" (*) 𝑥! 𝑥" UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 71 COURSE LEARNING COURSE – ‹#›– 71 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 71 Re: point 2: The Backprop in us: different! https://www.nature.com/articles/ncomms13276 UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 72 COURSE LEARNING COURSE – ‹#›– 72 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 72 Computational feasibility o y = f(x) o Each x’s contribution to y is given by the Jacobian, x f y df/dx o Suppose x and y are some intermediate outputs y of size 32x32x512 o Then storing the Jacobian would take 1TB of memory. x df/dx UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 73 COURSE LEARNING COURSE – ‹#›– 73 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 73 Chain rule visualized 𝑥& 𝑥! 𝑥$%! 𝑥$ 𝑓! 𝑓" …. 𝑓$%! 𝑓$ 𝑑𝑓! /𝑑𝑥& 𝑑𝑓" /𝑑𝑥! 𝑑𝑓$%! /𝑑𝑥$%" 𝑑𝑓$ /𝑑𝑥$%! How to adjust 𝑥! to minimize 𝑥$ ? ”just multiply Jacobians” … But this is not possible. 𝑑𝑥$ /𝑑𝑥& UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 74 COURSE LEARNING COURSE – ‹#›– 74 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 74 What if the output is a scalar? o With x of size 32x32x512 and y=1, o df/dx is only 32*32*512=524K elements ~ 2MB x f y o Of size 𝐷< ×1 x UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 75 COURSE LEARNING COURSE – ‹#›– 75 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 75 Chain rule visualized 𝑥& 𝑥! 𝑥$%! 𝑥$ 𝑓! 𝑓" …. 𝑓$%! 𝑓$ 𝑑𝑓! /𝑑𝑥& 𝑑𝑓" /𝑑𝑥! 𝑑𝑓$%! /𝑑𝑥$%" 𝑑𝑓$ /𝑑𝑥$%! A simple matrix-vector product: 𝐷$%! ×𝐷$ 𝐷$ ×1 Result again low size: 𝐷$%! ×1 Compute this first! Too large UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 76 COURSE LEARNING COURSE – ‹#›– 76 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 76 Chain rule visualized 𝑥& 𝑥! 𝑥$%! 𝑥$ 𝑓! 𝑓" …. 𝑓$%! 𝑓$ 𝑑𝑓! /𝑑𝑥& 𝑑𝑓" /𝑑𝑥! 𝑑𝑓$%! /𝑑𝑥$%" 𝑑𝑓$ /𝑑𝑥$%! p In other words, the vector-matrix product in the left hand side can Compute this first! be computed as the derivative of the scalar-valued projected Too large function p · f to the right. AutoDiff toolboxes allow you to write efficient derivatives of , and take care of the rest. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 77 COURSE LEARNING COURSE – ‹#›– 77 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 77 Chain rule visualized 𝑥& 𝑥! 𝑥$%! 𝑥$ 𝑓! 𝑓" …. 𝑓$%! 𝑓$ 𝑑𝑓! /𝑑𝑥& 𝑑𝑓" /𝑑𝑥! 𝑑𝑓$%! /𝑑𝑥$%" 𝑑𝑓$ /𝑑𝑥$%! Keep going 𝑑𝑥$ /𝑑𝑥& UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 78 COURSE LEARNING COURSE – ‹#›– 78 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 78 Chain rule visualized 𝑥& 𝑥! 𝑥$%! 𝑥$ 𝑓! 𝑓" …. 𝑓$%! 𝑓$ 𝑑𝑓! /𝑑𝑥& 𝑑𝑓" /𝑑𝑥! 𝑑𝑓$%! /𝑑𝑥$%" 𝑑𝑓$ /𝑑𝑥$%! Keep going 𝑑𝑥$ /𝑑𝑥& UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 79 COURSE LEARNING COURSE – ‹#›– 79 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 79 But we still need the Jacobian? o Yes, but: the operations we use generally have a very sparse Jacobian o Sometimes projected Jacobian is more efficient to compute o ReLU / Sigmoid etc.. E.g. softmax: UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 80 COURSE LEARNING COURSE – ‹#›– 80 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 80 Computational graphs: Forward graph o Compute the activation of each module in the network 𝒉3 = ℎ3 𝒘; 𝒙3 o Then, set 𝑥3f$: = ℎ3 o Store intermediate variables ℎ3 ◦ will be needed for the backpropagation and saves time at the cost of memory o Then, repeat recursively and in the right order ℎ" 𝒉" ℎ' 𝒉' 𝒙 ℎ! 𝒉! ℎ# 𝒉# ℎ) 𝒉) ℎ( 𝒉( UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 81 COURSE LEARNING COURSE – ‹#›– 81 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 81 Computational graphs: Forward graph o Compute the activation of each module in the network 𝒉3 = ℎ3 𝒘; 𝒙3 o Then, set 𝑥3f$: = ℎ3 o Store intermediate variables ℎ3 ◦ will be needed for the backpropagation and saves time at the cost of memory o Then, repeat recursively and in the right order ℎ" 𝒉" ℎ' 𝒉' 𝒙 ℎ! 𝒉! ℎ# 𝒉# ℎ) 𝒉) ℎ( 𝒉( UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 82 COURSE LEARNING COURSE – ‹#›– 82 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 82 Computational graphs: Forward graph o Compute the activation of each module in the network 𝒉3 = ℎ3 𝒘; 𝒙3 o Then, set 𝑥3f$: = ℎ3 o Store intermediate variables ℎ3 ◦ will be needed for the backpropagation and saves time at the cost of memory o Then, repeat recursively and in the right order ℎ" 𝒉" ℎ' 𝒉' 𝒙 ℎ! 𝒉! ℎ# 𝒉# ℎ) 𝒉) ℎ( 𝒉( UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 83 COURSE LEARNING COURSE – ‹#›– 83 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 83 Computational graphs: Forward graph o Compute the activation of each module in the network 𝒉3 = ℎ3 𝒘; 𝒙3 o Then, set 𝑥3f$: = ℎ3 o Store intermediate variables ℎ3 ◦ will be needed for the backpropagation and saves time at the cost of memory o Then, repeat recursively and in the right order ℎ" 𝒉" ℎ' 𝒉' 𝒙 ℎ! 𝒉! ℎ# 𝒉# ℎ) 𝒉) ℎ( 𝒉( UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVESEFSTRATIOS EFSTRATIOS GAVVES DEEP GAVVES –LEARNING – UVAUVA DEEP DEEP ONE LEARNING - 84 COURSE LEARNING COURSE – ‹#›– 84 DEEPER INTO DEEP VISLab VISLab LEARNING AND OPTIMIZATIONS - 84 Computational graphs: Forward graph o Compute the activation of each module in the network 𝒉3 = ℎ3 𝒘; 𝒙3