Deep Learning: Introduction, Neural Networks, and Backpropagation - PDF

1. Introduction and History of Deep Learning The Perceptron Single layer perceptron for binary classification One weight 𝑤! per input 𝑥! Multiply each input with its weight, sum and add bias If result larger than threshold, return 1, otherwise return 0 Weights are updated one sample at a time Limitations: § It can only solve linearly separable problems § Fails for the XOR problem, because it can’t model non- linear boundaries Moravec’s Paradox Tasks humans find easy, like sensory perception and motor skills (e.g., recognizing objects or walking), are incredibly hard for machines, while tasks humans find hard, like logical reasoning or solving math problems, are relatively easy for machines. Reasoning requires little computation, perception from sensors a lot: reasoning tasks are structured and rule-based, so computers can solve them efficiently, while perception requires interpreting large, unstructured and noisy sensory data, which is computationally intensive. Two Paths of Machine Learning Path 1: Better Inputs Fix perceptrons by making better features. Philosophy: encode domain knowledge to help machine learning algorithms. Classical image recognition pipeline: 1. Extract local features. 2. Aggregate local features over image. 3. Train classical models on aggregations. Path 2: Neural Networks beyond a single layer Fix perceptrons by making them more complex. Deep Learning Arrives Easier to train one layer at a time → Layer-by-layer training Training multi-layered neural networks became easier Benefits of multi-layer networks, but single-layer easy of training Challenges of NNs then Lack of processing power Lack of data Overfitting Vanishing gradients Experimentally, training multi-layer perceptrons was not that useful 2. Forward and Backward Propagation Deep Learning A family of parametric, non-linear and hierarchical representational learning functions, which are massively optimized with stochastic gradient descent to encode domain knowledge, i.e. domain invariances, stationarity. 𝑎" $𝑥; 𝜃#,…," ' = ℎ" (ℎ"&# (… ℎ# (𝑥, 𝜃# ), 𝜃"&# ), 𝜃" ) 𝑥: input, 𝜃" : parameters for layer 𝑙, 𝑎' = ℎ' (𝑥, 𝜃' ): (non-)lineair fucntion Given training corpus {𝑋, 𝑌} find optimal parameters: 𝜃 ∗ ← ∑(*,+)⊆(.,/) 𝑙(𝑦, 𝑎" $𝑥; 𝜃#,…," ') Deep Feedforward Networks Feedforward Neural Networks Also called multi-layer perceptrons (MLPs) The goal is to approximate some function 𝑓 A feedforward network defines a mapping: 𝑦 = 𝑓(𝑥; 𝜃) Learns the value of the parameters 𝜃 with the best function approximation No feedback connections When including feedback connections, we obtain recurrent neural networks Note: brains have many feedback connections A composite of functions: 𝑦 = 𝑎" $𝑥; 𝜃#,…," ' = ℎ" (ℎ"&# (… ℎ# (𝑥, 𝜃# ), 𝜃"&# ), 𝜃" ) where 𝜃" denotes the parameters in the 𝑙-th layer We can simplify the notation to 𝑎" = 𝑓(𝑥; 𝜃) = ℎ" ∘ ℎ"&# ∘ … ∘ ℎ# ∘ 𝑥 where each function ℎ' is parameterized by parameters 𝜃" Neural Networks With the last notation, we can visualize networks as blocks: Module Û Building Block Û Transformation Û Function Module receives as input either data 𝑥 or another module’s output A module returns an output 𝑎 based on its activation function ℎ(… ) A module may or may not have trainable parameters 𝑤 Examples: 𝑓 = 𝐴𝑥, 𝑓 = exp (𝑥) Requirements 1) Activations must be 1st-order differentiable (almost) everywhere 2) Take special care when there are cycles in the architecture of blocks Most models are feedforward networks (e.g., CNNs, Transformers) Recurrency Module’s past output is module’s future input We must take care of cycles, i.e., unfold the graph (‘Recurrent Networks’) MLPs: Training Goal and Overview We have a dataset of inputs and outputs Initialize all weights and biases with random values Learn weights and biases through ‘forward-backward’ propagation: Forward step: Map input to predicted output Loss step: Compare predicted output to ground truth output Backward step: Correct predictions by propagating gradients Linear / Fully-connected layer Identity activation function No activation saturation Hence, strong & stable gradients Reliable learning with linear modules Forward propagation When using linear layers, essentially repeated application of perceptrons: 1. Start from the input, multiply with weights, sum, add bias 2. Repeat for all following layers until you reach the end There is one main new element (next to the multiple layers): Activation functions after each layer Why activation functions? Each hidden/output neuron is a linear sum A combination of linear functions is a linear function! 𝑣(𝑥) = 𝑎𝑥 + 𝑏 𝑤(𝑧) = 𝑐𝑧 + 𝑑 𝑤$𝑣(𝑥)' = 𝑐(𝑎𝑥 + 𝑏) + 𝑑 = (𝑎𝑐)𝑥 + (𝑐𝑏 + 𝑑) Activation functions transform outputs of each neuron à results in non-linear functions They define how the weighted sum of the input is transformed into an output in a layer of the network. If output range limited, then called a “squashing function.” The choice of activation function has a large impact on the capability and performance of the neural network. Different activation functions may be combined, but rare. All hidden layers typically use the same activation function. Need to be differentiable at most points. Sigmoid Function Range: (0,1) 0 Differentiable: 01 𝜎(𝑧) = 𝜎(𝑧)(1 − 𝜎(𝑧)) Tanh Function tanh(𝑥) has better output range [−1, +1] Data centered around 0 (not 0.5) à stronger gradients Less ‘positive’ bias for next layers (mean 0, not 0.5) Both saturate at the extreme à 0 gradients Easily become ‘overconfident’ (0 or 1 decisions) Undesirable for middle layers Gradients ≪ 1 with chain multiplication à tanh(𝑥) better for middle layers à Sigmoids for outputs to emulate probabilities Rectified Linear Unit (ReLU) Advantages: - Sparse activation: In randomly initialized networks, ~50% active - Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions o E.g. for sin(x), 𝑥 ≪ 1: (small number) * (small number) * … à 0 - Efficient computation: Only comparison, addition and multiplication Limitations: - Non-differentiable at zero; however, it is differentiable anywhere else and the value of the derivative at zero can be arbitrarily chosen to be 0 or 1. - Not zero-centered. - Unbounded. - Dead neurons problem: neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. Higher learning rates might help. Leaky ReLU Leaky ReLUs allow a small, positive gradient when the unit is not active. Parametric ReLUs, or PReLU, treat 𝑎 as learnable parameter. Exponential Linear Unit (ELU) ELU is a smooth approximation to the rectifier It has a non-monotonic ‘bump’ when 𝑥 < 0 It serves as the default activation for models such as BERT How to choose an activation function Hidden layers In modern NNs, the default recommendation is to use the ReLU or GELU (Recurrent Neural Networks: Tanh and/or Sigmoid activation function.) Output layer Regression: One node à Linear activation Binary Classification: One node à Sigmoid activation Multiclass Classification: One node per class à Softmax activation Multilabel Classification: One nod per class à Sigmoid activation Cost Functions Cost function: binary classification Let 𝑦! ∈ {0,1} denote the binary label of example 𝑖 and 𝑝! ∈ [0,1] denote the output of 𝑖 Our goal: minimize 𝑝! if 𝑦! = 0, maximize if 𝑦! = 1 + Maximize: 𝑝! ! ∙ (1 − 𝑝! )(#&+! ) Minimize: −(𝑦! ∙ log(𝑝! ) + (1 − 𝑦! ) ∙ log(1 − 𝑝! )) Multi-class Classification: SoftMax Outputs probability distribution ∑2!3# ℎ(𝑥) = 1 for 𝐾 classes or simply normalizes in non-linear manner Avoid exponentiating too large/small numbers for better stability 4 "! 4 "! $% ℎ(𝑥! ) = ∑ "# =∑ "# $% , 𝜇 = max 𝑥! #4 #4 ! Loss becomes: − ∑2!3# 𝑦6 log$𝑝(𝑥)6 ' Architecture Design The overall structure of the network: How many units it should have How those units should be connected to each other Neural networks are organized into groups of units, called layers in a chain structure The first layer is given by: ℎ(#) = 𝑔(#) (𝑊 (#)7 𝑥 + 𝑏 (#) ) The second layer is given by: ℎ(8) = 𝑔(8) (𝑊 (8)7 𝑥 + 𝑏 (8) ) Universal Approximation Theorem Universal approximation theorem: Feedforward networks with hidden layers provide a universal approximation framework. A large MLP with even a single hidden layer can represent any function provided that the network is given enough hidden units. However, no guarantee that the training algorithm will be able to learn that function May not be able to find the parameter values that corresponds to the desired function. Might choose the wrong function due to overfitting. Width and depth In the worse case, an exponential number of hidden units A deep rectifier net can require an exponential number of hidden units with a shallow (one hidden layer) network. We like deep models in deep learning: 1. Can reduce the number of units required to represent the desired function 2. Can reduce the amount of generalization error 3. Deeper networks often generalize better Training deep networks: summary 1. Move input through network to yield prediction 2. Compare prediction to ground truth label 3. Backpropagate errors to all weights Chain Rule The Jacobian: generalization of the gradient for vector-valued functions ℎ(𝑥) All input dimensions contribute to all output dimensions Geometry of Jacobian The Jacobian represents a local linearization of a function give a coordinate Not unlike derivative being the best linear approximation of a curve (tangent) The Jacobian determinant (for square matrices) measures the ratio of areas Similar to what the ‘absolute slope’ measures in the 1d case (derivative) Another gradient to remember 9 9 9 Product rule: $𝑓(𝑥) ∙ 𝑔(𝑥)' = 𝑓(𝑥) ∙ 𝑔(𝑥) + 𝑔(𝑥) ∙ 𝑓(𝑥) 9* 9* 9* 9 9 9 Sum rule: 9* $𝑓(𝑥) + 𝑔(𝑥)' = 9* 𝑓(𝑥) + 9* 𝑔(𝑥) Backprop: chain rule as an algorithm The neural network loss is a composite function of modules We want the gradient w.r.t. to the parameters of the 𝑙 layers Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient. Autodiff: backprop with computation graphs Forward: Compute the activation of each module in the network 𝒉' = ℎ' (𝒘; 𝒙' ) Then, set 𝑥':# ≔ ℎ' Store intermediate variables ℎ' Will be needed for the backpropagation and saves time at the cost of memory Then, repeat recursively and in the right order Autodiff: reverse graph Go backward and use gradient functions instead of activations 9; 9; & Must have the gradient functions 0

Deep Learning: Introduction, Neural Networks, and Backpropagation - PDF

Document Details

Tags

Related

Summary

Full Transcript