Podcast
Questions and Answers
Which activation function is generally recommended as the default for hidden layers in modern neural networks?
Which activation function is generally recommended as the default for hidden layers in modern neural networks?
- Softmax
- ReLU or GELU (correct)
- Sigmoid
- Tanh
For regression problems in the output layer of a neural network, the recommended activation function is Sigmoid.
For regression problems in the output layer of a neural network, the recommended activation function is Sigmoid.
False (B)
In multi-class classification problems, what type of activation function is typically applied to the output layer?
In multi-class classification problems, what type of activation function is typically applied to the output layer?
Softmax
In binary classification, the cost function aims to minimize 𝑝! if 𝑦! = 0 and to ________ if 𝑦! = 1.
In binary classification, the cost function aims to minimize 𝑝! if 𝑦! = 0 and to ________ if 𝑦! = 1.
According to the Universal Approximation Theorem, what is a primary characteristic of feedforward networks with hidden layers?
According to the Universal Approximation Theorem, what is a primary characteristic of feedforward networks with hidden layers?
Match the use case with the appropriate activation function:
Match the use case with the appropriate activation function:
What is the primary limitation of a single-layer perceptron?
What is the primary limitation of a single-layer perceptron?
Moravec's Paradox highlights that machines excel at tasks requiring sensory perception compared to logical reasoning.
Moravec's Paradox highlights that machines excel at tasks requiring sensory perception compared to logical reasoning.
Briefly describe the core idea behind 'Path 1: Better Inputs' in machine learning.
Briefly describe the core idea behind 'Path 1: Better Inputs' in machine learning.
Deep learning models are massively optimized with ______ to encode domain knowledge.
Deep learning models are massively optimized with ______ to encode domain knowledge.
Match the following historical challenges faced by early neural networks with their corresponding description:
Match the following historical challenges faced by early neural networks with their corresponding description:
Which of the following is a characteristic of deep learning, as described in the content?
Which of the following is a characteristic of deep learning, as described in the content?
Why are activation functions necessary in neural networks?
Why are activation functions necessary in neural networks?
Using different activation functions in each hidden layer of a neural network is a common practice to optimize performance.
Using different activation functions in each hidden layer of a neural network is a common practice to optimize performance.
What is a key characteristic that activation functions must possess for use in neural networks?
What is a key characteristic that activation functions must possess for use in neural networks?
Activation functions with a limited output range are often called ' ______ functions'.
Activation functions with a limited output range are often called ' ______ functions'.
In deep feedforward networks, what is the primary goal?
In deep feedforward networks, what is the primary goal?
Which of the following is an advantage of using the Tanh activation function over the Sigmoid function in hidden layers?
Which of the following is an advantage of using the Tanh activation function over the Sigmoid function in hidden layers?
Recurrent neural networks (RNNs) are characterized by the absence of feedback connections, distinguishing them from feedforward networks.
Recurrent neural networks (RNNs) are characterized by the absence of feedback connections, distinguishing them from feedforward networks.
ReLU activation functions are zero-centered.
ReLU activation functions are zero-centered.
In the context of neural networks, what is a 'module'?
In the context of neural networks, what is a 'module'?
What is a potential problem associated with ReLU neurons, where they become inactive for all inputs?
What is a potential problem associated with ReLU neurons, where they become inactive for all inputs?
During the training of Multilayer Perceptrons (MLPs), weights and biases are learned through 'forward-backward' propagation, which involves mapping input to predicted output, comparing this output to the ground truth, and then propagating __________ to correct predictions.
During the training of Multilayer Perceptrons (MLPs), weights and biases are learned through 'forward-backward' propagation, which involves mapping input to predicted output, comparing this output to the ground truth, and then propagating __________ to correct predictions.
Which activation function is designed to address the 'dead neurons' problem by allowing a small, positive gradient when the unit is not active?
Which activation function is designed to address the 'dead neurons' problem by allowing a small, positive gradient when the unit is not active?
Match each component with its corresponding description in the context of neural networks:
Match each component with its corresponding description in the context of neural networks:
Which of the following is NOT a requirement for activation functions in neural networks?
Which of the following is NOT a requirement for activation functions in neural networks?
In Parametric ReLU (PReLU), the slope of the inactive part of the function is treated as a ______ parameter.
In Parametric ReLU (PReLU), the slope of the inactive part of the function is treated as a ______ parameter.
Which of the following activation functions is most suitable for the output layer when the task is to emulate probabilities?
Which of the following activation functions is most suitable for the output layer when the task is to emulate probabilities?
What must be done when there are cycles in the architecture of blocks in a recurrent network?
What must be done when there are cycles in the architecture of blocks in a recurrent network?
In feedforward networks, layers apply a series of functions. The notation $a^L = f(x; \theta) = h^L \circ h^{L-1} \circ … \circ h^1 \circ x$ shows that each function $h^l$ is parameterized by parameters ________.
In feedforward networks, layers apply a series of functions. The notation $a^L = f(x; \theta) = h^L \circ h^{L-1} \circ … \circ h^1 \circ x$ shows that each function $h^l$ is parameterized by parameters ________.
Flashcards
Perceptron
Perceptron
A single-layer neural network for binary classification. It multiplies inputs by weights, sums them, adds a bias, and outputs 1 if above a threshold, 0 otherwise.
Moravec's Paradox
Moravec's Paradox
The observation that tasks easy for humans (perception, motor skills) are hard for machines, and vice versa (logical reasoning).
Path 1: Better Inputs
Path 1: Better Inputs
One approach to improve machine learning by creating better input features, often encoding domain knowledge.
Path 2: Neural Networks
Path 2: Neural Networks
Signup and view all the flashcards
Layer-by-layer Training
Layer-by-layer Training
Signup and view all the flashcards
Deep Learning
Deep Learning
Signup and view all the flashcards
Deep Network Equation
Deep Network Equation
Signup and view all the flashcards
Feedforward Neural Network
Feedforward Neural Network
Signup and view all the flashcards
Multi-Layer Perceptron (MLP)
Multi-Layer Perceptron (MLP)
Signup and view all the flashcards
Optimal Parameters
Optimal Parameters
Signup and view all the flashcards
Module in Neural Networks
Module in Neural Networks
Signup and view all the flashcards
Activation Function
Activation Function
Signup and view all the flashcards
Recurrent Neural Networks
Recurrent Neural Networks
Signup and view all the flashcards
Forward Propagation
Forward Propagation
Signup and view all the flashcards
ELU (Exponential Linear Unit)
ELU (Exponential Linear Unit)
Signup and view all the flashcards
Activation Function Choice (Hidden Layers)
Activation Function Choice (Hidden Layers)
Signup and view all the flashcards
Regression Output Layer
Regression Output Layer
Signup and view all the flashcards
Binary Classification Output Layer
Binary Classification Output Layer
Signup and view all the flashcards
Multiclass Classification Output Layer
Multiclass Classification Output Layer
Signup and view all the flashcards
Universal Approximation Theorem
Universal Approximation Theorem
Signup and view all the flashcards
Sigmoid Function
Sigmoid Function
Signup and view all the flashcards
Tanh Function
Tanh Function
Signup and view all the flashcards
ReLU (Rectified Linear Unit)
ReLU (Rectified Linear Unit)
Signup and view all the flashcards
Dead Neurons (ReLU)
Dead Neurons (ReLU)
Signup and view all the flashcards
Advantage of ReLU
Advantage of ReLU
Signup and view all the flashcards
Leaky ReLU
Leaky ReLU
Signup and view all the flashcards
Parametric ReLU (PReLU)
Parametric ReLU (PReLU)
Signup and view all the flashcards
Vanishing Gradients
Vanishing Gradients
Signup and view all the flashcards
Squashing Function
Squashing Function
Signup and view all the flashcards
Study Notes
Introduction and History of Deep Learning
The Perceptron
- Single-layer perceptrons are used for binary classification.
- During processing, each input is multiplied by its corresponding weight.
- The products are summed, and a bias is added.
- If the result is above a threshold, the perceptron outputs 1; otherwise, it outputs 0.
- Weights are adjusted one sample at a time.
- Perceptron Limitations:
- Can only solve linearly separable problems.
- Fails to model non-linear boundaries, like the XOR problem.
Moravec’s Paradox
- Tasks that humans find easy, such as sensory perception and motor skills, are difficult for machines.
- Tasks humans find hard, such as logical reasoning, are relatively easy for machines.
- Reasoning requires little computation, whereas perception from sensors requires a lot.
- Computers solve structured and rule-based reasoning tasks efficiently.
- Perception needs interpreting large, unstructured and noisy sensory data, which is computationally intensive.
Two Paths of Machine Learning
- Approach 1: Improve inputs by creating better features.
- Encode domain knowledge to aid machine learning algorithms.
- Classical image recognition pipeline:
- Extract local features.
- Aggregate local features over the image.
- Train classical models on the aggregations.
- Approach 2: Create Neural Networks with multiple layers to fix perceptrons by making them more complex.
Deep Learning Arrives
- Layer-by-layer training makes it easier to train one layer at a time.
- Training multi-layered neural networks becomes easier.
- Multi-layer networks have benefits, but single-layer networks are easy to train.
Challenges of NNs Then
- Processing power was limited.
- Data was scarce.
- Overfitting was a problem.
- Vanishing gradients hindered training.
- Multi-layer perceptrons were not useful experimentally.
Forward and Backward Propagation
Deep Learning
- A family of parametric, non-linear, hierarchical representational learning functions
- Massively optimized with stochastic gradient descent to encode domain invariances and stationarity
- Non-Linear Function calculation: a₁(x; θ₁...) = h₁ (h₁−1(... h₁ (x, θ₁), θL−1), θL) x: input θ₁: parameters for layer l a₁ = h₁(x, θ₁): (non-)linear function
- Given training corpus {X, Y}, find optimal parameters: θ* ← ∑(x,y)=(x,y) l(y, a₁(x; 0,1,...,1))
Deep Feedforward Networks
- Also known as multilayer perceptrons (MLPs).
- Approximates a function.
- Defines a mapping: y = f(x; θ).
- Learns the parameter values @ for the best function approximation.
- No feedback connections are present, but if they are included, they become recurrent neural networks.
- Brains have many feedback connections.
- Composite functions calculation: y = a₁(x; 01,...,L) = h₁(h₁−1 (... h₁ (x, θ₁), θ1−1), θ1), where θ₁ denotes the parameters in the l-th layer.
- Simplified notation: a₁ = f(x; 0) = h₁ ∘ h₁−1 ∘ ... ∘ h₁ ∘ x, where each function hī is parameterized by parameters θ₁.
- Neural Network Notation: Visualize networks as blocks where modules are building blocks, perform transformations, and functions
- Modules:
- Modules receives data x or another module's output as input.
- Module returns an output a based on its activation function h(...).
- Modules may or may not have trainable parameters w. Examples: f = Ax, f = exp(x)
Requirements
- Activations must be first-order differentiable almost everywhere.
- Take special care when cycles in the architecture of blocks occur.
- Most models are feedforward networks, such as CNNs and Transformers.
MLPs: Training Goal and Overview
- Utilize a dataset of inputs and desired outputs for training.
- Random values are used to initialize all weights and biases.
- Learning process involves cycling through 'forward-backward' propagation:
- Forward step: Process the input to generate a predicted output.
- Loss step: Quantify the disparity between the predicted output and the ground truth.
- Backward step: Fine-tune predictions by propagating gradients.
Linear / Fully-connected layer
- The identity activation function implies no activation saturation
- Results in strong & stable gradients.
- Enables reliable learning with linear modules. x∈ R1XM, WE∈ RN×M h(x; w) = x. wT + b dh/dx = w
Forward propagation
- When using linear layers, linear layers application are repeated
- Start from the input, multiply with weights, sum, add bias
- Repeat for all following layers until you reach the end
- Activation functions are applied after each layer as a main new element
Why activation functions?
- Each hidden/output neuron is a linear sum.
- A combination of linear functions is a linear function.
- Function Calculations:
- v(x) = ax + b
- w(z) = cz + d
- w(v(x)) = c(ax + b) + d = (ac)x + (cb + d)
- Activation functions transform each neuron's outputs, resulting in non-linear functions. Outputs have weights that define how the weighted is transformed to and output on a given layer.
Sigmoid Function
- It has a range between zero and one
- Differentiable to :σ′(z) = σ(z)(1 − σ(z))
Tanh Function
- It has a better output range [-1, +1].
- Data are centered around 0, but more often centered around 0.5, which implies stronger gradients
- It is less 'positive' bias for next layers, unlike with sigmoids where mean is 0.5 and not is 0.
- Tanh Both saturate at the extreme which lead 0 gradients. Gradients << 1 with chain multiplication
- tanh(x) is better for middle layers
- Sigmoids are better for outputs to emulate probabilities
Rectified Linear Unit (ReLU)
Advantages:
- Sparse activation: In randomly initialized networks, ~50% active
- Better gradient propagation:
- Fewer vanishing gradient problems are found compared to sigmoidal activation functions, which saturate in both directions
- e.g. for sin(x), x < 1: (small number) * (small number) * → 0
- Efficient computation: Only comparison, addition and multiplication
- Calculation: h(x) = max(0, x) , dh/dw=1 when x > 0
Limitations:
- Non-differentiable at zero
- However, differentiable anywhere else and the derivative's value can be arbitrarily 0 or 1
- Not zero-centered
- Unbounded
- Dead neurons problem: Neurons sometimes pushed into inactive states, rendering them virtually useless for all inputs
- Higher learning rates can help with this problem
Leaky ReLU
- Leaky ReLUs allow a small, positive gradient when the unit is not active.
- Parametric ReLUs, or PReLU, treat a as learnable parameter.
Exponential Linear Unit (ELU)
- ELU calculation : h(x) = x, when x > 0 and exp(x) - 1, x ≤ 0
- ELU serves as the default activation for models, including BERT.
How to choose an activation function
- Hidden layers should use ReLU or GELU in Modern NNs
- Tanh and/or Sigmoid activation function for Recurrent Neural network
- Output has different function depending on what is used out of the following layers:
- Regression: One node → Linear activation
- Binary Classification: One node → Sigmoid activation
- Multiclass Classification: One node per class → Softmax activation
- Multilabel Classification: One node per class → Sigmoid activation
Cost Functions
Multiclass Classification: SoftMax
Outputs probability distribution ,with a formula of h(x)= exp(x)/ Σexp(x). It is also important to Avoid exponentiating too large/small numbers for better stability.
Universal Approximation Theorem
- Deep feedforward networks can approximate virtually any continuous function.
- A sufficiently large MLP with a single hidden layer can represent any function, provided the network has enough hidden units.
- It makes no guarantees that the training algorithm can learn that function
- It makes choose the wrong function due to overfitting in most cases.
3: Deep Learning Optimization I
Optimization Versus Learning
- Optimization seeks a model's best parameters for a set of data points, minimizing the objective function.
- Learning aims to reduce errors on training data and generalize well to unseen data.
Minimizing Risk
- Want to minimize the error on observed data.
- The goal is to minimize the cost function with extra regulations with a formula of min Ex,y~pdata [L(f(x, w)), y)] + λΩ(w)
- The main Goal is to Minimize: (1) Predictions are not too wrong, while (2) not being 'too geared' towards the observed data
- Problem: The true distribution Pdata is often unavailable To minimize any function, we take a step δ. Our best bet would be to use negative gradients given by δ = d/dw L d L(f(x, w), y)
Stochastic Gradient Descent
- Mini-Batch Utilization: Gradients are calculated and parameter updates enacted; this contrasts with an entire and complete dataset.
- Properties of SGD :
- Reduces overfitting through randomness incorporation.
- Important and Necessary: Reshuffling prevents the same data sequence.
- Epoch Definition: Each complete pass through all mini-batches. Important to make sure to have a good balance class/data per batch.
Batch Size Analysis
- Large Batch Size allows for more gradient estimation in most cases.
- Small Batch Sizes underutilize hardware power, can act as regularizer in most situations.
- Batch Size and Learning Rate Relation: They two are typically coupled by the double BS = double LR rule.
- A good guideline: It is best to use the largest possible and the fits the GPU, as a power of 2. The gradient descent is already an approximation; the true data distribution is typically unknown.
A Nutshell Review
- First, define the NN by y = h₁°hL - 1°·h₁(x), where each module comes with a parameter. w₁
- Find the optimal Network by minizing the loss function.
- Rely on stochastic gradient descent methods to obtain the parameters.
Challenges of Optimizing Deep Networks
- Training is a non-convex optimization involving functions with multiple optima.
- It raises multiple complex problems:
- How do we avoid getting stuck in local optima?
- What is a reasonable learning rate to use?
- What if the loss surface morphology changes?
Main Optimizing Challenges
- Ill conditioning : Even a strong gradient might not be suitable.
- Local optimization is susceptive to local minima
- Issues: Ravines, plateaus, cliffs, and pathological curvatures.
- Vanishing, exploding gradients often occure.
- Long-term dependencies create problems.
III-Conditioning
- Hessian matrix is a square matrix, it used to describe function curvature across multiple variables
- The Hessian matrix if symmetric.
- Curvature is critical determinate in the 2nd derivative.
- Negative Curvature: Implies a faster decease then the gradients
- Positive Curvature indicates a slower decease then gradients
- No Curvature the prediction is correct Critical Points- Hessian matrix. Local minimim occurs if the Hessian is positive definite. Local Maximum: occurs if the Hessian is negative definite. A Saddle point occurs when the eigenvalues have pos and neg. Consider the Hessian matrix with eigenvalue decomposition. where the Condition number is max (λ/λ): the magnitudes the larger i and smaller eigenvalue.
Local Minima
- What is the Model identifiability state:
- The State of sufficiently huge set to rule out 1 settings of model.
- Models: frequently models with no identifiability in it, which means they be obtain, equivalent by variables that each other.
- Can have local minima and can be extremely numbered as values in const function
Ravines
- Are is large on direct one and is smalls in directions
4.Plateaus and flat area
They often have little or near zero gradient, therefore implying in no learning
4 Cliffs and Exploding Gradients
- NNs: with multi-layers, they often are step with are regions that are like cliffs
5: Long Term Dependencies
- These types gradient happens for layers or for recurrent .
- The problem is more focused and specialized on NNs
- Vanishing gradients → no direction to move
- Exploding gradients → learning unstable.
- For training-trajectory dependency: hard to recover from a bad start!
- Most of then they will become better the can be enhanced.
Advance Optimizers
Can we improve the learning rate? • Can we get a better gradients?
Sets the Learn Rate :
The approach is truly in empirical but unique which, depends of data set, big trig also
Improving gradient descent
Types of gradients: • Stochastic Gradient Descent with momentum • Nesterov momentum • SGD with adaptive learning rates(AdaGrad, RMSProp, Adam) • Second-order approximation as Newton’s methods
Momentum
: • Made to increase learning, specifically loss of high curvature We understand momentum via exponentially to weighted moving averages • Example: with a sequences S with a lot’ s of noise vt = βvt-1 + (1 − β)st, β∈ [01], V = is the exponential weighted averages. β = to 0. is a balence.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Questions covering activation functions like Sigmoid, and their applications in neural networks. Also explores concepts like the Universal Approximation Theorem, limitations of single-layer perceptrons, and Moravec's Paradox.