Neural Networks & Activation Functions
30 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which activation function is generally recommended as the default for hidden layers in modern neural networks?

  • Softmax
  • ReLU or GELU (correct)
  • Sigmoid
  • Tanh

For regression problems in the output layer of a neural network, the recommended activation function is Sigmoid.

False (B)

In multi-class classification problems, what type of activation function is typically applied to the output layer?

Softmax

In binary classification, the cost function aims to minimize 𝑝! if 𝑦! = 0 and to ________ if 𝑦! = 1.

<p>maximize</p> Signup and view all the answers

According to the Universal Approximation Theorem, what is a primary characteristic of feedforward networks with hidden layers?

<p>They provide a universal approximation framework. (C)</p> Signup and view all the answers

Match the use case with the appropriate activation function:

<p>Binary Classification = Sigmoid Multiclass Classification = Softmax Hidden Layers = ReLU/GELU Regression = Linear</p> Signup and view all the answers

What is the primary limitation of a single-layer perceptron?

<p>It can only solve linearly separable problems. (D)</p> Signup and view all the answers

Moravec's Paradox highlights that machines excel at tasks requiring sensory perception compared to logical reasoning.

<p>False (B)</p> Signup and view all the answers

Briefly describe the core idea behind 'Path 1: Better Inputs' in machine learning.

<p>Encode domain knowledge to help machine learning algorithms.</p> Signup and view all the answers

Deep learning models are massively optimized with ______ to encode domain knowledge.

<p>stochastic gradient descent</p> Signup and view all the answers

Match the following historical challenges faced by early neural networks with their corresponding description:

<p>Lack of processing power = Limited computational resources hindered the training of complex models. Overfitting = Models learned the training data too well, leading to poor generalization on new data. Vanishing gradients = Gradients became too small during training, preventing weights from updating effectively in deeper layers. Lack of data = Insufficient amounts of data made it difficult to train robust and generalizable models.</p> Signup and view all the answers

Which of the following is a characteristic of deep learning, as described in the content?

<p>Parametric, non-linear and hierarchical (D)</p> Signup and view all the answers

Why are activation functions necessary in neural networks?

<p>To introduce non-linearity, allowing the network to learn complex patterns. (A)</p> Signup and view all the answers

Using different activation functions in each hidden layer of a neural network is a common practice to optimize performance.

<p>False (B)</p> Signup and view all the answers

What is a key characteristic that activation functions must possess for use in neural networks?

<p>differentiable</p> Signup and view all the answers

Activation functions with a limited output range are often called ' ______ functions'.

<p>squashing</p> Signup and view all the answers

In deep feedforward networks, what is the primary goal?

<p>To approximate a function $f$ and learn the best parameters $\theta$ for that approximation. (C)</p> Signup and view all the answers

Which of the following is an advantage of using the Tanh activation function over the Sigmoid function in hidden layers?

<p>Tanh outputs are centered around 0, which can lead to stronger gradients. (B)</p> Signup and view all the answers

Recurrent neural networks (RNNs) are characterized by the absence of feedback connections, distinguishing them from feedforward networks.

<p>False (B)</p> Signup and view all the answers

ReLU activation functions are zero-centered.

<p>False (B)</p> Signup and view all the answers

In the context of neural networks, what is a 'module'?

<p>A building block or transformation, such as a function, that receives input and returns an output based on its activation function.</p> Signup and view all the answers

What is a potential problem associated with ReLU neurons, where they become inactive for all inputs?

<p>dead neurons</p> Signup and view all the answers

During the training of Multilayer Perceptrons (MLPs), weights and biases are learned through 'forward-backward' propagation, which involves mapping input to predicted output, comparing this output to the ground truth, and then propagating __________ to correct predictions.

<p>gradients</p> Signup and view all the answers

Which activation function is designed to address the 'dead neurons' problem by allowing a small, positive gradient when the unit is not active?

<p>Leaky ReLU (D)</p> Signup and view all the answers

Match each component with its corresponding description in the context of neural networks:

<p>Input (x) = Data fed into the first layer of the network Parameters ($\theta$) = Values that are learned during training to optimize the network's function approximation Activation Function ($h$) = A (non-)linear function applied to the output of each layer Output ($a$) = The result produced by a module based on its activation function</p> Signup and view all the answers

Which of the following is NOT a requirement for activation functions in neural networks?

<p>Must be computationally expensive. (B)</p> Signup and view all the answers

In Parametric ReLU (PReLU), the slope of the inactive part of the function is treated as a ______ parameter.

<p>learnable</p> Signup and view all the answers

Which of the following activation functions is most suitable for the output layer when the task is to emulate probabilities?

<p>Sigmoid (C)</p> Signup and view all the answers

What must be done when there are cycles in the architecture of blocks in a recurrent network?

<p>Unfold the graph, often referred to as 'Recurrent Networks'.</p> Signup and view all the answers

In feedforward networks, layers apply a series of functions. The notation $a^L = f(x; \theta) = h^L \circ h^{L-1} \circ … \circ h^1 \circ x$ shows that each function $h^l$ is parameterized by parameters ________.

<p>\theta^l</p> Signup and view all the answers

Flashcards

Perceptron

A single-layer neural network for binary classification. It multiplies inputs by weights, sums them, adds a bias, and outputs 1 if above a threshold, 0 otherwise.

Moravec's Paradox

The observation that tasks easy for humans (perception, motor skills) are hard for machines, and vice versa (logical reasoning).

Path 1: Better Inputs

One approach to improve machine learning by creating better input features, often encoding domain knowledge.

Path 2: Neural Networks

An approach to improve machine learning that focuses on increasing the complexity of neural networks beyond a single layer.

Signup and view all the flashcards

Layer-by-layer Training

Training multi-layered neural networks by training one layer at a time as to achieve deeper learning.

Signup and view all the flashcards

Deep Learning

Parametric, non-linear, hierarchical representational learning functions, optimized with stochastic gradient descent to encode domain knowledge.

Signup and view all the flashcards

Deep Network Equation

Mathematical representation of a deep feedforward network involving input x, parameters θ for each layer l, and a non-linear function h.

Signup and view all the flashcards

Feedforward Neural Network

A neural network architecture where data flows in one direction, from input to output, without loops or cycles.

Signup and view all the flashcards

Multi-Layer Perceptron (MLP)

Synonym for feedforward neural networks, emphasizing their layered structure.

Signup and view all the flashcards

Optimal Parameters

Finding the set of parameters (θ) that minimizes the difference between the network's predictions and the actual values in the training data.

Signup and view all the flashcards

Module in Neural Networks

A generic term for a building block within a neural network that transforms input data into an output based on an activation function.

Signup and view all the flashcards

Activation Function

A function applied within a module to introduce non-linearity and produce an output.

Signup and view all the flashcards

Recurrent Neural Networks

Layers where the output from one layer is fed back into itself, creating cycles.

Signup and view all the flashcards

Forward Propagation

Process of mapping the input to a predicted output.

Signup and view all the flashcards

ELU (Exponential Linear Unit)

A smooth approximation of a rectifier with a non-monotonic 'bump' when x < 0. Used as the default activation for models like BERT.

Signup and view all the flashcards

Activation Function Choice (Hidden Layers)

ReLU or GELU are generally recommended for hidden layers in modern neural networks. For Recurrent Neural Networks, Tanh and/or Sigmoid are common.

Signup and view all the flashcards

Regression Output Layer

Single node -> Linear Activation

Signup and view all the flashcards

Binary Classification Output Layer

Sigmoid activation.

Signup and view all the flashcards

Multiclass Classification Output Layer

Softmax activation (one node per class).

Signup and view all the flashcards

Universal Approximation Theorem

With adequate hidden units, a feedforward network can approximate almost any continuous function.

Signup and view all the flashcards

Sigmoid Function

Output range (0,1). Differentiable. Commonly used for output layers to emulate probabilities.

Signup and view all the flashcards

Tanh Function

Output range [-1, +1]. Data centered around 0. Stronger gradients, less 'positive' bias. Better for middle layers.

Signup and view all the flashcards

ReLU (Rectified Linear Unit)

A function that outputs the input directly if it is positive, otherwise, it outputs zero.

Signup and view all the flashcards

Dead Neurons (ReLU)

Neurons sometimes pushed into inactive states for all inputs.

Signup and view all the flashcards

Advantage of ReLU

Better gradient propagation and sparse activation.

Signup and view all the flashcards

Leaky ReLU

A ReLU variant that allows a small, positive gradient when the unit is not active.

Signup and view all the flashcards

Parametric ReLU (PReLU)

ReLU variant where the slope of the negative part is learned during training.

Signup and view all the flashcards

Vanishing Gradients

Sigmoid and Tanh functions reduce the gradients towards zero.

Signup and view all the flashcards

Squashing Function

Output values are squeezed into a limited range.

Signup and view all the flashcards

Study Notes

Introduction and History of Deep Learning

The Perceptron

  • Single-layer perceptrons are used for binary classification.
  • During processing, each input is multiplied by its corresponding weight.
  • The products are summed, and a bias is added.
  • If the result is above a threshold, the perceptron outputs 1; otherwise, it outputs 0.
  • Weights are adjusted one sample at a time.
  • Perceptron Limitations:
    • Can only solve linearly separable problems.
    • Fails to model non-linear boundaries, like the XOR problem.

Moravec’s Paradox

  • Tasks that humans find easy, such as sensory perception and motor skills, are difficult for machines.
  • Tasks humans find hard, such as logical reasoning, are relatively easy for machines.
  • Reasoning requires little computation, whereas perception from sensors requires a lot.
  • Computers solve structured and rule-based reasoning tasks efficiently.
  • Perception needs interpreting large, unstructured and noisy sensory data, which is computationally intensive.

Two Paths of Machine Learning

  • Approach 1: Improve inputs by creating better features.
  • Encode domain knowledge to aid machine learning algorithms.
  • Classical image recognition pipeline:
    • Extract local features.
    • Aggregate local features over the image.
    • Train classical models on the aggregations.
  • Approach 2: Create Neural Networks with multiple layers to fix perceptrons by making them more complex.

Deep Learning Arrives

  • Layer-by-layer training makes it easier to train one layer at a time.
  • Training multi-layered neural networks becomes easier.
  • Multi-layer networks have benefits, but single-layer networks are easy to train.

Challenges of NNs Then

  • Processing power was limited.
  • Data was scarce.
  • Overfitting was a problem.
  • Vanishing gradients hindered training.
  • Multi-layer perceptrons were not useful experimentally.

Forward and Backward Propagation

Deep Learning

  • A family of parametric, non-linear, hierarchical representational learning functions
  • Massively optimized with stochastic gradient descent to encode domain invariances and stationarity
  • Non-Linear Function calculation: a₁(x; θ₁...) = h₁ (h₁−1(... h₁ (x, θ₁), θL−1), θL) x: input θ₁: parameters for layer l a₁ = h₁(x, θ₁): (non-)linear function
  • Given training corpus {X, Y}, find optimal parameters: θ* ← ∑(x,y)=(x,y) l(y, a₁(x; 0,1,...,1))

Deep Feedforward Networks

  • Also known as multilayer perceptrons (MLPs).
  • Approximates a function.
  • Defines a mapping: y = f(x; θ).
  • Learns the parameter values @ for the best function approximation.
  • No feedback connections are present, but if they are included, they become recurrent neural networks.
  • Brains have many feedback connections.
  • Composite functions calculation: y = a₁(x; 01,...,L) = h₁(h₁−1 (... h₁ (x, θ₁), θ1−1), θ1), where θ₁ denotes the parameters in the l-th layer.
  • Simplified notation: a₁ = f(x; 0) = h₁ ∘ h₁−1 ∘ ... ∘ h₁ ∘ x, where each function hī is parameterized by parameters θ₁.
  • Neural Network Notation: Visualize networks as blocks where modules are building blocks, perform transformations, and functions
  • Modules:
    • Modules receives data x or another module's output as input.
    • Module returns an output a based on its activation function h(...).
    • Modules may or may not have trainable parameters w. Examples: f = Ax, f = exp(x)

Requirements

  • Activations must be first-order differentiable almost everywhere.
  • Take special care when cycles in the architecture of blocks occur.
  • Most models are feedforward networks, such as CNNs and Transformers.

MLPs: Training Goal and Overview

  • Utilize a dataset of inputs and desired outputs for training.
  • Random values are used to initialize all weights and biases.
  • Learning process involves cycling through 'forward-backward' propagation:
    • Forward step: Process the input to generate a predicted output.
    • Loss step: Quantify the disparity between the predicted output and the ground truth.
    • Backward step: Fine-tune predictions by propagating gradients.

Linear / Fully-connected layer

  • The identity activation function implies no activation saturation
  • Results in strong & stable gradients.
  • Enables reliable learning with linear modules. x∈ R1XM, WE∈ RN×M h(x; w) = x. wT + b dh/dx = w

Forward propagation

  • When using linear layers, linear layers application are repeated
  • Start from the input, multiply with weights, sum, add bias
  • Repeat for all following layers until you reach the end
  • Activation functions are applied after each layer as a main new element

Why activation functions?

  • Each hidden/output neuron is a linear sum.
  • A combination of linear functions is a linear function.
  • Function Calculations:
    • v(x) = ax + b
    • w(z) = cz + d
    • w(v(x)) = c(ax + b) + d = (ac)x + (cb + d)
  • Activation functions transform each neuron's outputs, resulting in non-linear functions. Outputs have weights that define how the weighted is transformed to and output on a given layer.

Sigmoid Function

  • It has a range between zero and one
  • Differentiable to :σ′(z) = σ(z)(1 − σ(z))

Tanh Function

  • It has a better output range [-1, +1].
  • Data are centered around 0, but more often centered around 0.5, which implies stronger gradients
  • It is less 'positive' bias for next layers, unlike with sigmoids where mean is 0.5 and not is 0.
  • Tanh Both saturate at the extreme which lead 0 gradients. Gradients << 1 with chain multiplication
  • tanh(x) is better for middle layers
  • Sigmoids are better for outputs to emulate probabilities

Rectified Linear Unit (ReLU)

Advantages:

  • Sparse activation: In randomly initialized networks, ~50% active
  • Better gradient propagation:
    • Fewer vanishing gradient problems are found compared to sigmoidal activation functions, which saturate in both directions
    • e.g. for sin(x), x < 1: (small number) * (small number) * → 0
  • Efficient computation: Only comparison, addition and multiplication
  • Calculation: h(x) = max(0, x) , dh/dw=1 when x > 0

Limitations:

  • Non-differentiable at zero
    • However, differentiable anywhere else and the derivative's value can be arbitrarily 0 or 1
  • Not zero-centered
  • Unbounded
  • Dead neurons problem: Neurons sometimes pushed into inactive states, rendering them virtually useless for all inputs
    • Higher learning rates can help with this problem

Leaky ReLU

  • Leaky ReLUs allow a small, positive gradient when the unit is not active.
  • Parametric ReLUs, or PReLU, treat a as learnable parameter.

Exponential Linear Unit (ELU)

  • ELU calculation : h(x) = x, when x > 0 and exp(x) - 1, x ≤ 0
  • ELU serves as the default activation for models, including BERT.

How to choose an activation function

  • Hidden layers should use ReLU or GELU in Modern NNs
  • Tanh and/or Sigmoid activation function for Recurrent Neural network
  • Output has different function depending on what is used out of the following layers:
  • Regression: One node → Linear activation
  • Binary Classification: One node → Sigmoid activation
  • Multiclass Classification: One node per class → Softmax activation
  • Multilabel Classification: One node per class → Sigmoid activation

Cost Functions

Multiclass Classification: SoftMax

Outputs probability distribution ,with a formula of h(x)= exp(x)/ Σexp(x). It is also important to Avoid exponentiating too large/small numbers for better stability.

Universal Approximation Theorem

  • Deep feedforward networks can approximate virtually any continuous function.
  • A sufficiently large MLP with a single hidden layer can represent any function, provided the network has enough hidden units.
  • It makes no guarantees that the training algorithm can learn that function
  • It makes choose the wrong function due to overfitting in most cases.

3: Deep Learning Optimization I

Optimization Versus Learning

  • Optimization seeks a model's best parameters for a set of data points, minimizing the objective function.
  • Learning aims to reduce errors on training data and generalize well to unseen data.

Minimizing Risk

  • Want to minimize the error on observed data.
  • The goal is to minimize the cost function with extra regulations with a formula of min Ex,y~pdata [L(f(x, w)), y)] + λΩ(w)
  • The main Goal is to Minimize: (1) Predictions are not too wrong, while (2) not being 'too geared' towards the observed data
  • Problem: The true distribution Pdata is often unavailable To minimize any function, we take a step δ. Our best bet would be to use negative gradients given by δ = d/dw L d L(f(x, w), y)

Stochastic Gradient Descent

  • Mini-Batch Utilization: Gradients are calculated and parameter updates enacted; this contrasts with an entire and complete dataset.
  • Properties of SGD :
    • Reduces overfitting through randomness incorporation.
    • Important and Necessary: Reshuffling prevents the same data sequence.
    • Epoch Definition: Each complete pass through all mini-batches. Important to make sure to have a good balance class/data per batch.

Batch Size Analysis

  • Large Batch Size allows for more gradient estimation in most cases.
  • Small Batch Sizes underutilize hardware power, can act as regularizer in most situations.
  • Batch Size and Learning Rate Relation: They two are typically coupled by the double BS = double LR rule.
  • A good guideline: It is best to use the largest possible and the fits the GPU, as a power of 2. The gradient descent is already an approximation; the true data distribution is typically unknown.

A Nutshell Review

  • First, define the NN by y = h₁°hL - 1°·h₁(x), where each module comes with a parameter. w₁
  • Find the optimal Network by minizing the loss function.
  • Rely on stochastic gradient descent methods to obtain the parameters.

Challenges of Optimizing Deep Networks

  • Training is a non-convex optimization involving functions with multiple optima.
  • It raises multiple complex problems:
    • How do we avoid getting stuck in local optima?
    • What is a reasonable learning rate to use?
    • What if the loss surface morphology changes?

Main Optimizing Challenges

  1. Ill conditioning : Even a strong gradient might not be suitable.
  2. Local optimization is susceptive to local minima
  3. Issues: Ravines, plateaus, cliffs, and pathological curvatures.
  4. Vanishing, exploding gradients often occure.
  5. Long-term dependencies create problems.

III-Conditioning

  • Hessian matrix is a square matrix, it used to describe function curvature across multiple variables
  • The Hessian matrix if symmetric.
  • Curvature is critical determinate in the 2nd derivative.
    • Negative Curvature: Implies a faster decease then the gradients
    • Positive Curvature indicates a slower decease then gradients
    • No Curvature the prediction is correct Critical Points- Hessian matrix. Local minimim occurs if the Hessian is positive definite. Local Maximum: occurs if the Hessian is negative definite. A Saddle point occurs when the eigenvalues have pos and neg. Consider the Hessian matrix with eigenvalue decomposition. where the Condition number is max (λ/λ): the magnitudes the larger i and smaller eigenvalue.

Local Minima

  • What is the Model identifiability state:
  • The State of sufficiently huge set to rule out 1 settings of model.
  • Models: frequently models with no identifiability in it, which means they be obtain, equivalent by variables that each other.
  • Can have local minima and can be extremely numbered as values in const function

Ravines

  • Are is large on direct one and is smalls in directions

4.Plateaus and flat area

They often have little or near zero gradient, therefore implying in no learning

4 Cliffs and Exploding Gradients

  • NNs: with multi-layers, they often are step with are regions that are like cliffs

5: Long Term Dependencies

  • These types gradient happens for layers or for recurrent .
  • The problem is more focused and specialized on NNs
  • Vanishing gradients → no direction to move
  • Exploding gradients → learning unstable.
  • For training-trajectory dependency: hard to recover from a bad start!
  • Most of then they will become better the can be enhanced.

Advance Optimizers

Can we improve the learning rate? • Can we get a better gradients?

Sets the Learn Rate :

The approach is truly in empirical but unique which, depends of data set, big trig also

Improving gradient descent

Types of gradients: • Stochastic Gradient Descent with momentum • Nesterov momentum • SGD with adaptive learning rates(AdaGrad, RMSProp, Adam) • Second-order approximation as Newton’s methods

Momentum

: • Made to increase learning, specifically loss of high curvature We understand momentum via exponentially to weighted moving averages • Example: with a sequences S with a lot’ s of noise vt = βvt-1 + (1 − β)st, β∈ [01], V = is the exponential weighted averages. β = to 0. is a balence.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Description

Questions covering activation functions like Sigmoid, and their applications in neural networks. Also explores concepts like the Universal Approximation Theorem, limitations of single-layer perceptrons, and Moravec's Paradox.

Use Quizgecko on...
Browser
Browser