Multi-Layer Neural Networks

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the primary role of the activation function in a perceptron?

To compute the dot product of the weight vector and input.
To compute the bias term.
To determine whether the perceptron gets activated based on the sum of weighted inputs. (correct)
To normalize the input vector.

What is a key limitation of using a unit step function as an activation function in modern neural networks?

It is ideal for handling nuanced decision boundaries.
It's derivative is not defined which is incompatible with gradient-based optimization. (correct)
It captures the complexities and subtleties within datasets very effectively.
It is computationally expensive.

Which of the following is NOT a key property of a good activation function?

Linearity (correct)
Differentiability
Computational Efficiency
Non-linearity

What is a key characteristic of sigmoid neurons?

They map real-valued inputs to values between 0 and 1, making them suited for probabilities. (C)

Signup and view all the answers

What is a major limitation of perceptrons when used as binary classifiers?

They struggle when data is not linearly separable. (A)

Signup and view all the answers

According to the theory of universal approximation, what capability do neural networks gain by arranging neurons in layers and using non-linear activation functions?

The ability to represent any non-linear function. (A)

Signup and view all the answers

What is a key challenge associated with using a single-layer neural network to approximate complex functions?

It requires an excessively large number of neurons. (B)

Signup and view all the answers

What is the primary purpose of an input layer in a one-hidden-layer neural network?

To serve as a placeholder for input values. (D)

Signup and view all the answers

What is a major limitation of the perceptron learning algorithm when applied to multi-layer perceptron networks?

It cannot effectively propagate errors to middle (hidden) layers. (B)

Signup and view all the answers

What is the primary motivation for using deeper neural networks rather than shallow networks?

To learn more complex patterns with fewer total parameters. (C)

Signup and view all the answers

What is the purpose of introducing non-linearity with activation functions?

To enable the network to learn complex patterns. (B)

Signup and view all the answers

If a deep neural network is 'inherently compositional', what does this imply about its ability to represent real-world data?

It is well-suited for representing hierarchical structures. (C)

Signup and view all the answers

In the context of neural network architecture, what is meant by a 'fully connected' network?

A network where each neuron in one layer is connected to every neuron in the next layer. (C)

Signup and view all the answers

How is the number of neurons determined in the input layer of a deep neural network designed for image processing?

It is determined by the pixel dimensions of input images. (D)

Signup and view all the answers

What are some challenges about determining the number of hidden layers and neurons for a neural network?

The number of hidden layers and their neurons are hyperparameters which must be determined by the designer based on the task. (C)

Signup and view all the answers

In a multi-class classification problem using a neural network, what generally determines the number of neurons in the output layer?

The number of classes. (A)

Signup and view all the answers

Why is it a general practice to demonstrate the general practice with DNN and multiclass classification where the number of output neurons corresponds to the number of classes, with a softmax activation function?

To take advantage of the probability distribution characteristics of softmax and because the majority of tasks are multiclass. (C)

Signup and view all the answers

What method can be used to determine weights and biases (parameters) in a Deep Neural Network or Multi Layer Neural Network?

Gradient Descent (C)

Signup and view all the answers

In the context of training neural networks, what is meant by the term 'ERM Framework'?

A set of general steps including Decision Process, Loss Function, Empirical Risk, Optimization. (D)

Signup and view all the answers

The models (DNN) depend on the number of inputs, number of classes, and which other parameter?

The number of layers and number of neurons. (A)

Signup and view all the answers

In the context of the loss function for multi-class classification, what does the 'Categorical Cross-Entropy Loss' measure?

How well the predicted probability distribution matches the true class labels. (C)

Signup and view all the answers

If the goal is to maximize output probability of correctness for a loss, what is a good choice for an output?

A one-hot encoded vector (D)

Signup and view all the answers

In the formulation of ERM optimization problem, what does it mean to find a 'weight vector' and 'bias term'?

Finding parameters to minimize average loss. (D)

Signup and view all the answers

What is the significance of 'forward and backward propagations' in the context of training multi-layer networks?

The weights are learned through the combinations of forward and backward propagations. (D)

Signup and view all the answers

What is the main purpose of Backward Propagation?

Algorithm to update weights. (C)

Signup and view all the answers

What three requirements are needed for backpropagation?

A feed-forward neural network.Dataset: in the pairs of input-output i.e. (xi, yi). An Error Function, E which defines the error between the desired output and calculated output. (C)

Signup and view all the answers

Based on how much data is used, how many main variants of the gradient descent are there?

There are three. (B)

Signup and view all the answers

What distinguishes Stochastic Gradient Descent (SGD) from Batch Gradient Descent?

SGD updates parameters for each training example. (B)

Signup and view all the answers

What is the best definition for "Batch Gradient Descent?"

Also known as vanilla gradient descent, it computes the gradient of the cost function for the entire training set. (D)

Signup and view all the answers

How does the mini-batch gradient descent compare with other methods?

It updates the parameter for every mini batch of n training examples (B)

Signup and view all the answers

Is mini gradient descent better with small or larger models or same results?

Common mini-batch size range between 50 and 256 will lead to stable results for large models. (C)

Signup and view all the answers

How do you fix getting trapped into suboptimal local minima or saddle points when dealing with the gradient descent?

By Using Annealing that consist of learning rate schedules. (D)

Signup and view all the answers

Which action would be expected to take when approaching the end of mini-batch learning?

Turn down learning rate! (C)

Signup and view all the answers

When setting up a learning algorithm and automating the process, what kind of "rate" is desired?

An initial learning rate. (D)

Signup and view all the answers

How is an appropriate course of action set if if the error keeps getting worse or oscillates wildly?

Use a smaller, updated initial learning rate. (C)

Signup and view all the answers

The term "pixel784," what does it represent?

It represents that many pixels in an array. (C)

Signup and view all the answers

Flashcards

Perceptron

A component of modern neural networks, often called neurons.

Activation function

Determines if a neuron gets activated based on the sum of weighted inputs.

Sigmoid neurons

A type of neuron using a sigmoid function to model output.

Linear separability limitation

Limitation where Perceptron's perform well only when data is linearly separable.

Signup and view all the flashcards

Theory of Universal Approximation

The theory states that neurons create layers using nonlinear activation functions.

Signup and view all the flashcards

Activation functions

Introduces non-linearity into the output of neurons.

Signup and view all the flashcards

Challenge of Single Layer Neural Network

Achieving a good approximation requires a large number of neurons.

Signup and view all the flashcards

Shallow Network concept

Placing an exponentially large number of neurons in a single hidden Layer.

Signup and view all the flashcards

Limitations of Shallow Neural Network

Expressiveness and capturing complex, non-linear decision boundaries.

Signup and view all the flashcards

Deep Neural Network

Consists of more than 2 hidden layers.

Signup and view all the flashcards

Deep networks are compositional

Helps with making well-suited for representing hierarchical structures.

Signup and view all the flashcards

Multi-layer Fully Connected Neural Network

A fully connected neural network, having unique weight value at every edge.

Signup and view all the flashcards

Number of neurons in the input layer

Determined by the size of images.

Signup and view all the flashcards

Selection of k and L

Impacts the network’s capacity and computational efficiency.

Signup and view all the flashcards

Number of neurons in the output layer

Depends on the classification task and the activation function used.

Signup and view all the flashcards

DNN Models vary

The models may vary in the number of layers and the number of neurons.

Signup and view all the flashcards

Categorical Cross-Entropy Loss

Measures how well the predicted probability distribution matches the true class labels.

Signup and view all the flashcards

Target Representation

We represent our target variable with its one-hot-encoded version.

Signup and view all the flashcards

Perceptron Learning Algorithm

Designed for single-layer perceptrons or one neuron and also is applicable only to problems that are linearly separable.

Signup and view all the flashcards

Error Propagation

Updates all the weights of the output layer based on the error.

Signup and view all the flashcards

Perceptron Learning Algorithm

It is designed for single-layer perceptrons or one neuron and also is applicable only to problems that are linearly separable.

Signup and view all the flashcards

Backpropagation

An technique used by deep layer networks, to help find the error of the network.

Signup and view all the flashcards

Batch Gradient Descents

It is known vanilla gradient descent and it computes the gradient of the cost function w.r.t to the parameters w for the entire training set.

Signup and view all the flashcards

Stochastic Gradient Descent

Performs parameters update for each training example {loss function} i.e. x1 and label y1

Signup and view all the flashcards

Mini Bactch Gradient Descent

It updates the parameter for every mini batch of n training examples

Signup and view all the flashcards

Learning rate

Apply and update all parameters.

Signup and view all the flashcards

Study Notes

This lecture covers Artificial Neural Networks (ANNs), focusing on multi-layer neural networks, covering the limitations of perceptrons, deep neural networks, and training with gradient descent.
The instructor is Siman Giri, the Module Leader for 6CS012.

Learning Objectives

Review the limitations of perceptrons discussed in the previous week.
Overcome the limitations of perceptrons and the Perceptron Learning Algorithm.
Discuss Deeper Neural Networks (also called Multi-Layer Perceptrons).
Understand the basics and architecture of Deep Neural Networks.
Train a Deep Neural Network with Gradient Descent for multi-class classification problems.

Recap of Challenges with Simple Perceptrons

Challenge 1: Automating feature extraction.
- High dimensionality results when using pixel values directly (e.g., 784 columns for 28x28 images).
- Datasets can become very large for larger images.
Challenge 2: Non-linear decision boundaries.
- Logistic Regression or Softmax Regression are better at separating linearly separable data.
- Standard perceptrons fail when classes/labels are not linearly separable

Perceptron and its Limitations

The perceptron is a fundamental component of modern neural network architectures, often referred to as neurons.
Perceptrons are often single hidden units connected to some input x.
It computes the dot product of a weight vector w and input x, adding a bias term, and passing the result through an activation function.
The activation function determines if the perceptron/neuron activates based on the sum of weighted inputs.
The original perceptron using unit step function as activation function is not ideal for most modern problems.

Why Not a Unit Step Function

The unit step function is defined as f(z) = 1 if z ≥ 0, else f(z) = 0, where z = wo + (wTx).
Unit step functions do not capture subtleties of the data and create a sharp boundary at 0.
- It implies all positive sums classify as class 1, and all negative sums classify as class 0.
- It does not accommodate cases where negative values can belong to class 1.
The output is binary only, either 0 or 1, limiting nuanced decision boundaries.
The small changes in input do not affect output, limiting the model's ability to learn complex patterns.
The non-differentiability of the function is incompatible with gradient-based optimization, such as backpropagation in neural networks.

Activation Function

Introduces non-linearity to the output of neurons.
Good activation function properties include:
- Non-linearity to learn complex patterns and decision boundaries.
- Differentiability to enable gradient-based optimizations (i.e., gradient descent).
- Computational efficiency for real-time applications.
- Gradient behavior to avoid vanishing or exploding gradients.

Sigmoid Neurons

A type of artificial neuron using the sigmoid activation function to model the output of the neuron.
The sigmoid function is a smooth, S-shaped curve that maps any real-valued input to a value between 0 and 1.
- It makes it useful for problems involving probabilities or binary classification.
The sigmoid activation function is defined as σ(z) = 1 / (1 + e^(-z)), where z = wo + Σ Wixi is the weighted sum of the inputs to the neuron.

Limitations of the Perceptron

Perceptrons are binary classifiers that work well when data is linearly separable.
Performance can be lower when data isn't linearly separable. -Arranging perceptrons in multiple layers makes XOR and similar problems solvable.
The "Theory of Universal Approximation" backs up the idea that multiple perceptrons/neurons in layers can learn non-linearly separable functions, such as XOR.

Theory of Universal Approximation

States that by arranging neurons in layers and using nonlinear activation functions, neural networks can represent any non-linear function.
With this in mind, neural networks become powerful for function approximation, pattern recognition, and deep learning applications.
Any boolean function can be implemented using a one-hidden-layer perceptron network with 2^n neurons, where n is the number of inputs.

Challenges of Single Layer Neural Networks

While a single-layer perceptron network can represent any function, real-world functions require many neurons for a good approximation, creating inefficiency.
A single layer would require millions of neurons: making computing inefficient in image recognition (e.g. dog vs cat classification).
Multi-layer (aka deeper) neural networks are preferred in practice.

Architecture of One Hidden Layer Neural Network

The input layer functions as a placeholder and involves no computation. Understanding the Architecture x = [x1, x2, x3, x4].
The computation starts in the first hidden layer, propagating forward to the next layer.
The hidden layer computation is z¹ = (W¹x) + b¹ and a¹= (f¹(z¹)).
- W1 is a weight vector, and there are 4 neurons per layer.
Computed output after activation function f¹ yields a¹, using sigmoid (σ).

Architecture of One Hidden Layer Neural Network - Output

Neurons in the output layer can also have different activation functions: depending on the classification task.
Binary uses a sigmoid activation function, and multi-class uses a softmax activation function.
For demonstration, the setup utilizes a single neuron with a sigmoid activation function.

Limitations of Perceptron Learning Algorithm

Perceptron Learning Algorithm is designed for single-layer perceptrons or one neuron.
- It applies to problems that are linearly separable only.
The algorithm adjusts weights iteratively based on the prediction error, updating when the model is incorrect.
A key limitation of the perceptron learning algorithm is the inability to propagate errors to middle (hidden) layers: preventing it from solving nonlinear classification issues.

Towards Multi-Layer (Deep) Neural Networks

A solution requires an updated learning algorithm that can both feedforward and backpropagate the adjusted weights through the hidden layers.
This raises the question: Can Gradient Descent algorithm be used?

From Single Layer to Shallow Network

Single Layer Neural Networks
- Networks with one hidden layer and 2n neurons.
- Theoretically represent any function, but require many neurons: less computationally efficient.
Shallow Neural Network Concept
- Aims to avoid exponentially large number of neurons by distributing the computation across multiple hidden layers with fewer neurons per layer.
A shallow neural network typically consists of one or a few hidden layers, reducing the need for 2^n neurons in a single layer while capturing complex patterns

From Shallow to Deep Neural Network

Limitations of Shallow Neural Networks
- Universality: While shallow multi-layer perceptrons (MLP) can approximate any continuous function with enough neurons, neuron quantity can be impractically large.
- Expressivity: A small number of neurons might not capture complex, non-linear decision boundaries needed for classification problems.
- Learnability: There's no guarantee that the learning method will find the optimal solution.

Deep Neural Networks (DNN)

Definition: Any network with more than 2 hidden layers.
Motivation:
- Deeper networks can represent more complex patterns. Statistical: Inherently compositional; combine simple patterns recursively. Computational and Expressiveness: More expressive than shallow ones; learn more complex patterns with the same total parameters.

Understanding the Architecture of DNN

Multi-layer Fully Connected Neural Network with some weight to every edge. -Every neurons are connected to each other.
Individual neurons conduct a weighted sum and apply an activation function The Number of Neurons
The quantity of neurons determined by the size of input image
- The input layer would be just a placeholder. Number of neurones for MNIST with image with shape (28 × 28 × 1) have 784 neurons (28281 = 784),
The number of hidden layers {Lk} and neurons per layer {k₁: kth layer n neurons} are hyperparameters determined by the designer.
- {It is important to note increasing the neurons helps improve capacity, but cause overfitting}
- {Decreasing the neurons underfit}

Output Layer Number of Neurons

Depends on classification task and activation function used
- Binary Classification Neuron= 1 with sigmoid function applied -Multilcass Classification Number of Neurons= The number of classes/
- A Softmax Activation function is applied

Training the DNN

Key Question: How to learn the weights and biases in Deep Neural Networks (DNNs)?
Training is done via an Empirical Risk Minimization (ERM) Framework.
- Requires both a Decision Process, Loss Function, Empirical Risk and an Optimization method for training.

Defining the DNN as a Model

Unlike traditional machine learning models, DNNs are mathematical representations of deep neural networks with multiple layers.
The models vary in the number of layers, number of neurons and general structure.
The architecture mainly depends on the nature of the task and the nature of the data.
For MNIST digit classification, the first layer has input placeholders, hidden layers can have 64, 128 and 512 neurons, and the output layer has 10 neurons for prediction.

Training the DNN within the ERM Framework

Objective of DNNs with Softmax Output
- Minimize average loss over the training dataset that can be formulated as L(W, b) = −(1/n) Σ Σ yik log(ŷik). Finding weight factors w and b that minimizes average-Log Loss over the data

Loss Function: Categorical Cross Entropy

Categorical Cross-Entropy Loss measures how well the predicted probability distribution matches the true class labels in classification tasks.
- It is used in multi-class classification problems.
Goal: To maximize the probability of the correct class.
For a single sample x₁ with true label yi and y^(predicted probabilities): l(yi, y^) = −Σ yik log(y^ik)

Computing Gradient

The weights in Multi Layer networks are learned with the combinations of forward and backward propagations.
- Forward Propagation (activation) and backward propagations (error determination for weight changes)

Backpropogation

Backpropagation is a technique used by deep layer networks to find the error of the network through calculation and comparison
There are three conditions (dataset, feedforware and Error function)
The output from the input layer will be the dataset used in the Error function
The derivatives will be calculated going forward- back and updates accordingly

Variants of the Gradient Decent

Here are three main types (dependent on data computation)

Batch gradient decent
Stochastic gradient decent
Mini-Batch Stochastic Gradient Decent

Common Issue With The Gradiant Decent

Same learning rate applies to all parameter updates.
Getting trapped into suboptimal local minima or saddle points.
Choosing a Proper Learning
Note reducing and learning has to adapt to characteristics

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Multi-Layer Neural Networks

Choose a study mode

Podcast

Questions and Answers

What is the primary role of the activation function in a perceptron?

What is a key limitation of using a unit step function as an activation function in modern neural networks?

Which of the following is NOT a key property of a good activation function?

What is a key characteristic of sigmoid neurons?

What is a major limitation of perceptrons when used as binary classifiers?

According to the theory of universal approximation, what capability do neural networks gain by arranging neurons in layers and using non-linear activation functions?

What is a key challenge associated with using a single-layer neural network to approximate complex functions?

What is the primary purpose of an input layer in a one-hidden-layer neural network?

What is a major limitation of the perceptron learning algorithm when applied to multi-layer perceptron networks?

What is the primary motivation for using deeper neural networks rather than shallow networks?

What is the purpose of introducing non-linearity with activation functions?

If a deep neural network is 'inherently compositional', what does this imply about its ability to represent real-world data?

In the context of neural network architecture, what is meant by a 'fully connected' network?

How is the number of neurons determined in the input layer of a deep neural network designed for image processing?

What are some challenges about determining the number of hidden layers and neurons for a neural network?

In a multi-class classification problem using a neural network, what generally determines the number of neurons in the output layer?

Why is it a general practice to demonstrate the general practice with DNN and multiclass classification where the number of output neurons corresponds to the number of classes, with a softmax activation function?

What method can be used to determine weights and biases (parameters) in a Deep Neural Network or Multi Layer Neural Network?

In the context of training neural networks, what is meant by the term 'ERM Framework'?

The models (DNN) depend on the number of inputs, number of classes, and which other parameter?

In the context of the loss function for multi-class classification, what does the 'Categorical Cross-Entropy Loss' measure?

If the goal is to maximize output probability of correctness for a loss, what is a good choice for an output?

In the formulation of ERM optimization problem, what does it mean to find a 'weight vector' and 'bias term'?

What is the significance of 'forward and backward propagations' in the context of training multi-layer networks?

What is the main purpose of Backward Propagation?

What three requirements are needed for backpropagation?

Based on how much data is used, how many main variants of the gradient descent are there?

What distinguishes Stochastic Gradient Descent (SGD) from Batch Gradient Descent?

What is the best definition for "Batch Gradient Descent?"

How does the mini-batch gradient descent compare with other methods?

Is mini gradient descent better with small or larger models or same results?

How do you fix getting trapped into suboptimal local minima or saddle points when dealing with the gradient descent?

Which action would be expected to take when approaching the end of mini-batch learning?

When setting up a learning algorithm and automating the process, what kind of "rate" is desired?

How is an appropriate course of action set if if the error keeps getting worse or oscillates wildly?

The term "pixel784," what does it represent?

Flashcards

Perceptron

Activation function

Sigmoid neurons

Linear separability limitation

Theory of Universal Approximation

Activation functions

Challenge of Single Layer Neural Network

Shallow Network concept

Limitations of Shallow Neural Network

Deep Neural Network

Deep networks are compositional

Multi-layer Fully Connected Neural Network

Number of neurons in the input layer

Selection of k and L

Number of neurons in the output layer

DNN Models vary

Categorical Cross-Entropy Loss

Target Representation

Perceptron Learning Algorithm

Error Propagation

Perceptron Learning Algorithm

Backpropagation

Batch Gradient Descents

Stochastic Gradient Descent

Mini Bactch Gradient Descent

Learning rate

Study Notes

Learning Objectives

Recap of Challenges with Simple Perceptrons

Perceptron and its Limitations

Why Not a Unit Step Function

Activation Function

Sigmoid Neurons

Limitations of the Perceptron

Theory of Universal Approximation

Challenges of Single Layer Neural Networks

Architecture of One Hidden Layer Neural Network

Architecture of One Hidden Layer Neural Network - Output

Limitations of Perceptron Learning Algorithm