Podcast
Questions and Answers
What is the primary role of the activation function in a perceptron?
What is the primary role of the activation function in a perceptron?
- To compute the dot product of the weight vector and input.
- To compute the bias term.
- To determine whether the perceptron gets activated based on the sum of weighted inputs. (correct)
- To normalize the input vector.
What is a key limitation of using a unit step function as an activation function in modern neural networks?
What is a key limitation of using a unit step function as an activation function in modern neural networks?
- It is ideal for handling nuanced decision boundaries.
- It's derivative is not defined which is incompatible with gradient-based optimization. (correct)
- It captures the complexities and subtleties within datasets very effectively.
- It is computationally expensive.
Which of the following is NOT a key property of a good activation function?
Which of the following is NOT a key property of a good activation function?
- Linearity (correct)
- Differentiability
- Computational Efficiency
- Non-linearity
What is a key characteristic of sigmoid neurons?
What is a key characteristic of sigmoid neurons?
What is a major limitation of perceptrons when used as binary classifiers?
What is a major limitation of perceptrons when used as binary classifiers?
According to the theory of universal approximation, what capability do neural networks gain by arranging neurons in layers and using non-linear activation functions?
According to the theory of universal approximation, what capability do neural networks gain by arranging neurons in layers and using non-linear activation functions?
What is a key challenge associated with using a single-layer neural network to approximate complex functions?
What is a key challenge associated with using a single-layer neural network to approximate complex functions?
What is the primary purpose of an input layer in a one-hidden-layer neural network?
What is the primary purpose of an input layer in a one-hidden-layer neural network?
What is a major limitation of the perceptron learning algorithm when applied to multi-layer perceptron networks?
What is a major limitation of the perceptron learning algorithm when applied to multi-layer perceptron networks?
What is the primary motivation for using deeper neural networks rather than shallow networks?
What is the primary motivation for using deeper neural networks rather than shallow networks?
What is the purpose of introducing non-linearity with activation functions?
What is the purpose of introducing non-linearity with activation functions?
If a deep neural network is 'inherently compositional', what does this imply about its ability to represent real-world data?
If a deep neural network is 'inherently compositional', what does this imply about its ability to represent real-world data?
In the context of neural network architecture, what is meant by a 'fully connected' network?
In the context of neural network architecture, what is meant by a 'fully connected' network?
How is the number of neurons determined in the input layer of a deep neural network designed for image processing?
How is the number of neurons determined in the input layer of a deep neural network designed for image processing?
What are some challenges about determining the number of hidden layers and neurons for a neural network?
What are some challenges about determining the number of hidden layers and neurons for a neural network?
In a multi-class classification problem using a neural network, what generally determines the number of neurons in the output layer?
In a multi-class classification problem using a neural network, what generally determines the number of neurons in the output layer?
Why is it a general practice to demonstrate the general practice with DNN and multiclass classification where the number of output neurons corresponds to the number of classes, with a softmax activation function?
Why is it a general practice to demonstrate the general practice with DNN and multiclass classification where the number of output neurons corresponds to the number of classes, with a softmax activation function?
What method can be used to determine weights and biases (parameters) in a Deep Neural Network or Multi Layer Neural Network?
What method can be used to determine weights and biases (parameters) in a Deep Neural Network or Multi Layer Neural Network?
In the context of training neural networks, what is meant by the term 'ERM Framework'?
In the context of training neural networks, what is meant by the term 'ERM Framework'?
The models (DNN) depend on the number of inputs, number of classes, and which other parameter?
The models (DNN) depend on the number of inputs, number of classes, and which other parameter?
In the context of the loss function for multi-class classification, what does the 'Categorical Cross-Entropy Loss' measure?
In the context of the loss function for multi-class classification, what does the 'Categorical Cross-Entropy Loss' measure?
If the goal is to maximize output probability of correctness for a loss, what is a good choice for an output?
If the goal is to maximize output probability of correctness for a loss, what is a good choice for an output?
In the formulation of ERM optimization problem, what does it mean to find a 'weight vector' and 'bias term'?
In the formulation of ERM optimization problem, what does it mean to find a 'weight vector' and 'bias term'?
What is the significance of 'forward and backward propagations' in the context of training multi-layer networks?
What is the significance of 'forward and backward propagations' in the context of training multi-layer networks?
What is the main purpose of Backward Propagation?
What is the main purpose of Backward Propagation?
What three requirements are needed for backpropagation?
What three requirements are needed for backpropagation?
Based on how much data is used, how many main variants of the gradient descent are there?
Based on how much data is used, how many main variants of the gradient descent are there?
What distinguishes Stochastic Gradient Descent (SGD) from Batch Gradient Descent?
What distinguishes Stochastic Gradient Descent (SGD) from Batch Gradient Descent?
What is the best definition for "Batch Gradient Descent?"
What is the best definition for "Batch Gradient Descent?"
How does the mini-batch gradient descent compare with other methods?
How does the mini-batch gradient descent compare with other methods?
Is mini gradient descent better with small or larger models or same results?
Is mini gradient descent better with small or larger models or same results?
How do you fix getting trapped into suboptimal local minima or saddle points when dealing with the gradient descent?
How do you fix getting trapped into suboptimal local minima or saddle points when dealing with the gradient descent?
Which action would be expected to take when approaching the end of mini-batch learning?
Which action would be expected to take when approaching the end of mini-batch learning?
When setting up a learning algorithm and automating the process, what kind of "rate" is desired?
When setting up a learning algorithm and automating the process, what kind of "rate" is desired?
How is an appropriate course of action set if if the error keeps getting worse or oscillates wildly?
How is an appropriate course of action set if if the error keeps getting worse or oscillates wildly?
The term "pixel784," what does it represent?
The term "pixel784," what does it represent?
Flashcards
Perceptron
Perceptron
A component of modern neural networks, often called neurons.
Activation function
Activation function
Determines if a neuron gets activated based on the sum of weighted inputs.
Sigmoid neurons
Sigmoid neurons
A type of neuron using a sigmoid function to model output.
Linear separability limitation
Linear separability limitation
Signup and view all the flashcards
Theory of Universal Approximation
Theory of Universal Approximation
Signup and view all the flashcards
Activation functions
Activation functions
Signup and view all the flashcards
Challenge of Single Layer Neural Network
Challenge of Single Layer Neural Network
Signup and view all the flashcards
Shallow Network concept
Shallow Network concept
Signup and view all the flashcards
Limitations of Shallow Neural Network
Limitations of Shallow Neural Network
Signup and view all the flashcards
Deep Neural Network
Deep Neural Network
Signup and view all the flashcards
Deep networks are compositional
Deep networks are compositional
Signup and view all the flashcards
Multi-layer Fully Connected Neural Network
Multi-layer Fully Connected Neural Network
Signup and view all the flashcards
Number of neurons in the input layer
Number of neurons in the input layer
Signup and view all the flashcards
Selection of k and L
Selection of k and L
Signup and view all the flashcards
Number of neurons in the output layer
Number of neurons in the output layer
Signup and view all the flashcards
DNN Models vary
DNN Models vary
Signup and view all the flashcards
Categorical Cross-Entropy Loss
Categorical Cross-Entropy Loss
Signup and view all the flashcards
Target Representation
Target Representation
Signup and view all the flashcards
Perceptron Learning Algorithm
Perceptron Learning Algorithm
Signup and view all the flashcards
Error Propagation
Error Propagation
Signup and view all the flashcards
Perceptron Learning Algorithm
Perceptron Learning Algorithm
Signup and view all the flashcards
Backpropagation
Backpropagation
Signup and view all the flashcards
Batch Gradient Descents
Batch Gradient Descents
Signup and view all the flashcards
Stochastic Gradient Descent
Stochastic Gradient Descent
Signup and view all the flashcards
Mini Bactch Gradient Descent
Mini Bactch Gradient Descent
Signup and view all the flashcards
Learning rate
Learning rate
Signup and view all the flashcards
Study Notes
- This lecture covers Artificial Neural Networks (ANNs), focusing on multi-layer neural networks, covering the limitations of perceptrons, deep neural networks, and training with gradient descent.
- The instructor is Siman Giri, the Module Leader for 6CS012.
Learning Objectives
- Review the limitations of perceptrons discussed in the previous week.
- Overcome the limitations of perceptrons and the Perceptron Learning Algorithm.
- Discuss Deeper Neural Networks (also called Multi-Layer Perceptrons).
- Understand the basics and architecture of Deep Neural Networks.
- Train a Deep Neural Network with Gradient Descent for multi-class classification problems.
Recap of Challenges with Simple Perceptrons
- Challenge 1: Automating feature extraction.
- High dimensionality results when using pixel values directly (e.g., 784 columns for 28x28 images).
- Datasets can become very large for larger images.
- Challenge 2: Non-linear decision boundaries.
- Logistic Regression or Softmax Regression are better at separating linearly separable data.
- Standard perceptrons fail when classes/labels are not linearly separable
Perceptron and its Limitations
- The perceptron is a fundamental component of modern neural network architectures, often referred to as neurons.
- Perceptrons are often single hidden units connected to some input x.
- It computes the dot product of a weight vector w and input x, adding a bias term, and passing the result through an activation function.
- The activation function determines if the perceptron/neuron activates based on the sum of weighted inputs.
- The original perceptron using unit step function as activation function is not ideal for most modern problems.
Why Not a Unit Step Function
- The unit step function is defined as f(z) = 1 if z ≥ 0, else f(z) = 0, where z = wo + (wTx).
- Unit step functions do not capture subtleties of the data and create a sharp boundary at 0.
- It implies all positive sums classify as class 1, and all negative sums classify as class 0.
- It does not accommodate cases where negative values can belong to class 1.
- The output is binary only, either 0 or 1, limiting nuanced decision boundaries.
- The small changes in input do not affect output, limiting the model's ability to learn complex patterns.
- The non-differentiability of the function is incompatible with gradient-based optimization, such as backpropagation in neural networks.
Activation Function
- Introduces non-linearity to the output of neurons.
- Good activation function properties include:
- Non-linearity to learn complex patterns and decision boundaries.
- Differentiability to enable gradient-based optimizations (i.e., gradient descent).
- Computational efficiency for real-time applications.
- Gradient behavior to avoid vanishing or exploding gradients.
Sigmoid Neurons
- A type of artificial neuron using the sigmoid activation function to model the output of the neuron.
- The sigmoid function is a smooth, S-shaped curve that maps any real-valued input to a value between 0 and 1.
- It makes it useful for problems involving probabilities or binary classification.
- The sigmoid activation function is defined as σ(z) = 1 / (1 + e^(-z)), where z = wo + Σ Wixi is the weighted sum of the inputs to the neuron.
Limitations of the Perceptron
- Perceptrons are binary classifiers that work well when data is linearly separable.
- Performance can be lower when data isn't linearly separable. -Arranging perceptrons in multiple layers makes XOR and similar problems solvable.
- The "Theory of Universal Approximation" backs up the idea that multiple perceptrons/neurons in layers can learn non-linearly separable functions, such as XOR.
Theory of Universal Approximation
- States that by arranging neurons in layers and using nonlinear activation functions, neural networks can represent any non-linear function.
- With this in mind, neural networks become powerful for function approximation, pattern recognition, and deep learning applications.
- Any boolean function can be implemented using a one-hidden-layer perceptron network with 2^n neurons, where n is the number of inputs.
Challenges of Single Layer Neural Networks
- While a single-layer perceptron network can represent any function, real-world functions require many neurons for a good approximation, creating inefficiency.
- A single layer would require millions of neurons: making computing inefficient in image recognition (e.g. dog vs cat classification).
- Multi-layer (aka deeper) neural networks are preferred in practice.
Architecture of One Hidden Layer Neural Network
- The input layer functions as a placeholder and involves no computation. Understanding the Architecture x = [x1, x2, x3, x4].
- The computation starts in the first hidden layer, propagating forward to the next layer.
- The hidden layer computation is z¹ = (W¹x) + b¹ and a¹= (f¹(z¹)).
- W1 is a weight vector, and there are 4 neurons per layer.
- Computed output after activation function f¹ yields a¹, using sigmoid (σ).
Architecture of One Hidden Layer Neural Network - Output
- Neurons in the output layer can also have different activation functions: depending on the classification task.
- Binary uses a sigmoid activation function, and multi-class uses a softmax activation function.
- For demonstration, the setup utilizes a single neuron with a sigmoid activation function.
Limitations of Perceptron Learning Algorithm
- Perceptron Learning Algorithm is designed for single-layer perceptrons or one neuron.
- It applies to problems that are linearly separable only.
- The algorithm adjusts weights iteratively based on the prediction error, updating when the model is incorrect.
- A key limitation of the perceptron learning algorithm is the inability to propagate errors to middle (hidden) layers: preventing it from solving nonlinear classification issues.
Towards Multi-Layer (Deep) Neural Networks
- A solution requires an updated learning algorithm that can both feedforward and backpropagate the adjusted weights through the hidden layers.
- This raises the question: Can Gradient Descent algorithm be used?
From Single Layer to Shallow Network
- Single Layer Neural Networks
- Networks with one hidden layer and 2n neurons.
- Theoretically represent any function, but require many neurons: less computationally efficient.
- Shallow Neural Network Concept
- Aims to avoid exponentially large number of neurons by distributing the computation across multiple hidden layers with fewer neurons per layer.
- A shallow neural network typically consists of one or a few hidden layers, reducing the need for 2^n neurons in a single layer while capturing complex patterns
From Shallow to Deep Neural Network
- Limitations of Shallow Neural Networks
- Universality: While shallow multi-layer perceptrons (MLP) can approximate any continuous function with enough neurons, neuron quantity can be impractically large.
- Expressivity: A small number of neurons might not capture complex, non-linear decision boundaries needed for classification problems.
- Learnability: There's no guarantee that the learning method will find the optimal solution.
Deep Neural Networks (DNN)
- Definition: Any network with more than 2 hidden layers.
- Motivation:
- Deeper networks can represent more complex patterns. Statistical: Inherently compositional; combine simple patterns recursively. Computational and Expressiveness: More expressive than shallow ones; learn more complex patterns with the same total parameters.
Understanding the Architecture of DNN
- Multi-layer Fully Connected Neural Network with some weight to every edge. -Every neurons are connected to each other.
- Individual neurons conduct a weighted sum and apply an activation function The Number of Neurons
- The quantity of neurons determined by the size of input image
- The input layer would be just a placeholder. Number of neurones for MNIST with image with shape (28 × 28 × 1) have 784 neurons (28281 = 784),
- The number of hidden layers {Lk} and neurons per layer {k₁: kth layer n neurons} are hyperparameters determined by the designer.
- {It is important to note increasing the neurons helps improve capacity, but cause overfitting}
- {Decreasing the neurons underfit}
Output Layer Number of Neurons
- Depends on classification task and activation function used
- Binary Classification Neuron= 1 with sigmoid function applied -Multilcass Classification Number of Neurons= The number of classes/
- A Softmax Activation function is applied
Training the DNN
- Key Question: How to learn the weights and biases in Deep Neural Networks (DNNs)?
- Training is done via an Empirical Risk Minimization (ERM) Framework.
- Requires both a Decision Process, Loss Function, Empirical Risk and an Optimization method for training.
Defining the DNN as a Model
- Unlike traditional machine learning models, DNNs are mathematical representations of deep neural networks with multiple layers.
- The models vary in the number of layers, number of neurons and general structure.
- The architecture mainly depends on the nature of the task and the nature of the data.
- For MNIST digit classification, the first layer has input placeholders, hidden layers can have 64, 128 and 512 neurons, and the output layer has 10 neurons for prediction.
Training the DNN within the ERM Framework
- Objective of DNNs with Softmax Output
- Minimize average loss over the training dataset that can be formulated as L(W, b) = −(1/n) Σ Σ yik log(ŷik). Finding weight factors w and b that minimizes average-Log Loss over the data
Loss Function: Categorical Cross Entropy
- Categorical Cross-Entropy Loss measures how well the predicted probability distribution matches the true class labels in classification tasks.
- It is used in multi-class classification problems.
- Goal: To maximize the probability of the correct class.
- For a single sample x₁ with true label yi and y^(predicted probabilities): l(yi, y^) = −Σ yik log(y^ik)
Computing Gradient
- The weights in Multi Layer networks are learned with the combinations of forward and backward propagations.
- Forward Propagation (activation) and backward propagations (error determination for weight changes)
Backpropogation
- Backpropagation is a technique used by deep layer networks to find the error of the network through calculation and comparison
- There are three conditions (dataset, feedforware and Error function)
- The output from the input layer will be the dataset used in the Error function
- The derivatives will be calculated going forward- back and updates accordingly
Variants of the Gradient Decent
Here are three main types (dependent on data computation)
- Batch gradient decent
- Stochastic gradient decent
- Mini-Batch Stochastic Gradient Decent
Common Issue With The Gradiant Decent
- Same learning rate applies to all parameter updates.
- Getting trapped into suboptimal local minima or saddle points.
- Choosing a Proper Learning
- Note reducing and learning has to adapt to characteristics
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.