Podcast
Questions and Answers
According to Mitchell's definition, what is essential for a computer program to be considered as 'learning'?
According to Mitchell's definition, what is essential for a computer program to be considered as 'learning'?
- The capacity to store and retrieve large amounts of data.
- Improvement in performance at a task T, as measured by P, with experience E. (correct)
- The ability to solve complex mathematical problems.
- The capability to mimic human behavior.
What distinguishes deep learning from traditional machine learning?
What distinguishes deep learning from traditional machine learning?
- Deep learning is primarily used for data storage and retrieval.
- Deep learning uses simpler algorithms that are easier to train.
- Deep learning extracts patterns from data using neural networks with multiple layers. (correct)
- Deep learning relies solely on explicitly programmed rules.
How do deep neural networks handle complex tasks compared to shallow networks, assuming they express the same function?
How do deep neural networks handle complex tasks compared to shallow networks, assuming they express the same function?
- Deeper networks and shallow networks require the same amount of neurons.
- Deeper networks do not have the capacity to handle complex tasks.
- Deeper networks require exponentially more neurons than shallow networks.
- Deeper networks typically require exponentially fewer neurons than shallow networks. (correct)
Which of the following could negatively affect the ability for a training algorithm to learn a function, even if a large Multi-Layer Perceptron (MLP) is capable of representing that function?
Which of the following could negatively affect the ability for a training algorithm to learn a function, even if a large Multi-Layer Perceptron (MLP) is capable of representing that function?
What is a key characteristic of a feed-forward neural network architecture?
What is a key characteristic of a feed-forward neural network architecture?
What does the Universal Approximation Theorem for neural networks mainly imply?
What does the Universal Approximation Theorem for neural networks mainly imply?
Which factor primarily contributes to the current effectiveness of deep learning?
Which factor primarily contributes to the current effectiveness of deep learning?
What is the role of Keras in the context of TensorFlow?
What is the role of Keras in the context of TensorFlow?
What is the purpose of an activation function in a neural network?
What is the purpose of an activation function in a neural network?
Why are non-linear activation functions necessary in deep neural networks?
Why are non-linear activation functions necessary in deep neural networks?
What is the main function of a loss function in machine learning?
What is the main function of a loss function in machine learning?
In the context of a classification problem, what does the Cross-Entropy Loss measure?
In the context of a classification problem, what does the Cross-Entropy Loss measure?
What is the primary role of the Softmax function in neural networks?
What is the primary role of the Softmax function in neural networks?
During backpropagation, what is the main purpose of computing the gradient of the loss function with respect to the weights?
During backpropagation, what is the main purpose of computing the gradient of the loss function with respect to the weights?
Why does the backpropagation algorithm require that the loss function be continuous and differentiable?
Why does the backpropagation algorithm require that the loss function be continuous and differentiable?
Which of the following is a typical step in the backpropagation algorithm after the initialization of weights?
Which of the following is a typical step in the backpropagation algorithm after the initialization of weights?
In the context of training neural networks, what is the purpose of gradient descent?
In the context of training neural networks, what is the purpose of gradient descent?
What characterizes the 'stochastic' aspect of stochastic gradient descent (SGD)?
What characterizes the 'stochastic' aspect of stochastic gradient descent (SGD)?
What is a potential problem with gradient descent, especially in deep networks, that adaptive learning rules aim to address?
What is a potential problem with gradient descent, especially in deep networks, that adaptive learning rules aim to address?
What is the main issue with using Stochastic Gradient Descent (SGD) on non-convex loss functions?
What is the main issue with using Stochastic Gradient Descent (SGD) on non-convex loss functions?
Why is initializing weights to small random values important when training feedforward neural networks using SGD?
Why is initializing weights to small random values important when training feedforward neural networks using SGD?
What is the key difference between batch mode and sequential mode (on-line) gradient descent?
What is the key difference between batch mode and sequential mode (on-line) gradient descent?
What does 'regularization' primarily aim to prevent in the context of machine learning?
What does 'regularization' primarily aim to prevent in the context of machine learning?
How does dropout regularization work in neural networks?
How does dropout regularization work in neural networks?
What is one of the main benefits of using dropout regularization in neural networks?
What is one of the main benefits of using dropout regularization in neural networks?
In the context of neural network training, what is 'early stopping' used for?
In the context of neural network training, what is 'early stopping' used for?
Which of the following network architectures is best suited for processing sequential data, like time series?
Which of the following network architectures is best suited for processing sequential data, like time series?
If you want to generate new data that resembles your training dataset, which unsupervised learning architecture could be useful?
If you want to generate new data that resembles your training dataset, which unsupervised learning architecture could be useful?
Which type of neural network architecture is often used for tasks that involve making decisions or taking actions in an environment to maximize a reward?
Which type of neural network architecture is often used for tasks that involve making decisions or taking actions in an environment to maximize a reward?
Which of the following are valid reasons for 'Why now' is Deep Learning so popular?
Which of the following are valid reasons for 'Why now' is Deep Learning so popular?
What qualities should a neural network have?
What qualities should a neural network have?
What activation function could be defined as $g(z) = max(0,z)$?
What activation function could be defined as $g(z) = max(0,z)$?
What functions does Backpropagation compose?
What functions does Backpropagation compose?
What is the meaning of: $W ← W – η \frac{dJ(W)}{dW}$?
What is the meaning of: $W ← W – η \frac{dJ(W)}{dW}$?
What factors could affect the rate at which learning occurs?
What factors could affect the rate at which learning occurs?
What is the sequential mode of training also known as?
What is the sequential mode of training also known as?
What is batch mode?
What is batch mode?
If a model does not have capacity to fully learn the data, what is this called?
If a model does not have capacity to fully learn the data, what is this called?
What is the ideal initialization of biases for SGD?
What is the ideal initialization of biases for SGD?
What do you call that process of modifying a learning algorithm so as to prevent overfitting?
What do you call that process of modifying a learning algorithm so as to prevent overfitting?
Flashcards
ML Algorithm
ML Algorithm
A ML algorithm is able to learn from data.
Deep Learning
Deep Learning
A ML technique that employs deep neural networks.
A Deep Neural Network
A Deep Neural Network
A multi-layered neural network that contains two or more hidden layers.
Expressibility
Expressibility
Signup and view all the flashcards
Efficiency in NNs
Efficiency in NNs
Signup and view all the flashcards
Learnability in NNs
Learnability in NNs
Signup and view all the flashcards
Loss Function
Loss Function
Signup and view all the flashcards
Backpropagation Algorithm
Backpropagation Algorithm
Signup and view all the flashcards
Universal Approximation Theorem
Universal Approximation Theorem
Signup and view all the flashcards
Robustness in Neural Networks
Robustness in Neural Networks
Signup and view all the flashcards
Regularization
Regularization
Signup and view all the flashcards
Compute Gradient
Compute Gradient
Signup and view all the flashcards
Update Weights
Update Weights
Signup and view all the flashcards
Backpropagation
Backpropagation
Signup and view all the flashcards
Dropout Regularization
Dropout Regularization
Signup and view all the flashcards
Convex optimization algorithms
Convex optimization algorithms
Signup and view all the flashcards
Dropout
Dropout
Signup and view all the flashcards
Machine Learning
Machine Learning
Signup and view all the flashcards
Algorithm Stoppage
Algorithm Stoppage
Signup and view all the flashcards
Mean squared error
Mean squared error
Signup and view all the flashcards
Study Notes
- Deep Learning 1 is taught by Dr. Shabnam N. Kadir at the University of Hertfordshire on March 9, 2025.
References
- Some useful links for further study include:
- https://d2l.ai///
- https://www.deeplearningbook.org/
- https://machinelearningmastery.com/inspirational-applications-deep-learning/
- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
- https://playground.tensorflow.org/ (Tinker with a neural network)
Introduction to Tensorflow
- Python and Tensorflow will be used for implementation.
- Further information can be found at:
- https://www.tensorflow.org/overview
- https://blog.tensorflow.org/2019/02/introducing-tensorflow-datasets.html
- Jupyter notebooks will be used.
- Useful link: https://colab.research.google.com/notebooks/welcome.ipynb
Practical Overview
- Neural networks, optimization theory (loss functions, gradients), and a programming language like Python will be implemented
Machine Learning Algorithms
- Machine Learning (ML) algorithm is able to learn from data.
- Mitchell (1997) defined a computer program as learning from experience E, with respect to task T and performance measure P, if its performance at task T improves with experience E, as measured by P.
- ML allows for solutions to complex tasks that are hard to solve with fixed, human-designed programs.
What is Deep Learning?
- Deep learning extracts patterns from data using neural networks.
Deep Learning
- Deep learning is a machine learning technique using deep neural networks.
- Deep neural networks are multi-layered, containing two or more hidden layers.
- The weights of these networks must be adjusted to minimize a loss/cost/error function.
Neurons and the Perceptron
- Diagram of the basic mathematical model of a neuron depicting inputs, weights, summation, and output.
Feed-forward Neural Network Architecture
- Key feature: no feedback connections where outputs are fed back into the network.
- Object recognition is an important application
- Convolutional Neural Networks (CNNs) are a specialized type of feed-forward neural network inspired by the visual system of the brain.
Deep Neural Networks
- Deep neural networks pass an input and extract the features of the input.
Why Now?
- The core algorithms to train deep neural networks have existed for decades.
- Algorithms used have been around for decades
Big Data Impact
- The availability of larger datasets facilitates deep learning.
- Easier collection and storage of data are enablers.
Hardware Advancements
- Graphics Processing Units (GPUs) provide the necessary computational power.
- GPUs allow for massively parallelizable computations.
Software Improvements
- There is now improved techniques, new models, and toolboxes.
Universal Approximation Theorem
- A feedforward neural network with a single layer can approximate any continuous function to arbitrary precision (Hornik 1989, Cybenko 1989).
Universal Approximation Theorem: Implications
- A large Multi-Layer Perceptron (MLP) can represent a wide range of functions, according to the universal approximation theorem.
- Achieving learnability with an MLP is not guaranteed and can fail for a few reasons.
- Training optimization algorithms might fail to find the correct parameter values.
- The chosen training algorithm might result in overfitting.
The Unreasonable Effectiveness of Deep Learning
- A shallow network is computationally intensive with few layers and many neurons per layer.
- A deep network has many layers and relatively few neurons per layer achieving high levels of abstraction.
Quality of a Neural Network
- The quality of a neural network depends on expressibility, efficiency and learnability.
- Expressibility is the class of functions the neural network can express.
- Efficiency is the resources required to approximate a given function.
- Learnability refers to how fast the neural network learns good parameters for approximation.
The Unreasonable Effectiveness of Deep Learning
- Deeper neural networks often require fewer neurons than shallow networks to express the same function.
Choice of Activation Functions
- Activation function choices include Sigmoid, Hyperbolic Tangent or ReLU.
Why Non-Linear Activation Functions?
- Linear activation functions produce linear decisions, regardless of network size.
- Non-linearities enable the approximation of arbitrarily complex functions.
Loss Functions
- Loss functions quantify the gap between prediction and ground truth.
- For Regression use Mean Squared Error (MSE).
- For classification use Cross Entropy Loss.
Softmax Function
- It converts a set of outputs into a probability distribution.
- Typically utilized in the final layer of neural network classifiers.
- Formula: y¡ = e^(h¡) / Σ(e^(hj)
- Good reference: https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e
Backpropagation
- The backpropagation algorithm finds the minimum of the loss function in weight space.
- Method of gradient descent.
- Minimizing the loss function with an appropriate combination of weights can be considered a solution.
- Computation of the gradient of the loss function at each iteration requires that the loss function be continuous and differentiable.
Backpropagation: Function Composition Algorithm
- Decompose the algorithm into the following steps after random initialization of weights:
- Feed-forward computation.
- Backpropagation to the output layer.
- Backpropagation to each hidden layer.
- Weight updates.
- The algorithm is stopped once the error function value is sufficiently small.
Backpropagation Algorithm
- Initialize weights randomly.
- Loop until convergence.
- Compute gradient.
- Update weights.
- Return weights.
Backpropagation: Chain Rule
- Input sum of neuron k in layer l, consider the output from neuron j in the previous layer.
- Apply an activation function (σ, ReLU.) to the weighted sum.
- Carry this calculation through each subsequent layer.
- Computing the derivative of the loss function with respect to each weight requires use of the Chain Rule.
- Update weights through gradient descent
Gradient Descent Intuition
- Optimization is achieved through gradient descent.
Gradient Descent, Chain Rule
- Helpful link: https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Vanishing/Exploding Gradient Problems
- Gradients can either explode or vanish, which poses challenges.
Gradient Descent
- Loss Functions Can Be Difficult to Optimize.
- Optimization is achieved through gradient descent
Gradient Descent
- The non-linearity of activation functions causes most interesting loss functions to become non-convex.
Adaptive Learning Rules
- Learning rates are not fixed but can be adjusted based on:
- Size of the gradient.
- Size of particular weights.
- How fast learning is occurring.
SGD and Non-Convexity
- Convex optimization algorithms with global convergence guarantees are used for logistic regression or SVMs.
- Stochastic gradient descent (SGD) applied to non-convex loss functions have no such convergence guarantee and is sensitive to the values of the initial parameters.
- SGD is only guaranteed to converge at a local minimum.
- Overfitting can be a problem.
SGD initialization of weights
- For feedforward neural networks, initialize all weights to small random values.
- Initialize biases to zero or to small positive values.
Gradient Descent: Sequential vs. Batch Modes
- Sequential training mode is also known as on-line, pattern, or stochastic mode, where weights are updated after each example.
- The term "stochastic" refers to the gradient (based on a single training sample) being a "stochastic approximation" of the "true" cost gradient.
- Batch mode updates weights only after the complete presentation of the training set during each sweep or epoch and is impractical for very large datasets.
- Further details: https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent/
Regularization
- Modifies a learning algorithm to prevent overfitting.
Dropout Regularization
- Randomly drops units from the neural network during training.
- Dropout is a training strategy that ignores a fraction of hidden neurons when training
- Does not update their weights but sets their activation to zero.
- Each iteration drops a different set of neurons.
- Reference: Srivastava et al. 2014
Benefits of Dropout Regularization
- It forces networks not to rely on any one node, discouraging memorization.
- Robustness: Prevents excessive inter-dependencies from emerging between nodes, this allows the network to learn more robust relationships.
- Similar to brain function where losing a few neurons still allows task completion.
- Computationally cheaper (time & storage) than averaging a committee of networks.
Regularization: Early Stopping
- Stopping training when performance on a validation set starts to degrade.
Architectural Paradigms
- Common neural network architectures include feedforward networks, convolutional networks, recurrent networks, autoencoders, generative adversarial networks, and networks for actions, values, policies and models.
- Useful reference: https://blog.tensorflow.org/2019/02/mit-deep-learning-basics-introduction-tensorflow.html
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.