Deep Learning 1 with TensorFlow

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

According to Mitchell's definition, what is essential for a computer program to be considered as 'learning'?

  • The capacity to store and retrieve large amounts of data.
  • Improvement in performance at a task T, as measured by P, with experience E. (correct)
  • The ability to solve complex mathematical problems.
  • The capability to mimic human behavior.

What distinguishes deep learning from traditional machine learning?

  • Deep learning is primarily used for data storage and retrieval.
  • Deep learning uses simpler algorithms that are easier to train.
  • Deep learning extracts patterns from data using neural networks with multiple layers. (correct)
  • Deep learning relies solely on explicitly programmed rules.

How do deep neural networks handle complex tasks compared to shallow networks, assuming they express the same function?

  • Deeper networks and shallow networks require the same amount of neurons.
  • Deeper networks do not have the capacity to handle complex tasks.
  • Deeper networks require exponentially more neurons than shallow networks.
  • Deeper networks typically require exponentially fewer neurons than shallow networks. (correct)

Which of the following could negatively affect the ability for a training algorithm to learn a function, even if a large Multi-Layer Perceptron (MLP) is capable of representing that function?

<p>The optimization algorithm struggling to find suitable parameter values. (D)</p> Signup and view all the answers

What is a key characteristic of a feed-forward neural network architecture?

<p>It has no feedback connections; information flows in one direction. (C)</p> Signup and view all the answers

What does the Universal Approximation Theorem for neural networks mainly imply?

<p>A feedforward neural network with a single layer can approximate any continuous function to arbitrary precision. (B)</p> Signup and view all the answers

Which factor primarily contributes to the current effectiveness of deep learning?

<p>The availability of larger datasets, hardware improvements, and software tools. (A)</p> Signup and view all the answers

What is the role of Keras in the context of TensorFlow?

<p>Keras serves as a streamlined API to simplify building deep learning systems on top of TensorFlow. (B)</p> Signup and view all the answers

What is the purpose of an activation function in a neural network?

<p>To introduce non-linearity, enabling the network to learn complex patterns. (A)</p> Signup and view all the answers

Why are non-linear activation functions necessary in deep neural networks?

<p>To allow the approximation of arbitrarily complex functions. (B)</p> Signup and view all the answers

What is the main function of a loss function in machine learning?

<p>To quantify the gap between the model's predictions and the actual ground truth. (C)</p> Signup and view all the answers

In the context of a classification problem, what does the Cross-Entropy Loss measure?

<p>The difference between predicted probability distribution and the true distribution of classes. (C)</p> Signup and view all the answers

What is the primary role of the Softmax function in neural networks?

<p>To convert a set of outputs into a probability distribution. (D)</p> Signup and view all the answers

During backpropagation, what is the main purpose of computing the gradient of the loss function with respect to the weights?

<p>To determine the direction and magnitude for updating weights to reduce the loss. (A)</p> Signup and view all the answers

Why does the backpropagation algorithm require that the loss function be continuous and differentiable?

<p>To enable the computation of the derivative (gradient) for weight updates. (B)</p> Signup and view all the answers

Which of the following is a typical step in the backpropagation algorithm after the initialization of weights?

<p>Feed-forward computation to calculate the loss. (B)</p> Signup and view all the answers

In the context of training neural networks, what is the purpose of gradient descent?

<p>To find the minimum of the loss function in weight space. (A)</p> Signup and view all the answers

What characterizes the 'stochastic' aspect of stochastic gradient descent (SGD)?

<p>The gradient calculated from a single training sample is a 'stochastic approximation' of the true cost gradient. (D)</p> Signup and view all the answers

What is a potential problem with gradient descent, especially in deep networks, that adaptive learning rules aim to address?

<p>Vanishing or exploding gradients. (C)</p> Signup and view all the answers

What is the main issue with using Stochastic Gradient Descent (SGD) on non-convex loss functions?

<p>It may not converge to a global minimum and is sensitive to initial parameters. (C)</p> Signup and view all the answers

Why is initializing weights to small random values important when training feedforward neural networks using SGD?

<p>To ensure faster convergence and avoid saturation of neurons. (D)</p> Signup and view all the answers

What is the key difference between batch mode and sequential mode (on-line) gradient descent?

<p>Batch mode updates weights after processing the complete training set, while sequential mode updates after each example. (A)</p> Signup and view all the answers

What does 'regularization' primarily aim to prevent in the context of machine learning?

<p>Overfitting. (D)</p> Signup and view all the answers

How does dropout regularization work in neural networks?

<p>By randomly dropping out (ignoring) a proportion of neurons during training. (B)</p> Signup and view all the answers

What is one of the main benefits of using dropout regularization in neural networks?

<p>It reduces the risk of overfitting by preventing excessive inter-dependencies between nodes. (D)</p> Signup and view all the answers

In the context of neural network training, what is 'early stopping' used for?

<p>To stop training when the model starts to overfit, based on performance on a validation set. (B)</p> Signup and view all the answers

Which of the following network architectures is best suited for processing sequential data, like time series?

<p>Recurrent Neural Networks. (D)</p> Signup and view all the answers

If you want to generate new data that resembles your training dataset, which unsupervised learning architecture could be useful?

<p>Generative Adversarial Network (GAN). (B)</p> Signup and view all the answers

Which type of neural network architecture is often used for tasks that involve making decisions or taking actions in an environment to maximize a reward?

<p>Networks for Actions, Values, Policies, and Models (Reinforcement Learning). (B)</p> Signup and view all the answers

Which of the following are valid reasons for 'Why now' is Deep Learning so popular?

<p>Larger Datasets, Hardware Improvements, Improved Techniques and new models. (D)</p> Signup and view all the answers

What qualities should a neural network have?

<p>Expressibility, Efficiency and Learnability. (A)</p> Signup and view all the answers

What activation function could be defined as $g(z) = max(0,z)$?

<p>Rectified Linear Unit (ReLU). (A)</p> Signup and view all the answers

What functions does Backpropagation compose?

<p>Function composition. (C)</p> Signup and view all the answers

What is the meaning of: $W ← W – η \frac{dJ(W)}{dW}$?

<p>Optimization through gradient descent. (D)</p> Signup and view all the answers

What factors could affect the rate at which learning occurs?

<p>All of the above. (D)</p> Signup and view all the answers

What is the sequential mode of training also known as?

<p>Online, pattern or stochastic mode. (D)</p> Signup and view all the answers

What is batch mode?

<p>Weights are updated only after the complete presentation of the epoch set. (D)</p> Signup and view all the answers

If a model does not have capacity to fully learn the data, what is this called?

<p>Underfitting. (D)</p> Signup and view all the answers

What is the ideal initialization of biases for SGD?

<p>Small Positive values or zero. (C)</p> Signup and view all the answers

What do you call that process of modifying a learning algorithm so as to prevent overfitting?

<p>Regularization. (D)</p> Signup and view all the answers

Flashcards

ML Algorithm

A ML algorithm is able to learn from data.

Deep Learning

A ML technique that employs deep neural networks.

A Deep Neural Network

A multi-layered neural network that contains two or more hidden layers.

Expressibility

What class of functions can the neural network express?

Signup and view all the flashcards

Efficiency in NNs

How many resources (neurons, parameters, etc.) the neural network requires to approximate a given function?

Signup and view all the flashcards

Learnability in NNs

How rapidly the neural network learn good parameters for approximating a function?

Signup and view all the flashcards

Loss Function

A function that quantifies the gap between prediction and ground truth.

Signup and view all the flashcards

Backpropagation Algorithm

Looks for the minimum of the loss function in weight space using gradient descent.

Signup and view all the flashcards

Universal Approximation Theorem

A feedforward neural network with a single layer is sufficient to approximate any continuous function.

Signup and view all the flashcards

Robustness in Neural Networks

Technique to prevent excessive inter-dependencies by nodes, which means the network learns a more robust relationship.

Signup and view all the flashcards

Regularization

Refers to the process of modifying a learning algorithm to prevent overfitting.

Signup and view all the flashcards

Compute Gradient

Algorithm step that calculates the gradient of the loss function with respect to the weights.

Signup and view all the flashcards

Update Weights

Update the weights of the network based on the computed gradient and learning rate.

Signup and view all the flashcards

Backpropagation

Algorithm to find the minimum of the loss function in weight space.

Signup and view all the flashcards

Dropout Regularization

Technique where units are randomly dropped from the neural network during training.

Signup and view all the flashcards

Convex optimization algorithms

Optimization algorithms with global convergence guarantees used to train logistic regression or SVMs

Signup and view all the flashcards

Dropout

A training strategy which ignores a proportion of the hidden neurons, randomly, when training the weights and setting their activation to zero

Signup and view all the flashcards

Machine Learning

A ML technique that is able to solve tasks that are too difficult to solve with fixed programs written and designed by humans

Signup and view all the flashcards

Algorithm Stoppage

Stops when the value of the error function has become sufficiently small.

Signup and view all the flashcards

Mean squared error

Loss function that measures the average squared difference between the estimated values and the actual value

Signup and view all the flashcards

Study Notes

  • Deep Learning 1 is taught by Dr. Shabnam N. Kadir at the University of Hertfordshire on March 9, 2025.

References

  • Some useful links for further study include:
    • https://d2l.ai///
    • https://www.deeplearningbook.org/
    • https://machinelearningmastery.com/inspirational-applications-deep-learning/
    • https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
    • https://playground.tensorflow.org/ (Tinker with a neural network)

Introduction to Tensorflow

  • Python and Tensorflow will be used for implementation.
  • Further information can be found at:
    • https://www.tensorflow.org/overview
    • https://blog.tensorflow.org/2019/02/introducing-tensorflow-datasets.html
  • Jupyter notebooks will be used.
  • Useful link: https://colab.research.google.com/notebooks/welcome.ipynb

Practical Overview

  • Neural networks, optimization theory (loss functions, gradients), and a programming language like Python will be implemented

Machine Learning Algorithms

  • Machine Learning (ML) algorithm is able to learn from data.
  • Mitchell (1997) defined a computer program as learning from experience E, with respect to task T and performance measure P, if its performance at task T improves with experience E, as measured by P.
  • ML allows for solutions to complex tasks that are hard to solve with fixed, human-designed programs.

What is Deep Learning?

  • Deep learning extracts patterns from data using neural networks.

Deep Learning

  • Deep learning is a machine learning technique using deep neural networks.
  • Deep neural networks are multi-layered, containing two or more hidden layers.
  • The weights of these networks must be adjusted to minimize a loss/cost/error function.

Neurons and the Perceptron

  • Diagram of the basic mathematical model of a neuron depicting inputs, weights, summation, and output.

Feed-forward Neural Network Architecture

  • Key feature: no feedback connections where outputs are fed back into the network.
  • Object recognition is an important application
  • Convolutional Neural Networks (CNNs) are a specialized type of feed-forward neural network inspired by the visual system of the brain.

Deep Neural Networks

  • Deep neural networks pass an input and extract the features of the input.

Why Now?

  • The core algorithms to train deep neural networks have existed for decades.
  • Algorithms used have been around for decades

Big Data Impact

  • The availability of larger datasets facilitates deep learning.
  • Easier collection and storage of data are enablers.

Hardware Advancements

  • Graphics Processing Units (GPUs) provide the necessary computational power.
  • GPUs allow for massively parallelizable computations.

Software Improvements

  • There is now improved techniques, new models, and toolboxes.

Universal Approximation Theorem

  • A feedforward neural network with a single layer can approximate any continuous function to arbitrary precision (Hornik 1989, Cybenko 1989).

Universal Approximation Theorem: Implications

  • A large Multi-Layer Perceptron (MLP) can represent a wide range of functions, according to the universal approximation theorem.
  • Achieving learnability with an MLP is not guaranteed and can fail for a few reasons.
  • Training optimization algorithms might fail to find the correct parameter values.
  • The chosen training algorithm might result in overfitting.

The Unreasonable Effectiveness of Deep Learning

  • A shallow network is computationally intensive with few layers and many neurons per layer.
  • A deep network has many layers and relatively few neurons per layer achieving high levels of abstraction.

Quality of a Neural Network

  • The quality of a neural network depends on expressibility, efficiency and learnability.
  • Expressibility is the class of functions the neural network can express.
  • Efficiency is the resources required to approximate a given function.
  • Learnability refers to how fast the neural network learns good parameters for approximation.

The Unreasonable Effectiveness of Deep Learning

  • Deeper neural networks often require fewer neurons than shallow networks to express the same function.

Choice of Activation Functions

  • Activation function choices include Sigmoid, Hyperbolic Tangent or ReLU.

Why Non-Linear Activation Functions?

  • Linear activation functions produce linear decisions, regardless of network size.
  • Non-linearities enable the approximation of arbitrarily complex functions.

Loss Functions

  • Loss functions quantify the gap between prediction and ground truth.
  • For Regression use Mean Squared Error (MSE).
  • For classification use Cross Entropy Loss.

Softmax Function

Backpropagation

  • The backpropagation algorithm finds the minimum of the loss function in weight space.
  • Method of gradient descent.
  • Minimizing the loss function with an appropriate combination of weights can be considered a solution.
  • Computation of the gradient of the loss function at each iteration requires that the loss function be continuous and differentiable.

Backpropagation: Function Composition Algorithm

  • Decompose the algorithm into the following steps after random initialization of weights:
    • Feed-forward computation.
    • Backpropagation to the output layer.
    • Backpropagation to each hidden layer.
    • Weight updates.
  • The algorithm is stopped once the error function value is sufficiently small.

Backpropagation Algorithm

  • Initialize weights randomly.
  • Loop until convergence.
  • Compute gradient.
  • Update weights.
  • Return weights.

Backpropagation: Chain Rule

  • Input sum of neuron k in layer l, consider the output from neuron j in the previous layer.
  • Apply an activation function (σ, ReLU.) to the weighted sum.
  • Carry this calculation through each subsequent layer.
  • Computing the derivative of the loss function with respect to each weight requires use of the Chain Rule.
  • Update weights through gradient descent

Gradient Descent Intuition

  • Optimization is achieved through gradient descent.

Gradient Descent, Chain Rule

Vanishing/Exploding Gradient Problems

  • Gradients can either explode or vanish, which poses challenges.

Gradient Descent

  • Loss Functions Can Be Difficult to Optimize.
  • Optimization is achieved through gradient descent

Gradient Descent

  • The non-linearity of activation functions causes most interesting loss functions to become non-convex.

Adaptive Learning Rules

  • Learning rates are not fixed but can be adjusted based on:
    • Size of the gradient.
    • Size of particular weights.
    • How fast learning is occurring.

SGD and Non-Convexity

  • Convex optimization algorithms with global convergence guarantees are used for logistic regression or SVMs.
  • Stochastic gradient descent (SGD) applied to non-convex loss functions have no such convergence guarantee and is sensitive to the values of the initial parameters.
  • SGD is only guaranteed to converge at a local minimum.
  • Overfitting can be a problem.

SGD initialization of weights

  • For feedforward neural networks, initialize all weights to small random values.
  • Initialize biases to zero or to small positive values.

Gradient Descent: Sequential vs. Batch Modes

Regularization

  • Modifies a learning algorithm to prevent overfitting.

Dropout Regularization

  • Randomly drops units from the neural network during training.
  • Dropout is a training strategy that ignores a fraction of hidden neurons when training
  • Does not update their weights but sets their activation to zero.
  • Each iteration drops a different set of neurons.
  • Reference: Srivastava et al. 2014

Benefits of Dropout Regularization

  • It forces networks not to rely on any one node, discouraging memorization.
  • Robustness: Prevents excessive inter-dependencies from emerging between nodes, this allows the network to learn more robust relationships.
  • Similar to brain function where losing a few neurons still allows task completion.
  • Computationally cheaper (time & storage) than averaging a committee of networks.

Regularization: Early Stopping

  • Stopping training when performance on a validation set starts to degrade.

Architectural Paradigms

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser