TSIA-203: Introduction to Deep Learning

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Match the training paradigm with its description:

Supervised learning = Using labeled data to train the system Zero/few Shot learning = Training the system with minimal labeled data Self Supervised learning = Training the system without external labels

What is the purpose of adding a small constant δ to the diagonal elements when performing ridge regression?

Avoid problems due to poor conditioning of the system

In the context of regression, what does the term 'weight decay' refer to?

Regularization to prevent overfitting by restricting the magnitude of the weights

What is the purpose of the Hessian in optimization?

All of the above (D) Signup and view all the answers

What does the gradient indicate in optimization?

The direction of maximum change or steepest ascent Signup and view all the answers

What are the types of supervised classification mentioned in the content?

binary classification, multi-class classification Signup and view all the answers

What output values are used for regression in supervised classification?

real numbers Signup and view all the answers

Which of the following is a type of Linear Regression?

Non-Linear Regression (C) Signup and view all the answers

Maximizing the likelihood is equivalent to minimizing the Mean Square Error (MSE) in Linear Regression.

True (A) Signup and view all the answers

The ML prediction for a new input x, given the training data D, and known σ^2 is given by p(y | x, D, σ^2) = ____(prediction)

𝒩(y | x *T θML, σ 2) Signup and view all the answers

What is the goal of the Steepest/Stochastic Gradient Descent (SGD) algorithm?

To find the minimum of a function Signup and view all the answers

Which algorithm is known for being very costly in practice due to storing a large matrix?

Newton/Hessian algorithm (B) Signup and view all the answers

The AdaGrad algorithm aims to put more weights on ________ features.

rare Signup and view all the answers

The SGD algorithm is named 'Stochastic' because it operates on a subset of data points.

True (A) Signup and view all the answers

What type of regression is considered in the provided equation?

softmax regression Signup and view all the answers

The given equation represents the cost function of softmax regression, which includes terms related to the model's parameters and the __ of the data.

probabilities Signup and view all the answers

What is the function of the layers z[l] mentioned in the context?

succession of layers Signup and view all the answers

What is the softmax regression algorithm used for?

multi-class classification Signup and view all the answers

What is the activation function in softmax regression?

softmax Signup and view all the answers

What is the cost function associated with softmax regression?

Negative Log-Likelihood (NLL) or Categorical Cross Entropy Signup and view all the answers

What is the likelihood function for softmax regression?

Likelihood function for Softmax Signup and view all the answers

What is the derivative of the cost w.r.t. the parameters in softmax regression?

x(i)(π2(i) - y(i)) Signup and view all the answers

What is the primary purpose of logistic regression?

Binary classification algorithm (D) Signup and view all the answers

What is the activation function used in logistic regression?

Sigmoid Signup and view all the answers

Logistic regression defines a linear separating ________.

hyper-plane Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

Introduction to Deep Learning

The course is TSIA-203, Introduction to Deep Learning, by Geoffroy Peeters and Stéphane Lathuilière
The course is part of Télécom-Paris, IP-Paris, France

Deep Learning and Neural Networks: History

Deep learning is a subset of machine learning
Deep learning is about learning hierarchical representations

Feature Learning Examples

From Lee, Grosse, Ranganath, and Ng, "Convolutional Deep Belief Networks", ICML, 2009
Deep learning can be used for feature learning

Applications of Deep Learning

Automatic Speech Recognition
Self-Driving Cars
Automatic Picture Captioning
Style Transfer
Neural Machine Translation
Generative AI (GenAI) - NLP, Video, Music, Computer Vision, Speech (deep fake), Voice dubbing, DALL-E

Roadmap

Architectures: distinguishing the way neurons are interconnected
- Fully Connected Neural Network
- Recurrent Neural Network
- Convolutional Neural Network
Meta-architectures: the way architectures are plugged to form a system
- Encoder-Decoder with Attention
- Auto-Encoder
- Generative Adversarial Network
Training-paradigms: the paradigm we use to train a system
- Supervised learning
- Zero/few Shot learning
- Self-Supervised learning

Supervised Classification

Given a set of m input/output examples (x(i), y(i))i∈{1,…,m}
Estimate the parameters θ of a model fθ
Predict the output y(m+1) for a new input x(m+1)

Linear Regression

Model: ŷ(i) = θ0 + x(i)θ1
Objective function: Mean Square Error (MSE)
Parameter estimation: find θ that minimises J(θ)

Maximum Likelihood Estimation (MLE)

Example: suppose we have n=3 data points which are independent samples from a Gaussian with unknown mean θ and variance 1
Likelihood: p(y(1), y(2), y(3)|θ) = p(y(1)|θ) ⋅ p(y(2)|θ) ⋅ p(y(3)|θ)
We take the θ that maximises the likelihood
In linear regression, maximising the likelihood is equivalent to minimising the MSE Loss### Maximum Likelihood Estimation (MLE)
The likelihood for linear regression is given by the formula: p(y | X, θ, σ) = (2πσ²)⁻⁻⁻m/2 e⁻⁻⁻(1/2σ²) (y-Xθ)T (y-Xθ)
Maximizing the likelihood is equivalent to minimizing the Negative Log-Likelihood (NLL)
The MLE of θ is the same as the solution for MSE: θML = (XT X)⁻¹XT y
The MLE of σ is given by: σML = √(1/m) (y - X θ)T (y - X θ)

Making Predictions

The ML prediction for a new input x*, given the training data D and known σ², is a Gaussian distribution with mean x*T θML and variance σ²
This provides a confidence in prediction

Regularization

Ridge regression is a modification of least-square regression that adds a small constant δ to the diagonal elements to prevent singularities
The regularized quadratic cost function is: J(θ) = (y - Xθ)T (y - Xθ) + δ θT θ
The solution for θ is: θridge = (XT X + δ²I)⁻¹XT y
Ridge regression is equivalent to minimization of the quadratic cost function with a constraint on the L2 norm of θ

Non-Linear Regression

Basis functions can be used to transform the input data into a higher dimensional space
Examples of basis functions include polynomials and radial basis functions (RBFs)
The RBF kernel is given by: ϕ(x) = [k(x, μ1, λ), …, k(x, μd, λ)] where k(x, μd, λ) = e⁻⁻⁻λ ||x-μd||²

Having the Right Model

The model complexity should be chosen based on the number of training data points
If the model is too weak, the results will be bad regardless of the number of data points
If the model is too complex, the results will depend on the number of data points

Optimisation

Gradient descent is a method for finding the minimum of a function
The gradient is a vector of partial derivatives of the function with respect to each parameter
The Hessian is a matrix of second partial derivatives of the function with respect to each parameter
The Hessian is used to check the convexity of the function and to find the minimum

Gradient and Hessian for Linear Regression

The gradient of the cost function for linear regression is: ∇θ f(θ) = -2XT y + 2XT X θ
The Hessian of the cost function for linear regression is: ∇²θ f(θ) = 2XT X### Optimisation
Steepest/Stochastic Gradient Descent (SGD) algorithm:
Updates the parameters θ with the learning rate η and the gradient of the objective function g
θk+1 = θk - ηk gk
SGD for linear regression:
θk+1 = θk - ηk [2XT (Xθk - y)]
How to choose the learning rate ηk:
η = 0.1 is too slow, η = 0.6 is too large

Newton/Hessian Algorithm

Uses the curvature of the objective function to adapt the learning rate ηk
Taylor series expansion:
f(θ) ≈ f(θk) + gTk (θ - θk) + (θ - θk)T Hk (θ - θk) / 2
Newton's method:
θk+1 = θk - Hk-1 gk
Newton's method for linear regression:
θk+1 = θk - (XT X)-1 XT Xθk + (XT X)-1 XT y
This is the theoretical solution, but it is costly to compute

Variations of SGD

Batch:
θk+1 = θk + ηk XT (y - Xθk)
Online:
θk+1 = θk + ηk x(k)T (y(k) - x(k)θk)
Mini-Batch:
θk+1 = θk + ηk XT (y - Xθk)

Momentum

Goal: accelerate descent in cumulative direction
Updates the parameters θ with the momentum term α
θk+1 = θk + α(θk - θk-1) + (1 - α)[-η ∇J(θ)]

AdaGrad

Goal: put more weights on rare features
For feature d at step k:
θdk+1 = θdk - ηk gdk / (∑τ=1 gdkτ2)

ADAM

Combination of momentum and AdaGrad
θk+1 = θk - ηk gdk / (∑τ=1 gdkτ2) + α(θk - θk-1)

Logistic Regression

Binary classification algorithm
Outputs the probability of the class 1
Activation function: logistic (inverse of the logit function)
Derivative of the logistic function: π(i)(1 - π(i))

Maximum Likelihood Estimation (MLE)

Bernoulli distribution:
P(X = x | θ) = p x(1 - p)(1-x)
Entropy:
H(X) = -p log(p) - (1-p) log(1-p)

Cost Function

Binary Cross-Entropy (BCE)
C(θ) = -∑[y(i) log π(i) + (1 - y(i)) log(1 - π(i))]

Optimisation using Gradient

Gradient descent:
θk+1 = θk - ηk gk
Gradient:
g = ∑[π(i) - y(i)] * x(i)

Optimisation using Hessian (Iteratively Reweighted Least Square)

Newton update:
θk+1 = θk - Hk-1 gk
Hessian:
Hk = XT diag(πk(1)(1 - πk(1))⋯πk(n)(1 - πk(n))) X

Softmax Regression

Multi-class classification algorithm
Outputs the vector of probability for all classes
Activation function: softmax
Softmax function:
πc(i) = softmax(zc(i)) = exp(zc(i)) / ∑c′=1 exp(zc′(i))

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

TSIA-203: Introduction to Deep Learning

Choose a study mode

Podcast

Questions and Answers

Match the training paradigm with its description:

What is the purpose of adding a small constant δ to the diagonal elements when performing ridge regression?

In the context of regression, what does the term 'weight decay' refer to?

What is the purpose of the Hessian in optimization?

What does the gradient indicate in optimization?

What are the types of supervised classification mentioned in the content?

What output values are used for regression in supervised classification?

Which of the following is a type of Linear Regression?

Maximizing the likelihood is equivalent to minimizing the Mean Square Error (MSE) in Linear Regression.

The ML prediction for a new input x*, given the training data D, and known σ^2 is given by p(y | x*, D, σ^2) = ____(prediction)

What is the goal of the Steepest/Stochastic Gradient Descent (SGD) algorithm?

Which algorithm is known for being very costly in practice due to storing a large matrix?

The AdaGrad algorithm aims to put more weights on ________ features.

The SGD algorithm is named 'Stochastic' because it operates on a subset of data points.

What type of regression is considered in the provided equation?

The given equation represents the cost function of softmax regression, which includes terms related to the model's parameters and the __ of the data.

What is the function of the layers z[l] mentioned in the context?

What is the softmax regression algorithm used for?

What is the activation function in softmax regression?

What is the cost function associated with softmax regression?

What is the likelihood function for softmax regression?

What is the derivative of the cost w.r.t. the parameters in softmax regression?

What is the primary purpose of logistic regression?

What is the activation function used in logistic regression?

Logistic regression defines a linear separating ________.

Study Notes

Introduction to Deep Learning

Deep Learning and Neural Networks: History

Feature Learning Examples

Applications of Deep Learning

Roadmap

Supervised Classification

Linear Regression

Maximum Likelihood Estimation (MLE)

Making Predictions

Regularization

Non-Linear Regression

Having the Right Model

Optimisation

Gradient and Hessian for Linear Regression

Newton/Hessian Algorithm

Variations of SGD

Momentum

AdaGrad

ADAM

Logistic Regression

Maximum Likelihood Estimation (MLE)

Cost Function

Optimisation using Gradient

Optimisation using Hessian (Iteratively Reweighted Least Square)

Softmax Regression

Studying That Suits You

More Like This

Neural Network Quiz: Deep Learning Fundamentals Test

Artificial Intelligence: Machine Learning and Neural Networks

AI Concepts: Machine Learning, Neural Networks, NLP, Deep Learning, an...

Artificial Intelligence Chapter - Neural Networks and Deep Learning

The ML prediction for a new input x, given the training data D, and known σ^2 is given by p(y | x, D, σ^2) = ____(prediction)