TSIA-203: Introduction to Deep Learning
25 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Match the training paradigm with its description:

Supervised learning = Using labeled data to train the system Zero/few Shot learning = Training the system with minimal labeled data Self Supervised learning = Training the system without external labels

What is the purpose of adding a small constant δ to the diagonal elements when performing ridge regression?

Avoid problems due to poor conditioning of the system

In the context of regression, what does the term 'weight decay' refer to?

Regularization to prevent overfitting by restricting the magnitude of the weights

What is the purpose of the Hessian in optimization?

<p>All of the above (D)</p> Signup and view all the answers

What does the gradient indicate in optimization?

<p>The direction of maximum change or steepest ascent</p> Signup and view all the answers

What are the types of supervised classification mentioned in the content?

<p>binary classification, multi-class classification</p> Signup and view all the answers

What output values are used for regression in supervised classification?

<p>real numbers</p> Signup and view all the answers

Which of the following is a type of Linear Regression?

<p>Non-Linear Regression (C)</p> Signup and view all the answers

Maximizing the likelihood is equivalent to minimizing the Mean Square Error (MSE) in Linear Regression.

<p>True (A)</p> Signup and view all the answers

The ML prediction for a new input x*, given the training data D, and known σ^2 is given by p(y | x*, D, σ^2) = ____(prediction)

<p>𝒩(y | x *T θML, σ 2)</p> Signup and view all the answers

What is the goal of the Steepest/Stochastic Gradient Descent (SGD) algorithm?

<p>To find the minimum of a function</p> Signup and view all the answers

Which algorithm is known for being very costly in practice due to storing a large matrix?

<p>Newton/Hessian algorithm (B)</p> Signup and view all the answers

The AdaGrad algorithm aims to put more weights on ________ features.

<p>rare</p> Signup and view all the answers

The SGD algorithm is named 'Stochastic' because it operates on a subset of data points.

<p>True (A)</p> Signup and view all the answers

What type of regression is considered in the provided equation?

<p>softmax regression</p> Signup and view all the answers

The given equation represents the cost function of softmax regression, which includes terms related to the model's parameters and the __ of the data.

<p>probabilities</p> Signup and view all the answers

What is the function of the layers z[l] mentioned in the context?

<p>succession of layers</p> Signup and view all the answers

What is the softmax regression algorithm used for?

<p>multi-class classification</p> Signup and view all the answers

What is the activation function in softmax regression?

<p>softmax</p> Signup and view all the answers

What is the cost function associated with softmax regression?

<p>Negative Log-Likelihood (NLL) or Categorical Cross Entropy</p> Signup and view all the answers

What is the likelihood function for softmax regression?

<p>Likelihood function for Softmax</p> Signup and view all the answers

What is the derivative of the cost w.r.t. the parameters in softmax regression?

<p>x(i)(π2(i) - y(i))</p> Signup and view all the answers

What is the primary purpose of logistic regression?

<p>Binary classification algorithm (D)</p> Signup and view all the answers

What is the activation function used in logistic regression?

<p>Sigmoid</p> Signup and view all the answers

Logistic regression defines a linear separating ________.

<p>hyper-plane</p> Signup and view all the answers

Study Notes

Introduction to Deep Learning

  • The course is TSIA-203, Introduction to Deep Learning, by Geoffroy Peeters and Stéphane Lathuilière
  • The course is part of Télécom-Paris, IP-Paris, France

Deep Learning and Neural Networks: History

  • Deep learning is a subset of machine learning
  • Deep learning is about learning hierarchical representations

Feature Learning Examples

  • From Lee, Grosse, Ranganath, and Ng, "Convolutional Deep Belief Networks", ICML, 2009
  • Deep learning can be used for feature learning

Applications of Deep Learning

  • Automatic Speech Recognition
  • Self-Driving Cars
  • Automatic Picture Captioning
  • Style Transfer
  • Neural Machine Translation
  • Generative AI (GenAI) - NLP, Video, Music, Computer Vision, Speech (deep fake), Voice dubbing, DALL-E

Roadmap

  • Architectures: distinguishing the way neurons are interconnected
    • Fully Connected Neural Network
    • Recurrent Neural Network
    • Convolutional Neural Network
  • Meta-architectures: the way architectures are plugged to form a system
    • Encoder-Decoder with Attention
    • Auto-Encoder
    • Generative Adversarial Network
  • Training-paradigms: the paradigm we use to train a system
    • Supervised learning
    • Zero/few Shot learning
    • Self-Supervised learning

Supervised Classification

  • Given a set of m input/output examples (x(i), y(i))i∈{1,…,m}
  • Estimate the parameters θ of a model fθ
  • Predict the output y(m+1) for a new input x(m+1)

Linear Regression

  • Model: ŷ(i) = θ0 + x(i)θ1
  • Objective function: Mean Square Error (MSE)
  • Parameter estimation: find θ that minimises J(θ)

Maximum Likelihood Estimation (MLE)

  • Example: suppose we have n=3 data points which are independent samples from a Gaussian with unknown mean θ and variance 1

  • Likelihood: p(y(1), y(2), y(3)|θ) = p(y(1)|θ) ⋅ p(y(2)|θ) ⋅ p(y(3)|θ)

  • We take the θ that maximises the likelihood

  • In linear regression, maximising the likelihood is equivalent to minimising the MSE Loss### Maximum Likelihood Estimation (MLE)

  • The likelihood for linear regression is given by the formula: p(y | X, θ, σ) = (2πσ²)⁻⁻⁻m/2 e⁻⁻⁻(1/2σ²) (y-Xθ)T (y-Xθ)

  • Maximizing the likelihood is equivalent to minimizing the Negative Log-Likelihood (NLL)

  • The MLE of θ is the same as the solution for MSE: θML = (XT X)⁻¹XT y

  • The MLE of σ is given by: σML = √(1/m) (y - X θ)T (y - X θ)

Making Predictions

  • The ML prediction for a new input x*, given the training data D and known σ², is a Gaussian distribution with mean x*T θML and variance σ²
  • This provides a confidence in prediction

Regularization

  • Ridge regression is a modification of least-square regression that adds a small constant δ to the diagonal elements to prevent singularities
  • The regularized quadratic cost function is: J(θ) = (y - Xθ)T (y - Xθ) + δ θT θ
  • The solution for θ is: θridge = (XT X + δ²I)⁻¹XT y
  • Ridge regression is equivalent to minimization of the quadratic cost function with a constraint on the L2 norm of θ

Non-Linear Regression

  • Basis functions can be used to transform the input data into a higher dimensional space
  • Examples of basis functions include polynomials and radial basis functions (RBFs)
  • The RBF kernel is given by: ϕ(x) = [k(x, μ1, λ), …, k(x, μd, λ)] where k(x, μd, λ) = e⁻⁻⁻λ ||x-μd||²

Having the Right Model

  • The model complexity should be chosen based on the number of training data points
  • If the model is too weak, the results will be bad regardless of the number of data points
  • If the model is too complex, the results will depend on the number of data points

Optimisation

  • Gradient descent is a method for finding the minimum of a function
  • The gradient is a vector of partial derivatives of the function with respect to each parameter
  • The Hessian is a matrix of second partial derivatives of the function with respect to each parameter
  • The Hessian is used to check the convexity of the function and to find the minimum

Gradient and Hessian for Linear Regression

  • The gradient of the cost function for linear regression is: ∇θ f(θ) = -2XT y + 2XT X θ
  • The Hessian of the cost function for linear regression is: ∇²θ f(θ) = 2XT X### Optimisation
  • Steepest/Stochastic Gradient Descent (SGD) algorithm:
  • Updates the parameters θ with the learning rate η and the gradient of the objective function g
  • θk+1 = θk - ηk gk
  • SGD for linear regression:
  • θk+1 = θk - ηk [2XT (Xθk - y)]
  • How to choose the learning rate ηk:
  • η = 0.1 is too slow, η = 0.6 is too large

Newton/Hessian Algorithm

  • Uses the curvature of the objective function to adapt the learning rate ηk
  • Taylor series expansion:
  • f(θ) ≈ f(θk) + gTk (θ - θk) + (θ - θk)T Hk (θ - θk) / 2
  • Newton's method:
  • θk+1 = θk - Hk-1 gk
  • Newton's method for linear regression:
  • θk+1 = θk - (XT X)-1 XT Xθk + (XT X)-1 XT y
  • This is the theoretical solution, but it is costly to compute

Variations of SGD

  • Batch:
  • θk+1 = θk + ηk XT (y - Xθk)
  • Online:
  • θk+1 = θk + ηk x(k)T (y(k) - x(k)θk)
  • Mini-Batch:
  • θk+1 = θk + ηk XT (y - Xθk)

Momentum

  • Goal: accelerate descent in cumulative direction
  • Updates the parameters θ with the momentum term α
  • θk+1 = θk + α(θk - θk-1) + (1 - α)[-η ∇J(θ)]

AdaGrad

  • Goal: put more weights on rare features
  • For feature d at step k:
  • θdk+1 = θdk - ηk gdk / (∑τ=1 gdkτ2)

ADAM

  • Combination of momentum and AdaGrad
  • θk+1 = θk - ηk gdk / (∑τ=1 gdkτ2) + α(θk - θk-1)

Logistic Regression

  • Binary classification algorithm
  • Outputs the probability of the class 1
  • Activation function: logistic (inverse of the logit function)
  • Derivative of the logistic function: π(i)(1 - π(i))

Maximum Likelihood Estimation (MLE)

  • Bernoulli distribution:
  • P(X = x | θ) = p x(1 - p)(1-x)
  • Entropy:
  • H(X) = -p log(p) - (1-p) log(1-p)

Cost Function

  • Binary Cross-Entropy (BCE)
  • C(θ) = -∑[y(i) log π(i) + (1 - y(i)) log(1 - π(i))]

Optimisation using Gradient

  • Gradient descent:
  • θk+1 = θk - ηk gk
  • Gradient:
  • g = ∑[π(i) - y(i)] * x(i)

Optimisation using Hessian (Iteratively Reweighted Least Square)

  • Newton update:
  • θk+1 = θk - Hk-1 gk
  • Hessian:
  • Hk = XT diag(πk(1)(1 - πk(1))⋯πk(n)(1 - πk(n))) X

Softmax Regression

  • Multi-class classification algorithm
  • Outputs the vector of probability for all classes
  • Activation function: softmax
  • Softmax function:
  • πc(i) = softmax(zc(i)) = exp(zc(i)) / ∑c′=1 exp(zc′(i))

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Description

Learn about deep learning and its applications in this course from Télécom-Paris, IP-Paris, France. Covers the history of deep learning and neural networks, and feature learning examples.

Use Quizgecko on...
Browser
Browser