Podcast Beta
Questions and Answers
What is an example of classification in machine learning?
Which of the following defines regression in machine learning?
In the context of machine learning, what does the output of classification represent?
How does the measurement of error differ between classification and regression?
Signup and view all the answers
What aspect do recommender systems focus on?
Signup and view all the answers
What represents the scalar output in the linear regression model?
Signup and view all the answers
In the model $y(i) = \beta^T x(i) + \beta_0$, what does the term $\beta_0$ represent?
Signup and view all the answers
What must be defined to measure how wrong a linear regression model is?
Signup and view all the answers
What geometric shape is represented in 2D in linear regression?
Signup and view all the answers
What happens when one dimension is added to the input feature vector in linear regression?
Signup and view all the answers
Which statement is true regarding the parameters $\beta$ in the linear regression model?
Signup and view all the answers
What is the key characteristic of the function fitted in linear regression?
Signup and view all the answers
In a higher-dimensional context, what term is used for the shape that models the relationship in linear regression?
Signup and view all the answers
What does the parameterization of the model represent?
Signup and view all the answers
What distribution do the ground truth labels y follow?
Signup and view all the answers
Which symbol represents the output of the model as a function of the input?
Signup and view all the answers
In the context of maximum likelihood estimation, which expression is minimized?
Signup and view all the answers
What is the role of ε in the equation for y?
Signup and view all the answers
What is the standard deviation used in the Gaussian distribution described?
Signup and view all the answers
Linear regression can be understood as maximum likelihood estimation under which condition?
Signup and view all the answers
What does 𝜇* represent in the context of maximum likelihood estimation?
Signup and view all the answers
What mathematical operation is performed to derive the expression for maximum likelihood estimation?
Signup and view all the answers
Which of the following is part of the expression for the Gaussian probability P?
Signup and view all the answers
What does the gradient direction indicate in relation to a function's value?
Signup and view all the answers
What happens to the function value along a level set?
Signup and view all the answers
What is the relationship between the gradient of a differentiable function and the level set at a point?
Signup and view all the answers
What role does the learning rate (𝜂) play in gradient descent optimization?
Signup and view all the answers
What is a critical consideration when selecting a learning rate in gradient descent?
Signup and view all the answers
Which statement accurately describes the loss surface in 2D?
Signup and view all the answers
In the context of gradient descent on convex functions, what happens if the learning rate is too high?
Signup and view all the answers
What characteristic of a level set is most emphasized in its definition?
Signup and view all the answers
What does the formula for $L_{XE}$ represent in terms of probabilities?
Signup and view all the answers
How is the Monte Carlo estimation of an expectation formulated?
Signup and view all the answers
What does the term 'cross-entropy' refer to in the provided context?
Signup and view all the answers
What does the KL divergence measure in information theory?
Signup and view all the answers
In the equation $H(P, Q) = -E_P(x) ext{log} Q(x)$, what does $H(P, Q)$ represent?
Signup and view all the answers
What is the role of the indicator function $δ$ in the context provided?
Signup and view all the answers
What implication does minimizing the loss have with respect to the distributions $P$ and $Q$?
Signup and view all the answers
What is required to approximate the integral in the expectation formula using Monte Carlo methods?
Signup and view all the answers
What does the approximation $E_P(f(y)) hickapprox - ext{log} Q_Y(x_i, eta)$ suggest about the relationship of expectations and probabilities?
Signup and view all the answers
Study Notes
Machine Learning Basics
- A function y = 𝑓𝑓(𝑥𝑥) is approximated to produce an output.
- Example: Image classification, where pixel values (𝑥𝑥) are used to identify categories like dog, cat, truck, airplane, etc. (𝑦𝑦).
- Example: Tweet emotion recognition, where the text of a tweet (𝑥𝑥) determines the associated emotion (𝑦𝑦) like fear, anger, joy, sadness, etc.
Classification vs. Regression
- Classification: Output (𝑦𝑦) is discrete and represents distinct categories, for example, image or emotion categories.
MNIST Classification
- Handwritten digit classification.
Classification for Skin Problems
- Image classification for identifying skin problems.
Classification vs. Regression
- Regression: Output (𝑦𝑦) is a continuously varying quantity, typically a real number, for example, stock price or heart disease severity.
- Key difference between classification and regression: Measurement of error.
Ranking
- Recommender systems provide a list of recommended items where the order of items matters.
Linear Regression
- A p-dimensional feature vector 𝒙𝒙 = (𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑝𝑝 ) and a scalar output 𝑦𝑦 are used to create a linear model: 𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯ + 𝛽𝛽𝑝𝑝 𝑥𝑥𝑝𝑝 + 𝛽𝛽0.
- In vector form: 𝑦𝑦 = 𝜷𝜷⊤ 𝒙𝒙 + 𝛽𝛽0.
Linear Regression
- With 𝑛𝑛 data points (𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛), the model is assumed to be consistent across all points: 𝑦𝑦 (𝑖𝑖) = 𝜷𝜷⊤ 𝒙𝒙(𝑖𝑖) + 𝛽𝛽0.
- The goal of machine learning is to determine the parameters 𝜷𝜷 and 𝛽𝛽0.
One Small Tweak …
- Adding 1 to 𝒙𝒙 and 𝛽𝛽0 allows the model to be represented as 𝑦𝑦 (𝑖𝑖) = 𝜷𝜷⊤ 𝒙𝒙(𝑖𝑖).
Geometric Intuition
- The goal is to find the hyperplane that is closest to the observed data points.
- Key point: The fitted function is linear. A unit change in 𝑥𝑥𝑖𝑖 always influences the 𝑦𝑦 by the magnitude of 𝛽𝛽𝑖𝑖, regardless of 𝒙𝒙 or 𝑦𝑦.
The Loss Function
- Defines the error of the model.
Probabilistic Perspective
- The model is parameterized by 𝜷𝜷 and takes input 𝒙𝒙, producing an output 𝑓𝑓𝜷𝜷 𝒙𝒙.
- 𝑓𝑓𝜷𝜷 𝒙𝒙 becomes the (input-dependent) parameter 𝜇𝜇 for a Gaussian distribution with a standard deviation of 1 (𝜎𝜎 = 1).
- The ground truth 𝑦𝑦 𝑖𝑖 is drawn from this distribution: 𝑦𝑦 𝑖𝑖 ~𝒩𝒩 𝑓𝑓𝜷𝜷 𝒙𝒙 𝑖𝑖 , 1.
- Equivalently: 𝑦𝑦 𝑖𝑖 = 𝑓𝑓𝜷𝜷 𝒙𝒙 𝑖𝑖 + 𝜖𝜖, 𝜖𝜖 ~ 𝒩𝒩(0, 1).
Maximum Likelihood for Gaussian
-
Given data 𝑦𝑦 , … , 𝑦𝑦 𝑁𝑁
-
The Gaussian probability is: 𝑃𝑃 𝑦𝑦 𝜇𝜇, 𝜎𝜎 = exp − 𝑁𝑁 𝑁𝑁 𝑖𝑖 2 𝑖𝑖 1 𝑦𝑦 − 𝜇𝜇
2𝜋𝜋𝜎𝜎 2𝜎𝜎 2 𝑖𝑖=1 𝑖𝑖=1
-
Taking the log and removing unrelated terms: 𝜇𝜇 ∗ = argmax log exp − 𝑁𝑁 𝑖𝑖 2 𝑁𝑁 𝑖𝑖 2 1 𝑦𝑦 − 𝜇𝜇 𝑦𝑦 − 𝜇𝜇 𝜇𝜇 2𝜋𝜋𝜎𝜎 2𝜎𝜎 2 𝜇𝜇 2𝜎𝜎 2 𝑖𝑖=1 𝑖𝑖=1 𝑁𝑁 2 𝜇𝜇∗ = argmin 𝑦𝑦 𝑖𝑖 − 𝜇𝜇 𝜇𝜇 𝑖𝑖=1
Plugging in …
- 𝜇𝜇 (𝑖𝑖) = 𝑦𝑦 𝑖𝑖 = 𝑓𝑓𝜷𝜷 𝒙𝒙
- 𝜷𝜷 ∗ = argmin 𝑦𝑦 − 𝑓𝑓𝜷𝜷 𝒙𝒙 𝑁𝑁 2 ∗ 𝑖𝑖 𝑖𝑖 𝜷𝜷 𝑖𝑖=1
- Linear regression can be understood as MLE assuming the label contains noise from the Gaussian distribution.
Cross Entropy
- 𝐿𝐿XE = − 𝑦𝑦 log 𝑦𝑦 + 1 − 𝑦𝑦 log 1 − 𝑦𝑦 𝑁𝑁 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖
- 𝐻𝐻 𝑃𝑃, 𝑄𝑄 = 𝐸𝐸𝑃𝑃 − log 𝑄𝑄(𝑋𝑋) 𝑁𝑁 𝑖𝑖=1
Monte Carlo Expectation
- 𝐸𝐸𝑃𝑃 𝑓𝑓 𝑦𝑦 = 𝑓𝑓 𝑦𝑦 𝑃𝑃 𝑌𝑌 = 𝑦𝑦 𝑑𝑑𝑦𝑦
- Integral can be approximated by drawing 𝑦𝑦 1 , … , 𝑦𝑦 𝐾𝐾 from 𝑃𝑃 𝑌𝑌 𝐾𝐾 1 𝑖𝑖 𝐸𝐸𝑃𝑃 𝑓𝑓 𝑦𝑦 ≈ 𝑓𝑓 𝑦𝑦 𝐾𝐾 𝑖𝑖=1
Why the Name?
- Cross entropy: 𝐻𝐻 𝑃𝑃, 𝑄𝑄 = 𝐸𝐸𝑃𝑃 − log 𝑄𝑄(𝑋𝑋)
- 𝑦𝑦 (i) is drawn from an unknown distribution − 𝑦𝑦 i log 𝑦𝑦 i + 1 − 𝑦𝑦 i log 1 − 𝑦𝑦 i 𝑁𝑁 𝑃𝑃 𝑌𝑌 𝑿𝑿 𝑖𝑖=1
- 𝑦𝑦 (i) 𝑖𝑖 is the probability 𝑄𝑄 𝑌𝑌 = 1 𝒙𝒙 , 𝜷𝜷 1 i = −δ 𝑦𝑦 = 1 log𝑄𝑄 𝑌𝑌 = 1 𝒙𝒙 𝑖𝑖 , 𝜷𝜷 𝑁𝑁 𝑖𝑖=1
- 1 − 𝑦𝑦 (i) is the probability 𝑄𝑄 𝑌𝑌 = 0 𝒙𝒙 𝑖𝑖 , 𝜷𝜷 −δ 𝑦𝑦 i = 0 log𝑄𝑄 𝑌𝑌 = 0 𝒙𝒙 𝑖𝑖 , 𝜷𝜷
Information Theoretical Perspective
- The cross-entropy is related to the KL divergence 𝐻𝐻 𝑃𝑃, 𝑄𝑄 = −𝐸𝐸𝑃𝑃 𝑥𝑥 log 𝑄𝑄 𝑥𝑥 = 𝐻𝐻 𝑃𝑃 + 𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄)
- Minimizing the loss minimizes the distance between the GT distribution 𝑃𝑃 𝑦𝑦 𝑖𝑖 𝒙𝒙 𝑖𝑖 and estimated distribution 𝑄𝑄 𝑦𝑦 𝑖𝑖 𝒙𝒙 𝑖𝑖 , 𝜷𝜷.
2D Functions
-
Loss surface in 2D = contour diagrams / level sets 𝐿𝐿𝑎𝑎 𝑓𝑓 = 𝒙𝒙 𝑓𝑓 𝒙𝒙 = 𝑎𝑎}
-
The gradient direction is the direction along which the function value changes the fastest (for a small change of 𝒙𝒙 in Euclidean norm).
-
Along the level set, the function value doesn’t change.
-
For a differentiable function 𝑓𝑓(𝒙𝒙), its gradient of at any point is either zero or perpendicular to the level set at that point.
Gradient Descent on Convex Functions
- 𝜷𝜷𝑡𝑡 = 𝜷𝜷𝑡𝑡−1 − 𝜂𝜂
d𝑓𝑓 𝑥𝑥, 𝜷𝜷 d𝜷𝜷 𝜷𝜷 100 𝑡𝑡−1 90 80 - The learning rate 𝜂𝜂 determines how much we move at each step. We cannot move too much because the gradient is a local approximation of the function. Thus, the learning rate is usually small.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz covers fundamental concepts of machine learning, including the distinction between classification and regression. It explains practical applications such as image classification, tweet emotion recognition, and recommender systems. Test your understanding of these key topics in machine learning.