Podcast
Questions and Answers
Which of the following is the correct interpretation of $P(X = x)$ in probability notation?
Which of the following is the correct interpretation of $P(X = x)$ in probability notation?
- A probability distribution. (correct)
- A realization of a random variable.
- The support of a probability density function.
- A joint probability.
In the context of probability for machine learning, what does supp(p(x)) = {x ∈ X; p(x) ≠ 0}
represent?
In the context of probability for machine learning, what does supp(p(x)) = {x ∈ X; p(x) ≠ 0}
represent?
- The variance of the random variable X.
- The domain on which the probability density function (PDF) is defined. (correct)
- The cumulative distribution function.
- The expected value of the random variable X.
Given two random variables, X and Y, which formula correctly expresses their covariance, Cov[X, Y]?
Given two random variables, X and Y, which formula correctly expresses their covariance, Cov[X, Y]?
- E[X] - E[Y]
- E[X + Y] - E[X]E[Y]
- E[XY] - E[X] + E[Y]
- E[XY] - E[X]E[Y] (correct)
If events A and B are not mutually exclusive, which of the following formulas should be used to calculate $P(A \cup B)$?
If events A and B are not mutually exclusive, which of the following formulas should be used to calculate $P(A \cup B)$?
Which of the following equations correctly represents Bayes' Theorem?
Which of the following equations correctly represents Bayes' Theorem?
According to Bayes' Theorem, what do you need to determine $P(A|B)$ if you know $P(B|A)$?
According to Bayes' Theorem, what do you need to determine $P(A|B)$ if you know $P(B|A)$?
In machine learning, what is the primary reason for using an objective function?
In machine learning, what is the primary reason for using an objective function?
What is another term used for a loss function when minimizing in the context of machine learning?
What is another term used for a loss function when minimizing in the context of machine learning?
When deriving the sum of squared residuals, what is the significance of assuming that the noise $\epsilon_i$ follows a normal distribution $N(0, \sigma^2)$?
When deriving the sum of squared residuals, what is the significance of assuming that the noise $\epsilon_i$ follows a normal distribution $N(0, \sigma^2)$?
What is the primary purpose of taking the logarithm of the likelihood function in the context of linear regression?
What is the primary purpose of taking the logarithm of the likelihood function in the context of linear regression?
After obtaining the log-likelihood and multiplying it by -2, what statistical measure is approximated in the context of linear regression?
After obtaining the log-likelihood and multiplying it by -2, what statistical measure is approximated in the context of linear regression?
What key assumption is made about the data when transitioning from the chi-squared statistic to the sum of squared residuals in linear regression?
What key assumption is made about the data when transitioning from the chi-squared statistic to the sum of squared residuals in linear regression?
After deriving a general approach to defining a loss function, what crucial step is often necessary to maximize it?
After deriving a general approach to defining a loss function, what crucial step is often necessary to maximize it?
In the context of defining loss functions, why is it important to start with a statement about the probability distribution you are working with?
In the context of defining loss functions, why is it important to start with a statement about the probability distribution you are working with?
Which of the following is the formula for Mean Absolute Error (MAE)?
Which of the following is the formula for Mean Absolute Error (MAE)?
Which of the following is the formula for Root Mean Squared Error (RMSE)?
Which of the following is the formula for Root Mean Squared Error (RMSE)?
Which of these objective functions do not necessarily relate to a probability?
Which of these objective functions do not necessarily relate to a probability?
In the context of model evaluation, why might one be interested in $P(M|D)$ rather than $P(D|M)$?
In the context of model evaluation, why might one be interested in $P(M|D)$ rather than $P(D|M)$?
What does $P(D|M)$ represent when using the likelihood to derive the sum of squared residuals?
What does $P(D|M)$ represent when using the likelihood to derive the sum of squared residuals?
How can Bayes' theorem be used to find an expression for what we really want ($P(M|D)$)?
How can Bayes' theorem be used to find an expression for what we really want ($P(M|D)$)?
When using just likelihood, what key assumption is made?
When using just likelihood, what key assumption is made?
In the context of penalized loss, what is $P(M)$ typically called?
In the context of penalized loss, what is $P(M)$ typically called?
What is the purpose of including prior information in penalized loss functions?
What is the purpose of including prior information in penalized loss functions?
Which of the following describes what a Ridge Regression loss function aims to do?
Which of the following describes what a Ridge Regression loss function aims to do?
In the context of penalized loss functions like Ridge Regression, what is the effect of adding a penalty term to the loss function?
In the context of penalized loss functions like Ridge Regression, what is the effect of adding a penalty term to the loss function?
Flashcards
Random Variable
Random Variable
A variable whose value is a numerical outcome of a random phenomenon.
Probability Distribution
Probability Distribution
A function giving the probability that a random variable takes a value less than or equal to a specified value.
Probability Density Function (PDF)
Probability Density Function (PDF)
A function that describes the relative likelihood for a random variable to take on a given value.
Support (of a PDF)
Support (of a PDF)
Signup and view all the flashcards
Joint Probability
Joint Probability
Signup and view all the flashcards
Conditional Probability
Conditional Probability
Signup and view all the flashcards
Cumulative Distribution Function (CDF)
Cumulative Distribution Function (CDF)
Signup and view all the flashcards
Survival Function
Survival Function
Signup and view all the flashcards
Expected Value
Expected Value
Signup and view all the flashcards
Variance
Variance
Signup and view all the flashcards
Covariance
Covariance
Signup and view all the flashcards
P(X AND Y)
P(X AND Y)
Signup and view all the flashcards
P(X OR Y)
P(X OR Y)
Signup and view all the flashcards
Bayes' Theorem
Bayes' Theorem
Signup and view all the flashcards
Loss Function
Loss Function
Signup and view all the flashcards
Objective Function
Objective Function
Signup and view all the flashcards
Log-likelihood
Log-likelihood
Signup and view all the flashcards
Chi-squared statistic
Chi-squared statistic
Signup and view all the flashcards
Homoscedasticity
Homoscedasticity
Signup and view all the flashcards
Mean Squared Error (MSE)
Mean Squared Error (MSE)
Signup and view all the flashcards
Mean Absolute Error (MAE)
Mean Absolute Error (MAE)
Signup and view all the flashcards
P(M) - Prior
P(M) - Prior
Signup and view all the flashcards
Penalized Loss
Penalized Loss
Signup and view all the flashcards
Ridge Regression
Ridge Regression
Signup and view all the flashcards
Data Generating Process
Data Generating Process
Signup and view all the flashcards
Study Notes
- Presented by Vandana Das, with slides based on work by R. de Souza and M. Gordovskyy.
Probability Theory Review
- Basic probability concepts cover the definition of probability and its applications.
- Simple probability algebra and Bayes' theorem are covered.
Loss Functions
- The material considers the definition of a loss function.
- It explores the relationship between loss functions and probability.
- It looks at how loss functions are used in machine learning.
- Considers what types of loss functions exist.
Probability Notation
- A random variable is denoted as X.
- A realization of a random variable is denoted as x.
- P(X = x) represents a probability distribution.
- p(X = x) represents a probability density (or mass) function.
- The support of p(x), denoted supp(p(x)) = {x ∈ X; p(x) ≠ 0}, is the domain upon which the PDF is defined.
- Joint probability is notated as p(x, y).
- Conditional probability is notated as p(x|y).
More Concepts
- A cumulative distribution function (CDF) is defined: CDF(x) = ∫xp(x)dx.
- For any distribution: CDF(∞) = ∫-∞p(x)dx = 1.
- A survival function is defined: SF(x) = CDF(x) = 1 − CDF(x).
- Expected value is: E[x] = ∫ x p(x) dx.
- Variance is: Var[x] = -E[x] + ∫ x²p(x)dx.
- Covariance is: Cov[X,Y] = E[XY] – E[X]E[Y].
Combining Probabilities
- P(X AND Y) = P(X ∩ Y) = P(X,Y) = P(X)P(Y)
- P(X OR Y) = P(X ∪ Y) = P(X) + P(Y) – P(X AND Y)
Conditional Probability
- P(A ∩ B) = P(A)P(B) = P(A, B) = P(A|B)P(B)
- The relationship is reversible because P(A,B) = P(B,A)
- P(A|B)P(B) = P(B|A)P(A)
- This is Bayes' Theorem.
Bayes' Theorem
- Bayes' Theorem is a useful tool.
- Bayes' Theorem is described as "The chain rule of probability".
- Bayes' Theorem is said to allow extraction of a different conditional distribution from one that is known.
- If P(B|A) is known, but P(A|B) needs to be found, Bayes' theorem can be used.
Measuring Model Performance
- The material concerns talking about "itting" a model.
- How to determine what is a good fit.
- Objective measures of quality are desired.
- An "objective function" gives a number for quality.
Objective Functions
- Linear regression is revisited, with reference to the sum of squared residuals.
- Minimizing the sum of squares is called loss and cost.
Origins of Sum of Squared Residuals
- The aim should be to measure how good a model is at predicting data.
- A statement about probability can be turned into this.
- It seeks to know how likely a model to produce data.
- P(D|M) is probability of the data, given the model is true.
Formulae relating to Noise
- If data has noise ε¡
N(0, σ²) then P(yi|M)Ν(0, σ²) + (mx₁ + c) or P(yi|M)~N(mxi + c, σ²) - Equation is then given to define this: P(yi|M) = 1 / √2πσ2 * e ^ (yi_ - (mxi + c))^2/ 2σ2
Equations for all Data
-
Calculation using all data involves calculating P(D|M) = P(y1|M) × P(y2|M) ... or P(D|M) = ∏ i=1 to N P(yi|M)
-
P(D|M) = ∏ i=1 to N ( 1 / √2πσ2 * e ^ (yi_ - (mxi + c))^2/ 2σ2 )
Working with the Log
-
Yes! The material turns to a log based approach
-
ln P(D|M) = Σ i=1 to N (yi-(mxi + c))²/2σ2 + ln 1/√2πσ2
-
Constants are dropped and equation re-arranged
-
ln P(D|M) = -½ Σ i=1 to N ( yi_ - (mxi + c) / σ )^2
-
This is the "log-likelihood" statistic.
MSE
-
Closer to the MSE but still different what we need to do?
-
Step 1: multiply by -2 -2 ln P(D|M) = Σ i=1 to N ( yi_ - (mxi + c) / σ )^2
-
This is the chi-squared statistic.
Homoscedasticity
-
Step 2: assume σ is the same for all data and drop it.
-
We are then assuming the data is homoscedastic
-
2 ln P(D|M) α Σ i=1 to N ( yi_ - (mxi + c))^2
-
This leads to an equation for the sum of squared residuals!
Defining Loss Functions
-
The text addresses defining a loss
- Start from a statement about the probability distribution you are working with
- which relates to the process that you think generated your data
- Try to find a simple expression for that probability that you can work with
- it can be something proportional to it
- Start from a statement about the probability distribution you are working with
-
If necessary, multiply by -1 so you can maximize
-
It could have been stopped at different points depending on the properties of data.
-
A question is asked about what would happen if things were changed
Common Loss Functions
- Mean Squared Error (MSE) is defined as MSE = 1/N Σi=1N (yi – ŷi)².
- Mean Absolute Error (MAE) is defined as MAE = 1/N Σi=1N |yi – ŷi|.
- Root Mean Squared Error (RMSE) is defined as RMSE = √(1/N Σi=1N (yi – ŷi)²).
- Cross-entropy is defined as - Σi=1N Σc=0K=1 yi log P(Y = c|Xi).
Objective Functions
- There are many other objective functions around, including:
- predictive accuracy
- precision score
- AUC ROC
- F1 score
- R² score
- These objective functions do not necessarily relate to probability.
Likelihood
- The sum of squared residuals was derived using the likelihood P(D|M).
- This is the probability of the data being generated by the model..
- Is this really what we want?
- Is the real target the probability that the model is the true one, given the data?
- i.e., P(M|D)
- How do we determine this?
Bayes' Theorem in Action
- Bayes' theorem is exploited because P(D|M)P(M) = P(M|D)P(D).
- This is rearranged to give P(M|D) = P(D|M)P(M) / P(D).
- It is questioned whether assuming all models are equally likely using just likelihood.
Penalized Loss
- P(M) is called the prior.
- This can be used to include prior information, or to penalize some parameter.
- Ridge regression loss is given as RR = 1/N [Σ(yi – ŷi)²] + λ [Σβj²].
- What does this loss function do?
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.