Probability Theory, Loss Functions and Notation

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is the correct interpretation of $P(X = x)$ in probability notation?

  • A probability distribution. (correct)
  • A realization of a random variable.
  • The support of a probability density function.
  • A joint probability.

In the context of probability for machine learning, what does supp(p(x)) = {x ∈ X; p(x) ≠ 0} represent?

  • The variance of the random variable X.
  • The domain on which the probability density function (PDF) is defined. (correct)
  • The cumulative distribution function.
  • The expected value of the random variable X.

Given two random variables, X and Y, which formula correctly expresses their covariance, Cov[X, Y]?

  • E[X] - E[Y]
  • E[X + Y] - E[X]E[Y]
  • E[XY] - E[X] + E[Y]
  • E[XY] - E[X]E[Y] (correct)

If events A and B are not mutually exclusive, which of the following formulas should be used to calculate $P(A \cup B)$?

<p>$P(A) + P(B) - P(A \cap B)$ (A)</p>
Signup and view all the answers

Which of the following equations correctly represents Bayes' Theorem?

<p>$P(A|B) = P(A)P(B|A) / P(B)$ (D)</p>
Signup and view all the answers

According to Bayes' Theorem, what do you need to determine $P(A|B)$ if you know $P(B|A)$?

<p>P(A) and P(B). (C)</p>
Signup and view all the answers

In machine learning, what is the primary reason for using an objective function?

<p>To have an objective measure of the quality of fit of a model. (C)</p>
Signup and view all the answers

What is another term used for a loss function when minimizing in the context of machine learning?

<p>Cost function. (C)</p>
Signup and view all the answers

When deriving the sum of squared residuals, what is the significance of assuming that the noise $\epsilon_i$ follows a normal distribution $N(0, \sigma^2)$?

<p>It leads to a likelihood function that, when maximized, is equivalent to minimizing the sum of squared residuals. (B)</p>
Signup and view all the answers

What is the primary purpose of taking the logarithm of the likelihood function in the context of linear regression?

<p>To simplify calculations by converting products into sums and to prevent underflow. (B)</p>
Signup and view all the answers

After obtaining the log-likelihood and multiplying it by -2, what statistical measure is approximated in the context of linear regression?

<p>The chi-squared statistic. (A)</p>
Signup and view all the answers

What key assumption is made about the data when transitioning from the chi-squared statistic to the sum of squared residuals in linear regression?

<p>The data must be homoscedastic. (A)</p>
Signup and view all the answers

After deriving a general approach to defining a loss function, what crucial step is often necessary to maximize it?

<p>Multiply the expression by -1. (D)</p>
Signup and view all the answers

In the context of defining loss functions, why is it important to start with a statement about the probability distribution you are working with?

<p>This relates to the process you think generated your data. (C)</p>
Signup and view all the answers

Which of the following is the formula for Mean Absolute Error (MAE)?

<p>$MAE = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|$ (B)</p>
Signup and view all the answers

Which of the following is the formula for Root Mean Squared Error (RMSE)?

<p>$RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}$ (C)</p>
Signup and view all the answers

Which of these objective functions do not necessarily relate to a probability?

<p>Precision score (A)</p>
Signup and view all the answers

In the context of model evaluation, why might one be interested in $P(M|D)$ rather than $P(D|M)$?

<p>$P(M|D)$ represents the probability of the model being true given the observed data, which is often more directly relevant for model selection. (D)</p>
Signup and view all the answers

What does $P(D|M)$ represent when using the likelihood to derive the sum of squared residuals?

<p>The probability of the data given the model. (B)</p>
Signup and view all the answers

How can Bayes' theorem be used to find an expression for what we really want ($P(M|D)$)?

<p>$P(M|D) = P(D|M)P(M) / P(D)$ (C)</p>
Signup and view all the answers

When using just likelihood, what key assumption is made?

<p>That all models are equally likely. (C)</p>
Signup and view all the answers

In the context of penalized loss, what is $P(M)$ typically called?

<p>The prior. (B)</p>
Signup and view all the answers

What is the purpose of including prior information in penalized loss functions?

<p>To influence model selection based on prior beliefs or knowledge. (A)</p>
Signup and view all the answers

Which of the following describes what a Ridge Regression loss function aims to do?

<p>It penalizes the complexity of the model, preventing overfitting. (B)</p>
Signup and view all the answers

In the context of penalized loss functions like Ridge Regression, what is the effect of adding a penalty term to the loss function?

<p>It encourages simpler models, often improving generalization performance. (A)</p>
Signup and view all the answers

Flashcards

Random Variable

A variable whose value is a numerical outcome of a random phenomenon.

Probability Distribution

A function giving the probability that a random variable takes a value less than or equal to a specified value.

Probability Density Function (PDF)

A function that describes the relative likelihood for a random variable to take on a given value.

Support (of a PDF)

The set of values for which the probability density function is non-zero.

Signup and view all the flashcards

Joint Probability

The probability of two events occurring together.

Signup and view all the flashcards

Conditional Probability

The probability of an event occurring given that another event has already occurred.

Signup and view all the flashcards

Cumulative Distribution Function (CDF)

A function giving the probability that a random variable is less than or equal to a certain value.

Signup and view all the flashcards

Survival Function

The probability that a device or system will survive beyond a specified time.

Signup and view all the flashcards

Expected Value

The weighted average of all possible values, with the weights being the probabilities of those values.

Signup and view all the flashcards

Variance

A measure of how much the random variable differs from its expected value.

Signup and view all the flashcards

Covariance

Measure of how much two random variables change together.

Signup and view all the flashcards

P(X AND Y)

Probability of both events occurring

Signup and view all the flashcards

P(X OR Y)

Probability of either event occurring

Signup and view all the flashcards

Bayes' Theorem

A formula that describes how to update the probabilities of hypotheses when given evidence.

Signup and view all the flashcards

Loss Function

A function that quantifies the cost associated with prediction.

Signup and view all the flashcards

Objective Function

A function to be minimized or maximized to obtain the best model parameters.

Signup and view all the flashcards

Log-likelihood

A compact representation of the likelihood of a dataset given a model.

Signup and view all the flashcards

Chi-squared statistic

A measure used to determine the goodness-of-fit between observed and expected values.

Signup and view all the flashcards

Homoscedasticity

A condition where the variances of error terms are constant across all levels of the independent variables.

Signup and view all the flashcards

Mean Squared Error (MSE)

The average of the squared differences between the predicted and actual values.

Signup and view all the flashcards

Mean Absolute Error (MAE)

The average of the absolute differences between the predicted and actual values.

Signup and view all the flashcards

P(M) - Prior

Prior probability of a model.

Signup and view all the flashcards

Penalized Loss

Adding a penalty term to the loss function to prevent overfitting.

Signup and view all the flashcards

Ridge Regression

A technique used to penalize large coefficients.

Signup and view all the flashcards

Data Generating Process

This relates to the process that you think generated your data

Signup and view all the flashcards

Study Notes

  • Presented by Vandana Das, with slides based on work by R. de Souza and M. Gordovskyy.

Probability Theory Review

  • Basic probability concepts cover the definition of probability and its applications.
  • Simple probability algebra and Bayes' theorem are covered.

Loss Functions

  • The material considers the definition of a loss function.
  • It explores the relationship between loss functions and probability.
  • It looks at how loss functions are used in machine learning.
  • Considers what types of loss functions exist.

Probability Notation

  • A random variable is denoted as X.
  • A realization of a random variable is denoted as x.
  • P(X = x) represents a probability distribution.
  • p(X = x) represents a probability density (or mass) function.
  • The support of p(x), denoted supp(p(x)) = {x ∈ X; p(x) ≠ 0}, is the domain upon which the PDF is defined.
  • Joint probability is notated as p(x, y).
  • Conditional probability is notated as p(x|y).

More Concepts

  • A cumulative distribution function (CDF) is defined: CDF(x) = ∫xp(x)dx.
  • For any distribution: CDF(∞) = ∫-∞p(x)dx = 1.
  • A survival function is defined: SF(x) = CDF(x) = 1 − CDF(x).
  • Expected value is: E[x] = ∫ x p(x) dx.
  • Variance is: Var[x] = -E[x] + ∫ x²p(x)dx.
  • Covariance is: Cov[X,Y] = E[XY] – E[X]E[Y].

Combining Probabilities

  • P(X AND Y) = P(X ∩ Y) = P(X,Y) = P(X)P(Y)
  • P(X OR Y) = P(X ∪ Y) = P(X) + P(Y) – P(X AND Y)

Conditional Probability

  • P(A ∩ B) = P(A)P(B) = P(A, B) = P(A|B)P(B)
  • The relationship is reversible because P(A,B) = P(B,A)
  • P(A|B)P(B) = P(B|A)P(A)
  • This is Bayes' Theorem.

Bayes' Theorem

  • Bayes' Theorem is a useful tool.
  • Bayes' Theorem is described as "The chain rule of probability".
  • Bayes' Theorem is said to allow extraction of a different conditional distribution from one that is known.
  • If P(B|A) is known, but P(A|B) needs to be found, Bayes' theorem can be used.

Measuring Model Performance

  • The material concerns talking about "itting" a model.
  • How to determine what is a good fit.
  • Objective measures of quality are desired.
  • An "objective function" gives a number for quality.

Objective Functions

  • Linear regression is revisited, with reference to the sum of squared residuals.
  • Minimizing the sum of squares is called loss and cost.

Origins of Sum of Squared Residuals

  • The aim should be to measure how good a model is at predicting data.
  • A statement about probability can be turned into this.
  • It seeks to know how likely a model to produce data.
  • P(D|M) is probability of the data, given the model is true.

Formulae relating to Noise

  • If data has noise ε¡N(0, σ²) then P(yi|M) Ν(0, σ²) + (mx₁ + c) or P(yi|M)~N(mxi + c, σ²)
  • Equation is then given to define this: P(yi|M) = 1 / √2πσ2 * e ^ (yi_ - (mxi + c))^2/ 2σ2

Equations for all Data

  • Calculation using all data involves calculating P(D|M) = P(y1|M) × P(y2|M) ... or P(D|M) = ∏ i=1 to N P(yi|M)

  • P(D|M) = ∏ i=1 to N ( 1 / √2πσ2 * e ^ (yi_ - (mxi + c))^2/ 2σ2 )

Working with the Log

  • Yes! The material turns to a log based approach

  • ln P(D|M) = Σ i=1 to N (yi-(mxi + c))²/2σ2 + ln 1/√2πσ2

  • Constants are dropped and equation re-arranged

  • ln P(D|M) = -½ Σ i=1 to N ( yi_ - (mxi + c) / σ )^2

  • This is the "log-likelihood" statistic.

MSE

  • Closer to the MSE but still different what we need to do?

  • Step 1: multiply by -2 -2 ln P(D|M) = Σ i=1 to N ( yi_ - (mxi + c) / σ )^2

  • This is the chi-squared statistic.

Homoscedasticity

  • Step 2: assume σ is the same for all data and drop it.

  • We are then assuming the data is homoscedastic

  • 2 ln P(D|M) α Σ i=1 to N ( yi_ - (mxi + c))^2

  • This leads to an equation for the sum of squared residuals!

Defining Loss Functions

  • The text addresses defining a loss

    • Start from a statement about the probability distribution you are working with
      • which relates to the process that you think generated your data
    • Try to find a simple expression for that probability that you can work with
      • it can be something proportional to it
  • If necessary, multiply by -1 so you can maximize

  • It could have been stopped at different points depending on the properties of data.

  • A question is asked about what would happen if things were changed

Common Loss Functions

  • Mean Squared Error (MSE) is defined as MSE = 1/N Σi=1N (yi – ŷi)².
  • Mean Absolute Error (MAE) is defined as MAE = 1/N Σi=1N |yi – ŷi|.
  • Root Mean Squared Error (RMSE) is defined as RMSE = √(1/N Σi=1N (yi – ŷi)²).
  • Cross-entropy is defined as - Σi=1N Σc=0K=1 yi log P(Y = c|Xi).

Objective Functions

  • There are many other objective functions around, including:
    • predictive accuracy
    • precision score
    • AUC ROC
    • F1 score
    • R² score
  • These objective functions do not necessarily relate to probability.

Likelihood

  • The sum of squared residuals was derived using the likelihood P(D|M).
    • This is the probability of the data being generated by the model..
  • Is this really what we want?
  • Is the real target the probability that the model is the true one, given the data?
    • i.e., P(M|D)
  • How do we determine this?

Bayes' Theorem in Action

  • Bayes' theorem is exploited because P(D|M)P(M) = P(M|D)P(D).
  • This is rearranged to give P(M|D) = P(D|M)P(M) / P(D).
  • It is questioned whether assuming all models are equally likely using just likelihood.

Penalized Loss

  • P(M) is called the prior.
  • This can be used to include prior information, or to penalize some parameter.
  • Ridge regression loss is given as RR = 1/N [Σ(yi – ŷi)²] + λ [Σβj²].
  • What does this loss function do?

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser