Supervised Learning

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the primary purpose of specifying a hypothesis class in machine learning?

  • To ensure the learning algorithm converges quickly.
  • To encode assumptions about the type of problem being learned. (correct)
  • To simplify the optimization process.
  • To guarantee a perfect fit to the training data.

According to the No Free Lunch Theorem, what must every successful machine learning algorithm do?

  • Achieve zero training error.
  • Make assumptions about the data. (correct)
  • Minimize computational complexity.
  • Use a universally optimal hypothesis class.

In the context of machine learning, what does the loss function evaluate?

  • The complexity of the hypothesis class.
  • The size of the training dataset.
  • The performance of a hypothesis on the training data. (correct)
  • The computational cost of the learning algorithm.

Why is the zero-one loss function often unsuitable for guiding optimization procedures?

<p>It is non-differentiable and non-continuous. (B)</p> Signup and view all the answers

What is a key characteristic of the squared loss function that influences the learning process?

<p>It penalizes large mispredictions more heavily. (C)</p> Signup and view all the answers

In a regression setting with a probabilistic label $y$ given an input $x$, according to a distribution $P(y|x)$, what prediction minimizes the squared loss?

<p>The expected value of $P(y|x)$. (B)</p> Signup and view all the answers

When is the absolute loss function more suitable than the squared loss function?

<p>When the data is noisy and contains outliers. (B)</p> Signup and view all the answers

In a regression setting with a probabilistic label $y$ given an input $x$, according to a distribution $P(y|x)$, what prediction minimizes the absolute loss?

<p>The median of $P(y|x)$. (A)</p> Signup and view all the answers

What is the primary goal when splitting data into training, validation, and test sets?

<p>To evaluate the model's performance on unseen data and simulate a real-world scenario. (C)</p> Signup and view all the answers

Why should you avoid splitting data alphabetically or by feature values when creating training and test sets?

<p>It can introduce bias and unrealistic evaluation scenarios. (B)</p> Signup and view all the answers

What is the most common assumption made by machine learning algorithms about the function they are approximating?

<p>The function is locally smooth. (C)</p> Signup and view all the answers

Given a training dataset $D={(\mathbf{x}_1, y_1),...,(\mathbf{x}_n, y_n)}$, what does $\mathbf{x}_i$ represent?

<p>The i-th input instance. (B)</p> Signup and view all the answers

In the context of supervised learning, what is the ultimate goal of learning a function $h$?

<p>To ensure that $h(\mathbf{x}) = y$ with high probability for new pairs $(\mathbf{x}, y)$ drawn from $\mathcal{P}$. (D)</p> Signup and view all the answers

Which of the following is a critical consideration when selecting a hypothesis class $\mathcal{H}$?

<p>The choice of $\mathcal{H}$ depends on the data and encodes assumptions about the dataset/distribution $\mathcal{P}$. (B)</p> Signup and view all the answers

What does it mean for a loss function to be normalized by the total number of training samples, $n$?

<p>The output can be interpreted as the average loss per sample and is independent of $n$. (A)</p> Signup and view all the answers

Given the zero-one loss function $\mathcal{L}{0/1}(h)=\frac{1}{n}\sum^n{i=1}\delta_{h(\mathbf{x}i)\ne y_i}$, what does the term $\delta{h(\mathbf{x}_i)\ne y_i}$ represent?

<p>An indicator function that is 1 if the prediction is incorrect and 0 otherwise. (A)</p> Signup and view all the answers

What is the effect of squaring the difference $(h(\mathbf{x}_i) - y_i)$ in the squared loss function?

<p>It ensures the loss is always non-negative and penalizes large mispredictions more heavily. (C)</p> Signup and view all the answers

Why is it important to split train/test data temporally (predicting the future from the past) when training an email spam filter?

<p>To simulate the real-life scenario where future emails are classified based on past data. (B)</p> Signup and view all the answers

What is the primary issue with a 'memorizer' hypothesis function $h(x)$ that simply recalls training data?

<p>It suffers from overfitting and performs poorly on samples not in the training data. (D)</p> Signup and view all the answers

Given a loss function $\mathcal{L}(h)$ and a hypothesis class $\mathcal{H}$, what does the expression $h = \textrm{argmin}_{h\in{\mathcal{H}}}\mathcal{L}(h)$ represent?

<p>The function $h$ that minimizes the loss within the hypothesis class. (B)</p> Signup and view all the answers

Consider a scenario where $|h(\mathbf{x}_i)-y_i| = 0.001$ when using the squared loss function. What is a likely outcome?

<p>The squared loss will be tiny ($0.000001$) and may not be fully corrected during training. (B)</p> Signup and view all the answers

What does the training error refer to?

<p>The fraction of misclassified training samples. (A)</p> Signup and view all the answers

You are building a machine learning model and find that your model has a very low loss on the training data but performs poorly on the test data. What is this issue called?

<p>Overfitting (A)</p> Signup and view all the answers

Which of the following is NOT a typical consideration when choosing a loss function?

<p>The computational complexity of the model (B)</p> Signup and view all the answers

If you have a dataset with many outliers, which loss function would be a better choice to use?

<p>Absolute Loss (A)</p> Signup and view all the answers

What is generally the first step in approaching a supervised machine learning problem?

<p>Selecting an appropriate hypothesis class (C)</p> Signup and view all the answers

What can be implied if a loss function outputs zero?

<p>Perfect predictions were made (C)</p> Signup and view all the answers

Which of the following is a downside of using squared loss?

<p>The squared loss will be tiny and little attention will be given to an example even if the prediction is very close to be correct (D)</p> Signup and view all the answers

What is a validation set primarily used for?

<p>Guiding model selection and hyperparameter tuning. (B)</p> Signup and view all the answers

Why is splitting data alphabetically or by feature values problematic?

<p>It can introduce unintended biases. (B)</p> Signup and view all the answers

What is the purpose of ensuring the test set simulates a 'real test scenario'?

<p>To accurately represent real-world model performance. (B)</p> Signup and view all the answers

In the context of machine learning, what is the consequence of violating the 'no free lunch' principle?

<p>the model performs poorly in most settings (C)</p> Signup and view all the answers

Why might temporal splitting be essential when creating an email spam filter?

<p>to strictly extrapolate future behavior from past data (A)</p> Signup and view all the answers

What key idea does the concept of Occam's razor translate to in machine learning?

<p>to select the simplest model that adequately fits the data (B)</p> Signup and view all the answers

If a classifier shows 0% error on the training data but performs terribly with new samples, what issue is present?

<p>overfitting issue (C)</p> Signup and view all the answers

In the equation $\mathcal{L}{abs}(h)=\frac{1}{n}\sum^n{i=1}|h(\mathbf{x}_i)-y_i|$, what does $|h(\mathbf{x}_i)-y_i|$ represent?

<p>The absolute difference between the predicted and actual values. (B)</p> Signup and view all the answers

Which data characteristic makes absolute loss functions particularly useful?

<p>datasets with outliers or noise (D)</p> Signup and view all the answers

What is the advantage of using the mean to make predictions?

<p>minimizing the squared loss (C)</p> Signup and view all the answers

Flashcards

x

Input instance in supervised learning.

y

Label of the input instance in supervised learning.

D

The entire set of training data in supervised learning, consisting of pairs of inputs and labels.

P(X, Y)

Unknown distribution from which data points (x, y) are drawn.

Signup and view all the flashcards

h

A function learned to predict y from x.

Signup and view all the flashcards

Hypothesis Class

The set of possible functions that the learning algorithm can choose from.

Signup and view all the flashcards

No Free Lunch Theorem

States that every successful ML algorithm must make assumptions about the data.

Signup and view all the flashcards

Loss Function

Evaluating a hypothesis on training data, indicating how bad it is.

Signup and view all the flashcards

Zero-One Loss

Counts how many mistakes a hypothesis makes on the training set.

Signup and view all the flashcards

Training Error

Fraction of misclassified training samples.

Signup and view all the flashcards

Squared Loss

Calculates the squared difference between predicted and actual values.

Signup and view all the flashcards

Absolute Loss

Calculates the absolute difference between predicted and actual values.

Signup and view all the flashcards

Memorizer Function

Function that returns known y values for x inputs found in training data, otherwise returns 0. Is prone to overfitting.

Signup and view all the flashcards

Overfitting

Occurs when a model fits the training data too well, performing poorly on new data.

Signup and view all the flashcards

Temporal Split

Splitting data so future data is predicted from the past.

Signup and view all the flashcards

Random split

Splitting data uniformly at random.

Signup and view all the flashcards

Locally Smooth Assumption

The assumption that the function to be approximated is locally smooth.

Signup and view all the flashcards

Study Notes

  • Supervised machine learning uses training data in pairs of inputs (x,y)(\mathbf{x}, y)(x,y), where x∈Rd\mathbf{x}\in{\mathcal{R}}^dx∈Rd is the d-dimensional input instance and yyy is its label.
  • Training data is denoted as D=\left{(\mathbf{x}_1, y_1),\dots,(\mathbf{x}_n, y_n)\right}\subseteq {\cal R}^d\times \mathcal{C}.
  • The data points (xi,yi)(\mathbf{x}_i, y_i)(xi​,yi​) are drawn from some unknown distribution P(X,Y)\mathcal{P}(X, Y)P(X,Y).
  • The goal is to learn a function hhh such that for a new pair (x,y)∼P(\mathbf{x}, y)\sim {\mathcal{P}}(x,y)∼P, we have h(x)=yh(\mathbf{x})=yh(x)=y with high probability (or h(x)≈yh(\mathbf{x})\approx yh(x)≈y).
  • Before finding a function hhh, the type of function, such as an artificial neural network or a decision tree, must be specified.
  • The set of possible functions is called the hypothesis class, encoding assumptions about the problem being solved.
  • The No Free Lunch Theorem states that every successful ML algorithm must make assumptions, implying no single algorithm works for every setting.
  • There are two steps in learning a hypothesis function h()h()h():
    • Select an appropriate machine learning algorithm, defining the hypothesis class H\mathcal{H}H.
    • Find the best function within this class, h∈Hh\in\mathcal{H}h∈H, which often involves optimization.
  • The learning process involves finding a function hhh within the hypothesis class that makes the fewest mistakes on the training data, often choosing the "simplest" function.
  • A loss function evaluates a hypothesis h∈Hh\in{\mathcal{H}}h∈H on training data, indicating how bad it is; a higher loss means worse performance and zero loss signifies perfect predictions.
  • Loss is commonly normalized by the number of training samples, nnn, to represent the average loss per sample, independent of nnn.

Loss Functions

  • The zero-one loss counts the number of mistakes an hypothesis function hhh makes on the training set.
  • It assigns a loss of 1 for mispredicted examples and 0 for correct predictions.
  • The normalized zero-one loss returns the fraction of misclassified training samples, also known as the training error.
  • The zero-one loss is used to evaluate classifiers in multi-class/binary classification but is not useful to guide optimization because it is non-differentiable and non-continuous.
  • Formally, the zero-one loss is: $\mathcal{L}{0/1}(h)=\frac{1}{n}\sum^n{i=1}\delta_{h(\mathbf{x}i)\ne y_i}, \mbox{ where }\delta{h(\mathbf{x}_i)\ne y_i}=\begin{cases} 1,&\mbox{ if h(xi)≠yih(\mathbf{x}_i)\ne y_ih(xi​)=yi​}\ 0,&\mbox{ o. w.} \end{cases}$.
  • The squared loss function is used in regression settings and calculates the loss as (h(xi)−yi)2\left(h(\mathbf{x}_i)-y_i\right)^2(h(xi​)−yi​)2.
  • Squaring ensures the loss is nonnegative and grows quadratically with the absolute mispredicted amount.
  • This encourages predictions to avoid being too far off, but can give little attention to predictions very close to correct.
  • If the label yyy is probabilistic according to P(y∣x)P(y|\mathbf{x})P(y∣x), the optimal prediction to minimize the squared loss is the expected value, $h(\mathbf{x})=\mathbf{E}{P(y|\mathbf{x})}[y].Formallythesquaredlossis:. Formally the squared loss is: .Formallythesquaredlossis:$\mathcal{L}{sq}(h)=\frac{1}{n}\sum^n_{i=1}(h(\mathbf{x}_i)-y_i)^2.$$
  • The absolute loss function is also used in regression, with penalties of ∣h(xi)−yi∣|h(\mathbf{x}_i)-y_i|∣h(xi​)−yi​∣.
  • It grows linearly with mispredictions, making it suitable for noisy data.
  • If yyy is probabilistic according to P(y∣x)P(y|\mathbf{x})P(y∣x), the optimal prediction to minimize absolute loss is the median value, h(x)=MEDIANP(y∣x)[y]h(\mathbf{x})=\textrm{MEDIAN}_{P(y|\mathbf{x})}[y]h(x)=MEDIANP(y∣x)​[y].
  • Formally, the absolute loss can be stated as: $$\mathcal{L}{abs}(h)=\frac{1}{n}\sum^n{i=1}|h(\mathbf{x}_i)-y_i|.$$

Minimizing Loss

  • Given a loss function, the goal is to find the function hhh that minimizes the loss: h=argminh∈HL(h)h=\textrm{argmin}_{h\in{\mathcal{H}}}\mathcal{L}(h)h=argminh∈H​L(h).
  • Machine learning focuses on how to efficiently perform this minimization.
  • A function h(â‹…)h(\cdot)h(â‹…) with low loss on data DDD may not generalize well to examples not in DDD, leading to overfitting.
  • An example of overfitting is a "memorizer" function: h(x)=\begin{cases} y_i,&\mbox{ if ∃(xi,yi)∈D\exists (\mathbf{x}_i, y_i)\in D∃(xi​,yi​)∈D, s. t., x=xi\mathbf{x}=\mathbf{x}_ix=xi​},\ 0,&\mbox{ o. w.} \end{cases}
  • With the function above, you get 00%0 error on the training data DDD, but does horribly with samples not in DDD.
  • Data splitting into Train, Validation, and Test sets must be done carefully.
  • The test set should simulate a real-world test scenario, predicting the future from the past when a temporal component exists.
  • If no temporal component exists, it is best to split uniformly at random, avoiding splitting alphabetically or by feature values.

Assumptions

  • Every ML algorithm must make assumptions to choose a hypothesis class H\mathcal{H}H, depending on the data and encoding assumptions about the data set/distribution P\mathcal{P}P.
  • There is no one perfect H\mathcal{H}H for all problems.
  • Determining the value of yyy for a given x\mathbf{x}x is impossible without assumptions.
  • The most common assumption of ML algorithms is that the function to be approximated is locally smooth.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

Use Quizgecko on...
Browser
Browser