Machine Learning Concepts Quiz
47 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a consequence of using a polynomial with a high degree in regression?

  • It will simplify the model.
  • It will always underfit the data.
  • It reduces the likelihood of overfitting.
  • It will fit the data exactly. (correct)

Which condition is necessary for the dual method of ridge regression to be recommended?

  • d > n (correct)
  • d < n
  • d = 0
  • d = n

What is a necessary feature of the kernel function k(x, z) = Φ(x) · Φ(z)?

  • It must be a linear function.
  • It must be positive semidefinite. (correct)
  • It must be negative definite.
  • It must be symmetric only.

What happens if the primal perceptron algorithm terminates?

<p>The kernel perceptron algorithm might or might not terminate. (A)</p> Signup and view all the answers

When using a matrix M for projection in the kernel perceptron, which of the following statements is true?

<p>Raw data may be linearly separable, but projected data might not be. (C)</p> Signup and view all the answers

Which type of splits does a decision tree perform?

<p>Binary splits to maximize information gain. (D)</p> Signup and view all the answers

What characterizes the kernel matrix K when using the feature map Φ(X) = XM >?

<p>It is always symmetric. (B)</p> Signup and view all the answers

What is the implication of using a dual algorithm with a high-dimensional feature set?

<p>It increases the risk of overfitting. (A)</p> Signup and view all the answers

What is the risk of a classification rule r(X) when P(X) is not known?

<p>The risk cannot be computed without data distribution. (B)</p> Signup and view all the answers

If P(X = x) changes but P(Y = y|X = x) remains the same, what can be concluded about r(X)?

<p>r(X) will minimize the risk regardless of the change in P(X). (D)</p> Signup and view all the answers

Under what conditions can LDA and QDA classifiers be considered identical?

<p>When class means µ̂C and µ̂D are equal. (A), When the covariance matrices Σ̂C and Σ̂D are identical. (D)</p> Signup and view all the answers

What statement is true regarding the posterior probability P(Y = C|X = x) if LDA and QDA classifiers are identical?

<p>It is linear in terms of x. (C)</p> Signup and view all the answers

What can be inferred about the QDA decision function if the covariance matrices are different?

<p>The function becomes nonlinear. (D)</p> Signup and view all the answers

Which kernel is used in dual ridge regression as described?

<p>k(Xi , X j ) = (XiT X j + 1)^p (B)</p> Signup and view all the answers

In dual ridge regression with the polynomial kernel, how does regularization (λ > 0) affect the model?

<p>It reduces the risk of overfitting. (B)</p> Signup and view all the answers

What can be concluded if Σ̂C = I (the identity matrix) and Σ̂D = 5I?

<p>The classifiers will yield different decision boundaries. (A)</p> Signup and view all the answers

What does a hard-margin SVM require about the data for it to create a decision boundary?

<p>Data must be linearly separable. (B)</p> Signup and view all the answers

In a soft-margin SVM, what role does the hyperparameter C play?

<p>Balances between maximizing margin and minimizing classification error. (B)</p> Signup and view all the answers

Which of the following characterizes the decision boundary learned by Linear Discriminant Analysis (LDA)?

<p>It is always a linear function of the features. (C)</p> Signup and view all the answers

What is the relationship between a kernel function and a feature map in kernel PCA?

<p>The kernel represents inner products of the feature map's transformations. (B)</p> Signup and view all the answers

Which statement is true regarding the margin in SVM classification?

<p>Soft-margin SVM allows for wider margins than hard-margin SVM. (A)</p> Signup and view all the answers

What is indicated by a nonzero eigenvalue in the context of principal component analysis?

<p>The direction is significant and reflects variance in the data. (C)</p> Signup and view all the answers

Kernel Principal Components Analysis primarily differs from standard PCA in what aspect?

<p>It uses a feature map to enable non-linear transformations. (A)</p> Signup and view all the answers

In the equation $X^TXw = \lambda w$, what does the symbol $\lambda$ represent?

<p>A scalar eigenvalue associated with the eigenvector $w$. (D)</p> Signup and view all the answers

What is the role of the number k in clustering algorithms?

<p>It represents the number of clusters to form. (C)</p> Signup and view all the answers

Which statement about complete linkage in clustering is accurate?

<p>Complete linkage can work with any distance function. (C)</p> Signup and view all the answers

How does single linkage clustering differ from complete linkage?

<p>Single linkage is more sensitive to outliers than complete linkage. (C)</p> Signup and view all the answers

What characterizes the Fiedler vector in spectral clustering?

<p>It corresponds to the second smallest eigenvalue of the Laplacian matrix. (D)</p> Signup and view all the answers

What does the relaxed optimization problem for partitioning a graph involve?

<p>Minimizing the Rayleigh quotient of the Laplacian matrix and an indicator vector. (A)</p> Signup and view all the answers

Which statement is true regarding the Laplacian matrix in spectral clustering?

<p>The Laplacian matrix is never invertible because 1 is always in the nullspace. (A)</p> Signup and view all the answers

What is a feature of AdaBoost concerning decision trees?

<p>The metalearner in AdaBoost gives a classification while estimating posterior probabilities. (D)</p> Signup and view all the answers

Which statement is false regarding the application of AdaBoost?

<p>AdaBoost is limited to specific types of classifiers. (A)</p> Signup and view all the answers

What is the inner product representation of the matrix K?

<p>K = Φ(X)&gt;Φ(X) (D)</p> Signup and view all the answers

Which of the following correctly represents the first principal component direction?

<p>The vector w that maximizes w&gt; Φ(X)Φ(X) w (A)</p> Signup and view all the answers

What are the matrices B and C defined in the context of the generalized Rayleigh quotient?

<p>B = K^2, C = K (C)</p> Signup and view all the answers

What effect does the balance constraint have on the indicator vector y in the minimum bisection problem?

<p>It requires that the sum of the elements equals zero. (C)</p> Signup and view all the answers

Which of the following matrices represents the Laplacian matrix L for the given graph?

<p>L = [[2, -1, 0, 0], [-1, 3, -1, -1], [0, -1, 3, -1], [0, 0, -1, 2]] (A)</p> Signup and view all the answers

What is the strict binary constraint in the context of the minimum bisection problem?

<p>Each element of y must be either 1 or -1. (A)</p> Signup and view all the answers

Which expression is maximized in the generalized Rayleigh quotient formulation?

<p>a&gt; Ba (B)</p> Signup and view all the answers

What is the expression for ∇µ1 ` based on the provided content?

<p>$ rac{1}{n} imes (X_i - heta) au_i$ (C)</p> Signup and view all the answers

Which of the following correctly represents ∇µ2 `?

<p>$ rac{1}{n} (1 - au_i)(X_i - µ2)$ (B)</p> Signup and view all the answers

What is the objective of the k-means-like algorithm described?

<p>To estimate parameters µ1, µ2, and θ using fixed τi values (D)</p> Signup and view all the answers

What does the notation τi represent in the algorithm?

<p>The membership indicator for data points (A)</p> Signup and view all the answers

What parameter values must be initialized in the k-means-like algorithm?

<p>τ1, τ2,..., τn to arbitrary values (C)</p> Signup and view all the answers

What is the relationship between the parameters µ1, µ2, and θ?

<p>They can be derived from τi values (A)</p> Signup and view all the answers

Which method is used to update Gaussian cluster parameters in the algorithm?

<p>By fixing the τi and computing new means (D)</p> Signup and view all the answers

What is the purpose of the repeated steps in the k-means-like algorithm?

<p>To converge the τi values towards a maximum likelihood (A)</p> Signup and view all the answers

Flashcards

Dual Ridge Regression Overfitting

When n is very large, this dual algorithm is more likely to overfit than the primal algorithm with degree-p polynomial features.

Primal vs. Dual Ridge Regression

Both primal and dual ridge regression give the same solution, no matter the size of n.

Kernel Matrix Calculation

The kernel matrix is XM > MX > , where XM is the matrix multiplication of the design matrix X and the matrix M.

Kernel vs. Primal Perceptron Termination

If the primal perceptron algorithm terminates, then the kernel perceptron algorithm terminates.

Signup and view all the flashcards

Decision Tree Splitting

The decision tree algorithm chooses binary splits to maximize information gain at each node.

Signup and view all the flashcards

Binary Decision Tree Splits

Decision tree splits are binary, meaning they split the data into two branches.

Signup and view all the flashcards

Decision Tree Feature Types

Decision trees can handle both categorical and numerical features.

Signup and view all the flashcards

Decision Tree Overfitting

Decision trees can be prone to overfitting, especially when the tree is too deep.

Signup and view all the flashcards

∀y ∈ {0, 1, 2}, ∃x : r(x) = y

For every possible class label (y) in the set {0, 1, 2}, there exists a classifier (r(x)) that assigns the most likely class label to an input (x).

Signup and view all the flashcards

∀x, r(x) is a class y that maximizes the posterior probability P(Y = y|X = x)

The classifier function r(x) is the function that maximizes the posterior probability P(Y = y|X = x), meaning it assigns the class label (y) that is most likely given the input (x).

Signup and view all the flashcards

If we don’t have access to the underlying data distribution P(X) or P(Y|X), we cannot exactly compute the risk of r(·)

The risk of a classifier is undefined without knowledge of the underlying data distributions P(X) and P(Y|X). These distributions are crucial for calculating the expected error of the classifier.

Signup and view all the flashcards

If P(X = x) changes but P(Y = y|X = x) remains the same for all x, y, r(X) still minimizes the risk

The classifier r(X) still minimizes the risk even if the distribution of inputs P(X) changes, as long as the conditional probability of class labels given inputs (P(Y = y|X = x)) remains the same for all inputs and classes.

Signup and view all the flashcards

If µ̂C = µ̂D and π̂C = π̂D , then the LDA and QDA classifiers are identical

If the means (µ̂C, µ̂D) and prior probabilities (π̂C, π̂D) of two classes (C, D) are equal in Gaussian Discriminant Analysis (GDA), then the Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) classifiers are identical.

Signup and view all the flashcards

If Σ̂C = Σ̂D , π̂C = 1/6, and π̂D = 5/6, then the LDA and QDA classifiers are identical

If the covariance matrices of two classes (C, D) in Gaussian Discriminant Analysis (GDA) are equal Σ̂C = Σ̂D, and the prior probabilities are π̂C = 1/6 and π̂D = 5/6, the LDA and QDA classifiers are again identical.

Signup and view all the flashcards

If Σ̂C = I (the identity matrix) and Σ̂D = 5I, then the LDA and QDA classifiers are identical

If the covariance matrix of class C is the identity matrix (I) and the covariance matrix of class D is 5 times the identity matrix (Σ̂D = 5I), the LDA and QDA classifiers are identical.

Signup and view all the flashcards

If the LDA and QDA classifiers are identical, then the posterior probability P(Y = C|X = x) is linear in x

If LDA and QDA classifiers give identical results, the posterior probability P(Y = C|X = x) is a logistic function of the input (x), not a linear function.

Signup and view all the flashcards

Kij

The inner product of row i of Φ(X) and column j of Φ(X).

Signup and view all the flashcards

First Principal Component Direction

A vector that maximizes the Rayleigh quotient, which is a measure of how much the data is spread along the direction of the vector.

Signup and view all the flashcards

Generalized Rayleigh Quotient

A way to express the problem of maximizing the Rayleigh quotient as a maximization problem involving matrices.

Signup and view all the flashcards

Kernel Matrix (K)

The matrix that holds the inner products of all pairs of data points in the original dataset, calculated using the kernel function.

Signup and view all the flashcards

Laplacian Matrix (L)

A matrix that represents the connections between nodes in a graph.

Signup and view all the flashcards

Indicator Vector (y)

A vector that assigns each node of the graph to one of two clusters.

Signup and view all the flashcards

Minimum Bisection

The problem of finding the best way to divide a graph into two clusters with equal sizes.

Signup and view all the flashcards

Risk Minimization

The minimization of a function that measures the error of a classifier based on the data distribution.

Signup and view all the flashcards

Principal Component Direction

Every direction that maximizes the variance of projected data is an eigenvector of the covariance matrix, which is calculated as X'X.

Signup and view all the flashcards

Kernel PCA Applicability

Kernel PCA can only be applied to optimization problems where the solution can be expressed as a linear combination of the data points.

Signup and view all the flashcards

Kernel Matrix Definition

The kernel matrix K is an n x n matrix where each element K(i, j) represents the dot product of the feature maps of the ith and jth data points: K(i, j) = Φ(Xi)'Φ(Xj).

Signup and view all the flashcards

Kernel Matrix and Feature Matrix

The kernel matrix K can also be expressed as the matrix multiplication of the feature matrix Φ(X) with its transpose: K = Φ(X)'Φ(X).

Signup and view all the flashcards

Kernel Function

The kernel function k(x, z) = Φ(x)'Φ(z) allows calculating similarities between data points in the feature space without explicitly computing the feature maps Φ(x) and Φ(z).

Signup and view all the flashcards

Kernel Matrix Symmetry

The kernel matrix K is a symmetric matrix because the kernel function k(x, z) is symmetric: k(x, z) = k(z, x).

Signup and view all the flashcards

Hard vs. Soft Margin SVM

A hard-margin SVM aims to find a decision boundary that perfectly separates the data points without any errors, while a soft-margin SVM allows for some misclassifications to accommodate noisy data.

Signup and view all the flashcards

SVM Margin

The margin of an SVM is the region around the decision boundary where data points can be classified correctly. SVMs aim to maximize the margin to improve generalization.

Signup and view all the flashcards

What does the Fiedler vector represent in spectral clustering?

The smallest eigenvalue of the Laplacian matrix corresponds to the constant vector. The second smallest eigenvalue (the Fiedler value) corresponds to the Fiedler vector, which captures the best possible way to divide the graph into two clusters.

Signup and view all the flashcards

What is the Laplacian matrix?

The Laplacian matrix is a matrix that captures connections between nodes in a graph. It is always singular, meaning it has an eigenvalue of 0.

Signup and view all the flashcards

Complete linkage clustering

Complete linkage uses the maximum distance between any two elements from the respective clusters as the inter-cluster distance. This leads to tighter, more compact clusters.

Signup and view all the flashcards

What is agglomerative clustering?

Agglomerative clustering is a bottom-up approach, starting with individual data points and merging clusters until a desired number of clusters is reached.

Signup and view all the flashcards

Why can single linkage be susceptible to outliers?

Single linkage can be highly sensitive to outliers as it uses the minimum distance between points in different clusters. Outliers can easily create long, skinny clusters that might not reflect the true underlying structure of the data.

Signup and view all the flashcards

What is AdaBoost?

AdaBoost is a powerful ensemble learning algorithm that combines multiple weak learners (e.g., decision trees) to build a strong predictor. It's used for binary classification.

Signup and view all the flashcards

How does AdaBoost handle misclassified data?

AdaBoost assigns weights to each training sample, putting more emphasis on misclassified data points. This helps the algorithm focus on challenging examples and improve its accuracy.

Signup and view all the flashcards

Can AdaBoost be used with decision trees?

AdaBoost can be used with decision trees, where each tree is a weak learner. The combined decision from these trees then determines the final classification.

Signup and view all the flashcards

What is the expression for the gradient of the log-likelihood of θ?

The gradient of the log-likelihood of the mixture parameter θ (the probability of belonging to the first cluster) is expressed as a simple function of θ. This function is independent of the cluster means (µ1, µ2) and only depends on the posterior probabilities (τi) assigned to each data point.

Signup and view all the flashcards

How is the gradient of µ1 expressed?

The gradient of the log-likelihood of the first cluster mean (µ1) is a weighted sum of differences between data points and the mean, where the weights are the posterior probabilities (τi) of belonging to the first cluster.

Signup and view all the flashcards

How is the gradient of µ2 expressed?

The gradient of the log-likelihood of the second cluster mean (µ2) is similar to the gradient of µ1, but it uses the complementary probabilities (1 - τi) for each data point.

Signup and view all the flashcards

What is the k-means-like algorithm based on?

The algorithm iteratively updates the cluster parameters (means and mixing probability) and the posterior probabilities using the gradients derived from the log-likelihood function. This repetition continues until convergence, resulting in optimal cluster assignments and parameters.

Signup and view all the flashcards

How does the k-means-like algorithm start?

The algorithm starts by arbitrarily assigning initial values for the posterior probabilities (τi) associated with each data point.

Signup and view all the flashcards

What happens in the first step of the k-means like algorithm?

For fixed posterior probabilities (τi), the algorithm updates the cluster parameters (µ1, µ2, and θ) by maximizing the log-likelihood. The update rules are based on the gradients derived for µ1, µ2, and θ.

Signup and view all the flashcards

What happens in the second step of the k-means like algorithm?

Using the updated cluster parameters (µ1, µ2, and θ), the algorithm re-estimates the posterior probabilities (τi) for each data point. This step essentially revisits the cluster assignment for each data point based on the new cluster definitions.

Signup and view all the flashcards

What is the overall purpose of the k-means-like algorithm?

The k-means-like algorithm uses the gradients of the log-likelihood to refine the cluster parameters (µ1, µ2, and θ) and then re-evaluate the posterior probabilities (τi). This iterative process continues until convergence, resulting in the optimal cluster assignments and the corresponding parameters.

Signup and view all the flashcards

Study Notes

Exam Instructions

  • Do not open the exam before instructed.
  • Electronic devices (phones, iPods, headphones, laptops) are prohibited.
  • Ensure all 12 pages and 6 questions are present.
  • Write initials at top right of each page after the first.
  • Exam is closed-book, closed-notes, two cheat sheets allowed.
  • Exam duration: 3 hours.
  • Answer on exam paper only.
  • Total points: 150.
  • Multiple choice questions (26): 3 points each.
  • Written questions (5): 72 points total.
  • Multiple answer questions: all correct choices must be marked.
  • No partial credit for multiple answer questions.

Exam Details

  • Exam covers Introduction to Machine Learning
  • Spring 2019
  • Exam is final
  • Student should bring two cheat sheets.
  • A total of 150 points are available.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Test your understanding of critical concepts in machine learning, including regression techniques, kernel functions, and decision trees. This quiz covers theoretical aspects and practical implications of using various algorithms in different scenarios.

Use Quizgecko on...
Browser
Browser