Support Vector Machines (SVM)

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

In Support Vector Machines (SVM), what fundamentally defines a 'hyperplane' in a p-dimensional space?

A curved surface that optimally separates the data points.
The margin that maximizes the separation between classes.
A flat affine subspace of dimension _p_.
A flat affine subspace of dimension _p_ - 1. (correct)

What is the primary limitation of using 'hard margin classification' with Support Vector Machines (SVM)?

It is computationally intensive and requires significant resources.
It requires a large amount of labeled data to achieve good performance.
It is highly sensitive to outliers and only works if data is perfectly linearly separable. (correct)
It tends to overfit the training data, leading to poor generalization.

In the context of Support Vector Machines (SVM), what is the significance of the tuning parameter 'C'?

It determines the width of the margin in the SVM model.
It bounds the sum of slack variables, influencing the tolerance for observations violating the margin. (correct)
It sets the polynomial degree in polynomial kernel SVM.
It controls the kernel type used in the SVM model.

What characterizes observations that are known as 'support vectors' in the context of Support Vector Machines (SVM)?

Observations that lie directly on the margin or on the wrong side of the margin for their class. (A) Signup and view all the answers

Which of the following kernel options in SVM would be most appropriate when there's no clear understanding of the data distribution?

Radial Basis Function (Gaussian) kernel (C) Signup and view all the answers

How does the Radial Basis Function (RBF) kernel handle observations that are far away from each other in the feature space?

It decreases their impact on computing the decision boundary due to the larger Euclidean distance. (D) Signup and view all the answers

Principal Component Analysis (PCA) is an unsupervised method primarily used for what?

Finding a low-dimensional representation of a dataset while retaining as much variance as possible. (A) Signup and view all the answers

What does PCA aim to achieve when projecting observations in a p-dimensional space?

To project observations with a vector that has the largest variance. (D) Signup and view all the answers

In K-means clustering, what is the primary objective of the algorithm?

To partition data into K clusters by minimizing intra-cluster variance. (D) Signup and view all the answers

What is minimized using squared Euclidean distance using the K-means algorithm?

The sum of all pairwise squared Euclidean distances between the observations in the _k_th cluster, divided by total # of observations in the _k_th cluster. (B) Signup and view all the answers

What is the Silhouette Score used for in the context of K-means clustering?

To determine the optimal number of clusters by measuring the quality of clustering. (D) Signup and view all the answers

Which of the following statements regarding the K-means algorithm is correct?

Results depend on the initial cluster assignment of each observation in Step 1, finds a local rather than global optimum. (A) Signup and view all the answers

What is the purpose of the `minPts` parameter in the DBSCAN clustering algorithm?

It sets the minimum number of neighboring points within a specified radius for a point to be considered a core point. (C) Signup and view all the answers

Which of the following is an advantage of the DBSCAN algorithm compared to K-means clustering?

DBSCAN automatically determines the number of clusters. (D) Signup and view all the answers

In DBSCAN, what differentiates a 'border point' from a 'core point'?

Border points are within the epsilon distance of a core point but do not have enough neighbors to be considered core points. (A) Signup and view all the answers

What does interpreting a dendrogram allows one to determine?

How similar observations are to each other. (C) Signup and view all the answers

What is a key difference between agglomerative and divisive hierarchical clustering approaches?

Divisive clustering starts with all data points in one cluster, while agglomerative clustering begins with each data point as its own cluster. (C) Signup and view all the answers

In the context of hierarchical clustering, what does the term 'single linkage' refer to?

The minimum distance between points in two clusters. (C) Signup and view all the answers

Which distance metric is commonly used and is most suitable for continuous variable data for clustering?

Euclidean distance (C) Signup and view all the answers

What is the primary role of activation functions in neural networks?

To introduce non-linearity, control output values, and enable gradient propagation. (D) Signup and view all the answers

Which of the following is a characteristic of the Sigmoid activation function?

It can suffer from the vanishing gradient problem due to saturated neurons. (A) Signup and view all the answers

What is a key advantage of using the ReLU (Rectified Linear Unit) activation function?

It avoids vanishing gradient issues for positive inputs and allows faster convergence. (C) Signup and view all the answers

Which activation function is most suitable for the output layer of multi-class classification models?

Softmax (C) Signup and view all the answers

Which of the following describes the 'vanishing gradient' problem?

The loss function gradients become extremely small, preventing weights from updating in early layers. (A) Signup and view all the answers

What is the primary difference between a perceptron and a multi-layer perceptron (MLP)?

An MLP has at least one hidden layer, allowing it to learn non-linear boundaries. (A) Signup and view all the answers

What is the role of 'epochs' in the training of neural networks?

One complete pass through the entire training dataset. (D) Signup and view all the answers

During the training of a Deep Neural Network (DNN), what is the immediate result of the 'forward-pass'?

Computation of the predictions based on the inputs. (A) Signup and view all the answers

Which loss function is most appropriate for a binary classification problem?

Binary Cross-Entropy (D) Signup and view all the answers

What is the purpose of 'backpropagation' in neural networks?

Used to adjust the weights to reduce the error of prediction. (B) Signup and view all the answers

What is the advantage of using mini-batches rather than processing one observation at a time or using Batch Gradient Descent for training neural networks?

Faster computation compared to Batch algorithm and more frequent weight updates compared to processing data one at a time. (D) Signup and view all the answers

Considering that Stochastic Gradient Descent (SGD) considers the past gradients, what is the purpose of the 'velocity' term in SGD with Momentum?

To accelerate convergence by helping the model build speed in consistent directions. (A) Signup and view all the answers

In the context of time series analysis, what is the main purpose of smoothing?

Reducing noise and making patterns like trends, or seasonality, clearer. (A) Signup and view all the answers

What typically characterizes time series data in which Simple Moving Average (SMA) is optimal?

Absence of seasonality or a trend. (A) Signup and view all the answers

How does a larger window size impact the results in Simple Moving Average?

Reduces responsiveness to recent changes. (A) Signup and view all the answers

What distinguishes Exponential Smoothing from Simple Moving Average?

ES gives greater weight to more recent datapoints. (B) Signup and view all the answers

If there was high alpha for time series date, this will result in greater what?

Weight to recent data (D) Signup and view all the answers

What could happen if seasonality isn't properly considered in time series data?

Recurring seasonal patterns can be mistaken for anomalies. (D) Signup and view all the answers

What is the main focus of Natural Language Processing (NLP)?

Enabling computers to understand, interpret, and generate human language. (B) Signup and view all the answers

What is the purpose of a Tokenizer?

Breaking down text into smaller units. (D) Signup and view all the answers

Subword Tokenization is characterized as breaking into subword units to handle what?

Rare words. (C) Signup and view all the answers

When implementing SVM with soft margins, what is the primary effect of increasing the value of the tuning parameter 'C'?

It leads to a narrower margin and more observations violate it, decreasing bias and increasing variance. (C) Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the purpose of 'slack variables'?

To allow individual observations to be on the wrong side of the margin or the hyperplane, thus softening the margin. (A) Signup and view all the answers

When would it be most appropriate to choose a polynomial kernel in Support Vector Machines (SVM)?

When there are known polynomial relationships between the features in the data. (B) Signup and view all the answers

With a small 'C' parameter, how does a support vector classifier behave, and what are its potential drawbacks?

It aims for a narrow margin, is highly sensitive to individual observations, and may overfit the data. (B) Signup and view all the answers

What is the impact of the number of dimensions on the computational complexity for unsupervised learning algorithms?

Higher dimensions typically require more computation and time, increasing the computational complexity. (A) Signup and view all the answers

What is a critical challenge specific to unsupervised learning methods when compared to supervised learning?

The objective evaluation of results due to the lack of labeled data (ground truth). (C) Signup and view all the answers

What strategies can be employed to address the challenges posed by high dimensionality in unsupervised learning?

Using dimensionality reduction, feature selection, regularization, and leveraging domain knowledge to reduce data complexity. (B) Signup and view all the answers

What is a limitation of using PCA before applying a supervised learning method?

PCA may not necessarily retain directions that are most useful for effective prediction in a supervised learning context. (C) Signup and view all the answers

While assessing the optimal number of clusters using the Elbow Method, how should the 'elbow' point be best interpreted?

The point where adding more clusters provides diminishing returns in reducing WCSS. (A) Signup and view all the answers

In K-means clustering, under which condition is the result considered to have reached a 'local optimum'?

When the cluster assignments no longer change after iterations. (D) Signup and view all the answers

What is a key limitation of the K-means algorithm regarding cluster shapes?

It assumes clusters are spherical and equally sized. (B) Signup and view all the answers

How does DBSCAN effectively identify clusters of arbitrary shapes compared to K-means?

By grouping together closely packed points, defining clusters based on density rather than distance to centroids. (A) Signup and view all the answers

Within DBSCAN, how is 'density' quantified to determine cluster formation?

By counting the number of points within a specified radius (epsilon) of a given point. (A) Signup and view all the answers

How do you determine the value for epsilon (ε) for DBSCAN?

All of the above (D) Signup and view all the answers

What is the primary implication of the arrangement of observations along the horizontal axis of a dendrogram?

It is arbitrary and does not imply similarity or dissimilarity. (D) Signup and view all the answers

In agglomerative hierarchical clustering, what is the key implication of two observations fusing together closer to the bottom of the dendrogram?

These observations are more similar. (A) Signup and view all the answers

What is a limitation of Hierarchical clustering?

Hierarchical clustering cannot force a hierarchy, and show well-defined or intuitively nested groups. (B) Signup and view all the answers

In the context of hierarchical clustering, why might Ward's method be chosen over other linkage methods?

Because it minimizes the variance within clusters, promoting more spherical clusters. (B) Signup and view all the answers

When preparing time-series data for Recurrent Neural Networks (RNNs), what format should the data have to work with an LSTM?

A 3D tensor with dimensions (batch size, time steps, features). (B) Signup and view all the answers

How does transforming text into numerical input with NLP aid Neural Networks with Word2Vec?

By allowing for pretraining for tokens to transform into dense arrays that that provide semantic meaning and relationships. (B) Signup and view all the answers

What is the purpose of choosing a vector with cosine similarity?

Measuring the angle to find a similarity between angles, using vector to find relationship with words high dimensional vectors. (D) Signup and view all the answers

In designing a deep neural network(DNN), what distinguishes a network considered 'deep' from one that is not?

Having two or more hidden layers. (A) Signup and view all the answers

Given a dataset with non-linear relationships, what is a primary benefit of using a Deep Neural Network (DNN) over a simpler model?

DNNs can automatically learn intricate, non-linear relationships in the data due to their increased capacity. (B) Signup and view all the answers

When training a deep neural network, what is the effect of using GPUs rather than CPUs?

GPUs enhance the parallel processing capabilities. (C) Signup and view all the answers

Why can neurons that are saturated during the activation function phase lead to the vanishing gradient problem?

All of the above. (D) Signup and view all the answers

When would a tanh activation function be preferable?

Zero centered outputs are needed for training networks. (D) Signup and view all the answers

What is a limitation when using softmax activation function?

Not suitable for multi-class as it enforces only one class to be labeled, and nothing else. (D) Signup and view all the answers

In the context of neural networks, what does binary cross-entropy measure?

The performance of how the label compares to a probability between 0 and 1. (B) Signup and view all the answers

What is the key idea that helps gradient descent in SGD with momentum?

The model builds momentum based on speed and direction of consistent descent by considering past gradients. (A) Signup and view all the answers

What describes the action of error in back-propagation?

The transmission of error back into the network to improve performance. (C) Signup and view all the answers

In time series analysis, why is it important to account for seasonality?

Recurring seasonal patterns may be misunderstood or mistaken for actual anomalies hidden while detecting certain patterns. (D) Signup and view all the answers

What could Simple Moving Average results highlight with the trend?

Highlight trends if there is noise. (A) Signup and view all the answers

In Time Series analysis, why might exponential smoothing be preferred over simple moving average (SMA)?

Exponential smoothing is better for data with little seasonality. (D) Signup and view all the answers

When would the use of a polynomial be the best approach out of: linear, exponential Moving average, and logarithmic to see the trend line in time series data?

For more complex patterns with fluctuations. (C) Signup and view all the answers

What is the use of Seasonal and Trend Decomposition (STL decomposition)?

All of the above. (D) Signup and view all the answers

What role does the integer value of the ARIMA model play?

It refers to where the differencing of data happens and helps create stationary states (removing time trends and seasonality with value d, which indicates the order of differencing). (D) Signup and view all the answers

What is TF-IDF often represented in?

All of the above. (D) Signup and view all the answers

What can cause the Embedding process to be limited?

Both A and B. (D) Signup and view all the answers

In the context of Support Vector Machines (SVM), how does the 'large margin classification' principle primarily aim to improve model performance?

By fitting the widest possible 'street' between different classes, thereby enhancing generalization. (D) Signup and view all the answers

When implementing Support Vector Machines (SVM) with soft margins, under what circumstances would it be most strategic to permit certain observations to violate the margin?

When dealing with complex, non-linear datasets or datasets containing outliers, to enhance robustness. (C) Signup and view all the answers

Considering the bias-variance tradeoff in Support Vector Classifiers, how does increasing the value of the tuning parameter 'C' influence the classifier's characteristics?

It causes the classifier to prioritize a narrower margin, potentially leading to overfitting and lower bias. (D) Signup and view all the answers

How does the behavior of a Support Vector Classifier change with a very large 'C' value, and what implications does this have for model performance?

The model behaves similarly to hard margin classification, with potential for overfitting on non-separable data. (C) Signup and view all the answers

When using the Radial Basis Function (RBF) kernel in Support Vector Machines (SVM), how does the kernel implicitly handle the dimensionality of the feature space?

By mapping the data into a higher-dimensional space, implicitly handling infinite dimensions without explicit computation. (D) Signup and view all the answers

Which statement best describes how polynomial kernels in SVM facilitate the classification of non-linear data?

They approximate non-linear relationships by transforming the data into a higher-dimensional space, allowing for linear separation. (A) Signup and view all the answers

What is a primary challenge associated with unsupervised learning methods, particularly concerning the assessment of model performance?

The difficulty in objectively evaluating outcomes due to the lack of a predefined metric or 'ground truth'. (A) Signup and view all the answers

In the context of handling high dimensionality, how does L1 Regularization address the challenges posed by the 'Curse of Dimensionality'?

By automatically selecting more relevant features, thus simplifying the model and improving generalization. (C) Signup and view all the answers

Within K-means clustering, how does the K-means++ initialization method aim to improve the quality of clustering compared to random initialization?

By strategically spreading out the initial centroids, thereby reducing the risk of suboptimal clustering. (C) Signup and view all the answers

When evaluating K-means clustering results, which interpretation of the Silhouette Score indicates a potential issue with the clustering?

Negative Silhouette Scores for numerous data points suggest these points may have been assigned to the wrong clusters. (A) Signup and view all the answers

In DBSCAN, what is the algorithmic approach to classifying points that fall outside dense regions and are not within the epsilon neighborhood of any core points?

Treating these points as noise, effectively excluding them from any cluster. (B) Signup and view all the answers

What key constraint differentiates hierarchical clustering methods from K-means clustering, particularly affecting their applicability in certain scenarios?

Hierarchical clustering does not require pre-specification of the number of clusters, unlike K-means. (B) Signup and view all the answers

When interpreting a dendrogram in hierarchical clustering, under what condition is it impossible to make definitive statements about the similarity of two observations?

When the observations are positioned close to each other along the horizontal axis of the dendrogram. (C) Signup and view all the answers

How does the backpropagation algorithm utilize the chain rule of calculus in training neural networks, and what is the significance of this process?

To efficiently update the parameters of each layer by relating the change in the loss function to the parameters of earlier layers. (D) Signup and view all the answers

What is a critical implication of the saturation of neurons in deep neural networks for the backpropagation process, and why does this present a significant challenge?

Saturated neurons lead to near-zero gradients, hindering the weight update and resulting in slower or stalled learning. (D) Signup and view all the answers

With regards to activation functions, what key advantage does the ReLU activation function provide over sigmoid or tanh functions in deep neural networks, and how does this affect training dynamics?

ReLU mitigates the vanishing gradient problem, enabling more effective training of deep networks by reducing saturation. (A) Signup and view all the answers

In the context of Natural Language Processing (NLP), what is the significance of word embeddings beyond simply converting words into numerical data?

Word embeddings establish explicit semantic relationships between words, enabling models to capture context and meaning. (A) Signup and view all the answers

When applying Simple Moving Average (SMA) to a time series, how does the selection of a larger window size affect the resulting smoothed time series, and what are its practical consequences?

It produces a smoother series less responsive to recent changes, potentially obscuring short-term patterns. (B) Signup and view all the answers

In time series analysis using Exponential Smoothing, how does a smoothing factor (alpha) close to 1 impact the model's sensitivity to recent data points, and what does this imply for forecasting?

It makes the model highly sensitive to recent data, essentially making the forecast strongly influenced by the latest observations. (D) Signup and view all the answers

What is the primary advantage of employing subword tokenization techniques in Natural Language Processing (NLP), especially when dealing with large and diverse text corpora?

Subword tokenization effectively handles rare words and words with similar roots, improving model generalization and vocabulary coverage. (B) Signup and view all the answers

Flashcards

Support Vector Machines (SVM)

A powerful algorithm used for classification and regression, effective in high-dimensional spaces by fitting the widest possible street between classes.

Margin

Distance from the separating hyperplane to the closest data points.

Support vectors

Data points closest to the hyperplane that influence its position and the margin.

Hard Margin Classification

Strictly imposing every observation (regardless of which side of the hyperplane) is assigned a class, with no room for misclassification!.