Support Vector Machines and Unsupervised Methods

Podcast

Listen to an AI-generated conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

In Support Vector Machines (SVM), what determines whether it is used for classification (SVC) or regression (SVR)?

The intended purpose, either classification or regression. (correct)
The type of kernel used.
The number of dimensions in the data.
The size of the dataset.

What is the geometric interpretation of an SVM classifier?

Fitting the widest possible street between the classes. (correct)
Calculating the average distance between classes.
Finding the smallest circle that encloses all data points.
Fitting a line that minimizes the distance to all points.

In a p-dimensional space, what is a hyperplane?

A flat affine subspace of dimension p - 1. (correct)
A flat affine subspace of dimension p + 1.
A curved surface that separates the data.
A single point that best represents the data.

What are the support vectors in the context of Support Vector Machines?

The data points closest to the hyperplane that influence its position. (D)

Signup and view all the answers

What is the formula to classify a test observation $x^\wedge$ using a maximal margin classifier, given coefficients $\beta_0, \beta_1, ..., \beta_p$?

$f(x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_p x_p$, classify based on the sign of $f(x)$. (A)

Signup and view all the answers

What is a significant limitation of hard margin classification?

It only works if the data is linearly separable and is sensitive to outliers. (B)

Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the primary motivation for using 'soft margins' instead of 'hard margins'?

To allow for some misclassifications, increasing robustness to outliers and non-separable data. (D)

Signup and view all the answers

In the context of soft margins in SVM, what do slack variables (£1,...,εη) represent?

The degree to which individual observations are allowed to be on the wrong side of the margin or hyperplane. (A)

Signup and view all the answers

What does the tuning parameter C control in a soft margin SVM?

The trade-off between achieving a wide margin and limiting the number of margin violations. (A)

Signup and view all the answers

In the context of SVM, what happens when the tuning parameter C is very large?

The margin is wide and many observations violate the margin, leading to low variance but potentially high bias. (A)

Signup and view all the answers

Which of the following kernel options in SVM is generally considered a good default choice when there is no clear understanding of the data distribution?

Radial Basis Function (Gaussian) (A)

Signup and view all the answers

What is the effect of a positive constant γ in the Radial Basis Function (RBF) kernel?

Decreases the influence of distant observations. (C)

Signup and view all the answers

In the context of machine learning, what is the primary goal of unsupervised learning?

To understand the underlying structure of data by creating groupings without prior knowledge of class labels. (B)

Signup and view all the answers

Which of the following is a common approach in unsupervised learning?

Clustering (A)

Signup and view all the answers

Why is evaluating the success of unsupervised learning models often challenging?

Because the analysis tends to be subjective without predefined metrics or ground truth. (C)

Signup and view all the answers

Which of the following is a potential risk associated with unsupervised learning?

Overfitting to noise in the data, especially with methods sensitive to the number of parameters. (D)

Signup and view all the answers

Which of the following is a common technique to deal with high dimensionality?

Dimensionality reduction (B)

Signup and view all the answers

What is the primary goal of Principal Component Analysis (PCA)?

To find a low-dimensional representation of a dataset that captures as much of the variance as possible. (C)

Signup and view all the answers

What is the main objective of K-means clustering?

To partition data into K clusters by minimizing intra-cluster variance. (B)

Signup and view all the answers

In the context of K-means clustering, what is the 'Elbow Method' used for?

To determine the optimal number of clusters (K). (D)

Signup and view all the answers

What does a higher Silhouette Score indicate in the context of K-means clustering?

A better clustering, as data points are similar to their own cluster and dissimilar to other clusters. (C)

Signup and view all the answers

Which of the following statements describes a limitation of K-Means clustering?

K-Means requires the number of clusters (K) to be predefined. (A)

Signup and view all the answers

What is a key characteristic of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm?

It identifies clusters based on data point density and can find arbitrary-shaped clusters. (C)

Signup and view all the answers

In DBSCAN, what distinguishes a 'core point' from other points?

It has a minimum number of neighboring points within a specified distance. (B)

Signup and view all the answers

Which of the following is an advantage of DBSCAN over K-Means clustering?

DBSCAN can detect noise and outliers. (A)

Signup and view all the answers

In Hierarchical Clustering, what is a dendrogram used for?

To visualize the hierarchy of clusters and the relationships between data points. (D)

Signup and view all the answers

In the context of interpreting a dendrogram, what does the vertical axis represent?

The distance or dissimilarity at which clusters are merged. (B)

Signup and view all the answers

What is the key difference between agglomerative and divisive hierarchical clustering?

Agglomerative starts with individual points and merges them; divisive starts with all points in one cluster and splits them. (B)

Signup and view all the answers

In hierarchical clustering, which linkage criterion minimizes the variance within clusters?

Ward's Method (A)

Signup and view all the answers

What characterizes Directed Acyclic Graphs (DAGs)?

Edges with direction and the absence of cycles. (B)

Signup and view all the answers

What key function did early neural networks use, also employed in Logistic Regression, to compute a probability between 0 and 1?

Sigmoid Function (D)

Signup and view all the answers

For what type of tasks is a perceptron primarily designed?

Binary classification tasks (B)

Signup and view all the answers

In the context of perceptron learning, what is backpropagation used for?

Adjusting the weights and bias based on the error between the predicted output and the true label in multi hidden layer networks. (B)

Signup and view all the answers

What happens when the argument gets smaller using ELUs (Exponential Linear Units)?

ELUs saturate to a negative value (C)

Signup and view all the answers

For a single perceptron, what activation function will output A AND B logical operation correctly?

step function (D)

Signup and view all the answers

What logical operation cannot be handled with a single perceptron?

XOR (A)

Signup and view all the answers

In a neural network, what is the primary purpose of activation functions?

To introduce non-linearity, enabling the model to learn complex relationships. (A)

Signup and view all the answers

What distinguishes a Deep Neural Network (DNN) from a regular neural network?

It has two or more hidden layers. (A)

Signup and view all the answers

Why are GPUs and TPUs important for deep learning?

They accelerate the training process through parallel processing. (B)

Signup and view all the answers

In deep learning, what is the purpose of a loss function?

To measure the difference between the predicted output and the actual target value. (B)

Signup and view all the answers

In the context of neural network training, what does 'epoch' refer to?

One complete pass through the entire training dataset. (D)

Signup and view all the answers

When training neural networks, what is the role of mini-batches?

To approximate the gradient of the loss function using a subset of the training data. (D)

Signup and view all the answers

Which is likely to happen if the learning rate for the model is set very high?

Overshooting or divergence (B)

Signup and view all the answers

What is the purpose of "smoothing" in the context of time series analysis?

To reduce noise and make patterns clearer. (D)

Signup and view all the answers

What is the primary difference between a Simple Moving Average (SMA) and Exponential Smoothing?

SMA gives equal weight to all data points in the window, while exponential smoothing gives more weight to recent observations. (D)

Signup and view all the answers

In time series analysis, what does 'seasonality' refer to?

Recurring patterns or cycles that occur at regular intervals. (B)

Signup and view all the answers

In time series analysis, which type of trend line is suitable for growth patterns that level off over time?

Logarithmic (A)

Signup and view all the answers

In STL decomposition, what are the three main components into which a time series is broken down?

Trend, Seasonality, Residual (B)

Signup and view all the answers

In the context of ARIMA models, what does the 'I' component represent?

Integration, referring to making data stationary. (A)

Signup and view all the answers

In Natural Language Processing (NLP), what does 'tokenization' refer to?

Breaking down text into smaller units such as words or phrases. (B)

Signup and view all the answers

In Support Vector Machines (SVM), what is the significance of points that lie directly on the margin or on the wrong side of it?

They are known as support vectors and directly affect the support vector classifier. (A)

Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the primary purpose of the tuning parameter C?

To control the bias-variance trade-off of the support vector classifier. (A)

Signup and view all the answers

Which statement is correct regarding the impact of a large tuning parameter C in Support Vector Machines(SVM)?

It leads to a classifier with low variance but potentially high bias. (A)

Signup and view all the answers

When is it most appropriate to use a Polynomial kernel in a Support Vector Machine (SVM)?

When the data has polynomial relationships between features. (B)

Signup and view all the answers

What is the practical implication of using the Radial Basis Function (RBF) kernel with a very large gamma (γ) value in SVM?

The model becomes highly sensitive to each training data point, potentially overfitting the data. (A)

Signup and view all the answers

What is a key difference between supervised and unsupervised learning?

Supervised learning requires labeled data for training, while unsupervised learning does not. (B)

Signup and view all the answers

Which of the following tasks is best suited for unsupervised learning?

Identifying customer segments based on purchasing behavior. (B)

Signup and view all the answers

What is a primary challenge in unsupervised learning compared to supervised learning?

The difficulty in evaluating the results due to a lack of ground truth. (D)

Signup and view all the answers

Why can unsupervised learning methods be sensitive to noise and outliers in the data?

Because these algorithms aim to find patterns, and outliers can disproportionately influence the identified patterns. (A)

Signup and view all the answers

Which of the following is NOT a common technique for dealing with high dimensionality?

Data Augmentation. (C)

Signup and view all the answers

In the context of Principal Component Analysis (PCA), what does projecting observations onto a vector with the largest variance achieve?

It results in projected observations that retain the most variance in the data. (C)

Signup and view all the answers

What is the most direct way to describe the central idea behind K-means clustering?

Assigning data points to clusters to minimize the intra-cluster variance. (C)

Signup and view all the answers

What do higher K values typically imply when using the Elbow Method to select the optimal number of clusters?

Diminishing returns in reducing the Within-Cluster Sum of Squares (WCSS). (C)

Signup and view all the answers

What does the Silhouette Score measure in the context of clustering?

How similar a data point is to its own cluster compared to other clusters. (C)

Signup and view all the answers

Which statement best describes a limitation of K-Means clustering?

It struggles with data that is not spherically shaped or evenly sized. (C)

Signup and view all the answers

In the DBSCAN algorithm, what role do border points play in cluster formation?

They are within the neighborhood of a core point but do not have enough neighbors to be core points themselves. (D)

Signup and view all the answers

Which of the following is a disadvantage of DBSCAN compared to K-Means clustering?

DBSCAN struggles with clusters of varying densities. (D)

Signup and view all the answers

In hierarchical clustering, what does the height at which two branches merge in a dendrogram indicate?

The distance or dissimilarity between the two clusters. (A)

Signup and view all the answers

What does it mean when observations fuse together at the very top of a dendrogram?

The observations are quite different from each other. (C)

Signup and view all the answers

In agglomerative hierarchical clustering, how is the distance between clusters updated after merging two clusters?

Using linkage criteria to define how the distance between the new cluster and other clusters is computed. (C)

Signup and view all the answers

Which of the following linkage criteria in hierarchical clustering tends to create elongated clusters?

Single linkage. (B)

Signup and view all the answers

What best describes Directed Edges in Directed Acyclic Graphs (DAGs)?

They signify a one-way relationship or dependency between nodes. (B)

Signup and view all the answers

What does the term 'acyclic' signify in the context of Directed Acyclic Graphs (DAGs)?

The graph does not contain cycles or closed loops. (C)

Signup and view all the answers

What is the role of the weight coefficients in a perceptron?

They multiply the input data to determine its importance. (B)

Signup and view all the answers

What must be true for any node in a perceptron to generate an outut?

Activation function to trigger. (B)

Signup and view all the answers

What does the perceptron learning rule adjust in a perceptron?

The weights and bias. (A)

Signup and view all the answers

In the context of neural networks, what does the term 'non-linearity' refer to?

The use of activation functions to transform linear combinations of inputs. (C)

Signup and view all the answers

What is the function of gradient propagation in neural networks?

To optimize the weights during training using backpropagation. (C)

Signup and view all the answers

What is a potential disadvantage of Leaky ReLU compared to ReLU?

It is not as standardized and requires tuning an additional hyperparameter. (A)

Signup and view all the answers

For what purpose is the Softmax activation function primarily used?

To convert model outputs into probability distributions for multi-class classification. (A)

Signup and view all the answers

Which of the following techniques can accelerate convergences in training deep learning?

Zero centered outputs that helps networks train faster (A)

Signup and view all the answers

What does the term 'epoch' refer to in the context of training neural networks?

One complete pass through the entire training dataset. (A)

Signup and view all the answers

Why must the mini-batch size be optimized during training?

To make a decision relating to tradeoff between computation overhead and better uncovering of patterns in the data. (C)

Signup and view all the answers

What would incorporating momentum do?

Considers the past gradients, adding a velocity term that helps the model build speed in directions of consistent descent (C)

Signup and view all the answers

During neural network training, what is addressed when using Batch SGD instead of Stochastic Gradient Descent?

Updates all weights after the batch (B)

Signup and view all the answers

In a neural network, how is the Chain rule of calculus used to calculate updates?

To compute the gradients of the loss function with respect to the weights of the earlier layers. (C)

Signup and view all the answers

What is a general rule with Learning Rate?

Better initial convergences and help escape local minima. (B)

Signup and view all the answers

What does the term 'multidimensional array' best describes?

Tensor (A)

Signup and view all the answers

In time series analysis, what is the purpose of applying smoothing techniques?

To make patterns like trends and seasonality clearer by reducing noise. (D)

Signup and view all the answers

For a time series dataset exhibiting non-linear growth patterns that gradually approach a saturation point, which trend line would be the most appropriate?

Logarithmic (D)

Signup and view all the answers

What is the typical purpose of STL decomposition in time series analysis?

To break down a time series into trend, seasonality, and residual components. (C)

Signup and view all the answers

Within the ARIMA framework, what is the approach for tuning the components of the framework?

All the choices are the correct approach (D)

Signup and view all the answers

What task is 'Sentiment Analysis' targeting?

Determining the sentiment or emotion behind the text. (A)

Signup and view all the answers

In Natural Language Processing (NLP), what role do word embeddings play?

They provide numerical representations of words that capture semantic relationships. (C)

Signup and view all the answers

What do you see with a Bag of Words approach?

A count of words to see frequency of terms (D)

Signup and view all the answers

What best describes an advantage of the skip-gram architecture?

The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (D)

Signup and view all the answers

What is represented when applying Cosine Similarity?

Angle between two vectors (A)

Signup and view all the answers

In Support Vector Machines (SVM), if you want to allow some misclassifications to achieve a better fit on the majority of the data, which type of margin would be most appropriate?

Soft Margin (D)

Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the effect of having a very small value for the tuning parameter C?

Wider margin, with more tolerance for training errors (B)

Signup and view all the answers

When should you choose a Polynomial kernel over a linear kernel in Support Vector Machines (SVM)?

When the relationship between features is suspected to be polynomial (C)

Signup and view all the answers

In Support Vector Machines (SVM), if you're dealing with data where the true underlying distribution is unknown, which kernel is generally recommended as a first approach?

Radial Basis Function (RBF) Kernel (C)

Signup and view all the answers

What happens to the influence of distant observations in a Support Vector Machine (SVM) using a Radial Basis Function (RBF) kernel as the gamma (γ) parameter increases?

Their influence decreases, focusing on closer observations (A)

Signup and view all the answers

In unsupervised learning, what does the term 'lack of labeled data' primarily imply?

All of the above (D)

Signup and view all the answers

Which of the following statements captures a key challenge when using unsupervised learning methods on very large datasets?

They may become computationally intensive and impractical due to scalability concerns (C)

Signup and view all the answers

What is the 'curse of dimensionality', and how does it specifically impact unsupervised learning techniques?

It describes the phenomenon where data points become sparse, distance metrics become less informative and models are prone to overfitting noise. (D)

Signup and view all the answers

Which of the following is NOT a recognized strategy for addressing the challenges posed by high dimensionality in machine learning datasets?

Feature Expansion (A)

Signup and view all the answers

What is the primary reason for performing a rotation transformation in Principal Component Analysis (PCA)?

To retain the maximum possible variance in the resulting representation (D)

Signup and view all the answers

What does minimizing intra-cluster variance accomplish in K-means clustering?

It creates more compact and distinct clusters (B)

Signup and view all the answers

In K-means clustering, which of the following describes the role of the 'Assignment Step'?

Assigning each observation to the cluster with the closest centroid according to Euclidean distance (A)

Signup and view all the answers

Which strategy can directly address the sensitivity of K-means to initial centroid placement?

Running the algorithm multiple times with different initial centroid placements (B)

Signup and view all the answers

How does DBSCAN identify clusters of arbitrary shape?

By grouping points based on density (D)

Signup and view all the answers

In DBSCAN, what is one of the major parameters, and how is it used?

epsilon (ε), a minimum radius with which to retrieve points (B)

Signup and view all the answers

When employing Single Linkage in agglomerative hierarchical clustering, how is the distance between two clusters determined?

By the shortest distance between any two points in the two clusters (A)

Signup and view all the answers

In hierarchical clustering, what is the main advantage of using Ward's method over other linkage methods?

It minimizes the variance within clusters (B)

Signup and view all the answers

In the context of deep learning, what is the vanishing gradient problem and why is it significant?

A problem where gradients become extremely small during backpropagation, hindering weight updates in early layers. (D)

Signup and view all the answers

In neural networks, what is the main function of the Softmax activation function, and for which type of layer is it most commonly used?

To convert outputs into a probability distribution for multi-class classification, used in the output layer (A)

Signup and view all the answers

What is the primary method for dealing with an underperforming Learning Rate?

Setting the rate higher, to move weights quicker (D)

Signup and view all the answers

Flashcards

Supervised learning

A type of machine learning where the model learns from a dependent variable.

Unsupervised Learning

Machine learning that discovers hidden patterns without human supervision.

Hyperplane

A flat affine subspace where data is classified.

Margin

Distance from solid line to dashed line that are support vectors.

Signup and view all the flashcards

Support Vectors

Data points closest to the hyperplane; affect classifier.

Signup and view all the flashcards

Hard Margin Classification

Imposing every data point is assigned a class, without errors

Signup and view all the flashcards

Soft Margin Classification

Soft margin classification allows some points to be misclassified.

Signup and view all the flashcards

Slack Variables

Variables that permits individual observations to be on the wrong side of margin.

Signup and view all the flashcards

Tuning Parameter C

Hyperparameter that bounds the sum of the ei's and determines number of violations to margin.

Signup and view all the flashcards

Polynomial Kernel

Use when the data has polynomial relationships between features.

Signup and view all the flashcards

Radial Basis Function (Gaussian)

A good default choice; use when there is no clear understanding of the data distribution.

Signup and view all the flashcards

Sigmoid Kernel

Use when you suspect the data behaves similarly to a neural network.

Signup and view all the flashcards

PCA

Principal Component Analysis; Reduces dimensionality.

Signup and view all the flashcards

Clustering

Method to group data into clusters; set of objects are similar.

Signup and view all the flashcards

K-Means Clustering

Partitions data into K clusters by minimizing intra-cluster variance.

Signup and view all the flashcards

Elbow Method

Used to find an optimal K.

Signup and view all the flashcards

Silhouette Score

Used to find optimal K. Measures if your data is well clustered

Signup and view all the flashcards

DBSCAN

Density-based algorithm used to identify clusters in data closely packed.

Signup and view all the flashcards

MinPts

Minimum number of neighboring points to form a cluster.

Signup and view all the flashcards

Dendrogram

Hierarchical representation of the clustering process.

Signup and view all the flashcards

Agglomerative Approach

Combines each data point into small clusters and initially have small n clusters.

Signup and view all the flashcards

Divisive Approach

Splits the data into single cluster with several split divisions.

Signup and view all the flashcards

Directed Acyclic Graphs

Graph used for the operation nodes. No cycles or looping.

Signup and view all the flashcards

Perceptron Function

The input * weight coefficients, with a value.

Signup and view all the flashcards

Deep Neural Network

A network is considered deep if it has multiple hidden layers.

Signup and view all the flashcards

Loss Functions

The measurement of the predicted output of the model and the actual target value.

Signup and view all the flashcards

Epoch

One complete pass through the entire dataset.

Signup and view all the flashcards

Mini-Batch

Small, randomly selected subset of the training data.

Signup and view all the flashcards

Input Layer

Layer that takes input data.

Signup and view all the flashcards

Hidden Layers

Layer that performs activation function and the computations.

Signup and view all the flashcards

Activation Functions

How to control the learning of non-linear by using the correct weights.

Signup and view all the flashcards

Output Layer

Where the output is calculated after activations functions and weight updating.

Signup and view all the flashcards

GPUs

Parallel processing.

Signup and view all the flashcards

Learning the Tuning

The tuning to change to a more efficient learning.

Signup and view all the flashcards

Minimizing the Loss

Where the loss is to be the minimum.

Signup and view all the flashcards

Small Batching

To better learning when there are no many observations.

Signup and view all the flashcards

Weight Update Alternative

To change the update steps.

Signup and view all the flashcards

Back Propagation

Where the network transmits back the network.

Signup and view all the flashcards

Convolutional Neural Networks

Where the image patterns mimics the human.

Signup and view all the flashcards

Pool Layers

Compress into a smaller image for performance.

Signup and view all the flashcards

Recurrent Neural Networks

RNN that can take a sequence as input.

Signup and view all the flashcards

Embeddings

Transformation of data into numerical of data.

Signup and view all the flashcards

Time Series Smooth

Tool for small changes to lower noise.

Signup and view all the flashcards

Ideal for Time Series

Clear data when trend are not their.

Signup and view all the flashcards

Exponential Smooth

Tool for an averaging and smoothing.

Signup and view all the flashcards

Patterns of External Factors.

Patterns for data

Signup and view all the flashcards

STL

Break down and into main components:

Signup and view all the flashcards

Series Time Code

Analysis the series.

Signup and view all the flashcards

Autocorrelations

Check if the data is stationary. Apply differencing.

Signup and view all the flashcards

NLP

Is a branch of artificial intelligence.

Signup and view all the flashcards

Text tokenizer

Breaks down text into smaller segments.

Signup and view all the flashcards

Model Input

Tool that coverts unstructured text into structured for the ML.

Signup and view all the flashcards

Study Notes

This lecture covers Support Vector Machines (SVM) and unsupervised methods in Python for data analysis.
This lecture also introduces Deep Learning concepts.