Support Vector Machines (SVM)

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

In Support Vector Machines (SVM), what fundamentally defines a 'hyperplane' in a p-dimensional space?

  • A curved surface that optimally separates the data points.
  • The margin that maximizes the separation between classes.
  • A flat affine subspace of dimension _p_.
  • A flat affine subspace of dimension _p_ - 1. (correct)

What is the primary limitation of using 'hard margin classification' with Support Vector Machines (SVM)?

  • It is computationally intensive and requires significant resources.
  • It requires a large amount of labeled data to achieve good performance.
  • It is highly sensitive to outliers and only works if data is perfectly linearly separable. (correct)
  • It tends to overfit the training data, leading to poor generalization.

In the context of Support Vector Machines (SVM), what is the significance of the tuning parameter 'C'?

  • It determines the width of the margin in the SVM model.
  • It bounds the sum of slack variables, influencing the tolerance for observations violating the margin. (correct)
  • It sets the polynomial degree in polynomial kernel SVM.
  • It controls the kernel type used in the SVM model.

What characterizes observations that are known as 'support vectors' in the context of Support Vector Machines (SVM)?

<p>Observations that lie directly on the margin or on the wrong side of the margin for their class. (A)</p> Signup and view all the answers

Which of the following kernel options in SVM would be most appropriate when there's no clear understanding of the data distribution?

<p>Radial Basis Function (Gaussian) kernel (C)</p> Signup and view all the answers

How does the Radial Basis Function (RBF) kernel handle observations that are far away from each other in the feature space?

<p>It decreases their impact on computing the decision boundary due to the larger Euclidean distance. (D)</p> Signup and view all the answers

Principal Component Analysis (PCA) is an unsupervised method primarily used for what?

<p>Finding a low-dimensional representation of a dataset while retaining as much variance as possible. (A)</p> Signup and view all the answers

What does PCA aim to achieve when projecting observations in a p-dimensional space?

<p>To project observations with a vector that has the largest variance. (D)</p> Signup and view all the answers

In K-means clustering, what is the primary objective of the algorithm?

<p>To partition data into K clusters by minimizing intra-cluster variance. (D)</p> Signup and view all the answers

What is minimized using squared Euclidean distance using the K-means algorithm?

<p>The sum of all pairwise squared Euclidean distances between the observations in the _k_th cluster, divided by total # of observations in the _k_th cluster. (B)</p> Signup and view all the answers

What is the Silhouette Score used for in the context of K-means clustering?

<p>To determine the optimal number of clusters by measuring the quality of clustering. (D)</p> Signup and view all the answers

Which of the following statements regarding the K-means algorithm is correct?

<p>Results depend on the initial cluster assignment of each observation in Step 1, finds a local rather than global optimum. (A)</p> Signup and view all the answers

What is the purpose of the minPts parameter in the DBSCAN clustering algorithm?

<p>It sets the minimum number of neighboring points within a specified radius for a point to be considered a core point. (C)</p> Signup and view all the answers

Which of the following is an advantage of the DBSCAN algorithm compared to K-means clustering?

<p>DBSCAN automatically determines the number of clusters. (D)</p> Signup and view all the answers

In DBSCAN, what differentiates a 'border point' from a 'core point'?

<p>Border points are within the epsilon distance of a core point but do not have enough neighbors to be considered core points. (A)</p> Signup and view all the answers

What does interpreting a dendrogram allows one to determine?

<p>How similar observations are to each other. (C)</p> Signup and view all the answers

What is a key difference between agglomerative and divisive hierarchical clustering approaches?

<p>Divisive clustering starts with all data points in one cluster, while agglomerative clustering begins with each data point as its own cluster. (C)</p> Signup and view all the answers

In the context of hierarchical clustering, what does the term 'single linkage' refer to?

<p>The minimum distance between points in two clusters. (C)</p> Signup and view all the answers

Which distance metric is commonly used and is most suitable for continuous variable data for clustering?

<p>Euclidean distance (C)</p> Signup and view all the answers

What is the primary role of activation functions in neural networks?

<p>To introduce non-linearity, control output values, and enable gradient propagation. (D)</p> Signup and view all the answers

Which of the following is a characteristic of the Sigmoid activation function?

<p>It can suffer from the vanishing gradient problem due to saturated neurons. (A)</p> Signup and view all the answers

What is a key advantage of using the ReLU (Rectified Linear Unit) activation function?

<p>It avoids vanishing gradient issues for positive inputs and allows faster convergence. (C)</p> Signup and view all the answers

Which activation function is most suitable for the output layer of multi-class classification models?

<p>Softmax (C)</p> Signup and view all the answers

Which of the following describes the 'vanishing gradient' problem?

<p>The loss function gradients become extremely small, preventing weights from updating in early layers. (A)</p> Signup and view all the answers

What is the primary difference between a perceptron and a multi-layer perceptron (MLP)?

<p>An MLP has at least one hidden layer, allowing it to learn non-linear boundaries. (A)</p> Signup and view all the answers

What is the role of 'epochs' in the training of neural networks?

<p>One complete pass through the entire training dataset. (D)</p> Signup and view all the answers

During the training of a Deep Neural Network (DNN), what is the immediate result of the 'forward-pass'?

<p>Computation of the predictions based on the inputs. (A)</p> Signup and view all the answers

Which loss function is most appropriate for a binary classification problem?

<p>Binary Cross-Entropy (D)</p> Signup and view all the answers

What is the purpose of 'backpropagation' in neural networks?

<p>Used to adjust the weights to reduce the error of prediction. (B)</p> Signup and view all the answers

What is the advantage of using mini-batches rather than processing one observation at a time or using Batch Gradient Descent for training neural networks?

<p>Faster computation compared to Batch algorithm and more frequent weight updates compared to processing data one at a time. (D)</p> Signup and view all the answers

Considering that Stochastic Gradient Descent (SGD) considers the past gradients, what is the purpose of the 'velocity' term in SGD with Momentum?

<p>To accelerate convergence by helping the model build speed in consistent directions. (A)</p> Signup and view all the answers

In the context of time series analysis, what is the main purpose of smoothing?

<p>Reducing noise and making patterns like trends, or seasonality, clearer. (A)</p> Signup and view all the answers

What typically characterizes time series data in which Simple Moving Average (SMA) is optimal?

<p>Absence of seasonality or a trend. (A)</p> Signup and view all the answers

How does a larger window size impact the results in Simple Moving Average?

<p>Reduces responsiveness to recent changes. (A)</p> Signup and view all the answers

What distinguishes Exponential Smoothing from Simple Moving Average?

<p>ES gives greater weight to more recent datapoints. (B)</p> Signup and view all the answers

If there was high alpha for time series date, this will result in greater what?

<p>Weight to recent data (D)</p> Signup and view all the answers

What could happen if seasonality isn't properly considered in time series data?

<p>Recurring seasonal patterns can be mistaken for anomalies. (D)</p> Signup and view all the answers

What is the main focus of Natural Language Processing (NLP)?

<p>Enabling computers to understand, interpret, and generate human language. (B)</p> Signup and view all the answers

What is the purpose of a Tokenizer?

<p>Breaking down text into smaller units. (D)</p> Signup and view all the answers

Subword Tokenization is characterized as breaking into subword units to handle what?

<p>Rare words. (C)</p> Signup and view all the answers

When implementing SVM with soft margins, what is the primary effect of increasing the value of the tuning parameter 'C'?

<p>It leads to a narrower margin and more observations violate it, decreasing bias and increasing variance. (C)</p> Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the purpose of 'slack variables'?

<p>To allow individual observations to be on the wrong side of the margin or the hyperplane, thus softening the margin. (A)</p> Signup and view all the answers

When would it be most appropriate to choose a polynomial kernel in Support Vector Machines (SVM)?

<p>When there are known polynomial relationships between the features in the data. (B)</p> Signup and view all the answers

With a small 'C' parameter, how does a support vector classifier behave, and what are its potential drawbacks?

<p>It aims for a narrow margin, is highly sensitive to individual observations, and may overfit the data. (B)</p> Signup and view all the answers

What is the impact of the number of dimensions on the computational complexity for unsupervised learning algorithms?

<p>Higher dimensions typically require more computation and time, increasing the computational complexity. (A)</p> Signup and view all the answers

What is a critical challenge specific to unsupervised learning methods when compared to supervised learning?

<p>The objective evaluation of results due to the lack of labeled data (ground truth). (C)</p> Signup and view all the answers

What strategies can be employed to address the challenges posed by high dimensionality in unsupervised learning?

<p>Using dimensionality reduction, feature selection, regularization, and leveraging domain knowledge to reduce data complexity. (B)</p> Signup and view all the answers

What is a limitation of using PCA before applying a supervised learning method?

<p>PCA may not necessarily retain directions that are most useful for effective prediction in a supervised learning context. (C)</p> Signup and view all the answers

While assessing the optimal number of clusters using the Elbow Method, how should the 'elbow' point be best interpreted?

<p>The point where adding more clusters provides diminishing returns in reducing WCSS. (A)</p> Signup and view all the answers

In K-means clustering, under which condition is the result considered to have reached a 'local optimum'?

<p>When the cluster assignments no longer change after iterations. (D)</p> Signup and view all the answers

What is a key limitation of the K-means algorithm regarding cluster shapes?

<p>It assumes clusters are spherical and equally sized. (B)</p> Signup and view all the answers

How does DBSCAN effectively identify clusters of arbitrary shapes compared to K-means?

<p>By grouping together closely packed points, defining clusters based on density rather than distance to centroids. (A)</p> Signup and view all the answers

Within DBSCAN, how is 'density' quantified to determine cluster formation?

<p>By counting the number of points within a specified radius (epsilon) of a given point. (A)</p> Signup and view all the answers

How do you determine the value for epsilon (ε) for DBSCAN?

<p>All of the above (D)</p> Signup and view all the answers

What is the primary implication of the arrangement of observations along the horizontal axis of a dendrogram?

<p>It is arbitrary and does not imply similarity or dissimilarity. (D)</p> Signup and view all the answers

In agglomerative hierarchical clustering, what is the key implication of two observations fusing together closer to the bottom of the dendrogram?

<p>These observations are more similar. (A)</p> Signup and view all the answers

What is a limitation of Hierarchical clustering?

<p>Hierarchical clustering cannot force a hierarchy, and show well-defined or intuitively nested groups. (B)</p> Signup and view all the answers

In the context of hierarchical clustering, why might Ward's method be chosen over other linkage methods?

<p>Because it minimizes the variance within clusters, promoting more spherical clusters. (B)</p> Signup and view all the answers

When preparing time-series data for Recurrent Neural Networks (RNNs), what format should the data have to work with an LSTM?

<p>A 3D tensor with dimensions (batch size, time steps, features). (B)</p> Signup and view all the answers

How does transforming text into numerical input with NLP aid Neural Networks with Word2Vec?

<p>By allowing for pretraining for tokens to transform into dense arrays that that provide semantic meaning and relationships. (B)</p> Signup and view all the answers

What is the purpose of choosing a vector with cosine similarity?

<p>Measuring the angle to find a similarity between angles, using vector to find relationship with words high dimensional vectors. (D)</p> Signup and view all the answers

In designing a deep neural network(DNN), what distinguishes a network considered 'deep' from one that is not?

<p>Having two or more hidden layers. (A)</p> Signup and view all the answers

Given a dataset with non-linear relationships, what is a primary benefit of using a Deep Neural Network (DNN) over a simpler model?

<p>DNNs can automatically learn intricate, non-linear relationships in the data due to their increased capacity. (B)</p> Signup and view all the answers

When training a deep neural network, what is the effect of using GPUs rather than CPUs?

<p>GPUs enhance the parallel processing capabilities. (C)</p> Signup and view all the answers

Why can neurons that are saturated during the activation function phase lead to the vanishing gradient problem?

<p>All of the above. (D)</p> Signup and view all the answers

When would a tanh activation function be preferable?

<p>Zero centered outputs are needed for training networks. (D)</p> Signup and view all the answers

What is a limitation when using softmax activation function?

<p>Not suitable for multi-class as it enforces only one class to be labeled, and nothing else. (D)</p> Signup and view all the answers

In the context of neural networks, what does binary cross-entropy measure?

<p>The performance of how the label compares to a probability between 0 and 1. (B)</p> Signup and view all the answers

What is the key idea that helps gradient descent in SGD with momentum?

<p>The model builds momentum based on speed and direction of consistent descent by considering past gradients. (A)</p> Signup and view all the answers

What describes the action of error in back-propagation?

<p>The transmission of error back into the network to improve performance. (C)</p> Signup and view all the answers

In time series analysis, why is it important to account for seasonality?

<p>Recurring seasonal patterns may be misunderstood or mistaken for actual anomalies hidden while detecting certain patterns. (D)</p> Signup and view all the answers

What could Simple Moving Average results highlight with the trend?

<p>Highlight trends if there is noise. (A)</p> Signup and view all the answers

In Time Series analysis, why might exponential smoothing be preferred over simple moving average (SMA)?

<p>Exponential smoothing is better for data with little seasonality. (D)</p> Signup and view all the answers

When would the use of a polynomial be the best approach out of: linear, exponential Moving average, and logarithmic to see the trend line in time series data?

<p>For more complex patterns with fluctuations. (C)</p> Signup and view all the answers

What is the use of Seasonal and Trend Decomposition (STL decomposition)?

<p>All of the above. (D)</p> Signup and view all the answers

What role does the integer value of the ARIMA model play?

<p>It refers to where the differencing of data happens and helps create stationary states (removing time trends and seasonality with value d, which indicates the order of differencing). (D)</p> Signup and view all the answers

What is TF-IDF often represented in?

<p>All of the above. (D)</p> Signup and view all the answers

What can cause the Embedding process to be limited?

<p>Both A and B. (D)</p> Signup and view all the answers

In the context of Support Vector Machines (SVM), how does the 'large margin classification' principle primarily aim to improve model performance?

<p>By fitting the widest possible 'street' between different classes, thereby enhancing generalization. (D)</p> Signup and view all the answers

When implementing Support Vector Machines (SVM) with soft margins, under what circumstances would it be most strategic to permit certain observations to violate the margin?

<p>When dealing with complex, non-linear datasets or datasets containing outliers, to enhance robustness. (C)</p> Signup and view all the answers

Considering the bias-variance tradeoff in Support Vector Classifiers, how does increasing the value of the tuning parameter 'C' influence the classifier's characteristics?

<p>It causes the classifier to prioritize a narrower margin, potentially leading to overfitting and lower bias. (D)</p> Signup and view all the answers

How does the behavior of a Support Vector Classifier change with a very large 'C' value, and what implications does this have for model performance?

<p>The model behaves similarly to hard margin classification, with potential for overfitting on non-separable data. (C)</p> Signup and view all the answers

When using the Radial Basis Function (RBF) kernel in Support Vector Machines (SVM), how does the kernel implicitly handle the dimensionality of the feature space?

<p>By mapping the data into a higher-dimensional space, implicitly handling infinite dimensions without explicit computation. (D)</p> Signup and view all the answers

Which statement best describes how polynomial kernels in SVM facilitate the classification of non-linear data?

<p>They approximate non-linear relationships by transforming the data into a higher-dimensional space, allowing for linear separation. (A)</p> Signup and view all the answers

What is a primary challenge associated with unsupervised learning methods, particularly concerning the assessment of model performance?

<p>The difficulty in objectively evaluating outcomes due to the lack of a predefined metric or 'ground truth'. (A)</p> Signup and view all the answers

In the context of handling high dimensionality, how does L1 Regularization address the challenges posed by the 'Curse of Dimensionality'?

<p>By automatically selecting more relevant features, thus simplifying the model and improving generalization. (C)</p> Signup and view all the answers

Within K-means clustering, how does the K-means++ initialization method aim to improve the quality of clustering compared to random initialization?

<p>By strategically spreading out the initial centroids, thereby reducing the risk of suboptimal clustering. (C)</p> Signup and view all the answers

When evaluating K-means clustering results, which interpretation of the Silhouette Score indicates a potential issue with the clustering?

<p>Negative Silhouette Scores for numerous data points suggest these points may have been assigned to the wrong clusters. (A)</p> Signup and view all the answers

In DBSCAN, what is the algorithmic approach to classifying points that fall outside dense regions and are not within the epsilon neighborhood of any core points?

<p>Treating these points as noise, effectively excluding them from any cluster. (B)</p> Signup and view all the answers

What key constraint differentiates hierarchical clustering methods from K-means clustering, particularly affecting their applicability in certain scenarios?

<p>Hierarchical clustering does not require pre-specification of the number of clusters, unlike K-means. (B)</p> Signup and view all the answers

When interpreting a dendrogram in hierarchical clustering, under what condition is it impossible to make definitive statements about the similarity of two observations?

<p>When the observations are positioned close to each other along the horizontal axis of the dendrogram. (C)</p> Signup and view all the answers

How does the backpropagation algorithm utilize the chain rule of calculus in training neural networks, and what is the significance of this process?

<p>To efficiently update the parameters of each layer by relating the change in the loss function to the parameters of earlier layers. (D)</p> Signup and view all the answers

What is a critical implication of the saturation of neurons in deep neural networks for the backpropagation process, and why does this present a significant challenge?

<p>Saturated neurons lead to near-zero gradients, hindering the weight update and resulting in slower or stalled learning. (D)</p> Signup and view all the answers

With regards to activation functions, what key advantage does the ReLU activation function provide over sigmoid or tanh functions in deep neural networks, and how does this affect training dynamics?

<p>ReLU mitigates the vanishing gradient problem, enabling more effective training of deep networks by reducing saturation. (A)</p> Signup and view all the answers

In the context of Natural Language Processing (NLP), what is the significance of word embeddings beyond simply converting words into numerical data?

<p>Word embeddings establish explicit semantic relationships between words, enabling models to capture context and meaning. (A)</p> Signup and view all the answers

When applying Simple Moving Average (SMA) to a time series, how does the selection of a larger window size affect the resulting smoothed time series, and what are its practical consequences?

<p>It produces a smoother series less responsive to recent changes, potentially obscuring short-term patterns. (B)</p> Signup and view all the answers

In time series analysis using Exponential Smoothing, how does a smoothing factor (alpha) close to 1 impact the model's sensitivity to recent data points, and what does this imply for forecasting?

<p>It makes the model highly sensitive to recent data, essentially making the forecast strongly influenced by the latest observations. (D)</p> Signup and view all the answers

What is the primary advantage of employing subword tokenization techniques in Natural Language Processing (NLP), especially when dealing with large and diverse text corpora?

<p>Subword tokenization effectively handles rare words and words with similar roots, improving model generalization and vocabulary coverage. (B)</p> Signup and view all the answers

Flashcards

Support Vector Machines (SVM)

A powerful algorithm used for classification and regression, effective in high-dimensional spaces by fitting the widest possible street between classes.

Margin

Distance from the separating hyperplane to the closest data points.

Support vectors

Data points closest to the hyperplane that influence its position and the margin.

Hard Margin Classification

Strictly imposing every observation (regardless of which side of the hyperplane) is assigned a class, with no room for misclassification!.

Signup and view all the flashcards

Soft margins

Used to allow observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane.

Signup and view all the flashcards

Slack Variables

Variables that allow individual observations to be on the wrong side of the margin or the hyperplane.

Signup and view all the flashcards

Tuning Parameter C

tuning parameter that bounds the sum of the slack variables

Signup and view all the flashcards

Polynomial kernel

SVM kernel designed for polynomial relationships that maps to a higher dimensional vector space.

Signup and view all the flashcards

Radial Basis Function (RBF) kernel

SVM kernel using squared Euclidean distance, where Y is a positive constant.

Signup and view all the flashcards

Unsupervised Learning

An approach in machine learning used for discovering groupings in data without labeled responses.

Signup and view all the flashcards

Clustering

A technique used to group data points into clusters or groups.

Signup and view all the flashcards

Dimensionality Reduction

Reduces the number of variables in the data while preserving essential information.

Signup and view all the flashcards

Association rule learning

Finding associations between different items or variables in a dataset

Signup and view all the flashcards

K-means clustering

An unsupervised learning algorithm for partitioning data into K distinct, non-overlapping clusters.

Signup and view all the flashcards

Algorithm Objective

Where the data point is at its correct cluster

Signup and view all the flashcards

Density Based Spatial Clustering of Applications with Noise (DBSCAN)

Finds groups based on data point density.

Signup and view all the flashcards

Density

In DBSCAN, how much information is inside a certain area

Signup and view all the flashcards

Core Points

Points within a minimum amount of neighboring points within a specified distance.

Signup and view all the flashcards

Border Points

Points fall within epsilon distance of core points but not enough neighbors.

Signup and view all the flashcards

Epsilon

Distance you can go out to.

Signup and view all the flashcards

Noise Points

Does not belong to a cluster.

Signup and view all the flashcards

Hierarchical clustering

A method of clustering that builds a hierarchy of clusters.

Signup and view all the flashcards

Agglomerative

Type that starts builds from the bottom merging individual records

Signup and view all the flashcards

Leaf

A point of data where the data starts.

Signup and view all the flashcards

Kernel trick

Used to build a higher-dimensional, non-linear transformation of the input features.

Signup and view all the flashcards

Directed Edges

In Directed Acyclic Graph, each edge has a direction, meaning it goes from one vertex (node) to another, and signifies a one way relationship.

Signup and view all the flashcards

Acyclic

Indicates that there are no cycles or closed loops within the graph.

Signup and view all the flashcards

Perceptron

A mathematical function where the input data (x) is multiplied by the weight coefficients (w) in order to create a value.

Signup and view all the flashcards

Supervised Data

Ability to adjust weights and bias based on the supervised data.

Signup and view all the flashcards

Non-Linearity

Enables a model to learn the non-linear relationships by transforming the outputs.

Signup and view all the flashcards

Output control

Helps constrain the output values to a specific range, used for tasks.

Signup and view all the flashcards

Gradient Propagation

Provide gradients needed for optimizing the weights during the process, specifically, backpropagation

Signup and view all the flashcards

Saturation

Saturation occurs when the output of an activation is pushed to its extreme values.

Signup and view all the flashcards

Vanishing Gradient

Where a loss function become very small.

Signup and view all the flashcards

Neural Network

Can be used for classification, for large multi-layer with non-linear separation.

Signup and view all the flashcards

Hidden Layers

A network is considered deep if it has two or more layers.

Signup and view all the flashcards

Increased Capacity

With more layers, a DNN can model non-linear realtionships.

Signup and view all the flashcards

SGD with momentum

Considers the past gradients and helps to build speed.

Signup and view all the flashcards

RMSprop

Designed to accelerate convergence, especially in scenarios where the loss rate gets too much.

Signup and view all the flashcards

Adam

One of the most popular optimizion algorithms in deep learning. It combines the benefits and maintains two moving everages.

Signup and view all the flashcards

Nadam

It aims to improve the performance of Adam

Signup and view all the flashcards

Loss functions

A measure to describe how well the tree will perform.

Signup and view all the flashcards

Epoch

One complete pass through the entire training.

Signup and view all the flashcards

Mini-batch

Usually, to process and compute is costly, mini-batch was added to the model

Signup and view all the flashcards

Batch Size

Aka cost functions measure a good model over the entire set.

Signup and view all the flashcards

Loss calculation

Is the total loss, is the difference between the weights and bias.

Signup and view all the flashcards

Convolutional Networks

CNNs mimic to some degree how humans classify images by recognizing specific features or patterns anywhere in the image that distinguish each particular object class.

Signup and view all the flashcards

Pooling Layers

Used condense images.

Signup and view all the flashcards

Recurrent Neural Networks

These allow the network to have access to future data, in order to give access when predicitons flow.

Signup and view all the flashcards

Exponential Smoothing

Technique that applies decreases.

Signup and view all the flashcards

Study Notes

  • Support Vector Machines (SVM) can perform both classification (SVC) and regression (SVR), although classification is more common.
  • Think of an SVM classifier as fitting the widest possible "street" between different classes, known as large margin classification.
  • A hyperplane is a flat affine subspace with a dimension of p-1 in p-dimensional space.
  • In 2D, it's a line.
  • In 3D, it's a plane.

Definitions

  • Margin refers to the distance between the solid line and either dashed line.
  • Support vectors correlate to the blue and purple points on the dashed lines.
  • The distance between these points and the hyperplane is indicated by arrows.
  • In a maximal margin hyperplane, if the coefficients are represented by β0,β1,...,βp, then the maximal margin classifier classifies an x^ test observation based on the sign of f(x) = β0 + β1x1 + β2x2 +...+ βp.

Hard Margins

  • Hard margin classification strictly assigns each observation to a class based on which side of the hyperplane it is, without allowing misclassification.
  • Hard margin classification only works if the data is linearly separable.
  • Hard margin classification is sensitive to outliers.
  • Hard margin classification is impractical in real-world applications due to misclassifications and errors.

Hyperplane and Support Vectors

  • A single observation can significantly change the hyperplane because the plane is very sensitive to support vectors

Challenges to Hyperplane Calculation

  • If the optimization problem has no solution when M (margin) > 0, then a "soft" margin hyperplane is developed that almost separates the classes.
  • Using a classifier based on a hyperplane that doesn't isolate the two classes could lead to increased robustness to individual observations.
  • The benefits may also include better classification of training observations.
  • The maximal margin classifier can't be used if the classes aren't separable by a hyperplane.

Soft Margins

  • Soft Margins are used when observations are allowed to be on the incorrect side of the margin, or even the incorrect side of the hyperplane.
  • An observation might be on the wrong side of the hyperplane, in addition to being the wrong side of the margin.
  • Support vector classifiers may misclassify observations on the wrong side of the hyperplane.

C

  • Slack variables ε1,...,εn allow observations to be on the wrong side of the margin or the hyperplane.
  • The tuning parameter C constraints the sum of the epsilon i's, which controls how many violations to the margin and hyperplane will be tolerated.
  • If εi = 0, the ith observation is on the correct side of the margin.
  • If εi > 0, the ith observation violates the margin and lies on the wrong side of the margin.
  • If εi > 1, the ith observation is on the wrong side of the hyperplane.
  • The tuning parameter C is a hyperparameter.
  • Large C means the margin is wide, many observations violate it, and there are many support vectors, which leads to low variance but potentially high bias.
  • Small C results in seeking narrow margins, meaning it has a low bias, and high variance.
  • Observations are called support vectors if they lie directly on the margin, or the wrong side of it as they affect the support vector classifier.

Kernel Options

  • Use Polynomial kernel to solve polynomial relationships between features.
  • The Radial Basis Function (Gaussian) kernel is a good default if you are unsure of the data distribution.
  • Use Sigmoid kernels when data acts like a neural network.
  • polynomial kernel of degree d is used to improve maps to a higher dimensional vector space and creates more flexible support vector
  • γ is a positive constant that minimizes as e^(negative numbers) is gotten
  • If two observations are far away from each other, the Euclidean distance will be larger. The values decreases and provides low impact on computing decisions.
  • The dimensions are implicit or infinite, meaning computation is feasible.

Unsupervised Methods

  • Unsupervised learning trains models using no dependent variable.
  • The aim is to understand data and create groupings rather than predict a value or class probability.
  • Principal Component Analysis (PCA) is an unsupervised technique that can prepare data for supervised learning.
  • Clustering is the process of finding and assessing data groupings.
  • The 3 main areas of unsupervised learning include:
    • Clustering (e.g., K-means, hierarchical clustering)
    • Dimensionality Reduction (e.g., Principal Component Analysis (PCA), t-SNE)
    • Association rule learning (e.g., Apriori algorithm)
  • Common supervised learning tasks lack labeled data or ground truth.
    • For example, fraud detection, medical imaging, cybersecurity, Natural Language Processing, and recommender systems have this issue.
  • Analysis becomes subjective, which makes evaluating models difficult without ground truth or predefined metrics.
  • It is difficult to assess the results or performance, like RMSE (regression), Accuracy and Precision(classification).
  • Scalability may be problematic.
    • Some algorithms are computationally intensive with large, high-dimensional data sets.
    • These algorithms can be slow and require a lot of memory.
  • Overfitting sensitive methods to the parameters or model complexity is a risk.
  • Unsupervised learning can be sensitive to noise and outliers.
  • A constraint lies in algorithms making assumptions about the data (e.g. cluster shape or distribution)
  • Increased Computational Complexity, sparse data, overfitting, distance metric issues, visualization challenges, and feature redundancy are a number of effects from the curse of dimensionality.
  • Techniques to reducing dimensionality include PCA, t-SNE, and LDA.
  • Feature selection can occur using the filter and wrapper methods.
  • Sampling techniques to reduce dimensionality that leverage feature engineering or use random projection.

PCA

  • Principle Component Analysis finds a low-dimensional data representation while retaining variance.
  • It works when limited knowledge or methods do not make election of other approaches feasible.
  • PCA projects observations with the largest variance via a vector or loadings
  • PCA may not always be helpful to prediction as it only provides a direction that maintains the most variance in the data, due to it's unsupervised nature.
  • The rotation results in resulting retains max(var).

Clustering

  • Clustering groups similar data points into clusters.
  • The goal is to organize objects in such a way that those in the cluster are more similar than those in other groups.
  • It helps discover patterns, structures, and relationships in data.

K-Means Clustering

  • K-means clustering partitions data into K distinct, non-overlapping clusters.
  • The algorithm attempts to minimize intra-cluster variance
  • The number of clusters K must be specified and observations assigned to the K clusters
  • The steps include:
    • Initialization
    • Assignment
    • Update
    • Iterate -The point is one for which variation in the with-in cluster is as small as possible.
  • Many ways to define this concept, but is typically handled by squared Euclidean distance.
  • Within-cluster variation of the kth cluster is the sum of Euclidean distances squared between kth cluster observations, and then divided amongst the number of kth clusters

Algorithm Steps

  • A number is randomly assigned, from 1 to K, to each observation that acts as initial cluster assignments.
  • Iterate until the cluster assignments stop changing.
    • Compute each of the K clusters cluster centroid that is a vector of the p feature means for the kth cluster observations.
    • Assign each observation to the cluster with the closest centroid, where closest is found leveraging Evaluation distance. -The result is a local optimum because the K-means algorithm locates a local instead of global optimum.
  • A hyperparameter is the number of clusters, with methods to optimization such as the “elbow” and “silhouette”.
  • The elbow method plots the within-cluster sum of squares (WCSS) against k values, where the point of the "elbow" is where the improvement slows down
  • The silhouette Score measures how similar a datapoint is to its own cluster instead of other ones, where it typically ranges from negative 1 to 1 and has higher values indicate better clustering.

Other Hyperparameters

  • K-Means is sensitive to the initial placement of centroids through the Initialization Method (Centroid Initialization):
  • Some common strategies:
    • Random initialization: The default method that can lead to suboptimal results.
    • K-Means++ initialization: Improves clustering by spreading out initial centroids and reduces the risk of poor convergence.
  • Distance Measurements are typically handled using the Evaluation Method.

DBSCAN

  • Density-based spatial clustering of applications with noise (DBSCAN) identifies clusters where points are closely packed using a density-based algorithm.
  • Arbitrary-shaped clusters, with outliers identified as noise, can be found using the algorithim.

Density

  • How much mass is packed into a given volume of substance can be found using density.
  • In DBSCAN, the region density is much greater compared to other locations with more observations.
  • There exists code points that leverage density to cluster datasets.

Key Concepts:

  • Core points have a minimum number of neighboring points are within a specified distance (minPts and epsilon).
  • Border points are within the ε distance of the code point, but not enough neighbours to be a border points.
  • Noise points belong to no cluster.

Pros and Cons

  • Does not require predetermining the number of clusters and can detect noises and outliers.
  • It works wll with arbitrary shapes and has only two parameters to fine tune.
  • There are limitations of being able to struggle with varying density clusters, and as it is sensitive to the parameters ε and minPts.

Applications

  • Geospatial data
  • Anomaly detection
  • Image segmentation

DBSCAN Algorithm Steps

  • Initialize Parameters with epsilon and minPts.
  • For each point in the dataset, Skip it.if it has already been visited
  • Determine the points that surround the current the points that are in its ε-neighborhood distance.
  • Check if the current point is a core point with the following criterias:
  • If the number of points(which includings itself), is greaterthan minPts, the point takes that form and starts a new cluster.
  • If the number of points does not reach minPts, Then its market as noise.

Expansion ofCluster

  • For any code, recursively visit its neighborhood in order of neighborhood
  • If you visit a nonvisited one, mark to be part of existing cluster
  • Cluster can grow by visting neighbours.
  • border points is included by not expanded

Hyperparameter Tuning

  • Set minPts to at least D+1, where D is the number of dimensions in the dataset, and increase to void affecting result and to use higher number for noisy data to make dense regions, and if need of small groups use lower number for small data.
  • To compute the Nearest Distance:
  • Choose k as mintPts - 1.
  • Compute the distance of each point with its Knn.
  • And to plot the sorted distances:
  • Sort and plot them to look for elbow location to look for point to start point distance for good ε.

Hirarchical Clustering

  • Organizes cluster hierarchy
  • Has two types:
    • Agglomerative (Bottom-Up) : Starts with individual points and merges them. -Divisive: Top-Down starts with all point and splits them.

Hierarchical Clustering Advantages

  • Not need specific number in K-means.
  • It's usefull to visualize data It takes advantages with a genomics and market segmentation for image analysis

Interpreting the Dendrogram

  • Each leaf on a dendrogram is an observation
  • Moving up the tree, leaves start to fuse,
  • Internal Nodes are clusters.
  • Hierarchichal (2) has height number of nine in clusters
  • Height at five results in 3 clusters and the role is controling the numbers in the K-means
  • All are nested groups

###Agglomerative Approach

  • Initializes and starts each pont with its own cluster
  • Computes a matrix distance that has to be every pairwise distance.
  • -It needs a chosendistance metric.

Linkage Criterias

A Matrix needs to be updated.

-Single Linkage- which distance bewteen the two closet points . -Complete distance which is two closets points Average points are in a a middle ground with spherical cluseters. .Wards Method minimzes varince.

Divise Approach

The Approach Is to begin in a smaller cluster this is less common to its higher computational complexity It initialize starts with a data point.

###Splits Has to indentifty which cluster and it goes though kmeans Repeat the process with a criteria

Considerartions . Euclidian Menhattan has to be handled elongated and can result in broken clusters. Aveage can balanced with spherucal cluseters.

Directed Acyclic Graphds

  • Nodes is NN and also DAG
  • DirectEdiges in ands direct acycling and meaning in edges has direction,it signifies a one -way dependency

Acyclic

It doent contait cycles or closed loops, Cant to traverses or directed in Edges.

Can have a coputational grapth can extent DAG.

###The Perceton

The Perceptron is a mathmetical Function used and can be use into a value. Its visualized by a single lay network Its the form and has an binary class tasks

###Perceptron It leverages by by the Step Backporation Eta controlls the weights and the leariningrate.

NonLiniairty it enables the model it works on the traninf OutPut helpconstrant Values Gradient Proragation they used dur the trang.

LossFuctionsand and that measures difference

###Perpectrons It test the operation it doesnt haved a handle it has network that

  • a complex solt and ha a neural network

###MulityalPer The part that takes the input and activations it makes

BackPropations to update weights g. descent.

Deep Neural networks.

Hidden lays a e that network which has deeped. Incresd Capcity which is layesd with data. Fets Abstactions each has that learn abstractly and it needs to combine this from complex

  • Backportation

###The need Power and Speed Parallel Processing contains the build for vector calc It can help to processed in largeamounts It supports a wide of learning Its can used for quick data

  • Custom architectures and designs to designed learning.

  • Tensor Operation

  • high ###Batch Epochs Baches and Pass through data And with help. The modelprocess from desent. Numeric with the MSE, and MAE. The

  • Binary test are used for just 1 an the are and it comparas that are daptive

They gradients the

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Support Vector Machines Overview
45 questions

Support Vector Machines Overview

AlluringRhodochrosite8455 avatar
AlluringRhodochrosite8455
Support Vector Classifier Quiz
25 questions
Use Quizgecko on...
Browser
Browser