Support Vector Machines and Unsupervised Methods

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

In Support Vector Machines (SVM), what determines whether it is used for classification (SVC) or regression (SVR)?

  • The intended purpose, either classification or regression. (correct)
  • The type of kernel used.
  • The number of dimensions in the data.
  • The size of the dataset.

What is the geometric interpretation of an SVM classifier?

  • Fitting the widest possible street between the classes. (correct)
  • Calculating the average distance between classes.
  • Finding the smallest circle that encloses all data points.
  • Fitting a line that minimizes the distance to all points.

In a p-dimensional space, what is a hyperplane?

  • A flat affine subspace of dimension p - 1. (correct)
  • A flat affine subspace of dimension p + 1.
  • A curved surface that separates the data.
  • A single point that best represents the data.

What are the support vectors in the context of Support Vector Machines?

<p>The data points closest to the hyperplane that influence its position. (D)</p> Signup and view all the answers

What is the formula to classify a test observation $x^\wedge$ using a maximal margin classifier, given coefficients $\beta_0, \beta_1, ..., \beta_p$?

<p>$f(x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_p x_p$, classify based on the <em>sign</em> of $f(x)$. (A)</p> Signup and view all the answers

What is a significant limitation of hard margin classification?

<p>It only works if the data is linearly separable and is sensitive to outliers. (B)</p> Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the primary motivation for using 'soft margins' instead of 'hard margins'?

<p>To allow for some misclassifications, increasing robustness to outliers and non-separable data. (D)</p> Signup and view all the answers

In the context of soft margins in SVM, what do slack variables (£1,...,εη) represent?

<p>The degree to which individual observations are allowed to be on the wrong side of the margin or hyperplane. (A)</p> Signup and view all the answers

What does the tuning parameter C control in a soft margin SVM?

<p>The trade-off between achieving a wide margin and limiting the number of margin violations. (A)</p> Signup and view all the answers

In the context of SVM, what happens when the tuning parameter C is very large?

<p>The margin is wide and many observations violate the margin, leading to low variance but potentially high bias. (A)</p> Signup and view all the answers

Which of the following kernel options in SVM is generally considered a good default choice when there is no clear understanding of the data distribution?

<p>Radial Basis Function (Gaussian) (A)</p> Signup and view all the answers

What is the effect of a positive constant γ in the Radial Basis Function (RBF) kernel?

<p>Decreases the influence of distant observations. (C)</p> Signup and view all the answers

In the context of machine learning, what is the primary goal of unsupervised learning?

<p>To understand the underlying structure of data by creating groupings without prior knowledge of class labels. (B)</p> Signup and view all the answers

Which of the following is a common approach in unsupervised learning?

<p>Clustering (A)</p> Signup and view all the answers

Why is evaluating the success of unsupervised learning models often challenging?

<p>Because the analysis tends to be subjective without predefined metrics or ground truth. (C)</p> Signup and view all the answers

Which of the following is a potential risk associated with unsupervised learning?

<p>Overfitting to noise in the data, especially with methods sensitive to the number of parameters. (D)</p> Signup and view all the answers

Which of the following is a common technique to deal with high dimensionality?

<p>Dimensionality reduction (B)</p> Signup and view all the answers

What is the primary goal of Principal Component Analysis (PCA)?

<p>To find a low-dimensional representation of a dataset that captures as much of the variance as possible. (C)</p> Signup and view all the answers

What is the main objective of K-means clustering?

<p>To partition data into K clusters by minimizing intra-cluster variance. (B)</p> Signup and view all the answers

In the context of K-means clustering, what is the 'Elbow Method' used for?

<p>To determine the optimal number of clusters (K). (D)</p> Signup and view all the answers

What does a higher Silhouette Score indicate in the context of K-means clustering?

<p>A better clustering, as data points are similar to their own cluster and dissimilar to other clusters. (C)</p> Signup and view all the answers

Which of the following statements describes a limitation of K-Means clustering?

<p>K-Means requires the number of clusters (K) to be predefined. (A)</p> Signup and view all the answers

What is a key characteristic of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm?

<p>It identifies clusters based on data point density and can find arbitrary-shaped clusters. (C)</p> Signup and view all the answers

In DBSCAN, what distinguishes a 'core point' from other points?

<p>It has a minimum number of neighboring points within a specified distance. (B)</p> Signup and view all the answers

Which of the following is an advantage of DBSCAN over K-Means clustering?

<p>DBSCAN can detect noise and outliers. (A)</p> Signup and view all the answers

In Hierarchical Clustering, what is a dendrogram used for?

<p>To visualize the hierarchy of clusters and the relationships between data points. (D)</p> Signup and view all the answers

In the context of interpreting a dendrogram, what does the vertical axis represent?

<p>The distance or dissimilarity at which clusters are merged. (B)</p> Signup and view all the answers

What is the key difference between agglomerative and divisive hierarchical clustering?

<p>Agglomerative starts with individual points and merges them; divisive starts with all points in one cluster and splits them. (B)</p> Signup and view all the answers

In hierarchical clustering, which linkage criterion minimizes the variance within clusters?

<p>Ward's Method (A)</p> Signup and view all the answers

What characterizes Directed Acyclic Graphs (DAGs)?

<p>Edges with direction and the absence of cycles. (B)</p> Signup and view all the answers

What key function did early neural networks use, also employed in Logistic Regression, to compute a probability between 0 and 1?

<p>Sigmoid Function (D)</p> Signup and view all the answers

For what type of tasks is a perceptron primarily designed?

<p>Binary classification tasks (B)</p> Signup and view all the answers

In the context of perceptron learning, what is backpropagation used for?

<p>Adjusting the weights and bias based on the error between the predicted output and the true label in multi hidden layer networks. (B)</p> Signup and view all the answers

What happens when the argument gets smaller using ELUs (Exponential Linear Units)?

<p>ELUs saturate to a negative value (C)</p> Signup and view all the answers

For a single perceptron, what activation function will output A AND B logical operation correctly?

<p>step function (D)</p> Signup and view all the answers

What logical operation cannot be handled with a single perceptron?

<p>XOR (A)</p> Signup and view all the answers

In a neural network, what is the primary purpose of activation functions?

<p>To introduce non-linearity, enabling the model to learn complex relationships. (A)</p> Signup and view all the answers

What distinguishes a Deep Neural Network (DNN) from a regular neural network?

<p>It has two or more hidden layers. (A)</p> Signup and view all the answers

Why are GPUs and TPUs important for deep learning?

<p>They accelerate the training process through parallel processing. (B)</p> Signup and view all the answers

In deep learning, what is the purpose of a loss function?

<p>To measure the difference between the predicted output and the actual target value. (B)</p> Signup and view all the answers

In the context of neural network training, what does 'epoch' refer to?

<p>One complete pass through the entire training dataset. (D)</p> Signup and view all the answers

When training neural networks, what is the role of mini-batches?

<p>To approximate the gradient of the loss function using a subset of the training data. (D)</p> Signup and view all the answers

Which is likely to happen if the learning rate for the model is set very high?

<p>Overshooting or divergence (B)</p> Signup and view all the answers

What is the purpose of "smoothing" in the context of time series analysis?

<p>To reduce noise and make patterns clearer. (D)</p> Signup and view all the answers

What is the primary difference between a Simple Moving Average (SMA) and Exponential Smoothing?

<p>SMA gives equal weight to all data points in the window, while exponential smoothing gives more weight to recent observations. (D)</p> Signup and view all the answers

In time series analysis, what does 'seasonality' refer to?

<p>Recurring patterns or cycles that occur at regular intervals. (B)</p> Signup and view all the answers

In time series analysis, which type of trend line is suitable for growth patterns that level off over time?

<p>Logarithmic (A)</p> Signup and view all the answers

In STL decomposition, what are the three main components into which a time series is broken down?

<p>Trend, Seasonality, Residual (B)</p> Signup and view all the answers

In the context of ARIMA models, what does the 'I' component represent?

<p>Integration, referring to making data stationary. (A)</p> Signup and view all the answers

In Natural Language Processing (NLP), what does 'tokenization' refer to?

<p>Breaking down text into smaller units such as words or phrases. (B)</p> Signup and view all the answers

In Support Vector Machines (SVM), what is the significance of points that lie directly on the margin or on the wrong side of it?

<p>They are known as support vectors and directly affect the support vector classifier. (A)</p> Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the primary purpose of the tuning parameter C?

<p>To control the bias-variance trade-off of the support vector classifier. (A)</p> Signup and view all the answers

Which statement is correct regarding the impact of a large tuning parameter C in Support Vector Machines(SVM)?

<p>It leads to a classifier with low variance but potentially high bias. (A)</p> Signup and view all the answers

When is it most appropriate to use a Polynomial kernel in a Support Vector Machine (SVM)?

<p>When the data has polynomial relationships between features. (B)</p> Signup and view all the answers

What is the practical implication of using the Radial Basis Function (RBF) kernel with a very large gamma (γ) value in SVM?

<p>The model becomes highly sensitive to each training data point, potentially overfitting the data. (A)</p> Signup and view all the answers

What is a key difference between supervised and unsupervised learning?

<p>Supervised learning requires labeled data for training, while unsupervised learning does not. (B)</p> Signup and view all the answers

Which of the following tasks is best suited for unsupervised learning?

<p>Identifying customer segments based on purchasing behavior. (B)</p> Signup and view all the answers

What is a primary challenge in unsupervised learning compared to supervised learning?

<p>The difficulty in evaluating the results due to a lack of ground truth. (D)</p> Signup and view all the answers

Why can unsupervised learning methods be sensitive to noise and outliers in the data?

<p>Because these algorithms aim to find patterns, and outliers can disproportionately influence the identified patterns. (A)</p> Signup and view all the answers

Which of the following is NOT a common technique for dealing with high dimensionality?

<p>Data Augmentation. (C)</p> Signup and view all the answers

In the context of Principal Component Analysis (PCA), what does projecting observations onto a vector with the largest variance achieve?

<p>It results in projected observations that retain the most variance in the data. (C)</p> Signup and view all the answers

What is the most direct way to describe the central idea behind K-means clustering?

<p>Assigning data points to clusters to minimize the intra-cluster variance. (C)</p> Signup and view all the answers

What do higher K values typically imply when using the Elbow Method to select the optimal number of clusters?

<p>Diminishing returns in reducing the Within-Cluster Sum of Squares (WCSS). (C)</p> Signup and view all the answers

What does the Silhouette Score measure in the context of clustering?

<p>How similar a data point is to its own cluster compared to other clusters. (C)</p> Signup and view all the answers

Which statement best describes a limitation of K-Means clustering?

<p>It struggles with data that is not spherically shaped or evenly sized. (C)</p> Signup and view all the answers

In the DBSCAN algorithm, what role do border points play in cluster formation?

<p>They are within the neighborhood of a core point but do not have enough neighbors to be core points themselves. (D)</p> Signup and view all the answers

Which of the following is a disadvantage of DBSCAN compared to K-Means clustering?

<p>DBSCAN struggles with clusters of varying densities. (D)</p> Signup and view all the answers

In hierarchical clustering, what does the height at which two branches merge in a dendrogram indicate?

<p>The distance or dissimilarity between the two clusters. (A)</p> Signup and view all the answers

What does it mean when observations fuse together at the very top of a dendrogram?

<p>The observations are quite different from each other. (C)</p> Signup and view all the answers

In agglomerative hierarchical clustering, how is the distance between clusters updated after merging two clusters?

<p>Using linkage criteria to define how the distance between the new cluster and other clusters is computed. (C)</p> Signup and view all the answers

Which of the following linkage criteria in hierarchical clustering tends to create elongated clusters?

<p>Single linkage. (B)</p> Signup and view all the answers

What best describes Directed Edges in Directed Acyclic Graphs (DAGs)?

<p>They signify a one-way relationship or dependency between nodes. (B)</p> Signup and view all the answers

What does the term 'acyclic' signify in the context of Directed Acyclic Graphs (DAGs)?

<p>The graph does not contain cycles or closed loops. (C)</p> Signup and view all the answers

What is the role of the weight coefficients in a perceptron?

<p>They multiply the input data to determine its importance. (B)</p> Signup and view all the answers

What must be true for any node in a perceptron to generate an outut?

<p>Activation function to trigger. (B)</p> Signup and view all the answers

What does the perceptron learning rule adjust in a perceptron?

<p>The weights and bias. (A)</p> Signup and view all the answers

In the context of neural networks, what does the term 'non-linearity' refer to?

<p>The use of activation functions to transform linear combinations of inputs. (C)</p> Signup and view all the answers

What is the function of gradient propagation in neural networks?

<p>To optimize the weights during training using backpropagation. (C)</p> Signup and view all the answers

What is a potential disadvantage of Leaky ReLU compared to ReLU?

<p>It is not as standardized and requires tuning an additional hyperparameter. (A)</p> Signup and view all the answers

For what purpose is the Softmax activation function primarily used?

<p>To convert model outputs into probability distributions for multi-class classification. (A)</p> Signup and view all the answers

Which of the following techniques can accelerate convergences in training deep learning?

<p>Zero centered outputs that helps networks train faster (A)</p> Signup and view all the answers

What does the term 'epoch' refer to in the context of training neural networks?

<p>One complete pass through the entire training dataset. (A)</p> Signup and view all the answers

Why must the mini-batch size be optimized during training?

<p>To make a decision relating to tradeoff between computation overhead and better uncovering of patterns in the data. (C)</p> Signup and view all the answers

What would incorporating momentum do?

<p>Considers the past gradients, adding a velocity term that helps the model build speed in directions of consistent descent (C)</p> Signup and view all the answers

During neural network training, what is addressed when using Batch SGD instead of Stochastic Gradient Descent?

<p>Updates all weights after the batch (B)</p> Signup and view all the answers

In a neural network, how is the Chain rule of calculus used to calculate updates?

<p>To compute the gradients of the loss function with respect to the weights of the earlier layers. (C)</p> Signup and view all the answers

What is a general rule with Learning Rate?

<p>Better initial convergences and help escape local minima. (B)</p> Signup and view all the answers

What does the term 'multidimensional array' best describes?

<p>Tensor (A)</p> Signup and view all the answers

In time series analysis, what is the purpose of applying smoothing techniques?

<p>To make patterns like trends and seasonality clearer by reducing noise. (D)</p> Signup and view all the answers

For a time series dataset exhibiting non-linear growth patterns that gradually approach a saturation point, which trend line would be the most appropriate?

<p>Logarithmic (D)</p> Signup and view all the answers

What is the typical purpose of STL decomposition in time series analysis?

<p>To break down a time series into trend, seasonality, and residual components. (C)</p> Signup and view all the answers

Within the ARIMA framework, what is the approach for tuning the components of the framework?

<p>All the choices are the correct approach (D)</p> Signup and view all the answers

What task is 'Sentiment Analysis' targeting?

<p>Determining the sentiment or emotion behind the text. (A)</p> Signup and view all the answers

In Natural Language Processing (NLP), what role do word embeddings play?

<p>They provide numerical representations of words that capture semantic relationships. (C)</p> Signup and view all the answers

What do you see with a Bag of Words approach?

<p>A count of words to see frequency of terms (D)</p> Signup and view all the answers

What best describes an advantage of the skip-gram architecture?

<p>The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (D)</p> Signup and view all the answers

What is represented when applying Cosine Similarity?

<p>Angle between two vectors (A)</p> Signup and view all the answers

In Support Vector Machines (SVM), if you want to allow some misclassifications to achieve a better fit on the majority of the data, which type of margin would be most appropriate?

<p>Soft Margin (D)</p> Signup and view all the answers

In the context of Support Vector Machines (SVM), what is the effect of having a very small value for the tuning parameter C?

<p>Wider margin, with more tolerance for training errors (B)</p> Signup and view all the answers

When should you choose a Polynomial kernel over a linear kernel in Support Vector Machines (SVM)?

<p>When the relationship between features is suspected to be polynomial (C)</p> Signup and view all the answers

In Support Vector Machines (SVM), if you're dealing with data where the true underlying distribution is unknown, which kernel is generally recommended as a first approach?

<p>Radial Basis Function (RBF) Kernel (C)</p> Signup and view all the answers

What happens to the influence of distant observations in a Support Vector Machine (SVM) using a Radial Basis Function (RBF) kernel as the gamma (γ) parameter increases?

<p>Their influence decreases, focusing on closer observations (A)</p> Signup and view all the answers

In unsupervised learning, what does the term 'lack of labeled data' primarily imply?

<p>All of the above (D)</p> Signup and view all the answers

Which of the following statements captures a key challenge when using unsupervised learning methods on very large datasets?

<p>They may become computationally intensive and impractical due to scalability concerns (C)</p> Signup and view all the answers

What is the 'curse of dimensionality', and how does it specifically impact unsupervised learning techniques?

<p>It describes the phenomenon where data points become sparse, distance metrics become less informative and models are prone to overfitting noise. (D)</p> Signup and view all the answers

Which of the following is NOT a recognized strategy for addressing the challenges posed by high dimensionality in machine learning datasets?

<p>Feature Expansion (A)</p> Signup and view all the answers

What is the primary reason for performing a rotation transformation in Principal Component Analysis (PCA)?

<p>To retain the maximum possible variance in the resulting representation (D)</p> Signup and view all the answers

What does minimizing intra-cluster variance accomplish in K-means clustering?

<p>It creates more compact and distinct clusters (B)</p> Signup and view all the answers

In K-means clustering, which of the following describes the role of the 'Assignment Step'?

<p>Assigning each observation to the cluster with the closest centroid according to Euclidean distance (A)</p> Signup and view all the answers

Which strategy can directly address the sensitivity of K-means to initial centroid placement?

<p>Running the algorithm multiple times with different initial centroid placements (B)</p> Signup and view all the answers

How does DBSCAN identify clusters of arbitrary shape?

<p>By grouping points based on density (D)</p> Signup and view all the answers

In DBSCAN, what is one of the major parameters, and how is it used?

<p>epsilon (ε), a minimum radius with which to retrieve points (B)</p> Signup and view all the answers

When employing Single Linkage in agglomerative hierarchical clustering, how is the distance between two clusters determined?

<p>By the shortest distance between any two points in the two clusters (A)</p> Signup and view all the answers

In hierarchical clustering, what is the main advantage of using Ward's method over other linkage methods?

<p>It minimizes the variance within clusters (B)</p> Signup and view all the answers

In the context of deep learning, what is the vanishing gradient problem and why is it significant?

<p>A problem where gradients become extremely small during backpropagation, hindering weight updates in early layers. (D)</p> Signup and view all the answers

In neural networks, what is the main function of the Softmax activation function, and for which type of layer is it most commonly used?

<p>To convert outputs into a probability distribution for multi-class classification, used in the output layer (A)</p> Signup and view all the answers

What is the primary method for dealing with an underperforming Learning Rate?

<p>Setting the rate higher, to move weights quicker (D)</p> Signup and view all the answers

Flashcards

Supervised learning

A type of machine learning where the model learns from a dependent variable.

Unsupervised Learning

Machine learning that discovers hidden patterns without human supervision.

Hyperplane

A flat affine subspace where data is classified.

Margin

Distance from solid line to dashed line that are support vectors.

Signup and view all the flashcards

Support Vectors

Data points closest to the hyperplane; affect classifier.

Signup and view all the flashcards

Hard Margin Classification

Imposing every data point is assigned a class, without errors

Signup and view all the flashcards

Soft Margin Classification

Soft margin classification allows some points to be misclassified.

Signup and view all the flashcards

Slack Variables

Variables that permits individual observations to be on the wrong side of margin.

Signup and view all the flashcards

Tuning Parameter C

Hyperparameter that bounds the sum of the ei's and determines number of violations to margin.

Signup and view all the flashcards

Polynomial Kernel

Use when the data has polynomial relationships between features.

Signup and view all the flashcards

Radial Basis Function (Gaussian)

A good default choice; use when there is no clear understanding of the data distribution.

Signup and view all the flashcards

Sigmoid Kernel

Use when you suspect the data behaves similarly to a neural network.

Signup and view all the flashcards

PCA

Principal Component Analysis; Reduces dimensionality.

Signup and view all the flashcards

Clustering

Method to group data into clusters; set of objects are similar.

Signup and view all the flashcards

K-Means Clustering

Partitions data into K clusters by minimizing intra-cluster variance.

Signup and view all the flashcards

Elbow Method

Used to find an optimal K.

Signup and view all the flashcards

Silhouette Score

Used to find optimal K. Measures if your data is well clustered

Signup and view all the flashcards

DBSCAN

Density-based algorithm used to identify clusters in data closely packed.

Signup and view all the flashcards

MinPts

Minimum number of neighboring points to form a cluster.

Signup and view all the flashcards

Dendrogram

Hierarchical representation of the clustering process.

Signup and view all the flashcards

Agglomerative Approach

Combines each data point into small clusters and initially have small n clusters.

Signup and view all the flashcards

Divisive Approach

Splits the data into single cluster with several split divisions.

Signup and view all the flashcards

Directed Acyclic Graphs

Graph used for the operation nodes. No cycles or looping.

Signup and view all the flashcards

Perceptron Function

The input * weight coefficients, with a value.

Signup and view all the flashcards

Deep Neural Network

A network is considered deep if it has multiple hidden layers.

Signup and view all the flashcards

Loss Functions

The measurement of the predicted output of the model and the actual target value.

Signup and view all the flashcards

Epoch

One complete pass through the entire dataset.

Signup and view all the flashcards

Mini-Batch

Small, randomly selected subset of the training data.

Signup and view all the flashcards

Input Layer

Layer that takes input data.

Signup and view all the flashcards

Hidden Layers

Layer that performs activation function and the computations.

Signup and view all the flashcards

Activation Functions

How to control the learning of non-linear by using the correct weights.

Signup and view all the flashcards

Output Layer

Where the output is calculated after activations functions and weight updating.

Signup and view all the flashcards

GPUs

Parallel processing.

Signup and view all the flashcards

Learning the Tuning

The tuning to change to a more efficient learning.

Signup and view all the flashcards

Minimizing the Loss

Where the loss is to be the minimum.

Signup and view all the flashcards

Small Batching

To better learning when there are no many observations.

Signup and view all the flashcards

Weight Update Alternative

To change the update steps.

Signup and view all the flashcards

Back Propagation

Where the network transmits back the network.

Signup and view all the flashcards

Convolutional Neural Networks

Where the image patterns mimics the human.

Signup and view all the flashcards

Pool Layers

Compress into a smaller image for performance.

Signup and view all the flashcards

Recurrent Neural Networks

RNN that can take a sequence as input.

Signup and view all the flashcards

Embeddings

Transformation of data into numerical of data.

Signup and view all the flashcards

Time Series Smooth

Tool for small changes to lower noise.

Signup and view all the flashcards

Ideal for Time Series

Clear data when trend are not their.

Signup and view all the flashcards

Exponential Smooth

Tool for an averaging and smoothing.

Signup and view all the flashcards

Patterns of External Factors.

Patterns for data

Signup and view all the flashcards

STL

Break down and into main components:

Signup and view all the flashcards

Series Time Code

Analysis the series.

Signup and view all the flashcards

Autocorrelations

Check if the data is stationary. Apply differencing.

Signup and view all the flashcards

NLP

Is a branch of artificial intelligence.

Signup and view all the flashcards

Text tokenizer

Breaks down text into smaller segments.

Signup and view all the flashcards

Model Input

Tool that coverts unstructured text into structured for the ML.

Signup and view all the flashcards

Study Notes

  • This lecture covers Support Vector Machines (SVM) and unsupervised methods in Python for data analysis.
  • This lecture also introduces Deep Learning concepts.

Support Vector Machines

  • SVM can be used for both classification (SVC) and regression (SVR).
  • SVM classification is more common than regression.
  • Think of an SVM classifier as fitting the widest possible street between classes; called large margin classification.
  • In a p-dimensional space, a hyperplane is a flat affine subspace of dimension p - 1.
  • In two dimensions, a hyperplane is a line.
  • In three dimensions, a hyperplane is a plane.
  • The margin is the distance from the solid line to either of the dashed lines.
  • Support vectors are two points that lie on dashed lines
  • The distance from support vectors to the hyperplane is indicated by arrows.
  • If ẞ0,ẞ1,...,ẞp are the coefficients of the maximal margin hyperplane, then the maximal margin classifier classifies the test observation x^ based on the sign of f (x) =β0 + β1 x1 + β2 x2 + ・・・ + βρ

Hard Margins

  • Strictly imposing every observation is assigned a class; no room for misclassification
  • Hard margin classification only works if data is linearly separable.
  • Hard margin classification is sensitive to outliers.
  • Hard margin classification is impractical because real-world data has misclassifications (errors).

Soft Margins

  • Rather than seeking the largest possible margin so that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, allow some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane.
  • Observations on the wrong side of the hyperplane correspond to training observations that are misclassified by the support vector classifier.
  • £1,...,εη are slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane
  • The tuning parameter C bounds the sum of the ei's, determining the number and severity of violations to the margin/hyperplane
  • If εί = 0 then the ith observation is on the correct side of the margin.
  • If εί > 0 then the ith observation is on the wrong side of the margin, and we say that the ith observation has violated the margin.
  • If εί > 1 then it is on the wrong side of the hyperplane
  • Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors.
  • These observations affect the support vector classifier.
  • C controls the bias-variance trade-off of the support vector classifier.
  • When tuning parameter C is large, the margin is wide, many observations violate the margin, creating many support vectors.
  • This classifier has low variance but potentially high bias.
  • When C is small, narrow margins are sought that are rarely violated, leading to a classifier that is highly fit to the data, low bias but high variance.

Kernel Options

  • Polynomial Kernel: Used when the data has polynomial relationships between features.
  • Radial Basis Function (Gaussian) Kernel: A good default choice when there's no clear understanding of the data distribution.
  • Sigmoid Kernel: Used when it's suspected the data behaves similarly to a neural network.

Polynomial Kernel

  • The polynomial kernel of degree d, where d is a positive integer.
  • Using a kernel with d > 1, instead of the standard linear kernel, generates a more flexible support vector where it maps to a higher dimensional vector space.
  • Formula: K(xi, xj) = (xi * xj + c)^d
  • xi and xj are feature vectors for two data points.
  • c is a constant that controls the offset of the polynomial function.
  • d is the degree of the polynomial.

RBF Kernel – Radial Basis Function

  • Formula: K(xi, xj) = exp(-γ||xi – xj||^2)
  • xi and xj are feature vectors for two data points.
  • ||xi - xj||^2 is the squared Euclidean distance between the two vectors.
  • gamma is a positive constant that has a minimizing effect as e^(negative numbers) gets very small.
  • When two observations are far away, the Euclidean distance is larger, decreasing the value.
  • Local observations have more impact.
  • Dimensions are implicit or infinite, therefore the kernel trick makes computation feasible.

Unsupervised Learning

  • Focuses on understanding what the data means by creating groupings, used instead of predicting a value or probability of a class
  • PCA (Principal Component Analysis) is an unsupervised approach that can prepare data for supervised learning.
  • Clustering data can discover groups which can be assessed and analyzed.

Approaches in Unsupervised Learning

  • Clustering (e.g., K-means, hierarchical clustering)
  • Dimensionality Reduction (e.g., Principal Component Analysis (PCA), t-SNE)
  • Association rule learning (e.g., Apriori algorithm)

Challenges of Unsupervised Methods

  • There is a lack of labeled data aka ground truth in many domains
  • Examples include fraud detection, medical imaging, cybersecurity, Natural Language Processing, Recommender Systems etc.
  • Without predefined metrics/ground truth, analysis gets subjective, making it challenging to evaluate the model's success.
  • It's difficult to assess results or performance as we did with RMSE (as in regression), Accuracy, Precision (as in classification).
  • Scalability concerns: Some unsupervised learning algorithms can be computationally intensive, especially with large datasets/high-dimensional data.
  • Overfitting risk: There is a risk of overfitting to noise in the data, especially with methods sensitive to the parameters or model complexity.
  • Unsupervised learning methods can be sensitive to noise and outliers in data.
  • Assumptions and constraints: Many unsupervised algorithms come with assumptions about the data (e.g., cluster shape or distribution) which might not always hold true in real-world scenarios.

Curse of Dimensionality

  • Increased Computational Complexity: Higher dimensions require more computation and time.
  • Sparse Data: Data points become sparse, making it harder to find patterns.
  • Overfitting: Models can be overfit to noise due to increased complexity.
  • Distance Metrics Issues: Distance measures become less informative in high dimensions.
  • Visualization Challenges: Difficult to visualize and interpret high-dimensional data.
  • Feature Redundancy: More features can introduce irrelevant or redundant information.

How to Deal with High Dimensionality

  • Dimensionality Reduction: PCA, t-SNE, LDA
  • Feature Selection: Filter Methods (ex. chi-square), Wrapper Methods(aka selection methods)
  • Regularization: L1 Regularization (LASSO), L2 Regularization (Ridge Regression)
  • Sampling Techniques: Feature Engineering, Random Projection
  • Algorithm Choice: Dimensionality-Aware Algorithms
  • Domain Knowledge: Feature Analysis

PCA

  • Principal Component Analysis finds a low-dimensional representation of a dataset with as much variation as possible.
  • It's effective when you lack domain knowledge or other approaches are not feasible.
  • Observations live in p-dimensional space, but not all dimensions are equally interesting.
  • Project observations with a vector (loadings) that has the largest variance.
  • This results in projected observations onto any other line would yield projected observations with lower variance.
  • PCA is unsupervised, so the direction PCA takes you may not always be helpful to effective prediction.
  • It only provides the direction that retains the most variance in the data.
  • It uses a rotation transformation to retain max(var).

Clustering

  • Clustering is a technique used to group similar data points into clusters or groups.

K-Means Clustering

  • K-means clustering is an approach for partitioning a data set into K distinct, non-overlapping clusters.

  • Algorithm objective: Partition data into K clusters by minimizing intra-cluster variance (within-cluster sum of squares)

  • Miniimize sum of W (Ck) for k= 1..K

  • First, specify the number of clusters K; then the K-means algorithm will assign each observation to exactly one of the K clusters.

  • The goal of K-means clustering is a good clustering that is one for which the within-cluster variation is as small as possible

  • The within-cluster variation for the kth cluster is a measure W(CK)

  • The number of clusters K is Predefined by the user.

  • squared Euclidean distance is commonly used.

  • K minimize { 1 / |Ck| * sum_members * sum_dimensions * (xij - x'ij)^2 } С1,...,Ск k=1

  • Algorithm Steps:

  • Randomly assign a number, from 1 to K, to each of the observations.

    • These serve as initial cluster assignments for the observations.
  • Iterate until the cluster assignments stop changing:

    • For each of the K clusters, compute the cluster centroid.
    • Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
  • When the result no longer changes, a local optimum has been reached.

  • Because the K-means algorithm finds a local rather than a global optimum, the results obtained will depend on the initial (random) cluster assignment of each observation in Step 1

Key Hyperparameters - Finding K

  • K (Number of clusters) – This is the most critical hyperparameter and is predefined by the user.
  • Elbow Method: Plot the Within-Cluster Sum of Squares (WCSS) against different K values. The "elbow" point is where the improvement slows down.
  • Silhouette Score: Measures how similar a data point is to its own cluster vs. other clusters (ranges from -1 to 1). Higher values indicate better clustering.

Other Hyperparameters

  • Initialization Method (Centroid Initialization):
  • K-Means is sensitive to the initial placement of centroids
  • Common strategies
    • Random initialization: Default method but may lead to suboptimal results.
    • K-Means++ initialization: Improves clustering by spreading out the initial centroids. Reduces the risk of poor convergence.
  • It's useful to run K-means several times to account for address initial placement of centroids.
  • Distance/similarity metric: Euclidean is most common

DBSCAN

  • Density-Based Spatial Clustering of Applications with Noise is a density-based clustering algorithm used to identify clusters in data by grouping together points that are closely packed
  • DBSCAN is effectiv e for finding arbitrary-shaped clusters and identifying outliers as noise.
  • Density refers to measurement of of how much mass is packed into a volume of a substance, describing how tightly packed matter is within an object, calculated by dividing mass of object b volume (Density= Mass/Volume)
  • DBSCAN uses the density of a region to cluster data
  • Key Concepts:
  • Core Points: Points with a minimum number of neighboring points within a specified distance (minPts and epsilon (ε)).
  • Border Points: Points that are within ε distance of a core point but do not have enough neighbors to be considered core points.
  • Noise Points: Points that do not belong to any cluster.
  • Does not require the number of clusters to be predefined.
  • Can detect noise and outliers.
  • Outliers do not affect the model
  • It works well with clusters of arbitrary shapes.
  • There are only two parameters to tune.
  • It's sensitive to the parameters & and minPts.
  • Algorithm Steps
  • Initialize Parameters: epsilon, minPts
  • For each point in dataset:
    • If point has already been visited, skip.
  • Determine the neighboring points of the current point:
    • Find all points within the & distance of the current point (its ε-neighborhood).
  • Check if the current point is a core point:
    • If point has fewer than minPts neighbors, mark it as noise

DBSCAN Algorithm Steps (2)

  • Expand the cluster
    • For each core point, recursively visit all the neighboring points in its ɛ-neighborhood
    • If a neighboring point has not been visited, mark it as part of the current cluster.
    • If it's another core point, continue expanding the cluster by visiting its neighbors.
    • Border points (points with fewer neighbors than minPts but still reachable from a core point) are included, but do not expand it further Repeat: Continue process for unvisited points

DBSCAN Hyperparameter Tuning

  • MinPts (Minimum Points)
    • set MinPts to at least D+1 where D is the number of dimensions in dataset
    • increase increase as dimensionally increases to avoid noise affecting result
    • for noisy data, us higher values of minPts to ensure that only dense regions form cluster
    • Expect small clusters, a lower MinPts may work better
  • Epsilon
  • Compute the k-nearest neighbor’s distance
    • Chose k as MinPts -1
      • compute distance of reach point to the kth nearest neighbor
    • Plot the sorted distances
      • sort the distances and plot them ( this is the k-distance plot)
      • look for an elbow in the plot which is a point where the distance starts increasing rapidly. This pint is a good candidate for g (epsilon)

Hierarchical Clustering

  • Method clusters which bulilds a hierarchy of clusters
  • Organizes data into tree-like structure: dendrogram
  • Types:
    • Agglomerative (Bottom-Up): Starts with individual points and merges them
  • Starts with all points in one cluster
    • Divisive (Top-Down): Starts with all points and splits them -Advantages :
    • No need to specify to number to clusters ind advance K- MEANS -Useful for visualizing relationships between data points
  • Applications :
    • Genomics , market segmentation, image analys

Interpreting the Dendrogram

  • Each leaf represents an observation, internal nodes represent merges of clusters, and correspond to similar observations

  • As you move up the tree, more points fuse to branches, branches themselves fuse , either will leaves or higher branches

  • The most earlier the (lower into tree ) fusions occur, the most similar the groups of observations are to each other

  • For any 2 observations , we can look for the point in the tree branches containing these 2 observations are first fused

  • observations the fuse at the very atom of tree are quit similar

  • observations that fuse ciéeélose to the top of tree with tend to be quiete difference

  • There are n − 1 points where fusions occur but affect order .

  • We cannot draw conclusions about the similarity between two observations based on their proximity along thehorizontal axis. Rather, we about the similarity between two observations based on the location on the vertical axis where branches containing those two observe first are fused.

  • Cutting the dendrogram at a height of nine results in two clusters, shown in distinct colors. In the right-hand panel, cutting the dendrogram at a height of five results in three clusters. Further cuts can be made as one descends the dendrogram to obtain any number of clusters, between 1 and n. The height of the cut to the dendrogram serves the same role as the K in K-means clustering - it controls the number of clusters obtained.

  • hierarchical refers to the fact that clusters obtained by cutting the dendrogram at a given height are nested within the clusters obtained by cutting the dendrogram at any greater height

  • Hierarchical clustering generally produces nested groups by design, as clusters are either progressively merged or split in a hierarchical tree structure. However, there are scenarios where the visualization of the clustering process or the nature of the data may not show well-defined or intuitively nested group

Agglomerative Appraoch

  • Initialize: Start with each data point as its own cluster. Initially have n clusters.
  • Compute Distance Matrix:
    • Calculate the pairwise distance between all clusters using a chosen distance metric (e.g., Euclidean distance). -Create a distance matrix (a table showing distances between every pair of clusters).
  • Merge Closest Clusters: -Identify the two clusters that are closest to each other based on the distance matrix. -Merge these two clusters into a new cluster. Update Distance Matrix: -After merging two clusters, update the distance matrix to reflect the distance between the newly formed. cluster and the remaining cluster The way this update is done depends on the linkage criteria:
    • Single Linkage: Distance between the two closest points in the clusters.
    • Complete Linkage: Distance between the two farthest points in the clusters.
    • Average Linkage: Average of all pairwise distances between points in the clusters.
    • Ward's Method: Minimizes the variance within clusters. Repeat -> Build Dendrogram

Divisive Approach

  • Start with all the date points in one cluster and we need to decide what splits (usually based on some dissimilarity of measures ) . This can be done with tecqunies such as K means our calculating dismailraty into data point

Considerations

  • Distance Matrics -Euclidean Distance : Most Common and suable for continuous data -Manhattan distance : Suitable for Categorical or grid-like Data -(Cosine distance : Common for text our high Dimensional Da

Introduction to Deep Learning

  • Neural Networks (NNs) are like Directed Acyclic Graphs (DAGs).
    • Directed Edges: Each edge has a direction meaning it goes from one vertex (node) to another. This direction signifies a one-way relationship/dependency between nodes.
    • Acyclic: Indicates that there are no cycles or closed loops.
  • Computational Graph Extends DAGs.
  • In each node or vertex, a computation takes place like we will see happens in Neural Networks.
  • Early Neural networks used the Sigmoid Function – the same function used in Logistic Regression where we used a linear equation to compute a probability between 1 and 0.

The Perceptron

  • The Perceptron is a mathematical function, where input data (x) is multiplied by the weight coefficients (w), resulting in a value.
  • It is visualized as a single-layer network.
  • A perceptron uses a step function (or Heaviside function) as its activation function, determining the output.
  • Can be used for binary classification tasks
  • Perceptron: y = 1 if z > 0, otherwise y = 0.

Perceptron Learning

  • A perceptron is trained using a supervised learning algorithm.
  • Backpropagation is used in multilayer hidden layer networks to adjust the weights and bias based on the error
  • n is the learning rate, controlling how much to adjust the weights.
  • w₁ = w₁ + ∆ωή and Δω₁ = n(Ytrue – Ypred) X₁.

Activation Functions

  • Enable the model to learn non-linear relationships by transforming the linear combination of inputs into a non-linear output.

Purpose

  • Output Control: help in constraining the output values to a specific range, which can be beneficial for various tasks.
  • Gradient Propagation: They provide gradients needed for optimizing the weights during the training process, especially during backpropagation.

Activation Function Types

  • Sigmoid: outputs values between 0 and 1 suitable for binary classification, suffers vanishing gradient problems because outputs are always positive, with slow exp() function
  • Formula: σ(x) = 1/(1 + e^-x)
  • Tanh (Hyperbolic tangent): Zero centered outputs help networks train faster, and suffers vanishing gradient problem when saturated.
  • Formula tanh(x) = (e^x - e^-x)/ ex + e^-x)
  • ReLU (Rectified Linear Unit): Avoids vanishing gradient issues for positive inputs, computationally efficient, faster convergence compared to sigmoid/Tanh, often not zero-centered and prone to a "dying ReLU" problem.
  • Formula f(x) = max(0,x)
  • Leaky ReLU: All benefits of ReLU, addresses the "dying ReLU" problem by allowing a small gradient for negative inputs, helps with training deeper networks, but not standardized as ReLU.
    • Formula: f(x) = max(0.01x, x)
  • ELU (Exponential Linear Unit): All benefits of ReLU, zero centered outputs help networks train faster, ELUs saturate to a negative value becoming more robust to noise, and are computationally more expensive due to exponential operation. -Formula: f(x) = [[x ifz>=0] or [a(e**-1 if z< 0]] Softmax: Classifications problems, converting model outputs into probability distributions. Is the formula P(yr = e" 19" - L
  • Not suitable for multi-label classification, as it enforces that only one class can be predicted.

Issues With Activation Functions

  • Saturation - occurs when output to an activation function is pushed to its extreme values
  • This leads to a near-zero gradient.
  • When neurons are saturated, small changes in the input lead to very little change in the output
  • Gradients become close to zero during backpropagation.
  • Vanishing Gradient - Gradient of loss function gets very small
  • Prevents weights from updating early
  • As gradients backpropagate through layers they exponentially diminish = slower training.
  • This is problematic in deep networks.
  • Logical AND in Perceptrons - both are required with a thehreshold of 1 to fire a neuron)
  • Logical OR in Perceptrons - either of the two inputs are required with a thehreshold of 1 to fire a neuron

The operation that perceptrons can’t handle: XOR:

  • Either A or B has to be 1 - (ExClusive OR OR)
  • Impossible with since perceptron. So can’t solve XOAR
  • We need a network/multi laser perceptrom (MLPs) - which allows to find non-linear boundaries The perceptron learning rule updates the weight when the prediction is incorrect
  • Multi layer perceptron - MLP use:
    • Activation function :introduce non-linearity -output layer : produced the final prediction -Weights and biase : Parameter learned during training
    • Loss Funciton : Measures prediction error
    • Backpropagatiom uptades weights using gradient descent

Deep Neural Networks

  • A network is considered deep if it has two or more hidden layers.
  • DNN's model more intricate, non-linear relationships in the data.
  • Each hidden layer in a DNN learns more abstract representations of the input data.
  • Weights of a DNN are updated through backpropagation and gradient descent.

Speed and Power

  • It Needs to Accelerate convergence Use Gpus ( contain thousand if cores built for vectors calculations Support a lot of deep leaning frameworks
  • TPUS is designed for me learnings workload
    • And efficient and high tensor operations

Loss Functions

  • Loss Functions are called, cost functions or objective functions, measure how far away from actual target value our predicted values
  • Loss functions , also provide a feedback mechanism for model so weights Loss -the cost (function (of the 8 to the W
  • We also, use term epoch- that 1 complete pass of the dataset Then batches help when mini batches are set that one to forward to the pass helps

Value Prediction

  • MSE and MAE: Most Commonly used to calculate total loss
    • Mean Square Error : Is 1.0/n sum{ 1 / (y’1-y1)**_2] Mean absolute error - is 01/n sum | y1-y1 | log loss - is a smoothness is a smooth alternarive in use

Binary Classification

  • binary cross entropy : Measures the performance the classification model to in that of a number has two classes ###Multi Class Classificaion
  • Softmaxactivation function is used to output probability for each class (j is call ) z is logit
  • The categorical cross entropy is a last calculation based on all label in to data set Miminize loss of the goal
  • To adjus the model parameters, find moving gradient descents
  • To compute with that rule of w eight *2+b ###Mini Batches
  • With mini batch gradient descents - operate and weights updates are updated basis

Alternatives For the Weight Rules

  • Standard gradients ( Momentum of the weights being
  • To accelerate convergence
  • RMS prop (Root mean square propagation) to adjust the learning weights
  • Adam is in of best models
  • Nadam is advance

###Back Propagation

  • Back Propagation trasmitts the the error, this allows network to improve through weights adjustments
  • Gradient deccents - use all function a gradient efficient

Learning Reates Consideration

  • High learning is rate in : fasters initial convert - but divergence

The Number of Parameters is Calculated How?

(Number of inputs Number of neurons Number of neurons)

Convolutional Neural Networks

  • They model the degree is how Human Images By Recnogizning
  • Network of Low. level fed into high level for features for eyes etc
  • Then will contribute what each has to output

The Convolution Filter

  • To obtain converted image apply it back in submatrix

  • What result if the is close, it has a larger volume

  • Puling layers that helps condensed summaries

    • Maximal : pick with the maxs number
    • Average puling
    • GLoblal
    • L2
  • Puling that introduces raondomne

Recurrent Neural

  • They are a the kind
    • That takes input as sequences
    • Converted words conveys meaning
    • The is to take adavantage with sequential to nature

RNN Advantages

  • They have memory and ideal in areas there need to track
  • Applicaitioons of recurences
    • Document of the news
    • Time Series
    • Record speech
  • Handwrititng

###Recurrent Network

  • Each step a sum of input it has
  • Unlike the use of other neural network with weights , RNN use set of
  • Feeback lips , they allow from prior times to be included in current

Variants

  • DNN - LSTM - solve to Vanish Gradients
    • Have gates to the network to to allow the low memory or -Grus - compenstationary cheaper
  • Bidirectional DNN neworks future and past times

Steps

1 Inistalization 2 feed sequemce 3 Update for compute 4 Accumulate to less 5 Backpropagation 6 Update Weights

Emding

  • Used with word to embeded words in to numerical four.
  • Step one to Transform to Numerical date a
  • Text - transform the four numerical
    • Cateoric transform the values

Final Input format

  • Batch Time
  • Number of feats.

More Embedding methods

Word GloVe FastTExt ELMO BERT GPT Tt USE Xlnet

Some Limitation In the embedding

  • Limit Context if does has proper contes
  • Words will has disticnging Embeddings If there's to infficente contest
  • Homonyme y Words will shre Embeddings Context if there doe provide the difference
  • The simple Models does look to the of the word
  • Demission Dimentio-nality helps will the redyion
  • Training and HyperParameters and aect distivnessees the Embeddungs
  • The model has the surface and has a different emdings

Tensors

  • tensors a multidiemental has four properties
  • Scalar - Has 0 ( ex point) of 3 (ex, r g b) Shape - Data type -

Time Series - Smoothing

  • Used in time series analysis that will help reduce noise.
  • Make the long running trends / seasonalty cleater
  • It applies various weighting/everaging techniques

Simple Machine Learning

  • It averages the fixed that of the points will will in that window
  • Ideality highlighlight a long trends
  • The Window
  • The size of the term

####Exponential

  • An techqines with decreases
  • Has little trend sealease the high to the reconsisenc

###Seasonalities ####Trend Lines

  • What the rigth trend line
    • Linear,
  • *- Exponential For accelerating our Deacceating time ####STL
  • They break series with main component used Trend,Seasonalies
    • There is Additive, and multiplicative
  • *To Decopse Tiem seirees ,Apply TSL decompotion
  • First find out from seasonal and residual

###ARIMA

  • Autuogressive (AR) represens
    • Its term with parameter I : Remove that from a times serives

Selecting

Check the data to station

  • The set off to set to the term ( The acf plot helps the the Ma Term
  • Pac plot- Help identify the AR termDoes not have in term Grind is that
  • ALC - akake
  • Bic

NLP INTRODUCTION

  • The nlp combine linguistics but machines deep learning etec

NPL Compennets

  • Tokenziatlion - is breaking down smaller units
  • Part Of Speech
  • Named Entity: Identifier is classifying
  • Sentiment
  • Text is all about classication ###other cases yses
  • Malciy docments
  • Incivs
  • Vulnerability
  • User Behanor

###NTL

  • Early rules
  • Then linguist base
  • shift to that use for the large ###Tokelization • It converts words in the parts/segments Types
  • Words/ segmensts
  • Charcater Token
  • It to helps take words

###Libraries for the Tokennication

  • NT
  • SACY HUGGIN

###Library Tensorflow Token

  • Text will numerical format

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

SVM Quiz
5 questions

SVM Quiz

UnmatchedSchorl8056 avatar
UnmatchedSchorl8056
Linear SVM Classification
10 questions
Machine Learning with Iris Dataset
10 questions
Use Quizgecko on...
Browser
Browser