Podcast
Questions and Answers
In Support Vector Machines (SVM), what fundamentally defines a 'hyperplane' in a p-dimensional space?
In Support Vector Machines (SVM), what fundamentally defines a 'hyperplane' in a p-dimensional space?
- A curved surface that optimally separates the data points.
- The margin that maximizes the separation between classes.
- A flat affine subspace of dimension _p_.
- A flat affine subspace of dimension _p_ - 1. (correct)
What is the primary limitation of using 'hard margin classification' with Support Vector Machines (SVM)?
What is the primary limitation of using 'hard margin classification' with Support Vector Machines (SVM)?
- It is computationally intensive and requires significant resources.
- It requires a large amount of labeled data to achieve good performance.
- It is highly sensitive to outliers and only works if data is perfectly linearly separable. (correct)
- It tends to overfit the training data, leading to poor generalization.
In the context of Support Vector Machines (SVM), what is the significance of the tuning parameter 'C'?
In the context of Support Vector Machines (SVM), what is the significance of the tuning parameter 'C'?
- It determines the width of the margin in the SVM model.
- It bounds the sum of slack variables, influencing the tolerance for observations violating the margin. (correct)
- It sets the polynomial degree in polynomial kernel SVM.
- It controls the kernel type used in the SVM model.
What characterizes observations that are known as 'support vectors' in the context of Support Vector Machines (SVM)?
What characterizes observations that are known as 'support vectors' in the context of Support Vector Machines (SVM)?
Which of the following kernel options in SVM would be most appropriate when there's no clear understanding of the data distribution?
Which of the following kernel options in SVM would be most appropriate when there's no clear understanding of the data distribution?
How does the Radial Basis Function (RBF) kernel handle observations that are far away from each other in the feature space?
How does the Radial Basis Function (RBF) kernel handle observations that are far away from each other in the feature space?
Principal Component Analysis (PCA) is an unsupervised method primarily used for what?
Principal Component Analysis (PCA) is an unsupervised method primarily used for what?
What does PCA aim to achieve when projecting observations in a p-dimensional space?
What does PCA aim to achieve when projecting observations in a p-dimensional space?
In K-means clustering, what is the primary objective of the algorithm?
In K-means clustering, what is the primary objective of the algorithm?
What is minimized using squared Euclidean distance using the K-means algorithm?
What is minimized using squared Euclidean distance using the K-means algorithm?
What is the Silhouette Score used for in the context of K-means clustering?
What is the Silhouette Score used for in the context of K-means clustering?
Which of the following statements regarding the K-means algorithm is correct?
Which of the following statements regarding the K-means algorithm is correct?
What is the purpose of the minPts
parameter in the DBSCAN clustering algorithm?
What is the purpose of the minPts
parameter in the DBSCAN clustering algorithm?
Which of the following is an advantage of the DBSCAN algorithm compared to K-means clustering?
Which of the following is an advantage of the DBSCAN algorithm compared to K-means clustering?
In DBSCAN, what differentiates a 'border point' from a 'core point'?
In DBSCAN, what differentiates a 'border point' from a 'core point'?
What does interpreting a dendrogram allows one to determine?
What does interpreting a dendrogram allows one to determine?
What is a key difference between agglomerative and divisive hierarchical clustering approaches?
What is a key difference between agglomerative and divisive hierarchical clustering approaches?
In the context of hierarchical clustering, what does the term 'single linkage' refer to?
In the context of hierarchical clustering, what does the term 'single linkage' refer to?
Which distance metric is commonly used and is most suitable for continuous variable data for clustering?
Which distance metric is commonly used and is most suitable for continuous variable data for clustering?
What is the primary role of activation functions in neural networks?
What is the primary role of activation functions in neural networks?
Which of the following is a characteristic of the Sigmoid activation function?
Which of the following is a characteristic of the Sigmoid activation function?
What is a key advantage of using the ReLU (Rectified Linear Unit) activation function?
What is a key advantage of using the ReLU (Rectified Linear Unit) activation function?
Which activation function is most suitable for the output layer of multi-class classification models?
Which activation function is most suitable for the output layer of multi-class classification models?
Which of the following describes the 'vanishing gradient' problem?
Which of the following describes the 'vanishing gradient' problem?
What is the primary difference between a perceptron and a multi-layer perceptron (MLP)?
What is the primary difference between a perceptron and a multi-layer perceptron (MLP)?
What is the role of 'epochs' in the training of neural networks?
What is the role of 'epochs' in the training of neural networks?
During the training of a Deep Neural Network (DNN), what is the immediate result of the 'forward-pass'?
During the training of a Deep Neural Network (DNN), what is the immediate result of the 'forward-pass'?
Which loss function is most appropriate for a binary classification problem?
Which loss function is most appropriate for a binary classification problem?
What is the purpose of 'backpropagation' in neural networks?
What is the purpose of 'backpropagation' in neural networks?
What is the advantage of using mini-batches rather than processing one observation at a time or using Batch Gradient Descent for training neural networks?
What is the advantage of using mini-batches rather than processing one observation at a time or using Batch Gradient Descent for training neural networks?
Considering that Stochastic Gradient Descent (SGD) considers the past gradients, what is the purpose of the 'velocity' term in SGD with Momentum?
Considering that Stochastic Gradient Descent (SGD) considers the past gradients, what is the purpose of the 'velocity' term in SGD with Momentum?
In the context of time series analysis, what is the main purpose of smoothing?
In the context of time series analysis, what is the main purpose of smoothing?
What typically characterizes time series data in which Simple Moving Average (SMA) is optimal?
What typically characterizes time series data in which Simple Moving Average (SMA) is optimal?
How does a larger window size impact the results in Simple Moving Average?
How does a larger window size impact the results in Simple Moving Average?
What distinguishes Exponential Smoothing from Simple Moving Average?
What distinguishes Exponential Smoothing from Simple Moving Average?
If there was high alpha for time series date, this will result in greater what?
If there was high alpha for time series date, this will result in greater what?
What could happen if seasonality isn't properly considered in time series data?
What could happen if seasonality isn't properly considered in time series data?
What is the main focus of Natural Language Processing (NLP)?
What is the main focus of Natural Language Processing (NLP)?
What is the purpose of a Tokenizer?
What is the purpose of a Tokenizer?
Subword Tokenization is characterized as breaking into subword units to handle what?
Subword Tokenization is characterized as breaking into subword units to handle what?
When implementing SVM with soft margins, what is the primary effect of increasing the value of the tuning parameter 'C'?
When implementing SVM with soft margins, what is the primary effect of increasing the value of the tuning parameter 'C'?
In the context of Support Vector Machines (SVM), what is the purpose of 'slack variables'?
In the context of Support Vector Machines (SVM), what is the purpose of 'slack variables'?
When would it be most appropriate to choose a polynomial kernel in Support Vector Machines (SVM)?
When would it be most appropriate to choose a polynomial kernel in Support Vector Machines (SVM)?
With a small 'C' parameter, how does a support vector classifier behave, and what are its potential drawbacks?
With a small 'C' parameter, how does a support vector classifier behave, and what are its potential drawbacks?
What is the impact of the number of dimensions on the computational complexity for unsupervised learning algorithms?
What is the impact of the number of dimensions on the computational complexity for unsupervised learning algorithms?
What is a critical challenge specific to unsupervised learning methods when compared to supervised learning?
What is a critical challenge specific to unsupervised learning methods when compared to supervised learning?
What strategies can be employed to address the challenges posed by high dimensionality in unsupervised learning?
What strategies can be employed to address the challenges posed by high dimensionality in unsupervised learning?
What is a limitation of using PCA before applying a supervised learning method?
What is a limitation of using PCA before applying a supervised learning method?
While assessing the optimal number of clusters using the Elbow Method, how should the 'elbow' point be best interpreted?
While assessing the optimal number of clusters using the Elbow Method, how should the 'elbow' point be best interpreted?
In K-means clustering, under which condition is the result considered to have reached a 'local optimum'?
In K-means clustering, under which condition is the result considered to have reached a 'local optimum'?
What is a key limitation of the K-means algorithm regarding cluster shapes?
What is a key limitation of the K-means algorithm regarding cluster shapes?
How does DBSCAN effectively identify clusters of arbitrary shapes compared to K-means?
How does DBSCAN effectively identify clusters of arbitrary shapes compared to K-means?
Within DBSCAN, how is 'density' quantified to determine cluster formation?
Within DBSCAN, how is 'density' quantified to determine cluster formation?
How do you determine the value for epsilon (ε) for DBSCAN?
How do you determine the value for epsilon (ε) for DBSCAN?
What is the primary implication of the arrangement of observations along the horizontal axis of a dendrogram?
What is the primary implication of the arrangement of observations along the horizontal axis of a dendrogram?
In agglomerative hierarchical clustering, what is the key implication of two observations fusing together closer to the bottom of the dendrogram?
In agglomerative hierarchical clustering, what is the key implication of two observations fusing together closer to the bottom of the dendrogram?
What is a limitation of Hierarchical clustering?
What is a limitation of Hierarchical clustering?
In the context of hierarchical clustering, why might Ward's method be chosen over other linkage methods?
In the context of hierarchical clustering, why might Ward's method be chosen over other linkage methods?
When preparing time-series data for Recurrent Neural Networks (RNNs), what format should the data have to work with an LSTM?
When preparing time-series data for Recurrent Neural Networks (RNNs), what format should the data have to work with an LSTM?
How does transforming text into numerical input with NLP aid Neural Networks with Word2Vec?
How does transforming text into numerical input with NLP aid Neural Networks with Word2Vec?
What is the purpose of choosing a vector with cosine similarity?
What is the purpose of choosing a vector with cosine similarity?
In designing a deep neural network(DNN), what distinguishes a network considered 'deep' from one that is not?
In designing a deep neural network(DNN), what distinguishes a network considered 'deep' from one that is not?
Given a dataset with non-linear relationships, what is a primary benefit of using a Deep Neural Network (DNN) over a simpler model?
Given a dataset with non-linear relationships, what is a primary benefit of using a Deep Neural Network (DNN) over a simpler model?
When training a deep neural network, what is the effect of using GPUs rather than CPUs?
When training a deep neural network, what is the effect of using GPUs rather than CPUs?
Why can neurons that are saturated during the activation function phase lead to the vanishing gradient problem?
Why can neurons that are saturated during the activation function phase lead to the vanishing gradient problem?
When would a tanh activation function be preferable?
When would a tanh activation function be preferable?
What is a limitation when using softmax activation function?
What is a limitation when using softmax activation function?
In the context of neural networks, what does binary cross-entropy measure?
In the context of neural networks, what does binary cross-entropy measure?
What is the key idea that helps gradient descent in SGD with momentum?
What is the key idea that helps gradient descent in SGD with momentum?
What describes the action of error in back-propagation?
What describes the action of error in back-propagation?
In time series analysis, why is it important to account for seasonality?
In time series analysis, why is it important to account for seasonality?
What could Simple Moving Average results highlight with the trend?
What could Simple Moving Average results highlight with the trend?
In Time Series analysis, why might exponential smoothing be preferred over simple moving average (SMA)?
In Time Series analysis, why might exponential smoothing be preferred over simple moving average (SMA)?
When would the use of a polynomial be the best approach out of: linear, exponential Moving average, and logarithmic to see the trend line in time series data?
When would the use of a polynomial be the best approach out of: linear, exponential Moving average, and logarithmic to see the trend line in time series data?
What is the use of Seasonal and Trend Decomposition (STL decomposition)?
What is the use of Seasonal and Trend Decomposition (STL decomposition)?
What role does the integer value of the ARIMA model play?
What role does the integer value of the ARIMA model play?
What is TF-IDF often represented in?
What is TF-IDF often represented in?
What can cause the Embedding process to be limited?
What can cause the Embedding process to be limited?
In the context of Support Vector Machines (SVM), how does the 'large margin classification' principle primarily aim to improve model performance?
In the context of Support Vector Machines (SVM), how does the 'large margin classification' principle primarily aim to improve model performance?
When implementing Support Vector Machines (SVM) with soft margins, under what circumstances would it be most strategic to permit certain observations to violate the margin?
When implementing Support Vector Machines (SVM) with soft margins, under what circumstances would it be most strategic to permit certain observations to violate the margin?
Considering the bias-variance tradeoff in Support Vector Classifiers, how does increasing the value of the tuning parameter 'C' influence the classifier's characteristics?
Considering the bias-variance tradeoff in Support Vector Classifiers, how does increasing the value of the tuning parameter 'C' influence the classifier's characteristics?
How does the behavior of a Support Vector Classifier change with a very large 'C' value, and what implications does this have for model performance?
How does the behavior of a Support Vector Classifier change with a very large 'C' value, and what implications does this have for model performance?
When using the Radial Basis Function (RBF) kernel in Support Vector Machines (SVM), how does the kernel implicitly handle the dimensionality of the feature space?
When using the Radial Basis Function (RBF) kernel in Support Vector Machines (SVM), how does the kernel implicitly handle the dimensionality of the feature space?
Which statement best describes how polynomial kernels in SVM facilitate the classification of non-linear data?
Which statement best describes how polynomial kernels in SVM facilitate the classification of non-linear data?
What is a primary challenge associated with unsupervised learning methods, particularly concerning the assessment of model performance?
What is a primary challenge associated with unsupervised learning methods, particularly concerning the assessment of model performance?
In the context of handling high dimensionality, how does L1 Regularization address the challenges posed by the 'Curse of Dimensionality'?
In the context of handling high dimensionality, how does L1 Regularization address the challenges posed by the 'Curse of Dimensionality'?
Within K-means clustering, how does the K-means++ initialization method aim to improve the quality of clustering compared to random initialization?
Within K-means clustering, how does the K-means++ initialization method aim to improve the quality of clustering compared to random initialization?
When evaluating K-means clustering results, which interpretation of the Silhouette Score indicates a potential issue with the clustering?
When evaluating K-means clustering results, which interpretation of the Silhouette Score indicates a potential issue with the clustering?
In DBSCAN, what is the algorithmic approach to classifying points that fall outside dense regions and are not within the epsilon neighborhood of any core points?
In DBSCAN, what is the algorithmic approach to classifying points that fall outside dense regions and are not within the epsilon neighborhood of any core points?
What key constraint differentiates hierarchical clustering methods from K-means clustering, particularly affecting their applicability in certain scenarios?
What key constraint differentiates hierarchical clustering methods from K-means clustering, particularly affecting their applicability in certain scenarios?
When interpreting a dendrogram in hierarchical clustering, under what condition is it impossible to make definitive statements about the similarity of two observations?
When interpreting a dendrogram in hierarchical clustering, under what condition is it impossible to make definitive statements about the similarity of two observations?
How does the backpropagation algorithm utilize the chain rule of calculus in training neural networks, and what is the significance of this process?
How does the backpropagation algorithm utilize the chain rule of calculus in training neural networks, and what is the significance of this process?
What is a critical implication of the saturation of neurons in deep neural networks for the backpropagation process, and why does this present a significant challenge?
What is a critical implication of the saturation of neurons in deep neural networks for the backpropagation process, and why does this present a significant challenge?
With regards to activation functions, what key advantage does the ReLU activation function provide over sigmoid or tanh functions in deep neural networks, and how does this affect training dynamics?
With regards to activation functions, what key advantage does the ReLU activation function provide over sigmoid or tanh functions in deep neural networks, and how does this affect training dynamics?
In the context of Natural Language Processing (NLP), what is the significance of word embeddings beyond simply converting words into numerical data?
In the context of Natural Language Processing (NLP), what is the significance of word embeddings beyond simply converting words into numerical data?
When applying Simple Moving Average (SMA) to a time series, how does the selection of a larger window size affect the resulting smoothed time series, and what are its practical consequences?
When applying Simple Moving Average (SMA) to a time series, how does the selection of a larger window size affect the resulting smoothed time series, and what are its practical consequences?
In time series analysis using Exponential Smoothing, how does a smoothing factor (alpha) close to 1 impact the model's sensitivity to recent data points, and what does this imply for forecasting?
In time series analysis using Exponential Smoothing, how does a smoothing factor (alpha) close to 1 impact the model's sensitivity to recent data points, and what does this imply for forecasting?
What is the primary advantage of employing subword tokenization techniques in Natural Language Processing (NLP), especially when dealing with large and diverse text corpora?
What is the primary advantage of employing subword tokenization techniques in Natural Language Processing (NLP), especially when dealing with large and diverse text corpora?
Flashcards
Support Vector Machines (SVM)
Support Vector Machines (SVM)
A powerful algorithm used for classification and regression, effective in high-dimensional spaces by fitting the widest possible street between classes.
Margin
Margin
Distance from the separating hyperplane to the closest data points.
Support vectors
Support vectors
Data points closest to the hyperplane that influence its position and the margin.
Hard Margin Classification
Hard Margin Classification
Signup and view all the flashcards
Soft margins
Soft margins
Signup and view all the flashcards
Slack Variables
Slack Variables
Signup and view all the flashcards
Tuning Parameter C
Tuning Parameter C
Signup and view all the flashcards
Polynomial kernel
Polynomial kernel
Signup and view all the flashcards
Radial Basis Function (RBF) kernel
Radial Basis Function (RBF) kernel
Signup and view all the flashcards
Unsupervised Learning
Unsupervised Learning
Signup and view all the flashcards
Clustering
Clustering
Signup and view all the flashcards
Dimensionality Reduction
Dimensionality Reduction
Signup and view all the flashcards
Association rule learning
Association rule learning
Signup and view all the flashcards
K-means clustering
K-means clustering
Signup and view all the flashcards
Algorithm Objective
Algorithm Objective
Signup and view all the flashcards
Density Based Spatial Clustering of Applications with Noise (DBSCAN)
Density Based Spatial Clustering of Applications with Noise (DBSCAN)
Signup and view all the flashcards
Density
Density
Signup and view all the flashcards
Core Points
Core Points
Signup and view all the flashcards
Border Points
Border Points
Signup and view all the flashcards
Epsilon
Epsilon
Signup and view all the flashcards
Noise Points
Noise Points
Signup and view all the flashcards
Hierarchical clustering
Hierarchical clustering
Signup and view all the flashcards
Agglomerative
Agglomerative
Signup and view all the flashcards
Leaf
Leaf
Signup and view all the flashcards
Kernel trick
Kernel trick
Signup and view all the flashcards
Directed Edges
Directed Edges
Signup and view all the flashcards
Acyclic
Acyclic
Signup and view all the flashcards
Perceptron
Perceptron
Signup and view all the flashcards
Supervised Data
Supervised Data
Signup and view all the flashcards
Non-Linearity
Non-Linearity
Signup and view all the flashcards
Output control
Output control
Signup and view all the flashcards
Gradient Propagation
Gradient Propagation
Signup and view all the flashcards
Saturation
Saturation
Signup and view all the flashcards
Vanishing Gradient
Vanishing Gradient
Signup and view all the flashcards
Neural Network
Neural Network
Signup and view all the flashcards
Hidden Layers
Hidden Layers
Signup and view all the flashcards
Increased Capacity
Increased Capacity
Signup and view all the flashcards
SGD with momentum
SGD with momentum
Signup and view all the flashcards
RMSprop
RMSprop
Signup and view all the flashcards
Adam
Adam
Signup and view all the flashcards
Nadam
Nadam
Signup and view all the flashcards
Loss functions
Loss functions
Signup and view all the flashcards
Epoch
Epoch
Signup and view all the flashcards
Mini-batch
Mini-batch
Signup and view all the flashcards
Batch Size
Batch Size
Signup and view all the flashcards
Loss calculation
Loss calculation
Signup and view all the flashcards
Convolutional Networks
Convolutional Networks
Signup and view all the flashcards
Pooling Layers
Pooling Layers
Signup and view all the flashcards
Recurrent Neural Networks
Recurrent Neural Networks
Signup and view all the flashcards
Exponential Smoothing
Exponential Smoothing
Signup and view all the flashcards
Study Notes
- Support Vector Machines (SVM) can perform both classification (SVC) and regression (SVR), although classification is more common.
- Think of an SVM classifier as fitting the widest possible "street" between different classes, known as large margin classification.
- A hyperplane is a flat affine subspace with a dimension of p-1 in p-dimensional space.
- In 2D, it's a line.
- In 3D, it's a plane.
Definitions
- Margin refers to the distance between the solid line and either dashed line.
- Support vectors correlate to the blue and purple points on the dashed lines.
- The distance between these points and the hyperplane is indicated by arrows.
- In a maximal margin hyperplane, if the coefficients are represented by β0,β1,...,βp, then the maximal margin classifier classifies an x^ test observation based on the sign of f(x) = β0 + β1x1 + β2x2 +...+ βp.
Hard Margins
- Hard margin classification strictly assigns each observation to a class based on which side of the hyperplane it is, without allowing misclassification.
- Hard margin classification only works if the data is linearly separable.
- Hard margin classification is sensitive to outliers.
- Hard margin classification is impractical in real-world applications due to misclassifications and errors.
Hyperplane and Support Vectors
- A single observation can significantly change the hyperplane because the plane is very sensitive to support vectors
Challenges to Hyperplane Calculation
- If the optimization problem has no solution when M (margin) > 0, then a "soft" margin hyperplane is developed that almost separates the classes.
- Using a classifier based on a hyperplane that doesn't isolate the two classes could lead to increased robustness to individual observations.
- The benefits may also include better classification of training observations.
- The maximal margin classifier can't be used if the classes aren't separable by a hyperplane.
Soft Margins
- Soft Margins are used when observations are allowed to be on the incorrect side of the margin, or even the incorrect side of the hyperplane.
- An observation might be on the wrong side of the hyperplane, in addition to being the wrong side of the margin.
- Support vector classifiers may misclassify observations on the wrong side of the hyperplane.
C
- Slack variables ε1,...,εn allow observations to be on the wrong side of the margin or the hyperplane.
- The tuning parameter C constraints the sum of the epsilon i's, which controls how many violations to the margin and hyperplane will be tolerated.
- If εi = 0, the ith observation is on the correct side of the margin.
- If εi > 0, the ith observation violates the margin and lies on the wrong side of the margin.
- If εi > 1, the ith observation is on the wrong side of the hyperplane.
- The tuning parameter C is a hyperparameter.
- Large C means the margin is wide, many observations violate it, and there are many support vectors, which leads to low variance but potentially high bias.
- Small C results in seeking narrow margins, meaning it has a low bias, and high variance.
- Observations are called support vectors if they lie directly on the margin, or the wrong side of it as they affect the support vector classifier.
Kernel Options
- Use Polynomial kernel to solve polynomial relationships between features.
- The Radial Basis Function (Gaussian) kernel is a good default if you are unsure of the data distribution.
- Use Sigmoid kernels when data acts like a neural network.
- polynomial kernel of degree d is used to improve maps to a higher dimensional vector space and creates more flexible support vector
- γ is a positive constant that minimizes as e^(negative numbers) is gotten
- If two observations are far away from each other, the Euclidean distance will be larger. The values decreases and provides low impact on computing decisions.
- The dimensions are implicit or infinite, meaning computation is feasible.
Unsupervised Methods
- Unsupervised learning trains models using no dependent variable.
- The aim is to understand data and create groupings rather than predict a value or class probability.
- Principal Component Analysis (PCA) is an unsupervised technique that can prepare data for supervised learning.
- Clustering is the process of finding and assessing data groupings.
- The 3 main areas of unsupervised learning include:
- Clustering (e.g., K-means, hierarchical clustering)
- Dimensionality Reduction (e.g., Principal Component Analysis (PCA), t-SNE)
- Association rule learning (e.g., Apriori algorithm)
- Common supervised learning tasks lack labeled data or ground truth.
- For example, fraud detection, medical imaging, cybersecurity, Natural Language Processing, and recommender systems have this issue.
- Analysis becomes subjective, which makes evaluating models difficult without ground truth or predefined metrics.
- It is difficult to assess the results or performance, like RMSE (regression), Accuracy and Precision(classification).
- Scalability may be problematic.
- Some algorithms are computationally intensive with large, high-dimensional data sets.
- These algorithms can be slow and require a lot of memory.
- Overfitting sensitive methods to the parameters or model complexity is a risk.
- Unsupervised learning can be sensitive to noise and outliers.
- A constraint lies in algorithms making assumptions about the data (e.g. cluster shape or distribution)
- Increased Computational Complexity, sparse data, overfitting, distance metric issues, visualization challenges, and feature redundancy are a number of effects from the curse of dimensionality.
- Techniques to reducing dimensionality include PCA, t-SNE, and LDA.
- Feature selection can occur using the filter and wrapper methods.
- Sampling techniques to reduce dimensionality that leverage feature engineering or use random projection.
PCA
- Principle Component Analysis finds a low-dimensional data representation while retaining variance.
- It works when limited knowledge or methods do not make election of other approaches feasible.
- PCA projects observations with the largest variance via a vector or loadings
- PCA may not always be helpful to prediction as it only provides a direction that maintains the most variance in the data, due to it's unsupervised nature.
- The rotation results in resulting retains max(var).
Clustering
- Clustering groups similar data points into clusters.
- The goal is to organize objects in such a way that those in the cluster are more similar than those in other groups.
- It helps discover patterns, structures, and relationships in data.
K-Means Clustering
- K-means clustering partitions data into K distinct, non-overlapping clusters.
- The algorithm attempts to minimize intra-cluster variance
- The number of clusters K must be specified and observations assigned to the K clusters
- The steps include:
- Initialization
- Assignment
- Update
- Iterate -The point is one for which variation in the with-in cluster is as small as possible.
- Many ways to define this concept, but is typically handled by squared Euclidean distance.
- Within-cluster variation of the kth cluster is the sum of Euclidean distances squared between kth cluster observations, and then divided amongst the number of kth clusters
Algorithm Steps
- A number is randomly assigned, from 1 to K, to each observation that acts as initial cluster assignments.
- Iterate until the cluster assignments stop changing.
- Compute each of the K clusters cluster centroid that is a vector of the p feature means for the kth cluster observations.
- Assign each observation to the cluster with the closest centroid, where closest is found leveraging Evaluation distance. -The result is a local optimum because the K-means algorithm locates a local instead of global optimum.
- A hyperparameter is the number of clusters, with methods to optimization such as the “elbow” and “silhouette”.
- The elbow method plots the within-cluster sum of squares (WCSS) against k values, where the point of the "elbow" is where the improvement slows down
- The silhouette Score measures how similar a datapoint is to its own cluster instead of other ones, where it typically ranges from negative 1 to 1 and has higher values indicate better clustering.
Other Hyperparameters
- K-Means is sensitive to the initial placement of centroids through the Initialization Method (Centroid Initialization):
- Some common strategies:
- Random initialization: The default method that can lead to suboptimal results.
- K-Means++ initialization: Improves clustering by spreading out initial centroids and reduces the risk of poor convergence.
- Distance Measurements are typically handled using the Evaluation Method.
DBSCAN
- Density-based spatial clustering of applications with noise (DBSCAN) identifies clusters where points are closely packed using a density-based algorithm.
- Arbitrary-shaped clusters, with outliers identified as noise, can be found using the algorithim.
Density
- How much mass is packed into a given volume of substance can be found using density.
- In DBSCAN, the region density is much greater compared to other locations with more observations.
- There exists code points that leverage density to cluster datasets.
Key Concepts:
- Core points have a minimum number of neighboring points are within a specified distance (minPts and epsilon).
- Border points are within the ε distance of the code point, but not enough neighbours to be a border points.
- Noise points belong to no cluster.
Pros and Cons
- Does not require predetermining the number of clusters and can detect noises and outliers.
- It works wll with arbitrary shapes and has only two parameters to fine tune.
- There are limitations of being able to struggle with varying density clusters, and as it is sensitive to the parameters ε and minPts.
Applications
- Geospatial data
- Anomaly detection
- Image segmentation
DBSCAN Algorithm Steps
- Initialize Parameters with epsilon and minPts.
- For each point in the dataset, Skip it.if it has already been visited
- Determine the points that surround the current the points that are in its ε-neighborhood distance.
- Check if the current point is a core point with the following criterias:
- If the number of points(which includings itself), is greaterthan minPts, the point takes that form and starts a new cluster.
- If the number of points does not reach minPts, Then its market as noise.
Expansion ofCluster
- For any code, recursively visit its neighborhood in order of neighborhood
- If you visit a nonvisited one, mark to be part of existing cluster
- Cluster can grow by visting neighbours.
- border points is included by not expanded
Hyperparameter Tuning
- Set minPts to at least D+1, where D is the number of dimensions in the dataset, and increase to void affecting result and to use higher number for noisy data to make dense regions, and if need of small groups use lower number for small data.
- To compute the Nearest Distance:
- Choose k as mintPts - 1.
- Compute the distance of each point with its Knn.
- And to plot the sorted distances:
- Sort and plot them to look for elbow location to look for point to start point distance for good ε.
Hirarchical Clustering
- Organizes cluster hierarchy
- Has two types:
- Agglomerative (Bottom-Up) : Starts with individual points and merges them. -Divisive: Top-Down starts with all point and splits them.
Hierarchical Clustering Advantages
- Not need specific number in K-means.
- It's usefull to visualize data It takes advantages with a genomics and market segmentation for image analysis
Interpreting the Dendrogram
- Each leaf on a dendrogram is an observation
- Moving up the tree, leaves start to fuse,
- Internal Nodes are clusters.
- Hierarchichal (2) has height number of nine in clusters
- Height at five results in 3 clusters and the role is controling the numbers in the K-means
- All are nested groups
###Agglomerative Approach
- Initializes and starts each pont with its own cluster
- Computes a matrix distance that has to be every pairwise distance.
- -It needs a chosendistance metric.
Linkage Criterias
A Matrix needs to be updated.
-Single Linkage- which distance bewteen the two closet points . -Complete distance which is two closets points Average points are in a a middle ground with spherical cluseters. .Wards Method minimzes varince.
Divise Approach
The Approach Is to begin in a smaller cluster this is less common to its higher computational complexity It initialize starts with a data point.
###Splits Has to indentifty which cluster and it goes though kmeans Repeat the process with a criteria
Considerartions . Euclidian Menhattan has to be handled elongated and can result in broken clusters. Aveage can balanced with spherucal cluseters.
Directed Acyclic Graphds
- Nodes is NN and also DAG
- DirectEdiges in ands direct acycling and meaning in edges has direction,it signifies a one -way dependency
Acyclic
It doent contait cycles or closed loops, Cant to traverses or directed in Edges.
Can have a coputational grapth can extent DAG.
###The Perceton
The Perceptron is a mathmetical Function used and can be use into a value. Its visualized by a single lay network Its the form and has an binary class tasks
###Perceptron It leverages by by the Step Backporation Eta controlls the weights and the leariningrate.
NonLiniairty it enables the model it works on the traninf OutPut helpconstrant Values Gradient Proragation they used dur the trang.
LossFuctionsand and that measures difference
###Perpectrons It test the operation it doesnt haved a handle it has network that
- a complex solt and ha a neural network
###MulityalPer The part that takes the input and activations it makes
BackPropations to update weights g. descent.
Deep Neural networks.
Hidden lays a e that network which has deeped. Incresd Capcity which is layesd with data. Fets Abstactions each has that learn abstractly and it needs to combine this from complex
- Backportation
###The need Power and Speed Parallel Processing contains the build for vector calc It can help to processed in largeamounts It supports a wide of learning Its can used for quick data
-
Custom architectures and designs to designed learning.
-
Tensor Operation
-
high ###Batch Epochs Baches and Pass through data And with help. The modelprocess from desent. Numeric with the MSE, and MAE. The
-
Binary test are used for just 1 an the are and it comparas that are daptive
They gradients the
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.