Podcast
Questions and Answers
In Support Vector Machines (SVM), what determines whether it is used for classification (SVC) or regression (SVR)?
In Support Vector Machines (SVM), what determines whether it is used for classification (SVC) or regression (SVR)?
- The intended purpose, either classification or regression. (correct)
- The type of kernel used.
- The number of dimensions in the data.
- The size of the dataset.
What is the geometric interpretation of an SVM classifier?
What is the geometric interpretation of an SVM classifier?
- Fitting the widest possible street between the classes. (correct)
- Calculating the average distance between classes.
- Finding the smallest circle that encloses all data points.
- Fitting a line that minimizes the distance to all points.
In a p-dimensional space, what is a hyperplane?
In a p-dimensional space, what is a hyperplane?
- A flat affine subspace of dimension p - 1. (correct)
- A flat affine subspace of dimension p + 1.
- A curved surface that separates the data.
- A single point that best represents the data.
What are the support vectors in the context of Support Vector Machines?
What are the support vectors in the context of Support Vector Machines?
What is the formula to classify a test observation $x^\wedge$ using a maximal margin classifier, given coefficients $\beta_0, \beta_1, ..., \beta_p$?
What is the formula to classify a test observation $x^\wedge$ using a maximal margin classifier, given coefficients $\beta_0, \beta_1, ..., \beta_p$?
What is a significant limitation of hard margin classification?
What is a significant limitation of hard margin classification?
In the context of Support Vector Machines (SVM), what is the primary motivation for using 'soft margins' instead of 'hard margins'?
In the context of Support Vector Machines (SVM), what is the primary motivation for using 'soft margins' instead of 'hard margins'?
In the context of soft margins in SVM, what do slack variables (£1,...,εη) represent?
In the context of soft margins in SVM, what do slack variables (£1,...,εη) represent?
What does the tuning parameter C control in a soft margin SVM?
What does the tuning parameter C control in a soft margin SVM?
In the context of SVM, what happens when the tuning parameter C is very large?
In the context of SVM, what happens when the tuning parameter C is very large?
Which of the following kernel options in SVM is generally considered a good default choice when there is no clear understanding of the data distribution?
Which of the following kernel options in SVM is generally considered a good default choice when there is no clear understanding of the data distribution?
What is the effect of a positive constant γ in the Radial Basis Function (RBF) kernel?
What is the effect of a positive constant γ in the Radial Basis Function (RBF) kernel?
In the context of machine learning, what is the primary goal of unsupervised learning?
In the context of machine learning, what is the primary goal of unsupervised learning?
Which of the following is a common approach in unsupervised learning?
Which of the following is a common approach in unsupervised learning?
Why is evaluating the success of unsupervised learning models often challenging?
Why is evaluating the success of unsupervised learning models often challenging?
Which of the following is a potential risk associated with unsupervised learning?
Which of the following is a potential risk associated with unsupervised learning?
Which of the following is a common technique to deal with high dimensionality?
Which of the following is a common technique to deal with high dimensionality?
What is the primary goal of Principal Component Analysis (PCA)?
What is the primary goal of Principal Component Analysis (PCA)?
What is the main objective of K-means clustering?
What is the main objective of K-means clustering?
In the context of K-means clustering, what is the 'Elbow Method' used for?
In the context of K-means clustering, what is the 'Elbow Method' used for?
What does a higher Silhouette Score indicate in the context of K-means clustering?
What does a higher Silhouette Score indicate in the context of K-means clustering?
Which of the following statements describes a limitation of K-Means clustering?
Which of the following statements describes a limitation of K-Means clustering?
What is a key characteristic of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm?
What is a key characteristic of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm?
In DBSCAN, what distinguishes a 'core point' from other points?
In DBSCAN, what distinguishes a 'core point' from other points?
Which of the following is an advantage of DBSCAN over K-Means clustering?
Which of the following is an advantage of DBSCAN over K-Means clustering?
In Hierarchical Clustering, what is a dendrogram used for?
In Hierarchical Clustering, what is a dendrogram used for?
In the context of interpreting a dendrogram, what does the vertical axis represent?
In the context of interpreting a dendrogram, what does the vertical axis represent?
What is the key difference between agglomerative and divisive hierarchical clustering?
What is the key difference between agglomerative and divisive hierarchical clustering?
In hierarchical clustering, which linkage criterion minimizes the variance within clusters?
In hierarchical clustering, which linkage criterion minimizes the variance within clusters?
What characterizes Directed Acyclic Graphs (DAGs)?
What characterizes Directed Acyclic Graphs (DAGs)?
What key function did early neural networks use, also employed in Logistic Regression, to compute a probability between 0 and 1?
What key function did early neural networks use, also employed in Logistic Regression, to compute a probability between 0 and 1?
For what type of tasks is a perceptron primarily designed?
For what type of tasks is a perceptron primarily designed?
In the context of perceptron learning, what is backpropagation used for?
In the context of perceptron learning, what is backpropagation used for?
What happens when the argument gets smaller using ELUs (Exponential Linear Units)?
What happens when the argument gets smaller using ELUs (Exponential Linear Units)?
For a single perceptron, what activation function will output A AND B logical operation correctly?
For a single perceptron, what activation function will output A AND B logical operation correctly?
What logical operation cannot be handled with a single perceptron?
What logical operation cannot be handled with a single perceptron?
In a neural network, what is the primary purpose of activation functions?
In a neural network, what is the primary purpose of activation functions?
What distinguishes a Deep Neural Network (DNN) from a regular neural network?
What distinguishes a Deep Neural Network (DNN) from a regular neural network?
Why are GPUs and TPUs important for deep learning?
Why are GPUs and TPUs important for deep learning?
In deep learning, what is the purpose of a loss function?
In deep learning, what is the purpose of a loss function?
In the context of neural network training, what does 'epoch' refer to?
In the context of neural network training, what does 'epoch' refer to?
When training neural networks, what is the role of mini-batches?
When training neural networks, what is the role of mini-batches?
Which is likely to happen if the learning rate for the model is set very high?
Which is likely to happen if the learning rate for the model is set very high?
What is the purpose of "smoothing" in the context of time series analysis?
What is the purpose of "smoothing" in the context of time series analysis?
What is the primary difference between a Simple Moving Average (SMA) and Exponential Smoothing?
What is the primary difference between a Simple Moving Average (SMA) and Exponential Smoothing?
In time series analysis, what does 'seasonality' refer to?
In time series analysis, what does 'seasonality' refer to?
In time series analysis, which type of trend line is suitable for growth patterns that level off over time?
In time series analysis, which type of trend line is suitable for growth patterns that level off over time?
In STL decomposition, what are the three main components into which a time series is broken down?
In STL decomposition, what are the three main components into which a time series is broken down?
In the context of ARIMA models, what does the 'I' component represent?
In the context of ARIMA models, what does the 'I' component represent?
In Natural Language Processing (NLP), what does 'tokenization' refer to?
In Natural Language Processing (NLP), what does 'tokenization' refer to?
In Support Vector Machines (SVM), what is the significance of points that lie directly on the margin or on the wrong side of it?
In Support Vector Machines (SVM), what is the significance of points that lie directly on the margin or on the wrong side of it?
In the context of Support Vector Machines (SVM), what is the primary purpose of the tuning parameter C?
In the context of Support Vector Machines (SVM), what is the primary purpose of the tuning parameter C?
Which statement is correct regarding the impact of a large tuning parameter C in Support Vector Machines(SVM)?
Which statement is correct regarding the impact of a large tuning parameter C in Support Vector Machines(SVM)?
When is it most appropriate to use a Polynomial kernel in a Support Vector Machine (SVM)?
When is it most appropriate to use a Polynomial kernel in a Support Vector Machine (SVM)?
What is the practical implication of using the Radial Basis Function (RBF) kernel with a very large gamma (γ) value in SVM?
What is the practical implication of using the Radial Basis Function (RBF) kernel with a very large gamma (γ) value in SVM?
What is a key difference between supervised and unsupervised learning?
What is a key difference between supervised and unsupervised learning?
Which of the following tasks is best suited for unsupervised learning?
Which of the following tasks is best suited for unsupervised learning?
What is a primary challenge in unsupervised learning compared to supervised learning?
What is a primary challenge in unsupervised learning compared to supervised learning?
Why can unsupervised learning methods be sensitive to noise and outliers in the data?
Why can unsupervised learning methods be sensitive to noise and outliers in the data?
Which of the following is NOT a common technique for dealing with high dimensionality?
Which of the following is NOT a common technique for dealing with high dimensionality?
In the context of Principal Component Analysis (PCA), what does projecting observations onto a vector with the largest variance achieve?
In the context of Principal Component Analysis (PCA), what does projecting observations onto a vector with the largest variance achieve?
What is the most direct way to describe the central idea behind K-means clustering?
What is the most direct way to describe the central idea behind K-means clustering?
What do higher K values typically imply when using the Elbow Method to select the optimal number of clusters?
What do higher K values typically imply when using the Elbow Method to select the optimal number of clusters?
What does the Silhouette Score measure in the context of clustering?
What does the Silhouette Score measure in the context of clustering?
Which statement best describes a limitation of K-Means clustering?
Which statement best describes a limitation of K-Means clustering?
In the DBSCAN algorithm, what role do border points play in cluster formation?
In the DBSCAN algorithm, what role do border points play in cluster formation?
Which of the following is a disadvantage of DBSCAN compared to K-Means clustering?
Which of the following is a disadvantage of DBSCAN compared to K-Means clustering?
In hierarchical clustering, what does the height at which two branches merge in a dendrogram indicate?
In hierarchical clustering, what does the height at which two branches merge in a dendrogram indicate?
What does it mean when observations fuse together at the very top of a dendrogram?
What does it mean when observations fuse together at the very top of a dendrogram?
In agglomerative hierarchical clustering, how is the distance between clusters updated after merging two clusters?
In agglomerative hierarchical clustering, how is the distance between clusters updated after merging two clusters?
Which of the following linkage criteria in hierarchical clustering tends to create elongated clusters?
Which of the following linkage criteria in hierarchical clustering tends to create elongated clusters?
What best describes Directed Edges in Directed Acyclic Graphs (DAGs)?
What best describes Directed Edges in Directed Acyclic Graphs (DAGs)?
What does the term 'acyclic' signify in the context of Directed Acyclic Graphs (DAGs)?
What does the term 'acyclic' signify in the context of Directed Acyclic Graphs (DAGs)?
What is the role of the weight coefficients in a perceptron?
What is the role of the weight coefficients in a perceptron?
What must be true for any node in a perceptron to generate an outut?
What must be true for any node in a perceptron to generate an outut?
What does the perceptron learning rule adjust in a perceptron?
What does the perceptron learning rule adjust in a perceptron?
In the context of neural networks, what does the term 'non-linearity' refer to?
In the context of neural networks, what does the term 'non-linearity' refer to?
What is the function of gradient propagation in neural networks?
What is the function of gradient propagation in neural networks?
What is a potential disadvantage of Leaky ReLU compared to ReLU?
What is a potential disadvantage of Leaky ReLU compared to ReLU?
For what purpose is the Softmax activation function primarily used?
For what purpose is the Softmax activation function primarily used?
Which of the following techniques can accelerate convergences in training deep learning?
Which of the following techniques can accelerate convergences in training deep learning?
What does the term 'epoch' refer to in the context of training neural networks?
What does the term 'epoch' refer to in the context of training neural networks?
Why must the mini-batch size be optimized during training?
Why must the mini-batch size be optimized during training?
What would incorporating momentum do?
What would incorporating momentum do?
During neural network training, what is addressed when using Batch SGD instead of Stochastic Gradient Descent?
During neural network training, what is addressed when using Batch SGD instead of Stochastic Gradient Descent?
In a neural network, how is the Chain rule of calculus used to calculate updates?
In a neural network, how is the Chain rule of calculus used to calculate updates?
What is a general rule with Learning Rate?
What is a general rule with Learning Rate?
What does the term 'multidimensional array' best describes?
What does the term 'multidimensional array' best describes?
In time series analysis, what is the purpose of applying smoothing techniques?
In time series analysis, what is the purpose of applying smoothing techniques?
For a time series dataset exhibiting non-linear growth patterns that gradually approach a saturation point, which trend line would be the most appropriate?
For a time series dataset exhibiting non-linear growth patterns that gradually approach a saturation point, which trend line would be the most appropriate?
What is the typical purpose of STL decomposition in time series analysis?
What is the typical purpose of STL decomposition in time series analysis?
Within the ARIMA framework, what is the approach for tuning the components of the framework?
Within the ARIMA framework, what is the approach for tuning the components of the framework?
What task is 'Sentiment Analysis' targeting?
What task is 'Sentiment Analysis' targeting?
In Natural Language Processing (NLP), what role do word embeddings play?
In Natural Language Processing (NLP), what role do word embeddings play?
What do you see with a Bag of Words approach?
What do you see with a Bag of Words approach?
What best describes an advantage of the skip-gram architecture?
What best describes an advantage of the skip-gram architecture?
What is represented when applying Cosine Similarity?
What is represented when applying Cosine Similarity?
In Support Vector Machines (SVM), if you want to allow some misclassifications to achieve a better fit on the majority of the data, which type of margin would be most appropriate?
In Support Vector Machines (SVM), if you want to allow some misclassifications to achieve a better fit on the majority of the data, which type of margin would be most appropriate?
In the context of Support Vector Machines (SVM), what is the effect of having a very small value for the tuning parameter C?
In the context of Support Vector Machines (SVM), what is the effect of having a very small value for the tuning parameter C?
When should you choose a Polynomial kernel over a linear kernel in Support Vector Machines (SVM)?
When should you choose a Polynomial kernel over a linear kernel in Support Vector Machines (SVM)?
In Support Vector Machines (SVM), if you're dealing with data where the true underlying distribution is unknown, which kernel is generally recommended as a first approach?
In Support Vector Machines (SVM), if you're dealing with data where the true underlying distribution is unknown, which kernel is generally recommended as a first approach?
What happens to the influence of distant observations in a Support Vector Machine (SVM) using a Radial Basis Function (RBF) kernel as the gamma (γ) parameter increases?
What happens to the influence of distant observations in a Support Vector Machine (SVM) using a Radial Basis Function (RBF) kernel as the gamma (γ) parameter increases?
In unsupervised learning, what does the term 'lack of labeled data' primarily imply?
In unsupervised learning, what does the term 'lack of labeled data' primarily imply?
Which of the following statements captures a key challenge when using unsupervised learning methods on very large datasets?
Which of the following statements captures a key challenge when using unsupervised learning methods on very large datasets?
What is the 'curse of dimensionality', and how does it specifically impact unsupervised learning techniques?
What is the 'curse of dimensionality', and how does it specifically impact unsupervised learning techniques?
Which of the following is NOT a recognized strategy for addressing the challenges posed by high dimensionality in machine learning datasets?
Which of the following is NOT a recognized strategy for addressing the challenges posed by high dimensionality in machine learning datasets?
What is the primary reason for performing a rotation transformation in Principal Component Analysis (PCA)?
What is the primary reason for performing a rotation transformation in Principal Component Analysis (PCA)?
What does minimizing intra-cluster variance accomplish in K-means clustering?
What does minimizing intra-cluster variance accomplish in K-means clustering?
In K-means clustering, which of the following describes the role of the 'Assignment Step'?
In K-means clustering, which of the following describes the role of the 'Assignment Step'?
Which strategy can directly address the sensitivity of K-means to initial centroid placement?
Which strategy can directly address the sensitivity of K-means to initial centroid placement?
How does DBSCAN identify clusters of arbitrary shape?
How does DBSCAN identify clusters of arbitrary shape?
In DBSCAN, what is one of the major parameters, and how is it used?
In DBSCAN, what is one of the major parameters, and how is it used?
When employing Single Linkage in agglomerative hierarchical clustering, how is the distance between two clusters determined?
When employing Single Linkage in agglomerative hierarchical clustering, how is the distance between two clusters determined?
In hierarchical clustering, what is the main advantage of using Ward's method over other linkage methods?
In hierarchical clustering, what is the main advantage of using Ward's method over other linkage methods?
In the context of deep learning, what is the vanishing gradient problem and why is it significant?
In the context of deep learning, what is the vanishing gradient problem and why is it significant?
In neural networks, what is the main function of the Softmax activation function, and for which type of layer is it most commonly used?
In neural networks, what is the main function of the Softmax activation function, and for which type of layer is it most commonly used?
What is the primary method for dealing with an underperforming Learning Rate?
What is the primary method for dealing with an underperforming Learning Rate?
Flashcards
Supervised learning
Supervised learning
A type of machine learning where the model learns from a dependent variable.
Unsupervised Learning
Unsupervised Learning
Machine learning that discovers hidden patterns without human supervision.
Hyperplane
Hyperplane
A flat affine subspace where data is classified.
Margin
Margin
Signup and view all the flashcards
Support Vectors
Support Vectors
Signup and view all the flashcards
Hard Margin Classification
Hard Margin Classification
Signup and view all the flashcards
Soft Margin Classification
Soft Margin Classification
Signup and view all the flashcards
Slack Variables
Slack Variables
Signup and view all the flashcards
Tuning Parameter C
Tuning Parameter C
Signup and view all the flashcards
Polynomial Kernel
Polynomial Kernel
Signup and view all the flashcards
Radial Basis Function (Gaussian)
Radial Basis Function (Gaussian)
Signup and view all the flashcards
Sigmoid Kernel
Sigmoid Kernel
Signup and view all the flashcards
PCA
PCA
Signup and view all the flashcards
Clustering
Clustering
Signup and view all the flashcards
K-Means Clustering
K-Means Clustering
Signup and view all the flashcards
Elbow Method
Elbow Method
Signup and view all the flashcards
Silhouette Score
Silhouette Score
Signup and view all the flashcards
DBSCAN
DBSCAN
Signup and view all the flashcards
MinPts
MinPts
Signup and view all the flashcards
Dendrogram
Dendrogram
Signup and view all the flashcards
Agglomerative Approach
Agglomerative Approach
Signup and view all the flashcards
Divisive Approach
Divisive Approach
Signup and view all the flashcards
Directed Acyclic Graphs
Directed Acyclic Graphs
Signup and view all the flashcards
Perceptron Function
Perceptron Function
Signup and view all the flashcards
Deep Neural Network
Deep Neural Network
Signup and view all the flashcards
Loss Functions
Loss Functions
Signup and view all the flashcards
Epoch
Epoch
Signup and view all the flashcards
Mini-Batch
Mini-Batch
Signup and view all the flashcards
Input Layer
Input Layer
Signup and view all the flashcards
Hidden Layers
Hidden Layers
Signup and view all the flashcards
Activation Functions
Activation Functions
Signup and view all the flashcards
Output Layer
Output Layer
Signup and view all the flashcards
GPUs
GPUs
Signup and view all the flashcards
Learning the Tuning
Learning the Tuning
Signup and view all the flashcards
Minimizing the Loss
Minimizing the Loss
Signup and view all the flashcards
Small Batching
Small Batching
Signup and view all the flashcards
Weight Update Alternative
Weight Update Alternative
Signup and view all the flashcards
Back Propagation
Back Propagation
Signup and view all the flashcards
Convolutional Neural Networks
Convolutional Neural Networks
Signup and view all the flashcards
Pool Layers
Pool Layers
Signup and view all the flashcards
Recurrent Neural Networks
Recurrent Neural Networks
Signup and view all the flashcards
Embeddings
Embeddings
Signup and view all the flashcards
Time Series Smooth
Time Series Smooth
Signup and view all the flashcards
Ideal for Time Series
Ideal for Time Series
Signup and view all the flashcards
Exponential Smooth
Exponential Smooth
Signup and view all the flashcards
Patterns of External Factors.
Patterns of External Factors.
Signup and view all the flashcards
STL
STL
Signup and view all the flashcards
Series Time Code
Series Time Code
Signup and view all the flashcards
Autocorrelations
Autocorrelations
Signup and view all the flashcards
NLP
NLP
Signup and view all the flashcards
Text tokenizer
Text tokenizer
Signup and view all the flashcards
Model Input
Model Input
Signup and view all the flashcards
Study Notes
- This lecture covers Support Vector Machines (SVM) and unsupervised methods in Python for data analysis.
- This lecture also introduces Deep Learning concepts.
Support Vector Machines
- SVM can be used for both classification (SVC) and regression (SVR).
- SVM classification is more common than regression.
- Think of an SVM classifier as fitting the widest possible street between classes; called large margin classification.
- In a p-dimensional space, a hyperplane is a flat affine subspace of dimension p - 1.
- In two dimensions, a hyperplane is a line.
- In three dimensions, a hyperplane is a plane.
- The margin is the distance from the solid line to either of the dashed lines.
- Support vectors are two points that lie on dashed lines
- The distance from support vectors to the hyperplane is indicated by arrows.
- If ẞ0,ẞ1,...,ẞp are the coefficients of the maximal margin hyperplane, then the maximal margin classifier classifies the test observation x^ based on the sign of f (x) =β0 + β1 x1 + β2 x2 + ・・・ + βρ
Hard Margins
- Strictly imposing every observation is assigned a class; no room for misclassification
- Hard margin classification only works if data is linearly separable.
- Hard margin classification is sensitive to outliers.
- Hard margin classification is impractical because real-world data has misclassifications (errors).
Soft Margins
- Rather than seeking the largest possible margin so that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, allow some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane.
- Observations on the wrong side of the hyperplane correspond to training observations that are misclassified by the support vector classifier.
- £1,...,εη are slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane
- The tuning parameter C bounds the sum of the ei's, determining the number and severity of violations to the margin/hyperplane
- If εί = 0 then the ith observation is on the correct side of the margin.
- If εί > 0 then the ith observation is on the wrong side of the margin, and we say that the ith observation has violated the margin.
- If εί > 1 then it is on the wrong side of the hyperplane
- Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors.
- These observations affect the support vector classifier.
- C controls the bias-variance trade-off of the support vector classifier.
- When tuning parameter C is large, the margin is wide, many observations violate the margin, creating many support vectors.
- This classifier has low variance but potentially high bias.
- When C is small, narrow margins are sought that are rarely violated, leading to a classifier that is highly fit to the data, low bias but high variance.
Kernel Options
- Polynomial Kernel: Used when the data has polynomial relationships between features.
- Radial Basis Function (Gaussian) Kernel: A good default choice when there's no clear understanding of the data distribution.
- Sigmoid Kernel: Used when it's suspected the data behaves similarly to a neural network.
Polynomial Kernel
- The polynomial kernel of degree d, where d is a positive integer.
- Using a kernel with d > 1, instead of the standard linear kernel, generates a more flexible support vector where it maps to a higher dimensional vector space.
- Formula: K(xi, xj) = (xi * xj + c)^d
- xi and xj are feature vectors for two data points.
- c is a constant that controls the offset of the polynomial function.
- d is the degree of the polynomial.
RBF Kernel – Radial Basis Function
- Formula: K(xi, xj) = exp(-γ||xi – xj||^2)
- xi and xj are feature vectors for two data points.
- ||xi - xj||^2 is the squared Euclidean distance between the two vectors.
- gamma is a positive constant that has a minimizing effect as e^(negative numbers) gets very small.
- When two observations are far away, the Euclidean distance is larger, decreasing the value.
- Local observations have more impact.
- Dimensions are implicit or infinite, therefore the kernel trick makes computation feasible.
Unsupervised Learning
- Focuses on understanding what the data means by creating groupings, used instead of predicting a value or probability of a class
- PCA (Principal Component Analysis) is an unsupervised approach that can prepare data for supervised learning.
- Clustering data can discover groups which can be assessed and analyzed.
Approaches in Unsupervised Learning
- Clustering (e.g., K-means, hierarchical clustering)
- Dimensionality Reduction (e.g., Principal Component Analysis (PCA), t-SNE)
- Association rule learning (e.g., Apriori algorithm)
Challenges of Unsupervised Methods
- There is a lack of labeled data aka ground truth in many domains
- Examples include fraud detection, medical imaging, cybersecurity, Natural Language Processing, Recommender Systems etc.
- Without predefined metrics/ground truth, analysis gets subjective, making it challenging to evaluate the model's success.
- It's difficult to assess results or performance as we did with RMSE (as in regression), Accuracy, Precision (as in classification).
- Scalability concerns: Some unsupervised learning algorithms can be computationally intensive, especially with large datasets/high-dimensional data.
- Overfitting risk: There is a risk of overfitting to noise in the data, especially with methods sensitive to the parameters or model complexity.
- Unsupervised learning methods can be sensitive to noise and outliers in data.
- Assumptions and constraints: Many unsupervised algorithms come with assumptions about the data (e.g., cluster shape or distribution) which might not always hold true in real-world scenarios.
Curse of Dimensionality
- Increased Computational Complexity: Higher dimensions require more computation and time.
- Sparse Data: Data points become sparse, making it harder to find patterns.
- Overfitting: Models can be overfit to noise due to increased complexity.
- Distance Metrics Issues: Distance measures become less informative in high dimensions.
- Visualization Challenges: Difficult to visualize and interpret high-dimensional data.
- Feature Redundancy: More features can introduce irrelevant or redundant information.
How to Deal with High Dimensionality
- Dimensionality Reduction: PCA, t-SNE, LDA
- Feature Selection: Filter Methods (ex. chi-square), Wrapper Methods(aka selection methods)
- Regularization: L1 Regularization (LASSO), L2 Regularization (Ridge Regression)
- Sampling Techniques: Feature Engineering, Random Projection
- Algorithm Choice: Dimensionality-Aware Algorithms
- Domain Knowledge: Feature Analysis
PCA
- Principal Component Analysis finds a low-dimensional representation of a dataset with as much variation as possible.
- It's effective when you lack domain knowledge or other approaches are not feasible.
- Observations live in p-dimensional space, but not all dimensions are equally interesting.
- Project observations with a vector (loadings) that has the largest variance.
- This results in projected observations onto any other line would yield projected observations with lower variance.
- PCA is unsupervised, so the direction PCA takes you may not always be helpful to effective prediction.
- It only provides the direction that retains the most variance in the data.
- It uses a rotation transformation to retain max(var).
Clustering
- Clustering is a technique used to group similar data points into clusters or groups.
K-Means Clustering
-
K-means clustering is an approach for partitioning a data set into K distinct, non-overlapping clusters.
-
Algorithm objective: Partition data into K clusters by minimizing intra-cluster variance (within-cluster sum of squares)
-
Miniimize sum of W (Ck) for k= 1..K
-
First, specify the number of clusters K; then the K-means algorithm will assign each observation to exactly one of the K clusters.
-
The goal of K-means clustering is a good clustering that is one for which the within-cluster variation is as small as possible
-
The within-cluster variation for the kth cluster is a measure W(CK)
-
The number of clusters K is Predefined by the user.
-
squared Euclidean distance is commonly used.
-
K minimize { 1 / |Ck| * sum_members * sum_dimensions * (xij - x'ij)^2 } С1,...,Ск k=1
-
Algorithm Steps:
-
Randomly assign a number, from 1 to K, to each of the observations.
- These serve as initial cluster assignments for the observations.
-
Iterate until the cluster assignments stop changing:
- For each of the K clusters, compute the cluster centroid.
- Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
-
When the result no longer changes, a local optimum has been reached.
-
Because the K-means algorithm finds a local rather than a global optimum, the results obtained will depend on the initial (random) cluster assignment of each observation in Step 1
Key Hyperparameters - Finding K
- K (Number of clusters) – This is the most critical hyperparameter and is predefined by the user.
- Elbow Method: Plot the Within-Cluster Sum of Squares (WCSS) against different K values. The "elbow" point is where the improvement slows down.
- Silhouette Score: Measures how similar a data point is to its own cluster vs. other clusters (ranges from -1 to 1). Higher values indicate better clustering.
Other Hyperparameters
- Initialization Method (Centroid Initialization):
- K-Means is sensitive to the initial placement of centroids
- Common strategies
- Random initialization: Default method but may lead to suboptimal results.
- K-Means++ initialization: Improves clustering by spreading out the initial centroids. Reduces the risk of poor convergence.
- It's useful to run K-means several times to account for address initial placement of centroids.
- Distance/similarity metric: Euclidean is most common
DBSCAN
- Density-Based Spatial Clustering of Applications with Noise is a density-based clustering algorithm used to identify clusters in data by grouping together points that are closely packed
- DBSCAN is effectiv e for finding arbitrary-shaped clusters and identifying outliers as noise.
- Density refers to measurement of of how much mass is packed into a volume of a substance, describing how tightly packed matter is within an object, calculated by dividing mass of object b volume (Density= Mass/Volume)
- DBSCAN uses the density of a region to cluster data
- Key Concepts:
- Core Points: Points with a minimum number of neighboring points within a specified distance (minPts and epsilon (ε)).
- Border Points: Points that are within ε distance of a core point but do not have enough neighbors to be considered core points.
- Noise Points: Points that do not belong to any cluster.
- Does not require the number of clusters to be predefined.
- Can detect noise and outliers.
- Outliers do not affect the model
- It works well with clusters of arbitrary shapes.
- There are only two parameters to tune.
- It's sensitive to the parameters & and minPts.
- Algorithm Steps
- Initialize Parameters: epsilon, minPts
- For each point in dataset:
- If point has already been visited, skip.
- Determine the neighboring points of the current point:
- Find all points within the & distance of the current point (its ε-neighborhood).
- Check if the current point is a core point:
- If point has fewer than minPts neighbors, mark it as noise
DBSCAN Algorithm Steps (2)
- Expand the cluster
- For each core point, recursively visit all the neighboring points in its ɛ-neighborhood
- If a neighboring point has not been visited, mark it as part of the current cluster.
- If it's another core point, continue expanding the cluster by visiting its neighbors.
- Border points (points with fewer neighbors than minPts but still reachable from a core point) are included, but do not expand it further Repeat: Continue process for unvisited points
DBSCAN Hyperparameter Tuning
- MinPts (Minimum Points)
- set MinPts to at least D+1 where D is the number of dimensions in dataset
- increase increase as dimensionally increases to avoid noise affecting result
- for noisy data, us higher values of minPts to ensure that only dense regions form cluster
- Expect small clusters, a lower MinPts may work better
- Epsilon
- Compute the k-nearest neighbor’s distance
- Chose k as MinPts -1
- compute distance of reach point to the kth nearest neighbor
- Plot the sorted distances
- sort the distances and plot them ( this is the k-distance plot)
- look for an elbow in the plot which is a point where the distance starts increasing rapidly. This pint is a good candidate for g (epsilon)
- Chose k as MinPts -1
Hierarchical Clustering
- Method clusters which bulilds a hierarchy of clusters
- Organizes data into tree-like structure: dendrogram
- Types:
- Agglomerative (Bottom-Up): Starts with individual points and merges them
- Starts with all points in one cluster
- Divisive (Top-Down): Starts with all points and splits them -Advantages :
- No need to specify to number to clusters ind advance K- MEANS -Useful for visualizing relationships between data points
- Applications :
- Genomics , market segmentation, image analys
Interpreting the Dendrogram
-
Each leaf represents an observation, internal nodes represent merges of clusters, and correspond to similar observations
-
As you move up the tree, more points fuse to branches, branches themselves fuse , either will leaves or higher branches
-
The most earlier the (lower into tree ) fusions occur, the most similar the groups of observations are to each other
-
For any 2 observations , we can look for the point in the tree branches containing these 2 observations are first fused
-
observations the fuse at the very atom of tree are quit similar
-
observations that fuse ciéeélose to the top of tree with tend to be quiete difference
-
There are n − 1 points where fusions occur but affect order .
-
We cannot draw conclusions about the similarity between two observations based on their proximity along thehorizontal axis. Rather, we about the similarity between two observations based on the location on the vertical axis where branches containing those two observe first are fused.
-
Cutting the dendrogram at a height of nine results in two clusters, shown in distinct colors. In the right-hand panel, cutting the dendrogram at a height of five results in three clusters. Further cuts can be made as one descends the dendrogram to obtain any number of clusters, between 1 and n. The height of the cut to the dendrogram serves the same role as the K in K-means clustering - it controls the number of clusters obtained.
-
hierarchical refers to the fact that clusters obtained by cutting the dendrogram at a given height are nested within the clusters obtained by cutting the dendrogram at any greater height
-
Hierarchical clustering generally produces nested groups by design, as clusters are either progressively merged or split in a hierarchical tree structure. However, there are scenarios where the visualization of the clustering process or the nature of the data may not show well-defined or intuitively nested group
Agglomerative Appraoch
- Initialize: Start with each data point as its own cluster. Initially have n clusters.
- Compute Distance Matrix:
- Calculate the pairwise distance between all clusters using a chosen distance metric (e.g., Euclidean distance). -Create a distance matrix (a table showing distances between every pair of clusters).
- Merge Closest Clusters:
-Identify the two clusters that are closest to each other based on the distance matrix.
-Merge these two clusters into a new cluster.
Update Distance Matrix:
-After merging two clusters, update the distance matrix to reflect the distance between the newly formed.
cluster and the remaining cluster
The way this update is done depends on the linkage criteria:
- Single Linkage: Distance between the two closest points in the clusters.
- Complete Linkage: Distance between the two farthest points in the clusters.
- Average Linkage: Average of all pairwise distances between points in the clusters.
- Ward's Method: Minimizes the variance within clusters. Repeat -> Build Dendrogram
Divisive Approach
- Start with all the date points in one cluster and we need to decide what splits (usually based on some dissimilarity of measures ) . This can be done with tecqunies such as K means our calculating dismailraty into data point
Considerations
- Distance Matrics -Euclidean Distance : Most Common and suable for continuous data -Manhattan distance : Suitable for Categorical or grid-like Data -(Cosine distance : Common for text our high Dimensional Da
Introduction to Deep Learning
- Neural Networks (NNs) are like Directed Acyclic Graphs (DAGs).
- Directed Edges: Each edge has a direction meaning it goes from one vertex (node) to another. This direction signifies a one-way relationship/dependency between nodes.
- Acyclic: Indicates that there are no cycles or closed loops.
- Computational Graph Extends DAGs.
- In each node or vertex, a computation takes place like we will see happens in Neural Networks.
- Early Neural networks used the Sigmoid Function – the same function used in Logistic Regression where we used a linear equation to compute a probability between 1 and 0.
The Perceptron
- The Perceptron is a mathematical function, where input data (x) is multiplied by the weight coefficients (w), resulting in a value.
- It is visualized as a single-layer network.
- A perceptron uses a step function (or Heaviside function) as its activation function, determining the output.
- Can be used for binary classification tasks
- Perceptron: y = 1 if z > 0, otherwise y = 0.
Perceptron Learning
- A perceptron is trained using a supervised learning algorithm.
- Backpropagation is used in multilayer hidden layer networks to adjust the weights and bias based on the error
- n is the learning rate, controlling how much to adjust the weights.
- w₁ = w₁ + ∆ωή and Δω₁ = n(Ytrue – Ypred) X₁.
Activation Functions
- Enable the model to learn non-linear relationships by transforming the linear combination of inputs into a non-linear output.
Purpose
- Output Control: help in constraining the output values to a specific range, which can be beneficial for various tasks.
- Gradient Propagation: They provide gradients needed for optimizing the weights during the training process, especially during backpropagation.
Activation Function Types
- Sigmoid: outputs values between 0 and 1 suitable for binary classification, suffers vanishing gradient problems because outputs are always positive, with slow exp() function
- Formula: σ(x) = 1/(1 + e^-x)
- Tanh (Hyperbolic tangent): Zero centered outputs help networks train faster, and suffers vanishing gradient problem when saturated.
- Formula tanh(x) = (e^x - e^-x)/ ex + e^-x)
- ReLU (Rectified Linear Unit): Avoids vanishing gradient issues for positive inputs, computationally efficient, faster convergence compared to sigmoid/Tanh, often not zero-centered and prone to a "dying ReLU" problem.
- Formula f(x) = max(0,x)
- Leaky ReLU: All benefits of ReLU, addresses the "dying ReLU" problem by allowing a small gradient for negative inputs, helps with training deeper networks, but not standardized as ReLU.
- Formula: f(x) = max(0.01x, x)
- ELU (Exponential Linear Unit): All benefits of ReLU, zero centered outputs help networks train faster, ELUs saturate to a negative value becoming more robust to noise, and are computationally more expensive due to exponential operation. -Formula: f(x) = [[x ifz>=0] or [a(e**-1 if z< 0]] Softmax: Classifications problems, converting model outputs into probability distributions. Is the formula P(yr = e" 19" - L
- Not suitable for multi-label classification, as it enforces that only one class can be predicted.
Issues With Activation Functions
- Saturation - occurs when output to an activation function is pushed to its extreme values
- This leads to a near-zero gradient.
- When neurons are saturated, small changes in the input lead to very little change in the output
- Gradients become close to zero during backpropagation.
- Vanishing Gradient - Gradient of loss function gets very small
- Prevents weights from updating early
- As gradients backpropagate through layers they exponentially diminish = slower training.
- This is problematic in deep networks.
- Logical AND in Perceptrons - both are required with a thehreshold of 1 to fire a neuron)
- Logical OR in Perceptrons - either of the two inputs are required with a thehreshold of 1 to fire a neuron
The operation that perceptrons can’t handle: XOR:
- Either A or B has to be 1 - (ExClusive OR OR)
- Impossible with since perceptron. So can’t solve XOAR
- We need a network/multi laser perceptrom (MLPs) - which allows to find non-linear boundaries The perceptron learning rule updates the weight when the prediction is incorrect
- Multi layer perceptron - MLP use:
- Activation function :introduce non-linearity -output layer : produced the final prediction -Weights and biase : Parameter learned during training
- Loss Funciton : Measures prediction error
- Backpropagatiom uptades weights using gradient descent
Deep Neural Networks
- A network is considered deep if it has two or more hidden layers.
- DNN's model more intricate, non-linear relationships in the data.
- Each hidden layer in a DNN learns more abstract representations of the input data.
- Weights of a DNN are updated through backpropagation and gradient descent.
Speed and Power
- It Needs to Accelerate convergence Use Gpus ( contain thousand if cores built for vectors calculations Support a lot of deep leaning frameworks
- TPUS is designed for me learnings workload
- And efficient and high tensor operations
Loss Functions
- Loss Functions are called, cost functions or objective functions, measure how far away from actual target value our predicted values
- Loss functions , also provide a feedback mechanism for model so weights Loss -the cost (function (of the 8 to the W
- We also, use term epoch- that 1 complete pass of the dataset Then batches help when mini batches are set that one to forward to the pass helps
Value Prediction
- MSE and MAE: Most Commonly used to calculate total loss
- Mean Square Error : Is 1.0/n sum{ 1 / (y’1-y1)**_2] Mean absolute error - is 01/n sum | y1-y1 | log loss - is a smoothness is a smooth alternarive in use
Binary Classification
- binary cross entropy : Measures the performance the classification model to in that of a number has two classes ###Multi Class Classificaion
- Softmaxactivation function is used to output probability for each class (j is call ) z is logit
- The categorical cross entropy is a last calculation based on all label in to data set Miminize loss of the goal
- To adjus the model parameters, find moving gradient descents
- To compute with that rule of w eight *2+b ###Mini Batches
- With mini batch gradient descents - operate and weights updates are updated basis
Alternatives For the Weight Rules
- Standard gradients ( Momentum of the weights being
- To accelerate convergence
- RMS prop (Root mean square propagation) to adjust the learning weights
- Adam is in of best models
- Nadam is advance
###Back Propagation
- Back Propagation trasmitts the the error, this allows network to improve through weights adjustments
- Gradient deccents - use all function a gradient efficient
Learning Reates Consideration
- High learning is rate in : fasters initial convert - but divergence
The Number of Parameters is Calculated How?
(Number of inputs Number of neurons Number of neurons)
Convolutional Neural Networks
- They model the degree is how Human Images By Recnogizning
- Network of Low. level fed into high level for features for eyes etc
- Then will contribute what each has to output
The Convolution Filter
-
To obtain converted image apply it back in submatrix
-
What result if the is close, it has a larger volume
-
Puling layers that helps condensed summaries
- Maximal : pick with the maxs number
- Average puling
- GLoblal
- L2
-
Puling that introduces raondomne
Recurrent Neural
- They are a the kind
- That takes input as sequences
- Converted words conveys meaning
- The is to take adavantage with sequential to nature
RNN Advantages
- They have memory and ideal in areas there need to track
- Applicaitioons of recurences
- Document of the news
- Time Series
- Record speech
- Handwrititng
###Recurrent Network
- Each step a sum of input it has
- Unlike the use of other neural network with weights , RNN use set of
- Feeback lips , they allow from prior times to be included in current
Variants
- DNN - LSTM - solve to Vanish Gradients
- Have gates to the network to to allow the low memory or -Grus - compenstationary cheaper
- Bidirectional DNN neworks future and past times
Steps
1 Inistalization 2 feed sequemce 3 Update for compute 4 Accumulate to less 5 Backpropagation 6 Update Weights
Emding
- Used with word to embeded words in to numerical four.
- Step one to Transform to Numerical date a
- Text - transform the four numerical
- Cateoric transform the values
Final Input format
- Batch Time
- Number of feats.
More Embedding methods
Word GloVe FastTExt ELMO BERT GPT Tt USE Xlnet
Some Limitation In the embedding
- Limit Context if does has proper contes
- Words will has disticnging Embeddings If there's to infficente contest
- Homonyme y Words will shre Embeddings Context if there doe provide the difference
- The simple Models does look to the of the word
- Demission Dimentio-nality helps will the redyion
- Training and HyperParameters and aect distivnessees the Embeddungs
- The model has the surface and has a different emdings
Tensors
- tensors a multidiemental has four properties
- Scalar - Has 0 ( ex point) of 3 (ex, r g b) Shape - Data type -
Time Series - Smoothing
- Used in time series analysis that will help reduce noise.
- Make the long running trends / seasonalty cleater
- It applies various weighting/everaging techniques
Simple Machine Learning
- It averages the fixed that of the points will will in that window
- Ideality highlighlight a long trends
- The Window
- The size of the term
####Exponential
- An techqines with decreases
- Has little trend sealease the high to the reconsisenc
###Seasonalities ####Trend Lines
- What the rigth trend line
- Linear,
- *- Exponential For accelerating our Deacceating time ####STL
- They break series with main component used Trend,Seasonalies
- There is Additive, and multiplicative
- *To Decopse Tiem seirees ,Apply TSL decompotion
- First find out from seasonal and residual
###ARIMA
- Autuogressive (AR) represens
- Its term with parameter I : Remove that from a times serives
Selecting
Check the data to station
- The set off to set to the term ( The acf plot helps the the Ma Term
- Pac plot- Help identify the AR termDoes not have in term Grind is that
- ALC - akake
- Bic
NLP INTRODUCTION
- The nlp combine linguistics but machines deep learning etec
NPL Compennets
- Tokenziatlion - is breaking down smaller units
- Part Of Speech
- Named Entity: Identifier is classifying
- Sentiment
- Text is all about classication ###other cases yses
- Malciy docments
- Incivs
- Vulnerability
- User Behanor
###NTL
- Early rules
- Then linguist base
- shift to that use for the large ###Tokelization • It converts words in the parts/segments Types
- Words/ segmensts
- Charcater Token
-
It to helps take words
###Libraries for the Tokennication
- NT
- SACY HUGGIN
###Library Tensorflow Token
- Text will numerical format
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.