Podcast
Questions and Answers
What is the primary goal of supervised learning?
What is the primary goal of supervised learning?
Clustering is considered a supervised learning task.
Clustering is considered a supervised learning task.
False
Name one application of clustering in marketing.
Name one application of clustering in marketing.
Segment customers according to their similarities.
In clustering, data instances that are similar are grouped into one ______.
In clustering, data instances that are similar are grouped into one ______.
Signup and view all the answers
Match the following applications with their clustering examples:
Match the following applications with their clustering examples:
Signup and view all the answers
What is one key characteristic of unsupervised learning?
What is one key characteristic of unsupervised learning?
Signup and view all the answers
Clustering only applies to fields related to technology.
Clustering only applies to fields related to technology.
Signup and view all the answers
Why is clustering important in the context of online documents?
Why is clustering important in the context of online documents?
Signup and view all the answers
What is the primary function of the first principal component in PCA?
What is the primary function of the first principal component in PCA?
Signup and view all the answers
The proportions of variance explained (PVE) by each principal component sum to more than one.
The proportions of variance explained (PVE) by each principal component sum to more than one.
Signup and view all the answers
What is the recommended action if the variables in PCA are in different units?
What is the recommended action if the variables in PCA are in different units?
Signup and view all the answers
The first principal component explains _____% of the variance in the data according to the information provided.
The first principal component explains _____% of the variance in the data according to the information provided.
Signup and view all the answers
How much variance is explained by the next principal component after the first?
How much variance is explained by the next principal component after the first?
Signup and view all the answers
Match the following terms related to PCA with their correct descriptions:
Match the following terms related to PCA with their correct descriptions:
Signup and view all the answers
The last two principal components explain a high percentage of the variance in the data.
The last two principal components explain a high percentage of the variance in the data.
Signup and view all the answers
What is the main purpose of principal component analysis (PCA)?
What is the main purpose of principal component analysis (PCA)?
Signup and view all the answers
What is a significant disadvantage of K-means clustering?
What is a significant disadvantage of K-means clustering?
Signup and view all the answers
Agglomerative clustering starts with all data points in one cluster.
Agglomerative clustering starts with all data points in one cluster.
Signup and view all the answers
What is a dendrogram?
What is a dendrogram?
Signup and view all the answers
In hierarchical clustering, the __________ method builds the dendrogram from the bottom level.
In hierarchical clustering, the __________ method builds the dendrogram from the bottom level.
Signup and view all the answers
Match the following clustering types with their descriptions:
Match the following clustering types with their descriptions:
Signup and view all the answers
In a dendrogram, cutting it at a certain height results in how many clusters?
In a dendrogram, cutting it at a certain height results in how many clusters?
Signup and view all the answers
Divisive clustering results in all data points merged into one cluster.
Divisive clustering results in all data points merged into one cluster.
Signup and view all the answers
How does agglomerative clustering determine which clusters to merge?
How does agglomerative clustering determine which clusters to merge?
Signup and view all the answers
Which of the following best describes the K-means clustering algorithm?
Which of the following best describes the K-means clustering algorithm?
Signup and view all the answers
In K-means clustering, the centroids are recalculated after every assignment of data points.
In K-means clustering, the centroids are recalculated after every assignment of data points.
Signup and view all the answers
What is the primary goal of clustering in a data set?
What is the primary goal of clustering in a data set?
Signup and view all the answers
The K-means algorithm aims to minimize the sum of squared error (SSE) across all clusters, defined as SSE = $\sum_{j=1}^{k} \sum_{x \in C_j} dist(x, m_j)^2$ where $C_j$ is the ________ of the jth cluster.
The K-means algorithm aims to minimize the sum of squared error (SSE) across all clusters, defined as SSE = $\sum_{j=1}^{k} \sum_{x \in C_j} dist(x, m_j)^2$ where $C_j$ is the ________ of the jth cluster.
Signup and view all the answers
Match the aspects of clustering with their correct descriptions:
Match the aspects of clustering with their correct descriptions:
Signup and view all the answers
Which of the following is NOT a stopping criterion for K-means clustering?
Which of the following is NOT a stopping criterion for K-means clustering?
Signup and view all the answers
Clustering algorithms can be used in various fields such as medicine, sociology, and biology.
Clustering algorithms can be used in various fields such as medicine, sociology, and biology.
Signup and view all the answers
What characteristic should be maximized and minimized for clustering quality?
What characteristic should be maximized and minimized for clustering quality?
Signup and view all the answers
What is the time complexity of the k-means algorithm?
What is the time complexity of the k-means algorithm?
Signup and view all the answers
K-means is considered a non-linear algorithm.
K-means is considered a non-linear algorithm.
Signup and view all the answers
What is one method to deal with outliers in k-means clustering?
What is one method to deal with outliers in k-means clustering?
Signup and view all the answers
K-means clustering is sensitive to __________.
K-means clustering is sensitive to __________.
Signup and view all the answers
Which of the following is a strength of k-means clustering?
Which of the following is a strength of k-means clustering?
Signup and view all the answers
The k-means algorithm always finds the global optimum regardless of the initialization.
The k-means algorithm always finds the global optimum regardless of the initialization.
Signup and view all the answers
What technique can be used to solve the optimization problem in PCA?
What technique can be used to solve the optimization problem in PCA?
Signup and view all the answers
Match the following terms related to k-means with their descriptions:
Match the following terms related to k-means with their descriptions:
Signup and view all the answers
What user input is necessary for k-means clustering to function?
What user input is necessary for k-means clustering to function?
Signup and view all the answers
The first principal component is uncorrelated with the second principal component.
The first principal component is uncorrelated with the second principal component.
Signup and view all the answers
What are the principal component scores for the first principal component generally referred to as?
What are the principal component scores for the first principal component generally referred to as?
Signup and view all the answers
The principal component directions are the ordered sequence of ______ of the matrix $X^TX$.
The principal component directions are the ordered sequence of ______ of the matrix $X^TX$.
Signup and view all the answers
Match the following terms with their definitions:
Match the following terms with their definitions:
Signup and view all the answers
In PCA, how many principal components can be at most generated?
In PCA, how many principal components can be at most generated?
Signup and view all the answers
The second principal component can be correlated with the first principal component.
The second principal component can be correlated with the first principal component.
Signup and view all the answers
List the three crimes featured in the USAarrests data set.
List the three crimes featured in the USAarrests data set.
Signup and view all the answers
Study Notes
Introduction to Machine Learning AI 305
- Unsupervised learning is a subset of machine learning.
- Clustering is a technique in unsupervised learning.
- Clustering finds similarity groups in data, called clusters.
Supervised Learning vs. Unsupervised Learning
- Supervised learning discovers patterns relating data attributes to a target (class) attribute. These patterns help predict values of the target attribute in future data.
- Unsupervised learning lacks a target attribute. The goal is to explore the data and find inherent structures within it.
Clustering
- Clustering is a method for finding similarity groups in data, creating clusters of similar data points.
- Data points similar to each other are grouped into the same cluster.
- Data points far apart, dissimilar, are grouped into different clusters.
- Clustering is often considered a form of unsupervised learning.
An Illustration
- A dataset can have naturally occurring groups of data points—natural clusters.
What is Clustering For?
- Example 1: Grouping people by size to make different sizes of clothing (small, medium, large).
- Example 2: Marketing segments customers based on similarities, targeting specific subgroups with advertising or product recommendations.
- Example 3: Organizing documents by topic similarities (e.g., news articles) to create a topic hierarchy. This is important for organizing large volumes of text data.
- Clustering is used in many fields like medicine, psychology, and marketing.
Aspects of Clustering
- Clustering Algorithm: This includes partitions (like K-means) and hierarchical methodologies.
- Distance (Similarity or Dissimilarity) function: Determines how similar or dissimilar data points are.
- Clustering Quality: Measures how good the clusters are, typically looking for minimized intra-cluster distance and maximized inter-cluster distances.
K-means Clustering
- A partitional clustering algorithm.
- Divides data points into k clusters.
- Each cluster has a center called a centroid.
- K is specified by the user.
K-means Algorithm Steps
- Step 1: Randomly choose k data points to be the initial centroids (cluster centers).
- Step 2: Assign each data point to the centroid. The dataset is divided (partitioned) into k clusters.
- Step 3: Recalculate the centroids using the current assignments of data points.
- Step 4: Repeat steps 2 and 3 until a convergence criterion is met (no more significant changes in assignments).
- Different seed selections lead to differing cluster formation.
K-means Algorithm (cont) steps to detail
- Step 1: Choose k data points (seeds) and set as initial centroids.
- Step 2 (iterate): For each data point: compute the distance to each centroid; assign to the nearest centroid.
- Step 3: Re-compute the centroids based on the current assignments; until a stopping criterion.
- Stopping criteria: no (or minimum) re-assignments to different clusters.
- Also minimum change in centroids or minimum decrease in SSE (sum of squared errors). -SSE calculation: SSE = ∑∑ dist(x,mⱼ)²
Stopping/Convergence Criterion
- No further changes in data point assignments, centroid locations.
- Minimum decrease in SSE (sum of squared errors) between iterations.
Strengths of K-means
- Simple to understand and implement.
- Efficient, with time complexity O(tkn) which is linear when t and k are relatively small.
Weaknesses of K-means
- Dependent on the correct value of k (number of clusters).
- Sensitive to outliers, that are very distant data points.
- Dependent on initial seed selection, as different initial cluster centers, will yield different clusters.
- Not suitable for clusters that have non-spherical shapes or hyper-ellipsoidal patterns.
Selecting 'k'
- In the plot k = 2, k = 3, k=4 show how data clusters using k-means will change. A simulated data set with 150 observations in a 2-dimensional space is shown for illustration.
Weaknesses of K-means: Outliers
- Remove data points that are far from the centroids.
- Monitor outliers over iterations for removal.
- Use Random Sampling to reduce the likelihood of selecting outliers.
- Assign remaining data points via distance or classification.
Weaknesses of K-means: Initial Seeds
- Random seed selection produces different clustering results
- Different methods that can be used to help select seeds.
Clustering Algorithm Summary
- K-means, despite weaknesses, is still popular.
- Other algorithms may be better for specific tasks, needs, and data sets.
- Comparing clustering algorithms is complex and subjective.
Common Ways to Represent Clusters
- Use centroids to represent clusters
- Calculate spread of each cluster by calculating radius and standard deviation.
Using Classification Model
- All data points are assumed to share the same class label (cluster ID).
Use frequent values to represent cluster
- Use for categorical data.
- Often used in text clustering.
Clusters Of Arbitrary Shapes
- Hyper-ellipsoidal or hyper-spherical clusters are easier to represent using centroids and spread measures.
- Irregular shaped clusters are more complex to represent.
Hierarchical Clustering
- Avoids pre-defining 'k'
- Creates nested sequence of clusters, a tree (dendrogram).
- Different methodologies exist (divisive and agglomerative).
Agglomerative Clustering
- Data points are initially individual clusters (nodes)
- Merge nodes with the shortest distance between corresponding datapoints from different clusters until a single cluster.
Agglomerative Clustering Algorithm Steps
- Each data point in the dataset D is a separate cluster
- Compute distances between all pairs of data points in D(X1,X2,...,Xn)
- Repeatedly merge the two closest clusters as new clusters are formed
- Repeat steps 2 and 3 until only a single cluster remains.
Measuring Distance Between Clusters
- There are different ways to measure distances between different clusters.
- Different variations include complete link, single link, average link, and centroids.
Types of Hierarchical Clustering Methods
- Agglomerative (bottom-up): Starts with individual data points as clusters and merges them iteratively based on similarity
- Divisive (top-down): Starts with all data points in one cluster and recursively splits them into smaller clusters based on dissimilarity.
Single Link Method
- Distance between two clusters is the shortest distance between any two data points.
- It favors longer "chains" of clusters, and very sensitive to noise or outliers.
- May form undesirable “chain effects”
Complete Link Method
- Maximum distance between two data points will be the distance between two clusters.
- Less sensitive to outliers than single linkage but can potentially miss clusters.
Average Link Method
- Compromise between single-link and complete-link.
- The average distance between all pairs of data points between the two clusters will be used as the distance.
Centroid Method
- Distance is computed based on the distance between the centroids.
Distance Function
- “Similarity” or “dissimilarity” are important for clustering concept
- Methods exist for many types of data (nominal, numeric), and differing applications.
- Metrics used include Euclidean distance (squared distance), Manhattan distance, Chebychev distance.
How to Choose a Clustering Algorithm
- No single best algorithm
- Various algorithms exist each with strengths, weaknesses, suitable use cases.
- Trial and error via testing, is typically used.
- Consider the type of data, desired characteristics of clusters.
Principal Components Analysis (PCA)
- Data visualization or preprocessing for supervised learning.
- Produces low-dimensional representation (reduced dimensionality).
- Finds linear combinations of variables with maximal variance, uncorrelatedness.
Principal Components Analysis (PCA): Details
- First principal component : normalized linear combination of features with largest variance
- Z₁= Φ₁₁X₁ + Φ₂₁X₂ + ...+ Φp₁Xp
- Loadings of first principal component are Φ₁₁, Φ₂₁, ...,Φp₁.
PCA: Example
- Illustrates population size and ad spending for 100 different cities.
- First and second Principal Components are depicted visually (in 2d).
Computation of Principal Components
- Assumes variables have mean zero
- Calculate the linear combination with the largest variance.
- SVD (singular value decomposition) or eigen decomposition can do the calculation.
Geometry of PCA
- Loading vector (Φ₁) represents direction in feature space with maximum variance
- Projecting data points onto this vector generates principal component scores.
Further Components
- Subsequent PCs (principal components) are orthogonal (uncorrelated with previous PCs) with maximum variance.
Further Principal Components Continued
- The principal component directions (Φᵢ) are the ordered sequence of right singular vectors.
- Variances of the components are the squares of the singular values.
- Components are at most min(n-1, p)
Illustration - Example
- Example illustrates use of USAarrests data including crimes (Assault, Murder, Rape), population percentage, and state characteristics.
Example Details
- First two principal components of USAarrests are examined.
- Visualization of these components highlight relationships.
- The first component is primarily related to overall crime rates among states.
- The second component focuses primarily on the urbanization level of states.
Example Details - Continued
- Relationship between and correlation using component scores and loading vector projections.
Another Interpretation of Principal Components
- Data points are in a p-dimensional Space, and there is a hyper plane that is closest to these observations
- A hyperplane has a special loading vector that can represent this closest hyperplane (closest to the n observations). Using average squared Euclidean distance as a measure of closeness).
Scaling the Variables
- If variables are in different units, they should be scaled (typically to unit variance to ensure numerical stability).
Proportion Variance Explained (PVE)
- Determine the strength of each component:
- Calculate how much variance in the data each principle component explains.
- Explains the proportion of variance.
- The sum of the PVEs (principle components) is 100%.
Proportion Variance Explained, continued
- The PVE for a component is the ratio of that component’s variance to the total variance
- Cumulative PVEs can be displayed and observed
- Using just the first few principle components can accurately capture the variation in the data.
How Many Components Should We Use?
- Determining a sufficient number of principal components is not always straightforward in unsupervised learning.
- Cross-validation is not always possible.
- “Scree plot” can help, looking for an elbow in a plot of the PVE. Use visual examination.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your understanding of supervised and unsupervised learning, focusing particularly on clustering and Principal Component Analysis (PCA). This quiz covers key characteristics, applications, and concepts related to these important machine learning techniques.