Podcast Beta
Questions and Answers
What is the primary goal of cluster analysis?
Which of the following is a type of proximity measure?
What is the main difference between partitional and hierarchical clustering?
What is the purpose of adopting a (dis)similarity measure in clustering?
Signup and view all the answers
What is a dendrogram used to represent?
Signup and view all the answers
What is a characteristic of the clusters formed in cluster analysis?
Signup and view all the answers
What is the initial step in the k-means clustering algorithm?
Signup and view all the answers
What is the common issue with k-means clustering in small data sets?
Signup and view all the answers
What is the purpose of pre-processing in k-means clustering?
Signup and view all the answers
What is the key operation in agglomerative hierarchical clustering?
Signup and view all the answers
What is the objective function in k-means clustering?
Signup and view all the answers
What is the characteristic of hierarchical clustering?
Signup and view all the answers
Study Notes
Clustering with Unsupervised Learning
- Unsupervised learning involves unknown class labels, and the data is plotted to identify natural clusters.
- Cluster analysis aims to divide data into meaningful and/or useful clusters that may or may not correspond to human perception of similarity.
Characteristics of Clusters
- Clusters should comprise objects that are similar to each other and different from those in other clusters.
- A (dis)similarity measure is required, often taken as a proximity measure (e.g., L1, L2, or L∞ norm).
Clustering Types
- Clustering can be partitional (flat) or hierarchical.
- Partitional clustering divides data into non-overlapping subsets (clusters) where each data point is in exactly one subset.
- Hierarchical clustering produces nested clusters, often represented by a hierarchical tree or dendrogram.
k-means Clustering (Partitional Clustering)
- Randomly choose k objects from the training set as prototypes.
- Assign all other objects to the nearest prototype to form clusters based on Euclidean distance (or other norm).
- Update the new prototype of each cluster as the centroid of all objects assigned to that cluster.
- Repeat until convergence (i.e., no data point changes clusters, or centroids remain the same).
- k-means clustering is a heuristic algorithm with no guarantee of convergence to the global optimum.
- The result is sensitive to the initial choice of objects as cluster centers, especially for small data sets.
k-means Clustering Algorithm
- The algorithm can be viewed as a greedy algorithm for partitioning n samples into k clusters to minimize an objective function (e.g., sum of squared distances to cluster centers, SSE).
- SSE is calculated by summing the squared errors (i.e., distances to the closest centroid) for each data point.
Pre- and Post-processing
- Pre-processing steps can improve the final result, including standardizing (or normalizing) the data and eliminating or reducing the effect of outliers.
- Post-processing can include splitting “loose” clusters and merging “close” clusters.
Agglomerative Hierarchical Clustering
- Each instance starts off as its own cluster and is subsequently joined to the “nearest” instance to form a new cluster.
- The algorithm is a bottom-up technique, where larger clusters are obtained at each step.
- The key operation is the computation of proximity in step (i), which can be defined in various ways.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
This quiz explores clustering with unsupervised learning, where data is divided into clusters that are meaningful and useful. It covers the concept of similarity and dissimilarity measures in clustering.