Clustering in Unsupervised Learning

FreedLemur avatar
FreedLemur
·
·
Download

Start Quiz

Study Flashcards

12 Questions

What is the primary goal of cluster analysis?

To divide data into meaningful clusters

Which of the following is a type of proximity measure?

L1 norm

What is the main difference between partitional and hierarchical clustering?

Whether the clusters are nested or non-nested

What is the purpose of adopting a (dis)similarity measure in clustering?

To determine the similarity between objects

What is a dendrogram used to represent?

Hierarchical clusters

What is a characteristic of the clusters formed in cluster analysis?

They are always non-overlapping

What is the initial step in the k-means clustering algorithm?

Randomly choose k objects from the training set as the prototypes

What is the common issue with k-means clustering in small data sets?

The algorithm is sensitive to the initial choice of objects as cluster centers

What is the purpose of pre-processing in k-means clustering?

To standardize or normalize the data and eliminate or reduce the effect of outliers

What is the key operation in agglomerative hierarchical clustering?

Finding the two features that are 'closest' in multivariate space

What is the objective function in k-means clustering?

The sum of the squared distances to the cluster centers

What is the characteristic of hierarchical clustering?

Each instance starts off as its own cluster, and is subsequently joined to the 'nearest' instance to form a new cluster

Study Notes

Clustering with Unsupervised Learning

  • Unsupervised learning involves unknown class labels, and the data is plotted to identify natural clusters.
  • Cluster analysis aims to divide data into meaningful and/or useful clusters that may or may not correspond to human perception of similarity.

Characteristics of Clusters

  • Clusters should comprise objects that are similar to each other and different from those in other clusters.
  • A (dis)similarity measure is required, often taken as a proximity measure (e.g., L1, L2, or L∞ norm).

Clustering Types

  • Clustering can be partitional (flat) or hierarchical.
  • Partitional clustering divides data into non-overlapping subsets (clusters) where each data point is in exactly one subset.
  • Hierarchical clustering produces nested clusters, often represented by a hierarchical tree or dendrogram.

k-means Clustering (Partitional Clustering)

  • Randomly choose k objects from the training set as prototypes.
  • Assign all other objects to the nearest prototype to form clusters based on Euclidean distance (or other norm).
  • Update the new prototype of each cluster as the centroid of all objects assigned to that cluster.
  • Repeat until convergence (i.e., no data point changes clusters, or centroids remain the same).
  • k-means clustering is a heuristic algorithm with no guarantee of convergence to the global optimum.
  • The result is sensitive to the initial choice of objects as cluster centers, especially for small data sets.

k-means Clustering Algorithm

  • The algorithm can be viewed as a greedy algorithm for partitioning n samples into k clusters to minimize an objective function (e.g., sum of squared distances to cluster centers, SSE).
  • SSE is calculated by summing the squared errors (i.e., distances to the closest centroid) for each data point.

Pre- and Post-processing

  • Pre-processing steps can improve the final result, including standardizing (or normalizing) the data and eliminating or reducing the effect of outliers.
  • Post-processing can include splitting “loose” clusters and merging “close” clusters.

Agglomerative Hierarchical Clustering

  • Each instance starts off as its own cluster and is subsequently joined to the “nearest” instance to form a new cluster.
  • The algorithm is a bottom-up technique, where larger clusters are obtained at each step.
  • The key operation is the computation of proximity in step (i), which can be defined in various ways.

This quiz explores clustering with unsupervised learning, where data is divided into clusters that are meaningful and useful. It covers the concept of similarity and dissimilarity measures in clustering.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free
Use Quizgecko on...
Browser
Browser