Introduction to Clustering PDF

Summary

This document provides an introduction to clustering, a data mining technique used to group similar data points together. It discusses the core concepts of cluster analysis, including defining clusters and evaluating results, as well as applications like biology, marketing, and city planning. It also outlines the basic steps involved in practical clustering.

Full Transcript

What is Clustering? Cluster analysis (clustering, segmentation, quantization, …) is the data mining core task to find clusters. But what is a cluster? [Esti02] ▶ cannot be precisely defined ▶ many different principles and models have been defined ▶ even more algorithms, with very different results ▶...

What is Clustering? Cluster analysis (clustering, segmentation, quantization, …) is the data mining core task to find clusters. But what is a cluster? [Esti02] ▶ cannot be precisely defined ▶ many different principles and models have been defined ▶ even more algorithms, with very different results ▶ when is a result “valid”? ▶ results are subjective “in the eye of the beholder” ▶ no specific definition seems “best” in the general case [Bonn64] Common themes found in definition attempts: ▶ more homogeneous ▶ more similar ▶ cohesive 3 What is Clustering? /2 Cluster analysis (clustering, segmentation, quantization, …) is the data mining core task to divide the data into clusters such that: ▶ similar (related) objects should be in the same cluster ▶ dissimilar (unrelated) objects should be in different clusters ▶ clusters are not defined beforehand (otherwise: use classification) ▶ clusters have (statistical, geometric, …) properties such as: ▶ connectivity ▶ separation ▶ least squared deviation ▶ density Clustering algorithms have different ▶ cluster models (“what is a cluster for this algorithm?”) ▶ induction principles (“how does the algorithm find clusters?”) 4 Applications of Clustering /2 ▶ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus, and species ▶ Information retrieval: document clustering ▶ Land use: identification of areas of similar land use in an Earth observation database ▶ Marketing: help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs ▶ City-planning: Identifying groups of houses according to their house type, value, and geographical location ▶ Earthquake studies: observed epicenters should be clustered along continent faults ▶ Climate: understanding Earth climate, find patterns of atmospheric and oceanic phenomena ▶ Economic Science: market research 5 Basic Steps for Clustering Feature selection ▶ select information (about objects) concerning the task of interest ▶ aim at minimal information redundancy ▶ weighting of information Clustering algorithm and parameters ▶ distance and similarity measure suitable for the problem ▶ cluster quality criterion / cost function / objective ▶ algorithms to use with this distance and quality criterion Validation and interpretation of the results ▶ validation test ▶ integration with applications 6

Use Quizgecko on...
Browser
Browser