BPD2233: Data Mining Clustering Chap 5 PDF
Document Details
Uploaded by WellManneredSarod9760
Universiti Malaysia Pahang
Tags
Summary
This document discusses the concept of clustering, its applications, and various clustering techniques as part of Data Mining. It covers learning objectives, and examples of real-world applications in different areas like marketing and biology.
Full Transcript
BPD2233: DATA MINING Clustering 1 Learning Objectives ▪ To comprehend the concept of clustering, its application, and features ▪ To understand various distance metrics for clustering of data...
BPD2233: DATA MINING Clustering 1 Learning Objectives ▪ To comprehend the concept of clustering, its application, and features ▪ To understand various distance metrics for clustering of data 2 Clustering ▪ Defined as grouping a set of similar objects into classes or cluster ▪ Key Characteristic: Unsupervised Learning: Clustering is unsupervised because it does not rely on labeled data. The algorithm discovers patterns or groupings without prior knowledge of the output. Similarity/Dissimilarity: Clustering depends on measuring how similar or dissimilar data points are, often using metrics like Euclidean distance, Manhattan distance, or cosine similarity. 3 What is Cluster Analysis? ▪ Given a set of objects, place them in groups such that the objects in a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized Application of Cluster Analysis ▪ Understanding Discovered Clusters Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Group related documents for browsing, 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, group genes and proteins that have similar Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Technology1-DOWN functionality, or group stocks with similar Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, price fluctuations 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Technology2-DOWN Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN ▪ Summarization Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, Reduce the size of large data sets 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Oil-UP Clustering precipitation in Australia Application of Cluster Analysis ▪ Marketing: It helps marketers find out distinctive groups among their customer bases, and this knowledge helps them improve their targeted marketing programs. ▪ Land use: Clustering is used for identifying areas of similar land use from the databases of earth observations. ▪ Insurance: Clustering is helpful for recognizing clusters of insurance policyholders with a high regular claim cost. ▪ City-planning: It also helps in identifying clusters of houses based on house type, geographical location, and value. ▪ Earthquake studies: Clustering is helpful for analysis of earthquakes as it has been noticed hat earthquake epicenters are clustered along continent faults. ▪ Biology studies: Clustering helps in defining plant and animal classifications, identifying genes with similar functionalities, and in gaining insights into structures inherent to populations. ▪ Web discovery: Clustering is helpful in categorizing documents on the web for information discovery. ▪ Fraud detection: Clustering is also helpful in outlier detection applications such as credit card fraud detection. Idea for Cluster Analysis ▪ Scalability: Clustering algorithms should be capable of handling small as well as large datasets smoothly. ▪ Ability to handle different types of attributes: Clustering algorithms should be able to handle different kinds of data such as binary, categorical and interval-based (numerical) data. ▪ Independent of data input order: The clustering results should not be dependent on the ordering of input data. ▪ Identification of clusters with different shapes: The clustering algorithm should be capable of identifying clusters of any shape. ▪ Ability to handle noisy data: Usually, databases consist of noisy, erroneous or missing data, and algorithm must be able to handle these. ▪ High performance: To have a high performance algorithm, it is desirable that the algorithm should need to perform only one scan of the dataset. This capability would reduce the cost of input-output operations. Idea for Cluster Analysis ▪ Interpretability: The results of clustering algorithms should be interpretable, logical and usable. ▪ Ability to stop and resume: For a large dataset, it is desirable to stop and resume the task as it can take a huge amount of time to accomplish the full task and breaks may be necessary. ▪ Minimal user guidance: The clustering algorithm should not expect too much supervision from the analyst, because commonly the analyst has a limited knowledge of the dataset. What is Not Cluster Analysis? ▪ Supervised classification – Have class label information ▪ Simple segmentation – Dividing students into different registration groups alphabetically, by last name ▪ Results of a query – Groupings are a result of an external specification ▪ Graph partitioning – Some mutual relevance and synergy, but areas are not identical Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Types of Clustering ▪ Partition-Based Clustering ▪ Hierarchical clustering ▪ Grid-Based Clustering ▪ Model-Based Clustering ▪ Fuzzy Clustering ▪ Constraint-Based Clustering ▪ Spectral Clustering Partition-Based Clustering ▪ Divides the data into (k) clusters, where (k) is predefined, and each data point belongs to exactly one cluster. ▪ Key Characteristics: Non-overlapping clusters. Optimizes an objective function, such as minimizing the distance to the cluster center. ▪ Example Algorithms: K-Means Clustering: Groups data into (k) clusters around centroids. Iteratively updates centroids until convergence. K-Medoids (PAM): Similar to K-Means, it uses actual data points (medoids) as centers. Applications: Market segmentation, document clustering. Partitional-Based Clustering Original Points A Partitional Clustering Hierarchical-Based Clustering ▪ Creates a hierarchy of clusters represented as a tree-like structure (dendrogram). ▪ Key Characteristics: Can be agglomerative (bottom-up) or divisive (top-down). Does not require the number of clusters to be predefined. ▪ Example Algorithms: Single Linkage: Merges clusters based on the closest pair of points. Complete Linkage: Merges clusters based on the farthest pair of points. ▪ Applications: Gene expression analysis, taxonomy creation. Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram (diagram representing a tree) p1 p3 p4 p2 p1 p2 p3 p4 Non-traditional Hierarchical Clustering Non-traditional Dendrogram Density-Based Clustering ▪ Forms clusters based on dense regions of data points separated by low-density regions. ▪ Key Characteristics: Can detect clusters of arbitrary shapes. Robust to noise and outliers. ▪ Example Algorithms: ▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Uses a neighborhood radius (ϵ\epsilonϵ) and a minimum number of points (MinPts) to identify dense regions. ▪ OPTICS (Ordering Points to Identify the Clustering Structure): Handles varying density clusters better than DBSCAN. ▪ Applications: ▪ Geospatial data analysis, anomaly detection. Density-Based Clustering Grid-Based Clustering ▪ Divides the data space into a grid of cells and clusters based on dense regions in the grid. ▪ Key Characteristics: Efficient for large datasets. Ideal for spatial data. ▪ Example Algorithms: STING (Statistical Information Grid): Uses hierarchical grids and statistical measures for clustering. CLIQUE: A grid-based and density-based approach for high-dimensional data. ▪ Applications: Spatial data mining, image analysis. Grid-Based Clustering Model-Based Clustering ▪ Assumes data is generated from a mixture of statistical models and fits the data to these models. ▪ Key Characteristics: Provides a probabilistic framework for clustering. Identifies the optimal number of clusters using statistical criteria. ▪ Example Algorithms: Gaussian Mixture Models (GMMs): Assumes each cluster follows a Gaussian distribution. Expectation-Maximization (EM): Iteratively estimates cluster probabilities and model parameters. ▪ Applications: Speech recognition, image segmentation. Model-Based Clustering Fuzzy Clustering ▪ Assigns data points to multiple clusters with degrees of membership. ▪ Key Characteristics: Soft clustering approach. Membership values indicate the degree to which a data point belongs to a cluster. ▪ Example Algorithms: Fuzzy C-Means: Generalizes K-Means by assigning probabilities instead of hard assignments. ▪ Applications: Pattern recognition, medical diagnosis. Fuzzy Clustering Constraint-Based Clustering ▪ Incorporates domain knowledge or constraints into the clustering process. ▪ Key Characteristics: Constraints can include must-link or cannot-link conditions between points. Ensures clusters meet specific rules. ▪ Example Algorithms: COP-KMeans (Clustering with Pairwise Constraints). ▪ Applications: Bioinformatics, market analysis with predefined business rules. Constraint-Based Clustering Spectral Clustering ▪ Uses eigenvalues of a similarity matrix to reduce dimensions and perform clustering. ▪ Key Characteristics: Works well for data that is not linearly separable. Relies on graph theory for clustering. ▪ Example Algorithms: Normalized Cut (Ncut) algorithm. ▪ Applications: Image segmentation, graph clustering. Constraint-Based Clustering Summary Type Key Characteristics Strengths Weaknesses Divides data into kkk non- Requires (k), sensitive to Partition-Based Simple, fast for large datasets. overlapping clusters. initialization. Forms a tree-like hierarchy of Hierarchical No need for kkk, interpretable. Computationally expensive. clusters. Detects arbitrary shapes, handles Sensitive to parameters Density-Based Groups high-density regions. noise. (ϵ\epsilonϵ, MinPts). Grid-Based Clusters dense grid cells. Efficient for large spatial datasets. Limited to spatial data. Computationally intensive, Model-Based Fits data to statistical models. Probabilistic, handles uncertainty. assumes distributions. Soft clustering with degrees of Fuzzy Clustering Handles overlapping clusters. High computational cost. membership. Dependent on quality of Constraint-Based Clusters follow specific rules. Incorporates domain knowledge. constraints. Clustering using graph-based Effective for non-linearly Spectral Sensitive to graph construction. methods. separable data. Types of Clusters ▪ Well-separated clusters ▪ Prototype-based clusters ▪ Contiguity-based clusters ▪ Density-based clusters ▪ Described by an Objective Function Types of Clusters: Well-Separated ▪ A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters Types of Clusters: Prototype - Based ▪ A cluster is a set of objects such that an object in a cluster is closer (more similar) to the prototype or “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters Types of Clusters: Contiguity - Based A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters Types of Clusters: Density - Based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters Types of Cluster: Objectives Function ▪ Clusters Defined by an Objective Function Finds clusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) Can have global or local objectives. o Hierarchical clustering algorithms typically have local objectives o Partitional algorithms typically have global objectives A variation of the global objective function approach is to fit the data to a parameterized model. o Parameters for the model are determined from the data. o Mixture models assume that the data is a ‘mixture' of a number of statistical distributions. Clustering vs Cluster Aspect Clustering Cluster Definition The resulting group of data The process of grouping data points. points. Nature A technique or algorithm. A subset or output of data. Focus Methodology. Specific group or result. Examples K-Means, Hierarchical Clustering. A group of similar customers. Scope The overall task. A single group within the dataset. Evaluation Cluster Metric ▪ Internal Validation: Silhouette Coefficient: Measures how similar a point is to its own cluster compared to others. Range: [-1, 1]. Higher is better. Dunn Index: Ratio of the minimum distance between clusters to the maximum intra-cluster distance. ▪ External Validation: ▪ Rand Index: ▪ Compares clustering with ground truth (if available). ▪ Adjusted Mutual Information (AMI): ▪ Measures agreement between clustering and ground truth. ▪ Cluster Validation Techniques: Elbow Method: Used in K-Means to find the optimal (k). Gap Statistic: Compares within-cluster dispersion with a reference dataset. Characteristic of the Input Data are Important ▪ Type of proximity or density measure Central to clustering Depends on data and application ▪ Data characteristics that affect proximity and/or density are Dimensionality o Sparseness Attribute type Special relationships in the data For example, autocorrelation Distribution of the data ▪ Noise and Outliers Often interfere with the operation of the clustering algorithm ▪ Clusters of differing sizes, densities, and shapes Test Define the differences between hard and soft clustering [4 marks] Remind me…. 39 ▪ Noisy data refers to data that contains errors, irrelevant information, or inconsistencies that obscure the patterns or insights in a dataset. Noise can distort the true relationships in the data, making analysis, model training, or decision-making more challenging ▪ Example: Text Data: Typographical errors, slang, or irrelevant words in text datasets. Example: "Thsi is anoise" instead of "This is noise". Numerical Data: Outliers or incorrect values, such as a temperature of 1,000°C in a weather dataset. Image Data: Grainy, blurred, or pixelated images. Time Series Data: Sudden spikes or dips in stock prices without a valid reason. ▪ Analogy Think of clustering as cooking, and a cluster as the dish you prepare: Clustering is the method of cooking (e.g., following a recipe, using tools). A cluster is the final dish — one of the outcomes of the cooking process. THANK YOU 41