BPD2233: Data Mining Clustering Chap 5 PDF

BPD2233: DATA MINING Clustering 1 Learning Objectives ▪ To comprehend the concept of clustering, its application, and features ▪ To understand various distance metrics for clustering of data...

BPD2233: DATA MINING Clustering 1 Learning Objectives ▪ To comprehend the concept of clustering, its application, and features ▪ To understand various distance metrics for clustering of data 2 Clustering ▪ Defined as grouping a set of similar objects into classes or cluster ▪ Key Characteristic: Unsupervised Learning: Clustering is unsupervised because it does not rely on labeled data. The algorithm discovers patterns or groupings without prior knowledge of the output. Similarity/Dissimilarity: Clustering depends on measuring how similar or dissimilar data points are, often using metrics like Euclidean distance, Manhattan distance, or cosine similarity. 3 What is Cluster Analysis? ▪ Given a set of objects, place them in groups such that the objects in a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized Application of Cluster Analysis ▪ Understanding Discovered Clusters Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Group related documents for browsing, 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, group genes and proteins that have similar Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Technology1-DOWN functionality, or group stocks with similar Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, price fluctuations 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Technology2-DOWN Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN ▪ Summarization Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, Reduce the size of large data sets 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Oil-UP Clustering precipitation in Australia Application of Cluster Analysis ▪ Marketing: It helps marketers find out distinctive groups among their customer bases, and this knowledge helps them improve their targeted marketing programs. ▪ Land use: Clustering is used for identifying areas of similar land use from the databases of earth observations. ▪ Insurance: Clustering is helpful for recognizing clusters of insurance policyholders with a high regular claim cost. ▪ City-planning: It also helps in identifying clusters of houses based on house type, geographical location, and value. ▪ Earthquake studies: Clustering is helpful for analysis of earthquakes as it has been noticed hat earthquake epicenters are clustered along continent faults. ▪ Biology studies: Clustering helps in defining plant and animal classifications, identifying genes with similar functionalities, and in gaining insights into structures inherent to populations. ▪ Web discovery: Clustering is helpful in categorizing documents on the web for information discovery. ▪ Fraud detection: Clustering is also helpful in outlier detection applications such as credit card fraud detection. Idea for Cluster Analysis ▪ Scalability: Clustering algorithms should be capable of handling small as well as large datasets smoothly. ▪ Ability to handle different types of attributes: Clustering algorithms should be able to handle different kinds of data such as binary, categorical and interval-based (numerical) data. ▪ Independent of data input order: The clustering results should not be dependent on the ordering of input data. ▪ Identification of clusters with different shapes: The clustering algorithm should be capable of identifying clusters of any shape. ▪ Ability to handle noisy data: Usually, databases consist of noisy, erroneous or missing data, and algorithm must be able to handle these. ▪ High performance: To have a high performance algorithm, it is desirable that the algorithm should need to perform only one scan of the dataset. This capability would reduce the cost of input-output operations. Idea for Cluster Analysis ▪ Interpretability: The results of clustering algorithms should be interpretable, logical and usable. ▪ Ability to stop and resume: For a large dataset, it is desirable to stop and resume the task as it can take a huge amount of time to accomplish the full task and breaks may be necessary. ▪ Minimal user guidance: The clustering algorithm should not expect too much supervision from the analyst, because commonly the analyst has a limited knowledge of the dataset. What is Not Cluster Analysis? ▪ Supervised classification – Have class label information ▪ Simple segmentation – Dividing students into different registration groups alphabetically, by last name ▪ Results of a query – Groupings are a result of an external specification ▪ Graph partitioning – Some mutual relevance and synergy, but areas are not identical Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Types of Clustering ▪ Partition-Based Clustering ▪ Hierarchical clustering ▪ Grid-Based Clustering ▪ Model-Based Clustering ▪ Fuzzy Clustering ▪ Constraint-Based Clustering ▪ Spectral Clustering Partition-Based Clustering ▪ Divides the data into (k) clusters, where (k) is predefined, and each data point belongs to exactly one cluster. ▪ Key Characteristics: Non-overlapping clusters. Optimizes an objective function, such as minimizing the distance to the cluster center. ▪ Example Algorithms: K-Means Clustering: Groups data into (k) clusters around centroids. Iteratively updates centroids until convergence. K-Medoids (PAM): Similar to K-Means, it uses actual data points (medoids) as centers. Applications: Market segmentation, document clustering. Partitional-Based Clustering Original Points A Partitional Clustering Hierarchical-Based Clustering ▪ Creates a hierarchy of clusters represented as a tree-like structure (dendrogram). ▪ Key Characteristics: Can be agglomerative (bottom-up) or divisive (top-down). Does not require the number of clusters to be predefined. ▪ Example Algorithms: Single Linkage: Merges clusters based on the closest pair of points. Complete Linkage: Merges clusters based on the farthest pair of points. ▪ Applications: Gene expression analysis, taxonomy creation. Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram (diagram representing a tree) p1 p3 p4 p2 p1 p2 p3 p4 Non-traditional Hierarchical Clustering Non-traditional Dendrogram Density-Based Clustering ▪ Forms clusters based on dense regions of data points separated by low-density regions. ▪ Key Characteristics: Can detect clusters of arbitrary shapes. Robust to noise and outliers. ▪ Example Algorithms: ▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Uses a neighborhood radius (ϵ\epsilonϵ) and a minimum number of points (MinPts) to identify dense regions. ▪ OPTICS (Ordering Points to Identify the Clustering Structure): Handles varying density clusters better than DBSCAN. ▪ Applications: ▪ Geospatial data analysis, anomaly detection. Density-Based Clustering Grid-Based Clustering ▪ Divides the data space into a grid of cells and clusters based on dense regions in the grid. ▪ Key Characteristics: Efficient for large datasets. Ideal for spatial data. ▪ Example Algorithms: STING (Statistical Information Grid): Uses hierarchical grids and statistical measures for clustering. CLIQUE: A grid-based and density-based approach for high-dimensional data. ▪ Applications: Spatial data mining, image analysis. Grid-Based Clustering Model-Based Clustering ▪ Assumes data is generated from a mixture of statistical models and fits the data to these models. ▪ Key Characteristics: Provides a probabilistic framework for clustering. Identifies the optimal number of clusters using statistical criteria. ▪ Example Algorithms: Gaussian Mixture Models (GMMs): Assumes each cluster follows a Gaussian distribution. Expectation-Maximization (EM): Iteratively estimates cluster probabilities and model parameters. ▪ Applications: Speech recognition, image segmentation. Model-Based Clustering Fuzzy Clustering ▪ Assigns data points to multiple clusters with degrees of membership. ▪ Key Characteristics: Soft clustering approach. Membership values indicate the degree to which a data point belongs to a cluster. ▪ Example Algorithms: Fuzzy C-Means: Generalizes K-Means by assigning probabilities instead of hard assignments. ▪ Applications: Pattern recognition, medical diagnosis. Fuzzy Clustering Constraint-Based Clustering ▪ Incorporates domain knowledge or constraints into the clustering process. ▪ Key Characteristics: Constraints can include must-link or cannot-link conditions between points. Ensures clusters meet specific rules. ▪ Example Algorithms: COP-KMeans (Clustering with Pairwise Constraints). ▪ Applications: Bioinformatics, market analysis with predefined business rules. Constraint-Based Clustering Spectral Clustering ▪ Uses eigenvalues of a similarity matrix to reduce dimensions and perform clustering. ▪ Key Characteristics: Works well for data that is not linearly separable. Relies on graph theory for clustering. ▪ Example Algorithms: Normalized Cut (Ncut) algorithm. ▪ Applications: Image segmentation, graph clustering. Constraint-Based Clustering Summary Type Key Characteristics Strengths Weaknesses Divides data into kkk non- Requires (k), sensitive to Partition-Based Simple, fast for large datasets. overlapping clusters. initialization. Forms a tree-like hierarchy of Hierarchical No need for kkk, interpretable. Computationally expensive. clusters. Detects arbitrary shapes, handles Sensitive to parameters Density-Based Groups high-density regions. noise. (ϵ\epsilonϵ, MinPts). Grid-Based Clusters dense grid cells. Efficient for large spatial datasets. Limited to spatial data. Computationally intensive, Model-Based Fits data to statistical models. Probabilistic, handles uncertainty. assumes distributions. Soft clustering with degrees of Fuzzy Clustering Handles overlapping clusters. High computational cost. membership. Dependent on quality of Constraint-Based Clusters follow specific rules. Incorporates domain knowledge. constraints. Clustering using graph-based Effective for non-linearly Spectral Sensitive to graph construction. methods. separable data. Types of Clusters ▪ Well-separated clusters ▪ Prototype-based clusters ▪ Contiguity-based clusters ▪ Density-based clusters ▪ Described by an Objective Function Types of Clusters: Well-Separated ▪ A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters Types of Clusters: Prototype - Based ▪ A cluster is a set of objects such that an object in a cluster is closer (more similar) to the prototype or “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters Types of Clusters: Contiguity - Based A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters Types of Clusters: Density - Based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters Types of Cluster: Objectives Function ▪ Clusters Defined by an Objective Function Finds clusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) Can have global or local objectives. o Hierarchical clustering algorithms typically have local objectives o Partitional algorithms typically have global objectives A variation of the global objective function approach is to fit the data to a parameterized model. o Parameters for the model are determined from the data. o Mixture models assume that the data is a ‘mixture' of a number of statistical distributions. Clustering vs Cluster Aspect Clustering Cluster Definition The resulting group of data The process of grouping data points. points. Nature A technique or algorithm. A subset or output of data. Focus Methodology. Specific group or result. Examples K-Means, Hierarchical Clustering. A group of similar customers. Scope The overall task. A single group within the dataset. Evaluation Cluster Metric ▪ Internal Validation: Silhouette Coefficient: Measures how similar a point is to its own cluster compared to others. Range: [-1, 1]. Higher is better. Dunn Index: Ratio of the minimum distance between clusters to the maximum intra-cluster distance. ▪ External Validation: ▪ Rand Index: ▪ Compares clustering with ground truth (if available). ▪ Adjusted Mutual Information (AMI): ▪ Measures agreement between clustering and ground truth. ▪ Cluster Validation Techniques: Elbow Method: Used in K-Means to find the optimal (k). Gap Statistic: Compares within-cluster dispersion with a reference dataset. Characteristic of the Input Data are Important ▪ Type of proximity or density measure Central to clustering Depends on data and application ▪ Data characteristics that affect proximity and/or density are Dimensionality o Sparseness Attribute type Special relationships in the data For example, autocorrelation Distribution of the data ▪ Noise and Outliers Often interfere with the operation of the clustering algorithm ▪ Clusters of differing sizes, densities, and shapes Test Define the differences between hard and soft clustering [4 marks] Remind me…. 39 ▪ Noisy data refers to data that contains errors, irrelevant information, or inconsistencies that obscure the patterns or insights in a dataset. Noise can distort the true relationships in the data, making analysis, model training, or decision-making more challenging ▪ Example: Text Data: Typographical errors, slang, or irrelevant words in text datasets. Example: "Thsi is anoise" instead of "This is noise". Numerical Data: Outliers or incorrect values, such as a temperature of 1,000°C in a weather dataset. Image Data: Grainy, blurred, or pixelated images. Time Series Data: Sudden spikes or dips in stock prices without a valid reason. ▪ Analogy Think of clustering as cooking, and a cluster as the dish you prepare: Clustering is the method of cooking (e.g., following a recipe, using tools). A cluster is the final dish — one of the outcomes of the cooking process. THANK YOU 41

BPD2233: Data Mining Clustering Chap 5 PDF

Document Details

Tags

Related

Summary

Full Transcript