Data Mining Topic (7) AMT Clustering PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an overview of data mining, focusing on the topic of clustering. It details clustering concepts including similar and dissimilar data objects and applications in various fields. It also elucidates the steps in developing a clustering task, such as feature selection, proximity measures, and validation.
Full Transcript
# Data Mining Topic (7) AMT Clustering ### Clustering - **Cluster:** A collection of data objects - **Similar** (or related) to one another within the same group - **Dissimilar** (or unrelated) to the objects in other groups - **Cluster analysis** (or clustering, data segmentation, ...)...
# Data Mining Topic (7) AMT Clustering ### Clustering - **Cluster:** A collection of data objects - **Similar** (or related) to one another within the same group - **Dissimilar** (or unrelated) to the objects in other groups - **Cluster analysis** (or clustering, data segmentation, ...) - Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters. - **Unsupervised learning:** no predefined classes (i.e. learning by observations vs. learning by examples: supervised) - **Typical applications** 1. As a stand-alone tool to get insight into data distribution 2. As a preprocessing step for other algorithms - **Applications** - Data reduction - (Summarization & Compression) - Hypothesis generation and testing - Prediction based on groups. - Finding K-nearest Neighbors. - Outlier detection. - Biology - Information Retrieval - Land Use - Marketing - City-planning - Earth-quake studies - Climate - Economic Science ### Basic Steps to Develop a Clustering Task 1. Feature selection 2. Proximity measure 3. Clustering criterion 4. Clustering algorithms 5. Validation of the results 6. Interpretation of the results ### A good clustering method will produce high quality clusters - **High intra-class similarity:** cohesive within clusters - **Low inter-class similarity:** distinctive between clusters - The quality of a clustering method depends on: - The similarity measure used by the method its implementation. - Its ability to discover some or all the hidden patterns. ### Dissimilarity/Similarity metric - The definitions of distance functions are usually rather different for interval-scaled, Boolean, categorical, ordinal ratio, and vector variables - It is hard to define -similar enough or -good enough - The answer is typically highly subjective ### Consideration of Clustering analysis - Partitioning Criteria (Single Vs Hierarchical) - Separation of clustering (Exclusive vs non-exclusive) - Similarity measure - Clustering space ### Requirement and challenges - Scalability - Deal with different types of attributes (numerical, binary, categorical, ...) - Constraint-based clustering - Interpretability and usability - Cluster arbitrary shape - Deal with noise - Incremental clustering - High dimensionality ### Types of Clustering 1. **Partitioning approach:** - Construct various partitions and then evaluate them by some criterion, e.g. minimizing the sum of square errors - Typical methods: k-means, k-medoids, CLARANS 2. **Hierarchical approach** 3. **Density-based approach** 4. **Grid-based approach** ### (1) Partitioning approach - Partitioning method: Partitioning a database D of n objects into aset of k clusters, such that the sum of squared distances is minimized (where c¡ is the centroid or medoid of clusterCi) - **Heuristic methods:** 1. k-means 2. K-mode (for categorical data) 3. k-medoids (solve problem of K-mean deal with wide range of data) - **Characteristic of K-means** 1. Strength 2. Efficient (save time) 3. Terminate at a local optimal. - **Weakness of K-means** 1. Applicable to continuous data only 2. Must specify number of clusters. 3. Sensitive to noise (outliers) 4. Not suitable to non-convex shapes - **Variations on K-mean Method:** 1. Initial K means 2. Dissimilarity calculation method (Manhattan, Euclidean, ...) 3. Strategies (algorithm) to calculate cluster mean - **Rule:** $E = \sum_{i=1}^k \sum_{p \in C_i} (d(p,c_i))^2$ ### Measuring Clustering Quality: 1. **External:** supervised, employ criteria not inherent to the dataset - Compare a clustering against prior or expert-specified knowledge (i.e. the ground truth) using certain clustering quality measure 2. **Internal:** unsupervised, criteria derived from data itself - Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are, e.g. Silhouette coefficient 3. **Relative:** directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm. ### Algorithm Steps of K- means: 1. Choose K (number of Clusters) → then select random centroid 2. Assign records to each nearest cluster 3. Recalculate resulting centroids by getting mean of the Record of each cluster 4. Repeat step 2 and step 3 till the centroid be almost not changing - **Final Output:** 1. Cluster Centers 2. assigned clusters to data records ### Important Drawings: - Diagram showing 5 clusters, with data points distributed around the clusters. - Diagram showing 3 clusters, with data points in a variety of shapes, some inside and some outside of the clusters. ### Question: We have 4 medicines as our training data points object and each medicine has 2 attributes. Each attribute represents coordinate of the object. We have to determine which medicines belongs to cluster 1 and which medicines belongs to cluster 2. - Number of clusters = 2, Draw points - Suppose that we use medicine A and medicine B as the first centroids. - C1=(1,1) - C2=(2,1) - We calculate the distance between cluster centroid and each object using Euclidean Distance: - D(A,C1) = √(1 − 1)² + (1 − 1)² = 0 - D(C,C1) = √(4 − 1)² + (3 − 1)² = 3.61 - D(A,C2) =√(1 − 2)² + (1 − 1)² = 1 - D(C,C2) = (4-2)² + (3 − 1)² = 2.83 - D(B,C1) =√(2 — 1)² + (1 − 1)² = 1 - D(D,C1) = (5-1)² + (4 − 1)² = 5 - D(B,C2) =√(2 – 2)² + (1 − 1)² = 0 - D(D,C2) = √(5 – 2)² + (4 − 1)² = 4.24 - Then we have distance matrix at iteration 0: - D(0)= | A | B | C | D | |----|---|---|----| | 0 | 1 | 3.61 | 5 | | 1 | 0 | 2.83 | 4.24 | - C1(1,1) → group 1 - C2(2,1) group 2 - A is assigned to group 1 - B, C and D are assigned to group 2 - G(0)= | A | B | C | D | |---|---|---|---| | 1 | 0 | 0 | 0 | Group 1 | 0 | 1 | 1 | 1 | Group 2 - Recalculate the mean. - C1 = (X(A),Y(A)) = (1,1) - C2 = ( X(B) + X(C)+X(D) Y(B)+Y(C)+Y(D)) = (3.67, 2.67) - 3 / 3 - Recalculate the Distance again depending on New C1 & C2. - D(1)= | A | B | C | D | |----|---|---|----| | 0 | 1 | 3.61 | 5 | | 3.14 | 2.36 | .47 | 1.89 | | C1=(1,1) group 1 | C2=(3.67, 2.67) group 2 - G(1)= | A | B | C | D | |---|---|---|---| | 1 | 1 | 0 | 0 | Group 1 | 0 | 0 | 1 | 1 | Group 2 - Recalculate the mean. - C1=( X(A)+X(B) Y(A)+Y(B) )=(1.5,1) - 2 / 2 - C2=( X(C)+X(D) Y(C)+Y(D) ) =(4.5.3.5) - 2 / 2 - Recalculate the Distance again depending on New C1 & C2. - D(2)= | A | B | C | D | |---|---|---|---| | .5 | .5 | 3.2 | 4.61 | | 4.3 | 3.54 | .71 | .71 | | C1=(1.5,1) group 1 | C2=(4.5,3.5) group 2 - G(2)= | A | B | C | D | |---|---|---|---| | 1 | 1 | 0 | 0 | Group 1 | 0 | 0 | 1 | 1 | Group 2 - We obtain result that G(1) = G(2) comparing the grouping of last iteration and this iteration reveals that the objects doesn't move anymore - **Final Answer:** - Machine 1, 2 → Group 1 - Machine 3, 4 → Group 2