Data Mining Topic (7) AMT Clustering PDF

# Data Mining Topic (7) AMT Clustering ### Clustering - **Cluster:** A collection of data objects - **Similar** (or related) to one another within the same group - **Dissimilar** (or unrelated) to the objects in other groups - **Cluster analysis** (or clustering, data segmentation, ...) - Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters. - **Unsupervised learning:** no predefined classes (i.e. learning by observations vs. learning by examples: supervised) - **Typical applications** 1. As a stand-alone tool to get insight into data distribution 2. As a preprocessing step for other algorithms - **Applications** - Data reduction - (Summarization & Compression) - Hypothesis generation and testing - Prediction based on groups. - Finding K-nearest Neighbors. - Outlier detection. - Biology - Information Retrieval - Land Use - Marketing - City-planning - Earth-quake studies - Climate - Economic Science ### Basic Steps to Develop a Clustering Task 1. Feature selection 2. Proximity measure 3. Clustering criterion 4. Clustering algorithms 5. Validation of the results 6. Interpretation of the results ### A good clustering method will produce high quality clusters - **High intra-class similarity:** cohesive within clusters - **Low inter-class similarity:** distinctive between clusters - The quality of a clustering method depends on: - The similarity measure used by the method its implementation. - Its ability to discover some or all the hidden patterns. ### Dissimilarity/Similarity metric - The definitions of distance functions are usually rather different for interval-scaled, Boolean, categorical, ordinal ratio, and vector variables - It is hard to define -similar enough or -good enough - The answer is typically highly subjective ### Consideration of Clustering analysis - Partitioning Criteria (Single Vs Hierarchical) - Separation of clustering (Exclusive vs non-exclusive) - Similarity measure - Clustering space ### Requirement and challenges - Scalability - Deal with different types of attributes (numerical, binary, categorical, ...) - Constraint-based clustering - Interpretability and usability - Cluster arbitrary shape - Deal with noise - Incremental clustering - High dimensionality ### Types of Clustering 1. **Partitioning approach:** - Construct various partitions and then evaluate them by some criterion, e.g. minimizing the sum of square errors - Typical methods: k-means, k-medoids, CLARANS 2. **Hierarchical approach** 3. **Density-based approach** 4. **Grid-based approach** ### (1) Partitioning approach - Partitioning method: Partitioning a database D of n objects into aset of k clusters, such that the sum of squared distances is minimized (where c¡ is the centroid or medoid of clusterCi) - **Heuristic methods:** 1. k-means 2. K-mode (for categorical data) 3. k-medoids (solve problem of K-mean deal with wide range of data) - **Characteristic of K-means** 1. Strength 2. Efficient (save time) 3. Terminate at a local optimal. - **Weakness of K-means** 1. Applicable to continuous data only 2. Must specify number of clusters. 3. Sensitive to noise (outliers) 4. Not suitable to non-convex shapes - **Variations on K-mean Method:** 1. Initial K means 2. Dissimilarity calculation method (Manhattan, Euclidean, ...) 3. Strategies (algorithm) to calculate cluster mean - **Rule:** $E = \sum_{i=1}^k \sum_{p \in C_i} (d(p,c_i))^2$ ### Measuring Clustering Quality: 1. **External:** supervised, employ criteria not inherent to the dataset - Compare a clustering against prior or expert-specified knowledge (i.e. the ground truth) using certain clustering quality measure 2. **Internal:** unsupervised, criteria derived from data itself - Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are, e.g. Silhouette coefficient 3. **Relative:** directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm. ### Algorithm Steps of K- means: 1. Choose K (number of Clusters) → then select random centroid 2. Assign records to each nearest cluster 3. Recalculate resulting centroids by getting mean of the Record of each cluster 4. Repeat step 2 and step 3 till the centroid be almost not changing - **Final Output:** 1. Cluster Centers 2. assigned clusters to data records ### Important Drawings: - Diagram showing 5 clusters, with data points distributed around the clusters. - Diagram showing 3 clusters, with data points in a variety of shapes, some inside and some outside of the clusters. ### Question: We have 4 medicines as our training data points object and each medicine has 2 attributes. Each attribute represents coordinate of the object. We have to determine which medicines belongs to cluster 1 and which medicines belongs to cluster 2. - Number of clusters = 2, Draw points - Suppose that we use medicine A and medicine B as the first centroids. - C1=(1,1) - C2=(2,1) - We calculate the distance between cluster centroid and each object using Euclidean Distance: - D(A,C1) = √(1 − 1)² + (1 − 1)² = 0 - D(C,C1) = √(4 − 1)² + (3 − 1)² = 3.61 - D(A,C2) =√(1 − 2)² + (1 − 1)² = 1 - D(C,C2) = (4-2)² + (3 − 1)² = 2.83 - D(B,C1) =√(2 — 1)² + (1 − 1)² = 1 - D(D,C1) = (5-1)² + (4 − 1)² = 5 - D(B,C2) =√(2 – 2)² + (1 − 1)² = 0 - D(D,C2) = √(5 – 2)² + (4 − 1)² = 4.24 - Then we have distance matrix at iteration 0: - D(0)= | A | B | C | D | |----|---|---|----| | 0 | 1 | 3.61 | 5 | | 1 | 0 | 2.83 | 4.24 | - C1(1,1) → group 1 - C2(2,1) group 2 - A is assigned to group 1 - B, C and D are assigned to group 2 - G(0)= | A | B | C | D | |---|---|---|---| | 1 | 0 | 0 | 0 | Group 1 | 0 | 1 | 1 | 1 | Group 2 - Recalculate the mean. - C1 = (X(A),Y(A)) = (1,1) - C2 = ( X(B) + X(C)+X(D) Y(B)+Y(C)+Y(D)) = (3.67, 2.67) - 3 / 3 - Recalculate the Distance again depending on New C1 & C2. - D(1)= | A | B | C | D | |----|---|---|----| | 0 | 1 | 3.61 | 5 | | 3.14 | 2.36 | .47 | 1.89 | | C1=(1,1) group 1 | C2=(3.67, 2.67) group 2 - G(1)= | A | B | C | D | |---|---|---|---| | 1 | 1 | 0 | 0 | Group 1 | 0 | 0 | 1 | 1 | Group 2 - Recalculate the mean. - C1=( X(A)+X(B) Y(A)+Y(B) )=(1.5,1) - 2 / 2 - C2=( X(C)+X(D) Y(C)+Y(D) ) =(4.5.3.5) - 2 / 2 - Recalculate the Distance again depending on New C1 & C2. - D(2)= | A | B | C | D | |---|---|---|---| | .5 | .5 | 3.2 | 4.61 | | 4.3 | 3.54 | .71 | .71 | | C1=(1.5,1) group 1 | C2=(4.5,3.5) group 2 - G(2)= | A | B | C | D | |---|---|---|---| | 1 | 1 | 0 | 0 | Group 1 | 0 | 0 | 1 | 1 | Group 2 - We obtain result that G(1) = G(2) comparing the grouping of last iteration and this iteration reveals that the objects doesn't move anymore - **Final Answer:** - Machine 1, 2 → Group 1 - Machine 3, 4 → Group 2

Data Mining Topic (7) AMT Clustering PDF

Document Details

Tags

Related

Summary

Full Transcript