Data Warehousing and Data Mining Clustering & K-Means PDF
Document Details
International University for Science and Technology
Dr. Mohammed Hayyan Alsibai
Tags
Summary
This document is a lecture or presentation on data warehousing and data mining, specifically focusing on clustering methods. It includes descriptions of various clustering algorithms, distance measurements, and an example demonstrating the K-means clustering process.
Full Transcript
Dr. Mohammed Hayyan Alsibai Clustering & K-Means All slides are distributed under the Creative Commons Attribution-ShareAlike License: Hierarchical algorithms: these find...
Dr. Mohammed Hayyan Alsibai Clustering & K-Means All slides are distributed under the Creative Commons Attribution-ShareAlike License: Hierarchical algorithms: these find successive clusters using previously established clusters. Clustering is the classification of objects into ◦ Agglomerative ("bottom-up"): Agglomerative algorithms different groups, or more precisely, the begin with each element as a separate cluster and merge partitioning of a data set into subsets them into successively larger clusters. ◦ Divisive ("top-down"): Divisive algorithms begin with the (clusters), so that the data in each subset whole set and proceed to divide it into successively (ideally) share some common trait - often smaller clusters. according to some defined distance measure. Partitional clustering: Partitional algorithms determine all clusters at once. They include: ◦ K-means and derivatives ◦ Fuzzy c-means clustering ◦ QT clustering algorithm Distance measure will determine how the 3.The maximum norm is given by: similarity of two elements is calculated and it will influence the shape of the clusters. They include: 4. The Mahalanobis distance corrects data for 1. The Euclidean distance (also called 2-norm different scales and correlations in the variables. distance) is given by: 5. Inner product space: The angle between two vectors can be used as a distance measure when clustering high dimensional data 2. The Manhattan distance (also called taxicab 6. Hamming distance (sometimes edit distance) norm or 1-norm) is given by: measures the minimum number of substitutions required to change one member into another. The k-means algorithm is an algorithm to An algorithm for partitioning (or clustering) N cluster n objects based on attributes into k data points into K disjoint subsets Sj partitions, where k < n. containing data points so as to minimize the sum-of-squares criterion It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of where xn is a vector representing the the nth natural clusters in the data. data point and uj is the geometric centroid of It assumes that the object attributes form a the data points in Sj. vector space. Simply speaking k-means clustering is an algorithm to classify or to group the objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Step 1: Begin with a decision on the value of k= number of clusters. Step 3: Take each sample in sequence and compute its distance from the centroid of each of Step 2: Put any initial partition that classifies the the clusters. If a sample is not currently in the data into k clusters. You may assign the training cluster with the closest centroid, switch this samples randomly,or systematically as the sample to that cluster and update the centroid of following: the cluster gaining the new sample and the cluster 1. Take the first k training sample as single-element losing the sample. clusters 2. Assign each of the remaining (N-k) training sample to Step 4. Repeat step 3 until convergence is the cluster with the nearest centroid. After each assignment, re-compute the centroid of the gaining achieved, that is until a pass through the training cluster. sample causes no new assignments. Step 1: Initialization: Randomly we choose following two centroids (k=2) for two clusters. Let the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0). Step 3: Step 2: Now using these Thus, we obtain two centroids we compute clusters containing: the Euclidean distance {1,2,3} and {4,5,6,7}. of each object, as shown in table. Their new centroids are: Therefore, the new changed clusters are: {1,2} and {3,4,5,6,7} For example calculation for (2) Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1) Step 4 : The clusters obtained are: {1,2} and {3,4,5,6,7} Therefore, there is no change in the cluster. Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}. Step 2 Step 1 Step 1: We have 4 medicines as our training data points object and each medicine has 2 attributes. Each Initial value of attribute represents coordinate of the object. We centroids : Suppose have to determine which medicines belong to cluster we use medicine A 1 and which medicines belong to the other cluster. and medicine B as the first centroids. Let and c1 and c2 Attribute1 (X): Attribute 2 (Y): Object weight index pH denote the Medicine A 1 1 coordinate of the Medicine B 2 1 centroids, then c1=(1,1) and c2=(2,1) Medicine C 4 3 Medicine D 5 4 Objects-Centroids distance : we calculate the Step 2: distance between cluster centroid to each object. Let us use Euclidean distance, then we have Objects clustering : We distance matrix at iteration 0 is assign each object based on the minimum distance. Medicine A is assigned to group 1, medicine B to group 2, medicine C to group 2 and medicine D to group 2. Each column in the distance matrix symbolizes the object. The elements of Group The first row of the distance matrix corresponds to matrix below is 1 if and the distance of each object to the first centroid and only if the object is the second row is the distance of each object to the assigned to that group. second centroid. For example, distance from medicine C = (4, 3) to the first centroid is , and its distance to the second centroid is, Iteration-1, Objects clustering Iteration-1, Objects-Centroids distances : The next step is to compute the distance of all objects to the Based on the new distance matrix, we move the medicine B to Group 1 new centroids. while all the other objects remain. The Group matrix is shown below Similar to step 2, we have distance matrix as: Iteration 2, determine centroids: Now we repeat step 4 to calculate the new centroids coordinate based on the clustering of previous iteration. Group1 and group 2 both has two members, thus the new centroids are and Repeat step 2 again, we have new distance Again, we assign each object based on the minimum distance. matrix at iteration 2 as We obtain result that. Comparing the grouping of last iteration and this iteration reveals that the objects does not move group anymore. Thus, the computation of the k-mean clustering has reached its stability and no more iteration is needed.. 1. When the numbers of data are not so many, initial grouping will determine the cluster significantly. We get the final grouping as the results as: 2. The number of cluster, K, must be determined before hand. Its disadvantage is that it does not yield the same Object Feature1(X): Feature2 Group result with each run, since the resulting clusters depend weight index (Y): pH (result) on the initial random assignments. Medicine A 1 1 1 3. We never know the real cluster, using the same data, Medicine B 2 1 1 because if it is inputted in a different order it may Medicine C 4 3 2 produce different cluster if the number of data is few. Medicine D 5 4 2 4. It is sensitive to initial condition. Different initial condition may produce different result of cluster. The algorithm may be trapped in the local optimum. It is relatively efficient and fast. It computes The hierarchical agglomerative clustering result at O(tkn), where n is number of objects or points, k is number of clusters and t is number of uses the bottom-up approaches. iterations. In the HAC algorithm starts with every single k-means clustering can be applied to machine data point as a single cluster. learning or data mining Used on acoustic data in speech understanding The similar clusters are successively merged to convert waveforms into one of k categories until all clusters have merged into a one (known as Vector Quantization or Image cluster and result is represents in tree Segmentation). Also used for choosing color palettes on old structure as named dendogram. fashioned graphical display devices and Image Quantization. Dr. M.H. ALSIBAI 1/12/2024 32 Step 2: calculate the Euclidian distance Suppose we have the following data: between every person to each other person, then we have distance matrix at iteration 0 is: Step1: Assign each object to a cluster so that for N objects we have N clusters each containing just one Object Dr. M.H. ALSIBAI 1/12/2024 33 Dr. M.H. ALSIBAI 1/12/2024 34 Step 3: Find the most similar Step4: Compute distances between the new pair of clusters and merge them into a single cluster so cluster and each of the old clusters. that we now have one ◦ Single-link: In single-link clustering or single- cluster less linkage clustering, the similarity of two clusters is the similarity of their most similar members We get the following matrix Dr. M.H. ALSIBAI 1/12/2024 35 Dr. M.H. ALSIBAI 1/12/2024 36 Step5: Repeat steps 3 and 4 until all items are clustered into a single cluster After getting one cluster containing all elements, we can define a horizontal threshold through which we can determine the number of clusters, when the threshold is equal to 52 we get three clusters: (P4,(P2,P5),(P1 ,P3)) Dr. M.H. ALSIBAI 1/12/2024 37 Dr. M.H. ALSIBAI 1/12/2024 38 Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c- means, hierarchical, mixture of gaussians) + some interactive demos (java applets). The difference between Kmeans and hierarchical Digital Image Processing and Analysis-byB.Chanda and D.Dutta Majumdar. H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. "Spectral Relaxation for K-means clustering is that in Kmeans clustering, the Clustering", Neural Information Processing Systems vol.14 (NIPS 2001). pp. 1057- number of clusters is pre-defined and is denoted 1064, Vancouver, Canada. Dec. 2001. by “K”, but in hierarchical clustering, the number J. A. Hartigan (1975) "Clustering Algorithms". Wiley. J. A. Hartigan and M. A. Wong (1979) "A K-Means Clustering Algorithm", Applied of sets is either one or similar to the number of Statistics, Vol. 28, No. 1, p100-108. data observations. D. Arthur, S. Vassilvitskii (2006): "How Slow is the k-means Method?," Another difference between these two clustering D. Arthur, S. Vassilvitskii: "k-means++ The Advantages of Careful Seeding" 2007 Symposium on Discrete Algorithms (SODA). techniques is that K-means clustering is more effective on much larger datasets than hierarchical clustering. But hierarchical clustering Thank you for your attention spheroidal shape small datasets. K-means clustering is effective on dataset Any question? spheroidal shape of clusters compared to hierarchical clustering. Documents related to this class are on: http://bit.ly/DSM_IUST Dr. M.H. ALSIBAI 1/12/2024 39 Dr. M.H. ALSIBAI 1/12/2024 40