K-Means Clustering PDF

Document Details

Uploaded by Deleted User

Dr Shaimaa Elmorsy

Tags

k-means clustering machine learning clustering algorithms data analysis

Summary

This document provides an overview of k-means clustering, a popular unsupervised machine learning algorithm. It explains the concept of clustering, the necessary components, distance measures, cluster evaluation, different clustering techniques and algorithms such as hierarchical, partitional and Bayesian clustering. The document includes practical examples and visual representations of k-means clustering.

Full Transcript

clustering Dr Shaimaa Elmorsy What is clustering?  The organization of unlabeled data into similarity groups called clusters.  A cluster is a collection of data items which are “similar” between them, and “dissimilar” to data items in other clusters. What do we need for clustering? Dista...

clustering Dr Shaimaa Elmorsy What is clustering?  The organization of unlabeled data into similarity groups called clusters.  A cluster is a collection of data items which are “similar” between them, and “dissimilar” to data items in other clusters. What do we need for clustering? Distance measures Cluster evaluation (a hard problem)  Intra-cluster cohesion(compactness): –Cohesion measures how near the data points in a cluster are to the cluster centroid. –Sum of squared error (SSE) is a commonly used measure.  Inter-cluster separation(isolation): –Separation means that different cluster centroids should be far away from one another.  In most applications, expert judgments are still the key How many clusters? Clustering techniques Clustering techniques Clustering techniques K-mean clustering  Unsupervised machine learning algorithm.  It groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than those in other groups (other clusters). Example1:  Given the following data points, use K-means clustering to partition data into two clusters. Example1:  Assume that you have the data points that are show in the figure.  Your goal is to cluster each data point into one of two groups.  Thus, the cluster size is 2. C1 and C2 represent these two clusters. Example1:  Set initial centroids are C1 (1,1) and C2 (2,1) Example1: point Center(1) (1,1) Center(2) (2,1) cluster (1,1) (1 − 1)2 +(1 − 1)2 (2 − 1)2 +(1 − 1)2 1 =0 =1 (2,1) (1 − 2)2 +(1 − 1)2 (2 − 2)2 +(1 − 1)2 2 =1 =0 (4,3) (1 − 4)2 +(1 − 3)2 (2 − 4)2 +(1 − 3)2 2 = 13 = =2 2 (5,4) (1 − 5)2 +(1 − 4)2 (2 − 5)2 +(1 − 4)2 2 =5 =3 2 Example1:  Find a new centroid by using  Iteration 1  Now, we calculate for each point to which center it belongs. The result depends on the distance between the center and the point (by using Euclidian distance):  Point 1: (1, 1) C1 = Yes C2 = No  Point 2: (2, 1) C1 = No, C2 = Yes  Point 3: (4, 3) C1 = No, C2 = Yes  Point 4: (5, 4) C1 = No, C2 = Yes Example1:  Now, we calculate the new centroid as follows: 1 1  C1 =( , )=(1, 1) 1 1 2+4+5 𝟏+𝟑+𝟒  C2 = 1/3 ((2, 1) + (4, 3) + (5, 4)) = ( , ) =(3.67, 2.67) 3 𝟑 Example1: point Center(1) (1,1) Center(2) (3.67, 2.67) cluster (1,1) (1 − 1)2 +(1 − 1)2 = 0 (3.67 − 1)2 +(2.67 − 1)2 1 = 3.15 (2,1) (1 − 2)2 +(1 − 1)2 = 1 (3.67 − 2)2 +(2.67 − 1)2 1 = 2.36 (4,3) (1 − 4)2 +(1 − 3)2 = 13 (3.67 − 4)2 +(2.67 − 3)2 2 = 0.47 (5,4) (1 − 5)2 +(1 − 4)2 = 5 (3.67 − 5)2 +(2.67 − 4)2 2 = 1.89 Example1:  As you see, the new points in red are the new centroids. We apply another iteration to find a better centroid that represents each cluster.  Iteration 2:  Point 1: (1, 1) C1 = Yes, C2 = No  Point 2: (2, 1) C1 = Yes, C2 = No  Point 3: (4, 3) C1= No, C2 = Yes  Point 4: (5, 4) C1 = No, C2 = Yes  Now, we calculate the new centroid as follows: 1+2 1+1  C1 = 1/2 ((1, 1)+(2,1)) =( , )= (1.5,1) 2 2 4+5 3+4  C2 = 1/2 ((4, 3) + (5, 4)) = ( , )=(4.5, 3.5) 2 2 Example1:  Now, we examine each point again against the centroid by using Euclidian distance and calculate the new centroids (C1 and C2). Example1: point Center(1) (1.5,1) Center(2) (4.5 ,3.5) cluster (1,1) (1.5 − 1)2 +(1 − 1)2 = 0.5 (4.5 − 1)2 +(3.5 − 1)2 = 4.3 1 (2,1) (1.5 − 2)2 +(1 − 1)2 = 0.5 (4.5 − 2)2 +(3.5 − 1)2 1 = 3.54 (4,3) (1.5 − 4)2 +(1 − 3)2 =3.2 (4.5 − 4)2 +(3.5 − 3)2 2 = 0.71 (5,4) (1.5 − 5)2 +(1 − 4)2 = 4.6 (4.5 − 5)2 +(3.5 − 4)2 2 = 0.71 Why use K-means? Strengths: –Simple: easy to understand and to implement –Efficient: Time complexity: O(tkn), where nis the number of data points, k is the number of clusters, and t is the number of iterations. –Since both k and tare small. k-means is considered a linear algorithm. K-means is the most popular clustering algorithm. Weaknesses of K-means The user needs to specify k. The algorithm is sensitive to outliers –Outliers are data points that are very far away from other data points. –Outliers could be errors in the data recording or some special data points with very different values.

Use Quizgecko on...
Browser
Browser