K-Means Clustering Algorithm PDF
Document Details
Uploaded by GlimmeringSard2672
Helwan University
Tags
Summary
This document provides a detailed explanation of the K-Means clustering algorithm, a machine learning technique for unsupervised learning. It describes the algorithm's steps and illustrates its functionality through diagrams. The document also outlines the advantages and disadvantages of this algorithm in machine learning.
Full Transcript
K-Means clusters Model K-Means Clustering, categorized as an Unsupervised Learning technique, organizes an untagged dataset into distinct clusters. parameter K determines the quantity of predetermined clusters to be generated during the procedure. For instance, when K is set to 2, two clusters will...
K-Means clusters Model K-Means Clustering, categorized as an Unsupervised Learning technique, organizes an untagged dataset into distinct clusters. parameter K determines the quantity of predetermined clusters to be generated during the procedure. For instance, when K is set to 2, two clusters will be formed, and for K equal to 3, three clusters will be established, and this pattern continues. The main idea of k-means is to divides the unlabeled dataset into k different clusters in such a way that each dataset belongs to only one group that has similar properties. It enables the categorization of data into distinct groups and provides a straightforward means to identify group categories within an unlabeled dataset without requiring any prior training. This algorithm, centered around centroids, associates each cluster with a centroid. Its primary objective is to minimize the total distance between data points and their respective clusters. The process begins with the algorithm taking an unlabeled dataset as input, dividing it into a specified number of clusters (denoted as k), and iterating until optimal clusters are identified. It's important to note that the value of k must be predetermined in this algorithm. The K-means clustering algorithm undertakes two key tasks: Iteratively determines the optimal positions for k centroids. Assigns each data point to the nearest k-center, forming clusters based on proximity. Consequently, each cluster comprises data points sharing commonalities, distinct from other clusters. The diagram below illustrates the functioning of the K-means Clustering Algorithm How does the K-Means Algorithm Work? The working of the K-Means algorithm is explained in the below steps: Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters. Step-4: Calculate the variance and place a new centroid of each cluster. Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. Step-6: If any reassignment occurs, then go to step-4 else go to FINISH. Step-7: The model is ready. k-means flowchart Let's understand the above steps by considering the visual plots: Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below: Consider selecting a specified number of clusters, denoted as k, for instance, K=2, to categorize the dataset into distinct groups. In this scenario, the objective is to organize the datasets into two separate clusters. To initiate the clustering process, it is necessary to choose k random points or centroids that will define the clusters. These points can either be extracted from the dataset or selected independently. In this context, the following two points are chosen as k points, even though they are not part of the dataset. Refer to the image below: o o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a median between both the centroids. Consider the below image: AD From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization. o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below: o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median line. The median will be like below image: From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids. As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points. o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image: o As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be: o We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image: As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the below image: How to choose the value of "K number of clusters" in K- means Clustering? The effectiveness of the K-means clustering algorithm relies on the quality of the clusters it establishes. However, determining the optimal number of clusters is a significant challenge. Various methods exist to identify the ideal number of clusters, and we will now delve into one of the most suitable approaches for determining the number of clusters or the value of K. This method is known as the Elbow Method. he Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below: WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2 In the above formula of WCSS, ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a cluster1 and the same for the other two terms. To assess the distance between data points and centroids, various methods such as Euclidean distance or Manhattan distance can be employed. The elbow method, utilized for determining the optimal number of clusters, follows these steps: It applies K-means clustering to a given dataset for different values of K, typically ranging from 1 to 10. For each K value, it computes the Within-Cluster Sum of Squares (WCSS) value. A curve is plotted, depicting the relationship between the calculated WCSS values and the number of clusters, K. The optimal value of K is identified at the point where the curve exhibits a sharp bend, resembling an arm. As the graph displays a distinct elbow-like bend, this method is aptly named the elbow method. The graph representing the elbow method is depicted in the image below: Advantages of the K-means Clustering Algorithm: 1. Easy Implementation: K-means clustering is an iterative and relatively straightforward algorithm. It is user-friendly and can even be executed manually, as demonstrated in the numerical example. 2. Scalability: The algorithm is versatile and applicable to datasets ranging from as few as 10 records to as extensive as 10 million records, delivering reliable results in both scenarios. 3. Convergence: K-means guarantees convergence, ensuring that the algorithm will produce results with each execution. 4. Generalization: This algorithm is not confined to specific types of problems. Whether dealing with numerical data or text documents, K-means clustering can be applied to any dataset, accommodating various sizes and distributions. 5. Choice of Centroids: The algorithm allows for an easy selection and assignment of centroids, providing flexibility in aligning centroids with the dataset. Disadvantages of the K-means Clustering Algorithm: 1. Deciding the Number of Clusters: Determining the optimal number of clusters is a challenge in K-means clustering and often requires the use of methods like the elbow method. 2. Choice of Initial Centroids: The efficiency of the algorithm depends on the proper selection of initial centroids. A careful choice in the initial step is crucial for maximizing effectiveness. 3. Effect of Outliers: Outliers in the dataset can significantly impact the position of centroids during the clustering process, leading to inaccuracies. Identifying and removing outliers during data cleaning can mitigate this issue. 4. Curse of Dimensionality: As the number of dimensions in the dataset increases, the efficiency of K-means clustering diminishes, as the distance between data points converges to specific values. Advanced clustering methods like spectral clustering or dimensionality reduction techniques can be employed to address this challenge during data preprocessing. Example Numerical – Using K means clustering algorithm form two clusters for given data. Height Weight 185 72 170 56 168 60 179 68 182 72 188 77 180 71 180 70 183 84 180 88 180 67 177 76 As per question we need to form 2 clusters, So for that we consider first two data points of our data and assign them as a centroid for each cluster as shown below Now we need to assign each and every data point of our data to one of these clusters based on Euclidean distance calculation. Here (X0,Y0) is our data point and (Xc,Yc) is a centroid of a particular cluster. Lets consider the next data point i.e. 3rd data point(168,60) and check its distance with the centroid of both clusters. Now we can see from calculations that 3rd data point(168,60) is more closer to k2(cluster 2), so we assign it to k2. After that we need to modify the centroid of k2 by using the old centroid values and new data point which we just assigned to k2. Now after new centroid calculations we got new centroid value for k2 as (169,58) and k1 centroid value will remain the same as NO new data point is added to that cluster(k1). We need to repeat the above mentioned procedure until all data points are over. The final answer is mentioned below, you can check your answers. Naïve Bayes Classifier Algorithm Naive Bayes is a classification method grounded in Bayes' Theorem, assuming independence among predictors. Essentially, it posits that the presence of a specific feature in a class is unrelated to the presence of any other feature. This classifier is frequently employed in supervised machine learning tasks, such as text classification. It falls under the category of generative learning algorithms, which model input distribution for a given class or category. The underlying assumption is that, given the class, the features of the input data are conditionally independent, enabling swift and accurate predictions. In statistical terms, naive Bayes classifiers are regarded as straightforward probabilistic classifiers applying Bayes' theorem. Despite assuming independence among input features—an oversimplification in real-world scenarios—these classifiers are widely adopted due to their efficiency and strong performance across diverse applications. It is noteworthy that naive Bayes classifiers, although among the simplest Bayesian network models, can achieve high accuracy levels when combined with kernel density estimation. This involves using a kernel function to estimate the probability density function of input data, enhancing performance in situations with undefined data distributions. Consequently, the naive Bayes classifier is a potent tool in machine learning, especially in tasks like text classification, spam filtering, and sentiment analysis. For instance, consider a fruit being identified as an apple based on properties like being red, round, and approximately 3 inches in diameter. Even if these features are interrelated, all contribute independently to the probability of the fruit being an apple, hence the term 'Naive'. Building an NB model is straightforward and particularly advantageous for extensive datasets. Despite its simplicity, Naive Bayes often outperforms more complex classification methods. Bayes' theorem facilitates the calculation of posterior probability P(c|x) from P(c), P(x), and P(x|c). The equation below illustrates this: P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B. P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true. P(A) is Prior Probability: Probability of hypothesis before observing the evidence. P(B) is Marginal Probability: Probability of Evidence. Working of Naïve Bayes' Classifier: To comprehend the functioning of the Naïve Bayes' Classifier, consider the following example: Imagine we have a dataset containing information about weather conditions and a corresponding target variable labeled "Play." The objective is to determine whether one should engage in outdoor activities on a specific day based on the prevailing weather conditions. To address this, the following steps are undertaken: 1. Transform the provided dataset into frequency tables. 2. Construct a Likelihood table by determining the probabilities associated with the given features. 3. Utilize Bayes' theorem to compute the posterior probability. Problem: If the weather is sunny, then the Player should play or not? Solution: To solve this, first consider the below dataset: Applying Bayes'theorem: P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny) P(Sunny|Yes)= 3/10= 0.3 P(Sunny)= 0.35 P(Yes)=0.71 So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60 P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny) P(Sunny|NO)= 2/4=0.5 P(No)= 0.29 P(Sunny)= 0.35 So P(No|Sunny)= 0.5*0.29/0.35 = 0.41 So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny) Hence on a Sunny day, Player can play the game. Solution: P(A|B) = (P(B|A) * P(A) )/ P(B) Mango: P(X | Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long | Mango) a)P(Yellow | Mango) = (P(Yellow | Mango) * P(Yellow) )/ P (Mango) = ((350/800) * (800/1200)) / (650/1200) P(Yellow | Mango)= 0.53 →1 1.b)P(Sweet | Mango) = (P(Sweet | Mango) * P(Sweet) )/ P (Mango) = ((450/850) * (850/1200)) / (650/1200) P(Sweet | Mango)= 0.69 → 2 c)P(Long | Mango) = (P(Long | Mango) * P(Long) )/ P (Mango) = ((0/650) * (400/1200)) / (800/1200) P(Long | Mango)= 0 → 3 On multiplying eq 1,2,3 ==> P(X | Mango) = 0.53 * 0.69 * 0 P(X | Mango) = 0 2. Banana: P(X | Banana) = P(Yellow | Banana) * P(Sweet | Banana) * P(Long | Banana) 2.a) P(Yellow | Banana) = (P( Banana | Yellow ) * P(Yellow) )/ P (Banana) = ((400/800) * (800/1200)) / (400/1200) P(Yellow | Banana) = 1 → 4 2.b) P(Sweet | Banana) = (P( Banana | Sweet) * P(Sweet) )/ P (Banana) = ((300/850) * (850/1200)) / (400/1200) P(Sweet | Banana) =.75→ 5 2.c)P(Long | Banana) = (P( Banana | Yellow ) * P(Long) )/ P (Banana) = ((350/400) * (400/1200)) / (400/1200) P(Yellow | Banana) = 0.875 → 6 On multiplying eq 4,5,6 ==> P(X | Banana) = 1 *.75 * 0.875 P(X | Banana) = 0.6562 3. Others: P(X | Others) = P(Yellow | Others) * P(Sweet | Others) * P(Long | Others) 3.a) P(Yellow | Others) = (P( Others| Yellow ) * P(Yellow) )/ P (Others) = ((50/800) * (800/1200)) / (150/1200) P(Yellow | Others) = 0.34→ 7 3.b) P(Sweet | Others) = (P( Others| Sweet ) * P(Sweet) )/ P (Others) = ((100/850) * (850/1200)) / (150/1200) P(Sweet | Others) = 0.67 → 8 3.c) P(Long | Others) = (P( Others| Long) * P(Long) )/ P (Others) = ((50/400) * (400/1200)) / (150/1200) P(Long | Others) = 0.34 → 9 On multiplying eq 7,8,9 ==> P(X | Others) = 0.34 * 0.67* 0.34 P(X | Others) = 0.07742 So finally from P(X | Mango) == 0 , P(X | Banana) == 0.65 and P(X| Others) == 0.07742. We can conclude Fruit{Yellow,Sweet,Long} is Banana. Advantages of Naïve Bayes Classifier: o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets. o It can be used for Binary as well as Multi-class Classifications. o It performs well in Multi-class predictions as compared to the other Algorithms. o It is the most popular choice for text classification problems. Disadvantages of Naïve Bayes Classifier: o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between feature Types of Naïve Bayes Model: There are three types of Naive Bayes Model, which are given below: o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution. o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc. The classifier uses the frequency of words for the predictors. o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables. Such as if a particular word is present or not in a document. This model is also famous for document classification tasks.