Lesson 7: Detecting Patterns with Unsupervised Learning PDF

Lesson 7: Detecting Patterns with Unsupervised Learning Prepared by: Kianna Dominique D. Alvarez Introduction to Unsupervised Learning What is Unsupervised Learning? Unsupervised learning is a type of machine learning where models are trained on data without labeled responses. The goal is to discover hidden patterns or intrinsic structures in the input data. Introduction to Unsupervised Learning (cont.) Supervised vs. Unsupervised Learning Aspect Supervised Learning Unsupervised Learning Data Labeled (input-output pairs) Unlabeled (only input data) Objective Predict or classify outputs Find patterns or clusters Examples Regression, Classification Clustering, Association Introduction to Unsupervised Learning (cont.) Real-World Applications Customer Segmentation: Grouping customers based on similar behaviors or preferences. Anomaly Detection: Identifying unusual patterns that deviate from the norm (e.g., fraud detection). Dimensionality Reduction: Simplifying data while preserving significant structure (e.g., PCA) Clustering Data with the K-Means Algorithm What is Clustering? Clustering: Grouping data points based on their similarity. K-Means: A popular algorithm used to find K distinct clusters in the data. Clustering Data with the K-Means Algorithm (cont.) Types of Clustering Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity. The various types of clustering are: Hierarchical clustering Partitioning clustering Hierarchical clustering is further subdivided into: Agglomerative clustering Divisive clustering Partitioning clustering is further subdivided into: K-Means clustering Fuzzy C-Means clustering Clustering Data with the K-Means Algorithm (cont.) Hierarchical Clustering Hierarchical clustering uses a tree-like structure, like so: Clustering Data with the K-Means Algorithm (cont.) Hierarchical Clustering (cont.) In agglomerative clustering, there is a bottom-up approach. We begin with each element as a separate cluster and merge them into successively more massive clusters, as shown below: Clustering Data with the K-Means Algorithm (cont.) Hierarchical Clustering (cont.) Divisive clustering is a top-down approach. We begin with the whole set and proceed to divide it into successively smaller clusters, as you can see below: Clustering Data with the K-Means Algorithm (cont.) Partitioning Clustering Partitioning clustering is split into two subtypes - K-Means clustering and Fuzzy C- Means. In k-means clustering, the objects are divided into several clusters mentioned by the number ‘K.’ So if we say K = 2, the objects are divided into two clusters, c1 and c2, as shown: Clustering Data with the K-Means Algorithm (cont.) Here, the features or characteristics are compared, and all objects having similar characteristics are clustered together. Fuzzy c-means is very similar to k-means in the sense that it clusters objects that have similar characteristics together. In k-means clustering, a single object cannot belong to two different clusters. But in c-means, objects can belong to more than one cluster, as shown. Clustering Data with the K-Means Algorithm (cont.) What is K-Means Clustering? K-means clustering is a way of grouping data based on how similar or close the data points are to each other. Imagine you have a bunch of points, and you want to group them into clusters. The algorithm works by first randomly picking some central points (called centroids) and then assigning every data point to the nearest centroid. Once that’s done, it recalculates the centroids based on the new groupings and repeats the process until the clusters make sense. It’s a pretty fast and efficient method, but it works best when the clusters are distinct and not too mixed up. One challenge, though, is figuring out the right number of clusters (K) beforehand. Plus, if there’s a lot of noise or overlap in the data, K Means might not perform as well. Clustering Data with the K-Means Algorithm (cont.) Objective of K-Means Clustering K-Means clustering primarily aims to organize similar data points into distinct groups. Here’s a look at its key objectives: Grouping Similar Data Points Minimizing Within-Cluster Distance Maximizing Between-Cluster Distance Clustering Data with the K-Means Algorithm (cont.) Properties of K-Means Clustering Now, let’s look at the key properties that make K-means clustering algorithm effective in creating meaningful groups: Similarity Within a Cluster One of the main things K Means aims for is that all the data points in a cluster should be pretty similar to each other. Imagine a bank that wants to group its customers based on income and debt. If customers within the same cluster have vastly different financial situations, then a one-size-fits-all approach to offers might not work. For example, a customer with high income and high debt might have different needs compared to someone with low income and low debt. By making sure the customers in each cluster are similar, the bank can create more tailored and effective strategies. Clustering Data with the K-Means Algorithm (cont.) Properties of K-Means Clustering Now, let’s look at the key properties that make K-means clustering algorithm effective in creating meaningful groups: Differences Between Clusters Another important aspect is that the clusters themselves should be as distinct from each other as possible. Going back to our bank example, if one cluster consists of high-income, high-debt customers and another cluster has high- income, low-debt customers, the differences between the clusters are clear. This separation helps the bank create different strategies for each group. If the clusters are too similar, it can be challenging to treat them as separate segments, which can make targeted marketing less effective. Clustering Data with the K-Means Algorithm (cont.) Applications of K-Means Clustering Here are some interesting ways K-means clustering is put to work across different fields: Distance Measures K-Means for Geyser Eruptions Customer Segmentation Document Clustering Image Segmentation Recommendation Engines K-Means for Image Compression Clustering Data with the K-Means Algorithm (cont.) Advantages of K-means Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice for clustering tasks. Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality. Scalability: K-means can handle large datasets with many data points and can be easily scaled to handle even larger datasets. Flexibility: K-means can be easily adapted to different applications and can be used with varying metrics of distance and initialization methods. Clustering Data with the K-Means Algorithm (cont.) Disadvantages of K-Means Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a suboptimal solution. Requires specifying the number of clusters: The number of clusters k needs to be specified before running the algorithm, which can be challenging in some applications. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting clusters. Clustering Data with the K-Means Algorithm (cont.) Different Evaluation Metrics for Clustering When it comes to evaluating how well your clustering algorithm is working, there are a few key metrics that can help you get a clearer picture of your results. Here’s a rundown of the most useful ones: Silhouette Analysis Silhouette analysis is like a report card for your clusters. It measures how well each data point fits into its own cluster compared to other clusters. A high silhouette score means that your points are snugly fitting into their clusters and are quite distinct from points in other clusters. Imagine a score close to 1 as a sign that your clusters are well-defined and separated. Conversely, a score close to 0 indicates some overlap, and a negative score suggests that the clustering might need some work. Clustering Data with the K-Means Algorithm (cont.) Different Evaluation Metrics for Clustering (cont.) When it comes to evaluating how well your clustering algorithm is working, there are a few key metrics that can help you get a clearer picture of your results. Here’s a rundown of the most useful ones: Inertia Inertia is a bit like a gauge of how tightly packed your data points are within each cluster. It calculates the sum of squared distances from each point to the cluster's center (or centroid). Think of it as measuring how snugly the points are huddled together. Lower inertia means that points are closer to the centroid and to each other, which generally indicates that your clusters are well-formed. For most numeric data, you'll use Euclidean distance, but if your data includes categorical features, Manhattan distance might be better. Clustering Data with the K-Means Algorithm (cont.) Different Evaluation Metrics for Clustering (cont.) When it comes to evaluating how well your clustering algorithm is working, there are a few key metrics that can help you get a clearer picture of your results. Here’s a rundown of the most useful ones: Dunn Index The Dunn Index takes a broader view by considering both the distance within and between clusters. It’s calculated as the ratio of the smallest distance between any two clusters (inter-cluster distance) to the largest distance within a cluster (intra-cluster distance). A higher Dunn Index means that clusters are not only tight and cohesive internally but also well-separated from each other. In other words, you want your clusters to be as far apart as possible while being as compact as possible. Clustering Data with the K-Means Algorithm (cont.) How Does K-Means Clustering Work? The flowchart below shows how k-means clustering works: Clustering Data with the K-Means Algorithm (cont.) We have a data set for a grocery shop, and we want to find out how many clusters this has to be spread across. To find the optimum number of clusters, we break it down into the following steps: Step 1: The Elbow method is the best way to find the number of clusters. The elbow method constitutes running K-Means clustering on the dataset. Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid. The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value. Clustering Data with the K-Means Algorithm (cont.) Now, we draw a curve between WSS and the number of clusters. Here, WSS is on the y-axis and number of clusters on the x-axis. You can see that there is a very gradual change in the value of WSS as the K value increases from 2. So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized. Clustering Data with the K-Means Algorithm (cont.) Step 2: Let's assume that these are our delivery points: We can randomly initialize two points called the cluster centroids. Here, C1 and C2 are the centroids assigned randomly. Clustering Data with the K-Means Algorithm (cont.) Step 3: Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it. This is how the initial grouping is done: Clustering Data with the K-Means Algorithm (cont.) Step 4: Compute the actual centroid of data points for the first group. Step 5: Reposition the random centroid to the actual centroid. Clustering Data with the K-Means Algorithm (cont.) Step 6: Compute the actual centroid of data points for the second group. Step 7: Reposition the random centroid to the actual centroid. Clustering Data with the K-Means Algorithm (cont.) Step 8: Once the cluster becomes static, the k-means algorithm is said to be converged. The final cluster with centroids c1 and c2 is as shown below: Clustering Data with the K-Means Algorithm (cont.) K-Means Clustering Algorithm Let's say we have X₁, X₂, X₃… as our input data points, and we want to split this data into KKK clusters. Step 1: We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that Where C is the set of all centroids. Clustering Data with the K-Means Algorithm (cont.) K-Means Clustering Algorithm (cont.) Step 2: We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance. Where dist() is the Euclidean distance. Here, we calculate each x value's distance from each c value, and so on. Then we find which is the lowest value and assign x1 to that particular centroid. Similarly, we find the minimum distance for Clustering Data with the K-Means Algorithm (cont.) K-Means Clustering Algorithm (cont.) Step 3: We identify the actual centroid by taking the average of all the points assigned to that cluster. Where Si is the set of all points assigned to the ith cluster. It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups. Clustering Data with the K-Means Algorithm (cont.) K-Means Clustering Algorithm (cont.) Step 4: Keep repeating step 2 and step 3 until convergence is achieved. Clustering Data with the K-Means Algorithm (cont.) K-Means Clustering Algorithm (cont.) How to Choose the Value of "K number of clusters" in K-Means Clustering? Although many choices are available for choosing the optimal number of clusters, the Elbow Method is one of the most popular and appropriate methods. The Elbow Method uses the idea of WCSS value, which is short for for Within Cluster Sum of Squares. WCSS defines the total number of variations within a cluster. This is the formula used to calculate the value of WCSS (for three clusters) provided courtesy of Javatpoint: Python Implementation of the K-Means Clustering Algorithm 1. Data Pre-processing Import the necessary libraries. Load the dataset and select the features for clustering. # Importing libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Mall_Customers_data.csv') # Extracting the independent variables (Annual Income and Spending Score) x = dataset.iloc[:, [3, 4]].values Python Implementation of the K-Means Clustering Algorithm (cont.) 2. Finding the Optimal Number of Clusters Using the Elbow Method of squares (WCSS) for different values of 𝑘. The Elbow method is used to find the optimal number of clusters by plotting the within-cluster sum # Finding the optimal number of clusters using the elbow method from sklearn.cluster import KMeans wcss_list = [] # Initializing the list for the values of WCSS # Using a for loop for iterations from 1 to 10 for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(x) wcss_list.append(kmeans.inertia_) # Plotting the Elbow Method graph plt.plot(range(1, 11), wcss_list) plt.title('The Elbow Method Graph') plt.xlabel('Number of clusters (k)') plt.ylabel('WCSS') plt.show() Python Implementation of the K-Means Clustering Algorithm (cont.) 3. Training the K-Means Algorithm on the Dataset Choose the number of clusters (e.g., 5) based on the elbow method. Fit the K-Means model to the dataset. The algorithm will assign each data point to one of the 5 clusters. # Training the K-means model on the dataset kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42) y_predict = kmeans.fit_predict(x) Python Implementation of the K-Means Clustering Algorithm (cont.) 4. Visualizing the Clusters Plot the clusters on a scatter plot. Each cluster is represented by a different color. Plot the centroids of the clusters to show the central point of each cluster. The graph will show how customers are grouped based on their annual income and spending score. # Visualizing the clusters plt.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s=100, c='blue', label='Cluster 1') plt.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s=100, c='green', label='Cluster 2') plt.scatter(x[y_predict == 2, 0], x[y_predict == 2, 1], s=100, c='red', label='Cluster 3') plt.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s=100, c='cyan', label='Cluster 4') plt.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s=100, c='magenta', label='Cluster 5') # Plotting the centroids plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroid') plt.title('Clusters of Customers') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.legend() plt.show() Gaussian Mixture Models (GMM) What is Gaussian Mixture Model (GMM)? GMM is a probabilistic model that assumes data points are generated from a mixture of several Gaussian distributions. Unlike K-Means, which assigns each data point to a single cluster, GMM assigns a probability to each point belonging to different clusters. Gaussian Mixture Models (GMM) (cont.) How GMM Differs from K-Means Gaussian Mixture Models (GMM) (cont.) Probability Distributions in Clustering Gaussian Distribution: A bell-shaped curve, often used to model data that is symmetrically distributed around a mean. GMM models each cluster as a Gaussian distribution, making it more flexible than K-Means. Gaussian Mixture Models (GMM) (cont.) One hint that data might follow a mixture model is that the data looks multimodal, i.e. there is more than one "peak" in the distribution of data. Trying to fit a multimodal distribution with a unimodal (one "peak") model will generally give a poor fit, as shown in the example below. Since many simple distributions are unimodal, an obvious way to model a multimodal distribution would be to assume that it is generated by multiple unimodal distributions. For several theoretical reasons, the most commonly used distribution in modeling real-world unimodal data is the Gaussian distribution. Thus, modeling multimodal data as a mixture of many unimodal Gaussian distributions makes intuitive sense. Furthermore, GMMs maintain many of the theoretical and computational benefits of Gaussian models, making them practical for efficiently modeling very large datasets. Gaussian Mixture Models (GMM) (cont.) Use Cases of GMM Speech Recognition: Identifying phonemes in speech by clustering audio signals with similar properties. Anomaly Detection: Identifying rare or unusual data points by evaluating their likelihood of belonging to any of the Gaussian distributions. Propagation Model What Are Propagation Models? Propagation models simulate how information, influence, or behavior spreads through a network. These models are used in areas like social networks, disease spread, and recommendation systems. Label Propagation Label Propagation: A semi-supervised learning algorithm used for community detection in networks. Nodes in a graph are initially assigned a random label, and then labels propagate through the network. Each node updates its label to the most frequent label of its neighbors, iterating until convergence. Propagation Model Example: Community Detection in Social Networks In social media platforms, users (nodes) may belong to different communities (e.g., sports fans, tech enthusiasts). Label Propagation can identify these communities by propagating user labels based on their connections and interactions. This helps in discovering groups of users with similar interests or behaviors. THANK YOU

Lesson 7: Detecting Patterns with Unsupervised Learning PDF

Document Details

Tags

Related

Summary

Full Transcript