CS 312 Introduction to Artificial Intelligence - Clustering Algorithms PDF

Document Details

Uploaded by Deleted User

Tags

artificial intelligence clustering algorithms machine learning data science

Summary

These lecture notes cover clustering algorithms, including k-means, hierarchical, and density-based clustering, for a course on Introduction to Artificial Intelligence (CS 312). The notes discuss the principles, methods, and applications of these algorithms in data science and machine learning.

Full Transcript

CS 312 Introduction to Artificial Intelligence Clustering Algorithms CS312 – Introduction to Artificial Intelligence Course Content 2 CS312 – Introduction to Artificial Intelligence Course...

CS 312 Introduction to Artificial Intelligence Clustering Algorithms CS312 – Introduction to Artificial Intelligence Course Content 2 CS312 – Introduction to Artificial Intelligence Course Unsupervised Content Learning: k-means clustering k-means clustering takes the number of clusters k and a dataset of n objects as inputs, and outputs k clusters with minimized within-cluster variances. In the k-means clustering algorithm, the number of clusters is k, and n data objects are split into k clusters. The obtained clusters meet the following requirements: high similarity between objects in the same cluster, and Low similarity between objects in different clusters 3 CS312 – Introduction to Artificial Intelligence Understanding the k-means clustering algorithm k-means clustering, assigns data points to one of the K clusters depending on their distance from the center of Course Content the clusters. Conventional k-means requires only a few steps. The first step is to randomly select k centroids, where k is equal to the number of clusters you choose. Centroids are data points representing the center of a cluster. The main element of the algorithm works by a two-step process called expectation-maximization. The expectation step assigns each data point to its nearest centroid. Then, the maximization step computes the mean of all the points for each cluster and sets the new centroid. The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge, or match the previous iteration’s assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of k- means is to try to minimize this value. 4 CS312 – Introduction to Artificial Intelligence Choosing the appropriate number of clusters Course Content To choose k: Determining the best k (number of clusters), you can use methods like the Elbow method or silhouette score method to choose the optimal number of clusters. 1. Elbow method – involves running several k-means, increment k with each iteration, and record the SSE. When you plot SSE as a function of the number of clusters, notice that SSE continues to decrease as you increase k. As more centroids are added, the distance from each point to its closest centroid will decrease. There’s a sweet spot where the SSE curve starts to bend known as the elbow point. The x- value of this point is thought to be a reasonable trade-off between error and number of clusters. @k=3 (elbow point) 5 CS312 – Introduction to Artificial Intelligence Choosing the appropriate number of clusters 2. Silhouette coefficient – is a measure of cluster cohesion and separation. It quantifies how well a Course Content data point fits into its assigned cluster based on two factors: How close the data point is to other points in the cluster How far away the data point is from points in other clusters Silhouette coefficient values range between -1 and 1. Larger numbers indicate that samples are closer to their clusters than they are to other clusters. @k=3 (with highest silhouette coefficien t) 6 CS312 – Introduction to Artificial Intelligence Course Unsupervised Content Learning: Hierarchical clustering Hierarchical clustering divides a dataset at different layers and forms a tree-like clustering structure. The dataset division may use a “bottom-up” aggregation policy, or a “top-down” splitting policy. The hierarchy of clustering is represented in a tree diagram. The root is the only cluster of all samples, and the leaves are clusters of single samples. These methods produce a tree-based hierarchy of points called a dendrogram. Clusters are assigned by cutting the dendrogram at a specified depth that results in k groups of smaller dendrograms. 7 CS312 – Introduction to Artificial Intelligence Course Hierarchical Content Agglomerative clustering 8 CS312 – Introduction to Artificial Intelligence Course Hierarchical DivisiveContent clustering 9 CS312 – Introduction to Artificial Intelligence Course Content Reporting on next meeting Important note: At the end of your report , give the link/ reading resources where you studied your report. for 1 assigned reporter – Show a sample code using k-means clustering Show what method is used in the code to choose the number of clusters (k) for 3 assigned reporters – Discuss Density-based clustering and how it discovers hidden patterns in unlabeled data Compare Density-based clustering with other clustering algorithms (k-means clustering and hierarchical clustering) Show a sample code for the 3 algorithms using the same dataset Compare and interpret the results 10 CS312 – Introduction to Artificial Intelligence Course Content Thank you 11

Use Quizgecko on...
Browser
Browser