Lecture 5 - Data Mining Continued 4fda5aafc9ef47229be04359186c0c5e.pdf
Document Details
Uploaded by DefeatedRomanArt
Tags
Full Transcript
Lecture 5 - Data Mining Continued Outline Clustering Definition Examples K-means method Step by step process Case studies Clustering Looking for underlying patterns (groups) in the data Grouping data according to a set of criteria No target value Also known as unsupervised learning Clusters in data...
Lecture 5 - Data Mining Continued Outline Clustering Definition Examples K-means method Step by step process Case studies Clustering Looking for underlying patterns (groups) in the data Grouping data according to a set of criteria No target value Also known as unsupervised learning Clusters in data mining represent groups of similar data objects. They are essentially the division of data into groups or clusters such that objects in the same cluster are more similar to each other than those in other clusters. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. Partitional Clustering Partitional clustering is a type of clustering method in data mining where data is divided into non-overlapping subsets (clusters) such that each data Lecture 5 - Data Mining Continued 1 object is in exactly one subset. This implies that the set of clusters formed is exhaustive and mutually exclusive. The most common partitional clustering method is the k-means algorithm, where 'k' represents the number of clusters. Hard Clustering In hard clustering, each data point either belongs to a cluster completely or not. Each data point is associated with an exact cluster. In other words, in hard clustering, a data point belongs exclusively to one cluster and not to others. It's a definitive classification of data points into clusters. Soft Clustering Unlike hard clustering, soft clustering (also known as fuzzy clustering) allows each data point to belong to multiple clusters with varying degrees of membership. This degree of membership indicates the strength of the association between that data point and a particular cluster. Fuzzy clustering is particularly useful when the boundaries between clusters are not clear or well-defined. It provides a more flexible approach and can result in more accurate representations of data relationships. Hierarchical Clustering Hierarchical clustering is another method of cluster analysis which seeks to build a hierarchy of clusters. This method starts with all the data points assigned to a cluster of their own. Then, two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left. Hierarchical Agglomerative Clustering Lecture 5 - Data Mining Continued 2 Hierarchical agglomerative clustering is a "bottom-up" approach. Each observation starts in its own cluster, and pairs of clusters are merged together as one moves up the hierarchy. The advantage of agglomerative clustering is that it can find smaller clusters in the data that other clustering methods might miss. Hierarchical Divisive Clustering On the other hand, hierarchical divisive clustering is a "top-down" approach. All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. This method is more computationally intensive than agglomerative clustering, but it can produce more accurate results if the data contains distinct, non-overlapping clusters. K-Means Clustering K-means is a partitional clustering technique that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It starts by randomly selecting k data points as initial centroids. All data points are then assigned to the nearest centroid, and the centroids are recalculated. This process is repeated until the location of centroids becomes stable. Partitioning Around Medoids (PAM) Clustering Partitioning Around Medoids (PAM) is another partitional clustering method. Instead of using the mean value of a cluster to represent its center as in Kmeans, PAM uses actual data points called medoids to represent the center of a cluster. The medoid is the most centrally located object in a cluster. PAM is more robust to outliers than K-means because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances. Fuzzy C-Means Clustering Fuzzy C-Means (FCM) is a method of clustering that allows one piece of data to belong to two or more clusters. This method works by assigning membership to each data point corresponding to each cluster center based on the distance between the cluster center and the data point. The closer the data is to the cluster center, the higher its membership degree. This process is iterated until Lecture 5 - Data Mining Continued 3 the membership coefficients and cluster centers are unchanged or the maximum number of iterations is reached. If n is unknown validity indices computation External criteria to assess how good clusters are Defined considering the data dispersion within and between clusters According to decision rules, the best number of clusters may be selected Methods: K-means K-means: The Process Lecture 5 - Data Mining Continued 4 K-means is an iterative algorithm that divides a group of n datasets into k nonoverlapping subgroups or clusters, where each data point belongs to the cluster with the nearest mean. Here is a simple example of how the K-means algorithm works: 1. Initialization: Suppose we have an unlabelled data set and we want to classify it into two clusters (k=2). First, we randomly select two data points in the dataset. These points act as the initial centroids. 2. Assignment: Then, each data point is assigned to the closest centroid. The distance between a point and a centroid is often measured using Euclidean distance. After this step, we will have two clusters. 3. Update: We then re-calculate the centroids of the clusters. The new centroid of a cluster is calculated as the mean of all data points in the cluster. 4. Repeat the Assignment and Update steps: The Assignment and Update steps are repeated iteratively until the centroids no longer move significantly or the maximum number of iterations is reached. 5. At the end of the process, we have two clusters, each with a final centroid. Each data point belongs to the cluster with the nearest centroid. *Practice questions in relation to this Summary Clustering Finding groups in data without using pre defined labels Data in the same cluster share similar characteristics Data grouped according to a set of criteria K-means Minimisation of objective function Centres as means of cluster data Lecture 5 - Data Mining Continued 5 Repeat process until data is stable in clusters Lecture 5 - Data Mining Continued 6