Cluster Analysis and Implementation of Clustering with Weka(angelow).pptx

Cluster Analysis and Implementation of Clustering with Weka and R Language PRESENTED BY: L AROT, ANGELOW BALORIO BUCOL, ERICA JANE NUÑEZ JAYSON, GEOMARIZ MADUL ARA Chapter Objectives Cluster  To comprehend the concept of clustering, its applications, and features. Analysis  To understand various distance metrics for clustering of data.  To comprehend the process of K-means clustering.  To comprehend the process of hierarchical clustering algorithms.  To comprehend the process of DBSCAN algorithms. Introduction to Cluster Analysis Large datasets often lack labeling due to the time and effort required. Clustering, an unsupervised learning technique, allows for the analysis of unlabeled data by grouping similar objects into classes or clusters. This process identifies high similarity records within a cluster but high dissimilarities in other clusters. The first humans, Adam and Eve, learned through clustering, observing objects and classifying them based on their properties. They then assigned labels to these objects, making clustering an essential human activity. Applications of Cluster Analysis Cluster analysis is a versatile technique used in various fields, such as marketing, land use, insurance, city planning, earthquake studies, biology, web discovery, and fraud detection, to group similar items together and reveal valuable patterns and insights. This helps in fields like marketing (targeting customers), land use (identifying similar areas), insurance (spotting high-risk groups), and even detecting fraud by identifying unusual activity. Desired Features of Clustering An ideal clustering technique should minimize intra-cluster distances and maximize inter- cluster distances, while also being scalable, able to handle various types of data, independent of input order, capable of identifying clusters of different shapes, robust to noisy data, high-performing with minimal scans of the dataset, interpretable, able to stop and resume tasks, and requiring minimal user guidance. Distance metrics are crucial for understanding similarity between objects in clustering. The best clustering methods aim to group similar data points close together (small distances within a group) while keeping different groups far apart (large distances between groups). They should also work well with large datasets, handle different data types, be reliable regardless of the order of data, find clusters of various shapes, deal with errors in data, be fast and efficient, have understandable results, allow for pausing and restarting, and require minimal user input. Distance metrics play a vital role in determining how similar data points are to each other. Distance Metrics A distance metric is a function d(x,y) that calculates the distance between elements in a set, with zero distance indicating equality. It helps measure how close or similar elements are to each other, regardless of their type. Different distance metrics are used to assess similarity among objects. Euclidean distance Euclidean distance is mainly used to calculate distances. The distance between two points in the plane with coordinates (x, y) and (a, b) according to the Euclidean distance formula is given by: Euclidean dist((x, y), (a, b)) = √(x - a)² + (y - b)² For example, the (Euclidean) distance between points (-2, 2) and (2, -1) is calculated as Euclidean dist((-2, 2), (2, - 1)) = √(-2 - (2))² + (2 - (-1))² = √(-4)² + (3)² = √16 + 9 = √25 =5 Manhattan distance Manhattan distance is also called L1-distance. It is defined as the sum of the lengths of the projections of the line segment between the two points on the coordinate axes. For example, the distance between two points in the plane with coordinates (x, y) and (a, b) according to the Manhattan distance formula, is given by: Manhattan dist((x, y), (a, b)) = | x – a | + | y – b | Let’s do the calculations among the same three persons Using the formula of Manhattan distance, we can calculate the similarity distance among persons. The calculation for the distance between person 1 and 2 is: Manhattan dist((30, 70), (40, 54)) = |30 - 40| + |70 – 54| = |- 10 | + | 16| = 10+16 = 26 Chebyshev distance It is also called as chessboard distance because, in a game of chess, the minimum number of moves required by a king to go from one square to another on a chessboard equals Chebyshev distance between the centers of the squares. Chebyshev distance is defined on a vector space, where the distance between two vectors is the maximum value of their differences along any coordinate dimension. Formula of Chebyshev distance is given by: Chebyshev dist((r1, f1), (r2, f2)) = max(|r2−r1|,|f2-f1|) Major Clustering Methods/Algorithms Clustering algorithms can be grouped into five categories based on their approaches: 1. Partitioning Method: This method starts with random partitions and then optimizes them iteratively based on a certain criterion. It divides the data into distinct groups. 2. Hierarchical Method: Hierarchical clustering creates a tree-like structure of clusters based on similarity, allowing for a detailed analysis of relationships within the data. 3. Density-based Method: This approach focuses on the density and connectivity of data points to form clusters, particularly useful for datasets with irregular shapes. 4. Grid-based Method: Grid-based clustering organizes data into multiple levels of granularity, making it efficient for large datasets. 5. Model-based Method: In this method, a model is assigned to each cluster, and the goal is to find the best-fitting model for the data, allowing for more complex cluster shapes and structures to be identified. Partitioning Clustering Clustering involves dividing a dataset into a few clusters, like grouping items in a grocery store by categories. This can be qualitative (e.g., dairy products) or quantitative (e.g., based on milk percentage). In the partitioning method, objects are clustered based on attributes into different partitions. K-means clustering, a key technique in partitioning clustering, assigns data points to clusters by minimizing the distance to cluster centers. k-means clustering In the k-means clustering algorithm, n objects are clustered into k clusters or partitions on the basis of attributes, where k < n and k is a positive integer number. In simple words, in k-means clustering algorithm, the objects are grouped into ‘k’ number of clusters on the basis of attributes or features. Starting values for the k-means algorithm Normally we have to specify the number of starting seeds and clusters at the start of the k-means algorithm. We can use an iterative approach to overcome this problem. For example, first select three starting seeds randomly and choose three clusters. Once the final clusters have been found, the process may be repeated several times with a different set of seeds. We should select the seed records that are at maximum distance from each other or are as far as possible. If two clusters are identified to be close together during the iterative process, it is appropriate to merge them. Also, a large cluster may be partitioned into two smaller clusters if the intra-cluster variance is above some threshold value. Issues with the k-means algorithm The k-Means algorithm has limitations such as its sensitivity to initial seed guesses, vulnerability to outliers affecting results, restriction to continuous data due to Euclidean distance use, lack of consideration for cluster sizes, inability to handle overlapping clusters, and challenges with clusters of varying sizes and densities. Scaling and weighting In clustering, attributes with small values compared to others may not strongly influence cluster determination due to their limited variation. For instance, in a wine dataset where attribute values span six orders of magnitude, clustering algorithms may struggle when an attribute like wine density has values only varying by 5% of the average, making it challenging to effectively cluster based on such attributes. Normalization In case of normalization, all the attributes are converted to a normalized score or to a range (0, 1). The problem of normalization is an outlier. If there is an outlier, it will tend to crunch all of the other values down toward the value of zero. In order to understand this case, let’s suppose the range of students’ marks is 35 to 45 out of 100. Then 35 will be considered as 0 and 45 as 1, and students will be distributed between 0 to 1 depending upon their marks. But if there is one student having marks 90, then it will act as an outlier and in this case, 35 will be considered as 0 and 90 as 1. Now, it will crunch most of the values down toward the value of zero. In this scenario, the solution is standardization. Standardization In case of standardization, the values are all spread out so that we have a standard deviation of 1. Generally, there is no rule for when to use normalization versus standardization. However, if your data does have outliers, use standardization otherwise use normalization. Using standardization tends to make the remaining values for all of the other attributes fall into similar ranges since all attributes will have the same standard deviation of 1. In next section, another important clustering technique, i.e., Hierarchical Clustering has been discussed.

Cluster Analysis and Implementation of Clustering with Weka(angelow).pptx

Document Details

Tags

Related

Full Transcript