AI 305 Unsupervised Learning: Clustering (PDF)

Introduction to Machine learning AI 305 Unsupervised Learning Clustering 1 Supervised learning vs. unsupervised learning oSupervised learning: discover patterns in the data that relate data attributes with a target (class) attribute. ◦These patterns are then utilized to predict the values of the target attribute in future data instances. oUnsupervised learning: The data have no target attribute. ◦We want to explore the data to find some intrinsic structures in them. 2 Clustering oClustering is a technique for finding similarity groups in data, called clusters. I.e., ◦ it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. oClustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. oDue to historical reasons, clustering is often considered synonymous with unsupervised learning. 3 An illustration The data set has three natural groups of data points, i.e., 3 natural clusters. 4 What is clustering for? oLet us see some real-life examples oExample 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts. ◦ Tailor-made for each person: too expensive ◦ One-size-fits-all: does not fit all. oExample 2: In marketing, segment customers according to their similarities ◦ To do targeted marketing, by identifying subgroups of people who might be more receptive to a particular form of advertising, or more likely to purchase a particular product. 5 What is clustering for? (cont…) oExample 3: Given a collection of text documents, we want to organize them according to their content similarities, ◦To produce a topic hierarchy oClustering has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc. ◦In recent years, due to the rapid increase of online documents, text clustering becomes important. 6 What is clustering for? (cont…) oExample 3: Given a collection of text documents, we want to organize them according to their content similarities, ◦To produce a topic hierarchy oClustering has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc. ◦In recent years, due to the rapid increase of online documents, text clustering becomes important. 7 Aspects of clustering oA clustering algorithm ◦ Partitional clustering ◦ Hierarchical clustering ◦… oA distance (similarity, or dissimilarity) function oClustering quality ◦ Inter-clusters distance  maximized ◦ Intra-clusters distance  minimized oThe quality of a clustering result depends on the algorithm, the distance function, and the application. 8 K-means clustering oK-means is a partitional clustering algorithm oLet the set of data points (or instances) D be {x1, x2, …, xn}, where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr, and r is the number of attributes (dimensions) in the data. oThe k-means algorithm partitions the given data into k clusters. ◦ Each cluster has a cluster center, called centroid. ◦ k is specified by the user 9 K-means algorithm oGiven k, the k-means algorithm works as follows: 1)Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2)Assign each data point to the closest centroid 3)Re-compute the centroids using the current cluster memberships. 4)If a convergence criterion is not met, go to 2). 10 K-means algorithm – (cont …) 11 Stopping/convergence criterion 1. no (or minimum) re-assignments of data points to different clusters, 2. no (or minimum) change of centroids, or 3. minimum decrease in the sum of squared error (SSE), k SSE =  j =1 xC j dist (x, m j ) 2 (1) ◦ Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points in Cj), and dist(x, mj) is the distance between data point x and centroid mj. 12 An example + + 13 An example (cont …) 14 15 An example distance function 16 Strengths of k-means oStrengths: ◦ Simple: easy to understand and to implement ◦ Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. ◦ Since both k and t are small. k-means is considered a linear algorithm. oK-means is the most popular clustering algorithm. oNote that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity. 19 Weaknesses of k-means oThe algorithm is only applicable if the mean is defined. ◦ For categorical data, k-mode - the centroid is represented by most frequent values. oThe user needs to specify k. oThe algorithm is sensitive to outliers ◦ Outliers are data points that are very far away from other data points. ◦ Outliers could be errors in the data recording or some special data points with very different values. 20 Selecting k - value oA simulated data set with 150 observations in 2-dimensional space. oPanels show the results of applying K-means clustering with different values of K, the number of clusters. oThe color of each observation indicates the cluster to which it was assigned using the K-means clustering algorithm. oNote that there is no ordering of the clusters, so the cluster coloring is arbitrary. oThese cluster labels were not used in clustering; instead, they are the outputs of the clustering procedure. 21 Weaknesses of k-means: Problems with outliers 22 Weaknesses of k-means: To deal with outliers oOne method is to remove some data points in the clustering process that are much further away from the centroids than other data points. ◦ To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. oAnother method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. ◦ Assign the rest of the data points to the clusters by distance or similarity comparison, or classification 23 Weaknesses of k-means The algorithm is sensitive to initial seeds. 24 Weaknesses of k-means (cont …) If we use different seeds: good results There are some methods to help choose good seeds 25 26 Weaknesses of k-means (cont …) The k-means algorithm is not suitable for discovering clusters that are not hyper- ellipsoids (or hyper-spheres). 27 K-means summary oDespite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and ◦ other clustering algorithms have their own lists of weaknesses. oNo clear evidence that any other clustering algorithm performs better in general ◦ although they may be more suitable for some specific types of data or applications. oComparing different clustering algorithms is a difficult task. ◦ No one knows the correct clusters! 28 Common ways to represent clusters oUse the centroid of each cluster to represent the cluster. ◦compute the radius and ◦standard deviation of the cluster to determine its spread in each dimension ◦The centroid representation alone works well if the clusters are of the hyper-spherical shape. ◦If clusters are elongated or are of other shapes, centroids are not sufficient 29 Using classification model All the data points in a cluster are regarded to have the same class label, e.g., the cluster ID. ◦ run a supervised learning algorithm on the data to find a classification model. 30 Use frequent values to represent cluster oThis method is mainly for clustering of categorical data (e.g., k-modes clustering). oMain method used in text clustering, where a small set of frequent words in each cluster is selected to represent the cluster. 31 Clusters of arbitrary shapes Hyper-elliptical and hyper-spherical clusters are usually easy to represent, using their centroid together with spreads. Irregular shape clusters are hard to represent. They may not be useful in some applications. ◦ Using centroids are not suitable (upper figure) in general ◦ K-means clusters may be more useful (lower figure), e.g., for making 2 size T-shirts. 32 Hierarchical Clustering oK-means clustering requires us to pre- specify the number of clusters K, which is a disadvantage oHierarchical clustering is an alternative approach which does not require that we commit to a particular choice of K. oProduce a nested sequence of clusters, a tree, also called Dendrogram. 33 Types of hierarchical clustering oAgglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and ◦ merges the most similar (or nearest) pair of clusters ◦ stops when all the data points are merged into a single cluster (i.e., the root cluster). oDivisive (top down) clustering: It starts with all data points in one cluster, the root. ◦ Splits the root into a set of child clusters. Each child cluster is recursively divided further ◦ stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point 34 Agglomerative clustering It is more popular than divisive methods. oAt the beginning, each data point forms a cluster (also called a node). oMerge nodes/clusters that have the least distance. oGo on merging oEventually all nodes belong to one cluster 35 Agglomerative clustering algorithm 36 37 o45 observations generated in 2- dimensional space. oIn reality there are three distinct classes, shown in separate colors. oHowever, we will treat these class labels as unknown and will seek to cluster the observations in order to discover the classes from the data. 38 oLeft: Dendrogram obtained from hierarchically clustering the data from previous slide, with complete linkage and Euclidean distance. o Center: The dendrogram from the left-hand panel, cut at a height of 9 (indicated by the dashed line). This cut results in two distinct clusters, shown in different colors. o Right: The dendrogram from the left-hand panel, now cut at a height of 5. This cut results in three distinct clusters, shown in different colors. oThe colors were not used in clustering, but are simply used for display purposes in this figure 39 An example: working of the algorithm 40 Measuring the distance of two clusters oA few ways to measure distances of two clusters. oResults in different variations of the algorithm. ◦ Single link ◦ Complete link ◦ Average link ◦ Centroids ◦… 41 Single link method oThe distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster. oMinimal intercluster dissimilarity. oCompute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. oIt can find arbitrarily shaped clusters, but o It may cause the undesirable “chain effect” by noisy points o Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time. 42 Complete link method oThe distance between two clusters is the distance of two furthest data points in the two clusters. oMaximal intercluster dissimilarity. oCompute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities. oIt is sensitive to outliers because they are far away 43 Average link and centroid methods oAverage link: A compromise between ◦ the sensitivity of complete-link clustering to outliers and ◦ the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects. ◦ In this method, the distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters. oCentroid method: In this method, the distance between two clusters is the distance between their centroids 44 Average and complete linkage tend to yield more balanced clusters 45 Distance functions oKey to clustering. “similarity” and “dissimilarity” can also commonly used terms. oThere are numerous distance functions for ◦Different types of data ◦ Numeric data ◦ Nominal data ◦Different specific applications 47 Distance functions for numeric attributes oMost commonly used functions are ◦Euclidean distance and ◦Manhattan (city block) distance oWe denote distance with: dist(xi, xj), where xi and xj are data points (vectors) oThey are special cases of Minkowski distance. h is positive integer. 1 dist (xi , x j ) = ((xi1 − x j1 ) + ( xi 2 − x j 2 ) +... + ( xir − x jr h h h h )) 48 Euclidean distance and Manhattan distance oIf h = 2, it is the Euclidean distance dist (xi , x j ) = ( xi1 − x j1 ) 2 + ( xi 2 − x j 2 ) 2 +... + ( xir − x jr ) 2 oIf h = 1, it is the Manhattan distance dist (xi , x j ) =| xi1 − x j1 | + | xi 2 − x j 2 | +...+ | xir − x jr | oWeighted Euclidean distance dist (xi , x j ) = w1 ( xi1 − x j1 ) 2 + w2 ( xi 2 − x j 2 ) 2 +... + wr ( xir − x jr ) 2 49 Squared distance and Chebychev distance oSquared Euclidean distance: to place progressively greater weight on data points that are further apart. dist (xi , x j ) = ( xi1 − x j1 )2 + ( xi 2 − x j 2 )2 +... + ( xir − x jr )2 oChebychev distance: one wants to define two data points as "different" if they are different on any one of the attributes. dist (xi , x j ) = max(| xi1 − x j1 |, | xi 2 − x j 2 |,...,| xir − x jr |) 50 How to choose a clustering algorithm oClustering research has a long history. A vast collection of algorithms are available. ◦ We only introduced several main algorithms. oChoosing the “best” algorithm is a challenge. ◦ Every algorithm has limitations and works well with certain data distributions. ◦ It is very hard, if not impossible, to know what distribution the application data follow. The data may not fully follow any “ideal” structure or distribution required by the algorithms. ◦ One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values. 69 Choose a clustering algorithm (cont …) oDue to these complexities, the common practice is to ◦ run several algorithms using different distance functions and parameter settings, and ◦ then carefully analyze and compare the results. oThe interpretation of the results must be based on insight into the meaning of the original data together with knowledge of the algorithms used. oClustering is highly application dependent and to certain extent subjective (personal preferences). 70 Summary oClustering is has along history and still active ◦ There are a huge number of clustering algorithms ◦ More are still coming every year. oWe only introduced several main algorithms. There are many others, e.g., ◦ density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc. oClustering is hard to evaluate, but very useful in practice. This partially explains why there are still a large number of clustering algorithms being devised every year. oClustering is highly application dependent and to some extent subjective. 80 Principal Components Analysis PCA 81 Principal components analysis oA tool used for data visualization or data pre-processing before supervised techniques are applied. oPCA produces a low-dimensional representation of a dataset. ◦ It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. ◦ Apart from producing derived variables for use in supervised learning problems, ◦ PCA also serves as a tool for data visualization. 82 Principal Components Analysis: details o The first principal component of a set of features X 1 , X 2 ,... , X p is the normalized linear combination of the features, this principle component has the largest variance. oZ 1 = φ11X 1 + φ21X 2 +... + φp1X p 𝑝 2 oBy normalized, we mean that σ𝑗=1 𝜙𝑗1 =1 We refer to the elements φ11,... , φp1 as the loadings of the first principal component; together, the loadings make up the principal component loading vector, φ1 = (φ11 φ21... φp1) T. We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance. 83 PCA: example oThe population size and ad spending for 100 different Cities are shown as purple circles. oThe green solid line indicates the first principal component, and the blue dashed line indicates the second principal component ◦ The first principal component direction of the data is that along which the observations vary the most 84 oLeft: The first principal component direction is shown in green. ◦ It is the dimension along which the data vary the most, ◦ it defines the line that is closest to all n of the observations. ◦ The distances from each observation to the principal component are represented using the black dashed line segments. ◦ The blue dot represents the mean. oRight: The left-hand panel has been rotated so that the first principal component direction coincides with the x-axis. 85 Plots of the first principal component scores zi1 versus pop and ad. The relationships are strong. Plots of the second principal component scores zi2 versus pop and ad. The relationships are weak. 86 Computation of Principal Components o Suppose we have a n × p data set X. ◦ Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero). o We then look for the linear combination of the sample feature values of the form ◦ zi1 = φ11xi1 + φ21xi2 +... + φp1xip (1) for i = 1,... , n 𝑝 2 ◦ that has largest sample variance, subject to the constraint that σ𝑗=1 𝜙𝑗1 =1 o Since each of the x ij has mean zero, then so does zi1 (for any values of φj1). 1 ◦ Hence the sample variance of the zi1 can be written as σ𝑛 𝑧 2 𝑖=1 𝑖1 𝑛 87 Computation: continued oPlugging in (1) the first principal component loading vector solves the optimization problem This problem can be solved via a singular-value decomposition (SVD) of the matrix X, a standard technique in linear algebra. Also it can be solved via an eigen decomposition, We refer to Z 1 as the first principal component, with realized values z11,... , zn1 88 Geometry of PCA The loading vector φ1 with elements φ11, φ21,... , φp1 defines a direction in feature space along which the data vary the most. If we project the n data points x 1,... , x n onto this direction, the projected values are the principal component scores z11,... , zn1 themselves. 89 Further principal components o The second principal component is the linear combination of X 1 ,... , X p that has maximal variance among all linear combinations that are uncorrelated with Z 1. o The second principal component scores z12, z22,... , zn2 take the form ozi2 = φ12x i1 + φ22x i2 +... + φp2x ip , owhere φ2 is the second principal component loading vector, with elements φ12, φ22,... , φp2. 90 Further principal components: continued o It turns out that constraining Z 2 to be uncorrelated with Z 1 is equivalent to constraining the direction φ2 to be orthogonal (perpendicular) to the direction φ1. And so on. o The principal component directions φ1, φ2, φ3,... are the ordered sequence of right singular vectors of the matrix X, and the variances 1 of the components are times the squares of the singular values. 𝑛 ◦ The principal component directions are the ordered sequence of eigenvectors of the matrix 𝑋 𝑇 𝑋, and the variances of the components are the eigenvalues. oThere are at most min(n − 1, p) principal components. 91 Illustration -Example USAarrests data: For each of the fifty states in the United States, the data set contains the number of arrests per 100, 000 residents for each of three crimes: Assault , Murder, and Rape. They also record UrbanPop (the percent of the population in each state living in urban areas). The principal component score vectors have length n = 50, and the principal component loading vectors have length p = 4. PCA was performed after standardizing each variable to have mean zero and standard deviation one. 92 USAarrests data: PCA plot 93 Example details oThe first two principal components for the USArrests data are shown o The blue state names represent the scores for the first two principal components. o The orange arrows indicate the first two principal component loading vectors (with axes on the top and right). ◦ For example, the loading for Murder on the first component is 0.54, and its loading on the second principal component - 0.42, (0.54; -0.42). The principal component loading vectors φ1 and φ2, for the o This figure is a biplot, because it displays both the USArrests data. principal component scores and the principal component loadings. 94 Example details- cont. oThe first loading vector places approximately equal weight on Assault, Murder, and Rape, but with much less weight on UrbanPop. ◦ Hence this component roughly corresponds to a measure of overall rates of serious crimes. oThe second loading vector places most of its weight on UrbanPop and much less weight on the other three features. ◦ Hence, this component roughly corresponds to the level of urbanization of the state. oOverall, we see that the crime-related variables (Murder, Assault, and Rape) are located close to each other, and that the UrbanPop variable is far from the other three. ◦ This indicates that the crime-related variables are correlated with each other ◦ states with high murder rates tend to have high assault and rape rates ◦ and that the UrbanPop variable is less correlated with the other three. 95 Example details- cont. oWe can examine differences between the states via the two principal component score vectors. oThe states with large positive scores on the first component, such as California, Nevada and Florida, have high crime rates, while states like North Dakota, with negative scores on the first component, have low crime rates. oCalifornia also has a high score on the second component, ◦ indicating a high level of urbanization, ◦ while the opposite is true for states like Mississippi. oStates close to zero on both components, such as Indiana, have approximately average levels of both crime and urbanization. 96 Another Interpretation of Principal Components oPCA find the hyperplane closest to the observations. oThe first principal component loading vector has a very special property: ◦ it defines the line in p-dimensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness) o The notion of principal components as the dimensions that are closest to the n observations extends beyond just the first principal component. o For instance, the first two principal components of a data set span the plane that is closest to the n observations, in terms of average squared Euclidean distance. 97 90 observations simulated in three dimensions. The observations are displayed in color for ease of visualization. Left: the first two principal component directions span the plane that best fits the data. The plane is positioned to minimize the sum of squared distances to each point. Right: the first two principal component score vectors give the coordinates of the projection of the 90 observations onto the plane. 98 Scaling of the variables matters oIf the variables are in different units, ◦ scaling each to have standard deviation equal to one is recommended. o If they are in the same units, you might or might not scale the variables. 99 Proportion Variance Explained oTo understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one. oThe total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as oand the variance explained by the mth principal component is oIt can be shown that with 𝑀 = min(𝑛 − 1, 𝑝) 100 Proportion Variance Explained: continued oTherefore, the PVE of the mth principal component is given by the positive quantity between 0 and 1 oThe PVEs sum to one. We sometimes display the cumulative PVEs. 101 Proportion Variance Explained: continued oThe first principal component explains 62.0% of the variance in the data, oThe next principal component explains 24.7% of the variance. ◦ Together, the first two principal components explain almost 87% of the variance in the data, o The last two principal components explain only 13% of the variance. oThis means that we can provides a pretty accurate summary of the data using just two dimensions. 102 How many principal components should we use? oIf we use principal components as a summary of our data, how many components are sufficient? ◦ No simple answer to this question, ◦ cross-validation?? ◦ is not available for this purpose. ◦ Why not? (unsupervised) oWe use cross-validation to select the number of components: ◦ if we compute principal components for use in a supervised analysis ◦ the number of principal components as a tuning parameter) othe “scree plot" on the previous slide can be used as a guide: we look for an “elbow". (first two principle component in the previous figure) 103 End 104

AI 305 Unsupervised Learning: Clustering (PDF)

Document Details

Tags

Related

Summary

Full Transcript