CSDS 391 Intro to AI - Unsupervised Learning and Clustering PDF

Summary

This document provides an outline of the CSDS 391 Introduction to AI course material on unsupervised learning and clustering. It covers topics like unsupervised learning, nearest neighbor classification, multivariate Gaussians, and k-means clustering. The document also includes diagrams and examples, potentially for illustrating these concepts.

Full Transcript

CSDS 391 Intro to AI Unsupervised Learning and Clustering Outline unsupervised learning nearest neighbor classi cation multivariate Gaussians k-means clustering deriving learning rules from objective functions Bayesian clustering with Gaussian mixture mode...

CSDS 391 Intro to AI Unsupervised Learning and Clustering Outline unsupervised learning nearest neighbor classi cation multivariate Gaussians k-means clustering deriving learning rules from objective functions Bayesian clustering with Gaussian mixture models cross validation fi Fisher’s Iris data (unlabeled) 2.5 2.0 petal width (cm) 1.5 1.0 0.5 1 2 3 4 5 6 7 petal length (cm) 2.5 Iris virginica 2 petal width (cm) 1.5 1 Iris setosa Iris versicolor 0.5 0 1 2 3 4 5 6 7 petal length (cm) 2.5 Iris virginica 2 petal width (cm) 1.5 Decision Tree: Dim 2 1 Iris setosa Iris versicolor 0.5 Decision Tree: Dim 1 0 1 2 3 4 5 6 7 petal length (cm) Fisher’s Iris data 2.5 In which example would you be more con dent Iris virginica about2 the class? petal width (cm) 1.5 1 Iris setosa Iris versicolor 0.5 0 Decision boundaries 1 2 3 4 5 provide 6 a classi7 cation petal length (cm) but not uncertainty. fi fi The general classi cation problem output ! is a binary classi cation vector: desired output 1 if xi ∈ Ci ≡ class i, y = {y1 ,... , yK } yi = 0 otherwise model model (e.g. a decision tree) is θ = {θ1 ,... , θM } de ned by M parameters. Data input is a set of T observations, D = {x1 ,... , xT } each an N-dimensional vector xi = {x1 ,... , xN }i (binary, discrete, or continuous) Given data, we want to learn a model that How do we can correctly classify novel observations. approach this probabilistically? fi fi fi The answer to all questions of uncertainty Let’s apply Bayes’ rule to infer the most probable class given the observation: p(x|Ck )p(Ck ) p(Ck |x) = p(x) p(x|Ck )p(Ck ) = ! k p(x|Ck )p(Ck ) This is the answer, but what does it mean? How do we specify the terms? - p(Ck) is the prior probability on the different classes - p(x|Ck) is the data likelihood, ie probability of x given class Ck How should we de ne this? fi What classi er would give “optimal” performance? Consider the iris data. p(petal length |C3) How would we minimize the number p(petal 0.9 0.8 length |C2) of future mis-classi cations? 0.7 0.6 We would need to know the true 0.5 distribution of the classes. 0.4 0.3 Assume they follow a Gaussian 0.2 0.1 distribution. 0 The number of samples in each class 2.5 is the same (50), so (assume) p(Ck) is equal for all classes. 2 petal width (cm) Because p(x) is the same for all 1.5 classes we have: 1 p(x|Ck )p(Ck ) p(Ck |x) = p(x) 0.5 ∝ p(x|Ck )p(Ck ) 0 1 2 3 4 5 6 7 petal length (cm) fi fi Where do we put the boundary? p(petal length |C2) p(petal length |C3) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 Where do we put the boundary? decision boundary p(petal length |C2) p(petal length |C3) 0.9 0.8 0.7 0.6 0.5 0.4 R23 = C2 is misclassi ed as C3 0.3 R32 = C3 is misclassi ed as C2 0.2 0.1 0 1 2 3 4 5 6 7 fi fi Where do we put the boundary? Shifting the boundary trades-off the two errors. p(petal length |C2) p(petal length |C3) 0.9 0.8 0.7 0.6 0.5 0.4 R23 = C2 is misclassi ed as C3 0.3 R32 = C3 is misclassi ed as C2 0.2 0.1 0 1 2 3 4 5 6 7 fi fi Where do we put the boundary? The misclassi cation error is de ned by Z Z p(error) = p(x|C3 )P (C3 )dx + p(x|C2 )P (C2 )dx R32 R23 which in our case is proportional to the data likelihood p(petal length |C2) p(petal length |C3) 0.9 0.8 0.7 0.6 0.5 0.4 R23 = C2 is misclassi ed as C3 0.3 R32 = C3 is misclassi ed as C2 0.2 0.1 0 1 2 3 4 5 6 7 fi fi fi fi Where do we put the boundary? The misclassi cation error is de ned by Z Z p(error) = p(x|C3 )P (C3 )dx + p(x|C2 )P (C2 )dx R32 R23 which in our case is proportional to the data likelihood p(petal length |C2) p(petal length |C3) 0.9 0.8 0.7 0.6 This region would yield 0.5 p(C3 |x) > p(C2 |x) 0.4 but we’re still classifying 0.3 this region as C2! 0.2 0.1 0 1 2 3 4 5 6 7 fi fi The optimal decision boundary The minimal misclassi cation error at the point where p(C3 |x) = p(C2 |x) Optimal decision boundary ⇒ p(x|C3 )p(C3 )/p(x) = p(x|C2 )p(C2 )/p(x) ⇒ p(x|C3 ) = p(x|C2 ) p(C2 | petal length) p(C3 | petal length) 1 0.8 p(petal length |C2) p(petal length |C3) 0.6 0.4 Note: this assumes we have only two classes. 0.2 0 1 2 3 4 5 6 7 fi A different approach to classi cation Nearest neighbor classi cation on the iris dataset Nearby points are likely to be members of the same class. 2.5 Class 3 What if we used the points 2 themselves to classify? classify x in Ck if x is “similar” to 1.5 x 2 a point we already know is in Ck. x Class 2 1 Eg: unclassi ed point x is more 0.5 similar Class 2 than Class 1. Class 1 Issue: How to de ne “similar” ? 0 1 2 3 4 x1 5 6 7 Simplest is Euclidean distance: ! Potential advantages: d(x, y) = (xi − yi )2 i don’t need an explicit model the more examples the better Could de ne other metrics might handle more complex classes depending on application, e.g. easy to implement text documents, images, etc. “no brain on part of the designer” fi fi fi fi fi eB ≤ eKN N ≤ ei+1 B (1 − eB ) k−i + ek−i(1 − e )i+1 B B i i=0 Example: Handwritten digits For more on these bounds, see the book A Probabilistic Theory of Pattern Recognition, by L. Devroye, L. Gyorfi & G. Lugosi (1996). Use Euclidean distance to see which known digit isExample: closest USPSto each class.9 Digits Take 16x16 grayscale images (8bit) of handwritten digits. But not all neighbors Use Euclidean distance in raw pixel are theandsame: space (dumb!) 7-nn. Classification error (leave-one-out): 4.85%. example Example nearest neighbors 7 Nearest Neighbours example from Sam Roweis “k-nearest neighbors”: look at k-nearest neighbors and from LeCun etal, 1998 digit data available at: choose most frequent. http://yann.lecun.com/exdb/mnist/ Cautions: can get expensive to nd neighbors Digits are just represented as a vector. fi The problem of using templates (ie Euclidean distance) Which of these is more like the performance results of various classi ers example? A or B? (from http://yann.lecun.com/exdb/mnist/) error rate on example A B Classi er test data (%) linear 12 k=3 nearest neighbor from Simard etal, 1998 5 (Euclidean distance) 2-layer neural network 4.7 Euclidean distance only cares about (300 hidden units) nearest neighbor how many pixels overlap. 3.1 (Euclidean distance) Could try to de ne a distance metric k-nearest neighbor 1.1 that is insensitive to small deviations (improved distance metric) in position, scale, rotation, etc. convolutional neural net 0.95 Digit example: - 60,000 training images, best (the conv. net with elastic distortions) 0.4 - 10,000 test images - no “preprocessing” humans 0.2 - 2.5 fi fi fi Clustering: Classi cation without labels In many situations we don’t have labeled training data, only unlabeled data. Eg, in the iris data set, what if we were just starting and didn’t know any classes? 2.5 2 petal width (cm) 1.5 1 0.5 0 1 2 3 4 5 6 7 petal length (cm) fi Types of learning supervised unsupervised reinforcement reinforcement desired output {y1 ,... , yn } model output model model model {θ1 ,... , θn } {θ1 ,... , θn } {θ1 ,... , θn } world world world (or data) (or data) (or data) A real example: clustering electrical signals from neurons An application of PCA: Spike sorting amplifier filters software A/D analysis electrode An extracellular waveform with many differ oscilloscope 0 5 10 15 msec How do we sort the different spikes? Basic problem: only information is signal. The true classes are always unknown. ➡ Sorting with level detection on an oscilloscope Sorting withwith Sorting levellevel detection detection −0.5 −0.5 0 0 0.5 0.5 1 1 1.5 1.5 −0.5 0 0.5 1 1.5 msec msec msec Level detection doesn’t always work. Sorting withwith levellevel Sorting Whydetection detection level detection doesn’t work Why level detection doesn’t work −0.5 −0.5 0 0 0.5 0.5 1 1 1.5 1.5 −0.5 0 0.5 1 1.5 msec msec msec peakLevel amplitude: neuron detection 2 always work. doesn’t background peak amplitude: amplitude neuron 1 ➡ PrincipalPrincipal Component Analysis,Analysis, Apr 23, 2001 / Michael 2001 / S. Lewicki, CMU CMU ➡ ➡ ➡ ➡? 7 ➡ amplitude ➡ Component Apr 23, Michael S. Lewicki, ? 7 ➡ A B One dimension is not sufficient to separate the spikes. Using multiple features Idea: try more features max amplitude min amplitude −0.5 0 0.5 1 1.5 msec What other features could we use? Maximum vs minimum Maximum vs minimum 250 200 spike maximum (µV) 150 100 50 0 −200 −150 −100 −50 0 spike minimum (µV) This allows better discrimination than max alone, but is it optimal? Features using Principal Components (not covered) 300 200 2nd PC score 100 0 −100 −200 −100 0 100 200 300 1st PC score k-means clustering Idea: try to estimate k cluster centers by minimizing “distortion” De ne distortion as: N ! ! K D = rnk ! xn − µk !2 n=1 k=1 rnk = 1 if xn ∈ cluster k, 0 otherwise. rnk is 1 for the closest cluster mean to xn. Each point xn is the minimum distance from its closet center. How do we learn the cluster means? Need to derive a learning rule. fi Deriving a learning rule for the cluster means Our objective function is: Here, we can solve for the mean: N ! ! K ! rnk xn D= rnk ! xn − µk !2 µk = ! n n=1 k=1 n rnk Differentiate w.r.t. to the mean (the This is simply a weighted mean for parameter we want to estimate): each cluster. N AAACUXicbZBBaxNBFMdf1mpjojbqsZfBIMRDw26grQiBQD14KhFMUsnGZXYy2wyZmV1mZsWwzCfwA/Xoh9BLjz14bD17czYJtGl8MPDj/97jDb8440wb37+seA92Hj7arT6u1Z88fbbXeP5iqNNcETogKU/VWYw15UzSgWGG07NMUSxiTkfx/KTsj75SpVkqP5lFRicCn0uWMIKNi6LG5zBRmBRhhpVhmKP39pbDWBShyG00tzXURQcdFOpcRIXsBvbLKVKO5ha1yrFvNpLo4M7Gm1rUaPptf1loG4I1NHu977+K64tWP2pchdOU5IJKQzjWehz4mZkU5WcIp7YW5ppmmMzxOR07lFhQPSmWCix67ZIpSlLlnjRomd7dKLDQeiFiNymwmen7vTL8X2+cm+TtpGAyyw2VZHUoyTkyKSp9oilTlBi+cICJYu6viMywc2qc9Y0rsbDOSXDfwDYMO+3gqH340cl5B6uqwj68ghYEcAw9+AB9GACBH/AbbuBP5Wflrweetxr1Kuudl7BRXv0ftlm33g== X @D @µk = 2 rnk (xn µk ) Thus we have a simple estimation n=1 algorithm (k-means clustering) 1. initialize μk to K random points We know the optimum is when 2. assign clusters, i.e. rnk N 3. estimate (update) means ∂D ! =2 rnk (xn − µk ) = 0 4. loop from step 2 until converged ∂µk n=1 convergence (to a local minimum) is guaranteed k-means clustering example 300 Select 3 points at random for cluster means 200 2nd PC score 100 0 −100 −200 −100 0 100 200 300 1st PC score k-means clustering example 300 The update them using the estimate. 200 2nd PC score 100 0 −100 −200 −100 0 100 200 300 1st PC score k-means clustering example 300 And iterate... 200 2nd PC score 100 0 −100 −200 −100 0 100 200 300 1st PC score k-means clustering example 300 200 2nd PC score 100 0 −100 −200 −100 0 100 200 300 1st PC score k-means clustering example 300 Stop when converged, ie no change. 200 2nd PC score 100 0 −100 −200 −100 0 100 200 300 1st PC score An example of a local minimum 300 There can be multiple local minima. 200 2nd PC score 100 0 −100 −200 0 100 200 300 −100 1st PC score A probabilistic interpretation: Gaussian mixture models We’ve already seen a one-dimensional version This example has three classes: neuron 1, neuron 2, and background noise. Each can be modeled as a Gaussian Any given data point comes from just one Gaussian The whole set of data is modeled by a mixture of three Gaussians R58 How do Mwe model this? S Lewicki peak amplitude: neuron 2 background peak amplitude: amplitude neuron 1 amplitude A B Figure 4. The figure illustrates the distribution of amplitudes for the background activity and the The Gaussian mixture model density The likelihood of the data given a particular class ck is given by p(x|ck , µk , ⌃k ) x is the spike waveform, µk and ⌃k are the mean and covariance for class ck. The marginal likelihood is computed by summing over the likelihood of the K classes K X p(x|✓1:K) = p(x|ck , ✓k )p(ck ) k=1 ✓1:K defines the parameters for all of the classes, ✓1:K = {µ1, ⌃1,... , µK , ⌃K }. P p(ck ) is the probability of the kth class, with k p(ck ) = 1. What does this mean in this example? Bayesian classi cation with multivariate Gaussian mixtures How do we determine the class ck from the data x ? Again use Bayes’ rule p(x(n)|ck , ✓k )p(ck ) p(ck |x(n), ✓1:K) = pk,n =P (n) |c , ✓ )p(c ) k p(x k k k This tells is the probability that waveform x(n) came from class ck. fi Estimating the parameters: tting the model density to the data The objective of density estimation is to maximize the likelihood of the data the data If we assume the samples are independent, the data likelihood is just the product of the marginal likelihoods N Y p(x1:N|✓1:K) = p(xn|✓1:K) n=1 The class parameters are determined by optimization. Is far more practical to optimize the log-likelihood. One elegant approach to this is the EM algorithm. fi The Gaussian mixture EM stands for Expectation-Maximization, and involves two steps that are iterated. For the case of a Gaussian mixture model: P 1. E-step: Compute pn,k = p(ck |x , ✓1:K). Let pk = n pi,n (n) 2. M-step: Compute new mean, covariance, and class prior for each class: X µk pn,k x(n)/pk n X ⌃k pn,k (x(n) µk )(x(n) µk )T /pk n p(ck ) pk This is just the sample mean and covariance, weighted by the class conditional probabilities pn,k. Derived by solving setting log-likelihood gradient to zero (i.e. the maximum). 400 Four cluster solution with decision boundaries 300 200 2nd component score 100 0 −100 −200 −300 −400 −200 0 200 400 600 1st component score 400 But wait! Here’s a nine cluster solution 300 Uh oh... 200 2nd component score 100 How many clusters are there really? 0 −100 −200 −300 −400 0 −200 0 200 400 600 1st component score over tting How do we choose k? 10 9 Increasing k, will always decrease our 8 distortion. This will over t the data. 7 - Distortion 6 How can we avoid this? test set error - 5 Or how do we choose the best k? 4 One way: cross validation: 3 minimize over tting by stopping 2 training set error before the error on a test data set 1 1 2 3 4 5 6 7 8 9 10 starts to increase. k = number of clusters Use our distortion metric 300 k=10 clusters N ! ! K 200 D= rnk ! xn − µk !2 2nd PC score 100 n=1 k=1 0 on a test data set — one not used to t the parameters — stop when we −100 reach a minimum. −200 −100 0 100 200 300 1st PC score fi fi fi fi

Use Quizgecko on...
Browser
Browser