Unsupervised Learning Lecture Notes PDF
Document Details
Uploaded by PoliteIndigo
Tags
Summary
These lecture notes cover unsupervised learning techniques, focusing on clustering methods like K-means and hierarchical clustering, and their applications in areas such as market analysis and data visualization.
Full Transcript
Unsupervised learning Some applications of what we have studied so far – • • • • • • • • • • • • • • • Trading. Price prediction. Portfolio Construction. Risk Analysis Classification techniques are natural to financial decision problems. Anomaly detection (probabilistic models) Feature selection....
Unsupervised learning Some applications of what we have studied so far – • • • • • • • • • • • • • • • Trading. Price prediction. Portfolio Construction. Risk Analysis Classification techniques are natural to financial decision problems. Anomaly detection (probabilistic models) Feature selection. Importance of features. Credit Rating (Random Forest). Default Prediction. Sentiment Analysis. Text Analytics. Automation. Compliance. Forecasting. Regime. Factor selection. Sales recommendations Market analytics. Expert models. Robo advisory. Analyst estimates. Loan. Insurance. Alternative data. Diversification study. Clustered strategies. Trend following strategies KNN – Timing of risk premia strategies. SVM – Trading vols. Unsupervised Learning • No Y in unsupervised learning. • Exploratory. Discover interesting things about the data: informative way to visualize. discover subgroups. • Understanding the underlying structure of Data. Summarize and group. Data compression. • Principal components analysis – direct import from statistics without a change - a tool used for data visualization or data pre-processing before supervised techniques are applied. • Clustering - a broad class of methods for discovering unknown subgroups in data. • Self organizing maps. • Unsupervised learning is more subjective (no simple goal for the analysis !) Which is helpful too sometimes. E.g. earnings reports. Market commentaries. • Good part is - No labeling needed. The curse of dimensionality • We are so used to living in three dimensions that our intuition fails us when we try to imagine a high-dimensional space. Even a basic 4D hypercube is incredibly hard to picture in our mind, what can we do about a 200-dimensional ellipsoid bent in a 1,000-dimensional space! UNSUPERVISED LEARNING : CLUSTERING OVERVIEW PCA k-Means Principal component analysis (PCA)—performs a linear transformation on the data so that most of the variance or information in your highdimensional dataset is captured by the first few principal components. The first principal component will capture the most variance, followed by the second principal component, and so on. Partitions data into k number of mutually exclusive clusters. How well a point fits into a cluster is determined by the distance from that point to the cluster’s center. Best Used... • When the number of clusters is known • For fast clustering of large data sets PCA Idea: the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis. Clustering • An unsupervised learning problem • Given N unlabeled examples and number of desired partitions K : Group the examples into K “homogeneous” partitions. • A good clustering is one that achieves: High within-cluster similarity and Low inter-cluster similarity. • Clustering only looks at similarities, no labels are given. First thing is to define similarity based on what? • Flat or Partitional clustering Partitions are independent of each other • Hierarchical clustering Partitions can be visualized using a tree structure (a dendrogram) • You cannot measure the prediction error. So you cannot optimize the model parameters to minimize the prediction error. • The model does not have a teacher: it must learn by itself. • K-means: N datapoints into k groups! • K Centroids of the group –> each data point to the closest centroid -> Update the centroid. • Hierarchical is like a tree-clustering. • We seek a partition of the data into distinct groups so that the observations within each group are quite similar to each other, • It make this concrete, we must define what it means for two or more observations to be similar or different. Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied. • Market Segmentation. Crowding. Volatility clustering. K-means • The idea behind K-means clustering is that we want to identify homogenous sub-groups, or clusters, where observations are very close to each other. • Simple to implement • “Very close” means that we want the intra-cluster variation, or within-cluster variation, to be as small as possible. • We need to define a measure of distance W(Ck) for the variation among the observations belonging to cluster Ck. • Can We express K-means clustering as an optimization problem? • The problem finds the K clusters that minimize collectively the within-cluster variation. • K-Means. • Random initialize the clusters. Calculate the centroid assign to the min distance. re-calculate the centroid. repeat. • The most common within-cluster variation function W(Ck) is the Euclidian norm: • where nk is the number of observations located in Cluster k. • This is an extremely large combinatorial problem. • We need an algorithm ! That decreases objective monotonically at each step, to reach a local minimum. - The centers of the clusters are found by iteration. K-means clustering (KMC) • We have many individual data points each represented by vectors. Each entry in the vector represents a feature. But these data points are not labelled or classified. • Our goal is to group these data points in a sensible way. Each group will be associated with its center of mass. But how many groups are there and where are their centers of mass? • We Group together unlabeled data points according to similarities in their features. The features must have meaningful numerical values. • Example: Classification of customers according to their purchase history, each feature might be expenditure on different types of goods; Optimal placement of car parks in a city; similar companies based on features. • ..a very simple method for assigning individual data points to a collection of groups or clusters. • Which cluster each data point is assigned to is governed simply by its distance from the centers of the clusters. • Since it’s the machine that decides on the groups which means this is an example of unsupervised learning. • KMC is a very popular clustering technique. It is highly intuitive and visual, and extremely easy to program. • The technique can be used in a variety of situations. • Dividing up data artificially even if there is no obvious grouping. • We have a dataset of N individuals, each having an associated M-dimensional vector representing M features. • Each entry in the vectors represents a different numerical quantity. For example each vector could represent an individual household, with the first element in each vector being an income, the second the number of cars, pets. . . , the Mth being the number of pets. • We will pick a number for K, say three. So we will have three clusters • Each of these three clusters will have a center, a point in the M-dimensional feature space • Each of the N individual data points is associated with the center that is closest to it.(Like houses and postboxes. The postbox is like the center for the cluster. And each house is associated with one postbox.) • The goal of the method is to find the best positions for the centers of these clusters. (And so you could use KMC to tell you where is best to put the postboxes.) • Mathematically, the idea is to minimize the intra-cluster (or within-cluster) variation. And the intra-cluster variance is just a measure of how far each individual point is to its nearest center. (How far is the house from the nearest postbox.) • The algorithm is really simple. It involves first guessing the centers and then iterating until convergence. Typically you might then choose a different K and see what effect that has on distances. The Algorithm • Step 0: Scaling We first scale our data, since we are going to be measuring distances. • Step 1: Pick some centers We need to seed the algorithm with centers for the K clusters. Either pick K of the N vectors to start with, or just • generate K random vectors. • Step 2: Find distances of each data point to the centers Now for each data point measure its distance from the centers of each of the K clusters. The measure we use might be problem dependent. But often we’d use the obvious Euclidean distance. Each data point, that is each n, is then associated with the nearest cluster/center. • This is easily visualized as follows. Suppose K is two, there are thus two clusters and two corresponding centers. Let’s call them the red cluster and the blue cluster. • We take the first data point and measure its distance from each of the two centers. That’s two distances. The smaller of these turns out to be the distance to the blue center. So we paint our first data point blue. Repeat for all of the other data points, so each data point gets colored. • Step 3: Find the K centroids Now take all of the data points associated with the first center and calculate the centroid, its center of mass. In the colorized version, just find the centroid of all the red dots. Do the same with all the blue dots, i.e. find the K centroids. These will be the cluster centers for the next iteration. • Go back to Step 2 and repeat until convergence. Scree plot • Adding up all the squared distances to the nearest cluster gives us a measure of total distance or an error. This error is a decreasing function of the number of clusters, K. In the extreme case K = N you have one cluster for each data point and this error will be zero. • If you plot this error against K you will get a scree plot. • It will be one of the two types of plots shown in the figure. If you get the plot with an elbow where the error falls dramatically and then levels off then you probably have data that falls into nice groupings (the triangles in the figure). The number of clusters is then obvious, it is three in this plot. • If the error only gradually falls then there is no obvious best K using this methodology. • one would usually repeat several times with other initial centers. Example: Volatility • A simple one-dimensional example. • Using S&P500 index historical time series calculate a rolling 30-day volatility. • This KMC analysis completely throws away any time dependence in the behavior of volatility. But there are models used in finance in which volatility jumps from one level to another, from regime to regime. • Often the volatility is pretty low, and occasionally very large. Now in the jump volatility model one has volatility moving between given levels. That’s not quite what we have here. Here there is a continuum of volatility levels. We are going to pretend that there are only three clusters, K = 3. Cluster 1 Cluster 2 Cluster 3 Number in cluster 586 246 24 SPX volatility 9.5% 18.8% 44.3% We can modify this to compute a matrix of probabilities: To: From: Cluster 1 Cluster 2 Cluster 3 Cluster 1 84% 16% 0% Cluster 2 38% 57% 5% Cluster 3 0% 54% 46% We interpret this as, for example, the probability of jumping from Cluster 1 to Cluster 2 is 16% every 30 days. • The algorithm converges locally, so you cannot be sure that the clustering you get on any single run is close to the global optimum. • For best results, run the K-means algorithm several times starting from various initial random clusters: • Run the algorithm for each choice of initial cluster; • Compare the results you get for the various runs; • Try the algorithm on various subsets of your training data (maybe on all possible subsets) to see how stable your results are. Selecting the number of clusters K: • This is a crucial but difficult task as you cannot rely on cross validation; You need to experiment. • Try various values for K and analyze the results. • Different values for K may result in very different clusters. • For example: consider a group of people, men and women, whose mother tongue is either English, Chinese, French or Spanish. • Gender provides a sensible 2-means clustering; • Mother tongue provides a sensible 4-means clustering. • But the two clusterings are very different! • Now, what if we try to find a 3-means clustering? • Here as well, experiment with various subsets of your training data to see how stable your clusters are. • If your clusters are unstable, you may have the wrong K! • It is almost always a good idea to standardize the features before computing the distance. Hierarchical Clustering • The difficulty with K-Means clustering is to determine what K is. • Hierarchical clustering is a popular alternative to K-Means clustering, which does not require a choice for the number of clusters. • Another attractive feature of hierarchical clustering is that it creates a nice representation as a tree, called a dendrogram. • The most common type of hierarchical clustering is the agglomerative, or bottom-up, clustering. • In agglomerative hierarchical clustering, each observation represents a leaf in a tree-like structure. • As we move up, leaves start to fuse to create branches. • As we keep going up, branches fuse with other branches or with leaves to create new clusters… • Up until all the leaves have fused together into a single cluster: the trunk. • Dendrograms shows many different possible clustering, from a single cluster all the way to n clusters! • Start with each point in its own cluster. • Identify the closest two clusters and merge them. • Repeat. • Ends when all points are in a single cluster • To understand agglomerative dendrograms, we need to start from the leaves and slowly make our way up. • In particular, we cannot draw conclusions on the similarity of two observations by taking a horizontal cut of the dendrogram; • Given n observations, • there are n-1 fusion points, • and at each fusion point we could represent the two fusing leaves, say I and j two different ways: • i on the left and j on the right, or • j on the left and i on the right. • So, we have a total of 2n-1 possible representations of the tree. • Horizontal position does not matter! • In fact, we need to look vertically, from bottom to top, at where the observations fuse together to be part of the same branch. • Any cluster created at a lower level is automatically nested inside any upper cluster. Observations 1 and 6 are similar to each other. • However, observations 9 and 2 are not similar to each other. • More precisely, observation 2 is not more similar to 9 than observations 8, 5 and 7. • 9 fuses with all these observations at the same level. ! • Observations 1 and 6, and 5 and 7 fuse together first to create 2 clusters. • Next, observations 5&7 fuse with observations 8 to create a new branch. • Observations 1&6 fuse with observations 4 to create a new branch. • To identify the clusters, we make a horizontal cut across the dendrogram. • The sets of observations below the cut are identified as clusters. • Cutting at a height of 2 (blue) gives us two clusters: Cluster 3-4-1-6 and Cluster 9-2-8-5-7 • Cutting at a height of 1.6 (red) gives us three clusters: Cluster 3-4-1-6, cluster 2-8-5-7 and Leaf 9. • The height of the cut controls the number of clusters. • The lower we go, the less distance we are willing to accept, and the more clusters we get. • The optimal number of clusters is generally determined: • Visually by looking at the dendrogram, • Through a comparison between clusters and data to find an interpretable pattern suggesting why observations have been clustered together, • Often the choice of where to cut and how many clusters to use is not obvious. Hierarchical clustering algorithm Dissimilarity Measure • The two most common dissimilarity measures are: • Euclidian distance: We compute and use it just as we did with K-Means Clustering. • Correlation-based distance: Two observations are similar if their features are highly correlated. • The choice of dissimilarity measure is extremely important. • The dissimilarity measure is at the heart of the hierarchical clustering algorithm; • Different dissimilarity measures are likely to generate very different dendrograms. Linkage • Now that we know how to compute the dissimilarity between observations, the next question is how to compute dissimilarity between clusters. • The notion of linkage extends the idea of dissimilarity to groups of observations. • There are four commonly used types of linkage • Statisticians generally prefer average and complete linkage because they tend to generate more balanced dendrograms than single linkage; • Centroid is generally avoided because it might create an inversion - An inversion occurs when two clusters are fused together, below the level at which either of the individual clusters occur. That is the centroid clustering is not monotonic. key questions 1. Should the observations be standardized or rescaled? 2. What dissimilarity measure should be used? 3. What linkage should be used? 4. Where should we cut the dendrogram, and how many clusters will we obtain? 5. Do we understand the reason why the algorithm has produced this particular clustering? 6. Have we checked and “validated” the dendrogram before using it for prediction and decision? • K Means clustering works well with big data compared to hierarchical clustering. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2). • In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering. • K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D). • K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram. Examples • Uncovering the structure of data • We may take market returns and try to identify the main drivers of the market. For instance, a successful model may find that at one point in time, the market is driven by the momentum factor, energy prices, level of USD, and a new factor that may be related to liquidity. • Crowded strategies. Hedge funds strategies similarity. • Unsupervised learning is important for understanding the variation and grouping structure of a set of unlabeled data, and can be a useful preprocessor for supervised learning. (PCA) • It is intrinsically more difficult than supervised learning because there is no gold standard (like an outcome variable) and no single objective (like test set accuracy) • It is an active field of research, with many recently developed tools.