Optimization for Machine Learning Lecture 12 PDF
Document Details
Uploaded by HumbleUnakite3397
Politecnico di Torino
Giuseppe Calafiore
Tags
Related
Summary
Lecture 12 on Covariance Estimation and PCA, from a course on optimization for machine learning, at the Politecnico di Torino. The lecture discusses fundamental concepts and theorems related to covariance matrices, eigen-value decomposition and principal component analysis.
Full Transcript
G.C. Calafiore (Politecnico di Torino) 1 / 46 LECTURE 12 Covariance Estimation & PCA G.C. Calafiore (Politecnico di Torino) 2 / 46 Motivations and goals Covariance matrices are ubiquitous in data science:...
G.C. Calafiore (Politecnico di Torino) 1 / 46 LECTURE 12 Covariance Estimation & PCA G.C. Calafiore (Politecnico di Torino) 2 / 46 Motivations and goals Covariance matrices are ubiquitous in data science: I Exploratory data analysis I Risk analysis I Financial risk assessment I Outlier detection N data points in dimension d. Often in practice the number of data points may be less than the number of dimensions. We examine two covariance estimation methods: one naive (sample estimate), the other classical (factor model). G.C. Calafiore (Politecnico di Torino) 3 / 46 Preliminaries Eigenvalue decomposition for symmetric matrices Theorem 1 (EVD of symmetric matrices) We can decompose any symmetric p × p matrix S as p X S = UΛU > = λi ui ui> , i=1 where Λ = diag (λ1 ,... , λp ), with λ1 ≥... ≥ λp the eigenvalues, and U = [u1 ,... , up ] is a p × p orthogonal matrix (U > U = Ip ) that contains the eigenvectors ui of S, that is: Sui = λi ui , i = 1,... , p. G.C. Calafiore (Politecnico di Torino) 4 / 46 Rayleigh variational representation Corollary 1 (Rayleigh variational representation) If S is symmetric p × p, then λmax = λ1 (S) = max w > Sw w : kw k2 =1 λmin = λp (S) = min w > Sw. w : kw k2 =1 Moreover, the optima are attained, respectively, at wmax = u1 and wmin = up. G.C. Calafiore (Politecnico di Torino) 5 / 46 Positive semi-definite (PSD) matrices A (square) symmetric matrix S is said to be positive semi-definite (PSD) if ∀ x, x > Sx ≥ 0. In this case, we write S 0. From EVD theorem: for any square, symmetric matrix S: S 0 ⇐⇒ every eigenvalue of S is non-negative. Hence we can numerically (via EVD) check positive semi-definiteness. G.C. Calafiore (Politecnico di Torino) 6 / 46 Preliminaries Singular Value Decomposition (SVD) Theorem 2 (SVD of general matrices) We can decompose any non-zero p × m matrix A as r X A= σi ui vi> = UΣV > , Σ = diag (σ1 ,... , σr , 0,... , 0) ∈ Rp,m i=1 where σ1 ≥... ≥ σr > 0 are the singular values, and U = [u1 ,... , um ], V = [v1 ,... , vp ] are square, orthogonal matrices (U > U = Ip , V > V = Im ). The number r ≤ min(p, m) (the number of non-zero singular values) is called the rank of A. The first r columns of U, V contains the left- and right singular vectors of A, respectively, that is: Avi = σi ui , A> ui = σi vi , i = 1,... , r. G.C. Calafiore (Politecnico di Torino) 7 / 46 Links between EVD and SVD The SVD of a p × m matrix A is related to the EVD of a symmetric (PSD) matrix related to A. If A = UΣV > is the SVD of A, then The EVD of AA> is UΛU > , with Λ = Σ2. The EVD of A> A is V ΛV >. Hence the left (resp. right) singular vectors of A are the eigenvectors of the PSD matrix AA> (resp. A> A). G.C. Calafiore (Politecnico di Torino) 8 / 46 The Data Covariance Matrix G.C. Calafiore (Politecnico di Torino) 9 / 46 The sample covariance matrix Motivation We define the variance of a collection of numbers z1 ,... , zm as m 1 X σ2 = (zi − ẑ)2 , m i=1 1 where ẑ = m (z1 +... + zm ) is the average of the zi ’s. How can we extend this notion to higher dimensions (with zi ’s as vectors)? Why would we want to do that? G.C. Calafiore (Politecnico di Torino) 10 / 46 The sample covariance matrix Definition Given a d × N data matrix A = [a(1) ,... , a(N) ], the sample covariance matrix is defined as the symmetric d × d matrix N N 1 X (i). 1 X (i) S= (a − â)(a(i) − â)> , â = a. N N i=1 i=1 We can express S as 1 Ac A> S= c , N where Ac is the centered data matrix: Ac = a(1) − â... a(N) − â. G.C. Calafiore (Politecnico di Torino) 11 / 46 Directional scores and variance Given a direction v ∈ Rd , kv k2 = 1, the score of a datum x ∈ Rd along direction v is simply defined as the component of x along v , i.e., the coefficient of the projection of x onto v , which is given by v > x. Given N data points a(1) ,... , a(N) we can compute the scores (projections) of the centered data points ã(i) = a(i) − â, i = 1,... , N, along direction v : si = v > (a(i) − â), i = 1,... , N This collection of numbers, s1 ,... , sN gives a picture of the behavior of the data along direction v. In particular, the variance of the data along direction v is N N 1 X 2 1 X > (i) σv2 = si = v (a − â)(a(i) − â)> v N N i=1 i=1 N ! 1 > X = v (a(i) − â)(a(i) − â)> v = v > Sv. N i=1 G.C. Calafiore (Politecnico di Torino) 12 / 46 The sample covariance matrix Link with directional variance The (sample) variance along direction v is the variance of the projection of the data along v : N 1 X > (i) 1 σv2 = varv (A) = [v (a − â)]2 = v > Sv = kAc v k22 N N i=1 where Ac is the centered data matrix: Hence: The covariance matrix gives information about variance along any direction v , via the quadratic function v → v > Sv ; The covariance matrix is always symmetric (S = S > ); It is also positive-semidefinite (PSD), since v > Sv = varv (A) ≥ 0, ∀v. G.C. Calafiore (Politecnico di Torino) 13 / 46 Application: portfolio risk Data: Consider n assets with returns over one period (e.g., day) ri , i = 1,... , n. In general not known in advance. Portfolio: described by a vector x ∈ Rn , with xi ≥ 0 the proportion of a total wealth invested in asset i. Portfolio return: r > x; in general not known. Expected return: mean value of portfolio return, given by E{r > x} = rˆ> x, with rˆ = (ˆ r1 ,... , rˆn ) the vector of mean returns. Portfolio risk: Assuming return vector r is random, with mean rˆ and covariance matrix S, the variance of the portfolio is. σ 2 (x) = E{(r > x − rˆ> x)2 } = x > Sx. G.C. Calafiore (Politecnico di Torino) 14 / 46 The sample covariance matrix Total variance For a given sample covariance matrix S, we define the total variance to be the sum of the variances along the unit vectors ei = (0,... , 1,... , 0) [with 1 in i-th position, 0 otherwise] The total variance writes: d X d X d X. varei = ei> Sei = Sii = trace S, i=1 i=1 i=1 where the symbol trace (trace) denotes the sum of the diagonal elements of its matrix argument. G.C. Calafiore (Politecnico di Torino) 15 / 46 Estimating the covariance matrix of data Assume a generative model for the data: x = x̄ + , where is a Normal random vector with zero mean and (unknown) covariance matrix Σ. Then, the sample data covariance matrix S is just an estimate of the true, unknown data covariance matrix Σ (it can be proved it is the maximum likelihood estimate). We’ll discuss later whether S is an appropriate proxy for Σ... Note that often in practice the number of data points N may be similar to, or even less than the number of dimensions d... G.C. Calafiore (Politecnico di Torino) 16 / 46 s wrong with the empirical covariance? What is wrong with the sample covariance? e we draw data from Σ = Ip , and look at eigenvalues of Assume we draw random data with zero mean and true covariance Σ = Id , and callook estimate, when at eigenvalues both estimate of the sample p, n are large S, when both. d, N are large: !"#$%&'()*+",-.#/0"1/"'2/34,-)5#"6/)7/&684#4$"9/&45&':"9.&, !"+ !"* !"#!$%%&'%%!()*+,-.!/0*+,/1!,2,23!45%.67 !") 8,1*9:+)(!9;!/,:/0/1!9;!"?"@0!)03!A9++/1B903,0:! C)+AD/0E9FG)1*>+!3/01,*H!1>B/+,(B91/3 !"( !"# !"' !"& !"% !"$ ! ! !"# $ $"# % %"# & Figure 1: Density of Marchenko-Pastur law and histogram of eigenvalues Her the histogram tribution of eigenvalues of eigenvalues G.C. Calafiore (Politecnico diofTorino) of X empirical X/n is plotted, covariance for pvalues vs. theoretical = 200,for n = 200, p = 500.n The entries = 500. Source 17 / 46 o What is wrong with the sample covariance? Eigenvalues should be all close to 1! This becomes true only when d is fixed and number of samples N → +∞. Red curve shows theoretical result from “random matrix theory,” which works for the “large d, large N” case. G.C. Calafiore (Politecnico di Torino) 18 / 46 Estimation problem In practice, the sample estimate might not work well in high dimensions; so we need to look for better estimates. Problem: Given data points x (1) ,... , x (N) ∈ Rd , find an estimate Σ̂ of the covariance matrix. Many methods start with the sample estimate S...... and then remove “noise” from it. G.C. Calafiore (Politecnico di Torino) 19 / 46 Detuning the sample covariance... Assume a data generative model of the form x = Lz + σe x is the observation (data points). e is a noise vector (assume E{e} = 0, E{ee > } = I ). z contains k “factors” (assume E{z} = 0, E{zz > } = I ). L is a d × k loading matrix (usually, k. G.C. Calafiore (Politecnico di Torino) 20 / 46 Fitting factor models Given the data sample covariance matrix S 0, and assuming rank S > k, we can find the closest factor model that explains that covariance by solving for L and α min kS − αI − LL> k2F. α≥0, L Solution: via EVD of S = UΛU > , with Λ = diag (λ1 ,... , λd ), λ1 ≥ · · · ≥ λd ≥ 0. Noting that LL> has rank k, the problem is related to finding rank-k matrix Q(α) that approximates X d. S(α) = S − αI = U(Λ − αI )U > = (λi − α)ui ui> i=1 which, by the Eckart-Young-Mirsky Theorem, is k X Q(α) = (λi − α)ui ui> , i=1 and Q(α) 0 if and only if α ≤ λk. G.C. Calafiore (Politecnico di Torino) 21 / 46 Fitting factor models The problem then reduces to d X d X min kS(α) − Q(α)k2F = k (λi − α)ui ui> k2F = (λi − α)2. λk ≥α≥0 i=k+1 i=k+1 By setting the derivative wrt α to zero, we obtain d Pd X i=k+1 λi λi = (d − k)α ⇒ α∗ = , d −k i=k+1 where we observe that indeed α∗ ≥ 0 and α∗ ≤ λk. The optimal Q is therefore k X Q ∗ = Q(α∗ ) = (λi − α∗ )ui ui>. i=1 The optimal factor L, such that LL = Q ∗ , is then found as > hp p i L= λ1 − α∗ u1 · · · λk − α∗ uk. G.C. Calafiore (Politecnico di Torino) 22 / 46 Scaled version In practice, we may assume that each random variable has its own noise variance, i.e., the model is x = Lz + σe with E{ee > } = D, with diagonal D 0. Modified problem: min kS − D − LL> k2F : D diagonal, D 0. D,L This time, no easy solution... Can alternate optimization over D (easy) and L (EVD). Results in local optimum. G.C. Calafiore (Politecnico di Torino) 23 / 46 Principal Component Analysis (PCA) G.C. Calafiore (Politecnico di Torino) 24 / 46 Principal Component Analysis Overview Principal Component Analysis (PCA) originated in psychometrics in the 1930’s. It is now widely used in: Exploratory data analysis Simulation Visualization Application fields include: Finance, marketing, economics Biology, medicine Engineering design, signal compression and image processing Search engines, data mining. G.C. Calafiore (Politecnico di Torino) 25 / 46 Principal Component Analysis Solution principle PCA finds “principal components” (PCs), i.e. orthogonal directions of maximal variance. I PCs are computed via EVD of the data covariance matrix. I Alternatively, PCs can be found via SVD of (centered) data matrix. I Can be interpreted as a “factor model” of original data matrix. Data matrix: A ∈ Rd,N , A = [a(1) ,... , a(N) ];. 1 P (i) Centered data matrix: Ac = A − â1> N , â = N i a ; 1 > Data covariance matrix: S = N Ac Ac. G.C. Calafiore (Politecnico di Torino) 26 / 46 Principal component Directional variance along direction v : σv2 = v > Sv , where S is the sample covariance matrix of the data. The principal direction is the direction of maximal variance: vmax = max v > Sv kv k2 =1 Assume the EVD of S is given: d X S= λi ui ui> , i=1 with λ1 ≥... ≥ λd ≥ 0, and U = [u1 ,... , ud ] is orthogonal (U > U = I ). Then a solution is v ∗ = u1 , with u1 an eigenvector of S that corresponds to its largest eigenvalue λ1. G.C. Calafiore (Politecnico di Torino) 27 / 46 Principal component 1 > Since S = N Ac Ac , then if Ac = UΣV > is the SVD of Ac , then 1 1 1 S= Ac A> c = UΣV > V ΣU > = UΣ2 U > N N N is the EVD of S. Hence, the principal direction u1 of S can be found also directly via SVD of the (centered) data matrix Ac. G.C. Calafiore (Politecnico di Torino) 28 / 46 Finding orthogonal directions Deflation Once we’ve found a direction with high variance, can we repeat the process and find other ones? Deflation method: Project data points on the subspace orthogonal to the direction we found. Find a direction of maximal variance for projected data. The process stops after d steps (d is the dimension of the whole space), but can be stopped earlier (to find only k directions, with k Sv : v > u1 = 0 kv k2 =1 is u2 , an eigenvector corresponding to the second-to-largest eigenvalue. After k steps of the deflation process, the directions returned are u1 ,... , uk. Thus we can compute k directions of largest variance in one eigenvalue decomposition of the covariance matrix. G.C. Calafiore (Politecnico di Torino) 30 / 46 Geometry of deflation Deflation consists in projecting the data on a subspace orthogonal to the principal subspace found before, and computing the next principal direction there. If Uk ∈ Rd,k contains the first k principal directions, and Λk the corresponding first k-largest eigenvalues, we consider the deflated covariance matrix Sk+1 = S − Uk Λk Uk> Use Power Iteration on Sk+1 to find the (k + 1)th principal direction uk+1. Let Uk+1 = [Uk uk+1 ], and iterate. G.C. Calafiore (Politecnico di Torino) 31 / 46 Algorithm: Power Iterations (PI) Given a matrix M ∈ Rd,N , the PI algorithm returns the first input/output principal directions of M, namely vectors u ∈ Rd , v ∈ RN corresponding to the first columns in the U, V facotrs of the SVD of M. The PI algorithm starts from random initial vectors u, v with unit norm, and repeats: Mv M >u u← , v←. kMv k2 kM > uk2 This converges (for arbitrary initial u, v ) to the left and right singular vectors u, v , under mild conditions on M. The rate of convergence depends on the ratio ρ = |λ 2| |λ1 | , where |λ1 | is the > eigenvalue of largest modulus of MM , and |λ2 | is the second-largest modulus eigenvalue. Fast convergence for ρ 1. G.C. Calafiore (Politecnico di Torino) 32 / 46 Measuring quality How well is data approximated by its projections on the successive subspaces? Approach: compare sum of variances contained in the k principal directions with the total variance. Explained variance ratio:. λ1 +... + λk σ 2 +... + σk2 ηk = = 12 , λ1 +... + λd σ1 +... + σd2 where λ1 ≥... ≥ λd are the eigenvalues of the covariance matrix, and σ1 ≥... ≥ σd are the singular values of the (centered) data matrix. G.C. Calafiore (Politecnico di Torino) 33 / 46 Example PCA of market data Plot shows ηk in function of k. First ten components explain 80% of the variance. Largest magnitude of eigenvector for 1st component correspond to financial sector. Figure: Data: Daily log-returns of 77 Fortune 500 companies, 1/2/2007—12/31/2008. G.C. Calafiore (Politecnico di Torino) 34 / 46 Low-rank approximation of a matrix For a given d × N matrix A, and integer k ≤ d, N, the rank-k approximation problem is. A(k) = arg min kA − X kF : rank(X ) ≤ k, X where k · kF is the Frobenius norm (Euclidean norm of the vector formed with all the entries of the matrix). The solution (Eckart-Young-Mirsky theorem) is k X A(k) = σi ui vi> , i=1 where r X A = UΣV > = σi ui vi> i=1 is an SVD of the matrix A. G.C. Calafiore (Politecnico di Torino) 35 / 46 Low-rank approximation Interpretation: rank-one case Assume data matrix A ∈ Rd,N represents time-series data (each row is a time-series). Assume also that A is rank-one, that is, A = uv > ∈ Rd,N , where u, v are vectors. Then > α1 .. A = . , Ajt = uj vt , 1 ≤ j ≤ d, 1 ≤ t ≤ N. αp> Thus, each time-series is a “scaled” copy of the time-series represented by v , with scaling factors given in u. We can think of v as a “factor” that drives all the time-series. Geometry: if a data matrix is rank-one, then all the data points are on a single line. G.C. Calafiore (Politecnico di Torino) 36 / 46 Low-rank approximation Interpretation: low-rank case Suppose A is rank k, that is, A = UΣV > , U ∈ Rd,N , Σ = diag (σ1 ,... , σk ) ∈ Rk,k , V ∈ RN,k , Then, we can express the j-th row of A as k X αj> = σi uij vi> , 1 ≤ j ≤ d i=1 Thus, each time-series is the sum of scaled copies of k time-series represented by v1 ,... , vk , with scaling factors given in u1 ,... , uk. We can think of vi ’s as the few “factors” that drive all the time-series. G.C. Calafiore (Politecnico di Torino) 37 / 46 PCA and low-rank approximations Consider the rank-k approximation of the centered data matrix Ac : arg min kAc − X kF : rank(X ) ≤ k. X If Ac = UΣV > is the SVD of Ac , then we know that the solution is k. X X ∗ = Â(k) c = σi ui vi> i=1 Each vi is a particular factor, and ui ’s contain scalings. The principal directions ui are the same found via PCA on the data covariance matrix. G.C. Calafiore (Politecnico di Torino) 38 / 46 Explained variance and approximation error Recall the explained variance ratio λ1 + · · · + λk σ 2 + · · · + σk2 ηk = = 12 , λ1 + · · · + λd σ1 + · · · + σd2 We have that (k) 2 kAc − Âc k2F σk+1 + · · · + σd2 = = 1 − ηk kAc k2F σ12 + · · · + σd2 so, the squared relative approximation error is equal to the complement of the explained variance ratio. We can also express the above as trace(S − S (k) ) λk+1 + · · · + λd = trace S λ1 + · · · + λd G.C. Calafiore (Politecnico di Torino) 39 / 46 Fisher discriminant analysis G.C. Calafiore (Politecnico di Torino) 40 / 46 Fisher discriminant analysis We next discuss an approach based on covariance estimation and suitable projections for the purpose of classification. Original reference R.A. Fisher: “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 1936. The idea is based on projecting the data points onto the subspace of some vector w (to be determined). The projected data result in a collection of scalars (scores) α = w > x. If we have two classes 1 and 0, respectively with means µ1 , µ0 and covariances Σ1 , Σ0 , then the corresponding projected data will have means m1 , m0 and variances σ12 , σ02 given by mi = w > µi , σi2 = w > Σi w , i = 0, 1. G.C. Calafiore (Politecnico di Torino) 41 / 46 Fisher discriminant analysis Good and bad separators G.C. Calafiore (Politecnico di Torino) 42 / 46 Fisher discriminant analysis Fisher defines the between-class variance of the projected data as the squared distance between the projected means: 2. σsep = (w > (µ1 − µ0 ))2 = w > (µ1 − µ0 )(µ1 − µ0 )> w. The separation of the two classes is then defined as the ratio of the between-class variance and the sum of in-class variances 2. σsep w > (µk − µj )(µk − µj )> w Ω(w ) = 2 =. σin w > (Σk + Σj )w We look for w that maximizes Ω(w ). G.C. Calafiore (Politecnico di Torino) 43 / 46 Fisher discriminant analysis 1 0 G.C. Calafiore (Politecnico di Torino) 44 / 46 Fisher discriminant analysis Solution w >Σ w Want to maximize Ω(w ) = w > Σsep in w , where.. Σsep = (µ1 − µ0 )(µ1 − µ0 )> , Σin = Σ1 + Σ0 ,. Let Σin = UDU > , and let w̃ = D 1/2 U > w , then w̃ > Σ̃sep w̃ Ω(w̃ ) = w̃ > w̃. where Σ̃sep = D −1/2 U > Σsep UD −1/2. By Rayleigh’s theorem, the maximum of Ω is attained when w̃ is equal to the eigenvector of Σ̃sep associated with its largest eigenvalue. However, Σ̃sep has rank one, and its principal eigenvector is w̃ ∗ = D −1/2 U > (µ1 − µ0 ). Going back to the original variable we obtain the optimal direction w ∗ = UD −1/2 w̃ ∗ = Σ−1 in (µ1 − µ0 ). G.C. Calafiore (Politecnico di Torino) 45 / 46 Fisher discriminant Fisher approach is aimed at projecting data on a direction which has optimal separation. Such direction is given by w = (Σ1 + Σ0 )−1 (µ1 − µ0 ). To turn Fisher discriminant into a classifier we have to compare the score w > x with a threshold w0 , so that if w > x ≥ −w0 we classify in class 1, or we classify in the other class otherwise. In the Fisher approach, the threshold can be computed, e.g., by examining the empirical ROC curve of the resulting classifier. G.C. Calafiore (Politecnico di Torino) 46 / 46