Lecture Notes: Data Mining & Statistical Analysis

Summary

These are lecture notes on data mining and multidimensional statistical analysis. The notes cover multiple regression, analysis of variance (ANOVA), principal component analysis (PCA), classification, and frequent itemset mining. The text also includes R code examples for various statistical methods.

Full Transcript

Version v18.1 Lecture notes in data mining and multidimensional statistical analysis [email protected], [email protected], [email protected] 1 Contents 1 Multiple Regression...

Version v18.1 Lecture notes in data mining and multidimensional statistical analysis [email protected], [email protected], [email protected] 1 Contents 1 Multiple Regression 4 1.1 Introduction.............................................. 4 1.2 Least Squares............................................. 4 1.2.1 Solution of least squares................................... 4 1.2.2 Coefficient of determination................................. 5 1.3 Gaussian model............................................ 6 1.3.1 Multivariate Gaussian distributions............................. 6 1.3.2 F -test............................................. 6 1.3.3 t-tests of the regression................................... 8 1.3.4 R Example.......................................... 8 1.4 Variable selection........................................... 9 1.4.1 Bias-variance decomposition................................. 10 1.4.2 Split-validation........................................ 10 1.4.3 Cross-validation....................................... 11 1.4.4 Akaike information criterion (AIC)............................. 13 1.4.5 Optimization algorithms................................... 14 2 Analysis of variance (ANOVA) 16 2.1 Introduction to one-way ANOVA.................................. 16 2.2 The F -test for one-way ANOVA.................................. 16 2.3 ANOVA in R............................................. 17 2.4 Estimation of the effects....................................... 19 2.5 Two-way ANOVA........................................... 21 3 Principal Component Analysis (PCA) 25 3.1 Preliminary notations........................................ 25 3.1.1 Covariance and correlation matrices............................ 25 2 3.1.2 Eigenvectors and eigenvalues................................ 26 3.2 Solution of PCA........................................... 26 3.3 Variance captured by the principal components.......................... 27 3.4 Scale matters in PCA........................................ 27 3.5 PCA in R............................................... 28 4 Classification 31 4.1 Principles............................................... 31 4.1.1 The Classification problem................................. 31 4.1.2 Optimal/Bayes classifier................................... 31 4.1.3 Evaluating the classification error.............................. 33 4.2 Classic classifiers........................................... 33 4.2.1 k-nearest neighbor classifier................................. 33 4.2.2 Linear discriminant analysis (LDA)............................. 33 4.2.3 Quadratic discriminant analysis (QDA).......................... 37 4.3 Logistic regression.......................................... 37 5 Frequent Itemset Mining 43 5.1 Definitions............................................... 43 5.2 Apriori algorithm........................................... 44 5.3 Backtracking algorithm....................................... 44 5.4 Closed frequent itemsets and the LCM algorithm......................... 45 6 Clustering 45 7 Document classification 46 7.1 Generative models.......................................... 47 7.1.1 Univariate Bernoulli model................................. 47 7.1.2 Multinomial model...................................... 48 3 1 Multiple Regression 1.1 Introduction We assume that we have the p-dimensional input vectors xi· = (xi1 , xi2 ,... , xip ), and we want to predict the real-valued output Yi ’s for i = 1,... , n where n is the number of datapoints. The linear regression model has the form p X Yi = β0 + xij βj + εi , (1) j=1 where εi is a residual term with E[εi ] = 0 and var(εi ) = σ 2. Here the βj ’s are unknown coefficients, and the input variables can come from different sources: quantitative inputs; transformations of quantitative inputs, such as log, square-root or square; interactions between variables, for example, xi3 = xi1 xi2. 1.2 Least Squares 1.2.1 Solution of least squares We assume that we have a set of training data ((x1· , y1),... , (xn· , yn )) from which to estimate the parameters β of regression equation (1). Each xi· = (xi1 , xi2 ,... , xip ) is a vector of measurements for the ith individual and yi is the ith output. The most popular estimation method is least squares, in which we choose the coefficients β = (β0 ,... , βp ) that minimize the residual sum of squares  2 Xn Xp RSS(β) = yi − β0 − xij βj . (2) i=1 j=1 We can show that the unique solution is β̂ = (X TX)−1 X Ty, (3) where X is the design matrix     1 x11 ··· x1p y1 ........  , and y = .. . X=....  .  1 xn1 ··· xnp yn The regression model can also be expressed in matrix form as y = Xβ + ε, (4) where ε = (ε1 ,... , εn ). 4 1.2.2 Coefficient of determination We denote by yi∗ the fitted values from regression analysis p X yi∗ = β̂0 + xij β̂j j=1 and by y ∗ the vector (yi∗ )i=1,...,n. The coefficient of determination is defined as 2 ∗ d (y ∗ , y) = q Cov (y , y) 2 2 d R = Cor d ∗ )Var(y) Var(y d where Var d denotes the sample variance, n d ∗ , y) = 1 X Cor(y (y ∗ − yn∗)(yi − ȳn ) n i=1 i the empirical correlation coefficient, yn∗ and ȳn the sample mean of y ∗ and yn , respectively. We can show that d ∗) Var(y Var(y) d − Var(ε̂) d R2 = = , (5) Var(y) d Var(y) d where ε̂ = y − y ∗ is the vector of residuals from the regression. The numerator of equation (5) is called the explained variance. The coefficient of determination R2 ∈ [0, 1] is therefore interpreted as the proportion of variance that is explained by the p predictors of the regression. The residual variance is a biased estimator of σ 2 ; however, σ̂ 2 = nVar(ε̂)/(n d − p − 1) is an unbiased estimator. Remark: regression as conditional expectation For this paragraph only, we assume that xip is not fixed but rather a realization of a random variable Xip for each i and p. Assuming that Yi has finite variance, it can be shown that the best approximation arg min E[(Yi − f (Xi1 ,... , Xip ))2 ] f of Yi by a function f (Xi1 ,... , Xip ) with finite variance is the function E[Yi |Xi1 ,... , Xip ]. Since the functional form of the conditional expectation of Yi given Xi1 ,... , Xip is unknown, we have to assume a particular parametric form for f. When f belongs to a parametric family {fβ }β∈B , the regression model has the general form Yi = fβ (xi1 ,... , xi1 ) + εi , where εi is the residual term with E[εi ] = 0 and Var(εi ) = σ 2. If f is assumed to depend linearly on its parameters β, we obtain the linear regression model of equation (1). 5 1.3 Gaussian model So far, we did not assume any parametric model for the residuals. However, if we want to perform statistical hypothesis-testing, we have to resort to parametric approximations. The most standard parametric approach assumes the following Gaussian model ε ⇝ N (0, σ 2 ). 1.3.1 Multivariate Gaussian distributions Some random vector Z = (Z1 ,... , Zn ) has a multivariate Gaussian distribution iff its probability density function (PDF) has the form   1 1 T −1 fm,Σ (z) = np exp − (z − m) Σ (z − m) (2π) 2 |Σ| 2 where m (n-dimensional vector) and Σ (n × n symetric positive definite matrix) are parameters. This is denoted by Z ⇝ N (m, Σ). In this case, E[Z] = m, ∀(i, j) cov(Zi , Zj ) = Σi,j and Zi ⇝ N (mi , Σi,i ). Moreover: (i) Zi and Zj are independent iff Σi,j = 0 (ii) for any deterministic matrix A and vector b, AZ + b ⇝ N (Am + b, AΣAT ) (some components of AZ + b may be degenerate Gaussian – in other words deterministic). In the Gaussian regression model, the least square estimators have the following properties: β̂ ⇝ N (β, σ 2 (X T X)−1 ) n−p−1 2 σ 2 σ̂ ⇝ χ2n−p−1 both statistics are independent. 1.3.2 F -test F -test for a group of variables. With the Gaussian hypothesis, we can use the F -test to test if a group of variables, let us say the p′th to the pth variable (p ≥ p′ ), significantly improves the fit of the regression. The null hypothesis is H0 : (∀j ∈ [p′ , p], βj = 0) against H1 : (∃j ∈ [p′ , p], βj ̸= 0). The test statistic is the F statistic, (RSSp′ − RSSp )/(p − p′ ) F = (6) RSSp /(n − p − 1) 6 where RSSp is the residual sum-of-squares of the larger model with p + 1 parameters, and RSSp′ is the same for the nested model with p′ +1 parameters. The F statistic measures the decrease in residual sum-of-squares per additional parameter (the numerator of equation (6)), and it is normalized by an estimate of σ 2 (the denominator of equation (6)). Under H0 , the F statistic follows a Fisher distribution Fp−p′ ,n−p−1 with p − p′ and n − p − 1 degrees of freedom. The result of a statistical test is nowadays reported using P -values. The P -value is the probability under the null hypothesis to have a test statistic more atypical—in the direction of the alternative hypothesis— than the observed one. Here, we will reject H0 for large enough values of F P -value = P (F ⩾ f ), where f is the F -statistic computed for the data and F ⇝ Fp−p′ ,n−p−1. The null hypothesis is rejected when the P -value is small enough (P < 5%, P < 1%, P < 0.1%,... , depending on the context). This test can be generalized to the null hypothesis that β belongs to an affine subspace E of Rp+1 of dimension p′ + 1, i.e. when β satisfies a given set of linear constraints. As an example, the null hypothesis could be H0 : β1 = 2β3 against H1 : β1 ̸= 2β3. The test statistic is still given by equation (6), where RSSp′ now refers to the residual sum-of-squares for the model estimated under the linear constraints. Note that if A is a (p + 1) × (p′ + 1) matrix whose columns represent a basis of E, the least-square estimates that satisfy the linear constraints are provided by equation (3) using XA as design matrix. F -test of the regression. There is an instance of the F-test that is of particular interest. It is the situation where the bigger model is the complete model that includes all explanatory variables whereas the nested model is the model with parameter β0 only. This test is called the F -test of the regression and it asseses whether there is an effect of the predictors on the variable to predict Y. Formally, the null hypothesis is H0 : (∀j ⩾ 1, βj = 0) against H1 : (∃j ⩾ 1, βj ̸= 0). We can show that the F statistic of the regression is n − p − 1 R2 F = , (7) p 1 − R2 and it follows a Fisher distribution Fp,n−p−1 with p and n − p − 1 degrees of freedom under H0. The formula of the F statistic given in equation (7) can be derived using equations (6) and (5) (an excercise for you!). Here, we will reject H0 for large enough values of R2 , i.e. large enough values of F so that the P -value 7 is equal to P (F ⩾ f ), where f is the F -statistic computed for the data and F ⇝ Fp,n−p−1. 1.3.3 t-tests of the regression We test the hypothesis that a particular coefficient is null. The null hypothesis is H0 : βj = 0, and the alternative one is H1 : βj ̸= 0. The T statistic is β̂j T = √ , σ̂ vj where vj is the jth diagonal element of (X TX)−1. Because of the distributions of the maximum likelihood estimators, the T statistic follows under H0 a Student’s t distribution with n − p − 1 degrees of freedom. 1.3.4 R Example In this example, we regress the murder rate, obtained for each of the 50 US states in 1973, by two potential explicative variables: the assault rate and the percentage of urban populations. Violent Crime Rates by US State This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. Format: A data frame with 50 observations on 4 variables. [,1] Murder numeric Murder arrests (per 100,000) [,2] Assault numeric Assault arrests (per 100,000) [,3] UrbanPop numeric Percent urban population [,4] Rape numeric Rape arrests (per 100,000) > mylmsummary(mylm) lm(formula = Murder ~ UrbanPop + Assault, data = USArrests) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.207153 1.740790 1.842 0.0717. 8 UrbanPop -0.044510 0.026363 -1.688 0.0980. Assault 0.043910 0.004579 9.590 1.22e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.58 on 47 degrees of freedom Multiple R-squared: 0.6634,Adjusted R-squared: 0.6491 F-statistic: 46.32 on 2 and 47 DF, p-value: 7.704e-12 There is a significative effect of the explicative variables on the murder rate because the P -value of the F -test is 7.704 × 10−12. Once the percentage of urban population has been accounted for, the assault rate has a significative effect on the murder rate because the P -value of the t-test is 1.22 × 10−12. However, once the assault rate is accounted for, there is no additional effect of the percentage of urban population on the murder rate (P = 0.0980) unless we accept a large enough type I error (e.g. α = 10%). 1.4 Variable selection Variable selection or feature selection consists of selecting a subset of relevant variables (or features) among the p initial variables. There are two main reasons to perform variable selection. The first is prediction accuracy: the least squares estimates often have low bias but large variance (see subsection 1.4.1) especially when the number of variables p is large, of the same order as the sample size n. Prediction accuracy can be improved by variable selection, which consists of choosing a subsample of the initial set of predictive variables. The second reason is interpretation. With a large number of predictors, we want to sacrifice some of the small details in order to get the ‘big picture’ carried by a smaller subset of variables that exhibit the strongest effects. 9 1.4.1 Bias-variance decomposition Assume that we want to make a prediction for an input x0 = (x01 ,... , x0p ). Denoting by fˆ the regression p X function fˆ(x0 ) = β̂0 + x0j β̂j , the prediction error is given by j=1 Epred (x0 ) = E[(Y − fˆ(x0 ))2 ] (8) = E[(f (x0 ) + ε − fˆ(x0 ))2 ] (9)     = σ 2 + bias2 fˆ(x0 ) + Var fˆ(x0 ) (10) = irreducible error + bias2 + variance | {z } mean square error of fˆ(x0 ) When p is large (e.g. large number of predictors, large-degree polynomial regression...), the model can accommodate complex regression functions (small bias) but because there are many β parameters to estimate, the variance of the estimates may be large. For small values of p, the variance of the estimates gets smaller but the bias increases. There is therefore an optimal value to find for the number of predictors that reaches the so-called bias-variance tradeoff. To give a concrete example, assume that we make a polynomial regression f (x) = β0 + β1 x + · · · + βp xp with p = n − 1. Since p = n − 1, the regression function will interpolate all of the n points. This is not a desirable feature and it suffers from overfitting. Rather than fitting all the points in the training sample, the regression function should rather provides a good generalization, i.e. be able to make good predictions for new points that have not been used during the fit. This is the reason why choosing the model that minimizes the residual squared error computed from the whole dataset is not a reasonable strategy. In subsections 1.4.2 and 1.4.3, we present different methods to compute the prediction error so that we can choose a model that achieves optimal generalization. 1.4.2 Split-validation If we are in a data-rich situation, a standard approach to estimate the prediction error is to randomly divide the dataset into two parts: a training set with n1 datapoints, and a validation set with n2 datapoints (n1 +n2 = n). The training set is used to fit the models, i.e. to provide estimates of the regression coefficients (β̂0 ,... , β̂p ) and the validation set is used to estimate the prediction error. The prediction error is estimated 10 by averaging the squared errors over the points of the validation set  2 p 1 X X Epred = yi − (β̂0 + xij β̂j ). (11) n2 j=1 (xi ,yi )∈Validation Set We typically choose the model that generates the smallest prediction error. 1.4.3 Cross-validation The caveat of the split-validation approach is that the prediction error may severely depend on a specific choice of the training and validation sets. To overcome this problem, the following cross-validation approaches consider a rotation of the validation and training sets. K-fold cross-validation. The idea is to split the data into K parts. To start with, K − 1 subsets are used as the training sets and the remaining subset is used as a validation subset to compute a prediction error Epred 1 as given by equation (11). This procedure is repeated K times by considering each subset as the validation subset and the remaining subsets as the training sets. The average prediction error is defined as K 1 X k Epred = Epred K k=1 Leave-one-out. It is the same idea as K-fold cross validation with K = n. The prediction error is computed as n 1X Epred = (yi − fˆ−i (xi· ))2 , (12) n i=1 where fˆ−i denotes the estimated regression function when the ith point xi· has been removed from the training dataset. To show the potential of cross-validation, we consider the following polynomial example Y = X − X 2 + X 3 + ε, (13) ε ⇝ N (0, σ 2 = 102 ). We simulate a dataset with n = 100 using the polynomial model of equation (13) (see the left panel of Figure 1). We then consider a polynomial regression to smooth the data Y = β0 + β1 x + · · · + βp x p + ε 11 We consider a range of polynomial degrees p = 1,... , 20 for polynomial regression. In the right panel of Figure 1, we show the leave-one-out prediction error (equation (12)) and the residual squared error for the entire data set (RSS(β̂) in equation (2)) as a function of the number of degrees. The residual squared error is a decreasing function of the number of degrees so that minimizing naively the residual squared error would lead to the selection of the most complex model. The prediction error has a nonmonotonous behavior and points to a polynomial regression of degree 3, which is consistent with the generating mechanism given in equation (13). We provide below the R code that we used to generate Figure 1. ###########3 Polynomial regressions with different degrees ######Scatterplot and plot of the 3 polynomial regressions (deg=1,3,20) xx model.tables(aov(marks~specialization*corrector,data=alldata), type="means") Tables of means Grand mean 11.30711 specialization ISI m1rmosig MIF MMIS SIF SLE TEL 10.79 9 12.85 11.67 10.25 8.731 10.21 rep 50.00 1 43.00 64.00 14.00 13.000 12.00 corrector 1 2 3 4 5 6 11.78 10.92 11.36 10.6 11.64 11.58 rep 32.00 36.00 32.00 32.0 33.00 32.00 specialization:corrector corrector specialization 1 2 3 4 5 6 ISI 10.85 10.73 rep 0.00 0.00 0.00 0.00 26.00 24.00 m1rmosig 9.00 rep 1.00 0.00 0.00 0.00 0.00 0.00 MIF 18.00 13.14 11.81 rep 0.00 1.00 29.00 13.00 0.00 0.00 MMIS 12.33 11.13 rep 29.00 35.00 0.00 0.00 0.00 0.00 23 SIF 10.25 rep 0.00 0.00 0.00 14.00 0.00 0.00 SLE 6.75 6.50 5.00 9.50 10.00 rep 2.00 0.00 1.00 1.00 3.00 6.00 TEL 8.00 8.87 12.00 11.50 rep 0.00 0.00 2.00 4.00 4.00 2.00 24 3 Principal Component Analysis (PCA) The data consist of n individuals and for each of these individuals, we have p measurements (or variables). The data are given in the following n × p design matrix   x11 · · · x1p ...... . X=...  xn1 ··· xnp Quoting once again Wikipedia (http://en.wikipedia.org/wiki/Principal_component_analysis) Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated p variables into a set of values of d uncorrelated variables called principal components (d ≤ p). This transformation is defined in such a way that the first principal component has as high a variance as possible (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (uncorrelated with) the preceding components. 3.1 Preliminary notations 3.1.1 Covariance and correlation matrices We denote by ckl the empirical covariance between the k th and lth variable n 1 X ckl = (xik − x̄·k )(xil − x̄·l ). n − 1 i=1 The covariance matrix of the design matrix X is   c11 ··· c1p Σ = ........ ..   cp1 ··· cpp Note that the diagonal of the covariance matrix contains the vector of the variances. Next, we denote by X ′ the n × p matrix obtained from the design matrix X after we have centered all of the columns (i.e. all of the columns of X ′ have a mean of 0), then we have 1 Σ= X ′ TX ′. n−1 25 Next, we define the empirical correlation between the k th and lth variable ckl √ rkl = , where sk = ckk. sk sl Accordingly the correlation matrix can be written as R = D−1 Σ D−1 , where   s1 0... 0.... ..    0 s2 D=.. ........    0  0... 0 sp 3.1.2 Eigenvectors and eigenvalues We rank the p eigenvectors of the covariance matrix Σ by decreasing order of their corresponding eigenvalues. The first eigenvector, the column vector a1 , has the largest eigenvalue λ1 , the second eigenvector, the column vector a2 , has the second largest eigenvalue λ2 , etc. The (p × p) matrix A of the eigenvectors is A = [a1 a2... ap ], and the matrix of the eigenvalues is the diagonal matrix   λ1 0... 0.... ..    0 λ2 ΣY = .. . .......  0  0... 0 λp By definition of the matrix of the eigenvectors, we have Σ = A ΣY A T. 3.2 Solution of PCA We can show that PCA can be done by eigenvalue decomposition of the data covariance matrix Σ. The first principal component corresponds to the eigenvector a1. It means that a1 defines the one-dimensional 26 projection that maximizes the variance of the projected values. The second eigenvector a2 defines the one- dimensional projection that maximizes the variance of the projected values among the vectors orthogonal to a1 , and so on. The (n × p) matrix Y of the projected values on the principal component axes is Y = X ′ A. In the first column of Y , we read the projections of the data on the first principal component, in the second column of Y , we read the projections of the data on the second principal component, and so on. 3.3 Variance captured by the principal components We can easily show that 1) the variance-covariance matrix of the projected values Y is given by ΣY , and p X p X that 2) the sum of the variances of the original data s2j is equal to the sum of the eigenvalues λj. j=1 j=1 p X This means that the k th principal component captures a fraction λk / λj of the total variance, and that j=1 k X p X the k first principal components capture a fraction λj / λj of the total variance. j=1 j=1 3.4 Scale matters in PCA Assume that we record the following information for n sampled individuals: their weight in grams, their height in centimeters, their percentage of body fat and their age (in years). The corresponding design matrix X has n lines and p = 4 columns. If we perform PCA naively using the variance-covariance matrix Σ, it is quite clear that the first principal component will be almost equal to the first column (weight in grams). Indeed the choice of the unity of measure, gram here, is such that the variance of the first column will be extremely large compared to the variances of the remaining columns. The fact that the result of PCA depends on the choice of scale, or unity of measure, might be unsatisfactory in several settings. To overcome this problem, PCA may be performed using the correlation matrix R rather than using the variance-covariance matrix Σ. It is the same as performing PCA after having standardized all of the columns of X (i.e. the columns of the standardized matrix have an empirical variance equal to 1). 27 3.5 PCA in R We consider the same USArrests dataset than in the section about multiple regression. This dataset contains four variables measuring three criminality measures and the percentage of urban population in each of the 50 US states in 1973. Using PCA, we can display the dataset in two dimensions (Figure 2). ## The variances of the variables in the ## USArrests data vary by orders of magnitude, so scaling is appropriate res summary(res) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.57 0.995 0.5971 0.4164 Proportion of Variance 0.62 0.247 0.0891 0.0434 Cumulative Proportion 0.62 0.868 0.9566 1.0000 ##The following function displays the observations on the PC1, PC2 axes ## but also the original variables on the same axis (see Figure 2) >biplot(res,xlim=c(-.3,.3),ylim=c(-.3,.3),cex=.7) # To obtain a view of variables with appropriate metrics (correlation): P = res$rotation for(i in 1:4) P[,i]