Multivariate Statistics Made Simple (2019) PDF
Document Details
Uploaded by Deleted User
2019
K. V. S. Sarma, R. Vishnu Vardhan
Tags
Related
- Multivariate Statistical Analysis (SMA 3023) PDF
- BMT 443 Exploratory Data Analysis (EDA) Lecture 4 PDF
- Exploratory Data Analysis (EDA) Lecture 4 PDF
- Multivariate Statistik und Datenanalyse Wintersemester 2024/25 PDF
- Multivariate Statistik und Datenanalyse PDF Wintersemester 2024/25
- Multivariate Statistik und Datenanalyse - Wintersemester 2024/25 PDF
Summary
This book, Multivariate Statistics Made Simple, is a focused practical guide to multivariate statistical analysis. The book covers a broad range of topics from an overview of multivariate analysis to different kinds of analysis techniques such as ANOVA, MANOVA, and multiple linear regression. It also discusses classification problems in medical diagnosis, survival analysis, and regression analysis. The book is suitable for those interested in applying statistical methods to real-world problems.
Full Transcript
Multivariate Statistics Made Simple A Practical Approach Multivariate Statistics Made Simple A Practical Approach K. V. S. Sarma R. Vishnu Vardhan CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2019 by Taylor & Fra...
Multivariate Statistics Made Simple A Practical Approach Multivariate Statistics Made Simple A Practical Approach K. V. S. Sarma R. Vishnu Vardhan CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2019 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20180913 International Standard Book Number-13: 978-1-1386-1095-8 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com To my beloved wife, Late Sarada Kumari (1963–2015), who inspired me to inspire data scientists. K.V.S.Sarma To my beloved parents, Smt. R. Sunanda & Sri. R. Raghavendra Rao, for having unshakeable confidence in my endeavors. R.Vishnu Vardhan Contents Preface xi Authors xv 1 Multivariate Statistical Analysis—An Overview 1 1.1 An Appraisal of Statistical Analysis.............. 1 1.2 Structure of Multivariate Problems.............. 4 1.3 Univariate Data Description.................. 7 1.4 Standard Error (SE) and Confidence Interval (CI)...... 7 1.5 Multivariate Descriptive Statistics............... 8 1.6 Covariance Matrix and Correlation Matrix.......... 14 1.7 Data Visualization........................ 16 1.8 The Multivariate Normal Distribution............. 23 1.9 Some Interesting Applications of Multivariate Analysis... 25 Summary................................ 26 Do it yourself............................. 27 Suggested Reading........................... 29 2 Comparison of Multivariate Means 31 2.1 Multivariate Comparison of Mean Vectors........... 31 2.2 One-sample Hotelling’s T2 Test................ 33 2.3 Confidence Intervals for Component Means.......... 37 2.4 Two-Sample Hotelling’s T2 Test................ 39 2.5 Paired Comparison of Multivariate Mean Vectors...... 43 Summary................................ 48 Do it yourself............................. 49 Suggested Reading........................... 50 3 Analysis of Variance with Multiple Factors 51 3.1 Review of Univariate Analysis of Variance (ANOVA).... 51 3.2 Multifactor ANOVA....................... 55 3.3 ANOVA with a General Linear Model............. 57 3.4 Continuous Covariates and Adjustment............ 61 vii viii Contents 3.5 Non-Parametric Approach to ANOVA............. 65 3.6 Influence of Random Effects on ANOVA........... 68 Summary................................ 69 Do it yourself............................. 69 Suggested Reading........................... 71 4 Multivariate Analysis of Variance (MANOVA) 73 4.1 Simultaneous ANOVA of Several Outcome Variables..... 73 4.2 Test Procedure for MANOVA................. 74 4.3 Interpreting the Output..................... 78 4.4 MANOVA with Age as a Covariate.............. 81 4.5 Using Age and BMI as Covariates............... 84 4.6 Theoretical Model for Prediction................ 86 Summary................................ 88 Do it yourself............................. 88 Suggested Reading........................... 90 5 Analysis of Repeated Measures Data 91 5.1 Experiments with Repeated Measures............. 91 5.2 RM ANOVA Using SPSS.................... 95 5.3 RM ANOVA Using MedCalc.................. 99 5.4 RM ANOVA with One Grouping Factor............ 100 5.5 Profile Analysis......................... 107 Summary................................ 113 Do it yourself............................. 113 Suggested Reading........................... 114 6 Multiple Linear Regression Analysis 115 6.1 The Concept of Regression................... 115 6.2 Multiple Linear Regression................... 118 6.3 Selection of Appropriate Variables into the Model...... 120 6.4 Predicted Values from the Model................ 126 6.5 Quality of Residuals....................... 128 6.6 Regression Model with Selected Records........... 129 Summary................................ 131 Do it yourself............................. 131 Suggested Reading........................... 134 7 Classification Problems in Medical Diagnosis 135 7.1 The Nature of Classification Problems............. 135 7.2 Binary Classifiers and Evaluation of Outcomes........ 137 7.3 Performance Measures of Classifiers.............. 138 Contents ix 7.4 ROC Curve Analysis...................... 144 7.5 Composite Classifiers...................... 148 7.6 Biomarker Panels and Longitudinal Markers......... 149 Summary................................ 150 Do it yourself............................. 150 Suggested Reading........................... 151 8 Binary Classification with Linear Discriminant Analysis 153 8.1 The Problem of Discrimination................. 153 8.2 The Discriminant Score and Decision Rule.......... 155 8.3 Understanding the Output................... 158 8.4 ROC Curve Analysis of Discriminant Score.......... 161 8.5 Extension of Binary Classification............... 163 Summary................................ 164 Do it yourself............................. 164 Suggested Reading........................... 167 9 Logistic Regression for Binary Classification 169 9.1 Introduction........................... 169 9.2 Simple Binary Logistic Regression............... 170 9.3 Binary Logistic Regression with Multiple Predictors..... 175 9.4 Assessment of the Model and Relative Effectiveness of Markers.............................. 179 9.5 Logistic Regression with Interaction Terms.......... 180 Summary................................ 182 Do it yourself............................. 182 Suggested Reading........................... 184 10 Survival Analysis and Cox Regression 185 10.1 Introduction........................... 185 10.2 Data Requirements for Survival Analysis........... 186 10.3 Estimation of Survival Time with Complete Data (No Censoring).......................... 187 10.4 The Kaplan–Meier Method for Censored Data........ 190 10.5 Cox Regression Model...................... 193 Summary................................ 199 Do it yourself............................. 200 Suggested Reading........................... 203 x Contents 11 Poisson Regression Analysis 205 11.1 Introduction........................... 205 11.2 General Form of Poisson Regression.............. 206 11.3 Selection of Variables and Subset Regression......... 211 11.4 Poisson Regression Using SPSS................. 213 11.5 Applications of the Poisson Regression model......... 216 Summary................................ 217 Do it yourself............................. 217 Suggested Reading........................... 218 12 Cluster Analysis and Its Applications 219 12.1 Data Complexity and the Need for Clustering........ 219 12.2 Measures of Similarity and General Approach to Clustering. 221 12.3 Hierarchical Clustering and Dendrograms........... 224 12.4 The Impact of Clustering Methods on the Results...... 229 12.5 K-Means Clustering....................... 232 Summary................................ 235 Do it yourself............................. 235 Suggested Reading........................... 237 Index 239 Preface The twenty-first century is recognized as the era of data science. Evidence- based decision making is the order of the day and the common man is contin- uously educated about various types of information. Statistics is a science of data analysis and a powerful tool in decision making. The days are gone when only numbers were considered as data but with the advent of sophisticated software, data now includes text, voice,and images in addition to numbers. Practical data will be usually unstructured and sometimes contaminated and hence statistics alone can resolve issues of uncertainty. Students taking a master’s course in statistics, by and large learn the math- ematical treatment of statistical methods in addition to a few applications areas like biology, public health, psychology, marketing etc. However, real- world problems are often multivariate in nature and the tools needed are not as simple as describing a data using averages, measures of spread or charts. Multivariate statistical analysis requires the ability to handle complex math- ematical routines such as inverse of matrices, eigen values or complicated integrals of density functions and lengthy iterations. Statistical software has transformed analytic tools into simple and user-friendly applications, thereby bridging the gap between science (theory) and the technology of statistics. Expertise in statistical software is inevitable to the contemporary statisti- cians. Easy access to software packages however distracts the attention of users away from ‘a state of understanding’ to a ‘cookbook mode’ focusing on the end results only. The current tendency towards reporting p-values is one such situation without rational justification for the context of use. On the other hand, there is a tremendous growth in data warehousing, data mining and machine learning. Data analytic tools like artificial intelligence, image processing, dynamic visualization of data etc., have shadowed the clas- sical approach to statistical analysis and real-time outputs are available on the dashboards of computer-supported devices. The motivation for publishing this book was our experience in teaching statis- tics to post-graduate students over a long period of time as well as interaction with professionals including medical researchers, practicing doctors, psychol- ogists, management experts and engineers. The problems brought to us are xi xii Preface often challenging and need a different view when compared with what is taught in a classroom. The selection of suitable statistical tools (like Analysis of Vari- ance) and an implementation tool (like software or an online calculator) plays an important role for the application scientist. Hence it is time to learn and practice statistics by utilizing the power of computers. Our experience in at- tending to problems in consulting also made us to write this book. Teaching and learning statistics is fruitful when real-time studies are under- stood. We have chosen medicine and clinical research areas as the platform to explain statistics tools. The discussions and illustrations however hold well for other contexts too. All the illustrations are explained with one or more of the following software packages. 1. IBM SPSS Statistics Version 20 (IBM Corp, Somers, NY, USA) 2. MedCalc Version 15.0 for Windows (MedCalc Software bvba, Belgium) 3. Real Statistics Resource Pack Add-ins to Microsoft Excel 4. R (open source software) This book begins with an overview of multivariate analysis in Chapter 1 de- signed to motivate the reader towards a ‘holistic approach’ for analysis instead of partial study using univariate analysis. In Chapter 2 we have focused on the need to understand several related variables simultaneously, using mean vectors. The concept of Hotelling’s T-square and the method of working on it with MS-Excel Add-ins is illustrated. In Chapter 3 we have illustrated the use of a General Linear Model to handle Multifactor ANOVA with in- teractions and continuous covariates. As an extension to ANOVA we have discussed the method and provided working skills to perform Multivariate ANOVA (MANOVA) in Chapter 4 and illustrated it with practical data sets. Repeated Measures data is commonly used in follow-up studies and Chapter 5 is designed to explain the theory and skills of this tool. In Chapter 6 Multiple Linear Regression and its manifestations are discussed with practical data sets and explained with appropriate software. An important area in medical research is the design of biomarkers for clas- sification and early detection of diseases. Chapter 7 is devoted to explaining the statistical methods and software support for medical diagnosis and ROC curves. Linear Discriminant Analysis, a popular tool for binary classification is discussed in Chapter 8 and the procedure is illustrated with clinical data. The use of Logistic Regression in binary classification is common in predictive models and the same is discussed in Chapter 9 with practical datasets. In Chapter 10 the basic elements of Survival Analysis, the use of Kaplan–Meier test and Cox Regression are presented in view of their wide applications in the Preface xiii management of chronic diseases. Poisson Regression, a tool so far discussed in a few advanced books only, is presented in Chapter 11 with a focus on appli- cation and software supported skills. We end the book with Cluster Analysis in Chapter 12 in view of its vast applications in community medicine, public health and hospital management. At the end of each chapter, we have provided exercises under the caption, Do it Yourself. We believe that the user can perform the required analysis by following the stepwise methods and the software options provided in the text. These problems are not simply numerically focused but contain a real context. A few motivational and thought-provoking exercises are also given. We acknowledge the excellent support extended by several doctors and clini- cal researchers from Sri Venkateswara Institute of Medical Sciences (SVIMS), Tirupati and those at Jawaharlal Institute of Post Graduate Medical Edu- cation & Research (JIPMER), Pondicherry and the faculty members of Sri Venkateswara University, Tirupati and Pondicherry University, Puducherry. We appreciate their intent in sharing their datasets to support the illustration, in making the book more practical than a mere theoretical volume. We have acknowledged each of them in the appropriate places. Our special thanks to Dr. Alladi Mohan, Professor and Head, Department of Medicine, Sri Venkateswara Institute of Medical Sciences (SVIMS), Tirupati for lively discussion on various concepts used in this book. We also wish to place on record our special appreciation to Sai Sarada Ve- dururu, Jahnavi Merupula and Mohammed Hisham who assisted in database handling, running of specific tools and the major activity of putting the text into Latex format. K.V.S.Sarma R.Vishnu Vardhan Authors Dr. K. V. S. Sarma K.V.S.Sarma served as a Professor of Statistics at Sri Venkateswara University from 1977 till 2016. He taught various subjects in statistics for 39 years including Oper- ations Research, Statistical Quality Control, Probability Theory, and Computer Programming. His areas of re- search interests are Inventory Modeling, Sampling Plans and Data Analysis. He has guided 20 Ph.D. theses on diverse topics and published more than 60 research arti- cles in reputed international journals. He has authored a book titled Statistics Made Simple using Excel and SPSS, apart from delivering lectures on software-based analysis and interpretation. He has delivered more than 100 invited talks at various conferences and training programs and he is a life member of the Indian Society for Probability and Statistics (ISPS) and the Operations Research Society of India. He is a consultant to clinical researchers and currently working as Biostatistician cum Research Coordinator at the Sri Venkateswara Institute of Medical Sciences (SVIMS). Dr. R. Vishnu Vardhan Rudravaram Vishnu Vardhan is currently working as Assistant Professor in the Department of Statistics, Pondicherry Central University, Puducherry. His areas of research are Biostatistics - Classification Techniques; Multivariate Analysis; Regression Diagnostics and Sta- tistical Computing. He has published 52 research papers in reputed national and international journals. He has guided two Ph.D. students and 39 P.G. projects. He is a recipient of the Ms. Bhargavi Rao and Padma Vib- hushan Prof. C. R. Rao Award for best Poster Presenta- tion in an International Conference in the year 2010, In- dian Society for Probability and Statistics (ISPS) Young Statistician Award, December 2011 and Young Scientist Award from the In- dian Science Congress, February 2014. He has authored a book and also edited xv xvi Authors a book. He has presented 34 research papers in national and international conferences/Seminars, and delivered 39 invited talks at various reputed insti- tutes/universities in India. He has organized three national workshops, one national conference and one international conference. He is a life member of several professional organizations. He serves as referee and also holds the po- sition of editorial member at reputed journals. Chapter 1 Multivariate Statistical Analysis—An Overview 1.1 An Appraisal of Statistical Analysis............................. 1 1.2 Structure of Multivariate Problems.............................. 4 1.3 Univariate Data Description..................................... 7 1.4 Standard Error (SE) and Confidence Interval (CI).............. 7 1.5 Multivariate Descriptive Statistics............................... 8 1.6 Covariance Matrix and Correlation Matrix...................... 14 1.7 Data Visualization............................................... 16 1.8 The Multivariate Normal Distribution........................... 23 1.9 Some Interesting Applications of Multivariate Analysis......... 25 Summary......................................................... 26 Do it yourself (Exercises)........................................ 27 Suggested Reading............................................... 29 I only believe in statistics that I doctored myself. Galileo Galilei (1564 – 1642) 1.1 An Appraisal of Statistical Analysis Statistical analysis is an essential component in the contemporary business environment as well as in scientific research. With expanding horizons in all fields of research, new challenges are faced by researchers. The classical statis- tical analysis which is mainly focused on drawing inferences on the parameters of a population requires a conceptual change towards an exploratory study of 1 2 Multivariate Statistics Made Simple: A Practical Approach large and complex data sets, analyzing them and mining the latent features of the data. Statistical analysis is generally carried out with two purposes: 1. To describe the observed data for an understanding of the facts and 2. To draw inferences for the target group, (called population) based on sample data The former is known as descriptive statistics and latter is called inferential statistics. Since it is impossible or sometimes prohibitive to study the entire popula- tion, one prefers to take a sample and measure some characteristics or observe attributes like color, taste, preference etc. The sample data shall however be unbiased and represent the population so that the inferences drawn from the sample can be generalized to the population usually with some error. The alternative to avoid error will be examining the entire population (known as census), which is not always meaningful, like a proposal to pump out the entire blood from a patient’s body to calculate the glucose level! Large volumes of data with hundreds of variables and several thousands of records constitute today’s data sets and the solutions are required in real time. Studying the relationships among the variables provides useful information for decision making with the help of multivariate analytical tools which refers to the “simultaneous analysis of several inter-related variables”. In a different scenario, a biologist may wish to estimate the toxic effect of a treatment on different organs of an animal. With the help of statistically designed experiments, one can estimate the effect of various prognostic factors on the response variables simultaneously by using multivariate analysis. In general the multivariate approach is helpful to: Explore the joint performance of several variables and Estimate the effect of each variable in the presence of the others (marginal effect). The science of multivariate of analysis is mathematically complicated and the computations are too involved to perform with a desktop calculator. How- ever, with the availability of computer software, the burden of computation is minimized, irrespective of the size of the data. For this reason, during the last three decades multivariate tools have reached the application scientist and multivariate analysis is the order of the day. Data analysts and some researchers use the terms univariate and multi- variate analyses with the second one usually followed by the first. A brief description is given below of these approaches. Multivariate Statistical Analysis—An Overview 3 Univariate analysis: Analysis of variables one at a time (each one separately) is known as uni- variate analysis in which data is described in terms of mean, mode, median, standard deviation and also by way of charts. Inferences are also drawn sep- arately for each variable, such as comparison of mean values of each variable across two or more groups of cases. Here are some instances where univariate analysis alone is carried out. Daily number of MRI scans handled at a hospital Arrival pattern of cases at the emergency department of a hospital Residual life (in months) after a cancer therapy in a follow-up study Waiting time at toll gates Analyzing pain scores of patients after some intervention like Visual Analogue Scale (VAS) Symptomatic assessment of the health condition of a patient such as presence of fever (yes or no) or diabetic status Each characteristic is denoted by a random variable X having a known statistical distribution. A distribution is a mathematical formula used to ex- press the pattern of occurrence of values of X in the presence of an uncertain environment, guided by a probability mechanism. Normal distribution is one such probability distribution used to describe a data pattern of values measured on an interval scale. The normal distribution is described by two parameters, viz., mean (µ) and variance (σ2 ). When the values of the parameters are not known in advance (from prior studies or by belief) they are estimated from sample data. One or more hypotheses can also be tested for their statistical significance based on the sample data. The word statistical significance is used to mean that the finding from the study is not an occurrence by chance. This area of analysis is called inferential statistics. Multivariate analysis: Analysis of data simultaneously on several variables (characteristics) is often called multivariate analysis. In clinical examination, a battery of pa- rameters (like lipid profile or hemogram) is observed on a single patient to understand a health condition. When only two variables are involved, it is called bivariate analysis in which, correlations, associations and simple re- lationships (like dose-response) are studied. Multivariate analysis is however more general and covers bivariate and univariate analyses within. Here are some instances that fit into a multivariate environment. 4 Multivariate Statistics Made Simple: A Practical Approach Anthropometry of a patient (height, weight, waist circumference, waist- hip ratio etc.) Scores obtained on Knowledge, Adoption and Practice (KAP) measured on agricultural farmers Response to several questions regarding pain relief after physiotherapy (having say 50 questions with response on a 5 – point scale) Health profile of post menopausal women (a mixture of questions on different scales) Repeated measurements like blood sugar levels taken for the same pa- tient at different time points Understanding several characteristics, in relation to others, is the essence of a multivariate approach. A group of characteristics on an individual taken together convey more information about the features of the individual than each characteristic separately. Hence several variables will be considered as a group or an array and such arrays are compared across study groups in multivariate analysis. Further, data on several related variables are combined in a suitable way to produce a new value like a composite score to estimate an outcome. This however needs extra mathematical effort to handle complex data structures and needs a different approach than the univariate method. In the following section, a brief sketch of multivariate problems is discussed. 1.2 Structure of Multivariate Problems Suppose there are k-variables on which data is observed from an individual and let there be n-individuals in the study. Each variable is in fact a random variable to mean that the true value on the individual is unknown but governed by a random mechanism that can be explained by rules of probability. X1 X2 The data can be arranged as an array or a profile denoted by X = . .. Xk For instance scores on X = {Knowledge, Adoption, Practice} is a profile. The complete hemogram of a patient is another example of a profile. The data obtained from n individuals on X contains information on each of the p components and can be arranged as a (n x k) array or matrix shown as Multivariate Statistical Analysis—An Overview 5 X11 X12... X1k X21 X22... X2k A= ··· ···... ··· Xn1 Xn2... Xnk Lowercase letters are used to indicate the values obtained from the corre- sponding variable Xij. For instance x23 indicates the value on the variable X3 from the individual 2. Consider the following illustration. Illustration 1.1 A clinical researcher has collected data on several charac- teristics listed below from 64 patients with a view to compare the parameters between cases and controls. Thirty-two patients with Rheumatoid Arthritis (RA) are classified as ‘cases’ and thirty-two age and gender matched healthy subjects are classified as ‘controls’. This categorization variable is shown as ‘Group’. The variables used in the study and their description is given below. Vari Para Description Type able meter X1 Group Case/Control(1/0) Nominal X2 AS Atherosclerosis (AS: 1 = Yes, 0 = No) Nominal X3 Age Age of the patient in years Scale X4 Gen Gender (0 = Male, 1 = Female) Nominal X5 BMI Body Mass Index (Kg/m2 ) Nominal X6 CHOL Cholesterol (mg/dl) Scale X7 TRIG Triglycerides (mg/dl) Scale X8 HDL High-density lipoproteins cholesterol (mg/dl) Scale X9 LDL Low-density lipoprotein cholesterol (mg/dl) Scale X10 VLDL Very-low-density lipoprotein cholesterol (mg/dl) Scale X11 CIMT Carotid Intima-Media Thickness Scale Table 1.1 shows a sample of 20 records from the study. The analysis and discussion is however based on the complete data. This data will be referred to as ‘CIMT data’ for further reference. There are 11 variables out of which some (like CHOL and BMI) are mea- sured on an interval scale while some (like Sex and Diagnosis) are coded as 0 and 1 on a nominal scale. The measured variables are called continuous variables and data on such variables is often called quantitative data by some researchers. 6 Multivariate Statistics Made Simple: A Practical Approach TABLE 1.1: Illustrative multivariate data on CIMT S.No X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 1 1 1 56 1 29.55 160 91 29 112.8 18.2 0.600 2 1 1 43 0 17.33 190 176 31 123.8 35.2 0.570 3 1 1 45 0 29.08 147 182 32 76.6 36.4 0.615 4 1 1 39 0 24.03 168 121 38 105.6 24.4 0.630 5 1 0 31 0 25.63 146 212 38 65.6 42.4 0.470 6 1 1 34 0 20.54 160 138 36 96.4 27.6 0.630 7 1 0 37 1 18.42 128 156 32 64.8 31.2 0.515 8 1 1 54 0 25.53 212 307 36 114.6 61.4 0.585 9 1 1 59 0 22.67 149 89 38 93.2 17.8 0.615 10 1 0 33 0 21.92 147 95 32 96.0 19.0 0.415 11 0 0 45 0 23.44 122 76 32 74.8 15.2 0.460 12 0 0 37 0 27.48 124 84 42 65.2 16.8 0.440 13 0 0 63 0 19.55 267 153 45 190.4 30.6 0.530 14 0 0 41 0 23.43 175 261 38 84.8 52.2 0.510 15 0 0 51 0 24.44 198 193 35 124.4 38.6 0.410 16 0 0 40 0 26.56 160 79 36 108.2 15.8 0.560 17 0 0 42 0 25.53 130 95 32 79.0 19.0 0.480 18 0 0 45 1 26.44 188 261 32 103.6 52.4 0.510 19 0 0 38 0 25.78 120 79 32 71.2 15.8 0.510 20 0 1 43 1 19.70 204 79 42 146.2 15.8 0.680 (Data courtesy: Dr. Alladi Mohan, Department of Medicine, Sri Venkateswara Institute of Medical Sciences (SVIMS), Tirupati.) The other variables which are not measured on a scale are said to be dis- crete and the word qualitative data is used to indicate them. It is important to note that means and standard deviations are calculated only for quantitative data while counts and percentages are used for qualitative data. The rows represent the records and the columns represent the variables. While the records are independent, the variables need not be. In fact, the data on several variables (columns) exhibit related information and multi- variate analysis helps summarize the data and draw inferences about sets of parameters (of variables) instead of a single parameter at a time. Often, in public health studies there will be a large number of variables used to describe one context. During analysis, some new variables will also be created like a partial sum of scores, derived values like BMI and so on. The size (n × k) of the data array is called the dimensionality of the problem and the complexity of analysis usually increases with k. This type of issue often arises in health studies and also in marketing. Multivariate Statistical Analysis—An Overview 7 In the following section we outline the tools used for univariate description of variables in terms of average, spread etc. 1.3 Univariate Data Description For continuous variables like BMI, Weight etc., the simple average (mean) is a good summary provided that there are no extreme values in the data. It is easy to see that mean is quickly perturbed by very high or very low values in the data. Such high or low values may sometimes be important in the study their presence along with other values not only gives an incorrect average how- ever, but also inflates the standard deviation (S.D). Some computer packages use Std. Dev. or S.D. For instance, the data set 3,1,2,1,4,3 has a mean of 2.33 and a standard deviation of 1.21. Suppose the first and last values are modified and the new data set is -2,1,2,1,4,8. This data also has the same average of 2.33 but a standard deviation of 3.38 which shows that the second data set has more spread around the mean than the first one. Normally distributed data will be summarized as mean ± S.D. If Q3 and Q1 represent the third and first quartiles (75% and 25% percentiles) respectively, then IQR = (Q3 -Q1 ) is the measure of dispersion used in place of S.D. For non-symmetric data distribution (age or residual life after treatment), data will be summarized as median and IQR. For qualitative data like gender, diabetes (yes/no), hypertension (yes/no) etc., the values are discrete and hence the concept of mean or median is not used to summarize the data. Counts and percentages are used to describe such data expressed on an ordinal or nominal scale instead of an interval scale. In the next section we discuss the concept and utility of the confidence interval in drawing inferences about the population. 1.4 Standard Error (SE) and Confidence Interval (CI) Suppose x̄ denotes the sample mean of a group of n individuals on a param- eter and let ‘s’ be the S.D of the sample. There are several ways of obtaining samples of size ‘n’ from a given cohort or target group. So every sample gives a ‘mean’ which is an estimate of the true but unknown mean (µ) of the cohort. 8 Multivariate Statistics Made Simple: A Practical Approach Let x1 , x2 , · · · , xm be the means of m-samples. Then the distribution (pat- tern) of x1 , x2 , · · · , xm is called the sampling distribution of the mean and the S.D of these distribution is called Standard Error denoted by SE. Some software report this as Std. Error. For normally distributed data the mean (average) of x1 , x2 , · · · , xm is an estimate of µ. We say that x̄ is an unbiased estimator of µ and it is called a s point estimate. It can be shown that the SE of mean is estimated as √. n It is customary to provide an interval (rather than a simple value) around x̄ such that the true value of µ lies in this interval with 100(1-α)% confidence where α is called the error rate. Such an interval is called the Confidence Interval (CI) given by s x̄ ± Zα √ n In general the CI takes the form [sample value ± Zα ∗ SE]. As a conven- tion we take α = 0.05 so that the 95% CI for mean will be s x̄ ± 1.96 √ n because Z = 1.96 for normal distribution (use NORMSINV(0.025) in MS- Excel). CI plays a pivotal role in statistical inference. It is important to note that every estimate (sample value) contains some error, expressed in terms of SE and CI. The following section contains methods of describing multivariate data. 1.5 Multivariate Descriptive Statistics When the multivariate data is viewed as a collection of variables, each of them can be summarized in terms of descriptive statistics known as mean vector and variance-covariance matrix. For the data array X we define the following. Mean vector: x1 x2 This is defined as x̄ = .. . xk Multivariate Statistical Analysis—An Overview 9 n 1X where xj = xij for j = 1, 2,... , k (1.1) n i=1 (mean of all values for the jth column) is the sample mean of Xj. Mean is expressed in the original measurement units like centimeters, grams etc. Variance-Covariance matrix: Each variable Xj will have some variance that measures the spread of values around its mean. This is given by the sample variance. n 1 X s2j = (xij − xj )2 for j = 1, 2,... , k (1.2) (n − 1) i=1 The denominator in Equation 1.2 should have been n but the use of (n -1) is to correct for the small sample bias (specifically for normally distributed data) and gives what is called an unbiased estimate of the population mean for this variable. Most of the computer codes use Equation 1.2 since it works for both small and large n. Variance will have squared units like cm2 , gm2 etc., which is difficult to interpret along with mean. Hence, another measure called the standard deviation is used which is expressed in natural units and always non-negative. When all the data values are the same, then the standard deviation is zero. The sample standard deviation of Xj is given by v u n u 1 X sj = t (xij − xj )2 for j = 1, 2,... , k (1.3) (n − 1) i=1 Another measure used in multivariate analysis is the measure of co- variation between each pair of variables, called the covariance. If Xj and Xk constitute a pair of variables then the covariance is defined as n 1 X sjk = (xij − xj )(xil − xl ) for j = 1, 2,... , k l = 1, 2,... , k (1.4) (n − 1) i=1 The covariance is also called the product moment and measures the simul- k(k − 1) taneous variation in Xj and Xl. There will be covariances for the 2 vector X. From Equation 1.4 it is easy to see that sjl is the same as slj. The covariance between Xj and Xl is nothing but the variance of Xj. The variances and covariances can be arranged in the form of a matrix called the variance-covariance matrix as follows. 2 s1 s12... s1k s21 s22... s2k S= ··· ···... ··· sk1 sk2... s2k 10 Multivariate Statistics Made Simple: A Practical Approach We also use the notation s2j = sjj so that every element is viewed as a covariance and we write s11 s12... s1k s21 s22... s2k S= ··· (1.5) ···... ··· sk1 sk2... snk When each variable is studied separately, it will have only variance and the concept of covariance does not arise. The overall spread of data around the mean vector can be expressed as given below by a single metric which is useful for comparison of multivariate vectors. 1. The generalized sample variance defined as the determinant of the co- variance matrix (|S|) 2. Total sample variance defined as the sum of all diagonal terms in the matrix S. This is also called trace, given by tr(S)= s11 + s22 +...+ skk Correlation matrix: The important descriptive statistic in multivariate analysis is the sam- ple correlation coefficient, denoted by rjk between Xj and Xk known as the Pearson’s product-moment correlation coefficient proposed by Karl Pearson (1867–1936). It is a measure of the strength of the linear relationship between a pair of variables. It is calculated as n P (xij − xj )(xil − xl ) sjl i=1 rjl = √ = s (1.6) sjj sll n P n P (xij − x2i )( (xil − x2l ) i=1 i=1 We sometimes write this simply as r without any subscript. The value of r lies between -1 and +1. A higher value of r indicates a stronger linear relationship than a lower value of r. Further, the correlation coefficient is symmetric in the sense that r between X and Y is the same as r between Y and X. Correlation coefficient however does not indicate any cause and effect re- lationship but only says that each variable in the pair moves in tandem with the other (r > 0) or opposing the other (r < 0). When r = +1 it means a perfect positive linear (straight line) relation- ship and r = -1 is a perfect negative linear relationship. Again r = 0 implies that Xj and Xl are uncorrelated to mean that they lack a linear relationship between them. It however does not mean that these two variables are inde- pendent because Equation 1.6 assumes a linear relationship but there could Multivariate Statistical Analysis—An Overview 11 be a quadratic or other non-linear relationship between them or there may not be any relationship at all. The correlations measured from the data can be arranged in the form of a matrix called the correlation matrix given by 1 r12... r1k r21 1... r2k r= ··· ···... ··· (1.7) rk1 rk2... 1 This is a symmetric matrix with lower and upper diagonal elements being equal. Such matrices have importance in multivariate analysis. Even though it is enough to display only the upper or lower elements above the diagonal, some software packages show the complete (k × k) correlation matrix. The correlation coefficient has an important property whereby it remains unchanged when data is modified with a linear transformation (multiplying by a constant or adding a constant). In medical diagnosis some measurements are multiplied by a scaling factor like 100 or 1000 or 100−1 or 1000−1 for the purpose of interpretation. Sometimes the data is transformed to a percentage to overcome the effect of units of measurements but r remains unchanged. Another important property of the correlation coefficient is that it is a unit free value or pure number to mean that with different units of measurement on Xj and Xl , the formula Equation 1.6 produces a value without any units. This makes it possible to understand the strength of the relationship among several pairs of variables with different measurement units by just comparing the correlation coefficients. Coefficient of determination: The square of the correlation coefficient r2 is called the coefficient of de- termination. The correlation coefficient (r) assumes that there is a linear re- lationship between X and Y. The larger the value of r, the stronger the linear relationship. Since r can be positive or negative, r2 is always positive and is used to measure the amount of variation in one variable explained by the other variable. For instance, when r = 0.60, we get r2 = 0.36 which means that only 36% of variation between the variables is explained by the correlation coeffi- cient. There could be several reasons for such a low value of r2 such as a wide scatter of points around the linear trend or a nonlinear relationship may be present which the correlation coefficient cannot detect. Consider the following illustration. Illustration 1.2 Reconsider the data used in Illustration 1.1. We will exam- ine the data to understand its properties and to know the inter-relationships among them. Let us consider a profile of measurements with four variables CHOL, TRIG, HDL and LDL. 12 Multivariate Statistics Made Simple: A Practical Approach From the data the following descriptive statistics can be obtained using SPSS → Analyze → Descriptives. One convention is to present the summary as mean ± S.D Profile All cases (n = 64) Variable Mean Std. Dev. CHOL 165.688 34.444 TRIG 134.672 75.142 HDL 36.281 5.335 LDL 102.488 28.781 The data contains a classification variable Group = 1 or 0. While pro- cessing the data we can attach the true labels instead of numeric codes. We can also view the above profile simultaneously for the two groups as shown in Table 1.2. TABLE 1.2: Descriptive statistics for the profile variables Cases (n = 32) Controls (n = 32) Profile variables Mean Std. Dev Mean Std. Dev CHOL 164.156 28.356 167.219 40.027 TRIG 126.750 60.690 142.594 87.532 HDL 36.125 6.084 36.438 4.557 LDL 102.419 25.046 102.556 32.497 In the univariate analysis, we are interested in comparing the mean value of a parameter like HDL between the two groups to find whether the difference in the means could be considered as significant.This exercise is done for each variable, independent of others. In the multivariate context we compare the ‘mean vector’ between the two groups while in the univariate environment, the mean values are compared for one variable at a time. With the availability of advanced computing tools, profiles can be visu- alized in two and three dimensions to understand what the data describes. For instance, the individual distributions of LDL and CHOL are shown by histograms separately in Figure 1.1a and Figure 1.1b which are univariate displays. The joint distribution of LDL and CHOL is shown by a 3D-histogram in Figure 1.2. (Hint: In SPSS we use the commands; Graphs → Graphboard Templete Chooser → Select CHOL and LDL using Ctrl key → Press Ok. Manage colors). Since there are only two variables, visualization is possible in three dimen- Multivariate Statistical Analysis—An Overview 13 sions and with more than three variables it is not possible to visualize the pattern. (a) Univariate distribution of LDL. (b) Univariate distribution of CHOL. FIGURE 1.1: Univariate distribution. The maximum dimensions one can visualize are only three, viz., length, breadth and height/depth. Beyond three dimensions, data is understood by numbers only. We will see in the following section that the four variables listed above are correlated to each. In other words, a change in the values of one variable causes a proportionate change in other variables. Therefore, when the profile variables are correlated to each, independent univariate analysis is not correct 14 Multivariate Statistics Made Simple: A Practical Approach and multivariate analysis shall be used to compare the entire profile (all the four variables simultaneously) between the groups. If the difference is signifi- cant, then independent univariate comparison has to be made as a Post Hoc analysis. FIGURE 1.2: Bivariate distribution of LDL & CHOL. In some situations, we need to compare the mean values of a variable at different time points (called longitudinal comparison) to observe changes over time. For instance, the changes in the lipid profile or hemogram will be compared at baseline, after treatment and after a follow-up. Multivariate comparison tests are used in this context. In the next section we discuss a measure called variance-covariance matrix to express the spread among the components of a multivariate panel. 1.6 Covariance Matrix and Correlation Matrix In addition to means and variances, the covariance structure of variables (within a profile) plays an important role in multivariate analysis. Covari- ance is a measure of the joint variation between two variables as defined in Multivariate Statistical Analysis—An Overview 15 Equation 1.4. For the 4-variable profile given in Illustration 1.2, the matrix of variances and covariances is obtained as shown below. Some software packages call this covariance matrix instead of variance-covariance matrix. CHOL TRIG HDL LDL CHOL 1186.409 1007.213 93.042 888.936 TRIG 1007.213 5646.319 -36.224 -68.891 HDL 93.042 -36.224 28.459 70.785 LDL 888.936 -68.891 70.785 828.340 The variances are shown in boldface (along the diagonal) and the off- diagonal values indicate the covariances. The covariance terms represent the joint variation between pairs of variables. For instance the covariance between CHOL and LDL is 93.042. A higher value indicates more covariation. For the end user, it is difficult to interpret the covariance because the units are expressed in product of two natural units. Instead of covariance, we can use a related measure called the correlation coefficient which is a pure number (free of any units). However, the mathematical treatment of several multivariate problems is based on the variance-covariance matrix itself. The correlation matrix for the profile variables is shown below. CHOL TRIG HDL LDL CHOL 1 0.389 0.506 0.897 TRIG 0.389 1 -0.090 -0.032 HDL 0.506 -0.090 1 0.461 LDL 0.897 -0.032 0.461 1 The correlation coefficients on the diagonal line are all equal to 1 and all the upper diagonal values are identical to the lower diagonal elements. Since, by definition, the correlation between variables is symmetric, the lower diagonal values need not be shown. Some statistical packages like the Data Analysis Pak for MS-Excel show only the upper diagonal terms. Some MS-Excel Add- ins (for instance Real Statistics Add-ins) offer interesting Data Analysis Tools which can be added to MS-Excel. More details on MS-Excel and SPSS for statistical analysis can be found in Sarma (2010). It can be seen that CHOL has a strong positive relationship with LDL (r = 0.897), which means that when one variable increases, the other one also increases. Similarly, the correlation coefficient between LDL and CIMT is very low and negative. Both the covariance matrix and the correlation matrix play a fundamental role in multivariate analysis. 16 Multivariate Statistics Made Simple: A Practical Approach The next section contains a discussion of methods for and the advantages of data visualization. 1.7 Data Visualization Data visualization is an important feature in multivariate analysis which helps in understanding the relationships among pairs of variables. This can be done with a tool called a matrix scatter plot that simultaneously shows the plots of all pairs of related variables and also across groups of cases. Creation and interpretation of such plots are discussed in Illustration 1.1. With the availability of computer software it is now easy to visualize the data on several variables simultaneously in a graph and understand the data pattern. Some important graphs used in data analysis and visualization are discussed below. Histogram: This is a graph used to understand the data pattern of a continuous vari- able with grouped distribution. The count or frequency of values that fall in an interval (called a bin) is plotted as a vertical rectangle with height pro- portional to the count and width proportionally equal to the interval. For normally distributed values the histogram will be symmetric with its peak neither high nor low. Data with many values lower than the mean will be left skewed and data with many values higher than the mean will be right skewed. Consider the following illustration. Illustration 1.3 Consider the data of Illustration 1.1. Let us examine the histogram of LDL. The histogram can be constructed with the following SPSS options. a) Graphs → Legacy Dialogs → Histogram. b) Push LDL to variable. c) Check display normal curve. d) Press OK. The chart is produced with minimum features. With a double click on the chart the required options can be inserted into the chart which looks like the one given in Figure 1.3. It is easy to see that the shape of the distribution is Multivariate Statistical Analysis—An Overview 17 more or less symmetric. There are 10 patients in the LDL bin of 60-80. With a double click on the SPSS chart, we get options to change the bin width or the number of bins to show. With every change, the shape of the histogram changes automatically. FIGURE 1.3: Histogram of LDL with normal distribution. In well-distributed data, we expect a specific shape called the normal dis- tribution as shown by the embedded curve on the vertical bars. In brief, the normal distribution has important properties like a) mean = median, b) sym- metric around mean (skewness measure = 0) and c) neither flat nor peaked (kurtosis measure = 3). No real life data shows a perfect normal shape but often exhibits a pattern close to it. The researcher can use either personal judgement or use another graphic procedure called the P-P plot to accept normality. Further, the shape of the histogram depends on the choice of the number of bins or bin width. (Double click on the histogram in the SPSS output and change the width to see the effect on the histogram!) Why normality? Several statistical tools of inference are based on the assumption that the data can be explained by a theoretical model of normal distribution. In simple terms, normality indicates that about 95% of the individual value lies within 2 standard deviations on either side of the mean. Values outside this window can be considered as abnormal or simply outliers. 18 Multivariate Statistics Made Simple: A Practical Approach Bar chart: When the data is discrete like gender, case/control or satisfaction level, we do not draw a histogram but a bar chart is drawn. In a bar chart the individual bars are separated by a gap (to indicate that they are distinct categories!). Pie chart: This chart is used to display the percentage of different cases as segments of a circle marked by different colors or line. The labels for each color could be either the actual number or the percentage. We use this chart only when the components within the circle add up to 100%. All the above charts can also be drawn by using simple software like MS- Excel, MedCalc, Stata, R and Statgraphics. Box & Whisker plot: This is a very useful method for comparing multivariate data. Tukey (1977) proposed this chart for data visualization and it is commonly used in ex- ploratory analysis and business analytics. The concentration of data is shown as a vertical box and the variation around the median is shown by vertical lines. For symmetrically distributed data (line normal) the two edges of the box will be at equal distances from the middle line (median). The difference (Q3-Q1) is called the Inter Quartile Range (IQR) and represents the height of the box and holds 50% of the data. Typical box plots are shown in Figure 1.4. FIGURE 1.4: Box and Whisker plot for CHOL by gender and group. Multivariate Statistical Analysis—An Overview 19 The box plot is also used to identify the outliers which are values that are considered unusual or abnormal and defined as follows. Outliers are those values which are either a) above 3×IQR or more than the third quartile or b) below 3×IQR or less than the first quartile. Suspected outliers are either 1.5×IQR or more above the third quartile or 1.5×IQR or more below the first quartile. There will be vertical lines called whiskers terminated by a crosshatch. Whiskers are drawn in two ways. a) The ends of the whiskers are usually drawn from the top of the box to the maximum and from the bottom of the box to the minimum. b) If either type of outlier is present, the whisker on the appropriate side is taken to 1.5×IQR from the quartile (the “inner fence”) rather than the maximum or minimum. The individual outlying data points are displayed as unfilled circles (for suspected outliers) or filled circles (for outliers). (The “outer fence” is 3×IQR from the quartile.) Lower Whisker = nearest data value larger than (Q1 -1.5×IQR) and Upper Whisker = nearest data value smaller than (Q3 +1.5×IQR). When the data is normally distributed the median and the mean will be identical and further IQR = 1.35×σ so that the whiskers are placed at a distance of 2.025 times or approximately at 2σ from the median of the data. In Figure 1.4 we observe that the CHOL for men in cases has an outlier labelled as 26. When the data has outliers, there are several methods of es- timating mean, standard deviation and other parameters. This is known as robust estimation. One practice is to exclude the top 5% and bottom 5% values and estimate the parameters (provided this does not suppress salient cases of data). For instance the trimmed mean is the mean value that is obtained after trimming top and bottom 5% extreme values and hence it is more reliable in the presence of outliers. The ‘explore’ option of SPSS gives this analysis. Scatter diagram: Another commonly used chart that helps visualization of correlated data is the scatter diagram. Let X and Y be two variables measured on an interval scale, like BMI, glucose level etc. Let there be n patients for whom data is available as pairs (xi , yi ) for i = 1, 2, · · · , n. The plot of yi against xi marked as dots produces a shape similar to the scatter of a fluid on a surface and hence the name scatter diagram. A scatter diagram indicates the nature of the relationship between the two variables as shown in Figure 1.5. 20 Multivariate Statistics Made Simple: A Practical Approach 102 100 Positive 98 Relationship 96 Y 94 92 90 88 86 0.8 1 1.2 1.4 1.6 X (a) Positive relationship. 155 Negative Relationship 135 Y 115 95 75 2 4 6 8 X (b) Negative relationship. 98 97 No Relationship 96 95 94 Y 93 92 91 90 89 1.1 1.2 1.3 1.4 1.5 1.6 X (c) No relationship. FIGURE 1.5: Scatter diagram with direction of relationship. Multivariate Statistical Analysis—An Overview 21 Referring to Figure 1.5a it follows that Y increases whenever X increases and hence it is a case of positive relationship. In Figure 1.5b the relationship is in the opposite direction. Here, Y decreases as X increases and hence it is a case of a negative relationship. When no specific pattern appears as in Figure 1.5c, we observe that there is no clear linear relationship between X and Y. Let us revisit Illustration 1.1 and examine the scatter diagrams of the profile variables with the help of SPSS. The chart is shown in Figure 1.6 which is known as a scatter matrix. FIGURE 1.6: Scatter matrix for three variables. It can be seen that a clear positive relationship exists between CHOL and LDL while other relationships are not clearly positive. This chart helps understand all possible correlations simultaneously. The histograms shown in 22 Multivariate Statistics Made Simple: A Practical Approach the diagonal elements refer to the same variable for which the distribution is shown to understand the variation within that variable. This chart is produced with the SPSS option in the Graph board Template Chooser menu called scatter plot matrix (SPLOM). In a different version of this chart, the diagonal elements in the matrix are left blank because the same variable is involved in the pair, for which no scatter exists. When the scatter is too wide, it may indicate abnormal data and the scatter is vague without a trend. In such cases, removal of a few widely scattered points may lead to a recognizable pattern. One can do this type of exercise using MS-Excel for scatter charts directly. A 3D scatter plot is another tool useful to understanding the multivariate scatter of three variables at a time. This can be worked out with SPSS from the Graph board Template Chooser but can be presented differently using R software and appears as shown in Figure 1.7. The following R-code is used to produce the 3D chart. FIGURE 1.7: A 3D-Scatter chart using R. R Code: library(MASS) # reading a file cimt.data 2 since we can at most observe a 3- dimensional plot. When k = 2 we get the case of bivariate normal distribution in which only two variable X1 and X2 are present and the parameters µ and Σ of µ Equation 1.8 take the simple form µ = 1 where µ1 and µ2 are the means 2 µ2 σ1 σ12 of X1 and X2 and Σ = where σ21 , σ22 are the two variances and σ21 σ22 σ12 is the covariance between X1 and X2. If ρ denotes the correlation coefficient between X1 and X2 , then we can write the covariance as σ12 = ρσ1 σ2. Therefore in order to understand the bivariate normal distribution we need five parameters (µ1 , µ2 , σ1 , σ2 and ρ). 24 Multivariate Statistics Made Simple: A Practical Approach Thus, the bivariate normal distribution has a lengthy but interesting for- mula for the density function given by 1 f(x1 , x2 ) = p e−Q , −∞ < x1 , x2 < ∞ where 2πσ1 σ2 1 − ρ2 " #" !2 !2 1 x1 − µ1 x2 − µ2 Q= + 2σ21 σ22 (1 − ρ2 ) σ1 σ2 ! !# x1 − µ1 x2 − µ2 − 2ρ (1.10) σ1 σ2 Given the values of the five parameters, it is possible to plot the density function given in Equation 1.10 as a 3D plot. Visualization of bivariate normal density plot is given in Figure 1.8. 0.450 0.400 0.350 0.300 0.250 0.200 0.150 3.50 0.100 1.50 0.050 -1.50 0.000 -3.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 -3.50 -3.00 -2.50 -2.00 -1.50 -1.00 -0.50 FIGURE 1.8: Simulated bivariate normal density. For the case of more than two variables summary statistics like mean vectors and covariance matrices will be used in the analysis. In the following section an outline is given of different applications of mul- tivariate data. The importance of these tools in handling datasets with large numbers of variables and cases is also mentioned. Most of these applications are data-driven and need software to perform the calculations. Multivariate Statistical Analysis—An Overview 25 1.9 Some Interesting Applications of Multivariate Anal- ysis a) Visual understanding of data: We can use software to view and un- derstand the distribution of several variables graphically. A histogram or a line chart can be used for this purpose. A multivariate scatter dia- gram is a special case of interest to understand the mutual relationships among variables. b) Hidden relationships: There could be several direct and indirect re- lationships among variables which can be understood with the help of partial correlations and canonical correlations. c) Multivariate ANOVA (MANOVA): The multivariate analogue of Analysis of Variance (ANOVA) is known as Multivariate ANOVA (MANOVA). Suppose we observe 3 response variables in 3 groups de- fined by treatment dose, say Placebo, 10mg and 20mg. For each group we get a vector (profile) of 3 means. Using MANOVA we can test for the significance of the difference among the 3 profiles (not individual variables) between groups. Once we confirm that the profiles differ sig- nificantly we employ one way ANOVA for each variable separately. If the profiles do not differ significantly, further analysis is not necessary. d) Repeated Measures Analysis (RMANOVA): Suppose we take the Fasting Blood Sugar (FBS) level of each of the 10 patients at two time points, viz., Fasting and PP. To compare the average glucose level at two time points, we use the paired t-test. Suppose the blood sugar is measured at 4 times, say at 9am, 11am, 1pm and 3pm. To compare these values we use a multivariate tool called Repeated Measures Analysis of Variance. e) Multiple regression The joint effect of several explanatory variables on one response variable (Y) can be studied with the help of a multiple linear regression model. It is a cause and effect situation and we wish to study the joint as well as the marginal effect of input variables on the response. A commonly used method of regression is known as stepwise regression. It helps in identifying the most influential variables that affect the response. f) Principal components: Sometimes psychological studies involve hun- dreds of variables representing the response to a question. It is difficult to handle a large number of variables in a regression model and poses a problem of dimensionality. Principal component analysis is a tool to reduce dimensionality. It so happens that among all the variables, some 26 Multivariate Statistics Made Simple: A Practical Approach of them can be combined in a particular manner so that we produce a mixture of original variables. They indicate ‘latent’ features that are not directly observable from the study subjects. We can also extract sev- eral such mixtures so that they are uncorrelated! The number of such mixtures called principal components will be fewer than the number of original variables. g) Factor analysis: Factor Analysis (FA) is another multivariate tool that works on the method of identifying the latent or hidden factors in the data. They are like the principal components and hence they are new entities called factors. For instance, courtesy could be a hidden factor for the success at a blood bank counter. It cannot be measured directly but can be expressed in different observable quantities like (1) warm receiv- ing at the counter (yes/no), (2) throwing away the request form (yes/no), (3) careless answering (yes/no) or (4) too much talkative (yes/no) etc. These are four factors that might have been extracted from a large set of original variables. h) Classification problems: In these problems we are interested in es- tablishing a multiple variable model to predict a binary outcome like presence or absence of a disease or a health condition. By observing enough number of instances where the event is known, we develop a formula discriminant between the two groups. This is also called a su- pervised learning method by data scientists. When there are no prede- fined groups, the classification is called unsupervised learning and the technique used if cluster analysis. We end this chapter with the observation that a holistic approach to un- derstanding data is multivariate analysis. Summary The conventional approach toward analysis in many studies is univariate. The statistical tests like t-test are aimed at comparing the mean values among treatment groups. This is usually done for each outcome variable at a time. The measurements made on a patient are usually correlated and this structure conveys more supportive information in decision making. Multivariate analysis is a class of analytical tools (mostly computer intensive) and they provide great insight into the problem. We have highlighted the importance of vectors and matrices to represent multivariate data. The need for a correlation matrix and visual description of data helps to understand the inter-relationships among the variables of study. Multivariate Statistical Analysis—An Overview 27 Do it yourself (Exercises) 1.1 Consider the following data from 20 patients about three parameters denoted by X1, X2 and X3 on the Bone Mineral Density (BMD). S.No Gender X1 X2 X3 S.No Gender X1 X2 X3 1 Male 0.933 0.820 0.645 11 Male 0.715 0.580 0.433 2 Female 0.889 0.703 0.591 12 Female 0.932 0.823 0.636 3 Female 0.937 0.819 0.695 13 Male 0.800 0.614 0.518 4 Female 0.874 0.733 0.587 14 Female 0.699 0.664 0.541 5 Male 0.953 0.824 0.688 15 Male 0.677 0.547 0.497 6 Female 0.671 0.591 0.434 16 Female 0.813 0.613 0.450 7 Female 0.914 0.714 0.609 17 Male 0.851 0.680 0.569 8 Female 0.883 0.839 0.646 18 Male 0.888 0.656 0.462 9 Female 0.749 0.667 0.591 19 Female 0.875 0.829 0.620 10 Male 0.875 0.887 0.795 20 Male 0.773 0.637 0.585 (a) Find out the mean BMD profile for male and female patients. (b) Generate the covariance matrix and study the symmetry in the covariance terms. (c) Obtain the correlation matrix and the matrix scatter plot for the profile without reference to gender. 1.2 The following data refers to four important blood parameters, viz., Hemoglobin (Hgb), ESR, B12 and Ferritin obtained by a researcher in a hematological study from 20 patients. (a) Construct the mean profile and covariance matrix among the vari- ables. (b) Draw a Box and Whisker plot for all the variables. (c) Find the correlation matrix and identify which correlations are high (either positive or negative). 28 Multivariate Statistics Made Simple: A Practical Approach S.No Age Hgb ESR B12 Ferritin S.No Age Hgb ESR B12 Ferritin 1 24 12.4 24 101 15.0 11 32 9.1 4 223 3.5 2 50 5.8 40 90 520.0 12 28 8.8 20 250 15.0 3 16 4.8 110 92 2.7 13 23 7.0 4 257 3.2 4 40 5.7 90 97 21.3 14 56 7.9 40 313 25.0 5 35 8.5 6 102 11.9 15 35 9.2 60 180 1650.0 6 69 7.5 120 108 103.0 16 44 10.2 60 88 11.2 7 23 3.9 120 115 141.0 17 46 13.7 10 181 930.0 8 28 12.0 30 144 90.0 18 48 10.0 40 252 30.0 9 39 10.0 70 155 271.0 19 44 9.0 30 162 44.0 10 14 6.8 10 159 164.0 20 22 6.3 40 284 105.0 (Data courtesy: Dr. C. Chandar Sekhar, Department of Hematology, Sri Venkateswara Institute of Medical Sciences (SVIMS), Tirupati.) 1.3 The following table represents in the covariance matrix among 5 vari- ables. Age Hgb ESR B12 Ferritin Age 184.193 Hgb 6.619 5.545 ESR 85.729 -35.074 1199.306 B12 230.922 153.070 299.035 62449.460 Ferritin 368.284 163.926 94.734 -8393.780 136183.600 Obtain the correlation matrix by using the formula r(x,y) = Cov(X,Y)/SX *SY. 1.4 For the given data in Illustration 1.1, categorize age into meaningful groups and obtain the histogram CIMT for group = 1. 1.5 Obtain the matrix-scatter plot of Hgb, ESR and B12 using SPSS from the data given in Exercise 1.2. 1.6 A dot plot is another way of visualizing of data. MedCalc has some interesting graphs using dot plots. Use the data given in Table 1.1 in the Appendix and obtain the dot plot of (a) BMI and (b) CIMT groupwise and compare them with the box plot. Multivariate Statistical Analysis—An Overview 29 Suggested Reading 1. Alvin C.Rencher, William F. Christensen. 2012. Methods of Multivariate Analysis. 3rd ed. Brigham Young University: John Wiley & Sons. 2. Johnson, R.A., & Wichern, D.W. 2014. Applied multivariate statistical analysis, 6th ed. Pearson New International Edition. 3. Ramzan, S., Zahid, F.M, Ramzan, S. 2013. Evaluating multivariate nor- mality: A graphical approach. Middle-East Journal of Scientific Re- search. 13(2):254–263. 4. Anderson T.W. 2003. An introduction to Multivariate Statistical Analy- sis. 3rd edition. New York: John Wiley. 5. Bhuyan K.C. 2005. Multivariate Analysis and Its Applications. India: New Central Book Agency. 6. Tukey, J.W. 1977. Exploratory Data Analysis. Addison-Wesley. 7. Crawley, M.J. 2012. The R Book: 2nd ed. John Wiley & Sons. 8. Sarma K.V.S. 2010. Statistics Made Simple-Do it yourself on PC. 2nd ed. Prentice Hall India. Chapter 2 Comparison of Multivariate Means 2.1 Multivariate Comparison of Mean Vectors...................... 31 2.2 One-sample Hotelling’s T2 Test.................................. 33 2.3 Confidence Intervals for Component Means..................... 37 2.4 Two-Sample Hotelling’s T2 Test................................. 39 2.5 Paired Comparison of Multivariate Mean Vectors............... 43 Summary......................................................... 48 Do it yourself (Exercises)........................................ 48 Suggested Reading............................................... 50 Be approximately right rather than exactly wrong. John Tukey (1915 – 2000) 2.1 Multivariate Comparison of Mean Vectors In clinical studies it is often necessary to compare the mean vector of a panel of variables with a hypothetical mean vector to understand whether the observed mean vector is close to the hypothetical vector. Sometimes we may have to compare the panel mean of two or more independent groups of patients to understand the similarity among the groups. For instance, in cohort studies, one may wish to compare a panel of health indices among three or more categories of respondents like those based on the Socio Economic Status (SES). The word ‘vector’ is used to indicate an array of correlated random vari- 31 32 Multivariate Statistics Made Simple: A Practical Approach ables or their summary values (like mean). If means are listed in the array, we call it the mean vector. The univariate method of comparing the mean of each variable of the panel independently and reporting the p-value is not always correct. Since several variables are observed on one individual, the data is usually to be inter-correlated. The correct approach to compare the mean vectors is to take into account the covariance structure within the data and develop a procedure that mini- mizes the error rate of false rejection. Multivariate statistical inference is based on this observation. Let α be the type-I error rate for the comparison of the mean vectors. If there are k-variables in the profile and we make k-independent comparisons between the groups, the chance of making a correct decision will only be α′ = 1 − (1 − α)k , which is much less than the advertised error rate α. For instance with k = 5 and α = 0.05, the ultimate error rate gets in- flated to α′ = 0.226. It means about 23% of wrong rejections (even when the null hypothesis is true) are likely to take place against the promised 5% by way of univariate tests, which is a phenomenon known as Rao’s paradox and more details can be had from Healy, M.J.R. (1969). Hummel and Sligo (1971) recommend performing a multivariate test followed by univariate t-tests. There is a close relationship between testing of hypothesis and CI. Suppose we take α = 0.05, then the 95% CI contains all plausible values for the true parameter (µ0 ) specified under H0. If µ0 is contained in the class interval we accept the hypothesis with 95% confidence; else we reject the hypothesis. The Hotelling’s T2 test provides a procedure to draw inferences on mean vectors as mentioned below. a) Compare the sample mean vector with a hypothetical mean vector (claim of the researcher). b) If the null hypothesis is not rejected at the α level, stop and conclude that the mean vectors do not differ significantly; else c) Use 100 (1-α)% CIs for each variable of the profile and identify which variable(s) contribute to rejection. We now discuss the Hotelling’s T2 test procedure for one sample problem and outline a computational procedure to perform the test and to interpret the findings. This will be followed by a two-sample procedure. Comparison of Multivariate Means 33 2.2 One-sample Hotelling’s T2 Test Consider a profile X having k-variables X1 , X2 , · · · , Xk and assume that all are measured on a continuous scale. Assume that X follows a multivariate normal distribution with mean vector µ and covariance matrix Σ. Let the data be available from n-individuals. The sample mean vector X defined in Equation 1.1 and the covariance matrix S defined in Equation 1.5 can be computed with the help of software like MS-Excel or SPSS. We wish to test the hypothesis H0 : µ = µ0 versus H1 : µ = 6 µ0 , where µ0 is the hypothetical vector of k means defined as µ01 µ02 µ0 = .. . µ0k The multivariate equivalent of the univariate t-statistic is given by the statistic Z2 = n(X − µ0 )′ Σ-1 (X − µ0 ) (2.1) This statistic follows what is known as Chi-square(χ2 ) distribution with k-degrees of freedom, under the assumption that Σ is known. The decision rule is to reject H0 if Z2 > χ2k,α or if the p-value of the test is less than α. In general σ is unknown and in its place S is used. Then Z2 reduces to the Hotelling’s T2 statistic (see Anderson (2003)), given by T2 = n(X − µ0 )′ S−1 (X − µ0 ) (2.2) The T2 statistic is related to the well-known F-statistic by the relation (n-1)k T2 = Fk,n−k. (n-k) The critical value of T2 test can be easily found from F distribution tables. The p-value of the Hotelling’s test can be got with the help of an MS-Excel function. Standard statistical software like SPSS does not have a function to com- pute T2 directly. MS-Excel also has no direct function in the standard fx list but a few Add-ins of MS-Excel offer array formulas to compute T2. For instance, the Real Statistics Add-in of MS-Excel has a direct function to compute the T2 statistic (source: http//www.real-statistics.com). It also contains a class of procedures to handle one-sample and two-sample T2 tests 34 Multivariate Statistics Made Simple: A Practical Approach directly from raw data. Some interesting matrix operations are also available in it. Suppose the Hotelling’s test rejects the null hypothesis, then the post-hoc analysis requires finding out which of the k-components of the profile vector contributed to the rejection of the null hypothesis. For the ith mean, µi the 100(1-α)% CI (often known as the T2 interval or simultaneous CIs) is given by the relation " s # " s # k(n − 1) Sii k(n − 1) Sii xi − Fk,(n−k),α 6 µi 6 xi + Fk,(n−k),α (n − k) n (n − k) n where Sii is the variance of the ith variable and Fk,(n−k),α denotes the (1-α)% critical value on the F distribution with (k, n-k) degrees of freedom. We also write this as an interval " r r # Sii Sii xi − θ , xi + θ (2.3) n n s k(n-1) where θ = Fk,(n−k),α is a constant. (n-k) If the hypothetical mean µi of the ith component lies outside this interval, we say that this variable contributes to rejection of the hypothesis and the difference between the observed and hypothetical means is significant. Consider the following illustration. Illustration 2.1 In a cardiology study, data is obtained on various parame- ters from 50 patients with Metabolic Syndrome (MetS) and 30 patients with- out MetS. We call this MetS data for further reference. A description of vari- ables, parameters and codes is given below. Variable Parameter Description V1 Group With out MetS = 1, With MetS = 2 V2 Age Age of the patient in years V3 Gender 0 = Male, 1 = Female V4 Height Height (inches) V5 Weight Weight (Kg) V6 BMI Body Mass Index (Kg/m2 ) V7 WC Waist Circumference (cm) V8 HTN Hypertension (0 = No, 1 = Yes) V9 DM Diabetes Mellites (0 = No, 1 = Yes, 2 = Pre-diabetic) V10 TGL Triglycerides (mg/dl) V11 HDL High-density lipoproteins cholesterol (mg/dl) V12 LVMPI Left Ventricle Myocardial Performance Index Comparison of Multivariate Means 35 Table 2.1 shows a portion of data with 20 records but the analysis and discussion is carried out on 40 records of the dataset. The profile variables are Weight, BMI, WC and HDL. The researcher claims that the mean values would be Weight = 75, BMI = 30, WC = 95 and HDL = 40. TABLE 2.1: MetS data with a sample of 20 records S.No V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 1 1 35 1 175.0 76.0 24.8 93 0 0 101 38 0.480 2 1 36 0 160.0 65.0 25.8 91 0 0 130 32 0.550 3 1 40 1 168.0 72.0 25.5 85 0 0 82 32 0.420 4 1 30 0 158.0 59.0 23.6 75 0 0 137 33 0.590 5 1 45 0 156.0 63.0 25.9 75 0 0 171 38 0.470 6 1 54 1 164.0 60.0 22.3 75 0 0 97 38 0.410 7 1 43 0 152.0 39.0 16.9 68 0 0 130 50 0.450 8 1 35 1 176.0 73.0 23.6 95 0 0 70 40 0.430 9 1 32 1 165.0 52.0 19.2 76 0 0 96 42 0.410 10 1 32 1 178.0 72.0 22.7 88 0 0 102 47 0.380 11 2 39 0 152.0 67.0 29.0 94 0 1 205 32 0.730 12 2 39 1 182.0 94.0 28.3 111 0 1 150 32 0.660 13 2 47 1 167.0 83.0 30.0 98 0 1 186 32 0.600 14 2 37 0 153.0 62.0 26.5 85 0 1 183 35 0.520 15 2 36 0 160.0 68.0 23.4 93 0 1 150 37 0.600 16 2 41 1 172.0 67.0 22.6 94 0 1 91 36 0.750 17 2 42 0 162.5 73.0 27.6 100 0 1 115 42 0.500 18 2 43 1 162.5 67.5 25.6 106 1 1 312 31 0.730 19 2 31 1 160.0 76.0 29.6 100 0 0 180 35 0.610 20 2 45 1 167.0 60.0 21.5 90 1 1 233 35 0.520 (Data courtesy: Dr D. Rajasekhar, Department of Cardiology, Sri Venkateswara Institute of Medical Sciences (SVIMS), Tirupati.) Weight 75.0 BMI 29.0 The profile vector is X = WC and µ0 = 95.0 will be the hypothesized HDL 40.0 mean vector. We wish to test the hypothesis where the sample profile represents a pop- ulation claimed by the researcher with mean vector µ0. Analysis: The following stepwise procedure can be implemented in MS-Excel (2010 version). 36 Multivariate Statistics Made Simple: A Practical Approach Step-1: Enter the data in an MS-Excel sheet with column headings in the first row. Step-2: Find the mean of each variable and store it as a column with heading “means”. This gives the mean vector. (Hint: Select the cells B42 to E42 and click AutoSum → Average. Then select a different area at the top of the sheet and select 4 ‘blank’ cells vertically. Type the function =TRANSPOSE(B42 : E42) and press Control+Shift+Enter.) Step-3: Enter the hypothetical mean vector (µ0 ) in cells G3 to G6. Step-4: Click on Add-ins → Real Statistics → Data Analysis Tools → Multivar → Hotelling T-Square and Click. FIGURE 2.1: Hotelling’s test options. This gives an option window as shown in Figure 2.1. Choose the option One-sample. Step-5: For the input range1, select the entire data from B1 to E41 (includ- ing headings) and for the input range2, select the hypothetical mean vector along with headings (G3:G6). Step-6: Fix a cell to indicate the output range (for display of results) and press OK. Comparison of Multivariate Means 37 This gives T2 = 3.916 and F = 0.9036 with p = 0.4721. The hypothesis is accepted since p > 0.05. Hence, there is no significant difference between the sample mean vector and the hypothesized mean vector. 75.0 30.0 Suppose the hypothetical mean vector is changed to µ0 = 90.0. 90.0 From the worksheet of the Hotelling’s T2 test, it is enough to make changes in the µ0 and the results get automatically updated (one need not run the above steps again!). This gives T2 = 31.027 with p-value = 0.000238 and the hypothesis is rejected, indicating a significant difference between the observed and claimed mean vectors. Therefore the difference between the sample mean vector and the hypothesized mean vector is significant at the 0.05 level. Remarks-1: When the profile has a mean vector which does not differ from a hypo- thetical vector, then T2 test is conclusive. Otherwise, one needs to find out which component(s) contributed to the rejection. In the following section we discuss the use of CI in interpreting the results. 2.3 Confidence Intervals for Component Means The next job is to generate a CI for the mean of each component and find which component is contributing to the rejection of the hypothesis. The multiplier in Equation 2.3 is often called the confidence coefficient and the value of Fk,(n−k),α can be calculated using the MS-Excel function FINV()(see Sarma (2010)). This is called critical value of F, denoted by Fcri. For this example, we get Fcri = 2.6335 (using FINV($L$11,$H$5,$H$6)). With simple calculations in the MS-Excel sheet we can build the intervals as shown in Figure 2.2. Some of the rows in the MS-Excel sheet are hidden for visibility of intermediate results. Since the hypothetical mean of 90.0 for the variable WC falls outside the CI, we infer that WC differs significantly from the belief of the researcher. The other variables can be considered to agree with the hypothetical means. When the number of comparisons is small, a better method of building CIs, instead of the T2 intervals is based on family-wise error rate known as Bonferroni intervals basing on the Student’s t-test. 38 Multivariate Statistics Made Simple: A Practical Approach Hotelling T-square Test One-sample test T2 30.8882693 df1 4 df2 36 F 7.12806216 p-value 0.00024659 Computation of Simultaneous Confidence intervals n 40 COUNT(B2:B41) alpha 0.05 k 4 COUNTA(B1:E1) 95% confidence interval Variable Hyp.Mean Sample Mean Variance Conf.coeff Lower Upper Significance Weight 75 76.77 290.237 4.9509 71.817 81.718 Not Sig BMI 30 28.77 38.983 1.8145 26.951 30.579 Not Sig WC 90 97.18 310.866 5.1239 92.051 102.299 Sig HDL 40 39.80 53.087 2.1174 37.683 41.917 Not Sig Computation of Bonferroni intervals n 40 COUNT(B2:B41) alpha 0.05 k 4 COUNTA(B1:E1) alpha-dash 0.0125 95% confidence interval Variable Hyp.Mean Sample Mean Variance Conf.coeff Lower Upper Significance Weight 75 76.80 290.626 7.7921 69.008 84.592 Not Sig BMI 30 28.77 38.983 2.8538 25.911 31.619 Not Sig WC 90 97.18 310.866 8.0588 89.116 105.234 Not Sig HDL 40 39.80 53.087 3.3303 36.470 43.130 Not Sig FIGURE 2.2: MS-Excel worksheet to compute post-hoc calculations. If there are k-