Multivariate Statistical Analysis (SMA 3023) PDF
Document Details
Uploaded by RapturousParallelism
Tags
Summary
This document is a textbook chapter on multivariate statistical analysis, specifically focusing on Principal Component Analysis (PCA). It describes the concept of PCA, its use in dimension reduction, and the steps involved, such as data normalization and covariance matrix computation.
Full Transcript
Applications of Statistical Techniques (SMA 3023) CHAPTER 8 Textbook: 1) Basic Elements of Computational Statistics Multivariate Statistical Analysis Chapter 8 1 / 12 Table of Contents 1 P...
Applications of Statistical Techniques (SMA 3023) CHAPTER 8 Textbook: 1) Basic Elements of Computational Statistics Multivariate Statistical Analysis Chapter 8 1 / 12 Table of Contents 1 Principal Component Analysis (PCA) Chapter 8 2 / 12 Principal Component Analysis (PCA) One of the challenges of multivariate analysis is the curse of dimensionality. A high correlation between the original variables would lead to estimation and inference problems caused by near multicollinearity. This motivates principal component analysis (PCA), a multivariate technique whose central aim is to reduce the dimension of the dataset. This transformation leads to a new set of variables, which are linear combinations of the original variables, called principal components. PCA only works with quantitative variables. Dimension reduction is a process of converting a dataset having a vast dimensions into a dataset with lesser dimensions. It ensures that the converted dataset conveys similar information concisely. Chapter 8 3 / 12 Example: The following graph shows two dimensions x1 and x2. x1 represents the measurement of several objects in cm. x2 represents the measurement of several objects in inches. Chapter 8 4 / 12 In machine learning, Using both these dimensions convey similar information. Also, they introduce a lot of noise in the system. So, it is better to use just one dimension. Using dimension reduction technique, We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1). It makes the data relatively easier to explain. Chapter 8 5 / 12 Dimensions reduction offers several benefits such as: It compresses the data and thus reduces the storage space requirements. It reduces the time required for computation since less dimensions require less computation. It eliminates the redundant features. It improves the model performance. Chapter 8 6 / 12 PCA is a well-known dimension reduction technique. It transforms the variables into a new set of variables called as principal components. These principal components are linear combination of original variables and are orthogonal. The first principal components accounts for most of the possible variation of original data. The second principal component does its best to capture the variance in the data. There can be only two principal components for a two-dimensional dataset. Chapter 8 7 / 12 The five main steps for computing principal components: Chapter 8 8 / 12 Step 1 - Data normalization Example: Monthly expenses: $300 Age: 27 Rating: 4.5 – This information has different scales and performing PCA using such data will lead to a biased result. – This is where data normalization comes in. – It ensures that each attribute has the same level of contribution, preventing one variable from dominating others. – For each variable, normalization is done by subtracting its mean and and possibly scaled by dividing to its standard deviation. Chapter 8 9 / 12 Step 2: Covariance matrix Compute the covariance matrix from the normalized data. It is a symmetric matrix, and each element (i, j) corresponds to the covariance between variables i and j. Step 3: Eigenvectors and eigenvalues Eigenvector – a direction such as ”vertical” or ”90 degrees”. Eigenvalue – a number representing the amount of variance present in the data for a given direction. Each eigenvector has its corresponding eigenvalue. Chapter 8 10 / 12 Step 4: Selection of principal components There are as many pairs of eigenvectors and eigenvalues as the number of variables in the data. So, the eigenvector with the highest eigenvalue corresponds to the first principal component. The second principal component is the eigenvector with the second highest eigenvalue, and so on. Step 5: Data transformation in new dimensional space This step involves re-orienting the original data onto a new subspace defined by the principal components. This orientation is done by multiplying the original data by the previously computed eigenvectors. Remark: This transformation does not modify the original data itself but instead provides a new perspective to better represent the data. Chapter 8 11 / 12 Example: Suppose that the dataset is 2-dimensional variables as follows: X = 1, 3, 6, 9, 12, 15 Y = 2, 4, 6, 8, 10, 12 Calculate the principal component using Principal Component Analysis (PCA). Chapter 8 12 / 12