Métodos de Aprendizagem Não Supervisionada PDF
Document Details
Uploaded by Deleted User
ISCTE - IUL
José G. Dias
Tags
Summary
This document provides lecture notes on unsupervised learning methods, focusing on principal component analysis. It covers topics such as covariance, correlation, and data transformations. The materials are suitable for undergraduate-level courses in data analysis or machine learning.
Full Transcript
Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL Métodos de Aprendizagem Não Supervisionada Lecture 2. Principal component analysis 2.1. Characterizing multivariate data Covariance and correlation Data transformations...
Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL Métodos de Aprendizagem Não Supervisionada Lecture 2. Principal component analysis 2.1. Characterizing multivariate data Covariance and correlation Data transformations Example in R 1 Covariance and correlation Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 3 Data set Objects are represented as a cloud of points in a multidimensional space with an axis for each of the variables We represent the data set as a matrix with rows and columns ⋯ = ⋮ ⋱ ⋮ ⋯ For example, for variable we have the data vector = ⋮ = ⋯ Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 4 2 Mean, variance, and standard deviation The centroid (mean) of the points is defined by the mean of each variable 1 ̅ = The variance of each variable is the average squared deviation of its values around the mean of that variable 1 = ( − ̅) −1 The standard deviation is the square root of the variance = Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 5 Covariance Variance Measure of the deviation from the mean for points in one dimension Covariance Measure of how much each of the dimensions vary from the mean with respect to each other Positive Negative Both dimensions increase While one increases, or decrease together the other decreases Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 6 3 Covariance Variance 1 1 = ( − ̅) = ( − ̅ )( − ̅) −1 −1 Covariance 1 = ( − ̅ )( − ̅ ) −1 Thus: > 0 → dimensions co-vary in the same direction < 0 → dimensions co-vary in the opposite direction Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 7 Covariance matrix The original set of variables X is characterized by a ´ variance-covariance matrix, denoted by The diagonal elements of are the variances, i.e., = The off-diagonal elements of are the covariances, i.e., , ≠ In matrix notation, ⋯ = ⋮ ⋱ ⋮ ⋯ It is a symmetric matrix, i.e., = It is positive semi-definite matrix Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 8 4 Correlation coefficient Covariance determines whether relation is positive or negative, but it does not allow measuring the degree to which the variables are related Pearson correlation is given by = It is a standardization of the covariance by dividing it by the standard deviations with −1 ≤ ≤ +1 In matrix notation, 1 ⋯ = ⋮ ⋱ ⋮ ⋯ 1 Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 9 Correlation coefficient In addition to whether variables are positively or negatively related, correlation also tells the degree to which the variables are related each other Perfect High Low Low High Perfect Negative Negative Negative No Positive Positive Positive Correlation Correlation Correlation Correlation Correlation Correlation Correlation (r = -1.0) (r = -0.8) (r = -0.3) (r = 0.0) (r = 0.3) (r = 0.8) (r = 1.0) Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 10 5 Correlation and independence Events and B are independent iif: ∩ = or | = Independence means there is no relation between variables, i.e., we cannot learn about one variable from the other one: = ( ) In these examples, are and independent? Pearson correlation measures linear association; thus, in all cases ≅0 Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 11 Data transformations Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 12 6 Data centering Data centering means to subtract the mean to the variables values Centered values are = − ̅ In this case, the new variables have mean 0. Indeed, 1 1 1 1 = ( − ̅)= − ̅ =0 Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 13 Data centering The variance and co-variance values are not affected by the mean value Subtracting the mean makes variance and covariance calculation easier by simplifying their equations For the centered matrix − ̅ ⋯ − ̅ = ⋮ ⋱ ⋮ − ̅ ⋯ − ̅ Then, the sample covariance matrix is 1 = −1 Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 14 7 Data standardization Using covariances between variables only makes sense if they are measured in the same units. If variables have very heterogeneous dispersions/variances, we standardize them The standardized variables are − ̅ = Notice this is a linear transformation − ̅ ̅ 1 = = = − + ⏟ Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 15 Data standardization In these case, all variables have the same mean and variance/standard deviation: Mean is 0 Variance is 1 In this case, all variable have the same weight Covariances between the standardized variables are correlations, i.e., the covariance matrix coincides with the correlation matrix Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 16 8 Sample vs Population: Remark At this stage we are not using probabilistic models We focus on data/samples Thus, most of the time we avoid [ ], [ ], ( , ), etc. as they are population parameters (characteristics of random variables) Measure Parameters Statistics (Population) (Sample) Mean = [ ] ̅ Mean vector = [ ] Variance = [ ] Covariance = Cov( , ) Covariance matrix = [ ] Correlation = ( , ) Correlation matrix Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 17 Example in R Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 18 9 Crime data set - Characteristics Data set: crime rates per 100000 (population by States, United States in 2017) Observations: 50 States + DC The 7 variables are: 1. MURDER: Murder and nonnegligent manslaughter 2. RAPE: Rape 3. ROBBERY: Robbery 4. ASSAULT: Aggravated assault 5. BURGLARY: Burglary 6. LARCENY: Larceny theft 7. AUTO: Motor vehicle theft Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 19 Crime data set – View #Open usacrimerates2017 > dataset head(dataset) STATE ACRONYM MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO 1 Alabama AL 8.3 41.6 86.5 387.8 645.7 2048.1 263.4 2 Alaska AK 8.4 116.7 128.5 575.4 563.8 2402.7 575.6 3 Arizona AZ 5.9 51.0 106.0 345.0 536.3 2107.0 271.6 4 Arkansas AR 8.6 68.3 64.4 413.6 727.7 2109.5 241.4 5 California CA 4.6 37.2 143.2 264.2 446.9 1623.9 425.9 6 Colorado CO 3.9 68.8 68.4 226.9 406.9 1904.9 389.9 Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 20 10 Crime data set – Summary and covariances #Summary and covariance matrix >data summary(data) MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO Min. : 1.000 Min. : 16.70 Min. : 11.40 Min. : 65.3 Min. :176.3 Min. :1078 Min. : 31.1 1st Qu.: 2.750 1st Qu.: 36.75 1st Qu.: 50.75 1st Qu.:171.8 1st Qu.:307.5 1st Qu.:1374 1st Qu.:154.5 Median : 5.000 Median : 43.40 Median : 75.40 Median :240.5 Median :412.7 Median :1734 Median :230.0 Mean : 5.239 Mean : 46.09 Mean : 84.26 Mean :253.8 Mean :437.3 Mean :1743 Mean :234.7 3rd Qu.: 6.900 3rd Qu.: 53.65 3rd Qu.:102.60 3rd Qu.:313.9 3rd Qu.:533.6 3rd Qu.:2062 3rd Qu.:279.8 Max. :16.700 Max. :116.70 Max. :378.00 Max. :575.4 Max. :858.1 Max. :3650 Max. :575.6 >round(var(data),2) MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER 9.44 10.57 139.55 263.16 256.08 976.53 138.42 RAPE 10.57 259.06 119.02 1077.30 647.50 3100.89 1116.42 ROBBERY 139.55 119.02 3426.32 4167.59 1887.15 17861.98 3214.10 ASSAULT 263.16 1077.30 4167.59 14199.76 10324.97 37885.48 8310.28 BURGLARY 256.08 647.50 1887.15 10324.97 27778.26 38628.70 10625.93 LARCENY 976.53 3100.89 17861.98 37885.48 38628.70 223527.60 38872.09 AUTO 138.42 1116.42 3214.10 8310.28 10625.93 38872.09 14301.11 Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 21 Crime data set – Scatterplot #scatterplot >pairs(data, pch = 19, lower.panel = NULL) Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 22 11 Crime data set – Correlation matrix #correlation matrix >correlation round(correlation,3) MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER 1.000 0.214 0.776 0.719 0.500 0.672 0.377 RAPE 0.214 1.000 0.126 0.562 0.241 0.407 0.580 ROBBERY 0.776 0.126 1.000 0.597 0.193 0.645 0.459 ASSAULT 0.719 0.562 0.597 1.000 0.520 0.672 0.583 BURGLARY 0.500 0.241 0.193 0.520 1.000 0.490 0.533 LARCENY 0.672 0.407 0.645 0.672 0.490 1.000 0.688 AUTO 0.377 0.580 0.459 0.583 0.533 0.688 1.000 Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 23 Crime data set – Scatterplot #corrplot #install.packages("corrplot") >library(corrplot) >par(oma = c(2, 2, 2, 2)) # space around for text >corrplot.mixed(correlation, order = "hclust", #order of variables tl.pos = "lt", #text left + top upper = "ellipse") Métodos de Aprendizagem Não Supervisionada José G. Dias, ISCTE-IUL 24 12