Covariance and Correlation PDF

Covariance and Correlation Covariance and Correlation In our statistical work, it is often of interest to us to assess the relationship between variables within our sample data. Let us say we have two numeric, normally distributed random variables 𝑋 and 𝑌 which we have measured within our sample, it is often useful to understand how changes in the value of 𝑋 are associated with changes in the value of 𝑌 , and vice versa. Capturing the covariance of 𝑋 and 𝑌 and the correlation of 𝑋 and 𝑌 provide us two metrics by which we can measure the association between these two variables. In this text, we will discuss what covariance and correlation are, how they are calculated, what they can tell us, and how we can use R to invetigate them in our data. Covariance Covariance is a measure of how two random variables 𝑋 and 𝑌 vary together. For now, we will discuss measuring covariance between two numeric variables 𝑋 and 𝑌. For example, let us say 𝑋 is age and 𝑌 is height. We can understand that for adolescents (let’s say people aged 18 and younger), that as age (𝑋 ) increases, so does height (𝑌 ). As age varies (i.e., as adolescents get older), their height varies in a corresponding pattern (i.e., older adolescents are taller than younger adolescents). Likewise, if we encounter a taller adolescent, it is fair to guess they are older than shorter adolescents. In this situation, we can understand that there is a positive covariance between age and height because increases in the value of one are associated with increases in the value of the other. However, let us say we are looking at 𝑋 is age and 𝑌 is height amongst people aged 18 through 65. Assuming people typically stop growing in height by age 18, we can see that among adults, no relationship exists between age (𝑋 ) and height (𝑌 ). Knowing how old someone is provides no information about how tall they are and, likewise, knowing how tall someone is provides no information about how old they are. In such a situation, we can understand the covariance of 𝑋 and 𝑌 is approximately 0! Let’s consider one last example. Again, let 𝑋 be age and 𝑌 be height. Let’s imagine we are now looking at individuals aged 65 and older, elders. It turns out as people get into their golden years, they tend to get a bit shorter. In this case, as elders’ age increases, we notice that height decreases. Within this sample, older people tend to be shorter and shorter people tend to be older. In this situation, we can understand that there is a negative covariance between age and height because increases in the value of one are associated with decreases in the value of the other. Measuring Covariance Mathematically, we refer to the covariance of 𝑋 and 𝑌 with the expression: 𝑐𝑜𝑣(𝑋, 𝑌 ). Let us assume that we have a sample of 𝑛 individuals and we have measured 𝑋 and 𝑌 among all 𝑛 of these participants. If we let 𝑥¯ equal the mean of our 𝑛 measures of 𝑋 and if we let 𝑦¯ equal the mean of our 𝑛 measures of 𝑦 , then we can define covariance between 𝑋 and 𝑌 , mathematically like so: ∑𝑛 (( − )∗( − )) ∑𝑛𝑖=1 ((𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯)) 𝑐𝑜𝑣(𝑋, 𝑌 ) = 𝑛−1 What is this equation measuring, precisely? For each participant, 𝑖 , we are calculating (𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯) , where 𝑥𝑖 is the value of 𝑋 of participant 𝑖 and 𝑦𝑖 is the value of 𝑌 of participant 𝑖. Well, (𝑥𝑖 − 𝑥¯) contains two important pieces of information: 1) how far from the mean value of 𝑋 , 𝑥¯ , the 𝑖𝑡ℎ participant is and 2) if the value of 𝑥𝑖 is less than or greater than the mean value. We receive this same types of information for (𝑦𝑖 − 𝑦¯). Positive Covariance Before we discussed how positive covariance captures when two variables increase together - i.e., when greater values of 𝑋 are associated with greater values of 𝑌 (and vice versa). We refer to it as positive because, when this is the case, the value of 𝑐𝑜𝑣(𝑋, 𝑌 ) > 0. Why would that be so? Let us consider our example of age (𝑋 ) and height (𝑌 ) among youth aged 0 through 18, where we intuitively understand that 𝑐𝑜𝑣(𝑋, 𝑌 ) > 0. Let us say that the average age in our sample is 𝑥¯ = 9 years old and the average height is 𝑦¯ = 52 inches. Let us say that participant 𝑖 is 5 years old and 40 inches tall. We can then calculate for this participant: (𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯) =(5 − 9) ∗ (40 − 52) =(−4) ∗ (−12) =36 Notice here, when we multiply two negative values, we get a positive value. This makes sense, we want to calculate a positive value when lower values of 𝑋 and are associated with lower value of 𝑌. We identify these values as negative, because they are both below their respective mean values. Likewise, we want to calculate a positive value when higher values of 𝑋 are associated with higher values of 𝑌. We sample someone else and they are 18 years old and they are 68 inches tall. For this participant we can then calculate: (𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯) =(18 − 9) ∗ (68 − 52) =(9) ∗ (16) =144 Our equation for covariance is able to capture that people with above average age also tend to have above average height in our sample and that people with below average height also tend to have below average age. Often, we primarily care if 𝑐𝑜𝑣(𝑋, 𝑌 ) is less than 0 (negative covariance), equal or close to 0 (no covariance), or greater than 0 (positive covariance). Negative Covariance Likewise, it is good to see how our equation for measuring covariance captures negative covariance. We discussed that negative covariance is when higher values of 𝑋 are associated with lower values of 𝑌 and, thus, when higher values of 𝑌 are associated with lower values of 𝑋. We refer to it as negative, because in such cases 𝑐𝑜𝑣(𝑋, 𝑌 ) < 0. Let us consider our example of age (𝑋 ) and height (𝑌 ) among the elderly, where we intuitively understand that 𝑐𝑜𝑣(𝑋, 𝑌 ) < 0. Let us say that the average age in our sample is 𝑥¯ = 75 years old and the average height is 𝑦¯ = 64 inches. Let us say that participant 𝑖 is 68 years old and 66 inches tall. We can then calculate for this participant: (𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯) =(68 − 75) ∗ (68 − 64) =(−7) ∗ (4) = − 28 In this instance, we get a negative value. This is because participant 𝑖 had below average age compared to the sample and above average height. This results in multiplying together one negative value and one positive value. Likewise, we could sample an older participant whose age is 92 and their height is 59 inches. We would then calculate: (𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯) =(92 − 75) ∗ (59 − 64) =(18) ∗ (−5) = − 90 Since this participant has above average age and below average height, the value calculated is negative. Thus, if this pattern emerges across the full sample, this will result in 𝑐𝑜𝑣(𝑋, 𝑌 ) < 0. Of course, there are likely to be individuals who are both of above average age and height or both of below average age and height, resulting in a positive value. That is okay. Our measure of covariance is trying to capture a pattern in the relationship between 𝑋 and 𝑌 - not every participant will fall perfectly into that pattern. Covariance is a Measure of Association In the examples provided, we can understand that there likely is a causal relationship between age and height. As children get older, they get taller. But we only know that this is causal because we have a specific understanding of how age influences height - we typically do not think of changes in height resulting in change in age. (Or, to be a bit of a philosophy troll - it is possible that our idea of age as a construct is cultural and it is totally possible, that, in another culture, height may be a more salient measure of the aging process then a measure of time…meaning that causality is extremely challenging to identify) Importantly, covariance is a measure of the association of two variables 𝑋 and 𝑌. Covariance does not identify why 𝑋 and 𝑌 vary together, it simply identifies that they do and to what extent. Measuring and Visualizing Covariance in R Calculating the covariance between two variables 𝑋 and 𝑌 in R is easy, we can just use the 𝑐𝑜𝑣() function. We are going to use the mtcars data.frame that is included with R. It is a data.frame where each observation is a diﬀerent car and we have a range of variables about each car, such as the miles per gallon it gets (mpg) and its horse power (hp). I would guess that cars with greater horsepower get fewer miles to the gallon - I imagine a powerful pickup truck gets worse gas milage than a small sedan. However, I’d like to confirm this in the data - my understanding is that there should be a negative covariance between a car’s miles per gallon and its horse power. I can calculate the covariance in R like so: ## Take the covariance cov(mtcars$mpg, mtcars$hp) ## -320.7321 As we see, we get a value of approximately -321, indicating a negative covariance between these two variables. Visualizing Covariance with Scatterplots -320.7321 is not easy to intuitively understand. Often, it is easier to understand covariance visually. To do so, we can use a scatterplot which plots values of our variable 𝑋 against our variable 𝑌. Each participant’s observations (𝑥𝑖 , 𝑦𝑖 ) are plotted on the graph. Here, we will introduce the 𝑔𝑔𝑝𝑙𝑜𝑡2 package in R. This is a very powerful and useful package for generating visualizations in R. It can be a bit tricky, but we will learn how to use it through experience with it. Let us start by generating a basic scatter plot of miles per gallon (𝑋 ) and horsepower (𝑌 ) by pairing the 𝑔𝑔𝑝𝑙𝑜𝑡() function with the 𝑔𝑒𝑜𝑚𝑝 𝑜𝑖𝑛𝑡() function: ## First we need to load the ggplot2 package library(ggplot2) ## Now we will create our visualization ## The first argument is our data.frame ## The second argument defines which variable is on the x-axis and which is on the y ## Finally, by writing "+ geom_point()" we are telling the computer to plot it as a s catterplot ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() As we can see, higher horse power values tend to have lower miles per gallon (and thus points are concentrated in the top left of the plot). We see that cars with higher miles per gallon tend to have lower horse power. This visually displays negative covariance. However, our plot does not look that nice. So I am going to use some additional features of the 𝑔𝑔𝑝𝑙𝑜𝑡2 package to make the plot look nicer: ## We are going to use the labs function to set the name of the X-axis, Y-axis, and t itle ## We are going to change the appearance using the theme_minimal() function ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + labs( x = "Miles Per Gallon", y = "Horsepower", title = "Relationship Between MPG and Horsepower" ) + theme_minimal() You can customize many aspects of your scatterplot. One that I want to point out is that ggplot2 has a series of theme functions that can be used to change the appearance. I am a fan of 𝑡ℎ𝑒𝑚𝑒𝑚 𝑖𝑛𝑖𝑚𝑎𝑙() but there are many to explore, for example: ## theme_classic() is a classic! ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + labs( x = "Miles Per Gallon", y = "Horsepower", title = "Relationship Between MPG and Horsepower" ) + theme_classic() ## Or we can load themes from the ggthemes library library(ggthemes) ## Warning: package 'ggthemes' was built under R version 4.1.2 ## The solarized theme gives some fun flair! ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + labs( x = "Miles Per Gallon", y = "Horsepower", title = "Relationship Between MPG and Horsepower" ) + theme_solarized() The 𝑔𝑔𝑝𝑙𝑜𝑡2 package will come in handy whenever we want to create plots because it allows us to customize many of the features in our plot. In practice, you will learn how to use 𝑔𝑔𝑝𝑙𝑜𝑡2 by applying it in your work - there are many features and what you want to plot will shape which features you end up using. The important takeaway here is that scatterplots represent a powerful tool for visually inspecting the relationship between two variables and identifying covariance. Let’s plot an example to visually see an example of positive correlation. Below I will plot the relationship between the weight of the vehicle (in tons) and its horsepower. A heavier car likely needs more horsepower to drive and heavier cars (such as trucks) are often desired to have greater horsepower for tasks like towing: ggplot(mtcars, aes(x = wt, y = hp)) + geom_point() + labs( x = "Weight (in tons)", y = "Horsepower", title = "Relationship Between Weight and Horsepower" ) + theme_solarized() Here we can see that lighter cars tend to have lower horsepower (concentrated in the bottom left). Heavier cars tend to have higher horsepower. Positive correlation can visually be identified by a trend in points from the bottom left corner to the top right corner. A scatterplot is also useful for indentifying no (or weak) covariance. In the following photo, there are three scatterplots, one displaying positive covariance, one displaying negative covariance, and the last displaying weak covariance: https://careerfoundry.com/en/blog/data-analytics/covariance-vs-correlation/ (https://careerfoundry.com/en/blog/data-analytics/covariance-vs-correlation/) With positive covariance we visually see a pattern where the data “travels” from the lower left to upper right. This pattern arises when greater values of 𝑋 are associated with greater values of 𝑌. With negative covariance, we visually see a pattern where the data “travels” from the upper left to the lower right. This pattern arises when greater values of 𝑋 are associated with smaller values of 𝑌. Finally, with weak covariance, we are unable to visually detect a pattern, indicating that 𝑋 and 𝑌 behave independently of one another. Covariance Matrix Often though, we have a lot of variables in our dataset and we are interested in identifying pairings of variables that vary together. We can generate what is called a covariance matrix, a two-dimensional array where each row represents a variable and each column represents a variable. To do so, we simply use the 𝑐𝑜𝑣() function and supply our data.frame as the argument. This tells the computer to take the covariance of every pairing of variables in our data.frame. I use the 𝑟𝑜𝑢𝑛𝑑() function to tell R to round all the values to one decimal, for the sake of visually looking at it! round(cov(mtcars),1) ## mpg cyl disp hp drat wt qsec vs am gear carb ## mpg 36.3 -9.2 -633.1 -320.7 2.2 -5.1 4.5 2.0 1.8 2.1 -5.4 ## cyl -9.2 3.2 199.7 101.9 -0.7 1.4 -1.9 -0.7 -0.5 -0.6 1.5 ## disp -633.1 199.7 15360.8 6721.2 -47.1 107.7 -96.1 -44.4 -36.6 -50.8 79.1 ## hp -320.7 101.9 6721.2 4700.9 -16.5 44.2 -86.8 -25.0 -8.3 -6.4 83.0 ## drat 2.2 -0.7 -47.1 -16.5 0.3 -0.4 0.1 0.1 0.2 0.3 -0.1 ## wt -5.1 1.4 107.7 44.2 -0.4 1.0 -0.3 -0.3 -0.3 -0.4 0.7 ## qsec 4.5 -1.9 -96.1 -86.8 0.1 -0.3 3.2 0.7 -0.2 -0.3 -1.9 ## vs 2.0 -0.7 -44.4 -25.0 0.1 -0.3 0.7 0.3 0.0 0.1 -0.5 ## am 1.8 -0.5 -36.6 -8.3 0.2 -0.3 -0.2 0.0 0.2 0.3 0.0 ## gear 2.1 -0.6 -50.8 -6.4 0.3 -0.4 -0.3 0.1 0.3 0.5 0.3 ## carb -5.4 1.5 79.1 83.0 -0.1 0.7 -1.9 -0.5 0.0 0.3 2.6 As we can see, each row corresponds to a variable and each column corresponds to a variable. Each value represents the covariance of the two corresponding variables. You will notice that along the main diagnoal (starting from the top-left down to the bottom-right), that some entries display the covariance of a variable with itself. For example, the top-left entry the row is “mpg” and the column is “mpg”. We refer to this value as the variance 𝜎 2 , which we have discussed in prior chapters. Correlation As we can see, though, the measurement of covariance is not easy to interpret and it is dependent on the scale of what we are measuring. For example, in the 𝑚𝑡𝑐𝑎𝑟𝑠 dataset we measured the covariance between miles per gallon and horsepower and found that 𝑐𝑜𝑣(𝑚𝑝𝑔, ℎ𝑝) ≈ −321. However, if we had measured fuel eﬃciency in miles per liter, our measure of covariance would change because (even though miles per liter represents the same value as miles per gallon) miles per liter is measured on a diﬀerent scale (per liter) than miles per gallon. Thus, it can be useful to have a standardized way to measure the association between two numeric variables. This is where we introduce the concept of correlation. Correlation is a standardized measure of covariance which measures the strength and direction of the association between two variables 𝑋 and 𝑌. At the population-level, we denote correlation between 𝑋 and 𝑌 with 𝜌𝑋,𝑌 (the Greek letter “rho”, pronounded row) and it is equal to: 𝑐𝑜𝑣(𝑋, 𝑌 ) 𝜌𝑋,𝑌 = 𝜎𝑋 ∗ 𝜎𝑌 where 𝜎𝑋 is the standard deviation of 𝑋 and 𝜎𝑌 is the standard deviation of 𝑌. Interestingly, by dividing the covariance by the product of 𝜎𝑋 and 𝜎𝑌 , the correlation is always a value that falls between -1 and 1. Computing the Correlation Coefficient for a Sample Typically, we do not have the population estimates of covariance and standard deviation, so we need a formula by which we can calculate correlation from a sample. The most common form of correlation to take is called the Pearson correlation coeﬃcient 𝑟𝑋,𝑌 and we can calculate it like so: ∑((𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯)) 𝑟𝑥,𝑦 = ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ √∑( 𝑥𝑖 − 𝑥¯)2 ∗ ∑(𝑦𝑖 − 𝑦¯)‾2 This is the equivalent of taking the 𝑐𝑜𝑣(𝑥, 𝑦) and dividing it by the product of measured standard deviations of 𝑋 and 𝑌 , 𝑠𝑋 and 𝑠𝑌. Computing the Coefficient in R That is another intimidating equation. The important thing to know is that the correlation takes the covariance of 𝑋 and 𝑌 and standardizes it by dividing it by the product of the standard deviation of 𝑋 and 𝑌. This process of standardization comes up a lot, doesn’t it?! It is quite simple though to compute the correlation between two variables in R by using the 𝑐𝑜𝑟() function. Let’s now take the correlation of miles per gallon and horsepower in the mtcars package, like so: cor(mtcars$mpg, mtcars$hp) ## -0.7761684 This results in a standardized value of -0.776. Because correlation can only range from -1 to 1, we can see that this represents a relatively strong negative correlation between 𝑚𝑝𝑔 and ℎ𝑝 ! However, it is good to now discuss how to interpret the correlation coeﬃcient. Interpreting the Correlation Coefficient Whereas covariance can result in any value from −∞ to ∞ , correlation can result in a value from -1 to 1. If 𝑟𝑋,𝑌 = 1 , this indicates a perfect positive linear relationship between 𝑋 and 𝑌. What is a perfect linear relationship? Well, remember in middle school we learned that a line can be defined using the equation 𝑦 = 𝑚𝑥 + 𝑏 , where 𝑚 is the slope of the line and 𝑏 is where the line crosses the 𝑦 -intercept? When 𝑟𝑋,𝑌 = 1 , this means that when we plot our scatterplot of 𝑋 and 𝑌 , that the points will fall perfectly on a line and that the slope of the line is positive (meaning that greater values of 𝑋 are associated with greater values of 𝑌 ). Let us plot some points to display this. I will generate two vectors that have a perfect linear correlation, like so: X = c(1,2,3,4,5) Y = c(2,4,6,8,10) data 𝑦𝑗 or 2) 𝑥𝑖 < 𝑥𝑗 & 𝑦𝑖 < 𝑦𝑗 , then we consider the pair concordant. In this case, one pair has a greater 𝑥 and 𝑦 value than the other pair. A discordant pair this therefore any pair in which one has a greater 𝑥 value and the other has the greater 𝑦 value. Let’s let 𝐶 be the number of concordant pairs and 𝐷 be the number of discordant pairs. Let’s create two additional variables as well: 1) Let 𝑋0 be the number of pairs where the 𝑥 values are equal (but 𝑦 values are not); 2) Let 𝑌0 be the number of pairs where the 𝑦 value is equal (but the 𝑥 value is not). We represent this version of Kendall’s rank coeﬃcient with 𝜏𝑏 (the greek letter “tau”, pronounced the way you would say “Owwww!” when you stub your toe). The formula is as follows: 𝐶−𝐷 𝐶−𝐷 𝜏𝑏 = √‾(𝐶 ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ + 𝐷 + 𝑋0 ) ∗ (𝐶 + 𝐷 + 𝑌0‾) Logically, we can see that the numerator is capturing the ratio of concordance to discordance. If every pairing is concordant, then all the observations are in perfect positive rank order. If every pairing is discordant, then all the observations are in perfect negative rank order. If we have the same amount of concordant and discordant observations, 𝐶 − 𝐷 = 0 , this indicates the data is pretty “jumbled together” and a pattern is not apparent. The equation is dividing by the total number of pairings, with an adjustment to account for ties. We can easily run this correlation in R using the 𝑐𝑜𝑟() function like so: ## Let's define a vector X from 0 to 10 X = c(0,1,2,3,4,5,6,7,8,9,10) ## And define y = x^2 Y = X^2 ## Check the correlation, using SPearman cor(X,Y, method = "kendall") ## 1 Correlation Matrix The last thing we will discuss in this text is the correlation matrix. Like the covariance matrix, we can generate a table that shows correlation between all the variables in our dataset. This can be incredibly useful when we are examining our data and trying to identify if we have any variables that are highly correlated. This is very important when we get into fitting regression models where variables need to be independent. We typically do not want to include two variables as predictors in a model if they have high correlation scores - as this indicates the sampled values are not independent. The 𝑐𝑜𝑟() function can be used by simply supplying it with a data.frame, like so: ## This will take the Pearson correlation cor(mtcars[,c(1:5)]) ## mpg cyl disp hp drat ## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 ## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381 ## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139 ## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591 ## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.0000000 ## Can also specify alt-correlations cor(mtcars[,c(1:5)], method = c("kendall")) ## mpg cyl disp hp drat ## mpg 1.0000000 -0.7953134 -0.7681311 -0.7428125 0.4645488 ## cyl -0.7953134 1.0000000 0.8144263 0.7851865 -0.5513178 ## disp -0.7681311 0.8144263 1.0000000 0.6659987 -0.4989828 ## hp -0.7428125 0.7851865 0.6659987 1.0000000 -0.3826269 ## drat 0.4645488 -0.5513178 -0.4989828 -0.3826269 1.0000000 As you’ll notice, the main diagonal of the matrix (top left to bottom right) all equal 1. This is because a variable is inherently perfectly correlated with itself. You can also see that the matrix is symmetric, as the same comparisons are made twice (one below the main diagonal and once above). However, this is not fun to read! I want something that will visually help me detect which variables are highly correlated with one another! We are going to use the 𝑐𝑜𝑟𝑟𝑝𝑙𝑜𝑡 library to generate a Correlogram, like so: ## load the library library(corrplot) ## corrplot 0.90 loaded ## save our correlation matrix cormat

Covariance and Correlation PDF

Document Details

Tags

Related

Summary

Full Transcript