Covariance and Correlation in R

Assessing the relationship between variables within a sample data set is a common statistical task
Covariance and correlation are two metrics used to measure the association between two variables
R can be used to investigate covariance and correlation in data

Covariance

Covariance measures how two random variables 𝑋 and 𝑌 vary together
Focus is given to measuring covariance between two numeric variables
Positive covariance exists when increases in the value of one variable are associated with increases in the value of the other
For people aged 18 and younger, as age (𝑋) increases, height (𝑌) also increases, demonstrating positive covariance
The covariance of 𝑋 and 𝑌 is approximately 0 when there is no relationship between age (𝑋) and height (𝑌) among people aged 18 through 65
Negative covariance exists when increases in the value of one variable are associated with decreases in the value of the other
For individuals aged 65 and older, as age increases, height decreases, demonstrating negative covariance

Measuring Covariance

The covariance of 𝑋 and 𝑌 is expressed as 𝑐𝑜𝑣(𝑋, 𝑌)
Assuming a sample of 𝑛 individuals with measurements of 𝑋 and 𝑌, the covariance between 𝑋 and 𝑌 can be defined mathematically
cov(X,Y) = ∑ni=1 ((𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯)) / (𝑛−1)
This equation calculates (𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯) for each participant, 𝑖
𝑥𝑖 is the value of 𝑋 for participant 𝑖, and 𝑦𝑖 is the value of 𝑌 for participant 𝑖
(𝑥𝑖 − 𝑥¯) indicates how far the 𝑖𝑡ℎ participant is from the mean value of 𝑋 (𝑥¯), and whether the value of 𝑥𝑖 is less than or greater than the mean value
The same information is indicated using (𝑦𝑖 − 𝑦¯)
Positive covariance indicates that greater values of 𝑋 are associated with greater values of 𝑌 (and vice versa), resulting in 𝑐𝑜𝑣(𝑋, 𝑌) > 0
When lower values of 𝑋 are associated with lower values of 𝑌, multiplying two negative values results in a positive value, indicating a positive covariance
Whether 𝑐𝑜𝑣(𝑋, 𝑌) is less than 0 (negative covariance), equal or close to 0 (no covariance), or greater than 0 (positive covariance) is a primary concern
Negative covariance is captured when higher values of 𝑋 are associated with lower values of 𝑌, resulting in 𝑐𝑜𝑣(𝑋, 𝑌) < 0

Additional Notes on Covariance

If a pattern emerges across the full sample where higher values of one variable typically correspond to lower values of the other(but values can vary), 𝑐𝑜𝑣(𝑋, 𝑌) < 0
Covariance is a measure of the association between two variables 𝑋 and 𝑌
Covariance identifies that 𝑋 and 𝑌 vary together and to what extent, but does not specify why

Measuring and Visualizing Covariance in R

The 𝑐𝑜𝑣() function can be used to calculate covariance between two variables, 𝑋 and 𝑌
The mtcars data.frame in R contains variables about different cars, such as miles per gallon (mpg) and horsepower (hp)
The covariance between a car’s miles per gallon and its horsepower can be calculated to determine if there is a negative covariance
Negative covariance has an approximate value of -321
Scatterplots can be used to visually understand covariance by plotting values of 𝑋 against values of 𝑌, where each participant’s observations (𝑥𝑖, 𝑦𝑖) are plotted on the graph
The ggplot2 package in R can be used to generate visualizations, including a basic scatter plot of miles per gallon (𝑋) and horsepower (𝑌)
ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point()
The labs function can be used to set the name of the X-axis, Y-axis, and plot title and appearance can be altered using the theme_minimal() function

Visualizing Covariance with ggPlot2

Codes for ggplot2 package for plot niceness
ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + labs( x = "Miles Per Gallon", y = "Horsepower", title = "Relationship Between MPG and Horsepower" ) + theme_minimal()
The themem inimal() is great, but there are many options for ggplot2 themes
theme_classic() is a classic theme that changes plot appearance
The ggthemes library includes themes such as theme_solarized()

Scatterplots

Scatterplots visually display covariance
Lighter cars tend to have lower horsepower and heavier cars tend to have higher horsepower
Positive Correlation: Points are indicated by a trend in points from bottom left to top right
Negative Correlation: Data "travels" from upper left to lower right, with greater values of 𝑋 associated with smaller values of 𝑌
Weak Correlation: No pattern is detected, indicating that 𝑋 and 𝑌 behave independently

Covariance Matrix

A covariance matrix is a two-dimensional array where each row and column represents a variable
The 𝑐𝑜𝑣() function and the sample data can used to generate a covariance matrix
The round() function can be used to round all the covariance values
Each value represents the covariance of the two corresponding variables, and the values along the main diagnoal display the covariance of each variable with itself and is also variance denoted as σ²

Correlation

Covariance is difficult to interpret and is dependent on the scale of measurement of units
Example: Covariance between miles per gallon and horsepower in the 𝑚𝑡𝑐𝑎𝑟𝑠 dataset is approximately -321
A standardized measure is useful for the association between two numeric variables, so correlation is introduced
Correlation: a standardized measure of covariance that measures the strength and direction of the association between two variables 𝑋 and 𝑌
At the population-level, correlation between 𝑋 and 𝑌 is denoted as 𝜌𝑋,𝑌
(ρ𝑋,𝑌 = 𝑐𝑜𝑣(𝑋, 𝑌)/(𝜎𝑋 ∗ 𝜎𝑌)) where 𝜎𝑋 is the standard deviation of 𝑋 and 𝜎𝑌 is the standard deviation of 𝑌
The correlation is always a value that falls between -1 and 1

Computing the Correlation Coefficient for a Sample

The Pearson correlation coefficient is calculated in place of of estimates or averages
𝑥,𝑦 = (∑((𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯)))/ (√(∑(𝑥𝑖 − 𝑥¯)² ∗ ∑(𝑦𝑖 − 𝑦¯)²))
This is the equivalent of taking the 𝑐𝑜𝑣(𝑥, 𝑦) and dividing it by the product of measured standard deviations of 𝑋 and 𝑌 , 𝑠𝑋 and 𝑠𝑌

Computing in R

The 𝑐𝑜𝑟() function can be used
For miles per gallon and horsepower, the correlation is -0.7761684

Interpreting the Correlation Coefficient

Covariance: Can vary from negative infinity to infinity
Correlation: Ranges from -1 to 1
In the case that 𝑟𝑋,𝑌 = 1, this indicates a perfect positive linear relationship between 𝑋 and 𝑌
When 𝑟𝑋,𝑌 = 1, the points will fall perfectly on a line and that the slope of the line is positive, meaning that greater values of 𝑋 are associated with greater values of 𝑌
When 𝜌 = −1, there is a perfect negative linear relationship between X and Y, there is a line y = mx + b that perfectly explains the relationship between X and Y, and that the slope m is negative
The importance is understanding how to interpret other values that are between 0 and 1

Positive Correlation

When 𝑟𝑥,𝑦 is between 0 and 1, greater values of 𝑋 and 𝑌 are still associated without a perfect line
A simple linear regression (using geommooth()) can visualized
Note this is a visual and linear regression and should not be an approach taken in calculating correlation
Higher values equal a closer value to 1 and lower values indicate closer values to 0
- <0.3 indicates no correlation
- .3-.5 indicates low correlation
- .5-.7 indicates moderate correlation
- .7-1 indicates high correlation

Factors to Consideration

Guidelines are not hard and fast rules
These aid in multiple regressions, however, interpretation can be challenging
The correlation coefficient does not measure the magnitude, but instead the existence of a linear dependency
Good practice for Pearson's Correlation Coefficient requires to perform well:
- The relationship between 𝑋 and 𝑌 is linear in nature
- There are no severe outliers in the data
- 𝑋 and 𝑌 are normally distributed variables

Alternatives

Correlation for existing linear dependency for non-linear relationships
Linear Dependency: There exists a line (y = mx + b) (The possibility stands as they vary on a non-linear function)
If X and Y varied perfectly to non-linear functions then their correlation score may not be 1
Spearman's Rank Correlation and Kendall's Rank Coeeficients are some alternatives

Spearman's Rank Correlation Coefficient

Spearman's Rank Correlation Coefficient rs is a constructed alternative to Pearson's Correlation Coefficient
Instead of comparing the values of X and Y against one another, the ranks of each variable R(X) and R(Y) are compared
This checks if changes in the rank of X vary with changes in those of Y
A positive rank correlation would indicate that greater values X/R correspond with higher values Y/R
A negative rank correlation would indicate that the higher values of X/R correspond with values of Y/R
The ranking order of X must be associated with the ranking order of Y
To calculate R(X), a value in X corresponds with a rank depending on if it is the lowest rank

Procedure

With {4,12,0,-2,-5}
The result of R(X) equals R(X) = {4,5,3, 2, 1}
If the lowest value in X had a value in R(X) with 1 then 2 would equal the greatest rank of 5
The average rank in used if there are points that equal the same data
The result would equal R(X) = {1,2.5, 2.5, 4, 5}

Implementation

Pearson's Correlation applies on the rank variables to have its coefficient on the x and y axis

Kendall's Rank Coefficient

Kendall's Rank Coefficient takes a different approach
Determine the observations x and y and define their measuremenrs

Procedure

Two pairs (x, y) and (xj, yj)
If xi xi & yi > yj
The result for concordant and discordant pairs is determined by if the pair has a greater x/y and if there equal against the others.
Determine the number of concordant or discordant pairs
The number of pairs where the x axis equal or if the y equals are added in
Calculate: 𝜏𝑏 with the variables derived
This correlation can easily be run in R using the 𝑐𝑜𝑟() function like so:
It would be able to run even with SParman functions

Correlation Matrix

The correlation matrix is a table showing the correlation variables in our dataset
These are used mostly when identifying hugh variables

Covariance and Correlation in R

Choose a study mode

Podcast

Questions and Answers

How does the interpretation of covariance differ when examining age and height in adolescents versus the elderly?

Explain how the covariance equation captures the relationship between people with above-average age and height in a sample.

How would you describe the relationship between miles per gallon (mpg) and horsepower (hp) based on the covariance value of approximately -321?

Explain why the Pearson correlation coefficient might not be ideal for determining the relationship between $X$ and $Y$ if their association is best described by a non-linear function.

In terms of rank-ordering, what does a Spearman correlation of -1 indicate between variables $X$ and $Y$?

Flashcards

What is Covariance?

What is positive covariance?

What is negative covariance?

What is Correlation?

What does a correlation of 1 mean?

Study Notes

Covariance

Measuring Covariance

Additional Notes on Covariance

Measuring and Visualizing Covariance in R

Visualizing Covariance with ggPlot2

Scatterplots

Covariance Matrix

Correlation

Computing the Correlation Coefficient for a Sample

Computing in R

Interpreting the Correlation Coefficient

Positive Correlation

Factors to Consideration

Alternatives

Spearman's Rank Correlation Coefficient

Procedure

Implementation

Kendall's Rank Coefficient

Procedure

Correlation Matrix

Studying That Suits You

Related Documents

More Like This

Chapter 8: Covariance, Correlation, and Regression Measures Quiz

Statistiques: Covariance et Corrélation

Statistics: Correlation and Covariance

Understanding Covariance and Correlation