Podcast
Questions and Answers
How does the interpretation of covariance differ when examining age and height in adolescents versus the elderly?
How does the interpretation of covariance differ when examining age and height in adolescents versus the elderly?
In adolescents, a positive covariance between age and height indicates that as age increases, height tends to increase as well. However, in the elderly, a negative covariance may be observed, as height tends to decrease with age.
Explain how the covariance equation captures the relationship between people with above-average age and height in a sample.
Explain how the covariance equation captures the relationship between people with above-average age and height in a sample.
The covariance equation calculates the product of the differences between each person's age and the average age, and their height and the average height. For people with above-average age and height, both differences will be positive, resulting in a positive product.
How would you describe the relationship between miles per gallon (mpg) and horsepower (hp) based on the covariance value of approximately -321?
How would you describe the relationship between miles per gallon (mpg) and horsepower (hp) based on the covariance value of approximately -321?
A covariance of approximately -321 indicates a negative relationship between mpg and hp, suggesting that as horsepower increases, miles per gallon tend to decrease.
Explain why the Pearson correlation coefficient might not be ideal for determining the relationship between $X$ and $Y$ if their association is best described by a non-linear function.
Explain why the Pearson correlation coefficient might not be ideal for determining the relationship between $X$ and $Y$ if their association is best described by a non-linear function.
In terms of rank-ordering, what does a Spearman correlation of -1 indicate between variables $X$ and $Y$?
In terms of rank-ordering, what does a Spearman correlation of -1 indicate between variables $X$ and $Y$?
Flashcards
What is Covariance?
What is Covariance?
A measure of how two random variables change together.
What is positive covariance?
What is positive covariance?
If higher values of one variable are associated with higher values of another.
What is negative covariance?
What is negative covariance?
If higher values of one variable are associated with lower values of the other variable.
What is Correlation?
What is Correlation?
Signup and view all the flashcards
What does a correlation of 1 mean?
What does a correlation of 1 mean?
Signup and view all the flashcards
Study Notes
- Assessing the relationship between variables within a sample data set is a common statistical task
- Covariance and correlation are two metrics used to measure the association between two variables
- R can be used to investigate covariance and correlation in data
Covariance
- Covariance measures how two random variables 𝑋 and 𝑌 vary together
- Focus is given to measuring covariance between two numeric variables
- Positive covariance exists when increases in the value of one variable are associated with increases in the value of the other
- For people aged 18 and younger, as age (𝑋) increases, height (𝑌) also increases, demonstrating positive covariance
- The covariance of 𝑋 and 𝑌 is approximately 0 when there is no relationship between age (𝑋) and height (𝑌) among people aged 18 through 65
- Negative covariance exists when increases in the value of one variable are associated with decreases in the value of the other
- For individuals aged 65 and older, as age increases, height decreases, demonstrating negative covariance
Measuring Covariance
- The covariance of 𝑋 and 𝑌 is expressed as 𝑐𝑜𝑣(𝑋, 𝑌)
- Assuming a sample of 𝑛 individuals with measurements of 𝑋 and 𝑌, the covariance between 𝑋 and 𝑌 can be defined mathematically
- cov(X,Y) = ∑ni=1 ((𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯)) / (𝑛−1)
- This equation calculates (𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯) for each participant, 𝑖
- 𝑥𝑖 is the value of 𝑋 for participant 𝑖, and 𝑦𝑖 is the value of 𝑌 for participant 𝑖
- (𝑥𝑖 − 𝑥¯) indicates how far the 𝑖𝑡ℎ participant is from the mean value of 𝑋 (𝑥¯), and whether the value of 𝑥𝑖 is less than or greater than the mean value
- The same information is indicated using (𝑦𝑖 − 𝑦¯)
- Positive covariance indicates that greater values of 𝑋 are associated with greater values of 𝑌 (and vice versa), resulting in 𝑐𝑜𝑣(𝑋, 𝑌) > 0
- When lower values of 𝑋 are associated with lower values of 𝑌, multiplying two negative values results in a positive value, indicating a positive covariance
- Whether 𝑐𝑜𝑣(𝑋, 𝑌) is less than 0 (negative covariance), equal or close to 0 (no covariance), or greater than 0 (positive covariance) is a primary concern
- Negative covariance is captured when higher values of 𝑋 are associated with lower values of 𝑌, resulting in 𝑐𝑜𝑣(𝑋, 𝑌) < 0
Additional Notes on Covariance
- If a pattern emerges across the full sample where higher values of one variable typically correspond to lower values of the other(but values can vary), 𝑐𝑜𝑣(𝑋, 𝑌) < 0
- Covariance is a measure of the association between two variables 𝑋 and 𝑌
- Covariance identifies that 𝑋 and 𝑌 vary together and to what extent, but does not specify why
Measuring and Visualizing Covariance in R
- The 𝑐𝑜𝑣() function can be used to calculate covariance between two variables, 𝑋 and 𝑌
- The mtcars data.frame in R contains variables about different cars, such as miles per gallon (mpg) and horsepower (hp)
- The covariance between a car’s miles per gallon and its horsepower can be calculated to determine if there is a negative covariance
- Negative covariance has an approximate value of -321
- Scatterplots can be used to visually understand covariance by plotting values of 𝑋 against values of 𝑌, where each participant’s observations (𝑥𝑖, 𝑦𝑖) are plotted on the graph
- The ggplot2 package in R can be used to generate visualizations, including a basic scatter plot of miles per gallon (𝑋) and horsepower (𝑌)
- ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point()
- The labs function can be used to set the name of the X-axis, Y-axis, and plot title and appearance can be altered using the theme_minimal() function
Visualizing Covariance with ggPlot2
- Codes for ggplot2 package for plot niceness
- ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + labs( x = "Miles Per Gallon", y = "Horsepower", title = "Relationship Between MPG and Horsepower" ) + theme_minimal()
- The themem inimal() is great, but there are many options for ggplot2 themes
- theme_classic() is a classic theme that changes plot appearance
- The ggthemes library includes themes such as theme_solarized()
Scatterplots
- Scatterplots visually display covariance
- Lighter cars tend to have lower horsepower and heavier cars tend to have higher horsepower
- Positive Correlation: Points are indicated by a trend in points from bottom left to top right
- Negative Correlation: Data "travels" from upper left to lower right, with greater values of 𝑋 associated with smaller values of 𝑌
- Weak Correlation: No pattern is detected, indicating that 𝑋 and 𝑌 behave independently
Covariance Matrix
- A covariance matrix is a two-dimensional array where each row and column represents a variable
- The 𝑐𝑜𝑣() function and the sample data can used to generate a covariance matrix
- The round() function can be used to round all the covariance values
- Each value represents the covariance of the two corresponding variables, and the values along the main diagnoal display the covariance of each variable with itself and is also variance denoted as σ²
Correlation
- Covariance is difficult to interpret and is dependent on the scale of measurement of units
- Example: Covariance between miles per gallon and horsepower in the 𝑚𝑡𝑐𝑎𝑟𝑠 dataset is approximately -321
- A standardized measure is useful for the association between two numeric variables, so correlation is introduced
- Correlation: a standardized measure of covariance that measures the strength and direction of the association between two variables 𝑋 and 𝑌
- At the population-level, correlation between 𝑋 and 𝑌 is denoted as 𝜌𝑋,𝑌
- (ρ𝑋,𝑌 = 𝑐𝑜𝑣(𝑋, 𝑌)/(𝜎𝑋 ∗ 𝜎𝑌)) where 𝜎𝑋 is the standard deviation of 𝑋 and 𝜎𝑌 is the standard deviation of 𝑌
- The correlation is always a value that falls between -1 and 1
Computing the Correlation Coefficient for a Sample
- The Pearson correlation coefficient is calculated in place of of estimates or averages
- 𝑥,𝑦 = (∑((𝑥𝑖 − 𝑥¯) ∗ (𝑦𝑖 − 𝑦¯)))/ (√(∑(𝑥𝑖 − 𝑥¯)² ∗ ∑(𝑦𝑖 − 𝑦¯)²))
- This is the equivalent of taking the 𝑐𝑜𝑣(𝑥, 𝑦) and dividing it by the product of measured standard deviations of 𝑋 and 𝑌 , 𝑠𝑋 and 𝑠𝑌
Computing in R
- The 𝑐𝑜𝑟() function can be used
- For miles per gallon and horsepower, the correlation is -0.7761684
Interpreting the Correlation Coefficient
- Covariance: Can vary from negative infinity to infinity
- Correlation: Ranges from -1 to 1
- In the case that 𝑟𝑋,𝑌 = 1, this indicates a perfect positive linear relationship between 𝑋 and 𝑌
- When 𝑟𝑋,𝑌 = 1, the points will fall perfectly on a line and that the slope of the line is positive, meaning that greater values of 𝑋 are associated with greater values of 𝑌
- When 𝜌 = −1, there is a perfect negative linear relationship between X and Y, there is a line y = mx + b that perfectly explains the relationship between X and Y, and that the slope m is negative
- The importance is understanding how to interpret other values that are between 0 and 1
Positive Correlation
- When 𝑟𝑥,𝑦 is between 0 and 1, greater values of 𝑋 and 𝑌 are still associated without a perfect line
- A simple linear regression (using geommooth()) can visualized
- Note this is a visual and linear regression and should not be an approach taken in calculating correlation
- Higher values equal a closer value to 1 and lower values indicate closer values to 0
- <0.3 indicates no correlation
- .3-.5 indicates low correlation
- .5-.7 indicates moderate correlation
- .7-1 indicates high correlation
Factors to Consideration
- Guidelines are not hard and fast rules
- These aid in multiple regressions, however, interpretation can be challenging
- The correlation coefficient does not measure the magnitude, but instead the existence of a linear dependency
- Good practice for Pearson's Correlation Coefficient requires to perform well:
- The relationship between 𝑋 and 𝑌 is linear in nature
- There are no severe outliers in the data
- 𝑋 and 𝑌 are normally distributed variables
Alternatives
- Correlation for existing linear dependency for non-linear relationships
- Linear Dependency: There exists a line (y = mx + b) (The possibility stands as they vary on a non-linear function)
- If X and Y varied perfectly to non-linear functions then their correlation score may not be 1
- Spearman's Rank Correlation and Kendall's Rank Coeeficients are some alternatives
Spearman's Rank Correlation Coefficient
- Spearman's Rank Correlation Coefficient rs is a constructed alternative to Pearson's Correlation Coefficient
- Instead of comparing the values of X and Y against one another, the ranks of each variable R(X) and R(Y) are compared
- This checks if changes in the rank of X vary with changes in those of Y
- A positive rank correlation would indicate that greater values X/R correspond with higher values Y/R
- A negative rank correlation would indicate that the higher values of X/R correspond with values of Y/R
- The ranking order of X must be associated with the ranking order of Y
- To calculate R(X), a value in X corresponds with a rank depending on if it is the lowest rank
Procedure
- With {4,12,0,-2,-5}
- The result of R(X) equals R(X) = {4,5,3, 2, 1}
- If the lowest value in X had a value in R(X) with 1 then 2 would equal the greatest rank of 5
- The average rank in used if there are points that equal the same data
- The result would equal R(X) = {1,2.5, 2.5, 4, 5}
Implementation
- Pearson's Correlation applies on the rank variables to have its coefficient on the x and y axis
Kendall's Rank Coefficient
- Kendall's Rank Coefficient takes a different approach
- Determine the observations x and y and define their measuremenrs
Procedure
- Two pairs (x, y) and (xj, yj)
- If xi xi & yi > yj
- The result for concordant and discordant pairs is determined by if the pair has a greater x/y and if there equal against the others.
- Determine the number of concordant or discordant pairs
- The number of pairs where the x axis equal or if the y equals are added in
- Calculate: 𝜏𝑏 with the variables derived
- This correlation can easily be run in R using the 𝑐𝑜𝑟() function like so:
- It would be able to run even with SParman functions
Correlation Matrix
- The correlation matrix is a table showing the correlation variables in our dataset
- These are used mostly when identifying hugh variables
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.