Correlation PDF
Document Details
Tags
Summary
This document provides an overview of correlation analysis, including types of correlations coefficients. It explains the concept of correlation and its use in psychological assessments, as well as gives examples of positive and negative correlations.
Full Transcript
Slides 4 → Correlation What’s a Correlation? Measures and describes the relationship (or association) between two variables ○ Q: Does change in one variable consistently relate to change in the other variable? ○ For example, Does an increase in mindfulness practice rela...
Slides 4 → Correlation What’s a Correlation? Measures and describes the relationship (or association) between two variables ○ Q: Does change in one variable consistently relate to change in the other variable? ○ For example, Does an increase in mindfulness practice relate to a decrease in stress levels? Tells us the strength (a numeric value) and direction (+ or -) of the relationship; does NOT tell us about cause-effect. Causal relationships can only be established via a true experiment (i.e., manipulation of one variable and control of other variables). Terms of Correlation Correlation coefficient: A numerical index that reflects the relationship between two variables. Bivariate correlation: A correlation between two variables. Pearson product-moment correlation: Examines the relationship between two variables, but both of those variables are continuous in nature. ○ The ‘fanciest’ term for a correlation. We will use ‘correlation coefficient’. Pearson Correlation Coefficient The most famous of the correlations (there’s many…) Widely used in psychological assessment, such as evaluating the reliability of psychometric tests. Measures the direction and strength of a linear relationship between two variables (-1 to 1). ○ Focus is on continuous variables (not discrete) Very useful for: ○ Understanding relationships between variables ○ Validity: e.g., correlation between a new measure and an existing measure (concurrent validity). ○ Reliability: test-retest, inter-rater, internal consistency all rely on correlations. A Perfect Correlation A perfect correlation occurs when the correlation coefficient is exactly +1 or -1. ○ Perfect positive correlation (+1): As the number of hours spent in cognitive behavioral therapy increases, so do scores on measures of self-efficacy increase. ○ Perfect negative correlation (-1): As social media usage time increases, face-to-face social interaction time decreases at a constant rate. ○ How do Correlations work? Some things to remember… ○ A correlation can range in value from -1.00 to +1.00 ○ The absolute value of the coefficient reflects the strength of the correlation. So, a correlation of -.70 is stronger than a correlation of +.50 ○ To calculate a correlation, you need exactly two variables and at least two people. ○ Another easy mistake is to assign a value judgement to the sign of the correlation. Many students assume that a negative relationship is not good and a positive one is good. ○ The Pearson correlation coefficient is represented by the small letter r with a subscript representing the variables that are being correlated. ○ What Are Correlations All About? Types of Correlation Coefficients: ○ Direct correlation: positive. ○ Indirect correlation: negative. ○ Types of correlations and the relationship between variables. As X goes up, Y goes up (positive, range 0 to +1) As X goes down, Y goes down (positive, range 0 to +1) As Y goes up, X goes down (negative, range -1 to 0) As Y goes down, X goes up (negative, range -1 to 0) What Are Correlations All About? As X goes up, Y goes up (positive, range 0 to +1) ○ Ex. The more time you spend studying, the higher your test score will be. As X goes down, Y goes down (positive, range 0 to +1) ○ Ex. The less money you put in the bank, the less interest you will earn. As Y goes up, X goes down (negative, range -1 to 0) ○ Ex. The more you exercise, the less you will weigh. As Y goes down, X goes up (negative, range -1 to 0) ○ Ex. The less time you take to complete a test, the more you’ll get wrong. The correlation coefficient is a value between -1 and 1. The Pearson (product-moment) correlation coefficient is represented by the small letter r The further from zero in either direction (i.e., closer to either -1 or 1), the stronger the relationship A positive coefficient indicates that the variables tend to change in the same direction (e.g., as one increases, so does the other). A negative coefficient indicates an inverse relationship (e.g., as one increases, the other decreases). ○ An easy mistake is to assign a value judgment to the sign of the correlation. To calculate a correlation, you need exactly two variables and at least two people. Zero correlations are possible Outliers and Restricted Range Outliers ○ Data points that are significantly different from other observations. Can strongly influence the correlation coefficient. Example: In a study on the relationship between hours of sleep and cognitive performance, one participant with insomnia might show extremely low sleep hours with unexpectedly high performance. Restricted Range ○ When the range of values for one or both variables is limited. Can reduce the strength of the observed correlation. Example: Studying the correlation between IQ and job performance in a highly selective company where all employees have IQs above 120. Some things to consider… Correlation coefficient reflects amount of variability: ○ The correlation coefficient reflects the amount of variability that is shared between two variables and what they have in common. For example, you can expect an individual’s height to be correlated with an individual’s weight because these two variables share many of the same characteristics, such as the individual’s nutritional and medical history, general health, and genetics. Zero correlation: ○ If one variable does not change in value and therefore has nothing to share, then the correlation between it and another variable is zero. For example, if you computed the correlation between age and number of years of school completed, and everyone was 25 years old, there would be no correlation between the two variables because there is literally no information (no variability) in age available to share. Impact of restricting range of one variable: If you constrain or restrict the range of one variable, the correlation between that variable and another variable will be less than if the range is not constrained. Understanding What the Correlation Coefficient Means Size of the Correlation Coefficient General Interpretation.7 to 1.0 Very strong relationship.5 to.7 Strong relationship.4 Moderate to strong relationship.3 Moderate relationship.2 Weak to moderate relationship 0 to.1 Weak or no relationship The Correlation Matrix Branches of Correlations: The Correlation Matrix ○ A set of correlation coefficients. ○ It’s used to see correlations when there are more than two variables. Income Education Attitude Vote Income 1.00 574 −.08 −.291 Education.574 1.00 −.149 −.199 Attitude −.08 −.149 1.00 −.169 Vote −.291 −.199 −.169 1.00 Chapter 14: Correlation Detailed Concepts: 1. Correlation Overview ○ Definition: Correlation is a statistical measure that quantifies the degree to which two variables are related. It is a key statistical tool used in various fields to understand the relationship between variables without intervening or altering their natural occurrence. ○ Application: Used widely in research to determine relationships, such as between socioeconomic status and academic performance, or health behaviors and outcomes. 2. Characteristics of Correlation ○ Direction of the Relationship: Positive Correlation: Indicates that as one variable increases, the other variable also increases. For example, the more hours students spend studying, the higher their grades tend to be. Negative Correlation: Indicates that as one variable increases, the other decreases. For instance, an increase in the use of smartphones might correlate with a decrease in academic performance. ○ Form of the Relationship: The correlation is typically linear, suggesting a straight-line relationship between two variables, which can be observed through scatter plots. Non-linear relationships might involve more complex interactions and might require different statistical techniques like polynomial regression for analysis. ○ Strength or Consistency of the Relationship: The correlation coefficient, denoted as 'r', ranges from -1.00 to +1.00. The closer the coefficient is to +1 or -1, the stronger the linear relationship. Perfect Correlations: An 'r' of +1.00 or -1.00 signifies a perfect linear relationship, where all data points lie exactly on a line. Zero Correlation: An 'r' of 0 implies no linear relationship, with data points scattered without any discernible pattern. Visual Tools: Scatter Plots: Essential for visualizing the relationship between two variables. The positioning of data points can illustrate the direction and strength of a correlation. Lines of best fit or regression lines are often added to scatter plots to depict the average trend. Examples and Practical Implications: Positive Correlation Example: The relationship between physical exercise and heart health, where increases in the amount of exercise correlate with improvements in heart function. Negative Correlation Example: The relationship between the number of hours spent watching TV and physical fitness, where more hours watching TV could correlate with lower levels of physical fitness. Analyzing Scatter Plots: Trends: By examining the distribution of data points in a scatter plot, one can infer whether the correlation is positive, negative, or nonexistent. Envelope Shape: The shape formed by enclosing all data points in a scatter plot can indicate the correlation's strength. A football-shaped envelope often indicates a moderate correlation (around ±0.7), while narrower shapes indicate stronger correlations. Section 14-2: The Pearson Correlation Overview of Pearson Correlation: Definition: The Pearson correlation, also known as the Pearson product-moment correlation, quantifies the degree and direction of a linear relationship between two variables. Symbol: The Pearson correlation for a sample is denoted by 'r', and for the entire population by the Greek letter rho (ρ). Conceptual Basis: The Pearson correlation measures how closely data points fit a straight line, indicating the linear dependency of one variable on another. Calculation Details: 1. Sum of Products of Deviations (SP): ○ Purpose: SP measures the covariability between two variables, similar to how the sum of squared deviations (SS) measures the variability of a single variable. ○ Definitional Formula: Calculate SP by finding the deviations of X and Y from their means, multiplying these deviations for each pair, and summing these products. ○ Computational Formula: Provides an easier method to compute SP using original scores directly, beneficial especially when means are non-integers. 2. Pearson Correlation Formula: ○ Structure: The Pearson 'r' is calculated as the ratio of the sum of products of deviations (SP) to the product of the sum of squared deviations of X and Y (SSx and SSy). ○ Interpretation: A high absolute value of 'r' (close to 1) indicates a strong linear relationship. The sign of 'r' (+ or -) indicates the direction of the relationship. The Pattern of Data Points and Correlation: Data Transformation Effects: Adding or subtracting a constant to X or Y scores or multiplying by a positive constant does not change the pattern of data points nor the correlation value. However, multiplying by a negative constant changes the sign of the correlation, reflecting a mirror image of the data pattern. Pearson Correlation and z-Scores: Link to z-Scores: Correlation can also be expressed in terms of z-scores, which standardize individual scores within their respective distributions. This formulation aligns with understanding how individual scores relate to the overall distribution of scores in X and Y. Z-Score Formula: Transforming X and Y scores into z-scores and then calculating the Pearson correlation provides a normalized view of the relationship, making it independent of the original units of X and Y. Practical Implications: Real-World Application: The Pearson correlation is extensively used in fields like psychology, finance, and biology to determine the strength and direction of relationships between variables. Interpretive Note: It is crucial to remember that correlation does not imply causation; a high correlation between two variables does not mean that one variable causes changes in the other. Section 14-3: Using and Interpreting the Pearson Correlation Key Applications and Interpretations of Correlation: 1. Prediction: ○ Correlation allows for the prediction of one variable based on the known value of another. This is commonly used in settings like college admissions, where SAT scores may predict college success. ○ The process involves regression, which will be elaborated on later in this chapter. 2. Validity: ○ Correlations help establish the validity of new psychological tests by correlating test scores with other established measures of the same construct, like intelligence. 3. Reliability: ○ Correlation assesses the reliability of measurements. A measurement is considered reliable if it produces consistent results under consistent conditions. 4. Theory Verification: ○ Psychological theories often predict relationships between variables, which can be tested through correlations to validate the theory. Critical Considerations in Interpreting Correlations: 1. Correlation Does Not Imply Causation: ○ It is crucial to understand that correlation between two variables does not establish a cause-and-effect relationship. Many factors might influence correlated variables, and further experimental methods are needed to establish causality. 2. Influence of Restricted Range: ○ Correlations calculated from a limited range of data may not represent the true relationship as it would appear across the full range of possible values. 3. Impact of Outliers: ○ Outliers can significantly affect the strength and direction of a correlation. An extreme outlier can drastically change the correlation coefficient, misleading the interpretation of data. 4. Understanding the Coefficient of Determination (r²): ○ The square of the correlation coefficient, known as the coefficient of determination, indicates the proportion of variance in one variable that is predictable from the other variable. It provides a more accurate measure of the strength of the relationship. ○ Example: If r = 0.80, then r2=0.64r^2 = 0.64r2=0.64, indicating that 64% of the variance in one variable is predictable from the other. Detailed Explanation of Coefficient of Determination: Role in Prediction: The squared correlation coefficient quantifies how much one variable's variance can explain the variance of another variable. It is essential for predicting outcomes and understanding the strength of relationships. Effect Size Evaluation: The r2r^2r2 value is also used to measure effect size in research, providing a scale to judge the significance of correlations (small, medium, large). Practical Examples: IQ and College GPA: A correlation between IQ scores and GPA may not be perfect but can still provide useful predictive information, illustrating the concept of partial predictability. Salary Predictions: In cases of perfect correlation (r = 1.00), predictions can be made with 100% accuracy, such as predicting yearly salary from monthly earnings. Further Discussion Points: The text further emphasizes the importance of considering the full context of the data, including the potential effects of data range and outliers on correlation results. Additionally, it stresses the need to visualize data through scatter plots to understand the true nature of the relationship before drawing conclusions based solely on the correlation coefficient. Enhanced Section 14-4: Hypothesis Tests with the Pearson Correlation Conceptual Framework: 1. The Role of Hypothesis Testing: ○ Hypothesis tests with the Pearson correlation assess whether the observed relationship between two variables in a sample can be generalized to the population. This process is a cornerstone of inferential statistics, allowing researchers to make data-driven decisions about population parameters. 2. Setting Up Hypotheses: ○ Null Hypothesis (H0): No correlation exists in the population (ρ = 0). This hypothesis posits that any observed correlation in the sample arises purely from sampling error. ○ Alternative Hypothesis (H1): A nonzero correlation exists in the population (ρ ≠ 0). This may also be directed (positive or negative) based on prior research or theoretical expectations. 3. Statistical Testing: ○ The t-test for Pearson correlation is designed to determine how likely it is that the observed sample correlation could occur if the null hypothesis were true. Detailed Statistical Methodology: 1. t-Statistic for Correlation: ○ The formula for the t-statistic in correlation tests is derived from the sample correlation (r), hypothesized population correlation (usually ρ = 0), and the standard error of the correlation. The formula quantifies the deviation of the observed correlation from the hypothesized correlation relative to the variability in the data. ○ Formula: t=rn−21−r2t = \frac{r \sqrt{n-2}}{\sqrt{1-r^2}}t=1−r2rn−2where nnn is the sample size. 2. Degrees of Freedom: ○ The degrees of freedom for the t-test with Pearson correlation are n−2n - 2n−2. This accounts for the estimation of two parameters (mean and standard deviation) from the sample data. 3. Critical Values and Decision Making: ○ Using statistical tables or software, researchers compare the calculated t-value to critical values based on the chosen significance level (commonly α = 0.05). If the t-value exceeds the critical value, the null hypothesis is rejected, suggesting the correlation in the population is significantly different from zero. Practical Examples and Common Pitfalls: 1. Interpretation of Results: ○ A common error in interpreting correlation is overlooking the size of the sample and the variability of data. Larger samples tend to provide more reliable estimates, reducing the impact of outliers or anomalous data points. 2. Visualizing Data: ○ Scatter plots are crucial for visually assessing the relationship between variables before performing hypothesis tests. These plots can reveal patterns, outliers, or non-linear relationships that might not be evident through correlation coefficients alone. 3. Real-World Application: ○ Example 14.8 illustrates the application of hypothesis testing in determining the significance of a correlation in educational research, where a researcher might investigate the relationship between students' academic achievements and their extracurricular engagement. Enhanced Understanding: Role of Sampling Error: ○ Even in the absence of a true correlation in the population, a sample may show a small but non-zero correlation purely due to random variation among selected data points. Outliers and Their Impact: ○ Outliers can significantly skew the results of a correlation analysis, leading to erroneous conclusions about the population. Analyzing scatter plots can help identify outliers and assess their impact. Section 14-5: Alternatives to the Pearson Correlation Overview: While the Pearson correlation is suitable for linear relationships with interval or ratio data, other types of correlations are useful for non-linear relationships and different scales of data. This section covers the Spearman correlation, point-biserial correlation, and phi-coefficient. 1. The Spearman Correlation Purpose: Used when data are on an ordinal scale or when the relationship between variables is monotonic but not necessarily linear. Calculation: ○ Convert the original scores to ranks. ○ Apply the Pearson correlation formula to these ranks. ○ The Spearman correlation measures the strength and direction of a monotonic relationship. Application Scenarios: ○ Useful when dealing with rank-ordered data. ○ Appropriate when expecting a non-linear but consistent relationship between two variables. Example: ○ A study on the relationship between practice amounts and skill improvement, where improvements are large at the beginning but taper off with increased practice. 2. Ranking Tied Scores Procedure: ○ Assign the average of the ranks that would have been assigned if the tied values were ranked separately. ○ This ensures that the presence of ties does not artificially inflate or deflate the correlation statistic. 3. The Point-Biserial Correlation Purpose: Measures the relationship between a continuous variable and a dichotomous variable. Calculation: ○ One variable is continuous (e.g., test scores), and the other is dichotomous (e.g., pass/fail). ○ Apply the Pearson correlation formula after coding the dichotomous variable as 0 or 1. Application Scenarios: ○ Assessing the relationship between a binary categorical variable and a metric variable, such as examining the correlation between gender (male/female) and test scores. 4. The Phi-Coefficient Purpose: Used when both variables are dichotomous. Calculation: ○ Code each dichotomous variable as 0 or 1. ○ Apply the Pearson correlation formula to these coded values. Application Scenarios: ○ Suitable for situations where both variables of interest are binary, such as the presence or absence of two different traits or conditions. Practical Examples: Example 14.9 and 14.10: ○ Demonstrates how to convert raw scores to ranks and compute the Spearman correlation. ○ Illustrates how a consistent but non-linear relationship in raw scores translates into a perfect linear relationship when ranks are used. Important Considerations: Non-linearity: The Spearman and phi coefficients are crucial when the relationship between variables does not fit a linear model. Tied Scores: Special attention is needed when dealing with tied ranks to ensure accurate computation of Spearman correlation. Dichotomous Data: Both the point-biserial correlation and the phi-coefficient provide mechanisms to handle analyses where at least one variable is binary, broadening the scope of correlation analysis beyond continuous data. Section 14-6: Introduction to Linear Equations and Regression Overview: This section builds on the concept of the Pearson correlation by introducing linear regression, a method used to describe and predict relationships between two variables using a linear equation. 1. Linear Relationship and Equation Definition: A linear relationship between two variables X and Y can be represented by the equation Y = a + bX, where: ○ a (Y-intercept) is the value of Y when X is 0. ○ b (slope) indicates how much Y changes for a one-unit increase in X. Graphical Representation: The line of best fit through the data points on a scatter plot helps visualize the relationship and provides a tool for prediction. 2. Regression Equation Purpose: To find the best-fitting line that minimizes the discrepancies (errors) between predicted Y values and observed Y values. Calculation: ○ Slope (b): Calculated as b=SPSSxb = \frac{SP}{SS_x}b=SSxSP, where SP is the sum of products of deviations and SSxSS_xSSxis the sum of squares for X. ○ Y-intercept (a): Calculated as a=Yˉ−bXˉa = \bar{Y} - b\bar{X}a=Yˉ−bXˉ, where Xˉ\bar{X}Xˉ and Yˉ\bar{Y}Yˉ are the means of X and Y respectively. Interpretation: The slope tells us the change in Y for a one-unit increase in X, and the intercept provides the starting value of Y when X is zero. 3. Standard Error of Estimate Definition: Measures the average distance between the observed values and the values predicted by the regression equation. Purpose: To quantify the precision of the predictions made by the regression model. Calculation: SEE=∑(Y−Y^)2n−2\text{SEE} = \sqrt{\frac{\sum (Y - \hat{Y})^2}{n - 2}}SEE=n−2∑(Y−Y^)2, where Y^\hat{Y}Y^ are the predicted Y values from the regression line. 4. Significance of Regression Analysis of Regression: Similar to ANOVA, it uses an F-ratio to test if the variance explained by the regression model is significantly greater than the variance due to error. F-Ratio: F=MSregressionMSerrorF = \frac{MS_{regression}}{MS_{error}}F=MSerrorMSregression, where MSregressionMS_{regression}MSregressionand MSerrorMS_{error}MSerrorare the mean squares for regression and error, respectively. Practical Applications: Prediction: Regression equations are widely used for predicting outcomes. For instance, predicting a student's GPA based on their SAT scores. Validation: It provides a method to validate theoretical predictions quantitatively. Important Considerations: Assumptions: Assumes linearity, homoscedasticity (constant variance of error terms), and independence of errors. Limitations: Predictions are only reliable within the range of data from which the model was developed. Extrapolating beyond this range can lead to unreliable predictions. Example Illustrations: Math SAT Scores and GPA: Demonstrates how a regression line can be used to predict college GPA based on SAT scores. Gym Membership Cost: Uses a linear equation to model the relationship between the number of months of membership and total cost. Mastery training Ch 14 Spearman correlation relationship between two variables when both are Alternatives to measured in ordinal scales the Pearson Correlation point biseral correlation relationship between two variables, one consisting of regular scores and the second having two values phi coefficient relationship between two variables when both measured for each individual are dichotomous correlation statistical technique used to measure and describe the relationship between two variables positive correlation relationship in which two variables tend to change in the same direction negative correlation correlation in which two variables tend to go in opposite directions perfect correlation relationship of 1.00, indicating an exactly consistent relationship Pearson correlation measure of the degree and the direction of the linear relationship between two variables linear relationship indicator of how well the data points fit a straight rule sum of products of deviations measure of the amount of covariability between two variables outlier extreme datum point restricted range set of scores that do not represent the full range of possible values coefficient of determination measure of proportion of variability in one variable determined from the relationship with another variable correlation matrix diagram of results from multiple relationships monotonic relationship consistently one-directional relationship between two variables dichotomous variable quantity with only two values linear relationship equation expressed by the equation Y = bX + a slope value which determines how much Y variable changes when X is increased by one point Y-intercept value which determines the value of Y when X = 0 regression statistical technique for finding the best-fitting straight rule for a set of data least-squared-error solution best-fitting rule with the smallest total squared error regression equation for Y linear equation standard error of estimate measure of standard distance between predicted Y values on regression line and actual Y values analysis of regression process of testing the significance of a regression equation Analysis Of Regression The process of testing the significance of a regressing equation is called analysis of regression. The regression analysis uses an F-ratio to determine whether the variance predicted by regression equation is significantly greater than would be expected if there were no relationship between X and Y Coefficient Of Determination The value r^2 is called the coefficient of determination because it measures the proportion of variability in one variable that can be determined from the relationship with the other variable. Correlation Is a statistical technique that is used to measure and describe the relationship between two variables Correlation Matrix The results from multiple correlations are most easily reported in a table called a correlation matrix, using footnotes to indicate which correlations are significant Dichotomous Variable A variable with only two values is called a dichotomous variable or a binomial variable Least-Squared-Error Solution Now we can define the best-fitting line as the one that has the smallest total squared error. Fr obvious reasons, the resulting line is commonly called the least-squared-error solution. Linear Equation With this information, the total cost for 1 year can be computed using a linear equation that describes the relationship between the total cost (y) and the number of videos and games rented (X) Linear Relationship How well the data points fit on a straight line Negative Correlation In a negative correlation, the two variables tend to go in opposite directions. As the X variable increases, the Y variable decreases. That is, it is an inverse relationship Partial Correlation Measures the relationship between two variables while controlling the influence of a third variable by holding it constant Pearson Correlation The pearson correlation measures the degree and the direction of the linear relationship between two variables Perfect Correlation A perfect correlation always is identified by a correlation of and indicates a perfectly consistent relationship Phi-Coefficient When both variables (x and y) measured for each individual are dichotomous, the correlation between the two variable is called the phi-coefficient Point-Biserial Correlation Is used to measure the relationship between two variables in situations in which one variable consists of regular, numerical scores, but the second variable has only two values. Positive Correlation In a positive correlations, the two variables tend to change in the same direction: As the value of the X variable increases from one individual to another, the Y variable also tends to increase; when the X variable decreases, the Y variable also decreases. Predicted Variability Example: if r=0.80 then the predicted variability is r^2 = 0.64 (64%) of the total variability for the Y scores Regression The statistical technique for finding the best-fitting straight line for a set of data is called regression Regression Equation For Y the regression equation for Y is the linear equation Regression Line The resulting straight line is called the regression line Restricted Range the correlation within this restricted range could be completely different from the correlation that would be obtained from a full range of IQ scores Slope In the general linear equation, the value of b is called the slope. The slope determines how much the Y variables changes when X is increased by 1 point Spearman Correlation When the pearson correlation formula is used with data from an ordinal scale (ranks), the result is called the Spearman Correlation. The Spearman correlation is used to measure the relationship between X and Y when both variables are measured on ordinal scales. The Spearman Correlation ca be sued s a valuable alternative to the Pearson correlation, even when the original raw scores are on an interval or ratio scale Standard Error Of Estimate The standard error of estimate gives a measure of the standard distance between the predicted Y values on the regression line and the actual Y values in the data Sum Of Products (SP) The sum of products of deviations, or SP. this new values is similar to SS, which is used to measure variability for a single variable. We use SP to measure the amount of covariability between two variables. The value for SP can be calculated with either a definitonal formula or a computational formula Unpredicted Variability The remaining 36% (1-r) is the unpredicted variability Y-Intercept The value of a in the general equation is called the Y-intercept because it determines the value of Y when X=0.