The Importance of Reliability: Psychological Measurement PDF

Document Details

ZippyHeliotrope9386

Uploaded by ZippyHeliotrope9386

Háskóli Íslands

Tags

psychological measurement test reliability measurement error behavioral research

Summary

This document discusses the importance of reliability in psychological measurement, highlighting its significance for research in behavioral sciences and applied practices. The text explores how reliability impacts test scores, point estimates, confidence intervals, and the results of behavioral research. The concepts covered include applied behavioral practice, evaluation of an individual's test score, and point estimates of true scores.

Full Transcript

```markdown ## 7 THE IMPORTANCE OF RELIABILITY This book has consistently emphasized that psychological measurement is crucial for research in behavioral science and for the application of behavioral science. As a cornerstone of a test's psychometric quality, reliability is fundamental to understan...

```markdown ## 7 THE IMPORTANCE OF RELIABILITY This book has consistently emphasized that psychological measurement is crucial for research in behavioral science and for the application of behavioral science. As a cornerstone of a test's psychometric quality, reliability is fundamental to understanding and evaluating psychological measurement. The previous two chapters detailed the conceptual basis of reliability and the procedures used to estimate the reliability of test scores. This chapter articulates the important roles that reliability plays in the applied practice of behavioral science, in behavioral research, and in test construction and refinement. ### APPLIED BEHAVIORAL PRACTICE: EVALUATION OF AN INDIVIDUAL'S TEST SCORE Psychological test scores are used by psychologists and others to make decisions that shape people's lives. For example, as mentioned in the first chapter, intelligence test scores can be used by courts to determine eligibility for the death sentence for convicted murderers. This is an extreme example of how test scores can affect our lives, but it illustrates the importance of having reliable scores. It would be tragic, to say the least, if someone were sentenced to death based on an unreliable intelligence test score. There are uncounted other, albeit less dramatic, ways in which the reliability of test scores can impact the lives of ordinary people. Children are often removed from standard academic classrooms and assigned to special classes based on intelligence and achievement test scores. Similarly, tests such as the SAT and Graduate Record Examination (GRE) are used to inform decisions about college admissions, and employers often use tests to make hiring and promotion decisions. Classroom instructors may not give reliability much thought when they give class examinations, but scores on those examinations can influence students' futures. The reliability of a test's scores has crucial implications for the quality of decisions made on the basis of those scores. Recall that we can never know an individual's "true" level on an unobservable psychological construct. For example, we can never know a person's true level of intelligence or capacity for college achievement. Thus, we use psychological test scores to indicate or estimate an individual's true level of some psychological attribute. Because test scores are only estimates of people's actual psychological characteristics and because decisions about persons' lives are often based partly on these scores, we must evaluate the precision of the test score obtained by any particular individual. That is, we would like to gauge the precision or accuracy of an individual's test score as an estimate of her level of the underlying psychological attribute. As we will see, the reliability of test scores can be used to calculate information that reveals the precision of particular test scores. Two important pieces of information help us evaluate an individual's test score. First, a point estimate is a specific value that is interpreted as the "best estimate" of an individual's standing on a particular psychological attribute. As we will discuss, there are two ways of obtaining a point estimate for an individual. The second source of information that helps us evaluate an individual's test score is a confidence interval. A confidence interval reflects a range of values that is often interpreted as a range in which the true score is likely to fall. The logic of a confidence interval is based on the recognition that an observed score is simply an estimate of a true score. Because of measurement error, the observed score may not be exactly equal to the true score. The confidence interval around a particular score reflects the score's accuracy or precision as an estimate of a true score. If an individual's score has a narrow confidence interval, then we know that the score is a fairly precise point estimate of the individual's true score. However, if instead an individual's score has a wide confidence interval, then we know that the score is an imprecise or inaccurate point estimate of the individual's true score. We will see that these values - point estimates and confidence intervals - are directly affected by reliability. ### Point Estimates of True Scores Two kinds of point estimates can be derived from an individual's observed test score, representing the best single estimate of the individual's true score. The most widely used point estimate is based solely on an individual's observed test score. When an individual takes a test at a given point in time, his or her observed score can itself be used as a point estimate. For example, if you give someone a self-esteem test, then his or her score on the test can be seen as a point estimate of his or her true self-esteem score. The second type of point estimate, sometimes called an adjusted true score estimate, takes measurement error into account. Once again, recall that an individual's observed score on a test is affected by measurement error. Because testing is never perfect, an individual's observed test score may be somewhat inflated or deflated by momentary factors, such as fatigue, distraction, and so on. Therefore, an individual's test score at one time is likely to be artificially high or low compared with the score that the individual would obtain if she or he took the test a second time. Indeed, if an individual took the same test on two different occasions, then she or he would likely obtain two observed scores that are at least slightly different from each other. Both of those observed test scores could be considered point estimates of the individual's true score. With an understanding of reliability and the nature of measurement error , we can use an individual's observed score from one testing occasion to estimate the score that they would get if we tested the individual repeatedly. This produces an adjusted true score estimate, which reflects an effect called regression to the mean. Regression to the mean applies to scores that are relatively extreme at a first testing occasion. Specifically, if a respondent has a relatively extreme score at a first testing, then her or his score is likely to be less extreme (i.e., closer to the group mean) at a second testing. That is, if an individual's observed score is far above the mean on the first testing occasion, then they are likely to score somewhat lower (i.e., closer to the mean) on the second testing occasion. Similarly, if an individual's observed score is far below the mean on the first testing occasion, then they are likely to score somewhat higher (i.e., closer to the mean) on the second testing occasion. This prediction is again based on the logic of classical test theory (CTT) and random measurement error. In Chapter 5, we learned that measurement error is random and likely to affect all test scores to some degree - artificially inflating some scores (that end up relatively high) and artificially deflating some scores (that end up relatively low). This is illustrated in **Figure 7.1**. This figure plots the scores of 20 hypothetical people who took a test twice, with each person's pair of scores connected by a line. Each person's observed score at testing occasion 1 was created according to CTT, by adding a true score and a random error value. Similarly, each person's observed score at testing occasion 2 was created by using the same true score but a different random error value. The solid lines in **Figure 7.1** reflect people who had relatively extreme test scores at occasion 1 (i.e., scores relatively far above or below the mean). As these lines show, people who had relatively extreme test scores at occasion 1 tended to have less extreme scores at occasion 2. That is, the solid lines tend to converge toward the mean at the second testing occasion. Even more specifically, of the five people with the highest observed scores at occasion 1, four had lower (i.e., closer to the mean) test scores at occasion 2. Similarly, of the six people with the lowest observed scores at occasion 1, five had higher (closer to the mean) scores at occasion 2. This pattern is due to the effects of random measurement error, and it is the basis of the adjusted true score estimate. **Figure 7.1 Illustrating Regression to the Mean** The image is a line graph that illustrates regression to the mean. * The y-axis is labeled "Test Score", and the values range from 2 to 12. * The x-axis is labeled "Test Occasion", and the values are 1 and 2. There are 20 lines on the graph. Most of these are converging towards the mean. The adjusted true score estimate is intended to reflect the discrepancy in an individual's observed scores that is likely to arise across repeated testing occasions, due to regression to the mean. The size and direction of this discrepancy is a function of three factors: (1) the reliability of the test scores, (2) the extremity of the individual's observed score (i.e., size of the difference between the individual's observed test score and the mean of those scores), and (3) the direction of the difference between the observed score and the mean of the those scores. These factors can be used to calculate the adjusted true score estimate through the following equation: (7.1) $X_{est} = \overline{X} + R_{xx}(X_o - \overline{X})$ where $X_{est}$ is the adjusted true score estimate, $\overline{X}$ is the test's mean observed score, $R_{xx}$ is the reliability of the test scores, and $X_o$ is the individual's observed score. For example, imagine that you have scores from a multiple-choice exam given to a class. There are 40 questions on the exam, and the mean score is 30. Assume that the exam scores have an estimated reliability of .90 (which would be a very high reliability for most class examinations). If a student had a score of 38 on the exam, then his or her estimated true score would be $X_{est} = 30 + .90(38 - 30)$, = 37.2. Notice that the estimated true score (37.2) is closer to the mean (30) than was the initial observed score (38). Thus, the adjusted true score attempts to account for the likely occurrence of regression to the mean. Note two important points about the adjusted true score estimate, in relation to the observed score. First, reliability influences the difference between the estimated true score and the observed score. Specifically, as reliability decreases (i.e., worsens), the difference between the adjusted true score estimate and the observed score increases. That is, poorer reliability produces bigger discrepancies between the estimated true score and the observed score. This reflects the fact that regression to the mean is more likely to occur (or is likely to be more substantial) when a test's scores are affected heavily by measurement error. For example, assume that the class test scores have a reliability of only .50 (rather than .90, as illustrated earlier) and we computed the adjusted true score estimate for an individual with an observed score of 38: $X_{est} = 30 + .50(38 - 30)$, = 34. Thus, for an individual with a test score of 38, the predicted effect of regression to the mean is 4 points (38 - 34 = 4) for test scores with poor reliability. Contrast this 4-point discrepancy with the earlier example, in which the discrepancy was less than 1 point (38 - 37.2 = .8) for test scores with strong reliability. A second important point regarding the adjusted true score estimate is that the observed score's extremity influences the difference between the estimated true score and the observed score. Specifically, the difference will be larger for relatively extreme observed scores (high or low) than for moderate scores. For example , consider the adjusted true score estimate for an individual with a relatively extreme observed score of 22 (i.e., an observed score that is 8 points below the mean of 30) on a test with scores having a reliability of .90: $X_{est} = 30 + ,90(22 - 30)$, = 22.8. Note that the adjusted true score estimate is .8 points closer to the mean than the observed score in this case. Now, consider the adjusted true score estimate for an individual with a more moderate observed score of 27 (i.e., an observed score that is only 3 points below the mean of 30): $X_{est} = 30 + .90(27 - 30)$, = 27.3. For this moderate observed score, the adjusted true score estimate is only .3 points closer to the mean than the observed score. Thus, the adjustment was more substantial for the relatively extreme observed score (i.e., 22 vs. 22.8) than it was for the less extreme observed score (i.e., 27 vs. 27.3). Although the ideas of adjusted true score estimates and regression to the mean are common when evaluating individual scores on a test (e.g., see Wechsler, 2003a, 2003b), there are reasons to approach these ideas with caution. First, except when we intend to predict a person's score on a subsequent test or when we wish to form a confidence interval (discussed next), there may be little reason to correct observed scores by adjusting them for regression to the mean. Indeed, Nunnally and Bernstein (1994) state that "one rarely estimates true scores [in the "adjusted true score" sense that we discuss] in the applied assessment of static constructs" and that "it is easier to interpret the individual's obtained score" (p. 260). Second, although most psychologists seem to think that regression to the mean is, in the long run, a mathematical certainty, it might not always occur (Rogosa, 1995). Nevertheless, as we will see when discussing true score confidence intervals, it is common practice to convert observed scores to adjusted true score estimates. ### Confidence Intervals In applied testing situations, point estimates of an individual's score are usually reported along with confidence intervals. Roughly speaking, confidence intervals reflect the accuracy or precision of the point estimate as an indicator of an individual's score. For example, we might administer the Wechsler Intelligence Scale for Children (WISC) to a child and find that the child obtains a score of 106. Taking this observed score as an estimate of the child's true score, we might calculate a confidence interval and conclude that we are "95% confident that the individual's true IQ score falls in the range of 100-112" (Wechsler, 2003b, p. 37). The width of a confidence interval (e.g., a 12-point range) reflects the precision of the point estimate. You will probably not be surprised to learn that this precision is closely related to reliability - when test scores have higher reliability, they provide more precise estimates of individuals' true scores. The link between reliability and the precision of confidence intervals is made through the standard error of measurement $(s_{em})$. As discussed in Chapter 5, the $s_{em}$ represents the average size of the error scores that affect observed scores. The larger the $s_{em}$, the greater the average difference between observed scores and true scores. Thus, the $s_{em}$ can be seen as an index of measurement error, and it is closely linked to reliability. In fact, Equation 5.16 presented the exact link between the standard error of measurement $(s_{em})$, reliability $(R_{xx})$, and the standard deviation of a test's observed scores $(s_o)$: $s_{em} = s_o\sqrt{1 - R_{xx}}$ After we estimate the standard error of measurement for a set of test scores, we can compute a confidence interval around an individual's estimated true score. For a 95% confidence interval around that score, we use the following equation: 95% confidence interval = $X_{est} \pm (1.96)(s_{em})$, (7.2) where $X_{est}$ is the adjusted true score estimate (i.e., a point estimate of the individual's true score) and $s_{em}$ is the standard error of measurement of the test scores. The final component of this equation (1.96) reflects the fact that we are interested in a 95% confidence interval rather than a 90% interval or any other "degree of confidence" (we will address alternate "degrees of confidence" later). Some readers - particularly those who have a background in statistical significance testing - might recognize this value (i.e., 1.96) as being associated with a probability of .95 from the standard normal distribution. For example, imagine an individual's observed score is 38, based on a test with a mean observed score of $\overline{X}=30$, a standard deviation (of observed scores) of $s_o = 6$, and an estimated reliability of $R_{xx} = .90$. From calculations earlier, we know that her adjusted true score estimate is $X_{est} = 37.2$. Using Equation 5.16, we estimate the standard error of measurement as $s_{em} = 6\sqrt{1 - .90}$, = 1.90. Based on Equation 7.2, for our test, the 95% confidence interval is 33.5 to 40.9: 95% confidence interval = 37.2 ± (1.96)(1.90), = 37.2 ± 3.7, = 33.5 to 40.9. Using the logic expressed by the above quote from Wechsler (2003b), we might interpret this result as indicating that we are 95% confident that the individual's true score falls in the range of 33.5 to 40.9. As mentioned earlier, the precision of a true score estimate is tied to reliability. Briefly put, highly reliable tests produce narrower confidence intervals than less reliable tests. We just saw that for a test with highly reliable scores $(R_{xx} = .90)$, the $s_{em}$ was 1.90 and the confidence interval had a range of 7.4 points (40.9 - 33.5 = 7.4). The size of this range reflects the precision of the confidence interval - the smaller or narrower the interval, the more precise the observed score is as an estimate of the true score. Although high reliability produces narrow intervals, low reliability produces wider (i.e., larger) confidence intervals, reflecting a less precise estimate of the true score. For example, imagine that our test had the same observed score standard deviation as our previous example $(s_o = 6)$ but a lower reliability (say $R_{xx} =.50)$. In this case, the standard error of measurement would be 4.2: $s_{em}$ = 6√1-.50, = 4.24. Note that this $s_{em}$ is larger than in the previous example, in which reliability was .90 and the $s_{em}$ was only 1.90. As we have seen, the $s_{em}$ has a direct effect on the confidence interval. So in the case of low reliability ($R_{xx} = .50$), the 95% confidence interval around an adjusted true score estimate of 37.2 is a relatively wide range of 28.9 to 45.5: 95% confidence interval = 37.2 ± (1.96)(4.24), = 37.2 ± 8.3, = 28.9 to 45.5. Thus, as shown in **Figure 7.2**, when test scores have poor reliability, confidence intervals are much less precise (i.e., wider) than when test scores have high reliability. Specifically, test scores with $R_{xx} = .50$ produced a relatively wide interval of 16.6 points (45.5 - 28.9 = 16.6), whereas test scores with $R_{xx} = .90$ produced a relatively narrow interval of only 7.4 points. It is a much stronger and more precise statement to say that "we are 95% confident that an individual's true score lies between 33.5 and 40.9" than it is to say that "we are 95% confident that the individual's true score lies anywhere all the way from 28.9 to 45.5." **Figure 7.2 Confidence Intervals Based on Different Levels of Reliability** The image is a horizontal bar graph illustrating Confidence Intervals Based on Different Levels of Reliability. * The x-axis represents the score and ranges from 26 to 48. * The y-axis is labeled "Reliability = .90, 95%CI = 33.5 to 40.9" with a horizontal bar extending between these values. * The second label is Reliability = .50, 95%CI = 28.9 to 45.5 with a much longer horizontal bar showing these bounds. ### Debate and Alternatives The previous section outlined one way to report individual scores and confidence intervals around those scores. However, you might well encounter other ways. Perhaps more commonly, you will see observed scores reported, you will see confidence intervals (if they are reported at all) computed around those observed scores, and you will see the intervals interpreted as above. There is considerable debate and variation in the ways in which confidence intervals are computed and integrated with true score estimates. Confidence intervals can be computed for various degrees of confidence (e.g., 99%, 90%, or 68% instead of 95%), they can be computed by using either the standard error of measurement or a related value called the standard error of estimate (which is also affected by reliability), and they can be applied to either observed score estimates of true scores or adjusted true score estimates (as described in the previous section). These various alternatives, in turn, have implications for the exact interpretation of the confidence intervals. According to true score theory, observed scores are distributed normally around true scores. Because an observed score is the best estimate of a true score, the observed score represents the mean of this distribution. In our example, an adjusted true score estimate of 37.2 may lie within a 95% confidence interval that ranges from 33.5 to 40.9, but what does it mean to say that the score is in this confidence interval? Perhaps the most widely offered answer to this question is, as we illustrate, that "there is a 95% chance that the true score falls within the confidence interval." Here is another way to say the same thing: "The probability is .95 that the confidence interval contains the true score." These statements might be interpreted in two different ways. They might mean that there is a 95% chance that a person's true score will fall in the interval on repeated testing with the same or parallel tests, or it might mean that if you had many people with the same true score take the same test, 95% of their observed scores would fall in the interval. However, disagreement exists over such interpretations. For example, referring to the typical computation of confidence intervals, Dudek (1979) objects to interpretations such as "There is a 95% chance that the true score falls within the confidence interval" because answers of this type imply that true scores are deviating around an observed score. He suggests such interpretations require use of the adjusted true score estimate, along with a different version of the standard error. This view might well be correct in technical terms, but in most cases, when confidence intervals are computed as illustrated above, they are interpreted (again for better or for worse) in a way that suggests that true scores are falling somewhere in the confidence interval. Although such variations emerge in some applications of psychological testing, details of these variations are well beyond the scope of our current discussion. Interested readers are encouraged to refer to other sources for more details, including Atkinson (1991), Dudek (1979), and Nunnally and Bernstein (1994, especially pp. 237-240 and 258-260). ### Summary For our purposes, the most important general message from this section is that reliability affects the confidence, accuracy, or precision with which an individual's true score is estimated. That is, reliability affects the standard error of measurement, which affects the width of a confidence interval around an individual's estimated true score. Poor reliability produces scores that are imprecise reflections of individuals' true psychological traits, skills, abilities, attitudes, and so on. In contrast, good reliability produces scores that are more precise reflections of individuals' true psychological attributes. If one uses psychological tests to make decisions about individuals, then those tests should have strong reliability. Again, the issues associated with estimated true scores and true score intervals might seem abstract and esoteric, but they can have important consequences when test scores inform decisions about the lives of individual people. For example, children are often classified as having mental retardation if they have an intelligence test score below 70. We know, however, that any IQ score will have some degree of unreliability (although the reliability of scores on standard, individually administered intelligence tests is very high). The degree of test score unreliability should influence your interpretation of an observed score - to what extent does an observed score reflect a child's true score? Imagine that a child has a tested IQ score of 69. How confident would you be that the child's true score is below 70? And how likely is it that, on a second testing, the child's tested score might be greater than 70? If this child is given a second intelligence test, then the second IQ score might be greater than 70 because of regression to the mean. At what point do we take these factors into consideration, and how do we do so when making a decision about the child's intellectual status? People who make these types of decisions need to recognize the problems associated with the interpretation of psychological test scores. Hopefully, this section has helped you recognize the problem and appreciate the fact that reliability has a fundamental role in it. ### BEHAVIORAL RESEARCH Reliability has important implications for conducting and interpreting research in the behavioral sciences. The quality and meaningfulness of any research hinges on the quality of the measurement procedures used in that research. This section explains how reliability and measurement error affect the results of behavioral research. Awareness of these effects is crucial for interpreting behavioral research accurately and for conducting behavioral research in a productive way. ### Reliability, True Associations, and Observed Associations Earlier in this book, we discussed the importance of understanding associations between psychological variables (see Chapter 3). That is, a fundamental goal of research is to discover how important variables are related to each other. For example, researchers might want to know whether SAT scores are associated with academic performance, whether personality similarity is associated with relationship satisfaction, or whether "dosage of medication" is associated with decreases in depressive affect. Knowing the direction and magnitude of the associations between variables is a central part of scientific research. Behavioral science usually relies on several basic ways of quantifying the association between variables. In terms of psychometrics, the most common way of doing this is through a correlation coefficient (again , see Chapter 3). Thus, the following discussion focuses mainly on the correlation when explaining reliability's effects on behavioral research. However , researchers often use other statistics to reflect the association between variables. For example, experimental psychologists are more likely to use statistics such as Cohen's $d$ or $\eta^2$ than they are to use correlations. We will touch on these statistics briefly as well. According to CTT, the correlation between observed scores on two measures (i.e., $r_{XY}$) is determined by two factors: (1) the correlation between the true scores of the two psychological constructs being assessed by the measures (i.e., $r_{XtYt}$) and (2) the reliabilities of the two measures (i.e., $R_{XX}$ and $R_{YY}$). Specifically, (7.3) $r_{XY}=r_{X_tY_t}\sqrt{R_{XX}R_{YY}}$. Equation 7.3 is the key element of this section, with important implications for research and applied measurement. Before moving on to those implications, interested readers might wish to understand how Equation 7.3 follows logically from CTT (including the assumption that error scores are random and therefore uncorrelated with true scores and other sets of error scores , but cf. Charles, 2005; Nimon et al., 2012). Recall from Chapter 3 (Equation 3.6) that the correlation between two variables $(r_{XY})$ is the covariance divided by two standard deviations: $r_{XY} = \frac {C_{XY}} {S_XS_Y}$ In terms of observed scores, the correlation between scores from two measures is (7.4) $r_{XY} = \frac {C_{X_OY_O}} {S_{X_O}S_{Y_O}}$ Consider the numerator of this equation for a moment. As explained in Chapter 5 and as according to CTT, observed scores are composite variables (i.e., $X_O = X_t + X_e$ and $Y_O = Y_t + Y_e$). Therefore, the covariance between two sets of observed scores (i.e., observed scores on X and observed scores on Y) can be seen as the covariance between two composite variables. Following Chapter 3 's discussion (e.g., Equation 3.8) of the covariance between composite variables, the covariance between X and Y (i.e., $c_{X_OY_O}$) is $C_{X_OY_O}= C_{X_tY_t} + C_{X_tY_e} + C_{X_eY_t} + C_{X_eY_e}$ where $c_{X_tY_t}$ is the covariance between true scores on test X and true scores on test Y, $c_{X_tY_e}$ is the covariance between true scores on test X and error scores on test Y, $c_{X_eY_t}$ is the covariance between error scores on test X and error scores on test Y, and $c_{X_eY_e}$ is the covariance between error scores on test X and error scores on test Y. As has been discussed, CTT assumes that error scores are random. Therefore, error scores are uncorrelated with true scores, and error scores on test X are uncorrelated with error scores on test Y. Consequently, the three covariances that include error scores are equal to 0, and the covariance between observed scores thus reduces to the covariance between true scores ($C_{XY} = C_{X_tY_t}$). Returning to Equation l7.4, the correlation between two sets of observed scores is then: (7.5) $r_{XY}= \frac {C_{X_tY_t}} {S_{X_O}S_{Y_O}}$ Consider next the denominator of this equation. Recall from Chapter 5 that variability in a test's observed scores (e.g., $s_X$, and $s_Y$) is related to the test's reliability. Specifically , reliability can bel defined as the ratio of true score variance tol observed score variance: $R_{XX} = \frac {S^2_{X_t}} {S^2_{X_O}}$ and $R_{YY} = \frac {S^2_{Y_t}} {S^2_{Y_O}}$ Rearranging these, the observed standarde deviations cand be seen in terms of reliability ande standard deviations of true scores: (7.6a) $S_{X_O} = \frac {S_{X_t}} {\sqrt {R_{XX}}}$ and (7.6b) $S_YO = \frac {S_{Y_t}} {\sqrt {R_{YY}}}$ Entering Equations 7.6a and 7.6b into the denominator of Equation 7.5 and rearranging: $r_{XY} = \frac {C_{X_tY_t}} {\sqrt {R_{XX}}\sqrt {R_{YY}}S_{X_t}S_{Y_t}}$ Once again recall that correlation is covariance divided by standard deviations (again, see Chapter 3's Equation 3.6). In this case, divided the covariance between two sets of true scores (i.e., $c_{X_tY_t}$) by the standard deviation of true scores (i.e., $s_{X_t}$ and $s_{Y_t}$}, producing the correlation between true scores ($(R_{xx} =.50)$, and rx₁y₁). This simplifies the equation to $r_{XY}=rxY\sqrt{R_{XX}R_{YY}}$ This brings us back to Equation 7.3. Thus, CTT implies directly that the correlation between two measures (i,e,, between observed scores) is determined by the correlation between psychological constructs and by the reliabilities of the measures, To illustrate this, imagine that we examine the association between self-esteem and academic achievement, with participants complete a self-esteem questionnaire and a measure of academic achievement, Imagine that the true correlation between the constructs is .40 (i.e., $r_{X_tY_t}$ = .40). Of course, we would not actually know this true correlation; in fact, the entire point of conducting a study is to uncover or estimate this correlation, In addition, imagine that both measures have poor reliability - say, reliability is .50 for the self-esteem questionnaire and .40 for the academic achievement test. According to Equation 7.3, the correlation between the two measures will be $r_{X_OY_O}=r_{X_tY_t}\sqrt{R_{XX}R_{YY}}$, $=.40(\sqrt{(.50)}+(.40)}$, = 40(.447), = .18. Note that the correlation between observed scores on the two measures is smaller than the correlation between the two contructs. Specifically, the correlation beetween the two constructs ia .40, but the correlation that we would actually ontain in out study is only .18. This discrepancy can be seen visually with the correlation that we would actually obtain in our study is only .18, This discrepancy can be seen visually in **Figure 7.3** scatterplots as well, and it is a result of measurement error, as explained next, **Figure 7.3** Scatterplots Showing the True-Score Association and the Observed - Srore Association Between Self Esteem and Academic Achievement When Relibilities Are .50 and .40, Respectitvely The image contains two very similar scatter plots. The first one says "True scores / correlation = .40. The other one says "Observed scores / Correlation = .18 ### Measurement Error (Low Reliability) Attenuates the Observed Associations Between Measures As summarized in the image below, the discrepancy between observed associations and true associations reflects important implications of Equation 7.3 (again, if the assumptions of CTT hold; see Lohen Gelman, 2017 ; Nimon et al., 2012). This section describes and illustrates these implications. **Figure 7.4 Implications of the Connection Between Reliability and Observed Correlations** The image is of a table titled: Implications of the Connection Between Reliability and Observed Correlations. The numbers listed and their functions are: 1. Observed correlations will be weaker than true-score correlations; 2. The amount of attenuation is determinent by reliability; 3. Reliability / error constrains the maximum association that could be found betweent the measures; 4. It is possible to eliminate he true-score correlation. First, in research , observed associations (i.e., between measures) will always be weaker than true associations (i.e., between psychologicam constructs), This arises from two facts of life in measurement, One and That measurement is never perfect , Although scientists might develop precise measurement tool, measures with always be affected by measurement error to some measures will ways. That is , measures are not perfect relyable . A second fact if life in measurement is that imperfect measurement weakens "attenuantes", observed associations, For example, ass shows in equation 7.3, any time that reliabilities are less than perfect, and observed correlation to be wweaker than the truw value, Ex what would the reationa best. Thus, even the slightest imperfections in measurement willl attenuale observed assoications .In sum ,given that measuremeente is and that imperfec CTT implies that the observations A second importaant implacation or Equations is to the degree of to put the tommons by the what they are. with self and then with what the correlation the is the correalation This it, Obvicosyl y . Furthermore for example ahas has For ,4. ### Reliability ,Effect Sizes, and Statistical Significance The fact that measurement error (i.e., low reliability) attenuates observed associates has severalimplications for interpreting that there is are sevearl . and as the way ou may be as . , there area re