Empirical Estimates of Reliability PDF

Summary

This document focuses on empirical estimates of reliability in behavioral measurement. It outlines methods such as alternate forms, test-retest, and internal consistency, explaining their application and limitations. Key concepts include the assumptions of classical test theory and the importance of consistency in assessing reliability.

Full Transcript

Okay, here is the conversion of the provided text into a structured markdown format. I've focused on extracting all relevant information, organizing it logically, and converting mathematical formulas to LaTeX format. ## 6 EMPIRICAL ESTIMATES OF RELIABILITY Chapter 5 described the conceptual basis...

Okay, here is the conversion of the provided text into a structured markdown format. I've focused on extracting all relevant information, organizing it logically, and converting mathematical formulas to LaTeX format. ## 6 EMPIRICAL ESTIMATES OF RELIABILITY Chapter 5 described the conceptual basis of reliability. As that chapter acknowledged, though, a gap lies between the theory of reliability and the practical examination of reliability in behavioral measurement. Indeed, as we discussed, reliability is a theoretical property of test scores and cannot be computed directly in real testing situations. It is defined in terms of true scores and measurement error, which we cannot ever actually know. Thus, reliability can only be estimated from real data. This chapter will show how, given the assumptions of classical test theory (CTT), we can use observed (empirical) test scores to estimate reliability and measurement error. As summarized in Figure 6.1, there are at least three general methods for estimating reliability. The methods produce estimates that can be interpreted as described in the previous chapter (e.g., as the proportion of observed score variance that is attributable to true score variance). However, the three methods differ in terms of the kind of data that are available and in the assumptions on which they rest. **Description of Figure 6.1** Figure 6.1 is a summary of three methods of estimating reliability * **Alternate Forms:** Reliability is estimated based on consistency of scores across two verisons of the test. * **Test-Retest:** Reliability is estimated based on consistency of scores across two times of testing. * **Internal Consistency:** Reliability is estimated based on consistency of scores across 'parts' of the test. Figure 6.1 Three General Methods of Estimating Reliability, Emphasizing That Each Method Requires Two or More "Testings" This chapter outlines these methods, providing examples and interpretations of each. This information is important because it allows test developers and test users to examine the reliability of their tests. This is a "practical" chapter, in that it presents a "how-to" for the real-world examination of reliability. The next chapter extends this discussion by detailing the implications of reliability for behavioral research, applied behavioral practice, and test development. In addition to discussing basic methods of estimating the reliability of test scores, the current chapter also discusses the reliability of "difference scores," which can be used to study phenomena such as cognitive growth, symptom reduction, personality change, person-environment fit, overconfidence, and the accuracy of first impressions. Despite their intuitive appeal, difference scores are notorious for poor psychometric quality-although this reputation might be a bit undeserved (Rogosa, 1995). There are three general observations that are worth noting before detailing each method of estimating reliability. The first is that there is no single method that provides completely accurate estimates of reliability under all conditions. As mentioned in the previous chapter and as discussed in this chapter, the accuracy of each method depends on a variety of assumptions about the participants, the testing procedures, and the psychometric properties of the test(s). If these assumptions are not valid, then the reliability estimates will not be totally accurate. Indeed, data sometimes indicate that one or more assumptions are, in fact, not valid. In such cases, we might need to use a different method for estimating reliability, or we might simply need to acknowledge the fact that our estimate of reliability might not be very accurate. As implied by Figure 6.1, the second initial observation is that every method requires at least two "testings” in order to generate an estimate of reliability. That is, each respondent must complete at least two responses related to the test. The difference between methods lies, in part, in the nature of those testings. One method requires each respondent to take two highly similar forms of the test being examined. Another method requires each respondent to take the same exact test at two different times. A third general method is applicable for tests that have two or more "parts" (e.g., two or more items), and it requires each respondent to respond to each of those "parts" of the test. The third initial observation is that consistency is the basis of estimating reliability, for every method (again see Figure 6.1). Because each method requires respondents to take two or more "testings,” we can compare their responses across those testings. For example, we might compare respondents' scores on the test taken at Time 1 to their scores on the test taken at Time 2. More specifically, we examine the consistency of those scores are the respondents' scores on the test at Time 1 consistent with their scores on the test at Time 2? If consistency across testings is high, then this is taken as evidence of reliability. The general logic of evaluating "consistency across testings" lies at the heart of every method of estimating reliability. ## ALTERNATE FORMS METHOD OF ESTIMATING RELIABILITY The alternate forms method (sometimes called parallel forms reliability) is one method for estimating the reliability of test scores. By obtaining scores from two different forms of a test, test users can compute the correlation between the two forms and may be able to interpret the correlation as an estimate of reliability (e.g., see Table 5.4 from Chapter 5). To the degree that differences in the observed scores from one form are consistent with differences in the observed scores from another form, the test is reliable. The ability to interpret the correlation between alternate forms as an estimate of reliability is appropriate only if the two test forms are parallel (see Chapter 5). Looking back at Table 5.3, you might recall that two tests are considered parallel only if (among other things) they are measuring the identical true scores (i.e., $X_{t1} = X_{t2}$) and if they have the same amount of error variance (i.e., $s_{e1}^{2}= s_{e2}^{2}$). In addition, you might recall that the correlation between two parallel tests is exactly equal to the reliability of the test scores (i.e., $r_{o1o2} = R_{11} = R_{22}$). If the two alternate forms of a test meet the strict criteria for parallel tests, then this method produces accurate estimates of reliability. If the two forms of the test do not meet these criteria, then the alternate forms method produces inaccurate estimates. Despite the theoretical logic of parallel tests and the statistical foundations linking parallel tests to reliability, there is a serious practical problem, as summarized in Figure 6.2. Specifically, we can never be entirely confident that alternate forms of a test are truly parallel. This lack of confidence occurs because we can never truly know whether two forms of a test meet the very strict assumptions of CTT and of parallel tests. One key problem is that, in reality, we can never be certain that the alternate forms of a test reflect the same psychological attribute. More specifically, we can never be sure that the true scores as measured by the first form of a test are equal to the true scores as measured by the second form of the test (i.e., that $X_{t1} = X_{t2}$). This problem arises in part because different forms will, by definition, include different content. Because of this differing content, the different forms might not assess the same psychological construct. For example, we might generate two forms of a self-esteem test, and we would like to assume that they are parallel. However, the first form might include several items regarding self-esteem in relation to other people, whereas the second form might include only one such item. In that case, the two forms of the test might actually be assessing slightly different psychological constructs (i.e., a socially derived self-esteem vs. a nonsocial self-esteem). Therefore, the respondents' true scores on the first form are not strictly equal to their true scores on the second form, and the two forms are not truly parallel. As noted in Figure 6.2, a more subtle problem with alternate forms of tests is the potential for carryover or contamination effects due to repeated testing. The act of taking one form of a test might affect responses on the second form of a test-respondents' memory for test content, their attitudes, or their immediate mood states might affect test performance across both forms of the test. Such effects could, in turn, cause the error scores on one form to be correlated with error scores on the second form. This is a problem because, as discussed in Chapter 5, a fundamental assumption of CTT is that the error affecting any test is random. Recall from that discussion (and from Figure 5.1) that an important implication of the randomness assumption is that error scores on one test are uncorrelated with error scores on a second test (i.e., $r_{e1e2} = 0$). Unfortunately, if two forms of a test are completed simultaneously, then some of the error affecting responses to one form might carry over and also affect responses to the other form. This would violate a fundamental assumption of CTT, and it would mean that the two forms are not truly parallel tests. **Description of Figure 6.2** Figure 6.2 is a summary of three methods of estimating reliability and each of their limitations. * **Alternate forms:** Difficulties of producing parallel versions of the test and carryover effects. * **Test-Retest:** Possible changes in true scores between testings and carryover effects. * **Internal consistency:** Carryover effects and difficulties of having respondents participate twice. **Figure 6.2** The Three General Methods of Estimating Reliability, With Some of Their Limitations Table 6.1 presents a hypothetical example illustrating this carryover problem. Imagine that six people respond to two forms of a test. Table 6.1 presents their observed scores on the two forms, along with their true scores and their error scores (again, we pretend to be omniscient when we imagine that we know participants' true scores and error). Notice first that the two forms meet several assumptions of CTT, in general, and of parallel tests, in particular. For example, each observed score is an additive function of true scores and error scores (i.e., $X_o = X_t + X_e$). In addition, the true scores are completely identical across the two forms (i.e., $X_{t1} = X_{t2}$), the average error score is 0 for each form (i.e., $X_{e1} = X_{e2}= 0$), the true scores are uncorrelated with error scores (i.e., $r_{t1e2} = r_{e1t2} =.00$), and the error variances are equal for the two forms (i.e., $s_{e1}^{2}= s_{e2}^{2} = 4.67$). As shown in Table 6.1, these qualities ensure that the two forms are equally reliable for both forms, the ratio of true score variance to observed score variance is $R_{xx}$ = .38. Thus, in our omniscient state, we know that the reliability of both sets of scores is in fact .38. If all of the assumptions of CTT and parallel tests hold true for these data, then the correlation between the two forms' observed scores should be exactly equal to .38 (as we saw in the previous chapter's Table 5.4). Unfortunately, the data in Table 6.1 violate a fundamental assumption about the nature of error scores. Again, error scores are assumed to affect tests as if they are random, which implies that the error scores from the two forms should be uncorrelated with each other. That is, if error is indeed random within each form, then $r_{e1e2} = 0$. However, the two sets of error scores in Table 6.1 are in fact very strongly correlated($r_{e1e2} =.93$). As mentioned earlier, this correlation could emerge from carryover effects, such as mood state or memory. **Table 6.1** Example of Carryover Effects on Alternate Forms Estimate of Reliability | Respondent | Observed Score | True Score | Error | Observed Score | True Score | Error | | :----------: | :--------------: | :--------: | :----: | :--------------: | :--------: | :----: | | 1 | 14 | = 15 | + -1 | 13 | = 15 | + -2 | | 2 | 17 | = 14 | + 3 | 17 | = 14 | + 3 | | 3 | 11 | = 13 | + -2 | 12 | = 13 | + -1 | | 4 | 10 | = 12 | + -2 | 11 | = 12 | + -1 | | 5 | 14 | = 11 | + 3 | 14 | = 11 | + 3 | | 6 | 9 | = 10 | + -1 | 8 | = 10 | + -2 | | Mean | 12.50 | 12.50 | .00 | 12.50 | 12.50 | 0 | | Variance | 7.58 | 2.92 | 4.67 | 7.58 | 2.92 | 4.67 | | | $R_{xx}$ | $r_{ot}$ | $r_{te}$ | | :-------: | :------: | :------: | :------: | | Test 1 | .38 | .62 | .00 | | Test 2 | .38 | .62 | .00 | *Correlations across tests: rtito₂ = 1.00 rt₁e₂ = .00 rete,= .00 re₁e₂ = .93 Fo10₂ = .96* The fact that the two sets of error scores are correlated with each other, in turn, influences the correlation between the two sets of observed scores. In fact, the correlation between the observed scores from the two forms is quite strong ($r_{0102}$ = .96). Thus, the correlation between the alternate forms in this example ($r_{0102}$ =. 96) is grossly inaccurate as an estimate of reliability, which our omniscience reveals to be .38. In this example, anyone who is unaware of the key assumptions underlying parallel tests could dramatically overestimate the reliability of the test, believing it to be $R_{xx}$ = .96. Although we can never be certain that two test forms are truly parallel, we might have two test forms that seem to fit several criteria for being parallel. As described in Chapter 5, a consequence of two assumptions of parallel tests (i.e., the true scores are the same, and the error variance is the same) is that parallel tests will have identical observed score means and standard deviations (i.e., $X_{o1} = X_{o2}$ and $s_{o1}^{2}= s_{o2}^{2}$). If we have two test forms that have similar means and standard deviations, and if we feel fairly confident in assuming that they are measuring the same construct, then we might feel that the forms are "close enough" to meeting the criteria for being parallel. And if we feel that the two forms are close enough to being parallel, then we might feel comfortable using the correlation between the test forms as an estimate of reliability. We would have an alternate forms estimate of reliability. ## TEST-RETEST METHOD OF ESTIMATING RELIABILITY The test-retest method of estimating reliability avoids some problems with the alternate forms method, and it is potentially quite useful for measures of stable psychological constructs, such as intelligence or extroversion. As mentioned in the previous section, an important concern about the alternate forms method of estimating reliability is that alternate forms of a test have different content and therefore might actually measure different constructs. This would violate an important assumption about parallel tests, thereby invalidating the use of the correlation between forms as an estimate of reliability. As summarized in Figure 6.1, the test-retest method requires the same people to take the same test on more than one occasion (i.e., a first test occasion and a "retest" occasion). If you can safely make several assumptions, then the correlation between the first test scores and the retest scores can be interpreted as an estimate of the test's reliability. To the degree that observed scores from one testing occasion are consistent with observed scores from the second occasion, the test is reliable. Similar to the alternate forms method, the test-retest method rests on the assumption that the two testings meet the criteria for parallel tests. Again, as summarized in Chapter 5's Table 5.3, one key element of this is that the participants' true scores are stable across the two testing occasions (i.e., $X_{t1} = X_{t2}$). That is, you must be confident that the respondents' true scores do not change from the first time they take the test to the second time. The second assumption is that the error variance of the first testing equals the error variance of the second testing (i.e., $s_{e1}^{2}= s_{e2}^{2}$). Among other implications, these two assumptions essentially mean that the two testing occasions produce scores that are equally reliable. If these assumptions are legitimate, then the correlation between scores from the two test occasions is an accurate estimate of the score's reliability. Take a moment to consider the confidence that we might have in these two assumptions, in the context of the test-retest method. Beginning with the second assumption (i.e., the equality of error variances), this assumption might be reasonable, if care is taken in the testing process. Recall that measurement error (and thus error variance) is strongly affected by temporary elements within the immediate testing situation-noise, distractions, the presence or absence of other people, and so on. Such elements of the testing situation can affect responses in apparently random ways that might mask the differences among respondents' true scores. However, under the right circumstances, you might be able to create two testing situations that are reasonably comparable with each other. if you carefully set up the testing situations, controlling for the many extraneous variables that might affect test scores, then you might have confidence that the two testing situations are identical. For example, you could have participants complete both tests in the same room, during approximately the same time of day, and in the same interpersonal context (i.e., in a large group, a small group, or alone). By making the two testing occasions as similar as possible, you might have reasonable confidence that responses are affected by error to the same degree. As noted in Figure 6.2, it might be more difficult, however, to be confident in the first assumption-that the true scores of people taking your test remain stable between the first and second testing occasions. Although the test-retest procedure avoids the problem of differing content that arises with the alternate forms procedure, another problem arises. Specifically, we must assume that participants' true scores have remained completely stable and unchanged across the two occasions; however, respondents might experience psychological change between occasions, thus producing changes in their true scores. In fact, there are at least three factors affecting our confidence in the stability assumption. First, some psychological attributes are unstable across time. Transient or statelike characteristics are less stable than more traitlike characteristics. For example, imagine that you have a test of mood states-an individual's level of mood at an exact moment in time. Generally, mood state is considered a psychological attribute that can fluctuate from day to day, hour to hour, or even moment to moment. Thus, it would probably not make sense to assume that a person's true score on a mood state test is stable during a test-retest interval of any significant length. Furthermore, changes in mood state are likely to result from various factors affecting mood swings-in different ways for different people. For example, during the test-retest interval, one person might experience physical distress of some kind, which depresses that person's mood. In contrast, another person might receive good news of some kind, which elevates that person's mood. As a result, the individuals' mood states during the first assessment might be quite different from their mood states during the second assessment. That is, their true construct levels are not stable across the two testings. For such statelike constructs, the test-retest method provides a poor estimate of test reliability. Notice that the mood test might be quite reliable in the sense that differences in observed test scores at any single testing occasion accurately reflect differences in true score at that occasion. However, the test-retest method might provide a very low estimate of reliability because moods have changed across the two testing occasions. On the other hand, the test-retest procedure may provide reasonable estimates of reliability for measures of traitlike psychological attributes. For example, intelligence is generally conceived as a stable psychological characteristic. There is good theoretical rationale and strong empirical evidence suggesting that intelligence is highly stable from middle childhood through adulthood. Thus, for a measure of intelligence, we might reasonably assume that true scores do not change during a test-retest interval. If this assumption is correct, then changes in observed scores across two testings will represent measurement error, which will be reflected by the size of the test-retest reliability coefficient.