Lesson 08: Test Reliability PDF

Summary

This document provides an overview of test reliability, focusing on concepts, types, and sources of error that impact assessment accuracy. Topics include test-retest, parallel forms, and internal consistency reliability methods. It is a useful resource for psychology students.

Full Transcript

PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 LESSON 08: TEST RELIABILITY  Scorers and scoring systems are potential sources of error variance What is Reliability?...

PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 LESSON 08: TEST RELIABILITY  Scorers and scoring systems are potential sources of error variance What is Reliability? 4. Other sources of error Reliability  forgetting, failing to notice abusive  Refers to consistency in measurement. behavior, and misunderstanding Reliability coefficient instructions regarding reporting.  underreporting or overreporting  an index of reliability, a proportion that indicates the ratio between the true score Reliability Estimates variance on a test and the total variance. Three approaches to the estimation of reliability: Concept of Reliability 1. Test-retest  a score on a test is presumed to reflect not 2. Alternate or parallel forms only the test taker’s true score on the 3. Internal or inter-item consistency ability being measured but also error. NOTE: The method or methods employed will depend Variance from true differences is true variance, and on a number of factors, such as the purpose of variance from irrelevant, random sources is error obtaining a measure of reliability. variance. Test-Retest Reliability systematic error source does not change the variability of the distribution or affect reliability.  an estimate of reliability obtained by correlating pairs of scores from the same  A systematic source of error would not people on two different administrations of affect score consistency the same test.  appropriate when evaluating the reliability Sources of Error Variance of a test that purports to measure 1. Test construction something that is relatively stable over time, such as a personality trait. Item sampling or content sampling  refer to variation among items within a test The passage of time can be a source of error as well as to variation among items variance. The longer the time that passes, the between tests. greater the likelihood that the reliability coefficient  The extent to which a testtaker’s score is will be lower. affected by the content sampled When the interval between testing is greater than six 2. Test administration months, the estimate of test-retest reliability is often referred to as the coefficient of stability. Test environment: the room temperature, the level of lighting, and the amount of ventilation and noise may be most appropriate in gauging the reliability of tests that employ outcome measures such as Testtaker variables: emotional problems, physical reaction time or perceptual judgments (brightness, discomfort, lack of sleep, and the effects of drugs or loudness) medication Parallel-Forms and Alternate-Forms Reliability Examiner-related variables: examiner’s physical appearance and demeanor— even the presence or  The degree of the relationship between absence of an examine various forms of a test can be evaluated by means of an alternate-forms or parallel- 3. Test scoring and interpretation made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 forms coefficient of reliability, which is Three steps: often termed the coefficient of equivalence. 1. Divide the test into equivalent halves. 2. Calculate a Pearson r between scores on Parallel forms the two halves of the test.  Exists when, for each form of the test, the 3. Adjust the half-test reliability using the means and the variances of observed test Spearman-Brown formula scores are equal. Simply dividing the test in the middle is not Alternate forms recommended because it’s likely this procedure would  are simply different versions of a test that spuriously raise or lower the reliability coefficient. have been constructed so as to be parallel.  Although they do not meet the Ways to split a test: requirements for the legitimate designation 1. Randomly assign items to one or the other “parallel,” alternate forms of a test are half of the test. typically designed to be equivalent with 2. Assign odd-numbered items to one half of respect to variables such as content and the test and evennumbered items to the level of difficulty. other half. Referred to as odd-even Similarity with Test-retest method reliability. 3. Divide the test by content so that each half 1. Two test administrations with the same contains items equivalent with respect to group are required content and difficulty 2. Test scores ay be affected by factors such as motivation, fatigue, or intervening events Spearman-Brown formula such as practice, learning, or therapy.  allows a test developer or user to estimate internal consistency reliability from a Advantage: correlation of two halves of a test. 1. It minimizes the effect of memory for the  Because the reliability of a test is affected content of a previously administered form by its length, a formula is necessary for of the test. estimating the reliability of a test that has been shortened or lengthened. Certain traits are presumed to be relatively stable in people over time, and we would expect tests General Spearman-Brown (rSB) formula is measuring those traits to reflect that stability. Ex: intelligence tests Where: Disadvantage: 1. rSB 1. Developing alternate forms of tests can be  equal to the reliability adjusted by the time-consuming and expensive. Spearman-Brown formula 2. Error due to item sampling – selection of 2. rxy items for inclusion in the test.  equal to the Pearson r in the original-length test Internal Consistency 3. n  equal to the number of items in the revised Split-Half Reliability version divided by the number of items in  obtained by correlating two pairs of scores the original version. obtained from equivalent halves of a single test administered once. By determining the reliability of one half of a test, a  useful measure of reliability when it is test developer can use the Spearman-Brown formula impractical or undesirable to assess to estimate the reliability of a whole test. reliability with two tests or to administer a test twice made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 Because a whole test is two times longer than half a  Tests are said to be homogeneous if they test, n becomes 2 in the Spearman-Brown formula contain items that measure a single trait. for the adjustment of split-half reliability. Homogeneity rhh - stands for the Pearson r of scores in the two  as an adjective used to describe test items, half tests. it is the degree to which a test measures a single factor. Spearman-Brown formula  In other words, homogeneity is the extent  may be used to estimate the effect of the to which items in a scale are unifactorial. shortening on the test’s reliability  also used to determine the number of Heterogeneity items needed to attain a desired level of  In contrast to test homogeneity, describes reliability. the degree to which a test measures different factors. Odd-Even Reliability Coefficients before and after the Spearman-Brown Adjustment [NOTE: The more homogeneous a test is, the more inter-item consistency it can be expected to have. Increasing Test Reliability Because a homogeneous test samples a relatively In adding items to increase test reliability to a narrow content area, it is to be expected to contain desired level, the rule is that the new items more inter-item consistency than a heterogeneous must be equivalent in content and difficulty so test] that the longer test still measures what the Test homogeneity is desirable because it allows original test measured. relatively straightforward test-score interpretation.  If the reliability of the original test is  Test takers with the same score on a relatively low, then it may be homogeneous test probably have similar impractical to increase the number of abilities in the area tested. items to reach an acceptable level of  Test takers with the same score on a more reliability. heterogeneous test may have quite different abilities. Another alternative is to abandon the relatively unreliable instrument and locate—or develop— The Kuder-Richardson formulas a suitable alternative. KR-20 is the statistic of choice for determining the inter-item consistency of dichotomous items, The reliability of the instrument could also be primarily those items that can be scored right or raised by creating new items, clarifying the wrong (such as multiple-choice items). test’s instructions, or simplifying the scoring rules. Kuder-Richardson formula 20, or KR-20, so named because it was the twentieth formula developed in a Other Methods of Estimating Internal series. Consistency Where test items are highly homogeneous, KR-20 Inter-item consistency and split-half reliability estimates will be similar.  refers to the degree of correlation among all the items on a scale. If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half  calculated from a single administration of a method. single form of a test.  useful in assessing the homogeneity of the KR-20 Formula: test. made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 Where: ∑ > sum of variances of each item σ^2 > variance of the total test scores rKR20 > stands for the Kuder-Richardson formula 20 reliability coefficient Coefficient alpha  the preferred statistic for obtaining an k > number of test items estimate of internal consistency reliability. σ^2 > variance of total test scores  widely used because it requires only one administration of the test. p > proportion of testtakers who pass the item  Typically ranges in value from 0 to 1. q > proportion of people who fail the item o answers about how similar sets of data are: ∑pq > sum of the pq products over all items Myth about alpha is that “bigger is always better.” The one variant of the KR-20 formula that has received the most acceptance and is in widest use  a value >.90 may indicate redundancy in the today is a statistic called coefficient alpha. items. An approximation of KR-20 can be obtained by the All indexes of reliability, coefficient alpha among use of the twenty-first formula in the series them, provide an index that is a characteristic of a developed by Kuder and Richardson. particular group of test scores, not of the test itself. KR-21 formula may be used if there is reason to Measures of reliability are estimates, and estimates assume that all the test items have approximately are subject to error. the same degree of difficulty.  The precise amount of error inherent in a reliability estimate will vary with the sample Assumption is seldom justified. Formula KR-21 has of testtakers from which the data were become outdated in an era of calculators and drawn. computers. Way back when, KR-21 was sometimes Measures of Inter-Scorer Reliability used to estimate KR-20 only because it required many fewer calculations. Numerous modifications  the degree of agreement or consistency of Kuder-Richardson formulas have been proposed between two or more scorers (or judges or through the years. raters) with regard to a particular measure. The one variant of the KR-20 formula that has  If the reliability coefficient is high, the received the most acceptance and is in widest use prospective test user knows that test scores today is a statistic called coefficient alpha. can be derived in a systematic, consistent way by various scorers with sufficient referred to as coefficient α -20. (COEFFICIENT training. ALPHA) Coefficient of Inter-scorer Reliability  This expression incorporates both the Greek letter alpha (α) and the number 20, the  A coefficient of correlation that determines latter a reference to KR-20. the degree of consistency among scorers in the scoring of a test. Coefficient Alpha or Cronbach Alpha  appropriate for use on tests containing non- Using and Interpreting a Coefficient of dichotomous items Reliability  The formula for coefficient alpha is “How high should the coefficient of reliability be?” Where: If a test score carries with it life-or-death rα > coefficient alpha implications, then we need to hold that test to some k > number of items high standards. σi^2 > variance of one item made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 If a test score is routinely used in combination with Sources of Variance in a Hypothetical Test many other test scores and typically accounts for only a small part of the decision process, that test The Nature of the Test will not be held to the highest standards of Considerations: reliability. 1. test items are homogeneous or As a rule of thumb, it may parallels many grading heterogeneous in nature systems: 2. characteristic, ability, or trait being measured is presumed to be dynamic or 1..90s rates a grade of A (with a value of.95 static. higher for the most important types of 3. range of test scores is or is not restricted. decisions) 4. test is a speed or a power test. 2..80s rates a B (with below.85 being a clear 5. the test is or is not criterion-referenced. B) 6. Some tests present special problems 3..65 -.70s rates a weak and unacceptable regarding the measurement of their reliability. The Purpose of the Reliability Coefficient Homogeneity versus heterogeneity of test items If a specific test of employee performance is  Tests designed to measure one factor designed for use at various times over the course of (ability or trait) are expected to be the employment period, it would be reasonable to homogeneous in items resulting in a high expect the test to demonstrate reliability across degree of internal consistency. time. It would thus be desirable to have an estimate  By contrast, in a heterogeneous test, an of the instrument’s test-retest reliability. For a test estimate of internal consistency might be designed for a single administration only, an low relative to a more appropriate estimate estimate of internal consistency would be the of test-retest reliability. reliability measure of choice. If the purpose of determining reliability is to break down the error Dynamic versus static characteristics variance into its parts, as shown in Figure 5–1, then a  A dynamic characteristic is a trait, state, or number of reliability coefficients would have to be ability presumed to be everchanging as a calculated. function of situational and cognitive [Note that the various reliability coefficients do not all experiences. reflect the same sources of error variance. Thus, an  Example: anxiety (dynamic) manifested by a individual reliability coefficient may provide an index of stockbroker throughout a business day vs. error from test construction, test administration, or test his intelligence (static) scoring and interpretation. A coefficient of inter-rater Restriction or Inflation of Range reliability, for example, provides information about error as a result of test scoring. Specifically, it can be  If the variance of either variable in a used to answer questions about how consistently two correlational analysis is restricted by the scorers score the same test items.] sampling procedure used, then the resulting correlation coefficient tends to be Sources of Variance in a Hypothetical Test lower.  If the variance of either variable in a In this hypothetical situation: correlational analysis is inflated by the  5% of the variance has not been identified sampling procedure, then the resulting by the test user (could be accounted for by correlation coefficient tends to be higher. transient error - attributable to variations in the testtaker’s feelings, moods, or mental  Also of critical importance is whether the state over time. range of variances employed is appropriate to the objective of the correlational  may also be due to other factors that are yet to be identified. analysis. made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 Speed tests versus power tests a speed test, it is conceivable that p would equal 1.0 and q would equal 0 for many of the items. Toward Power test the end of the test—when many items would not  When a time limit is long enough to allow even be attempted because of the time limit— p testtakers to attempt all items, and if some might equal 0 and q might equal 1.0. For many if not items are so difficult that no testtaker is a majority of the items, then, the product pq would able to obtain a perfect score. equal or approximate 0. When 0 is substituted in the Speed test KR-20 formula for ∑ pq, the reliability coefficient is  generally contains items of uniform level of 1.0 (a meaningless coefficient in this instance). difficulty (typically uniformly low) so that, Criterion-referenced tests when given generous time limits, all testtakers should be able to complete all  Designed to provide an indication of where the test items correctly. a test taker stands with respect to some  the time limit on a speed test is established variable or criterion, such as an educational so that few if any of the testtakers will be or a vocational objective. able to complete the entire test.  Unlike norm-referenced tests, criterion-  Score differences on a speed test are referenced tests tend to contain material therefore based on performance speed that has been mastered in hierarchical fashion. A reliability estimate of a speed test should be based  Traditional techniques of estimating on performance from two independent testing reliability employ measures that take into periods using one of the following: (1) test-retest account scores on the entire test. reliability, (2) alternate-forms reliability, or (3) split-  Such traditional procedures of estimating half reliability from two separately timed half tests. reliability are usually not appropriate for To understand why the KR-20 or split-half reliability use with criterion-referenced tests. coefficient will be spuriously high, consider the To understand why, recall that reliability is defined following example. as the proportion of total variance (σ2 ) attributable When a group of testtakers completes a speed test, to true variance (σtr 2 ). Total variance in a test almost all the items completed will be correct. If score distribution equals the sum of the true reliability is examined using an odd-even split, and if variance plus the error variance (σe 2 ): the testtakers completed the items in order, then A measure of reliability, therefore, depends on the testtakers will get close to the same number of odd variability of the test scores: how different the as even items correct. A testtaker completing 82 scores are from one another. items can be expected to get approximately 41 odd and 41 even items correct. In criterion-referenced testing, and particularly in mastery testing, how different the scores are from A testtaker completing 61 items may get 31 odd and one another is seldom a focus of interest. 30 even items correct. When the numbers of odd and even items c orrect are correlated across a The critical issue for the user of a mastery test is group of testtakers, the correlation will be close to whether or not a certain criterion score has been 1.00. Yet this impressive correlation coeffi cient achieved. actually tells us nothing about response consistency. As individual differences (and the variability) Under the same scenario, a Kuder-Richardson decrease, a traditional measure of reliability would reliability coeffi cient would yield a similar coeffi also decrease. cient that would also be, well, equally useless. Recall that KR-20 reliability is based on the proportion of The person will ordinarily have a different universe testtakers correct ( p ) and the proportion of score for each universe. Mary’s universe score testtakers incorrect ( q) on each item. In the case of covering tests on May 5 will not agree perfectly with made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 her universe score for the whole month of May....  In theory, the items in the domain are Some testers call the average over a large number of thought to have the same means and comparable observations a “true score”; e.g., variances of those in the test that samples “Mary’s true typing rate on 3-minute tests.” Instead, from the domain. Of the three types of we speak of a “universe score” to emphasize that estimates of reliability, measures of internal what score is desired depends on the universe being consistency are perhaps the most considered. For any measure there are many “true compatible with domain sampling theory. scores,” each corresponding to a different universe. GENERALIZABILITY THEORY - Developed by When we use a single observation as if it Lee J. Cronbach (1970) and his colleagues (Cronbach represented the universe, we are generalizing. We et al., 1972), generalizability theory is based on the generalize over scorers, over selections typed, idea that a person’s test scores vary from testing to perhaps over days. If the observed scores from a testing because of variables in the testing situation. procedure agree closely with the universe score, we can say that the observation is “accurate,” or  Instead of conceiving of all variability in a “reliable,” or “generalizable.” And since the person’s scores as error, Cronbach observations then also agree with each other, we say encouraged test developers and that they are “consistent” and “have little error researchers to describe the details of the variance.” To have so many terms is confusing, but particular test situation or universe leading not seriously so. The term most often used in the to a specific test score. This universe is literature is “reliability.” The author prefers described in terms of its facets, which “generalizability” because that term immediately include things like the number of items in implies “generalization to what?”... There is a the test, the amount of training the test different degree of generalizability for each universe. scorers have had, and the purpose of the The older methods of analysis do not separate the test administration. sources of variation. They deal with a single source  Given the exact same conditions of all the of variance, or leave two or more sources entangled. facets in the universe, the exact same test (Cronbach, 1970, pp. 153–154) score should be obtained. This test score is the universe score, and it is, as Cronbach Alternatives to the True Score Model noted, analogous to a true score in the true Generalizability theory score model.  The 1950s saw the development of an Item Response Theory [IRT] (Lord, 1980) alternative theoretical model, one originally referred to as domain sampling theory and  Item response theory procedures models better known today in one of its many the probability that a person with X amount modified forms as generalizability theory of a particular personality trait will exhibit Y amount of that trait on a personality test  In domain sampling theory, a test’s designed to measure it. reliability is conceived of as an objective  Called Latent-Trait Theory because the measure of how precisely the test score psychological or educational construct assesses the domain from which the test being measured is so often physically draws a sample (Thorndike, 1985). A unobservable (stated another way, is latent) domain of behavior, or the universe of and because the construct being measured items that could conceivably measure that may be a trait (or an ability). behavior, can be thought of as a hypothetical construct: one that shares IRT refers to a family of theory and methods with certain characteristics with (and is many other names used to distinguish specific measured by) the sample of items that approaches. make up the test. made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 Examples of two characteristics of items within an Important Differences Between Latent-Trait IRT framework are the difficulty level of an item and Models and Classical “True Score” Test Theory. the item’s level of discrimination; items may be viewed as varying in terms of these, as well as other, In classical TST theory, no assumptions are made characteristics. about the frequency distribution of test scores. By contrast, such assumptions are inherent in latent-  “Difficulty” in this sense refers to the trait models. attribute of not being easily accomplished, solved, or comprehended.  Some IRT models have very specific and o In a mathematics test, for example, a test stringent assumptions about the item tapping basic addition ability will underlying distribution. In one group of IRT have a lower difficulty level than a test models developed by Rasch, each item on item tapping basic algebra skills. The the test is assumed to have an equivalent characteristic of difficulty as applied to a relationship with the construct being test item may also refer to physical measured by the test. difficulty—that is, how hard or easy it is  The psychometric advantages of item for a person to engage in a particular response theory have made this model activity. appealing, especially to commercial and Discrimination academic test developers and to large-scale  Signifies the degree to which an item test publishers. It is a model that in recent differentiates among people with higher or years has found increasing application in lower levels of the trait, ability, or whatever standardized tests, professional licensing it is that is being measured. examinations, and questionnaires used in o Consider two more ADLQ items: item 4, behavioral and social sciences. My mood is generally good; and item 5, I am able to walk one block on flat ground. Reliability and Individual Scores o Which of these two items do you think The Standard Error of Measurement would be more discriminating in terms of  The standard error of measurement, often the respondent’s physical abilities? abbreviated as SEM or SEM , provides a  A number of different IRT models exist to measure of the precision of an observed handle data resulting from the test score. administration of tests with various  It provides an estimate of the amount of characteristics and in various formats. For error inherent in an observed score or example, there are IRT models designed to measurement. handle data resulting from the  In general, the relationship between the administration of tests with: SEM and the reliability of a test is inverse; 1. Dichotomous test items - test items or the higher the reliability of a test (or questions that can be answered with only individual subtest within a test), the lower one of two alternative responses, such as the SEM. true–false, yes–no, or correct–incorrect  The usefulness of the reliability coefficient questions. does not end with test construction and 2. Polytomous test items - test items or selection. questions with three or more alternative o By employing the reliability coefficient in responses, where only one is scored correct the formula for the standard error of or scored as being consistent with a measurement, the test user now has targeted trait or other construct another descriptive statistic relevant to test interpretation, this one useful in estimating the precision of a particular test score. made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 To be hired at a company TRW as a word processor, a measurement of 4, then—using 50 as the point candidate must be able to word-process accurately at estimate—we can be: the rate of 50 words per minute. The personnel office administers a total of seven brief word-processing tests 1. 68% (actually, 68.26%) confident that the to Mary over the course of seven business days. In true score falls within 50 ±1σmeas (or words per minute, Mary’s scores on each of the seven between 46 and 54, including 46 and 54); tests are as follows: 52 55 39 56 35 50 54. 2. 95% (actually, 95.44%) confident that the true score falls within 50 ±2σmeas (or “Which is her ‘true’ score?” between 42 and 58, including 42 and 58); The standard error of measurement is the tool used 3. 99% (actually, 99.74%) confident that the to estimate or infer the extent to which an observed true score falls within 50 ±3σmeas (or score deviates from a true score. between 38 and 62, including 38 and 62) Standard Error of Measurement The standard error of measurement, like the  The standard deviation of a theoretically reliability coefficient, is one way of expressing test normal distribution of test scores obtained reliability. by one person on equivalent tests. If the standard deviation of a test is held constant,  Also known as the standard error of a score then the smaller the σmeas, the more reliable the and denoted by the symbol σmeas, the test will be; as rxx increases, the σmeas decreases. standard error of measurement is an index For example, when a reliability coefficient equals.64 of the extent to which one individual’s and σ equals 15, the standard error of measurement scores vary over tests presumed to be equals 9: parallel. With a reliability coefficient equal to.96 and σ still Assumption: equal to 15, the standard error of measurement decreases to 3: If the individual were to take a large number of equivalent tests, scores on those tests would tend to  In practice, the standard error of be normally distributed, with the individual’s true measurement is most frequently used in score as the mean. the interpretation of individual test scores. Because the standard error of measurement functions like a standard deviation in this context, o If the cut off score for mental we can use it to predict what would happen if an retardation is 70, how do scores that individual took additional equivalent tests: are close to the cutoff value of 70 should be treated? 1. approximately 68% (actually, 68.26%) of the o How high above 70 must a score be for scores would be expected to occur within us to conclude confidently that the ±1σmeas of the true score; individual is unlikely to be retarded? 2. approximately 95% (actually, 95.44%) of the scores would be expected to occur within The standard error of measurement provides such ±2σmeas of the true score; an estimate. 3. approximately 99% (actually, 99.74%) of the Further, the standard error of measurement is useful scores would be expected to occur within in establishing a confidence interval: a range or band ±3σmeas of the true score. of test scores that is likely to contain the true score. The best estimate available of the individual’s true Consider an application of a confidence interval with score on the test is the test score already obtained. one hypothetical measure of adult intelligence. Thus, if a student achieved a score of 50 on one Suppose a 22-year-old testtaker obtained a FSIQ of 75. spelling test and if the test had a standard error of The test user can be 95% confident that this testtaker’s made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 true FSIQ falls in the range of 70 to 80. This is so 1. How did this individual’s performance on test because the 95% confidence interval is set by taking the 1 compare with his or her performance on observed score of 75, plus or minus 1.96, multiplied by test 2? the standard error of measurement. In the test manual 2. How did this individual’s performance on test we find that the standard error of measurement of the 1 compare with someone else’s performance FSIQ for a 22-year-old testtaker is 2.37. With this on test 1? information in hand, the 95% confidence interval is 3. How did this individual’s performance on test calculated as follows: 1 compare with someone else’s performance on test 2? The calculated interval of 4.645 is rounded to the nearest whole number, 5. We can therefore be 95% As you might have expected, when comparing scores confident that this testtaker’s true FSIQ on this achieved on the different tests, it is essential that particular test of intelligence lies somewhere in the the scores be converted to the same scale. range of the observed score of 75 plus or minus 5, or The formula for the standard error of the difference somewhere in the range of 70 to 80. between two scores is The Standard Error of the Difference between Where: Two Scores σ diff ---- is the standard error of the difference between two scores The Standard Error of the Difference between Two Scores σ 2 meas1 ---- is the squared standard error of measurement for test 1  Error related to any of the number of possible variables operative in a testing σ 2 meas12 ---- is the squared standard error of situation can contribute to a change in a measurement for test 2 score achieved on the same test, or a parallel test, from one administration of the If we substitute reliability coefficients for the test to the next. standard errors of measurement of the separate  The amount of error in a specific test score scores, the formula becomes is embodied in the standard error of Where: measurement. r1 ---- is the reliability coefficient of test 1  True differences in the characteristic being measured can also affect test scores. r2 --- is the reliability coefficient of test 2 The Standard Error of the Difference between σ --- is the standard deviation Two Scores Note that both tests would have the same standard deviation because they must be on the same scale (or In the field of psychology, if the probability is more be converted to the same scale) before a comparison than 5% that the difference occurred by chance, can be made. then, for all intents and purposes, it is presumed that there was no difference. A more rigorous The standard error of the difference between two standard is the 1% standard. Applying the 1% scores will be larger than the standard error of standard, no statistically significant difference would measurement for either score alone because the be deemed to exist unless the observed difference former is affected by measurement error in both could have occurred by chance alone less than one scores. time in a hundred. The value obtained by calculating the standard error The standard error of the difference between two of the difference is used in much the same way as scores can be the appropriate statistical tool to the standard error of the mean. If we wish to be 95% address three types of questions: confident that the two scores are different, we made with love: nini kyutie patootie PSYCHOLOGICAL ASSESSMENT BS PSYCHOLOGY | P-201 would want them to be separated by 2 standard The difference between Larry’s and Moe’s scores is errors of the difference. only 9 points, not a large enough difference for the personnel officer to conclude with 95% confidence A separation of only 1 standard error of the that the two individuals actually have true scores difference would give us 68% confidence that the that differ on this test. two true scores are different. Stated another way: If Larry and Moe were to take a As an illustration of the use of the standard error of parallel form of the SMT, then the personnel officer the difference between two scores, consider the could not be 95% confident that, at the next testing, situation of a corporate personnel manager who is Larry would again outperform Moe. The personnel seeking a highly responsible person for the position officer in this example would have to resort to other of vice president of safety. The personnel officer in means to decide whether Moe, Larry, or someone this hypothetical situation decides to use a new else would be the best candidate for the position. published test we will call the Safety-Mindedness Test (SMT) to screen applicants for the position. After placing an ad in the employment section of the local newspaper, the personnel officer tests 100 applicants for the position using the SMT. The personnel officer narrows the search for the vice president to the two highest scorers on the SMT: Moe, who scored 125, and Larry, who scored 134. Assuming the measured reliability of this test to be.92 and its standard deviation to be 14, should the personnel officer conclude that Larry performed significantly better than Moe? To answer this question, first calculate the standard error of the difference: Note that in this application of the formula, the two test reliability coefficients are the same because the two scores being compared are derived from the same test. What does this standard error of the difference mean? For any standard error of the difference, we can be: 1. 68% confident that two scores differing by 1σdiff represent true score differences 2. 95% confident that two scores differing by 2σdiff represent true score differences 3. 99.7% confident that two scores differing by 3σdiff represent true score differences Applying this information to the standard error of the difference just computed for the SMT, we see that the personnel officer can be: 1. 68% confident that two scores differing by 5.6 represent true score differences. 2. 95% confident that two scores differing by 11.2 represent true score differences 3. 99.7% confident that two scores differing by 16.8 represent true score differences. made with love: nini kyutie patootie

Use Quizgecko on...
Browser
Browser