Hau Psychology Society Psychological Assessment Midterm Reviewer PDF
Document Details
2024
HAU
Tags
Related
Summary
This is a midterm reviewer for a psychology course. It covers topics on test reliability, including true variance, error variance, and reliability coefficients. It also discusses test administration, scoring, and interpretation, emphasizing sources of error variance and different types of reliability coefficients, including test-retest and parallel-forms reliability.
Full Transcript
HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER Note: The reviewers created by the HAU o Item Sampling or Content Psycholog...
HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER Note: The reviewers created by the HAU o Item Sampling or Content Psychology Society ensure consistency and quality Sampling during your review process. Be reminded that the ▪ Refer to the variation among content of the reviewers is based ONLY ON THE items within a test as well as GIVEN MODULES by the subject’s instructor. to variation among items Thank you and Goodluck on your Exam! between tests. ▪ The extent to which a Laus Deo Semper! testtaker's score is affected by Lesson 1: Test Reliability the content sampled. 2. Test Administration Reliability Test Environment - Refers to consistency in measurement. - The room temperature, the level of lighting, and the Reliability Coefficient amount of ventilation and - Is an index of reliability. noise. - A proportion that indicates the ratio Testtaker Variables between the true score variance on a test - Emotional problems, physical and the total variance. discomfort, lack of sleep, and the effects of drugs or Concept of Reliability medication. - A score on a test is presumed to reflect not Examiner-Related Variables only the testtaker's true score on the ability - Examiner's physical being measured but also error. appearance and demeanor (even the presence or absence True Variance of an examiner) - It is the variance from true differences. 3. Test Scoring and Interpretation - X=T+E - Scorers and scoring systems are potential sources of error variance. Error Variance 4. Other Sources of Error - It is the variance from irrelevant, random - Forgetting, failing to notice abusive sources. behavior, and misunderstanding - 𝜎 2 = 𝜎𝑡𝑟 2 + 𝜎𝑒2 instructions regarding reporting. Systematic Error Source - It does not change the variability of the distribution or affect reliability. - It would not affect score consistency. Sources of Error Variance 1. Test Construction This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - Underreporting or overreporting Coefficient of Stability - The estimate of test-retest reliability when the interval between testing is greater than six months. 2.Parallel-Forms and Alternate-Forms Reliability Coefficient of Equivalence - The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability. Parallel Forms Reliability Estimates - Exists when, for each form of the test, the - Three approaches to the estimation of means and the variances of observed test scores are equal. reliability. - The method or methods employed will depend on a number of factors such as the Alternate Forms - Are simply different versions of a test that purpose of obtaining a measure of reliability. have been constructed so as to be parallel. - Although they not meeting the requirements 1. Test-Retest Reliability for the legitimate designation "parallel,” they are typically designed to be equivalent with - An estimate of reliability obtained by respect to variables such as content and level correlating pairs of scores from the same of difficulty. people on two different administrations of the same test. - Appropriate when evaluating the reliability Similarity of Alternate-Forms and Parallel- of a test that purports to measure something Forms to Test-Retest Method: that is relatively stable over time, such as a personality trait. 1. Two test administrations with the same group are required. - May be most appropriate in gauging the 2. Test scores may be affected by factors such reliability of tests that employ outcome as motivation, fatigue, or intervening events measures such as reaction time or perceptual such as practice, learning, or therapy. judgments (brightness, loudness) Passage of Time Advantage - Can be a source of error variance. - It minimizes the effect of memory for the - The longer the time that passes, the greater content of a previously administered form of the likelihood that the reliability coefficient the test. will be lower. This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - Certain traits are presumed to be relatively C. Divide the test by content so that each half stable in people over time, and we would contains items equivalent with respect to expect tests measuring those traits to reflect content and difficulty that stability (ex. Intelligence Tests). Spearman-Brown Formula Disadvantage - Allows a test developer or user to estimate - Developing alternate forms of tests can be internal consistency reliability from a time-consuming and expensive. correlation of two halves of a test. - Error due to Item Sampling (selection of - A formula is necessary for estimating the items for inclusion in the test) reliability of a test that has been shortened or lengthened because the reliability of a test is affected by its length. 3. Internal or Inter-Item Consistency - General Spearman-Brown (rSB) formula Split-Half Reliability 𝑛𝑟𝑥𝑦 𝑟𝑆𝐵 = 1 + (𝑛 − 1)𝑟𝑥𝑦 - Obtained by correlating two pairs of scores obtained from equivalent halves of a single Where: test administered once. rSB is equal to the reliability adjusted by the - A useful measure of reliability when it is Spearman-Brown formula impractical or undesirable to assess reliability rxy is equal to the Pearson r in the original- with two tests or to administer a test twice length test - Simply dividing the test in the middle is not n is equal to the number of items in the recommended because it’s likely this revised version divided by the number of procedure would spuriously raise or lower items in the original version the reliability coefficient By determining the reliability of one half of a test, a Three Steps to Follow: test developer can use the Spearman-Brown formula 1. Divide the test into equivalent halves. to estimate the reliability of a whole test 2. Calculate a Pearson r between scores on the - Because a whole test is two times longer than two halves of the test. half a test, n becomes 2 in the Spearman- 3. Adjust the half-test reliability using the Brown formula for the adjustment of split- Spearman-Brown formula half reliability. - The symbol rhh stands for the Pearson r of Ways to Split a Test: scores in the two half tests A. Randomly assign items to one or the other 2𝑟ℎℎ half of the test 𝑟𝑆𝐵 = 1 + 𝑟ℎℎ B. Assign odd-numbered items to one half of the test and even-numbered items to the other - May be used to estimate the effect of the half (Odd-Even Reliability) shortening on the test’s reliability This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - Also used to determine the number of items - Is the extent to which items in a scale are needed to attain a desired level of reliability. unifactorial. Heterogeneity Increasing Test Reliability - It describes the degree to which a test measures different factors. The more homogeneous a test is, the more inter- item consistency it can be expected to have. - Because a homogeneous test samples a - In adding items to increase test reliability to relatively narrow content area, it is to be a desired level, the rule is that the new items expected to contain more inter-item must be equivalent in content and consistency than a heterogeneous test. difficulty so that the longer test still measures what the original test measured. Test Homogeneity - If the reliability of the original test is - Is desirable because it allows relatively relatively low, then it may be impractical to straightforward test-score interpretation. increase the number of items to reach an - Testtakers with the same score on a acceptable level of reliability. homogeneous test probably have similar - Another alternative is to abandon the abilities in the area tested. relatively unreliable instrument and - Testtakers with the same score on a more locate—or develop—a suitable alternative. heterogeneous test may have quite - The reliability of the instrument could also different abilities. be raised by creating new items, clarifying the test’s instructions, or simplifying the OTHER METHODS OF ESTIMATING scoring rules. INTERNAL CONSISTENCY: 4. Inter-Item Consistency The Kuder-Richardson formulas (KR-20) - Refers to the degree of correlation among all - is the statistic of choice for determining the the items on a scale. inter-item consistency of dichotomous - Calculated from a single administration of a items, primarily those items that can be single form of a test. scored right or wrong (such as multiple- - Useful in assessing the homogeneity of the choice items). test. - Kuder-Richardson formula 20, or KR-20, - Tests are said to be homogeneous if they so named because it was the twentieth contain items that measure a single trait. formula developed in a series. - Where test items are highly homogeneous, Homogeneity KR-20 and split-half reliability estimates will - As an adjective used to describe test items, it be similar. is the degree to which a test measures a single factor. This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - If test items are more heterogeneous, KR-20 - You may even hear it referred to as will yield lower reliability estimates than the coefficient α -20. This expression split-half method. incorporates both the Greek letter alpha (α) and the number 20, the latter a reference to KR-20. Coefficient alpha or Cronbach alpha Where: - appropriate for use on tests containing non- dichotomous items. Where: Coefficient alpha - Is a statistic that is defined as the one variant of the KR-20 formula that has received the most acceptance and is in widest use. - is the preferred statistic for obtaining an estimate of internal consistency reliability. An approximation of KR-20 can be obtained by the - widely used because it requires only one use of the twenty-first formula in the series administration of the test. developed by Kuder and Richardson - Typically ranges in value from 0 to 1 - KR-21 formula may be used if there is o answers about how similar sets of reason to assume that all the test items have data are approximately the same degree of difficulty. - Myth about alpha is that “bigger is always - Let’s add that this assumption is seldom better.” justified. Formula KR-21 has become o a value >.90 may indicate redundancy outdated in an era of calculators and in the items computers. Way back when, KR-21 was - All indexes of reliability, coefficient alpha sometimes used to estimate KR-20 only among them, provide an index that is a because it required many fewer calculations. characteristic of a particular group of test Numerous modifications of Kuder- scores, not of the test itself Richardson formulas have been proposed - Measures of reliability are estimates, and through the years. estimates are subject to error. - The one variant of the KR-20 formula that o The precise amount of error inherent has received the most acceptance and is in in a reliability estimate will vary with widest use today is a statistic called the sample of testtakers from which coefficient alpha. the data were drawn. This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER Measures of Inter-Scorer Reliability expect the test to demonstrate reliability across time. It would thus be desirable to have an estimate of - the degree of agreement or consistency the instrument’s test-retest reliability. For a test between two or more scorers (or judges or designed for a single administration only, an estimate raters) with regard to a particular measure. of internal consistency would be the reliability - If the reliability coefficient is high, the measure of choice. prospective test user knows that test scores can be derived in a systematic, consistent If the purpose of determining reliability is to break way by various scorers with sufficient down the error variance into its parts, as shown in training. Figure 5–1, then a number of reliability coefficients would have to be calculated. Note that the various Coefficient of Inter-Scorer Reliability reliability coefficients do not all reflect the same - A coefficient of correlation that determines sources of error variance. Thus, an individual the degree of consistency among scorers in reliability coefficient may provide an index of error the scoring of a test. from test construction, test administration, or test scoring and interpretation. USING AND INTERPRETING A COEFFICIENT OF RELIABILITY A coefficient of inter-rater reliability, for example, provides information about error as a result of test How high should the coefficient of reliability be? scoring. Specifically, it can be used to answer - If a test score carries with it life-or-death questions about how consistently two scorers score implications, then we need to hold that test to the same test items. some high standards. SUMMARY OF RELIABILITY TYPES - If a test score is routinely used in combination with many other test scores and typically accounts for only a small part of the decision process, that test will not be held to the highest standards of reliability. - As a rule of thumb, it may parallels many grading systems: Sources of Variance in a Hypothetical Test o.90s rates a grade of A (with a value of.95 higher for the most important types of decisions) o.80s rates a B (with below.85 being a clear B) o.65-.70s rates a weak and unacceptable The purpose of the reliability coefficient If a specific test of employee performance is designed for use at various times over the course of the employment period, it would be reasonable to This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - In this hypothetical situation: - Example: anxiety (dynamic) manifested by a o 5% of the variance has not been stockbroker throughout a business day vs. his identified by the test user (could be intelligence (static) accounted for by transient error - Restriction or Inflation of Range attributable to variations in the testtaker’s feelings, moods, or mental - If the variance of either variable in a state over time. correlational analysis is restricted by the o may also be due to other factors that sampling procedure used, then the resulting are yet to be identified. correlation coefficient tends to be lower. - If the variance of either variable in a Sources of Variance in a Hypothetical Test correlational analysis is inflated by the 1. Nature of the test sampling procedure, then the resulting correlation coefficient tends to be higher. Considerations: - Also of critical importance is whether the a. test items are homogeneous or range of variances employed is appropriate heterogeneous in nature to the objective of the correlational analysis. b. characteristic, ability, or trait being measured Two Scatterplots illustrating unrestricted and is presumed to be dynamic or static restricted range c. range of test scores is or is not restricted d. test is a speed or a power test e. the test is or is not criterion-referenced. f. some tests present special problems regarding the measurement of their reliability. Homogeneity versus Heterogeneity of test items - Tests designed to measure one factor (ability or trait) are expected to be homogeneous in Speed tests versus Power tests items resulting in a high degree of internal consistency. Power test - By contrast, in a heterogeneous test, an - When a time limit is long enough to allow estimate of internal consistency might be low testtakers to attempt all items, and if some relative to a more appropriate estimate of items are so difficult that no testtaker is able test-retest reliability. to obtain a perfect score. Dynamic versus Static characteristics Speed test - A dynamic characteristic is a trait, state, or - generally contains items of uniform level of ability presumed to be everchanging as a difficulty (typically uniformly low) so that, function of situational and cognitive when given generous time limits, all experiences. testtakers should be able to complete all the test items correctly. This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - the time limit on a speed test is established so certain criterion score has been that few if any of the testtakers will be able to achieved. complete the entire test. - As individual differences (and the variability) - Score differences on a speed test are therefore decrease, a traditional measure of reliability based on performance speed would also decrease - A reliability estimate of a speed test should The person will ordinarily have a different universe be based on performance from two score for each universe. Mary’s universe score independent testing periods using one of covering tests on May 5 will not agree perfectly with the following: (1) test-retest reliability, (2) her universe score for the whole month of May.... alternate-forms reliability, or (3) split-half Some testers call the average over a large number of reliability from two separately timed half comparable observations a “true score”; e.g., tests. “Mary’s true typing rate on 3-minute tests.” Instead, Criterion-referenced tests we speak of a “universe score” to emphasize that what score is desired depends on the universe being - designed to provide an indication of where a considered. For any measure there are many “true testtaker stands with respect to some variable scores,” each corresponding to a different universe. or criterion, such as an educational or a vocational objective. When we use a single observation as if it represented - Unlike norm-referenced tests, criterion- the universe, we are generalizing. We generalize over referenced tests tend to contain material that scorers, over selections typed, perhaps over days. If has been mastered in hierarchical fashion. the observed scores from a procedure agree closely - Traditional techniques of estimating with the universe score, we can say that the reliability employ measures that take into observation is “accurate,” or “reliable,” or account scores on the entire test. “generalizable.” And since the observations then also - Such traditional procedures of estimating agree with each other, we say that they are reliability are usually not appropriate for use “consistent” and “have little error variance.” To have with criterion-referenced tests. so many terms is confusing, but not seriously so. The term most often used in the literature is “reliability.” The author prefers “generalizability” because that term immediately implies “generalization to what?”... There is a different degree of generalizability for - A measure of reliability, therefore, depends each universe. The older methods of analysis do not on the variability of the test scores: how separate the sources of variation. They deal with a different the scores are from one another. single source of variance or leave two or more - In criterion-referenced testing, and sources entangled. (Cronbach, 1970, pp. 153–154) particularly in mastery testing, how different the scores are from one another is seldom a ALTERNATIVES TO THE TRUE SCORE focus of interest. MODEL o The critical issue for the user of a The 1950s saw the development of an alternative mastery test is whether or not a theoretical model, one originally referred to as domain sampling theory and better known today in This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER one of its many modified forms as generalizability the universe, the exact same test score should theory be obtained. This test score is the universe score, and it is, as Cronbach noted, analogous Domain sampling theory to a true score in the true score model. - a test’s reliability is conceived of as an Item Response theory (IRT) (Lord, 1980) objective measure of how precisely the test score assesses the domain from which the test - Item response theory procedures models the draws a sample (Thorndike, 1985). probability that a person with X amount of a - A domain of behavior, or the universe of particular personality trait will exhibit Y items that could conceivably measure that amount of that trait on a personality test behavior, can be thought of as a hypothetical designed to measure it. construct: one that shares certain - Called Latent-Trait Theory because the characteristics with (and is measured by) the psychological or educational construct being sample of items that make up the test. measured is so often physically unobservable - In theory, the items in the domain are thought (stated another way, is latent) and because to have the same means and variances of the construct being measured may be a trait those in the test that samples from the (or an ability). domain. Of the three types of estimates of - IRT refers to a family of theory and methods reliability, measures of internal consistency with many other names used to distinguish are perhaps the most compatible with domain specific approaches. sampling theory. - Examples of two characteristics of items within an IRT framework are the difficulty Generalizability theory level of an item and the item’s level of - Developed by Lee J. Cronbach (1970) and his discrimination; items may be viewed as colleagues (Cronbach et al., 1972), varying in terms of these, as well as other, generalizability theory is based on the idea characteristics. that a person’s test scores vary from testing to Difficulty testing because of variables in the testing situation. - refers to the attribute of not being easily - Instead of conceiving of all variability in a accomplished, solved, or comprehended. person’s scores as error, Cronbach o In a mathematics test, for example, a encouraged test developers and researchers to test item tapping basic addition ability describe the details of the particular test will have a lower difficulty level than situation or universe leading to a specific test a test item tapping basic algebra score. This universe is described in terms of skills. The characteristic of difficulty its facets, which include things like the as applied to a test item may also refer number of items in the test, the amount of to physical difficulty—that is, how training the test scorers have had, and the hard or easy it is for a person to purpose of the test administration. engage in a particular activity. - According to generalizability theory, given Discrimination the exact same conditions of all the facets in This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - signifies the degree to which an item assumed to have an equivalent relationship differentiates among people with higher or with the construct being measured by the test. lower levels of the trait, ability, or whatever - The psychometric advantages of item it is that is being measured. response theory have made this model o Consider two more ADLQ items: appealing, especially to commercial and item 4, My mood is generally good; academic test developers and to large-scale and item 5, I am able to walk one test publishers. block on flat ground. - It is a model that in recent years has found o Which of these two items do you increasing application in standardized tests, think would be more discriminating professional licensing examinations, and in terms of the respondent’s physical questionnaires used in behavioral and social abilities? sciences. - A number of different IRT models exist to RELIABILITY AND INDIVIDUAL SCORES handle data resulting from the administration of tests with various characteristics and in Standard Error of Measurement (SEM) various formats. For example, there are IRT - The standard error of measurement, often models designed to handle data resulting abbreviated as SEM or SEM, provides a from the administration of tests with: measure of the precision of an observed test o Dichotomous test items - test items score. or questions that can be answered - It provides an estimate of the amount of error with only one of two alternative inherent in an observed score or responses, such as true–false, yes–no, measurement. or correct–incorrect questions - In general, the relationship between the SEM o Polytomous test items - test items or and the reliability of a test is inverse; the questions with three or more higher the reliability of a test (or individual alternative responses, where only one subtest within a test), the lower the SEM. is scored correct or scored as being - The usefulness of the reliability coefficient consistent with a targeted trait or does not end with test construction and other construct. selection. IMPORTANT DIFFERENCES BETWEEN o By employing the reliability LATEN-TRAIT MODELS AND CLASSICAL coefficient in the formula for the “TRUE SCORE” TEST THEORY standard error of measurement, the test user now has another descriptive - In classical TST theory, no assumptions are statistic relevant to test made about the frequency distribution of test interpretation, this one useful in scores. By contrast, such assumptions are estimating the precision of a inherent in latent-trait models. particular test score. - Some IRT models have very specific and stringent assumptions about the underlying distribution. In one group of IRT models developed by Rasch, each item on the test is This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER The best estimate available of the individual’s true To be hired at a company TRW as a word processor, score on the test is the test score already obtained. a candidate must be able to word-process accurately at the rate of 50 words per minute. The personnel Thus, if a student achieved a score of 50 on one office administers a total of seven brief word- spelling test and if the test had a standard error of processing tests to Mary over the course of seven measurement of 4, then—using 50 as the point business days. In words per minute, Mary’s scores on estimate—we can be: each of the seven tests are as follows: 52 55 39 56 35 50 54 “Which is her ‘true’ score?” The standard error of measurement is the tool used to estimate or infer the extent to which an observed The standard error of measurement, like the score deviates from a true score. reliability coefficient, is one way of expressing test reliability. Standard Error of Measurement - The standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests. - Also known as the standard error of a score and denoted by the symbol σmeas, the standard error of measurement is an index of In practice, the standard error of measurement is the extent to which one individual’s scores most frequently used in the interpretation of vary over tests presumed to be parallel. individual test scores. Assumption: - If the cut off score for mental retardation is If the individual were to take a large number of 70, how do scores that are close to the cutoff equivalent tests, scores on those tests would tend to value of 70 should be treated? be normally distributed, with the individual’s true - How high above 70 must a score be for us to score as the mean. conclude confidently that the individual is unlikely to be retarded? Because the standard error of measurement functions like a standard deviation in this context, we can use The standard error of measurement provides such an it to predict what would happen if an individual took estimate. additional equivalent tests: Further, the standard error of measurement is useful in establishing a confidence interval: a range or band of test scores that is likely to contain the true score. This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER Consider an application of a confidence interval with In the field of psychology, if the probability is more one hypothetical measure of adult intelligence. than 5% that the difference occurred by chance, then, for all intents and purposes, it is presumed that there - Suppose a 22-year-old testtaker obtained a was no difference. A more rigorous standard is the FSIQ of 75. The test user can be 95% 1% standard. Applying the 1% standard, no confident that this testtaker’s true FSIQ falls statistically significant difference would be deemed in the range of 70 to 80. This is so because to exist unless the observed difference could have the 95% confidence interval is set by taking occurred by chance alone less than one time in a the observed score of 75, plus or minus 1.96, hundred. The standard error of the difference multiplied by the standard error of between two scores can be the appropriate statistical measurement. In the test manual we find that tool to address three types of questions: the standard error of measurement of the FSIQ for a 22-year-old testtaker is 2.37. With 1. How did this individual’s performance on test this information in hand, the 95% confidence 1 compare with his or her performance on interval is calculated as follows: test 2? 2. How did this individual’s performance on test 1 compare with someone else’s performance - The calculated interval of 4.645 is rounded to on test 1? the nearest whole number, 5. We can 3. How did this individual’s performance on test therefore be 95% confident that this 1 compare with someone else’s performance testtaker’s true FSIQ on this particular test of on test 2? intelligence lies somewhere in the range of the observed score of 75 plus or minus 5, or As you might have expected, when comparing scores somewhere in the range of 70 to 80. achieved on the different tests, it is essential that the scores be converted to the same scale. The Standard Error of the Difference Between Two Scores The formula for the standard error of the difference between two scores is: - Error related to any of the number of possible variables operative in a testing situation can contribute to a change in a score achieved on the same test, or a parallel test, from one Where: administration of the test to the next. - The amount of error in a specific test score is embodied in the standard error of measurement. - True differences in the characteristic being If we substitute reliability coefficients for the measured can also affect test scores. standard errors of measurement of the separate scores, the formula becomes Where: This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER question, first calculate the standard error of the difference: Note that both tests would have the same standard deviation because they must be on the same scale (or be conveyed to the same scale) before a comparison Note that in this application of the formula, the two can be made. test reliability coefficients are the same because the two scores being compared are derived from the The standard error of the difference between two same test. What does this standard error of the scores will be larger than the standard error of difference mean? For any standard error of the measurement for either score alone because the difference, we can be: former is affected by measurement error in both scores. The value obtained by calculating the standard error of the difference is used in much the same way as the Applying this information to the standard error of the standard error of the mean. If we wish to be 95% difference just computed for the SMT, we see that the confident that the two scores are different, we would personnel officer can be: want them to be separated by 2 standard errors of 68% confident that two scores differing by the difference. A separation of only 1 standard error 5.6 represent true score differences of the difference would give us 68% confidence that 95% confident that two scores differing by the two true scores are different. 11.2 represent true score differences As an illustration of the use of the standard error of 99.7% confident that two scores differing by the difference between two scores, consider the 16.8 represent true score differences situation of a corporate personnel manager who is The difference between Larry’s and Moe’s scores is seeking a highly responsible person for the position only 9 points, not a large enough difference for the of vice president of safety. The personnel officer in personnel officer to conclude with 95% confidence this hypothetical situation decides to use a new that the two individuals actually have true scores that published test we will call the Safety-Mindedness differ on this test. Stated another way: If Larry and Test (SMT) to screen applicants for the position. Moe were to take a parallel form of the SMT, then After placing an ad in the employment section of the the personnel officer could not be 95% confident local newspaper, the personnel officer tests 100 that, at the next testing, Larry would again applicants for the position using the SMT. The outperform Moe. The personnel officer in this personnel officer narrows the search for the vice example would have to resort to other means to president to the two highest scorers on the SMT: decide whether Moe, Larry, or someone else would Moe, who scored 125, and Larry, who scored 134. be the best candidate for the position. Assuming the measured reliability of this test to be.92 and its standard deviation to be 14, should the personnel officer conclude that Larry performed Lesson 2: Validity significantly better than Moe? To answer this This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER Classical Concept of Validity: Trinitarian View 1 Content validity - Face validity 2. Criterion-related validity Validity - Concurrent validity - Predictive validity - As applied to a test, is a judgment or estimate of how well a test measures what it purports 3. Construct validity to measure in a particular context. - the “umbrella validity” since every other - More specifically, it is a judgment based on variety of validity falls under it evidence about the appropriateness of - Convergent validity inferences drawn from test scores. - Discriminant validity - Characterizations of the validity of tests and test scores are frequently phrased in terms Content validity such as “acceptable” or “weak.” - describes a judgment of how adequately a test Inference samples behavior representative of the universe of behavior that the test was - Logical result or deduction. designed to sample. Validation - Ex: a test of assertiveness must adequately represent the wide range of behaviors of - The process of gathering and evaluating assertiveness evidence about validity. - It is the test developer’s responsibility to Educational achievement tests supply validity evidence in the test manual. - considered a content-valid measure when the Local Validation Studies proportion of material covered by the test approximates the proportion of material - may yield insights regarding a particular covered in the course. population of testtakers as compared to the norming sample. Test blueprint - Local validation studies are necessary when - A plan regarding the types of information to the test user plans to: be covered by the items, the number of items o alter in some way the format, tapping each area of coverage, the instructions, language, or content organization of the items in the test. of the test. - also called Table of Specifications (TOS) o use a test with a population of - Employment test to be content-valid, its testtakers that differ in some content must be a representative sample of significant way from the population the job-related skills required for on which the test was standardized. employment. TYPES OF VALIDITY This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - Behavioral observation is one technique - It can be a test score, a specific behavior or frequently used in blueprinting the content group of behaviors, an amount of time, a areas to be covered in certain types of rating, a psychiatric diagnosis, a training cost, employment tests. an index of absenteeism, an index of alcohol intoxication, and so on. Culture and the relativity of content validity A criterion must be: - Tests are often thought of as either valid or not valid. A history test, for example, either - Relevant does or does not accurately measure one’s - Valid knowledge of historical facts. - Uncontaminated - Example: Martial Law Marcos family Example: A study of how accurately a test Criterion-related validity called the MMPI-2-RF predicted psychiatric diagnosis in the psychiatric population of the - is a judgment of how adequately a test score Minnesota state hospital system. If someone can be used to infer an individual’s most informs these researchers that the diagnosis probable standing on some measure of for every patient in the Minnesota state interest— the measure of interest being the hospital system was determined, at least in criterion. part, by an MMPI-2-RF test score. Should Two types of validity evidence under criterion- they still proceed with their analysis? related validity: Concurrent validity 1. Concurrent validity - If test scores are obtained at about the same - is an index of the degree to which a test score time that the criterion measures are obtained, is related to some criterion measure obtained measures of the relationship between the test at the same time (concurrently). scores and the criterion provide evidence of concurrent validity. 2. Predictive validity - indicate the extent to which test scores may - is an index of the degree to which a test score be used to estimate an individual’s present predicts some criterion measure. standing on a criterion. o Example: The concurrent validity of a Criterion particular test (Test A) is explored - as the standard against which a test or a test with respect to another test (Test B). score is evaluated. Given that prior research has o Example, if a test purports to measure satisfactorily demonstrated the the trait of athleticism, we might validity of Test B, so the question expect to employ “membership in a becomes: “How well does Test A health club” or any generally compare with Test B?” Here, Test B is accepted measure of physical fitness used as the validating criterion. In - There are no hard-and-fast rules for what some studies, Test A is either a brand- constitutes a criterion. new test or a test being used for some This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER new purpose, perhaps with a new (1965) cautioned against the establishment of population. such rules. - Validity coefficients need to be large Predictive validity enough to enable the test user to make - Test scores may be obtained at one time and accurate decisions within the unique context the criterion measures obtained at a future in which a test is being used. time, usually after some intervening event - should be high enough to result in the has taken place. identification and differentiation of testtakers - Measures of the relationship between the test with respect to target attribute(s) scores and a criterion measure obtained at a Incremental validity future time provide an indication of the predictive validity of the test; that is, how - The value of including more than one accurately scores on the test predict some predictor depends on a couple of factors. criterion measure. o Each measure used as a predictor o Ex: relationship between college should have criterion-related admissions tests and freshman GPA predictive validity. provide evidence of the predictive o Additional predictors should possess validity of the admissions tests. incremental validity, defined here as - A test’s high predictive validity can be a the degree to which an additional useful aid to decision makers who must select predictor explains something about successful students, productive workers, or the criterion measure that is not good parole risks. explained by predictors already in - Whether a test result is valuable in decision use. making depends on how well the test results Expectancy data improve selection decisions over decisions made without knowledge of test results. - provide information that can be used in - Judgments of criterion-related validity, evaluating the criterion-related validity of a whether concurrent or predictive, are based test. Using a score obtained on some test(s) on two types of statistical evidence: the or measure(s), expectancy tables illustrate the validity coefficient and expectancy data. likelihood that the testtaker will score within some interval of scores on a criterion Intervening event measure—an interval that may be seen as - training, experience, therapy, medication, or “passing,” “acceptable,” and so on. the passage of time. Expectancy table How high should a validity coefficient be for a user - shows the percentage of people within or a test developer to infer that the test is valid? specified testscore intervals who - There are no rules for determining the subsequently were placed in various minimum acceptable size of a validity categories of the criterion (for example, coefficient. In fact, Cronbach and Gleser placed in “passed” category or “failed” category). This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - Example: An expectancy table showing the using the test would improve relationship between scores on a subtest of selection over existing methods. the Differential Aptitude Test (DAT) and DECISION THEORY AND TEST UTILITY course grades in American history for eleventh-grade boys. Base rate Taylor-Russell Tables - is the extent to which a particular trait, behavior, characteristic, or attribute exists in - provide an estimate of the extent to which the population (expressed as a proportion). inclusion of a particular test in the selection - Due consideration must be given to the base system will actually improve selection. rate of a targeted attribute in the sample of - The tables provide an estimate of the people being studied in predictive validity percentage of employees hired by the use of research as compared to the base rate of that a particular test who will be successful at same attribute in the population at large. their jobs, given different combinations of three variables: Hit rate o the test’s validity - the computed - may be defined as the proportion of people a validity coefficient test accurately identifies as possessing or o the selection ratio used exhibiting a particular trait, behavior, o the base rate characteristic, or attribute. Selection ratio - For example, hit rate could refer to the proportion of people accurately predicted to - is a numerical value that reflects the be able to perform work at the graduate relationship between the number of people to school level or to the proportion of be hired and the number of people available neurological patients accurately identified as to be hired. having a brain tumor. o For instance, if there are 50 positions and 100 applicants, then the selection Miss rate ratio is 50/100, or.50. - may be defined as the proportion of people o As used here, base rate refers to the the test fails to identify as having, or not percentage of people hired under the having, a particular characteristic or attribute. existing system for a particular - amounts to an inaccurate prediction. position. If, for example, a firm employs 25 computer programmers The category of misses may be further and 20 are considered successful, the subdivided. base rate would be.80. With 1. false positive knowledge of the validity coefficient of a particular test along with the - is a miss wherein the test predicted that the selection ratio, reference to the Taylor testtaker did possess the particular Russell tables provides the personnel characteristic or attribute being measured officer with an estimate of how much when in fact the testtaker did not. This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER 2. false negative The test is homogeneous, measuring a single construct. - is a miss wherein the test predicted that the Test scores increase or decrease as a testtaker did not possess the particular function of age, the passage of time, or an characteristic or attribute being measured experimental manipulation as theoretically when the testtaker actually did. predicted. Construct validity Test scores obtained after some event or the - is a judgment about the appropriateness of mere passage of time (that is, posttest scores) inferences drawn from test scores regarding differ from pretest scores as theoretically individual standings on a variable called a predicted. construct. Test scores obtained by people from distinct groups vary as predicted by the theory. Construct Test scores correlate with scores on other - is an informed, scientific idea developed or tests in accordance with what would be hypothesized to describe or explain behavior. predicted from a theory that covers the - If the test is a valid measure of the construct, manifestation of the construct in question. then high scorers and low scorers will EVIDENCE OF HOMOGENEITY behave as predicted by the theory. - An alternative explanation could lie in the Homogeneity theory that generated hypotheses about the - refers to how uniform a test is in measuring a construct. single concept. o Theory may need to be reexamined - A test developer can increase test o Contrary evidence homogeneity in several ways. - Construct validity has been viewed as the o For example, a test of academic unifying concept for all validity evidence achievement that contains subtests in (American Educational Research Association areas such as mathematics, spelling, et al., 1999). and reading comprehension. The The researcher investigating a test’s construct Pearson r could be used to correlate validity must formulate hypotheses about the average subtest scores with the expected behavior of high scorers and low scorers on average total test score. the test. - Correlations between subtest scores and total test score are generally reported in the Hypotheses give rise to a tentative theory about the test manual as evidence of homogeneity. nature of the construct the test was designed to measure EVIDENCE OF CHANGES WITH AGE Evidence of construct validity: Some constructs lend themselves more readily to predictions of change over time. Various techniques of construct validation may provide evidence that Giftedness or intelligence Reading ability Marital satisfaction This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER Measures of marital satisfaction may be less stable scores from groups of people who would be over time or more vulnerable to situational events presumed to differ with respect to that than is reading ability. construct should have correspondingly different test scores. Evidence of change over time, like evidence of test o Consider a test of depression. We homogeneity, does not in itself provide information would expect individuals about how the construct relates to other constructs. psychiatrically hospitalized for EVIDENCE OF PRETEST-POSTTEST depression to score higher on this CHANGES measure than a random sample of people. Evidence that test scores change as a result of some experience between a pretest and a posttest can be Convergent evidence (Convergent validity) evidence of construct validity. - Evidence for the construct validity of a Typical intervening experiences responsible for particular test may converge from a number changes in test scores are formal education, a course of sources, such as other tests or measures of therapy or medication, and on-the-job experience, designed to assess the same (or a similar) reading an inspirational book, watching a TV talk construct. show, undergoing surgery, serving a prison sentence, - If scores on the test undergoing construct or the mere passage of time may each prove to be a validation tend to correlate highly in the potent intervening variable. predicted direction with scores on older, more established, and already validated tests - Ex: Marital Satisfaction Scale has shown designed to measure the same construct, this significant change between pretest and would be an example of convergent posttest. A second posttest given eight weeks evidence. later showed that scores remained stable - Convergent evidence for validity may come (suggesting the instrument was reliable), not only from correlations with tests whereas the pretest–posttest measures were purporting to measure an identical construct still significantly different. Such changes in but also from correlations with measures scores in the predicted direction after the purporting to measure related constructs. treatment program contribute to evidence of o Consider a new test designed to the construct validity for this test. measure test anxiety. Generally EVIDENCE FROM DISTINCT GROUPS speaking, we might expect high positive correlations between this - Also referred to as Method of Contrasted new test and older, more established Groups measures of test anxiety. However, - One way of providing evidence for the we might also expect more moderate validity of a test is to demonstrate that scores correlations between this new test and on the test vary in a predictable way as a measures of general anxiety. function of membership in some group. - The rationale here is that if a test is a valid Discriminant evidence (Discriminant validity) measure of a particular construct, then test This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER - A validity coefficient showing little (that is, a Factor loading statistically insignificant) relationship - in a test conveys information about the extent between test scores and/or other variables to which the factor determines the test score with which scores on the test being construct- or scores. validated should be not theoretically o A new test purporting to measure correlated provides discriminant evidence of bulimia, for example, can be factor- construct validity. analyzed with other known measures o Gratitude and Aggression of bulimia, as well as with other kinds o Transcendence and Explosive of measures (such as measures of Behavior intelligence, self-esteem, general Factor analysis anxiety, anorexia, or perfectionism). - High factor loadings by the new test on a - a shorthand term for a class of mathematical “bulimia factor” would provide convergent procedures designed to identify factors or evidence of construct validity. specific variables that are typically attributes, - Moderate to low factor loadings by the new characteristics, or dimensions on which test with respect to measures of other eating people may differ. disorders such as anorexia would provide - Factor analysis is frequently employed as a discriminant evidence of construct validity. data reduction method in which several sets of scores and the correlations between them Naming factors are analyzed. - that emerge from a factor analysis has more - The purpose of the factor analysis is to to do with knowledge, judgment, and verbal identify the factor or factors in common abstraction ability than with mathematical between test scores on subscales within a expertise. particular test, or the factors in common - There are no hard-and-fast rules. Factor between scores on a series of tests. analysts exercise their own judgment about - Factor analysis is conducted on either an what factor name best communicates the exploratory or a confirmatory basis. meaning of the factor. o Exploratory factor analysis (EFA) typically entails “estimating, or VALIDITY, BIAS, AND FAIRNESS extracting factors; deciding how Test bias many factors to retain; and rotating factors to an interpretable - Bias is a factor inherent in a test that orientation” (Floyd & Widaman, systematically prevents accurate, impartial 1995, p. 287). measurement. o Confirmatory factor analysis - Psychometricians have developed the (CFA) - a factor structure is explicitly technical means to identify and remedy bias, hypothesized and is tested for its fit at least in the mathematical sense. with the observed covariance o As a simple illustration, consider a structure of the measured variables test, the “flip-coin test” (FCT). The (Floyd & Widaman, 1995, p. 287). “equipment” needed to conduct this This reviewer is not for sale. HAU PSYCHOLOGY SOCIETY PSYCHOLOGICAL ASSESSMENT MIDTERM REVIEWER | FIRST SEMESTER test is a two-sided coin. One side tend to cluster in the middle of the rating (“heads”) has the image of a profile continuum. and the other side (“tails”) does not. - One way to overcome what might be termed The FCT would be considered biased restriction of-range rating errors (central if the instrument (the coin) were tendency, leniency, severity errors) is to use weighted so that either heads or tails rankings, a procedure that requires the rater appears more frequently than by to measure individuals against one another chance alone. If the test in question instead of against an absolute scale. were an intelligence test, the test Halo effect would be considered biased if it were constructed so that people who had - A tendency to give a particular ratee a higher brown eyes consistently and rating than he or she objectively deserves systematically obtained higher scores because of the rater’s failure to discriminate than people with green eyes. among conceptually distinct and potentially independent aspects of a ratee’s behavior. Rating error Test fairness - is a judgment resulting from the intentional or unintentional misuse of a rating scale. - In contrast to questions of test bias, which may be thought of as technically complex Rating statistical problems, issues of test fairness - is a numerical or verbal judgment (or both) tend to be rooted more in thorny issues that places a person or an attribute along a involving values (Halpern, 2000). continuum identified by a scale of numerical Fairness or word descriptors known as a rating scale. - in a psychometric context may be defined as Leniency error the extent to which a test is used in an - called as the “generosity error” impartial, just, and equitable way. - an error in rating that arises from the - Ideally, the test developer strives for fairness tendency on the part of the rater to be lenient in the test development process and in the in scoring, marking, and/or grading. test’s manual and usage guidelines. The test user strives for fairness in the way the test is Severity error actually used. Society strives for fairness in - Less than accurate rating or error in test use by means of legislation, judicial evaluation due to the rater’s tendency to be decisions, and administrative regulations. overly critical. Central-tendency error ‘ - Rater exhibits a general and systematic reluctance to giving ratings at either the positive or the negative extreme. Consequently, all of this rater’s ratings would This reviewer is not for sale.