Psychological Testing and Assessment: Of Tests and Testing PDF
Document Details
Uploaded by Deleted User
Cohen−Swerdlik
Tags
Summary
This chapter provides an introduction to the science of psychological testing and measurement. It highlights the importance of tests in various critical life decisions and explores some basic assumptions about psychological traits and states. The overview will help in understanding and assessing the characteristics of individuals in a wide array of contexts.
Full Transcript
Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 113 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measu...
Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 113 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition 4 C H A P T E R Of Tests and Testing Is this person competent to stand trial? Who should be hired, transferred, promoted, or fired? Who should gain entry to this special program or be awarded a scholarship? Which parent shall have custody of the children? E very day, throughout the world, critically important questions like these are addressed through the use of tests. The answers to these kinds of questions are likely to have a significant impact on many lives. ◆ If they are to sleep comfortably at night, assessment J U S T T H I N K... professionals must have confidence in the tests and other What’s a “good test”? Outline some tools of assessment they employ. They need to know, for elements or features that you believe are example, what does and does not constitute a “good test.” essential to a good test before reading on. Our objective in this chapter is to overview the ele- ments of a good test. As background, we begin by listing some basic assumptions about assessment. Aspects of these fundamental assumptions will be elaborated later in this chapter as well as in subsequent chapters. Some Assumptions about Psychological Testing and Assessment Assumption 1: Psychological Traits and States Exist A trait has been defined as “any distinguishable, relatively enduring way in which one individual varies from another” (Guilford, 1959, p. 6). States also distinguish one per- son from another but are relatively less enduring (Chaplin et al., 1988). The trait term that an observer applies, as well as the strength or magnitude of the trait presumed to be present, is based on observing a sample of behavior. Samples of behavior may be obtained in a number of ways, ranging from direct observation to the analysis of self- report statements or pencil-and-paper test answers. The term psychological trait, much like the term trait alone, covers a wide range of possible characteristics. Thousands of psychological trait terms can be found in the English language (Allport & Odbert, 1936). Among them are psychological traits that 101 114 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition relate to intelligence, specific intellectual abilities, cognitive style, adjustment, interests, attitudes, sexual orientation and preferences, psychopathology, personality in general, and specific personality traits. New concepts or discoveries in research may bring new trait terms to the fore. For example, a trait term seen in the professional literature on human sexuality is androgynous (referring to an absence of primacy of male or female characteristics). Cultural evolution may bring new trait terms into common usage, as it did in the 1960s when people began speaking of the degree to which women were liberated (or freed from the constraints of gender-dependent social expectations). A more recent example is the trait term New Age, used in the popular culture to refer to a particular nonmainstream orientation to spirituality and health. Few people deny that psychological traits exist. Yet there has been a fair amount of controversy regarding just how they exist. For example, do traits have a physical exis- tence, perhaps as a circuit in the brain? Although some have argued in favor of such a conception of psychological traits (Allport, 1937; Holt, 1971), compelling evidence to support such a view has been difficult to obtain. For our purposes, a psychological trait exists only as a construct—an informed, scientific concept developed or constructed to describe or explain behavior. We can’t see, hear, or touch constructs, but we can infer their existence from overt behavior. In this context, overt behavior refers to an observ- able action or the product of an observable action, including test- or assessment-related responses. A challenge facing test developers is to construct tests that are at least as telling as observable behavior such as that illustrated in Figure 4–1. The phrase relatively enduring in our definition of trait is a reminder that a trait is not expected to be manifested in behavior 100% of the time. Thus, it is important to be aware of the context or situation in which a particular behavior is displayed. Whether a trait manifests itself in observable behavior, and to what ◆ degree it manifests, is presumed to depend not only on the J UST THIN K... strength of the trait in the individual but also on the nature Give another example of how the same of the situation. Stated another way, exactly how a particu- behavior in two different contexts may be lar trait manifests itself is, at least to some extent, situation- viewed in terms of two different traits. dependent. For example, a violent parolee may be prone to behave in a rather subdued way with her parole officer and much more violently in the presence of her family and friends. John may be viewed as dull and cheap by his wife but as charming and extravagant by his business associ- ates, whom he keenly wants to impress. The context within which behavior occurs also plays a role in helping us select appropriate trait terms for observed behavior. Consider how we might label the behav- ior of someone who is kneeling and talking to God. Such behavior might be viewed as either religious or deviant, depending on the context in which it occurs. A person who is kneeling and talking to God inside a church or upon a prayer rug may be described as religious, whereas another person engaged in the exact same behavior in a public restroom might be viewed as deviant or paranoid. The definitions of trait and state we are using also refer ◆ to a way in which one individual varies from another. The attri- J UST THIN K... bution of a trait or state term is a relative phenomenon. Is the strength of a particular psychological For example, in describing one person as shy, or even in trait the same across all situations or using terms such as very shy or not shy, most people are environments? What are the implications making an unstated comparison with the degree of shy- of one’s answer to this question for ness they could reasonably expect the average person to assessment? exhibit under the same or similar circumstances. In psy- chological assessment, assessors may also make such 102 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 115 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition Figure 4–1 Measuring Sensation Seeking The psychological trait of sensation seeking has been defined as “the need for varied, novel, and complex sensations and experiences and the willingness to take physical and social risks for the sake of such experiences” (Zuckerman, 1979, p. 10). A 22-item Sensation-Seeking Scale (SSS) seeks to identify people who are high or low on this trait. Assuming the SSS actually measures what it purports to measure, how would you expect a random sample of people lining up to bungee jump to score on the test as compared with another age-matched sample of people shopping at the local mall? What are the comparative advantages of using paper-and- pencil measures, such as the SSS, and using more performance-based measures, such as the one pictured here? comparisons with respect to the hypothetical average person. Alternatively, assessors may make comparisons among people who, because of their membership in some group or for any number of other reasons, are decidedly not average. As you might expect, the reference group with which comparisons are made can greatly influence one’s conclusions or judgments. For example, suppose a psychologist administers a test of shyness to a 22-year-old male who earns his living as an exotic dancer. The interpretation of the test data will almost surely differ as a function of the reference group with which the testtaker is compared—that is, other males in his age group or other male exotic dancers in his age group. Assumption 2: Psychological Traits and States Can Be Quantified and Measured Having acknowledged that psychological traits and states do exist, the specific traits and states to be measured and quantified need to be carefully defined. Test develop- ers and researchers, much like people in general, have many different ways of looking at and defining the same phenomenon. Just think, for example, of the different ways a term such as aggressive is used. We speak of an aggressive salesperson, an aggressive killer, and an aggressive waiter, to name but a few contexts. In each of these different contexts, aggressive carries with it a different meaning. If a personality test yields a score purporting to provide information about how aggressive a testtaker is, a first step in understanding the meaning of that score is understanding how aggressive was defined by the test developer. More specifically, what types of behaviors are presumed to be indicative of someone who is aggressive as defined by the test? Chapter 4: Of Tests and Testing 103 116 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition Once having defined the trait, state, or other construct to be measured, a test devel- oper considers the types of item content that would provide insight into it. From a universe of behaviors presumed to be indicative of the targeted trait, a test developer has a world of possible items that can be written to gauge the strength of that trait in testtakers.1 For example, if the test developer deems knowledge of American history to be one component of adult intelligence, then the item Who was the second president of the United States? may appear on the test. Similarly, if social judgment is deemed to be indicative of adult intelligence, then it might be reasonable to include the item Why should guns in the home always be inaccessible to children? Suppose we agree that an item tapping knowledge of American history and an item tapping social judgment are both appropriate for an adult intelligence test. One question that arises is: Should both items be given equal weight? That is, should we place more importance on—and award more points for—an answer keyed “correct” to one or the other of these two items? Perhaps a correct response to the social judgment question should earn more credit than a correct response to the American history ques- tion. Weighting the comparative value of a test’s items comes about as the result of a complex interplay among many factors, including technical considerations, the way a construct has been defined for the purposes of the test, and ◆ the value society (and the test developer) attaches to the J UST THIN K... behaviors evaluated. On an adult intelligence test, what type of Measuring traits and states by means of a test entails item should be given the most weight? developing not only appropriate test items but also appro- What type of item should be given the least priate ways to score the test and interpret the results. For weight? many varieties of psychological tests, some number repre- senting the score on the test is derived from the examinee’s responses. The test score is presumed to represent the strength of the targeted ability or trait or state and is frequently based on cumulative scoring.2 Inherent in cumulative scoring is the assumption that the more the testtaker responds in a particular direction as keyed by the test manual as correct or consistent with a particular trait, the higher that testtaker is presumed to be on the targeted ability or trait. You were probably first introduced to cumulative scoring early in elementary school when you observed that your score on a weekly spelling test had everything to do with how many words you spelled correctly or incorrectly. The score reflected the extent to which you had suc- cessfully mastered the spelling assignment for the week. On the basis of that score, we might predict that you would spell those words correctly if called upon to do so. And in the context of such prediction, consider the next assumption. Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior Many tests involve tasks such as blackening little grids with a No. 2 pencil or simply pressing keys on a computer keyboard. The objective of such tests typically has little to do with predicting future grid-blackening or key-pressing behavior. Rather, the objec- tive of the test is to provide some indication of other aspects of the examinee’s behavior. 1. In the language of psychological testing and assessment, the word domain is substituted for world in this context. Assessment professionals speak, for example, of domain sampling, which may refer to either (1) a sample of behaviors from all possible behaviors that could conceivably be indicative of a particular construct or (2) a sample of test items from all possible items that could conceivably be used to measure a particular construct. 2. Other models of scoring are discussed in Chapter 8. 104 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 117 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition For example, patterns of answers to true–false questions on one widely used test of personality are used in decision making regarding mental disorders. The tasks in some tests mimic the actual behaviors that the test user is attempting to understand. By their nature, however, such tests yield only a sample of the behavior that can be expected to be emitted under nontest condi- tions. The obtained sample of behavior is typically used to ◆ J US T T HINK... make predictions about future behavior, such as work per- formance of a job applicant. In some forensic (legal) matters, In practice, tests have proven to be good psychological tests may be used not to predict behavior but predictors of some types of behaviors and to postdict it—that is, to aid in the understanding of behav- not-so-good predictors of other types of ior that has already taken place. For example, there may behaviors. For example, tests have not be a need to understand a criminal defendant’s state of proven to be as good at predicting violence mind at the time of the commission of a crime. It is beyond as had been hoped. Why do you think it is the capability of any known testing or assessment proce- so difficult to predict violence by means of dure to reconstruct someone’s state of mind. Still, behavior a test? samples may shed light, under certain circumstances, on someone’s state of mind in the past. Additionally, other tools of assessment—such as case history data or the defendant’s personal diary during the period in question—might be of great value in such an evaluation. Assumption 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses Competent test users understand a great deal about the tests they use. They under- stand, among other things, how a test was developed, the circumstances under which it is appropriate to administer the test, how the test should be administered and to whom, and how the test results should be interpreted. Competent test users understand and appreciate the limitations of the tests they use as well as how those limitations might be compensated for by data from other sources. All of this may sound quite commonsen- sical, and it probably is. Yet this deceptively simple assumption—that test users know the tests they use and are aware of the tests’ limitations—is emphasized repeatedly in the codes of ethics of associations of assessment professionals. Assumption 5: Various Sources of Error Are Part of the Assessment Process In everyday conversation, we use the word error to refer to mistakes, miscalculations, and the like. In the context of assessment, error need not refer to a deviation, an over- sight, or something that otherwise violates expectations. To the contrary, error tradition- ally refers to something that is more than expected; it is actually a component of the measurement process. More specifically, error refers to a long-standing assumption that factors other than what a test attempts to measure will influence performance on the test. Test scores are always subject to questions about the degree to which the measure- ment process includes error. For example, an intelligence test score could be subject to debate concerning the degree to which the obtained score truly reflects the examin- ee’s intelligence and the degree to which it was due to factors other than intelligence. Because error is a variable that must be taken account of in any assessment, we often speak of error variance, that is, the component of a test score attributable to sources other than the trait or ability measured. There are many potential sources of error variance. Whether or not an assessee has the flu when taking a test is a source of error variance. In a more general sense, then, Chapter 4: Of Tests and Testing 105 118 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition assessees themselves are sources of error variance. Assessors, too, are sources of error variance. For example, some assessors are more professional than others in the extent to which they follow the instructions governing how and under what conditions a test should be administered. In addition to assessors and assessees, measuring instruments themselves are another source of error variance. Some tests are simply better than oth- ers in measuring what they purport to measure. Instructors who teach the undergraduate measurement course will occasionally hear a student refer to error as “creeping into” or “contaminating” the measurement process. Yet measurement professionals tend to view error as simply an element in the process of measurement, one for which any theory of measurement must surely account. In what is referred to as the classical or true score theory of measurement, an assumption is made that each testtaker has a true score on a test that would be obtained but for the random action of measurement error. Assumption 6: Testing and Assessment Can Be Conducted in a Fair and Unbiased Manner If we had to pick the one of these seven assumptions that is more controversial than the remaining six, this one is it. Decades of court challenges to various tests and test- ing programs have sensitized test developers and users to the societal demand for fair tests used in a fair manner. Today, all major test publishers strive to develop instru- ments that are fair when used in strict accordance with guidelines in the test manual. However, despite the best efforts of many professionals, fairness-related questions and problems do occasionally arise. One source of fairness-related problems is the test user who attempts to use a particular test with people whose background and experience are different from the background and experience of people for whom the test was intended. Some potential problems related to test fairness are more political than psy- chometric. For example, heated debate on selection, hiring, ◆ and access or denial of access to various opportunities often J UST THIN K... surrounds affirmative action programs. In many cases, the Do you believe that testing can be real question for debate is not “Is this test or assessment conducted in a fair and unbiased manner? procedure fair?” but rather “What do we as a society wish to accomplish by the use of this test or assessment proce- dure?” In all questions about tests with regard to fairness, it is important to keep in mind that tests are tools. And just like other, more familiar tools (hammers, ice picks, wrenches, and so on), they can be used properly or improperly. Assumption 7: Testing and Assessment Benefit Society At first glance, the prospect of a world devoid of testing and assessment might seem appealing, especially from the perspective of a harried student preparing for a week of midterm examinations. Yet a world without tests would most likely be more a night- mare than a dream. In such a world, people could present themselves as surgeons, bridge builders, or airline pilots regardless of their background, ability, or professional credentials. In a world without tests or other assessment procedures, personnel might be hired on the basis of nepotism rather than documented merit. In a world without tests, teachers and school administrators could arbitrarily place children in different types of special classes simply because that is where they believed the children belonged. In a world without tests, there would be a great need for instruments to diagnose 106 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 119 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition educational difficulties in reading and math and point the ◆ way to remediation. In a world without tests, there would J U S T T H I N K... be no instruments to diagnose neuropsychological impair- How else might a world without tests or ments. In a world without tests, there would be no prac- other assessment procedures be different tical way for the military to screen thousands of recruits from the world today? with regard to many key variables. Considering the many critical decisions that are based on testing and assessment procedures, we can readily appreciate the need for tests, especially good tests. And that, of course, raises one critically important question... What’s a “Good Test”? Logically, the criteria for a good test would include clear instructions for administration, scoring, and interpretation. It would also seem to be a plus if a test offered economy in the time and money it took to administer, score, and interpret it. Most of all, a good test would seem to be one that measures what it purports to measure. Beyond simple logic, there are technical criteria that assessment professionals use to evaluate the quality of tests and other measurement procedures. Test users often speak of the psychometric soundness of tests, two key aspects of which are reliability and validity. Reliability A good test or, more generally, a good measuring tool or procedure is reliable. As we will explain in Chapter 5, the criterion of reliability involves the consistency of the mea- suring tool: the precision with which the test measures and the extent to which error is present in measurements. In theory, the perfectly reliable measuring tool consistently measures in the same way. To exemplify reliability, visualize three digital scales labeled A, B, and C. To deter- mine if they are reliable measuring tools, we will use a standard 1-pound gold bar that has been certified by experts to indeed weigh 1 pound and not a fraction of an ounce more or less. Now, let the testing begin. Repeated weighings of the 1-pound bar on Scale A register a reading of 1 pound every time. No doubt about it, Scale A is a reliable tool of measurement. On to Scale B. Repeated weighings of the bar on Scale B yield a reading of 1.3 pounds. Is this scale reliable? It sure is! It may be consistently inaccurate by three-tenths of a pound, but there’s no taking away the fact that it is reliable. Finally, Scale C. Repeated weighings of the bar on Scale C register a different weight every time. On one weighing, the gold bar weighs in at 1.7 pounds. On the next weighing, the weight registered is 0.9 pound. In short, the weights registered are all over the map. Is this scale reliable? Hardly. This scale is neither reliable nor accurate. Contrast it to Scale B, which also did not record the weight of the gold standard correctly. Although inaccurate, Scale B was consistent in terms of how much the registered weight deviated from the true weight. By contrast, the weight registered by Scale C deviated from the true weight of the bar in seemingly random fashion. Whether we are measuring gold bars, behavior, or anything else, unreliable mea- surement is to be avoided. We want to be reasonably certain that the measuring tool or test that we are using is consistent. That is, we want to know that it yields the same Chapter 4: Of Tests and Testing 107 120 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition numerical measurement every time it measures the same thing under the same con- ditions. Psychological tests, like other tests and instruments, are reliable to varying degrees. As you might expect, however, reliability is a necessary but not sufficient ele- ment of a good test. In addition to being reliable, tests must be reasonably accurate. In the language of psychometrics, tests must be valid. Validity A test is considered valid for a particular purpose if it does, in fact, measure what it purports to measure. In the gold bar example cited earlier, the scale that consistently indicated that the 1-pound gold bar weighed 1 pound is a valid scale. Likewise, a test of reaction time is a valid test if it accurately measures reaction time. A test of intelligence is a valid test if it truly measures intelligence. Well, yes, but... Although there is relatively little controversy about the definition of a term such as reaction time, a great deal of controversy exists about the definition of intelligence. Because there is controversy surrounding the definition of intelligence, the validity of any test purporting to measure this variable is sure to come under close scrutiny by critics. If the definition of intelligence on which the test is based is sufficiently different from the definition of intelligence on other accepted tests, then the test may be con- demned as not measuring what it purports to measure. Questions regarding a test’s validity may focus on the items that collectively make up the test. Do the items adequately sample the range of areas that must be sampled to adequately measure the construct? Individual items will also come under scrutiny in an investigation of a test’s validity. How do individual items contribute to or detract from the test’s validity? The validity of a test may also be questioned on grounds related to the interpretation of resulting test scores. What do these scores really tell us about the targeted construct? How are high scores on the test related to testtakers’ behavior? How are low scores on the test related to testtakers’ behavior? How do scores on this test relate to scores on other tests purporting to measure the same construct? How do scores on this test relate to scores on other tests purporting ◆ to measure opposite types of constructs? J UST THIN K... We might expect one person’s score on a valid test of Why might a test shown to be valid for use introversion to be inversely related to that same person’s for a particular purpose with members of score on a valid test of extraversion; that is, the higher the one population not be valid for use for that introversion test score, the lower the extraversion test score, same purpose with members of another and vice versa. As we will see when we discuss validity population? in greater detail in Chapter 6, questions concerning the validity of a particular test may be raised at every stage in the life of a test. From its initial development through the life of its use with members of different populations, assessment professionals may raise questions regarding the extent to which a test is measuring what it purports to measure. Other Considerations A good test is one that trained examiners can administer, score, and interpret with a minimum of difficulty. A good test is a useful test, one that yields actionable results that will ultimately benefit individual testtakers or society at large. In “putting a test to the test,” there are a number of different ways to evaluate just how good a test really is (see this chapter’s Everyday Psychometrics). If the purpose of a test is to compare the performance of the testtaker with the performance of other testtakers, a good test is one that contains adequate norms. Also 108 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 121 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition E V E R Y D A Y P S Y C H O M E T R I C S Putting Tests to the Test F or experts in the field of testing and assessment, certain regarding an individual’s parenting capacity. Many psycholo- questions occur almost reflexively in evaluating a test or gists who perform such evaluations use a psychological measurement technique. As a student of assessment, you test as part of the evaluation process. However, the may not be expert yet, but consider the questions that follow psychologist performing such an evaluation is—or should when you come across mention of any psychological test or be—aware of the guidelines promulgated by the American other measurement technique. Psychological Association’s Committee on Professional Practice and Standards. These guidelines describe three Why Use This Particular Instrument or Method? types of assessments relevant to a child custody decision: A choice of measuring instruments typically exists when (1) the assessment of parenting capacity, (2) the assess- it comes to measuring a particular psychological or ment of psychological and developmental needs of the child, educational variable, and the test user must therefore and (3) the assessment of the goodness of fit between choose from many available tools. Why use one over the parent’s capacity and the child’s needs. According to another? Answering this question typically entails raising these guidelines, an evaluation of a parent—or even of two other questions, such as: What is the objective of using a parents—is not sufficient to arrive at an opinion regarding test and how well does the test under consideration meet custody. Rather, an educated opinion about who should be that objective? Who is this test designed for use with (age awarded custody can be arrived at only after evaluating (1) of testtakers? reading level? etc.) and how appropriate is it the parents (or others seeking custody), (2) the child, and for the targeted testtakers? How is what the test measures (3) the goodness of fit between the needs and capacity of defined? For example, if a test user seeks a test of “leader- each of the parties. ship,” how is “leadership” defined by the test developer In this example, published guidelines inform us that (and how close does this definition match the test user’s any instrument the assessor selects to obtain informa- definition of leadership for the purposes of the assess- tion about parenting capacity must be supplemented with ment)? What type of data will be generated from using this other instruments or procedures designed to support test, and what other types of data will it be necessary to any expressed opinion, conclusion, or recommendation. generate if this test is used? Do alternate forms of this test In everyday practice, these other sources of data will be exist? Answers to questions about specific instruments derived using other tools of psychological assessment such may be found in published sources of information (such as as interviews, behavioral observation, and case history or test catalogues, test manuals, and published test reviews) document analysis. Published guidelines and research may as well as unpublished sources (correspondence with also provide useful information regarding how likely the use test developers and publishers and with colleagues who of a particular test or measurement technique is to meet the have used the same or similar tests). Answers to related Daubert or other standards set by courts (see, for example, questions about the use of a particular instrument may be Yañez & Fremouw, 2004). found elsewhere—for example, in published guidelines. This Is This Instrument Reliable? brings us to another question to “put to the test.” Earlier, we introduced you to the psychometric concept of Are There Any Published Guidelines reliability and noted that it concerned the consistency of for the Use of This Test? measurement. Research to determine whether a particu- Measurement professionals make it their business to be lar instrument is reliable starts with a careful reading of aware of published guidelines from professional associations the test’s manual and of published research on the test, and related organizations for the use of tests and measure- test reviews, and related sources. However, it does not ment techniques. Sometimes, a published guideline for the necessarily end with such research. use of a particular test will list other measurement tools that Measuring reliability is not always a straightforward should also be used along with it. For example, consider the matter. As an example, consider one of the tests that might case of psychologists called upon to provide input to a court be used in the evaluation of parenting capacity, the Bricklin in the matter of a child custody decision. More specifically, Perceptual Scales (BPS; Bricklin, 1984). The BPS was the court has asked the psychologist for expert opinion designed to explore a child’s perception of father and (continued) Chapter 4: Of Tests and Testing 109 122 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition E V E R Y D A Y P S Y C H O M E T R I C S Putting Tests to the Test (continued) mother. A measure of one type of reliability, referred to as quickly for intelligence. It may have been desirable to test-retest reliability, would indicate how consistent a child’s individually administer a Binet intelligence test to each perception of father and mother is over time. However, recruit, but it would have taken a great deal of time—too the BPS test manual contains no reliability data because, much time, given the demands of the war—and it would not as Bricklin (1984, p. 42) opined, “There are no reasons to have been very cost-effective. Instead, the armed services expect the measurements reported here to exhibit any partic- developed group measures of intelligence that could be ular degree of stability, since they should vary in accordance administered quickly and that addressed its needs more with changes in the child’s perceptions.” This assertion has efficiently than an individually administered test. In this not stopped others (Gilch-Pesantez, 2001; Speth, 1992) and instance, it could be said that group tests had greater utility even Bricklin himself many years later (Bricklin & Halbert, than individual tests. The concept of test utility is discussed 2004) from exploring the test-retest reliability of the BPS. in greater depth in Chapter 7. Whether or not one accepts Bricklin’s opinion as found in What Inferences May Reasonably Be Made from This the original test manual, such opinions illustrate the great Test Score, and How Generalizable Are the Findings? complexity of reliability questions. They also underscore the need for multiple sources of data to strengthen arguments In evaluating a test, it is critical to consider the inferences regarding the confirmation or rejection of a hypothesis. that may reasonably be made as a result of administering that test. Will we learn something about a child’s readiness Is This Instrument Valid? to begin first grade? Whether one is harmful to oneself Validity, as you have learned, refers to the extent that a or others? Whether an employee has executive potential? test measures what it purports to measure. And as was the These represent but a small sampling of critical ques- case with questions concerning a particular instrument’s tions for which answers must be inferred on the basis of reliability, research to determine whether a particular instru- test scores and other data derived from various tools of ment is valid starts with a careful reading of the test’s assessment. manual as well as published research on the test, test Intimately related to considerations regarding the reviews, and related sources. Once again, as you might have inferences that can be made are those regarding the anticipated, there will not necessarily be any simple answers generalizability of the findings. As you learn more and at the end of this preliminary research. more about test norms, for example, you will discover that As with reliability, questions related to the validity of a the population of people used to help develop a test has test can be complex and colored more in shades of gray than a great effect on the generalizability of findings from an black or white. For example, even if data from a test such administration of the test. Many other factors may affect the as the BPS were valid for the purpose of gauging children’s generalizability of test findings. For example, if the items perceptions of their parents, the data would be invalid as on a test are worded in such a way as to be less compre- the sole source on which to base an opinion regarding hensible by members of a specific group, then the use of child custody (Brodzinsky, 1993; Heinze & Grisso, 1996). that test with members of that group could be questionable. The need for multiple sources of data on which to base an Another issue regarding the generalizability of findings opinion stems not only from the ethical mandates published concerns how a test was administered. Most published in the form of guidelines from professional associations tests include explicit directions for testing conditions and but also from the practical demands of meeting a burden test administration procedures that must be followed to of proof in court. In sum, what starts as research to deter- the letter. If a test administration deviates in any way from mine the validity of an individual instrument for a particular these directions, the generalizability of the findings may be objective may end with research as to which combination of compromised. instruments will best achieve that objective. Although you may not yet be an expert in measurement, you are now aware of the types of questions experts ask Is This Instrument Cost-Effective? when evaluating tests. It is hoped that you can now appreci- During the First and Second World Wars, a need existed ate that simple questions such as “What’s a good test?” for the military to screen hundreds of thousands of recruits don’t necessarily have simple answers. 110 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 123 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition referred to as normative data, norms provide a standard with which the results of mea- surement can be compared. Let’s explore the important subject of norms in a bit more detail. Norms We may define norm-referenced testing and assessment as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual testtaker’s score and comparing it to scores of a group of testtakers. In this approach, the mean- ing of an individual test score is understood relative to other scores on the same test. A common goal of norm-referenced tests is to yield information on a testtaker’s standing or ranking relative to some comparison group of testtakers. Norm in the singular is used in the scholarly literature to refer to behavior that is usual, average, normal, standard, expected, or typical. Reference to a particular variety of norm may be specified by means of modifiers such as age, as in the term age norm. Norms is the plural form of norm, as in the term gender norms. In a psychometric con- text, norms are the test performance data of a particular group of testtakers that are designed for use as a reference when evaluating or interpreting individual test scores. As used in this definition, the “particular group of testtakers” may be defined broadly (for example, “a sample representative of the adult population of the United States”) or narrowly (for example, “female inpatients at the Bronx Community Hospital with a primary diagnosis of depression”). A normative sample is that group of people whose performance on a particular test is analyzed for reference in evaluating the performance of individual testtakers. Whether broad or narrow in scope, members of the normative sample will all be typical with respect to some characteristic(s) of the people for whom the particular test was designed. A test administration to this representative sample of testtakers yields a distribution (or distributions) of scores. These data constitute the norms for the test and typically are used as a reference source for evaluating and placing into context test scores obtained by individual testtakers. The data may be in the form of raw scores or converted scores. The verb to norm, as well as related terms such as norming, refer to the process of deriving norms. Norming may be modified to describe a particular type of norm deriva- tion. For example, race norming is the controversial practice of norming on the basis of race or ethnic background. Race norming was once engaged in by some govern- ment agencies and private organizations, and the practice resulted in the establishment of different cutoff scores for hiring by cultural group. Members of one cultural group would have to attain one score to be hired, whereas members of another cultural group would have to attain a different score. Although initially instituted in the service of affirmative action objectives (Greenlaw & Jensen, 1996), the practice was outlawed by the Civil Rights Act of 1991. The Act left unclear a number of issues, however, includ- ing “whether, or under what circumstances, in the development of an assessment pro- cedure, it is lawful to adjust item content to minimize group differences” (Kehoe & Tenopyr, 1994, p. 291). Norming a test, especially with the participation of a nationally representative nor- mative sample, can be a very expensive proposition. For this reason, some test manuals provide what are variously known as user norms or program norms, which “consist of descriptive statistics based on a group of testtakers in a given period of time rather than norms obtained by formal sampling methods” (Nelson, 1994, p. 283). Understanding Chapter 4: Of Tests and Testing 111 124 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition how norms are derived through “formal sampling methods” requires some discussion of the process of sampling. Sampling to Develop Norms The process of administering a test to a representative sample of testtakers for the pur- pose of establishing norms is referred to as standardization or test standardization. As will be clear from the Close-up, a test is said to be standardized when it has clearly speci- fied procedures for administration and scoring, typically including normative data. To understand how norms are derived, an understanding of sampling is necessary. Sampling In the process of developing a test, a test developer has targeted some defined group as the population for which the test is designed. This population is the complete universe or set of individuals with at least one common, observable char- acteristic. The common observable characteristic(s) could be just about anything. For example, it might be high-school seniors who aspire to go to college, or the 16 boys and girls in Mrs. Perez’s day care center, or all housewives with primary responsibility for house- hold shopping who have purchased over-the-counter headache remedies within the last two months. To obtain a distribution of scores, the test developer could have the test adminis- tered to every person in the targeted population. If the total targeted population con- sists of something like the 16 boys and girls in Mrs. Perez’s day care center, it may well be feasible to administer the test to each member of the targeted population. However, for tests developed to be used with large or wide-ranging populations, it is usually impossible, impractical, or simply too expensive to administer the test to everyone, nor is it necessary. The test developer can obtain a distribution of test responses by administering the test to a sample of the population—a portion of the universe of people deemed to be rep- resentative of the whole population. The size of the sample could be as small as one per- son, though samples that approach the size of the population reduce the possible sources of error due to insufficient sample size. The process of selecting the portion of the uni- verse deemed to be representative of the whole population is referred to as sampling. Subgroups within a defined population may differ with respect to some characteris- tics, and it is sometimes essential to have these differences proportionately represented in the sample. Thus, for example, if you devised a public opinion test and wanted to sample the opinions of Manhattan residents with this instrument, it would be desir- able to include in your sample people representing different subgroups (or strata) of the population, such as Blacks, Whites, Asians, other non-Whites, males, females, the poor, the middle class, the rich, professional people, busi- ◆ ness people, office workers, skilled and unskilled laborers, J UST THIN K... the unemployed, homemakers, Catholics, Jews, members Truly random sampling is relatively rare. of other religions, and so forth—all in proportion to the Why do you think this is so? current occurrence of these strata in the population of peo- ple who reside on the island of Manhattan. Such sampling, termed stratified sampling, would help prevent sampling bias and ultimately aid in the interpretation of the findings. If such sampling were random (that is, if every member of the population had the same chance of being included in the sample), then the proce- dure would be termed stratified-random sampling. Two other types of sampling procedures are purposive sampling and incidental sam- pling. If we arbitrarily select some sample because we believe it to be representative of the population, then we have selected what is referred to as a purposive sample. 112 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 125 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition C L O S E - U P How “Standard” is Standard in Measurement? T he foot, a unit of distance measurement in the United States, probably had its origins in the length of a British king’s foot used as a standard—one that measured about 12 inches, give or take. It wasn’t so very long ago that different localities throughout the world all had different “feet” to measure by. We have come a long way since then, especially with regard to standards and standardization in measurement... haven’t we? Perhaps. However, in the field of psychological testing and assessment, there’s still more than a little confusion when it comes to the meaning of terms like standard and standardization. Questions also exist concerning what is and is not standardized. To address these and related questions, a close-up look at the word standard and its derivatives seems very much in order. The word standard can be a noun or an adjective, and in either case it may have multiple (and quite different) definitions. As a noun, standard may be defined as that which others are compared to or evaluated against. One may speak, for example, of a test with exceptional psychometric properties as being “the standard against which all similar tests are judged.” An exceptional textbook on the subject of psychological testing and assessment—take the one you are reading, for example—may be judged “the standard against which all similar textbooks are judged.” Perhaps the most common use of standard as a noun in the context of testing Figure 1 and assessment is in the title of that well-known manual Ben’s Cold Cut Preference Test (CCPT) that sets forth ideals of professional behavior against which Ben owns a small “deli boutique” that sells ten varieties of any practitioner’s behavior can be judged: The Standards for private label cold cuts. Ben read somewhere that if a test has Educational and Psychological Testing, usually referred to clearly specified methods for test administration and scoring simply as The Standards. then it must be considered “standardized.” He then went on As an adjective, standard often refers to what is usual, to create his own “standardized test”; the Cold Cut Preference generally accepted, or commonly employed. One may speak, Test (CCPT). The CCPT consists of only two questions: “What for example, of the standard way of conducting a particular would you like today?” and a follow-up question, “How much measurement procedure, especially as a means of contrasting of that would you like?” Ben scrupulously trains his only it to some newer or experimental measurement procedure. employee (his wife—it’s literally a “mom and pop” business) on For example, a researcher experimenting with a new, multi- “test administration” and “test scoring” of the CCPT. So, just media approach to conducting a mental status examination think: Does the CCPT really qualify as a “standardized test”? might conduct a study to compare the value of this approach to the standard mental status examination interview. In some areas of psychology, there has been a need to problems, many researchers have adopted the concept of a create a new standard unit of measurement in the interest standard drink. The notion of a “standard drink” is designed of better understanding or quantifying particular phenom- to facilitate communication and to enhance understanding ena. For example, in studying alcoholism and associated regarding alcohol consumption patterns (Aros et al., 2006; (continued) Chapter 4: Of Tests and Testing 113 126 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition C L O S E - U P How “Standard” is Standard in Measurement? (continued) Gill et al., 2007), intervention strategies (Hwang, 2006; package as the test’s items. The test manual, which may Podymow et al., 2006), and costs associated with alcohol be published in one or more booklets, will ideally provide consumption (Farrell, 1998). Regardless of whether it is potential test users with all of the information they need beer, wine, liquor, or any other alcoholic beverage, reference to use the test in a responsible fashion. The test manual to a “standard drink” immediately conveys information to the enables the test user to administer the test in the “stan- knowledgeable researcher about the amount of alcohol in the dardized” manner in which it was designed to be admin- beverage. istered; all test users should be able to replicate the test The verb “to standardize” refers to making or administration as prescribed by the test developer. Ideally, transforming something into something that can serve as there will be little deviation from examiner to examiner in a basis of comparison or judgment. One may speak, for the way that a standardized test is administered, owing to example, of the efforts of researchers to standardize an the rigorous preparation and training that all potential alcoholic beverage that contains 15 milliliters of alcohol as a users of the test have undergone prior to administering the “standard drink.” For many of the variables commonly used test to testtakers. in assessment studies, there is an attempt to standardize If a standardized test is designed for scoring by the test a definition. As an example, Anderson (2007) sought to user (in contrast to computer scoring), the test manual standardize exactly what is meant by “creative thinking.” will ideally contain detailed scoring guidelines. If the test Well known to any student who has ever taken a nationally is one of ability that has correct and incorrect answers, the administered achievement test or college admission manual will ideally contain an ample number of examples of examination is the standardizing of tests. But what does it correct, incorrect, or partially correct responses, complete mean to say that a test is “standardized”? Some “food for with scoring guidelines. In like fashion, if it is a test that thought” regarding an answer to this deceptively simple measures personality, interest, or any other variable that is question can be found in Figure 1. not scored as correct or incorrect, then ample examples of Test developers standardize tests by developing replicable potential responses will be provided along with complete procedures for administering the test and for scoring and scoring guidelines. We would also expect the test manual to interpreting the test. Also part of standardizing a test is devel- contain detailed guidelines for interpreting the test results, oping norms for the test. Well, not necessarily... whether including samples of both appropriate and inappropriate or not norms for the test must be developed in order for the generalizations from the findings. test to be deemed “standardized” is debatable. It is true that Also from a traditional perspective, we think of standard- almost any “test” that has clearly specified procedures for ized tests as having undergone a standardization process. administration, scoring, and interpretation can be considered Conceivably, the term standardization could be applied to “standardized.” So even Ben the deli guy’s CCPT (described “standardizing” all the elements of a standardized test that in Figure 1) might be deemed a “standardized test” according need to be standardized. Thus, for a standardized test of to some. This is so because the test is “standardized” to the leadership, we might speak of standardizing the definition extent that the “test items” are clearly specified (presum- of leadership, standardizing test administration instructions, ably along with “rules” for “administering” them and rules standardizing test scoring, standardizing test interpreta- for “scoring and interpretation”). Still, many assessment tion, and so forth. Indeed, one definition of standardization professionals would hesitate to refer to Ben’s CCPT as a as applied to tests is “the process employed to introduce “standardized test.” Why? objectivity and uniformity into test administration, scor- Traditionally, assessment professionals have reserved ing and interpretation” (Robertson, 1990, p. 75). Another the term standardized test for those tests that have and perhaps more typical use of standardization, however, clearly specified procedures for administration, scoring, is reserved for that part of the test development process and interpretation in addition to norms. Such tests also during which norms are developed. It is for this very come with manuals that are as much a part of the test reason that the term test standardization has been used 114 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 127 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition interchangeably by many test professionals with the term set mean and standard deviation—are differentiated from test norming. z scores by the term standardized. For these authors, a Assessment professionals develop and use standardized z score would still be referred to as a “standard score” tests to benefit testtakers, test users, and/or society at whereas a T score, for example, would be referred to as a large. Although there is conceivably some benefit to Ben in “standardized score.” gathering data on the frequency of orders for a pound or two For the purpose of tackling another “nonstandard” use of of bratwurst, this type of data gathering does not require the word standard, let’s digress for just a moment to images a “standardized test.” So, getting back to Ben’s CCPT... of the great American pastime of baseball. Imagine, for a although there are some writers who would staunchly defend moment, all of the different ways that players can be charged the CCPT as a “standardized test” (simply because any two with an error. There really isn’t one type of error that could questions with clearly specified guidelines for administration be characterized as standard in the game of baseball. Now, and scoring would make the “cut”), practically speaking this back to psychological testing and assessment—where there is simply not the case from the perspective of most assess- also isn’t just one variety of error that could be character- ment professionals. ized as “standard.” No, there isn’t one... there are lots of There are a number of other ambiguities in psychological them! One speaks, for example, of the standard error of testing and assessment when it comes to the use of measurement (also known as the standard error of a score) the word standard and its derivatives. Consider, for the standard error of estimate (also known as the standard example, the term standard score. Some test manuals error of prediction), the standard error of the mean, and the and books reserve the term standard score for use with standard error of the difference. A table briefly summarizing reference to z scores. Raw scores (as well as z scores) the main differences between these terms is presented here, linearly transformed to any other type of standard scoring although they are discussed in greater detail elsewhere in systems—that is, transformed to a scale with an arbitrarily this book. Type of “Standard Error” What Is It? Standard error of measurement A statistic used to estimate the extent to which an observed score deviates from a true score Standard error of estimate In regression, an estimate of the degree of error involved in predicting the value of one variable from another Standard error of the mean A measure of sampling error Standard error of the difference A statistic used to estimate how large a difference between two scores should be before the difference is considered statistically significant We conclude by encouraging the exercise of critical Certainly with regard to this word’s use in the context of thinking upon encountering the word standard. The next psychological testing and assessment, what is presented time you encounter the word standard in any context, give as “standard” usually turns out to be not as standard as we some thought to how standard that “standard” really is. might expect. Manufacturers of products frequently use purposive sampling when they test the appeal of a new product in one city or market and then make assumptions about how that product would sell nationally. For example, the manufacturer might test a product in a market such as Cleveland because, on the basis of experience with this particu- lar product, “how goes Cleveland, so goes the nation.” The danger in using such a purposive sample is that the sample, in this case Cleveland residents, may no longer be Chapter 4: Of Tests and Testing 115 128 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition representative of the nation. Alternatively, this sample may simply not be representa- tive of national preferences with regard to the particular product being test-marketed. Often, a test user’s decisions regarding sampling wind up pitting what is ideal against what is practical. It may be ideal, for example, to use 50 chief executive offi- cers from any of the Fortune 500 companies (that is, the top 500 companies in terms of income) as a sample in an experiment. However, conditions may dictate that it is prac- tical for the experimenter only to use 50 volunteers recruited from the local Chamber of Commerce. This important distinction between what is ideal and what is practical in sampling brings us to a discussion of what has been referred to variously as an inciden- tal sample or a convenience sample. Ever hear the old joke about a drunk searching for money he lost under the lamp- post? He may not have lost his money there, but that is where the light is. Like the drunk searching for money under the lamppost, a researcher may sometimes employ a sample that is not necessarily the most appropriate but is rather the most convenient. Unlike the drunk, the researcher employing this type of sample is not doing so as a result of poor judgment but because of budgetary limitations or other constraints. An incidental sample or convenience sample is one that is convenient or available for use. You may have been a party to incidental sampling if you have ever been placed in a subject pool for experimentation with introductory psychology students. It’s not that the students in such subject pools are necessarily the most appropriate subjects for the experiments, it’s just that they are the most available. Generalization of findings from incidental samples must be made with caution. If incidental or convenience samples were clubs, they would not be considered very exclusive clubs. By contrast, there are many samples that are exclusive, in a sense, since they contain many exclusionary criteria. Consider, for example, the group of children and adolescents who served as the normative sample for one well-known children’s intelligence test. The sample was selected to reflect key demographic variables repre- sentative of the U.S. population according to the latest available census data. Still, some groups were deliberately excluded from participation. Who? persons tested on any intelligence measure in the six months prior to the testing persons not fluent in English or primarily nonverbal persons with uncorrected visual impairment or hearing loss persons with upper-extremity disability that affects motor performance persons currently admitted to a hospital or mental or psychiatric facility persons currently taking medication that might depress test performance persons previously diagnosed with any physical condition or illness that might depress test performance (such as stroke, epilepsy, or meningitis) Our general description of the norming process for a ◆ standardized test continues in what follows and, to varying J UST THIN K... degrees, in subsequent chapters. A highly recommended Why do you think each of these groups of way to supplement this study and gain a great deal of people were excluded from the standard- firsthand knowledge about norms for intelligence tests, ization sample of a nationally standardized personality tests, and other tests is to peruse the technical intelligence test? manuals of major standardized instruments. By going to the library and consulting a few of these manuals, you will discover not only the “real life” way that normative samples are described but also the many varied ways that normative data can be presented. 116 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 129 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition Developing norms for a standardized test Having obtained a sample, the test developer administers the test according to the standard set of instructions that will be used with the test. The test developer also describes the recommended setting for giving the test. This may be as simple as making sure that the room is quiet and well lit or as complex as providing a specific set of toys to test an infant’s cognitive skills. Establishing a stan- dard set of instructions and conditions under which the test is given makes the test scores of the normative sample more comparable with the scores of future testtakers. For example, if a test of concentration ability is given to a normative sample in the sum- mer with the windows open near people mowing the grass and arguing about whether the hedges need trimming, then the normative sample probably won’t concentrate well. If a testtaker then completes the concentration test under quiet, comfortable conditions, that person may well do much better than the normative group, resulting in a high stan- dard score. That high score would not be very helpful in understanding the testtaker’s concentration abilities because it would reflect the differing conditions under which the tests were taken. This example illustrates how important it is that the normative sample take the test under a standard set of conditions, which are then replicated (to the extent possible) on each occasion the test is administered. After all the test data have been collected and analyzed, the test developer will summarize the data using descriptive statistics, including measures of central tendency and variability. In addition, it is incumbent on the test developer to provide a precise description of the standardization sample itself. Good practice dictates that the norms be developed with data derived from a group of people who are presumed to be rep- resentative of the people who will take the test in the future. In order to best assist future users of the test, test developers are encouraged to “describe the population(s) represented by any norms or comparison group(s), the dates the data were gathered, and the process used to select the samples of testtakers” (Code of Fair Testing Practices in Education, 1988, p. 3). In practice, descriptions of normative samples vary widely in detail. Not sur- prisingly, test authors wish to present their tests in the most favorable light possible. Accordingly, shortcomings in the standardization procedure or elsewhere in the pro- cess of the test’s development may be given short shrift or totally overlooked in a test’s manual. Sometimes, although the sample is scrupulously defined, the generalizability of the norms to a particular group or individual is questionable. For example, a test carefully normed on school-age children who reside within the Los Angeles school dis- trict may be relevant only to a lesser degree to school-age children who reside within the Dubuque, Iowa, school district. How many children in the standardization sample were English speaking? How many were of Hispanic origin? How does the elementary school curriculum in Los Angeles differ from the curriculum in Dubuque? These are the types of questions that must be raised before the Los Angeles norms are judged to be generalizable to the children of Dubuque. Test manuals sometimes supply prospective test users with guidelines for establishing local norms (discussed shortly), one of many different ways norms can be categorized. One note on terminology is in order before moving on. When the people in the normative sample are the same people on whom the test was standardized, the phrases normative sample and standardization sample are often used interchangeably. Increas- ingly, however, new norms for standardized tests for specific groups of testtakers are developed some time after the original standardization. That is, the test remains stan- dardized based on data from the original standardization sample; it’s just that new nor- mative data are developed based on an administration of the test to a new normative sample. Included in this new normative sample may be groups of people who were Chapter 4: Of Tests and Testing 117 130 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition underrepresented in the original standardization sample data. For example, if there had been a large influx of potential testtakers from the Czech Republic since original stan- dardization, the new normative sample might well include a sample of Czech Republic nationals. In such a scenario, the normative sample for the new norms clearly would not be identical to the standardization sample, so it would be inaccurate to use the terms standardization sample and normative sample interchangeably. Types of Norms Some of the many different ways we can classify norms are as follows: age norms, grade norms, national norms, national anchor norms, local norms, norms from a fixed reference group, subgroup norms, and percentile norms. Percentile norms are the raw data from a test’s stan- dardization sample converted to percentile form. To better understand them, let’s back- track for a moment and review what is meant by percentiles. Percentiles In our discussion of the median, we saw that a distribution could be divided into quartiles where the median was the second quartile (Q2), the point at or below which 50% of the scores fell and above which the remaining 50% fell. Instead of dividing a dis- tribution of scores into quartiles, we might wish to divide the distribution into deciles, or ten equal parts. Alternatively, we could divide a distribution into 100 equal parts— 100 percentiles. In such a distribution, the xth percentile is equal to the score at or below which x% of scores fall. Thus, the 15th percentile is the score at or below which 15% of the scores in the distribution fall. The 99th percentile is the score at or below which 99% of the scores in the distribution fall. If 99% of a particular standardization sample answered fewer than 47 questions on a test correctly, then we could say that a raw score of 47 corresponds to the 99th percentile on this test. It can be seen that a percentile is a ranking that conveys information about the relative position of a score within a distri- bution of scores. More formally defined, a percentile is an expression of the percentage of people whose score on a test or measure falls below a particular raw score. Intimately related to the concept of a percentile as a description of performance on a test is the concept of percentage correct. Note that percentile and percentage correct are not synonymous. A percentile is a converted score that refers to a percentage of testtakers. Percentage correct refers to the distribution of raw scores—more specifically, to the number of items that were answered correctly multiplied by 100 and divided by the total number of items. Because percentiles are easily calculated, they are a popular way of organizing all test-related data, including standardization sample data. Additionally, they lend them- selves to use with a wide range of tests. Of course, every rose has its thorns. A problem with using percentiles with normally distributed scores is that real differences between raw scores may be minimized near the ends of the distribution and exaggerated in the middle of the distribution. This distortion may even be worse with highly skewed data. In the normal distribution, the highest frequency of raw scores occurs in the middle. That being the case, the differences between all those scores that cluster in the middle might be quite small, yet even the smallest differences will appear as differences in percentiles. The reverse is true at the extremes of the distributions, where differences between raw scores may be great, though we would have no way of knowing that from the relatively small differences in percentiles. Age norms Also known as age-equivalent scores, age norms indicate the average per- formance of different samples of testtakers who were at various ages at the time the test was administered. If the measurement under consideration is height in inches, for 118 Part 2: The Science of Psychological Measurement Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill 131 Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition example, then we know that scores (heights) for children will gradually increase at vari- ous rates as a function of age up to the middle to late teens. With the graying of Amer- ica, there has been increased interest in performance on various types of psychological tests, particularly neuropsychological tests, as a function of advancing age. Carefully constructed age norm tables for physical characteristics such as height enjoy widespread acceptance and are virtually noncontroversial. This is not the case, however, with respect to age norm tables for psychological characteristics such as intel- ligence. Ever since the introduction of the Stanford-Binet to this country in the early twentieth century, the idea of identifying the “mental age” of a testtaker has had great intuitive appeal. The child of any chronological age whose performance on a valid test of intellectual ability indicated that he or she had intellectual ability similar to that of the average child of some other age was said to have the mental age of the norm group in which his or her test score fell. The reasoning here was that, irrespective of chrono- logical age, children with the same mental age could be expected to read the same level of material, solve the same kinds of math problems, reason with a similar level of judg- ment, and so forth. Increasing sophistication about the limitations of the mental age concept has prompted assessment professionals to be hesitant about describing results in terms of mental age. The problem is that “mental age” as a way to report test results is too broad and too inappropriately generalized. To understand why, consider the case of a 6-year- old who, according to the tasks sampled on an intelligence test, performs intellectually like a 12-year-old. Regardless, the 6-year-old is likely not to be very similar at all to the average 12-year-old socially, psychologically, and in many other key respects. Beyond such obvious faults in mental age analogies, the mental age concept has also been criti- cized on technical grounds.3 Grade norms Designed to indicate the average test performance of testtakers in a given school grade, grade norms are developed by administering the test to representative samples of children over a range of consecutive grade levels (such as first through sixth grades). Next, the mean or median score for children at each grade level is calculated. Because the school year typically runs from September to June—ten months—fractions in the mean or median are easily expressed as decimals. Thus, for example, a sixth- grader performing exactly at the average on a grade-normed test administered dur- ing the fourth month of the school year (December) would achieve a grade-equivalent score of 6.4. Like age norms, grade norms have great intuitive appeal. Children learn and develop at varying rates but in ways that are in some aspects predictable. Perhaps because of this fact, grade norms have widespread application, especially to children of elementary school age. Now consider the case of a student in twelfth grade who scores “6” on a grade- normed spelling test. Does this mean that the student has the same spelling abilities as the average sixth-grader? The answer is no. What this finding means is that the student and a hypothetical, average sixth-grader answered the same fraction of items correctly on that test. Grade norms do not provide information as to the content or type of items that a student could or could not answer correctly. Perhaps the primary use of grade 3. For many years, IQ (intelligence quotient) scores on tests such as the Stanford-Binet were calculated by dividing mental age (as indicated by the test) by chronological age. The quotient would then be multiplied by 100 to eliminate the fraction. The distribution of IQ scores had a mean set at 100 and a standard deviation of approximately 16. A child of 12 with a mental age of 12 had an IQ of 100 (12/12 & 100 " 100). The technical problem here is that IQ standard deviations were not constant with age. At one age, an IQ of 116 might be indicative of performance at 1 standard deviation above the mean, whereas at another age an IQ of 121 might be indicative of performance at 1 standard deviation above the mean. Chapter 4: Of Tests and Testing 119 132 Cohen−Swerdlik: II. The Science of 4. Of Tests and Testing © The McGraw−Hill Psychological Testing and Psychological Companies, 2010 Assessment: An Measurement Introduction to Tests and Measurement, Seventh Edition norms is as a convenient, readily understandable gauge of how one student’s perfor- mance compares with that of fellow students in the same grade. One drawback of grade norms is that they are useful ◆ only with respect to years and months of schooling com- J UST THIN K... pleted. They have little or no applicability to children who Some experts in testing have called for a are not yet in school or to children who are out of school. moratorium on the use of grade-equivalent Further, they are not typically designed for use with adults as well as age-equivalent scores because who have returned to school. such scores may so easily be misinter- Both grade norms and age norms are referred to more preted. What is your opinion on this issue? generally as developmental norms, a term applied broadly to norms developed on the basis of any trait, ability, skill, or other characteristic that is presumed to develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life. A Piagetian theorist might, for example, develop a test of the Piagetian concept of accommodation and then norm this test in terms of stage of life as set forth in Jean Piaget’s writings. Such developmental norms would be subject to the same sorts of limitations we have previously described for other developmental norms that are not designed for application to physical charac- teristics. The accommodation test norms would further be subject to additional scrutiny and criticism by potential test users who do not subscribe to Piaget’s theory. National norms As the name implies, national norms are derived from a normative sam- ple that was nationally representative of the population at the time the norming study was conducted. In the fields of psychology and education, for example, national norms may be obtained by testing large numbers of people representative of different variables of interest such as age, gender, racial/ethnic background, socioeconomic strata, geo- graphical location (such as North, East, South, West, Midwest), and different types of communities within the various parts of the country (such as rural, urban, suburban). If the test were designed for use in the schools, norms might be obtained for stu- dents in every grade to which the test aim