Psychological Assessment PDF
Document Details
Uploaded by DashingMagnolia2845
Ronald Cohen
Tags
Summary
This document provides an overview of psychological assessment, defining it as the gathering and integration of psychology-related data for psychological evaluation. It covers different assessment methods like tests, interviews, case studies, and behavioral observations, highlighting the importance of proper administration and interpretation. The text also touches on various settings for psychological assessment, such as educational, clinical, and business environments, along with the responsibilities of test users.
Full Transcript
lOMoARcPSD|3448362 Psychological Assessment Ronald Cohen Chapter 1 Psychological Testing and Assessment The use of testing to denote everything from test...
lOMoARcPSD|3448362 Psychological Assessment Ronald Cohen Chapter 1 Psychological Testing and Assessment The use of testing to denote everything from test administration to test interpretation can be found in p ostwar textbooks as well as in various test-related writings for decades thereafter. However, by World War II a semantic distinction between testing and a more inclusive term, assessment, began to emerge. We define psychological assessment as the gathering and integration of psychology - related data for the purpose of making a psychological evaluation that is accomplished through the use of tools such as tests, interviews, case studies, behavioral observation, and specially designed apparatuses and measurement procedures. We define psychological testing as the process of measuring psychology-related variables by means of devices or procedures designed to obtain a sample of behavior. The process of assessment In general, the process of assessment begins with a referral for assessment from a source such as a teacher, a school psychologist, a counselor, a judge, a clinician, or a corporate human resources specialist. Some examples of referral questions are: “Can this child function in a regular classroom?”; “Is this defendant competent to stand trial?”; and “How well can this employee be expected to perform if promoted to an executive position?” lOMoARcPSD|3448362 The assessor prepares for the assessment by selecting the tools of assessment to be used. For example, if the assessment is in a corporate or military setting and the referral question concerns the assessee’s leadership ability, the assessor may wish to employ a measure (or two) of leadership. Subsequent to the selection of the instruments or procedures to be employed, the formal assessment will begin. After the assessment, the assessor writes a report of the findings that is designed to answer the referral question. Other assessors view the process of assessment as more of a collaboration between the assessor and the assessee. For example, in one approach to assessment, referred to (logically enough) as collaborative psychological assessment, the assessor and assessee may work as “partners” from initial contact through final feedback. Another approach to assessment that seems to have picked up momentum in recent years, most notably in educational settings, is referred to as dynamic assessment. While the term dynamic may at first glance suggest to some a psychodynamic or psychoanalytic approach to assessment, as used in this context it refers to the interactive, changing, or varying nature of the assessment. The Tools of Psychological Assessment The Test A test may be defined simply as a measuring device or procedure. When the word test is prefaced with a modifier, it refers to a device or procedure designed to measure a variable related to that modifier. The term psychological test refers to a device or procedure designed to measure variables related to psychology (for example, intelligence, personality, aptitude, interests, attitudes, and values). Psychological tests and other tools of assessment may differ with respect to a number of variables such as content, format, administration procedures, scoring and interpretation procedures, and technical quality. The Interview If the interview is conducted face-to-face, then the interviewer is probably taking note of not only the content of what is said but also the way it is being said. More specifically, the interviewer is taking note of both verbal and nonverbal behavior. The interviewer may also take note of the way that the interviewee is dressed. lOMoARcPSD|3448362 In an interview conducted by telephone, the interviewer may still be able to gain information beyond the responses to questions by being sensitive to variables such as changes in the interviewee’s voice pitch or the extent to which particular questions precipitate long pauses or signs of emotion in response. Interviews may be conducted by various electronic means, as would be the case with online interviews, e-mail interviews, and interviews conducted by means of text messaging. In its broadest sense, then, we can define an interview as a method of gathering information through direct communication involving reciprocal exchange. The Portfolio Students and professionals in many different fields of endeavor ranging from art to architecture keep files of their work products. These work products—whether retained on paper, canvas, film, video, audio, or some other medium—constitute what is called a portfolio. As samples of one’s ability and accomplishment, a portfolio may be used as a tool of evaluation. Case History Data Case history data refers to records, transcripts, and other accounts in written, pictorial, or other form that preserve archival information, official and informal accounts, and other data and items relevant to an assessee. Behavioral Observation Behavioral observation, as it is employed by assessment professionals, may be defined as monitoring the actions of others or oneself by visual or electronic means while recording quantitative and/or qualitative information regarding the actions. Sometimes researchers venture outside of the confines of clinics, classrooms, workplaces, and research laboratories in order to observe behavior of humans in a natural setting—that is, the setting in which the behavior would typically be expected to occur. This variety of behavioral observation is referred to as naturalistic observation. lOMoARcPSD|3448362 Role-Play Tests Role play may be defined as acting an improvised or partially improvised part in a simulated situation. A role-play test is a tool of assessment wherein assessees are directed to act as if they were in a particular situation. Assessees may then be evaluated with regard to their expressed thoughts, behaviors, abilities, and other variables. Computers as Tools We have already made reference to the role computers play in contemporary assessment in the context of generating simulations. But perhaps the more obvious role as a tool of assessment is their role in test administration, scoring, and interpretation. The acronym CAPA refers to the term computer assisted psychological assessment. By the way, here the word assisted typically refers to the assistance computers provide to the test user, not the testtaker. Another acronym you may come across is CAT, this for computer adaptive testing. The adaptive in this term is a reference to the computer’s ability to tailor the test to the testtaker’s ability or testtaking pattern. So, for example, on a computerized test of academic abilities, the computer might be programmed to switch from testing math skills to English skills after three consecutive failures on math items. Who Are the Parties? The test developer Test developers and publishers create tests or other methods of assessment. Test developers and publishers appreciate the significant impact that test results can have on people’s lives. Accordingly, a number of professional organizations have published standards of ethical behavior that specifically address aspects of responsible test development and use. The test user Psychological tests and assessment methodologies are used by a wide range of professionals, including clinicians, counselors, school psychologists, human resources personnel, consumer psychologists, experimental psychologists, social psychologists,... ; the list goes on. lOMoARcPSD|3448362 The testtaker In the broad sense in which we are using the term testtaker, anyone who is the subject of an assessment or an evaluation can be a testtaker or an assessee. A psychological autopsy may be defined as a reconstruction of a deceased individual’s psychological profile on the basis of archival records, artifacts, and interviews previously conducted with the deceased assessee or with people who knew him or her. In What Types of Settings Are Assessments Conducted, and Why? Educational settings As mandated by law, tests are administered early in school life to help identify children who may have special needs. In addition to school ability tests, another type of test commonly given in schools is an achievement test, which evaluates accomplishment or the degree of learning that has taken place. Clinical settings The tests employed in clinical settings may be intelligence tests, personality tests, neuropsychological tests, or other specialized instruments, depending on the presenting or suspected problem area. The hallmark of testing in clinical settings is that the test or measurement technique is employed with only one individual at a time. Counseling settings Assessment in a counseling context may occur in environments as diverse as schools, prisons, and government or privately owned institutions. Regardless of the particular tools used, the ultimate objective of many such assessments is the improvement of the assessee in terms of adjustment, productivity, or some related variable. lOMoARcPSD|3448362 Geriatric settings Wherever older individuals reside, they may at some point require psychological assessment to evaluate cognitive, psychological, adaptive, or other functioning. At issue in many such assessments is the extent to which assessees are enjoying as good a quality of life as possible. Business and military settings A wide range of achievement, aptitude, interest, motivational, and other tests may be employed in the decision to hire as well as in related decisions regarding promotions, transfer, job satisfaction, and eligibility for further training. Governmental and organizational credentialing Before they are legally entitled to practice medicine, physicians must pass an examination. Law-school graduates cannot present themselves to the public as attorneys until they pass their state’s bar examination. Psychologists, too, must pass an examination before adopting the official title of “psychologist”. How Are Assessments Conducted? Responsible test users have obligations before, during, and after a test or any measurement procedure is administered. Test users have the responsibility of ensuring that the room in which the test will be conducted is suitable and conducive to the testing. To the extent that it is possible, distracting conditions such as excessive noise, heat, cold, interruptions, glaring sunlight, crowding, inadequate ventilation, and so forth should be avoided. It is important that attempts to establish rapport with the testtaker not compromise any rules of the test administration instructions. After a test administration, test users have many obligations as well. These obligations range from safeguarding the test protocols to conveying the test results in a clearly understandable fashion. lOMoARcPSD|3448362 Assessment of people with disabilities People with disabilities are assessed for exactly the same reasons that people with no disabilities are assessed: to obtain employment, to earn a professional credential, to be screened for psychopathology, and so forth. In the context of psychological testing and assessment, accommodation may be defined as the adaptation of a test, procedure, or situation, or the substitution of one test for another, to make the assessment more suitable for an assessee with exceptional needs. Alternate assessment is an evaluative or diagnostic procedure or process that varies from the usual, customary, or standardized way a measurement is derived either by virtue of some special accommodation made to the assessee or by means of alternative methods designed to measure the same variable(s). Where to Go for Authoritative Information: Reference Sources Test catalogues As you might expect, however, publishers’ catalogues usually contain only a brief description of the test and seldom contain the kind of detailed technical information that a prospective user might require. Moreover, the catalogue’s objective is to sell the test. For this reason, highly critical reviews of a test are seldom, if ever, found in a publisher’s test catalogue. Test manuals Detailed information concerning the development of a particular test and technical information relating to it should be found in the test manual, which is usually available from the test publisher. Reference volumes The Buros Institute of Mental Measurements provides “one-stop shopping” for a great deal of test-related information. The initial version of what would evolve into the Mental Measurements Yearbook was compiled by Oscar Buros in 1933. At this writing, the latest edition of this authoritative lOMoARcPSD|3448362 compilation of test reviews is the 17th Annual Mental Measurements Yearbook published in 2007 (though the 18th cannot be far behind). Journal articles Articles in current journals may contain reviews of the test, updated or independent studies of its psychometric soundness, or examples of how the instrument was used in either research or an applied context. In addition to articles relevant to specific tests, journals are a rich source of information on important trends in testing and assessment. Online databases One of the most widely used bibliographic databases for test-related publications is that maintained by the Educational Resources Information Center (ERIC). Funded by the U.S. Department of Education and operated out of the University of Maryland, the ERIC Web site at www.eric.ed.gov contains a wealth of resources and news about tests, testing, and assessment. The American Psychological Association (APA) maintains a number of databases useful in locating psychology-related information in journal articles, book chapters, and doctoral dissertations. The world’s largest private measurement institution is Educational Testing Service (ETS). This company, based in Princeton, New Jersey, maintains a staff of some 2,500 people, including about 1,000 measurement professionals and education specialists. These are the folks who bring you the Scholastic Aptitude Test (SAT) and the Graduate Record Exam (GRE), among many other tests. lOMoARcPSD|3448362 Chapter 2 Historical, Cultural, and Legal/Ethical Considerations A Historical Perspective Antiquity to the Nineteenth Century It is believed that tests and testing programs first came into being in China as early as 2200 b.c.e.. Testing was instituted as a means of selecting who, of many applicants, would obtain government jobs. In general, proficiency in endeavors such as music, archery, horsemanship, writing, and arithmetic were examined. Also important were subjects such as agriculture, geography, revenue, civil law, and military strategy. Knowledge and skill with respect to the rites and ceremonies of public and social life were also evaluated. During the Song dynasty, emphasis was placed on knowledge of classical literature. Testtakers who demonstrated their command of the classics were perceived as having acquired the wisdom of the past; they were therefore entitled to a government position. In 1859, a book was published entitled On the Origin of Species by Means of Natural Selection by Charles Darwin. In this important, far-reaching work, Darwin argued that chance variation in species would be selected or rejected by nature according to adaptivity and survival value. Indeed, Darwin’s writing on individual differences kindled interest in research on heredity in his half cousin, Francis Galton. In the course of his efforts to explore and quantify individual differences between people, Galton became an extremely influential contributor to the field of measurement. Galton aspired to classify people “according to their natural gifts” and to ascertain their “deviation from an average”. Along the way, Galton would be credited with devising or contributing to the development of many contemporary tools of psychological assessment including questionnaires, rating scales, and self-report inventories. Galton’s initial work on heredity was done with sweet peas, in part because there tended to be fewer variations among the peas in a single pod. In this work, Galton pioneered the use of a statistical concept central to psychological experimentation and testing: the coefficient of correlation. lOMoARcPSD|3448362 Although Karl Pearson developed the product-moment correlation technique, its roots can be traced directly to the work of Galton. From heredity in peas, Galton’s interest turned to heredity in humans and various ways of measuring aspects of people and their abilities. At an exhibition in London in 1884, Galton displayed his Anthropometric Laboratory where, for three or four pence—depending on whether you were already registered or not—you could be measured on variables such as height (standing), height (sitting), arm span, weight, breathing capacity, strength of pull, strength of squeeze, swiftness of blow, keenness of sight, memory of form, discrimination of color, and steadiness of hand. Assessment was also an important activity at the first experimental psychology laboratory, founded at the University of Leipzig in Germany by Wilhelm Max Wundt, a medical doctor whose title at the university was professor of philosophy. Wundt and his students tried to formulate a general description of human abilities with respect to variables such as reaction time, perception, and attention span. In contrast to Galton, Wundt focused on questions relating to how people were similar, not different. In fact, individual differences were viewed by Wundt as a frustrating source of error in experimentation. In spite of the prevailing research focus on people’s similarities, one of Wundt’s students at Leipzig, an American named James McKeen Cattell, completed a doctoral dissertation that dealt with individual differences—specifically, individual differences in reaction time. After receiving his doctoral degree from Leipzig, Cattell returned to the United States, teaching at Bryn Mawr and then at the University of Pennsylvania before leaving for Europe to teach at Cambridge. At Cambridge, Cattell came in contact with Galton, whom he later described as “the greatest man I have known”. Inspired by his interaction with Galton, Cattell returned to the University of Pennsylvania in 1888 and coined the term mental test in an 1890 publication. In 1921, Cattell was instrumental in founding the Psychological Corporation, which named 20 of the country’s leading psychologists as its directors. The goal of the corporation was the “advancement of psychology and the promotion of the useful applications of psychology”. Other students of Wundt at Leipzig included Charles Spearman, Victor Henri, Emil Kraepelin, E. B. Titchener, G. Stanley Hall, and Lightner Witmer. Spearman is credited with originating the concept of test reliability as well as building the mathematical framework for the statistical technique of factor analysis. lOMoARcPSD|3448362 Victor Henri is the Frenchman who would collaborate with Alfred Binet on papers suggesting how mental tests could be used to measure higher mental processes. Psychiatrist Emil Kraepelin was an early experimenter with the word association technique as a formal test. Lightner Witmer received his Ph.D. from Leipzig and went on to succeed Cattell as director of the psychology laboratory at the University of Pennsylvania. Witmer has been cited as the “little-known founder of clinical psychology”, owing at least in part to his being challenged to treat a “chronic bad speller” in March of 1896. Later that year, Witmer founded the first psychological clinic in the United States at the University of Pennsylvania. In 1907, Witmer founded the journal Psychological Clinic. The first article in that journal was entitled “Clinical Psychology”. The Twentieth Century The measurement of intelligence As early as 1895, Alfred Binet and his colleague Victor Henri published several articles in which they argued for the measurement of abilities such as memory and social comprehension. Ten years later, Binet and collaborator Theodore Simon published a 30-item “measuring scale of intelligence” designed to help identify mentally retarded Paris schoolchildren. In 1939, David Wechsler, a clinical psychologist at Bellevue Hospital in New York City, introduced a test designed to measure adult intelligence. For Wechsler, intelligence was “the aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment”. Originally christened the Wechsler-Bellevue Intelligence Scale, the test was subsequently revised and renamed the Wechsler Adult Intelligence Scale (WAIS). The WAIS has been revised several times since then, and versions of Wechsler’s test have been published that extend the age range of testtakers from young children through senior adulthood. A natural outgrowth of the individually administered intelligence test devised by Binet was the group intelligence test. Group intelligence tests came into being in the United States in response to the military’s need for an efficient method of screening the intellectual ability of World War I recruits. lOMoARcPSD|3448362 The measurement of personality World War I had brought with it not only the need to screen the intellectual functioning of recruits but also the need to screen for recruits’ general adjustment. A government Committee on Emotional Fitness chaired by psychologist Robert S. Woodworth was assigned the task of developing a measure of adjustment and emotional stability that could be administered quickly and efficiently to groups of recruits. The committee developed several experimental versions of what were, in essence, paper-and-pencil psychiatric interviews. To disguise the true purpose of one such test, the questionnaire was labeled as a “Personal Data Sheet.” Draftees and volunteers were asked to indicate yes or no to a series of questions that probed for the existence of various kinds of psychopathology. For example, one of the test questions was “Are you troubled with the idea that people are watching you on the street?” After the war, Woodworth developed a personality test for civilian use that was based on the Personal Data Sheet. He called it the Woodworth Psychoneurotic Inventory. This instrument was the first widely used self-report test of personality—a method of assessment that would soon be employed in a long line of succeeding personality tests. A projective test is one in which an individual is assumed to “project” onto some ambiguous stimulus his or her own unique needs, fears, hopes, and motivation. The ambiguous stimulus might be an inkblot, a drawing, a photograph, or something else. Culture and Assessment Culture may be defined as “the socially transmitted behavior patterns, beliefs, and products of work of a particular population, community, or group of people”. Evolving Interest in Culture-Related Issues Soon after Alfred Binet introduced intelligence testing in France, the U.S. Public Health Service began using such tests to measure the intelligence of people seeking to immigrate to the United States. Henry H. Goddard, who had been highly instrumental in getting Binet’s test adopted for use in various settings in the United States, was the chief researcher assigned to the project. lOMoARcPSD|3448362 Despite an impressive list of career accomplishments, the light of history has not shone favorably on Henry Goddard. Goddard’s recommendation for segregation of the mentally deficient and his calls for their sterilization tend to be viewed, at best, as misguided. He focused on the nature side of the nature–nurture controversy not because he was an ardent eugenicist at heart but rather because the nature side of the coin was where researchers at the time all tended to focus. One way that early test developers attempted to deal with the impact of language and culture on tests of mental a bility was, in essence, to “isolate” the cultural variable. So-called culture-specific tests, or tests designed for use with people from one culture but not from another, soon began to appear on the scene. Test-user qualifications Level A: Tests or aids that can adequately be administered, scored, and interpreted with the aid of the manual and a general orientation to the kind of institution or organization in which one is working (for instance, achievement or proficiency tests). Level B: Tests or aids that require some technical knowledge of test construction and use and of supporting psychological and educational fields such as statistics, individual differences, psychology of adjustment, personnel psychology, and guidance (e.g., aptitude tests and adjustment inventories applicable to normal populations). Level C: Tests and aids that require substantial understanding of testing and supporting psychological fields together with supervised experience in the use of these devices (for instance, projective tests, individual mental tests). Testing people with disabilities Specifically, these difficulties may include (1) transforming the test into a form that can be taken by the testtaker, (2) transforming the responses of the testtaker so that they are scorable, and (3) meaningfully interpreting the test data. Computerized test administration, scoring, and interpretation For assessment professionals, some major issues with regard to Computer-assisted psychological assessment (CAPA) are as follows. lOMoARcPSD|3448362 1. Access to test administration, scoring, and interpretation software. Despite purchase restrictions on software and technological safeguards to guard against unauthorized copying, software may still be copied. Unlike test kits, which may contain manipulatable objects, manuals, and other tangible items, a computer-administered test may be easily copied and duplicated. 2. Comparability of pencil-and-paper and computerized versions of tests. Many tests once available only in a paper-and-pencil format are now available in computerized form as well. In many instances, the comparability of the traditional and the computerized forms of the test has not been researched or has only insufficiently been researched. 3. The value of computerized test interpretations. Many tests available for computerized administration also come with computerized scoring and interpretation procedures. Thousands of words are spewed out every day in the form of test interpretation results, but the value of these words in many cases is questionable. 4. Unprofessional, unregulated “psychological testing” online. A growing number of Internet sites purport to provide, usually for a fee, online psychological tests. Yet the vast majority of the tests offered would not meet a psychologist’s standards. Assessment professionals wonder about the long-term effect of these largely unprofessional and unregulated “psychological testing” sites. Might they, for example, contribute to more public skepticism about psychological tests? The Rights of Testtakers The right of informed consent Testtakers have a right to know why they are being evaluated, how the test data will be used, and what (if any) information will be released to whom. With full knowledge of such information, testtakers give their informed consent to be tested. The disclosure of the information needed for consent must, of course, be in language the testtaker can understand. Thus, for a testtaker as young as 2 or 3 years of age or an individual who is mentally retarded with limited language ability, a disclosure before testing might be worded as follows: “I’m going to ask you to try to do some things so that I can see what you know how to do and what things you could use some more help with”. If a testtaker is incapable of providing an informed consent to testing, such consent may be obtained from a parent or a legal representative. lOMoARcPSD|3448362 Consent must be in written rather than oral form. The written form should specify (1) the general purpose of the testing, (2) the specific reason it is being undertaken in the present case, and (3) the general type of instruments to be administered. A full disclosure and debriefing would be made after the testing. Various professional organizations have created policies and guidelines regarding deception in research. For example, the APA Ethical Principles of Psychologists and Code of Conduct provides that psychologists (a) do not use deception unless it is absolutely necessary, (b) do not use deception at all if it will cause participants emotional distress, and (c) fully debrief participants. The right to be informed of test findings In a bygone era, the inclination of many psychological assessors, particularly many clinicians, was to tell testtakers as little as possible about the nature of their performance on a particular test or test battery. In no case would they disclose diagnostic conclusions that could arouse anxiety or precipitate a crisis. This orientation was reflected in at least one authoritative text that advised testers to keep information about test results superficial and focus only on “positive” findings. This was done so that the examinee would leave the test session feeling “pleased and satisfied”. But all that has changed, and giving realistic information about test performance to examinees is not only ethically and legally mandated but may be useful from a therapeutic perspective as well. Testtakers have a right to be informed, in language they can understand, of the nature of the findings with respect to a test they have taken. The right to privacy and confidentiality The concept of the privacy right “recognizes the freedom of the individual to pick and choose for himself the time, circumstances, and particularly the extent to which he wishes to share or withhold from others his attitudes, beliefs, behavior, and opinions”. When people in court proceedings “take the Fifth” and refuse to answer a question put to them on the grounds that the answer might be self-incriminating, they are asserting a right to privacy provided by the Fifth Amendment to the Constitution. The information withheld in such a manner is termed privileged; it is information that is protected by law from disclosure in a legal proceeding. State statutes have extended the concept of privileged information to parties who communicate with each other in the context of certain relationships, including the lawyer-client relationship, the lOMoARcPSD|3448362 doctor-patient relationship, the priest-penitent relationship, and the husband-wife relationship. In most states, privilege is also accorded to the psychologist-client relationship. Stated another way, it is for society’s good if people feel confident that they can talk freely to their attorneys, clergy, physicians, psychologists, and spouses. Professionals such as psychologists who are parties to such special relationships have a legal and ethical duty to keep their clients’ communications confidential. Confidentiality may be distinguished from privilege in that, whereas “confidentiality concerns matters of communication outside the courtroom, privilege protects clients from disclosure in judicial proceedings”. Privilege is not absolute. There are occasions when a court can deem the disclosure of certain information necessary and can order the disclosure of that information. Should the psychologist or other professional so ordered refuse, he or she does so under the threat of going to jail, being fined, and other legal consequences. Privilege in the psychologist-client relationship belongs to the client, not the psychologist. The competent client can direct the psychologist to disclose information to some third party (such as an attorney or an insurance carrier), and the psychologist is obligated to make the disclosure. In some rare instances, the psychologist may be ethically (if not legally) compelled to disclose information if that information will prevent harm either to the client or to some endangered third party. Clinicians may have a duty to warn endangered third parties not only of potential violence but of potential AIDS infection from an HIV-positive client as well as other threats to their physical well-being. Another ethical mandate with regard to confidentiality involves the safekeeping of test data. Test users must take reasonable precautions to safeguard test records. If these data are stored in a filing cabinet then the cabinet should be locked and preferably made of steel. If these data are stored in a computer, electronic safeguards must be taken to ensure only authorized access. The right to the least stigmatizing label The Standards advise that the least stigmatizing labels should always be assigned when reporting test results. lOMoARcPSD|3448362 Chapter 3 A Statistics Refresher Scales of Measurement If, for example, research subjects were to be categorized as either female or male, the categorization scale would be said to be discrete because it would not be meaningful to categorize a subject as anything other than female or male. In contrast, a continuous scale exists when it is theoretically possible to divide any of the values of the scale. Measurement always involves error. In the language of assessment, error refers to the collective influence of all of the factors on a test score or measurement beyond those specifically measured by the test or measurement. Measurement using continuous scales always involves error. Most scales used in psychological and educational assessment are continuous and therefore can be expected to contain this sort of error. The number or score used to characterize the trait being measured on a continuous scale should be thought of as an approximation of the “real” number. Nominal Scales Nominal scales are the simplest form of measurement. These scales involve classification or categorization based on one or more distinguishing characteristics, where all things measured must be placed into mutually exclusive and exhaustive categories. Ordinal Scales Like nominal scales, ordinal scales permit classification. However, in addition to classification, rank ordering on some characteristic is also permissible with ordinal scales. Although he may have never used the term ordinal scale, Alfred Binet, a developer of the intelligence test that today bears his name, believed strongly that the data derived from an intelligence test are ordinal in nature. He emphasized that what he tried to do in the test was not to measure people, as one might measure a person’s height, but merely to classify (and rank) people on the basis of their performance on the tasks. lOMoARcPSD|3448362 Interval Scales In addition to the features of nominal and ordinal scales, interval scales contain equal intervals between numbers. Each unit on the scale is exactly equal to any other unit on the scale. But like ordinal scales, interval scales contain no absolute zero point. Scores on many tests, such as tests of intelligence, are analyzed statistically in ways appropriate for data at the interval level of measurement. The difference in intellectual ability represented by IQs of 80 and 100, for example, is thought to be similar to that existing between IQs of 100 and 120. However, if an individual were to achieve an IQ of 0 (something that is not even possible, given the way most intelligence tests are structured), that would not be an indication of zero (the total absence of) intelligence. Because interval scales contain no absolute zero point, a presumption inherent in their use is that no testtaker possesses none of the ability or trait (or whatever) being measured. Ratio Scales In addition to all the properties of nominal, ordinal, and interval measurement, a ratio scale has a true zero point. All mathematical operations can meaningfully be performed because there exist equal intervals between the numbers on the scale as well as a true or absolute zero point. In psychology, ratio-level measurement is employed in some types of tests and test items, perhaps most notably those involving assessment of neurological functioning. One example is a test of hand grip, where the variable measured is the amount of pressure a person can exert with one hand. Another example is a timed test of perceptual-motor ability that requires the testtaker to assemble a jigsaw-like puzzle. In such an instance, the time taken to successfully complete the puzzle is the measure that is recorded. Because there is a true zero point on this scale (that is, 0 seconds), it is meaningful to say that a testtaker who completes the assembly in 30 seconds has taken half the time of a testtaker who completed it in 60 seconds. Measurement Scales in Psychology The ordinal level of measurement is most frequently used in psychology. As Kerlinger put it: “Intelligence, aptitude, and personality test scores are, basically and strictly speaking, ordinal. These tests indicate with more or less accuracy not the amount of intelligence, aptitude, and personality traits of individuals, but rather the rank-order positions of the individuals.” lOMoARcPSD|3448362 Describing Data A distribution may be defined as a set of test scores arrayed for recording or study. The 25 scores in this distribution are referred to as raw scores. As its name implies, a raw score is a straightforward, unmodified accounting of performance that is usually numerical. Frequency Distributions The data from the test could be organized into a distribution of the raw scores. One way the scores could be distributed is by the frequency with which they occur. In a frequency distribution, all scores are listed alongside the number of times each score occurred. The scores might be listed in tabular or graphic form. Often, a frequency distribution is referred to as a simple frequency distribution to indicate that individual scores have been used and the data have not been grouped. In a grouped frequency distribution, test-score intervals, also called class intervals, replace the actual test scores. The number of class intervals used and the size or width of each class interval (i.e., the range of test scores contained in each class interval) are for the test user to decide. lOMoARcPSD|3448362 Frequency distributions of test scores can also be illustrated graphically. A graph is a diagram or chart composed of lines, points, bars, or other symbols that describe and illustrate data. 1. A histogram is a graph with vertical lines drawn at the true limits of each test score (or class interval), forming a series of contiguous rectangles. It is customary for the test scores (either the single scores or the midpoints of the class intervals) to be placed along the graph’s horizontal axis (also referred to as the abscissa or X -axis) and for numbers indicative of the frequency of occurrence to be placed along the graph’s vertical axis (also referred to as the ordinate or Y -axis). 2. In a bar graph, numbers indicative of frequency also appear on the Y -axis, and reference to some categorization (e.g., yes/no/maybe, male/female) appears on the X -axis. Here the rectangular bars typically are not contiguous. lOMoARcPSD|3448362 3. Data illustrated in a frequency polygon are expressed by a continuous line connecting the points where test scores or class intervals (as indicated on the X -axis) meet frequencies (as indicated on the Y -axis). Measures of Central Tendency A measure of central tendency is a statistic that indicates the average or midmost score between the extreme scores in a distribution. The arithmetic mean The arithmetic mean, denoted by the symbol X (pronounced “X bar”), is equal to the sum of the observations (or test scores in this case) divided by the number of observations. Symbolically written, the formula for the arithmetic mean is X = Σ(X/n), where n equals the number of observations or test scores. The arithmetic mean is typically the most appropriate measure of central tendency for interval or ratio data when the distributions are believed to be approximately normal. An arithmetic mean can also be computed from a frequency distribution. The formula for doing this is: lOMoARcPSD|3448362 The median The median, defined as the middle score in a distribution, is another commonly used measure of central tendency. We determine the median of a distribution of scores by ordering the scores in a list by magnitude, in either ascending or descending order. The median is an appropriate measure of central tendency for ordinal, interval, and ratio data. The median may be a particularly useful measure of central tendency in cases where relatively few scores fall at the high end of the distribution or relatively few scores fall at the low end of the distribution. The mode The most frequently occurring score in a distribution of scores is the mode. If adjacent scores occur equally often and more often than other scores, custom dictates that the mode be referred to as the average. These scores are said to have a bimodal distribution because there are two scores (51 and 66) that occur with the highest frequency (of two). Except with nominal data, the mode tends not to be a very commonly used measure of central tendency. Unlike the arithmetic mean, which has to be calculated, the value of the modal score is not calculated; one simply counts and determines which score occurs most frequently. In fact, it is theoretically possible for a bimodal distribution to have two modes each of which falls at the high or the low end of the distribution—thus violating the expectation that a measure of central tendency should be... well, central (or indicative of a point at the middle of the distribution). The mode is useful in analyses of a qualitative or verbal nature. For example, when assessing consumers’ recall of a commercial by means of interviews, a researcher might be interested in which word or words were mentioned most by interviewees. Because the mode is not calculated in a true sense, it is a nominal statistic and cannot legitimately be used in further calculations. Measures of Variability Variability is an indication of how scores in a distribution are scattered or dispersed. Two or more distributions of test scores can have the same mean even though differences in the dispersion of scores around the mean can be wide. lOMoARcPSD|3448362 Statistics that describe the amount of variation in a distribution are referred to as measures of variability. Some measures of variability include the range, the interquartile range, the semi-interquartile range, the average deviation, the standard deviation, and the variance. The range The range of a distribution is equal to the difference between the highest and the lowest scores. The range is the simplest measure of variability to calculate, but its potential use is limited. Because the range is based entirely on the values of the lowest and highest scores, one extreme score (if it happens to be the lowest or the highest) can radically alter the value of the range. As a descriptive statistic of variation, the range provides a quick but gross description of the spread of scores. When its value is based on extreme scores in a distribution, the resulting description of variation may be understated or overstated. The interquartile and semi-interquartile ranges A distribution of test scores (or any other data, for that matter) can be divided into four parts such that 25% of the test scores occur in each quarter. As illustrated below, the dividing points between the four quarters in the distribution are the quartiles. There are three of them, respectively labeled Q 1 , Q 2 , and Q 3. Note that quartile refers to a specific point whereas quarter refers to an interval. An individual score may, for example, fall at the third quartile or in the third quarter (but not “in” the third quartile or “at” the third quarter). lOMoARcPSD|3448362 It should come as no surprise to you that Q 2 and the median are exactly the same. And just as the median is the midpoint in a distribution of scores, so are quartiles Q 1 and Q 3 the quarter-points in a distribution of scores. The interquartile range is a measure of variability equal to the difference between Q 3 and Q 1. Like the median, it is an ordinal statistic. A related measure of variability is the semi-interquartile range, which is equal to the interquartile range divided by 2. Knowledge of the relative distances of Q 1 and Q 3 from Q 2 (the median) provides the seasoned test interpreter with immediate information as to the shape of the distribution of scores. In a perfectly symmetrical distribution, Q 1 and Q 3 will be exactly the same distance from the median. If these distances are unequal then there is a lack of symmetry. This lack of symmetry is referred to as skewness, and we will have more to say about that shortly. The average deviation Another tool that could be used to describe the amount of variability in a distribution is the average deviation, or AD for short. Its formula is, The lowercase italic x in the formula signifies a score’s deviation from the mean. The value of x is obtained by subtracting the mean from the score ( X - mean = x ). The bars on each side of x indicate that it is the absolute value of the deviation score (ignoring the positive or negative sign and treating all deviation scores as positive). All the deviation scores are then summed and divided by the total number of scores ( n ) to arrive at the average deviation. The average deviation is rarely used. Perhaps this is so because the deletion of algebraic signs renders it a useless measure for purposes of any further operations. Why, then, discuss it here? The reason is that a clear understanding of what an average deviation measures provides a solid foundation for understanding the conceptual basis of another, more widely used measure: the standard deviation. lOMoARcPSD|3448362 The standard deviation Recall that, when we calculated the average deviation, the problem of the sum of all deviation scores around the mean equaling zero was solved by employing only the absolute value of the deviation scores. In calculating the standard deviation, the same problem must be dealt with, but we do so in a different way. Instead of using the absolute value of each deviation score, we use the square of each score. With each score squared, the sign of any negative deviation becomes positive. Because all the deviation scores are squared, we know that our calculations won’t be complete until we go back and obtain the square root of whatever value we reach. We may define the standard deviation as a measure of variability equal to the square root of the average squared deviations about the mean. More succinctly, it is equal to the square root of the variance. The variance is equal to the arithmetic mean of the squares of the differences between the scores in a distribution and their mean. The formula used to calculate the variance ( s2 ) using deviation scores is, The formula for the standard deviation is, Skewness Distributions can be characterized by their skewness, or the nature and extent to which symmetry is absent. A distribution has a positive skew when relatively few of the scores fall at the high end of the distribution. Positively skewed examination results may indicate that the test was too difficult. More items that were easier would have been desirable in order to better discriminate at the lower end of the distribution of test scores. lOMoARcPSD|3448362 A distribution has a negative skew when relatively few of the scores fall at the low end of the distribution. Negatively skewed examination results may indicate that the test was too easy. In this case, more items of a higher level of difficulty would make it possible to better discriminate between scores at the upper end of the distribution. Kurtosis The term testing professionals use to refer to the steepness of a distribution in its center is kurtosis. Distributions are generally described as platykurtic (relatively flat), leptokurtic (relatively peaked), or—somewhere in the middle— mesokurtic. The Normal Curve Development of the concept of a normal curve began in the middle of the eighteenth century with the work of Abraham DeMoivre and, later, the Marquis de Laplace. At the beginning of the nineteenth century, Karl Friedrich Gauss made some substantial contributions. Through the early nineteenth century, scientists referred to it as the “Laplace-Gaussian curve.” Karl Pearson is credited with being the first to refer to the curve as the normal curve, perhaps in an effort to be diplomatic to all of the people who helped develop it. Theoretically, the normal curve is a bell-shaped, smooth, mathematically defined curve that is highest at its center. From the center it tapers on both sides approaching the X-axis asymptotically (meaning that it approaches, but never touches, the axis). In theory, the distribution of the normal curve ranges from negative infinity to positive infinity. lOMoARcPSD|3448362 The curve is perfectly symmetrical, with no skewness. Because it is symmetrical, the mean, the median, and the mode all have the same exact value. The Area under the Normal Curve 50% of the scores occur above the mean and 50% of the scores occur below the mean. Approximately 34% of all scores occur between the mean and 1 standard deviation above the mean. Approximately 34% of all scores occur between the mean and 1 standard deviation below the mean. Approximately 68% of all scores occur between the mean and ± 1 standard deviation. Approximately 95% of all scores occur between the mean and ± 2 standard deviations. A normal curve has two tails. The area on the normal curve between 2 and 3 standard deviations above the mean is referred to as a tail. The area between -2 and -3 standard deviations below the mean is also referred to as a tail. As a general rule (with ample exceptions), the larger the sample size and the wider the range of abilities measured by a particular test, the more the graph of the test scores will approximate the normal curve. In terms of mental ability as operationalized by tests of intelligence, performance that is approximately two standard deviations from the mean (i.e., IQ of 70–75 or lower or IQ of 125–130 or higher) is one key element in identification. Success at life’s tasks, or its absence, also plays a defining role, but the primary classifying feature of both gifted and retarded groups is intellectual deviance. These individuals are out of sync with lOMoARcPSD|3448362 more average people, simply by their difference from what is expected for their age and circumstance. Standard Scores Simply stated, a standard score is a raw score that has been converted from one scale to another scale, where the latter scale has some arbitrarily set mean and standard deviation. Raw scores may be converted to standard scores because standard scores are more easily interpretable than raw scores. With a standard score, the position of a testtaker’s performance relative to other testtakers is readily apparent. z Scores In essence, a z score is equal to the difference between a particular raw score and the mean divided by the standard deviation. T Scores If the scale used in the computation of z scores is called a zero plus or minus one scale, then the scale used in the computation of T scores can be called a fifty plus or minus ten scale; that is, a scale with a mean set at 50 and a standard deviation set at 10. Devised by W. A. McCall and named a T score in honor of his professor E. L. Thorndike. This standard score system is composed of a scale that ranges from 5 standard deviations below the mean to 5 standard deviations above the mean. Thus, for example, a raw score that fell exactly at 5 standard deviations below the mean would be equal to a T score of 0, a raw score that fell at the mean would be equal to a T of 50, and a raw score 5 standard deviations above the mean would be equal to a T of 100. One advantage in using T scores is that none of the scores is negative. By contrast, in a z score distribution, scores can be positive and negative; this can make further computation cumbersome in some instances. lOMoARcPSD|3448362 Other Standard Scores Stanine Numerous other standard scoring systems exist. Researchers during World War II developed a standard score with a mean of 5 and a standard deviation of approximately 2. Divided into nine units, the scale was christened a stanine, a term that was a contraction of the words standard and nine. Stanine scoring may be familiar to many students from achievement tests administered in elementary and secondary school, where test scores are often represented as stanines. The 5th stanine indicates performance in the average range, from 1/4 standard deviation below the mean to 1/4 standard deviation above the mean, and captures the middle 20% of the scores in a normal distribution. The 4th and 6th stanines are also 1/2 standard deviation wide and capture the 17% of cases below and above (respectively) the 5th stanine. Standard score for SAT and GRE Another type of standard score is employed on tests such as the Scholastic Aptitude Test (SAT) and the Graduate Record Examination (GRE). Raw scores on those tests are converted to standard scores such that the resulting distribution has a mean of 500 and a standard deviation of 100. Deviation IQ For most IQ tests, the distribution of raw scores is converted to IQ scores, whose distribution typically has a mean set at 100 and a standard deviation set at 15. lOMoARcPSD|3448362 The typical mean and standard deviation for IQ tests results in approximately 95% of deviation IQs ranging from 70 to 130, which is 2 standard deviations below and above the mean. A standard score obtained by a linear transformation is one that retains a direct numerical relationship to the original raw score. The magnitude of differences between such standard scores exactly parallels the differences between corresponding raw scores. A nonlinear transformation may be required when the data under consideration are not normally distributed yet comparisons with normal distributions need to be made. In a nonlinear transformation, the resulting standard score does not necessarily have a direct numerical relationship to the original, raw score. As the result of a nonlinear transformation, the original distribution is said to have been normalized. Normalized standard scores Many test developers hope that the test they are working on will yield a normal distribution of scores. Yet even after very large samples have been tested with the instrument under development, skewed distributions result. What should be done? One alternative available to the test developer is to normalize the distribution. Conceptually, normalizing a distribution involves “stretching” the skewed curve into the shape of a normal curve and creating a corresponding scale of standard scores, a scale that is technically referred to as a normalized standard score scale. lOMoARcPSD|3448362 Chapter 4 Of Tests and Testing Some Assumptions about Psychological Testing and Assessment Assumption 1: Psychological Traits and States Exist A trait has been defined as “any distinguishable, relatively enduring way in which one individual varies from another”. States also distinguish one person from another but are relatively less enduring. The term psychological trait, much like the term trait alone, covers a wide range of possible characteristics. Although some have argued in favor of such a conception of psychological traits, compelling evidence to support such a view has been difficult to obtain. For our purposes, a psychological trait exists only as a construct — an informed, scientific concept developed or constructed to describe or explain behavior. We can’t see, hear, or touch constructs, but we can infer their existence from overt behavior. In this context, overt behavior refers to an observable action or the product of an observable action, including test- or assessment-related responses. The phrase relatively enduring in our definition of trait is a reminder that a trait is not expected to be manifested in behavior 100% of the time. Thus, it is important to be aware of the context or situation in which a particular behavior is displayed. Whether a trait manifests itself in observable behavior, and to what degree it manifests, is presumed to depend not only on the strength of the trait in the individual but also on the nature of the situation. Assumption 2: Psychological Traits and States Can Be Quantified and Measured If a personality test yields a score purporting to provide information about how aggressive a testtaker is, a first step in understanding the meaning of that score is understanding how aggressive was defined by the test developer. lOMoARcPSD|3448362 Once having defined the trait, state, or other construct to be measured, a test developer considers the types of item content that would provide insight into it. From a universe of behaviors presumed to be indicative of the targeted trait, a test developer has a world of possible items that can be written to gauge the strength of that trait in testtakers. The test score is presumed to represent the strength of the targeted ability or trait or state and is frequently based on cumulative scoring. Inherent in cumulative scoring is the assumption that the more the testtaker responds in a particular direction as keyed by the test manual as correct or consistent with a particular trait, the higher that testtaker is presumed to be on the targeted ability or trait. Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior Patterns of answers to true–false questions on one widely used test of personality are used in decision making regarding mental disorders. The tasks in some tests mimic the actual behaviors that the test user is attempting to understand. By their nature, however, such tests yield only a sample of the behavior that can be expected to be emitted under nontest conditions. The obtained sample of behavior is typically used to make predictions about future behavior, such as work performance of a job applicant. Assumption 4: Tests and Other Measurement Techniques Have Strengths and Weaknesses Competent test users understand and appreciate the limitations of the tests they use as well as how those limitations might be compensated for by data from other sources. All of this may sound quite commonsensical, and it probably is. Yet this deceptively simple assumption—that test users know the tests they use and are aware of the tests’ limitations—is emphasized repeatedly in the codes of ethics of associations of assessment professionals. lOMoARcPSD|3448362 Assumption 5: Various Sources of Error Are Part of the Assessment Process To the contrary, error traditionally refers to something that is more than expected; it is actually a component of the measurement process. More specifically, error refers to a long-standing assumption that factors other than what a test attempts to measure will influence performance on the test. Because error is a variable that must be taken account of in any assessment, we often speak of error variance, that is, the component of a test score attributable to sources other than the trait or ability measured. In a more general sense, then, assessees themselves are sources of error variance. Assessors, too, are sources of error variance. For example, some assessors are more professional than others in the extent to which they follow the instructions governing how and under what conditions a test should be administered. In addition to assessors and assessees, measuring instruments themselves are another source of error variance. Some tests are simply better than others in measuring what they purport to measure. In what is referred to as the classical or true score theory of measurement, an assumption is made that each testtaker has a true score on a test that would be obtained but for the random action of measurement error. Assumption 6: Testing and Assessment Can Be Conducted in a Fair and Unbiased Manner Decades of court challenges to various tests and testing programs have sensitized test developers and users to the societal demand for fair tests used in a fair manner. Today, all major test publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual. Assumption 7: Testing and Assessment Benefit Society In a world without tests or other assessment procedures, personnel might be hired on the basis of nepotism rather than documented merit. In a world without tests, teachers and school administrators could arbitrarily place children in different types of special classes simply because that is where they believed the children belonged. lOMoARcPSD|3448362 In a world without tests, there would be a great need for instruments to diagnose educational difficulties in reading and math and point the way to remediation. In a world without tests, there would be no instruments to diagnose neuropsychological impairments. In a world without tests, there would be no practical way for the military to screen thousands of recruits with regard to many key variables. What’s a “Good Test”? Logically, the criteria for a good test would include clear instructions for administration, scoring, and interpretation. It would also seem to be a plus if a test offered economy in the time and money it took to administer, score, and interpret it. Most of all, a good test would seem to be one that measures what it purports to measure. Beyond simple logic, there are technical criteria that assessment professionals use to evaluate the quality of tests and other measurement procedures. Test users often speak of the psychometric soundness of tests, two key aspects of which are reliability and validity. Reliability A good test or, more generally, a good measuring tool or procedure is reliable. As we will explain in Chapter 5, the criterion of reliability involves the consistency of the measuring tool: the precision with which the test measures and the extent to which error is present in measurements. In theory, the perfectly reliable measuring tool consistently measures in the same way. As you might expect, however, reliability is a necessary but not sufficient element of a good test. In addition to being reliable, tests must be reasonably accurate. In the language of psychometrics, tests must be valid. Validity A test is considered valid for a particular purpose if it does, in fact, measure what it purports to measure. Because there is controversy surrounding the definition of intelligence, the validity of any test purporting to measure this variable is sure to come under close scrutiny by critics. If the definition of intelligence on which the test is based is sufficiently different from the definition of intelligence on other accepted tests, then the test may be condemned as not measuring what it purports to measure. lOMoARcPSD|3448362 Other Considerations A good test is one that trained examiners can administer, score, and interpret with a minimum of difficulty. A good test is a useful test, one that yields actionable results that will ultimately benefit individual testtakers or society at large. If the purpose of a test is to compare the performance of the testtaker with the performance of other testtakers, a good test is one that contains adequate norms. Norms We may define norm-referenced testing and assessment as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual testtaker’s score and comparing it to scores of a group of testtakers. A common goal of norm-referenced tests is to yield information on a testtaker’s standing or ranking relative to some comparison group of testtakers. Norm in the singular is used in the scholarly literature to refer to behavior that is usual, average, normal, standard, expected, or typical. Reference to a particular variety of norm may be specified by means of modifiers such as age, as in the term age norm. Norms is the plural form of norm, as in the term gender norms. In a psychometric context, norms are the test performance data of a particular group of testtakers that are designed for use as a reference when evaluating or interpreting individual test scores. A normative sample is that group of people whose performance on a particular test is analyzed for reference in evaluating the performance of individual testtakers. The verb to norm, as well as related terms such as norming, refer to the process of deriving norms. Norming may be modified to describe a particular type of norm derivation. For example, race norming is the controversial practice of norming on the basis of race or ethnic background. Norming a test, especially with the participation of a nationally representative normative sample, can be a very expensive proposition. For this reason, some test manuals provide what are variously known as user norms or program norms, which “consist of descriptive statistics based on a group of testtakers in a given period of time rather than norms obtained by formal sampling methods”. lOMoARcPSD|3448362 Sampling to Develop Norms The process of administering a test to a representative sample of testtakers for the purpose of establishing norms is referred to as standardization or test standardization. As will be clear from the Close-up, a test is said to be standardized when it has clearly specified procedures for administration and scoring, typically including normative data. Sampling In the process of developing a test, a test developer has targeted some defined group as the population for which the test is designed. This population is the complete universe or set of individuals with at least one common, observable characteristic. The test developer can obtain a distribution of test responses by administering the test to a sample of the population—a portion of the universe of people deemed to be representative of the whole population. The process of selecting the portion of the universe deemed to be representative of the whole population is referred to as sampling. Such sampling, termed stratified sampling, would help prevent sampling bias and ultimately aid in the interpretation of the findings. If such sampling were random (that is, if every member of the population had the same chance of being included in the sample), then the procedure would be termed stratified-random sampling. If we arbitrarily select some sample because we believe it to be representative of the population, then we have selected what is referred to as a purposive sample. An incidental sample or convenience sample is one that is convenient or available for use. Generalization of findings from incidental samples must be made with caution. lOMoARcPSD|3448362 Developing norms for a standardized test When the people in the normative sample are the same people on whom the test was standardized, the phrases normative sample and standardization sample are often used interchangeably. Types of Norms Percentiles Instead of dividing a distribution of scores into quartiles, we might wish to divide the distribution into deciles, or ten equal parts. Alternatively, we could divide a distribution into 100 equal parts— 100 percentiles. In such a distribution, the x th percentile is equal to the score at or below which x % of scores fall. Thus, the 15th percentile is the score at or below which 15% of the scores in the distribution fall. The 99th percentile is the score at or below which 99% of the scores in the distribution fall. If 99% of a particular standardization sample answered fewer than 47 questions on a test correctly, then we could say that a raw score of 47 corresponds to the 99th percentile on this test. It can be seen that a percentile is a ranking that conveys information about the relative position of a score within a distribution of scores. More formally defined, a percentile is an expression of the percentage of people whose score on a test or measure falls below a particular raw score. Intimately related to the concept of a percentile as a description of performance on a test is the concept of percentage correct. Note that percentile and percentage correct are not synonymous. A percentile is a converted score that refers to a percentage of testtakers. Percentage correct refers to the distribution of raw scores—more specifically, to the number of items that were answered correctly multiplied by 100 and divided by the total number of items. Age norms Also known as age-equivalent scores, age norms indicate the average performance of different samples of testtakers who were at various ages at the time the test was administered. If the measurement under consideration is height in inches, for example, then we know that scores (heights) for children will gradually increase at various rates as a function of age up to the middle to late teens. lOMoARcPSD|3448362 The child of any chronological age whose performance on a valid test of intellectual ability indicated that he or she had intellectual ability similar to that of the average child of some other age was said to have the mental age of the norm group in which his or her test score fell. For many years, IQ (intelligence quotient) scores on tests such as the Stanford-Binet were calculated by dividing mental age (as indicated by the test) by chronological age. The quotient would then be multiplied by 100 to eliminate the fraction. The distribution of IQ scores had a mean set at 100 and a standard deviation of approximately 16. Grade norms Designed to indicate the average test performance of testtakers in a given school grade, grade norms are developed by administering the test to representative samples of children over a range of consecutive grade levels (such as fi rst through sixth grades). Like age norms, grade norms have great intuitive appeal. Children learn and develop at varying rates but in ways that are in some aspects predictable. Perhaps because of this fact, grade norms have widespread application, especially to children of elementary school age. Perhaps the primary use of grade norms is as a convenient, readily understandable gauge of how one student’s performance compares with that of fellow students in the same grade. One drawback of grade norms is that they are useful only with respect to years and months of schooling completed. They have little or no applicability to children who are not yet in school or to children who are out of school. Further, they are not typically designed for use with adults who have returned to school. Both grade norms and age norms are referred to more generally as developmental norms, a term applied broadly to norms developed on the basis of any trait, ability, skill, or other characteristic that is presumed to develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life. National norms As the name implies, national norms are derived from a normative sample that was nationally representative of the population at the time the norming study was conducted. In the fields of psychology and education, for example, national norms may be obtained by testing large numbers of people representative of different variables of interest such as age, gender, racial/ethnic background, socioeconomic strata, geographical location (such as North, East, South, lOMoARcPSD|3448362 West, Midwest), and different types of communities within the various parts of the country (such as rural, urban, suburban). National anchor norms Suppose we select a reading test designed for use in grades 3 to 6, which, for the purposes of this hypothetical example, we call the Best Reading Test (BRT). Suppose further that we want to compare findings obtained on another national reading test designed for use with grades 3 to 6, the hypothetical XYZ Reading Test, with the BRT. An equivalency table for scores on the two tests, or national anchor norms, could provide the tool for such a comparison. Just as an anchor provides some stability to a vessel, so national anchor norms provide some stability to test scores by anchoring them to other test scores. Using the equipercentile method, the equivalency of scores on different tests is calculated with reference to corresponding percentile scores. Thus, if the 96th percentile corresponds to a score of 69 on the BRT and if the 96th percentile corresponds to a score of 14 on the XYZ, then we can say that a BRT score of 69 is equivalent to an XYZ score of 14. When two tests are normed from the same sample, the norming process is referred to as co-norming. Subgroup norms A normative sample can be segmented by any of the criteria initially used in selecting subjects for the sample. What results from such segmentation are more narrowly defined subgroup norms. Thus, for example, suppose criteria used in selecting children for inclusion in the XYZ Reading Test normative sample were age, educational level, socioeconomic level, geographic region, community type, and handedness (whether the child was right-handed or left-handed). A community school board member might find the regional norms to be most useful, whereas a psychologist doing exploratory research in the area of brain lateralization and reading scores might find the handedness norms most useful. Local norms Typically developed by test users themselves, local norms provide normative information with respect to the local population’s performance on some test. lOMoARcPSD|3448362 Individual high schools may wish to develop their own school norms (local norms) for student scores on an examination that is administered statewide. Fixed Reference Group Scoring Systems Norms provide a context for interpreting the meaning of a test score. Another type of aid in providing a context for interpretation is termed a fixed reference group scoring system. Here, the distribution of scores obtained on the test from one group of testtakers—referred to as the fixed reference group — is used as the basis for the calculation of test scores for future administrations of the test. Perhaps the test most familiar to college students that exemplifies the use of a fixed reference group scoring system is the SAT. As an example, suppose John took the SAT in 1995 and answered 50 items correctly on a particular scale. And let’s say Mary took the test in 2008 and, just like John, answered 50 items correctly. Although John and Mary may have achieved the same raw score, they would not necessarily achieve the same scaled score. If, for example, the 2008 version of the test was judged to be somewhat easier than the 1995 version, then scaled scores for the 2008 testtakers would be calibrated downward. This would be done so as to make scores earned in 2008 comparable to scores earned in 1995. Conceptually, the idea of a fixed reference group is analogous to the idea of a fixed reference foot, the foot of the English king that also became immortalized as a measurement standard. Norm-Referenced versus Criterion-Referenced Evaluation One way to derive meaning from a test score is to evaluate the test score in relation to other scores on the same test. As we have pointed out, this approach to evaluation is referred to as norm-referenced. Another way to derive meaning from a test score is to evaluate it on the basis of whether or not some criterion has been met. We may define a criterion as a standard on which a judgment or decision may be based. Criterion-referenced testing and assessment may be defined as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard. In norm-referenced interpretations of test data, a usual area of focus is how an individual performed relative to other people who took the test. In criterion-referenced interpretations of test data, a usual area of focus is the testtaker’s performance: what the testtaker can do or not do; what the testtaker has or has not learned; whether the testtaker lOMoARcPSD|3448362 does or does not meet specified criteria for inclusion in some group, access to certain privileges, and so forth. Because criterion-referenced tests are frequently used to gauge achievement or mastery, they are sometimes referred to as mastery tests. Correlation and Inference A coefficient of correlation (or correlation coefficient ) is a number that provides us with an index of the strength of the relationship between two things. The Concept of Correlation Simply stated, correlation is an expression of the degree and direction of correspondence between two things. If asked to supply information about its magnitude, it would respond with a number anywhere at all between -1 and +1. If a correlation coefficient has a value of -1 and +1, then the relationship between the two variables being correlated is perfect—without error in the statistical sense. If two variables simultaneously increase or simultaneously decrease, then those two variables are said to be positively (or directly) correlated. A negative (or inverse) correlation occurs when one variable increases while the other variable decreases. If a correlation is zero, then absolutely no relationship exists between the two variables. And some might consider “perfectly no correlation” to be a third variety of perfect correlation; that is, a perfect noncorrelation. After all, just as it is nearly impossible in psychological work to identify two variables that have a perfect correlation, so it is nearly impossible to identify two variables that have a zero correlation. Most of the time, two variables will be fractionally correlated. The fractional correlation may be extremely small but seldom “perfectly” zero. It must be emphasized that a correlation coefficient is merely an index of the relationship between two variables, not an index of the causal relationship between two variables. lOMoARcPSD|3448362 The Pearson r Many techniques have been devised to measure correlation. The most widely used of all is the Pearson r, also known as the Pearson correlation coefficient and the Pearson product-moment coefficient of correlation. Devised by Karl Pearson, r can be the statistical tool of choice when the relationship between the variables is linear and when the two variables being correlated are continuous (that is, they can theoretically take any value). The formula used to calculate a Pearson r from raw scores is, Another formula for calculating a Pearson r is, History records, however, that it was actually Sir Francis Galton who should be credited with developing the concept of correlation. Galton experimented with many formulas to measure correlation, including one he labeled r. Pearson, a contemporary of Galton’s, modified Galton’s r, and the rest, as they say, is history. The Pearson r eventually became the most widely used measure of correlation. The next logical question concerns what to do with the number obtained for the value of r. The answer is that you ask even more questions, such as “Is this number statistically significant given the size and nature of the sample?” or “Could this result have occurred by chance?” At this point, you will need to consult tables of significance for Pearson r—tables that are probably in the back of your old statistics textbook. In those tables you will find, for example, that a Pearson r of.899 with an N = 10 is significant at the.01 level (using a two-tailed test). You will recall from your statistics course that significance at the.01 level tells you, with reference to these data, that a correlation such as this could have been expected to occur merely by chance only one time or less in a hundred if X and Y are not correlated in the population. Significance at the.05 level means that the result could have been expected to occur by chance alone five times or less in a hundred. lOMoARcPSD|3448362 The value obtained for the coefficient of correlation can be further interpreted by deriving from it what is called a coefficient of determination, or r2. The coefficient of determination is an indication of how much variance is shared by the X - and the Y -variables. The calculation of r2 is quite straightforward. Simply square the correlation coefficient and multiply by 100; the result is equal to the percentage of the variance accounted for. If, for example, you calculated r to be.9, then r2 would be equal to.81. The number.81 tells us that 81% of the variance is accounted for by the X - and Y -variables. The remaining variance, equal to 100(1-r2 ), or 19%, could presumably be accounted for by chance, error, or otherwise unmeasured or unexplainable factors. Let’s address a logical question sometimes raised by students when they hear the Pearson r referred to as the product-moment coefficient of correlation. Why is it called that? The answer is a little complicated, but here goes. In the language of psychometrics, a moment describes a deviation about a mean of a distribution. Individual deviations about the mean of a distribution are referred to as deviates. Deviates are referred to as the first moments of the distribution. The second moments of the distribution are the moments squared. The third moments of the distribution are the moments cubed, and so forth. One way of conceptualizing standard scores is as the first moments of a distribution. This is because standard scores are deviates about a mean of zero. A formula that entails the multiplication of two corresponding standard scores can therefore be conceptualized as one that entails the computation of the product of corresponding moments. The Spearman Rho One commonly used alternative statistic is variously called a rank-order correlation coefficient, a rank-difference correlation coefficient, or simply Spearman’s rho. Developed by Charles Spearman, a British psychologist, this coefficient of correlation is frequently used when the sample size is small (fewer than 30 pairs of measurements) and especially when both sets of measurements are in ordinal (or rank-order) form. Charles Spearman is best known as the developer of the Spearman rho statistic and the Spearman-Brown prophecy formula, which is used to “prophesize” the accuracy of tests of different sizes. Spearman is also credited with being the father of a statistical method called factor analysis. lOMoARcPSD|3448362 Graphic Representations of Correlation One type of graphic representation of correlation is referred to by many names, including a bivariate distribution, a scatter diagram, a scattergram, or—our favorite—a scatterplot. A scatterplot is a simple graphing of the coordinate points for values of the X -variable (placed along the graph’s horizontal axis) and the Y -variable (placed along the graph’s vertical axis). Scatterplots are useful because they provide a quick indication of the direction and magnitude of the relationship, if any, between the two variables. lOMoARcPSD|3448362 Scatterplots are useful in revealing the presence of curvilinearity in a relationship. As you may have guessed, curvilinearity in this context refers to an “eyeball gauge” of how curved a graph is. Remember that a Pearson r should be used only if the relationship between the variables is linear. If the graph does not appear to take the form of a straight line, the chances are good that the relationship is not linear. When the relationship is nonlinear, other statistical tools and techniques may be employed. A graph also makes the spotting of outliers relatively easy. An outlier is an extremely atypical point located at a relatively long distance—an outlying distance—from the rest of the coordinate points in a scatterplot. In some cases, outliers are simply the result of administering a test to a very small sample of testtakers. Regression Regression may be defined broadly as the analysis of relationships among variables for the purpose of understanding how one variable may predict another. Simple regression involves one independent variable ( X ), typically referred to as the predictor variable, and one dependent variable ( Y ), typically referred to as the outcome variable. lOMoARcPSD|3448362 Simple regression analysis results in an equation for a regression line. The regression line is the line of best fit: the straight line that, in one sense, comes closest to the greatest number of points on the scatterplot of X and Y. In the formula, a and b are regression coefficients; b is equal to the slope of the line, and a is the intercept, a constant indicating where the line crosses the Y - axis. This is what is meant by error in prediction: Each of these students would be predicted to get the same GPA based on the entrance exam, but in fact they earned different GPAs. This error in the prediction of Y from X is represented by the standard error of the estimate. As you might expect, the higher the correlation between X and Y, the greater the accuracy of the prediction and the smaller the standard error of the estimate. Multiple regression The use of more than one score to predict Y requires the use of a multiple regression equation. The multiple regression equation takes into account the intercorrelations among all the variables involved. lOMoARcPSD|3448362 Predictors that correlate highly with the predicted variable are generally given more weight. This means that their regression coefficients (referred to as b-values) are larger. No surprise there. We would expect test users to pay the most attention to predictors that predict Y best. If many predictors are used and if one is not correlated with any of the other predictors but is correlated with the predicted score, then that predictor may be given relatively more weight because it is providing unique information. In contrast, if two predictor scores are highly correlated with each other then they could be providing redundant information. If both were kept in the regression equation, each might be given less weight so that they would “share” the prediction of Y. More predictors are not necessarily better. If two predictors are providing the same information, the person using the regression equation may decide to use only one of them for the sake of efficiency. If the De Sade dean observed that dental school admission test scores and scores on the test of fine motor skills were highly correlated with each other and that each of these scores correlated about the same with GPA, the dean might decide to use only one predictor because nothing was gained by the addition of the second predictor. Inference from Measurement Meta-Analysis The term meta-analysis refers to a family of techniques used to statistically combine information across studies to produce single estimates of the statistics being studied. Culture and Inference It is incumbent upon responsible test users not to lose sight of culture as a factor in test administration, scoring, and interpretation. So, in selecting a test for use, the responsible test user does some advance research on the test’s available norms to check on how appropriate they are for use with the targeted testtaker population. lOMoARcPSD|3448362 Chapter 5 Reliability In everyday conversation, reliability is a synonym for dependability or consistency. We speak of the train that is so reliable you can set your watch by it. If we are lucky, we have a reliable friend who is always there for us in a time of need. Broadly speaking, in the language of psychometrics reliability refers to consistency in measurement. A reliability coefficient is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance. The Concept of Reliability Recall from our discussion of classical test theory that a score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also error. In its broadest sense, error refers to the component of the observed test score that does not have to do with the testtaker’s ability. If we use X to represent an observed score, T to represent a true score, and E to represent error, then the fact that an observed score equals the true score plus error may be expressed as follows: A statistic useful in describing sources of test score variability is the variance (σ2)—the standard deviation squared. Variance from true differences is true variance, and variance from irrelevant, random sources is error variance. The term reliability refers to the proportion of the total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more reliable the test. lOMoARcPSD|3448362 Let’s emphasize here that a systematic source of error would not affect score consistency. If a measuring instrument such as a weight scale consistently underweighed everyone who stepped on it by 5 pounds, then the relative standings of the people would remain unchanged. A systematic error source does not change the variability of the distribution or affect reliability. Sources of Error Variance Test construction One source of variance during test construction is item sampling or content sampling, terms that refer to variation among items within a test as well as to variation among items between tests. From the perspective of a test creator, a challenge in test development is to maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance. Test administration Sources of error variance that occur during test administration may influence the testtaker’s attention or motivation. The testtaker’s reactions to those influences are the source of one