Chapter 1 - Introduction to Testing PDF
Document Details
Uploaded by Deleted User
Tags
Related
Summary
This document introduces basic concepts of testing and psychological assessment. It covers the evolution of testing, different types of tests, and their applications in various fields. It includes information on ability tests, personality tests, and the various parties involved in the testing process.
Full Transcript
CHAPTER 1-INTRODUCTION TO TESTING Learning outcomes: 1. Discuss the evolution of testing and its significant achievements until the present. UNIT 1-OVERVIEW OF TESTING Basic Concepts (Cohen-Swerdik, 2018) A test - a measurement device or technique used to quantify behavior or aid in t...
CHAPTER 1-INTRODUCTION TO TESTING Learning outcomes: 1. Discuss the evolution of testing and its significant achievements until the present. UNIT 1-OVERVIEW OF TESTING Basic Concepts (Cohen-Swerdik, 2018) A test - a measurement device or technique used to quantify behavior or aid in the understanding and prediction of behavior. (ex. spelling test- measures how well someone spells words). An item - a specific stimulus to which a person responds overtly; this response can be scored or evaluated (e.g., classified, graded on a scale, or counted) Psychological test- a systematic procedure for obtaining samples of behavior, relevant to cognitive, affective, or interpersonal functioning, and for scoring and evaluating those samples according to standards. Also called “psychological measurement”/ psychometrics Psychological Assessment- refers to the integration of information from multiple sources in order to describe, predict, explain, diagnose and make decisions. It uses a lot of techniques and tools of psychology in order to learn facts about another person, either to inform others of how they are currently functioning or to predict their behavior and functioning in the future. Taxonomy of Psychological Assessment (Source: Coaley, Keith, 2010) From the above diagram, psychological assessment encompasses psychological testing as part of the whole process of assessment. It both uses tools that can be quantified or cannot be quantified (qualitative). Information about the client through observation and interviews are part of the BPSYCH8 Psychological Assessment 1 assessment process. Psychological testing is just one of the tools to be used in assessing an individual. It is more quantitative as it is geared more in the use of measurement thru the use of tests that can be quantified. Testing Assessment Objective Typically, to obtain some gauge, usually Typically, to answer a referral question, solve numerical in nature, with regard to an ability a problem, or arrive at a decision through or attribute. the use of tools of evaluation. Process Testing may be individual or group in nature. Assessment is typically individualized. In After test administration, the tester will contrast to testing, assessment more typically add up “the number of correct typically focuses on how an individual answers or the number of certain processes rather than simply the results of types of responses... with little if any regard that processing. for the how or mechanics of such content. Role of Evaluator The tester is not key to the process; The assessor is key to the process of practically speaking, one tester may be selecting tests and/or other tools of substituted for another tester without evaluation as well as in drawing conclusions appreciably affecting the evaluation from the entire evaluation. Skill of Evaluator Testing typically requires technician-like skills Assessment typically requires an educated in terms of administering and scoring a test selection of tools of evaluation, skill in as well as in interpreting a test result. evaluation, and thoughtful organization and integration of data. Outcome Typically, testing yields a test score or series Typically, assessment entails a logical of test scores. problem-solving approach that brings to bear many sources of data designed to shed light on a referral question. (Cohen-Swerdlik, 2018) Basic Elements of the Definition of Psychological Tests (Urbina, 2014) Defining Element Explanation Rationale Psychological tests are They are characterized by Tests must be demonstrably systematic procedures planning, uniformity, and objective and fair to be of use thoroughness. Psychological tests are samples They are small subsets of a Sampling behavior is efficient of behavior. much larger whole. because the time available is usually limited. The behaviors sampled by tests The samples are selected for Tests, unlike mental games, are relevant to cognitive, their empirical or practical exist to be of use; they are psychological significance. tools. BPSYCH8 Psychological Assessment 2 affective, or interpersonal functioning Test results are evaluated and Some numerical or category There should be no question scored system is applied to test results, about what the results of tests according to pre-established are. rules. Implications on the Use of Psychological Tests (Urbina, 2014) 1. Should be standardized in order to have objectivity in the testing process. - Standardized means this has something to do with the uniformity of procedure in all important aspects of the administration, scoring, and interpretation of tests. Note: the time and place that a test is administered or taken, as well as the circumstances under which it is completed and the mode of administration used, can and do affect test results. - The second meaning of standardization concerns the use of standards for evaluating test results and these are often referred to as the norms derived from a group of individuals—also known as the normative or standardization sample—in the process of developing the test. 2. Psychological test as a tool- “they are always a means to an end and never an end in themselves.” - Psychological tests are tools designed to help in drawing inferences about individuals or groups and they should be used appropriately and skillfully because they are key components in the practice and science of psychology. - When test results are misinterpreted or misused, they can harm people by labeling them in unjustified ways, unfairly denying them opportunities, or simply discouraging them. Hence, strict guidelines and policies are part and parcel in the use of these tests. 3. Psychological tests are products- there is a commercial nature of psychological tests because they are advertised through publications and catalogs targeted to the professionals who use them. On the other hand, the fact remains that many, if not most, psychological tests are conceived, developed, marketed, and sold for applied purposes in education, business, forensic, or mental health settings. They also must make a profit for those who produce them. But just like any other commercial products, psychological tests can also be used in ways that do damage. Types of Tests (Kaplan, 2018) Individual Tests -an examiner gives a test to one person at a time Group tests- one examiner administers tests to a group of examinees at a time BPSYCH8 Psychological Assessment 3 Categories of tests a. Ability tests- tests that would sample knowledge, skills, or cognitive functions of a person; these are tests for capacity or potential of a person. Examples are the following: - Achievement tests are tests that measure previous learning (for ex. An examiner is interested in how many words can you read correctly in a reading achievement test) - Aptitude tests are used to measure a person’s potential for acquiring a certain or specific skill (ex. How many words can you read correctly after given a certain amount of training) - Intelligence tests refers to a person’s general potential to solve problems, adapt to changing circumstances, think abstractly, and profit from experience. b. Personality tests- measure typical behavior—traits, temperaments, and dispositions and they are related to the overt and covert dispositions of the individual—for example, the tendency of a person to show a particular behavior or response in a given situation. - Structured (objective)- a test that provides a self-report statement to which the person responds. (ex. responses can be a “Yes” or “No”; agree, strongly agree, etc.) - Unstructured/Projective- test provide an ambiguous test stimulus; response requirements are. Summary of the kinds and categories of Tests (Source: Kaplan, et al, 2018) CURRENT USES OF PSYCHOLOGICAL TESTS 1. Decision Making- this involves value judgments on the part of one or more decision makers who need to determine the bases upon which to select, place, classify, diagnose, or otherwise deal with individuals, groups, organizations, or programs. NOTE: testing should be merely a part of a thorough and well-planned decision-making strategy that takes into account the particular context in which the decisions are made, the limitations of the tests, and other sources of data in addition to tests. 2. Psychological Research- used in research in the fields of differential, developmental, abnormal, educational, social, and vocational psychology. They provide a well-recognized method of studying the nature, development, and interrelationships of cognitive, affective, and behavioral traits BPSYCH8 Psychological Assessment 4 3. Self-Understanding and Personal Development- to provide clients with information to promote self-understanding and positive growth It also evolved into a therapeutic model of assessment Parties involved in Testing (Urbina, 2014) Test authors and developers….they conceive, prepare, and develop tests. They also find a way to disseminate their tests by publishing them either commercially or through professional publications such as books or periodicals. Test publishers …they publish, market, and sell tests, thus controlling their distribution. Test reviewers…they prepare evaluative critiques of tests based on their technical and practical merits. Test users …..they select or decide to take a specific test off the shelf and use it for some purpose. They may also participate in other roles, for example, as examiners or scorers. Test sponsors Institutional boards or government agencies…. who contract test developers or publishers for various testing services. Test administrators or examiners…….they administer the test either to one individual at a time or to groups. Test takers …….they take the test by choice or necessity. Test scorers…..they tally the raw responses of the test taker and transform them into test scores through objective or mechanical scoring or through the application of evaluative judgments. Test score interpreters ……..they interpret test results to their ultimate consumers, who may be individual test takers or their relatives, other professionals, or organizations of various kinds. BPSYCH8 Psychological Assessment 5 UNIT 11- HISTORY OF TESTING History of Testing (Kaplan, 2018) Early antecedents The origins of testing are said to have started in China where they had already a sophisticated civil service testing program more than 4000 years ago (in some books around 2,000 years ago). The examinations encompassed demonstrations of proficiency in music, archery, and horsemanship, among other things. Every 3rd year, they give oral examinations to help determine work evaluations and promotion decisions. In the Han Dynasty (206–220 B.C.E.), the use of test batteries (two or more tests used in conjunction) was quite common. These early tests were related to diverse topics such as civil law, military affairs, agriculture, revenue, and geography. During the Ming Dynasty (1368–1644 C.E.), tests had become quite well developed where there was a national multistage testing program involved in local and regional testing centers equipped with special testing booths. For example, those who did well on the tests at the local level went on to provincial capitals for more extensive essay examinations. until it went on to the nation’s capital for a final round. Only those who passed this third set of tests were eligible. This system for selecting government officials ended in 1905 and was replaced with selection based on university studies. However, it served as an inspiration for the civil service exams developed in Britain in the 1850s, which, in turn, stimulated the creation of the U.S. Civil Service Examination in the 1860s. The works of Charles Darwin Charles Darwin’s, The Origin of Species in 1859 introduced the concept of individual differences. According to Darwin’s theory, higher forms of life evolved partially because of differences among individual forms of life within a species. He also believed that those with the best or most adaptive characteristics survive at the expense of those who are less fit and that the survivors pass their characteristics on to the next generation. Through this process, he argued, life has evolved to its currently complex and intelligent levels. Sir Francis Galton, a relative of Darwin, soon began applying Darwin’s theories to the study of human beings. Given the concepts of survival of the fittest and individual differences, Galton set out to show that some people possessed characteristics that made them more fit than others, a theory he articulated in his book Hereditary Genius, published in 1869. Galton subsequently began a series of experimental studies to document the validity of his position. He demonstrated that individual differences exist in human sensory and motor functioning, such as reaction time, visual acuity, and physical strength. Galton’s work was extended by the U.S. psychologist James McKeen Cattell, who coined the term mental test (Cattell, 1890). Cattell’s doctoral dissertation was based on Galton’s work on individual differences in reaction time. As such, Cattell perpetuated and stimulated the forces that ultimately led to the development of modern tests. BPSYCH8 Psychological Assessment 6 The Experimental Psychologist Before founding the Experimental psychology and before psychology become a science, mathematical models of the mind were developed, in particular those of J. E. Herbart, who used these models as basis for educational theories that strongly influenced 19th-century educational practices. Following Herbart, E. H. Weber attempted to demonstrate the existence of a psychological threshold which was followed by G. T. Fechner who devised the law which states that the strength of a sensation grows as the logarithm of the stimulus intensity. Wilhelm Wundt, who set up a laboratory at the University of Leipzig in 1879, is credited with founding the science of psychology, following in the tradition of Weber and Fechner. Wundt was succeeded by E. B. Titchner, whose student, G. Whipple, recruited L. L. Thurstone. Whipple provided the basis for immense changes in the field of testing by conducting a seminar at the Carnegie Institute in 1919 attended by Thurstone, E. Strong, and other early prominent U.S. psychologists. From this seminar came the Carnegie Interest Inventory and later the Strong Vocational Interest Blank. Thus, psychological testing developed from at least two lines of inquiry: one based on the work of Darwin, Galton, and Cattell on the measurement of individual differences, and the other (more theoretically relevant and probably stronger) based on the work of the German psychophysicists Herbart, Weber, Fechner, and Wundt. - With experimental psychology, there was a great interest in developing apparatus and standardized procedures for mapping out the range of human capabilities in the realm of sensation and perception. The first experimental psychologists were interested in discovering general laws governing the relationship between the physical and psychological worlds. They had little or no interest in individual differences—the main item of interest in differential psychology and psychological testing—which they, in fact, tended to view as a source of error. Nevertheless, their emphases on the need for accuracy in their measurements and for standardized conditions in the lab proved to be important contributions to the emerging field of psychological testing. -Implications: From this work also came the idea that testing, like an experiment, requires rigorous experimental control. Francis Galton- In the USA, other psychologists conducted their own lab and Francis Galton became interested in the measurement of psychological functions from an entirely different perspective. He was convinced that intellectual ability was a function of the keenness of one’s senses in perceiving and discriminating stimuli, which he in turn believed was hereditary in nature. Hence, he set up his lab in order to investigate and collected data on a number of physical and physiological characteristics—such as arm span, height, weight, vital capacity, strength of grip, and sensory acuity—on thousands of individuals and families. He made use of cross-tabulation of the anthropometric data, hoping to find characteristics and interrelationships and concordance across individuals with different degrees of familial ties. (Fancher, 1996, as cited by Urbina, 2014). In the process, Galton made significant contributions to the fields of statistics and psychological measurements. While charting data comparing parents and their offspring, for instance, he BPSYCH8 Psychological Assessment 7 discovered the phenomena of regression and correlation, which provided the groundwork for much subsequent psychological research and data analyses. - However, Galton did not succeed in his ultimate objective that is a way of assessing the intellectual capacity of children and adolescents through tests so as to identify the most gifted individuals early and encourage them to produce many offsprings. His work was continued by James Cattell. Other contributions followed and the earliest tests was the Seguin Form Board Test which was developed by Emil Kraeplin a German psychiatrist, in an effort to educate and evaluate the mentally disabled. Emil Kraepelin (1895) was interested primarily in the clinical examination of psychiatric patients. He prepared a long series of tests to measure what he regarded as basic factors in the characterization of an individual. Tests employed chiefly simple arithmetic operations, were designed to measure practice effects, memory, and susceptibility to fatigue and to distraction. Hermann Ebbinghaus (1897) a German psychologist, administered tests of arithmetic computation, memory span, and sentence completion to schoolchildren. Sentence completion was the only one that showed correspondence with scholastic. achievement. He wanted to study the effects of fatigue on children’s mental ability so he devised a technique that called for children to fill in the blanks in text passages from which words or word fragments had been omitted. Intelligence Tests and Achievement Tests In the 20th century, the French minister of public instruction appointed a commission to study ways of identifying intellectually subnormal individuals in order to provide them with appropriate educational experiences. He commissioned Alfred Binet, who is working at that time with the French physician T. Simon, to develop the first major general intelligence test. Binet’s early effort launched the first systematic attempt to evaluate individual differences in human intelligence - The first version of the test, known as the Binet-Simon Scale, was published in 1905. This instrument contained 30 items of increasing difficulty and was designed to identify intellectually subnormal individuals. The Binet-Simon Scale of 1905 had a standardization sample consisted of 50 children who had been given the test under standard conditions—that is, with precisely the same instructions and format. - By 1908, the Binet-Simon Scale had been substantially improved. It was revised to include nearly twice as many items as the 1905 scale. Even more significantly, the size of the standardization sample was increased to more than 200. The 1908 Binet-Simon Scale also determined a child’s mental age, thereby introducing a historically significant concept. The mental age concept was one of the most important contributions of the revised 1908 Binet-Simon Scale. - In 1911, the Binet-Simon Scale received a minor revision and the intelligence testing become popular. By 1916, L. M. Terman of Stanford University had revised the Binet test for use in the United States. Terman’s revision, known as the Stanford-Binet Intelligence Scale (Terman, 1916), was the only American version of the Binet test that flourished. BPSYCH8 Psychological Assessment 8 World War 1 The testing movement grew enormously in the United States because of the demand for a quick efficient way of evaluating the emotional and intellectual functioning of thousands of military recruits in World War I. However, the Binet test was an individual test. The US army requested the assistance of Robert Yerkes, who was then the president of the American Psychological Association (see Yerkes, 1921). Yerkes headed a committee of distinguished psychologists who soon developed two structured group tests of human abilities: the Army Alpha and the Army Beta. The Army Alpha required reading ability, whereas the Army Beta measured the intelligence of illiterate adults. World War I fueled the widespread development of group tests. About this time, the scope of testing also broadened to include tests of achievement, aptitude, interest, and personality. The creation of group tests had added momentum to the testing movement. Shortly after the appearance of the 1916 Stanford-Binet Intelligence Scale and the Army Alpha test, schools, colleges, and industry began using tests. Achievement Tests One of the important developments after World War I was the development of standardized achievement tests. In contrast to essay tests, standardized achievement tests provide multiple- choice questions that are standardized on a large sample to produce norms against which the results of new examinees can be compared. Because of its relative ease in the administration, scoring and being objectivity, standardized achievement tests were often used in the in school settings and it allowed a broader coverage of content, were less expensive and more efficient than essays. In 1923, the development of standardized achievement tests culminated in the publication of the Stanford Achievement Test by T. L. Kelley, G. M. Ruch, and L. M. Terman. By the 1930s, it was widely held that the objectivity and reliability of these new standardized tests made them superior to essay tests. The Rise of Intelligence Tests In the 1930’s, researches saw the weaknesses and limitations of existing tests and their utility and accuracy were always criticized. Thus, before the end of the 1930s, developers began to reestablish the respectability of tests. In1937, the Stanford-Binet had been revised again. David Wechsler published the first version of the Wechsler intelligence scales, the Wechsler- Bellevue Intelligence Scale (W-B) (Wechsler, 1939). The Stanford-Binet test had long been criticized because of its emphasis on language and verbal skills, making it inappropriate for many individuals, such as those who cannot speak or who cannot read. Wechsler’s inclusion of a nonverbal scale helped overcome some of the practical and theoretical weaknesses of the Binet test. In 1986, the Binet test was drastically revised to include performance subtests. More recently, it was overhauled again in 2003. Personality Tests: (1920–1940) Before and after World War II, personality tests began to be developed. The earliest personality tests were structured paper-and-pencil group tests. These tests provided multiple-choice and true-false questions that could be administered to a large group. Although this was met with BPSYCH8 Psychological Assessment 9 enthusiasm, researchers scrutinized, analyzed, and criticized the early structured personality tests, just as they had done with the ability tests. Interest in projective tests began to grow which is a relatively ambiguous test stimulus with unclear alternative responses. Furthermore, the scoring of projective tests is often subjective. One of the popular projective test is Rorschach inkblot test which was first published by Herman Rorschach of Switzerland in 1921. It was later introduced in the US by David Levy. The Thematic Apperception Test (TAT) by Henry Murray and Christina Morgan was developed in 1935. Whereas the Rorschach test contained completely ambiguous inkblot stimuli, the TAT was more structured. Its stimuli consisted of ambiguous pictures depicting a variety of scenes and situations. New Approaches to Personality Testing In 1943, the Minnesota Multiphasic Personality Inventory (MMPI) began a new era for structured personality tests. The idea behind the MMPI—to use empirical methods to determine the meaning of a test response—helped revolutionize structured personality tests. The MMPI, along with its updated companion the MMPI-2 (Butcher, 1989, 1990), is currently the most widely used and referenced personality test. Its emphasis on the need for empirical data has stimulated the development of tens of thousands of studies. In the early 1940s, J. R. Guilford made the first serious attempt to use factor analytic techniques in the development of a structured personality test. By the end of that decade, R. B. Cattell had introduced the Sixteen Personality Factor Questionnaire (16PF); despite its declining popularity, it remains one of the most well-constructed structured personality tests and an important example of a test developed with the aid of factor analysis. (Source: Kaplan & Saccuzzo, 2018) BPSYCH8 Psychological Assessment 10 The Rapid Changes in the Status of Testing In 1940, there was a growth in psychological testing but also the growth of applied aspects of psychology. The role and significance of tests paved way to training of clinically oriented psychologists and the opening of clinical psychology in universities. Other applied branches of psychology—such as industrial, counseling, educational, and school psychology—soon began to blossom. One of the major functions of the applied psychologist was providing psychological testing. A position paper of the American Psychological Association published 7 years later (APA, 1954) affirmed that the domain of the clinical psychologist included testing. It formally declared, however, that the psychologist would conduct psychotherapy only in “true” collaboration with physicians. Thus, psychologists could conduct testing independently, but not psychotherapy. Thus, as long as psychologists assumed the role of testers, they played a complementary but often secondary role vis-à-vis medical practitioners. Testing Today Beginning in the 1980s and through the present, several major branches of applied psychology emerged and flourished: neuropsychology, health psychology, forensic psychology, and child psychology. Because each of these important areas of psychology makes extensive use of psychological tests, psychological testing again grew in status and use. Psychological Testing in the Philippines Read the pdf BPSYCH8 Psychological Assessment 11 CHAPTER 2-BASIC STATISTICS IN TESTING AND INTERPRETATION Learning Outcomes 1. examine the different statistics applied in psychological testing. 2. identify the different frames of reference in testing (score interpretation). UNIT 1-BASIC STATISTICS Why we need Statistics in Testing? For purposes of description- numbers provide convenient summaries and allow us to evaluate some observations relative to others. (Ex. A score of 60 means is it below the average or the same?) To make inferences- these are logical deductions about events that cannot be observed directly. (Ex. One can infer the percentage of people who watched a certain TV show thru a simple survey) Statistics and principles of measurement lie at the center of the modern science of psychology. Scientific statements are usually based on careful study, and such systematic study requires some numerical analysis. Types of Statistics Descriptive- numbers and graphs are used to describe, condense, or represent data. - Frequency Distributions - Grouped frequency distribution- helps organize scores into a still more compact form. - Graphs – frequency distribution are transformed into a pie charts, bar graphs (for discrete or categorical data) and histograms or frequency polygons. Inferential statistics- are methods used to make inferences from observations of a small group known as a sample to a larger group of individuals known as population. Variables and Constant (Urbina, 20114) - Variable- anything that varies - Continuous variables such as time, distance, and temperature, on the other hand, have infinite ranges and really cannot be counted - Discrete variables are those with a finite range of values—or a potentially infinite, but countable, range of values - Dichotomous variables, -are discrete variables that can assume only two values, such as true– false or the outcome of coin tosses. - Polytomous variables -are discrete variables that can assume more than two values, such as marital status, race, and so on - Constant – anything that does not vary The Meaning of Numbers -because numbers can be used in a multitude of ways, there is a need for a system for classifying different levels of measurement on the basis of the relationships between numbers and the objects or events to which the numbers are applied. BPSYCH8 Psychological Assessment 12 These systems will depend on the types of statistical operations that are logically feasible depending on how numbers are used. What is Measurement? - It is the assignment of numbers to properties or attributes of people, objects, or events using a set of rules. Characteristics of Measurement: It focuses on attributes of people, objects or events not on actual people, objects or events. Uses a set of rules to quantify these which are standardized, clear, understandable and easy to apply Consists of scaling and classification; scaling deals with assignment of numbers so as to quantify them; - Classification- refers to defining people, events or objects fall into the same or different categories Properties of Scales Magnitude- is the property of “moreness.” A scale has the property of magnitude if we can say that a particular instance of the attribute represents more, less, or equal amounts of the given quantity than does another instance. (On a scale of height, for example, if we can say that John is taller than Fred, then the scale has the property of magnitude) Equal intervals- when a scale has the property of equal intervals, the relationship between the measured units and some outcome can be described by a straight line or a linear equation. Absolute Zero (0) - is obtained when nothing of the property being measured exists. For example, if you are measuring heart rate and observe that your patient has a rate of 0 and has died, then you would conclude that there is no heart rate at all. Types of Scale (Urbina, 2014; Kaplan & Saccuzzo, 2018) 1. NOMINAL- Scale- (Identity only) - not really a scale but naming or describing things (e.g. occupation, ethnic group, assignment of numbers instead of words - only for classification (age, gender) - cannot compared quantitatively - any amount of difference between things may not be known. 2. ORDINAL scale (Identity + Rank) - more precise measurement than nominal because it classifies but has the property of order or magnitude as well. - variable being measured is ranked or ordered according to some dimensions without regard for differences in the distance between scores (e.g. 100m sprint) 3. INTERVAL (Identity + rank + equality of units) - classifies variables, ranked them but also represent the difference between them - do not have a true zero, instead, use constant units of measurements so that differences on a characteristic can be stated and compared. (e.g. intelligence tests) BPSYCH8 Psychological Assessment 13 - they have equal-unit scales 4. RATIO scale (Identity + rank + equality of units + additivity) - the highest or ideal level of measurement because they have a true value of zero (0) - An interval scale in which people’s distances are given relative to a rational zero. (e,g, person’s income level, reaction time to a particular stimulus) The relevance (Scales of measurement) to Psych testing- it helps to keep the relativity in the meaning of numbers in proper perspective and the limitations in the meaning of scores have to be understood and the inaccurate inferences that are likely to be made on the basis of these scores because measurement process are inexact. (Urbina, 2014) Permissible Operations The level of measurement is important because it defines which mathematical operations one can apply to numerical data For nominal data, each observation can be placed in only one mutually exclusive category. For example, you are a member of only one gender. One can use nominal data to create frequency distributions, but no mathematical manipulations of the data are permissible. Ordinal measurements can be manipulated using arithmetic; however, the result is often difficult to interpret because it reflects neither the magnitudes of the manipulated observations nor the true amounts of the property that have been measured. For example, if the heights of 15 children are rank ordered, knowing a given child’s rank does not reveal how tall he or she stands. Averages of these ranks are equally uninformative about height. With interval data, one can apply any arithmetic operation to the differences between scores. The results can be interpreted in relation to the magnitudes of the underlying property. Interval data cannot be used to make statements about ratios. For example, if IQ is measured on an interval scale, one cannot say that an IQ of 160 is twice as high as an IQ of 80. This mathematical operation is reserved for ratio scales, for which any mathematical operation is permissible Frequency distribution displays scores on a variable or a measure to reflect how frequently each value was obtained. With a frequency distribution, one defines: - all the possible scores and determines how many people obtained each of those scores. - usually, scores are arranged on the horizontal axis from the lowest to the highest value. - the vertical axis reflects how many times each of the values on the horizontal axis was observed. For most distributions of test scores, the frequency distribution is bell shaped, with the greatest frequency of scores toward the center of the distribution and decreasing scores as the values A single test score means more if one relates it to other test scores. A distribution of scores summarizes the scores for a group of individuals. In testing, there are many ways to record a distribution of scores BPSYCH8 Psychological Assessment 14 Percentile ranks vs percentile Percentile ranks replace simple ranks when we want to adjust for the number of scores in a group. A percentile rank answers the question, “What percent of the scores fall below a particular score (Xi)?” In other words, it indicates what percentage of scores fall below a particular score. Percentiles are the specific scores or points within a distribution. Percentiles divide the total frequency for a set of observations into hundredths. It indicates the particular score, below which a defined percentage of scores falls. To calculate a percentile rank, you need only follow these simple steps: (1) determine how many cases fall below the score of interest, (2) determine how many cases are in the group, (3) divide the number of cases below the score of interest (Step 1) by the total number of cases in the group (Step 2), and (4) multiply the result of Step 3 by 100. The formula is: (Source: Kaplan & Saccuzzo, 2018) Steps 1. Arrange data in ascending order—that is, the lowest score first, the second lowest score second, and so on. 2. Determine the number of cases with worse rates than the score of interest. 3. Determine the number of cases in the sample. 4. Divide the number of scores worse than the score of interest (Step 2) by the total number of scores (Step 3): 5. Multiply by 100: BPSYCH8 Psychological Assessment 15 (Kaplan & Saccuzzo, 2018) Describing Distributions Measures of Central Tendency Statistics are used to summarize data. If you consider a set of scores, the mass of information may be too much to interpret all at once. That is why we need numerical conveniences to help summarize the information. Mean- arithmetic average score in a distribution ∑𝑥 𝑥̅ = 𝑁 Mode - or most frequently occurring value in a distribution; used primarily when dealing with qualitative or categorical variables. Strictly speaking, there can be only one mode or—if there is no variability in a distribution—no mode at all. However, if two or more values in a distribution are tied with the same maximum frequency, the distribution is said to be bimodal or multimodal. Median (Mdn) - the value that divides a distribution that has been arranged in order of magnitude into two halves. If the number of values (n)in the distribution is odd, the median is simply the middle value; if n is even, the median is the midpoint between the two middle values. BPSYCH8 Psychological Assessment 16 Measures of Variability Variability - describe how much dispersion, or scatter, there is in a set of data. When added to information about central tendency, measures of variability help us to place any given value within a distribution and enhance the description of a data set. (Source: Cohen & Swerdlik, 2018) Variance The variance is the sum of the squared differences or deviations between each value (X) in a distribution and the mean of that distribution (M), divided by n. The variance is the average of the sum of squares (SS). The sum of squares represents the total amount of variability in a score distribution and the variance (SS/n) represents its average variability. Due to the squaring of the deviation scores, however, the variance is not in the same units as the original distribution. ∑(𝑥−𝑥̅ )2 𝜎= or 𝑆2 𝑁 Standard Deviation This is an approximation of the average deviation around the mean. The square root of the variance. The standard deviation is a gauge of the average variability in a set of scores, expressed in the same units as the scores. 𝛴×2 √𝛴(𝑥−𝑥̅ )2 𝛴𝑥 2 − 𝑠= or 𝑠 = 𝑁 𝑁−1 𝑁−1 (Kaplan & Saccuzzo, 2018) BPSYCH8 Psychological Assessment 17 In the above example, if you compute the mean of the three sets of scores, you will have a mean of 4 in all the three sets. The mean cannot give much information about the scores of the three sets, so we need to compute the variability of the 3 sets of scores. The three distributions of scores appear quite different but have the same mean, so it is important to consider other characteristics of the distribution of scores besides the mean. The difference between the three sets lies in variability. There is no variability in Set 1, a small amount in Set 2, and a lot in Set 3. NOTE: The standard deviation is thus the square root of the average squared deviation around the mean. Although the standard deviation is not an average deviation, it gives a useful approximation of how much a typical score is above or below the average score. Because of their mathematical properties, the variance and the standard deviation have many advantages. For example, knowing the standard deviation of a normally distributed batch of data allows us to make precise statements about the distribution. The Importance of Variability The psychological testing enterprise depends on variability across individuals. Without individual differences there would be no variability and tests would be useless in helping us to make determinations or decisions about people. The Normal Curve Model Definition Also known as the bell curve Its baseline, equivalent to the X-axis of the distribution, shows the standard deviation (σ) units; its vertical axis, or ordinate, usually does not need to be shown because the normal curve is not a frequency distribution of data but a mathematical model of an ideal or theoretical distribution. Properties of the Normal Curve Model It is bell shaped, as its nickname indicates. It is bilaterally symmetrical, which means its two halves are identical (if we split the curve into two, each half contains 50% of the area under the curve). It has tails that approach but never touch the baseline, and thus its limits extend to ± infinity (±∞), a property that underscores the theoretical and mathematical nature of the curve. It is unimodal; that is, it has a single point of maximum frequency or maximum height. It has a mean, median, and mode that coincide at the center of the distribution because the point where the curve is in perfect balance, which is the mean, is also the point that divides the curve into two equal halves, which is the median, and the most frequent value, which is the mode. BPSYCH8 Psychological Assessment 18 68. 27% of the observation or data falls within ± 1 sd 95.44 % of the data or observations falls within ± 2 sd 99.72 % of the data falls withing the range of ± 3 sd (Source: Kaplan & Sccuzzo, 2018) Uses of the Normal Curve Model 1. Descriptive Uses - The proportions of the area under the standard normal curve that lie above and below any point of the baseline or between any two points of the baseline are pre-established—and easy to find in the tables of areas of the normal curve. 2. Inferential Uses - estimating population parameters and - testing hypotheses about differences the normal curve has two tails. The area on the normal curve between 2 and 3 standard deviations above the mean is referred to as a tail. The area between −2 and −3 standard deviations below the mean is also referred to as a tail. BPSYCH8 Psychological Assessment 19 Skewness (Cohen & Swerdlik, 2018) Skewness is an indication of how the measurements in a distribution are distributed. A distribution has a positive skew when relatively few of the scores fall at the high end of the distribution. Positively skewed examination results may indicate that the test was too difficult. A distribution has a negative skew when relatively few of the scores fall at the low end of the distribution. Negatively skewed examination results may indicate that the test was too easy. In this case, more items of a higher level of difficulty would make it possible to better discriminate between scores at the upper end of the distribution. BPSYCH8 Psychological Assessment 20 UNIT 2- INTERPRETATION OF SCORES TEST SCORE INTERPRETATION (Kaplan & Saccuzzo, 2018) Raw score- number that summarizes or captures some aspects of a person’s performance in the carefully selected behaviour samples that make-up psychological tests. (Urbina, 2014) - Direct numerical report of a person’s test performance - By itself, raw score does not convey any meaning Some tests of cognitive ability—in particular, or other neuropsychological instruments—are scored in terms of number of errors or speed of performance, so that the higher the score, the less favorable the result. Moreover, we do not know how high it is without some kind of frame of reference. FRAMES OF REFERENCE FOR TEST SCORE INTERPRETATION 1. Norms or Norm-referenced test interpretation 2. Criterion-Referenced Tests 1. Norms or Norm-referenced test interpretation Norm- refers to the test performance or typical/average behavior of one or more reference groups. These are empirically established by determining what the persons in a representative group actually do in the test. - normative sample is that group of people whose performance on a particular test is analyzed for reference in evaluating the performance of individual test takers. Whether broad or narrow in scope, members of the normative sample will all be typical with respect to some characteristic(s) of the people for whom the particular test was designed. - Norming - refers to the process of deriving norms. Norming may be modified to describe a particular type of norm derivation. For example, race norming is the controversial practice of norming on the basis of race or ethnic background. Norm-referenced - a method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker’s score and comparing it to scores of a group of test takers. In this approach, the meaning of an individual test score is understood relative to other scores on the same test. - Uses standards based on the specific groups of people to provide information for interpreting scores. Sampling to Develop Norms - standardization or test standardization- the process of administering a test to a representative sample of test takers for the purpose of establishing norms. Sampling - In the process of developing a test, a test developer has targeted some defined group as the population for which the test is designed. This population is the complete universe or set of individuals with at least one common, observable characteristic. purposive sample stratified sampling incidental sample or convenience sample, etc. BPSYCH8 Psychological Assessment 21 - Having obtained a sample, the test developer administers the test according to the standard set of instructions that will be used with the test. - Establishing a standard set of instructions and conditions under which the test is given makes the test scores of the normative sample more comparable with the scores of future test takers. - After all the test data have been collected and analyzed, the test developer will summarize the data using descriptive statistics, including measures of central tendency and variability. - In addition, it is incumbent on the test developer to provide a precise description of the standardization sample itself. Good practice dictates that the norms be developed with data derived from a group of people who are presumed to be representative of the people who will take the test in the future. After all, if the normative group is very different from future test takers, the basis for comparison becomes questionable at best. Types of Norms 1. Developmental Norm a. Ordinal Scales Based on Behavioral Sequences Human development is characterized by sequential processes in a number of behavioral realms. A classic example is the sequence that normal motor development follows during infancy. In the first year of life, most babies progress from the fetal posture at birth, through sitting and standing, to finally walking alone. Whenever a universal sequence of development involves an orderly progression from one behavioral stage to another—more advanced—stage, the sequence itself can be converted into an ordinal scale and used normatively. In this case, the frame of reference for test score interpretation is derived from observing and noting certain uniformities in the order and timing of behavioral attainments across many individuals to indicate how far (test score) along the developmental path the individual has progressed. The pioneer in the development of this type of scales was Arnold Gesell, a psychologist and pediatrician who published the Gesell Developmental. Schedules in 1940 based on a series of longitudinal studies conducted by him and his associates at Yale over a span of several decades (Ames, 1989). Ex. Gesell Developmental Schedule b. The Provence Birth-to Three- Developmental profile The Provence Birth-to-Three Developmental Profile (Provence Profile), which is part of the Infant- Toddler Developmental Assessment (IDA) The IDA is an integrated system designed to help in the early identification of children who are developmentally at risk and possibly in need of monitoring or intervention. Through naturalistic observation and parental reports, the Provence Profile provides information about the timeliness with which a child attains developmental milestones in eight domains, in relation to the child’s chronological age. The developmental domains are Gross Motor Behavior, Fine Motor Behavior, Relationship to Inanimate Objects, Language/Communication, Self-Help, Relationship BPSYCH8 Psychological Assessment 22 (Urbina, 2014) c. Theory-Based Ordinal Scale Some theories, such as Jean Piaget’s proposed stages of cognitive development from infancy to adolescence or Lawrence Kohlberg’s theory of moral development, posit an orderly and invariant sequence or progression derived at least partly from behavioral observations. Some of them have generated ordinal scales designed to evaluate the level that an individual has attained within the proposed sequence; these tools are used primarily for purposes of research rather than for individual assessment. Ex. Jean Piaget Cognitive Development- ex. Sensory discrimination, linguistic communication Lawrence Kohlberg’s theory of moral development d. Mental Age Scores Oldest approach to norming; the mental age scores derived from those scales were computed on the basis of the child’s performance, which earned credits in terms of years and months, depending on the number of chronologically arranged tests that were passed. Use of Ratio IQ Mental age is defined as the level of development in mental ability expressed as equivalent to the chronological age at which the average group of individuals reach that level. IQ= MA X 100 CA Ex. Those passed by the majority of 7-year olds in the standardization sample are placed in the 7-year level. BPSYCH8 Psychological Assessment 23 Pros and cons of Mental Age Norms PROS - MA are easily comprehended and simple - They are insightful since they provide a certain index of a child’s general intellectual ability CONS - Technical difficulty wherein the chronological age has to be added not only the mental age, if not mental age cannot be given appropriate interpretation. - MA units are not equal because they do not remain constant with age but tends to shrink with advancing years. - In light of the difficulties at arriving at mental age scores has been abandoned. However, several current tests still provide norms that are presented as age equivalent scores and are based on the average raw score performance of children of different age groups in the standardization sample. e. Age equivalent scores, also known as test ages or test-age equivalents, simply represent a way of equating the test taker’s performance on a test with the average performance of the normative age group with which it corresponds. For example, if a child’s raw score equals the mean raw score of 9-year-olds in the normative sample, her or his test age equivalent score is 9years. Pros and cons the procedures used to obtain scores labeled in terms of age, their use remains problematic, not only because they suggest there is such a thing as “normal” performance for a given age, because the rate of development varies widely within age groups, and differences in behavioral attainments that can be expected with each passing year diminish greatly from infancy and early childhood to adolescence and adulthood. if the meaning of a test age is extended to realms other than the specific behavior sampled by the test— for example, when an adolescent who gets a test age score of 8 years is described as having “the mind of an 8-year-old”—the use of such scores can be quite misleading. f. Grade Equivalent Scores Are derived by locating the performance of test takers within the norms of the students at each grade level in the standardization sample. Ex. If a child has scored at the 7th grade in reading and the 5th grade in arithmetic, it means her/his performance on the reading test matches the average performance of the 7th grades in the standardization and that of the 5th grade on arithmetic. Pros - GE offer convenient units in plotting profiles of student achievement- emphasize areas of underachievement and overachievement Cons - Content of curricula and quality of instruction vary across schools, etc. Hence, GE does not provide a uniform standard The advance expected in the elementary in terms of academic achievement, is much greater than it is in the middle or high school BPSYCH8 Psychological Assessment 24 g. National norms As the name implies, national norms are derived from a normative sample that was nationally representative of the population at the time the norming study was conducted. For example, national norms may be obtained by testing large numbers of people representative of different variables of interest such as age, gender, racial/ethnic background, socioeconomic strata, geographical location (such as North, East, South, West, Midwest), and different types of communities within the various parts of the country (such as rural, urban, suburban). Sub-groups - A normative sample can be segmented by any of the criteria initially used in selecting subjects for the sample. Norms can be reported or separated into sub-group norms, provided they are of sufficient size & fairly representative of their categories, can be collected after a test has been standardized. Ex. MMPI adult- sub-group- MMPI adolescents Local norms – - test users may evaluate scores on the basis of a reference groups drawn from a specific geographical or institutional setting. In this case, test users may choose to develop local norms for members of a more narrowly defined population. Example: employees of a particular company, students of a certain university, etc. BPSYCH8 Psychological Assessment 25 (researchgate.net) Scores Used for Expressing With-in Group Norms (Normative Groups) 1. Percentile Rank- - Most direct & ubiquitous (pervasive) method used to convey norm-referenced test results. - It represents the percentage of persons in the reference group who scored at or below a given raw score. BPSYCH8 Psychological Assessment 26 Differences between Percentile and percentage Percentile Percentage Percentile scores reflect the rank/position scores that reflect the number of correct of an individual’s performance on a test in responses that an individual obtains out of the comparison to a reference group total possible number of correct responses on a test. Frame of Reference: Other people Frame of Reference: Entire contents of the test Pros (Percentile) Easy to calculate Easy to understand- even untrained people Cons Score units are markedly unequal Percentile rank scores pertains to the most extreme scores in a distribution Test Ceiling -If a test taker reaches the highest score attainable on an already standardized test- it implies the test ceiling or maximum difficulty of the test is insufficient. Test floor -if a person fails all the items presented in a test or scores lower than any of the people in the normative sample, there is insufficient test floor 2. Z-score In order to solve the problem of inequality of percentile units & still convey the mean of the test scores compared to a normative or reference group, we can transform raw scores into scales that express the position of the scores, relative to the mean, in standard deviation units. Standard Score- is a raw score that has been converted from one scale to another scale, where the latter scale has some arbitrarily set mean and standard deviation. Raw scores may be converted to standard scores because standard scores are more easily interpretable than raw scores. With a standard score, the position of a test taker’s performance relative to other test takers is readily apparent. - Linear transformation- changes the units in which scores are expressed while leaving the interrelationships among them (the shape of a linearly derived scale score distribution for a given group of test takers is the same as that of the original raw score distribution. Z-score- a result from the conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean of the distribution. - A z score indicates the relative location of a score within a distribution. A Z of 0 is always located at the mean, which indicates that 50% of the scores fall below it and 50% of the scores fall above it. First, Z scores have a mean of 0 and a standard deviation of 1.0. BPSYCH8 Psychological Assessment 27 Example: If someone’s Z score is +1.00, it is 1SD above the mean. Example: To calculate the percentile rank for a Z score of +1.00, we need to find the percentage of scores from the bottom point of the curve up to the Z score of +1.00. We do this by adding 34.13%, the percentage of scores between the mean and 1SD -above the mean, to 50%, the percentage of all scores that fall below the mean. +1Z = 34.13% + 50% = 84.13%. We now know that the person who had a Z score of +1.00 scored the same as or better than 84.13% of the people who took this test. So this person’s percentile rank is 84.13. In addition to providing a convenient context for comparing scores on the same test, standard scores provide a convenient context for comparing scores on different tests. Simon’s math test: X = 48 Simon’s English test: X = 58 Class M = 40 SD = 8 Class M = 62 SD = 4 z= (X-M)/sd Z = (X-M)/sd Z = (48 − 40)/8 Z = (58 − 62)/4 Z = 1.00 Z =−1.00 Interpretation: Simon’s raw score in English is higher than his raw score in math, In other words, Simon did much better on his math test than on his English test just by looking at the Z scores. Simon scored at the 84.13th percentile in math compared to his classmates. (PR+1Z = 50% + 34.13% = 84.13%) or 84.13th percentile. However, he scored only at the 15.87th percentile in English (PR−1Z = 50.0 –34.13 = 15.87%). Simon needs help in English. On this test he did better than only 15.87% of his classmates. 3. Deviation IQ Introduced by David Wechsler in his 1st intelligence test called WAIS (Wechsler Adult Intelligence Scale) To differentiate it from Ratio IQ, DIQ are obtained on various subtests into Wechsler scale scores, adding these scores and locating this sum in the appropriate normative table Its mean is 100 and the SD=15 Example: how to transform a z score into DIQ or other derived standard scores A z score of +1.00 to an IQ score with a mean = 100 and SD = 15 or New standard score= (z score) (new SD) + New mean IQ score= (+1.00) (15) + 100 = 115 BPSYCH8 Psychological Assessment 28 Other Derivation of Standard scores - T-score (M=50; SD = 10), used in many personality tests such as MMPI - CEEB (college Entrance Examination Board (CEEB) scores- (M= 500; SD = 100)- traditionally used by the College Board’s SAT - Wechsler scale subtest score- (M=10, SD = 3) – use for all the sub-tests of Wechsler scale - Otis-Lennon School Ability Indices- (M= 100; SD = 16) Normalized Standard scores- Non-linear Transformation Not all score transformations are linear. Nonlinear transformations are those that convert a raw score distribution into a distribution that has a different shape than the original. This can be done through methods that afford test developers greater flexibility in dealing with raw score distributions than linear conversions do. For instance, the transformation of normally distribute draw scores in to percentile rank scores, which we have already considered, is a nonlinear conversion. It is accomplished by transforming each raw score into a z –score and locating the z-score in the Table of Areas of the Normal Curve to derive the proportion or percentage of the area of the normal curve that is below that point. Types of nonlinearly derived standard scores (Urbina, 2014) Steps Raw score Cumulative percent…CP† Cumulative proportion (CP) Normalized score 1. The raw score and cumulative percent are located in the distribution. 2. The cumulative proportion is the cumulative percent divided by 100. 3. A normalized z score is obtained from the Table of Areas of the Normal Curve- the area of the curve that comes closest to the cumulative proportion for a given score. For scores with cumulative proportions above 0.50, the areas in the larger portion must be used to obtain the normalized z scores. Raw Score CP cp Normalized Z score 49 98.3 0.983 +2.12 40 53.3 0.533 +0.08 39 45.0 0.450 0.13 36 23.3 0.233 0.73 29 1.7 0.017 2.12 (Source: Urbina, 2014) Stanine- The standard nine, or stanine, scale transforms all the scores in a distribution into single-digit numbers from 1 to 9. This device has the distinct advantage of reducing the time and effort needed to enter scores on a computer for storage and further processing Stanine transformations also make use of cumulative frequency and cumulative percentage distributions, standard nine- transforms all the scores in a distribution into single-digit numbers from 1 to 9. Such transformations also make use of cumulative frequency and cumulative percentage distribution and the scores are allocated on the basis of the percentage of cases at given score ranges. BPSYCH8 Psychological Assessment 29 STANINE 1 2 3 4 5 6 7 8 9 Percentage of cases within each stanine 4 7 12 17 20 17 12 7 4 Cumulative % at each stanine 4 11 23 40 60 77 89 96 100 STEN- similar to stanine- called “standard ten” - This system provides for five standard-score units on each side of the mean, each being one-half standard deviation in width except for the sten values of 1 and 10 which corresponds to the z-values of -2.00 and + 2.00, respectively. STEN 1 2 3 4 5 6 7 8 9 10 Percentage of cases within each sten 2 5 9 15 19 19 15 9 5 2 Cumulative % at each sten 2 7 16 31 50 69 84 93 98 100 INTERTEST COMPARISON norm-referenced test scores cannot be compared unless they are obtained from the same test, using the same normative distribution An additional reason for lack of comparability of test scores stems from differences in scale units, like the various sizes of SD units discussed earlier in connection with deviation IQs Furthermore, even when the tests, the norms, and the scale units employed are the same, test scores do not necessarily have the same meaning. When test scores are used in the context of individual assessment, it must be kept in mind that many other factors extraneous to the test may also enter into test results (e.g. The test taker’s background and motivation, the influence of the examiners, and the circumstances under which the tests were taken). 2.Criterion-Referenced Tests Also known as domain-referenced, content-referenced, objective-referenced, or competency testing. It describes the specific types of skills, tasks, or knowledge that the test taker can demonstrate such as mathematical skills. Instead of comparing a person’s performance to that of others, the performance of an individual, or a group, is compared to a predetermined criterion or standard - criterion may refer either to knowledge of a specific content domain or to competence in some kind of endeavor. The standards by which criterion-referenced tests are evaluated are typically defined in terms of specified levels of knowledge or expertise necessary to pass a course, obtain a degree, or get a professional license they may also involve a demonstration of sufficient competence to do a job or to create a product. The validity of the inferences made on the basis of scores needs to be established through empirical links between test scores and performance on the criterion. BPSYCH8 Psychological Assessment 30 NORM-REFERENCED TEST INTERPRETATION CRITERION-REFERENCE Norm-referenced tests seek to locate the Criterion-referenced tests seek to evaluate the performance of one or more individuals, performance of individuals in relation to standards with regard to the construct the tests related to the construct itself. assess, on a continuum created by the performance of a reference group Whereas in norm-referenced test The frame of reference may be (a) knowledge of a interpretation the frame of reference is content domain as demonstrated in standardized, always people. objective tests, or (b) level of competence displayed in the quality of a performance or of a product. The term criterion-referenced testing is sometimes also applied to describe test interpretations that use the relationship between the scores and expected levels of performance or standing on a criterion as a frame of reference. Examples of Domain-Referenced Test Objectives and Items I. Domain: Arithmetic A. Content area to be assessed: Multiplication of fractions B. Objectives to be assessed: 1. Knowledge of the steps involved in multiplying fractions 2. Understanding of the basic principles involved in multiplying fractions 3. Application of principles in solving fraction multiplication problems C. Sample test items for each objective: Item 1. List the steps involved in multiplying fractions. Item 2. Draw a diagram to show 1/4 of 1/2 of a pie. Item 3. How much is 3/4 × 1/2? Performance Assessment This is utilized for the purpose of making decisions in the workplace, and in the realm of education as well, or if there is often a need to ascertain or certify competence in the performance of tasks that are more realistic, more complex, more time-consuming, or more difficult to evaluate (than those typical of content- or domain-referenced testing). It calls for evaluating performance through work samples, work products, or some other behavioral display of competence and skill in situations that simulate real-life settings. The criterion in criterion referenced test interpretation is the quality either of the performance itself or of the product that results from applying a skill. Evaluation and Scoring in the Assessment of Performance The assessment of performance tends to rely more heavily on subjective judgment. An exception to this rule occurs when criteria can be quantified in terms of speed of performance, number of errors, units produced, or some other objective standard (e.g. typing in a clerical job). BPSYCH8 Psychological Assessment 31 The usual methods for evaluating qualitative criteria involve rating scales or scoring rubrics (i.e., scoring guides) that describe and illustrate the rules and principles to be applied in scoring the quality of a performance or product. (e.g. Scoring of athletic performances by designated expert judges in events such as figure skating on ice or diving competitions) Mastery Testing Procedures that evaluate test performance on the basis of whether the individual test taker does or does not demonstrate a pre-established level of mastery are known as mastery tests. Many of these tests yield all-or-none scores, such as pass or fail, based on some criterion level that separates mastery from no mastery. Example: driving tests many states require for the issuance of a driver’s license. —such as landing an airplane on an aircraft carrier or performing brain surgery Predicting Performance In predicting performance, the criterion is an outcome to be estimated or predicted by means of a test. This type of information constitutes the basis for establishing the predictive validity of tests “What level of criterion performance can one expect from a person who obtains this score? “or “Is the test taker’s performance on the test sufficient to assure the desired level of criterion performance in a given endeavor?” NOTE: Relationship Among the Frames of Reference The distinctions between both frames of reference for test interpretation—as well as among the varieties within each frame—are matters of emphasis. Even though scores can be expressed in a variety of ways, basically, all of testing relies on a normative framework in the broadest sense of that term. The standards used in criterion-referenced test score interpretations must be based on expectations that are realistic or feasible for the population of test takers for whom the test is meant. BPSYCH8 Psychological Assessment 32 CHAPTER III- CHARACTERISTICS OF PSYCHOLOGICAL TEST Learning outcomes: 1. Analyze the psychometric properties of tests. 2. Demonstrate the reliability and validity of psychological test. RELIABILITY What is a “Good” Test? Unlike the physical science which uses measurement relatively with precision, measuring a phenomenon in the field of psychology is more complicated. The importance of statistics has helped significantly in describing and making inferences about these phenomena which eventually led to developing methods of measurement. Measurement in the form of psychological tests have been a technique in assessing and in understanding human behavior and it is very vital to understand how these measurements are developed and used. It is also imperative that the psychometric properties of psychological tests are clearly analyzed and understood to help in viewing tests as measurement tools. Reliability One of the properties of a good test is the presence of reliability in tests. Reliability is based on the consistency and precision of the results of the measurement process which suggests trustworthiness. To have some degree of confidence or trusting scores, test users require evidence to the effect that the scores obtained from tests would be consistent if the tests were repeated on the same individuals or groups and that the scores are reasonably precise. The Concept of “Error” Lack of reliability implies inconsistency and imprecision, both of which are equated with measurement error. Measurement error - defined as any fluctuation in scores that results from factors related to the measurement process that are irrelevant to what is being measured. But no measure is completely accurate – there is always some degree of error----measurements are always subject to some degree of error and fluctuation. In the social and behavioral sciences, measurements are much more prone to error due to the elusive nature of the constructs that are assessed and behavioral data through which they are assessed can be affected by many more intractable factors than other types of data. Psychological test scores, in particular, are especially susceptible to influences from a variety of sources—including the test taker, the examiner, and the context in which testing takes place—all of which may result in variability that is extraneous to the purpose of the test. Knowing about reliability enables us to calculate that margin of error. BPSYCH8 Psychological Assessment 33 Two kinds of Measurement Error 1. Random error is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores. Example: Examples of random error that could conceivably affect test scores range from unanticipated events happening in the test environment (such as a lightning strike or a spontaneous “occupy the university” rally), to unanticipated physical events happening within the test taker (such as a sudden and unexpected surge in the test taker’s blood sugar or blood pressure). 2. Systematic error refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. It is the measuring instrument itself that has been found to be a source of systematic error. Once a systematic error becomes known, it becomes predictable—as well as fixable. Theories of Reliability 1. Classical Test Theory (CTT) Classical test score theory assumes that each person has a true score that would be obtained if there were no errors in measurement. However, because measuring instruments are imperfect, the score observed for each person almost always differs from the person’s true ability or characteristic. The difference between the true score and the observed score results from measurement error. X = T + E Observed score True score + Error score Or we can say that the difference between the score we obtain and the score we are really interested is the error of measurement: X -T= E CTT assumes that measurement errors are random. Basic sampling theory tells us that the distribution of random errors is bell shaped. Thus, the center of the distribution should represent the true score, and the dispersion around the mean of the distribution should display the distribution of sampling errors. (Source: Kaplan, 2018) BPSYCH8 Psychological Assessment 34 2. Item Response Theory (ITT) A more sophisticated procedures based on mathematical models are increasingly replacing the traditional equating techniques that have just been described. Using IRT, the computer is used to focus on the range of item difficulty that helps assess an individual’s ability level. For example, if the person gets several easy items correct, the computer might quickly move to more difficult items. If the person gets several difficult items wrong, the computer moves back to the area of item difficulty where the person gets some items right and some wrong. Then, this level of ability is intensely sampled. The overall result is that a more reliable estimate of ability is obtained using a shorter test with fewer items. But there are difficulties with applications of IRT. For instance, the method requires a bank of items that have been systematically evaluated for level of difficulty (Templin, 2016). Considerable effort must go into test development, and complex computer software is required. Source: Statistics How Towww.statisticshowto.com Models of Reliability The need for high standards of reliability is important in the development of psychological tests. One can determine the reliability of a test through the reliability coefficient. Most reliability coefficients are correlations; however, it is sometimes more useful to define reliability as its mathematically equivalent ratio. The reliability coefficient is the ratio of the variance of the true scores on a test to the variance of the observed scores. 𝜎𝑇2 𝑟= 𝜎𝑥2 Where 𝑟 = the theoretical reliability of the test 𝜎𝑇2 = the variance of the true scores 𝜎𝑥2 = the variance of the observed scores BPSYCH8 Psychological Assessment 35 The ratio of true score variance to observed score variance can be thought of as a percentage. In this case, it is the percentage of the observed variation 𝜎 2 𝑥 that is attributable to variation in the true score. If we subtract this ratio from 1.0, then we will have the percentage of variation attributable to random error. Example: Suppose you are given a test that will be used to select people for a particular job, and the reliability of the test is.40. When the employer gets the test back and begins comparing applicants, 40% of the variation or difference among the people will be explained by real differences among people, and 60% must be ascribed to random or chance factors. Now you can see why the government needs to insist on high standards of reliability. So if all the test score variance were true variance, score reliability would be perfect (1.00). A reliability coefficient may be viewed as a number that estimates the proportion of the variance in a group of test scores that is accounted for by error stemming from one or more sources. From this perspective, the evaluation of score reliability involves a two-step process that consists of (a) determining what possible sources of error may enter into test scores and (b) estimating the magnitude of those errors. Sources of Error 1. Time Sampling Error: The Test–Retest Method Time sampling error refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another. This concept hinges on two related notions, namely, (a) that whatever construct or behavior a test evaluates is liable to fluctuate in time, and (b) that some of the constructs and behaviors assessed through tests are either less subject to change, or change at a much slower pace, than others. Test–retest reliability estimates are used to evaluate the error associated with administering a test at two different times. This type of analysis is of value only when we measure “traits” or characteristics that do not change over time. For instance, we usually assume that an intelligence test measures a consistent general ability. As such, if an IQ test administered at two points in time produces different scores, then we might conclude that the lack of correspondence is the result of random measurement error. Usually we do not assume that a person got more or less intelligent in the time between tests. There is also the possibility of a carryover effect. This effect occurs when the first testing session influences scores from the second session. Practice effects are one important type of carryover effect. Some skills improve with practice. When a test is given a second time, test takers score better because they have sharpened their skills by having taken the test the first time. To generate estimates of the amount of time sampling error liable to affect the scores of a given test, it is customary to administer the same test on two different occasions, separated by a certain time BPSYCH8 Psychological Assessment 36 interval, to one or more groups of individuals. The correlation between the scores obtained from the two administrations is a test–retest reliability (or stability) coefficient (rtt) and may be viewed as an index of the extent to which scores are likely to fluctuate as a result of time sampling error. When you find a test–retest correlation in a test manual, you should pay careful attention to the interval between the two testing sessions. A well-evaluated test will have many retest correlations associated with different time intervals between testing sessions. Most often, you want to be assured that the test is reliable over the time interval of your own study. You also should consider what events occurred between the original testing and the retest. 2. Item Sampling Error: Parallel Forms Method Parallel forms reliability compares two equivalent forms of a test that measure the same attribute. The two forms use different items; however, the rules used to select items of a particular difficulty level are the same. When two forms of the test are available, one can compare performance on one form versus the other. Some textbooks refer to this process as equivalent forms reliability, whereas others call it simply parallel forms. Sometimes the two forms are administered to the same group of people on the same day. The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the coefficient of equivalence. The Pearson product moment correlation coefficient is used as an estimate of the reliability. When both forms of the test are given on the same day, the only sources of variation are random error and the difference between the forms of the test. (The order of administration is usually counterbalanced to avoid practice effects.) The method of parallel forms provides one of the most rigorous assessments of reliability commonly in use. Often test developers find it burdensome to develop two forms of the same test, and practical constraints make it difficult to retest the same group of individuals. Instead, many test developers prefer to base their estimate of reliability on a single form of a test. 3. Split-Half Method In split-half reliability, a test is given and divided into halves that are scored separately. The results of one half of the test are then compared with the results of the other. The two halves of the test can be created in a variety of ways. If the test is long, the best method is to divide the items randomly into two halves. Another acceptable way to split a test is to assign odd- numbered items to one half of the test and even-numbered items to the other half. This method yields an estimate of split-half reliability that is also referred to as odd-even reliability For ease in computing scores for the different halves, however, some people prefer to calculate a score for the first half of the items and another score for the second half. BPSYCH8 Psychological Assessment 37 Step 1. Divide the test into equivalent halves. Step 2. Calculate a Pearson r between scores on the two halves of the test. Step 3. Adjust the half-test reliability using the Spearman–Brown formula. The Spearman–Brown formula allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test. It is a specific application of a more general formula to estimate the reliability of a test that is lengthened or shortened by any number of items. Because the reliability of a test is affected by its length, a formula is necessary for estimating the reliability of a test that has been shortened or lengthened. The general Spearman–Brown (rSB) formula is: (Source: Cohen, 2018) Where: rSB is equal to the reliability adjusted by the Spearman–Brown formula, rxy is equal to the Pearson r in the original-length test n is equal to the number of items in the revised version divided by the number of items in the original version By determining the reliability of one half of a test, a test developer can use the Spearman–Brown formula to estimate the reliability of a whole test. Because a whole test is two times longer than half a test, n becomes 2 in the Spearman–Brown formula for the adjustment of split-half reliability. The symbol rhh stands for the Pearson r of scores in the two half tests: (Source: Cohen, 2018) BPSYCH8 Psychological Assessment 38 KR20 Formula Kuder and Richardson (1937)- greatly advanced reliability assessment by developing methods for evaluating reliability within a single test administration. However, there are potential problems for split- half reliability like the two halves may have different variances and it requires that each half be scored separately, possibly creating additional work. The KR20 formula is not appropriate for evaluating internal consistency in some cases. The KR20 formula requires that you find the proportion of people who got each item “correct.” There are many types of tests, though, for which there are no right or wrong answers, such as many personality and attitude scales. For example, on an attitude questionnaire, you might be presented with a statement such as, “I believe premarital affairs is immoral.” You must indicate whether you strongly disagree, disagree, are neutral, agree, or strongly agree. None of these choices is incorrect, and none is correct. Rather, your response indicates where you stand on the continuum between agreement and disagreement. To use the Kuder-Richardson method with this sort of item, Cronbach developed a formula that estimates the internal consistency of tests in which the items are not scored as 0 or 1 (right or wrong). In doing so, Cronbach developed a more general reliability estimate, which he called coefficient alpha, or a. The formula for coefficient alpha is: 𝑘 𝛴𝜎𝑖2 𝑟𝛼 = ( ) (1 − 2 ) 𝑘−1 𝜎 Where rα = coefficient alpha, k = the number of items, 𝜎𝑖2 = the variance of one item 𝜎2 = the variance of the total test scores. Σ = sum of variances of each item (Source: Cohen, 2018) 4. Inter-Scorer Reliability This is also referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability, inter-scorer reliability is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure. If the reliability coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training. Inter-rater consistency may be promoted by providing raters with the opportunity for group discussion along with practice exercises and information on rater accuracy. For example, the problem is a lack of clarity in scoring criteria, then the remedy might be to rewrite the scoring criteria section of the manual to include clearly written scoring rules. Inter- BPSYCH8 Psychological Assessment 39 rater consistency may be promoted by providing raters with the opportunity for group discussion along with practice exercises Summary of Reliability types and how to address sources of Error (Cohen, 2018) Other Sources of Error Test construction -One source of variance during test construction is item sampling or content sampling, terms that refer to variation among items within a test as well as to variation among items between tests. The extent to which a test taker’s score is affected by the content sampled on a test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance. Test administration -sources of error variance that occur during test administration may influence the test taker’s attention or motivation. Examples of untoward influences during administration of a test include factors related to the test environment: room temperature, level of lighting, and amount of ventilation and noise, for instance BPSYCH8 Psychological Assessment 40 - Test taker variables like pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance. - Examiner-related variables -the examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here. Some examiners in some testing situations might knowingly or unwittingly depart from the procedure prescribed for a particular test. Test scoring and interpretation- not all tests can be scored from grids blackened by no. 2 pencils. Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still require scoring by trained personnel. Quantifying Errors in Test Use (Urbina, 2014) Standard Errors of Measurement (SEM) The standard error of measurement is the tool used to estimate or infer the extent to which an observed score deviates from a true score that is, it tells us how much “true score" there is, in a measurement. We may define the standard error of measurement as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests. Also known as the standard error of a score and denoted by the symbol 𝜎𝑚𝑒𝑎𝑠 In general, the relationship between the SEM and the reliability of a test is inverse;