Week 6.1: What Is A Good Test? PDF
Document Details
Caraga State University
Loressa Joy D. Paguta, Laira Dee A. Baroquillo
Tags
Summary
This Caraga State University presentation outlines the key psychometric properties of tests, including reliability (consistency of measurement), validity (what a test measures), and utility (practical value). The document details sources of error, such as random and systematic errors.
Full Transcript
What’s a Good Test? LORESSA JOY D. PAGUTA, MA, RPm LAIRA DEE A. BAROQUILLO, RPm Department of Psychology Psychometric Properties the criteria employed to judge the usefulness of tests for any particular purpose technical quality o...
What’s a Good Test? LORESSA JOY D. PAGUTA, MA, RPm LAIRA DEE A. BAROQUILLO, RPm Department of Psychology Psychometric Properties the criteria employed to judge the usefulness of tests for any particular purpose technical quality of a test or other assessment tool Reliability Validity Utility the consistency of what a test usefulness or measurement measures practical value 8/31/2023 2 Reliability 8/31/2023 3 The Concept of Reliability The criterion of reliability involves the consistency of the measuring tool The precision with which the test measures and the extent to which error is present in measurements. A reliability coefficient is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance, ranging from 0 (not at all reliable) to 1 (perfectly reliable). 8/31/2023 4 The Concept of Reliability Error refers to the component of the observed test score that does not have to do with the test taker’s ability. If we use X to represent an observed score, T to represent a true score, and E to represent measurement error, then the fact that an observed score equals the true score plus error may be expressed as follows: X=T+E 8/31/2023 5 Measurement Error Measurement error refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured. 8/31/2023 6 Measurement Error Random error, sometimes referred to as “noise,” is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. For example: Test environment (such as a lightning strike or a spontaneous “occupy the university” rally), to unanticipated physical events happening within the testtaker (such as a sudden and unexpected surge in the testtaker’s blood sugar or blood pressure). 8/31/2023 7 Measurement Error Systematic errors a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. For example: a 12-inch ruler may be found to be, in actuality, a tenth of one inch longer than 12 inches; it is the measuring instrument itself that has been found to be a source of systematic error Bias refers to the degree to which systematic error influences the measurement. 8/31/2023 8 Sources of Error Variance 1. Test construction Item sampling or content sampling, terms that refer to variation among items within a test as well as to variation among items between tests. The extent to which a testtaker’s score is affected by the content sampled on a test and by the way the content is sampled (i.e., the way in which the item is constructed) 8/31/2023 9 Sources of Error Variance 2. Test administration The testtaker’s reactions to influences that occur during test administration may affect the testtaker’s attention or motivation. Example: Factors related to the test environment: room temperature, level of lighting, amount of ventilation and noise, the instrument used to enter responses (e.g., pencil with a dull or broken point), the writing surface, the events of the day in a global sense (e.g., if the country us at war or at peace) 8/31/2023 10 Sources of Error Variance 2. Test administration Example: Factors related to the testtaker variables: Pressing emotional problems, physical discomfort, lack of sleep, the effects of drugs or Medication, Formal learning experiences, casual life experiences, therapy, illness, changes in mood or mental state, changes in their body weight Factors related to the examiner variables: appearance, demeanor, might knowingly or unwittingly depart from the procedure prescribed for a particular test, might convey information about the correctness of a response (e.g., nodding, eye movements) 8/31/2023 11 Sources of Error Variance 3. Test scoring and interpretation Advent of computer scoring and a growing reliance on objective, computer- scorable items have virtually eliminated error variance caused by scorer differences. In some tests of personality, examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance. 8/31/2023 12 How do we assess for reliability? 1. Test-Retest Reliability Estimates Test-retest reliability is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. The test-retest measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. 8/31/2023 13 How do we assess for reliability? 2. Parallel-Forms & Alternate forms Reliability Estimates Parallel forms reliability uses one set of questions divided into two equivalent sets (“forms”), where both sets contain questions that measure the same construct, knowledge or skill. The two sets of questions are given to the same sample of people within a short period of time and an estimate of reliability is calculated from the two sets. Alternate forms are simply different versions of a test that have been constructed so as to be parallel. 8/31/2023 14 How do we assess for reliability? 3. Internal Consistency Split-half reliability – obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once; makes use of Spearman-Brown formula Inter-item consistency – degree of correlation among all the items on a scale; makes use of Kuder-Richardson formula (esp for dichotomous items, yes/no) Coefficient Alpha – the mean of all possible split-half correlations developed by Cronbach; appropriate for use on tests containing non-dichotomous items response (e.g., Likert scales) McDonald’s (1978) omega when the test loadings are unequal 8/31/2023 15 How do we assess for reliability? 4. Inter-scorer Reliability The degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure. Inter-scorer reliability is often used when coding nonverbal behavior. The simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a coefficient of inter-scorer reliability. 8/31/2023 16 17 Psychometric Properties the criteria employed to judge the usefulness of tests for any particular purpose technical quality of a test or other assessment tool Reliability Validity Utility the consistency of what a test usefulness or measurement measures practical value 8/31/2023 18 Validity 8/31/2023 19 The Concept of Validity A judgment or estimate of how well a test measures what it purports to measure in a particular context. It is a judgment based on evidence about the appropriateness of inferences drawn from test scores. No test or measurement technique is “universally valid” for all time, for all uses, with all types of test taker populations. 8/31/2023 20 The Concept of Validity Validation is the process of gathering and evaluating evidence about validity. Local validation studies are absolutely necessary when the test user plans to alter in some way the format, instructions, language, or content of the test. Local validation studies would also be necessary if a test user sought to use a test with a population of testtakers that differed in some significant way from the population on which the test was standardized. 8/31/2023 21 8/31/2023 22 TYPES OF VALIDITY Face Validity a judgement concerning how relevant the items appear to be Content Validity a measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test Criterion a measure of validity obtained by evaluating the Validity relationship of scores obtained on the test to scores on other tests or measures Predictive validity: Does the test predict later performance on a related criterion? Concurrent validity: Does the test relate to an existing similar measure? Construct a measure of validity based on whether the test measures the theoretical Validity constructs it was intended to measure Convergent validity: the degree to which a test “converges” on other tests that should be measuring the same thing Divergent validity: the degree to which a test “diverges” on other tests that should be measuring different things 8/31/2023 23 Face Validity A judgement concerning how relevant the items appear to be. If a test appears to measure what it purports to measure “on the face of it,” then it could be said to be high in face validity. Judgments about face validity are frequently thought of from the perspective of the testtaker, not the test user. A test’s lack of face validity could contribute to a lack of confidence in the perceived effectiveness of the test—with a consequential decrease in the testtaker’s cooperation or motivation to do their best. 8/31/2023 24 Content Validity Describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample. Example: A content-valid, paper-and-pencil test of assertiveness would be one that is adequately representative of this wide range. Items sampling from hypothetical situations at home (such as whether the respondent has difficulty in making their views known to fellow family members), on the job (such as whether the respondent has difficulty in asking subordinates to do what is required of them), and in social situations (such as whether the respondent would send back a steak not done to order in a fancy restaurant). 8/31/2023 25 Criterion-Related Validity A judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest—the measure of interest being the criterion. A criterion just a bit more narrowly as the standard against which a test or a test score is evaluated. Example: for example, if a test purports to measure the trait of athleticism, we might expect to employ “membership in a health club” or any generally accepted measure of physical fitness as a criterion in evaluating whether the athleticism test truly measures athleticism. 8/31/2023 26 Criterion-Related Validity Concurrent validity is an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently). For example, scores (or classifications) made on the basis of a psychodiagnostic test were to be validated against a criterion of already diagnosed psychiatric patients, then the process would be one of concurrent validation. Predictive validity is an index of the degree to which a test score predicts some criterion measure. For example, measures of the relationship between college admissions tests and freshman grade point averages, for example, provide evidence of the predictive validity of the admissions tests. 8/31/2023 27 Criterion-Related Validity The validity coefficient is a correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure. Example: a score (or classification) on a psychodiagnostic test and the criterion score (or classification) assigned by psychodiagnosticians The Pearson correlation coefficient is used, however, depending on variables such as the type of data, the sample size, and the shape of the distribution, other correlation coefficients could be used. For example, in correlating self-rankings of performance on some job with rankings made by job supervisors, the formula for the Spearman rho rank-order correlation would be employed. 8/31/2023 28 Construct Validity A judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct. A construct is an informed, scientific idea developed or hypothesized to describe or explain behavior. Example: job satisfaction, personality, bigotry, clerical aptitude, depression, motivation, self-esteem, emotional adjustment, creativity 8/31/2023 29 Evidence of Construct Validity A number of procedures may be used to provide different kinds of evidence that a test has construct validity. The various techniques of construct validation may provide evidence, for example, that the test is homogeneous, measuring a single construct; test scores increase or decrease as a function of age, the passage of time, or an experimental manipulation as theoretically predicted; test scores obtained after some event or the mere passage of time (or, posttest scores) differ from pretest scores as theoretically predicted; test scores obtained by people from distinct groups vary as predicted by the theory; test scores correlate with scores on other tests in accordance with what would be predicted from a theory that covers the manifestation of the construct in question. 8/31/2023 30 Evidence of Construct Validity Convergent validity – If scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same (or a similar) construct Divergent/Discriminant validity – A validity coefficient showing little (a statistically insignificant) relationship between test scores and/or other variables with which scores on the test being construct-validated should not theoretically be correlated 8/31/2023 31 Evidence of Construct Validity Factor analysis is a shorthand term for a class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ Exploratory factor analysis typically entails “estimating, or extracting factors; deciding how many factors to retain; and rotating factors to an interpretable orientation” Confirmatory factor analysis, researchers test the degree to which a hypothetical model (which includes factors) fits the actual data. 8/31/2023 32 TYPES OF VALIDITY Face Validity a judgement concerning how relevant the items appear to be Content Validity a measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test Criterion a measure of validity obtained by evaluating the Validity relationship of scores obtained on the test to scores on other tests or measures Predictive validity: Does the test predict later performance on a related criterion? Concurrent validity: Does the test relate to an existing similar measure? Construct a measure of validity based on whether the test measures the theoretical Validity constructs it was intended to measure Convergent validity: the degree to which a test “converges” on other tests that should be measuring the same thing Divergent validity: the degree to which a test “diverges” on other tests that should be measuring different things 8/31/2023 33 8/31/2023 34 Utility 8/31/2023 35 Test utility refers to the practical value of using a test to aid in decision making. What is the comparative utility of this test? That is, how useful is this test as compared to another test? What is the treatment utility of this test? That is, is the use of this test followed by better intervention results? What is the diagnostic utility of this neurological test? That is, how useful is it for classification purposes? Does the use of this medical school admissions test allow us to select better applicants from our applicant pool? 8/31/2023 36 Test utility refers to the practical value of using a test to aid in decision making. How useful is the addition of another test to the test battery already in use for screening purposes? Is the time and money it takes to administer, score, and interpret this personnel promotion test battery worth it as compared to simply asking the employee’s supervisor for a recommendation as to whether the employee should be promoted? Does using this test save us time, money, and resources we would otherwise need to spend? 8/31/2023 37 Utility Analysis A utility analysis may be broadly defined as a family of techniques that entail a cost–benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment. Test scores are said to have utility if their use in a particular situation helps us to make better decisions—better, that is, in the sense of being more cost- effective. 8/31/2023 38 8/31/2023 39 8/31/2023 40 Norms 8/31/2023 41 Norms Norm-referenced testing and assessment is a method of evaluation and a way of deriving meaning from test scores by evaluating an individual testtaker’s score and comparing it to scores of a group of testtakers. Norms are the test performance data of a particular group of testtakers that are designed for use as a reference when evaluating or interpreting individual test scores. A normative sample is that group of people whose performance on a particular test is analyzed for reference in evaluating the performance of individual testtakers. 8/31/2023 42 Norms Norming refers to the process of deriving norms. Race norming is the controversial practice of norming on the basis of race or ethnic background User norms or program norms, which “consist of descriptive statistics based on a group of testtakers in a given period of time rather than norms obtained by formal sampling methods” (Nelson, 1994, p. 283). 8/31/2023 43 Sampling to Develop Norms Standardization or test standardization is the process of administering a test to a representative sample of testtakers for the purpose of establishing norms Sample of the population—a portion of the universe of people deemed to be representative of the whole population. Sampling is the process of selecting the portion of the universe deemed to be representative of the whole population 8/31/2023 44 Types of Norms Percentile norms are the raw data from a test’s standardization sample converted to percentile form. Percentile is an expression of the percentage of people whose score on a test or measure falls below a particular raw score. Percentage correct refers to the number of items that were answered correctly multiplied by 100 and divided by the total number of items. Age-equivalent scores, age norms indicate the average performance of different samples of testtakers who were at various ages at the time the test was administered. 8/31/2023 45 Types of Norms Grade norms are designed to indicate the average test performance of testtakers in a given school grade. Developmental norms, a term applied broadly to norms developed on the basis of any trait, ability, skill, or other characteristic that is presumed to develop, deteriorate, or otherwise be affected by chronological age, school grade, or stage of life. National norms are derived from a normative sample that was nationally representative of the population at the time the norming study was conducted. 8/31/2023 46 Types of Norms National anchor norms provide some stability to test scores by anchoring them to other test scores. Subgroups norms is used when a normative sample is segmented by any of the criteria initially used in selecting subjects for the sample. Local norms provide normative information with respect to the local population’s performance on some test. 8/31/2023 47 Fixed Reference Group Scoring Systems The distribution of scores obtained on the test from one group of testtakers—referred to as the fixed reference group— is used as the basis for the calculation of test scores for future administrations of the test. 8/31/2023 48 Norm-Referenced vs Criterion- Referenced Evaluation Criterion as a standard on which a judgment or decision may be based. Criterion-referenced testing and assessment (domain- or content- referenced testing and assessment) may be defined as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard. Example: To be licensed as a psychologist, the applicant must achieve a score that meets or exceeds the score mandated by the state on the licensing test. 8/31/2023 49 What’s a Good Test? Cohen, R. J. & Swerdlik, M. E. (2018). Psychological Testing & Assessment, 9th edition. McGraw-Hill Education, New York Kaplan, R. M. & Saccuzzo, D. P. (2018). Psychological Testing : Principles, Applications, & Issues, 9th edition. Cengage Learning, Boston, MA