Chapter III Reliability and Validity PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document discusses the concepts of reliability and validity in psychological testing and assessment. It provides a theoretical overview, including the definition and importance of reliability, measurement error, and sources of variance affecting test scores. The chapter also introduces the classical test score theory, highlighting the relationship between true score, observed score and error variance.
Full Transcript
Chapter III. Introduction Imagine you are the kind of person who likes to text a lot especially at night as you are living alone in your house. You have this a friend who you regularly have text conversations with at night. Even though it’s already midnight, this friend of yours is willing...
Chapter III. Introduction Imagine you are the kind of person who likes to text a lot especially at night as you are living alone in your house. You have this a friend who you regularly have text conversations with at night. Even though it’s already midnight, this friend of yours is willing to stay awake with you so he could respond to your messages. One day, someone broke into your house at 12 midnight, but you managed to hide at some corner in your room where this burglar can’t see you. You have your phone in hand with your friend’s number stored on it. Even though you are aware that it’s late and this friend of yours may be fast asleep at that time of the night, you quickly dial his number. He answers, you tell him your situation, he wakes everybody in his house and brings people with him to your location. The armed burglar noticed that he can’t defeat your friend and his company so he just ran away and you are now safe. You are safe because you could rely on your friend to deliver. He gave you a sense of security and assurance that said “Whenever you need my assistance, just call once and I will be there to deliver” which made you put your confidence and trust in him. But what if the worst scenario happened, wherein you kept trying for about 17 times only to realize that calling this friend is a waste of time. The burglar robbed you at gunpoint and took everything leaving you in your pants. It’s late, no one is passing by to save you, you lie there and by the time someone finds you in the morning, you are already far gone in the land of the dead. This friend’s lack of reliability has resulted in your untimely death which could be easily avoided if he was reliable. Now, when an individual or an item is reliable, you can put your confidence and trust in it to deliver for you consistently whenever you call upon it. This topic is related to the focus of this chapter, only this time, we’ll be referring to reliability and validity of tests. This chapter is designed to provide students with knowledge base on psychometric properties observed in psychological testing and assessment. Learning outcome/s: 1. Define reliability and validity Expected output: 1. Chapter Quiz Time allotment: 10 hours (2 weeks) References: Cohen, R. J., & Swerdlik, M. E. (2018). Psychological testing and assessment: An introduction to tests and measurement. Kaplan, R. M., & Saccuzzo, D. P. (2017). Psychological testing: Principles, applications, and issues. Nelson Education. Lesson proper: Reliability refers to CONSISTENCY in MEASUREMENT Reliability Coefficient - An index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance The Concept of Reliability True Variance – variance from true differences Error Variance - variance from irrelevant, random sources Thus, Reliability - Refers to the proportion of the total variance attributed to true variance - The greater the proportion of the total variance attributed to true variance, the more reliable the test Measurement Error - All of the factors associated with the process of measuring some variable, other than the variable being measured Random error/Noise - Source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Systematic error - Source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. Sources of Error Variance A. Test Construction - The extent to which the score is affected by the content sampled in the test and by the way the content is sampled (Item sampling or Content sampling) B. Test Administration - Test environment: Room temperature, level of lighting, amount of ventilation and noise - Test-taker Variables: Pressing emotional variables, physical discomfort, lack of sleep and effect of drugs or medications - Examiner-related variables: Examiner’s physical appearance and demeanor – or even the presence or absence of an examiner C. Test Scoring and Interpretation - NOT ALL test can be scored by computer - Scorers and scoring system are potential source of error variance - If subjectivity is involved in scoring, then the rater can be a source of error variance - Subjectivity in scoring can even enter in behavioral assessment True Score Model of Measurements and Its Alternatives A. Classical Test Score Theory - Classical test score theory assumes that each person has a true score that would be obtained if there were no errors in measurement. However, because measuring instruments are imperfect, the score observed for each person almost always differs from the person’s true ability or characteristic. The difference between the true score and the observed score results from measurement error. X (Observed Score) = T (True Score) + E (Error) - A major assumption in classical test theory is that errors of measurement are random. - It assumes that the true score for an individual will not change with repeated applications of the same test. - Because of random error, however, repeated applications of the same test can produce different scores. Theoretically, the standard deviation of the distribution of errors for each person tells us about the magnitude of measurement error. Standard error of measurement: Because we usually assume that the distribution of random errors will be the same for all people, classical test theory uses the standard errors as the basic measure of error. Classical test theory requires that exactly the same test items be administered to each person. B. Item Response Theory (IRT) - the computer is used to focus on the range of item difficulty that helps assess an individual’s ability level. For example, if the person gets several easy items correct, the computer might quickly move to more difficult items. If the person gets several difficult items wrong, the computer moves back to the area of item difficulty where the person gets some items right and some wrong. - overall result is that a more reliable estimate of ability is obtained using a shorter test with fewer items. - method requires a bank of items that have been systematically evaluated for level of difficulty C. Generalizability Theory - Is based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation - “Given the exact same conditions of all the facets of universe, the exact same test score should be obtained” - Generalizability Study: It examines how generalizable scores from a particular test are if the test is administered in different situations - Decision Study: Developers examine the usefulness of test scores in helping the test user make decisions - From the perspective of generalizability theory, a test’s reliability is very much a function of the circumstances under which the test is developed, administered, and interpreted. Reliability Estimates/Types of Reliability A. Test-Retest Reliability Estimates/ Time Sampling Test-Retest Method - Using the same instrument to measure the same thing as two points in time - The results of evaluation is called test-retest reliability - 1 group; 2 different administration - Measure something that is relatively stable over time such as personality - Coefficient of Stability - interval between testing is 6 months - Test-retest is also appropriate in reaction time and perceptual judgment B. Parallel Forms and Alternate Forms Reliability Estimates/ Item Sampling Parallel Forms - Uses one set of questions divided into two equivalent sets (“forms”), where both sets contain questions that measure the same construct, knowledge or skill. The two sets of questions are given to the same sample of people within a short period of time and an estimate of reliability is calculated from the two sets. Alternate Forms - Measure of reliability between two different forms of the same test. Two equivalent (but different) tests are administered, scores are correlated, and a reliability coefficient is calculated. - 1 group; 2 different administrations - Applicable on relatively stable traits C. Internal Consistency - Single administration; single form - Assessing test of HOMOGENEITY - Homogenous – items in a scale are unifactorial - Heterogeneous – composed of items that measure more than one trait - MORE HOMOGENOUS; MORE INTER-ITEM CONSISTENCY C.1. Split-Half Reliability - Correlating two pairs of scores obtained from equivalent halves of a single test administered once Step 1. Divide the test into equivalent halves. Step 2. Calculate a Pearson r between scores on the two halves of the test. Step 3. Adjust the half-test reliability using the Spearman– Brown formula Do’s 1. Randomly assign items on the halves of the test 2. Assign odd-numbered items to one half of the test and even-numbered items to the other half (odd-even reliability) 3. Divide the test by content Don’t’s - Don’t split into middle - In general, a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called “mini- parallel-forms,” with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects. Spearman-Brown Formula - Allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test - Estimate the reliability of a test that is lengthened or shortened by number of items - Usually, but not always, reliability increases as test length increases Spearman-Brown Prophecy - Used to determine the number of items needed to attain a desired level of reliability Other Measures of Internal Consistency 1. Coefficient Alpha - Developed by Cronbach - WHEN TO USE? - Homogenous Items - Non-dichotomous items (Ex. Likert Scale) - It is the preferred statistic for obtaining an estimate of internal consistency reliability - Values ranges from (No Similarity 0.00 – Perfectly Identical 1.00) - Note: A value of.90 or above indicates redundancy of items 2. Kuder-Richardson Formulas - Developed by G. Frederic Kuder and M.W. Richardson KR-20 - It was named KR-20 because it was the twentieth formula developed in a series - WHEN TO USE? - Homogenous Items - Dichotomous Items (“right or wrong” items) - Note: if items are more HETEROGENEOUS, KR-20 will yield LOWER reliability estimates than the split-half method KR-21 - It is used for a test where the items are all about the same difficulty 3. Average Proportional Distance (APD) - A relatively new measure for evaluating the internal consistency of a test - A measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores D. Inter-Scorer Reliability/Kappa Statistics - It is the degree of agreement of consistency between two or more scorers (judges or raters) with regard to a particular measure - Cohen’s Kappa – 2 raters - Fleiss’ Kappa – 3 or more raters Sources of Errors A. Time Sampling: The same test given at different points in time may produce different scores, even if given to the same test takers. B. Item sampling: The same construct or attribute may be assessed using a wide pool of items. C. Internal consistency: refers to the intercorrelations among items within the same test D. Different observers record the same behavior: Even though they have the same instructions, different judges observing the same event may record different numbers. Standard Errors Remember that psychologists working with unreliable tests are like carpenters working with rubber yardsticks that stretch or contract and misrepresent the true length of a board. However, as all rubber yardsticks are not equally inaccurate, all psychological tests are not equally inaccurate. The standard error of measurement allows us to estimate the degree to which a test provides inaccurate readings; that is, it tells us how much “rubber” there is in a measurement. The larger the standard error of measurement, the less certain we can be about the accuracy with which an attribute is measured. Conversely, a small standard error of measurement tells us that an individual score is probably close to the measured value. How Reliable Is Reliable? - The answer depends on the use of the test. - Reliability estimates in the range of.70 and.80 are good enough for most purposes in basic research. - Reliabilities greater than.95 are not very useful because they suggest that all of the items are testing essentially the same thing & that the measure could easily be shortened. - A test of skill at using the multiplication tables for one-digit numbers would be expected to have an especially high reliability. Tests of complex constructs, such as creativity, might be expected to be less reliable. - In clinical settings, high reliability is extremely important. When tests are used to make important decisions about someone’s future, evaluators must be certain to minimize any error in classification; evaluators should attempt to find a test with a reliability greater than.95. - Standard error of measurement: The wider the interval, the lower the reliability of the score. Using the standard error of measurement, we can say that we are 95% confident that a person’s true score falls between two values. What to Do About Low Reliability - increase the length of the test - throw out items that run down the reliability - estimate what the true correlation would have been if the test did not have measurement error - Tests are most reliable if they are unidimensional. This means that one factor should account for considerably more of the variance than any other factor. Items that do not load on this factor might be best omitted. Discriminability analysis - Item analysis which examines the correlation between each item and the total score for the test. - low correlation indicates that the item drags down the estimate of reliability and should be excluded. Correction for attenuation - This estimates what the correlation between two measures would have been if they had not been measured with error. - method that allows researchers to estimate the relationship between two constructs as if they were measured perfectly reliably and free from random errors that occur in all observed measures. The Concept of Validity Validity - the agreement between a test score or measure and the quality it is believed to measure. - defined as the answer to the question, “Does the test measure what it is supposed to measure?” - Estimates of how well a test measures what it purports to measure in a particular context - Interpretation: “acceptable or weak” - NO TEST is universally valid for all time - The validity of a test must be proven again from time to time - It is the evidence for inferences made about a test score. There are three types of evidence: (1) construct- related, (2) criterion-related, and (3) content-related. Validation - The process of gathering and evaluating evidence about validity. Trinitarian View of Validity - Content Validity - Criterion-related Validity - Construct Validity - Face validity the mere appearance that a measure has validity. We often say a test has face validity if the items seem to be reasonably related to the perceived purpose of the test. - “looks like” it is valid. These appearances can help motivate test takers because they can see that the test is relevant. A. Content validity - A judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample - Determines whether a test has been constructed adequately; logical rather than statistical. - Construct underrepresentation describes the failure to capture important components of a construct. - Construct-irrelevant variance occurs when scores are influenced by factors irrelevant to the construct. B. Criterion validity - measures how well one measure predicts an outcome for another measure. - Criterion: the standard against which the test is compared. Predictive validity: type or form of criterion validity evidence that has the forecasting function of tests. we assess the operationalization’s ability to predict something it should theoretically be able to predict. Concurrent-related: assessments of the simultaneous relationship between the test and the criterion; applies when the test and the criterion can be measured at the same time. C. Construct validity - A judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called a construct - “the umbrella validity” - “viewed as the unifying concept for all validity evidence” - Construct: An informed, scientific idea developed or hypothesized to describe or explain behavior. It is unobservable, presupposed (underlying) traits that a test developer may invoke to describe test behavior or criterion performance Examples: Intelligence, Self-Esteem, Motivation etc. Types of Construct Validity 1. Convergent Validity - Convergent validity takes two measures that are supposed to be measuring the same construct and shows that they are related. - Convergent evidence is obtained in one of two ways. In the first, we show that a test measures the same things as other tests used for the same purpose. In the second, we demonstrate specific relationships that we can expect if the test is really doing its job. 2. Discriminant Validity (Divergent Validity) - demonstration of uniqueness. shows that two measures that are not supposed to be related are in fact, unrelated - to demonstrate discriminant evidence for validity, a test should have low correlations with measures of unrelated constructs, or evidence for what the test does not measure. - indicates that the measure does not represent a construct other than the one for which it was devised. Statistical Evidences 1. Validity Coefficient: A correlation that provides a measure of the relationship between test scores and scores on the criterion measure. One rarely sees a validity coefficient larger than.60; ranges of.30 to.40 are commonly considered high. 2. Incremental Validity: The degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use. Relationship Between Reliability and Validity - Attempting to define the validity of a test will be futile if the test is not reliable. - we can have reliability without validity. However, it is logically impossible to demonstrate that an unreliable test is valid.