W2 Lecture slides.pdf
Transcript
PSY2041 Semester 2, 2023 Week 2: Reliability and validity Daniel Bennett [email protected] image: detail from Wet Sand, Anglesea (1929) by Clarice Beckett Lecture learning outcomes 1. Define 'reliability' as a psychometric property, and articulate the importance...
PSY2041 Semester 2, 2023 Week 2: Reliability and validity Daniel Bennett [email protected] image: detail from Wet Sand, Anglesea (1929) by Clarice Beckett Lecture learning outcomes 1. Define 'reliability' as a psychometric property, and articulate the importance of test reliability in psychological testing and assessment. 2. List, describe, and differentiate the different ways that a test's reliability can be assessed. 3. Define 'validity' as a psychometric property, describe its importance in psychological testing and assessment, and explain how it is related to test reliability. 4. List, describe, and differentiate the different kinds of test validity. Weekly reading Shum et al., Chapter 4: pages 71-84 Shum et al., Chapter 5: pages 85-93 and 98-101 PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Overview of this weekʼs videos Mini-lecture 1: A case study Mini-lecture 2: Reliability Mini-lecture 3: Validity PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Mini-lecture 1 A case study Lecture learning outcomes 1. Define 'reliability' as a psychometric property, and articulate the importance of test reliability in psychological testing and assessment. 2. List, describe, and differentiate the different ways that a test's reliability can be assessed. 3. Define 'validity' as a psychometric property, describe its importance in psychological testing and assessment, and explain how it is related to test reliability. 4. List, describe, and differentiate the different kinds of test validity. PSY2041 ‒ Reliability and validity ‒ Daniel Bennett The ʻImplicit Association Testʼ Ø In 1998, Greenwald et al. published a highly influential paper on measuring ʻimplicit biasʼ Ø The paper has been cited more than 16,000 times and influenced much public and corporate policy Ø This paper proposed the Implicit Association Test (IAT) of peopleʼs implicit racial prejudice Ø In other words: a way of behaviourally measuring the strength of peopleʼs unconscious prejudices Ø Unconscious prejudices certainly exist and cause very real harms Ø Implicit bias is an unconscious association, belief, or attitude towards a particular social group Ø Such biases are one reason that people stereotype others and behave in prejudiced ways Ø Yet the IAT has recently come in for heavy scientific criticism. Why? Arkes & Tetlock, 2004; Gawronski, 2019; Mitchell, 2010; Schimmack, 2021 The ʻImplicit Association Testʼ: phase 1 https://implicit.harvard.edu/implicit/ The ʻImplicit Association Testʼ: phase 2 Ø Implicit bias is measured by how much slower and more error-prone people are when asked to categorise African-American faces and ʻgoodʼ words with the same button on the keyboard Ø This is a psychological test of implicit bias Criticisms of the reliability of the IAT Ø One criticism of the IAT is that its measurements of implicit bias are not stable over time Ø i.e., it has poor reliability = person has diff scores at diff times of doing the IAT Ø Without stable measurements, there is no guarantee that people who receive a high score on the test actually have high levels of the underlying construct Scale 1 Scale 2 Ø The same person receives Ø The same person receives consecutive measurements consecutive measurements of 69.3, 89.2, 74.3kg of 67.5, 67.4, 67.4kg Ø A new person steps on and Ø A new person steps on and is measured as 85.5kg. is measured as 85.5kg. Ø It is quite unlikely that they Ø It is quite likely that they do actually weigh 85.5kg actually weigh ~85.5kg Criticisms of the reliability of the IAT Ø One criticism of the IAT is that its measurements of implicit bias are not stable over time Ø i.e., it has poor reliability Ø Without stable measurements, there is no guarantee that people who receive a high score on the test actually have high levels of the underlying construct Ø One way of measuring stability over time is with test-retest reliability Ø Ranges between 0 (very unstable) and 1 (very stable) Ø Tests typically required to have scores greater than 0.7 or 0.8 to be usable Ø The IAT has an estimated test-retest reliability of 0.44 (“unacceptable”) PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Criticisms of the validity of the IAT Ø A second criticism of the IAT is that it may not actually measure what it says it is measuring Ø i.e., it has poor validity Ø Scores on the IAT are weak and inconsistent predictors of peopleʼs behaviour Ø e.g., some studies have shown that white police officers with high anti-black IAT scores are faster to shoot at African-Americans (Swencionis & Goff, 2017) Ø But other studies have shown the exact opposite (James, James & Vila, 2016) Ø Some have suggested that the IAT measures awareness of racial stereotypes, not endorsement of racial stereotypes PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Why is this important? Ø None of this means that implicit bias is not real, harmful, or important to address Ø It does call into question whether the IAT is a good tool for measuring implicit bias Ø This is not merely an academic question! Ø IAT scores have been used to identify individuals who should receive further racial sensitivity training Ø IAT scores have been the target metric for interventions to improve racism in US police forces Ø Procedures designed to change IAT scores typically donʼt have effects that last longer than a day PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Mini-lecture 2 Reliability Lecture learning outcomes 1. Define 'reliability' as a psychometric property, and articulate the importance of test reliability in psychological testing and assessment. 2. List, describe, and differentiate the different ways that a test's reliability can be assessed. 3. Define 'validity' as a psychometric property, describe its importance in psychological testing and assessment, and explain how it is related to test reliability. 4. List, describe, and differentiate the different kinds of test validity. PSY2041 ‒ Reliability and validity ‒ Daniel Bennett What is reliability? Ø In general, the reliability of a test is a measure of how consistent that test is Ø Does the test consistently give the same score? Ø This is a very important psychometric property Ø If a test does not give reliable measurements then is not suitable for use as a measurement tool Unreliable scales Reliable scales Ø The same person receives Ø The same person receives consecutive measurements consecutive measurements of 69.3, 89.2, 74.3kg of 67.5, 67.4, 67.4kg Ø A new person steps on and Ø A new person steps on and is measured as 85.5kg. is measured as 85.5kg. Ø It is quite unlikely that they Ø It is quite likely that they do actually weigh 85.5kg actually weigh ~85.5kg Four kinds of reliability Ø We will consider four ways of determining whether a test gives consistent results: Test-retest reliability: does the test give consistent results over time? Interrater reliability: does the test give consistent results between different test administrators? Internal consistency: does the test give consistent results between the different items of the test? Equivalent forms reliability: are different versions of the test consistent with each other? PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Test-retest reliability Test-retest reliability: does the test give consistent results over time? Ø To measure test-retest reliability, the same test is identified to the same people at two different points in time Ø If the test has good reliability then those who score high at time 1 should also score high at time 2, and vice versa PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Test-retest reliability Ø Test-retest reliability is usually scored as the correlation between test scores at test and retest Ø 0 = no correlation, 1 = perfect correlation Ø Standard guidelines are that a test should have a test-retest reliability above 0.7 or 0.8 to be used Ø 0.6 to 0.7 = questionable, 0.5 to 0.6 = poor, below 0.5 = unacceptable Reliability = 0 Reliability = 0.5 Reliability = 0.9 = no consistent rs Test-retest reliability Ø It only makes sense to assess test-retest reliability when we expect the underlying construct to be stable over the time period that elapses between tests Ø If the underlying construct relates to a stable trait of the individual (e.g., height) then it is meaningful to assess test-retest reliability Ø If the underlying construct is unstable or relates to a transient state of the individual (e.g., current mood) then test-retest reliability should not be assessed Ø Other practical considerations may also get in the way of test-retest reliability Ø Practice effects = ppl learn the answers, improvements in performance after the first time u do the test undergo Ø Interventions between test and retest ie take medication Ø Time-of-day or seasonal effects ie do the test b4 eat lunch and after PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Interrater reliability Interrater reliability: does the test give consistent results between different test administrators? Ø Two different test administrators both use a test to score the same individualʼs behaviour or performance Ø Good interrater reliability: the two administrators give similar scores to the individual Ø Poor interrater reliability: the two administrators give very different scores to the individual Ø Interrater reliability is most relevant when there is some subjectiveness in the test scoring process Ø The way it is scored depends on the data type PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Internal consistency Internal consistency: does the test give consistent results between the different items of the test? Below: 10 items from the BFAS measuring emotional volatility Ø In many situations, psychological tests will be made up of a number of different items Ø In a well-designed test, the different items all measure the same underlying construct Ø Internal consistency can be measured in several different ways: Ø Split-half reliability Ø Cronbachʼs alpha Measuring internal consistency with split-half reliability Ø One way of measuring internal consistency is to calculate the split-half reliability Ø Divide the items in half Ø Calculate the score separately for each half Ø Assess the consistency between the two halves of the test ʻAʼ half ʻBʼ half Measuring internal consistency with split-half reliability Ø One way of measuring internal consistency is to calculate the split-half reliability need to take into account that: - in general, scores will Ø Divide the items in half = good split-half reliability be a bit noisier, more error prone when tests Ø Calculate the score separately for each half are shorter = need to correct that using the Spearman- Ø Assess the consistency between the two halves of the test Brown correction ʻAʼ half ʻBʼ half Measuring internal consistency with Cronbachʼs alpha Ø But what is so special about the first half and second half? Ø Why didnʼt we divide things up other ways instead? ʻAʼ half ʻBʼ half Measuring internal consistency with Cronbachʼs alpha Ø But what is so special about the first half and second half? Ø Why didnʼt we divide things up other ways instead? Measuring internal consistency with Cronbachʼs alpha Ø But what is so special about the first half and second half? Ø Why didnʼt we divide things up other ways instead? Measuring internal consistency with Cronbachʼs alpha Ø But what is so special about the first half and second half? Ø Why didnʼt we divide things up other ways instead? Ø Cronbach (1951) proposed a measure of getting around this limitation Ø Cronbachʼs alpha calculates the average split-half reliability across all possible splits of the data all the configurations Ø (Corrected by the Spearman-Brown formula) = good internal consistency Ø Gives a score between 0 and 1 that can be interpreted in a similar way to a correlation coefficient = poor internal consistency PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Equivalent forms reliability Equivalent forms reliability: are different versions of the test consistent with each other? Ø Only relevant in cases where more than one version of a test exists Ø e.g., weekly MCQ tests in which students are randomly allocated to either question set A or question set B Ø If a test has good equivalent-forms reliability, then the different versions of a test will give consistent scores for the same person Ø Means we donʼt need to worry about which version of a test a person completed when interpreting their score PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Four kinds of reliability Ø We will consider four ways of determining whether a test gives consistent results: Test-retest reliability: does the test give consistent results over time? Interrater reliability: does the test give consistent results between different test administrators? Internal consistency: does the test give consistent results between the different items of the test? Ø Can be measured using split-half measures or Cronbachʼs alpha Equivalent forms reliability: are different versions of the test consistent with each other? PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Generalisability theory Ø Which of these ways of measuring reliability should be used for a given test? Ø It depends on what kind of generalisation we would like to make from our test Ø The logic is that we care about tests because their scores generalise to a wider universe of behaviour Ø Generalisation across different assessors/judges: interrater reliability Ø Generalisation across different versions of a test: equivalent forms reliability Ø Generalisation from one item in the test to other items in the test: internal consistency Ø Generlisation from one timepoint to another timepoint: test-retest reliability = which one of these u assess, depends on what u are planning to use the test for PSY2041 ‒ Reliability and validity ‒ Daniel Bennett How much is enough? https://www.youtube.com/watch?v=-R-HrDmY0dM Ø Standard guidelines are that a test should have a test-retest reliability above 0.7 or 0.8 to be used Ø 0.6 to 0.7 = questionable, 0.5 to 0.6 = poor, below 0.5 = unacceptable Ø Very high reliability may not be a good thing Ø Cronbachʼs alpha scores greater than.95 can indicate redundancy in items Ø On the other hand, very high reliability is crucial in clinical settings when making decisions for a person Ø e.g., assessments of capacity to drive following an injury Ø Reliability is primarily important for measuring individual differences. If all a researcher cares about is group differences, having high reliability may not be important at all Lecture learning outcomes 1. Define 'reliability' as a psychometric property, and articulate the importance of test reliability in psychological testing and assessment. 2. List, describe, and differentiate the different ways that a test's reliability can be assessed. 3. Define 'validity' as a psychometric property, describe its importance in psychological testing and assessment, and explain how it is related to test reliability. 4. List, describe, and differentiate the different kinds of test validity. PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Mini-lecture 3 Validity Lecture learning outcomes 1. Define 'reliability' as a psychometric property, and articulate the importance of test reliability in psychological testing and assessment. 2. List, describe, and differentiate the different ways that a test's reliability can be assessed. 3. Define 'validity' as a psychometric property, describe its importance in psychological testing and assessment, and explain how it is related to test reliability. 4. List, describe, and differentiate the different kinds of test validity. PSY2041 ‒ Reliability and validity ‒ Daniel Bennett What is validity? Ø In general, the validity of a test is a measure of how well the test measures what it aims to measure Ø Ishihara plates are a valid test of red-green colour blindness Ø Measuring the length of a personʼs big toe is not a valid test of colour blindness Ø This is a complex and subtle question for many psychological tests Ø Psychological tests often deal with ʻlatent constructsʼ (e.g., anxiety, empathy, self-esteem, intelligence) that cannot be directly observed Ø They are theoretical concepts conceptualised from a specific viewpoint PSY2041 ‒ Reliability and validity ‒ Daniel Bennett What is validity? Ø How can we determine if a test measures what it claims to measure? Ø There is no perfect way of doing this Ø Instead, different ʻtestsʼ of validity allow us to gather evidence that a test measures what it claims to measure Ø But there is always the chance that a new piece of information will change our view Ø The answer is unlikely to be a binary ʻyesʼ or ʻnoʼ Ø To what extent is a test valid in a given testing context? PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Testing the validity of an intelligence test Ø Binet sought to use several practical criteria to demonstrate the validity of his intelligence tests Ø If the test truly measured intelligence, then children whom teachers identified as ʻbrightʼ should receive higher scores than those identified as ʻdullʼ Ø If the test truly measured intelligence, then older children should score higher on average than younger children Ø Binet only included test items that met both these criteria (even if Alfred Binet (1857-1911) he and his co-workers did not see any other merit in them) PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Four kinds of validity Ø We will discuss four ways of establishing the validity of a psychological test Face validity: does the test appear (to the test-taker) to assess the relevant construct? Content validity: does the test adequately represent all the components of the construct? Predictive validity: are scores on the test predictive of other (external) indicators of the construct? Construct validity: are the testʼs assumptions about the construct it is measuring theoretically justified? Ø Can be measured using convergent evidence and discriminant evidence PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Face validity Face validity: does the test appear (to the test-taker) to assess the relevant construct? Ø A spelling test that requires test-takers to spell words aloud has high face validity Ø A spelling test that requires test-takers to list their favourite TV shows has low face validity Ø Face validity is the weakest form of validity evidence we will consider Ø A test can be invalid despite having good face validity (e.g., the Myers-Briggs personality test) Ø A test can be valid despite having poor face validity (e.g., finger-tapping speed is a relatively valid measure of concussion severity) Ø Good face validity can establish the credibility of a test in the eyes of test-takers, which encourages confidence and performance motivation PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Content validity Content validity: does the test adequately represent all the components of the construct? Ø Many constructs have multiple different components, so a good test of the construct should assess all its components Ø ʻGeneralised anxiety disorderʼ is one such construct Ø Feelings of anxiety Ø Difficulty concentrating Ø Worry that is difficult to control Ø Irritability Ø Restlessness or feeling on edge Ø Muscle tension Ø Being easily fatigued Ø Sleep disturbance Ø A test of generalised anxiety disorder that only asked about anxiety feelings would lack content validity Ø Even though feelings of anxiety are one of the most ʻcentralʼ symptoms of generalised anxiety disorder! Structure of unit (lectures) Week Topic Lecturer Week Topic Lecturer Introduction to 1 7 Educational testing Chiara McDowell psychological testing Daniel Bennett 2 Reliability and validity 8 Quantitative methods III Daniel Bennett 3 Test development Jake Hoskin 9 Quantitative methods IV 4 Quantitative methods I 10 Organisational testing Adriana Ortega Daniel Bennett 5 Quantitative methods II Rebecca 11 Clinical testing Kerestes 6 Intelligence testing Chiara McDowell 12 Neuropsych. testing TBD Theoretical basics Application series Statistics PSY2041 ‒ Orientation ‒ Daniel Bennett Predictive validity Predictive validity: are scores on the test predictive of other (external) indicators of the construct? Ø A test has good predictive validity if scores on the test allow us to estimate (or ʻpredictʼ) scores on some criterion that is external to the test itself Ø A good driving test is one that allows us to predict whether an individual will drive safely Ø A good test of marital satisfaction predicts divorce rates 12 months later Ø An ATAR score in Year 12 is used to predict performance at university Ø Scores on a new anxiety scale you developed should be correlated with a clinicianʼs simultaneous ratings PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Predictive validity Predictive validity: are scores on the test predictive of other (external) indicators of the construct? = predictive bcos the criterion is after the test Ø A test has good predictive validity if scores on the test allow us to estimate (or ʻpredictʼ) scores on some criterion that is external to the test itself Note: some researchers use the Ø A good driving test is one that allows us to predict whether term predictive validity only in an individual will drive safely cases where the test is delivered before the criterion. Ø A good test of marital satisfaction predicts divorce rates 12 months later When the test is measured at Ø An ATAR score in Year 12 is used to predict performance at the same time as the criterion, university these researchers use the term concurrent validity Ø Scores on a new anxiety scale you developed should be correlated with a clinicianʼs simultaneous ratings = concurrent PSY2041 ‒ Reliability and validity ‒ Daniel Bennett What makes for a good criterion? Ø The answer differs according to the construct, but a good criterion should have these features: 1. Reliable Ø A criterion is not useful if it cannot itself be assessed reliably Ø The validity coefficient is limited by the reliability of both the test and the criterion 2. Theoretically appropriate = a criterion needs to be theoretically appropriate in the context that we're measuring it Ø An ideal criterion is the behaviour itself; otherwise, a theoretically supported measured should be chosen = should have good theoretical reasons to expect that the construct we're measuring should be related to the criterion ie should have good theoretical reason to think that marital satisfaction should be related to divorce, if we dont think there is a link between those 2 things, then its not an appropriate criterion = so divorce rates 12 mo later would be a good criterion for marital satisfaction NOW BUT id we go back to delivering a marital satisfaction survey in the US in the 50s hen divorce was illegal in many states = then divorce rates X a good criterion bcos theres theoretical reasons why it might not be strongly related to marital satisfaction 3. Not contaminated by the test itself Ø If the criterion measure contains similar items to those on the test itself, estimates of validity will be artificially inflated but its bcos of an overlap btwn the test and the criteria Ø e.g., developing a new depression test that is a re-worded version of the BDI, and then using the BDI as a thru thesaurus criterion Known-groups validity Ø Known-groups validity is a specific name for measures of predictive validity where the external criterion that is used in estimating predictive validity is group membership Ø A test with good predictive validity should be able to distinguish between groups of people known to differ on the construct of interest Ø For example, imagine that we are developing a new measure of chronic pain severity Ø If the measure has good predictive validity, then people receiving treatment for chronic back pain should get higher scores than healthy control participants PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Construct validity Construct validity: are the testʼs assumptions about the construct it is measuring theoretically justified? Ø What is a latent construct? Ø Theoretical, intangible states or traits on which individuals differ. Inferred from behaviour, never observed Ø Assessing construct validity therefore depends on appraising the test with respect to the underlying theory Ø The criticism that the IAT may measure awareness of stereotypes rather than endorsement of stereotypes is a criticism of its construct validity Ø Cronbach and Meehl encouraged test developers to measure construct validity in terms of both convergent evidence and discriminant evidence Ø Convergent evidence: do test scores correlate with other variables that we expect them to correlate with = does it converge with other measures of related things? Ø Discriminant evidence: do test scores not correlate with variables that we donʼt expect them to correlate with? = does it discriminate btwn measures of unrelated constructs? Construct validity Ø Convergent evidence: does the measure correlate with other measures we expect it to? Ø Diagnostic criteria for depression include low mood, anhedonia, feelings of guilt and worthlessness, sleep disturbance, appetite disturbance, etc. Ø Therefore a psychological test of depression symptoms should correlate with other measures of low mood, anhedonia, strength of guilt feelings, etc. = show that it is related to the things that it should theoretically be related to THIS measures visual acuity Ø Discriminant evidence: does the measure not correlate with measures it should be unrelated to? Ø e.g., a colour-blindness test should not correlate strongly with general visual acuity (clarity of vision, see details) PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Four kinds of validity Ø We will discuss four ways of establishing the validity of a psychological test Face validity: does the test appear (to the test-taker) to assess the relevant construct? Content validity: does the test adequately represent all the components of the construct? Predictive validity: are scores on the test predictive of other (external) indicators of the construct? = criteria ie known-groups validity: does our test predict membership of groups that it should predict? Construct validity: are the testʼs assumptions about the construct it is measuring theoretically justified? Ø Can be measured using convergent evidence and discriminant evidence PSY2041 ‒ Reliability and validity ‒ Daniel Bennett How are reliability and validity related? Ø A test must be reliable to be valid Ø An unreliable test does not measure anything, so it definitely canʼt measure what it says its measuring Ø However, a test can be reliable without being valid Ø e.g., a test of colour-blindness that involves measuring the length of a personʼs big toe Ø Another way of saying this is that reliability is necessary but not sufficient for validity PSY2041 ‒ Reliability and validity ‒ Daniel Bennett Lecture learning outcomes 1. Define 'reliability' as a psychometric property, and articulate the importance of test reliability in psychological testing and assessment. 2. List, describe, and differentiate the different ways that a test's reliability can be assessed. 3. Define 'validity' as a psychometric property, describe its importance in psychological testing and assessment, and explain how it is related to test reliability. 4. List, describe, and differentiate the different kinds of test validity. Weekly reading Shum et al., Chapter 4: pages 71-84 Shum et al., Chapter 5: pages 85-93 and 98-101 PSY2041 ‒ Reliability and validity ‒ Daniel Bennett