🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

ITRIP Lecture 3 - Reliability and Validity 2024 PDF

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Summary

This ITRIP lecture provides an overview of reliability and validity in psychological research. It discusses different types of validity and reliability, with examples and explanations. It includes notes on operationalization, measurement error, and how to design and interpret research studies.

Full Transcript

SCIENTIFIC FOUNDATIONS OF PSYCHOLOGICAL SCIENCE Lecture 3: Reliability and Validity of Measurement Dr Melanie Murphy [email protected] Reading: Navarro DJ and Foxcroft DR (2022). learning statistics with jamovi: a tutorial for psychology students and other beginners. (Version 0.75). Section 2....

SCIENTIFIC FOUNDATIONS OF PSYCHOLOGICAL SCIENCE Lecture 3: Reliability and Validity of Measurement Dr Melanie Murphy [email protected] Reading: Navarro DJ and Foxcroft DR (2022). learning statistics with jamovi: a tutorial for psychology students and other beginners. (Version 0.75). Section 2.3 - Assessing the reliability of a measurement Section 2.6 - Assessing the validity of a study Optional Reding: Howitt, D., & Cramer, D (2011). Introduction to research methods in psychology. Pearson/Prentice Hall. (pp. 266-279) Field (2017) Chapter 1, Sections 1.6.3 and 1.6.4 ASSESSMENT 1 PRE-CONCEPT CHECK Priming knowledge for Proposing a Research Study (group presentation task) 5% of SFP grade 10 Multiple choice questions Think about the following points: What are the elements and what is the purpose of the….? Introduction What is Operationalisation and a Hypothesis? Methods Three sections What is the importance of citing research? What are the key elements of an effective team? (Tutorial 1) What is goal-setting theory and SMART goals? (Tutorial 3) ASSESSMENT 1 PRE-CONCEPT CHECK Revision Assignment Instructions Mel’s Lecture slides Week 2 to Week 3 Associated readings Annukka’s Lecture slides Week 2 philosophy lecture 2 slides 12 19 Tutorial slides Week 1 to Week 3 MEASUREMENT OF A CONCEPT Scale How is it classified or quantified? Reliability Is it reproducible? Validity Does it behave as we expect? Reading: Field, A (2017). Chapter 1 -section 1.6.2(starts pg. 10) M03_HOWI4994_03_SE_C03.QXD LEVELS OF MEASUREMENT 10/11/10 15:00 Page 46 Nominal (or categorical) Ordinal (or ranked) 46 PART 1 THE BASICS OF RESEARCH Interval Ratio FIGURE 3.1 The different types of scales of measurement and their major characteristics SCALES “NOIR” Christensen, L., Johnson, R., Johnson, R. B., Turner, L., Christensen, L., Johnson, R., Johnson, R. B., & Turner, L. (2015). Research methods, design, and analysis ebook pdf, global edition. Pearson Education, Limited. (p. 153) Definition of concepts is important for scientific communication. Shared understanding of phenomena Based on understanding, a set of agreed upon characteristics Comprehensive Strive for precision Constructs are concepts operationalized. WEEK 2 SUMMARY More precise operationalisation allows for more precise measurement. Objective observation Replication Observable, quantifiable Different scales of measurement provide different amounts of information Nominal Ordinal Interval Ratio From theory to hypothesis A Theory K N C B F X Y Operationalisation Real World Hypothesis y = f (x) Note: It is important that operational measures are reliable and valid representations of the theoretical constructs. You could consider a hypothesis to be an ‘educated guess’ about what a study might find. CONSTRUCTING A HYPOTHESIS It should be quite specific so it can set parameters on the research design i.e. what kind of study design should be used (or what design is not appropriate). Usually a hypothesis is built/constructed based on what has been found by past research looking at a similar question. A hypothesis must also be testable, meaning that the study must be designed to create a situation where the researcher can objectively determine whether the ‘educated guess’ is ‘correct’ or not (the correct terminology is supported or not supported). As such, hypotheses should be phrased as a statement, not a question. Typically it is directional, so that it can be tested. OPERATIONALISATION Operationalisation of a hypothesis is the process by which we select/create quantifiable representations of concepts. These operationalisations must be reliable and valid. A concept (e.g., anxiety) may be operationalised in different ways (e.g., galvanic skin response, heart rate, eye movements, simple Q & A, and multi-item questionnaires). MEASUREMENT ERROR Observed Score eet e ly Sh c tiv ffe ng i ee t tim Ra ge 1 1 ff se ec tiv e ly a e ag re s o e u rc + Error. nd na a tio o rm f. n ks ic. f in t. c if Ma ta s 5 e o r ta n pe le ud ts po ltit 2 4 ltip u u im no re a m a t is 3 em a g n h a ns na 5 2 S c id e w tio ma ec c 4 to d ir 3 de ow en 3 eh wh e id k 5 c r 2 c id De wo de 4 1 he nd 4 et 3 e ly na n iz iv io t a t 5 c g 2 ffe Or ma 4 fo r 1 ee 5 tim f in 3 ge eo 5 na u d t. 2 a it M u lt ta n 4 1 a m por 1 3 an im 5 S c t is 2 a h 4 3 w 1 3 5 2 4 1 3 2 1 n Ma = True Score T + e WHY WORRY ABOUT ERROR? This notion of ‘measurement error’ links to our discussion about the philosophy of science Nothing is ever definitively proven Research operates on the basis of probability “On the balance of probability that the hypothesis is supported” Link this to the idea of falsification (Popper) When we analyse the results of an experiment, we are asking: “What is the likelihood of this result happening by chance in the real world?” (when we try and take error into account) We try our best to minimise error when we design our studies DANGERS OF MEASUREMENT ERROR MISINTERPRETATIONS OF A HOSPITAL PAIN SCALE 0: Haha! I'm not wearing any pants! 2: Awesome! Someone just offered me a free hot dog! 4: Huh. I never knew that about giraffes. 6: I'm sorry about your cat, but can we talk about something else now? I'm bored. 8: The ice cream I bought barely has any cookie dough chunks in it. This is not what I expected and I am disappointed. 10: You hurt my feelings and now I'm crying! http://hyperboleandahalf.blogspot.com.au/2010/02/boyfriend-doesnt-have-ebola-probably.html 0: Hi. I am not experiencing any pain at all. I don't know why I'm even here. 1: I am completely unsure whether I am experiencing pain or itching or maybe I just have a bad taste in my mouth. 2: I probably just need a Band Aid. 3: This is distressing. I don't want this to be happening to me at all. 4: My pain is not playing around. 5: Why is this happening to me?? 6: Ow. Okay, my pain is super legit now. 7: I see Jesus coming for me and I'm scared. 8: I am experiencing a disturbing amount of pain. I might actually be dying. Please help. 9: I am almost definitely dying. 10: I am actively being mauled by a bear. 11: Blood is going to explode out of my face at any moment. http://hyperboleandahalf.blogspot.com.au/2010/02/boyfriend-doesnt-have-ebola-probably.html MEASURING CONSTRUCTS IN RESEARCH Christensen, L., Johnson, R., Johnson, R. B., Turner, L., Christensen, L., Johnson, R., Johnson, R. B., & Turner, L. (2015). Research methods, design, and analysis ebook pdf, global edition. Pearson Education, Limited. (p. 151) RELIABILITY AND VALIDITY – TRUST ISSUES IN RESEARCH http://thesciencepost.com/i-just-know-replaces-systematic-reviews-at-top-of-evidence-pyramid/ Higgins, P. A., & Straub, A. J. (2006). Understanding the error of our ways: mapping the concepts of validity and reliability. Nursing Outlook, 54(1), 23-29. RELIABILITY In quantitative research, to have good reliability the aim to ensure that if someone else wishes to repeat they study, they would obtain the same results we did. One key way to maximise reliability is to ensure the precision of the tools/measures you use. This becomes even more important for practice, because if we are to employ new assessment tools or interventions in the clinic, we need them to be as accurate or effective as possible to best serve our patients/clients. Repeatability – can this study be conducted using the same procedure again? Key things to keep in mind with regards to reliability: Consistency - using a standardised procedure, can the same results be obtained again? Agreement - how well do the measures match across situations, researchers, time? What is Reliability? It is the extent to which a score is consistent (i.e., reproducible) across time and between observers. Sources of error may be random: individual variations (e.g., Time 1 to Time 2) situational variations (e.g., fatigue, anxiety, mood) Sources of error may be systematic: method variations (e.g., assessor, training) way we pose a question (e.g., like / dislike) Reliability coefficient It is a percentage (%) of the true score True score Reliability = True score + Error true score Example The Spielberger State Anxiety Inventory has a reliability coefficient of.80. This means that 80% of the score can be relied upon to measure the true extent of anxiety (20% would be error). Establishing Reliability from two administrations of the test Test-retest Correlation between scores obtained across time (e.g., 1 or 2 weeks) Parallel-form Correlation between scores obtained two versions of the test across time (e.g., 1 or 2 weeks) Inter-rater Correlation between scores by two observers Test-retest reliability Test-retest Week 1 Correlation between scores obtained over a short time (e.g., 1 or 2 weeks) Week 2 test-retest Week 52 stability / change Note: Even if a test is reliable, its score may change over a longer time simply because the measured quantity (e.g., height) had, in fact, changed. Parallel-form reliability If the test is easy to remember (e.g., IQ test, memory test), we can use an equivalent test (i.e., parallel-form) in subsequent (re)testing. We often counterbalance the administrations (Form A to half participants, Form B to the other half) to avoid any unforeseen differences between forms. Week 1 Week 2 Form A Form A B-A Form B Form B A-B Inter-rater reliability Do individual observers produce similar scores? e.g. How many tantrums did the child throw today? object or % agreement event 5 Observer 1 ? 5 Observer 2 Establishing Reliability from a single administration of the test Split-half Correlation between two halves of the same test Internal Consistency (Cronbach Alpha) Averaged correlation between all possible two halves of the same test Split-half reliability Do scores from one half correlate with scores from the other half ? On the whole, I am satisfied with myself. At times I think I am no good at all. I feel that I have a number of good qualities. I am able to do things as well as most other people. I feel I do not have much to be proud of. I feel that I’m a person of worth, at least on an equal plane with others. I certainly feel useless at times. I wish I could have more respect for myself. All in all, I am inclined to feel that I am a failure. I take a positive attitude toward myself. Internal consistency (Cronbach a) The Cronbach alpha (a) index of internal consistency is the average of all possible split half correlations. It ranges from 0 to 1. If a questionnaire has a Cronbach a above.70, it is considered reliable; that is, its items are homogeneous in meaning (i.e., they measure the same concept). IMPROVING RELIABILITY Improving Improving measurement instrument increasing number of items, clearly written items, objective scoring technique Standardising Standardising test situation clear instructions, standardised setting Stating Stating the limitations and/or adjusting for them e.g., age norms, gender norms, education norms What is Validity? The extent to which the score ‘behaves’ as expected from theory. A Theory K N C B F X Y VALIDITY A researcher needs to keep validity in mind when designing a study so they can eliminate as many of the potential biases as possible – not doing so could make interpreting the results difficult. Types of validity can fall into three broad categories: Measurement validity, Internal validity and External validity. MEASUREMENT VALIDITY How good are the measures being used? § Questions to consider about the design ̶ Has this measure been used previously in peer-reviewed research? ̶ Is this measure appropriate for the research question? ̶ Is this measure appropriate for the sample being investigated? ̶ Is this measure actually measuring what is supposed to measure? ̶ Does the measure fit in with other types of measures assessing the same thing? ̶ Does the measure comprehensively measure the question it is supposed to measure, or does it miss something? INTERNAL VALIDITY How sure are we that one factor causes a particular outcome? For example; How sure are we that studying during semester leads to better exam performance? You could think of threats to Internal Validity as arising due to what happened when the experiment was running that make it harder to be confident in the results. For example; maybe be you studied hard during the semester but then got sick. This would mean your exam performance might not really reflect your true ability. INTERNAL VALIDITY Internal validity can be compromised by many factors: Events affecting participants during data collection. Bias in allocating participants to groups. Can be avoided by using RCT Long term changes in participant's response. An issue for longitudinal studies Interaction between testing and participant's refusal to continue. Maybe there are characteristic differences between the people who stay in a study and those who drop out Effect of differences in testing conditions or procedures. Should always try to use the same testing situations and measures/equipment. Experimenter expectancy Just as testing can affect performance, the participant can be influenced by his/her perception of the experimenter's expectations. EXTERNAL VALIDITY ! EXTERNAL VALIDITY How well can the results of the study be applied (generalised) to other similar situations with different people at different times? In other words - To what populations, settings, treatment variables and measurement variables can this effect be generalized? To maximise External Validity, a study a researcher should ask the following How representative are our participants? Are they a good snapshot of the wider population of interest? How representative are our variables? Are the variables being tested the best ones to examine in order to answer the question? Do they cover all bases? How representative is our test situation? It the situation where data is being collected suitable for gaining realistic measures that could be generalised? Ecological Validity We need evidence that use of the test is valid for its context (e.g. can it be generalized to real life situations in practice?) Is this test valid? On the whole, I am satisfied with myself. At times I think I am no good at all. I feel that I have a number of good qualities. I am able to do things as well as most other people. I feel I do not have much to be proud of. I feel that I’m a person of worth, at least on an equal plane with others. I certainly feel useless at times. I wish I could have more respect for myself. All in all, I am inclined to feel that I am a failure. I take a positive attitude toward myself. [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] (examine items to determine face and content validity) [] [] [] [] [] KINDS OF VALIDITY Face validity ‘looks’ like a measure of self-esteem? Content validity items covers various aspects of self-esteem? Construct validity convergent validity (i.e. correlates with other measures of self-esteem) discriminant validity (i.e. does not correlate with measures of a different trait, such as IQ) known groups validity (i.e. distinguishes between groups that have previously been shown to differ) Convergent Validity Convergent validity is demonstrated by moderate to high correlations between measures of the same trait (e.g., two measures of self-esteem, or self-efficacy) Convergent (shared) measurement Self-Esteem Convergent Validity (cont.) Note that scores from similar methods are used to measure a trait tend to be correlated better than scores from dissimilar methods. For example, Beck Anxiety Questionnaire would correlate higher with Spielberger Anxiety Questionnaire than with physiological (heart rate, finger tremor) measures of anxiety because both are questionnaires. Correlations Spielberger Anxiety Q Heart Rate Beck Anxiety Q High Moderate Finger Tremor Moderate Moderately High Discriminant Validity Discriminant validity is demonstrated by low correlations between measures of different traits (e.g., self-esteem and memory, aggression and mindfulness), even when the methods employed are the same (e.g., both pencil-andpaper questionnaires). Researchers/clinicians need to be sure they are not measuring something else ! When designing an experiment, a researcher must always keep in mind the balance between Internal and External Validity. It can be a delicate balance POINTS (3 points) Concepts in Action Montreal Cognitive Assessment MoCA [ [ ] ] [ ] [ ] [ /5 ] https://www.nytimes.com/article/trump-cognitivetest.html [ ] [ ] [ ] Nasreddine, Z. S., Phillips, N. A., Bédirian, V., Charbonneau, S., Whitehead, V., Collin, I.,... & Chertkow, H. (2005). The Montreal Cognitive Assessment, MoCA: a brief screening tool for mild cognitive impairment. Journal of the American Geriatrics Society [ ] /2 [ ] [ ] ] [ [ ] /3 /1 [ [ ] /1 /2 ] ] /3 /2 ] /5 [ ] X2 [ © Z. Nasreddine MD : [ ] ] [ ] [ [ X3 ] [ [ ] [ ] (MIS) [ FBACMNAAJKLBAFAKDEAAAJAMOFAAB [ ] Memory Index Score (MIS) https://www.mocatest.org [ ] ] [ ] [ ] [ ] [ ] /6 www.mocatest.org TOTAL / 30 ASSESSING RELIABILITY AND VALIDITY FOR THE MOCA Reliability Validity Face validity Content validity Known groups validity The items look like they assess cognitive skills Test re-test reliability Tested participants approx. 30 days after 1st assessment. (r =.92) Parallelforms Different but equivalent tasks to test within each domain Internal consistency Large Cronbach’s alpha (.83) The items cover a range of different skills associated with cognitive ability Shown to distinguish between different conditions (MCI and AD from control) Convergent validity High degree of agreement with similar, previously validated and accepted cognitive assessment (MMSE) Discriminant validity Not specifically reported in paper for other constructs, but pattern of results suggests a degree of discriminant validity for diagnostic classification Ecological Validity Works well in hospital/allied health settings and in different languages Inter-rater reliability Training required for administration and scoring MINI-MENTAL STATE EXAMINATION (MMSE) RCD.9999.0087.0001 7. Show pencil. Ask: What is this called? Standardised Mini-Mental State Examination (SMMSE) Please see accompanying guidelines for administration and scoring instructions best you can. /1 /1 /1 /1 What country are we in? (accept exact answer only) /1 What state are we in? (accept exact answer only) /1 What city/town are we in? (accept exact answer only) /1 What is the street address of this house? (accept street name and house number or equivalent in rural areas) /1 What is the name of this building? (accept exact name of institution only)/1 e) What room are we in? (accept exact answer only) /1 What floor of the building are we on? (accept exact answer only) /1 Time: ------------------------------------------------------------------------------ ---------- 10. Hand the person a pencil and paper. Say: Write any complete sentence on that piece of paper (allow 30 seconds. Score one point. The sentence must make sense. Ignore spelling errors). /1 11. Place design (see page 3), pencil, eraser and paper in front of the person. Say: Copy this design please. Allow multiple tries. /1 Wait until the person is finished and hands it back. Score one point for a correctly copied diagram. The person must have drawn a four-sided figure between two five-sided figures. Maximum time: one minute. 3. Say: I am going to name three objects. When I am finished, I want you to repeat them. Remember what they are because I am going to ask you to name them again in a few minutes (say slowly at approximately one-second intervals). 12. Ask the person if he is right or left handed. Take a piece of paper, hold it up in front of the person and say the following: Take this paper in your right/left hand (whichever is non-dominant), fold the paper in half once with both hands and put the paper down on the floor. Car Man For repeated use: Bell, jar, fan; bill, tar, can; bull, bar, pan Say: Please repeat the three items for me (score one point for each correct reply on the first Takes paper in correct hand_________ /3 Allow 20 seconds for reply; if the person did not repeat all three, repeat until they are learned or up to a maximum of five times (but only score first attempt) 4. Say: Spell the word WORLD (you may help the person to spell the word correctly). Say: Now spell it backwards please (allow 30 seconds; if the person cannot spell world even with assistance, score zero). Refer to accompanying guide for scoring instructions (score on reverse of this sheet) /5 /1 Folds it in half___________ /1 Puts it on the floor________ /1 TOTAL TEST SCORE: ADJUSTED SCORE: /30 / The SMMSE tool and guidelines are provided for use in Australia by the Independent Hospital Pricing Authority under a licence agreement with the copyright owner, Dr D. William Molloy. The SMMSE Guidelines for administration and scoring instructions and the SMMSE tool must not be used outside Australia without the written consent of Dr D. William Molloy. /3 (score one point for each correct answer regardless of order; allow ten seconds) 6. Show wristwatch. Ask: What is this called? /1 Then, hand the person the sheet with CLOSE YOUR EYES (score on reverse of this sheet) on it. If the subject just reads and does not close eyes, you may repeat: Read the words on this page and then do what it says, a maximum of three times. See point number three in Directions for Administration section of accompanying guidelines. Allow ten seconds; score one point only if the person closes their eyes. The person does not have to read aloud. a) b) c) d) 5. Say: Now what were the three objects I asked you to remember? /1 9. Say: Read the words on this page and then do what it says /1 2. Allow ten seconds for each reply. Say: attempt) 8. Say: I would like you to repeat a phrase after me: No ifs, ands, or buts (allow ten seconds for response. Score one point for a correct repetition. Must be exact, e.g. no ifs or buts, score zero) 1. Allow ten seconds for each reply. Say: Ball /1 (score one point for correct response; accept ‘pencil’ only; score zero for pen; allow ten seconds for reply) Say: I am going to ask you some questions and give you some problems to solve. Please try to answer as a) What year is this? (accept exact answer only) b) What season is this? (during the last week of the old season or first week of a new season, accept either) c) What month is this? (on the first day of a new month or the last day of the previous month, accept either) d) What is today’s date? (accept previous or next date) e) What day of the week is this? (accept exact answer only) RCD.9999.0087.0003 RCD.9999.0087.0002 /1 CLOSE YOUR EYES Molloy DW, Alemayehu E, Roberts R. Reliability of a standardized Mini-Mental State Examination compared with the traditional Mini-Mental state Examination. American Journal of Psychiatry, Vol. 14, 1991a, pp.102-105. (score one point for correct response; accept ‘wristwatch’ or ‘watch’; do not accept ‘clock’ or ‘time’, etc.; allow ten seconds) 1 3 2 LET’S LOOK AT SOME RESEARCH OUTCOMES The top figure compares the average score of participants administered the MMSE and MoCA NC = controls. They have highest scores of both assessments MCI = Mild cognitive impairment. There is a smaller difference between these scores for the MMSE compared to the MoCA This is why the authors argue the MoCA is more sensitive for identifying this condition AD = Alzheimer’s Disease This group shows the lowest scores on both, but the greatest difference on the MoCA* The bottom figure looks at the relationship between scores on the MMSE and MoCA for the different groups (conditions) Another way of showing that there is more overlap between control and MCI scores for the MMSE compared to the MoCA (looking for clusters of dots with triangles), whereas the AD has more scores at the lower end of the scale (looking for squares) Importantly, if we consider approx. 25 as typical cognitive ability, this graph shows that the MMSE is more likely to classify someone already known to have MCI or AD (less severe) within the range of typical function. An indication of the ‘sensitivity and specificity’ of the scales Nasreddine, Z. S., Phillips, N. A., Bédirian, V., Charbonneau, S., Whitehead, V., Collin, I.,... & Chertkow, H. (2005). The Montreal Cognitive Assessment, MoCA: a brief screening tool for mild cognitive impairment. Journal of the American Geriatrics Society SUMMARY Measurement Error Reliability Validity Arises due to discrepancies between the construct we intend to measure and how well we actually capture it. The extent to which a score is consistent (i.e., reproducible) across time and between observers The extent to which the score is consistent with theoretical expectations about how the construct should behave NEXT WEEK Experimental Research

Use Quizgecko on...
Browser
Browser