PSYF231 Chapter 3: Reliability and Validity PDF

Kappa Coefficient The kappa coefficient accounts for chance agreement in inter-rater reliability. The kappa coefficient takes into account that raters can agree on the presence or absence of the same behavior, but also have one observer claim that the behavior was present while the other failed to see it. This translates into four possibilities: both observers agree the behavior was present, both agree it was absent, observer A thinks the behavior was seen but B disagrees, and vice versa Kappa Coefficient It considers four possibilities of agreement and disagreement between observers. Rater training is essential to achieve a kappa coefficient of at least 0.8. Video recording behaviors can help maintain reliability and provide a permanent record for review. 4-Cronbach Alpha (internal consistency reliability) To determine how useful each item is for measuring the overall construct, the test statistic to be computed is internal consistency, also referred to as Cronbach’s alpha. Alpha coefficients, by definition, have to be between 0 and 1.0. It is not surprising that coefficients approaching 1.0 are considered to be more desirable than low scores are. A test where items average an internal consistency score of greater than 0.8 is considered very good. On the other hand, internal consistency of less than 0.6 typically means that a test is considered “noisy” and problematic. 4-Cronbach Alpha While it is desirable to have high internal consistency, achieving a perfect internal consistency of 1.0 (i.e., what looks like a perfect score) is actually a problem. Why is that? Well, if, for example, somebody composes a test of irritability and used the hypothetical test item “I get easily ticked off ” 10 times in a row, then it is probable that the internal consistency will be perfect. However, in this case not much good has been achieved; quite the reverse is true. You wasted a lot of the test takers’ time because you made them respond to redundant test items. The ideal test is short and still has very high, but not perfect, internal consistency. Quick reminder Test-retest reliability measures consistency over time, while inter- rater reliability assesses agreement between observers. Reliability coefficients such as Cronbach's alpha and intraclass correlation are commonly used to quantify reliability. Training Observers Observer training is crucial for achieving inter-rater reliability. Training helps observers understand what to look for and how to use the coding system. Consistent training ensures that observers interpret behaviors similarly. Ongoing training may be necessary to maintain reliability, especially with new coding systems. Use of Video Recording Video recording behaviors is a cost- effective way to ensure reliability. It allows for multiple observers to review behaviors at different times. Video recording captures subtle behaviors that may be missed during live observations. Reviewing video recordings can help resolve disagreements between observers. In order to have a meaningful test, researchers need to define in behavioral terms what observers are expected to watch out for; this is the foundation of a reliable, structured coding system. Asking a coder whether or not a child behaved “aggressively” is potentially problematic because “aggressive” can mean different things to different observers and can be culture-specific, all of which can lead to low inter-rater reliability Self-Evaluations Self-evaluations in psychological tests can be more reliable than observer ratings. Individuals provide information about themselves, reducing observer bias. Self-evaluations can be conducted using standardized test formats. Test-retest reliability is important for self-evaluations to ensure consistent results over time. Quick RECAP Establishing reliability is essential for psychological tests. Reliability ensures consistency in test results over time and between observers. Training, clear definitions, and use of video recording can help maintain reliability. Reliability is a key component of test validity and usefulness in clinical practice and research. Validity The concept of validity in psychological testing is essential to ensure that tests accurately measure what they claim to measure. In addition to reliability, which focuses on consistency in test results, validity addresses whether a test truly assesses the intended construct. For example, an intelligence test should genuinely measure intelligence, and a depression test should effectively identify individuals experiencing gloominess and negativity. 1-Face validity (Görünüş geçerliliği) Face validity is about whether a test appears to measure what it's supposed to measure. When test-takers easily recognize the purpose of a test, they may consciously or unconsciously bias their responses to align with expectations. For ex, while high face validity can simplify interpretation in some contexts, it can also lead to biased responses.  Mathematical test: Numerical Data  Geography test: World map, Mountains, Seas, Oceans etc. 2-Content validity (kapsam geçerliliği) Content validity is crucial for ensuring that a test adequately covers the content domain it intends to measure. Test items should target relevant aspects of the construct, based on input from experts in the field. For example, when developing a questionnaire on delusional thoughts, input from psychiatrists and psychologists familiar with delusional patients helps ensure content validity.  Questions need to be prepared in balance.  QQ: What do you think about your exams’ content validity? (: 3-Construct validity (Yapı geçerliliği) Construct validity examines the theoretical framework underlying a test and its relationship with other psychological constructs. A test with good construct validity demonstrates meaningful connections with related concepts and accurately represents the targeted construct. Construct validity means that someone with high environmental awareness does not throw garbage on the ground. If a student who does not study gets a high score on the exam, the construct validity is low. Convergent validity (Uyum geçerliliği) [subtype of Construct validity] Convergent validity, sometimes called congruent validity, is the extent to which responses on a test or instrument exhibit a strong relationship with responses on conceptually similar tests or instruments. Not only should a construct correlate with related variables, but it should not correlate with dissimilar and unrelated ones. To demonstrate convergent validity for the survey instrument, the researcher might compare the responses to correlated measures such as optimism and contentment. If the responses are similar on both tests — that is, they are strongly correlated — it suggests that the instrument is measuring the same construct and, therefore, has convergent validity 3-Construct validity Factor analysis Comparison with a scale Interviewing respondents Different groups comparison (people who know the subject VS those who do not) 4-Criterion validity (Ölçüt geçerliliği) A good test may help to answer questions like: (a) Is this patient well- adjusted and functioning, or is he suicidal and such a risk to himself that he needs the protection of a hospital? Or (b) Is this prison inmate who is considered for release on parole likely to re-offend and represents a risk to the public? Let us presume that a previous study had shown that high scores on such a test can differentiate who is at risk for harming himself or which offender has in the past re-off ended; in this case, the test has criterion validity. The test scores represent criteria to help us with real-world decision-making. 4-Criterion validity Criterion validity (or criterion-related validity) measures how well one measure predicts an outcome for another measure. A test has this type of validity if it is useful for predicting performance or behavior in another situation (past, present, or future). For example: A job applicant takes a performance test during the interview process. If this test accurately predicts how well the employee will perform on the job, the test is said to have criterion validity. A graduate student takes the GRE. The GRE has been shown as an effective tool (i.e. it has criterion validity) for predicting how well a student will perform in graduate studies. Predictive validity (Yordama geçerliliği) [subtype of criterion validity] Predictive validity is particularly valuable for making accurate predictions about future outcomes based on test scores. For example, a test that effectively predicts which individuals are at risk of self-harm or likely to re-offend provides crucial information for decision-making in clinical and legal settings. Concurrent validity (Eşzamanlı/Örtüşmeli geçerlilik) (it is a form of criterion validity) Concurrent validity assesses the degree to which a new test correlates with other established tests measuring similar constructs. While high concurrent validity may seem desirable, it can indicate redundancy if the new test merely replicates existing measures. However, it can be valuable if the new test provides equivalent results in a shorter format or offers unique insights. A test of generalized anxiety would be expected to correlate with other tests of anxiety. When that is the case, a test is described as possessing good concurrent validity. Concurrent validity (Eşzamanlı/Örtüşmeli geçerlilik) (it is a form of criterion validity) Concurrent criterion validity is established by demonstrating that a measure correlates with an external criterion that is measured at the same time. For example, concurrent criterion validity could be measured if scores on a math test correlate highly with scores on another math test administered at the same time. How Should Tests Be Described with Respect to Their Reliability and Validity? “We used test X to measure intelligence; test X is reliable and valid. ” Having seen a more detailed description of how many ways there are to establish reliability and validity, you now have a clearer sense about why such a simplistic description of test properties is inappropriate, or even misleading. Instead of using the categorical (and highly inadequate) phrase: “Test X is reliable and valid,” a more informative description would read as follows: To determine the prevalence of depression in our sample, we used the ABC test of depression developed by Down and In-the-Dumps (1986). The ABC is a 25-item self- report scale of depressed mood, each item using a 1–5 scale where a larger number indicates higher depression. Scores obtained with the ABC have been shown to have a test-retest reliability of r =.91 for a 2-week test-retest and r =.74 for a 6-month interval, determined in an Australian college student population. ABC also has been shown to have an internal consistency coefficient of.86 which is generally considered to be very good. Test items were written by the researchers and were then validated using college students and adults living in the community. To avoid unnecessary length, the original 96-item list was reduced to 24 parsimonious items via factor analytic approaches which confirmed that ABC measures a single factor, named depression. Test scores derived with the ABC have been shown to have criterion validity in that they are sensitive to change in individuals undergoing psychological therapy, and they are able to differentiate recently diagnosed from not recovered depressive individuals. https://www.researchgate.net/publication/371676671_Psychometric_Pr operties_and_Factor_Structure_of_the_Turkish_Version_of_the_Short_Fo rm_of_Behavioural_Activation_for_Depression_Scale_BADS-SF_in_Non- Clinical_Adults A very important question raised in the profession of psychology is whether or not our interventions are effective. Measuring Change in Psychologists in private practice working with individual patients have an interest in Therapy knowing how much their patients improve and which of their interventions is particularly critical for this improvement. Measuring Change in Therapy A hospital administrator wants to see that patients seen by the hospital psychology service are improving to such a degree that the existence of the psychology department and the associated budget are justifiable to the taxpayer, insurance companies, or government officials involved in health care. Measuring Change in Therapy Researchers continue to carefully test which therapies are best suited for which kind of patient and seek to create a knowledge foundation to assist practitioners and answer questions such as: How many therapy hours are needed before patients start to make substantial improvements?  How much therapist training is needed to create a pool of skilled therapists that can do the job in the most cost-efficient manner? Methods Used to Learn About Therapy Outcome Case study Practicing psychologists often engage in discussions about complex cases to seek advice, share successes, or contribute to professional development. Case conferences, where ongoing cases are discussed, are common, and supervisors and students often participate in these as part of their training. Particularly interesting cases may be presented in hospital settings, or published in journals to help others develop protocols for similar cases. The study of individual cases can take two main formats: One approach involves presenting a structured narrative, similar to telling a story, which allows therapists to share interesting cases in a format similar to a written intake report. Another approach treats each patient as an opportunity to conduct an experiment, often seen in behavioral therapies. This approach involves systematically studying the effect of various interventions, such as behavioral changes, and recording the outcomes to inform future strategies. For example, parents may conduct Case study their own experimental case study by trying different strategies, like yelling at their children or offering rewards, to stop sibling fights. If these strategies are tested in a structured format and carefully studied, the results can inform future parenting strategies. Similarly, psychotherapists may try innovative approaches with new presenting problems, recording the outcomes for future reference and potentially sharing their findings with other therapists through conferences or publications. Case study Observation of individual patients and systematic case studies are essential in psychotherapy research, often leading to innovative approaches and contributing to the development of therapeutic techniques. https://www.researchgate.net/publication/3658819 13_The_Use_of_Dynamic_Cognitive_Behavioural_T herapy_DCBT_in_Social_Anxiety_Disorder_SAD_A_T heoretical_Integration_Initiative Therapy Outcome Research Based on Groups The trustworthiness of evidence in psychological therapy relies on high-quality studies with similar treatments for comparable problems. Single-group, pre-post treatment… Unfortunately, this is not a safe interpretation at all because this type of design cannot rule out many alternative explanations. It is possible that a group of depressed patients who had been assessed in January, then received 4 months of treatment, and were reassessed in May, improved because their depression was affected by the lack of light (confounding treatment). Confounding makes it impossible to differentiate that variable’s effects in isolation from its effects in conjunction with other variables. For example, in a study of high-school student achievement, the type of school (e.g., private vs. public) that a student attended prior to high school and their prior academic achievements in that context are confounds (APA dic, 2018). Therapy Outcome Research Based on Groups So, Single-group, pre-post Randomised controlled trials treatment designs can't reliably (RCTs) with active treatment attribute improvements solely and control groups are to therapy due to various preferred to address these alternative explanations like issues, ensuring similar starting placebo effects or external conditions and minimising factors. biases. Therapy Outcome Research Based on Groups The control group in an experimental design is crucial for determining the true effects of a treatment. Comparing the outcomes of an active treatment group with those of a no-treatment control group helps assess if the treatment's success is due to the treatment itself or external factors. A wait-list control group, where patients expect treatment later, helps control for the effects of expectancy and is ethically appealing, ensuring all participants receive treatment eventually. This approach reflects the equipoise principle, ensuring fairness and offering comparable treatments or services to all participants. Therapy Outcome Research Based on Groups Using a placebo control treatment in psychotherapy research is challenging because the placebo concept is more straightforward in drug studies. In drug studies, a placebo is an empty pill that looks, tastes, and feels the same as the active drug. This approach is used in single-blind studies, where the patient doesn't know which treatment they're receiving, to prevent bias in the results. In double-blind studies, neither the physician nor the patient knows which treatment is being administered, ensuring unbiased results. A single blind is a procedure in which participants are unaware of the experimental conditions under which they are operating; A double blind is a procedure in which both the participants and the experimenters interacting with them are unaware of the particular experimental conditions; A triple blind is a procedure in which the participants, experimenters, and data analysts are all unaware of the particular experimental conditions (APA dic, 2018). However, in psychotherapy research, it's impossible to blind therapists to the treatment they're providing, and patients are usually aware of the therapy they're receiving. Researchers sometimes address this issue by using less effective treatments, but this raises ethical concerns. Psychotherapy researchers not only assess if treatments produce desirable effects compared to no treatment but also aim to determine if new treatments are superior to existing ones and if they work for the intended reasons. This involves testing treatment specificity, such as demonstrating that cognitive therapy affects negative thought patterns in depression. Statistical tests are conducted to determine if observed changes are statistically significant and beneficial to patients' quality of life. However, statistical significance alone may not always translate to real-world significance, as illustrated by the example of an educational program's negligible impact on employment despite statistically significant improvements in friendliness scores (see the next slide for the example). Hence, success in psychotherapy is defined by a combination of statistical significance and changes in outcomes relevant to society. The government invested $5,000,000 in an educational program to help 1,000 unemployed individuals with a history of schizophrenia to acquire job-hunting skills. Even 6 months after the end of the program, only 2 of the 1,000 participants had found work (and this difference is not statistically different from zero). Still, their scores on a self-report friendliness test improved from 5.7 on a 10- point scale to 6.1. This change is statistically significant with p =.04. From the perspective of the government that funded the program, these results are not clinically significant because the main variable the program wanted to influence was unemployment. No politician will dare to go back to taxpayers and tell them that for $5,000,000 the unemployed now feel slightly friendlier  Again, REMEMBER, the take-home message is that success is usually defined by a blend of statistical significance and change in an outcome that has value to society. In the last two or three decades, researchers have engaged in Quality Criteria for thousands of therapy outcome studies, and the standards for a high-quality study have grown substantially. Today, a really Therapy Outcome good study is one that meets all the requirements listed in Table 3.4. Quality Criteria for Therapy Outcome After several therapy outcome studies are published, they undergo review by writers or committees who assess their quality and reliability using a rating system. This system, similar across different organizations, helps evaluate the evidence's trustworthiness. For example, the one used by the Association for Biofeedback and Applied Psychophysiology and the American Psychological Association includes different levels of evidence quality: Level 1: Not empirically supported - Based solely on anecdotal reports or case studies in non-peer-reviewed sources. Level 2: Possibly efficacious - Supported by at least one study with sufficient statistical power and well-defined outcome measures, but lacks randomized assignment to a control group. Level 3: Probably efficacious - Supported by multiple observational studies, clinical trials, wait-list controlled studies, and replication studies demonstrating effectiveness. Possible means "able to be done; able to happen or exist." Probable means "likely to happen or be true but not certain." If something is possible, it can happen. But possible does not mean that something will happen for certain or even that it is very likely to happen. If there is a 10% chance of rain today, it is possible that it will rain. It could rain, but there is a 90% chance that it will not rain. If something is probable, there is a good chance that it will happen, but it is not certain. If there is a 90% chance of rain today, it is probable [=it is likely] that it will rain. Level 4: Efficacious (impressive): a. In a comparison with a no-treatment control group, alternative treatment group, or placebo control utilizing randomized assignment, the investigational treatment is shown to be statistically significantly superior to the control condition, or it is equivalent to a treatment with established efficacy in a study with sufficient power to detect moderate differences. b. The studies have been conducted with a population treated for a specific problem for whom inclusion criteria are delineated in a reliable, operationally defined manner. c. The study used valid and clearly specified outcome measures related to the problem being treated. d. The data are subjected to appropriate analysis. e. The diagnostic and treatment variables and procedures are clearly defined in a manner that permits replication of the study by independent researchers. f. The superiority or equivalence of the investigational treatment has been shown in at least two independent research settings. Level 5: Efficacious and specific: The investigational treatment is shown to be statistically superior to placebo control treatment, or alternative bona fide (genuine, honest) treatment in at least two independent research settings. Meta-Analysis in Psychotherapy Research In essence, meta-analysis is a quantitative review method that selects similar studies from the literature and then extracts the same information about mean change and variability of change from each study. (e.g., all studies using psychological treatments for “fear of flying”) Meta-Analysis in Psychotherapy Research 1.Meta-analysis combines individual study results to reveal meaningful effects. 2.Critical considerations include publication bias and retrieval bias. 3.It also requires clear definitions of target populations and randomization. 4.Meta-analysis must handle confounds and ensure treatment integrity. Publication Bias Studies with positive results are more likely to be published. This bias can distort the overall perception of treatment effectiveness. It is a significant concern in meta- analysis. Strategies to address publication bias are essential for accurate conclusions. Researchers may ignore or criticize studies not supporting their hypotheses. This bias affects the selection of studies included in meta-analyses. Retrieval Bias It can lead to an incomplete or biased view of the evidence. Awareness of retrieval bias is crucial for interpreting meta-analytic results. 1.Meta-analysis requires studies with similar participant demographics. 2.It also requires studies with similar problem severity levels. 3.Ensuring comparability strengthens the validity of meta-analytic findings. 4.Meta-analysts must carefully select studies that meet these criteria. Study Comparability RECAP Therapy Outcome Research Based on Groups 1) Assessment of Evidence: The quality and quantity of studies using similar treatments for comparable problems gauge the trustworthiness of evidence. High-quality studies are expensive and laborious to conduct. 2) Single-group, pre-post Treatment Design: This design assesses patients before and after treatment. However, it cannot rule out alternative explanations for improvement, such as seasonal changes or the introduction of new medications. 3) Randomized, Controlled Trial (RCT): This study design involves randomly assigning patients to at least two groups: an active treatment group and a control group. This design helps control for factors like expectancy effects and confounding treatments. Therapy Outcome Research Based on Groups 4) Placebo Effect: Patients can show improvement simply by believing they are in active treatment. This belief is a potent component of successful therapies. 5) Blinding in Psychotherapy Research: Double-blinding, a standard in drug studies, is not feasible in psychotherapy research. Patients and therapists cannot be blind to the treatment being administered. 6) Treatment Specificity: Researchers seek to determine whether a treatment works for the reasons it is presumed to work. For example, in cognitive therapy, researchers must show that changes in cognition lead to mood improvement. 7) Statistical Analysis: Researchers use statistical tests to determine the significance of change in psychotherapy. However, statistical significance does not always translate to clinical significance. Therapy Outcome Research Based on Groups 8) Quality of Evidence: Evidence is categorized into levels based on study design and statistical power, ranging from anecdotal reports to randomized controlled trials. 9) Meta-analysis: This method aggregates (combine) data from multiple studies to draw conclusions. It helps overcome limitations of individual studies but has its own set of limitations, such as publication bias and retrieval bias. 10) Considerations for Meta-analysis: Factors such as publication bias, randomization, blinding, drop-out analysis, and treatment integrity must be carefully considered in meta-analysis to arrive at meaningful conclusion

PSYF231 Chapter 3: Reliability and Validity PDF

Document Details

Tags

Related

Summary

Full Transcript