Replication Studies in Psychology PDF
Document Details
Uploaded by PreciousMossAgate7078
McGill University
Tags
Summary
This document discusses the replication crisis in psychology research, focusing on the challenges of reproducing results in different labs using similar methods and analyzes the causes and proposed solutions.
Full Transcript
PSYCH 306 Research Methods in Psychology The Replication Crisis https://today.ucsd.edu/story/a-new-replication-crisis-research- that-is-less-likely-be-true-is-cited-more 1 The Replication Crisis Replicability: whether a...
PSYCH 306 Research Methods in Psychology The Replication Crisis https://today.ucsd.edu/story/a-new-replication-crisis-research- that-is-less-likely-be-true-is-cited-more 1 The Replication Crisis Replicability: whether a published study's findings can be repeated in a different lab, using the same or similar research methods The Crisis: A failure for published research findings to be repeated in other labs when they followed the same or similar research methods as the published findings. The Gold standard: a scientific finding's replicability is often considered the best possible evidence for the accuracy of a finding 2 Reproducibility Replicability is not the same as Reproducibility: the ability of a different researcher to reproduce another researcher's published analyses, given the original dataset and computer code for the statistics used According to a U.S. National Science Foundation (NSF) subcommittee: “reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results…. " Goodman, S. N., Fanelli, D., & Ioannidis, J. P. (2016). What does research reproducibility mean? Science translational medicine, 8(341), 341ps12-341ps12. 3 Reproducibility Reproducibility: the ability of a different researcher to reproduce another researcher's published analyses, given the original data and computer code for the statistics used Testing a previous finding: Replicability: New Researcher, New data Reproducibility: New Researcher, Same (previous) data Replicability = more Effort and Time than Reproducibility Essawy et al (2020), Environmental Modeling & Software 4 Why can Reproducibility fail? Two possible causes: Process reproducibility failure: when the original analysis cannot be repeated because of the unavailability of data, code, information needed to recreate the code, or necessary software or tools. Outcome reproducibility failure: when the reanalysis obtains a different result than the one reported originally. This can occur because of an error in either the original or the reproduction study. 5 Why can Reproducibility fail? Artner and colleagues (2021) reanalyzed psychological research from 2012 journal articles (Emotion, Exper & Clinical Psychopharm, J Abnormal Psych, and Psych & Aging) Total = 232 findings Their Method: Attempt to reproduce the statistical outcomes, starting from the raw data provided by the authors and closely following the Method section in each article. Many scientific publications today: require authors to upload their data (often with analysis code) to an accessible internet address prior to publication 6 Why can Reproducibility fail? Artner et al's (2021) reproducibility outcomes: - only 70% of findings could be reproduced - 18 of those were reproduced only after deviating from the analysis reported in the original papers (a process reproducibility failure) - 13 of the articles claiming to reach significance generated non-significant outcomes (an outcome reproducibility failure) Conclusions: Authors are not providing enough information about the Methods of data collection, the analysis steps, or both 7 Replication failures: Open science collaboration Scientific Journals 100 studies were repeated in different labs Average # studies that replicated major findings = 36% Replication rates for Psychology by research area: Cognitive psychology (50%) Science, 2015 Social psychology (26%) 8 Replication Crisis in Other Science Domains Baker (Nature, 2016): a survey of 1,500 scientists 70% reported that a failure to replicate the results of others’ studies. 87% of chemists 69% of physicists and engineers 77% of biologists 64% of environmental and earth scientists 67% of medical researchers 62% of all other respondents 50% had failed to replicate their own experiment 9 Replication Crisis in Other Science Domains Is there a replication crisis? Have you failed to replicate a published experiment? % Failure rate Baker, 2016 10 Replication Crisis in Other Science Domains Factors that scientists identified as contributing to replication failures: - publication pressures - insufficient data / analyses - insufficient research methods 11 Replication Crisis in Other Science Domains Have you established procedures to enhance replicability? Common methods to enhance replicability: - better document the research methods used - run the study again - ask lab member to replicate the study How do psychology experiments differ? Use of human participants can introduce significant variability 12 Why the Replication Crisis Continues Examined Psychology publications that were included in replication studies in terms of total citation counts: Evaluated replicated findings versus non-replicated findings Non-replicated findings have more citations than do replicated findings Total number of citations (cumulative probability) Serra-Garcia et al (2021), Science 13 Why the Replication Crisis Continues Non-replicated publications continue to have larger annual citations than replicated studies - even after the failure to replicate is published Yearly # citations Year that replication study was published Serra-Garcia et al (2021), Science 14 Sample Failures to Replicate: Social Psych (Summarized by ChatGPT) 1. The Stanford Prison Experiment: the Stanford Prison Experiment conducted by Philip Zimbardo has been widely criticized for ethical concerns and has faced skepticism regarding the generalizability of its findings. Subsequent replications and reevaluations have raised doubts about the validity of the original study. 2. The Bystander Effect: The bystander effect, which suggests that individuals are less likely to help in emergency situations when others are present, has faced mixed replication results. Situational and contextual factors may play a significant role in whether or not bystanders intervene. 3. Stereotype Threat: The concept of stereotype threat, where individuals from stigmatized groups underperform due to the fear of confirming stereotypes, has faced mixed replication results. While some studies have supported the phenomenon, others have failed to find consistent effects, particularly in real-world settings. 15 Sample Failures to Replicate: Cognitive Psych (Summarized by ChatGPT) 1. Spotlight Attention: The concept of "spotlight" attention, where individuals focus their attention on a specific area of the visual field, has faced challenges in replication. Some studies have found that attention is more distributed and flexible than originally theorized. 2. The Dual-Process Model of Memory: The dual-process model posits that memory consists of two separate systems, one for implicit (unconscious) and another for explicit (conscious) memory. While this model has been influential, it has faced scrutiny, and some researchers have proposed alternative models of memory. 3. The Mirror Neuron System: The mirror neuron system, which is believed to play a role in understanding and imitating the actions of others, has faced challenges in replication. Some replication studies have reported inconsistencies in the neural responses associated with mirror neurons. 16 Sample Failures to Replicate: Developmental Psych (Summarized by ChatGPT) 1. The "Critical Period" Hypothesis for Language Acquisition: The idea of a "critical period" for language acquisition, implying that there is a specific window during which language learning is optimal, has faced challenges. Some studies have reported individual variations and exceptions to this hypothesis. 2. Attachment Theory and Attachment Types: Attachment theory, developed by John Bowlby, describes different attachment styles in children (e.g., secure, anxious, avoidant). Some replication studies have suggested that attachment styles are more fluid and context-dependent than originally proposed. 3. The Mozart Effect on Infant Intelligence: The "Mozart Effect," which suggests that listening to classical music, particularly Mozart, can boost children's cognitive abilities, has faced replication challenges. Subsequent studies have not consistently replicated the idea that listening to Mozart leads to long-term cognitive enhancements. 17 Sample Failures to Replicate: Clinical Psych (Summarized by ChatGPT) 1. The Dodo Bird Verdict: The "Dodo Bird Verdict" (1930's Rozensweig) refers to the idea that all psychotherapies are equally effective, and that it is the therapeutic relationship (the "common factors") that accounts for positive outcomes. Specific therapeutic approaches, such as cognitive-behavioral therapy "Everybody has won and or psychodynamic therapy, can have different effectiveness for certain conditions. all must have prizes" (Alice in Wonderland) 2. The "Power of Positive Thinking" in Health Outcomes: The idea that a positive attitude and optimistic thinking can lead to improved physical health outcomes has faced challenges. Replication studies have not always produced consistent results. 3. The Clinical Efficacy of Memory Recovery Techniques: Some memory recovery techniques, such as hypnosis and guided imagery, have been used in therapy to help individuals recall repressed memories. However, replication studies have shown that the accuracy of such recovered memories can be unreliable. 18 Sample Failures to Replicate: Neuroscience (Summarized by ChatGPT) 1. The Amygdala's Role in Fear Processing: While the amygdala is traditionally associated with fear processing and emotional responses, some replication studies have reported variations in the amygdala's involvement in fear-related tasks. 2. The Left Brain-Right Brain Distinction: The notion that the left hemisphere of the brain is primarily responsible for logical and analytical thinking, while the right hemisphere is associated with creativity and emotion, has faced challenges. Neuroscience research has shown that both hemispheres are involved in a wide range of cognitive functions. 3. The "Brain-Training" Effect: Some studies have suggested that engaging in cognitive training exercises can improve cognitive function and memory in older adults. However, replication efforts have produced mixed results, with some studies failing to consistently replicate the benefits of brain-training programs. 19 Causes of Replication Crisis 1) Ignoring or misunderstanding statistics (we will cover red items today) ◦ Misunderstanding: ◦ A) Null hypotheses (Meehl, 1978, 1990, 1997) ◦ B) Meaning of P-values (Cohen, 1992) ◦ Small sample sizes (Button et al, 2013) ◦ Effect Size and Power (Cohen, 1969, 1988, 1992; Kühberger, 2014) 2) Publication bias The way we conduct, publish, distribute, and fund our science (Ioannidis, 2012) 20 Causes of Replication Crisis 3) Falsifying data ◦ D Stapel (2011), Psychology professor at Tilburg University (Netherlands), 55 cases of fabricated data in social psychology ◦ Marc Hauser (2012), Psychology professor at Harvard, accused of faking results on morality and cognition ◦ Karen Ruggiero (2001), Psychology professor at Univ of Texas, accused of faking results on discrimination research 4) Quality of replication Failure to follow original procedures 21 1A: Poor Hypothesis Practices HARKing (Hypothesizing After the Results are Known): HARKing involves formulating or changing hypotheses after analyzing the data. In this practice, researchers may first explore their data without a specific hypothesis and then generate a hypothesis based on what they find. This can lead to confirmation bias, as it gives the appearance that the results were predicted when they were actually discovered post hoc. Note: HARKing goes against the principle of research strategy to show that the IV causes (precedes) the change in the DV IV1 does not generate change in DV BAD: look for another IV that does result in change in DV Propose that IV2 = change in DV 22 1A: Poor Hypothesis Practices Assigned reading by Hollenbeck (2017): Harking, Sharking, and Tharking: Making the Case for Post Hoc Analysis of Scientific Data Hollenbeck et al (2017) SHARKing: Secretly HARKing in the introduction section of a scientific presentation “publicly presenting in the Introduction section of an article hypotheses that emerged from post hoc analyses and treating them as if they were a priori” Instead of making hypotheses based on results from all existing studies (standard practice), the researcher hypothesizes after knowing the results from the data at hand Never justified in science 23 1A: Poor Hypothesis Practices THARKing: Transparently (openly) HARKing in the discussion section “clearly and transparently presenting new hypotheses that were derived from post hoc results in the Discussion section of an article” Can promote effectiveness and efficiency of science Ethically required in some cases THARKing is justified in science, according to the author Why? (read the article, highlighted sections) Hollenbeck et al (2017) 24 1A: Poor Hypothesis Practices Two case studies (Hollenbeck, 2017): 1) A researcher desperate to get a job takes 30 of the shortest and most easily obtained survey measures and creates a pair of long questionnaires… they run a new survey and find some significant correlations and publishes them as a priori hypotheses. No one can replicate the findings. 2) Epidemiologists test 100 patients on new drug to protect against virus. Correlation between treatment (drug) and survival rate = r of.1 (small). Some researchers notice that females react differently to the drug than males. The researchers re-evaluate the findings by peak estrogen age of participants and publish a short report as a post-hoc analysis. Others replicate the findings. 25 1B: Meaning of p-values Null Hypothesis SignificanceTesting NHST = assumption that null hypothesis means there is no significant difference between the groups or conditions being compared Example: Condition A yields same outcome as Condition B With a Large-enough N, virtually every study would yield a significant result. “It is highly unlikely that any psychologically discriminable stimulation which we apply to an experimental subject would exert literally zero effect upon any aspect of their performance” (Meehl) “It is foolish to ask ‘Are the effects of A and B different?’ They are always different—for some decimal place” (Tukey) mean =.0561 vs mean =.0562 26 1B: Meaning of p-values P-hacking: unethical and questionable practice of manipulating or "hacking" statistical analyses in order to achieve statistically significant results. Researchers may engage in various data techniques (trying multiple outcome measures, analyses, or excluding data points selectively) until a statistically significant result is obtained. This can lead to false-positive findings and misrepresentation of the true state of affairs. Questionable p-hacking examples: 1. Stop collecting data when p <.05 2. Analyze many measures, but report only those with p <.05 3. Collect and analyze many conditions, but only report those with p <.05 4. Add covariates to reach p <.05 5. Exclude participants to reach p <.05 6. Transform the data to reach p <.05 27 1B: Meaning of p-values P-hacking: unethical and questionable practice of manipulating or "hacking" statistical analyses in order to achieve statistically significant results. Researchers may engage in various data techniques (trying multiple outcome measures, analyses, or excluding data points selectively) until a statistically significant result is obtained. This can lead to false-positive findings and misrepresentation of the true state of affairs. Example: Compare 3 study conditions (read; listen to recording; take notes) with a memory test Original hypothesis: taking notes outperforms reading or listening to recording 3 sets of memory scores yield p-value not <.05 (no significant difference across 3 conditions) BAD: Researcher decides then that maybe taking notes > listening to recording and tests just 2 of the 3 means BETTER: Researcher decides in advance what the order of conditions should be (based on hypothesis) and tests only those comparisons 28 1B: Meaning of p-values P-hacking: unethical and questionable practice of manipulating or "hacking" statistical analyses in order to achieve statistically significant results. Researchers may engage in various data techniques (trying multiple outcome measures, analyses, or excluding data points selectively) until a statistically significant result is obtained. This can lead to false-positive findings and misrepresentation of the true state of affairs. Example: Compare 3 study conditions (read; listen to recording; take notes) with a memory test Original hypothesis: taking notes outperforms reading or listening to recording 3 sets of memory scores yield p-value not <.05 (no significant difference across 3 conditions) BAD: Researcher notes there are extreme outliers (0 value on memory score) in the "taking notes" condition and decides to remove them BETTER: define outliers and remove them from all conditions 29 1B: Other Poor Data Practices Cherry-Picking Data: Researchers may selectively report only the data or results that support their hypotheses, while disregarding data that contradicts their expected outcomes. This practice can lead to a biased representation of the research findings. Global Warming Temperatures Global Warming - Selective Data 30 1B: Other Poor Data Practices Data Fabrication and Falsification: Intentionally creating or altering research data to support desired outcomes. It is a clear violation of research ethics and can have severe consequences, including academic and professional repercussions. Example: Study of learning in cotton-top tamarins (Hauser, 2002). Graph contained fabricated data on monkeys' responses to different grammars on which they were trained. Videotaped responses of monkeys contained no evidence for left bars (they were not given the circled conditions) Journal article was retracted (2010) following allegations of misconduct. Author claimed oversight, resigned from his position. Same Grammar Different Grammar Does competitive academic environment encourage poor practices? 31 2: Publication Biases 1. The File Drawer Problem: studies with non-significant or null results are less likely to be published or reported compared to studies with positive or significant findings. One solution: publish a meta-analysis that analyzes data from all articles on a particular topic (including outcomes that show no effect – whether published or not) 2. Selective Reporting: Researchers and academic institutions tend to favor the publication of studies that reveal a significant effect or a novel discovery, leading to an imbalance in the scientific literature. One solution: shift the balance toward journals that agree to publish null results when the research methods are judged in advance to be strong 32 2: Publication Biases 3. Incomplete Knowledge: When non-significant results are not published or made publicly available, the scientific community may have an incomplete or biased view of a particular research question, as only a portion of the available data is accessible. One solution: larger sample sizes, reduce problems of internal validity and confounding variables 4. Replication Challenges: When other researchers attempt to replicate findings based on published studies, they may encounter difficulties if non-published studies with null results exist in the "file drawer" and have not been considered in the replication effort. Solution: Open science practices (registration of research studies, publication of null findings) have been advocated to improve transparency and the reliability of scientific research 33 Proposed solutions for Replication Crisis 1) Pre-registration of Research Methods for a given study = A detailed plan for research methods that are filed online (open) ahead of data collection. The plan includes writing what your hypotheses are, which specific methodologies you’ll use, and which analyses you’ll conduct. These are then set in stone (unchangeable) 2) Registered report = a detailed plan for research methods filed online that undergoes peer review prior to data collection. High-quality protocols for research methods are then provisionally accepted for publication if the authors follow through with the registered methodology. 34 1) Pre-Registration of research methods Detailed plan for research methods that are filed online (open) ahead of data collection. The plan includes writing what your hypotheses are, which specific methodologies you’ll use, and which analyses you’ll conduct. These are then set in stone (unchangeable) No review prior to data collection. With pre-registration, you get a DOI that you can refer to in the final paper. DOI = unique and permanent string of letters and numbers that identifies each article filed online Each published (and some unpublished) article has a DOI (Digital Object Identifier) You can also pre-register a study for which you’re using existing or secondary data. 35 1) Pre-registration of research methods Where to register: Templates are available online for each scientific field https://www.psycharchives.org/en 36 Psychological Research Preregistration-Quantitative (aka PRP-QUANT) Template; CC BY 4.0 Quantitative Psychology Title and title page Title Pre-Registration Template: Contributors, Affiliations, and Persistent IDs Date of Preregistration Estimated duration of project Main sections are similar to IRB Status (Institutional Review Board/Research Ethics Board) Conflict of Interest Statement an APA-style research paper: Data accessibility statement and planned repository Introduction (no word limit) Theoretical background Title page Objectives and Research question(s) Hypothesis (H1, H2, …) Introduction Method Time point of registration Methods Proposal: Use of pre-existing data (re-analysis or secondary data analysis) Sampling participants Sampling Procedure and Data Collection Sample size, power and precision IVs, DVs, design Participant recruitment, selection, and compensation materials, procedure How will participant drop-out be handled? Masking of participants and researchers Planned Analyses Data cleaning and screening How will missing data be handled? Conditions and design Type of study and study design Randomization of participants and/or experimental materials Measured variables, manipulated variables, covariates Study Materials 37 Psyarchives.org Study Procedures 2) Registered Reports Registered report = peer review of the research methods prior to data collection. High-quality protocols for research methods are then provisionally accepted for publication if the authors follow through with the registered methodology. Registered report method Traditional review of contents (review the research methods) (did authors use the research methods they promised) The traditional review method for scientific articles is only Stage 2 peer review (when it is too late for the authors to adjust their research methods) https://www.cos.io/initiatives/registered-reports 38 Registered Reports are on the Rise Open Science Framework 39 2) Registered Reports Stage 1 review In Principle Accepted Stage 2 review Chambers, C, & Tzavella, L40(2022) Final paper Accepted Pre-registered & Registered Reports: Differences Pre-registered: No stage 1 review No In-Principle Acceptance Registered: Stage 1 review In-Principle Acceptance Many more Pre-registered than Registered studies 41 Pre-registered & Registered Reports: Differences 42 https://www.aje.com/arc/pre-registration-vs-registered-reports/ Will Registration solve the Crisis? As of now… Pre-registration? Evidence suggests it alone does not improve replication rates (Szollosi et al, 2020) Seems to improve perceptions of higher-quality research Registered reports? Evidence that RR’s improve replication rates RR’s more likely to be reproducible (re-analysis) RR’s more likely to fail to reject the null hypothesis than regular articles Reduces but does not solve the publication bias (Chambers, et al, 2022) Szollosi, A., Kellen, D., Navarro, D. J., Shiffrin, R., van Chambers, Christopher D., and Loukia Tzavella. Rooij, I., Van Zandt, T., & Donkin, C. (2020). Is "The past, present and future of registered preregistration worthwhile?. Trends in cognitive reports." Nature human behaviour 6.1 (2022): sciences, 24(2), 94-95. 29-42. 43 Role of Social / Cultural Norms in Replication People change slowly… Scientific policies must change too Nosek et al (2022), Ann Rev Psych 44 Community, Policy, Structural Change Community: Big team science Open scholarship Policy: Systematic reviews Meta-analyses Statistical assessments Korbmacher et al (2023), Science 45 Role of Meta-Analyses Data inclusion process for a Meta-Analysis Definition: the statistical combination of the results of multiple studies addressing a similar research question. Rationale: A single study can never definitively answer a question. Meta-analysis can provide a solution Meta-analyses are based on multiple replications from several studies Schmidt et al (2016), Arch Sci Psych 46 Role of Systematic Reviews Key components - clearly stated topic with pre-defined eligibility criteria for inclusion of studies - a systematic search to identify all studies that meet those eligibility criteria - an assessment of the validity of the findings of the included studies Distinction from Meta-Analyses: No combined statistical analysis in systematic reviews 47 https://kib.ki.se/en/search-evaluate/systematic-reviews Next time: Research methods with open-source and web-based data 48