HSCI 190 Module 06: Correlation, Regression & Health Sciences PDF
Document Details
Uploaded by RealizableFuturism
Queen's University
Tags
Summary
This document is a companion guide for module 6 of the Introduction to Statistics for the Health Sciences course (HSCI 190) at Queen's University. It covers topics including correlation, regression, and their applications in healthcare. The document outlines key concepts, provides examples, and includes questions for further understanding.
Full Transcript
HSCI 190 oiw INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES MODULE 06 CORRELATION, REGRESSION, AND OTHER STATISTICAL APPLICATIONS IN HEALTH SCIENCES Please note: This course was designed to be interacted and engaged with...
HSCI 190 oiw INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES MODULE 06 CORRELATION, REGRESSION, AND OTHER STATISTICAL APPLICATIONS IN HEALTH SCIENCES Please note: This course was designed to be interacted and engaged with using the online modules. This Module Companion Guide is a resource created to complement the online slides. If there is a discrepancy between this guide and the online module, please refer to the module. How can you help protect the integrity and quality of your Queen’s University course? Do not distribute this Module Companion Guide to any students who are not enrolled in HSCI 190 as it is a direct violation of the Academic Integrity Policy of Queen’s University. Students found in violation can face sanctions. For more information, please visit https://www.queensu.ca/academic- calendar/health-sciences/bhsc/. MODULE 06 COMPANION GUIDE HSCI 190 TABLE OF CONTENTS INTRODUCTION..................................................................................................................................................... 5 Video: Introduction to Module 06................................................................................................................... 5 Module Learning Outcomes............................................................................................................................ 5 Module 06 Assessments................................................................................................................................... 5 Module Homework....................................................................................................................................... 5 Module Quiz................................................................................................................................................... 6 Final Exam...................................................................................................................................................... 6 Course Icons...................................................................................................................................................... 6 Module Outline.................................................................................................................................................. 7 SECTION 01: Correlation....................................................................................................................................... 8 Introduction to Correlation.............................................................................................................................. 8 Scatter Plots....................................................................................................................................................... 8 Scatter Plots and the Line of Best Fit.............................................................................................................. 9 Key Correlation Concepts...............................................................................................................................10 Scatter Plots & Correlations...........................................................................................................................10 Question: Scatter Plots & Correlations.........................................................................................................12 Types of Correlation Analyses.......................................................................................................................13 Pearson Correlation........................................................................................................................................13 Linearity............................................................................................................................................................14 Homoscedasticity............................................................................................................................................15 Calculating the Pearson Correlation.............................................................................................................16 Interpreting Correlation Analyses.................................................................................................................17 Guidelines for Pearson Correlation Coefficients.........................................................................................18 Significance Values..........................................................................................................................................19 Video: Sample Problem: Pearson Correlation.............................................................................................19 Spearman’s Correlation..................................................................................................................................20 Calculating & Interpreting Spearman’s Correlation....................................................................................20 Video: Sample Problem: Spearman’s Correlation.......................................................................................21 Intraclass Correlation Coefficient (ICC).........................................................................................................21 Calculating & Interpreting ICCs......................................................................................................................23 Video: Sample Problem: ICCs.........................................................................................................................24 Factors Impacting Correlations.....................................................................................................................24 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 2 MODULE 06 COMPANION GUIDE HSCI 190 Question: Correlation versus Causation.......................................................................................................25 Clinical Importance of Correlations..............................................................................................................26 Section 01: Summary......................................................................................................................................27 SECTION 02: Introduction to Regression..........................................................................................................28 Introduction to Regression............................................................................................................................28 Response & Explanatory Variables................................................................................................................28 Question: Response & Predictor Variables..................................................................................................28 Plotting Your Data...........................................................................................................................................29 Linear Relationship.........................................................................................................................................30 Combining Concepts.......................................................................................................................................31 Residuals & The Regression Equation..........................................................................................................32 Least Squares Method....................................................................................................................................33 Calculating Y Intercept & Slope.....................................................................................................................34 Regression Assumptions................................................................................................................................35 Question: Regression versus Pearson Correlation......................................................................................35 Additional Notes: Least Squares Method.....................................................................................................36 Conducting and Interpreting Regression Analyses.....................................................................................37 Interpreting the Coefficient of Determination.............................................................................................38 Video: Sample Problem: Simple Linear Regression....................................................................................39 Prediction & Extrapolation.............................................................................................................................40 Advanced Statistical Techniques Using Regression....................................................................................40 Regression in Healthcare: Ottawa Ankle Rules............................................................................................40 Video: Machine Learning in Healthcare........................................................................................................41 Video: Surveillance Studies............................................................................................................................42 Section 02: Summary......................................................................................................................................43 SECTION 03: Statistical Considerations for the Health Sciences...................................................................44 Introduction to Statistical Considerations for the Health Sciences...........................................................44 Question: Study Design & Methods..............................................................................................................44 Feedback: Study Design & Statistics..............................................................................................................44 Question: Sample Size....................................................................................................................................45 Feedback: Sample Size....................................................................................................................................45 Data Quality & Machine Learning.................................................................................................................46 Meaning of Significance..................................................................................................................................47 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 3 MODULE 06 COMPANION GUIDE HSCI 190 Statistical versus Clinical versus Biological Significance.............................................................................47 Example: Statistical versus Clinical Significance..........................................................................................48 Perspectives on Statistical Significance........................................................................................................49 Data Reproducibility.......................................................................................................................................49 Video: The Replication Crisis..........................................................................................................................50 The Reproducibility Project: Cancer Biology................................................................................................50 Question: The Reproducibility Crisis.............................................................................................................51 Causes of the Replication Crisis.....................................................................................................................51 Replication Crisis: Solutions...........................................................................................................................53 Replication Crisis: Take Home Points............................................................................................................55 Section 03: Conclusion....................................................................................................................................55 CONCLUSION.......................................................................................................................................................56 Module 06: Conclusion...................................................................................................................................56 Video: Course Conclusion...............................................................................................................................56 Module Complete............................................................................................................................................56 Credits...............................................................................................................................................................56 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 4 MODULE 06 COMPANION GUIDE HSCI 190 INTRODUCTION Please see the online learning module for the full experience of interactions within this document. VIDEO: INTRODUCTION TO MODULE 06 This content was retrieved from Introduction Slide 1 of 5 of the online learning module. In Module 04, you learned how to use t tests to compare the means of two groups and chi square tests to compare the proportions between two groups. In Module 05, you learned how to use ANOVAs to compare the means of more than two groups. Module 06 builds on your knowledge of inferential statistics and introduces you to two final statistical tests - correlations and regressions. This module also discusses some final statistical considerations such as data reproducibility. Watch the video for an introduction to Module 06 from Dr. Natalie Wagner. (0:52) Page Link: https://player.vimeo.com/video/527357958 MODULE LEARNING OUTCOMES This content was retrieved from Introduction Slide 2 of 5 of the online learning module. By the end of Module 06, students will be able to: 1. Explain the assumptions of a Pearson correlation, Spearman Rank correlation, and Intraclass correlations to apply the correct statistical test. 2. Describe linear regression and its relationship with correlation analyses to understand advanced statistical applications in health sciences. 3. Discuss data reproducibility and the difference between clinical, biological, and statistical significance, and why they are important in health sciences research. MODULE 06 ASSESSMENTS This content was retrieved from Introduction Slide 3 of 5 of the online learning module. These assessments must be completed as part of, or include information from, Module 06. View details. Module Homework - Refer to pages 5-6 Module Quiz - Refer to page 6 Final Exam - Refer to page 6 MODULE HOMEWORK INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 5 MODULE 06 COMPANION GUIDE HSCI 190 Subpage of Introduction Slide 3 of 5 – Module Homework 1/1 In each module, you will be presented with a series of homework questions. You must submit your responses prior to the module tutorial. The homework questions will be reviewed as a group; you will then have the chance to update your homework and resubmit. For more details about the module homework, visit the assessment page in your online learning environment. MODULE QUIZ Subpage of Introduction Slide 3 of 5 – Module Quiz 1/1 You will complete six quizzes throughout the course; one quiz at the end of each module. The quizzes will consist of multiple choice and short answer questions designed to test your comprehension of module content. For more details about the module quizzes, visit the assessment page in your online learning environment. FINAL EXAM Subpage of Introduction Slide 3 of 5 –Final Exam1/1 Throughout the course, all modules and assignments build on each other. Material in the modules, homework questions, and multimedia content are testable material unless otherwise noted. The final exam will be comprised of multiple choice and short answer questions. COURSE ICONS This content was retrieved from Introduction Slide 4 of 5 of the online learning module. As you navigate the HSCI 190 modules, watch for these course icons. Learn about each icon’s function in the course. Listen Up! This icon indicates the presence of an audio clip on the slide from your instructor or other content experts. To play the audio clip, click the play button. Full transcripts and closed captions are available. Calculator This icon lives in the sidebar of your module. Clicking this icon will reveal relevant equations. Additional Information Clicking this icon will present additional facts, information, and resources to aid in your studying and recall of the material. This content will not be tested. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 6 MODULE 06 COMPANION GUIDE HSCI 190 Reference This icon lives in the sidebar of the slide. Clicking it will reveal the references for content and/or images on the slide. MODULE OUTLINE This content was retrieved from Introduction Slide 5 of 5 of the online learning module. Section 01: Correlation – Refer to page 8 Section 02: Introduction to Regression – Refer to page 28 Section 03: Statistical Considerations of the Health Sciences – Refer to page 44 End of Introduction INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 7 MODULE 06 COMPANION GUIDE HSCI 190 SECTION 01: CORRELATION INTRODUCTION TO CORRELATION This content was retrieved Section 01 Slide 2 of 26 of the online learning module. Correlation refers to the relatedness between two continuous variables. Correlations are often used in health sciences to determine how variables impact one another. For example, how exercise is related to vascular calcification, or how antidepressant dose affects mood state. In this section, you will learn about correlation analyses and how they can be used to explore the relationship between two groups. References: GraphPad Software, L. (n.d.). GraphPad PRISM 9 Statistics guide - key concepts: Correlation. Retrieved February 2021, from: https://www.graphpad.com/guides/prism/latest/statistics/stat_key_concepts_correlation.htm Mukaka M. M. (2012). Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi medical journal : the journal of Medical Association of Malawi, 24(3), 69-71. SCATTER PLOTS This content was retrieved Section 01 Slide 3 of 26 of the online learning module. Recall from Module 02 the discussion surrounding scatter plots*. Scatter plots are one of the most useful techniques for gaining insight into the relationship between two variables. For example, the relationship between weight of a chemical compound and the number of experimental days. Review the components of a scatter plot. Independent Variable In a scatter plot, the independent variable (i.e. what is being manipulated) is presented on the x axis. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 8 MODULE 06 COMPANION GUIDE HSCI 190 Dependent Variable The dependent variable (i.e. the outcome measurement) is presented on the y axis. Unit of Observation Every experimental subject, unit, or observation in the study is represented as a dot or point in 2D space in the centre of the figure. Definition*: Scatter plot: A graph where the values are plotted as dots along two axes. SCATTER PLOTS AND THE LINE OF BEST FIT This content was retrieved Section 01 Slide 4 of 26 of the online learning module. In Module 02 you also learned about the line of best fit, which can be superimposed on a scatter plot to show the general pattern of the relationship between the dependent and independent variable. The line of best fit is also known as the regression line. In effect, correlation measures the degree to which the data points cluster around the regression line. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 9 MODULE 06 COMPANION GUIDE HSCI 190 KEY CORRELATION CONCEPTS This content was retrieved Section 01 Slide 5 of 26 of the online learning module. Correlation analyses use correlation coefficients* to measure the direction and magnitude of a relationship between an independent variable and a dependent variable. Review the key principles of correlation coefficients. Magnitude of Association Correlation coefficients range on a scale of -1 to +1. A correlation coefficient of -1 or +1 means a perfect linear association between the two variables; a correlation coefficient of 0 means no linear association. Direction of Association Whether correlation coefficients are positive or negative indicates the directionality of the relationship (i.e. if it is a positive or negative correlation). The sign says nothing about the strength of the association. Definition*: Correlation coefficient: A measure of relatedness. SCATTER PLOTS & CORRELATIONS This content was retrieved Section 01 Slide 6 of 26 of the online learning module. Correlation coefficients are a numeric representation of what you see in a scatterplot. View a series of scatter plots and their correlation coefficients. Focus on how the direction and magnitude of the correlation coefficient changes with the scatterplots. Negative INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 10 MODULE 06 COMPANION GUIDE HSCI 190 Correlation Coefficient = -1.0 Moderate Negative Correlation Coefficient = -0.43 Zero Correlation Coefficient = 0 Strong Positive + Outlier Correlation Coefficient = 0.71 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 11 MODULE 06 COMPANION GUIDE HSCI 190 Strong Positive Correlation Coefficient = 0.80 Maximum Positive Correlation Coefficient = 1.0 QUESTION: SCATTER PLOTS & CORRELATIONS This content was retrieved Section 01 Slide 7 of 26 of the online learning module. Answer the question using what you have learned about scatter plots and correlations. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 12 MODULE 06 COMPANION GUIDE HSCI 190 Question: Based on the figure shown on the slide, what is the most correct description and correlation coefficient? Zero, 0 Moderate Negative, -0.43 Strong Positive and Outlier, 0.71 Strong Negative, -0.98 Maximum Positive, 1.0 Feedback: Correct response: Moderate Negative, -0.43 Continue to the next page to learn about three different types of correlation analyses that can be used to derive the correlation coefficient. TYPES OF CORRELATION ANALYSES This content was retrieved Section 01 Slide 8 of 26 of the online learning module. There are a number of different statistical tests that can be used to derive a correlation coefficient. The analysis you will use depends on the type of data you are working with. Throughout the rest of this section, you will be introduced to three types of correlation analyses: 1. Pearson Product Moment Correlation 2. Spearman’s Rank Correlation 3. Intraclass Correlations PEARSON CORRELATION This content was retrieved Section 01 Slide 9 of 26 of the online learning module. The Pearson Product Moment Correlation, commonly referred to as the Pearson correlation, is widely used in statistics to measure the degree of the relationship between linear related variables. However, similar to other statistical tests you have learned about, the Pearson correlation has assumptions that must be met for it to produce reliable results. Learn about the assumptions of a Pearson correlation. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 13 MODULE 06 COMPANION GUIDE HSCI 190 Scale Measurements Pearson correlation assumes both variables are scale measurements. For example, weight and number of days. Normal Distribution Pearson correlation assumes both variables are normally distributed. Recall from Module 04 that the normality of each group can be checked using a visualization (e.g. a histogram) or the Shapiro-Wilk normality test. Paired Variables The two variables should be paired, meaning that each case (e.g. participant) has two values - one for each variable. The cases must be independent from one another, meaning the values that one participant has, have no effect on other participants’ scores. No Outliers Outliers should be removed from the data. Having one data point that lies far from the rest can significantly impact the analysis. Refer to Module 02 to review how to identify outliers. Linearity The variables must also exhibit linearity. Linearity means there is a linear relationship between one variable and another. You will learn more about linearity in the coming slides. Homoscedasticity Homoscedasticity means that the data have equal variance throughout the plot. Homoscedasticity is determined by visualizing the data. You will learn more about homoscedasticity in the coming slides. The Pearson test is named after Karl Pearson. Recall from Module 04 that Karl Pearson is the statistician who William Gosset studied with when he was discovering the Student’s t distribution. LINEARITY This content was retrieved Section 01 Slide 10 of 26 of the online learning module. Linearity is a straight line relationship between two variables. In other words, as X increases, Y decreases or increases in a consistent manner. Linearity is also called monotonicity*, and can be checked using a visualization such as a scatter plot. Switch between two graphs demonstrating monotonic and non-monotonic relationships. Monotonic INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 14 MODULE 06 COMPANION GUIDE HSCI 190 Non-Monotonic Note: With monotonic relationships, the two variables need to increase or decrease together in general, however, the degree of increase and decrease can vary between data points. Definition*: Monotonicity: When variables increase or decrease in value together. Reference: Laerd Statistics. (n.d.). Pearson product-moment CORRELATION. Retrieved February 2021, from: https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php HOMOSCEDASTICITY This content was retrieved Section 01 Slide 11 of 26 of the online learning module. Homoscedasticity refers to an equal spread of data around the line of best fit. Similar to linearity, homoscedasticity can be determined by visualizing your data. Review a visualization of a homoscedastic and heteroscedastic data. Homoscedastic INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 15 MODULE 06 COMPANION GUIDE HSCI 190 Homoscedastic data has an equal spread around the line of best fit. In this diagram, the data points are consistently spread out around the line. Heteroscedastic Heteroscedastic data has an unequal spread around the line of best fit. In this diagram, the data points start close to the line (i.e. have little variability), then spread out at the top of the line (i.e. increase their variability). Now that you have learned the assumptions of the Pearson correlation, continue to the next page to learn how to conduct the analysis. CALCULATING THE PEARSON CORRELATION This content was retrieved Section 01 Slide 12 of 26 of the online learning module. The Pearson correlation coefficient is denoted by the letter ‘r’. The Pearson correlation coefficient is calculated by manipulating each independent variable (x) as it relates to the independent mean (x), and each dependent variable (y) as it relates to the dependent mean (y). INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 16 MODULE 06 COMPANION GUIDE HSCI 190 Note: In this course, you will use statistical software to calculate the Pearson correlation coefficient. This formula is presented for your interest and future learning. INTERPRETING CORRELATION ANALYSES This content was retrieved Section 01 Slide 13 of 26 of the online learning module. As you learned earlier in this section, correlation coefficients are what tell you the strength and magnitude of the relationship. The values of the correlation coefficients range from -1 to +1, where -1 is a perfectly negative correlation, +1 is a perfectly positive correlation, and 0 indicates no correlation. Correlation coefficients do not have units associated with them. Correlation Coefficient = -1.0 Correlation Coefficient = 0 Correlation Coefficient = 1.0 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 17 MODULE 06 COMPANION GUIDE HSCI 190 GUIDELINES FOR PEARSON CORRELATION COEFFICIENTS This content was retrieved Section 01 Slide 14 of 26 of the online learning module. The guidelines for interpreting Pearson correlation coefficients may vary slightly by research field. In this course, you should use the guidelines derived from Evans (1996). Correlation Coefficient Strength of Correlation.00 to.019 Very weak.20 to.39 Weak.40 to.59 Moderate.60 to.79 Strong.80 to 1.0 Very strong Listen to Dr. Wagner discuss interpreting Pearson correlation coefficients. (1:51) Start of Audio Transcript: Correlation coefficients of zero, negative one, and positive one are fairly easy to interpret: Zero means zero relationship. Positive one means a perfectly positive relationship, or as one variable increases so does the other. Negative one means a perfectly negative relationship, meaning as one variable increases, the other decreases. But what do we do about in-between values like 0.3 or 0.4? To help interpret these in-between values there have been a number of guidelines published. The guideline shown here is by Evans in a 1996 paper for the behavioural sciences. Something to be mindful of, is that these guidelines are somewhat field specific. Meaning that different research disciplines may consider 0.6 to be strong, others might consider it to be good, or moderate correlation coefficient. This is in part due to the nature of the data in different research fields - so whether data is inherently homogeneous (meaning very similar) or heterogeneous (meaning very different). Recall from an earlier module that cell lines tend to be fairly homogeneous, human behaviour less so. So, guidelines really depend on the type of data we are dealing with; also what the data is going to be used for. If you ever want to qualify your correlation coefficient, or one of those in-between variables, or results with a descriptor and be able to say something like it’s moderate or it’s strong correlation coefficient, it’s really important to provide the reference to what guideline you are using that says that the specific number you got would be strong, and you also want to ensure those guidelines were derived from a similar area of research to your research question that you are looking at. End of Audio Transcript. Note: you will not need to memorize the specific guidelines shown in this course, but you should get in the habit of using guidelines from your field when reporting correlations. Reference: INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 18 MODULE 06 COMPANION GUIDE HSCI 190 Based on guidelines from: Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Pacific Grove, CA: Brooks/Cole Publishing Company. SIGNIFICANCE VALUES This content was retrieved Section 01 Slide 15 of 26 of the online learning module. While the main statistical output from the Pearson correlation is the correlation coefficient, Pearson correlations also provide you with p values that can be used for hypothesis testing. Recall from earlier modules that p values are compared to a predetermined significance level (e.g. α =.05) to determine whether you fail to reject or reject the null hypotheses. The p values generated by correlation analyses are interpreted the same way. Review the null and alternative hypotheses for a Pearson correlation analysis. Null Hypothesis In correlation analyses, the null hypothesis is that there is no correlation between the two variables. In other words, the probability of finding a correlation between the two variables is 0. Alternative Hypothesis In correlation analyses, the alternative hypothesis is that a correlation exists between the two variables. In other words, the probability of finding a correlation between the two variables does not equal 0. While the correlation coefficient tells you the magnitude and direction of the relationship, the p value gives you information on whether the correlation you found is likely due to chance or not. Both are valuable pieces of information. VIDEO: SAMPLE PROBLEM: PEARSON CORRELATION This content was retrieved Section 01 Slide 16 of 26 of the online learning module. Consider the scenario: A researcher interested in heart disease wants to know whether weight is related to LDL cholesterol. Since the researcher wants to know the relationship between these two scale measurements, a Pearson correlation is conducted. Watch the video to learn how to conduct and interpret a Pearson correlation analysis. (4:57) Additional Information For your interest, learn more about the relationship between LDL and weight. L D L and Weight Page Links: INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 19 MODULE 06 COMPANION GUIDE HSCI 190 https://player.vimeo.com/video/527537559 https://www.cdc.gov/cholesterol/prevention.htm Reference: Problem adapted from work by MONTOYE, H., EPSTEIN, F., & KJELSBERG, M. (1966). Relationship Between Serum Cholesterol and Body Fatness. The American Journal of Clinical Nutrition, 18(6), 397-406. Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.1093/ajcn/18.6.397 SPEARMAN’S CORRELATION This content was retrieved Section 01 Slide 17 of 26 of the online learning module. Like other statistical tests, the Pearson correlation has a non-parametric equivalent called the Spearman’s Rank Order correlation that can be used if the assumptions of the Pearson correlation are not met. As the name suggests, Spearman’s correlation ranks* data to explore the relationship between two variables. Learn about Spearman’s correlation. Any Distribution Because Spearman’s correlation uses medians (similar to other nonparametric tests you have learned about like Wilcoxon rank sum and Kruskal Wallis), it does not require variables to come from a normal distribution. Scale or Original Data The two variables must be measured at the ordinal or scale levels. Ordinal data can be used since Spearman’s uses medians and ranks. Linearity Similar to the Pearson correlation, the variables must have a linear relationship with one another. CALCULATING & INTERPRETING SPEARMAN’S CORRELATION This content was retrieved Section 01 Slide 18 of 26 of the online learning module. To calculate a Spearman’s correlation, each variable is ranked and the difference between the ranked pairs is used to calculate the correlation coefficient (rs). Similar to the Pearson correlation, Spearman’s correlation coefficient ranges from -1 to +1, and indicates the magnitude and direction of the relationship between variables. Spearman’s correlation coefficients can be interpreted using the same guidelines you learned about earlier in the section. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 20 MODULE 06 COMPANION GUIDE HSCI 190 Note: You will not be required to calculate Spearman’s correlations by hand, but will use a statistical software. This formula is present for your interest and future learning. d - difference between ranked pairs; n - number of pairs of ranks Reference: Laerd Statistics. (n.d.). Spearman's rank-order correlation. Retrieved February 2021, from: https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php VIDEO: SAMPLE PROBLEM: SPEARMAN’S CORRELATION This content was retrieved Section 01 Slide 19 of 26 of the online learning module. Consider the scenario: A kinesiologist is interested in comparing the results from two strength tests for patients with tennis elbow. The two strength tests include a manual muscle test that rates strength on a scale of 0 to 5, and an isometric strength test using a dynamometer. Since the researcher wants to know the relationship between two measurements, but one variable is an ordinal measurement, a Spearman correlation is the appropriate test. Watch the video to learn how to conduct and interpret a Spearman’s correlation analysis. (3:55) Page Link: https://player.vimeo.com/video/527540210 INTRACLASS CORRELATION COEFFICIENT ( ICC) This content was retrieved Section 01 Slide 20 of 26 of the online learning module. The last correlation analysis you will learn about in this section is the intraclass correlation coefficient (I CC). ICCs were first introduced by Ronald A. Fisher, a famous statistician and geneticist, as a modification of the Pearson correlation coefficient. ICCs are now widely used in healthcare to evaluate inter-rater, test-retest, and intra-rater reliability on many forms of clinical assessment. Learn more about inter-rater, test-retest, and intra-rater reliability. Inter-Rater Reliability Inter-rater reliability is the variation between two or more raters who measure the same event. For example, if two raters were to assess a patients’ well-being using the SCAT5* measurement tool, inter-rater reliability would measure whether the raters agreed in their scoring. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 21 MODULE 06 COMPANION GUIDE HSCI 190 Test-Retest Reliability Test-retest reliability is the variation in two measurements from the same instrument, subject, rater, and conditions. For example, if an individual rated their own cognitive state more than once in the same conditions, test-retest reliability would assess how similar the two measurements were. Intra-Rater Reliability Intra-rater reliability is the variation within one rater across two or more trials. For example, if one rater assessed a patients’ well-being using the SCAT5 twice under the same conditions, intra-rater reliability looks at the agreement between their scores. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 22 MODULE 06 COMPANION GUIDE HSCI 190 Definition*: SCAT5: The Sport Concussion Assessment Tool (SCAT) is a standardized scale used by healthcare professionals to assess suspected concussions. References: Koo, T. K., & Li, M. Y. (2016). A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of chiropractic medicine, 15(2), 155-163. Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.1016/j.jcm.2016.02.012 Topp, C. W., Østergaard, S. D., Søndergaard, S., & Bech, P. (2015). The WHO-5 Well-Being Index: a systematic review of the literature. Psychotherapy and psychosomatics, 84(3), 167-176. CALCULATING & INTERPRETING ICCS This content was retrieved Section 01 Slide 21 of 26 of the online learning module. ICCs differ from other correlations you have learned about as they operate on data structured as groups, rather than data structured as paired observations. Additionally, how ICCs are calculated depends on the type of data you are dealing with. Since there are so many variations of the ICC formula and you do not need to calculate it by hand in this course, the formulas have been omitted. Similar to Pearson and Spearman’s correlations, ICCs are typically interpreted using field-specific guidelines. ICC Level of Agreement < 0.40 Poor.40 to.59 Far.60 to.74 Good.75 to 1.00 Excellent Listen to Dr. Wagner discuss calculating and interpreting ICCs in health sciences (01:03) Start of Audio Transcript: Similar to the Pearson correlation coefficient that you learnt about earlier in the module, Intraclass correlation coefficients or ICCs also have guidelines that can be used to help interpret their values. In this case, you're looking at guidelines from a 1994 paper by Cicchetti et al that really looks at categorizing or qualifying the ICC within the psychology field. So this is often used in human behaviour research. And again, it really goes back to research specific fields or different research fields having their own guidelines that can help you interpret the strength of an ICC or correlation in what's typically accepted in your research field. So it's really important to note what guideline you're using if you want to say it's an excellent or good correlation. And then you also want to make sure that you're using guidelines that are from your research field. End of Audio Transcript. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 23 MODULE 06 COMPANION GUIDE HSCI 190 Reference: Based on guidelines from: Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284- 290. VIDEO: SAMPLE PROBLEM: ICCS This content was retrieved Section 01 Slide 22 of 26 of the online learning module. Consider the scenario: To increase adoption of the SCAT globally, researchers in a recent study had native Cantonese and Mandarin-speaking multidisciplinary expert panels translate the SCAT to the Chinese context. To ensure the translated SCAT remained a reliable measure of concussion, two healthcare professionals administered the translated SCAT to a rugby team and compared their ratings for each athlete. Since the researchers want to explore inter-rater reliability in this scenario, an ICC is the correct statistical analysis. Watch the video to learn how to conduct and interpret an ICC analysis. (6:42) Note: for this course the differences between consistency and absolute agreement are not important. You should use consistency for your homework. Additional Information For you interest, view the SCAT5 assessment tool. Page Links: https://bjsm.bmj.com/content/bjsports/early/2017/04/26/bjsports-2017-097506SCAT5.full.pdf https://player.vimeo.com/video/527776326 Reference: Problem adapted from: Yeung, E. W., Sin, Y. W., Lui, S. R., Tsang, T. W., Ng, K. W., Ma, P. K.,... & Ma, T. M. (2018). Chinese translation and validation of the Sport Concussion Assessment Tool 3 (SCAT3). BMJ open sport & exercise medicine, 4(1). FACTORS IMPACTING CORRELATIONS This content was retrieved Section 01 Slide 23 of 26 of the online learning module. Now that you have learned about three of the most common correlation analyses, there are a few things to consider when dealing with them. Learn about a few considerations for correlation analyses. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 24 MODULE 06 COMPANION GUIDE HSCI 190 Restricting Data Range In most cases, you want to include as much data as possible in your analysis. However, in some cases, restricting the range of data included in your analysis can be advantageous. For example, if you were interested in how reading ability increases linearly with age, you would likely want to restrict the age range to the window in which reading development typically occurs (i.e. between 4 and 20 years old, or some other reasonable limit). In this case, restricting the range of data will increase the correlation (r), as any non-linear portions of the relationship are removed. Heterogenous Samples Another important thing to consider is how heterogeneous* or homogeneous* your data is. Heterogeneous samples can mask underlying patterns. For example, consider the relationship between height and weight in males and females. When you combine data from both sexes, the relationship between height and weight is strikingly good. However, if you look at the two groups separately, the correlation between height and weight falls to.60 for males and.49 for females. This is because males are, on average, taller and heavier than females. Outliers Recall learning about outliers in Module 02 and how they can significantly impact descriptive statistics and the shape of a sampling distribution. Outliers can also significantly alter a correlation. Return to Module 02 to review the strategies you can use to deal with outliers. Definitions*: Heterogeneous: A lot of variability amongst data. Homogeneous: Little variability amongst data. QUESTION: CORRELATION VERSUS CAUSATION This content was retrieved Section 01 Slide 24 of 26 of the online learning module. Most importantly, correlation does not mean causation. You need to be aware of the possibility of hidden or intervening variables. Answer the question using what you have learned so far in this module. Question: A scatter plot and correlation analysis reveals a very strong relationship between reading ability and foot length (r=.88, n=54, p=.003). Does this mean individuals with larger feet are better readers? What do you think is going on here? Feedback: Listen to Dr. Wagner discuss correlation versus causation. (1:51) Start of Audio Transcript: INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 25 MODULE 06 COMPANION GUIDE HSCI 190 If you were to interpret this correlation coefficient and p value alone, you might be tempted to think that foot size dictates reading ability. However, there is a number of reasons why this isn’t true. First, correlations go both ways, so this would mean foot size would dictate reading ability or that reading ability dictates foot size, which simply doesn’t make sense. Second, consider what else might be associated with foot size, such as age. If you consider age, foot size, and reading ability, you might think that this correlation between foot size and reading is actually misleading, since the real driver of the relationship is age. Because of this, we would call age a confounding variable which means a variable that influences both the independent and dependent variable causing a false or misleading correlation. In this example, this might have been fairly easy to tease apart, however, its often not so easy and correlations are often wrongly interpreted as a cause-and-effect relationship. For example, consider the weight and L D L cholesterol example you learned about earlier in the section. We were able to say the two variables are related, but what we don’t know is whether high weight causes LDL to increase, or high LDL causes weight to increase. Perhaps there is something else, a different confounding variable, like diet, driving this relationship. As a takeaway for this course, remember that correlation does not equal causation. Correlation tells you if the two are related but not which causes the other. Continue to the next page for some summary thoughts on the clinical implications of correlations. End of Audio Transcript. CLINICAL IMPORTANCE OF CORRELATIONS This content was retrieved Section 01 Slide 25 of 26 of the online learning module. Correlation analyses are fundamental to healthcare. First, correlations are critical to the development of measurement tools and procedures that are used to assess patients. A measure must be reliable (i.e. have high inter-rater, intra-rater, and test-retest reliability) to give an accurate measurement. Second, healthcare providers rely on correlations to support clinical decision making. Listen to Dr. Kash Visram, a urologist, discuss how correlations can be used to assist with clinical decision making in the case of kidney dysfunction. (1:04) Note: You will not be tested on the specifics of this example. Start of Audio Transcript: Correlations are critical to healthcare delivery. Correlations allow us to create and test new clinical assessment tools to ensure that they are giving us accurate measurements. They also impact medical decision making on a regular basis. For example, when a patient comes in with a symptom that may be indicative of a disease or condition, such as high glucose, this may suggest to us that the patient is diabetic. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 26 MODULE 06 COMPANION GUIDE HSCI 190 However, before we can make that decision, we need to check for other conditions that are typically associated with diabetes to ensure that this is the correct diagnosis. In this scenario, if the patient had high glucose, we may also check for other symptoms related to diabetes. This could include nephropathy (which is kidney dysfunction), retinopathy (which is eyesight dysfunction), or neuropathy (which is nerve dysfunction). If blood work confirms any of these things, it would help us gain confidence that we are dealing with diabetes. If a patient only had high glucose and none of these other symptoms, we may look for alternate diagnosis. End of Audio Transcript. References: Lachin, J. (2004). The role of measurement reliability in clinical trials. Clinical Trials (London, England), 1(6), 553-566. Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.1191/1740774504cn057oa Homewood Health. (2020). Assessments. Retrieved February 2021, from: https://homewoodhealth.com/corporate/services/return-to-work/assessments SECTION 01: SUMMARY This content was retrieved Section 01 Slide 26 of 26 of the online learning module. In this section, you were introduced to correlation analysis as a way to explore the relationship between two variables. Specifically, you learned about three types of correlation analyses, including: Pearson correlation, Spearman’s correlation, and Intraclass correlations. You learned how to conduct these correlations using statistical software, interpret the results including p values and correlation coefficients, and some additional factors to be aware of in correlations such as the data range, heterogeneity, outliers, and causation. End of Section 01 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 27 MODULE 06 COMPANION GUIDE HSCI 190 SECTION 02: INTRODUCTION TO REGRESSION INTRODUCTION TO REGRESSION This content was retrieved Section 02 Slide 2 of 22 of the online learning module. In Section 01, you learned about correlation analyses, which quantify the degree of relatedness between two variables. Regression is the next step up after correlation. Similar to correlations, regression explores the relationship between variables. The primary difference between correlation and regression is that regression allows you to investigate the influence of one variable on another. Compare the definitions of correlation and regression. Correlation Explores the relationship between two continuous variables. For example, the relationship between weight and LDL. Regression Explores how an explanatory variable (X) impacts a response variable (Y). For example, how weight affects LDL. You will learn more about response and explanatory variables on the next slide. RESPONSE & EXPLANATORY VARIABLES This content was retrieved Section 02 Slide 3 of 22 of the online learning module. The response and explanatory variables are labelled based on what variable is being used to predict the other. Learn about response and explanatory variables. Response Variable A response variable (Y) is the variable that you are predicting (i.e. the outcome or dependent variable). Explanatory Variable An explanatory variable (X) is the variable that you are using to predict Y (i.e. the independent variable). If you have one explanatory variable (X), you will be doing a simple linear regression. If there is more than one explanatory variable, it is called a multiple linear regression. This section focuses on simple linear regressions. QUESTION: RESPONSE & PREDICTOR VARIABLES INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 28 MODULE 06 COMPANION GUIDE HSCI 190 This content was retrieved Section 02 Slide 4 of 22 of the online learning module. Consider the scenario: You want to explore how height predicts weight. You randomly select 100 individuals and take their measurements. Select the correct terms from the drop down menu using your knowledge of regression terminology. Question: Of height and weight, which is the response and which is the explanatory variable? Options: Explanatory, Response Height Weight Feedback: Height Explanatory Weight Response In this case, the stem tells you that height is being used to predict weight. Therefore, height is the explanatory variable (X) and weight is the response variable (Y). You would collect height and weight from all participants, and label your observations in pairs (e.g. X1 , Y1). PLOTTING YOUR DATA This content was retrieved Section 02 Slide 5 of 22 of the online learning module. Recall from Section 01 where you learned that one of the first steps in a correlation analysis is to plot your data to see what you are working with. The same is true for a regression analysis. Regression analyses summarize the relationship between variables using a line of best fit. In the example discussed on the previous slide, the line of best fit can be used to determine whether height can be used to predict weight. Listen to Dr. Wagner discuss correlation, regression, and line of best fits. (1:24) Start of Audio Transcript: You have now learned about some of the key differences between correlation and regression. Correlation looks at the relationship between two variables, while regression looks at whether you can use one variable to predict another variable. In both cases, a first step is plotting the data using a scatter plot. This scatterplot shows the predictor variable, height, on the X axis and the response variable, weight, on the Y axis. Regression analyses then try to summarize the relationship between the variables using a line of best fit, a term you were introduced to earlier in the module. The main difference here is that regression uses the line of best fit for prediction. In other words, to determine whether height can predict weight. At this point, I imagine you may have questions, like: INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 29 MODULE 06 COMPANION GUIDE HSCI 190 How do you come up with a good fitting line? Does a straight line truly summarize the pattern of data? Is the relationship strong enough to use this line to predict other data? Is there strong evidence to suggest this is a real effect? In other words, is it likely that this relationship would be due to random chance alone? Or is something else going on How can we measure the strength of the relationship between X and Y using this line? Continue to the next page to learn about how we can answer some of these questions. End of Audio Transcript. LINEAR RELATIONSHIP This content was retrieved Section 02 Slide 6 of 22 of the online learning module. In this course, you will focus on linear relationships between response (Y) and predictor (X) variables. When you have a linear relationship, you can use a formula to predict Y for any given X. You are likely already familiar with the formula y = mx + b. The linear equation for regression follows a similar format. Break down regression equation for a linear relationship between X and Y. μY|X The true mean of Y for a given value of X. β0 The Y intercept, meaning the value of the response variable Y, when the predictor variable (X) is 0. β1 The slope of the regression line; meaning the increase in Y that corresponds to one unit increase in X. X The predictor variable X. Reference: INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 30 MODULE 06 COMPANION GUIDE HSCI 190 Kleinbaum, Kupper, Nizam, & Rosenberg. Applied Regression Analysis and Other Multivariable Methods (5th Edition). Cengage Learning. 2014. Boston, USA. Chp 3: Basic Statistics a Review. COMBINING CONCEPTS This content was retrieved Section 02 Slide 7 of 22 of the online learning module. What this equation is saying is: if you know the value of X, you can use this equation, along with the line of best fit, to predict the value of Y. In fact, the Y value should be where X intersects with the line of best fit. X = 1.5 Meters X = 1.5 m Y = 75 k g X = 1.8 Meters X = 1.8 m Y = 95 k g INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 31 MODULE 06 COMPANION GUIDE HSCI 190 RESIDUALS & THE REGRESSION EQUATION This content was retrieved Section 02 Slide 8 of 22 of the online learning module. However, in reality, observed values of Y will often vary from the predicted value. In other words, they do not all fall perfectly on the line of best fit, but vary around the line. To account for this variability, the equation for the line of best fit is adapted to include an error term, epsilon (ε). Epsilon represents the residual*. Listen to Dr. Wagner discuss residuals and the regression equation. (1:56) Start of Audio Transcript: The line of best fit shows the predicted or expected Y values, however, the observed values often don’t fall precisely on this line. Recall from Module 04 you learned about observed and expected values in the chi square analysis, it is a similar idea here. In regressions, the vertical distance between the predicted and observed values is the distance between the data point and the line of best fit on the graph, this is calculated by the observed value minus the predicted value. The resulting value or distance between those two points is called the residual and is represented by the epsilon, or “error term”. Data points below the line have negative residuals, data points above the line have positive residuals. A few additional notes regarding the regression line equation: You might have noticed that the true mean of Y for a given value of X (which is written as μ subscript y bar x) was replaced with just a simple y on the previous slide. These two representations of the “predicted true mean of Y” are often interchanged, but they are referring to a similar thing. Another thing to be aware of is that your beta coefficients, so beta zero and beta one are population parameters, which means they are typically unknown. Thus, you use sample statistics to estimate these values. A common method for estimating your beta coefficients is called the least squares method, and you can continue to the next slide to learn a little bit more about least squares. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 32 MODULE 06 COMPANION GUIDE HSCI 190 End of Audio Transcript. Definition*: Residual: The distance an observed Y lies from the regression line. LEAST SQUARES METHOD This content was retrieved Section 02 Slide 9 of 22 of the online learning module. You just learned that a residual is the observed value minus the predicted value. The least squares method chooses the values of the Y intercept (β0) and slope (β1) that minimize the sum of the squared residuals. Recall from Modules 05, you learned how the sum of squares can be used to estimate variance between or within groups in an ANOVA. The sum of squares in a regression also estimates variance. Learn about how the sum of squares is used in regression analyses. SSX The sum of squares of X (SSX) is calculated by taking a sum of each observed value of X, minus the predicted value of X, and squaring that value. The variance of the X is equal to the sum of squares of X, divided by the degrees of freedom (n-1). SSY The sum of squares of Y (SSY) is calculated by taking a sum of each observed value of Y, subtracting the predicted value of Y, and squaring that value. The variance of the Y is equal to the sum of squares of Y, divided by the degrees of freedom (n-1). SSXY The sum of squares of X Y (SSXY) is the product of the sum of squares of X and sum of squares of Y. The covariance* of X and Y would be equal to the total sum of squares, divided by the degrees of freedom (n-1). INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 33 MODULE 06 COMPANION GUIDE HSCI 190 Definition*: Covariance: Joint variability of two random variables. Reference: Kleinbaum, Kupper, Nizam, & Rosenberg. Applied Regression Analysis and Other Multivariable Methods (5th Edition). Cengage Learning. 2014. Boston, USA. Chp 3: Basic Statistics a Review. CALCULATING Y INTERCEPT & SLOPE This content was retrieved Section 02 Slide 10 of 22 of the online learning module. By decreasing this “squares” value, it means the difference between the data points and the line of best fit is minimized. Learn how to calculate Y intercept (β0) and slope (β1) using the least squares method. Y intercept (β0) The estimate of the Y intercept (β0) is calculated by the mean of Y, minus the estimated value of slope, multiplied by the mean of X. Note the “hats” in the equation indicate that the variables are estimates. Slope (β1) The estimate of the slope (β1) is calculated by dividing the total sum of squares, by the sum of squares of X. Similar to the Y intercept, the “hats” indicate that the variables are estimates. Note: In this course, you will use statistical software to calculate these variables. You will not need to calculate them by hand. Reference: Kleinbaum, Kupper, Nizam, & Rosenberg. Applied Regression Analysis and Other Multivariable Methods (5th Edition). Cengage Learning. 2014. Boston, USA. Chp 3: Basic Statistics a Review. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 34 MODULE 06 COMPANION GUIDE HSCI 190 REGRESSION ASSUMPTIONS This content was retrieved Section 02 Slide 11 of 22 of the online learning module. Like other statistical tests, in order to conduct a simple linear regression using the least squares method, there are a number of assumptions that must be met. Learn the assumptions of a simple linear regression. Scale Data In simple linear regression, both the response and predictor variables must be scale data. Normal Distribution The residuals of the regression line must be approximately normally distributed. The normality of residuals can be checked using visualizations. You will learn more about this during the sample problem. No Outliers Similar to the Pearson Correlation you learned about in Section 01, outliers should be removed from the data. Linearity There must be a linear relationship between the two variables. This can be checked using a visualization. Homoscedasticity The data must also be homoscedastic. Recall from Section 01 that homoscedasticity refers to equal variance along the line of best fit. Homoscedasticity can also be checked using visualizations. Reference: Pagano, M., & Gauvreau, K. (2018). Principles of biostatistics (2nd ed.). Boca Raton, FL: CRC Press. QUESTION: REGRESSION VERSUS PEARSON CORRELATION This content was retrieved Section 02 Slide 12 of 22 of the online learning module. Recall the discussion surrounding the Pearson correlation from Section 01. Answer the question using your knowledge of statistical assumptions. What is the main difference between the assumptions of the Pearson correlation and simple linear regression? Feedback: The assumptions of these two tests are very similar, in that they both require scale data, homoscedasticity, and linearity. In both cases, you also want to remove outliers and want to have paired observations (i.e. X1, Y1). The main difference is that regression analyses require the residuals to INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 35 MODULE 06 COMPANION GUIDE HSCI 190 be normally distributed. Whereas correlations require each variable to be normally distributed. This has implications for how you check normality. ADDITIONAL NOTES: LEAST SQUARES METHOD This content was retrieved Section 02 Slide 13 of 22 of the online learning module. In addition to the assumptions, there are a few additional rules about the least squares regression model that you should be aware of. Learn about these rules. Sum of Residuals The sum of the residuals is always equal to 0. For example, if you were to take the vertical distances between each data point and the line of best fit, and add them together, the positive and negative distances would cancel each other out. View the sum of residuals Positive Distances Negative Distances Sum of Residuals INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 36 MODULE 06 COMPANION GUIDE HSCI 190 Line of Best Fit The line of best fit always passes through the mean of X and the mean of Y. For example, if you were to find the mean of X and the mean of Y, and plot them together from this height and weight data set, the line of best fit would pass through that data point. CONDUCTING AND INTERPRETING REGRESSION ANALYSES This content was retrieved Section 02 Slide 14 of 22 of the online learning module. Once you have checked your assumptions, you can use statistical software to help run the analysis. For this course, the important part is knowing how to interpret the data. Learn about the statistical output for regression analyses. Coefficient of Determination While the r value gives you the simple correlation between variables, the coefficient of determination (r2) tells you the amount of variation in weight (Y) that is explained by height (X). In this case, the r 2 =.57 which means 57% of variation of Y is explained by X. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 37 MODULE 06 COMPANION GUIDE HSCI 190 ANOVA Table The ANOVA table gives you information on how well your line “fits” the data (i.e. how well the X predicts Y). In this case, the p <.01 thus the model significantly predicts the Y. Note: This may remind you of the ANOVA tables from Module 05. It should, as they are very similar! β Coefficients The coefficients table provides information on your Y intercept and slope. You can also use this information to complete your regression equation. Note: The Y intercept is also called the constant. In this example, B0 = -77.28 and B1 = 3.33, therefore, the regression equation is: Y = -77.28 + 3.33X. INTERPRETING THE COEFFICIENT OF DETERMINATION This content was retrieved Section 02 Slide 15 of 22 of the online learning module. Similar to correlation analyses, the coefficient of determination (r2) indicates the strength of the relationship between X and Y. Learn the important features of the coefficient of determination. Positive Values Since the coefficient of determination (r2) is a squared value, it can only consist of positive values. Range 0 to +1 The closer r2 is to 0, it means less variation in Y is determined by X. The greatest value of r 2 is 1, meaning all of the variability in Y is determined by X Common Misconceptions INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 38 MODULE 06 COMPANION GUIDE HSCI 190 Commonly, it is thought that r2 measures the magnitude of the slope of the regression line, or the how well the line “fits” the data. Both of these are false. Switch between images to see that the magnitude of r2 does not tell you the fit or magnitude of the line of best fit. r2 = 0 Hight r2 Significance There are no clear guidelines for interpreting r2. Thus, its interpretation will often depend on what is being studied or your field of study. For example, in clinical research it would be widely accepted that an r 2 of.03 would mean the independent variable is not a useful predictor as it only explains 3% of the variance in Y. However, an r 2 of.75, which explained 75% of the variation would be considered quite useful. Reference: Kleinbaum, Kupper, Nizam, & Rosenberg. Applied Regression Analysis and Other Multivariable Methods (5th Edition). Cengage Learning. 2014. Boston, USA. Chp 3: Basic Statistics a Review. VIDEO: SAMPLE PROBLEM: SIMPLE LINEAR REGRESSION This content was retrieved Section 02 Slide 16 of 22 of the online learning module. Consider the scenario: A researcher is interested in the relationship between vitamin D and calcium. Specifically, if daily vitamin D supplements affect blood calcium concentrations (mg/dL). Watch the video to learn how to conduct and interpret a simple linear regression analysis. (10:26) Page Link: https://player.vimeo.com/video/528322966 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 39 MODULE 06 COMPANION GUIDE HSCI 190 PREDICTION & EXTRAPOLATION This content was retrieved Section 02 Slide 17 of 22 of the online learning module. In Section 01, you learned that correlation is not causation. With regression, there is some degree of causation, as you are exploring whether one variable predicts another in a given data set. However, with simple regression it is dangerous to extrapolate your findings beyond the range of observed values for X. Returning to the example of using height to predict weight, if a height of 2.0 meters is outside the range of data collected, it would not be appropriate to use this regression line to predict weight for that height, as the relationship for height and weight may be different outside this range. ADVANCED STATISTICAL TECHNIQUES USING REGRESSION This content was retrieved Section 02 Slide 18 of 22 of the online learning module. Now that you have learned about simple linear regression, you will be introduced to other advanced statistical techniques using regression. There are entire courses focused entirely on regression and thus, this information is meant to lightly touch on this topic. You will not be asked to run these analyses. You should know these techniques exist, what they are in a broad sense, and how they differ from simple linear regression. Learn about different advanced statistical techniques using regression Linear Regression Linear regression analysis is when there is one explanatory and one response variable, and both are scale measurements. For example, studying the effect of height (X) on weight (Y). Multiple Linear Regression Multiple linear regression analysis is when there is one response variable and more than one explanatory variable, and all are scale measurements. For example, studying the effect of height (X 1) and age (X2) on weight (Y). Logistic Regression Logistic regression analysis is when there is one explanatory variable and one response variable, however, the response variable is dichotomous. For example, studying the effect of number of cigarettes smoked daily on cancer status (cancer versus no cancer). Reference: Alexopoulos E. C. (2010). Introduction to multivariate regression analysis. Hippokratia, 14(Suppl 1), 23- 28. REGRESSION IN HEALTHCARE: OTTAWA ANKLE RULES This content was retrieved Section 02 Slide 19 of 22 of the online learning module. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 40 MODULE 06 COMPANION GUIDE HSCI 190 In Section 01, you learned that correlations can be used to develop new assessment tools by testing inter-rater reliability, or can be used in practice by healthcare professionals to look for associated conditions. In a similar manner, regression analyses are used in healthcare to identify signs and symptoms that are predictive of a certain condition or disease state. Listen to Dr. Andrew McGuire discuss a famous example of how regression analyses can be used to create powerful diagnostic tools in healthcare. (1:00) Start of Audio Transcript: As an emergency physician at the University of Ottawa, Dr. Stiell noticed a large number of x-rays being ordered for injuries that were afterwards found to be normal. With this observation, over the past 20 years Dr. Stiell has conducted a unique series of studies to develop clinical decision rules and risk scales for emergency departments that are internationally recognized and implemented. Essentially, Dr. Stiell found what variables were most predictive of ankle fractures, and built guidelines around those risk factors. These guidelines, called the Ottawa Ankle Rules, are used internationally in ankle fracture management, have saved the healthcare system an enormous amount of money, and saved patients time waiting in an emergency department for a sprained ankle that just needs rest, also limiting patients from harmful radiation. It also gives you a taste of how powerful regression analyses can be and how it can be used in healthcare. End of Audio Transcript. Reference: Stiell, I. (n.d.). The Ottawa Ankle Rules. Retrieved February 2021, from: http://www.theottawarules.ca/ankle_rules VIDEO: MACHINE LEARNING IN HEALTHCARE This content was retrieved Section 02 Slide 20 of 22 of the online learning module. Regression also opens the door to far more advanced and exciting analytics that are used to predict variables. For example, regression forms the basis for techniques such as machine learning and artificial intelligence. In the last several years, great attention has been placed on how these techniques can be used to advance healthcare. Watch the video of Dr. Kiret Dhindsa, a brain-computer interface and machine learning expert, discuss his own work in image diagnostics. (8:04) As you watch: Consider how machine learning relates to the statistics you learned about in this course and how these advanced techniques can be applied in healthcare. Note: the specifics of this video will not be tested, but you should understand how machine learning relates to the concepts you learned about in this course. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 41 MODULE 06 COMPANION GUIDE HSCI 190 Reference: Dhindsa, K., Smail, L., McGrath, M., Braga, L., Becker, S., & Sonnadara, R. (2018). Grading Prenatal Hydronephrosis from Ultrasound Imaging Using Deep Convolutional Neural Networks. 2018 15th Conference on Computer and Robot Vision (CRV), 80-87. Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.1109/CRV.2018.00021 VIDEO: SURVEILLANCE STUDIES This content was retrieved Section 02 Slide 21 of 22 of the online learning module. Another field that utilizes advanced analytics for prediction is data surveillance. Data surveillance refers to the process of pulling data into systems that store, combine, and analyze the data to identify patterns and trends, to inform policies, governance, marketing, etc. This involves data from websites, social media platforms, cameras, credit cards, mobile applications, and more. In 2020, a large focus of data surveillance became tracking and predicting COVID-19 outbreaks to guide government reopening policies. While this type of analysis can be quite powerful, it can also have serious consequences if used inappropriately. Watch the video by Dr. David Lyon, the director of the Surveillance Studies Centre at Queen’s, to learn about data surveillance and how it impacts daily life. (5:03) Interesting Fact If you’re interested, check out these resources on surveillance studies. Queen’s Surveillance Studies Centre for their current research projects Environics analytics Prizm Postal Code Look Up for detailed information that has been gathered by surveillance studies. You can look up your postal code and view information about the people who live in your neighbourhood, including interests, hobbies, cars, etc. Note: the specifics of this video will not be tested, this information is provided to show you how machine learning can be applied more broadly in healthcare and to highlight work happening here at Queens. Page Links: https://www.youtube.com/embed/xtAa-f-1rTg?start=17 https://www.sscqueens.org/ https://prizm.environicsanalytics.com/ References: Government of Canada. (March 8, 2021). National surveillance for Coronavirus disease (COVID-19). Retrieved February 2021, from: https://www.publichealthontario.ca/en/data-and-analysis/infectious- disease/covid-19-data-surveillance/covid-19-data-tool?tab=summary INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 42 MODULE 06 COMPANION GUIDE HSCI 190 Public Health Ontario. (January 25, 2021). Ontario COVID-19 Data Tool. Retrieved February 2021, from: https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection/health- professionals/interim-guidance-surveillance-human-infection.html SECTION 02: SUMMARY This content was retrieved Section 02 Slide 22 of 22 of the online learning module. In this section, you learned about regression analysis. Specifically, you learned about the differences between correlations and regressions, lines of best fit, residuals, the least squares method for calculating the Y intercept and slope, and how to interpret regression output and the coefficient of determination. At the end of the section, you were introduced to other advanced regression analyses and learned about some examples of how healthcare professionals and researchers are applying and building on these statistical techniques in practice, including the development of clinical assessment tools, the use of machine learning in healthcare and image diagnostics, and the field of data surveillance. Continue to Section 03, the final section in Module 06 to learn more about statistical applications and considerations and wrap up to the course. End of Section 02 INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 43 MODULE 06 COMPANION GUIDE HSCI 190 SECTION 03: STATISTICAL CONSIDERATIONS FOR THE HEALTH SCIENCES INTRODUCTION TO STATISTICAL CONSIDERATIONS FOR THE HEALTH SCIENCES This content was retrieved Section 03 Slide 2 of 19 of the online learning module. In Module 01, you began the course by learning about the Problem, Plan, Data, Analysis, and Conclusion (PPDAC) cycle. Since then, you have learned about descriptive statistics, visualizations, outliers, sampling, probability, z scores, t tests, chi square tests, ANOVAs, correlations, and regression. In this section, you will reflect on the PPDAC cycle as a way to tie together course concepts. You will also learn about some final statistical considerations, such as data quality, significance, and data reproducibility. QUESTION: STUDY DESIGN & METHODS This content was retrieved Section 03 Slide 3 of 19 of the online learning module. The PPDAC cycle highlights that research is a cyclical and interconnected process; what happens in one phase greatly impacts the next. Answer the question using your knowledge from the course. Question: Reflecting on the various statistical tests and assumptions covered in this course, describe a scenario where study design would directly impact your statistical methods. Feedback: Navigate to the next page for feedback on this question from your instructor. FEEDBACK: STUDY DESIGN & STATISTICS This content was retrieved Section 03 Slide 4 of 19 of the online learning module. Certain research questions will require different types of analyses. Different analyses require data to be collected in certain ways (e.g. ordinal versus scale data). Because of this, it is imperative to ensure INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 44 MODULE 06 COMPANION GUIDE HSCI 190 your statistics and methods are aligned. No amount of statistical sophistication can rescue a poorly designed study. Listen to Dr. Wagner discuss examples of how study design can impact statistics. (2:19) Start of Audio Transcript: Statistics and methods are extremely intertwined. Through the P P D A C cycle we can see that what you do in one stage greatly impacts the next. For example, what the problem is, influences what research question you want to ask, the research question you want to ask impacts the data you collect, the data you collect impacts your analyses, your analysis impacts your conclusions, and then your conclusions inevitably lead to the future questions. If there is a disconnect between any of these steps, the research quality will suffer. In this course, everything you learned about fits somewhere in this P P D A C cycle. For example, in Module 01 you learned about different levels of measurement, how those levels impact the descriptive statistics and graphs you can use. In Module 02 you learned about sampling strategies, sampling biases, and that planning your study to collect a random selected sample is the best way to ensure a sample can be used to estimate a population. In Modules 03 to 06 you learned that the type of data collected and research question determine the type of statistical analysis you will want to conduct. For example, the number of groups you have, the type and level of measurement (i.e. counts versus measures, scale versus ordinal data), and whether you want to explore the difference between groups, relatedness between groups, or whether one group predicts another, all come together to determine the best statistical test. I think you can continue to build your knowledge in upper year research courses such as HSCI 270 and HSCI 383 and hope that this course is just starting to get you involved in research and methods, and this type of work. End of Audio Transcript. QUESTION: SAMPLE SIZE This content was retrieved Section 03 Slide 5 of 19 of the online learning module. Building on this idea of methods and statistics being intertwined, in Modules 02 and 03 you learned that increasing your sample size (i.e. having more data) will result in a better approximation of the true population. Answer the question using your knowledge from the course. Is more data always better? Feedback: Navigate to the next page for feedback on this question from your instructor. FEEDBACK: SAMPLE SIZE This content was retrieved Section 03 Slide 6 of 19 of the online learning module. INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 45 MODULE 06 COMPANION GUIDE HSCI 190 Much of statistics, including the central limit theorem, assumes that you are working with random observations. You also learned that there are a number of assumptions that must be met for each statistical test. If you are working with non-random or biased data, or your assumptions are not met, the statistical tests cannot be applied reliably. Therefore, it is not just the quantity but also the quality of data that matters. More bad data is still bad data. In statistics, this is referred to as the “Garbage In is Garbage Out” (GIGO) phenomenon. It doesn’t matter how much data you have if it is poor quality data to begin with. Continue to the next page to learn about data quality issues in healthcare Reference: Simpson, M. (2019, January 28). CURMUDGUCATION: AI: Bad data, bad results. Retrieved February 2021, from: http://bigeducationape.blogspot.com/2019/11/curmudgucation-ai-bad-data-bad- results.html DATA QUALITY & MACHINE LEARNING This content was retrieved Section 03 Slide 7 of 19 of the online learning module. At the start of the course you were introduced to Electronic Medical Records (EMR) systems and how they are used to collect patient information. With advancements in machine learning, there is interest in using EMR data to identify patterns between certain risk factors and health conditions. However, the utility of these machine learning algorithms relies on the quality of information that is inputted into the EMR in the first place. Listen to Dr. Kiret Dhindsa discuss issues of data quality and machine learning in healthcare. (3:38) Start of Audio Transcript: So the way I think about this is that power and flexibility, when it comes to machine learning and pattern recognition in general, is sort of a double-edged sword. Essentially what we see is that just like humans, machine learning can confuse correlation for causation. And more interestingly, it can be fooled into seeing patterns that aren't really there. And we know that human brains with also being really advanced in pattern recognition fall under the same -- or have the same problems. I'll give two examples of how this happens in -- that are specifically related to healthcare. So commonly, something that we have in healthcare is that missing data makes it difficult to establish correct correlations between variables. So this happens pretty often when we have some patients who don't agree to take a certain test or go through a certain procedure, or who just don't go to a follow up appointment. Now, the reasons that these data are missing are not actually random, and that means that they have non-random or meaningful effects on the correlations among variables and sometimes those can be clinically relevant. But the machine learning model has no way of assessing what those correlations might actually look like. And so the correlations that it extrapolates to the outcome variables can also be distorted. A second factor is that there are large inconsistencies in data collection, and machine learning kind of almost tells the difference -- tell if the different inconsistencies in INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 46 MODULE 06 COMPANION GUIDE HSCI 190 data are clinically important or not. So crucially in healthcare, different clinics and institutions collect and store data in their own ways and have different data standards. And that creates differences in datasets to the point that it's often easier for a machine learning algorithm to detect which clinic a patient is from, rather than what their medical condition is. And this, essentially, acts as a major confounding variable for machine learning solutions. So a solution that we need, and many people now are advocating for in healthcare, is that we need to standardize data collection and data quality metrics and standards across the healthcare system so that we can actually have a more complete picture of each patient, and at the same time pool larger patient populations together so that we can explore more of the intricacies and nuances of the correlations between clinical variables and different outcome variables that we're interested in. So I hope this gives you a little bit of insight into artificial intelligence, its connection to statistics, and it's role in healthcare. I would just, you know, hope that people keep in mind that whatever your role ends up being in healthcare, it's increasingly likely that artificial intelligence and data science will play a growing role in it. So it's important to be able to tell the difference between fact and fiction, and to separate the hype from reality of what machine learning and artificial intelligence actually can do for healthcare, particularly for a field that's still struggling to separate itself from the way it's been presented in science fiction. So good luck with the rest of your term. End of Audio Transcript. MEANING OF SIGNIFICANCE This content was retrieved Section 03 Slide 8 of 19 of the online learning module. Now that you have learned about data quality, the next statistical concept you will revisit is the idea of statistical significance. Originally, the term significant meant that the computation signified or showed something. Now, the term has a number of meanings, some of which are technical and others less so. In this course, you learned to identify a significance level (usually α =.05) prior to an analysis, and then compare your calculated p value to your predetermined significance level. If p > α you would fail to reject the null, and if p < α you would reject the null. While this is a critical component of inferential statistics, it is also important to realize the limitations of statistical significance. Continue to the next page to learn about an important distinction with statistical significance and other types of significance. STATISTICAL VERSUS CLINICAL VERSUS BIOLOGICAL SIGNIFICANCE This content was retrieved Section 03 Slide 9 of 19 of the online learning module. Just because something is statistically significant, does not mean it is clinically or biologically significant (or vice versa). Learn more about statistical vs. clinical vs. biological significance. Statistical Significance INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 47 MODULE 06 COMPANION GUIDE HSCI 190 Statistical significance quantifies the probability of an event occurring due to random chance. This is determined by comparing a probability value derived in a statistical analysis to a predetermined significance level (e.g. α =.05). Clinical Significance Clinical significance is when an event or difference is meaningful for a clinical reason. For example, whether the change makes a real difference to patients’ lives. This has nothing to do with statistics, but is based off expert opinion/the literature. Biological Significance In lab-based research, biological significance is similar to clinical significance. It refers to whether the finding has a biological relevance. This, again, has nothing to do with statistics, but is informed by expert opinion on biological processes. Reference: Ranganathan, P., Pramesh, C. S., & Buyse, M. (2015). Common pitfalls in statistical analysis: Clinical versus statistical significance. Perspectives in clinical research, 6(3), 169-170. Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.4103/2229-3485.159943 EXAMPLE: STATISTICAL VERSUS CLINICAL SIGNIFICANCE This content was retrieved Section 03 Slide 10 of 19 of the online learning module. A good example of statistical versus clinical significance is biomechanical studies exploring ACL reconstruction*. Specifically, whether there is a difference in tensile strength* with different types of grafts, such as hamstring grafts, quadriceps grafts, patellar tendon grafts, or allografts*. Studies have found that there are statistically significant differences in the tensile strength between AC L graft types, however, which graft is used does not seem to impact patient outcomes. Therefore, while there are statistically significant differences, there are no clinically significant differences. This point is very important for surgeons selecting which type of graft to use. Definitions*: A C L reconstruction: Common surgery to repair a torn anterior cruciate ligament (ACL). Tensile Strength: How much tension can be placed on the material before it breaks/tears. Allografts: Tendon graft from a cadaver. References: Widner, M., Dunleavy, M., & Lynch, S. (2019). Outcomes Following ACL Reconstruction Based on Graft Type: Are all Grafts Equivalent? Current Reviews in Musculoskeletal Medicine, 12(4), 460-465. Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.1007/s12178-019-09588-w INTRODUCTION TO STATISTICS FOR THE HEALTH SCIENCES| HSCI 190 MODULE 06 PAGE 48 MODULE 06 COMPANION GUIDE HSCI 190 Dewan, A. K. (n.d.). Anterior Cruciate Ligament Tears - Choosing the Best ACL Graft. Retrieved February 2021, from: https://www.drdewan.com/blog/anterior-cruciate-ligament-tears-choosing-the-best-acl- graft PERSPECTIVES ON STATISTICAL SIGNIFICANCE This content was retrieved Section 03 Slide 11 of 19 of the online learning module. Too often, individuals will pay attention to research that is statistically significant, and disregard that which is not. This should not be the case. It is very important to recognize that statistical, clinical, and biological significance are not the same thing. Continue to the next page to learn about data reproducibility and its relationship with the emphasis on statistical significance. Reference: Clinicwise. (2016, March 13). Significance. Retrieved February 2021, from: https://clinicwise.org/beginners-guide-reading-massage-therapy-research/significance/ DATA REPRODUCIBILITY This content was retrieved Section 03 Slide 12 of 19 of the online learning module. Data reproducibility* has long been considered an important part of the scientific method. Throughout this course, a large emphasis was placed on describing your statistical methods and findings in great detail. A reason reproducibility was emphasized was because the scientific field is currently experiencing an issue called the replication crisis*, and some of this issue can be traced back to statistics and methodology training. The remainder of this section will focus on the replication crisis and how it relates to what you have learned in this course regarding statistics. Throughout these slides, you will hear from Dr. Anita Acai, a faculty member of McMaster University, describe her experiences with data reproducibility and the replication crisis. Definitions*: Data Reproducibility: The ability to reproduce, or replicate, findings. Replication Crisis: The inability to reproduce scientific results. References: Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251). Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.1126/science.aac4716 Ioannidis, J. P. (2005). Why most published research findings are false. PLOS Medicine, 2(8). Retrieved February 2021, from: https://proxy.queensu.ca/login?url=https://doi.org/10.1371/journal.pmed.0020124 INTRODUCT