Statistics & Data Analysis Lecture 2 2024 PDF
Document Details
Uploaded by CongratulatoryIntelligence5915
University of Surrey
2024
null
Youngchan Kim
Tags
Summary
This document contains a lecture on statistics and data analysis, specifically for the Analytical and Clinical Biochemistry course (BMS2043) at the University of Surrey. The lecture, from Spring 2024, covers inferential statistics, hypothesis testing, and various statistical tests.
Full Transcript
Statistics & Data Analysis Analytical and Clinical Biochemistry (BMS2043) Spring 2024 Lecture 2 Youngchan Kim, PhD Lecturer in Quantum Biology University of Surrey [email protected] | 01AZ04 Inferential Statistics, Part 1 Inferential statistics Relationships between variables Inferential st...
Statistics & Data Analysis Analytical and Clinical Biochemistry (BMS2043) Spring 2024 Lecture 2 Youngchan Kim, PhD Lecturer in Quantum Biology University of Surrey [email protected] | 01AZ04 Inferential Statistics, Part 1 Inferential statistics Relationships between variables Inferential statistics Is based on a hypothesis. In other words, a basic concept in inferential statistics is hypothesis testing, which is an act in statistics whereby one tests an assumption regarding a population parameter. Need a formal statistical test to guide policy making, health care choices etc. E.g. cancer screening in general population, vaccinations (Covid-19) Tests depending on the nature of the data χ2 test t-test Mann-Whitney U-test Analysis of variance (ANOVA) Linear and logistic regression Several others, e.g. survival analysis (beyond the scope of the lectures) Evaluation of the test results in terms of statistical significance BMS2043 – Statistics and Data Analysis, 2024 Steps in performing a statistical test 1. Formulate null and alternative hypotheses (H0 vs. H1) 2. Evaluate the data and choose an appropriate statistical test for the data 3. Perform the statistical test 4. Obtain test statistic and P-value 5. Evaluate the statistical significance of the result 6. Reject or accept null hypothesis BMS2043 – Statistics and Data Analysis, 2024 Hypothesis Null hypothesis H0 Alternative hypothesis H1 Direction of the effect One-tailed vs two-tailed test Example: One might believe that no person in the UK is taller than 2m30cm. After measuring 100, 1000 or 100k people and verifying that they all are shorter than 2.30, it is still not proven that no-one of the remaining individuals is taller than 2.30. However, as soon as the first individual who is taller than 2.30 is found, the hypothesis is rejected. So, rejecting a hypothesis is often a more feasible task than proving it A null hypothesis is the statement that is considered to be true unless the data provides sufficient evidence to reject it. BMS2043 – Statistics and Data Analysis, 2024 One-tailed vs two-tailed test A test that is conducted to show whether the mean of the sample is significantly greater than and significantly less than the mean of a population is considered a two-tailed test. When the testing is set up to show that the sample mean would be higher or lower than the population mean, it is referred to as a one-tailed test. How do we determine if it is a one-tailed or two-tailed test? A one-tailed test looks for an increase or decrease in a parameter. A two-tailed test looks for change, which could be a decrease or an increase. BMS2043 – Statistics and Data Analysis, 2024 Example 1: FTO and BMI Based on previous research a scientist has an a priori hypothesis that body mass index (BMI) in Europeans is increased due to variations in FTO gene. The scientist has data on UK population to study the effects of FTO on BMI. Which is the null hypothesis here? After testing the effect of variation in the FTO gene on BMI the scientist gets a test statistic with an associated Pvalue=0.004. What is your conclusion? BMI in Europeans is either decreased or not changed due to variations in FTO gene. Which is the alternative hypothesis? BMI in Europeans is increased due to variations in FTO gene. Is the test one or two-tailed? Now we have a “increase” in the alternative hypothesis. This means that instead of performing a twotailed test, we will perform a left-sided one-tailed test. BMS2043 – Statistics and Data Analysis, 2024 Example 2: FTO and BMI Based on previous research a scientist has an a priori hypothesis that body mass index (BMI) in Europeans is not changed due to variations in FTO gene. The scientist has data on UK population to study the effects of FTO on BMI. Which is the null hypothesis here? BMI in Europeans is not changed due to variations in FTO gene. Which is the alternative hypothesis? BMI in Europeans is changed due to variations in FTO gene. Is the test one or two-tailed? Now we have a “is not” in the alternative hypothesis. This means we have to use a two-tailed test. BMS2043 – Statistics and Data Analysis, 2024 Statistical significance and P-value Statistical significance: the observed result is not by chance NB! Biological/clinical significance? P-value: The probability of observing the result or more extreme result (test statistic) given the null-hypothesis is true P-value Usual interpretation p < 0.05 a statistically significant difference p < 0.01 a very significant difference p < 0.001 a highly significant difference 0.05 < p < 0.1 a “borderline significant“ difference – no sufficient evidence that there is a real difference, but it may be recommended to increase the sample size to get a clear picture BMS2043 – Statistics and Data Analysis, 2024 The winner is the one with the smallest P-value BMS2043 – Statistics and Data Analysis, 2024 What is wrong with this definition? Harris and Taylor: Medical statistics made easy, 2003 BMS2043 – Statistics and Data Analysis, 2024 Example 1: FTO and BMI Based on previous research a scientist has an a priori hypothesis that body mass index (BMI) in Europeans is increased due to variations in FTO gene. The scientist has data on UK population to study the effects of FTO on BMI. Which is the null hypothesis here? After testing the effect of variation in the FTO gene on BMI the scientist gets a test statistic with an associated Pvalue=0.004. What is your conclusion? BMI in Europeans is either decreased or not changed due to variations in FTO gene. Which is the alternative hypothesis? BMI in Europeans is increased due to variations in FTO gene. Is the test one or two-tailed? Now we have a “increase” in the alternative hypothesis. This means that instead of performing a twotailed test, we will perform a left-sided one-tailed test. BMS2043 – Statistics and Data Analysis, 2024 Cautious notes about the use of P-value A large P-value does never prove the absence of an effect but allows us to conclude that there is no sufficient evidence of the effect to be present! The cut-off of 0.05 is quite arbitrary (Sir Ronald Fisher, 1890-1962): Cook, Chad. "Five per cent of the time it works 100 per cent of the time: the erroneousness of the P value." Journal of Manual & Manipulative Therapy 18.3 (2010): 123-125. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3109681/ BMS2043 – Statistics and Data Analysis, 2024 Statistical test Depends on The type of data (frequencies vs continuous, normal vs non-normal, etc.) Research question Sample size BMS2043 – Statistics and Data Analysis, 2024 Test of correlation Two commonly used measures: 1. Pearson correlation: quantitative traits linear relationship 2. Spearman correlation: Quantitative or ordinal data, e.g. Likert scale (1,2,3,4,5) Does not require a linear relationship; however, needs the data to follow a monotonic relationship Good for non-normal data Based on ranks of the data Both give us a correlation coefficient r, -1 ≤ r ≤ 1 Can calculate a P-value for the correlation coefficient. (NB! Large sample sizes may result in statistically significant results) BMS2043 – Statistics and Data Analysis, 2024 Test of frequencies: χ2 – test In the test statistics, 𝜒 ! (chi-square) represent how the observed (O) data/frequency deviates from the expected (E). 𝜒 ! is another probability distribution and ranges from 0 to ∞ 𝜒 ! = ∑%"#$ &! '(! " 1. Subtract each expected number from each observed number 2. Square the difference 3. Divide the squares so obtained for each cell of the table by the expected number for that cell 4. HTN, no HTN, yes Total Male 2035 509 2544 Female 2625 148 2773 Total 4660 657 5317 HTN, no HTN, yes Total Male 2230 314 2544 Female 2430 343 2773 Total 4660 657 5317 (! 𝜒 ! is the sum of (O-E)2/E χ2= (2035-2230)2/2230 + (509-314) 2/314+(2625-2430) 2/2430 +(148-343) 2/343 = 264.6585 χ2= 264.6585 with df=1 gives P-value, p-value < 2.2e-16 NB! A simple rule is that the degrees of freedom (df) is df= (number of rows – 1 ) × (number of columns -1) = (2-1) × (2-1) = 1 BMS2043 – Statistics and Data Analysis, 2024 *HTN = Hypertension For instance, in the table for the expected values, 2230 = (2544/5317) × 4660, 314 = (2544/5317) × 657, and so on. 𝜒2 distribution Source: Wikipedia Fisher’s exact test To be used when sample size is small (cell sizes