Introduction To Statistics PDF
Document Details
Uploaded by Deleted User
Dave MacLean
Tags
Summary
This document provides an introduction to basic statistical concepts and methods, suitable for undergraduate biology students. It covers topics like data collection, descriptive statistics, and analyses of different data types.
Full Transcript
Introduction to statistics Dave MacLean Observation or question Fame & Hypothesis fortune Test Conclusions Experimentally How tall are pine trees?...
Introduction to statistics Dave MacLean Observation or question Fame & Hypothesis fortune Test Conclusions Experimentally How tall are pine trees? Measure every pine tree….. How tall are pine trees? Measure every pine tree….. Population mean (µ) Much more practical to measure a bunch of trees and extrapolate from the few to the many, (ie. from the sample mean to the population mean). This is the essence of descriptive statistics, estimating the ‘truth’ about a population from an experimentally accessible subset. Important distinction so we give sample mean another symbol 𝑥𝑥̅ Population mean Sample mean 𝜇𝜇 𝑥𝑥̅ Sample standard deviation Population standard deviation 𝑠𝑠 𝜎𝜎 100k 4k 400 40 4 Population mean Sample mean 𝜇𝜇 𝑥𝑥̅ 𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥̅ = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥̅ = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ∑ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑠𝑠 = 𝑁𝑁 − 1 How good is your estimate of the population mean? How close is 𝑥𝑥̅ to 𝜇𝜇? 𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥̅ = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ∑ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2 𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠 = 𝑁𝑁 − 1 𝑠𝑠 𝐶𝐶𝐶𝐶 = 𝑥𝑥̅ ± 𝑧𝑧 𝑁𝑁 𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥̅ = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ∑ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2 𝑠𝑠𝑠𝑠𝑠𝑠 𝑠𝑠 = 𝑁𝑁 − 1 𝑠𝑠 𝑠𝑠. 𝑒𝑒. 𝑚𝑚. = 𝑁𝑁 Sample mean -> population mean N Sample standard deviation -> population standard deviation CI and SEM decrease When do I use SD versus SEM (or CI)? If you want to show variation in your data, use SD If you want to show how accurately you measured the mean, use SEM Best practice is probably to show individual data points anyways! Bimodal distribution Normal distribution Skewed distribution Dealing with outliers - Is it a mistake in data entry? Then fix - Is there anything that happened during the experiment? Then remove. Do the same for the ‘good’ points too. - Are you sure the data is normally distributed? Try lognormal Population mean is 800 uM Dealing with outliers - Is it a mistake in data entry? Then fix - Is there anything that happened during the experiment? Then remove. Do the same for the ‘good’ points too. - Are you sure the data is normally distributed? Try lognormal - Could there actually be some interesting biology going on? - Is this sample drawn from the same population as the rest? Outlier test - Calculate SD using the other data points, is this point > 3 sd from their mean? - Then likely justified to remove it as the outlier will unreasonably distort your estimate of the population. 𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑥𝑥̅ = 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ∑ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2 𝑠𝑠. 𝑑𝑑. 𝑠𝑠 = 𝑁𝑁 − 1 𝑠𝑠 𝑠𝑠. 𝑒𝑒. 𝑚𝑚. = 𝑁𝑁 Biological replicates measure biologically distinct samples to capture mean and dispersion of biological population. Technical replicates are repeated measurements of the same biological sample. Technical replicates help remove instrument noise and improve the accuracy of a biological sample measurement. Distinguishing biological from technical replicate You hypothesize treating mice with a new drug reduces serum albumin levels. You obtain three independent blood samples from a treated mouse and a control mouse. You measure serum albumin from each of these samples 5 times. What is ‘n’ for each group? Observation or question Fame & Hypothesis fortune Test Conclusions Experimentally How tall are pine trees? Measure every pine tree….. Hypothesis: pine trees and bonsai trees are different sizes Hypothesis: pine trees and maple trees are different sizes 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 ≠ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 - New unknown tree is 150 ft. - You measure 3 times (3 technical replicates) just to be sure - ~ 95% of pine trees are taller than 150 ft. So it might not be pine tree, but there’s a chance it could. - You measure more of them…. As we measure more of these unknown trees, they seem to be smaller, on average, than pine trees. But it’s still possible they are pine trees. We just got a sample with an unusually small mean. What are the chances of that? This is what a t-test does. T test assume pine trees = unknown blue trees Then asks ‘how often would we get these kind of results, just by chance, from the same population?’. If it’s very unlikely, then we have evidence that these trees are different. Stats never prove things are different! Only provides probability of samples being the same Hypothesis: pine trees and bine trees are different sizes Remember it’s ALL trees Statistically Significant * Substantial * Assumption of a t test Samples are independent biological replicates Samples are drawn from a normally distributed population Sample populations have approximately equal variance Assumption of a t test Samples are independent biological replicates Samples are drawn from a normally distributed population Sample populations have approximately equal variance Useful things…….. These are very helpful general stats (including t-test) & ANOVA https://www.youtube.com/watch?v=kyjlxsLW1Is&list=PL_MN5auU2qk2XSw_WmyhpAn3upzFtIpXJ&inde x=1&t=2123s https://www.youtube.com/watch?v=ITf4vHhyGpc MIT course https://www.youtube.com/watch?v=VPZD_aij8H0&t=2746s Prism guide https://www.graphpad.com/guides/prism/latest/statistics/stat_---_principles_of_statistics_-.htm Small glimpse of big data: Principal component analysis https://www.youtube.com/watch?v=_UVHneBUBW0&t=648s