Statistics Lecture Notes PDF
Document Details
![UncomplicatedRomanArt5405](https://quizgecko.com/images/avatars/avatar-3.webp)
Uploaded by UncomplicatedRomanArt5405
Joseph Trigiante
Tags
Summary
These lecture notes cover various aspects of statistics, including descriptive and inferential statistics, sampling methods, and the concept of correlation. The notes also explain the importance of understanding different types of variables for effective statistical analysis..
Full Transcript
STATISTICS LS4003 Joseph Trigiante PREVIOUSLY… In the last lecture we saw the bases of DESCRIPTIVE STATISTICS All real world measurements are affected by noise (or error) This noise almost always takes on the form...
STATISTICS LS4003 Joseph Trigiante PREVIOUSLY… In the last lecture we saw the bases of DESCRIPTIVE STATISTICS All real world measurements are affected by noise (or error) This noise almost always takes on the form of a normal distribution 39% LLLL LLLR RRLL RRRL RRRR 25% 25% LLRL RLRL RRLR LRRL LRLL RLRR Frequency 6% RLLL RLLR LRRR 6% LRLR LLRR Far left Left Center Right Far right The normal distribution has a measure of centrality (the mean) and of dispersion around it (the Standard Deviation) We learned about the Z parameter and how to use it to determine probability of a measurement occurring by chance We also saw non-normal distributions and why they exist And we identified the median as the measure of centrality and the interquartile range as measure of dispersion Q1 Q3 Mean Median SD IQR TODAY In this lecture we will Complete descriptive statistics with the concepts of population and sample Begin inferential statistics with its general concepts Talk about correlation, one of the statistic tests SAMPLING POPULATIO N In statistics a population is the group of people (or objects) whose properties you are interested in Each of these population properties is called a parameter POPULATIO N Examples are: The UK income (parameter) of the UK population The proportion of gray squirrels (parameter) among the squirrel population The temperatures (parameter) of the days of the year (population) POPULATIO N To measure the value of a parameter we should poll the whole population But often this is impossible because: It’s impractical It’s too expensive It’s destructive POPULATIO N Example: to know the percentages of political parties within the population we would have to interview tens of millions of people It is done only at general elections; it would be too expensive otherwise POPULATIO N Or, suppose you are making bullets and want to know how many are faulty How would you do it? POPULATIO N Of course you could fire them all and check You’d get the exact value. But then… You’ll have nothing to sell. POPULATIO N Same goes for pills: to analyse them you need to destroy them so you can’t test the whole population SAMPLING This is where the concept of sampling arises We take a bit of the population and measure the parameter there We assume the value will be the same if the sample is properly taken The parameter measured within the sample is called statistic Example: opinion polls (including exit polls) gather any given opinion from a sample of people assuming it will be the same as that of the whole population It is important that the sample is truly representative of the population, that is unbiased The sample choice must have nothing to do with what you want to measure Example of BAD sampling: Do you have a phone? Er, yes 100% of Brits now have a phone SAMPLING: STANDARD ERROR OF THE MEAN If we take n samples of the population, each measuring the parameter with a mean and a standard deviation, we can combine them to reduce the overall STDEV The “reduced” standard deviation is called standard error of the mean (SEM) and its value is 𝜎 𝑆𝐸𝑀 = √𝑛 Remember you need n measurements each with its own STDEV This is the only case in which you can use SEM - it is not a cheap way to knock down INFERENTIAL STATISTICS INFERENTIAL STATISTICS This is the part of statistics where we draw conclusions from comparing sets of data As we saw last time, we want to tell if there is a signal or what we see is just noise. Before going into detail of the tests, let’s cover some general concepts which apply in all cases VARIABLES Every statistical analysis involves 2 kind of variables A predictor (or independent)variable is the parameter that you think may be causing an effect (signal) The outcome (or dependent)variable is the parameter that you think may be showing this effect This is important: predictor and outcome are your assumptions: statistics won’t tell you which one causes which VARIABLES Examples: Taking aspirin (predictor) will cut the number of heart attacks (outcome) in the long run Living in Australia (predictor) will make skin cancer more likely (outcome) A diet low in carbohydrates (predictor) will cause weight loss (outcome) VARIABLES We will normally see the predictor variable on the X axis and the outcome on the Y Outcome Predictor VARIABLES: CONTINUOUS AND CATEGORICAL Each of the two variables can be of 2 kinds Continuous: a number which can take many values and can be ordered (eg blood pressure, temperature, age, etc.) Categorical: a label which can take few values which don’t not have an order(eg sex, country, smoking status). Each combination of the 2 variables as predictor and outcome requires its own statistical test and we will examine them later individually VARIABLES Let’s see some examples Outcome (continuous) Predictor (categorical) VARIABLES Let’s see some examples Outcome (continuous) Predictor (continuous) VARIABLES Let’s see some examples Outcome (continuous) Predictor (continuous) VARIABLES Let’s see some examples Predictor (categorical) Outcome (categorical) HYPOTHESES Another common concept to all inferential statistic are the hypotheses These are basically the answers to the main question: is there a signal? Let’s see them HYPOTHESES H0 The null hypothesis = No joy It’s just noise No effect Tough luck Insignificant No signal Coincidence Nothing there HYPOTHESES H1 The alternative hypothesis = It’s not noise effect Significant There is a Lucky me Yes! signal No coincidence It’s true HYPOTHESES So we can rephrase the main goal of inferential statistics: Determining which hypothesis we can accept and which to reject H0 H1 Null Alternative P-VALUE The last common concept to all tests is the p-value This is calculated by all the tests we will see and is the Probability of H0 being true Or, probability that the effect we see is a result of chance The lower the p-value the more confident we are of H1 and of a signal P-VALUE:MEANING Because p is the probability of you being wrong (false positive), 1-p is the probability of you being right (true positive) Example: p=0.05 means you’re 95% (0.95) likely to be right in seeing a signal P-VALUE : SIGNIFICANCE The p-value is quantitative: it’s a number between 0 and 1 But we need to make a qualitative decision: accepting or rejecting H0 (or H1) So, we need to set an (arbitrary) rejection threshold Usually this is taken as 0.05 (5%). So, If p 0.05 H0 N points (correlation insignificant ) CORRELATION In your next R workshop you will see how easy it is to calculate the r coefficient and the p-value from it using the functions cor() and cor.test() CORRELATION IS NOT CAUSATION Once you’ve established there is a correlation between 2 variables it is important NOT to jump to the conclusion that one variable is causing the other to change (causation) Because it could be the other way around, or more commonly, a third variable is influencing both. “Predictor” and “Outcome” are only your definitions A and B are correlated: Var A Var B Var C Var B Var A Var A Var B CORRELATION IS NOT CAUSATION Let’s see some examples, together with the statistical hypothesis we can make H1: height correlates with weight r=0.85 N=29 p-value=0.00001, H1 accepted Height relates to weight. In this case we accept causation because we have a logical explanation www.statology.org CORRELATION IS NOT CAUSATION H1: blood pressure correlates with age Var A Blood pressure (max) 130 r=0.988 125 120 N=7 115 p-value=0.00003, H1 accepted Pressure (mmHg) 110 Var B 105 100 95 90 85 Blood pressure rises with age. Here also we 80 20 25 30 35 40 45 have a logical explanation so A causes B Age (years) CORRELATION IS NOT CAUSATION A typical error in conclusions-let’s replot that H1: age correlates with blood pressure Var B Age 45 r=0.988 40 N=7 Var A p-value=0.00003, H1 accepted Age (years) 35 30 25 20 85 90 95 100 105 110 115 120 125 Does high blood pressure age you? Certainly Blood pressure (mmHg) not in that sense! Because a parameter is on the X axis it doesn’t mean it’s the cause of Y! CORRELATION IS NOT CAUSATION A more complex example Var C H1: ice cream consumption correlates with drowning deaths r=0.75 Var A Var B N=12 p-value=0.005, H1 accepted Eating ice cream won’t drown you nor will drowning make you eat ice cream! Both are caused by Summertime! CORRELATION IS NOT CAUSATION H1: Coffee consumption correlates with IQ r=0.05 N=28 p-value=0.800, H1 rejected An example of no correlation whatsoever. And that also means intelligent people don’t necessarily choose coffee CORRELATION IS NOT CAUSATION Correlation is only a mathematical parameter and is either there or not But it implies nothing about causation. Let’s see some examples from this zany site H1: Popularity of “Andrea” correlates with cottage cheese consumption r=0.977 p-value< 0.01, H1 accepted https://www.tylervigen.com/spurious-correlations CORRELATION IS NOT CAUSATION Correlation is only a mathematical parameter and is either there or not But it implies nothing about causation. Let’s see some examples from this zany site H1: Divorce rate in Maine correlates with margarine consumption r=0.992 p-value< 0.01, H1 accepted https://www.tylervigen.com/spurious-correlations CORRELATION Now let’s do some testing ourselves Let’s find out if it’s true that the world is heating up We need to state the question in a statistical way: H1 vs H0. H1: There is a positive correlation between time and temperatures H0: There is no correlation between time and temperatures CORRELATION World mean temperatures (vs 1901-2000) 2 1.5 1 Delta T 0.5 0 -0.5 -1 1840 1860 1880 1900 1920 1940 1960 1980 2000 2020 2040 Year This is the chart of mean yearly T differences to the XX century average CORRELATION World mean temperatures (vs 1901-2000) 0.4 0.3 0.2 0.1 0 Delta T -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 Year Let’s start from the period up to World War II (1940) R= 0.17, N=91 H1 rejected – no significant warming p=0.11 CORRELATION World mean temperatures (vs 1901-2000) 1.6 1.4 1.2 1 0.8 Delta T 0.6 0.4 0.2 0 -0.2 -0.4 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 Year Now the period from World War II (1940) to today R= 0.85, N=84 H1 accepted – very significant warming P< 0.00001 THE SPEARMAN CORRELATION COEFFICIENT Pearson’s r is very useful at detecting linear correlations and will work in most cases There is a broader test available for any kind of correlation (eg quadratic) It’s called the Spearman Rank Correlation coefficient (or Rho)-it works in exactly the same way-use the one that works better Linear correlation (both will Nonlinear correlation (Spearman works better) work) CORRELATION Now you can have fun scientifically proving any correlation you want Next time we will look at the most relevant case in biomedicine: continuous vs categorical Don’t forget the R workshop!