Document Details

UncomplicatedRomanArt5405

Uploaded by UncomplicatedRomanArt5405

Kingston

Joseph Trigiante

Tags

statistics normal distribution descriptive statistics data analysis

Summary

These lecture notes introduce the concept of statistics and provide a comprehensive overview of the normal distribution and descriptive statistics. The notes explain how errors in measurements are related to the normal distribution and how to interpret statistical probabilities. Practical examples demonstrate the application of these statistical concepts.

Full Transcript

STATISTICS LS4003 Joseph Trigiante INTRODUCTIO N A long time ago, London was a very different place… INTRODUCTION Overcrowded, dirty, dangerous. Well-to-do folks would not live in it INTRODUCTION It wasn’t only Jack...

STATISTICS LS4003 Joseph Trigiante INTRODUCTIO N A long time ago, London was a very different place… INTRODUCTION Overcrowded, dirty, dangerous. Well-to-do folks would not live in it INTRODUCTION It wasn’t only Jack the Ripper & friends you had to worry about Sanitation was so poor that infectious diseases were rife INTRODUCTION Particularly Cholera. This disease leads to severe diarrhea and dehydration and can be fatal It spreads from dirty water and needless to say that was pretty abundant back then INTRODUCTION In 1854 London suffered a severe cholera outbreak which resulted in more than 600 casualties Vibrio cholerae Germ theory is obvious to all of us today but not then Diseases were still thought to originate from “bad smells” in the air INTRODUCTION But not for this doctor. John Snow (not from GOT) decided to get to the bottom of this mystery He had no clue about what caused cholera but like his contemporary Sherlock Holmes, decided to use evidence and logic to guide him Ho plotted the sightings of the “culprit”, i.e. the cholera cases, on a map He had the handle removed from “pump of death” And all of a sudden the cases stopped The pump (replica) is still on display in Broad street THE POWER OF STATISTICS Unbeknownst to him, Snow was using statistics. This is Wikipedia’s definition We will use a less academic definition: statistics is a tool to separate SIGNALS from NOISE THE POWER OF STATISTICS In a perfect world we would not need statistics… THE POWER OF STATISTICS In the real world we do because of noise THE NATURE OF NOISE NOISE is any factor which can Impact your measurement Do so in a random direction NOISE can be reduced but never eliminated in science THE NATURE OF NOISE Examples In a clinical trial, patients not complying with the drug regimen Patients who metabolise or absorb the drug with different rates People moving away from the “death pump” after drinking Poorly calibrated instruments Uncontrolled parameters such as temperature, pressure etc. THE NATURE OF NOISE Noise can both mask a signal which is there or simulate one which is not Either way it’s critical for scientists to be able to see the truth Statistics is the tool to achieve this We now have a goal: separate signal from noise The rest of the lectures will explain HOW SUMMARY OF THE LECTURES We will be examining statistics systematically taking it from the beginning This is a summary of what we will explore The shape of error- the normal distribution Examining one set of data: descriptive statistics Comparing sets of data: inferential statistics General concepts: variables, H0, H1 hypotheses and p-value The three main cases and the tests required: continuous/categorical vs each other TODAY’S MENU The shape of error- the normal distribution Examining one set of data: descriptive statistics THE SHAPE OF ERROR The noise we see on a measure is a result of many elementary causes Each one will shift our measurement one way or another at random To understand how this works we will use a Plinko board In this game you drop a ball down a board with several studs in it The ball will randomly bounce one way or the other at each stud You win depending where it finally lands One stud is the same as one elementary error. If we drop many balls, we’ll see that 50% go right, 50% left. 50% 50% Now let’s add another one Frequency Left Right RR=right RL=center LR=center LL=left 50% With 2 levels of studs the Frequency 25% 25% distribution changes because the center is twice as likely to get hit Left Center Right As we increase the number of studs the ball passes the distribution of hits around the center takes on a particular shape 39% LLLL LLLR RRLL RRRL RRRR 25% 25% LLRL RLRL RRLR LRRL LRLL RLRR Frequency 6% RLLL RLLR LRRR 6% LRLR LLRR Far left Left Center Right Far right If we consider the “true fall position” of the ball the center (where it would fall without studs) then the other positions are “noise” introduced by the studs. The “bell curve” we found represents how likely we are to make a certain error in our measurement Noise (error) True value THE NORMAL DISTRIBUTION Any curve plotting measurables on the X and their likelihood (probability or number) on Y is called a DISTRIBUTIO N Number of measures with value OR Probability of value Value measured THE NORMAL DISTRIBUTION This particular (bell shaped) distribution is an extremely important one in statistics because most errors look like this It is therefore called the “normal distribution” of error or simply normal distribution Probability Measurement THE NORMAL DISTRIBUTION mean Let’s look at the 2 fundamental properties of this distribution The first one is the mean value. This is the center and it is the most likely measurement we would get if we repeated the measurement many times THE NORMAL DISTRIBUTION The second is “how broad” it is around the mean. This obviously represents the precision of the data-the narrower the better Now it’s not obvious how to measure breadth of a curve that never ends THE NORMAL DISTRIBUTION Statisticians have come up with a parameter that measures the average error on a set of measurements: the standard deviation (SD) or s (sigma) √ ∑ ( 𝑥 − 𝑥) 2 𝜎= 𝑁 −1 Where x are all the measurements, is the mean and N the number of measurements Obviously the more the measurements are scattered from the mean (x - large) the higher s is and the broader our distribution THE NORMAL DISTRIBUTION Standard deviation has a meaning on the breadth of the area under the curve The area under the distribution curve between the mean and mean+ 1 SD is 34% of the total This is always true regardless what your SD or mean actually are! THE NORMAL DISTRIBUTION But why do we care about the area under the curve? Remember this plot? If I ask you what are the odds of the ball falling at the center OR right slot what would you 39% say? 25% 25% That’s right, you sum the Frequency 6% bars and get 39% 6% +25%=54% Far left Left Center Right Far right THE NORMAL DISTRIBUTION If we have a large number So the area under a distribution between 2 values of values, the sum of the is the probability of measuring that value by bars becomes the area chance under the curve THE NORMAL DISTRIBUTION So there is a 34% chance of measuring a value between the mean and mean + 1 SD by chance For mean + 2 SD the probabilities are higher-47.5% and so on THE NORMAL DISTRIBUTION But this also means that the odds of measuring a value far from the mean are low You only have a 2.35% chance of a value over 2 SD above from the mean So if the SD is small your measurements will be close to the mean. This is why SD is a measure of precision THE NORMAL DISTRIBUTION Of course these probabilities only hold if chance is involved. If there is something else at play (like trickery) the odds are different We can thus use the normal distribution probability in reverse: if we do see an out of range measure then maybe it’s not by chance This is the basis of inferential statistics which we will examine later Mean male height: 175 cm SD: 7 cm THE STANDARD NORMAL DISTRIBUTION Calculating probabilities of observations falling within certain ranges is critical in science In a normal distribution these probabilities depend only on how many SDs away from the mean we are This absolute parameter (which doesn’t depend on the particular distribution you are looking at) is so important it is called z Observation 𝑥 −𝜇 Mean 𝑧= 𝜎 Standard deviation THE STANDARD NORMAL DISTRIBUTION Looking at z we can see what it means 𝑥 −𝜇 Away from the mean 𝑧= 𝜎 How many standard deviations THE STANDARD NORMAL DISTRIBUTION If we take a distribution with mean=0 and SD=1 we see that z is actually the distribution itself 𝑥 −0 𝑧= =𝑥 1 This is called the standard normal distribution THE STANDARD NORMAL DISTRIBUTION By turning your measurements into z you can easily find the probability of any observation from any normal distribution you have All you have to do is look up the z values on a table Let’s see some examples THE STANDARD NORMAL DISTRIBUTION This is a distribution of people’s heights. We can see it’s normal. The mean is 175 cm and the SD is 7 cm What are the odds of bumping into someone between 1.90 and 1.95 m tall? Note that I said “between 1.9 and 1.95”-you always need a range. The odds of finding someone 1.95 m tall are zero because that would mean 1.95000000000… If you want to calculate you should say “between 1.95 and 1.96”. THE STANDARD NORMAL DISTRIBUTION Let’s calculate the two z values 190 − 175 𝑧1 = =2.143 7 195 − 175 𝑧2 = =2.857 7 THE STANDARD NORMAL DISTRIBUTION Now let’s look them up on the P table 𝑧 1 =2.143 𝑧 2 =2.857 p That’s it! About 1.4% of the people should be between 1.9 and 2m tall People that tall are actually quite rare THE STANDARD NORMAL DISTRIBUTION If the range is open the upper (lower) limit for z infinity is P=0.5 (or P=-0.5). Example What are the odds of bumping into someone taller than 2 metres? 200− 175 𝑧1 = =3.571 7 𝑧 2 =∞ THE STANDARD NORMAL DISTRIBUTION Now let’s look them up. I used an online calculator because 3.571 is off most charts 𝑧 1 =3.571 𝑝=0.4998 𝑝=0.5000 𝑧 2 =∞ p That’s it! About 0.2% of people are above 2 m (pretty rare they are) THE STANDARD NORMAL DISTRIBUTION If z is negative just minus the value you find on the chart How many people are shorter than 165 cm? 165− 175 𝑧1 = =− 1.42 7 𝑧 2 =− ∞ THE STANDARD NORMAL DISTRIBUTION Now let’s look them up on the P table 𝑧 1 =− 1.42 p So about 8% of people should be shorter than 165 cm THE STANDARD NORMAL DISTRIBUTION Alternative convention for p Z=0 max=0.5 Z =-∞ max=1 Some tables report p not from z=0 but from z=-∞. If you read numbers higher than 0.5 this is probably the case and just subtract 0.5 to obtain your P. THE STANDARD NORMAL DISTRIBUTION For our next example we are going to meet Chris, the world’s most intelligent man No kidding! THE STANDARD NORMAL DISTRIBUTION Let’s check him out using statistics! He says his IQ is between 190 and 210 The human IQ mean is 100 And the SD is 15 THE STANDARD NORMAL DISTRIBUTION Let’s calculate the two z values for Chris 190 − 100 𝑧1 = =6.000 15 210 − 100 𝑧2 = =7.333 15 THE STANDARD NORMAL DISTRIBUTION These are HUGE z-values which are not on tables but I could find an online calculator to do it 𝑝 ( 𝑧 =6.000 )=0.49999999997 𝑝 ( 𝑧 =7.333 )=0.50000000000 − 11 𝑝 ( 𝑧 𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑡h𝑒𝑠𝑒 2 ) =3 ∗10 THE STANDARD NORMAL DISTRIBUTION IQ score between 195 and 210 − 11 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =3 ∗10 This planet counts about 8.2 billion people (children included!) If you were to check them all you would find 0.24 people (none) with that IQ THE STANDARD NORMAL DISTRIBUTION Fishy! Any proof??? How convenient! NON NORMAL DISTRIBUTIONS GENERIC DISTRIBUTIONS Normal distribution of error is the most common in science, but there are others to consider These arise when our hypothesis of the “many random elementary errors” doesn’t hold We will now deal with the general case and what parameters to use to describe non-normal distributions GENERIC DISTRIBUTIONS Reasons for the non-normality Too few samples taken-then we would not have enough data to fit the normal curve There is a minimum in the observable-in this case the curve has to stop Your observations are the sum of not one but two or more normal sources each with its own mean GENERIC DISTRIBUTIONS Example: test scores There are two distinct groups of students each with its own mean which together yield a non normal (“bimodal”) distribution GENERIC DISTRIBUTIONS Example: income UK Office for National Statistics The income has an absolute minimum (£ 0) so the curve has to be non normal GENERIC DISTRIBUTIONS For non normal distributions you can still calculate the mean and standard deviations but they don’t have the same meaning The mean is not the most likely measure and there is no correlation between SD and probability Example: number of eyes per person The mean is 1.99999… But nobody has 1.9999 eyes 0 1 2 GENERIC DISTRIBUTIONS: MEDIAN We need different parameters to measure the most likely (“central tendency”) and “scattering” of probability distribution The first is called median. It is the value which splits the curve in half, that is half the observations are above and half below GENERIC DISTRIBUTIONS: MEDIAN equal median GENERIC DISTRIBUTIONS: MEDIAN To find the median of a set of data, simply write them in ascending order in a row and pick the middle value If there are two in the middle, choose their mean Example: 2, 4, 7,10, 10, 13, 15, 20, 21 Middle number is 10-that’s the median Example: 2, 4, 7,10, 10, 13, 15, 20,21,24 Middle numbers are 10 and 13 is 10; the median is 11.5 GENERIC DISTRIBUTIONS: MEDIAN It can be shown that the median is always closer to the most likely value (the mode) than the mean Example: number of eyes per person Eye number list: 0,0,0,…1,1,1,1, …,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 ,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 The median will obviously be somewhere in the “2” group” and that’s also the most likely value 0 1 2 GENERIC DISTRIBUTIONS: SKEWDNESS Non normal distributions are normally skewed This means they are not symmetrical but imbalanced towards one side of the peak (mode) Mode Negatively skewed distribution Based on which side “bulges” they are called positively or negatively skewed GENERIC DISTRIBUTIONS: SKEWDNESS As you can see this distribution is instead more abundant on the right side of the peak. Median Mode Mean We say it’s positively skewed. GENERIC DISTRIBUTIONS: SKEWDNESS The mean always follows the skew. Mode Median Mean Positively skewed: mean> median GENERIC DISTRIBUTIONS: SKEWDNESS The mean always follows the skew. Median Mean Mode Negatively skewed: mean< median GENERIC DISTRIBUTIONS: PERCENTILES median Having sorted the “central tendency” with the median, let’s take care of probabilities by replacing standard deviation with something else GENERIC DISTRIBUTIONS: PERCENTILES 75% 25% (lower) quartile A percentile is a value that splits the curve into a certain percentage of the total A quartile is a value that splits the curve into a 25%/75% (or 75/25%) split GENERIC DISTRIBUTIONS: QUARTILES 50% lower quartile Q1 25% 25% upper quartile Q3 The lower quartile (Q1) is a value that splits the curve into a 25%/75% split The upper quartile (Q3) is a value that splits the curve into a 75%/25% split GENERIC DISTRIBUTIONS: QUARTILES To find the quartiles of a set of data, simply 1. write them in ascending order in a row 2. Find the median and remove that number (or the 2 numbers in the middle) 3. Find the remaining medians Example: 1,2, 4, 7,10, 10, 13, 15, 20, 21,22,30 There are 12 numbers. The median is between 10 and 13. I remove both 1,2, 4, 7,10, 15, 20, 21,22,30 The remaining medians are 4 and 21. These are the lower and upper quartile. 1,2, 4, 7,10, 15, 20, 21,22,30 GENERIC DISTRIBUTIONS: QUARTILES Another example 59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98 The median is 75 I remove it 59, 60, 65, 65, 68, 69, 70, 72, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98 The remaining medians are 68 and 84. These are the lower and upper quartile. GENERIC DISTRIBUTIONS: QUARTILES Quartiles are useful to represent non-normal data because they behave like SD: they tell us how scattered our points are The more distant the quartiles are the more the scattering. This distance is called interquartile range (IQR). IQR = upper-lower quartile =Q3-Q1 GENERIC DISTRIBUTIONS: SUMMARY So now we have parameters to measure “centre” and “scattering” of a non normal distribution corresponding to mean and SD in the normal one Q1 Q3 Mean Median SD IQR GENERIC DISTRIBUTIONS: BOX PLOTS You will often find a compact way to represent a non-normal distribution and its parameters on a single chart: the box (and whiskers) plot This simply puts together all the parameters we saw in a single easy to understand format https://datavizcatalogue.com/methods/box_plot.html R WORKSHOPS This part of the module is accompanied by a brief course in data representation using R The R language is the most commonly used for biomedical data analysis These workshops will show you how to play with the concepts we discussed in lectures via R in an easy way AND, they will improve your CV- so please attend! NEXT TIME Inferential statistics- spotting the signal amidst the noise

Use Quizgecko on...
Browser
Browser