Probability and Theoretical Distributions PDF
Document Details
Uploaded by LargeCapacityAntigorite4770
Charles Marks
Tags
Summary
This document introduces probability and theoretical probability distributions, specifically focusing on the normal distribution. It explains concepts like mathematical notation, probability, and theoretical distributions. The document also discusses inferential statistics and how to identify signals from noise in data.
Full Transcript
Probability and Theoretical Distributions Charles Marks Mathematical Notation, Probability, & Distributions Before we dive into learning and applying statistical tools, we need to learn about some important concepts. This week, we will discuss how statistics is all about assessing probability. Fur...
Probability and Theoretical Distributions Charles Marks Mathematical Notation, Probability, & Distributions Before we dive into learning and applying statistical tools, we need to learn about some important concepts. This week, we will discuss how statistics is all about assessing probability. Further, we will formally introduce some mathematical language we can use to discuss probability and we will introduce theoretical distributions and how we can use them to think about probability. In this text, we will specifically focus on introducing and understanding the normal distribution, one of the most important distributions in the field of statistics. Statistics: Identifying the Signal From the Noise One of the primary goals in this class is to help get you comfortable with inferential statistics techniques. Inferential statistics are methods where we take data from a sample of 𝑛 people and then try to learn something about the broader population of people we think they represent. When we talk about an arbitrary number of people, we will often use the the letter 𝑛 when we don’t know what that number actually is (or when the exact amount doesn’t matter). A population represents a broad category of people we want to learn about and a sample represents a subset of that population that we recruited for a study. The reason inferential statistics are so important is because usually you cannot survey every person in a population of interest. Perhaps you want to do a study focusing on people who vape in the United States - well, that is quite literally millions of people. Surveying millions of people is impractical or impossible. However, it is more practical to recruit 𝑛 = 1, 000 people who vape or 𝑛 = 10, 000 people who vape. With inferential statistics methods, we can take information about our sample of 𝑛 people who vape and try to make conclusions about the broader population of people who vape. This is our main mission when employing inferential statistics techniques - take information from a sample to learn about a corresponding population. These inferential “conclusions” are essentially probabalistic guesses! These methods allow us to ask, “How probable do we think it is that the signal we have observed in our sample is representative of the broader population?” Here, signal just refers to any effect or difference that we may observe. For example, in our (hypothetical) study of people who vape, we might find that twice as many men in the sample report using their vape while at work compared to women in the sample. This difference between men and women in the sample represents a signal - the objective of inferential statistical methods is to help us determine how probable we think it is that the signal within the sample represents a meaningful, non-random pattern within the overall population. In the case of our example, how probable do we think it is that men who vape are actually more likley than women who vape to vape while they are at work? One of the hardest parts about trying to identify these signals is wading through the noise in our data. We can think of noise as variability within our data that is simply the result of randomness. Even if there exists a signal in our overall population, when we run a study we intend to recruit a random sample - a sample is considered random when every person in the population has the same probability of being chosen. This randomness introduces variability into the data that may obscure signals we might observe. For example, let’s say we want to know how likely it is that if we flip a coin we get heads. Intuitively, we imagine it is 50-50, that 50% of all coin flips should be heads and 50% should be tails. So, we flip a coin once and get heads. We flip it again and get…heads?!?! In fact we flip 5 heads in a row before we get a tails! We flip the coin 100 hundred times and we end up flipping heads 72 times. Now, I promise you, the reader, that there is a 50% chance of getting heads (assuming this isn’t some trick coin or some quantum thought experiment), however, because every coin flip is independent (i.e., not dependent on any other coin flip) and random, the reality is that we have flipped heads 72% of the time! This represents our signal. Because we have taken a random sample of coin flips, there is natural variation in the results we have observed - this variation is the noise. Noise, natural variation in our sample data as a result of randomness, obscures the signals that can tell us about our population. So, our goal in inferential statistics is to take a sample and try to learn something about the population we believe they represent. In order to do so, we must wade through noisy data in search of meaningful, probable signals. While by no means a rigorous equation, our ability to make 𝑠𝑖𝑔𝑛𝑎𝑙 conclusions about the population is dependent on the ratio of signal to noise, or: 𝑛𝑜𝑖𝑠𝑒. Often we assume that no signal exists at the population- level and then try to assess how probable our observed data is under this assumption. With a fraction, a big numerator (i.e., the top) means the overall number is bigger (i.e., 4 > 2 ). A bigger denominator (i.e., the bottom) means the 3 3 overall number is smaller (i.e., 2 < 2 ). So, we can understand that the bigger (or “stronger”) our signal in our sample, the less probable we think 5 3 it is that no signal exists at the population-level (or, the more probable we think it is that a signal does exist). Noise can be understood to obscure signals we may observe in our data (i.e., weaken signals) Statistics and Falsification So, statistics represents a broad set of methods through which we can take data from a sample, search for signals amidst the random noise, and assess how probable we think it is that the signal we are observing is representative of the broader population! Importantly, statistics isn’t about figuring out what is true or false. Statistical methods were developed because we really cannot know what is true – instead we can reflect on what is probable. If something is not probable, we could assume some alternate reality is true, a sort of logic of contradiction. We refer to the act of determining something is improbable as falsification. For example, we could never prove that men vape at work more than women do, but we could observe data and feel confident that it is probable that men and women don’t vape the same amount at work - we can seek to falsify that possibility by examing our data. In statistics we do this by assuming a signal does not exist and then asking how probable our observed data is under this assumption - a confirmation by contradiction, of sorts. For example, we could assume that men and women who vape, vape the same amount at work. Then we could collect data and ask, “if we assume that men and women vape the same amount at work, how likely is the data we have observed?” Let’s say we collect data about vaping at work and find that men vape at work 2% more than women - if we assume that men and women, generally, vape the same amount of work, does this signal in our sample seem probable? Sure! - 2% doesn’t seem like a very big difference. Maybe we just happened to interview a couple men who vape at work a lot (e.g., noise). Or, what if in our sample, men smoke at work 300% more than women? This observation seems way more unlikely if we assume no true signal and we feel more confident rejecting our assumption that men and women vape the same amount at work! This is the logic we will be employing when we use inferential statistics techniques - we will assume that a signal doesn’t exist and then observe data and assess how probable our observed data is under our assumption of no signal. This represents a logic of contradiction. We make an assumption and then ask how likely our observation are under that assumption. If the observations are unlikely, this provides evidence that our assumption may be wrong and then we reject our assumption. Almost every inferential statistical test involves three primary steps: the first is to assume that a signal does not exist in the overall population (e.g., assume that men do not vape at work more than women do); the second is then to measure and quantify the signal and the noise within the sample; and the final step is to ask how probable it is that we could observe the patterns in our data assuming that no such signal exists. To apply this to our coin flipping scenario: first, we assume that there is a 50% chance of flipping heads any flip; then we flip a coin 𝑛 times and calculate how often we got heads; and, finally, based on the data, we ask how probable it is that we could have observed the sample (i.e., the coin flips) assuming that the chances of flipping a heads was 50-50. So, if we flip a coin 100 times and get heads 72 times, do we still feel confident that there is a 50% chance of getting heads? Statistics: The Art of Making Educated Guesses Perhaps this was all a long-winded way of saying that statistics is the art of making educated guesses based on the data we have available to us. This is, in fact, a very human activity! We do it all the time! Every day we make probabalistic decisions based on information available to us. A classic example is which way should I drive home to avoid traffic? You usually can’t know the best way, but from experience (your sample), you decide the route (usually dependent on the time of day and which routes are available). You choose the route you think has the highest probablity of being fastest. Now, statistics can feel scary because it is often presented as a bunch of mathematical equations and weird distributions and there are lots of Greek letters and tables. I definitely don’t whip out a calculator and do some mathematical calculations to decide which way to drive home (even Google Maps cannot predict a car crash before it happens). In this chapter, I want to go over some of this “scary” math stuff because its all just ways of presenting probability in formal and testable terms. So, as we dive in, I want to assure you that you are familiar with the logic behind probability - understanding how we represent probablity and probabilistic decision-making in statistics will actually make understanding the statistical methods way way way easier. Intro to Probability and the Normal Distribution Statistics is all about assessing the probability of our observed data given some assumption. As such, we need ways to formally think through the concepts of probablity. Defining Probablity Mathematically We need a way to express the following question mathematically: “What is the probability that [insert phenomenon] will occur?” For example, we might wish to ask, what is the probability that the result of our next coin flip will be heads? If we let 𝐴 represent our next coin flip, we then want to ask, “What is the probability that 𝐴 = ℎ𝑒𝑎𝑑𝑠 ?” We can use 𝑃 (𝑥) notation to achieve this statement mathematically. 𝑃 (𝑥) can simply be translated as “The probablity of 𝑥 …”. So, if we were to write 𝑃 (𝐴 = ℎ𝑒𝑎𝑑𝑠) , we would read that as saying that “The probability that our next coin flip will be heads is…”. Intuitively, we know that 𝑃 (𝐴 = ℎ𝑒𝑎𝑑𝑠) =.5 = 50%. We would read this as saying “The probability that our next coin flip is heads equals 50%.” We can also chain together multiple phenomena and ask how likely the combination of outcomes is. For example, we could ask 𝑃 (𝐴 = ℎ𝑒𝑎𝑑𝑠 𝑂𝑅 𝐴 = 𝑡𝑎𝑖𝑙𝑠). Here we are just asking what the probability is that the coin flip will be either heads or tails - we can see that since those are the only possibilities for a normal coin that 𝑃 (𝐴 = ℎ𝑒𝑎𝑑𝑠 𝑂𝑅 𝐴 = 𝑡𝑎𝑖𝑙𝑠) = 1 = 100%. Conditional Probablity At the heart of inferential statistics is the concept of conditional probablity. Conditional probability comes in handy when we want to ask, “Assuming that 𝐴 is true, what is the probablity of 𝐵 occurring?” Here 𝐴 and 𝐵 simply represent phenomenon or circumstances. We can write this mathematically, like so: 𝑃 (𝐵|𝐴). We would read this as “The probability of 𝐵 , given that 𝐴 is true…”. Now, 𝐴 does not actually have to be true, it can be entirely hypothetical. For example, someone could ask you if the freeway is the fastest route to get to your house - let 𝐵 represent the freeway route to your home. Now, (hypothetically) you know that the freeway is the fastest way except during rush hour. During rush hour, you have found that the freeway is fastest only 1/3 of the time. So, let 𝐴 represent whether or not it is rush hour. We could ask 𝑃 (𝐵|𝐴 = 𝑛𝑜𝑡 𝑟𝑢𝑠ℎ ℎ𝑜𝑢𝑟). Well, from experience we have found that the freeway is always fastest when it is not rush hour so, 𝑃 (𝐵|𝐴 = 𝑛𝑜𝑡 𝑟𝑢𝑠ℎ ℎ𝑜𝑢𝑟) = 1 = 100%. Likewise we could ask 𝑃 (𝐵|𝐴 = 𝑟𝑢𝑠ℎ ℎ𝑜𝑢𝑟). From experience, we have found that 𝑃 (𝐵|𝐴 = 𝑟𝑢𝑠ℎ ℎ𝑜𝑢𝑟) = 1/3 = 33.3̄%. Why is conditional probability so important in inferential statistics? In a quantitative study, we will observe some data (i.e., our study sample). We will also assume that a signal does not exist (e.g., men and women who vape, vape the same amount at work). So, we will try to ask 𝑃 (𝑑𝑎𝑡𝑎|𝑛𝑜 𝑠𝑖𝑔𝑛𝑎𝑙) - or, what is the probablity that we observed our data assuming that no signal exists? That is the foundation of every single inferential statistical method that we will employ, for example: If we want to know if smoking cigarettes leads to lung cancer, we would 1) assume that smoking cigarettes and lung cancer are not related, 2) observe data from a sample (perhaps ask people if they smoked and if they had lung cancer), and 3) then assess how probable our data is given our assumption that smoking cigarettes and lung cancer are not related. The “probablity of data” seems like a funny concept. Let us say we assume that men and women who vape do so the same amount at work. We recurit a sample and ask how often they vape at work and then we compare responses of men and women. If we assume that men and women vape the same amount at work, then we imagine it is quite probable that men and women report vaping at work at similar rates. However, it’s almost certain that men and women in our sample won’t have identical vaping patterns at work - thus, it is important that we be able to capture probabilties of discrepancies between our observed data and our assumption. For example, if we think that men and women vape the same amount at work, it seems that it would be quite probable that in our sample we find that men, on average, vape at work 0.2 times more per workday than women - perhaps we randomly sampled a couple of men who vape more than others. Whereas, it might be quite improbable, assuming men and women vape the same amount, if we found that men vape 10 times more per workday than women - such an improbable finding might force us to question if our initial assumption was correct. So, we need a way to assess the probability of our data given some underlying assumption. Introducing Theoretical Probability Distributions We do so by employing theoretical probability distributions. A probability distribution is a mathematical function that identifies the probability of a given outcome occurring. While distributions are sometimes a primary point of confusion in statistics, the reality is that distributions are just a way of capturing the way we think about probabilities - they first come from our intuition about the world, the math is just a way of making it rigorous. Essentially, a probability distribution is a tool by which we can make assumptions about the behavior of a given variable. For example, let’s imagine we are playing a game where we are guessing the height of the next person to walk in the room - we don’t have any information prior to making our guess. Well, our best guess is probably the average height of all people…let’s say we are pretty sure the average person is 5 foot 8 inches tall. We are also pretty sure that there are just as many people shorter than 5’8" as there are taller than 5’8" (i.e., the distribution of height is symmetrical around the mean). Further, we are quite positive that most people’s heights are around 5’8" - we feel confident that most people are between 5’2" and 6’2". It is quite rare for someone to be shorter than or taller than that range. So, we have constructed a theory of the distribution of height in order to play this game. We think that if someone walks into the room (i.e., a random observation), that they are most likely to be of average height (5’8“) or close to that height (whether taller or shorter). Further, we think that heights far away (way shorter or way taller) than 5’8” are the least likely to be observed. We can actually capture this probability by plotting it as a mathematical function like so. We will have the x-axis be height (in inches) and the y-axis will represent the hypothetical probability of observing that height if someone walked in the door: ## Let's create our x-axis, ranging from 4'4" (52 inches) to 7'0" (84 inches) x = 5′ 8 " |ℎ𝑒𝑖𝑔ℎ𝑡 𝑖𝑠 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑) (or what is the probability someone is 5’8" or taller, assuming that height is normally distributed around 5’8“). We can use the same principle as above to make this calculation. We start by shading in the area under the curve corresponding to 5’8” and taller, like so: ## We will now plot the normally distributed data as a line (type = "l") plot(x,y, type="l", xlab = "Height in Inches", ylab = "Density") ## we want all values of x and y where x is greater than 68 ## The following three lines of code do this poly_x =68]) index_val