Statistical Methods of Data Science PDF

Statistical Methods of Data Science Mattia Mungo September 2024 Chapter 1 Outcomes, Events & Probability 1.1 Symmetry, Frequency, and Subjective Prob- abilities Symmetry-based probability: This refers to the classical definition of probability, where it is assumed that all outcomes are equally likely. The probability of an event is given by: Number of favorable outcomes P (A) = Total number of possible outcomes For example, in the case of flipping a coin, the probability of heads is: 1 P (Heads) = 2 Frequency-based probability: This is based on the proportion of times an event occurs over many repeated trials. The probability of an event A in n trials is the limit of its relative frequency as n approaches infinity: nA P (A) = lim n→∞ n where nA is the number of times event A occurs. For instance, if you observe 100 coin flips and 60 of them are heads, the probability of heads based on frequency is 60/100. Subjective probability: This type of probability reflects personal belief or confidence in the occurrence of an event. It is often used when there is no statistical data available, and is expressed as a number between 0 and 1. 1 For example, a doctor estimating the probability of a patient recovering based on their experience, without statistical data to back it up. 1.2 Axiomatic Definition of Probability (Kolmogorov) The axiomatic definition of probability consists of three key axioms: 1. 0 ≤ P (A) ≤ 1 for any event A. 2. P (Ω) = 1, where Ω is the sample space. 3. If two events A and B are disjoint (mutually exclusive), then: P (A ∪ B) = P (A) + P (B). 1.3 Outcomes and Events via Set Theory In probability, outcomes and events are expressed through set theory: Sample space (Ω): This is the set of all possible outcomes of an experi- ment. For example, the sample space for throwing a dice is: Ω = {1, 2, 3, 4, 5, 6}. Events: Events are subsets of the sample space. For instance, the event of throwing an even number is the subset: A = {2, 4, 6}. Events will be denoted with capital letters: A, B, etc. Two particularly important events: Ω = {Certain event}, ∅ = {Impossible event}. We use the following operations on sets (events): Union: A ∪ B represents the event that either A or B (or both) occur. Intersection: A ∩ B represents the event that both A and B occur. Complement: Ac represents the event that A does not occur. 2 Figure 1.1: Set theory Remark: For any A ⊆ Ω, we have the following properties: A ∪ Ac = Ω A∪Ω=Ω A∪∅=A A∩Ω=A A∩∅=∅ In addition, for A, B ⊆ Ω, we have: A ⊆ B, if and only if all the elements of A also belong to B. In this case, we say that A ”implies” B because the occurrence of A implies the occurrence of B. 1.4 De Morgan’s Laws De Morgan’s Laws relate the complement of unions and intersections of sets: (A ∪ B)c = Ac ∩ B c , (A ∩ B)c = Ac ∪ B c. These laws are useful in calculating probabilities of more complex events. 1.5 Event Decomposition It is often useful to express an event as the union of two disjoint events. For example, consider a car undergoing an emissions test: Let F represent the event ”the car fails the test.” 3 Let I represent the event ”the car pollutes.” The event F can be decomposed into: F = (F ∩ I) ∪ (F ∩ I c ), where F ∩I represents the event that the car fails because it pollutes, and F ∩I c represents the event that the car fails despite not polluting. This approach generalizes to partitions of the sample space into more than two disjoint events. 4 Chapter 2 Conditional Probability & Independence 2.1 Conditional Probability Conditional probability helps us refine our predictions when we have additional information. It is defined as: P (A ∩ B) P (A|B) = , where P (B) > 0 P (B) This represents the probability that event A occurs, given that the event B is known to have occurred. The idea is that the sample space is now restricted to B, so the probability is normalized based on the likelihood of B. 2.2 Remark: Conditional Probability is a Legit- imate Measure For any event A ∈ Ω, the conditional probability P (A|B) is a legitimate prob- ability measure that satisfies all the usual probability axioms. One important property is that for the complement of A, we have: P (Ac |B) = 1 − P (A|B) 2.2.1 Proof To prove this, we start by decomposing the sample space Ω in terms of the event A. Specifically, we use the fact that: B = (B ∩ Ω) = (B ∩ (A ∪ Ac )) = (A ∩ B) ∪ (Ac ∩ B) 5 where Ac denotes the complement of A, and the union is disjoint. By the additivity property of probability for disjoint events, we know: P (B) = P (A ∩ B) + P (Ac ∩ B). Since P (Ac |B) is the probability of the complement of A given B, we can now express it as: P (Ac ∩ B) P (Ac |B) = P (B) Substituting P (Ac ∩ B) = P (B) − P (A ∩ B), we get: P (B) − P (A ∩ B) P (A ∩ B) P (Ac |B) = =1− = 1 − P (A|B) P (B) P (B) Thus, we have shown that: P (Ac |B) = 1 − P (A|B) 2.2.2 Basic Example Suppose you’re playing a card game, and you know the card dealt is a face card (Jack, Queen, or King). What is the probability that it is specifically a King? Define the events: - K: the event that the card is a King. - F : the event that the card is a face card. The probability becomes: 1 P (K ∩ F ) 13 1 P (K|F ) = = 3 = P (F ) 13 3 This shows that the probability of it being a King has increased now that we know it’s a face card. 2.3 The Multiplication Rule and Chain Rule Using the definition of conditional probability, we can express the probability of the intersection of events (the joint probability): P (A ∩ B) = P (A|B) · P (B) which can also be written as P (B|A) · P (A). In general, of course, P (A|B) ̸= P (B|A) For more than two events, we apply the chain rule, which extends this idea: P (A1 ∩A2 ∩· · ·∩An ) = P (A1 |A2 ∩· · ·∩An )·P (A2 |A3 ∩· · ·∩An )... P (An−1 |An )·P (An ) 6 2.3.1 Example: Drawing Cards Consider drawing two cards from a deck. What is the probability of drawing an Ace first and a King second? Define the events: - A1 : the first card is an Ace. - K2 : the second card is a King. Since the second event depends on the outcome of the first (i.e., one card is already removed), we compute: 4 4 4 P (K2 ∩ A1 ) = P (K2 |A1 ) · P (A1 ) = · = ≈ 0.006. 51 52 663 2.4 Independence of Events Two events A and B are called independent if the occurrence of one does not affect the probability of the other. Formally: P (A ∩ B) = P (A) · P (B) or equivalently, P (A|B) = P (A) and P (B|A) = P (B). 2.4.1 Example: Coin Flips If you flip a coin 10 times and get heads every time, what is the probability of getting heads on the next flip? Since the flips are independent: 1 P (Heads on 11th flip) = 2 The result of previous flips has no effect on future flips if the coin is fair. 2.5 Pairwise vs Mutual Independence For three or more events, pairwise independence means that any two events are independent, but this does not imply mutual independence (where all events are independent simultaneously). For example, events A, B, and C can be pairwise independent, but not mutually independent. 2.5.1 Example: Rolling a Tetrahedron Consider rolling a balanced tetrahedron with sides numbered 2, 3, 5, and 30. Define the events: - A: the result is even. - B: the result is divisible by 3. - C: the result is divisible by 5. These events are pairwise independent, but not mutually independent, because the only number that satisfies all three conditions is 30. 7 2.6 Conditional Independence Sometimes, two events may not be independent, but they can be conditionally independent given a third event. We say that A and B are conditionally independent given C if: P (A ∩ B|C) = P (A|C) · P (B|C). Conditional independence is a useful concept when dealing with scenarios where additional information (the event C) changes the relationship between A and B. 2.6.1 Example: Admission to Programs Consider two events: - DS: accepted to a Data Science program. - Stat: accepted to a Statistics program. Without any other information, these two events are dependent (i.e., being accepted to one might increase your chances of being accepted to the other). However, given a student’s GPA (Grade Point Average), the events might be- come conditionally independent. If both admissions are based solely on GPA, knowing a student’s GPA provides all the relevant information, making DS and Stat conditionally independent given GPA. 2.7 The Law of Total Probability The Law of Total Probability allows us to compute the overall probability of an event by considering a partition of the sample space: P (A) = P (A|B) · P (B) + P (A|B c ) · P (B c ) where B and B c form a partition of the sample space. 2.7.1 Example: Drawing a Marble Suppose you are drawing a marble from one of two boxes: - Box 1 contains 4 red and 6 green marbles. - Box 2 contains 7 red and 3 green marbles. You flip a coin to choose which box to draw from. The probability of the coin landing heads (and selecting Box 1) is 0.8. What is the probability of drawing a green marble? Define the events: - G: drawing a green marble. - H: the coin lands heads. Using the Law of Total Probability: P (G) = P (G|H) · P (H) + P (G|H c ) · P (H c ) 6 3 P (G) = · 0.8 + · 0.2 = 0.54 = 54% 10 10 8 Chapter 3 Around Bayes’ Theorem 3.1 Introduction The chapter is mainly focused on yhe Bayes’ Theorem, which is a fundamental result in probability theory. It also covers: The formula of Bayes’ Theorem. Interpretation and explanation of its practical usage. Simple applications, including the famous Monty Hall problem. Bayesian folklore and its impact in modern applications. 3.2 Card Example and Prediction Let’s introduce a simple, yet insightful, problem involving the prediction of the color of a hidden card face using Bayesian reasoning. 3.2.1 Setup: You are given three cards: CR: Both sides are red. CB: Both sides are blue. CRB: One side is red, the other is blue. You shuffle the cards in a bag and randomly pick one. Without looking at the other side of the card, you place it on the table with one visible face. The question is: ”What is the probability that the hidden side of the card has the same color as the visible side?” 9 3.2.2 Solution: This problem is solved by recognizing that we are dealing with the faces of the cards, not the cards themselves. There are three equally likely ways to observe a specific face after randomly picking a card. However, only two of these possibilities involve cards with identical sides (the CR and CB cards). Hence, the probability is: 2 P (Same Color) = 3 Explanation: Since two of the three possible faces would result from cards with matching sides, the probability that the card has identical sides is higher than it would be in a uniform random distribution. This reasoning is clarified using Bayes’ Theorem, showing how prior information can update our beliefs about uncertain events. 3.3 Bayes’ Theorem: Formula and Explanation Bayes’ Theorem is used to calculate the probability of a hypothesis H given some observed evidence E. The formula is derived from the multiplication rulein probability: P (E|H)P (H) P (H|E) =. P (E) Where: P (H) is the prior probability of the hypothesis. P (E|H) is the likelihood, or the probability of observing evidence E given H. P (E) is the total probability of the evidence, calculated as: P (E) = P (E|H)P (H) + P (E|H c )P (H c ). This allows us to calculate P (H|E), the posterior probability, which updates our belief about H after observing the evidence E. 3.4 Application: Diagnostic Testing Example The next example discusses Bayesian inference in a diagnostic setting, partic- ularly focusing on a medical test for a rare disease. Suppose the disease affects 0.1% of the population, and the test has a 99% sensitivity (true positive rate) and a 99% specificity (true negative rate). The question is: ”What is the probability that a person has the disease given a positive test result?” 10 Using Bayes’ Theorem: P (Positive Test|Disease) · P (Disease) P (Disease|Positive Test) = P (Positive Test) Where: P (Positive Test|Disease) = 0.99 (test accuracy for diseased individuals). P (Disease) = 0.001 (prevalence of the disease in the population). P (Positive Test) is computed as: P (Positive Test) = 0.99 × 0.001 + 0.01 × 0.999 = 0.01098 Thus, applying Bayes’ Theorem: 0.99 × 0.001 P (Disease|Positive Test) = ≈ 0.09 = 9% 0.01098 Despite a positive test result, the probability of actually having the disease is only 9 3.5 Iterated Bayes: Repeated Testing Bayes’ Theorem can be applied iteratively. Suppose you take a second, inde- pendent test at a different lab, which also returns positive. In this case, the posterior probability from the first test becomes the prior for the second test. Reapplying Bayes’ Theorem increases the probability of having the dis- ease substantially, as each piece of evidence is incorporated into the analysis. 3.6 Monty Hall Problem The famous Monty Hall Problem is introduced to further explain Bayesian reasoning. In this game show scenario, you choose one of three doors, behind one of which is a prize. Monty Hall, the host, opens one of the two remaining doors to reveal that the prize is not behind it. You are then given the option to stick with your original choice or switch to the other unopened door. The question is: ”Is it better to stick with your initial choice or switch doors?” Using Bayes’ Theorem, the solution reveals that switching doors increases your probability of winning the prize from 31 to 32 , as the likelihood of winning by switching is higher after Monty provides additional information. Let Hi denote the hypothesis that the prize is behind door i. We make the following assumptions: 11 1. The three hypotheses are equiprobable a priori, that is: 1 P (H1 ) = P (H2 ) = P (H3 ) = 3 2. The data we receive, after choosing door 1, is one of (D = 3) or (D = 2), meaning that Monty opens door 3 or door 2. 3. These two outcomes have the following probabilities: 1 1 P (D = 2|H1 ) = , P (D = 3|H1 ) = 2 2 P (D = 2|H2 ) = 0, P (D = 3|H2 ) = 1 P (D = 2|H3 ) = 1, P (D = 3|H3 ) = 0 In other words: If the prize is behind door 1, Monty has a free choice. In this case, we assume that Monty selects at random between D = 2 and D = 3. Otherwise, the choice is forced, and the probabilities are 0 or 1. 3.7 Solution Now, using Bayes’ Theorem, we evaluate the posterior probabilities of the hypotheses: P (D = 3|Hi )P (Hi ) P (Hi |D = 3) = P (D = 3) that is, 1 1 2 3 P (H1 |D = 3) = P (D = 3) 1 13 P (H2 |D = 3) = P (D = 3) 0 13 P (H3 |D = 3) = =0 P (D = 3) 1 The denominator P (D = 3) is 2 because it is the normalizing constant for this posterior distribution. So: 1 2 P (H1 |D = 3) = , P (H2 |D = 3) = , P (H3 |D = 3) = 0 3 3 Conclusion The contestant should switch to door 2 in order to have the biggest chance of getting the prize. 12 3.8 Bayesian Learning and the Replicator Dy- namic We can also compare Bayesian learning to the discrete replicator dynamic from evolutionary biology. Both involve updating probabilities (or population proportions) based on new evidence or environmental fitness landscapes. Bayes’ Theorem is a special case of this replicator dynamic, where probabilities are adjusted in response to new data. In both contexts: The prior beliefs (or initial population state) are updated as new informa- tion is gained. The posterior beliefs (or new population state) reflect the updated knowl- edge after considering the new evidence. 13 Chapter 4 Random Objects The topic explores the notion of random variables and how they link sample spaces and events to measurable data. This concept is central to both data science and statistics. The main points covered include: Discrete and continuous random variables. Probability Mass Function (PMF) and Probability Density Function (PDF). Cumulative Distribution Function (CDF). Quantile Function. Statistical folklore and the basic probability spaces. 4.1 Random Variables and Measurements Random variables are functions that assign real numbers to outcomes of an experiment. Formally, given a probability space (Ω, P ), a random variable X is a mapping: X:Ω→R that assigns a real number X(ω) to each outcome ω ∈ Ω. The same concept can be extended to random vectors. Given a probability space (Ω, P ), a random vector X is a mapping: X : Ω → Rp , which assigns a p-dimensional vector X(ω) to each outcome ω ∈ Ω. 14 Figure 4.1: Random Variable 4.2 Mapping Outcomes to Probabilities To work with probabilities, we need to understand how the random variable transports the probability measure defined on the sample space Ω to the real values it can assume in R. We do this using the preimage of a subset E ⊆ R: P (X ∈ E) = P ({ω ∈ Ω : X(ω) ∈ E}) In other words, the probability of the event X ∈ E is equal to the probability of all outcomes ω in Ω that get mapped into the subset E by the random variable X. 4.2.1 Preimage Example For example, let’s assume a random variable X and a subset E ⊆ R. The preimage of E is: X −1 (E) = {ω ∈ Ω : X(ω) ∈ E} ⊆ Ω Thus, the probability of X taking values in E is: P (X ∈ E) = P (ω ∈ X −1 (E)) = P (X −1 (E)) 4.3 Example: Coin Toss Experiment Consider the following example: The experiment is flipping a fair coin n = 5 times. The sample space Ω5 is the set of all possible sequences of heads (H) and tails (T). The random variable X counts the number of tails (T) observed in the five flips. 15 The probability of any specific outcome ω ∈ Ω5 is: 1 1 P (ω) = 5 =. 2 32 The random variable X can take values in {0, 1, 2, 3, 4, 5}, corresponding to the number of tails observed. 4.3.1 Probability Distribution for X The probability that X = k tails is given by the binomial distribution: 5 5 1 P (X = k) = , k 2 where k5 is the binomial coefficient. 4.4 Discrete vs Continuous Random Variables Random variables can be either discrete or continuous: Discrete random variables take a countable number of possible values. For example, the number of tails in 5 coin flips is discrete. Continuous random variables take an uncountable number of values, like distances or times. 4.4.1 Discrete Random Variables (PMF) A random variable X is discrete if it takes at most countably many values {x1 , x2 ,... }. The probability mass function (PMF) is defined as: pX (x) = P (X = x). Properties of the PMF: 1. pX (x) ≥ 0 for all x ∈ R. 2. The sum of probabilities is 1: X pX (xi ) = 1. i 4.4.2 Continuous Random Variables (PDF) A random variable X is continuous if there exists a probability density func- tion (PDF) fX (x) such that: Z b P (a ≤ X ≤ b) = fX (x) dx. a Properties of the PDF: 16 1. fX (x) ≥ 0 for all x ∈ R. 2. The integral over all R equals 1: Z +∞ fX (x) dx = 1. −∞ 4.5 Cumulative Distribution Function (CDF) The cumulative distribution function (CDF) of a random variable X is defined as: FX (x) = P (X ≤ x). Properties of the cdf: 1. FX (x) is non-decreasing. 2. FX (x) → 0 as x → −∞ and FX (x) → 1 as x → +∞. 3. For a continuous random variable, the cdf is related to the pdf by: Z x FX (x) = fX (t) dt. −∞ 4.6 Quantile Function The quantile function Q(p) is the inverse of the cdf and is defined as: Q(p) = inf{x : FX (x) = p}. For continuous random variables, this function is often invertible. When FX (x) is not strictly monotonic, we must handle plateaus in the cdf, and the quantile function will return the smallest x for which FX (x) ≥ p. 4.6.1 Example: Pareto Distribution The Pareto distribution has the following cdf: α 1 FX (x) = 1 − for x ≥ 1. x For the quantile function, we solve for x: α 1 p = FX (qp ) = 1 − , qp giving: 1/α 1 qp =. 1−p 17

Statistical Methods of Data Science PDF

Document Details

Tags

Related

Summary

Full Transcript