CS 70 Discrete Math & Probability Theory Fall 2024 Notes PDF

CS 70 Discrete Mathematics and Probability Theory Fall 2024 Course Notes Note 19 Geometric and Poisson Distributions Recall our basic probabilistic experiment of tossing a biased coin n times. This is a very simple model, yet surprisingly powerful. Many important probability distributions that are widely used to model real-world phenomena can be derived from looking at this basic coin tossing model. The first example is the Binomial(n, p) distribution, introduced earlier. This is the distribution of the number of Heads, Sn , in n tosses of a biased coin with probability p to be Heads. Recall that the distribution of Sn is P[Sn = k] = nk pk (1 − p)n−k for k ∈ {0, 1,... , n}. The expected value is E[Sn ] = np and the variance is Var(Sn ) = np(1 − p). The binomial distribution is frequently used to model the number of successes in a repeated experiment. 1 Geometric Distribution Consider repeatedly tossing a biased coin with Heads probability p. Let X denote the number of tosses until the first Head appears. Then X is a random variable that takes values in Z+ , the set of positive integers. The event that X = i is equal to the event of observing Tails for the first i − 1 tosses and getting Heads in the ith toss, which occurs with probability (1 − p)i−1 p. Such a random variable is called a geometric random variable. Note that X is unbounded: i.e., for any positive integer i (no matter how large), there is some positive probability that X = i. The geometric distribution frequently occurs in applications because we are often interested in how long we have to wait before a certain event happens: how many runs before the system fails, how many shots before one is on target, how many poll samples before we find a Democrat, how many retransmissions of a packet before successfully reaching the destination, etc. Definition 19.1 (Geometric Distribution). A random variable X for which P[X = i] = (1 − p)i−1 p, for i = 1, 2, 3,..., is said to have the geometric distribution with parameter p. This is abbreviated as X ∼ Geometric(p) or X ∼ Geom(p). As a quick check, we can verify that the total probability of X is equal to 1: ∞ ∞ ∞ 1 ∑ P[X = i] = ∑ (1 − p)i−1 p = p ∑ (1 − p)i−1 = p × 1 − (1 − p) = 1, i=1 i=1 i=1 where in the second-to-last step we have used the formula for the sum of a geometric series. If we plot the distribution of X (i.e., the values P[X = i] against i) we get a curve that decreases monotonically by a factor of 1 − p at each step, as illustrated in Figure 1. CS 70, Fall 2024, Note 19 1 p = 0.1 p = 0.3 0.1 0.3 0.08 0.25 P[X = k] P[X = k] 0.2 0.06 0.15 0.04 0.1 0.02 0.05 0 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 k k Figure 1: Illustration of the Geometric(p) distribution for p = 0.1 and p = 0.3. 1.1 Mean and Variance of a Geometric Random Variable Let us now compute the expectation E[X]. Applying the definition of expected value directly gives us: ∞ ∞ E[X] = ∑ i × P[X = i] = p ∑ i(1 − p)i−1. i=1 i=1 However, the final summation is a little tricky to evaluate (though it can be done). Instead, we will use the following alternative formula for expectation that simplifies the calculation and is useful in its own right. Theorem 19.1 (Tail Sum Formula). Let X be a random variable that takes values in {0, 1, 2,... }. Then ∞ E[X] = ∑ P[X ≥ i]. i=1 Proof. For notational convenience, let’s write pi = P[X = i], for i = 0, 1, 2,.... From the definition of expectation, we have E[X] = (0 × p0 ) + (1 × p1 ) + (2 × p2 ) + (3 × p3 ) + (4 × p4 ) + · · · = p1 + (p2 + p2 ) + (p3 + p3 + p3 ) + (p4 + p4 + p4 + p4 ) + · · · = (p1 + p2 + p3 + p4 + · · · ) + (p2 + p3 + p4 + · · · ) + (p3 + p4 + · · · ) + (p4 + · · · ) + · · · = P[X ≥ 1] + P[X ≥ 2] + P[X ≥ 3] + P[X ≥ 4] + · · ·. In the third line, we have regrouped the terms into convenient infinite sums, and each infinite sum is exactly the probability that X ≥ i for each i. You should check that you understand how the fourth line follows from the third. Let us repeat the proof more formally, this time using more compact mathematical notation: ∞ ∞ j ∞ ∞ ∞ E[X] = ∑ j × P[X = j] = ∑ ∑ P[X = j] = ∑ ∑ P[X = j] = ∑ P[X ≥ i], j=1 j=1 i=1 i=1 j=i i=1 where the third equality follows from interchanging the order of summations. CS 70, Fall 2024, Note 19 2 We can now use Theorem 19.1 to compute E[X] more easily. Theorem 19.2. For X ∼ Geometric(p), we have E[X] = 1p. Proof. The key observation is that for a geometric random variable X, P[X ≥ i] = (1 − p)i−1 for i = 1, 2,.... (1) We can obtain this simply by summing P[X = j] for j ≥ i. Another way of seeing this is to note that the event “X ≥ i” means at least i tosses are required. This is equivalent to saying that the first i − 1 tosses are all Tails, and the probability of this event is precisely (1 − p)i−1. Now, plugging (1) into Theorem 19.1, we get ∞ ∞ 1 1 E[X] = ∑ P[X ≥ i] = ∑ (1 − p)i−1 = = , i=1 i=1 1 − (1 − p) p where we have used the formula for geometric series. So, the expected number of tosses of a biased coin until the first Head appears is 1p. Thus for a fair coin, the expected number of tosses until the first Head is 2 (but of course the actual number of tosses can be any positive integer). Let us now compute the variance of X. 1−p Theorem 19.3. For X ∼ Geometric(p), we have Var(X) = p2. Proof. We use the formula 1 Var(X) = E X 2 − (E[X])2 = E X 2 − 2. (2) p Thus we just have to compute E X 2 , which we can do using a calculus trick as follows. Starting from ∞ 1 ∑ (1 − p)i = p i=0 and differentiating with respect to p, we get ∞ 1 ∑ i(1 − p)i−1 = p2. (3) i=1 Here we dropped the first term, for i = 0, since its derivative is zero, and multiplied both sides by −1. (Note incidentally that (3) gives an alternative proof of the fact that E[X] = 1p ; why?) Now we multiply both sides of (3) by (1 − p) and differentiate again to get ∞ 2− p ∑ i2 (1 − p)i−1 = p3. i=1 1 E X 2 , we conclude that Since the left-hand side here is exactly p 2− p E X2 = 2. p Finally, plugging this into (2) gives us 2− p 1 1− p Var(X) = 2 − 2= 2 , p p p as claimed. CS 70, Fall 2024, Note 19 3 1.2 Memoryless Property The geometric distribution has a special property called memorylessness1. To get some intuition for this property, suppose we revisit our earlier motivational example of repeatedly tossing a biased coin with heads probability p, and let X denote the number of tosses until the first head appears. Thus X ∼ Geometric(p). As we’ve seen, the probability that we need more than n tosses before getting the first head is P[X > n] = (1 − p)n , since the first n tosses must all have been tails. What if we’ve already tossed the coin m times without getting a head? What is the probability that we need more than n additional tosses before getting our first head? We can calculate this probability as: P[X > n + m] P[X > n + m | X > m] = P[X > m] (1 − p)n+m = (1 − p)m = (1 − p)n = P[X > n] This gives us the formal statement of the memoryless property of geometric distributions: P[X > n + m | X > m] = P[X > n].. The geometric distribution is memoryless in the sense that the length of time we have already been waiting does not affect the additional time we have to wait until we get our first head. (This, of course, defies the “common wisdom” that if we’ve tossed a coin ten times and it’s come up tails every time, then surely it’s more likely to come up heads on the 11th toss!) 1.3 Application: Coupon Collecting Recall the coupon collecting problem from a previous note: we are trying to collect a set of n different base- ball cards by buying boxes of cereal, each of which contains exactly one random card (uniformly distributed among the n cards). How many boxes do we need to buy until we have collected at least one copy of every card? Let Sn denote the number of boxes we need to buy in order to collect all n cards. The distribution of Sn is difficult to compute directly (try it for n = 3!), and we did some calculations in a previous note to estimate the likely values of Sn. But if we are only interested in the expected value E[Sn ], then we can evaluate it easily using linearity of expectation and what we have just learned about the geometric distribution. We start by writing Sn = X1 + X2 + · · · + Xn (4) for simpler random variables Xi , where Xi is the number of boxes we buy while trying to get the i-th new card (starting immediately after we have gotten the (i − 1)st new card). With this definition, make sure you believe (4) before proceeding. What does the distribution of Xi look like? Well, X1 is trivial: no matter what happens, we always get a new card in the first box (since we have none to start with). So P[X1 = 1] = 1, and thus E[X1 ] = 1. 1 This property is shared by the exponential distribution, the continuous analog of the geometric distribution, which we’ll cover in a future note. Indeed, these are the only two distributions that are memoryless. CS 70, Fall 2024, Note 19 4 How about X2 ? Each time we buy a box, we will get the same old card with probability 1n , and a new card with probability n−1 n. So we can think of buying boxes as flipping a biased coin with Heads probability p = n−1 n ; then X2 is just the number of tosses until the first Head appears. So X2 has the geometric distribution n−1 with parameter p = n , and n E[X2 ] =. n−1 n−2 How about X3 ? This is very similar to X2 except that now we only get a new card with probability n (since there are now two old ones). So X3 has the geometric distribution with parameter p = n−2n , and n E[X3 ] =. n−2 Arguing in the same way, we see that, for i = 1, 2,... , n, Xi has the geometric distribution with parameter p = n−i+1 n , and hence that n E[Xi ] =. n−i+1 Finally, applying linearity of expectation to (4), we get n n n n n n 1 E[Sn ] = ∑ E[Xi ] = + +···+ + = n ∑. (5) i=1 n n−1 2 1 i=1 i This is an exact expression for E[Sn ]. We can obtain a tidier form by noting that the sum in (5) actually has a very good approximation,2 namely: n 1 ∑ i ≈ ln n + γE , i=1 where γE = 0.5772... is known as Euler’s constant. Thus, the expected number of cereal boxes needed to collect n cards is about n(ln n + γ). This is an excel- lent approximation to the exact formula from (5) even for quite small values of n. So for example, for n = 100, we expect to buy about 518 boxes. 2 Poisson Distribution Consider the number of clicks of a Geiger counter, which measures radioactive emissions. The average number of such clicks per unit time, λ , is a measure of radioactivity, but the actual number of clicks fluctuates according to a certain distribution called the Poisson distribution. What is remarkable is that the average value, λ , completely determines the probability distribution on the number of clicks X. Definition 19.2 (Poisson distribution). A random variable X for which λ i −λ P[X = i] = e , for i = 0, 1, 2,... (6) i! is said to have the Poisson distribution with parameter λ. This is abbreviated as X ∼ Poisson(λ ). To make sure this is a valid definition, let us check that (6) is in fact a distribution, i.e., that the probabilities sum to 1. We have ∞ λ i −λ −λ ∞ λi ∑ i! e = e ∑ i! = e−λ × eλ = 1. i=0 i=0 2 This is another of the little tricks you might like to carry around in your toolbox. CS 70, Fall 2024, Note 19 5 2 3 In the second-to-last step, we used the Taylor series expansion ex = 1 + x + x2! + x3! + · · ·. The Poisson distribution is a very widely accepted model for so-called “rare events,” such as disconnected phone calls, radioactive emissions, crossovers in chromosomes, the number of cases of disease, the number of births per hour, etc. This model is appropriate whenever the occurrences can be assumed to happen randomly with some constant density λ in a continuous region (of time or space), such that occurrences in disjoint subregions are independent. One can then show that the number of occurrences in a region of unit size should obey the Poisson distribution with parameter λ. Example Suppose when we write an article, we make an average of 1 typo per page. We can model this with a Poisson random variable X with λ = 1. So the probability that a page has 5 typos is 15 −1 1 1 P[X = 5] = e = ≈. 5! 120e 326 Now suppose the article has 200 pages. If we assume the numbers of typos on each page are independent, then the probability that there is at least one page with exactly 5 typos is P[∃ a page with exactly 5 typos] = 1 − P[every page has ̸= 5 typos] 200 = 1 − ∏ P[page k has ̸= 5 typos] k=1 200 = 1 − ∏ (1 − P[page k has exactly 5 typos]) k=1 200 1 = 1− 1− ≈ 0.46, 120e where in the last step we have used our earlier calculation for P[X = 5]. 2.1 Mean and Variance of a Poisson Random Variable Let us now calculate the expectation and variance of a Poisson random variable. Theorem 19.4. For a Poisson random variable X ∼ Poisson(λ ), we have E[X] = λ and Var(X) = λ. Proof. We can calculate E[X] directly from the definition of expectation: ∞ E[X] = ∑ i × P[X = i] i=0 ∞ λ i −λ = ∑i× e (the i = 0 term is equal to 0 so we omit it) i=1 i! ∞ λ i−1 = λ e−λ ∑ (i − 1)! i=1 ∞ λj = λ e−λ eλ (since eλ = ∑ with j = i − 1) j=0 j! = λ. CS 70, Fall 2024, Note 19 6 Similarly, we can calculate E X 2 as follows: ∞ E X 2 = ∑ i2 × P[X = i] i=0 ∞ λ i −λ = ∑ i2 e (the i = 0 term is equal to 0 so we omit it) i=1 i! ∞ λ i−1 = λ e−λ ∑i i=1 (i − 1)! " # ∞ λ i−1 ∞ λ i−1 = λ e−λ ∑ +∑ (replacing i by (i − 1) + 1) i=2 (i − 2)! i=1 (i − 1)! h i ∞ λj = λ e−λ λ eλ + eλ (using eλ = ∑ twice, with j = i − 1 and j = i − 2) j=0 j! = λ2 +λ. Therefore, Var(X) = E X 2 − E[X]2 = λ 2 + λ − λ 2 = λ , as desired. A plot of the Poisson distribution reveals a curve that rises monotonically to a single peak and then decreases monotonically. The peak is as close as possible to the expected value, i.e., at i = ⌊λ ⌋. Figure 2 illustrates the Poisson(λ ) distribution for λ = 1, 2, 5, 20. 2.2 Sum of Independent Poisson Random Variables As illustrated in Figure 2, the Poisson(λ ) distribution becomes more symmetric and resembles a “bell curve” as λ increases. As we will learn later, this phenomenon is closely related to the following useful fact regarding a sum of independent Poisson random variables. Theorem 19.5. Let X ∼ Poisson(λ ) and Y ∼ Poisson(µ) be independent Poisson random variables. Then, X +Y ∼ Poisson(λ + µ). Proof. For all k = 0, 1, 2,..., we have k P[X +Y = k] = ∑ P[X = j,Y = k − j] j=0 k = ∑ P[X = j] P[Y = k − j] j=0 k λ j −λ µ k− j −µ = ∑ e (k − j)! e j=0 j! k 1 k! = e−(λ +µ) ∑ λ j µ k− j k! j=0 j!(k − j)! (λ + µ)k = e−(λ +µ) , k! where the second equality follows from independence, and the last equality from the binomial theorem. CS 70, Fall 2024, Note 19 7 λ =1 λ =3 0.4 0.25 0.3 0.2 P[X = k] P[X = k] 0.2 0.15 0.1 0.1 0.05 0 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 k k λ =5 λ = 20 0.08 0.15 P[X = k] P[X = k] 0.06 0.1 0.04 0.05 0.02 0 0 0 2 4 6 8 10 12 14 16 0 10 20 30 40 k k Figure 2: Illustration of the Poisson(λ ) distribution for λ = 1, 2, 5, 20. By induction, we conclude that if X1 , X2 ,... , Xn are independent Poisson random variables with parameters λ1 , λ2 ,... , λn , respectively, then X1 + X2 + · · · + Xn ∼ Poisson(λ1 + λ2 + · · · + λn ). 2.3 Poisson as Limit of Binomial To see a concrete example of how the Poisson distribution arises, suppose we want to model the number of cell phone users initiating calls in a network during a time period, of duration (say) 1 minute. There are many customers in the network, and all of them can potentially make a call during this time period. However, only a very small fraction of them actually will. Under this scenario, it seems reasonable to make two assumptions: The probability of having more than 1 customer initiating a call in any small time interval is negligible. The initiations of calls in disjoint time intervals are independent events. Then, if we divide the one-minute time period into n disjoint intervals, the number of calls X in that time period can be modeled as a Binomial(n, p) random variable, where p is the probability of having a call initiated in a time interval of length 1/n. But what should p be in terms of the relevant parameters of the problem? If calls are initiated at an average rate of λ calls per minute, then E[X] = λ and so np = λ , i.e., CS 70, Fall 2024, Note 19 8 p = λ /n. So X ∼ Binomial(n, λn ). As we shall see below, as we let n tend to infinity, this distribution tends to the Poisson distribution with parameter λ. This explains why the Poisson distribution is a model for “rare events”: it counts the number of heads in a very long sequence (n → ∞) of coin flips, where the expected number of heads is a small finite number. Now we will prove the claim above that Poisson(λ ) is the limit of Binomial(n, λn ), as n tends to infinity. Theorem 19.6. Let X ∼ Binomial(n, λn ) where λ > 0 is a fixed constant. Then for every i = 0, 1, 2,... , λ i −λ P[X = i] −→ e as n → ∞. i! That is, the probability distribution of X converges to the Poisson distribution with parameter λ. Proof. Fix i ∈ {0, 1, 2,... }, and assume n ≥ i (because we will let n → ∞). Then, because X has binomial distribution with parameter n and p = λn , i λ n−i n i n−i n! λ P[X = i] = p (1 − p) = 1−. i i!(n − i)! n n Let us collect the factors into λi λ n λ −i n! 1 P[X = i] = · i · 1− · 1−. (7) i! (n − i)! n n n For any fixed i, the first parenthesis above becomes, as n → ∞, n! 1 n · (n − 1) · · · (n − i + 1) · (n − i)! 1 n (n − 1) (n − i + 1) · = · i= · ··· → 1. (n − i)! ni (n − i)! n n n n Since λ is a fixed constant, the second parenthesis in (7) becomes, as n → ∞, λ n 1− → e−λ. n And since i is fixed, the third parenthesis in (7) becomes, as n → ∞, λ −i 1− → (1 − 0)−i = 1. n Substituting these results back into (7) gives us λi λi P[X = i] → · 1 · e−λ · 1 = e−λ , i! i! as desired. CS 70, Fall 2024, Note 19 9

CS 70 Discrete Math & Probability Theory Fall 2024 Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue