Probability: Univariate Models PDF
Document Details
Uploaded by PanoramicHeisenberg
Kevin P. Murphy
Tags
Summary
This document introduces the concepts of probability theory, including frequentist and Bayesian interpretations. It explores the fundamental concepts and types of uncertainty which are useful in a variety of applications. The document also delves into probability as an extension of logic and provides examples to illustrate several concepts.
Full Transcript
2 Probability: Univariate Models 2.1 Introduction In this chapter, we give a brief introduction to the basics of probability theory. There are many good books that go into more detail, e.g., [GS97; BT08; Cha21]. 2.1.1 What is probability? Probability theory is nothing but common...
2 Probability: Univariate Models 2.1 Introduction In this chapter, we give a brief introduction to the basics of probability theory. There are many good books that go into more detail, e.g., [GS97; BT08; Cha21]. 2.1.1 What is probability? Probability theory is nothing but common sense reduced to calculation. — Pierre Laplace, 1812 We are all comfortable saying that the probability that a (fair) coin will land heads is 50%. But what does this mean? There are actually two di!erent interpretations of probability. One is called the frequentist interpretation. In this view, probabilities represent long run frequencies of events that can happen multiple times. For example, the above statement means that, if we flip the coin many times, we expect it to land heads about half the time.1 The other interpretation is called the Bayesian interpretation of probability. In this view, proba- bility is used to quantify our uncertainty or ignorance about something; hence it is fundamentally related to information rather than repeated trials [Jay03; Lin06]. In the Bayesian view, the above statement means we believe the coin is equally likely to land heads or tails on the next toss. One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about one-o! events that do not have long term frequencies. For example, we might want to compute the probability that the polar ice cap will melt by 2030 CE. This event will happen zero or one times, but cannot happen repeatedly. Nevertheless, we ought to be able to quantify our uncertainty about this event; based on how probable we think this event is, we can decide how to take the optimal action, as discussed in Chapter 5. We shall therefore adopt the Bayesian interpretation in this book. Fortunately, the basic rules of probability theory are the same, no matter which interpretation is adopted. 2.1.2 Types of uncertainty The uncertainty in our predictions can arise for two fundamentally di!erent reasons. The first is due to our ignorance of the underlying hidden causes or mechanism generating our data. This is 1. Actually, the Stanford statistician (and former professional magician) Persi Diaconis has shown that a coin is about 51% likely to land facing the same way up as it started, due to the physics of the problem [DHM07]. 34 Chapter 2. Probability: Univariate Models called epistemic uncertainty, since epistemology is the philosophical term used to describe the study of knowledge. However, a simpler term for this is model uncertainty. The second kind of uncertainty arises from intrinsic variability, which cannot be reduced even if we collect more data. This is sometimes called aleatoric uncertainty [Hac75; KD09], derived from the Latin word for “dice”, although a simpler term would be data uncertainty. As a concrete example, consider tossing a fair coin. We might know for sure that the probability of heads is p = 0.5, so there is no epistemic uncertainty, but we still cannot perfectly predict the outcome. This distinction can be important for applications such as active learning. A typical strategy is to query examples for which H(p(y|x, D)) is large (where H(p) is the entropy, discussed in Section 6.1). However, this could be due to uncertainty about the parameters, i.e., large H(p(ω|D)), or just due to inherent variability of the outcome, corresponding to large entropy of p(y|x, ω). In the latter case, there would not be much use collecting more samples, since our uncertainty would not be reduced. See [Osb16] for further discussion of this point. 2.1.3 Probability as an extension of logic In this section, we review the basic rules of probability, following the presentation of [Jay03], in which we view probability as an extension of Boolean logic. 2.1.3.1 Probability of an event We define an event, denoted by the binary variable A, as some state of the world that either holds or does not hold. For example, A might be event “it will rain tomorrow”, or “it rained yesterday”, or “the label is y = 1”, or “the parameter ω is between 1.5 and 2.0”, etc. The expression Pr(A) denotes the probability with which you believe event A is true (or the long run fraction of times that A will occur). We require that 0 → Pr(A) → 1, where Pr(A) = 0 means the event definitely will not happen, and Pr(A) = 1 means the event definitely will happen. We write Pr(A) to denote the probability of event A not happening; this is defined to be Pr(A) = 1 ↑ Pr(A). 2.1.3.2 Probability of a conjunction of two events We denote the joint probability of events A and B both happening as follows: Pr(A ↓ B) = Pr(A, B) (2.1) If A and B are independent events, we have Pr(A, B) = Pr(A) Pr(B) (2.2) For example, suppose X and Y are chosen uniformly at random from the set X = {1, 2, 3, 4}. Let A be the event that X ↔ {1, 2}, and B be the event that Y ↔ {3}. Then we have Pr(A, B) = Pr(A) Pr(B) = 12 · 14. 2.1.3.3 Probability of a union of two events The probability of event A or B happening is given by Pr(A ↗ B) = Pr(A) + Pr(B) ↑ Pr(A ↓ B) (2.3) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 35 If the events are mutually exclusive (so they cannot happen at the same time), we get Pr(A ↗ B) = Pr(A) + Pr(B) (2.4) For example, suppose X is chosen uniformly at random from the set X = {1, 2, 3, 4}. Let A be the event that X ↔ {1, 2} and B be the event that X ↔ {3}. Then we have Pr(A ↗ B) = 24 + 14. 2.1.3.4 Conditional probability of one event given another We define the conditional probability of event B happening given that A has occurred as follows: Pr(A, B) Pr(B|A) ↭ (2.5) Pr(A) This is not defined if Pr(A) = 0, since we cannot condition on an impossible event. 2.1.3.5 Independence of events We say that event A is independent of event B if Pr(A, B) = Pr(A) Pr(B) (2.6) 2.1.3.6 Conditional independence of events We say that events A and B are conditionally independent given event C if Pr(A, B|C) = Pr(A|C) Pr(B|C) (2.7) This is written as A ↘ B|C. Events are often dependent on each other, but may be rendered independent if we condition on the relevant intermediate variables, as we discuss in more detail later in this chapter. 2.2 Random variables Suppose X represents some unknown quantity of interest, such as which way a dice will land when we roll it, or the temperature outside your house at the current time. If the value of X is unknown and/or could change, we call it a random variable or rv. The set of possible values, denoted X , is known as the sample space or state space. An event is a set of outcomes from a given sample space. For example, if X represents the face of a dice that is rolled, so X = {1, 2,... , 6}, the event of “seeing a 1” is denoted X = 1, the event of “seeing an odd number” is denoted X ↔ {1, 3, 5}, the event of “seeing a number between 1 and 3” is denoted 1 → X → 3, etc. 2.2.1 Discrete random variables If the sample space X is finite or countably infinite, then X is called a discrete random variable. In this case, we denote the probability of the event that X has value x by Pr(X = x). We define the Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 36 Chapter 2. Probability: Univariate Models (a) (b) Figure 2.1: Some discrete distributions on the state space X = {1, 2, 3, 4}. (a) A uniform distribution with p(x = k) = 1/4. (b) A degenerate distribution (delta function) that puts all its mass on x = 1. Generated by discrete_prob_dist_plot.ipynb. probability mass function or pmf as a function which computes the probability of events which correspond to setting the rv to each possible value: p(x) ↭ Pr(X = x) (2.8) ! The pmf satisfies the properties 0 → p(x) → 1 and x→X p(x) = 1. If X has a finite number of values, say K, the pmf can be represented as a list of K numbers, which we can plot as a histogram. For example, Figure 2.1 shows two pmf’s defined on X = {1, 2, 3, 4}. On the left we have a uniform distribution, p(x) = 1/4, and on the right, we have a degenerate distribution, p(x) = I (x = 1), where I () is the binary indicator function. Thus the distribution in Figure 2.1(b) represents the fact that X is always equal to the value 1. (Thus we see that random variables can also be constant.) 2.2.2 Continuous random variables If X ↔ R is a real-valued quantity, it is called a continuous random variable. In this case, we can no longer create a finite (or countable) set of distinct possible values it can take on. However, there are a countable number of intervals which we can partition the real line into. If we associate events with X being in each one of these intervals, we can use the methods discussed above for discrete random variables. Informally speaking, we can represent the probability of X taking on a specific real value by allowing the size of the intervals to shrink to zero, as we show below. 2.2.2.1 Cumulative distribution function (cdf ) Define the events A = (X → a), B = (X → b) and C = (a < X → b), where a < b. We have that B = A ↗ C, and since A and C are mutually exclusive, the sum rules gives Pr(B) = Pr(A) + Pr(C) (2.9) and hence the probability of being in interval C is given by Pr(C) = Pr(B) ↑ Pr(A) (2.10) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 37 ,/2 ,/2 )-1(,/2) 0 )-1(1-,/2) (a) (b) Figure 2.2: (a) Plot of the cdf for the standard normal, N (0, 1). Generated by gauss_plot.ipynb. (b) Corresponding pdf. The shaded regions each contain ω/2 of the probability mass. Therefore the nonshaded region contains 1 → ω of the probability mass. The leftmost cuto! point is !→1 (ω/2), where ! is the cdf of the Gaussian. By symmetry, the rightmost cuto! point is !→1 (1 → ω/2) = →!→1 (ω/2). Generated by quantile_plot.ipynb. In general, we define the cumulative distribution function or cdf of the rv X as follows: P (x) ↭ Pr(X → x) (2.11) (Note that we use a capital P to represent the cdf.) Using this, we can compute the probability of being in any interval as follows: Pr(a < X → b) = P (b) ↑ P (a) (2.12) Cdf’s are monotonically non-decreasing functions. See Figure 2.2a for an example, where we illustrate the cdf of a standard normal distribution, N (x|0, 1); see Section 2.6 for details. 2.2.2.2 Probability density function (pdf ) We define the probability density function or pdf as the derivative of the cdf: d p(x) ↭ P (x) (2.13) dx (Note that this derivative does not always exist, in which case the pdf is not defined.) See Figure 2.2b for an example, where we illustrate the pdf of a univariate Gaussian (see Section 2.6 for details). Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows: " b Pr(a < X → b) = p(x)dx = P (b) ↑ P (a) (2.14) a As the size of the interval gets smaller, we can write Pr(x < X → x + dx) ≃ p(x)dx (2.15) Intuitively, this says the probability of X being in a small interval around x is the density at x times the width of the interval. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 38 Chapter 2. Probability: Univariate Models 2.2.2.3 Quantiles If the cdf P is strictly monotonically increasing, it has an inverse, called the inverse cdf, or percent point function (ppf), or quantile function. If P is the cdf of X, then P ↑1 (q) is the value xq such that Pr(X → xq ) = q; this is called the q’th quantile of P. The value P ↑1 (0.5) is the median of the distribution, with half of the probability mass on the left, and half on the right. The values P ↑1 (0.25) and P ↑1 (0.75) are the lower and upper quartiles. For example, let ! be the cdf of the Gaussian distribution N (0, 1), and !↑1 be the inverse cdf. Then points to the left of !↑1 (ε/2) contain ε/2 of the probability mass, as illustrated in Figure 2.2b. By symmetry, points to the right of !↑1 (1 ↑ ε/2) also contain ε/2 of the mass. Hence the central interval (!↑1 (ε/2), !↑1 (1 ↑ ε/2)) contains 1 ↑ ε of the mass. If we set ε = 0.05, the central 95% interval is covered by the range (!↑1 (0.025), !↑1 (0.975)) = (↑1.96, 1.96) (2.16) If the distribution is N (µ, ϑ 2 ), then the 95% interval becomes (µ ↑ 1.96ϑ, µ + 1.96ϑ). This is often approximated by writing µ ± 2ϑ. 2.2.3 Sets of related random variables In this section, we discuss distributions over sets of related random variables. Suppose, to start, that we have two random variables, X and Y. We can define the joint distribution of two random variables using p(x, y) = p(X = x, Y = y) for all possible values of X and Y. If both variables have finite cardinality, we can represent the joint distribution as a 2d table, all of whose entries sum to one. For example, consider the following example with two binary variables: p(X, Y ) Y = 0 Y = 1 X=0 0.2 0.3 X=1 0.3 0.2 If two variables are independent, we can represent the joint as the product of the two marginals. If both variables have finite cardinality, we can factorize the 2d joint table into a product of two 1d vectors, as shown in Figure 2.3. Given a joint distribution, we define the marginal distribution of an rv as follows: # p(X = x) = p(X = x, Y = y) (2.17) y where we are summing over all possible states of Y. This is sometimes called the sum rule or the rule of total probability. We define p(Y = y) similarly. For example, from the above 2d table, we see p(X = 0) = 0.2 + 0.3 = 0.5 and p(Y = 0) = 0.2 + 0.3 = 0.5. (The term “marginal” comes from the accounting practice of writing the sums of rows and columns on the side, or margin, of a table.) We define the conditional distribution of an rv using p(X = x, Y = y) p(Y = y|X = x) = (2.18) p(X = x) We can rearrange this equation to get p(x, y) = p(x)p(y|x) (2.19) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 39 P(Y) P(X, Y) = P(X) Figure 2.3: Computing p(x, y) = p(x)p(y), where X ↑ Y. Here X and Y are discrete random variables; X has 6 possible states (values) and Y has 5 possible states. A general joint distribution on two such variables would require (6 ↓ 5) → 1 = 29 parameters to define it (we subtract 1 because of the sum-to-one constraint). By assuming (unconditional) independence, we only need (6 → 1) + (5 → 1) = 9 parameters to define p(x, y). This is called the product rule. By extending the product rule to D variables, we get the chain rule of probability: p(x1:D ) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 )p(x4 |x1 , x2 , x3 )... p(xD |x1:D↑1 ) (2.20) This provides a way to create a high dimensional joint distribution from a set of conditional distributions. We discuss this in more detail in Section 3.6. 2.2.4 Independence and conditional independence We say X and Y are unconditionally independent or marginally independent, denoted X ↘ Y , if we can represent the joint as the product of the two marginals (see Figure 2.3), i.e., X ↘ Y ⇐⇒ p(X, Y ) = p(X)p(Y ) (2.21) In general, we say a set of variables X1 ,... , Xn is (mutually) independent if the joint can be written as a product of marginals for all subsets {X1 ,... , Xm } ⇑ {X1 ,... , Xn }: i.e., m $ p(X1 ,... , Xm ) = p(Xi ) (2.22) i=1 For example, we say X1 , X2 , X3 are mutually independent if the following conditions hold: p(X1 , X2 , X3 ) = p(X1 )p(X2 )p(X3 ), p(X1 , X2 ) = p(X1 )p(X2 ), p(X2 , X3 ) = p(X2 )p(X3 ), and p(X1 , X3 ) = p(X1 )p(X3 ).2 Unfortunately, unconditional independence is rare, because most variables can influence most other variables. However, usually this influence is mediated via other variables rather than being direct. We therefore say X and Y are conditionally independent (CI) given Z i! the conditional joint can be written as a product of conditional marginals: X ↘ Y | Z ⇐⇒ p(X, Y |Z) = p(X|Z)p(Y |Z) (2.23) 2. For further discussion, see https://github.com/probml/pml-book/issues/353#issuecomment-1120327442. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 40 Chapter 2. Probability: Univariate Models We can write this assumption as a graph X ↑ Z ↑ Y , which captures the intuition that all the dependencies between X and Y are mediated via Z. By using larger graphs, we can define complex joint distributions; these are known as graphical models, and are discussed in Section 3.6. 2.2.5 Moments of a distribution In this section, we describe various summary statistics that can be derived from a probability distribution (either a pdf or pmf). 2.2.5.1 Mean of a distribution The most familiar property of a distribution is its mean, or expected value, often denoted by µ. For continuous rv’s, the mean is defined as follows: " E [X] ↭ x p(x)dx (2.24) X If the integral is not finite, the mean is not defined; we will see some examples of this later. For discrete rv’s, the mean is defined as follows: # E [X] ↭ x p(x) (2.25) x→X However, this is only meaningful if the values of x are ordered in some way (e.g., if they represent integer counts). Since the mean is a linear operator, we have E [aX + b] = aE [X] + b (2.26) This is called the linearity of expectation. For a set of n random variables, one can show that the expectation of their sum is as follows: % n & n # # E Xi = E [Xi ] (2.27) i=1 i=1 If they are independent, the expectation of their product is given by % n & n $ $ E Xi = E [Xi ] (2.28) i=1 i=1 2.2.5.2 Variance of a distribution The variance is a measure of the “spread” of a distribution, often denoted by ϑ 2. This is defined as follows: " ' ( V [X] ↭ E (X ↑ µ)2 = (x ↑ µ)2 p(x)dx (2.29) " " " ' ( = x2 p(x)dx + µ2 p(x)dx ↑ 2µ xp(x)dx = E X 2 ↑ µ2 (2.30) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 41 from which we derive the useful result ' ( E X 2 = ϑ 2 + µ2 (2.31) The standard deviation is defined as ) std [X] ↭ V [X] = ϑ (2.32) This is useful since it has the same units as X itself. The variance of a shifted and scaled version of a random variable is given by V [aX + b] = a2 V [X] (2.33) If we have a set of n independent random variables, the variance of their sum is given by the sum of their variances: % n & n # # V Xi = V [Xi ] (2.34) i=1 i=1 The variance of their product can also be derived, as follows: % n & % & % & $ $ $ 2 V Xi = E ( Xi ) ↑ (E Xi ) 2 (2.35) i=1 i i % & $ $ =E Xi2 ↑( E [Xi ])2 (2.36) i i $ ' ( $ = E Xi2 ↑ (E [Xi ])2 (2.37) i i $ $ = (V [Xi ] + (E [Xi ])2 ) ↑ (E [Xi ])2 (2.38) i i $ $ = (ϑi2 + µ2i ) ↑ µ2i (2.39) i i 2.2.5.3 Mode of a distribution The mode of a distribution is the value with the highest probability mass or probability density: x↓ = argmax p(x) (2.40) x If the distribution is multimodal, this may not be unique, as illustrated in Figure 2.4. Furthermore, even if there is a unique mode, this point may not be a good summary of the distribution. 2.2.5.4 Conditional moments When we have two or more dependent random variables, we can compute the moments of one given knowledge of the other. For example, the law of iterated expectations, also called the law of total expectation, tells us that E [X] = EY [E [X|Y ]] (2.41) Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 42 Chapter 2. Probability: Univariate Models Figure 2.4: Illustration of a mixture of two 1d Gaussians, p(x) = 0.5N (x|0, 0.5) + 0.5N (x|2, 0.5). Generated by bimodal_dist_plot.ipynb. To prove this, let us suppose, for simplicity, that X and Y are both discrete rv’s. Then we have % & # EY [E [X|Y ]] = EY x p(X = x|Y ) (2.42) x % & # # # = x p(X = x|Y = y) p(Y = y) = xp(X = x, Y = y) = E [X] (2.43) y x x,y To give a more intuitive explanation, consider the following simple example.3 Let X be the lifetime duration of a lightbulb, and let Y be the factory the lightbulb was produced in. Suppose E [X|Y = 1] = 5000 and E [X|Y = 2] = 4000, indicating that factory 1 produces longer lasting bulbs. Suppose factory 1 supplies 60% of the lightbulbs, so p(Y = 1) = 0.6 and p(Y = 2) = 0.4. Then the expected duration of a random lightbulb is given by E [X] = E [X|Y = 1] p(Y = 1) + E [X|Y = 2] p(Y = 2) = 5000 ⇓ 0.6 + 4000 ⇓ 0.4 = 4600 (2.44) There is a similar formula for the variance. In particular, the law of total variance, also called the conditional variance formula, tells us that V [X] = EY [V [X|Y ]] + VY [E [X|Y ]] (2.45) ( ' To see this, let us define the conditional moments, µX|Y = E [X|Y ], sX|Y = E X 2 |Y , and ϑX|Y = V [X|Y ] = sX|Y ↑ µ2X|Y , which are functions of Y (and therefore are random quantities). 2 Then we have ' ( ' ( * ' (+2 V [X] = E X 2 ↑ (E [X])2 = EY sX|Y ↑ EY µX|Y (2.46) , - , - * ' (+2 2 2 = EY ϑX|Y + EY µX|Y ↑ EY µX|Y (2.47) = EY [V [X|Y ]] + VY [µX|Y ] (2.48) To get some intuition for these formulas, consider a mixture of K univariate Gaussians. Let Y be the hidden indicator variable that specifies which mixture component we are using, and let 3. This example is from https://en.wikipedia.org/wiki/Law_of_total_expectation, but with modified notation. “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 43 Dataset: I Dataset: II Dataset: III Dataset: IV 10 10 10 10 y y y y 5 5 5 5 0 0 0 0 0 10 20 0 10 20 0 10 20 0 10 20 x x x x (a) (b) (c) (d) Figure 2.5: Illustration of Anscombe’s quartet. All of these datasets have the same low order summary statistics. Generated by anscombes_quartet.ipynb. !K X = y=1 ϖy N (X|µy , ϑy ). In Figure 2.4, we have ϖ1 = ϖ2 = 0.5, µ1 = 0, µ2 = 2, ϑ1 = ϑ2 = 0.5. Thus E [V [X|Y ]] = ϖ1 ϑ12 + ϖ2 ϑ22 = 0.25 (2.49) 2 2 2 2 V [E [X|Y ]] = ϖ1 (µ1 ↑ µ) + ϖ2 (µ2 ↑ µ) = 0.5(0 ↑ 1) + 0.5(2 ↑ 1) = 0.5 + 0.5 = 1 (2.50) So we get the intuitive result that the variance of X is dominated by which centroid it is drawn from (i.e., di!erence in the means), rather than the local variance around each centroid. 2.2.6 Limitations of summary statistics * Although it is common to summarize a probability distribution (or points sampled from a distribution) using simple statistics such as the mean and variance, this can lose a lot of information. A striking example of this is known as Anscombe’s quartet [Ans73], which is illustrated in Figure 2.5. This shows 4 di!erent datasets of (x, y) pairs, all of which have identical mean, variance and correlation coe"cient ϱ (defined in Section 3.1.2): E [x] = 9, V [x] = 11, E [y] = 7.50, V [y] = 4.12, and ϱ = 0.816.4 However, the joint distributions p(x, y) from which these points were sampled are clearly very di!erent. Anscombe invented these datasets, each consisting of 10 data points, to counter the impression among statisticians that numerical summaries are superior to data visualization [Ans73]. An even more striking example of this phenomenon is shown in Figure 2.6. This consists of a dataset that looks like a dinosaur5 , plus 11 other datasets, all of which have identical low order statistics. This collection of datasets is called the Datasaurus Dozen [MF17]. The exact values of the (x, y) points are available online.6 They were computed using simulated annealing, a derivative free optimization method which we discuss in the sequel to this book, [Mur23]. (The objective 4. The maximum likelihood estimate for the variance in Equation (4.36) di!ers from the unbiased estimate in Equation (4.38). For the former, we have V [x] = 10.00, V [y] = 3.75, for the latter, we have V [x] = 11.00, V [y] = 4.12. 5. This dataset was created by Alberto Cairo, and is available at http://www.thefunctionalart.com/2016/08/ download-datasaurus-never-trust-summary.html 6. https://www.autodesk.com/research/publications/same-stats-different-graphs. There are actually 13 datasets in total, including the dinosaur. We omitted the “away” dataset for visual clarity. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 44 Chapter 2. Probability: Univariate Models Dataset: dino Dataset: h lines Dataset: v lines Dataset: x shape 100 50 y 0 Dataset: star Dataset: high lines Dataset: dots Dataset: circle 100 50 y 0 Dataset: bullseye Dataset: slant up Dataset: slant down Dataset: wide lines 100 50 y 0 0 50 100 0 50 100 0 50 100 0 50 100 x x x x Figure 2.6: Illustration of the Datasaurus Dozen. All of these datasets have the same low order summary statistics. Adapted from Figure 1 of [MF17]. Generated by datasaurus_dozen.ipynb. function being optimized measures deviation from the target summary statistics of the original dinosaur, plus distance from a particular target shape.) The same simulated annealing approach can be applied to 1d datasets, as shown in Figure 2.7. We see that all the datasets are quite di!erent, but they all have the same median and inter-quartile range as shown by the central shaded part of the box plots in the middle. A better visualization is known as a violin plot, shown on the right. This shows (two copies of) the 1d kernel density estimate (Section 16.3) of the distribution on the vertical axis, in addition to the median and IQR markers. This visualization is better able to distinguish di!erences in the distributions. However, the technique is limited to 1d data. 2.3 Bayes’ rule Bayes’s theorem is to the theory of probability what Pythagoras’s theorem is to geometry. — Sir Harold Je!reys, 1973 [Jef73]. In this section, we discuss the basics of Bayesian inference. According to the Merriam-Webster dictionary, the term “inference” means “the act of passing from sample data to generalizations, usually with calculated degrees of certainty”. The term “Bayesian” is used to refer to inference methods that “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.3. Bayes’ rule 45 Figure 2.7: Illustration of 7 di!erent datasets (left), the corresponding box plots (middle) and violin box plots (right). From Figure 8 of https: // www. autodesk. com/ research/ publications/ same-stats-different-graphs. Used with kind permission of Justin Matejka. represent “degrees of certainty” using probability theory, and which leverage Bayes’ rule7 , to update the degree of certainty given data. Bayes’ rule itself is very simple: it is just a formula for computing the probability distribution over possible values of an unknown (or hidden) quantity H given some observed data Y = y: p(H = h)p(Y = y|H = h) p(H = h|Y = y) = (2.51) p(Y = y) This follows automatically from the identity p(h|y)p(y) = p(h)p(y|h) = p(h, y) (2.52) which itself follows from the product rule of probability. In Equation (2.51), the term p(H) represents what we know about possible values of H before we see any data; this is called the prior distribution. (If H has K possible values, then p(H) is a vector of K probabilities, that sum to 1.) The term p(Y |H = h) represents the distribution over the possible outcomes Y we expect to see if H = h; this is called the observation distribution. When we evaluate this at a point corresponding to the actual observations, y, we get the function p(Y = y|H = h), which is called the likelihood. (Note that this is a function of h, since y is fixed, but it is not a probability distribution, since it does not sum to one.) Multiplying the prior distribution p(H = h) by the likelihood function p(Y = y|H = h) for each h gives the unnormalized joint distribution p(H = h, Y = y). We can convert this into a normalized distribution by dividing by p(Y = y), which is known as the marginal likelihood, since it is computed by marginalizing over the unknown H: # # p(Y = y) = p(H = h↔ )p(Y = y|H = h↔ ) = p(H = h↔ , Y = y) (2.53) h→ →H h→ →H 7. Thomas Bayes (1702–1761) was an English mathematician and Presbyterian minister. For a discussion of whether to spell this as Bayes rule, Bayes’ rule or Bayes’s rule, see https://bit.ly/2kDtLuK. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 46 Chapter 2. Probability: Univariate Models Observation 0 1 0 TNR=Specificity=0.975 FPR=1-TNR=0.025 Truth 1 FNR=1-TPR=0.125 TPR=Sensitivity=0.875 Table 2.1: Likelihood function p(Y |H) for a binary observation Y given two possible hidden states H. Each row sums to one. Abbreviations: TNR is true negative rate, TPR is true positive rate, FNR is false negative rate, FPR is false positive rate. Normalizing the joint distribution by computing p(H = h, Y = y)/p(Y = y) for each h gives the posterior distribution p(H = h|Y = y); this represents our new belief state about the possible values of H. We can summarize Bayes rule in words as follows: posterior ⇔ prior ⇓ likelihood (2.54) Here we use the symbol ⇔ to denote “proportional to”, since we are ignoring the denominator, which is just a constant, independent of H. Using Bayes rule to update a distribution over unknown values of some quantity of interest, given relevant observed data, is called Bayesian inference, or posterior inference. It can also just be called probabilistic inference. Below we give some simple examples of Bayesian inference in action. We will see many more interesting examples later in this book. 2.3.1 Example: Testing for COVID-19 Suppose you think you may have contracted COVID-19, which is an infectious disease caused by the SARS-CoV-2 virus. You decide to take a diagnostic test, and you want to use its result to determine if you are infected or not. Let H = 1 be the event that you are infected, and H = 0 be the event you are not infected. Let Y = 1 if the test is positive, and Y = 0 if the test is negative. We want to compute p(H = h|Y = y), for h ↔ {0, 1}, where y is the observed test outcome. (We will write the distribution of values, [p(H = 0|Y = y), p(H = 1|Y = y)] as p(H|y), for brevity.) We can think of this as a form of binary classification, where H is the unknown class label, and y is the feature vector. First we must specify the likelihood. This quantity obviously depends on how reliable the test is. There are two key parameters. The sensitivity (aka true positive rate) is defined as p(Y = 1|H = 1), i.e., the probability of a positive test given that the truth is positive. The false negative rate is defined as one minus the sensitivity. The specificity (aka true negative rate) is defined as p(Y = 0|H = 0), i.e., the probability of a negative test given that the truth is negative. The false positive rate is defined as one minus the specificity. We summarize all these quantities in Table 2.1. (See Section 5.1.3.1 for more details.) Following https://nyti.ms/31MTZgV, we set the sensitivity to 87.5% and the specificity to 97.5%. Next we must specify the prior. The quantity p(H = 1) represents the prevalence of the disease in the area in which you live. We set this to p(H = 1) = 0.1 (i.e., 10%), which was the prevalence in New York City in Spring 2020. (This example was chosen to match the numbers in https://nyti.ms/31MTZgV.) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.3. Bayes’ rule 47 Now suppose you test positive. We have p(Y = 1|H = 1)p(H = 1) p(H = 1|Y = 1) = (2.55) p(Y = 1|H = 1)p(H = 1) + p(Y = 1|H = 0)p(H = 0) TPR ⇓ prior = (2.56) TPR ⇓ prior + FPR ⇓ (1 ↑ prior) 0.875 ⇓ 0.1 = = 0.795 (2.57) 0.875 ⇓ 0.1 + 0.025 ⇓ 0.9 So there is a 79.5% chance you are infected. Now suppose you test negative. The probability you are infected is given by p(Y = 0|H = 1)p(H = 1) p(H = 1|Y = 0) = (2.58) p(Y = 0|H = 1)p(H = 1) + p(Y = 0|H = 0)p(H = 0) FNR ⇓ prior = (2.59) FNR ⇓ prior + TNR ⇓ (1 ↑ prior) 0.125 ⇓ 0.1 = = 0.014 (2.60) 0.125 ⇓ 0.1 + 0.975 ⇓ 0.9 So there is just a 1.4% chance you are infected. Nowadays COVID-19 prevalence is much lower. Suppose we repeat these calculations using a base rate of 1%; now the posteriors reduce to 26% and 0.13% respectively. The fact that you only have a 26% chance of being infected with COVID-19, even after a positive test, is very counter-intuitive. The reason is that a single positive test is more likely to be a false positive than due to the disease, since the disease is rare. To see this, suppose we have a population of 100,000 people, of whom 1000 are infected. Of those who are infected, 875 = 0.875 ⇓ 1000 test positive, and of those who are uninfected, 2475 = 0.025 ⇓ 99, 000 test positive. Thus the total number of positives is 3350 = 875 + 2475, so the posterior probability of being infected given a positive test is 875/3350 = 0.26. Of course, the above calculations assume we know the sensitivity and specificity of the test. See [GC20] for how to apply Bayes rule for diagnostic testing when there is uncertainty about these parameters. 2.3.2 Example: The Monty Hall problem In this section, we consider a more “frivolous” application of Bayes rule. In particular, we apply it to the famous Monty Hall problem. Imagine a game show with the following rules: There are three doors, labeled 1, 2, 3. A single prize (e.g., a car) has been hidden behind one of them. You get to select one door. Then the gameshow host opens one of the other two doors (not the one you picked), in such a way as to not reveal the prize location. At this point, you will be given a fresh choice of door: you can either stick with your first choice, or you can switch to the other closed door. All the doors will then be opened and you will receive whatever is behind your final choice of door. For example, suppose you choose door 1, and the gameshow host opens door 3, revealing nothing behind the door, as promised. Should you (a) stick with door 1, or (b) switch to door 2, or (c) does it make no di!erence? Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 48 Chapter 2. Probability: Univariate Models Door 1 Door 2 Door 3 Switch Stay Car - - Lose Win - Car - Win Lose - - Car Win Lose Table 2.2: 3 possible states for the Monty Hall game, showing that switching doors is two times better (on average) than staying with your original choice. Adapted from Table 6.1 of [PM18]. Intuitively, it seems it should make no di!erence, since your initial choice of door cannot influence the location of the prize. However, the fact that the host opened door 3 tells us something about the location of the prize, since he made his choice conditioned on the knowledge of the true location and on your choice. As we show below, you are in fact twice as likely to win the prize if you switch to door 2. To show this, we will use Bayes’ rule. Let Hi denote the hypothesis that the prize is behind door i. We make the following assumptions: the three hypotheses H1 , H2 and H3 are equiprobable a priori, i.e., 1 P (H1 ) = P (H2 ) = P (H3 ) =. (2.61) 3 The datum we receive, after choosing door 1, is either Y = 3 and Y = 2 (meaning door 3 or 2 is opened, respectively). We assume that these two possible outcomes have the following probabilities. If the prize is behind door 1, then the host selects at random between Y = 2 and Y = 3. Otherwise the choice of the host is forced and the probabilities are 0 and 1. P (Y = 2|H1 ) = 12 P (Y = 2|H2 ) = 0 P (Y = 2|H3 ) = 1 (2.62) P (Y = 3|H1 ) = 12 P (Y = 3|H2 ) = 1 P (Y = 3|H3 ) = 0 Now, using Bayes’ theorem, we evaluate the posterior probabilities of the hypotheses: P (Y = 3|Hi )P (Hi ) P (Hi |Y = 3) = (2.63) P (Y = 3) P (H1 |Y = 3) = (1/2)(1/3) P (Y =3) P (H2 |Y = 3) = P(1)(1/3) (Y =3) P (H3 |Y = 3) = P(0)(1/3) (Y =3) (2.64) The denominator P (Y = 3) is P (Y = 3) = 1 6 + 1 3 = 12. So 1 2 P (H1 |Y = 3) = P (H2 |Y = 3) = P (H3 |Y = 3) = 0. (2.65) 3 3 So the contestant should switch to door 2 in order to have the biggest chance of getting the prize. See Table 2.2 for a worked example. Many people find this outcome surprising. One way to make it more intuitive is to perform a thought experiment in which the game is played with a million doors. The rules are now that the contestant chooses one door, then the game show host opens 999,998 doors in such a way as not to reveal the prize, leaving the contestant’s selected door and one other door closed. The contestant may now stick or switch. Imagine the contestant confronted by a million doors, of which doors 1 and 234,598 have not been opened, door 1 having been the contestant’s initial guess. Where do you think the prize is? “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.4. Bernoulli and binomial distributions 49 Figure 2.8: Any planar line-drawing is geometrically consistent with infinitely many 3-D structures. From Figure 11 of [SA93]. Used with kind permission of Pawan Sinha. 2.3.3 Inverse problems * Probability theory is concerned with predicting a distribution over outcomes y given knowledge (or assumptions) about the state of the world, h. By contrast, inverse probability is concerned with inferring the state of the world from observations of outcomes. We can think of this as inverting the h ↖ y mapping. For example, consider trying to infer a 3d shape h from a 2d image y, which is a classic problem in visual scene understanding. Unfortunately, this is a fundamentally ill-posed problem, as illustrated in Figure 2.8, since there are multiple possible hidden h’s consistent with the same observed y (see e.g., [Piz01]). Similarly, we can view natural language understanding as an ill-posed problem, in which the listener must infer the intention h from the (often ambiguous) words spoken by the speaker (see e.g., [Sab21]). To tackle such inverse problems, we can use Bayes’ rule to compute the posterior, p(h|y), which gives a distribution over possible states of the world. This requires specifying the forwards model, p(y|h), as well as a prior p(h), which can be used to rule out (or downweight) implausible world states. We discuss this topic in more detail in the sequel to this book, [Mur23]. 2.4 Bernoulli and binomial distributions Perhaps the simplest probability distribution is the Bernoulli distribution, which can be used to model binary events, as we discuss below. 2.4.1 Definition Consider tossing a coin, where the probability of event that it lands heads is given by 0 → ω → 1. Let Y = 1 denote this event, and let Y = 0 denote the event that the coin lands tails. Thus we are assuming that p(Y = 1) = ω and p(Y = 0) = 1 ↑ ω. This is called the Bernoulli distribution, and can be written as follows Y ↙ Ber(ω) (2.66) Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 50 Chapter 2. Probability: Univariate Models θ=0.250 θ=0.900 0.35 0.4 0.3 0.35 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 (a) (b) Figure 2.9: Illustration of the binomial distribution with N = 10 and (a) ε = 0.25 and (b) ε = 0.9. Generated by binom_dist_plot.ipynb. where the symbol ↙ means “is sampled from” or “is distributed as”, and Ber refers to Bernoulli. The probability mass function (pmf) of this distribution is defined as follows:. 1 ↑ ω if y = 0 Ber(y|ω) = (2.67) ω if y = 1 (See Section 2.2.1 for details on pmf’s.) We can write this in a more concise manner as follows: Ber(y|ω) ↭ ωy (1 ↑ ω)1↑y (2.68) The Bernoulli distribution is a special case of the binomial distribution. To explain this, suppose we observe a set of N Bernoulli trials, denoted yn ↙ Ber(·|ω), for n = 1 : N. Concretely, think of !N tossing a coin N times. Let us define s to be the total number of heads, s ↭ n=1 I (yn = 1). The distribution of s is given by the binomial distribution: / N s Bin(s|N, ω) ↭ ω (1 ↑ ω)N ↑s (2.69) s where / N N! ↭ (2.70) k (N ↑ k)!k! is the number of ways to choose k items from N (this is known as the binomial coe!cient, and is pronounced “N choose k”). See Figure 2.9 for some examples of the binomial distribution. If N = 1, the binomial distribution reduces to the Bernoulli distribution. 2.4.2 Sigmoid (logistic) function When we want to predict a binary variable y ↔ {0, 1} given some inputs x ↔ X , we need to use a conditional probability distribution of the form p(y|x, ω) = Ber(y|f (x; ω)) (2.71) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.4. Bernoulli and binomial distributions 51 Heaviside function 1.0 0.8 0.6 0.4 0.2 0.0 4 3 2 1 0 1 2 3 4 (a) (b) Figure 2.10: (a) The sigmoid (logistic) function ϑ(a) = (1 + e→a )→1. (b) The Heaviside function I (a > 0). Generated by activation_fun_plot.ipynb. 1 ex ϑ(x) ↭ ↑x = (2.72) 1+e 1 + ex d ϑ(x) = ϑ(x)(1 ↑ ϑ(x)) (2.73) dx 1 ↑ ϑ(x) = ϑ(↑x) (2.74) / p ↑1 ϑ (p) = log ↭ logit(p) (2.75) 1↑p ϑ+ (x) ↭ log(1 + ex ) ↭ softplus(x) (2.76) d ϑ+ (x) = ϑ(x) (2.77) dx Table 2.3: Some useful properties of the sigmoid (logistic) and related functions. Note that the logit function is the inverse of the sigmoid function, and has a domain of [0, 1]. where f (x; ω) is some function that predicts the mean parameter of the output distribution. We will consider many di!erent kinds of function f in Part II–Part IV. To avoid the requirement that 0 → f (x; ω) → 1, we can let f be an unconstrained function, and use the following model: p(y|x, ω) = Ber(y|ϑ(f (x; ω))) (2.78) Here ϑ() is the sigmoid or logistic function, defined as follows: 1 ϑ(a) ↭ (2.79) 1 + e↑a where a = f (x; ω). The term “sigmoid” means S-shaped: see Figure 2.10a for a plot. We see that it Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 52 Chapter 2. Probability: Univariate Models Figure 2.11: Logistic regression applied to a 1-dimensional, 2-class version of the Iris dataset. Generated by iris_logreg.ipynb. Adapted from Figure 4.23 of [Gér19]. maps the whole real line to [0, 1], which is necessary for the output to be interpreted as a probability (and hence a valid value for the Bernoulli parameter ω). The sigmoid function can be thought of as a “soft” version of the heaviside step function, defined by H(a) ↭ I (a > 0) (2.80) as shown in Figure 2.10b. Plugging the definition of the sigmoid function into Equation (2.78) we get 1 ea p(y = 1|x, ω) = = = ϑ(a) (2.81) 1 + e↑a 1 + ea 1 e↑a 1 p(y = 0|x, ω) = 1 ↑ ↑a = = = ϑ(↑a) (2.82) 1+e 1 + e↑a 1 + ea p The quantity a is equal to the log odds, log( 1↑p ), where p = p(y = 1|x; ω). To see this, note that / / a p e 1 + ea log = log = log(ea ) = a (2.83) 1↑p 1 + ea 1 The logistic function or sigmoid function maps the log-odds a to p: 1 ea p = logistic(a) = ϑ(a) ↭ ↑a = (2.84) 1+e 1 + ea The inverse of this is called the logit function, and maps p to the log-odds a: / p a = logit(p) = ϑ (p) ↭ log ↑1 (2.85) 1↑p See Table 2.3 for some useful properties of these functions. 2.4.3 Binary logistic regression In this section, we use a conditional Bernoulli model, where we use a linear predictor of the form f (x; ω) = wT x + b. Thus the model has the form p(y|x; ω) = Ber(y|ϑ(wT x + b)) (2.86) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.5. Categorical and multinomial distributions 53 In other words, 1 p(y = 1|x; ω) = ϑ(wT x + b) = (2.87) 1 + e↑(wT x+b) This is called logistic regression. For example consider a 1-dimensional, 2-class version of the iris dataset, where the positive class is “Virginica” and the negative class is “not Virginica”, and the feature x we use is the petal width. We fit a logistic regression model to this and show the results in Figure 2.11. The decision boundary corresponds to the value x↓ where p(y = 1|x = x↓ , ω) = 0.5. We see that, in this example, x↓ ≃ 1.7. As x moves away from this boundary, the classifier becomes more confident in its prediction about the class label. It should be clear from this example why it would be inappropriate to use linear regression for a (binary) classification problem. In such a model, the probabilities would increase above 1 as we move far enough to the right, and below 0 as we move far enough to the left. For more detail on logistic regression, see Chapter 10. 2.5 Categorical and multinomial distributions To represent a distribution over a finite set of labels, y ↔ {1,... , C}, we can use the categorical distribution, which generalizes the Bernoulli to C > 2 values. 2.5.1 Definition The categorical distribution is a discrete probability distribution with one parameter per class: C $ Cat(y|ω) ↭ ωcI(y=c) (2.88) c=1 In other words, p(y = c|ω) = ωc. Note that the parameters are constrained so that 0 → ωc → 1 and !C c=1 ωc = 1; thus there are only C ↑ 1 independent parameters. We can write the categorical distribution in another way by converting the discrete variable y into a one-hot vector with C elements, all of which are 0 except for the entry corresponding to the class label. (The term “one-hot” arises from electrical engineering, where binary vectors are encoded as electrical current on a set of wires, which can be active (“hot”) or not (“cold”).) For example, if C = 3, we encode the classes 1, 2 and 3 as (1, 0, 0), (0, 1, 0), and (0, 0, 1). More generally, we can encode the classes using unit vectors, where ec is all 0s except for dimension c. (This is also called a dummy encoding.) Using one-hot encodings, we can write the categorical distribution as follows: C $ Cat(y|ω) ↭ ωcyc (2.89) c=1 The categorical distribution is a special case of the multinomial distribution. To explain this, suppose we observe N categorical trials, yn ↙ Cat(·|ω), for n = 1 : N. Concretely, think of rolling a C-sided dice N times. Let us define y to be a vector that counts the number of times each face Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license