ProbML Intro (Murphy) - Univariate Probabilities PDF
Document Details
Uploaded by PanoramicHeisenberg
Kevin P. Murphy
Tags
Related
- Mathematics and Statistical Foundations for Machine Learning PDF
- Mathematics and Statistical Foundations for Machine Learning (FIC 504) - PDF
- Mathematics and Statistical Foundations for Machine Learning (FIC 504-507) PDF
- Mathematics and Statistical Foundations for Machine Learning PDF
- Machine Learning 1 Classification Methods Lectures PDF
- Week 3 Review of Machine Learning PDF
Summary
This document introduces univariate probabilities from a book on probabilistic machine learning. It outlines different interpretations of probability, including frequentist and Bayesian approaches, and explores types of uncertainty like epistemic/model uncertainty and aleatoric/data uncertainty. The basic rules of probability are also reviewed, presenting probability as an extension of Boolean logic and defining concepts like conditional probability and independence.
Full Transcript
2 Probability: Univariate Models 2.1 Introduction In this chapter, we give a brief introduction to the basics of probability theory. There are many good books that go into more detail, e.g., [GS97; BT08; Cha21]. 2.1.1 What is probability? Probability theory is nothing but common...
2 Probability: Univariate Models 2.1 Introduction In this chapter, we give a brief introduction to the basics of probability theory. There are many good books that go into more detail, e.g., [GS97; BT08; Cha21]. 2.1.1 What is probability? Probability theory is nothing but common sense reduced to calculation. — Pierre Laplace, 1812 We are all comfortable saying that the probability that a (fair) coin will land heads is 50%. But what does this mean? There are actually two different interpretations of probability. One is called the frequentist interpretation. In this view, probabilities represent long run frequencies of events that can happen multiple times. For example, the above statement means that, if we flip the coin many times, we expect it to land heads about half the time.1 The other interpretation is called the Bayesian interpretation of probability. In this view, proba- bility is used to quantify our uncertainty or ignorance about something; hence it is fundamentally related to information rather than repeated trials [Jay03; Lin06]. In the Bayesian view, the above statement means we believe the coin is equally likely to land heads or tails on the next toss. One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about one-off events that do not have long term frequencies. For example, we might want to compute the probability that the polar ice cap will melt by 2030 CE. This event will happen zero or one times, but cannot happen repeatedly. Nevertheless, we ought to be able to quantify our uncertainty about this event; based on how probable we think this event is, we can decide how to take the optimal action, as discussed in Chapter 5. We shall therefore adopt the Bayesian interpretation in this book. Fortunately, the basic rules of probability theory are the same, no matter which interpretation is adopted. 2.1.2 Types of uncertainty The uncertainty in our predictions can arise for two fundamentally different reasons. The first is due to our ignorance of the underlying hidden causes or mechanism generating our data. This is 1. Actually, the Stanford statistician (and former professional magician) Persi Diaconis has shown that a coin is about 51% likely to land facing the same way up as it started, due to the physics of the problem [DHM07]. 34 Chapter 2. Probability: Univariate Models called epistemic uncertainty, since epistemology is the philosophical term used to describe the study of knowledge. However, a simpler term for this is model uncertainty. The second kind of uncertainty arises from intrinsic variability, which cannot be reduced even if we collect more data. This is sometimes called aleatoric uncertainty [Hac75; KD09], derived from the Latin word for “dice”, although a simpler term would be data uncertainty. As a concrete example, consider tossing a fair coin. We might know for sure that the probability of heads is p = 0.5, so there is no epistemic uncertainty, but we still cannot perfectly predict the outcome. This distinction can be important for applications such as active learning. A typical strategy is to query examples for which H(p(y|x, D)) is large (where H(p) is the entropy, discussed in Section 6.1). However, this could be due to uncertainty about the parameters, i.e., large H(p(✓|D)), or just due to inherent variability of the outcome, corresponding to large entropy of p(y|x, ✓). In the latter case, there would not be much use collecting more samples, since our uncertainty would not be reduced. See [Osb16] for further discussion of this point. 2.1.3 Probability as an extension of logic In this section, we review the basic rules of probability, following the presentation of [Jay03], in which we view probability as an extension of Boolean logic. 2.1.3.1 Probability of an event We define an event, denoted by the binary variable A, as some state of the world that either holds or does not hold. For example, A might be event “it will rain tomorrow”, or “it rained yesterday”, or “the label is y = 1”, or “the parameter ✓ is between 1.5 and 2.0”, etc. The expression Pr(A) denotes the probability with which you believe event A is true (or the long run fraction of times that A will occur). We require that 0 Pr(A) 1, where Pr(A) = 0 means the event definitely will not happen, and Pr(A) = 1 means the event definitely will happen. We write Pr(A) to denote the probability of event A not happening; this is defined to be Pr(A) = 1 Pr(A). 2.1.3.2 Probability of a conjunction of two events We denote the joint probability of events A and B both happening as follows: Pr(A ^ B) = Pr(A, B) (2.1) If A and B are independent events, we have Pr(A, B) = Pr(A) Pr(B) (2.2) For example, suppose X and Y are chosen uniformly at random from the set X = {1, 2, 3, 4}. Let A be the event that X 2 {1, 2}, and B be the event that Y 2 {3}. Then we have Pr(A, B) = Pr(A) Pr(B) = 12 · 14. 2.1.3.3 Probability of a union of two events The probability of event A or B happening is given by Pr(A _ B) = Pr(A) + Pr(B) Pr(A ^ B) (2.3) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 35 If the events are mutually exclusive (so they cannot happen at the same time), we get Pr(A _ B) = Pr(A) + Pr(B) (2.4) For example, suppose X is chosen uniformly at random from the set X = {1, 2, 3, 4}. Let A be the event that X 2 {1, 2} and B be the event that X 2 {3}. Then we have Pr(A _ B) = 24 + 14. 2.1.3.4 Conditional probability of one event given another We define the conditional probability of event B happening given that A has occurred as follows: Pr(A, B) Pr(B|A) , (2.5) Pr(A) This is not defined if Pr(A) = 0, since we cannot condition on an impossible event. 2.1.3.5 Independence of events We say that event A is independent of event B if Pr(A, B) = Pr(A) Pr(B) (2.6) 2.1.3.6 Conditional independence of events We say that events A and B are conditionally independent given event C if Pr(A, B|C) = Pr(A|C) Pr(B|C) (2.7) This is written as A ? B|C. Events are often dependent on each other, but may be rendered independent if we condition on the relevant intermediate variables, as we discuss in more detail later in this chapter. 2.2 Random variables Suppose X represents some unknown quantity of interest, such as which way a dice will land when we roll it, or the temperature outside your house at the current time. If the value of X is unknown and/or could change, we call it a random variable or rv. The set of possible values, denoted X , is known as the sample space or state space. An event is a set of outcomes from a given sample space. For example, if X represents the face of a dice that is rolled, so X = {1, 2,... , 6}, the event of “seeing a 1” is denoted X = 1, the event of “seeing an odd number” is denoted X 2 {1, 3, 5}, the event of “seeing a number between 1 and 3” is denoted 1 X 3, etc. 2.2.1 Discrete random variables If the sample space X is finite or countably infinite, then X is called a discrete random variable. In this case, we denote the probability of the event that X has value x by Pr(X = x). We define the Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 36 Chapter 2. Probability: Univariate Models (a) (b) Figure 2.1: Some discrete distributions on the state space X = {1, 2, 3, 4}. (a) A uniform distribution with p(x = k) = 1/4. (b) A degenerate distribution (delta function) that puts all its mass on x = 1. Generated by discrete_prob_dist_plot.ipynb. probability mass function or pmf as a function which computes the probability of events which correspond to setting the rv to each possible value: p(x) , Pr(X = x) (2.8) P The pmf satisfies the properties 0 p(x) 1 and x2X p(x) = 1. If X has a finite number of values, say K, the pmf can be represented as a list of K numbers, which we can plot as a histogram. For example, Figure 2.1 shows two pmf’s defined on X = {1, 2, 3, 4}. On the left we have a uniform distribution, p(x) = 1/4, and on the right, we have a degenerate distribution, p(x) = I (x = 1), where I () is the binary indicator function. Thus the distribution in Figure 2.1(b) represents the fact that X is always equal to the value 1. (Thus we see that random variables can also be constant.) 2.2.2 Continuous random variables If X 2 R is a real-valued quantity, it is called a continuous random variable. In this case, we can no longer create a finite (or countable) set of distinct possible values it can take on. However, there are a countable number of intervals which we can partition the real line into. If we associate events with X being in each one of these intervals, we can use the methods discussed above for discrete random variables. Informally speaking, we can represent the probability of X taking on a specific real value by allowing the size of the intervals to shrink to zero, as we show below. 2.2.2.1 Cumulative distribution function (cdf ) Define the events A = (X a), B = (X b) and C = (a < X b), where a < b. We have that B = A _ C, and since A and C are mutually exclusive, the sum rules gives Pr(B) = Pr(A) + Pr(C) (2.9) and hence the probability of being in interval C is given by Pr(C) = Pr(B) Pr(A) (2.10) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 37 ,/2 ,/2 )-1(,/2) 0 )-1(1-,/2) (a) (b) Figure 2.2: (a) Plot of the cdf for the standard normal, N (0, 1). Generated by gauss_plot.ipynb. (b) Corresponding pdf. The shaded regions each contain ↵/2 of the probability mass. Therefore the nonshaded region contains 1 ↵ of the probability mass. The leftmost cutoff point is 1 (↵/2), where is the cdf of the Gaussian. By symmetry, the rightmost cutoff point is 1 (1 ↵/2) = 1 (↵/2). Generated by quantile_plot.ipynb. In general, we define the cumulative distribution function or cdf of the rv X as follows: P (x) , Pr(X x) (2.11) (Note that we use a capital P to represent the cdf.) Using this, we can compute the probability of being in any interval as follows: Pr(a < X b) = P (b) P (a) (2.12) Cdf’s are monotonically non-decreasing functions. See Figure 2.2a for an example, where we illustrate the cdf of a standard normal distribution, N (x|0, 1); see Section 2.6 for details. 2.2.2.2 Probability density function (pdf ) We define the probability density function or pdf as the derivative of the cdf: d p(x) , P (x) (2.13) dx (Note that this derivative does not always exist, in which case the pdf is not defined.) See Figure 2.2b for an example, where we illustrate the pdf of a univariate Gaussian (see Section 2.6 for details). Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows: Z b Pr(a < X b) = p(x)dx = P (b) P (a) (2.14) a As the size of the interval gets smaller, we can write Pr(x < X x + dx) ⇡ p(x)dx (2.15) Intuitively, this says the probability of X being in a small interval around x is the density at x times the width of the interval. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 38 Chapter 2. Probability: Univariate Models 2.2.2.3 Quantiles If the cdf P is strictly monotonically increasing, it has an inverse, called the inverse cdf, or percent point function (ppf), or quantile function. If P is the cdf of X, then P 1 (q) is the value xq such that Pr(X xq ) = q; this is called the q’th quantile of P. The value P 1 (0.5) is the median of the distribution, with half of the probability mass on the left, and half on the right. The values P 1 (0.25) and P 1 (0.75) are the lower and upper quartiles. For example, let be the cdf of the Gaussian distribution N (0, 1), and 1 be the inverse cdf. Then points to the left of 1 (↵/2) contain ↵/2 of the probability mass, as illustrated in Figure 2.2b. By symmetry, points to the right of 1 (1 ↵/2) also contain ↵/2 of the mass. Hence the central interval ( 1 (↵/2), 1 (1 ↵/2)) contains 1 ↵ of the mass. If we set ↵ = 0.05, the central 95% interval is covered by the range 1 1 ( (0.025), (0.975)) = ( 1.96, 1.96) (2.16) If the distribution is N (µ, 2 ), then the 95% interval becomes (µ 1.96 , µ + 1.96 ). This is often approximated by writing µ ± 2. 2.2.3 Sets of related random variables In this section, we discuss distributions over sets of related random variables. Suppose, to start, that we have two random variables, X and Y. We can define the joint distribution of two random variables using p(x, y) = p(X = x, Y = y) for all possible values of X and Y. If both variables have finite cardinality, we can represent the joint distribution as a 2d table, all of whose entries sum to one. For example, consider the following example with two binary variables: p(X, Y ) Y = 0 Y = 1 X=0 0.2 0.3 X=1 0.3 0.2 If two variables are independent, we can represent the joint as the product of the two marginals. If both variables have finite cardinality, we can factorize the 2d joint table into a product of two 1d vectors, as shown in Figure 2.3. Given a joint distribution, we define the marginal distribution of an rv as follows: X p(X = x) = p(X = x, Y = y) (2.17) y where we are summing over all possible states of Y. This is sometimes called the sum rule or the rule of total probability. We define p(Y = y) similarly. For example, from the above 2d table, we see p(X = 0) = 0.2 + 0.3 = 0.5 and p(Y = 0) = 0.2 + 0.3 = 0.5. (The term “marginal” comes from the accounting practice of writing the sums of rows and columns on the side, or margin, of a table.) We define the conditional distribution of an rv using p(X = x, Y = y) p(Y = y|X = x) = (2.18) p(X = x) We can rearrange this equation to get p(x, y) = p(x)p(y|x) (2.19) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 39 P(Y) P(X, Y) = P(X) Figure 2.3: Computing p(x, y) = p(x)p(y), where X ? Y. Here X and Y are discrete random variables; X has 6 possible states (values) and Y has 5 possible states. A general joint distribution on two such variables would require (6 ⇥ 5) 1 = 29 parameters to define it (we subtract 1 because of the sum-to-one constraint). By assuming (unconditional) independence, we only need (6 1) + (5 1) = 9 parameters to define p(x, y). This is called the product rule. By extending the product rule to D variables, we get the chain rule of probability: p(x1:D ) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 )p(x4 |x1 , x2 , x3 )... p(xD |x1:D 1) (2.20) This provides a way to create a high dimensional joint distribution from a set of conditional distributions. We discuss this in more detail in Section 3.6. 2.2.4 Independence and conditional independence We say X and Y are unconditionally independent or marginally independent, denoted X ? Y , if we can represent the joint as the product of the two marginals (see Figure 2.3), i.e., X ? Y () p(X, Y ) = p(X)p(Y ) (2.21) In general, we say a set of variables X1 ,... , Xn is (mutually) independent if the joint can be written as a product of marginals for all subsets {X1 ,... , Xm } ✓ {X1 ,... , Xn }: i.e., m Y p(X1 ,... , Xm ) = p(Xi ) (2.22) i=1 For example, we say X1 , X2 , X3 are mutually independent if the following conditions hold: p(X1 , X2 , X3 ) = p(X1 )p(X2 )p(X3 ), p(X1 , X2 ) = p(X1 )p(X2 ), p(X2 , X3 ) = p(X2 )p(X3 ), and p(X1 , X3 ) = p(X1 )p(X3 ).2 Unfortunately, unconditional independence is rare, because most variables can influence most other variables. However, usually this influence is mediated via other variables rather than being direct. We therefore say X and Y are conditionally independent (CI) given Z iff the conditional joint can be written as a product of conditional marginals: X ? Y | Z () p(X, Y |Z) = p(X|Z)p(Y |Z) (2.23) 2. For further discussion, see https://github.com/probml/pml-book/issues/353#issuecomment-1120327442. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 40 Chapter 2. Probability: Univariate Models We can write this assumption as a graph X Z Y , which captures the intuition that all the dependencies between X and Y are mediated via Z. By using larger graphs, we can define complex joint distributions; these are known as graphical models, and are discussed in Section 3.6. 2.2.5 Moments of a distribution In this section, we describe various summary statistics that can be derived from a probability distribution (either a pdf or pmf). 2.2.5.1 Mean of a distribution The most familiar property of a distribution is its mean, or expected value, often denoted by µ. For continuous rv’s, the mean is defined as follows: Z E [X] , x p(x)dx (2.24) X If the integral is not finite, the mean is not defined; we will see some examples of this later. For discrete rv’s, the mean is defined as follows: X E [X] , x p(x) (2.25) x2X However, this is only meaningful if the values of x are ordered in some way (e.g., if they represent integer counts). Since the mean is a linear operator, we have E [aX + b] = aE [X] + b (2.26) This is called the linearity of expectation. For a set of n random variables, one can show that the expectation of their sum is as follows: " n # n X X E Xi = E [Xi ] (2.27) i=1 i=1 If they are independent, the expectation of their product is given by " n # n Y Y E Xi = E [Xi ] (2.28) i=1 i=1 2.2.5.2 Variance of a distribution The variance is a measure of the “spread” of a distribution, often denoted by 2. This is defined as follows: Z ⇥ ⇤ V [X] , E (X µ)2 = (x µ)2 p(x)dx (2.29) Z Z Z 2 2 ⇥ ⇤ = x p(x)dx + µ p(x)dx 2µ xp(x)dx = E X 2 µ2 (2.30) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 41 from which we derive the useful result ⇥ ⇤ E X 2 = 2 + µ2 (2.31) The standard deviation is defined as p std [X] , V [X] = (2.32) This is useful since it has the same units as X itself. The variance of a shifted and scaled version of a random variable is given by V [aX + b] = a2 V [X] (2.33) If we have a set of n independent random variables, the variance of their sum is given by the sum of their variances: " n # n X X V Xi = V [Xi ] (2.34) i=1 i=1 The variance of their product can also be derived, as follows: " n # " # " # Y Y Y 2 V Xi = E ( Xi ) (E Xi ) 2 (2.35) i=1 i i " # Y Y =E Xi2 ( E [Xi ])2 (2.36) i i Y ⇥ ⇤ Y = E Xi2 (E [Xi ])2 (2.37) i i Y Y = (V [Xi ] + (E [Xi ])2 ) (E [Xi ])2 (2.38) i i Y Y 2 = ( i + µ2i ) µ2i (2.39) i i 2.2.5.3 Mode of a distribution The mode of a distribution is the value with the highest probability mass or probability density: x⇤ = argmax p(x) (2.40) x If the distribution is multimodal, this may not be unique, as illustrated in Figure 2.4. Furthermore, even if there is a unique mode, this point may not be a good summary of the distribution. 2.2.5.4 Conditional moments When we have two or more dependent random variables, we can compute the moments of one given knowledge of the other. For example, the law of iterated expectations, also called the law of total expectation, tells us that E [X] = EY [E [X|Y ]] (2.41) Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 42 Chapter 2. Probability: Univariate Models Figure 2.4: Illustration of a mixture of two 1d Gaussians, p(x) = 0.5N (x|0, 0.5) + 0.5N (x|2, 0.5). Generated by bimodal_dist_plot.ipynb. To prove this, let us suppose, for simplicity, that X and Y are both discrete rv’s. Then we have " # X EY [E [X|Y ]] = EY x p(X = x|Y ) (2.42) x " # X X X = x p(X = x|Y = y) p(Y = y) = xp(X = x, Y = y) = E [X] (2.43) y x x,y To give a more intuitive explanation, consider the following simple example.3 Let X be the lifetime duration of a lightbulb, and let Y be the factory the lightbulb was produced in. Suppose E [X|Y = 1] = 5000 and E [X|Y = 2] = 4000, indicating that factory 1 produces longer lasting bulbs. Suppose factory 1 supplies 60% of the lightbulbs, so p(Y = 1) = 0.6 and p(Y = 2) = 0.4. Then the expected duration of a random lightbulb is given by E [X] = E [X|Y = 1] p(Y = 1) + E [X|Y = 2] p(Y = 2) = 5000 ⇥ 0.6 + 4000 ⇥ 0.4 = 4600 (2.44) There is a similar formula for the variance. In particular, the law of total variance, also called the conditional variance formula, tells us that V [X] = EY [V [X|Y ]] + VY [E [X|Y ]] (2.45) ⇤ ⇥ To see this, let us define the conditional moments, µX|Y = E [X|Y ], sX|Y = E X 2 |Y , and X|Y = V [X|Y ] = sX|Y 2 µ2X|Y , which are functions of Y (and therefore are random quantities). Then we have ⇥ ⇤ ⇥ ⇤ ⇥ ⇤ 2 V [X] = E X 2 (E [X])2 = EY sX|Y EY µX|Y (2.46) h i h i ⇥ ⇤ 2 2 2 = EY X|Y + EY µX|Y EY µX|Y (2.47) = EY [V [X|Y ]] + VY [µX|Y ] (2.48) To get some intuition for these formulas, consider a mixture of K univariate Gaussians. Let Y be the hidden indicator variable that specifies which mixture component we are using, and let 3. This example is from https://en.wikipedia.org/wiki/Law_of_total_expectation, but with modified notation. “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.2. Random variables 43 Dataset: I Dataset: II Dataset: III Dataset: IV 10 10 10 10 y y y y 5 5 5 5 0 0 0 0 0 10 20 0 10 20 0 10 20 0 10 20 x x x x (a) (b) (c) (d) Figure 2.5: Illustration of Anscombe’s quartet. All of these datasets have the same low order summary statistics. Generated by anscombes_quartet.ipynb. PK X = y=1 ⇡y N (X|µy , y ). In Figure 2.4, we have ⇡1 = ⇡2 = 0.5, µ1 = 0, µ2 = 2, 1 = 2 = 0.5. Thus 2 2 E [V [X|Y ]] = ⇡1 1 + ⇡2 2 = 0.25 (2.49) 2 2 2 2 V [E [X|Y ]] = ⇡1 (µ1 µ) + ⇡2 (µ2 µ) = 0.5(0 1) + 0.5(2 1) = 0.5 + 0.5 = 1 (2.50) So we get the intuitive result that the variance of X is dominated by which centroid it is drawn from (i.e., difference in the means), rather than the local variance around each centroid. 2.2.6 Limitations of summary statistics * Although it is common to summarize a probability distribution (or points sampled from a distribution) using simple statistics such as the mean and variance, this can lose a lot of information. A striking example of this is known as Anscombe’s quartet [Ans73], which is illustrated in Figure 2.5. This shows 4 different datasets of (x, y) pairs, all of which have identical mean, variance and correlation coefficient ⇢ (defined in Section 3.1.2): E [x] = 9, V [x] = 11, E [y] = 7.50, V [y] = 4.12, and ⇢ = 0.816.4 However, the joint distributions p(x, y) from which these points were sampled are clearly very different. Anscombe invented these datasets, each consisting of 10 data points, to counter the impression among statisticians that numerical summaries are superior to data visualization [Ans73]. An even more striking example of this phenomenon is shown in Figure 2.6. This consists of a dataset that looks like a dinosaur5 , plus 11 other datasets, all of which have identical low order statistics. This collection of datasets is called the Datasaurus Dozen [MF17]. The exact values of the (x, y) points are available online.6 They were computed using simulated annealing, a derivative free optimization method which we discuss in the sequel to this book, [Mur23]. (The objective 4. The maximum likelihood estimate for the variance in Equation (4.36) differs from the unbiased estimate in Equation (4.38). For the former, we have V [x] = 10.00, V [y] = 3.75, for the latter, we have V [x] = 11.00, V [y] = 4.12. 5. This dataset was created by Alberto Cairo, and is available at http://www.thefunctionalart.com/2016/08/ download-datasaurus-never-trust-summary.html 6. https://www.autodesk.com/research/publications/same-stats-different-graphs. There are actually 13 datasets in total, including the dinosaur. We omitted the “away” dataset for visual clarity. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 44 Chapter 2. Probability: Univariate Models Dataset: dino Dataset: h lines Dataset: v lines Dataset: x shape 100 50 y 0 Dataset: star Dataset: high lines Dataset: dots Dataset: circle 100 50 y 0 Dataset: bullseye Dataset: slant up Dataset: slant down Dataset: wide lines 100 50 y 0 0 50 100 0 50 100 0 50 100 0 50 100 x x x x Figure 2.6: Illustration of the Datasaurus Dozen. All of these datasets have the same low order summary statistics. Adapted from Figure 1 of [MF17]. Generated by datasaurus_dozen.ipynb. function being optimized measures deviation from the target summary statistics of the original dinosaur, plus distance from a particular target shape.) The same simulated annealing approach can be applied to 1d datasets, as shown in Figure 2.7. We see that all the datasets are quite different, but they all have the same median and inter-quartile range as shown by the central shaded part of the box plots in the middle. A better visualization is known as a violin plot, shown on the right. This shows (two copies of) the 1d kernel density estimate (Section 16.3) of the distribution on the vertical axis, in addition to the median and IQR markers. This visualization is better able to distinguish differences in the distributions. However, the technique is limited to 1d data. 2.3 Bayes’ rule Bayes’s theorem is to the theory of probability what Pythagoras’s theorem is to geometry. — Sir Harold Jeffreys, 1973 [Jef73]. In this section, we discuss the basics of Bayesian inference. According to the Merriam-Webster dictionary, the term “inference” means “the act of passing from sample data to generalizations, usually with calculated degrees of certainty”. The term “Bayesian” is used to refer to inference methods that “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.3. Bayes’ rule 45 Figure 2.7: Illustration of 7 different datasets (left), the corresponding box plots (middle) and violin box plots (right). From Figure 8 of https: // www. autodesk. com/ research/ publications/ same-stats-different-graphs. Used with kind permission of Justin Matejka. represent “degrees of certainty” using probability theory, and which leverage Bayes’ rule7 , to update the degree of certainty given data. Bayes’ rule itself is very simple: it is just a formula for computing the probability distribution over possible values of an unknown (or hidden) quantity H given some observed data Y = y: p(H = h)p(Y = y|H = h) p(H = h|Y = y) = (2.51) p(Y = y) This follows automatically from the identity p(h|y)p(y) = p(h)p(y|h) = p(h, y) (2.52) which itself follows from the product rule of probability. In Equation (2.51), the term p(H) represents what we know about possible values of H before we see any data; this is called the prior distribution. (If H has K possible values, then p(H) is a vector of K probabilities, that sum to 1.) The term p(Y |H = h) represents the distribution over the possible outcomes Y we expect to see if H = h; this is called the observation distribution. When we evaluate this at a point corresponding to the actual observations, y, we get the function p(Y = y|H = h), which is called the likelihood. (Note that this is a function of h, since y is fixed, but it is not a probability distribution, since it does not sum to one.) Multiplying the prior distribution p(H = h) by the likelihood function p(Y = y|H = h) for each h gives the unnormalized joint distribution p(H = h, Y = y). We can convert this into a normalized distribution by dividing by p(Y = y), which is known as the marginal likelihood, since it is computed by marginalizing over the unknown H: X X p(Y = y) = p(H = h0 )p(Y = y|H = h0 ) = p(H = h0 , Y = y) (2.53) h0 2H h0 2H 7. Thomas Bayes (1702–1761) was an English mathematician and Presbyterian minister. For a discussion of whether to spell this as Bayes rule, Bayes’ rule or Bayes’s rule, see https://bit.ly/2kDtLuK. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 46 Chapter 2. Probability: Univariate Models Observation 0 1 0 TNR=Specificity=0.975 FPR=1-TNR=0.025 Truth 1 FNR=1-TPR=0.125 TPR=Sensitivity=0.875 Table 2.1: Likelihood function p(Y |H) for a binary observation Y given two possible hidden states H. Each row sums to one. Abbreviations: TNR is true negative rate, TPR is true positive rate, FNR is false negative rate, FPR is false positive rate. Normalizing the joint distribution by computing p(H = h, Y = y)/p(Y = y) for each h gives the posterior distribution p(H = h|Y = y); this represents our new belief state about the possible values of H. We can summarize Bayes rule in words as follows: posterior / prior ⇥ likelihood (2.54) Here we use the symbol / to denote “proportional to”, since we are ignoring the denominator, which is just a constant, independent of H. Using Bayes rule to update a distribution over unknown values of some quantity of interest, given relevant observed data, is called Bayesian inference, or posterior inference. It can also just be called probabilistic inference. Below we give some simple examples of Bayesian inference in action. We will see many more interesting examples later in this book. 2.3.1 Example: Testing for COVID-19 Suppose you think you may have contracted COVID-19, which is an infectious disease caused by the SARS-CoV-2 virus. You decide to take a diagnostic test, and you want to use its result to determine if you are infected or not. Let H = 1 be the event that you are infected, and H = 0 be the event you are not infected. Let Y = 1 if the test is positive, and Y = 0 if the test is negative. We want to compute p(H = h|Y = y), for h 2 {0, 1}, where y is the observed test outcome. (We will write the distribution of values, [p(H = 0|Y = y), p(H = 1|Y = y)] as p(H|y), for brevity.) We can think of this as a form of binary classification, where H is the unknown class label, and y is the feature vector. First we must specify the likelihood. This quantity obviously depends on how reliable the test is. There are two key parameters. The sensitivity (aka true positive rate) is defined as p(Y = 1|H = 1), i.e., the probability of a positive test given that the truth is positive. The false negative rate is defined as one minus the sensitivity. The specificity (aka true negative rate) is defined as p(Y = 0|H = 0), i.e., the probability of a negative test given that the truth is negative. The false positive rate is defined as one minus the specificity. We summarize all these quantities in Table 2.1. (See Section 5.1.3.1 for more details.) Following https://nyti.ms/31MTZgV, we set the sensitivity to 87.5% and the specificity to 97.5%. Next we must specify the prior. The quantity p(H = 1) represents the prevalence of the disease in the area in which you live. We set this to p(H = 1) = 0.1 (i.e., 10%), which was the prevalence in New York City in Spring 2020. (This example was chosen to match the numbers in https://nyti.ms/31MTZgV.) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.3. Bayes’ rule 47 Now suppose you test positive. We have p(Y = 1|H = 1)p(H = 1) p(H = 1|Y = 1) = (2.55) p(Y = 1|H = 1)p(H = 1) + p(Y = 1|H = 0)p(H = 0) TPR ⇥ prior = (2.56) TPR ⇥ prior + FPR ⇥ (1 prior) 0.875 ⇥ 0.1 = = 0.795 (2.57) 0.875 ⇥ 0.1 + 0.025 ⇥ 0.9 So there is a 79.5% chance you are infected. Now suppose you test negative. The probability you are infected is given by p(Y = 0|H = 1)p(H = 1) p(H = 1|Y = 0) = (2.58) p(Y = 0|H = 1)p(H = 1) + p(Y = 0|H = 0)p(H = 0) FNR ⇥ prior = (2.59) FNR ⇥ prior + TNR ⇥ (1 prior) 0.125 ⇥ 0.1 = = 0.014 (2.60) 0.125 ⇥ 0.1 + 0.975 ⇥ 0.9 So there is just a 1.4% chance you are infected. Nowadays COVID-19 prevalence is much lower. Suppose we repeat these calculations using a base rate of 1%; now the posteriors reduce to 26% and 0.13% respectively. The fact that you only have a 26% chance of being infected with COVID-19, even after a positive test, is very counter-intuitive. The reason is that a single positive test is more likely to be a false positive than due to the disease, since the disease is rare. To see this, suppose we have a population of 100,000 people, of whom 1000 are infected. Of those who are infected, 875 = 0.875 ⇥ 1000 test positive, and of those who are uninfected, 2475 = 0.025 ⇥ 99, 000 test positive. Thus the total number of positives is 3350 = 875 + 2475, so the posterior probability of being infected given a positive test is 875/3350 = 0.26. Of course, the above calculations assume we know the sensitivity and specificity of the test. See [GC20] for how to apply Bayes rule for diagnostic testing when there is uncertainty about these parameters. 2.3.2 Example: The Monty Hall problem In this section, we consider a more “frivolous” application of Bayes rule. In particular, we apply it to the famous Monty Hall problem. Imagine a game show with the following rules: There are three doors, labeled 1, 2, 3. A single prize (e.g., a car) has been hidden behind one of them. You get to select one door. Then the gameshow host opens one of the other two doors (not the one you picked), in such a way as to not reveal the prize location. At this point, you will be given a fresh choice of door: you can either stick with your first choice, or you can switch to the other closed door. All the doors will then be opened and you will receive whatever is behind your final choice of door. For example, suppose you choose door 1, and the gameshow host opens door 3, revealing nothing behind the door, as promised. Should you (a) stick with door 1, or (b) switch to door 2, or (c) does it make no difference? Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 48 Chapter 2. Probability: Univariate Models Door 1 Door 2 Door 3 Switch Stay Car - - Lose Win - Car - Win Lose - - Car Win Lose Table 2.2: 3 possible states for the Monty Hall game, showing that switching doors is two times better (on average) than staying with your original choice. Adapted from Table 6.1 of [PM18]. Intuitively, it seems it should make no difference, since your initial choice of door cannot influence the location of the prize. However, the fact that the host opened door 3 tells us something about the location of the prize, since he made his choice conditioned on the knowledge of the true location and on your choice. As we show below, you are in fact twice as likely to win the prize if you switch to door 2. To show this, we will use Bayes’ rule. Let Hi denote the hypothesis that the prize is behind door i. We make the following assumptions: the three hypotheses H1 , H2 and H3 are equiprobable a priori, i.e., 1 P (H1 ) = P (H2 ) = P (H3 ) =. (2.61) 3 The datum we receive, after choosing door 1, is either Y = 3 and Y = 2 (meaning door 3 or 2 is opened, respectively). We assume that these two possible outcomes have the following probabilities. If the prize is behind door 1, then the host selects at random between Y = 2 and Y = 3. Otherwise the choice of the host is forced and the probabilities are 0 and 1. P (Y = 2|H1 ) = 12 P (Y = 2|H2 ) = 0 P (Y = 2|H3 ) = 1 (2.62) P (Y = 3|H1 ) = 12 P (Y = 3|H2 ) = 1 P (Y = 3|H3 ) = 0 Now, using Bayes’ theorem, we evaluate the posterior probabilities of the hypotheses: P (Y = 3|Hi )P (Hi ) P (Hi |Y = 3) = (2.63) P (Y = 3) P (H1 |Y = 3) = (1/2)(1/3) P (Y =3) P (H2 |Y = 3) = P(1)(1/3) (Y =3) P (H3 |Y = 3) = P(0)(1/3) (Y =3) (2.64) The denominator P (Y = 3) is P (Y = 3) = 1 6 + 1 3 = 12. So 1 2 P (H1 |Y = 3) = P (H2 |Y = 3) = P (H3 |Y = 3) = 0. (2.65) 3 3 So the contestant should switch to door 2 in order to have the biggest chance of getting the prize. See Table 2.2 for a worked example. Many people find this outcome surprising. One way to make it more intuitive is to perform a thought experiment in which the game is played with a million doors. The rules are now that the contestant chooses one door, then the game show host opens 999,998 doors in such a way as not to reveal the prize, leaving the contestant’s selected door and one other door closed. The contestant may now stick or switch. Imagine the contestant confronted by a million doors, of which doors 1 and 234,598 have not been opened, door 1 having been the contestant’s initial guess. Where do you think the prize is? “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.4. Bernoulli and binomial distributions 49 Figure 2.8: Any planar line-drawing is geometrically consistent with infinitely many 3-D structures. From Figure 11 of [SA93]. Used with kind permission of Pawan Sinha. 2.3.3 Inverse problems * Probability theory is concerned with predicting a distribution over outcomes y given knowledge (or assumptions) about the state of the world, h. By contrast, inverse probability is concerned with inferring the state of the world from observations of outcomes. We can think of this as inverting the h ! y mapping. For example, consider trying to infer a 3d shape h from a 2d image y, which is a classic problem in visual scene understanding. Unfortunately, this is a fundamentally ill-posed problem, as illustrated in Figure 2.8, since there are multiple possible hidden h’s consistent with the same observed y (see e.g., [Piz01]). Similarly, we can view natural language understanding as an ill-posed problem, in which the listener must infer the intention h from the (often ambiguous) words spoken by the speaker (see e.g., [Sab21]). To tackle such inverse problems, we can use Bayes’ rule to compute the posterior, p(h|y), which gives a distribution over possible states of the world. This requires specifying the forwards model, p(y|h), as well as a prior p(h), which can be used to rule out (or downweight) implausible world states. We discuss this topic in more detail in the sequel to this book, [Mur23]. 2.4 Bernoulli and binomial distributions Perhaps the simplest probability distribution is the Bernoulli distribution, which can be used to model binary events, as we discuss below. 2.4.1 Definition Consider tossing a coin, where the probability of event that it lands heads is given by 0 ✓ 1. Let Y = 1 denote this event, and let Y = 0 denote the event that the coin lands tails. Thus we are assuming that p(Y = 1) = ✓ and p(Y = 0) = 1 ✓. This is called the Bernoulli distribution, and can be written as follows Y ⇠ Ber(✓) (2.66) Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 50 Chapter 2. Probability: Univariate Models θ=0.250 θ=0.900 0.35 0.4 0.3 0.35 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 (a) (b) Figure 2.9: Illustration of the binomial distribution with N = 10 and (a) ✓ = 0.25 and (b) ✓ = 0.9. Generated by binom_dist_plot.ipynb. where the symbol ⇠ means “is sampled from” or “is distributed as”, and Ber refers to Bernoulli. The probability mass function (pmf) of this distribution is defined as follows: ( 1 ✓ if y = 0 Ber(y|✓) = (2.67) ✓ if y = 1 (See Section 2.2.1 for details on pmf’s.) We can write this in a more concise manner as follows: Ber(y|✓) , ✓y (1 ✓)1 y (2.68) The Bernoulli distribution is a special case of the binomial distribution. To explain this, suppose we observe a set of N Bernoulli trials, denoted yn ⇠ Ber(·|✓), for n = 1 : N. Concretely, think of PN tossing a coin N times. Let us define s to be the total number of heads, s , n=1 I (yn = 1). The distribution of s is given by the binomial distribution: ✓ ◆ N s Bin(s|N, ✓) , ✓ (1 ✓)N s (2.69) s where ✓ ◆ N N! , (2.70) k (N k)!k! is the number of ways to choose k items from N (this is known as the binomial coefficient, and is pronounced “N choose k”). See Figure 2.9 for some examples of the binomial distribution. If N = 1, the binomial distribution reduces to the Bernoulli distribution. 2.4.2 Sigmoid (logistic) function When we want to predict a binary variable y 2 {0, 1} given some inputs x 2 X , we need to use a conditional probability distribution of the form p(y|x, ✓) = Ber(y|f (x; ✓)) (2.71) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.4. Bernoulli and binomial distributions 51 Heaviside function 1.0 0.8 0.6 0.4 0.2 0.0 4 3 2 1 0 1 2 3 4 (a) (b) Figure 2.10: (a) The sigmoid (logistic) function (a) = (1 + e a ) 1. (b) The Heaviside function I (a > 0). Generated by activation_fun_plot.ipynb. 1 ex (x) , x = (2.72) 1+e 1 + ex d (x) = (x)(1 (x)) (2.73) dx 1 (x) = ( x) (2.74) ✓ ◆ p 1 (p) = log , logit(p) (2.75) 1 p + (x) , log(1 + ex ) , softplus(x) (2.76) d + (x) = (x) (2.77) dx Table 2.3: Some useful properties of the sigmoid (logistic) and related functions. Note that the logit function is the inverse of the sigmoid function, and has a domain of [0, 1]. where f (x; ✓) is some function that predicts the mean parameter of the output distribution. We will consider many different kinds of function f in Part II–Part IV. To avoid the requirement that 0 f (x; ✓) 1, we can let f be an unconstrained function, and use the following model: p(y|x, ✓) = Ber(y| (f (x; ✓))) (2.78) Here () is the sigmoid or logistic function, defined as follows: 1 (a) , a (2.79) 1+e where a = f (x; ✓). The term “sigmoid” means S-shaped: see Figure 2.10a for a plot. We see that it Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 52 Chapter 2. Probability: Univariate Models Figure 2.11: Logistic regression applied to a 1-dimensional, 2-class version of the Iris dataset. Generated by iris_logreg.ipynb. Adapted from Figure 4.23 of [Gér19]. maps the whole real line to [0, 1], which is necessary for the output to be interpreted as a probability (and hence a valid value for the Bernoulli parameter ✓). The sigmoid function can be thought of as a “soft” version of the heaviside step function, defined by H(a) , I (a > 0) (2.80) as shown in Figure 2.10b. Plugging the definition of the sigmoid function into Equation (2.78) we get 1 ea p(y = 1|x, ✓) = a = = (a) (2.81) 1+e 1 + ea 1 e a 1 p(y = 0|x, ✓) = 1 a = = = ( a) (2.82) 1+e 1+e a 1 + ea The quantity a is equal to the log odds, log( 1 p p ), where p = p(y = 1|x; ✓). To see this, note that ✓ ◆ ✓ a ◆ p e 1 + ea log = log = log(ea ) = a (2.83) 1 p 1 + ea 1 The logistic function or sigmoid function maps the log-odds a to p: 1 ea p = logistic(a) = (a) , a = (2.84) 1+e 1 + ea The inverse of this is called the logit function, and maps p to the log-odds a: ✓ ◆ p a = logit(p) = 1 (p) , log (2.85) 1 p See Table 2.3 for some useful properties of these functions. 2.4.3 Binary logistic regression In this section, we use a conditional Bernoulli model, where we use a linear predictor of the form f (x; ✓) = wT x + b. Thus the model has the form p(y|x; ✓) = Ber(y| (wT x + b)) (2.86) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.5. Categorical and multinomial distributions 53 In other words, 1 p(y = 1|x; ✓) = (wT x + b) = (wT x+b) (2.87) 1+e This is called logistic regression. For example consider a 1-dimensional, 2-class version of the iris dataset, where the positive class is “Virginica” and the negative class is “not Virginica”, and the feature x we use is the petal width. We fit a logistic regression model to this and show the results in Figure 2.11. The decision boundary corresponds to the value x⇤ where p(y = 1|x = x⇤ , ✓) = 0.5. We see that, in this example, x⇤ ⇡ 1.7. As x moves away from this boundary, the classifier becomes more confident in its prediction about the class label. It should be clear from this example why it would be inappropriate to use linear regression for a (binary) classification problem. In such a model, the probabilities would increase above 1 as we move far enough to the right, and below 0 as we move far enough to the left. For more detail on logistic regression, see Chapter 10. 2.5 Categorical and multinomial distributions To represent a distribution over a finite set of labels, y 2 {1,... , C}, we can use the categorical distribution, which generalizes the Bernoulli to C > 2 values. 2.5.1 Definition The categorical distribution is a discrete probability distribution with one parameter per class: C Y Cat(y|✓) , ✓cI(y=c) (2.88) c=1 In other words, p(y = c|✓) = ✓c. Note that the parameters are constrained so that 0 ✓c 1 and PC c=1 ✓c = 1; thus there are only C 1 independent parameters. We can write the categorical distribution in another way by converting the discrete variable y into a one-hot vector with C elements, all of which are 0 except for the entry corresponding to the class label. (The term “one-hot” arises from electrical engineering, where binary vectors are encoded as electrical current on a set of wires, which can be active (“hot”) or not (“cold”).) For example, if C = 3, we encode the classes 1, 2 and 3 as (1, 0, 0), (0, 1, 0), and (0, 0, 1). More generally, we can encode the classes using unit vectors, where ec is all 0s except for dimension c. (This is also called a dummy encoding.) Using one-hot encodings, we can write the categorical distribution as follows: C Y Cat(y|✓) , ✓cyc (2.89) c=1 The categorical distribution is a special case of the multinomial distribution. To explain this, suppose we observe N categorical trials, yn ⇠ Cat(·|✓), for n = 1 : N. Concretely, think of rolling a C-sided dice N times. Let us define y to be a vector that counts the number of times each face Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 54 Chapter 2. Probability: Univariate Models Figure 2.12: Softmax distribution softmax(a/T ), where a = (3, 0, 1), at temperatures of T = 100, T = 2 and T = 1. When the temperature is high (left), the distribution is uniform, whereas when the temperature is low (right), the distribution is “spiky”, with most of its mass on the largest element. Generated by softmax_plot.ipynb. PN shows up, i.e., yc = Nc , n=1 I (yn = c). Now y is no longer one-hot, but is “multi-hot”, since it has a non-zero entry for every value of c that was observed across all N trials. The distribution of y is given by the multinomial distribution: ✓ ◆YC ✓ ◆YC N N M(y|N, ✓) , ✓cyc = ✓ Nc (2.90) y1... yC c=1 N1... NC c=1 c where ✓c is the probability that side c shows up, and ✓ ◆ N N! , (2.91) N1... NC N1 !N2 ! · · · NC ! PC is the multinomial coefficient, which is the number of ways to divide a set of size N = c=1 Nc into subsets with sizes N1 up to NC. If N = 1, the multinomial distribution becomes the categorical distribution. 2.5.2 Softmax function In the conditional case, we can define p(y|x, ✓) = Cat(y|f (x; ✓)) (2.92) which we can also write as p(y|x, ✓) = M(y|1, f (x; ✓)) (2.93) PC We require that 0 fc (x; ✓) 1 and c=1 fc (x; ✓) = 1. To avoid the requirement that f directly predict a probability vector, it is common to pass the output from f into the softmax function [Bri90], also called the multinomial logit. This is defined as follows: " # ea1 e aC softmax(a) , PC ,... , PC (2.94) a c0 a c0 c0 =1 e c0 =1 e “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.5. Categorical and multinomial distributions 55 Figure 2.13: Logistic regression on the 3-class, 2-feature version of the Iris dataset. Adapted from Figure of 4.25 [Gér19]. Generated by iris_logreg.ipynb. PC This maps RC to [0, 1]C , and satisfies the constraints that 0 softmax(a)c 1 and c=1 softmax(a)c = 1. The inputs to the softmax, a = f (x; ✓), are called logits, and are a generalization of the log odds. The softmax function is so-called since it acts a bit like the argmax function. To see this, let us divide each ac by a constant T called the temperature.8 Then as T ! 0, we find ⇢ 1.0 if c = argmaxc0 ac0 softmax(a/T )c = (2.95) 0.0 otherwise In other words, at low temperatures, the distribution puts most of its probability mass in the most probable state (this is called winner takes all), whereas at high temperatures, it spreads the mass uniformly. See Figure 2.12 for an illustration. 2.5.3 Multiclass logistic regression If we use a linear predictor of the form f (x; ✓) = Wx + b, where W is a C ⇥ D matrix, and b is a C-dimensional bias vector, the final model becomes p(y|x; ✓) = Cat(y|softmax(Wx + b)) (2.96) Let a = Wx + b be the C-dimensional vector of logits. Then we can rewrite the above as follows: e ac p(y = c|x; ✓) = PC (2.97) c0 =1 e a c0 This is known as multinomial logistic regression. If we have just two classes, this reduces to binary logistic regression. To see this, note that e a0 1 softmax(a)0 = a a = = (a0 a1 ) (2.98) e +e 0 1 1 + e a1 a0 so we can just train the model to predict a = a1 a0. This can be done with a single weight vector w; if we use the multi-class formulation, we will have two weight vectors, w0 and w1. Such a model is over-parameterized, which can hurt interpretability, but the predictions will be the same. 8. This terminology comes from the area of statistical physics. The Boltzmann distribution is a distribution over states which has the same form as the softmax function. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 56 Chapter 2. Probability: Univariate Models We discuss this in more detail in Section 10.3. For now, we just give an example. Figure 2.13 shows what happens when we fit this model to the 3-class iris dataset, using just 2 features. We see that the decision boundaries between each class are linear. We can create nonlinear boundaries by transforming the features (e.g., using polynomials), as we discuss in Section 10.3.1. 2.5.4 Log-sum-exp trick In this section, we discuss one important practical detail to pay attention to when working with the softmax distribution. Suppose we want to compute the normalized probability pc = p(y = c|x), which is given by eac eac pc = = PC (2.99) Z(a) c0 =1 e a c0 where a = f (x; ✓) are the logits. We might encounter numerical problems when computing the partition function Z. For example, suppose we have 3 classes, with logits a = (0, 1, 0). Then we find Z = e0 +e1 +e0 = 4.71. But now suppose a = (1000, 1001, 1000); we find Z = 1, since on a computer, even using 64 bit precision, np.exp(1000)=inf. Similarly, suppose a = ( 1000, 999, 1000); now we find Z = 0, since np.exp(-1000)=0. To avoid numerical problems, we can use the following identity: C X C X log exp(ac ) = m + log exp(ac m) (2.100) c=1 c=1 This holds for any m. It is common to use m = maxc ac which ensures that the largest value you exponentiate will be zero, so you will definitely not overflow, and even if you underflow, the answer will be sensible. This is known as the log-sum-exp trick. We use this trick when implementing the lse function: C X lse(a) , log exp(ac ) (2.101) c=1 We can use this to compute the probabilities from the logits: p(y = c|x) = exp(ac lse(a)) (2.102) We can then pass this to the cross-entropy loss, defined in Equation (5.41). However, to save computational effort, and for numerical stability, it is quite common to modify the cross-entropy loss so that it takes the logits a as inputs, instead of the probability vector p. For example, consider the binary case. The CE loss for one example is L= [I (y = 0) log p0 + I (y = 1) log p1 ] (2.103) where ✓ ◆ 1 log p1 = log = log(1) log(1 + exp( a)) = 0 lse([0, a]) (2.104) 1 + exp( a) log p0 = 0 lse([0, +a]) (2.105) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.6. Univariate Gaussian (normal) distribution 57 2.6 Univariate Gaussian (normal) distribution The most widely used distribution of real-valued random variables y 2 R is the Gaussian distribu- tion, also called the normal distribution (see Section 2.6.4 for a discussion of these names). 2.6.1 Cumulative distribution function We define the cumulative distribution function or cdf of a continuous random variable Y as follows: P (y) , Pr(Y y) (2.106) (Note that we use a capital P to represent the cdf.) Using this, we can compute the probability of being in any interval as follows: Pr(a < Y b) = P (b) P (a) (2.107) Cdf’s are monotonically non-decreasing functions. The cdf of the Gaussian is defined by Z y (y; µ, 2 ) , N (z|µ, 2 )dz (2.108) 1 See Figure 2.2a p for a plot. Note that the cdf of the Gaussian is often implemented using (y; µ, 2 )= 1 2 [1 + erf(z/ 2)], where z = (y µ)/ and erf(u) is the error function, defined as Z u 2 t2 erf(u) , p e dt (2.109) ⇡ 0 The parameter µ encodes the mean of the distribution; in the case of a Gaussian, this is also the same as the mode. The parameter 2 encodes the variance. (Sometimes we talk about the precision of a Gaussian, which is the inverse variance, denoted = 1/ 2.) When µ = 0 and = 1, the Gaussian is called the standard normal distribution. If P is the cdf of Y , then P 1 (q) is the value yq such that p(Y yq ) = q; this is called the q’th quantile of P. The value P 1 (0.5) is the median of the distribution, with half of the probability mass on the left, and half on the right. The values P 1 (0.25) and P 1 (0.75) are the lower and upper quartiles. For example, let be the cdf of the Gaussian distribution N (0, 1), and 1 be the inverse cdf (also known as the probit function). Then points to the left of 1 (↵/2) contain ↵/2 of the probability mass, as illustrated in Figure 2.2b. By symmetry, points to the right of 1 (1 ↵/2) also contain ↵/2 of the mass. Hence the central interval ( 1 (↵/2), 1 (1 ↵/2)) contains 1 ↵ of the mass. If we set ↵ = 0.05, the central 95% interval is covered by the range 1 1 ( (0.025), (0.975)) = ( 1.96, 1.96) (2.110) If the distribution is N (µ, 2 ), then the 95% interval becomes (µ 1.96 , µ + 1.96 ). This is often approximated by writing µ ± 2. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 58 Chapter 2. Probability: Univariate Models 2.6.2 Probability density function We define the probability density function or pdf as the derivative of the cdf: d p(y) , P (y) (2.111) dy The pdf of the Gaussian is given by 1 1 (y µ)2 N (y|µ, 2 ), p e 2 2 (2.112) 2⇡ 2 p where 2⇡ 2 is the normalization constant needed to ensure the density integrates to 1 (see Exercise 2.12). See Figure 2.2b for a plot. Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows: Z b Pr(a < Y b) = p(y)dy = P (b) P (a) (2.113) a As the size of the interval gets smaller, we can write Pr(y Y y + dy) ⇡ p(y)dy (2.114) Intuitively, this says the probability of Y being in a small interval around y is the density at y times the width of the interval. One important consequence of the above result is that the pdf at a point can be larger than 1. For example, N (0|0, 0.1) = 3.99. We can use the pdf to compute the mean, or expected value, of the distribution: Z E [Y ] , y p(y)dy (2.115) Y ⇥ ⇤ For a Gaussian, we have the familiar result that E N (·|µ, 2 ) = µ. (Note, however, that for some distributions, this integral is not finite, so the mean is not defined.) We can also use the pdf to compute the variance of a distribution. This is a measure of the “spread”, and is often denoted by 2. The variance is defined as follows: Z ⇥ ⇤ V [Y ] , E (Y µ)2 = (y µ)2 p(y)dy (2.116) Z Z Z ⇥ ⇤ = y 2 p(y)dy + µ2 p(y)dy 2µ yp(y)dy = E Y 2 µ2 (2.117) from which we derive the useful result ⇥ ⇤ E Y 2 = 2 + µ2 (2.118) The standard deviation is defined as p std [Y ] , V [Y ] = (2.119) (The standard deviation can be more intepretable than the variance ⇥ since ⇤ it has the same units as Y itself.) For a Gaussian, we have the familiar result that std N (·|µ, 2 ) =. “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.6. Univariate Gaussian (normal) distribution 59 (a) (b) Figure 2.14: Linear regression using Gaussian output with mean µ(x) = b + wx and (a) fixed vari- ance 2 (homoskedastic) or (b) input-dependent variance (x)2 (heteroscedastic). Generated by lin- reg_1d_hetero_tfp.ipynb. 2.6.3 Regression So far we have been considering the unconditional Gaussian distribution. In some cases, it is helpful to make the parameters of the Gaussian be functions of some input variables, i.e., we want to create a conditional density model of the form p(y|x; ✓) = N (y|fµ (x; ✓), f (x; ✓)2 ) (2.120) where fµ (x; ✓) 2 R predicts the mean, and f (x; ✓)2 2 R+ predicts the variance. It is common to assume that the variance is fixed, and is independent of the input. This is called homoscedastic regression. Furthermore it is common to assume the mean is a linear function of the input. The resulting model is called linear regression: p(y|x; ✓) = N (y|wT x + b, 2 ) (2.121) where ✓ = (w, b, 2 ). See Figure 2.14(a) for an illustration of this model in 1d. and Section 11.2 for more details on this model. However, we can also make the variance depend on the input; this is called heteroskedastic regression. In the linear regression setting, we have p(y|x; ✓) = N (y|wTµ x + b, T + (w x)) (2.122) where ✓ = (wµ , w ) are the two forms of regression weights, and + (a) = log(1 + ea ) (2.123) is the softplus function, that maps from R to R+ , to ensure the predicted standard deviation is non-negative. See Figure 2.14(b) for an illustration of this model in 1d. Note that Figure 2.14 plots the 95% predictive interval, [µ(x) 2 (x), µ(x) + 2 (x)]. This is the uncertainty in the predicted observation y given x, and captures the variability in p the blue dots. By contrast, the uncertainty in the underlying (noise-free) function is represented by V [fµ (x; ✓)], which does not involve the term; now the uncertainty is over the parameters ✓, rather than the output y. See Section 11.7 for details on how to model parameter uncertainty. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 60 Chapter 2. Probability: Univariate Models 2.6.4 Why is the Gaussian distribution so widely used? The Gaussian distribution is the most widely used distribution in statistics and machine learning. There are several reasons for this. First, it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance. Second, the central limit theorem (Section 2.8.6) tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”. Third, the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance, as we show in Section 3.4.4; this makes it a good default choice in many cases. Finally, it has a simple mathematical form, which results in easy to implement, but often highly effective, methods, as we will see in Section 3.2. From a historical perspective, it’s worth remarking that the term “Gaussian distribution” is a bit misleading, since, as Jaynes [Jay03, p241] notes: “The fundamental nature of this distribution and its main properties were noted by Laplace when Gauss was six years old; and the distribution itself had been found by de Moivre before Laplace was born”. However, Gauss popularized the use of the distribution in the 1800s, and the term “Gaussian” is now widely used in science and engineering. The name “normal distribution” seems to have arisen in connection with the normal equations in linear regression (see Section 11.2.2.2). However, we prefer to avoid the term “normal”, since it suggests other distributions are “abnormal”, whereas, as Jaynes [Jay03] points out, it is the Gaussian that is abnormal in the sense that it has many special properties that are untypical of general distributions. 2.6.5 Dirac delta function as a limiting case As the variance of a Gaussian goes to 0, the distribution approaches an infinitely narrow, but infinitely tall, “spike” at the mean. We can write this as follows: 2 lim N (y|µ, ) ! (y µ) (2.124) !0 where is the Dirac delta function, defined by ( +1 if x = 0 (x) = (2.125) 0 if x 6= 0 where Z 1 (x)dx = 1 (2.126) 1 A slight variant of this is to define ( +1 if x = y y (x) = (2.127) 0 if x 6= y Note that we have y (x) = (x y) (2.128) “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.7. Some other common univariate distributions * 61 0.0 0.6 Student Student (⌫ = 2) (⌫ = 1) Laplace Gaussian 2.5 log pdf 0.4 pdf Gaussian 5.0 Student (⌫ = 1) 0.2 Student (⌫ = 2) 7.5 Laplace 0.0 4 2 0 2 4 4 2 0 2 4 x x (a) (b) Figure 2.15: p(a) The pdf ’s for a N (0, 1), T (µ = 0, = 1, ⌫ = 1), T (µ = 0, = 1, ⌫ = 2), and Laplace(0, 1/ 2). The mean is 0 and the variance is 1 for both the Gaussian and Laplace. When ⌫ = 1, the Student is the same as the Cauchy, which does not have a well-defined mean and variance. (b) Log of these pdf ’s. Note that the Student distribution is not log-concave for any parameter value, unlike the Laplace distribution. Nevertheless, both are unimodal. Generated by student_laplace_pdf_plot.ipynb. The delta function distribution satisfies the following sifting property, which we will use later on: Z 1 f (y) (x y)dy = f (x) (2.129) 1 2.7 Some other common univariate distributions * In this section, we briefly introduce some other univariate distributions that we will use in this book. 2.7.1 Student t distribution The Gaussian distribution is quite sensitive to outliers. A robust alternative to the Gaussian is the Student t-distribution, which we shall call the Student distribution for short.9 Its pdf is as follows: " ✓ ◆2 # ( ⌫+1 2 ) 2 1 y µ T (y|µ, , ⌫) / 1 + (2.130) ⌫ where µ is the mean, > 0 is the scale parameter (not the standard deviation), and ⌫ > 0 is called the degrees of freedom (although a better term would be the degree of normality [Kru13], since large values of ⌫ make the distribution act like a Gaussian). 9. This distribution has a colorful etymology. It was first published in 1908 by William Sealy Gosset, who worked at the Guinness brewery in Dublin, Ireland. Since his employer would not allow him to use his own name, he called it the “Student” distribution. The origin of the term t seems to have arisen in the context of tables of the Student distribution, used by Fisher when developing the basis of classical statistical inference. See http://jeff560.tripod.com/s.html for more historical details. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 62 Chapter 2. Probability: Univariate Models 0.5 0.5 gaussian gaussian student T student T laplace laplace 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 -5 0 5 10 -5 0 5 10 (a) (b) Figure 2.16: Illustration of the effect of outliers on fitting Gaussian, Student and Laplace distributions. (a) No outliers (the Gaussian and Student curves are on top of each other). (b) With outliers. We see that the Gaussian is more affected by outliers than the Student and Laplace distributions. Adapted from Figure 2.16 of [Bis06]. Generated by robust_pdf_plot.ipynb. We see that the probability density decays as a polynomial function of the squared distance from the center, as opposed to an exponential function, so there is more probability mass in the tail than with a Gaussian distribution, as shown in Figure 2.15. We say that the Student distribution has heavy tails, which makes it robust to outliers. To illustrate the robustness of the Student distribution, consider Figure 2.16. On the left, we show a Gaussian and a Student distribution fit to some data with no outliers. On the right, we add some outliers. We see that the Gaussian is affected a lot, whereas the Student hardly changes. We discuss how to use the Student distribution for robust linear regression in Section 11.6.2. For later reference, we note that the Student distribution has the following properties: ⌫ 2 mean = µ, mode = µ, var = (2.131) (⌫ 2) The mean is only defined if ⌫ > 1. The variance is only defined if ⌫ > 2. For ⌫ 5, the Student distribution rapidly approaches a Gaussian distribution and loses its robustness properties. It is common to use ⌫ = 4, which gives good performance in a range of problems [LLT89]. 2.7.2 Cauchy distribution If ⌫ = 1, the Student distribution is known as the Cauchy or Lorentz distribution. Its pdf is defined by " ✓ ◆2 # 1 1 x µ C(x|µ, ) = 1+ (2.132) ⇡ This distribution has very heavy tails compared to a Gaussian. For example, 95% of the values from a standard normal are between -1.96 and 1.96, but for a standard Cauchy they are between -12.7 and 12.7. In fact the tails are so heavy that the integral that defines the mean does not converge. “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.7. Some other common univariate distributions * 63 The half Cauchy distribution is a version of the Cauchy (with µ = 0) that is “folded over” on itself, so all its probability density is on the positive reals. Thus it has the form " ✓ ◆2 # 1 2 x C+ (x| ) , 1+ (2.133) ⇡ This is useful in Bayesian modeling, where we want to use a distribution over positive reals with heavy tails, but finite density at the origin. 2.7.3 Laplace distribution Another distribution with heavy tails is the Laplace distribution10 , also known as the double sided exponential distribution. This has the following pdf: ✓ ◆ 1 |y µ| Laplace(y|µ, b) , exp (2.134) 2b b See Figure 2.15 for a plot. Here µ is a location parameter and b > 0 is a scale parameter. This distribution has the following properties: mean = µ, mode = µ, var = 2b2 (2.135) In Section 11.6.1, we discuss how to use the Laplace distribution for robust linear regression, and in Section 11.4, we discuss how to use the Laplace distribution for sparse linear regression. 2.7.4 Beta distribution The beta distribution has support over the interval [0, 1] and is defined as follows: 1 Beta(x|a, b) = xa 1 (1 x)b 1 (2.136) B(a, b) where B(a, b) is the beta function, defined by (a) (b) B(a, b) , (2.137) (a + b) where (a) is the Gamma function defined by Z 1 (a) , xa 1 e x dx (2.138) 0 See Figure 2.17a for plots of some beta distributions. We require a, b > 0 to ensure the distribution is integrable (i.e., to ensure B(a, b) exists). If a = b = 1, we get the uniform distribution. If a and b are both less than 1, we get a bimodal distribution with “spikes” at 0 and 1; if a and b are both greater than 1, the distribution is unimodal. 10. Pierre-Simon Laplace (1749–1827) was a French mathematician, who played a key role in creating the field of Bayesian statistics. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 64 Chapter 2. Probability: Univariate Models G (a) (b) Figure 2.17: (a) Some beta distributions. If a < 1, we get a “spike” on the left, and if b < 1, we get a “spike” on the right. if a = b = 1, the distribution is uniform. If a > 1 and b > 1, the distribution is unimodal. Generated by beta_dist_plot.ipynb. (b) Some gamma distributions. If a 1, the mode is at 0, otherwise the mode is away from 0. As we increase the rate b, we reduce the horizontal scale, thus squeezing everything leftwards and upwards. Generated by gamma_dist_plot.ipynb. For later reference, we note that the distribution has the following properties (Exercise 2.8): a a 1 ab mean = , mode = , var = (2.139) a+b a+b 2 (a + b)2 (a + b + 1) Note that the above equation for the mode assumes a > 1 and b > 1; if a < 1 and b 1, the mode is at 0, and if a 1 and b < 1, the mode is at 1. 2.7.5 Gamma distribution The gamma distribution is a flexible distribution for positive real valued rv’s, x > 0. It is defined in terms of two parameters, called the shape a > 0 and the rate b > 0: ba a Ga(x|shape = a, rate = b) , x 1 e xb (2.140) (a) Sometimes the distribution is parameterized in terms of the shape a and the scale s = 1/b: 1 Ga(x|shape = a, scale = s) , xa 1 e x/s (2.141) sa (a) See Figure 2.17b for some plots of the gamma pdf. For reference, we note that the distribution has the following properties: a a 1 a mean = , mode = max( , 0), var = 2 (2.142) b b b There are several distributions which are just special cases of the Gamma, which we discuss below. “Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024 2.7. Some other common univariate distributions * 65 Exponential distribution. This is defined by Expon(x| ) , Ga(x|shape = 1, rate = ) (2.143) This distribution describes the times between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. Chi-squared distribution. This is defined by ⌫ 1 2 ⌫ (x) , Ga(x|shape = , rate = ) (2.144) 2 2 where ⌫ is called the degrees of freedom. This is the distribution P⌫ of the sum of squared Gaussian random variables. More precisely, if Zi ⇠ N (0, 1), and S = i=1 Zi2 , then S ⇠ 2⌫. The inverse Gamma distribution is defined as follows: ba IG(x|shape = a, scale = b) , x (a+1) e b/x (2.145) (a) The distribution has these properties b b b2 mean = , mode = , var = (2.146) a 1 a+1 (a 1)2 (a 2) The mean only exists if a > 1. The variance only exists if a > 2. Note: if X ⇠ Ga(shape = a, rate = b), then 1/X ⇠ IG(shape = a, scale = b). (Note that b plays two different roles in this case.) 2.7.6 Empirical distribution Suppose we have a set of N samples D = {x(1) ,... , x(N ) }, derived from a distribution p(X), where X 2 R. We can approximate the pdf using a set of delta functions (Section 2.6.5) or “spikes”, centered on these samples: N 1 X p̂N (x) = x(n) (x) (2.147) N n=1 This is called the empirical distribution of the dataset D. An example of this, with N = 5, is shown in Figure 2.18(a). The corresponding cdf is given by 1 X ⇣ (n) ⌘ N N 1 X P̂N (x) = I x x = u (n) (x) (2.148) N n=1 N n=1 x where uy (x) is a step function at y defined by ( 1 if x y uy (x) = (2.149) 0 if x < y This can be visualized as a “stair case”, as in Figure 2.18(b), where the jumps of height 1/N occur at every sample. Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license 66 Chapter 2. Probability: Univariate Models (a) (b) Figure 2.18: Illustration of the (a) empirical pdf and (b) empirical cdf derived from a set of N = 5 samples. From https: // bit