Statistical Inference - Lecture 1 PDF
Document Details
Uploaded by EngrossingTragedy
University of Limerick
Dr. Kevin Burke
Tags
Summary
This lecture provides a review of probability, covering basic definitions and rules, including addition and multiplication rules. It also touches upon conditional probability and the law of total probability.
Full Transcript
Statistical Inference – Lecture 1 Review of Probability Dr. Kevin Burke 1 Basic Definitions which extends to more than two mutually...
Statistical Inference – Lecture 1 Review of Probability Dr. Kevin Burke 1 Basic Definitions which extends to more than two mutually exclusive events The sample space, S, is the set of all possible outcomes Pr(A1 ∪ · · · ∪ An ) = Pr(A1 ) + · · · + Pr(An ) from an experiment or data-generating process. Xn An event, A ⊆ S, is a set of outcomes from the sample = Pr(Ai ). space. i=1 #(A) The probability of the an event, Pr(A) = #(S) ∈ [0, 1], is a measure of how likely this event is. 2.3 Multiplication Rule Example 1.1. Flipping a coin twice Any two events S = {HH, HT, T H, T T }. Pr(A ∩ B) = Pr(A) Pr(B | A). A = “the first coin is a head” = {HH, HT }. #(A) 2 Pr(A) = #(S) = 4 = 12. Note: we can change the order of multiplication Pr(A ∩ B) = Pr(B) Pr(A | B), For two events A and B, the conditional probability of A given the information that B has occurred is: i.e., the event appearing in the first term is the condi- tioning event in the second term. #(A ∩ B) Pr(A | B) = , Independent events (i.e., do not affect each other) #(B) The above arises from restricting our attention to the Pr(A ∩ B) = Pr(A) Pr(B), portion of A that exists within B. i.e., Pr(B | A) = Pr(B) since the presence of A does not Note that dividing above and below by #(S) gives alter B. #(A∩B) This extends to more than two independent events #(S) Pr(A ∩ B) Pr(A | B) = #(B) =. Pr(B) Pr(A1 ∩ · · · ∩ An ) = Pr(A1 ) × · · · × Pr(An ) #(S) Yn Similarly, = Pr(Ai ). Pr(A ∩ B) i=1 Pr(B | A) = , Pr(A) i.e., divide by the component being conditioned on. 2.4 Law of Total Probability If the event B can be split up into n mutually exclusive 2 Rules of Probability sets B ∩ A1 ,... , B ∩ An , then Pr(B) = Pr(B ∩ A1 ) + · · · + Pr(B ∩ An ) 2.1 Complement Rule = Pr(A1 ) Pr(B | A1 ) + · · · + Pr(An ) Pr(B | An ) Xn = Pr(Ai ) Pr(B | Ai ) Pr(Ac ) = 1 − Pr(A). i=1 2.2 Addition Rule 2.5 Bayes’ Rule Any two events Bayes’ rules is a formula for reversing conditional prob- abilities Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B). Pr(A) Pr(B | A) Pr(A | B) = , Pr(B) Mutually exclusive events (i.e., cannot occur together) which comes from combining the conditional probability Pr(A ∪ B) = Pr(A) + Pr(B), formula and the multiplication rule. 1 By incorporating the law of total probability, Bayes’ rule Thus, X ∈ {0, 1, 2} with corresponding probabilities: can be written as follows Pr(X = 0) = 14 , Pr(X = 1) = 2 and Pr(X = 2) = 14. 4 Pr(Aj ) Pr(B | Aj ) Pr(Aj | B) = Pn. i=1 Pr(Ai ) Pr(B | Ai ) 3.1 Probability Distribution Example 2.1. Rare Disease Assume that 1% of the population have a rare disease. The probability distribution is the mechanism gov- A test has been developed such that: erning the behaviour of X, i.e., how likely the different values of X are. If the individual has the disease, the test always de- tects it. It is characterised by the cumulative distribution function (cdf): If the individual does not have the disease, the test incorrectly detects the disease 10% of the time. F (x) = Pr(X ≤ x), Define the following events: which is a non-decreasing function going from 0 to 1. D = “The individual has the disease” and T = “The test says the individual has the disease”. If X is discrete the cdf is given by X X F (x) = f (xi ) = Pr(X = xi ), Thus, the above information can be converted into prob- xi ≤x xi ≤x ability statements: where f (x) is called the probability mass function Pr(D) = 0.01 Pr(T | D) = 1 (pmf); often p(x) is used instead of f (x) in this case. Pr(D c ) = 0.99 Pr(T | D c ) = 0.1. If X is continuous the cdf is given by Therefore, if the test says you have the disease, the prob- Z x ability that you do in fact have the disease is: F (x) = f (t) dt, −∞ Pr(D) Pr(T | D) Pr(D | T ) = where f (x) ≥ 0 is called the probability density Pr(D) Pr(T | D) + Pr(D c ) Pr(T | D c ) d 0.01 (1) function (pdf). In this case f (x) = dx F (X). = 0.01 (1) + 0.99 (0.1) 0.01 0.01 Note: In both cases we call F (x) the cdf. = = = 0.0917, 0.01 + 0.099 0.109 i.e., just over 9% chance. 3.2 Expected Value The expected value is the mean of the probability dis- tribution. 3 Random Variables D iscrete case: X A random variable is a numerical quantity whose value EX = x f (x), is determined by a probabilistic (i.e., random) process. x∈X Such random variables may take discrete (integer) or where X represents the set of all possible values of X. continuous (real) values. Typically, uppercase letters (e.g., X) denote the random C ontinuous case: variable whereas lowercase letters (e.g., x) denote a spe- Z ∞ cific realisation of X, i.e., Pr(X = x) is “the probabil- EX = x f (x) dx. ity that the random variable X takes the value x” and −∞ Pr(x1 ≤ X ≤ x2 ) is “the probability that X lies between x1 and x2 ” The expected value of a function of X is: Example 3.1. Flipping a coin twice Z ∞ Let X be a random variable representing the number of heads in two flips of a coin. E g(X) = g(x) f (x) dx, −∞ 2 R∞ e.g., E(sin x) = −∞ sin x f (x) dx. and the first two moments are ′ E(X) = MX (0) = 1 4 (2 e0 + 2 e0 ) = 4 4 = 1, Important property (linear function of X): 2 ′′ 0 0 1 6 E(X ) = MX (0) = 4 (2 e + 4 e ) = 4 = 1.5, E(aX + b) = a EX + b, as previously calculated. for constants a and b. Note: E g(X) 6= g(EX) in general; only true when Question 3.2. Exponential mgf g(X) = aX + b, i.e., a linear function. Consider the exponential variable X ∈ [0, ∞) with pdf given by Question 3.1. Flipping a coin twice f (x) = λe−λx. Let X = “the number of heads in two flips of a coin”. Calculate E(X), E(X 2 ) and E(3X ) Find the moment generating function. 3.3 Moments 3.4 Variance EX is known as the first moment of the probability Variance measures the degree of spread of a distribution distribution. around its mean: EX 2 is the second moment, EX 3 is the third etc. VarX = E(X − EX)2 , The moment generating function (mgf) of X is given i.e., average squared deviation of X from its mean, EX. by Note that VarX ≥ 0. MX (t) = EetX. It is easy to show that Note: The subscript X highlights that the mgf pertains to X. However, the function depends on t only. VarX = EX 2 − (EX)2. This function generates moments as follows: d Important property (linear function of X): dt MX (0) = EX d2 dt2 MX (0) = EX 2 Var(aX + b) = a2 VarX,... for constants a and b. dk MX (0) = EX k , √ dtk Note: the standard deviation is VarX. i.e., differentiate MX (t) k times and then set t = 0 to get the kth moment. 3.5 Sums of Random Variables Property of mgf (linear function of X): Below are some further important properties. MaX+b (t) = etb MX (at). Let Sn = X1 + · · · + Xn be a sum of random variable. Example 3.2. Flipping a coin twice It is always true that Let X = “the number of heads in two flips of a coin”. The mgf is E(Sn ) = E(X1 + · · · + Xn ) = EX1 + · · · + EXn. MX (t) = EetX = e0 ( 14 ) + et ( 24 ) + e2t ( 14 ) If the variables are independent we also have that 1 t 2t = 4 (1 + 2 e + e ). Var(Sn ) = Var(X1 + · · · + Xn ) = VarX1 + · · · + VarXn , Differentiating MX (t) twice gives and ′ 1 t 2t MX (t) = 4 (2 e + 2 e ), n Y ′′ MX (t) = 1 (2 et + 4 e2t ), MSn (t) = MX1 (t) × · · · × MXn (t) = MXi (t). 4 i=1 3 Note: If X1 ,... , Xn all have the same distribution, the 4 Multiple Random Variables above three results clearly become: We will focus on the case where there are two random E(Sn ) = n EX, variables X and Y. Var(Sn ) = n VarX, We also only consider the continuous case - results R apply P equally for the discrete case (simply replace with ). and MSn (t) = [MX (t)]n. 4.1 Joint, Conditional and Marginal Note: Random variables which are independent and all Distributions have the same distribution are called independent and identically distributed or iid. The joint density function is denoted by f (x, y). Question 3.3. Mean and variance of sample mean Let X1 ,... , Xn be a sample of iid variables from a dis- where probabilities are calculated in the usual way, i.e., tribution with mean µ = EX and variance σ 2 = VarX. Z y2 Z x 2 Find the mean and variance for the sample mean: Pr(x1 ≤ X ≤ x2 ∩ y1 ≤ Y ≤ y2 ) = f (x, y) dx dy. X = X1 +···+X n n = Snn. y1 x1 The expected value of g(X, Y ), a function of X and Y , Example 3.3. Sum of exponential variables (e.g., X + Y or XY ) is given by iid Let X1 ,... , Xn ∼ Exp(λ), i.e., they all have an exponen- Z Z tial distribution with parameter λ. Eg(X, Y ) = g(x, y) f (x, y) dx dy, The mgf of Sn = X1 ,... , Xn is therefore where the lack of limits of integration here is intended to n λ mean integrate over all values of X and Y respectively. MSn (t) = , λ−t Analogous to the multiplication rule, we have that which is the mgf of a Gamma(n, λ) distribution, i.e., the f (x, y) = f (x) f (y | x) = f (y) f (x | y), sum of iid exponential variables have an exponential dis- tribution. and, if X and Y are independent, f (x, y) = f (x) f (y). 3.6 Transformations Let X be continuous with pdf fX (x). Assume we are in- The conditional density functions f (y | x) and f (x | y) terested in the distribution of Y = g(X), a transformed are clearly given by variable, where g is a one-to-one function. f (x, y) f (x, y) The pdf of Y is given by f (x | y) = and f (y | x) = , f (y) f (x) dx satisfying the usual conditional probability formula. f (y) = f ( x(y) ). dy The conditional expectation is given by Z Question 3.4. Normal linear transformation E(X | y) = x f (x | y) dx, Let X be a normal random variable, i.e., X ∼ N (µ, σ 2 ) where µ and σ 2 are the mean and variance respectively. where the lowercase y in E(X | y) means this is the ex- Thus, the pdf of X is pected value of X given that Y = y. 1 1 2 The conditional variance is f (x) = √ e− 2σ2 (x−µ). 2πσ 2 Var(X | y) = E(X 2 | y) − (E(X | y))2. Show that Y = aX + b ∼ N (aµ + b, a2 σ 2 ). 4 In this context f (x) and f (y) are referred to as The outer expectation is taken over the distribution of Y marginal density functions. and the inner expectation is taken over the conditional distribution of X | y. It can be shown that Z Z Thus, for clarity, it is sometimes written f (x) = f (x, y) dy = f (y)f (x | y) dy. EX (X) = EY (EX|Y (X | Y )). This procedure is known as marginilizing over Y or integrating Y out. 4.3 Law of Total Variance Similarly Z Z f (y) = f (x, y) dx = f (x)f (y | x) dx. VarX = E[Var(X | Y )] + Var[E(X | Y )]. Note: Compare with the law of total probability Proof. (Sect. 2) where we summed over A1 ,... , An to get P (B). VarX = E[ X 2 ] − [ EX ]2 Bayes’ Rule can be applied to random variables, for = E[ E(X 2 | Y ) ] − [ E(E(X | Y )) ]2 example, = E[ E(X 2 | Y ) − (E(X | Y ))2 ] f (x)f (y | x) f (x)f (y | x) + E[(E(X | Y ))2 ] − [ E(E(X | Y )) ]2 f (x | y) = =R. f (y) f (x)f (y | x) dx E[Var(X | Y )] z }| { This is easily derived by combining above definitions of = E[ E(X 2 | Y ) − (E(X | Y ))2 ] conditional, joint and marginal distributions. + E[(E(X | Y ))2 ] − [ E(E(X | Y )) ]2 Compare with Bayes’ Rule in Sect. 2. | {z } Var[E(X | Y )] Question 4.1. Joint distribution = E[Var(X | Y )] + Var[E(X | Y )]. Define a joint distribution 6xy 2 0 < x < 1 and 0 < y < 1, f (x, y) = 4.4 Covariance and Correlation 0 otherwise. R1R1 Covariance measures the linear relationship between (a) Show that 0 0 f (x, y) dx dy = 1. two variables X and Y and is defined by (b) Calculate f (y) and hence f (x | y). Z Z Cov(X, Y ) = (X − EX)(Y − EY ) f (x, y) dx dy = E[(X − EX)(Y − EY )]. 4.2 Law of Total Expectation Unlike variance (which is positive), covariance may be (a.k.a tower law and law of iterated expectation) any real number. If X and Y are independent then EX = E(E(X | Y )). Cov(X, Y ) = 0 (but the converse is not true). It is easy to show that Proof. Z Cov(X, Y ) = E(XY ) − (EX)(EY ). EX = x f (x) dx Z Z Properties of covariance = x f (x, y) dy dx Z Z Cov(X, Y ) = Cov(Y, X) = x f (y) f (x | y) dy dx Cov(X, X) = Var(X) Z Z = f (y) x f (x | y) dx dy Cov(aX + b, cY + d) = a c Cov(X, Y ) Z Cov(X + Y, U + V ) = Cov(X, U ) + Cov(X, V ) = f (y) E(X | y), dy + Cov(Y, U ) + Cov(Y, V ) = E(E(X | Y )). Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ) 5 The mean and variance for the Bernoulli distribution are: Correlation, given by EX = p VarX = p(1 − p). Cov(X, Y ) Corr(X, Y ) = √ , VarXVarY Thus, the sample mean of n Bernoulli trials has a normal distribution (when n is large): is a scaled measure such that Corr(X, Y ) ∈ [−1, 1]. p(1 − p) X → N p, , n 5 Central Limit Theorem and, furthermore, The central limit theorem (CLT) is one of the most X −p important results in statistics. Z=q → N (0, 1). p(1−p) n Let X1 ,... , Xn be a sample of iid variables from a dis- tribution with mean µ = EX and variance σ 2 = VarX. The distribution of the sample mean of these variables (i.e., X = X1 +···+X n n = Snn ) when n is large is 6 Other Useful Results σ2 6.1 The Delta Method X → N µ, , n The delta method provides an extension of the CLT: i.e., the sample mean is approximately normally dis- tributed for n large. Typically n ≥ 30 is enough. How- 2 ′ 2 σ ever, if X is discrete, much larger samples may be needed, g(X) → N g(µ), [g (µ)] , n e.g., n ≥ 100. If the variables X1 ,... , Xn are themselves normal, then the result holds for any value of n (this can for some differentiable function g(x). be shown easily using the mgf). We also have that Proof. X −µ Z= → N (0, 1) , The Taylor expansion of g(X) around the point µ is: √σ n g(X) ≈ g(µ) + g′ (µ)(X − µ). which forms the basis of hypothesis testing and confi- dence interval construction. 2 Rearranging terms gives Note: We already found that EX = µ and VarX = σn in Question 3.3. However, the CLT tells us about the g(X) = g′ (µ) X + g(µ) − g′ (µ)µ | {z } | {z } distribution of X. = a X + b Clearly the distribution of the sum X1 + · · · + Xn (for i.e., a linear transformation of a normal variable. large n) is: 2 ′ 2 σ ⇒ g(X) → N g(µ), [g (µ)] , X1 + · · · + Xn = nX → N nµ, nσ 2. n (recall the result of Question 3.4) since The central limit theorem provides an explanation for aµ + b = g′ (µ)µ + g(µ) − g′ (µ)µ = g(µ). the ubiquity of the normal distribution in practice. Various quantities can be viewed as the average result of many other variables which leads to a normality, e.g., Note: Taylors Theorem: the weight or height of an animal is the net effect of many biological and environmental variables. g(x) = g(x0 ) + g′ (x0 )(x − x0 ) + 12 g′′ (x0 )(x − x0 )2 + Example 5.1. Bernoulli Trials Question 6.1. Delta method The Bernoulli distribution is used to model situations 1 For large n, what is the distribution of X ? with two outcomes (e.g., head/tail, yes/no, win/lose). 6 6.2 Markov’s Inequality Eg(X) Pr(g(X) ≥ ǫ) ≤ , ǫ for any ǫ > 0 provided g(x) is a non-negative function. Proof. Z ∞ Eg(X) = g(x) f (x) dx −∞ Z = g(x) f (x) dx x : g(x)≥ǫ Z + g(x) f (x) dx x : g(x)