Stochastic Processes Lecture Notes 2024-2025 PDF
Document Details
Uploaded by Deleted User
University of Trento
2024
Sonia Mazzucchi
Tags
Summary
Lecture notes on Stochastic Processes for a 1-module course at the University of Trento for the academic year 2024-2025. The notes cover preliminaries of probability theory, focusing on probability spaces, σ-algebras, and related concepts. They also discuss various properties of probability measures and random variables.
Full Transcript
Lecture Notes on Stochastic Processes (1mod). AY 2024-25 November 26, 2024 Preface The present notes are a concise account of the course Stochastic processes (1 mod) for the mas- ter degree in mathematics of the University of Trento. For further details as well as additional e...
Lecture Notes on Stochastic Processes (1mod). AY 2024-25 November 26, 2024 Preface The present notes are a concise account of the course Stochastic processes (1 mod) for the mas- ter degree in mathematics of the University of Trento. For further details as well as additional exercises and examples we refer to the references quoted in the bibliography. I am grateful to all students who improved these notes by means of their remarks and suggestions. In particular, I am grateful to Leonardo Errati for his accurate work of proofreading and to Antonio Lorenzin for some figures. Sonia Mazzucchi 1 Chapter 1 Preliminaries The aim of the present chapter is to introduce some basic definitions and results of probability theory which will be applied in the subsequent chapters. Probability spaces Definition 1. A measurable space is a couple (Ω, F) where Ω is a set and F is a σ- algebra of subsets of Ω, i.e. a family of sets satisfying the following conditions i. Ω ∈ F ii. if A ∈ F then Ac ∈ F iii. if {An } ⊂ F is a sequence of subsets in F, then ∪n An ∈ F. Remark 1. In definition 1 the triple of conditions i,ii, and iii, can be replaced by the triple i’, ii and iii, where i’. F 6= ∅ Moreover, by De Morgan laws, it is possible to prove that a σ-algebra is closed under countable intersections. Definition 2. Let (Ω, F) be a measurable space. A probability measure P on F is a map P : F → [0, +∞) satisfying the following conditions: Normalization: P(Ω) = 1 σ-additivity: for any countable family of sets {En } ⊂ F such that En ∩ Em = ∅ whenever n 6= m, the following holds X P(∪n En ) = P(En ) (1.1) Definition 3. A probability space is a triple (Ω, F, P), where (Ω, F) is a measurable space and P : F → [0, +∞) is a probability measure on F. In the following Ω will be called sample space while the sets E ∈ F will be called events. 2 Construction of σ-algebras Trivial examples of σ-algebras on a set Ω are F = P(Ω) (the power set of Ω) or the trivial σ-algebra F = {Ω, ∅}. Non trivial examples of σ-algebras can be constructed by starting by a generic collection of subsets C ⊂ P(Ω) and by considering the smallest σ-algebra containing them, according to the following definition. Definition 4. Let C ⊂ P(Ω) be a collection of subsets. The σ-algebra generated by C is defined as the intersection of all σ-algebras containing C and it is denoted with the symbol σ(C). In fact σ(C) is a σ-algebra and has the following properties C ⊂ σ(C) if C ⊂ F and F is a σ-algebra, then σ(C) ⊂ F if C is a σ-algebra then C = σ(C) if C1 ⊂ C2 then σ(C1 ) ⊂ σ(C2 ) if C1 ⊂ C2 ⊂ σ(C1 ) then σ(C1 ) = σ(C2 ) The proof is left as an exercise. Example 1. if C = ∅ and Ω 6= ∅, then σ(C) = {∅, Ω} Example 2. If C = {E}, with E ⊂ Ω and E 6= Ω, ∅, then σ(C) = {E, E c , Ω, ∅}. Example 3. σ-algebra generated by a partition: Let (Ω, F) be a measurable space. A countable collection of measurable sets C = {Ej }j∈J ⊂ F, with J a countable set (in most cases J = {1,... , N } or J = N) is called a partition of Ω if: 1. the sets of the collection are pairwise disjoint: En ∩ Em = ∅ forall n 6= m; 2. their union yields the sample space itself Ω = ∪j∈J Ej In order to avoid trivialities, we will assume that all the sets {Ej } are not empty. It is easy to verify that the σ-algebra generated by the partition C is the collection of sets σ(C) = {∪j∈K Ej , K ⊂ J} Indeed, it is easy to verify that: the collection of the sets of the form ∪j∈K Ej , for some K ⊂ J is a σ-algebra (prove it as an exercise!) Any σ-algebra F 0 that contains the partition C = {Ej }j∈J must contain all the sets of the form ∪j∈K Ej , for some K ⊂ J (since F 0 is closed under countable union). Example 4. Let Ω be a topological space and let τ be the collection of open sets. The σ-algebra generated by τ is called the Borel σ-algebra of Ω. Example 5. If Ω = R and τ is the Euclidean topology then the Borel σ-algebra is denoted B(R). In fact, it is possible to prove that B(R) can be also obtained as the σ-algebra generated by the following collections of sets: 3 C = {(−∞, t], t ∈ R} C = {(a, b), a, b ∈ R} C = {(a, b], a, b ∈ R} Definition 5. Let (X, FX ) and (Y, FY ) be two measurable spaces. The product σ-algebra FX ×FY is the σ-algebra on the set X ×Y generated by the collection C of product sets, namely C = {I ×J, I ∈ FX , J ∈ FY }. In particular, if (X, FX ) = (Y, FY ) = (R, B(R) one gets FX × FY = B(R2 ). Properties of probability measures If P is a probability measure on a σ-algebra F, it is rather simple to prove the following properties: P(∅) = 0; if A, B ∈ F and A ⊂ B then P(A) ≤ P(B), moreover P(A \ B) = P(A) − P(B), A, B ∈ F, P(A ∪ B) = P(A) + P(B) − P(A ∩ B). In fact the properties above follow easily from the finite additivity property of P 1 : N X P(∪N n=1 En ) = P(En ) ∀N ∈ N, {E1 ,... , EN } ⊂ F, such that En ∩ Em = ∅ if n 6= m. (1.2) n=1 The σ-additivity (1.1) is a stronger property and yields the following continuity result. Theorem 1. Let P : F → [0, 1] be a σ-additive probability measure on a σ-algebra F. Then it has the following properties: i. Continuity from below: for any sequence {En } ⊂ F such that En ⊂ En+1 , the following holds P(∪n En ) = lim P(En ) n→∞ ii. Continuity from above: for any sequence {En } ⊂ A such that En+1 ⊂ En , the following holds P(∩n En ) = lim P(En ) n→∞ Proof: i. Given a sequence {En } ⊂ F of sets in F such that En ⊂ En+1 , let us construct the sequence {Ẽn } ⊂ F of disjoint sets of F, defined as: Ẽ1 := E1 , Ẽn := En \ En−1 1 Actually property (1.2) is equivalent to the following, apparently weaker, property ∀A, B ∈ F , A ∩ B = ∅ P(A ∪ B) = P(A) + P(B), as one can easily prove by an inductive argument. 4 It is easy to see that Ẽn ∈ F ∀n ∈ N, Ẽn ∩ Ẽm = ∅ and ∪n En = ∪n Ẽn. Hence: N X P(∪n En ) = P(∪n Ẽn ) = lim P(Ẽn ) = lim P(EN ) N →+∞ N →+∞ n=1 where the last equality follows from P(Ẽn ) = P(En ) − P(En−1 ). ii. Let {En } ⊂ F be a sequence of sets in F such that En+1 ⊂ En. Let us construct a sequence {Ẽn } ⊂ F of sets in F such that Ẽn ⊂ Ẽn+1. Indeed, if we define Ẽn = Enc we have that Ẽn ∈ F ∀n ∈ N. Moreover ∪n Enc = (∩n En )c. By property i we have: P(∩n En ) = P((∪n Ẽn )c ) = 1 − P(∪n Ẽn ) = 1 − lim P(Ẽn ) = lim P(En ) n→∞ n→∞ Random variables Definition 6. Let (Ω, F) and (Ω0 , F 0 ) be two measurable spaces. A map T : Ω → Ω0 is said measurable if for any set A ∈ F 0 the set T −1 (A) := {ω ∈ Ω : T (ω) ∈ A} belongs to F. Definition 7. Let (Ω, F, P) be a probability space. A real random variable X on (Ω, F, P) is measurable map from (Ω, F) into (R, B(R)). It is worthwhile to recall some important properties of measurable functions. Indeed, if f, g are measurable functions from (Ω, F) to (R, B(R)) and h : R2 → R is Borel measurable, then h(f, g) is measurable from (Ω, F) to (R, B(R)). In particular, f + g, f − g and f g are measurable functions. This shows that linear combinations and products of random variables are random variables as well. The following result is a useful tool in the proof of the measurability of a map. Theorem 2. Let T : Ω → Ω0 be a function between two measurable spaces (Ω, F) and (Ω0 , F 0 ). Let C ⊂ P(Ω0 ) be a family of subsets of Ω0 such that σ(C) = F 0. If for any A ∈ C the set T −1 (A) belongs to F, then T is measurable. Proof: Let us consider the family C ⊂ P(Ω0 ) such that σ(C) = F 0. Moreover, let G ⊂ P(Ω0 ) be defined by: G = {E ⊂ Ω0 : T −1 (E) ∈ F}. By assumption C ⊂ G. Moreover G is a σ-algebra, indeed: Ω0 ∈ G since T −1 (Ω0 ) = Ω ∈ F. c if E ∈ G then E c ∈ G. Indeed T −1 (E c ) = T −1 (E). given a countable family {En }n ⊂ G then ∪n En ∈ G , since T −1 (∪n En ) = ∪n T −1 (En ). 5 By definition σ(C) is the intersection of all σ-algebras containing C, hence σ(C) ⊂ G. In the particular case where X is a real random variable, the previous theorem states that X : (Ω, F) → (R, B(R)) is measurable iff for any t ∈ R the set X −1 ((−∞, t]) belongs to F, since the sets of the form (−∞, t], t ∈ R generate the Borel σ-algebra B(R). Another important notion in the study of random variables is the following. Definition 8. Let (Ω, F, P) be a probability space and X : (Ω, F) → (S, FS ) a random variable. The σ-algebra FX ⊂ P(Ω) generated by X is defined as FX := {A ⊂ Ω : A = X −1 (I), I ∈ FS } It is rather simple to prove that the family of sets FX is a σ-algebra contained in F. Moreover it is the smallest σ-algebra which makes X measurable. This means that X : (Ω, F 0 ) → (S, FS ) is measurable iff FX ⊆ F 0. Example 6. Let (Ω, F, P) be a probability space and let E ∈ F be an event, with E 6= Ω, ∅. Let X : Ω → R be the real random variable defined as the indicator function of the set E: 1 ω∈E X(ω) = 1E (ω) = 0 ω∈ /E Then FX = Ω, ∅, E, E C , i.e. FX coincides with the σ-algebra generated by the set E. Indeed, given a Borel set I ∈ B(R): X −1 (I) = {w ∈ Ω | X(w) ∈ I} = {w ∈ Ω | 1E (w) ∈ I} and we see that X −1 (I) = ∅ if neither 0 nor 1 are in I; if both 0, 1 ∈ I, then X −1 (I) = Ω; If 0 ∈ I and 1 6∈ I, then X −1 (I) = E C ; If 0 6∈ I and 1 ∈ I, then X −1 (I) = E. Example 7. σ-algebra generated by a discrete random variable Let (Ω, F, P) be a probability space. Let X : Ω → R be a discrete random variable, i.e. a random variable that can attain only a discrete set of (distinct) values {xj }j∈J ⊂ R, with J a countable set (in most cases J = {1,... , N } or J = N). Typical examples of discrete random variables are Bernoulli random variables (which can attain only two values {xj }j∈J = {0, 1}) and Poisson random variables (which can attain only positive integer values {xj }j∈J = {j}j∈N ). For any possible attainable value xj , j ∈ J, let us define the measurable set Ej := X −1 ({xj }) = {ω ∈ Ω : X(ω) = xj } It is easy to verify that the collection C = {Ej }j∈J forms a partition of Ω. Moreover the σ-algebra FX generated by X coincides with the σ-algebra generated by the partition C. Indeed, for any Borel set I ∈ B(R) the set X −1 (I) = {ω ∈ Ω : X(ω) ∈ I} is a countable disjoint union of suitable sets of the partition. More precisely: X −1 (I) = ∪xj ∈I {ω ∈ Ω : X(ω) = xj } = ∪xj ∈I Ej 6 Example 8. Random variables measurable with respect to the σ-algebra generated by a partition. Let (Ω, F, P) be a probability space and let C = {Ej }j∈J ⊂ F be a partition of Ω. A random variable X : Ω → R is measurable with respect to the σ-algebra σ(C) generated by the partition (i.e. for any Borel set I ∈ B(R) the set X −1 (I) ∈ σ(C)) iff it is constant on the sets of the partition. More precisely, a random variable X : Ω → R is σ(C)-measurable if and only if there exist real constants {cj }j∈J such that cj 1Ej (w) X X(ω) = (1.3) j∈J Indeed, if X is of the form (1.3) than it is σ(C)-measurable since it is a linear combination of σ(C)-measurable maps (the indicator functions of sets Ej , j ∈ J). Conversely, if X is not of the form (1.3) then there is a set Ej ∗ of the partition and two points ω1 , ω2 ∈ Ej ∗ such that X(ω1 ) = x1 6= X(ω2 ) = x2. In this case it is easy to verify that X cannot be σ(C)-measurable. Indeed, it is sufficient to consider the Borel set I = {x1 } ⊂ R and the corresponding set X −1 (I) ⊂ F. As discussed in example 3 the sets belonging to σ(C) have the form ∪j∈K Ej for some K ⊂ J. If X −1 ({x1 }) = ∪j∈K Ej then there are only two possibilities: the index j ∗ belongs to the set K ∈ J. This means Ej ∗ ⊂ ∪j∈K Ej hence X(ω) = x1 ∀ω ∈ Ej ∗. This is not possible since ω2 ∈ Ej ∗ and X(ω2 ) 6= x1. the index j ∗ doesn’t belong to the set K ∈ J. This means Ej ∗ ⊂ (∪j∈K Ej )c hence X(ω) 6= x1 ∀ω ∈ Ej ∗. This is not possible since ω1 ∈ Ej ∗ and X(ω1 ) = x1. We can now state an important result which describes the most general form of the functions which are measurable with respect to the σ-algebra generated by a random variable. Let us consider on a probability space (Ω, F, P) and a real random variable X on it. Let FX be the σ-algebra generated by X. Since composition of measurable maps is a measurable map , given a function g : R → R which is Borel measurable it is easy to see that the map Y : Ω → R defined by Y := g ◦ X is FX - measurable. X g (Ω, FX ) (R, B(R)) (R, B(R) Y In fact this formula describes the form of any FX −measurable random variable. Theorem 3 (Doob). Let (Ω, F, P) be a probability space, X : (Ω, F) → (R, B(R)) a random variable Y : Ω → R a FX -measurable random variable. Then there exists a Borel function g : (R, B(R)) → (R, B(R)) such that Y = g ◦ X. For a general proof see Appendix A.1. In fact the theorem can be generalized to the case where (R, B(R)) is replaced by a measurable space (S, FS ), where S is a complete separable metric space and FS is the Borel σ-algebra on E (see, e.g. ). We only remark here that in the case where X is a discrete random variable the statement of Doob’s theorem follows directly by examples 7 and 8. Indeed, if X is a discrete random variable with values {xj }j∈J then FX = {∪j∈K X −1 ({xj })}. Moreover all maps Y : Ω → R which are FX -measurable are of the form cj 1X −1 ({xj }) (w) X Y (ω) = j∈J 7 This clearly shows that the value of Y (ω) can be computed by knowing the value of X(ω) (to X(ω) = xj corresponds Y (ω) = cj ). This relation can be easily described in terms of the Borel measurable map g : R → R given by cj 1xj (u), X g(u) = u ∈ R. j∈J Dynkin lemma and first applications Definition 9. A class of sets P ⊂ P(Ω) is called π−system if it satisfies the following condition: π. if A, B ∈ P then A ∩ B ∈ P. In other words, a π−system is a class of sets closed under finite intersections. Definition 10. A class of sets L ⊂ P(Ω) is called λ−system if it satisfies the following conditions: λ1. Ω ∈ L; λ2. if A ∈ L then Ac ∈ L. λ3. if {An }n∈N ⊂ L is a countable family of sets in L such that An ∩ Am = ∅ if n 6= m, then ∪n An ∈ L. Remark 2. The set of properties λ1 , λ2 , λ3 is equivalent to the set of properties λ1 , λ02 , λ3 where: λ02. if A, B ∈ L, with B ⊂ A then A \ B ∈ L. Furthermore, the following result follows easily by the definitions above. Lemma 1. If C ⊂ P(Ω) is both a π-system and a λ-system, then C is a σ-algebra. The proof is left as an exercise The following theorem finds several applications in different settings. In particular it often allows to prove that if the sets in a class C ∈ P(Ω) have a particular property, then all the sets belonging to σ(C) have the same property. Lemma 2 (Dynkin). Let P be a π-system and L a λ-system. If P ⊂ L then σ(P) ⊂ L. For a proof, see appendix A.2. An interesting application is the following theorem which states that whenever two probability measures coincides on a π−system P, then they coincide on the σ-algebra σ(P). Theorem 4. Let (Ω, F) be a measurable space and µ and ν two probability measures on (Ω, F). Let C ⊂ F be a family of subsets stable with respect to finite intersection and generating F. If µ and ν coincide on C, then they coincide on F. Proof: Let L ⊂ F the family of sets defined by L := {A ∈ F : µ(A) = ν(A)}. L is a λ−system, indeed: 8 Ω ∈ L (since µ(Ω) = ν(Ω) = 1). if A ∈ L then Ac ∈ L. Indeed µ(Ac ) = 1 − µ(A) = 1 − ν(A) = ν(Ac ). if {An } ⊂ L is a sequence P of setsPin L such that An ∩ Am = ∅ for n 6= m, then ∪n An ∈ L. Indeed µ(∪n An ) = n µ(An ) = n ν(An ) = ν(∪n An ). Moreover, by assumption, C ⊂ L and it is a π-system. By Dynkin lemma σ(C) ⊂ L , hence the measures µ and ν coincide on σ(C) = F. Independence Definition 11 (Image measure). Let (Ω, F) and (Ω0 , F 0 ) be measurable spaces and let T : Ω → Ω0 be a measurable map. Let µ : F → [0, +∞] be a measure on (Ω, F). The image measure (or pushforward measure) of µ is the measure µT : F 0 → [0, +∞] defined by: µT (E) := µ(T −1 (E)), E ∈ F0 (1.4) Def. 50 is a good definition. Indeed, the set function defined by the r.h.s. of (B.6) is a σ additive measure on F 0. Further, µT is a probability measure if µ is. Moreover a measurable function f : Ω0 → R is integrable with respect to µT iff the function f ◦ T : Ω → R is measurable with respect to µ and in this case the integrals coincide, namely: Z Z f (y)dµT (y) = f (T (x))dµ(x). (1.5) Ω0 Ω Marginal distributions Given a probability measure ν on (Rn , B(Rn )), we can define its marginals νi , i = 1,... n, i.e. the probability measures on (R, B(R)) defined as the pushforward measure of µ under the projection map πi : Rn → R defined as πi (x1 ,... , xn ) := xi , (x1 ,... , xn ) ∈ Rn (which is Borel measurable since it is continuous). More specifically νi is defined on Borel sets I ∈ B(R) by: νi (I) := ν(πi−1 (I)) = ν({(x1 ,... , xn ) ∈ Rn : xi ∈ I}) = ν(R × · · · × I × · · · R) Random variables and their distributions A random variable X with state space (S, FS ), i.e. a measurable function between (Ω, F, P) and (S, FS ), induces a probability measure µX on (S, FS ), which is called the distribution of the random variable X and it is defined as the image measure of P under the map X, namely: µX (A) := P(X −1 (A)) = P(X ∈ A), A ∈ FS In the following we shall focus on the cases (S, FS ) = (R, B(R)) and (S, FS ) = (Rn , B(Rn )) 9 Definition 12. Let (Ω, F, P) be a probability space and X : (Ω, F) → (R, B(R)) a real random variable. The distribution of X is the probability measure µX on (R, B(R)) defined as µX (I) := P(X −1 (I)), I ∈ B(R). (1.6) In particular, given a Borel measurable map f : R → R, the expected value of f (X) is denoted E[f (X)] and defined as Z E[f (X)] = f (X(ω))dP(ω) Ω whenever the map f ◦ X is summable (i.e. f ◦ X ∈ L1 (ω, P)). In particular, by Eq. (1.5) the integral above can be computed according to the following formula: Z Z f (X(ω))dP(ω) = f (x)dµX (x). Ω R Joint distribution of n random variables A random variable X : (Ω, F) → (Rn , B(Rn )) is also called vector random variable or random vector. Introducing the projections πi : Rn → R defined by πi (x1 ,..., xn ) := xi , the components Xi : Ω → R, i = 1,..., n, of X are the real random variables defined as Xi (ω) = πi ◦ X(ω). We then have X(ω) = (X1 (ω),..., Xn (ω)), i.e. to any random vector a n−tuple of real random variable is naturally associated. Conversely, given a n−tuple X1 ,..., Xn of real random variables, the vector- valued map X(ω) := (X1 (ω),..., Xn (ω)) defines a measurable map from (Ω, F) to (Rn , B(Rn )). In order to show this, it is sufficient to recall that the Borel σ-algebra B(Rn ) is generated by the family of sets E ⊂ Rn of the form E = I1 ×... × In , (1.7) with I1 ,..., In ∈ B(R), and, by theorem 2, it is sufficient to check that X−1 (E) ∈ F. Indeed: X−1 (E) = X−1 (I1 ×... × In ) = {ω ∈ Ω : X1 (ω) ∈ I1 } ∩... ∩ {ω ∈ Ω : X1 (ω) ∈ I1 } = ∩nj=1 {ω ∈ Ω : Xj (ω) ∈ Ij }, and, by the measurability of Xi for any i = 1,..., n, the set X−1 (E) is a finite intersection of sets in F. Let (Ω, F, P) be a probability space and X1 ,... , Xn real random variables defined on (Ω, F, P). Definition 13. Given n real random variables X1 ,..., Xn on a probability space (Ω, F, P), the joint distribution of the random variables X1 ,..., Xn is defined as the probability measure µX1 ,...,Xn on (Rn , B(Rn )) defined as the distribution of the vector random variable X(ω) := (X1 (ω), · · · , Xn (ω)) µX1 ,...,Xn (A) := P(X−1 (A)), A ∈ B(Rn ) (1.8) n In particular, by taking A ∈ B(R ) of the form A = B1 × · · · × Bn , with B1 ,... , Bn ∈ B(R), (1.8) becomes: µX1 ,...,Xn (B1 × · · · × Bn ) = P(X1 ∈ B1 ,... , Xn ∈ Bn ). In particular, given n real random variables X1 ,... , Xn on a probability space (Ω, F, P), one can easily verify that their distributions µXi , i = 1,... , n, can be obtained as the marginals of their joint distribution µX1 ,...,Xn : µXi (I) = µX1 ,...,Xn ({(x1 ,... , xn ) ∈ Rn : xi ∈ I}) 10 indeed: µX1 ,...,Xn ({(x1 ,... , xn ) ∈ Rn : xi ∈ I}) = µX1 ,...,Xn (R × · · · × I × · · · R) = P({ω ∈ Ω : (X1 (Ω),... , Xn (ω)) ∈ R × · · · × I × · · · R}) = P({ω ∈ Ω : X1 (Ω) ∈ R,... , Xi (ω) ∈ I,... , Xn (Ω) ∈ R}) = P({ω ∈ Ω : Xi (ω) ∈ I}) = µXi (I) Indipendence Definition 14. n sub σ-algebras F1 ,..., Fn ⊂ F in a probability space (Ω, F, P) are independent if ∀E1 ∈ F1 ,..., En ∈ Fn the following holds: P(E1 ∩... ∩ En ) = P(E1 ) · · · · · P(En ) Remark 3. n events E1 , E2 ,... , En are called independent if for all k = 2,... , n, {Ei1 ,... , Eik } ⊂ {E1 ,... , En } we have \ k Y k P Eie = P(Eie ). e=1 e=1 An equivalent way to say that n events E1 ,... , En are independent is saying that the corresponding σ-algebras Fi := σ({Ei }) are independent.. Definition 15. Let A an index set and {Fα }α∈A a collections of sub σ-algebras in a probability space (Ω, F, P). The σ-algebras {Fα }α∈A are said to be independent if for any n ∈ N, n ≥ 1, and for any choice of (non-coinciding) indexes α1 ,... , αn ∈ A the sub σ-algebras Fα1 ,... , Fαn are independent in the sense of definition 14. Definition 16. n random variables X1 ,..., Xn on a probability space (Ω, F, P) are independent if the corresponding generated σ-algebras FXi , i = 1,..n, are independent. According to definition 16, n real random variables X1 ,..., Xn are independent if for any I1 ,..., In ∈ B(R) the following holds: P({X1 ∈ I1 ,..., Xn ∈ In }) = P({X1 ∈ I1 }) · · · P({Xn ∈ In }). (1.9) We remark that this fact can be actually formulated in terms of properties of the joint distri- bution of the variables X1 ,..., Xn. Indeed, from the definition of σ-algebra generated by a random variable, for all j = 1,... , n we’ve got Ej ∈ FXj , Ej = Xj −1 (Ij ) for some appropriate Ij ∈ B(R). Then for all I1 ,... , IN ∈ B(R) N \ N Y P Xj−1 (Ij )) = P(Xj−1 (Ij )), (1.10) j=1 j=1 By definition, the left hand side of (1.10) is equal to µX1 ,...,Xn (I1 × · · · × In ) 11 QN while the right hand side of (1.10) is equal to j=1 µXj (Ij ). Since the product sets of the form I1 × · · · × In , with I1 ,... In ∈ B(R) generate the Borel σ-algebra of Rn and are a family of sets closed under intersection, they are actually a π−system generating B(Rn ) and the identity N Y µX1 ,...,Xn (I1 × · · · × In ) = µXj (Ij ) j=1 uniquely defines the joint distribution of n independent real random variables X1 ,..., Xn. Similarly to the case of sub σ-algebras, considered an index set A and a collection {Xα }α∈A of random variables on a probability space (Ω, F, P), we say that the random variables are independent if for any n ∈ N, n ≥ 1, and for any choice of (non-coinciding) indexes α1 ,... , αn ∈ A the srandom variables Xα1 ,... , Xαn are independent in the sense of definition 16. Another interesting application of the Dynkin theorem is the following lemma. Lemma 3. Let (Ω, F, P) be a probability space, F1 , F2 ⊂ F two sub-σ-algebras and P1 and P2 two π-systems generating F1 and F2 respectively. Then F1 , F2 are independent iff P(A1 ∩ A2 ) = P(A1 )P(A2 ) (1.11) for all A1 ∈ P1 and A2 ∈ P2. Proof: The proof is an application of Dynkin lemma (see lemma 2). Indeed, assume that (1.11) holds for all A1 ∈ P1 and A2 ∈ P2. Fixed an A2 ∈ P2 and define L1 ⊂ F1 as L1 := {A ∈ F1 : P(A ∩ A2 ) = P(A)P(A2 )} It is easy to prove that L1 is a λ-system. Since A1 ⊂ L1 , by lemma 2 we have that σ(P1 ) = F1 ⊂ L1 , namely P(A ∩ A2 ) = P(A)P(A2 ) for all A ∈ F1. Again, define L2 ⊂ F2 as L2 := {A ∈ F2 : P(A1 ∩ A) = P(A1 )P(A) ∀A1 ∈ F1 } It is simple to prove that L2 is a λ-system as well and, by the previous step, P2 ⊂ L2. By lemma 2 we have that σ(P2 ) = F2 ⊂ L2 which means P(A1 ∩ A2 ) = P(A1 )P(A2 ), ∀A1 ∈ F1 , A2 ∈ F2. 12 Chapter 2 Elements of information theory 2.1 Information and entropy Let us consider a probability space (Ω, F, P). We are going to introduce a function I : F → R∗ (R∗ denoting the extended real line) called information. For any E ∈ F the quantity I(E) will quantify how much unlikely the event E is or, equivalently, how much information we gain when we get to know that the event E has occurred. In particular, we shall construct I as a decreasing function of the probability, i.e. I(E) = g(P(E)) (2.1) for some real valued function g : Dg ⊂ R → R∗ , where Dg denotes the domain of g. Clearly, we will require [0, 1] ⊂ Dg , otherwise Eq. (2.1) is ill-defined. As we shall see below, the fundamental requirement that characterises the function I : F → R∗ or, equivalently, the mapping g : Dg → R∗ is the following relation I(E1 ∩ E2 ) = I(E1 ) + I(E2 ). (2.2) which is required to hold whenever E1 , E2 ∈ F are independent. In terms of the function g this identity assumes the following form g(xy) = g(x) + g(y). (2.3) We shall also require g to be continuous and (by convention) non negative on [0, 1]. In particular, condition 2.3 gives 1 n g(xn ) = ng(x), g(x1/m ) = g(x), g(xn/m ) = g(x), m m whenever x, xn , x1/m ∈ Dg , n, m ∈ N. In the following, we shall look for a mapping g such that its domain includes the half- real line R+. In this regard, it is important to remark that we require that the interval R+ is included in Dg and if x ∈ R+ then xn and x1/m belong to R+ for any choice of n, m ∈ N. Hence, if R+ ∈ Dg then for any q ∈ Q+ , with q = n/m we have g(xq ) = qg(x) (2.4) 13 whenever x ∈ R+. Let us fix a constant α ∈ R+ and represent any x ∈ R+ as x := αlogα x. In particular, by (2.4) we have that the map y 7→ g(αy ) − yg(α) is indentically equal to 0 on Q+ and, by the assumption of continuity of g, it will be identically 0 on R+. In particular we have g(x) = g(αlogα x ) = g(α) logα x = K logα x The usual convention is to choose K = −1 and α = 2, giving I(E) := − log2 P(E). In the following we will always adopt the convention log ≡ log2. The quantity I(E) = − log P(E) is called (binary) information of the event E. In particular if E = Ω than I(Ω) = 0 while I(∅) = +∞. Given a partition P = {Ej }j∈J ⊂ F of Ω, i.e. ∪j∈J Ej = Ω and Ej ∩ Ek = ∅ for j 6= k, let us define the Shannon entropy of the partition, or simply the entropy of the partition, as the average information content of the sets of the partition: X X H(P) = P(Ej )I(Ej ) = − P(Ej ) log(P(Ej )) j j with the convention P(Ej ) log(P(Ej )) ≡ 0 if P(Ej ) = 0. It is interesting to point out that, taking a finer partition, the entropy increases. Indeed, if one of the sets EP∈ P is decomposed into the disjoint union of smaller sets, i.e. E = ∪k Ẽk , with pk := P(Ek ) and k pk = P(E) := p, then it is easy to check by using the monotonicity of the map u 7→ log u that X X X (pk log pk ) ≤ (pk ) log( pk ). k k k Let us consider now a discrete random variable X on a probability space (Ω, F, P) attaining a finite or infinite countable number of values {xj }j∈J. The variable X naturally induces a partition PX = {Ej }j∈J ⊂ F of Ω, where Ej := X −1 ({xj }) = {ω ∈ Ω : X(ω) = xj }. Let us introduce the notation p(j) = P(Ej ), j ∈ J. We define the (Shannon) entropy of the random variable X as the entropy of the corresponding partition: X X H(X) = − P(Ej ) log(P(Ej )) = − p(j) log(p(j)), j j (with the usual convention p(j) log(p(j)) ≡ 0 if p(j) = 0. The number H(X) quantifies the amount of randomness present in the random variable X. For example, if X is a Bernoulli random variable P(X = 1) = p, P(X = 0) = 1 − p the H(p) = −p log p − (1 − p) log(1 − p). The maximum of this function is attained at p = 1/2, while H(X) = 0 if p = 0, 1. Let X be a discrete random variable and f : Df → R a function whose domain contains the set {xj }j∈J of possible values of X. Let Y be the random variable obtained as Y := f ◦ X. Then the following holds H(Y ) ≤ H(X). Indeed it is easy to verify that the partition PX induced by X is finer than the partition PY induced by Y and they coincide if and only if the map f is injective on the set {xj }j∈J. Remark 4. It is interesting to remark that in information theory different definitions of entropy have been proposed. In particular, for any α ≥ 0, α 6= 1, it is possible to define the Renyi entropy 14 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1.0 Figure 2.1: Entropy of a Bernoulli random variable as a function of the parameter p of order α of a discrete random variable X with probability density {pj }j∈J as 1 X Hα (X) := log p(j)α . 1−α j∈J Particularly interesting for our purposes are the two limiting cases: For α → 1 the Renyi entropy converges to the Shannon entropy: X lim Hα (X) = H(X) = − p(j) log(p(j)) α→1 j For α → ∞ the Renyi entropy converges to the so-called min-entropy Hmin (X): lim Hα (X) = Hmin (X) = min{− log(p(j))} = − log max p(j) α→∞ j j which gives the minimum information content of the random variable X. It is related to the guessing probability pg := maxj p(j), which gives the probability of the best guess of the value that X attains. This parameter plays a particular role in the context of randomness extractors. Properties of Shannon entropy If X is a discrete random variable attaining a finite number of values {x1 ,..., xn } with probabilities p(j), j = 1,...n, then its entropy H(X) has the following properties: H(X) ≥ 0 and H(X) = 0 iff p(j) = 1 for some j = 1,...n and p(k) = 0 for k 6= j. Indeed it is easy to check that, by definition, H(X) is given by the sum of nonnegative P terms and H(X) = 0 iff all the terms p(j) log p(j) = 0. By the normalization condition j p(j) = 1 we have that p(j) log p(j) = 0 ∀j iff there exists a j such that p(j) = 1 and p(k) = 0 for k 6= j. 15 H(X) ≤ log n and H(X) = log n iff p(j) = 1/n for all j = 1,..., n. This property follows from the Jensen inequality. If f : I ⊂ R → R is a strict convex function (I ⊂ R being an interval of the real line) and Y is a random variable we have f (E[Y ]) ≤ E[f (Y )] and the equality is attained iff Y is concentrated on a point, i.e. P(Y = y) = 1 for some y ∈ I. conversely, if f is strictly concave we have f (E[Y ]) ≥ E[f (Y )] and the equality is attained iff Y is concentrated on a point, i.e. P(Y = y) = 1 for some y ∈ I. We can now apply this inequality to the case where f (x) = log(x) (which is strictly concave function) and Y is the random variable attaining the values 1/p(j) with probability p(j). We obtain X 1 f (E[Y ]) = log p(j) = log(n) j p(j) X X E[f (Y )] = p(j) log(1/p(j)) = − p(j) log(p(j)) = H(X) j j Hence H(X) ≤ log n and the equality is attained iff the random variable Y attains a unique value, i.e. if 1/p(j) is a constant not depending on j which yields p(j) = 1/n for all j. Interpretation of Shannon entropy As remarked above, the Shannon entropy of a random variable X gives a measure of the information content, or equivalently of the randomness, in the random variable X. Determining the value of a random variable X with n possible values is the same as determining the outcome of an experiment with n possible outcomes. Further, the same can be rephrased in terms of determining which set of the partition {Ej }j induced by X a generic element ω ∈ Ω belongs to. We are going to show now that the value H(X) provides a lower bound to the expected number of yes-no questions necessary to determine the outcome of a random experiment with n possible outcomes, or equivalently the value of the random variable X. Suppose that we want to devise a scheme with only yes-no questions aimed to determine which set {Ej }J=1,...,n of the partition P induced by X a general element ω ∈ Ω belongs to. The first step consists in creating a partition P1 = {F1 , F2 } made of two sets, each obtained as union of sets {Ej }j=1,...,n of the partition P (or, equivalently, belonging to the σ-algebra FP generarted by P), and ask the question ”is ω in F1 ?”. Clearly, if the answer is ”yes” then ω ∈ F1 , if it is ”no” then ω ∈ F2. If we want to minimize the total number of questions necessary to determine the final outcome, the best strategy is to choose F1 and F2 is such a way that their probability is as close as possible to 1/2. In this way the answer removes in the average as much randomness as possible. If the set Fi , i = 1, 2, identified through the first question coincides with a particular one of the original partition P = {Ej }j=1,...,n we can stop since the outcome of the experiment (=the value of X) is uniquely determined. Otherwise we further partition Fi into two subsets {Fi1 , Fi2 } ⊂ FP and ask the question ”is ω in Fi1 ? ”. As before, if the answer is ”yes” then ω ∈ Fi1 , if it is ”no” then ω ∈ Fi2 and the best strategy which minimizes the average number of questions necessary 16 to determine the value of X is to choose the sets Fi1 , Fi2 is such a way that the two conditional probabilities P(Fi1 |Fi ), P(Fi2 |Fi ) are as close as possible to 1/2. We point out that this second step leads to a new partition P2 of Ω, which is finer that P1 Proceeding in this way, at the k-th step we obtain a new partition Pk ⊂ FP of Ω that is finer than the previous one Pk−1 ⊂ FP. The scheme ends when the new partition coincides with the original one P, which means that the value of X is uniquely determined. This procedure can be represented graphically by means of a ”decision tree”, i.e. by a graph where each node corresponds to a question, whose two possible answers are associated with two edges. Let us consider the simple example of a discrete random variable X with 4 possible outcomes {x1 , x2 , x3 , x4 } and the associated partition P = {E1 , E2 , E3 , E4 }. Let us consider a first partition P1 = {F1 , F2 } with F1 = E1 ∪ E2 ∪ E3 , F2 = E4 The partition associated to the second step will be made of three sets, namely P2 = {F11 , F12 , F2 }, with F11 = E1 , F12 = E2 ∪ E3 At the third, and in this case final, step we have P3 = {F11 , F121 , F122 , F2 }, where F121 = E2 , F122 = E3 This scheme is associated to the following decision tree: ω ∈ E4 , X = x4 no is ω ω ∈ E3 , X = x3 no in F1 ? is ω in yes no F121 ? is ω in yes ω ∈ E2 , X = x2 F11 ? yes ω ∈ E 1 , X = x1 For a given scheme, let l denote the maximum number of questions necessary to determine the value of X. For any particular outcome xi , let Ni denote the number of questions necessary to determine it. Let us denote with E[N ] the average number of questions, i.e. X E[N ] = p(i)Ni. i In the example above we have E[N ] = 2p(1) + 3p(2) + 3p(3) + p(4). By construction, each decision scheme devised in this way satisfies the following identity n X 1 N =1 (2.5) i=1 2 i 17 Indeed, if we denote with ak the number of outcomes that can be determined through k ques- tions, we have: n l X 1 X ak Ni = i=1 2 2k k=1 On the other hand, in a decision scheme with at most l questions, the final partition Pl ≡ P can contain at most 2l sets. Generally we have n < 2l since some outcomes need less than l questions to be determined. In particular each result requiring k questions eliminates from the ”maximal final partition” exactly 2l−k sets, hence: l−1 X al = 2l − ak 2l−k k=1 Pl The final equality easily yields k=1 a2kk = 1. We can now prove that in any decision scheme, H(X) provides a lower bound for E[N ]. The proof relies on the following lemma. Lemma 4 (Gibbs inequality). Let p(1),... , p(n) and q(1),... , q(n) be probability distributions on a finite set with n points. Then the following holds X X − p(i) log p(i) ≤ − p(i) log q(i) (2.6) i i and the equality holds iff p(i) = q(i) for all i = 1,... , n. For the proof see Appendix A.3. Theorem 5. X E[N ] ≥ − p(i) log p(i) i Proof. Let us consider Gibbs inequality with q(i) = 2N1 i. This is a well defined probability distri- bution since Eq. (2.5) holds. We have: X X X H(X) = − p(i) log p(i) ≤ − p(i) log q(i) = p(i)Ni = E[N ] i i i Remark 5. By lemma (4) we can see that E[N ] = H(X) in those decision schemes where pi = 2−Ni. 2.2 Joint entropy, conditional entropy and mutual informa- tion Let X and Y be two discrete random variables on a probability space (Ω, F, P) attaining a finite number of values {x1 ,..., xn } and {y1 ,..., ym } respectively. Let us denote with p(i, j) the joint probabilities p(i, j) = P(X = xi , Y = yj ) and with p(i) = P(X = xi ) and p(j) = P(Y = yj ) the marginals. 18 The joint entropy H(X, Y ) of X and Y is the entropy of their joint distribution X H(X, Y ) := − p(i, j) log p(i, j) i,j Example 9. 1. If X and Y are i.i.d. Bernoulli random variables with parameter p = 1/2 then H(X, Y ) = log 4 = 2. 2. If X is a Bernoulli random variables with parameter p = 1/2 and Y = X then H(X, Y ) = log 2 = 1. In both cases the entropy of the random variables X and Y is equal to 1, i.e. H(X) = H(Y ) = 1, but in case 2. the correlation between X and Y led to a reduction of the joint entropy. Let us fix a particular value xi of the random variable X and denote with pi (j), j = 1,...m, the conditional distribution of Y with respect to the event {X = xi }: pi (j) = P(Y = yj |X = xi ) The conditional entropy of Y given X = xi is denoted H(Y |X = xi ) ≡ Hi (Y ) and defined as the entropy of the conditional distribution pi (j): X Hi (Y ) := − pi (j) log pi (j) j (with the usual convention that pi (j) log pi (j) ≡ 0 whenever pi (j) = 0). Moreover, if p(i) = 0 hence pi (j) is undefined then we set Hi (Y ) ≡ H(Y ). This quantity gives a measure of the uncertainty still present in Y when we know that the event {X = xi } has occurred. In the two cases described in example 9 we have 1. H(Y |X = 1) = 1 2. H(Y |X = 1) = 0 Analogously, taking the average of Hi (Y ) over all possible values of X we obtain the conditional entropy of Y given X, denoted H(Y |X) or HX (Y ) and defined as X HX (Y ) := p(i)Hi (Y ) i In is easy to prove that it can be computed in terms of the following equality X HX (Y ) = − p(i, j) log pi (j) i,j In the two cases described in example 9 we have 1. HX (Y ) = 1 2. HX (Y ) = 0 19 Moreover the following identity holds H(X, Y ) = H(X) + HX (Y ) (2.7) which has an interesting intuitive interpretation. The left hand side H(X, Y ) gives the information content of the two random variables, while the right hand side is the sum of H(X), i.e. the information content of X, and HX (Y ), i.e. the information content of Y not contained in X. Remark 6. If X and Y are independent then, since pi (j) = p(j) for all i, j, we have: HX (Y ) = H(Y ) and H(X, Y ) = H(X) + H(Y ) The mutual information I(X, Y ) of X and Y is defined as I(X, Y ) := H(Y ) − HX (Y ) (2.8) It gives the information content of Y which is contained in X, since the right hand side of (2.8) is the difference between H(Y ), the total information content of Y , and HX (Y ), i.e. the information content of Y not contained in X. It is easy to prove the following identity X p(i, j) I(X, Y ) = p(i, j) log (2.9) i,j p(i)p(j) from which it is easy to verify the invariance of I(X, Y ) under the exchange of X, Y : I(X, Y ) = I(Y, X) hence, in particular H(Y ) − HX (Y ) = H(X) − HY (X). Remark 7. The mutual information of two random variables has an interesting interpretation in terms of the Kullbach-Leibner divergence of the joint distribution (i, j) 7→ pX,Y (i, j) and the product distribution of the marginals (i, j) 7→ pX (i)pY (j). In general, given two probability distributions p and q on a discrete set {k}, the Kullbach-Leibner divergence of p and q is denoted D(p|q) and defined as: X p(k) D(p|q) := p(k) log. (2.10) q(k) k Even if it is not a distance (since it is not symmetric and triangular inequality doesn’t hold) it enjoys the following properties: D(p|q) ≥ 0, D(p|q) = 0 iff p = q. (2.11) By formula (2.9) it is easy to check that I(X, Y ) is equal to the KL-divergence of the two prob- ability distributions (on the set of ordered pairs (i, j) ≡ (xi , yj )) p(i, j) ≡ pXY (i, j) and q(i, j) ≡ pX (i)pY (j). From the properties of the KL-divergence we can then prove that I(X, Y ) is always non-negative and it can vanish iff pXY (i, j) = pX (i)pY (j) for all (i, j), i.e. iff X and Y are independent. 20 Example 10 ( X and Y not correlated but I(X, Y ) 6= 0). Let X be the discrete random variable attaining the 4 values +1, −1, +2, −2 with equal probability 1/4. Let Y = X 2. It is easy to prove that X and Y are not correlated, since Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0 but the mutual information is positive, indeed: I(X, Y ) = 1 2.3 Entropy and mutual information for continuous random variables The definitions of Shannon entropy, conditional entropy and mutual information given so far apply only to discrete random variables. In this section we are going to investigate whether these concepts admit a generalization to the case of continuous random variables. Let us start by considering a generic random variable X defined on a probability space (Ω, F, P). Let P = {Ij }j∈J be a generic partition of the real line made of intervals Ij = (aj , bj ] and let us denote with [X]P be the discretized version of X associated to the partition P, i.e. the discrete random variable with values in the index set J and discrete probability density {pj }j∈J defined as pj = P([X]P = j) := P(X ∈ Ij ). P The Shannon entropy of [X]P will be given by H([X]P ) = − j pj log pj. We point out that by refining the partition, the value of H increases. Indeed, if P 0 = {Ik0 }k∈J 0 is a refinement of P (i.e P 0 is a partition of R such that any set of P 0 is a subset of P), then [X]P will be a deterministic function of [X]P 0 and the corresponding entropies will satisfy the following bound1 H([X]P ) ≤ H([X]P 0 ) (2.12) We can now define the entropy of a generic random variable as H(X) := sup H([X]P )) (2.13) where the supremum is taken over all possible partitions of R. By (2.12) this correspond to compute the limiting value of H([X]P ) when the partitions P become increasingly finer. It is interesting to investigate the value of the right hand side of (2.13) in two particular cases. 1. If X is a discrete random variable with values {xi }, then the supremum on the right side of (2.13) gives exactly the entropy of X. To see this, just note that when the partition P = {Ij }j∈J is sufficiently fine so that each set Ij contains at most a point xi , then no further refinement can increase the value of H([X]P. 2. Let X be an absolutely continuous random variable with a distribution of the form Z P(X ∈ I) = p(x)dx, (2.14) I 1 we are using the fact that if X, Y are discrete random variables and Y = f (X) for some Borel function f : R ⊂ R → R, then H(Y ) ≤ H(X). The proof is left as an exercise. 21 with p : R → R a continuous function. By the mean value theorem, for any interval Ij = (aj , bj ] of the partition there exists a point xj ∈ [aj , bj ] such that Z p(x)dx = p(xj )(bj − aj ). Ij Hence X H([X]P ) = − p(xj )(bj − aj ) log (p(xj )(bj − aj )) j X X =− p(xj ) log p(xj )(bj − aj ) − p(xj )(bj − aj ) log (bj − aj ) j j R The first term above coincide with the Riemann sum approximation of the integral − R p(x) log p(x)dx, to which it converges when the mesh |P| of the partition P, defined as |P| = supj |bj − aj |, converges to 0. If the partition is sufficiently fine in such a way that bj − aj ≤ 1 for all j ∈ J, then second term can be estimated as X X |− p(xj )(bj − aj ) log (bj − aj ) | ≥ − log |P| p(xj )(bj − aj ) j j P R When |P| → 0 the term j p(xj )(bj − aj ) converges to R p(x)dx = 1, but the term − log |P| converges to +∞. In summary for a continuous random variableR with distribution (2.14) the supremum supP H([X]P ) is equal to the differential entropy − R p(x) log p(x)dx plus an infinite quantity. The same reasoning can be applied to the definition of the mutual information of two random variables X and Y. Similarly as above, we consider two partitions P = {Ij }j∈J and Q = {I˜k }k∈K of the real line, with Ij = (aj , bj ] and I˜k = (ck , dk ], and the corresponding discretised versions [X]P and [Y ]Q of X and Y. We can than compute the mutual information I([X]P , [Y ]Q ) of [X]P and [Y ]Q. It still holds (even if the proof of this property requires some work) that if P 0 is a refinement of P and Q0 is a refinement of Q, then the following inequality holds I([X]P 0 , [Y ]Q0 ) ≥ I([X]P , [Y ]Q ). (2.15) Eq. (2.15) justifies the definition of mutual information of X and Y as I(X, Y ) = sup I([X]P , [Y ]Q ) , (2.16) P,Q where the supremum is taken over all pairs of partitions P, Q of the real line. Again, two particular cases are meaningful. 1. If both X and Y are discrete random variables with values {xj }j and {yk }k , then the supre- mum on the right side of (2.16) is equal to the mutual information of X and Y. As in the case of H(X), if the partition P = {Ij }j∈J is sufficiently fine so that each set Ij contains at most a point xi and the partition Q = {I˜k }k∈K is sufficiently fine so that each set I˜k contains at most a point yk , then no further refinement can increase the value of I([X]P , [Y ]Q ). 22 2. Let X and Y be real valued random variables with a joint distribution of the form Z Z P(X ∈ I, Y ∈ J) = p(x, y)dxdy, (2.17) I J where p : R2 → R continuous function, and let Z Z pX (x) = p(x, y)dy, pY (y) = p(x, y)dx R R By the mean value theorem, the mutual entropy I([X]P , [Y ]Q ) is equal to ! X p(xj , yk )(bj − aj )(dk − ck ) I([X]P , [Y ]Q ) = p(xj , yk )(bj − aj )(dk − ck ) log pX (x0j )pY (yk0 )(bj − aj )(dk − ck ) j,k ! X p(xj , yk ) = p(xj , yk ) log (bj − aj )(dk − ck ) pX (x0j )pY (yk0 ) j,k where xj , x0j ∈ (aj , bj ] and yk , yk0 ∈ (ck , dk ]. The term in the last line is the Riemann sum approximation of the integral Z Z p(x, y) p(x, y) log dxdy R2 pX (x)pY (y) to which I([X]P , [Y ]Q ) converges when the partitions P and Q become increasingly finer, giving Z Z p(x, y) sup I([X]P , [Y ]Q ) = p(x, y) log dxdy. P,Q R2 pX (x)pY (y) 2.4 Communication: a stochastic approach In this section we provide a mathematical model for the transmission of information across a communication channel. The main ingredients are the source of information, the receiver and the channel. The source is modelled by a random variable X attaining a finite number of values {x1 ,..., xn } with probabilities pi = P(X = xi ), i = 1,..., n. The set {x1 ,..., xn } is called the source alphabet. A message sent by the source is a (finite) sequence of symbols of the source alphabet. We have to take into account the presence of noise across the channel, causing distortion of the message sent by the source. The receiver is modelled as a random variable Y attaining a finite number of values {y1 ,..., ym } with probabilities qj = P(Y = yj ), j = 1,..., m. We point out that in general the sets of symbols sent and the set of symbols received do not coincide. For instance, if the source sends binary symbols, i.e. {x1 , x2 } = {0, 1}, then we can take into account the presence of losses along the channel or of distortions, which do not allow to distinguish clearly between the sent symbols, by adding the additional symbol ∅ to the receiver alphabet, i.e. {y1 , y2 , y3 } = {0, 1, ∅}. Finally, the distorting effects of the channel are modelled by the set of conditional probabilities pi (j) := P(Y = yj |X = xi ), i = 1,..., n, j = 1,..., m. 23 The binary symmetric channel The simplest non trivial example of a communication channel is the binary symmetric channel. In this case, both the source and the receiver alphabet coincide with the binary alphabet {0, 1}. The transition probabilities are given by pi (j) = (1 − e)δij + e(1 − δij ) where δij is the Kronecker symbol and e ∈ [0, 1] is the probability of a bit-flip. The diagram describing the channel is the following one: 1−e 0 0 e e 1 1 1−e 2.4.1 Mutual information and channel capacity The figure of merit modelling the flux of information across the channel, taking into account the distortion effects of the noise, is the mutual information of the source and the receiver random variables X and Y , i.e. I(X, Y ) = H(X) − HY (X) = H(Y ) − HX (Y ). The capacity C of a transmission channel is the maximum attainable mutual information I(X, Y ), defined as C := max I(X, Y ) (2.18) where the maximum is evaluated over the set of possible distributions of the input random variable X. Example 11. Let us consider the case of the binary symmetric channel. Both source and receiver are Bernoulli random variables. If P(X = 0) = p and P(Y = 0) = q, by using the conditional probabilities pi (j) describing the distortion effects of the channel, we can relate the parameter q to the parameter p: q = P(Y = 0) = P(Y = 0|X = 0)P(X = 0)+P(Y = 0|X = 1)P(X = 1) = (1−e)p+e(1−p) = p+e−2ep Hence, H(Y ) = HB (q) = HB (p + e − 2ep), where HB gives the entropy of a Bernoulli random variable. Similarly HX (Y ) = −P(X = 0, Y = 0) log P(Y = 0|X = 0) − P(X = 0, Y = 1) log P(Y = 1|X = 0) − P(X = 1, Y = 0) log P(Y = 0|X = 1) − P(X = 1, Y = 1) log P(Y = 1|X = 1) = −p(1 − e) log(1 − e) − pe log e − (1 − p)e log e − (1 − p)(1 − e) log(1 − e) = −pHB (e) − (1 − p)HB (e) = HB (e) 24 Hence, if the distribution of the source X is Bernoulli of parameter p the mutual information I(X, Y ) e equal to I(X, Y ) = HB (p + e − 2ep) − HB (e). (2.19) In order to compute the capacity of the channel, we have to take the maximum of (2.19) over all possible values of the parameter p ∈ [0, 1]. Since the term HB (e) does not explicitly depend on p, it is sufficient to take the value of p maximazing the term HB (p + e − 2ep). Since the entropy HB (q) of a Bernoulli random variable attains its maximum when q = 1/2, we have to impose the condition p + e − 2ep = 1/2, which yields p = 1/2. Eventually we get: C = max (HB (p + e − 2ep) − HB (e)) = HB (1/2) − HB (e) = 1 − HB (e) p∈[0,1] In particular C = 0 if and only if e = 1/2. Further, the distribution of the input random variable X which maximizes the transmission of information is the uniform one. 2.4.2 Codes and Shannon’s noiseless coding theorem In the previous section we have discovered that the transmission of information across a channel is bounded by its capacity. Further, mutual information reaches the capacity of the channel provided that the source alphabet has a particular distribution. In practical application, the distribution of the source alphabet is given and cannot be changed, hence there is the necessity to translate into another language which has better features regarding transmission issues. The transformation of a message composed in the source alphabet into a different message, composed in a different alphabet, is called coding. Coding theory is extensively developed and we are going to give just a few elements, in particular those related with entropy and communication theory. A code alphabet is a set {c1 ,..., cr } of symbols, called code symbols. A codeword is a finite sequence ci1...cil of code symbols, where l ≥ 1 is the length of the codeword and i1 ,..., il ∈ {1,..., r}. A typical example of code alphabet is the binary one, where r = 2, {c1 , c2 } = {0, 1}. The aim of a code is to map each letter {x1 ,..., xn } of the source alphabet into a codeword. A natural requirement is the injectivity of this map: different letters of the source alphabet must be associated to different codewords. Moreover, in the case the lenght of codewords is not constant and during the transmission there is no an explicit symbol telling where a codeword ends and another begins, there can be ambiguity in decoding the message. Let us consider for example the case where the source alphabet consists of 5 symbols A, B, C, D, E corresponding respectively to the codewords 0, 01, 11, 011, 101. It is not difficult to construct examples of messages that are not uniquely decodable, such as, e.g., 011101, which can be decoded either as 01 11 01 (corresponding to BCB) or 011 101 (corresponding to DE). In order to avoid this ambiguities we have to consider prefix-free codes. Definition 17. If a sequence ci1...cil of code symbols is a codeword and for some k < m the subsequence ci1...cik of the first k letters is a codeword as well, then ci1...cik is called a prefix of ci1...cil. A code is called prefix free if no codeword is a prefix for any other. In the code mentioned above the codeword 01 is a prefix for the codeword 011. An example of prefix-free code for the source alphabet A, B, C, D, E is 0, 10, 110, 1110, 1111. Every prefix-free code can be associated to a tree diagram, also called ”coding” (or ”decoding”) tree. Figure 2.2 describes the code of the previous example. 25 A 0 B 1 0 C 0 1 D 0 1 1 E Figure 2.2: The prefix-free code 0, 10, 110,1110, 1111 In general, one can construct different prefix-free codes. Figure 2.3 provides an alternative example corresponding to the code 0, 100, 101, 1110, 1111. Important parameters in a code are the lengths of codewords. Given a source alphabet with n symbols x1 ,..., xn and a code alphabet with r symbols c1 ,..., cr , we shall denote with li , the length of the codeword associated to the symbol xi. The following result gives a necessary and sufficient condition for the existence of a prefix free code with lengths li , i = 1,..., n. Theorem 6 (Kraft-McMillan inequality). Given a source alphabet with n symbols and a code alphabet with r symbols, a prefix-free code with codewords of lengths li exists iff n X 1 li ≤ 1. (2.20) i=1 r For the proof see Appendix A.4. Given a source alphabet with n symbols x1 ,..., xn with corresponding probabilities p1 ,..., pn , and a code alphabet with r symbols, there are infinitely many prefix-free codes. For each of them we can consider the length li of the codewords associated to the symbols xi. They can be considered as the possible Pn values of a discrete random variable L with distribution pi , i = 1,..., n. Its average E[L] = i=1 li pi gives the expected value of the lengths of codewords. Since the transmission of a codeword across a communication channel has a cost and needs time, it is important to minimize E[L]. The following theorem relates the entropy of the source H(X) to a lower bound for the average length of a prefix-free code, where n X H(X) = − pi log pi (2.21) i=1 Theorem 7 (Shannon’s noiseless coding theorem). Given a code alphabet with r symbols and a source alphabet with entropy H(X) 26 A 0 B 0 1 0 1 C D 0 1 1 1 E Figure 2.3: The prefix-free code 0, 100, 101, 1110, 1111 1. for any prefix-free code the following holds H(X) E[L] ≥ log r 1 where the equality holds iff pi = r li for all i = 1,..., n; 2. there exists a prefix-free code such that H(X) H(X) ≤ E[L] ≤ + 1. (2.22) log r log r Proof: r −li 1. If we apply Gibbs inequality (see Lemma 4) with qi = Pn −lj we get j=1 r n n X X r−li − pi log pi ≤ − pi log Pn −lj i=1 i=1 j=1 r n X n X H(X) ≤ log r pi li + log( r−lj ) i=1 j=1 n X −lj H(X) ≤ log rE[L] + log( r ) j=1 Pn Pn By the Kraft-McMillan inequality(2.20) we have j=1 r−lj ≤ 1, hence log( j=1 r−lj ) ≤ 0 and Xn H(X) ≤ log rE[L] + log( r−lj ) ≤ log rE[L] j=1 27 Pn −li By Lemma 4, the equality H(X) = log rE[L] + log( r−lj ) holds iff pi = qi = Pnr r−lj j=1 j=1 Pn for all i = 1,..., n. Moreover, the equality H(X) = log rE[L] holds iff j=1 r−lj = 1, leading to the condition pi = qi = r−li. 2. If for any i = 1,..., n we choose the length li of the codeword associated to the source symbol xi as the unique integer such that r−li ≤ pi < r−(li −1). Actually this amounts to choosing li = b− logr pi c, since the inequality above is equivalent to − logr pi ≤ li < − logr pi + 1. It is easy to verify that the Kraft-McMillan inequality is satisfied, hence there exists a cor- responding prefix-free code. Moreover by taking the log we get −li log r ≤ log pi < −(li − 1) log r. multiplying by pi and summing over i = 1,.., n we get − log rE[L] ≤ −H(X) < − log r(E[L] − 1) obtaining (2.22). The upper bound on the right hand side of the inequality (2.22) can be improved by considering the so-called block codes, i.e. by encoding sequences of source symbols of length m > 1 instead of just letters. If the source alphabet consists of n symbols {x1 ,..., xn }, then there will be nm possible words of length m. They can be regarded the possible values of a random vector X m = (X1 ,..., Xm ) with components independent and identically distributed as the source random variable X. In particular, the entropy of the random vector X m will be given by H(X m ) = mH(X). (2.23) A code will map each m alphabet word xi1...xim in a corresponding codeword. We shall denote m ] the average length of the nm codewords with the symbol E[Lm ]. The ratio E[L m will give the average length of codewords per source symbol. By applying (2.22) and (2.23) we have: mH(X) mH(X) ≤ E[Lm ] ≤ +1 log r log r Hence, the average length per source symbol satisfies the inequality H(X) E[Lm ] H(X) 1 ≤ ≤ + log r m log r m 2.4.3 Transmission with noise and Shannon’s main theorem In this section we are going to show that, given a transmission channel, there exists a code allowing, on the one hand, to minimize the probability of errors in decoding and, on the other hand, to approach the rate of transmission of information to a level arbitrary close to the capacity of the channel. In this section we shall restrict ourselves to the binary symmetric channel, but the result is true for general transmission channels. 28 Error probability Due to the presence of noise there is a positive probability that the symbols transmitted by the source will be decoded erroneously. This is quantified by the error probability, P(E) defined as X P(E) := P (an error in decoding occurs |wi sent)P(wi sent) (2.24) wi source codewords If we transmit across the binary symmetric channel the binary symbols 0, 1, we have: P(E) = P (an error in decoding occurs |0 sent)P(0 sent) + P (an error in decoding occurs |1 sent)P(1 sent) = eP(0 sent) + eP(1 sent) = e The probability of error can be reduced by introducing redundancy in the transmission. For instance, we can decide to encode the symbol 0 with the codeword 000 and the symbol 1 with the codeword 111. Due to transmission error there will be the possibility of one or more bit-flips across the channel. In this respect, we can decide to decode the received sequence y as the codewords having the higher number of bits equal to the bits of y. For instance, if the received sequence is 100 we shall decode it as 000, while in the case the received sequence is 110 we shall decode it as 111. This decoding criterion allows to correct the transmission errors where there is at most one bit flip and give erroneous result if during the transmission the bit flips are greater or equal than 2. In particular the error probability is given by P (an error in decoding occurs |wi sent) = P(2 bit-flips ) + P(3 bit-flips ) = 3e2 (1 − e) + e3 Hence P (E) = 3e2 (1 − e) + e3. This procedure reduces the error probability of the case a single symbol is sent if e < 1/2, indeed it is easy to see that under this condition 3e2 (1 − e) + e3 < e. Transmission rate If we can send across the channel one symbol per unit time (for instance a symbol per second), the redundancy introduced by sending words of length 3 instead of 1 will slow down the transmission of information. The number of bits of information transmitted across the channel per unit time is called trans- mission rate R. In the case of the example above, let H(X) denote the entropy of the Bernoulli random variable associated to the source symbols 0,1. If we send one symbol per second, the rate information transmitted across the channel is equal to H(X) bits per second. On the other hand, if we transmit the codewords 000 and 111 to reduce errors, the transmission rate is reduced to H(X) 3 bits per second. This examples seem to suggest that we should expect a tradeoff between transmission rate and reliability, i.e. small error probability. Remarkably the Shannon’s theorem we are going to state proves the existence of a code able to reduce the error probability and to keep transmission rate close to the channel capacity. Before stating the result we have to introduce some notions. 29 Decision rules and Hamming distance We point out that the value of the error probability P (E) (Eq. (2.24)) depends on the decoding strategy chosen. Let {w1..., wM } be the set of all possible codewords that can be transmitted and let {y1 ,..., yN } be the set of all possible words received. A decoding rule is a decision strategy which allows to assign a unique codeword to any one received. One of the main decision rules is the one based on maximum likelihood rule. This consists in choosing the codeword wi for which the conditional probability P (wi sent|y received ) attains its maximum value. In other words, if y is the received word, we decode it with that particular wi such that P (wi sent|y received ) ≥ P (wk sent|y received ), ∀k 6= i If there are more than one codeword attaining the maximum of the conditional probability, we shall choose randomly and uniformly among them. If all codewords are equally likely, this criterion is equivalent to maximize the conditional probabilities P (y received |wi sent). (2.25) In the case of binary codewords, a useful notion is the Hamming distance. Let w1 and w2 two binary codewords of length n, wk = cki1... ckin with k = 1, 2 and cki1 ,... , ckin ∈ {0, 1}, their Hamming distance H(w1 , w2 ) is defined as the number of digits in w1 that are different from those in w2. For instance, if n = 5, w1 = 00101 and w2 = 01001 then H(w1 , w2 ) = 2. A possible decision rule when we deal with binary codeword is to minimize the Hamming distance, choosing to decode a received codeword y with that particular codeword wi such that H(y, wi ) ≤ H(y, wk ), ∀k 6= i If there are more than one codeword attaining the minimum of the Hamming distance, we shall choose randomly and uniformly among them. In the case of a binary symmetric channel with probability e of bit-flip less than 1/2, the two decision rules actually lead to the same results, i.e the maximization of the conditional probabilities (2.25) is equivalent to minimising the Hamming distance between the received sequence y and the codewords ωi. Indeed the conditional probability of receiving an n− sequence y of binary symbols if an n−codeword wi was sent is given by d e P (y received |wi sent) = ed (1 − e)(n−d) = (1 − e)n , 1−e e where d = H(y, wi ). Since 1−e < 1 if e < 1/2, we have that P (y received |wi sent) attains its maximum value when d = H(y, wi ) attains its minimum. Shannon’s main theorem Theorem 8. Let us consider a binary symmetric channel with capacity C > 0. Let us assume that2 the probability e of bit flip is less than 1/2. Given 1 > 0 and 2 > 0 arbitrary small, there exists a code such that transmission rate R and error probability P (E) satisfy the following inequalities: 2 There is no loss of generality in assuming e < 1/2. Indeed, if e > 1/2 it is sufficient to switch the role of 0 and 1 at the receiver end of the communication channel 30 1. R ≥ C − 1 2. P (E) ≤ 2 The proof is partially constructive and gives a description of the main features of the code fulfilling properties 1. and 2. in the statement. It can be divided into several steps. Length of codewords and their distribution. We shall consider the binary alphabet 0, 1 and codewords of fixed length n (to be chosen later), choosing just M < 2n out of all the possible 2n ordered sequences on n-binary symbols. Moreover we shall put the uniform probability of the set of M chosen codewords, in such a way that each of them has probability 1/M. Hence, the entropy transmitted across the channel will be that of a uniform random variables with M values, equal to log M , while the transmission rate will be given by R = lognM bits per unit time. Given the capacity C of the channel and 0 < 1 < C, if we choose the number M of codewords equal 3 to M = 2n(C−1 ) we obtain log M log 2n(C−1 ) n(C − 1 ) R= = = = C − 1 n n n Random choice of codewords. Given n and M < 2n , how do we choose the M codewords out of the possible 2n possibles? Shannon’s idea is to use random coding. Let us consider the set Wn of all possible codewords with the discrete uniform distribution on it, in such a way that the probability of each word is equal to 2−n. Let us construct the sequence of M codewords by choosing them randomly (with replacement) in W. It turns out that each of the possible codes has probability 2−nM. Moreover there is the possibility that 2 or more codewords coincide, but this is really unlikely if n is a large number as we shall see later. Decoding strategy. If wi = ci1...cin is a codeword transmitted across the binary symmetric channel and each symbol is transmitted independently of the others, the number N of bit flips occurred during the transmission is a binomial random variable B(n, e): n k P(N = k) = e (1 − e)n−k , k = 0,..., n k hence the expected number of bit flips occurred during the transmission is equal to E[N ] = ne. This means that the average of the Hamming distance between a transmitted codeword w and the corresponding received one y will be equal to ne. In particular, if we denote Bw (r) the sphere in the Hamming distance with center w and radius r: Bw (r) = {w0 : H(w, w0 ) < r} we expect that the received word y will belong to the sphere Bwi (r), with wi the sent codeword and r = n(e + ρ), ρ > 0 being a small parameter to be set later. Conversely, if we receive a word y, it is likely that the sent codeword wi will belong to the sphere By (n(e+ρ)). This line of thought leads to the following decoding strategy (depending on the free parameter ρ that will be set later). Given a received word y we consider the corresponding sphere By (n(e + ρ)) and apply the following set of rules: 3 more generally, if 2n(C−1 ) is not an integer, we shall choose M as the smallest integer greater or equal to 2n(C−1 ) ) 31 – If there is a unique codeword wi belonging to By (n(e + ρ)) we shall decode y as wi. – If there is more than a possible codeword in By (n(e + ρ)) we shall decline to guess and decide that an error that cannot be corrected occurred. – If there are no codewords in By (n(e + ρ)) we shall decide that an error that cannot be corrected occurred. Error probability. A decoding error will occur if the correct sent codeword wi will not belong to the Hamming sphere By (n(e + ρ)) centered in the received one y or if the sphere will contain more than one of the M possible codewords. In particular, the error probability will be computed as X P (an error in decoding occurs = P (an error in decoding occurs |wi sent)P (wi sent) wi X 1 = P (an error in decoding occurs |wi sent) wi M In the following for notational simplicity we shall denote the conditional error probability P (an error in decoding occurs |wi sent) with the symbol P (E), dropping the explicit depen- dence on the sent codeword ωi. In particular, P (E) can be decomposed into the sum of two terms: P (E) = P(wi ∈ / By (n(e + ρ)) ∩ E) + P ({wi ∈ By (n(e + ρ))} ∩ E) Clearly {wi ∈ / By (n(e + ρ))} ⊂ E, while {wi ∈ By (n(e + ρ))} ∩ E = {wi ∈ By (n(e + ρ))} ∩ ∪wj 6=wi {wj ∈ By (n(e + ρ))}, hence P (E) is given by two contributions P (E) = P(wi ∈ / By (n(e + ρ))) + P {wi ∈ By (n(e + ρ))} ∩ ∪wj 6=wi {wj ∈ By (n(e + ρ))} (2.26) In the following we shall choose the parameters n, ρ in such a way that both terms in (2.26) will be smaller than 2 /2. The first term takes into account both the case where there are no codewords in By (n(e + ρ)) and the case where the correct wi does not belong to By (n(e + ρ)) but other codewords are in By (n(e + ρ)). In particular, P(wi ∈/ By (n(e + ρ))) gives the probability that the during the transmission the number of bit flips that occurred is greater of the parameter n(e + ρ). Since the number of bit flips is a binomial random variable N , we can write P(wi ∈ / By (n(e + ρ))) = P(N ≥ n(e + ρ)) By Chebychev inequality we have E[|N − ne|2 ] ne(1 − e) e(1 − e) P(N ≥ n(e+ρ)) = P(N −ne ≥ nρ) ≤ P(|N −ne| ≥ nρ) ≤ = = n2 ρ2 n2 ρ2 nρ2 Given e, ρ and 2 > 0 it is sufficient to choose n sufficiently large in such a way that e(1 − e) 2 < (2.27) nρ2 2 and we have 2 P(wi ∈ / By (n(e + ρ))) < 2 32 Concerning the second term, we have P {wi ∈ By (n(e + ρ))} ∩ ∪wj 6=wi {wj ∈ By (n(e + ρ))} ≤ P ∪wj 6=wi {wj ∈ By (n(e + ρ))} X ≤ P ({wj ∈ By (n(e + ρ))}) wj 6=wi (2.28) This term depends on the particular codewords present in the code. Clearly, if all the codewords are very close in the Hamming distance, this term will give a non negligible con- tribution. On the other hand, if the M codewords are very far apart in the Hamming distance (this will be rather likely if M 0. Once such ρ is chosen, we can eventually choose n sufficiently large in such a way that condition (2.27) together with 2 2n(HB (e+ρ)−HB (e)−1 ) < 2 are fulfilled. Eventually, we have proved that by averaging over all possible random codes, by choosing ρ in a suitable way and n sufficiently large, the average of the error probability hP (E)i satisfies the following bound hP (E)i < 2 (2.31) Final result and discussion. The inequality (2.31) actually gives the existence of a code with the properties stated by the theorem. Indeed, if we denote C the set of all possible codes with M words of length n considered in the theorem, and with µ the uniform probability on the σ-algebra F = P(C), we can look at the error probability of each code as a particular random variable c 7→ Pc (E). The inequality (2.31) reads as follows Z E[P (E)] = Pc (E)dµ(c) < 2 , C which actually gives the existence of a subset A ⊂ C with strictly positive probability µ(A) = #A 2M n such that Pc (E) < 2 for all c ∈ A. Indeed, if this were not true and Pc (E) ≥ 2 ∀c ∈ C then we would have E[P (E)] ≥ 2. Hence, there must exists at least a code c ∈ C such that Pc (E) < 2. 34 Chapter 3 Stochastic processes 3.1 Main definitions Let us now introduce the definition of stochastic process, an object that provides a mathematical model for the description of the time evolution of a random phenomenon (see ) Definition 18. A stochastic process is an object of the form (Ω, F, (Ft )t∈T , (Xt )t∈T , P) where: (Ω, F, P) is a probability space, T is a subset of R+ , (Ft )t∈T is a filtration i.e. an increasing family of sub-σ-algebras of F: ∀s, t ∈ T, s≤t Fs ⊂ Ft ⊂ F (Xt )t∈T is a family of random variables on (Ω, F, P) with state space (E, E), such that for any t ∈ T the r.v. Xt is Ft -measurable. If this property is satisfied we say that the process is adapted to the filtration (shortly it is Ft -adapted) The parameter t ∈ T represents the time variable and the random variable Xt represents the state of a system at time t. Usually T = N (in this case we shall speak about discrete time) or T = R+ (in this case we shall speak about continuous time). Example 12. Let us describe N tosses of a fair coin, assuming that all tosses are mutually independent and that the two possible outcomes of each toss have the same probability 1/2. In this case the set of times is finite, in particular T = {1,... , N }. First of all, let us construct the probability space. By convention, let us assign to the outcome ”head” the numeric value +1 and to the outcome ”tail” the value -1. We can describe the set of all possible sequences of outcomes of the N tosses in terms of the N -tuples ω = (ω1 ,... , ωN ) where ωi ∈ {+1, −1} for all i = 1,... , N. Hence: Ω = {(ω1 ,... , ωN ), ωi ∈ {+1, −1}, i = 1,... , N }. 35 We can choose F = P(Ω) and P defined on the singletons as 1 P({(ω1 ,... , ωN )}) = 2N #E hence, for all events E ∈ P(Ω), their probability can be simply computed as P(E) = 2N. The second step is the construction of the filtration (Fn )n∈T. Let us consider N independent identically distributed discrete random variables ξ1 ,... , ξN on (Ω, F, P) defined as ξi (ω1 ,... , ωN ) := ωi , i = 1,... , n. In fact the random variable ξi gives the result of the i-th toss; ξi = +1 if head and ξi = −1 otherwise. One can easily verify (as exercise) that the random variables ξ1 ,... , ξN are independent and equidistributed. Let us construct the family of sub σ-algebras F1 ,... , FN as Fn := σ(ξi i ≤ n): F1 := σ(ξ1 ) = {Ω, ∅, {ξ1 = +1}, {ξ1 = −1}, } F2 := σ(ξ1 , ξ2 ) = σ({{ξ1 = +1}, {ξ1 = −1}, {ξ2 = +1}, {ξ2 = −1}}) ··· Fn := σ(ξ1 , ξ2 ,... , ξn ) B