Probability Theory PDF Textbook

Universitext Alexandr A. Borovkov Probability Theory Universitext Universitext Series Editors: Sheldon Axler San Francisco State University, San Francisco, CA, USA Vincenzo Capasso Università degli Studi di Milano, Milan, Italy Carles Casacuberta Universitat de Barcelona, Barcelona, Spain Angus MacIntyre Queen Mary, University of London, London, UK Kenneth Ribet University of California, Berkeley, Berkeley, CA, USA Claude Sabbah CNRS, École Polytechnique, Palaiseau, France Endre Süli University of Oxford, Oxford, UK Wojbor A. Woyczynski Case Western Reserve University, Cleveland, OH, USA Universitext is a series of textbooks that presents material from a wide variety of mathematical disciplines at master’s level and beyond. The books, often well class-tested by their author, may have an informal, personal, even experimental approach to their subject matter. Some of the most successful and established books in the series have evolved through several editions, always following the evolution of teaching curricula, into very polished texts. Thus as research topics trickle down into graduate-level teaching, first textbooks written for new, cutting-edge courses may make their way into Universitext. For further volumes: www.springer.com/series/223 Alexandr A. Borovkov Probability Theory Edited by K.A. Borovkov Translated by O.B. Borovkova and P.S. Ruzankin Alexandr A. Borovkov Sobolev Institute of Mathematics and Novosibirsk State University Novosibirsk, Russia Translation from the 5th edn. of the Russian language edition: ‘Teoriya Veroyatnostei’ by Alexandr A. Borovkov © Knizhnyi dom Librokom 2009 All Rights Reserved. 1st and 2nd edn. © Nauka 1976 and 1986 3rd edn. © Editorial URSS and Sobolev Institute of Mathematics 1999 4th edn. © Editorial URSS 2003 ISSN 0172-5939 ISSN 2191-6675 (electronic) Universitext ISBN 978-1-4471-5200-2 ISBN 978-1-4471-5201-9 (eBook) DOI 10.1007/978-1-4471-5201-9 Springer London Heidelberg New York Dordrecht Library of Congress Control Number: 2013941877 Mathematics Subject Classification: 60-XX, 60-01 © Springer-Verlag London 2013 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of pub- lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Foreword The present edition of the book differs substantially from the previous one. Over the period of time since the publication of the previous edition the author has accumu- lated quite a lot of ideas concerning possible improvements to some chapters of the book. In addition, some new opportunities were found for an accessible exposition of new topics that had not appeared in textbooks before but which are of certain interest for applications and reflect current trends in the development of modern probability theory. All this led to the need for one more revision of the book. As a result, many methodological changes were made and a lot of new material was added, which makes the book more logically coherent and complete. We will list here only the main changes in the order of their appearance in the text. Section 4.4 “Expectations of Sums of a Random Number of Random Variables” was significantly revised. New sufficient conditions for Wald’s identity were added. An example is given showing that, when summands are non-identically distributed, Wald’s identity can fail to hold even in the case when its right-hand side is well- defined. Later on, Theorem 11.3.2 shows that, for identically distributed summands, Wald’s identity is always valid whenever its right-hand side is well-defined. In Sect. 6.1 a criterion of uniform integrability of random variables is con- structed, which simplifies the use of this notion. For example, the criterion directly implies uniform integrability of weighted sums of uniformly integrable random vari- ables. Section 7.2, which is devoted to inversion formulas, was substantially expanded and now includes assertions useful for proving integro-local theorems in Sect. 8.7. In Chap. 8, integro-local limit theorems for sums of identically distributed ran- dom variables were added (Sects. 8.7 and 8.8). These theorems, being substantially more precise assertions than the integral limit theorems, do not require additional conditions and play an important role in investigating large deviation probabilities in Chap. 9. v vi Foreword A new chapter was written on probabilities of large deviations of sums of ran- dom variables (Chap. 9). The chapter provides a systematic and rather complete exposition of the large deviation theory both in the case where the Cramér condition (rapid decay of distributions at infinity) is satisfied and where it is not. Both integral and integro-local theorems are obtained. The large deviation principle is established. Assertions concerning the case of non-identically distributed random variables were added in Chap. 10 on “Renewal Processes”. Among them are renewal theo- rems as well as the law of large numbers and the central limit theorem for renewal processes. A new section was written to present the theory of generalised renewal processes. An extension of the Kolmogorov strong law of large numbers to the case of non-identically distributed random variables having the first moment only was added to Chap. 11. A new subsection on the “Strong law of large numbers for gen- eralised renewal processes” was written. Chapter 12 on “Random walks and factorisation identities” was substantially revised. A number of new sections were added: on finding factorisation components in explicit form, on the asymptotic properties of the distribution of the suprema of cumulated sums and generalised renewal processes, and on the distribution of the first passage time. In Chap. 13, devoted to Markov chains, a section on “The law of large numbers and central limit theorem for sums of random variables defined on a Markov chain” was added. Three new appendices (6, 7 and 8) were written. They present important aux- iliary material on the following topics: “The basic properties of regularly varying functions and subexponential distributions”, “Proofs of theorems on convergence to stable laws”, and “Upper and lower bounds for the distributions of sums and maxima of sums of independent random variables”. As has already been noted, these are just the most significant changes; there are also many others. A lot of typos and other inaccuracies were fixed. The process of creating new typos and misprints in the course of one’s work on a book is random and can be well described mathematically by the Poisson process (for the defini- tion of Poisson processes, see Chaps 10 and 19). An important characteristic of the quality of a book is the intensity of this process. Unfortunately, I am afraid that in the two previous editions (1999 and 2003) this intensity perhaps exceeded a certain acceptable level. Not renouncing his own responsibility, the author still admits that this may be due, to some extent, to the fact that the publication of these editions took place at the time of a certain decline of the publishing industry in Russia related to the general state of the economy at that time (in the 1972, 1976 and 1986 editions there were much fewer such defects). Foreword vii Before starting to work on the new edition, I asked my colleagues from our lab- oratory at the Sobolev Institute of Mathematics and from the Chair of Probability Theory and Mathematical Statistics at Novosibirsk State University to prepare lists of any typos and other inaccuracies they had spotted in the book, as well as sug- gested improvements of exposition. I am very grateful to everyone who provided me with such information. I would like to express special thanks to I.S. Borisov, V.I. Lotov, A.A. Mogul’sky and S.G. Foss, who also offered a number of method- ological improvements. I am also deeply grateful to T.V. Belyaeva for her invaluable assistance in type- setting the book with its numerous changes. Without that help, the work on the new edition would have been much more difficult. A.A. Borovkov Foreword to the Third and Fourth Editions This book has been written on the basis of the Russian version (1986) published by “Nauka” Publishers in Moscow. A number of sections have been substantially revised and several new chapters have been introduced. The author has striven to provide a complete and logical exposition and simpler and more illustrative proofs. The 1986 text was preceded by two earlier editions (1972 and 1976). The first one appeared as an extended version of lecture notes of the course the author taught at the Department of Mechanics and Mathematics of Novosibirsk State University. Each new edition responded to comments by the readers and was completed with new sections which made the exposition more unified and complete. The readers are assumed to be familiar with a traditional calculus course. They would also benefit from knowing elements of measure theory and, in particular, the notion of integral with respect to a measure on an arbitrary space and its basic properties. However, provided they are prepared to use a less general version of some of the assertions, this lack of additional knowledge will not hinder the reader from successfully mastering the material. It is also possible for the reader to avoid such complications completely by reading the respective Appendices (located at the end of the book) which contain all the necessary results. The first ten chapters of the book are devoted to the basics of probability theory (including the main limit theorems for cumulative sums of random variables), and it is best to read them in succession. The remaining chapters deal with more specific parts of the theory of probability and could be divided into two blocks: random processes in discrete time (or random sequences, Chaps. 12 and 14–16) and random processes in continuous time (Chaps. 17–21). There are also chapters which remain outside the mainstream of the text as indi- cated above. These include Chap. 11 “Factorisation Identities”. The chapter not only contains a series of very useful probabilistic results, but also displays interesting re- lationships between problems on random walks in the presence of boundaries and boundary problems of complex analysis. Chapter 13 “Information and Entropy” and Chap. 19 “Functional Limit Theorems” also deviate from the mainstream. The for- mer deals with problems closely related to probability theory but very rarely treated in texts on the discipline. The latter presents limit theorems for the convergence ix x Foreword to the Third and Fourth Editions of processes generated by cumulative sums of random variables to the Wiener and Poisson processes; as a consequence, the law of the iterated logarithm is established in that chapter. The book has incorporated a number of methodological improvements. Some parts of it are devoted to subjects to be covered in a textbook for the first time (for example, Chap. 16 on stochastic recursive sequences playing an important role in applications). The book can serve as a basis for third year courses for students with a rea- sonable mathematical background, and also for postgraduates. A one-semester (or two-trimester) course on probability theory might consist (there could be many vari- ants) of the following parts: Chaps. 1–2, Sects. 3.1–3.4, 4.1–4.6 (partially), 5.2 and 5.4 (partially), 6.1–6.3 (partially), 7.1, 7.2, 7.4–7.6, 8.1–8.2 and 8.4 (partially), 10.1, 10.3, and the main results of Chap. 12. For a more detailed exposition of some aspects of Probability Theory and the Theory of Random Processes, see for example [2, 10, 12–14, 26, 31]. While working on the different versions of the book, I received advice and help from many of my colleagues and friends. I am grateful to Yu.V. Prokhorov, V.V. Petrov and B.A. Rogozin for their numerous useful comments which helped to improve the first variant of the book. I am deeply indebted to A.N. Kolmogorov whose remarks and valuable recommendations, especially of methodological char- acter, contributed to improvements in the second version of the book. In regard to the second and third versions, I am again thankful to V.V Petrov who gave me his comments, and to P. Franken, with whom I had a lot of useful discussions while the book was translated into German. In conclusion I want to express my sincere gratitude to V.V. Yurinskii, A.I. Sakha- nenko, K.A. Borovkov, and other colleagues of mine who also gave me their com- ments on the manuscript. I would also like to express my gratitude to all those who contributed, in one way or another, to the preparation and improvement of the book. A.A. Borovkov For the Reader’s Attention The numeration of formulas, lemmas, theorems and corollaries consists of three numbers, of which the first two are the numbers of the current chapter and section. For instance, Theorem 4.3.1 means Theorem 1 from Sect. 3 of Chap. 4. Section 6.2 means Sect. 2 of Chap. 6. The sections marked with an asterisk may be omitted in the first reading. The symbol at the end of a paragraph denotes the end of a proof or an important argument, when it should be pointed out that the argument has ended. The symbol :=, systematically used in the book, means that the left-hand side is defined to be given by the right-hand side. The relation =: has the opposite meaning: the right-hand side is defined by the left-hand side. The reader may find it useful to refer to the Index of Basic Notation and Subject index, which can be found at the end of this book. xi Introduction 1. It is customary to set the origins of Probability Theory at the 17th century and relate them to combinatorial problems of games of chance. The latter can hardly be considered a serious occupation. However, it is games of chance that led to prob- lems which could not be stated and solved within the framework of the then existing mathematical models, and thereby stimulated the introduction of new concepts, ap- proaches and ideas. These new elements can already be encountered in writings by P. Fermat, D. Pascal, C. Huygens and, in a more developed form and somewhat later, in the works of J. Bernoulli, P.-S. Laplace, C.F. Gauss and others. The above- mentioned names undoubtedly decorate the genealogy of Probability Theory which, as we saw, is also related to some extent to the vices of society. Incidentally, as it soon became clear, it is precisely this last circumstance that can make Probability Theory more attractive to the reader. The first text on Probability Theory was Huygens’ treatise De Ratiociniis in Ludo Alea (“On Ratiocination in Dice Games”, 1657). A bit later in 1663 the book Liber de Ludo Aleae (“Book on Games of Chance”) by G. Cardano was published (in fact it was written earlier, in the mid 16th century). The subject of these treatises was the same as in the writings of Fermat and Pascal: dice and card games (prob- lems within the framework of Sect. 1.2 of the present book). As if Huygens foresaw future events, he wrote that if the reader studied the subject closely, he would no- tice that one was not dealing just with a game here, but rather that the foundations of a very interesting and deep theory were being laid. Huygens’ treatise, which is also known as the first text introducing the concept of mathematical expectation, was later included by J. Bernoulli in his famous book Ars Conjectandi (“The Art of Conjecturing”; published posthumously in 1713). To this book is related the no- tion of the so-called Bernoulli scheme (see Sect. 1.3), for which Bernoulli gave a cumbersome (cf. our Sect. 5.1) but mathematically faultless proof of the first limit theorem of Probability Theory, the Law of Large Numbers. By the end of the 19th and the beginning of the 20th centuries, the natural sci- ences led to the formulation of more serious problems which resulted in the develop- ment of a large branch of mathematics that is nowadays called Probability Theory. This subject is still going through a stage of intensive development. To a large extent, xiii xiv Introduction Probability Theory owes its elegance, modern form and a multitude of achievements to the remarkable Russian mathematicians P.L. Chebyshev, A.A. Markov, A.N. Kol- mogorov and others. The fact that increasing our knowledge about nature leads to further demand for Probability Theory appears, at first glance, paradoxical. Indeed, as the reader might already know, the main object of the theory is randomness, or uncertainty, which is due, as a rule, to a lack of knowledge. This is certainly so in the classical example of coin tossing, where one cannot take into account all the factors influencing the eventual position of the tossed coin when it lands. However, this is only an apparent paradox. In fact, there are almost no exact de- terministic quantitative laws in nature. Thus, for example, the classical law relating the pressure and temperature in a volume of gas is actually a result of a probabilistic nature that relates the number of collisions of particles with the vessel walls to their velocities. The fact is, at typical temperatures and pressures, the number of particles is so large and their individual contributions are so small that, using conventional instruments, one simply cannot register the random deviations from the relationship which actually take place. This is not the case when one studies more sparse flows of particles—say, cosmic rays—although there is no qualitative difference between these two examples. We could move in a somewhat different direction and name here the uncertainty principle stating that one cannot simultaneously obtain exact measurements of any two conjugate observables (for example, the position and velocity of an object). Here randomness is not entailed by a lack of knowledge, but rather appears as a fun- damental phenomenon reflecting the nature of things. For instance, the lifetime of a radioactive nucleus is essentially random, and this randomness cannot be eliminated by increasing our knowledge. Thus, uncertainty was there at the very beginning of the cognition process, and it will always accompany us in our quest for knowledge. These are rather general comments, of course, but it appears that the answer to the question of when one should use the methods of Probability Theory and when one should not will always be determined by the relationship between the degree of precision we want to attain when studying a given phenomenon and what we know about the nature of the latter. 2. In almost all areas of human activity there are situations where some exper- iments or observations can be repeated a large number of times under the same conditions. Probability Theory deals with those experiments of which the result (ex- pressed in one way or another) may vary from trial to trial. The events that refer to the experiment’s result and which may or may not occur are usually called random events. For example, suppose we are tossing a coin. The experiment has only two out- comes: either heads or tails show up, and before the experiment has been carried out, it is impossible to say which one will occur. As we have already noted, the rea- son for this is that we cannot take into account all the factors influencing the final position of the coin. A similar situation will prevail if you buy a ticket for each lot- tery draw and try to predict whether it will win or not, or, observing the operation of a complex machine, you try to determine in advance if it will have failed before or Introduction xv Fig. 1 The plot of the relative frequencies nh /n corresponding to the outcome sequence htthtthhhthht in the coin tossing experiment after a given time. In such situations, it is very hard to find any laws when consid- ering the results of individual experiments. Therefore there is little justification for constructing any theory here. However, if one turns to a long sequence of repetitions of such an experiment, an interesting phenomenon becomes apparent. While individual results of the ex- periments display a highly “irregular” behaviour, the average results demonstrate stability. Consider, say, a long series of repetitions of our coin tossing experiment and denote by nh the number of heads in the first n trials. Plot the ratio nh /n ver- sus the number n of conducted experiments (see Fig. 1; the plot corresponds to the outcome sequence htthtthhhthh, where h stands for heads and t for tails, respec- tively). We will then see that, as n increases, the polygon connecting the consecutive points (n, nh /n) very quickly approaches the straight line nh /n = 1/2. To verify this observation, G.L. Leclerc, comte de Buffon,1 tossed a coin 4040 times. The number of heads was 2048, so that the relative frequency nh /n of heads was 0.5069. K. Pearson tossed a coin 24,000 times and got 12,012 heads, so that nh /n = 0.5005. It turns out that this phenomenon is universal: the relative frequency of a certain outcome in a series of repetitions of an experiment under the same conditions tends towards a certain number p ∈ [0, 1] as the number of repetitions grows. It is an objective law of nature which forms the foundation of Probability Theory. It would be natural to define the probability of an experiment outcome to be just the number p towards which the relative frequency of the outcome tends. How- ever, such a definition of probability (usually related to the name of R. von Mises) has proven to be inconvenient. First of all, in reality, each time we will be dealing not with an infinite sequence of frequencies, but rather with finitely many elements thereof. Obtaining the entire sequence is unfeasible. Hence the frequency (let it again be nh /n) of the occurrence of a certain outcome will, as a rule, be different for each new series of repetitions of the same experiment. This fact led to intense discussions and a lot of disagreement regarding how one should define the concept of probability. Fortunately, there was a class of phenomena that possessed certain “symmetry” (in gambling, coin tossing etc.) for which one could compute in advance, prior to the experiment, the expected numerical values 1 The data is borrowed from. xvi Introduction of the probabilities. Take, for instance, a cube made of a sufficiently homogeneous material. There are no reasons for the cube to fall on any of its faces more often than on some other face. It is therefore natural to expect that, when rolling a die a large number of times, the frequency of each of its faces will be close to 1/6. Based on these considerations, Laplace believed that the concept of equiprobability is the fundamental one for Probability Theory. The probability of an event would then be defined as the ratio of the number of “favourable” outcomes to the total number of possible outcomes. Thus, the probability of getting an odd number of points (e.g. 1, 3 or 5) when rolling a die once was declared to be 3/6 (i.e. the number of faces with an odd number of points was divided by the total number of all faces). If the die were rolled ten times, then one would have 610 in the denominator, as this number gives the total number of equally likely outcomes and calculating probabilities reduces to counting the number of “favourable outcomes” (the ones resulting in the occurrence of a given event). The development of the mathematical theory of probabilities began from the in- stance when one started defining probability as the ratio of the number of favourable outcomes to the total number of equally likely outcomes, and this approach is nowa- days called “classical” (for more details, see Chap. 1). Later on, at the beginning of the 20th century, this approach was severely crit- icised for being too restrictive. The initiator of the critique was R. von Mises. As we have already noted, his conception was based on postulating stability of the fre- quencies of events in a long series of experiments. That was a confusion of physical and mathematical concepts. No passage to the limit can serve as justification for introducing the notion of “probability”. If, for instance, the values nh /n were to converge to the limiting value 1/2 in Fig. 1 too slowly, that would mean that no- body would be able to find the value of that limit in the general (non-classical) case. So the approach is clearly vulnerable: it would mean that Probability Theory would be applicable only to those situations where frequencies have a limit. But why fre- quencies would have a limit remained unexplained and was not even discussed. In this relation, R. von Mises’ conception has been in turn criticised by many mathematicians, including A.Ya. Khinchin, S.N. Bernstein, A.N. Kolmogorov and others. Somewhat later, another approach was suggested that proved to be fruitful for the development of the mathematical theory of probabilities. Its general features were outlined by S.N. Bernstein in 1908. In 1933 a rather short book “Foundations of Probability Theory” by A.N. Kolmogorov appeared that contained a complete and clear exposition of the axioms of Probability Theory. The general construction of the concept of probability based on Kolmogorov’s axiomatics removed all the obstacles for the development of the theory and is nowadays universally accepted. The creation of an axiomatic Probability Theory provided a solution to the sixth Hilbert problem (which concerned, in particular, Probability Theory) that had been formulated by D. Hilbert at the Second International Congress of Mathematicians in Paris in 1900. The problem was on the axiomatic construction of a number of physical sciences, Probability Theory being classified as such by Hilbert at that time. An axiomatic foundation separates the mathematical aspect from the physical: one no longer needs to explain how and where the concept of probability comes Introduction xvii from. The concept simply becomes a primitive one, its properties being described by axioms (which are essentially the axioms of Measure Theory). However, the problem of how the probability thus introduced is related (and can be applied) to the real world remains open. But this problem is mostly removed by the remarkable fact that, under the axiomatic construction, the desired fundamental property that the frequencies of the occurrence of an event converge to the probability of the event does take place and is a precise mathematical result. (For more details, see Chaps. 2 and 5.)2 We will begin by defining probability in a somewhat simplified situation, in the so-called discrete case. 2 Much later, in the 1960s A.N. Kolmogorov attempted to develop a fundamentally different ap- proach to the notions of probability and randomness. In that approach, the measure of randomness, say, of a sequence 0, 1, 0, 0, 1,... consisting of 0s and 1s (or some other symbols) is the complex- ity of the algorithm describing this sequence. The new approach stimulated the development of a number of directions in contemporary mathematics, but, mostly due to its complexity, has not yet become widely accepted. Contents 1 Discrete Spaces of Elementary Events................. 1 1.1 Probability Space.......................... 1 1.2 The Classical Scheme........................ 4 1.3 The Bernoulli Scheme........................ 6 1.4 The Probability of the Union of Events. Examples......... 9 2 An Arbitrary Space of Elementary Events............... 13 2.1 The Axioms of Probability Theory. A Probability Space...... 13 2.2 Properties of Probability....................... 20 2.3 Conditional Probability. Independence of Events and Trials.... 21 2.4 The Total Probability Formula. The Bayes Formula........ 25 3 Random Variables and Distribution Functions............. 31 3.1 Definitions and Examples...................... 31 3.2 Properties of Distribution Functions. Examples........... 33 3.2.1 The Basic Properties of Distribution Functions....... 33 3.2.2 The Most Common Distributions.............. 37 3.2.3 The Three Distribution Types................ 39 3.2.4 Distributions of Functions of Random Variables...... 42 3.3 Multivariate Random Variables................... 44 3.4 Independence of Random Variables and Classes of Events..... 48 3.4.1 Independence of Random Vectors.............. 48 3.4.2 Independence of Classes of Events............. 50 3.4.3 Relations Between the Introduced Notions......... 52 3.5 On Infinite Sequences of Random Variables............ 56 3.6 Integrals............................... 56 3.6.1 Integral with Respect to Measure.............. 56 3.6.2 The Stieltjes Integral..................... 57 3.6.3 Integrals of Multivariate Random Variables. The Distribution of the Sum of Independent Random Variables...................... 59 xix xx Contents 4 Numerical Characteristics of Random Variables........... 65 4.1 Expectation............................. 65 4.2 Conditional Distribution Functions and Conditional Expectations............................. 70 4.3 Expectations of Functions of Independent Random Variables... 74 4.4 Expectations of Sums of a Random Number of Random Variables............................... 75 4.5 Variance............................... 83 4.6 The Correlation Coefficient and Other Numerical Characteristics............................ 85 4.7 Inequalities.............................. 87 4.7.1 Moment Inequalities..................... 87 4.7.2 Inequalities for Probabilities................. 89 4.8 Extension of the Notion of Conditional Expectation........ 91 4.8.1 Definition of Conditional Expectation............ 91 4.8.2 Properties of Conditional Expectations........... 95 4.9 Conditional Distributions...................... 99 5 Sequences of Independent Trials with Two Outcomes......... 107 5.1 Laws of Large Numbers....................... 107 5.2 The Local Limit Theorem and Its Refinements........... 109 5.2.1 The Local Limit Theorem.................. 109 5.2.2 Refinements of the Local Theorem............. 111 5.2.3 The Local Limit Theorem for the Polynomial Distributions......................... 114 5.3 The de Moivre–Laplace Theorem and Its Refinements....... 114 5.4 The Poisson Theorem and Its Refinements............. 117 5.4.1 Quantifying the Closeness of Poisson Distributions to Those of the Sums Sn.................... 117 5.4.2 The Triangular Array Scheme. The Poisson Theorem... 120 5.5 Inequalities for Large Deviation Probabilities in the Bernoulli Scheme................................ 125 6 On Convergence of Random Variables and Distributions....... 129 6.1 Convergence of Random Variables................. 129 6.1.1 Types of Convergence.................... 129 6.1.2 The Continuity Theorem................... 134 6.1.3 Uniform Integrability and Its Consequences........ 134 6.2 Convergence of Distributions.................... 140 6.3 Conditions for Weak Convergence................. 147 7 Characteristic Functions......................... 153 7.1 Definition and Properties of Characteristic Functions........ 153 7.1.1 Properties of Characteristic Functions............ 154 7.1.2 The Properties of Ch.F.s Related to the Structure of the Distribution of ξ....................... 159 7.2 Inversion Formulas.......................... 161 Contents xxi 7.2.1 The Inversion Formula for Densities............ 161 7.2.2 The Inversion Formula for Distributions.......... 163 7.2.3 The Inversion Formula in L2. The Class of Functions that Are Both Densities and Ch.F.s................ 164 7.3 The Continuity (Convergence) Theorem.............. 167 7.4 The Application of Characteristic Functions in the Proof of the Poisson Theorem........................... 169 7.5 Characteristic Functions of Multivariate Distributions. The Multivariate Normal Distribution................ 171 7.6 Other Applications of Characteristic Functions. The Properties of the Gamma Distribution....................... 175 7.6.1 Stability of the Distributions α,σ 2 and Kα,σ........ 175 7.6.2 The Ŵ-distribution and its properties............ 176 7.7 Generating Functions. Application to Branching Processes. A Problem on Extinction...................... 180 7.7.1 Generating Functions.................... 180 7.7.2 The Simplest Branching Processes............. 180 8 Sequences of Independent Random Variables. Limit Theorems... 185 8.1 The Law of Large Numbers..................... 185 8.2 The Central Limit Theorem for Identically Distributed Random Variables............................... 187 8.3 The Law of Large Numbers for Arbitrary Independent Random Variables............................... 188 8.4 The Central Limit Theorem for Sums of Arbitrary Independent Random Variables.......................... 199 8.5 Another Approach to Proving Limit Theorems. Estimating Approximation Rates........................ 209 8.6 The Law of Large Numbers and the Central Limit Theorem in the Multivariate Case.......................... 214 8.7 Integro-Local and Local Limit Theorems for Sums of Identically Distributed Random Variables with Finite Variance......... 216 8.7.1 Integro-Local Theorems................... 216 8.7.2 Local Theorems....................... 219 8.7.3 The Proof of Theorem 8.7.1 in the General Case...... 222 8.7.4 Uniform Versions of Theorems 8.7.1–8.7.3 for Random Variables Depending on a Parameter............ 225 8.8 Convergence to Other Limiting Laws................ 227 8.8.1 The Integral Theorem.................... 230 8.8.2 The Integro-Local and Local Theorems........... 235 8.8.3 An Example......................... 236 9 Large Deviation Probabilities for Sums of Independent Random Variables................................. 239 9.1 Laplace’s and Cramér’s Transforms. The Rate Function...... 240 xxii Contents 9.1.1 The Cramér Condition. Laplace’s and Cramér’s Transforms.......................... 240 9.1.2 The Large Deviation Rate Function............. 243 9.2 A Relationship Between Large Deviation Probabilities for Sums of Random Variables and Those for Sums of Their Cramér Transforms. The Probabilistic Meaning of the Rate Function.... 250 9.2.1 A Relationship Between Large Deviation Probabilities for Sums of Random Variables and Those for Sums of Their Cramér Transforms..................... 250 9.2.2 The Probabilistic Meaning of the Rate Function...... 251 9.2.3 The Large Deviations Principle............... 254 9.3 Integro-Local, Integral and Local Theorems on Large Deviation Probabilities in the Cramér Range.................. 256 9.3.1 Integro-Local and Integral Theorems............ 256 9.3.2 Local Theorems....................... 261 9.4 Integro-Local Theorems at the Boundary of the Cramér Range... 264 9.4.1 Introduction......................... 264 9.4.2 The Probabilities of Large Deviations of Sn in an o(n)-Vicinity of the Point α+ n; the Case ψ ′′ (λ+ ) < ∞... 264 9.4.3 The Class of Distributions ER. The Probability of Large Deviations of Sn in an o(n)-Vicinity of the Point α+ n for Distributions F from the Class ER in Case ψ ′′ (λ+ )=∞.. 266 9.4.4 On the Large Deviation Probabilities in the Range α > α+ for Distributions from the Class ER............. 269 9.5 Integral and Integro-Local Theorems on Large Deviation Probabilities for Sums Sn when the Cramér Condition Is not Met. 269 9.5.1 Integral Theorems...................... 270 9.5.2 Integro-Local Theorems................... 271 9.6 Integro-Local Theorems on the Probabilities of Large Deviations of Sn Outside the Cramér Range (Under the Cramér Condition).. 274 10 Renewal Processes............................ 277 10.1 Renewal Processes. Renewal Functions............... 277 10.1.1 Introduction......................... 277 10.1.2 The Integral Renewal Theorem for Non-identically Distributed Summands.................... 280 10.2 The Key Renewal Theorem in the Arithmetic Case......... 285 10.3 The Excess and Defect of a Random Walk. Their Limiting Distribution in the Arithmetic Case................. 290 10.4 The Renewal Theorem and the Limiting Behaviour of the Excess and Defect in the Non-arithmetic Case............... 293 10.5 The Law of Large Numbers and the Central Limit Theorem for Renewal Processes.......................... 298 10.5.1 The Law of Large Numbers................. 298 10.5.2 The Central Limit Theorem................. 299 Contents xxiii 10.5.3 A Theorem on the Finiteness of the Infimum of the Cumulative Sums...................... 300 10.5.4 Stochastic Inequalities. The Law of Large Numbers and the Central Limit Theorem for the Maximum of Sums of Non-identically Distributed Random Variables Taking Values of Both Signs..................... 302 10.5.5 Extension of Theorems 10.5.1 and 10.5.2 to Random Variables Assuming Values of Both Signs.......... 304 10.5.6 The Local Limit Theorem.................. 306 10.6 Generalised Renewal Processes................... 307 10.6.1 Definition and Some Properties............... 307 10.6.2 The Central Limit Theorem................. 309 10.6.3 The Integro-Local Theorem................. 311 11 Properties of the Trajectories of Random Walks. Zero-One Laws.. 315 11.1 Zero-One Laws. Upper and Lower Functions............ 315 11.1.1 Zero-One Laws....................... 315 11.1.2 Lower and Upper Functions................. 318 11.2 Convergence of Series of Independent Random Variables..... 320 11.3 The Strong Law of Large Numbers................. 323 11.4 The Strong Law of Large Numbers for Arbitrary Independent Variables............................... 326 11.5 The Strong Law of Large Numbers for Generalised Renewal Processes............................... 330 11.5.1 The Strong Law of Large Numbers for Renewal Processes. 330 11.5.2 The Strong Law of Large Numbers for Generalised Renewal Processes...................... 331 12 Random Walks and Factorisation Identities.............. 333 12.1 Factorisation Identities........................ 333 12.1.1 Factorisation......................... 333 12.1.2 The Canonical Factorisation of the Function fz (λ) = 1 − zϕ(λ)...................... 335 12.1.3 The Second Factorisation Identity.............. 336 12.2 Some Consequences of Theorems 12.1.1–12.1.3.......... 340 12.2.1 Direct Consequences.................... 340 12.2.2 A Generalisation of the Strong Law of Large Numbers... 343 12.3 Pollaczek–Spitzer’s Identity. An Identity for S = supk≥0 Sk.... 344 12.3.1 Pollaczek–Spitzer’s Identity................. 345 12.3.2 An Identity for S = supk≥0 Sk................ 347 12.4 The Distribution of S in Insurance Problems and Queueing Theory................................ 348 12.4.1 Random Walks in Risk Theory............... 348 12.4.2 Queueing Systems...................... 349 12.4.3 Stochastic Models in Continuous Time........... 350 xxiv Contents 12.5 Cases Where Factorisation Components Can Be Found in an Explicit Form. The Non-lattice Case................ 351 12.5.1 Preliminary Notes on the Uniqueness of Factorisation... 351 12.5.2 Classes of Distributions on the Positive Half-Line with Rational Ch.F.s........................ 354 12.5.3 Explicit Canonical Factorisation of the Function v(λ) in the Case when the Right Tail of the Distribution F Is an Exponential Polynomial................... 355 12.5.4 Explicit Factorisation of the Function v(λ) when the Left Tail of the Distribution F Is an Exponential Polynomial.. 361 12.5.5 Explicit Canonical Factorisation for the Function v0 (λ).. 362 12.6 Explicit Form of Factorisation in the Arithmetic Case....... 364 12.6.1 Preliminary Remarks on the Uniqueness of Factorisation. 365 12.6.2 The Classes of Distributions on the Positive Half-Line with Rational Generating Functions............. 366 12.6.3 Explicit Canonical Factorisation of the Function v(z) in the Case when the Right Tail of the Distribution F Is an Exponential Polynomial................... 367 12.6.4 Explicit Canonical Factorisation of the Function v(z) when the Left Tail of the Distribution F Is an Exponential Polynomial.......................... 370 12.6.5 Explicit Factorisation of the Function v0 (z)......... 371 12.7 Asymptotic Properties of the Distributions of χ± and S...... 372 12.7.1 The Asymptotics of P(χ+ > x | η+ < ∞) and P(χ−0 < −x) in the Case Eξ ≤ 0...................... 373 12.7.2 The Asymptotics of P(S > x)................ 376 12.7.3 The Distribution of the Maximal Values of Generalised Renewal Processes...................... 380 12.8 On the Distribution of the First Passage Time............ 381 12.8.1 The Properties of the Distributions of the Times η±.... 381 12.8.2 The Distribution of the First Passage Time of an Arbitrary Level x by Arithmetic Skip-Free Walks........... 384 13 Sequences of Dependent Trials. Markov Chains............ 389 13.1 Countable Markov Chains. Definitions and Examples. Classification of States........................ 389 13.1.1 Definition and Examples................... 389 13.1.2 Classification of States.................... 392 13.2 Necessary and Sufficient Conditions for Recurrence of States. Types of States in an Irreducible Chain. The Structure of a Periodic Chain............................ 395 13.3 Theorems on Random Walks on a Lattice.............. 398 13.3.1 Symmetric Random Walks in Rk , k ≥ 2........... 400 13.3.2 Arbitrary Symmetric Random Walks on the Line...... 401 13.4 Limit Theorems for Countable Homogeneous Chains....... 404 Contents xxv 13.4.1 Ergodic Theorems...................... 404 13.4.2 The Law of Large Numbers and the Central Limit Theorem for the Number of Visits to a Given State..... 412 13.5 The Behaviour of Transition Probabilities for Reducible Chains.. 412 13.6 Markov Chains with Arbitrary State Spaces. Ergodicity of Chains with Positive Atoms......................... 414 13.6.1 Markov Chains with Arbitrary State Spaces......... 414 13.6.2 Markov Chains Having a Positive Atom.......... 420 13.7 Ergodicity of Harris Markov Chains................. 423 13.7.1 The Ergodic Theorem.................... 423 13.7.2 On Conditions (I) and (II).................. 429 13.8 Laws of Large Numbers and the Central Limit Theorem for Sums of Random Variables Defined on a Markov Chain......... 436 13.8.1 Random Variables Defined on a Markov Chain....... 436 13.8.2 Laws of Large Numbers................... 437 13.8.3 The Central Limit Theorem................. 443 14 Information and Entropy........................ 447 14.1 The Definitions and Properties of Information and Entropy.... 447 14.2 The Entropy of a Finite Markov Chain. A Theorem on the Asymptotic Behaviour of the Information Contained in a Long Message; Its Applications...................... 452 14.2.1 The Entropy of a Sequence of Trials Forming a Stationary Markov Chain........................ 452 14.2.2 The Law of Large Numbers for the Amount of Information Contained in a Message................... 453 14.2.3 The Asymptotic Behaviour of the Number of the Most Common Outcomes in a Sequence of Trials........ 454 15 Martingales................................ 457 15.1 Definitions, Simplest Properties, and Examples........... 457 15.2 The Martingale Property and Random Change of Time. Wald’s Identity................................ 462 15.3 Inequalities.............................. 477 15.3.1 Inequalities for Martingales................. 477 15.3.2 Inequalities for the Number of Crossings of a Strip..... 481 15.4 Convergence Theorems....................... 482 15.5 Boundedness of the Moments of Stochastic Sequences....... 487 16 Stationary Sequences........................... 493 16.1 Basic Notions............................ 493 16.2 Ergodicity (Metric Transitivity), Mixing and Weak Dependence.. 497 16.3 The Ergodic Theorem........................ 502 17 Stochastic Recursive Sequences..................... 507 17.1 Basic Concepts............................ 507 17.2 Ergodicity and Renovating Events. Boundedness Conditions.... 508 xxvi Contents 17.2.1 Ergodicity of Stochastic Recursive Sequences....... 508 17.2.2 Boundedness of Random Sequences............ 514 17.3 Ergodicity Conditions Related to the Monotonicity of f...... 516 17.4 Ergodicity Conditions for Contracting in Mean Lipschitz Transformations........................... 518 18 Continuous Time Random Processes.................. 527 18.1 General Definitions......................... 527 18.2 Criteria of Regularity of Processes................. 532 19 Processes with Independent Increments................ 539 19.1 General Properties.......................... 539 19.2 Wiener Processes. The Properties of Trajectories.......... 542 19.3 The Laws of the Iterated Logarithm................. 545 19.4 The Poisson Process......................... 549 19.5 Description of the Class of Processes with Independent Increments.............................. 552 20 Functional Limit Theorems....................... 559 20.1 Convergence to the Wiener Process................. 559 20.2 The Law of the Iterated Logarithm................. 568 20.3 Convergence to the Poisson Process................. 572 20.3.1 Convergence of the Processes of Cumulative Sums..... 572 20.3.2 Convergence of Sums of Thinning Renewal Processes... 575 21 Markov Processes............................. 579 21.1 Definitions and General Properties................. 579 21.1.1 Definition and Basic Properties............... 579 21.1.2 Transition Probability.................... 581 21.2 Markov Processes with Countable State Spaces. Examples..... 583 21.2.1 Basic Properties of the Process............... 583 21.2.2 Examples........................... 589 21.3 Branching Processes......................... 591 21.4 Semi-Markov Processes....................... 593 21.4.1 Semi-Markov Processes on the States of a Chain...... 593 21.4.2 The Ergodic Theorem.................... 594 21.4.3 Semi-Markov Processes on Chain Transitions....... 597 21.5 Regenerative Processes....................... 600 21.5.1 Regenerative Processes. The Ergodic Theorem....... 600 21.5.2 The Laws of Large Numbers and Central Limit Theorem for Integrals of Regenerative Processes........... 601 21.6 Diffusion Processes......................... 603 22 Processes with Finite Second Moments. Gaussian Processes..... 611 22.1 Processes with Finite Second Moments............... 611 22.2 Gaussian Processes......................... 614 22.3 Prediction Problem......................... 616 Contents xxvii Appendix 1 Extension of a Probability Measure.............. 619 Appendix 2 Kolmogorov’s Theorem on Consistent Distributions.... 625 Appendix 3 Elements of Measure Theory and Integration........ 629 3.1 Measure Spaces........................... 629 3.2 The Integral with Respect to a Probability Measure......... 630 3.2.1 The Integrals of a Simple Function............. 630 3.2.2 The Integrals of an Arbitrary Function........... 631 3.2.3 Properties of Integrals.................... 634 3.3 Further Properties of Integrals.................... 635 3.3.1 Convergence Theorems................... 635 3.3.2 Connection to Integration with Respect to a Measure on the Real Line......................... 636 3.3.3 Product Measures and Iterated Integrals........... 638 3.4 The Integral with Respect to an Arbitrary Measure......... 640 3.5 The Lebesgue Decomposition Theorem and the Radon–Nikodym Theorem............................... 643 3.6 Weak Convergence and Convergence in Total Variation of Distributions in Arbitrary Spaces.................. 649 3.6.1 Weak Convergence..................... 649 3.6.2 Convergence in Total Variation............... 652 Appendix 4 The Helly and Arzelà–Ascoli Theorems........... 655 Appendix 5 The Proof of the Berry–Esseen Theorem........... 659 Appendix 6 The Basic Properties of Regularly Varying Functions and Subexponential Distributions...................... 665 6.1 General Properties of Regularly Varying Functions......... 665 6.2 The Basic Asymptotic Properties.................. 668 6.3 The Asymptotic Properties of the Transforms of R.V.F.s (Abel-Type Theorems)........................ 672 6.4 Subexponential Distributions and Their Properties......... 674 Appendix 7 The Proofs of Theorems on Convergence to Stable Laws.. 687 7.1 The Integral Limit Theorem..................... 687 7.2 The Integro-Local and Local Limit Theorems............ 699 Appendix 8 Upper and Lower Bounds for the Distributions of the Sums and the Maxima of the Sums of Independent Random Variables................................. 703 8.1 Upper Bounds Under the Cramér Condition............ 703 8.2 Upper Bounds when the Cramér Condition Is Not Met....... 704 8.3 Lower Bounds............................ 713 Appendix 9 Renewal Theorems....................... 715 References................................... 723 xxviii Contents Index of Basic Notation............................ 725 Subject Index.................................. 727 Chapter 1 Discrete Spaces of Elementary Events Abstract Section 1.1 introduces the fundamental concept of probability space, along with some basic terminology and properties of probability when it is easy to do, i.e. in the simple case of random experiments with finitely or at most count- ably many outcomes. The classical scheme of finitely many equally likely outcomes is discussed in more detail in Sect. 1.2. Then the Bernoulli scheme is introduced and the properties of the binomial distribution are studied in Sect. 1.3. Sampling without replacement from a large population is considered, and convergence of the emerging hypergeometric distributions to the binomial one is formally proved. The inclusion- exclusion formula for the probabilities of unions of events is derived and illustrated by some applications in Sect. 1.4. 1.1 Probability Space To mathematically describe experiments with random outcomes, we will first of all need the notion of the space of elementary events (or outcomes) corresponding to the experiment under consideration. We will denote by Ω any set such that each result of the experiment we are interested in can be uniquely specified by the elements of Ω. In the simplest experiments we usually deal with finite spaces of elementary out- comes. In the coin tossing example we considered above, Ω consists of two ele- ments, “heads” and “tails”. In the die rolling experiment, the space Ω is also finite and consists of 6 elements. However, even for tossing a coin (or rolling a die) one can arrange such experiments for which finite spaces of elementary events will not suffice. For instance, consider the following experiment: a coin is tossed until heads shows for the first time, and then the experiment is stopped. If t designates tails in a toss and h heads, then an “elementary outcome” of the experiment can be repre- sented by a sequence (tt... th). There are infinitely many such sequences, and all of them are different, so there is no way to describe unambiguously all the outcomes of the experiment by elements of a finite space. Consider finite or countably infinite spaces of elementary events Ω. These are the so-called discrete spaces. We will denote the elements of a space Ω by the letter ω and call them elementary events (or elementary outcomes). A.A. Borovkov, Probability Theory, Universitext, 1 DOI 10.1007/978-1-4471-5201-9_1, © Springer-Verlag London 2013 2 1 Discrete Spaces of Elementary Events The notion of the space of elementary events itself is mathematically undefinable: it is a primitive one, like the notion of a point in geometry. The specific nature of Ω will, as a rule, be of no interest to us. Any subset A ⊆ Ω will be called an event (the event A occurs if any of the elementary outcomes ω ∈ A occurs). The union or sum of two events A and B is the event A ∪ B (which may also be denoted by A + B) consisting of the elementary outcomes which belong to at least one of the events A and B. The product or intersection AB (which is often denoted by A ∩ B as well) is the event consisting of all elementary events belonging to both A and B. The difference of the events A and B is the set A − B (also often denoted by A \ B) consisting of all elements of A not belonging to B. The set Ω is called the certain event. The empty set ∅ is called the impossible event. The set A = Ω − A is called the complementary event of A. Two events A and B are mutually exclusive if AB = ∅. Let, for instance, our experiment consist in rolling a die twice. Here one can take the space of elementary events to be the set consisting of 36 elements (i, j ), where i and j run from 1 to 6 and denote the numbers of points that show up in the first and second roll respectively. The events A = {i + j ≤ 3} and B = {j = 6} are mutually exclusive. The product of the events A and C = {j is even} is the event (1, 2). Note that if we were interested in the events related to the first roll only, we could consider a smaller space of elementary events consisting of just 6 elements i = 1, 2,... , 6. One says that the probabilities of elementary events are given if a nonnegative real-valued function P is given on Ω such that ω∈Ω P(ω) = 1 (one also says that the function P specifies a probability distribution on Ω). The probability of an event A is the number P(A) := P(ω). ω∈A This definition is consistent, for the series on the right hand side is absolutely con- vergent. We note here that specific numerical values of the function P will also be of no interest to us: this is just an issue of the practical value of the model. For instance, it is clear that, in the case of a symmetric die, for the outcomes 1, 2,... , 6 one should put P(1) = P(2) = · · · = P(6) = 1/6; for a symmetric coin, one has to choose the values P(h) = P(t) = 1/2 and not any others. In the experiment of tossing a coin until heads shows for the first time, one should put P(h) = 1/2, P(th) = 1/22 , P(tth) = 1/2 ,.... Since ∞ 3 n=1 2 −n = 1, the function P given in this way on the outcomes of the form (t... th) will define a probability distribution on Ω. For ex- ample, to calculate the probability that the experiment stops on an even step (that is, the probability of the event composed of the outcomes (th), (ttth),... ), one should consider the sum of the corresponding probabilities which is equal to ∞ 1 4 1 2−2n = × =. 4 3 3 n=1 1.1 Probability Space 3 In the experiments mentioned in the Introduction, where one had to guess when a device will break down—before a given time (the event A) or after it, quantita- tive estimates of the probability P(A) can usually only be based on the results of the experiments themselves. The methods of estimating unknown probabilities from ob- servation results are studied in Mathematical Statistics, the subject-matter of which will be exemplified somewhat later by a problem from this chapter. Note further that by no means can one construct models with discrete spaces of elementary events for all experiments. For example, suppose that one is measuring the energy of particles whose possible values fill the interval [0, V ], V > 0, but the set of points of this interval (that is, the set of elementary events) is continuous. Or suppose that the result of an experiment is a patient’s electrocardiogram. In this case, the result of the experiment is an element of some functional space. In such cases, more general schemes are needed. From the above definitions, making use of the absolute convergence of the series ω∈A P(ω), one can easily derive the following properties of probability: (1) P(∅) = 0, P(Ω) = 1. (2) P(A + B) = ω∈A∪B P(ω) = ω∈A P(ω) + ω∈B P(ω) − ω∈A∩B P(ω) = P(A) + P(B) − P(AB). (3) P(A) = 1 − P(A). This entails, in particular, that, for disjoint (mutually exclusive) events A and B, P(A + B) = P(A) + P(B). This property of the additivity of probability continues to hold for an arbitrary number of disjoint events A1 , A2 ,... : if Ai Aj = ∅ for i = j , then ∞ ∞ P Ak = P(Ak ). (1.1.1) k=1 k=1 This follows from the equality n n P Ak = P(Ak ) k=1 k=1 and the fact that P( ∞ k=n+1 Ak ) → 0 as n → ∞. To prove the last relation, first enumerate the elementary events. Thenwe will be dealing with the sequence ω1 , ω2 ,... ; ωk = Ω, P( k>n ωk ) = k>n P(ωk ) → 0 as n → ∞. Denote by nk the number of events Aj such that ωk ∈ Aj = Ank ; nk = 0 if ωk Aj = ∅ for all j. If nk ≤ N < ∞ for all k, then the events Aj with j > N are empty and the desired relation is obvious. If Ns := maxk≤s nk → ∞ as s → ∞, then one has j >n A j ⊂ k>s ω k for n > Ns , and therefore P Aj ≤P ωk = P(ωk ) → 0 as s → ∞. j >n k>s k>s 4 1 Discrete Spaces of Elementary Events The required relation is proved. For arbitrary A and B, one has P(A + B) ≤ P(A) + P(B). A similar inequality also holds for the sum of an arbitrary number of events: ∞ ∞ P Ak ≤ P(Ak ). k=1 k=1 This follows from (1.1.1) and the representation of Ak as the union Ak B k of disjoint events Ak B k , where Bk = j n. 6 1 Discrete Spaces of Elementary Events number of samples of size k which differ in composition and contain exactly k1 black balls is nk11 n−n1 k−k1. Thus the desired probability is equal to n1 n − n1 n Pn1 ,n (k1 , k) =. k1 k − k1 k The collection of numbers Pn1 ,n (0, k), Pn1 ,n (1, k),... , Pn1 ,n (k, k) forms the so- called hypergeometric distribution. From the derived formula it follows, in particu- lar, that, for any 0 < n1 < n, k n1 n − n1 n =. k1 k − k1 k k1 =0 Example 1.2.1 In the 1980s, a version of a lottery called “Sportloto 6 out of 49” had became rather popular in Russia. A gambler chooses six from the totality of 49 sports (designated just by numbers). The prize amount is determined by how many sports he guesses correctly from another group of six sports, to be drawn at random by a mechanical device in front of the public. What is the probability that the gambler correctly guesses all six sports? A similar question could be asked about five sports, and so on. It is not difficult to see that this is nothing else but a problem on the hypergeo- metric distribution where the gambler has labelled as “white” six items in a general population consisting of 49 items. Therefore the probability that, of the six items chosen at random, k1 will turn out to be “white” (i.e. will coincide with those la- belled by the gambler) is equal to P6,49 (k1 , k), where the sample size k equals 6. For example, the probability of guessing all six sports correctly is −1 49 P6,49 (6, 6) = ≈ 7.2 × 10−8. 6 In connection with the hypergeometric distribution, one could comment on the nature of problems in Probability Theory and Mathematical Statistics. Knowing the composition of the general population, we can use the hypergeometric distribution to find out what chances different compositions of the sample would have. This is a typical direct problem of probability theory. However, in the natural sciences one usually has to solve inverse problems: how to determine the nature of general populations from the composition of random samples. Generally speaking, such inverse problems form the subject matter of Mathematical Statistics. 1.3 The Bernoulli Scheme Suppose one draws a sample with replacement of size r from a general population consisting of two elements {0, 1}. There are 2r such samples. Let p be a number in 1.3 The Bernoulli Scheme 7 the interval [0, 1]. Define a nonnegative function P on the set Ω of all samples in the following way: if a sample ω contains exactly k ones, then P(ω) = p k (1 − p)r−k. To verify that P is a probability, one has to prove the equality P(Ω) = 1. It is easy to see that k ones can be arranged in r places in kr different ways. There- fore there is the same number of samples containing exactly k ones. Now we can compute the probability of Ω: r r k r P(Ω) = p (1 − p)r−k = p + (1 − p) = 1. k k=0 The second equality here is just the binomial formula. At the same time we have found that the probability P (k, r) that the sample contains exactly k ones is: r k P (k, r) = p (1 − p)r−k. k This is the so-called binomial distribution. It can be considered as the distribution of the number of “successes” in a series of r trials with two possible outcomes in each trial: 1 (“success”) and 0 (“failure”). Such a series of trials with probability P(ω) defined as p k (1 − p)r−k , where k is the number of successes in ω, is called the Bernoulli scheme. It turns out that the trials in the Bernoulli scheme have the independence property which will be discussed in the next chapter. It is not difficult to verify that the probability of having 1 at a fixed place in the sample (say, at position s) equals p. Indeed, having removed the item number s from the sample, we obtain a sample from the same population, but of size r − 1. We will find the desired probability if we multiply the probabilities of these truncated samples by p and sum over all “short” samples. Clearly, we will get p. This is why the number p in the Bernoulli scheme is often called the success probability. Arguing in the same way, we find that the probability of having 1 at k fixed positions in the sample equals p k. Now consider how the probabilities P (k, r) of various outcomes behave as k varies. Let us look at the ratio P (k, r) p r −k+1 p r +1 R(k, r) := = = −1. P (k − 1, r) 1 − p k 1−p k It clearly monotonically decreases as k increases, the value of the ratio being less than 1 for k/(r + 1) < p and greater than 1 for k/(r + 1) > p. This means that the probabilities P (k, r) first increase and then, for k > p(r + 1), decrease as k increases. The above enables one to estimate, using the quantities P (k, r), the probabilities k Q(k, r) = P (j, r) j =0 8 1 Discrete Spaces of Elementary Events that the number of successes in the Bernoulli scheme does not exceed k. Namely, for k < p(r + 1), 1 1 Q(k, r) = P (k, r) 1 + + + ··· R(k, r) R(k, r)R(k − 1, r) R(k, r) (r + 1 − k)p ≤ P (k, r) = P (k, r). R(k, r) − 1 (r + 1)p − k It is not difficult to see that this bound will be rather sharp if the numbers k and r are large and the ratio k/(pr) is not too close to 1. In that case the sum 1 1 1+ + + ··· R(k, r) R(k, r)R(k − 1, r) will be close to the sum of the geometric series ∞ R(k, r) R −j (k, r) = , R(k, r) − 1 j =0 and we will have the approximate equality (r + 1 − k)p Q(k, r) ≈ P (k, r). (1.3.1) (r + 1)p − k For example, for r = 30, p = 0.7 and k = 16 one has rp = 21 and P (k, r) ≈ 0.023. Here the ratio (r+1−k)p (r−1)p−1 equals 15 × 0.7/5.7 ≈ 1.84. Hence the right hand side of (1.3.1) estimating Q(k, r) is approximately equal to 0.023 × 1.84 ≈ 0.042. The true value of Q(k, r) for the given values of r, p and k is 0.040 (correct to three decimals). Formula (1.3.1) will be used in the example in Sect. 5.2. Now consider a general population composed of n items, of which n1 are of the first type and n2 = n − n1 of the second type. Draw from it a sample without replacement of size r. Theorem 1.3.1 Let n and n1 tend to infinity in such a way that n1 /n → p, where p is a number from the interval [0, 1]. Then the following relation holds true for the hypergeometric distribution: Pn1 ,n (r1 , r) → P (r1 , r). Proof Divide both the numerator and denominator in the formula for Pn1 ,n (r1 , r) (see Sect. 1.2) by nr. Putting r2 = r − r1 and n2 := n − n1 , we get r!(n − r)! n1 ! n2 ! Pn1 ,n (r1 , r) = n! r1 !(n1 − r1 ) r2 !(n2 − r2 ) 1.4 The Probability of the Union of Events. Examples 9 n n n n r1 −1 r! n1 ( n1 − n1 )( n1 − n2 ) · · · ( n1 − n ) = n 1 r−1 r1 !r2 ! n (1 − n ) · · · (1 − n ) n2 n2 1 n2 r2 − 1 × − ··· − n n n n n r → p r1 (1 − p)r2 = P (r1 , r) r1 as n → ∞. The theorem is proved. For sufficiently large n, Pn1 ,n (r1 , r) is close to P (r1 , r) by the above theorem. Therefore the Bernoulli scheme can be thought of as sampling without replacement from a very large general population consisting of items of two types, the proportion of items of the first type being p. In conclusion we will consider two problems. Imagine n bins in which we place at random r enumerated particles. Each particle can be placed in any of the n bins, so that the total number of different allocations of r particles to n bins will be nr. Allocation of particles to bins can be thought of as drawing a sample with replacement of size r from a general population of n items. We will assume that we are dealing with the classical scheme, where the probability of each outcome is 1/nr. (1) What is the probability that there are exactly r1 particles in the k-th bin? The remaining r − r1 particles which did not fall into bin k are allocated to the remaining n − 1 bins. There are (n − 1)r−r1 different ways in which these r − r1 particles can be placed into n − 1 bins. Of the totality of r particles, one can choose r r − r1 particles which did not fall into bin k in r−r 1 different ways. Therefore the desired probability is r (n − 1)r−r1 r 1 r1 1 r−r1 = 1 −. r − r1 nr r − r1 n n This probability coincides with P (r1 , r) in the Bernoulli scheme with p = 1/n. (2) Now let us compute the probability that at least one bin will be empty. Denote this event by A. Let Ak mean that the k-th bin is empty, then n A= Ak. k=1 To find the probability of the event A, we will need a formula for the probability of a sum (union) of events. We cannot make use of the additivity of probability, for the events Ak are not disjoint in our case. 1.4 The Probability of the Union of Events. Examples Let us return to an arbitrary discrete probability space. 10 1 Discrete Spaces of Elementary Events Theorem 1.4.1 Let A1 , A2 ,... , An be events. Then n n P Ai = P(Ai ) − P(Ai Aj ) i=1 i=1 i 0 and any semi-interval [a, b), one can always find an embedded interval [a, b − δ), δ > 0, such that P([a, b − δ)) ≥ P([a, b)) − ε. This follows directly from property F3: F (b − δ) → F (b) as δ ↓ 0. Hence, for a given ε > 0 and set Bn , there exist δin > 0, i = 1,... , kn , such that kn n n n = B ai , bi − δin ⊂ Bn , n ) > P(Bn ) − ε2−n. P(B i=1 36 3 Random Variables and Distribution Functions n and consider the Now add the right end points of the semi-intervals to the set B closed bounded set kn n n Kn = ai , bi − δin. i=1 Clearly, ∞ n ⊂ Kn ⊂ Bn , B K= Kn = ∅, n=1 P(Bn − Kn ) = P(Bn K n ) ≤ ε2−n. It follows from the relation K = ∅ that Kn = ∅ for all sufficiently large n. Indeed, all the sets Kn belong to the closure [CN ] = [N, −N] which is compact. The sets {∆n = [CN ] − Kn }∞n=1 form an open covering of [CN ], since ∆n = [CN ] K n = [CN ] Kn = [CN ]. n n n 0 n Thus, by the nHeine–Borel lemma there exists a finite subcovering {∆n }n=1 , n0 < ∞, 0 n0 such that n=1 ∆n = [CN ] or, which is the same, n=1 Kn = ∅. Therefore n n 0 0 P(Bn0 ) = P Bn0 Kn = P Bn0 Kn n=1 n=1 n0 n0 n0 =P Bn0 K n ≤ P Bn K n ≤ ε2−n < ε. n=1 n=1 n=1 Thus, for a given ε > 0 we found an n0 (depending on ε) such that P(Bn0 ) < ε. This means that P(Bn ) → 0 as n → ∞. We proved that axiom P3 holds. So we have constructed a probability space. It remains to take ξ to be the identity mapping of R onto itself. Then Fξ (x) = P(ξ < x) = P(−∞, x) = F (x). The model of the sample probability space based on the assertion just proved is often used in studies of distribution functions. Definition 3.2.1 A probability space Ω, F, F is called a sample space for a ran- dom variable ξ(ω) if Ω is a subset of the real line R and ξ(ω) ≡ ω. The probability F = Fξ is called, in accordance with Definition 3.1.1 from Sect. 3.1, the distribution of ξ. We will write this as ξ⊂ = F. (3.2.1) It is obvious that constructing a sample probability space is always possible. It suffices to put Ω = R, F = B, F(B) = P(ξ ∈ B). For integer-valued variables 3.2 Properties of Distribution Functions. Examples 37 ξ the space Ω, F can be chosen in a more “economical” way by taking Ω = {... , −1, 0,...}. Since by Theorem 3.2.1 the distribution function F (x) of a random variable ξ uniquely specifies the distribution F of this random variable, along with (3.2.1) we will also write ξ ⊂ = F. Now we will give examples of some of the most common distributions. 3.2.2 The Most Common Distributions 1. The degenerate distribution Ia. The distribution Ia is defined by 0 if a ∈ B, Ia (B) = 1 if a ∈/ B. This distribution is concentrated at the point a: if ξ ⊂= Ia , then P(ξ = a) = 1. The distribution function of Ia has the form 0 for x ≤ a, F (x) = 1 for x > a. The next two distributions were described in Examples 3.1.1 and 3.1.2 of Sect. 3.1. 2. The binomial distribution Bnp. By the definition, ξ ⊂= Bnp (n > 0 is an integer, p ∈ (0, 1)) if P(ξ = k) = nk p k (1 − p)n−k , 0 ≤ k ≤ n. The distribution B1p will be denoted by Bp. 3. The uniform distribution Ua,b. If ξ ⊂ = Ua,b , then µ(B ∩ [a, b]) P(ξ ∈ B) = , µ([a, b]) where µ is the Lebesgue measure. We saw that this distribution has distribution function (3.1.1). The next distribution plays a special role in probability theory, and we will en- counter it many times. 4. The normal distribution α,σ 2 (the normal or Gaussian law). We will write ξ⊂= α,σ 2 if 1 2 2 P(ξ ∈ B) = α,σ 2 (B) = √ e−(u−α) /(2σ ) du. (3.2.2) σ 2π B The distribution α,σ 2 depends on two parameters: α and σ > 0. If α = 0, σ = 1, the normal distribution is called standard. The distribution function of 0,1 is equal to 1 x 2 Φ(x) = 0,1 (−∞, x) = √ e−u /2 du. 2π −∞ The distribution function of α,σ 2 is obviously equal to Φ((x − α)/σ ), so that the parameters α and σ have the meaning of the “location” and “scale” of the distribu- tion. 38 3 Random Variables and Distribution Functions The fact that formula (3.2.2) defines a distribution follows from Theorem 3.2.1 and the observation that the function Φ(x) (or Φ((x − a)/σ )) satisfies properties F1–F3, since Φ(−∞) = 0, Φ(∞) = 1, and Φ(x) is continuous and monotone. One could also directly use the fact that the integral in (3.2.2) is a countably additive set function (see Sect. 3.6 and Appendix 3). 5. The exponential distribution Ŵ α. The relation ξ ⊂ = Ŵ α means that ξ is nonneg- ative and P(ξ ∈ B) = Ŵ α (B) = α e−αu du. B∩(0,∞) The distribution function of ξ ⊂ = Ŵ α clearly has the form 1 − e−αx for x ≥ 0, P(ξ < x) = 0 for x < 0. The exponential distribution is a special case of the gamma distribution Ŵ α,λ , to be considered in more detail in Sect. 7.7. 6. A discrete analogue of the exponential distribution is called the geometric distribution. It has the form P(ξ = k) = (1 − p)p k , p ∈ (0, 1), k = 0, 1,... 7. The Cauchy distribution Kα,σ. As was the case with the normal distribution, this distribution depends on two parameters α and σ which are also location and scale parameters. If ξ ⊂ = Kα,σ then 1 du P(ξ ∈ B) =. πσ B 1 + ((u − a)/σ )2 The distribution function K(x) of K0,1 is 1 x du K(x) =. π −∞ 1 + u2 The distribution function of Kα,σ is equal to K((x − α)σ ). All the remarks made for the normal distribution continue to hold here. Example 3.2.1 Suppose that there is a source of radiation at a point (α, σ ), σ > 0, on the plane. The radiation is registered by a detector whose position coincides with the x-axis. An emitted particle moves in a random direction distributed uniformly over the circle. In other words, the angle η between this direction and the vector (0, −1) has the uniform distribution U−π,π on the interval [−π, π]. Observation results are the coordinates ξ1 , ξ2 ,... of the points on the x-axis where the particles interacted with the detector. What is the distribution of the random variable ξ = ξ1 ? To find this distribution, consider a particle emitted at the point (α, σ ) given that the particle hit the detector (i.e. given that η ∈ [−π/2, π/2]). It is clear that the conditional distribution of η given the last event (of which the probability is P(η ∈ [−π/2, π/2]) = 1/2) coincides with U−π/2,π/2. Since (ξ − α)/σ = tan η, one obtains that 3.2 Properties of Distribution Functions. Examples 39 P(ξ < x) = P(α + σ tan η < x) η 1 x −α 1 1 x−α =P < arctan = + arctan. π π σ 2 π σ Recalling that (arctan u)′ = 1/(1 + u2 ), we have x du x du π arctan x = = − , 0 1 + u2 2 −∞ 1 + u 2 1 (x−α)/σ du x −α P(ξ < x) = =K. π −∞ 1 + u2 σ Thus the coordinates of the traces on the x-axis of the particles emitted from the point (α, σ ) have the Cauchy distribution Kα,σ. 8. The Poisson distribution λ. We will write ξ ⊂ = λ if ξ assumes nonnegative integer values with probabilities λm −λ P(ξ = m) = e , λ > 0, m = 0, 1, 2,... m! The distribution function, as in Example 3.1.1, has the form of a sum: λm −λ F (x) = m 0, 0 for x ≤ 0. 3.2.3 The Three Distribution Types All the distributions considered in the above examples can be divided into two types. I. Discrete Distributions Definition 3.2.2 The distribution of a random variable ξ is called discrete if ξ can assume only finitely or countably many values x1 , x2 ,... so that pk = P(ξ = xk ) > 0, pk = 1. A discrete distribution {pk } can obviously always be defined on a discrete prob- ability space. It is often convenient to characterise such a distribution by a table: Values x1 x2 x3... Probabilities p1 p2 p3... The distributions Ia , Bnp , λ , and the geometric distribution are discrete. The derivative of the distribution function of such a distribution is equal to zero every- where except at the points x1 , x2 ,... where F (x) is discontinuous, the jumps being F (xk + 0) − F (xk ) = pk. An important class of discrete distributions is formed by lattice distributions. 40 3 Random Variables and Distribution Functions Definition 3.2.3 We say that random variable ξ has a lattice distribution with span h if there exist a and h such that ∞ P(ξ = a + kh) = 1. (3.2.3) k=−∞ If h is the greatest number satisfying (3.2.3) and the number a lies in the interval [0, h) then these numbers are called the span and the shift, respectively, of the lattice. If a = 0 and h = 1 then the distribution is called arithmetic. The same terms will be used for random variables. Obviously the greatest common divisor (g.c.d.) of all possible values of an arith- metic random variable equals 1. II. Absolutely Continuous Distributions Definition 3.2.4 The distribution F of a random variable ξ is said to be absolutely continuous2 if, for any Borel set B, F(B) = P(ξ ∈ B) = f (x) dx, (3.2.4) B !∞ where f (x) ≥ 0, −∞ f (x) dx = 1. The function f (x) in (3.2.4) is called the density of the distribution. It is not hard to derive from the proof of Theorem 3.2.1 (to be more precise, from the theorem on uniqueness of the extension of a measure) that the above definition of absolute continuity is equivalent to the representation x Fξ (x) = f (u) du −∞ for all x ∈ R. Distribution functions with this property are also called absolutely continuous. 2 The definition refers to absolute continuity with respect to the Lebesgue measure. Given a measure µ on R, B (see Appendix 3), a distribution F is called absolutely continuous with respect to µ if, for any B ∈ B, one has F(B) = f (x)µ(dx). B In this sense discrete distributions are also absolutely continuous, but with respect to the count- ing measure m. Indeed, if one puts f (xk ) = pk , m(B) = {the number of points from the set (x1 , x2 ,...) which are in B}, then F(B) = pk = f (xk ) = f (x)m(dx) xk ∈B xk ∈B B (see Appendix 3). 3.2 Properties of Distribution Functions. Examples 41 Fig. 3.1 The plot shows the result of the first three steps in the construction of the Cantor function The function f (x) is determined by the above equalities up to its values on a set of Lebesgue measure 0. For this function, the relation f (x) = dFdx(x) holds3 almost everywhere (with respect to the Lebesgue measure). The distributions Ua,b , α,σ 2 , Kα,σ and Ŵ α are absolutely continuous. The den- sity of the normal distribution with parameters αand σ is equal to 1 2 /(2σ 2 ) φα,σ 2 (x) = √ e−(x−α). 2πσ From their definitions, one could easily derive the densities of the distributions Ua,b , Kα,σ and Ŵ α as well. The density of Kα,σ has a shape resembling that of the normal density, but with “thicker tails” (it vanishes more slowly as |x| → ∞). We will say that a distribution F has an atom at point x1 if F({x1 }) > 0. We saw that any discrete distribution consists of atoms but, for an absolutely continuous distribution, the probability of hitting a set of zero Lebesgue measure is zero. It turns out that there exists yet a third class of distributions which is characterised by the negation of both mentioned properties of discrete and absolutely continuous distributions. III. Singular Distributions Definition 3.2.5 A distribution F is said to be singular (with respect to Lebesgue measure) if it has no atoms and is concentrated on a set of zero Lebesgue measure. Because a singular distribution has no atoms, its distribution function is continu- ous. An example of such a distribution function is given by the famous Cantor func- tion of which the whole variation is concentrated on the interval [0, 1]: F (x) = 0 for x ≤ 0, F (x) = 1 for x ≥ 1. It can be constructed as follows (the construction process is shown in Fig. 3.1). 3 Theassertion about the “almost everywhere” uniqueness of the function f follows from the Radon–Nikodym theorem (see Appendix 3). 42 3 Random Variables and Distribution Functions Divide the segment [0, 1] into three equal parts [0, 1/3], [1/3, 2/3], and [2/3, 1]. On the inner segment put F (x) = 1/2. The remaining two segments are again di- vided into three equal parts each, and on the inner parts one sets F (x) to be 1/4 and 3/4, respectively. Each of the remaining segments is divided in turn into three parts, and F (x) is defined on the inner parts as the arithmetic mean of the two already defined neighbouring values of F (x), and so on. At the points which do not belong to such inner segments F (x) is defined by continuity. It is not hard to see that the total length of such “inner” segments on which F (x) is constant is equal to ∞ 1 2 4 1 2 k 1 1 + + + ··· = = = 1, 3 9 27 3 3 3 1 − 2/3 k=0 so that the function F (x) grows on a set of measure zero but has no jumps. From the construction of the Cantor distribution we see that dF (x)/dx = 0 al- most everywhere. It turns out that these three types of distribution exhaust all possibilities. More precisely, there is a theorem belonging to Lebesgue4 stating that any distri- bution function F (x) can be represented in a unique way as a sum of three compo- nents: discrete, absolutely continuous, and singular. Hence an arbitrary distribution function cannot have more than a countable number of jumps (which can also be observed directly: we will count all the jumps if we first enumerate all the jumps which are greater than 1/2, then the jumps greater than 1/3, then greater than 1/4, etc.). This means, in particular, that F (x) is everywhere continuous except perhaps at a countable or finite set of points. In conclusion of this section we will list several properties of distribution func- tions and densities that arise when forming new random variables. 3.2.4 Distributions of Functions of Random Variables For a given function g(x), to find the distribution of g(ξ ) we have to impose some measurability requirements on the function. The function g(x) is called Borel if the inverse image g −1 (B) = x : g(x) ∈ B of any Borel set B is again a Borel set. For such a function g the distribution function of the random variable η = g(ξ ) equals Fg(ξ ) (x) = P g(ξ ) < x = P ξ ∈ g −1 (−∞, x). If g(x) is continuous and strictly increasing on an interval (a, b) then, on the interval (g(a), g(b)), the inverse function y = g (−1) (x) is defined as the solution to 4 See Sect. 3.5 in Appendix 3. 3.2 Properties of Distribution Functions. Examples 43 the equation g(y) = x.5 Since g is a monotone mapping we have g(ξ ) < x = ξ < g (−1) (x) for x ∈ g(a), g(b). Thus we get the following representation for Fg(ξ ) in terms of Fξ : for x ∈ (g(a), g(b)), Fg(ξ ) (x) = P ξ < g −1 (x) = Fξ g −1 (x). (3.2.5) Putting g = Fξ we obtain, in particular, that if Fξ is continuous and strictly increas- ing on (a, b) and F (a) = 0, F (b)

Probability Theory PDF Textbook

Document Details

Tags

Related

Summary

Full Transcript