Statistics - Probability and Descriptive Statistics Course Book PDF

STATISTICS – PROBABILITY AND DESCRIPTIVE STATISTICS DLBDSSPDS01-01 STATISTICS – PROBABILITY AND DESCRIPTIVE STATISTICS MASTHEAD Publisher: IU Internationale Hochschule GmbH IU International University of Applied Sciences Juri-Gagarin-Ring 152 D-99084 Erfurt Mailing address: Albert-Proeller-Straße 15-19 D-86675 Buchdorf [email protected] www.iu.de DLBDSSPDS01-01 Version No.: 001-2024-1127 N.N. Cover image: Adobe Stock. © 2024 IU Internationale Hochschule GmbH This course book is protected by copyright. All rights reserved. This course book may not be reproduced and/or electronically edited, duplicated, or dis- tributed in any kind of form without written permission by the IU Internationale Hoch- schule GmbH (hereinafter referred to as IU). The authors/publishers have identified the authors and sources of all graphics to the best of their abilities. However, if any erroneous information has been provided, please notify us accordingly. 2 TABLE OF CONTENTS STATISTICS – PROBABILITY AND DESCRIPTIVE STATISTICS Introduction Signposts Throughout the Course Book............................................. 6 Basic Reading.................................................................... 7 Further Reading.................................................................. 8 Learning Objectives.............................................................. 10 Unit 1 Probability 13 1.1 Definitions.................................................................. 14 1.2 Independent Events.......................................................... 22 1.3 Constant, the the random Conditional Probability.............................. 23 1.4 Bayesian Statistics........................................................... 25 Unit 2 Random Variables 33 2.1 Random Variables........................................................... 34 2.2 Probability Mass Functions and Distribution Functions.......................... 37 2.3 Important Discrete Random Variables.......................................... 42 2.4 Important Continuous Random Variables...................................... 59 Unit 3 Joint Distributions 85 3.1 Joint Distributions........................................................... 86 3.2 Marginal Distributions........................................................ 99 3.3 Independent Random Variables.............................................. 103 3.4 Conditional Distributions.................................................... 109 Unit 4 Expectation and Variance 117 4.1 Expectation of a Random Variable............................................ 120 4.2 Variance and Covariance.................................................... 132 4.3 Expectations and Variances of Important Probability Distributions............... 139 4.4 Central Moments........................................................... 151 4.5 Moment Generating Functions............................................... 156 3 Unit 5 Inequalities and Limit Theorems 167 5.1 Probability Inequalities...................................................... 168 5.2 Inequalities and Expectations................................................ 178 5.3 The Law of Large Numbers................................................... 183 5.4 The Central Limit Theorem.................................................. 188 Backmatter List of References............................................................... 202 List of Tables and Figures........................................................ 203 4 INTRODUCTION WELCOME SIGNPOSTS THROUGHOUT THE COURSE BOOK This course book contains the core content for this course. Additional learning materials can be found on the learning platform, but this course book should form the basis for your learning. The content of this course book is divided into units, which are divided further into sec- tions. Each section contains only one new key concept to allow you to quickly and effi- ciently add new learning material to your existing knowledge. At the end of each section of the digital course book, you will find self-check questions. These questions are designed to help you check whether you have understood the con- cepts in each section. For all modules with a final exam, you must complete the knowledge tests on the learning platform. You will pass the knowledge test for each unit when you answer at least 80% of the questions correctly. When you have passed the knowledge tests for all the units, the course is considered fin- ished and you will be able to register for the final assessment. Please ensure that you com- plete the evaluation prior to registering for the assessment. Good luck! 6 BASIC READING Downey, A.B. (2014). Think stats (2nd ed.). O’Reilly. http://search.ebscohost.com.pxz.iubh. de:8080/login.aspx?direct=true&db=cat05114a&AN=ihb.28838&site=eds-live&scope=s ite Kim, A. (2019). Exponential Distribution - Intuition, Derivation, and Applications. (Available online) Rohatgi, V. K., & Saleh, A. K. E. (2015). An introduction to probability and statistics. John Wiley & Sons, Incorporated. http://search.ebscohost.com.pxz.iubh.de:8080/login.aspx ?direct=true&db=cat05114a&AN=ihb.45506&site=eds-live&scope=site Triola , M.F. (2013). Elementary statistics. Pearson Education. http://search.ebscohost.com. pxz.iubh.de:8080/login.aspx?direct=true&db=cat05114a&AN=ihb.45501&site=eds-live &scope=site Wagaman, A.S & Dobrow, R.P. (2021). Probability: With applications and R. Wiley. http://sea rch.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=edsebk&AN=294773 4&site=eds-live&scope=site 7 FURTHER READING UNIT 1 Downey, A.B. (2014). Think stats (2nd ed.). O’Reilly. http://search.ebscohost.com.pxz.iubh. de:8080/login.aspx?direct=true&db=cat05114a&AN=ihb.28838&site=eds-live&scope=s ite Rohatgi, V. K., & Saleh, A. K. E. (2015). An introduction to probability and statistics. John Wiley & Sons, Incorporated. (Chapter 1). http://search.ebscohost.com.pxz.iubh.de:808 0/login.aspx?direct=true&db=cat05114a&AN=ihb.45506&site=eds-live&scope=site Wagaman, A.S & Dobrow, R.P. (2021). Probability: With applications and R. Wiley http://sear ch.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=edsebk&AN=2947734 &site=eds-live&scope=site UNIT 2 Downey, A.B. (2014). Think Bayes. Sebastopol, CA: O’Reilly. (Chapters 3, 4, 5, and 6) http://s earch.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=cat05114a&AN=ih b.28839&site=eds-live&scope=site Rohatgi, V. K., & Saleh, A. K. E. (2015). An introduction to probability and statistics. John Wiley & Sons, Incorporated. (Chapter 5). http://search.ebscohost.com.pxz.iubh.de:808 0/login.aspx?direct=true&db=cat05114a&AN=ihb.45506&site=eds-live&scope=site Wagaman, A.S & Dobrow, R.P. (2021). Probability: With applications and R. Wiley (Chapter 3,6-8). http://search.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=eds ebk&AN=2947734&site=eds-live&scope=site UNIT 3 Downey, A.B. (2014). Think Bayes. Sebastopol, CA: O’Reilly. (Chapter 7) http://search.ebsco host.com.pxz.iubh.de:8080/login.aspx?direct=true&db=cat05114a&AN=ihb.28839&sit e=eds-live&scope=site Rohatgi, V. K., & Saleh, A. K. E. (2015). An introduction to probability and statistics. John Wiley & Sons, Incorporated. (Chapter 4). http://search.ebscohost.com.pxz.iubh.de:808 0/login.aspx?direct=true&db=cat05114a&AN=ihb.45506&site=eds-live&scope=site 8 UNIT 4 Rohatgi, V. K., & Saleh, A. K. E. (2015). An introduction to probability and statistics. John Wiley & Sons, Incorporated. (Chapter 7). http://search.ebscohost.com.pxz.iubh.de:808 0/login.aspx?direct=true&db=cat05114a&AN=ihb.45506&site=eds-live&scope=site Triola, M. F. (2013). Elementary statistics. Pearson Education. (Chapter 11). http://search.eb scohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=cat05114a&AN=ihb.45501& site=eds-live&scope=site Wagaman, A.S. & Dobrow, R.P. (2021). Probability: With applications and R. Wiley (Chapter 9). http://search.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=edsebk &AN=2947734&site=eds-live&scope=site UNIT 5 Triola, M. F. (2013). Elementary statistics. Pearson Education. (Chapter 6). http://search.ebs cohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=cat05114a&AN=ihb.45501&s ite=eds-live&scope=site Wagaman, A.S. & Dobrow. R.P. (2021). Probability: With applications and R. Wiley (Chapter 10). http://search.ebscohost.com.pxz.iubh.de:8080/login.aspx?direct=true&db=edseb k&AN=2947734&site=eds-live&scope=site 9 LEARNING OBJECTIVES Welcome to Statistics - Probability and Descriptive Statistics! This course will provide you with a foundation in mathematical probability, preparing you for further courses in statistical inference and data science. The statistical tools that you will be introduced to in this course will enable you to review, analyze, and draw conclusions from data. You will become familar with the key terms and concepts that are at the core of probability theory, including random experiments, sample spaces, events, and the axioms of proba- bility. You will learn to classify events as mutually exclusive and independent, and how to compute the probability of unions and joint events. You will also learn how to interpret and use conditional probability and apply Bayes’ theorem to selected applications. A random variable is a numerical description of the outcome of a statistical experiment. As a mathematical formalization it quantifies random events. When studying a given data set, we generally consider the data points as an observation of a random occurrence, which can be described by the underlying distribution of a random variable. You will learn to define a random variable and express and interpret its distribution using probability mass functions (PMFs), probability density functions (PDFs), and cumulative distribution functions (CDFs). You will learn about important probability distributions, their characteristics, and how they are used to model real-world experiments. Sometimes data comes in the form of pairs of triples or random variables. The variables in these tuples may be independent or dependent. You will learn how to express the ran- domness of these tuples using joint distributions, PMFs and PDFs. Marginal and condi- tional distributions play a key role in isolating the distribution of one variable from the tuple in different ways. You will be provided with examples that will help you to learn how to compute and interpret such distributions. The average and standard deviation are the most popular summaries we can compute from numerical data. These ideas are extended using general notions of the expected value of a random variable as well as other expectation quantities. You will learn how to compute means, variances, general moments, and central moments. More importantly, you will be able to describe certain characteristics of distributions, such as skewness and kurtosis, using these quantities. Finally, you will be introduced to important inequalities and limit theorems. These inequalities and theorems are at the very foundation of the methods of statistical infer- ence, providing a sound framework for drawing conclusions about scientific truths from data. Furthermore, they will be used to define and evaluate performance metrics of learn- ing algorithms in your further studies. Note 10 Given the main focus of this course (on fundamental theories and applications of statis- tics), it would be preferable for students to have some prior knowledge of basic topics of mathematical analysis (i.e., integral and differential calculus), as well as properties of functions. However, for the sake of completeness, the tools of analysis that are most important for this course will be briefly introduced and discussed at relevant points throughout the course book. 11 UNIT 1 PROBABILITY STUDY GOALS On completion of this unit, you will be able to... – understand the key terms outcome, event, and sample space and how these terms are used to define and compute probabilities. – identify the three fundamental axioms of probability measures. – compute and interpret probabilities involving mutually exclusive events. – compute and interpret probabilities of two independent events and conditional proba- bilities. – compute probabilities of two events that are not necessarily independent. – compute probabilities of two events that are not necessarily mutually exclusive. – understand the concept of partitioning a sample space and how it frames the state- ment of the total law of probability. – apply Bayes’ theorem to real-world examples. 1. PROBABILITY Introduction Probability is the primary tool we use when we are dealing with random experiments: that is, experiments where the outcome cannot be determined with complete certainty (see Wackerly, Mendenhall & Schaeffer, 2008; Wasserman, 2004). Consider rolling a pair of fair 6-sided dice. The outcome of any such roll cannot be determined with absolute certainty. How many possible outcomes are there? Is a sum of five or eight more likely? What is the most likely sum? What is the least likely sum? The tools we discuss in this unit will help address these and other questions. Perhaps you have heard of the phrase “lucky number 7”. The origin of this statement lies in the fact that when a pair of fair dice are rolled, seven is the most likely sum. On completion of this unit, you will be able to quantify this fact. Furthermore, you will be able to develop the relevant concepts much further in order to answer more complex questions. 1.1 Definitions Although we cannot predict the outcome of random experiments with absolute certainty, we can write down all the possible outcomes the experiment could have. For the coin toss random experiment, the possible outcomes, also called elements (see Klenke, 2014), are H (heads) or T (tails). The set containing all the possible outcomes is called the sample Sample space space of the experiment. We say that an outcome a is an element of Ω and write This is a set containing all possible outcomes of a random experiment. It is a ∈ Ω. usually denoted by Ω or S. Now consider the experiment of tossing two coins. One possible outcome could be to observe heads on the first coin and tails on the second coin. We can summarize this out- come as HT. Using this notation, the sample space can thus be written as Ω = HH, HT , T H, T T Outcome Each element is an outcome of the random experiment. In general, we can denote the This is a single result from outcome of an experiment by ωi, where i ∈ ℕ is just the index of the outcome. In this nota- a trial of a random experi- ment. tion, the sample space can be denoted as Ω = ω1, ω2, …, ωn for n ∈ ℕ and for a finite sample space and Ω = ω1, ω2, … for a countably infinite sample space. In some applications we are interested in a single outcome and want to calculate the probability of that outcome, but sometimes we are interested in a group of outcomes. Therefore, the next term we will define is an event. An event of a random experiment is a set of outcomes. The following notation is used to denote an event A, which is contained in Ω, 14 A ⊆ Ω. Event This is a collection of zero or more outcomes of a An event is also called a set (see Klenke, 2014). The following notation random experiment. Events are usually A ⊂ Ω, denoted using capital let- ters: A, B, C, … means that the event A is contained in Ω and at least one outcome exists which is not contained in A, but in Ω. For the two-coin toss experiment, perhaps we are interested in the outcomes where the result for the two coins match. In this case, we are talking about the event A = HH, T T. Note that the order of the elements in a set does not matter, so HH, T T = T T , HH. Finally, we can have an event that contains a single outcome: B = HT. Finally, we will introduce two fundamental operations for any events in the sample space Ω. For two events A, B ⊆ Ω, the union of A and B, which is denoted by Union The union of the events A and B is also an event A ∪ B = x ∈ Ω x ∈ A or x ∈ B containing all outcomes of A and all outcomes of is the event of all outcomes contained in A or in B. B. In addition, the intersection of A and B, denoted by Intersection The intersection of the events A and B is also an A ∩ B = x ∈ Ω x ∈ A and x ∈ B , n event containing all out- comes of A, which are is the event in which all outcomes are common to both A and B. also contained in B. We can also say the intersec- tion of A and B is the Figure 1: A Venn Diagram of Three Events event, which contains all outcomes of B, which are also contained in A. Source: George Dekermenjian (2019). 15 Special Events There are two special events that require a mention here. At one extreme, an event may contain nothing, in which case we have the null event or the empty set: ∅ =. At the other extreme, we have the whole sample space itself which, of course, contains all the possible outcomes. Axioms of Probability Now that we have an understanding of the fundamental terms, we are ready to talk about probability. The probability of an event measures the likelihood of observing that event when, for example, a random experiment is performed. For a given sample space Ω every Probability measure probability measure P , which maps an event of Ω to a real number, has to satisfy the This is used to assign following three axioms of probability: probabilities to events of a given sample space. 1. For any event A ⊆ Ω it holds that P A ≥ 0, 2. For mutually exclusive events A1,A2,A3,… ⊆ Ω, 3. P Ω = 1. ∞ P A1 ∪ A2 ∪ A3 ∪ … = ∑i = 1 P Ai , Mutually exclusive Two events (sets) are mutually exclusive if they have no common outcomes (elements). Two events are called mutually exclusive if their intersection yields an For non-mutually exclusive events we can deduce, according to the axioms of probability, empty set. that P A ∩ B + P A ∪ B = P A + P B for any events A, B ⊆ Ω. Example 1.1 Consider the random experiment of tossing two coins. We will assume that the probability of each outcome is equally likely so that singleton events (events with only one outcome) have equal probability. Since there are four outcomes, the probability of each singleton 1 event is. 4 1 P HH =P HT =P TH =P TT = 4 1 In practice, if an event contains one element, we can just write P HT = , excluding the 4 brackets. Classical Probability There are two approaches to defining probability: the classical (frequentist) approach and the Bayesian approach. We will discuss the classical approach first and then move on to a discussion of the Bayesian approach. 16 Consider a random experiment with n ∈ ℕ equally likely outcomes. In other words, the sample space contains n outcomes Ω = ω1, ω2, …, ωn The probability of an event A = ωi1, ωi2, …, ωim for m, im ∈ ℕ of this experiment is the ratio of the number of outcomes in A to the size of the sample space. We will denote the number of outcomes in A by A so that A = m. A m P A = =. Ω n Suppose a bag contains seven red marbles denoted by r1, r2, …, r7 and three blue mar- bles denoted by b1, b2 and b3. We will draw one marble out of this bag at random. The sam- ple space for this experiment is Ω = r1, r2, r3, r4, r5, r6, r7, b1, b2, b3. We are interested in computing the probability that the marble drawn is blue. The event corresponding to drawing a blue marble is A = b1, b2, b3. The event contains A = 3 outcomes and the sample space contains Ω = 10 outcomes. Therefore, the probability of drawing a blue A 3 marble is P A = =. Ω 10 Let us now verify that this formulation is a valid probability measure. In other words, we need to verify that the axioms of probability are satisfied. ∅ 0 Ω n 1. P ∅ = = = 0 and P Ω = = = 1. Ω n Ω n A 2. If A is an event, then 0 ≤ A ≤ n. Dividing by Ω = n gives 0 ≤ ≤ 1. In other Ω words, we have 0 ≤ P A ≤ 1 as required. 3. Now suppose that A and B are mutually exclusive events. Then the number of ele- ments in the event A or B is the union A ∪ B. Since they are mutually exclusive, it must hold that A ∪ B = A + B, because a marble cannot be in A and B simultaneously. Dividing by Ω we obtain A ∪ B A B = +. In other words, it holds that Ω Ω Ω P A ∪ B =P A +P B as required. We do not have to deal with the case of infinitely mutually exclusive events since our sam- ple space is finite i.e., it consists of 10 marbles. That means if we assume mutually exclu- sive events such that A1, A2, A3, … ⊆ Ω then only finite events can contain at least one marble. The rest of the sets must be empty sets. Thus, we reduced the problem to finite mutually disjoint events, which can be discussed in the same way as in the case of two mutually disjoint events. 17 Since the classical definition of probability satisfies all probability axioms, it is a valid probability measure. Example 1.2 Consider the random experiment of tossing three coins. Find the probability of observing at least one H. Solution 1.2 Recall that the sample space is Ω = T T T , T T H, T HH, HT H, T HT , HT T , HHT , HHH. The event of observing at least one H is exactly the event A = T T H, T HH, HT H, T HT , HT T , HHT , HHH. This event contains A = 7 out- comes. Furthermore, the sample space contains Ω = 8 outcomes. Therefore, the proba- bility of observing at least one H is A 7 P at least one H = P A = = = 0. 875. Ω 8 Example 1.3 Consider the experiment of rolling a 6-sided die. a) Write down the sample space. b) Write down the event of observing an even number. c) Calculate the probability of observing an even number. Solution 1.3 a) Ω = 1,2, 3,4, 5,6. b) A = 2,4, 6. A 3 1 c) P A = = = = 0. 5 = 50 %. Ω 6 2 Consider the experiment of rolling a pair of 6-sided dice. For each die, we can observe a number from 1 to 6. If we paired the observations from each die, we would have a single observation from the pair. For example, if the first die lands on 2 and the second lands on 5, we can write down this outcome as (2,5). The sample space S of this experiment is shown in the table below. Table 1: Sample Space of Rolling a Pair of 6-Sided Dice (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) 18 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (5,1) (5,2) (5,3) (5,4) (5,5) (5,6) (6,1) (6,2) (6,3) (6,4) (6,5) (6,6) Source: George Dekermenjian (2019). The sample space consists of Ω = 36 outcomes. Using this information, let us explore some questions related to this experiment. Example 1.4 Using the information provided in the table above: a) Write down the event of observing the same number on both dice. b) Write down the event of observing numbers that sum to 4. c) Calculate the probability of each of these events. Solution 1.4 a) A = 1,1 , 2,2 , 3,3 , 4,4 , 5,5 , 6,6. b) B = 1,3 , 2,2 , 3,1. A 6 1 c) P A = = =. Ω 36 6 B 3 1 d) P B = = =. Ω 36 12 Do we have to write down the outcomes? The formula for probability that we are using only makes use of the number of outcomes. As you can imagine, for more complex experi- ments, the size of the sample space can become very large, and it would not be wise to write down all the possible outcomes. However, to compute the probability we still need to be able to count the number of outcomes, whether it is in the sample space or for another event. To this end, we will take a short departure from our main topic and review some basic counting techniques that will be useful in answering probability questions. Counting All of the formulas we will discuss here are based on one simple principle: the multiplica- tion principle of counting. If there are N 1 ways of performing task 1 and N 2 ways of per- forming task 2, then there are N 1 · N 2 ways of performing both tasks. This principle is easily extended to more than two tasks. Suppose a pizza parlor offers its patrons the option of customizing their own pizzas. They are offered three types of crusts, two types of sauces, and can also choose one from a selection of five toppings. To count the number of different pizzas one can order at this pizza parlor, we can break down making a pizza into three tasks: (i) there are N 1 = 3 ways 19 of choosing a crust, (ii) there are N 2 = 2 ways of choosing a sauce, and (iii) N 3 = 5 ways of choosing a topping. Therefore, there are N 1 · N 2 · N 3 = 3 · 2 · 5 = 30 ways of making a pizza. Permutations Suppose you have four different books, and you want to arrange them on a shelf. We want to count the total number of arrangements possible. There are four tasks. At first there are four books to set in place. After placing the first book, there are three books to set in place, then two books, and, finally, after placing these, there is one book left to place on the shelf. Therefore, using the multiplication principle, there are 4 · 3 · 2 · 1 = 24 ways of arranging the four books on the shelf. This is an example of a permutation. Using factorial notation, we can write this computation as 4! = 4 · 3 · 2 · 1. In general, if there are n ∈ ℕ different objects to arrange, there are n! = n n − 1 n − 2 · … · 3 · 2 · 1 permutations (arrangements) possible. Now suppose that we have n = 10 objects, but we want to select and arrange k = 3 of them. We can choose the first objects (10 choices), then the second objects (9 choices), and, finally, the third object (8 choices). Therefore, the total number of arrangements is 10 · 9 · 8 = 720. In general, if we have n distinct objects, the number of permutations of k of these objects is n! n−k ! Combinations Suppose there are 10 people at a dinner party and each person shakes the hands of every other person. We want to work out how many handshakes there would be. Using the mul- tiplication rule, we can argue that observing the event of a handshake involves two tasks: (i) the first person in the handshake (10 people available) and (ii) the second person in the handshake (9 people available). So far, we have 10 · 9. However, the order of the people does not matter. If John shakes hands with Mary and Mary shakes hands with John, the handshake is the same. Therefore, we do not need to count these handshakes twice. We 10 ⋅ 9 divide the expression to get = 45 handshakes. 2 This is an example of a combination, which is similar to a permutation, but order does not matter. In general, if we have n ∈ ℕ distinct objects, the number of ways of choosing k ∈ ℕ of them is given by n n! =. k n − k !k! 20 n The expression is read as “n choose k”. For the handshake example, this formula k indeed gives the correct answer: 10 10! 10! = = = 45. 2 10 − 2 !2! 8!2! Now that we have some efficient tools for counting, we are equipped to tackle some more probability questions. Below is one such example. Example 1.5 Suppose there are five women and four men. We will randomly choose three people. a) Calculate the size of the sample space for this experiment. b) How many ways are there of choosing two women and one man? c) What is the probability of choosing two women and one man? Solution 1.5 a) The sample space consists of all possible groups of three people from nine different people. The order does not matter here. The number of ways is “9 choose 3” or 9 9! 9! 9⋅8⋅7 Ω = = = = = 84 3 9 − 3 !3! 6!3! 3 ⋅2⋅1 b) Choosing two women and one man is actually two tasks. We will count the number of ways of performing each task and then multiply them (using the multiplication princi- ple). Task 1: Choosing two women from five. There are 5 = 10 ways. 2 Task 2: Choosing one man from four. There are 4 = 4 ways. 1 According to the multiplication rule, there are 10 · 4 = 40 ways of choosing two women and one man. c) Let us call the event of choosing two women and one man A. We found that A = 40. Therefore, the probability of this event is 40 P A = ≈ 0.4762 = 47.62 %. 84 Complementary Events The complement of an event, just like the complement of a set, is the event of not observ- ing the included outcomes. For example, in a dice roll experiment we have the sample space Ω = 1,2, 3,4, 5,6. If A is the event of observation 1 or 2 A = 1,2 , then the c complement of A is A = 3,4, 5,6. 21 2 c 4 The probability of A is P A = , and the probability of its complement is P A =. 6 6 c 2 4 Indeed, we have P A + P A = + = 1. 6 6 This means that for a given sample Ω it holds that c P A +P A = 1 for any A ⊆ Ω. 1.2 Independent Events Consider the experiment of tossing a fair coin and then rolling a fair 6-sided die. The prob- 1 1 1 ability of observing the joint event H, 2 is ⋅ =. That is, we multiply the probabili- 2 6 12 ties. This is because the tossing of a fair coin does not influence the result of rolling a die. The two events are independent. More formally, for a given sample space Ω, two events, Independence of two A ⊆ Ω and B ⊆ Ω, are said to be independent if events Two events are independ- ent if the probability of P A∩B =P A ·P B. their intersection yields the same as the product Example 1.6 of each probability. Suppose we draw two cards at random with replacement from a standard deck of 52 Standard deck of 52 cards. That is, we draw the first card, place it back in the deck, and then draw another cards card. What is the probability that both cards are spades? This consists of four sets with 13 cards. Each of these four sets is depicted Solution 1.6 with a different suit. The four possible suits are: clovers or clubs, dia- The event of the first card being a spade is independent from the second card being a monds, hearts and spade. Therefore, the probability of both being spades is spades for a total of 52 cards (13 times 4). The 13 cards are numbered from 13 13 1 · = = 0. 0625 = 6.25 %. 1 (also called ace) to 13 52 52 16 (also called King). Suppose the two events A and B with P A , P B > 0 are disjoint (mutually exclusive). Can they be independent? If the two events are mutually exclusive, then they cannot both occur at the same time, so the probability of the joint event is P A ∩ B = 0. Therefore, the two events are not independent, since 0 = P A ∩ B ≠ P A · P B > 0. Example 1.7 Suppose a fair coin is tossed five times. What is the probability of observing at least one tail? 22 Solution 1.7 Note that each of the tosses is independent. Furthermore, it is easier to work with the complement of this event. Let A be the event of observing at least one tail. Then, the com- c plement event A is the event of observing no tails (that is, observing heads on each of the five tosses). Let Hi denote the event of observing heads on the itℎtoss for i ∈ ℕ, and then use the formula for the probability of complements. We then have c P A =1 − P A =1 − P H1H2H3H4H5 =1 − P H1 P H2 P H3 P H4 P H5 1 5 31 =1 − = ≈ 0.9688 = 96.88 %. 2 32 We have used independence in the third equality. In the fourth equality, we used the fact 1 that the probability of observing heads in any toss is. 2 Example 1.8 A bag contains three red marbles and five blue marbles. Two marbles are drawn, one after the other, without replacement. Is the event of observing a red marble on the first draw and a blue marble on the second draw independent? Why or why not? Solution 1.8 The two events are not independent. The result of the first event will change the number of available marbles in the bag, since there is one marble missing for the second draw. We will see how to calculate the probability of joint events that are dependent in the fol- lowing section. 1.3 Constant, the the random Conditional Probability Conditional probability is a way of calculating the probability of an event using prior infor- mation. The notation P A B is read as the “probability of A given that we have already observed B”. In other words, while P A is the (unconditional) probability of observing A, P A B is the conditional probability of A conditioned on B. Suppose we have three red marbles and five blue marbles in a bag. We draw two marbles at random without replace- ment. Let A denote the event of observing a red marble. Let B denote the event of observ- ing a blue marble. The probability of B given that we have already observed A is written as P B A. After observing A, a red marble, there are only seven marbles left in the bag: 23 5 5 two red and five blue. Therefore, the probability P B A =. In contrast, P B =. 7 8 Thus for a given sample space Ω, we say that the conditional probability of A ⊆ Ω condi- tioned on B ⊆ Ω with P B > 0, is defined by P A∩B P A B =. P B Example 1.9 Suppose that the probability of a randomly chosen person having cancer is 1%, and that if a person has cancer, a medical test will yield a positive result with a probability of 98%. What is the probability that the person has cancer and the medical test result shown is positive? Solution 1.9 Let A denote the event that a person has cancer. We know that P A = 0.01. Now let B denote the event that the medical test yields a positive result. We want to find P A ∩ B. We know the conditional probability P B A = 0.98. Using the formula for conditional probability we have P B∩A P B A =. P A Rewriting this formula, we have P A ∩ B = P B A · P A = 0.98 · 0.01 = 0.0098 = 0.98 %. If the two events A and B are independent, then observing one of the events gives us no information about the other event. In other words, P A B = P A. Indeed, we can show this using the result of independent events and the formula for conditional probability as follows: P A∩B P A ⋅P B P A∣B = = =P A for P B > 0. P B P B Let us revisit the experiment of drawing two cards. Example 1.10 Suppose two cards are drawn out of a deck of 52 cards, one after the other, without replacement. What is the probability that both are spades? Solution 1.10 Let S1 denote the event that the first card is a spade and let S2 denote the event that the second card is a spade. Since these two events are dependent, think about why this is the case. We can use the conditional probability formula in the form 24 P S1 ∩ S2 = P S2 ∣ S1 ⋅ P S1. The left-hand side is the probability that we observe a spade on both draws. The first fac- tor on the right denotes the probability that the second card is a spade given that the first card was a spade. The last factor is the probability that the first card is a spade. Since there 13 are 13 spades out of 52 cards, we have P S1 =. After having observed a spade, there 52 are only 12 spades left in the deck of a total of 51 cards. 12 Therefore, P S1 S2 =. 51 Therefore, 12 13 1 P S1 ∩ S2 = ⋅ = ≈ 0.0588 = 5.88 %. 51 52 17 Compare this answer with Example 1.6. Does the result surprise you? 1.4 Bayesian Statistics In contrast to classical statistics, Bayesian statistics is all about modifying conditional probabilities – it uses prior distributions for unknown quantities which it then updates to posterior distributions using the laws of probability. Let us revisit the medical cancer test in Example 1.9. Let us say a randomly chosen person tests positive for cancer. What is the probability that they actually have cancer? Biomedi- cal tests are never perfect; there is typically a small false positive and false negative rate. In the setting of the example, recall that A represents the event that a person has cancer and B represents the event that the medical test returns a positive result. We were given the prevalence of the disease in the general population, which was 1%, so that P A = 0.01. The test is 98% accurate for people who actually have the disease—that is, P B A = 0.98. Finally, suppose the test gives a false positive 20% of the time. We are now interested in finding out P A B. This is the subject of Bayes’ theorem. Before dis- cussing Bayes’ theorem, let us first write down a preliminary result. To motivate the result, suppose that we partition the sample space Ω into disjoint events Partition of an event A1, A2, and A3. That is, these events are mutually exclusive, and together they contain all Let A be an event. When the union of two or more the outcomes in the sample space, meaning mutually exclusive events is A, the group of events is called a partition of A. A1 ∪ A2 ∪ A3 = Ω. Now consider another event. Then it holds B = A1 ∩ B ∪ A2 ∩ B ∪ A3 ∩ B 25 meaning that the events A1 ∩ B, A2 ∩ B, and A3 ∩ B partition the event B. In other words, these events are mutually disjoint and together they contain all of B. See the figure below for an illustration. Figure 2: Partitions Source: George Dekermenjian (2019). Theorem: The Law of Total Probability Let A1, A2, A3, … be a countably infinite collection that partitions the sample space Ω. In other words, the events A1, A2, A3, … are pairwise mutually exclusive and ∞ ∪ Ai. i=1 Let B ⊆ Ω be another event. Then it follows that ∞ P B = ∑P Ai ∩ B. i=1 or, equivalently, ∞ P B = ∑P B Ai P Ai. i=1 26 We are now ready to state one of the most important theorems in modern probability theory. Theorem: Bayes’ Theorem Let A1, A2, A3, … ⊆ Ω be a countably infinite set of a partition of a sample space Ω such that P Ai > 0 for all i ∈ ℕ. Then for fixed Aj with j ∈ ℕ and B ⊆ Ω such that P B > 0, it holds that P B Aj P Aj P Aj B = ∞. ∑P B Ai P Ai i=1 Proof We know for the conditional probability formula yields P Aj B P B = P B Aj P Aj. Dividing by P B and we use the Law of Total Probability for the event B ⊆ Ω yielding the result. Note that as a special case of Bayes’ theorem, we can apply the results with just two c events, A and B, and use the two events A and A as the partition. In this case, the result is reduced to P B AP A P A B = c c. P B A P A +P B A P A Example 1.11 Suppose that the probability of a randomly chosen person having developed cancer is 1% given that if a person has cancer, a medical test will yield a positive result with a probabil- ity of 98%. Also, given that if a person does not have cancer, the test will yield a negative result with a probability of 0.80. Now, suppose a randomly chosen person tests positive. What is the probability they have actually developed cancer? Solution 1.11 Let A denote the event that a person has cancer. We know that P A = 0.01 and c P A = 0.99. Let B denote the event that the test returns a positive result. We know that c P B A = 0.98 and P Bc A = 0.80. We want to find P A B. Note that c P B A = 1 – 0.80 = 0.20. Now, using Bayes’ theorem, we have 27 P B AP A 0.98 ⋅ 0.01 P A B= = P B A P A +P B A P A c c 0.98 ⋅ 0.01 + 0.20 ⋅ 0.99 0.0098 = ≈ 0.0472 = 4.72%. 0.0098 + 0.198 Note that the result in Solution 1.11 is very low and applies to biomedical tests with signif- icant false positive and false negative rates, making population-wide screening programs of relatively rare diseases with such tests pointless. Tree diagrams and two-way tables help us understand how the total law of probability, Bayes’ theorem, and applications such as the one in Example 1.11 work. Below is an example of such a probability tree together with the associated two-way table. Figure 3: The Probability Tree Diagram from Example 1.11 Source: George Dekermenjian (2019). Table 2: Table of Probabilities from Example 1.11 True Diagnosis Total Cancer No Cancer Positive 0.0098 0.1980 0.2078 Medical test result Negative 0.0002 0.7920 0.7922 Total 0.01 0.99 1 28 Source: George Dekermenjian (2019). Now consider a sample with a size of 10,000. The natural frequencies corresponding to the probabilities help us get a “feel” for how these types of probabilities impact a real-world data set. Below is a tree diagram with natural frequencies followed by the corresponding two-way table. Figure 4: The Tree Diagram of Natural Frequencies from Example 1.11 Source: George Dekermenjian (2019). Table 3: Table of Natural Frequencies for a Sample Size of 10,000 from Example 1.11 True Diagnosis Total Cancer No Cancer Positive 98 1980 2078 Medical test result Negative 2 7920 7922 Total 100 9900 10,000 Source: George Dekermenjian (2019). 29 In Bayes’ theorem, P A is interpreted as the prior probability while P A B is the poste- rior probability. So, for the example above, before knowing the test result, we could say with 1% probability that the person has cancer, but after getting the result of the test, we could say that, based on the new information, the probability that the person has cancer is almost 5%. SUMMARY Several fundamental concepts were introduced in this unit, including random experiment, outcome, event, sample space, probability axioms, and counting techniques. We used these concepts to compute probabilities of certain events for simple experiments. Mutually exclusive events and the sum of probabili- ties axiom were used to compute probabilities of unions of events: P A ∪ B = P A + P B for mutually exclusive events A, B ⊆ Ω. When events are not mutually exclusive, a general sum of probabilities rule gives P A ∪ B = P A + P B – P A ∩ B for any events A, B ⊆ Ω. The joint event A ∩ B led to a discussion of independent events in which case we have the product of probabilities rule P A ∩ B = P A · P B for any independent events A, B ⊆ Ω. When two events A and B are not independent, P A and P A B are not the same. Therefore, we introduced the conditional probability of A ⊆ Ω conditioned on B ⊆ Ω by P A∩B P A B = where P B > 0. P B This definition can be interpreted as a general product of probabilities for events that are not necessarily independent. Bayes’ rule is central to understanding Bayesian probability. We dis- cussed instances where a collection of events partitions a sample space and how such a collection induces a partition of any event. These ideas 30 led to an important theorem known as the law of total probability. Finally, building on this theorem, we introduced Bayes’ theorem and discussed a number of applications. 31 UNIT 2 RANDOM VARIABLES STUDY GOALS On completion of this unit, you will be able to... – describe and compare the properties of discrete and continuous random variables. – understand the roles of PMFs and CDFs for discrete distributions and their properties. – understand the roles of PDFs and CDFs for continuous distributions and their proper- ties. – apply PMFs, PDFs, and CDFs to answer probability questions. – identify important discrete distributions and important continuous distributions. 2. RANDOM VARIABLES Introduction In real-world applications of data analysis and statistics we work with numerical data. In order to describe the occurrence of data points, a mathematical model or formalization, Random variable called a random variable, is necessary. From a scientific point of view, we assume that This is a rule (function) the data points are realizations of random variables. Each random variable has a specific which assigns outcomes of a given sample space sample space, probability measure and therefore, distribution, which describes the fre- to a real number. The quency of occurrence of our data points. A random variable is different from traditional sample space is equipped variables in terms of the value it takes. It is a function which performs the mapping of the with a probability meas- ure such that the out- outcomes of a random process to a numeric value. Given their importance, the main sub- comes or events have a ject of this unit will be random variables (see Wackerly, Mendenhall & Schaeffer, 2008) and defined likelihood. their mathematical properties. Random variables have many real-world applications and are used, for example, to model stock charts, the temperature, customer numbers, and the number of traffic accidents that occur in a given timeframe or location. 2.1 Random Variables Informally, a random variable is a rule that assigns a real number to each outcome of the sample space (see Wasserman, 2004). We usually denote random variables using the capi- tal letters X, Y , Z. When appropriate, we sometimes also use subscripts: X1, X2 and so on. Consider the random experiment of tossing a fair coin four times. Let X be the random variable that counts the number of heads. For the outcome HHT H, we have X HHT H = 3 and for another outcome T T T H, we have X T T T H = 1. Now consider the random experiment of rolling two fair 6-sided dice. Let Y denote the random variable that adds the numbers observed from each of the dice. For example, Y 1,2 = 3 and Y 4,4 = 8. Finally, the same random experiment can have many different random vari- ables. For example, for the experiment of rolling two 6-sided dice, let M denote the ran- dom variable that gives the maximum of the numbers from the dice. For example, M 1,2 = 2 and M 5,2 = 5. Since the values of a random variable depend on the outcome, which is random, we know that the values of a random variable are random numbers. So far, we have made the connection between the value of a random variable and an out- come of the random experiment. Before moving onto events, we will make this relation- ship more formal. For a given sample space Ω, equipped with a probability measure P , a random variable X is a mapping from the sample space Ω to the set of real numbers ℝ that assigns for each outcome ω ∈ Ω a real number x ∈ ℝ. In standard notation, this is written as 34 X :Ω ℝ, ω x=X ω. This is the most abstract definition of a random variable we can encounter. For our pur- poses we will restrict ourselves to discrete and piecewise continuous random variables and describe a wide range of random variables. For random variables with a finite sample space, we can write down the possible values of a given random variable. Consider the random experiment of tossing a coin three times; let X denote the random variable which counts the number of tails. The table below gives the values of each of the outcomes. Table 4: Values of the Random Variable Counting Tails When a Coin is Tossed Three Times ω X ω HHH 0 HHT 1 HTH 1 THH 1 HTT 2 THT 2 TTH 2 TTT 3 Source: George Dekermenjian (2019). Now we want to establish the connection of random variables with events. Consider the equation X ω = 1 from the table above. There are three outcomes ω that fit this equa- tion. If we put these three outcomes in a set, it becomes an event. More formally, the event X ω = 1 corresponds to the event HHT , HT H, T HH. We can also write this rela- tionship as X–1 1 = HHT , HT H, T HH. Here, X–1 denotes the inverse relation. It takes a value (from ℝ) to an event (in Ω). 35 Figure 5: The Random Variable as a Mapping from the Sample Space to Real Numbers Source: George Dekermenjian (2019). Figure 6: The Inverse Mapping Source: George Dekermenjian (2019). It is standard practice to use shorthand notation when describing events using random variables. Formally, in the example above, the event X ω = 1 describes the event ω ∈ Ω X ω = 1. However, in practice we usually write this event as X = 1 , meaning that we have the following notation for the event that the random variable X equals one X ω =1 = ω∈ΩX ω =1= X=1. Since our sample space Ω is equipped with a probability measure P , we can ask ourselves how likely this event is. Consequently, the symbolic form of writing a probability such as “the probability of observing one tail in a sequence of three tosses of a coin” would be written as P X = 1. When we talk about the probability of all such (single value) events, we are describing a probability mass function. We will look at these functions in the next two sections. 36 Sometimes we are interested in events corresponding to multiple values of a random vari- able. The event of observing 0 or 1 tail can be written as 0 ≤ X ω ≤ 1 = X−1 0,1 = ω ∈ Ω 0 ≤ X ω ≤ 1 , which is written in shorthand as 0 ≤ X ≤ 1. Figure 7: The Inverse Mapping of a Set of Values Source: George Dekermenjian (2019). An important range of values that comes up in the study of probability distributions is the range of values up to and including a specified number, such as X ≤ 1 or X ≤ 2. For our example above, the former is equivalent to 0 ≤ X ≤ 1 and the latter is equivalent to 0 ≤ X ≤ 2. When we speak about the probability of such events, we are describing a dis- tribution function. This is the subject of the next section. 2.2 Probability Mass Functions and Distribution Functions In the previous section, we defined a random variable as a rule that connects each out- come of a sample space to a real number. In this section, we will continue building the connection by taking the values of a random variable and connecting them to probabili- ties. Probability Mass Functions For a given sample space Ω and its corresponding probability measure P , we consider a random variable X: Ω x1, x2, x3, … , where x1, x2, … are real numbers. This random variable is called a discrete random variable, because the set Discrete random variable A random variable which of possible values is countable infinite or finite. Now we consider a function takes only finite or count- f : x1, x2, x3, … 0,1. If the function f satisfies the following properties ably infinite values. P ω ∈ Ω X ω = i = P X = xi = f xi for all i ∈ ℕ, 37 ∞ ∑f xi = 1, i=1 Probability mass f is then called a probability mass function (PMF). In that case the support of the PMF function consists of all xi ∈ ℝ such that A series which determines the likelihood that a dis- crete random variable P ω ∈ Ω X ω = i = P X = i = f xi > 0. takes a value. When we are working with multiple random variables, we can write fX instead of f to specify the random variable to which the PMF refers. Example 2.1 Consider the experiment of tossing a fair two-sided coin three times. Let X denote the ran- dom variable that counts the number of tails. Write down the PMF f of X defined by fX x = P X = x. Solution 2.1 The possible values of X are 0, 1, 2, 3. The table below summarizes the PMF. Table 5: Values, Events, and PMF of Tossing a Fair Coin Three Times x X=x fX x = P X = x 0 HHH 1/8 1 HHT , HT H, T HH 3/8 2 HT T , T HT , T T H 3/8 3 TTT 1/8 Source: George Dekermenjian (2019). Note that each value f x is non-negative and f 0 + f 1 + f 2 + f 3 = 1. Therefore, f is indeed a valid PMF. 38 Figure 8: A Plot of the PMF from Example 2.1 Source: George Dekermenjian (2019). Example 2.2 Suppose that f is a PMF defined using the table below. What is f 3 ? Table 6: A Discrete PMF with a Missing Value x f x 1 0.2 2 0.05 3 ? 4 0.39 5 0.01 6 0.05 7 0.05 8 0.10 Source: George Dekermenjian (2019). 39 Solution 2.2 Since we are told that this is a probability mass function, we know that f 3 ≥ 0 and f 1 + f 2 + f 3 + … + f 8 = 1. Therefore, the second equation reduces to 0.85+f 3 = 1 which gives f 3 = 0.15. Probability mass functions can be represented graphically as point plots with the horizon- tal axis containing the values of the random variable and the vertical axis containing the values of f. Below is a plot of the PMF for Example 2.2. Figure 9: A Plot of the PMF from Example 2.2 Source: George Dekermenjian (2019). Cumulative Distribution Function In this section, we consider events corresponding to values of the random variable in the Cumulative distribution form X ≤ x. Given a random variable X, the cumulative distribution function (CDF) is function defined by A CDF of a random varia- ble X is a function which measures the probability F X x = P X ≤ x for any x∈R. that X will take a value less or equal to x for fixed x. Formally, a CDF is a function F X such that F X : ℝ 0,1. We also write F instead of F X if the random variable is clear from the context. In addition, we can prove that for any CDF the following three properties must hold: F is normalized: lim F x =0 x −∞ and 40 lim F x = 1 x ∞ F is non-decreasing: F x1 ≤ F x2 for x1 < x2 F is right-continuous: F x = lim F t. t x, t > x We can verify the three properties in an intuitive way. For the first property, when x tends to negative infinity, then the set of outcomes that are in the event X ≤ x become the empty set. Also, as x tends to positive infinity, the set of outcomes in the event become the whole sample space, in which case the probability is 1. For the second point, notice that if an outcome is in the event X ≤ x1 and x1 ≤ x2, then automatically, this same out- come must be in X ≤ x2 , which basically means X ≤ x1 ⊆ X ≤ x2 for x1 ≤ x2. Therefore, the former event is a subset of the latter one. Hence, the former event is, at most, as probable as the latter. For the final property, take t > x, then we have F t −F x =P X ≤t −P X ≤x =P x 0. The Uniform Distribution We have already looked at the discrete uniform distribution; the (continuous) uniform dis- tribution is its continuous analog. For such that a < b random variable X is said to follow a uniform distribution over the interval a, b if its PDF is given by 1 a ≤ x ≤ b, f x = b−a 0 otherwise. In other words, the density is constant (uniform) across the range a, b. Consequently, the CDF has a constant slope on this interval. We can write down the CDF from a PDF as fol- lows: ∫ x F x = f t dt −∞ This particular PDF is 0 for x < a. For a ≤ x < b we have ∫ x 1 x−a F x = dt = a b−a b−a and for x ≥ b, it holds F x = 1. Altogether, the CDF of X Uniform a, b is given by 60 0 x < a, x−a F x = a ≤ x < b, b−a 1 x ≥ b. A Short Introduction to Integration For fixed a, b ∈ ℝ such that a < b and a given (piecewise) continuous function f : a, b ℝ the integral of f on the interval a, b , which is denoted by ∫ f x dx, b a is, in its most intuitive form, the blue-colored area below the curve of the following pic- ture. Figure 22: The Meaning of an Integral Source: Bartosch Ruszkowski (2022). In the picture, the function f is assumed to be non-negative for all x ∈ ℝ, which is suffi- cient for our purposes, hence we consider PDFs of random variables, which satisfy the non-negativity. Recall that for a given PDF f : ℝ ℝ the integral must satisfy 61 ∫ ∞ f x dx = 1, −∞ meaning that the area is one, which is the same as saying that the total probability is one. Remember the Fundamental Theorem of Calculus, which is one of the most powerful algebraic tools to compute integrals. For fixed a, b ∈ ℝ such that a < b and a given (piece- wise) continuous function f : a, b ℝ the function F : a, b ℝ is called primitive func- tion of f , if it satisfies F ’ x = f x for all x ∈ a, b , then, it holds ∫ f x dx = F b b −F a. a As an example, let us consider the function f x = 3x2 for x ∈ 0,1. The function F x = x3 for x ∈ 0,1 satisfies F ′ x = f x for all x ∈ 0,1 , which means that the following integral is algebraic computable in the following sense ∫ 3x 1 1 2 dx = x3 0 = 13 − 03 = 1. 0 We see in this example that we constructed a PDF, since the integral yields one. In detail, this example shows that the function 3x2 for x ∈ 0,1 , 0 otherwise, is a PDF and its CDF is given by 0 for x < 0, F x =P X≤x = x3 for x ∈ 0,1 , 1 for x > 1. A parachutist lands at a random point between two markers (A and B) that are 50 meters 1 apart. Find the probability that the distance to marker A is less than of the distance to 2 marker B. We can model the parachutist’s landing point as 62 X distance from marker A dA. We have X Uniform 0,50 , where 0 is the loca- tion of marker A and 50 is the location of marker B. The PDF is given by 1 1 f x = = for 0 ≤ x ≤ 50 and 0 otherwise. The CDF is given by 50 − 0 50 0 x < 0, x F x = 0 ≤ x < 50, 50 1 x ≥ 50. Below are the graphs of the PDF and CDF respectively. Figure 23: Plots of the PDF and CDF of a Uniform Distribution (x=50/3) Source: George Dekermenjian (2019). The probability that the parachutist lands at a distance from A that is less than half the distance from B is 1 dA < 50 − dA 2 50 50 50 3 1 P X< =F = = 3 3 50 3 Now let us calculate the probability that the parachutist lands at least 20 meters from marker A and at least 10 meters from marker B. This corresponds to the event 20 ≤ X ≤ 40 and so P 20 ≤ X ≤ 40 = P X ≤ 40 – P X < 20 = F 40 – F 20. Furthermore, from the fundamental theorem of calculus, we know that 63 ∫ 40 F 40 − F 20 = f t dt 20 Therefore, we have two ways of computing probabilities. If we have the CDF in simple form, we can plug in the relevant numbers as follows: 40 20 20 P 20 ≤ X ≤ 40 = F 40 − F 20 = − = = 0.4 or 40 %. 50 50 50 In some cases, we will not be able to write down the CDF in simple form. For such distribu- tions, it will be necessary to write down the integral expression using the PDF and, if nec- essary, approximate this integral as follows: ∫ ∫ 40 40 1 20 P 20 ≤ X ≤ 40 = f t dt = dt =. 20 20 50 50 Figure 24: Plots of the PDF and CDF of a Uniform Distribution (x=20 and x=40) Source: George Dekermenjian (2019). If you examine the figure above, you will find that this probability is the change of the y- coordinate in the CDF graph from x = 20 to x = 40. Additionally, it is also the area under the PDF curve between these two x values. The mathematical relationship between these two computations and the graphs is crucial to understanding the nature of PDFs and CDFs and to using them correctly in relevant applications. 64 The Normal Distribution Arguably the most widely known distribution both among academics and non-academics is the bell-shaped distribution, known as the Normal (Gaussian) Distribution. Indeed, many natural quantities have a shape that is approximately bell-shaped. A random varia- ble X following a normal distribution with parameters μ ∈ ℝ and σ > 0 is written as X N μ, σ2 and has the PDF 2 1 x−μ f x = e− 2 · σ2 2πσ2 The location parameter μ is called the mean and the scale parameter σ is called the stand- ard deviation. It is easier to work with the quantity σ2, which we call the variance. Below is a graph of several PDFs that have a unit variance σ2 = 1 and various means μ. Figure 25: Plots of PDFs of Normal Distributions with Various μ Source: George Dekermenjian (2019). For each of these PDFs, the density peaks at the mean (center) and quickly vanishes away from it. This means that for a normally distributed random variable, the values most likely to be observed are near the center. Values much larger or much lower than the center value are also less likely. Consider one of these distributions, say X N 0,1. We examine the probability of unit length intervals in the table below. Notice that as the interval moves away from the center (zero) the probability decreases significantly. 65 Table 7: Standard Normal Probabilities of Various Unit Intervals Unit-length interval a, b P a≤X≤b (0,1) 0.3413 (0.5,1.5) 0.2417 (1,2) 0.1359 (1.5,2.5) 0.0606 Source: George Dekermenjian (2019). Next, let us take a look at graphs of PDFs that have the same mean (center) at μ = 0 but different scales σ. Figure 26: Plots of PDFs of Normal Distributions with Various σ Source: George Dekermenjian (2019). These PDFs show that the density is spread over a wider range of values around the center for larger values of σ and vice-versa for smaller values (a narrower range of values). In other words, if we consider the same interval across various scales, the distributions that are spread out more will give a smaller probability. To explore this idea, we evaluate the probability P 0 ≤ Xi ≤ 1 forXi N 0, σi2 for σ12 = 1, σ22 = 1.5, and σ32 = 2. 66 Table 8: Normal Probabilities on [0,1] with Various σ X N 0, σi2 P 0≤X≤1 X N 0,1 0.3413 X N 0,1.52 0.2475 X N 0,22 0.1915 Source: George Dekermenjian (2019). The CDF of a normally distributed random variable cannot be written in closed form, indeed the integral 2 t−μ ∫ ∫ x 1 x F x =P X≤x = f t dt = e− 2 · σ2 dt −∞ 2πσ2 −∞ is the best we can do. However, almost all programming packages you will encounter will have a built-in function to evaluate the CDF of normally distributed random variables. As such, it is not necessary to try to evaluate this integral directly. Below are graphs of CDFs of normally distributed random variables with mean 0 and varying σ. Figure 27: Plot of CDFs of Normal Distributions with Various σ Source: George Dekermenjian (2019). Note that for distributions with larger variance, it takes greater values of x to accumulate the probability than with ones with smaller variance. 67 The following graph shows CDFs of normally distributed random variables with unit var- iance and various means μ. Figure 28: Plot of CDFs of Normal Distributions with Various μ Source: George Dekermenjian (2019). For the CDFs with the same variance, note that the shapes are identical, and the center just shifts the graph. Additionally, note how each of the distributions accumulate 50% of the probability up to their respective means. This means that the mean of a normally dis- tributed random variable is also its median. This is typical of symmetric distributions. Among all the different normal distributions, one deserves special attention: the standard normal distribution. A random variable Z has a standard normal distribution if it has a normal distribution with mean 0 and standard deviation 1, which means Z N 0,1. The PDF of Z is given by 2 1 −z fZ z = e 2 2π and the CDF is given by t2 ∫ 1 z FZ z = e− 2 dt. 2π −∞ The PDF of the standard normal distribution is symmetric with respect to the vertical axis. This means that the area to the left of the center, zero, is exactly 1/2 , and the area to the right of zero is also 1/2. We can exploit this symmetry to compute certain probabilities. 68 Probabilities can be computed by finding the area under the curve. For example, t2 ∫ 1 1 P Z2. d) P Z < −2. Solution 2.10 a) Using the previously mentioned fact that P – 2 ≤ Z ≤ 2 ≈ 0.9545, together with the symmetry, we know that 71 1 0.9545 P 0≤Z≤2 = P −2 ≤ Z ≤ 2 ≈ ≈ 0.4773 2 2 b) The event Z ≤ 2 is the disjoint union of the events Z < 0 and 0 ≤ Z ≤ 2. Therefore, by the sum property of probabilities, we have P Z ≤ 2 = P Z < 0 + P 0 ≤ Z ≤ 2 ≈ 0.5 + 0.4773 ≈ 0.9773. c) The event Z > 2 is the complement of the event Z ≤ 2. Therefore, by the comple- ment rule we have P Z > 2 = 1 − P Z ≤ 2 ≈ 1 − 0.9773 ≈ 0.0227. d) Using symmetry, we have P Z < − 2 = P Z > 2 ≈ 0.0227. The probabilities in parts c. and d. of Example 2.10 are called tail probabilities. Such quan- tities come up quite often in statistical inference. As such, here are some tail probabilities. Table 10: Standard Normal Tail Probabilities P Z > 2.576 ≈ 0.0050 P Z > 2.3264 ≈ 0.0100 P Z > 1.6449 ≈ 0.0500 P Z > 1.2816 ≈ 0.1000 Source: George Dekermenjian (2019). We have devoted a substantial amount of time to discussing the standard normal distribu- tion. What about all the other normal distributions? It turns out that computing with the standard normal distribution is enough because of the following fact: if X N μ, σ2 , then X−μ Z= N 0,1. σ In other words, if we want to compute probabilities with non-standard normal distribu- tions, we can work with the related standard normal distribution by using the above trans- formation. Example 2.11 It is believed that IQ is normally distributed with a mean of 100 and a standard deviation of 15. Compute the probability that a randomly chosen person has an IQ of between 85 and 115. 72 Solution 2.11 Let X denote the IQ of a randomly selected person. We are given that X N 100, 152. Using the transformation above we have 85 − 100 X − 100 115 − 100 P 85 ≤ X ≤ 115 = P ≤ ≤ , 15 15 15 X − 100 if we set Z = , then Z N 0,1. Therefore, the above probability is the same as 15 P −1 ≤ Z ≤ 1 ≈ 0.6827. The reverse transformation also works. Suppose Z N 0,1 , then the transformed random variable X = μ + σZ is a normal distribution with mean μ and standard deviation σ, i.e., X N μ, σ2. Example 2.12 Continuing from the previous example, find the IQ score that separates the top 5% from the rest. (Hint: use the fact that P Z > 1.64485 ≈ 0.0500.) Solution 2.12 Let X denote the IQ score of a randomly selected person. We want to find x95 such that P X ≤ x95 = 0.9500. Following the hint, together with the complementary event, we know that P Z ≤ 1.64485 = 1 − P Z > 1.64485 ≈ 0.9500. Furthermore, the transformed random variable 100 + 15Z follows a normal distribution with a mean of 100 and a standard deviation of 15. Therefore, X = 100 + 15Z. Therefore, 0.9500 ≈ P 100 + 15Z ≤ 100 + 15 1.64485 = P X ≤ 124.673. In other words, the 95th percentile of X is x95 ≈ 24.673. In our context, an IQ score of 124.673 would be higher than 95% of the population (since IQ is defined as an integer, this number would be rounded up to 125). Student's T Distribution The Student's T distribution comes up in statistical inference when the sample size is not sufficiently large. It behaves very much like the standard normal distribution, but the tails are “heavier” or “thicker” than the standard normal distribution. The mean (center) of the T distribution is always 0, but the standard deviation changes with the degrees of freedom parameter, v > 0. If X follows a T distribution with v degrees of freedom, then we write X T v. The PDF of X is given by 73 ν+1 v+1 Γ 2 x2 − 2 fX x = 1+ for any x ∈ ℝ πνΓ 2 ν ν Here, Γ y is called the gamma function; it is a generalization of the factorial for all real numbers. In fact, if n ≥ 0 is an integer, then n! = Γ n + 1. ∫ ∞ x−1 Γx = t e−tdt for any x > 0. 0 (We will discuss the gamma distribution, which is closely related to the gamma function, later on.) The graphs below show PDFs of random variables that follow the T distribution with differ- ent degrees of freedom. Additionally, the graph of the standard normal PDF is included as a reference for the comparison of the thickness in the tails. Figure 32: Plots of PDFs of Various Student T Distributions Together with the Standard Normal Distribution Source: George Dekermenjian (2019). As you may have noticed, the larger the degrees of freedom, the closer the PDF is to the standard normal distribution. In statistical inference, the degrees of freedom are tied to the sample size. Therefore, if the sample size is large enough, the standard normal distri- bution is substituted in the calculations without much loss of accuracy. As a matter of fact, it can be shown in the limit that T v for v ∞ yields N 0,1. To illustrate this fact, we compare the probability of P X < = — 1 and P Z < — 1 where X T v and Z N 0,1. Notice that as the degrees of freedom get larger, the two probabilities get very close to one another. 74 Figure 33: Convergence of Probability from Student T to Standard Normal Source: George Dekermenjian (2019). The CDF of the T distribution does not have a simple closed form. However, most statisti- cal software packages have an implementation of this CDF. The Exponential Distribution Exponential distributions are used, among other things, to model the interarrival times between two successive events where the number of events follows a Poisson distribution (see Kim, 2019a). If X is exponentially distributed at a rate of λ > 0, we write X Exponential λ. Notice this distribution only has one parameter. We will see in the next section that λ is related to the mean of the distribution. The PDF is given by λe−λx x ≥ 0, fX x = 0 otherwise. The following graphic shows plots of PDFs of different exponential distributions: 75 Figure 34: Plots of PDFs of Various Exponential Distributions Source: George Dekermenjian (2019). The CDF of an exponential distribution can be derived by simple integration: ∫ ∫ λe x x FX x = fx t dt = −λtdt = 1 − e−λx for any x ≥ 0, −∞ 0 and for x < 0 we have F x x = 0. 76 Figure 35: Plots of CDFs of Various Exponential Distributions Source: George Dekermenjian (2019). Example 2.13 A battery has a lifespan in hours that is exponentially distributed with a parameter rate of λ = 1/2500. Find the probability that a randomly selected battery will die out before 3000 hours. Solution 2.13 Let X be the lifespan of a randomly selected battery. We know that X Exponential 1/2500. The probability that this battery dies out before 3000 hours is the same as saying that the lifespan of the battery is less than 3000. Therefore, we com- pute 3000 P X < 3000 = F X 3000 = 1 − e− 2500 ≈ 0.6988 Therefore, the probability is about 69.88%. The Beta Distribution The beta distribution can be used to model the behavior of a random variable whose range is a finite interval. The PDF of any beta distribution is supported on the closed inter- val 0,1. In data science, the beta distribution comes up in Bayesian inference when we want to incorporate prior knowledge from data into the modeling of unknown parame- 77 ter(s) of Bernoulli, binomial, geometric, and negative binomial distributions. Historically, the reason for its popularity comes from the fact that the posterior is in the same distribu- tion family as the prior when using the beta distribution. A well-known application of the beta distribution in education is to model the true test score for students, see (Sinharay, 2010). The beta distribution has two parameters, α > 0 and β > 0. Both of these parame- ters are interpreted as shape parameters. If the random variable X follows a beta distribu- tion with parameters α and β, we write X Beta α, β. The PDF is given by Γ α+β α−1 β−1 x 1−x for 0 ≤ x ≤ 1, fX x = Γ αΓβ 0 otherwise. Figure 36: Plots of PDFs of Various Beta Distributions Source: George Dekermenjian (2019). As you can see in the figure above, the beta family of distributions is quite diverse. One member of the family even reduces to the uniform distribution (with parameters α = β = 1). Some of the PDFs have a maximum, while others have a minimum. Some PDF

Statistics - Probability and Descriptive Statistics Course Book PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue