Statistics Textbook PDF

Chapter 1 Overview and descriptive statistics 1.1 Populations, samples, and processes 1.1.1 Population vs sample Population: the entire collection of individuals or objects to be considered or studied. Usually we use N to denote the size of a finite population. Observation: a single individual entity; or an individual measurement. Variable: a characteristic of an individual or object in a population of interest. Census: is taken when every unit in the population is measured or surveyed. Sample: a subset of the entire population, a small selection of individuals or objectes taken from the entire population. 1 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 1.1 We are interested in the height of Canadian children aged from 5 to 9 years old at 2018 (approximate 2,000,000 children). In this example, The population is: The variable is: The observation is: Describe a census for this example. Design one sample with sample size n = 100. Why create a sample. 1.1.2 (Simple) Random sample vs convenience sample Random sample: sample is randomly chosen from population, every possible individual/object has the same chance to be selected. Convenience sample: sample is chosen by convenience. Example 1.2 we are interested in grade 9 math subject score at Toronto. sample 1: choose one class from one high school. sample 2: Randomly draw from a box with all grade 9 student’s ID at Toronto. 2 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 1.1.3 Probability vs statistics Probability problem: property/characteristics of the population is assumed to be known. We can then answer questions for a sample from that population. (Inferential) Statistics problem: We assume little about a population, we use the information contained in a sample to make a generalization about a popula- tion, and answer questions concerning the population. Relation between probability and statistical inference: Example 1. A box contains 7 red balls and 3 blue balls, randomly select one ball from the box, is the color likely to be red or blue? 2. A box contains red balls and blue balls. randomly select one ball from the box, record the color then put it back. Repeat the selection 10 times. The chosen balls’ colors are: red, blue, red, blue, blue, blue, red, red, red, red. Estimate the proportion of red ball in the box. 3 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 1.1.4 Types of data 1. Categorical variable: possible value is Non-numerical, can be grouped to categories. Ordinal: Categorical variable, the set of possible values has order. Nominal: Categorical variable, the set of possible values has no order. 2. Numerical variable: values are numerical. Discrete: numerical, there are finite, or countably infinite possible values. Continuous: possible values is an interval of numbers. Example 1.3 The population is Canadian children (9-10). We choose a random sample of 100 children. 1. We measure and record their height. 2. We count and record their number of teeth. 3. We record their gender. 4. We record their overall GPA. 4 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 1.2 Measures of location and variability Summation notation: Pn xi = x1 + x2 +... + xn. i=1 Example 3.1 P74: Let x1 = 5, x2 = 9, x3 = 12, x4 = −6, x5 = 17, x6 = −2. 6 a. ( xi )2 P i=1 3 x2i P b. i=1 5 (xi − 7)2 P c. i=2 1.2.1 Sample mean A set of observations are denoted by x1 , x2 ,..., xn. Sample mean is defined as: n 1X x1 + x2 +... + xn x̄ = xi = n i=1 n Note: We denote population mean as µ, which is a fixed constant. While x̄ varies from sample to sample. Example 1.4 : One company randomly selected 10 employees, see their annual income below. Calculate sample mean. 45000, 47650, 51000, 55000, 62000, 62500, 65000, 66000, 67500, 270000 5 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 1.2.2 Measures of variability Measures of center of the data set are only one characteristic of a data set. Mea- sures of variability is another factor to describe a data set. Example 1.5 We record and compare the daily maximum temperature at Toronto and Vancouver. Two random samples: Toronto: -27, -11, 0, 12, 15, 15, 28, 33, 36 Vancouver: 3, 5, 5, 8, 9, 12, 13, 19, 21, 22 1.2.3 Sample variance s2 and sample standard deviation s The ith deviation about the mean is xi − x̄. Sample variance n 2 1 X 2 (x1 − x̄)2 + (x2 − x̄)2 +... + (xn − x̄)2 s = (xi − x̄) = n − 1 i=1 n−1 n n ! 1 X 1 X = x2i − ( xi )2. n − 1 i=1 n i=1 √ Sample standard deviation: s = s2. Q1: Why square xi − x̄? Q2: s2 , s are similar, why we need both? 6 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 1.6 :We record and compare the daily maximum temperature at Toronto and Vancouver. Two random samples: Toronto: -27, -11, 0, 12, 15, 15, 28, 33, 36 Vancouver: 3, 5, 5, 8, 9, 12, 13, 19, 21, 22 Note: 1. Large value of variance implies variability of the data. 2. Sample standard deviation s has the same unit for the original data. 3. s = 0, 4. The population variance, a measure of the variability for an entire popula- N 1 tion is denoted by σ = N (xi − µ)2 , and the population standard deviation is 2 P √ i=1 denoted by σ = σ. 2 5. Different from population variance σ 2. The divisor is n − 1 for sample variance. This is because xi ’s tend to be closer to their average x̄ than to µ. To compensate for this we use a smaller divisor n − 1 rather than the sample size n. Summary of symbols: mean variance standard deviation population sample 7 Chapter 2 Probability 2.1 Sample space and event Experiment is an activity in which there are at least two possible outcomes and the result of the activity can not be predicted with absolute certainty. Sample space, S, associated with an experiment is a listing of all the possible outcomes using set notation. Event is any collection (or set) of outcomes from an experiment. 2.1.1 Set operations Let A, B be two events associated with a sample space S. ′ Complement, A , all outcomes in the sample space S not in A. Called as not A. Union, A ∪ B, all outcomes that are in A or B or both. Called as A or B. Intersection, A ∩ B, all outcomes in both A and B. Called as A and B. Disjoint or mutually exclusive, A ∩ B = {} or A ∩ B = ∅, no elements in common. Where {}, ∅ both denote the empty set, P (∅) = 0. Exhaustive, A ∪ B = S, they include all outcomes of sample space. Mutually exclusive and exhaustive, A ∩ B = {} and A ∪ B = S, if they have no elements in common, they include all outcomes of the sample space. 8 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Venn diagram examples. 9 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 2.1 Three components are connected to form a system as shown be- low. Components 2, 3 are connected in parallel, hence the system functions if component 1 functions and at least one of the components 2, 3 functions. The ex- periment records the conditions: success(S) or failure(F), of all three components in the order of 1, 2, 3. 1. List all outcomes in the sample space S. 2. Write the outcomes for the following events: A = two out of three component function. B = at least two of the three components function. C = the system functions. D = only one component functions. 3. Any events mutually exclusive (disjoint)? 4. Any events mutually exclusive and exhaustive? ′ 5. Find C , A ∪ C, A ∩ C, B ∪ C, andB ∩ C 10 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 2.2 Axioms, interpretations and properties of probability Relative frequency of occurrence of an event A is the number of times the event occurs divided by the total number of times the experiment is conducted. n(A) relative frequency = N Probability of an event A, P (A), is the limiting relative frequency, the pro- portion of time the event A occur in the long run. n(A) P (A) = lim N →∞ N Example 2.2 Toss an even coin N times. We count the number of head. N 10 100 1000 10000 100000 n(H) 6 54 489 5013 49979 Relative frequency Axiom of probability: A1. P (A) ≥ 0 A2. S is the sample space, P (S) = 1 ∞ P A3. A1 , A2 ,..., infinite many disjoint events, then P (A1 ∪ A2 ∪...) = P (Ai ). i=1 Other properties of probability: 1. A ∩ B = B ∩ A, hence P (A ∩ B) = P (B ∩ A). 2. A ∪ B = B ∪ A, hence P (A ∪ B) = P (B ∪ A). 3. Let ∅ be the empty set, P (∅) = 0 ′ 4. P (A) + P (A ) = 1 11 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 5. P (A) ≤ 1 6. If A, B disjoint events, then P (A ∩ B) = 0 7. For any two A, B events, P (A ∪ B) = P (A) + P (B) − P (A ∩ B) P (A ∪ B) = If A, B are disjoint events, then P (A ∪ B) = P (A) + P (B) 8. For any three A, B, C events, P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C) 9. For event A, P (A) is the sum of the probability of all of the outcomes in A. P Let E1 , E2 ,...denote the simple events of an experiment, hence P (Ei ) = 1, and P P (A) = Ei ∈A P (Ei ). Example 2.3 S = {1, 3, 5, 8}, p(1) = 0.2, p(3) = 0.3, p(5) = 0.4, p(8) = 0.1. A = {1, 5}, then P (A) =? 12 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 2.4 Example 2.14 P62. Given the following information: 60% families have TV cable; 80% families have internet cable; 50% families have both TV and internet cable. Determine the probability that a randomly selected family 1. has TV cable or internet cable or both. 2. has none of the two cables. 3. only has TV cable. 4. only has internet cable. Example 2.5 Example exercise 15 P65. Cloth dryer has two types: electronic and gas. Consider five independent customers at one store. 1. If the probability that at most one of these purchases an electronic dryer is 0.428, what is the probability that at least two purchase an electronic dryer? 2. if P (all five purchase gas) = 0.116, P (all five purchase electronic) = 0.005, what is the probability that at least one of each type is purchased. 13 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 An equally likely outcome experiment means that all outcomes have the equal chance to occur. In such an experiment, the probability of any event A is the number of outcomes in A divided by the total number of outcomes in the sample space S. X N (A) P (A) = P (Ei ) = N (S) Ei ∈A Example 2.6 Toss a fair four sides die twice. The number on each side is: 1, 2, 3, 4. Define events: A = at least one of the toss gets 4. B = the sum of two numbers of two tosses is 5. C = both tosses get number 1. Calculate P (A), P (B), P (C). Example 2.7 One drink was put into three cups with label A, B, C. An individual is asked to taste all three cups, and order them per his/her preference. 1. write the sample space. 2. What is the probability that A is ranked first. 3. What is the probability that A is ranked first or B is ranked first. 4. What is the probability that A is ranked first and B is ranked last. 14 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 2.3 Conditional probability Conditional probability of A given that the event B has occurred: P (A ∩ B) P (A|B) = , P (B) > 0. P (B) Example 2.8 Rolling a fair, six-sided die. S = {1, 2, 3, 4, 5, 6}. A = {1} = roll a 1. B = {1, 3, 5} roll an odd number. Find P (A), P (A|B). 15 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 2.9 : A survey about how people choose to travel to work. Male(M) Female (F) Total Drive (D) 3 7 10 Bus (B) 6 3 9 Total 9 10 19 1. Find the probability of an individual drive to work and is a male. 2. Find the probability of an individual drive to work given it is a male. 3. Find the probability that an individual is male given he takes bus to work. Example 2.10 One couple have two children. Assume the gender (girl or boy) of any new birth is random and equally likely. 1. What is the probability that both children are boys? 2. One of the children is a boy, what is the probability that another is also a boy? 3. The older is a boy, what is the probability that another is also a boy? 16 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 The probability multiplication rule: For any two events A, B, P (A ∩ B) = P (B) ∗ P (A|B) = P (A) ∗ P (B|A). Example 2.11 Approximately 18% shoppers used cell phone to look for deals. The probability of making a purchase was 0.45 if a person used a cell phone to shop. For a randomly selected shoppers, what is the probability that the person used a cell phone and made a purchase? 17 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 The law of total probability: Let A1 , A2 ,..., Ak be mutually exclusive and exhaustive events, then for any other event B. k X P (B) = P (B|Ai )∗P (Ai ) = P (B|A1 )∗P (A1 )+P (B|A2 )∗P (A2 )+...+P (B|Ak )∗P (Ak ). i=1 Example 2.12 Example 2.30 P81 One people has three email accounts: A, B, C. 1. Find probability that a randomly selected email is from A and is a spam. 2. Find probability that a randomly selected email is spam? A B C probability of receiving email 70% 20% 10% probability of spam 1% 2% 5% 18 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 2.3.1 Medical screen test terminologies Medical screen test terminologies: ′ D = {patient has the disease} D = {Patient does not have the disease} ′ T = {test result is positive} T = {test result is negative} Sensitivity of a test is the probability of a positive test result when the patient has the disease. Specificity of a test is the probability of a negative test result when the patient does not have the disease. Positive predictive value (PPV): P (D|T ). ′ ′ Negative predictive value (NPV): P (D |T ). Prevalence: P (D), is the proportion of a population that has the disease. Incident: the number or proportion of newly diagnosed cases in a specified time period. Bayes’ Rule Suppose events B1 , B2 ,..., Bk are mutually exclusive and ex- haustive, with prior probabilities P (B1 ), P (B2 ),..., P (Bk ). If an event A occurs, the posterior probability of Bi given A is: P (Bi ∩ A) P (A|Bi ) ∗ P (Bi ) P (Bi |A) = = Pk P (A) j=1 P (A|Bj ) ∗ P (Bj ) P (A|B1 ) ∗ P (B1 ) P (B1 |A) = P (A|B1 ) ∗ P (B1 ) + P (A|B2 ) ∗ P (B2 ) P (A|B) ∗ P (B) P (B|A) = P (A|B) ∗ P (B) + P (A|B ′ ) ∗ P (B ′ ) P (A|B1 ) ∗ P (B1 ) P (B1 |A) = P (A|B1 ) ∗ P (B1 ) + P (A|B2 ) ∗ P (B2 ) + P (A|B3 ) ∗ P (B3 ) 19 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 2.13 Alicia learned that she had a positive result for a diagnostic test of a certain disease, and want to find the probability of having the disease given that the test was positive. We know 1. Every one in 1000 people has this disease. The test is not always accurate. 2. The probability of test + given people has this disease is 0.95. 3. The probability of test - given people has no this disease is 0.9. test + test - total disease 95 5 100 no disease 9990 89910 99,900 total 10085 89915 100,000 Practice: If Alicia did the test twice (assume independence), both are positive, what is the probability that Alicia has the disease? Is this probability higher than the above result? Why? Answer: 0.0829. 20 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 2.4 Independence A, B are independent if and only if P (A|B) = P (A). A, B are dependent events if P (A|B) ̸= P (A). ′ ′ ′ ′ A, B are independent, then (A , B); (A, B ); (A , B ) are also independent ′ ′ ′ ′ P (A|B ) = P (A |B) = P (A |B ) = 1. Are independent events disjoint (mutually exclusive)? 2. Are disjoint events independent? The probability multiplication rule (independent): IF A, B are independent, P (A ∩ B) = P (A) ∗ P (B). A1 , A2 ,..., An are mutually independent if for every k = 2, 3,..., n and every subset of indices i1 , i2 ,..., ik , P (Ai1 ∩ Ai2 ∩... ∩ Aik ) = P (Ai1 ) ∗ P (Ai2 ) ∗... ∗ P (Aik ) For example n = 4. A1 , A2 , A3 , A4 are mutually independent: 1. Pairwise independent (every pair are independent): P (A1 ∩ A2 ) = P (A1 ) ∗ P (A2 ), P (A1 ∩ A3 ) = P (A1 ) ∗ P (A3 ), P (A1 ∩ A4 ) = P (A1 ) ∗ P (A4 ), P (A2 ∩ A3 ) = P (A2 ) ∗ P (A3 ), P (A2 ∩ A4 ) = P (A2 ) ∗ P (A4 ), P (A3 ∩ A4 ) = P (A3 ) ∗ P (A4 ). 2. Group of every three events: P (A1 ∩ A2 ∩ A3 ) = P (A1 ) ∗ P (A2 ) ∗ P (A3 ), P (A1 ∩ A3 ∩ A4 ) = P (A1 ) ∗ P (A3 ) ∗ P (A4 ), P (A1 ∩ A2 ∩ A4 ) = P (A1 ) ∗ P (A2 ) ∗ P (A4 ), P (A2 ∩ A3 ∩ A4 ) = P (A2 ) ∗ P (A3 ) ∗ P (A4 ). 3. All four events: P (A1 ∩ A2 ∩ A3 ∩ A4 ) = P (A1 ) ∗ P (A2 ) ∗ P (A3 ) ∗ P (A4 ). 21 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 2.14 exercise 71 P87. Two projects at Asia and Europe. A={Asia project is successful}, B={European project is successful}. A, B are independent, P (A) = 0.4, P (B) = 0.7. 1. If the Asian project is not successful, what is the probability that European project is also not successful? 2. What is the probability that at least one of two projects will be successful? 22 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 2.15 Some components are connected parallel, then the subsystem works if at least one component work. Some components are connected in series, then the subsystem works if all components work. Assume all components are of the same type and are mutually independent, and P (component i work) = 0.9, i = 1, 2,..., n. For the following system, calculate P (system works). 23 Chapter 3 Discrete random variables and probability distributions 3.1 Random variables Random variable: For a given sample space S of some experiment, a random variable is a rule that assigns a unique numerical value to each outcome in the sample space. In mathematics language, a random variable is a function whose domain is the sample space and whose range is the set of real numbers. Discrete random variable: a finite set, or countable infinite sequence. Continuous random variable: the set of possible values is an interval, or disjoint union of intervals. A random variable is usually denoted by a capital letter. 24 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 3.1 Example 3.3 P97. Two gas stations, each has three pumps. We record the how many pumps in use for each gas stations. Define random variables X, Y, U : X: the total number of pumps in use at the two stations. Y : the difference between the number of pumps in use at two stations (1, 2). U : the maximum of the numbers of pumps in use at two stations. If one experiment has outcome (2, 3), find the random variables’ value. 25 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 3.2 Probability distributions for discrete random variables The probability distribution or probability mass function (pmf ) of a discrete random variable is defined for every number x by p(x) = P (X = x) = P (all ω ∈ S : X(ω) = x). The cumulative distribution function (cdf ) F (x) of a discrete random vari- able X with pmf p(x) is defined for every number x by X F (x) = P (X ≤ x) = p(y). y:y≤x For any number x, F (x) is the probability that the observed value of X will be at most x. For any two numbers a, b, and a ≤ b. P (a ≤ X ≤ b) = P (X = a or a + 1 or a + 2 or... or b − 1, or b) = P (X = a) + P (X = a + 1) +... + P (X = b − 1) + P (X = b) = P (X ≤ b) − P (X ≤ (a − 1)) = F (b) − F (a − 1) 26 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 3.2 Example (exercise 11 P107) Let X be the number of students who x 0 1 2 3 4 show up during office hour, the pmf of X is p(x) 0.2 0.25 0.3 0.15 1. Find p(4). 2. What is the probability that at least two students show up? 3. What is the probability at most two students show up? 4. What is the probability more than two students show up? 5. What is the probability between one and three students show up (inclusive) 27 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 3.3 We ordered two books online, they are both supposed to arrive on Wed, but may arrive late on Thursday or Friday. Suppose two books arrive inde- pendently with same probability p(Wed) = 0.6, p(Thur) = 0.3, p(Fri) = 0.1. Define rv Y = the number of days later than Wed it takes for both books to arrive, compute pmf for Y. 28 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 3.3 Expected value and variance of X Let X be a discrete rv with set of possible values D and pmf p(x). the expected value or mean value of X, denoted by E(X) or µX : X E(X) = µX = x ∗ p(x). x∈D The expected value of any function h(X), denoted by E[h(X)] or µh(X) : X E[h(X)] = h(x) ∗ p(x). x∈D 2 variance of X, denoted by V (X) or σX , or σ 2 is X X V (X) = (x − E(X))2 ∗ p(x) = x2 ∗ p(x) − E(X)2 = E(X 2 ) − [E(X)]2. x∈D x∈D √ The standard deviation (SD) of X is σX = σ2. Example 3.4 : A fair six sides die has numbers 1, 2, 3, 4, 5, 6. One number on each side. Toss the die, let random variable X be the number on the top side. Find the expected value, variance, and standard deviation of X. Question: Compare x̄, µ and µX. 29 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 3.3.1 Bernoulli random variable Bernoulli experiment: an experiment with only two possible random outcomes: success and failure. The probability of success is the same every time the experi- ment is conducted Bernoulli random variable is a random variable whose only possible values are 0 and 1. pmf Expected value: practice by yourself. Variance: practice by yourself. standard deviation: practice by yourself. Example 3.5 Toss a fair coin with two sides: H, T. 30 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 3.4 The binomial probability distribution Binomial experiment: 1. Consists of n trials. 2. Each trial has only two outcomes: success, failure. 3. Outcomes are independent from each trial. 4. The probability of a success, p is constant from trail to trail. Binomial random variable, X, is defined as the number of success in n trials. X ∼ B(n, p) Example: Toss a fair coin 100 times. Example 3.6 Which of the following is(are) binomial experiment(s)? 1. We record the weather condition (sunny, not sunny) for the future 10 days. 2. We toss five coins, once for each coin, and record the top sides. 3. There are four blue balls and six red balls in one box, same shape and weight, we randomly select three balls with replacement, and record the color. 4. 21% York students work part time, for 20 randomly selected York students, we record if they work par time. 31 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Binomial random variable, X, is defined as the number of success in n trials. And the probability of success is denoted by p, the pmf of X: X ∼ B(n, p). Mean of X: Variance of X: Standard deviation of X: x X P (X ≤ x) = b(k; n, p) = P (X = 0) + P (X = 1) +... + P (X = x). k=0 Example 3.7 exercise 65 P125 Customers at a gas station make payment with: A: credit card B: debit card C: cash probability 0.5 0.2 0.3 Among the next 100 customers, what are the mean and variance of the number who pay with a debit card, why? 32 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 3.8 exercise 49 P123 The defective rate of one product is 10% in one company. Assume if one product is defective is independent to the others. For six randomly selected products. 1. How likely only one is defective? 2. How likely at least two are defective? 3. Given at least two are defective, what is the probability that exactly three are defective? 33 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 3.9 exercise 47 P123 Assume 4 of 10 auto accidents involve a single vehicle. Suppose 15 accidents are randomly selected, use appendix table A.1 to answer the probability of 1. at most 4 involve a single vehicle. 2. exactly 4 involve a single vehicle. 3. between 2 and 4, inclusive, involve a single vehicle. 4. at least 2 involve a single vehicle. 34 Chapter 4 Continuous random variables and probability distributions 4.1 Probability density functions A continuous probability distribution completely describes the random vari- able and is used to compute probabilities associated with random variable. Probability density function (pdf ), f (x): 1. is a function defined for all real numbers. i.e. x ∈ (−∞, +∞). 2. is a smooth curve describes the probability distribution for a continuous random variable X through area under the curve. Let a ≤ b, the probability Z b P (a ≤ X ≤ b) = f (x)dx. a 35 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 4.1 Density curves of monthly income for four different regions. 0 1K 2K 3K 4K 5K 6K 7K 8K 0 1K 2K 3K 4K 5K 6K 7K 8K 9K 10K 11K 12K 13K 14K 0 1K 2K 3K 4K 5K 6K 7K 8K 9K 11K 13K 15K 17K 19k 0 1K 2K 3K 4K 5K 6K 7K 8K 9K 11K 13K 15K 17K 19k 36 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 The cumulative distribution function (cdf ) F (x) for a continuous rv X is defined for every number x by Z x F (x) = P (X ≤ x) = f (y)dy. −∞ F (x) is the area under the density curve to the left of x. Note: For any numbers a, b, a ≤ b. 1. P (X > a) = 2. P (a ≤ X ≤ b) = ′ 3. f (x) = F (x). 4. The Expected value and variance of a continuous rv X: µX = E(X) = µh(X) = E[h(X)] = 2 σX = V (X) = σX = 37 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 4.2 exercise 1 P146 A rv X has pdf  0.075x + 0.2 3≤x≤5  f (x) =  0 otherwise 1. Graph the pdf and verify the total area under the pdf curve. 2. Calculate P (X ≤ 4), P (3.5 ≤ X ≤ 4.5). 3. Calculate E(X), V (X), σX. 38 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 4.1.1 Uniform distribution The random variable X has a uniform distribution on the interval [a, b]: pdf:  1 a≤X≤b   b−a f (x; a, b) =  0 otherwise Example 4.3 exercise 7 P147 A uniform distribution on interval [0.2, 4.25]. 1. Find the pdf and graph it. 2. Compute P (X > 3), P (0.5 < X < 2.1). 39 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 4.2 The normal distribution Normal distribution has two parameters: µ, σ (or µ, σ 2 ), −∞ < µ < +∞ and σ > 0. We write the random variable X ∼ N (µ, σ 2 ). The pdf is: 1 2 2 f (x) = √ e−(x−µ) /2σ. σ 2π Example 4.4 Height of CA male 40-59 years old : N (175, 72 ) Height of CA female 40-59 years old : N (162, 52 ) 40 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Standard normal distribution N (0, 1): That is µ = 0, σ = 1. The pdf is 1 −z 2 /2 f (z) = e. 2π The cdf: Z z Z z 1 −z 2 /2 Φ(z) = P (Z ≤ z) = f (y)dy = e dy −∞ −∞ 2π Example 4.5 Find the probability or determine the constant c. 1. P (0.2 ≤ Z ≤ 1.32) P (Z ≥ −0.06) 2. Φ(c) = 0.9838 P (−c ≤ Z ≤ c) = 0.68 P (|Z| ≥ c) = 0.05 41 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Critical value zα : α refers to the area on the right. Example (exercise 31 P167) Determine zα for the following α. 1. α = 0.05 2. α = 0.95 Standardization a normal rv X: 42 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example (exercise 35 P167) The reboot time for a machine follows a normal distribution with mean value of 8.46 min, and standard deviation 0.913 mins. 1. What is the probability the reboot time are at least 10 mins? exceed 10 mins? 2. What is the probability the reboot time are between 8 and 10 mins, inclusive? 3. Find c, such that 98% of reboot time are in the interval (8.46 - c, 8.46 + c)? 4. If four reboot time are randomly selected independently, what is the probability that at least one of them exceeds 10 mins? 43 Chapter 5 Joint probability distributions and random samples 5.1 Statistic and the distribution Population parameter: a numeric measure of a population Sample statistic: A numerical characteristic of the sample. A point estimate is a single value of the selected point estimator, computed from a given sample. population parameter sample statistic point estimate n P P xi X i=1 mean µ= N X̄ = n is a point estimate of n (xi −x̄)2 P (X−µ)2 P variance σ2 = N s2 = i=1 n−1 is a point estimate of s n (xi −x̄)2 qP P (X−µ)2 i=1 sd σ= N s= n−1 is a point estimate of 44 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 5.1 We are interested in the average height of 5-9 years old CA chil- dren. The population size is about 2M, we take a sample with 10 children. 109, 109, 109, 105, 115, 110, 109, 114 , 120 , 103. To estimate the population mean µ - average height of all CA children (5 - 9years old), we an choose the following point estimators. x̄ = 110.3 x̃ = 109 (xmax +xmin ) 2 = 116.5 M ode : 109 Bias of an estimator: E(θ̂) − θ. Variability of an estimator: the spread of its distribution. 45 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Sampling distribution is the probability distribution of the statistic. Example: We are interested in the average height of 5-9 years old CA children. The population size is about 2M, we take a sample with 10 children. 46 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 5.1.1 Deriving a sampling distribution 1. apply probability rule. 2. simulation experiment (using a computer). Example 5.21 P223 One book store sells three books, their prices are $80, $100, $120. Define random variable X as the book price for a randomly selected pur- chase. The probability of customer’s purchase of each book is showing below: x 80 100 120 p(x) 0.2 0.3 0.5 with E(X) = 106, V (X) = 244. For two randomly selected purchases. Let X1 , X2 be the book price of the first purchase and second purchase, respectively. Possible (X1 , X2 ) pairs are: x1 x2 p(x1 ∩ x2 ) x̄ s2 0.1 800 0.06 200 0.09 0 0.15 200 0.1 800 0.15 200 0.25 0 47 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 5.2 The distribution of the sample mean Proposition 1 The rv’s X1 , X2 ,..., Xn be a random sample from a distribution with mean value µ and standard deviation σ. Let T0 = X1 + X2 +... + Xn , the sample total. Then 1. E(X̄) = µ. 2. V (X̄) = σ 2 /n, σX̄=σ/√n. √ 3. E(T0 ) = nµ, V (T0 ) = nσ 2 , σT0 = nσ. Proposition 2 The rv’s X1 , X2 ,..., Xn be a random sample from a normal dis- tribution with mean value µ and standard deviation σ, then for any n 1. E(X̄) = µ, V (X̄) = σ 2 /n. 2. X̄ ∼ N (µ, σ 2 /n). 3. T0 ∼ N (nµ, nσ 2 ). Central limit theorem (CLT) The rv’s X1 , X2 ,..., Xn be a random sample from a distribution with mean value µ and standard deviation σ. If n is sufficiently large (n > 30),then 1. E(X̄) = µ, V (X̄) = σ 2 /n. 2. X̄ ∼ N (µ, σ 2 /n) 3. T0 ∼ N (nµ, nσ 2 ) 48 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 5.2 1. Population mean is µ = 15, population standard deviation σ = 4, sample size n = 36, write the distribution of sample mean X̄. 2. Population mean is µ = 15, population standard deviation σ = 4, sample size n = 16, write the distribution of sample mean X̄. 3. One bottle filling machine is set up to fill for average volume of 500ml, the filling volume standard deviation is 5ml. For a random sample of 50 bottles, what is the distribution of the sample mean filling volume? 4. One bottle filling machine is set up to fill for average volume of 500ml. Assume the filling volume is normally distributed, and the filling volume standard deviation is 5ml. For a random sample of 10 bottles, what is the distribution of the sample mean filling volume? 49 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 5.3 (exercise 51 P237) The time taken by a randomly selected individ- ual to fill a form has a normal distribution with mean value 10 mins and standard deviation 2 mins. If five individuals fill out a form on one day and six on another. What is the probability that the sample average amount of time taken on each day is at most 11 mins. Example 5.4 (exercise 49 P237) There are 40 students in a class. Assume the time to grade a randomly chosen exam paper is a random variable with an ex- pected value of 6 mins and a standard deviation of 6 mins. If grading times are independent, and the marker begins grading at 6:50 PM, and grades continously, what is the approximate probability that he finishes grading before 11 PM. 50 Chapter 6 Confidence intervals with a single sample Population parameter: a numeric measure of a population Sample statistic: A numerical characteristic of the sample. A point estimate is a single value of the selected point estimator, computed from a given sample. 6.1 Why do we need confidence interval Population: 5-9 years old children in Canada. Interest: the average height of the children. Random sample: 100 children, the sample mean is 115cm. It is highly likely that this single estimate is wrong. 51 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Hence we construct Confidence interval, which is an interval (usually open in- terval) used to estimate the population true mean, is more reliable comparing to a single estimate. Confidence level 100(1 − α)%: is a measure of the degree of reliability. The higher the confidence level, the more strongly we are confident about the estima- tion. For example, we are more confident about a 99% CI than a 95% CI. 52 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 6.2 Derive the confidence interval We assume: 1. normal population or n > 30; 2. σ 2 is known. Let α = 0.05, derive the 95% confidence interval. 53 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Random interval for population mean: σ σ (X̄ − 1.96 √ , X̄ + 1.96 √ ). n n Then we define the 95% confidence interval as: σ σ (x̄ − 1.96 √ , x̄ + 1.96 √ ). n n Example: Average height of 5-9 years old children in Canada. n = 100, x̄ = 115, σ = 20. The 95% CI for the true mean is: Questions: 1. What is the difference between Random Interval and Confidence interval. 2. How to interpret a confidence interval? Random interval (Recall): the probability that the true mean falls in the random interval is 0.95. The chance that true mean falls in the CI is 95%. Under repeated sampling, 95% of constructed CI will contain the true mean. We are 95% confident that the CI contains the true mean. 3. wider CI or narrower CI, which is better? Ex: assume the same confidence level, (111, 119 ) vs (100, 130) 54 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 4. How to make a confidence interval smaller? When the confidence level keeps the same value. 5. How to calculate 99% CI for µ? Given the same sample, 95% CI and 99% CI, which is wider? 55 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 6.3 Two sided CI with known σ: Z - CI 1. Underlying population is normal, or large sample (n ≥ 30), and 2. the popu- lation standard deviation σ is known. A two-sided 100(1 − α)% confidence interval for the population mean µ: σ x̄ ± zα/2 √. n Example 6.1 exercise 1 P284 Consider a normal population with known σ. √ 1. What is the confidence level of x̄ ± 2.81σ/ n. 2. Find the 99.7% confidence interval formula. Example 6.2 Suppose the weight of tire are normally distributed with standard deviation σ= 1.25lbs. With a random sample of 15 tires, the sample mean weight is 18.75 lbs. Find 95% CI for the ture mean weight. 56 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 6.3 The historic mean delivery time of one route is 6.5 hours. A new driver was assigned to this route and a random sample was collected: 6.61 6.25 6.4 6.57 6.35 5.95 6.53 6.29. Assume the underlying population is normal with standard deviation σ = 0.2, cal- culate the 99% confidence interval for mean delievery time of new driber. Is there any evidence to suggest the new driver’s mean delivery time is different from 6.5 hours? 57 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 6.3.1 The width of Z - CI and sample size Note: Given one population (σ is fixed), the following factors may effect CI width. Given the 100(1 − α)% confidence interval x̄ ± zα/2 √σn. The margin of error: σ m = zα/2 √. n We can then solve for n, which will derive the desired CI width. σ 2 n = zα/2 m Example 6.4 exercise 5 P285 X ∼ N (µ, 0.752 ). 1. x̄ = 4.56, n = 16. Find the 98% CI for the true average µ. 2. find n such that the width of 98% CI is 0.4. 58 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 6.4 Confidence interval when σ is unknown: t - CI 6.4.1 The t distribution Parameter: v, degree of freedom, v = 1, 2,.... Mean: µ = 0. v Variance: σ 2 = v−2. 1. It is symmetrical about mean 0, and is bell-shaped. 2. The standard deviation of t-distribution is always greater than 1. 3. When the degree of freedom v increases, the standard deviation of the t- distribution decreases. That is, when the degree of freedom v approaches +∞, the standard deviation approaches to 1. Understand t-distribution Critical value: tα (v) : P (T ≥ tα (v)) = α. Example 6.5 Find the value for t0.01 (21), z0.05 , t0.05 (2), t0.05 (20), t0.05 (120) 59 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 6.4.2 Two sided t - CI with unknown σ 1. Underlying distribution is normal or large sample size (n ≥ 30), and 2. Unknown σ 2. We have X̄ − µ Z= √ ∼ N (0, 1) σ/ n replace σ with s, X̄ − µ T = √ ∼ t(n − 1). S/ n The 100(1-α)% confidence interval is s x̄ ± tα/2 (n − 1) √ n Example (exercise 33, P303) One random sample is given below: 418, 421, 421, 422, 425, 427, 431, 434, 437, 439, 446, 447, 448, 453, 454, 463, 465. Assume the sample is from a normal distributed population. 1. Calculate a two-sided 95% CI for the true mean µ. 2. Does the CI suggest that 440 is a plausible value for the true mean? What about 450? 60 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 6.6 A water park claims that their need of water per day is 250 thou- sand gallons. The city collects a random sample of size n = 81, x̄ = 236.22, s = 40.81. Use α = 0.01, is there evidence to suggest the mean water usage is different from 250 thousand gallons? 61 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 6.4.3 Large sample Note: when sample size is big(n > 40), t(n − 1) → N (0, 1). Hence, t - CI can be approximated by Z - CI when n > 40. Example 6.7 One random sample of odometer of one model of used car, size n = 51, we know that x̄ = 45679, s = 26641.675. 1. Find the 95% t- CI for µ. 2. Find the 95% Z- CI for µ. 6.4.4 Confidence interval summary population sample size variance σ 2 distribution CI X̄−µ 1 Normal - known √ σ/ n ∼ N (0, 1) [x̄ ± Zα/2 √σn ] X̄−µ 2 - Large (n ≥ 30) known √ σ/ n ∼ N (0, 1) [x̄ ± Zα/2 √σn ] X̄−µ 3 Normal - unknown √ s/ n ∼ t(n − 1) [x̄ ± tn−1 √s α/2 n ] X̄−µ 4 Large (n ≥ 30) unknown √ s/ n ∼ t(n − 1) [x̄ ± tn−1 √s α/2 n ] Note: when sample size is big(n > 40) 62 Chapter 7 Hypothesis test for mean - one sample 7.1 Hypothesis tests for mean-Z test Assumption: 1. Underlying distribution is normal or large sample size (n ≥ 30), and 2. Known σ 2. Example: Canadian male (30-44 years old) average weight (1953): 76 KG Now people starts to question this, believe the average weight increased, since eating and living style changes. For example, consumption of much sugar or fat. Null (Initial) hypothesis, H0 : µ = 76 Alternative (new) hypothesis, Ha : µ > 76 Question: Is the new hypothesis true? Should we reject the initial hypothesis? 1. Measure the weight of every male (30-44), then we calculate the average. 2. Work with a random sample: 100 CA males, then estimate true mean weight 63 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Solution: Let the sample mean x̄ = 82. Assume the following: The weight is normal; σ is known, σ = 25 kg. 1. Write down the hypotheses: 2. Assume H0 is true, decide the observed sample mean is rare or reasonable. 64 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 3. Set a threshold to define rare event. 4. Make conclusion. Summary: components of a hypothesis test: 1. 2. 3. 4. 5. 7.1.1 Three types alternative hypothesis Upper one-tailed lower one-tailed two-tailed 65 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 7.1 Per one 1953 report, the weight of Canadian male (30-44 years old) follows normal distribution, with mean weight 76 KG (1953), and population standard deviation σ = 25. For a random sample of size n = 100. Let α = 0.01, test the following: 1. The sample mean is 82, people wants to test if the average weight increased. 2. The sample mean is 74, people wants to test if the average weight decreased. 3. The sample mean is 82, people wants to test if the average weight is different from 76 KG. 66 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 67 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 7.2 With a new type capsule which travels in a vacuum tube has ap- proximate speed 620 mph. 39 random travels are selected, with x̄ = 632. Assume population standard deviation σ = 35. Is there any evidence to suggest that the true mean speed is larger than 620 mph? Use α = 0.05. 68 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 7.2 Hypothesis test with unknown σ - t test 1. Underlying distribution is normal or large sample size (n ≥ 30), and 2. Unknown σ 2. X̄ − µ X̄ − µ Z= √ ∼ N (0, 1), T = √ ∼ t(n − 1). σ/ n S/ n Test statistic (TS): Ha : µ > µ0 Ha : µ < µ0 Ha : µ ̸= µ0 Reject Point tα (n − 1) −tα (n − 1) ±tα/2 (n − 1) RR (tα (n − 1), ∞) (−∞, −tα (n − 1)) (tα/2 (n − 1), ∞) ∪ (−∞, −tα/2 (n − 1)) P -value P (T > t) P (T < t) 2P (T > |t|) We reject H0 , when: 69 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 7.3 A water park claims that their need of water per day is 250 thou- sand gallons. The city collects a random sample of size n = 81, x̄ = 236.22, s = 40.81. Use α = 0.01. 1.Is there evidence mean daily water usage is different from 250 thousand gallons? 2.Is there evidence mean daily water usage is less than 250 thousand gallons? 70 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 7.4 The historic mean delivery time of one route is 6.5 hours. A new driver was assigned to this route and a random sample was collected: 6.61 6.25 6.4 6.57 6.35 5.95 6.53 6.29. Let α = 0.01, assume the underlying population is normal, is there any evidence to suggest the new driver has been able to shorten the delivery time? 7.2.1 Summary of different types of hypothesis tests Decide Z-test or t-test: population sample size variance σ 2 TS distribution x̄−µ X̄−µ 1 Normal - known z= √0 σ/ n √ σ/ n ∼ N (0, 1) x̄−µ X̄−µ 2 - Large (n ≥ 30) known z= √0 σ/ n √ σ/ n ∼ N (0, 1) x̄−µ X̄−µ 3 Normal - unknown t= √0 s/ n √ s/ n ∼ t(n − 1) x̄−µ X̄−µ 4 Large (n ≥ 30) unknown t= √0 s/ n √ s/ n ∼ t(n − 1) Note: when sample size is big(n > 40) 71 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 H0 : µ = µ0 Upper-tailed test Lower-tailed test Two-tailed test Ha: µ > µ0 Ha: µ < µ0 Ha: µ ≠ µ0 α = 𝑝(𝑍 < −𝑧𝛼 ) 𝛼 𝛼 = 𝑝(𝑍 < −𝑧𝛼/2) = 𝑝(𝑍 > 𝑧𝛼/2 ) 2 2 α = 𝑝(𝑍 > 𝑧𝛼 ) 𝑧𝛼 −𝑧𝛼 −𝑧𝛼/2 𝑧𝛼/2 Rejection region Rejection region Rejection Rejection region region Solve using Reject region (RR): reject H0, when test statistic falls in the RR. 𝑥̅ − µ0 Z-test: 𝑍 = ~𝑁(0, 1) 𝜎/√𝑛 Reject point: 𝑧𝛼 Reject point: −𝑧𝛼 Reject point: ±𝑧𝛼/2 RR: (𝑧𝛼 , ∞) RR: (−∞, − 𝑧𝛼 ) RR: (−∞, −𝑧𝛼/2 ) ∪ ( 𝑧𝛼/2 , ∞) 𝑥̅ − µ0 T-test: 𝑇 = ~𝑡(𝑛 − 1) 𝑠/√𝑛 Reject point: 𝑡𝛼 (𝑛 − 1) Reject point: −𝑡𝛼 (𝑛 − 1) Reject point: ±𝑡𝛼/2 (𝑛 − 1) RR: (𝑡𝛼 (𝑛 − 1) , ∞) RR: (−∞, −𝑡𝛼 (𝑛 − 1)) RR: (−∞, −𝑡𝛼/2 (𝑛 − 1)) ∪ (𝑡𝛼/2 (𝑛 − 1), ∞) Solve using P-value: reject H0, when P-value ≤ α. 𝑥̅ − µ Z-test: 𝑍 = 𝜎/ 𝑛0 ~𝑁(0, 1) √ P-value = 𝑃(𝑋̅ ≥ 𝑥̅ | H0 is true) P-value= 𝑃(𝑋̅ ≤ 𝑥̅ | H0 is true) P-value=2 ∗ 𝑃(𝑋̅ ≥ |𝑥̅ | | H0 is true) = 𝑃(𝑍 ≥ 𝑧) = 𝑃(𝑍 ≤ 𝑧) = 2*𝑃(𝑍 ≥ |𝑧|) 𝑥̅ − µ T-test: 𝑇 = 𝑠/ 𝑛0 ~𝑡(𝑛 − 1) √ P-value = 𝑃(𝑋̅ ≥ 𝑥̅ | H0 is true) P-value= 𝑃(𝑋̅ ≤ 𝑥̅ | H0 is true) P-value=2 ∗ 𝑃(𝑋̅ ≥ |𝑥̅ | | H0 is true) = 𝑃(𝑇 ≥ 𝑡) = 𝑃(𝑇 ≤ 𝑡) = 2*𝑃(𝑇 ≥ |𝑡|) Solve using confidence interval: reject H0, when µ𝟎 ∉ CI 𝑥̅ − µ0 Z-test: 𝑍 = ~𝑁(0, 1) 𝜎/√𝑛 𝜎 𝜎 𝜎 𝜎 (𝑥̅ − 𝑧𝛼 , + ∞) (−∞, 𝑥̅ + 𝑧𝛼 ) (𝑥̅ − 𝑧𝛼/2 , 𝑥̅ + 𝑧𝛼/2 ) √𝑛 √𝑛 √𝑛 √𝑛 𝑥̅ − µ0 T-test: 𝑇 = 𝑠/√𝑛 ~𝑡(𝑛 − 1) 𝑠 𝑠 ( ) 𝑠 ( ) 𝑠 (−∞, 𝑥̅ + 𝑡(𝛼𝑛−1) 𝑛−1 𝑛−1 (𝑥̅ − 𝑡(𝑛−1) 𝛼 , + ∞) ) (𝑥̅ − 𝑡𝛼/2 , 𝑥̅ + 𝑡𝛼/2 ) √𝑛 √𝑛 √𝑛 √𝑛 72 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 7.3 Examples Example 8.6 P329 A manufacture of sprinkler systems used for fire protection claims the true average system-activation temperature is 130o F. A sample of n = 9 systems, when tested, yield a sample average activation temperature of 131.38o F. If the activation temperature follows normal distribution with standard deviation 1.5o F. Does the data contradict the manufacturer’s claim at significance level α = 0.01? 73 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example (exercise 25 P334) A sample of 52 male enforcement officiers, exist a vehicle while wearing armor. Sample mean was 1.95 seconds, sample standard deviation was 0.2 seconds. Does it appear that the true average existing time is less than 2 seconds? α = 0.01. 74 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example (exercise 35 P345) A random sample of repairing time are listed be- low: 159, 120, 480, 149, 270, 547, 340, 43, 228, 202, 240, 218. Assume the population is normal distribution. Sample mean and sample sd are 249.7, 145.1. Let α = 0.05. Is there evidence that true average repair time exceeds 200 mins? 75 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 7.4 Hypothesis test errors Type I error: We reject H0 when H0 is true. P (type I error) = α. Type II error: We do not reject H0 when Ha is true. P (type II error) = β. reject H0 not reject H0 H0 true Ha true Note: 1. For test efficiency or goodness, we want both error to be small. i.e. small α, β. 2. For a fixed n, making α small force β large. 3. Most of time type I error is more serious, hence we usually control α to be small. But sometimes Type II error is more serious Example 7.5 a. New drug will be approved only if 20% of users will experience side effects. H0 : p = 0.2 (p >= 0.2), Ha : p < 0.2 b. A kid, whose father is allergic to peanut, want to eat a dessert contains peanut. H0 : the kid is not allergic to peanut, Ha : the kid is allergic to peanut 4. The only way to make both α and β small is to increase n. 5. Although we can not directly control β if we keep α small, we can calculate the probability of type II error at one specified alternative value of the parameter θa. 76 Chapter 8 One-way analysis of variance 8.1 F distribution F Distribution: F (v1 , v2 ), let X ∼ F (v1 , v2 ), v1 = 1, 2, 3,... the degrees of freedom in the numerator. v2 = 1, 2, 3,... the degrees of freedom in the denominator. The pdf of X is positively skewed (not symmetric), and gets closer and closer to the x-axis but never touches it. As both degrees of freedom increase, the density curve becomes taller and more compact. F critical value: Fα (v1 , v2 ) Example 8.1 F0.05 (3, 5) = F0.01 (20, 11) = 77 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 8.2 One-way ANOVA (ANalysis Of VAriance) Purpose: determine whether the data came from a single population, or whether at least two samples came from populations with different means. 8.2.1 Notations k: the number of populations under investigations. n = n1 +... + nk. Sample sizes. xij : the jth observation taken from the ith population. ni x̄i = n1i xij = n1i (xi1 + xi2 +... + xini ) P j=1 k Pni k 1 1 P P x̄ = n xij = n ni x̄i i=1 j=1 i=1 Population 1 2... i... k Population mean µ1 µ2... µi... µk Population variance σ12 σ22... σi2... σk2 Sample size ni n2... ni... nk Sample mean x̄1 x̄2... x̄i... x̄k Sample variance s21 s22... s2i... s2k Total sum of squares (total variation) ni k P ni k P 2 x2ij − nx̄2. P P SST = (xij − x̄) = i=1 j=1 i=1 j=1 Sum of squares due to factor ( between-sample variation) k k ni (x̄i − x̄)2 = ni x̄2i − nx̄2. P P SSA = i=1 i=1 Sum of squares due to error ( within-sample variation) ni k P (xij − x̄i )2 = SST − SSA P SSE = i=1 j=1 Mean square due to factor: M SA = SSA/(k − 1) Mean square due to error: M SE = SSE/(n − k) 78 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 8.2.2 one way ANOVA test Test assumptions: 1. Normal populations. Check with normal probability plot (normal QQ-plot) using sample date, roughly straight line pattern suggest the sample is from normal population. 2. Populations have common variance. Check the sample variances, assume samples are from population with common variance if max(si ) ≤ 2 ∗ min(si ), i = 1,..., k 3. Independent samples. H0 : µ1 = µ2 =... = µk Ha : not all of µi are equal. M SA SSA/(k−1) TS: F0 = M SE = SSE/(n−k) P -value: P (F ≥ F0 ). Reject point: Fα (k − 1, n − k) One-way ANOVA summary table Source of variation Sum of squares Degrees of freedom Mean square F0 P -value SSA M SA Factor SSA k−1 M SA = k−1 M SE p SSE Error SSE n−k M SE = n−k Total SST n−1 79 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 8.2 To investigate if a high-salt diet in older women increases the risk of breaking a bone. Independent random samples in four different salt intake cat- egories were obtained. The vitamin D blood level was measured in each. Assume normality and common variance, is there any evidence to suggest that at least two of the population mean vitamin D blood levels are different? α = 0.05 Sample Obs’s n x̄ s2 Very high 91.5 77.5 94.5 77.5 92.0 High 89.0 92.0 98.2 80.0 86.7 Moderate 92.5 100.7 94.0 93.3 106.3 Low 100.1 98.0 99.1 103.9 97.6 80 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 ANOVA summary table: SoV Sum of squares Degrees Mean square F0 P -value Factor Error Total 81 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 8.3 To study the effect of music on task performance. Peoples were recruited to attend a simple arithmetic math test. They are randomly divided into three groups: silent, music without lyrics, and music with lyrics. The number of completed math problems in one minute were recorded. We would like to see if average number of math problems completed are different in three groups. group n x̄ s silent 152 19.11 4.16 music without lyrics 149 17.34 4.08 music with lyrics 146 17.81 4 Solution see R coding 82 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 8.3 Isolating Differences What if H0 is rejected in ANOVA? There is evidence to suggest an overall difference among means. How to isolate and determine which pair(s) of means are different? 8.3.1 Tukey’s Procedure (the T method) The simultaneous 100(1-α)% Tukey confidence intervals: r M SE x̄i − x̄j ± Qα,k,(n−k) , for i ̸= j. J Qα,m,v is the critical value of Studentized range distribution. If 0 ̸∈ Tukey CI, then the corresponding pair of mean are different significantly. Example 8.4 Use previous high-salt diet example, α = 0.05. k = 4, M SE = 39.31, x̄1 = 86.6, x̄2 = 89.18, x̄3 = 97.36, x̄4 = 99.74 Qα,k,(n−k) = p Qα,k,(n−k) M SE/J = i−j x̄i − x̄j lower upper 2-1 3-1 4-1 13.14 1.78 24.5 3-2 8.18 -3.18 19.54 4-2 10.56 -0.8 21.92 4-3 2.38 -8.89 13.74 83 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 8.5 Use previous music effect example. Solution see R coding 84 85 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Chapter 9 Simple linear regression and correlation 9.1 Correlation Example 9.1 Linear relationship: x y1 y2 y3 1 11 10 -10 2 20 8 30 3 34 7 -5 4 39 5 50 5 51 3 -40 86 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 9.2 Sample correlation coefficient A measure of the direction and strength of a linear relationship between two quantitative variables. Pn Sxy 1 i=1 (xi − x̄)(yi − ȳ) r=p =. Sxx Syy n−1 sx sy P P P Sxy = xi yi − ( xi )( yi )/n Sxx = x2i − ( xi )2 /n P P Syy = yi2 − ( yi )2 /n P P x̄, sx are mean and standard deviation for x data. ȳ, sy are mean and standard deviation for y data. Note: 1. r does not distinguish between x and y. Label of x, y does not matter 2. r has no units of measurement. 3. r ranges from -1 to +1. 4. Positive value means that the association between the two variables is positive. 5. Negative value means that the association between the two variables is negative. 6. r is strongly affected by outliers. 7. r = 1(−1) iff all (x, y) pairs lie on a positive(negative) slop straight line. 87 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 9.2 Choose the most appropriate value of r for the following data: -0.99, -0.7, -0.3, 0, 0.5, 0.9. Example 9.3 Given a data set below. 1. Make a scatter plot of y versus x. 2. Find the correlation coefficient between y and x, Describe the relationship. x 20 30 40 50 y 10 30 50 30 88 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 9.3 Simple linear regression model Correlation measures the strength (scatter) and direction of the linear relationship But can not tell which line best describes the data. simple linear regression model: a model with a single independent variable x that has a relationship with a response y that is a straight line. y = β0 + β1 x + ϵ, where ϵ ∼ N (0, σ 2 ) is a random variable. Given the n observed values of the independent variable x: x1 , x2 ,..., xn , and the n observed values of the dependent variable y : y1 , y2 ,..., yn. yi = µi + ϵi = β0 + β1 xi + ϵi 1. xi : The ith independent variable. 2. yi : The observed value of ith dependent variable. 3. ϵi : The error term ϵi = (yi − µi ), the difference between observed value and the mean value of the ith dependent variable. 4. β0 : is called y-intercept, the mean value of the dependent variable when the independent variable is x = 0. 5. β1 : is called slop, the change in the mean value of the dependent variable that is associated with the change of independent variable. 6. µi : The mean value of the ith dependent variable for a given value of inde- pendent variable x = xi. 89 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 9.4 A large building is using a coal-fueled heating system. The building management office is trying to determine the proper amount of coal to order each week. Past experience shows that the consumption of the coal depends on the average hourly temperature. Data collected shows in the following table. Week, i Average hourly temperature, xi Weekly fuel consumption, yi 1 x1 = 28 y1 = 12.4 2 x2 = 28 y2 = 11.7 3 x3 = 32.5 y3 = 12.4 4 x4 = 39 y4 = 10.8 5 x5 = 45.9 y5 = 9.4 6 x6 = 57.8 y6 = 9.5 7 x7 = 58.1 y7 = 8.0 8 x8 = 62.5 y8 = 7.5 12 11 10 y 9 8 30 35 40 45 50 55 60 x Q: β0 =?β1 =? 90 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 the Least Squares Point Estimators of β0 , β1 : n n n n P P P P x i yi − xi yi /n (xi − x̄)(yi − ȳ) Sxy i=1 i=1 i=1 i=1 β̂1 = = n 2 = n Sxx n P P 2 xi − P xi /n (xi − x̄)2 i=1 i=1 i=1 β̂0 = ȳ − β̂1 x̄ The fitted (estimated) simple linear regression model is: ŷ = β̂0 + β̂1 x. point prediction of yi : ŷi = β̂0 + β̂1 xi. The ith residual:ei = yi − ŷi. Residual sum of squares (SSE): evaluates how far the estimated values from the true values. We call it as the. n X n X SSE = e2i = (yi − ŷi )2 i=1 i=1 Find β̂0 , β̂1 which minimize SSE – the least squares point estimators. To derive the least square point estimators, define a function n n f (β0 , β1 ) = SSE = (yi − ŷi )2 = (yi − (β0 + β1 xi ))2. P P i=1 i=1 91 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Compare: yi , ŷi , µi , β0 , β1 , β̂0 , β̂1 , ϵi , ei. 30 35 40 45 50 55 60 92 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 9.5 The fuel consumption data for the building. 1. Find the fitted regression line, and interpret the two coefficients. 2. Predict the fuel consumption for x5 = 45.9, and calculate residual e5. xi yi x2i xi y i x1 = 28 y1 = 12.4 784 347.2 x2 = 28 y2 = 11.7 784 327.6 x3 = 32.5 y3 = 12.4 1056.25 403 x4 = 39 y4 = 10.8 1521 421.2 x5 = 45.9 y5 = 9.4 2106.81 431.46 x6 = 57.8 y6 = 9.5 3340.84 549.1 x7 = 58.1 y7 = 8.0 3375.61 464.8 x8 = 62.5 y8 = 7.5 3906.25 468.75 P8 P8 8 P 2 P8 xi = 351.8 yi = 81.7 xi = 16874.76 xi yi = 3413.11 i=1 i=1 i=1 i=1 12 11 10 y 9 8 30 35 40 45 50 55 60 x 93 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 9.4 Model assumptions A1: Zero mean: The error term ϵ has zero mean. A2: Constant variance: For any i, the population of all possible values of corresponding yi (based on xi ) has the same variance σ 2. Equivalently the error term ϵi = yi − µi has the same variance σ 2. A3: Normality: For any value xi of the independent variable x the cor- responding population of potential values of the dependent variable has a normal distribution. Or we can express it as ϵi ∼ N (0, σ 2 ). A4: Independence: Any one value of the dependent variable y is statisti- cally independent of any other value of y. Note: The independent assumption is most likely to be violated for time series data. 94 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 9.5 Justify model assumptions 9.5.1 Residual plot Residual: ei = y − ŷ. Example 9.6 Continue with fuel consumption example. ŷ = 15.84 − 0.1279x xi yi ŷi = 15.84 − 0.1279xi ei x1 = 28 y1 = 12.4 y1 = e1 = x2 = 28 y2 = 11.7 y2 =12.256 e2 = -0.5560 x3 = 32.5 y3 = 12.4 y3 =11.6804 e3 = 0.7196 x4 = 39 y4 = 10.8 y4 =10.8489 e4 = -0.0489 x5 = 45.9 y5 = 9.4 y5 = e5 = x6 = 57.8 y6 = 9.5 y6 = 8.4440 e6 =1.0560 x7 = 58.1 y7 = 8.0 y7 = 8.4056 e7 = -0.4056 x8 = 62.5 y8 = 7.5 y8 =7.8428 e8 = -0.3428 SSE =2.57 2 1 0 −1 −2 30 35 40 45 50 55 60 95 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 1. Constant variance assumption Different patterns of residual plot: 96 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 2. Normality assumption residual normal QQ plot: 97 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 9.6 Estimation of σ 2 Recall that n X n X n X SSE = e2i = (yi − ŷi ) = 2 [yi − (β̂0 + β̂1 xi )]2 i=1 i=1 i=1 Xn n X n X = yi2 − β̂0 yi − β̂1 xi yi = Syy − β̂1 Sxy i=1 i=1 i=1 It can be shown that E(SSE) = (n − 2)σ 2 , Hence SSE σ̂ 2 = n−2 is an unbiased estimate of σ 2. The standard error of regression: r SSE σ̂ = , n−2 is a point estimate of σ. Example 9.7 Recall that the fuel consumption regression model, the least squares point estimators of β0 , β1 are β̂0 = 15.84, β̂1 = −0.1279. Calculate SSE, σ̂ 2 , and the standard error. Four parameters of simple linear regression model: Parameters β0 β1 σ2 σ Point estimators 98 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 9.6.1 The simple coefficients of determination, R2 Recall the symbols: yi : ŷi : ȳ: Three ways to measure error: 1. SST : Total sum of squares (the total variation) X n n X 2 SST = (yi − ȳ) = yi2 − nȳ 2. i=1 i=1 2. SSE: Residual sum of squares (the unexplained variation by model): Xn n X n X n X 2 2 SSE = (yi − ŷi ) = yi − β̂0 yi − β̂1 xi yi = Syy − β̂1 Sxy. i=1 i=1 i=1 i=1 3. SSR: Regression sum of squares(the explained variation by model): Xn SSR = (ŷi − ȳ)2 = SST − SSE. i=1 The coefficient of determination R2 : n (ŷi − ȳ)2 P SSR i=1 SSE R2 = =P n = 1 −. SST 2 SST (yi − ȳ) i=1 Understanding R2 : 1. It is a proportion. The proportion of observed y variation explained by the regression model over total sum of squares (the total variation). 2. 0 ≤ R2 ≤ 1. 3. It is a good fit if R2 −→ 1. 4. It is a poor fit if R2 −→ 0. 5. The larger R2 , the better fit. 99 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 9.8 Consider the fuel consumption example, to calculate and interpret the simple coefficient of determination R2. P8 2 By previous examples: i=1 yi = 859.91, ȳ = 10.21, SSE = 2.57 100 Chapter 10 Multiple regression Model The multiple linear regression (MLR) model with p independent variables: y = β0 + β1 x1 + β2 x2 +... + βp xp + ϵ Given the n observed values of the independent variable x: x1 , x2 ,..., xn , and the n observed values of the dependent variable y : y1 , y2 ,..., yn. yi = µi + ϵi = β0 + β1 xi1 + β2 xi2 +... + βp xip + ϵi. ui : The mean value of the ith dependent variable, when the values of inde- pendent variables x1 , x2 ,..., xp are xi1 , xi2 ,..., xip. β0 , β1 , β2 ,..., βp : Parameters of the model, relating µi to xi1 , xi2 ,..., xip , is called (partial) regression coefficients. ϵi : The error term ϵi = (yi − µi ), describes the effects on yi of all factors other than xi1 , xi2 ,..., xip. 101 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Interpret β0 , β1 ,..., βp : 1. β0 : The mean of the dependent variable when all independent variables take value of zero, if the range of data including x1 = x2 =... = xp = 0 2. βj , j = 1, 2,..., p: is the change in the mean of the dependent variable when the jth independent variable xj increases one-unit, while all other independent variables remain fix. Examples of MLR with two potential independent variables x1 , x2 : 1. First-order model: y = β0 + β1 x1 + β2 x2 + ϵ. 2. Second-order with quadratic:y = β0 + β1 x1 + β2 x2 + β3 x21 + β4 x22 + ϵ. 3. Second-order with interaction: y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + ϵ. 4. Full Second-order model: y = β0 + β1 x1 + β2 x2 + β3 x21 + β4 x22 + β5 x1 x2 + ϵ. 5. y = β0 + β1 x1 + β2 xβ2 3 + ϵ. Hierarchical rule: The regression model is said to be hierarchical if it contains all terms of the highest order and lower. Most of time we want to our model to be hierarchical. 1. y = β0 + β1 x1 + β2 x2 + β3 x21 + ϵ 2. y = β0 + β2 x2 + β3 x21 + ϵ 3. y = β0 + β1 x1 x2 + ϵ 4. y = β0 + β1 x1 + β2 x31 + ϵ 102 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 10.1 Model selection A good regression model: 1. Easy, simple. 2. Precise, good fit. For a MLR with p independent variables: y = β0 +β1 x1 +β2 x2 +β3 x21 +...+βp xp +ϵ. There are k many parameters:β0 , β1 ,..., βp. SSR SSE 1. Coefficient of multiple determination: R2 = SST =1− SST The larger R2 is, the better the model is. R2 can only be used to compare models with same number of independent vari- ables, because adding any independent variable to a model, even an unimportant independent variable, R2 gets larger. a. y = β0 + β1 x1 + β2 x2 + β3 x3 + ϵ. b. y = β0 + β1 x1 + β2 x2 + ϵ. c. y = β0 + β1 x1 + β3 x3 + ϵ. 2. Adjusted R2 : k−1 n−1 n − 1 SSE Ra2 2 = R − =1− ∗ n−1 n−k n − k SST The larger Ra2 is, the better the model is. Adjusted R2 take account of number of independent variables in the model - k, which works as a penalty of too many independent variables. Hence adjusted R2 can be used to compare models with different numbers of independent variables. SSE 3. σ̂ 2 = n−k or σ̂ The smaller σ̂ 2 is, the better the model is. It has k involved, which works as a penalty of too many independent variables, it can be used to compare models with different numbers of independent variables. 103 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 Example 10.1 Case study: Historical real estate valuation data - from uci machine learning repository. Potential independent variables: date: the transaction date. age: the house age (unit: year). dist: the distance to the nearest MRT station (unit: meter). cstores: the number of convenience stores in the living circle (integer). Dependent variable: price: house price of unit area. Sample date: date age distance convenience.stores price 2012.917 32 84.87882 10 37.9 2012.917 19.5 306.5947 9 42.2 2013.583 13.3 561.9845 5 47.3............... We try the following models: model R2 Ra2 σ̂ 1 y = β0 + β1 dist + β2 age 0.5843 0.5817 8.94 2 y = β0 + β1 dist + β2 age + β3 dist ∗ age 0.6122 0.6086 8.648 3 y = β0 + β1 dist + β2 age + β3 dist2 + β4 dist ∗ age 0.6512 0.6468 8.215 4 y = β0 + β1 dist + β2 age + β3 dist2 + β4 age2 + β5 dist ∗ age 0.652 0.6465 8.218 104 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 105 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 106 © 2023 Yan Hua Tian @ York University. All Rights Reserved November 24, 2024 107

Statistics Textbook PDF

Document Details

Tags

Related

Summary

Full Transcript