Lecture Notes - Inferential Statistics PDF

Lecture Notes Inferential Statistics Exploratory data analysis helped you understand how to discover patterns in data using various techniques and approaches. As you've learnt, EDA is one of the most important parts of the data analysis process. It is also the part on which data analysts spend most of their time. However, sometimes, you may require a very large amount of data for your analysis which may need too much time and resources to acquire. In such situations, you are forced to work with a smaller sample of the data, instead of having the entire data to work with. Situations like these arise all the time at big companies like Amazon. For example, say the Amazon QC department wants to know what proportion of the products in its warehouses are defective. Instead of going through all of its products (which would be a lot!), the Amazon QC team can just check a small sample of 1,000 products and then find, for this sample, the defect rate (i.e. the proportion of defective products). Then, based on this sample's defect rate, the team can "infer" what the defect rate is for all the products in the warehouses. This process of “inferring” insights from sample data is called “Inferential Statistics”. Random Variables Before performing any kind of statistical analysis on a problem, it is advisable to quantify its outcomes by using random variables. So, the random variable X basically converts outcomes of experiments to something measurable. For example, recall that we quantified the colours of the balls we would get after playing our game by assigning a value of X to each outcome. We did so by defining X as the number of red balls we would get after playing the game once. Figure 1 - Quantifying Using Random Variables Probability Distribution A probability distribution for X, basically, is ANY form of representation that tells us the probability for all possible values of X. It could be a table, a chart or an equation. A probability distribution looks like the frequency distribution, just with a different scale. For example, here are the probability distribution and the frequency distribution (histogram) for our UpGrad red ball game – Figure 2 – Frequency Distribution (Left) vs Probability Distribution (Right) Expected Value The expected value for a variable X is the value of X we would “expect” to get after performing the experiment once. It is also called the expectation, average, and mean value. Mathematically speaking, for a random variable X that can take values x1, x2, x3, …… xn, the expected value (EV) is given by: EV(X) = x1*P(X = x1) + x2*P(X = x2) + x3*P(X = x3) + …………………. + xn*P(X = xn) Where, P(X=xi) denotes the probability that the random variable will take the value xi. For example, suppose you’re trying to find the expected value of the number of red balls in our UpGrad game. The random variable X, which is the number of red balls the player gets after playing the game once, can take values 0, 1, 2, 3 and 4. So, the expected value for the number of red balls would be – EV(X) = 0*P(X = 0) + 1*P(X = 1) + 2*P(X = 2) + 3*P(X = 3) + 4*P(X = 4) EV(X) = 0*(0.027) + 1*(0.160) + 2*(0.347) + 3*(0.333) + 4*(0.133) = 2.385 However, you can never get 2.385 red balls in one game, as the number of balls will be an integer, like 2 or 3. However, the expected value, the value you would “expect” to get after one experiment, does not have to be a value that will turn up in the experiment/game. So, the expected value, actually, is the average value of X that you will get after playing the game an infinite number of times. Probability Without Experiment Using basic rules of probability, i.e., addition rule and multiplication rule, you saw how you could find the probabilities for our UpGrad red ball game, without even playing the game once. The probability distribution thus achieved (theoretical probability distribution), was very similar to the distribution achieved earlier via experiment (observed probability distribution). Figure 3 – Observed Probability Distribution (Left) vs Theoretical Probability Distribution (Right) Notice that the values of P(X = 0) are very close in both graphs, as are the values of P(X = 1), P(X = 2), P(X = 3) and P(X = 4). If the number of experiments conducted would have been more than 75, the values would have been even closer. In fact, for an infinite number of experiments, the values will be exactly same for both graphs. Binomial Distribution The binomial distribution can be used to calculate the probability of an event, if it follows the following conditions – 1. The total number of trials is fixed 2. Each trial is binary, i.e. has only two possible outcomes, success and failure 3. The probability of success is the same for all the trials Basically, it should be a series of yes or no questions, with the probability of yes remaining same for all questions. Examples of such situations are – 1. Finding the probability of 5 out of the next 10 cars having an even numbered license plate 2. Finding the probability of 3 of the next 4 balls picked out from the bag, being red (UpGrad game {balls are put back after drawing}) 3. Finding the probability of 9 out of the next 20 coin tosses resulting in a heads. For such a situation, the probability of r successes, is given by – P(X = r) = 𝑛𝐶𝑟 (𝑝)𝑟 (1 − 𝑝)𝑛−𝑟 Where, n is the total number of trials/questions p is the probability of success in 1 trial r is the number of successes after n trials For example, in our UpGrad game – Total number of trials, n = 4 Probability of getting a red ball in 1 trial, p = 0.6 So, the probability of getting r red balls is given by – P(X = r) = 4𝐶𝑟 (0.6)𝑟 (0.4)4−𝑟 Using this, we get P(X = 0) = 4𝐶0 (0.6)0 (0.4)4 = 0.0256. Also, P(X = 1) = 4𝐶1 (0.6)1 (0.4)3 = 0.1536. Similarly, we can find P(X = 2), P(X = 3) and P(X = 4). Cumulative Probability Cumulative probability of x, generally denoted by F(x), is the probability of the random variable X, taking a value lesser than x. Mathematically speaking, we’d say – F(x) = P(X < x) For example, for our UpGrad game, F(2) = P(X < 2) = P(X = 0) + P(X = 1) + P(X = 2) = 0.0256 + 0.1536 + 0.3456 = 0.5238. Probability Density Functions For a continuous random variable, the probability of getting an exact value is very low, almost zero. Hence, when talking about the probability of continuous random variables, you can only talk in terms of intervals. For example, for a particular company, the probability of an employee’s commute time being exactly equal to 35 minutes was zero, but the probability of an employee having a commute time between 35 and 40 minutes was 0.2. Hence, for continuous random variables, probability density functions (PDFs) and cumulative distribution functions (CDFs) are used, instead of the bar chart type of distribution used for the probability of discrete random variables. These functions are preferred because they talk about probability in terms of intervals. Figure 4 – PDFs vs. CDFs (X = commute time) To find the cumulative probability using a CDF, you just have to check the value of the graph. For example, F(28), i.e., the probability of an employee having a commute time less than or equal to 28 minutes, is given by the value of the CDF at X = 28. In the PDF, it is given by the area under the graph, between X = 20, the lowest value and X = 28. Normal Distribution A very commonly used probability density function is the normal distribution. It is a symmetric distribution, and its mean, median and mode lie at the center. Figure 5 – Normal Distribution Also, a variable that is normally distributed, follows the 1-2-3 rule, which states that there is a – 1. 68% probability of the variable lying within 1 standard deviation of the mean 2. 95% probability of the variable lying within 2 standard deviations of the mean 3. 99.7% probability of the variable lying within 3 standard deviations of the mean Figure 6 – 1-2-3 Rule for Normal Distribution Standard Normal Distribution In order to find the probability for a normal variable, you actually do not need to know the value of the mean or the standard deviation — it is enough to know the number of standard deviations away from the mean your random variable is. That is given by: 𝑋−𝜇 Z= 𝜎 This is called the Z score, or the standard normal variable. In fact, you can use the Z table to find the cumulative probability for various values of Z. For example, say, you want to find the cumulative probability for Z = 0.68 using the Z table. Figure 7 – Z Table The intersection of row “0.6” and column “0.08” is 0.7517, which is our answer. Samples Instead of finding the mean and standard deviation for the entire population, it is sometimes beneficial to find the mean and standard deviation for only a small representative sample. You may have to do this because of time and/or money constraints. For example, for an office of 30,000 employees, we wanted to find the average commute time. So, instead of asking all employees, we asked only 100 of them and found that for them, the mean was equal to 36.6 minutes and the standard deviation was equal to 10 minutes. However, we said that it would not be fair to infer that the population mean is exactly equal to the sample mean. This is because the flaws of the sampling process must have led to some error. Hence, the sample mean’s value has to be reported with some error margin. For example, the mean commute time for the office of 30,000 employees would be equal to 36.6 + 3 minutes, 36.6 + 1 minutes or 36.6 + 10 minutes or, for that matter, 36.6 minutes + some error margin However, in order to find this margin, it would be necessary to understand what sampling distributions are, as there properties help in finding this margin. Sampling Distributions & Central Limit Theorem The sampling distribution, which is basically the distribution of sample means of a population, has some interesting properties which are collectively called the central limit theorem, which states that no matter how the original population is distributed, the sampling distribution will follow these three properties – 1. Sampling Distribution’s Mean (𝜇𝑋̅ ) = Population Mean (𝜇) 𝜎 2. Sampling Distribution’s Standard Deviation (Standard Error) = , where σ is the population’s √𝑛 standard deviation and n is the sample size 3. For n > 30, the sampling distribution becomes a normal distribution To verify these properties, we performed sampling using data collected for our UpGrad game from Session 1. The values for the sampling distribution thus created (𝜇𝑋̅ = 2.348, S.E. = 0.4248) were pretty close to the values predicted by theory (𝜇𝑋̅ = 2.385, S.E. = 0.44). To summarise, the notation and formulae related to samples, populations and sampling distributions are – Mean Estimation Using CLT Using CLT, you can estimate the population mean from the sample mean and standard deviation. For example, to estimate the mean commute time of 30,000 employees of an office, you took a sample of 100 employees and found their mean commute time. For this sample, the sample mean 𝑋̅ = 36.6 minutes, sample standard deviation S = 10 minutes. Using CLT, you can say that the sampling distribution for mean commute time will have - 1. Mean = μ {unknown} 𝜎 𝑆 10 2. Standard error = 𝑛 = = =1 √ √𝑛 √100 3. Since n(100) > 30, the sampling distribution is a normal distribution Using these properties, you can claim that the probability that the population mean μ lies between 34.6 (36.6-2) mins and 38.6 (36.6+2) mins, is 95.4%. Also, there is some terminology related to the claim - 1. Probability associated with the claim is called confidence level (Here it is 95.4%) 2. Maximum error made in sample mean is called margin of error (Here it is 2 minutes) 3. Final interval of values is called confidence interval {Here it is the range – (34.6, 38.6)} In fact, you can generalise the entire process. Let’s say you have a sample with sample size n, mean 𝑋̅ and standard deviation S. Now, the y% confidence interval (i.e., confidence interval corresponding to y% confidence level) for μ will be given by the range – ∗ 𝑍 𝑆 𝑍 𝑆∗ Confidence Interval = (𝑋̅ − 𝑛 , 𝑋̅ + 𝑛 ) √ √ Where, Z* is the Z-score associated with a y% confidence level. For example, the 90% confidence interval for the mean commute time will be – ∗ 𝑍 𝑆 𝑍 𝑆 ∗ μ = (𝑋̅ − 𝑛 , 𝑋̅ + 𝑛 ) √ √ Here, 𝑋̅ = 36.6 minutes S = 10 minutes n = 100 Z* = 1.65 (Z* corresponding to 90% confidence level) So, the confidence interval is – μ = (34.95 mins, 38.25 mins)

Lecture Notes - Inferential Statistics PDF

Document Details

Tags

Related

Summary

Full Transcript