Summary

This document appears to be lecture notes for a STATS 250 course, covering statistical methodology from an analysis-of-data viewpoint. The course includes topics such as sampling distributions, hypothesis tests, regression, and includes the analysis of the 1969 American Draft Lottery as a case study.

Full Transcript

STATS 250 A one-semester course in applied statistical methodology from an analysis-of-data viewpoint Lecture Topic COURSE PACK PAGE 01 Variable Types, Parameters, & Statistics...

STATS 250 A one-semester course in applied statistical methodology from an analysis-of-data viewpoint Lecture Topic COURSE PACK PAGE 01 Variable Types, Parameters, & Statistics 3 02 Sampling Distributions of 𝜋̂ 13 03 Normal approximations to sampling distributions 25 04 Sampling distribution of 𝜇̂ 35 05 Simulation-based hypothesis tests 47 06 Parametric tests for 𝜋 65 07 Parametric tests for 𝜇 81 08 Estimated effect sizes 93 09 Single sample confidence intervals 105 10 Mechanics of confidence intervals 115 11 Understanding confidence levels 129 12 Independence, association, and generalizability 139 13 Confounders and causal inference 151 14 Inferential procedures for a difference in independent means, 𝜇1 − 𝜇2 161 15 Power and other mechanics of hypothesis tests 173 16 ANOVA 191 17 Introduction to simple linear regression 207 18 Inference for simple linear regression 225 19 Introduction to multiple linear regression 237 20 Multiple regression with categorical predictors 253 21 Multiple regression with interaction terms 267 22 Associations between categorical variables, relative risk 277 23 𝜒 2 tests of independence 289 24 𝜒 2 Goodness of fit tests 301 John Keane and Alicia Romero © 2024 Lecture 01: Variable Types, Parameters, & Statistics. □ Describe statistical inference in terms of detecting a signal amid noise. □ Given a data set, identify observational units (cases) and variables. Classify variables by their type (quantitative or categorical). □ Distinguish between parameters and statistics, using appropriate notation. 1 The 1969 American Draft Lottery Almost no STATS 250 students were alive the last time the US government drafted American citizens into the army. The last draft lottery was conducted on December 1, 1969. The purpose was to determine which young men (between 19 and 26 years old) would be drafted to serve in the U.S. armed forces, perhaps to end up risking their lives in combat in Vietnam. Given the life-and-death stakes of the draft, officials felt it was imperative to draft young American men without bias, at random. Thus, the draft lottery was based on birthdays, so as not to give any advantage or disadvantage to certain groups of people. Three hundred and sixty-six capsules were put into a bin, with each capsule containing one of the 366 dates of the year. The capsules were drawn one-at-a-time, with draft number 1 being assigned to the birthday drawn first (which turned out to be September 14), meaning that young men born on that date were the first to be drafted. The table to the right shows the results of the draft lottery. a. Find the number associated with your birthday. When would you have been drafted? STATS 250 Lecture Notes, 3 1.1 Was the Lottery Fair? Consider the scatterplot below, which has birthdays’ sequential date on the x-axis (e.g., January 1 is placed at 𝑥 = 1, February 1 is placed at 𝑥 = 32, and December 31 is placed at 𝑥 = 366), and draft number of the y-axis. b. What would you expect this graph to look like if the Vietnam draft was a truly fair, random lottery process? c. Does this graph appear to display the results of a truly fair, random lottery process? STATS 250 Lecture Notes, 4 Given the stakes of this lottery were so high, it is worth digging a little deeper into these data to ensure the draft lottery was conducted in a fair and unbiased manner. Suppose we proceed month-by-month, calculating the median draft number for each month. The table below shows the same results of the December 1, 1969 draft lottery displayed earlier. This time, however, draft numbers have been sorted from smallest to largest within each month. Rank JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC 1 17 4 29 2 31 20 13 11 1 5 9 3 2 52 25 30 14 35 22 15 21 6 7 19 10 3 58 57 33 32 37 28 23 36 8 24 34 12 4 59 68 108 62 40 60 27 44 18 38 46 16 5 77 86 122 74 55 64 42 45 49 72 47 26 6 92 89 136 81 65 69 50 48 63 79 51 ↑ 39 7 101 91 139 83 75 73 67 54 71 87 66 41 · 8 118 144 166 90 103 85 88 61 82 94 76 43 9 121 150 169 124 112 104 93 102 113 117 80 53 10 140 152 170 147 130 109 98 106 119 125 97 56 11 159 179 200 148 133 110 115 111 149 138 99 70 12 164 181 213 191 155 134 120 114 151 171 107 78 13 186 189 217 208 178 137 153 116 158 176 126 84 14 194 205 223 218 183 180 172 141 160 192 127 95 15 199 210 239 219 197 206 187 142 161 196 131 96 " 16 211 212 256 231 226 209 188 145 175 201 132 100 17 215 214 258 252 250 222 190 154 177 202 143 105 18 221 216 259 253 276 228 193 167 184 220 146 123 19 224 236 265 260 278 247 227 168 195 229 156 128 20 235 285 267 262 295 249 248 198 204 234 174 129 21 238 290 268 269 296 272 270 245 207 237 182 135 22 251 292 275 271 298 274 277 261 225 241 185 157 23 280 297 293 273 308 301 279 286 232 243 203 162 24 305 299 300 312 313 335 284 291 233 244 230 163 25 306 302 317 316 319 341 287 307 242 254 266 165 26 318 338 323 336 321 353 289 311 246 264 281 173 27 325 347 332 340 326 356 303 324 255 283 ~ 282 240 28 329 363 334 345 330 358 322 333 257 288 309 304 29 337 365 343 346 357 360 327 339 263 294 - 310 314 30 349 354 351 361 366 331 344 315 342 348 320 31 355 362 364 350 352 359 - 328 d. In the space below, calculate the median draft number for your own birth month. Confer with your neighbors to fill out the median draft number for all months to complete the table below. Month Median Draft Number Month Median Draft Number January 211 July 145 February 16 August 165 March 256 September April 225 October May 226 November June 201. 6 December STATS 250 Lecture Notes, 5 e. Consider the scatterplot above, which has been updated to include the median draft numbers by month. Do you see any patterns in these medians, or do they look like random scatter? there is a clear pattern yes , You should notice a very concerning pattern above where the median draft number appears to decline as you read the scatterplot from left to right, suggesting that American men born later in the calendar year were systematically drafted earlier than young men born earlier. How was this bias introduced? You can watch the news report of the lottery in action here.1 Analyses conducted afterward suggested that the capsules were not properly mixed, leading to biased selection of young Americans into the armed services.2 Fortunately, many improvements were made in the process for the following year’s lottery. The capsules were mixed much more thoroughly, and the process included random selection of draft numbers as well as random drawing of birthdates. 1 https://www.realclearhistory.com/video/2018/11/30/1970_draft_lottery.html 2 https://www.science.org/doi/10.1126/science.171.3968.255 STATS 250 Lecture Notes, 6 1.2 Detecting the Signal Amid Noise The previous example regarding the 1969 American Draft Lottery shows how statistics (even very simple ones, like the median) can help us detect meaningful patterns among collected data. Many techniques we’ll explore throughout this course can be interpreted through a metaphor commonly employed in statistics and data analysis. This metaphor argues that our goal as statisticians is primarily to detect the signal amid the noise. The ‘signal’ represents meaningful information you’re trying to detect about the world around you. In the previous example, the ‘signal’ we were looking to detect was any form of bias advantaging or disadvantaging young American men according to their birthdate. The noise is the random, unwanted variation or fluctuation that interferes with the signal. In the previous example, there was meaningful random variation in draft numbers that made it difficult to detect the bias in the lottery process. Consider each of the following examples, some of which are not statistical in nature. How would you describe each in terms of a ‘signal’ you are trying to detect amid ‘noise?’ 1. You are speaking to your mother on a cell phone while riding on a busy subway train. 2. You and your sibling play a game of ‘Where’s Waldo?’ in which you try to spot a classic cartoon character amid a large crowd of people. 3. You are evaluating evidence for the claim that a new drug provides improved outcomes for patients compared to a pre-existing drug already on the market. You randomly split patients into two groups: one receives the new drug and the other receives the pre-existing drug. You compare the symptoms of each group of patients after they conclude a 2-week treatment plan. 2 Observational Units & Variables Stats 250 is a course that will develop your statistical thinking abilities. Regardless of your future path with statistics (practitioner or consumer), it is important to know how data can be used to answer research questions. What are the best ways to collect the data needed to answer a research question? How can we summarize and display data in ways that allow us to distinguish between the ‘signals’ embedded in them from the ‘noise?’ What conclusions can be made from the data? The ideas and techniques that you learn in Stats 250 will start you on your journey of statistical thinking. It would be impossible to learn all statistical techniques in one term, but we endeavor to teach you a few core data collection techniques, some core data analysis ideas, and fundamental concepts of statistical inference. We begin the course by introducing some terminology that will be used throughout the term. 2.1 Observational Units, Variables, and Data Matrices Data sets don’t come to us all neatly summarized so that we can immediately see the story that the data tell us. We need to organize our data so that we can effectively describe the data. Data can be messy! Cleaning data is both an art and a science. We won’t get into data cleaning in this course, but you may wind up doing it in the future. The first step in most data analyses is always to understand the data and to describe (summarize) the data. Table 1: Five rows from the IMDb data set (Source: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). STATS 250 Lecture Notes, 7 Descriptive statistics refers to methods for summarizing and organizing observed samples of data. Table 1 displays 5 rows of a data set containing 500 movies randomly selected from the Internet Movie Database (IMDb) website. We will refer to this set of observations as IMDb data set (this data set is a random sample from a larger data set). Each row in the table represents a movie or an observational unit; it corresponds to a unique case. The columns represent different characteristics of each movie selected, called variables. For example, the first row represents the movie Avatar, which is an action movie produced by Ingenious Film Partners and is rated PG-13. Avatar reported a revenue of $2,787,965,087 with a runtime of 162 minutes. At the time of the data collection, Avatar had been reviewed by 11800 viewers, and the average star rating was 7.2 stars. When data is presented to us, it is very important to be sure we know what each variable represents and the units of measurement. Observational units (or cases) are the specific entities about which information is recorded. Observational units can be concrete (e.g., a person, a petri dish, an elementary school) or more abstract (e.g., a randomly selected week from the year). The number of independent observational units observed in a dataset typically determines your sample size. A variable describes any characteristic of an observational unit you might record, which can assume different values for different observational units. 2.2 Types of Variables Let’s continue working with the IMDb data set. Each of the eight variables is different, but they do share certain characteristics. First, consider two variables from the IMDb data set: runtime and genre. Try It! Differences between quantitative and categorical variables MPAA vote vote original title genre production company rating revenue runtime average count 1 Pulp Fiction Thriller Miramax Films R 213928762 154 8.3 8428 Walt Disney 2 Toy Story Animation Pictures G 373554033 81 7.7 5259 3 Jaws Horror Universal Pictures PG 470654000 124 7.5 2542 The Dark Knight 4 Rises Action Legendary Pictures PG-13 1084939099 165 7.6 9106 5 Mean Girls Comedy Paramount Pictures PG-13 129042871 97 6.9 2320........................... Walt Disney 500 Frozen Animation Pictures PG 1274219009 102 7.3 5295 f. Could you compute the “AVERAGE RUNTIME” for these 500 movies? yes g. Could you compute the “AVERAGE GENRE” for these 500 movies? No STATS 250 Lecture Notes, 8 Runtime and Genre are two different types of variables. Different types of variables provide different kinds of information. The variable type will guide what kinds of summaries (graphs/numerical) are appropriate. All variables are either quantitative or categorical: Quantitative variables (also called numerical or measurement variables) take on a wide range of numerical values. If a variable is quantitative, it should be amenable to arithmetic operations (e.g., addition, subtraction, averaging). Categorical variables (also called qualitative variables) place an individual or item into one of several groups or categories, which are called levels. Try It! Classifying variables by their type h. Suppose that the students in today’s class session are the observational units of a survey study. Classify the answer a student would give to each of the questionnaire items below as quantitative or categorical. i. How long did you sleep last night? quanitative ii. In what zip code were you born? Categorical iii. How many credits are you taking this semester? quantitative iv. Are you left- or right-hand dominant? Categorical i. Explain why the following are not variables (still considering the students in today’s class session as the observational units). v. What proportion of students in this class are left-handed? vi. What is the average amount of sleep in the past 24 hours among all students in today’s class session? characteristics of The reason being the answers to these questions areNot be considera statistics. individuals Obser. Units. They would 3 Defining Populations, Samples, Parameters, and Statistics Now that we have introduced the notion of quantitative and categorical variables, we next introduced two types of quantities we could use to summarize them: parameters and statistics. To understand the similarities and differences between these two types of quantities, we first need to understand the differences between populations of interest and observed samples of data. In a statistical study, the population of interest is the group of observational units about which the researcher wants to draw a conclusion. Populations of interest are often a finite group of observational units (e.g., the population of all students enrolled in STATS 250 this semester). However, populations can also be a bit more abstract, where the number of observational units in the population is ill-defined (e.g., the population of all students who will enroll in STATS 250 in the next decade). STATS 250 Lecture Notes, 9 Typically, a population of interest is not fully observed by a researcher.3 Instead, the researcher selects a subset of these observational units called a sample and uses this smaller group of observational units for the basis of their study. In a statistical study, the sample is the group of observational units the researcher draws and observes from the broader population of interest. 3.1 Parameters and Statistics Broadly, parameters and statistics are numerical quantities used to describe attributes of data. These two types of quantities share many attributes but are also distinguished by important differences. A parameter is a number that describes an attribute of a population of interest, whereas a statistic is a quantity that describes an attribute of an observed sample. A statistics is often used as a point estimate of a parameter value. Parameter Statistic A parameter is a numerical characteristic A statistic is a numerical characteristic of an of a population. It is a fixed, usually observed sample. In a statistical study, it is usually unknown, quantity that we are trying to a known quantity. But, because it is based on learn more about. observed data, it is also a variably quantity that can change from sample to sample. A quantitative mean M Notation: ________________ e Notation: ________________ RQ: How tall, on 𝜇 =the mean height of all women in 𝜇̂ = the mean height of a sample of 100 women average, are women in Ann Arbor Ann Arbor? - A quantitative Notation: ________________ o o Notation: ________________ standard deviation RQ: How much 𝜎 = the true mean standard deviation in 𝜎̂ = the observed standard deviation of heights variability do heights heights of all women in Ann Arbor of a sample of 100 women of Ann Arbor women display? A categorical I Notation: ________________ # Notation: ________________ - proportion 𝜋 = the true proportion of all women in 𝜋̂ = the observed proportion of a sample of 100 RQ: What proportion Ann Arbor who are taller than 6 ft. women in Ann Arbor who are taller than 6 ft. of Ann Arbor women are taller than 6 ft? 3 This can be for a variety of reasons. Observing every observational unit in a population can be too time-intensive, too expensive, or even unethical. STATS 250 Lecture Notes, 10 Try It! Identifying Populations, Sample, Parameters, and Statistics Each of the following scenarios provides a brief description of a statistical study. Summarize each study in terms of a population of interest, observed sample, parameter, and statistic. j. A medical researcher is interested to know how different cardiovascular diseases affect patients’ blood pressure variability. Across the US adult population, the standard deviation of systolic blood pressure is 20 mmHg. The researcher is interested to know whether this value is higher among US adults diagnosed with hypertension. She takes a random sample of 45 patients being treated for hypertension and finds that they have a systolic blood pressure standard deviation of 32 mmHg. Observational units: U S.. adults diagnosed with hyper tension Population: Parameter: O = 20 mm Hg All us Adults that have been diagnosed with hypertension Sample: Statistic: = 32 mmHg with n = 45 U.. S adults diagnosed hyper tension k. A quality control engineer has been asked to inspect an automated assembly line at a large manufacturing center. Part of his report must detail the rate at which the assembly line produces defective products. If he finds evidence that the defect rate exceeds 2%, the assembly line will be shut down and recalibrated. The engineer samples every 10th product that comes off the assembly line until he has a total of 80 observations. Of the 80 products he sampled, 4 are defective. Observational units: Population: Parameter: Sample: Statistic: l. A labor market analyst is interested in the average duration recent US college graduates spend searching for a first job after completing their undergraduate degree. She polls a random sample of 30 recent graduates from UMich LSA and finds that, on average, they spent 3.2 months job searching before finding their first position. Observational units: Population: Parameter: Sample: Statistic: STATS 250 Lecture Notes, 11 ADDITIONAL NOTES: This page is intentionally left blank for you to use to log notes taken during lecture, jot down your thoughts regarding additional examples, or to record work completed during Group Work exercises. STATS 250 Lecture Notes, 12 Lecture 02: Exploring Sampling Distributions: Sample Proportions and Sample Means □ Visualize and interpret sampling distributions – Simulate and plot the sampling distributions of sample proportions. □ Analyze the shape, center, and variability of sampling distributions for sample proportions. □ Identify how increasing or decreasing sample size may affect key features of the sampling distribution such as shape, center, and variability. □ Recognize how sampling distributions help us make predictions and inferences about population parameters based on sample data. Data Set Descriptions For several of our examples of random sampling, we will use the institution data set. institution: The US Department of Education maintains a database of demographic information from all active postsecondary education institutions that participate in Title IV. This data set contains information from 2019 on all 1842 US colleges and universities that grant bachelor’s degrees. We will consider this population data representing all 4-year US colleges and universities in 2019. Here’s the list of variables. Name: Name of the school Region:Region of country(Midwest, Northeast, Southeast, Territory, West) * Type: Type of school (Private or Public) A Locale: Locale (City, Rural, Suburb, Town)* AdmitRate: Admission rate * MidACT: Median ACT scores * AvgSAT: Average combined SAT scores * Enrollment: Undergraduate enrollment * TuitionIn: In-state tuition and fees * TuitionOut: Out-of-state tuition and fees * FacSalary: Average monthly salary for full-time faculty* FirstGen: Percent of first-generation students * STATS 250 Lecture Notes, Page 13 1 The Sampling Distribution—Visualizing Patterns in Statistic Values Famously, Albert Einstein is attributed with the following quotation: “The definition of insanity is doing the same thing over and over and expecting different results.” But in statistics, this is not insanity—it is randomness, the driving force behind sampling distributions. Many statistics scenarios are conceptualized in terms of a statistical model and observed data. Broadly, a statistical model is any process that generates observed data through a random process. A random process, in this context, is defined as any procedure that produces outcomes that are unpredictable in the short term but follow predictable patterns in the long run. This is the foundation of how we think about randomness and its role in generating data. When we take repeated samples from the same population, it is not uncommon to get slightly different results each time. This variation is due to the inherent randomness in sampling. Even when using the same sampling method, the outcomes can differ because each sample captures different aspects of the underlying population. This variation is captured by a sampling distribution, a concept that helps us understand how sample statistics fluctuate and allows us to make sense of inferential statistics. Before diving deeper into sampling distributions, let’s revisit the key components of any distribution. in probable what is possible or ~ A distribution of a variable summarizes how data points are spread out across possible values. In particular, when analyzing distributions, we consider three main attributes: Central tendency 1. _____________________________________________________________________________________________ 2. Shape _____________________________________________________________________________________________ Variability (spread , dispersion ( 3. _____________________________________________________________________________________________ It turns out that statistics, like the sample proportion (symbol 𝜋̂) and the sample mean (symbol 𝜇̂ ), are our variables of interest and we are interested in understanding their distribution. Imagine you take a random sample of 10 students in a large introduction to statistics course, record the exam 1 score, and calculate the sample mean score. Now repeat the sampling process multiple times, each time calculating the mean exam 1 score. Each of these observations is a statistic (a variable we collect from each random sample of students), and we can create a plot to visualize the distribution of the statistic values. STATS 250 Lecture Notes, Page 14 statistic The distribution of all possible values of a ______________________________________________________________ (obtained from repeated samples of the same size from a population of interest) is called the ________________________________________________________________________________________________. Sample distribution Today, we begin to explore the behavior of statistics—we start with statistics resulting from categorical variables (the sample proportion, 𝜋̂) and end with statistics resulting from quantitate variables (the sample mean, 𝜇̂ ). 2 Sampling Distribution of Sample Proportions for a Categorical Variable When working with categorical data that produces counts, we often summarize the data with proportions. Using the institution data set, we created a bar chart for the categorical variables Region. Try It! Bar charts and proportions Using the institution data set, we created a bar chart for the variable Region. > addmargins(table(institution$Region) Midwest Northwest Southeast Territory West Sum 465 526 425 47 379 1842 STATS 250 Lecture Notes, Page 15 signal sour level Coll JUni. are located in the MW Ittrue rateO the population. a. What proportion of all colleges and universities in the U.S. are in the Midwest? Is this numerical summary a parameter or a statistic? Include the appropriate symbol. T = 465/1842 = 0. 25 b. Suppose we randomly select 8 colleges/universities from the institution data set and compute the sample proportion (𝜋̂) of institutions that are in the Midwest. Which value do you think is more likely to occur, 𝜋̂ = 0.23 or 𝜋̂ = 0.32? Explain your reasoning. 2.1 Simulation Study Now that you have calculated the population proportion of all colleges and universities in the U.S. that are in the Midwest, symbol 𝜋, let’s shift our focus to what happens when we take random samples. Imagine selecting a random sample of 8 institutions from our population, computing the sample proportion of institutions that are in the Midwest, and repeating the sampling process over and over. Rather than relying on speculations and our imagination. We can get a clearer picture by conducting a simulation study. In the next activity, you will use the full population data to simulate taking many random samples of 8 institutions and visualizing the variability of the sample proportion. Simulation Study A simulation study involving categorical variables consists of the following steps: 1. Take a random sample of size 𝑛 from the population of interest. 2. Summarize the sample results by calculating the sample proportion (𝜋̂1 ). 3. Plot the computed sample proportion on a number line by marking it with a dot. Note: This will serve as the foundation for building your sampling distribution. 4. Repeat steps 1-3 many, many, many times (𝜋̂2 , 𝜋̂3 , 𝜋̂4 , … , 𝜋̂∞ ). These steps allow us to visualize how sample statistics (like the sample proportion) behave across repeated samples. Let’s illustrate a few random samples to get a sense of the simulation results. We are going to follow the steps of the simulation study and manually record and visualize our results. STATS 250 Lecture Notes, Page 16 2.1.1 Simulation Study - Steps 1 and 2 Using the institution data set and R via the Stats 250 Posit.Cloud workspace, we will collect a random sample of 8 institutions in the U.S., identify their region, and compute the sample proportion of institutions in the Midwest, denoted by the symbol 𝜋̂. To Access Lecture02 in Posit.Cloud and actively engage in the simulation study (available to students who have completed Lab00): 1. Head to: https://rstudio.cloud/ 2. Click “Log in with Google” and sign in using your UM credentials. 3. Click the Start button next to the lec02 project. 4. Click the lec02_activity.Rmd in the bottom right corner to open the R Markdown file. To begin, run the “readData” code chunk to load the data set. Once the data is stored, you can take a random sample of 8 institutions and record the region of each institution by running the “region_sample” code chunk. The output will list the regions of the 8 institutions randomly selected. Now that you have your sample, it is time to calculate and record the sample proportion of institutions in the Midwest: The result from the first random sample. a. Record the result from the first random sample: - Number of institutions in the Midwest: ↓ - Number of institutions not in the Midwest: 7 # = ____________ b. Compute the observed sample proportion of institutions in the Midwest: ____ 0 14 ,. # " = Take a moment to consider your result. - How does your sample proportion compare to that of your instructor or the students around you? - Is it what you expected, or are you surprised? Reflect on the following questions: c. Why might the sample proportions differ between you and others, even though you are all sampling from the same population? d. How much do you think the sample size affects the variability in your results? STATS 250 Lecture Notes, Page 17 2.1.2 Simulation Study - Steps 3 and 4 Now that you have computed your first sample proportion, let’s see what happens when we perform the sampling process multiple times. Try It! Results of a simulation study 1. Rerun the “region_sample” code chunk: Use the “region_sample” code chunk to take additional random samples of 8 institutions. Each time you run the simulation, compute and record the sample proportion of institutions in the Midwest. 2. Record and plot your results: For each new sample, mark the sample proportion on the provided plot below with a dot. Repeat this process several times until the plot begins to show the overall distribution of sample proportions. Ideally, you will want to collect enough samples to see a clear pattern emerge. Simulation Results 3 1 = 2 * every dot is 3 3- a statistic 4 2 - 5 3- 6 3- 7 2 - 8 3- % ⑧ 9 10 2 6- - ⑧ ⑧ ⑧ · 11 2 - I I 0. 125 0. 375 12 I - Sample proportion values 13 2 As you and your classmates actively work through the simulation study, a distribution of sample proportions will take shape. Reflect on the following questions as you observe the emerging pattern. a. What does each dot in the plot above represent? an observed value of it , the sample proportion of n = 8 colleges & Univ that resided. in the Mw b. Three attributes of a distribution: Center: Where is the distribution of sample proportions centered? Why do you think that is? S Centered around pi = 0 25. , true pop. proportion complete dist Variability: How spread out are the sample proportions? What factors might contribute to this variability? # = 0 > 0 -. 625 in a right skewed , asymmetric manner Shape: What does the shape of the distribution look like? Is it symmetric, skewed, or something else? STATS 250 Lecture Notes, Page 18 c. Reflecting on the impact of sample size: Now that you have worked with a sample size of 8 institutions, let’s think about what might happen if we increase the sample size. Do you think the center of the distribution would change if we increased the sample size from 𝑛 = 8 to 20 or 50 institutions? Why or why not? yes, How does the sample size relate to the accuracy of estimating the true population proportion? If you completed the simulation study, your sampling distribution should look similar to the image below. If you were not able to follow along, use the sampling distribution provided below to answer the “Try It” questions above. STATS 250 Lecture Notes, Page 19 2.2 Three attributes of the sampling distribution of sample proportions In this section, we will dive into the three key attributes of the sampling distribution: center, variability, and shape. By examining these attributes, we will uncover fundamental ideas about how sample statistics behave when we take repeated samples from the same population. 2.2.1 Center Let’s start by focusing on the center of the sampling distribution. We will explore this by running three simulations where the true population parameter is set to 𝜋 = 0.25, 𝜋 = 0.50, 𝑎𝑛𝑑 𝜋 = 0.75. As we compare these simulations, notice how the center of the sampling distribution closely aligns with the population parameter in each case. x Key Idea 1: The center of the distribution of all possible values of the sample proportion (denoted by 𝜋̂) is the value of the true population proportionIt / ___________________________________________________________________________________________. always Centered & your signal. Note: ______________________________________________________________________________________ always true so long the sample is random. STATS 250 Lecture Notes, Page 20 2.2.2 Variability and shape Next, we will investigate how sample size affects the variability and shape of the sampling distribution. We will run three simulation studies where the population parameter is fixed at 𝜋 = 0.25, but we increase the sample size from 𝑛 = 8 to 𝑛 = 20, and 𝑛 = 50. As we compare these simulations, pay attention to how increasing the sample size reduces the spread of the distribution, leading to less variability in the sample statistics. Additionally, observe how the shape of the distribution becomes more symmetric and unimodal as the sample size grows. These patterns illustrate key ideas about how large samples provide more precise and predictable estimates, with sampling distributions that become tighter and more symmetric and unimodal as sample size increases. Center stays the same Key Idea 2: The variability (or spread) of the distribution of all possible values of the sample proportion (denoted by 𝜋̂) decreases _____________________________________ increases as the sample size 𝑛 ___________________________________________. Note: sample must be random / Can be anticipated / equations. ______________________________________________________________________________________. STATS 250 Lecture Notes, Page 21 Key Idea 3: If the _________________________________ Sample Size is large enough, the shape of the distribution of all possible values of the # sample proportion (denoted by 𝜋̂) is approximately ______________________________________________________. normally distributed Note: "large ______________________________________________________________________________________ enough" varies from situation to situation. r unimodal normal-bell-shaped symmetric , , Try It! Putting it all together Work through the following two exercises to practice the key ideas from today’s lecture. 1. According to the University of Michigan website, 72% of undergraduate students live off campus. Two students have concerns about this value and believe it is higher and decide to conduct their own studies. Student A: surveys a random sample of 50 UM undergraduate students and records whether each student lives off campus. Student B: surveys a random sample of 200 UM undergraduate students and records whether each student lives off campus. Use this information to make decisions about the relationship between Student A and B for each row in the table below. For each case, select the most appropriate statement from the following choices: A. Student A’s quantity is greater. B. Student B’s quantity is greater. C. The quantities are the same. D. The relationship cannot be determined without more information. Student A Student B Statement The median of the sampling distribution The median of the sampling distribution o A o B of 𝜋̂. of 𝜋̂. o C o D The standard deviation of the sampling The standard deviation of the sampling o A o B distribution of the sample proportion, 𝜋̂. distribution of the sample proportion, 𝜋̂. o C o D The percentile corresponding to an The percentile corresponding to an o A o B observed sample proportion value of 0.76 ( observed sample proportion value of 0.76 ( 𝜋̂ = 0.76) 𝜋̂ = 0.76) o C o D STATS 250 Lecture Notes, Page 22 NOTES: This page is intentionally left blank for you to use to log notes taken during lecture, jot down your thoughts regarding additional examples, or to record work completed during Group Work exercises. STATS 250 Lecture Notes, Page 23 STATS 250 Lecture Notes, Page 24 Lecture 03: Approximating the Sampling ̂ Using a Normal Curve Distribution of 𝝅 □ Determine whether the assumptions are met such that a Normal curve can be applied as a model for a sampling distribution of 𝜋̂. □ Calculate the likelihood of observing specific sample proportion values by applying the Normal approximation to the sampling distribution of 𝜋̂. 1 Normal Models for the Sampling Distribution of 𝝅 ̂. In recent lectures, we created many sampling distributions to describe how a statistic behaves over repeated sampling. Sampling Distribution A sampling distribution is the distribution of sample statistics computed for different samples of the same size from the same population. A sampling distribution shows us how the sample statistic varies from sample to sample 1.1 Common Characteristics of Sampling Distributions After reviewing the sampling distributions and null distributions to describe how a statistic behaves over repeated sampling. What characteristics did they share? centered at the population parameter they Characteristic 1: _____________________________________________________________________________________ are they are unimodal , symmetric bell-shaped Characteristic 2: _____________________________________________________________________________________ , STATS 250 Lecture Notes, Page 25 These observations represent a very common pattern in sampling distributions; a symmetric and unimodal curve known as the Normal distribution. 2 Normal Distributions f(x) = 2x + 1 Consider each of the three normal distributions below, plot A, B, and C N(SO , 40 N(30 , 8) N(100 , 3) Basic Facts of the Normal Curve f(x) =ze( The Normal distribution can be adjusted using two quantities: the mean and the standard deviation. Changing the mean of the Normal curve _________________________________________________________________. Shift the right changing Shape curve leftt on X-axis w/ Changing the standard deviation of a Normal curve ________________________________________________________. Stretch or compress curve around it's mean value The total area under the Normal curve is equal ___________________________________________________________. 3 Approximating the Sampling Distribution of 𝝅 ̂ with a Normal Model In recent lecture, we explored how the sampling distributions of 𝜋̂ typically exhibit characteristics like being centered around a specific parameter and often assume a normal shape. These characteristics represent a very common pattern in sampling distributions that can often be predicted with a famous statistical theory known as the Central Limit Theorem. STATS 250 Lecture Notes, Page 26 3.1 The Central Limit Theorem The Central Limit Theorem (CLT) has several versions, each one describing how a sampling distribution of a statistic will behave. So long as two assumptions are met: 1. Sample ______________________________________________________________________________________________. is & collected random from population the This means that the selection of one observation should not influence the selection of another. In practice, this is often achieved if the data are collected randomly or, in the context of a controlled experiment, if we randomly assign individuals to treatment groups. 2. Sample ______________________________________________________________________________________________. is sufficiently large We must gather a sufficiently large sample of data for the Central Limit Theorem to take effect. Just how large is large enough? For a categorical variable, large enough is _______________________________________________________ n. - M/1-1) 10 and - 10 _________________________________________________________________________________________. If assumptions are met, by the CLT, the sampling distribution of many sample statistics can be well approximated by a Normal curve. The CLT is a truly amazing result that gives the Normal curve a central role in statistics and its applications. One of these is being able to approximate the sampling distribution of statistics (sample proportions (𝜋̂)) with a Normal curve. For this reason, the Central Limit Theorem is considered the most important theorem in statistics. STATS 250 Lecture Notes, Page 27 3.2 The Normal Distribution for Sampling Distributions of Proportions (𝝅 ̂) Let’s return to the institution data set and the categorical variable Region, which records the region of the institution (West, Midwest, South, etc.). 1 The plot below shows the sampling distribution for 𝜋̂ = the sample proportion of institutions who are in the Midwest for samples sized 𝑛 = 40. Try It! Labeling the Distribution I = true pop. parameter Label the sampling distributions with the normal curve that best approximates it. You are probably finding this challenging because, to label the distributions with a normal curve, we need ____________________________________________________________________________________________! FrN( ,) di NC) INC1--10 ̂ Key Idea: The Normal Model for the Sampling Distribution of 𝝅 From the previous example, you see that we can use the normal distribution to describe the random behavior of 𝜋̂, a sample proportion representing a sample of size 𝑛 from a statistical model governed by the parameter 𝜋. The model is stated as …. Note: This normal distribution only works if we can expect to see at least 10 successes and at least 10 failures in a random sample of size 𝑛. Typically, it is only responsible to apply the normal model to the sampling distribution of a sample proportion 𝜋̂ if the size of your sample is large enough, expect at least 10 successes and failures. This is known as the __________________________________________________________________, which is often represented as follows: STATS 250 Lecture Notes, Page 28 - Consider the following sampling distribution of 𝜋̂ F(x = zoossal w - Try It! Approximating the Sampling Distribution with a Normal Model Use the results of the simulated sampling distribution above to answer the following questions. a. Using the simulated sampling distribution of 𝜋̂ identify the following: The sample size 𝑛: _______________________________________ 80 The population proportion: _______ T = ________________________ 0 50. b. Confirm that the sample size 𝑛 is large enough to approximate the sampling distribution of 𝜋 ̂ with a normal distribution. 80(. 50)880(1- 50) · = 40110 c. Provide a complete Normal approximation of the sampling distribution of 𝜋 ̂. ~ N( , N = (0 5. ,os)) = No 5 ,. 0 d. Based on the simulated sampling distribution, what fraction of sample proportions are 0.40 or less? P(π[0 4) = % 200. = 3% STATS 250 Lecture Notes, Page 29 3.3 Finding Proportions We can use technology to find proportions and percentiles of a 𝑁𝑜𝑟𝑚𝑎𝑙 curve. In Stats 250, we will be using R Shiny app or R depending on the following scenarios: 1. Scenario 1: Find the proportion of data that falls within a certain range under the curve of a Normal distribution. a. Option 1: R Shiny app supported by the University (https://shiny.stat.lsa.umich.edu/pvals ). b. Option 2: Use pnorm function in R > pnorm(cutoff, mean, standard deviation, lower.tail = TRUE or FALSE) Example: What proportion of observations fall below 𝑧 = 1.8? > pnorm(q=1.8, mean=0, sd=1, lower.tail = TRUE) 0.9640697 Revisiting the sampling distribution of 𝜋̂ Try It! Finding proportions The simulated sampling distribution of 𝜋̂ can be approximated with the following normal curve: 𝜋(1 − 𝜋) 𝜋̂~ 𝑁𝑂𝑅𝑀𝐴𝐿 (𝜋 = 0.5 , √ = 0.0559) 𝑛 STATS 250 Lecture Notes, Page 30 Use the normal approximation model and the pnorm function in R to calculate the following probabilities a. Compute the likelihood of observing a sample proportion 0.40 or less? 0. 03681475 b. What is the probability of observing a sample proportion of 0.55 or more? 0. 1855394 c. What is the probability of observing a sample proportion between 0.45 and 0.50? 0 3144606. d. How many standard deviations is an observed sample proportion of 0.45 from the population proportion of 0.5? z = z = STATS 250 Lecture Notes, Page 31 Try It! Voter turnout in the U.S. About two-thirds (66%) of the voting eligible population turned out for the 2020 presidential election. The value of 0.66 is the true rate for this population of interest (𝜋 = 0.66). a. Suppose we plan to draw a sample of size 𝑛 = 50. Confirm that the sample size 𝑛 is large enough to approximate the sampling distribution of 𝜋̂ with a normal distribution. b. Provide a complete Normal approximation of the sampling distribution of 𝜋 ̂. c. How likely is it that we observe a sample proportion less than 0.70? d. Without calculations, how would your answer to part (b) change if we took a random sample of 𝑛 = 30? Explain. STATS 250 Lecture Notes, Page 32 ADDITIONAL NOTES: This page is intentionally left blank for you to use to log notes taken during lecture, jot down your thoughts regarding additional examples, or to record work completed during Group Work exercises. STATS 250 Lecture Notes, Page 33 STATS 250 Lecture Notes, Page 34 Lecture 04: Exploring Sampling Distributions: Sample Means □ Visualize and interpret sampling distributions – Simulate and plot the sampling distributions of sample means. □ Analyze the shape, center, and variability of sampling distributions for sample means. □ Identify how increasing or decreasing sample size may affect key features of the sampling distribution such as shape, center, and variability. □ Identify the conditions for approximating the sampling distribution of 𝜇̂ with a Normal model. Use the normal approximation to find probabilities of observing certain outcomes. 1 Sampling Distribution of Sample Means for a Quantitative Variable We have explored the center, variability, and shape of sampling distributions using categorical data with a binary response. Now, let’s shift our focus to quantitative data. The big difference here is that, with quantitative data, we can visualize the distribution of the sample itself, not just the sample statistic. Using the institution population data set, we create a histogram for the quantitative variable first and use the population distribution as a reference point. Then, as we generate sampling distributions through simulation studies, we can see how the sample means (denoted by 𝜇̂ ) reflect the population distribution. We will start by exploring population distributions that are unimodal and symmetric, which provide a straightforward case. From there, we will move on to distributions that are not symmetric or unimodal to see how these more complex shapes influence the sampling distribution. By comparing how sample means and distributions behave across different population shapes, you will gain a deeper understanding of how the population’s characteristics affect the resulting sampling distribution. 1.1 Population distribution is unimodal and symmetric We will be using the institution data set, but first, we will subset it to include only the 573 public 4-year institutions. Notice that our population of interest is now all public 4-year institutions. Our focus will be on the quantitative variable Tuition_out (out-of-state tuition). peramentar > quantile(public$tuition_out) 0% 25% 50% 75% 100% 480 16152 19320 24308 47476 > mean(public$tuition_out) M 20520.13 > sd(public$tuition_out) o 7598.33 STATS 250 Lecture Notes, Page 35 Try It! Population distribution a. Identify the observation units. Public the population of all "collegesI universities in 2019 b. How would you describe the shape distribution of out-of-state tuition for the population of public 4-year institutions in the U.S.? The pop. of 00S tuition cost is unimodal , symmetric I bell shaped with , a mean of approx. 20,520 & a SD of approx 7598 33. c. Is the given mean value of $20,520.13 a parameter or a statistic? Provide the appropriate symbol. parameter o M In the next activity, you will use population data consisting of all public 4-year institutions to simulate taking many random samples of 10 institutions and visualizing the variability of the sample mean. Simulation Study A simulation study involving categorical variables consists of the following steps: 1. Take a random sample of size 𝑛 from the population of interest. 2. Summarize the sample results by calculating the sample mean (𝜇̂ 1 ). 3. Plot the computed sample mean on a number line by marking it with a dot. Note: This will serve as the foundation for building your sampling distribution. 4. Repeat steps 1-3 many, many, many times (𝜇̂ 2 , 𝜇̂ 3 , 𝜇̂ 4 , … , 𝜇̂ ∞ ). Let’s illustrate a few random samples to get a sense of the simulation results. We are going to follow the steps of the simulation study and manually record and visualize our results. 3.1.1 Simulation Study - Steps 1 and 2 Using R via the Stats 250 Posit.Cloud workspace, we will collect a random sample of 10 public 4-year institutions in the U.S., record their out-of-state tuition, and compute the sample mean, denoted by the symbol 𝜇̂. To Access Lecture04 in Posit.Cloud and actively engage in the simulation study (available to students who have completed Lab00): 1. Head to: https://rstudio.cloud/ 2. Click “Log in with Google” and sign in using your UM credentials. 3. Click the Start button next to the lec04 project. 4. Click the lec04_activity.Rmd in the bottom right corner to open the R Markdown file. STATS 250 Lecture Notes, Page 36 To begin, run the “readData” code chunk to load the data set. Once the data is stored, you can take a random sample of 8 institutions and record the region of each institution by running the “tuition_sample” code chunk. The output will list the out-of-state tuition costs of the 10 institutions randomly selected along with the computed sample mean. 1.1.2 Simulation Study - Steps 3 and 4 Now that you have computed your first sample proportion, let’s see what happens when we perform the sampling process multiple times. Try It! Results of a simulation study 3. Rerun the “tuition_sample” code chunk: Use the “tuition_sample” code chunk to take additional random samples of 10 institutions. Each time you run the simulation, record the sample mean out-of-state tuition of institutions randomly selected. 4. Record and plot your results: For each random sample, mark the sample mean on the provided plot below with a dot. Repeat this process several times until the plot begins to show the overall distribution of sample means. Ideally, you will want to collect enough samples to see a clear pattern emerge. Simulation Results 1 24823 S. 2 17732 1. 3 22903 S. 4 20327. S 5 22813 7. 6 21/27. 7 19495. Y 8 27421. 2 9 17719 6. 10 20419 Y. · 11 23498. 4 12 21413 S. 13 22761 S. As you and your classmates actively work through the simulation study, a distribution of sample means will take shape. Reflect on the following questions as you observe the emerging pattern. a. What are the observational units in the simulated sampling distribution? statistics : sample avrage i b. What does each dot in the plot represent? the observed mean tuition forus students among a.. of R S n = 10 public collegest uni. STATS 250 Lecture Notes, Page 37 c. How does the center of the sampling distribution of sample means compare to the population mean? : pop Centered & M Sample : Centerd & M d. How does the shape of the sampling distribution for means compare to the original population distribution? pop : bell shaped unimodal , symmetric , sample : Challenge: What do you think the sampling distribution will look like if the sample size is 1? If you completed the simulation study, your sampling distribution should look similar to the image below. If you were not able to follow along, use the sampling distribution provided below to answer the “Try It” questions above. 1.1.3 The effect of sample size Let’s continue to explore the sampling distribution of sample means while varying the sample size 𝑛. STATS 250 Lecture Notes, Page 38 Key Idea: The sample size 𝑛 influences the behavior of a sample statistic. Specifically, as the sample size 𝑛 increases ________________________________ the variability of a statistic ___________________________________________. declines larger sampleslend to be This is the same as saying statistics produced from _________________________________________________________ tend to be __________________________________________________________________ more precise estimators of population parameters. 1.2 Population distribution is not unimodal and symmetric We are returning to the full institution data set, where our population of interest is all institutions in the U.S. Our focus will now be on the quantitative variable Enrollment. > mean(institution$Enrollment) 4647.319 > median(institution$Enrollment) 1830 Try It! Population distribution – Enrollment a. Identify the observation units. in 2019 Public Collegest uni. b. How would you describe the shape distribution of enrollment for the population of all institutions in the U.S.? Unimodal but extremely right shared c. Is the given mean value of 4,647.319 students a parameter or a statistic? Provide the appropriate symbol. STATS 250 Lecture Notes, Page 39 1.2.1 The effect of sample size Since we have already gone through the detailed steps of generating a sampling distribution—first with proportions and then with means—we are going to skip the process this time. Instead, we will jump straight to the simulated sampling distribution and focus on drawing conclusions. Try It! Population distribution influences the sampling distribution Consider the plots above to answer the following questions. a. As the sample size increases from 𝑛 = 5 to 𝑛 = 30, what do you notice about the shape of the sampling distribution, even though the population distribution is heavily right-skewed? sample sizes the sampling dist. ofM is also right skered at small , the lends to decline as sample size increases Severity b. Based on what you have observed, how confident would you be in using a sample mean to estimate the population mean? What conditions make you more confident in the reliability of your estimate? c. Which of the following distributions will be approximately unimodal and symmetric if a sample of 100 institutions is collected? o The distribution of enrollments for all institutions in the U.S. o The distribution of enrollments for the sample of 100 institutions in the U.S. o The sampling distribution of the sample mean enrollment for repeated random samples of 100 institutions in the U.S. STATS 250 Lecture Notes, Page 40 1.3 The sampling distribution of 𝝁 ̂ Now that we have explored how the shape of the population distribution and sample size influence the sampling distribution of the sample mean in Sections 1.1 and 1.2 through simulation studies, it is time to summarize what we have learned. We will focus on three key ideas that capture how the center, spread, and shape of sampling distributions are affected by these factors. These conclusions will help us solidify our understand of the critical attributes of sampling distributions as we move forward. Key Idea 1: The center of the distribution of all possible values of the sample mean ( denoted by 𝜇̂ ) is the value of the __________________________________________________________________________________________. population , mean Note: ______________________________________________________________________________________ samples must drawn & random be Key Idea 2: The variability (or spread) of the distribution of all possible values of the sample mean ( denoted by 𝜇̂ ) _____________________________________ decreases as the sample size 𝑛 ___________________________________________. goes up/ increases Note: ______________________________________________________________________________________ gourned by specific formula a Key Idea 3: For the shape of the sampling distribution, we have two results to consider. Result 1: If the _________________________________ population is unimodal and symmetric, the shape of the distribution of all possible values of the sample mean ( denoted by 𝜇̂ ) is_____________________________________________________. aproxamently normal Note: ______________________________________________________________________________________ Result 2: If the ________________________________________________ population is not unimodal and symmetric, the shape of the distribution of all possible values of the sample mean ( denoted by 𝜇̂ ) is approximately ______________________________________________________________________________________, Still normally dist provided your sample is large enough. provided ______________________________________________________________________________________________. Note: ______________________________________________________________________________________ n1 2S STATS 250 Lecture Notes, Page 41 2 The Normal Distribution for Sampling Distributions of Means (𝝁 ̂) Let’s return to the institution data set and the quantitative variables Tuition_out, which records the out-of- state tuition, and the variable Enrollment. Example 1: Out-of-State Tuition The plots below show the sampling distribution for 𝜇̂ = the sample mean out-of-state tuition for samples sized 𝑛 = 5, 25, and 80. Recall the distribution of out-of-state tuition among all 4-year institutions in the US is unimodal and symmetric. Example 2: Enrollment The plots below show the sampling distribution for 𝜇̂ = the sample mean enrollment for samples sized 𝑛 = 25, 40, and 100. Recall the distribution of enrollment among all institutions in the US is heavily skewed to the right. STATS 250 Lecture Notes, Page 42 If you were asked to label each of these sampling distributions with the appropriate normal curve that best approximates them, you would likely find it challenging. That’s because, in order to label the distributions correctly, we need to account for the variability in the means. As we learned from our exploration of sampling distributions, the variability of the sample means decreases as the sample size increases. Therefore, knowing the sample sizes is crucial for determining the appropriate labels. ̂ Key Idea: The Normal Model for the Sampling Distribution of 𝝁 From the previous example, you see that we can use the normal distribution to describe the random behavior of 𝜇̂ , a sample mean representing a sample of size 𝑛 from a statistical model governed by the parameter 𝜇. The model is stated as …. ~ Normal (M , 0/mn) Notes: Result 1: The normal model will accurately describe the behavior of 𝜇̂ when the population from which the sample is drawn is reasonably symmetric and unimodal. Result 2: The CLT, in fact, guarantees the normal model will appropriately describe the behavior of 𝜇̂ as the sample size 𝑛 increases, even when the random process that generates the sample is skewed. In general, the more skewed the random process is, the larger the sample size required for the CLT to "take effect" and for the normal curve to model the sampling distribution appropriately. Rule of thumb: ___________________________________________________________________________ n22S STATS 250 Lecture Notes, Page 43 Try It! Poodle Dogs The weights of all standard poodle dogs have a 𝑁𝑜𝑟𝑚𝑎𝑙 distribution with a mean of 46 pounds and a standard deviation of 7 pounds. a. Suppose we plan to take a random sample of 𝑛 = 4 poodle dogs. Provide a complete sampling distribution of 𝜇̂ when 𝑛 = 4. result 1 Apply (IT for i - 1146 , "/5) - 146 , 3. 5) b. How likely is it that, in a random sample of 4 poodle dogs, we observe a sample mean weight that is greater than 50.9 pounds? c. Change sample size (without calculations): How would your answer to (b) change if we took a random sample of 𝑛 = 20 poodle dogs? Explain d. Change population standard deviation (without calculations): How would your answer to (b) change if the standard deviation is 15 pounds (𝜎 = 15)? Explain e. Change population mean (without calculations: How would your answer to (b) change if 𝜇 = 45 pounds? Explain STATS 250 Lecture Notes, Page 44 This page is intentionally left blank for you to use to log notes taken during lecture, jot down your thoughts regarding additional examples, or to record work completed during Group Work exercises. STATS 250 Lecture Notes, Page 45 STATS 250 Lecture Notes, Page 46 Lecture 05: Simulation-based hypothesis tests □ Specify hypotheses based on a question of interest, defining relevant parameters. □ Recognize that a simulated null distribution shows what is likely to happen by random chance if the null hypothesis is true. □ Interpret the 𝑝-value as the proportion of simulated samples that would give a statistic as or more extreme as the observed sample, if the null hypothesis is true. □ Estimate a 𝑝-value using a simulated null distribution. □ Recognize that a p-value quantifies how much evidence there is against the null hypothesis and in support of the alternative hypothesis. 1 Statistical Inference In the previous lectures, we explored core ideas regarding the variability of sample statistics representing repeated random samples of the same size from a population of interest (Lecture 03) and the normal model that, under certain conditions, adequately describes their sampling distribution (Lecture 04). In Lecture 05, we introduce the idea of statistical inference and, specifically, a class of procedures broadly known as hypothesis tests. In the passages that follow, we’ll introduce the four fundamental steps of hypothesis testing that will serve as the basis for many future lectures in the course. Before we dive into hypothesis testing, however, let’s review key ideas from recent lectures and how they connect to statistical inference. 1.1 Key ideas of sampling distributions Let’s quickly recap some ideas from our earlier discussions of sampling distributions. The plot below corresponds to a sampling distribution of 𝜋̂ for repeated random samples of size 𝑛 = 26 from a broader population. a. What does each dot represent in the dot plot above? STATS 250 Lecture Notes, Page 47 b. Based on the dot plot above, approximately what value is the center of this distribution? Is the center of this distribution considered a parameter or a statistic? Recap of key ideas regarding sampling distributions Statistics such as 𝜋̂ or 𝜇̂ , which summarize random samples drawn from a population of interest, are random quantities that vary from sample to sample. While we cannot predict the value a statistic will take on for any given random sample, their long-run behavior over many repetitions is quite predictable under certain conditions. The sampling distribution of 𝝅̂. Suppose the rate at which a population of interest takes on a particular categorical outcome is described by the parameter 𝜋. If you draw a random sample of size 𝑛 from a population of interest and compute the observed statistic, 𝜋̂, and 𝑛𝜋 and 𝑛(1 − 𝜋) ≥ 10, then the sampling distribution of 𝜋̂ is… 𝜋(1−𝜋) 𝜋̂~𝑁 (𝜋, √ ) 𝑛 The sampling distribution of 𝝁̂. Suppose the mean value a population of interest takes on a particular quantitative outcome is described by the parameter 𝜇 and that the population varies around this true mean with standard deviation 𝜎. If you draw a random sample of size 𝑛 from a population of interest and compute the observed statistic, 𝜇̂ , and either the population is normally distributed or 𝑛 ≥ 25, then the sampling distribution of 𝜇̂ is… 𝜎 𝜇̂ ~𝑁 (𝜇, ) √𝑛 STATS 250 Lecture Notes, Page 48 1.2 What is statistical inference? Over the past few lectures, we have emphasized the long-term random behavior of sample statistics (like 𝜋̂ or 𝜇̂ ) in contexts where the parameters that specify the exact nature of their random behavior are known. As you might imagine, it is more common to compute statistics of samples that are drawn from populations with unknown parameter values. That it to say, we don’t know the true values of 𝜋 when studying a categorical phenomenon or the true values of 𝜇 and 𝜎 when studying a quantitative phenomenon. In these instances, we can employ methods of statistical inference to help understand, with some level of uncertainty, the values of these unknown parameters. We use sample data and sample statistics to find out about the world around us. How the data were collected matters— properly collected data can give us insight, and poorly collected data tell us nothing useful (and may be misleading). Let’s continue our exploration of data by talking about where the data come from, how to think about whether we can make causal claims and generalizable claims based on the study design. Statistical inference refers to methods that use observed sample statistics to learn more about unknown population parameters with some level of uncertainty. There are two main types of statistical inference we explore in STATS 250: testing methods and estimating methods. In the next few lectures, we will discuss the basics of how to use sample data to make a generalization to the population at large, also known as statistical inference. We will see how to use hypothesis testing to answer questions such as: - Do people subconsciously apply facial prototypes when they encounter different names? - Among players who are playing rock-paper-scissors for the first time, is there a tendency to throw scissors less often than random chance? - Are a majority of Stats 250 students attending the Ann Arbor Art Fair? In every case, we use data from a sample to answer a specific question about unknown parameters. We will develop a framework to analyze sample data to answer questions of interest. 1.2.1 Sample Statistics as Point Estimates In previous lectures, we explored a model for describing the sample-to-sample variability of a statistic. One of our key findings is the average of all the possible statistic values is equal to the parameter value – this finding is always true, so long as the sample is drawn at random from the population of interest. This result is super useful because we can refer to the statistic as an unbiased estimator of our unknown population parameter. For example, we use the sample mean to estimate the population mean and the sample proportion to estimate the population proportion. These estimates of our unknown population parameters are called point estimates. STATS 250 Lecture Notes, Page 49 The point estimate we encountered so far: Point estimate Parameter Sample mean, 𝜇̂ Population mean, 𝜇 Sample standard deviation, 𝜎̂ Population standard deviation, 𝜎 Sample proportion, 𝜋 ̂ Population proportion, 𝜋 In the next few lectures, we will focus on the sample proportion 𝜋 ̂ , the point estimate of the population proportion 𝜋. We will start our statistical inference journey by looking at inference methods where we use a point estimate 𝜋 ̂ to perform a hypothesis test about an unknown parameter 𝜋. 1.3 Overview of Testing To get a sense of the logic of a hypothesis test, let’s draw a parallel to the U.S. Court System. We have two competing theories about an unknown truth. To decide between the two theories, we will gather data, and based on the data, we will decide which theory seems more reasonable. When testing theories, we use the approach highlighted in the example below. Example: Hypothesis Testing in the U.S. Court System 1. Research question and hypotheses: Is the defendant guilty of a crime? Theory 1: The defendant is innocent Theory 2: The defendant is guilty 2. Collect data: Detectives investigate the crime 3. Analyze data: Prosecution and defense present the result of the investigations in court 4. Evaluating the evidence: A jury deliberates about whether the prosecution has provided evidence that calls into question the innocence of the defendant. In practice, juries and judges must determine whether there is convincing evidence to conclude that the defendant is guilty. When there is convincing evidence, they find the defendant “guilty.” When there is not convincing evidence, they find the defendant “not guilty.” Note: A verdict of “not guilty” does not mean that the defendant is innocent; rather, it means that there was not enough evidence to convince the jury or judge that the defendant is guilty. 1.3.1 Formulating Hypothesis Statements: The Null and Alternative Hypotheses We will be using data to help us decide between two competing claims about an unknown population parameter. We refer to the competing claims about the population parameter as the null hypothesis and the alternative hypothesis. The null hypothesis (often denoted by 𝐻0 ) is a “nothing of interest” statement. It is the claim that any differences we see in the sample results when compared to the status quo are due to chance alone, that is, due to naturally occurring variability. The alternative hypothesis (often denoted by 𝐻𝑎 ) is a “something of interest” statement. It is the claim that any difference between the sample results compared to the status quo is unlikely to be a result of natural sample-to-sample variability and is not due to chance alone. Note: It is not possible for both the null and alternative hypotheses to be true at the same time. To illustrate stating competing theories, let’s work through the following example. STATS 250 Lecture Notes, Page 50 1.3.2 Example: Facial Stereotypes of Male Names Do people subconsciously apply facial prototypes when they encounter different names? To explore this idea, we will present two side-by-side faces to subject, without context, and ask them to choose which is named Bob and which is named Tim. Let’s convert this idea into two competing claims. Null hypothesis (𝑯𝟎 ): The answer is “no”. Subjects randomly assign a name to an image. Alternative hypothesis (𝑯𝒂 ): There is yes. Subjects have some facial stereotypes of male names and do not randomly assign a name to an image. Notice that the null hypothesis is a claim of no difference, no effect. However, the alternative hypothesis is a statement claiming a difference, an effect. Think About It! If we say that there are no facial stereotypes of male names, that is, the answer is “no,” what proportion of subjects should we expect to assign the name Bob to the image on the right? We can rewrite the null and alternative hypotheses by defining a parameter and stating the theories in terms of the unknown population parameter. Let the unknown parameter 𝑝 represent the population proportion, and unknown parameter. Null hypothesis (𝑯𝟎 ): 𝜋 = 0.50 Alternative hypothesis (𝑯𝒂 ): 𝜋 > 0.5 Where the parameter 𝜋 represents the population proportion of adults who assign the name Bob to the image on the right. The two statements are about the population parameter and are competing ideas; both can’t be right. Notice that we stated the null and alternative hypotheses prior to conducting a study before any data are gathered. The reason why we need to develop a method for testing theories is that the value of the parameter is unknown. Hence, we will perform a hypothesis test by collecting sample data and examine if the sample data provide enough support, enough evidence, against the null hypothesis and in support of the alternative hypothesis. Stating the Null and Alternative Hypotheses 𝑯𝟎 and 𝑯𝒂 are competing claims about the unknown value of a population parameter. In a test of competing hypotheses, 𝐻0 is always by default assumed to be true and the goal of the testing procedure is to generate evidence against it in favor of 𝐻𝑎. In general, the null hypothesis 𝑯𝟎 is a statement of equality (=), while the alternative hypothesis uses notation indicating greater than (>),