OpenIntro Statistics PDF Fourth Edition

OpenIntro Statistics Fourth Edition David Diez Data Scientist OpenIntro Mine Çetinkaya-Rundel Associate Professor of the Practice, Duke Uni...

OpenIntro Statistics Fourth Edition David Diez Data Scientist OpenIntro Mine Çetinkaya-Rundel Associate Professor of the Practice, Duke University Professional Educator, RStudio Christopher D Barr Investment Analyst Varadero Capital Editions 1, 2, and 3 can be found in the book's extra files, which also include tablet-friendly versions of the latest edition. Copyright © 2019. Fourth Edition. Updated: April 12th, 2022. This book may be downloaded as a free PDF at openintro.org/os. This textbook is also available under a Creative Commons license, with the source files hosted on Github. 3 Table of Contents 1 Introduction to data 7 1.1 Case study: using stents to prevent strokes....................... 9 1.2 Data basics......................................... 12 1.3 Sampling principles and strategies............................ 22 1.4 Experiments......................................... 32 2 Summarizing data 39 2.1 Examining numerical data................................. 41 2.2 Considering categorical data................................ 61 2.3 Case study: malaria vaccine................................ 71 3 Probability 79 3.1 Defining probability.................................... 81 3.2 Conditional probability.................................. 95 3.3 Sampling from a small population............................ 112 3.4 Random variables...................................... 115 3.5 Continuous distributions.................................. 125 4 Distributions of random variables 131 4.1 Normal distribution.................................... 133 4.2 Geometric distribution................................... 144 4.3 Binomial distribution.................................... 149 4.4 Negative binomial distribution.............................. 158 4.5 Poisson distribution.................................... 163 5 Foundations for inference 168 5.1 Point estimates and sampling variability......................... 170 5.2 Confidence intervals for a proportion........................... 181 5.3 Hypothesis testing for a proportion............................ 189 6 Inference for categorical data 206 6.1 Inference for a single proportion.............................. 208 6.2 Difference of two proportions............................... 217 6.3 Testing for goodness of fit using chi-square........................ 229 6.4 Testing for independence in two-way tables....................... 240 7 Inference for numerical data 249 7.1 One-sample means with the t-distribution........................ 251 7.2 Paired data......................................... 262 7.3 Difference of two means.................................. 267 7.4 Power calculations for a difference of means....................... 278 7.5 Comparing many means with ANOVA.......................... 285 4 TABLE OF CONTENTS 8 Introduction to linear regression 303 8.1 Fitting a line, residuals, and correlation......................... 305 8.2 Least squares regression.................................. 317 8.3 Types of outliers in linear regression........................... 328 8.4 Inference for linear regression............................... 331 9 Multiple and logistic regression 341 9.1 Introduction to multiple regression............................ 343 9.2 Model selection....................................... 353 9.3 Checking model conditions using graphs......................... 358 9.4 Multiple regression case study: Mario Kart....................... 365 9.5 Introduction to logistic regression............................. 371 A Exercise solutions 384 B Data sets within the text 403 C Distribution tables 408 5 Preface OpenIntro Statistics covers a first course in statistics, providing a rigorous introduction to applied statistics that is clear, concise, and accessible. This book was written with the undergraduate level in mind, but it’s also popular in high schools and graduate courses. We hope readers will take away three ideas from this book in addition to forming a foundation of statistical thinking and methods. Statistics is an applied field with a wide range of practical applications. You don’t have to be a math guru to learn from real, interesting data. Data are messy, and statistical tools are imperfect. But, when you understand the strengths and weaknesses of these tools, you can use them to learn about the world. Textbook overview The chapters of this book are as follows: 1. Introduction to data. Data structures, variables, and basic data collection techniques. 2. Summarizing data. Data summaries, graphics, and a teaser of inference using randomization. 3. Probability. Basic principles of probability. 4. Distributions of random variables. The normal model and other key distributions. 5. Foundations for inference. General ideas for statistical inference in the context of estimating the population proportion. 6. Inference for categorical data. Inference for proportions and tables using the normal and chi-square distributions. 7. Inference for numerical data. Inference for one or two sample means using the t-distribution, statistical power for comparing two groups, and also comparisons of many means using ANOVA. 8. Introduction to linear regression. Regression for a numerical outcome with one predictor variable. Most of this chapter could be covered after Chapter 1. 9. Multiple and logistic regression. Regression for numerical and categorical data using many predictors. OpenIntro Statistics supports flexibility in choosing and ordering topics. If the main goal is to reach multiple regression (Chapter 9) as quickly as possible, then the following are the ideal prerequisites: Chapter 1, Sections 2.1, and Section 2.2 for a solid introduction to data structures and statis- tical summaries that are used throughout the book. Section 4.1 for a solid understanding of the normal distribution. Chapter 5 to establish the core set of inference tools. Section 7.1 to give a foundation for the t-distribution Chapter 8 for establishing ideas and principles for single predictor regression. 6 TABLE OF CONTENTS Examples and exercises Examples are provided to establish an understanding of how to apply methods EXAMPLE 0.1 This is an example. When a question is asked here, where can the answer be found? The answer can be found here, in the solution section of the example! When we think the reader should be ready to try determining the solution to an example, we frame it as Guided Practice. GUIDED PRACTICE 0.2 The reader may check or learn the answer to any Guided Practice problem by reviewing the full solution in a footnote.1 Exercises are also provided at the end of each section as well as review exercises at the end of each chapter. Solutions are given for odd-numbered exercises in Appendix A. Additional resources Video overviews, slides, statistical software labs, data sets used in the textbook, and much more are readily available at openintro.org/os We also have improved the ability to access data in this book through the addition of Appendix B, which provides additional information for each of the data sets used in the main text and is new in the Fourth Edition. Online guides to each of these data sets are also provided at openintro.org/data and through a companion R package. We appreciate all feedback as well as reports of any typos through the website. A short-link to report a new typo or review known typos is openintro.org/os/typos. For those focused on statistics at the high school level, consider Advanced High School Statistics, which is a version of OpenIntro Statistics that has been heavily customized by Leah Dorazio for high school courses and AP® Statistics. Acknowledgements This project would not be possible without the passion and dedication of many more people beyond those on the author list. The authors would like to thank the OpenIntro Staff for their involvement and ongoing contributions. We are also very grateful to the hundreds of students and instructors who have provided us with valuable feedback since we first started posting book content in 2009. We also want to thank the many teachers who helped review this edition, including Laura Acion, Matthew E. Aiello-Lammens, Jonathan Akin, Stacey C. Behrensmeyer, Juan Gomez, Jo Hardin, Nicholas Horton, Danish Khan, Peter H.M. Klaren, Jesse Mostipak, Jon C. New, Mario Orsi, Steve Phelps, and David Rockoff. We appreciate all of their feedback, which helped us tune the text in significant ways and greatly improved this book. 1 Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnote solution for any Guided Practice. 7 Chapter 1 Introduction to data 1.1 Case study: using stents to prevent strokes 1.2 Data basics 1.3 Sampling principles and strategies 1.4 Experiments 8 Scientists seek to answer questions using rigorous methods and careful observations. These observations – collected from the likes of field notes, surveys, and experiments – form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data, and in this first chapter, we focus on both the properties of data and on the collection of data. For videos, slides, and other resources, please visit www.openintro.org/os 1.1. CASE STUDY: USING STENTS TO PREVENT STROKES 9 1.1 Case study: using stents to prevent strokes Section 1.1 introduces a classic challenge in statistics: evaluating the efficacy of a medical treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the text. The plan for now is simply to get a sense of the role statistics can play in practice. In this section we will consider an experiment that studies effectiveness of stents in treating patients at risk of stroke. Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death. Many doctors have hoped that there would be similar benefits for patients at risk of stroke. We start by writing the principal question the researchers hope to answer: Does the use of stents reduce the risk of stroke? The researchers who asked this question conducted an experiment with 451 at-risk patients. Each volunteer patient was randomly assigned to one of two groups: Treatment group. Patients in the treatment group received a stent and medical manage- ment. The medical management included medications, management of risk factors, and help in lifestyle modification. Control group. Patients in the control group received the same medical management as the treatment group, but they did not receive stents. Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. In this study, the control group provides a reference point against which we can measure the medical impact of stents in the treatment group. Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment. The results of 5 patients are summarized in Figure 1.1. Patient outcomes are recorded as “stroke” or “no event”, representing whether or not the patient had a stroke at the end of a time period. Patient group 0-30 days 0-365 days 1 treatment no event no event 2 treatment stroke stroke 3 treatment no event no event......... 450 control no event no event 451 control no event no event Figure 1.1: Results for five patients from the stent study. Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once. Figure 1.2 summarizes the raw data in a more helpful way. In this table, we can quickly see what happened over the entire study. For instance, to identify the number of patients in the treatment group who had a stroke within 30 days, we look on the left-side of the table at the intersection of the treatment and stroke: 33. 0-30 days 0-365 days stroke no event stroke no event treatment 33 191 45 179 control 13 214 28 199 Total 46 405 73 378 Figure 1.2: Descriptive statistics for the stent study. 10 CHAPTER 1. INTRODUCTION TO DATA GUIDED PRACTICE 1.1 Of the 224 patients in the treatment group, 45 had a stroke by the end of the first year. Using these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year. (Please note: answers to all Guided Practice exercises are provided using footnotes.)1 We can compute summary statistics from the table. A summary statistic is a single number summarizing a large amount of data. For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups. Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%. Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%. These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the data show a “real” difference between the groups? This second question is subtle. Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably won’t observe exactly 50 heads. This type of fluctuation is part of almost any type of data generating process. It is possible that the 8% difference in the stent study is due to this natural variation. However, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance. So what we are really asking is the following: is the difference so large that we should reject the notion that it was due to chance? While we don’t yet have our statistical tools to fully address this question on our own, we can comprehend the conclusions of the published analysis: there was compelling evidence of harm by stents in this study of stroke patients. Be careful: Do not generalize the results of this study to all patients and all stents. This study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients. In addition, there are many types of stents and this study only considered the self-expanding Wingspan stent (Boston Scientific). However, this study does leave us with an important lesson: we should keep our eyes open for surprises. 1 The proportion of the 224 patients who had a stroke within 365 days: 45/224 = 0.20. 1.1. CASE STUDY: USING STENTS TO PREVENT STROKES 11 Exercises 1.1 Migraine and acupuncture, Part I. A migraine is a particularly painful type of headache, which patients sometimes wish to treat with acupuncture. To determine whether acupuncture relieves migraine pain, researchers conducted a randomized controlled study where 89 females diagnosed with migraine headaches were randomly assigned to one of two groups: treatment or control. 43 patients in the treatment group received acupuncture that is specifically designed to treat migraines. 46 patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations). 24 hours after patients received S174 Neurol Sci (2011) 32 (Suppl 1):S173–S175 acupuncture, they were asked if they were pain free. Results are summarized in the contingency table below.2 identified on the antero-internal part of the antitragus, the Fig. 1 The appropriate area (M) versus the inappropriate anterior part of the lobe and the upper auricular concha, on area (S) used in the treatment the same side of pain. The majority of these points were of migraine attacks Figure from the original pa- effective very rapidly (within 1 min), while the remaining Pain free points produced a slower antalgic response, between 2 and per displaying the appropri- Yes No Total 5 min. The insertion of a semi-permanent needle in these ate area (M) versus the in- zones allowed stable control Group of the Treatment migraine pain, which10 33 43 appropriate area (S) used in Control occurred within 30 min and still persisted 24 h later. 2 44 46 Since the most active site in controlling migraine pain the treatment of migraine at- Total 12 77 89 was the antero-internal part of the antitragus, the aim of tacks. this study was to verify the therapeutic value of this elec- tive area (appropriate point) and to compare it with an area of the ear (representing the sciatic nerve) which is probably inappropriate in terms of giving a therapeutic effect on (a) What percent of patients in migraine attacks, since it has no somatotopic correlation the treatment In group B, the lower group branch ofwere pain free the anthelix was 24 hours after receiving acupuncture? with head pain. (b) What percent were painrepeatedly free in testedthe control group? for about 30 s to with the algometer ensure it was not sensitive. On both the French and Chinese (c) In which group did a higher auricularpercent maps, thisofareapatients corresponds become pain free 24 hours after receiving acupuncture? to the representation Materials and methods (d) Your findings so far might of the sciatic nerve suggest that(Fig. 1, area S) and is is acupuncture specifically used an effective treatment for migraines for all people to treat sciatic pain. Four needles were inserted in this area, who suffer from The study enrolled 94 females, diagnosed as migraine migraines. However, two for each ear. this is not the only possible conclusion that can be drawn based without aura following theon your findings International so far. Classification of WhatIn all is one other patients, possible explanation the ear acupuncture was always per- for the observed difference between the Headache Disorders , who were subsequently percentages of examined patients thatformedare by an experienced pain free 24 acupuncturist. hours after The analysis receivingof acupuncture in the two groups? at the Women’s Headache Centre, Department of Gynae- the diaries collecting VAS data was conducted by an cology and Obstetrics of Turin University. They were all impartial operator who did not know the group each patient included in the study 1.2 during aSinusitis migraine attackand antibiotics, provided that was in. Part I. Researchers studying the effect of antibiotic treatment for acute it started no more than 4 h previously. sinusitis compared According to a to symptomatic The average values of randomly treatments VAS in groupassigned A and B were 166 adults diagnosed with acute sinusitis to predetermined computer-made randomization list, the eli- calculated at the different times of the study, and a statis- one of two groups: treatment gible patients were randomly and blindly assigned to the tical evaluation of the differences between received or control. Study participants the values either a 10-day course of amoxicillin (an following two groups: antibiotic) group A (n or a placebo = 46) (average age similar in appearance obtained in T0, T1, T2, T3 and andtaste. T4 in the The twoplacebo groups consisted of symptomatic treatments 35.93 years, range 15–60), such group B (n = 48) (average agenasal as acetaminophen, studied was performed using decongestants, etc. anAtanalysis the end of variance of the 10-day period, patients were asked if 33.2 years, range 16–58). (ANOVA) for repeated measures followed by multiple Before enrollment, they each experienced patient was askedimprovement to give an t testin symptoms. of Bonferroni Thethedistribution to identify source of variance. of responses is summarized below.3 informed consent to participation in the study. Moreover, to evaluate the difference between group B Migraine intensity was measured by means of a VAS and group A, a t test for unpaired Self-reported data was alwaysimprovement per- before applying NCT (T0). formed for each level of the variable ‘‘time’’. In the case of in symptoms In group A, a specific algometer exerting a maximum proportions, a Chi square test was applied. All analyses pressure of 250 g (SEDATELEC, France) was chosen to Yes were performed using the Statistical Package for the Social No Total identify the tender points with Pain–Pressure Test (PPT). Sciences Treatment (SPSS) software program.66 All values given19 in the 85 Every tender point located within the identified area by the Group followingControl text are reported as 65arithmetic mean (±SEM). 16 81 pilot study (Fig. 1, area M) was tested with NCT for 10 s starting from the auricle, that was ipsilateral, to the side of Total 131 35 166 prevalent cephalic pain. If the test was positive and the Results reduction was at least(a)25% What percent in respect to basis,ofa patients semi- in the treatment group experienced improvement in symptoms? permanent needle (b) (ASP What SEDATELEC,percent France) was experienced Onlyimprovement 89 patients out of the inentire group of 94in symptoms (43the in group control group? inserted after 1 min. On the contrary, if pain did not lessen A, 46 in group B) completed the experiment. Four patients after 1 min, a further(c) In point tender whichwas group challengeddid a higher in the withdrew percentage from the study,of patients because theyexperience experienced animprovement in symptoms? same area and so on. When patients became aware of an unbearable exacerbation of pain in the period preceding the (d) Your findings so far might initial decrease in the pain in all the zones of the head suggest a real difference in effectiveness of antibiotic and placebo treatments last control at 24 h (two from group A and two from group for improving symptoms affected, they were invited to use a specific diary card to of B) andsinusitis. However, were excluded from the this is not statistical thesince analysis only possible conclusion that can be drawn based score the intensity of the pain with aon VASyour following sothey at thefindings far.requested Whattheis removal one other of thepossible needles. One patient explanation for the observed difference between intervals: after 10 min (T1), after 30 min (T2), after from group A did not give her consent to the implant of the the percentages 60 min (T3), after 120 min (T4), and after 24 h (T5). of patients in the antibiotic and placebo semi-permanent needles. In group A, the mean number of treatment groups that experience improvement in symptoms of sinusitis? 123 2 G. Allais et al. “Ear acupuncture in the treatment of migraine attacks: a randomized trial on the efficacy of appropriate versus inappropriate acupoints”. In: Neurological Sci. 32.1 (2011), pp. 173–175. 3 J.M. Garbutt et al. “Amoxicillin for Acute Rhinosinusitis: A Randomized Controlled Trial”. In: JAMA: The Journal of the American Medical Association 307.7 (2012), pp. 685–692. 12 CHAPTER 1. INTRODUCTION TO DATA 1.2 Data basics Effective organization and description of data is a first step in most analyses. This section introduces the data matrix for organizing data as well as some terminology about different forms of data that will be used throughout this book. 1.2.1 Observations, variables, and data matrices Figure 1.3 displays rows 1, 2, 3, and 50 of a data set for 50 randomly sampled loans offered through Lending Club, which is a peer-to-peer lending company. These observations will be referred to as the loan50 data set. Each row in the table represents a single loan. The formal name for a row is a case or observational unit. The columns represent characteristics, called variables, for each of the loans. For example, the first row represents a loan of $7,500 with an interest rate of 7.34%, where the borrower is based in Maryland (MD) and has an income of $70,000. GUIDED PRACTICE 1.2 What is the grade of the first loan in Figure 1.3? And what is the home ownership status of the borrower for that first loan? For these Guided Practice questions, you can check your answer in the footnote.4 In practice, it is especially important to ask clarifying questions to ensure important aspects of the data are understood. For instance, it is always important to be sure we know what each variable means and the units of measurement. Descriptions of the loan50 variables are given in Figure 1.4. loan amount interest rate term grade state total income homeownership 1 7500 7.34 36 A MD 70000 rent 2 25000 9.43 60 B OH 254000 mortgage 3 14500 6.08 36 A MO 80000 mortgage........................ 50 3000 7.96 36 A CA 34000 rent Figure 1.3: Four rows from the loan50 data matrix. variable description loan amount Amount of the loan received, in US dollars. interest rate Interest rate on the loan, in an annual percentage. term The length of the loan, which is always set as a whole number of months. grade Loan grade, which takes values A through G and represents the quality of the loan and its likelihood of being repaid. state US state where the borrower resides. total income Borrower’s total income, including any second income, in US dollars. homeownership Indicates whether the person owns, owns but has a mortgage, or rents. Figure 1.4: Variables and their descriptions for the loan50 data set. The data in Figure 1.3 represent a data matrix, which is a convenient and common way to organize data, especially if collecting data in a spreadsheet. Each row of a data matrix corresponds to a unique case (observational unit), and each column corresponds to a variable. 4 The loan’s grade is A, and the borrower rents their residence. 1.2. DATA BASICS 13 When recording data, use a data matrix unless you have a very good reason to use a different structure. This structure allows new cases to be added as rows or new variables as new columns. GUIDED PRACTICE 1.3 The grades for assignments, quizzes, and exams in a course are often recorded in a gradebook that takes the form of a data matrix. How might you organize grade data using a data matrix?5 GUIDED PRACTICE 1.4 We consider data for 3,142 counties in the United States, which includes each county’s name, the state where it resides, its population in 2017, how its population changed from 2010 to 2017, poverty rate, and six additional characteristics. How might these data be organized in a data matrix?6 The data described in Guided Practice 1.4 represents the county data set, which is shown as a data matrix in Figure 1.5. The variables are summarized in Figure 1.6. 5 There are multiple strategies that can be followed. One common strategy is to have each student represented by a row, and then add a column for each assignment, quiz, or exam. Under this setup, it is easy to review a single line to understand a student’s grade history. There should also be columns to include student information, such as one column to list student names. 6 Each county may be viewed as a case, and there are eleven pieces of information recorded for each case. A table with 3,142 rows and 11 columns could hold these data, where each row represents a county and each column represents a particular piece of information. 14 name state pop pop change poverty homeownership multi unit unemp rate metro median edu median hh income 1 Autauga Alabama 55504 1.48 13.7 77.5 7.2 3.86 yes some college 55317 2 Baldwin Alabama 212628 9.19 11.8 76.7 22.6 3.99 yes some college 52562 3 Barbour Alabama 25270 -6.22 27.2 68.0 11.1 5.90 no hs diploma 33368 4 Bibb Alabama 22668 0.73 15.2 82.9 6.6 4.39 yes hs diploma 43404 5 Blount Alabama 58013 0.68 15.6 82.0 3.7 4.02 yes hs diploma 47412 6 Bullock Alabama 10309 -2.28 28.5 76.9 9.9 4.93 no hs diploma 29655 7 Butler Alabama 19825 -2.69 24.4 69.0 13.7 5.49 no hs diploma 36326 8 Calhoun Alabama 114728 -1.51 18.6 70.7 14.3 4.93 yes some college 43686 9 Chambers Alabama 33713 -1.20 18.8 71.4 8.7 4.08 no hs diploma 37342 10 Cherokee Alabama 25857 -0.60 16.1 77.5 4.3 4.05 no hs diploma 40041.................................... 3142 Weston Wyoming 6927 -2.93 14.4 77.9 6.5 3.98 no some college 59605 Figure 1.5: Eleven rows from the county data set. variable description name County name. state State where the county resides, or the District of Columbia. pop Population in 2017. pop change Percent change in the population from 2010 to 2017. For example, the value 1.48 in the first row means the population for this county increased by 1.48% from 2010 to 2017. poverty Percent of the population in poverty. homeownership Percent of the population that lives in their own home or lives with the owner, e.g. children living with parents who own the home. multi unit Percent of living units that are in multi-unit structures, e.g. apartments. unemp rate Unemployment rate as a percent. metro Whether the county contains a metropolitan area. median edu Median education level, which can take a value among below hs, hs diploma, some college, and bachelors. median hh income Median household income for the county, where a household’s income equals the total income of its occupants who are 15 years or older. Figure 1.6: Variables and their descriptions for the county data set. CHAPTER 1. INTRODUCTION TO DATA 1.2. DATA BASICS 15 1.2.2 Types of variables Examine the unemp rate, pop, state, and median edu variables in the county data set. Each of these variables is inherently different from the other three, yet some share certain characteristics. First consider unemp rate, which is said to be a numerical variable since it can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes doesn’t have any clear meaning. The pop variable is also numerical, although it seems to be a little different than unemp rate. This variable of the population count can only take whole non-negative numbers (0, 1, 2,...). For this reason, the population variable is said to be discrete since it can only take numerical values with jumps. On the other hand, the unemployment rate variable is said to be continuous. The variable state can take up to 51 values after accounting for Washington, DC: AL, AK,..., and WY. Because the responses themselves are categories, state is called a categorical variable, and the possible values are called the variable’s levels. Finally, consider the median edu variable, which describes the median education level of county residents and takes values below hs, hs diploma, some college, or bachelors in each county. This variable seems to be a hybrid: it is a categorical variable but the levels have a natural ordering. A variable with these properties is called an ordinal variable, while a regular categorical variable without this type of special ordering is called a nominal variable. To simplify analyses, any ordinal variable in this book will be treated as a nominal (unordered) categorical variable. all variables numerical categorical nominal ordinal continuous discrete (unordered categorical) (ordered categorical) Figure 1.7: Breakdown of variables into their respective types. EXAMPLE 1.5 Data were collected about students in a statistics course. Three variables were recorded for each student: number of siblings, student height, and whether the student had previously taken a statistics course. Classify each of the variables as continuous numerical, discrete numerical, or categorical. The number of siblings and student height represent numerical variables. Because the number of siblings is a count, it is discrete. Height varies continuously, so it is a continuous numerical variable. The last variable classifies students into two categories – those who have and those who have not taken a statistics course – which makes this variable categorical. GUIDED PRACTICE 1.6 An experiment is evaluating the effectiveness of a new drug in treating migraines. A group variable is used to indicate the experiment group for each patient: treatment or control. The num migraines variable represents the number of migraines the patient experienced during a 3-month period. Classify each variable as either numerical or categorical?7 7 The group variable can take just one of two group names, making it categorical. The num migraines variable describes a count of the number of migraines, which is an outcome where basic arithmetic is sensible, which means this is numerical outcome; more specifically, since it represents a count, num migraines is a discrete numerical variable. 16 CHAPTER 1. INTRODUCTION TO DATA 1.2.3 Relationships between variables Many analyses are motivated by a researcher looking for a relationship between two or more variables. A social scientist may like to answer some of the following questions: (1) If homeownership is lower than the national average in one county, will the percent of multi-unit structures in that county tend to be above or below the national average? (2) Does a higher than average increase in county population tend to correspond to counties with higher or lower median household incomes? (3) How useful a predictor is median education level for the median household income for US counties? To answer these questions, data must be collected, such as the county data set shown in Figure 1.5. Examining summary statistics could provide insights for each of the three questions about counties. Additionally, graphs can be used to visually explore data. Scatterplots are one type of graph used to study the relationship between two numerical vari- ables. Figure 1.8 compares the variables homeownership and multi unit, which is the percent of units in multi-unit structures (e.g. apartments, condos). Each point on the plot represents a single county. For instance, the highlighted dot corresponds to County 413 in the county data set: Chat- tahoochee County, Georgia, which has 39.4% of units in multi-unit structures and a homeownership rate of 31.3%. The scatterplot suggests a relationship between the two variables: counties with a higher rate of multi-units tend to have lower homeownership rates. We might brainstorm as to why this relationship exists and investigate each idea to determine which are the most reasonable explanations. 80% Homeownership Rate 60% 40% 20% 0% 0% 20% 40% 60% 80% 100% Percent of Units in Multi−Unit Structures Figure 1.8: A scatterplot of homeownership versus the percent of units that are in multi-unit structures for US counties. The highlighted dot represents Chatta- hoochee County, Georgia, which has a multi-unit rate of 39.4% and a homeowner- ship rate of 31.3%. The multi-unit and homeownership rates are said to be associated because the plot shows a discernible pattern. When two variables show some connection with one another, they are called associated variables. Associated variables can also be called dependent variables and vice-versa. 1.2. DATA BASICS 17 20% Population Change over 7 Years 10% 0% −10% $0 $20k $40k $60k $80k $100k $120k Median Household Income Figure 1.9: A scatterplot showing pop change against median hh income. Owsley County of Kentucky, is highlighted, which lost 3.63% of its population from 2010 to 2017 and had median household income of $22,736. GUIDED PRACTICE 1.7 Examine the variables in the loan50 data set, which are described in Figure 1.4 on page 12. Create two questions about possible relationships between variables in loan50 that are of interest to you.8 EXAMPLE 1.8 This example examines the relationship between a county’s population change from 2010 to 2017 and median household income, which is visualized as a scatterplot in Figure 1.9. Are these variables associated? The larger the median household income for a county, the higher the population growth observed for the county. While this trend isn’t true for every county, the trend in the plot is evident. Since there is some relationship between the variables, they are associated. Because there is a downward trend in Figure 1.8 – counties with more units in multi-unit structures are associated with lower homeownership – these variables are said to be negatively associated. A positive association is shown in the relationship between the median hh income and pop change in Figure 1.9, where counties with higher median household income tend to have higher rates of population growth. If two variables are not associated, then they are said to be independent. That is, two variables are independent if there is no evident relationship between the two. ASSOCIATED OR INDEPENDENT, NOT BOTH A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent. 8 Two example questions: (1) What is the relationship between loan amount and total income? (2) If someone’s income is above the average, will their interest rate tend to be above or below the average? 18 CHAPTER 1. INTRODUCTION TO DATA 1.2.4 Explanatory and response variables When we ask questions about the relationship between two variables, we sometimes also want to determine if the change in one variable causes a change in the other. Consider the following rephrasing of an earlier question about the county data set: If there is an increase in the median household income in a county, does this drive an increase in its population? In this question, we are asking whether one variable affects another. If this is our underlying belief, then median household income is the explanatory variable and the population change is the response variable in the hypothesized relationship.9 EXPLANATORY AND RESPONSE VARIABLES When we suspect one variable might causally affect another, we label the first variable the explanatory variable and the second the response variable. explanatory might affect response variable variable For many pairs of variables, there is no hypothesized relationship, and these labels would not be applied to either variable in such cases. Bear in mind that the act of labeling the variables in this way does nothing to guarantee that a causal relationship exists. A formal evaluation to check whether one variable causes a change in another requires an experiment. 1.2.5 Introducing observational studies and experiments There are two primary types of data collection: observational studies and experiments. Researchers perform an observational study when they collect data in a way that does not directly interfere with how the data arise. For instance, researchers may collect information via surveys, review medical or company records, or follow a cohort of many similar individuals to form hypotheses about why certain diseases might develop. In each of these situations, researchers merely observe the data that arise. In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection. When researchers want to investigate the possibility of a causal connection, they conduct an experiment. Usually there will be both an explanatory and a response variable. For instance, we may suspect administering a drug will reduce mortality in heart attack patients over the following year. To check if there really is a causal connection between the explanatory variable and the response, researchers will collect a sample of individuals and split them into groups. The individuals in each group are assigned a treatment. When individuals are randomly assigned to a group, the experiment is called a randomized experiment. For example, each heart attack patient in the drug trial could be randomly assigned, perhaps by flipping a coin, into one of two groups: the first group receives a placebo (fake treatment) and the second group receives the drug. See the case study in Section 1.1 for another example of an experiment, though that study did not employ a placebo. ASSOCIATION 6= CAUSATION In general, association does not imply causation, and causation can only be inferred from a randomized experiment. 9 Sometimes the explanatory variable is called the independent variable and the response variable is called the dependent variable. However, this becomes confusing since a pair of variables might be independent or dependent, so we avoid this language. 1.2. DATA BASICS 19 Exercises 1.3 Air pollution and birth outcomes, study components. Researchers collected data to examine the relationship between air pollutants and preterm births in Southern California. During the study air pollution levels were measured by air quality monitoring stations. Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and coarse particulate matter (PM10 ) in µg/m3. Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. The analysis suggested that increased ambient PM10 and, to a lesser degree, CO concentrations may be associated with the occurrence of preterm births.10 (a) Identify the main research question of the study. (b) Who are the subjects in this study, and how many are included? (c) What are the variables in the study? Identify each variable as numerical or categorical. If numerical, state whether the variable is discrete or continuous. If categorical, state whether the variable is ordinal. 1.4 Buteyko method, study components. The Buteyko method is a shallow breathing technique devel- oped by Konstantin Buteyko, a Russian doctor, in 1952. Anecdotal evidence suggests that the Buteyko method can reduce asthma symptoms and improve quality of life. In a scientific study to determine the effectiveness of this method, researchers recruited 600 asthma patients aged 18-69 who relied on medication for asthma treatment. These patients were randomly split into two research groups: one practiced the Buteyko method and the other did not. Patients were scored on quality of life, activity, asthma symptoms, and medication reduction on a scale from 0 to 10. On average, the participants in the Buteyko group experienced a significant reduction in asthma symptoms and an improvement in quality of life.11 (a) Identify the main research question of the study. (b) Who are the subjects in this study, and how many are included? (c) What are the variables in the study? Identify each variable as numerical or categorical. If numerical, state whether the variable is discrete or continuous. If categorical, state whether the variable is ordinal. 1.5 Cheaters, study components. Researchers studying the relationship between honesty, age and self- control conducted an experiment on 160 children between the ages of 5 and 15. Participants reported their age, sex, and whether they were an only child or not. The researchers asked each child to toss a fair coin in private and to record the outcome (white or black) on a paper sheet, and said they would only reward children who report white. The study’s findings can be summarized as follows: “Half the students were explicitly told not to cheat and the others were not given any explicit instructions. In the no instruction group probability of cheating was found to be uniform across groups based on child’s characteristics. In the group that was explicitly told to not cheat, girls were less likely to cheat, and while rate of cheating didn’t vary by age for boys, it decreased with age for girls.”12 (a) Identify the main research question of the study. (b) Who are the subjects in this study, and how many are included? (c) How many variables were recorded for each subject in the study in order to conclude these findings? State the variables and their types. 10 B. Ritz et al. “Effect of air pollution on preterm birth among children born in Southern California between 1989 and 1993”. In: Epidemiology 11.5 (2000), pp. 502–511. 11 J. McGowan. “Health Education: Does the Buteyko Institute Method make a difference?” In: Thorax 58 (2003). 12 Alessandro Bucciol and Marco Piovesan. “Luck or cheating? A field experiment on honesty with children”. In: Journal of Economic Psychology 32.1 (2011), pp. 73–78. 20 CHAPTER 1. INTRODUCTION TO DATA 1.6 Stealers, study components. In a study of the relationship between socio-economic class and unethical behavior, 129 University of California undergraduates at Berkeley were asked to identify themselves as having low or high social-class by comparing themselves to others with the most (least) money, most (least) education, and most (least) respected jobs. They were also presented with a jar of individually wrapped candies and informed that the candies were for children in a nearby laboratory, but that they could take some if they wanted. After completing some unrelated tasks, participants reported the number of candies they had taken.13 (a) Identify the main research question of the study. (b) Who are the subjects in this study, and how many are included? (c) The study found that students who were identified as upper-class took more candy than others. How many variables were recorded for each subject in the study in order to conclude these findings? State the variables and their types. 1.7 Migraine and acupuncture, Part II. Exercise 1.1 introduced a study exploring whether acupuncture had any effect on migraines. Researchers conducted a randomized controlled study where patients were randomly assigned to one of two groups: treatment or control. The patients in the treatment group re- ceived acupuncture that was specifically designed to treat migraines. The patients in the control group received placebo acupuncture (needle insertion at non-acupoint locations). 24 hours after patients received acupuncture, they were asked if they were pain free. What are the explanatory and response variables in this study? 1.8 Sinusitis and antibiotics, Part II. Exercise 1.2 introduced a study exploring the effect of antibiotic treatment for acute sinusitis. Study participants either received either a 10-day course of an antibiotic (treatment) or a placebo similar in appearance and taste (control). At the end of the 10-day period, patients were asked if they experienced improvement in symptoms. What are the explanatory and response variables in this study? 1.9 Fisher’s irises. Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.14 (a) How many cases were included in the data? (b) How many numerical variables are included in the data? Indicate what they are, and if they Photo by Ryan Claussen are continuous or discrete. (http://flic.kr/p/6QTcuX) (c) How many categorical variables are included in CC BY-SA 2.0 license the data, and what are they? List the corre- sponding levels (categories). 1.10 Smoking habits of UK residents. A survey was conducted to study the smoking habits of UK residents. Below is a data matrix displaying a portion of the data collected in this survey. Note that “£” stands for British Pounds Sterling, “cig” stands for cigarettes, and “N/A” refers to a missing component of the data.15 sex age marital grossIncome smoke amtWeekends amtWeekdays 1 Female 42 Single Under £2,600 Yes 12 cig/day 12 cig/day 2 Male 44 Single £10,400 to £15,600 No N/A N/A 3 Male 53 Married Above £36,400 Yes 6 cig/day 6 cig/day........................ 1691 Male 40 Single £2,600 to £5,200 Yes 8 cig/day 8 cig/day (a) What does each row of the data matrix represent? (b) How many participants were included in the survey? (c) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as contin- uous or discrete. If categorical, indicate if the variable is ordinal. 13 P.K. Piff et al. “Higher social class predicts increased unethical behavior”. In: Proceedings of the National Academy of Sciences (2012). 14 R.A Fisher. “The Use of Multiple Measurements in Taxonomic Problems”. In: Annals of Eugenics 7 (1936), pp. 179–188. 15 National STEM Centre, Large Datasets from stats4schools. 1.2. DATA BASICS 21 1.11 US Airports. The visualization below shows the geographical distribution of airports in the contiguous United States and Washington, DC. This visualization was constructed based on a dataset where each observation is an airport.16 (a) List the variables used in creating this visualization. (b) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as contin- uous or discrete. If categorical, indicate if the variable is ordinal. 1.12 UN Votes. The visualization below shows voting patterns in the United States, Canada, and Mexico in the United Nations General Assembly on a variety of issues. Specifically, for a given year between 1946 and 2015, it displays the percentage of roll calls in which the country voted yes for each issue. This visualization was constructed based on a dataset where each observation is a country/year pair.17 (a) List the variables used in creating this visualization. (b) Indicate whether each variable in the study is numerical or categorical. If numerical, identify as contin- uous or discrete. If categorical, indicate if the variable is ordinal. 16 Federal Aviation Administration, www.faa.gov/airports/airport safety/airportdata 5010. 17 DavidRobinson. unvotes: United Nations General Assembly Voting Data. R package version 0.2.0. 2017. url: https://CRAN.R-project.org/package=unvotes. 22 CHAPTER 1. INTRODUCTION TO DATA 1.3 Sampling principles and strategies The first step in conducting research is to identify topics or questions that are to be investigated. A clearly laid out research question is helpful in identifying what subjects or cases should be studied and what variables are important. It is also important to consider how data are collected so that they are reliable and help achieve the research goals. 1.3.1 Populations and samples Consider the following three research questions: 1. What is the average mercury content in swordfish in the Atlantic Ocean? 2. Over the last 5 years, what is the average time to complete a degree for Duke undergrads? 3. Does a new drug reduce the number of deaths in patients with severe heart disease? Each research question refers to a target population. In the first question, the target population is all swordfish in the Atlantic ocean, and each fish represents a case. Often times, it is too expensive to collect data for every case in a population. Instead, a sample is taken. A sample represents a subset of the cases and is often a small fraction of the population. For instance, 60 swordfish (or some other number) in the population might be selected, and this sample data may be used to provide an estimate of the population average and answer the research question. GUIDED PRACTICE 1.9 For the second and third questions above, identify the target population and what represents an individual case.18 1.3.2 Anecdotal evidence Consider the following possible responses to the three research questions: 1. A man on the news got mercury poisoning from eating swordfish, so the average mercury concentration in swordfish must be dangerously high. 2. I met two students who took more than 7 years to graduate from Duke, so it must take longer to graduate at Duke than at many other colleges. 3. My friend’s dad had a heart attack and died after they gave him a new heart disease drug, so the drug must not work. Each conclusion is based on data. However, there are two problems. First, the data only represent one or two cases. Second, and more importantly, it is unclear whether these cases are actually representative of the population. Data collected in this haphazard fashion are called anecdotal evidence. ANECDOTAL EVIDENCE Be careful of data collected in a haphazard fashion. Such evidence may be true and verifiable, but it may only represent extraordinary cases. 18 (2) The first question is only relevant to students who complete their degree; the average cannot be computed using a student who never finished her degree. Thus, only Duke undergrads who graduated in the last five years represent cases in the population under consideration. Each such student is an individual case. (3) A person with severe heart disease represents a case. The population includes all people with severe heart disease. 1.3. SAMPLING PRINCIPLES AND STRATEGIES 23 Figure 1.10: In February 2010, some media pundits cited one large snow storm as valid evidence against global warming. As comedian Jon Stewart pointed out, “It’s one storm, in one region, of one country.” Anecdotal evidence typically is composed of unusual cases that we recall based on their striking characteristics. For instance, we are more likely to remember the two people we met who took 7 years to graduate than the six others who graduated in four years. Instead of looking at the most unusual cases, we should examine a sample of many cases that represent the population. 1.3.3 Sampling from a population We might try to estimate the time to graduation for Duke undergraduates in the last 5 years by collecting a sample of students. All graduates in the last 5 years represent the population, and graduates who are selected for review are collectively called the sample. In general, we always seek to randomly select a sample from a population. The most basic type of random selection is equivalent to how raffles are conducted. For example, in selecting graduates, we could write each graduate’s name on a raffle ticket and draw 100 tickets. The selected names would represent a random sample of 100 graduates. We pick samples randomly to reduce the chance we introduce biases. all graduates sample Figure 1.11: In this graphic, five graduates are randomly selected from the popu- lation to be included in the sample. EXAMPLE 1.10 Suppose we ask a student who happens to be majoring in nutrition to select several graduates for the study. What kind of students do you think she might collect? Do you think her sample would be representative of all graduates? Perhaps she would pick a disproportionate number of graduates from health-related fields. Or per- haps her selection would be a good representation of the population. When selecting samples by hand, we run the risk of picking a biased sample, even if their bias isn’t intended. 24 CHAPTER 1. INTRODUCTION TO DATA all graduates sample graduates from health−related fields Figure 1.12: Asked to pick a sample of graduates, a nutrition major might inad- vertently pick a disproportionate number of graduates from health-related majors. If someone was permitted to pick and choose exactly which graduates were included in the sample, it is entirely possible that the sample could be skewed to that person’s interests, which may be entirely unintentional. This introduces bias into a sample. Sampling randomly helps resolve this problem. The most basic random sample is called a simple random sample, and which is equivalent to using a raffle to select cases. This means that each case in the population has an equal chance of being included and there is no implied connection between the cases in the sample. The act of taking a simple random sample helps minimize bias. However, bias can crop up in other ways. Even when people are picked at random, e.g. for surveys, caution must be exercised if the non-response rate is high. For instance, if only 30% of the people randomly sampled for a survey actually respond, then it is unclear whether the results are representative of the entire population. This non-response bias can skew results. population of interest sample population actually sampled Figure 1.13: Due to the possibility of non-response, surveys studies may only reach a certain group within the population. It is difficult, and often times impossible, to completely fix this problem. Another common downfall is a convenience sample, where individuals who are easily ac- cessible are more likely to be included in the sample. For instance, if a political survey is done by stopping people walking in the Bronx, this will not represent all of New York City. It is often difficult to discern what sub-population a convenience sample represents. GUIDED PRACTICE 1.11 We can easily access ratings for products, sellers, and companies through websites. These ratings are based only on those people who go out of their way to provide a rating. If 50% of online reviews for a product are negative, do you think this means that 50% of buyers are dissatisfied with the product?19 19 Answers will vary. From our own anecdotal experiences, we believe people tend to rant more about products that fell below expectations than rave about those that perform as expected. For this reason, we suspect there is a negative bias in product ratings on sites like Amazon. However, since our experiences may not be representative, we also keep an open mind. 1.3. SAMPLING PRINCIPLES AND STRATEGIES 25 1.3.4 Observational studies Data where no treatment has been explicitly applied (or explicitly withheld) is called observa- tional data. For instance, the loan data and county data described in Section 1.2 are both examples of observational data. Making causal conclusions based on experiments is often reasonable. How- ever, making the same causal conclusions based on observational data can be treacherous and is not recommended. Thus, observational studies are generally only sufficient to show associations or form hypotheses that we later check using experiments. GUIDED PRACTICE 1.12 Suppose an observational study tracked sunscreen use and skin cancer, and it was found that the more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean sunscreen causes skin cancer?20 Some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer. One important piece of information that is absent is sun exposure. If someone is out in the sun all day, she is more likely to use sunscreen and more likely to get skin cancer. Exposure to the sun is unaccounted for in the simple investigation. sun exposure use sunscreen ? skin cancer Sun exposure is what is called a confounding variable,21 which is a variable that is correlated with both the explanatory and response variables. While one method to justify making causal conclusions from observational studies is to exhaust the search for confounding variables, there is no guarantee that all confounding variables can be examined or measured. GUIDED PRACTICE 1.13 Figure 1.8 shows a negative association between the homeownership rate and the percentage of multi- unit structures in a county. However, it is unreasonable to conclude that there is a causal relationship between the two variables. Suggest a variable that might explain the negative relationship.22 Observational studies come in two forms: prospective and retrospective studies. A prospec- tive study identifies individuals and collects information as events unfold. For instance, medical researchers may identify and follow a group of patients over many years to assess the possible influ- ences of behavior on cancer risk. One example of such a study is The Nurses’ Health Study, started in 1976 and expanded in 1989. This prospective study recruits registered nurses and then collects data from them using questionnaires. Retrospective studies collect data after events have taken place, e.g. researchers may review past events in medical records. Some data sets may contain both prospectively- and retrospectively-collected variables. 1.3.5 Four sampling methods Almost all statistical methods are based on the notion of implied randomness. If observational data are not collected in a random framework from a population, these statistical methods – the estimates and errors associated with the estimates – are not reliable. Here we consider four random sampling techniques: simple, stratified, cluster, and multistage sampling. Figures 1.14 and 1.15 provide graphical representations of these techniques. 20 No. See the paragraph following the exercise for an explanation. 21 Also called a lurking variable, confounding factor, or a confounder. 22 Answers will vary. Population density may be important. If a county is very dense, then this may require a larger fraction of residents to live in multi-unit structures. Additionally, the high density may contribute to increases in property value, making homeownership infeasible for many residents. 26 CHAPTER 1. INTRODUCTION TO DATA

OpenIntro Statistics PDF Fourth Edition

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue