Introductory Statistics for the Life and Biomedical Sciences PDF
Document Details
Uploaded by Deleted User
Harvard University
2021
Julie Vu and David Harrington
Tags
Summary
This textbook introduces statistics and its application in life sciences and biomedical research. It covers core topics like data, probability, distributions, and inference, with examples relevant to biology and medicine. The book is intended for undergraduate and graduate students in related fields.
Full Transcript
Introductory Statistics for the Life and Biomedical Sciences First Edition Julie Vu Preceptor in Statistics Harvard University David Harrington Professor of Biostatistics (Emeritus)...
Introductory Statistics for the Life and Biomedical Sciences First Edition Julie Vu Preceptor in Statistics Harvard University David Harrington Professor of Biostatistics (Emeritus) Harvard T.H. Chan School of Public Health Dana-Farber Cancer Institute This book can be purchased for $0 on Leanpub by adjusting the price slider. Purchasing includes access to a tablet-friendly version of this PDF where margins have been minimized. Copyright © 2020. First Edition. Version date: August 8th, 2021. This textbook and its supplements, including slides and labs, may be downloaded for free at openintro.org/book/biostat. This textbook is a derivative of OpenIntro Statistics 3rd Edition by Diez, Barr, and Çetinkaya- Rundel, and it is available under a Creative Commons Attribution-ShareAlike 3.0 Unported United States license. License details are available at the Creative Commons website: creativecommons.org. Source files for this book may be found on Github at github.com/OI-Biostat/oi_biostat_text. 3 Table of Contents 1 Introduction to data 10 1.1 Case study........................................... 12 1.2 Data basics........................................... 14 1.3 Data collection principles................................... 18 1.4 Numerical data........................................ 30 1.5 Categorical data........................................ 37 1.6 Relationships between two variables............................ 38 1.7 Exploratory data analysis................................... 50 1.8 Notes.............................................. 74 1.9 Exercises............................................ 75 2 Probability 88 2.1 Defining probability...................................... 90 2.2 Conditional probability.................................... 106 2.3 Extended example....................................... 117 2.4 Notes.............................................. 126 2.5 Exercises............................................ 127 3 Distributions of random variables 138 3.1 Random variables....................................... 140 3.2 Binomial distribution..................................... 147 3.3 Normal distribution...................................... 152 3.4 Poisson distribution...................................... 168 3.5 Distributions related to Bernoulli trials........................... 170 3.6 Distributions for pairs of random variables........................ 177 3.7 Notes.............................................. 184 3.8 Exercises............................................ 185 4 Foundations for inference 198 4.1 Variability in estimates.................................... 201 4.2 Confidence intervals..................................... 205 4.3 Hypothesis testing....................................... 212 4.4 Notes.............................................. 225 4.5 Exercises............................................ 227 4 TABLE OF CONTENTS 5 Inference for numerical data 236 5.1 Single-sample inference with the t-distribution...................... 238 5.2 Two-sample test for paired data............................... 244 5.3 Two-sample test for independent data........................... 247 5.4 Power calculations for a difference of means........................ 257 5.5 Comparing means with ANOVA............................... 264 5.6 Notes.............................................. 272 5.7 Exercises............................................ 274 6 Simple linear regression 290 6.1 Examining scatterplots.................................... 293 6.2 Estimating a regression line using least squares...................... 295 6.3 Interpreting a linear model.................................. 298 6.4 Statistical inference with regression............................. 308 6.5 Interval estimates with regression.............................. 312 6.6 Notes.............................................. 316 6.7 Exercises............................................ 317 7 Multiple linear regression 330 7.1 Introduction to multiple linear regression......................... 332 7.2 Simple versus multiple regression.............................. 334 7.3 Evaluating the fit of a multiple regression model..................... 338 7.4 The general multiple linear regression model....................... 342 7.5 Categorical predictors with several levels.......................... 347 7.6 Reanalyzing the PREVEND data............................... 350 7.7 Interaction in regression................................... 352 7.8 Model selection for explanatory models.......................... 358 7.9 The connection between ANOVA and regression..................... 368 7.10 Notes.............................................. 370 7.11 Exercises............................................ 372 8 Inference for categorical data 386 8.1 Inference for a single proportion............................... 388 8.2 Inference for the difference of two proportions...................... 395 8.3 Inference for two or more groups.............................. 401 8.4 Chi-square tests for the fit of a distribution........................ 414 8.5 Outcome-based sampling: case-control studies...................... 416 8.6 Notes.............................................. 420 8.7 Exercises............................................ 421 A End of chapter exercise solutions 435 B Distribution tables 463 Index 469 5 Foreword The past year has been challenging for the health sciences in ways that we could not have imagined when we started writing 5 years ago. The rapid spread of the SARS coronavirus (SARS-CoV-2) worldwide has upended the scientific research process and highlighted the need for maintaining a balance between speed and reliability. Major medical journals have dramatically increased the pace of publication; the urgency of the situation necessitates that data and research findings be made available as quickly as possible to inform public policy and clinical practice. Yet it remains essential that studies undergo rigorous review; the retraction of two high-profile coronavirus studies 1, 2 sparked widespread concerns about data integrity, reproducibility, and the editorial process. In parallel, deepening public awareness of structural racism has caused a re-examination of the role of race in published studies in health and medicine. A recent review of algorithms used to direct treatment in areas such as cardiology, obstetrics and oncology uncovered examples of race used in ways that may lead to substandard care for people of color. 3 The SARS-CoV-2 pandemic has reminded us once again that marginalized populations are disproportionately at risk for bad health outcomes. Data on 17 million patients in England 4 suggest that Blacks and South Asians have a death rate that is approximately 50% higher than white members of the population. Understanding the SARS coronavirus and tackling racial disparities in health outcomes are but two of the many areas in which Biostatistics will play an important role in the coming decades. Much of that work will be done by those now beginning their study of Biostatistics. We hope this book provides an accessible point of entry for students planning to begin work in biology, medicine, or public health. While the material presented in this book is essential for understanding the foundations of the discipline, we advise readers to remember that a mastery of technical details is secondary to choosing important scientific questions, examining data without bias, and reporting results that transparently display the strengths and weaknesses of a study. 1 Mandeep R. Mehra et al. “Retraction: Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19. N Engl J Med. DOI: 10.1056/NEJMoa2007621.” In: New England Journal of Medicine 382.26 (2020), pp. 2582–2582. doi: 10.1056/ NEJMc2021225. 2 Mandeep R Mehra et al. “RETRACTED:Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis”. In: The Lancet (2020). doi: https://doi.org/10.1016/S0140- 6736(20) 31180-6. 3 Darshali A. Vyas et al. “Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms”. In: New England Journal of Medicine (2020). doi: 10.1056/NEJMms2004740. 4 Elizabeth J. Williamson et al. “OpenSAFELY: factors associated with COVID-19 death in 17 million patients”. In: Nature (2020). issn: 1476-4687. 6 Preface This text introduces statistics and its applications in the life sciences and biomedical research. It is based on the freely available OpenIntro Statistics, and, like OpenIntro, it may be downloaded at no cost. 5 In writing Introduction to Statistics for the Life and Biomedical Sciences, we have added sub- stantial new material, but also retained some examples and exercises from OpenIntro that illustrate important ideas even if they do not relate directly to medicine or the life sciences. Because of its link to the original OpenIntro project, this text is often referred to as OpenIntro Biostatistics in the supplementary materials. This text is intended for undergraduate and graduate students interested in careers in biology or medicine, and may also be profitably read by students of public health or medicine. It cov- ers many of the traditional introductory topics in statistics, in addition to discussing some newer methods being used in molecular biology. Statistics has become an integral part of research in medicine and biology, and the tools for summarizing data and drawing inferences from data are essential both for understanding the out- comes of studies and for incorporating measures of uncertainty into that understanding. An intro- ductory text in statistics for students who will work in medicine, public health, or the life sciences should be more than simply the usual introduction, supplemented with an occasional example from biology or medical science. By drawing the majority of examples and exercises in this text from published data, we hope to convey the value of statistics in medical and biological research. In cases where examples draw on important material in biology or medicine, the problem statement contains the necessary background information. Computing is an essential part of the practice of statistics. Nearly everyone entering the biomedical sciences will need to interpret the results of analyses conducted in software; many will also need to be capable of conducting such analyses. The text and associated materials sepa- rate those two activities to allow students and instructors to emphasize either or both skills. The text discusses the important features of figures and tables used to support an interpretation, rather than the process of generating such material from data. This allows students whose main focus is understanding statistical concepts not to be distracted by the details of a particular software package. In our experience, however, we have found that many students enter a research setting after only a single course in statistics. These students benefit from a practical introduction to data analysis that incorporates the use of a statistical computing language. The‘ self-paced learning labs associated with the text provide such an introduction; these are described in more detail later in this preface. The datasets used in this book are available via the R openintro package available on CRAN 6 and the R oibiostat package available via GitHub. 5 PDF available at https://www.openintro.org/book/biostat/ and source available at https://github.com/ OI-Biostat/oi_biostat_text. 6 Diez DM, Barr CD, Çetinkaya-Rundel M. 2012. openintro: OpenIntro data sets and supplement functions. http: //cran.r-project.org/web/packages/openintro. 7 Textbook overview The chapters of this book are as follows: 1. Introduction to data. Data structures, basic data collection principles, numerical and graphical summaries, and exploratory data analysis. 2. Probability. The basic principles of probability. 3. Distributions of random variables. Introduction to random variables, distributions of discrete and continuous random variables, and distributions for pairs of random variables. 4. Foundations for inference. General ideas for statistical inference in the context of estimating a population mean. 5. Inference for numerical data. Inference for one-sample and two-sample means with the t-distribution, power calculations for a difference of means, and ANOVA. 6. Simple linear regression. An introduction to linear regression with a single explanatory vari- able, evaluating model assumptions, and inference in a regression context. 7. Multiple linear regression. General multiple regression model, categorical predictors with more than two values, interaction, and model selection. 8. Inference for categorical data. Inference for single proportions, inference for two or more groups, and outcome-based sampling. Examples, exercises, and appendices Examples in the text help with an understanding of how to apply methods: EXAMPLE 0.1 This is an example. When a question is asked here, where can the answer be found? The answer can be found here, in the solution section of the example. When we think the reader would benefit from working out the solution to an example, we frame it as Guided Practice. GUIDED PRACTICE 0.2 The reader may check or learn the answer to any Guided Practice problem by reviewing the full solution in a footnote. 7 There are exercises at the end of each chapter that are useful for practice or homework as- signments. Solutions to odd numbered problems can be found in Appendix A. Readers will notice that there are fewer end of chapter exercises in the last three chapters. The more complicated methods, such as multiple regression, do not always lend themselves to hand calculation, and computing is increasingly important both to gain practical experience with these methods and to explore complex datasets. For students more interested in concepts than computing, however, we have included useful end of chapter exercises that emphasize the interpretation of output from statistical software. Probability tables for the normal, t, and chi-square distributions are in Appendix B, and PDF copies of these tables are also available from openintro.org for anyone to download, print, share, or modify. The labs and the text also illustrate the use of simple R commands to calculate probabilities from common distributions. 7 Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnote solution for any Guided Practice. 8 CHAPTER 0. PREFACE Self-paced learning labs The labs associated with the text can be downloaded from github.com/OI-Biostat/oi_biostat_ labs. They provide guidance on conducting data analysis and visualization with the R statistical language and the computing environment RStudio, while building understanding of statistical concepts. The labs begin from first principles and require no previous experience with statistical software. Both R and RStudio are freely available for all major computing operating systems, and the Unit 0 labs (00_getting_started) provide information on downloading and installing them. Information on downloading and installing the packages may also be found at openintro.org. The labs for each chapter all have the same structure. Each lab consists of a set of three documents: a handout with the problem statements, a template to be used for working through the lab, and a solution set with the problem solutions. The handout and solution set are most easily read in PDF format (although Rmd files are also provided), while the template is an Rmd file that can be loaded into RStudio. Each chapter of labs is accompanied by a set of "Lab Notes", which provides a reference guide of all new R functions discussed in the labs. Learning is best done, of course, if a student attempts the lab exercises before reading the solutions. The "Lab Notes" may be a useful resource to refer to while working through problems. OpenIntro, online resources, and getting involved OpenIntro is an organization focused on developing free and affordable education materials. The first project, OpenIntro Statistics, is intended for introductory statistics courses at the high school through university levels. Other projects examine the use of randomization methods for learning about statistics and conducting analyses (Introductory Statistics with Randomization and Simulation) and advanced statistics that may be taught at the high school level (Advanced High School Statistics). We encourage anyone learning or teaching statistics to visit openintro.org and get involved by using the many online resources, which are all free, or by creating new material. Students can test their knowledge with practice quizzes, or try an application of concepts learned in each chapter using real data and the free statistical software R. Teachers can download the source for course materials, labs, slides, datasets, R figures, or create their own custom quizzes and problem sets for students to take on the website. Everyone is also welcome to download the book’s source files to create a custom version of this textbook or to simply share a PDF copy with a friend or on a website. All of these products are free, and anyone is welcome to use these online tools and resources with or without this textbook as a companion. Acknowledgements The OpenIntro project would not have been possible without the dedication of many people, in- cluding the authors of OpenIntro Statistics, the OpenIntro team and the many faculty, students, and readers who commented on all the editions of OpenIntro Statistics. This text has benefited from feedback from Andrea Foulkes, Raji Balasubramanian, Curry Hilton, Michael Parzen, Kevin Rader, and the many excellent teaching fellows at Harvard College who assisted in courses using the book. The cover design was provided by Pierre Baduel. 9 10 Chapter 1 Introduction to data 1.1 Case study 1.2 Data basics 1.3 Data collection principles 1.4 Numerical data 1.5 Categorical data 1.6 Relationships between two variables 1.7 Exploratory data analysis 1.8 Notes 1.9 Exercises 11 Making observations and recording data form the backbone of empirical research, and represent the beginning of a systematic approach to investigating scientific questions. As a discipline, statistics focuses on addressing the following three questions in a rigorous and efficient manner: How can data best be collected? How should data be analyzed? What can be inferred from data? This chapter provides a brief discussion on the principles of data collection, and introduces basic methods for summarizing and exploring data. For labs, slides, and other resources, please visit www.openintro.org/book/biostat 12 CHAPTER 1. INTRODUCTION TO DATA 1.1 Case study: preventing peanut allergies The proportion of young children in Western countries with peanut allergies has doubled in the last 10 years. Previous research suggests that exposing infants to peanut-based foods, rather than excluding such foods from their diets, may be an effective strategy for preventing the develop- ment of peanut allergies. The "Learning Early about Peanut Allergy" (LEAP) study was conducted to investigate whether early exposure to peanut products reduces the probability that a child will develop peanut allergies. 1 The study team enrolled children in the United Kingdom between 2006 and 2009, selecting 640 infants with eczema, egg allergy, or both. Each child was randomly assigned to either the peanut consumption (treatment) group or the peanut avoidance (control) group. Children in the treatment group were fed at least 6 grams of peanut protein daily until 5 years of age, while chil- dren in the control group avoided consuming peanut protein until 5 years of age. At 5 years of age, each child was tested for peanut allergy using an oral food challenge (OFC): 5 grams of peanut protein in a single dose. A child was recorded as passing the oral food challenge if no allergic reaction was detected, and failing the oral food challenge if an allergic reaction occurred. These children had previously been tested for peanut allergy through a skin test, conducted at the time of study entry; the main analysis presented in the paper was based on data from 530 children with an earlier negative skin test. 2 Individual-level data from the study are shown in Figure 1.1 for 5 of the 530 children—each row represents a participant and shows the participant’s study ID number, treatment group assign- ment, and OFC outcome. 3 participant.ID treatment.group overall.V60.outcome LEAP_100522 Peanut Consumption PASS OFC LEAP_103358 Peanut Consumption PASS OFC LEAP_105069 Peanut Avoidance PASS OFC LEAP_994047 Peanut Avoidance PASS OFC LEAP_997608 Peanut Consumption PASS OFC Figure 1.1: Individual-level LEAP results, for five children. The data can be organized in the form of a two-way summary table; Figure 1.2 shows the results categorized by treatment group and OFC outcome. FAIL OFC PASS OFC Sum Peanut Avoidance 36 227 263 Peanut Consumption 5 262 267 Sum 41 489 530 Figure 1.2: Summary of LEAP results, organized by treatment group (either peanut avoidance or consumption) and result of the oral food challenge at 5 years of age (either pass or fail). 1 Du Toit, George, et al. Randomized trial of peanut consumption in infants at risk for peanut allergy. New England Journal of Medicine 372.9 (2015): 803-813. 2 Although a total of 542 children had an earlier negative skin test, data collection did not occur for 12 children. 3 The data are available as LEAP in the R package oibiostat. 1.1. CASE STUDY 13 The summary table makes it easier to identify patterns in the data. Recall that the question of interest is whether children in the peanut consumption group are more or less likely to develop peanut allergies than those in the peanut avoidance group. In the avoidance group, the proportion of children failing the OFC is 36/263 = 0.137 (13.7%); in the consumption group, the proportion of children failing the OFC is 5/267 = 0.019 (1.9%). Figure 1.3 shows a graphical method of dis- playing the study results, using either the number of individuals per category from Figure 1.2 or the proportion of individuals with a specific OFC outcome in a group. 1.0 FAIL OFC FAIL OFC 250 PASS OFC PASS OFC 0.8 200 0.6 150 0.4 100 50 0.2 0 0.0 Peanut Avoidance Peanut Consumption Peanut Avoidance Peanut Consumption (a) (b) Figure 1.3: (a) A bar plot displaying the number of individuals who failed or passed the OFC in each treatment group. (b) A bar plot displaying the proportions of individuals in each group that failed or passed the OFC. The proportion of participants failing the OFC is 11.8% higher in the peanut avoidance group than the peanut consumption group. Another way to summarize the data is to compute the ratio of the two proportions (0.137/0.019 = 7.31), and conclude that the proportion of participants failing the OFC in the avoidance group is more than 7 times as large as in the consumption group; i.e., the risk of failing the OFC was more than 7 times as great for participants in the avoidance group relative to the consumption group. Based on the results of the study, it seems that early exposure to peanut products may be an effective strategy for reducing the chances of developing peanut allergies later in life. It is important to note that this study was conducted in the United Kingdom at a single site of pediatric care; it is not clear that these results can be generalized to other countries or cultures. The results also raise an important statistical issue: does the study provide definitive evidence that peanut consumption is beneficial? In other words, is the 11.8% difference between the two groups larger than one would expect by chance variation alone? The material on inference in later chapters will provide the statistical tools to evaluate this question. 14 CHAPTER 1. INTRODUCTION TO DATA 1.2 Data basics Effective organization and description of data is a first step in most analyses. This section introduces a structure for organizing data and basic terminology used to describe data. 1.2.1 Observations, variables, and data matrices In evolutionary biology, parental investment refers to the amount of time, energy, or other resources devoted towards raising offspring. This section introduces the frog dataset, which orig- inates from a 2013 study about maternal investment in a frog species. 4 Reproduction is a costly process for female frogs, necessitating a trade-off between individual egg size and total number of eggs produced. Researchers were interested in investigating how maternal investment varies with altitude and collected measurements on egg clutches found at breeding ponds across 11 study sites; for 5 sites, the body size of individual female frogs was also recorded. altitude latitude egg.size clutch.size clutch.volume body.size 1 3,462.00 34.82 1.95 181.97 177.83 3.63 2 3,462.00 34.82 1.95 269.15 257.04 3.63 3 3,462.00 34.82 1.95 158.49 151.36 3.72 150 2,597.00 34.05 2.24 537.03 776.25 NA Figure 1.4: Data matrix for the frog dataset. Figure 1.4 displays rows 1, 2, 3, and 150 of the data from the 431 clutches observed as part of the study. 5 Each row in the table corresponds to a single clutch, indicating where the clutch was collected (altitude and latitude), egg.size, clutch.size, clutch.volume, and body.size of the mother when available. "NA" corresponds to a missing value, indicating that information on an individual female was not collected for that particular clutch. The recorded characteristics are referred to as variables; in this table, each column represents a variable. variable description altitude Altitude of the study site in meters above sea level latitude Latitude of the study site measured in degrees egg.size Average diameter of an individual egg to the 0.01 mm clutch.size Estimated number of eggs in clutch clutch.volume Volume of egg clutch in mm3 body.size Length of mother frog in cm Figure 1.5: Variables and their descriptions for the frog dataset. It is important to check the definitions of variables, as they are not always obvious. For ex- ample, why has clutch.size not been recorded as whole numbers? For a given clutch, researchers counted approximately 5 grams’ worth of eggs and then estimated the total number of eggs based on the mass of the entire clutch. Definitions of the variables are given in Figure 1.5. 6 4 Chen, W., et al. Maternal investment increases with altitude in a frog on the Tibetan Plateau. Journal of evolutionary biology 26.12 (2013): 2710-2715. 5 The frog dataset is available in the R package oibiostat. 6 The data discussed here are in the original scale; in the published paper, some values have undergone a natural log transformation. 1.2. DATA BASICS 15 The data in Figure 1.4 are organized as a data matrix. Each row of a data matrix corresponds to an observational unit, and each column corresponds to a variable. A piece of the data matrix for the LEAP study introduced in Section 1.1 is shown in Figure 1.1; the rows are study participants and three variables are shown for each participant. Data matrices are a convenient way to record and store data. If the data are collected for another individual, another row can easily be added; similarly, another column can be added for a new variable. 1.2.2 Types of variables The Functional polymorphisms Associated with human Muscle Size and Strength study (FA- MuSS) measured a variety of demographic, phenotypic, and genetic characteristics for about 1,300 participants. 7 Data from the study have been used in a number of subsequent studies,8 such as one examining the relationship between muscle strength and genotype at a location on the ACTN3 gene. 9 The famuss dataset is a subset of the data for 595 participants. 10 Four rows of the famuss dataset are shown in Figure 1.6, and the variables are described in Figure 1.7. sex age race height weight actn3.r577x ndrm.ch 1 Female 27 Caucasian 65.0 199.0 CC 40.0 2 Male 36 Caucasian 71.7 189.0 CT 25.0 3 Female 24 Caucasian 65.0 134.0 CT 40.0 595 Female 30 Caucasian 64.0 134.0 CC 43.8 Figure 1.6: Four rows from the famuss data matrix. variable description sex Sex of the participant age Age in years race Race, recorded as African Am (African American), Caucasian, Asian, Hispanic or Other height Height in inches weight Weight in pounds actn3.r577x Genotype at the location r577x in the ACTN3 gene. ndrm.ch Percent change in strength in the non-dominant arm, comparing strength after to before training Figure 1.7: Variables and their descriptions for the famuss dataset. The variables age, height, weight, and ndrm.ch are numerical variables. They take on numer- ical values, and it is reasonable to add, subtract, or take averages with these values. In contrast, a variable reporting telephone numbers would not be classified as numerical, since sums, differ- ences, and averages in this context have no meaning. Age measured in years is said to be discrete, since it can only take on numerical values with jumps; i.e., positive integer values. Percent change in strength in the non-dominant arm (ndrm.ch) is continuous, and can take on any value within a specified range. 7 Thompson PD, Moyna M, Seip, R, et al., 2004. Functional Polymorphisms Associated with Human Muscle Size and Strength. Medicine and Science in Sports and Exercise 36:1132 - 1139. 8 Pescatello L, et al. Highlights from the functional single nucleotide polymorphisms associated with human muscle size and strength or FAMuSS study, BioMed Research International 2013. 9 Clarkson P, et al., Journal of Applied Physiology 99: 154-163, 2005. 10 The subset is from Foulkes, Andrea S. Applied statistical genetics with R: for population-based association studies. Springer Science & Business Media, 2009. The full version of the data is available at http://people.umass.edu/foulkes/ asg/data.html. 16 CHAPTER 1. INTRODUCTION TO DATA Figure 1.8: Breakdown of variables into their respective types. The variables sex, race, and actn3.r577x are categorical variables, which take on values that are names or labels. The possible values of a categorical variable are called the variable’s levels. 11 For example, the levels of actn3.r577x are the three possible genotypes at this particular locus: CC, CT, or TT. Categorical variables without a natural ordering are called nominal categorical variables; sex, race, and actn3.r577x are all nominal categorical variables. Categorical variables with levels that have a natural ordering are referred to as ordinal categorical variables. For exam- ple, age of the participants grouped into 5-year intervals (15-20, 21-25, 26-30, etc.) is an ordinal categorical variable. EXAMPLE 1.1 Classify the variables in the frog dataset: altitude, latitude, egg.size, clutch.size, clutch.volume, and body.size. The variables egg.size, clutch.size, clutch.volume, and body.size are continuous numerical variables, and can take on all positive values. In the context of this study, the variables altitude and latitude are best described as categorical variables, since the numerical values of the variables correspond to the 11 specific study sites where data were collected. Researchers were interested in exploring the relationship between altitude and maternal investment; it would be reasonable to consider altitude an ordinal categorical variable. GUIDED PRACTICE 1.2 Characterize the variables treatment.group and overall.V60.outcome from the LEAP study (dis- cussed in Section 1.1). 12 GUIDED PRACTICE 1.3 Suppose that on a given day, a research assistant collected data on the first 20 individuals visiting a walk-in clinic: age (measured as less than 21, 21 - 65, and greater than 65 years of age), sex, height, weight, and reason for the visit. Classify each of the variables. 13 11 Categorical variables are sometimes called factor variables. 12 These variables measure non-numerical quantities, and thus are categorical variables with two levels. 13 Height and weight are continuous numerical variables. Age as measured by the research assistant is ordinal categorical. Sex and the reason for the visit are nominal categorical variables. 1.2. DATA BASICS 17 1.2.3 Relationships between variables Many studies are motivated by a researcher examining how two or more variables are related. For example, do the values of one variable increase as the values of another decrease? Do the values of one variable tend to differ by the levels of another variable? One study used the famuss data to investigate whether ACTN3 genotype at a particular lo- cation (residue 577) is associated with change in muscle strength. The ACTN3 gene codes for a protein involved in muscle function. A common mutation in the gene at a specific location changes the cytosine (C) nucleotide to a thymine (T) nucleotide; individuals with the TT genotype are un- able to produce any ACTN3 protein. Researchers hypothesized that genotype at this location might influence muscle function. As a measure of muscle function, they recorded the percent change in non-dominant arm strength after strength training; this variable, ndrm.ch, is the response variable in the study. A response variable is defined by the particular research question a study seeks to address, and measures the outcome of interest in the study. A study will typically examine whether the values of a response variable differ as values of an explanatory variable change, and if so, how the two variables are related. A given study may examine several explanatory variables for a single response variable. 14 The explanatory variable examined in relation to ndrm.ch in the study is actn3.r557x, ACTN3 genotype at location 577. EXAMPLE 1.4 In the maternal investment study conducted on frogs, researchers collected measurements on egg clutches and female frogs at 11 study sites, located at differing altitudes, in order to investigate how maternal investment varies with altitude. Identify the response and explanatory variables in the study. The variables egg.size, clutch.size, and clutch.volume are response variables indicative of ma- ternal investment. The explanatory variable examined in the study is altitude. While latitude is an environmental factor that might potentially influence features of the egg clutches, it is not a variable of interest in this particular study. Female body size (body.size) is neither an explanatory nor response variable. GUIDED PRACTICE 1.5 Refer to the variables from the famuss dataset described in Figure 1.7 to formulate a question about the relationships between these variables, and identify the response and explanatory variables in the context of the question. 15 14 Response variables are sometimes called dependent variables and explanatory variables are often called independent variables or predictors. 15 Two sample questions: (1) Does change in participant arm strength after training seem associated with race? The response variable is ndrm.ch and the explanatory variable is race. (2) Do male participants appear to respond differently to strength training than females? The response variable is ndrm.ch and the explanatory variable is sex. 18 CHAPTER 1. INTRODUCTION TO DATA 1.3 Data collection principles The first step in research is to identify questions to investigate. A clearly articulated research question is essential for selecting subjects to be studied, identifying relevant variables, and deter- mining how data should be collected. 1.3.1 Populations and samples Consider the following research questions: 1. Do bluefin tuna from the Atlantic Ocean have particularly high levels of mercury, such that they are unsafe for human consumption? 2. For infants predisposed to developing a peanut allergy, is there evidence that introducing peanut products early in life is an effective strategy for reducing the risk of developing a peanut allergy? 3. Does a recently developed drug designed to treat glioblastoma, a form of brain cancer, appear more effective at inducing tumor shrinkage than the drug currently on the market? Each of these questions refers to a specific target population. For example, in the first ques- tion, the target population consists of all bluefin tuna from the Atlantic Ocean; each individual bluefin tuna represents a case. It is almost always either too expensive or logistically impossible to collect data for every case in a population. As a result, nearly all research is based on information obtained about a sample from the population. A sample represents a small fraction of the popu- lation. Researchers interested in evaluating the mercury content of bluefin tuna from the Atlantic Ocean could collect a sample of 500 bluefin tuna (or some other quantity), measure the mercury content, and use the observed information to formulate an answer to the research question. GUIDED PRACTICE 1.6 Identify the target populations for the remaining two research questions. 16 16 In Question 2, the target population consists of infants predisposed to developing a peanut allergy. In Question 3, the target population consists of patients with glioblastoma. 1.3. DATA COLLECTION PRINCIPLES 19 1.3.2 Anecdotal evidence Anecdotal evidence typically refers to unusual observations that are easily recalled because of their striking characteristics. Physicians may be more likely to remember the characteristics of a single patient with an unusually good response to a drug instead of the many patients who did not respond. The dangers of drawing general conclusions from anecdotal information are obvious; no single observation should be used to draw conclusions about a population. While it is incorrect to generalize from individual observations, unusual observations can sometimes be valuable. E.C. Heyde was a general practitioner from Vancouver who noticed that a few of his elderly patients with aortic-valve stenosis (an abnormal narrowing) caused by an accu- mulation of calcium had also suffered massive gastrointestinal bleeding. In 1958, he published his observation. 17 Further research led to the identification of the underlying cause of the association, now called Heyde’s Syndrome. 18 An anecdotal observation can never be the basis for a conclusion, but may well inspire the design of a more systematic study that could be definitive. 17 Heyde EC. Gastrointestinal bleeding in aortic stenosis. N Engl J Med 1958;259:196. 18 Greenstein RJ, McElhinney AJ, Reuben D, Greenstein AJ. Co-lonic vascular ectasias and aortic stenosis: coincidence or causal relationship? Am J Surg 1986;151:347-51. 20 CHAPTER 1. INTRODUCTION TO DATA 1.3.3 Sampling from a population Sampling from a population, when done correctly, provides reliable information about the characteristics of a large population. The US Centers for Disease Control (US CDC) conducts sev- eral surveys to obtain information about the US population, including the Behavior Risk Factor Surveillance System (BRFSS). 19 The BRFSS was established in 1984 to collect data about health- related risk behaviors, and now collects data from more than 400,000 telephone interviews con- ducted each year. Data from a recent BRFSS survey are used in Chapter 4. The CDC conducts similar surveys for diabetes, health care access, and immunization. Likewise, the World Health Or- ganization (WHO) conducts the World Health Survey in partnership with approximately 70 coun- tries to learn about the health of adult populations and the health systems in those countries. 20 The general principle of sampling is straightforward: a sample from a population is useful for learning about a population only when the sample is representative of the population. In other words, the characteristics of the sample should correspond to the characteristics of the population. Suppose that the quality improvement team at an integrated health care system, such as Har- vard Pilgrim Health Care, is interested in learning about how members of the health plan perceive the quality of the services offered under the plan. A common pitfall in conducting a survey is to use a convenience sample, in which individuals who are easily accessible are more likely to be included in the sample than other individuals. If a sample were collected by approaching plan members visiting an outpatient clinic during a particular week, the sample would fail to enroll generally healthy members who typically do not use outpatient services or schedule routine phys- ical examinations; this method would produce an unrepresentative sample (Figure 1.9). Figure 1.9: Instead of sampling from all members equally, approaching members visiting a clinic during a particular week disproportionately selects members who frequently use outpatient services. Random sampling is the best way to ensure that a sample reflects a population. In a simple random sample, each member of a population has the same chance of being sampled. One way to achieve a simple random sample of the health plan members is to randomly select a certain number of names from the complete membership roster, and contact those individuals for an interview (Figure 1.10). 19 https://www.cdc.gov/brfss/index.html 20 http://www.who.int/healthinfo/survey/en/ 1.3. DATA COLLECTION PRINCIPLES 21 Figure 1.10: Five members are randomly selected from the population to be in- terviewed. Even when a simple random sample is taken, it is not guaranteed that the sample is represen- tative of the population. If the non-response rate for a survey is high, that may be indicative of a biased sample. Perhaps a majority of participants did not respond to the survey because only a certain group within the population is being reached; for example, if questions assume that par- ticipants are fluent in English, then a high non-response rate would be expected if the population largely consists of individuals who are not fluent in English (Figure 1.11). Such non-response bias can skew results; generalizing from an unrepresentative sample may likely lead to incorrect conclusions about a population. Figure 1.11: Surveys may only reach a certain group within the population, which leads to non-response bias. For example, a survey written in English may only result in responses from health plan members fluent in English. GUIDED PRACTICE 1.7 It is increasingly common for health care facilities to follow-up a patient visit with an email pro- viding a link to a website where patients can rate their experience. Typically, less than 50% of patients visit the website. If half of those who respond indicate a negative experience, do you think that this implies that at least 25% of patient visits are unsatisfactory? 21 21 It is unlikely that the patients who respond constitute a representative sample from the larger population of patients. This is not a random sample, because individuals are selecting themselves into a group, and it is unclear that each person has an equal chance of answering the survey. If our experience is any guide, dissatisfied people are more likely to respond to these informal surveys than satisfied patients. 22 CHAPTER 1. INTRODUCTION TO DATA 1.3.4 Sampling methods Almost all statistical methods are based on the notion of implied randomness. If data are not sampled from a population at random, these statistical methods – calculating estimates and errors associated with estimates – are not reliable. Four random sampling methods are discussed in this section: simple, stratified, cluster, and multistage sampling. In a simple random sample, each case in the population has an equal chance of being included in the sample (Figure 1.12). Under simple random sampling, each case is sampled independently of the other cases; i.e., knowing that a certain case is included in the sample provides no information about which other cases have also been sampled. In stratified sampling, the population is first divided into groups called strata before cases are selected within each stratum (typically through simple random sampling) (Figure 1.12). The strata are chosen such that similar cases are grouped together. Stratified sampling is especially useful when the cases in each stratum are very similar with respect to the outcome of interest, but cases between strata might be quite different. Suppose that the health care provider has facilities in different cities. If the range of services offered differ by city, but all locations in a given city will offer similar services, it would be effective for the quality improvement team to use stratified sampling to identify participants for their study, where each city represents a stratum and plan members are randomly sampled from each city. 1.3. DATA COLLECTION PRINCIPLES 23 Stratum 2 Stratum 4 Stratum 6 Index Stratum 3 Stratum 1 Stratum 5 Figure 1.12: Examples of simple random and stratified sampling. In the top panel, simple random sampling is used to randomly select 18 cases (circled or- ange dots) out of the total population (all dots). The bottom panel illustrates stratified sampling: cases are grouped into six strata, then simple random sam- pling is employed within each stratum. 24 CHAPTER 1. INTRODUCTION TO DATA In a cluster sample, the population is first divided into many groups, called clusters. Then, a fixed number of clusters is sampled and all observations from each of those clusters are included in the sample (Figure 1.13). A multistage sample is similar to a cluster sample, but rather than keeping all observations in each cluster, a random sample is collected within each selected cluster (Figure 1.13). Unlike with stratified sampling, cluster and multistage sampling are most helpful when there is high case-to-case variability within a cluster, but the clusters themselves are similar to one an- other. For example, if neighborhoods in a city represent clusters, cluster and multistage sampling work best when the population within each neighborhood is very diverse, but neighborhoods are relatively similar. Applying stratified, cluster, or multistage sampling can often be more economical than only drawing random samples. However, analysis of data collected using such methods is more com- plicated than when using data from a simple random sample; this text will only discuss analysis methods for simple random samples. EXAMPLE 1.8 Suppose researchers are interested in estimating the malaria rate in a densely tropical portion of rural Indonesia. There are 30 villages in the area, each more or less similar to the others. The goal is to test 150 individuals for malaria. Evaluate which sampling method should be employed. A simple random sample would likely draw individuals from all 30 villages, which could make data collection extremely expensive. Stratified sampling is not advisable, since there is not enough information to determine how strata of similar individuals could be built. However, cluster sam- pling or multistage sampling are both reasonable options. For example, with multistage sampling, half of the villages could be randomly selected, and then 10 people selected from each village. This strategy is more efficient than a simple random sample, and can still provide a sample representa- tive of the population of interest. 1.3.5 Introducing experiments and observational studies The two primary types of study designs used to collect data are experiments and observational studies. In an experiment, researchers directly influence how data arise, such as by assigning groups of individuals to different treatments and assessing how the outcome varies across treatment groups. The LEAP study is an example of an experiment with two groups, an experimental group that received the intervention (peanut consumption) and a control group that received a standard ap- proach (peanut avoidance). In studies assessing effectiveness of a new drug, individuals in the control group typically receive a placebo, an inert substance with the appearance of the experi- mental intervention. The study is designed such that on average, the only difference between the individuals in the treatment groups is whether or not they consumed peanut protein. This allows for observed differences in experimental outcome to be directly attributed to the intervention and constitute evidence of a causal relationship between intervention and outcome. In an observational study, researchers merely observe and record data, without interfering with how the data arise. For example, to investigate why certain diseases develop, researchers might collect data by conducting surveys, reviewing medical records, or following a cohort of many similar individuals. Observational studies can provide evidence of an association between variables, but cannot by themselves show a causal connection. However, there are many instances where randomized experiments are unethical, such as to explore whether lead exposure in young children is associated with cognitive impairment. 1.3. DATA COLLECTION PRINCIPLES 25 Cluster 9 Cluster 2 Cluster 5 Cluster 7 Cluster 3 Cluster 8 Cluster 4