Fundamentals of Biostatistics (8th ed.) PDF
Document Details
Uploaded by DelicateNewOrleans624
2016
Bernard Rosner
Tags
Summary
This book, "Fundamentals of Biostatistics", 8th edition, provides a comprehensive overview of biostatistical concepts and methods. It covers topics including descriptive statistics, measures of location and spread, and more. The book is intended for undergraduate-level students in health-related or related fields.
Full Transcript
Fundamentals of Biostatistics Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electroni...
Fundamentals of Biostatistics Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Fundamentals of Biostatistics 8th Edition Bernard Rosner Harvard University Australia Brazil Mexico Singapore United Kingdom United States Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. This is an electronic version of the print textbook. Due to electronic rights restrictions, some third party content may be suppressed. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it. For valuable information on pricing, previous editions, changes to current editions, and alternate formats, please visit www.cengage.com/highered to search by ISBN#, author, title, or keyword for materials in your areas of interest. Important Notice: Media content referenced within the product description or the product text may not be available in the eBook version. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Fundamentals of Biostatistics, © 2016, 2011, 2006 Cengage Learning Eighth Edition WCN: 02-200-203 Bernard Rosner ALL RIGHTS RESERVED. No part of this work covered by the copyright Product Manager: Rita Lombard herein may be reproduced, transmitted, stored, or used in any form Content Developer: Andrew Coppola or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web Associate Content Developer: Spencer distribution, information networks, or information storage and retrieval Arritt systems, except as permitted under Section 107 or 108 of the 1976 Product Assistant: Kathryn Schrumpf United States Copyright Act, without the prior written permission of the Marketing Manager: Julie Schuster publisher. Content Project Manager: Cheryll Linthicum For product information and technology assistance, contact us at Art Director: Vernon Boes Cengage Learning Customer & Sales Support, 1-800-354-9706. Manufacturing Planner: Doug Bertke For permission to use material from this text or product, submit all requests online at www.cengage.com/permissions. Intellectual Property Analyst: Further permissions questions can be e-mailed to Christina Ciaramella [email protected]. Intellectual Property Project Manager: Farah Fard Library of Congress Control Number: 2015941787 Text and Cover Designer: C. Miller Cover Image Credit: Abstract background: ISBN: 978-1-305-26892-0 iStockPhoto.com/Pobytov; Office worker: Pressmaster/Shutterstock.com; financial Cengage Learning diagram: iStockPhoto.com/Petrovich9; 20 Channel Center Street Test tube: iStockPhoto/HadelProductions; Boston, MA 02210 financial diagram: iStockPhoto.com/ USA SergeyTimashov; Lab glass: iStocklPhoto. com/isak55. Production Service and Compositor: Cengage Learning is a leading provider of customized learning solutions Cenveo® Publisher Services with employees residing in nearly 40 different countries and sales in more than 125 countries around the world. Find your local representative at www.cengage.com. Cengage Learning products are represented in Canada by Nelson Education, Ltd. To learn more about Cengage Learning Solutions, visit www.cengage.com. Purchase any of our products at your local college store or at our preferred online store www.cengagebrain.com. Printed in the United States of America Print Number: 01 Print Year: 2015 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. This book is dedicated to my wife, Cynthia, and my children, Sarah, David, and Laura Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Contents Preface / xiii Chapter 1 General Overview / 1 Chapter 2 Descriptive Statistics / 5 2.1 Introduction / 5 2.9 Case Study 1: Effects of Lead Exposure on Neurological and Psychological Function in 2.2 Measures of Location / 6 Children / 32 2.3 Some Properties of the Arithmetic 2.10 Case Study 2: Effects of Tobacco Use Mean / 14 on Bone-Mineral Density in Middle-Aged 2.4 Measures of Spread / 16 Women / 32 2.5 Some Properties of the Variance 2.11 Obtaining Descriptive Statistics and Standard Deviation / 20 on the Computer / 35 2.6 The Coefficient of Variation / 22 2.12 Summary / 35 2.7 Grouped Data / 24 Problems / 35 2.8 Graphic Methods / 27 vii Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. viii Contents Chapter 3 Probability / 42 3.1 Introduction / 42 3.7 Bayes’ Rule and Screening Tests / 55 3.2 Definition of Probability / 43 3.8 Bayesian Inference / 60 3.3 Some Useful Probabilistic Notation / 44 3.9 ROC Curves / 61 3.4 The Multiplication Law of Probability / 46 3.10 Prevalence and Incidence / 63 3.5 The Addition Law of Probability / 48 3.11 Summary / 64 3.6 Conditional Probability / 50 Problems / 65 Chapter 4 Discrete Probability Distributions / 77 4.1 Introduction / 77 4.8 The Binomial Distribution / 90 4.2 Random Variables / 78 4.9 Expected Value and Variance of the Binomial Distribution / 96 4.3 The Probability-Mass Function for a Discrete Random Variable / 79 4.10 The Poisson Distribution / 98 4.4 The Expected Value of a Discrete 4.11 Computation of Poisson Probabilities / 101 Random Variable / 81 4.12 Expected Value and Variance 4.5 The Variance of a Discrete of the Poisson Distribution / 102 Random Variable / 82 4.13 Poisson Approximation to the 4.6 The Cumulative-Distribution Function Binomial Distribution / 104 of a Discrete Random Variable / 84 4.14 Summary / 106 4.7 Permutations and Combinations / 85 Problems / 107 Chapter 5 Continuous Probability Distributions / 115 5.1 Introduction / 115 5.6 Linear Combinations of Random Variables / 132 5.2 General Concepts / 115 5.7 Normal Approximation to the Binomial 5.3 The Normal Distribution / 118 Distribution / 133 5.4 Properties of the Standard Normal 5.8 Normal Approximation to the Poisson Distribution / 121 Distribution / 139 5.5 Conversion from an N ( μ,σ2) Distribution 5.9 Summary / 141 to an N (0,1) Distribution / 127 Problems / 142 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Contents ix Chapter 6 Estimation / 154 6.1 Introduction / 154 6.7 Estimation of the Variance of a Distribution / 181 6.2 The Relationship Between Population and Sample / 155 6.8 Estimation for the Binomial Distribution / 187 6.3 Random-Number Tables / 157 6.9 Estimation for the Poisson Distribution / 193 6.4 Randomized Clinical Trials / 161 6.10 One-Sided Confidence Intervals / 197 6.5 Estimation of the Mean of a Distribution / 165 6.11 The Bootstrap / 199 6.6 Case Study: Effects of Tobacco Use on 6.12 Summary / 202 Bone-Mineral Density (BMD) in Middle-Aged Problems / 203 Women / 180 Chapter 7 Hypothesis Testing: One-Sample Inference / 211 7.1 Introduction / 211 7.8 One-Sample χ2 Test for the Variance of a Normal Distribution / 245 7.2 General Concepts / 211 7.9 One-Sample Inference for the Binomial 7.3 One-Sample Test for the Mean of a Normal Distribution / 249 Distribution: One-Sided Alternatives / 214 7.10 One-Sample Inference for the Poisson 7.4 One-Sample Test for the Mean of a Normal Distribution / 259 Distribution: Two-Sided Alternatives / 222 7.11 Case Study: Effects of Tobacco Use on Bone- 7.5 The Relationship Between Hypothesis Mineral Density in Middle-Aged Women / 265 Testing and Confidence Intervals / 229 7.12 Derivation of Selected Formulas / 265 7.6 The Power of a Test / 232 7.13 Summary / 267 7.7 Sample-Size Determination / 239 Problems / 269 Chapter 8 Hypothesis Testing: Two-Sample Inference / 279 8.1 Introduction / 279 8.5 Interval Estimation for the Comparison of Means from Two Independent Samples 8.2 The Paired t Test / 281 (Equal Variance Case) / 290 8.3 Interval Estimation for the Comparison of 8.6 Testing for the Equality of Two Variances / 292 Means from Two Paired Samples / 285 8.7 Two-Sample t Test for Independent Samples 8.4 Two-Sample t Test for Independent Samples with Unequal Variances / 298 with Equal Variances / 286 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. x Contents 8.8 Case Study: Effects of Lead Exposure on 8.10 The Treatment of Outliers / 312 Neurologic and Psychological Function in 8.11 Derivation of Equation 8.13 / 319 Children / 305 8.12 Summary / 320 8.9 Estimation of Sample Size and Power for Comparing Two Means / 307 Problems / 320 Chapter 9 Nonparametric Methods / 338 9.1 Introduction / 338 9.5 Case Study: Effects of Lead Exposure on Neurological and Psychological Function in 9.2 The Sign Test / 340 Children / 358 9.3 The Wilcoxon Signed-Rank 9.6 Permutation Tests / 359 Test / 345 9.7 Summary / 364 9.4 The Wilcoxon Rank-Sum Test / 352 Problems / 365 Chapter 10 Hypothesis Testing: Categorical Data / 372 10.1 Introduction / 372 10.6 R × C Contingency Tables / 413 10.2 Two-Sample Test for Binomial 10.7 Chi-Square Goodness-of-Fit Test / 425 Proportions / 373 10.8 The Kappa Statistic / 431 10.3 Fisher’s Exact Test / 387 10.9 Derivation of Selected Formulas / 436 10.4 Two-Sample Test for Binomial Proportions for 10.10 Summary / 437 Matched-Pair Data (McNemar’s Test) / 395 Problems / 439 10.5 Estimation of Sample Size and Power for Comparing Two Binomial Proportions / 403 Chapter 11 Regression and Correlation Methods / 457 11.1 Introduction / 457 11.7 The Correlation Coefficient / 485 11.2 General Concepts / 458 11.8 Statistical Inference for Correlation Coefficients / 490 11.3 Fitting Regression Lines—The Method of Least Squares / 461 11.9 Multiple Regression / 502 11.4 Inferences About Parameters from 11.10 Case Study: Effects of Lead Exposure on Regression Lines / 465 Neurologic and Psychological Function in Children / 519 11.5 Interval Estimation for Linear Regression / 475 11.11 Partial and Multiple Correlation / 526 11.6 Assessing the Goodness of Fit of 11.12 Rank Correlation / 529 Regression Lines / 481 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Contents xi 11.13 Interval Estimation for Rank-Correlation 11.15 Summary / 539 Coefficients / 533 Problems / 540 11.14 Derivation of Equation 11.26 / 537 Chapter 12 Multisample Inference / 551 12.1 Introduction to the One-Way Analysis of 12.6 Two-Way ANOVA / 589 Variance / 551 12.7 The Kruskal-Wallis Test / 596 12.2 One-Way ANOVA—Fixed-Effects 12.8 One-Way ANOVA—The Random-Effects Model / 552 Model / 604 12.3 Hypothesis Testing in One-Way ANOVA— 12.9 The Intraclass Correlation Coefficient / 609 Fixed-Effects Model / 553 12.10 Mixed Models / 614 12.4 Comparisons of Specific Groups in One- Way ANOVA / 559 12.11 Derivation of Equation 12.30 / 619 12.5 Case Study: Effects of Lead Exposure on 12.12 Summary / 620 Neurologic and Psychological Function in Problems / 621 Children / 579 Chapter 13 Design and Analysis Techniques for Epidemiologic Studies / 633 13.1 Introduction / 633 13.10 Meta-Analysis / 705 13.2 Study Design / 634 13.11 Equivalence Studies / 710 13.3 Measures of Effect for Categorical Data / 637 13.12 The Cross-Over Design / 713 13.4 Attributable Risk / 647 13.13 Clustered Binary Data / 721 13.5 Confounding and Standardization / 653 13.14 Longitudinal Data Analysis / 733 13.6 Methods of Inference for Stratified Categorical 13.15 Measurement-Error Methods / 743 Data—The Mantel-Haenszel Test / 659 13.16 Missing Data / 753 13.7 Multiple Logistic Regression / 673 13.17 Derivation of 100% × (1 – α) CI for the Risk 13.8 Extensions to Logistic Regression / 694 Difference / 758 13.9 Sample Size Estimation for Logistic 13.18 Summary / 761 Regression / 703 Problems / 762 Chapter 14 Hypothesis Testing: Person-Time Data / 777 14.1 Measure of Effect for Person-Time 14.3 Two-Sample Inference for Data / 777 Incidence-Rate Data / 782 14.2 One-Sample Inference for 14.4 Power and Sample-Size Estimation Incidence-Rate Data / 779 for Person-Time Data / 790 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. xii Contents 14.5 Inference for Stratified Person-Time Data / 793 14.12 Power and Sample-Size Estimation under the Proportional-Hazards Model / 835 14.6 Power and Sample-Size Estimation for Stratified Person-Time Data / 800 14.13 Parametric Survival Analysis / 839 14.7 Testing for Trend: Incidence-Rate Data / 805 14.14 Parametric Regression Models for Survival Data / 847 14.8 Introduction to Survival Analysis / 808 14.15 Derivation of Selected Formulas / 854 14.9 Estimation of Survival Curves: The Kaplan-Meier Estimator / 811 14.16 Summary / 856 14.10 The Log-Rank Test / 819 Problems / 856 14.11 The Proportional-Hazards Model / 825 APPENDIX Tables / 867 1 Exact binomial probabilities Pr(X = k) = pkqn–k / 867 2 Exact Poisson probabilities / 871 3 The normal distribution / 874 4 Table of 1000 random digits / 878 5 Percentage points of the t distribution (td,u)a / 879 6 Percentage points of the chi-square distribution (χ2d,u)a / 880 7 Confidence limits for the expectation of a Poisson variable (µ) / 881 8 Percentage points of the F distribution (Fd1,d 2,p ) / 882 9 Critical values for the ESD (Extreme Studentized Deviate) outlier statistic (ESDn,1–α , α =.05,.01) / 884 10 Two-tailed critical values for the Wilcoxon signed-rank test / 884 11 Two-tailed critical values for the Wilcoxon rank-sum test / 885 12 Fisher’s z transformation / 887 13 Two-tailed upper critical values for the Spearman rank-correlation coefficient (rs) / 888 14 Critical values for the Kruskal-Wallis test statistic (H ) for selected sample sizes for k = 3 / 889 15 Critical values for the studentized range statistic q*, α =.05 / 890 Answers to Selected Problems / 891 Flowchart: Methods of Statistical Inference / 895 Index of Data Sets / 901 Index of Statistical Software / 903 Subject Index / 909 Index of Applications / 936 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Preface T his introductory-level biostatistics text is designed for upper-level undergraduate or graduate students interested in medicine or other health-related areas. It requires no previous background in statistics, and its mathematical level assumes only a knowledge of algebra. Fundamentals of Biostatistics evolved from notes that I have used in a biostatistics course taught to Harvard University undergraduates, Harvard Medical School, and Harvard School of Public Health students over the past 30 years. I wrote this book to help motivate students to master the statistical methods that are most often used in the medical literature. From the student’s viewpoint, it is important that the example material used to develop these methods is representative of what actually exists in the literature. Therefore, most of the examples and exercises in this book are based either on actual articles from the medical literature or on actual medi- cal research problems I have encountered during my consulting experience at the Harvard Medical School. The Approach Most introductory statistics texts either use a completely nonmathematical, cookbook approach or develop the material in a rigorous, sophisticated mathematical frame- work. In this book, however, I follow an intermediate course, minimizing the amount of mathematical formulation but giving complete explanations of all important concepts. Every new concept in this book is developed systematically through com- pletely worked-out examples from current medical research problems. In addition, I introduce computer output where appropriate to illustrate these concepts. I initially wrote this text for the introductory biostatistics course. However, the field has changed dramatically over the past 30 years; because of the increased power of newer statistical packages, we can now perform more sophisticated data analyses than ever before. Therefore, a second goal of this text is to present these new tech- niques at an introductory level so that students can become familiar with them without having to wade through specialized (and, usually, more advanced) statistical texts. To differentiate these two goals more clearly, I included most of the content for the introductory course in the first 12 chapters. More advanced statistical techniques used in recent epidemiologic studies are covered in Chapter 13, “Design and Analysis Techniques for Epidemiologic Studies,” and Chapter 14, “Hypothesis Testing: Person-Time Data.” xiii Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. xiv Preface Changes in the Eighth Edition For this edition, I have added three new sections and added new content to three other sections. Features new to this edition include the following: The data sets are now available on the book’s Companion Website at www.cengage.com/statistics/rosner in an expanded set of formats, including Excel, Minitab®, SPSS, JMP, SAS, Stata, R, and ASCII formats. Data and medical research findings in Examples have been updated. New or expanded coverage of the followings topics has been added: The Bootstrap (Section 6.11) One-sample inference for the Binomial Distribution (Section 7.9) Permutation Tests (Section 9.6) Sample size estimation for logistic regression (Section 13.9) Estimation of survival curves: The Kaplan-Meier Estimator (Section 14.9) Derivation of selected formulas (Sections 7.12, 8.11, 10.9, 11.14, 12.11, 13.17, 14.15) The new sections and the expanded sections for this edition have been indicated by an asterisk in the table of contents. Exercises This edition contains 1,490 exercises; 171 of these exercises are new. Data and medical research findings in the problems have been updated where appropriate. All problems based on the data sets are included. Problems marked by an asterisk (*) at the end of each chapter have corresponding brief solutions in the answer section at the back of the book. Based on requests from students for more completely solved problems, ap- proximately 600 additional problems and complete solutions are presented in the Study Guide available on the Companion Website accompanying this text. In addition, approximately 100 of these problems are included in a Miscellaneous Problems section and are randomly ordered so that they are not tied to a specific chapter in the book. This gives the student additional practice in determining what method to use in what situation. Complete instructor solutions to all exercises are available at the instructor companion website at cengage.com/statistics/rosner. Computation Method The method of handling computations is similar to that used in the seventh edi- tion. All intermediate results are carried to full precision (10+ significant digits), even though they are presented with fewer significant digits (usually 2 or 3) in the text. Thus, intermediate results may seem inconsistent with final results in some instances; this, however, is not the case. Organization Fundamentals of Biostatistics, Eighth Edition, is organized as follows. Chapter 1 is an introductory chapter that contains an outline of the develop- ment of an actual medical study with which I was involved. It provides a unique sense of the role of biostatistics in medical research. Chapter 2 concerns descriptive statistics and presents all the major numeric and graphic tools used for displaying medical data. This chapter is especially important Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Preface xv for both consumers and producers of medical literature because much information is actually communicated via descriptive material. Chapters 3 through 5 discuss probability. The basic principles of probability are developed, and the most common probability distributions—such as the binomial and normal distributions—are introduced. These distributions are used extensively in later chapters of the book. The concepts of prior probability and posterior prob- ability are also introduced. Chapters 6 through 10 cover some of the basic methods of statistical inference. Chapter 6 introduces the concept of drawing random samples from popula- tions. The difficult notion of a sampling distribution is developed and includes an introduction to the most common sampling distributions, such as the t and chi- square distributions. The basic methods of estimation, including an extensive discus- sion of confidence intervals, are also presented. In addition, the bootstrap method for obtaining confidence limits is introduced for the first time. Chapters 7 and 8 contain the basic principles of hypothesis testing. The most elementary hypothesis tests for normally distributed data, such as the t test, are also fully discussed for one- and two-sample problems. Chapter 9 covers the basic principles of nonparametric statistics. The assump- tions of normality are relaxed, and distribution-free analogues are developed for the tests in Chapters 7 and 8. The technique of permutation testing, which is widely used in genetic studies, is introduced for the first time. Chapter 10 contains the basic concepts of hypothesis testing as applied to cat- egorical data, including some of the most widely used statistical procedures, such as the chi-square test and Fisher’s exact test. Chapter 11 develops the principles of regression analysis. The case of simple lin- ear regression is thoroughly covered, and extensions are provided for the multiple- regression case. Important sections on goodness-of-fit of regression models are also included. Also, rank correlation is introduced, including methods for obtaining confidence intervals for rank correlation. Chapter 12 introduces the basic principles of the analysis of variance (ANOVA). The one-way analysis of variance fixed- and random-effects models are discussed. In addition, two-way ANOVA, the analysis of covariance, and mixed effects mod- els are covered. Finally, we discuss nonparametric approaches to one-way ANOVA. Multiple comparison methods including material on the false discovery rate are also provided. Chapter 13 discusses methods of design and analysis for epidemiologic studies. The most important study designs, including the prospective study, the case-control study, the cross-sectional study, and the cross-over design are introduced. The con- cept of a confounding variable—that is, a variable related to both the disease and the exposure variable—is introduced, and methods for controlling for confound- ing, which include the Mantel-Haenszel test and multiple-logistic regression, are discussed in detail. Extensions to logistic regression models, including conditional logistic regression, polytomous logistic regression, and ordinal logistic regression, are discussed. Methods of estimation of sample size for logistic regression models are provided for the first time. This discussion is followed by the exploration of topics of current interest in epidemiologic data analysis, including meta-analysis (the combination of results from more than one study); correlated binary data techniques (techniques that can be applied when replicate measures, such as data from multiple teeth from the same person, are available for an individual); measurement error methods (use- ful when there is substantial measurement error in the exposure data collected); equivalence studies (whose objective it is to establish bioequivalence between two Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. xvi Preface treatment modalities rather than that one treatment is superior to the other); and missing-data methods for how to handle missing data in epidemiologic studies. Longitudinal data analysis and generalized estimating equation (GEE) methods are also briefly discussed. Chapter 14 introduces methods of analysis for person-time data. The methods covered in this chapter include those for incidence-rate data, as well as several meth- ods of survival analysis: the Kaplan-Meier survival curve estimator, the log-rank test, and the proportional-hazards model. Methods for testing the assumptions of the proportional-hazards model have also been included. Parametric survival analysis methods are also discussed. Throughout the text—particularly in Chapter 13—I discuss the elements of study designs, including the concepts of matching; cohort studies; case-control studies; retrospective studies; prospective studies; and the sensitivity, specificity, and predictive value of screening tests. These designs are presented in the context of actual samples. In addition, Chapters 7, 8, 10, 11, 13, and 14 contain specific sections on sample-size estimation for different statistical situations. There have been two important organizational changes in the presentation of material in the text. First, the derivation of more complex formulas have either been moved after the statement of an equation or to separate derivation sections at the end of the chapter, to enable students to access the main results in the equations more immediately. Second, there are numerous subsections entitled “Using the Computer to Perform a Specific Test” to more clearly highlight use of the computer to implement many of the methods in the text. A flowchart of appropriate methods of statistical inference (see pages 895–900) is a handy reference guide to the methods developed in this book. Page references for each major method presented in the text are also provided. In Chapters 7 and 8 and Chapters 10–14, I refer students to this flowchart to give them some perspective on how the methods discussed in a given chapter fit with all the other statistical methods introduced in this book. In addition, I have provided an index of applications, grouped by medical spe- cialty, summarizing all the examples and problems this book covers. Finally, we provide for the first time, an index of computer software to more clearly identify the computer commands in specific computer packages that are featured in the text. Acknowledgments I am indebted to Debra Sheldon, the late Marie Sheehan, and Harry Taplin for their invaluable help typing the manuscript, to Dale Rinkel for invaluable help in typing problem solutions, and to Marion McPhee for helping to prepare the data sets on the Companion Website. I am also indebted to Roland Matsouaka for updating solu- tions to problems for this edition, and to Virginia Piaseczny for typing the Index of Applications. In addition, I wish to thank the manuscript reviewers, among them: Shouhao Zhou, Daniela Szatmari-Voicu, Jianying Gu, Raid Amin, Claus Wilke, Glen Johnson, Kara Zografos, and Hui Zhao. I would also like to thank my colleagues Nancy Cook, who was instrumental in helping me develop the part of Section 12.4 on the false-discovery rate, and Robert Glynn, who was invaluable in developing Section 13.16 on missing data and Section 14.11 on testing the assumptions of the proportional-hazards model. In addition, I wish to thank Spencer Arritt and Jay Campbell, whose input was critical in providing editorial advice and in preparing the manuscript. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Preface xvii I am also indebted to my colleagues at the Channing Laboratory—most notably, the late Edward Kass, Frank Speizer, Charles Hennekens, the late Frank Polk, Ira Tager, Jerome Klein, James Taylor, Stephen Zinner, Scott Weiss, Frank Sacks, Walter Willett, Alvaro Munoz, Graham Colditz, and Susan Hankinson—and to my other colleagues at the Harvard Medical School, most notably, the late Frederick Mosteller, Eliot Berson, Robert Ackerman, Mark Abelson, Arthur Garvey, Leo Chylack, Eugene Braunwald, and Arthur Dempster, who inspired me to write this book. I also wish to express appreciation to John Hopper and Philip Landrigan for providing the data for our case studies. Finally, I would like to acknowledge Leslie Miller, Andrea Wagner, Ithamar Jotkowitz, Loren Fishman, and Frank Santopietro, without whose clinical help the current edition of this book would not have been possible. Bernard Rosner Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. About the Author Bernard Rosner is Professor of Medicine (Biostatistics) at Harvard Medical School and Professor of Biosta- tistics in the Harvard School of Public Health. He received a B.A. in Mathematics from Columbia Uni- versity in 1967, an M.S. in Statistics from Stanford University in 1968, and a Ph.D. in Statistics from Har- vard University in 1971. He has more than 30 years of biostatistical con- sulting experience with other investigators at the Har- vard Medical School. Special areas of interest include cardiovascular disease, hypertension, breast cancer, and ophthalmology. Many of the examples and exer- Photo courtesy of the Museum of Science, Boston cises used in the text reflect data collected from actual studies in conjunction with his consulting experience. In addition, he has developed new biostatistical meth- ods, mainly in the areas of longitudinal data analysis, analysis of clustered data (such as data collected in families or from paired organ systems in the same person), measurement error methods, and outlier de- tection methods. You will see some of these methods introduced in this book at an elementary level. He was married in 1972 to his wife, Cynthia, and they have three children, Sarah, David, and Laura, each of whom has contributed examples to this book. xix Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. General Overview 1 Statistics is the science whereby inferences are made about specific random phe- nomena on the basis of relatively limited sample material. The field of statistics has two main areas: mathematical statistics and applied statistics. Mathematical statistics concerns the development of new methods of statistical inference and requires detailed knowledge of abstract mathematics for its implementation. Applied statistics involves applying the methods of mathematical statistics to spe- cific subject areas, such as economics, psychology, and public health. Biostatistics is the branch of applied statistics that applies statistical methods to medical and bio- logical problems. Of course, these areas of statistics overlap somewhat. For example, in some instances, given a certain biostatistical application, standard methods do not apply and must be modified. In this circumstance, biostatisticians are involved in developing new methods. A good way to learn about biostatistics and its role in the research process is to follow the flow of a research study from its inception at the planning stage to its com- pletion, which usually occurs when a manuscript reporting the results of the study is published. As an example, I will describe one such study in which I participated. A friend called one morning and in the course of our conversation mentioned that he had recently used a new, automated blood-pressure measuring device of the type seen in many banks, hotels, and department stores. The machine had measured his average diastolic blood pressure on several occasions as 115 mm Hg; the highest reading was 130 mm Hg. I was very worried, because if these readings were accurate, my friend might be in imminent danger of having a stroke or developing some other serious cardiovascular disease. I referred him to a clinical colleague of mine who, using a standard blood-pressure cuff, measured my friend’s diastolic blood pressure as 90 mm Hg. The contrast in readings aroused my interest, and I began to jot down readings from the digital display every time I passed the machine at my local bank. I got the distinct impression that a large percentage of the reported readings were in the hypertensive range. Although one would expect hypertensive individuals to be more likely to use such a machine, I still believed that blood-pressure readings from the machine might not be comparable with those obtained using standard methods 1 Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 2 C H A P T E R 1 General Overview of blood-pressure measurement. I spoke with Dr. B. Frank Polk, a physician at Harvard Medical School with an interest in hypertension, about my suspicion and succeeded in interesting him in a small-scale evaluation of such machines. We decided to send a human observer, who was well trained in blood-pressure measurement techniques, to several of these machines. He would offer to pay participants 50¢ for the cost of using the machine if they would agree to fill out a short questionnaire and have their blood pressure measured by both a human observer and the machine. At this stage we had to make several important decisions, each of which proved vital to the success of the study. These decisions were based on the following questions: (1) How many machines should we test? (2) How many participants should we test at each machine? (3) In what order should we take the measurements? That is, should the human observer or the machine take the first measurement? Under ideal circumstances we would have taken both the human and machine readings simultaneously, but this was logistically impossible. (4) What data should we collect on the questionnaire that might influence the comparison between methods? (5) How should we record the data to facilitate computerization later? (6) How should we check the accuracy of the computerized data? We resolved these problems as follows: (1) and (2) Because we were not sure whether all blood-pressure machines were comparable in quality, we decided to test four of them. However, we wanted to sample enough subjects from each machine so as to obtain an accurate comparison of the standard and automated methods for each machine. We tried to predict how large a discrepancy there might be between the two methods. Using the methods of sample-size determination discussed in this book, we calculated that we would need 100 participants at each site to make an accurate comparison. (3) We then had to decide in what order to take the measurements for each person. According to some reports, one problem with obtaining repeated blood- pressure measurements is that people tense up during the initial measurement, yield- ing higher blood-pressure readings. Thus we would not always want to use either the automated or manual method first, because the effect of the method would get con- fused with the order-of-measurement effect. A conventional technique we used here was to randomize the order in which the measurements were taken, so that for any person it was equally likely that the machine or the human observer would take the first measurement. This random pattern could be implemented by flipping a coin or, more likely, by using a table of random numbers similar to Table 4 of the Appendix. (4) We believed that the major extraneous factor that might influence the results would be body size (we might have more difficulty getting accurate readings from people with fatter arms than from those with leaner arms). We also wanted to get some idea of the type of people who use these machines. Thus we asked questions about age, gender, and previous hypertension history. (5) To record the data, we developed a coding form that could be filled out on site and from which data could be easily entered into a computer for subsequent analysis. Each person in the study was assigned a unique identification (ID) number by which the computer could identify that person. The data on the coding forms were then keyed and verified. That is, the same form was entered twice and the two Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. General Overview 3 records compared to make sure they were the same. If the records did not match, the form was re-entered. (6) Checking each item on each form was impossible because of the large amount of data involved. Instead, after data entry we ran some editing programs to ensure that the data were accurate. These programs checked that the values for individual variables fell within specified ranges and printed out aberrant values for manual checking. For example, we checked that all blood-pressure readings were at least 50 mm Hg and no higher than 300 mm Hg, and we printed out all readings that fell outside this range. We also ran programs to detect outliers as discussed later in this book. After completing the data-collection, data-entry, and data-editing phases, we were ready to look at the results of the study. The first step in this process is to get an impression of the data by summarizing the information in the form of several descriptive statistics. This descriptive material can be numeric or graphic. If numeric, it can be in the form of a few summary statistics, which can be presented in tabular form or, alternatively, in the form of a frequency distribution, which lists each value in the data and how frequently it occurs. If graphic, the data are summarized pictori- ally and can be presented in one or more figures. The appropriate type of descriptive material to use varies with the type of distribution considered. If the distribution is continuous—that is, if there is essentially an infinite number of possible values, as would be the case for blood pressure—then means and standard deviations may be the appropriate descriptive statistics. However, if the distribution is discrete—that is, if there are only a few possible values, as would be the case for gender—then percentages of people taking on each value are the appropriate descriptive measure. In some cases both types of descriptive statistics are used for continuous distribu- tions by condensing the range of possible values into a few groups and giving the percentage of people that fall into each group (e.g., the percentages of people who have blood pressures between 120 and 129 mm Hg, between 130 and 139 mm Hg, and so on). In this study we decided first to look at mean blood pressure for each method at each of the four sites. Table 1.1 summarizes this information. You may notice from this table that we did not obtain meaningful data from all 100 people interviewed at each site. This was because we could not obtain valid readings from the machine for many of the people. This problem of missing data is very common in biostatistics and should be anticipated at the planning stage when deciding on sample size (which was not done in this study). Ta b le 1. 1 Mean blood pressures and differences between machine and human readings at four locations Systolic blood pressure (mm Hg) Machine Human Difference Number Standard Standard Standard Location of people Mean deviation Mean deviation Mean deviation A 98 142.5 21.0 142.0 18.1 0.5 11.2 B 84 134.1 22.5 133.6 23.2 0.5 12.1 C 98 147.9 20.3 133.9 18.3 14.0 11.7 D 62 135.4 16.7 128.5 19.0 6.9 13.6 Source: Based on the American Heart Association, Inc. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 4 C H A P T E R 1 General Overview Our next step in the study was to determine whether the apparent differences in blood pressure between machine and human measurements at two of the locations (C, D) were “real” in some sense or were “due to chance.” This type of question falls into the area of inferential statistics. We realized that although there was a differ- ence of 14 mm Hg in mean systolic blood pressure between the two methods for the 98 people we interviewed at location C, this difference might not hold up if we interviewed 98 other people at this location at a different time, and we wanted to have some idea as to the error in the estimate of 14 mm Hg. In statistical jargon, this group of 98 people represents a sample from the population of all people who might use that machine. We were interested in the population, and we wanted to use the sample to help us learn something about the population. In particular, we wanted to know how different the estimated mean difference of 14 mm Hg in our sample was likely to be from the true mean difference in the population of all peo- ple who might use this machine. More specifically, we wanted to know if it was still possible that there was no underlying difference between the two methods and that our results were due to chance. The 14-mm Hg difference in our group of 98 people is referred to as an estimate of the true mean difference (d) in the population. The problem of inferring characteristics of a population from a sample is the central con- cern of statistical inference and is a major topic in this text. To accomplish this aim, we needed to develop a probability model, which would tell us how likely it is that we would obtain a 14-mm Hg difference between the two methods in a sample of 98 people if there were no real difference between the two methods over the entire population of users of the machine. If this probability were small enough, then we would begin to believe a real difference existed between the two methods. In this particular case, using a probability model based on the t distribution, we concluded this probability was less than 1 in 1000 for each of the machines at locations C and D. This probability was sufficiently small for us to conclude there was a real difference between the automatic and manual methods of measuring blood pressure for two of the four machines tested. We used a statistical package to perform the preceding data analyses. A package is a collection of statistical programs that describe data and perform various statisti- cal tests on the data. Currently the most widely used statistical packages are SAS, SPSS, Stata, R, MINITAB, and Excel. The final step in this study, after completing the data analysis, was to compile the results in a publishable manuscript. Inevitably, because of space considerations, we weeded out much of the material developed during the data-analysis phase and presented only the essential items for publication. This review of our blood-pressure study should give you some idea of what medical research is about and the role of biostatistics in this process. The material in this text parallels the description of the data-analysis phase of the study. Chapter 2 summarizes different types of descriptive statistics. Chapters 3 through 5 present some basic principles of probability and various probability models for use in later discussions of inferential statistics. Chapters 6 through 14 discuss the major topics of inferential statistics as used in biomedical practice. Issues of study design or data collection are brought up only as they relate to other topics discussed in the text. Reference Polk, B. F., Rosner, B., Feudo, R., & Vandenburgh, M. (1980). An evaluation of the Vita-Stat automatic blood pres- sure measuring device. Hypertension, 2(2), 221−227. Copyright 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Descriptive Statistics 2.1 Introduction 2 The first step in looking at data is to describe the data at hand in some concise way. In smaller studies this step can be accomplished by listing each data point. In general, however, this procedure is tedious or impossible and, even if it were possible, would not give an overall picture of what the data look like. E xamp le 2. 1 Cancer, Nutrition Some investigators have proposed that consumption of vitamin A prevents cancer. To test this theory, a dietary questionnaire might be used to collect data on vitamin-A consumption among 200 hospitalized cancer patients (cases) and 200 controls. The controls would be matched with regard to age and gender with the