Statistics in Health Sciences PDF
Document Details
Uploaded by PraiseworthyHammeredDulcimer
Universitat Autònoma de Barcelona
2023
Jose Barrera
Tags
Summary
This document is a set of lecture notes on measuring health outcomes, covering topics such as prevalence, cumulative incidence, and incidence rate. It includes examples and exercises. It also provides references and R code for performing various statistical tests.
Full Transcript
B.Sc. Degree in Applied Statistics Statistics in Health Sciences 7. Measuring health outcomes Jose Barreraab [email protected] https://sites.google.com/view/josebarrera a ISGlobal Barcelona Institute for Global Health - Campus MAR b Department of Mathematics (UAB) This work is licensed under...
B.Sc. Degree in Applied Statistics Statistics in Health Sciences 7. Measuring health outcomes Jose Barreraab [email protected] https://sites.google.com/view/josebarrera a ISGlobal Barcelona Institute for Global Health - Campus MAR b Department of Mathematics (UAB) This work is licensed under a Creative Commons “Attribution-NonCommercial-ShareAlike 4.0 International” license. Statistics in Health Sciences 1 Introduction 2 Measuring presence: Prevalence Definition and estimation Point estimate and confidence interval Comments Exercise 1 3 Measuring occurrence: Cumulative incidence Definition Comments 4 Measuring occurrence: Incidence rate Definition Comments Comparison of two incidence rates Exercise 2 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 2 / 28 Measuring health outcomes: introduction Presence vs occurrence When assessing a health outcome in a given population, we can measure: • The presence of the health indicator among individuals. It is useful for chronic health indicators (i.e. unchanging characteristics such as, for instance, asthma or diabetes). • Measures: prevalence, odds. • The occurrence of new cases related to the health indicator and its evolution over time. It is useful for acute or transitory health indicators (i.e. time-dependent outcomes such as, for instance, flu, sick leave or COVID-19). • Measures: cumulative incidence, incidence rate. Examples • How many residents in Barcelona suffer asthma? • How many new cases of COVID-19 has been detected during the last 15 days in Spain? • How fast has the COVID-19 pandemic spread in the world? Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 3 / 28 Prevalence: definition Prevalence • The prevalence (P), of a given disease D is the proportion of individuals affected by D in the population of interest, within a given time interval of length t: P= X , N where X is the number of cases and N is the population size. • Prevalence measures presence. It can be seen as the probability that a randomly selected individual among the population would be affected by D. • Prevalence can also be measured as an absolute number (X ) although it is rarely used because its lack of information. • The value t should be a realistic time interval (e.g. a day, a week, a year), depending on the context. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 4 / 28 Prevalence: estimation Point estimate and confidence interval • Prevalence can be estimated as P̂ = Xnn , where n is the size of a random sample drawn from the population of interest, and Xn is the number of cases in such sample. • Assuming Xn ∼ Binomial(n, π = P), a confidence interval (CI1−α ) for π (i.e P) can be obtained using the binomial distribution. • An exact CI means that the CI is obtained under the exact probability distribution (in this case, the binomial distribution). • The best known small-sample exact CI, proposed by Clopper and Pearson [1] , can be computed with the binom.exact function in the R package epitools. Search information about ClopperPearson interval and Agresti-Coull interval. • However, because of discreteness, for exact CI the actual probability that the CI1−α contains the true value of π a is ⩾ (1 − α). I.e. the coverage probability is larger than the nominal confidence. • Further details on small-sample CI for a proportion can be found in the book by Agresti [2] . a The actual probability that the CI1−α contains the true value of the parameter of interest is known as coverage probability. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 5 / 28 Prevalence: approximated confidence interval Approximated confidence interval • For large sample sizes, approximated CI1−α can be obtained based on the approximation of the binomial distribution by the normal distribution: q CI1−α (P) = P̂ ± z1−α/2 P̂(1 − P̂)/n, (1) where z1−α/2 is the (1 − α/2) · 100-th percentile of the standard normal distribution. • Previous approximated CI can be computed with the binom.approx function in the R package epitools. • Such an approximate CI can give highly incoherent coverages, depending on the values of P i n. Even, meaningless bounds for the approximated CI can be obtained if the sample size is not large enough or if the prevalence is extreme enough. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 6 / 28 Prevalence: approximated confidence interval Unrealistic bounds in the approximated confidence interval • Formula (1) can give an upper bound of the CI higher than 1 or a lower bound of the CI lower than 0, which would be unrealistic. • Specifically: q lower (CI P̂(1 − P̂)/n < 0 −→ P̂ < Pc 1−α (P)) < 0 −→ P̂ − z1−α/2 q upper (CI (P)) > 1 −→ P̂ + z P̂(1 − P̂)/n > 1 −→ P̂ > 1 − P 1−α 1−α/2 c , Pc := 1 1+ 2 n z . 1−α/2 • For example, for α = 0.05 (so z1−α/2 ≈ 1.96) and n = 30, Pc ≈ 0.1135. Hence, the lower bound of the 95%CI would be negative if P̂ < 0.11 and the upper bound of the 95%CI would be higher than 1 if P̂ > 0.89. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 7 / 28 Prevalence: approximated confidence interval Unrealistic bounds in the approximated confidence interval (cont.) 1.0 Prevalence estimate 0.9 0.8 0.7 Upper 95% CI bound > 1 0.6 Example in previous slide 0.5 0.4 Lower 95% CI bound < 0 0.3 0.2 0.1 0.0 4 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Sample size Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 8 / 28 Prevalence Comments • Prevalence can be estimated in cross-sectional studies. However, in general it is not possible in cohort studies or case-control studies. • If the sample is not randomly drawn, bias could be introduced in the estimation of P. • Prevalence can be compared in two different populations using statistical tests for proportion comparisons. In R: binom.test and prop.test. • Prevalence is sensitive to the disease duration. In general, the more lasting the disease in the individual, the higher the estimate of the prevalence. • Prevalence data are not useful to establish causal relationships between a disease and a potential risk factor. Why? Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 9 / 28 Exercise 1 (Prevalence) Exercise: prevalence estimation Suppose we select two independent random samples, A and B, from a given population of people aged 40 years or older. Sample A includes 19 individuals and only one of them has respiratory problems (D). Sample B includes 186 individuals and 19 of them have respiratory problems. 1 Complete the following table of estimates of the prevalence of D, P(D): Sample n P̂ CI95% (P(D)) exact CI95% (P(D)) approx. A B A∪B 2 Apply a test to samples A and B to decide if the population prevalence is 0.16. 3 Interpret the results. Answer: see slide 19. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 10 / 28 Cumulative incidence: definition Cumulative incidence • The cumulative incidence (CI) associated to a disease D is the proportion of new cases within a given time interval, in a given disease-free population: CI = I , N0 where I is the number of new cases (incident cases) and N0 is the disease-free population size. • CI measures occurrence. It can be interpreted as the probability that an individual randomly selected from the disease-free population become affected by D within a given time interval. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 11 / 28 Cumulative incidence: comments Comments • The setting of the length of the time interval can vary according to, for instance, the age, calendar, time from the beginning of a treatment, etc. • CI can be estimated in cohort studies. The estimator is the proportion of new cases in the cohort. A confidence interval for the CI can be obtained using the binomial distribution or an approximation using the normal distribution. • CI cannot be estimated in case-control or cross-sectional studies. Why? • The estimation of the CI can be difficult in a dynamic population with frequent in and out moves of individuals. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 12 / 28 Incidence rate: definition Incidence rate • The incidence rate (Ir ), or incidence density associated to a disease D is the number of new cases (I) per person-time at risk unit: Ir = I , ∆t where ∆t is the total time at risk of the whole population of interest, so that to estimate Ir we need to know the time at risk duration for each individual in the study. • Ir is not a proportion, so it must not be interpreted as a probability but as an expansion speed of D. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 13 / 28 Incidence rate: example Example In a hypothetical cohort study, 242 individuals were followed for 26 months, 288 individuals were followed for 32 months, and 176 individuals were followed for 24 months. When the study started, all individuals were free of the disease of interest, D. At the end of the follow-up, the number of individuals that presented the occurrence of D was 11, 10 and 7, respectively. The Ir estimate in each of the three groups and in the whole cohort is: 11 • Îr ,1 = ∆tI1 = 242·26 ≈ 0.0017483 cases per person-month 1 • Îr ,2 = I2 ∆t2 = • Îr ,3 = I3 ∆t3 = • Îr = I ∆t = 10 288·32 ≈ 0.0010851 cases per person-month 7 176·24 P3 Ii P3i=1 ∆ti i=1 ≈ 0.0016572 cases per person-month = 11+10+7 242·26+288·32+176·24 ≈ 0.001419 cases per person-month Exercise Express the previous results in “cases per 100 person-year” Answer: 2.1, 1.3, 1.5 and 1.7. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 14 / 28 Incidence rate: comments Comments • The incidence rate can be estimated in cohort studies but it cannot in case-control or crosssectional studies. Why? • The estimate of Ir is an average of the value of Ir for each analysed period. If Ir could be time varying, then it can be estimated in different time subperiods in order to show its time variation. • Assuming that Ir is time invariant, the time to the beginning of the illness, T , is exponentially distributed, so that: • The cumulative incidence (CI) associated to a given time interval of length δt can be estimated as CI(δt) = P(t ⩽ δt) = 1 − e−Ir δt . Prove that, if the time length δt is tiny enough, then CI(δt) ≈ Ir δt. • The mean time to the beginning of the illness is T̄ = 1 . I r • Assuming that cases follows a Poisson distribution, a confidence interval for Ir can be obtained based in such distribution or an approximation based in the normal distribution. In R, it can be done with pois.exact and pois.approx in the epitools package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 15 / 28 Incidence rate: comparison of two rates using the Wald test (1/2) Wald test: statement and test statistic • If we are interested in the comparison of the incidence rate in two populations using the hypothesis test H0 : Ir 0 = Ir 1 vs. H1 : Ir 0 ̸= Ir 1 , I and we are able to obtain the estimates Îrj = ∆tj , j = 0, 1, from two independent samples, j then, under H0 , the expectation and the variance of the number of cases in each population area Q ∆tj I = I0 + I1 j ∆tj I, Var(Ij ) = I, j = 0, 1, where . E(Ij ) = ∆t = ∆t0 + ∆t1 ∆t ∆t 2 • The Wald statistic in this case is X 2 := H (Ij − E(Ij ))2 (I1 ∆t0 − I0 ∆t1 )2 = . Var(Ij ) I∆t0 ∆t1 H • Asymptotically, X 2 ∼0 χ21 or, equivalently, X ∼0 N(0, 1). a See details in Rothman et al. [ 3 ] , Chapter 14, Section “Person-Time Data: Large-Sample Methods”. http://students. aiu.edu/submissions/profiles/resources/onlineBook/a9c7D5_Modern_Epidemiology_3.pdf Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 16 / 28 Incidence rate: comparison of two rates using the Wald test (2/2) Wald test: result and decision • The p-value of the test is p-value = P(χ21 ⩾ X 2 ). • The null hypothesis will not be rejected if p-value > α, where α is the significance level, which is usually set at 0.05. • In R, this test can be performed with rateratio and rateratio.wald in the epitools package. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 17 / 28 Exercise 2 (Incidence rates) Exercise: incidence rates and cumulative incidence estimation Two independent cohorts, C1 and C2 , were randomly selected from a population of people older than 25 years. Sample sizes of cohorts C1 and C2 were 240 and 120, respectively. All individuals in both cohort were disease-free at the study start. All individuals in cohort C1 were followed for 3 years and all individuals in cohort C2 were followed for 5 years. Follow-up of all individuals in both cohort started at the same time point. There was no lost of follow-up. When the follow-up has finished, cohort C1 presented 22 new cases and cohort C2 presented 15 new cases. 1 Apply a hypothesis test to decide on the equality of the incidence rate in the two subpopulations from which the cohorts have been drawn. Set the significance level at 0.05. (Partial answer: p-value of 2 Assuming that both subpopulation are identical regarding the incidence rate of the disease of interest, so we can aggregated both samples in a single pooled sample which is representative of the whole population, estimate the number of new cases that we expect, both monthly and yearly, if the population has 40 millions of disease-free inhabitants. (Answer: 93,325 and 1,105,644.) 3 What assumptions have you made to get these results? the test ≈ 0.548.) Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 18 / 28 Answer to Exercise 1 (1) > > > > > > > > > > > > > > > > > ### table of results: tab <- as.data.frame(matrix(nrow = 3, ncol = 8)) names(tab) <- c("sample", "x", "n", "p", "exactCIlow", "exactCIupp", "approxCIlow", "approxCIupp") tab$sample <- c("A", "B", "AB") ### add X, n and P: xA <- 1 nA <- 19 xB <- 19 nB <- 186 tab$x <- c(xA, xB, NA) tab$n <- c(nA, nB, NA) aux <- tab[tab$sample != "AB", c("x", "n")] tab[tab$sample == "AB", c("x", "n")] <- colSums(aux) tab$p <- tab$x / tab$n p <- tab$p n <- tab$n tab ## sample x n p exactCIlow exactCIupp approxCIlow approxCIupp ## 1 A 1 19 0.05263158 NA NA NA NA ## 2 B 19 186 0.10215054 NA NA NA NA ## 3 AB 20 205 0.09756098 NA NA NA NA Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 19 / 28 Answer to Exercise 1 (2) > > > > > > > > > > > > > + + + + > > > > > ### asymptotic CI (using R as a calculator): alpha <- 0.05 z <- qnorm(1 - alpha / 2) error <- z * sqrt(p * (1 - p) / n) pmatrix <- matrix(p, nrow = nrow(tab), ncol = 2) errormatrix <- matrix(error, nrow = nrow(tab), ncol = 1, byrow = TRUE) auxmatrix <- matrix(c(-1, 1), nrow = 1, ncol = 2, byrow = TRUE) approxCI <- pmatrix + errormatrix %*% auxmatrix tab[, c("approxCIlow", "approxCIupp")] <- approxCI ### asymptotic CI (with binom.approx): tab$alpha <- alpha library(epitools) myapprox <- function(y) { binom.approx(x = as.numeric(y["x"]), n = as.numeric(y["n"]), conf.level = 1 - as.numeric(y["alpha"])) } approxCI <- apply(tab, 1, FUN = myapprox) CInames <- names(approxCI[[1]]) approxCI <- matrix(unlist(approxCI), nrow = nrow(tab), ncol = 6, byrow = TRUE) approxCI <- as.data.frame(approxCI) names(approxCI) <- CInames Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 20 / 28 Answer to Exercise 1 (3) > > + + > > > > > > > ### exact CI (with binom.exact): myexact <- function(y) binom.exact(x = as.numeric(y["x"]), n = as.numeric(y["n"]), conf.level = 1 - as.numeric(y["alpha"])) exactCI <- apply(tab, 1, FUN = myexact) CInames <- names(exactCI[[1]]) exactCI <- matrix(unlist(exactCI), nrow = nrow(tab), ncol = 6, byrow = TRUE) exactCI <- as.data.frame(exactCI) names(exactCI) <- CInames tab[, c("exactCIlow", "exactCIupp")] <- exactCI[, c("lower", "upper")] tab ## sample x n p exactCIlow exactCIupp approxCIlow approxCIupp alpha ## 1 A 1 19 0.05263158 0.001331629 0.2602807 -0.04777310 0.1530363 0.05 ## 2 B 19 186 0.10215054 0.062629996 0.1549198 0.05862805 0.1456730 0.05 ## 3 AB 20 205 0.09756098 0.060616661 0.1466503 0.05694301 0.1381789 0.05 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 21 / 28 Answer to Exercise 1 (4) > > > > > > > > > > > > > ### LaTeX table: mytab <- tab loex <- sprintf("%.3f", mytab$exactCIlow) upex <- sprintf("%.3f", mytab$exactCIupp) mytab$ciex <- paste0("(", loex, ", ", upex, ")") loap <- sprintf("%.3f", mytab$approxCIlow) upap <- sprintf("%.3f", mytab$approxCIupp) mytab$ciap <- paste0("(", loap, ", ", upap, ")") mytab <- mytab[, c("sample", "x", "n", "p", "ciex", "ciap")] mytab ## sample x n p ciex ciap ## 1 A 1 19 0.05263158 (0.001, 0.260) (-0.048, 0.153) ## 2 B 19 186 0.10215054 (0.063, 0.155) (0.059, 0.146) ## 3 AB 20 205 0.09756098 (0.061, 0.147) (0.057, 0.138) Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 22 / 28 Answer to Exercise 1 (5) > > > > + + > + + + ### LaTeX table (cont.): library(xtable) mytab <- xtable(mytab, digits = c(0, 0, 0, 0, 3, 0, 0)) names(mytab) <- c("Sample", "$x$", "$n$", "$\\hat{P}$", "CI\\textsubscript{95\\%} (exact)", "CI\\textsubscript{95\\%} (approx)") print(mytab, include.rownames = FALSE, booktabs = TRUE, sanitize.text.function = function(x) {x}) Sample A B AB Jose Barrera (ISGlobal & UAB) x n P̂ 1 19 20 19 186 205 0.053 0.102 0.098 CI95% (exact) CI95% (approx) (0.001, 0.260) (0.063, 0.155) (0.061, 0.147) (-0.048, 0.153) (0.059, 0.146) (0.057, 0.138) Statistics in Health Sciences, 2023/2024 23 / 28 Answer to Exercise 1 (6) > > > > > ### test for equal prevalences (approximate): p0 <- 0.16 # Sample A: id <- tab$sample == "A" prop.test(x = tab$x[id], n = tab$n[id], p = p0, alternative = "two.sided", conf.level = 1 - alpha) ## ## ## ## ## ## ## ## ## ## ## 1-sample proportions test with continuity correction data: tab$x[id] out of tab$n[id], null probability p0 X-squared = 0.92873, df = 1, p-value = 0.3352 alternative hypothesis: true p is not equal to 0.16 95 percent confidence interval: 0.002753525 0.281074110 sample estimates: p 0.05263158 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 24 / 28 Answer to Exercise 1 (7) > # Sample B: > id <- tab$sample == "B" > prop.test(x = tab$x[id], n = tab$n[id], p = p0, alternative = "two.sided", conf.level = 1 - alpha) ## ## ## ## ## ## ## ## ## ## ## 1-sample proportions test with continuity correction data: tab$x[id] out of tab$n[id], null probability p0 X-squared = 4.211, df = 1, p-value = 0.04016 alternative hypothesis: true p is not equal to 0.16 95 percent confidence interval: 0.06422978 0.15714004 sample estimates: p 0.1021505 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 25 / 28 Answer to Exercise 1 (8) > > > > ## ## ## ## ## ## ## ## ## ## ## ### test for equal prevalences (exact): # Sample A: id <- tab$sample == "A" binom.test(x = tab$x[id], n = tab$n[id], p = p0, alternative = "two.sided", conf.level = 1 - alpha) Exact binomial test data: tab$x[id] and tab$n[id] number of successes = 1, number of trials = 19, p-value = 0.3444 alternative hypothesis: true probability of success is not equal to 0.16 95 percent confidence interval: 0.001331629 0.260280654 sample estimates: probability of success 0.05263158 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 26 / 28 Answer to Exercise 1 (9) > # Sample B: > id <- tab$sample == "B" > binom.test(x = tab$x[id], n = tab$n[id], p = p0, alternative = "two.sided", conf.level = 1 - alpha) ## ## ## ## ## ## ## ## ## ## ## Exact binomial test data: tab$x[id] and tab$n[id] number of successes = 19, number of trials = 186, p-value = 0.03496 alternative hypothesis: true probability of success is not equal to 0.16 95 percent confidence interval: 0.0626300 0.1549198 sample estimates: probability of success 0.1021505 Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 27 / 28 References [1] CJ. Clopper and ES. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934. URL https://doi.org/10.1093/biomet/26.4.404. [2] Alan Agresti. Categorical Data Analysis (3rd Edition). John Wiley & Sons, Inc., Hoboken, New Jersey, 2013. [3] KJ. Rothman, S. Greenland, and TL. Lash. Modern Epidemiology (3rd Edition). Wolters Kluwer/Lippincott Williams & Wilkins, Philadelphia, Pensilvania, 2008. Jose Barrera (ISGlobal & UAB) Statistics in Health Sciences, 2023/2024 28 / 28