Kaplan-Meier Estimator, PDF
Document Details
Uploaded by FondMonkey75
King Khalid University, Abha
Tags
Summary
This document provides a detailed description of the Kaplan-Meier estimator, a popular non-parametric method for estimating survival functions in statistical analysis. It includes formulas, tables, and graphs to illustrate the application of the estimator.
Full Transcript
Kaplan-Meier estimator of the survival function Formula for Kaplan-Meier estimator The Kaplan-Meier (KM) estimator is a very popular non-parametric method to estimate the survival function, S(t). Non parametric means that we are not assuming any particular distribution for the survival times. It...
Kaplan-Meier estimator of the survival function Formula for Kaplan-Meier estimator The Kaplan-Meier (KM) estimator is a very popular non-parametric method to estimate the survival function, S(t). Non parametric means that we are not assuming any particular distribution for the survival times. It has an intuitive formula: st*)=n ti<t eventsi num. at. riski The expression . ! , is simply the proportion of those at risk that survive time point ti. The KM estimator is the r y 1------------num.at.nski ) r 1 r r r product of the proportion that survive each time point ti up to the current time point t. Survival estimates do not change if someone drops due to censoring, although the number at risk will drop. Kaplan-Meier estimation with survfit ( ) We can obtain the KM estimate of S(t) using su rvfi t (). The first argument is a model formula with a Su rvO outcome specification on the left side of To estimate the survival function for the entire data set, we specify 1 after # ~ 1 indicates KM survival, function estimate for whole sample KM <- survfit(Surv(time, status) ~ 1, data=aml) Printing the su rvf 11 object gives a summary: ■ n: total number at risk ■ events: total number of events that occurred ■ medi an: survival time t at which S(t) = .5, or the time at which 50% of those at risk are expected to still be alive ■ 0.95LCL, 0.95UCL: 95% confidence limits for median survival prlnt(KM) ## Call: survfit(formula = Surv(time, status) ~ 1, data = ami) ## ## n events median 0.95LCL 0.95UCL ## [1,] 23 18 27 18 45 # same as print(KM) KM ## Call: survfit(formula = Surv(time, status) ~ 1, data = ami) ## ## n events median 8.95LCL 0.95UCL ## [lj] 23 18 27 18 45 Table of KM survival function The ti dy 0 function from the b room package works with many of the output objects created by the su rvi val package to create tables stored as tibbles (modern data.frames) Using ti dy() on the survfi t() object produces a table of the KM estimate of the survival function S(t) ■ ti me event/censoring times in the data set ■ n. ri sk, n. event, n. censor number at risk, number of events, number censored ■ estimate S(t) survival estimate • = (!-£)= MS • S(8) = (1 - i) (1 - i) = -826 ■ s td. e r ro r, conf .high, conf. 1 ow standard error for 5(t), and 95% confidence interval for S(t) # save KM survival, function as tibbte (modern data,frame) KM.tab <- tidy(KM) KM.tab # same as print(KM.tab) ### A tibble: 18 x 8 ## time n.risk n.event n censor estimate std.error conf.high conf.low <dbl> ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 1 5 23 0 0.913 0.0643 0.805 ## 2 21 2 0.0957 8 0 0.826 0.996 0.685 tttt 3 9 19 1 0.971 0 0.783 0.110 0.631 ## 4 12 18 1 0 0.739 0.124 0.942 0.580 ## 5 13 17 1 1 0.696 0.138 0.912 0.531 ## 6 16 15 0 1 0.696 0.138 0.912 0.531 tt# 7 14 1 0.157 18 0 0.646 0.878 0.475 ## 8 2 0.547 0.372 23 13 0 0.196 0.803 ## 9 27 11 1 0 0.497 0.218 0.762 0.324 ## 10 28 10 0 1 0.497 0.218 0.762 0.324 ## ## ## ## ## ## ## 11 12 13 14 15 16 17 ## 18 30 31 33 34 43 45 48 161 9 8 7 6 5 4 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0.442 0.386 0.331 0.276 0.221 0.166 0.0828 0.0828 0.248 0.282 0.321 0.369 0.432 0.519 0.877 0.877 0.718 0.671 0.622 0.569 0.515 0.458 0.462 0.462 0.272 0.223 0.177 0.134 0.0947 0.0598 0.0148 0.0148 Graphing the survival function We can pl ot the KM survival function and confidence intervals for the entire ami sample. Notice how the KM-estimated S(t) is a step function, where S(t) only changes at timepoints when an event occurs. The true underlying survival curve 5(t) may be smooth, but the smoothness of the KM curve is limited by the number of event times observed in the data. plot(KMj ylab="survival probability", xl ab="months") Comparing survival curves Stratified Kaplan-Meier estimates To estimate separate KM survival functions for different strata, specify one or more strata variables after the ~ in su rvf 110 # stratify by x variabLe KM.x <- survfit(Surv(time, status) ~ x, data=aml) # median surviveL by strata KM.x ## ## # # # Call: survfit(formula = Surv(time, status) ~ x, data = ami) # n events median 0.95LCL 0.95UCL # x=Maintained 11 7 31 18 NA # x=Nonmaintained 12 11 23 8 NA Notice the addition of the strata column in the ti dyO output: # KM estimated survival, functions by strata tidy(KM.x) ## # A tibble: 20 x 9 ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## time n.risk n.event n.censor estimate std.error conf.high conf.low strata <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 9 11 1 0 0.909 0.0953 1 0.754 x=Mainta„ 2 13 10 1 1 0.818 0.142 1 0.619 x=Mainta_ 3 18 8 1 0 0.716 0.195 1 0.488 x=Mainta„ 4 23 7 1 0 0.614 0.249 0.999 0.377 x=Mainta„ 5 28 6 0 1 0.614 0.249 0.999 0.377 x=Mainta„ 6 31 5 1 0 0.491 0.334 0.946 0.255 x=Mainta... 7 34 4 1 0 0.368 0.442 0.875 0.155 x=Mainta„ 8 45 3 0 1 0.368 0.442 0.875 0.155 x=Mainta_ 9 48 2 1 0 0.184 0.834 0.944 0.0359 x=Mainta„ 10 161 1 0 1 0.184 0.834 0.944 0.0359 x=Mainta... 11 5 12 2 0 0.833 0.129 1 0.647 X=Nonmai„ 12 8 10 2 0 0.667 0.204 0.995 0.447 x=Nonmai_ 13 12 8 1 0 0.583 0.244 0.941 0.362 x=Nonmai_. 14 16 7 0 1 0.583 0.244 0.941 0.362 x=Nonmai_ 15 23 6 1 0 0.486 0.305 0.883 0.268 x=Nonmai_. 16 27 5 1 0 0.389 0.378 0.816 0.185 x=Nonmai„ 17 30 4 1 0 0.292 0.476 0.741 0.115 x=Nonmai„ 18 33 3 1 0 0.194 0.627 0.664 0.0569 x=Nonmai_ 19 43 2 1 0 0.0972 0.945 0.620 0.0153 X=Nonmai„ 20 45 1 1 0 0 Inf NA NA x=Nonmai_ Graphing stratified KM estimates of survival pl ot O will only produce confidence intervals for S(t) by default if one curve is plotted J For multiple curves, we must request confidence intervals with conf. 1 nt=T. We also specify two colors to make the graph more readable. # stratified KM curves ulth 95% CI, 2 colors plot(KM.x, ylab="survival probability", xlab=,,months"4 conf.int=T, col=c("red", "blue")) t plotO is a generic function and calls plot. survfi t() when supplied a survfi t object. The option conf.i nt= is an option of pl ot. su rvf i t O • Customizable, informative survival plots with survminer pl ot. su rvf 11 O uses base R graphics and has limited options to customize a plot of survival functions. The survmi ner package leverages the graphical power of the ggplot2 package and adds many of its own features to create highly customizable plots of survival functions. Using ggsurvplotO from survmi ner on a survfit object produces a plot of the KM estimated survival functions. ggsurvplot(KM.x, conf.iirt-T) Strata x=Maintained -+■ x=Nonmaintained Time Notice that ggsurvplotO automatically adds + symbols to denote censored observations, ggsu rvpl ot Q makes adding a risk table very easy. ggsurvplot(KM.x, conf.int=T. risk.table=T) Strata x=Maintained x=Nonmaintained Time Number at risk -2 x=Maintained ■ 11 3 1 1 1 x=Nonmaintained- 12 2 0 0 0 0 40 80 120 160 Time You can pass most arguments to various functions in ggpl ot2 through ggsu rvpl ot (). ggsurvplot(KM.x, conf.int=T, risk.table=T, palette="Accent", # argument to scale_coior_brewer() size=2, # argument to geom_Line() ggtheme = theme_mini»al()) # changing ggtheme Strata x=Maintained x=Nonmaintained 1.00- 0.75- 03 .Q O 0.5003 CO 0.25- o.oo- 80 120 160 -1 1 d 0 0 0 80 120 160 Time Number at risk co x=Maintained 2 CO x=Noni Q d lol d d 1 12 2 () 40 1 Time You can also use traditional ggplot2 syntax by extracting the ggplot object as ggsurvplot()$plot. g <- ggsurvplot(KM.Xj conf.lnt=T, risk.table=T)$plot # this is the ggpLot object g i scale_fill_grey() + scale_color_grey() Survival probability See our ggplot2 seminar to learn how to use the ggpl Ot2 package. Comparing survival functions with survdiff ( ) We often want to test the hypothesis: curves across 2 or more groups are equivalent H^. survival curves across 2 or more groups are not equivalent Hq‘. survival The log-rank statistic is one popular method to evaluate this hypothesis. Under the null, the log-rank statistic is x2 distributed with g — 1 degrees of freedom. The function survdi ff O performs the log-rank test by default. # Log rank test, default is rho=0 survdiff(Surv(time, status) ~ x, data=aml) ## ## ## ## ## ## ## ## Call: survdiff(formula = Surv(time, status) ~ x, data = ami) x=Maintained x=Nonmaintained Chisq= 3.4 N Observed Expected (0-E)A2/E (0-E)A2/V 11 7 10.69 1.27 3.4 12 11 7.31 1.86 3.4 on 1 degrees of freedom, p= 0.07 We see some evidence, not strong, that the survival curves may be different. survdi ff O allows weights in calculation of the x2 statistic with the rho= argument. The weights are defined as S(t)p, where 0<p< 1. ■ rho=0 equal weights, S(f)° = 1, which is the log-rank test and the default ■ rho=l the weights equal the survival estimate itself, 5(f)1 = 5(f), equivalent to the Gehan-Wilcoxon test • earlier timepoints are weighted more heavily (might make sense for death after surgery, for example) ■ rho= values between o and 1 are valid, with values closer to 1 putting more weight on earlier time points # rho=l specifies Peto & Peto modification of Gehan-NiLcoxon, # more ueight put on eartier time points survdiff(Surv(tlme, status) ~ x, data=aml, rho=l) ## ## ## ## ## ## ## ## Call: survdiff(formula = Surv(time, status) ~ x, data = ami, rho = 1) N Observed Expected (O-E)A2/E (O-E)A2/V 0.859 x=Maintained 11 3.85 6.14 2.78 x=Nonmaintained 12 7.18 4.88 1.081 2.78 Chisq= 2.8 on 1 degrees of freedom, p= 0.1 Exercise 1 The veteran data set describes survival times for veterans with lung cancer. Variables: ■ ti me: survival time ■ status: status, o=censored, i=dead ■ trt: treatment, i=standard, 2=test 1. Create a graph and table of the Kaplan-Meier estimated survival function for the entire data set. What is the median survival time? 2. Create a graph of KM estimated survival functions stratified by treatment, adding 95% confidence intervals and coloring the 2 functions. Does survival appear different for the 2 treatment groups? 3. Use the log-rank test to provide more evidence for your assessment of survival of the 2 groups. The Cox proportional hazards model Background Instead of estimating the survival function S(t) directly, the Cox proportional hazards model estimates changes to the hazard function, h(t). The Cox model can estimate the effects of multiple predictors^ on the hazard function. By far the most popular method for survival analysis ■ no distribution assumed for survival times ■ naturally accommodates right-censoring and time-varying covariates ■ can be extended in many ways: • time-varying coefficients • random effect frailties for recurrent events or clustering • competing risks modeling t We use the words predictor and covariate interchangeably throughout this workshop. The Cox proportional hazards model For simplicity, we begin with a Cox model with a single time-constant predictor, Ai: = iEi) = ho^exp^bixi) ■ h(t\Xi — a?i): the hazard at time t for a subject with predictor Xi equal to the value xi ■ ho(t): the baseline hazard at time t, the hazard for a subject with all predictors equal to zero ■ exp(bixi)‘. the hazard ratio comparing the hazard for a subject with Xi = sei to a subject with Xi = 0 Hazard ratio For example, imagine that X± is a treatment variable, with values Xi = 1 for treatment and X± = 0 for control. The hazard at time t for treatment: = 1) = hofyexp^bi * 1) = h0(t)exp(b!) and for control: h(t|Xi = 0) = ho(t)exp(bi * 0) = ho(t)exp(O) - holt) We can compare the hazards for treatment and control at time t as a hazard ratio (HR): h^X, = 1) = 0) _ ho(t)exp(bi) ho(t) — exp(bi) Thus, exp(bi) is the hazard ratio comparing the hazard for treatment to controls. ■ HR — .25 means that treatment has | (25%) the hazard of control, or a 75% decrease. With a lower hazard rate, treatment will have fewer expected events and thus better survival. ■ HR — 2 here means that treatment has twice the hazard of control, or a 100% increase, and thus worse survival. In general, exp(bi) expresses the hazard ratio for a 1-unit increase in the associated covariate. 51 itself is the log-hazard ratio. Proportional hazards The standard Cox model assumes proportional hazards, which means that the effects of covariates are constant over time. For example, in our simple Cox model for a treatment effect, proportional hazards means that the effect of treatment does not change over time. Notice that the expression for the hazard ratio for the treatment effect does not contain time, so it will be the same value no matter the time: HR = expfbi) Note: With proportional hazards, a subject’s hazard function (which we don’t need to know for the Cox model) can change over time. But the hazard ratio comparing that subject’s hazard to another subject’s hazard cannot change over time, provided their covariate values do not change. Visually, this is represented by “parallel” survival curves that should not cross: Proportional hazard functions (left) and corresponding survival functions (right) days hazard function --- control ---- treatment hazard function --- control ---- treatment The parallelism is easier to evaluate if we plot — log(—log(S(t))), the negative log of the negative log of the survival function. If the hazards are proportional, the vertical distance between the curves is constant across time. hazard Proportional hazard functions (left) and corresponding -log(-log(survival)) functions (right) hazard function --- control ---- treatment Violation of proportional hazards suggests that a predictor’s effect changes over time. Hazard and survival functions will often cross: hazard function —■ control ---- treatment Non-proportional hazard functions (left) and corresponding survival functions (right) hazard function --- control ---- treatment hazard function --- control ---- treatment As will graphs of — log(—log(S(t))), where we see that the vertical distance between curves changes and reverses direction overtime: hazard Non-proportional hazard functions (left) and corresponding -log(-log(survival)) functions (right) days hazard function --- control ---- treatment days hazard function --- Failing to account for non-constant hazard ratios threatens validity of Cox model estimates. The Cox model can be extended in various ways to accommodate non-proportional hazards. control ---- treatment Baseline hazard function ho (t) One reason why the Cox model is so popular is that it does not require specification of the baseline hazard function, ho (f), the hazard function for a subject with zero on all covariates. Essentially, this means we can use the Cox model without assuming a particular form of the hazard function or assuming a distribution of survival times. You can fit a Cox model without knowing the shape of the hazard function hazard function constant — decreasing — increasing We thus then do not need to estimate parameters to characterize a hazard function or survival distribution, but only the regression parameters that quantify the covariate effects. The Cox model is thus called semiparametric. No constant/intercept in Cox models. Because ho(t) is left unspecified, the Cox model can not directly estimate either the hazard function or the survival function, but is used to estimate covariate effects on the hazard functions.