Introduction to Survival Analysis in R PDF
Document Details
Uploaded by FondMonkey75
UCLA
Tags
Summary
This document is an introduction to survival analysis in R, focusing on the practical aspects of using the "survival" R package. The document encompasses survival analysis basics, associated packages, clear outlines for specific analysis methods and a detailed review of the survival function, hazard function and cumulative hazard function.
Full Transcript
Introduction to Survival Analysis in R UCLA Office of Advanced Research Computing Statistical Methods and Data Analytics Purpose This workshop aims to provide just enough background in survival analysis to be able to use the su rvi val package in R to: ■ estimate survival functions ■ test whether...
Introduction to Survival Analysis in R UCLA Office of Advanced Research Computing Statistical Methods and Data Analytics Purpose This workshop aims to provide just enough background in survival analysis to be able to use the su rvi val package in R to: ■ estimate survival functions ■ test whether survival functions are different between groups ■ fit a Cox proportional hazards model Workshop packages The survival package: ■ provides all tools used in this workshop to estimate survival analysis models and tests ■ created by Terry Therneau, researcher and expert in survival analysis, so package is trustworthy • Therneau co-authored Modeling Survival Data: Extending the Cox Model with Patricia Grambsch, a reference book for survival analysis and the survival package • Grambsch and Therneau developed some of the methods used to assess the proportional hazards assumption of the Cox model ■ widely used, so has inspired many additional packages to extend its functionality, like survmi ner We use the survmi ner for its ggsurvplotQ function, used to create highly customizable plots of survival functions We use the broom package for its ti dy() function, which cleans up output tables and stores them as data frames. If you are following in RStudio, go ahead and load the workshop packages now with 1 i braryOlibrary(survival) library(survminer) # for customizabLe graphs of survival, function library(broom) # for tidy output library(ggplotZ) # for graphing (actually Loaded by survminer) Outline 1. Quick review of survival analysis 2. Setting up data for survival analysis 3. Kaplan-Meier estimator of the survival function 4. Comparing survival curves 5. Cox model introduction 6. Fitting a Cox model with coxph () 7. Predictions from A Cox model 8. Assessing the proportional hazards assumption 9. Time-varying covariates A very quick review of survival analysis (POLL) What is survival analysis? Survival analysis models how much time elapses before an event occurs. The outcome variable, the length of time to an event, is often referred to as either survival time, failure time, or time to event. Example events include: ■ death upon contracting a disease ■ divorce ■ malfunctioning of a machine ■ first job Events are often referred to as failures. Almost anything can be framed as the event of interest, so survival analysis has broad applications across many fields. We often say that the subject is at risk and a member of the risk set before the event occurs or the subject’s time is censored. Survival function One of the goals of survival analysis is to estimate the probability that a subject survives without experiencing the event past some time t. We can infer these probabilities from observing how long different subjects remain at risk before failing, i.e., observing their survival times. Let T be a random variable representing a subject’s true survival time. Sometimes, we cannot observe a subject’s true survival time T during the course of a study, known as censoring. In general, we say we observe subjects’ follow-up time, which for some will be the true survival time T and for others will be the censoring time. The survival function, S(t) expresses the probability that a subject’s true survival time Twill exceed time t, i.e., that the subject will survive beyond time t. S(t) = Pr(T > t) Survival function, S(t) For the survival curve above, 5(100) = .577, the probability that a subject survives beyond 100 days is 0.577. 5(200) = .122, the probability that a subject survives beyond 200 days is 0.122. Typically, we also assume: 5(0) = 1, all subjects survive the very first moment 5(oo) = 0, all subjects fail after infinite time Median survival time is defined as the time t at which 50% of the population is expected to be still surviving: survival probability Median survival time is 114.2 days, 8(114.2)=.5 days Hazard function Some survival methods, such as the Kaplan-Meier estimator, focus on estimating the survival function S(t) directly. Other methods, such as the Cox model, focus on the hazard function (also known as the hazard rate), h(t), which is inversely related to the S(t). The hazard function at time t, h(t), is defined as the instantaneous rate of events at time t, given that the subject has survived until time t. We say instantaneous because h(t) may be changing moment to moment, continuously overtime. For example, in the green curve below, h(200) = -0204 events f while h(200.1) = •0204J events. With an increase in the hazard function, more events are expected per unit time, and survival will be expected to decrease. Below we see three examples of hazard functions, 2 of which are changing continuously with time. hazard function — constant — decreasing — increasing Note: h(t) > 0, so the hazard function can never be negative Cumulative hazard The cumulative hazard function, H(t), expresses how much hazard a subject has accumulated overtime up to time t. H(t) = I h(u)du Jo The probability that a subject will fail over time increases as the hazard accumulates. Because the hazard function h(t) is never negative, the cumulative hazard H(t) can never decrease with time. Three hazard functions, h(t), and corresponding cumulative hazard functions, H(t) hazard function -- constant — decreasing — increasing hazard function -- constant — decreasing — increasing Relationship between the hazard and survival functions The survival function is inversely related to the cumulative hazard function, where we see that as a subject’s cumulative hazard grows, the survival probability decreases. S(t) = exp(—H(ty) Three cumulative hazard functions, H(t), and corresponding survival functions, S(t) hazard function -- constant — decreasing — increasing hazard function -- Therefore, by modeling either the survival function or the hazard function, we can infer the other. constant — decreasing — increasing Censoring Many times the exact time when the event is unknown or censored. Right censoring means that a subject’s actual survival time is greater than their observed time ■ study ends before event occurs ■ subject is lost to follow-up ■ subject is no longer is “at risk” for event after study begins status • censored ■ death unobserved death days Left-censoring means that a subjects actual survival time is less than their observed time. One common example is when the event is defined as disease infection: positive tests for the infection may be delayed by days or even years. Interval censoring means that a subjects is survival time is unknown, but known to lie between 2 observed time points. We will only discuss methods that handle right-censoring in this workshop. Many standard methods, such as linear regression, are not equipped to deal with censored outcomes. *lmage adapted from Kleinbaum and Klein, Survival Analysis: A Self-Learning Text, Third Edition, Springer, 2012. Assumption of noninformative censoring Most survival analysis methods, including all those discussed here, assume non-informative censoring. ■ a subject’s censoring time should not be related to the unobserved survival time ■ distribution of censoring times and survival times are unrelated Informative (left) vs noninformative (right) censoring Status • censored ■ death unobserved death Status • censored ■ death unobserved death Failing to account for informative censoring may result in biased estimates of survival. Below are plots of the Kaplan-Meier survival function estimates of the above data: 1.00 Survival function estimated from data with informative censoring is very different 0-75 0.50 Survival functions estimated from data with no censoring and data with non-informative censoring are the same 0.25 o.oo O 200 400 days 6oo 8oo Examples of possible informative censoring and resulting bias if not addressed: ■ oldest subjects drop out of study of time to death after surgery • Oldest might have shortest survival times, so survival estimates might be biased upward ■ Travel-loving subjects drop out of study of time to first marriage Travel-loving subjects may delay marriage to travel more and have longer survival times, so survival estimates might be biased downward Data set up for survival analysis Data for survival analysis The simplest data structure for a typical survival analysis is: ■ single row per subject ■ a status variable coding whether the subject experienced the event or not (censored) ■ single time variable measuring T time to event (or censoring time, time of last observation) ■ variables for covariates, assumed to be time-constant in this structure The ami dataset We’ll start with the ami dataset in the su rvi val package. These data come from a study looking at time to death for patients with acute myelogenous leukemia, comparing “maintained” chemotherapy treatment to “nonmaintained”. Variables: ■ time survival or censoring time ■ status o=censored, i=death ■ x chemotherapy “maintained” or “nonmaintained” time status x 9 1 Maintained 13 1 Maintained 13 0 Maintained 18 1 Maintained 23 1 Maintained time status x 28 0 Maintained 31 1 Maintained 34 1 Maintained 45 0 Maintained 48 1 Maintained 161 0 Maintained 5 1 Nonmaintained 5 1 Nonmaintained 8 1 Nonmaintained 8 1 Nonmaintained 12 1 Nonmaintained 16 0 Nonmaintained 23 1 Nonmaintained 27 1 Nonmaintained 30 1 Nonmaintained 33 1 Nonmaintained 43 1 Nonmaintained 45 1 Nonmaintained The Surv( ) function for survival outcomes Use SurvO to specify the survival outcome variables. Allows for many different time and event status configurations. For data with a single time variable indicating time to event or censoring, the Su rv specification will be: Surv(time, event) ■ time survival/censoringtime variable ■ event status variable. To code for censored/event use: • 0/1 • 1/2 • FALSE/TRUE Censoring is assumed to be right-censored unless otherwise specified with the type argument. Surv() specification for start-stop format Some survival analyses require time to be recorded in 2 variables that mark the beginning and end of time intervals. We need this format to model: ■ time-varying covariates ■ interval censoring ■ recurrent events data In this format, some or all subjects may have multiple rows of data. This format is sometimes called start-stop format. The j asal data set has this setup, where start and stop are the time variables, and event is the status variable: head(jasal) ## id start stop event transplant age year surgery ## 1 1 0 49 1 0 -17.155373 0.1232033 0 ## 2 2 1 3.835729 0.2546201 0 5 0 e 102 3 1 1 6.297057 0.2655715 0 15 0 ## 3 4 0 35 0 0 -7.737166 0.4900753 0 ## 103 4 1 1 -7.737166 0.4900753 35 38 0 ## 4 5 0 17 1 0 -27.214237 0.6078029 0 To specify the outcome for data in stop-start format, use: Surv(time, time2, event) ■ ti me and ti me2 beginning and end of time intervals ■ event is the status at the end of the interval.