# Joint Models for Longitudinal and Survival Data

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

**iProgn: Interactive Prediction Tools based on Joint Models**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### What are joint models for longitudinal and survival data?

In this post we will introduce in layman’s terms the framework of joint models for longitudinal and time-to-event data. These models are applied in settings where the sample units are followed-up in time, for example, we may be interest in patients suffering from a specific disease who are followed-up in time to monitor their progress. In this context typically different types of outcomes are collected for the sample units at hand. In general, these can be categorized as longitudinal outcomes, i.e., the same outcome repeatedly measured in time for the same sample unit, and event-time outcomes, i.e., the time until a specific event of interest occurs for the sample units. Often our research questions require analyzing each of these outcomes separately. For instance, does a new treatment have a beneficial effect in the survival of the patients or is there any evidence for a difference in the average longitudinal evolutions between males and females. However, there are also research questions of more complex nature for which a joint modeling approach of these outcomes is required. Joint models for longitudinal and survival data constitute an attractive paradigm for the analysis of such data, and they are mainly applicable in two settings: First, when focus is on a survival outcome and we wish to account for the effect of endogenous time-varying covariates measured with error, and second, when focus is on the longitudinal outcome and we wish to correct for non-random dropout.#### What are endogenous time-varying covariates (in the setting described above)?

A time-varying covariate is termed endogenous if its value at any time point*t*is can be affected by an event occurring at an earlier time point

*s*<

*t*, and is exogenous if its value at

*t*is not affected by an earlier event. Examples of exogenous covariates are the nurse that took the blood sample or performed the echo, the period of the year (e.g., winter versus summer) and environmental factors (e.g., pollution levels). On the other hand all covariates measured on the patient (e.g., biomarkers) are endogenous. To make this distinction more clear let’s consider two time-varying covariates in the context of a study on asthma, namely, a biomarker for asthma that has been measured during follow up for the patients in the study, and the air pollution levels in the neighborhood where each patients live. Suppose that a particular patient has an asthma attack after

*s*= 5 months from the start of the study. It is directly evident that at a future time point, say

*t*= 5.2 months, the level of the biomarker will be affected from the fact that this patient had an asthma attack, whereas air pollution levels at the same time point

*t*= 5.2 months will not be affected by this attack.

**What is non-random dropout (in the setting described above)?**

A common and important problem in the analysis of longitudinal outcomes is missing data. Namely, if though measurements are planned at specific time points, often and for a variety of reasons the subjects under study do not adhere to these scheduled visit times or they even completely dropout from the study. When the reasons (or more accurately the probability) for dropping out depends on unobserved longitudinal responses, the process that generates these missing data cannot be ignored, even if the main focus is on the longitudinal outcome. In this case the dropout process is termed non-random.
### How joint models work?

The intuitive idea behind joint models for longitudinal and survival outcomes is given in the following figure: In the top panel the hazard process is depicted, which describes how the instantaneous risk (hazard function) of an event changes in time; for example, the hazard of patient having a relapse. In the bottom panel the asterisk denote the observed longitudinal responses and the green line the underlying longitudinal process; for example, the values of a biomarker for the disease under study. Joint models postulate a relative risk (proportional hazards) model for the event time outcome, which is directly associated with the longitudinal process denoted by the green line. This green line is recovered from the observed data (asterisks) using a mixed effects model. This model contains fixed effects, describing the average longitudinal evolution in time, and random effects that describe how each patient deviates from this average evolution. In their basic form joint models assume that the hazard function at any particular time point*t*, denoted by the vertical dashed line, is associated with the value of the longitudinal process (green line) at the same time point. The blue line represents the assumption behind the time-dependent Cox model, which posits that the value of the longitudinal outcome remain constant in between the observation times. Estimation of the model is based on the joint distribution of the two outcomes and can be done either under maximum likelihood or under a Bayesian approach. The framework of joint models can be used to account for both endogenous time-varying covariates and non-random dropout.

### Fitting joint models using package JMbayes

There are several packages available on CRAN for fitting this type of joint models, namely JM, JMbayes, joineR and lcmm among others. In this post we will show how joint models can be fitted using using package JMbayes that fits joint models under the Bayesian paradigm. For this illustration we will be using the Primary Biliary Cirrhosis (PBC) data set (available in the package and also in the survival package) collected by the Mayo Clinic from 1974 to 1984. For our analysis we will consider 312 patients who have been randomized to D-penicillamine and placebo. During follow-up several biomarkers associated with PBC have been collected for these patients. Here we focus on serum bilirubin levels, which is considered one of the most important ones associated with disease progression. In package JMbayes the PBC data are available in the data frames pbc2 and pbc2.id containing the longitudinal and survival information, respectively (i.e., the former is in the long format while the latter contains a single row per patient). We start by loading the JMbayes and lattice packages and defining the indicator status2 for the composite event, namely transplantation or death: library(“JMbayes”) library(“lattice”) pbc2$status2 <- as.numeric(pbc2$status != "alive") pbc2.id$status2 <- as.numeric(pbc2.id$status != "alive") The design followed by package JMbayes requires to first separately fit a linear mixed model for the longitudinal outcome and a Cox model for the survival one. The aim of the linear mixed model is to describe/recover the subject-specific longitudinal trajectories. Close inspection of the shapes of the log serum bilirubin profiles indicates that for some individuals these seem to be nonlinear. set.seed(1) ## we take a sample of patients with more than six ## measurements long_ids <- names(which(table(pbc2$id) > 6)) ids <- sample(long_ids, 16) xyplot(log(serBilir) ~ year | id, data = pbc2, subset = id %in% ids, type = c(“p”, “smooth”), lwd = 2, layout = c(4, 4)) Hence, to allow for flexibility in the specification of these profiles we include natural cubic splines in both the fixed- and random-effects parts of the mixed model. This model can be fitted using the following calls to functions lme() and ns() (the latter from package splines): lmeFit <- lme(log(serBilir) ~ ns(year, 2), data = pbc2, random = ~ ns(year, 2) | id) Analogously, in the Cox model we control for treatment and age, and also allow for their interaction: coxFit <- coxph(Surv(years, status2) ~ drug * age, data = pbc2.id, x = TRUE) In the call to coxph() argument x is set to TRUE such that the design matrix is also included in the resulting model object. Using as main arguments the lmeFit and coxFit objects, the corresponding joint model is fitted using the code: jointFit <- jointModelBayes(lmeFit, coxFit, timeVar = "year") summary(jointFit)Call: jointModelBayes(lmeObject = lmeFit.pbc1, survObject = coxFit.pbc1, timeVar = "year") Data Descriptives: Longitudinal Process Event Process Number of Observations: 1945 Number of Events: 169 (54.2%) Number of subjects: 312 Joint Model Summary: Longitudinal Process: Linear mixed-effects model Event Process: Relative risk model with penalized-spline-approximated baseline risk function Parameterization: Time-dependent value LPML DIC pD -3169.172 6116.463 939.5384 Variance Components: StdDev Corr (Intercept) 1.0097 (Intr) n(,2)1 ns(year, 2)1 2.3555 0.3640 ns(year, 2)2 2.2430 0.3609 0.5701 Residual 0.3023 Coefficients: Longitudinal Process Value Std.Err Std.Dev 2.5% 97.5% P (Intercept) 0.4918 0.0212 0.0699 0.3676 0.6316 <0.001 ns(year, 2)1 2.4117 0.1032 0.2337 1.9423 2.7918 <0.001 ns(year, 2)2 2.3611 0.0895 0.2906 1.7765 2.8609 <0.001 Event Process Value Std.Err Std.Dev 2.5% 97.5% P drugD-penicil -0.9648 0.1272 0.7543 -2.4530 0.4693 0.199 age 0.0370 0.0018 0.0107 0.0135 0.0571 <0.001 drugD-penicil:age 0.0179 0.0025 0.0145 -0.0096 0.0470 0.211 Assoct 1.4181 0.0062 0.0953 1.2367 1.6041 <0.001 Bs.gammas1 -6.6205 0.1009 0.5992 -7.7762 -5.3132 <0.001 Bs.gammas2 -6.5832 0.1031 0.5962 -7.7639 -5.2468 <0.001 ... MCMC summary: iterations: 20000 adapt: 3000 burn-in: 3000 thinning: 10 time: 2.6 minArgument timeVar is a character string that specifies the name of the time variable in the mixed model (the scale of time (e.g., days, months, years) must be the same in both the mixed and Cox models). The default call to jointModelBayes() includes in the linear predictor of the relative risk model the subject-specific linear predictor of the mixed model, which in this case represents the average patient-specific log serum bilirubin level. The output of the summary() method is rather self-explanatory and contains model summary statistics, namely LPML (the log pseudo marginal likelihood value), DIC (deviance information criterion), and pD (the effective number of parameters component of DIC), posterior means for all parameters, and standard errors (effective sample size estimated using time series methodology), standard deviations, 95% credibility intervals and tail probabilities for all regression coefficients in the two submodels. The association parameter, denoted in the output as Assoct, is the parameter that measures how strongly associated is the longitudinal outcome at any particular time point

*t*with the hazard of an event at the same time point. The results suggest that serum bilirubin is strongly related with the risk for the composite event, with a doubling of serum bilirubin levels, resulting in a 2.7-fold (95% CI: 2.3; 3.1) increase of the risk.

To

**leave a comment**for the author, please follow the link and comment on their blog:**iProgn: Interactive Prediction Tools based on Joint Models**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.