# Veterinary Epidemiologic Research: Modelling Survival Data – Parametric and Frailty Models

July 5, 2013
By

(This article was first published on denis haine » R, and kindly contributed to R-bloggers)

Last post on modelling survival data from Veterinary Epidemiologic Research: parametric analyses. The Cox proportional hazards model described in the last post make no assumption about the shape of the baseline hazard, which is an advantage if you have no idea about what that shape might be. With a parametric survival model, the survival time is assumed to follow a known distribution: Weibull, exponential (which is a special case of the Weibull), log-logistic, log-normal, and generalized gamma.

Exponential Model
The exponential model is the simplest, the hazard $h_0(t)$ is constant over time: the rate at which failures are occurring is constant, $h(t) = \lambda$. We use again the pgtrial dataset:

temp <- tempfile()
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)

library(Hmisc)
pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id',
tx = 'Treatment', lact = 'Lactation number',
thin = 'Body condition', dar = 'Days at risk',
preg = 'Pregnant or censored'),
levels = list(thin = list('normal' = 0, 'thin' = 1),
preg = list('censored' = 0, 'pregnant' = 1)))
pgtrial$herd <- as.factor(pgtrial$herd)

library(survival)
exp.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + (lact - 1) +
thin, data = pgtrial, dist = "exponential")
summary(exp.mod)

Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx +
(lact - 1) + thin, data = pgtrial, dist = "exponential")
Value Std. Error     z         p
herd1     4.3629     0.1827 23.88 4.66e-126
herd2     4.6776     0.1711 27.34 1.41e-164
herd3     4.3253     0.1617 26.75 1.12e-157
tx       -0.2178     0.1255 -1.74  8.26e-02
lact      0.0416     0.0413  1.01  3.14e-01
thinthin  0.1574     0.1383  1.14  2.55e-01

Scale fixed at 1

Exponential distribution
Loglik(model)= -1459.9   Loglik(intercept only)= -1465.6
Chisq= 11.42 on 5 degrees of freedom, p= 0.044
Number of Newton-Raphson Iterations: 5
n= 319


Interpretation is the same as for a Cox model. Exponentiated coefficients are hazard ratios. R outputs the parameter estimates of the AFT (accelerated failure time) form of the exponential model. If you multiply the estimated coefficients by minus one you get estimates that are consistent with the proportional hazards parameterization of the model. So for tx, the estimated hazard ratio is exp(0.2178) = 1.24 (at any given point in time, a treated cow is 1.24 times more likely to conceive than a non-treated cow). The corresponding accelerating factor for an exponential model is the reciprocal of the hazard ratio, exp(-0.2178) = 0.80: treating a cow accelerates the time to conception by a factor of 0.80.

Weibull Model

In a Weibull model, the hazard function is $h(t) = \lambda p t^{p-1}$ where $p$ and $\lambda$ are > 0. $p$ is the shape parameter and determines the shape of the hazard function. If it’s $> 1$, the hazard increases with time. If $p = 1$, the hazard is constant and the model reduces to an exponential model. If $p < 1$, the hazard decreases over time.

library(car)
pgtrial$parity <- recode(pgtrial$lact, "1 = 1; 2:hi = '2+'")
weib.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + parity +
thin, data = pgtrial, dist = "weibull")
summary(weib.mod)

Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx +
parity + thin, data = pgtrial, dist = "weibull")
Value Std. Error       z         p
(Intercept)  4.23053     0.1937 21.8412 9.42e-106
herd2        0.36117     0.1947  1.8548  6.36e-02
herd3       -0.00822     0.1980 -0.0415  9.67e-01
tx          -0.23386     0.1438 -1.6262  1.04e-01
parity2+     0.33819     0.1490  2.2698  2.32e-02
thinthin     0.11222     0.1576  0.7119  4.77e-01
Log(scale)   0.13959     0.0509  2.7407  6.13e-03

Scale= 1.15

Weibull distribution
Loglik(model)= -1453.7   Loglik(intercept only)= -1460.7
Chisq= 14 on 5 degrees of freedom, p= 0.016
Number of Newton-Raphson Iterations: 5
n= 319


The shape parameter is the reciprocal of what is called by R the scale parameter. The shape parameter is then 1/1.15 = 0.869.

We can also use a piecewise constant exponential regression model, which is a model allowing the baseline hazard to vary between time periods but forces it to remain constant within time periods. In order to run such a model, we need data in a counting process format with a start and stop time for each interval. However, survreg does not allow for a data in that format. The trick would be to use a glm and fitting a Poisson model, including time intervals. See this post by Stephanie Kovalchik which explains how to construct the data and model. The example below is using the same approach, for a time interval of 40 days:

interval.width <- 40
# function to compute time breaks given the exit time = dar
cow.breaks <- function(dar) unique(c(seq(0, dar, by = interval.width),
dar))
# list of each subject's time periods
the.breaks <- lapply(unique(pgtrial$cow), function(id){ cow.breaks(max(pgtrial$dar[pgtrial$cow == id])) }) # the expanded period of observation: start <- lapply(the.breaks, function(x) x[-length(x)]) # for left time points stop <- lapply(the.breaks, function(x) x[-1]) # for right time points count.per.cow <- sapply(start, length) index <- tapply(pgtrial$cow, pgtrial$cow, length) index <- cumsum(index) # index of last observation for each cow event <- rep(0, sum(count.per.cow)) event[cumsum(count.per.cow)] <- pgtrial$preg[index]

# creating the expanded dataset
pw.pgtrial <- data.frame(
cow = rep(pgtrial$cow[index], count.per.cow), dar = rep(pgtrial$dar[index], count.per.cow),
herd = rep(pgtrial$herd[index], count.per.cow), tx = rep(pgtrial$tx[index], count.per.cow),
lact = rep(pgtrial$lact[index], count.per.cow), thin = rep(pgtrial$thin[index], count.per.cow),
start = unlist(start),
stop = unlist(stop),
event = event
)

# create time variable which indicates the period of observation (offset in Poisson model)
pw.pgtrial$time <- pw.pgtrial$stop - pw.pgtrial$start # length of observation # create a factor for each interval, allowing to have a different rate for each period pw.pgtrial$interval <- factor(pw.pgtrial$start) pw.pgtrial[100:110, ] cow dar herd tx lact thin start stop event time interval 100 61 113 1 1 4 thin 0 40 0 40 0 101 61 113 1 1 4 thin 40 80 0 40 40 102 61 113 1 1 4 thin 80 113 1 33 80 103 62 117 1 0 7 normal 0 40 0 40 0 104 62 117 1 0 7 normal 40 80 0 40 40 105 62 117 1 0 7 normal 80 117 2 37 80 106 63 121 1 1 1 thin 0 40 0 40 0 107 63 121 1 1 1 thin 40 80 0 40 40 108 63 121 1 1 1 thin 80 120 0 40 80 109 63 121 1 1 1 thin 120 121 2 1 120 110 64 122 1 1 3 normal 0 40 0 40 0 # Poisson model pw.model <- glm(event ~ offset(log(time)) + interval + herd + tx + lact + + thin, data = pw.pgtrial, family = "poisson") summary(pw.model) Call: glm(formula = event ~ offset(log(time)) + interval + herd + tx + lact + thin, family = "poisson", data = pw.pgtrial) Deviance Residuals: Min 1Q Median 3Q Max -1.858 -1.373 -1.227 1.357 3.904 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.602545 0.132436 -27.202 < 2e-16 *** interval40 -0.112838 0.106807 -1.056 0.29076 interval80 -0.064105 0.125396 -0.511 0.60920 interval120 -0.007682 0.147919 -0.052 0.95858 interval160 -0.005743 0.191778 -0.030 0.97611 interval200 -0.427775 0.309143 -1.384 0.16644 interval240 0.199904 0.297331 0.672 0.50137 interval280 0.737508 0.385648 1.912 0.05583 . interval320 0.622366 1.006559 0.618 0.53637 herd2 -0.254389 0.114467 -2.222 0.02626 * herd3 0.026851 0.119416 0.225 0.82209 tx 0.219584 0.084824 2.589 0.00963 ** lact -0.023528 0.027511 -0.855 0.39241 thinthin -0.139915 0.093632 -1.494 0.13509 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2155.6 on 798 degrees of freedom Residual deviance: 2131.1 on 785 degrees of freedom AIC: 2959.1 Number of Fisher Scoring iterations: 7  Log-logistic Model loglog.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin, data = pgtrial, dist = "loglogistic") summary(loglog.mod) Call: survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + lact + thin, data = pgtrial, dist = "loglogistic") Value Std. Error z p (Intercept) 3.9544 0.2531 15.625 4.91e-55 herd2 0.2537 0.2355 1.077 2.81e-01 herd3 -0.1019 0.2437 -0.418 6.76e-01 tx -0.3869 0.1768 -2.189 2.86e-02 lact 0.0612 0.0550 1.112 2.66e-01 thinthin 0.0400 0.1894 0.211 8.33e-01 Log(scale) -0.1260 0.0515 -2.447 1.44e-02 Scale= 0.882 Log logistic distribution Loglik(model)= -1467.2 Loglik(intercept only)= -1472.2 Chisq= 9.85 on 5 degrees of freedom, p= 0.08 Number of Newton-Raphson Iterations: 4 n= 319  Individual Frailty Model In an individual frailty model, we add variance unique to individuals in order to account for additional variability in the hazard (like negative binomial model vs. Poisson model). For example, let’s fit a Weibull model with gamma individual frailty to the prostaglandin dataset: library(parfm) pgtrial$preg.bin <- as.numeric(pgtrial$preg) - 1 indfr.mod <- parfm(Surv(dar, preg.bin) ~ herd + tx + lact + thin, cluster = "cow", data = pgtrial, dist = "weibull", frailty = "gamma") Execution time: 17.872 second(s) indfr.mod Frailty distribution: gamma Baseline hazard distribution: Weibull Loglikelihood: -1455.679 ESTIMATE SE p-val theta 0.000 0.003 rho 0.867 0.044 lambda 0.024 0.006 herd2 -0.289 0.169 0.088 . herd3 0.039 0.175 0.824 tx 0.204 0.125 0.103 lact -0.041 0.041 0.314 thinthin -0.136 0.138 0.323 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Shared Frailty Shared frailty is a way to deal with clustered data. We will use the “culling” dataset and fit a shared frailty model with a Weibull distribution and a gamma distributed frailty common to all cows in a herd: temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/culling.rdata")) unlink(temp) library(frailtypack) shfrw.mod <- frailtyPenal(Surv(dar, culled) ~ as.factor(lact_c3) + johnes + cluster(herd), hazard = 'Weibull', data = culling, Frailty = TRUE) shfrw.sum <- cbind(shfrw.mod$coef, sqrt(diag(shfrw.mod$varH)), shfrw.mod$coef / sqrt(diag(shfrw.mod$varH)), signif(1 - pchisq((shfrw.mod$coef/sqrt(diag(shfrw.mod$varH)))^2, 1)), exp(shfrw.mod$coef),
exp(shfrw.mod$coef - abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))),
exp(shfrw.mod$coef + abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))))
row.names(shfrw.sum) <- c("Lactation 2", "Lactation 3+", "Johnes")
colnames(shfrw.sum) <- c("Coef.", "Std. Err.", "z", "p-value", "Hazard Ratio",
"Lower CI", "Upper CI")
shfrw.sum
Coef. Std. Err.        z     p-value Hazard Ratio  Lower CI
Lactation 2  0.2518627 0.1450806 1.736019 8.25605e-02     1.286419 0.9680321
Lactation 3+ 0.7636558 0.1227840 6.219508 4.98717e-10     2.146108 1.6870874
Johnes       0.5914741 0.3045475 1.942141 5.21200e-02     1.806650 0.9945867
Upper CI
Lactation 2  1.709525
Lactation 3+ 2.730017
Johnes       3.281748


That’s it for reproducing the examples from Dohoo’s book, chapter on modelling survival data. Next time I’ll look at mixed models.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...