Survival Analysis With Generalized Models: Part II (time discretization, hazard rate integration and calculation of hazard ratios)

[This article was first published on Statistical Reflections of a Medical Doctor » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.

Consider a set of sM individual observations at times \mathcal{F}=\left\{ F_{i}\right\} _{i=1}^{M} , with censoring indicators \mathcal{D}=\left\{ \delta_{i}\right\} _{i=1}^{M} assuming the value of 0 if the corresponding observation was censored and 1 otherwise. Under the assumption of non-informative censoring, the likelihood of the sample is given by:

L=\prod_{i=1}^{M}f(F_{i})^{\delta_{i}}S(F_{i})^{1-\delta_{i}}=    \prod_{i=1}^{M}h(F_{i})^{\delta_{i}}\exp\left(-\int_{0}^{F_{i}}h(t)\mathrm{\, d}t\right)

where h(t) is the hazard function. By using an interpolatory quadrature rule, one may substitute the integral with a weighted sum evaluated at a distinct number of nodes.

L=\prod_{i=1}^{M}\prod_{j=1}^{N_{i}}h(t_{i})^{d_{i,j}}\exp\left(-w_{j,j}h(t_{i,j})\right)

where t_{i,j}, w_{i,j}  are the nodes, weights of the integration rule and d_{i,j} is an indicator variable equal to 1 if the corresponding node corresponds to an event time and zero otherwise.  By including additional “pseudo-observations” at the nodes of the quadrature rule, we converted the survival likelihood to the kernel of a Poisson regression with variable exposures (weights).  Conditional on the adoption of an efficient quadrature rule, this is a highly accurate approximation:

A) relationship between (log) MST and the logarithm of the Maximum Hazard rate function for survival distributions with a cubic polynomial log baseline hazard function (B) Box plots of the GL error as a function of the number of nodes in the quadrature rule (C) GL error as a function of the length of the integration interval (taken equal to be equal to the MST for each distribution examined) for different orders of the quadrature rule (D) GL error as a function of the maximum value of the hazard rate for different orders of the quadrature rule.

Bounds of the Gauss Lobatto (GL) approximation error for the integration of survival data (MST=Mean Survival Time).

In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme.  In our paper, this was accomplished by the adoption of the Gauss-Lobatto rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval [-1,1] depend on the Legendre polynomials in a complex way. The following R function will calculate the weights and nodes for the N-th order Gauss Lobatto rule:

GaussLobatto<-function(N)
{
 N1<-N
 N<-N-1
 x=matrix(cos(pi*(0:N)/N),ncol=1)
 x=cos(pi*(0:N)/N)
 P<-matrix(0,N1,N1)
 xold<-2
 while (max(abs(x-xold))>2.22044604925031e-16) {
 xold<-x
 P[,1]<-1 
 P[,2]<-x
 
 for (k in 2:N) {
 P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k;
 }
 
 x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] )
 
 }

 w<-2./(N*N1*P[,N1]^2);
 ret<-list(x=rev(x),w=w)
 attr(ret,"order")<-N
 ret
}

which can be called to return a list of the nodes and their weights:

> GaussLobatto(5)
$x
[1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000

$w
[1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000

attr(,"order")
[1] 4

To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:


GAMSurvdataset<-function(GL,data,fu,d)
## GL : Gauss Lobatto rule
## data: survival data
## fu: column number containing fu info
## d: column number with event indicator
 {
 ## append artificial ID in the set
 data$id<-1:nrow(data)
 Gllx<-data.frame(stop=rep(GL$x,length(data$id)),
 gam.dur=rep(GL$w,length(data$id)),
 t=rep(data[,fu],each=length(GL$x)),
 ev=rep(data[,d],each=length(GL$x)),
 id=rep(data$id,each=length(GL$x)),
 gam.ev=0,start=0)
 ## Change the final indicator to what
 ## was observed, map node positions,
 ## weights from [-1,1] back to the
 ## study time
 Gllx<-transform(Gllx,
 gam.ev=as.numeric((gam.ev | ev)*I(stop==1)),
 gam.dur=0.5*gam.dur*(t-start),
 stop=0.5*(stop*(t-start)+(t+start)))
 ## now merge the remaining covariate info
 Gllx<-merge(Gllx,data[,-c(fu,d)])
 Gllx
 }

We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:

data(pbc)
> ## Change transplant to alive
> pbc$status[pbc$status==1]<-0
> ## Change event code of death(=2) to 1
> pbc$status[pbc$status==2]<-1
> 
> head(pbc)
 id time status trt age sex ascites hepato spiders edema bili chol albumin copper
1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54
3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210
4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64
5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143
6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50
 alk.phos ast trig platelet protime stage
1 1718.0 137.95 172 190 12.2 4
2 7394.8 113.52 88 221 10.6 3
3 516.0 96.10 55 151 12.0 4
4 6121.8 60.63 92 183 10.3 4
5 671.0 113.15 72 136 10.9 3
6 944.0 93.00 63 NA 11.0 3
> 
> GL<-GaussLobatto(5)
> pbcGAM<-GAMSurvdataset(GL,pbc,2,3)
> head(pbcGAM)
 id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders
1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1
2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1
3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1
4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1
5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1
6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1
 edema bili chol albumin copper alk.phos ast trig platelet protime stage
1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4
6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3
> 
> dim(pbc)
[1] 418 20
> dim(pbcGAM)
[1] 2090 24

The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable stop, their associated weights in the variable gam.dur and the event indicators are in the column gam.ev. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule ([-1,1]). To fit the survival dataset one needs to load the mgcv package and fit a Poisson GAM, using a flexible (penalized spline) for the log-hazard rate function.

The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:


library(survival) ## for coxph
> library(mgcv) ## for mgcv
> 
> ## Prop Hazards Modeling with PGAM
> fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM,family="poisson",scale=1,method="REML")
> 
> ## Your Cox Model here
> f<-coxph(Surv(time,status)~trt+age+sex,data=pbc)
> 
> 
> summary(fGAM)

Family: poisson 
Link function: log

Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345236 0.655176 -15.790 < 2e-16 ***
trt 0.069546 0.181779 0.383 0.702 
age 0.038488 0.008968 4.292 1.77e-05 ***
sexf -0.370260 0.237726 -1.558 0.119 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.008 1.015 4.186 0.0417 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.249 Deviance explained = 2.25%
-REML = 693.66 Scale est. = 1 n = 1560
> f
Call:
coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc)


 coef exp(coef) se(coef) z p
trt 0.0626 1.065 0.182 0.344 7.3e-01
age 0.0388 1.040 0.009 4.316 1.6e-05
sexf -0.3377 0.713 0.239 -1.414 1.6e-01

Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125 
 (106 observations deleted due to missingness)

The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).

Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:


GL<-GaussLobatto(10)
> pbcGAM2<-GAMSurvdataset(GL,pbc,2,3)
> fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)),
+ data=pbcGAM2,family="poisson",scale=1,method="REML")
> 
> summary(fGAM2)

Family: poisson 
Link function: log

Formula:
gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur))

Parametric coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -10.345288 0.655177 -15.790 < 2e-16 ***
trt 0.069553 0.181780 0.383 0.702 
age 0.038487 0.008968 4.292 1.77e-05 ***
sexf -0.370340 0.237723 -1.558 0.119 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
 edf Ref.df Chi.sq p-value 
s(stop) 1.003 1.005 4.163 0.0416 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = -0.124 Deviance explained = 1.7%
-REML = 881.67 Scale est. = 1 n = 3120

Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.

In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing  the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.


To leave a comment for the author, please follow the link and comment on their blog: Statistical Reflections of a Medical Doctor » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)