[This article was first published on Statistical Reflections of a Medical Doctor » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The ability of PGAMs to estimate the log-baseline hazard rate, endows them with the capability to be used as smooth alternatives to the Kaplan Meier curve. If we assume for the shake of simplicity that there are no proportional co-variates in the PGAM regression, then the quantity modeled  corresponds to the log-hazard of the  survival function. Note that the only assumptions made by the PGAM is that the a) log-hazard is a smooth function, with b) a given maximum complexity (number of degrees of freedom) and c) continuous second derivatives. A PGAM provides estimates of the log-hazard constant, $\beta_{0}$, and the time-varying deviation, $\lambda(t_{i,j})$. These can be used to predict the value of the survival function, $S(t)$, by approximating the integral appearing in the definition of $S(t)$ by numerical quadrature. $S(t_{i})=\exp\left(-\int_{0}^{t_{i}}h(t)\mathrm{\, d}t\right)\approx\exp\left(-\sum_{j=1}^{N_{i}}w_{i,j}\exp(\beta_{0}+\lambda(t_{i,j}))\right)$

From the above definition it is obvious that the value of the survival distribution at any given time point is a non-linear function of the PGAM estimate. Consequently, the predicted survival value, $S_{pred}(t)$, cannot be derived in closed form; as with all non-linear PGAM estimates, a simple Monte Carlo simulation algorithm may be used to derive both the expected value of $\hat{S}_{pred}(t)$ and its uncertainty. For the case of the survival function, the simulation steps are provided in Appendix (Section A3) of our paper. The following R function can be used to predict the survival function and an associated confidence interval at a grid of points. It accepts as arguments a) the vector of time points, b) a PGAM object for the fitted log-hazard function, c) a list with the nodes and weights of a Gauss-Lobatto rule for the integration of the predicted survival, d) the number of Monte Carlo samples to obtain and optionally e) the seed of the random number generation. Of note, the order of the quadrature used to predict the survival function is not necessarily the same as the order used to fit the log-hazard function.

## Calculate survival and confidence interval over a grid of points
## using a GAM
SurvGAM<-Vectorize(function(t,gm,gl.rule,CI=0.95,Nsim=1000,seed=0)
##         t : time at which to calculate relative risk
##        gm : gam model for the fit
##   gl.rule : GL rule (list of weights and nodes)
##        CI : CI to apply
##      Nsim : Number of replicates to draw
##      seed : RNG seed
{
q<-(1-CI)/2.0
## create the nonlinear contrast
pdfnc<-data.frame(stop=t,gam.dur=1)
L<-length(gl.rule$x) start<-0; ## only for right cens data ## map the weights from [-1,1] to [start,t] gl.rule$w<-gl.rule$w*0.5*(t-start) ## expand the dataset df<-Survdataset(gl.rule,pdfnc,fu=1) ## linear predictor at each node Xp <- predict(gm,newdata=df,type="lpmatrix") ## draw samples set.seed(seed) br <- rmvn(Nsim,coef(gm),gm$Vp)
res1<-rep(0,Nsim)
for(i in 1:Nsim){
## hazard function at the nodes
hz<-exp(Xp%*%br[i,])
## cumumative hazard
chz1<-gl.rule$w %*% hz[1:L,] ##survival res1[i]<-exp(-chz1) } ret<-data.frame(t=t,S=mean(res1), LCI=quantile(res1,prob=q), UCI=quantile(res1,prob=1-q)) ret },vectorize.args=c("t")) The function makes use of another function, Survdataset, that expands internally the vector of time points into a survival dataset. This dataset is used to obtain predictions of the log-hazard function by calling the predict function from the mgcv package. ## Function that expands a prediction dataset ## so that a GL rule may be applied ## Used in num integration when generating measures of effect Survdataset<-function(GL,data,fu) ## GL : Gauss Lobatto rule ## data: survival data ## fu: column number containing fu info { ## append artificial ID in the set data$id<-1:nrow(data)
Gllx<-data.frame(stop=rep(GL$x,length(data$id)),
t=rep(data[,fu],each=length(GL$x)), id=rep(data$id,each=length(GL\$x)),
start=0)
## Change the final indicator to what
## was observed, map node positions,
## weights from [-1,1] back to the
## study time
Gllx<-transform(Gllx,
stop=0.5*(stop*(t-start)+(t+start)))
## now merge the remaining covariate info

Gllx<-merge(Gllx,data[,-c(fu)])
nm<-match(c("t","start","id"),colnames(Gllx))
Gllx[,-nm]
}

The ability to draw samples from the multivariate normal distribution corresponding to the model estimates and its covariance matrix is provided by another function, rmvn:

## function that draws multivariate normal random variates with
## a given mean vector and covariance matrix
##    n : number of samples to draw
##   mu : mean vector
##  sig : covariance matrix
rmvn <- function(n,mu,sig) { ## MVN random deviates
L <- mroot(sig);m <- ncol(L);
t(mu + L%*%matrix(rnorm(m*n),m,n))
}

To illustrate the use of these functions we revisit the PBC example from the 2nd part of this blog series. Firstly, let’s obtain a few Gauss-Lobatto lists of weights/nodes for the integration of the survival function:

## Obtain a few Gauss Lobatto rules to integrate the survival
## distribution
GL5<-GaussLobatto(5);
GL10<-GaussLobatto(10);
GL20<-GaussLobatto(20);

Subsequently, we fit the log-hazard rate to the coarsely (5 nodes) and more finely discretized (using a 10 point Gauss Lobatto rule) versions of the PBC dataset, created in Part 2. The third command obtains the Kaplan Meier estimate in the PBC dataset.

fSurv1<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)),
data=pbcGAM,family="poisson",scale=1,method="REML")
fSurv2<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)),
data=pbcGAM2,family="poisson",scale=1,method="REML")

KMSurv<-survfit(Surv(time,status)~1,data=pbc)


We obtained survival probability estimates for the 6 combinations of time discretization for fitting (either a 5 or 10th order Lobatto rule) and prediction (a 5th, 10th or 20th order rule):

t<-seq(0,4500,100)
s1<-SurvGAM(t,fSurv1,GL5)
s2<-SurvGAM(t,fSurv1,GL10)
s3<-SurvGAM(t,fSurv1,GL20)
s4<-SurvGAM(t,fSurv2,GL5)
s5<-SurvGAM(t,fSurv2,GL10)
s6<-SurvGAM(t,fSurv2,GL20)

In all cases 1000 Monte Carlo samples were obtained for the calculation of survival probability estimates and their pointwise 95% confidence intervals. We can plot these against the Kaplan Meier curve estimates:

par(mfrow=c(2,3))
plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL5)")
lines(s1[1,],s1[2,],col="blue",lwd=2)
lines(s1[1,],s1[3,],col="blue",lwd=2,lty=2)
lines(s1[1,],s1[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL10)")
lines(s2[1,],s2[2,],col="blue",lwd=2)
lines(s2[1,],s2[3,],col="blue",lwd=2,lty=2)
lines(s2[1,],s2[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL20)")
lines(s3[1,],s3[2,],col="blue",lwd=2)
lines(s3[1,],s3[3,],col="blue",lwd=2,lty=2)
lines(s3[1,],s3[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL5)")
lines(s4[1,],s4[2,],col="blue",lwd=2)
lines(s4[1,],s4[3,],col="blue",lwd=2,lty=2)
lines(s4[1,],s4[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL10)")
lines(s5[1,],s5[2,],col="blue",lwd=2)
lines(s5[1,],s5[3,],col="blue",lwd=2,lty=2)
lines(s5[1,],s5[4,],col="blue",lwd=2,lty=2)

plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL20)")
lines(s6[1,],s6[2,],col="blue",lwd=2)
lines(s6[1,],s6[3,],col="blue",lwd=2,lty=2)
lines(s6[1,],s6[4,],col="blue",lwd=2,lty=2)



Overall, there is a close agreement between the Kaplan Meier estimate and the PGAM estimates despite the different function spaces that the corresponding estimators “live”: the space of all piecewise constant functions (KM) v.s. that of the smooth functions with bounded, continuous second derivatives (PGAM). Furthermore, the 95% confidence interval of each estimator (dashed lines) contain the expected value of the other estimator. This suggests that there is no systematic difference between the KM and the PGAM survival estimators. This was confirmed in simulated datasets (see Fig 2 in our PLoS One paper).  