Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

When I first went to grad school, the mathematicians advised me cultivate the habit of reading with a pencil. This turned into a lifelong habit and useful skill for reading all sorts of things: literature, reports and newspapers for example; not just technical papers. However, reading statistics and data science papers, or really anything that includes some data, considerably “ups the ante”. For this sort of exercise, I need a tool to calculate, to try some variations that test my intuition and see how well I'm following the arguments. The idea here is not so much to replicate the paper but to accept the author's invitation to engage with the data and work through the analysis. Ideally, I'd want something not much more burdensome than than a pencil (maybe a tablet based implementation of R), but standard R on my notebook comes pretty close to the perfect tool.

Recently, I sat down with Bradley Efron's 1987 paper “Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve”, the paper where he elaborates on the idea of using conditional logistic regression to estimate hazard rates and survival curves. This paper is classic Efron: drawing you in with a great story well before you realize how much work it's going to be to follow it to the end. Efron writes with a fairly informal style that encourages the reader to continue. Struggling to keep up with some of his arguments I nevertheless get the feeling that Efron is doing his best help me follow along, dropping hints every now and then about where to look if I lose the trail.

The basic idea of conditional logistic regression is to group the data into discrete time intervals with ni patients at risk in each interval, i, and then assume that the intervals really are independent and that the si events (deaths or some other measure of “success”) in each interval, follow a binomial distribution with parameters ni and hi where:

hi = Prob(patient i dies during the ith interval | patient i survives until the beginning of the ith interval).

The modest goal of this post was to see if I could reproduce Efron's Figure 3 which shows survival curves for three different models for A arm of a clinical trial examining treatments for head and neck cancer. I figured that getting to Figure 3 represents the minimum amount of comprehension required to begin experimenting with conditional logistic regression.

I entered the data buried in the caption to Efron's Table 1 and was delighted when R's Survival package replicated survival times also in the caption.

# Data for Efron's Table 1
# Enter raw data for arm A
Adays <- c(7, 34, 42, 63, 64, 74, 83, 84, 91,
108, 112, 129, 133, 133, 139, 140, 140, 146,
149, 154, 157, 160, 160, 165, 173, 176, 185,
218, 225, 241, 248, 273, 277, 279, 297, 319,
405, 417, 420, 440, 523, 523, 583, 594,
1101, 1116, 1146, 1226, 1349, 1412, 1417)

Astatus <- rep(1,51)
Astatus[c(6,27,34,36,42,46,48,49,50)] <-0
Aobj <- Surv(time = Adays, Astatus==1)
Aobj
#     7    34    42    63    64    74+   83    84    91   108   112   129   133   133
#   139   140   140   146   149   154   157   160   160   165   173   176   185+  218
#   225   241   248   273   277   279+  297   319+  405   417   420   440   523   523+
#   583   594  1101  1116+ 1146  1226+ 1349+ 1412+ 1417

Doing the same with the data from arm B of the trial led to a set of Kaplan-Meier curves that pretty much match the curves in Efron's Figure 1. All of this was straightforward but I was puzzled that the summary of the Kaplan-Meier curve for arm A (See KM_A in the code below) doesn't match the values for month, ni and si in Efron's Table 1, until I realized these values were for the beginning of the month. To match the table I compute ni by putting 51 in front of the vector KM_A$n.risk, and add a 1 to the end of the vector KM_A$n.event to get si. (See the "set up variables for models section in the code below.)

After this, one more "trick" was required to get to Figure 3. Most of the time, I suppose those of us working with logistic regression to construct machine learning models are accustomed to specifying the outcome as a binary variable of "ones" and "zeros", or as a two level factor. But, how exactly does one specify the parameters ni and si for the binomial models that comprise each outcome? After a close reading of the documentation (See the first paragraph under Details for ?glm) I was very pleased to see that glm() permits the dependent variable to be a matrix.

The formula for Efron's cubic model looks like this:

Y <- matrix(c(si, failure),n,2) # Response matrix
form <- formula(Y ~ t + t2 + t3)

The rest of the code is straight forward and leads to a pretty good reproduction of Figure 3. In this figure, the black line represents the survival curve for a Life Table model where the hazard probabilities are estimated by hi = si / ni. The blue triangles map the survival curve for the cubic model given above, and the red curve with crosses plots a cubic spline model where hi = ti + (ti - 11)2 + (ti - 11)3 .

What a delightful little diagram! In addition to illustrating the technique of using a very basic statistical tool to model time-to-event data, the process leading to Figure 3 reveals something about the intuition and care a professional statistician puts into the exploratory modeling process.

There is much more in Efron's paper. What I have shown here is just the "trailer". Efron presents a careful analysis of the data for both arms of the clinical trial data, goes on to study maximum likelihood estimates for conditional logistic regression models and their standard errors, and proves a result about average ratio of the asymptotic variance between parametric and non-parametric hazard rate estimates.

Enjoy this classic paper and write some "am I reading this right" code of your own!

Here is my code for the models and plots.