Here is an assortment of R Shiny apps that you may find useful for exploration if you are in the process of learning Shiny and looking for something different. Some of these apps are very small and simple whereas others are large and complex. This repository provides full code and any necessary accompanying data sets. The repo also links to the apps hosted online at shinyapps.io so that you can run apps in your browser without having to download the entire collection repo to run apps locally. That and other details can be found at the repo linked above. This isn’t a tutorial or other form of support, but it’s plenty of R code to peruse if that is what you are looking for.
A bit of backstory. If I recall correctly, I began exploring RStudio’s Shiny package when I first heard of it in late 2012. Needless to say, a lot has changed since then, including not only all the alpha-release code-breaking changes I had to adjust to when making my first apps and what features and capabilities Shiny has grown to offer, but also simply how I go about coding apps has changed over time symbiotically with the package’s continued development. None of the apps in this repository are quite that old, though a few are close. Even so, they have been maintained and updated and tweaked since then to keep with the times as necessary.
Most of the apps are newer. But one nice thing about this collection is that it shows a diversity of approaches to coding different features and behavior into apps depending on their purposes and how for me that has changed over time. For example, some apps are heavy on maps. Prior to the robust availability of Leaflet in Shiny, I would tend to have a Shiny app display maps using static (but reactive) plots made with Lattice or ggplot2. There are many ways to do the same thing, and the way that is best in one case is not always the best way.
Across these apps there are many other examples of different ways to implement the same general task, depending on how I want that to be presented to the user in a specific app. In other cases, some approaches have proven more powerful and outright superior to others and have won out and it is equally useful to see these examples of what once was considered to be “good enough” is no longer.
Lastly, if you do happen to stumble upon something that is actually broken, I am unaware of it, so please let me know.
SHINY APPLICATION LAYOUTS-PLOT PLUS COLUMNS
In the second part of our series we will build another small shiny app but use another UI.
More specifically we will present the example of a UI with a plot at the top and columns at the bottom that contain the inputs that drive the plot. For our case we are going to use the diamonds
dataset to create a Diamonds Analyzer App.
This part can be useful for you in two ways.
First of all, you can see different ways to enhance the appearance and the utility of your shiny app.
Secondly you can make a revision on what you learnt in the “Building Shiny App” series as we will build basic shiny stuff in order to present it in the proper way.
Read the examples below to understand the logic of what we are going to do and then test your skills with the exercise set we prepared for you. Lets begin!
Answers to the are available here.
Shiny Installation
In order to create the app we have to install and load the package shiny
.
Exercise 1
Install and load shiny
.
Grid Layout
The sidebarLayout
uses Shiny’s grid layout functions. Rows are created by the fluidRow
function and include columns defined by the column
function. Column widths should add up to 12 within a fluidRow
.
The first parameter to the column
function is it’s width
. You can also change the position of columns to decide the location of UI elements. You can put columns to the right by adding the offset
parameter to the column
function. Each unit of offset increases the left-margin of a column by a whole column.
Now let’s begin to build our UI. First of all we will place the fluidpage
with a title
as below:
#ui.R
library(shiny)
shinyUI(fluidPage(
title = "Diamonds",
h4("Diamonds Analyzer")
))
#server.R
library(shiny)
function(input, output) {}
Exercise 2
Create the initial UI of your app and name it “Diamonds Analyzer”.
You can use the fluidrow
function with the column
function of width
=2 inside of it like this:
#ui.R
library(shiny)
shinyUI(fluidPage(
title = "Diamonds",
h4("Diamonds Analyzer"),
fluidRow(column(2),
column(2),
)
))
Exercise 3
Create a fluidrow
with two columns of width
= 4 inside it. NOTE: Do not expect to see something yet.
Now it is time to fill these columns with some tools that will help us determine the variables that we are going to use for our plot.
In the first 4 columns we will put a selectInput
as the code below.
#ui.R
fluidRow(column(4,
h4("Variable X"),
selectInput('x', 'X', names(diamonds)))
Exercise 4
Put a selectInput
in the first 4 columns of your UI. Name it “Variable X”. HINT:Use names
to get the names of the dataset diamonds
as inputs.
Now let’s move to the next four columns. We are going to put in there another selectInput
and select the second of the dataset’s names as default. We are also going to see what offset
does by setting it to 1 and then deactivating it again like the example below. You can use the code as it is or change the parameters given to understand the logic behind its use.
#ui.R
offset = 1,
selectInput('y', 'Y', names(dataset), names(dataset)[[2]])
Exercise 5
Create a selectInput
from column
5 to column
8. Choose the second of the dataset’s name as default. Name it “Variable Y”. HINT: Use names
to get the names of the dataset diamonds
as inputs.
Exercise 6
Set the offset
parameter to 1 from columns 5 to 8.
Now let’s call our plot
and put it on the top of our UI. Look at the example below.
Exercise 7
Place the plot
on the top of your UI. HINT: Use plotOutput
and hr
. NOTE: You are not going to see the plot
in your UI because you have not created the server side yet.
We are going to create a reactive expression in order to combine the selected variables into a new data frame.Look at the example:
#server.R
selectedData <- reactive({
diamonds[, c(input$x, input$y)]
})
Exercise 8
Create a reactive expression in order to combine the selected variables into a new data frame. HINT: Use reactive
.
Now plot your new data frame like the example:
#server.R
output$plot <- renderPlot({
plot(selectedData())
})
Exercise 9
Plot your data frame. HINT: Use renderPlot
.
As mentioned before the width
of our UI is equal to 12 columns. So what is goint to happen if we a add a new column of width
= 4 next to the other two? You have to find out in order to understand better how it works.
Exercise 10
Create a new selectInput
and try to put it next to “Variable Y”. Try to explain the result. NOTE: You do not have to connect it with your plot
.
I guess we all use it, the good old histogram. One of the first things we are taught in Introduction to Statistics and routinely applied whenever coming across a new continuous variable. However, it easily gets messed up by outliers. Putting most of the data into a single bin or a few bins, and scattering the outliers barely visible over the x-axis. This distribution might look familiar
library(tidyverse)
set.seed(42)
hist_data <- data_frame(x = c(rexp(1000, .5),
runif(50, 0, 500)))
ggplot(hist_data, aes(x)) +
geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")
Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be
hist_data %>%
mutate(x_new = ifelse(x > 10, 10, x)) %>%
ggplot(aes(x_new)) +
geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")
Not bad. However, it now suggests incorrectly that many observations are exactly 10. I routinely make these plots for my own information, but they cannot be shared without explaining what happened to the outliers and what there original range was. Since a plot with a manual is not that great either, I recently did a hacking session into the ggplot
object. The resulting gg_outlier_bin
function not only indicates the range of the last bin, it also allows for a different fill color of the bin. Now we are clearly distinguishing the outlier aggregation
gg_outlier_bin(hist_data,
"x",
cut_off_floor = NA,
cut_off_ceiling = 10,
binwidth = 0.1)
It is still a bit experimental, but it seems to work in most situations. Below you find the function code for making histograms with outlier bins. You can also get it by installing the package accompanying this blog devtools::install_github("edwinth/thatssorandom")
. By the way, it works on both floor and ceiling outliers. Like in the following
data_frame(x = c(runif(100, 0, 100), rnorm(1000, 50, 2))) %>%
gg_outlier_bin("x", 45, 55, binwidth = .1)
gg_outlier_bin <- function(x,
var_name,
cut_off_floor,
cut_off_ceiling,
col = "black",
fill = "cornflowerblue",
fill_outlier_bins = "forestgreen",
binwidth = NULL) {
printing_min_max <- x %>% summarise_(sprintf("round(min(%s, na.rm = TRUE), 1)", var_name),
sprintf("round(max(%s, na.rm = TRUE), 1)", var_name))
ceiling_filter <- ifelse(!is.na(cut_off_ceiling),
sprintf("%s < %f", var_name, cut_off_ceiling),
"1 == 1")
floor_filter <- ifelse(!is.na(cut_off_floor),
sprintf("%s > %f", var_name, cut_off_floor),
"1 == 1")
x_regular <- x %>% filter_(ceiling_filter, floor_filter) %>%
select_(var_name)
x_to_roll_ceiling <- x %>% filter_(
sprintf("%s >= %f", var_name, cut_off_ceiling)) %>% select_(var_name)
if (!is.na(cut_off_ceiling)) x_to_roll_ceiling[, 1] <- cut_off_ceiling
x_to_roll_floor <- x %>% filter_(
sprintf("%s <= %f", var_name, cut_off_floor)) %>% select_(var_name)
if (!is.na(cut_off_floor)) x_to_roll_floor[, 1] <- cut_off_floor
plot_obj <- ggplot(x_regular, aes_string(var_name)) +
geom_histogram(col = col, fill = fill, binwidth = binwidth)
if (!is.na(cut_off_ceiling)) {
ticks_for_ceiling <- update_tickmarks_ceiling(plot_obj, cut_off_ceiling,
printing_min_max[1,2])
plot_obj <- plot_obj +
geom_histogram(data = x_to_roll_ceiling, fill = fill_outlier_bins, col = col,
binwidth = binwidth) +
scale_x_continuous(breaks = ticks_for_ceiling$tick_positions,
labels = ticks_for_ceiling$tick_labels)
}
if (!is.na(cut_off_floor)) {
ticks_for_floor <- update_tickmarks_floor(plot_obj, cut_off_floor,
printing_min_max[1,1])
plot_obj <- plot_obj +
geom_histogram(data = x_to_roll_floor, fill = fill_outlier_bins,
col = col, binwidth = binwidth) +
scale_x_continuous(breaks = ticks_for_floor$tick_positions,
labels = ticks_for_floor$tick_labels)
}
return(plot_obj)
}
update_tickmarks_ceiling <- function(gg_obj,
co,
max_print) {
ranges <- suppressMessages(
ggplot_build(gg_obj)$layout$panel_ranges[[1]])
label_to_add <- sprintf("(%s , %s)", round(co, 1), max_print)
tick_positions <- ranges$x.major_source
tick_labels <- ranges$x.labels
if (overlap_ceiling(tick_positions, co)) {
tick_positions <- tick_positions[-length(tick_positions)]
tick_labels <- tick_labels[-length(tick_labels)]
}
return(list(tick_positions = c(tick_positions, co),
tick_labels = c(tick_labels, label_to_add)))
}
overlap_ceiling <- function(positions, cut_off) {
n <- length(positions)
ticks_dif <- positions[n] - positions[n-1]
(cut_off - positions[n]) / ticks_dif < 0.25
}
update_tickmarks_floor <- function(gg_obj,
co,
min_print) {
ranges <- suppressMessages(
ggplot_build(gg_obj)$layout$panel_ranges[[1]])
label_to_add <- sprintf("(%s , %s)", min_print, round(co, 1))
tick_positions <- ranges$x.major_source
tick_labels <- ranges$x.labels
if (overlap_floor(tick_positions, co)) {
tick_positions <- tick_positions[-1]
tick_labels <- tick_labels[-1]
}
return(list(tick_positions = c(co, tick_positions),
tick_labels = c(label_to_add, tick_labels)))
}
overlap_floor <- function(positions, cut_off) {
ticks_dif <- positions[2] - positions[1]
(positions[1] - cut_off) / ticks_dif < 0.25
}
I recently came across a great natural language dataset from Mark Riedel: 112,000 plots of stories downloaded from English language Wikipedia. This includes books, movies, TV episodes, video games- anything that has a Plot section on a Wikipedia page.
This offers a great opportunity to analyze story structure quantitatively. In this post I’ll do a simple analysis, examining what words tend to occur at particular points within a story, including words that characterize the beginning, middle, or end.
As I usually do for text analysis, I’ll be using the tidytext package Julia Silge and I developed last year. To learn more about analyzing datasets like this, see our online book Text Mining with R: A Tidy Approach, soon to be published by O’Reilly. I’ll provide code for the text mining sections so you can follow along. I don’t show the code for most of the visualizations to keep the post concise, but as with all of my posts the code can be found here on GitHub.
I downloaded and unzipped the plots.zip file from the link on the GitHub repository. We then read the files into R, and combined them using dplyr.
We can then use the tidytext package to unnest the plots into a tidy format, with one token per line.
This dataset contains over 40 million words across 112,000 stories.
Joseph Campbell introduced the idea of a “hero’s journey”, that every story follows the same structure. Whether or not you buy into his theory, you can agree it’d be surprising if a plot started with a climactic fight, or ended by introducing new characters.
That structure is reflected quantitatively in what words are used at which point in a story: there are some words you’d expect would appear at the start, and others at the end.
As a simple measure of where a word occurs within a plot, we’ll record the median position of each word, along with the number of times it appears.
We’re not interested in rare words that occurred in only a few plot descriptions, so we’ll filter for ones occurring at least 2,500 times.
For example, we can see that the word “fictional” was used about 2700 times, and that half of its uses were before the 12% mark of the story: it’s highly shifted towards the beginning.
What were were the words most shifted towards the beginning or end of a story?
The words shifted towards the beginning of a story tend to describe a setting: “The story opens on the protagonist, a wealthy young 19th century student recently graduated from the fictional University College in Los Angeles, California.”. Most are therefore nouns and adjectives that can be used to specify and describe a person, location, or time period.
In contrast, the words shifted towards the end of a story are packed with excitement! There are a few housekeeping terms you’d expect to find at the end of a plot description (“ending”, “final”), but also a number of verbs suggestive of a climax. “The hero shoots the villain and rushes to the heroine, and apologizes. The two reunited, they kiss.”
The median gives us a useful summary statistic of where a word appears within a story, but let’s take a closer look at a few. First we’ll divide each story into deciles (first 10%, second 10%, etc), and count the appearances of each word within each decile.
This lets us visualize the frequency of a word across the length of plot descriptions. We may want to look at the most extreme start/end ones:
No word happens exclusively at the start or end of a story. Some, like “happily”, remain steady throughout and then spike up at the end (“lived happily ever after”). Other words, like “truth”, or “apologizes”, show a constant rise in frequency over the course of the story, which makes sense: a character generally wouldn’t “apologize” or “realize the truth” right at the start of the story. Similarly, words that establish settings like “wealthy” become steadily rarer the course of the story, as it becomes less likely the plot will introduce new characters.
One interesting feature of the above graph is that while most words peak either at the beginning or end, words like “grabs”, “rushes”, and “shoots” were most common at the 90% point. This might represent the climax of the story.
Inspired by this examination of words that might occur at a climax, let’s consider what words were most likely to appear at particular points in the middle, rather than being shifted towards the beginning or end.
Each decile of the book (the start, the end, the 30% point, etc) therefore has some some words that peak within it. What words were most characteristic of particular deciles?
We see that the words in the start and the end are the most specific to their particular deciles: for example, almost half of the occurrences of the word “fictional” occurred in the first 10% of the story. The middle sections have words that are more spread out (having, say, 14% of their occurrences in that section rather than the expected 10%), but they still are words that make sense in the story structure.
Let’s visualize the full trend for the words overrepreseted at each point.
Try reading the 24 word story laid out by the subgraph titles. Our protagonist is “attracted”, then “suspicious”, followed by “jealous”, “drunk”, and ultimately “furious”. A shame that once they “confront” the problem, they run into a “trap” and are “wounded”. If you ignore the repetitive words and the lack of syntax, you can see the rising tension of a story just in these sparklines.
As one more confirmation of our hypothesis about rising tension and conflict within a story, we can use sentiment analysis to find the average sentiment within each piece of a story.
Plot descriptions have a negative average AFINN score at all points in the story (which makes sense, since stories focus on conflict. But it might start with a relatively peaceful beginning, before the conflict increases over the course of the plot, until it hits a maximum around the climax, 80-90%. It’s then often followed by a resolution, which contains words like “happily”, “rescues”, and “reunited” that return it to a higher sentiment score.
In short, if we had to summarize the average story that humans tell, it would go something like Things get worse and worse until at the last minute they get better.
This was a pretty simple analysis of story arcs (for a more in-depth example, see the research described here), and it doesn’t tell us too much we wouldn’t have been able to guess.
(Except perhaps that characters are most likely to be drunk right in the middle of a story. How can we monetize that insight?)
What I like about this approach is how quickly you can gain insights with simple quantitative methods (counting, taking the median) applied to a large text dataset. In future posts, I’ll be diving deeper into these plots and showing what else we can learn.
In the last two weeks, I have posted twice about modifying age bias plots and Bland-Altman-like plots for comparing age estimates. From those posts, I have decided that I prefer to
I recently updated the FSA package so that these preferences are the defaults, while still allowing users some flexibility in creating plots that fit their preferences. Here I explain this new functionality.
The functionality described here is available in the current development version of FSA and will eventually (during summer) be on CRAN as version 0.8.13. I welcome any comments or suggestions.
The data used here will again be ages of Lake Whitefish (Coregonus clupeaformis) from Lake Champlain that are available in the WhitefishLC
data.frame in FSA. These analyses will compare consensus (between two readers) otolith (otolithC
) and scale (scaleC
) age estimates and otolith ages between two readers (otolith1
and otolith2
). The consensus otolith age estimates and otolith age estimates from the first reader will be considered as “reference” ages when such a distinction is needed.
The default plot of an ageBias()
object is a modified age bias plot with the difference in age estimates on the y-axis, the reference age estimates on the x-axis, a reference line at a difference in age estimates of zero, the mean and the range of differences in age estimates shown for each reference age estimate, open points representing age estimates where the mean difference in age estimates is significantly different from zero, solid points representing age estimates where the mean difference in age estimates is not significantly different from zero, a marginal histogram at the top that shows the distribution (and sample sizes) of the reference age estimates, and a marginal histogram on the right that shows the distribution of the difference in age estimates. Confidence intervals for the mean differences in age estimates at each reference age estimate may be added with show.CI=TRUE
and individual points can be added with show.pts=TRUE
. Other options are described in the ageBias()
documentation, which includes a number of examples.
The example in Figure 1 shows that age estimates from scales are less than age estimates from otoliths for otolith age estimates greater than about age-6 or 8, though the statistical evidence is less clear at older ages due to low sample sizes and increased variability. The example in Figure 2 illustrates no systematic difference in age estimates from otoliths between two readers. [Note that the y-axis limits here were widened from the defaults so that the bars in the marginal histogram were not cut off.]
Figure 1: Mean (points) and range (intervals) of differences in scale and otolith age estimates at each otolith age estimate for Lake Champlain Lake Whitefish. Open points represent mean differences in scale and otolith age estimates that are significantly different from zero (dashed gray horizontal line). Marginal histograms are for otolith age estimates (top) and differences in scale and otolith age estimates (right).
Figure 2: Mean (points) and range (intervals) of differences in otolith age estimates between two readers at the estimates for the first reader for Lake Champlain Lake Whitefish. Open points represent mean differences in age estimates that are significantly different from zero (dashed gray horizontal line). Marginal histograms are for age estimates of the first reader (top) and differences in age estimates between readers (right).
As discussed in this post, differences between two sets of age estimates can be revealed by plotting individual points with a summary for the relationship between the differences in age estimates and the reference or mean age estimates (whichever is used on the x-axis). These examples show how to create a base plot to which a summary can be added. These examples use the mean of the two age estimates on the x-axis, but the plot from the previous section with the reference age estimates on the x-axis could be used (but with show.pts=TRUE
to show the individual points and show.range=FALSE
to remove the mean and range intervals).
Before making the first example plot, a GAM will be fit to the differences and mean age estimates data. These data are contained in the diff
and mean
variables in the data
object returned in the ageBias()
object.
As shown in this post, the GAM is fit with gam()
using s()
from the mgcv
package. The mean predicted differences in age estimates, and their standard errors, throughout the range of observed mean age estimates are calculated with predict()
using type="response"
and se=TRUE
. Approximate 95% confidence intervals for the predicted mean differences in age estimates are computed from normal theory. The code below fits the GAM, creates a vector of mean age estimates at which to make predictions, makes the predictions, and computes the approximate 95% confidence intervals.
The base plot of individual differences in age estimates plotted against the mean age estimates is constructed by adding xvals="mean"
to plot()
. By default, a histogram for the difference in age estimates is shown on the right. A histogram for the mean age estimates is not shown by default but can be added at the top with xHist=TRUE
. The allowAdd=TRUE
argument is used so that “items”, like the GAM results, can be added to the main plot (i.e., not the marginal histograms). Note that using allowAdd=TRUE
changes the current graphing parameters and that it is good practice to save the current graphing parameters (the first line below) so that they can be reset after finishing the plot (use of par(op)
below).
The GAM results (line at the the predicted means and polygon for the 95% confidence bands) are then added to this plot as described in this post.
Finally, the graphing parameters are returned to their original values.
The example in Figure 3 suggests that the two age estimates generally agree to a mean age of about 5, after which ages estimated from scales are less than ages estimated from otoliths. The example in Figure 4 suggests no difference in age estimates between the two readers for all mean ages.
Figure 3: Differences in scale and otolith age estimates at each mean age estimate for Lake Champlain Lake Whitefish. The dashed gray horizontal line is at 0, which represents no difference between scale and otolith age estimates. The dashed black line and gray polygon represent the mean and 95% confidence band for the predicted mean difference in age estimates from a generalized additive model. The right marginal histogram is for the differences in scale and otolith age estimates.
Figure 4: Differences in otolith age estimates between two readers at each mean age estimate for Lake Champlain Lake Whitefish. The dashed gray horizontal line is at 0, which represents no difference in age estimates between the two readers. The dashed black line and gray polygon represent the mean and 95% confidence band for the predicted mean difference in age estimates from a generalized additive model. The right marginal histogram is for the differences in age estimates between the two readers.
My modification of the traditional age bias plot of Campana et al. (1995) is constructed from the ageBias()
object with plotAB()
(Figure 5). Some simple modifications of this plot are demonstrated in the documentation and examples for plotAB()
.
Figure 5: Mean (points) and 95% confidence intervals of scale age estimates at each otolith age estimate for Lake Champlain Lake Whitefish. The dashed gray line represents age estimates that agree. Open points (with red confidence intervals) represent mean scale age estimates that differ significantly from the corresponding otolith age estimate.
Finally, some users prefer a simple plot that shows the number of individuals at each point (Figure 6). This plot is constructed with plotAB()
using what="nunbers"
.
Figure 6: Number of individuals by each scale and otolith age estimate combination for Lake Champlain Lake Whitefish. The dashed gray line represents age estimates that agree.
Campana, S.E., M.C. Annand, and J.I. McMillan. 1995. Graphical and statistical methods for determining the consistency of age determinations. Transactions of the American Fisheries Society 124:131-138.
Ogle, D.H. 2015. Introductory Fisheries Analyses with R book. CRC Press.
Looking at the Task View on a small screen is a bit like standing too close to a brick wall - left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger
The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv()
function, which contains failure time and censoring information, is the basic survival analysis data structure in R.
KMsurv contains some interesting data sets from John Klein and Melvin Moeschberger’s classic text, Survival Analysis Techniques for Censored and Truncated Data.
My main reason for selecting the OIsurv package was to draw attention the very helpful guide to Survival Analysis in R, produced by the folks at OpenIntro.
ranger might be the surprise in my very short list of survival packages. The ranger()
function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But it was news to me that ranger()
also builds survival models. Benchmarks indicate that ranger()
is suitable for building time-to-event models with the large, high dimensional data sets important to internet marketing applications. Since ranger()
uses standard Surv()
survival objects, it’s an ideal tool for getting acquainted with survival analysis in the in this machine learning age.
*** Load and transform the data This first block of code loads the required packages along with the bone marrow transplant data frame from the KMsurv package. I chose this because has number of covariates and no missing values. The only data preparation is to make the appropriate variables into factors.
library(survival)
library(dplyr)
library(OIsurv) # Aumatically loads KMsurv
library(ranger)
library(ggplot2)
#------------
data(bmt)
# sapply(bmt,class)
# Prepare new data frame for modeling
bmt2 <- select(bmt,-c(t2,d2,d3))
bmt2 <- mutate(bmt2,
group = as.factor(group),
d1 = as.factor(d1),
da = as.factor(da),
dc = as.factor(dc),
dp = as.factor(dp),
z3 = as.factor(z3),
z4 = as.factor(z4),
z5 = as.factor(z5),
z6 = as.factor(z6),
z8 = as.factor(z8),
z9 = as.factor(z9),
z10 = as.factor(z10)
)
head(bmt2)
## group t1 d1 ta da tc dc tp dp z1 z2 z3 z4 z5 z6 z7 z8 z9 z10
## 1 1 2081 0 67 1 121 1 13 1 26 33 1 0 1 1 98 0 1 0
## 2 1 1602 0 1602 0 139 1 18 1 21 37 1 1 0 0 1720 0 1 0
## 3 1 1496 0 1496 0 307 1 12 1 26 35 1 1 1 0 127 0 1 0
## 4 1 1462 0 70 1 95 1 13 1 17 21 0 1 0 0 168 0 1 0
## 5 1 1433 0 1433 0 236 1 12 1 32 36 1 1 1 1 93 0 1 0
## 6 1 1377 0 1377 0 123 1 12 1 22 31 1 1 1 1 2187 0 1 0
The first thing to do is to use Surv()
to build the standard survival object. The variable t1
records the time to death or the censored time; d1
indicates that the patient died (d1 = 1
) or that the patient survived until the end of the study (d1 = 0
). Note that a “+” after the time in the print out of y_bmt
indicates censoring. The formula y_bmt ~ 1
instructs the survfit()
function to fit a model with intercept only, and produces the Kaplan-Meier estimate.
The confBands()
function from the OIsurv package estimates Hall-Wellner confidence bands. These are large-sample, simultaneous estimates and are plotted in red. They contrast with the point-wise confidence bands rendered as black dashed lines.
# Kaplan Meier Survival Curve
y_bmt <- Surv(bmt$t1, bmt$d1)
y_bmt
## [1] 2081+ 1602+ 1496+ 1462+ 1433+ 1377+ 1330+ 996+ 226+ 1199+ 1111+
## [12] 530+ 1182+ 1167+ 418 417 276 156 781 172 487 716
## [23] 194 371 526 122 1279 110 243 86 466 262 162
## [34] 262 1 107 269 350 2569+ 2506+ 2409+ 2218+ 1857+ 1829+
## [45] 1562+ 1470+ 1363+ 1030+ 860+ 1258+ 2246+ 1870+ 1799+ 1709+ 1674+
## [56] 1568+ 1527+ 1324+ 957+ 932+ 847+ 848+ 1850+ 1843+ 1535+ 1447+
## [67] 1384+ 414 2204 1063 481 105 641 390 288 522 79
## [78] 1156 583 48 431 1074 393 10 53 80 35 1499+
## [89] 704 653 222 1356+ 2640+ 2430+ 2252+ 2140+ 2133+ 1238+ 1631+
## [100] 2024+ 1345+ 1136+ 845+ 491 162 1298 121 2 62 265
## [111] 547 341 318 195 469 93 515 183 105 128 164
## [122] 129 122 80 677 73 168 74 16 248 732 105
## [133] 392 63 97 153 363
fit1_bmt <- survfit(y_bmt ~ 1)
summary(fit1_bmt)
## Call: survfit(formula = y_bmt ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 137 1 0.993 0.00727 0.979 1.000
## 2 136 1 0.985 0.01025 0.966 1.000
## 10 135 1 0.978 0.01250 0.954 1.000
## 16 134 1 0.971 0.01438 0.943 0.999
## 35 133 1 0.964 0.01602 0.933 0.995
## 48 132 1 0.956 0.01748 0.923 0.991
## 53 131 1 0.949 0.01881 0.913 0.987
## 62 130 1 0.942 0.02003 0.903 0.982
## 63 129 1 0.934 0.02117 0.894 0.977
## 73 128 1 0.927 0.02222 0.884 0.972
## 74 127 1 0.920 0.02322 0.875 0.966
## 79 126 1 0.912 0.02415 0.866 0.961
## 80 125 2 0.898 0.02588 0.848 0.950
## 86 123 1 0.891 0.02668 0.840 0.944
## 93 122 1 0.883 0.02744 0.831 0.939
## 97 121 1 0.876 0.02817 0.822 0.933
## 105 120 3 0.854 0.03017 0.797 0.915
## 107 117 1 0.847 0.03078 0.788 0.909
## 110 116 1 0.839 0.03137 0.780 0.903
## 121 115 1 0.832 0.03193 0.772 0.897
## 122 114 2 0.818 0.03300 0.755 0.885
## 128 112 1 0.810 0.03350 0.747 0.879
## 129 111 1 0.803 0.03399 0.739 0.872
## 153 110 1 0.796 0.03445 0.731 0.866
## 156 109 1 0.788 0.03490 0.723 0.860
## 162 108 2 0.774 0.03575 0.707 0.847
## 164 106 1 0.766 0.03615 0.699 0.841
## 168 105 1 0.759 0.03653 0.691 0.834
## 172 104 1 0.752 0.03690 0.683 0.828
## 183 103 1 0.745 0.03726 0.675 0.821
## 194 102 1 0.737 0.03760 0.667 0.815
## 195 101 1 0.730 0.03793 0.659 0.808
## 222 100 1 0.723 0.03825 0.651 0.802
## 243 98 1 0.715 0.03856 0.644 0.795
## 248 97 1 0.708 0.03886 0.636 0.788
## 262 96 2 0.693 0.03943 0.620 0.775
## 265 94 1 0.686 0.03969 0.612 0.768
## 269 93 1 0.678 0.03995 0.604 0.761
## 276 92 1 0.671 0.04019 0.597 0.755
## 288 91 1 0.664 0.04042 0.589 0.748
## 318 90 1 0.656 0.04063 0.581 0.741
## 341 89 1 0.649 0.04084 0.574 0.734
## 350 88 1 0.642 0.04104 0.566 0.727
## 363 87 1 0.634 0.04122 0.558 0.720
## 371 86 1 0.627 0.04140 0.551 0.713
## 390 85 1 0.619 0.04156 0.543 0.706
## 392 84 1 0.612 0.04172 0.535 0.699
## 393 83 1 0.605 0.04186 0.528 0.693
## 414 82 1 0.597 0.04199 0.520 0.686
## 417 81 1 0.590 0.04212 0.513 0.679
## 418 80 1 0.583 0.04223 0.505 0.671
## 431 79 1 0.575 0.04234 0.498 0.664
## 466 78 1 0.568 0.04243 0.490 0.657
## 469 77 1 0.560 0.04252 0.483 0.650
## 481 76 1 0.553 0.04259 0.476 0.643
## 487 75 1 0.546 0.04266 0.468 0.636
## 491 74 1 0.538 0.04271 0.461 0.629
## 515 73 1 0.531 0.04276 0.453 0.622
## 522 72 1 0.524 0.04280 0.446 0.615
## 526 71 1 0.516 0.04282 0.439 0.607
## 547 69 1 0.509 0.04285 0.431 0.600
## 583 68 1 0.501 0.04287 0.424 0.593
## 641 67 1 0.494 0.04288 0.416 0.585
## 653 66 1 0.486 0.04288 0.409 0.578
## 677 65 1 0.479 0.04286 0.402 0.571
## 704 64 1 0.471 0.04284 0.394 0.563
## 716 63 1 0.464 0.04281 0.387 0.556
## 732 62 1 0.456 0.04277 0.380 0.548
## 781 61 1 0.449 0.04272 0.372 0.541
## 1063 52 1 0.440 0.04276 0.364 0.533
## 1074 51 1 0.432 0.04278 0.355 0.524
## 1156 48 1 0.423 0.04282 0.346 0.515
## 1279 42 1 0.413 0.04297 0.336 0.506
## 1298 41 1 0.402 0.04308 0.326 0.496
## 2204 9 1 0.358 0.05696 0.262 0.489
cb <- confBands(y_bmt, type = "hall")
plot(fit1_bmt,
main = 'Kaplan Meyer Plot with confidence bands')
lines(cb, col = "red",lty = 3)
legend(1000, 0.99, legend = c('K-M survival estimate',
'pointwise intervals', 'Hall-Werner conf bands'), lty = 1:3)
Next, I fit a Cox Proportional Hazards model, which makes use of several of the variables contained in the data set. I don’t pretend to have the best model here, or even a very good one. My intention is just to show the survival package’s coxph()
function, and show how the covariates enter the formula. Note, however, that this model does achieve an R2 value of 0.8, and that three variables - ta
, tc
, and dc1
- show up as significant.
# Fit Cox Model
form <- formula(y_bmt ~ group + ta + tc + tp + dc + dp +
z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z10)
cox_bmt <- coxph(form,data = bmt2)
summary(cox_bmt)
## Call:
## coxph(formula = form, data = bmt2)
##
## n= 137, number of events= 81
##
## coef exp(coef) se(coef) z Pr(>|z|)
## group2 0.2540084 1.2891826 0.4342118 0.585 0.5586
## group3 0.5280287 1.6955865 0.4197516 1.258 0.2084
## ta -0.0017962 0.9982054 0.0003736 -4.808 1.52e-06 ***
## tc -0.0060005 0.9940175 0.0010512 -5.708 1.14e-08 ***
## tp -0.0005709 0.9994293 0.0012622 -0.452 0.6510
## dc1 -3.5832062 0.0277865 0.4938320 -7.256 3.99e-13 ***
## dp1 -1.2266468 0.2932744 0.4832251 -2.538 0.0111 *
## z1 0.0109077 1.0109674 0.0230293 0.474 0.6358
## z2 -0.0155304 0.9845895 0.0200202 -0.776 0.4379
## z31 0.0162146 1.0163468 0.2480277 0.065 0.9479
## z41 0.0334753 1.0340419 0.2722755 0.123 0.9021
## z51 0.0792485 1.0824733 0.2856339 0.277 0.7814
## z61 -0.0702048 0.9322029 0.2557463 -0.275 0.7837
## z7 0.0000440 1.0000440 0.0004235 0.104 0.9173
## z81 0.4618598 1.5870228 0.3386689 1.364 0.1726
## z101 0.5145619 1.6729055 0.3330479 1.545 0.1223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## group2 1.28918 0.7757 0.55044 3.01937
## group3 1.69559 0.5898 0.74478 3.86023
## ta 0.99821 1.0018 0.99747 0.99894
## tc 0.99402 1.0060 0.99197 0.99607
## tp 0.99943 1.0006 0.99696 1.00190
## dc1 0.02779 35.9887 0.01056 0.07314
## dp1 0.29327 3.4098 0.11375 0.75613
## z1 1.01097 0.9892 0.96635 1.05764
## z2 0.98459 1.0157 0.94670 1.02399
## z31 1.01635 0.9839 0.62506 1.65258
## z41 1.03404 0.9671 0.60642 1.76319
## z51 1.08247 0.9238 0.61842 1.89474
## z61 0.93220 1.0727 0.56470 1.53887
## z7 1.00004 1.0000 0.99921 1.00087
## z81 1.58702 0.6301 0.81716 3.08218
## z101 1.67291 0.5978 0.87092 3.21338
##
## Concordance= 0.926 (se = 0.034 )
## Rsquare= 0.806 (max possible= 0.995 )
## Likelihood ratio test= 224.6 on 16 df, p=0
## Wald test = 107.5 on 16 df, p=1.332e-15
## Score (logrank) test = 211.7 on 16 df, p=0
cox_fit_bmt <- survfit(cox_bmt)
# plot(cox_fit_bmt)
To give some idea of the scope of R’s capabilities to work with time to event data, I use the ranger()
function to fit a Random Forests Ensemble model to the data. Note that ranger()
builds a model for each observation in the data set. The next block of code builds the model using the same variables used the the Cox model above, and each of the 137 survival curves computed for the bmt data set, along with a curve of average values.
# ranger model
# ranger model
r_fit_bmt <- ranger(form,
data = bmt2,
importance = "permutation",
seed = 1234)
# Average the survival models
death_times <- r_fit_bmt$unique.death.times
surv_prob <- data.frame(r_fit_bmt$survival)
avg_prob <- sapply(surv_prob,mean)
# Plot the survival models for each patient
plot(r_fit_bmt$unique.death.times,r_fit_bmt$survival[1,], type = "l",
ylim = c(0,1),
col = "red",
xlab = "death times",
ylab = "survival",
main = "Patient Survival Curves")
for(n in c(2:137)){
lines(r_fit_bmt$unique.death.times, r_fit_bmt$survival[n,], type = "l", col = "red")
}
lines(death_times, avg_prob, lwd = 2)
legend(100, 0.2, legend = c('Averages - black'))
Here, we show the ranking of variable importance computed by the permutation method, which is ranger()
’s default for survival data. Note that ta
, tc
, and dc
are the same top three variables flagged in the Cox model.
Also listed is a measure of prediction error calculated from Harrell’s c-index. This index is defined as “… the proportion of all usable patient pairs in which the predictions and outcomes are concordant” (cf. [8] p 370), where predictions for pairs are concordant if predicted survival times are larger for patients who lived longer. Note that Harrell’s c-index may be thought of as a generalization of finding the are under an ROC curve. (For binary outcomes Harrell’s c-index reduces to the Wilcoxon-Mann-Whitney statistic which, in turn, is equivalent to computing the area under the ROC curve.)
vi <- data.frame(sort(round(r_fit_bmt$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)
## importance
## ta 0.1259
## tc 0.0688
## dc 0.0190
## tp 0.0117
## dp 0.0092
## z2 0.0046
cat("Prediction Error = 1 - Harrell's c-index = ", r_fit_bmt$prediction.error)
## Prediction Error = 1 - Harrell's c-index = 0.09304771
Finally, we plot the survival curves computed for all three models on the same graph. Note that the “ad hoc” curve of average survival curves computed by the ranger model tracks the Kaplan-Meier curve fairly well.
# Set up for ggplot
km <- rep("KM", length(fit1_bmt$time))
km_df <- data.frame(fit1_bmt$time,fit1_bmt$surv,km)
names(km_df) <- c("Time","Surv","Model")
cox <- rep("Cox",length(cox_fit_bmt$time))
cox_df <- data.frame(cox_fit_bmt$time,cox_fit_bmt$surv,cox)
names(cox_df) <- c("Time","Surv","Model")
rf <- rep("RF",length(r_fit_bmt$unique.death.times))
rf_df <- data.frame(r_fit_bmt$unique.death.times,avg_prob,rf)
names(rf_df) <- c("Time","Surv","Model")
plot_df <- rbind(km_df,cox_df,rf_df)
p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model))
p + geom_line() + ggtitle("Comparison of Survival Curves")
For a very nice exposition of the sort of predictive survival analysis modeling that can be done with ranger
, be sure to have a look at Manuel Amunategui’s post and video.
This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform which has been under continuous development and improvement for nearly twenty years. The use of the Surv()
function shows how open source code allows generations of developers to build on the work of their predecessors. The ranger
packages provides a practical example of how R can incorporate fast C++ code and adapt to the world of machine learning applications, and the incidental use of options such as Hall-Wellner Confidence bands and Harrell’s c-index gives some idea of the statistical depth that underlies almost everything R.
For convenience, I have collected the references used throughout the post here.
[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed. p11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis (overview) Encyclopedia of Biostatistics 6. Wiley, p 4452-4461
[3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, 457–481, 562–563.
[4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, 187–220.
[5] Diez, David. Survival Analysis in R. OpenIntro [6] Klein, John P and Moeschberger, Melvin L. (1997) Survival Analysis Techniques for Censored and Truncated Data, Springer.
[7] Wright, Marvin & Ziegler, Andreas. (2017) * [ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R] (https://www.jstatsoft.org/article/view/v077i01)*, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. (1996) Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15, 361-387
[9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R
With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, Survival Analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were known before 1700 [2]. In the early eighteenth century, the old masters, de Moivre working on annuities and Daniel Bernoulli studying competing risks for his work on smallpox inoculation, developed the foundations of time-to-event modeling [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing and Medicine and many more application areas. So, it is not surprising that the R Task View on Survival Analysis, a curated, organized and annotated list of relevant R packages and functions, is formidable.
Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger
The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv()
function, which contains failure time and censoring information, is the basic survival analysis data structure in R.
KMsurv contains some interesting data sets from John Klein and Melvin Moeschberger’s classic text, Survival Analysis Techniques for Censored and Truncated Data.
My main reason for selecting the OIsurv package was to draw attention the very helpful guide to Survival Analysis in R, produced by the folks at OpenIntro.
ranger might be the surprise in my very short list of survival packages. The ranger()
function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But it was news to me that ranger()
also builds survival models. Benchmarks indicate that ranger()
is suitable for building time-to-event models with the large, high dimensional data sets important to internet marketing applications. Since ranger()
uses standard Surv()
survival objects, it’s an ideal tool for getting acquainted with survival analysis in the in this machine learning age.
*** Load and transform the data This first block of code loads the required packages along with the bone marrow transplant data frame from the KMsurv package. I chose this because has number of covariates and no missing values. The only data preparation is to make the appropriate variables into factors.
library(survival)
library(dplyr)
library(OIsurv) # Aumatically loads KMsurv
library(ranger)
library(ggplot2)
#------------
data(bmt)
# sapply(bmt,class)
# Prepare new data frame for modeling
bmt2 <- select(bmt,-c(t2,d2,d3))
bmt2 <- mutate(bmt2,
group = as.factor(group),
d1 = as.factor(d1),
da = as.factor(da),
dc = as.factor(dc),
dp = as.factor(dp),
z3 = as.factor(z3),
z4 = as.factor(z4),
z5 = as.factor(z5),
z6 = as.factor(z6),
z8 = as.factor(z8),
z9 = as.factor(z9),
z10 = as.factor(z10)
)
head(bmt2)
## group t1 d1 ta da tc dc tp dp z1 z2 z3 z4 z5 z6 z7 z8 z9 z10
## 1 1 2081 0 67 1 121 1 13 1 26 33 1 0 1 1 98 0 1 0
## 2 1 1602 0 1602 0 139 1 18 1 21 37 1 1 0 0 1720 0 1 0
## 3 1 1496 0 1496 0 307 1 12 1 26 35 1 1 1 0 127 0 1 0
## 4 1 1462 0 70 1 95 1 13 1 17 21 0 1 0 0 168 0 1 0
## 5 1 1433 0 1433 0 236 1 12 1 32 36 1 1 1 1 93 0 1 0
## 6 1 1377 0 1377 0 123 1 12 1 22 31 1 1 1 1 2187 0 1 0
The first thing to do is to use Surv()
to build the standard survival object. The variable t1
records the time to death or the censored time; d1
indicates that the patient died (d1 = 1
) or that the patient survived until the end of the study (d1 = 0
). Note that a “+” after the time in the print out of y_bmt
indicates censoring. The formula y_bmt ~ 1
instructs the survfit()
function to fit a model with intercept only, and produces the Kaplan-Meier estimate.
The confBands()
function from the OIsurv package estimates Hall-Wellner confidence bands. These are large-sample, simultaneous estimates and are plotted in red. They contrast with the point-wise confidence bands rendered as black dashed lines.
# Kaplan Meier Survival Curve
y_bmt <- Surv(bmt$t1, bmt$d1)
y_bmt
## [1] 2081+ 1602+ 1496+ 1462+ 1433+ 1377+ 1330+ 996+ 226+ 1199+ 1111+
## [12] 530+ 1182+ 1167+ 418 417 276 156 781 172 487 716
## [23] 194 371 526 122 1279 110 243 86 466 262 162
## [34] 262 1 107 269 350 2569+ 2506+ 2409+ 2218+ 1857+ 1829+
## [45] 1562+ 1470+ 1363+ 1030+ 860+ 1258+ 2246+ 1870+ 1799+ 1709+ 1674+
## [56] 1568+ 1527+ 1324+ 957+ 932+ 847+ 848+ 1850+ 1843+ 1535+ 1447+
## [67] 1384+ 414 2204 1063 481 105 641 390 288 522 79
## [78] 1156 583 48 431 1074 393 10 53 80 35 1499+
## [89] 704 653 222 1356+ 2640+ 2430+ 2252+ 2140+ 2133+ 1238+ 1631+
## [100] 2024+ 1345+ 1136+ 845+ 491 162 1298 121 2 62 265
## [111] 547 341 318 195 469 93 515 183 105 128 164
## [122] 129 122 80 677 73 168 74 16 248 732 105
## [133] 392 63 97 153 363
fit1_bmt <- survfit(y_bmt ~ 1)
summary(fit1_bmt)
## Call: survfit(formula = y_bmt ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 137 1 0.993 0.00727 0.979 1.000
## 2 136 1 0.985 0.01025 0.966 1.000
## 10 135 1 0.978 0.01250 0.954 1.000
## 16 134 1 0.971 0.01438 0.943 0.999
## 35 133 1 0.964 0.01602 0.933 0.995
## 48 132 1 0.956 0.01748 0.923 0.991
## 53 131 1 0.949 0.01881 0.913 0.987
## 62 130 1 0.942 0.02003 0.903 0.982
## 63 129 1 0.934 0.02117 0.894 0.977
## 73 128 1 0.927 0.02222 0.884 0.972
## 74 127 1 0.920 0.02322 0.875 0.966
## 79 126 1 0.912 0.02415 0.866 0.961
## 80 125 2 0.898 0.02588 0.848 0.950
## 86 123 1 0.891 0.02668 0.840 0.944
## 93 122 1 0.883 0.02744 0.831 0.939
## 97 121 1 0.876 0.02817 0.822 0.933
## 105 120 3 0.854 0.03017 0.797 0.915
## 107 117 1 0.847 0.03078 0.788 0.909
## 110 116 1 0.839 0.03137 0.780 0.903
## 121 115 1 0.832 0.03193 0.772 0.897
## 122 114 2 0.818 0.03300 0.755 0.885
## 128 112 1 0.810 0.03350 0.747 0.879
## 129 111 1 0.803 0.03399 0.739 0.872
## 153 110 1 0.796 0.03445 0.731 0.866
## 156 109 1 0.788 0.03490 0.723 0.860
## 162 108 2 0.774 0.03575 0.707 0.847
## 164 106 1 0.766 0.03615 0.699 0.841
## 168 105 1 0.759 0.03653 0.691 0.834
## 172 104 1 0.752 0.03690 0.683 0.828
## 183 103 1 0.745 0.03726 0.675 0.821
## 194 102 1 0.737 0.03760 0.667 0.815
## 195 101 1 0.730 0.03793 0.659 0.808
## 222 100 1 0.723 0.03825 0.651 0.802
## 243 98 1 0.715 0.03856 0.644 0.795
## 248 97 1 0.708 0.03886 0.636 0.788
## 262 96 2 0.693 0.03943 0.620 0.775
## 265 94 1 0.686 0.03969 0.612 0.768
## 269 93 1 0.678 0.03995 0.604 0.761
## 276 92 1 0.671 0.04019 0.597 0.755
## 288 91 1 0.664 0.04042 0.589 0.748
## 318 90 1 0.656 0.04063 0.581 0.741
## 341 89 1 0.649 0.04084 0.574 0.734
## 350 88 1 0.642 0.04104 0.566 0.727
## 363 87 1 0.634 0.04122 0.558 0.720
## 371 86 1 0.627 0.04140 0.551 0.713
## 390 85 1 0.619 0.04156 0.543 0.706
## 392 84 1 0.612 0.04172 0.535 0.699
## 393 83 1 0.605 0.04186 0.528 0.693
## 414 82 1 0.597 0.04199 0.520 0.686
## 417 81 1 0.590 0.04212 0.513 0.679
## 418 80 1 0.583 0.04223 0.505 0.671
## 431 79 1 0.575 0.04234 0.498 0.664
## 466 78 1 0.568 0.04243 0.490 0.657
## 469 77 1 0.560 0.04252 0.483 0.650
## 481 76 1 0.553 0.04259 0.476 0.643
## 487 75 1 0.546 0.04266 0.468 0.636
## 491 74 1 0.538 0.04271 0.461 0.629
## 515 73 1 0.531 0.04276 0.453 0.622
## 522 72 1 0.524 0.04280 0.446 0.615
## 526 71 1 0.516 0.04282 0.439 0.607
## 547 69 1 0.509 0.04285 0.431 0.600
## 583 68 1 0.501 0.04287 0.424 0.593
## 641 67 1 0.494 0.04288 0.416 0.585
## 653 66 1 0.486 0.04288 0.409 0.578
## 677 65 1 0.479 0.04286 0.402 0.571
## 704 64 1 0.471 0.04284 0.394 0.563
## 716 63 1 0.464 0.04281 0.387 0.556
## 732 62 1 0.456 0.04277 0.380 0.548
## 781 61 1 0.449 0.04272 0.372 0.541
## 1063 52 1 0.440 0.04276 0.364 0.533
## 1074 51 1 0.432 0.04278 0.355 0.524
## 1156 48 1 0.423 0.04282 0.346 0.515
## 1279 42 1 0.413 0.04297 0.336 0.506
## 1298 41 1 0.402 0.04308 0.326 0.496
## 2204 9 1 0.358 0.05696 0.262 0.489
cb <- confBands(y_bmt, type = "hall")
plot(fit1_bmt,
main = 'Kaplan Meyer Plot with confidence bands')
lines(cb, col = "red",lty = 3)
legend(1000, 0.99, legend = c('K-M survival estimate',
'pointwise intervals', 'Hall-Werner conf bands'), lty = 1:3)
Next, I fit a Cox Proportional Hazards model, which makes use of several of the variables contained in the data set. I don’t pretend to have the best model here, or even a very good one. My intention is just to show the survival package’s coxph()
function, and show how the covariates enter the formula. Note, however, that this model does achieve an R^{2} value of 0.8, and that three variables – ta
, tc
, and dc1
– show up as significant.
# Fit Cox Model
form <- formula(y_bmt ~ group + ta + tc + tp + dc + dp +
z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z10)
cox_bmt <- coxph(form,data = bmt2)
summary(cox_bmt)
## Call:
## coxph(formula = form, data = bmt2)
##
## n= 137, number of events= 81
##
## coef exp(coef) se(coef) z Pr(>|z|)
## group2 0.2540084 1.2891826 0.4342118 0.585 0.5586
## group3 0.5280287 1.6955865 0.4197516 1.258 0.2084
## ta -0.0017962 0.9982054 0.0003736 -4.808 1.52e-06 ***
## tc -0.0060005 0.9940175 0.0010512 -5.708 1.14e-08 ***
## tp -0.0005709 0.9994293 0.0012622 -0.452 0.6510
## dc1 -3.5832062 0.0277865 0.4938320 -7.256 3.99e-13 ***
## dp1 -1.2266468 0.2932744 0.4832251 -2.538 0.0111 *
## z1 0.0109077 1.0109674 0.0230293 0.474 0.6358
## z2 -0.0155304 0.9845895 0.0200202 -0.776 0.4379
## z31 0.0162146 1.0163468 0.2480277 0.065 0.9479
## z41 0.0334753 1.0340419 0.2722755 0.123 0.9021
## z51 0.0792485 1.0824733 0.2856339 0.277 0.7814
## z61 -0.0702048 0.9322029 0.2557463 -0.275 0.7837
## z7 0.0000440 1.0000440 0.0004235 0.104 0.9173
## z81 0.4618598 1.5870228 0.3386689 1.364 0.1726
## z101 0.5145619 1.6729055 0.3330479 1.545 0.1223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## group2 1.28918 0.7757 0.55044 3.01937
## group3 1.69559 0.5898 0.74478 3.86023
## ta 0.99821 1.0018 0.99747 0.99894
## tc 0.99402 1.0060 0.99197 0.99607
## tp 0.99943 1.0006 0.99696 1.00190
## dc1 0.02779 35.9887 0.01056 0.07314
## dp1 0.29327 3.4098 0.11375 0.75613
## z1 1.01097 0.9892 0.96635 1.05764
## z2 0.98459 1.0157 0.94670 1.02399
## z31 1.01635 0.9839 0.62506 1.65258
## z41 1.03404 0.9671 0.60642 1.76319
## z51 1.08247 0.9238 0.61842 1.89474
## z61 0.93220 1.0727 0.56470 1.53887
## z7 1.00004 1.0000 0.99921 1.00087
## z81 1.58702 0.6301 0.81716 3.08218
## z101 1.67291 0.5978 0.87092 3.21338
##
## Concordance= 0.926 (se = 0.034 )
## Rsquare= 0.806 (max possible= 0.995 )
## Likelihood ratio test= 224.6 on 16 df, p=0
## Wald test = 107.5 on 16 df, p=1.332e-15
## Score (logrank) test = 211.7 on 16 df, p=0
cox_fit_bmt <- survfit(cox_bmt)
# plot(cox_fit_bmt)
To give some idea of the scope of R’s capabilities to work with time to event data, I use the ranger()
function to fit a Random Forests Ensemble model to the data. Note that ranger()
builds a model for each observation in the data set. The next block of code builds the model using the same variables used the the Cox model above, and each of the 137 survival curves computed for the bmt data set, along with a curve of average values.
# ranger model
# ranger model
r_fit_bmt <- ranger(form,
data = bmt2,
importance = "permutation",
seed = 1234)
# Average the survival models
death_times <- r_fit_bmt$unique.death.times
surv_prob <- data.frame(r_fit_bmt$survival)
avg_prob <- sapply(surv_prob,mean)
# Plot the survival models for each patient
plot(r_fit_bmt$unique.death.times,r_fit_bmt$survival[1,], type = "l",
ylim = c(0,1),
col = "red",
xlab = "death times",
ylab = "survival",
main = "Patient Survival Curves")
for(n in c(2:137)){
lines(r_fit_bmt$unique.death.times, r_fit_bmt$survival[n,], type = "l", col = "red")
}
lines(death_times, avg_prob, lwd = 2)
legend(100, 0.2, legend = c('Averages - black'))
Here, we show the ranking of variable importance computed by the permutation method, which is ranger()
’s default for survival data. Note that ta
, tc
, and dc
are the same top three variables flagged in the Cox model.
Also listed is a measure of prediction error calculated from Harrell’s c-index. This index is defined as “… the proportion of all usable patient pairs in which the predictions and outcomes are concordant” (cf. [8] p 370), where predictions for pairs are concordant if predicted survival times are larger for patients who lived longer. Note that Harrell’s c-index may be thought of as a generalization of finding the are under an ROC curve. (For binary outcomes Harrell’s c-index reduces to the Wilcoxon-Mann-Whitney statistic which, in turn, is equivalent to computing the area under the ROC curve.)
vi <- data.frame(sort(round(r_fit_bmt$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)
## importance
## ta 0.1259
## tc 0.0688
## dc 0.0190
## tp 0.0117
## dp 0.0092
## z2 0.0046
cat("Prediction Error = 1 - Harrell's c-index = ", r_fit_bmt$prediction.error)
## Prediction Error = 1 - Harrell's c-index = 0.09304771
Finally, we plot the survival curves computed for all three models on the same graph. Note that the “ad hoc” curve of average survival curves computed by the ranger model tracks the Kaplan-Meier curve fairly well.
# Set up for ggplot
km <- rep("KM", length(fit1_bmt$time))
km_df <- data.frame(fit1_bmt$time,fit1_bmt$surv,km)
names(km_df) <- c("Time","Surv","Model")
cox <- rep("Cox",length(cox_fit_bmt$time))
cox_df <- data.frame(cox_fit_bmt$time,cox_fit_bmt$surv,cox)
names(cox_df) <- c("Time","Surv","Model")
rf <- rep("RF",length(r_fit_bmt$unique.death.times))
rf_df <- data.frame(r_fit_bmt$unique.death.times,avg_prob,rf)
names(rf_df) <- c("Time","Surv","Model")
plot_df <- rbind(km_df,cox_df,rf_df)
p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model))
p + geom_line() + ggtitle("Comparison of Survival Curves")
For a very nice exposition of the sort of predictive survival analysis modeling that can be done with ranger
, be sure to have a look at Manuel Amunategui’s post and video.
This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform which has been under continuous development and improvement for nearly twenty years. The use of the Surv()
function shows how open source code allows generations of developers to build on the work of their predecessors. The ranger
packages provides a practical example of how R can incorporate fast C++ code and adapt to the world of machine learning applications, and the incidental use of options such as Hall-Wellner Confidence bands and Harrell’s c-index gives some idea of the statistical depth that underlies almost everything R.
For convenience, I have collected the references used throughout the post here.
[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed. p11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis (overview) Encyclopedia of Biostatistics 6. Wiley, p 4452-4461
[3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, 457–481, 562–563.
[4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, 187–220.
[5] Diez, David. Survival Analysis in R. OpenIntro [6] Klein, John P and Moeschberger, Melvin L. (1997) Survival Analysis Techniques for Censored and Truncated Data, Springer.
[7] Wright, Marvin & Ziegler, Andreas. (2017) * [ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R] (https://www.jstatsoft.org/article/view/v077i01)*, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. (1996) Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15, 361-387
[9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R
The knitr package by Yihui Xie is a wonderful tool for reproducible data science. I especially like using it with R Markdown documents, where with some simple markup in an easy-to-read document I can easily combine R code and narrative text to generate an attractive document with words, tables and pictures in HTML, PDF or Word format. Say, something like this:
In that document, the numerical weather records and the chart were generated by R, combined into a document using R Markdown, and then generated as a word file with knitr
. (You can find the R Markdown file to generate that report, and the R script to download the data, in my weather-report repository.)
Another useful tool for reproducible data science is the checkpoint package. It helps you manage the ever-changing ecosystem of R packages on CRAN, by making it easy to "lock in" specific versions of R packages. With a single call to the checkpoint function — say checkpoint("2017-04-25")
, for April 25, 2017 — you can automatically find all the packages used by your current R project (i.e. the current folder) and install them as they used to be on the specified date. A colleague or collaborator can use the same script to get the same versions too, and so be confident of reproducing your results without having to worry a newer package version may have affected the results. By the way, those package versions get installed in a special folder (.checkpoint
, in your home directory), so they won't change the results of any other R projects, either.
RStudio includes a very useful tool for working with R Markdown and knitr
: you can press the "Knit" toolbar button to process the document with a single click. For that to work, it does require certain R packages to be available for use behind the scenes. In normal circumstances RStudio will offer to install them, but the process doesn't work when a checkpoint folder is active. A simple workaround is to include a file in the same folder (I call mine knitr-packages.R
) with the following lines:
library("formatR") library("htmltools") library("caTools") library("bitops") library("base64enc") library("rprojroot") library("rmarkdown") library("evaluate") library("stringi")
Although you never run that file directly, the checkpoint process will discover it and ensure the necessary packages are installed for RStudio to perform its magic. (In my tests this works with recent versions of RStudio including the latest, 1.0.143). All you need to do is make sure you run checkpoint
from the R command line (just press Control-ENTER on the corresponding line in the .Rmd
file) before attempting to knit. Simple!
Uncertainty is the biggest enemy of a profitable business. That is especially true of small business who don’t have enough resources to survive an unexpected diminution of revenue or to capitalize on a sudden increase of demand. In this context, it is especially important to be able to predict accurately the change in the markets to be able to make better decision and stay competitive.
This series of posts will teach you how to use data to make sound prediction. In the last set of exercises, we’ve seen how to make predictions on a random walk by isolating the white noise components via differentiation of the term of the time series. But this approach is valid only if the random components of the time series follow a normal distribution of constant mean and variance and if those components are added together in each iteration to create the new observations.
Today, we’ll see some transformations we can apply on the time series make them stationary, especially how to stabilise variance and how to detect and remove seasonality in a time series.
To be able to do theses exercise, you have to have installed the packages forecast
and tseries
.
Answers to the exercises are available here.
Exercise 1
Use the data()
function to load the EuStockMarkets
dataset from the R library. Then use the diff()
function on EuStockMarkets[,1]
to isolate the random component and plot them. This differentiation is the most used transformation with time series.
We can see that the mean of the random component of the time series seems to stay constant over time, but the variance seems to get bigger near 1997.
Exercise 2
Apply a the log()
function on EuStockMarkets[,1]
and repeat the step of exercise 1. The logarithmic transformation is often used to stabilise non constant variance.
Exercise 3
Use the adf.test()
function from the tseries
package to test if the time series you obtain in the last exercise is stationary. Use a lag of 1.
Exercise 4
Load and plot the co2
dataset from the R library dataset
. Use the lowess()
function to create a trend line and add it to the plot of the time series.
By looking at the last plot, we can see that the time series oscillate from one side to the other of the trend line with a constant period. That characteristic is called seasonally and is often observed in time series. Just think about temperature, which change predictably from season to season.
Exercise 5
To eliminate the upward trend in the data use the diff()
function and save the resulting time series in a variable called diff.co2
. Plot the autocorrelation plot of diff.co2
.
Exercise 6
This last autocorrelation plot has years for unit which is not really intuitive in our scenario. Make another autocorrelation plot where the x axis has months as units. By looking at this plot, can you tell what is the seasonal period of this time series?
Another way to verify if the time series show seasonnality is to use the tbats
function from the forecast package. As his named says, this function fits a tbats model on the time series and return a smooth curve that fit the data. We’ll learn more about that model in a future post.
Exercise 7
Use the tbats
function on the co2
time series and store the result in a variable called tbats.model
. If the time series show sign of seasonality, the $seasonal.periods
value of tbats.model
will store the period value, else the result will be null.
Exercise 8
Use the diff()
function with the appropriate lag to remove the seasonality of the co2
time series, then use it again to remove the trend. Plot the resulting random component and the autocorrelation plot.
Exercise 9
Apply the adf test, the kpss test and the Ljung-Box test on the result of the last exercise to make sure that the random component is stationary.
Exercise 10
An interesting way to analyse a time series is to use the decompose()
function which uses a moving average to estimate the seasonal, random and trend component of a time series. With that in mind, use this function and plot each component of the co2
time series.