(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

**SHINY APPLICATION LAYOUTS-PLOT PLUS COLUMNS**

In the second part of our series we will build another small shiny app but use another UI.

More specifically we will present the example of a UI with a plot at the top and columns at the bottom that contain the inputs that drive the plot. For our case we are going to use the `diamonds`

dataset to create a Diamonds Analyzer App.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in the “Building Shiny App” series as we will build basic shiny stuff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test your skills with the exercise set we prepared for you. Lets begin!

Answers to the are available here.

**Shiny Installation**

In order to create the app we have to install and load the package `shiny`

.

**Exercise 1**

Install and load `shiny`

.

**Grid Layout**

The `sidebarLayout`

uses Shiny’s grid layout functions. Rows are created by the `fluidRow`

function and include columns defined by the `column`

function. Column widths should add up to 12 within a `fluidRow`

.

The first parameter to the `column `

function is it’s `width`

. You can also change the position of columns to decide the location of UI elements. You can put columns to the right by adding the `offset`

parameter to the `column`

function. Each unit of offset increases the left-margin of a column by a whole column.

Now let’s begin to build our UI. First of all we will place the `fluidpage`

with a `title`

as below:

#ui.R

`library(shiny)`

shinyUI(fluidPage(

title = "Diamonds",

h4("Diamonds Analyzer")

```
```

`))`

#server.R

`library(shiny)`

function(input, output) {}

**Exercise 2**

Create the initial UI of your app and name it “Diamonds Analyzer”.

You can use the `fluidrow`

function with the `column`

function of `width`

=2 inside of it like this:

#ui.R

`library(shiny)`

shinyUI(fluidPage(

title = "Diamonds",

h4("Diamonds Analyzer"),

fluidRow(column(2),

column(2),

)

```
```

`))`

**Exercise 3**

Create a `fluidrow`

with two columns of `width`

= 4 inside it. **NOTE**: Do not expect to see something yet.

Now it is time to fill these columns with some tools that will help us determine the variables that we are going to use for our plot.

In the first 4 columns we will put a `selectInput`

as the code below.

#ui.R

`fluidRow(column(4,`

h4("Variable X"),

selectInput('x', 'X', names(diamonds)))

**Exercise 4**

Put a `selectInput`

in the first 4 columns of your UI. Name it “Variable X”. HINT:Use `names`

to get the names of the dataset `diamonds`

as inputs.

Now let’s move to the next four columns. We are going to put in there another `selectInput`

and select the second of the dataset’s names as default. We are also going to see what `offset`

does by setting it to 1 and then deactivating it again like the example below. You can use the code as it is or change the parameters given to understand the logic behind its use.

#ui.R

`offset = 1,`

selectInput('y', 'Y', names(dataset), names(dataset)[[2]])

**Exercise 5**

Create a `selectInput`

from `column`

5 to `column`

8. Choose the second of the dataset’s name as default. Name it “Variable Y”. HINT: Use `names`

to get the names of the dataset `diamonds`

as inputs.

**Exercise 6**

Set the `offset`

parameter to 1 from columns 5 to 8.

Now let’s call our `plot`

and put it on the top of our UI. Look at the example below.

**Exercise 7**

Place the `plot`

on the top of your UI. **HINT**: Use `plotOutput`

and `hr`

. **NOTE**: You are not going to see the `plot`

in your UI because you have not created the server side yet.

We are going to create a reactive expression in order to combine the selected variables into a new data frame.Look at the example:

#server.R

`selectedData <- reactive({`

diamonds[, c(input$x, input$y)]

})

**Exercise 8**

Create a reactive expression in order to combine the selected variables into a new data frame. **HINT**: Use `reactive`

.

Now plot your new data frame like the example:

#server.R

`output$plot <- renderPlot({`

plot(selectedData())

})

**Exercise 9**

Plot your data frame. **HINT**: Use `renderPlot`

.

As mentioned before the `width`

of our UI is equal to 12 columns. So what is goint to happen if we a add a new column of `width`

= 4 next to the other two? You have to find out in order to understand better how it works.

**Exercise 10**

Create a new `selectInput`

and try to put it next to “Variable Y”. Try to explain the result. **NOTE**: You do not have to connect it with your `plot`

.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** That’s so Random**, and kindly contributed to R-bloggers)

I guess we all use it, the good old histogram. One of the first things we are taught in *Introduction to Statistics* and routinely applied whenever coming across a new continuous variable. However, it easily gets messed up by outliers. Putting most of the data into a single bin or a few bins, and scattering the outliers barely visible over the x-axis. This distribution might look familiar

```
library(tidyverse)
set.seed(42)
hist_data <- data_frame(x = c(rexp(1000, .5),
runif(50, 0, 500)))
ggplot(hist_data, aes(x)) +
geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")
```

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

```
hist_data %>%
mutate(x_new = ifelse(x > 10, 10, x)) %>%
ggplot(aes(x_new)) +
geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")
```

Not bad. However, it now suggests incorrectly that many observations are exactly 10. I routinely make these plots for my own information, but they cannot be shared without explaining what happened to the outliers and what there original range was. Since a plot with a manual is not that great either, I recently did a hacking session into the `ggplot`

object. The resulting `gg_outlier_bin`

function not only indicates the range of the last bin, it also allows for a different fill color of the bin. Now we are clearly distinguishing the outlier aggregation

```
gg_outlier_bin(hist_data,
"x",
cut_off_floor = NA,
cut_off_ceiling = 10,
binwidth = 0.1)
```

It is still a bit experimental, but it seems to work in most situations. Below you find the function code for making histograms with outlier bins. You can also get it by installing the package accompanying this blog `devtools::install_github("edwinth/thatssorandom")`

. By the way, it works on both floor and ceiling outliers. Like in the following

```
data_frame(x = c(runif(100, 0, 100), rnorm(1000, 50, 2))) %>%
gg_outlier_bin("x", 45, 55, binwidth = .1)
```

```
gg_outlier_bin <- function(x,
var_name,
cut_off_floor,
cut_off_ceiling,
col = "black",
fill = "cornflowerblue",
fill_outlier_bins = "forestgreen",
binwidth = NULL) {
printing_min_max <- x %>% summarise_(sprintf("round(min(%s, na.rm = TRUE), 1)", var_name),
sprintf("round(max(%s, na.rm = TRUE), 1)", var_name))
ceiling_filter <- ifelse(!is.na(cut_off_ceiling),
sprintf("%s < %f", var_name, cut_off_ceiling),
"1 == 1")
floor_filter <- ifelse(!is.na(cut_off_floor),
sprintf("%s > %f", var_name, cut_off_floor),
"1 == 1")
x_regular <- x %>% filter_(ceiling_filter, floor_filter) %>%
select_(var_name)
x_to_roll_ceiling <- x %>% filter_(
sprintf("%s >= %f", var_name, cut_off_ceiling)) %>% select_(var_name)
if (!is.na(cut_off_ceiling)) x_to_roll_ceiling[, 1] <- cut_off_ceiling
x_to_roll_floor <- x %>% filter_(
sprintf("%s <= %f", var_name, cut_off_floor)) %>% select_(var_name)
if (!is.na(cut_off_floor)) x_to_roll_floor[, 1] <- cut_off_floor
plot_obj <- ggplot(x_regular, aes_string(var_name)) +
geom_histogram(col = col, fill = fill, binwidth = binwidth)
if (!is.na(cut_off_ceiling)) {
ticks_for_ceiling <- update_tickmarks_ceiling(plot_obj, cut_off_ceiling,
printing_min_max[1,2])
plot_obj <- plot_obj +
geom_histogram(data = x_to_roll_ceiling, fill = fill_outlier_bins, col = col,
binwidth = binwidth) +
scale_x_continuous(breaks = ticks_for_ceiling$tick_positions,
labels = ticks_for_ceiling$tick_labels)
}
if (!is.na(cut_off_floor)) {
ticks_for_floor <- update_tickmarks_floor(plot_obj, cut_off_floor,
printing_min_max[1,1])
plot_obj <- plot_obj +
geom_histogram(data = x_to_roll_floor, fill = fill_outlier_bins,
col = col, binwidth = binwidth) +
scale_x_continuous(breaks = ticks_for_floor$tick_positions,
labels = ticks_for_floor$tick_labels)
}
return(plot_obj)
}
update_tickmarks_ceiling <- function(gg_obj,
co,
max_print) {
ranges <- suppressMessages(
ggplot_build(gg_obj)$layout$panel_ranges[[1]])
label_to_add <- sprintf("(%s , %s)", round(co, 1), max_print)
tick_positions <- ranges$x.major_source
tick_labels <- ranges$x.labels
if (overlap_ceiling(tick_positions, co)) {
tick_positions <- tick_positions[-length(tick_positions)]
tick_labels <- tick_labels[-length(tick_labels)]
}
return(list(tick_positions = c(tick_positions, co),
tick_labels = c(tick_labels, label_to_add)))
}
overlap_ceiling <- function(positions, cut_off) {
n <- length(positions)
ticks_dif <- positions[n] - positions[n-1]
(cut_off - positions[n]) / ticks_dif < 0.25
}
update_tickmarks_floor <- function(gg_obj,
co,
min_print) {
ranges <- suppressMessages(
ggplot_build(gg_obj)$layout$panel_ranges[[1]])
label_to_add <- sprintf("(%s , %s)", min_print, round(co, 1))
tick_positions <- ranges$x.major_source
tick_labels <- ranges$x.labels
if (overlap_floor(tick_positions, co)) {
tick_positions <- tick_positions[-1]
tick_labels <- tick_labels[-1]
}
return(list(tick_positions = c(co, tick_positions),
tick_labels = c(label_to_add, tick_labels)))
}
overlap_floor <- function(positions, cut_off) {
ticks_dif <- positions[2] - positions[1]
(positions[1] - cut_off) / ticks_dif < 0.25
}
```

To **leave a comment** for the author, please follow the link and comment on their blog: ** That’s so Random**.

R-bloggers.com offers

Looking at the Task View on a small screen is a bit like standing too close to a brick wall - left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger

The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the `Surv()`

function, which contains failure time and censoring information, is the basic survival analysis data structure in R.

KMsurv contains some interesting data sets from John Klein and Melvin Moeschberger’s classic text, Survival Analysis Techniques for Censored and Truncated Data.

My main reason for selecting the OIsurv package was to draw attention the very helpful guide to Survival Analysis in R, produced by the folks at OpenIntro.

ranger might be the surprise in my very short list of survival packages. The `ranger()`

function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But it was news to me that `ranger()`

also builds survival models. Benchmarks indicate that `ranger()`

is suitable for building time-to-event models with the large, high dimensional data sets important to internet marketing applications. Since `ranger()`

uses standard `Surv()`

survival objects, it’s an ideal tool for getting acquainted with survival analysis in the in this machine learning age.

*** Load and transform the data This first block of code loads the required packages along with the bone marrow transplant data frame from the KMsurv package. I chose this because has number of covariates and no missing values. The only data preparation is to make the appropriate variables into factors.

```
library(survival)
library(dplyr)
library(OIsurv) # Aumatically loads KMsurv
library(ranger)
library(ggplot2)
#------------
data(bmt)
# sapply(bmt,class)
# Prepare new data frame for modeling
bmt2 <- select(bmt,-c(t2,d2,d3))
bmt2 <- mutate(bmt2,
group = as.factor(group),
d1 = as.factor(d1),
da = as.factor(da),
dc = as.factor(dc),
dp = as.factor(dp),
z3 = as.factor(z3),
z4 = as.factor(z4),
z5 = as.factor(z5),
z6 = as.factor(z6),
z8 = as.factor(z8),
z9 = as.factor(z9),
z10 = as.factor(z10)
)
head(bmt2)
```

```
## group t1 d1 ta da tc dc tp dp z1 z2 z3 z4 z5 z6 z7 z8 z9 z10
## 1 1 2081 0 67 1 121 1 13 1 26 33 1 0 1 1 98 0 1 0
## 2 1 1602 0 1602 0 139 1 18 1 21 37 1 1 0 0 1720 0 1 0
## 3 1 1496 0 1496 0 307 1 12 1 26 35 1 1 1 0 127 0 1 0
## 4 1 1462 0 70 1 95 1 13 1 17 21 0 1 0 0 168 0 1 0
## 5 1 1433 0 1433 0 236 1 12 1 32 36 1 1 1 1 93 0 1 0
## 6 1 1377 0 1377 0 123 1 12 1 22 31 1 1 1 1 2187 0 1 0
```

The first thing to do is to use `Surv()`

to build the standard survival object. The variable `t1`

records the time to death or the censored time; `d1`

indicates that the patient died (`d1 = 1`

) or that the patient survived until the end of the study (`d1 = 0`

). Note that a “+” after the time in the print out of `y_bmt`

indicates censoring. The formula `y_bmt ~ 1`

instructs the `survfit()`

function to fit a model with intercept only, and produces the Kaplan-Meier estimate.

The `confBands()`

function from the OIsurv package estimates Hall-Wellner confidence bands. These are large-sample, simultaneous estimates and are plotted in red. They contrast with the point-wise confidence bands rendered as black dashed lines.

```
# Kaplan Meier Survival Curve
y_bmt <- Surv(bmt$t1, bmt$d1)
y_bmt
```

```
## [1] 2081+ 1602+ 1496+ 1462+ 1433+ 1377+ 1330+ 996+ 226+ 1199+ 1111+
## [12] 530+ 1182+ 1167+ 418 417 276 156 781 172 487 716
## [23] 194 371 526 122 1279 110 243 86 466 262 162
## [34] 262 1 107 269 350 2569+ 2506+ 2409+ 2218+ 1857+ 1829+
## [45] 1562+ 1470+ 1363+ 1030+ 860+ 1258+ 2246+ 1870+ 1799+ 1709+ 1674+
## [56] 1568+ 1527+ 1324+ 957+ 932+ 847+ 848+ 1850+ 1843+ 1535+ 1447+
## [67] 1384+ 414 2204 1063 481 105 641 390 288 522 79
## [78] 1156 583 48 431 1074 393 10 53 80 35 1499+
## [89] 704 653 222 1356+ 2640+ 2430+ 2252+ 2140+ 2133+ 1238+ 1631+
## [100] 2024+ 1345+ 1136+ 845+ 491 162 1298 121 2 62 265
## [111] 547 341 318 195 469 93 515 183 105 128 164
## [122] 129 122 80 677 73 168 74 16 248 732 105
## [133] 392 63 97 153 363
```

```
fit1_bmt <- survfit(y_bmt ~ 1)
summary(fit1_bmt)
```

```
## Call: survfit(formula = y_bmt ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 137 1 0.993 0.00727 0.979 1.000
## 2 136 1 0.985 0.01025 0.966 1.000
## 10 135 1 0.978 0.01250 0.954 1.000
## 16 134 1 0.971 0.01438 0.943 0.999
## 35 133 1 0.964 0.01602 0.933 0.995
## 48 132 1 0.956 0.01748 0.923 0.991
## 53 131 1 0.949 0.01881 0.913 0.987
## 62 130 1 0.942 0.02003 0.903 0.982
## 63 129 1 0.934 0.02117 0.894 0.977
## 73 128 1 0.927 0.02222 0.884 0.972
## 74 127 1 0.920 0.02322 0.875 0.966
## 79 126 1 0.912 0.02415 0.866 0.961
## 80 125 2 0.898 0.02588 0.848 0.950
## 86 123 1 0.891 0.02668 0.840 0.944
## 93 122 1 0.883 0.02744 0.831 0.939
## 97 121 1 0.876 0.02817 0.822 0.933
## 105 120 3 0.854 0.03017 0.797 0.915
## 107 117 1 0.847 0.03078 0.788 0.909
## 110 116 1 0.839 0.03137 0.780 0.903
## 121 115 1 0.832 0.03193 0.772 0.897
## 122 114 2 0.818 0.03300 0.755 0.885
## 128 112 1 0.810 0.03350 0.747 0.879
## 129 111 1 0.803 0.03399 0.739 0.872
## 153 110 1 0.796 0.03445 0.731 0.866
## 156 109 1 0.788 0.03490 0.723 0.860
## 162 108 2 0.774 0.03575 0.707 0.847
## 164 106 1 0.766 0.03615 0.699 0.841
## 168 105 1 0.759 0.03653 0.691 0.834
## 172 104 1 0.752 0.03690 0.683 0.828
## 183 103 1 0.745 0.03726 0.675 0.821
## 194 102 1 0.737 0.03760 0.667 0.815
## 195 101 1 0.730 0.03793 0.659 0.808
## 222 100 1 0.723 0.03825 0.651 0.802
## 243 98 1 0.715 0.03856 0.644 0.795
## 248 97 1 0.708 0.03886 0.636 0.788
## 262 96 2 0.693 0.03943 0.620 0.775
## 265 94 1 0.686 0.03969 0.612 0.768
## 269 93 1 0.678 0.03995 0.604 0.761
## 276 92 1 0.671 0.04019 0.597 0.755
## 288 91 1 0.664 0.04042 0.589 0.748
## 318 90 1 0.656 0.04063 0.581 0.741
## 341 89 1 0.649 0.04084 0.574 0.734
## 350 88 1 0.642 0.04104 0.566 0.727
## 363 87 1 0.634 0.04122 0.558 0.720
## 371 86 1 0.627 0.04140 0.551 0.713
## 390 85 1 0.619 0.04156 0.543 0.706
## 392 84 1 0.612 0.04172 0.535 0.699
## 393 83 1 0.605 0.04186 0.528 0.693
## 414 82 1 0.597 0.04199 0.520 0.686
## 417 81 1 0.590 0.04212 0.513 0.679
## 418 80 1 0.583 0.04223 0.505 0.671
## 431 79 1 0.575 0.04234 0.498 0.664
## 466 78 1 0.568 0.04243 0.490 0.657
## 469 77 1 0.560 0.04252 0.483 0.650
## 481 76 1 0.553 0.04259 0.476 0.643
## 487 75 1 0.546 0.04266 0.468 0.636
## 491 74 1 0.538 0.04271 0.461 0.629
## 515 73 1 0.531 0.04276 0.453 0.622
## 522 72 1 0.524 0.04280 0.446 0.615
## 526 71 1 0.516 0.04282 0.439 0.607
## 547 69 1 0.509 0.04285 0.431 0.600
## 583 68 1 0.501 0.04287 0.424 0.593
## 641 67 1 0.494 0.04288 0.416 0.585
## 653 66 1 0.486 0.04288 0.409 0.578
## 677 65 1 0.479 0.04286 0.402 0.571
## 704 64 1 0.471 0.04284 0.394 0.563
## 716 63 1 0.464 0.04281 0.387 0.556
## 732 62 1 0.456 0.04277 0.380 0.548
## 781 61 1 0.449 0.04272 0.372 0.541
## 1063 52 1 0.440 0.04276 0.364 0.533
## 1074 51 1 0.432 0.04278 0.355 0.524
## 1156 48 1 0.423 0.04282 0.346 0.515
## 1279 42 1 0.413 0.04297 0.336 0.506
## 1298 41 1 0.402 0.04308 0.326 0.496
## 2204 9 1 0.358 0.05696 0.262 0.489
```

```
cb <- confBands(y_bmt, type = "hall")
plot(fit1_bmt,
main = 'Kaplan Meyer Plot with confidence bands')
lines(cb, col = "red",lty = 3)
legend(1000, 0.99, legend = c('K-M survival estimate',
'pointwise intervals', 'Hall-Werner conf bands'), lty = 1:3)
```

Next, I fit a Cox Proportional Hazards model, which makes use of several of the variables contained in the data set. I don’t pretend to have the best model here, or even a very good one. My intention is just to show the survival package’s `coxph()`

function, and show how the covariates enter the formula. Note, however, that this model does achieve an R2 value of 0.8, and that three variables - `ta`

, `tc`

, and `dc1`

- show up as significant.

```
# Fit Cox Model
form <- formula(y_bmt ~ group + ta + tc + tp + dc + dp +
z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z10)
cox_bmt <- coxph(form,data = bmt2)
summary(cox_bmt)
```

```
## Call:
## coxph(formula = form, data = bmt2)
##
## n= 137, number of events= 81
##
## coef exp(coef) se(coef) z Pr(>|z|)
## group2 0.2540084 1.2891826 0.4342118 0.585 0.5586
## group3 0.5280287 1.6955865 0.4197516 1.258 0.2084
## ta -0.0017962 0.9982054 0.0003736 -4.808 1.52e-06 ***
## tc -0.0060005 0.9940175 0.0010512 -5.708 1.14e-08 ***
## tp -0.0005709 0.9994293 0.0012622 -0.452 0.6510
## dc1 -3.5832062 0.0277865 0.4938320 -7.256 3.99e-13 ***
## dp1 -1.2266468 0.2932744 0.4832251 -2.538 0.0111 *
## z1 0.0109077 1.0109674 0.0230293 0.474 0.6358
## z2 -0.0155304 0.9845895 0.0200202 -0.776 0.4379
## z31 0.0162146 1.0163468 0.2480277 0.065 0.9479
## z41 0.0334753 1.0340419 0.2722755 0.123 0.9021
## z51 0.0792485 1.0824733 0.2856339 0.277 0.7814
## z61 -0.0702048 0.9322029 0.2557463 -0.275 0.7837
## z7 0.0000440 1.0000440 0.0004235 0.104 0.9173
## z81 0.4618598 1.5870228 0.3386689 1.364 0.1726
## z101 0.5145619 1.6729055 0.3330479 1.545 0.1223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## group2 1.28918 0.7757 0.55044 3.01937
## group3 1.69559 0.5898 0.74478 3.86023
## ta 0.99821 1.0018 0.99747 0.99894
## tc 0.99402 1.0060 0.99197 0.99607
## tp 0.99943 1.0006 0.99696 1.00190
## dc1 0.02779 35.9887 0.01056 0.07314
## dp1 0.29327 3.4098 0.11375 0.75613
## z1 1.01097 0.9892 0.96635 1.05764
## z2 0.98459 1.0157 0.94670 1.02399
## z31 1.01635 0.9839 0.62506 1.65258
## z41 1.03404 0.9671 0.60642 1.76319
## z51 1.08247 0.9238 0.61842 1.89474
## z61 0.93220 1.0727 0.56470 1.53887
## z7 1.00004 1.0000 0.99921 1.00087
## z81 1.58702 0.6301 0.81716 3.08218
## z101 1.67291 0.5978 0.87092 3.21338
##
## Concordance= 0.926 (se = 0.034 )
## Rsquare= 0.806 (max possible= 0.995 )
## Likelihood ratio test= 224.6 on 16 df, p=0
## Wald test = 107.5 on 16 df, p=1.332e-15
## Score (logrank) test = 211.7 on 16 df, p=0
```

```
cox_fit_bmt <- survfit(cox_bmt)
# plot(cox_fit_bmt)
```

To give some idea of the scope of R’s capabilities to work with time to event data, I use the `ranger()`

function to fit a Random Forests Ensemble model to the data. Note that `ranger()`

builds a model for each observation in the data set. The next block of code builds the model using the same variables used the the Cox model above, and each of the 137 survival curves computed for the bmt data set, along with a curve of average values.

```
# ranger model
# ranger model
r_fit_bmt <- ranger(form,
data = bmt2,
importance = "permutation",
seed = 1234)
# Average the survival models
death_times <- r_fit_bmt$unique.death.times
surv_prob <- data.frame(r_fit_bmt$survival)
avg_prob <- sapply(surv_prob,mean)
# Plot the survival models for each patient
plot(r_fit_bmt$unique.death.times,r_fit_bmt$survival[1,], type = "l",
ylim = c(0,1),
col = "red",
xlab = "death times",
ylab = "survival",
main = "Patient Survival Curves")
for(n in c(2:137)){
lines(r_fit_bmt$unique.death.times, r_fit_bmt$survival[n,], type = "l", col = "red")
}
lines(death_times, avg_prob, lwd = 2)
legend(100, 0.2, legend = c('Averages - black'))
```

Here, we show the ranking of variable importance computed by the permutation method, which is `ranger()`

’s default for survival data. Note that `ta`

, `tc`

, and `dc`

are the same top three variables flagged in the Cox model.

Also listed is a measure of prediction error calculated from Harrell’s c-index. This index is defined as “… the proportion of all usable patient pairs in which the predictions and outcomes are concordant” (cf. [8] p 370), where predictions for pairs are concordant if predicted survival times are larger for patients who lived longer. Note that Harrell’s c-index may be thought of as a generalization of finding the are under an ROC curve. (For binary outcomes Harrell’s c-index reduces to the Wilcoxon-Mann-Whitney statistic which, in turn, is equivalent to computing the area under the ROC curve.)

```
vi <- data.frame(sort(round(r_fit_bmt$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)
```

```
## importance
## ta 0.1259
## tc 0.0688
## dc 0.0190
## tp 0.0117
## dp 0.0092
## z2 0.0046
```

`cat("Prediction Error = 1 - Harrell's c-index = ", r_fit_bmt$prediction.error)`

`## Prediction Error = 1 - Harrell's c-index = 0.09304771`

Finally, we plot the survival curves computed for all three models on the same graph. Note that the “ad hoc” curve of average survival curves computed by the ranger model tracks the Kaplan-Meier curve fairly well.

```
# Set up for ggplot
km <- rep("KM", length(fit1_bmt$time))
km_df <- data.frame(fit1_bmt$time,fit1_bmt$surv,km)
names(km_df) <- c("Time","Surv","Model")
cox <- rep("Cox",length(cox_fit_bmt$time))
cox_df <- data.frame(cox_fit_bmt$time,cox_fit_bmt$surv,cox)
names(cox_df) <- c("Time","Surv","Model")
rf <- rep("RF",length(r_fit_bmt$unique.death.times))
rf_df <- data.frame(r_fit_bmt$unique.death.times,avg_prob,rf)
names(rf_df) <- c("Time","Surv","Model")
plot_df <- rbind(km_df,cox_df,rf_df)
p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model))
p + geom_line() + ggtitle("Comparison of Survival Curves")
```

For a very nice exposition of the sort of predictive survival analysis modeling that can be done with `ranger`

, be sure to have a look at Manuel Amunategui’s post and video.

This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform which has been under continuous development and improvement for nearly twenty years. The use of the `Surv()`

function shows how open source code allows generations of developers to build on the work of their predecessors. The `ranger`

packages provides a practical example of how R can incorporate fast C++ code and adapt to the world of machine learning applications, and the incidental use of options such as Hall-Wellner Confidence bands and Harrell’s c-index gives some idea of the statistical depth that underlies almost everything R.

For convenience, I have collected the references used throughout the post here.

[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed. p11

[2] Andersen, P.K., Keiding, N. (1998) Survival analysis (overview) Encyclopedia of Biostatistics 6. Wiley, p 4452-4461

[3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, 457–481, 562–563.

[4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, 187–220.

[5] Diez, David. Survival Analysis in R. OpenIntro [6] Klein, John P and Moeschberger, Melvin L. (1997) Survival Analysis Techniques for Censored and Truncated Data, Springer.

[7] Wright, Marvin & Ziegler, Andreas. (2017) * [ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R] (https://www.jstatsoft.org/article/view/v077i01)*, JSS Vol 77, Issue 1.

[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. (1996) Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15, 361-387

[9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R

(This article was first published on ** R Views**, and kindly contributed to R-bloggers)

With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, Survival Analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were known before 1700 [2]. In the early eighteenth century, the old masters, de Moivre working on annuities and Daniel Bernoulli studying competing risks for his work on smallpox inoculation, developed the foundations of time-to-event modeling [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing and Medicine and many more application areas. So, it is not surprising that the R Task View on Survival Analysis, a curated, organized and annotated list of relevant R packages and functions, is formidable.

Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger

The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the `Surv()`

function, which contains failure time and censoring information, is the basic survival analysis data structure in R.

KMsurv contains some interesting data sets from John Klein and Melvin Moeschberger’s classic text, *Survival Analysis Techniques for Censored and Truncated Data*.

My main reason for selecting the OIsurv package was to draw attention the very helpful guide to Survival Analysis in R, produced by the folks at OpenIntro.

ranger might be the surprise in my very short list of survival packages. The `ranger()`

function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But it was news to me that `ranger()`

also builds survival models. Benchmarks indicate that `ranger()`

is suitable for building time-to-event models with the large, high dimensional data sets important to internet marketing applications. Since `ranger()`

uses standard `Surv()`

survival objects, it’s an ideal tool for getting acquainted with survival analysis in the in this machine learning age.

*** Load and transform the data This first block of code loads the required packages along with the bone marrow transplant data frame from the KMsurv package. I chose this because has number of covariates and no missing values. The only data preparation is to make the appropriate variables into factors.

```
library(survival)
library(dplyr)
library(OIsurv) # Aumatically loads KMsurv
library(ranger)
library(ggplot2)
#------------
data(bmt)
# sapply(bmt,class)
# Prepare new data frame for modeling
bmt2 <- select(bmt,-c(t2,d2,d3))
bmt2 <- mutate(bmt2,
group = as.factor(group),
d1 = as.factor(d1),
da = as.factor(da),
dc = as.factor(dc),
dp = as.factor(dp),
z3 = as.factor(z3),
z4 = as.factor(z4),
z5 = as.factor(z5),
z6 = as.factor(z6),
z8 = as.factor(z8),
z9 = as.factor(z9),
z10 = as.factor(z10)
)
head(bmt2)
```

```
## group t1 d1 ta da tc dc tp dp z1 z2 z3 z4 z5 z6 z7 z8 z9 z10
## 1 1 2081 0 67 1 121 1 13 1 26 33 1 0 1 1 98 0 1 0
## 2 1 1602 0 1602 0 139 1 18 1 21 37 1 1 0 0 1720 0 1 0
## 3 1 1496 0 1496 0 307 1 12 1 26 35 1 1 1 0 127 0 1 0
## 4 1 1462 0 70 1 95 1 13 1 17 21 0 1 0 0 168 0 1 0
## 5 1 1433 0 1433 0 236 1 12 1 32 36 1 1 1 1 93 0 1 0
## 6 1 1377 0 1377 0 123 1 12 1 22 31 1 1 1 1 2187 0 1 0
```

The first thing to do is to use `Surv()`

to build the standard survival object. The variable `t1`

records the time to death or the censored time; `d1`

indicates that the patient died (`d1 = 1`

) or that the patient survived until the end of the study (`d1 = 0`

). Note that a “+” after the time in the print out of `y_bmt`

indicates censoring. The formula `y_bmt ~ 1`

instructs the `survfit()`

function to fit a model with intercept only, and produces the Kaplan-Meier estimate.

The `confBands()`

function from the OIsurv package estimates Hall-Wellner confidence bands. These are large-sample, simultaneous estimates and are plotted in red. They contrast with the point-wise confidence bands rendered as black dashed lines.

```
# Kaplan Meier Survival Curve
y_bmt <- Surv(bmt$t1, bmt$d1)
y_bmt
```

```
## [1] 2081+ 1602+ 1496+ 1462+ 1433+ 1377+ 1330+ 996+ 226+ 1199+ 1111+
## [12] 530+ 1182+ 1167+ 418 417 276 156 781 172 487 716
## [23] 194 371 526 122 1279 110 243 86 466 262 162
## [34] 262 1 107 269 350 2569+ 2506+ 2409+ 2218+ 1857+ 1829+
## [45] 1562+ 1470+ 1363+ 1030+ 860+ 1258+ 2246+ 1870+ 1799+ 1709+ 1674+
## [56] 1568+ 1527+ 1324+ 957+ 932+ 847+ 848+ 1850+ 1843+ 1535+ 1447+
## [67] 1384+ 414 2204 1063 481 105 641 390 288 522 79
## [78] 1156 583 48 431 1074 393 10 53 80 35 1499+
## [89] 704 653 222 1356+ 2640+ 2430+ 2252+ 2140+ 2133+ 1238+ 1631+
## [100] 2024+ 1345+ 1136+ 845+ 491 162 1298 121 2 62 265
## [111] 547 341 318 195 469 93 515 183 105 128 164
## [122] 129 122 80 677 73 168 74 16 248 732 105
## [133] 392 63 97 153 363
```

```
fit1_bmt <- survfit(y_bmt ~ 1)
summary(fit1_bmt)
```

```
## Call: survfit(formula = y_bmt ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 137 1 0.993 0.00727 0.979 1.000
## 2 136 1 0.985 0.01025 0.966 1.000
## 10 135 1 0.978 0.01250 0.954 1.000
## 16 134 1 0.971 0.01438 0.943 0.999
## 35 133 1 0.964 0.01602 0.933 0.995
## 48 132 1 0.956 0.01748 0.923 0.991
## 53 131 1 0.949 0.01881 0.913 0.987
## 62 130 1 0.942 0.02003 0.903 0.982
## 63 129 1 0.934 0.02117 0.894 0.977
## 73 128 1 0.927 0.02222 0.884 0.972
## 74 127 1 0.920 0.02322 0.875 0.966
## 79 126 1 0.912 0.02415 0.866 0.961
## 80 125 2 0.898 0.02588 0.848 0.950
## 86 123 1 0.891 0.02668 0.840 0.944
## 93 122 1 0.883 0.02744 0.831 0.939
## 97 121 1 0.876 0.02817 0.822 0.933
## 105 120 3 0.854 0.03017 0.797 0.915
## 107 117 1 0.847 0.03078 0.788 0.909
## 110 116 1 0.839 0.03137 0.780 0.903
## 121 115 1 0.832 0.03193 0.772 0.897
## 122 114 2 0.818 0.03300 0.755 0.885
## 128 112 1 0.810 0.03350 0.747 0.879
## 129 111 1 0.803 0.03399 0.739 0.872
## 153 110 1 0.796 0.03445 0.731 0.866
## 156 109 1 0.788 0.03490 0.723 0.860
## 162 108 2 0.774 0.03575 0.707 0.847
## 164 106 1 0.766 0.03615 0.699 0.841
## 168 105 1 0.759 0.03653 0.691 0.834
## 172 104 1 0.752 0.03690 0.683 0.828
## 183 103 1 0.745 0.03726 0.675 0.821
## 194 102 1 0.737 0.03760 0.667 0.815
## 195 101 1 0.730 0.03793 0.659 0.808
## 222 100 1 0.723 0.03825 0.651 0.802
## 243 98 1 0.715 0.03856 0.644 0.795
## 248 97 1 0.708 0.03886 0.636 0.788
## 262 96 2 0.693 0.03943 0.620 0.775
## 265 94 1 0.686 0.03969 0.612 0.768
## 269 93 1 0.678 0.03995 0.604 0.761
## 276 92 1 0.671 0.04019 0.597 0.755
## 288 91 1 0.664 0.04042 0.589 0.748
## 318 90 1 0.656 0.04063 0.581 0.741
## 341 89 1 0.649 0.04084 0.574 0.734
## 350 88 1 0.642 0.04104 0.566 0.727
## 363 87 1 0.634 0.04122 0.558 0.720
## 371 86 1 0.627 0.04140 0.551 0.713
## 390 85 1 0.619 0.04156 0.543 0.706
## 392 84 1 0.612 0.04172 0.535 0.699
## 393 83 1 0.605 0.04186 0.528 0.693
## 414 82 1 0.597 0.04199 0.520 0.686
## 417 81 1 0.590 0.04212 0.513 0.679
## 418 80 1 0.583 0.04223 0.505 0.671
## 431 79 1 0.575 0.04234 0.498 0.664
## 466 78 1 0.568 0.04243 0.490 0.657
## 469 77 1 0.560 0.04252 0.483 0.650
## 481 76 1 0.553 0.04259 0.476 0.643
## 487 75 1 0.546 0.04266 0.468 0.636
## 491 74 1 0.538 0.04271 0.461 0.629
## 515 73 1 0.531 0.04276 0.453 0.622
## 522 72 1 0.524 0.04280 0.446 0.615
## 526 71 1 0.516 0.04282 0.439 0.607
## 547 69 1 0.509 0.04285 0.431 0.600
## 583 68 1 0.501 0.04287 0.424 0.593
## 641 67 1 0.494 0.04288 0.416 0.585
## 653 66 1 0.486 0.04288 0.409 0.578
## 677 65 1 0.479 0.04286 0.402 0.571
## 704 64 1 0.471 0.04284 0.394 0.563
## 716 63 1 0.464 0.04281 0.387 0.556
## 732 62 1 0.456 0.04277 0.380 0.548
## 781 61 1 0.449 0.04272 0.372 0.541
## 1063 52 1 0.440 0.04276 0.364 0.533
## 1074 51 1 0.432 0.04278 0.355 0.524
## 1156 48 1 0.423 0.04282 0.346 0.515
## 1279 42 1 0.413 0.04297 0.336 0.506
## 1298 41 1 0.402 0.04308 0.326 0.496
## 2204 9 1 0.358 0.05696 0.262 0.489
```

```
cb <- confBands(y_bmt, type = "hall")
plot(fit1_bmt,
main = 'Kaplan Meyer Plot with confidence bands')
lines(cb, col = "red",lty = 3)
legend(1000, 0.99, legend = c('K-M survival estimate',
'pointwise intervals', 'Hall-Werner conf bands'), lty = 1:3)
```

Next, I fit a Cox Proportional Hazards model, which makes use of several of the variables contained in the data set. I don’t pretend to have the best model here, or even a very good one. My intention is just to show the survival package’s `coxph()`

function, and show how the covariates enter the formula. Note, however, that this model does achieve an R^{2} value of 0.8, and that three variables – `ta`

, `tc`

, and `dc1`

– show up as significant.

```
# Fit Cox Model
form <- formula(y_bmt ~ group + ta + tc + tp + dc + dp +
z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z10)
cox_bmt <- coxph(form,data = bmt2)
summary(cox_bmt)
```

```
## Call:
## coxph(formula = form, data = bmt2)
##
## n= 137, number of events= 81
##
## coef exp(coef) se(coef) z Pr(>|z|)
## group2 0.2540084 1.2891826 0.4342118 0.585 0.5586
## group3 0.5280287 1.6955865 0.4197516 1.258 0.2084
## ta -0.0017962 0.9982054 0.0003736 -4.808 1.52e-06 ***
## tc -0.0060005 0.9940175 0.0010512 -5.708 1.14e-08 ***
## tp -0.0005709 0.9994293 0.0012622 -0.452 0.6510
## dc1 -3.5832062 0.0277865 0.4938320 -7.256 3.99e-13 ***
## dp1 -1.2266468 0.2932744 0.4832251 -2.538 0.0111 *
## z1 0.0109077 1.0109674 0.0230293 0.474 0.6358
## z2 -0.0155304 0.9845895 0.0200202 -0.776 0.4379
## z31 0.0162146 1.0163468 0.2480277 0.065 0.9479
## z41 0.0334753 1.0340419 0.2722755 0.123 0.9021
## z51 0.0792485 1.0824733 0.2856339 0.277 0.7814
## z61 -0.0702048 0.9322029 0.2557463 -0.275 0.7837
## z7 0.0000440 1.0000440 0.0004235 0.104 0.9173
## z81 0.4618598 1.5870228 0.3386689 1.364 0.1726
## z101 0.5145619 1.6729055 0.3330479 1.545 0.1223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## group2 1.28918 0.7757 0.55044 3.01937
## group3 1.69559 0.5898 0.74478 3.86023
## ta 0.99821 1.0018 0.99747 0.99894
## tc 0.99402 1.0060 0.99197 0.99607
## tp 0.99943 1.0006 0.99696 1.00190
## dc1 0.02779 35.9887 0.01056 0.07314
## dp1 0.29327 3.4098 0.11375 0.75613
## z1 1.01097 0.9892 0.96635 1.05764
## z2 0.98459 1.0157 0.94670 1.02399
## z31 1.01635 0.9839 0.62506 1.65258
## z41 1.03404 0.9671 0.60642 1.76319
## z51 1.08247 0.9238 0.61842 1.89474
## z61 0.93220 1.0727 0.56470 1.53887
## z7 1.00004 1.0000 0.99921 1.00087
## z81 1.58702 0.6301 0.81716 3.08218
## z101 1.67291 0.5978 0.87092 3.21338
##
## Concordance= 0.926 (se = 0.034 )
## Rsquare= 0.806 (max possible= 0.995 )
## Likelihood ratio test= 224.6 on 16 df, p=0
## Wald test = 107.5 on 16 df, p=1.332e-15
## Score (logrank) test = 211.7 on 16 df, p=0
```

```
cox_fit_bmt <- survfit(cox_bmt)
# plot(cox_fit_bmt)
```

To give some idea of the scope of R’s capabilities to work with time to event data, I use the `ranger()`

function to fit a Random Forests Ensemble model to the data. Note that `ranger()`

builds a model for each observation in the data set. The next block of code builds the model using the same variables used the the Cox model above, and each of the 137 survival curves computed for the bmt data set, along with a curve of average values.

```
# ranger model
# ranger model
r_fit_bmt <- ranger(form,
data = bmt2,
importance = "permutation",
seed = 1234)
# Average the survival models
death_times <- r_fit_bmt$unique.death.times
surv_prob <- data.frame(r_fit_bmt$survival)
avg_prob <- sapply(surv_prob,mean)
# Plot the survival models for each patient
plot(r_fit_bmt$unique.death.times,r_fit_bmt$survival[1,], type = "l",
ylim = c(0,1),
col = "red",
xlab = "death times",
ylab = "survival",
main = "Patient Survival Curves")
for(n in c(2:137)){
lines(r_fit_bmt$unique.death.times, r_fit_bmt$survival[n,], type = "l", col = "red")
}
lines(death_times, avg_prob, lwd = 2)
legend(100, 0.2, legend = c('Averages - black'))
```

Here, we show the ranking of variable importance computed by the permutation method, which is `ranger()`

’s default for survival data. Note that `ta`

, `tc`

, and `dc`

are the same top three variables flagged in the Cox model.

Also listed is a measure of prediction error calculated from Harrell’s c-index. This index is defined as “… the proportion of all usable patient pairs in which the predictions and outcomes are concordant” (cf. [8] p 370), where predictions for pairs are concordant if predicted survival times are larger for patients who lived longer. Note that Harrell’s c-index may be thought of as a generalization of finding the are under an ROC curve. (For binary outcomes Harrell’s c-index reduces to the Wilcoxon-Mann-Whitney statistic which, in turn, is equivalent to computing the area under the ROC curve.)

```
vi <- data.frame(sort(round(r_fit_bmt$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)
```

```
## importance
## ta 0.1259
## tc 0.0688
## dc 0.0190
## tp 0.0117
## dp 0.0092
## z2 0.0046
```

`cat("Prediction Error = 1 - Harrell's c-index = ", r_fit_bmt$prediction.error)`

`## Prediction Error = 1 - Harrell's c-index = 0.09304771`

Finally, we plot the survival curves computed for all three models on the same graph. Note that the “ad hoc” curve of average survival curves computed by the ranger model tracks the Kaplan-Meier curve fairly well.

```
# Set up for ggplot
km <- rep("KM", length(fit1_bmt$time))
km_df <- data.frame(fit1_bmt$time,fit1_bmt$surv,km)
names(km_df) <- c("Time","Surv","Model")
cox <- rep("Cox",length(cox_fit_bmt$time))
cox_df <- data.frame(cox_fit_bmt$time,cox_fit_bmt$surv,cox)
names(cox_df) <- c("Time","Surv","Model")
rf <- rep("RF",length(r_fit_bmt$unique.death.times))
rf_df <- data.frame(r_fit_bmt$unique.death.times,avg_prob,rf)
names(rf_df) <- c("Time","Surv","Model")
plot_df <- rbind(km_df,cox_df,rf_df)
p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model))
p + geom_line() + ggtitle("Comparison of Survival Curves")
```

For a very nice exposition of the sort of predictive survival analysis modeling that can be done with `ranger`

, be sure to have a look at Manuel Amunategui’s post and video.

This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform which has been under continuous development and improvement for nearly twenty years. The use of the `Surv()`

function shows how open source code allows generations of developers to build on the work of their predecessors. The `ranger`

packages provides a practical example of how R can incorporate fast C++ code and adapt to the world of machine learning applications, and the incidental use of options such as Hall-Wellner Confidence bands and Harrell’s c-index gives some idea of the statistical depth that underlies almost everything R.

For convenience, I have collected the references used throughout the post here.

[1] Hacking, Ian. (2006) *The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference*. Cambridge University Press, 2nd ed. p11

[2] Andersen, P.K., Keiding, N. (1998) *Survival analysis (overview)* Encyclopedia of Biostatistics 6. Wiley, p 4452-4461

[3] Kaplan, E.L. & Meier, P. (1958). *Non-parametric estimation from incomplete observations*, J American Stats Assn. 53, 457–481, 562–563.

[4] Cox, D.R. (1972). *Regression models and life-tables* (with discussion), Journal of the Royal Statistical Society (B) 34, 187–220.

[5] Diez, David. *Survival Analysis in R*. OpenIntro [6] Klein, John P and Moeschberger, Melvin L. (1997) *Survival Analysis Techniques for Censored and Truncated Data*, Springer.

[7] Wright, Marvin & Ziegler, Andreas. (2017) * [ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R] (https://www.jstatsoft.org/article/view/v077i01)*, JSS Vol 77, Issue 1.

[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. (1996) *Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors*. Statistics in Medicine, Vol 15, 361-387

[9] Amunategui, Manuel. *Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R*

To **leave a comment** for the author, please follow the link and comment on their blog: ** R Views**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

The knitr package by Yihui Xie is a wonderful tool for reproducible data science. I especially like using it with R Markdown documents, where with some simple markup in an easy-to-read document I can easily combine R code and narrative text to generate an attractive document with words, tables and pictures in HTML, PDF or Word format. Say, something like this:

In that document, the numerical weather records and the chart were generated by R, combined into a document using R Markdown, and then generated as a word file with `knitr`

. (You can find the R Markdown file to generate that report, and the R script to download the data, in my weather-report repository.)

Another useful tool for reproducible data science is the checkpoint package. It helps you manage the ever-changing ecosystem of R packages on CRAN, by making it easy to "lock in" specific versions of R packages. With a single call to the checkpoint function — say `checkpoint("2017-04-25")`

, for April 25, 2017 — you can automatically find all the packages used by your current R project (i.e. the current folder) and install them *as they used to be* on the specified date. A colleague or collaborator can use the same script to get the same versions too, and so be confident of reproducing your results without having to worry a newer package version may have affected the results. By the way, those package versions get installed in a special folder (`.checkpoint`

, in your home directory), so they won't change the results of any other R projects, either.

RStudio includes a very useful tool for working with R Markdown and `knitr`

: you can press the "Knit" toolbar button to process the document with a single click. For that to work, it does require certain R packages to be available for use behind the scenes. In normal circumstances RStudio will offer to install them, but the process doesn't work when a checkpoint folder is active. A simple workaround is to include a file in the same folder (I call mine `knitr-packages.R`

) with the following lines:

library("formatR") library("htmltools") library("caTools") library("bitops") library("base64enc") library("rprojroot") library("rmarkdown") library("evaluate") library("stringi")

Although you never run that file directly, the checkpoint process will discover it and ensure the necessary packages are installed for RStudio to perform its magic. (In my tests this works with recent versions of RStudio including the latest, 1.0.143). All you need to do is make sure you run `checkpoint`

from the R command line (just press Control-ENTER on the corresponding line in the `.Rmd`

file) before attempting to knit. Simple!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Uncertainty is the biggest enemy of a profitable business. That is especially true of small business who don’t have enough resources to survive an unexpected diminution of revenue or to capitalize on a sudden increase of demand. In this context, it is especially important to be able to predict accurately the change in the markets to be able to make better decision and stay competitive.

This series of posts will teach you how to use data to make sound prediction. In the last set of exercises, we’ve seen how to make predictions on a random walk by isolating the white noise components via differentiation of the term of the time series. But this approach is valid only if the random components of the time series follow a normal distribution of constant mean and variance and if those components are added together in each iteration to create the new observations.

Today, we’ll see some transformations we can apply on the time series make them stationary, especially how to stabilise variance and how to detect and remove seasonality in a time series.

To be able to do theses exercise, you have to have installed the packages `forecast`

and `tseries`

.

Answers to the exercises are available here.

**Exercise 1**

Use the `data()`

function to load the `EuStockMarkets`

dataset from the R library. Then use the `diff()`

function on `EuStockMarkets[,1]`

to isolate the random component and plot them. This differentiation is the most used transformation with time series.

We can see that the mean of the random component of the time series seems to stay constant over time, but the variance seems to get bigger near 1997.

**Exercise 2**

Apply a the `log()`

function on `EuStockMarkets[,1]`

and repeat the step of exercise 1. The logarithmic transformation is often used to stabilise non constant variance.

**Exercise 3**

Use the `adf.test()`

function from the `tseries`

package to test if the time series you obtain in the last exercise is stationary. Use a lag of 1.

**Exercise 4**

Load and plot the `co2`

dataset from the R library `dataset`

. Use the `lowess()`

function to create a trend line and add it to the plot of the time series.

By looking at the last plot, we can see that the time series oscillate from one side to the other of the trend line with a constant period. That characteristic is called seasonally and is often observed in time series. Just think about temperature, which change predictably from season to season.

**Exercise 5**

To eliminate the upward trend in the data use the `diff()`

function and save the resulting time series in a variable called `diff.co2`

. Plot the autocorrelation plot of `diff.co2`

.

**Exercise 6**

This last autocorrelation plot has years for unit which is not really intuitive in our scenario. Make another autocorrelation plot where the x axis has months as units. By looking at this plot, can you tell what is the seasonal period of this time series?

Another way to verify if the time series show seasonnality is to use the `tbats`

function from the forecast package. As his named says, this function fits a tbats model on the time series and return a smooth curve that fit the data. We’ll learn more about that model in a future post.

**Exercise 7**

Use the `tbats`

function on the `co2`

time series and store the result in a variable called `tbats.model`

. If the time series show sign of seasonality, the `$seasonal.periods`

value of `tbats.model`

will store the period value, else the result will be null.

**Exercise 8**

Use the `diff()`

function with the appropriate lag to remove the seasonality of the `co2`

time series, then use it again to remove the trend. Plot the resulting random component and the autocorrelation plot.

**Exercise 9**

Apply the adf test, the kpss test and the Ljung-Box test on the result of the last exercise to make sure that the random component is stationary.

**Exercise 10**

An interesting way to analyse a time series is to use the `decompose()`

function which uses a moving average to estimate the seasonal, random and trend component of a time series. With that in mind, use this function and plot each component of the `co2`

time series.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

K-nearest neighbors is a classification algorithm and is perhaps one of the simplest machine learning algorithms.

The exercise we will develop is: "diagnosing breast cancer with the k-NN algorithm", and we will use the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

We will carry out the exercise verbatim as published in the aforementioned reference.

### libraries

library("class")

library("gmodels")

### import data

wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

### exoplore data structure

str(wbcd)

### remove the id column that will not be needed

wbcd <- wbcd[-1]

### explore how many records are classified as benign and how many ### as malignant

table(wbcd$diagnosis)

### set the levels of the diagnostic variable

wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"),

labels = c("Benign", "Malignant"))

### explore the percentages of the diagnoses

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)

### explore some other variables

summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])

### define normalizing function

normalize <- function(x) {

return ((x - min(x)) / (max(x) - min(x)))

}

### normalize the data set

wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

summary(wbcd_n$area_mean)

### data preparation (training and test sets)

wbcd_train <- wbcd_n[1:469, ]

wbcd_test <- wbcd_n[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]

wbcd_test_labels <- wbcd[470:569, 1]

### training the model

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k = 21)

### evaluating the model

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq=FALSE)

### improving model performance?

### the exercise will be repeated using z-score to see if there is a ### better performance in the accuracy of the algorithm

### Z-score

wbcd_z <- as.data.frame(scale(wbcd[-1]))

summary(wbcd_z$area_mean)

### data preparation (training and test sets)

wbcd_train <- wbcd_z[1:469, ]

wbcd_test <- wbcd_z[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]

wbcd_test_labels <- wbcd[470:569, 1]

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k = 21)

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq = FALSE)

Using z-score no improvement in model accuracy is observed.

You can get the exercise and the data set in:

https://github.com/pakinja/Data-R-Value]]>

(This article was first published on ** Data R Value**, and kindly contributed to R-bloggers)

We will develop a well-known k-NN exercise originally published in *“Machine Learning in R” *by Brett Lantz, PACKT publishing 2015.

K-nearest neighbors is a classification algorithm and is perhaps one of the simplest machine learning algorithms.

The exercise we will develop is: “diagnosing breast cancer with the k-NN algorithm”, and we will use the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

We will carry out the exercise verbatim as published in the aforementioned reference.**### libraries**

library(“class”)

library(“gmodels”)

### import data

wbcd <- read.csv(“wisc_bc_data.csv”, stringsAsFactors = FALSE)

**### exoplore data structure **

str(wbcd)

**### remove the id column that will not be needed**

wbcd <- wbcd[-1]

**### explore how many records are classified as benign and how many ### as malignant**

table(wbcd$diagnosis)

**### ****set the levels of the diagnostic variable**

wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c(“B”, “M”),

labels = c(“Benign”, “Malignant”))**### ****explore the percentages of the diagnoses**

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)

**### explore some other variables**

summary(wbcd[c(“radius_mean”, “area_mean”, “smoothness_mean”)])

**### define normalizing** **function**

normalize <- function(x) {

return ((x – min(x)) / (max(x) – min(x)))

}** ****### normalize the data set**

wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

summary(wbcd_n$area_mean)

**### data preparation (training and test sets)**

wbcd_train <- wbcd_n[1:469, ]

wbcd_test <- wbcd_n[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]

wbcd_test_labels <- wbcd[470:569, 1]

**### training the model**

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k = 21)

**### evaluating the model**

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq=FALSE)

**### improving model performance?****###**** the exercise will be repeated using z-score to see if there is a ### better performance in the accuracy of the algorithm **

**### Z-score**

wbcd_z <- as.data.frame(scale(wbcd[-1]))

summary(wbcd_z$area_mean)

**### data preparation (training and test sets)**

wbcd_train <- wbcd_z[1:469, ]

wbcd_test <- wbcd_z[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]

wbcd_test_labels <- wbcd[470:569, 1]

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k = 21)

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq = FALSE)

Using z-score no improvement in model accuracy is observed.

You can get the exercise and the data set in:

https://github.com/pakinja/Data-R-Value

To **leave a comment** for the author, please follow the link and comment on their blog: ** Data R Value**.

R-bloggers.com offers

(This article was first published on ** R – Modern Data**, and kindly contributed to R-bloggers)

I’m super excited to announce that plotly 4.6.0 is now on CRAN! Go ahead and install it with:

install.packages("plotly")

This release brings a ton of new features and improvements — some of which are over in the year making. I recently held a webinar introducing the biggest features, including linked views, animation, and crosstalk integration. In addition to the webinar recording, some good resources for getting up-to-speed include new demos (`demo(package = "plotly")`

), help pages (`help(highlight)`

and `help(animation)`

), and the official NEWS file. I also recommend checking out my vimeo page for some more advanced applications of the linked views framework.

My upcoming plotcon workshop will be a great opportunity to learn all these new features over 2 full days of lectures, demos, hands-on exercises, and one-on-one Q&A.

PS. I can’t wait to see what you do with these new features! And I always love hearing from the community; so tweet me if you run into issues/questions, want to share your work, and/or just want to say hi!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Modern Data**.

R-bloggers.com offers

(This article was first published on ** Mango Solutions » R Blog**, and kindly contributed to R-bloggers)

We’re delighted to announce our line up for the first San Francisco EARL Conference.

As always, we received a large number of excellent abstracts so the Mango team spent many hours deliberating in order to bring the best program to San Francisco:

Prakhar Mehrotra, Senior Data Science Manager in Uber’s Strategic Finance team has joined the Keynote Speaker ranks.

We’re also looking forward to having representatives from RLadies and the R Consortium take to the podium – we’ll bring you more details on this soon.

In addition to the speakers announced a little while ago, we can now divulge the rest of the lineup:

Aaron Hamming ProAssurance | Dr Aleksander Dietrichson Blackboard Inc. | Bharatch Sankaranarayan Microsoft |

Daniela Saldana Amgen |
David Bishop Hitachi Solutions America Ltd. |
David Croushore Pandora Media Inc. |

David Smith Microsoft |
Dennis Salguero VegasCrimeStats.org |
Dr Salvaggio Salvino Qatar Science & Technology Park |

Eduardo Ariño de la Rubia Domino Data Lab |
Elena Schneider InsightRX |
Emily Bailey Uber |

Filip Stachura Appsilon Data Science |
Dr Gabriel Becker Genentech |
Gabriela de Queiroz SelfScore |

Jorge Miranda County of Los Angeles |
Joseph Rickert RStudio |
Kelly O’Briant B23 LLC |

Kyle Allaire Glass Box Research |
Luke Fostvedt Pfizer |
Madhura Raju Microsoft |

Mike Gahan Metabiota |
Shad Thomas Glass Box Research |
Professor Thomas Vaughan Fred Hutchinson Cancer Research Center |

Tyler Cole Amgen |

To find out what our speakers will talk about, take a look at our speakers list on the EARL website.

**Make the most of earlybird pricing before tickets go up to full price on 28 April for San Francisco. Buy tickets now.**

To **leave a comment** for the author, please follow the link and comment on their blog: ** Mango Solutions » R Blog**.

R-bloggers.com offers