Reproducible Research – when your results can’t be reproduced?

(This article was first published on Appsilon Data Science - R language, and kindly contributed to R-bloggers)

Even most sophisticated machine learning methods, most beautiful visualisations and perfect datasets can be useless if you do your research carelessly and ignore the context of your code execution. In this article, you will read about a few situations that can destroy your reproducibility and learn how to resolve and avoid them in the future.

3 danger zones

There are three places where things can go wrong if you don’t pay attention:

  1. R session context
  2. Operating System (OS) context
  3. Data versioning

In this blogpost, I will focus on the first two. They are quite similar because both are related to software context of a research execution. While more and more data scientists pay attention to what happens in their R session, not many think about OS context, which can also significantly influence their research.

Data versioning is a huge topic itself and I will cover it in a separate article in the future. For now, let’s see some examples where things can go wrong, when you have different R session context or various Operating System configurations.

R session context

The basic elements of an R session context, that data scientists should be aware of are:

  • R version
  • Packages versions
  • Using set.seed() for a reproducible randomisation
  • Floating point accuracy

R version

Not many monitor this R language change log on a regular basis, but sometimes a crucial changes are made and can lead to a failure in your code. One of such situation is described here. Change in R(3.2.1) for “nested arima model has higher log-likelihood”, led to inconsistency of results for arima calculations with d >= 1.

Packages versions

Now, let’s see an example with a changing package version. I am going to use dplyr in versions 0.4.0 and 0.5.0:

dplyr 0.4.0

df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)
result <- df %>% dplyr::distinct(x) %>% sum
# result == 3

dplyr 0.5.0

df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)
result <- df %>% dplyr::distinct(x) %>% sum
# result == 8

whoa! Version 0.5.0 introduced functionality breaking changes and result is now completely different.

set.seed()

When your code uses a randomisation and you want to have reproducible results, it is important to remember about setting a seed.

> sample(1:10, 5)
[1] 6 7 2 5 9
> sample(1:10, 5) # Different result
[1] 5 7 8 2 3

…and with setting seed:

> set.seed(42)
> sample(1:10, 5)
[1] 10  9  3  6  4
> set.seed(42)
> sample(1:10, 5) # Result is the same
[1] 10  9  3  6  4

Random numbers returned by generator are in fact a sequence of numbers that can be reproduced. Setting a seed informs a generator where to start.

Floating point accuracy

Pay attention how expressions on floating point numbers are implemented. Let’s assume you want to implement an algorithm described in a recently published scientific paper. You get unexpected results in some cases and now try to debug it. And you eventually get this:

> x <- 3
> y <- 2.9
> x - y == 0.1
[1] FALSE

what? It can be solved by testing for “near equality”:

> x <- 3
> y <- 2.9
> all.equal(x - y, 0.1)
[1] TRUE

More about a floating point arithmetic can be found here.

Operating system context

When your research is shared with others and you can’t predict which operating system someone is going to use, many things can go wrong. Any difference in one of the following elements:

  • System packages versions
  • System locale
  • Environment variables

can result in different execution results.

Many R packages depend on system packages. You can view it in a package DESCRIPTION file under the parameter SystemRequirements. Let’s look at the example with png package and running it on two different operating systems.

Package png DESCRIPTION file:

Package: png
Version: 0.1-7
...
SystemRequirements: libpng
...

Let’s assume we wrote custom function savePlotToPNG that saves a plot to a file. We want to write an unit test for it and the idea is to compare produced result with a previously generated reference image. We generate reference plots on Ubuntu 16.04:

reference.png.path <- file.path("referencePlots", "ref0001.png")
savePlotToPNG(reference.png.path)

…and then we run tests on Debian 8 Jessie. We use exactly the same version of png package:

reference.png.path <- file.path("referencePlots", "ref0001.png")
test.plot.path <- tempfile(fileext = ".png")

savePlotToPNG(test.plot.path)

identical(x = png::readPNG(test.plot.path),
          y = png::readPNG(reference.png.path))
# Error! libpng system package is different on Ubuntu 16.04

…plots don’t match! This is caused by different libpng system package versions.

Locale and system variables

Ubuntu:

> x <- c("-110", "-1", "-10", "10", "0", "110", "1")
> sort(x)
[1] "0"    "1"    "-1"   "10"   "-10"  "110"  "-110"

Windows:

> x <- c("-110", "-1", "-10", "10", "0", "110", "1")
> sort(x)
[1] "-1"    "-10"    "-110"   "0"   "1"  "10"  "110"

…okayyy 🙂 String sorting in R is based on the locale which is different for Windows and Linux systems. Read more about setting locales here.

Tools for R session level reproducibility

Data scientist should be aware of the traps described above. These are the foundations of the reproducible research. Below there is a list of tools that support managing packages versions:

Packrat – popular package that allows to create and manage a private library attached to your project.

Switchr – handy abstraction over .libPaths(), lib.loc and other lower level library settings. Helpful in creating and managing libraries.

Checkpoint – helpful when you need to find older versions of a specific package.

Tools for OS level reproducibility

The best approach is to do your research in an isolated environment, that can be shared as an image. Inside of this environment you can also use any tool for R session level reproducibility, like Packrat, Switchr or Checkpoint.

To create an isolated environment you can use virtualization or conteinarization. There are many tools that can be used for that: Vagrant, Docker, rkt and many more. I recommend using Docker because it’s easy to use and has a good community support.

Example structure of your ~/project:

Dockerfile
install_libraries.R
R/ <-- contains your research R scripts

Let’s look at a sample Dockerfile that describes the steps necessary for recreation of an isolated environment:

FROM rocker/rstudio:3.3.2 

MAINTAINER Paweł Przytuła "[email protected]"

RUN echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
  && locale-gen en_US.utf8 \
  && /usr/sbin/update-locale LANG=en_US.UTF-8

ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8

# Set the CRAN mirror
RUN echo "local({\n  r <- getOption(\"repos\")\n\
  r[\"CRAN\"] <- \
  \"https://cran.r-project.org\"\n\
  options(repos = r)\n\
  })\n" >> /etc/R/Rprofile.site

COPY install_libraries.R /code/install_libraries.R

RUN R -f /code/install_libraries.R

CMD ["/init"]

install_libraries.R contains code that reproduces an R environment. It can be either some Packrat code or something like this:

install.packages("devtools")
devtools::install_version("dplyr", version = "0.5.0")
devtools::install_version("stringr", version = "1.1.0")
devtools::install_version("forecast", version = "7.3")

After that it’s enough to run a following command to build an image:

docker build -t myimagename:1.0.0 .

When an image is ready you should either save it as a file using docker save -o myimagename.img myimagename:1.0.0 or upload it to Docker Hubsteps described here.

Using a previously saved Docker image you can start a container with the following:

docker run -d -p 8787:8787 -v ~/project/R:/R myimagename:1.0.0

Because image is based on rocker/rstudio, you can access an RStudio session via http://localhost:8787 (login rstudio, password rstudio). Your R/ directory will be available inside the running container via /R path. Awesome, isn’t it?!

With a Docker image like this, you have a control over whole software context needed for a reproducible research. You can easily extend it with more packages and share it without a fear of losing reproducibility. I recommend this approach in any data science project, which results must be reproduced on more than one machine and also when working in a bigger data science team.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science - R language.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)