Implementation of a basic reproducible data analysis workflow

[This article was first published on Joris Muller's blog - Posts about R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post, I described the principles of my basic reproducible data analysis workflow. Today, let’s be more practical and see how to implement it.

Be noted that it is a basic workflow. The goal is to find a good balance between a minimal reproducible analysis and the ease to deploy it on any platform.

This workflow will allow you to run a complete analysis based on multiple files (data files, R script, Rmd files…) just by launching a single R script.

Summary

This workflow allows to process raw files in order to produce reports (in html or PDF). There are 3 main components:

  1. Sofwares:
    • R of course.
    • Rstudio IDE optionnaly. It could save you installation time because it’s come with both Pandoc and rmarkdown built-in.
    • If not using Rstudio IDE:
      • rmarkdown R’s package, to convert R markdown files to markdown, html and PDF. Just type in R install.packages("rmarkdown").
      • a recent version of Pandoc to process markdown files.
    • Git for control version.
  2. Files organisation (see above).
  3. One R script to rule them all (see above).

Files organisation

Most important is the organisation of the project. I understand by project a folder containing every file necessary to run the analysis: raw data, intermediate data, scripts and other assets (pictures, xml…).

My organisation is:

name_of_the_project
|-- assets
|-- functions
    |-- import_html_helper.R
    |-- make_report.R
|-- plot
|-- produced_data
    |-- imported.rds
    |-- model_result.rds
|-- run_all.R     # *** Most important file ***
|-- raw_data
    |-- FinalDataV3.xlsx
    |-- SomeExternalData.csv
    |-- NomenclatureFromWeb.html
|-- reports
    |-- 01-import_data.html
    |-- 02-data_tidying.html
    |-- 03-descriptive.html
    |-- 04-model1.html
    |-- 99-sysinfo.html
|-- rmds
    |-- import_data.Rmd
    |-- data_tidying.Rmd
    |-- descriptive.Rmd
    |-- model1.Rmd
    |-- sysinfo.Rmd
|-- rscripts
    |-- complicated_model.R
|-- name_of_the_project.Rproj # not mandatory but useful

As you can observe, there are some principles :

  • Directory names are as explicit as possible to be understandable by anyone getting these files.
  • Reports don’t belong to the script, because I don’t produce a report for every script or Rmd (there is child Rmd) and this way, it’s straightforward where people should look at the results (reports).
  • Reports are sorted using 2-digit numbers. Data science is also an art to tell stories in the right order.
  • I use rmarkdown files to produce my report, to have both the results and my comments. These comments are fundamental, this is an appreciation of the results by the data scientist.
  • A rmarkdown file, sysinfo.Rmd, will be used to produce a report keeping trace of the name and version of R’s package used (with sessionInfo()) and some extra information about the OS (Sys.info()). In an ideal workflow, these commands have to be called at the end of each report.
  • Everything lives in a subfolder except run_all.R (detail above).

One R script to run them all

This run_all.R script, as it name explicitly tell, it runs everything necessary to produce all the reports. Here an example:

source("functions/make_reports.R")

report("rmds/import_data.Rmd", n_file = "1")
report("rmds/data_tidying.Rmd", "2")
report("rmds/descriptive.Rmd", "3")
report("rmds/model1.Rmd", "4")

It’s straightforward: one line by rmarkdown files to process.

make_report.R is optional. It’s an help function to produce the report, set their name.

# Clean up the environment
rm(list = ls())

# Load the libraries
library(knitr)
library(rmarkdown)

# Set the root dir because my rmds live in rmds/ subfolder
opts_knit$set(root.dir = '../.')

# By default, don't open the report at the end of processing
default_open_file <- FALSE

# Main function
report <- function(file, n_file = "", open_file = default_open_file,  
                   report_dir = "reports") {

  ### Set the name of the report file ###
  base_name <- sub(pattern = ".Rmd", replacement = "", x = basename(file))

  # Make nfiles with always 2 digits
  n_file <- ifelse(as.integer(n_file) < 10, paste0("0", n_file), n_file)

  file_name <- paste0(n_file, "-", base_name, ".html")
  
  ### Render ###
  render(
    input = file,
    output_format = html_document(
      toc = TRUE,
      toc_depth = 1,
      code_folding = "hide"
    ),
    output_file = file_name,
    output_dir = report_dir,
    envir = new.env()
    )

  ### Under macOS, open the report file  ###
  ### in firefox at the end of rendering ###
  if(open_file & Sys.info()[1] == "Darwin") {
    result_path <- file.path(report_dir, file_name)
    system(command = paste("firefox", result_path))
  }

}

Usage

The core of this workflow is the rmarkdown files. Most of the time, I write .Rmd files, mixing comments about what I’m going to do, code, results and comment about these results.

If there is a heavy computation (e.g. models or simulations), I write a script in R and save the results in a .rds file. Often I use remote machine for this kind of computation. Then I insert an R script in a Rmd chunk not evaluated and load the results in an evaluated one.

 ## ```{r heavy_computation, eval=FALSE}
 ## source("rscript/complicated_model.R")
 ## ```
 ## ```{r load_results}
 ## mod_results <- readRDS("model_result.rds")
 ## ```

Because my .Rmd don’t live in the root directory, I add a setup chunk on all these files. This way, I’m to process my .Rmd directly or even use Rstudio’s notebook feature.

 ## ```{r setup, include=FALSE}
 ## knitr::opts_knit$set(root.dir = '../.')
 ## ```

When I have to rebuild all the reports (e.g. due to some raw data changes), I just run the script run_all.R.

Benefits in real life

This basic workflow works for me because:

  • It’s pragmatic: it fulfils all the constraints and goals I fixed without any bells or whistles.
  • It uses mainstream tools. Others can easily use it and these tools are not likely to be deprecated in the next decade.
  • It’s easy to implement and deploy.
  • It’s straightforward: peoples who want just to see the reports, know where to see, those who want to reproduce the analysis, just have to use the run_all.R file.

Limits

There are more layers needed for a perfect data analysis reproducibility (as described in this article). The main weakness of my workflow is the lack of back up for the good version of the software I use (R’ packages included). I already wasn’t able rerun an old data analysis due to change in the package used. (e.g. deprecated functions in dplyr or disappearance of a package in the CRAN). To improve this, I already try to add packrat to my workflow but it’s been a long time ago and the package wasn’t stable enough for my day to day work. Resolution for 2017: give another try to packrat!

Other improvements possible: run the scripts in a docker container or in a virtual machine. But this adds too much overhead in my daily work. Furthermore, it brake the platform agnostic and simplicity principles.

Conclusion

Currently, I’m happy with this workflow. It covers my needs of my daily data analysis reproducibility with few extra work.

In a later post, I will discuss about these possible extension of this basic workflow.

To leave a comment for the author, please follow the link and comment on their blog: Joris Muller's blog - Posts about R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)