How Do You Organise Your R Project? This Is What We Do.

[This article was first published on r-bloggers – Telethon Kids Institute, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Biometrics group at Telethon Kids Institute uses a standardised template project directory to manage our biostatistical consultation projects. This approach allows us to streamline our workflow, initiate projects, and produce professional looking reports directly from the statistical analysis platform minimising the time spent on the non-analytical aspects of our projects. This project structure is identical and successful for both simple and large-scale projects.

Although this workflow has been developed in R, the broader principles
discussed here are applicable to any scripting language and non-coding projects
alike. This blog is based on recent presentations that I made at the
UserR!2019, Toulouse, France (lightning talk) and the 40th Annual Conference of
the International Society for Clinical Biostatistics (ISCB40), Leuven, Belgium
(winner of best poster award) conferences in July 2019.

Project Directory Structure

We have developed and refined the skeleton of a template
project that contains a series of sub-directories with distinct objectives that
cover all steps of a project from initial import of raw data to reporting. Our
template is pre-populated with directories and starter-files that has enabled
our group to save time in project initiation, preliminary analysis, analysis,
and reporting.

Figure 1. Schematic of our template project. Sub-directories are indicated by squares and template files are shown by circles.

This structure is based on what is described by the “ProjectTemplate” R package (http://projecttemplate.net/index.html).
We have found this structure to be attractive thanks to its automated project
initiation, executing data wrangling scripts, and loading library packages and
data. Without going into excessive detail, we have several directories
including:

  • Directories to house data at various stages of rawness,
    cleanliness, and wrangling
  • admin directories for project meta data
  • 2 scripting directories (ProjectTemplate can be configured to automatically run scripts in
    the munge directory; the R directory can be reproduced/changed for different
    scripting languages or changed to a generic “src” folder)
  • and the vignettes folder for our reports.

This structure is also useful as it allows us to build our projects as R packages that are easily distributed to our collaborators once the analysis is complete. We chose to build our reports in the vignettes directory rather than locate them in a “reports” sub‑directory since we can share the final work as an installable package – enabling us to make a collection of project reports available to our collaborators by browsing the package vignettes. Packaging projects is also a useful way to share the cleaned data as it is available as an included dataset along with any documentation that was created specifically for the analysis.

options(ProjectTemplate.templatedir = “path/to/templates")
create.project(paste(“path/to/projects”, “00_project”, sep = “/”), template = "biometrics_project")

Box 1. It only takes 2 lines of code to initiate a new project. A template directory can contain multiple project templates.

Reproducible Research

It is important that our analysis is traceable from raw data to final report and that all changes that were made to the analysis throughout the project life cycle are tracked. I suggest that no modifications are made to the raw data once it is received from the researcher (which often comes as a .xlsx or a .csv). The first thing to do once data is received is to prefix the file name with the current date (YYYYMMDD_) then make it read-only. Any further cleaning is then performed in-script where all changes are documented and can be verified and audited. Subsequent changes to the analysis are tracked via. GitLab, which we have installed on our secure servers, or If the data isn’t sensitive then services such as GitHub or Bit Bucket.

Reporting with R Markdown

R Markdown is an excellent tool that allows analysts to
compose the project narration and data analysis output in a single document.
Markdown also helps to maintain the quality of a report; by keeping the
analysis data frames and the report commentary all within in the R environment.
Inserting the analysis outputs directly into the report removes the possibility
of transcription errors. Markdown is also great for updating reports when minor
changes have been made to the underlying data.

To streamline our analysis plans and reporting we have
developed a series of R markdown templates that produce documents that conform
to the Telethon Kids Institute’s style guide and online branding. These reports
produce beautiful stand-alone HTML documents that we distribute to our
collaborators and are built on the bootstrap CSS libraries which allow for dynamic
responsive pages that can be viewed on a range of devices.

Figure 2. The default reporting template that we have packaged in https://github.com/telethonkids/biometrics; when the package is installed we can create new reports by clicking in R Studio: “File > New File > R Markdown… > From Template”. More information about our templates can be found here.

Toolbox

The following table lists some packages that we use to
simplify our reporting. These packages are useful as they allow us to focus on
the data without wasting time on the non-analytical parts of the project such
as package citations, caption numbering, and tabulating and visualising model
output. A brief description is provided about why we use each of these
packages, you should visit the official documentation for further details.

Table 1. Summary of useful packages that we use during our professional biostatistical consultations.

PackageDescription
Tidyverse collectionData wrangling/summaries/visualisation
CaptionerCross-reference tables/figures/models
KableExtraNicely format data frames for reporting
StargazerCreate well-formatted regression tables
BroomExtract a model’s estimates and statistics
RepmisCreate a bibliography of loaded packages
Devtools::build_vignettes()Knit all vignette .Rmd files
jtools::plot_summs()Visualise a model(s) effect estimates and CIs
Gggally::ggpairs()Look at your data with a plot pair matrix
roxygen2Documenting code by writing .Rd files in the man/ directory

The default project directory structure that we have developed can be seen in action as part of the Telethon Kids rstudio GitHub repository; this repo is an implementation of R Studio within a Docker container (see here and here). You can navigate through the repo on GitHub, or clone it to your local machine. The template project is found in the projects/00_next_project sub-directory. Each of the sub-directories in this template contain a README that briefly describes its purpose.

Disclaimer

I doubt there will be anything in this article that can be
called “new”, but unless someone has worked in a place with a clearly defined
project structure then it is unlikely they have thought about an efficient way
to organise their myriad of relates files (documents/data/scripts).

Conversations that I had with data scientists, statisticians,
and analysts at both the UseR!2019 and ISCB40 conferences indicate that
organisations are becoming increasingly aware of the importance of
well-structured data projects. This workflow has come from reading many
articles and trying out several packages – very few of which I recorded; thus,
unfortunately, I am using other people’s ideas/concepts without proper
acknowledgement. I don’t claim anything in this article as my original work and
if you know of any authoritative sources on this content then please leave a
comment.

There is a plethora of other sub-directories that could be included in a template project. For example, a figures directory for high resolution publication-ready images is a worthy inclusion. I am interested in how do YOU structure your projects; leave a comment and let me know what tools you use and how you increase throughput to ease your workload.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Telethon Kids Institute.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)