Improving automatic document production with R

[This article was first published on R – Locke Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I describe the latest iteration of my automatic document production with R. It improves upon the methods used in Rtraining, and previous work on this topic can read by going to the auto deploying R documentation tag.

I keep banging on about this area because reproducible research / analytical document pipelines is an area I’ve a keen interest in. I see it as a core part of DataOps as it’s vital for helping us ensure our models and analysis are correct in data science and boosting our productivity.

Even after (or because of) a few years of off and on again development to the process, Rtraining had a number of issues:

  • build time was long because all presentations and their respective dependencies were required
  • if a single presentation broke, any later presentations would not get generated
  • the presentation build step was in “after_success” so didn’t trigger notifications
  • the build script for the presentations did a lot of stuff I thought could be removed
  • the dynamic index page sucks

This post covers how I’m attempting to fix all bar the last problem (more on that in a later post).

With the problems outlined, let’s look at my new base solution and how it addresses these issues.

Structure

I have built a template that can be used to generate multiple presentations and publish them to a docs/ directory for online hosting by GitHub. I can now use this template to produce category repositories, based on the folders in inst/slides/ in Rtraining. I can always split them out further at a later date.

The new repo is structured like so:

  • Package infrastructure
    • DESCRIPTION – used for managing dependencies primarily
    • R/ – store any utility functions
    • .Rbuildignore – avoid non-package stuff getting checked
  • Presentations
    • pres/ – directory for storing presentation .Rmd files
    • pres/_output.yml – file with render preferences
  • Output directory
    • docs/ – directory for putting generated slides in
  • Document generation infrastructure
    • .travis.yml – used to generate documents every time we push a commit
    • buildpres.sh – shell script doing the git workflow and calling R
    • buildpres.R – R script that performs the render step

Presentations

  • My Rtraining repo contained all presentations in the inst/slidedecks/ directory with further categories. This meant that if someone installed Rtraining, they’d get all the decks. I think this is a sub-optimal experience for people, especially because it mean installing so many packages, and I’ll be focusing instead on improving the web delivery.
  • Render requirements are now stored in an _output.yml file instead of being hard coded into the render step so that I can add more variant later
  • I’m currently using a modified version of the revealjs package as I’ve built a heavily customised theme. As I’m not planning on any of these presentation packages ever going on CRAN, I can use the Remotes field in DESCRIPTION to describe the location. This simplifies the code significantly.

Document generation

Automatic document generation with R
Automatic document generation with R

Travis

I use travis-ci to perform the presentation builds. The instructions I provide travis are:

language: r

cache: packages

latex: false

warnings_are_errors: false

install: 

  - R -e 'install.packages("devtools")'

  - R -e 'devtools::install_deps(dep = T)'

  - R CMD build --no-build-vignettes --no-manual .

  - R CMD check --no-build-vignettes --no-manual  *tar.gz

  - Rscript -e 'devtools::install(pkg = ".")'

before_script:

  - chmod +x ./buildpres.sh

script:

  - ./buildpres.sh


One important thing to note here is that I used some arguments on my package build and check steps along with latex: false to drastically reduce the build time as I have no intention of producing PDFs normally.

The install section is the prep work, and then the script section does the important bit. Now if there are errors, I’ll get notified!

Bash

The script that gets executed in my Travis build:

  • changes over to a GITHUB_PAT based connection to the repo to facilitate pushing changes and does some other config
  • executes the R render step
  • puts the R execution log in docs/ for debugging
  • commits all the changes using the important prefix [CI SKIP] so we don’t get infinite build loops
  • pushes the changes
#!/bin/bash

AUTHORNAME="Steph"

AUTHOREMAIL="[email protected]"

GITURL="https://[email protected]/$TRAVIS_REPO_SLUG.git"



git remote set-url origin $GITURL

git pull

git checkout master

git config --global user.name $AUTHORNAME

git config --global user.email $AUTHOREMAIL



R CMD BATCH './buildpres.R'



cp buildpres.Rout docs/



git add docs/

git commit -am "[ci skip] Documents produced in clean environment via Travis $TRAVIS_BUILD_NUMBER"

git push -u  --quiet origin master


R

The R step is now very minimal in that it works out what presentations to generate, then loops through them and builds each one according to the options specified in _output.yml

library(rmarkdown)

slides=list.files("pres","*.Rmd",full.names=TRUE)



for (f in slides) render(f,output_dir = "docs")


Next steps for me

This work has substantially mitigated most of the issues I had with my previous Rtraining workflow. I now have to get all my slide decks building under this new process.

I will be writing about making an improved presentation portal and how to build and maintain your own substantially modified revealjs theme at a later date.

The modified workflow and scripts also have implications on my pRojects package that I’m currently developing along with Jon Calder. I’d be very interested to hear from you if you have thoughts on how to make things more streamlined.

The post Improving automatic document production with R appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)