In this post, I describe the latest iteration of my automatic document production with R. It improves upon the methods used in Rtraining, and previous work on this topic can read by going to the auto deploying R documentation tag.
I keep banging on about this area because reproducible research / analytical document pipelines is an area I’ve a keen interest in. I see it as a core part of DataOps as it’s vital for helping us ensure our models and analysis are correct in data science and boosting our productivity.
Even after (or because of) a few years of off and on again development to the process, Rtraining had a number of issues:
- build time was long because all presentations and their respective dependencies were required
- if a single presentation broke, any later presentations would not get generated
- the presentation build step was in “after_success” so didn’t trigger notifications
- the build script for the presentations did a lot of stuff I thought could be removed
- the dynamic index page sucks
This post covers how I’m attempting to fix all bar the last problem (more on that in a later post).
With the problems outlined, let’s look at my new base solution and how it addresses these issues.
I have built a template that can be used to generate multiple presentations and publish them to a
docs/ directory for online hosting by GitHub. I can now use this template to produce category repositories, based on the folders in
inst/slides/ in Rtraining. I can always split them out further at a later date.
The new repo is structured like so:
- Package infrastructure
DESCRIPTION– used for managing dependencies primarily
R/– store any utility functions
.Rbuildignore– avoid non-package stuff getting checked
pres/– directory for storing presentation .Rmd files
pres/_output.yml– file with render preferences
- Output directory
docs/– directory for putting generated slides in
- Document generation infrastructure
.travis.yml– used to generate documents every time we push a commit
buildpres.sh– shell script doing the git workflow and calling R
buildpres.R– R script that performs the render step
- My Rtraining repo contained all presentations in the
inst/slidedecks/directory with further categories. This meant that if someone installed Rtraining, they’d get all the decks. I think this is a sub-optimal experience for people, especially because it mean installing so many packages, and I’ll be focusing instead on improving the web delivery.
- Render requirements are now stored in an
_output.ymlfile instead of being hard coded into the render step so that I can add more variant later
- I’m currently using a modified version of the
revealjspackage as I’ve built a heavily customised theme. As I’m not planning on any of these presentation packages ever going on CRAN, I can use the
DESCRIPTIONto describe the location. This simplifies the code significantly.
I use travis-ci to perform the presentation builds. The instructions I provide travis are:
- R -e 'install.packages("devtools")'
- R -e 'devtools::install_deps(dep = T)'
- R CMD build --no-build-vignettes --no-manual .
- R CMD check --no-build-vignettes --no-manual *tar.gz
- Rscript -e 'devtools::install(pkg = ".")'
- chmod +x ./buildpres.sh
One important thing to note here is that I used some arguments on my package build and check steps along with
latex: false to drastically reduce the build time as I have no intention of producing PDFs normally.
install section is the prep work, and then the
script section does the important bit. Now if there are errors, I’ll get notified!
The script that gets executed in my Travis build:
- changes over to a GITHUB_PAT based connection to the repo to facilitate pushing changes and does some other config
- executes the R render step
- puts the R execution log in
- commits all the changes using the important prefix
[CI SKIP]so we don’t get infinite build loops
- pushes the changes
git remote set-url origin $GITURL
git checkout master
git config --global user.name $AUTHORNAME
git config --global user.email $AUTHOREMAIL
R CMD BATCH './buildpres.R'
cp buildpres.Rout docs/
git add docs/
git commit -am "[ci skip] Documents produced in clean environment via Travis $TRAVIS_BUILD_NUMBER"
git push -u --quiet origin master
The R step is now very minimal in that it works out what presentations to generate, then loops through them and builds each one according to the options specified in
for (f in slides) render(f,output_dir = "docs")
Next steps for me
This work has substantially mitigated most of the issues I had with my previous Rtraining workflow. I now have to get all my slide decks building under this new process.
I will be writing about making an improved presentation portal and how to build and maintain your own substantially modified revealjs theme at a later date.
The modified workflow and scripts also have implications on my pRojects package that I’m currently developing along with Jon Calder. I’d be very interested to hear from you if you have thoughts on how to make things more streamlined.
The post Improving automatic document production with R appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.