targets: Democratizing Reproducible Analysis Pipelines
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Make1-like pipelines enhance the integrity, transparency, shelf life, efficiency, and scale of large analysis projects. With pipelines, data science feels smoother and more rewarding, and the results are worthy of more trust.
…looking to get your project/s organised in the new year? hoping just to distract from feelings of impending doom/crushing loss of hope? I promise workflowing will make you feel better… and @wmlandau has made it SO EASY.
— Dr Saras Windecker (@smwindecker) January 8, 2021
{targets} and its predecessors are visionary work. I can’t imagine making pipelines in a linear script ever again.
— Miles McBain (@MilesMcBain) January 8, 2021
targets
install.packages("targets")
The targets
2 package is a new pipeline toolkit for R.
It recently cleared software review, and it is now on CRAN. targets
is the long-term successor of drake
3, which in turn succeeded Rich FitzJohn’s groundbreaking remake
4 package.
A chapter in the user manual explains the future of drake
, the advantages of targets
, and how to transition.
The reference website explains how to get started, and the overview vignette describes the major features of targets
and its user manual.
🔗 How it works
In targets
, a data analysis pipeline is a collection of target objects that express the individual steps of the workflow, from upstream data processing to downstream R Markdown reports5.
These targets live in a special script called _targets.R
.
# _targets.R file library(targets) tar_option_set(packages = c("biglm", "dplyr", "ggplot2", "readr")) # Most workflows have custom functions to support the targets. read_clean <- function(path) { path %>% read_csv(col_types = cols()) %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))) } fit_model <- function(data) { biglm(Ozone ~ Wind + Temp, data) } create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone), bins = 12) + theme_gray(24) } # List of targets. list( # airquality dataset in base R: tar_target(raw_data_file, "raw_data.csv", format = "file"), tar_target(data, read_clean(raw_data_file)), tar_target(fit, fit_model(data)), tar_target(hist, create_plot(data)) )
targets
inspects your code and constructs a dependency graph.
# R console tar_visnetwork()
tar_make()
runs the correct targets in the correct order.
# R console tar_make() #> ● run target raw_data_file #> ● run target data #> ● run target fit #> ● run target hist #> ● end pipeline
Alternatives tar_make_clustermq()
and tar_make_future()
leverage clustermq
6 and future
7, respectively, to distribute targets on traditional schedulers such as SLURM8.
It is only a matter of time before these backends become capable of sending jobs to the cloud9.
Your can store the results in the _targets/
folder (default) or Amazon S3 buckets.
Either way, loading data back into R is the same.
# R console tar_read(hist) # see also tar_load()
Up-to-date targets do not rerun, which saves countless hours in computationally intense fields like machine learning, Bayesian statistics, and statistical genomics.
# R console tar_make() #> ✓ skip target raw_data_file #> ✓ skip target data #> ✓ skip target fit #> ✓ skip target hist #> ✓ skip pipeline
🔗 The next challenge
To help workflows scale, targets
adopts the classical, pedantic, function-oriented perspective of the R language.10
Nearly everything that happens in R results from a function call. Therefore, basic programming centers on creating and refining functions.
— John Chambers
The more often you write your own functions, the nicer your experience becomes.
I’m thinking about why this exists only in R and it may be because:
— Dr. David Neuzerling (@mdneuzerling) December 17, 2020
1) R’s functional approach makes it easier to detect dependencies, and
2) R’s uses lazy evaluation
I tried building a little prototype equivalent in Julia and I think it’s possible, but above my skill level
But if your mind is on the domain knowledge, or if you feel pressure to work fast, then it can be hard to write functions for everything.
🔗 Target factories
The best way to write fewer functions is to write less code. To write less code, we need abstraction and automation. Target factories are package functions that return lists of pre-configured target objects, and they make specialized pipelines reusable.
# script inside example.package #' @export read_clean <- function(path) { path %>% read_csv(col_types = cols()) %>% mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))) } #' @export fit_model <- function(data) { biglm(Ozone ~ Wind + Temp, data) } #' @export create_plot <- function(data) { ggplot(data) + geom_histogram(aes(x = Ozone), bins = 12) + theme_gray(24) } #' @title Example target factory. #' @description Concise shorthand to express our example pipeline. #' @details #' Target factories should use `tar_target_raw()`. #' `tar_target()` is for users, and `tar_target_raw()` is for developers. #' The former quotes its arguments, while the latter evaluates them. #' @export biglm_factory <- function(file) { list( tar_target_raw("raw_data_file", as.expression(file), format = "file"), tar_target_raw("data", quote(example.package::read_clean(raw_data_file))), tar_target_raw("fit", quote(example.package::fit_model(data))), tar_target_raw("hist", quote(example.package::create_plot(data))) ) }
With the factory above, our long _targets.R
file suddenly collapses down to three lines.
# _targets.R file library(targets) library(example.package) biglm_factory("raw_data.csv")
And you still have complete freedom to add more targets to the list.
# _targets.R file library(targets) library(example.package) run_model2 <- function(data) {...} list( # Target lists can be arbitrarily nested. biglm_factory("raw_data.csv"), tar_target(model2, run_model2(data)) )
The R Targetopia
The R Targetopia11 is an emerging ecosystem of packages to bring target factories to specific domains of Statistics and data science.
🔗 stantargets
library(remotes) install_github("wlandau/stantargets") library(cmdstanr) install_cmdstan()
stantargets
12 abstracts away most of the targets and functions required for a solid Bayesian data analysis with Stan13.
With a single target factory and a single function to generate data, stantargets
can give you an entire sensitivity analysis or an entire simulation-based calibration study.14 15
# _targets.R for simulation-based calibration to validate a Stan model. library(targets) library(stantargets) generate_data <- function() { true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1) x <- seq(from = -1, to = 1, length.out = n) y <- stats::rnorm(n, x * true_beta, 1) list(n = n, x = x, y = y, true_beta = true_beta) } list( tar_stan_mcmc_rep_summary( model, "model.stan", # We assume you already have a Stan model file. generate_data(), # Runs once per rep. batches = 25, # Batching reduces per-target overhead. reps = 40, # Number of simulation reps per batch. data_copy = "true_beta", variables = "beta", summaries = list( ~posterior::quantile2(.x, probs = c(0.025, 0.975)) ) ) ) # R console tar_visnetwork()
tarchetypes
install.packages("tarchetypes")
The tarchetypes
16 R Targetopia package is far more general than stantargets
.
Its target factories include tar_rep()
for arbitrary simulation studies, tar_render()
for dependency-aware literate programming, and tar_render_rep()
for parameterized R Markdown.
tar_plan()
is a drake_plan()
-like target factory to help drake
users transition to targets
.
# _targets.R file library(targets) library(tarchetypes) tar_plan( tar_target(raw_data_file, "raw_data.csv", format = "file"), data = read_clean(raw_data_file), fit = fit_model(data), hist = create_plot(data) )
🔗 You can help!
The R Targetopia has exciting potential for tidymodels17, mlr3
18, keras
19, torch
20, PK/PD, spatial statistics, and beyond.
If your field needs a friendly pipeline tool, please consider creating an R Targetopia package of your own.
I am trying to make it easy, and I would be eager to get in touch.
🔗 Thanks
Volunteers drive the rOpenSci review process, and each review is an act of altruism.
This was especially true for targets
because of COVID-19, the overlap with the holidays, and the unusually copious workload.
Despite the obstacles, everyone delivered incredible feedback that substantially improved targets
and its documentation.
Sam Oliver and TJ Mahr served as reviewers, and Mauro Lepore served as editor.
Sam inspired a section on getting started, an overview vignette, more debugging advice, and a new tar_branches()
function to show branch provenance.
TJ suggested a new chapter on functions, helped me contrast the two styles of branching, and raised interesting questions about target names.
Mauro was continuously diligent, responsive, thoughtful, and conscientious as he mediated the review process and ensured a successful outcome.
Thanks also to Matt Warkentin, Timing Liu, Miles McBain, Gorka Navarrete, Bruno Carlin, Noam Ross, Kendon Bell, and others who adopted targets
early in development, proposed insightful ideas, and influenced the direction and behavior of the package.
My colleague Richard Payne was a serious drake
user, and he built a proprietary drake_plan()
generator for our team.
His package was the major inspiration for target factories and the R Targetopia.
Everyone who contributed to drake
is part of targets
.
Four years of pull requests, issues, rOpenSci discussions, RStudio Community posts, Stack Overflow threads are materializing in this new suite of tools.
🔗 Disclaimer
The views in this post do not necessarily reflect those of my employer.
🔗 References
-
Stallman, R. (1998). GNU Make, Version 3.77. Free Software Foundation. ISBN: 1882114809 ↩︎
-
Landau, W. M., (2021). The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 6(57), 2959, https://doi.org/10.21105/joss.02959 ↩︎
-
Landau, W. M. (2018). The drake R package: a pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 3(21), 550. https://doi.org/10.21105/joss.00550 ↩︎
-
Rich FitzJohn (2021). remake: Make-like build management. R package version 0.3.0. ↩︎
-
JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.6.4. URL https://rmarkdown.rstudio.com ↩︎
-
Schubert, M. (2019). clustermq enables efficient parallelization of genomic analyses. Bioinformatics, 35(21), 4493–4495. https://doi.org/10.1093/bioinformatics/btz284 ↩︎
-
Bengtsson, H. (2020). A unifying framework for parallel and distributed processing in R using futures. https://arxiv.org/abs/2008.00553 ↩︎
-
Yoo A.B., Jette M.A., Grondona M. (2003) SLURM: Simple Linux Utility for Resource Management. In: Feitelson D., Rudolph L., Schwiegelshohn U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2003. Lecture Notes in Computer Science, vol 2862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10968987_3 ↩︎
-
Amazon Web Services (2020). Overview of Amazon Web Services. https://d1.awsstatic.com/whitepapers/aws-overview.pdf ↩︎
-
Chambers, John. 2008. “Software for Data Analysis: Programming with R.” In “Programming with R: The Basics”, 37–76. Springer. https://link.springer.com/chapter/10.1007/978-0-387-75936-4_3 ↩︎
-
Landau, W. M., (2021). The R Targetopia: an R package ecosystem for democratized reproducible pipelines at scale. https://wlandau.github.io/targetopia/ ↩︎
-
Landau, W. M., (2021). stantargets: Targets for Stan Workflows. https://wlandau.github.io/stantargets/, https://github.com/wlandau/stantargets. ↩︎
-
Stan Development Team (2012). Stan: a C++ library for probability and sampling. https://mc-stan.org ↩︎
-
Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3): 675–92. http://www.jstor.org/stable/27594203 ↩︎
-
Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” http://arxiv.org/abs/1804.06788 ↩︎
-
Landau, W. M. (2021). tarchetypes: Archetypes for Targets. https://docs.rOpenSci.org/tarchetypes/, https://github.com/rOpenSci/tarchetypes. ↩︎
-
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org ↩︎
-
Lang M, Binder M, Richter J, Schratz P, Pfisterer F, Coors S, Au Q, Casalicchio G, Kotthoff L, Bischl B (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software. https://doi.org/10.21105/joss.01903, https://joss.theoj.org/papers/10.21105/joss.01903 ↩︎
-
JJ Allaire and François Chollet (2020). keras: R Interface to ‘Keras’. R package version 2.3.0.0. https://CRAN.R-project.org/package=keras ↩︎
-
Daniel Falbel and Javier Luraschi (2020). torch: Tensors and Neural Networks with ‘GPU’ Acceleration. R package version 0.2.0. https://CRAN.R-project.org/package=torch ↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.