Site icon R-bloggers

On target

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here are some notes on getting started with {targets}.

The project I am working on involves several different reports, each at least 30 pages, and each with about 20 plots and 20 tables per document.

As well as a myriad of functions, I had 7 very large R scripts doing the data munging and processing.

I thought they were well ordered, but I had to burn everything down a couple of times and it was quite nerve wracking building it back up. The thought of adding additional phases of the project to this code base made me uncomfortable. I decided I needed to learn{targets} to ensure this project can be reproducible a few years down the line.

The package comes with extensive documentation, but here are some edited highlights and explainers

If you don’t know what targets does – it keeps track of the objects you create, and the relationships between those objects. So if you have a file that feeds into a function, and the file updates, then the function needs to be run again. You don’t need to keep track of that in your head, {targets} does the work for you and produces a wonderful network plot showing the current status.

For example – here is a very zoomed out view of all my targets. It’s hard to tell, but quite a lot are now out of date – as seen by the blue colour

Here I’ve zoomed in, with particular focus on the localities target, which acts as an input to many other downstream targets.

The next time tar_make is run, the code that updates these functions will run, and everything else will be skipped. There is no way, having broken everything down into small functions, that I could track all this manually.

Note – I’m using dataframe here as a generic term for data.frame, tibble, data.table, or whatever else you might be using.

For example, here I’m tracking a spreadsheet which has a list of desired indicators. If the file changes in any way, then anything that depends on this will become outdated, and {targets} will know to update those parts of the pipeline

 tar_target(profile2_adult_indicators,# target name
            "./01-inputs/profile2_adultindicators.xlsx", # command
            format = "file") # target file format
  • If you have a large script that generates several objects, you’re going to need to break that down into functions so that one target is returned per function. It seems a lot of work, but its worth it.

  • In general, you will use tar_manifest, tar_vis_network, and tar_make the most.

    tar_manifest creates a table of the targets and their inputs and contains lots of info that will help you check that everything is working. If you run tar_make, and your pipeline doesn’t work as expected, you need to run tar_manifest and examine the output in detail (you’ll probably want to pipe the output straight into dplyr::View(.)

    You might also want to use tar_invalidate with a specific target to ensure any changes you make, e.g., as a result of a function change, are picked up prior to running tar_make.

    You can also use tar_destroy and set the option to “all” to completely burn everything down and rebuild it. Probably not something to use on a Friday afternoon, unless you’re very confident in your pipeline, or you simply live for the danger.

    tar_target(
      fig_three_bar_comparison, #target name
    plot_three_bar_comparison(df = combined_populations, # command
                              council = localities$area,
                              areaval = localities$areaname),
      pattern = map(localities), # localities = a 2 column df with area & areaname
      iteration = "list",
     format = "file" # format
    )
    

    This code maps over the area and areanamecolumns in my localities dataframe, and creates a plot for each combination, using the plot_three_bar_comparison function, with existing target combined_populations as an input. The target name is fig_three_bar_comparison

    Here are the results of this bit of code:

    This is using dynamic branching . Static branching is also available, and I should probably have used it, as I know what I want my file names to be. I’m using static branching to generate each Word document with tar_render. This involves creating a tibble with the column values to map over, and an output vector. That may be the topic of a future post.

    I have many of these functions, and as a result I already have over 500 individual targets, from original source spreadsheets and CSVs to plots and documents.

    I am much happier now about the foundations of the project. I had a draft phase 2 document up and running in a couple of days – and this is a much larger document with even more tables and plots. Combined with {renv} and git, we are in a good place for our first Reproducible Analytic Pipeline.

    To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.