Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Happy New Year! I made some updates to the future framework during 2021 that involves overall improvements and essential preparations to go forward with some exciting new features that I’m keen to work on during 2022.

The future framework makes it easy to parallelize existing R code – often with only a minor change of code. The goal is to lower the barriers so that anyone can quickly and safely speed up their existing R code in a worry-free manner.

future 1.22.1 was released in August 2021, followed by future 1.23.0 at the end of October 2021. Below, I summarize the updates that came with those two releases:

There were also several updates to the related parallelly and progressr packages, which you can read about in earlier blog posts under the #parallelly and #progressr blog tags.

## New features

### futureSessionInfo() for troubleshooting and issue reporting

Function futureSessionInfo() was added to future 1.22.0. It outputs information useful for troubleshooting problems related to the future framework. It also runs some basic tests to validate that the current future backend works as expected. If you have problems getting futures to work on your machine, please run this function before reporting issues at Future Discussions. Here’s an example:

> library(future)
> plan(multisession, workers = 2)
> futureSessionInfo()
*** Package versions
future 1.23.0, parallelly 1.30.0, parallel 4.1.2, globals 0.14.0, listenv 0.8.0

*** Allocations
availableCores():
system  nproc
8      8
availableWorkers():
\$system
[1] "localhost" "localhost" "localhost"
[4] "localhost" "localhost" "localhost"
[7] "localhost" "localhost"

*** Settings
- future.plan=<not set>
- future.globals.maxSize=<not set>
- future.globals.onReference=<not set>
- future.resolve.recursive=<not set>
- future.rng.onMisuse='warning'
- future.wait.timeout=<not set>
- future.wait.interval=<not set>
- future.wait.alpha=<not set>
- future.startup.script=<not set>

*** Backends
Number of workers: 2
List of future strategies:
1. multisession:
- args: function (..., workers = 2, envir = parent.frame())
- tweaked: TRUE
- call: plan(multisession, workers = 2)

*** Basic tests
worker   pid     r sysname          release
1      1 19291 4.1.2   Linux 5.4.0-91-generic
2      2 19290 4.1.2   Linux 5.4.0-91-generic
version
1 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021
2 #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021
1 my-laptop  x86_64 alice alice          alice
2 my-laptop  x86_64 alice alice          alice
Number of unique PIDs: 2 (as expected)


### Working around UTF-8 escaping on MS Windows

Because of limitations in R itself, UTF-8 symbols outputted on MS Windows parallel workers would be relayed as escaped symbols when using futures. Now, the future framework, and, more specifically, value(), attempts to recover such MS Windows output to UTF-8 before outputting it.

For example, in future (< 1.23.0) you would get the following:

f <- future({ cat("\u2713 Everything is OK") ; 42 })
v <- value(f)
#> <U+2713> Everything is OK


when, and only when, those futures are resolved on a MS Windows machine. In future (>= 1.23.0), we work around this problem by looking for <U+NNNN> like patterns in the output and decode them as UTF-8 symbols;

f <- future({ cat("\u2713 Everything is OK") ; 42 })
v <- value(f)
#> ✓ Everything is OK


Comment: From R 4.2.0, R will have native support for UTF-8 also on MS Windows. More testing and validation is needed to confirm this will work out of the box in R (>= 4.2.0) when running R in the terminal, in the R GUI, in the RStudio Console, and so on. If so, future will be updated to only apply this workaround for R (< 4.2.0).

### Harmonization of future(), futureAssign(), and futureCall()

Prior to future 1.22.0, argument seed for futureAssign() and futureCall() defaulted to TRUE, whereas it defaulted to FALSE for future(). This was an oversight. In future (>= 1.22.0), seed = FALSE is the default for all these functions.

### Protecting against non-exportable results

Analogously to how globals may be scanned for “non-exportable” objects when option future.globals.onReference is set to "error" or "warning", value() will now check for similar problems in the value returned from parallel workers. For example, in future (< 1.23.0) we would get:

library(future)
plan(multisession, workers = 2)
options(future.globals.onReference = "error")

v <- value(f)
print(v)
#> Error in doc_type(x) : external pointer is not valid


whereas in future (>= 1.23.0) we get:

library(future)
plan(multisession, workers = 2)
options(future.globals.onReference = "error")

v <- value(f)
#> Error: Detected a non-exportable reference ('externalptr') in the value
#> (of class 'xml_document') of the resolved future


### Finer control of what type of conditions are captured and replayed

Besides specifying which condition classes to be captured and relayed, in future (>= 1.22.0), it is possible to specify also condition classes to be ignored. For example,

f <- future(..., conditions = structure("condition", exclude = "message"))


captures all conditions but message conditions. The default is conditions = "condition", which captures and relays any type of condition.

## Performance improvements

I always prioritize correctness over performance in the future framework. So, whenever optimizing for performance, one always has to make sure we are not breaking things somewhere else. Thankfully, there are now over 200 reverse-dependency packages on CRAN and Bioconductor that I can validate against. They provide another comfy cushion against mistakes than what we already get from package unit tests and the future.tests test suite. Below are some recent performance improvements made.

### Less latency for multicore, multisession, and cluster futures

In future 1.22.0, the default timeout of resolved() was decreased from 0.20 seconds to 0.01 seconds for multicore, multisession, and cluster futures. This means that less time is now spent on checking for results from these future backends when they are not yet available. After making sure it is safe to do so, we might decrease the default timeout to zero in a later release.

### Less overhead when initiating futures

The overhead of initiating futures was significantly reduced in future 1.22.0. For example, the round-trip time for value(future(NULL)) is about twice as fast for sequential, cluster, and multisession futures. For multicore futures the round-trip speedup is about 20%.

The speedup comes from pre-compiling the future’s R expression into an R expression template, which then can quickly re-compiled into the final expression to be evaluated. Specifically, instead of calling expr <- base::bquote(tmpl) for each future, which is computationally expensive, we take a two-step approach where we first call tmpl_cmp <- bquote_compile(tmpl) once per session such that we only have to call the much faster expr <- bquote_apply(tmpl_cmp) for each future.(*) This new pre-compile approach speeds up the construction of the final future expression from the original future expression ~10 times.

(*) These are internal functions of the future package.

### Environment variables are only used when package is loaded

All R options specific to the future framework have defaults that fall back to corresponding environment variables. For example, the default for option future.rng.onMisuse can be set by environment variable R_FUTURE_RNG_ONMISUSE.

The purpose of the environment variables is to make it possible to configure the future framework before launching R, e.g. in shell startup scripts, or in shell scripts submitted to job schedulers in high-performance compute (HPC) environments. When R is already running, the best practice is to use the R options to configure the future framework.

In order to avoid the overhead from querying and parsing environment variables at runtime, but also to clarify how and when environment variables should be set, starting with future 1.22.0, R_FUTURE_* environment variables are only used when the future package is loaded. Then, if set, they are used for setting the corresponding future.* option.

## Cleanups to make room for new features

The values() function is defunct since future 1.23.0 in favor of value(). All CRAN and Bioconductor packages that depend on future have been updated since a long time. If you get the error:

Error: values() is defunct in future (>= 1.20.0). Use value() instead.


make sure to update your R packages. A few users of furrr have run into this error - updating to furrr (>= 0.2.0) solved the problem.

Continuing, to further harmonize how developers use the Future API, we are moving away from odds-and-ends features, especially the ones that are holding us back from adding new features. The goal is to ensure that more code using futures can truly run anywhere, not just on a particular parallel backend that the developer work with.

In this spirit, we are slowly moving away from “persistent” workers. For example, in future (>= 1.23.0), plan(multisession, persistent = TRUE) is no longer supported and will produce an error if attempted. The same will eventually happen also for plan(cluster, persistent = TRUE), but not until we have support for caching “sticky” globals, which is the main use case for persistent workers.

Another example is transparent futures, which are prepared for deprecation in future (>= 1.23.0). If used, plan(transparent) produces a warning, which soon will be upgraded to a formal deprecation warning. In a later release, it will produce an error. Transparent futures were added during the early days in order to simplify troubleshooting of futures. A better approach these days is to use plan(sequential, split = TRUE), which makes interactive troubleshooting tools such as browser() and debug() to work.

## Significant changes preparing for the future

Prior to future 1.22.0, lazy futures were assigned to the currently set future backend immediately when created. For example, if we do:

library(future)
plan(multisession, workers = 2)

f <- future(42, lazy = TRUE)


with future (< 1.22.0), we would get:

class(f)
#> [1] "MultisessionFuture" "ClusterFuture"      "MultiprocessFuture"
#> [4] "Future"             "environment"


Starting with future 1.22.0, lazy futures remain generic futures until they are launched, which means they are not assigned a backend class until they have to. Now, the above example gives:

class(f)
#> [1] "Future"      "environment"


This change opens up the door for storing futures themselves to file and sending them elsewhere. More precisely, this means we can start working towards a queue of futures, which then can be processed on whatever compute resources we have access to at the moment, e.g. some futures might be resolved on the local computer, others on machines on a local cluster, and when those fill up, we can burst out to cloud resources, or maybe process them via a community-driven peer-to-peer cluster.

There are lots of new features on the roadmap related to the above and other things. I hope to make progress on several of them during 2022. If you’re curious about what’s coming up, see the Project Roadmap, stay tuned on this blog (feed), or follow me on Twitter.

Happy futuring!

Henrik