(A Very) Experimental Threading in R

[This article was first published on R – Random Remarks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been trying to find a way to introduce threads to R. I guess there can be many reasons to do that, among which I could mention simplified input/output logic, sending tasks to the background (e.g. building a model asynchronously), running computation-intensive tasks in parallel (e.g. parallel, chunk-wise var() on a large vector). Finally, it’s just a neat problem to look at? I’m trying to follow approach similar to Python’s global interpreter lock.

So far it seems that:

  • one can re-enter the interpreter with R_tryEval which internally calls R_ToplevelExec, which in turn intercepts all long jumps (e.g. errors)
  • there are a few basic checks to verify whether the stack is in a good shape, e.g. R_CStackStart which checks stack frames and R_PPStackTop which checks objects under PROTECTion

I think that one can run multiple threads in R and maintain a separate interpreter “instance” in each of them. R interpreter uses stack for its bookkeeping and each thread has its own stack. It also counts objects excluded from garbage collection with PROTECT. Thus, when coming back to a given R interpreter “instance” (after thread-level context switch), one needs to pay attention to re-set R_PPStackTop to whatever that thread was left with.

I have put these ideas together in the form of a R package thread (GitHub). This is what it can do:

  • start a new thread and execute a R function in its own interpreter
  • switch between threads on specific function calls, e.g. thread_join(), thread_print(), thread_sleep()
  • finish thread execution
  • keep track of R_PPStackTop
  • avoid SIGSEGV-faulting the R process?

Here’s an example where two functions are run in parallel R threads (it’s also available via thread::run_r_printing_example()):


thread_runner <- function (data) {
  thread_print(paste("thread", data, "starting\n"))
  for (i in 1:10) {
    timeout <- as.integer(abs(rnorm(1, 500, 1000)))
    thread_print(paste("thread", data, "iteration", i,
                       "sleeping for", timeout, "\n"))
  thread_print(paste("thread", data, "exiting\n"))
message("starting the first thread")
thread1 <- new_thread(thread_runner, 1)
message("starting the second thread")
thread2 <- new_thread(thread_runner, 2)
message("going to join() both threads")

And here’s the output from my Ubuntu 16.10 x64:

starting the first thread
[1] "thread_140737231587072"
starting the second thread
[1] "thread_140737223194368" "thread_140737231587072"
going to join() both threads
thread 1 starting
thread 1 iteration 1 sleeping for 144 
thread 2 starting
thread 2 iteration 1 sleeping for 587 
thread 1 iteration 2 sleeping for 761 
thread 2 iteration 2 sleeping for 1327 
thread 1 iteration 3 sleeping for 360 
thread 1 iteration 4 sleeping for 1802 
thread 2 iteration 3 sleeping for 704 
thread 2 iteration 4 sleeping for 463 
thread 1 iteration 5 sleeping for 368 
thread 2 iteration 5 sleeping for 977 
thread 1 iteration 6 sleeping for 261 
thread 1 iteration 7 sleeping for 323 
thread 1 iteration 8 sleeping for 571 
thread 2 iteration 6 sleeping for 509 
thread 2 iteration 7 sleeping for 2521 
thread 1 iteration 9 sleeping for 298 
thread 1 iteration 10 sleeping for 394 
thread 1 exiting
thread 2 iteration 8 sleeping for 966 
thread 2 iteration 9 sleeping for 533 
thread 2 iteration 10 sleeping for 1795 
thread 2 exiting

How far is this from a real thread support in R? Well, there are three major challenges before this is really useful:

  • Context switch happens only when a function from this package is called explicitly
  • Memory allocation needs to be synchronized
  • Error handling runs into R_run_onexits which in turn throws a very nasty error message – this suggests I haven’t covered all features of the interpreter related to switching stacks

Issues #1 and #2 are related: one cannot leave R (release R interpreter lock) and enter an arbitrary C function because it is legal to call allocVector() from any C/C++ code. This in turn needs to happen synchronously – only one thread can execute allocVector() (or more specifically, allocVector3()) at any given time. I think that the best way to address it would be to patch R (main/memory.c) and introduce a pointer to allocVector3 similar to ptr_R_WriteConsole). Then the thread package would inject a decorator for allocVector3 with additional synchronization logic.

Issue #3 is not clear to me yet. But it also suggests more attention is needed to the specifics of R code execution.

I’ll be grateful for comments and suggestions. I think R could benefit from native thread support, if only to simplify program logic – but maybe also to run parts of computation-intensive code in lightweight parallel manner.

To leave a comment for the author, please follow the link and comment on their blog: R – Random Remarks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)