(A Very) Experimental Threading in R

December 11, 2016
By

(This article was first published on R – Random Remarks, and kindly contributed to R-bloggers)

I’ve been trying to find a way to introduce threads to R. I guess there can be many reasons to do that, among which I could mention simplified input/output logic, sending tasks to the background (e.g. building a model asynchronously), running computation-intensive tasks in parallel (e.g. parallel, chunk-wise var() on a large vector). Finally, it’s just a neat problem to look at😉 I’m trying to follow approach similar to Python’s global interpreter lock.

So far it seems that:

  • one can re-enter the interpreter with R_tryEval which internally calls R_ToplevelExec, which in turn intercepts all long jumps (e.g. errors)
  • there are a few basic checks to verify whether the stack is in a good shape, e.g. R_CStackStart which checks stack frames and R_PPStackTop which checks objects under PROTECTion

I think that one can run multiple threads in R and maintain a separate interpreter “instance” in each of them. R interpreter uses stack for its bookkeeping and each thread has its own stack. It also counts objects excluded from garbage collection with PROTECT. Thus, when coming back to a given R interpreter “instance” (after thread-level context switch), one needs to pay attention to re-set R_PPStackTop to whatever that thread was left with.

I have put these ideas together in the form of a R package thread (GitHub). This is what it can do:

  • start a new thread and execute a R function in its own interpreter
  • switch between threads on specific function calls, e.g. thread_join(), thread_print(), thread_sleep()
  • finish thread execution
  • keep track of R_PPStackTop
  • avoid SIGSEGV-faulting the R process😉

Here’s an example where two functions are run in parallel R threads (it’s also available via thread::run_r_printing_example()):

library(thread)

thread_runner <- function (data) {
  thread_print(paste("thread", data, "starting\n"))
  for (i in 1:10) {
    timeout <- as.integer(abs(rnorm(1, 500, 1000)))
    thread_print(paste("thread", data, "iteration", i,
                       "sleeping for", timeout, "\n"))
    thread_sleep(timeout)
  }
  thread_print(paste("thread", data, "exiting\n"))
}
 
message("starting the first thread")
thread1 <- new_thread(thread_runner, 1)
print(ls(threads))
 
message("starting the second thread")
thread2 <- new_thread(thread_runner, 2)
print(ls(threads))
 
message("going to join() both threads")
thread_join(thread1)
thread_join(thread2)

And here’s the output from my Ubuntu 16.10 x64:

starting the first thread
[1] "thread_140737231587072"
starting the second thread
[1] "thread_140737223194368" "thread_140737231587072"
going to join() both threads
thread 1 starting
thread 1 iteration 1 sleeping for 144 
thread 2 starting
thread 2 iteration 1 sleeping for 587 
thread 1 iteration 2 sleeping for 761 
thread 2 iteration 2 sleeping for 1327 
thread 1 iteration 3 sleeping for 360 
thread 1 iteration 4 sleeping for 1802 
thread 2 iteration 3 sleeping for 704 
thread 2 iteration 4 sleeping for 463 
thread 1 iteration 5 sleeping for 368 
thread 2 iteration 5 sleeping for 977 
thread 1 iteration 6 sleeping for 261 
thread 1 iteration 7 sleeping for 323 
thread 1 iteration 8 sleeping for 571 
thread 2 iteration 6 sleeping for 509 
thread 2 iteration 7 sleeping for 2521 
thread 1 iteration 9 sleeping for 298 
thread 1 iteration 10 sleeping for 394 
thread 1 exiting
thread 2 iteration 8 sleeping for 966 
thread 2 iteration 9 sleeping for 533 
thread 2 iteration 10 sleeping for 1795 
thread 2 exiting

How far is this from a real thread support in R? Well, there are three major challenges before this is really useful:

  • Context switch happens only when a function from this package is called explicitly
  • Memory allocation needs to be synchronized
  • Error handling runs into R_run_onexits which in turn throws a very nasty error message – this suggests I haven’t covered all features of the interpreter related to switching stacks

Issues #1 and #2 are related: one cannot leave R (release R interpreter lock) and enter an arbitrary C function because it is legal to call allocVector() from any C/C++ code. This in turn needs to happen synchronously – only one thread can execute allocVector() (or more specifically, allocVector3()) at any given time. I think that the best way to address it would be to patch R (main/memory.c) and introduce a pointer to allocVector3 similar to ptr_R_WriteConsole). Then the thread package would inject a decorator for allocVector3 with additional synchronization logic.

Issue #3 is not clear to me yet. But it also suggests more attention is needed to the specifics of R code execution.

I’ll be grateful for comments and suggestions. I think R could benefit from native thread support, if only to simplify program logic – but maybe also to run parts of computation-intensive code in lightweight parallel manner.

To leave a comment for the author, please follow the link and comment on their blog: R – Random Remarks.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)