Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Have you tried synchronizing R processes? I did and it wasn’t straightforward. In fact, I ended up creating a new package – flock.

One of the improvements I did not too long ago to my R back-testing infrastructure was to start using a database to store the results. This way I can compute all interesting models (see the “ARMA Models for Trading” series for an example) once and store the relevant information (mean forecast, variance forecast, AIC, etc) into the database. Then, I can test whatever I want without further heavy lifting.

Easy said, but took me some time to implement in practice. My database of choice was SQLite, which doesn’t require a database server and the entire database is stored within a single file. Choosing SQLite was probably part of the problem – it turned out that SQLite doesn’t support simultaneous database updates – i.e. the synchronization is left to the user. Here is a stripped down version of a program, which often FAILS on multi-core system:

require(RSQLite)
require(parallel)

db.path <- tempfile()
con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
df <- data.frame(value=0)
dbWriteTable(con, "test", df)
dbDisconnect(con)

write.one.value <- function(val) {
con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
dbWriteTable(con, "test", data.frame(value=val), append=TRUE)
dbDisconnect(con)
}

mclapply(1:100, write.one.value, mc.cores=2)


After some generous debugging I found out that the problem is the parallel writing occurring in write.one.value.

My first impulse was to find some scripting solution, using R’s system call, but that didn’t work very well. My next plan was to add support to RSQLite, but that seemed too limiting (synchronization can be useful elsewhere) and complicated. So, I moved to the next solution, which was to execute the write.one.value code within a critical section (take an exclusive lock at the beginning, unlock at the end).

Before unrolling my own package, I decided to check what’s on CRAN, and I discovered the synchronicity package. A bit heavy for my taste (using the C++ library boost), but so what. It seemed to work at first, so I run some of my long-ish simulations.

A day later, I observed something strange – the number of processes has reduced, and the simulation seemed to have hanged. Some more debugging revealed that the underlying boost libraries were throwing an exception, and things were going sour afterwards. Fed up with debugging and the time wasted, I went back to approach of writing my own package.

The result was the flock package. The working version of the above code follows:

require(RSQLite)
require(parallel)
require(flock)

db.path <- tempfile()
con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
df <- data.frame(value=0)
dbWriteTable(con, "test", df)
dbDisconnect(con)

write.one.value <- function(val, lock.name) {
# Take an exclusive lock
ll = lock(lock.name)

con <- dbConnect(RSQLite::SQLite(), dbname=db.path)
dbWriteTable(con, "test", data.frame(value=val), append=TRUE)
dbDisconnect(con)

# Release the lock
unlock(ll)
}

lock.name = tempfile()
# or lock.name = "~/file.lock"

mclapply(1:100, write.one.value, mc.cores=2)


With RStudio and Rcpp the package development was a breeze. Besides for database access, I started using the package to protect logging to files, and other similar tasks. It’s new, but by the time you are reading this post I have probably executed millions of synchronizations, so it should be pretty stable. The only downside – Windows is not supported, I simply cannot afford the time at the moment to do that (it may work out-of-the-box, but that’s far from certain).

To install the package:

install.packages("flock", repos="http://R-Forge.R-project.org")