Performance benefits of linking R to multithreaded math libraries
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
R wasn’t originally designed as a multithreaded application — multiprocessor systems were still rare when the R Project was first conceived in the mid90’s — and so, by default, R will only use one processor of your dualcore laptop or quadcore desktop machine when doing calculations. For calculations that take a long time, like big simulations or modeling of large data sets, it would be nice to put those other processors to use to speed up the computations. There are several parallel processing libraries for R available that allow you to explicitly run loops in R simultaneously (ideally, each on a different processor), but using them does require you to rewrite your code accordingly.
But there is a way to make use of all your processing power for many computations in R, without changing a line of code. That’s because R is a statistical computing system, and at the heart of many of the algorithms you use on a daily basis — data restructuring, regressions, classifications, even some graphics functions — is linear algebra. The data are transformed into vector and matrix objects, and the internals of R have been cleverly designed to link to a standard “BLAS” API to perform calculations on vectors and matrices. The binaries provided by the R Core Group on CRAN (with one exception, see below) are linked to an “internal BLAS which is welltested and will be adequate for most uses of R”, but is not multithreaded and so only uses one core. But the beauty of linking to the BLAS API is that you can recompile R to link to a different, multithreaded BLAS library and, voilà, suddenly many computations are using all cores and therefore run much faster.
The MacOS port of R on CRAN is linked to ATLAS, a “tuned” BLAS that uses multiple cores for computations. As a result, R on a multicore Mac (as all new Macs are these days) really zooms. But the Windows binaries on CRAN are not linked to an optimized BLAS. It’s possible to compile and link R yourself, but it can be tricky.
That’s what we do at Revolution for our Windows and Linux distributions of Revolution R. When we compile R, we link it to the Intel Math Kernel Libraries, which includes a highperformance BLAS implementation tuned to multicore Intel chips. “Tuning” here means using efficient algorithms, optimized assembly code that exploits features of the chipset, and multithreaded algorithms that use all cores simultaneously. As a result, you get some serious speed boosts for many operations in R, especially on a multicore system. Here are some examples:

As you can see, using the Intel MKL libraries on a 4core machine gives some dramatic speedups (about a quarter of the 1core time, as you might expect). Perhaps more surprisingly, using the Intel MKL libraries on a 1core machine is also faster than using R’s standard BLAS library: this is a result of the optimized algorithms, not additional computing power. You even get improvements on nonIntel chipsets (like AMD).
[A side note: These calculations were actually all run on an 8core machine, specifically, an Intel Xeon 8core CPU with 18 GB system RAM running Windows Server 2008 operating system. The complete benchmark code is available on this page. The results for Revolution R 1core and 4core were calculated by restricting the Intel MKL library to use 1 thread and 4 threads, using the RevolutionRspecific commands setMKLthreads(1) and setMKLthreads(4) respectively. This has the effect of using only the power of the specific number of cores, even when more cores are available. Note: if you’re using Revolution R and are doing explicit parallel programming with doSMP, it’s a good idea to call setMKLthreads(1) first. Otherwise, your parallel loops and the multithreaded linear algebra computations will compete for the same processor and actually degrade performance.]
These results are dramatic, but multithreaded BLAS libraries aren’t a panacea. Not all R commands ultimately link to BLAS code, even ones you might expect. (For example, lm for regression uses a nonBLAS QR decomposition by default. Edit: as pointed out by Doug Bates, lm and glm end up calling the older Linpack routines using level1 (vectorvector) BLAS instead of the newer, level3based (matrixmatrix) Lapack routines because of the need to handle certain rankdeficient cases cleanly.) And if your R code ultimately does not involve linear algebra, you can’t expect any improvement at all. (For example, the “Program Control” R benchmarks by Simon Urbanek show only marginal performance gains in Revolution R.) This is when explicit parallel programming is the route to improved performance. We’re also working on dedicated statistical routines for Revolution R Enterprise that are explicitly multithreaded for singlemachines and also distributable to multiple machines in a cluster or in the cloud, but that’s a topic for another post.
Revolution Analytics: Performance Benchmarks
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.