One of the more important pieces of software that powers R is its BLAS, which stands for Basic Linear Algebra Subprograms. This is the suite of programs which, as its name implies, performs basic linear algebra routines such as vector copying, scaling and dot products; linear combinations; and matrix multiplication. It is the engine on which LINPACK, EISPACK, and the now industry-standard LAPACK and its variants are built. As a lower-level software suite, the BLAS can be fine-tuned to different computer architectures, and a tuned BLAS can create a dramatic speedup in many fundamental R routines. There are a number of projects developing tuned BLAS suites such as the open source ATLAS, GotoBLAS2, and OpenBLAS, and the closed source Intel MKL and AMD ACML libraries.
In ‘R’ itself, the BLAS is stored in the shared library file Rblas.dll, and a properly compiled tuned version can just be “dropped in” to the bin subdirectory and used immediately. For people working in Linux, there is significant support for specially tuned BLAS (and LAPACK) files, which can be found in great detail in the R Installation manual—the support for Windows is somewhat less robust, to be charitable. For many years, I was constrained to working in a 32-bit Windows environment, and took advantage of the 32-bit Windows compiled versions that existed in R’s repository of tuned BLAS files. However, at that time, the most recent version was for a Pentium 4, so over the next few years, I struggled and compiled ATLAS-based Rblas’s for Windows 32-bit. I did this for the Core 2 Duo (C2D) and the quad-core SandyBridge (Core i7 called C2i7 in the repository). The speedup was dramatic (see this blog, which uses the Core 2 BLAS I compiled, for an example) However, I found ATLAS to be difficult to compile, subject to any number of issues, and not really being a programmer, I often ran into issues I could not solve. Moreover, I was never able to successfully compile a 64-bit BLAS which passed the comprehensive make check-all test suite.
Recently, I made the complete switch to 64 bit Windows at both work and home so finding a 64-bit Rblas become much more important. There are some pre-compiled 64-bit BLAS files for R, graciously compiled by Dr. Ei-ji Nakama, which can be found here. Using these binaries, I found a dramatic increase in speed over the reference BLAS. However, the most recent processor-specific BLAS in that repository is for the Nehalem architecture, and cannot take advantage of the SSE4 and AVX operations built into the new SandyBridge and IvyBridge cores. Much to my frustration, I had many failures trying to compile a 64-bit ATLAS-based BLAS which, when compiled in R, would pass all the checks. With each try taking between six and a dozen hours, I eventually gave up trying to use ATLAS and resigned myself to living with the GotoBLAS-based files—which honestly was not much of a resignation.
A bit more recently, came across the OpenBLAS project, which claims to have reached near-MKL speeds on the Sandy/IvyBridge architecture due to some elegant hand-coding and optimization, and I was hoping to be able to take advantage of this in R. Unfortunately, there are no pre-compiled binaries to make use of, and so I had to attempt the compilation on my own. What made this a bit more difficult is that officially, R for Windows does not support using Goto or ACML based BLAS routines, and even ATLAS has limited support (see this thread, specifically Dr. Ripley’s response). This called for a lot of trial and error, originally resulting in dismal failure.
Serendipitously, around the time of the 3.0.1 release, there was an OpenBLAS update as well. Trying again, I was finally successful in compiling a single-threaded, OpenBLAS v2.8 based BLAS for the SandyBridge architecture on Windows 64 bit that, when used in the R compilation, created an Rblas that passed make-check all! For those interested in compiling their own, once the OpenBLAS is compiled, it can be treated as an ATLAS blas in R’s Makefile.local, with the only additional change being pointing to the compiled .a file in \src\extra\blas\Makefile.win.
Once I was successful with the SandyBridge-specific file, I compiled an Rblas that was not Sandy-Bridge dependent, but could be used on any i386 machine. I plan on submitting both to Dr. Uwe Ligges at R, and hope that, like the other BLAS’s I submitted, they will be posted eventually.
To demonstrate the increase in speed that using a tuned BLAS can provide in R, I ran a few tests. First, I created two 1000×1000 matrices populated with random normal variables. In specific (I don’t know why the first assignment operator has an extra space, I’ve seen that before when posting to WordPress):
A <- matrix(rnorm(1e6, 1000, 100), 1000, 1000) B <- matrix(rnorm(1e6, 1000, 100), 1000, 1000) write.csv(A, file="C:/R/A.csv", row.names=FALSE) write.csv(B, file="C:/R/B.csv", row.names=FALSE)
I can provide the specific matrices for anyone interested. I then compiled a basically vanilla R 3.0.2 for 64 bit windows, using only
-mtune=corei7-avx -O3 for optimizations, so the code should run on any i386. I followed the compilation steps for a full installation, so it included base, bitmapdll, cairodevices, recommended, vignettes, manuals, and rinstaller for completeness. Using a Dell M4700 (i7-3740 QM 2.7Ghz, Windows 7 Professional 64bit, 8GB RAM) I tested the following BLAS’s:
- GotoBLAS Dynamic
- GotoBLAS Nehalem
- OpenBLAS Dynamic
- OpenBLAS SandyBridge
I updated all packages and installed the
microbenchmark package (all from source). To test the effects of the different BLAS’s I renamed and copied the appropriate blas to Rblas.dll each time. The test ran multiple copies of crossprod, solve, qr, svd, eigen, and lu (the last needs the Matrix package, but it is a recommended package). The actual test code is:
library(microbenchmark) library(Matrix) A <- as.matrix(read.csv(file="C:/R/BLAS/A.csv", colClasses='numeric')) B <- as.matrix(read.csv(file="C:/R/BLAS/B.csv", colClasses='numeric')) colnames(A) <- colnames(B) < - NULL microbenchmark(crossprod(A,B), solve(A), qr(A, LAPACK=TRUE), svd(A), eigen(A), lu(A), times=100L, unit='ms')
The results are illuminating:
Reference BLAS: Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 1173.4761 1184.7500 1190.9430 1198.4409 1291.9620 100 solve(A) 1000.1138 1011.5769 1018.7613 1027.0789 1172.6551 100 qr(A, LAPACK = TRUE) 625.3756 633.2889 638.0123 644.8645 713.7831 100 svd(A) 3045.0855 3074.7472 3093.8418 3138.0180 3621.9590 100 eigen(A) 5491.4986 5563.1129 5586.9327 5632.5695 5836.6824 100 lu(A) 170.8486 172.5362 175.5632 178.7300 247.1116 100
GotoBLAS_Dynamic Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 63.68471 83.16222 92.33265 104.5118 184.4941 20 solve(A) 207.38801 229.81548 250.53656 276.6736 391.5336 20 qr(A, LAPACK = TRUE) 171.93558 175.22074 178.57133 186.3217 996.7584 20 svd(A) 892.90194 920.55781 969.79708 1069.5740 12185.9414 20 eigen(A) 14870.65481 14943.13804 15113.49240 15513.8285 29100.0788 20 lu(A) 70.65831 76.07561 83.52785 146.1552 152.5949 20
GotoBLAS_Nehalem Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 52.86457 64.42491 73.75266 77.89007 82.05677 20 solve(A) 203.96266 209.78135 220.36275 229.09164 298.06924 20 qr(A, LAPACK = TRUE) 171.22182 174.58935 175.90882 181.63144 247.92046 20 svd(A) 895.20834 904.17256 950.78584 970.47127 1057.90653 20 eigen(A) 15429.49102 15470.62708 15552.95074 15627.39700 15759.67103 20 lu(A) 67.13746 72.27017 74.26725 77.52482 85.24281 20
OpenBLAS_Dynamic Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 102.54157 102.99851 104.42070 105.95056 177.8572 100 solve(A) 180.64641 182.99257 186.41507 188.91836 261.6805 100 qr(A, LAPACK = TRUE) 188.08381 194.37999 197.10319 199.66507 281.2404 100 svd(A) 1027.00070 1040.84424 1055.35816 1108.67206 1193.3480 100 eigen(A) 2114.96652 2190.11265 2208.06810 2250.41861 2476.9335 100 lu(A) 57.47394 58.53336 59.80013 62.60323 139.1076 100
OpenBLAS_SandyBridge Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 102.41031 102.71278 104.14695 105.4141 179.8048 100 solve(A) 180.52238 181.97006 184.94760 187.6423 260.5379 100 qr(A, LAPACK = TRUE) 186.89180 191.01798 193.69381 199.2825 286.8123 100 svd(A) 1024.47896 1034.60627 1043.91214 1099.8361 1153.3881 100 eigen(A) 2108.22006 2172.05087 2192.65913 2216.7981 2375.9831 100 lu(A) 57.55384 59.10672 61.08402 62.9213 133.6456 100
All the tuned BLAS results are much better than the reference, with the exception of eigenvalue decomposition for the GotoBLAS-based Rblas. I do not know why that is the case, but the difference was so severe that I had to run it only 20 times to have results in reasonable time. For the other routines, sometimes the OpenBLAS based version is quicker, other times not. I personally will use it exclusively as the minor lag in comparison to some of the GotoBLAS timings is more than compensated for by the eigenvalue speedup, and overall, it is still anywhere between 3 and 10 times faster than the reference BLAS. Regardless, it is clear that using a tuned BLAS can speed up calculations considerably.
Using a tuned BLAS is not the only way to increase R’s speed. For anyone compiling R for themselves, throwing the proper flags for R in its compilation can squeeze a bit more speed out if it, and activating byte-compilation can do a bit more. In my next post, I hope to show similar timing numbers, but this time, using an R compiled for my specific machine (Ivy Bridge) in concert with the tuned BLAS.