R-3.1.0 + OpenBLAS Speed Comparisons
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With the recent release of R-3.1.0, and the near-recent release of OpenBLAS 0.29rc2, it was time to recompile Rblas.dll and do some new speed tests. The test computer has an Intel i7-2600K, overclocked to 4.6Ghz with 16GB RAM and runs Windows 7 Home Premium 64bit. For the first test, a plain vanilla reference R will be used, compiled with no optimization changes to the defaults. The reference BLAS will be tested, followed by a single- and multi-threaded (6 threads max) OpenBLAS-based Rblas.dll. The tests will use two 1000 x 1000 dense matrices, A and B. Matrix A has been made positive-definite using nearPD while B is not positive-definite. The tests consist of the following common matrix procedures:
crossprod(A, B)– R optimized multiplicationsolve(A)– standard inversionchol(A)– regular cholesky decompositionchol(B, pivot = TRUE)– cholesky decomposition using pivotingqr(A, LAPACK=TRUE)– QR decompositionsvd(A)– singular-value decompositioneigen(A, symmetric = FALSE)– eigenvalue decomposition with all real eigenvectorseigen(B, symmetric = FALSE)– eigenvalue decomposition with complex eigenvectorslu(A)– general triangular decomposition (requires theMatrixpackage)
The results speak for themselves:
#Reference (compiled with no extra flags)
Unit: milliseconds
expr min lq median uq max neval
crossprod(A, B) 699.054 705.897 709.745 725.500 791.222 100
solve(A) 599.579 603.419 608.564 624.725 681.806 100
chol(A) 114.970 115.956 116.074 117.700 175.176 100
chol(B, pivot = TRUE) 1.687 2.445 2.496 2.623 54.076 100
qr(A, LAPACK = TRUE) 414.195 415.801 417.635 423.346 474.681 100
svd(A) 1944.341 1959.048 1981.735 2006.881 2127.501 100
eigen(A, symmetric = FALSE) 2982.008 2995.499 3010.713 3044.907 3225.888 100
eigen(B, symmetric = FALSE) 3959.382 3979.049 4000.352 4027.309 4164.430 100
lu(A) 137.275 138.398 139.865 142.096 219.666 100#Reference R with single-thread BLAS
Unit: milliseconds
expr min lq median uq max neval
crossprod(A, B) 58.770 60.133 60.223 60.371 111.98 100
solve(A) 105.092 106.103 107.227 108.957 158.80 100
chol(A) 15.241 16.307 16.369 16.439 68.58 100
chol(B, pivot = TRUE) 1.711 2.468 2.528 2.644 55.98 100
qr(A, LAPACK = TRUE) 110.604 113.013 113.985 114.999 164.41 100
svd(A) 542.648 549.262 552.076 595.846 644.62 100
eigen(A, symmetric = FALSE) 972.696 980.050 981.604 984.273 1035.89 100
eigen(B, symmetric = FALSE) 1264.458 1281.789 1284.065 1287.032 1345.26 100
lu(A) 31.741 32.488 33.639 35.337 85.03 100#Reference R with multi-threaded BLAS
Unit: milliseconds
expr min lq median uq max neval
crossprod(A, B) 22.570 24.134 24.781 27.447 80.523 100
solve(A) 65.360 67.267 69.081 71.637 125.201 100
chol(A) 25.378 26.068 26.673 28.143 79.746 100
chol(B, pivot = TRUE) 2.235 2.843 2.987 3.347 7.422 100
qr(A, LAPACK = TRUE) 122.267 125.120 126.808 133.940 187.129 100
svd(A) 534.880 549.375 565.313 589.973 644.990 100
eigen(A, symmetric = FALSE) 2059.984 2091.719 2127.908 2186.482 2433.343 100
eigen(B, symmetric = FALSE) 5396.165 5644.925 5766.819 5875.664 6238.816 100
lu(A) 35.095 36.547 37.643 39.668 91.663 100
Using an optimized BLAS provides impressive speed gains. One interesting note is that for the simple matrix operations, the multi-threaded BLAS provided a noticeable gain in speed. However, this was far outweighed by its dismal performance in base R’s eigenvalue calculation routines. Not only did the multi-threaded BLAS perform much more poorly than the single-threaded BLAS, in the complex case, it performed more poorly than the reference BLAS! I have no idea why this should be the case, but this does explain why the GotoBLAS based Rblas from this earlier post performed so poorly in the eigen test—it is a multi-threaded BLAS.
The next test was to see if compiling R with optimization flags helps at all. I didn’t re-compile the reference R, but for the single- and multi-threaded cases, I compiled R itself against those BLAS’s and throwing these optimization flags: -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256. To these two tests I added the default multiplication t(A) %*% B for comparison as well.
#Single-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256)
Unit: milliseconds
expr min lq median uq max neval
t(A) %*% B 67.955 69.07 69.44 71.697 125.06 500
crossprod(A, B) 59.216 60.13 60.24 61.911 125.02 500
solve(A) 105.052 106.13 108.00 109.554 164.32 500
chol(A) 15.245 16.32 16.38 16.863 71.60 500
chol(B, pivot = TRUE) 1.691 2.48 2.54 2.672 55.60 500
qr(A, LAPACK = TRUE) 111.689 113.59 114.56 117.898 186.68 500
svd(A) 533.361 548.97 552.33 574.644 680.55 500
eigen(A, symmetric = FALSE) 925.477 934.73 939.20 962.548 1095.85 500
eigen(B, symmetric = FALSE) 1163.962 1181.89 1187.34 1211.832 1389.74 500
lu(A) 31.656 32.63 33.76 35.462 98.06 500#Multi-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256)
Unit: milliseconds
expr min lq median uq max neval
t(A) %*% B 30.972 33.085 33.872 36.458 97.72 500
crossprod(A, B) 22.500 23.955 24.344 26.365 83.47 500
solve(A) 62.616 67.140 68.797 70.904 146.53 500
chol(A) 25.300 26.152 26.771 28.338 86.29 500
chol(B, pivot = TRUE) 2.045 2.859 3.014 3.443 59.72 500
qr(A, LAPACK = TRUE) 120.004 124.588 127.109 131.142 199.33 500
svd(A) 529.964 546.345 557.353 585.517 714.93 500
eigen(A, symmetric = FALSE) 2021.299 2047.496 2084.340 2125.387 2493.60 500
eigen(B, symmetric = FALSE) 5242.572 5571.571 5640.380 5732.441 6451.08 500
lu(A) 34.785 36.625 38.006 39.451 98.82 500When dealing with the matrix operations that use the BLAS, it does not seem that the optimization flags make that much of a difference. Looking at operations that do not use the BLAS, the extra optimization flags don’t make much of a difference either. Using the same matrices A and B as inputs, the following test was run directly upon opening R with results following:
library(microbenchmark) A <- as.matrix(read.csv(file="F:/R/A.csv", colClasses='numeric')) B <- as.matrix(read.csv(file="F:/R/B.csv", colClasses='numeric')) colnames(A) <- colnames(B) <- NULL Z <- microbenchmark(A + 2, A - 2, A * 2, A / 2, A + B, A - B, A * B, A / B, A ^ 2, sqrt(A), control=list(order = 'block'), times = 1000L)
#Reference (compiled with no extra flags)
Unit: milliseconds
expr min lq median uq max neval
A + 2 1.779 1.811 1.823 3.730 12.84 1000
A - 2 1.784 1.812 1.825 3.729 12.77 1000
A * 2 1.785 1.818 1.842 3.733 13.47 1000
A/2 3.112 3.126 3.138 4.835 14.15 1000
A + B 2.117 2.142 2.158 4.124 13.10 1000
A - B 2.111 2.140 2.155 4.120 13.42 1000
A * B 2.116 2.144 2.164 4.124 13.87 1000
A/B 5.673 5.692 5.704 7.374 16.72 1000
A^2 1.789 1.812 1.823 3.729 12.76 1000
sqrt(A) 7.939 7.955 7.968 9.630 18.98 1000#Single-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256)
Unit: milliseconds
expr min lq median uq max neval
A + 2 1.760 1.805 1.816 3.741 12.79 1000
A - 2 1.775 1.809 1.825 3.742 13.35 1000
A * 2 1.786 1.810 1.823 3.743 12.82 1000
A/2 3.115 3.124 3.131 4.844 14.15 1000
A + B 2.119 2.140 2.152 4.136 13.18 1000
A - B 2.122 2.141 2.152 4.135 13.90 1000
A * B 2.122 2.141 2.152 4.140 13.15 1000
A/B 5.675 5.684 5.692 7.376 16.70 1000
A^2 1.778 1.815 1.826 3.745 12.79 1000
sqrt(A) 7.931 7.941 7.954 9.625 18.98 1000The “reference” install of R does throw -mtune=core2 -O2. The added gains from O3 over O2 seem minimal. Also, it is also possible that the internal R code is not set up to take advantage of the AVX operations that are activated by --march=corei7-avx.
The upshot is that substituting the reference Rblas.dll with an optimized one, specifically the single-threaded one, can provide significant increases in speed for matrix operations. I’d like to compile a dynamic architecture version as well, and if both pass the suite of R checks, send them to CRAN to be hosted in the ATLAS section. If that does not work, I may follow Dr. Nakama by hosting them locally as well.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.