R-3.1.0 + OpenBLAS Speed Comparisons

April 20, 2014
By

(This article was first published on Strange Attractors » R, and kindly contributed to R-bloggers)

With the recent release of R-3.1.0, and the near-recent release of OpenBLAS 0.29rc2, it was time to recompile Rblas.dll and do some new speed tests. The test computer has an Intel i7-2600K, overclocked to 4.6Ghz with 16GB RAM and runs Windows 7 Home Premium 64bit. For the first test, a plain vanilla reference R will be used, compiled with no optimization changes to the defaults. The reference BLAS will be tested, followed by a single- and multi-threaded (6 threads max) OpenBLAS-based Rblas.dll. The tests will use two 1000 x 1000 dense matrices, A and B. Matrix A has been made positive-definite using nearPD while B is not positive-definite. The tests consist of the following common matrix procedures:

  1. crossprod(A, B) – R optimized multiplication
  2. solve(A) – standard inversion
  3. chol(A) – regular cholesky decomposition
  4. chol(B, pivot = TRUE) – cholesky decomposition using pivoting
  5. qr(A, LAPACK=TRUE) – QR decomposition
  6. svd(A) – singular-value decomposition
  7. eigen(A, symmetric = FALSE) – eigenvalue decomposition with all real eigenvectors
  8. eigen(B, symmetric = FALSE) – eigenvalue decomposition with complex eigenvectors
  9. lu(A) – general triangular decomposition (requires the Matrix package)

The results speak for themselves:

#Reference (compiled with no extra flags)
Unit: milliseconds
                        expr      min       lq   median       uq      max neval
             crossprod(A, B)  699.054  705.897  709.745  725.500  791.222   100
                    solve(A)  599.579  603.419  608.564  624.725  681.806   100
                     chol(A)  114.970  115.956  116.074  117.700  175.176   100
       chol(B, pivot = TRUE)    1.687    2.445    2.496    2.623   54.076   100
        qr(A, LAPACK = TRUE)  414.195  415.801  417.635  423.346  474.681   100
                      svd(A) 1944.341 1959.048 1981.735 2006.881 2127.501   100
 eigen(A, symmetric = FALSE) 2982.008 2995.499 3010.713 3044.907 3225.888   100
 eigen(B, symmetric = FALSE) 3959.382 3979.049 4000.352 4027.309 4164.430   100
                       lu(A)  137.275  138.398  139.865  142.096  219.666   100

#Reference R with single-thread BLAS
Unit: milliseconds
                        expr      min       lq   median       uq     max neval
             crossprod(A, B)   58.770   60.133   60.223   60.371  111.98   100
                    solve(A)  105.092  106.103  107.227  108.957  158.80   100
                     chol(A)   15.241   16.307   16.369   16.439   68.58   100
       chol(B, pivot = TRUE)    1.711    2.468    2.528    2.644   55.98   100
        qr(A, LAPACK = TRUE)  110.604  113.013  113.985  114.999  164.41   100
                      svd(A)  542.648  549.262  552.076  595.846  644.62   100
 eigen(A, symmetric = FALSE)  972.696  980.050  981.604  984.273 1035.89   100
 eigen(B, symmetric = FALSE) 1264.458 1281.789 1284.065 1287.032 1345.26   100
                       lu(A)   31.741   32.488   33.639   35.337   85.03   100

#Reference R with multi-threaded BLAS
Unit: milliseconds
                        expr      min       lq   median       uq      max neval
             crossprod(A, B)   22.570   24.134   24.781   27.447   80.523   100
                    solve(A)   65.360   67.267   69.081   71.637  125.201   100
                     chol(A)   25.378   26.068   26.673   28.143   79.746   100
       chol(B, pivot = TRUE)    2.235    2.843    2.987    3.347    7.422   100
        qr(A, LAPACK = TRUE)  122.267  125.120  126.808  133.940  187.129   100
                      svd(A)  534.880  549.375  565.313  589.973  644.990   100
 eigen(A, symmetric = FALSE) 2059.984 2091.719 2127.908 2186.482 2433.343   100
 eigen(B, symmetric = FALSE) 5396.165 5644.925 5766.819 5875.664 6238.816   100
                       lu(A)   35.095   36.547   37.643   39.668   91.663   100

Using an optimized BLAS provides impressive speed gains. One interesting note is that for the simple matrix operations, the multi-threaded BLAS provided a noticeable gain in speed. However, this was far outweighed by its dismal performance in base R’s eigenvalue calculation routines. Not only did the multi-threaded BLAS perform much more poorly than the single-threaded BLAS, in the complex case, it performed more poorly than the reference BLAS! I have no idea why this should be the case, but this does explain why the GotoBLAS based Rblas from this earlier post performed so poorly in the eigen test—it is a multi-threaded BLAS.

The next test was to see if compiling R with optimization flags helps at all. I didn’t re-compile the reference R, but for the single- and multi-threaded cases, I compiled R itself against those BLAS’s and throwing these optimization flags: -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256. To these two tests I added the default multiplication t(A) %*% B for comparison as well.

#Single-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256)
Unit: milliseconds 
                        expr      min      lq  median       uq     max neval
                  t(A) %*% B   67.955   69.07   69.44   71.697  125.06   500
             crossprod(A, B)   59.216   60.13   60.24   61.911  125.02   500
                    solve(A)  105.052  106.13  108.00  109.554  164.32   500
                     chol(A)   15.245   16.32   16.38   16.863   71.60   500
       chol(B, pivot = TRUE)    1.691    2.48    2.54    2.672   55.60   500
        qr(A, LAPACK = TRUE)  111.689  113.59  114.56  117.898  186.68   500
                      svd(A)  533.361  548.97  552.33  574.644  680.55   500
 eigen(A, symmetric = FALSE)  925.477  934.73  939.20  962.548 1095.85   500
 eigen(B, symmetric = FALSE) 1163.962 1181.89 1187.34 1211.832 1389.74   500
                       lu(A)   31.656   32.63   33.76   35.462   98.06   500

#Multi-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256)						 
Unit: milliseconds
                        expr      min       lq   median       uq     max neval
                  t(A) %*% B   30.972   33.085   33.872   36.458   97.72   500
             crossprod(A, B)   22.500   23.955   24.344   26.365   83.47   500
                    solve(A)   62.616   67.140   68.797   70.904  146.53   500
                     chol(A)   25.300   26.152   26.771   28.338   86.29   500
       chol(B, pivot = TRUE)    2.045    2.859    3.014    3.443   59.72   500
        qr(A, LAPACK = TRUE)  120.004  124.588  127.109  131.142  199.33   500
                      svd(A)  529.964  546.345  557.353  585.517  714.93   500
 eigen(A, symmetric = FALSE) 2021.299 2047.496 2084.340 2125.387 2493.60   500
 eigen(B, symmetric = FALSE) 5242.572 5571.571 5640.380 5732.441 6451.08   500
                       lu(A)   34.785   36.625   38.006   39.451   98.82   500

When dealing with the matrix operations that use the BLAS, it does not seem that the optimization flags make that much of a difference. Looking at operations that do not use the BLAS, the extra optimization flags don’t make much of a difference either. Using the same matrices A and B as inputs, the following test was run directly upon opening R with results following:

library(microbenchmark)							 
A <- as.matrix(read.csv(file="F:/R/A.csv", colClasses='numeric'))
B <- as.matrix(read.csv(file="F:/R/B.csv", colClasses='numeric'))
colnames(A) <- colnames(B) <- NULL									 
Z <- microbenchmark(A + 2, A - 2, A * 2, A / 2, A + B, A - B, A * B, A / B, A ^ 2, sqrt(A), control=list(order = 'block'), times = 1000L)

#Reference (compiled with no extra flags)
Unit: milliseconds
    expr   min    lq median    uq   max neval
   A + 2 1.779 1.811  1.823 3.730 12.84  1000
   A - 2 1.784 1.812  1.825 3.729 12.77  1000
   A * 2 1.785 1.818  1.842 3.733 13.47  1000
     A/2 3.112 3.126  3.138 4.835 14.15  1000
   A + B 2.117 2.142  2.158 4.124 13.10  1000
   A - B 2.111 2.140  2.155 4.120 13.42  1000
   A * B 2.116 2.144  2.164 4.124 13.87  1000
     A/B 5.673 5.692  5.704 7.374 16.72  1000
     A^2 1.789 1.812  1.823 3.729 12.76  1000
 sqrt(A) 7.939 7.955  7.968 9.630 18.98  1000

#Single-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256)
Unit: milliseconds
    expr   min    lq median    uq   max neval
   A + 2 1.760 1.805  1.816 3.741 12.79  1000
   A - 2 1.775 1.809  1.825 3.742 13.35  1000
   A * 2 1.786 1.810  1.823 3.743 12.82  1000
     A/2 3.115 3.124  3.131 4.844 14.15  1000
   A + B 2.119 2.140  2.152 4.136 13.18  1000
   A - B 2.122 2.141  2.152 4.135 13.90  1000
   A * B 2.122 2.141  2.152 4.140 13.15  1000
     A/B 5.675 5.684  5.692 7.376 16.70  1000
     A^2 1.778 1.815  1.826 3.745 12.79  1000
 sqrt(A) 7.931 7.941  7.954 9.625 18.98  1000

The “reference” install of R does throw -mtune=core2 -O2. The added gains from O3 over O2 seem minimal. Also, it is also possible that the internal R code is not set up to take advantage of the AVX operations that are activated by --march=corei7-avx.

The upshot is that substituting the reference Rblas.dll with an optimized one, specifically the single-threaded one, can provide significant increases in speed for matrix operations. I’d like to compile a dynamic architecture version as well, and if both pass the suite of R checks, send them to CRAN to be hosted in the ATLAS section. If that does not work, I may follow Dr. Nakama by hosting them locally as well.

To leave a comment for the author, please follow the link and comment on his blog: Strange Attractors » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.