**Strange Attractors » R**, and kindly contributed to R-bloggers)

With the recent release of R-3.1.0, and the near-recent release of OpenBLAS 0.29rc2, it was time to recompile Rblas.dll and do some new speed tests. The test computer has an Intel i7-2600K, overclocked to 4.6Ghz with 16GB RAM and runs Windows 7 Home Premium 64bit. For the first test, a plain vanilla reference R will be used, compiled with no optimization changes to the defaults. The reference BLAS will be tested, followed by a single- and multi-threaded (6 threads max) OpenBLAS-based Rblas.dll. The tests will use two 1000 x 1000 dense matrices, A and B. Matrix A has been made positive-definite using `nearPD`

while B is not positive-definite. The tests consist of the following common matrix procedures:

`crossprod(A, B)`

– R optimized multiplication`solve(A)`

– standard inversion`chol(A)`

– regular cholesky decomposition`chol(B, pivot = TRUE)`

– cholesky decomposition using pivoting`qr(A, LAPACK=TRUE)`

– QR decomposition`svd(A)`

– singular-value decomposition`eigen(A, symmetric = FALSE)`

– eigenvalue decomposition with all real eigenvectors`eigen(B, symmetric = FALSE)`

– eigenvalue decomposition with complex eigenvectors`lu(A)`

– general triangular decomposition (requires the`Matrix`

package)

The results speak for themselves:

1 2 3 4 5 6 7 8 9 10 11 12 | #Reference (compiled with no extra flags) Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 699.054 705.897 709.745 725.500 791.222 100 solve(A) 599.579 603.419 608.564 624.725 681.806 100 chol(A) 114.970 115.956 116.074 117.700 175.176 100 chol(B, pivot = TRUE) 1.687 2.445 2.496 2.623 54.076 100 qr(A, LAPACK = TRUE) 414.195 415.801 417.635 423.346 474.681 100 svd(A) 1944.341 1959.048 1981.735 2006.881 2127.501 100 eigen(A, symmetric = FALSE) 2982.008 2995.499 3010.713 3044.907 3225.888 100 eigen(B, symmetric = FALSE) 3959.382 3979.049 4000.352 4027.309 4164.430 100 lu(A) 137.275 138.398 139.865 142.096 219.666 100 |

1 2 3 4 5 6 7 8 9 10 11 12 | #Reference R with single-thread BLAS Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 58.770 60.133 60.223 60.371 111.98 100 solve(A) 105.092 106.103 107.227 108.957 158.80 100 chol(A) 15.241 16.307 16.369 16.439 68.58 100 chol(B, pivot = TRUE) 1.711 2.468 2.528 2.644 55.98 100 qr(A, LAPACK = TRUE) 110.604 113.013 113.985 114.999 164.41 100 svd(A) 542.648 549.262 552.076 595.846 644.62 100 eigen(A, symmetric = FALSE) 972.696 980.050 981.604 984.273 1035.89 100 eigen(B, symmetric = FALSE) 1264.458 1281.789 1284.065 1287.032 1345.26 100 lu(A) 31.741 32.488 33.639 35.337 85.03 100 |

1 2 3 4 5 6 7 8 9 10 11 12 | #Reference R with multi-threaded BLAS Unit: milliseconds expr min lq median uq max neval crossprod(A, B) 22.570 24.134 24.781 27.447 80.523 100 solve(A) 65.360 67.267 69.081 71.637 125.201 100 chol(A) 25.378 26.068 26.673 28.143 79.746 100 chol(B, pivot = TRUE) 2.235 2.843 2.987 3.347 7.422 100 qr(A, LAPACK = TRUE) 122.267 125.120 126.808 133.940 187.129 100 svd(A) 534.880 549.375 565.313 589.973 644.990 100 eigen(A, symmetric = FALSE) 2059.984 2091.719 2127.908 2186.482 2433.343 100 eigen(B, symmetric = FALSE) 5396.165 5644.925 5766.819 5875.664 6238.816 100 lu(A) 35.095 36.547 37.643 39.668 91.663 100 |

Using an optimized BLAS provides impressive speed gains. One interesting note is that for the simple matrix operations, the multi-threaded BLAS provided a noticeable gain in speed. However, this was far outweighed by its dismal performance in base R’s eigenvalue calculation routines. Not only did the multi-threaded BLAS perform much more poorly than the single-threaded BLAS, in the complex case, it performed more poorly than the reference BLAS! I have no idea why this should be the case, but this does explain why the GotoBLAS based Rblas from this earlier post performed so poorly in the `eigen`

test—it is a multi-threaded BLAS.

The next test was to see if compiling R with optimization flags helps at all. I didn’t re-compile the reference R, but for the single- and multi-threaded cases, I compiled R itself against those BLAS’s and throwing these optimization flags: `-march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256`

. To these two tests I added the default multiplication `t(A) %*% B`

for comparison as well.

1 2 3 4 5 6 7 8 9 10 11 12 13 | #Single-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256) Unit: milliseconds expr min lq median uq max neval t(A) %*% B 67.955 69.07 69.44 71.697 125.06 500 crossprod(A, B) 59.216 60.13 60.24 61.911 125.02 500 solve(A) 105.052 106.13 108.00 109.554 164.32 500 chol(A) 15.245 16.32 16.38 16.863 71.60 500 chol(B, pivot = TRUE) 1.691 2.48 2.54 2.672 55.60 500 qr(A, LAPACK = TRUE) 111.689 113.59 114.56 117.898 186.68 500 svd(A) 533.361 548.97 552.33 574.644 680.55 500 eigen(A, symmetric = FALSE) 925.477 934.73 939.20 962.548 1095.85 500 eigen(B, symmetric = FALSE) 1163.962 1181.89 1187.34 1211.832 1389.74 500 lu(A) 31.656 32.63 33.76 35.462 98.06 500 |

1 2 3 4 5 6 7 8 9 10 11 12 13 | #Multi-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256) Unit: milliseconds expr min lq median uq max neval t(A) %*% B 30.972 33.085 33.872 36.458 97.72 500 crossprod(A, B) 22.500 23.955 24.344 26.365 83.47 500 solve(A) 62.616 67.140 68.797 70.904 146.53 500 chol(A) 25.300 26.152 26.771 28.338 86.29 500 chol(B, pivot = TRUE) 2.045 2.859 3.014 3.443 59.72 500 qr(A, LAPACK = TRUE) 120.004 124.588 127.109 131.142 199.33 500 svd(A) 529.964 546.345 557.353 585.517 714.93 500 eigen(A, symmetric = FALSE) 2021.299 2047.496 2084.340 2125.387 2493.60 500 eigen(B, symmetric = FALSE) 5242.572 5571.571 5640.380 5732.441 6451.08 500 lu(A) 34.785 36.625 38.006 39.451 98.82 500 |

When dealing with the matrix operations that use the BLAS, it does not seem that the optimization flags make that much of a difference. Looking at operations that do not use the BLAS, the extra optimization flags don’t make much of a difference either. Using the same matrices A and B as inputs, the following test was run directly upon opening R with results following:

1 2 3 4 5 | library(microbenchmark) A <- as.matrix(read.csv(file="F:/R/A.csv", colClasses='numeric')) B <- as.matrix(read.csv(file="F:/R/B.csv", colClasses='numeric')) colnames(A) <- colnames(B) <- NULL Z <- microbenchmark(A + 2, A - 2, A * 2, A / 2, A + B, A - B, A * B, A / B, A ^ 2, sqrt(A), control=list(order = 'block'), times = 1000L) |

1 2 3 4 5 6 7 8 9 10 11 12 13 | #Reference (compiled with no extra flags) Unit: milliseconds expr min lq median uq max neval A + 2 1.779 1.811 1.823 3.730 12.84 1000 A - 2 1.784 1.812 1.825 3.729 12.77 1000 A * 2 1.785 1.818 1.842 3.733 13.47 1000 A/2 3.112 3.126 3.138 4.835 14.15 1000 A + B 2.117 2.142 2.158 4.124 13.10 1000 A - B 2.111 2.140 2.155 4.120 13.42 1000 A * B 2.116 2.144 2.164 4.124 13.87 1000 A/B 5.673 5.692 5.704 7.374 16.72 1000 A^2 1.789 1.812 1.823 3.729 12.76 1000 sqrt(A) 7.939 7.955 7.968 9.630 18.98 1000 |

1 2 3 4 5 6 7 8 9 10 11 12 13 | #Single-Thread (compiled with -march=corei7-avx -O3 -std=gnu++0x --param l1-cache-line-size=64 --param l1-cache-size=64 --param l2-cache-size=256) Unit: milliseconds expr min lq median uq max neval A + 2 1.760 1.805 1.816 3.741 12.79 1000 A - 2 1.775 1.809 1.825 3.742 13.35 1000 A * 2 1.786 1.810 1.823 3.743 12.82 1000 A/2 3.115 3.124 3.131 4.844 14.15 1000 A + B 2.119 2.140 2.152 4.136 13.18 1000 A - B 2.122 2.141 2.152 4.135 13.90 1000 A * B 2.122 2.141 2.152 4.140 13.15 1000 A/B 5.675 5.684 5.692 7.376 16.70 1000 A^2 1.778 1.815 1.826 3.745 12.79 1000 sqrt(A) 7.931 7.941 7.954 9.625 18.98 1000 |

The “reference” install of R does throw `-mtune=core2 -O2`

. The added gains from O3 over O2 seem minimal. Also, it is also possible that the internal R code is not set up to take advantage of the AVX operations that are activated by `--march=corei7-avx`

.

The upshot is that substituting the reference Rblas.dll with an optimized one, specifically the single-threaded one, can provide significant increases in speed for matrix operations. I’d like to compile a dynamic architecture version as well, and if both pass the suite of R checks, send them to CRAN to be hosted in the ATLAS section. If that does not work, I may follow Dr. Nakama by hosting them locally as well.

**leave a comment**for the author, please follow the link and comment on his blog:

**Strange Attractors » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...