Performance considerations with sparse matrices in Armadillo

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Besides outstanding support for dense matrices, the Armadillo library also provides a great way to
manipulate sparse matrices in C++. However, the performance characteristics of dealing with sparse
matrices may be surprising if one is only familiar with dense matrices. This is a collection of
observations on getting best performance with sparse matrices in Armadillo.

All the timings in this article were generated using Armadillo version 8.500. This version adds a
number of substantial optimisations for sparse matrix operations, in some cases speeding things up
by as much as two orders of magnitude.

General considerations: sparsity, row vs column access

Perhaps the most important thing to note is that the efficiency of sparse algorithms can depend
strongly on the level of sparsity in the data. If your matrices and vectors are very sparse (most
elements equal to zero), you will often see better performance even if the nominal sizes of those
matrices remain the same. This isn’t specific to C++ or Armadillo; it applies to sparse algorithms
in general, including the code used in the Matrix package for R. By contrast, algorithms for working
with dense matrices usually aren’t affected by sparsity.

Similarly, the pattern of accesssing elements, whether by rows or by columns, can have a significant
impact on performance. This is due to caching, which modern CPUs use to speed up memory access:
accessing elements that are already in the cache is much faster than retrieving them from main
memory. If one iterates or loops over the elements of a matrix in Armadillo, one should try to
iterate over columns first, then rows, to maximise the benefits of caching. This applies to both
dense and sparse matrices. (Technically, at least for dense matrices, whether to iterate over rows
or columns first depends on how the data is stored in memory. Both R and Armadillo store matrices in
column-major order, meaning that
elements in the same column are contiguous in memory. Sparse matrices are more complex but the
advice to iterate by columns is basically the same; see below.)

Matrix multiplication

We start with a simple concrete example: multiplying two matrices together. In R, this can be done
using the %*% operator which (via the Matrix package) is able to handle any combination of sparse
and dense inputs. However, let us assume we want to do the multiplication in Armadillo: for example
if the inputs are from other C++ functions, or if we want more precise control of the output.

In Armadillo, the * operator multiplies two matrices together, and this also works for any
combination of sparse and dense inputs. However, the speed of the operation can vary tremendously,
depending on which of those inputs is sparse. To see this, let us define a few simple functions:

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
arma::sp_mat mult_sp_sp_to_sp(const arma::sp_mat& a, const arma::sp_mat& b) {
    // sparse x sparse -> sparse
    arma::sp_mat result(a * b);
    return result;
}

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp(const arma::sp_mat& a, const arma::mat& b) {
    // sparse x dense -> sparse
    arma::sp_mat result(a * b);
    return result;
}

// [[Rcpp::export]]
arma::sp_mat mult_den_sp_to_sp(const arma::mat& a, const arma::sp_mat& b) {
    // dense x sparse -> sparse
    arma::sp_mat result(a * b);
    return result;
}

The outputs of these functions are all the same, but they take different types of inputs: either two
sparse matrices, or a sparse and a dense matrix, or a dense and a sparse matrix (the order
matters). Let us call them on some randomly generated data:

library(Matrix)
set.seed(98765)
n <- 5e3
# 5000 x 5000 matrices, 99% sparse
a <- rsparsematrix(n, n, 0.01, rand.x=function(n) rpois(n, 1) + 1)
b <- rsparsematrix(n, n, 0.01, rand.x=function(n) rpois(n, 1) + 1)

a_den <- as.matrix(a)
b_den <- as.matrix(b)

system.time(m0 <- a %*% b)
   user  system elapsed 
  0.230   0.091   0.322 
system.time(m1 <- mult_sp_sp_to_sp(a, b))
   user  system elapsed 
  0.407   0.036   0.442 
system.time(m2 <- mult_sp_den_to_sp(a, b_den))
   user  system elapsed 
  1.081   0.100   1.181 
system.time(m3 <- mult_den_sp_to_sp(a_den, b))
   user  system elapsed 
  0.826   0.087   0.913 
all(identical(m0, m1), identical(m0, m2), identical(m0, m3))
[1] TRUE

While all the times are within an order of magnitude of each other, multiplying a dense and a sparse
matrix takes about twice as long as multiplying two sparse matrices together, and multiplying a
sparse and dense matrix takes about three times as long. (The sparse-dense multiply is actually one
area where Armadillo 8.500 makes massive gains over previous versions. This operation used to take
much longer due to using an unoptimised multiplication algorithm.)

Let us see if we can help the performance of the mixed-type functions by creating a temporary sparse
copy of the dense input. This forces Armadillo to use the sparse-sparse version of matrix multiply,
which as seen above is much more efficient. For example, here is the result of tweaking the
dense-sparse multiply. Creating the sparse copy does take some extra time and memory, but not enough
to affect the result.

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp2(const arma::sp_mat& a, const arma::mat& b) {
    // sparse x dense -> sparse
    // copy dense to sparse, then multiply
    arma::sp_mat temp(b);
    arma::sp_mat result(a * temp);
    return result;
}
system.time(m4 <- mult_sp_den_to_sp2(a, b_den))
   user  system elapsed 
  0.401   0.047   0.448 
identical(m0, m4)
[1] TRUE

This shows that mixing sparse and dense inputs can have significant effects on the efficiency of
your code. To avoid unexpected slowdowns, consider sticking to either sparse or dense classes for
all your objects. If one decides to mix them, it is worth remembering to test and profile the code.

Row vs column access

Consider another simple computation: multiply the elements of a matrix by their row number, then sum
them up. (The multiply by row number is to make it not completely trivial.) That is, we want to
obtain:

Armadillo lets us extract individual rows and columns from a matrix, using the .row() and .col()
member functions. We can use row extraction to do this computation:

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
arma::sp_mat row_slice(const arma::sp_mat& x, const int n) {
    return x.row(n - 1);
}
system.time({
    result <- sapply(1:nrow(a),
        function(i) i * sum(row_slice(a, i)))
    print(sum(result))
})
[1] 1248361320


   user  system elapsed 
  1.708   0.000   1.707 

For a large matrix, this takes a not-insignificant amount of time, even on a fast machine. To speed
it up, we will make use of the fact that Armadillo uses the compressed sparse column (CSC) format
for storing sparse matrices. This means that the data for a matrix is stored as three vectors: the
nonzero elements; the row indices of these elements (ordered by column); and a vector of offsets for
the first row index in each column. Since the vector of row indices is ordered by column, and we
have the starting offsets for each column, it turns out that extracting a column slice is actually
very fast. We only need to find the offset for that column, and then pull out the elements and row
indices up to the next column offset. On the other hand, extracting a row is much more work; we have
to search through the indices to find those matching the desired row.

We can put this knowledge to use on our problem. Row slicing a matrix is the same as transposing it
and then slicing by columns, so let us define a new function to return a column slice. (Transposing
the matrix takes only a tiny fraction of a second.)

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
arma::sp_mat col_slice(const arma::sp_mat& x, const int n) {
    return x.col(n - 1);
}
a_t <- t(a)
system.time({
    result <- sapply(1:nrow(a_t),
        function(i) i * sum(col_slice(a_t, i)))
    print(sum(result))
})
[1] 1248361320


   user  system elapsed 
  0.766   0.000   0.766 

The time taken has come down by quite a substantial margin. This reflects the ease of obtaining
column slices for sparse matrices.

The R–C++ interface

We can take the previous example further. Each time R calls a C++ function that takes a sparse
matrix as input, it makes a copy of the data. Similarly, when the C++ function returns, its sparse
outputs are copied into R objects. When the function itself is very simple—as it is here—all
this copying and memory shuffling can be a significant proportion of the time taken.

Rather than calling sapply in R to iterate over rows, let us do the same entirely in C++:

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
double sum_by_row(const arma::sp_mat& x) {
    double result = 0;
    for (size_t i = 0; i < x.n_rows; i++) {
        arma::sp_mat row(x.row(i));
        for (arma::sp_mat::iterator j = row.begin(); j != row.end(); ++j) {
            result += *j * (i + 1);
        }
    }
    return result;
}
system.time(print(sum_by_row(a)))
[1] 1248361320


   user  system elapsed 
  0.933   0.000   0.935 

This is again a large improvement. But what if we do the same with column slicing?

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
double sum_by_col(const arma::sp_mat& x) {
    double result = 0;
    for (size_t i = 0; i < x.n_cols; i++) {
        arma::sp_mat col(x.col(i));
        for (arma::sp_mat::iterator j = col.begin(); j != col.end(); ++j) {
            result += *j * (i + 1);
        }
    }
    return result;
}
system.time(print(sum_by_col(a_t)))
[1] 1248361320


   user  system elapsed 
  0.005   0.000   0.006 

Now the time is less than a tenth of a second, which is faster than the original code by roughly
three orders of magnitude. The moral to the story is: rather than constantly switching between C++
and R, you should try to stay in one environment for as long as possible. If your code involves a
loop with a C++ call inside (including functional maps like lapply and friends), consider writing
the loop entirely in C++ and combine the results into a single object to return to R.

(It should be noted that this interface tax is less onerous for built-in Rcpp classes such as
NumericVector or NumericMatrix, which do not require making copies of the data. Sparse data
types are different, and in particular Armadillo’s sparse classes do not provide constructors that
can directly use auxiliary memory.)

Iterators vs elementwise access

Rather than taking explicit slices of the data, let us try using good old-fashioned loops over the
matrix elements. This is easily coded up in Armadillo, and it turns out to be quite efficient,
relatively speaking. It is not as fast as using column slicing, but much better than row slicing.

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
double sum_by_element(const arma::sp_mat& x) {
    double result = 0;
    // loop over columns, then rows: see comments at the start of this article
    for (size_t j = 0; j < x.n_cols; j++) {
        for (size_t i = 0; i < x.n_rows; i++) {
            result += x(i, j) * (i + 1);
        }
    }
    return result;
}
system.time(print(sum_by_element(a)))
[1] 1248361320


   user  system elapsed 
  0.176   0.000   0.176 

However, we can still do better. In Armadillo, the iterators for sparse matrix classes iterate only
over the nonzero elements. Let us compare this to our naive double loop:

// [[Rcpp::depends(RcppArmadillo)]]
#include 

// [[Rcpp::export]]
double sum_by_iterator(const arma::sp_mat& x) {
    double result = 0;
    for (arma::sp_mat::const_iterator i = x.begin(); i != x.end(); ++i) {
        result += *i * (i.row() + 1);
    }
    return result;
}
system.time(print(sum_by_iterator(a)))
[1] 1248361320


   user  system elapsed 
  0.002   0.000   0.002 

This is the best time achieved so far, to the extent that system.time might have difficulty
capturing it. The timings are now so low that we should use the microbenchmark package to get
accurate measurements:

library(microbenchmark)
microbenchmark(col=sum_by_col(a_t), 
               elem=sum_by_element(a), 
               iter=sum_by_iterator(a),
               times=20)
Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
  col   4.78921   4.88444   5.05229   4.99184   5.18450   5.50579    20
 elem 172.84830 177.20431 179.87007 179.06447 182.08075 188.11256    20
 iter   1.02268   1.05447   1.12611   1.12627   1.16482   1.30800    20

Thus, using iterators represents a greater than three-order-of-magnitude speedup over the original
row-slicing code.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)