Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

It is no secret that sparse matrix operations are faster in C++ than in R. RcppArmadillo and RcppEigen do a great job copying sparse matrices from R to C++ and back again. But note the word “copy”. Every time RcppArmadillo converts an R sparse matrix to an arma::SpMat<T> object, it has to creates a deep copy due to the difference in representation between dense matrices (usually occupying contiguous chunks of memory) and sparse matrices (which do not precisely because of sparsity). Similarly, every time RcppEigen converts an R sparse matrix to an Eigen::SparseMatrix<T> object, it also creates a deep copy.

The price of instantiating sparse matrix object is measurable in both time and memory. But one of the advantages of types such as Rcpp::NumericVector is that they simply re-use the underlying memory of the R object by wrapping around the underlying SEXP representation and its contiguous dense memory layout so no copy is created! Can we create a sparse matrix class using Rcpp::NumericVector and Rcpp::IntegerVector that uses them similarly as references rather than actual deep-copy of each element?

Structure of a dgCMatrix

We will focus on the dgCMatrix type, the most common form of compressed-sparse-column (CSC) matrix in the Matrix package. The dgCMatrix class is an S4 object with several slots:

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
[email protected] i       : int [1:12] 2 4 0 3 4 0 3 4 2 3 0 2
[email protected] p       : int [1:6] 0 2 5 8 10 12
[email protected] Dim     : int [1:2] 5 5
[email protected] Dimnames:List of 2
.. ..$: NULL .. ..$ : NULL
[email protected] x       : num [1:12] 0.5 -1 0.83 -0.056 2.5 0.24 1.7 1.3 0.55 -1.7 -0.78 1.3
[email protected] factors : list()


Here we have:

• Slot i is an integer vector giving the row indices of the non-zero values of the matrix
• Slot p is an integer vector giving the index of the first non-zero value of each column in i
• Slot x gives the non-zero elements of the matrix corresponding to rows in i and columns delineated by p

Realize that i and p are integer vectors, and x is a numeric vector (stored as a double), corresponding to Rcpp::IntegerVector and Rcpp::NumericVector. That means that we can construct a sparse matrix class in C++ using Rcpp vectors!

A dgCMatrix reference class for Rcpp

A dgCMatrix is simply an S4 object containing integer and double vectors, and Rcpp already has implemented exporters and wrappers for S4 objects, integer vectors, and numeric vectors. That makes class construction easy:

In the above code, first we create a C++ class for a dgCMatrix in the Rcpp namespace with public members corresponding to the slots in an R Matrix::dgCMatrix. Second, we add a constructor for the class that receives an S4 R object (which should be a valid dgCMatrix object). Third, we add a simple forward-only sparse column iterator with read/write access for convenient traversal of non-zero elements in the matrix. Finally, we use Rcpp::as and Rcpp::wrap for seamless conversion between R and C++ and back again.

Using the Rcpp::dgCMatrix class

We can now simply pull a dgCMatrix into any Rcpp function thanks to our handy class methods and the magic of Rcpp::as.

And call the following from R:

[1] TRUE


Awesome!

Passing by reference versus value

Rcpp structures such as NumericVector are wrappers around the existing R objects. This means that if we modify our sparse matrix in C++, it will modify the original R object. For instance:

used in

sum before squaring:  801.059

sum of original object after squaring:  1007.97

sum of assigned object after squaring:  1007.97


Notice how the original object AND the returned object are identical, yet they don’t point to the same memory address (because of copy on write):

[1] "<0x55f9609a3ae0>"

[1] "<0x55f960f449f0>"


You might further inspect memory addresses within the structure using .Internal(inspect()) and indeed, you will see the memory addresses are not the same. What this all means is that we can simply call the function in R and modify the object implicitly without an assignment operator.

sum_before =  793.861

sum_after  =  970.174


This principle of course applies to other Rcpp classes such as NumericVector as well. But especially when working with very large sparse matrices, it is useful to avoid deep copies and pass by reference only. If you do need to operate on a copy of your matrix in C++, there is no reason to be using an Rcpp container when you can be using RcppArmadillo or RcppEigen. These libraries are extremely capable and well-documented—and generally give you access to specific sparse Matrix algorithms.

Sparse iterator class

One of the best ways to read and/or write non-zero values in a sparse matrix is with an iterator. A basic column forward iterator with read/write access was presented in the declaration of our sparse matrix class. We can use this iterator in a very Armadillo-esque manner to compute things like column sums:

On the R end:

[1] TRUE


Great—but is it faster???

Unit: microseconds
expr    min      lq     mean  median      uq    max neval cld
Rcpp_colSums(spmat)  2.510  2.7885  3.32225  3.1355  3.2685 21.842   100  a
Matrix::colSums(spmat) 11.068 11.5380 13.26691 11.7775 12.0525 79.324   100   b


Extending the class

There are many ways we can extend Rcpp::dgCMatrix. For example, we can use const_iterator, row_iterator and iterators that traverse all values in addition to col_iterator. We can also add support for subview copies, dense copies of columns and rows, basic element-wise operations, cross-product, etc.

We have implemented and documented these methods, and hopefully there will be a Rcpp-extending CRAN package in the not-too-distant future that allows you to seamlessly interface with the dgCMatrix class. For now, see the Github repo for RcppSparse for the in-progress package.