# Package bigstatsr: Statistics with matrices on disk (useR 2017)

**blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I will talk about my package **bigstatsr**, which I’ve just presented in a lightning talk of 5 minutes at useR!2017. You can listen to me in action there. I should have chosen a longer talk to explain more about this package, maybe next time. I will use this post to give you a more detailed version of the talk I gave in Brussels.

## Motivation behind bigstatsr

I’m a PhD student in predictive human genetics. I’m basically trying to predict someone’s risk of disease based on their DNA mutations. These DNA mutations are in the form of large matrices so that I’m currently working with a matrix of 15K rows and 300K columns. This matrix would take approximately 32GB of RAM if stored as a standard R matrix.

When I began studying this dataset, I had only 8GB of RAM on my computer. I now have 64GB of RAM but it would take only copying this matrix once to make my computer begin swapping and therefore slowing down. I found a convenient solution by using the object `big.matrix`

provided by the R package **bigmemory** (Kane, Emerson, and Weston 2013). With this solution, you can access a matrix that is stored on disk almost as if it were a standard R matrix in memory.

Yet, for these special matrices, some useful statistical functions were missing or not fast enough. So, I implemented these. It was a good experience about programming optimized and parallelized algorithms. I’m aware that there are other packages that come with **bigmemory**, such as **biganalytics** and **bigalgebra**, that already implement some of the features I will talk about in this post. I will discuss why I don’t use these packages. However, I use the work of other packages such as **biglasso** and **RSpectra**.

## Introduction to bigmemory

# loading package bigstatsr (and bigmemory) library(bigstatsr) # initializing some matrix on disk: wrapper to bigmemory::big.matrix() mat <- FBM(backingroot = "matrix-on-disk", descriptor = FALSE)(5e3, 10e3) ## Creating directory "backingfiles" which didn't exist.. dim(mat) ## [1] 5000 10000 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[1, 1] <- 2 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0 ## [5,] 0 0 0 0 0 mat[2:4] <- 3 mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 0 0 0 0 ## [2,] 3 0 0 0 0 ## [3,] 3 0 0 0 0 ## [4,] 3 0 0 0 0 ## [5,] 0 0 0 0 0 mat[, 2:3] <- rnorm(2 * nrow(mat)) mat[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 -0.1611891 3.1218981 0 0 ## [2,] 3 -0.6026680 -1.3935111 0 0 ## [3,] 3 0.4743703 -0.6566549 0 0 ## [4,] 3 0.1514252 0.8247594 0 0 ## [5,] 0 -1.5167186 0.7329260 0 0

What we can see is that big matrices (`big.matrix`

objects) can be accessed (read/write) almost as if they were standard R matrices, but you have to be cautious. For example, doing `mat[1, ]`

isn’t recommended. Indeed, big matrices, as standard R matrices, are stored by column so that it is in fact a big vector with columns stored one after the other, contiguously. So, accessing the first row would access elements that are not stored contiguously in memory, which is slow. One should always access columns rather than rows.

## Apply an R function to a big matrix

An easy strategy to apply an R function to a big matrix would be the split-apply-combine strategy (Wickham 2011). For example, you could access only a block of columns at a time, apply a (vectorized) function to this block, and then combine the results of all blocks. This is implemented in function `big_apply()`

.

# Compute the sums of the first 1000 columns colsums_1 <- colSums(mat[, 1:1000]) # Compute the sums of the second block of 1000 columns colsums_2 <- colSums(mat[, 1001:2000]) # Combine the results colsums_1_2 <- c(colsums_1, colsums_2) # Do this automatically with big_apply() colsums_all <- big_apply(mat, a.FUN = function(X, ind) colSums(X[, ind]), a.combine = 'c')

When the split-apply-combine strategy can be used for a given function, you could use `big_apply()`

to get the results, while accessing only small blocks of columns (or rows) at a time. Package **biganalytics**, by the creators of **bigmemory**, provides a way to apply an R function to margins of a `big.matrix`

. Yet, for example, if the `big.matrix`

has a lot of columns, it would be much slower to loop through all columns rather that applying a vectorized function to blocks of columns. You can find more examples of applications for `big_apply()`

there.

colsums_all2 <- biganalytics::apply(mat, 2, sum) all.equal(colsums_all2, colsums_all) ## [1] TRUE

## Use Rcpp with a big matrix

Using Rcpp with a `big.matrix`

is super easy. Let’s use the previous example, i.e. the computation of the colsums of a `big.matrix`

. We will do it in 3 different ways.

### 1. Using the bigmemory way

// [[Rcpp::depends(bigmemory, BH)]] #include <bigmemory/MatrixAccessor.hpp> #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector bigcolsums(const S4& BM) { XPtr<BigMatrix> xpMat = BM.slot("address"); MatrixAccessor<double> macc(*xpMat); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc[j][i]; return res; } colsums_all3 <- bigcolsums(mat) all.equal(colsums_all3, colsums_all) ## [1] TRUE

### 2. Using the bigstatsr way

// [[Rcpp::depends(bigstatsr, bigmemory, BH)]] #include <bigstatsr/SubMatAcc.h> // [[Rcpp::export]] NumericVector bigcolsums2(const S4& BM, const IntegerVector& rowInd, const IntegerVector& colInd) { XPtr<BigMatrix> xpMat = BM.slot("address"); // C++ indices begin at 0 IntegerVector rows = rowInd - 1; IntegerVector cols = colInd - 1; // An accessor of only part of the big.matrix SubMatAcc<double> macc(*xpMat, rows, cols); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); // vector of m zeros int i, j; for (j = 0; j < m; j++) for (i = 0; i < n; i++) res[j] += macc(i, j); return res; } colsums_all4 <- bigcolsums2(mat, rows_along(mat), cols_along(mat)) all.equal(colsums_all4, colsums_all) ## [1] TRUE

In **bigstatsr**, most of the functions have parameters for subsetting rows and columns because it is often useful. One of the main reasons why I don’t use package **bigalgebra** is its lack of subsetting options.

### 3. Use an already implemented function

str(colsums_all5 <- big_colstats(mat)) ## 'data.frame': 10000 obs. of 2 variables: ## $ sum: num 11 139.6 73.2 0 0 ... ## $ var: num 0.0062 1.0117 1.0164 0 0 ... all.equal(colsums_all5$sum, colsums_all) ## [1] TRUE

## Principal Component Analysis

Let’s begin by filling the matrix with random numbers in a tricky way.

U <- sweep(matrix(rnorm(nrow(mat) * 10), ncol = 10), 2, 1:10, "/") V <- matrix(rnorm(ncol(mat) * 10), ncol = 10) big_apply(mat, a.FUN = function(X, ind) { X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind)) NULL }, a.combine = 'c') ## NULL

Let’s say we want the first 10 PCs of the (scaled) matrix.

system.time( small_svd <- svd(scale(mat[, 1:2000]), nu = 10, nv = 10) ) ## user system elapsed ## 8.348 0.063 8.429 system.time( small_svd2 <- big_SVD(mat, big_scale(), ind.col = 1:2000) ) ## (2) ## user system elapsed ## 1.900 0.083 1.989 plot(small_svd2$u, small_svd$u)

system.time( small_svd3 <- big_randomSVD(mat, big_scale(), ind.col = 1:2000) ) ## user system elapsed ## 0.353 0.002 0.355 plot(small_svd3$u, small_svd$u)

system.time( svd_all <- big_randomSVD(mat, big_scale()) ) ## user system elapsed ## 2.267 0.001 2.268 plot(svd_all)

Function `big_randomSVD()`

uses Rcpp and package **Rpsectra** to implement a fast Singular Value Decomposition for a `big.matrix`

that is linear in all dimensions (standard PCA algorithm is quadratic in the smallest dimension) which makes it very fast even for large datasets (that have both dimensions that are large).

## Some linear models

M <- 100 # number of causal variables set <- sample(ncol(mat), M) y <- mat[, set] %*% rnorm(M) y <- y + rnorm(length(y), sd = 2 * sd(y)) ind.train <- sort(sample(nrow(mat), size = 0.8 * nrow(mat))) ind.test <- setdiff(rows_along(mat), ind.train) mult_test <- big_univLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ]) library(ggplot2) plot(mult_test) + aes(color = cols_along(mat) %in% set) + labs(color = "Causal?")

train <- big_spLinReg(mat, y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ], alpha = 0.5) pred <- predict(train, X. = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ]) plot(apply(pred, 2, cor, y = y[ind.test]))

The functions `big_spLinReg()`

, `big_spLogReg()`

and `big_spSVM()`

all use lasso (L1) or elastic-net (L1 & L2) regularizations in order to limit the number of predictors and to accelerate computations thanks to strong rules (R. Tibshirani et al. 2012). The implementation of these functions are based on modifications from packages **sparseSVM** and **biglasso** (Zeng and Breheny 2017). Yet, these models give predictions for a range of 100 different regularization parameters whereas we are only interested in one prediction.

So, that’s why I came up with the idea of Cross-Model Selection and Averaging (CMSA), which principle is:

- This function separates the training set in K folds (e.g. 10).
**In turn**,- each fold is considered as an inner validation set and the others (K - 1) folds form an inner training set,
- the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
- the vector of scores which maximizes
`feval`

is determined, - the vector of coefficients corresponding to the previous vector of scores is chosen.

- The K resulting vectors of coefficients are then combined into one vector.

train2 <- big_CMSA(big_spLinReg, feval = function(pred, target) cor(pred, target), X. = mat, y.train = y[ind.train], ind.train = ind.train, covar.train = svd_all$u[ind.train, ], alpha = 0.5, ncores = parallel::detectCores() / 2) mean(train2 != 0) # percentage of predictors ## [1] 0.1806194 pred2 <- predict(train2, X. = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ]) cor(pred2, y[ind.test]) ## [1] 0.311157

## Some matrix computations

For example, let’s compute the correlation of the first 2000 columns.

system.time( corr <- cor(mat[, 1:2000]) ) ## user system elapsed ## 10.822 0.007 10.824 system.time( corr2 <- big_cor(mat, ind.col = 1:2000) ) ## user system elapsed ## 0.729 0.005 0.737 all.equal(corr2, corr) ## [1] TRUE

## Advantages of using big.matrix objects

- you can apply algorithms on 100GB of data,
- you can easily parallelize your algorithms because the data on disk is shared,
- you write more efficient algorithms,
- you can use different types of data, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix).

In a next post, I’ll try to talk about good practices on how to use parallelism in R.

## References

Kane, Michael J, John W Emerson, and Stephen Weston. 2013. “Scalable Strategies for Computing with Massive Data.” *Journal of Statistical Software* 55 (14): 1–19. doi:10.18637/jss.v055.i14.

Tibshirani, Robert, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshirani. 2012. “Strong rules for discarding predictors in lasso-type problems.” *Journal of the Royal Statistical Society. Series B, Statistical Methodology* 74 (2): 245–66. doi:10.1111/j.1467-9868.2011.01004.x.

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software.” *Journal of Statistical Software* 40 (1): 1–29. doi:10.1039/np9971400083.

Zeng, Yaohui, and Patrick Breheny. 2017. “The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R,” January. http://arxiv.org/abs/1701.05936.

**leave a comment**for the author, please follow the link and comment on their blog:

**blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.