In this short post, I talk about why I’m moving away from using function apply.

## With matrices

It’s okay to use apply with a dense matrix, although you can often use an equivalent that is faster.

N <- M <- 8000
X <- matrix(rnorm(N * M), N)
system.time(res1 <- apply(X, 2, mean))
##    user  system elapsed
##    0.73    0.05    0.78
system.time(res2 <- colMeans(X))
##    user  system elapsed
##    0.05    0.00    0.05
stopifnot(isTRUE(all.equal(res2, res1)))

“Yeah, there are colSums and colMeans, but what about computing standard deviations?”

There are lots of apply-like functions in package {matrixStats}.

system.time(res3 <- apply(X, 2, sd))
##    user  system elapsed
##    0.96    0.01    0.97
system.time(res4 <- matrixStats::colSds(X))
##    user  system elapsed
##     0.2     0.0     0.2
stopifnot(isTRUE(all.equal(res4, res3)))

## With data frames

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 "5.1"        "3.5"       "1.4"        "0.2"       "setosa"
## 2 "4.9"        "3.0"       "1.4"        "0.2"       "setosa"
## 3 "4.7"        "3.2"       "1.3"        "0.2"       "setosa"
## 4 "4.6"        "3.1"       "1.5"        "0.2"       "setosa"
## 5 "5.0"        "3.6"       "1.4"        "0.2"       "setosa"
## 6 "5.4"        "3.9"       "1.7"        "0.4"       "setosa"

A DATA FRAME IS NOT A MATRIX (it’s a list).

The first thing that apply does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).

What can you use as a replacement of apply with a data frame?

• If you want to operate on all columns, since a data frame is just a list, you can use sapply instead (or map* if you are a purrrist).

sapply(iris, typeof)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
##     "double"     "double"     "double"     "double"    "integer"
• If you want to operate on all rows, I recommend you to watch this webinar.

## With sparse matrices

The memory problem is even more important when using apply with sparse matrices, which makes using apply very slow for such data.

library(Matrix)

X.sp <- rsparsematrix(N, M, density = 0.01)

## X.sp is converted to a dense matrix when using apply
system.time(res5 <- apply(X.sp, 2, mean))
##    user  system elapsed
##    0.78    0.46    1.25
system.time(res6 <- Matrix::colMeans(X.sp))
##    user  system elapsed
##    0.01    0.00    0.02
stopifnot(isTRUE(all.equal(res6, res5)))

You could implement your own apply-like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i and j storing positions of non-null elements, and x storing values of these elements). Then, you could use a group_by-summarize approach.

For instance, for the previous example, you can do this in base R:

apply2_sp <- function(X, FUN) {
res <- numeric(ncol(X))
X2 <- as(X, "dgTMatrix")
tmp <- tapply([email protected], [email protected], FUN)
res[as.integer(names(tmp)) + 1] <- tmp
res
}

system.time(res7 <- apply2_sp(X.sp, sum) / nrow(X.sp))
##    user  system elapsed
##    0.03    0.00    0.03
stopifnot(isTRUE(all.equal(res7, res5)))

## Conclusion

Using apply with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.