In this short post, I talk about why I’m moving away from using function apply.

With matrices

It’s okay to use apply with a dense matrix, although you can often use an equivalent that is faster.

N <- M <- 8000
X <- matrix(rnorm(N * M), N)
system.time(res1 <- apply(X, 2, mean))
##    user  system elapsed 
##    0.73    0.05    0.78
system.time(res2 <- colMeans(X))
##    user  system elapsed 
##    0.05    0.00    0.05
stopifnot(isTRUE(all.equal(res2, res1)))

“Yeah, there are colSums and colMeans, but what about computing standard deviations?”

There are lots of apply-like functions in package {matrixStats}.

system.time(res3 <- apply(X, 2, sd))
##    user  system elapsed 
##    0.96    0.01    0.97
system.time(res4 <- matrixStats::colSds(X))
##    user  system elapsed 
##     0.2     0.0     0.2
stopifnot(isTRUE(all.equal(res4, res3)))

With data frames

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
apply(head(iris), 2, identity)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
## 1 "5.1"        "3.5"       "1.4"        "0.2"       "setosa"
## 2 "4.9"        "3.0"       "1.4"        "0.2"       "setosa"
## 3 "4.7"        "3.2"       "1.3"        "0.2"       "setosa"
## 4 "4.6"        "3.1"       "1.5"        "0.2"       "setosa"
## 5 "5.0"        "3.6"       "1.4"        "0.2"       "setosa"
## 6 "5.4"        "3.9"       "1.7"        "0.4"       "setosa"

A DATA FRAME IS NOT A MATRIX (it’s a list).

The first thing that apply does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).

What can you use as a replacement of apply with a data frame?

  • If you want to operate on all columns, since a data frame is just a list, you can use sapply instead (or map* if you are a purrrist).

    sapply(iris, typeof)
    ## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
    ##     "double"     "double"     "double"     "double"    "integer"
  • If you want to operate on all rows, I recommend you to watch this webinar.

With sparse matrices

The memory problem is even more important when using apply with sparse matrices, which makes using apply very slow for such data.

library(Matrix)

X.sp <- rsparsematrix(N, M, density = 0.01)

## X.sp is converted to a dense matrix when using `apply`
system.time(res5 <- apply(X.sp, 2, mean))  
##    user  system elapsed 
##    0.78    0.46    1.25
system.time(res6 <- Matrix::colMeans(X.sp))
##    user  system elapsed 
##    0.01    0.00    0.02
stopifnot(isTRUE(all.equal(res6, res5)))

You could implement your own apply-like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i and j storing positions of non-null elements, and x storing values of these elements). Then, you could use a group_bysummarize approach.

For instance, for the previous example, you can do this in base R:

apply2_sp <- function(X, FUN) {
  res <- numeric(ncol(X))
  X2 <- as(X, "dgTMatrix")
  tmp <- tapply(X2@x, X2@j, FUN)
  res[as.integer(names(tmp)) + 1] <- tmp
  res
}

system.time(res7 <- apply2_sp(X.sp, sum) / nrow(X.sp))
##    user  system elapsed 
##    0.03    0.00    0.03
stopifnot(isTRUE(all.equal(res7, res5)))

Conclusion

Using apply with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.