In this short post, I talk about why I’m moving away from using function `apply`.

## With matrices

It’s okay to use `apply` with a dense matrix, although you can often use an equivalent that is faster.

``````N <- M <- 8000
X <- matrix(rnorm(N * M), N)
system.time(res1 <- apply(X, 2, mean))``````
``````##    user  system elapsed
##    0.73    0.05    0.78``````
``system.time(res2 <- colMeans(X))``
``````##    user  system elapsed
##    0.05    0.00    0.05``````
``stopifnot(isTRUE(all.equal(res2, res1)))``

“Yeah, there are `colSums` and `colMeans`, but what about computing standard deviations?”

There are lots of `apply`-like functions in package {matrixStats}.

``system.time(res3 <- apply(X, 2, sd))``
``````##    user  system elapsed
##    0.96    0.01    0.97``````
``system.time(res4 <- matrixStats::colSds(X))``
``````##    user  system elapsed
##     0.2     0.0     0.2``````
``stopifnot(isTRUE(all.equal(res4, res3)))``

## With data frames

``head(iris)``
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa``````
``apply(head(iris), 2, identity)``
``````##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 "5.1"        "3.5"       "1.4"        "0.2"       "setosa"
## 2 "4.9"        "3.0"       "1.4"        "0.2"       "setosa"
## 3 "4.7"        "3.2"       "1.3"        "0.2"       "setosa"
## 4 "4.6"        "3.1"       "1.5"        "0.2"       "setosa"
## 5 "5.0"        "3.6"       "1.4"        "0.2"       "setosa"
## 6 "5.4"        "3.9"       "1.7"        "0.4"       "setosa"`````` A DATA FRAME IS NOT A MATRIX (it’s a list).

The first thing that `apply` does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).

What can you use as a replacement of `apply` with a data frame?

• If you want to operate on all columns, since a data frame is just a list, you can use `sapply` instead (or `map*` if you are a purrrist).

``sapply(iris, typeof)``
``````## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
##     "double"     "double"     "double"     "double"    "integer"``````
• If you want to operate on all rows, I recommend you to watch this webinar.

## With sparse matrices

The memory problem is even more important when using `apply` with sparse matrices, which makes using `apply` very slow for such data.

``````library(Matrix)

X.sp <- rsparsematrix(N, M, density = 0.01)

## X.sp is converted to a dense matrix when using `apply`
system.time(res5 <- apply(X.sp, 2, mean))  ``````
``````##    user  system elapsed
##    0.78    0.46    1.25``````
``system.time(res6 <- Matrix::colMeans(X.sp))``
``````##    user  system elapsed
##    0.01    0.00    0.02``````
``stopifnot(isTRUE(all.equal(res6, res5)))``

You could implement your own `apply`-like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (`i` and `j` storing positions of non-null elements, and `x` storing values of these elements). Then, you could use a `group_by``summarize` approach.

For instance, for the previous example, you can do this in base R:

``````apply2_sp <- function(X, FUN) {
res <- numeric(ncol(X))
X2 <- as(X, "dgTMatrix")
tmp <- tapply(X2@x, X2@j, FUN)
res[as.integer(names(tmp)) + 1] <- tmp
res
}

system.time(res7 <- apply2_sp(X.sp, sum) / nrow(X.sp))``````
``````##    user  system elapsed
##    0.03    0.00    0.03``````
``stopifnot(isTRUE(all.equal(res7, res5)))``

## Conclusion

Using `apply` with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.