A Faster Scale Function

[This article was first published on Rbloggers – A HopStat and Jump Away, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the scale command. The scale function is great. It is simple and has 2 options other than passing in the matrix:

  1. Center – should the columns have their mean subracted off?
  2. Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The matrixStats package has some very quick operations for row/column operations and computing statistics along these margins.

Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

set.seed(1)
mat = matrix(rnorm(1e7, mean = 14, sd = 5),
    nrow = 1e5)
dim(mat)

How fast is scale?

Let’s see how fast the scale function is:

system.time({
    sx = scale(mat)
    })

   user  system elapsed
  0.971   0.285   1.262

That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in mat.

Defining a new function

First, we must load in the matrixStats package and the only function we really are using is the colSds.

library(matrixStats)

colScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]
    }

  ################
  # Get the column means
  ################
    cm = colMeans(x, na.rm = TRUE)
  ################
  # Get the column sd
  ################
    if (scale) {
        csd = colSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    }
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    }
    x = t( (t(x) - cm) / csd )
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        }
        if (scale) {
            attr(x, "scaled:scale") <- csd
        }
    }
    return(x)
}

Let’s break down what the function is doing:

  1. The function takes in a matrix x with options:
    1. subsetting rows or cols
    2. center each column (by the mean) or not
    3. scale each column (by the standard deviation) or not
    4. Add the attributes of center/scale, so they match the scale output.
  2. The functions subsets the matrix if options are passed.
  3. Column means are calculated
  4. Column standard deviations are calculated (using the colmeans) if scale = TRUE or simply set to 1 if scale = FALSE.
  5. If the data is not to be centered, the centers are set to 0.
  6. The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
    • The reason this is done is because R operates column-wise. Let p be the number of columns. The column means/sds are of length p. If one simply subtracted the column means, R would try to do this to each individual column. For instance, it would recycle the p numbers to get to length n (number of rows), and do that subtraction, which is not what we want.
  7. The attributes are added to the matrix to match scale output.

colScale timing

Now we can see how fast the colScale command would take:

system.time({
    csx = colScale(mat)
})

   user  system elapsed
  0.231   0.039   0.271

This is a lot faster than the scale function. First and foremost, let us make sure that these give the same results:

all.equal(sx, csx)

[1] TRUE

Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the microbenchmark package to

library(microbenchmark)
mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s")
print(mb)

Unit: seconds
          expr       min        lq      mean    median        uq      max
 colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507
    scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223
 neval cld
    20  a
    20   b

We can visualize the results using ggplot2 and some violin plots.

library(ggplot2)
g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)")
print(g)

plot of chunk gg

What about scaling rows!

If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

system.time({
  scaled_row = t( scale(t(mat)) )
})

   user  system elapsed
  2.165   0.624   3.398

all(abs(rowMeans(scaled_row)) < 1e-15)

[1] TRUE

Again, we can do the same thing with colScale:

system.time({
  colscaled_row = t( colScale(t(mat)) )
})

   user  system elapsed
  0.426   0.097   0.542

all(abs(rowMeans(colscaled_row)) < 1e-15)

[1] TRUE

all.equal(colscaled_row, scaled_row)

[1] TRUE

And we see the results are identical

Creating rowScale

The above results are good for what we would like to do. We may want to define the rowScale function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

rowScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]
    }

  ################
  # Get the column means
  ################
    cm = rowMeans(x, na.rm = TRUE)
  ################
  # Get the column sd
  ################
    if (scale) {
        csd = rowSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    }
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    }
    x = (x - cm) / csd
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        }
        if (scale) {
            attr(x, "scaled:scale") <- csd
        }
    }
    return(x)
}

Now let’s see how we do with rowScale:

system.time({
  rowscaled_row = rowScale(mat)
})

   user  system elapsed
  0.174   0.016   0.206

all(abs(rowMeans(rowscaled_row)) < 1e-15)

[1] TRUE

all.equal(rowscaled_row, scaled_row)

[1] TRUE

Let’s look at the times for this breakdown using microbenchmark:

mb_row = microbenchmark(t( colScale(t(mat)) ),
                        t( scale(t(mat)) ),
                        rowScale(mat),
                        times = 20, unit = "s")
print(mb_row)

Unit: seconds
                expr       min        lq      mean    median        uq
 t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429
    t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272
       rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327
       max neval cld
 0.9008130    20  a
 5.0518348    20   b
 0.5138103    20  a

and visualize the results:

g %+% mb_row

plot of chunk gg_row

Conclusions

Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The scale function is well suited for this purpose, but the matrixStats package allows for faster computation done in C. The scale function will have different behavior as the code below from base::scale.default:

f <- function(v) {
  v <- v[!is.na(v)]
  sqrt(sum(v^2)/max(1, length(v) - 1L))
}
scale <- apply(x, 2L, f)

If the data is not centered and center = FALSE, the data will be divided by the squared sum of each column (divided by n-1). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and colScale/rowScale can do that if necessary. I will talk to Henrik Bengtsson (matrixStats author/maintainer) about incorporating these into matrixStats for general use. But for now, you can use the above code.


To leave a comment for the author, please follow the link and comment on their blog: Rbloggers – A HopStat and Jump Away.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)