[This article was first published on Rbloggers – A HopStat and Jump Away, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the scale command. The scale function is great. It is simple and has 2 options other than passing in the matrix:

1. Center – should the columns have their mean subracted off?
2. Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The matrixStats package has some very quick operations for row/column operations and computing statistics along these margins.

## Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

set.seed(1)
mat = matrix(rnorm(1e7, mean = 14, sd = 5),
nrow = 1e5)
dim(mat)


### How fast is scale?

Let’s see how fast the scale function is:

system.time({
sx = scale(mat)
})

user  system elapsed
0.971   0.285   1.262


That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in mat.

## Defining a new function

First, we must load in the matrixStats package and the only function we really are using is the colSds.

library(matrixStats)

colScale = function(x,
center = TRUE,
scale = TRUE,
rows = NULL,
cols = NULL) {

if (!is.null(rows) && !is.null(cols)) {
x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {
x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) {
x <- x[, cols, drop = FALSE]
}

################
# Get the column means
################
cm = colMeans(x, na.rm = TRUE)
################
# Get the column sd
################
if (scale) {
csd = colSds(x, center = cm)
} else {
# just divide by 1 if not
csd = rep(1, length = length(cm))
}
if (!center) {
# just subtract 0
cm = rep(0, length = length(cm))
}
x = t( (t(x) - cm) / csd )
if (center) {
attr(x, "scaled:center") <- cm
}
if (scale) {
attr(x, "scaled:scale") <- csd
}
}
return(x)
}


Let’s break down what the function is doing:

1. The function takes in a matrix x with options:
1. subsetting rows or cols
2. center each column (by the mean) or not
3. scale each column (by the standard deviation) or not
4. Add the attributes of center/scale, so they match the scale output.
2. The functions subsets the matrix if options are passed.
3. Column means are calculated
4. Column standard deviations are calculated (using the colmeans) if scale = TRUE or simply set to 1 if scale = FALSE.
5. If the data is not to be centered, the centers are set to 0.
6. The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
• The reason this is done is because R operates column-wise. Let p be the number of columns. The column means/sds are of length p. If one simply subtracted the column means, R would try to do this to each individual column. For instance, it would recycle the p numbers to get to length n (number of rows), and do that subtraction, which is not what we want.
7. The attributes are added to the matrix to match scale output.

### colScale timing

Now we can see how fast the colScale command would take:

system.time({
csx = colScale(mat)
})

user  system elapsed
0.231   0.039   0.271


This is a lot faster than the scale function. First and foremost, let us make sure that these give the same results:

all.equal(sx, csx)

[1] TRUE


## Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the microbenchmark package to

library(microbenchmark)
mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s")
print(mb)

Unit: seconds
expr       min        lq      mean    median        uq      max
colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507
scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223
neval cld
20  a
20   b


We can visualize the results using ggplot2 and some violin plots.

library(ggplot2)
g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)")
print(g)


If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

system.time({
scaled_row = t( scale(t(mat)) )
})

user  system elapsed
2.165   0.624   3.398

all(abs(rowMeans(scaled_row)) < 1e-15)

[1] TRUE


Again, we can do the same thing with colScale:

system.time({
colscaled_row = t( colScale(t(mat)) )
})

user  system elapsed
0.426   0.097   0.542

all(abs(rowMeans(colscaled_row)) < 1e-15)

[1] TRUE

all.equal(colscaled_row, scaled_row)

[1] TRUE


And we see the results are identical

### Creating rowScale

The above results are good for what we would like to do. We may want to define the rowScale function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

rowScale = function(x,
center = TRUE,
scale = TRUE,
rows = NULL,
cols = NULL) {

if (!is.null(rows) && !is.null(cols)) {
x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {
x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) {
x <- x[, cols, drop = FALSE]
}

################
# Get the column means
################
cm = rowMeans(x, na.rm = TRUE)
################
# Get the column sd
################
if (scale) {
csd = rowSds(x, center = cm)
} else {
# just divide by 1 if not
csd = rep(1, length = length(cm))
}
if (!center) {
# just subtract 0
cm = rep(0, length = length(cm))
}
x = (x - cm) / csd
if (center) {
attr(x, "scaled:center") <- cm
}
if (scale) {
attr(x, "scaled:scale") <- csd
}
}
return(x)
}


Now let’s see how we do with rowScale:

system.time({
rowscaled_row = rowScale(mat)
})

user  system elapsed
0.174   0.016   0.206

all(abs(rowMeans(rowscaled_row)) < 1e-15)

[1] TRUE

all.equal(rowscaled_row, scaled_row)

[1] TRUE


Let’s look at the times for this breakdown using microbenchmark:

mb_row = microbenchmark(t( colScale(t(mat)) ),
t( scale(t(mat)) ),
rowScale(mat),
times = 20, unit = "s")
print(mb_row)

Unit: seconds
expr       min        lq      mean    median        uq
t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429
t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272
rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327
max neval cld
0.9008130    20  a
5.0518348    20   b
0.5138103    20  a


and visualize the results:

g %+% mb_row


## Conclusions

Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The scale function is well suited for this purpose, but the matrixStats package allows for faster computation done in C. The scale function will have different behavior as the code below from base::scale.default:

f <- function(v) {
v <- v[!is.na(v)]
sqrt(sum(v^2)/max(1, length(v) - 1L))
}
scale <- apply(x, 2L, f)


If the data is not centered and center = FALSE, the data will be divided by the squared sum of each column (divided by n-1). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and colScale/rowScale can do that if necessary. I will talk to Henrik Bengtsson (matrixStats author/maintainer) about incorporating these into matrixStats for general use. But for now, you can use the above code.