[This article was first published on Rbloggers – A HopStat and Jump Away, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the `scale` command. The `scale` function is great. It is simple and has 2 options other than passing in the matrix:

1. Center – should the columns have their mean subracted off?
2. Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The `matrixStats` package has some very quick operations for row/column operations and computing statistics along these margins.

## Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

```set.seed(1)
mat = matrix(rnorm(1e7, mean = 14, sd = 5),
nrow = 1e5)
dim(mat)
```

### How fast is scale?

Let’s see how fast the `scale` function is:

```system.time({
sx = scale(mat)
})

user  system elapsed
0.971   0.285   1.262
```

That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in `mat`.

## Defining a new function

First, we must load in the `matrixStats` package and the only function we really are using is the `colSds`.

```library(matrixStats)

colScale = function(x,
center = TRUE,
scale = TRUE,
rows = NULL,
cols = NULL) {

if (!is.null(rows) && !is.null(cols)) {
x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {
x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) {
x <- x[, cols, drop = FALSE]
}

################
# Get the column means
################
cm = colMeans(x, na.rm = TRUE)
################
# Get the column sd
################
if (scale) {
csd = colSds(x, center = cm)
} else {
# just divide by 1 if not
csd = rep(1, length = length(cm))
}
if (!center) {
# just subtract 0
cm = rep(0, length = length(cm))
}
x = t( (t(x) - cm) / csd )
if (center) {
attr(x, "scaled:center") <- cm
}
if (scale) {
attr(x, "scaled:scale") <- csd
}
}
return(x)
}
```

Let’s break down what the function is doing:

1. The function takes in a matrix `x` with options:
1. subsetting `rows` or `cols`
2. center each column (by the mean) or not
3. scale each column (by the standard deviation) or not
4. Add the attributes of center/scale, so they match the `scale` output.
2. The functions subsets the matrix if options are passed.
3. Column means are calculated
4. Column standard deviations are calculated (using the colmeans) if `scale = TRUE` or simply set to 1 if `scale = FALSE`.
5. If the data is not to be centered, the centers are set to 0.
6. The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
• The reason this is done is because `R` operates column-wise. Let `p` be the number of columns. The column means/sds are of length `p`. If one simply subtracted the column means, `R` would try to do this to each individual column. For instance, it would recycle the `p` numbers to get to length `n` (number of rows), and do that subtraction, which is not what we want.
7. The attributes are added to the matrix to match `scale` output.

### colScale timing

Now we can see how fast the `colScale` command would take:

```system.time({
csx = colScale(mat)
})

user  system elapsed
0.231   0.039   0.271
```

This is a lot faster than the `scale` function. First and foremost, let us make sure that these give the same results:

```all.equal(sx, csx)

 TRUE
```

## Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the `microbenchmark` package to

```library(microbenchmark)
mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s")
print(mb)

Unit: seconds
expr       min        lq      mean    median        uq      max
colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507
scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223
neval cld
20  a
20   b
```

We can visualize the results using `ggplot2` and some violin plots.

```library(ggplot2)
g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)")
print(g)
``` If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

```system.time({
scaled_row = t( scale(t(mat)) )
})

user  system elapsed
2.165   0.624   3.398

all(abs(rowMeans(scaled_row)) < 1e-15)

 TRUE
```

Again, we can do the same thing with `colScale`:

```system.time({
colscaled_row = t( colScale(t(mat)) )
})

user  system elapsed
0.426   0.097   0.542

all(abs(rowMeans(colscaled_row)) < 1e-15)

 TRUE

all.equal(colscaled_row, scaled_row)

 TRUE
```

And we see the results are identical

### Creating rowScale

The above results are good for what we would like to do. We may want to define the `rowScale` function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

```rowScale = function(x,
center = TRUE,
scale = TRUE,
rows = NULL,
cols = NULL) {

if (!is.null(rows) && !is.null(cols)) {
x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {
x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) {
x <- x[, cols, drop = FALSE]
}

################
# Get the column means
################
cm = rowMeans(x, na.rm = TRUE)
################
# Get the column sd
################
if (scale) {
csd = rowSds(x, center = cm)
} else {
# just divide by 1 if not
csd = rep(1, length = length(cm))
}
if (!center) {
# just subtract 0
cm = rep(0, length = length(cm))
}
x = (x - cm) / csd
if (center) {
attr(x, "scaled:center") <- cm
}
if (scale) {
attr(x, "scaled:scale") <- csd
}
}
return(x)
}
```

Now let’s see how we do with `rowScale`:

```system.time({
rowscaled_row = rowScale(mat)
})

user  system elapsed
0.174   0.016   0.206

all(abs(rowMeans(rowscaled_row)) < 1e-15)

 TRUE

all.equal(rowscaled_row, scaled_row)

 TRUE
```

Let’s look at the times for this breakdown using `microbenchmark`:

```mb_row = microbenchmark(t( colScale(t(mat)) ),
t( scale(t(mat)) ),
rowScale(mat),
times = 20, unit = "s")
print(mb_row)

Unit: seconds
expr       min        lq      mean    median        uq
t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429
t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272
rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327
max neval cld
0.9008130    20  a
5.0518348    20   b
0.5138103    20  a
```

and visualize the results:

```g %+% mb_row
``` ## Conclusions

Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The `scale` function is well suited for this purpose, but the `matrixStats` package allows for faster computation done in C. The `scale` function will have different behavior as the code below from `base::scale.default`:

```f <- function(v) {
v <- v[!is.na(v)]
sqrt(sum(v^2)/max(1, length(v) - 1L))
}
scale <- apply(x, 2L, f)
```

If the data is not centered and `center = FALSE`, the data will be divided by the squared sum of each column (divided by `n-1`). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and `colScale/rowScale` can do that if necessary. I will talk to Henrik Bengtsson (`matrixStats` author/maintainer) about incorporating these into `matrixStats` for general use. But for now, you can use the above code.  