# (Linear Algebra) Do not scale your matrix

**Florian Privé**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I will show you that you generally don’t need to explicitly scale a matrix. Maybe you wanted to know more about WHY matrices should be scaled when doing linear algebra. I will remind about that in the beginning but the rest will focus on HOW to not explicitly scale matrices. We will apply our findings to the computation of Principal Component Analysis (PCA) and then Pearson correlation at the end.

## WHY scaling matrices?

Generally, if you don’t center columns of a matrix before PCA, you end up with the loadings of PC1 being the column means, which is not of must interest.

n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) a <- sweep(a, 2, 1:m, '+') colMeans(a)

## [1] 0.9969826 1.9922219 3.0894905 3.9680835 4.9433477 6.0007352 ## [7] 6.9524794 8.0898546 8.9506657 10.0354630

pca <- prcomp(a, center = FALSE) cor(pca$rotation[, 1], colMeans(a))

## [1] 0.9999991

Now, say you have centered column or your matrix, do you also need to scale them? That is to say, do they need to have the same norm or standard deviation?

PCA consists in finding an orthogonal basis that maximizes the variation in the data you analyze. So, if there is a column with much more variation than the others, it will probably end up being PC1, which is not of must interest.

n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) a[, 1] <- 100 * a[, 1] apply(a, 2, sd)

## [1] 90.8562836 0.8949268 0.9514208 1.0437561 0.9995103 1.0475651 ## [7] 0.9531466 0.9651383 0.9804789 0.9605121

pca <- prcomp(a, center = TRUE) pca$rotation[, 1]

## [1] 9.999939e-01 6.982439e-04 -5.037773e-04 -9.553513e-05 8.937869e-04 ## [6] -1.893732e-03 -3.053232e-04 1.811950e-03 -1.658014e-03 9.937442e-04

Hope I convinced you on WHY it is important to scale matrix columns before doing PCA. I will now show you **HOW not to do it explictly**.

## Reimplementation of PCA

In this part, I will show you the basic code if you want to reimplement PCA yourself. It will serve as a basis to show you how to do linear algebra on a scaled matrix, without explicitly scaling the matrix.

# True one n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) pca <- prcomp(a, center = TRUE, scale. = TRUE) # DIY a.scaled <- scale(a, center = TRUE, scale = TRUE) K <- crossprod(a.scaled) K.eigs <- eigen(K, symmetric = TRUE) v <- K.eigs$vectors PCs <- a.scaled %*% v # Verif, recall that PCs can be opposites between runs plot(v, pca$rotation)

all.equal(sqrt(K.eigs$values), sqrt(n - 1) * pca$sdev)

## [1] TRUE

plot(PCs, pca$x)

## Linear algebra behind the previous implementation

Suppose \(m < n\) (\(m\) is the number of columns and \(n\) is the number of rows). Let us denote \(\tilde{X}\) the scaled matrix. A partial singular value decomposition of \(\tilde{X}\) is \(\tilde{X} \approx U \Delta V^T\) where \(U\) is an \(n \times K\) matrix such that \(U^T U = I_K\), \(\Delta\) is a \(K \times K\) diagonal matrix and \(V\) is an \(m \times K\) matrix such that \(V^T V = I_K\). Taking \(K = m\), you end up with \(\tilde{X} = U \Delta V^T\).

\(U \Delta\) are the scores (PCs) of the PCA and \(V\) are the loadings (rotation coefficients). \(K = \tilde{X}^T \tilde{X} = (U \Delta V^T)^T \cdot U \Delta V^T = V \Delta U^T U \Delta V^T = V \Delta^2 V^T\). So, when doing the eigen decomposition of K, you get \(V\) and \(\Delta^2\) because \(K V = V \Delta^2\). For getting the scores, you then compute \(\tilde{X} V = U \Delta\).

These are exactly the steps implemented above.

## Implicit scaling of the matrix

Do you know the matrix formulation of column scaling? \(\tilde{X} = C_n X S\) where \(C_n = I_n - \frac{1}{n} 1_n 1_n^T\) is the centering matrix and \(S\) is an \(m \times m\) diagonal matrix with the scaling coefficients (typically, \(S_{j,j} = 1 / \text{sd}_j\)).

# Let's verify sds <- apply(a, 2, sd) a.scaled2 <- (diag(n) - tcrossprod(rep(1, n)) / n) %*% a %*% diag(1 / sds) all.equal(a.scaled2, a.scaled, check.attributes = FALSE)

## [1] TRUE

In our previous implementation, we computed \(\tilde{X}^T \tilde{X}\) and \(\tilde{X} V\). We are going to compute them again, without explicitly scaling the matrix.

### Product

Let us begin by something easy: \(\tilde{X} V = C_n X S V = C_n (X (S V))\). So, you can compute \(\tilde{X} V\) without explicitly scaling \(X\). Let us verify:

SV <- v / sds XSV <- a %*% SV CXSV <- sweep(XSV, 2, colMeans(XSV), '-') all.equal(CXSV, PCs)

## [1] TRUE

### Self cross-product

A little more tricky: \(\tilde{X}^T \tilde{X} = (C_n X S)^T \cdot C_n X S = S^T X^T C_n X S\) (\(C_n^2 = C_n\) is intuitive because centering an already centered matrix doesn’t change it).

\(\tilde{X}^T \tilde{X} = S^T X^T (I_n - \frac{1}{n} 1_n 1_n^T) X S = S^T (X^T X - X^T (\frac{1}{n} 1_n 1_n^T) X) S = S^T (X^T X - \frac{1}{n} s_X * s_X^T) S\) where \(s_X\) is the vector of column sums of X.

Let us verify with a rough implementation:

sx <- colSums(a) K2 <- (crossprod(a) - tcrossprod(sx) / n) / tcrossprod(sds) all.equal(K, K2)

## [1] TRUE

## Conclusion

We have recalled some steps about the computation of Principal Component Analysis on a scaled matrix. We have seen how to compute the different steps of the implementation without having to explicitly scale the matrix. This “implicit” scaling can be quite useful if you manipulate very large matrices because you are not copying the matrix nor making useless computation. In my next post, I will present to you my new package that uses this trick to make a lightning partial SVD on very large matrices.

## Appendix: application to Pearson correlation

Pearson correlation is merely a self cross-product on a centered and normalized (columns with unit norm) matrix. Let us just implement that with our new trick.

#includeusing namespace Rcpp; // [[Rcpp::export]] NumericMatrix& correlize(NumericMatrix& mat, const NumericVector& shift, const NumericVector& scale) { int n = mat.nrow(); int i, j; for (j = 0; j < n; j++) { for (i = 0; i < n; i++) { // corresponds to "- \frac{1}{n} s_X * s_X^T" mat(i, j) -= shift(i) * shift(j); // corresponds to "S^T (...) S" mat(i, j) /= scale(i) * scale(j); } } return mat; } cor3 <- function(mat) { sums <- colSums(mat) / sqrt(nrow(mat)) corr <- crossprod(mat) diags <- sqrt(diag(corr) - sums^2) correlize(corr, shift = sums, scale = diags) } a <- matrix(0, 1000, 1000); a[] <- rnorm(length(a)) all.equal(cor3(a), cor(a))

## [1] TRUE

library(microbenchmark) microbenchmark( cor3(a), cor(a), times = 20 )

## Unit: milliseconds ## expr min lq mean median uq max neval ## cor3(a) 39.38138 39.67276 40.68596 40.09893 40.65623 46.46785 20 ## cor(a) 635.74350 637.33605 639.34810 638.09980 639.36110 651.61876 20

**leave a comment**for the author, please follow the link and comment on their blog:

**Florian Privé**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.