“Don’t invert that matrix” – why and how

[This article was first published on Civil Statistician » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The first time I read John Cook’s advice “Don’t invert that matrix,” I wasn’t sure how to follow it. I was familiar with manipulating matrices analytically (with pencil and paper) for statistical derivations, but not with implementation details in software. For reference, here are some simple examples in MATLAB and R, showing what to avoid and what to do instead.

If possible, John says, you should just ask your scientific computing software to directly solve the linear system Ax = b. This is often faster and more numerically accurate than computing the matrix inverse of A and then computing x = A^{-1}b.

We’ll chug through a computation example below, to illustrate the difference between these two methods. But first, let’s start with some context: a common statistical situation where you may think you need matrix inversion, even though you really don’t.

(Be aware that the x above (from traditional linear systems notation), and the X below (from traditional regression notation), play totally different roles! It’s a shame that these notations conflict.)

Statistical context: linear regression

First of all, I’m used to reading and writing mathematical/statistical formulas using inverted matrices. In statistical theory courses, for example, we derive the equations behind linear regression this way all the time. If your regression model is y = X \beta + \epsilon, with independent errors of mean 0 and variance \sigma^2, then textbooks usually write the ordinary least squares (OLS) solution as \hat{\beta} = (X'X)^{-1}X'y.
Computing \hat{\beta} directly like this in R would be
beta = solve(t(X) %*% X) %*% (X %*% y)
while in MATLAB it would be
beta = inv(X' * X) * (X' * y)

This format is handy for deriving properties of OLS analytically. But, as John says, it’s often not the best way to compute \hat{\beta}. Instead, rewrite this equation as (X'X) \beta = X'y and then get your software to solve for \beta.
In R, this would be
beta = solve(t(X) %*% X, t(X) %*% y)
or in MATLAB, it would be
beta = (X' * X) (X' * y)

(Occasionally, we do actually care about the values inside an inverted matrix. Analytically we can show that, in OLS, the variance-covariance matrix of the regression coefficients is Var(\hat{\beta}) = \sigma^2 (X'X)^{-1}. If you really need to report these variances and covariances, I suppose you really will have to invert the matrix. But even here, if you only need them temporarily as input to something else, you can probably compute that “something else” directly without matrix inversion.)

Numerical example of problems with matrix inversion

The MATLAB documentation for inv has a nice example comparing computation times and accuracies for the two approaches.

Reddit commenter five9a2 gives an even simpler example in Octave (also works in MATLAB).

Here, I’ll demonstrate five9a2’s example in R. We’ll use their same notation of solving the system Ax=b (rather than the regression example’s notation). We’ll let A be a 7×7 Hilbert matrix. The Hilbert matrices, with elements H(i,j) = 1/(i+j-1), are known to be poorly conditioned [1] and therefore to cause trouble with matrix inversion.

Here’s the R code and results, with errors and residuals defined as in the MATLAB example:

# Set up a linear system Ax=b,
# and compare 
# inverting A to compute x=inv(A)b
# vs
# solving Ax=b directly
options(digits = 3)

# Generate the 7x7 Hilbert matrix
# (known to be poorly conditioned)
library(Matrix)
n = 7
(A = as.matrix(Hilbert(n)))
##       [,1]  [,2]  [,3]  [,4]   [,5]   [,6]   [,7]
## [1,] 1.000 0.500 0.333 0.250 0.2000 0.1667 0.1429
## [2,] 0.500 0.333 0.250 0.200 0.1667 0.1429 0.1250
## [3,] 0.333 0.250 0.200 0.167 0.1429 0.1250 0.1111
## [4,] 0.250 0.200 0.167 0.143 0.1250 0.1111 0.1000
## [5,] 0.200 0.167 0.143 0.125 0.1111 0.1000 0.0909
## [6,] 0.167 0.143 0.125 0.111 0.1000 0.0909 0.0833
## [7,] 0.143 0.125 0.111 0.100 0.0909 0.0833 0.0769
# Generate a random x vector from N(0,1)
x = rnorm(n)
# Find the corresponding b
b = A %*% x

# Now solve for x, both ways,
# and compare computation times
system.time({xhat_inverting = solve(A) %*% b})
##    user  system elapsed 
##   0.002   0.000   0.003
system.time({xhat_solving = solve(A, b)})
##    user  system elapsed 
##   0.002   0.000   0.001

# Compare errors: sum of squared (x - xhat)
(err_inverting = norm(x - xhat_inverting))
## [1] 1.64e-07
(err_solving = norm(x - xhat_solving))
## [1] 3.14e-08

# Compare residuals: sum of squared (b - bhat)
(res_inverting = norm(b - A %*% xhat_inverting))
## [1] 1.48e-09
(res_solving = norm(b - A %*% xhat_solving))
## [1] 2.78e-16

As you can see, even with a small Hilbert matrix:

  • inverting takes more time than solving;
  • the error in x when solving Ax=b directly is a little smaller than when inverting; and
  • the residuals in the estimate of b when solving directly are many orders of magnitude smaller than when inverting.

Repeated reuse of QR or LU factorization in R

Finally: what if, as John suggests, you have to solve Ax=b for many different b’s? How do you encode this in R without inverting A?

I don’t know the best, canonical way to do this in R. However, here are two approaches worth trying: the QR decomposition and the LU decomposition. These are two ways to decompose the matrix A into factors with which it should be easier to solve Ax=b.

QR decomposition is included in base R. You use the function qr once to create a decomposition, then use qr.coef to solve for x repeatedly using new b’s.

For the LU decomposition, we can use the matrixcalc package. (Thanks to sample code on the Cartesian Faith blog.)
Imagine rewriting our problem as Ax = LUx = b, then defining y=Ux, so that we can solve it in two stages: Ly = b, Ux = y. We can collapse this in R into a single line, in the form
x = solve(U, solve(L, b))
once we have decomposed A into L and U.

# What if we need to solve Ax=b for many different b's?
# Compute the A=QR or A=LU decomposition once,
# then reuse it with different b's.

# Just once, on the same b as above:

# QR:
system.time({
  qr_decomp = qr(A, LAPACK = TRUE)
  xhat_qr = qr.coef(qr_decomp, b)
})
##    user  system elapsed 
##   0.008   0.000   0.014
(err_qr = norm(x - xhat_qr))
## [1] 2.24e-08
(res_qr = norm(b - A %*% xhat_qr))
## [1] 1.33e-15

# LU:
system.time({
  lu_decomp = lu.decomposition(A)
  xhat_lu = solve(lu_decomp$U, solve(lu_decomp$L, b))
})
##    user  system elapsed 
##   0.004   0.000   0.016
(err_lu = norm(x - xhat_lu))
## [1] 1.25e-08
(res_lu = norm(b - A %*% xhat_lu))
## [1] 4.02e-16

Both the QR and LU decompositions’ errors and residuals are comparable to the use of solve earlier in this post, and much smaller than inverting A directly.

However, what about the timing? Let’s create many new b vectors and time each approach: QR, LU, direct solving from scratch (without factorizing A), and inverting A:

# Reusing decompositions with many new b's:
m = 1000
xs = replicate(m, rnorm(n))
bs = apply(xs, 2, function(xi) A %*% xi)

# QR:
system.time({
  qr_decomp = qr(A, tol = 1e-10)
  xhats_qr = apply(bs, 2,
                   function(bi) qr.coef(qr_decomp, bi))
})
##    user  system elapsed 
##   0.166   0.003   0.177

# LU:
system.time({
  lu_decomp = lu.decomposition(A)
  bs = apply(bs, 2,
             function(bi) solve(lu_decomp$U,
                                solve(lu_decomp$L, bi)))
})
##    user  system elapsed 
##   0.162   0.000   0.171

# Compare to repeated use of solve()
system.time({
  xhats_solving = apply(bs, 2,
                        function(bi) solve(A, bi))
})
##    user  system elapsed 
##   0.079   0.000   0.081

# Compare to inverting A directly
system.time({
  Ainv = solve(A)
  xhats_inverting = apply(bs, 2,
                          function(bi) Ainv %*% bi)
})
##    user  system elapsed 
##   0.010   0.001   0.012

It’s not surprising that inverting A is the fastest… but as we saw above, it’s also the least accurate, by far.

However, I’m surprised that QR and LU are both much slower than direct use of solve(A, b). They are supposed to save you work that solve(A, b) has to redo for every new b.

Please comment if you know why this happened! Did I make a mistake? Or do LU and QR give you speedups over solve(A, b) only for much larger A matrices?

Finally, see this Revolutions post on R and Linear Algebra for more on matrix manipulation in R. They mention dealing with giant and/or sparse matrices, which is also the last situation described in John Cook’s blog post.

[1] (I usually hear of a “poorly conditioned” matrix, meaning one with a high “condition number,” being defined in terms of the ratio of largest to smallest eigenvalues. However, this nice supplement on Condition Numbers from Lay’s Linear Algebra has a more general definition on p.4: if A is invertible, the condition number is the norm of A times the norm of A’s inverse. This is the same as the ratio of largest to smallest eigenvalues if you’re using the spectral norm… but the general definition is more interpretable for beginners who haven’t studied eigenvalues yet, since you can use other simpler matrix norms instead.)

To leave a comment for the author, please follow the link and comment on their blog: Civil Statistician » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)