nrow, references and copies

[This article was first published on R – Statisfaction, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

 

b35a7afcb155bd9139d7ea13e61f97d6--monet-paintings-impressionist-paintings

Claude Monnet’s paintings have nothing to do with the topic of this post.

 

Hi all,

This post deals with a strange phenomenon in R that I have noticed while working on unbiased MCMC. Reducing the problem to a simple form, consider the following code, which iteratively samples a vector ‘x’ and stores it in a row of a large matrix called ‘chain’ (I’ve kept the MCMC terminology).

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
 if (imcmc == nrow(chain)){ #call to nrow
 }
 x = rnorm(dimstate, mean = 0, sd = 1)
 chain[imcmc,] = x #copying of x in chain
}

If you execute this code, you will see that it is surprisingly slow: it takes close to a minute on my computer. Now, consider the next block, which does exactly the same except that the vector ‘x’ is not copied into the matrix ‘chain’.

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
if (imcmc == nrow(chain)){ #call to nrow
}
x = rnorm(dimstate, mean = 0, sd = 1)
# chain[imcmc,] = x #no more copying
}

This code runs nearly instantaneously. Could it be that just copying a vector in a matrix takes a lot of time? Sounds unlikely. Now consider this third block.

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
if (imcmc == nmcmc){ #no call to nrow
}
x = rnorm(dimstate, mean = 0, sd = 1)
chain[imcmc,] = x #copying of x in chain
}

This code runs nearly instantaneously as well; this time ‘x’ is copied into ‘chain’, but the call to the nrow function is removed….?! What is nrow doing? It is meant to simply return dim(chain)[1], the first dimension of chain. So consider this fourth block.

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
 if (imcmc == dim(chain)[1]){ #call to dim instead of nrow
 }
 x = rnorm(dimstate, mean = 0, sd = 1)
 chain[imcmc,] = x #copying of x in chain
}

This one also runs instantaneously! So replacing nrow(chain) by dim(chain)[1] solves the problem. Why?

The answer comes from R guru and terrific statistician Louis Aslett. I directly quote from an exchange of emails, since he brilliantly explains the phenomenon.

You probably know R stores everything by reference, so if I do:

x <- matrix(0, nrow=1e5, ncol=100)
y <- x

I actually only have one copy of the matrix in memory with two references to it. If I then do:

x[1,1] <- 1

R will first make a copy of the whole matrix, update x to point to that and then change the first element to one. This idea is used when you pass a variable to a standard (i.e. non-core, non-primitive) R function, which nrow is: it creates a reference to the variable you pass so that it doesn’t have to copy and the function call is very fast …. as long as you don’t write to it inside the function, no copy need ever happen. But the “bad design” bit is that R makes a decision whether to copy on write based only on a reference count and crucially that reference count stays increased even after a function returns, irrespective of whether or not the function has touched the variable.

So:

x <- matrix(0, nrow=1e5, ncol=100) # matrix has ref count 1
x[1,1] <- 1 # ref count is 1, so write with no copy
nrow(x) # ref count is 2 even though nothing was touched
x[1,1] <- 1 # ref count still 2, so R copies before writing first element. Now the ref count drops to 1 again
x[2,2] <- 1 # this writes without a copy as ref count got reset on last line
nrow(x) # ref count jumps
x[3,3] <- 1 # copy invoked again! Aaaargh!

So by calling nrow in the loop for the first example, the chain matrix is being copied in full on every iteration. In the second example, chain is never written to so there is no negative side effect to the ref count having gone up. In the third example, chain only ever has ref count 1 so there are no copies and each row is written in-place. I did a quick bit of profiling and indeed in the slow example, the R garbage collector allocates and tidies up nearly 9GB of RAM when executing the loop!

The crazy thing is that dim(chain)[1] works full speed even though that is all that nrow is doing under the hood, but the reason is that dim is a so-called “primitive” core R function which is special because it doesn’t affect the reference counter of its arguments. If you want to dig into this yourself, there’s a function refs() in the pryr package which tells you the current reference count to any variable.

Thanks Louis!


To leave a comment for the author, please follow the link and comment on their blog: R – Statisfaction.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)