Fixing R’s NAMED problems in pqR

July 2, 2013
By

(This article was first published on Radford Neal's blog » R Programming, and kindly contributed to R-bloggers)

In R, objects of most types are supposed to be treated as “values”, that do not change when other objects change. For instance, after doing the following:

  a <- c(1,2,3)
  b <- a
  a[2] <- 0

b[2] is supposed to have the value 2, not 0. Similarly, a vector passed as an argument to a function is not normally changed by the function. For example, with b as above, calling f(b), will not change b even if the definition of f is f <- function (x) x[2] <- 0.

This semantics would be easy to implement by simply copying an object whenever it is assigned, or evaluated as the argument to a function. Unfortunately, this would be unacceptably slow. Think, for example, of passing a 10000 by 10000 matrix as an argument to a little function that just accesses a few elements of the matrix and returns a value computed from them.  The copying would take far longer than the computation within the function, and the extra 800 Megabytes of memory required might also be a problem.

So R doesn’t copy all the time.  Instead, it maintains a count, called NAMED, of how many “names” refer to an object, and copies only when an object that needs to be modified is also referred to by another name.  Unfortunately, however, this scheme works rather poorly.  Many unnecessary copies are still made, while many bugs have arisen in which copies aren’t made when necessary. I’ll talk about this more below, and discuss how pqR has made a start at solving these problems.

One problem with the NAMED scheme in R (all recent versions up to at least R-3.0.1) is that the count can only be 0, 1, or 2, with 2 really meaning “2 or more”.  Furthermore, the count never decreases.

Extending the example above illustrates how this causes unnecessary copying:

  a <- c(1,2,3)
  b <- a
  a[2] <- 0
  b[2] <- 4

Here, no copy is made when b <- a is executed. Instead, both a and b refer to the same object, which has NAMED set to 2. When executing a[2] <- 0, the R interpreter realizes that a copy needs to be made before changing a[2]. After this, a will refer to this new copy, which has NAMED set to 1, while b refers to the original object. Unfortunately, the original object still has NAMED set to 2, so when b[2] <- 4 is executed, another copy is made, unnecessarily. (This copying is fairly trivial for a vector of length three, but of course the same thing happens for vectors or matrices of any size.)

For similar reasons, simply passing something as an argument to a function will cause a copy to be made if it is later modified. For example, when doing the following:

  f <- function (X) sum(X^2)
  Y <- matrix(1.1,100,100)
  print(f(Y))
  Y[1,1] <- 2

R will make a copy of Y before changing Y[1,1], even though there are no other references to Y at this time.

So the NAMED scheme fails to eliminate many unnecessary copies. Despite this, it has also been a source of many bugs in which copies aren’t made when they should have been — for recent ones, see some of the bugs listed as fixed in the pqR NEWS file.

I think that one underlying cause of these bugs is that how NAMED should be used is not documented. Despite the terminology, reflected in my explanation above, NAMED should count not just the number of “names” referring to an object, but also the number of references to it as a member of a list (and maybe also as an attribute?). Except that that’s not exactly how things are actually done — objects with NAMED set to 0 are tolerated as list elements, on the understanding that this really means that NAMED is 1 if the list itself has a non-zero value for NAMED.  Or something like that.

Addressing these performance and reliability issues is an ongoing task in pqR.  Documenting a clear set of conventions is part of the plan, to be followed by reviewing code to fix places that don’t follow these conventions. The problem of unnecessary copying is addressed in pqR by extending NAMED to be a true count of references to an object, with this count now called NAMEDCNT. (For compatibility, the existing NAMED and SET_NAMED functions still exist, implemented in terms of NAMEDCNT.)

NAMEDCNT has a range of 0 to 7, and is intended to be as close as possible to an actual count of names and other references to an object, which will be decremented when references cease to be.  There will still be some situations where NAMEDCNT isn’t decremented when it should be, so the count will remain an upper bound rather than an exact count. (This will certainly be the case if the count reaches the maximum of 7, and hence can no longer be decremented, since how much bigger than 7 it is will be unknown.) But keeping a more accurate count than previous versions of R is possible in many situations.

The benfits of keeping a better count are illustrated by the example I give above with the statement b[2] <- 4. In pqR, this statement does not cause b to be copied — when the preceding statement a[2] <- 0 causes a newly copied object to be assigned to a, pqR realizes that NAMEDCNT for the value previously bound to a can be decremented. Since this is the same object that is bound to b, its NAMEDCNT becomes 1, allowing b[2] <- 4 to be done without an unnecessary second copying operation.

Unfortunately, in the current (2013-06-28) version of pqR, the function call f(Y) in the latter example above still leads to a copy of Y being made unnecessarily when Y[1,1] is changed. So there is more work to be done, both with regard to performance and with regard to ensuring correctness. However, the current version of pqR does avoid the bug in R-3.0.1 involving NAMED that was reported today. (Though pqR does give the wrong answer for a more elaborate version of this bug.)


To leave a comment for the author, please follow the link and comment on his blog: Radford Neal's blog » R Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.