Subset replacement in pqR: Now faster and better

October 13, 2014
By

(This article was first published on Radford Neal's blog » R Programming, and kindly contributed to R-bloggers)

The latest version of pqR that I just released uses a new way of implementing subset replacement operations — such as a[i]<-1 or L$M[1:100,i]<-v. The new approach is much faster, and eliminates some strange behaviour of the previous approach.

This change affects only interpreted code. The bytecode compiler (available since R-2.13.0) introduced a different mechanism, which is also faster than the previous approach used by the interpreter (though it still has some of the strange behaviour). This faster mechanism was one of the main reasons for byte-compiled code to be faster than interpreted code (although it would have been possible to use the new mechanism in the interpreter as well). With pqR’s new implementation of subset replacement, this advantage of byte-compiled over interpreted code is much reduced.

In addition to being faster, pqR’s new approach is also more coherent than the previous approach (still current in the interpreter for R Core releases to at least R-3.1.1), which despite its gross inefficiency and confused semantics has remained essentially unchanged for 18 years. Unfortunately, the new approach in pqR is not as coherent as it might be, because past confusion has resulted in some packages doing “wrong” things, which have to be accommodated, as least in the short term.

Replacement functions. To understand pqR’s new approach, and the problems with the old approach (some not currently fixable), you first need to know how R’s subset replacement operations are defined. The central concept is that every function for extracting part of an object is accompanied by a corresponding function for replacing that part, whose name has “<-” appended. So, for example, the “dimnames” function is accompanied by “dimnames<-“, the “$” operator is accompanied by “$<-“, and “[” is accompanied by “[<-“.

Those three pairs of functions are primitive, but users can define their own pairs of subset and replacement functions. For example, the pair of functions below access or replace those elements of a vector that have odd indexes:

odd_elements <- function (x)
{  x[seq(1,by=2,length=(length(x)+1)%/%2)]  }
`odd_elements<-` <- function (x,value)
{  x[seq(1,by=2,length=(length(x)+1)%/%2)] <- value;  x  }

In general, such functions may take additional arguments that specify which part of the variable should be accessed or modified.

Simple replacements. To see how these replacement functions are used, let’s start with a simple replacement of part of a variable:

x[3:5] <- 13:15

According to the current R Language Definition at r-project.org, the effect of this statement is the same as that of

`*tmp*` <- x
x <- `[<-`(`*tmp*`, 3:5, value=13:15)
rm(`*tmp*`)

This specification is actually incomplete, since it fails to specify the value of the expression x[3:5] <- 13:15 (which might, uncommonly, be used someplace such as the argument of a function call), but it is close to a literal description of what the interpreter in R Core implementations does — this simple assignment to part of a vector really does cause a variable called “*tmp*” to be created in the current environment, to then be modified, and to finally be removed, with all the overhead this implies. You can confirm that this is what’s happening (for example, in R-3.1.1) by typing the following:

`*tmp*`<-9; a<-c(1,2); a[1]<-3; print(`*tmp*`)

You’ll get an error from print, since a[1]<-3 will have removed *tmp*.

In pqR, x[3:5] <- 13:15 is now instead implemented as something close to the following:

x <- `[<-`(x, 3:5, value=13:15)

This has the same effect as the code in the language definition, except that it has much less overhead, and lacks the undesired side effect of deleting any previously existing *tmp* variable. Subset replacement with a user-defined function is done the same way — for example, odd_elements(x)<-0 is translated to

x <- `odd_elements<-`(x, value=0)

Note that although their use in implementing assignments to subsets is the principal purpose of replacement functions, nothing stops them from being called directly. And it can occasionally be useful to write things like the following:

z <- W %*% `odd_elements<-`(x+y, value=1)

Avoiding duplication. If the `[<-` primitive were implemented in the most obvious way, the call `[<-`(x,i,v) would start by making a duplicate copy of x, then replace the elements of this copy that i indexes by v, and finally return this modified copy as its value. But this would be intolerably inefficient when x is a vector of 1000000 elements, that isn’t shared with any other variable, and i indexes just one of these elements.

The right way to solve this is to not duplicate the first argument of [<- if either it is a value that is not stored anywhere (eg, the result of some arithmetic operation), or it is the value of a variable that is not also stored elsewhere and the call of [<- is part of an assignment operation. This would not be hard to do in pqR, using its “variant result” mechanism (see here) to pass to the replacement operator the information on whether it has been called from an assignment operator.

That’s not what is currently done, however. Instead, the primitive replacement operators such as “[<-” duplicate their first argument only if it is stored in two or more variables (or list elements), regardless of the context in which it is called. This violates the usual pass-by-value semantics of R function calls. For example, the call

y <- `[<-` (x, 1, 0)

ought to set y to the value stored in x with the first element changed to zero, while leaving x unchanged. But it (sometimes) does change x, as you can confirm with the following test:

x <- c(10,20,30); y <- `[<-`(x, 1, 0); print(x)

Unfortunately, some code now relies on this behaviour, although this is a very bad idea, both for general reasons, and also because in the following slightly different code, “[<-doesn’t change x:

w <- x <- c(10,20,30); y <- `[<-`(x, 1, 0); print(x)

Worse, the “@<-” and “slot<-” operators for changing the value of a slot in an S4 object have been written to never duplicate their first argument, even if it is shared amongst many variables. To keep this from causing total chaos, the general code for assignment to subsets has to duplicate the value stored in the target variable if it is shared with another variable (even though this is necessary only for “@<-" and “slot<-“), which sometimes results in an extra duplication being done. Unfortunately, this behaviour of “@<-” and “slot<-” is also relied on by some code.

For the moment, pqR accommodates all this bad behaviour, though it would be nice to move to a coherent semantics sometime.

Complex replacements. Assignment operations with more complex replacements are trickier. The R Language Definition defines an assignment such as

L[[2]][3] <- 1

as being equivalent to

`*tmp*` <- L
L<-`[[<-`(`*tmp*`,2,value=`[<-`(`*tmp*`[[2]],3,value=1))
rm(`*tmp*`)

That is, the [[ operator is used to extract the second element (a vector) of L (which has been put in *tmp*), then [<- is used to create a new version of this vector with its second element changed to 1, and finally the [[<- operator is used to put this modified vector back as the second element of L.

The interpreter in R Core implementations (and pqR before the latest release) implement this definition quite literally, actually creating a *tmp* variable, and evaluating index expressions as implied above. This results in strange behaviour. The following code produces the error “cannot change value of locked binding for `*tmp*`”, though it should surely be legal:

L <- list(c(4,7),"x"); b <- c(2,3); L[[ b[1]<-1 ]] [1] <- 9

The following code calls the function f twice, though a programmer writing it would surely expect it to be called only once:

f <- function () { cat("Hi!n"); 1 }
L <- list(c(4,7),"x"); L[[ f() ]] [1] <- 9

This prints “Hi!” twice, in R-3.1.1 and earlier R Core releases (for both interpreted and byte-compiled code).

How pqR implements complex replacements. These strange behaviours are eliminated in the new pqR implementation, which is also much faster.

In pqR, an assignment that does a complex replacement starts by evaluating the expression on the right side, and then calls in succession all the subset extraction functions that appear on the left side, except for the outermost one. For example, names(L[[f()]])[i]<-g() will first evaluate g(), and then evaluate the extraction functions from the inside out, effectively doing something like

tmp1 <- L[[f()]]
tmp2 < names(tmp1)

However, `tmp1` and `tmp2` are not actual R variables — the interpreter just stores the values extracted internally.

So far, this is similar to what the R Core interpreter does, but there are two crucial differences.

First, when evaluating an extraction function, pqR uses its “variant result” mechanism to ask the extraction function whether the value it returns is an unshared subset of the variable it was extracted from, which can safely be modified, and for which modifications will automatically be reflected in changes to that part of the larger variable.

For example, after L <- list("x", c(1,2)), the expression L[[2]] returns an unshared subset of L. However if either M <- L or M < L[[2]]; were then executed, L[[2]] would no longer be an unshared subset, since it would be shared with the value of M. And after v <- 1:100, the expression v[20:30] does not return an unshared subset, because it will return a copy of part of v, not that part of v itself (unlike list elements, parts of numeric vectors are not objects in themselves).

Knowing when the result of an extraction is an unshared subset is crucial to efficiently updating it. When the result of an extraction is not an unshared subset, and it is referenced elsewhere, pqR duplicates it (at the top level) before doing further extractions and replcements.

The second difference from R Core implementations concerns the index arguments of the extraction functions, which are later also arguments of the corresponding replacement functions. When pqR evaluates a call of an extraction function, such as L[[f()]], it creates what (in the terminology of R internals) are called “promises” for index arguments, such as f() in this example. These promises contain the expression to be evaluated, plus an initially empty field for the value of the expression. When (if ever) the extraction function actually references the index value, the expression is evaluated, and this field is filled in. Later references to the index value do not evaluate the expression again, but just use the value stored in this field of the promise. Crucially, in pqR, these promises are kept for later use when the corresponding replacement function is called, usually with their value fields already filled in.

Avoiding re-evaluation of index arguments saves time, and also eliminates the double occurrence of side effects of evaluation, such as “Hi!” being printed twice in the example above when f() is evaluated twice (once for extracting L[[f()]] and once when replacing that element of L by a call of `[[<-` with f() as the index argument).

Once all the extraction functions have been called, the outermost replacement function is called to store the right hand side of the assignment statement into the result from the last extraction function. The next replacement function is then called to store this modified value into the result of the previous extraction function, and so forth, until the last replacement function call produces the new value for the variable being assigned into.

This is again generally similar to R Core implementations. However, pqR is able to skip some of these replacement calls, when it knows that the result of an extraction function is part of the larger variable. In that case, when that part is modified, nothing has to be done to propagate the modification to the larger variable. For example, to perform the replacement operation below:

L <- list(a=c(1,2),b="x"); L$a[1] <- 9

pqR will first extract L$a, and find that this vector is an unshared subset of L. It will then call `[<-` to replace the first element of this vector by 9, at which point it is done — pqR realizes that there is no need to call the `$<-` replacement function, and also that there is no need to store the final result in L, since it is the same as the object already in L. However, if the assignment L$a[1] <- 1+2i is now done, the replacement of the first element of L$a by the complex number 1+2i will produce a new vector of complex type, and pqR will realize that `$<-` needs to be called to store this new vector in L$a.

R Core implementations try to infer whether an extracted value is an unshared subset from how many references there are to it (see the discussion here), which sort of works, but fails when extraction is done with a vector index, as below:

L <- list(1,list(2,list(3,c(4,5,6))))
K <- L
L[[c(2,2,2,3)]] <- 9

The vector index c(2,2,2,3) refers to the 3rd element of the 2nd element of the 2nd element of the 2nd element of L, which is the number 6. When replacing this by 9, the vector c(4,5,6) needs to be duplicated, because the entire object is shared by K and L. However, the reference count for c(4,5,6) should be only one, since it is referenced from only a single list (albeit one that ultimately is itself shared). To get around this problem, recent R Core releases increment reference counts as a result of simply extracting a value with a vector index, which will leave reference counts that are bigger than they should be, and may therefore cause unnecessary copying to be done later. (Earlier R Core releases just give the wrong answer.) In the new pqR implementation, extraction leaves the reference counts unchanged, but if asked, `[[` will say that the result returned in not an unshared subset, which will lead to the appropriate duplications before replacement functions are called.

User-defined replacement functions. As illustrated by the odd_elements and odd_elements<- functions above, users can write their own replacement functions, which can be used just like the built-in ones. Unfortunately, in both R Core and pqR implementations, there is presently no way to avoid duplication of the value in a variable that is updated with a user-defined function, even when the value is not shared with other variables. For example, odd_elements(a) <- 0 will always duplicate the value in a before setting its odd elements to 0. Furthermore, the modified value stored in a after the replacement will always be marked as shared with the variable x within odd_elements<- (even though x is inaccessible), forcing a copy when it is next updated.

The new version of pqR does avoid some unnecessary duplications that are done in R Core implementations, but the basic problems remain. One fundamental question is what should happen if a user-defined replacement function generates an error after partially changing the value being updated. At present, the variable being updated will be left unchanged after the error exit. But any scheme that doesn’t duplicate the variable being updated will have the possibility of leaving a partial update that was cut short by an error. Successfully resolving such issues would allow for much more efficient use of user-defined replacement functions.

To leave a comment for the author, please follow the link and comment on their blog: Radford Neal's blog » R Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)