R Design Flaws #1 and #2: A Solution to Both?

August 25, 2008
By

(This article was first published on Radford Neal's blog » R Programming, and kindly contributed to R-bloggers)

I’ve previously posted about two design flaws in R. The first post was about how R produces reversed sequences from a:b when a>b, with bad consequences in “for” statements (and elsewhere). The second post was about how R by default drops dimensions in expressions like M[i:j,] when i:j is a sequence only one long (ie, when i equals j).

In both posts, I suggested ways of extending R to try to solve these problems. I now think there is a better way, however, which solves both problems with one simple extension to R. This extension would also make R programs run faster and use less memory.

Recall the problems, and the solutions I proposed previously… To stop sequences from reversing, we need a new operator to use rather than 1:n, which suffers from the reversal problem when n is zero (which can’t be changed for compatibility reasons). I suggested 1:>:n. To stop dimenions from being dropped, I suggested using semicolons rather than commas to separate subscripts. My new suggestion is that in the most common case where the indexing vector is a sequence, we could use the same new operator introduced to solve the reversing problem — ie, we write M[1:>:n,] to get the first n rows, and define it so that the result isn’t converted to a vector when n is one.

Now, this may seem like it won’t work. If 1:>:n returns a vector, then when n is one, it’s a vector of length one, which has to (by default) lead to the dimension being dropped if ordinary subscripting by scalars is to work (since scalars in R are really vectors of length one).

The solution is for 1:>:n to not return a vector, but instead return a new data type that just records its two operands. This new data type — perhaps it could be called an indexing pair — would have to be recognized by “for” statements and by the subscripting operator. A dimension indexed by such an indexing pair would never be dropped. One could add an operator :<: as well, for iterating or subscripting in descending sequence, though this is much less common. If this descending operator is omitted, I think .. (two periods) would be a better name than :>: for the ascending indexing pair operator (but it unfortunately has no obviously mnemonic descending counterpart).

One disadvantage of this solution is that it doesn’t address the dropped dimension problem in the general case where the vector index may not be a sequence. But indexing by an ascending sequence is by far the most common case, so having to write drop=FALSE for the others may be OK. (However, perhaps knowledge of the drop=FALSE option would be less common if it is needed less often.) Similary, this solution doesn’t address the problem of reversing sequences outside the context of for loops and subscripting, but those are the most common uses.

One extra advantage of introducing indexing pairs is that they will take up a trivial amount of storage. In contrast, if you use 1:1000000 to iterate a million times in a for loop, R will allocate 4 Megabyes of storage to hold this sequence. Producing this sequence also takes time, of course. It would be possible for R to avoid this cost when using 1:1000000 in a for loop, by treating this combination specially, but it doesn’t (in version 2.4.1 at least):

   > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234834  6.3     467875 12.5   407500 10.9
   Vcells 104319  0.8     786432  6.0   690698  5.3
   > for (i in 1:1000000) if (i==1000000) print(gc())
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234855  6.3     467875 12.5   467875 12.5
   Vcells 604321  4.7     905753  7.0   720903  5.5
   > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 234904  6.3     467875 12.5   467875 12.5
   Vcells 104328  0.8     786432  6.0   720903  5.5

Notice how the memory usage goes up by 3.9 Megabytes during the for loop.

Another extra advantage of introducing :>: (or ..) is that the precedence of this operator could be made lower than that of any other operator (with no loss, since an indexing pair wouldn’t be a valid operand for any other operator). Expressions like 1:>:n-1 would then do what the programmer meant (unlike 1:n-1).

So, does anyone see any flaws in this solution?


To leave a comment for the author, please follow the link and comment on his blog: Radford Neal's blog » R Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.