Fixing R’s design flaws in a new version of pqR
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In particular, the extensions fix the problems that 1:n doesn’t work as intended when n is zero, and that M[1:n,] is a matrix rather than a vector when n is one, or when M has only one column. Since changing the “:” operator would cause too many problems with existing programs, pqR introduces a new “..” operator for generating increasing sequences. Unwanted dimension dropping is also addressed in ways that have minimal effects on existing code.
The new release, pqR-2016-06-24, is available at pqR-project-org. The NEWS file for this release also documents some other language extensions, as well as fixes for various bugs (some of which are also in R-3.3.1).
I’ve written about these design flaws in R before, here and here (and for my previous ideas on a solution, now obsolete, see here). These design flaws have been producing unreliable programs for decades, including bugs in code maintained by R Core. It is long past time that they were fixed.
It is crucial that the fixes make the easy way of writing a program also be the correct way. This is not the case with previous “fixes” like the seq_len function, and the drop=FALSE option, both of which are clumsy, as well as being unknown to many R programmers.
Here’s an example of how the new .. operator can be used:
for (i in 2..nrow(M)-1) for (j in 2..ncol(M)-1) M[i,j] <- 0
This code sets all the elements of the matrix M to zeros, except for those on the edges — in the first or last row or column.
If you replace the “..” operators above with “:“, the code will not work, because “:” has higher precedence than “-“. You need to write 2:(nrow(M)-1). This is a common error, which is avoided with the new “..” operator, which has lower precedence than the arithmetic operators. Fortunately the precedence problem with “:” is mostly just an annoyance, since it leads to the program not working at all, which is usually obvious.
The more insidious problem with writing the code above using “:” is that, after fixing the precedence problem, the result will work except when the number of rows or the number of columns in M is less than three. When M has two rows, 2:(nrow(M)-1) produces a sequence of length two, consisting of 2 and 1, rather than the sequence of length zero that is needed for this code to work correctly.
This could be fixed by prefixing the code segment with
if (nrow(M)>2 && ncol(M)>2)
But this requires the programmer to realize that there is a problem, and to not be lazy (with the excuse that they don’t intend to ever use the code with small matrices). And of course the problems with “:” cannot in general be solved with a single check like this.
Alternatively, one could write the program as follows:
for (i in 1+seq_len(nrow(M)-2)) for (j in 1+seq_len(ncol(M)-2)) M[i,j] <- 0
I hope readers will agree that this is not an ideal solution.
Now let’s consider the problems with R dropping dimensions from matrices (and higher-dimensional arrays). Some of these stem from R usually not distinguishing a scalar from a vector of length one. Fortunately, R actually can distinguish these, since a vector can have a dim attribute that explicitly states that it is a one-dimensional array. Such one-dimensional arrays are presently uncommon, but are easily created — if v is any vector, array(v) will be a one-dimensional array with the same contents. (Note that it will print like a plain vector, though dim(array(v)) will show the difference.)
So, the first change in pqR to address the dimension dropping problem is to not drop a dimension of size one if its subscript is a one-dimensional array (excluding logical arrays, or when drop=TRUE is stated explicitly). Here’s an example of how this now works in pqR:
> M <- matrix(1:12,3,4) > M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > r <- c(1,3) > c <- c(2,4) > M[r,c] [,1] [,2] [1,] 4 10 [2,] 6 12 > c <- 3 > M[r,c]  7 9 > M[array(r),array(c)] [,1] [1,] 7 [2,] 9
The final command above is the one which now acts differently, not dropping the dimensions even though there is only one column, since array(c) is an explicit one-dimensional vector. The use of array(r) similarly guards against only one row being selected, though that has no effect above, where r is of length two.
In this situation, the same result could be obtained with similar ease using M[r,c,drop=FALSE]. But drop=FALSE applies to every dimension, which is not always what is needed for higher-dimensional arrays. For example, in pqR, if A is a three-dimensional array, A[array(u),1,array(v)] will now select the slice of A with second subscript 1, and always return a matrix, even if u or v happened to have length one. There is no other convenient way of doing this that I know of.
The power of this feature becomes much greater when combined with the new “..” operator, which is defined to return a sequence that is a one-dimensional array, rather than a plain vector. Here’s how this works when continuing the example above:
> n <- 2 > m <- 3 > M[1..n,1..m] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 > m <- 1 > M[1..n,1..m] [,1] [1,] 1 [2,] 2 > n <- 0 > M[1..n,1..m] [,1] >
Note how M[1..n,1..m] is guaranteed to return a matrix, even if n or m is one. A matrix with zero rows or columns is also returned when appropriate, due to the “..” operator being able to produce a zero-length vector. To get the same effect without the “..” operator, one would need to write
M [seq_len(n), seq_len(m), drop=FALSE]
It gets worse if you want to extract a subset that doesn’t start with the first row and first column — the simplest equivalent of M[a..b,x..y] seems to be
M [a-1+seq_len(b-a+1), x-1+seq_len(y-x+1), drop=FALSE]
I suspect that not many R programmers have been writing code like this, which means that a lot of R programs don’t quite work correctly. Of course, the solution is not to berate these programmers for being lazy, but instead to make it easy to write correct code.
Dimensions can also get dropped inappropriately when an empty subscript is used to select all the rows or all the columns of a matrix. If this dimension happens to be of size one, R will reduce the result to a plain vector. Of course, this issue can be combined with the issues above — for example, M[1:n,] will fail to do what is likely intended if n is zero, or if n is one, or if M has only one column.
To solve this problem, pqR now allows “missing” arguments to be specified with an underscore, rather than by leaving the argument empty. The subscripting operator will not drop a dimension with an underscore subscript (unless drop=TRUE is specified explicitly). With this extension, along with “..“, one can rewrite M[1:n,] as M[1..n,_], which will always do the right thing.
Note that it is unfortunately probably not feasible to just never drop a dimension with a missing argument, since there is likely too much existing code that relies on the current behaviour (though there is probably even more code where the existing behaviour produces bugs). Hence the creation of a new way to specify a missing argument. A more explicit “missing” indicator may be desirable anyway, as it seems more readable, and less error-prone, than nothing at all.
It may also be infeasible to extend the rule of not dropping dimensions indexed by one-dimensional arrays to logical subscripts — when a and b are one-dimensional arrays, M[a==0,b==0] may be intended to select a single element of M, not to return a 1×1 matrix — though one-dimensional arrays are rare enough at present that maybe one could get away with this.
The new “..” operator does break some existing code. In order that “..” can conveniently be used without always putting spaces around it, pqR now prohibits names from containing consecutive dots, except at the beginning or the end. So i..j is no longer a valid name (unless quoted with backticks), although ..i.. is still valid (but not recommended). With this restriction, most uses of the “..” operator are unambiguous, though there are exceptions, such as i..(x+y), which is a call of the function i.., and i..-j, which computes i.. minus j. There would be no ambiguities at all if consecutive dots were allowed only at the beginning of names, but unfortunately the ggplot2 package uses names like ..count.. in its API (not just internally).
Also, .. is now a reserved word. This is not actually necessary to avoid ambiguity, but not making it reserved seems error-prone, since many typos would be valid syntax, and fetching from .. would not even be a run-time error, since it is defined as a primitive. A number of CRAN packages use .. as a name, but almost all such uses are typos, with ... being what was intended (many such uses are copied from an example with a typo in help(make.rgb)).
To accommodate packages with incompatible uses of “..“, there is an option to disabling parsing of “..” as an operator, allowing packages written without using this new extensions to still be installed.
The new pqR also has other new features, including a new version of the “for” statement. Implementation of these new language features is made possible by the new parser that was introduced in pqR-2015-09-14, which has other advantages as well. I plan to write blog posts on these topics soon.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.