R annoyances

March 20, 2010
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way.
Consider the following snippet of R code where we create a list with a single element named “x” that refers to a numeric vector. We start with a demonstration of the hard-coded method of pulling the x-value back out using the “$” operator.

> l <- list(x=c(1,2,3))
> l$x
[1] 1 2 3

But suppose we wanted to automate this; that is pass in the name of the value we want in a variable. We are after all using a computer, so automating a step seems like a reasonable desire. R supplies a notation for this using the “[]” operator. But something slightly different comes out under the “[]” operator than under the “$” operator:

> varName <- 'x'
> l[varName]
$x
[1] 1 2 3

Notice that the printed outputs are slightly different (one echoes "$x" and one does not). Let's use the "class()" method to see what is actually being returned in each case.

> class(l$x)
[1] "numeric"
> class(l['x'])
[1] "list"

Completely different return types are returned (in one case a numeric vector in the other a general list, not interchangeable types).

At this point you may think it is time to turn in our "pro" label and call ourselves "newb" (Internet slang for "newbie" or "idiot"). But let's slow down for a bit. When two views of the same situation disagree (such as the difference in opinion between the authors of R and myself whether the "[]" and "$" operators should return the same type) you at most know that at least one of those views is wrong. You don't really know if one view is right or even if one view is right which one it is. I can, however, bring in some additional argument to try and show the design of R is in fact wrong. The additional argument is "The Principle of Least Astonishment." This principle roughly says that it is a mistake to introduce unnecessary differences in outcomes (which to the unprepared user are unpleasant surprises). There may be some deep (yet obscure) reasons the two operators prefer to return different results. But the fact you would have to find a way to document and explain these differences really should make one think that this situation is really a mis-design and the "explanation" is really an attempt at a work around. Or to put it more rudely: there may be an explanation, but there is no excuse.

For another example consider creating a 3 by 3 matrix:

> m <- matrix(c(1,2,3,1,1,1,0,0,1),nrow=3,ncol=3)
> m
     [,1] [,2] [,3]
[1,]    1    1    0
[2,]    2    1    0
[3,]    3    1    1

Now select the last two rows of the matrix.

> m[c(FALSE,TRUE,TRUE),]
     [,1] [,2] [,3]
[1,]    2    1    0
[2,]    3    1    1
>

Now (for the punchline) try to select just the middle row of the matrix.

> m[c(FALSE,TRUE,FALSE),]
[1] 2 1 0

Notice that once again (and without warning) the result is subtly different. I admit that it seems paranoid to worry about such small differences- but when you are debugging a system that should work these are exactly the killing mistakes you are looking for. In this case the problem is pretty bad. See what happens if you tried to ask for the dimension of each of these differing returns:

> dim(m[c(FALSE,TRUE,TRUE),])
[1] 2 3
> dim(m[c(FALSE,TRUE,FALSE),])
NULL

The first case works fine (reports 2 rows and 3 columns). The second case returns "NULL" (instead of 1 row and 3 columns). In R NULL is sometimes used as an error-value (instead of throwing an exception) and this value will poison any further conditions or calculations it is involved in. The main way to deal with the arbitrary introduction of such NULLs is the incredibly tedious uncertain defensive coding practices that we argue against in Postel’s Law: Not Sure Who To Be Angry With. Such code weakens both programs and programmers.

But what is going on in this example? Once again we use the "class()" method to inspect the subtly different results.

> class(m[c(FALSE,TRUE,TRUE),])
[1] "matrix"
> class(m[c(FALSE,TRUE,FALSE),])
[1] "numeric"

The result is disappointing. For a two-row select R returns a matrix (what we would expect). For a single-row select R does us the "favor" of converting the result into a vector. This is a disaster. A single row matrix is similar to a vector, but even R itself does not support the same set of operations and outcomes on vectors as it does on matrices (for example the failure of the "dim()" method). It is not safe to further calculate with these results (without by-hand converting the result back to a single row matrix which R can in fact represent). In my case this created crashing bugs deep in a long running analysis (and was hard to diagnose as the bug was in an "innocent operation" not in a "risky calculation").

All of this has to violate John Chambers' "Prime Directive" for data: "an obligation on all creators of software to program in such a way that the computations can be understood and trusted." Chambers' opinion being relevant as he is the author of the S language (of which R is an open source re-implementation). We continue to recommend R, but we also recommend being exceptionally careful when using it (which unfortunately adds time to projects).

Related posts:

  1. R examine objects tutorial
  2. Relative returns: a banker versus trader paradox
  3. The cranky guide to trying R packages

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , ,

Comments are closed.