The subsetting section of Advanced R has a very good discussion on the subsetting and selection operators found in R. In particular it raises the important distinction of two simultaneously valuable but incompatible desiderata: simplification of results versus preservation of results.
The issue is: when you pull a single row or column out of R’s most important structure (the data frame) do you get a data frame, a list, or a vector? Not all code that works on one of these types works equivalently across all of these types, so this can be a serious issue. We have written about this before (see selection in R). But it wasn’t until we got more into teaching (and co-authored the book Practical Data Science with R) that we really appreciated how confusing this can be for the beginner.
Let’s start with an example.
> d print(d)
1 1 3
2 2 4
1 1 3
 1 2
What we see is: when using the two-argument
[,] extract operator on a simple data frame.
- Extracting a single row returns a data frame (confirm with the
- Extracting a single column returns a vector (instead of a data frame).
And this is pretty much what a user sitting in front of an interactive system would want: simplification on columns and preservation on rows. And this is compatible with R’s history as an interactive analysis system (versus as a batch programming language, as outlined here).
Where we run into trouble is when we are writing code that we expect to run correctly in all situations (even when we are not watching). Consider the following example.
> selector1 selector2 print(d[,selector1])
 1 2
1 1 3
2 2 4
In the first case our boolean selection vector returned a vector, and in the second case it returned a data frame. Believe it or not this is problem. If we were reading this code and the values of
selector2 were set somewhere else (say as the result of a complicated calculation) we would have no way of knowing what type would be returned by
d[,selector1]. This even if we were lucky enough to have documentation asserting
selector2 are logical vectors of the correct length.
At runtime we can see how many positions of
selector1 are set to
TRUE. But we can’t reliably infer this count from looking at just an isolated code snippet. So we would not know at coding time what code would be safe to apply to the result
d[,selector1]. The changing of the return type based on mere variation of argument value (not argument type) is very bad thing in terms of readability. A code reader can’t set simple (non data-dependent) expectations on the code. Or they can’t use assumed pre-conditions known about the inputs (such as documented type) to establish useful post-conditions (guaranteed behavior of the code).
Why should we care about prior expectations? Can’t we just consider those uniformed presumptions and teach past them? To my mind this violates some concepts of efficient learning and teaching. In my opinion there is no such thing as passive learning (or completely pure teaching). Students learn by thinking and base their expectations for new material by generalizing and regularizing lessons from older material. The more effective students can be at this the faster they learn.
Also, pity the student who makes a mistake while trying to learn about the square-bracket extraction operator through the R help system. If they accidentally type
help('[') instead of
help('[.data.frame'), then they see the following confusing help.
Instead of seeing the relevant definition, which is as follows.
Notice the first help implies there is an argument called
drop that defaults to
TRUE. This is true for matrices (what the help is talking about), but false for data frames (the central class of R, nobody should choose R for the matrix operations). You could (informally) think of
[.data.frame as being a specialization of the base
[ in the sense of object-oriented inheritance. Except, it is considered very bad form to change the semantics or rules when extending types and operators. The expectations set in the base class (and especially those set in the base-class documentation) should hold in derived classes and methods.
We can confirm
[.data.frame,] does not act like either of
[.data.frame,,drop=FALSE]. It picks its own behavior depending on if you end up with a single column or not (note: I didn’t say “if you picked a single column or not”). The code below shows some of the variations in behavior.
1 1 3
 1 2
 1 2
1 1 3
Notice how none of the complete results of these three experiments (running without the drop argument, running with it set to
TRUE, and running with it set to
FALSE) entirely match any of the others.
Also you can trigger the “only one column causes type conversion” issue even when you are not selecting on columns (in fact even when selecting the entire data frame!):
> d1 print(d1)
> print(d1[c language="(TRUE,TRUE),"][/c])
 1 2
This is a good point to return to the article about the historic context and influences of R, which gives us the following quote:
Pat begins with how R began as an experimental offshoot from S (there’s an adorable 1990’s-era photo of R’s creators Ross Ihaka and Robert Gentleman in Auckland on page 23, reproduced below), and then evolved into a language used first interactively, and then for programming. The tensions between the two modes of use led to some of the quirkier aspects of R. (Pat’s moral: “if you want to create a beautiful language, for god’s sake don’t make it useful”.)
How would I like R to behave if it evolved anew and didn’t have to support older code? I’d like (but know I can’t have) the following:
[,]is reserved to select sets of rows and columns and by default guarantees “preserving” behavior in all cases (i.e. all variations of
[]is reserved for extracting a single item and is “simplifying”.
- To extract a single column as a vector from a data frame you must use the single argument list operator
- In all cases
[]signals an error if you do not select exactly one element.
When I say I want these things: understand this means both I already known this is not the way they are and I know (for practical reasons) they can not be changed to be so. The fact that none of the above statements as currently true will come as a surprise to many R users. For example it is widely thought that
[] behaves everywhere as it behaves on lists: properly signaling errors if you try to select more than one element. Notice this does not turn out to be the case. For vectors and lists we have good error-indicating behaviors:
Error in c(1, 2, 3)[[c(1, 2)]] : attempt to select more than one element
Error in list(1, 2, 3)[[c(1, 2)]] : subscript out of bounds
For data frames we have a less desirable “anything goes” situation:
Remember: a situation that should have signaled an error and did not is worse than a situation with a signaling error. (Note:
subset(d1,x==1,select=c('x')) seems to reliably avoid unwanted simplification.)
Data frames are guaranteed to be lists of columns (a publicly exposed implementation detail, a bit obscured by the fact that the derived two-argument operator
[,] superficially appears to be row-oriented). So we would expect
d[[c(1,2)]] to properly error-out as it does for lists. However, it appears to behaving more like a two-dimensional index operator. Probably some code is using this, but it is a pretty clear violation of exceptions (especially for a new student). Repeating: data frames are lists of columns (you can check this with
unclass(d)) and this is not a hidden implementation detail (it is commonly discussed and expected). However the
[[.data.frame operator has extended or overridden behavior that is different than any notional base-
One of the reasons we need two extraction operators (
[]) is: R does not expose true scalar types (even the number
3 is in length-1 vector) so we have no convenient way to signal (even using runtime types) if we thought we were coding a set-based extraction (through a set/vector of indices or a vector of booleans) or a scalar based extraction (through a single index, the case where simplification is most likely to be desirable). It is likely the designers understood that return types changing on mere change in values of arguments (and not in more fundamental changes of types of arguments) is confusing and undesirable (as it eliminates any chance at pure type to type reasoning) that led to S/R having so many extraction/selection operators. They saw the need to isolate and document different behaviors. However these abstractions turn out to be a bit leaky.
For my part I teach designing your code assuming you had simple regular versions of the above operators, and then implementing defensively (specifying
drop, and preferring
) to ensure you get good regular behavior.