The comments on my post outlining recommended R usage for professional developers were universally scornful, with my proposal recommending
subset receiving the greatest wrath. The main argument against using
subset appeared to be that it went against existing practice, one comment linked to Hadley Wickham suggesting it was useful in an interactive session (and by implication not useful elsewhere).
The commenters appeared to be knowledgeable R users and I suspect might have fallen into the trap of thinking that having invested time in obtaining expertise of language intricacies, they ought to use these intricacies. Big mistake, the best way to make use of language expertise is to use it to avoid the intricacies, aiming to write simply, easy to understand code.
Some data to work with, which would normally be read from a file.
sample_df = data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))
The following are two of the ways of extracting all rows for which
a >= 4:
subset(sample_df, a >= 4) # has the same external effect as: sample_df[sample_df$a >= 4, ]
subset approach has the advantages:
- The array name,
sample_df, only appears once. If this code is cut-and-pasted or the array name changes, the person editing the code may omit changing the second occurrence.
- Omitting the comma in the array access is an easy mistake to make (and it won’t get flagged).
- The person writing the code has to remember that in R data is stored in row-column order (it is in column-row order in many languages in common use). This might not be a problem for developers who only code in R, but my target audience are likely to be casual R users.
The case for
subset is not all positive; there is a use case where it will produce the wrong answer. Let’s say I want all the rows where
b has some computed value and I have chosen to store this computed value in a variable called
c=3 subset(sample_df, b == c)
I get the surprising output:
> a b c > 1 1 5 5 > 5 5 1 1
because the code I have written is actually equivalent to:
sample_df[sample_df$b == sample_df$c, ]
The problem is caused by the data containing a column having the same name as the variable used to hold the computed value that is tested.
subset and array indexing are a source of potential problems. Which of the two is likely to cause the most grief?
Unless the files being processed each potentially contain many columns having unknown (at time of writing the code) names, I think the
subset name clash problem is much less likely to occur than the array indexing problems listed earlier.
Its a shame that assignment to subset is not supported (something to consider for future release), but reading is the common case and that is what we are interested in.
subset is restricted to 2-dimensional objects, but most data is 2-dimensional (at least in my world). Again concentrate recommendations on the common case.
When a choice is available, developers should pick the construct that is least likely to cause problems, and trivial mistakes are the most common cause of problems.
Does anybody have a convincing argument why array indexing is to be preferred over
subset (not common usage is the reason of last resort for the desperate)?