subset vs array indexing: which will cause the least grief in R?

Posted on January 3, 2016 by Derek Jones in R bloggers | 0 Comments

[This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The comments on my post outlining recommended R usage for professional developers were universally scornful, with my proposal recommending subset receiving the greatest wrath. The main argument against using subset appeared to be that it went against existing practice, one comment linked to Hadley Wickham suggesting it was useful in an interactive session (and by implication not useful elsewhere).

The commenters appeared to be knowledgeable R users and I suspect might have fallen into the trap of thinking that having invested time in obtaining expertise of language intricacies, they ought to use these intricacies. Big mistake, the best way to make use of language expertise is to use it to avoid the intricacies, aiming to write simply, easy to understand code.

Let’s use Hadley’s example to discuss the pros and cons of subset vs. array indexing (normally I have lots of data to help make my case, but usage data for R is thin on the ground).

Some data to work with, which would normally be read from a file.

sample_df = data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))

The following are two of the ways of extracting all rows for which a >= 4:

subset(sample_df, a >= 4)
# has the same external effect as:
sample_df[sample_df$a >= 4, ]

The subset approach has the advantages:

The array name, sample_df, only appears once. If this code is cut-and-pasted or the array name changes, the person editing the code may omit changing the second occurrence.
Omitting the comma in the array access is an easy mistake to make (and it won’t get flagged).
The person writing the code has to remember that in R data is stored in row-column order (it is in column-row order in many languages in common use). This might not be a problem for developers who only code in R, but my target audience are likely to be casual R users.

The case for subset is not all positive; there is a use case where it will produce the wrong answer. Let’s say I want all the rows where b has some computed value and I have chosen to store this computed value in a variable called c.

c=3
subset(sample_df, b == c)

I get the surprising output:

>   a b c
> 1 1 5 5
> 5 5 1 1

because the code I have written is actually equivalent to:

sample_df[sample_df$b == sample_df$c, ]

The problem is caused by the data containing a column having the same name as the variable used to hold the computed value that is tested.

So both subset and array indexing are a source of potential problems. Which of the two is likely to cause the most grief?

Unless the files being processed each potentially contain many columns having unknown (at time of writing the code) names, I think the subset name clash problem is much less likely to occur than the array indexing problems listed earlier.

Its a shame that assignment to subset is not supported (something to consider for future release), but reading is the common case and that is what we are interested in.

Yes, subset is restricted to 2-dimensional objects, but most data is 2-dimensional (at least in my world). Again concentrate recommendations on the common case.

When a choice is available, developers should pick the construct that is least likely to cause problems, and trivial mistakes are the most common cause of problems.

Does anybody have a convincing argument why array indexing is to be preferred over subset (not common usage is the reason of last resort for the desperate)?

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

subset vs array indexing: which will cause the least grief in R?

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)