R Tip: Use drop = FALSE with data.frames

February 27, 2018

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Another R tip. Get in the habit of using drop = FALSE when indexing (using [ , ] on) data.frames.


Prince Rupert’s drops (img: Wikimedia Commons)

In R, single column data.frames are often converted to vectors when manipulated. For example:

d <- data.frame(x = seq_len(3))
#>   x
#> 1 1
#> 2 2
#> 3 3
# not a data frame!
d[order(-d$x), ]
#> [1] 3 2 1

We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ] change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,] is also vector in this case.

The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame named “g” supplied as an argument: g[vec, ] can be a data.frame or a vector (or even possibly a list). However we do know if g is a data.frame then g[vec, , drop = FALSE] is also a data.frame (assuming vec is a vector of valid row indices or a logical vector, note: NA induces some special cases).

We care as vectors and data.frames have different semantics, so are not fully substitutable in later code.

The fix is to include drop = FALSE as a third argument to [ , ].

# is a data frame.
d[order(-d$x), , drop = FALSE]
#>   x
#> 3 3
#> 2 2
#> 1 1

To pull out a column I suggest using one of the many good extraction notations (all using the fact a data.frame is officially a list of columns):

#> [1] 1 2 3

#> [1] 1 2 3

#> [1] 1 2 3

My overall advice is: get in the habit of including drop = FALSE when working with [ , ] and data.frames. I say do this even when it is obvious that the result does in fact have more than one column.

For example write “mtcars[, c("mpg", "cyl"), drop = FALSE]” instead of “mtcars[, c("mpg", "cyl")]“. It is clear that for data.frames both forms should work the same (either selecting a data frame with two columns, or throwing an error if we have mentioned a non existent column). But longer drop = FALSE form is safer (go further towards ensuring type stable code) and more importantly documents intent (that you wanted a data.frame result).

One can also try base::subset(), as it has non-dropping defaults.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)