Comparing data frames, data.table and dplyr with random walks

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Arthur Charpentier was trying to solve an interesting problem with R: given this data set of random walks in the 2-D plane, what is the likely origin of a pathway that ends in the black circle below?

Random walks

It's pretty easy to generate random data like this with a few lines of code in R. And with 2 million trajectories of 80 points each, you have some moderately-sized data to analyze: about 4Gb.

There are several ways to tackle data of this size with R: you can use an ordinary data.frame object (provided you have sufficient RAM to hold it in memory) and use standard R functions to select the corresponding records; you can use functions in the dplyr package to filter the data; or you can use a data.table package and its operations to select the appropriate data. Arthur tried all three methods with the following results:

  • Using ordinary data.frame operations, it took about a minute to extract the necessary data. Even then, Arthur had some challenges with out of memory errors when trying to create temporary columns in the data (which swelled its size to over 6 Gb).
  • Using the dplyr package, Arthur read in the data as a data_frame object and filtered the data using dplyr's group_by, summarise, and left_join operations. This process took about two minutes.
  • Using the data.table package and using its built-in selection syntax and merge operator, the process took around 10 seconds.

Note that all of these techniques are in-memory operations. Arthur doesn't note the size of the system he was using, but it probably has at least 8Gb of RAM to be able to accommodate the data. While dplyr's syntax is (for me) somewhat simpler to use, here data.table wins out on performance, thanks to its optimized operations and the ability to create new variables on the fly (and without requiring additional RAM) with its := syntax. You can see the complete code used for the various methods at the link below. 

Freakonometrics: Working with “large” datasets, with dplyr and data.table

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)