Arthur Charpentier was trying to solve an interesting problem with R: given this data set of random walks in the 2-D plane, what is the likely origin of a pathway that ends in the black circle below?
It's pretty easy to generate random data like this with a few lines of code in R. And with 2 million trajectories of 80 points each, you have some moderately-sized data to analyze: about 4Gb.
There are several ways to tackle data of this size with R: you can use an ordinary data.frame object (provided you have sufficient RAM to hold it in memory) and use standard R functions to select the corresponding records; you can use functions in the dplyr package to filter the data; or you can use a data.table package and its operations to select the appropriate data. Arthur tried all three methods with the following results:
- Using ordinary data.frame operations, it took about a minute to extract the necessary data. Even then, Arthur had some challenges with out of memory errors when trying to create temporary columns in the data (which swelled its size to over 6 Gb).
- Using the dplyr package, Arthur read in the data as a data_frame object and filtered the data using dplyr's group_by, summarise, and left_join operations. This process took about two minutes.
- Using the data.table package and using its built-in selection syntax and merge operator, the process took around 10 seconds.
Note that all of these techniques are in-memory operations. Arthur doesn't note the size of the system he was using, but it probably has at least 8Gb of RAM to be able to accommodate the data. While dplyr's syntax is (for me) somewhat simpler to use, here data.table wins out on performance, thanks to its optimized operations and the ability to create new variables on the fly (and without requiring additional RAM) with its := syntax. You can see the complete code used for the various methods at the link below.
Freakonometrics: Working with “large” datasets, with dplyr and data.table