[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The “method ="radix"” option can greatly speed up sorting and ordering tables in R.
For a 1 million row table the speedup is already as much as 35 times (around 9.6 seconds versus 3 tenths of a second). Below is an excerpt from an experiment sorting showing default settings and showing radix sort (full code here).
order_default = d[order(d$col_a, d$col_b, d$col_c, d$col_x), ,
order_radix = d[order(d$col_a, d$col_b, d$col_c, d$col_x,
method ="radix"), ,
check = my_check,
times = 10L)
## Unit: milliseconds
## expr min lq mean median uq
## order_default 9531.2865 9653.6827 9759.8929 9690.6702 9833.2170
## order_radix 262.1377 263.3226 278.2547 265.1452 274.2476
## max neval
## 10329.3520 10
## 382.2544 10
This speedup is possible because Matt Dowle and Arun Srinivasan of the data.table team generously ported their radix sorting code into base-R! Please see help(sort) for details. So data.table is not only the best data manipulation package in R, the team actually works to improve R itself. This is what is meant by "R community" and what is needed to keep R vibrant and alive.
Edit/Note: Iñaki Úcar shared at least 2 good points in a follow-up article: if you are using factors you get radix sort for free (for technical reasons I tend to delay/disable conversion to factors), and I didn’t mention the loss of control of collation order. Because of that I am changing the article title from “R tip: Use Radix Sort” to “R Tip: Consider radix Sort”.
To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.