data.table is Really Good at Sorting

August 13, 2018
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

Present 2

The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below.

Present 3

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

Present 4

The above is consistent with data.table not only being faster than dplyr, but also having a fundamentally different asymptotic running time.

Performance like the above is one of the reasons you should strongly consider data.table for your R projects.

All details of the timings can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)