Binning Columns in Remote Tables with dplyr and rquery

[This article was first published on Blog - R. King Data Consulting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Problem

In some recent work, I needed to bin columns in a dplyr remote table. The new dplyr::ntile function was not an option because the database I needed this to work on (MySQL) doesn’t support window functions (at least, not until recently), and there was a use case for user-defined cut points rather than quantile-based cuts, so I needed to roll my own solution.

In the database world, a common approach to this is to run a join against a table with only the cut points, filter out rows with the target variable greater than the cut point, then take the minimum value (grouped by the primary key of the table).

Another approach available to me, since I’m not writing raw SQL, is to program a CASE WHEN statement, generating the bins based off that conditional. dplyr has the case_when that works on database back ends, making this trivial to construct an arbitrarily long CASE statement.

Eventually all this logic is going to get wrapped up into a shiny app for users to apply to arbitrary columns with arbitrary cut points. I’m going to need a performant solution.

The Approach

CASE WHEN

Because of their approach to non standard evaluation, programming against tidy-verse functions can be a pain. I need to use a function factory to generate the function I’ll apply to a remote table to bin it with dplyr::case_when. This is also a useful approach in benchmarking because by isolating any start-up costs or overhead associated with computing the function, I can get a better idea of how the generated SQL performs on it’s own. Here’s the code for the case when function factory:

This will return a function of data that will compute the new column for me.

A dplyr Join

The the join method with dplyr is much more straightforward. I don’t need a function factory at all–The only parameters are my (temporary) table of cut points and my data to filter by. The rest stays the same:

Using rquery

If you’re not familiar with rquery, it solves a similar problem as dplyr but in a different way. rquery is based on Codd’s relational algebra (Codd of Boyce-Codd Normal Form), so those who are accustomed to the verbs of dplyr may be lost initially. But with these less familiar verbs we also get the ability to compute on relational data in a way that is more flexible and better able to be optimized by query optimizers.

Another interesting feature of rquery is the ability to re-use a chain of operations in other contexts–even in other databases–as long as the table name and columns referenced maintain their same names. I can use this to create a rquery function factory the same way I did for case when–moving all the setup logic out of the function that gets benchmarked. Here’s the rquery function factory:

Notice it returns a function with no arguments. The only thing that is necessary to run this this chain of operations once we’ve computed it is the database connection.

The Benchmark

I want to benchmark a few things. I want to know how each method fairs when varying both the number of rows in the remote table and the number of cut points. I also want to know whether putting an index on the column that gets cut will speed things up. The function run_benchmark (see it in this gist) will do all necessary table setup and will call the function factories, finally benchmarking just the operation itself for dplyr‘s case_when, and both dplyr and rquery using the join method of binning. I’ll be using the fantastic microbenchmark package to do all the benchmarking.

Varying Table Size

First lets look at how varying table size affects each. With only three cut points, rquery is the clear leader with dplyr::case_when not far behind. Using a join with dplyr pulls up the rear.

Note: there are no units on these plots. I think this one is in milliseconds. It doesn’t really matter though–I’m interested in comparisons, not absolute time.

3 Cut points, 100 or 10,000 rows

3 Cut points, 100 or 10,000 rows

For my use case, I can stop here. I’m enforcing a limit on the maximum number of cuts a user can specify and it’ll be close to three. All three methods are reasonably fast, but I’m using case_when because a the client wants the code in dplyr and it avoids the overhead of having to write a temporary table with cut points in it–something is not a part of the benchmark here. I am still curious what happens when we bump the number of cut points to 10, 100, and even 1000.

Varying Cut Point Number

I assumed that this is where case_when would really start to shine–a join on 100 or 1000 cut points means every row in the source table gets joined to 100, or 1000, rows in the cut table–and I only have so much memory on my home laptop. But things ended up going the opposite way. rquery with a join is still the fastest solution, but followed closely by dplyr’s join. It looks like the performance of case_when degrades exponentially as more cases are added.

100 or 1000 cut points, 10,000 rows

100 or 1000 cut points, 10,000 rows

Note: the y-axes above are on different scales.

Indexed Cut Column

Finally, Indexing. I expected this to make a big difference–but looks like it didn’t help that much. This is again with 10,000 rows and 100 cut points.

1000 rows, 100 cut points, non-indexed and indexed.

1000 rows, 100 cut points, non-indexed and indexed.

Conclusions, Limitations, etc

rquery with a join is consistently the fastest option here. Based on what I’ve read of rquery, this is because the SQL it writes is ripe for query optimization. dplyr lags behind even CASE WHEN with only a few cut points but stays pretty much in line with rquery once the number of cuts increase. I wonder if this is overhead in dplyr code rather than anything on the database end? One way to get to the bottom of this would be to extract the queries that dplyr and rquery write and benchmark those.

I was surprised at the poor performance of the CASE expression at first. After reading a little more about how CASE WHEN works though, this makes sense. And lookup tables are a very common database pattern–of course they are going to be heavily optimized!

Thanks to John Mount over at Win-Vector LLC for writing rquery and so many other awesome contributions to the R community.

To leave a comment for the author, please follow the link and comment on their blog: Blog - R. King Data Consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)