New R package: replyr (get a grip on remote dplyr data services)

Posted on November 22, 2016 by John Mount in R bloggers | 0 Comments

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that work locally can’t always be used remotely at scale. It is emphatically not yet the case that one can practice with dplyr in one modality and hope to move to another back-end without significant debugging and work-arounds. replyr attempts to provide a few helpful work-arounds.

Our new package replyr supplies methods to get a grip on working with remote tbl sources (SQL databases, Spark) through dplyr. The idea is to add convenience functions to make such tasks more like working with an in-memory data.frame. Results still do depend on which dplyr service you use, but with replyr you have fairly uniform access to some useful functions.

Example: the following should work across more than one dplyr back-end (such as RMySQL or RPostgreSQL).

library('replyr')
d <- data.frame(x=c(1,2,2),y=c(3,5,NA),z=c(NA,'a','b'),
                stringsAsFactors = FALSE)
summary(d)
 #         x               y            z            
 #   Min.   :1.000   Min.   :3.0   Length:3          
 #   1st Qu.:1.500   1st Qu.:3.5   Class :character  
 #   Median :2.000   Median :4.0   Mode  :character  
 #   Mean   :1.667   Mean   :4.0                     
 #   3rd Qu.:2.000   3rd Qu.:4.5                     
 #   Max.   :2.000   Max.   :5.0                     
 #                   NA's   :1
replyr_summary(d)
 #    column     class nrows nna nunique min max     mean        sd lexmin lexmax
 #  1      x   numeric     3   0       2   1   2 1.666667 0.5773503   <NA>   <NA>
 #  2      y   numeric     3   1       2   3   5 4.000000 1.4142136   <NA>   <NA>
 #  3      z character     3   1       2  NA  NA       NA        NA      a      b

replyr doesn’t seem to add much until you use a remote data service:

my_db <- dplyr::src_sqlite("replyr_sqliteEx.sqlite3", create = TRUE)
dRemote <- dplyr::copy_to(my_db,d,'d')
summary(dRemote)
 #      Length Class          Mode
 #  src 2      src_sqlite     list
 #  ops 3      op_base_remote list
replyr_summary(dRemote)
 #    column     class nrows nna nunique min max     mean        sd lexmin lexmax
 #  1      x   numeric     3   0       2   1   2 1.666667 0.5773503   <NA>   <NA>
 #  2      y   numeric     3   1       2   3   5 4.000000 1.4142136   <NA>   <NA>
 #  3      z character     3   1       2  NA  NA       NA        NA      a      b

Data types, capabilities, and row-orders all vary a lot as we switch remote data services. But the point of replyr is to provide at least some convenient version of typical functions such as: summary, nrow, unique values, and filter rows by values in a set.

This is a very new package with no guarantees or claims of fitness for purpose. Some implemented operations are going to be slow and expensive (part of why they are not exposed in dplyr itself).

We will probably only ever cover:

Native data.frames (and tbl/tibble)
RMySQL
RPostgreSQL
SQLite
sparklyr 2.0.0

The main useful functions we supply are replyr::replyr_filter and replyr::replyr_inTest which are designed to subset data based on a columns values being in a given set. These allow selection of rows by testing membership in a set (very useful for partitioning data). Example below:

library('dplyr')

values <- c(2)
dRemote %>% replyr::replyr_filter('x',values)
 #  Source:   query [?? x 3]
 #  Database: sqlite 3.8.6 [replyr_sqliteEx.sqlite3]
 #  
 #        x     y     z
 #    <dbl> <dbl> <chr>
 #  1     2     5     a
 #  2     2    NA     b

To install replyr:

# install.packages('devtools')
devtools::install_github('WinVector/replyr')

The project URL is: https://github.com/WinVector/replyr

I would like this to become a bit of a “stone soup” project. If you have a neat function you want to add please contribute a pull request with your attribution and assignment of ownership to Win-Vector LLC (so Win-Vector LLC can control the code, which we are currently distributing under a GPL3 license) in the code comments.

There are a few (somewhat incompatible) goals for replyr:

Providing missing convenience functions that work well over all common dplyr service providers. Examples include replyr_summary, replyr_filter, and replyr_nrow.
Providing a basis for “row number free” data analysis. SQL back-ends don’t commonly supply row number indexing (or even deterministic order of rows), so a lot of tasks you could do in memory by adjoining columns have to be done through formal key-based joins.
Providing emulations of functionality missing from non-favored service providers (such as windowing functions, quantile, sample_n, cumsum; missing from SQLite and RMySQL).
Sheer bull-headedness in emulating operations that don’t quite fit into the pure dplyr formulation.

Good code should fill one important gap and work on a variety of dplyr back ends (you can test RMySQL, and RPostgreSQL using docker as mentioned here and here; sparklyr can be tried in local mode as described here). I am especially interested in clever “you wouldn’t thing this was efficiently possible, but” solutions (which give us an expanded grammar of useful operators), and replacing current hacks with more efficient general solutions. Targets of interest include sample_n (which isn’t currently implemented for tbl_sqlite), cumsum, and quantile.

Right now we have an expensive implementation of quantile based on binary search.

replyr_quantile(dRemote,'x')
 #     0 0.25  0.5 0.75    1 
 #     1    1    2    2    2

Some primitives of interest include:

cumsum or row numbering (interestingly enough if you have row numbering you can implement cumulative sum in log-n rounds using joins to implement pointer chasing/jumping ideas, but that is unlikely to be practical, lag is enough to generate next pointers, which can be boosted to row-numberings).
Random row sampling (like dplyr::sample_n, but working with more service providers).
Inserting random values (or even better random unique values) in a remote column. Most service providers have a pseudo-random source you can use.
Emulating The Split-Apply-Combine Strategy.
Emulating tidyr gather/spread (or pivoting and anti-pivoting).

Note we are deliberately using prefixed names replyr_ and not using common S3 method names to avoid the possibility of replyr functions interfering with basic dplyr functionality.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

New R package: replyr (get a grip on remote dplyr data services)

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)