Updated Apache Drill R JDBC Interface Package {sergeant.caffeinated} With {dbplyr} 2.x Compatibility

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While the future of the Apache Drill ecosystem is somewhat in-play (MapR — a major sponsoring org for the project — is kinda dead), I still use it almost daily (on my local home office cluster) to avoid handing over any more money to Amazon than I/we already do. The latest (yet-to-be-released) v1.18.0 has some great improvements, including JSON resultset streaming for the REST API. Alas, tweaking {sergeant} (my REST API R package) to handle that is not on the TODO for the foreseeable future, so I’ve been using {sergeant.caffeinated} — https://github.com/hrbrmstr/sergeant-caffeinated — (a RJDBC wrapper for the Drill JDBC interface) for quite a while since it handles large resultsets quite nicely.

I broke out the RJDBC functionality from {sergeant} into this separate package since, despite the fact that it’s 2019/2020, many folks still have/had problems getting {rJava} to work (FWIW it’s a seamless install for me on Windows, Ubuntu, or macOS, even Apple Silicon macOS). The surgery to separate it was fairly hack-ish (one reason it’s not on CRAN) and it finally broke with the recent {dbplyr} 2.x release. I assumed fixing the caffeinated version was easier/quicker than the REST API version, so I dug in and am cautiously tossing it out for wider poking.

An All New Way To Use ????☕

Gone are the days of src_drill_jdbc(), but enter in the new term of more standardized {DBI} and {d[b]plyr} access to Apache Drill. To install this version you can do:

remotes::install_github("hrbrmstr/sergeant-caffeinated")

(more install options using safer and saner social coding sites coming soon).

Let’s load up the package(s) and perform some operations.

library(sergeant.caffeinated)

test_host <- Sys.getenv("DRILL_TEST_HOST", "localhost")

be_quiet()

(con <- dbConnect(drv = DrillJDBC(), sprintf("jdbc:drill:zk=%s", test_host)))
## <DrillJDBCConnection>

The DRILL_TEST_HOST environment variable contains the hostname or IP address of my/your Drill server, defaulting to localhost if none is found.

The be_quiet() function stops the Java engine from yelling at you with “illegal reflective access” warnings. If you see this in other rJava-powered packages it means code in some classes in some Java archive files are doing some sketchy old-school things that newer JVMs aren’t happy about. At some point, these warnings become full-on errors which will break many things. Unfortunately, Drill is still fairly tied to Java 8.x and has tons of introspecting code. The errors are ugly, so if you want to get rid of them, just call this function before doing anything with Drill. (You’ll also notice log4j errors are finally gone!)

Now that we have a Drill JDBC connection, we can do something with it. All the DBI-ish operations work, but it’s 2020 and {d[b]ply} is the bee’s knees, so we’ll just dive right in with that:

(db <- tbl(con, "cp.`employee.json`"))

## # Source:   table<cp.`employee.json`> [?? x 16]
## # Database: DrillJDBCConnection
##    employee_id full_name first_name last_name position_id position_title store_id
##          <dbl> <chr>     <chr>      <chr>           <dbl> <chr>             <dbl>
##  1           1 Sheri No… Sheri      Nowmer              1 President             0
##  2           2 Derrick … Derrick    Whelply             2 VP Country Ma…        0
##  3           4 Michael … Michael    Spence              2 VP Country Ma…        0
##  4           5 Maya Gut… Maya       Gutierrez           2 VP Country Ma…        0
##  5           6 Roberta … Roberta    Damstra             3 VP Informatio…        0
##  6           7 Rebecca … Rebecca    Kanagaki            4 VP Human Reso…        0
##  7           8 Kim Brun… Kim        Brunner            11 Store Manager         9
##  8           9 Brenda B… Brenda     Blumberg           11 Store Manager        21
##  9          10 Darren S… Darren     Stanz               5 VP Finance            0
## 10          11 Jonathan… Jonathan   Murraiin           11 Store Manager         1
## # … with more rows, and 9 more variables: department_id <dbl>, birth_date <chr>,
## #   hire_date <chr>, salary <dbl>, supervisor_id <dbl>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>

Basically, that’s it: it “just works”.

FIN

If you’ve been a user of {sergeant.caffeinated} and really need src_drill_jdbc() back, drop an issue on GH or a note in the comments, and be sure to file issues if I’ve missed anything as you kick the tyres.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)