Painless ODBC + dplyr Connections to Amazon Athena and Apache Drill with R & odbc
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis
package which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the #rstats Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):
The (GitHub only for now) #rstats metis package for wiring up R to @amazonathema via RJDBC now uses & includes the new Simba Athena JDBC Driver 2.0.2 JAR https://t.co/wvwV6IxCNd (cc: @dabdine)
— hrbrmstr (@hrbrmstr) April 20, 2018
If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc
package with the ODBC driver for Athena.
I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.
Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc
+ dplyr
connection almost instantly:
library(odbc) library(tidyverse) DBI::dbConnect( odbc::odbc(), driver = "Simba Athena ODBC Driver", Schema = "redacted", AwsRegion = "us-east-1", AuthenticationType = "Default Credentials", S3OutputLocation = "s3://aws-athena-query-results-redacted" ) -> con some_tbl <- tbl(con, "redacted")
Apologies for the redaction and lack of output but we’ve removed the default example databases from our work Athena environment and I’m not near my personal systems, so a more complete example will have to wait until later.
The TLDR is that I can now use 100% dplyr
idioms with Athena vs add one to the RJDBC driver I made for metis
. The metis
package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).
The downside is that I’m now even more likely to run up the AWS bill ;-)
What About Drill?
I also maintain the sergeant
package which provides REST API and REST query access to Apache Drill along with a REST API DBI
driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).
I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.
library(odbc) library(tidyverse) DBI::dbConnect( odbc::odbc(), driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib", ConnectionType = "Zookeeper", AuthenticationType = "No Authentication", ZKCLusterID = "CLUSTERID", ZkQuorum = "HOST:2181", AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc; ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;" ) -> drill_con (employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`"))) ## # Source: SQL [?? x 16] ## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL] ## employee_id full_name first_name last_name position_id position_title store_id ## <S3: integer> <chr> <chr> <chr> <S3: integer> <chr> <S3: inte> ## 1 1 Sheri Nowmer Sheri Nowmer 1 President 0 ## 2 2 Derrick Whe… Derrick Whelply 2 VP Country Mana… 0 ## 3 4 Michael Spe… Michael Spence 2 VP Country Mana… 0 ## 4 5 Maya Gutier… Maya Gutierrez 2 VP Country Mana… 0 ## 5 6 Roberta Dam… Roberta Damstra 3 VP Information … 0 ## 6 7 Rebecca Kan… Rebecca Kanagaki 4 VP Human Resour… 0 ## 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 ## 8 9 Brenda Blum… Brenda Blumberg 11 Store Manager 21 ## 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 ## 10 11 Jonathan Mu… Jonathan Murraiin 11 Store Manager 1 ## # ... with more rows, and 9 more variables: department_id <S3: integer64>, birth_date <chr>, ## # hire_date <chr>, salary <dbl>, supervisor_id <S3: integer64>, education_level <chr>, ## # marital_status <chr>, gender <chr>, management_role <chr>## count(employee, position_title, sort=TRUE) ## # Source: lazy query [?? x 2] ## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL] ## # Ordered by: desc(n) ## position_title n ## <chr> <S3: integer64> ## 1 Store Temporary Checker 268 ## 2 Store Temporary Stocker 264 ## 3 Store Permanent Checker 226 ## 4 Store Permanent Stocker 222 ## 5 Store Shift Supervisor 52 ## 6 Store Permanent Butcher 32 ## 7 Store Manager 24 ## 8 Store Assistant Manager 24 ## 9 Store Information Systems 16 ## 10 HQ Finance and Accounting 8 ## # ... with more rows##
Apart from having to do that sql(…)
to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr
verbs in under ten minutes (total).
You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ll be updating my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant
package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.
FIN
The odbc
package is super-slick and it’s refreshing to be able to use dplyr
verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr
verbs. However, those queries can also be put right into sql()
with the Athena ODBC driver connection and used via the same dplyr
verb magic afterwards.
Today is, indeed, a good day to query!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.