I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my
If you do scroll through the thread you’ll see @hadleywickham suggested using the
odbc package with the ODBC driver for Athena.
I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.
Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an
dplyr connection almost instantly:
library(odbc) library(tidyverse) DBI::dbConnect( odbc::odbc(), driver = "Simba Athena ODBC Driver", Schema = "redacted", AwsRegion = "us-east-1", AuthenticationType = "Default Credentials", S3OutputLocation = "s3://aws-athena-query-results-redacted" ) -> con some_tbl <- tbl(con, "redacted")
Apologies for the redaction and lack of output but we’ve removed the default example databases from our work Athena environment and I’m not near my personal systems, so a more complete example will have to wait until later.
The TLDR is that I can now use 100%
dplyr idioms with Athena vs add one to the RJDBC driver I made for
metis package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).
The downside is that I’m now even more likely to run up the AWS bill ;-)
What About Drill?
I also maintain the
sergeant package which provides REST API and REST query access to Apache Drill along with a REST API
DBI driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).
I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.
library(odbc) library(tidyverse) DBI::dbConnect( odbc::odbc(), driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib", ConnectionType = "Zookeeper", AuthenticationType = "No Authentication", ZKCLusterID = "CLUSTERID", ZkQuorum = "HOST:2181", AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc; ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;" ) -> drill_con (employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`"))) ## # Source: SQL [?? x 16] ## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL] ## employee_id full_name first_name last_name position_id position_title store_id ## <S3: integer> <chr> <chr> <chr> <S3: integer> <chr> <S3: inte> ## 1 1 Sheri Nowmer Sheri Nowmer 1 President 0 ## 2 2 Derrick Whe… Derrick Whelply 2 VP Country Mana… 0 ## 3 4 Michael Spe… Michael Spence 2 VP Country Mana… 0 ## 4 5 Maya Gutier… Maya Gutierrez 2 VP Country Mana… 0 ## 5 6 Roberta Dam… Roberta Damstra 3 VP Information … 0 ## 6 7 Rebecca Kan… Rebecca Kanagaki 4 VP Human Resour… 0 ## 7 8 Kim Brunner Kim Brunner 11 Store Manager 9 ## 8 9 Brenda Blum… Brenda Blumberg 11 Store Manager 21 ## 9 10 Darren Stanz Darren Stanz 5 VP Finance 0 ## 10 11 Jonathan Mu… Jonathan Murraiin 11 Store Manager 1 ## # ... with more rows, and 9 more variables: department_id <S3: integer64>, birth_date <chr>, ## # hire_date <chr>, salary <dbl>, supervisor_id <S3: integer64>, education_level <chr>, ## # marital_status <chr>, gender <chr>, management_role <chr>## count(employee, position_title, sort=TRUE) ## # Source: lazy query [?? x 2] ## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL] ## # Ordered by: desc(n) ## position_title n ## <chr> <S3: integer64> ## 1 Store Temporary Checker 268 ## 2 Store Temporary Stocker 264 ## 3 Store Permanent Checker 226 ## 4 Store Permanent Stocker 222 ## 5 Store Shift Supervisor 52 ## 6 Store Permanent Butcher 32 ## 7 Store Manager 24 ## 8 Store Assistant Manager 24 ## 9 Store Information Systems 16 ## 10 HQ Finance and Accounting 8 ## # ... with more rows##
Apart from having to do that
sql(…) to make the table connection work, it was pretty painless and I had both Athena and Drill working with
dplyr verbs in under ten minutes (total).
You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ll be updating my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing
sergeant package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.
odbc package is super-slick and it’s refreshing to be able to use
dplyr verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the
dplyr verbs. However, those queries can also be put right into
sql() with the Athena ODBC driver connection and used via the same
dplyr verb magic afterwards.
Today is, indeed, a good day to query!