I’ve blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare a
dplyr chain, fire off a
collect() and then wait.
Queries that take significant processing time or have large result sets do not play nicely with the provided ODBC and JDBC drivers. This means “hung” R sessions and severe frustration, especially when you can login to the AWS Athena console and see that the results are right there!!
I’ve been crafting SQL by hand or using
sql_render() by hand to avoid this (when I remember to) but finally felt sufficient frustration to craft a better way, provided you can install and run
rJava-based code (it’s 2018 and that still is not an easy given on many systems unfortunately).
There are two functions below:
collect_async() function is designed to be used like
collect() but uses Athena components from the AWS SDK for Java to execute the SQL query behind the
dplyr chain asynchronously. The companion function
gather_results() takes the object created by
collect_async() and checks to see if the results are ready. If if they are, it will use the
aws.s3 package to download them. Personally, I’d just
aws s3 sync ... from the command line vs use the
aws.s3 package but that’s not everyone’s cup of tea.
Once I figure out the best package API for this I’ll add it to the
metis package. There are many AWS idiosyncrasies that need to be accounted for and I’d rather ship this current set of functions via the blog so folks can use it (and tweak it to their needs) before waiting for perfection.
Here’s the code:
library(rJava) library(awsjavasdk) library(aws.signature) library(aws.s3) library(odbc) library(tidyverse) library(dbplyr) #' Collect Amazon Athena query results asynchronously #' #' Long running Athena queries and Athena queries with large result #' sets can seriously stall a `dplyr` processing chain due to poorly #' implemented ODBC and JDBC drivers. The AWS SDK for Athena has #' methods that support submitting a query asynchronously for "batch" #' processing. All Athena resutls are stored in CSV files in S3 and it's #' easy to use the R `aws.s3` package to grab these or perform an #' `aws s3 sync ...` operation on the command line. #' #' @md #' @param obj the `dplyr` chain #' @param schema Athena schema (usually matches the `Schema` parameter to the #' Simba ODBC connection) #' @param region Your AWS region. All lower case with dashes (usually matches #' the `AwsRegion` parameter to the Simba ODBC connection) #' @param results_bucket the S3 results bucket where query results are stored #' (usually matches the `S3OutputLocation` parameter to the Simba ODBC #' connection) #' @return a `list` with the query execution ID and the S3 bucket. This object #' is designed to be passed to the companion `gather_results()` if you #' want to use the `aws.s3` package to retrieve the results. Otherwise, #' sync the file however you want using the query execution id. #' @note You may need to change up the authentication provider depending on how #' you use credentials with Athena collect_async
Now, we give it a go:
# Setup the credentials you're using use_credentials("personal") # load the AWS Java SDK classes awsjavasdk::load_sdk() # necessary for Simba ODBC and the async query ops aws_region
If you do try this out and end up needing to tweak it, feedback on what you had to do (via the comments) would be greatly appreciated.