Alleviating AWS Athena Aggravation with Asynchronous Assistance

[This article was first published on R –, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve blogged about how to use Amazon Athena with R before and if you are a regular Athena user, you’ve likely run into a situation where you prepare a dplyr chain, fire off a collect() and then wait.

And, wait.

And, wait.

And, wait.

Queries that take significant processing time or have large result sets do not play nicely with the provided ODBC and JDBC drivers. This means “hung” R sessions and severe frustration, especially when you can login to the AWS Athena console and see that the results are right there!!

I’ve been crafting SQL by hand or using sql_render() by hand to avoid this (when I remember to) but finally felt sufficient frustration to craft a better way, provided you can install and run rJava-based code (it’s 2018 and that still is not an easy given on many systems unfortunately).

There are two functions below:

  • collect_async(), and
  • gather_results()

The collect_async() function is designed to be used like collect() but uses Athena components from the AWS SDK for Java to execute the SQL query behind the dplyr chain asynchronously. The companion function gather_results() takes the object created by collect_async() and checks to see if the results are ready. If if they are, it will use the aws.s3 package to download them. Personally, I’d just aws s3 sync ... from the command line vs use the aws.s3 package but that’s not everyone’s cup of tea.

Once I figure out the best package API for this I’ll add it to the metis package. There are many AWS idiosyncrasies that need to be accounted for and I’d rather ship this current set of functions via the blog so folks can use it (and tweak it to their needs) before waiting for perfection.

Here’s the code:


#' Collect Amazon Athena query results asynchronously
#' Long running Athena queries and Athena queries with large result
#' sets can seriously stall a `dplyr` processing chain due to poorly
#' implemented ODBC and JDBC drivers. The AWS SDK for Athena has 
#' methods that support submitting a query asynchronously for "batch"
#' processing. All Athena resutls are stored in CSV files in S3 and it's
#' easy to use the R `aws.s3` package to grab these or perform an
#' `aws s3 sync ...` operation on the command line.
#' @md
#' @param obj the `dplyr` chain
#' @param schema Athena schema (usually matches the `Schema` parameter to the 
#'        Simba ODBC connection)
#' @param region Your AWS region. All lower case with dashes (usually matches
#'        the `AwsRegion` parameter to the Simba ODBC connection)
#' @param results_bucket the S3 results bucket where query results are stored 
#'        (usually matches the `S3OutputLocation` parameter to the Simba ODBC
#'        connection)
#' @return a `list` with the query execution ID and the S3 bucket. This object
#'         is designed to be passed to the companion `gather_results()` if you
#'         want to use the `aws.s3` package to retrieve the results. Otherwise,
#'         sync the file however you want using the query execution id.
#' @note You may need to change up the authentication provider depending on how 
#'       you use credentials with Athena

Now, we give it a go:

# Setup the credentials you're using

# load the AWS Java SDK classes

# necessary for Simba ODBC and the async query ops

If you do try this out and end up needing to tweak it, feedback on what you had to do (via the comments) would be greatly appreciated.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)