Interacting With Amazon Athena from R

December 5, 2016
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

This is a short post for those looking to test out Amazon Athena with R.

Amazon makes Athena available via JDBC, so you can use RJDBC to query data. All you need is their JAR file and some setup information. Here’s how to get the JAR file to the current working directory:

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)

To avoid putting credentials in code, you can store the AWS key and secret you’re using for the queries in ATHENA_USER and ATHENA_PASSWORD environment variables via ~/.Renviron. You’ll also need an S3 bucket writable by those credentials for the Athena staging directory. With that info in hand, it’s easy to connect:

library(RJDBC)
library(dplyr)

drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir="s3://accessible-bucket",
                                   user=Sys.getenv("ATHENA_USER"),
                                   password=Sys.getenv("ATHENA_PASSWORD"))

Even if you have no data configured in Athena, you can check out the test data available to all:

dbListTables(con)
## [1] "elb_logs"

If that worked, then you should be able to query data (using the fully qualified table name in this case):

dbGetQuery(con, "SELECT * FROM sampledb.elb_logs LIMIT 10") %>% 
  dplyr::glimpse()
## Observations: 10
## Variables: 16
## $ timestamp              "2014-09-27T00:00:25.424956Z", "2014-09-27T00:00:56.439218Z", "2014-09-27T00:01:27.441734Z", "2014-09-27T00:01:58.366715Z", "2014-09-27T00:02:29.446363Z", "2014-09-2...
## $ elbname                "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo"
## $ requestip              "241.230.198.83", "252.26.60.51", "250.244.20.109", "247.59.58.167", "254.64.224.54", "245.195.140.77", "245.195.140.77", "243.71.49.173", "240.139.5.14", "251.192.4...
## $ requestport            27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026
## $ backendip              "251.192.40.76", "249.89.116.3", "251.111.156.171", "251.139.91.156", "251.111.156.171", "254.64.224.54", "254.64.224.54", "250.244.20.109", "247.65.176.249", "250.2...
## $ backendport            443, 8888, 8888, 8888, 8000, 8888, 8888, 8888, 8888, 8888
## $ requestprocessingtime  9.1e-05, 9.4e-05, 8.4e-05, 9.7e-05, 9.1e-05, 9.3e-05, 9.4e-05, 8.3e-05, 9.0e-05, 9.0e-05
## $ backendprocessingtime  0.046598, 0.038973, 0.047054, 0.039845, 0.061461, 0.037791, 0.047035, 0.048792, 0.045724, 0.029918
## $ clientresponsetime     4.9e-05, 4.7e-05, 4.9e-05, 4.9e-05, 4.0e-05, 7.7e-05, 7.5e-05, 7.3e-05, 4.0e-05, 6.7e-05
## $ elbresponsecode        "200", "200", "200", "200", "200", "200", "200", "200", "200", "200"
## $ backendresponsecode    "200", "200", "200", "200", "200", "400", "400", "200", "200", "200"
## $ receivedbytes          0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ sentbytes              2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ requestverb            "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET"
## $ url                    "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http:/...
## $ protocol               "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"

And, you can disconnect when done:

dbDisconnect(con)

You should probably store the JAR file in a central location and refer to it that way in “production” scripts.

Now, you can go crazy querying data and racking up AWS charges 😀.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)