Currently there are two key ways in connecting to Amazon Athena from R, using the ODBC and JDBC drivers. To access the ODBC driver R users can use the excellent odbc package supported by Rstudio. To access the JDBC driver R users can either use the RJDBC R package or the helpful wrapper package AWR.Athena which wraps the
RJDBC package to make the connection to Amazon Athena through the JDBC driver simpler. These methods are an excellent way for R to connect to Amazon Athena, however is there another way?
Well glad you asked…yes there is! Ever since the reticulate package was developed (by Rstudio) the interface into Python from R has never been simpler. This makes another route into Athena possible! Amazon has developed a Python software development kit (SDK) called Boto3. By using
boto3 in combination with the R package
reticulate a new method into accessing Athena can be made possible. Introducing the R package RAthena.
- What is RAthena?
Rathenais a R package that creates a DBI (Database Interface) for R, using the R package DBI and the Python package
Boto3as the backend.
- Why was RAthena created when there are already methods for connecting to Athena?
RAthenawas created to provide an extra method to connect to Athena for R users. Nothing more, nothing less.
- Why is RAthena call RAthena?
Isn’t it obvious? Most R packages that interface with a database are called
RPostgreSQL, etc… Plus this package is “roughly” the R equivalent to the superb
Pythonpackage PyAthena. So calling this package
RAthenaseems like the best choice.
Now lets get into how to actually use
RAthena. I am going to skip over the part were you have to set up an Amazon Web Services Account (AWS Acount) and get straight into the good stuff.
Now we have
Python 3+ we now need to install
Boto3. If you installed
Python 3+ with the Anaconda Distribution I believe
Boto3 comes as standard (you can skip
boto3 installation step), but for everyone else you can install
Boto3 either by the
pip command or the inbuilt installation function in
pip install boto3
RAthena boto3 installation function:
Connecting to Athena
Now we have everything that is required we are now ready to connect to
RAthena provides several method to connect to
Athena ranging from hard-coding credentials to using Amazon Resource Name Roles (ARN roles).
This method isn’t recommended as your credentials are hard-coded.
library(DBI) con <- dbConnect(RAthena::athena(), aws_access_key_id = "
", aws_secret_access_key = " ", s3_staging_dir = " ")
s3_staging_dir requires to be in the format of
s3 uri for example “s3://path/to/query/bucket/”
Environment Variable Method:
RAthena supports AWS credentials set into the environment variables to avoid hard-coding. From what I have found out an easy way to set up environment variables (that persists) in R is to use the
file.edit function like so:
And now you can simply add in your environment variables in the file you are editing for example:
Once you have set your environment variables you can connect to Athena in the following method:
library(DBI) con <- dbConnect(RAthena::athena(), s3_staging_dir = "
AWS Profile Names:
Another method is to use AWS Profile Names. AWS profile names can be setup either manually in the
~/.aws directory or by using the AWS Command Line Interface (AWS CLI). Once you have setup your profile name you can connect to
Using Default Profile Name:
library(DBI) con <- dbConnect(RAthena::athena(), s3_staging_dir = "
Using Non-Default Profile Name:
library(DBI) con <- dbConnect(RAthena::athena(), profile_name = "rathena", s3_staging_dir = "
ARN roles are fairly useful if you need to assume a role that can connect to another AWS account and use the
Athena in that account. Or whether you want to create a temporary connection with different permissions than your current role (AWS ARN Documentation).
Assuming ARN role credentials before connecting to Athena:
library(RAthena) library(DBI) assume_role(profile_name = "YOUR_PROFILE_NAME", role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name", set_env = TRUE) # Connect to Athena using ARN Role con <- dbConnect(athena(), s3_staging_dir = "s3://path/to/query/bucket/")
Connect to Athena directly using ARN role:
library(DBI) con <- dbConnect(athena(), profile_name = "YOUR_PROFILE_NAME", role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name", s3_staging_dir = 's3://path/to/query/bucket/')
Note: ARN Roles have a duration timer before they will expire. To change the default you can increase the
duration_seconds parameter from the default 3600 seconds (1 hour).
Finally you can create temporary credentials before connecting to
library(RAthena) library(DBI) # Create Temporary Credentials duration 1 hour get_session_token("YOUR_PROFILE_NAME", serial_number='arn:aws:iam::123456789012:mfa/user', token_code = "531602", set_env = TRUE) # Connect to Athena using temporary credentials con <- dbConnect(athena(), s3_staging_dir = "s3://path/to/query/bucket/")
Note: This method will work for users who have set up Multi-Factor Authentication (MFA).
Now we have created a connection to
Athena we can ulitise
DBI methods to query
Athena for example:
All available tables in Athena:
Send Query to Athena
res <- dbSendQuery(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS") dbFetch(res) dbClearResult(res)
res <- dbExecute(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS") dbFetch(res) dbClearResult(res)
dbGetQuery wraps sending, fetching and clearing results in one easy step.
dbGetQuery(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
Note: You might of noticed that if you have
RAthena will attempt to return the data as a
data.table. This is to improve speed when larger queries are returned from
Athena. If you don’t have
RAthena will return the output as a
Get Column information
res <- dbSendQuery(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS") dbColumnInfo(res) dbClearResult(res)
To learn about what
DBI methods have been implemented in
RAthena please refer to: link.
It is all very well querying data from
Athena but what is more useful is to upload data as well.
RAthena has addressed this and implemented a method in
dbWriteTable(con, "mtcars", mtcars, partition=c("TIMESTAMP" = format(Sys.Date(), "%Y%m%d")), s3.location = "s3://mybucket/data/")
Once you have uploaded data into
Athena you can query it in the following:
dbGetQuery(con, "select * from mtcars")
Here are all variable parameters for the
conn: An AthenaConnection object, produced by dbConnect()
name: A character string specifying a table name. Names will be automatically quoted so you can use any sequence of characters, not just any valid bare table name.
value: A data.frame to write to the database.
overwrite: Allow overwriting the destination table. Cannot be ‘TRUE’ if ‘append’ is also ‘TRUE’.
append: Allow appending to the destination table. Cannot be ‘TRUE’ if ‘overwrite’ is also ‘TRUE’.
row.names: Either TRUE, FALSE, NA or a string. If TRUE, always translate row names to a column called “row_names”. If FALSE, never translate row names. If NA, translate rownames only if they’re a character vector. A string is equivalent to TRUE, but allows you to override the default name. For backward compatibility, NULL is equivalent to FALSE.
field.types: Additional field types used to override derived types.
partition: Partition Athena table (needs to be a named list or vector) for example: c(var1 = “2019-20-13”)
s3.location s3 bucket to store Athena table, must be set as a s3 uri for example (“s3://mybucket/data/“)
file.type: What file type to store data.frame on s3, RAthena currently supports [“csv”, “tsv”, “parquet”]. Note: file.type “parquet” is supported by R package
arrowand will need to be installed separately if you wish to upload data.frames in “parquet” format.
…: Other arguments used by individual methods.
RAthena can integrate with the famous R package dplyr.
library(DBI) library(dplyr) con <- dbConnect(RAthena::athena(), profile_name = "rathena", s3_staging_dir = "
") tbl(con, sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS"))
Or if you have already uploaded a table into
So hopefully this has given you an insight into the up coming package
RAthena and it’s usefulness. This package is not meant to replace any of the other packages that connect into
Athena but give another route into
Athena for R users.