RAthena 1.7.1 and
noctua 1.5.1 package versions have now been released to the CRAN. They both bring along several improvements with the connection to
AWS Athena, noticeably the performance speed and several creature comforts.
These packages have both been designed to reflect one another,even down to how they connect to
AWS Athena. This means that all features going forward will exist in both packages. I will refer to these packages as one, as they basically work in the same way.
Initially the packages utilised
AWS Athena SQL queries. This was to achieve all the functional requirements of the
DBI package framework. However the package would always send a SQL query to
AWS Athena which in turn would have to lift a flat file from
AWS S3, before returning the final result to
R. This means the performance of the packages would be limited and fairly slow compared to other data base backends.
The biggest change is the adoption of more functionality of the SDKs (software development kit) into AWS. The key component that has been adopted is
AWS Glue contains all of
AWS Athena table DDL’s. This means instead of going to
AWS Athena for this information
AWS Glue can be used instead.
AWS Glue, the table meta data (column names, column types, schema hierarchy etc…) can easily be retrieved at a fraction of the time it would of taken to query
AWS Athena. Previously the
dbListTables would send a query to
AWS Athena, this would retrieve all the tables listed in all schemas. This would take over 3 seconds. Now using
AWS Glue to retrieve the same data, it takes less than 0.5 of a second.
AWS Glue is used to collect metadata around a table in
AWS Athena, a performance in
dplyr::tbl can be done. I would like to say thanks to @OssiLehtinen for developing the initial implementation as this improvement would have been overlooked.
dplyr::tbl has two key methods when creating the initial object. The first is called SQL identifiers and this is the method that benefits from the new
AWS Glue functionality. To use SQL identifiers is fairly straight forward.
library(DBI) library(dplyr) library(RAthena) #Or library(noctua) con = dbConnect(athena()) dbWriteTable(con, "iris", iris) ident_iris = tbl(con, "iris")
dplyr can identify the
iris table within the connected schema. When a user uses the SQL identifier method in
AWS Glue is called to retrieve all the meta data for
dplyr. This increases the performance from 3.66 to 0.29 seconds. The second method is called SQL sub query. This unfortunately won’t benefit from the new feature and will run in slower at 3.66 seconds.
subquery_iris = tbl(con, sql("select * from iris"))
Therefore I recommend the use of SQL identifier method when using
AWS Athena Metadata
Due to user feature requests the packages now return more meta data around each query sent to
AWS Athena. Thus the basic level of meta data returned, is the amount of data scanned by
AWS Athena. This is formatted into a readable format depending on the amount of data scanned.
library(DBI) library(RAthena) #Or library(noctua) con = dbConnect(athena()) dbWriteTable(con, "iris", iris) dbGetQuery(con, "select * from iris") Info: (Data scanned: 860 Bytes) sepal_length sepal_width petal_length petal_width species 1: 5.1 3.5 1.4 0.2 setosa 2: 4.9 3.0 1.4 0.2 setosa 3: 4.7 3.2 1.3 0.2 setosa 4: 4.6 3.1 1.5 0.2 setosa 5: 5.0 3.6 1.4 0.2 setosa --- 146: 6.7 3.0 5.2 2.3 virginica 147: 6.3 2.5 5.0 1.9 virginica 148: 6.5 3.0 5.2 2.0 virginica 149: 6.2 3.4 5.4 2.3 virginica 150: 5.9 3.0 5.1 1.8 virginica
However if you set the new parameter
TRUE then all the metadata around that query is printed out like so:
dbGetQuery(con, "select * from iris", statistics = TRUE) $EngineExecutionTimeInMillis  1568 $DataScannedInBytes  860 $DataManifestLocation character(0) $TotalExecutionTimeInMillis  1794 $QueryQueueTimeInMillis  209 $QueryPlanningTimeInMillis  877 $ServiceProcessingTimeInMillis  17 Info: (Data scanned: 860 Bytes) sepal_length sepal_width petal_length petal_width species 1: 5.1 3.5 1.4 0.2 setosa 2: 4.9 3.0 1.4 0.2 setosa 3: 4.7 3.2 1.3 0.2 setosa 4: 4.6 3.1 1.5 0.2 setosa 5: 5.0 3.6 1.4 0.2 setosa --- 146: 6.7 3.0 5.2 2.3 virginica 147: 6.3 2.5 5.0 1.9 virginica 148: 6.5 3.0 5.2 2.0 virginica 149: 6.2 3.4 5.4 2.3 virginica 150: 5.9 3.0 5.1 1.8 virginica
This can also be retrieved by using
res = dbExecute(con, "select * from iris") # return query statistic query_stats = dbStatistics(res) # return query results dbFetch(res) # Free all resources dbClearResult(res)
RJDBC inspired function
I have to give full credit to the package
RJDBC for inspiring me to create this function.
DBI has got a good function called
dbListTables that will list all the tables that are in
AWS Athena. However it won’t return to which schema each individual table is related to. To over come this
RJDBC has a excellent function called
dbGetTables. This function returns all the tables from
AWS Athena as a
data.frame. This has the advantage of detailing schema, table and table type. With the new integration into
AWS Glue this can be returned quickly.
dbGetTables(con) Schema TableName TableType 1: default df_bigint EXTERNAL_TABLE 2: default iris EXTERNAL_TABLE 3: default mtcars2 EXTERNAL_TABLE 4: default nyc_taxi_2018 EXTERNAL_TABLE
This just makes it a little bit easier when working in different IDE’s for example
Backend option changes
This is not really a creature comfort but it still interesting and useful. Both packages are dependent on
data.table to read data into
R. This is down to the amazing speed
data.table offers when reading files into
R. However a new package, with equally impressive read speeds, has come onto the scene called
vroom has been designed to only read data into
R similarly to
data.table is still used for all of the heavy lifting. However if a user wishes to use
vroom as the file parser an
*_options function has been created to enable this:
nocuta_options(file_parser = c("data.table", "vroom")) # Or RAthena__options(file_parser = c("data.table", "vroom"))
By setting the file_parser to
vroom then the backend will change to allow
vroom's file parser to be used instead of
If you aren’t sure whether to use
data.table, I draw your attention to
vroom boasting a whopping 1.40GB/sec throughput.
Statistics taken from vroom’s github readme
Due to the ability of
AWS Glue to retrieve metadata for
AWS Athena at speed, it has now been possible to add the interface into RStudio’s connection tab. When a connection is established:
library(DBI) library(RAthena) #Or library(noctua) con = dbConnect(athena())
The connection icon will as follows:
The AWS region you are connecting to will be reflected in the connection (highlighted above in the red square). This is to help users that are able to connect to multiple different
AWS Athena over different regions.
Once you have connected
AWS Athena, schema hierarchy will be displayed. In my example you can see some of the tables I have created when testing these packages.
For more information around RStudio’s connection tab please check out RStudio preview connections.
To sum up, the
noctua latest versions have been released to cran with all the new goodies they bring. As these packages are based on AWS SDK’s they are highly customisable. Features can easily be added to improve the packages when connecting to
AWS Athena. So please raise any feature requests / bug issues to: https://github.com/DyfanJones/RAthena and https://github.com/DyfanJones/noctua