R and Athena update
[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
RBloggers|RBloggers-feedburner
Intro:
As it has been an while since RAthena
and noctua
updates have been announce, I thought I would try and get them all out of the way now. This blog will cover, key new features that has been made from version 1.9.0 to 2.3.0.
New Features:
- Big integers: Big integers from
AWS Athena
can be return to R in the following supported data types [integer64
,integer
,numeric
,character
] - Extra
AWS Athena
data types: Added support toAWS Athena
data types [array
,row
,map
,json
,binary
,ipaddress
]
library(DBI) library(RAthena) # default conversion methods con <- dbConnect(RAthena::athena()) # change json conversion method RAthena_options(json = "character") RAthena:::athena_option_env$json # [1] "character" # change json conversion to custom method RAthena_options(json = jsonify::from_json) RAthena:::athena_option_env$json # function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024) # { # json_to_r(json, simplify, fill_na, buffer_size) # } # <bytecode: 0x7f823b9f6830> # <environment: namespace:jsonify> # change bigint conversion without affecting custom json conversion methods RAthena_options(bigint = "numeric") RAthena:::athena_option_env$json # function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024) # { # json_to_r(json, simplify, fill_na, buffer_size) # } # <bytecode: 0x7f823b9f6830> # <environment: namespace:jsonify> RAthena:::athena_option_env$bigint # [1] "numeric" # change binary conversion without affect, bigint or json methods RAthena_options(binary = "character") RAthena:::athena_option_env$json # function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024) # { # json_to_r(json, simplify, fill_na, buffer_size) # } # <bytecode: 0x7f823b9f6830> # <environment: namespace:jsonify> RAthena:::athena_option_env$bigint # [1] "numeric" RAthena:::athena_option_env$binary # [1] "character" # no conversion for json objects con2 <- dbConnect(RAthena::athena(), json = "character") # use custom json parser con <- dbConnect(RAthena::athena(), json = jsonify::from_json)
- RStudio connection tab: Allowed RStudio connection tab to be optional, this is to speed up connection when users are connecting to large Data Lakes.
- Time zone: Added support to
AWS Athena
timestamp
with time zone data type. - R list: Properly support data type
list
when converting data toAWS Athena
SQL
format.
library(data.table) library(DBI) x = 5 dt = data.table( var1 = sample(LETTERS, size = x, T), var2 = rep(list(list("var3"= 1:3, "var4" = list("var5"= letters[1:5]))), x) ) con <- dbConnect(noctua::athena()) #> Version: 2.2.0 sqlData(con, dt) # Registered S3 method overwritten by 'jsonify': # method from # print.json jsonlite # Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv # var1 var2 # 1: 1 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}} # 2: 2 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}} # 3: 3 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}} # 4: 4 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}} # 5: 5 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}} #> Version: 2.1.0 sqlData(con, dt) # Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv # var1 var2 # 1: 1 1:3|list(var5 = c("a", "b", "c", "d", "e")) # 2: 2 1:3|list(var5 = c("a", "b", "c", "d", "e")) # 3: 3 1:3|list(var5 = c("a", "b", "c", "d", "e")) # 4: 4 1:3|list(var5 = c("a", "b", "c", "d", "e")) # 5: 5 1:3|list(var5 = c("a", "b", "c", "d", "e"))
AWS Athena UNLOAD
: Add support toAWS Athena UNLOAD
. This is to take advantage of read/write speedparquet
.
Set up AWS Athena table (example taken from AWS Data Wrangler: Amazon Athena Tutorial):
import awswrangler as wr import getpass bucket = getpass.getpass() path = f"s3://{bucket}/data/" if "awswrangler_test" not in wr.catalog.databases().values: wr.catalog.create_database("awswrangler_test") cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"] df = wr.s3.read_csv( path="s3://noaa-ghcn-pds/csv/189", names=cols, parse_dates=["dt", "obs_time"]) # Read 10 files from the 1890 decade (~1GB) wr.s3.to_parquet( df=df, path=path, dataset=True, mode="overwrite", database="awswrangler_test", table="noaa" ); wr.catalog.table(database="awswrangler_test", table="noaa")
Benchmark unload method using
noctua
.
library(DBI) con <- dbConnect(noctua::athena()) # Query ran using CSV output system.time({ df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa") }) # Info: (Data scanned: 80.88 MB) # user system elapsed # 57.004 8.430 160.567 noctua::noctua_options(cache_size = 1) # Query ran using UNLOAD Parquet output system.time({ df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T) }) # Info: (Data scanned: 80.88 MB) # user system elapsed # 21.622 2.350 39.232 # Query ran using cache system.time({ df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T) }) # Info: (Data scanned: 80.88 MB) # user system elapsed # 13.738 1.886 11.029
Note: Benchmark ran on
AWS Sagemaker
ml.t3.xlarge instance.
Finally:
If there is any new features or bug fixes, please raise a ticket on: noctua & RAthena
To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.