R Interface to Spark
[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
SparkR
library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = ""))
sc <- sparkR.session(master = "local")
df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true")
### SUMMARY TABLE WITH SQL
createOrReplaceTempView(df1, "tbl1")
summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month")
head(summ)
# month avg_dep avg_arr
# 1 1 1347.210 1523.155
# 2 3 1359.500 1509.743
# 3 5 1351.168 1502.685
### SUMMARY TABLE WITH AGG()
grp <- groupBy(filter(df1, "month in (1, 3, 5)"), "month")
summ <- agg(grp, avg_dep = avg(df1$dep_time), avg_arr = avg(df1$arr_time))
head(summ)
# month avg_dep avg_arr
# 1 1 1347.210 1523.155
# 2 3 1359.500 1509.743
# 3 5 1351.168 1502.685
sparklyr
library(sparklyr)
sc <- spark_connect(master = "local")
df1 <- spark_read_csv(sc, name = "tbl1", path = "nycflights13.csv", header = TRUE, infer_schema = TRUE)
### SUMMARY TABLE WITH SQL
library(DBI)
summ <- dbGetQuery(sc, "select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month")
head(summ)
# month avg_dep avg_arr
# 1 5 1351.168 1502.685
# 2 1 1347.210 1523.155
# 3 3 1359.500 1509.743
### SUMMARY TABLE WITH DPLYR
library(dplyr)
summ <- df1 %>%
filter(month %in% c(1, 3, 5)) %>%
group_by(month) %>%
summarize(avg_dep = mean(dep_time), avg_arr = mean(arr_time))
head(summ)
# month avg_dep avg_arr
# <int> <dbl> <dbl>
# 1 5 1351.168 1502.685
# 2 1 1347.210 1523.155
# 3 3 1359.500 1509.743
To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.