R Interface to Spark
[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
SparkR
library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") ### SUMMARY TABLE WITH SQL createOrReplaceTempView(df1, "tbl1") summ <- sql("select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685 ### SUMMARY TABLE WITH AGG() grp <- groupBy(filter(df1, "month in (1, 3, 5)"), "month") summ <- agg(grp, avg_dep = avg(df1$dep_time), avg_arr = avg(df1$arr_time)) head(summ) # month avg_dep avg_arr # 1 1 1347.210 1523.155 # 2 3 1359.500 1509.743 # 3 5 1351.168 1502.685
sparklyr
library(sparklyr) sc <- spark_connect(master = "local") df1 <- spark_read_csv(sc, name = "tbl1", path = "nycflights13.csv", header = TRUE, infer_schema = TRUE) ### SUMMARY TABLE WITH SQL library(DBI) summ <- dbGetQuery(sc, "select month, avg(dep_time) as avg_dep, avg(arr_time) as avg_arr from tbl1 where month in (1, 3, 5) group by month") head(summ) # month avg_dep avg_arr # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743 ### SUMMARY TABLE WITH DPLYR library(dplyr) summ <- df1 %>% filter(month %in% c(1, 3, 5)) %>% group_by(month) %>% summarize(avg_dep = mean(dep_time), avg_arr = mean(arr_time)) head(summ) # month avg_dep avg_arr # <int> <dbl> <dbl> # 1 5 1351.168 1502.685 # 2 1 1347.210 1523.155 # 3 3 1359.500 1509.743
To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.