Site icon R-bloggers

Working with big SAS datasets using R and sparklyr

[This article was first published on R – Paolo Eusebi, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In general, R loads all data into memory while SAS allocates memory dynamically to keep data on disk. This makes SAS a better solution for handling very large datasets.

I often need to work with large SAS data files that are prepared in the information system of my department. However, I always try to fit everything to my R workflow. This is because I like to manipulate data with dplyr and perform statistical analysis with all the available packages in R.

To this purpose I found the perfect solution with sparklyr.

First of all we need to install and load the packages.

library(sparklyr)
library(spark.sas7bdat)
library(dplyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")

Then I connect to a local instance of the installed Spark

sc <- spark_connect(master = "local")

Finally it is possible to read the SAS files, manipulate them via dplyr and store in the R memory via collect command.

df %
select()

df_manipulated_r <- collect(df_manipulated)

The command spark_read_sas return an object of class tbl_spark, which is a reference to a Spark data frame based on which dplyr functions can be executed.

The collect function returns a local data frame from the remote source of manipulated spark nibbles allowing for storage in the local memory.

This should be the file on which perform the data analysis and visualization steps.

Here some resources:

Big data in R

Importing 30GB of data into R with sparklyr

github.com/bnosac/spark.sas7bdat

sparklyr: R interface for Apache Spark

To leave a comment for the author, please follow the link and comment on their blog: R – Paolo Eusebi.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.