by Hong Ooi, Sr. Data Scientist, Microsoft
Version 0.90 of the dplyrXdf package has just been released. dplyrXdf is a package that brings dplyr pipelines and data transformation verbs to Microsoft R Server’s xdf files. This version includes several changes, mostly to address performance and efficiency concerns, which I’ll detail these below.
The .outFile argument
All dplyrXdf verbs now support a special argument
.outFile, which determines how the output data is handled. If you don’t specify a value for this argument, the data will be saved to a
tbl_xdf which will be managed by dplyrXdf. This supports the default behaviour, whereby data files are automatically created and deleted inside a pipeline. There are two other options for
If you specify
.outFile = NULL, the data will be returned in memory as a data frame.
.outFileis a character string giving a file name, the data will be saved to an xdf file at that location, and a persistent xdf data source will be returned.
This should improve the efficiency of pipelines with large datasets, by reducing the amount of I/O. Previously, to save the output of a pipeline, you had to call the
persist verb at the end:
In this example,
mutate would save a temporary xdf file in dplyrXdf’s working directory, and
persist would then copy that file to the final output location. Now, you can save the output directly to the final location as follows:
This omits a redundant file save and copy, thus speeding things up.
persist verb remains available, for situations where you have already run a pipeline and want to save its output after the fact.
Setting the dplyrXdf working directory
By default, dplyrXdf will save the data files it creates into the R working directory. On some systems, this may be located on a drive or filesystem that is relatively small; this is rarely an issue with open source R, but can be problematic when working with large xdf files. You can now change the location of the xdf tbl directory with the
Similarly, you can view the location of the current xdf tbl directory with
For best performance, you should avoid setting the xdf tbl directory to a remote location/network drive unless you have a fast network connection.
Sometimes it’s useful to be able to extract variables from an Xdf file. With a data frame, you can do this with the
[[ operators: for example
iris[["Species"]] both return the Species column (as a vector) from the iris dataset. This update to dplyrXdf implements the same functionality for Xdf files:
By default, the entire column is returned, so you should be careful using these operators when you have very large Xdf files.
The subset verb
In dplyr, subsetting data is handled by two verbs:
filter for subsetting by rows, and
select for subsetting by columns. This is fine for data frames, where everything runs in memory; and for SQL databases, where the hard work is done by the database. For Xdf files, however, this is suboptimal, as each verb translates into a separate I/O step where the data is read from disk, subsetted, then written out again. This can waste a lot of time with large datasets.
You can get around this by using the
.rxArgs argument in a verb to pass commands directly to the underlying RevoScaleR functions. For example,
filter(xdf, .rxArgs=list(varsToKeep=*))) would subset by rows, and simultaneously use the
varsToKeep parameter to tell
rxDataStep to subset by columns. But this is inelegant. It would be better if there was a verb that could natively subset in both dimensions, without having to rely on workarounds.
As it turns out, base R has a subset generic which (as the name says) performs subsetting on both rows and columns. You’ve probably used it with data frames:
Here, the first argument to subset specifies the rows, and the second argument the columns to return. The subset method for Xdf files works along the same lines:
airSubset <- subset(airline, DayOfWeek == "Monday", c(ArrDelay, CRSDepTime)) head(airSubset)
You can also use the same helper functions to choose columns as you would with select:
In addition to the above, version 0.90 includes the following changes:
persistverb now uses the base R functions
file.renameto copy/move a file, which should improve performance considerably on large datasets.
The code for two-table verbs has been extensively rewritten, and should be much more reliable than before.
The documentation, including the vignettes, has been significantly revised.
Unit testing infrastructure has been added, utilising the testthat package.
Several bugs have been fixed, some found with the aid of the aforementioned unit testing.
The latest version of the dplyrXdf package is available on Github at the link below.