Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Hong Ooi, Sr. Data Scientist, Microsoft

Version 0.90 of the dplyrXdf package has just been released. dplyrXdf is a package that brings dplyr pipelines and data transformation verbs to Microsoft R Server’s xdf files. This version includes several changes, mostly to address performance and efficiency concerns, which I’ll detail these below.

### The .outFile argument

All dplyrXdf verbs now support a special argument .outFile, which determines how the output data is handled. If you don’t specify a value for this argument, the data will be saved to a tbl_xdf which will be managed by dplyrXdf. This supports the default behaviour, whereby data files are automatically created and deleted inside a pipeline. There are two other options for .outFile:

• If you specify .outFile = NULL, the data will be returned in memory as a data frame.

• If .outFile is a character string giving a file name, the data will be saved to an xdf file at that location, and a persistent xdf data source will be returned.

This should improve the efficiency of pipelines with large datasets, by reducing the amount of I/O. Previously, to save the output of a pipeline, you had to call the persist verb at the end:

xdf %>% filter(...) %>% mutate(...) %>% persist("final/output.xdf")


In this example, mutate would save a temporary xdf file in dplyrXdf’s working directory, and persist would then copy that file to the final output location. Now, you can save the output directly to the final location as follows:

xdf %>% filter(...) %>% mutate(..., .outFile="final/output.xdf")


This omits a redundant file save and copy, thus speeding things up.

The persist verb remains available, for situations where you have already run a pipeline and want to save its output after the fact.

### Setting the dplyrXdf working directory

By default, dplyrXdf will save the data files it creates into the R working directory. On some systems, this may be located on a drive or filesystem that is relatively small; this is rarely an issue with open source R, but can be problematic when working with large xdf files. You can now change the location of the xdf tbl directory with the setXdfTblDir function:

# set the tbl directory to a network drive (on Windows)
setXdfTblDir("n:/Rtemp")


Similarly, you can view the location of the current xdf tbl directory with getXdfTblDir.

For best performance, you should avoid setting the xdf tbl directory to a remote location/network drive unless you have a fast network connection.

### Extraction operators

Sometimes it’s useful to be able to extract variables from an Xdf file. With a data frame, you can do this with the $ and [[ operators: for example iris$Species and iris[["Species"]] both return the Species column (as a vector) from the iris dataset. This update to dplyrXdf implements the same functionality for Xdf files:

sampDir <- system.file("sampleData", package="RevoScaleR")
airline <- RxXdfData(file.path(sampDir, "AirlineDemoSmall.xdf"))
ArrDelay <- airline\$ArrDelay

## [1]   6  -8  -2   1  -2 -14


By default, the entire column is returned, so you should be careful using these operators when you have very large Xdf files.

### The subset verb

In dplyr, subsetting data is handled by two verbs: filter for subsetting by rows, and select for subsetting by columns. This is fine for data frames, where everything runs in memory; and for SQL databases, where the hard work is done by the database. For Xdf files, however, this is suboptimal, as each verb translates into a separate I/O step where the data is read from disk, subsetted, then written out again. This can waste a lot of time with large datasets.

You can get around this by using the .rxArgs argument in a verb to pass commands directly to the underlying RevoScaleR functions. For example, filter(xdf, .rxArgs=list(varsToKeep=*))) would subset by rows, and simultaneously use the varsToKeep parameter to tell rxDataStep to subset by columns. But this is inelegant. It would be better if there was a verb that could natively subset in both dimensions, without having to rely on workarounds.

As it turns out, base R has a subset generic which (as the name says) performs subsetting on both rows and columns. You’ve probably used it with data frames:

subset(iris, Species == "setosa", c(Sepal.Length, Sepal.Width))
## Source: local data frame [50 x 2]
##
##    Sepal.Length Sepal.Width
##           (dbl)       (dbl)
## 1           5.1         3.5
## 2           4.9         3.0
## 3           4.7         3.2
## 4           4.6         3.1
## 5           5.0         3.6
## 6           5.4         3.9
## ..          ...         ...


Here, the first argument to subset specifies the rows, and the second argument the columns to return. The subset method for Xdf files works along the same lines:

airSubset <- subset(airline, DayOfWeek == "Monday", c(ArrDelay, CRSDepTime))
head(airSubset)
airSubset <- subset(airline, DayOfWeek == "Monday", c(ArrDelay, CRSDepTime))
## Source: local data frame [6 x 2]
##
##   ArrDelay CRSDepTime
##      (int)      (dbl)
## 1        6   9.666666
## 2       -8  19.916666
## 3       -2  13.750000
## 4        1  11.750000
## 5       -2   6.416667
## 6      -14  13.833333


You can also use the same helper functions to choose columns as you would with select:

airSubset2 <- subset(airline, , starts_with("A"))
names(airSubset2)
## [1] "ArrDelay"


### Other changes

In addition to the above, version 0.90 includes the following changes:

• The persist verb now uses the base R functions file.copy and file.rename to copy/move a file, which should improve performance considerably on large datasets.

• The code for two-table verbs has been extensively rewritten, and should be much more reliable than before.

• The documentation, including the vignettes, has been significantly revised.

• Unit testing infrastructure has been added, utilising the testthat package.

• Several bugs have been fixed, some found with the aid of the aforementioned unit testing.