by Hong Ooi, Sr. Data Scientist, Microsoft
I’m pleased to announce the release of version 0.62 of the dplyrXdf package, a backend to dplyr that allows the use of pipeline syntax with Microsoft R Server’s Xdf files. This update adds a new verb (
persist), fills some holes in support for dplyr verbs, and fixes various bugs.
A side-effect of dplyrXdf handling file management is that passing the output from one pipeline into subsequent pipelines can have unexpected results. Consider the following example:
# pipeline 1 output1 <- flightsXdf %>% mutate(delay=(arr_delay + dep_delay)/2) # use the output from pipeline 1 output2 <- output1 %>% group_by(carrier) %>% summarise(delay=mean(delay)) # reuse the output from pipeline 1 -- WRONG output3 <- output1 %>% group_by(dest) %>% summarise(delay=mean(delay))
The problem with this code is that the second pipeline will overwrite or delete its input, so the third pipeline will fail. This is consistent with dplyrXdf’s philosophy of only saving the most recent output of a pipeline, where a pipeline is defined as all operations starting from a raw xdf file. However, in this case it isn’t what’s desired.
Similarly, dplyrXdf stores its output files in R’s temporary directory, so when you close your R session, these files will be deleted. This saves you having to manually delete files that are no longer in use, but it means that you must copy the output of your pipeline to a permanent location if you want to keep it around.
persist verb is meant to address these issues. It saves a pipeline’s output to a permanent location and also resets the status of the pipeline, so that subsequent operations will know not to overwrite the data.
# pipeline 1 -- use persist to save the data to the working directory output1 <- flightsXdf %>% mutate(delay=(arr_delay + dep_delay)/2) %>% persist("output1.xdf") # use the output from pipeline 1 output2 <- output1 %>% group_by(carrier) %>% summarise(delay=mean(delay)) # reuse the output from pipeline 1 -- this works as expected output3 <- output1 %>% group_by(dest) %>% summarise(delay=mean(delay))
Specify levels in a
You can now specify the levels for a factor created by
factorise, using the standard name=value syntax:
factorise(data, x=c("a", "b", "c"))
This will convert the variable
x into a factor with levels
c. Any values that don’t match the given levels will be turned into NAs. If
x is already a factor, its levels will be changed to match those specified.
anti_join verbs have been implemented. As these types of joins aren’t internally supported by
rxMerge, they are done using a combination of other verbs:
# same as semi_join(a, b, by="x") # select everything in 'a' that matches a value of 'x' in 'b' semi <- inner_join(a, select(b, x) %>% distinct, by="x") # same as anti_join(a, b, by="x") # select everything in 'a' that doesn't match a value of 'x' in 'b' anti <- left_join(a, transmute(b, x, .ones=rep(1, .rxNumRows)) %>% distinct, by="x") %>% filter(is.na(.ones))
Support unnamed argument for
You can now use unnamed arguments with
doXdf, like the native
dplyr::do. In both cases, the output has to be coercible to a data frame (again, like
# example of unnamed argument to do do_unnamed <- flightsXdf %>% group_by(carrier) %>% do(data.frame(quantile=sprintf("%d%%", seq(0, 100, by=25)), quant_arr=quantile(.$arr_delay, na.rm=TRUE), quant_dep=quantile(.$dep_delay, na.rm=TRUE))) # example of unnamed argument to doXdf do_unnamedXdf <- flightsXdf %>% group_by(carrier) %>% doXdf(rxSummary(~ arr_delay, .)$sDataFrame)
Miscellaneous bug fixes and improvements
A number of bug fixes have been implemented. In particular, joining tables on factor variables should now work even when the factor levels in the two tables aren’t exactly the same. The
tally verbs have also been verified to work correctly for Xdf files.
If you encounter any bugs or issues with dplyrXdf, please contact me at [email protected]