Updated dplyrXdf package brings data munging with pipes to Xdf files

March 16, 2016
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Hong Ooi, Sr. Data Scientist, Microsoft

I’m pleased to announce the release of version 0.62 of the dplyrXdf package, a backend to dplyr that allows the use of pipeline syntax with Microsoft R Server’s Xdf files. This update adds a new verb (persist), fills some holes in support for dplyr verbs, and fixes various bugs.

The persist verb

A side-effect of dplyrXdf handling file management is that passing the output from one pipeline into subsequent pipelines can have unexpected results. Consider the following example:

# pipeline 1
output1 <- flightsXdf %>%
    mutate(delay=(arr_delay + dep_delay)/2)

# use the output from pipeline 1
output2 <- output1 %>%
    group_by(carrier) %>%
    summarise(delay=mean(delay))

# reuse the output from pipeline 1 -- WRONG
output3 <- output1 %>%
    group_by(dest) %>%
    summarise(delay=mean(delay))

The problem with this code is that the second pipeline will overwrite or delete its input, so the third pipeline will fail. This is consistent with dplyrXdf’s philosophy of only saving the most recent output of a pipeline, where a pipeline is defined as all operations starting from a raw xdf file. However, in this case it isn’t what’s desired.

Similarly, dplyrXdf stores its output files in R’s temporary directory, so when you close your R session, these files will be deleted. This saves you having to manually delete files that are no longer in use, but it means that you must copy the output of your pipeline to a permanent location if you want to keep it around.

The new persist verb is meant to address these issues. It saves a pipeline’s output to a permanent location and also resets the status of the pipeline, so that subsequent operations will know not to overwrite the data.

# pipeline 1 -- use persist to save the data to the working directory
output1 <- flightsXdf %>%
    mutate(delay=(arr_delay + dep_delay)/2) %>% persist("output1.xdf")

# use the output from pipeline 1
output2 <- output1 %>%
    group_by(carrier) %>%
    summarise(delay=mean(delay))

# reuse the output from pipeline 1 -- this works as expected
output3 <- output1 %>%
    group_by(dest) %>%
    summarise(delay=mean(delay))

Specify levels in a factorise call

You can now specify the levels for a factor created by factorise, using the standard name=value syntax:

factorise(data, x=c("a", "b", "c"))

This will convert the variable x into a factor with levels a, b and c. Any values that don’t match the given levels will be turned into NAs. If x is already a factor, its levels will be changed to match those specified.

Support for semi_join and anti_join

The semi_join and anti_join verbs have been implemented. As these types of joins aren’t internally supported by rxMerge, they are done using a combination of other verbs:

# same as semi_join(a, b, by="x")
# select everything in 'a' that matches a value of 'x' in 'b'
semi <- inner_join(a,
                   select(b, x) %>% distinct,
                   by="x")

# same as anti_join(a, b, by="x")
# select everything in 'a' that doesn't match a value of 'x' in 'b'
anti <- left_join(a,
                  transmute(b, x, .ones=rep(1, .rxNumRows)) %>% distinct,
                  by="x") %>% filter(is.na(.ones))

Support unnamed argument for do and doXdf

You can now use unnamed arguments with do and doXdf, like the native dplyr::do. In both cases, the output has to be coercible to a data frame (again, like dplyr::do).

# example of unnamed argument to do
do_unnamed <- flightsXdf %>%
    group_by(carrier) %>%
    do(data.frame(quantile=sprintf("%d%%", seq(0, 100, by=25)),
                  quant_arr=quantile(.$arr_delay, na.rm=TRUE),
                  quant_dep=quantile(.$dep_delay, na.rm=TRUE)))

# example of unnamed argument to doXdf
do_unnamedXdf <- flightsXdf %>%
    group_by(carrier) %>%
    doXdf(rxSummary(~ arr_delay, .)$sDataFrame)

Miscellaneous bug fixes and improvements

A number of bug fixes have been implemented. In particular, joining tables on factor variables should now work even when the factor levels in the two tables aren’t exactly the same. The mutate_each, summarise_each, count and tally verbs have also been verified to work correctly for Xdf files.

If you encounter any bugs or issues with dplyrXdf, please contact me at [email protected]

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)