Advisory on Multiple Assignment dplyr::mutate() on Databases

Posted on January 21, 2018 by John Mount in R bloggers | 0 Comments

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases.

(image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License)

In this note I exhibit a troublesome example, and a systematic solution.

First let’s set up dplyr, our database, and some example data.

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

packageVersion("dplyr")

## [1] '0.7.4'

packageVersion("dbplyr")

## [1] '1.2.0'

db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")

d <- dplyr::copy_to(
  db, 
  data.frame(xorig = 1:5, 
             yorig = sin(1:5)),
  "d")

Now suppose somewhere in one of your projects somebody (maybe not even you) has written code that looks somewhat like the following.

d %>%
  mutate(
    delta = 0,
    x0 = xorig + delta,
    y0 = yorig + delta,
    delta = delta + 1,
    x1 = xorig + delta,
    y1 = yorig + delta,
    delta = delta + 1,
    x2 = xorig + delta,
    y2 = yorig + delta
  ) %>%
  select(-xorig, -yorig, -delta) %>%
  knitr::kable()

x0	y0	x1	y1	x2	y2
1	0.8414710	1	0.8414710	1	0.8414710
2	0.9092974	2	0.9092974	2	0.9092974
3	0.1411200	3	0.1411200	3	0.1411200
4	-0.7568025	4	-0.7568025	4	-0.7568025
5	-0.9589243	5	-0.9589243	5	-0.9589243

Notice the above gives an incorrect result: all of the x_i columns are identical, and all of the y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.

The issue is: dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta” is taking only one value for the entire lifetime of the dplyr::mutate() statement (which is clearly not what the user would want).

The fix is: break up the dplyr::mutate() into a series of smaller mutates that don’t exhibit the problem. It is a trade-off breaking up dplyr::mutate() on a database causes deeper statement nesting, and potential loss of performance. However, correct results should come before speed.

One automated variation of the fix is to use seplyr‘s statement partitioner. seplyr can factor the large mutate in a minimal number of very safe sub-mutates (and use dplyr to execute them).

d %>% 
  seplyr::mutate_se(
    seplyr::quote_mutate(
      delta = 0,
      x0 = xorig + delta,
      y0 = yorig + delta,
      delta = delta + 1,
      x1 = xorig + delta,
      y1 = yorig + delta,
      delta = delta + 1,
      x2 = xorig + delta,
      y2 = yorig + delta
    )) %>%
  select(-xorig, -yorig, -delta) %>%
  knitr::kable()

x0	y0	x1	y1	x2	y2
1	0.8414710	2	1.8414710	3	2.841471
2	0.9092974	3	1.9092974	4	2.909297
3	0.1411200	4	1.1411200	5	2.141120
4	-0.7568025	5	0.2431975	6	1.243197
5	-0.9589243	6	0.0410757	7	1.041076

The above notation is, however, a bit clunky for everyday use. We did not use the more direct seplyr::mutate_nse() as we are (to lower maintenance effort) deprecating the direct non-standard evaluation methods in seplyr in favor of code using seplyr::quote_mutate or wrapr::qae().

One can instead use seplyr as a code inspecting and re-writing tool with seplyr::factor_mutate().

cat(seplyr::factor_mutate(
  delta = 0,
  x0 = xorig + delta,
  y0 = yorig + delta,
  delta = delta + 1,
  x1 = xorig + delta,
  y1 = yorig + delta,
  delta = delta + 1,
  x2 = xorig + delta,
  y2 = yorig + delta
))

Warning in seplyr::factor_mutate(delta = 0, x0 = xorig + delta, y0 = yorig
+ : Mutate should be split into more than one stage.

   mutate(delta = 0) %>%
   mutate(x0 = xorig + delta,
          y0 = yorig + delta) %>%
   mutate(delta = delta + 1) %>%
   mutate(x1 = xorig + delta,
          y1 = yorig + delta) %>%
   mutate(delta = delta + 1) %>%
   mutate(x2 = xorig + delta,
          y2 = yorig + delta)

seplyr::factor_mutate() both issued a warning and produced the factored code snippet seen above. We think this is in fact a different issue than explored in our prior note on dependency driven result corruption, and that fixes for the first issue did not fix this issue last time we looked.

And that why to continue to be careful when using multi assignment dplyr::mutate() statements with database backed data.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Advisory on Multiple Assignment dplyr::mutate() on Databases

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)