Please inspect your dplyr+database code

December 2, 2017
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements.

If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate() statement inside the same dplyr::mutate() statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more dplyr::mutate()s).

I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.

To keep things in proportion: if you are not writing multi-assignment mutates on a dplyr database-backed system you can’t run into the problem (though, for performance, multi-statement mutates are preferred over database sources such as Apache Spark).

The issue has been reported to the dplyr team. And I presume a fix is in the works. However, one does not want to be distributing incorrect results in the interim. This is the advice I have been giving private clients. After some thought I have come to feel it would be unfair to withhold such advice from the larger R community. This is not meant to make dplyr look bad, but to try and help prevent both dplyr and dplyr users from unnecessarily looking bad.

To be clear: I am a proponent of dplyr plus database development (which is why I ran into this). Also, I am not affiliated with RStudio or affiliated with the dplyr development team.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)