Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

October 30, 2018
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

  • Which columns are keys (together identify rows or records).
  • Which columns are data/payload (are considered free varying data).
  • Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

In mathematics the statement "y is a function of x" merely means there is an ideal lookup table with which if you knew the value of x, then you could in principle know the value of y.

For example in the following R data.frame: y is a function of x, as all rows that have the same value for x also have the same value for y.

d1 <- data.frame(
  x = c("a", "a", "b", "b", "c"),
  y = c(10, 10, 20, 20, 20), 
  z = c(1, 2, 1, 1, 1),
  stringsAsFactors = FALSE)
print(d1)
##   x  y z
## 1 a 10 1
## 2 a 10 2
## 3 b 20 1
## 4 b 20 1
## 5 c 20 1

Notice if we know the value of x we then, in principle know the value of y. In the same example z is not a function of x, as it does not have this property.

A more concrete example would be: user-name is a function of user-ID. If you know the an individual’s user-ID, then you also (if you have the right lookup table) know the individual’s user-name.

We first taught these concepts in the context of SQL and SQL grouped aggregation. In SQL once you aggregate on one column, then all other columns in your query must either be the grouping columns, or also aggregated. This is easiest to show in code, but we will use the dplyr package for our example.

library("dplyr")

d1 %>%
  group_by(x) %>%
  summarize(y = max(y)) %>%
  ungroup()
## # A tibble: 3 x 2
##   x         y
##    
## 1 a        10
## 2 b        20
## 3 c        20

Notice only grouping columns and columns passed through an aggregating calculation (such as max()) are passed through (the column z is not in the result). Now because y is a function of x no substantial aggregation is going on, we call this situation a "pseudo aggregation" and we have taught this before.

Our wrapr package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): psagg(). It works as follows.

library("wrapr")

d1 %>%
  group_by(x) %>%
  summarize(y = psagg(y)) %>%
  ungroup()
## # A tibble: 3 x 2
##   x         y
##    
## 1 a        10
## 2 b        20
## 3 c        20

psagg() pretty much worked the same as the earlier max(). However, it documents our belief that y is a function of x (that nothing interesting is going on for this column during aggregation). Where psagg() differs is if our assumption that y is a function of x is violated.

d2 <- data.frame(
  x = c("a", "a", "b", "b", "c"),
  y = c(10, 10, 20, 23, 20), 
  stringsAsFactors = FALSE)
print(d2)
##   x  y
## 1 a 10
## 2 a 10
## 3 b 20
## 4 b 23
## 5 c 20
d2 %>%
  group_by(x) %>%
  summarize(y = psagg(y)) %>%
  ungroup()
## Error in summarise_impl(.data, dots): Evaluation error: wrapr::psagg argument values are varying.

The code caught that our assumption was false and raised on error. This sort of checking can save a lot of time and prevent erroneous results.

And that is part of what we teach:

  • document intent
  • check assumptions
  • double-check results
  • use composable small helper-tools.

psagg() also works well with data.table.

library("data.table")

as.data.table(d1)[
  , .(y = psagg(y)), by = "x"]
##    x  y
## 1: a 10
## 2: b 20
## 3: c 20
as.data.table(d2)[
  , .(y = psagg(y)), by = "x"]
## Error in psagg(y): wrapr::psagg argument values are varying

Of course we don’t strictly need psagg() as we could insert checks by hand (though this would become burdensome if we had many derived columns).

as.data.table(d1)[
  , .(y = max(y), 
      y_was_const = min(y)==max(y)), 
  by = "x"]
##    x  y y_was_const
## 1: a 10        TRUE
## 2: b 20        TRUE
## 3: c 20        TRUE
as.data.table(d2)[
  , .(y = max(y), 
      y_was_const = min(y)==max(y)), 
  by = "x"]
##    x  y y_was_const
## 1: a 10        TRUE
## 2: b 23       FALSE
## 3: c 20        TRUE

Unfortunately, this sort of checking does not currently work for dplyr.

packageVersion("dplyr")
## [1] '0.7.7'
d1 %>%
  group_by(x) %>%
  summarize(y = max(y),
            y_was_const = min(y)==max(y)) %>%
  ungroup()
## # A tibble: 3 x 3
##   x         y y_was_const
##           
## 1 a        10 TRUE       
## 2 b        20 TRUE       
## 3 c        20 TRUE
d2 %>%
  group_by(x) %>%
  summarize(y = max(y),
            y_was_const = min(y)==max(y)) %>%
  ungroup()
## # A tibble: 3 x 3
##   x         y y_was_const
##           
## 1 a        10 TRUE       
## 2 b        23 TRUE       
## 3 c        20 TRUE

Notice the per-group variation in y was not detected. This appears to be a dplyr un-caught result corruption issue.

If we don’t attempt to re-use the variable name we get the correct result.

d2 %>%
  group_by(x) %>%
  summarize(y_for_group = max(y),
            y_was_const = min(y)==max(y)) %>%
  ungroup()
## # A tibble: 3 x 3
##   x     y_for_group y_was_const
##                 
## 1 a              10 TRUE       
## 2 b              23 FALSE      
## 3 c              20 TRUE

However, being forced to rename variables during aggregation is a needless user burden.

Our seplyr package (a package that is a very thin adapter on top of dplyr) does issue a warning in the failing situation.

library("seplyr")

d2 %>%
  group_by_se("x") %>%
  summarize_nse(y = max(y),
                y_was_const = min(y)==max(y)) %>%
  ungroup()
## Warning in summarize_se(res, summarizeTerms, warn = summarize_nse_warn, :
## seplyr::summarize_se possibly confusing column name re-use c('y' =
## 'max(y)', 'y_was_const' = 'min(y) == max(y)')

## # A tibble: 3 x 3
##   x         y y_was_const
##           
## 1 a        10 TRUE       
## 2 b        23 TRUE       
## 3 c        20 TRUE

The result is still wrong, but at least in this situation the user has a better chance at noticing and working around the issue.

The problem may not be common (it may or may not be in any of your code, or code you use), and is of course easy to avoid (once you know the nature of the issue).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)