The dplyr functions
mutate nowadays are commonly applied to perform
data.frame column operations, frequently combined with magrittrs forward
%>% pipe. While working well interactively, however, these methods often would require additional checking if used in “serious” code, for example, to catch column name clashes.
In principle, the container package provides a
dict-class (resembling Pythons dict type), which allows to cover these issues more easily. In its very recent update, the container package for this reason gained an S3 method interface plus functions to convert back and forth between
data.frame. This can be used to extend the set of
data.frame column operations and in this post I will show how and when they can serve as a useful alternative to
To keep matters simple, we use a tiny data table.
Let’s add a column using
For someone not familar with the tidyverse, this code block might read somewhat odd as the column is added and not mutated. To add a column using
dict simply use
The intended add-operation is stated more clearly, but on the downside we also had to add some overhead. Of course, since this has to be done only at the beginning and at the end of the pipe, it will be less of an issue if multiple dict-operations are performed in between. Next, instead of ID, let’s add another numeric column
y, which happens to “name-clash” with the already existing column.
Ooops – we have accidently overwritten the initial y-column. While this was easy to see here, it may not if the
data.frame has a lot of columns or if column names are created automatically as part of some script. To catch this, usually some overhead is required, too.
Let’s see the dict-operation in comparison.
The name clash is catched by default and the overhead does not look so silly anymore. As a bonus, the error message still provides information about the originally intended add-operation.
If the intend was indeed to overwrite the value, the dict-function
setval can be used.
As we saw above, if a column does not exist,
mutate silently creates it for you. If this is not what you want, which means, you want to make sure something is overwritten, again, a workaround is needed.
Once again, the workaround is already “built-in” in the dict-framework.
After all, the intend of the
mutate function actually would be something like: overwrite a column, or, create it if it does not exist. If desired, this behaviour can be expressed within the dict-framework as well.
A common tidyverse approach to remove a column is based on the
select function. One corresponding dict-function is
Let’s see what happens if the column does not exist in the first place.
Again, we obtain a slightly more informative error message with dict. Assume we want the column to be removed if it exist but otherwise silently ignore the command, for example:
You may have expected this by now – the dict-framework provides a straigh-forward solution, namely, the
The required additional code lines are limited but what about the computational overhead? To examine this, we benchmark some column operations using the famous ‘iris’ data set. As a hallmark reference we will also bring the data.table framework to the competition.
set.seed(123) library(microbenchmark) library(ggplot2) library(data.table) data <- iris n <- nrow(data) head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
For the benchmark, we add one, transform one and finally delete one column.
bm <- microbenchmark(control = list(order="inorder"), times = 100, dplyr = data %>% mutate(ID = 1:n) %>% mutate(Species = as.character(Species)) %>% select(-Sepal.Width), dict = data %>% as.dict() %>% add("ID", 1:n) %>% setval(., "Species", as.character(.$get("Species"))) %>% remove("Sepal.Width") %>% as.data.frame(), `[.data.table` = data %>% as.data.table() %>% .[, ID := 1:n] %>% .[, Species := as.character(Species)] %>% .[, Sepal.Width := NULL] ) autoplot(bm) + theme_bw()
bm <- microbenchmark(control = list(order="inorder"), times = 100, tbl <- mutate(data, ID = 1:n), tbl <- mutate(tbl, Species = as.character(Species)), tbl <- select(tbl, -Sepal.Width), dic <- as.dict(data), add(dic, "ID", 1:n), setval(dic, "Species", as.character(dic$get("Species"))), remove(dic, "Sepal.Width"), dic <- as.data.frame(dic), dt <- as.data.table(data), dt[, ID := 1:n], dt[, Species := as.character(Species)], dt[, Sepal.Width := NULL] ) autoplot(bm) + theme_bw()
Apparently, the mutate and select operations are the slowest in comparison, I think, because both the dict and data.table approach work by reference while probably some copying is done in the dplyr pipe. We also see that the dict-approach spends most of the computation time for the transformation back and forth between a
dict and a
data.frame while the actual column operations seem very efficient, even more efficient than that of data.table. This certainly came as a surprise to me, as the focus when developing the container package has never been on speed but rather on providing a concise data structure. Internally the dict simply consists of a named list, so I guess this speaks for the efficiency of base R list operations. Having said that, I found that the data.table code can be further improved by avoiding the overhead of the
[.data.table operator and instead use the built-in
bm <- microbenchmark(control = list(order="inorder"), times = 100, dplyr = data %>% mutate(ID = 1:n) %>% mutate(Species = as.character(Species)) %>% select(-Sepal.Width), dict = data %>% as.dict() %>% add("ID", 1:n) %>% setval(., "Species", as.character(.$get("Species"))) %>% remove("Sepal.Width") %>% as.data.frame(), `[.data.table` = data %>% as.data.table() %>% .[, ID := 1:n] %>% .[, Species := as.character(Species)] %>% .[, Sepal.Width := NULL], data.table.set = data %>% as.data.table() %>% set(., j= "ID", value = 1:n) %>% set(., j = "Species", value = as.character(.[["Species"]])) %>% set(., j = "Sepal.Width", value = NULL) ) autoplot(bm) + theme_bw()
This puts things back into perspective, I guess 🙂 It might also be interesting to know, how much of the computation time is spent on the non-standard evaluation part of the
[.data.table implementation, but that’s probably a topic on its own.
Accidently overwriting existing data columns leads to nasty bugs. The presented workflow allows to increase both reliability and precision of standard data frame column manipulation at very little cost. The intended column operations can be expressed more clearly and, in case of failures, informative error messages are provided by default. As a result, the dict-framework may serve as a useful supplement to “interactive piping”.