My Favorite data.table Feature
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
My favorite R data.table feature is the “by” grouping notation when combined with the := notation.
Let’s take a look at this powerful notation.
First, let’s build an example data.frame.
d <- wrapr::build_frame(
"group" , "value" |
"a" , 1L |
"a" , 2L |
"b" , 3L |
"b" , 4L )
knitr::kable(d)
| group | value |
|---|---|
| a | 1 |
| a | 2 |
| b | 3 |
| b | 4 |
The data is some sort of value with a grouping column telling us which rows are related.
With the data.table “:=,by” notation we can add the per-group totals into each row of the data as follows (the extra [] at the end is just the command to also print the results in addition to adding the column in-place).
library("data.table")
dt <- data.table::as.data.table(d)
dt[, group_sum := sum(value), by = "group"][]
# group value group_sum
# 1: a 1 3
# 2: a 2 3
# 3: b 3 7
# 4: b 4 7
The “by” signals we are doing a per-group calculation, and the “:=” signals to land the results in the original data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.
# build a fresh copy as last command altered dt in place dt <- data.table::as.data.table(d) dt[, fraction := value/sum(value), by = "group"][] # group value fraction # 1: a 1 0.3333333 # 2: a 2 0.6666667 # 3: b 3 0.4285714 # 4: b 4 0.5714286
In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.
sums <- tapply(d$value, d$group, sum) d$fraction <- d$value/sums[d$group] print(d) # group value fraction # 1 a 1 0.3333333 # 2 a 2 0.6666667 # 3 b 3 0.4285714 # 4 b 4 0.5714286
We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.
SELECT group, value, value/sum(value) OVER ( PARTITION BY group ) AS fraction FROM d
And the popular package dplyr uses the following notation for the same problem.
d %>%
group_by(group) %>%
mutate(fraction = value/sum(value)) %>%
ungroup()
And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.
## solution milliseconds ## datatable_soln 384 ## base_R_lookup_soln 1476 ## dplyr_soln 3988
All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the data.table time includes the time to convert to the data.table class.Details on the timings can be found here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.