A few weeks ago
padr was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or
vignette("padr") for a general introduction. In v0.2.0 the
pad function is extended with a
group argument, which makes your life a lot easier when you want to do padding within groups.
In the Examples of
padr in v0.1.0 I showed that padding over multiple groups could be done by using
padr in conjunction with
library(dplyr) library(padr) # padding a data.frame on group level day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp = rep(LETTERS[1:3], each = 4), y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp, date) x_df_grp %>% group_by(grp) %>% do(pad(.)) %>% ungroup %>% tidyr::fill(grp)
I quite quickly realized this is an unsatisfactory solution for two reasons:
1) It is a hassle. It is the goal of
padr to make datetime preparation as swift and pain free as possible. Having to manually fill your grouping variable(s) after padding is not exactly in concordance with that goal.
2) It does not work when one or both of the
end_val arguments are specified. The start and/or the end of the time series of a group are then no longer bounded by an original observation, and thus don’t have a value of the grouping variable(s). Forward filling with
tidyr::fill will incorrectly fill the grouping variable(s) as a result.
It was therefore necessary to expand
pad, so the grouping variable(s) do not contain missing values anymore after padding. The
group argument takes a character vector with the column name(s) of the variables to group by. Padding will be done on each of the groups formed by the unique combination of the grouping variables. This is of course just the distinct of the variable, if there is only one grouping variable. The result of the date padding will be exactly the same as before this addition (meaning the datetime variable does not change). However, the returned data frame will no longer have missing values for the grouping variables on the padded rows.
The new version of the section in the Examples of
day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4), grp2 = letters[1:2], y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp1, grp2, date) # pad by one grouping var x_df_grp %>% pad(group = 'grp1') # pad by two groups vars x_df_grp %>% pad(group = c('grp1', 'grp2'))
Besides the additional argument there were two bug fixes in this version:
paddoes no longer break when datetime variable contains one value only. Returns
xand a warning if
NULL, and will do proper padding when one or both are specified.
fill_function now a meaningful error is thrown, when forgetting to specify at least one column to apply the function on.
Right before posting this blog, Doug Friedman found out that in v0.2.0 the
by argument no longer functioned. This bug was fixed in the patch release v0.2.1.
I hope you (still) enjoy working with
padr, let me know when you find a bug or got ideas for improvement.