I’m very excited to announce dplyr 0.2. It has three big features:
- improved piping courtesy of the magrittr package
a vastly more useful implementation of
five new verbs:
These features are described in more detail below. To learn more about the 35 new minor improvements and bug fixes, please read the full release notes.
dplyr now imports
%>% from the magrittr package by Stefan Milton Bache. I recommend that you use this instead of
%.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you
%>%, you can control which argument on the RHS receives the LHS with the pronoun
.. This makes
%>% more useful with base R functions because they don’t always take the data frame as the first argument. For example you could pipe
mtcars %>% xtabs( ~ cyl + vs, data = .)
dplyr only exports
%>% from magrittr, but magrittr contains many other useful functions. To use them, load magrittr explicitly with
library(magrittr). For more details, see
%.% will be deprecated in a future version of dplyr, but it won’t happen for a while. I’ve deprecated
chain() to encourage a single style of dplyr usage: please use
do() has been completely overhauled, and
do() is now equivalent in power to
plyr::dlply(). There are two ways to use
do(), either with multiple named arguments or a single unnamed arguments. If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object which makes this form of
do() useful for storing models:
library(dplyr) models % group_by(cyl) %>% do(model = lm(mpg ~ wt, data = .)) models %>% summarise(rsq = summary(model)$r.squared)
If you use an unnamed argument, the result should be a data frame. This allows you to apply arbitrary functions to each group.
mtcars %>% group_by(cyl) %>% do(head(., 1))
Note the use of the pronoun
. to refer to the data in the current group.
do() also has an automatic progress bar. It appears if the computation takes longer than 2 seconds and estimates how long the job will take to complete.
sample_n() randomly samples a fixed number of rows from a tbl;
sample_frac() randomly samples a fixed fraction of rows. They currently only work for local data frames and data tables.
mutate_each() make it easy to apply one or more functions to multiple columns in a tbl. These works for all srcs that
mutate() work for.
glimpse() makes it possible to see all the columns in a tbl, displaying as much data for each variable as can be fit on a single line.