Using purrr with dplyr

[This article was first published on - rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Update: Most of the data frame functions in purrr have been deprecated in favour of a new family of functions in dplyr. The intent is to better separate the responsibilities of packages in the tidyverse. First of all map() now always returns a list. It no longer preserves the data frame type. Secondly, all slice- and rows-based functions are now deprecated. Mapping a column is now handled by the colwise family of dplyr functions, e.g. dplyr::mutate_all(), dplyr::summarise_if(), etc. Unlike the _each() variants which only accept expressions wrapped in funs(), the new colwise family accepts regular functions as well as additional arguments to be passed on. The syntax is thus pretty close to purrr's:

mtcars %>% group_by(cyl) %>% mutate_all(scale, center = FALSE)

/Update

purrr was finally released on CRAN last week. This package is focused on working with lists (and data frames by the same token). However it is not a DSL for lists in the way dplyr is a DSL for data frames. It aims at creating a “better standard lib” focused on functional programming. Purrr should feel like R programming and bring out the elegance of the language. That said, purrr can be a nice companion to your dplyr pipelines especially when you need to apply a function to many columns. In this post I show how purrr's functional tools can be applied to a dplyr workflow.

dplyr provides mutate_each() and summarise_each() for the purpose of mapping functions but I find that they are not as easy to use as the rest of the interface. This is mostly because there is no easy way to map a function to parts of your data frame. It's all columns or nothing. Also, they introduce a custom notation for lambda functions that can be a bit cumbersome. These are two areas where purrr shines in comparison. And since the interface has been designed with pipes in mind, purrr's functions integrate dplyr pipelines quite well.

Mapping to columns conditionally

One of my favourite functions in purrr is map_if(). It accepts a predicate function or a logical vector that specifies which columns should be mapped with a function. This makes it easy to apply a function conditionally, as in the following snippet where we transform all factors to a character vector:

library("purrr")
library("dplyr")
data(diamonds, package = "ggplot2")

diamonds %>% map_if(is.factor, as.character) %>% str()

#> Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
#>  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#>  $ cut    : chr  "Ideal" "Premium" "Good" "Premium" ...
#>  $ color  : chr  "E" "E" "E" "I" ...
#>  $ clarity: chr  "SI2" "SI1" "VS1" "VS2" ...
#>  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#>  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
#>  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
#>  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#>  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#>  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Mapping to specific columns

While cleaning a dataset, it is common to apply the same transformation to many variables. For example, reversing a scale or shifting it to zero. Instead of writing a long mutate() call with those transformations, I prefer to do it in one go.

This can be done with map_at() which takes a vector of column positions or column names. For example, let's assume you have written two functions reverse_scale() and shift_to_zero() that should be applied to specific variables. You record those variables in character vectors just before starting the dplyr/purrr pipeline, and then add the relevant map_at() calls.

to_reverse_vars <- c(
  "cyl", "am", "vs",
  "gear", "carb"
)
to_zero_vars <- c(
  "cyl", "gear", "carb"
)

mtcars %>%
  select(-disp) %>%
  map_at(to_reverse_vars, reverse_scale) %>%
  map_at(to_zero_vars, shift_to_zero)

Expanding one column to many with lmap()

lmap()'s story starts with the mysterious tweet and the gist that show up when you google “hadley monads”. While I'm not sure I really understand how it is monadic, lmap() is quite useful to extend a data frame without having to deal with binds, merges or having to define new column names.

Let's say you have a numeric variable that you want to discretise for data exploration or modelling (for example, to use as pivot in a ggplot facetting). There are several ways to cut a vector into pieces. Ideally, the cutpoints should be derived from theory, but it's often not possible or too time consuming to do so. In this case, I like to create different categorisations and check if the results are consistent (and investigate when they are not). Let's define two cutting functions, one that tries to create categories with equal sample sizes while the other just uses equal ranges to determine cutpoints.

cut_equal_sizes <- function(x, n = 3, ...) {
  ggplot2::cut_number(x, n, ...)
}

cut_equal_ranges <- function(x, n = 3, ...) {
  cut(x, n, include.lowest = TRUE, ...)
}

It'd be nice to "grow" the data frame at specific numeric columns in such a way that that two news discretised variables appear just next to them with appropriate column names. lmap() is adapted to this because instead of applying a function to the vectors contained in a data frame, it applies it to subsets of size 1 of that data frame. This has several advantages:

  • You get the name of the vector as an attribute of the enclosing data frame.

  • The usual mapping tools work on columns, so when you return a list or a data frame of vectors, they'll try to stick these inside a list-column, which is not what we want in this case. By comparison, lmap() gives a data frame to a function and expects a data frame in return and has no problem dealing with it when it has more than one column.

Let's write a function to be mapped in such a way. This function doesn't work with vectors but with vectors enclosed in a data frame. It takes and returns a data frame.

cut_categories <- function(x, n = 3) {
  # Record the name of the enclosed vector
  name <- names(x)

  # Create the new columns
  x$cat_n <- cut_equal_sizes(x[[1]], n)
  x$cat_r <- cut_equal_ranges(x[[1]], n)

  # Adjusting the names of the new columns
  names(x)[2:3] <- paste0(name, "_", n, names(x)[2:3])

  x
}

Then we just add a lmap() call to our data cleaning pipeline:

to_discretise_vars <- c(
  "mpg", "disp", "drat",
  "wt", "qsec"
)

mtcars %>% lmap_at(to_discretise_vars, cut_categories) %>% str()

#> Classes 'tbl_df', 'tbl' and 'data.frame':    32 obs. of  21 variables:
#>  $ mpg        : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ mpg_3cat_n : Factor w/ 3 levels "[10.4,16.7]",..: 2 2 3 2 2 2 1 3 3 2 ...
#>  $ mpg_3cat_r : Factor w/ 3 levels "[10.4,18.2]",..: 2 2 2 2 2 1 1 2 2 2 ...
#>  $ cyl        : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp       : num  160 160 108 258 360 ...
#>  $ disp_3cat_n: Factor w/ 3 levels "[71.1,146]","(146,293]",..: 2 2 1 2 3 2 3 2 1 2 ...
#>  $ disp_3cat_r: Factor w/ 3 levels "[70.7,205]","(205,338]",..: 1 1 1 2 3 2 3 1 1 1 ...
#>  $ hp         : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat       : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ drat_3cat_n: Factor w/ 3 levels "[2.76,3.17]",..: 2 2 2 1 1 1 2 2 3 3 ...
#>  $ drat_3cat_r: Factor w/ 3 levels "[2.76,3.48]",..: 2 2 2 1 1 1 1 2 2 2 ...
#>  $ wt         : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ wt_3cat_n  : Factor w/ 3 levels "[1.51,2.81]",..: 1 2 1 2 2 2 3 2 2 2 ...
#>  $ wt_3cat_r  : Factor w/ 3 levels "[1.51,2.82]",..: 1 2 1 2 2 2 2 2 2 2 ...
#>  $ qsec       : num  16.5 17 18.6 19.4 17 ...
#>  $ qsec_3cat_n: Factor w/ 3 levels "[14.5,17]","(17,18.6]",..: 1 1 3 3 1 3 1 3 3 2 ...
#>  $ qsec_3cat_r: Factor w/ 3 levels "[14.5,17.3]",..: 1 1 2 2 1 3 1 2 3 2 ...
#>  $ vs         : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am         : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear       : num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb       : num  4 4 1 1 2 1 4 2 2 4 ...

The data frame comes out of the pipeline with the new discretised variables nicely arranged and named.

Mapping a function within groups

purrr is also able to deal with dplyr groupings. The groups can be defined with either dplyr::by_group() or purrr::slice_rows(). To apply a function to all columns within groups, just combine a mapping function with the by_slice() adverb:

mtcars %>%
  slice_rows("cyl") %>%
  by_slice(map, ~ .x / sum(.x))

To leave a comment for the author, please follow the link and comment on their blog: - rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)