Building a base dplyr with primitives
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In one of my latest posts, I discussed the idea of turning base R’s get and set operators ([
, [[
, [<-
, [[<-
) into human readable and pipeable functions. It was kindly pointed out in the comments that the get()
and set()
functions I defined in that blog post are actually exported in the magrittr
package as magrittr::extract2()
([[
) and magrittr::inset2()
([[<-
). In fact, there are a whole host of “alias” functions exported by magrittr
, see ?magrittr::extract2
for more. However if we are developing a package, we may not necessarily want to Import: magrittr
, we may only want to Suggest
it as a package that complements our package. This is especially true when the functions we will be importing are simple aliases of other functions that we can just as easily create ourselves. Now sure, a lot of people already have and use magrittr
, in which case they can use it, but not everyone wants it or uses it, so we shouldn’t enforce that dependency on users.
Take for example if we were to create a package that recreates dplyr
’s main verbs, select()
; filter()
; mutate()
; and arrange()
, using base R only. Think of it as a “poor man’s” dplyr, of course I jest - base
is awesome. Oftentimes the main complaint I hear about dplyr
is the sheer number of dependencies it has and the installation times that come with that; not to mention APIs have changed a few times over the years. base
on the other hand already comes pre-installed with R and the API is extremely stable. The reason people like dplyr
, however, is because the API on offer is extremely flexible and easy to understand. This blog post will show how we can recreate these verbs using base R and aliases to R’s operator functions and use them in conjunction with magrittr
.
Select
dplyr::select()
allows the user to subset the columns of a data.frame
and always return a data.frame
. Thus to recreate this function we will need the operator for subsetting columns of a data.frame
which is [
, or more specifically, [.data.frame
. Let’s take a look at the arguments for this function.
args(`[.data.frame`) # function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == # 1) # NULL
We see that it takes:
x
- thedata.frame
i
- the rows to subsetj
- the columns to subsetdrop
- whether to return a vector if only one column is left
We will define our wrapper for the [
function in the same way that magrittr
does.
extract <- `[`
As this is an S3 generic, R will know to dispatch to [.data.frame
when it is passed a data.frame
. Hence, we can now define a select()
function which is similar in functionality to that of dplyr::select()
. Note that we tell R that we wish to subset all of the rows in the i
position by leaving the argument blank.
select <- function(.data, ...) { cols <- vapply(substitute(...()), deparse, NA_character_) extract(.data, , cols, drop = FALSE) }
This function uses a couple of tricks here, so I’ll break them down. To use non-standard evaluation in the same way that dplyr
does, that is to pass non-quoted column names, we must deparse them. We loop over the columns passed via ...
using a vapply()
. The substitute(...())
gives us a list-like object of all the symbols we pass which we can loop over. Using this function, we can now select a single column.
mtcars %>% select(mpg) # mpg # Mazda RX4 21.0 # Mazda RX4 Wag 21.0 # Datsun 710 22.8 # Hornet 4 Drive 21.4 # Hornet Sportabout 18.7 # Valiant 18.1 # Duster 360 14.3 # Merc 240D 24.4 # Merc 230 22.8 # ... 23 rows omitted
Or multiple columns by passing a vector.
mtcars %>% select(mpg, cyl) # mpg cyl # Mazda RX4 21.0 6 # Mazda RX4 Wag 21.0 6 # Datsun 710 22.8 4 # Hornet 4 Drive 21.4 6 # Hornet Sportabout 18.7 8 # Valiant 18.1 6 # Duster 360 14.3 8 # Merc 240D 24.4 4 # Merc 230 22.8 4 # ... 23 rows omitted
And of course, this function works without magrittr
.
select(mtcars, mpg, cyl) # mpg cyl # Mazda RX4 21.0 6 # Mazda RX4 Wag 21.0 6 # Datsun 710 22.8 4 # Hornet 4 Drive 21.4 6 # Hornet Sportabout 18.7 8 # Valiant 18.1 6 # Duster 360 14.3 8 # Merc 240D 24.4 4 # Merc 230 22.8 4 # ... 23 rows omitted
For bonus points, we can write an equivalent of dplyr::pull()
by setting the drop = TRUE
argument and removing the cols
parameter since we are only dealing with one column.
pull <- function(.data, var) { var <- deparse(substitute(var)) stopifnot(length(var) == 1) extract(.data, , var, drop = TRUE) } mtcars %>% pull(mpg) # [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 # [18] 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
Filter
As we saw in the previous section, [.data.frame
takes i
as an argument which represents the rows to filter. Thus we can use a similar method to that used for select()
only in this case, we must build the expressions by which to filter and separate them with an ampersand from which we can parse and evaluate.
filter <- function(.data, ...) { conditions <- paste(vapply(substitute(...()), deparse, NA_character_), collapse = " & ") extract(.data, with(.data, eval(parse(text = conditions))), ) } mtcars %>% filter(cyl == 4) # mpg cyl disp hp drat wt qsec vs am gear carb # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 # Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 # Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 # Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 # Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 # ... 2 rows omitted mtcars %>% filter(cyl <= 5 & am > 0) # mpg cyl disp hp drat wt qsec vs am gear carb # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 # Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 # Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 # Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 # Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 # Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 # Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 mtcars %>% filter(cyl == 4 | cyl == 8) # mpg cyl disp hp drat wt qsec vs am gear carb # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 # Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 # Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 # Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 # ... 16 rows omitted mtcars %>% filter(!(cyl %in% c(4, 6)), am != 0) # mpg cyl disp hp drat wt qsec vs am gear carb # Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 # Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
We can also get a copy of dplyr::slice()
really cheaply.
slice <- function(.data, ...) { stopifnot(is.numeric(...) || is.integer(...)) extract(.data, ..., ) } mtcars %>% slice(1:3) # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 # Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 # Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Arrange
The final function using the extract()
alias that I want to highlight is arrange()
. I want to highlight this function because of the required trick with eval.parent()
(note there are other ways we could achieve this).
arrange <- function(.data, ...) { rows <- eval.parent(substitute(with(.data, order(...)))) extract(.data, rows, , drop = FALSE) }
We use eval.parent()
instead of eval()
, because the eval()
/substitute()
combo doesn’t play well with nested functions. The eval.parent()
trick has been proposed by @MoodyMudskipper as a way to address this problem and allows us to seamlessly use arrange()
inside other functions, including magrittr
pipes:
mtcars %>% arrange(mpg) # mpg cyl disp hp drat wt qsec vs am gear carb # Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 # Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 # Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 # Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 # Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 # Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 # AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 # Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 # ... 23 rows omitted mtcars %>% arrange(cyl, mpg) # mpg cyl disp hp drat wt qsec vs am gear carb # Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 # Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 # Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 # Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 # Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 # ... 23 rows omitted
Thanks go to Artem Sokolov for pointing this out.
Mutate
If we wish to create new columns in our dataset, particularly columns created using existing columns in the data, we must use the [<-
operator, specifically, [<-.data.frame
.
args(`[<-.data.frame`) # function (x, i, j, value) # NULL
[<-.data.frame
takes the arguments:
x
- the datai
- the rows to createj
- the columns to createvalue
- the value to give to the rows/columns
We will assign this operator to inset
- the same as magrittr
does.
inset <- `[<-`
Here we lapply()
over each of the conditions to return a list of vectored results of our expressions. We then use the inset()
function to add these vectors as new columns to the data.frame
.
mutate <- function(.data, ...) { conditions <- vapply(substitute(...()), deparse, NA_character_) new_data <- lapply( conditions, function(x, .data) with(.data, eval(parse(text = x))), .data ) inset(.data, , names(conditions), new_data) } mtcars %>% mutate(mpg2 = mpg * 2) # mpg cyl disp hp drat wt qsec vs am gear carb mpg2 # Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 42.0 # Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 42.0 # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 45.6 # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 42.8 # Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 37.4 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 36.2 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 28.6 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 48.8 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 45.6 # ... 23 rows omitted mtcars %>% mutate(mpg2 = mpg * 2, cyl2 = cyl * 2) # mpg cyl disp hp drat wt qsec vs am gear carb mpg2 cyl2 # Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 42.0 12 # Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 42.0 12 # Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 45.6 8 # Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 42.8 12 # Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 37.4 16 # Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 36.2 12 # Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 28.6 16 # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 48.8 8 # Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 45.6 8 # ... 23 rows omitted
Note that unlike dplyr::mutate()
, we cannot create columns based on expressions we pass, for example, the following would not work.
mtcars %>% mutate(mpg2 = mpg * 2, mpg3 = mpg2 * 3) # Error in eval(parse(text = x)): object 'mpg2' not found
As a bonus, we can combine our mutate()
function with extract()
to create a copy of dplyr::transmute()
.
transmute <- function(.data, ...) { conditions <- vapply(substitute(...()), deparse, NA_character_) mutated <- mutate(.data, ...) extract(mutated, names(conditions)) } mtcars %>% transmute(mpg2 = mpg * 2, cyl2 = cyl * 2) # mpg2 cyl2 # Mazda RX4 42.0 12 # Mazda RX4 Wag 42.0 12 # Datsun 710 45.6 8 # Hornet 4 Drive 42.8 12 # Hornet Sportabout 37.4 16 # Valiant 36.2 12 # Duster 360 28.6 16 # Merc 240D 48.8 8 # Merc 230 45.6 8 # ... 23 rows omitted
Chaining
As a final note, it should be clear that due to the nature of magrittr
, your standard chaining of functions will still work.
mtcars %>% filter(cyl == 4) %>% select(mpg, cyl, wt, disp) # mpg cyl wt disp # Datsun 710 22.8 4 2.320 108.0 # Merc 240D 24.4 4 3.190 146.7 # Merc 230 22.8 4 3.150 140.8 # Fiat 128 32.4 4 2.200 78.7 # Honda Civic 30.4 4 1.615 75.7 # Toyota Corolla 33.9 4 1.835 71.1 # Toyota Corona 21.5 4 2.465 120.1 # Fiat X1-9 27.3 4 1.935 79.0 # Porsche 914-2 26.0 4 2.140 120.3 # ... 2 rows omitted
Conclusion
The idea behind this blog post was to highlight how we can use more human readable versions of R’s primitive operators to aid in pipeable data manipulation functions. Of course the solutions provided in this blog post are over-engineered and you would probably write them in a different way if you were seriously thinking about releasing them as a package. Also, whilst these functions are available via an import of magrittr
, you may not wish to force the user to import magrittr
and may wish to keep it as a suggestion instead. This reduces the number of dependencies on your package.
For what it’s worth, I have included all of the above code in a package called poorman
on my GitHub account. These functions haven’t been thoroughly tested and there may well be bugs. There are, however, much more detailed and dedicated recreations of dplyr
using base R. If you are interested, check out: bplyr
(note this package uses rlang
) and tbltools
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.