An R vlookup? Not so silly idea
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It started out as a joke, but Jenny Bryan recently posted a vlookup implementation in R. Here is the original post as seen on twitter:
Sometimes you just need to party like it’s VLOOKUP time ?? … seriously, sometimes a join doesn’t fit the bill pic.twitter.com/jz8StfQdNg
— Jenny Bryan (@JennyBryan) April 3, 2018
The argument for the creation of this kind of lookup implementation in R, was to help facilitate a join that wasn’t covered by the standard joins. This tweet started out as a joke function that would surely never see the daylight of production code, but there was one thing that stood out about the small little function – it was written in base R and made use of R’s list data objects. What makes this intriguing you ask? Well, combining lists and base usually has a nice consequence – speed.
So that is what I set out to test: How much more efficient would this vlookup be when compared to a dplyr or data.table implementation? We start off by recreating the vlookup function and use the tweet’s example to show its use case:
library(tidyverse)
vlookup <- function(this, df, key, value) {
m <- match(this, df[[key]])
df[[value]][m]
}
vlookup_base <- function(){
c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda") %>%
vlookup(starwars, "name", "mass") %>%
sum
}
vlookup_base()
## [1] 1452
Next we build the same query, but using dplyr notation to give us the same results (a quick check shows us the sum results are the same):
vlookup_dplyr <- function(){
starwars %>%
filter(name %in% c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda")) %>%
pull(mass) %>%
sum
}
vlookup_dplyr()
## [1] 1452
Lastly we build the query using data.table. This implementation comes from the comment section of the tweet. We first have to coerce the starwars data into a data.table object for this to work. We also lose the %>% workflow which can be quite a pain. In this case it doesn’t matter, because we all about speed!
library(data.table)
vlookup_dt <- function(){
name_vec <- c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda")
df <- data.table(dplyr::starwars)
df[.(name_vec), on = 'name', sum(mass)]
}
To test the speed to all of these function, we will use the microbenchmark library. Its one of my favourite libraries in R,thanks to its ease of use and quick API to comparing functions.
library(microbenchmark) microbenchmark( base = vlookup_base(), dplyr = vlookup_dplyr(), data_table = vlookup_dt(), times = 1000L ) ## Unit: microseconds ## expr min lq mean median uq max neval ## base 93.539 116.1030 143.8219 129.642 147.692 4797.951 1000 ## dplyr 780.309 889.0265 1017.7738 943.795 1035.898 8516.107 1000 ## data_table 1030.564 1166.3595 1454.0140 1252.103 1441.641 50518.175 1000 ## cld ## a ## b ## c
Here we see the enormous speed gain we got from using vlookup_base in comparison with the other two frameworks. In all fairness, I feel that I might be handicapping data.table a bit, just to the coercion of the starwars dataset each time. So lets see what happens when I do the coercion outside the function.
df <- data.table(dplyr::starwars)
vlookup_dt <- function(){
name_vec <- c("Luke Skywalker", "Jabba Desilijic Tiure", "Yoda")
df[.(name_vec), on = 'name', sum(mass)]
}
res_mic <- microbenchmark(
base = vlookup_base(),
dplyr = vlookup_dplyr(),
data_table = vlookup_dt(),
times = 1000L
)
res_mic
## Unit: microseconds
## expr min lq mean median uq max neval cld
## base 95.590 118.975 137.5330 128.410 140.1030 603.077 1000 a
## dplyr 784.410 873.026 962.0909 911.180 972.7185 3840.002 1000 c
## data_table 579.693 648.206 740.3640 685.129 746.0520 4667.900 1000 b
We definitely gained some speed by not having to coerce the data.frame over and over, but, the overall speed comparison is still nothing like the old fashion vlookup_base.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.