VLOOKUP in R with Schwartau Beehive Data

[This article was first published on R on jmarriott.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I started learning R back in 2016 in college thanks to a couple of my professors who used it to teach statistics: Dr. Grimshaw and Dr. Lawson. Thanks to the R community I’ve learned a lot more since then, but recently I did an embarrassing Google search for “how to do VLOOKUP in r.”

For those of you who don’t know, VLOOKUP is a function in Excel that takes two ranges, matches two of the columns, and gives you a result from a third column based on the match. There are some nuances, like that the matching column in the “from” range has to be the first one, and the difference between approximate and exact matches, but we’ll stay out of the Excel weeds on this post. Suffice it to say that dplyr merges are the ticket for 90% of what VLOOKUP gets used for and it’s way faster.

This post is largely insurance to make sure that I never do that Google search again. First, I’ll load the tidyverse (which includes dplyr) and grab some sample data. I’ll also load tibbletime to prepare the data more easily.

library(tidyverse)
library(tibbletime)

For some example data I hopped on to kaggle and grabbed a recent dataset about some behives that looked interesting. For the purpose of this exercise I’m going to calculate the average daily temperature and bee departures so that we can easily merge the two together.

temp <- read_csv("data/temperature_schwartau.csv")
## Parsed with column specification:
## cols(
##   timestamp = col_datetime(format = ""),
##   temperature = col_double()
## )
temp
## # A tibble: 253,430 x 2
##    timestamp           temperature
##    <dttm>                    <dbl>
##  1 2017-01-01 14:10:00        NA  
##  2 2017-01-01 14:15:00        12.3
##  3 2017-01-01 14:20:00        12.3
##  4 2017-01-01 14:25:00        12.3
##  5 2017-01-01 14:30:00        12.4
##  6 2017-01-01 14:35:00        12.4
##  7 2017-01-01 14:40:00        12.5
##  8 2017-01-01 14:45:00        12.5
##  9 2017-01-01 14:50:00        12.5
## 10 2017-01-01 14:55:00        12.5
## # ... with 253,420 more rows
flow <- read_csv("data/flow_schwartau.csv")
## Parsed with column specification:
## cols(
##   timestamp = col_datetime(format = ""),
##   flow = col_double()
## )
flow
## # A tibble: 2,513,836 x 2
##    timestamp            flow
##    <dttm>              <dbl>
##  1 2017-01-01 14:15:00     0
##  2 2017-01-01 14:16:00     0
##  3 2017-01-01 14:17:00     0
##  4 2017-01-01 14:18:00     0
##  5 2017-01-01 14:19:00     0
##  6 2017-01-01 14:20:00     0
##  7 2017-01-01 14:21:00     0
##  8 2017-01-01 14:22:00     0
##  9 2017-01-01 14:23:00     0
## 10 2017-01-01 14:24:00     0
## # ... with 2,513,826 more rows

To get the data ready for this demo I’m going to take the hourly average flow out of the hive (positive values) and the hourly average temperature.

flow <- flow %>%
  drop_na() %>%
  mutate(exit_flow = case_when( # get only departures
    flow >= 0 ~ flow
  )) %>%
  drop_na() %>% # drop all arriving rows
  arrange(timestamp) %>% # tble_time needs the timestamp to be ordered
  tbl_time(index = timestamp) %>% # convert to tble_time for ease of grouping by hour
  collapse_by(period = "hourly", clean = TRUE) %>% # "collapse" by hour
  group_by(timestamp) %>% # group by collapsed hourly column
  summarise(mean_exit_flow = mean(exit_flow)) # get the hourly average
flow
## # A time tibble: 20,953 x 2
## # Index: timestamp
##    timestamp           mean_exit_flow
##    <dttm>                       <dbl>
##  1 2017-01-01 15:00:00        0      
##  2 2017-01-01 16:00:00        0      
##  3 2017-01-01 17:00:00        0      
##  4 2017-01-01 18:00:00        0      
##  5 2017-01-01 19:00:00        0.00847
##  6 2017-01-01 20:00:00        0      
##  7 2017-01-01 21:00:00        0      
##  8 2017-01-01 22:00:00        0      
##  9 2017-01-01 23:00:00        0      
## 10 2017-01-02 00:00:00        0      
## # ... with 20,943 more rows
temp <- temp %>%
  drop_na() %>%
  arrange(timestamp) %>%
  tbl_time(index = timestamp) %>%
  collapse_by(period = "hourly", clean = TRUE) %>%
  group_by(timestamp) %>%
  summarise(mean_temp = mean(temperature))
temp
## # A time tibble: 20,953 x 2
## # Index: timestamp
##    timestamp           mean_temp
##    <dttm>                  <dbl>
##  1 2017-01-01 15:00:00      12.4
##  2 2017-01-01 16:00:00      13.4
##  3 2017-01-01 17:00:00      16.7
##  4 2017-01-01 18:00:00      16.4
##  5 2017-01-01 19:00:00      13.7
##  6 2017-01-01 20:00:00      14.0
##  7 2017-01-01 21:00:00      12.4
##  8 2017-01-01 22:00:00      17.3
##  9 2017-01-01 23:00:00      17.8
## 10 2017-01-02 00:00:00      20.0
## # ... with 20,943 more rows

You can see that the first column of both tables match exactly, with one row for each hour and the right column of each table do not match. This is one scenario that I might use VLOOKUP for in Excel, especially if the data isn’t ordered, or if I don’t know if both datasets are complete.

As a side note, the flow dataset has 2.5 million rows and the temperature dataset has 250 thousand rows. Summarizing these by hour into tables of about 21 thousand rows and then combining them would be a trying task for Excel, but for R it’s a breeze.

To combine these two in R, I will use a join function from dplyr. VLOOKUP works like left_join() or right_join() because it looks for matches from one table in another table. If there aren’t any matches, it returns with an error. We also have the option to do a full_join() which returns all rows from both tables and with NA wherever there isn’t a match or inner_join() which only returns rows where there are matches from both tables.

Since I have created a nice clean dataset with matching timestamp columns, I will get the same result no matter which join function I use. dplyr will try to guess which column I want to join by and return its guess as a message unless I specify which column I want it to use with by =.

joined_table <- flow %>%
  left_join(temp, by = "timestamp")
joined_table
## # A time tibble: 20,953 x 3
## # Index: timestamp
##    timestamp           mean_exit_flow mean_temp
##    <dttm>                       <dbl>     <dbl>
##  1 2017-01-01 15:00:00        0            12.4
##  2 2017-01-01 16:00:00        0            13.4
##  3 2017-01-01 17:00:00        0            16.7
##  4 2017-01-01 18:00:00        0            16.4
##  5 2017-01-01 19:00:00        0.00847      13.7
##  6 2017-01-01 20:00:00        0            14.0
##  7 2017-01-01 21:00:00        0            12.4
##  8 2017-01-01 22:00:00        0            17.3
##  9 2017-01-01 23:00:00        0            17.8
## 10 2017-01-02 00:00:00        0            20.0
## # ... with 20,943 more rows

Pretty easy! Preparing this dataset took more work than joining it, but there are a lot of cases where the data will lend itself to a join right out of the box.

Since we’ve made it this far, let’s take a look at the relationship between the hourly average number of bees exiting the hive and the hourly average temperature.

joined_table %>%
  ggplot(aes(x = mean_exit_flow, y = mean_temp)) +
  geom_point(alpha = .05) +
  labs(title = "Hourly Average Schwartau Beehive Activity", subtitle = "2017-01-01 through 2019-05-31", caption = "Data source: kaggle.com/se18m502/bee-hive-metrics") +
  xlab("Average Hourly Exit") +
  ylab("Average Hourly Temperature (C)") +
  theme_bw()

It looks like it needs to be about 10 C (50 F) before these bees want to head outdoors, and above 30 C (86 F) there is a huge spike in activity.

There is a pretty thick line of inactivity at all temperatures, though, which makes me wonder if the beehive has been active the whole time it was being observed. Let’s make a plot of activity over time.

joined_table %>%
  ggplot(aes(x = timestamp, y = mean_exit_flow, color = mean_temp)) +
  geom_point() +
  labs(title = "Hourly Average Schwartau Beehive Activity", 
       subtitle = "2017-01-01 through 2019-05-31", 
       caption = "Data source: kaggle.com/se18m502/bee-hive-metrics",
       color = "Average\nTemp\n(C)") +
  xlab("Time") +
  ylab("Average Hourly Exit") +
  scale_colour_gradient(low = "black", high = "red") +
  theme_bw()

As I expected, the beehive activity is function of both temperature and time of year. Also, there are some times of year where there are bees exiting the hive twenty-four hours per day.

There you have it! While there’s nothing wrong with using Excel, especially when it’s the medium that your colleagues are familiar with, R is much faster, and more reproducible and flexible. I can easily read through the code that I just wrote and follow what’s happening, but when I get an Excel workbook with similar calculations, it takes quite a bit of time to figure out what’s going on.

To leave a comment for the author, please follow the link and comment on their blog: R on jmarriott.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)