Understanding data.table Rolling Joins

June 5, 2016
By

(This article was first published on bRogramming, and kindly contributed to R-bloggers)

Introduction

Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.
library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments. We track user sessions on our website and PayPal tracks our payments, and we would like to attribute sales to user sessions. This way we can answer all kinds of questions along the lines of “What types of sessions lead to sales?”
Lets create some example data for a diverse group of customers.
Indecisive Isabel shops a lot before she buys anything.
isabel_website <- data.table(name = rep('Indecisive Isabel', 5),
session_start_time = as.POSIXct(c('2016-01-01 11:01', '2016-01-02 8:59', '2016-01-05 18:18', '2016-01-07 19:03', '2016-01-08 19:01')))
isabel_paypal <- data.table(name = 'Indecisive Isabel', purchase_time = as.POSIXct('2016-01-08 19:10'))
Spendy Sally visits the website once and makes multiple purchases.
sally_website <- data.table(name = 'Spendy Sally', session_start_time = as.POSIXct('2016-01-03 10:00'))
sally_paypal <- data.table(name = rep('Spendy Sally', 2), purchase_time = as.POSIXct(c('2016-01-03 10:06', '2016-01-03 10:15')))
Frequent Francis comes to the site a lot and sometimes buys things.
francis_website <- data.table(name = rep('Frequent Francis', 6),
session_start_time = as.POSIXct(c('2016-01-02 13:09', '2016-01-03 19:22', '2016-01-08 8:44', '2016-01-08 20:22', '2016-01-10 17:36', '2016-01-15 16:56')))
francis_paypal <- data.table(name = rep('Frequent Francis', 3), purchase_time = as.POSIXct(c('2016-01-03 19:28', '2016-01-08 20:33', '2016-01-10 17:46')))
Error-prone Erica has mysteriously managed to make a purchase before ever visiting the site!
erica_website <- data.table(name = rep('Error-prone Erica', 2),
session_start_time = as.POSIXct(c('2016-01-04 19:12', '2016-01-04 21:05')))
erica_paypal <- data.table(name = 'Error-prone Erica', purchase_time = as.POSIXct('2016-01-03 08:02'))
Visitor Vivian visits our website a couple times, but never makes a purchase (so she appears in the website data, but not in the payment data).
vivian_website <- data.table(name = rep('Visitor Vivian', 2),
session_start_time = as.POSIXct(c('2016-01-01 9:10', '2016-01-09 2:15')))
vivian_paypal <- erica_paypal[0] # has 0 rows, but the same column names/classes
And Mom sent money to my PayPal account before my website was up and running (so she appears in the payment data, but not in the website data).
mom_website <- vivian_website[0] # has 0 rows, but the same column names/classes
mom_paypal <- data.table(name = 'Mom', purchase_time = as.POSIXct('2015-12-02 17:58'))
Combine these into two data.tables:
website <- rbindlist(list(isabel_website, sally_website, francis_website, erica_website, vivian_website, mom_website))
paypal <- rbindlist(list(isabel_paypal, sally_paypal, francis_paypal, erica_paypal, vivian_paypal, mom_paypal))
To keep things straight, lets give each website session a unique ID and each payment a unique ID.
website[, session_id:=.GRP, by = .(name, session_start_time)]
paypal[, payment_id:=.GRP, by = .(name, purchase_time)]
Finally, have a look at the data.
website
##                  name  session_start_time session_id
## 1: Indecisive Isabel 2016-01-01 11:01:00 1
## 2: Indecisive Isabel 2016-01-02 08:59:00 2
## 3: Indecisive Isabel 2016-01-05 18:18:00 3
## 4: Indecisive Isabel 2016-01-07 19:03:00 4
## 5: Indecisive Isabel 2016-01-08 19:01:00 5
## 6: Spendy Sally 2016-01-03 10:00:00 6
## 7: Frequent Francis 2016-01-02 13:09:00 7
## 8: Frequent Francis 2016-01-03 19:22:00 8
## 9: Frequent Francis 2016-01-08 08:44:00 9
## 10: Frequent Francis 2016-01-08 20:22:00 10
## 11: Frequent Francis 2016-01-10 17:36:00 11
## 12: Frequent Francis 2016-01-15 16:56:00 12
## 13: Error-prone Erica 2016-01-04 19:12:00 13
## 14: Error-prone Erica 2016-01-04 21:05:00 14
## 15: Visitor Vivian 2016-01-01 09:10:00 15
## 16: Visitor Vivian 2016-01-09 02:15:00 16
paypal
##                 name       purchase_time payment_id
## 1: Indecisive Isabel 2016-01-08 19:10:00 1
## 2: Spendy Sally 2016-01-03 10:06:00 2
## 3: Spendy Sally 2016-01-03 10:15:00 3
## 4: Frequent Francis 2016-01-03 19:28:00 4
## 5: Frequent Francis 2016-01-08 20:33:00 5
## 6: Frequent Francis 2016-01-10 17:46:00 6
## 7: Error-prone Erica 2016-01-03 08:02:00 7
## 8: Mom 2015-12-02 17:58:00 8

The Joins

Before doing any rolling joins, I like to create a separate date/time column in each table to join on because one of the two tables loses it’s date/time field and I can never remember which.
website[, join_time:=session_start_time]
paypal[, join_time:=purchase_time]
Next, set keys on each table. The last key column is the one the rolling join will “roll” on. We want to first join on name and then within each name, match website sessions to purchases. So we key on name first, then on the newly created join_time.
setkey(website, name, join_time)
setkey(paypal, name, join_time)

Rolling Forward

Now let’s answer the question “what website session immediately preceded each payment?”
website[paypal, roll = T] # equivalent to website[paypal, roll = Inf]
##                 name  session_start_time session_id           join_time
## 1: Error-prone Erica NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3
Notice several things about this result:
  • Each payment is matched to the closest preceding payment. all(purchase_time > session_start_time, na.rm = T) evaluates to TRUE.
  • Payments with no preceding sessions still appear in the result (that is, nrow(result) == nrow(paypal)).
  • Visitor Vivian does not appear in the results because she does not appear in the paypal table.
  • Mom’s “purchase” has no website session associated with it because she has never visited the website at all.
  • Error-prone Erica’s mysterious purchase has no website session associated with it because she never visited the website prior to her purchase.
  • Spendy Sally’s one website session is matched to both of her purchases.

Rolling Backward

Now lets switch the order of the two tables and answer the question “which sessions led to a purchase?” In this case, we want to match payments to website sessions, so long as the payment occurred after the beginning of the website session.
paypal[website, roll = -Inf]
##                  name       purchase_time payment_id           join_time
## 1: Error-prone Erica NA 2016-01-04 19:12:00
## 2: Error-prone Erica NA 2016-01-04 21:05:00
## 3: Frequent Francis 2016-01-03 19:28:00 4 2016-01-02 13:09:00
## 4: Frequent Francis 2016-01-03 19:28:00 4 2016-01-03 19:22:00
## 5: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 08:44:00
## 6: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 20:22:00
## 7: Frequent Francis 2016-01-10 17:46:00 6 2016-01-10 17:36:00
## 8: Frequent Francis NA 2016-01-15 16:56:00
## 9: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-01 11:01:00
## 10: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-02 08:59:00
## 11: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-05 18:18:00
## 12: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-07 19:03:00
## 13: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-08 19:01:00
## 14: Spendy Sally 2016-01-03 10:06:00 2 2016-01-03 10:00:00
## 15: Visitor Vivian NA 2016-01-01 09:10:00
## 16: Visitor Vivian NA 2016-01-09 02:15:00
## session_start_time session_id
## 1: 2016-01-04 19:12:00 13
## 2: 2016-01-04 21:05:00 14
## 3: 2016-01-02 13:09:00 7
## 4: 2016-01-03 19:22:00 8
## 5: 2016-01-08 08:44:00 9
## 6: 2016-01-08 20:22:00 10
## 7: 2016-01-10 17:36:00 11
## 8: 2016-01-15 16:56:00 12
## 9: 2016-01-01 11:01:00 1
## 10: 2016-01-02 08:59:00 2
## 11: 2016-01-05 18:18:00 3
## 12: 2016-01-07 19:03:00 4
## 13: 2016-01-08 19:01:00 5
## 14: 2016-01-03 10:00:00 6
## 15: 2016-01-01 09:10:00 15
## 16: 2016-01-09 02:15:00 16
In this result
  • Each website session is match to the nearest following payment. all(session_start_time > purchase_time, na.rm = T) evaluates toTRUE.
  • Mom does not appear because she has no record in the website table.
  • Visitor Vivian’s sessions are not matched to any purchases because she hasn’t purchased anything.
  • Neither of Erica’s website sessions are matched to her purchase because it took place before both sessions.
  • Frequent Francis’s most recent session isn’t matched to a purchase because she hasn’t made a purchase after that session.
  • All of Indecisive Isabel’s sessions are matched to her one purchase. In fact, several purchases appear more than once.

Rolling Windows

What if we wanted to add an additional criteria to the rolling join above: match payments to website sessions, so long as the payment occurred after the beginning of the website session and within 12 hours of the website session?
twelve_hours <- 60*60*20 # 12 hours = 60 sec * 60 min * 12 hours
paypal[website, roll = -twelve_hours]
##                  name       purchase_time payment_id           join_time
## 1: Error-prone Erica NA 2016-01-04 19:12:00
## 2: Error-prone Erica NA 2016-01-04 21:05:00
## 3: Frequent Francis NA 2016-01-02 13:09:00
## 4: Frequent Francis 2016-01-03 19:28:00 4 2016-01-03 19:22:00
## 5: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 08:44:00
## 6: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 20:22:00
## 7: Frequent Francis 2016-01-10 17:46:00 6 2016-01-10 17:36:00
## 8: Frequent Francis NA 2016-01-15 16:56:00
## 9: Indecisive Isabel NA 2016-01-01 11:01:00
## 10: Indecisive Isabel NA 2016-01-02 08:59:00
## 11: Indecisive Isabel NA 2016-01-05 18:18:00
## 12: Indecisive Isabel NA 2016-01-07 19:03:00
## 13: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-08 19:01:00
## 14: Spendy Sally 2016-01-03 10:06:00 2 2016-01-03 10:00:00
## 15: Visitor Vivian NA 2016-01-01 09:10:00
## 16: Visitor Vivian NA 2016-01-09 02:15:00
## session_start_time session_id
## 1: 2016-01-04 19:12:00 13
## 2: 2016-01-04 21:05:00 14
## 3: 2016-01-02 13:09:00 7
## 4: 2016-01-03 19:22:00 8
## 5: 2016-01-08 08:44:00 9
## 6: 2016-01-08 20:22:00 10
## 7: 2016-01-10 17:36:00 11
## 8: 2016-01-15 16:56:00 12
## 9: 2016-01-01 11:01:00 1
## 10: 2016-01-02 08:59:00 2
## 11: 2016-01-05 18:18:00 3
## 12: 2016-01-07 19:03:00 4
## 13: 2016-01-08 19:01:00 5
## 14: 2016-01-03 10:00:00 6
## 15: 2016-01-01 09:10:00 15
## 16: 2016-01-09 02:15:00 16
Now Indecisive Isabel’s last session only is associated with a purchase.

The rollends Argument

Recall the first join from above, matching the preceding website session to each payment.
website[paypal, roll = T] # equivalent to website[paypal, roll = T, rollends = c(F, T)]
##                 name  session_start_time session_id           join_time
## 1: Error-prone Erica NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3
What if we want the rolling join to handle Error-prone Erica’s case differently? Perhaps in cases like hers, where there is a purchase with no preceding session, we prefer the user’s first website session to be matched to the offending purchase. We can use the rollends argument for this. From the data.table documentation (?data.table),
rollends[1]=TRUE will roll the first value backwards if the value is before it
website[paypal, roll = T, rollends = c(T, T)] # equivalent to website[paypal, roll = T, rollends = T]
##                 name  session_start_time session_id           join_time
## 1: Error-prone Erica 2016-01-04 19:12:00 13 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3
In this result, Erica’s first session is matched to her purchase, even though the session was after her purchase. Mom’s “purchase” still has no matching session because Mom does not appear in the website table. So all(purchase_time > session_start_time, na.rm = T) no longer evaluates to TRUE.
What if we want to perform the same join as above, but only returning matches for payments with sessions before and after?
website[paypal, roll = T, rollends = c(F, F)] # equivalent to website[paypal, roll = T, rollends = F]
##                 name  session_start_time session_id           join_time
## 1: Error-prone Erica NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00
## 5: Indecisive Isabel NA 2016-01-08 19:10:00
## 6: Mom NA 2015-12-02 17:58:00
## 7: Spendy Sally NA 2016-01-03 10:06:00
## 8: Spendy Sally NA 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3
In this result, the purchases of Error-prone Erica and Mom are unmatched because they have no preceding sessions, and Spendy Sally’s two purchases are unmatched because they have no following website session.
Note that when roll is set to a negative number, the meaning of the two rollends elements kind of flip-flops:
website[paypal, roll = -Inf, rollends = c(F, T)] # default when roll < 0 is rollends = c(T, F)
##                 name  session_start_time session_id           join_time
## 1: Error-prone Erica NA 2016-01-03 08:02:00
## 2: Frequent Francis 2016-01-08 08:44:00 9 2016-01-03 19:28:00
## 3: Frequent Francis 2016-01-10 17:36:00 11 2016-01-08 20:33:00
## 4: Frequent Francis 2016-01-15 16:56:00 12 2016-01-10 17:46:00
## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00
## 6: Mom NA 2015-12-02 17:58:00
## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00
## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00
## purchase_time payment_id
## 1: 2016-01-03 08:02:00 7
## 2: 2016-01-03 19:28:00 4
## 3: 2016-01-08 20:33:00 5
## 4: 2016-01-10 17:46:00 6
## 5: 2016-01-08 19:10:00 1
## 6: 2015-12-02 17:58:00 8
## 7: 2016-01-03 10:06:00 2
## 8: 2016-01-03 10:15:00 3
In this example,
  • Each payment is matched to the nearest following website session (because of roll = -Inf).
  • Error-prone Erica’s purchase is not matched to the following session because there is no previous session (due to rollends[1] = F).
  • Spendy Sally’s purchases are joined to her most recent website session, even though it occurred before her purchases (because ofrollends[2] = T).

To leave a comment for the author, please follow the link and comment on their blog: bRogramming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)