non-equi joins in data.table

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been toying with some of the advent of code challenges (I am way behind though!).

For day 5, I had to create a function, and I’m writing this up, because it’s an example of a non-equi join between two tables.
In this particular sitation, there are are no common columns between the two tables, so my usual data.table hack of copying the columns of interest, renaming themjoin_col, and then keying them both does not work.

Here’s the function:

find_matches <- function(input_dt, lookup_dt) {
 res <-  lookup_dt[input_dt,
            .(V1 = i.V1,
              dest_start = x.dest_start,
              source_start = x.source_start,
              source_end = x.source_end),
            on = .(source_start <= V1,
                   source_end >= V1)
  ][, dest := fcase(is.na(source_start), V1,
                    !is.na(source_start),V1 - source_start + dest_start)
  ][,.(V1 = dest)]
 return(res)

}

I want to join the main table, input_dt with the lookup_dt. Because data.table uses right joins, and removes the joining column from the main table (unless you tell it otherwise), I am being very specific.

This bit is where I specify the column names I want to return, and which table they come from.

i.col_name is the main / large/ right hand side table

x.col_name is the smaller/ left hand side table

Here I’m saying to keep V1 from the main table, and dest_start, source_start and source_end from the smaller table:

            .(V1 = i.V1,
              dest_start = x.dest_start,
              source_start = x.source_start,
              source_end = x.source_end),
           

Here is the non-equi join bit.

on = .(source_start <= V1,
                   source_end >= V1)

The tables should join when V1 is between the source start and source end.

There will be rows where these conditions are not met, these will return NA. This section of code deals with those – any NA’s are replaced with the original value and no further calculation is required.

[, dest := fcase(is.na(source_start), V1,
                    !is.na(source_start),V1 - source_start + dest_start)
  ]

The rest of the code returns the vector of interest.

The main takeway from this is to understand:

  • data.table uses right joins by default
  • the join column in the main table will be removed from the resulting joined table, unless you tell it otherwise
  • Use i.col_name to specify columns from the main table
  • Use x.col_name to specify columns from the smaller / lookup table
  • you can use conditions in your on syntax to create equi or non equi joins
To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)