Site icon R-bloggers

non-equi joins in data.table

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have been toying with some of the advent of code challenges (I am way behind though!).

For day 5, I had to create a function, and I’m writing this up, because it’s an example of a non-equi join between two tables.
In this particular sitation, there are are no common columns between the two tables, so my usual data.table hack of copying the columns of interest, renaming themjoin_col, and then keying them both does not work.

Here’s the function:

find_matches <- function(input_dt, lookup_dt) {
 res <-  lookup_dt[input_dt,
            .(V1 = i.V1,
              dest_start = x.dest_start,
              source_start = x.source_start,
              source_end = x.source_end),
            on = .(source_start <= V1,
                   source_end >= V1)
  ][, dest := fcase(is.na(source_start), V1,
                    !is.na(source_start),V1 - source_start + dest_start)
  ][,.(V1 = dest)]
 return(res)

}

I want to join the main table, input_dt with the lookup_dt. Because data.table uses right joins, and removes the joining column from the main table (unless you tell it otherwise), I am being very specific.

This bit is where I specify the column names I want to return, and which table they come from.

i.col_name is the main / large/ right hand side table

x.col_name is the smaller/ left hand side table

Here I’m saying to keep V1 from the main table, and dest_start, source_start and source_end from the smaller table:

            .(V1 = i.V1,
              dest_start = x.dest_start,
              source_start = x.source_start,
              source_end = x.source_end),
           

Here is the non-equi join bit.

on = .(source_start <= V1,
                   source_end >= V1)

The tables should join when V1 is between the source start and source end.

There will be rows where these conditions are not met, these will return NA. This section of code deals with those – any NA’s are replaced with the original value and no further calculation is required.

[, dest := fcase(is.na(source_start), V1,
                    !is.na(source_start),V1 - source_start + dest_start)
  ]

The rest of the code returns the vector of interest.

The main takeway from this is to understand:

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version