Visualizing relational joins

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join:

NewImage

Please read on for my discussion of this diagram and teaching joins.

Teaching joins

In the above diagram two tables are laid out at angles, lines are extended from every row in each table, and a subset of the line intersections are marked as obeying the join condition and hence being in the result. It is a great diagram for discussing the meaning of joins. Being able to organize data transforms in terms of joins is a critical data science skill, so there is great value in being able to teach join theory.

I’ve been trying teaching joins as notional expansion followed by selection as shown in my recent Strata Spark Workshop (material developed in cooperation with Garrett Grolemend and others):

Join

However, de-emphasizing the sequence of operations and the rejected join possibilities is an attractive alternative.

Deriving the diagram

The full or outer join operator is denoted as follows:


Huge full outer join.

Many relational joins can be thought of as conditions (often denoted as “theta”) under an outer or full-join operator applied to appropriately augmented tables (adding “no-match” rows to each table before taking the cross product). The join symbol and theory are supposed to evoke the notion of a cross product, so teaching in terms of that seems sensible.

There are many methods of illustrating a set cross product including:

  • As a menu of possible combinations:

    NewImage
    (expanding this was the basis of my diagram).

  • As a complete bipartite graph:

    Bipartite

  • As a matrix or grid:

    Bipartite2

If you flip the grid to an angle where both sets of source nodes have equivalent roles then you are getting back to the diagram of Grolemund and Wickham:

Bipartite3

The idea is that the very many pairs induced by the full cross product are illustrated, but they are decorations on the crossing lines. This makes it easy to believe these induced nodes are notional, and (as is the case with real databases) only the ones needed are actually produced.

Other diagrams

I looked around for a short while for common SQL diagrams.

Venn diagrams are typically over-promoted as join mnemonics:

NewImage

(Also discussed here.)

However there were some interesting illustrations trying both the grid and bipartite graph styles.

Grids

NewImage

NewImage

NewImage

NewImage

Bipartite Graphs

NewImage

Hybrids

NewImage

Conclusion

I like the idea of teaching all joins as filters (or theta-conditions) of the outer join. The Grolemund/Wickham diagrams are a good tool and have a style that reminds me of diagrammatic proofs of classic geometric theorems.

NewImage

NewImage

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)