Exploring Recursive CTEs with sqldf

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Bob Horton
Sr. Data Scientist at Microsoft

Common table expressions (CTEs, or “WITH clauses”) are a syntactic feature in SQL that makes it easier to write and use subqueries. They act as views or temporary tables that are only available during the lifetime of a single query. A more sophisticated feature is the “recursive CTE”, which is a common table expression that can call itself, providing a convenient syntax for recursive queries. This is very useful, for example, in following paths of links from record to record, as in graph traversal.

This capability is supported in Postgres, and Microsoft SQL Server (Oracle has similar capabilities with a different syntax), but not in MySql. Perhaps surprisingly, it is supported in SQLite, and since SQLite is the default backend for sqldf, this gives R users a convenient way to experiment with recursive CTEs.


This is the example from the Wikipedia article on hierarchical and recursive queries in SQL; you just pass it to sqldf and it works.


sqldf("WITH RECURSIVE temp (n, fact) AS 
(SELECT 0, 1 -- Initial Subquery
 SELECT n+1, (n+1)*fact FROM temp -- Recursive Subquery 
        WHERE n < 9)
SELECT * FROM temp;")
##    n   fact
## 1  0      1
## 2  1      1
## 3  2      2
## 4  3      6
## 5  4     24
## 6  5    120
## 7  6    720
## 8  7   5040
## 9  8  40320
## 10 9 362880

Other databases may use slightly different syntax (for example, if you want to run this query in Microsoft SQL Server, you need to leave out the word RECURSIVE), but the concept is pretty general. Here the recursive CTE named temp is defined in a with clause. As usual with recursion, you need a base case (here labeled “Initial Subquery”), and a recursive case (“Recursive Subquery”) that performs a select operation on itself. These two cases are put together using a UNION statement (basically the SQL equivalent of rbind). The last line in the query kicks off the computation by running a SELECT statement from this CTE.

Family Tree

Let’s make a toy family tree, so we can use recursion to find all the ancestors of a given person.

Exploring Recursive CTEs with sqldf
family <- data.frame(
  person = c("Alice", "Brian", "Cathy", "Danny", "Edgar", "Fiona", "Gregg", "Heidi", "Irene", "Jerry", "Karla"),
    mom = c(rep(NA, 4), c('Alice', 'Alice', 'Cathy', 'Cathy', 'Cathy', 'Fiona', 'Fiona')),
    dad = c(rep(NA, 4), c('Brian', 'Brian', 'Danny', 'Danny', 'Danny', 'Gregg', 'Gregg')),

We can visualize this family tree as a graph:

Exploring Recursive CTEs with sqldf
nodes <- family$person
edges <- apply(family, 1, function(r) {
  r <- r[c("mom", "dad")]
  r <- r[!is.na(r)]
  list(edges=r)  # c(r['mom'], r['dad'])
names(edges) <- names(nodes) <- nodes
g <- graphNEL(nodes=nodes, edgeL=edges, edgemode='directed')

library(Rgraphviz) # from Bioconductor
g <- layoutGraph(g, layoutType="dot", attrs=list(graph=list(rankdir="BT")))


Pointing from child to parents is backwards from how family trees are normally drawn, but this reflects how the table is laid out. I built the table this way because everybody always has exactly two biological parents, regardless of family structure.

SQLite only supports a single recursive call, so we can’t recurse on both the mom and dad columns. To be able to trace back through both parents, I put the table in “long form”; now each parent is entered in a separate row, with ‘mom’ and ‘dad’ being values in a new column called ‘parent’.


long_family <- gather(family, parent, parent_name, -person)

Exploring Recursive CTEs with sqldf
person parent parent_name
Alice mom NA
Brian mom NA
Cathy mom NA
Danny mom NA
Edgar mom Alice
Fiona mom Alice

Now we can use a recursive CTE to find all the ancestors in the database for a given person:

ancestors_sql <- "
WITH ancestors (name, parent, parent_name, level) AS (
  SELECT person, parent, parent_name, 1 FROM long_family WHERE person = '%s'
        UNION ALL
    SELECT A.person, A.parent, A.parent_name, P.level + 1 
        FROM ancestors P
        JOIN long_family A
        ON P.parent_name = A.person)
SELECT * FROM ancestors ORDER BY level, name, parent"

sqldf(sprintf(ancestors_sql, 'Jerry'))
##     name parent parent_name level
## 1  Jerry    dad       Gregg     1
## 2  Jerry    mom       Fiona     1
## 3  Fiona    dad       Brian     2
## 4  Fiona    mom       Alice     2
## 5  Gregg    dad       Danny     2
## 6  Gregg    mom       Cathy     2
## 7  Alice    dad        <NA>     3
## 8  Alice    mom        <NA>     3
## 9  Brian    dad        <NA>     3
## 10 Brian    mom        <NA>     3
## 11 Cathy    dad        <NA>     3
## 12 Cathy    mom        <NA>     3
## 13 Danny    dad        <NA>     3
## 14 Danny    mom        <NA>     3
sqldf(sprintf(ancestors_sql, 'Heidi'))
##    name parent parent_name level
## 1 Heidi    dad       Danny     1
## 2 Heidi    mom       Cathy     1
## 3 Cathy    dad        <NA>     2
## 4 Cathy    mom        <NA>     2
## 5 Danny    dad        <NA>     2
## 6 Danny    mom        <NA>     2
sqldf(sprintf(ancestors_sql, 'Cathy'))
##    name parent parent_name level
## 1 Cathy    dad        <NA>     1
## 2 Cathy    mom        <NA>     1

We can go the other way as well, and find all of a person’s descendants:

descendants_sql <- "
WITH RECURSIVE descendants (name, parent, parent_name, level) AS (
  SELECT person, parent, parent_name, 1 FROM long_family 
    WHERE person = '%s'
    AND parent = '%s'

    SELECT F.person, F.parent, F.parent_name, D.level + 1 
        FROM descendants D
        JOIN long_family F
        ON F.parent_name = D.name)

SELECT * FROM descendants ORDER BY level, name

sqldf(sprintf(descendants_sql, 'Cathy', 'mom'))
##    name parent parent_name level
## 1 Cathy    mom        <NA>     1
## 2 Gregg    mom       Cathy     2
## 3 Heidi    mom       Cathy     2
## 4 Irene    mom       Cathy     2
## 5 Jerry    dad       Gregg     3
## 6 Karla    dad       Gregg     3
Exploring Recursive CTEs with sqldf

If you work with tree- or graph-structured data in a database, recursive CTEs can make your life much easier. Having them on hand in SQLite and usable through sqldf makes it very easy to get started learning to use them.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)