Exploring Recursive CTEs with sqldf

December 1, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Bob Horton
Sr. Data Scientist at Microsoft

Common table expressions (CTEs, or “WITH clauses”) are a syntactic feature in SQL that makes it easier to write and use subqueries. They act as views or temporary tables that are only available during the lifetime of a single query. A more sophisticated feature is the “recursive CTE”, which is a common table expression that can call itself, providing a convenient syntax for recursive queries. This is very useful, for example, in following paths of links from record to record, as in graph traversal.

This capability is supported in Postgres, and Microsoft SQL Server (Oracle has similar capabilities with a different syntax), but not in MySql. Perhaps surprisingly, it is supported in SQLite, and since SQLite is the default backend for sqldf, this gives R users a convenient way to experiment with recursive CTEs.

Factorials

This is the example from the Wikipedia article on hierarchical and recursive queries in SQL; you just pass it to sqldf and it works.






library('sqldf')

sqldf("WITH RECURSIVE temp (n, fact) AS 
(SELECT 0, 1 -- Initial Subquery
  UNION ALL 
 SELECT n+1, (n+1)*fact FROM temp -- Recursive Subquery 
        WHERE n < 9)
SELECT * FROM temp;")
##    n   fact
## 1  0      1
## 2  1      1
## 3  2      2
## 4  3      6
## 5  4     24
## 6  5    120
## 7  6    720
## 8  7   5040
## 9  8  40320
## 10 9 362880







Other databases may use slightly different syntax (for example, if you want to run this query in Microsoft SQL Server, you need to leave out the word RECURSIVE), but the concept is pretty general. Here the recursive CTE named temp is defined in a with clause. As usual with recursion, you need a base case (here labeled “Initial Subquery”), and a recursive case (“Recursive Subquery”) that performs a select operation on itself. These two cases are put together using a UNION statement (basically the SQL equivalent of rbind). The last line in the query kicks off the computation by running a SELECT statement from this CTE.

Family Tree

Let’s make a toy family tree, so we can use recursion to find all the ancestors of a given person.

Exploring Recursive CTEs with sqldf






family <- data.frame(
  person = c("Alice", "Brian", "Cathy", "Danny", "Edgar", "Fiona", "Gregg", "Heidi", "Irene", "Jerry", "Karla"),
    mom = c(rep(NA, 4), c('Alice', 'Alice', 'Cathy', 'Cathy', 'Cathy', 'Fiona', 'Fiona')),
    dad = c(rep(NA, 4), c('Brian', 'Brian', 'Danny', 'Danny', 'Danny', 'Gregg', 'Gregg')),
  stringsAsFactors=FALSE)







We can visualize this family tree as a graph:

Exploring Recursive CTEs with sqldf






library(graph)
nodes <- family$person
edges <- apply(family, 1, function(r) {
  r <- r[c("mom", "dad")]
  r <- r[!is.na(r)]
  list(edges=r)  # c(r['mom'], r['dad'])
})
names(edges) <- names(nodes) <- nodes
g <- graphNEL(nodes=nodes, edgeL=edges, edgemode='directed')

library(Rgraphviz) # from Bioconductor
g <- layoutGraph(g, layoutType="dot", attrs=list(graph=list(rankdir="BT")))
renderGraph(g)

Family_tree

 






Pointing from child to parents is backwards from how family trees are normally drawn, but this reflects how the table is laid out. I built the table this way because everybody always has exactly two biological parents, regardless of family structure.

SQLite only supports a single recursive call, so we can’t recurse on both the mom and dad columns. To be able to trace back through both parents, I put the table in “long form”; now each parent is entered in a separate row, with ‘mom’ and ‘dad’ being values in a new column called ‘parent’.







library(tidyr)

long_family <- gather(family, parent, parent_name, -person)

knitr::kable(head(long_family))

Exploring Recursive CTEs with sqldf






person parent parent_name
Alice mom NA
Brian mom NA
Cathy mom NA
Danny mom NA
Edgar mom Alice
Fiona mom Alice

Now we can use a recursive CTE to find all the ancestors in the database for a given person:







ancestors_sql <- "
WITH ancestors (name, parent, parent_name, level) AS (
  SELECT person, parent, parent_name, 1 FROM long_family WHERE person = '%s'
        UNION ALL
    SELECT A.person, A.parent, A.parent_name, P.level + 1 
        FROM ancestors P
        JOIN long_family A
        ON P.parent_name = A.person)
SELECT * FROM ancestors ORDER BY level, name, parent"

sqldf(sprintf(ancestors_sql, 'Jerry'))







##     name parent parent_name level
## 1  Jerry    dad       Gregg     1
## 2  Jerry    mom       Fiona     1
## 3  Fiona    dad       Brian     2
## 4  Fiona    mom       Alice     2
## 5  Gregg    dad       Danny     2
## 6  Gregg    mom       Cathy     2
## 7  Alice    dad             3
## 8  Alice    mom             3
## 9  Brian    dad             3
## 10 Brian    mom             3
## 11 Cathy    dad             3
## 12 Cathy    mom             3
## 13 Danny    dad             3
## 14 Danny    mom             3







sqldf(sprintf(ancestors_sql, 'Heidi'))







##    name parent parent_name level
## 1 Heidi    dad       Danny     1
## 2 Heidi    mom       Cathy     1
## 3 Cathy    dad             2
## 4 Cathy    mom             2
## 5 Danny    dad             2
## 6 Danny    mom             2







sqldf(sprintf(ancestors_sql, 'Cathy'))







##    name parent parent_name level
## 1 Cathy    dad             1
## 2 Cathy    mom             1







We can go the other way as well, and find all of a person’s descendants:







descendants_sql <- "
WITH RECURSIVE descendants (name, parent, parent_name, level) AS (
  SELECT person, parent, parent_name, 1 FROM long_family 
    WHERE person = '%s'
    AND parent = '%s'


    UNION ALL
    SELECT F.person, F.parent, F.parent_name, D.level + 1 
        FROM descendants D
        JOIN long_family F
        ON F.parent_name = D.name)

SELECT * FROM descendants ORDER BY level, name
"

sqldf(sprintf(descendants_sql, 'Cathy', 'mom'))







##    name parent parent_name level
## 1 Cathy    mom             1
## 2 Gregg    mom       Cathy     2
## 3 Heidi    mom       Cathy     2
## 4 Irene    mom       Cathy     2
## 5 Jerry    dad       Gregg     3
## 6 Karla    dad       Gregg     3

Exploring Recursive CTEs with sqldf






If you work with tree- or graph-structured data in a database, recursive CTEs can make your life much easier. Having them on hand in SQLite and usable through sqldf makes it very easy to get started learning to use them.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)