# Database Reflection using dplyr

January 22, 2014
By

(This article was first published on anrprogrammer » R, and kindly contributed to R-bloggers)

At work I write a ton of SQL, and I do most of my querying using R.  The workflow goes:

1. Create a string with the SQL in R
2. Plug the string into fetchQuery (see my previous post)

This solution works relatively well, but i’m a bit unhappy writing strings rather than using function calls.

I began working on my own ORM implementation, but it was very slow-go and it would have taken a lot of time to get anywhere.  Luckily, I was pleasantly surprised that Hadley Wickham’s new dplyr package implements much of the ORM I was hoping for.

One thing I want out of an ORM is the ability to see every table in our databases all at once.  That functionality, while implement-able using the dplyr package, would likely take quite a while on tens of thousands of tables.  So I decided to implement the reflection myself.

#' Get the table information for a postgres database
#' @param config the configuration list
#' @return the table names, columns, and column types of all columns in the database
getTableInformation <- function(config = config.gp) {
tables <- fetchQuery(
"SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_name NOT LIKE '%prt%'
AND table_name NOT LIKE '%ext%'
AND table_name NOT LIKE '%tmp%'
ORDER BY 1, 2",
config
)
}

#' Replacement of the normal update function, you don't need to call this.
update <- function(object, ...) {
args <- list(...)
for (nm in names(args)) {
object[[nm]] <- args[[nm]]
}
if (is.null(object$select)) { if (is.ident(object$from)) {
var_names <- object$select } else { var_names <- qry_fields(object$src$con, object$from)
}
vars <- lapply(var_names, as.name)
object$select <- vars } object$query <- dplyr:::build_query(object)
object
}
#' Function to reflect a database, generalizable to others beyond postgres
#' by simply changing getTableInformation appropriately
reflectDatabase <- function(config, envir.name = "tables",
subclass = "postgres") {
if (!(envir.name %in% search())) {
envir <- new.env(parent = .GlobalEnv)
} else {
envir <- as.environment(envir.name)
detach(envir.name, character.only = TRUE)
}
src <- do.call(src_postgres, config)
tables <- getTableInformation(config)
tables <- split(tables, tables$table_name) lapply(tables, function(i) { nm <- ident(i$table_name[1])
vars <- lapply(i\$column_name, as.name)
tbl <- dplyr::make_tbl(c(subclass, "sql"), src = src, from = nm,
select = vars, summarise = FALSE, mutate = FALSE,
where = NULL, group_by = NULL, order_by = NULL)
tbl <- update(tbl)
assign(
nm,
tbl,
envir = envir
)
})
attach(envir, name = envir.name)
}

searchTables <- function(str, env = "tables") {
all.tbls <- ls(env)
all.tbls[grep(str, all.tbls)]
}



To use this function, you can simply call

reflectDatabase()


and if you’re using a Postgres database, that should be it!

The fun part now, is that I can do things like

res <- inner_join(my_table_1, my_table_2)


where my_table_1 and my_table_2 are simply names of tables in my database. This provides me with auto-complete of table names, search-able table names and columns, etc.

For example:

searchTables('user')


returns all tables in our database with the string “user” in them.

These are some things I hope to see or find in dplyr, and may try to build myself if they don’t already exist:
1. Case statements in mutate
2. Creating table indexes
3. type checking of columns, and more informative error messages when un-sensible joins and filters are performed.

Overall this package seems like a lot of fun, and i’m excited to try to work it into my coding!