Database Reflection using dplyr

January 22, 2014
By

(This article was first published on anrprogrammer » R, and kindly contributed to R-bloggers)

At work I write a ton of SQL, and I do most of my querying using R.  The workflow goes:

  1. Create a string with the SQL in R
  2. Plug the string into fetchQuery (see my previous post)

This solution works relatively well, but i’m a bit unhappy writing strings rather than using function calls.

I began working on my own ORM implementation, but it was very slow-go and it would have taken a lot of time to get anywhere.  Luckily, I was pleasantly surprised that Hadley Wickham’s new dplyr package implements much of the ORM I was hoping for.

One thing I want out of an ORM is the ability to see every table in our databases all at once.  That functionality, while implement-able using the dplyr package, would likely take quite a while on tens of thousands of tables.  So I decided to implement the reflection myself.

 

 

#' Get the table information for a postgres database
#' @param config the configuration list
#' @return the table names, columns, and column types of all columns in the database
getTableInformation <- function(config = config.gp) {
  tables <- fetchQuery(
    "SELECT table_name, column_name, data_type 
    FROM information_schema.columns 
    WHERE table_name NOT LIKE '%prt%' 
      AND table_name NOT LIKE '%ext%' 
      AND table_name NOT LIKE '%tmp%' 
    ORDER BY 1, 2",
    config
  )
}

#' Replacement of the normal update function, you don't need to call this.
update <- function(object, ...) {
    args <- list(...)
    for (nm in names(args)) {
        object[[nm]] <- args[[nm]]
    }
    if (is.null(object$select)) {
        if (is.ident(object$from)) {
            var_names <- object$select
        }
        else {
            var_names <- qry_fields(object$src$con, object$from)
        }
        vars <- lapply(var_names, as.name)
        object$select <- vars
    }
    object$query <- dplyr:::build_query(object)
    object
}
#' Function to reflect a database, generalizable to others beyond postgres 
#' by simply changing getTableInformation appropriately
reflectDatabase <- function(config, envir.name = "tables",
                            subclass = "postgres") {
  if (!(envir.name %in% search())) {
    envir <- new.env(parent = .GlobalEnv)
  } else {
    envir <- as.environment(envir.name)
    detach(envir.name, character.only = TRUE)
  }
  src <- do.call(src_postgres, config)
  tables <- getTableInformation(config)
  tables <- split(tables, tables$table_name)
  lapply(tables, function(i) {
    nm <- ident(i$table_name[1])
    vars <- lapply(i$column_name, as.name)
    tbl <- dplyr::make_tbl(c(subclass, "sql"), src = src, from = nm,
                    select = vars, summarise = FALSE, mutate = FALSE,
                    where = NULL, group_by = NULL, order_by = NULL)
    tbl <- update(tbl)
    assign(
      nm,
      tbl,
      envir = envir
    )
  })
  attach(envir, name = envir.name)
}

searchTables <- function(str, env = "tables") {
  all.tbls <- ls(env)
  all.tbls[grep(str, all.tbls)]
}

To use this function, you can simply call

reflectDatabase()

and if you’re using a Postgres database, that should be it!

The fun part now, is that I can do things like

res <- inner_join(my_table_1, my_table_2)

where my_table_1 and my_table_2 are simply names of tables in my database. This provides me with auto-complete of table names, search-able table names and columns, etc.

For example:

searchTables('user')

returns all tables in our database with the string “user” in them.

These are some things I hope to see or find in dplyr, and may try to build myself if they don’t already exist:
1. Case statements in mutate
2. Creating table indexes
3. type checking of columns, and more informative error messages when un-sensible joins and filters are performed.

Overall this package seems like a lot of fun, and i’m excited to try to work it into my coding!


To leave a comment for the author, please follow the link and comment on his blog: anrprogrammer » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.