Some quirks with R and SQL Server by @ellis2013nz

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been writing on this blog less frequently in the past few months. Mostly this is because I’ve been working on a very intensive and engaging professional project that is much more hands-on (ie coding) than I’ve been at work for a while. So a lot of energy has been going into that. One interesting side effect for me has been diving deep into Structured Query Language (SQL), and the Microsoft-flavoured Transact SQL in particular. I’ve used SQL for a long time, but usually with the attitude of “get it out of the database as quickly as possible”. In a situation where this didn’t make sense, I’ve been pleasantly surprised at how powerful and flexible SQL is for the right sort of jobs.

The bulk of the work project is in SQL, but there is an R flavour with a Shiny front end and a bunch of testing and semi-automation of the build process going on in R (long story). Here are a couple of quirky and useful things relating to using R and SQL Server in combination.

Using R to schedule SQL scripts

Imagine a project that combines SQL and R. For example, SQL is used to do a bunch of heavy lifting data management and complex queries in the database; and R is used for statistical modelling and producing polished outputs. This is actually a very common scenario. Relational databases are very powerful tools with many decades of optimisation embedded in them. They aren’t going anywhere soon.

“Whether your first data analysis language is R, Python, Julia or SAS, your second language should be SQL”

Quote by me – something I just thought of.

It’s good practice to keep any non-trivial SQL queries in their own files with a .sql suffix, and develop them in a database-aware environment. With SQL Server, this will often mean SQL Server Management Studio. But when you come to doing the final stage of reproducible analysis, you don’t want to be flicking between two applications; certainly not in terms of actually running anything. Although SQL Server since 2016 can include R code in a stored procedure, if it’s basically a statistical project it’s still going to have the workflow of “database first, R second”, with the statistical and presentation stage probably developed in RStudio or similar. So it’s very useful to be able to run bunch of .sql scripts from R. This is commonly done by reading in the script with readLines() and executing it on the database via RODBC or other database connection software.

I developed an R function sql_execute() to make this process efficient. The version below is available in my pssmisc R package (only on GitHub at this point) which is a grab-bag of multi-use functionality associated with this blog. The original version had a few project-specific features as well as a few cut corners. It also had an accompanying simple function that uses sql_execute() to run all the SQL scripts in a given folder in order.

The sql_execute() function provides the following benefits:

  • combines the multiple steps of reading in SQL scripts and executing them in a single command
  • handles a common problem where files developed in Management Studio often aren’t saved in an encoding automatically recognised by R
  • allows the use of the GO batch separator, a Microsoft-specific addition to SQL that will cause problems if included in a ODBC query
  • lets you specify a search-and-replace for a string – very useful sometimes if you’re running lots of SQL scripts to be able to say something like “oh by the way, can you change all references to database X to database Y while you’re at it”
  • lets you specify if an error in one batch should be fatal, or whether to proceed with the rest of the batches from that script
  • logs execution and results in a table in the databse.

Dealing with GO was particularly important, because it’s common (and often essential) in T-SQL development. GO divides a single script into batches. The sql_execute() function below embraces this, by splitting the original file into separate queries based on the location of GO, and sending the individual batches one at a time to the server.

# Helper function to convert the output of Sys.time() into a character string
# without the time zone on it
# 
# @details Not exported.
# @keywords internal
# @param dt an object of class \code{POSIXCT}
# @examples
# datetime_ch(Sys.time())
datetime_ch <- function(dt){
  dt <- gsub(" [A-Z]*$", "", as.character(dt))
  dt <- paste0("CAST ('", dt, "' AS DATETIME)")
  return(dt)
}



#' Execute SQL
#'
#' Execute T-SQL in a script, split into batches
#' 
#' @export
#' @importFrom RODBC sqlQuery
#' @importFrom stringr str_split str_length 
#' @details Reads a script of SQL, splits it into separate queries on the basis of any occurrences of \code{GO}
#' in the script, and passes it to the database server for execution.  While the initial use case was for SQL Server, there's no
#' reason why it wouldn't work with other ODBC connections.
#' 
#' The case of \code{GO} is ignored but it has to be the first non-space word on its line of code.
#' 
#' If any batch at any point returns rows of data (eg via a \code{SELECT} statement that does not \code{INSERT} the
#' results into another table or variable on the database), the rest of that batch is not executed.  
#' If that batch was the last batch of SQL
#' in the original file, the results are returned as a data.frame, otherwise it is discarded.
#' 
#' Example SQL code for creating a log suitable for this function:
#' \preformatted{
#' CREATE TABLE some_database.dbo.sql_executed_by_r_log
#' (
#'   log_event_code INT NOT NULL IDENTITY PRIMARY KEY, 
#'   start_time     DATETIME, 
#'   end_time       DATETIME,
#'   sub_text       NVARCHAR(200),
#'   script_name    NVARCHAR(1000),
#'   batch_number   INT,
#'   result         NCHAR(30),
#'   err_mess       VARCHAR(8000),
#'   duration       NUMERIC(18, 2)
#' );
#' }
#' 
#' @param channel connection handle as returned by RODBC::odbcConnect() of class RODBC
#' @param filename file name of an SQL script
#' @param sub_in character string that you want to be replaced with \code{sub_out}.  Useful if you want to do a bulk search
#' and replace.  This is useful if you have a bunch of scripts that you maybe want
#' to run on one schema sometimes, and on another schema other times - just automate the search and replace.  Use with caution.
#' @param sub_out character string that you want to replace \code{sub_in} with.
#' @param fixed logical.  If TRUE, \code{sub_in} is a string to be matched as is.  Otherwise it is treated as a regular expression 
#' (eg if fixed = FALSE, then . is a wild card)
#' @param error_action should you stop with an error if a batch gets an error message back from the database?  Any alternative
#' to "stop" means we just keep ploughing on, which may or may not be a bad idea.  Use "stop" unless you know that failure
#' in one part of a script isn't fatal.
#' @param log_table table in the database to record a log of what happened.  Set to NULL if no log table available.  The log_table
#' needs to have (at least) the following columns: event_time, sub_out, script_name, batch_number, result, err_mess and duration. 
#' See Details for example SQL to create such a log table.
#' @param verbose Logical, gives some control over messages
#' @param ... other arguments to be passed to \code{sqlQuery()}, such as \code{stringsAsFactors = FALSE}.
#' @examples
#' \dontrun{
#' ch <- odbcConnect("some_dsn")
#' sql_execute(ch, "some_file.sql", log_table = "some_database.dbo.sql_executed_by_r_log")
#' }
#' @author Peter Ellis
sql_execute <- function(channel, filename, sub_in = NULL, sub_out = NULL, fixed = TRUE, 
                        error_action = "stop", log_table = NULL, 
                        verbose = TRUE, ...){
  
  # we can't tell in advance what encoding the .sql files are in, so we read it in
  # in two ways (one of which is certain to return gibberish) and choose the version that is recognised as a proper string:
  
  # encoding method 1 (weird Windows encoding):
  file_con <- file(filename, encoding = "UCS-2LE")
  sql1 <- paste(readLines(file_con, warn = FALSE), collapse = "\n")
  close(file_con)
  
  # encoding method 2 (let R work it out - works in most cases):
  file_con <- file(filename)
  sql2 <- paste(readLines(file_con, warn = FALSE), collapse = "\n")
  close(file_con)
  
  # choose between the two encodings, based on which one has a legitimate string length:
  suppressWarnings({
    if(is.na(stringr::str_length(sql2))){
      sql <- sql1
    } else {
      sql <- sql2
    }
  })
  
  # do the find and replace that are needed
  if(!is.null(sub_in)){
    sql <- gsub(sub_in, sub_out, sql, fixed = fixed)
  }
  
  # split the SQL into separate commands wherever there is a "GO" at the beginning of a line
  # ("GO" is not ANSI SQL, only works for SQL Server - it indicates the lines above are a batch)
  sql_split <- stringr::str_split(sql, "\\n *[Gg][Oo]", simplify = TRUE)
  
  base_log_entry <- data.frame(
    sub_out         = ifelse(is.null(sub_out), "none", sub_out),
    script_name      = filename,
    stringsAsFactors = FALSE
  )
  
  n_batches <- length(sql_split)
  
  # execute the various separate commands
  for(i in 1:n_batches){
    log_entry              <- base_log_entry
    log_entry$batch_number <- i
    log_entry$result       <- "no error"
    log_entry$err_mess     <- ""
    log_entry$start_time   <- datetime_ch(Sys.time())
    
    if(verbose){message(paste("Executing batch", i, "of", n_batches))}
    
    duration <- system.time({res <- sqlQuery(channel, sql_split[[i]], ...)})
    log_entry$duration <- duration[3]
    
    if(class(res) == "data.frame"){
      txt <- paste("Downloaded a data.frame with", nrow(res), "rows and",
                   ncol(res), "columns in batch", i, ". Any commands left in batch", i, "were not run.")
      if(verbose){message(txt)}
      log_entry$result <- "data.frame"
      
    } 
    if(class(res) == "character" & length(res) > 0){
      message("\n\nI got this error message:")
      cat(res)
      log_entry$result <- "error"
      log_entry$err_mess <- paste(gsub("'", "", res), collapse = "\n")
      message(paste0("\n\nSomething went wrong with the SQL execution of batch ", i, 
                     " in ", filename, ". \n\nError message from the database is shown above\n\n"))
    }
    
    log_entry$end_time <- datetime_ch(Sys.time())
    
    # Update the log in the database, if we have been given one:
    if(!is.null(log_table)){
      # couldn't get sqlSave to append to a table even when append = TRUE... 
      # see https://stackoverflow.com/questions/36913664/rodbc-error-sqlsave-unable-to-append-to-table
      # so am writing the SQL to update the log by hand:
      sql <- with(log_entry, paste0("INSERT INTO ", 
                                    log_table, 
                                    "(start_time, end_time, sub_out, script_name, batch_number, 
                                    result, err_mess, duration)",
                                    " VALUES (", start_time, ", ", end_time, ", '", 
                                    sub_out, "', '", script_name, "', ", batch_number, ", '", result, "', '",
                                    err_mess, "', ", duration, ");"))
      
      log_res <- sqlQuery(channel, sql)
      
      
    }
    if(error_action == "stop" && log_entry$result == "error"){
      stop(paste("Stopping due to an error in", filename))
    }
    if(class(res) == "data.frame"){
      if(i == n_batches){
        return(res)  
      } else {
        warning("Downloaded a data frame from batch ", i, " of SQL, which wasn't the \nlast batch in the file.  This data frame is not kept.")
      }
      
    }
  }
}

One of the things to watch out for in this situation is how running a script via ODBC can get different results from hitting F5 in Management Studio. One key thing to trip up on is what happens if the SQL includes a SELECT statement that doesn’t INSERT the results into another table or variable, but returns them as a table. In this case, ODBC considers its work done and will not continue to execute anything else in the batch beyond that SELECT statement.

To clarify how this works, here is a potentially problematic SQL file:

/*
eg-sql.sql

for testing the sql_execute R function

*/

-- this will return five rows but sql_execute discards them
SELECT TOP 5 * FROM some_table
GO

-- this will return an error
some non-legitimate SQL here that causes an error
go

-- next batch will only get as far as the first seven rows
SELECT TOP 7 * FROM some_table
SELECT TOP 10 * FROM some_table
GO

If I run that file via sql_execute(ch, "examples/eg-sql.sql"), it does the following:

  • executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
  • tries to execute the some non-legitimate SQL, gets an error and stops.

Alternatively, if I run it via sql_execute(ch, "examples/eg-sql.sql", error_action = "continue") it does the following

  • executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
  • tries to execute the some non-legitimate SQL, gets an error and prints it to the screen.
  • executes the SELECT TOP 7 statement, returns the results as a data frame, and stops. The SELECT TOP 10 statement isn’t returned.

An odd quirk with SQL loops cutting short with ODBC

A second quirk that had me puzzled for a while (and indeed I am still puzzled and can’t get a fully reproducible example) seems to relate to the use of SQL WHILE loops in a script executed on the database from R via ODBC. I found many such SQL programs would silently stop after about 20 iterations of the loop under ODBC, even if they worked perfectly in Management Studio. The examples all look like this:

DECLARE @i INT = 1
WHILE @i <= 50
BEGIN
	-- do something that needs looping
	
	
	SET @i = @i + 1
END
GO

BTW, SQL is one of those languages where you avoid loops if you can, and think instead in terms of joining, aggregating and filtering tables. But there are times when it is necessary (for example, performing an action on each table in a database, such as creating a carefully chosen random sample of it in another database - one of the things we had to do in the work project mentioned above).

The solution to this mysterious refusal to go beyond about 20 (it varied) iterations in some loops was to wrap the whole action in a user-defined stored procedure, then execute the procedure. This seems satisfyingly secure in all sorts of ways. The procedure can be kept permanently or blown away depending on what makes sense:

CREATE PROCEDURE do_stuff
AS 
BEGIN
	DECLARE @i INT = 1
	WHILE @i <= 50
	BEGIN
		-- do something that needs looping
	
	
		SET @i = @i + 1
	END
END
GO

EXECUTE do_stuff
DROP PROCEDURE do_stuff

Worth noting - T-SQL distinguishes between its functions and stored procedures, whereas R lumps the two types of functionality together. Functions in SQL are true computer-science-defined functions, that take inputs and return outputs, with strictly no side effects. Stored procedures can have side effects (like creating or modifying tables). In R, functions can have side effects (and frequently do eg drawing plots), not just return outputs based on inputs.

No graphic today…

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)