Site icon R-bloggers

A function to speed up and simplify writing to SQL Server databases in R

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I had a recent enquiry on our NHS-R community slack channel about which package is best to work with larger datasets, such as 5 million plus records, with high dimensionality. I got to thinking that the DBI and odbc packages would be best for this purpose, as this is still not considered big data, it is more medium data with manageable dimensionality.

The functions can be fiddly to understand, so I thought I would combine all the functions into one lovely function wrapper in R.

Overview of the function

The function below allows you to pass in your connection parameters, for SQL Server trusted connections (I can extend this if you need to specify user name and password for SQL Server databases). The function is made up of below:

library(ggplot)
library(DBI)
library(odbc)
library(dplyr)


write_table_sql <- function(driver, server, db, trusted="True",
                        schema = "dbo", tbl, append=FALSE, 
                        overwrite = TRUE, tbl_name, ...){
  
  params <- list(driver=driver, server=server, 
                 database=db, trusted=trusted, db_schema=schema,
                 db_tbl=tbl, append_bool=append, 
       overwrite_bool=overwrite, table_name=tbl_name, ...)
  
  cat("Establishing connection to:", server, "\n")
  con <- dbConnect(odbc(), 
                   Driver = driver , 
                   Server = server , 
                   Database = db, 
                   Trusted_Connection = "True")
  
  
  if(tbl_name == "" | !is.character(tbl_name) | length(tbl_name)

Stepping through the function we have:

That is the entirety of the function, but this saves on retyping and having to instantiate the connection object before saving to the database.

Using the write_table_sql() function

The function has been created. What I am going to do now is load in the brilliant NHSRdatasets package built by my friend Chris Mainey and we are going to load the ons_mortality() data package. This contains deaths at an aggregate level:

library(ggplot)
mort_ons <- NHSRdatasets::ons_mortality

I have the relevant dataset. I need to store this in my SQL server database, using a trusted connection. This is how simple it is now with the wrapper function:

mort_sql_list <- write_table_sql(driver="SQL Server", 
                              server="localhost\\SQLEXPRESS", 
                              db="RDatabase", 
                              schema="data",
                              tbl=mort_ons, 
                              append=TRUE, 
                              overwrite=FALSE,
                              tbl_name="onsMortality")

Here I pass in:

Running this, the following is returned:

Establishing connection to: localhost\SQLEXPRESS 
Writing to database - please wait... 
Finished writing to database: RDatabase in 0.63 seconds. 
> print(mort_sql_list)
$input_params
$input_params$driver
[1] "SQL Server"

$input_params$server
[1] "localhost\\SQLEXPRESS"

$input_params$database
[1] "RDatabase"

$input_params$trusted
[1] "True"

$input_params$db_schema
[1] "data"

$input_params$db_tbl
# A tibble: 18,803 x 5
   category_1   category_2 counts date       week_no
   < chr>        < chr>       < dbl>        < int>
 1 Total deaths all ages    12968 2010-01-08       1
 2 Total deaths all ages    12541 2010-01-15       2
 3 Total deaths all ages    11762 2010-01-22       3
 4 Total deaths all ages    11056 2010-01-29       4
 5 Total deaths all ages    10524 2010-02-05       5
 6 Total deaths all ages    10117 2010-02-12       6
 7 Total deaths all ages    10102 2010-02-19       7
 8 Total deaths all ages    10295 2010-02-26       8
 9 Total deaths all ages     9981 2010-03-05       9
10 Total deaths all ages     9792 2010-03-12      10
# ... with 18,793 more rows

$input_params$append_bool
[1] TRUE

$input_params$overwrite_bool
[1] FALSE

$input_params$table_name
[1] "onsMortality"

$run_time
[1] 0.63

Accessing individual parameters from the function (params)

To access individual parameters you use the dollar notation to access fields:

mort_sql_list$input_params$table_name
[1] "onsMortality"

Checking the data has made it to SQL Server

I navigate to my SQL Server instance, set on my localhost in this instance, and can find the data under the relevant table name onsMortality:

As you can see the R function has worked. What is great is that I tested this with 50 million records and 12 columns and the DBI function took 24 minutes, whereas with RODBC this take 1 hour plus.

Conclusion

This function is ready to be dropped into your projects and please have a play to add the functionality to the function to get it to work with none trusted connections i.e. those that require a user ID and password.

I hope you find this function useful, and it does really speed the process up.

To find this go to my GitHub page.

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.