I did a tweet earlier which was slightly less nonsense than usual.
Shocker, I know.
I don’t think the idea of having SQL files being kept separate, rather than having a SQL query either embedded or called within an R script, is that controversial. But some folk clearly have not thought to try it before.
So, in R terms, rather than having a connection defined, and then pasting a SQL query into a DBI call, I’d be doing something like this to make the initial connection:
Then, once that ran successfully, I’d run the query with this:
results <- dbGetQuery(con, statement = readr::read_file(here('02-get-results.sql')))
(My query names are more descriptive in real life).
Let’s backtrack - my motivation for doing this was quite simple.
We needed to make a change to query due to a change in reporting requirements. This meant updating a couple of reports, and I got fed up scrolling aound trying to find the queries inside my rmarkdown scripts.
‘What if they were actually outside the script, and called in as needed?’
Perceived benefits of this approach:
Only need to rewrite the query once - it was the same query being used in separate reports.
Single version of the query. So no one can tweak it to suit themselves, even temporarily. We are all now running the same query.
Re-use. I am happy with R and SQL, and Power Query / Excel when the need arises. Some of my colleagues are more familiar with Excel and Power Query, and new to R and SQL.
If you don’t know SQL, then Power Query almost guides you to load an entire table, then drop columns/ rows you don’t need. Unless, you know to write a query up front and use that in the connection panel. Even though they cannot write the SQL themselves, they can grab the working query, paste it in, safe in the knowledge it works, and limit the needless transfer of data. If that query is available in a standalone file, there is more chance of them using it, and less likelihood of things going wrong.
Syntax highlighting. I’ve never liked the way that SQL code looks when pasted into a call to DBI - it’s a see of commented code. In a pure SQL file, everything looks much clearer - and, by the way, kudos to RStudio devs for making it look so good.
Use the glue package and add parameters when the query is run. I tend to declare my variables etc in SQL itself, but can totally see how using glue and having users set variables when they call the query within R would add another layer of flexibility.
Debugging. Grab the query, fire it into SSMS, Visual Studio Code or Azure Data Studio - make sure those latest tweaks work or quickly spot why it’s failed. It’s all just a bit easier than having to copy / paste out of the middle of an R script or markdown document.
By keeping the connection and the query separate, if anything falls over, I know exactly where to start troubleshooting.
Security. With no connection or database info, the remaining R code is pretty meaningless to anyone outside the organisation. It can go into version control quite safely.
There are likely to be more benefits, but these are all seem worthwhile.
If, however, you’ve already tried this and it didn’t work for you, I’d be interested to hear what went wrong, and what your final approach was.
N.B.- This is a very early draft post, in line with my ‘publish it now, or it might never get done’ strategy, and so I invite you to check back on my site to see if it has been updated.
But to reiterate, I definitely want to hear about any perceived advantages/ disadvantages of this approach.