At the useR! conference last month, Jim Hester gave a talk about two packages that provide a modern database interface for R. Those packages are the odbc package (developed by Jim and other members of the RStudio team), and the DBI package (developed by Kirill Müller with support from the R Consortium).
To communicate with databases, a common protocol is ODBC. ODBC is an open, cross-platform standard for interfacing with databases, whatever the database software and variant of the SQL language it implements. R has long had the RODBC package created by Brian Ripley, but the new odbc package provides an updated alternative.
The odbc package is a from-the-ground-up implementation of an ODBC interface for R that provides native support for additional data types (including dates, timestamps, raw binary, and 64-bit integers) and parameterized queries. The odbc package provides connections with any ODBC-compliant database, and has been comprehensively tested on SQL Server, PostgreSQL and MySQL. Benchmarks show that it's also somewhat faster than RODBC: 3.2 times faster for reads, and 1.9 times faster for writes.
With the odbc package (and the DBI package, which provides low-level connectivity to the database), you create a database connection like this:
con <- dbConnect( ... ) # provide server, username, password
With the connection established, you can the use functions like dbListTables (get a list of tables in the database), dbReadTable (return the data from a table), and dbGetQuery (execute a SQL query in the database, and return the results). That's pretty standard stuff. But the real power comes in being able to use high-level functions from the dplyr package and have the data processing run in the database, instead of in the local R session.
salesdata <- tbl(con, "ussales")
It doesn't matter how big the
ussales table is, because no data is brought into R until you perform an operation on the
salesdata object. For example, you can use a dplyr pipe to get the first 20 records from the state of California, like this:
tabledata %>% filter(state == "CA") %>% head(20)
Behind the scenes, this creates a SQL query (which you can inspect with the
show_query function, if you like) to move the computation from within R to the database. This process works for all the dplyr verbs:
BY, etc. In every case (including the example above) all the filtering happens in the database, and only 20 rows of data are actually returned to R.
You can see all of this in action in Jim's talk embedded below. (Jim's slides are also available here.) In the demo starting at the 8:20 mark, Jim connects to SQL Server (here running in a VM on his laptop, in non-demo situations more likely to be on a remote machine). You'll then see him query tables on the database, and perform dplyr operations on the data in the process. You'll even get a lesson on the importance of sanitizing user inputs to prevent SQL injection, and a solution based on parameterized queries.
The odbc package is built on top of the DBI package, which provides a low-level interface to databases and some of the functions used in Jim's demo above. The DBI package has been around for a while, but has recently undergone a major upgrade thanks to Kirill Müller and funding from the R Consortium. If you're interested in the implementation of the DBI package and how it's evolving, check out Kirill's talk from useR! 2017: Improving DBI.
The odbc and DBI packages are available on CRAN now. To use them with your database, you'll also need an ODBC driver installed on the local machine. Most likely you already have one (Windows includes an ODBC driver compatible with SQL Server, for example), but if not RStudio provides a useful guide to finding ODBC drivers (which also come bundled with commercial RStudio products).
Channel9: odbc – A modern database interface