Using Azure as an R datasource, Part 4 – Pulling data from SQL Server to Linux
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Gregory Vandenbrouck
Software Engineer, Microsoft
This post is the fourth in a series that covers pulling data from Microsoft SQL Server or MySQL/MariaDB on Azure to an R client on Windows or Linux.
In the previous posts, we covered pulling data from SQL Server to Windows and from MySQL/MariaDB to both Windows and Linux. This time we’ll be pulling data from Microsoft SQL Server to an R client on Linux.
Setting up the Linux client machine
Creating the client VM
We ended creating an Ubuntu Virtual Machine (VM) in Azure, using the same process as the previous post in the series. We named it MyAzureUbuntu.cloudapp.net
and are accessing it using account azureuser
.
Accessing the client VM
To access the client VM from your local machine, you can type ssh [email protected]
from a command prompt if you have an ssh client installed. If that fails, verify that ssh (TCP port 22) is one of the endpoints of the VM.
If your local machine runs Windows, you can use PuTTY or MobaXterm as an ssh client.
Connecting to the database from R on Linux
Using RODBC’s odbcDriverConnect
function
To install R and the tools/libraries to connect to SQL Server via RODBC, type:
sudo apt-get install r-cran-rodbc unixodbc-bin unixodbc odbcinst freetds-bin tdsodbc
Note: if not using Ubuntu, you may have to use an alternative to apt-get
, such as yum, zypper, etc.
To test connecting to SQL Server, you can use the tsql
utility, which is included in the FreeTDS package.
FreeTDS defaults are for another database vendor (Sybase) that uses a different port and TDS version than SQL Server. There are a few complications around the TDS version when connecting to SQL Server using FreeTDS. You need to know:
- that the TDS version needs to be explicitly specified (not obvious since tsql’s help and manual don’t mention it),
- what TDS version should be used, and
- how to specify it (tsql’s manual is silent about this).
In our case, 7.1 ended up being the TDS version that worked (although the documentation suggests instead 7.3) and 1433 is the port used by our SQL Azure instance.
Below we’re successfully connecting to our SQL Server instance using tsql
and getting back the version of SQL Server:
azureuser@MyAzureUbuntu:~$ TDSVER=7.1 tsql -H MySqlAzure.database.windows.net -p 1433 -U MyUser -P MyPassword locale is "en_US.UTF-8" locale charset is "UTF-8" using default charset "UTF-8" 1> version using TDS version 7.1 1> select serverproperty('Edition'), serverproperty('ProductVersion') 2> go SQL Azure 12.0.2000.8 (1 row affected) 1> quit
Notes:
tsql
does not appear to support specifying the default database and SQL Azure does not allow for changing the database once connected; therefore usingtsql
we only have access to themaster
database when connecting to SQL Azure. Not ideal but still useful to test the connection.- When not specifying TDSVER or specifying an incorrect TDSVER, you’ll get an error message similar to
Adaptive Server connection failed
FreeTDS’s ODBC drivers need to be registered in order to connect using RODBC. Below we show how to do it and how to see the contents of the configuration file.
azureuser@MyAzureUbuntu:~$ cd /etc azureuser@MyAzureUbuntu:/etc$ wc -l odbcinst.ini 0 odbcinst.ini azureuser@MyAzureUbuntu:/etc$ sudo odbcinst -i -d -f /usr/share/tdsodbc/odbcinst.ini odbcinst: Driver installed. Usage count increased to 1. Target directory is /etc azureuser@MyAzureUbuntu:/etc$ head -2 odbcinst.ini [FreeTDS] Description=TDS driver (Sybase/MS SQL) azureuser@MyAzureUbuntu:/etc$
Now we can connect to our SQL Azure instance using RODBC’s odbcDriverConnect()
function. We just need to make sure we pass the correct driver name (the name in the square brackets in odbcinst.ini), port number and TDS version in the connection string in addition to the regular settings: server, database, login and password.
azureuser@MyAzureUbuntu:~$ cat test.R library(RODBC) connStr <- "Server=MySqlAzure.database.windows.net;uid=MyUser;pwd=MyPassword;Database=MyDatabase" # regular info connStr <- paste0(connStr, ";Driver=FreeTDS;TDS_Version=8.0;Port=1433") # FreeTDS specific info conn <- odbcDriverConnect(connStr) sqlQuery(conn, "select serverproperty('ProductLevel') as [Product Level], db_name() as [Database name]") close(conn) azureuser@MyAzureUbuntu:~$ R --slave -f test.R Product Level Database name 1 RTM MyDatabase azureuser@MyAzureUbuntu:~$
Using RODBC’s odbcConnect
function
Some of this information can be persisted in a Data Source Name (DSN), allowing for shorter connection strings. Unfortunately, login and password cannot be persisted in the DSN for SQL Server whereas they can in the case of MySQL.
Below we show how to set up a SQL Azure/SQL Server DSN with FreeTDS, and how to test the changes using isql
:
azureuser@MyAzureUbuntu:~$ cat /etc/odbc.ini [Azure-MSSQL-FreeTDS] # Matches the name in odbcinst.ini Driver = FreeTDS Description = Testing MSSQL Azure Trace = No Server = MySqlAzure.database.windows.net Database = MyDatabase # Port must be specified as default corresponds to Sybase PORT=1433 # Sometimes incorrectly documented as "TDS Version" (space instead of underscore) # Silently ignored when incorrect. Must be specified as default corresponds to Sybase TDS_Version=7.1 # UID and PWD can only be specified in connection strings # and are silently ignored in odbc.ini azureuser@MyAzureUbuntu:~$ echo select 1 + 1 as Answer | isql Azure-MSSQL-FreeTDS -b MyUser MyPassword +------------+ | Answer | +------------+ | 2 | +------------+ SQLRowCount returns 1 1 rows fetched azureuser@MyAzureUbuntu:~$
While we’re at it, we use isql
to run a script that will create a table with 2 columns and 10K rows, all with random values between 0 and 1. isql
does not support batch delimiters (such as GO
), therefore each batch must fit on a single line:
azureuser@MyAzureUbuntu:~$ cat createTable.sql set nocount on select 'Creating table ' + db_name() + '..MyTable on machine ' + @@servername+ '...' create table MyTable(x float not null, y float not null) declare @i int = 0; while @i < 10000 begin insert into dbo.MyTable values(rand(), rand()); set @i += 1; end select count(*) as [Total number of rows] from dbo.MyTable azureuser@MyAzureUbuntu:~$ cat createTable.sql | isql Azure-MSSQL-FreeTDS -b MyUser MyPassword | grep -v +- SQLRowCount returns -1 | | | Creating table MyDatabase..MyTable on machine MySqlAzure... | SQLRowCount returns 1 1 rows fetched SQLRowCount returns -1 SQLRowCount returns 1 | Total number of rows| | 10000 | SQLRowCount returns 1 1 rows fetched azureuser@MyAzureUbuntu:~$
Note: script may take a while to complete. Be patient 🙂
We can now query our table using the DNS and RODBC’s odbcConnect()
function. Here we’re querying the x column of MyTable and showing the impact of increasing the sample size: the distribution of the samples mean gets more concentrated around the population mean (0.5).
library(RODBC) conn <- odbcConnect("Azure-MSSQL-FreeTDS", "MyUser", "MyPassword") ds <- sqlQuery(conn, "SELECT x FROM MyTable") close(conn) randomSamples <- function(times, sampleSize) { replicate(n = times, expr = mean(sample(ds$x, size = sampleSize, replace = FALSE)), simplify = TRUE) } par(mfrow=c(2,2)) numberSamples <- 100 mean <- .5 for (i in 2:5) { sampleSize <- i ^ 2 plot(density(randomSamples(numberSamples, sampleSize)), main = paste0("Sample size: ", sampleSize), col = i, xlab = "", xlim = c(.2, .8)) abline(v = mean) } par(mfrow=c(1,1))
Using RJDBC
Setup
To use RJDBC to connect to SQL Server, you need to:
- From a command prompt:
- Install Java:
sudo apt-get install openjdk-8-jdk libjtds-java sqlline
- Register Java in R:
sudo R CMD javareconf
- Install Java:
- From an R terminal:
- Install the R package (no Ubuntu packages via apt-get are provided):
install.packages("RJDBC")
- Install the R package (no Ubuntu packages via apt-get are provided):
In R:
In addition to the url connector prefix, we also need to specify in the call to JDBC
:
- classPath: the location of the jar file. Running
locate *jtds*.jar
from a command prompt may help here. If you don’t get any results, runsudo updatedb
first. - driverClass: the class name. If the one we use below doesn’t work, check the corresponding JDBC driver documentation.
In our particular setup:
library(RJDBC) ## Loading required package: DBI ## Loading required package: rJava drv <- JDBC( driverClass = "net.sourceforge.jtds.jdbc.Driver", # check driver's documentation name classPath = "/usr/share/java/jtds.jar", # use locate and updatedb to find location identifier.quote="`") conn <- dbConnect(drv, "jdbc:jtds:sqlserver://MySqlAzure.database.windows.net/MyDatabase", "MyUser", "MyPassword") # only query 1/4 of the records otherwise we get a solid color graph ds <- dbGetQuery(conn, "select top 25 percent * from MyTable") dbDisconnect(conn) ## [1] TRUE plot(ds, col=rgb(200,0,0,50, maxColorValue=255), main = "Querying SQL Server with RJDBC", pch = 19, cex = 1)
Using RSQLServer
Much easier to use than RJDBC (no url connector, driverClass and classpath to figure out). The prerequisites are the same as for RJDBC (RSSQLServer is a wrapper on top of RJDBC): install and register Java, then install the package.
library(RSQLServer) conn <- dbConnect( RSQLServer::SQLServer(), "MySqlAzure.database.windows.net", "useNTLMv2=false;user=MyUser;Password=MyPassword", database = "MyDatabase") ds <- dbGetQuery(conn, "select * from MyTable") dbDisconnect(conn) ## [1] TRUE plot( density(ds$x*ds$y), col = "salmon", lwd = 3, main = "Querying SQL Server with RSQLServer")
Using Microsoft’s ODBC drivers
Microsoft provides Linux ODBC drivers for SQL Server that can be used with RODBC.
Before you get started
Here are some pros/cons that may help deciding if the pros justify the cons in your specific case:
- Pros:
- Better defaults than FreeTDS so configuration is simplified.
- From our limited testing, better overall performance than FreeTDS and RJDBC/RSQLServer.
- Theoretical feature-set compatibility with SQL Server should be better than FreeTDS’s one.
- Cons:
- These drivers are only supported on a couple specific Linux distributions.
- The process of getting these installed is much more involved than FreeTDS since it is done with scripts instead of packages.
- These drivers depend on a version of unixODBC that is also not available as a package meaning that unixODBC needs to be manually installed, and in turn any other component that depends on unixODBC may have to be manually compiled and installed.
Here are some links that provide some more insights:
- Documentation
- Announcement
- Prerequisites.
Note: the install process will not complain about them and you will end up hitting hard to debug issues such as the one described here. - Download:
- Instructions:
- odbc.ini Documentation
Driver setup
For our testing, we created a machine in Azure using the latest SP of the “SUSE Linux Enterprise 11” images: Microsoft’s ODBC drivers for SQL Server on Linux require Redhat or Suse.
We named this VM MyAzureSuse.cloudapp.net
and are connecting to it with account azureuser
.
Another of the prerequisites is “64-bit UnixODBC 2.3.0”, which unfortunately is not available as a package at this time. To check what version is available, run zypper info UnixODBC
and look at the version field (in our case: 2.2.12-198.17).
Since two versions of unixODBC cannot be installed at the same time, you need to ensure it’s not yet installed (for example by running locate unixodbc
) and un-install it if needed.
After downloading the driver under /tmp
, we did the following from a root prompt started by running sudo bash
(“[…]” means we’ve trimmed the output for brievety):
MyAzureSuse:/tmp # zypper install gcc MyAzureSuse:/tmp # tar zxvf msodbcsql-11.0.2260.0.tar.gz [...] MyAzureSuse:/tmp # cd msodbcsql-11.0.2260.0/ MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # ./build_dm.sh [...]
The script when successful prompts to run a second command to install unixODBC, in our case: cd /tmp/unixODBC.27735.30044.32204/unixODBC-2.3.0; make install
Once unixODBC 2.3 is successfully installed, we can proceed with the installation of Microsoft’s drivers:
MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # ./install.sh verify [...] MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # ./install.sh install [...] MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # odbcinst -q -d -n "ODBC Driver 11 for SQL Server" [...] MyAzureSuse:/tmp/msodbcsql-11.0.2260.0 # cat /etc/odbcinst.ini [ODBC Driver 11 for SQL Server] Description=Microsoft ODBC Driver 11 for SQL Server Driver=/opt/microsoft/msodbcsql/lib64/libmsodbcsql-11.0.so.2260.0 Threading=1 UsageCount=1
We can now use RODBC’s odbcDriverConnect()
or odbcConnect()
functions.
Below is the appropriate DSN configuration entry to our test server in /etc/odbc.ini
:
[Azure-MSSQL-Microsoft] Driver = ODBC Driver 11 for SQL Server Server = tcp:MySqlAzure.database.windows.net Database = MyDatabase Description = Testing MSSQL Azure using Microsoft's drivers
Note: user/password entries in odbc.ini are silently ignored for SQL Server.
The quick test below with isql
confirms everything is working:
azureuser@MyAzureSuse:~> echo select 'Server=' + @@servername | isql Azure-MSSQL-Microsoft MyUser MyPassword | grep Server | Server=MySqlAzure
Using in R
We installed R from a terminal: sudo zypper install r-base
. For some versions of Suse you may have to look a little to find the package. This page is helpful (click on the “Show other versions” link).
Then we installed RODBC from an R session install.packages("RODBC")
, and verified we have connectivity to SQL Server using the DSN that uses Microsoft’s drivers:
library(RODBC) conn <- odbcConnect("Azure-MSSQL-Microsoft", "MyUser", "MyPassword") sqlQuery(conn, "SELECT COUNT(*) AS CountRows FROM dbo.MyTable") ## CountRows ## 1 10000 close(conn)
Summary
- We’ve tried the RODBC, RJDBC and RSQLServer packages to connect to a SQL Azure/SQL Server database from a Linux client, also hosted in a VM on Azure.
- We also tested Microsoft’s ODBC drivers for Linux, on Suse.
- There’s no clear winner here; it really depends on what matters most:
- Those that favor performance will want Microsoft’s ODBC drivers with RODBC.
- Those that favor portability will use RODBC with a DSN (using odbcConnect) or RSQLServer.
- Those that favor ease of setup and configuration will use RSQLServer.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.