Using Azure as an R data source, Part 1

May 12, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Gregory Vandenbrouck
Software Engineer at Microsoft

This post is the first in a series that covers pulling data from various Windows Azure hosted storage solutions (such as MySQL, or Microsoft SQL Server) to an R client on Windows or Linux.

We’ll start with a relatively simple case of pulling data from SQL Azure to an R client on Windows.

Creating the database

The Azure Management site changes quite often, therefore these instructions are valid “at the time of this writing” :o)

  1. You need an Azure subscription. If you don’t have one, free trials are available here.
  2. Log on to the Azure Management site.
  3. Create a SQL Azure database: select “New”, “Data Services”, “SQL Database”, “Quick Create” and then fill-in the information.
  4. Once the database creation is completed, click on “SQL databases” and then on your database name. There should be a way to view the server name and the connection strings for ODBC, JDBC, etc.

Connecting to the database outside of R

Optional step, but can be useful for troubleshooting. For example to solve firewall, port and credential issues.

(replace MyServer, MyUser, MyPassword and MyDatabase with your values)

Connecting to the database from R on Windows

Viewing and adding ODBC drivers

To view the list of available drivers, open the “ODBC Data Sources” dialog:

  • Type “ODBC” in the Start menu or screen.
  • Select the “ODBC Data Sources” that corresponds to the R console you’ve started: on a 64bit OS there’s both 32bit or 64bit. If you can’t find it, it’s here:
    • 32bit for 64bit OS: %windir%syswow64odbcad32.exe
    • 32bit for 32bit OS, 64bit for 64bit OS: %windir%system32odbcad32.exe
  • Click on the “Drivers” tab:

ODBC1

 

“SQL Server” was present by default with the versions of Windows I tested (8.1 and Server 2012) and was sufficient. If you want to try the other drivers, these are:

Using RODBC’s odbcDriverConnect function

This is the simplest option as ODBC is available by default on Windows.

If the ODBC connection string suggested by the Azure Management site doesn’t work (usually: missing driver) try using one similar to the simplified version below.

library(RODBC)
myServer <- "MyServer.database.windows.net"
myUser <- "MyUser"
myPassword <- "MyPassword"
myDatabase <- "MyDatabase"
myDriver <- "SQL Server" # Must correspond to an entry in the Drivers tab of "ODBC Data Sources"

connectionString <- paste0(
    "Driver=", myDriver, 
    ";Server=", myServer, 
    ";Database=", myDatabase, 
    ";Uid=", myUser, 
    ";Pwd=", myPassword)
# This query simulates a table by generating a rowset with one integer column going from 1 to 1000
sqlQuery <- "
WITH nb AS (SELECT 0 AS n UNION ALL SELECT n+1 FROM nb where n < 9)
SELECT n1.n+10*n2.n+100*n3.n+1 FROM nb n1 CROSS JOIN nb n2 CROSS JOIN nb n3
"
conn <- odbcDriverConnect(connectionString)
df <- sqlQuery(conn, sqlQuery)
close(conn) # don't leak connections !

str(df)
## 'data.frame':    1000 obs. of  1 variable:
##  $ : int  1 2 3 4 5 6 7 8 9 10 ...
summary(df)
##                  
##  Min.   :   1.0  
##  1st Qu.: 250.8  
##  Median : 500.5  
##  Mean   : 500.5  
##  3rd Qu.: 750.2  
##  Max.   :1000.0

Using RODBC’s odbcConnect function

This function allows for using a Data Source Name (DSN). The pros of the DSN is that you can hide some of the complexity from the R script, for example the choice of driver. To create a DSN, open the “ODBC Data Sources” dialog and then click on “Add” under the “User DSN” or “System DSN” tab.

Notes:

  • If connecting to Azure, choose “SQL Server authentication”.
  • The credentials (Login ID and Password) are not persisted, meaning even if you specify them in the dialog, you’ll have to specify them again in R.
  • If you can’t specify a database besides master for the “SQL Server” driver, then don’t specify credentials (you won’t be able to test the connection anymore).

In the example below, I’ve created a “User DSN” (that is: only available to current user) called MySqlAzure:

ODBC2

 

library(RODBC)

conn <- odbcConnect("MySqlAzure", uid = myUser, pwd = myPassword)
sqlQuery(conn, "SELECT POWER(2, 10) AS Answer")
##   Answer
## 1   1024
close(conn)

Using RJDBC

To use RJDBC to connect to SQL Server, you need to:

  • Install:
  • Know the following:
    • driverClass: the class name. See the driver-specific documentation.
    • classPath: the location of the jar file. If you don’t know where it is, try running dir /s /b %systemdrive%*sql*.jar from a command prompt.
    • url connector prefix. Again, driver-specific.

In my specific setup:

library(RJDBC)
drv <- JDBC(
    driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
    classPath = "C:/JDBC/Microsoft JDBC Driver 4.1 for SQL Server/sqljdbc_4.1/enu/sqljdbc41.jar")
conn <- dbConnect(drv, "jdbc:sqlserver://MyServer.database.windows.net", "MyUser", "MyPassword")
dbGetQuery(conn, "SELECT 1+1")
dbDisconnect(conn)

Using RSQLServer

RSqlServer also needs the JDK installed. It’s slighter easier to use than RJDBC:

library(RSQLServer)
conn <- dbConnect(
    RSQLServer::SQLServer(),
    "MyServer.database.windows.net",
    "useNTLMv2=false;user=MyUser;Password=MyPassword")
dbGetQuery(conn, "SELECT 1+1")
dbDisconnect(conn)

Summary

  • We’ve tried RODBC, RJDBC and RSQLServer to connect to a SQL Server hosted on Azure.
  • RODBC is the easiest to setup (no install required besides the package) and use (simple connection string).
  • RODBC was faster in our tests, both to connect and to get back the query results.

So you may ask: why use RJDBC or RSQLServer? One answer may be: because it’s much easier to port if you need to run your R scripts on both Windows and Linux, as we’ll see in subsequent posts.

  

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)