by Gregory Vandenbrouck
Software Engineer at Microsoft
This post is the first in a series that covers pulling data from various Windows Azure hosted storage solutions (such as MySQL, or Microsoft SQL Server) to an R client on Windows or Linux.
We’ll start with a relatively simple case of pulling data from SQL Azure to an R client on Windows.
Creating the database
The Azure Management site changes quite often, therefore these instructions are valid “at the time of this writing” :o)
- You need an Azure subscription. If you don’t have one, free trials are available here.
- Log on to the Azure Management site.
- Create a SQL Azure database: select “New”, “Data Services”, “SQL Database”, “Quick Create” and then fill-in the information.
- Once the database creation is completed, click on “SQL databases” and then on your database name. There should be a way to view the server name and the connection strings for ODBC, JDBC, etc.
Connecting to the database outside of R
Optional step, but can be useful for troubleshooting. For example to solve firewall, port and credential issues.
- Install Microsoft® ODBC Driver 11 for SQL Server® – Windows
- Install Microsoft® Command Line Utilities 11 for SQL Server (also available in the Feature Pack )
Below an example of a successful test session in a command prompt (search for “cmd” in the Start menu or screen to start a command prompt):
C:> sqlcmd -S MyServer.database.windows.net -U MyUser -P MyPassword -d MyDatabase -Q "print 'Hello World'" Hello World
(replace MyServer, MyUser, MyPassword and MyDatabase with your values)
Connecting to the database from R on Windows
Viewing and adding ODBC drivers
To view the list of available drivers, open the “ODBC Data Sources” dialog:
- Type “ODBC” in the Start menu or screen.
- Select the “ODBC Data Sources” that corresponds to the R console you’ve started: on a 64bit OS there’s both 32bit or 64bit. If you can’t find it, it’s here:
- 32bit for 64bit OS:
- 32bit for 32bit OS, 64bit for 64bit OS:
- 32bit for 64bit OS:
- Click on the “Drivers” tab:
“SQL Server” was present by default with the versions of Windows I tested (8.1 and Server 2012) and was sufficient. If you want to try the other drivers, these are:
This is the simplest option as ODBC is available by default on Windows.
If the ODBC connection string suggested by the Azure Management site doesn’t work (usually: missing driver) try using one similar to the simplified version below.
library(RODBC) myServer <- "MyServer.database.windows.net" myUser <- "MyUser" myPassword <- "MyPassword" myDatabase <- "MyDatabase" myDriver <- "SQL Server" # Must correspond to an entry in the Drivers tab of "ODBC Data Sources" connectionString <- paste0( "Driver=", myDriver, ";Server=", myServer, ";Database=", myDatabase, ";Uid=", myUser, ";Pwd=", myPassword) # This query simulates a table by generating a rowset with one integer column going from 1 to 1000 sqlQuery <- " WITH nb AS (SELECT 0 AS n UNION ALL SELECT n+1 FROM nb where n < 9) SELECT n1.n+10*n2.n+100*n3.n+1 FROM nb n1 CROSS JOIN nb n2 CROSS JOIN nb n3 " conn <- odbcDriverConnect(connectionString) df <- sqlQuery(conn, sqlQuery) close(conn) # don't leak connections ! str(df)
## 'data.frame': 1000 obs. of 1 variable: ## $ : int 1 2 3 4 5 6 7 8 9 10 ...
## ## Min. : 1.0 ## 1st Qu.: 250.8 ## Median : 500.5 ## Mean : 500.5 ## 3rd Qu.: 750.2 ## Max. :1000.0
This function allows for using a Data Source Name (DSN). The pros of the DSN is that you can hide some of the complexity from the R script, for example the choice of driver. To create a DSN, open the “ODBC Data Sources” dialog and then click on “Add” under the “User DSN” or “System DSN” tab.
- If connecting to Azure, choose “SQL Server authentication”.
- The credentials (Login ID and Password) are not persisted, meaning even if you specify them in the dialog, you’ll have to specify them again in R.
- If you can’t specify a database besides master for the “SQL Server” driver, then don’t specify credentials (you won’t be able to test the connection anymore).
In the example below, I’ve created a “User DSN” (that is: only available to current user) called MySqlAzure:
library(RODBC) conn <- odbcConnect("MySqlAzure", uid = myUser, pwd = myPassword) sqlQuery(conn, "SELECT POWER(2, 10) AS Answer")
## Answer ## 1 1024
To use RJDBC to connect to SQL Server, you need to:
- Know the following:
- driverClass: the class name. See the driver-specific documentation.
- classPath: the location of the jar file. If you don’t know where it is, try running
dir /s /b %systemdrive%*sql*.jarfrom a command prompt.
- url connector prefix. Again, driver-specific.
In my specific setup:
library(RJDBC) drv <- JDBC( driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver", classPath = "C:/JDBC/Microsoft JDBC Driver 4.1 for SQL Server/sqljdbc_4.1/enu/sqljdbc41.jar") conn <- dbConnect(drv, "jdbc:sqlserver://MyServer.database.windows.net", "MyUser", "MyPassword") dbGetQuery(conn, "SELECT 1+1") dbDisconnect(conn)
RSqlServer also needs the JDK installed. It’s slighter easier to use than RJDBC:
library(RSQLServer) conn <- dbConnect( RSQLServer::SQLServer(), "MyServer.database.windows.net", "useNTLMv2=false;user=MyUser;Password=MyPassword") dbGetQuery(conn, "SELECT 1+1") dbDisconnect(conn)
- We’ve tried RODBC, RJDBC and RSQLServer to connect to a SQL Server hosted on Azure.
- RODBC is the easiest to setup (no install required besides the package) and use (simple connection string).
- RODBC was faster in our tests, both to connect and to get back the query results.
So you may ask: why use RJDBC or RSQLServer? One answer may be: because it’s much easier to port if you need to run your R scripts on both Windows and Linux, as we’ll see in subsequent posts.