Text bashing in R for SQL

[This article was first published on Bearded Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fairly often, a coworker who is strong in Excel, but weak in writing code will come to me for help in special details about customers in their datasets. Sometimes the reason is to call, email, or snail mail a survey, other times to do some classification grouping on the customer. Whatever the reason, the coworker has a list of ID numbers and needs help getting something out of a SQL database.

When it isn’t as simple as just adding quotes and commas to the cells in Excel before copying all the ID’s into the WHERE clause of a very basic SELECT statement, I often fall back to R and let it do the work of putting together the SELECT statement and querying the data.

Suppose that you’re given an Excel file with 1.2 million ID’s and there’s some transformation that you need to do first. Obviously, you first read the file in using your protocol and package of choice. Since we’re ultimately doing SQL, let’s take advantage of the RODBC package’s cool features.


xlsxFile <- file.path("C:", "MyFile.xlsx")
xlsxChan <- odbcConnectExcel2007(xlsxFile)
xlsxSheets <- sqlTables(xlsxChan)

# From here you can choose the table (worksheet) you need to pull the data from.

xlsxData <- sqlFetch(xlsxChan, 'Sheet1$')

Now that the data have been pulled into R any manipulations that need to be made can be done until you end up with a list of ID's that you need to query the SQL database for. Let's start by writing the basis of the query.

myQuery <- "SELECT ID, Name, Street, City, State, Zip, Phone, Email FROM CustomerTable WHERE ID in (s)"

Notice the use of the s switch. We're ultimately going to be dropping this into sprintf() with the ID's so that we don't have to clutter up the script with a stupidly-long list of hard-coded values. Next we need to make R build that stupidly-long list of values to put into the query.

id <- as.character(xlsxData$ID)
id <- sprintf("'s'", unique(id))
id <- toString(id)

# Now we can put the two together and send the whole query off.
myQuery <- sprintf(myQuery, id)

# Here do what is appropriate for you database flavor.  I'm pretending to use MySQL on Windows.
dbChan <- odbcConnect("CustDb", uid="jeremy", case="tolower")
dbData <- sqlQuery(dbChan, myQuery, stringsAsFactors=FALSE)

What you should end up with in dbData is the list of things that you are after and a happy colleague.

To leave a comment for the author, please follow the link and comment on their blog: Bearded Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)