How to reliably access network resources in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s frustrating when an application unexpectedly dies due to a network timeout or unavailability of a network resource. Veterans of distributed systems know not to rely on network-based resources, such as web services or databases, since they can be unpredictable. So what is a data scientist supposed to do when you must use these resources in her analysis/application?
When there is a true network partition, there’s not much you can do since these resources are inaccessible. Most of the time, though, the issue is a timeout due to network latency or an unresponsive server. In these situations, the problem is temporary. It would be nice to recover from the error without having to add a bunch of logic and muddying up your model code. Recovery can be as simple as trying again, eventually failing if a resource is truly unavailable.
The new function ntry in lambda.tools 1.0.5 does just this: call a function up to n times, returning the result of the first successful call.
Here’s an example of how it works. The following function simulates an unreliable resource that fails 75% of the time. Using ntry, the function will be tried over and over until it either succeeds or the limit is reached.
library(lambda.tools)
library(futile.logger)
fn <- function(i) {
x <- sample(1:4, 1)
flog.info("x = %s",x)
if (x < 4) stop('stop') else x
}
Calling the function in isolation will mostly likely fail:
> fn() INFO [2015-01-21 18:26:21] x = 2 Error in fn() : stop
This is similar to what happens with a timeout, where sometimes a function will fail. To get around this, normally a loop of some sort is introduced to try a few times until the call succeeds. With ntry it’s simply a matter of wrapping a function in a closure and specifying the number of tries.
> ntry(fn, 6) INFO [2015-01-21 18:39:21] x = 2 INFO [2015-01-21 18:39:21] x = 4 [1] 4
Here’s a real-world example using RPostgreSQL. In a single function, a connection is opened, the query executed, and the connection closed.
db_execute_query <- function(query) {
on.exit(dbDisconnect(con))
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host=HOST, port=PORT, dbname=DATABASE,
user=USER, password=PASS)
dbGetQuery(con, statement=query)
}
For this to work with ntry, I use the on.exit function to disconnect. Normally I’d use a tryCatch block, but since ntry will catch the error, I leave this code naked. The ntry wraps the DB call in a closure, where the argument i is the attempt number. This is useful if you want to debug the call. The second parameter is simply the number of tries.
df <- ntry(function(i) db_execute_query(query), 3)
Access to the database is now a bit more resilient. To try it out yourself, install the latest version of lambda.tools via devtools.
library(devtools)
install_github('lambda.tools','zatonovo')
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.