SVN Version Control, R, and some rambling thought on AWS,Rscripts

December 16, 2011
By

(This article was first published on Command-Line Worldview, and kindly contributed to R-bloggers)

I do a alot of my modelling on Rstudio hosted on EC2 instances. If you don’t use, I would highly recommend. A brilliant tool. Kudos to the Rstudio team. I have made a personal and professional pledge to obsessively use version control. I hope to show a quick example of how to use version control in the modelling context, even if you are not tweaking the linux kernel. I know Rstudio has version control as an enhancement very soon and i am eagerly awaiting its release, like a virgin on prom night. (perhaps given the nerdy crowd that could bring up painful memories).

The first and obvious use of version control is to keep track of scripts over time. This is exactly what it is designed to do. Even if you came to R as a non-programmer(like me) it good to adopt the best parts. If you are at all familiar with version control this should make plenty of sense. If you are not here are some better high level and tutorials better then i could write:

http://chronicle.com/blogs/profhacker/a-gentle-introduction-to-version-control/23064

More technical introduction svn:

http://svnbook.red-bean.com/

Great Git intro(I professinoally use SVN and just started Git, the jump can be a little steep):

http://library.edgecase.com/git_immersion/index.html

Now to why you are reading this article, the R. So you crunch an afternoon, add some clever backflip to your R project, you should save it, then flip over to the command line and commit to you repo. This hopefully will get even easier with aforementioned Rstudio enhancement. However what about your models, the actual R objects. If you are just exploring and hacking a data set then the longevity of your models are not too important, you got the insight and you keep modeling. However, when you are constantly tweaking and comparing similar models then tracking these changes is incredibly important.

library(randomForest)
seed<-123
set.seed(seed)
 
data(iris) #Load Data
 
start<-Sys.time()
iris_rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
end<-Sys.time()
time<-(end-start)
 
#Save the model
save(iris_rf, file ="~/R/model/iris_rf")
 
#Now we add the the SVN Repo
system("svn add ~/R/model/iris_rf")
 
#Build Comment and Commit
svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="")
 
eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 
#Now get SVN Revision Number     
model_ver<-system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE) 
#This awk command grabs the version number, 
#the 'intern =TRUE' redirects the output of the system command to your object instead of stout
#a brilliant useful tool on a linux system
print(model_ver)
#Now we have our revision number

This is clearly a toy example but it illustrates the use. Because I do a lot of database development, something is not useful to me until it stored in a database. That depends on what you are going to use them models for but I like to store out of sample error rates, cross validation results, the specific parameters used in each model run and of course the revision number of the SVN repo. That ties it all together. Now for a further example say we want to tweak the example.

args1 <- commandArgs(TRUE)[1]
library(randomForest)
seed<-args1
set.seed(seed)
 
system("svn update ~/R/model/")
load("~/R/model/iris_rf")
 
data(iris) #Load Data
 
start<-Sys.time()
iris_rf2 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
end<-Sys.time()
time<-(end-start)
 
#we are combining the old model with the new model to add trees to the forest
rf.all <- combine(iris_rf,iris_rf2)
 
#Save the model
save(rf.all, file ="~/R/model/iris_rf")
 
#Now we add the the SVN Repo
system("svn add ~/R/model/iris_rf")
 
#Build Comment and Commit,
svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="")
 
eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 
#Now get SVN Revision Number     
model_ver<-system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE) 

I have added a little tweak to the top of the this script the really opens up the power of this approach.

commandArgs(TRUE)[1]

This functions parses command line arguments so they can be used within your R script. Say we saved this file as rf_Iris.R. you could call it from the command line or from within a shell script(where this REALLY gets fun, ok maybe i have a tainted idea of whats fun) like:

Rscript ./rf_Iris.R 456

Now this example of resetting the seed is a little toy, however you could pass in the target variable, model name for rerunning/updating, parametrize a sql call. It get cool i promise you. It turns R from a interactive scripting tool to a batch process. I will post some really cool shell scripts that allow you to spin up EC2 machines to remotely execute Rscripts on the cloud. Combining the randomForest Combine technique used above and a couple EC2 machines you can build a distributed training ‘cluster’. Or more accurately described by Mike Driscoll as “Bash Reduce”. At minimum you can expand your computing power from the comfort of your own command line. (as well as really useful story to pick up girls at the bar. little know fact, talk of distributed cloud based modelling really gets the women).

This framework can be used to dynamically build R statements using command-line arguments. I also like the three line split because you can just run the paste command to see what would run. It is totally kludgey the more complicated the statement it can really get difficult to debug. If someone has a more elegant solution i would love to hear.

eval(parse(text =as.expression(
paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="")
)))
 

A bit of WARNING: this type of concatenate command building should get your DBA/SysAdmin alarms going off. You should not expose any script that is this hack to web server or let i run by a potentially malicious soul on your machine. It is just asking for a sting injection attack. If you planning to access these scripts through a rApache or make them public facing, you should NOT be concatenating system commands. If you are skilled enough to do that, then you probably don’t need me to tell you.

Cheers

To leave a comment for the author, please follow the link and comment on his blog: Command-Line Worldview.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , , ,

Comments are closed.