Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

** This is a modified version of a previous R benchmark that was done back in 2011. Click this link to see the original post.

After using R for quite some time, you get to know a little bit about its strengths and weaknesses. It structures data very well and has a huge library of statistical and data processing packages, which makes analysis a breeze. What it lacks is the ability to deal with really large data, and processing SPEED. We’re going to focus on the speed issue, especially since there are some easy ways to improve this.

I’m sure most people have heard of Revolution Analytics. They offer a free, enhanced version of R, called Revolution R Open (RRO), which allows multi-core processing (standard R is single-core) and is very easy to setup. There’s definitely some debate about whether or not RRO really does improve upon R. As you’ll see from the data below, in some cases it’s not very clear that it does and in some cases it is. We’re also going to look at the difference between running R/RRO locally on Mac OSX and on the cloud through Ubuntu.

My notebook setup:
• Mac OS X Yosemite 10.10.2
• 7 GHz Intel Core i5 (dual-core)
• 4 GB ram
Cloud server setup:
• Ubuntu 14.04
• Dual-core CPU
• 4 GB ram

For both the notebook and the cloud setup, I ran benchmarks for both R and RRO, so 4 different variations in total. The benchmark code that I used is a modification of the benchmark code provided in the link at the top. I added a section for matrix operations since that is one of the categories in which RRO really shines according to their website. See the code below.

# clear workspace
rm(list=ls())

# print system information
R.version
Sys.info()

# install non-core packages
install.packages(c('party', 'rbenchmark', 'earth'))

require(rbenchmark)
require(party)
require(earth)
require(rpart)
require(compiler)

# function from http://dirk.eddelbuettel.com/blog/2011/04/12/
k <- function(n, x=1) for (i in 1:n) x=1/{1+x}

# create random matrix
mat1 <- matrix(data = rexp(200, rate = 10), nrow = 3000, ncol = 3000)
mat2 <- matrix(data = rexp(200, rate = 10), nrow = 3000, ncol = 3000)

# prepare data set from UCI Repository
# see: http://archive.ics.uci.edu/ml/datasets/Credit+Approval
url="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"

# run benchmark
results <- benchmark(ct=ctree(V16 ~ .,data=mydata),
e=earth(V16 ~ ., data=mydata),
rp=rpart(V16 ~ ., data=mydata),
mm=mat1%*%mat2,
k(1e6, x=1),
replications=20
)

results

Benchmarks – Table
ctree (s)earth (s)mm (s)k (s)rpart (s)
R_OSX_3.1.328415561480.51
RRO_OSX_3.1.229714739100.47
R_Ubuntu_3.0.2182127810150.45
RRO_Ubuntu_3.1.21301192880.42

Conclusion

For the most part, RRO performs significantly faster than standard R both locally and on the server. RRO performs really well on the matrix operations as seen in column group mm (over 90% faster than standard R); this is probably due to the addition of the Intel Math Kernal library. Standard R actually did better than RRO on the local machine for the ctree and k functions, which is definitely unexpected after all of the lofty claims made Revolution Analytics. The increase isn’t huge so maybe we can attribute this to the randomness of the small sample. Both standard R and RRO perform much better on the Ubuntu server. This is most likely because the operating system on the server doesn’t have all the extra bloat-ware that a pc operating system has. RRO performs better than standard R in all the tests I ran on the server, making it the clear winner on the server side.

Overall, it looks like cloud computing with a little help from RRO is definitely the way to go. Unfortunately this setup is definitely not the easiest for the average person to achieve. Good thing I’m working on a little side-project to help solve this issue:), …more to come about that in a future post.