**Looking at data**, and kindly contributed to R-bloggers)

The past few days I’ve been going through some R code that I wrote last year, when I was preparing a massive simulation-based power study for some tests for multivariate normality that I’ve been working on. My goal was to reduce the time needed to run the simulation. I wasn’t expecting great improvement, since I’ve always believed that the most common R functions are properly vectorized and optimized for speed. Turns out I was wrong. Very wrong.

The first thing that I did was that I replaced all parentheses ( ) by curly brackets { }. I was inspired to do so by this post (and this, via Xi’Ans Og) over at Radford Neal’s blog. As he pointed out, code that uses parentheses is actually slower than the same code with curly brackets:

> system.time( for(i in 1:1000000) { 1*(1+1) } )

user system elapsed

1.337 0.005 1.349

> system.time( for(i in 1:1000000) { 1*{1+1} } )

user system elapsed

1.072 0.003 1.076

One thing that I found very surprising, and frankly rather disturbing, is that mean(x) takes ten times as long to calculate the mean value of the 50 real numbers in the vector x as the “manual” function sum(x)/50:

> x<-rnorm(50)

> system.time(for(i in 1:100000){mean(x)})

user system elapsed

1.522 0.000 1.523

> system.time(for(i in 1:100000){sum(x)/length(x)})

user system elapsed

0.200 0.000 0.200

> system.time(for(i in 1:100000){sum(x)/50})

user system elapsed

0.167 0.000 0.167

> system.time(for(i in 1:100000){ overn<-rep(1/50,50); x%*%overn })

user system elapsed

0.678 0.000 0.677

> overn<-rep(1/50,50); system.time(for(i in 1:100000){ x%*%overn })

user system elapsed

0.164 0.000 0.164

I guess that the R development core team have been focusing on making R an easy-to-use high level programming language rather than optimizing all functions, but the poor performance of mean is just embarrassing.

Similarly, the var function can be greatly improved upon. Here are some of the many possibilites:

> x <- rnorm(50)

> system.time( for(i in 1:100000) { var(x) } )

user system elapsed

4.921 0.000 4.925

> system.time( for(i in 1:100000) { sum((x-mean(x))^2)/{length(x)-1} } )

user system elapsed

2.322 0.000 2.325

> system.time( for(i in 1:100000) { {sum(x*x)-sum(x)*sum(x)/length(x)}/{length(x)-1} } )

user system elapsed

0.736 0.000 0.737

> system.time( for(i in 1:100000) { {sum(x*x)-sum(x)*sum(x)/50}/49 } )

user system elapsed

0.618 0.000 0.618

> system.time( for(i in 1:100000) { sx<-sum(x); {sum(x*x)-sx*sx/50}/49 } )

user system elapsed

0.567 0.000 0.568

In my case, I’d been a bit sloppy with creating the vectors in my loops in the proper way, so I changed this in my code as well.

> system.time( source(“oldCode.R”) )

user system elapsed

548.045 0.273 548.622

> system.time( source(“newCode.R”) )

In conclusion, it’s definitely possible to speed up your code significantly if you know of the pitfalls of R. I suspect that I’ll be obsessed with finding more pitfalls in the next few weeks, so I’d be thankful for any hints about other weaknesses that R has.

It should probably be mentioned that R is really fast when things are properly vectorized. Last year, a coworker that uses Matlab challenged me to perform a number of matrix computations faster in R than in Matlab. To his great surprise, R won.

As a final remark, I’m now facing a bit of a dilemma. Should I write readable code; a^6; or fast code; a*a*a*a*a*a?**Update:** looking to speed up your R computations even more? See my posts on compiling your code and parallelization.

**leave a comment**for the author, please follow the link and comment on their blog:

**Looking at data**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...