# GNU R vs Julia: is it only a matter of devectorization?

December 28, 2013
By

(This article was first published on R snippets, and kindly contributed to R-bloggers)

Recently I have read a post on benefits of code devectorization in Julia. The examples given there inspired me to perform my own devectorization exercise. I decided to use bootstrapping as a test ground. The results are quite interesting (and not so bad for GNU R).

The task is very simple (and typical):

1. generate 10000 elements sample from uniform distribution
2. 1000 times perform bootstrap sample of the vector and calculate its standard deviation
3. return standard deviation of the bootstrap distribution
4. Perform steps 1-3 four times and record computation time

run  <- function() {
ssize <- 10000
nboot <- 1000
x <- runif(ssize)
y <- replicate(nboot, sd(sample(x, ssize, TRUE)))
sd(y)
}

for (i in 1:4) {
cat(system.time(run())[3], " ")
}
# result: 0.34  0.32  0.31  0.34

The direct translation to Julia gives:
using Distributions

function run()
ssize = 10000
nboot = 1000
x = rand(ssize)
y = Array(Float64,nboot)
for i in 1:nboot
y[i] = std(sample(x, ssize))
end
std(y)
end

for i in 1:4
print("$(@elapsed run()) ") end # result: 0.38965987 0.331032088 0.3208918 0.315452803 So not counting longer Julia start up time - execution time is comparable. Now we can try devectorizing Julia code: function run() ssize = 10000 nboot = 1000 x = rand(ssize) y = Array(Float64,nboot) bx = Array(Float64, ssize) for i in 1:nboot for j in 1:ssize bx[j] = x[(rand(Uint32) % ssize) + 1] end y[i] = std(bx) end std(y) end for i in 1:4 print("$(@elapsed run())  ")
end
# result: 0.072204955  0.082288181  0.072243955  0.073718137

We have over 4-times speedup. So devectorization works perfectly and Julia lives up to the promise of performance gain.

However, this is not a whole story. Remember that GNU R performs extra work behinds the scenes that standard Julia ignores - it handles missing values (NA) out of the box.

Actually we can tell Julia to take missing values into account via DataFrames package. Let us test vectorized and devectorized code.

Vectorized Julia with NAs:
using Distributions
using DataFrames

function run()
ssize = 10000
nboot = 1000
x = DataArray(rand(ssize))
y = DataArray(Float64,nboot)
for i in 1:nboot
y[i] = std(sample(x, ssize))
end
std(y)
end

for i in 1:4
print("$(@elapsed run()) ") end # result: 1.320962821 1.305368346 1.250524339 1.272740111 So w are over 4 times slower than GNU R now. How about devectorized Julia with NAs? Let us try: using DataFrames function run() ssize = 10000 nboot = 1000 x = DataArray(rand(ssize)) y = DataArray(Array(Float64,nboot)) bx = DataArray(Array(Float64, ssize)) for i in 1:nboot for j in 1:ssize bx[j] = x[(rand(Uint32) % ssize) + 1] end y[i] = std(bx) end std(y) end for i in 1:4 print("$(@elapsed run())  ")
end
# result: 1.027682685  0.994636538  1.017903246  1.021942776

There is a speed up, but it is minor and we still are 3 times slower.

Probably Julia code could be tuned, but we can draw some preliminary conclusions (at least for current versions of GNU R and Julia).
If you can use default Julia data types in your analysis Julia should easily beat GNU R in performance, especially after devectorization. However if you require handling of missing values in your code and have to use DataFrames package in Julia you can expect GNU R to be quite well optimized.

PS.
While trying to get a grasp of Julia I have collected some notes on its syntax and features from R-programmers perspective. You can have a peek its preliminary version here.