GNU R vs Julia: is it only a matter of devectorization?

Bogumił Kamiński

8 years ago

[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I have read a post on benefits of code devectorization in Julia. The examples given there inspired me to perform my own devectorization exercise. I decided to use bootstrapping as a test ground. The results are quite interesting (and not so bad for GNU R).

The task is very simple (and typical):

generate 10000 elements sample from uniform distribution
1000 times perform bootstrap sample of the vector and calculate its standard deviation
return standard deviation of the bootstrap distribution
Perform steps 1-3 four times and record computation time

Let us start with GNU R implementation:
run <- function() {
ssize <- 10000
nboot <- 1000
x <- runif(ssize)
y <- replicate(nboot, sd(sample(x, ssize, TRUE)))
sd(y)
}

for (i in 1:4) {
cat(system.time(run())[3], ” “)
}
# result: 0.34 0.32 0.31 0.34

The direct translation to Julia gives:
using Distributions

function run()
ssize = 10000
nboot = 1000
x = rand(ssize)
y = Array(Float64,nboot)
for i in 1:nboot
y[i] = std(sample(x, ssize))
end
std(y)
end

for i in 1:4
print(“$(@elapsed run()) “)
end
# result: 0.38965987 0.331032088 0.3208918 0.315452803

So not counting longer Julia start up time – execution time is comparable.

Now we can try devectorizing Julia code:

function run()

ssize = 10000

nboot = 1000

x = rand(ssize)

y = Array(Float64,nboot)

bx = Array(Float64, ssize)

for i in 1:nboot

for j in 1:ssize

bx[j] = x[(rand(Uint32) % ssize) + 1]

end

y[i] = std(bx)

end

std(y)

end

for i in 1:4

print(“$(@elapsed run()) “)

end

# result: 0.072204955 0.082288181 0.072243955 0.073718137

We have over 4-times speedup. So devectorization works perfectly and Julia lives up to the promise of performance gain.

However, this is not a whole story. Remember that GNU R performs extra work behinds the scenes that standard Julia ignores – it handles missing values (NA) out of the box.

Actually we can tell Julia to take missing values into account via DataFrames package. Let us test vectorized and devectorized code.

Vectorized Julia with NAs:

using Distributions

using DataFrames

function run()

ssize = 10000

nboot = 1000

x = DataArray(rand(ssize))

y = DataArray(Float64,nboot)

for i in 1:nboot

y[i] = std(sample(x, ssize))

end

std(y)

end

for i in 1:4

print(“$(@elapsed run()) “)

end

# result: 1.320962821 1.305368346 1.250524339 1.272740111

So w are over 4 times slower than GNU R now. How about devectorized Julia with NAs? Let us try:

using DataFrames

function run()

ssize = 10000

nboot = 1000

x = DataArray(rand(ssize))

y = DataArray(Array(Float64,nboot))

bx = DataArray(Array(Float64, ssize))

for i in 1:nboot

for j in 1:ssize

bx[j] = x[(rand(Uint32) % ssize) + 1]

end

y[i] = std(bx)

end

std(y)

end

for i in 1:4

print(“$(@elapsed run()) “)

end

# result: 1.027682685 0.994636538 1.017903246 1.021942776

There is a speed up, but it is minor and we still are 3 times slower.

Probably Julia code could be tuned, but we can draw some preliminary conclusions (at least for current versions of GNU R and Julia).

If you can use default Julia data types in your analysis Julia should easily beat GNU R in performance, especially after devectorization. However if you require handling of missing values in your code and have to use DataFrames package in Julia you can expect GNU R to be quite well optimized.

PS.

While trying to get a grasp of Julia I have collected some notes on its syntax and features from R-programmers perspective. You can have a peek its preliminary version here.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.