A room full of Julians

January 22, 2014
By

(This article was first published on Robert Grant's stats blog » R, and kindly contributed to R-bloggers)

Despite winter rain, I was delighted to head uptown last week to Skills Matter on the old Goswell Road for the first ever London Julia meetup. The first thing I learnt was that Julia’s friends are called Julians.

If you don’t know it yet, Julia is a pretty new (v 0.3 is current) programming language for fast numerical computing. Everything is designed from the ground up for speed, by some very clever people. They claim speeds consistently close to compiled machine code, which is generally the upper limit, like the speed of light. But a few facts make it potentially Revolutionary Computing: you don’t have to compile it before running, you can mess about in a command-line interface to learn it, it’s free and open source, you can directly call C++ functions from inside normal Julia code – and vice versa, and the syntax is LISP-ish and light as eiderdown (there are some nice comparative examples of this on the homepage).

Arise, ye Julians

Arise, ye Julians

The focus was on getting started, and the room was packed. Personally, I spent some time playing with it last year and then let it lapse, but now with v0.3 out there it seems to be time to get back up to speed.

For stats people, there are a few important packages to install: Distributions, Stats, DataFrames, HypothesisTests, and possibly Optim, MCMC, depending on your own interests. That’s all pretty straightforward, but when you start up Julia or load one of the packages like this:

using(HypothesisTests)

it takes a noticeable while to get ready. This is an artefact of the just-in-time compiler and open source programming. Almost all of the packages and the standard library are written in Julia itself. When you first need it, it gets compiled, and after that it should be superfast. Apparently a package is on the way to supply a pre-compiled standard library, to increase startup speeds.

Here’s a little power simulation I tried out afterwards:

using(HypothesisTests)
starttime=time()
nsig=0;
for (i in 1:100000)
 xx=140+(15*randn(10));
 yy=135+(15*randn(10));
 sig= pvalue(EqualVarianceTTest(xx,yy))<0.05 ? 1 : 0;
 nsig = nsig+sig;
end
time()-starttime

This does 100,000 simulations of independent-samples t-tests with sample size 10 per group, means 140 and 135, and SD 15, and took 5.05 seconds on a teeny weeny Samsung N110 ‘netbook’ with 1.6GHz Atom CPU and 1GB RAM (not what you would normally use!) once the package was loaded.

In R, you could do this at least two ways. First a supposedly inefficient looped form:


Sys.time()
nsig<-0
for (i in 1:100000) {
 xx<-rnorm(10,mean=140,sd=15)
 yy<-rnorm(10,mean=135,sd=15)
 if(t.test(xx,yy)$p.value<0.05) {
 nsig<-nsig+1
 }
}
Sys.time()
print(nsig)

Next, a supposedly more efficient vectorized form:


tp<-function(x) {
 return(t.test(x[,1],x[,2])$p.value)
}
Sys.time()
nsig<-0

xx<-array(c(rnorm(1000000,mean=140,sd=15),
 rnorm(1000000,mean=135,sd=15)),
 dim=c(100000,10,2))
pp<-apply(xx,1,tp)
ppsig<-(pp<0.05)
table(ppsig)
#nsig<-sum(apply(xx,1,tp)<0.05)
Sys.time()
print(nsig)

In fact, the first version was slightly quicker at 2 minutes 3 seconds, compared to 2 minutes 35. While we’re about it, let’s run it in Stata too:

</pre>
clear all
timer on 1
set obs 10
local p = 0
gen x=.
gen y=.
forvalues i=1/1000 {
qui replace x=rnormal(140,15)
qui replace y=rnormal(135,15)
qui ttest x==y, unpaired
if r(p)<0.05 local p = `p'+1
}
dis `p'
timer off 1
timer list
<pre>

That took 30 seconds so we’re looking at 50 minutes to do the whole 100,000 simulations, but Stata black belts would complain that the standard language is not the best tool for this sort of heavy duty number-crunching. I asked top clinical trial statistician Dan Bratton for some equivalent code in the highly optimised Mata language:


timer clear 1
timer on 1
mata:
reps = 100000
n = (10 \ 10)
m = (140 , 135)
s = (15 , 15)
pass = 0
for (i=1;i<=reps;i++) {
 X = rnormal(10,1,m,s)
mhat = mean(X)
 v = variance(X)
df = n[1]+n[2]-2
 t = (mhat[1]-mhat[2])/sqrt((1/n[1]+1/n[2])*((n[1]-1)*v[1,1]+(n[2]-1)*v[2,2])/df)
p = 2*ttail(df,t)
if (p<0.05) pass = pass+1
}
pass/reps
end
timer off 1
timer list 1

… which clocked in at 7 seconds. I’m not going to try anything more esoteric because I’m interested in the speed for those very pragmatic simulations such as sample size calculations, which the jobbing statistician must do quite often. (Actually, there is an adequate approximation formula for t-tests that means you would never do this simulation.)

That time difference surprised me, to say the least. It means that Julia is an option to take very seriously indeed for heavy-duty statistical calculations. It really isn’t hard to learn. However, I don’t know of any scientific papers published yet that used Julia instead of any more established software. Perhaps the version 0.x would worry editors and reviewers, but surely v1.0 is not far away now.


To leave a comment for the author, please follow the link and comment on his blog: Robert Grant's stats blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.