Julia versus R – Playing around

(This article was first published on Blag's bag of rants, and kindly contributed to R-bloggers)

So...as time goes by, I'm getting more proficient with Julia...which is something fairly easy as the learning curve is pretty fast...

I decided to load a file with 590,209 records that I got from Freebase...the file in question contains Actors and Actresses from movies...you can have a quick look here...


For this test, I'm using my Linux box on VMWare running on 2 GB of RAM...running Ubuntu 12.04.4 (Precise)

For R, I'm not using any special package...just plain R...version 2.14.1 and for Julia version 0.2.1, I'm using the DataFrames package...

Let's take a look at the R source code first along with its runtime processing...

Actors_Info.R
start.time <- Sys.time()
if(!exists("Actors")){
Actors<-read.csv("Actors_Table.csv", header=TRUE, 
                     stringsAsFactors=FALSE, colClasses="character", na.strings = "")
}
Actors<-unique(Actors)
Actors<-Actors[complete.cases(Actors),]
Actor_Info<-data.frame(Actor_Id=Actors$Actor_Id,Name=Actors$Name,Gender=Actors$Gender)
Actor_Info<-Actor_Info[order(Actor_Info$Gender),]
write.csv(Actor_Info,"Actor_Info_R.csv",row.names=TRUE)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

This source will first ask if the file was loaded already, if not...it will load it...then, it will eliminate the repeated records, delete all the null or NA's and the create a new Data Frame, sort it by "Gender" and then write a new CSV file...time will be taken to measure its speed...we will run it twice...first time the file is not loaded...second time it will...and that should improve greatly the execution time...



As we can see...the times are really good...and the different between the first and second run are pretty obvious...for the record...the generated file contains 105874 records...

Now...let's see the Julia version of the code...

Actors_Info.jl
using DataFrames
start = time()
isdefined(:Actors) || (Actors = readtable("Actors_Table.csv", header=true, nastrings=["","NA"]))
drop_duplicates!(Actors)
complete_cases!(Actors)
Actor_Info = DataFrame(Actor_Id=Actors["Actor_Id"],Name=Actors["Name"],Gender=Actors["Gender"])
sortby!(Actor_Info, [:Gender])
writetable("Actor_Info_Julia.csv", Actor_Info)
finish = time()
println("Time: ", finish-start)

Here...we're doing the same...we load the DataFrames package (But exclude that from the execution time), check if the file is loaded so we don't load it again on the second run...eliminate duplicates, delete all null or NA, create a new DataFrame, sort it by "Gender" and finally write a new CVS file...


Well...the difference between the second and first run is very significative...but of course...way slower than R...

But...let me tell you one simple thing...Julia is still a brand new language...the DataFrames package is not part of the core Julia language, which means...that its even newer...and optimizations are being performed as we speak...I would say that for a young language...18 seconds to process 590,209 records is pretty awesome...and of course...my R experience surpasses greatly my Julia experience...

So...I don't really want to leave you with the impression that Julia is not good or not fast enough...because believe me...it is...and you going to love my next experiment -;)

Let's take a look at the R source code first...

Random_Names.R
start.time <- Sys.time()
names<-c("Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
"Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven")

last_names<-c("Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
"Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau",
"Fenlon","Flynn","Taylor")
full_names<-c()
for(i in 1:100000){
name<-sample(1:15, 1)
last_name<-sample(1:15, 1)
full_name<-paste(names[name],last_names[last_name],sep=" ")
full_names<-append(full_names,full_name)
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

So this code is fairly simple...we have a couple of vectors with names and last names...then we loop 100000 times and then generate a couple of random numbers simply to read the vectors, create a full name and populate a new vector... with some random funny name combinations...



Well....the different between both runs is not really good...second time was a little bit higher...and 1 minute is kind of a lot...let's see how Julia behaves...

Here's the Julia source code...

Random_Numbers.jl
start = time()
names=["Anne","Gigi","Blag","Juergen","Marek","Ingo","Lars","Julia",
"Danielle","Rocky","Julien","Uwe","Myles","Mike", "Steven"]
last_names=["Hardy","Read","Tejada","Schmerder","Kowalkiewicz","Sauerzapf",
"Karg","Satsuta","Keene","Ongkowidjojo","Vayssiere","Kylau","Fenlon","Flynn","Taylor"]
full_names=String[]
full_name = ""
for i = 1:100000
name=rand(1:15)
last_name=rand(1:15)
full_name = names[name] * " " * last_names[last_name]
push!(full_names,full_name)
end
finish = time()
println("Time: ", finish-start)

So this code as well, creates two arrays with names and last names, do a loop 100000 times, generate a couple of random numbers, mix a name with a last name and then populate a new array with some mixed full names...


Just like in the R code...the second time took Julia a little bit more...but...less than a second?! That's something like...amazingly fast and really took R by storm...

Now...I believe you will start to take Julia more seriously -:D

Hope you liked this blog...

Greetings,

Blag.
Development Culture.

To leave a comment for the author, please follow the link and comment on his blog: Blag's bag of rants.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.