# An R programmer looks at Julia

**A second megabyte of memory**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In January of this year I first saw mention of the Julia language in the release notes for LLVM. I mentioned this to Dirk Eddelbuettel and later we got in contact with Viral Shahregarding a Debian package for Julia.

There are many aspects of *Julia* that are quite intriguing to an R programmer. I am interested in programming languages for “Computing with Data”, in John Chambers’ term, or “Technical Computing”, as the authors of *Julia* classify it. I believe that learning a programming language is somewhat like learning a natural language in that you need to live with it and use it for a while before you feel comfortable with it and with the culture surrounding it.

A common complaint for those learning *R* is finding the name of the function to perform a particular task. In writing a bit of *Julia*code for fitting generalized linear models, as described below, I found myself in exactly the same position of having to search through documentation to find how to do something that I felt should be simple. The experience is frustrating but I don’t know of a way of avoiding it. One word of advice for *R* programmers looking at *Julia*, the names of most functions correspond to the *Matlab*/*octave* names, not the *R* names. One exception is the d-p-q-r functions for distributions, as I described in an earlier posting.

## Evaluating a new language

There is a temptation to approach a new programming language with the idea that it should be exactly like the language I am currently using but better, faster, easier, … This is, of course, ridiculous. If you want *R* syntax, semantics, function names, and packages then use *R*. If you consider another language you should accept that many tasks will be done differently. Some will be done better than in your current language and some will not be done as well. And the majority will just be done differently.

Two primary *Julia* developers, Jeff Bezanson and Stefan Karpinski, recently gave a presentation at the lang.NEXT conference. The slides and videocan help you get a flavor of the language.

So, what does *Julia* provide that *R* doesn’t?

A JIT. The first time that a function is called with a particular argument signature it is compiled to machine code using the LLVM compiler backend. This, naturally, provides a speed boost. It can also affect programming style. In

*R*one instinctively writes vectorized code to avoid a performance hit. In*Julia*non-vectorized code can have similar and, under some circumstances, better performance than vectorized code.A richer type system. Scalars, vectors/arrays, tuples, composite types and several others can be defined in

*Julia*. In*R*the atomic types are actually atomic vectors and the composite types are either lists (which is actually a type of vector in*R*, not a lisp-like list), vectors with attributes (called structures in*R*but not to be confused with*C*‘s structs). The class systems, S3 and S4, are built on top of these structures.Currently it is difficult to work with 64-bit integers in

*R*but not in*Julia*Multiple dispatch. In

*Julia*all functions provide multiple dispatch to handle different signatures of arguments. In some sense, functions are just names used to index into method tables. In*R*terminology all functions are generic functions with S4-like dispatch, not S3-like which only dispatches on the first argument.Parallelism. A multiple process model and distributed arrays are a basic part of the

*Julia*language.A cleaner model for accessing libraries of compiled code. As Jeff and Stefan point out, languages like

*R*and*Matlab*/*octave*typically implement the high-level operations in “glue” code that, in turn, accesses library code. In*Julia*one can short-circuit this process and call the library code from the high-level language. See the earlier blog posting on accessing the Rmath library from Julia. This may not seem like a big deal unless you have written thousands of lines of glue code, as I have.

There are some parts of *R* not present in *Julia* that programmers may miss.

Default arguments of functions and named actual arguments in function calls. The process of matching actual arguments to formal arguments in

*R*function calls is richer. All matching in*Julia*is by position within function signature. Function design with many arguments requires more thought and care in*Julia*than in*R*.The

*R*package system and CRAN. One could also throw in namespaces and function documentation standards as part of the package system. These are important parts of*R*that are not yet available. However, that does not preclude simular facilities being developed for*Julia*. The*R*package system did not spring fully grown from the brow of Zeus.Graphics. One of the big advantages of

*R*over similar languages is the sophisticated graphics available in ggplot2 or lattice. Work is underway on graphics models for*Julia*but, again, it is early days still.

## An example: Generalized Linear Models (GLMs)

Most *R* users are familiar with the basic model-fitting functions like lm() and glm(). These functions call model.frame and model.matrix to form the numerical representation of the model from the formula and data specification, taking into account optional arguments such as subset, na.action and contrasts. After that they call the numerical workhorses, lm.fit and glm.fit. What I will describe is a function like glm.fit (the name in *Julia* cannot include a period and I call it glmFit) and structures like the glm “family” in *R*.

A glm family is a collection of functions that describe the distribution of the response, given the mean, and the link between the linear predictor and the mean response. The link function takes the mean response, mu, to the “linear predictor”, eta, of the form

`X %*% beta # in R`

or

`X * beta # in Julia`

where X is the model matrix and beta is the coefficient vector. In practice we more frequently evaluate the inverse link, from eta to mu, and its derivative.

We define a composite type

`type Link`

name::String # name of the link

linkFun::Function # link function mu -> eta

linkInv::Function # inverse link eta -> mu

muEta::Function # derivative eta -> d mu/d eta

end

with specific instances

`logitLink =`

Link("logit",

mu -> log(mu ./ (1. - mu)),

eta -> 1. ./ (1. + exp(-eta)),

eta -> (e = exp(-abs(eta)); f = 1. + e; e ./ (f .* f)))

logLink =

Link("log",

mu -> log(mu),

eta -> exp(eta),

eta -> exp(eta))

identityLink =

Link("identity",

mu -> mu,

eta -> eta,

eta -> ones(eltype(eta), size(eta)))

inverseLink =

Link("inverse",

mu -> 1. ./ mu,

eta -> 1. ./ eta,

eta -> -1. ./ (eta .* eta))

for the common links. Here I have used the short-cut syntax

`mu -> log(mu ./ (1. - mu))`

to create anonymous functions. The muEta function for the logitLink shows the use of multiple statements within the body of the anonymous function. Like *R*, the last expression evaluated in a function is the value of a *Julia* function.

I also ensure that the functions in the Link object are safe for vector arguments by using the componentwise-forms of multiplicative operators (“./” and “.*”). These forms apply to scalars as well as vector. I use explicit floating-point constants, such as 1., to give the compiler a hint that the result should be a floating-point scalar or vector.

A distribution characterizes the variance as a function of the mean, and the “deviance residuals” used to evaluate the model-fitting criterion. For some distributions, including the Gaussian distribution the model-fitting criterion (residual sum of squares) is different from the deviance itself but minimizing this simpler criterion provides the maximum-likelihood (or minimum deviance) estimates of the coefficients.

When we eventually fit the model we want the value of the deviance itself, which may not be the same as the sum of squared deviance residuals. Other functions are used to check for valid mean vectors or linear predictors. Also each distribution in what is called the “exponential family”, which includes many common distributions like Bernoulli, binomial, gamma, Gaussian and Poisson, has a canonical link function derived from the distribution itself.

The composite type representing the distribution is

`type Dist`

name::String # the name of the distribution

canonical::Link # the canonical link for the distribution

variance::Function # variance function mu -> var

devResid::Function # vector of squared deviance residuals

deviance::Function # the scalar deviance

mustart::Function # derive a starting estimate for mu

validmu::Function # check validity of the mu vector

valideta::Function # check validity of the eta vector

end

with specific distributions defined as

`## utilities used in some distributions`

logN0(x::Number) = x == 0 ? x : log(x)

logN0{T<:Number}(x::AbstractArray{T}) = reshape([ logN0(x[i]) | i=1:numel(x) ], size(x))

y_log_y(y, mu) = y .* logN0(y ./ mu) # provides correct limit at y == 0

BernoulliDist =

Dist("Bernoulli",

logitLink,

mu -> max(eps(Float64), mu .* (1. - mu)),

(y, mu, wt)-> 2 * wt .* (y_log_y(y, mu) + y_log_y(1. - y, 1. - mu)),

(y, mu, wt)-> -2. * sum(y .* log(mu) + (1. - y) .* log(1. - mu)),

(y, wt)-> (wt .* y + 0.5) ./ (wt + 1.),

mu -> all((0 < mu) & (mu < 1)),

eta -> true)

gammaDist =

Dist("gamma",

inverseLink,

mu -> mu .* mu,

(y, mu, wt)-> -2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu),

(y, mu, wt)-> (n=sum(wt); disp=sum(-2 * wt .* (logN0(y ./ mu) - (y - mu) ./ mu))/n; invdisp(1/disp); sum(wt .* dgamma(y, invdisp, mu * disp, true))),

(y, wt)-> all(y > 0) ? y : error("non-positive response values not allowed for gammaDist"),

mu -> all(mu > 0.),

eta -> all(eta > 0.))

GaussianDist =

Dist("Gaussian",

identityLink,

mu -> ones(typeof(mu), size(mu)),

(y, mu, wt)-> (r = y - mu; wt .* r .* r),

(y, mu, wt)-> (n = length(mu); r = y - mu; n * (log(2*pi*sum(wt .* r .* r)/n) + 1) + 2 - sum(log(wt))),

(y, wt)-> y,

mu -> true,

eta -> true)

PoissonDist =

Dist("Poisson",

logLink,

mu -> mu,

(y, mu, wt)-> 2 * wt .* (y .* logN0(y ./ mu) - (y - mu)),

(y, mu, wt)-> -2 * sum(dpois(y, mu, true) * wt),

(y, mu)-> y + 0.1,

mu -> all(mu > 0),

eta -> true)

Finally the GlmResp type consists of the distribution, the link, the response, the mean and linear predictor. Occasionally we want to add an offset to the linear predictor expression or apply prior weights to the responses and these are included.

`type GlmResp # response in a glm model`

dist::Dist

link::Link

eta::Vector{Float64} # linear predictor

mu::Vector{Float64} # mean response

offset::Vector{Float64} # offset added to linear predictor (usually 0)

wts::Vector{Float64} # prior weights

y::Vector{Float64} # response

end

With just this definition we would need to specify the offset, wts, etc. every time we construct such an object. By defining an outer constructor (meaning a constructor that is defined outside the type definition) we can provide defaults

`## outer constructor - the most common way of creating the object`

function GlmResp(dist::Dist, link::Link, y::Vector{Float64})

n = length(y)

wt = ones(Float64, (n,))

mu = dist.mustart(y, wt)

GlmResp(dist, link, link.linkFun(mu), mu, zeros(Float64, (n,)), wt, y)

end

## another outer constructor using the canonical link for the distribution

GlmResp(dist::Dist, y::Vector{Float64}) = GlmResp(dist, dist.canonical, y)

The second outer constructor allows us to use the canonical link without needing to specify it.

There are several functions that we wish to apply to the glmResp object.

`deviance( r::GlmResp) = r.dist.deviance(r.y, r.mu, r.wts)`

devResid( r::GlmResp) = r.dist.devResid(r.y, r.mu, r.wts)

drsum( r::GlmResp) = sum(devResid(r))

muEta( r::GlmResp) = r.link.muEta(r.eta)

sqrtWrkWt(r::GlmResp) = muEta(r) .* sqrt(r.wts ./ variance(r))

variance( r::GlmResp) = r.dist.variance(r.mu)

wrkResid( r::GlmResp) = (r.y - r.mu) ./ r.link.muEta(r.eta)

wrkResp( r::GlmResp) = (r.eta - r.offset) + wrkResid(r)

Our initial definitions of these functions are actually method definitions because we declare that the argument r must be a GlmResp. However, defining a method also defines the function if it has not already been defined. In *R* we must separately specify S3 generics and methods. An S4 generic is automatically created from a method specification but the interactions of S4 generics and namespaces can become tricky.

Note that the period (‘.’) is used to access components or members of a composite type and cannot be used in a Julia identifier.

The function to update the linear predictor requires both a GlmResp and the linear predictor value to assign. Here we use an abstract type inheriting from Number to allow a more general specification

`updateMu{T<:Number}(r::GlmResp, linPr::AbstractArray{T}) = (r.eta = linPr + r.offset; r.mu = r.link.linkInv(r.eta); drsum(r))`

Updating the linear predictor also updates the mean response then returns the sum of the square deviance residuals. Note that this is not functional semantics in that the object r being passed as an argument is updated in place. Arguments to *Julia* functions are passed by reference. Those accustomed to the functional semantics of *R* (meaning that you can’t change the value of an argument by passing it to a function) should beware.

Finally we create a composite type for a predictor

`type predD # predictor with dense X`

X::Matrix{Float64} # model matrix

beta0::Vector{Float64} # base coefficient vector

delb::Vector{Float64} # increment

end

and an outer constructor

`## outer constructor`

predD(X::Matrix{Float64}) = (zz = zeros((size(X, 2),)); predD(X, zz, zz))

Given the current state of the predictor we create the weighted X’X matrix and the product of the weighted model matrix and the weighted working residuals in the accumulate function

`function accumulate(r::GlmResp, p::predD)`

w = sqrtWrkWt(r)

wX = diagmm(w, p.X)

(wX' * (w .* wrkResp(r))), (wX' * wX)

end

This function doesn’t need to be written separately but I hope that doing so will make it easier to apply these operations to distributed arrays, which, to me, seem like a natural way of applying parallel computing to many statistical computing tasks.

A function to calculate and apply the increment is

`increment(r::GlmResp, p::predD) = ((wXz, wXtwX) = accumulate(r, p); bb = wXtwX \ wXz; p.delb = bb - p.beta0; updateMu(r, p.X * bb))`

and finally we get to the glmFit function

`function glmFit(p::predD, r::GlmResp, maxIter::Uint, minStepFac::Float64, convTol::Float64)`

if (maxIter < 1) error("maxIter must be positive") end

if (minStepFac < 0 || 1 < minStepFac) error("minStepFac must be in (0, 1)") end

cvg = false

devold = typemax(Float64) # Float64 version of Inf

for i=1:maxIter

dev = increment(r, p)

if (dev < devold)

p.beta0 = p.beta0 + p.delb

else

error("code needed to handle the step-factor case")

end

if abs((devold - dev)/dev) < convTol

cvg = true

break

end

devold = dev

end

if !cvg

error("failure to converge in $maxIter iterations")

end

end

glmFit(p::predD, r::GlmResp) = glmFit(p, r, uint(30), 0.001, 1.e-6)

Defining two methods allows for default values for some of the arguments. Here we have a kind of all-or-none approach to defaults. It is possible to create other method signatures for defaults applied to only some of the arguments. In retrospect I think I was being too cute in declaring the maxIter argument as an unsigned int. It only makes sense for this to be positive but it might make life simpler to allow it to be an integer as positivity is checked anyway.

Note the use of string interpolation in the last error message. The numeric value of maxIter will be substituted in the error message.

## Checking that it works

The script

`load("../extras/glm.jl")`

## generate a Bernoulli response

srand(123454321)

n = 10000

X = hcat(fill(1, (n,)), 3. * randn((n, 2)))

beta = randn((3,))

eta = X * beta

mu = logitLink.linkInv(eta)

y = float64(rand(n) < mu)

println("Coefficient vector for simulation: $(beta)")

## establish the glmResp object

rr = GlmResp(BernoulliDist, y)

pp = predD(X)

glmFit(pp, rr)

println("Converged to coefficient vector: $(pp.beta0 + pp.delb)")

println("Deviance at estimates: $(deviance(rr))")

## reinitialize objects for timing

rr = GlmResp(BernoulliDist, y)

pp = predD(X)

println("Elapsed time for fitting binary response with X $(size(X,1)) by $(size(X,2)): $(@elapsed glmFit(pp, rr)) seconds")

produces

`Coefficient vector for simulation: [-0.54105, 0.308874, -1.02009]`

Converged to coefficient vector: [-0.551101, 0.301489, -1.02574]

Deviance at estimates: 6858.466827205941

Elapsed time for fitting binary response with X 10000 by 3: 0.09874677658081055 seconds

**leave a comment**for the author, please follow the link and comment on their blog:

**A second megabyte of memory**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.