RStan: Fast, multilevel Bayesian modeling in R

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For the last decade or so, the go-to software for Bayesian statisticians has been BUGS (and later the open-source incarnation, OpenBugs, or JAGS). BUGS is used for multi-level modeling: using a specialized notation, you can define random variables of various distributions, set Bayesian priors for their parameters, and create the network of relationships that describe how the random variables influence each other. To apply these models to data requires some heavy-duty number crunching: hundreds or thousands of simulations of every random variable in the model (of which there can be dozens), all in parallel. This has generally been done using a technique called Markov-Chain Monte Carlo (MCMC) with the Gibbs Sampler. But as Andrew Gelman found, applying this algorithm to complex models was showing the strain:  

“The models we wanted to fit turned out to be a challenge for current general- purpose software to fit. A direct encoding in BUGS or JAGS can grind these tools to a halt. Matt Schofield found his multilevel time-series regression of climate on tree-ring measurements wasn’t converging after hundreds of thousands of iterations.”

To solve this problem, Gelman and collaborators from the University of Columbia announced last night that they have created STAN: new, high-performance open-source software for Bayesian inference on multi-level models. Stan has all the generality and ease of use of BUGS, and can solve the multilevel generalized linear models described in Part II of the book Data Analysis Using Regression and Multilevel/Hierarchical Models. Rather than the traditional Gibbs sampler, Stan uses a variant of Hamiltonian Monte Carlo (HMC) to speed up calculations. Stan supports truncated and/or censored data, and so can be used to fit survival and reliability models with non-standard (or even user-defined) probability distributions. By automatically converting models to compiled C++ code, Stan can solve such complex models with non-standard distributions quickly. (Check out these comparisons of Stan's performance with BUGS and JAGS.)

While Stan has a command-line interface, it's most easily used via the R package interface, RStan. R users will feel right at home with the Stan syntax used to describe multi-level models. For example, the code below describes a model with bivariate covariance matrix in which the variances are known, but the covariance is not:

data {
       int N;
       vector[2] y[N];
       double var1;  double var2;
     }
     transformed data {
       double max_cov;  double min_cov;
       max_cov <- sqrt(var1 * var2);
       min_cov <- -max_cov;
     }
     parameters {
       vector[2] mu;
       real cov;
     }
transformed parameters {
  matrix[2,2] sigma;
  sigma[1,1] <- var1;
  sigma[2,1] <- cov;
  sigma[1,2] <- cov;
  sigma[2,2] <- var2;
} model {
      for (n in 1:N)
        y[n] ~ multi_normal(mu,sigma);
}

For more information about Stan, check out the Stan home page and especially the comprehensive (178-page) manual. R users should check out the page Running Stan from R for information about the RStan package. The discussion group for Stan users is already quite active.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.