by Joseph Rickert
I don’t think that most people find reading the articles in the statistical journals to be easy going. In my experience, the going is particularly rough when trying to learn something completely new and I don’t expect it could be any other way. There is no getting around the hard work. However, at least in the field of computational statistics things seem to be getting a little easier. These days, it is very likely that you will find some code included in the supplementary material for a journal article; at least in the Journal of Computational and Graphical Statistics (JCGS) anyway. JCGS, which was started in 1992 with the mission of presenting “the very latest techniques on improving and extending the use of computational and graphical methods in statistics and data analysis”, still seems to be the place to publish. (Stanford's Rob Tibshirani published an article in Issue 1, Volume 1 back in 1992, Robert Tibshirani & Michael LeBlanc, and also in the most recent issue: Noah Simon, Jerome Friedman, Trevor Hastie & Robert Tibshirani.) Driven by the imperative to produce reproducible research most authors in this journal include some computer code to facilitate independent verification of their results. Of the 80 non-editorial articles published in the last 6 issues of JCGS all but 9 of these included computer code as part of the supplementary materials. The following table lists the counts of the type of software included. (Note that a few articles included code in multiple languages, R and C++ for example.)
June13 March13 Dec12 Sept12 June12 March12 total_by_code
R 9 9 5 5 7 7 42
Matlab 6 0 1 3 4 4 18
c 0 0 1 2 1 0 4
cpp 0 0 0 1 1 2 4
other 0 1 0 3 0 2 6
none 0 6 2 0 1 0 9
total_by_month 15 16 9 14 14 15 83
R code accounted for 57% of the 74 instances of software included in the supplementary materials. I think an important side effect of the inclusion of code is that studying the article is much easier for everyone. Seeing the R code is like walking into a room of full of people and spotting a familiar face: you know where to start. And, at least it seems feasible to “reverse engineer” the article. Look at the input data, run the code, see what it produces and map it to the math.
The following code comes from the supplementary material included in the survey article: “Computational Statistical Methods for Social Networks Models” by Hunter, Krivitsky and Schweinberger in the December 2012 issue of JCGS.
# Some of the code from Appendix of the article: # “Computational Statistical Methods for Social Networks Models” # by Hunter, Krivitsky and Schweinberger in the December 2012 issue of JCGS. #Two-dimensional Euclidean latent space model with three clusters and random # receiver effects library(latentnet) data(sampson) monks.d2G3r <- ergmm(samplike ~ euclidean(d=2,G=3)+rreceiver) Z <- plot(monks.d2G3r, rand.eff="receiver", pie=TRUE, vertex.cex=2) text(Z, label=1:nrow(Z)) #Three-dimensional Euclidean latent space model with three clusters and # random receiver effects library(latentnet) data(sampson) monks.d3G3r <- ergmm(samplike ~ euclidean(d=3,G=3)+rreceiver) plot(monks.d3G3r, rand.eff="receiver",use.rgl=TRUE, labels=TRUE)
The first four lines produce the graph below.
The sampson data set contains social network data that Samuel F. Sampson collected in the late ‘60s when he was a resident experimenter at a monastery in New England. The call to ergmm() fits a “latent space model” by embedding the data in a 2 dimensional Euclidean space, clustering it into 3 groups and including a random “receiver” effect”. The last 4 lines of code produce a way cool, interactive three dimensional plot that you can rotate.