Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Statistical journals friendship (clic for SVG format)

Xian blogged recently on the incoming RSS read paper: Statistical Modelling of Citation Exchange Between Statistics Journals, by Cristiano Varin, Manuela Cattelan and David Firth. Following the last JRSS B read paper by one of us! The data that are used in the paper (and can be downloaded here) are quite fascinating for us, academics fascinated by academic rankings, for better or for worse (ironic here). They consist in cross citations counts $C = (C_{ij})$ for 47 statistics journals (see list and abbreviations page 5): $C_{ij}$ is the number of citations from articles published in journal $j$ in 2010 to papers published in journal $i$ in the 2001-2010 decade. The choice of the list of journals is discussed in the paper. Major journals missing include Bayesian Analysis (published from 2006), The Annals of Applied Statistics (published from 2007).

I looked at the ratio of Total Citations Received by Total Citations made. This is a super simple descriptive statistic which happen to look rather similar to Figure 4 which plots Export Scores from Stigler model (can’t say more about it, I haven’t read in detail). The top five is the same modulo the swap between Annals of Statistics and Biometrika. Of course a big difference is that the Cited/Citation ratio isn’t endowed with a measure of uncertainty (below, left is my making, right is Fig. 4 in the paper).

I was surprised not to see a graph / network representation of the data in the paper. As it happens I wanted to try the gephi software for drawing graphs, used for instance by François Caron and Emily Fox in their sparse graphs paper. I got the above graph, where:

• for the data, I used the citations matrix $C$ renormalized by the total number of citations made, which I denote by $\tilde C$. This is a way to account for the size (number of papers published) of the journal. This is just a proxy though since the actual number of papers published by the journal is not available in the data. Without that correction, CSDA is way ahead of all the others.
• the node size represents the Cited/Citing ratio
• the edge width represents the renormalized $\tilde C_{ij}$. I’m unsure of what gephi does here, since it converts my directed graph into an undirected graph. I suppose that it displays only the largest of the two edges $\tilde C_{ij}$ and $\tilde C_{ji}$.
• for a better visibility I kept only the first decile of heaviest edges.
• the clusters identified by four colors are modularity classes obtained by the Louvain method.

Some remarks

The two software journals included in the dataset are quite outliers:

• the Journal of Statistical Software (JSS) is disconnected from the others, meaning it has no normalized citations $\tilde C_{ij}$ in the first decile. Except from its self citations which are quite big and make it the 4th Impact Factor from the total list in 2010 (and apparently the first in 2015).
• the largest $\tilde C_{ij}$ is the self citations of the STATA Journal (StataJ).

Centrality:

• CSDA is the most central journal in the sense of the highest (unweighted) degree.

Some further thoughts

All that is just for the fun of it. As mentioned by the authors, citation counts are heavy-tailed, meaning that just a few papers account for much of the citations of a journal while most of the papers account for few citations. As a matter of fact, the total of citations received is mostly driven by a few super-cited papers, and also is the Cited/Citations matrix $\tilde C$ that I use throughout for building the graph. A reason one could put forward about why JRSS B makes it so well is the read papers: for instance, Spiegelhalter et al. (2002), DIC, received alone 11.9% of all JRSS B citations in 2010. Who’d bet the number of citation this new read paper (JRSS A though) will receive?