**Freakonometrics - Tag - R-english**, and kindly contributed to R-bloggers)

Following my post on citations in academic journals, I wanted to go one step further in the understanding of the dynamic of citations. So here, the dataset looks like that: for each article, we have the name of the

journal, the year of publication (also the title of the article, but

here we do not use it, as well as the authors), and more interesting,

the number of citations in journals (any kind of academic journal)

published in 1996, 1997, …, 2011. Of course, articles published in

1999 might have their first citation only starting in 1999.

base[1000:1002,]

Publication.Year

7188 1999

7191 1999

7195 1999

Document.Title

7188 Sequential inspection

7191 On equitable resource approach

7195 Method for strategic

Authors ISSN Journal.Title

7188 Yao D.D., Zheng S. 0030364X Operations Research

7191 Luss H. 0030364X Operations Research

7195 Seshadri S., Khanna A., Harche F., Wyle R. 0030364X Operations Research

Volume Issue X139 DEV1996 DEV1997 DEV1998 DEV1999 DEV2000 DEV2001 DEV2002

7188 47 3 0 0 0 0 0 1 0 2

7191 47 3 0 0 0 0 0 0 2 0

7195 47 3 0 0 0 0 0 0 0 0

DEV2003 DEV2004 DEV2005 DEV2006 DEV2007 DEV2008 DEV2009 DEV2010 DEV2011

7188 0 0 0 1 0 0 0 0 0

7191 3 4 1 4 4 8 4 6 1

7195 0 1 2 2 1 0 1 0 0

X130655 X0 X130794

7188 4 0 4

7191 37 0 37

7195 7 0 7

The first step is to aggregate data, not to look at each article, but

to look at all paper published in 1999 (say). And then, we look at the

number in citations the year of publication, the year after, two years

after, etc. It will appear in a triangle since if we look at articles

published in 2010, there is only on possible year for citations (2010,

since I removed 2011).

VOL=rev(unique(base$Volume))

VOL=VOL[is.na(VOL)==FALSE]

TRIANGLE=matrix(NA,16,16)

for(v in VOL){

k=k+1

sb=base[base$Volume==v,9:24]

sb=sb[is.na(sb[,1])==FALSE,]

TRIANGLE[k,1:(17-k)]=apply(sb,2,sum)[k:16]}

Then, a standard idea (at least in insurance business, for claims

payment development) is to consider that data are Poisson distributed,

and the number of citations should depend on the year of publication of

the article (a row effect) and the development (how many years after

are we looking at, i.e. a column effect). More formally, let

denote the number of citations of articles published year during year (or after years). And we assume that

TRIANGLE=TRIANGLE[-16,]

TRIANGLE=TRIANGLE[,-16]

Y=as.vector(TRIANGLE)

YEAR=rep(1996:2010,15)

DEV =rep(1:15,each=15)

baseT=data.frame(Y,YEAR,DEV)

reg=glm(Y~as.factor(YEAR)+as.factor(DEV),

data=baseT,family=poisson)

Since those are *incremental *values,

in order to look at the paper of distribution, we need to sum them on a

line. Thus, we can plot

(because we used factors, the first component has been replaced by the constant in the regression) or a normalized version to compare among journals. For instance, we

would like to get 100 citations over 15 years.

DYN=exp(c(reg$coefficients[1],reg$coefficients[1]+

reg$coefficients[16:29]))

DYNN=cumsum(DYN)/sum(DYN)

plot(0:15,DYNN)

And this is what we get, for several academic journals,

The pattern is rather different. For instance, in *Health Economics*,

citations is a quick process: more than 40% of citations obtained over

15 years, were obtained during the first 4 years. On the other hand, in

the *Journal of Finance*, it is

much smaller: less than 15% of the citations were obtained during the

first 4 years (on average). So it means that comparing citation based

index (namely *g* or *h*) is a difficult exercise,

especially with you researchers in different areas. The same *g* or *h *index for young researcher, publishing either in *Stochastic Processes and their Applications
*or

*Annals of Statistics*,

means that after 3 years, it can be 50% higher.

Now it is possible to look more into details, with below JRSS-B (on applied statistics). Note that here, citations come extremely slowly… to it might not be a good “strategy” (assuming that a researcher’s target is simply to get – quickly – a high citation index) for a young researcher to publish in JRSS-BOn the other hand,

*Biometrika*is much faster (both are on applied statistics, but we’ve seen here that they were not in the same cluster)We can also observe that

*Annals of*Probability

and

*Stochastic Processes and their Applications*have (almost) similar patterns (SPA might be a bit faster). Anyway, I have been surprised to see that in theoretical journals citations are extremely fast. Especially if we compare with the

*Journal of Finance*for instancewhere I though citations were extremely fast. But I might have a non-correct interpretation: it might simply mean that in the

*Journal of Finance*it is common to cite old papers (published 10 or 15 years ago), maybe more common that in stochastic processes…

Anyway, all suggestions about the interpretation are welcomed !

**leave a comment**for the author, please follow the link and comment on his blog:

**Freakonometrics - Tag - R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...