Scholar indices (h-index and g-index) in PubMed with RISmed

December 7, 2015

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Scholar indices are intended to measure the contributions of authors to their fields of research. Jorge E. Hirsch suggested the h-index in 2005 as an author-level metric intended to measure both the productivity and citation impact of the publications of an author. An author has index h if h of his or her N papers have at least h citations each, and the other (N-h) papers have no more than h citations each.

In response to a comment, we will use our trusty RISmed package and the PubMed database to develop a script for calculating an h-index, as well as two similar metrics, the m-quotient, and g-index. Here is the code to conduct the search, the citations information is stored in the EUtilitiesSummary() as Cited().

x <- "Yi-Kuo Yu"
res <- EUtilsSummary(x, type="esearch", db="pubmed", datetype='pdat', mindate=1900, maxdate=2015, retmax=500)
citations <- Cited(res)
citations <-


Calculating the h-index is just a matter of cleverly arranging the data. Above, we created a data frame with one column containing all the values of Cited() in our search. We will sort them in descending order, then make a new column with the index values. The highest index value that is greater than the number of citations is that author’s h-index. The following code will return that index number.

citations <- citations[order(citations$citations,decreasing=TRUE),]
citations <-
citations <- cbind(id=rownames(citations),citations)
citations $id<- as.character(citations$id)
citations $id<- as.numeric(citations$id)
hindex <- max(which(citations$id<=citations$citations))


Here is the data frame we created above that shows that Dr. Yi-Kuo Yu has an h-index of 12, since he has 12 publications with 12 or more citations.


id citations
1       181
2        62
3        34
4        31
5        23
6        19
7        19
8        18
9        14
10       14
11       13
12       13
13       10
14        8


Although the h-index is a useful metric to measure an author’s impact, it has some disadvantages. For instance, a long, less impactful career will typically outscore a superstar junior scientist. For these cases, the m-quotient divides the h-index by the number of years since the author’s first publication. In this sense it is a way to normalize by career span.

y <- YearPubmed(EUtilsGet(res))
low <- min(y)
high <- max(y)
den <- high-low
mquotient <- hindex/den



Another weakness of the h-index is that doesn’t take into account highly cited publications. It doesn’t matter if an author has a few highly cited publications, he gets the same h-index as a relatively obscure author. The g-index was developed to address this situation. The g-index is the largest rank (where papers are arranged in decreasing order of the number of citations they received) such that the first g papers have (together) at least g^2 citations”. Here is code to calculate the g-index.

citations$square <- citations$id^2
citations$sums <- cumsum(citations$citations)
gindex <- max(which(citations$square

We made two new columns, one for the squares of the index column and one for the cumulative sum of the citations in descending order. Similar to the h-index, we need the index of the highest squared index value that is less than the cumulative sum. Our output with the two new columns below shows that Dr. Yu has a g-score of 22, based on the fact that especially his top two publications have many citations.


 id citations square sums
  1       181      1  181
  2        62      4  243
  3        34      9  277
  4        31     16  308
  5        23     25  331
  6        19     36  350
  7        19     49  369
  8        18     64  387
  9        14     81  401
 10        14    100  415
 11        13    121  428
 12        13    144  441
 13        10    169  451
 14         8    196  459
 15         7    225  466
 16         7    256  473
 17         7    289  480
 18         7    324  487
 19         7    361  494
 20         7    400  501
 21         6    441  507
 22         5    484  512
 23         4    529  516
 24         4    576  520

Check out the updated Shiny App to let the App do the work for you.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)