Scholar indices (h-index and g-index) in PubMed with RISmed

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Scholar indices are intended to measure the contributions of authors to their fields of research. Jorge E. Hirsch suggested the h-index in 2005 as an author-level metric intended to measure both the productivity and citation impact of the publications of an author. An author has index h if h of his or her N papers have at least h citations each, and the other (N-h) papers have no more than h citations each.

In response to a comment, we will use our trusty RISmed package and the PubMed database to develop a script for calculating an h-index, as well as two similar metrics, the m-quotient, and g-index. Here is the code to conduct the search, the citations information is stored in the EUtilitiesSummary() as Cited().

x <- "Yi-Kuo Yu"
res <- EUtilsSummary(x, type="esearch", db="pubmed", datetype='pdat', mindate=1900, maxdate=2015, retmax=500)
citations <- Cited(res)
citations <- as.data.frame(citations)

h-index

Calculating the h-index is just a matter of cleverly arranging the data. Above, we created a data frame with one column containing all the values of Cited() in our search. We will sort them in descending order, then make a new column with the index values. The highest index value that is greater than the number of citations is that author’s h-index. The following code will return that index number.

citations <- citations[order(citations$citations,decreasing=TRUE),]
citations <- as.data.frame(citations)
citations <- cbind(id=rownames(citations),citations)
citations $id<- as.character(citations$id)
citations $id<- as.numeric(citations$id)
hindex <- max(which(citations$id<=citations$citations))

hindex
12

Here is the data frame we created above that shows that Dr. Yi-Kuo Yu has an h-index of 12, since he has 12 publications with 12 or more citations.

citations

id citations
1       181
2        62
3        34
4        31
5        23
6        19
7        19
8        18
9        14
10       14
11       13
12       13
13       10
14        8

m-quotient

Although the h-index is a useful metric to measure an author’s impact, it has some disadvantages. For instance, a long, less impactful career will typically outscore a superstar junior scientist. For these cases, the m-quotient divides the h-index by the number of years since the author’s first publication. In this sense it is a way to normalize by career span.

y <- YearPubmed(EUtilsGet(res))
low <- min(y)
high <- max(y)
den <- high-low
mquotient <- hindex/den

mquotient
0.92

g-index

Another weakness of the h-index is that doesn’t take into account highly cited publications. It doesn’t matter if an author has a few highly cited publications, he gets the same h-index as a relatively obscure author. The g-index was developed to address this situation. The g-index is the largest rank (where papers are arranged in decreasing order of the number of citations they received) such that the first g papers have (together) at least g^2 citations”. Here is code to calculate the g-index.

citations$square <- citations$id^2
citations$sums <- cumsum(citations$citations)
gindex <- max(which(citations$square<citations$sums))

gindex
22

We made two new columns, one for the squares of the index column and one for the cumulative sum of the citations in descending order. Similar to the h-index, we need the index of the highest squared index value that is less than the cumulative sum. Our output with the two new columns below shows that Dr. Yu has a g-score of 22, based on the fact that especially his top two publications have many citations.

citations

 id citations square sums
  1       181      1  181
  2        62      4  243
  3        34      9  277
  4        31     16  308
  5        23     25  331
  6        19     36  350
  7        19     49  369
  8        18     64  387
  9        14     81  401
 10        14    100  415
 11        13    121  428
 12        13    144  441
 13        10    169  451
 14         8    196  459
 15         7    225  466
 16         7    256  473
 17         7    289  480
 18         7    324  487
 19         7    361  494
 20         7    400  501
 21         6    441  507
 22         5    484  512
 23         4    529  516
 24         4    576  520

Check out the updated Shiny App to let the App do the work for you.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)