All The Right Friends: how does Google Scholar rank co-authors?

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On a scientist’s Google Scholar page, there is a list of co-authors in the sidebar. I’ve often wondered how Google determines in what order these co-authors appear.

The list of co-authors on a primary author’s page is not exhaustive. It only lists co-authors who also have a Google Scholar profile. They also have to be suggested to the primary author and they need to accept the co-author to list them on the page. Finally, the profile page only displays the first 20 co-authors. Any further co-authors can be seen by clicking “View All”. As I understand it, there is a limit to the number of co-authors a primary author is allowed to have; I currently have 40 and haven’t yet hit a limit. The ranking of co-authors is determined somehow and the first 20 are displayed on the primary author’s profile page, in the sidebar on the right.

How does Google Scholar rank these co-authors? Let’s use R to find out!

We’ll make use of the scholar package to get the data. The primary author used in the package vignette is Albert Einstein, but he doesn’t have any co-authors on Google Scholar; so we’ll use my data instead.

library(scholar)
library(dplyr)
library(ggplot2)
library(zoo)

# use a Google Scholar ID here
id <- "PBcP8-oAAAAJ"
# retrieve all the info from the page
l <- get_profile(id)
# retrieve details of all of the primary author's papers
papers <- get_publications(id)
# sadly this only has 6 authors max for each paper, let's get the missing authors
papers$authors <-  papers$author

# we'll use a loop to use get_complete_authors() if required
# we can test for this because papers with missing authors have an ellipse as final author
for(i in 1 : nrow(papers)) {
  if(grepl("...", papers$author[i], fixed = TRUE)) {
    papers$authors[i] <- get_complete_authors(id, papers$pubid[i])
  }
}

At this point we have the primary author’s info, and a nice data frame of all of the primary author’s papers with number of cites and co-authors per paper.

We need to match up the Scholar co-authors to the authors in data frame. This involves a bit of manipulation because the Scholar co-author can enter their name in any format!

# authors in the data frame have names like "JD Bloggs"
# we need names like "j bloggs" to match efficiently

# get character vector of all authors from the data frame
all_authors <- unlist(strsplit(papers$authors,", "))
# use this function remove middle initials that we don't need
simplify_authors <- function(x) {
  s <- unlist(strsplit(x," "))
  t <- paste(substr(s[1],1,1),s[length(s)])
  return(t)
}
all_authors <- sapply(all_authors, simplify_authors)

# Now, let's count the frequency of each co-author
count_coau <- data.frame(au = tolower(all_authors)) %>% 
  group_by(au) %>% 
  count()
# get the Scholar co-authors from `l`
scholar_coau <- l$coauthors
# here I manually added in the other Scholar co-authors from the View All modal
# and again put them into the correct format
scholar_coau <- sapply(scholar_coau, simplify_authors)
# make a data frame of the Scholar co-authors and their rank
scholar_df <- data.frame(coauthors = tolower(scholar_coau),
                         rank = seq(1,length(scholar_coau)))
# now merge with the data frame with the paper count by author
compare_df <- merge(scholar_df, count_coau, by.x = "coauthors", by.y = "au", sort = FALSE)

First, let’s look if Scholar co-author rank is determined by the number of papers co-authored with the primary author.

# plot the number of papers as a function of rank
ggplot(compare_df, aes(x = rank, y = n)) +
  geom_point() +
  lims(x = c(0,NA), y = c(0,NA)) +
  labs(x = "Scholar Rank", y  = "Co-authored Papers") +
  theme_bw()
# so rank is not determined by number of papers

Number of co-authored papers correlates with rank, but doesn’t determine it.

If rank is not determined (only) by number of co-authored papers, let’s look at the total citations that each co-author shares with the primary author.

compare_df$cites <- 0
for(i in 1 : nrow(compare_df)) {
  total_cites <- 0
  au <- compare_df$coauthors[i]
  for(j in 1 : nrow(papers)) {
    aus <- unlist(strsplit(papers$authors[j],", "))
    aus <- sapply(aus,simplify_authors)
    aus <- paste(unlist(tolower(aus)), collapse = ",")
    if(grepl(au, aus)) {
      total_cites <- total_cites + papers$cites[j]
    }
  }
  compare_df$cites[i] <- total_cites
}

# plot the total citation as a function of rank
ggplot(compare_df, aes(x = rank, y = cites)) +
  geom_point() +
  lims(x = c(0,NA), y = c(0,NA)) +
  labs(x = "Scholar Rank", y  = "Co-cites") +
  theme_bw()

Co-cites do not match the rank either.

Of course, it is possible that the ranking is done by some complex method, e.g. the number of co-authors that the co-author has. But if we assume the ranking is done using only information on the primary author’s page, how can it be done?

Let’s look at a graph of co-cites and number of papers.

ggplot(compare_df, aes(x = n, y = cites, colour = rank)) +
  geom_point() +
  scale_colour_gradient(low = "red", high = "blue") +
  lims(x = c(0,NA), y = c(0,NA)) +
  labs(x = "Papers", y  = "Co-cites") +
  theme_bw()

This graph shows that the distance from the origin roughly scales inversely with rank.

If we take the log2 transform of the co-cites, we can see this more clearly.

ggplot(compare_df, aes(x = n, y = log2(cites), colour = rank)) +
  geom_point() +
  scale_colour_gradient(low = "red", high = "blue") +
  lims(x = c(0,NA), y = c(0,NA)) +
  labs(x = "Papers", y  = "Co-cites (log2)") +
  theme_bw()

If we use the manhattan distance of number of papers and log2 scaled number of citations, we get something approximating the ranking!

compare_df$distance <- compare_df$n + log2(compare_df$cites)

ggplot(compare_df, aes(x = rank, y = distance)) +
  geom_point() +
  lims(x = c(0,NA), y = c(0,NA)) +
  labs(x = "Rank", y  = "Distance") +
  theme_bw()

This approximation works well. It’s not perfect. There are authors whose distance is not in ranked order with their neighbours. On closer inspection it seems that the number of co-authored papers is not accurate, or perhaps zero-cited papers are excluded from the paper count.

I tried this simple strategy on a few other primary authors and could replicate their co-authors’ rank order. I’m not certain this is the algorithm used but it certainly seems simple enough to be readily computed on each profile page.

So the total co-citations and the number of co-authored papers is used to compute the rank of co-authors

Not every co-author is a Scholar co-author. Some don’t have accounts for example. Knowing how the ranking is done, we can ask which lucky co-author could slot into the top co-author spots on my page, if they made an account!

# get a list of all co-authors
unique_authors <- unique(all_authors)
# exclude current Scholar co-authors
unique_authors <- unique_authors[!(unique_authors %in% scholar_coau)]
# generate a data frame of these authors in the correct format with blank ranking
temp_df <- data.frame(coauthors = tolower(unique_authors),
                         rank = 0)
# merge to find the cumber of co-authored papers
other_df <- merge(temp_df, count_coau, by.x = "coauthors", by.y = "au", sort = FALSE)
# remove single paper coauthors for ease 
other_df <- other_df[other_df$n > 1,]

# retrieve to number of co-citations for each co-author
other_df$rank <- 0
other_df$cites <- 0
for(i in 1 : nrow(other_df)) {
  total_cites <- 0
  au <- other_df$coauthors[i]
  for(j in 1 : nrow(papers)) {
    aus <- unlist(strsplit(papers$authors[j],", "))
    aus <- sapply(aus,simplify_authors)
    aus <- paste(unlist(tolower(aus)), collapse = ",")
    if(grepl(au, aus)) {
      total_cites <- total_cites + papers$cites[j]
    }
  }
  other_df$cites[i] <- total_cites
}

# get the distance used for ranking
other_df$distance <- other_df$n + log2(other_df$cites)
# bind with the original data frame so that we can see where the new co-authors slot in
all_df <- rbind(compare_df,other_df)
# order by distance
all_df <- all_df[order(all_df$distance, decreasing = TRUE),]
# remove primary author (should have most cites and papers!)
all_df <- all_df[-1,]
# mark out non-Scholar co-authors
all_df$new <- ifelse(all_df$rank == 0, 1, 0)
# make a new column for interpolated rank
all_df$interrank <- ifelse(all_df$rank == 0, NA, all_df$rank)
all_df$interrank <- na.approx(all_df$interrank)
# plot the result - limit to original top ten
ggplot(all_df, aes(x = interrank, y = distance, colour = as.factor(new))) +
  geom_point() +
  lims(x = c(0,10), y = c(0,NA)) +
  labs(x = "Rank", y  = "Distance") +
  theme_bw() +
  theme(legend.position = "none")

Scholar co-authors are shown in salmon while co-authors without a Scholar profile are shown in teal. From this plot, we can see that the coveted third place is up for grabs, along with the new 5th place. If everyone made a Scholar account, the person currently in 6th place would be pushed down into 10th.

I don’t imagine for one minute that anyone would be motivated to sign up to make it onto the sidebar of my page, but this exercise was interesting to highlight to me who my “closest” co-authors are.

The post title comes from “All The Right Friends” by R.E.M. The version I have is on a Best Of… compilation. I have many songs with “Friends” in the title but this seemed appropriate since the co-author side bar is over on the right of the Scholar profile page.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)