Citation Patterns in Ecology

Nathan Lemoine

9 years ago

[This article was first published on Climate Change Ecology » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m always curious to see who is citing my one paper. Turns out I actually have two papers, and the most cited paper (with 19 citations, which sounds paltry but for me is quite exciting) is certainly not the one I’d have expected. By any stretch of the imagination. But, my second paper has only been in print for about six months. So I began to wonder: How do I, a third year grad student, stack up against other grad students, professors, and ecology rock stars? What is the key to having your work well cited?

Since I’ve been learning Python, I realized I could figure this out pretty easily. I wrote a Python script to run names through Google Scholar, look for a user profiles (if it exists), go to the the profile page, and extract the data from the “Citations by Year” chart. This chart presents how many times your papers have been cited in a year (it’s not cumulative). I’d attach the Python script if I felt like spending the time figuring out how to upload the text file to WordPress (but I don’t).

I ran 20 names through the program and downloaded the number of citations per year for each. I plotted them out in R and ran some super basic analyses and found that it’s surprisingly consistent and highly variable. Number of citations increased allometrically across all authors. However, the slopes varied significantly among authors. In the graphs, values are plotted as lag since date of first citation (which is set to 1).

FYI: Getting the y-axis to display ticks in exponential form was a bit tricky.

p +
geom_point( aes(lag + 1, cites + 0.1, color=name ), size=3, show_guide=F ) +
scale_y_continuous(
trans=log2_trans(),
breaks = c(0.1, 1, 10, 100, 1000),
labels = trans_format('log10', math_format(10^.x))
) +
scale_x_continuous(
trans = log2_trans()
) +
ylab('Number of Citations per Year') +
xlab('Years Since First Citation') +
theme(
axis.title = element_text(size=14),
axis.text = element_text(size=12, color='black')
)

I was a curious to know if there were differences between male and female ecologists, so I separated it out based on sex. The allometric relationship still holds for females but the number of citations per year for females increases more rapidly than it does for males.

Granted, there a huge number of problems here. I’ve not standardized by number of publications or any other variable. However, that would only serve to reduce the noise. It seems that the key to being heavily cited is longevity. Of course, extrapolating beyond these lines would be dangerous (at 64 years, one could expect 10,000 citations per year!)

Anyway, I thought this was interesting (I was more interested in getting my Python script to work, which it did). Thoughts?

I’m also willing to share the author list with the curious, but I kept it hidden to avoid insulting people who aren’t on the list (it was really a random sample of people I know and people whose papers I’ve read, which explains the slight bias towards marine ecology and insects).

UPDATE I had a request for the Python script, so here it is.

from urllib import FancyURLopener
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re

# Make a new opener w/ a browser header so Google allows it
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
scholar_url = 'http://scholar.google.com/scholar?q=(query)&btnG=&hl=en&as_sdt=0%2C10'

# Define the search function for the google chart string
def findBetween(s, first, last):
    start = s.index(first) + len(first)
    end = s.index(last)
    return( s[start:end])

# Define the search for text within a string till the end othe string
def getX(s):
    start = s.index('chxl=0:') + len('chxl=0:')
    return( s[start:] )

# Define the function to actually get the chart data
def scholarCiteGet(link):
    # Navigate to and parse the user profile
    citLink2 = link.get('href')
    s2 = 'http://scholar.google.com' + citLink2
    socket = myopener.open(s2)
    wsource2 = socket.read()
    socket.close()
    soup2 = BeautifulSoup(wsource2)

    # Find the chart image and encode the string URL
    chartImg = soup2.find_all('img')[2]
    chartSrc = chartImg['src'].encode('utf=8')

    # Get the chart y-data from the URL
    chartD = findBetween(chartSrc, 'chd=t:', '&chxl')
    chartD = chartD.split(',')
    chartD = [float(i) for i in chartD]
    chartD = np.array(chartD)

    # Get the chart y-conversion
    ymax = findBetween(chartSrc, '&chxr=', '&chd')
    ymax = ymax.split(',')
    ymax = [float(i) for i in ymax]
    ymax = np.array( ymax[-1] )
    chartY = ymax/100 * chartD

    # Get the chart x-data
    chartX = getX(chartSrc)
    chartX = chartX.split('|')
    chartX = int(chartX[1])
    chartX = np.arange(chartX, 2014)
    chartTime = chartX - chartX[0]

    # put the data together and return a dataframe
    name = soup2.title.string.encode('utf-8')
    name = name[:name.index(' - Google')]
    d = {'name':name, 'year':chartX, 'lag':chartTime, 'cites':chartY}

    citeData = pd.DataFrame(d)

    return(citeData)

def scholarNameGet(name):

    # Navigate and parse the google scholar page with the search for the name specified
    name2 = name.replace(' ', '%20')
    s1 = ( scholar_url.replace('(query)', name2) )
    socket = myopener.open(s1)
    wsource1 = socket.read()
    socket.close()
    soup1 = BeautifulSoup(wsource1)

    # Get the link to the user profile
    citText = soup1.find_all(href=re.compile('/citations?') )

    if 'mauthors' in str(citText):
        citLink = citText[2]
        temp = scholarCiteGet(citLink)
        return(temp)
    else:
        citLink = citText[1]

    # If the link is to a user profile... get the data
    if 'User profiles' in str(citLink):
        temp = scholarCiteGet(citLink)
        return(temp)

    # If not, return 'no data'
    else:
        d = {'name':name, 'year':'No Data', 'lag':'No Data', 'cites':'No Data'}
        temp = pd.DataFrame(d, index=[0])
        return(temp)

# Run getCites once to populate the dataframe
finalDat = pd.DataFrame()

# Insert list of names here
sciNames = []

for name in sciNames:
    a = scholarNameGet(name)
    finalDat = pd.concat([finalDat, a])

plotDat = finalDat.pivot(index = 'lag', columns = 'name', values = 'cites')
plotDat = plotDat.replace('No Data', np.nan)
plotDat.plot()

To leave a comment for the author, please follow the link and comment on their blog: Climate Change Ecology » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.