# Citation Patterns in Ecology

April 29, 2013
By

(This article was first published on Climate Change Ecology » R, and kindly contributed to R-bloggers)

I’m always curious to see who is citing my one paper. Turns out I actually have two papers, and the most cited paper (with 19 citations, which sounds paltry but for me is quite exciting) is certainly not the one I’d have expected. By any stretch of the imagination. But, my second paper has only been in print for about six months. So I began to wonder: How do I, a third year grad student, stack up against other grad students, professors, and ecology rock stars? What is the key to having your work well cited?

Since I’ve been learning Python, I realized I could figure this out pretty easily. I wrote a Python script to run names through Google Scholar, look for a user profiles (if it exists), go to the the profile page, and extract the data from the “Citations by Year” chart. This chart presents how many times your papers have been cited in a year (it’s not cumulative). I’d attach the Python script if I felt like spending the time figuring out how to upload the text file to WordPress (but I don’t).

I ran 20 names through the program and downloaded the number of citations per year for each. I plotted them out in R and ran some super basic analyses and found that it’s surprisingly consistent and highly variable. Number of citations increased allometrically across all authors. However, the slopes varied significantly among authors. In the graphs, values are plotted as lag since date of first citation (which is set to 1).

FYI: Getting the y-axis to display ticks in exponential form was a bit tricky.

p +
geom_point( aes(lag + 1, cites + 0.1, color=name ), size=3, show_guide=F ) +
scale_y_continuous(
trans=log2_trans(),
breaks = c(0.1, 1, 10, 100, 1000),
labels = trans_format('log10', math_format(10^.x))
) +
scale_x_continuous(
trans = log2_trans()
) +
ylab('Number of Citations per Year') +
xlab('Years Since First Citation') +
theme(
axis.title = element_text(size=14),
axis.text = element_text(size=12, color='black')
)


I was a curious to know if there were differences between male and female ecologists, so I separated it out based on sex. The allometric relationship still holds for females but the number of citations per year for females increases more rapidly than it does for males.

Granted, there a huge number of problems here. I’ve not standardized by number of publications or any other variable. However, that would only serve to reduce the noise. It seems that the key to being heavily cited is longevity. Of course, extrapolating beyond these lines would be dangerous (at 64 years, one could expect 10,000 citations per year!)

Anyway, I thought this was interesting (I was more interested in getting my Python script to work, which it did). Thoughts?

I’m also willing to share the author list with the curious, but I kept it hidden to avoid insulting people who aren’t on the list (it was really a random sample of people I know and people whose papers I’ve read, which explains the slight bias towards marine ecology and insects).

UPDATE I had a request for the Python script, so here it is.

from urllib import FancyURLopener
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re

# Make a new opener w/ a browser header so Google allows it
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()

# Define the search function for the google chart string
def findBetween(s, first, last):
start = s.index(first) + len(first)
end = s.index(last)
return( s[start:end])

# Define the search for text within a string till the end othe string
def getX(s):
start = s.index('chxl=0:') + len('chxl=0:')
return( s[start:] )

# Define the function to actually get the chart data
# Navigate to and parse the user profile
socket = myopener.open(s2)
socket.close()
soup2 = BeautifulSoup(wsource2)

# Find the chart image and encode the string URL
chartImg = soup2.find_all('img')[2]
chartSrc = chartImg['src'].encode('utf=8')

# Get the chart y-data from the URL
chartD = findBetween(chartSrc, 'chd=t:', '&chxl')
chartD = chartD.split(',')
chartD = [float(i) for i in chartD]
chartD = np.array(chartD)

# Get the chart y-conversion
ymax = findBetween(chartSrc, '&chxr=', '&chd')
ymax = ymax.split(',')
ymax = [float(i) for i in ymax]
ymax = np.array( ymax[-1] )
chartY = ymax/100 * chartD

# Get the chart x-data
chartX = getX(chartSrc)
chartX = chartX.split('|')
chartX = int(chartX[1])
chartX = np.arange(chartX, 2014)
chartTime = chartX - chartX[0]

# put the data together and return a dataframe
name = soup2.title.string.encode('utf-8')
d = {'name':name, 'year':chartX, 'lag':chartTime, 'cites':chartY}

citeData = pd.DataFrame(d)

return(citeData)

def scholarNameGet(name):

# Navigate and parse the google scholar page with the search for the name specified
name2 = name.replace(' ', '%20')
s1 = ( scholar_url.replace('(query)', name2) )
socket = myopener.open(s1)
socket.close()
soup1 = BeautifulSoup(wsource1)

# Get the link to the user profile
citText = soup1.find_all(href=re.compile('/citations?') )

if 'mauthors' in str(citText):
return(temp)
else:

# If the link is to a user profile... get the data
return(temp)

# If not, return 'no data'
else:
d = {'name':name, 'year':'No Data', 'lag':'No Data', 'cites':'No Data'}
temp = pd.DataFrame(d, index=[0])
return(temp)

# Run getCites once to populate the dataframe
finalDat = pd.DataFrame()

# Insert list of names here
sciNames = []

for name in sciNames:
a = scholarNameGet(name)
finalDat = pd.concat([finalDat, a])

plotDat = finalDat.pivot(index = 'lag', columns = 'name', values = 'cites')
plotDat = plotDat.replace('No Data', np.nan)
plotDat.plot()