If I Had a Text File, I’d Hack Regexes in the Morning

February 4, 2009
By

(This article was first published on "R-bloggers" via Tal Galili in Google Reader, and kindly contributed to R-bloggers)

Yesterday the topic of academic citation counts came up, so I decided that I should write up some tools for exploring cite counts. The first thing I did was to build a cheap screenscraper in Ruby for pulling citation count information from Google scholar. You’ll see the ugly hack I produced below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
module CitationTools
  require 'rubygems'
  require 'open-uri'
 
  def get_ten_most_cited_works_for_author(author_name)
    # First, let's clean up the author's name before using it in a URL.
    escaped_author_name = author_name.gsub(/s+/, '+')
 
    # Let's create a variable we'll place the Google Scholar HTML in.
    page_content = nil
 
    # Let's figure out the right URL for Google Scholar.
    url = "http://scholar.google.com/scholar?q=#{escaped_author_name}"
 
    # Let's access that URL using open-uri and get the HTML from the page.
    open(url) do |page|
      page_content = page.read()
    end
 
    # Let's scan the HTML for the names of this author's works.
    work_titles = page_content.scan(/<p class=g>.*?>([^<]+)(?:</a></span>)?(?:(?:<font size=-1>)|(?:s+-s+<span class=a>)|(?:s+-s+<a class=fl))/)
 
    # Let's scan the HTML for the citation counts for each work.
    cite_counts = page_content.scan(/Cited by (d+)/)
 
    # Let's set aside an array of hashes to store all of this data.
    works = []
 
    # As long as we have the same number of titles and counts, we're good.
    if work_titles.size == cite_counts.size
      work_titles.each_with_index do |title, index|
        works << {:title => title, :citation_count => cite_counts[index]}
      end
      return works
    else
      puts "Failed to process HTML for #{author_name}"
      return nil
    end
 
  end
end

With that in hand, I wrote a simple wrapper to pull information for a list of authors you store in a file called authors.txt from Google Scholar. The wrapper then prints a CSV file to STDOUT that can be redirected to a file for later analysis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Let's include a mix-in with some methods for parsing Google scholar data.
require 'CitationTools'
include CitationTools
 
# Let's pick a haphazard sample of authors.
authors = File.new('authors.txt', 'r').readlines.map {|line| line.chomp}
 
# Let's add a header line to our output.
puts '"Author","Work","Citations"'
 
# And then let's iterate over those authors.
authors.each do |author_name|
  cited_work_data = get_ten_most_cited_works_for_author(author_name)
 
  if cited_work_data.nil?
    print "Skipping #{author_name}"
  end
 
  cited_work_data.each do |cited_work|
    puts ""#{author_name}","#{cited_work[:title]}",#{cited_work[:citation_count]}"
  end
end

Then I coded up a simple barplot in R to give you a sense of the citation count for the first few authors that came to mind. The result is below.

citation_values.png

Now I think the goal should be to put these tools to a good use.

To leave a comment for the author, please follow the link and comment on his blog: "R-bloggers" via Tal Galili in Google Reader.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.