Web Scraping Google Scholar (Partial Success)

Posted on November 8, 2011 by Tony Breyal in R bloggers | 0 Comments

[This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wanted to scrape the information returned by a Google Scholar web search into an R data frame as a quick XPath exercise. The following will successfully extract the ‘title’, ‘url’ , ‘publication’ and ‘description’. If any of these fields are not available, as in the case of a citation, the corresponding cell in the data frame will have NA.

# load packages
library(XML)
library(RCurl)

get_google_scholar_df <- function(u, omit.citation = TRUE) {
  html <- getURL(u, .encoding = "CE_UTF8"))

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # make data frame from available information on page
  df <- data.frame(
    title = xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", function(x) xmlValue(xmlChildren(x)$a) ),
    url = xpathSApply(doc, "//html//body//div[@class='gs_r']//h3", function(x) ifelse(is.null(xmlChildren(x)$a), NA, xmlAttrs(xmlChildren(x)$a, 'href'))),
    publication = xpathSApply(doc, "//html//body//div[@class='gs_r']//font//span[@class='gs_a']", xmlValue),
    description = xpathSApply(doc, "//html//body//div[@class='gs_r']//font", xmlValue),
    stringsAsFactors=FALSE)

  # remove redundant information (i.e. publication field) from description.
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))

  # free doc from memory
  free(doc)

  # ensure urls start with "http" to avoid google references to the search page
  ifelse(omit.citation, return(na.omit(df)), return(df))
}

u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en&num=10"
df <- get_google_scholar_df(u, omit.citations = TRUE)

The above will produce results as follows:

df$url
# [1] "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# [2] "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQBlc&sig=5WujxIs3fN8W74kw3rYSM4PEw0Y"
# [3] "http://www.itu.dk/stud/projects_f2003/moebius/Burn/Ragelse/Andet/Den%20skriftlige%20opgave/Tekster/Hancock.doc"
# [4] "http://www.aaai.org/Papers/AIIDE/2006/AIIDE06-006.pdf"
# [5] "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.597&rep=rep1&type=pdf"
# [6] "http://www.google.com/patents?hl=en&lr=&vid=USPAT7249121&id=Up-AAAAAEBAJ&oi=fnd&dq=baldur%27s+gate+2&printsec=abstract"

Or the full data frame (using t() for display purposes):

t(df[1:2,])
#             1
# title       "Baldur's gate and history: Race and alignment in digital role playing games"
# url         "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# publication "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org"
# description "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \nCited by 8 - Related articles - View as HTML - All 10 versions"
#             3
# title       "AI game programming wisdom"
# url         "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQDsd&sig=T7MtDRP6trGHhAfupOpoKipEqtg"
# publication "S Rabin - 2002 - books.google.com"
# description "... years. Mark has worked on all the games in the Baldur's Gate series, and was lead\nprogram- mer on Tales of the Sword Coast, Baldur's Gate 2, and Throne ofBhaal.\nChad DawsonÂ—Stainless Steel Studios cd1 f@ yahoo. com ... \nCited by 187 - Related articles   - Library Search - All 8 versions"

That was the most information I could pull off a Google Scholar search using XPath though I have no doubt someone with more knowledge could pull more elements out! Many thanks to John Colby for helping me out with my question over on stackoverflow.com which made the above possible. Trying to get more elements out just didn’t seem to work for me.

To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)