Web Scraping Google Scholar (Partial Success)

November 8, 2011
By

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

I wanted to scrape the information returned by a Google Scholar web search into an R data frame as a quick XPath exercise. The following will successfully extract  the ‘title’, ‘url’ , ‘publication’ and ‘description’.  If any of these fields are not available, as in the case of a citation, the corresponding cell in the data frame will have NA.
# load packages
library(XML)
library(RCurl)

get_google_scholar_df <- function(u, omit.citation = TRUE) {
  html <- getURL(u, .encoding = "CE_UTF8"))

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # make data frame from available information on page
  df <- data.frame(
    title = xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", function(x) xmlValue(xmlChildren(x)$a) ),
    url = xpathSApply(doc, "//html//body//div[@class='gs_r']//h3", function(x) ifelse(is.null(xmlChildren(x)$a), NA, xmlAttrs(xmlChildren(x)$a, 'href'))),
    publication = xpathSApply(doc, "//html//body//div[@class='gs_r']//font//span[@class='gs_a']", xmlValue),
    description = xpathSApply(doc, "//html//body//div[@class='gs_r']//font", xmlValue),
    stringsAsFactors=FALSE)

  # remove redundant information (i.e. publication field) from description.
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))

  # free doc from memory
  free(doc)

  # ensure urls start with "http" to avoid google references to the search page
  ifelse(omit.citation, return(na.omit(df)), return(df))
}

u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en&num=10"
df <- get_google_scholar_df(u, omit.citations = TRUE)
The above will produce results as follows:
df$url
# [1] "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# [2] "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQBlc&sig=5WujxIs3fN8W74kw3rYSM4PEw0Y"
# [3] "http://www.itu.dk/stud/projects_f2003/moebius/Burn/Ragelse/Andet/Den%20skriftlige%20opgave/Tekster/Hancock.doc"
# [4] "http://www.aaai.org/Papers/AIIDE/2006/AIIDE06-006.pdf"
# [5] "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.597&rep=rep1&type=pdf"
# [6] "http://www.google.com/patents?hl=en&lr=&vid=USPAT7249121&id=Up-AAAAAEBAJ&oi=fnd&dq=baldur%27s+gate+2&printsec=abstract"
Or the full data frame (using t() for display purposes):
t(df[1:2,])
#             1
# title       "Baldur's gate and history: Race and alignment in digital role playing games"
# url         "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# publication "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org"
# description "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \nCited by 8 - Related articles - View as HTML - All 10 versions"
#             3
# title       "AI game programming wisdom"
# url         "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQDsd&sig=T7MtDRP6trGHhAfupOpoKipEqtg"
# publication "S Rabin - 2002 - books.google.com"
# description "... years. Mark has worked on all the games in the Baldur's Gate series, and was lead\nprogram- mer on Tales of the Sword Coast, Baldur's Gate 2, and Throne ofBhaal.\nChad Dawson—Stainless Steel Studios cd1 f@ yahoo. com ... \nCited by 187 - Related articles   - Library Search - All 8 versions"
That was the most information I could pull off a Google Scholar search using XPath though I have no doubt someone with more knowledge could pull more elements out! Many thanks to John Colby for helping me out with my question over on stackoverflow.com which made the above possible. Trying to get more elements out just didn’t seem to work for me.  

To leave a comment for the author, please follow the link and comment on his blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.