Web Scraping Google Scholar (Partial Success)

November 8, 2011

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

I wanted to scrape the information returned by a Google Scholar web search into an R data frame as a quick XPath exercise. The following will successfully extract  the ‘title’, ‘url’ , ‘publication’ and ‘description’.  If any of these fields are not available, as in the case of a citation, the corresponding cell in the data frame will have NA.

# load packages

get_google_scholar_df <- function(u, omit.citation = TRUE) {
  html <- getURL(u, .encoding = "CE_UTF8"))

  # parse HTML into tree structure
  doc <- htmlParse(html)

  # make data frame from available information on page
  df <- data.frame(
    title = xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", function(x) xmlValue(xmlChildren(x)$a) ),
    url = xpathSApply(doc, "//html//body//div[@class='gs_r']//h3", function(x) ifelse(is.null(xmlChildren(x)$a), NA, xmlAttrs(xmlChildren(x)$a, 'href'))),
    publication = xpathSApply(doc, "//html//body//div[@class='gs_r']//font//span[@class='gs_a']", xmlValue),
    description = xpathSApply(doc, "//html//body//div[@class='gs_r']//font", xmlValue),

  # remove redundant information (i.e. publication field) from description.
  df$description <- sapply(1:dim(df)[1], function(i) gsub(df$publication[i], "", df$description[i], fixed = TRUE))

  # free doc from memory

  # ensure urls start with "http" to avoid google references to the search page
  ifelse(omit.citation, return(na.omit(df)), return(df))

u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en&num=10"
df <- get_google_scholar_df(u, omit.citations = TRUE)

The above will produce results as follows:

# [1] "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# [2] "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQBlc&sig=5WujxIs3fN8W74kw3rYSM4PEw0Y"
# [3] "http://www.itu.dk/stud/projects_f2003/moebius/Burn/Ragelse/Andet/Den%20skriftlige%20opgave/Tekster/Hancock.doc"
# [4] "http://www.aaai.org/Papers/AIIDE/2006/AIIDE06-006.pdf"
# [5] "http://citeseerx.ist.psu.edu/viewdoc/download?doi="
# [6] "http://www.google.com/patents?hl=en&lr=&vid=USPAT7249121&id=Up-AAAAAEBAJ&oi=fnd&dq=baldur%27s+gate+2&printsec=abstract"

Or the full data frame (using t() for display purposes):

#             1
# title       "Baldur's gate and history: Race and alignment in digital role playing games"
# url         "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# publication "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org"
# description "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \nCited by 8 - Related articles - View as HTML - All 10 versions"
#             3
# title       "AI game programming wisdom"
# url         "http://books.google.com/books?hl=en&lr=&id=4f5Gszjyb8EC&oi=fnd&pg=PR11&dq=baldur%27s+gate+2&ots=9BRItsQDsd&sig=T7MtDRP6trGHhAfupOpoKipEqtg"
# publication "S Rabin - 2002 - books.google.com"
# description "... years. Mark has worked on all the games in the Baldur's Gate series, and was lead\nprogram- mer on Tales of the Sword Coast, Baldur's Gate 2, and Throne ofBhaal.\nChad Dawson—Stainless Steel Studios cd1 f@ yahoo. com ... \nCited by 187 - Related articles   - Library Search - All 8 versions"

That was the most information I could pull off a Google Scholar search using XPath though I have no doubt someone with more knowledge could pull more elements out! Many thanks to John Colby for helping me out with my question over on stackoverflow.com which made the above possible. Trying to get more elements out just didn’t seem to work for me.


To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.


Mango solutions

plotly webpage

dominolab webpage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training





CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)