# Web Scraping Google Scholar: Part 2 (Complete Success)

November 8, 2011
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a followup to a post I uploaded earlier today about web scraping data off Google Scholar. In that post I was frustrated because I’m not smart enough to use xpathSApply to get the kind of results I wanted. However fast-forward to the evening whilst having dinner with a friend, as a passing remark, she told me how she had finally figured out how to pass a function to another function in R today, e.g.

```example <- function(x, FUN1, FUN2) {
a <- sapply(x, FUN1)
b <- sapply(a, FUN2)
return(b)
}

example(c(-16,-9,-4,0,4,9,16), abs, sqrt)
# [1] 4 3 2 0 2 3 4
```

Now that might be a little thing to others, but to me that is amazing because I had never figured it out before! Anyway, using this new piece of knowledge I was able to take another shot at the scraping problem by rolling my own meta version of xpathSApply and was thus able to successfully complete the task!

```# One function to rule them all...
html <- getURL(u)

# parse HTML into tree structure
doc <- htmlParse(html)

# I hacked my own version of xpathSApply to deal with cases that return NULL which were causing me problems
GS_xpathSApply <- function(doc, path, FUN) {
path.base <- "/html/body/div[@class='gs_r']"
nodes.len <- length(xpathSApply(doc, "/html/body/div[@class='gs_r']"))
paths <- sapply(1:nodes.len, function(i) gsub( "/html/body/div[@class='gs_r']", paste("/html/body/div[@class='gs_r'][", i, "]", sep = ""), path, fixed = TRUE))
xx <- sapply(paths, function(xpath) xpathSApply(doc, xpath, FUN), USE.NAMES = FALSE)
xx[sapply(xx, length)<1] <- NA
xx <- as.vector(unlist(xx))
return(xx)
}
# construct data frame
df <- data.frame(
footer = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']", xmlValue),
title = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3", xmlValue),
type = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/span", xmlValue),
publication = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_a']", xmlValue),
description = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font", xmlValue),
cited_by = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Cited by')]/text()", xmlValue),
cited_ref = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Cited by')]", xmlAttrs),
title_url = GS_xpathSApply(doc,  "/html/body/div[@class='gs_r']/div[@class='gs_rt']/h3/a", xmlAttrs),
view_as_html = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'View as HTML')]", xmlAttrs),
view_all_versions = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,' versions')]", xmlAttrs),
from_domain = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/span[@class='gs_ggs gs_fl']/a", xmlValue),
related_articles = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Related articles')]", xmlAttrs),
library_search = GS_xpathSApply(doc, "/html/body/div[@class='gs_r']/font/span[@class='gs_fl']/a[contains(.,'Library Search')]", xmlAttrs),
stringsAsFactors = FALSE)
# Clean up extracted text
df\$title <- sub(".*\\] ", "", df\$title)
df\$description <- sapply(1:dim(df)[1], function(i) gsub(df\$publication[i], "", df\$description[i], fixed = TRUE))
df\$description <- sapply(1:dim(df)[1], function(i) gsub(df\$footer[i], "", df\$description[i], fixed = TRUE))
df\$type <- gsub("\\]", "", gsub("\\[", "", df\$type))
df\$cited_by <- as.integer(gsub("Cited by ", "", df\$cited, fixed = TRUE))

# remove footer as it is now redundant after doing clean up
df <- df[,-1]

# free doc from memory
free(doc)

return(df)
}
```

Then, given a google scholar url, we can scrape the following information for each search result:

```u <- "http://scholar.google.com/scholar?as_q=baldur%27s+gate+2&num=20&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdtf=&as_sdts=5&hl=en"

t(df[1, ])

# title             "Baldur's gate and history: Race and alignment in digital role playing games"
# type              "PDF"
# publication       "C Warnes - Digital Games Research Conference (DiGRA), 2005 - digra.org"
# description       "... It is argued that games like Baldur's Gate I and II cannot be properly understood without\nreference to the fantasy novels that inform them. ... Columbia University Press, New York, 2003.\npp 2-3. 12. 8. Hess, Rhyss. Baldur's Gate and Tales of the Sword Coast. ... \n"
# cited_by          "8"
# cited_ref         "/scholar?cites=13835674724285845934&as_sdt=2005&sciodt=0,5&hl=en&oe=ASCII&num=20"
# title_url         "http://digra.org:8080/Plone/dl/db/06276.04067.pdf"
# view_all_versions "/scholar?cluster=13835674724285845934&hl=en&oe=ASCII&num=20&as_sdt=0,5"
# from_domain       "[PDF] from digra.org"
# library_search    NA
```

I think that’s kind of cool. Everything is wrapped into one function which I rather like. This could be extended further by having a function to construct the a series of Google Scholar URLs with whatever parameters you require, including how many pages of results you desire and then put into a loop. The resulting data frames could then be merged and there you have it! You have a nice little data base to do whatever you want with. Not sure what you might want to do with it, but there it is all the same. This was a fun little XPath exercise and even though I didn’t learn how to achieve what I wanted with xpathSApply, it was nice to meta-hack a version of my own to get what I wanted. Awesome stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.