GScholarXScraper: Hacking the GScholarScraper function with XPath

November 13, 2011

(This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers)

Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation.

This was of interest to me because around the same time I posted an independent Google Scholar scraper function  get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage  into different columns of a dataframe structure.

In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. Essensially it’s still the same function he wrote and therefore full credit should go to Kay on this one as he fully deserves it – I certainly had no previous idea how to make a word cloud, plus I hadn’t used the tm package in ages (to the point where I’d forgotten most of it!). The main changes I made were as follows:

  • Restructure internal code of GScholarScraper into a series of local functions which each do a seperate job (this made it easier for me to hack because I understood what was doing what and why).
  • As far as possible, strip out Regular Expressions and replace with XPath alternatives (made possible via the XML package). Hence the change of name to GScholarXScraper. Basically, apart from a little messing about with the generation of the URLs I just copied over my get_google_scholar_df() function and removed the Regular Expression alternatives. I’m not saying one is better than the other but f0r me personally, I find XPath shorter and quicker to code but either is a good approach for web scraping like this (note to self: I really need to lean more about regular expressions!) :)
  • Vectorise a few of the loops I saw (it surprises me how second nature this has become to me – I used to find the *apply family of functions rather confusing but thankfully not so much any more!).
  • Make use of getURL from the RCurl package (I was getting some mutibyte string problems originally when using readLines but this approach automatically fixed it for me).
  • Add option to make a word-cloud from either the “title” or the “description” fields of the Google Scholar search results
  • Added steaming via the Rstem package because I couldn’t get the Snowball package to install with my version of java. This was important to me because I was getting word clouds with variations of the same word on it e.g. “game”, “games”, “gaming”.
  • Forced use of URLencode() on generation of URLs to automatically avoid problems with search terms like “Baldur’s Gate” which would otherwise fail.

I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post):

<div id="LC198"># #EXAMPLE 1: Display word cloud based on the title field of each Google Scholar search result returned</div>
<div id="LC199"># GScholarXScraper(search.str = "Baldur's Gate", field = "title", write.table = FALSE, stem = TRUE)</div>
<div id="LC200">#</div>
<div id="LC201"># # word freq</div>
<div id="LC202"># # game game 71</div>
<div id="LC203"># # comput comput 22</div>
<div id="LC204"># # video video 13</div>
<div id="LC205"># # learn learn 11</div>
<div id="LC206"># # [TRUNC...]</div>
<div id="LC207"># #</div>
<div id="LC208"># #</div>
<div id="LC209"># # Number of titles submitted = 210</div>
<div id="LC210"># #</div>
<div id="LC211"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC212"># #</div>
<div id="LC213"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>

// image

I think that’s kind of cool and corresponds to what I would expect for a search about the legendary Baldur’s Gate computer role playing game :)  The following is produced if we look at the ‘description’ filed instead of the ‘title’ field:

<div id="LC215"># # EXAMPLE 2: Display word cloud based on the description field of each Google Scholar search result returned</div>
<div id="LC216">GScholarXScraper(search.str = "Baldur's Gate", field = "description", write.table = FALSE, stem = TRUE)</div>
<div id="LC217">#</div>
<div id="LC218"># # word freq</div>
<div id="LC219"># # page page 147</div>
<div id="LC220"># # gate gate 132</div>
<div id="LC221"># # game game 130</div>
<div id="LC222"># # baldur baldur 129</div>
<div id="LC223"># # roleplay roleplay 21</div>
<div id="LC224"># # [TRUNC...]</div>
<div id="LC225"># #</div>
<div id="LC226"># # Number of titles submitted = 210</div>
<div id="LC227"># #</div>
<div id="LC228"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC229"># #</div>
<div id="LC230"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>


Not bad. I could see myself using the text mining and word cloud functionality with other projects I’ve been playing with such as Facebook, Google+, Yahoo search pages, Google search pages, Bing search pages… could be fun!

Many thanks again to Kay for making his code publicly available so that I could play with it and improve my programming skill set.


Full code for GScholarXScraper can be found here:

Original GSchloarScraper code is here:

Full code for just the XPath scraping function is here:

To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , ,

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)