Articles by Tony Breyal

R: Stem (Pre-Processed) Text Blocks

August 24, 2014 | Tony Breyal

Objective I recently needed to stem every word in a block of text i.e. reduce each word to a root form. Problem The stemmer I was using would only stem the last word in each block of text e.g. Solution I wrote a function which splits a block ... [Read more...]

R: Web Scraping R-bloggers Facebook Page

January 6, 2012 | Tony Breyal

  Introduction is a blog aggregator maintained by Tal Galili. It is a great website for both learning about R and keeping up-to-date with the latest developments (because someone will probably, and very kindly, post about the status of some R related feature). There is also an R-bloggers facebook ...
[Read more...]

Unshorten (almost) any URL with R

December 13, 2011 | Tony Breyal

Introduction I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s, Google’s, Facebook’s,,, TinyURL,,, etc.). I ... [Read more...]

Installing Rcpp on Windows 7 for R and C++ integration

December 7, 2011 | Tony Breyal

Introduction Romain Francois presented an Rcpp solution on his blog to an old r-wiki optimisation challenge which I had also presented R solutions for previously on my blog. The Rcpp package provides a method for integrating R and C++. This allows for faster execution of an R project by recoding ... [Read more...]

outersect(): The opposite of R’s intersect() function

November 29, 2011 | Tony Breyal

The Objective To find the non-duplicated elements between two or more vectors (i.e. the ‘yellow sections of the diagram above) The Problem I needed the opposite of R’s intersect() function, an “outersect()“. The closest I found was setdiff() but the order of the input vectors produces different results, ... [Read more...]

htmlToText(): Extracting Text from HTML via XPath

November 18, 2011 | Tony Breyal

Converting HTML to plain text usually involves stripping out the HTML tags whilst preserving the most basic of formatting. I wrote a function to do this which works as follows (code can be found on github): The above uses an XPath approach to achieve it’s goal. Another approach would ... [Read more...]

Web Scraping Google+ via XPath

November 11, 2011 | Tony Breyal

Google+ just opened up to allow brands, groups, and organizations to create their very own public Pages on the site. This didn’t bother me to much but I’ve been hearing a lot about Google+ lately so figured it might be fun to set up an XPath scraper to ...
[Read more...]

Web Scraping Yahoo Search Page via XPath

November 10, 2011 | Tony Breyal

Seeing as I’m on a bit of an XPath kick as of late, I figured I’d continue on scraping search results but this time from Rolling my own version of xpathSApply to handle NULL elements seems to have done the trick and so far it’s ... [Read more...]

Facebook Graph API Explorer with R

November 10, 2011 | Tony Breyal

I wanted to play around with the Facebook Graph API  using the Graph API Explorer page as a coding exercise. This facility allows one to use the API with a temporary authorisation token. Now, I don’t know how to make an R package for the proper API where you ... [Read more...]

Web Scraping Google Scholar (Partial Success)

November 8, 2011 | Tony Breyal

I wanted to scrape the information returned by a Google Scholar web search into an R data frame as a quick XPath exercise. The following will successfully extract  the ‘title’, ‘url’ , ‘publication’ and ‘description’.  If any of these fields are not available, as in the case of a citation, the ... [Read more...]

Web Scraping Google URLs

November 7, 2011 | Tony Breyal

Google slightly changed the html code it uses for hyperlinks on search pages last Thursday, thus causing one of my scripts to stop working. Thankfully, this is easily solved in R thanks to the XML package and the power and simplicity of XPath expressions: Lovely jubbly! P.S. I know ... [Read more...]

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)