Retrieving RSS Feeds Using Google Reader

January 13, 2012
By

(This article was first published on Jason Bryer » R, and kindly contributed to R-bloggers)

I have been working on a new package makeR to help manage Sweave projects where you wish to create multiple versions of documents that are based on a single source. For example, I create lots of monthly and quarterly reports using Sweave and the only differences between versions are a few variables. I have used GNU make and Apache ANT but wanted a 100% R solution. I will have more to write about that project in a few weeks. In the meantime I needed an example I could use publicly which led me to thinking about analyzing R-Bloggers. I wanted to use the RSS feed to get the frequency of posts and the tags and categories used. However, R-Bloggers, like most blogs, limits the RSS feeds to the latest few. Google Reader however keeps them all (or at least a lot more). The only downside is that you need to have a Google Reader account. The source code is hosted on Gist and provided below.

 

source('https://raw.github.com/gist/1606595/269d61dfcc7930f5275a212e11f3c43771ab2591/GoogleReader.R')

rbloggers = getRSSFeed(feedURL="http://r-bloggers.com/feed",
email="GOOGLE READER EMAIL",
passwd="GOOGLE READER PASSWORD",
posts=5000)
entries = rbloggers[which(names(rbloggers) == "entry")]
length(entries)
saveXML(rbloggers, file='rbloggers.xml')

#This will create a data frame with some of the information from the RSS feed
posts = data.frame(title=character(0), author=character(0),
link=character(0), stringsAsFactors=FALSE)
posts[1:length(entries),1:ncol(posts)] = NA
posts$published = as.Date(NA)
posts.categories = list()
for(i in 1:length(entries)) {
entry = entries[[i]]
posts[i,]$title = unclass(xmlChildren(entry[['title']])$text)$value
posts[i,]$author = unclass(xmlChildren(entry[['author']][['name']])$text)$value
posts[i,]$link = xmlAttrs(entry[['link']])[['href']]
posts[i,]$published = as.Date(substr(unclass(
xmlChildren(entry[['published']])$text)$value, 1, 10))
categories = entry[which(names(entry) == "category")]
posts.categories[[i]] = list()
if(length(categories) > 1) { #Ignore the first category as it is used for Google Reader
l = list()
for(j in 2:length(categories)) {
l[(j-1)] = xmlAttrs(categories[[j]])[['term']]
}
posts.categories[[i]] = l
}
}


#We'll use Paul Bleicher's calendarHeat function to visualize the number of posts per day
source('https://raw.github.com/tavisrudd/r_users_group_1/master/calendarHeat.R')
cal = as.data.frame(table(posts$published))
cal$Var1 = as.Date(cal$Var1)
calendarHeat(cal$Var1, cal$Freq, color="r2b", varname="Number of Posts on R-Bloggers.com")

#Create a word cloud
require(wordcloud)
ctab = unlist(posts.categories)
ctab = unlist(strsplit(ctab, ' '))
ctab = as.data.frame(table(ctab))
ctab = ctab[-which(ctab$ctab == 'Uncategorized'),]
wordcloud(ctab$ctab, ctab$Freq, min.freq=10)

view raw RBloggers.R This Gist brought to you by GitHub.
require(XML)
require(RCurl)

#' This function ruturns an XML tree of the RSS feed from the given URL.
#'
#' This function utilizes the (unofficial) Google Reader API to retrieve RSS
#' feeds. The advantage of access RSS feeds through the Google Reader API is that
#' you are not limited by the number of entries a website may included in their
#' feed. That is, Google maintains generally maintains a complete history of
#' entries from the RSS feed.
#'
#' Note that the contents of the results will be limited by what the website
#' provides in their feeds. That is, Google does not contain more information
#' per entry then what the website originally provided. If the initial feed
#' contained only excerpts of the article, the feed from Google will too only
#' contain excerpts. Be aware though that for sites that do provide the complete
#' contents of posts will result in potentially very large downloads.
#'
#' @param feedURL the full URL to the RSS feed.
#' @param email the email address for the Google Reader account
#' @param passwd the password for the Google Reader account
#' @param posts the number of posts to return
#' @return the root \code{XMLNode} for the RSS feed.
#' @seealso \code{\link{/xmlRoot}} for the format of the returned XML tree
#' @export
#' @example
#' \dontrun{
#' rbloggers = getRSSFeed(feedURL="http://r-bloggers.com/feed",
#' email="[email protected]", passwd="PASSWORD")
#' }
#' @author Jason Bryer <\email{jason@@bryer.org}x>
getRSSFeed <- function(feedURL, email, passwd, posts=1000) {
#Authenticate with Google
curlHandle = getCurlHandle(cookiefile="rcookies", ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
x = postForm("https://www.google.com/accounts/ClientLogin",
accountType="GOOGLE",
service="reader",
Email=email,
Passwd=passwd,
source="makeR",
curl = curlHandle)
gtoken = unlist(strsplit(x, "\n"))
parsed.gtoken <- unlist(strsplit(gtoken[3], "Auth="))
if (length(parsed.gtoken) >= 2) {
auth.token <- unlist(strsplit(gtoken[3], "Auth="))[[2]]
} else {
stop("Authentication failed.")
}
google.auth <- paste("GoogleLogin auth=", auth.token, sep='')

#Function to retrieve up to 1000 posts
getDoc <- function(n, c=NULL) {
feedURL = paste("http://www.google.com/reader/atom/feed/", feedURL, "?n=", n,
ifelse(is.null(c), "", paste("&c=", c, sep='')),
sep='')
feed = getURL(feedURL, .encoding = 'UTF-8', followlocation=TRUE,
httpheader=c("Authorization"=google.auth),
curl=curlHandle)
doc = xmlTreeParse(feed, asText=TRUE)
return(xmlRoot(doc))
}

root = NULL
continueValue = NULL
for(i in 1:ceiling(posts / 1000)) {
r = getDoc(n=ifelse(i == ceiling(posts / 1000), (posts-1) %% 1000 + 1, 1000),
c=continueValue)
if(is.null(root)) {
root = r
} else {
entries = which(xmlSApply(r, xmlName) == 'entry')
if(length(entries) > 0) {
root = addChildren(root, kids=r[entries])
}
}
if(is.null(r[['continuation']])) {
break #No more posts to retrieve
} else {
continueValue = unclass(xmlChildren(r[['continuation']])$text)$value
}
}
return(root)
}

To leave a comment for the author, please follow the link and comment on his blog: Jason Bryer » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , ,

Comments are closed.