How to Build a Dataset in R using an RSS feed or Web page

April 22, 2011
By

(This article was first published on Pass the ROC, and kindly contributed to R-bloggers)

I recently wanted to build a dataset from content in an RSS feed – the feed of crimes in Newark provided by SpotCrime.  (They have feeds for lots of US cities, but I just wanted Newark.  Please read their Terms of Service before using this code on their feed.)  After some tinkering, I got it to work using the XML package in R. 
The first step is to read in the RSS feed XML file:


#install.packages("XML")
library(XML)
doc<-xmlTreeParse("http://s3.spotcrime.com/cache/rss/newark.xml")

The xmlTreeParse command “parses an XML or HTML file or string containing XML/HTML content, and generates an R structure representing the XML/HTML tree.”  There are tons of optional arguments, but as you can see, I didn’t use any of them, and frankly, I don’t understand many of them.  But the function did what I wanted.
Next, I used the command xmlRoot to isolate the “top level XMLNode object resulting from parsing an XML document.”  Now is a good time to look at what we have:

> xmlRoot(doc)

 
 
  Spotcrime.com Crime Listing - Newark, NJ
  Crime feed - RSS - 5 incidents. To see more visit http://spotcrime.com
  en-us
  http://spotcrime.com
  180
  ReportSee, Inc.
 
   http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj
   http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj
   Mon, 18 Apr 2011 00:00:00 -0400
   Robbery on EASTON AVENUE, Franklin, NJ (via spotcrime.com)
   Police are seeking a man who robbed the Financial Resources Federal Credit Union
   40.5242061 -74.495662
  
    40.5242061
    -74.495662
  

 

This is only a portion of the full output – there are more nodes, one for each crime.
So the feed starts with a header full of stuff we don’t need, followed by the content in the node, which is the good stuff: a link to the crime on SpotCrime, the publication date (more on this later), the crime “title,” a description, and the Lat/Lon, in two different formats.  How do we get at that meaty stuff, and put it into a friendly R dataframe?  We’ll use the xpathApply command:


src<-xpathApply(xmlRoot(doc), "//item")

xpathApply is a “way to find XML nodes that match a particular criterion” using XPath syntax.  XPath is a way to navigate XML trees.  My approach for a project like this is to aim, first and foremost, for code that works, and worry about advanced techniques later.  So I did a simple search for nodes identified as “item,” ignoring all the other possible arguments to xpathApply.  src is now a list with 5 elements, one for each “item” node in the feed (recall that above, I only showed the first item node – four more followed).  We can now iterate through the 5 elements of src and convert the data into a dataframe:

for (i in 1:length(src)) {
    if (i==1) {
            foo<-xmlSApply(src[[i]], xmlValue)
            DATA<-data.frame(t(foo), stringsAsFactors=FALSE)
        }
    else {
            foo<-xmlSApply(src[[i]], xmlValue)
            tmp<-data.frame(t(foo), stringsAsFactors=FALSE)
            DATA<-rbind(DATA, tmp)
        }
   
    }
   
xmlSApply applies a function to the subnodes of an XML node.  In this case, the function is xmlValue, which returns the raw contents of a node.  So foo becomes a character vector containing all of those nice data bits for crime i. We then transpose foo into a matrix and convert it to a (1 row) data.frame.  The stringsasFactors=FALSE prevents R from treating the strings as factors, which makes sense in this case – it might not in yours.
The first time through the loop, we want to create the data.frame; subsequent iterations, we just want to rbind a row on the bottom.  When we’re done, we have what we want: the data from the RSS feed nicely formatted in a data.frame named (descriptively) DATA. 

Now, returning to the date and time.  SpotCrime reports the publication date and time, not the date and time that the crime actually occurred.  What can we do?  It looks like SpotCrime reports the date and time we want on the webpage for the crime, the link to which was helpfully provided in the RSS feed.  Take a look:

So, let’s read in the html for that page, and grab the correct date and time!


# Looping through the crimes, going to web page and grabbing actual date and time
date_time<-vector()
for (i in 1:length(src)) {
    res<-htmlTreeParse(DATA$link[i], useInternalNodes=TRUE)
    title<-xpathApply(xmlRoot(res), "//title")
    date_time[i]<-xmlSApply(title[[1]], xmlValue)
}
DATA<-cbind(DATA,date_time)   

Here, we used many of the same commands we used for the RSS feed. The real date and time were stored in a node called “title,” so we just grabbed that node for each crime, stuck it into the appropriate slot in a vector, and slapped that vector onto our DATA data.frame.
With a little string processing to extract and convert lat/lon and date/time to appropriate data types, the data collection code is finished!

To leave a comment for the author, please follow the link and comment on their blog: Pass the ROC.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)