Huh… I didn’t realize just how similar rvest was to XML until I did a bit of digging.
Ultra Signup: Treasure Trove of Ultra Data
If you’re into ultra running, then you probably know about Ultra Signup and the kinds of data you can find there: current and historical races results, list of entrants for each upcoming reace, results by runner, etc. I’ve done quite a bit of web scraping on their pages and you can see some of the fun things I’ve done with the data over on my running blog.
rvest versus XML
This post will discuss the mechanics of using rvest vs. XML on scraping the entrants list for the upcoming Rock/Creek StumpJump 50k.
library(magrittr) library(RCurl) library(XML) library(rvest) # Entrants Page for the Rock/Creek StumpJump 50k Race URL <- "http://ultrasignup.com/entrants_event.aspx?did=31114"
Dowloading and Parsing the URL
rvest definitely is compact using only one function. I like it.
rvest_doc <- html(URL)
XML gets its work done with the help of RCurl’s getURL function.
XML_doc <- htmlParse(getURL(URL),asText=TRUE)
And come to find out they return the exact same classed object. I didn’t know that!
##  "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" ##  "XMLAbstractDocument"
all.equal( class(rvest_doc), class(XML_doc) )
##  TRUE
Searching for the HTML Table
rvest seems to poo poo using xpath for selecting nodes in a DOM. Rather, they recommend using CSS selectors instead. Still, the code is nice and compact.
rvest_table_node <- html_node(rvest_doc,"table.ultra_grid")
XML here uses xpath, which I don’t think is that hard to understand once you get used to it. The only other hitch here is that we have to choose the first node returned from getNodeSet.
XML_table_node <- getNodeSet(XML_doc,'//table[@class="ultra_grid"]')[]
But each still returns the exact same classed object.
##  "XMLInternalElementNode" "XMLInternalNode" ##  "XMLAbstractNode"
all.equal( class(rvest_table_node), class(XML_table_node) )
##  TRUE
From HTML Table to Data Frame
rvest returns a nice stringy data frame here.
rvest_table <- html_table(rvest_table_node)
While XML must submit to the camelHumpDisaster of an argument name and factor reviled convention of stringsAsFactor=FALSE.
XML_table <- readHTMLTable(XML_table_node, stringsAsFactors=FALSE)
Still, they return almost equal data frames.
##  "Component "Results": Modes: numeric, character" ##  "Component "Results": target is numeric, current is character"
all.equal( rvest_table$Results, as.integer(XML_table$Results) )
##  TRUE
Magrittr For More Elegance
Adding in the way cool magrittr pipe system makes rvest really shine in compactness.
rvest_table <- html(URL) %>% html_node("table.ultra_grid") %>% html_table()
While XML is not as elegant, having to use named arguments in getNodeSet and exposing the internal function .subset2.
XML_table <- htmlParse(getURL(URL),asText=TRUE) %>% getNodeSet(path='//table[@class="ultra_grid"]') %>% .subset2(n=1) %>% readHTMLTable(stringsAsFactors=FALSE)
Summing Things Up
rvest is definitely elegant and compact syntactic sugar,
which I’m drawn to these days. But scraping web pages reveals the
dirtiest data among dirty data, and for now I think I’ll stick to the
power of XML over sytactic sugar.
Meh… who am I kidding, I’m just lazy. And old.