Old is New: XML and rvest

May 22, 2015
By

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

Huh… I didn’t realize just how similar rvest was to XML until I did a bit of digging.

After my wonderful experience using dplyr and tidyr recently, I decided to revisit some of my old RUNNING code and see if it could use an upgrade by swapping out the XML dependency with rvest.

Ultra Signup: Treasure Trove of Ultra Data

If you’re into ultra running, then you probably know about Ultra Signup and the kinds of data you can find there: current and historical races results, list of entrants for each upcoming reace, results by runner, etc. I’ve done quite a bit of web scraping on their pages and you can see some of the fun things I’ve done with the data over on my running blog.

rvest versus XML

This post will discuss the mechanics of using rvest vs. XML on scraping the entrants list for the upcoming Rock/Creek StumpJump 50k.

library(magrittr)
library(RCurl)
library(XML)
library(rvest)

# Entrants Page for the Rock/Creek StumpJump 50k Race
URL <- "http://ultrasignup.com/entrants_event.aspx?did=31114"

Dowloading and Parsing the URL

rvest definitely is compact using only one function. I like it.

rvest_doc <- html(URL)

XML gets its work done with the help of RCurl’s getURL function.

XML_doc   <- htmlParse(getURL(URL),asText=TRUE)

And come to find out they return the exact same classed object. I didn’t know that!

class(rvest_doc)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"
all.equal( class(rvest_doc), class(XML_doc) )
## [1] TRUE

Searching for the HTML Table

rvest seems to poo poo using xpath for selecting nodes in a DOM. Rather, they recommend using CSS selectors instead. Still, the code is nice and compact.

rvest_table_node <- html_node(rvest_doc,"table.ultra_grid")

XML here uses xpath, which I don’t think is that hard to understand once you get used to it. The only other hitch here is that we have to choose the first node returned from getNodeSet.

XML_table_node <- getNodeSet(XML_doc,'//table[@class="ultra_grid"]')[[1]]

But each still returns the exact same classed object.

class(rvest_table_node)
## [1] "XMLInternalElementNode" "XMLInternalNode"       
## [3] "XMLAbstractNode"
all.equal( class(rvest_table_node), class(XML_table_node) )
## [1] TRUE

From HTML Table to Data Frame

rvest returns a nice stringy data frame here.

rvest_table <- html_table(rvest_table_node)

While XML must submit to the camelHumpDisaster of an argument name and factor reviled convention of stringsAsFactor=FALSE.

XML_table <- readHTMLTable(XML_table_node, stringsAsFactors=FALSE)

Still, they return almost equal data frames.

all.equal(rvest_table,XML_table)
## [1] "Component "Results": Modes: numeric, character"              
## [2] "Component "Results": target is numeric, current is character"
all.equal( rvest_table$Results, as.integer(XML_table$Results) )
## [1] TRUE

Magrittr For More Elegance

Adding in the way cool magrittr pipe system makes rvest really shine in compactness.

rvest_table <- html(URL) %>% html_node("table.ultra_grid") %>% html_table()

While XML is not as elegant, having to use named arguments in getNodeSet and exposing the internal function .subset2.

XML_table <- htmlParse(getURL(URL),asText=TRUE) %>% 
                getNodeSet(path='//table[@class="ultra_grid"]') %>%
                .subset2(n=1) %>% 
                readHTMLTable(stringsAsFactors=FALSE)

Summing Things Up

rvest is definitely elegant and compact syntactic sugar,
which I’m drawn to these days. But scraping web pages reveals the
dirtiest data among dirty data, and for now I think I’ll stick to the
power of XML over sytactic sugar.

Meh… who am I kidding, I’m just lazy. And old.

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)