This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to interchange data over the Internet. If you want to access some online data over a webpage’s API you are likely to get it in XML format. So here is a very simple example of how to deal with XML in R.
Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you…
How to parse/read this XML-document into R?
# install and load the necessary package
# Save the URL of the xml file in a variable
xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
# Use the xmlTreePares-function to parse xml file directly from the web
xmlfile <- xmlTreeParse(xml.url)
# the xml file is now saved as an object you can easily work with in R:
# Use the xmlRoot-function to access the top node
xmltop = xmlRoot(xmlfile)
# have a look at the XML-code of the first subnodes:
This should look more or less like:
$PLANT Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44 031599 attr(,"class")  "XMLNodeList" Columbine Aquilegia canadensis 3 Mostly Shady $9.37 030699
One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?
# To extract the XML-values from the document, use xmlSApply:
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
# Finally, get the data in a data-frame and have a look at the first rows and columns
plantcat_df <- data.frame(t(plantcat),row.names=NULL)
The first rows and columns of that data frame should look like this:
COMMON BOTANICAL ZONE LIGHT 1 Bloodroot Sanguinaria canadensis 4 Mostly Shady 2 Columbine Aquilegia canadensis 3 Mostly Shady 3 Marsh Marigold Caltha palustris 4 Mostly Sunny 4 Cowslip Caltha palustris 4 Mostly Shady 5 Dutchman's-Breeches Dicentra cucullaria 3 Mostly Shady
Which is exactly what we need to analyze this data in R.