# R and the web (for beginners), Part II: XML in R

June 22, 2012
By

(This article was first published on GivenTheData, and kindly contributed to R-bloggers)

This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to interchange data over the Internet. If you want to access some online data over a webpage's API you are likely to get it in XML format. So here is a very simple example of how to deal with XML in R.
Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you...

How to parse/read this XML-document into R?

# install and load the necessary package

install.packages("XML")
library(XML)

# Save the URL of the xml file in a variable

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

# Use the xmlTreePares-function to parse xml file directly from the web

xmlfile <- xmlTreeParse(xml.url)

# the xml file is now saved as an object you can easily work with in R:

class(xmlfile)

# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

# have a look at the XML-code of the first subnodes:

print(xmltop)[1:2]

This should look more or less like:

$PLANT<PLANT> <COMMON>Bloodroot</COMMON> <BOTANICAL>Sanguinaria canadensis</BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599</AVAILABILITY></PLANT>$PLANT<PLANT> <COMMON>Columbine</COMMON> <BOTANICAL>Aquilegia canadensis</BOTANICAL> <ZONE>3</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$9.37</PRICE> <AVAILABILITY>030699</AVAILABILITY></PLANT>attr(,"class")[1] "XMLNodeList"

One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?

# To extract the XML-values from the document, use xmlSApply:

plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))

# Finally, get the data in a data-frame and have a look at the first rows and columns

plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]

The first rows and columns of that data frame should look like this:

               COMMON              BOTANICAL ZONE        LIGHT1           Bloodroot Sanguinaria canadensis    4 Mostly Shady2           Columbine   Aquilegia canadensis    3 Mostly Shady3      Marsh Marigold       Caltha palustris    4 Mostly Sunny4             Cowslip       Caltha palustris    4 Mostly Shady5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady
Which is exactly what we need to analyze this data in R.