R and the web (for beginners), Part II: XML in R

June 22, 2012
By

[This article was first published on GivenTheData, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to interchange data over the Internet. If you want to access some online data over a webpage’s API you are likely to get it in XML format. So here is a very simple example of how to deal with XML in R.
Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you…

How to parse/read this XML-document into R?
 
# install and load the necessary package
install.packages(“XML”)
library(XML)
# Save the URL of the xml file in a variable

xml.url <- “http://www.w3schools.com/xml/plant_catalog.xml”
# Use the xmlTreePares-function to parse xml file directly from the web
 
xmlfile <- xmlTreeParse(xml.url)
# the xml file is now saved as an object you can easily work with in R:
class(xmlfile)
# Use the xmlRoot-function to access the top node
xmltop = xmlRoot(xmlfile)
# have a look at the XML-code of the first subnodes:
print(xmltop)[1:2]

This should look more or less like:

$PLANT

Bloodroot
Sanguinaria canadensis
4
Mostly Shady
$2.44
031599


$PLANT

Columbine
Aquilegia canadensis
3
Mostly Shady
$9.37
030699


attr(,"class")
[1] "XMLNodeList"

One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?

# To extract the XML-values from the document, use xmlSApply:

plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))


# Finally, get the data in a data-frame and have a look at the first rows and columns

plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]

The first rows and columns of that data frame should look like this:
 
               COMMON              BOTANICAL ZONE        LIGHT
1 Bloodroot Sanguinaria canadensis 4 Mostly Shady
2 Columbine Aquilegia canadensis 3 Mostly Shady
3 Marsh Marigold Caltha palustris 4 Mostly Sunny
4 Cowslip Caltha palustris 4 Mostly Shady
5 Dutchman's-Breeches Dicentra cucullaria 3 Mostly Shady

Which is exactly what we need to analyze this data in R.


To leave a comment for the author, please follow the link and comment on their blog: GivenTheData.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)