Parse and process XML (and HTML) with xml2

[This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m pleased to announced that the first version of xml2 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R:

  • Read XML and HTML with read_xml() and read_html().
  • Navigate the tree with xml_children(), xml_siblings() and xml_parent(). Alternatively, use xpath to jump directly to the nodes you’re interested in with xml_find_one() and xml_find_all(). Get the full path to a node with xml_path().
  • Extract various components of a node with xml_text(), xml_attrs(), xml_attr(), and xml_name().
  • Convert to list with as_list().
  • Where appropriate, functions support namespaces with a global url -> prefix lookup table. See xml_ns() for more details.
  • Convert relative urls to absolute with url_absolute(), and transform in the opposite direction with url_relative(). Escape and unescape special characters with url_escape() and url_unescape().
  • Support for modifying and creating xml documents in planned in a future version.

This package owes a debt of gratitude to Duncan Temple Lang who’s XML package has made it possible to use XML with R for almost 15 years!


You can install it by running:


(If you’re on a mac, you might need to wait a couple of days – CRAN is busy rebuilding all the packages for R 3.2.0 so it’s running a bit behind.)

Here’s a small example working with an inline XML document:

x <- read_xml("<foo>
  <bar>text <baz id = 'a' /></bar>
  <baz id = 'b' /> 

#> [1] "foo"
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>

# Find all baz nodes anywhere in the document
baz <- xml_find_all(x, ".//baz")
#> {xml_nodeset (2)}
#> [1] <baz id="a"/>
#> [2] <baz id="b"/>
#> [1] "/foo/bar[1]/baz" "/foo/baz"
xml_attr(baz, "id")
#> [1] "a" "b"


Xml2 is still under active development. If notice any problems (including crashes), please try the development version, and if that doesn’t work, file an issue.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)