Cobble XPath Interactively with the xmlview Package

January 13, 2016
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

(If you don’t know what XML is, you should probably read a primer before reading this post,)

When working with data, one inevitably comes across things encoded in XML. I’m in the “anti-XML” camp, but deal with my fair share of XML in “cyber” and help out enough people who have to work with XML that I’ve become pretty proficient when slicing & dicing it.

R has two main packages to deal with XML: the original XML package and the more lightweight and modern xml2 package. If you really need all the power of libxml2 (the C library that powers both packages) or are creating XML from R, then you probably know your way around the XML package and are pretty self-sufficient.

Most folks can get by with the xml2 package if their goal is to work with XML data. By “work with” I mean read in files or data from APIs that come in XML format and have to find nuggets of gold in between all those < and > tags. To do so requires finding what you need and that means using a query language called XPath to pinpoint the node(s) you are after. Working with XPath can be pretty daunting for those who went to school to ultimately cure diseases, build high-performing stock portfolios, target advertising to everyone or perform a host of other real work. Becoming an expert in XPath was not something on the bucket list but to work with XML you will need to be familiar with it.

The xmlview package provides a way to visually inspect XML and interactively test out XPath expressions. It’s as simple to use as:

devtools::install_github("ramnathv/htmlwidgets") # we use some bleeding edge features
devtools::install_github("hrbrmstr/xmlview")
library(xml2)
library(xmlview)
 
# plain text XML
xml_view("ToveJaniReminderDon't forget me this weekend!")
 
# read-in XML document
doc <- read_xml("http://www.npr.org/rss/rss.php?id=1001")
xml_view(doc, add_filter=TRUE)

(There’s also an experimental xml_tree_view() in there by @timelyportfolio that we’ll be adding features to at a pretty rapid pace.)

Here’s a screenshot of it in action:

RStudioScreenSnapz003

There are options to change the CSS styling for the formatted code. Yep, it will format and highlight XML for you so it’s easier to work with. There’s an animated gif of a screencast over on github as well.

Once you perfect your XPath expression, hit the “R” button and it will generate the code you can copy back into RStudio. It understands namespaces but try not to stuff a huge XML document in there as browsers don’t work well with large data elements (the viewer is an htmlwidget and is, hence, browser-based).

It works with plain character XML/HTML, and many xml2 data types. I have no current plans for XML package object support but toss up an issue on github if you really need it (or, better yet, a PR). If there are other desired features (especially from educators), please post a request in github issue as well.

Watch for more features in the coming weeks and a CRAN release once the bleeding edge htmlwidgets packages makes it to CRAN.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)