xml2::xml_text() #> [1] "blop" I could also replace the element. Now, that was an especially simple XPath query. XPath’s strength is to allow you to really take advantage of the structure of the XML or HTML tree. You can extract nodes based on their attributes, on their parents, on their siblings, etc. Where to learn XPath? Two good websites to get started are Mozilla Developer Network’s intro to XPath; w3schools’ XPath tutorial. A primary skill to learn is the name of elements, e.g. nodes, attributes, which will help write type better keywords into search engines when trying to figure out a query. 😉 Note that if you are handling HTML, you might enjoy selectr by Simon Potter that creates XPath filters based on CSS selectors. Knowing XPath, or even knowing it exists, is really empowering. In the rest of this post, I’ll highlight cases where this is useful. When life gives you XML or HTML Web scraping At the beginning of this blog I liked extracting data from websites. I did that with regular expressions. Now I know better and would wrangle HTML as HTML. Goodbye, stringr::str_detect(), hello, xml2::xml_find_all(). A package that’s especially useful for web scraping is rvest by Hadley Wickham. rvest builds upon selectr, and will write XPath for you. pkgdown If you use pkgdown to produce the documentation website of for your package, please know that part of its magic comes from various “HTML tweaks” that are powered by XPath, see for instance “tweak-reference.R”. When life gives you something else… You can still make it XML to handle it as such, with XPath! Markdown manipulation with commonmark, tinkr The commonmark package transforms Markdown to XML. This can be extremely handy to get data on R Markdown or Markdown files. Now, say you want to modify the Markdown file as XML then get a Markdown file back. It is also possible, with the tinkr package, started by yours truly, now maintained by Zhian Kamvar. The conversion back to Markdown relies on xslt by Jeroen Ooms, a package that can use XSL stylesheets. Code tree manipulation with xmlparsedata Imagine you’re writing a domain-specific language where you let users write something like str_detect(str_to_lower(itemTitle), 'wikidata') that you want to somehow translate to: REGEX(LCASE(?itemTitle),"wikidata") Yes, that’s a real use case, from the glitter package (SPARQL DSL) maintained by Lise Vaudor. The way we translate the code is to transform it to an XML tree via Gábor Csárdi’s xmlparsedata, then we can apply different tweaks based on XPath. parse( text = "str_detect(str_to_lower(itemTitle), 'wikidata')", keep.source = TRUE ) |> xmlparsedata::xml_parse_data(pretty = TRUE) |> xml2::read_xml() |> as.character() |> cat() #> #> #> #> #> str_detect #> #> ( #> #> #> str_to_lower #> #> ( #> #> itemTitle #> #> ) #> #> , #> #> 'wikidata' #> #> ) #> #> To me, having an XML tree at hand makes it easier to think of, and work with, an “abstract syntax tree”. XPath for all the things A tool that I haven’t used, but that sounds intriguing, is rpath by Gabriel Becker, an R package implementing xpath-like functionality for querying R objects. Data documentation with EML No matter what format your data is, you can create its metadata using the EML package maintained by Carl Boettiger that creates XML metadata following the Ecological Metadata Language. Sure, you might prefer using dataspice maintained by Bryce Mecum (and get JSON). When you are creating XML or HTML If the goal of your code or package is to produce XML or HTML, knowing XPath will help you write unit tests (that you might want to complement with snapshot unit tests). Conclusion In this post I’ve explained why I find XPath, XML, HTML so useful. Applications are endless, not limited to the examples from this post: web scraping, HTML tweaks, Markdown manipulation, code tree manipulation…" />

Why I like XPath, XML and HTML

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of my favorite tool is XPath, the query language for exploring XML and HTML trees. In this post, I will highlight a few use cases of this “angle-bracket crunching tool” and hope to convince you that it’s an awesome thing to know about and play with.

Many thanks to Christophe Dervieux for useful feedback on this post! Mille mercis !

Brief intro to XPath in R

Say I have some XML,

my_xml <- xml2::read_xml("<wrapper><thing>blop</thing></wrapper>")

I’m using xml2, by Hadley Wickham, Jim Hester and Jeroen Ooms. This package is recommended over the XML package by e.g. the rOpenSci dev guide.

With XPath I can query the “thing” element:

xml2::xml_find_all(my_xml, ".//thing")
#> {xml_nodeset (1)}
#> [1] <thing>blop</thing>

I can extract its content via xml2::xml_text():

xml2::xml_find_all(my_xml, ".//thing") |>
  xml2::xml_text()
#> [1] "blop"

I could also replace the element.

Now, that was an especially simple XPath query. XPath’s strength is to allow you to really take advantage of the structure of the XML or HTML tree. You can extract nodes based on their attributes, on their parents, on their siblings, etc.

Where to learn XPath? Two good websites to get started are

A primary skill to learn is the name of elements, e.g. nodes, attributes, which will help write type better keywords into search engines when trying to figure out a query. 😉

Note that if you are handling HTML, you might enjoy selectr by Simon Potter that creates XPath filters based on CSS selectors.

Knowing XPath, or even knowing it exists, is really empowering. In the rest of this post, I’ll highlight cases where this is useful.

When life gives you XML or HTML

Web scraping

At the beginning of this blog I liked extracting data from websites. I did that with regular expressions. Now I know better and would wrangle HTML as HTML. Goodbye, stringr::str_detect(), hello, xml2::xml_find_all().

A package that’s especially useful for web scraping is rvest by Hadley Wickham. rvest builds upon selectr, and will write XPath for you.

pkgdown

If you use pkgdown to produce the documentation website of for your package, please know that part of its magic comes from various “HTML tweaks” that are powered by XPath, see for instance “tweak-reference.R”.

When life gives you something else…

You can still make it XML to handle it as such, with XPath!

Markdown manipulation with commonmark, tinkr

The commonmark package transforms Markdown to XML. This can be extremely handy to get data on R Markdown or Markdown files.

Now, say you want to modify the Markdown file as XML then get a Markdown file back. It is also possible, with the tinkr package, started by yours truly, now maintained by Zhian Kamvar. The conversion back to Markdown relies on xslt by Jeroen Ooms, a package that can use XSL stylesheets.

Code tree manipulation with xmlparsedata

Imagine you’re writing a domain-specific language where you let users write something like

str_detect(str_to_lower(itemTitle), 'wikidata')

that you want to somehow translate to:

REGEX(LCASE(?itemTitle),"wikidata")

Yes, that’s a real use case, from the glitter package (SPARQL DSL) maintained by Lise Vaudor.

The way we translate the code is to transform it to an XML tree via Gábor Csárdi’s xmlparsedata, then we can apply different tweaks based on XPath.

parse(
  text = "str_detect(str_to_lower(itemTitle), 'wikidata')",
  keep.source = TRUE
) |> 
  xmlparsedata::xml_parse_data(pretty = TRUE) |> 
  xml2::read_xml() |>
  as.character() |>
  cat()
#> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
#> <exprlist>
#>   <expr line1="1" col1="1" line2="1" col2="47" start="49" end="95">
#>     <expr line1="1" col1="1" line2="1" col2="10" start="49" end="58">
#>       <SYMBOL_FUNCTION_CALL line1="1" col1="1" line2="1" col2="10" start="49" end="58">str_detect</SYMBOL_FUNCTION_CALL>
#>     </expr>
#>     <OP-LEFT-PAREN line1="1" col1="11" line2="1" col2="11" start="59" end="59">(</OP-LEFT-PAREN>
#>     <expr line1="1" col1="12" line2="1" col2="34" start="60" end="82">
#>       <expr line1="1" col1="12" line2="1" col2="23" start="60" end="71">
#>         <SYMBOL_FUNCTION_CALL line1="1" col1="12" line2="1" col2="23" start="60" end="71">str_to_lower</SYMBOL_FUNCTION_CALL>
#>       </expr>
#>       <OP-LEFT-PAREN line1="1" col1="24" line2="1" col2="24" start="72" end="72">(</OP-LEFT-PAREN>
#>       <expr line1="1" col1="25" line2="1" col2="33" start="73" end="81">
#>         <SYMBOL line1="1" col1="25" line2="1" col2="33" start="73" end="81">itemTitle</SYMBOL>
#>       </expr>
#>       <OP-RIGHT-PAREN line1="1" col1="34" line2="1" col2="34" start="82" end="82">)</OP-RIGHT-PAREN>
#>     </expr>
#>     <OP-COMMA line1="1" col1="35" line2="1" col2="35" start="83" end="83">,</OP-COMMA>
#>     <expr line1="1" col1="37" line2="1" col2="46" start="85" end="94">
#>       <STR_CONST line1="1" col1="37" line2="1" col2="46" start="85" end="94">'wikidata'</STR_CONST>
#>     </expr>
#>     <OP-RIGHT-PAREN line1="1" col1="47" line2="1" col2="47" start="95" end="95">)</OP-RIGHT-PAREN>
#>   </expr>
#> </exprlist>

To me, having an XML tree at hand makes it easier to think of, and work with, an “abstract syntax tree”.

XPath for all the things

A tool that I haven’t used, but that sounds intriguing, is rpath by Gabriel Becker, an R package implementing xpath-like functionality for querying R objects.

Data documentation with EML

No matter what format your data is, you can create its metadata using the EML package maintained by Carl Boettiger that creates XML metadata following the Ecological Metadata Language. Sure, you might prefer using dataspice maintained by Bryce Mecum (and get JSON).

When you are creating XML or HTML

If the goal of your code or package is to produce XML or HTML, knowing XPath will help you write unit tests (that you might want to complement with snapshot unit tests).

Conclusion

In this post I’ve explained why I find XPath, XML, HTML so useful. Applications are endless, not limited to the examples from this post: web scraping, HTML tweaks, Markdown manipulation, code tree manipulation…

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)