Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My PhD is in climate science, and climate data is usually rectangular—or some higher-dimensional analogue of rectangular, anyway. (Blocky?) And since rectangular data is R’s bread and butter, I’ve had a pretty good time of things up until now.

But for the last few months I’ve been working on Is It Hot Right Now with my colleagues, Mat and Stefan, and that’s forced me to get to know data formats that are a bit less familiar to R but ubiquitous on the web: namely, XML and JSON.

Luckily, the xml2 and RJSONIO packages make accessing XML and JSON data really easy, and with purrrr (part of the tidyverse, we can reduce them down to something useful really quickly. Like a yummy, rich… bolognese of data, I guess?

Let’s look at some basic examples and then ramp it up.

JSON: it’s lists all the way down

JSON is a data format that’s designed to mimic the way JavaScript stores objects. Here’s an example we’re using on Is It Hot Right Now:

JSON can store data in Arrays (denoted by [square brackets]) or Objects (denoted by {curly brackets}). This example doesn’t illustrate it, but you can also nest Arrays and Objects as deeply as you want.

Arrays and Objects both map naturally to R’s lists, since you can access list elements by a numeric position or a name. So when you load JSON data in R using RJSONIO, you just get a hierarchy of lists:

UPDATE: Maëlle Salmon pointed me to the excellent jsonlite package, which works the same way as RJSONIO but can also process JSON arrays and data frame-like structures into native R vectors and data frames. It could save you a lot of time for simple files!

So this is a list of lists. We can look at one of the outer list elements (a single station) and see that it’s a list with a bunch of named elements inside. We can get an element out:

We can also use purrr’s amazing pluck function to dig as far as we need into lists of lists in a more readable way. For example, to get the name from the third station in the list:

(If you haven’t used the pipe operator %>% from magrittr before, it takes the thing on the left and squeezes it into the function on the right before that function’s other arguments. If you want the thing on the left somewhere else, you can use . to put it wherever you’d like.)

But pulling things out one at a time gets old real quick; we want to automate this. In cases where we trust the input data to be structured in a predictable way (like this example, where we expect that each station in the Array will have the same data in it), we can use map in combination with pluck to get every station’s name:

The map_* functions take lists (or vectors) in, run the supplied function on each element of the input, then return the results of the function in a data structure of your choice. Here, pluck gives us a character vector of length 1 (because, on its own, it deals with one thing at a time). Vanilla map plucks from each object in turn and gives us a list with each character vector back. Since the list we get back just as character vectors in it, we could reduce it all the way down to one vector with unlist():

Usually we pluck something like("this"), but map can pass arguments along to the function you want to run. So in this case, map passes name onto pluck.

A different way to do this is using what’s called an anonymous function: instead of a calling a here’s-one-I-prepared-earlier function, we define one on the spot. map gives a shortcut to do this using the tilde (~):

That looks a bit more complicated. In the first version, map passed each element of the list onto pluck automatically (just like the pipe operator, %>%, does); in the second, we have to do it ourselves using the . pronoun.

But the second version is also more powerful—nor just because we can use functions that don’t expect the data from map to go first, but because now we can pipe functions together inside map. For example:

Here, one pipe is stations %>% map() %>% unlist() %>% head(); the other is pluck() %>% paste(). The data pronoun . is passed into map from the outer pipe.

XML unplucked

To show you some of the more complex things we can do with map and nested pipes, let’s look at a more interesting dataset:

Like JSON, XML can store hierarchies of objects. However, in XML, the objects (or nodes, as XML calls them) are stored like <my_thing some_attribute="some_value">Some contents</my_thing>. So nodes have a name (my_thing), some optional attributes with values, and they have contents (which, like JSON, can be straight-up data or more objects).

Unlike JSON, XML doesn’t map neatly to R’s data structures. So we can’t pluck into XML files, or even use R’s builtin[["accessor"]][["syntax"]]. Instead, we use accessor functions that help us isolate parts of the file:

And then we use functions like xml_attr or xml_text to get at the good bits as pluck did. But, unlike, pluck, these XML functions can be given a group of nodes from xml_find_all, and they’ll return the results from each element in the group. No map needed!

(xml2 uses a syntax called XPath to make all sorts of granular selections of nodes. I’m not going to delve into it too deeply, but if you’re interested, MSDN has a good primer on it.)

I guess we don’t need map after all! Well, not quite…

Nesting pipes

I mentioned before that using map with the tilde syntax allows us to chain functions together and repeat the result across a list of things. But nested pipes can get complicated real fast, as the next example will illustrate.

Let’s say I don’t just want a bunch of station names from my XML file—I want a bunch of useful information, like its latitude and longitude, its timezone and the air temperature.

I’ll probably want to put that into a data frame, and I can! But I only want the stations with codes matching my earlier list:

Now we’re cookin’ with gas! (These are current observations, so your numbers might look a little different.)

Let’s leave the rather convoluted XPath filter aside and focus on the pipes. This code is liberally sprinkled with %>% and ., but there are actually two different pipes going on. (This is a really good argument for consistent indentation style: it helps keep nested pipes straight!)

In the first two lines (18 and 19), we have the sort of pipe we would usually expect: it passes the data frame obs onto the XML selector, xml_find_all, and that passes the selection of nodes onto data_frame. Once we’re inside data_frame, we use the data pronoun . to refer back to the selection from xml_find_all. So the pipe %>% stays outside data_frame, following the first level of indentation, and . stays inside.

But there’s also a pipe operator %>% inside data_frame, on line 26. That’s a new pipe. It’s passing a sub-selection on to xml_text. If we continue piping on outside data_frame, we’re back to the outer pipe, passing the whole data frame on.

There’s just one problem with all of this: my stations are spread across a bunch of different states, and this XML file is for one state, New South Wales. I want a data frame for all of them!

Something-something Inception

Luckily, the Bureau of Meteorology keeps the other states’ observations in the same place, varying only the third letter of the name: IDN60920.xml for New South Wales, IDV60920.xml for Victoria, and so on.

Now, what can we use to repeat a task across a list?

Yep, it’s map! But this time, we’re returning data frames. I mentioned that there are map_* functions for combining the results of our mapping in different ways, and map_dfr can bind data frames you return by row. So we’re going to take our entire last example, and we’re going to jam it into a map_dfr() call using that magical tilde ~:

Okay, that got a little wild. By my count, we have three pipes going here—and two of them have . operators referring back:

1. First, we pipe that vector of letter codes into map_dfr (on line 2). Once we’re inside map_dfr, we use . on line 4 to refer to the current letter (because the functions we give to map_* deal with one list element at a time).
2. But we also start a new pipe on lines 4 and 5, inside map_dfr, carrying a node selection into data_frame as we did before. And once we’re inside data_frame, we use . to refer back to the second pipe’s data (lines 8–11, 13).
3. And, on line 13, we start a third pipe (as we did before) to get the text from each air_temperature element.

Each of those data frames we made before gets row bound (glued from top-to-bottom) by map_dfr. Ta-da!

The important thing to keep track of all these pipes is that the pipe operator %>% and the data pronoun . appear on the outside and inside of a piped function respectively. In a couple of places the pronoun from one pipe and the operator from another appear on the same line. But if we’re consistent about our indentation, we can always see which pipes they belong to, remembering that the pronoun . appears one level further in (because it’s inside the piped function).

So now we have a totally automated way to bring together the interesting observations from all stations of interest across a number of files. In fact, we did automate it: for Is It Hot Right Now, we schedule R to run a script with this pipe in it every half hour.

Next stop

One of the best things about map is its flexibility: you can use this approach to deal with just about any structured data in R, whether it’s complex objects like regression models, data structures you’ve built yourself or files brought in using other packages.

If you’re looking for more detail, I—like many others—recommend Jenny Bryan’s incomparable purrr tutorial. It covers a lot of the other sophisticated ways you can use purrr. One particular use case from her tutorial that I didn’t cover is roundtripping a data frame list column with map inside a mutate verb, the way you would other a regular data frame column verbs. That’s mostly because I’ve only done it once and I still only 80% understand it ????

Think I could’ve done this better? Got a question? Let me know!

Cover image: Jackhammer by Martinus Scriblerus. Licensed under CC BY-NC 2.0