Better handling of JSON data in R?

March 13, 2014
By

(This article was first published on Quantifying Memory, and kindly contributed to R-bloggers)

What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user's best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson and RJSONIO, the creator of a third JSON package, Jeroen Ooms, urged me to reconsider my package selection process. So without further ado, is jsonlite any better? Does it get rid of the problem of seemingly infinitely nested lists?

As part of exploring digital data collection we used a range of sources that provide JSON data - from Wikipedia page views to social media sharing stats to YouTube Comments and real-time cricket scores. A persistent annoyance for students was navigating the JSON structure, typically translated into R as a list. Here is what my YouTube stats scraper looks like:

getStats <- function(id) {    url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")    raw.data <- readLines(url, warn = "F")    rd <- fromJSON(raw.data)    dop <- as.character(rd$entry$published)    term <- rd$entry$category[[2]]["term"]    label <- rd$entry$category[[2]]["label"]    title <- rd$entry$title    author <- rd$entry$author[[1]]$name duration <- rd$entry$media$group$media$content[[1]]["duration"]    favs <- rd$entry$yt$statistics["favoriteCount"] views <- rd$entry$yt$statistics["viewCount"]    dislikes <- rd$entry$yt$rating["numDislikes"] likes <- rd$entry$yt$rating["numLikes"]    return(list(id, dop, term, label, title, author, duration, favs, views,        dislikes, likes))}(getStats("Ya2elsR5s5s"))

[[1]]

[1] "Ya2elsR5s5s"

[[2]]

[1] "2013-12-17T19:01:44.000Z"

etc.

Now, this is all fine, except that, upon closer inspection, the scraper function burrows into lists to extract the correct field. We use special ticks to accommodate names with dollar-signs in them, to name but one challenge.

Is this any easier using jsonlite?

require(jsonlite)id = "Ya2elsR5s5s"url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")raw.data <- readLines(url, warn = "F")rd <- fromJSON(raw.data)term <- rd$entry$category$term[2]label <- rd$entry$category$label[2]title <- rd$entry$titleauthor <- rd$entry$author[1]duration <- rd$entry$media$group$media$content$duration[1]

is this any better? I’m not convinced there's much in it: because of the JSON structure used by the YouTube API, jsonlite can only coerce a few elements into data.frames, and these are still buried deep in the list structure. The object 'rd' contains a mix of named entities and data.frames, and in this case we have to do similar excavation to get at interesting data.

What about social stats, e.g. facebook shares?

Here is my approach from the web scraping tutorials: first we construct the HTTP request, then we read the response using rjson

fqlQuery = "select share_count,like_count,comment_count from link_stat where url=\""url = "http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"queryUrl = paste0("http://graph.facebook.com/fql?q=", fqlQuery, url, "\"")  #ignoring the callback partlookUp <- URLencode(queryUrl)  #What do you think this does?lookUp
## [1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"
rd <- readLines(lookUp, warn = "F")require(rjson)dat <- fromJSON(rd)dat
## $data##$data[[1]]## $data[[1]]$share_count## [1] 388## ## $data[[1]]$like_count## [1] 430## ## $data[[1]]$comment_count## [1] 231
dat$data[[1]]["share_count"] ##$share_count## [1] 388

How does jsonlite compare?

require(jsonlite)dat <- fromJSON(rd)dat
## $data## share_count like_count comment_count## 1 388 430 231 dat$data\$share_count
## [1] 388

Is that better? Yes, I think jsonlite in this case offers a significant improvement.

What about writing to JSON?

Not long ago I did a bit of work involving exporting data from R for use in d3 visualisations. This data had to be in a nested JSON format, which I approximated through a (to me) rather complex process using split and lapply. Can jsonlite simplify this at all?

Possibly. Though my gut reaction is that creating nested data.frames is not much simpler than manually creating creating nested lists. I repeatedly used the split function to chop up the data into a nested structure. Once this was done, however, toJSON wrote very nice output:

"9" : {"33" : {  "74" : [    {      "label" : "V155",      "labs" : "Bird Flu and Epidemics"    },    {      "label" : "V415",      "labs" : "Fowl and Meat Industry"    }  ],  "75" : [    {      "label" : "V166",      "labs" : "Academics"    },    {      "label" : "V379",      "labs" : "Places Of Study and Investigation"    }  ],  "76" : [    {      "label" : "V169",      "labs" : "Space Exploration"    },    {      "label" : "V261",      "labs" : "Cosmonauts"    }  ]}

My verdict: jsonlite makes saving a data.frame in JSON very easy indeed, and the fact we can turn a data.frame seamlessly into a 'flat' JSON file is excellent. In many real-world situations the reason for using JSON in the first place (rather than say csv) is that a columns/row structure is either inefficient or plain inappropriate. jsonlite is a welcome addition, though transporting data between R and javascript and applications is not seamless just yet. The bottom-line: great for simple cases; tricky structures remain tricky.

Seriously: does anyone know how to automatically created nested data frames or lists?

To leave a comment for the author, please follow the link and comment on his blog: Quantifying Memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.