Better handling of JSON data in R?

Posted on March 13, 2014 by Rolf Fredheim in R bloggers | 0 Comments

[This article was first published on Quantifying Memory, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user’s best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson and RJSONIO, the creator of a third JSON package, Jeroen Ooms, urged me to reconsider my package selection process. So without further ado, is jsonlite any better? Does it get rid of the problem of seemingly infinitely nested lists?

As part of exploring digital data collection we used a range of sources that provide JSON data – from Wikipedia page views to social media sharing stats to YouTube Comments and real-time cricket scores. A persistent annoyance for students was navigating the JSON structure, typically translated into R as a list. Here is what my YouTube stats scraper looks like:

getStats <- function(id) {
    url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")
    raw.data <- readLines(url, warn = "F")
    rd <- fromJSON(raw.data)
    dop <- as.character(rd$entry$published)
    term <- rd$entry$category[[2]]["term"]
    label <- rd$entry$category[[2]]["label"]
    title <- rd$entry$title
    author <- rd$entry$author[[1]]$name
    duration <- rd$entry$`media$group`$`media$content`[[1]]["duration"]
    favs <- rd$entry$`yt$statistics`["favoriteCount"]
    views <- rd$entry$`yt$statistics`["viewCount"]
    dislikes <- rd$entry$`yt$rating`["numDislikes"]
    likes <- rd$entry$`yt$rating`["numLikes"]
    return(list(id, dop, term, label, title, author, duration, favs, views,
        dislikes, likes))
}

(getStats("Ya2elsR5s5s"))

[[1]]

[1] "Ya2elsR5s5s"

[[2]]

[1] "2013-12-17T19:01:44.000Z"

etc.

Now, this is all fine, except that, upon closer inspection, the scraper function burrows into lists to extract the correct field. We use special ticks to accommodate names with dollar-signs in them, to name but one challenge.

Is this any easier using jsonlite?

require(jsonlite)
id = "Ya2elsR5s5s"
url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")
raw.data <- readLines(url, warn = "F")
rd <- fromJSON(raw.data)
term <- rd$entry$category$term[2]
label <- rd$entry$category$label[2]
title <- rd$entry$title
author <- rd$entry$author[1]
duration <- rd$entry$`media$group`$`media$content`$duration[1]

is this any better? I’m not convinced there's much in it: because of the JSON structure used by the YouTube API, jsonlite can only coerce a few elements into data.frames, and these are still buried deep in the list structure. The object 'rd' contains a mix of named entities and data.frames, and in this case we have to do similar excavation to get at interesting data.

What about social stats, e.g. facebook shares?

Here is my approach from the web scraping tutorials: first we construct the HTTP request, then we read the response using rjson

fqlQuery = "select share_count,like_count,comment_count from link_stat where url=\""
url = "http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"
queryUrl = paste0("http://graph.facebook.com/fql?q=", fqlQuery, url, "\"")  #ignoring the callback part
lookUp <- URLencode(queryUrl)  #What do you think this does?
lookUp

## [1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"

rd <- readLines(lookUp, warn = "F")

require(rjson)
dat <- fromJSON(rd)
dat

## $data
## $data[[1]]
## $data[[1]]$share_count
## [1] 388
## 
## $data[[1]]$like_count
## [1] 430
## 
## $data[[1]]$comment_count
## [1] 231

dat$data[[1]]["share_count"]

## $share_count
## [1] 388

How does jsonlite compare?

require(jsonlite)
dat <- fromJSON(rd)
dat

## $data
##   share_count like_count comment_count
## 1         388        430           231

dat$data$share_count

## [1] 388

Is that better? Yes, I think jsonlite in this case offers a significant improvement.

What about writing to JSON?

Not long ago I did a bit of work involving exporting data from R for use in d3 visualisations. This data had to be in a nested JSON format, which I approximated through a (to me) rather complex process using split and lapply. Can jsonlite simplify this at all?

Possibly. Though my gut reaction is that creating nested data.frames is not much simpler than manually creating creating nested lists. I repeatedly used the split function to chop up the data into a nested structure. Once this was done, however, toJSON wrote very nice output:

"9" : {
"33" : {
  "74" : [
    {
      "label" : "V155",
      "labs" : "Bird Flu and Epidemics"
    },
    {
      "label" : "V415",
      "labs" : "Fowl and Meat Industry"
    }
  ],
  "75" : [
    {
      "label" : "V166",
      "labs" : "Academics"
    },
    {
      "label" : "V379",
      "labs" : "Places Of Study and Investigation"
    }
  ],
  "76" : [
    {
      "label" : "V169",
      "labs" : "Space Exploration"
    },
    {
      "label" : "V261",
      "labs" : "Cosmonauts"
    }
  ]
}

My verdict: jsonlite makes saving a data.frame in JSON very easy indeed, and the fact we can turn a data.frame seamlessly into a 'flat' JSON file is excellent. In many real-world situations the reason for using JSON in the first place (rather than say csv) is that a columns/row structure is either inefficient or plain inappropriate. jsonlite is a welcome addition, though transporting data between R and javascript and applications is not seamless just yet. The bottom-line: great for simple cases; tricky structures remain tricky.

Seriously: does anyone know how to automatically created nested data frames or lists?

To leave a comment for the author, please follow the link and comment on their blog: Quantifying Memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Better handling of JSON data in R?

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)