The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.
If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my
ndjson package (there are other JSON “flattener” packages as well, and a new one —
corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.
The main reasons for replicating their Yelp example was to both have a more robust test suite for
sergeant (it’s hitting CRAN soon now that
dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a
tbl that you can then use
dplyr verbs to manipulate.
The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.