Site icon R-bloggers

Unravelling an Enormous JSON

[This article was first published on R - datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have a challenge: extracting data from an enormous JSON file. The structure of the file is very challenging: it’s a mapping at the top level, which means that for most standard approaches the entire document needs to be loaded before it can be processed. It would have been so much easier if the top level structure was an array. But, alas. It’s almost as if the purveyors of the data have made it intentionally inaccessible.

Getting the Data

The data come from Transparency in Coverage and are in the form of a compressed JSON file.

wget -O united-healthcare-rates.json.gz https://tinyurl.com/united-healthcare-rates

Let’s see how big the file is.

ls -lh united-healthcare-rates.json.gz

-rw-rw-r-- 1 ubuntu ubuntu 4.0G Jul 24 20:27 united-healthcare-rates.json.gz

The compressed JSON is 4 GiB. But that’s just the beginning. Now we need to decompress.

gunzip united-healthcare-rates.json.gz

How big is it now?

ls -lh united-healthcare-rates.json

-rw-rw-r-- 1 ubuntu ubuntu 122G Aug  3 05:36 united-healthcare-rates.json

Whoah! 122 GiB. That’s a serious chunk of data.

The data schema is described here. However, that schema is of little use if we cannot load and parse the data.

Initial Attempts

My immediate reaction was: okay, I’ll process this in Python. Wrong! The file was far too big to read in one chunk (obviously!). What about using an iterative or streaming approach with something like the ijson library? Seems like a good idea and would have worked if the JSON file was an array, but since it’s a mappin streaming is not a viable option.

What about loading the data into a document database like MongoDB and then using NoSQL to investigate the contents? Again, this seems like a reasonable idea. However, there are constraints on the size of documents which you can load into a document database. Another dead end.

Simple Shell Solution

There’s a handy shell command for working with JSON data: jq. Install it if you don’t already have it.

< !-- Useful cheatsheet: https://gist.github.com/olih/f7437fb6962fb3ee9fe95bda8d2c8fa4 -->
sudo apt install jq

Document Keys

Let’s take a look at the mapping keys.

jq 'keys' united-healthcare-rates.json

[
  "in_network",
  "last_updated_on",
  "provider_references",
  "reporting_entity_name",
  "reporting_entity_type",
  "version"
]

That takes a while to run (because there’s a lot of data to parse), but we can immediately see the number and names of the key/value pairs.

Scalar Components

A few of the top level components of the document are simple scalars ("version", "last_updated_on", "reporting_entity_name" and "reporting_entity_type"). I can access these fairly easily using head and grep.

Array Components

The interesting data is in the "in_network" and "provider_references" components. I decided to unpack those into separate files so that I could deal with each individually.

jq ".in_network" united-healthcare-rates.json >in-network.json
jq ".provider_references" united-healthcare-rates.json >provider-references.json

I now have two JSON files to work with. I extracted the first few records from one of them to show the next steps in the analysis.

For the purpose of this post I’m going to use R. You could equally use Python (or one of any other languages able to work with JSON).

library(tidyverse)
library(jsonlite)

PROVIDER_REFERENCES <- "provider-references-sample.json"
IN_NETWORK <- "in-network-sample.json.bz2"

Let’s take a look at the data.

[
  {
    "provider_groups": [
      {
        "npi": [
          1720734973
        ],
        "tin": {
          "type": "ein",
          "value": "870324719"
        }
      },
      {
        "npi": [
          1770627234
        ],
        "tin": {
          "type": "ein",
          "value": "870569774"
        }
      }
    ],
    "provider_group_id": 0
  },
  {
    "provider_groups": [
      {
        "npi": [
          1386945947,
          1588908388
        ],
        "tin": {
          "type": "ein",
          "value": "371705906"
        }
      }
    ],
    "provider_group_id": 1
  }
]

Now load a sample of the provider references data and do some rectangling to convert it into a tidy data frame.

unpack_group <- function(group) {
  with(
        group,
        tibble(
          provider_group_id,
          provider_groups
        )
      ) %>%
  unnest_wider(provider_groups) %>%
  unnest_wider(tin, names_sep = "_") %>%
  unnest_longer(npi) %>%
  select(provider_group_id, tin_type, tin_value, npi)
}

read_json(PROVIDER_REFERENCES) %>%
  map_dfr(unpack_group)

# A tibble: 4 × 4
  provider_group_id tin_type tin_value        npi
              <int> <chr>    <chr>          <int>
1                 0 ein      870324719 1720734973
2                 0 ein      870569774 1770627234
3                 1 ein      371705906 1386945947
4                 1 ein      371705906 1588908388

The structure of the in network data is a little more complicated but can be attacked using a similar approach. Here’s a sample of a subset of the fields.

# A tibble: 3 × 5
  arrangement code  code_type code_type_version name                            
  <chr>       <chr> <chr>     <chr>             <chr>                           
1 ffs         0001A CPT       2022              IMM ADMN SARSCOV2 30MCG/0.3ML D…
2 ffs         0001U CPT       2022              RBC DNA HEA 35 AG 11 BLD GRP WH…
3 ffs         0001  MS-DRG    2022              Heart Transplant or Implant of …

Streaming

Using the read_json() function to load the JSON data will not be practical with larger files. However, since we now have JSON arrays to deal with we can use a streaming approach.

First convert the data to NDJSON format.

jq -c '.[]' provider-references-sample.json >provider-references-sample-streaming.json

Now handle the document one line at a time using stream_in(). This completely avoids having to load the entire document. The code below could be a lot more efficient, but it’s just to illustrate the approach.

Take a look at the resulting CSV file.

cat "provider-references-sample.csv"

0,ein,870324719,1720734973
0,ein,870569774,1770627234
1,ein,371705906,1386945947
1,ein,371705906,1588908388

Conclusion

The key to cracking this problem was using the jq tool to break down the enormous JSON document into smaller parts. The resulting parts were still big, but had a different structure, which meant that I could use streaming techniques to process them.

To leave a comment for the author, please follow the link and comment on their blog: R - datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.