Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fathom Data has been doing a lot of work with the HCRIS (Healthcare Cost Report Information System) data. The underlying reports are submitted as a spreadsheet with multiple sheets. The data are then extracted and recorded in a simple tabular format, with each field linked to a worksheet code (wksht_cd), column number (clmn_num) and line number (clmn_num). These three keys are then mapped to a single compound key. The resulting data look something like this:

       key      value
---------------------
a_c1_30    8960498
a_c2_30    2559809
a_c6_30   -1581948
a_c1_200   45448537
a_c2_200  100473288
a_c6_200  -23947062
a7_1_c1_1    2696198
a7_1_c1_6   87655935
a7_1_c2_6    4190056
a7_1_c1_8  168428817
a7_1_c2_8    4391574
a7_1_c1_10  168428817
a7_1_c2_10    4391574

Of course, for these data to make any sense at all it’s necessary to map backwards from the compound key to the location within the original spreadsheet. And once you know where the data belong, you’ll also want to know what the data mean.

Fortunately this relationship is documented and available as a PDF file.

This is useful, but for analytical purposes a PDF is not ideal. We used the {pdftools} package to scrape the text from the PDF and, after a significant bit of wrangling ended up with a nice clean CSV file.

head(hcris_fields)
# A tibble: 6 × 6
key        type    wksht_cd clmn_num line_num label
<chr>      <chr>   <chr>    <chr>    <chr>    <chr>
1 a7_1_c1_1  NUMERIC A700001  00100    00100    Land
2 a7_1_c1_10 NUMERIC A700001  00100    01000    Total
3 a7_1_c1_2  NUMERIC A700001  00100    00200    Land Improvements
4 a7_1_c1_3  NUMERIC A700001  00100    00300    Buildings and Fixtures
5 a7_1_c1_4  NUMERIC A700001  00100    00400    Building Improvements
6 a7_1_c1_5  NUMERIC A700001  00100    00500    Fixed Equipment       

## Resources

Here are the artefacts of this analysis: