analyze the basic stand alone medicare claims public use files (bsapufs) with r and monetdb

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

the centers for medicare and medicaid services (cms) took the plunge.  the famous medicare 5% sample has been released to the public, free of charge.  jfyi – medicare is the u.s. government program that provides health insurance to 50 million elderly and disabled americans.  the basic stand alone medicare claims public use files (bsapufs) contain either person- or event-level data on inpatient stays, durable medical equipment purchases, prescription drug fills, hospice users, doctor visits, home health provision, outpatient hospital procedures, skilled nursing facility short-term residents, as well as aggregated statistics for medicare beneficiaries with chronic conditions and medicare beneficiaries living in nursing homes.  oh sorry, there’s one catch: they only provide sas scripts to analyze everything.  cue the villian music.  that bored old game of monopoly ends today.

the initial release of the 2008 bsapufs was accompanied by some major fanfare in the world of health policy, a big win for government transparency.  unfortunately, the final files that cleared the confidentiality hurdles are heavily de-identified and obfuscated.  prime examples:
  • none of the files can be linked to any other file.  not across years, not across expenditure categories
  • costs are rounded to the nearest fifth or tenth dollar at lower values, nearest thousandth at higher values
  • ages are categorized into five year bands
so these files are baldly inferior to the unsquelched, linkable data only available through an expensive formal application process.  any researcher with a budget flush enough to afford a sas license (the only statistical software mentioned in the cms official documentation) can probably also cough up the money to buy the identifiable data through resdac (resdac, btw, rocks).


cms released free public data sets that could only be analyzed with a software package costing thousands of dollars.  so even though the actual data sets were free, researchers still needed deep pockets to buy sas.  meanwhile, the unsquelched and therefore superior data sets are also available for many thousands of dollars.  researchers with funding would (reasonably) just buy the better data.  researchers without any financial resources – the target audience of free, public data – were left out in the cold.  no wonder these bsapufs haven’t been used much.

that ends now.  using r, monetdb, and the personal computer you already own (mine cost $700 in 2009), researchers can, for the first time, seriously analyze these medicare public use files without spending another dime.  woah.  plus hey guess what all you researcher fat-cats with your federal grant streams and your proprietary software licenses: r + monetdb runs one heckuva lot faster than sas.  woah^2.  dump your sas license water wings and learn how to swim.  the scripts below require monetdb.  click here for step-by-step instructions of how to install it on windows and click here for speed tests.  vroom.

since the bsapufs comprise 5% of the medicare population, ya generally need to multiply any counts or sums by twenty.  although the individuals represented in these claims are randomly sampled, this data should not be treated like a complex survey sample, meaning that the creation of a survey object is unnecessary.  most bsapufs generalize to either the total or fee-for-service medicare population, but each file is different so give the documentation a hard stare before that eureka moment.  this new github repository contains three scripts:

2008 – download all csv files.R
  • loop through and download every zip file hosted by cms
  • unzip the contents of each zipped file to the working directory

2008 – import all csv files into monetdb.R
  • create the batch (.bat) file needed to initiate the monet database in the future
  • loop through each csv file in the current working directory and import them into the monet database
  • create a well-documented block of code to re-initiate the monetdb server in the future

2008 – replicate cms publications.R
  • initiate the same monetdb server instance, unsing the same well-documented block of code as above
  • replicate nine sets of statistics found in data tables provided by cms

click here to view these three scripts

for more detail about the basic stand alone medicare claims public use files (bsapufs), visit:


the replication script has oodles of easily-modified syntax and should be viewed for analysis examples.  if you know the name of the data table you want to examine, you can quickly modify these general monetdb analysis examples too.  just run sql queries – sas users, that’s “proc sql;” for you.  never used sql? start fresh with this tutorial.  once you know the sql command you want to run on the data, you’re almost done.  for operations that make changes to the data tables, use dbSendUpdate().  for operations that only read the data tables, use dbGetQuery().

don’t ever use dbReadTable() on the outpatient, carrier, dme, or prescription drug event tables – they’ll likely cause r to crash.

if you need the more advanced statistical functions described on the sqlsurvey homepage but not available in monetdb’s flavor of sql, you could potentially create a taylor-series sqlsurvey() object with a weight column full of twenties and a strata+psu column with all ones.  the statistics should be correct, but if the columns in your analysis include any missing data, the variances might be wider (so more conservative) than those computed with monetdb’s stddev() function.

confidential to sas, spss, stata, and sudaan users: why are you using software that’s twenty years shy of medicare eligibility itself?  time to transition to r.  😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)