the initial release of the 2008 bsapufs was accompanied by some major fanfare in the world of health policy, a big win for government transparency. unfortunately, the final files that cleared the confidentiality hurdles are heavily de-identified and obfuscated. prime examples:
- none of the files can be linked to any other file. not across years, not across expenditure categories
- costs are rounded to the nearest fifth or tenth dollar at lower values, nearest thousandth at higher values
- ages are categorized into five year bands
cms released free public data sets that could only be analyzed with a software package costing thousands of dollars. so even though the actual data sets were free, researchers still needed deep pockets to buy sas. meanwhile, the unsquelched and therefore superior data sets are also available for many thousands of dollars. researchers with funding would (reasonably) just buy the better data. researchers without any financial resources - the target audience of free, public data - were left out in the cold. no wonder these bsapufs haven't been used much.
that ends now. using r, monetdb, and the personal computer you already own (mine cost $700 in 2009), researchers can, for the first time, seriously analyze these medicare public use files without spending another dime. woah. plus hey guess what all you researcher fat-cats with your federal grant streams and your proprietary software licenses: r + monetdb runs one heckuva lot faster than sas. woah^2. dump your sas license water wings and learn how to swim. the scripts below require monetdb. click here for step-by-step instructions of how to install it on windows and click here for speed tests. vroom.
since the bsapufs comprise 5% of the medicare population, ya generally need to multiply any counts or sums by twenty. although the individuals represented in these claims are randomly sampled, this data should not be treated like a complex survey sample, meaning that the creation of a survey object is unnecessary. most bsapufs generalize to either the total or fee-for-service medicare population, but each file is different so give the documentation a hard stare before that eureka moment. this new github repository contains three scripts:
2008 - download all csv files.R
- loop through and download every zip file hosted by cms
- unzip the contents of each zipped file to the working directory
2008 - import all csv files into monetdb.R
- create the batch (.bat) file needed to initiate the monet database in the future
- loop through each csv file in the current working directory and import them into the monet database
- create a well-documented block of code to re-initiate the monetdb server in the future
2008 - replicate cms publications.R
- initiate the same monetdb server instance, unsing the same well-documented block of code as above
- replicate nine sets of statistics found in data tables provided by cms
click here to view these three scripts
for more detail about the basic stand alone medicare claims public use files (bsapufs), visit:
- the centers for medicare and medicaid's bsapuf homepage
- a joint academyhealth webinar given by the organizations that partnered to create these files - cms, impaq, norc
the replication script has oodles of easily-modified syntax and should be viewed for analysis examples. if you know the name of the data table you want to examine, you can quickly modify these general monetdb analysis examples too. just run sql queries - sas users, that's "proc sql;" for you. never used sql? start fresh with this tutorial. once you know the sql command you want to run on the data, you're almost done. for operations that make changes to the data tables, use dbSendUpdate(). for operations that only read the data tables, use dbGetQuery().
don't ever use dbReadTable() on the outpatient, carrier, dme, or prescription drug event tables - they'll likely cause r to crash.
if you need the more advanced statistical functions described on the sqlsurvey homepage but not available in monetdb's flavor of sql, you could potentially create a taylor-series sqlsurvey() object with a weight column full of twenties and a strata+psu column with all ones. the statistics should be correct, but if the columns in your analysis include any missing data, the variances might be wider (so more conservative) than those computed with monetdb's stddev() function.
confidential to sas, spss, stata, and sudaan users: why are you using software that's twenty years shy of medicare eligibility itself? time to transition to r. :D