analyze the medicare current beneficiary survey (mcbs) with r

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

for over two decades now, researchers at cms have produced the definitive complex sample survey dataset of americans covered by medicare: the medicare current beneficiary survey (mcbs).  i bristle with righteous indignation when healthcare researchers tell me that medicare is boring because it’s pushing fifty.  yeah listen close – in any nation, who gets sick the most?  older people and disabled people.  oh and who does medicare cover?  older people and disabled people.  your uncle leo might be a snore, the dynamics of his government-provided health insurance are patently not.  in the world of american healthcare research, medicare is where the action is and mcbs is the richest tool for understanding that program.  here’s what it’s made of.

so why would this survey data with its measly fifteen thousand respondents be superior to the two-million-record chronic condition warehouse (ccw) or even the medicare public use files?  because those behemoths are just administrative claims, not substantive interviews with legit questionnaires.  and why does that matter?  well, as long as both are nationally-representative, i’d rather have a data set with ten thousand observations and one thousand variables (mcbs) than a data set with ten million observations and one hundred variables (ccw).  if the columns in your data are principally medical claims with a few basic identifiers, you’ll be stuck with cool-sounding but goofy-looking variables like ‘what is your race?’ as deduced by algorithm from the person’s last name.  in mcbs, they just ask everyone the actual question.  huzzah.

before you start licking your chops: these data are not (yet) publicly-available nor free, so you’ll need to submit a research application stating why you want the data and what you plan to use it for, sign some documents stating you’ll comply with privacy laws and then cough up about six hundred dollars to receive an encrypted cd via fedex.  now i just gotta say: i learned everything i know about the medicare current beneficiary survey from my prolific co-worker juliette cubanski.  creating a consolidated file was plainly her invention, and though i steered this ship away from the sas iceberg and into the tropical port that is the r language, she did most of the heavy thinking.  the syntax to create an easy peasy annual dataset – with all record identification code (ric) files bound together – would not exist had i not been able to draw on her expert data stewardship.  this new github repository contains four scripts:

  • scan through each of the mcbs cost and use files that you own, assuming you own some.
  • load each ric file directly into memory using our very own sascii package
  • consolidate everything into a one-record-per-person flat file, save each year as an r data file (.rda)

analysis examples.R


multiyear variable crosswalk.R
  • cycle through all of the readme files of the mcbs cost and use files that you already possess
  • determine which variable names are available which years
  • aggregate all of this information into one delightful table that can be easily filtered, so you can quickly see which mcbs columns are trendable – and for how long.

click here to view these four scripts

for more detail about the medicare current beneficiary survey (mcbs), visit:


although mcbs comes in two flavors – `access to care` and `cost and use` – these scripts only touch the latter.  the access to care data should be thought of as an early version of the cost and use files, so it’s incomplete in some important ways: it does not contain medical utilization or spending, and it excludes anyone without 365 days of coverage (anyone who either gained eligibility or died mid-year).  for any given year, the access to care component gets released about eighteen months earlier than the final cost and use version.  if you’re coveting slightly more recent data, you might find some utility in these files – just be forewarned that any population with a high death rate (like nursing home residents) will look a lot healthier in the `access to care` than they do in the final module.  if you’re tight on cash, buy the cost and use.  but don’t take my word for it, take resdac’s.

while not well-publicized, this survey does track medicare beneficiaries over three full calendar years, allowing you to construct a neat little panel.  assuming you’re the proud owner of two or three consecutive single-year modules already, send ’em an e-mail requesting the longitudinal weights.  then, instead of using `ricx` (as seen in my scripts), use `ricx3` or `ricx4` – and merge all other year-specific ric files on `baseid`.  since it’s a rolling panel, longitudinal analyses necessitate a sample size hit of about one- or two-thirds for the two- and three-year panels, respectively.  oh.  and be sure to review this methodology document before you attempt anything with the multi-year weights.

confidential to sas, spss, stata, sudaan users: minimize your netscape navigator and put down your crystal pepsi for a second because i have big news for you:  time to transition to r.  😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)