analyze the national plan and provider enumeration system (nppes) with r and monetdb

August 12, 2013

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

the national plan and provider enumeration system (nppes) contains information about every provider, insurance plan, and clearinghouse actively operating in the united states healthcare industry.  did i just see the ears of all the health workforce researchers in the room perk up?  it’s freely downloadable, courtesy of the department of health and human services’ implementation of the health insurance portability and accountability act of 1996 (hipaa).  in short: the law requires most everyone who ever sends a medical bill to apply for and acquire a unique national provider identifier.  hipaa, hipaa, hooray!

the main dissemination file is a monster – currently weighing in at a beefy four hundred megs.  the structure is one record per covered entity, so don’t mindlessly run these scripts and think you’ve got one record per doctor.  even after you’ve eliminated all of the organizations like hospitals and community health centers, you’ll still be left with nurse practitioners, respiratory therapists, and other non-physician medical providers.  if you prefer to learn about the contents of your data sets amidst a jungle of stock photos, read the cms npi handbook.  this microdata seems most useful to tally or geocode medical specialists by geographic area, but perhaps you can conjure up cool new uses.  so long as you’ve followed my monetdb setup instructions, the code below will grab the latest file and import everything seamlessly.  after that, you can follow my examples and pull specific columns directly into active memory – as always, doing so will not overload a computer with at least four gigabytes of ram.  from there, maybe follow my merge example or construct your own with some missouri census data center (mcdc) geographic files.  whatever you do, promise me you’ll do it well.  this new github repository contains three scripts:

download and import.R

merge taxonomy ids.R

  • construct a multi-level table of medical provider taxonomy ids
  • initiate the same monetdb server instance, using the same well-documented block of code as above
  • pull a subset of columns – a skinny file – directly into working memory
  • merge the nppes with taxonomy id codes, then run a quick crosstab or two

replicate cms state counts.R

click here to view these three scripts

for more detail about the national plan and provider enumeration system, visit:


if your analysis won’t be compromised by using county-level instead of provider-level data, also consider hrsa’s area resource file (arf).  the health workforce statistics in the arf come from the american medical association’s physician masterfile.  unfortunately, the ama masterfile is not publicly-available, so if your budget is zero dollars, your choices are the nppes (less detail at the individual-level) and the arf (more detail, but aggregated to the county-level).   here’s what the director of the national center for health workforce analysis told me:

We use the AMA Masterfile for the ARF (and most of our studies involving physicians) because it has extensive data on each physician, including demographic, education/training and practice information.   While there are a number of shortcomings of the MF, it is one of the best sources of data available nationally.  We use the NPI data from some professions where we don’t have a solid source of national data. While all practitioners who bill Medicare, Medicaid or private insurers should have an NPI, in the case of physicians, the NPI does not have the same depth of data as the AMA MF.

If you have not seen it, I recommend our new Compendium of Federal Data Sources to Support Health Workforce Analysis on our web site that describes 19 sources of data that can be used for health workforce analysis. It describes the data source, how it can be used and accessed and guidance on potential use.

one more trade-off: the nppes is never more than a month old.

confidential to sas, spss, stata, and sudaan users: you are working with the larry, curly, moe, and shemp of statistical languages. time to transition to r.  😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)