analyze the health and retirement study (hrs) with r

January 14, 2013

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

the hrs is the one and only longitudinal survey of american seniors.  with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992.  unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death do us part.  paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original.

figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking around on michigan's website.  instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page.  if you start enjoying yourself, here's the whole book.  after that, it's time to register for access to the (free) data.  keep your username and password handy, you'll need it for the top of the download automation r script.  next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle.  but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel.  oh so beautiful.  the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you.

the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010").  my source for that 411? page 13 of the design doc.  wicked.  this new github repository contains five scripts:

1992 - 2010 download HRS microdata.R
  • loop through every year and every file, download, then unzip everything in one big party

import longitudinal RAND contributed files.R
  • create a SQLite database (.db) on the local disk
  • load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram)

longitudinal RAND - analysis examples.R
  • connect to the sql database created by the 'import longitudinal RAND contributed files' program
  • create two database-backed complex sample survey object, using a taylor-series linearization design
  • perform a mountain of analysis examples with wave weights from two different points in the panel

import example HRS file.R
  • load a fixed-width file using only the sas importation script directly into ram with SAScii
  • parse through the IF block at the bottom of the sas importation script, blank out a number of variables
  • save the file as an R data file (.rda) for fast loading later

replicate 2002 regression.R

click here to view these five scripts

for more detail about the health and retirement study (hrs), visit:


exemplary work making it this far.  as a reward, here's the detailed codebook for the main rand hrs file.  note that rand also creates 'flat files' for every survey wave, but really, most every analysis you can think of is possible using just the four files imported with the rand importation script above.  if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself.

confidential to sas, spss, stata, and sudaan users: a tidal wave is coming.  you can get water up your nose and be dragged out to sea, or you can grab a surf board.  time to transition to r.  :D

To leave a comment for the author, please follow the link and comment on his blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.