analyze the survey of consumer finances (scf) with r

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

the survey of consumer finances (scf) tracks the wealth of american families.  every three years, more than five thousand households answer a battery of questions about income, net worth, credit card debt, pensions, mortgages, even the lease on their cars.  plenty of surveys collect annual income, only the survey of consumer finances captures such detailed asset data.  responses are at the primary economic unit-level (peu) – the economically dominant, financially interdependent family members within a sampled household.  norc at the university of chicago administers the data collection, but the board of governors of the federal reserve pay the bills and therefore call the shots.

if you were so brazen as to open up the microdata and run a simple weighted median, you’d get the wrong answer.  the five to six thousand respondents actually gobble up twenty-five to thirty thousand records in the final public use files.  why oh why?  well, those tables contain not one, not two, but five records for each peu.  wherever missing, these data are multiply-imputed, meaning answers to the same question for the same household might vary across implicates.  each analysis must account for all that, lest your confidence intervals be too tight.  to calculate the correct statistics, you’ll need to break the single file into five, necessarily complicating your life.  this can be accomplished with the `meanit` sas macro buried in the 2004 scf codebook (search for `meanit` – you’ll need the sas iml add-on).  or you might blow the dust off this website referred to in the 2010 codebook as the home of an alternative multiple imputation technique, but all i found were broken links.  perhaps it’s time for plan c, and by c, i mean free.  read the imputation section of the latest codebook (search for `imputation`), then give these scripts a whirl.  they’ve got that new r smell.

the lion’s share of the respondents in the survey of consumer finances get drawn from a pretty standard sample of american dwellings – no nursing homes, no active-duty military.  then there’s this secondary sample of richer households to even out the statistical noise at the higher end of the income and assets spectrum.  you can read more if you like, but at the end of the day the weights just generalize to civilian, non-institutional american households.  one last thing before you start your engine: read everything you always wanted to know about the scf.  my favorite part of that title is the word always.  this new github repository contains three scripts:

1989-2010 download all microdata.R
  • initiate a function to download and import any survey of consumer finances zipped stata file (.dta)
  • loop through each year specified by the user (starting at the 1989 re-vamp) to download the main, extract, and replicate weight files, then import each into r
  • break the main file into five implicates (each containing one record per peu) and merge the appropriate extract data onto each implicate
  • save the five implicates and replicate weights to an r data file (.rda) for rapid future loading

2010 analysis examples.R

replicate FRB SAS output.R

click here to view these three scripts

for more detail about the survey of consumer finances (scf), visit:


nationally-representative statistics on the financial health, wealth, and assets of american households might not be monopolized by the survey of consumer finances, but there isn’t much competition aside from the assets topical module of the survey of income and program participation (sipp).  on one hand, the scf interview questions contain more detail than sipp.  on the other hand, scf’s smaller sample precludes analyses of acute subpopulations.  and for any three-handed martians in the audience, there’s also a few biases between these two data sources that you ought to consider.

the survey methodologists at the federal reserve take their job seriously, as evidenced by this working paper trail.  write a thank-you in their guestbook.  one can never receive enough of those.

confidential to sas, spss, stata, and sudaan users: the eighties called.  they want their statistical languages back.  time to transition to r.  😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)