PSID data set builder for R

[This article was first published on plausibel, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Economists frequently use public datasets. One frequently used dataset is the Panel Study of Income Dynamics, short PSID, maintained by the Institute of Social Research at the University of Michigan.

I’m introducing psidR, which is a small helper package for R here which makes constructing panels from the PSID a bit easier.

One potential difficulty with the PSID is to construct a longitudinal dataset, i.e. one where individuals are followed over several survey waves. There are several solutions.

  1. In the so-called data center, users can use drill-down menus to select relevant variables from each wave. If the user wants only recent waves, there exists a subsetting mechanism (e.g. only household heads younger than 55). As the required dataset gets larger, this becomes unhandy, as the interface gets slower and slower, and the clicking procedure is rather error prone. The main motivation for this package is that I’ve spent too many hours clicking on cryptic variable names only to realize after I was done that I had forgotten a variable. Unacceptable.
  2. User may download the data and attempt to merge the annual interview files in order to obtain the desired panel. Though conceptually not very difficult (there is an individual index file, which provides a link for individuals across years), it is a cumbersome accounting exercise to find the right variable names from each year and do the right merges. 
  3. One can use psidR. The main function is inspired by the Stata add on package psiduse. Here is the function’s signature.
build.panel(datadir,fam.vars,ind.vars=NULL,fam.files=NULL,ind.file=NULL,heads.only=TRUE,core=TRUE,design="balanced",verbose=FALSE)



  • There is a default behaviour, where the user only points towards a data directory. otherwise one can specify custom locations for family files and individual index.
  • you can supply the PSID data in stata format or csv files
  • The user has to supply a data.frame “fam.vars” which lists the variable names for all required waves. 
  • it’s possible to tell the function that a certain variable is missing in a given year (without the variable getting dropped, so you can impute it later on)
  • One can subset the data for household heads only
  • there is a switch to only get the core sample
  • There are 3 different sample designs to choose from: balanced panel (all individuals must be present in all waves), k-period panel (individuals must be at least k periods present) and unbalanced (all included)
  • with “verbose=TRUE” the function prints comments as you go along. 
An issue could be memory. The dataset is quite big. I use data.table to keep things manageable, but it’s hard to get around a data.table of 628MB, which is the size of the individual file index. The verbose option prints memory load at various points, so you may be able to intervene and through out some things if you hit a limit.

To leave a comment for the author, please follow the link and comment on their blog: plausibel.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)