analyze the survey of income and program participation (sipp) with r

February 4, 2013
By

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

if the census bureau's budget was gutted and only one complex sample survey survived, pray it's the survey of income and program participation (sipp).  it's giant.  it's rich with variables.  it's monthly.  it follows households over three, four, now five year panels.  the congressional budget office uses it for their health insurance simulation.  analysts read that sipp has person-month files, get scurred, and retreat to inferior options.  the american community survey may be the mount everest of survey data, but sipp is most certainly the amazon.  questions swing wild and free through the jungle canopy i mean core data dictionary.  legend has it that there are still species of topical module variables that scientists like you have yet to analyze.  ponce de león would've loved it here.  ponce.  what a name.  what a guy.

the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households.  once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over four or five years (panel durations vary).  you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups').  if you don't comprehend what's going on, try their survey design tutorial.

since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period.  the analysis scripts below provide examples of each.  at every four-month interview point, every respondent answers every core question for the previous four months.  after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month.  to repeat: core wave files contain four records per person, topical modules contain one.  if you stacked every core wave, you would have one record per person per month for the duration of the panel.  mmmassive.  ~100,000 respondents x 12 months x ~4 years.  have an analysis plan before you start writing code so you extract exactly what you need, nothing more.  better yet, modify something of mine.  cool?  this new github repository contains eight, you read me, eight scripts:

1996 panel - download and create database.R
2001 panel - download and create database.R
2004 panel - download and create database.R
2008 panel - download and create database.R

2008 panel - full year analysis examples.R
  • define which waves and specific variables to pull into ram, based on the year chosen
  • loop through each of twelve months, constructing a single-year temporary table inside the database
  • read that twelve-month file into working memory, then save it for faster loading later if you like
  • read the main and replicate weights columns into working memory too, merge everything
  • construct a few annualized and demographic columns using all twelve months' worth of information
  • construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined
  • reproduce census-published statistics, not precisely (due to topcoding described here on pdf page 19)

2008 panel - point-in-time analysis examples.R
  • define which wave(s) and specific variables to pull into ram, based on the calendar month chosen
  • read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory
  • read the topical module and replicate weights files into working memory too, merge it like you mean it
  • construct a few new, exciting variables using both core and topical module questions
  • construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half
  • reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible.  oh hayy, now it is.

2008 panel - median value of household assets.R
  • define which wave(s) and specific variables to pull into ram, based on the topical module chosen
  • read the topical module and replicate weights files into working memory too, merge once again
  • construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half
  • reproduce census-published statistics, not exactly due to topcoding (read more about topcoding by searching this and that user guide for, well, `topcoding`).  huh.  so topcoding affects asset statistics.

replicate census poverty statistics.R


click here to view these eight scripts



for more detail about the survey of income and program participation (sipp), visit:


notes:

sipp is right in the middle of an ultra-long panel, these scripts will update as new files are released.

don't let the deprecated-looking homepage dissuade you.  the survey of income and program participation is happening now, red-hot.  everything you need is available, albeit somewhat hidden.  there's a short introduction, the data release schedule, an official ftp site - with codebooks - advanced user notes, census publications based on sipp - don't miss the table packages - aww cool even questionnaires.  the core variable codebook might not win any beauty pageants, but it'd be a wise use of time to slowly scroll through the first fifty variables.  interview months take place after `srefmon == 4` and actual times of month and year can be determined with the `rhcalmn` + `rhcalyr` variables.

perhaps more than any of the other data sets on this website, working with sipp will get more comfortable as you increase your ram.  so long as you manipulate these files with sql commands inside the sqlite database (.db) that my automated-download scripts create, you'll process these data line-by-line and therefore be untethered from any computer hardware limits.  but the moment a dbReadTable or dbGetQuery command pulls something into working memory, you'll begin gobbling up those precious four, eight, or sixteen gigabytes on your local computer.  in practice, this simply requires that you to define the columns you need at the start, then limit what gets read-in to only those variables.  you'll see it done in my scripts.  if you don't copy that strategy -fair warning- you may hit allocation errors.  maybe keep the performance tab of your windows task manager handy and take out the trash.


confidential to sas, spss, stata, and sudaan users: watch this.  time to transition to r.  :D

To leave a comment for the author, please follow the link and comment on his blog: asdfree by anthony damico.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.