Site icon R-bloggers

analyze the united states decennial census public use microdata sample (pums) with r and monetdb

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
during his tenure as secretary of state, thomas jefferson oversaw the first american census way back in 1790.  some of my countrymen express pride that we’re the oldest democracy, but my heart swells with the knowledge that we’ve got the world’s oldest ongoing census.  you’ll find the terms ‘census’ and ‘enumeration’ scattered throughout article one, section two of our constitution.  long story short: the united states census bureau has been a pioneer in the field of data collection and dissemination since george washington’s first term.  tis oft i wonder how he would have felt betwixt r and monetdb.

for the past few decades, the bureau has compiled and released public use microdata samples (pums) from each big decennial census.  these are simply one- and five-percent samples of the entire united states population.  although a microdata file containing five percent of the american population sounds better, the one-percent files might be more valuable for your analysis because fewer fields have to be suppressed or top-coded for respondent privacy (in compliance with title 13 of the u.s. code).

if you’re not sure what kind of census data you want, read the missouri census data center’s description of what’s available.  these public use microdata samples are most useful for very specific analyses of tiny areas (your neighborhood).  it’d be wise to review the bureau’s at a glance page to decide whether you can just use one of the summary files that they’ve already constructed – why re-invent the table?

my syntax below only loads the one- and five-percent files from 1990 and 2000.  earlier releases can be obtained from the national archives or the university of minnesota’s magical ipums or the missouri state library’s informative missouri census data center.  here’s some bad news: it looks like sequestration means the 2010 census data release might not include a pums.  this isn’t as much of a loss as it might sound: the 2010 census dropped the long form –  historically, one-sixth of american households were instructed to answer a lot of questions and the other five-sixths of us just had to answer a few.  starting in 2010, everyone just had to answer a few, and the more detailed questions are now asked of roughly one percent of the united states population on an annual basis (as opposed to decennial) with the spanking new american community surveyread this for more detail.  kinda awesome.  this new github repository contains three scripts:


download and import.R
fair warning: this full script takes a loooong time.  run it friday afternoon, get out of town for the weekend, and if you’ve got a fast processor and speedy internet connection, monday morning it should be ready for action.  otherwise, either download only the years and sizes you need or – if you gotta have ’em all – run it, minimize it, and then don’t disturb it for a week.

2000 analysis examples.R

replicate control counts table.R


click here to view these three scripts


for more detail about the united states decennial census public use microdata sample, visit:

notes:

analyzing trends between historical decennial censuses (would that be censii?) and the american community survey is legit.  not only legit.  encouraged.  instead of waiting ten years to analyze long-form respondents, now you and i have access to a new data set every year.  if you like this new design, thank a re-engineer.

so how might one calculate standard errors and confidence intervals in the pums?  there isn’t a good solution.  ipums (again, who i love dearly) has waved its wand and created this impressive strata variable for each of the historical pums data sets.  in a previous post, i advocated for simply doubling the standard errors but then calculating any critically-important standard errors by hand with the official formula (1990 here and 2000 there).  starting with the 2005 american community survey, replicate weights have been added and the survey data world has been at (relative) peace.


confidential to sas, spss, stata, and sudaan users: fred flintstone thinks you are old-fashioned.  time to transition to r.  😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.