obsessively-detailed instructions to analyze publicly-available survey data with free tools – the r language, the survey package, and (for big data) sqlsurvey + monetdb.
governments spend billions of dollars each year surveying their populations. if you have a computer and some energy, you should be able to unlock it for free, with transparent, open-source software, using reproducible techniques. we’re in a golden era of public government data, but almost nobody knows how to mine it with technology designed for this millennium. i can change that, so i’m gonna. help. use it.
the computer code for each survey data set consists of three core components:
current analysis examples
- fully-commented, easy-to-modify examples of how to load, clean, configure, and analyze the most current data sets available.
massive ftp download automation
- no-changes-necessary programs to download every microdata file from every survey year as an r data file onto your local disk.
- match published numbers exactly to show that r produces the same results as other statistical languages. these are your rosetta stones, so you know the syntax has been translated into r properly.
want a more gentle introduction? read this flowchart, grab some popcorn, watch me talk at the dc r users group.
endorsements, citations, links, words on the street:
- the consumer expenditure survey microdata page, bureau of labor statistics
- the survey of consumer finances microdata page, federal reserve
- the pesquisa nacional por amostra de domicilios – continua page, brazilian census bureau
- the pesquisa nacional por amostra de domicilios page, brazilian census bureau
- the pesquisa de orcamentos familiares microdata page, brazilian census bureau
- the pesquisa mensal de emprego page, brazilian census bureau
- the censo demografico 2010 and 2000 pages, brazilian census bureau
- the resources to help you learn and use r page, ucla institute for digital research and education
- the health services research methods external resources page, academyhealth
- the r survey package homepage, r core contributor dr. thomas lumley
frequently asked questions
what if i would like to offer additional code for the repository, or can’t figure something out, or find a mistake, or just want to say hi?
if it’s related to a data set discussed in a blog post, please write it in the comments section so others might benefit from the response. otherwise, e-mail me directly. i love talking about this stuff, in case you hadn’t noticed.
how do i get started with r?
either watch some of my two-minute tutorial videos or read this post at flowingdata.com.
r isn’t that hard to learn, but you’ve gotta want it.
are you sure r matches other statistical software like sas, stata, and sudaan?
yes. i wrote this journal article outlining how r precisely matches these three languages with complex survey data.
but that journal article only provides comparisons across software for the medical expenditure panel survey. what about other data sets?
along with the download, importation, and analysis scripts, each data set in the repository contains at least one syntax example that exactly replicates the statistics and standard errors of some government publication, so you can be confident that the methods are sound.
does r have memory limits that prevent it from working with big survey data and big data in general?
sort of, but i’ve worked around them for you. all published analyses get tested on the 32-bit version of r on my personal windows laptop (enforcing a 4gb ram limit) and then on a unix server (ensuring macintosh compatibility as well) hosted by the fantastic monetdb folks at cwi.nl. larger data sets are imported and analyzed using memory-free sql to accommodate analysts with limited computing resources.
why does this blog use a github repository as a back-end?
github is designed to host computer syntax that gets updated frequently. blogs don’t go there.
why does your github repository use this blog as a front-end?
most survey data sets become available on a regular basis (many are annual, but not all). if you use these scripts, you probably don’t care about every little change that i make to the underlying computer code (which you can view by clicking here).
what is github?
a version control website.
what is version control?
it’s like the track changes feature in microsoft word, only specially-designed for computer code.
what else do i need to analyze this survey data?
all scripts get tested on the latest version of r and the latest version of the survey package using the 32-bit version on my personal windows laptop (enforcing a 4gb ram limit) and then on a unix server (ensuring macintosh compatibility as well) hosted by the fantastic monetdb folks at cwi.nl.
what is SAScii?
(too) many data sets produced by official agencies include only a fixed-width ascii file and a sas-readable importation script. r is expert at loading in csv, spss, stata, sas transport, even sas7bdat files, but (until SAScii) couldn’t read the block of code written for sas to import fixed-width data. click here to see what others have to say about it.
a few of the importation scripts in the repository use a sql-based variant of SAScii to prevent overloading ram. but don’t worry, everything gets loaded automagically when you run the program.
how many questions should a good faq answer?