analyze the american community survey (acs) with r and monetdb
[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
experimental. think of the american community survey (acs) as the united states’ census for off-years – the ones that don’t end in zero. every year, one percent of all americans respond, making it the largest complex sample administered by the u.s. government (the decennial census has a much broader reach, but since it attempts to contact 100% of the population, it’s not a survey). the acs asks how people live and although the questionnaire only includes about three hundred questions on demography, income, insurance, it’s often accurate at sub-state geographies and – depending how many years pooled – down to small counties. households are the sampling unit, and once a household gets selected for inclusion, all of its residents respond to the survey. this allows household-level data (like home ownership) to be collected more efficiently and lets researchers examine family structure. the census bureau runs and finances this behemoth, of course.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
the downloadable american community survey ships as two distinct household-level and person-level comma-separated value (.csv) files. merging the two just rectangulates the data, since each person in the person-file has exactly one matching record in the household-file. for analyses of small, smaller, and microscopic geographic areas, choose one-, three-, or five-year pooled files. use as few pooled years as you can, unless you like sentences that start with, “over the period of 2006 – 2010, the average american … [insert yer findings here].”
rather than processing the acs public use microdata sample line-by-line, the r language brazenly reads everything into memory by default. to prevent overloading your computer, dr. thomas lumley wrote the sqlsurvey package principally to deal with this ram-gobbling monster. if you’re already familiar with syntax used for the survey package, be patient and read the sqlsurvey examples carefully when something doesn’t behave as you expect it to – some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr). gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running this acs code. monetdb imports, writes, recodes data slowly, but reads it hyper-fast. a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat. importation scripts (especially the ones i’ve already written for you) can be left running overnight sans hand-holding.
the acs weights generalize to the whole united states population including individuals living in group quarters, but non-residential respondents get an abridged questionnaire, so most (not all) analysts exclude records with a relp variable of 16 or 17 right off the bat. this new github repository contains four scripts:
2005-2011 – download all microdata.R
- create the batch (.bat) file needed to initiate the monet database in the future
- download, unzip, and import each file for every year and size specified by the user
- create and save household- and merged/person-level replicate weight complex sample designs
- create a well-documented block of code to re-initiate the monetdb server in the future
fair warning: this full script takes a loooong time. run it friday afternoon, commune with nature for the weekend, and if you’ve got a fast processor and speedy internet connection, monday morning it should be ready for action. otherwise, either download only the years and sizes you need or – if you gotta have ’em all – run it, minimize it, and then don’t disturb it for a week.
2011 single-year – analysis examples.R
- run the well-documented block of code to re-initiate the monetdb server
- load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file
- perform the standard repertoire of analysis examples, only this time using sqlsurvey functions
- run the well-documented block of code to re-initiate the monetdb server
- copy the single-year 2011 table to maintain the pristine original
- add a new age category variable by hand
- add a new age category variable systematically
- re-create then save the sqlsurvey replicate weight complex sample design on this new table
- close everything, then load everything back up in a fresh instance of r
- replicate a few of the census statistics. no muss, no fuss
- run the well-documented block of code to re-initiate the monetdb server
- load the r data file (.rda) containing the replicate weight designs for the single-year 2011 file
- match every nationwide statistic on the census bureau’s estimates page, using sqlsurvey functions
click here to view these four scripts
for more detail about the american community survey (acs), visit:
- the us census bureau’s acs homepage
- the american factfinder homepage
- the american community survey’s wikipedia page
- the census bureau’s acs frequently asked questions page
notes:
if you’re just looking for a couple data points, you ought to give the census bureau’s american factfinder a whirl. it’s a table creator (click here to watch me blab about table creators), so it’s easy-to-use but inflexible. here’s a li’l tip: if you run a statistic using american factfinder and then the same statistic using these scripts, they will be close but won’t match exactly. it’s not a mistake, and both are methodologically correct.
every now and then, grumpy lawmakers threaten to defund the acs because, well, it’s expensive. use it or lose it.
since data types in sql are not as plentiful as they are in the r language, the definition of a monet database-backed complex design object requires a cutoff be specified between the categorical variables and the linear ones. that cut point gets defined using the check.factors argument in the sqlsurvey() and sqlrepsurvey() function calls. check.factors defaults to ten, but can be raised or lowered as needed. here’s how it works:
- if the column would be a character string or factor inside an r data frame, the sql database stores it as a varchar column.
- if the column would be numeric or integer in an r data frame, but has fewer than eleven unique values, the sql database also stores it as a varchar column.
- if the column would be numeric or integer in an r data frame, but has at least eleven unique values, the sql database stores it as a double (that’s sql-speak for numeric).
unless specified by the question’s phrasing, most acs variables should be treated as point-in-time, as opposed to either annualized or ever during the year. this distinction is particularly important for health insurance coverage. think about these three statistics —
- the number of americans who won’t have health insurance at least once during this year
- the number of americans without health insurance right now
- the number of americans who won’t ever have health insurance during this year
although the automated ftp download program for this data set only retrieves files back as far as 2005, a nationwide version of the american community survey has been conducted since 2000. i skipped those years for two reasons —
- the sample size (the true strength of the modern acs) wasn’t very large on the older files (the 2004 and 2011 single-year person-level files are 54mb and 580mb, respectively). there’s no reason to import these files into a monet database.
- the replicate weighted design wasn’t implemented until 2005, so the creation of a complex sample survey object isn’t possible. if you need to calculate standard errors for earlier years, you’ll have to rely on a pita generalized variance formula instead. evidence: the published estimates prior to 2005 don’t include error columns.
confidential to sas, spss, stata, sudaan users: the decennial census is enshrined in our constitution. your statistical software isn’t. time to transition to r. 😀
To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.