analyze the consumer expenditure survey (ce) with r

November 13, 2012
By

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

the consumer expenditure survey (ce) is the primo data source to understand how americans spend money.  participating households keep a running diary about every little purchase over the year.  those diaries are then summed up into precise expenditure categories.  how else are you gonna know that the average american household spent $34 (±2) on bacon,$826 (±17) on cellular phones, and $13 (±2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a taste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with$5,000 to $9,999 of annual income spent an average of$283 (±90) on pets, toys, hobbies, and playground equipment (pdf page 3).  you can often get close to your statistic of interest from these web tables.  but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old.  another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest.  the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic.  if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life.  fun starts now.

fair warning: only analyze the consumer expenditure survey if you are nerd to the core.  the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis.  the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required.  the folks at bls have posted an excellent summary of what's available - read it before anything else.  after that, read the getting started guide.  don't skim.

a few of the descriptions below refer to sas programs provided by the bureau of labor statistics.  you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program.  this new github repository contains three scripts:

• loop through every year and download every file hosted on the bls's ce ftp site
• import each of the comma-separated value files into r with read.csv
• depending on user-settings, save each table as an r data file (.rda) or stata-readable file (.dta)

2011 fmly intrvw - analysis examples.R
• load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file
• construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012)
• initiate a replicate-weighted survey design object
• perform some lovely li'l analysis examples
• replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables
• replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using unimputed variables
• create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation
• initiate a replicate-weighted, database-backed, multiply-imputed survey design object
• perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs
• replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables
• replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables
• replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables

replicate integrated mean and se.R
• match each step in the bls-provided sas program "integrated mean and se.sas" but with r instead of sas
• create an rsqlite database when the expenditure table gets too large for older computers to handle in ram
• export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file

for more detail about the consumer expenditure survey (ce), visit:

notes:

throughout this post, i've used the terms consumer unit and household interchangeably.  consumer unit is a precise definition, but household is a reasonable proxy that will make more sense to your audience.  the consumer expenditure survey is a consumer unit-level survey, meaning all weights and results generalize to the average (non-institutional) american consumer unit.  since the unit of analysis is one consumer unit rather than one person, it's trickier to talk about your results.  instead of saying, "in 2011, the average american spent $x on y," you'll have to say, "in 2011, the average american household spent$y on z."  if your boss frowns at you, blame it on me.

if you're hard-pressed to talk about expenses at the individual-level, you could copy what the social security administration did on pdf page 11 of this report and compute per capita expenditures, but i don't recommend it.  if you desperately need to open up that can of worms, run your analytic plan by the folks at bls and get their blessing.  otherwise, stick with household.  small price for wonderful data.

confidential to sas, spss, stata, and sudaan users: why are you still using medieval inventions?  you don't see airplane pilots navigating by the stars.  time to transition to r.  :D