analyze the pesquisa de orcamentos familiares (pof) with r

June 17, 2013

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

for the unlucky among us born without a portuguese mother tongue, the pesquisa de orcamentos familiares (pof) translates to survey of household budgets.  this data set captures brazilian family consumption habits, allocation of expenses, and income distributions along with some handy quality of life and nutritional profile characteristics of the population.  my good friends djalma pessoa and andre martins at the brazilian census bureau (ibge) co-authored this post, so send them a thank you for pushing their agency toward the futuristic world of open source statistical analysis.  ibge’s main use for the pof is to define the weights of a basket of goods and services for the brazilian consumer price index. (like an american consumer price index).  hot.

weights are available on each file so users can compute weighted yearly expenses for any product code. look for the microdata variable named `valor_anual_expandido2`.  except for the domicilio and morador files, data tables are generally one-record-per-code.  click here for the full (portuguese language) code listing.  and here’s how the principal microdata files break down:

  1. domicilio, morador: discussed below [one-record-per-household and -per-person]
  2. inventario, despesa de 90 dias, despesa de 12 meses, outras despesas coletivas, servicos domesticos: household maintenance, electricity, maid, etc. [one-record-per-code-per-family]
  3. caderneta de despesa coletiva – food expenditures [one-record-per-code-per-family]
  4. despesas individuais, despesas com veiculos: education fees and tuition, books, etc. [one-record-per-code-per-householder]
  5. rendimentos e deducoes, outros rendimentos: incomes of households older than 10 years [one-record-per-code-per-householder]
  6. condicoes de vida: living standards [one-record-per family]
the two files from each survey year used to construct national estimates are the household-level (domicilio) and person-level (morador) tables.  the domicilio file contains one record per sampled household and the morador file contains one record for every person within each sampled household.  for this entire survey (and all publications produced by ibge), expenses are estimated at the family-level, also called the consumer unit.  (the american consumer expenditure survey does the same thing.)  occasionally, households have multiple consumer units in them – for the same reasons that a household might have multiple biological families.  if you’re unsure that you’re using consumer units correctly, follow the example replication script that re-constructs table 1.1.12, since total consumer units were used in the post-stratification.

the survey weights on the domicilio file generalize to all brazilian households nationwide and the weights on the morador file generalize to all brazilians, so if your goal is to make a statement that begins with either of the phrases, “among brazilian households..” or “the average brazilian..” you’ll need to aggregate whatever household- or person-level information you want from the other files then merge everything onto one of those two data sets.  if it’s family-level you’re after (like the example in table 1.1.12), you’ll need to aggregate based on the unique family identifier but still use the person-weight.  this new github repository contains four scripts:

download all microdata.R

  • download the big zipped file, plus documentation, plus post-stratification tables for each available year 
  • hack deep into the single-stream, massive sas importation syntax to extract the instructions necessary to import each individual microdata table directly into r without using my arch-nemesas
  • store quick-to-load copies of each microdata table into year-specific folders for easy access later
2008-2009 analysis examples.R

  • load the person-level and post-stratification tables from a single year of data
  • construct the complex sample survey object, post-stratifying according to ibge specifications
  • run example analyses that calculate perfect means, medians, quantiles, totals
  • replicate the statistics and coefficients of variation found in official table 15, why not?
replicate tabela 1.1.R

  • load the person-level and post-stratification tables from a single year of data
  • construct the complex sample survey object, post-stratifying according to ibge specifications
  • build a monster function that takes the post-stratified design and returns exciting formatted results
  • replicate the statistics and coefficients of variation found in official table 1.1
replicate tabela 1.1.12.R

  • load and configure person- and event-specific tables into family- and family-event-level tables
  • assemble an even monsterer function that (deep in its belly) constructs the post-stratified complex sample survey object with monthly food expenditures variables at the family-level
  • replicate the statistics and coefficients of variation found in official table 1.1.12

click here to view these four scripts

for more detail about the pesquisa de orcamentos familiares, visit:


although the download automation script will pull the 2002-2003 files onto your local computer for you, it’s currently not recommended that you analyze anything earlier than the 2008-2009 pof.  the 1995-1996 pof microdata are not available on the ibge website.  period.  the weights for the 2002-2003 pof were calibrated, and the total estimates were obtained by using greg estimators.  the calibration was done separately for each state in the country, and some states did not have metropolitan region and the capital had no municipality.  the calibration was initially done using both household- and householder-level information, with the household weight assigned to each householder.  however, with the information presently available on the ibge website, it isn’t possible to replicate the published estimates from the 2002-2003 microdata, since the twenty-seven files with calibration variables and population totals are not posted.  also, considering the modification in product codes from 2002-2003 to 2008-2009, a script built for the 2008-2009 files would require a product/code crosswalk – which is also not available.  if you’ve got a damn good research idea that requires analyzing trends, you’re better off asking for advice from the friendly cariocas at ibge.

confidential to sas, spss, stata, sudaan users: as indiana jones would say, “it belongs in a museum.” time to transition to r. 😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)