analyze the pesquisa nacional por amostra de domicilios (pnad) with r

April 7, 2013

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

think of the pesquisa nacional por amostra de domicilios (pnad) as the brazilian census for off-years – the ones that don’t end in zero.  the principal household survey for the nation of brazil, pnad measures general education, labor, income, and housing characteristics of the population.  pesquisa nacional por amostra de domicilios translates to ‘national household sample survey’ and, like most things, sounds better in the native portuguese (exhibit a).  my first foray into survey data for a country outside of the united states, this comes at the request of djalma pessoa who co-authored both this post and the published scripts.  djalma works as a consultant for the office in charge of statistical survey methodology at the brazilian institute of geography and statistics (ibge), just like a united states census bureau headquartered in rio de janeiro.

pnad has been on the shelves for forty-four years, investigating all the good stuff: migration, fertility, marriage, health, and food security.  microdata are available back to 2001 and starting in 2004, pnad started including the rural north (the amazon state, among others).  the sample design is self-weighted with three selection stages: primary sampling units are municipalities stratified by population size, selected systematically with pps. secondary and tertiary sampling units are enumeration areas, then households.  the weights also need to be post-stratified to the 2010 official brazilian census.  all in all, a pretty straightforward methodology.  let the code do all the setup for you so you can worry about the more exciting questions and then clock out for the day.  by the way, in brazil, do they call happy hour cappy hour?  this new github repository contains four scripts:

2001-2011 – download all microdata.R

  • download the fixed-width file containing household and person records
  • merge ’em together into a rectangular file at the person-level
  • create an adjusted weight and a new variable – one – in the data table

2011 single-year – analysis examples.R

  • connect to the sql database created by the ‘download all microdata’ program
  • create the complex sample survey object, post-stratifying using a custom-built function
  • perform a boatload of analysis examples

2011 single-year – variable recode example.R

  • connect to the sql database created by the ‘download all microdata’ program
  • recode some numeric variables into a broader categorical variable
  • re-create the complex sample survey object, post-stratifying using a custom-built function

replicate IBGE estimates – 2011.R

  • connect to the sql database created by the ‘download all microdata’ program
  • create the complex sample survey object, post-stratifying using a custom-built function
  • precisely match the sas-sudaan output provided by analysts at ibge (as seen in the script directory)

click here to view these four scripts

for more detail about the annual pesquisa naciona por amostra de domicilios, visit:


to accommodate smaller computer workstations (with only 4gb of ram), these scripts perform all manipulations inside sqlite and rely on database-backed survey objectsthe post-stratification function in the current implementation of the r survey package does not work on database-backed survey design objects.  therefore, with little fanfare, i’ve written one that does.  you’ll find it getting pulled in at the source_url() line.  exciting.

confidential to sas, spss, stata, sudaan users: yes, and bicycles with training wheels might be easier to ride, but that doesn’t make them a long-term solution.  time to transition to r.  😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)