analyze the censo demografico no brasil (censo) with r and monetdb

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

almost a century older than its own capital city, the decennial censo demografico no brasil (censo) forms the basis of every major survey product in the country.  the pesquisa nacional por amostra de domicilios (pnad), the pesquisa de orcamentos familiares (pof), the pesquisa mensal de emprego (pme), and all the other microdata sets released by the instituto brasileiro de geografia e estatistica (ibge) rely on this big-big-survey-every-ten-years for both calibration and inspiration.  a rarely-mined “arca do tesouro” of information, this nationwide census has demographics, labor force activity, bolsa familia participation, even the small stuff like refrigerator possession.

the published history of brazil’s population count might date back to the pre-fordlandia days of the jungle rubber trade (for more detail see: british biopiracy thriller t.t.a.t.e.o.t.w.), but ibge has so far loaded just the 2010 microdata for public download on their ftp site.  aggregated tables for prior censuses are available, but only the 2010 census has twenty million weighting area-level person-records for you to scroll through by hand (think of “weighting areas” as micro-municipalities – and click here to see them mapped.  most weighting areas have between 10,000 and 25,000 residents, though a few have as many as 300,000).  fun fact: if you’d like to analyze samples (partial files) from earlier microdata files, the university of minnesota’s integrated public use microdata series (ipums) has person-level brazilian data back to 1960.  but the ipums 2010 brazilian census sample only has about three million records, while the files downloaded from ibge have more than twenty million respondents.

rather than processing the censo demografico microdata sample line-by-line, the r language by itself would brazenly read everything into memory by default.  to prevent overloading your computer, dr. thomas lumley wrote the sqlsurvey package and these scripts right here and right now take full advantage of that.  if you’re already familiar with syntax used for the survey package, be patient and read the sqlsurvey examples carefully when something doesn’t behave as you expect it to – some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr).  gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running this censo code.  monetdb imports, writes, recodes data slowly, but reads it hyper-fast.  a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat.  importation scripts (especially the ones i’ve already written for you) can be left running overnight sans hand-holding.  this new github repository contains three scripts:

download and import.R
  • create the batch (.bat) file needed to initiate the monet database in the future
  • download, unzip, and import each state’s 2010 census microdata
  • create and save household- and merged/person-level complex sample designs
  • create a well-documented block of code to re-initiate the monetdb server in the future

fair warning: this script takes a long time.  computers less than two years old should successfully import everything overnight.  if you have a less-powerful computer, try running the script friday afternoon and it should finish before monday morning.

analysis examples.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the person-level design object
  • perform the standard repertoire of analysis examples, only this time using sqlsurvey functions

variable recode example.R
  • run the well-documented block of code to re-initiate the monetdb server
  • copy the 2010 table to maintain the pristine original
  • add some new poverty category variables by hand
  • re-create then save the sqlsurvey design on this new table
  • close everything, then load everything back up in a fresh instance of r
  • calculate and then beauuuutifully barplot the poverty rate by state
  • calculate a ratio.  sqlsurvey does not yet contain a svyratio function, so perform the calculations manually with a taylor-series linearization approximation

click here to view these three scripts

for more detail about the censo demografico no brasil (censo), visit:


um.  my friend djalma pessoa at ibge co-authored this post and all of the code.  send him a thank you note.  we don’t get paid for this, we do it for love of data.  and for you.  to the people still reading: make yourself real and say something.

dois.  although sqlsurvey and monetdb will not overload your ram by reading all twenty million records into working memory at once, they’ll still move slower on an older computer.  my personal laptop has eight gigabytes of ram and gets through every command eventually, but if you expect rapid-fire action on complicated linearized survey sample designs with twenty million records, you’d be smart to invest in at least sixteen gigabytes.  otherwise plan to run the “computationally intensive” commands overnight — and if you do happen to overload your computer, you can often simply close the monetdb server window, re-start it, and re-connect to the survey design object without shutting down your r session.  remember: it’s twenty million records.  it works on your laptop.  it’s free.  that’s amazing.  be happy about it.

tres.  since there had been no official recommendation on how to construct a coefficient of variation with the public use file, djalma tested out a bunch of different methods and constructed a linearization-based standard error calculation that you’ll see in both of the analysis scripts – but since his variance estimation method is new, there’s no replication script.  for official estimates, ibge adjusts the design weights using calibration (described on pdf page 636 of their census methodology document), but some of the variables/columns used in that official method are not included in the public use files because of confidentiality.  therefore, instead of using the approximated variance technique explained on pdf page 643 of the same document, microdata users will have to settle for djalma’s method.  but don’t fret!  this is a census after all: there’s lots of sample and your confidence intervals will be tiny, even with the more conservative approach shown in our syntax examples.

quatro.  if either of the previous two points frustrate you, try sticking to sql commands alone.  given the massive sample size, it’s generally safe to conclude that any two numbers that differ by more than a few percentage points are statistically significantly different.  basic sql commands are just otherworldly-fast in monetdb, and if you’re only in the exploratory phase of your analysis, (weighted!) sql queries might satisfy your each and every data craving.

cinco. if you wish to analyze a single brazilian state, simply subset your monetdb-backed survey design the same way you would subset on any other variable.  for example, you could analyze rondonia by itself with the command `pes.rondonia = subset( pes.d , v0001 == 11 )` and then use `pes.rondonia` instead of `pes.d` in your svymean and svyquantile calls.

seis. our scripts will automate the import of the 2010 microdata directly from ibge.  but ipums does not allow automated downloads for their microdata, so if you wish to load prior years, you’ll have to point-and-click through their website, download the census extract as a csv file, import the file with the function, and construct your own sqlsurvey design.  if you’re fluent in both r and survey analysis, budget a day of work.

sete.  a little more explanation on the smaller geographic areas:  setor censitario (enumeration area) is a set of about 400 adjacent households used internally at ibge; it’s generally the geographic area that one single interviewer will be responsible for during the administration of the census.  this identifier is not available in the public use files because of confidentiality; that’s why we approximated the variance estimation stratifying by the (larger) area de ponderaracao.  area de ponderaracao (weighting area) is a set of contiguous enumeration areas (setores) that do not cross municipality borders.  for small municipalities, the weighting area coincides with the entire municipality.  bigger cities have multiple weighting areas.  nationwide, there were about 10,000 weighting areas in the 2010 censo demografico; each had at least 10,000 residents (with unweighted sample sizes of at least 500).

confidential to sas, spss, stata, and sudaan users: think of yourself as if you’re in the caterpillar stage of your life cycle.  time to build that chrysalis, and after that..  time to transition to r.  😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)