analyze the new york city housing and vacancy survey (nychvs) with r

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

for those interested in the real estate and rental markets of the big apple, the census bureau’s nyc housing and vacancy survey might be your key to the city.  if you care about how many new york residents live more than one person per room (a lot), how many structures are dilapidated (a few, phew), or what rent prices run these days (cha-ching), start here.  way back in 1965, new york law began requiring the enumeration of the city’s heavily-regulated rental market, establishing this complex sample survey of about twenty thousand households, both occupied and vacant.  nowadays it’s triennial, it’s publicly-downloadable, and it’s free n easy to analyze with the r language.

although the census bureau employs the survey administrators and produces the main how-to documents (both faq and overviews), city government actually pays the bill and gets the glory: the preliminary 2011 report with all the fun facts and the older but more complete 2008 report.  the microdata include four exciting files: a person-level file for occupied units, a household-level file for occupied units, a household-level file for vacant units, and a household-level file for units that didn’t yield an interview (solely for adjusting the vacant-unit statistics).  most urban planning and policy wonks line up the occupied and vacant household-level files to calculate a vacancy rate, but depending on your mission, you might need some person-level action as well.  by the way, the report is six months older than the latest 2011 microdata, so don’t panic if your stats are off by a whisker.  this new github repository contains three scripts:

2002 – 2011 – download all microdata.R
  • download, import, save each of the four data files into a single year-specific .rda file back to 2002
  • bumper sticker idea for nychvs data users: if you can read this, thank a furman center for the sas import scripts.

2011 analysis examples.R
  • load all available tables for a single year of data
  • construct the complex sample survey object, but it’s fake – see note below.
  • run example analyses that calculate perfect means, medians, quantiles, totals

replicate contract items 2008.R
  • load all available tables for a single year of data
  • construct the complex sample survey object, but it’s fake – see note below.
  • thoroughly explain a back-of-the-envelope calculation for standard errors, confidence intervals, variances
  • print statistics that match exactly – and confidence intervals more conservative than – the target replication table

click here to view these three scripts

for more detail about the new york city housing and vacancy survey, visit:


hint for statistical illiterates: if the data point you’re looking for isn’t in the grand report, check the census bureau’s copious online tables too.

as described in detail in the comments of the replication script, it’s impossible to exactly match the census-published confidence intervals.  here’s one snippet of a longer conversation about how users cannot automate the computation of standard errors (discussed at footnote five) with the nychvs.  the `segment` variable (mentioned in the e-mail) does not get released due to confidentiality concerns.  either calculate them by hand with the infuriating generalized variance formula recommended in each year’s source and accuracy statement (2008, 2011) or use the back-of-the-envelope method i invented that approximates census-published confidence intervals conservatively.  when i learned that users couldn’t automate the matching of census-published numbers, i tried to be a bootstrapping young lad and come up with some fancy standard error computation methodology.  but it turns out that multiplying the un-adjusted errors by two gets as close to the right answer as anything else.  if you’re writing the final draft of a research product destined to get heavy exposure, you might have to calculate confidence intervals by hand or pay the census bureau for a custom run.  but for those of us who can live with an occasional false negative in our lives, try it my way.

confidential to sas, spss, stata, and sudaan users: i look at you the way new yorkers look at jersey.  time to transition to r. 😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)