analyze the youth risk behavior surveillance system (yrbss) with r

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

the youth risk behavior surveillance system is the high school edition of the behavioral risk factor surveillance system (brfss), a scientific study of good kids who do bad things.  questions are mostly about sex, drugs, rock and roll, and populate a veritable bouquet of cdc reports, fact sheets, and journal articles.  want to know how many american teenagers rode with a drunk driver or carried a gun to school or tried ecstacy for the fortieth time?  reading over the questionnaire makes me think rebel without a cause on steroids (steroid use can be found at question 56).  for a more professional introduction, check out the cdc’s yrbss in brief page.  or keep reading.

most states (and even two dozen urban areas) conduct their own yrbsses (that’s plural for yrbss), but state participation is kinda all over the place.  if you need state- or locale-specific data, you can send in a data request form and maybe modify my syntax (perhaps starting the importation with read.dta or read.spss).  but if nationwide estimates are all you’re after, just analyze the cdc’s publicly-available files with the syntax described below.  the yrbss weights generalize to all public and private school students in grades 9-12 in the fifty united states plus dc.  this new github repository contains three scripts:

1991 – 2011 download all microdata.R
  • download two decades of worth of data with no huss and also zero fuss.
  • flip a few strings around in the cdc’s sas importation scripts so they’re sascii-compliant
  • save everything to your local drive for easy loading later.
  • look at this page and thank your lucky stars that everything has been automated for you

2011 single-year – analysis examples.R
  • load the latest r data file (.rda) created by the download script (above)
  • set up a taylor-series linearization survey design outlined in this document
  • perform enough analysis examples to quench even the most insatiable of statistical appetites

replicate cdc software for analysis of yrbs data publication.R

click here to view these three scripts

for more detail about the youth risk behavior surveillance system, visit:


depending on your propensity for detecting statistical software comparisons, you may or may not have noticed that the centers for disease control and prevention document replicated by my third yrbs script is – some may say – the rosetta stone of complex sample survey statistical analysis.  in fact, dr. thomas lumley (author of the r survey package) wrote an entire extension of an older (2007) version of that document (currently 2009) to prove that r’s survey package is every bit as rough and tumble as any other statistical language out there (it is).  dr. lumley wrote much more detail about how r stacks up against those other languages than you’ll find in my silly little replication script.  if you’re a comparison addict like me, read it like you svymean it.

confidential to sas, spss, stata, and sudaan users: why are you huddled around that space heater for warmth when we can just snuggle instead.  time to transition to r. 😉

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)