Analyzing Public Health Data in R

June 20, 2016

(This article was first published on R –, and kindly contributed to R-bloggers)

Today’s post is by Thomas Yokota, an epidemiologist in Hawaii. I’ve been corresponding with Thomas via email and telephone for a while. I asked Thomas if he could write an introduction to how R, mapping and open data are used in the public health community. This is his reply.

Bonus: Download the code from this post!

What is public health?

Public health is a broad and sometimes overreaching field. Consequently, I am asked, “what IS public health?” more so than “what school did you go to (in Hawaii)?” I also work for a hospital system where I am often mistaken as someone knowledgeable of clinical care. With that said, I’d like to start off with a quote by C.E. Taylor:

“The focus of public health is the community. The ‘patient’ is a whole population unit. The shift in professional orientation which occurs as the unit of attention moves from the individual to the group must be clearly recognized and explicitly stated because it has led to many misunderstandings in the past.”

Once it’s clearly understood that public health is not brain surgery and – according to the CDC – more about zombie invasions, the next response I usually get is, “so what does public health do if it doesn’t cure diabetes?” Such a loaded question. In a nutshell, the four core functions of public health are as follows:

  • Assessment
  • Policy development
  • Assurance
  • Communication

In other words, public health professionals rely on studies and surveys, such as the Behavioral Risk Factors Surveillance System (BRFSS), to understand health trends. Consequently, campaigns and/or policies are developed to break trends when needed.

So why R?

R is a great asset for making powerful graphs and maps, and can complement current SAS/STATA/SPSS/etc. workflows. Proprietary statistical programs will always be public health’s frenemies in the area of research and complex survey analysis – it’s inevitable that we must sacrifice funding health fair incentives to pay for license fees – but let’s not write off R too soon. R is also desirable for pubic health data analysts (i.e., epidemiologists) who work in community organizations.


In laymen terms, many public health organizations and professionals cite the BRFSS when interested in health risk behaviors, health access and chronic disease prevalence. Introductions and examples have already been created on BRFSS for R, and have been featured on R-bloggers. Anthony Damico maintains a GitHub repository on downloading and storing complex surveys. I recommend his solution for those who are new to R and/or complex survey analysis and because there is a significant speed increase when working on analyzing the BRFSS alongside MonetDB. I also recommend Dr. Thomas Lumley’s book Complex Survey Analysis to understand complex surveys.

Analyzing BRFSS in R

In this example, I am taking a couple of shortcuts to import the data and to test out the srvyr package because… pipes. It also helps me to cut down some of the code and instructions entailed to setting up R and MonetDB – it’s also been done before. It is important to note, however, that  Survey package is doing all the heavy lifting.

load dependencies and data

I use pacman package when handling all additional package loading and installs. I find it to be more convenient than library(). I am also borrowing Anthony’s download script to pull the BRFSS from CDC before loading it into R.

pacman::p_load(RCurl, foreign, downloader, survey, srvyr, ggplot2, dplyr)

source_url("", prompt=F, echo=F)
# download ez-pz brought to you by anthony joseph damico [[email protected]]

tf <- tempfile(); td <- tempdir()
xpt <- ""
download_cached(xpt, tf, mode='wb')
local.fn <- unzip(tf, exdir=td)

brfss14 <- read.xport(local.fn)
save(brfss14, file="brfss14.rda")
# load("brfss14.rda") <- read.csv("", stringsAsFactors=F) <- read.csv("", stringsAsFactors=F)

create the survey object

After importing the data, I use the Survey package to utilize the BRFSS sampling design and weighting so that we can create estimates that represent the population. The raking weighting methodology should be understood before proceeding. In this example, I will look at BMI categories as inferred by the following variable and question: BMI5CAT-Computed body mass index categories. I am re-coding BMI5CAT so that I can create estimates for overweight and/or obese prevalence.

brfss.design14 <- brfss14 %>%
  as_survey_design(ids=X_PSU, weight=X_LLCPWT, nest=TRUE, strata=X_STSTR, variables= c(X_BMI5CAT, X_MRACE1, X_STATE))

brfss.design14 <- brfss.design14 %>%
  mutate(X_BMI5CAT2 = car::recode(X_BMI5CAT, "c(3,4)=1; NA=NA;  else=0"),

state-by-state comparison

Now that we have created the survey design object, we can use R to create estimates from the data. Before doing this, however, it is advised that the reader studies the BRFSS code book. Another good source to learning how to use the BRFSS is to study past reports and case studies. You can always verify your query against the CDC BRFSS Prevalence & Trends Tool.

options(survey.lonely.psu = "certainty")

brfss.design14.0 %>%
  filter(X_STATE==15) %>%
  group_by(X_STATE, X_BMI5CAT) %>%
  summarize(prevalence = survey_mean(na.rm=T, vartype = c("ci")),
            N = survey_total(na.rm=T)) # matches CDC tool

brfss.design14 <- brfss.design14 %>%

BMI5CAT2.1 <- brfss.design14 %>%
  group_by(X_STATE, X_BMI5CAT2) %>%
  summarize(prevalence = survey_mean(na.rm=T),
            N = survey_total(na.rm=T)) %>%
  mutate(X_STATE = as.character(X_STATE)) %>%
  left_join(, by=c("X_STATE"="VALUE"))

Most commonly, BRFSS is used to make state-by-state comparisons. In the past (and sometimes currently), state-by-state comparisons were often ranked. This made for great click-bait news headlines. Most recently, choropleth maps have been used to visualize prevalence rates across the nation.

BMI5CAT2.1 %>%
  filter(X_BMI5CAT2==1) %>%
  ggplot(aes(x=reorder(STATE, prevalence), y=prevalence)) +
  scale_y_continuous(labels=scales::percent) +
  geom_bar(stat="identity") +
  ylab("YES (%)") +
  xlab("STATE") +
  ggtitle("Computed body mass index categories (overweight or obese) by state") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.background = element_rect(fill = 'white' )) +

state overweight and obese

BMI5CAT2.2 <- BMI5CAT2.1 %>%
  filter(X_BMI5CAT2==1) %>%
  select(region=STATE, value=prevalence) %>%
  mutate(region = tolower(region))

choroplethr::state_choropleth(BMI5CAT2.2, title="Computed body mass index categories (overweight or obese) choropleth map",

state overweight and obese map

I’m not a big fan of this map projection, and hope there will be an option in the future to change the projection.

deep dive Hawaii – stratify by race

Before we start dropping leaflets on healthy eating across the Bible Belt, why don’t we dig into the data a bit more. Let’s take my home state for example. Hawaii appears to have fewer obesity and overweight prevalence than most of the nation – California, drool at our sculpted natural beach bodies. Seems like there’s no issues here…

brfss14a <- brfss14 %>%

brfss.design14a <- brfss14a %>%
  as_survey_design(ids=X_PSU, weight=X_LLCPWT, nest=TRUE, strata=X_STSTR, variables= c(X_BMI5CAT, X_MRACE1, X_STATE, X_AGEG5YR))

brfss.design14a <- brfss.design14a %>%
  mutate(X_BMI5CAT2 = car::recode(X_BMI5CAT, "c(3,4)=1; NA=NA;  else=0"),
         X_BMI5CAT2a=car::recode(X_BMI5CAT2, "1='Obese/overweight'; NA=NA; else='Not Obese/overweight'"),
         X_BMI5CAT2a=factor(X_BMI5CAT2a, levels=c('Obese/overweight', 'Not Obese/overweight'), ordered=TRUE),
         X_MRACE1a=car::recode(X_MRACE1, "1='White'; 2='Black'; 3='AIAN'; 4='Asian'; 5='NHOPI';  NA=NA; c(6,7)='other/multiracial'; else=NA"),

BMI5CAT2.3 <- brfss.design14a %>%
  group_by(X_STATE, X_MRACE1a, X_BMI5CAT2a) %>%
  summarize(prevalence = survey_mean(na.rm=T),
            N = survey_total(na.rm=T),
            prevalence_ci = survey_mean(na.rm=T, vartype = c("ci"))) %>%
  mutate(X_STATE = as.character(X_STATE)) %>%
  left_join(, by=c("X_STATE"="VALUE"))

BMI5CAT2.3 %>%
  ggplot(aes(x = X_MRACE1a, y = prevalence, group = X_BMI5CAT2a)) +
    geom_bar(stat = "identity", aes(fill = X_BMI5CAT2a), alpha = 0.3) +
  scale_y_continuous(labels=scales::percent) +
  ylab("Prevalence (%)") +
  xlab("Race") +
  ggtitle("Computed body mass index categories (overweight or obese) in Hawaii by race") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.background = element_rect(fill = 'white')) +
  guides(fill = guide_legend(title = "Categories"))

state of HI overweight and obese by race

Stratifying by race shows us a different picture: NHOPI [ed note: NHOPI stands for (N)ative (H)awaiian & (O)ther (P)acific (I)slander] overweight/obese prevalence is at 80% (CI95%:  73-87%), easily topping all point estimates by state.

deep dive Hawaii – stratify by race and age group

Going one level deeper, we see that onset of overweight/obesity starts earlier for NHOPI and that prevalence across age groups is higher in comparison. The graph suggests that efforts to address overweight/obesity among NHOPI groups is needed in Hawaii; in addition, it also suggests that intervention at earlier stages would be beneficial for this group as well.

BMI5CAT2.4 <- brfss.design14a %>%
  group_by(X_STATE, X_MRACE1a, X_AGEG5YR, X_BMI5CAT2a) %>%
  summarize(prevalence = survey_mean(na.rm=T),
            N = survey_total(na.rm=T),
            prevalence_ci = survey_mean(na.rm=T, vartype = c("ci"))) %>%
  mutate(X_STATE = as.character(X_STATE)) %>%
  left_join(, by=c("X_STATE"="VALUE")) %>%
  mutate(X_STATE = as.character(X_STATE)) %>%
  left_join( %>% rename(X_AGEG5YR2=X_AGEG5YR), by=c("X_AGEG5YR"="VALUE"))

BMI5CAT2.4 %>%
  filter(X_MRACE1a=="NHOPI" | X_MRACE1a=="White" | X_MRACE1a=="Asian", X_AGEG5YR2!="Dont know/Refused/Missing") %>%
  ggplot(aes(X_AGEG5YR2, prevalence)) +
    geom_bar(stat = "identity", aes(fill = X_BMI5CAT2a), col = "gray", alpha = 0.7) +
    facet_grid(~X_MRACE1a) +
    scale_y_continuous(labels=scales::percent) +
    ylab("Prevalence (%)") +
    xlab("Age Group") +
    ggtitle("Computed body mass index categories (overweight or obese) in Hawaii by race and age") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          panel.background = element_rect(fill = 'white' )) +
    guides(fill = guide_legend(title = "Categories"))

state of HI overweight and obese by age and race


Relying on high-level estimates alone can sometimes mislead your agenda. For example, after stratifying by race, we saw that the shared benefits of living in Hawaii were not equally shared by all groups. Further stratifying by age shows that onset by race occurred earlier for NHOPIs. These cuts of the data can help to narrow down public health efforts and hopefully prevent the usual spray-and-pray approach to solving problems.

This example only scratches the surface; there are other methods that can be explored. Analyzing data sets for public health such as BRFSS is more than just getting stuck on statistical models and debate on R vs. SAS vs. STATA. While it’s important that we continue to leverage statistical methods, analyzing data for public health is more than being a data monkey and computer babysitter. In recent years, with the onslaught of articles promoting “big data” and  “data is power”, I’ve experienced this elitism both in the workplace and in various communities. In my wiser years, I’ve realized that power doesn’t come from hogging the ability to analyze the data, but from contributing to the discussion as a team-player.

Bonus: Download the code from this post!

The post Analyzing Public Health Data in R appeared first on

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)