Analyzing the 2011-2012 California Health Inteview Survey with R

[This article was first published on quantitate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The California Health Interview Survey (CHIS) is a remarkable biannual survey of health status, care access, and demographics for California residents. The 2011-2012 public use survey data has recently been released and is freely available after registering at the CHIS site. CHIS currently offers data in SAS, Stata, and SPSS formats. However, thanks to a few great R packages, it’s not very hard to bring this complex and interesting data set into R. In this post, we’ll cover some practical details about using CHIS data. We’ll also take up a more technical topic about how the CHIS weights were build and about how they should and should not be used. Before we begin, here are some key references for CHIS:
  • UCLA maintains extensive documentation of the CHIS methodlogy including important details of how the sample weighting was developed and implemented. Definitions for the CHIS variables can be found in the data dictionary included with the CHIS public use datasets.
  • Thomas Lumley created and maintains the survey package. He has provided a description of how to work with the CHIS survey weights here and also in his book.

Getting The Data

To begin, we first need to register (free) at the CHIS site and download the data. The CHIS is a landline and mobile telephone survey, designed to represent the non-institutionalized resident population of California. The data is divided into three subsets: child, adolescent, and adult. For simplicity, we’ll just work with the adult data set which has about 43k records. To bring the data into R, download the Stata flavor from CHIS, unzip it, and load it into R as follows:
# Read CHIS file
library(foreign)
file <- "~/Desktop/CHIS/STATA/chis11-12_adult_stata/Data/ADULT.dta"  # your file
options(nwarnings=500)  # Bump up number of warnings -- see note below
CHIS   <- read.dta(file, convert.factors = TRUE)
warnings() # print the 257 warning messages
# Note: read.dta generates warnings because some of the Stata 
# value label definitions are not found in the raw Stata file.  Thus,
# some variables will show the raw survey coding and some will print
# with the decoded labels from CHIS.  Consult the data dictionary to translate
# raw survey codes.  To be really careful, we could set convert.factors = FALSE
# to work only with the raw codes for all variables.

Now that we've got the data loaded, we want to take a stab at analyzing it. Browsing through the data dictionary, we see that CHIS records educational attainment in the variable aheduc. How can we tabulate this variable? Observe that CHIS includes 81 variables, rakedw0, rakedw1,...,rakedw80 , which are called replicate weights. For the moment, we'll disregard all of these weights except for rakedw0. The variable rakedw0 is the sample weight and it tells us how much each of the 43k observations should "count" in an analysis. Using R's xtabs function, we can tabulate educational attainment with the weight rakedw0:
# Educational attainment -- counts
# as.data.frame, round are just aesthetic choices.
as.data.frame(round(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE),0))
# aheduc    Freq
# 1              GRADE 1-8 2214253
# 2             GRADE 9-11 1929232
# 3  GRADE 12/H.S. DIPLOMA 6747336
# 4           SOME COLLEGE 4155536
# 5      VOCATIONAL SCHOOL  842097
# 6        AA OR AS DEGREE 1936675
# 7        BA OR BS DEGREE 5918519
# 8      SOME GRAD. SCHOOL  284266
# 9        MA OR MS DEGREE 2537555
# 10   PH.D. OR EQUIVALENT  930248
# 11   NO FORMAL EDUCATION  300768

# sum over the table
margin.table(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE))
# 27796484

Though we had only 43k records in the sample, the sum over all the categories is about 28M people which is roughly the size of the non-institutionalized adult population of California. This is because the sample weight rakedw0 was designed so that CHIS totals match closely to known CA demographics (primarily from the California Department of Finance's Demographics Unit).

Also, we observe that:
# Male/Female proportion in CHIS using the sample weight (correct)
round(prop.table(xtabs( rakedw0 ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)
# srsex
# MALE FEMALE 
# 48.72  51.28

# Male/Female proportion in RAW CHIS (wrong!!)
round(prop.table(xtabs(  ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)
# srsex
# MALE FEMALE 
# 41.57  58.43  <-- (!)

The raw CHIS sample contains many more women than we'd expect from a completely random draw from the California population. The sample weight rakedw0 includes an adjustment to rebalance the over-representation of females. Later in the post, we'll explore the CHIS weighting method in more detail and also explain the reason for the other weights rakedw1,..., rakedw80.

Simple point estimates with CHIS

CHIS is a great resource for investigating how demographic and health outcome factors are related. Let's use the data to assess a simple hypothesis: Californians with lower educational attainment are more likely to be overweight/obese (ovrwt).

When working with CHIS data it is frequently convenient to group and recode categorical variables. There are a number of ways we could do this but the "car" package offers a recoding function with a simple syntax:
# Recode educational attainment
library(car)
recodeSTR = "
c('NOT ASCERTAINED') = 'UNKNOWN';
c('NO FORMAL EDUCATION','GRADE 1-8', 'GRADE 9-11',
'GRADE 12/H.S. DIPLOMA','SOME COLLEGE',
'VOCATIONAL SCHOOL','AA OR AS DEGREE') = 'LESS THAN BA';
c('BA OR BS DEGREE','SOME GRAD. SCHOOL') = 'BA';
c('MA OR MS DEGREE','PH.D. OR EQUIVALENT') = 'MA OR MORE' "
CHIS$college_grad = recode(CHIS$aheduc, recodeSTR)

Here's a simple visual tabulation from CHIS to support our hypothesis:

# make a plot
require(ggplot2)
require(scales)

# Educational attainment (recoded) vs overweight
round(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt, 
                   CHIS, drop.unused.levels=TRUE),1)*100,1)
#college_grad    YES   NO
#  BA           51.2 48.8
#  LESS THAN BA 64.6 35.4
#  MA OR MORE   50.5 49.5

# summary df
df <- as.data.frame(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt, 
                        CHIS, drop.unused.levels=TRUE),1))

# reorder the factors for sake of the plot
df$college_grad = factor(df$college_grad, 
                             levels=c('LESS THAN BA','BA','MA OR MORE')) 

# plot
ggplot(df, aes(x=college_grad, y=Freq, fill=ovrwt)) + geom_bar(stat="identity") +
  scale_y_continuous(labels = percent_format()) +
  ylab("% of Adults") + xlab("Education Level") +
  scale_fill_manual(values = c("blue","yellow"))

Now that we've found a potentially interesting relationship in the dataset, our next step might be to check that this relationship is statistically significant. This means we need to compute standard errors for our cross tabulation. Here's where working with the complex CHIS survey design gets a bit tricky...

Replicate weights and estimator variances

At this point, it's useful take a more detailed look at the weighting methodology employed in CHIS and other complex surveys. This section is a little technical but the key practical consequences are summarized at the end. The CHIS sample weights are designed through a process called raking (which is similar to iterated proportional fitting). To understand this process, imagine that we are conducting a study on smoking and have collected the following sample:
sample <- data.frame(
  gender = c(rep("M",3),rep("F",7)),
  city   = c(rep(c("A","B"),4),"B","B"),
  smoke  = c(rep(c(1,0),3),1,0,1,1)
)
#   gender city smoke
#1       M    A     1
#2       M    B     0
#3       M    A     1
#4       F    B     0
#5       F    A     1
#6       F    B     0
#7       F    A     1
#8       F    B     0
#9       F    B     1
#10      F    B     1

From an external source, such as census data, suppose we know with high-confidence that 52% of the total population (city A + city B) are female and 48% are male. Suppose we also know that the population is distributed so that 55% of people live in A and 45% live in B. These external facts are called usually called control totals. Our sample is clearly not representative of the population as 40% of respondents are from city A and 70% are female. If we want to estimate the joint distribution of smoking rates by gender and city, what can we do?

Consider the strategy of assigning a weight $w_i$ to each respondent in the sample. This weight tells us how much each observation should count for in the cross tabulation. Then, for example, we can estimate: $$ p(\mathrm{smoke}, M ,A) \approx \frac{1}{10} \sum_{i: \mathrm{gender}_i = M, \mathrm{city}_i = A } w_i \cdot \mathrm{smoke}. $$ The key issue is how to choose the weight $w_i$. Raking is one such strategy which can make sense for complex surveys.

In a raking procedure, we begin with one of the control variables (say gender) and choose $w_i$ so that the sample distribution will match the population distribution. For example, to get 48% male representation we'd need to give each male in the sample a weight such that $$.3 \cdot w_i(\mathrm{gender}_i = M) = .48 \qquad \Rightarrow \qquad w_i(\mathrm{gender}_i = M) = \frac{0.48}{0.3} = 1.6.$$ Similarly, to get 52% female representation, we'd need to give each female in the sample a weight such that $$ w_i(\mathrm{gender}_i = F) = \frac{0.52}{0.7} \approx 0.7429.$$ We can then recompute the crosstabs with this initial weight:
sample$w0 = ifelse(sample$gender=="M",.48/.3, .52/.7)

prop.table(xtabs(w0 ~ gender, sample))
#gender
#   F    M 
#0.52 0.48 

prop.table(xtabs(w0 ~ city, sample))
#city
#        A         B 
#0.4685714 0.5314286 

Now gender is in balance but city remains out of balance. To balance the city proportions, we need to weight the weighted data.
# tabulate with current weight
P <- prop.table(xtabs(w0 ~ city, sample))

# update the weight for city
sample$w1 = ifelse(sample$city=="A",.55/P[1], .45/P[2])

# tabulate again
prop.table(xtabs(w0*w1 ~ city, sample))
# city
# A    B 
# 0.55 0.45 
prop.table(xtabs(w0*w1 ~ gender, sample))
# gender
#F         M 
#0.4889064 0.5110936 

So we've balanced city but in doing so we've also unbalanced gender. However, gender is not nearly as out of balance as it was when we begin. This suggests that we can iterate this process to build a weight variable that will balance both gender and city simultaneously. Indeed, here is a simple code that will produce the sample weight for this toy problem:
# controls
pop.gender = c(F=.52,M=.48)
pop.city   = c(A=.55,B=.45)

# set initialize the weight
sample$w <- 1

# run the algorithm (only need a few iterations to converge)
for(i in 1:10){

  # get current weighted proportion for gender
  P0 <- prop.table(xtabs(w ~ gender, sample))
  sample$w0 = ifelse(sample$gender=="F", pop.gender[1]/P0[1], pop.gender[2]/P0[2])

  #update weight
  sample$w = sample$w * sample$w0

  # get current weighted proportion for city
  P1 <- prop.table(xtabs(w ~ city, sample))
  sample$w1 = ifelse(sample$city=="A", pop.city[1]/P1[1], pop.city[2]/P1[2])

  #update weight
  sample$w = sample$w * sample$w1
}
# It works!
prop.table(xtabs(w ~ gender, sample))
# gender
# F    M 
# 0.52 0.48 
prop.table(xtabs(w ~ city, sample))
# city
# A    B 
# 0.55 0.45 

This simplified raking algorithm is a good approximation to how the weights in CHIS (and other complex surveys) are developed. Of course in CHIS, the process is much more complicated and the rebalancing is done over 11 design variables and with many more factor levels.

Raking is an interesting procedure from a technical perspective. Why would we expect this algorithm to converge? When it does converge, why can we use these weights to obtain unbiased estimates within cells? These questions are well outside the scope of this post but there is a fair amount of literature on the topic (this article, for example). Practically speaking, what we need to know for working with CHIS are the following facts:
  • The raked weight RAKEDW0 can be used to obtain unbiased estimators of means and proportion in the usual way, e.g. with xtabs in R.
  • Variances (and hence standard errors) for these estimators cannot be computed directly using RAKEDW0. A nice description of this issue can be found in this article.

In the next section, we'll explain how to correctly compute the standard errors for means and proportions in CHIS.

Standard errors with Replicate Weights

To handle the standard errors, we need to make use of the replicate weights rakedw1,...,rakedw80. Essentially, these weights capture variability across subsamples of the CHIS sample and from the variability within these subsamples we can deduce the correct variance estimates for our point estimators. The details of how the CHIS replicate weights were created and how they are used can be found in section 9.2 of the CHIS methodology manual.

Fortunately, R's survey package was built precisely to handle complex survey designs, including replicate weights. The key feature of the survey package is that all trickiness of the CHIS design can be encapsulated in a survey design object $D$. Once R knows about the design, survey's methods will take care of the details of correctly computing the standard errors behind the scenes. Let's return to our education/obesity example and compute standard errors for our cross tabulations:
# survey analysis with the survey package
require(survey)

# remove empty levels from overwt (aesthetic)
CHIS$ovrwt <- factor(CHIS$ovrwt)

# design structure -- settings are thanks to the references
# by T. Lumley.  The design object captures all the information
# about the replicate weights.
D <- svrepdesign(variables = CHIS[,c("college_grad", "ovrwt")], 
            repweights = CHIS[,grep("rakedw[1-9]",colnames(CHIS))],
            weights = CHIS$rakedw0,
            combined.weights = TRUE,
            type = "other",
            scale = 1,
            rscales = 1)

# compute the means and standard errors
svyby( ~ovrwt, ~college_grad, D, 
                FUN = svymean, keep.names = FALSE)

#  college_grad  ovrwtYES   ovrwtNO         se1         se2
#1           BA 0.5122654 0.4877346 0.007344430 0.007344430
#2 LESS THAN BA 0.6456145 0.3543855 0.005697532 0.005697532
#3   MA OR MORE 0.5051845 0.4948155 0.009186273 0.009186273


# compare with the naive standard error (WRONG!):
n  <- sum(CHIS$rakedw0[CHIS$college_grad == "BA"])
x  <- sum(CHIS$rakedw0[CHIS$college_grad == "BA" & CHIS$ovrwt == "YES"]) 
p  <- x/n   # proportion
SE <- sqrt(p*(1-p)/n)  # standard error in proportion
# 0.0002006993 
# This is much smaller error than what we got with the correct 
# estimate provided by survey! Thus, we would be overly confident
# about our results if used the naive method.


To leave a comment for the author, please follow the link and comment on their blog: quantitate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)