# Analyzing the 2011-2012 California Health Inteview Survey with R

June 1, 2014
By

(This article was first published on quantitate, and kindly contributed to R-bloggers)

The California Health Interview Survey (CHIS) is a remarkable biannual survey of health status, care access, and demographics for California residents. The 2011-2012 public use survey data has recently been released and is freely available after registering at the CHIS site. CHIS currently offers data in SAS, Stata, and SPSS formats. However, thanks to a few great R packages, it's not very hard to bring this complex and interesting data set into R. In this post, we'll cover some practical details about using CHIS data. We'll also take up a more technical topic about how the CHIS weights were build and about how they should and should not be used. Before we begin, here are some key references for CHIS:
• UCLA maintains extensive documentation of the CHIS methodlogy including important details of how the sample weighting was developed and implemented. Definitions for the CHIS variables can be found in the data dictionary included with the CHIS public use datasets.
• Thomas Lumley created and maintains the survey package. He has provided a description of how to work with the CHIS survey weights here and also in his book.

### Getting The Data

To begin, we first need to register (free) at the CHIS site and download the data. The CHIS is a landline and mobile telephone survey, designed to represent the non-institutionalized resident population of California. The data is divided into three subsets: child, adolescent, and adult. For simplicity, we'll just work with the adult data set which has about 43k records. To bring the data into R, download the Stata flavor from CHIS, unzip it, and load it into R as follows:
# Read CHIS filelibrary(foreign)file <- "~/Desktop/CHIS/STATA/chis11-12_adult_stata/Data/ADULT.dta"  # your fileoptions(nwarnings=500)  # Bump up number of warnings -- see note belowCHIS   <- read.dta(file, convert.factors = TRUE)warnings() # print the 257 warning messages# Note: read.dta generates warnings because some of the Stata # value label definitions are not found in the raw Stata file.  Thus,# some variables will show the raw survey coding and some will print# with the decoded labels from CHIS.  Consult the data dictionary to translate# raw survey codes.  To be really careful, we could set convert.factors = FALSE# to work only with the raw codes for all variables.

Now that we've got the data loaded, we want to take a stab at analyzing it. Browsing through the data dictionary, we see that CHIS records educational attainment in the variable aheduc. How can we tabulate this variable? Observe that CHIS includes 81 variables, rakedw0, rakedw1,...,rakedw80 , which are called replicate weights. For the moment, we'll disregard all of these weights except for rakedw0. The variable rakedw0 is the sample weight and it tells us how much each of the 43k observations should "count" in an analysis. Using R's xtabs function, we can tabulate educational attainment with the weight rakedw0:
# Educational attainment -- counts# as.data.frame, round are just aesthetic choices.as.data.frame(round(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE),0))# aheduc    Freq# 1              GRADE 1-8 2214253# 2             GRADE 9-11 1929232# 3  GRADE 12/H.S. DIPLOMA 6747336# 4           SOME COLLEGE 4155536# 5      VOCATIONAL SCHOOL  842097# 6        AA OR AS DEGREE 1936675# 7        BA OR BS DEGREE 5918519# 8      SOME GRAD. SCHOOL  284266# 9        MA OR MS DEGREE 2537555# 10   PH.D. OR EQUIVALENT  930248# 11   NO FORMAL EDUCATION  300768# sum over the tablemargin.table(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE))# 27796484

Though we had only 43k records in the sample, the sum over all the categories is about 28M people which is roughly the size of the non-institutionalized adult population of California. This is because the sample weight rakedw0 was designed so that CHIS totals match closely to known CA demographics (primarily from the California Department of Finance's Demographics Unit).

Also, we observe that:
# Male/Female proportion in CHIS using the sample weight (correct)round(prop.table(xtabs( rakedw0 ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)# srsex# MALE FEMALE # 48.72  51.28# Male/Female proportion in RAW CHIS (wrong!!)round(prop.table(xtabs(  ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)# srsex# MALE FEMALE # 41.57  58.43  <-- (!)

The raw CHIS sample contains many more women than we'd expect from a completely random draw from the California population. The sample weight rakedw0 includes an adjustment to rebalance the over-representation of females. Later in the post, we'll explore the CHIS weighting method in more detail and also explain the reason for the other weights rakedw1,..., rakedw80.

### Simple point estimates with CHIS

CHIS is a great resource for investigating how demographic and health outcome factors are related. Let's use the data to assess a simple hypothesis: Californians with lower educational attainment are more likely to be overweight/obese (ovrwt).

When working with CHIS data it is frequently convenient to group and recode categorical variables. There are a number of ways we could do this but the "car" package offers a recoding function with a simple syntax:
# Recode educational attainmentlibrary(car)recodeSTR = "c('NOT ASCERTAINED') = 'UNKNOWN';c('NO FORMAL EDUCATION','GRADE 1-8', 'GRADE 9-11','GRADE 12/H.S. DIPLOMA','SOME COLLEGE','VOCATIONAL SCHOOL','AA OR AS DEGREE') = 'LESS THAN BA';c('BA OR BS DEGREE','SOME GRAD. SCHOOL') = 'BA';c('MA OR MS DEGREE','PH.D. OR EQUIVALENT') = 'MA OR MORE' "CHIS$college_grad = recode(CHIS$aheduc, recodeSTR)

Here's a simple visual tabulation from CHIS to support our hypothesis:

# make a plotrequire(ggplot2)require(scales)# Educational attainment (recoded) vs overweightround(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt,                    CHIS, drop.unused.levels=TRUE),1)*100,1)#college_grad    YES   NO#  BA           51.2 48.8#  LESS THAN BA 64.6 35.4#  MA OR MORE   50.5 49.5# summary dfdf <- as.data.frame(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt,                         CHIS, drop.unused.levels=TRUE),1))# reorder the factors for sake of the plotdf$college_grad = factor(df$college_grad,                              levels=c('LESS THAN BA','BA','MA OR MORE')) # plotggplot(df, aes(x=college_grad, y=Freq, fill=ovrwt)) + geom_bar(stat="identity") +  scale_y_continuous(labels = percent_format()) +  ylab("% of Adults") + xlab("Education Level") +  scale_fill_manual(values = c("blue","yellow"))

Now that we've found a potentially interesting relationship in the dataset, our next step might be to check that this relationship is statistically significant. This means we need to compute standard errors for our cross tabulation. Here's where working with the complex CHIS survey design gets a bit tricky...

### Replicate weights and estimator variances

At this point, it's useful take a more detailed look at the weighting methodology employed in CHIS and other complex surveys. This section is a little technical but the key practical consequences are summarized at the end. The CHIS sample weights are designed through a process called raking (which is similar to iterated proportional fitting). To understand this process, imagine that we are conducting a study on smoking and have collected the following sample:
sample <- data.frame(  gender = c(rep("M",3),rep("F",7)),  city   = c(rep(c("A","B"),4),"B","B"),  smoke  = c(rep(c(1,0),3),1,0,1,1))#   gender city smoke#1       M    A     1#2       M    B     0#3       M    A     1#4       F    B     0#5       F    A     1#6       F    B     0#7       F    A     1#8       F    B     0#9       F    B     1#10      F    B     1

From an external source, such as census data, suppose we know with high-confidence that 52% of the total population (city A + city B) are female and 48% are male. Suppose we also know that the population is distributed so that 55% of people live in A and 45% live in B. These external facts are called usually called control totals. Our sample is clearly not representative of the population as 40% of respondents are from city A and 70% are female. If we want to estimate the joint distribution of smoking rates by gender and city, what can we do?

Consider the strategy of assigning a weight $w_i$ to each respondent in the sample. This weight tells us how much each observation should count for in the cross tabulation. Then, for example, we can estimate: $$p(\mathrm{smoke}, M ,A) \approx \frac{1}{10} \sum_{i: \mathrm{gender}_i = M, \mathrm{city}_i = A } w_i \cdot \mathrm{smoke}.$$ The key issue is how to choose the weight $w_i$. Raking is one such strategy which can make sense for complex surveys.

In a raking procedure, we begin with one of the control variables (say gender) and choose $w_i$ so that the sample distribution will match the population distribution. For example, to get 48% male representation we'd need to give each male in the sample a weight such that $$.3 \cdot w_i(\mathrm{gender}_i = M) = .48 \qquad \Rightarrow \qquad w_i(\mathrm{gender}_i = M) = \frac{0.48}{0.3} = 1.6.$$ Similarly, to get 52% female representation, we'd need to give each female in the sample a weight such that $$w_i(\mathrm{gender}_i = F) = \frac{0.52}{0.7} \approx 0.7429.$$ We can then recompute the crosstabs with this initial weight:
sample$w0 = ifelse(sample$gender=="M",.48/.3, .52/.7)prop.table(xtabs(w0 ~ gender, sample))#gender#   F    M #0.52 0.48 prop.table(xtabs(w0 ~ city, sample))#city#        A         B #0.4685714 0.5314286

Now gender is in balance but city remains out of balance. To balance the city proportions, we need to weight the weighted data.
# tabulate with current weightP <- prop.table(xtabs(w0 ~ city, sample))# update the weight for citysample$w1 = ifelse(sample$city=="A",.55/P[1], .45/P[2])# tabulate againprop.table(xtabs(w0*w1 ~ city, sample))# city# A    B # 0.55 0.45 prop.table(xtabs(w0*w1 ~ gender, sample))# gender#F         M #0.4889064 0.5110936

So we've balanced city but in doing so we've also unbalanced gender. However, gender is not nearly as out of balance as it was when we begin. This suggests that we can iterate this process to build a weight variable that will balance both gender and city simultaneously. Indeed, here is a simple code that will produce the sample weight for this toy problem: