**quantitate**, and kindly contributed to R-bloggers)

The California Health Interview Survey (CHIS) is a remarkable biannual survey of health status, care access, and demographics for California residents. The 2011-2012 public use survey data has recently been released and is freely available after registering at the CHIS site. CHIS currently offers data in SAS, Stata, and SPSS formats. However, thanks to a few great R packages, it’s not very hard to bring this complex and interesting data set into R. In this post, we’ll cover some practical details about using CHIS data. We’ll also take up a more technical topic about how the CHIS weights were build and about how they should and should not be used. Before we begin, here are some key references for CHIS:

- UCLA maintains extensive documentation of the CHIS methodlogy including important details of how the sample weighting was developed and implemented. Definitions for the CHIS variables can be found in the data dictionary included with the CHIS public use datasets.
- Thomas Lumley created and maintains the survey package. He has provided a description of how to work with the CHIS survey weights here and also in his book.

### Getting The Data

To begin, we first need to register (free) at the CHIS site and download the data. The CHIS is a landline and mobile telephone survey, designed to represent the non-institutionalized resident population of California. The data is divided into three subsets: child, adolescent, and adult. For simplicity, we’ll just work with the adult data set which has about 43k records. To bring the data into R, download the Stata flavor from CHIS, unzip it, and load it into R as follows:

# Read CHIS file

library(foreign)

file <- "~/Desktop/CHIS/STATA/chis11-12_adult_stata/Data/ADULT.dta" # your file

options(nwarnings=500) # Bump up number of warnings -- see note below

CHIS <- read.dta(file, convert.factors = TRUE)

warnings() # print the 257 warning messages

# Note: read.dta generates warnings because some of the Stata

# value label definitions are not found in the raw Stata file. Thus,

# some variables will show the raw survey coding and some will print

# with the decoded labels from CHIS. Consult the data dictionary to translate

# raw survey codes. To be really careful, we could set convert.factors = FALSE

# to work only with the raw codes for all variables.

Now that we’ve got the data loaded, we want to take a stab at analyzing it. Browsing through the data dictionary, we see that CHIS records educational attainment in the variable *aheduc*. How can we tabulate this variable? Observe that CHIS includes 81 variables, * rakedw0, rakedw1,…,rakedw80 *, which are called replicate weights. For the moment, we’ll disregard all of these weights except for *rakedw0*. The variable *rakedw0* is the sample weight and it tells us how much each of the 43k observations should “count” in an analysis. Using R’s xtabs function, we can tabulate educational attainment with the weight *rakedw0*:

# Educational attainment -- counts

# as.data.frame, round are just aesthetic choices.

as.data.frame(round(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE),0))

# aheduc Freq

# 1 GRADE 1-8 2214253

# 2 GRADE 9-11 1929232

# 3 GRADE 12/H.S. DIPLOMA 6747336

# 4 SOME COLLEGE 4155536

# 5 VOCATIONAL SCHOOL 842097

# 6 AA OR AS DEGREE 1936675

# 7 BA OR BS DEGREE 5918519

# 8 SOME GRAD. SCHOOL 284266

# 9 MA OR MS DEGREE 2537555

# 10 PH.D. OR EQUIVALENT 930248

# 11 NO FORMAL EDUCATION 300768

# sum over the table

margin.table(xtabs(rakedw0 ~ aheduc, CHIS, drop.unused.levels = TRUE))

# 27796484

Though we had only 43k records in the sample, the sum over all the categories is about 28M people which is roughly the size of the non-institutionalized adult population of California. This is because the sample weight *rakedw0* was designed so that CHIS totals match closely to known CA demographics (primarily from the California Department of Finance’s Demographics Unit).

Also, we observe that:

# Male/Female proportion in CHIS using the sample weight (correct)

round(prop.table(xtabs( rakedw0 ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)

# srsex

# MALE FEMALE

# 48.72 51.28

# Male/Female proportion in RAW CHIS (wrong!!)

round(prop.table(xtabs( ~ srsex, CHIS, drop.unused.levels = TRUE))*100,2)

# srsex

# MALE FEMALE

# 41.57 58.43 <-- (!)

The raw CHIS sample contains many more women than we’d expect from a completely random draw from the California population. The sample weight *rakedw0 * includes an adjustment to rebalance the over-representation of females. Later in the post, we’ll explore the CHIS weighting method in more detail and also explain the reason for the other weights *rakedw1,…, rakedw80*.

### Simple point estimates with CHIS

CHIS is a great resource for investigating how demographic and health outcome factors are related. Let’s use the data to assess a simple hypothesis: Californians with lower educational attainment are more likely to be overweight/obese (*ovrwt*).

When working with CHIS data it is frequently convenient to group and recode categorical variables. There are a number of ways we could do this but the “car” package offers a recoding function with a simple syntax:

# Recode educational attainment

library(car)

recodeSTR = "

c('NOT ASCERTAINED') = 'UNKNOWN';

c('NO FORMAL EDUCATION','GRADE 1-8', 'GRADE 9-11',

'GRADE 12/H.S. DIPLOMA','SOME COLLEGE',

'VOCATIONAL SCHOOL','AA OR AS DEGREE') = 'LESS THAN BA';

c('BA OR BS DEGREE','SOME GRAD. SCHOOL') = 'BA';

c('MA OR MS DEGREE','PH.D. OR EQUIVALENT') = 'MA OR MORE' "

CHIS$college_grad = recode(CHIS$aheduc, recodeSTR)

Here’s a simple visual tabulation from CHIS to support our hypothesis:

# make a plot

require(ggplot2)

require(scales)

# Educational attainment (recoded) vs overweight

round(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt,

CHIS, drop.unused.levels=TRUE),1)*100,1)

#college_grad YES NO

# BA 51.2 48.8

# LESS THAN BA 64.6 35.4

# MA OR MORE 50.5 49.5

# summary df

df <- as.data.frame(prop.table(xtabs(rakedw0 ~ college_grad + ovrwt,

CHIS, drop.unused.levels=TRUE),1))

# reorder the factors for sake of the plot

df$college_grad = factor(df$college_grad,

levels=c('LESS THAN BA','BA','MA OR MORE'))

# plot

ggplot(df, aes(x=college_grad, y=Freq, fill=ovrwt)) + geom_bar(stat="identity") +

scale_y_continuous(labels = percent_format()) +

ylab("% of Adults") + xlab("Education Level") +

scale_fill_manual(values = c("blue","yellow"))

Now that we’ve found a potentially interesting relationship in the dataset, our next step might be to check that this relationship is statistically significant. This means we need to compute standard errors for our cross tabulation. Here’s where working with the complex CHIS survey design gets a bit tricky…

### Replicate weights and estimator variances

At this point, it’s useful take a more detailed look at the weighting methodology employed in CHIS and other complex surveys. This section is a little technical but the key practical consequences are summarized at the end. The CHIS sample weights are designed through a process called raking (which is similar to iterated proportional fitting). To understand this process, imagine that we are conducting a study on smoking and have collected the following sample:

sample <- data.frame(

gender = c(rep("M",3),rep("F",7)),

city = c(rep(c("A","B"),4),"B","B"),

smoke = c(rep(c(1,0),3),1,0,1,1)

)

# gender city smoke

#1 M A 1

#2 M B 0

#3 M A 1

#4 F B 0

#5 F A 1

#6 F B 0

#7 F A 1

#8 F B 0

#9 F B 1

#10 F B 1

From an external source, such as census data, suppose we know with high-confidence that 52% of the total population (city A + city B) are female and 48% are male. Suppose we also know that the population is distributed so that 55% of people live in A and 45% live in B. These external facts are called usually called control totals. Our sample is clearly not representative of the population as 40% of respondents are from city A and 70% are female. If we want to estimate the joint distribution of smoking rates by gender and city, what can we do?

Consider the strategy of assigning a weight $w_i$ to each respondent in the sample. This weight tells us how much each observation should count for in the cross tabulation. Then, for example, we can estimate: $$ p(\mathrm{smoke}, M ,A) \approx \frac{1}{10} \sum_{i: \mathrm{gender}_i = M, \mathrm{city}_i = A } w_i \cdot \mathrm{smoke}. $$ The key issue is how to choose the weight $w_i$. Raking is one such strategy which can make sense for complex surveys.

In a raking procedure, we begin with one of the control variables (say gender) and choose $w_i$ so that the sample distribution will match the population distribution. For example, to get 48% male representation we’d need to give each male in the sample a weight such that $$.3 \cdot w_i(\mathrm{gender}_i = M) = .48 \qquad \Rightarrow \qquad w_i(\mathrm{gender}_i = M) = \frac{0.48}{0.3} = 1.6.$$ Similarly, to get 52% female representation, we’d need to give each female in the sample a weight such that $$ w_i(\mathrm{gender}_i = F) = \frac{0.52}{0.7} \approx 0.7429.$$ We can then recompute the crosstabs with this initial weight:

sample$w0 = ifelse(sample$gender=="M",.48/.3, .52/.7)

prop.table(xtabs(w0 ~ gender, sample))

#gender

# F M

#0.52 0.48

prop.table(xtabs(w0 ~ city, sample))

#city

# A B

#0.4685714 0.5314286

Now gender is in balance but city remains out of balance. To balance the city proportions, we need to *weight the weighted data*.

# tabulate with current weight

P <- prop.table(xtabs(w0 ~ city, sample))

# update the weight for city

sample$w1 = ifelse(sample$city=="A",.55/P[1], .45/P[2])

# tabulate again

prop.table(xtabs(w0*w1 ~ city, sample))

# city

# A B

# 0.55 0.45

prop.table(xtabs(w0*w1 ~ gender, sample))

# gender

#F M

#0.4889064 0.5110936

So we’ve balanced city but in doing so we’ve also unbalanced gender. However, gender is not nearly as out of balance as it was when we begin. This suggests that we can *iterate* this process to build a weight variable that will balance both gender and city simultaneously. Indeed, here is a simple code that will produce the sample weight for this toy problem:

# controls

pop.gender = c(F=.52,M=.48)

pop.city = c(A=.55,B=.45)

# set initialize the weight

sample$w <- 1

# run the algorithm (only need a few iterations to converge)

for(i in 1:10){

# get current weighted proportion for gender

P0 <- prop.table(xtabs(w ~ gender, sample))

sample$w0 = ifelse(sample$gender=="F", pop.gender[1]/P0[1], pop.gender[2]/P0[2])

#update weight

sample$w = sample$w * sample$w0

# get current weighted proportion for city

P1 <- prop.table(xtabs(w ~ city, sample))

sample$w1 = ifelse(sample$city=="A", pop.city[1]/P1[1], pop.city[2]/P1[2])

#update weight

sample$w = sample$w * sample$w1

}

# It works!

prop.table(xtabs(w ~ gender, sample))

# gender

# F M

# 0.52 0.48

prop.table(xtabs(w ~ city, sample))

# city

# A B

# 0.55 0.45

This simplified raking algorithm is a good approximation to how the weights in CHIS (and other complex surveys) are developed. Of course in CHIS, the process is much more complicated and the rebalancing is done over 11 design variables and with many more factor levels.

Raking is an interesting procedure from a technical perspective. Why would we expect this algorithm to converge? When it does converge, why can we use these weights to obtain unbiased estimates within cells? These questions are well outside the scope of this post but there is a fair amount of literature on the topic (this article, for example). Practically speaking, what we need to know for working with CHIS are the following facts:

- The raked weight RAKEDW0 can be used to obtain unbiased estimators of means and proportion in the usual way, e.g. with xtabs in R.
- Variances (and hence standard errors) for these estimators
*cannot*be computed directly using RAKEDW0. A nice description of this issue can be found in this article.

In the next section, we’ll explain how to correctly compute the standard errors for means and proportions in CHIS.

### Standard errors with Replicate Weights

To handle the standard errors, we need to make use of the replicate weights *rakedw1,…,rakedw80*. Essentially, these weights capture variability across subsamples of the CHIS sample and from the variability within these subsamples we can deduce the correct variance estimates for our point estimators. The details of how the CHIS replicate weights were created and how they are used can be found in section 9.2 of the CHIS methodology manual.

Fortunately, R’s survey package was built precisely to handle complex survey designs, including replicate weights. The key feature of the survey package is that all trickiness of the CHIS design can be encapsulated in a survey design object $D$. Once R knows about the design, survey’s methods will take care of the details of correctly computing the standard errors behind the scenes. Let’s return to our education/obesity example and compute standard errors for our cross tabulations:

# survey analysis with the survey package

require(survey)

# remove empty levels from overwt (aesthetic)

CHIS$ovrwt <- factor(CHIS$ovrwt)

# design structure -- settings are thanks to the references

# by T. Lumley. The design object captures all the information

# about the replicate weights.

D <- svrepdesign(variables = CHIS[,c("college_grad", "ovrwt")],

repweights = CHIS[,grep("rakedw[1-9]",colnames(CHIS))],

weights = CHIS$rakedw0,

combined.weights = TRUE,

type = "other",

scale = 1,

rscales = 1)

# compute the means and standard errors

svyby( ~ovrwt, ~college_grad, D,

FUN = svymean, keep.names = FALSE)

# college_grad ovrwtYES ovrwtNO se1 se2

#1 BA 0.5122654 0.4877346 0.007344430 0.007344430

#2 LESS THAN BA 0.6456145 0.3543855 0.005697532 0.005697532

#3 MA OR MORE 0.5051845 0.4948155 0.009186273 0.009186273

# compare with the naive standard error (WRONG!):

n <- sum(CHIS$rakedw0[CHIS$college_grad == "BA"])

x <- sum(CHIS$rakedw0[CHIS$college_grad == "BA" & CHIS$ovrwt == "YES"])

p <- x/n # proportion

SE <- sqrt(p*(1-p)/n) # standard error in proportion

# 0.0002006993

# This is much smaller error than what we got with the correct

# estimate provided by survey! Thus, we would be overly confident

# about our results if used the naive method.

**leave a comment**for the author, please follow the link and comment on their blog:

**quantitate**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...