Analysing survey data can be tricky. There’s often a mismatch between the characteristics of the survey respondents and and those of the general population. If the discrepancies are not accounted for then the survey results can (and generally will!) be misleading.
A common approach to this problem is to weight the individual survey responses so that the marginal proportions of the survey are close to those of the population. Raking (also known as proportional fitting, sample-balancing, or ratio estimation) is a technique for generating the required weights.
A fictional population and an imaginary survey are used to illustrate how survey raking works.
Fictional Population Characteristics
A census gives a breakdown of the fictional population according to three categories: mood, sex and age.
> population mood sex age fraction 1 happy female young 0.0750 2 neutral female young 0.1250 3 grumpy female young 0.0500 4 happy male young 0.0750 5 neutral male young 0.1250 6 grumpy male young 0.0500 7 happy female middle 0.0600 8 neutral female middle 0.1000 9 grumpy female middle 0.0400 10 happy male middle 0.0600 11 neutral male middle 0.1000 12 grumpy male middle 0.0400 13 happy female senior 0.0165 14 neutral female senior 0.0275 15 grumpy female senior 0.0110 16 happy male senior 0.0135 17 neutral male senior 0.0225 18 grumpy male senior 0.0090
To apply raking we only need to know the marginal proportions of these categories, which can be derived from the data above. Use the
wpct() function from the
library(weights) target <- with(population, list( mood = wpct(mood, fraction), sex = wpct(sex, fraction), age = wpct(age, fraction) ))
Store the marginal proportions in a
list, where the elements of the list are named according to variable they’ll match in survey data.
> str(target) List of 3 $ mood: Named num [1:3] 0.3 0.5 0.2 ..- attr(*, "names")= chr [1:3] "happy" "neutral" "grumpy" $ sex : Named num [1:2] 0.505 0.495 ..- attr(*, "names")= chr [1:2] "female" "male" $ age : Named num [1:3] 0.5 0.4 0.1 ..- attr(*, "names")= chr [1:3] "young" "middle" "senior"
Imaginary Survey Results
The imaginary survey has ten thousand responses to a binary question. I’m imagining that it’s a survey for an important referendum, gauging the sentiment of the population on a particular proposal.
> nrow(survey)  10000
Here’s a sample of the results.
> head(survey) caseid mood sex age response 1 1 grumpy male young 1 2 2 happy male senior 1 3 3 neutral male senior 1 4 4 happy female young 1 5 5 grumpy male young 1 6 6 happy female young 0
Now let’s take a look at the proportions of the survey characteristics.
> wpct(survey$mood) happy neutral grumpy 0.3308 0.3372 0.3320 > wpct(survey$sex) female male 0.4971 0.5029 > wpct(survey$age) young middle senior 0.3393 0.3341 0.3266
The survey design was such that the levels of each of the survey characteristics were sampled in similar proportion. For sex this works well because the sample proportions are close to those in the population. For mood, however, this means that the proportion of neutral in the survey is less than in the population, while the sample proportion of grumpy is higher than in the population. Similarly, for age the senior category is oversampled in the survey, while the middle and young categories are undersampled.
What’s the result of the survey if we naively just average the responses?
> wpct(survey$response) 0 1 0.4277 0.5723
It appears that 57% of the survey respondents were in favour of the proposal. Of course, we know that this is probably not an accurate reflection of the sentiment of the population, because our survey is not an accurate representation of the population demographics.
Let’s use raking to fix that!
We’ll use the
anesrake package, which implements the ANES (American National Election Study) weighting algorithm. The algorithm, documented by DeBell and Krosnick (Computing Weights for American National Election Study Survey Data, 2009), aims to provide a default approach to survey weighting (there’s no single “right” way to do survey weighting, but this is a decent starting point). It uses an iterative procedure to generate multiplicative weights. The weights are chosen so that the survey marginals agree with the population marginals for a specific set of parameters. In each iteration all parameters are considered in turn. For each parameter weights are adjusted to align the survey marginals with the population marginals. These weights are then used as the starting point for the next parameter.
anesrake() function has three required parameters:
inputter– a list of target values
dataframe– the survey results and
caseid– a unique identifier for each respondent in the survey (normally a part of the survey results).
The remaining parameters are optional. The
type parameters affect the way that parameters are selected for inclusion in the weighting procedure. Ideally you want to choose only those parameters where there is a significant discrepancy between the sample and population proportions.
> raking <- anesrake(target, + survey, + survey$caseid, + cap = 5, # Maximum allowed weight per iteration + choosemethod = "total", # How are parameters compared for selection? + type = "pctlim", # What selection criterion is used? + pctlim = 0.05 # Threshold for selection + )  "Raking converged in 7 iterations"
It returns a list with a number of components. Fortunately there’s a
summary() method which gives a quick overview of the results.
> raking_summary <- summary(raking)
Which variables were used for weighting?
> raking_summary$raking.variables  "mood" "age"
Only mood and age were used. As we observed previously, the sample proportions for sex are already pretty close to those of the population.
Let’s take a look at the specifics for those two variables.
> raking_summary$mood Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc. happy 0.3 3308 0.3308 3000 0.3 -0.0308 0 -0.0308 neutral 0.5 3372 0.3372 5000 0.5 0.1628 0 0.1628 grumpy 0.2 3320 0.3320 2000 0.2 -0.1320 0 -0.1320 Total 1.0 10000 1.0000 10000 1.0 0.3256 0 0.3256
The summary data for mood gives the number of cases and proportions for each level in both the unweighted and weighted survey. It also tells us
- how the proportions change between the unweighted and weighted surveys (the
Change in %column);
- the discrepancy between the unweighted and target proportions (the
Orig. Disc.) — the discrepancy before weighting; and
- the discrepancy between the weighted and target proportions (the
Resid. Disc.) — the residual discrepancy after weighting.
Obviously we want the residual discrepancy to be as small as possible and here we can see that the mood proportions are perfect after weighting.
> raking_summary$age Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc. young 0.5 3393 0.3393 5000 0.5 0.1607 1.110223e-16 0.1607 middle 0.4 3341 0.3341 4000 0.4 0.0659 0.000000e+00 0.0659 senior 0.1 3266 0.3266 1000 0.1 -0.2266 -1.387779e-17 -0.2266 Total 1.0 10000 1.0000 10000 1.0 0.4532 1.249001e-16 0.4532
Similarly, for age the proportions are effectively perfect after weighting.
What about sex? Well, although sex was not used in the weighting algorithm, the weightings still have an effect on their proportions.
> raking_summary$sex Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc. female 0.505 4971 0.4971 5006.939 0.5006939 0.003593906 0.004306094 0.0079 male 0.495 5029 0.5029 4993.061 0.4993061 -0.003593906 -0.004306094 -0.0079 Total 1.000 10000 1.0000 10000.000 1.0000000 0.007187812 0.008612188 0.0158
The discrepancies in the unweighted results are small and the residual discrepancies after weighting are smaller still (it could have gone the other way too!).
Survey Results: Weighted versus Unweighted
It makes sense to inject the weights into the survey data.
> survey$weight <- raking$weightvec
What does that look like?
> head(survey, 10) caseid mood sex age response weight 1 1 grumpy male young 1 0.8790030 2 2 happy male senior 1 0.2732412 3 3 neutral male senior 1 0.4531776 4 4 happy female young 1 1.3322952 5 5 grumpy male young 1 0.8790030 6 6 happy female young 0 1.3322952 7 7 neutral male middle 1 1.8035884 8 8 grumpy male senior 1 0.1802753 9 9 grumpy female young 0 0.8790030 10 10 happy female young 0 1.3322952
Characteristics that were oversampled in the survey (like grumpy young females) get a weight smaller than one, while those that were undersampled (like middle aged neutral males) get a weight that is larger than one. Note that since sex was not included in the weighting algorithm, it does not influence weight: the weight for a grumpy young female is the same as that for a grumpy young male.
What is the range of weights?
> survey %>% select(mood, age, weight) %>% unique() %>% arrange(weight) mood age weight 1 grumpy senior 0.1802753 2 happy senior 0.2732412 3 neutral senior 0.4531776 4 grumpy middle 0.7174723 5 grumpy young 0.8790030 6 happy middle 1.0874649 7 happy young 1.3322952 8 neutral middle 1.8035884 9 neutral young 2.2096458 > raking_summary$weight.summary Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1803 0.4532 0.8790 1.0000 1.3323 2.2096
The weights are always adjusted so that the average weight is precisely one. Large weights are likely to be problematic. The
cap parameter to
anesrake() places an upper limit on the weights in each iteration. The largest weight in our survey falls below the default upper limit of 5.
What was the effect of raking on the survey results?
> wpct(survey$response, survey$weight) 0 1 0.5340386 0.4659614
Wow! The outcome swings completely the other way. A naive interpretation of the survey results would have been completely misleading. Actually around 53% of the population are opposed to the proposal.
Weighting results in an increase in the variance of means or proportions derived from survey results. We can get an idea of how large this effect is by looking at the design effect.
> raking_summary$general.design.effect  1.422105
This indicates that weighting results in a 42% increase in variance. Unfortunately this means that conclusions are less likely to be statistically significant. However, this is probably a worthwhile trade off: they might not be significant, but at least they will be representative of the population!