**Turning numbers into stories**, and kindly contributed to R-bloggers)

# Survey Data: Computing Your Own Weights

The second installment in my series on working with survey data in R explains how to compute your own post-stratification weights to use with survey data. For a more detailed overview on why you might need post-stratification weights, look at my previous post on survey weights.

Working with survey data has been a focal point of quantitative social sciences for most of the 20^{th} century. The idea behind a survey is to take a sample of the general population and generalize the sample’s answers. For this, the sample needs to be representative, i.e. reflect the characteristics of the population it was drawn from. Broadly speaking, population characteristics are important because they define strata within the population that behave differently with respect to social science questions. For instance, workers tend to harbor different views than managers, and young people lead different lives than pensioners. Maintaining the population’s mix in the sample is key in coming up with generalizable findings.

What precisely a good characteristic for a population is, depends on the population, the questionnaire and the research interest. A lot of consideration needs to be put into selecting appropriate population characteristics for aligning it with the sample. Good starters, however, are the distributions of age and gender or employment status.

Post-stratification weights are actually a very important tool to generalize findings from a sample to a larger population. See for instance Andrew Gelman’s piece in the Washington Post on the subject.

# Computing your own weights

When you conduct a survey yourself, you need to come up with weights. Already when sketching the questionnaire, one needs to know which characteristics will later be used to align sample and population. Each one of them needs to be covered by a question in the questionnaire. The idea behind post stratification is to make sure you have as many pensioners and as many workaholics in your data set, as there are in the general population, to extend the example from last time. For this to work, you need the number of workaholics and pensioners in the general population. We call these figures the marginal distributions. Again, which marginal distribution you need, depends on your data set and your research questions.

Once you have the marginal distributions, you can use *survey*’s `rake()`

function to compute the weights. This is done by an algorithm called Iterative Proportional Fitting (IPF). See Wikipedia’s entry on IPF for all its gory details. For a easier description, let’s consider that this algorithm tries to find weights, so that the actual distributions in your data set, when weighted, match the specified marginal distributions of the general population as closely as possible. Sometimes, this can get a little out of hand, and weights become too large. In that case you need to trim the weights back to a reasonable amount. Having weights between 0.3 and 3 is a good rule of thumb.

There are three steps involved: (a) you need to make a survey design object out of your data, without any weights associated, (b) you need to rake this object, so that you now have weights, and optionally (c), if the weights become too small or too large, you need to trim them.

# Create unweighted survey design object

Let’s use the small artificial data set from the previous post again. In order to have self-contained examples, let’s load the data set (again).

`load(url("http://knutur.at/wsmt/R/RData/small.RData"))`

This data set contains a weight variable, but for now let’s assume it had come without any. This would be the case, if e.g. you had conducted the survey yourself.

Now the first step is to create a survey design object without any weights. *survey*’s syntax here is quite straight forward. The `ids`

argument is used to tell survey that the data came all from on single primary sampling unit. If we were to survey students nested in classes in schools, or institutions nested in countries, then our respondents would not all come from the same sampling unit. But since our data set `small`

is to mimic a simple, run-of-the-mill survey, all respondents were chosen at random from the same list.

In order to create a survey object, let’s first load the survey package.

Then we create the unweighted survey object:

`small.svy.unweighted <- svydesign(ids=~1, data=small)`

`## Warning: No weights or probabilities supplied, assuming equal probability`

This just works as before, but now you don’t specify any weights. We still need to compute them (and R tells us that in a warning).

# Rake unweighted survey design

To rake, you first need marginal distributions. Once you obtained them, you need to tell the rake function, which variables correspond with what distribution. R does everything else.

Specifying these relationships looks a little bit daunting at first, but is actually quite easy. `rake()`

uses two arguments for this, `sample.margins`

and `population.margins`

. `Sample.margins`

is a list of formulas, that tell R where to look for the variables you want to weight with. These formulas can be as simple as `~sex`

. But they can also be more complicated, for instance `~sex+edu`

, if you have population marginal distributions for education obtained separated for the two sexes.

The argument `population.margins`

takes a list of just that, i.e. data frames specifying how often each level occurs in the population. You need one data frame for each of the formulas above.

Here, we want to compute weights based on the distribution of gender (variable `sex`

) and education (`edu`

). These variables have 2 and 3 levels, respectively. We are going to need the relative frequencies for each of these levels.

So let’s set up the population data frames:

`sex.dist <- data.frame(sex = c("M", "F"),`

Freq = nrow(small) * c(0.45, 0.55))

edu.dist <- data.frame(edu = c("Lo", "Mid", "Hi"),

Freq = nrow(small) * c(0.30, 0.50, 0.20))

Here, each data frame consists of two vectors: one describing the levels of the associated factor and the other the corresponding frequencies. Note that we multiply the theoretically known relative frequencies as we have obtained them from our reference population with the number of rows in the data set we compute the weights for, to obtain absolute frequencies.

In real life, we would not use fantasy distributions but frequencies others have obtained for our population. Where to obtain these frequencies from depends on the reference population. But if you are working with samples that should represent the general population, then checking with your national statistical offices might get you started.

Now it’s time to compute the weights by putting together the marginal distributions from before with our data.

`small.svy.rake <- rake(design = small.svy.unweighted,`

sample.margins = list(~sex, ~edu),

population.margins = list(sex.dist, edu.dist))

This creates a new survey design object, that now contains weights. The argument `sample.margins`

contains a list of simple formulas, that tell R that the marginal distributions of interest to you are `sex`

and `edu`

. The argument `population.margins`

then specifies the marginal distributions as they are to be found in the general population.

# Trim the weights

As stated above, sometimes it is necessary to trim weights, if they have grown too large or too small. This will make your data fit less well the population marginal distributions, but inflating a few cases to too much weight, is rarely ever sensible: Perhaps that one person that now counts as 50 is somewhat deranged, or otherwise not representative. So it is best to keep an eye on your weights.

`summary(weights(small.svy.rake))`

`## Min. 1st Qu. Median Mean 3rd Qu. Max. `

## 0.431 0.745 0.945 1.000 1.620 1.630

These weights actually fit quite well, with none of them exceeding our rule of thumb bounds. If they were to be smaller than 0.3 or larger than 3, then we should trim the weights. *Survey* provides us with the handy `trimWeights()`

function for this. This function takes three arguments: a survey design object, and a lower and upper bound for the weights. It works by setting weights below and above the limits to the limits’ values. The weight that is now “missing”, is divided equally among the cases that were not affected by the trimming operation. This can sometimes push cases above the limit. To prevent this, specify the option `strict=TRUE`

.

`small.svy.rake.trim <- trimWeights(small.svy.rake, lower=0.3, upper=3,`

strict=TRUE)

This would then create a new survey design object that now has its weights trimmed to fit into the interval \((0.3,3)\).

After this, we now have a survey design object with (trimmed) weights applied to it.

With that we conclude this little demonstration on how to compute your own sampling weights. While the main application is certainly with proper random samples, you can also use weights to get potentially biased surveys, like online surveys, to better fit the underlying population. But bear in mind that the only thing that weights can do, is ensure that your sample composition better mimics the general population’s characteristics. Weights will never help you if the process governing non-response is part of the puzzle you want to solve.

Perhaps noteworthy is also the relation between post-stratified random samples and quota samples. In a random sample, we define a population, draw from that population at random and then compute and apply weights to align the sample with the population. This weighting is necessary because some people originally sampled might be e.g. harder to reach than others, thereby biasing the sample. Once the post-stratification weights have been applied, the random sample is representative of the population it was drawn from. Statistics gives us a method to tell just how accurately the findings from the sample can be generalized.

Quota samples, on the other hand work differently. There, not only a population is being defined, but also how many persons of a certain strata are to be sampled. For instance, a quota sample might prescribe that there are 250 women and 200 men to be sampled, in a bid to generate a sample that is representative of the population in the first place. At a first look, this might make quota samples more attractive, as no post-stratification is required to ensure representativeness. There is a problem, however. And its quite a serious one: A quota sample might be representative of the population (if quotas actually do work, which they not always do). But a quota sample will never satisfy the strict randomness requirements that statistics require. Only if we are working with a random sample can we make inferences from the sample to the population. In quota samples, there is not sufficient randomness, as the interviewer selects the interviewees actively. Therefore, quota samples cannot be used to reason about the general population.

In the next post of this series we are going to dive deeper into the `survey`

package and compute descriptive statistics for survey-weighted data. Again, if you have any thoughts on computing your own post-stratification weights, please do leave a comment.

**leave a comment**for the author, please follow the link and comment on his blog:

**Turning numbers into stories**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...