Advanced Survey Design and Application to Big Data

[This article was first published on R – Daniel Oehm | Gradient Descending, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I like to describe Official statistics as the All Bran of statistics, it’s bland and a bit boring but it is good for you. It is key for any government to manage the economy, provide services where they are needed and monitor the growth of the nation. There are many facets to official statistics but where statistics plays a major role is in advanced survey design. Often survey sampling techniques can be applied to other real world scenarios but are often left solely for population and business surveys. I have had success applying these techniques to sampling data from big data sources and creating balanced training sets for rare event estimation and prediction. Many governments are transitioning to open source software such as R and a key package is ‘survey’. This post will give examples of how to implement advanced survey designs using the survey package in R. I’ll provide some commentary around how these techniques can be applied to other problems outside of official statistics, in particular in the application of big data.

Set up test data

Firstly, a frame will be simulated and from which we’ll take a sample and attempt to estimate the population total. The frame will consist of 3 stratification variables, location, age and sex. It is assumed that these initial values are known prior to the sample being selected from the population Census. Two response variables will be simulated, employment indicator (0, 1) and income. Income will be simulated to be correlated with age group but random for the other stratification variables. The frame won’t be simulated to represent the true population, it will simply be an example set of data. Although, it would be quite simple to simulate a frame that represents the Australian population by utilising the data from the ABS website.

# creating stratafication variables
location <- as.factor(1:3)
ploc <- c(0.35, 0.45, 0.2)
age <- c("agegrp15-24", "agegrp25-40", "agegrp41-65", "agegrp65+")
page <- c(0.2, 0.25, 0.25, 0.3)
sex <- c("male", "female")
psex <- c(0.5, 0.5)

strata.vars <- list(location = location, age = age, sex = sex)
strata.probs <- list(ploc, page, psex)

# create population frame
create.pop.frame <- function(N, strata, probs){
  if(length(strata)!= length(probs)) stop("strata and probs not the same length")
  df <- NULL
  for(k in 1:length(strata)){
    temp <- sample(strata[[k]], N, replace = T, prob = probs[[k]])
    df <- cbind(df, temp)
  colnames(df) <- names(strata)
  df <-

  # PSU id
  PSUsize <- c(15, 25, 10)
  df$PSU <- NA
  for(k in 1:3){
    df$PSU[df$location == k] <- paste0(k, sample(1:PSUsize[k], sum(df$location == k), replace = TRUE))

  # make response variables
  ngroups <- tbl_df(df) %>% group_by(location, age, sex) %>% summarise(Nh = length(location))
  pemp <- rbeta(nrow(ngroups), 6, 4)
  emp <- NULL
  for(k in 1:nrow(ngroups)){
    emp <- c(emp, sample(0:1, ngroups$Nh[k], replace = TRUE, prob = c(1-pemp[k], pemp[k])))

  nage <- tbl_df(df) %>% group_by(age) %>% summarise(Nh = length(age))
  mu_income <- rnorm(nrow(nage), c(60000, 80000, 100000, 80000), 2000)
  income <- rep(NA, N)
  for(k in 1:nrow(nage)){
    income["==")(df$age, nage$age[k])] <- rnorm(nage$Nh[k], sqrt(mu_income[k]), sqrt(1000))^2

  # return frame
  df <- data.frame(df, emp, income)
  return(tbl_df(df) %>% arrange(location, age, sex))

frame <- create.pop.frame(10000, strata.vars, strata.probs)

## # A tibble: 10,000 x 6
##    location age         sex    PSU     emp income
##    <fct>    <fct>       <fct>  <chr> <int>  <dbl>
##  1 1        agegrp15-24 female 16        0 54453.
##  2 1        agegrp15-24 female 15        1 58849.
##  3 1        agegrp15-24 female 111       1 88474.
##  4 1        agegrp15-24 female 12        0 71488.
##  5 1        agegrp15-24 female 113       1 94494.
##  6 1        agegrp15-24 female 12        1 71789.
##  7 1        agegrp15-24 female 111       0 47044.
##  8 1        agegrp15-24 female 112       1 39650.
##  9 1        agegrp15-24 female 12        1 77552.
## 10 1        agegrp15-24 female 16        1 65822.
## # ... with 9,990 more rows

The histograms below show the difference between the income distributions for the four age groups.

ggplot(frame, aes(x = income, fill = age)) + geom_histogram() + facet_grid(age ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk incme plot

A key part of survey design and survey estimation is calibrating the estimates to meet known benchmark totals. These are often population totals from the most recent Census or totals of turnover for industries provided by the tax office. This is necessary since not household or business selected in a survey will respond causing a non-response bias. For example it is known that employed persons are less likely to respond to the labour force survey than unemployed persons or those not in the labour force, likely due to the fact they are busy and who wants to fill out a survey anyway? Below we’ll set up the benchmark variables from the frame we created. A key economic indicator is the total number of a employed and unemployed persons in a nation, therefore in this case the non-response mechanism is correlated with the variable of interest causing a non-response bias. The only true way to remove this bias is to achieve 100% response rate – which never happens – therefore even more important correct for bias where possible.

benchmark.values <- tbl_df(frame) %>% group_by(location, age, sex) %>% 
  summarise(Nh = length(emp), total_emp = sum(emp), mean_emp = mean(emp), total_income = sum(income), mean_income = mean(income))

benchmark.values$poststrata <- paste0("poststrata", 1:nrow(benchmark.values))

frame <- left_join(frame, benchmark.values[,c("location", "age", "sex", "poststrata", "Nh")], by = c("location" = "location", "age" = "age", "sex" = "sex"))

benchmark.location <- frame %>% group_by(location) %>% summarise(Nh = length(location))
benchmark.age <- frame %>% group_by(age) %>% summarise(Nh = length(age)) <- frame %>% group_by(sex) %>% summarise(Nh = length(sex))

Stratified sampling

A key technique is stratified sampling. By controlling how many units are sampled from defined strata such as location, age group and sex it is easier to achieve a representative responding sample. Using the sampling package it is simple to take a stratified sample from our frame. The estimators for stratified sampling are

    \[ \hat{T} = \sum^H_{h = 1} \sum_{j \in S}{\frac{N_h}{n_h} y_{hj}} = N_h\bar{y}_h \]

where \hat{T} is the survey total and \bar{y}_h is the stratum mean. The initial weights for a stratified sample are given by w_{ih} = \frac{N_h}{n_h} The sample standard deviation is given by

    \[ s^2_h = \sum_{j \in S}\frac{(y_{hj} - \bar{y}_h)^2}{n_h - 1} \]

therefore the variance of the estimate of total is given by

    \[ \text{Var}(\hat{T}) = \sum^H_{h = 1} \left( 1 - \frac{n_h}{N_h} \right) N_h^2 \frac{s^2_h}{n_h} \]

# take stratified sample
frame.summary <- tbl_df(frame) %>% group_by(location, age, sex) %>% summarise(Nh = length(location))
nh <- round(frame.summary$Nh*rbeta(nrow(frame.summary), 5, 95))
stratified.sample <- sampling::strata(frame, stratanames = c("location", "age", "sex"), size =  nh, method = "srswor")

stratified.sample$initial_wt <- 1/stratified.sample$Prob
stratified.sample$emp <- frame$emp[stratified.sample$ID_unit]
stratified.sample$income <- frame$income[stratified.sample$ID_unit]

To replicate a real world scenario non-response will be introduced into the data. The non-response will be correlated with employment as described above. The response rate of those employed is 85% and those unemployed 95%. For household surveys this may seem high however fairly on par with current ABS LFS response rates.

# introduce non-response associated with employment
introduce.nonresponse <- function(df){
  emp.response.rate <- 0.85
  unemp.response.rate <- 0.95
  df$response_flag <- 1
  df$response_flag[df$emp == 1] <- sample(0:1, sum(df$emp == 1), replace = TRUE, prob = c(1-emp.response.rate, emp.response.rate))
  df$response_flag[df$emp != 1] <- sample(0:1, sum(df$emp != 1), replace = TRUE, prob = c(1-unemp.response.rate, unemp.response.rate))
  responding.df <- df[df$response_flag == 1,]

responding.stratified.sample <- introduce.nonresponse(stratified.sample)

The next step is to set the survey design object. This stores the responding sample data, initial weights and advanced survey design information such as the stratification variables. This object is then passed to other survey functions for calibration and estimation. There are some nuances with the survey design object such as the id variable refers to the cluster id which always needs to present given if it is not a cluster sample design and the population (benchmark) vector can be tricky to operate. Here is what works best for me.

# set survey design <- svydesign(id = ~1, strata = ~location + age + sex - 1, weights = ~initial_wt, data = responding.stratified.sample)

Here the strata are location, age and sex as sampled.

The survey design object is then passed to the calibrate function to adjust the initial weights so they sum to the strata benchmark values using GREG. GREG calibration is an important step in survey estimation to limit the bias caused by non-response. It draws strength from correlated auxiliary variables and adjusts the weights such that the final weights sum to benchmark values and minimises the residual variance. This is done by the model

    \[ Y_i | \mathbf{x}_i = \mathbf{x}^T_i\mathbf{\beta} + \epsilon_i \]

where \mathbf{x}_i = (x_{i1}, x_{i2}, … , x_{ip}) and \text{var}(\epsilon_i) = \sigma^2_i. In our example \mathbf{x} corresponds to our stratification variables location, age and sex. The population totals (benchmark values) \mathbf{T_x} = \sum^N_iw_ix_i are known. The goal is to adjust wi such that the total are maintained. The parameters of the model are estimated using weighted least squares

    \[ \mathbf{\beta} = (\mathbf{X}^T\mathbf{W}\mathbf{\Sigma^{-1}\mathbf{X}})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{\Sigma^{-1}} \mathbf{y} \]

where \mathbf{W} is the initial weight vector and \mathbf{\Sigma} = diag(\sigma^2_1, \sigma^2_2, \dots, \sigma^2_n). Therefore the generalised regression estimator of the population total is given by

    \[ \hat{T} = \hat{T} + (\mathbf{T_x} - \mathbf{\hat{T}_x}) \mathbf{\hat{\beta}} \]

The second part to this equation is the adjustment factor. In turn this translates to a weight adjustment as follows

    \[ \begin{array}{r l} \hat{T} &= \hat{T} + (\mathbf{T_x} - \mathbf{\hat{T}_x}) \mathbf{\hat{\beta}} \\ \sum^n_{i=1} a_i y_i &= \sum^n_{i=1}w_i y_i + (\mathbf{T_x} - \mathbf{\hat{T}_x})^T (\mathbf{X}^T\mathbf{W}\mathbf{\Sigma^{-1}\mathbf{X}})^{-1}\mathbf{X}^T \sum^n_{i=1}\frac{ w_i y_i}{\sigma^2_i} \\ &= \sum^n_{i=1}w_i \left(1 + (\mathbf{T_x} - \mathbf{\hat{T}_x})^T(\mathbf{X}^T\mathbf{W}\mathbf{\Sigma^{-1}\mathbf{X}})^{-1} \frac{\mathbf{x_i}}{\sigma^2_i}\right) y_i \\ &= \sum^n_{i=1} w_i g_i y_i \end{array} \]

Here it is easy to see that g_i = 1 + (\mathbf{T_x} - \mathbf{\hat{T}_x})^T(\mathbf{X}^T\mathbf{W}\mathbf{\Sigma^{-1}\mathbf{X}})^{-1} \frac{\mathbf{x_i}}{\sigma^2_i} is the adjustment factor for unit i and the final weights are given by a_i = w_i g_i.

Using the survey package, GREG calibration is conducted using calibrate function as follows

greg.wt <- calibrate(, ~ location + age + sex - 1, population = c(benchmark.location$Nh, benchmark.age$Nh[-1],$Nh[-1]))

Technically these are poststrata and don’t have to be the same as the sample design. Poststrata are obtained from the responding sample which were not known before the collection and have a better chance of correcting the bias. The -1 term in the formula subtracts the intercept off the model matrix. I prefer this so we can simply input the benchmark values for the first strata. The subsequent strata need to have the first value removed, due to how the model matrix is set up. As with regression the model matrix is constructed in an efficient way to store all the necessary information in the smallest space. This can be modified but I find this is OK. It is also important to ensure the benchmark values are in the correct order as there is not check to ensure the right strata are being calibrated to the right population total.

In this form, the benchmarks are actually marginal benchmarks and the weights are calibrated using the raking algorithm. If the post-strata was only a single variable being the interaction between location, age and sex the analytical solution can be used to calculate the weight adjustment and should give the same result. I find using marginal benchmarks easier due to less wrangling and easier to read.

It is good practice to observe the calibrated weights compared with the initial weights to see how much the were adjusted.

# check weight adjustments
wtadj <- weights(greg.wt)/responding.stratified.sample$initial_wt

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.047   1.088   1.124   1.126   1.162   1.207

If there is a large difference between the initial and final weights it may indicate the presence of bias.

Calculate survey estimates from the calibrated weights.

svytotal(~emp, greg.wt)

##      total     SE
## emp 5957.8 243.26

svymean(~emp, greg.wt)

##        mean     SE
## emp 0.59578 0.0243

svyby(~emp, by = ~sex, FUN = svytotal, greg.wt)

##           sex      emp       se
## female female 3041.054 165.9938
## male     male 2916.754 177.8602

svyby(~income, by = ~age, FUN = svymean, greg.wt)

##                     age   income       se
## agegrp15-24 agegrp15-24 64972.20 1442.575
## agegrp25-40 agegrp25-40 83778.26 1608.418
## agegrp41-65 agegrp41-65 98837.92 2615.439
## agegrp65+     agegrp65+ 78083.96 1712.318

svytotal(~income, greg.wt)

##            total      SE
## income 820624776 9688091

svymean(~income, greg.wt)

##         mean     SE
## income 82063 968.81

Stratified Systematic sampling

Systematic sampling can provide some gains when the frame is ordered with respect to a correlated variable. The sampling package handles systematic sampling by user supplied inclusion probabilities. Here the population totals are supplied and therefore will be effectively the same as straight stratified sampling. Ideally a continuous variable is supplied such as total turnover for business surveys.

# take systematic sample
systematic.sample <- sampling::strata(frame, stratanames = c("location", "age", "sex"), size = nh, method = "systematic", pik = frame$Nh)
systematic.sample$initial_wt <- 1/systematic.sample$Prob
systematic.sample$emp <- frame$emp[systematic.sample$ID_unit]
systematic.sample$income <- frame$income[systematic.sample$ID_unit]

# introduce non-response and subset to responding sample.
responding.systematic.sample <- introduce.nonresponse(systematic.sample)

# set survey design <- svydesign(id = ~1, strata = ~location + age + sex - 1, weights = ~initial_wt, data = responding.systematic.sample)
greg.wt.sys <- calibrate(, ~ location + age + sex - 1, population = c(benchmark.location$Nh, benchmark.age$Nh[-1],$Nh[-1]))
wtadj <- weights(greg.wt.sys)/responding.systematic.sample$initial_wt

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.041   1.084   1.103   1.114   1.131   1.215

svytotal(~emp, greg.wt.sys)

##     total     SE
## emp  5906 243.96

svymean(~emp, greg.wt.sys)

##       mean     SE
## emp 0.5906 0.0244

svyby(~emp, by = ~sex, FUN = svytotal, greg.wt.sys)

##           sex      emp       se
## female female 2847.107 167.1539
## male     male 3058.928 178.6840

svyby(~income, by = ~age, FUN = svymean, greg.wt.sys)

##                     age   income       se
## agegrp15-24 agegrp15-24 63371.62 1502.210
## agegrp25-40 agegrp25-40 81780.39 1485.779
## agegrp41-65 agegrp41-65 99731.88 2196.612
## agegrp65+     agegrp65+ 81498.69 1567.948

svytotal(~income, greg.wt.sys)

##            total      SE
## income 825255884 8706153

svymean(~income, greg.wt.sys)

##         mean     SE
## income 82526 870.62

The results are very similar to the stratified sample above.

2-stage cluster sampling

Household surveys are operationally quite challenging. An interviewer needs to travel out to the household to conduct the survey. This is expensive and time consuming. It makes economic sense to survey multiple households in the area. This can be achieved by first sampling primary selection units (PSU’s) or clusters such as a suburb. Each cluster contains a set of potential observations such as households (SSU – secondary selection units) which are sampled by a another sampling scheme such as stratification. For example say you want to know the overall opinion of students on schools uniforms. Rather than selecting a sample of students across multiple schools it is more convenient to first select a sample of schools followed by a sample of students. This offers convenience and cost savings by being able to control the number of schools which are travelled to in order to survey the students.

Cluster sampling is often a compromise between cost and accuracy. Selecting clusters followed by observations restricts the breadth of samples we could potentially take from a more standard stratified sample however the cost savings and convenience can far out-weigh the loss in accuracy.

Considering a simple case where in the first stage PSU’s are selected with equal probability without replacement and at the second stage SSu’s are selected with anotehr SRSWOR scheme the estimates are of the form

    \[ \hat{T} = \frac{N}{n} \sum_{j \in S} M_i \bar{y}_i \]

here M_i is the number of SSU’s in the ith PSU and \bar{y}_i is the response mean of the ith PSU. The variance of the cluster sample estimate of the total consists of 2 parts, the variance in selecting the PSU’s and variance selecting the SSU’s. This is given by

    \[ \text{Var}(\hat{T}) = N^2 \left( 1 - \frac{n}{N} \right)\frac{s^2_T}{n} + \frac{N}{n} \sum_{i \in S} \left(1 - \frac{m_i}{M_i} \right)M^2_i\frac{s^2_i}{m_i} \]


    \[ s^2_T = \frac{1}{n-1} \sum_{i \in S} \left( \hat{T}_i - \frac{\hat{T}}{N} \right)^2 \]


    \[ s^2_i = \frac{1}{m_i-1} \sum_{j \in S} \left( y_{ij} - \bar{y}_i \right)^2 \]

Under a stratified sampleing scheme at the second stage the second component in the variance above will resemble the stratified formula in the first example above.

# take cluster sample PSU
cluster.sample.psu <- sampling::cluster(frame, clustername = c("PSU"), size = 20, method = "srswor")
cluster.sample.psu <- cbind(cluster.sample.psu, frame[cluster.sample.psu$ID_unit, c("location", "age", "sex")])

# take stratified sample of SSU's
cluster.sample <- sampling::strata(cluster.sample.psu, stratanames = c("location", "age", "sex"), size = nh, method = "srswor")
cluster.sample <- cbind(cluster.sample, Prob.psu = cluster.sample.psu[cluster.sample$ID_unit, "Prob"])

# calc initial weights from selection probability
cluster.sample$Prob <- cluster.sample$Prob*cluster.sample$Prob.psu
cluster.sample$initial_wt <- 1/cluster.sample$Prob
cluster.sample <- cbind(cluster.sample, frame[cluster.sample$ID_unit, c("PSU", "emp", "income")])

# introduce non-response and subset to responding sample.
responding.cluster.sample <- introduce.nonresponse(cluster.sample)

# set survey design <- svydesign(id = ~PSU, strata = ~location + age + sex - 1, weights = ~initial_wt, data = responding.cluster.sample, nest = TRUE)

# calibrate weights
greg.wt.clus <- calibrate(, ~ location + age + sex - 1, population = c(benchmark.location$Nh, benchmark.age$Nh[-1],$Nh[-1]))
wtadj <- weights(greg.wt.clus)/responding.cluster.sample$initial_wt

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7327  0.8532  1.1037  1.0974  1.3270  1.4774

# output esetimates
svytotal(~emp, greg.wt.clus)

##      total     SE
## emp 5589.1 224.86

svymean(~emp, greg.wt.clus)

##        mean     SE
## emp 0.55891 0.0225

svyby(~emp, by = ~sex, FUN = svytotal, greg.wt.clus)

##           sex      emp       se
## female female 2622.041 186.9735
## male     male 2967.047 182.5725

svyby(~income, by = ~age, FUN = svymean, greg.wt.clus)

##                     age   income       se
## agegrp15-24 agegrp15-24 82368.31 1822.978
## agegrp25-40 agegrp25-40 82910.40 1749.325
## agegrp41-65 agegrp41-65 86096.58 2048.544
## agegrp65+     agegrp65+ 80276.57 3722.528

svytotal(~income, greg.wt.clus)

##            total       SE
## income 827970351 13458832

svymean(~income, greg.wt.clus)

##         mean     SE
## income 82797 1345.9

As expected the results from the cluster sample are not as accurate as the stratified case however it’s not bad. The use of cluster sampling comes down to convenience and cost savings, and in some cases practicality.

Sampling in the world of big data

Big Data comes in enormous volumes particularly regarding live streams of data such as Twitter feeds, telecommunication networks, etc. Trying to use that amount of data is often referred to as drinking from the fire hose. It is next to impossible to use all of this data for analysis or building predictive models for your business. The reasonable thing to do is to take a sample, develop models and deploy to the production environment. An element of this process which is often overlooked is sampling the data and building a training set. In these cases taking a simple random sample from the most recent entries e.g. last few days or weeks is about as much thought that goes into it. The biggest consideration (for good reason mind you) is how long it will take to extract the data so the data scientists can start building models. Often the result is a highly biased data set from which a model is trained and expected to perform consistently once it’s pushed to production. This is where advance survey design and sampling comes in.

There are many factors which need to be considered with any big data project and large volumes of data such as

  • Are there time/seasonal effects (hour, day, weekend, month, annual)?
  • Are there geographical effects?
  • Are there user effects i.e. different demographics exhibiting different behaviour?
  • Are there natural batches of data that can be sampled in one go i.e. clusters?
  • What is the intended purpose of the model or analytics project?

Ultimately, the overarching question is, is my sampled data representative? This is the crux of advanced survey design and sampling. Once you have understood the context and the purpose of the model the next step is to gather the correct data to support it and to adjust for any biases with in it. By adjusting the biases in the training set a higher accuracy model naturally follows.

A key part for this to be successful is to understand what your population is and have some robust metrics describing it. For example, in business and population surveys this is the Census which usually runs every 5-10 years. In a big data context this may be trickier to answer, but I’d argue nowhere near as expensive! If the context is a service such as telecommunications, stratifying by demographic and geographic information should be simple and knowing the population totals shouldn’t be anymore challenging. A sampling scheme based on this will already outperform a simple random sample. Given how data is often stored it is easier to reference time intervals and grabbing everything in between, or a sample again. This would be akin to taking a cluster sample, or if we think about the fire hose dipping the cup in for a second at randomly selected times. The resulting sample can be calibrated to the known population totals and the final weights used to estimate unknown

This can also be applied in rare event estimation such as credit card fraud detection. These events may only occur once in every 10000 records, in which case it is common to up-sample and down-sample the fraudulent and genuine cases respectively to balance the training set. Down-sampling is something which is often overlooked as well. By down-sampling with a more sophisticated design you can squeeze a higher accuracy our of the predictive model. I have had success with this using the kaggle credit card fraud data set. Firstly the genuine cases were segmented using unsupervised learning and then used the segments were used as strata in a sampling procedure. This reduced the false positive rate 5 fold on the training set. Using unsupervised learning to generate strata for sampling from can be a very effective way of ensuring a representative sample is selected.

Other considerations are simply the data model and how data is stored in the databases. There may be some natural clusters already in the data. In a telecommunications context the mobile phone towers may represent clusters or PSU’s from which individual phone numbers are sampled. The resulting weights can then be used to obtain population estimates or used to adjust the training set and statistical models.

Sampling data in the world of big data is complex even for taking simpe random samples. Exploring more complex sampling techniques requires time in order to get it right but can have great benefits. Sourcing the data is gauranteed to be the most painful part of the journey and everyone is forgiven to take the path of least resistance. If you have the luxury of working in an Agile environment working iteratively to deliver analytical solutions, retraining models using a more representative sample is something that deserves to be on the Kanban board.

The post Advanced Survey Design and Application to Big Data appeared first on Daniel Oehm | Gradient Descending.

To leave a comment for the author, please follow the link and comment on their blog: R – Daniel Oehm | Gradient Descending. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)