[This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.

The full data model

Imagine we have a a very simple population of $N^{\textrm{pop}}$ items with values normally distributed members with standard deviation known to be 2,

$y_n \sim \textrm{normal}(\mu, 2) \ \textrm{for} \ i \in 1:N^{\textrm{pop}}.$

To complete the Bayesian model, we’ll assume a standard normal prior on $\mu$,

$\mu \sim \textrm{normal}(0, 1).$

Now we’re not going to observe all $y_n$, but only a sample of the $N^{\textrm{pop}}$ elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items $y_n$ from the population.

Missing data

Now let’s assume the sample of $y_n$ we observe is not drawn at random from the population. Imagine instead that we have a subset of $N$ items from the population, and for each item $n$, there is a probability $\pi_n$ that the item will be included in the sample. We’ll take the log odds of inclusion to be equal to the item’s value,

$\pi_n = \textrm{logit}^{-1}(y_n)$.

Now when we collect our sample, we’ll do something like poll $N = 2000$ people from the population, but each person $n$ only has a $\pi_n$ chance of responding. So we only wind up with $N^{\textrm{obs}}$ observations, with $N^{\textrm{miss}} = N - N^{\textrm{obs}}$ observations missing.

This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).

So how do we do the careful adjustment?

Approach 1: Weighted likelihood

A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,

$\sum_{n = 1}^{N^{\textrm{obs}}} \frac{1}{\pi_n} \log \textrm{normal}(y_n \mid \mu, 2).$

Thus if an item has a 20% chance of being included, its weight is 5.

In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).

for (n in 1:N_obs)
target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);


If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate $\hat{\pi}$ is the true value $\pi$). This is borne out in simulation.

Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.

One reason the intervals are too narrow is that we are weighting the data as if we had observed $N$ items when we’ve only observed $N^{\textrm{obs}}$ items. That is, their weights are what we’d expect to get if we’d observed $N$ items.

So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to $N^{\textrm{obs}}.$ That also fails. The posterior intervals are still too narrow under simulation.

Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).

Approach 2: Missing data

What is going on generativey? We poll $N$ people out of a population of $N^{\textrm{pop}}$, each of which has a $\pi_n$ chance of responding, leading to a set of responses of size $N^{\textrm{obs}}.$

Given that we know how $\pi$ relates to $y$, we can just model everything (in the real world, this stuff is really hard and everything’s estimated jointly).

Specifically, the $N^{\textrm{miss}} = N - N^{\textrm{obs}}$ missing items each get parameters $y^{\textrm{miss}}_n$ representing how they would’ve responded had they responded. We also model response, so we have an extra term $\textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(y_n^{\textrm{miss}}))$ for the unobserved values and an extra term $\textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(y_n))$ for the observed values.

This works. Here’s the Stan program.

data {
int N_miss;
int N_obs;
vector[N_obs] y_obs;
}
parameters {
real mu;
vector[N_miss] y_miss;
}
model {
// prior
mu ~ normal(0, 1);
// observed data likelihood
y_obs ~ normal(mu, 2);
1 ~ bernoulli_logit(y_obs);
// missing data likelihood and missingness
y_miss ~ normal(mu, 2);
0 ~ bernoulli_logit(y_miss);
}


The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:

for (n in 1:N_miss)
0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))


And here’s the simulation code, including a cheap run at SBC:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores(), logical = FALSE)

printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") }
inv_logit <- function(u) 1 / (1 + exp(-u))

printf("Compiling model.")
model <- stan_model('missing.stan')

for (m in 1:20) {

# SIMULATE DATA
mu <- rnorm(1, 0, 1);
N_tot <- 1000
y <- rnorm(N_tot, mu, 2)
z <- rbinom(N_tot, 1, inv_logit(y))
y_obs <- y[z == 1]
N_obs <- length(y_obs)
N_miss <- N_tot - N_obs

# COMPILE AND FIT STAN MODEL
fit <- sampling(model,
data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs),
chains = 1, iter = 5000, refresh = 0)
mu_ss <- extract(fit)\$mu
mu_hat <- mean(mu_ss)
q25 <- quantile(mu_ss, 0.25)
q75 <- quantile(mu_ss, 0.75)
printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s;  mu_hat = %5.2f",
mu, q25, q75, ifelse(q25 

Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.

mu =  0.60 in 50pct( 0.50,  0.60) =  no;  mu_hat =  0.55
mu = -0.73 in 50pct(-0.67, -0.56) =  no;  mu_hat = -0.62
mu =  1.13 in 50pct( 1.00,  1.10) =  no;  mu_hat =  1.05
mu =  1.71 in 50pct( 1.67,  1.76) = yes;  mu_hat =  1.71
mu =  0.03 in 50pct(-0.02,  0.08) = yes;  mu_hat =  0.03
mu =  0.80 in 50pct( 0.76,  0.86) = yes;  mu_hat =  0.81


The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.

Exercise

How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.

Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.