Errors-in-variables models in stan

[This article was first published on Ecology in silico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post, I gave a cursory overview of how prior information about covariate measurement error can reduce bias in linear regression.
In the comments, Rasmus Bååth asked about estimation in the absence of strong priors.
Here, I’ll describe a Bayesian approach for estimation and correction for covariate measurement error using a latent-variable based errors-in-variables model that does not rely on strong prior information.
Recall that this matters because error in covariate measurements tends to bias slope estimates towards zero.

For what follows, we’ll assume a simple linear regression, in which continuous covariates are measured with error.
True covariate values are considered latent variables, with repeated measurements of covariate values arising from a normal distribution with a mean equal to the true value, and some measurement error $\sigma_x$.
We can represent the latent variables in the model as circles, and observables as boxes:

with $\epsilon_x \sim Normal(0, \sigma_x)$ and $\epsilon_y \sim Normal(0, \sigma_y)$.

In other words, we assume that for sample unit $i$ and repeat measurement $j$:

The trick here is to use repeated measurements of the covariates to estimate and correct for measurement error.
In order for this to be valid, the true covariate values cannot vary across repeat measurements.
If the covariate was individual weight, you would have to ensure that the true weight did not vary across repeat measurements (for me, frogs urinating during handling would violate this assumption).

Below, I’ll simulate some data of this type in R. I’m assuming that we randomly select some sampling units to remeasure covariate values, and each is remeasured n.reps times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<span class="line">n.reps <span class="o"><-</span> <span class="m">3</span>
</span><span class="line">n.repeated <span class="o"><-</span> <span class="m">10</span>
</span><span class="line">n <span class="o"><-</span> <span class="m">50</span>
</span><span class="line">
</span><span class="line"><span class="c1"># true covariate values</span>
</span><span class="line">x <span class="o"><-</span> runif<span class="p">(</span>n<span class="p">,</span> <span class="m">-3</span><span class="p">,</span> <span class="m">3</span><span class="p">)</span>
</span><span class="line">y <span class="o"><-</span> x <span class="o">+</span> rnorm<span class="p">(</span>n<span class="p">)</span>  <span class="c1"># alpha=0, beta=1, sdy=1</span>
</span><span class="line">
</span><span class="line"><span class="c1"># random subset to perform repeat covariate measurements</span>
</span><span class="line">which.repeated <span class="o"><-</span> sample<span class="p">(</span>n<span class="p">,</span> n.repeated<span class="p">)</span>
</span><span class="line">xsd <span class="o"><-</span> <span class="m">1</span>  <span class="c1"># measurement error</span>
</span><span class="line">xerr <span class="o"><-</span> rnorm<span class="p">(</span>n <span class="o">+</span> <span class="p">(</span>n.repeated <span class="o">*</span> <span class="p">(</span>n.reps <span class="o">-</span> <span class="m">1</span><span class="p">)),</span> <span class="m">0</span><span class="p">,</span> xsd<span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="c1"># indx assigns measurements to sample units</span>
</span><span class="line">indx <span class="o"><-</span> c<span class="p">(</span><span class="m">1</span><span class="o">:</span>n<span class="p">,</span> rep<span class="p">(</span>which.repeated<span class="p">,</span> each <span class="o">=</span> n.reps <span class="o">-</span> <span class="m">1</span><span class="p">))</span>
</span><span class="line">indx <span class="o"><-</span> sort<span class="p">(</span>indx<span class="p">)</span>
</span><span class="line">nobs <span class="o"><-</span> length<span class="p">(</span>indx<span class="p">)</span>
</span><span class="line">xobs <span class="o"><-</span> x<span class="p">[</span>indx<span class="p">]</span> <span class="o">+</span> xerr
</span><span class="line">plot<span class="p">(</span>x<span class="p">[</span>indx<span class="p">],</span> xobs<span class="p">,</span>
</span><span class="line">    xlab <span class="o">=</span> <span class="s">"True covariate value"</span><span class="p">,</span>
</span><span class="line">    ylab <span class="o">=</span> <span class="s">"Observed covariate value"</span><span class="p">)</span>
</span><span class="line">abline<span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> lty <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</span><span class="line">segments<span class="p">(</span>x0 <span class="o">=</span> x<span class="p">[</span>indx<span class="p">],</span> x1 <span class="o">=</span> x<span class="p">[</span>indx<span class="p">],</span>
</span><span class="line">    y0 <span class="o">=</span> x<span class="p">[</span>indx<span class="p">],</span> y1 <span class="o">=</span> xobs<span class="p">,</span> col <span class="o">=</span> <span class="s">"red"</span><span class="p">)</span>
</span><span class="line">abline<span class="p">(</span>v <span class="o">=</span> x<span class="p">[</span>which.repeated<span class="p">],</span> col <span class="o">=</span> <span class="s">"green"</span><span class="p">,</span> lty <span class="o">=</span> <span class="m">3</span><span class="p">)</span>
</span>

Here, the discrepancy due to measurement error is shown as a red segment, and the sample units that were measured three times are highlighted with green dashed lines.

I’ll use stan to estimate the model parameters, because I’ll be refitting the model to new data sets repeatedly below, and stan is faster than JAGS for these models.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<span class="line"><span class="c1"># write the .stan file</span>
</span><span class="line">cat<span class="p">(</span><span class="s">"</span>
</span><span class="line"><span class="s">data{</span>
</span><span class="line"><span class="s">  int n;</span>
</span><span class="line"><span class="s">  int nobs;</span>
</span><span class="line"><span class="s">  real xobs[nobs];</span>
</span><span class="line"><span class="s">  real y[n];</span>
</span><span class="line"><span class="s">  int indx[nobs];</span>
</span><span class="line"><span class="s">}</span>
</span><span class="line">
</span><span class="line"><span class="s">parameters {</span>
</span><span class="line"><span class="s">  real alpha;</span>
</span><span class="line"><span class="s">  real beta;</span>
</span><span class="line"><span class="s">  real<lower=0> sigmay;</span>
</span><span class="line"><span class="s">  real<lower=0> sigmax;</span>
</span><span class="line"><span class="s">  real x[n];</span>
</span><span class="line"><span class="s">}</span>
</span><span class="line">
</span><span class="line"><span class="s">model {</span>
</span><span class="line"><span class="s">  // priors</span>
</span><span class="line"><span class="s">  alpha ~ normal(0,100);</span>
</span><span class="line"><span class="s">  beta ~ normal(0,100);</span>
</span><span class="line"><span class="s">  sigmay ~ uniform(0,1000);</span>
</span><span class="line"><span class="s">  sigmax ~ uniform(0,1000);</span>
</span><span class="line"><span class="s">  </span>
</span><span class="line"><span class="s">  // model structure  </span>
</span><span class="line"><span class="s">  for (i in 1:nobs){</span>
</span><span class="line"><span class="s">    xobs[i] ~ normal(x[indx[i]], sigmax);</span>
</span><span class="line"><span class="s">  }</span>
</span><span class="line"><span class="s">  for (i in 1:n){</span>
</span><span class="line"><span class="s">    y[i] ~ normal(alpha + beta*x[i], sigmay);</span>
</span><span class="line"><span class="s">  }</span>
</span><span class="line"><span class="s">}</span>
</span><span class="line"><span class="s">  "</span><span class="p">,</span>
</span><span class="line">    file <span class="o">=</span> <span class="s">"latent_x.stan"</span><span class="p">)</span>
</span>

With the model specified, estimate the parameters.

1
2
3
4
5
6
7
8
9
<span class="line">library<span class="p">(</span>rstan<span class="p">)</span>
</span><span class="line">library<span class="p">(</span>modeest<span class="p">)</span>
</span><span class="line">stan_d <span class="o"><-</span> c<span class="p">(</span><span class="s">"y"</span><span class="p">,</span> <span class="s">"xobs"</span><span class="p">,</span> <span class="s">"nobs"</span><span class="p">,</span> <span class="s">"n"</span><span class="p">,</span> <span class="s">"indx"</span><span class="p">)</span>
</span><span class="line">chains <span class="o"><-</span> <span class="m">3</span>
</span><span class="line">iter <span class="o"><-</span> <span class="m">1000</span>
</span><span class="line">thin <span class="o"><-</span> <span class="m">1</span>
</span><span class="line">mod1 <span class="o"><-</span> stan<span class="p">(</span>file <span class="o">=</span> <span class="s">"latent_x.stan"</span><span class="p">,</span> data <span class="o">=</span> stan_d<span class="p">,</span>
</span><span class="line">    chains <span class="o">=</span> chains<span class="p">,</span> iter <span class="o">=</span> iter<span class="p">,</span>
</span><span class="line">    thin <span class="o">=</span> thin<span class="p">)</span>
</span>

How did we do? Let’s compare the true vs. estimated covariate values for each sample unit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<span class="line">posteriors <span class="o"><-</span> extract<span class="p">(</span>mod1<span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="c1"># highest density interval helper function (thanks to Joe Mihaljevic)</span>
</span><span class="line">HDI <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>values<span class="p">,</span> percent <span class="o">=</span> <span class="m">0.95</span><span class="p">)</span> <span class="p">{</span>
</span><span class="line">    sorted <span class="o"><-</span> sort<span class="p">(</span>values<span class="p">)</span>
</span><span class="line">    index <span class="o"><-</span> floor<span class="p">(</span>percent <span class="o">*</span> length<span class="p">(</span>sorted<span class="p">))</span>
</span><span class="line">    nCI <span class="o"><-</span> length<span class="p">(</span>sorted<span class="p">)</span> <span class="o">-</span> index
</span><span class="line">    width <span class="o"><-</span> rep<span class="p">(</span><span class="m">0</span><span class="p">,</span> nCI<span class="p">)</span>
</span><span class="line">    <span class="kr">for</span> <span class="p">(</span>i <span class="kr">in</span> <span class="m">1</span><span class="o">:</span>nCI<span class="p">)</span> <span class="p">{</span>
</span><span class="line">        width<span class="p">[</span>i<span class="p">]</span> <span class="o"><-</span> sorted<span class="p">[</span>i <span class="o">+</span> index<span class="p">]</span> <span class="o">-</span> sorted<span class="p">[</span>i<span class="p">]</span>
</span><span class="line">    <span class="p">}</span>
</span><span class="line">    HDImin <span class="o"><-</span> sorted<span class="p">[</span>which.min<span class="p">(</span>width<span class="p">)]</span>
</span><span class="line">    HDImax <span class="o"><-</span> sorted<span class="p">[</span>which.min<span class="p">(</span>width<span class="p">)</span> <span class="o">+</span> index<span class="p">]</span>
</span><span class="line">    HDIlim <span class="o"><-</span> c<span class="p">(</span>HDImin<span class="p">,</span> HDImax<span class="p">)</span>
</span><span class="line">    <span class="kr">return</span><span class="p">(</span>HDIlim<span class="p">)</span>
</span><span class="line"><span class="p">}</span>
</span><span class="line">
</span><span class="line"><span class="c1"># comparing estimated true x values to actual x values</span>
</span><span class="line">Xd <span class="o"><-</span> array<span class="p">(</span>dim <span class="o">=</span> c<span class="p">(</span>n<span class="p">,</span> <span class="m">3</span><span class="p">))</span>
</span><span class="line"><span class="kr">for</span> <span class="p">(</span>i <span class="kr">in</span> <span class="m">1</span><span class="o">:</span>n<span class="p">)</span> <span class="p">{</span>
</span><span class="line">    Xd<span class="p">[</span>i<span class="p">,</span> <span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]</span> <span class="o"><-</span> HDI<span class="p">(</span>posteriors<span class="o">$</span>x<span class="p">[,</span> i<span class="p">])</span>
</span><span class="line">    Xd<span class="p">[</span>i<span class="p">,</span> <span class="m">3</span><span class="p">]</span> <span class="o"><-</span> mlv<span class="p">(</span>posteriors<span class="o">$</span>x<span class="p">[,</span> i<span class="p">],</span> method <span class="o">=</span> <span class="s">"shorth"</span><span class="p">)</span><span class="o">$</span>M
</span><span class="line"><span class="p">}</span>
</span><span class="line">
</span><span class="line">lims <span class="o"><-</span> c<span class="p">(</span>min<span class="p">(</span>Xd<span class="p">),</span> max<span class="p">(</span>Xd<span class="p">))</span>
</span><span class="line">plot<span class="p">(</span>x<span class="p">,</span> Xd<span class="p">[,</span> <span class="m">3</span><span class="p">],</span> xlab <span class="o">=</span> <span class="s">"True covariate value"</span><span class="p">,</span>
</span><span class="line">    ylab <span class="o">=</span> <span class="s">"Estimated covariate value"</span><span class="p">,</span>
</span><span class="line">    col <span class="o">=</span> <span class="s">"purple"</span><span class="p">,</span> pch <span class="o">=</span> <span class="m">19</span><span class="p">,</span> ylim <span class="o">=</span> lims<span class="p">)</span>
</span><span class="line">abline<span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> lty <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</span><span class="line">segments<span class="p">(</span>x0 <span class="o">=</span> x<span class="p">,</span> x1 <span class="o">=</span> x<span class="p">,</span> y0 <span class="o">=</span> Xd<span class="p">[,</span> <span class="m">1</span><span class="p">],</span> y1 <span class="o">=</span> Xd<span class="p">[,</span> <span class="m">2</span><span class="p">],</span> col <span class="o">=</span> <span class="s">"purple"</span><span class="p">)</span>
</span>

Here purple marks the posterior for the covariate values, and the dashed black line shows the one-to-one line that we would expect if the estimates exactly matched the true values.
In addition to estimating the true covariate values, we may wish to check to see how well we estimated the standard deviation of the measurement error in our covariate.

1
2
3
4
5
6
<span class="line">hist<span class="p">(</span>posteriors<span class="o">$</span>sigmax<span class="p">,</span> breaks <span class="o">=</span> <span class="m">20</span><span class="p">,</span>
</span><span class="line">    main <span class="o">=</span> <span class="s">"Posterior for measurement error"</span><span class="p">,</span>
</span><span class="line">    xlab <span class="o">=</span> <span class="s">"Measurement standard deviation"</span><span class="p">)</span>
</span><span class="line">abline<span class="p">(</span>v <span class="o">=</span> xsd<span class="p">,</span> col <span class="o">=</span> <span class="s">"red"</span><span class="p">,</span> lwd <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</span><span class="line">legend<span class="p">(</span><span class="s">"topright"</span><span class="p">,</span> legend <span class="o">=</span> <span class="s">"True value"</span><span class="p">,</span> col <span class="o">=</span> <span class="s">"red"</span><span class="p">,</span>
</span><span class="line">    lty <span class="o">=</span> <span class="m">1</span><span class="p">,</span> bty <span class="o">=</span> <span class="s">"n"</span><span class="p">,</span> lwd <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</span>

How many sample units need repeat measurements?

You may want to know how many sample units need to be repeatedly measured to adequately estimate the degree of covariate measurement error.
For instance, if $\sigma_x = 1$, how does the precision in our estimate of $\sigma_x$ improve as more sample units are repeatedly measured?
Let’s see what happens when we repeatedly measure covariate values for $1, 2, …, N$ randomly selected sampling units.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<span class="line">n.repeated <span class="o"><-</span> <span class="m">1</span><span class="o">:</span>n
</span><span class="line">
</span><span class="line"><span class="c1"># store the HDI and mode for the estimate of sigmax in an array</span>
</span><span class="line">post.sdx <span class="o"><-</span> array<span class="p">(</span>dim <span class="o">=</span> c<span class="p">(</span>length<span class="p">(</span>n.repeated<span class="p">),</span> <span class="m">3</span><span class="p">))</span>
</span><span class="line"><span class="kr">for</span> <span class="p">(</span>i <span class="kr">in</span> n.repeated<span class="p">)</span> <span class="p">{</span>
</span><span class="line">    n.repeats <span class="o"><-</span> i
</span><span class="line">    which.repeated <span class="o"><-</span> sample<span class="p">(</span>n<span class="p">,</span> n.repeats<span class="p">)</span>
</span><span class="line">    xerr <span class="o"><-</span> rnorm<span class="p">(</span>n <span class="o">+</span> <span class="p">(</span>n.repeats <span class="o">*</span> <span class="p">(</span>n.reps <span class="o">-</span> <span class="m">1</span><span class="p">)),</span> <span class="m">0</span><span class="p">,</span> xsd<span class="p">)</span>
</span><span class="line">    indx <span class="o"><-</span> c<span class="p">(</span><span class="m">1</span><span class="o">:</span>n<span class="p">,</span> rep<span class="p">(</span>which.repeated<span class="p">,</span> each <span class="o">=</span> n.reps <span class="o">-</span> <span class="m">1</span><span class="p">))</span>
</span><span class="line">    indx <span class="o"><-</span> sort<span class="p">(</span>indx<span class="p">)</span>
</span><span class="line">    nobs <span class="o"><-</span> length<span class="p">(</span>indx<span class="p">)</span>
</span><span class="line">    xobs <span class="o"><-</span> x<span class="p">[</span>indx<span class="p">]</span> <span class="o">+</span> xerr
</span><span class="line">    stan_d <span class="o"><-</span> c<span class="p">(</span><span class="s">"y"</span><span class="p">,</span> <span class="s">"xobs"</span><span class="p">,</span> <span class="s">"nobs"</span><span class="p">,</span> <span class="s">"n"</span><span class="p">,</span> <span class="s">"indx"</span><span class="p">)</span>
</span><span class="line">    mod <span class="o"><-</span> stan<span class="p">(</span>fit <span class="o">=</span> mod1<span class="p">,</span> data <span class="o">=</span> stan_d<span class="p">,</span> chains <span class="o">=</span> chains<span class="p">,</span>
</span><span class="line">        iter <span class="o">=</span> iter<span class="p">,</span> thin <span class="o">=</span> thin<span class="p">)</span>
</span><span class="line">    posteriors <span class="o"><-</span> extract<span class="p">(</span>mod<span class="p">)</span>
</span><span class="line">    post.sdx<span class="p">[</span>i<span class="p">,</span> <span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]</span> <span class="o"><-</span> HDI<span class="p">(</span>posteriors<span class="o">$</span>sigmax<span class="p">)</span>
</span><span class="line">    post.sdx<span class="p">[</span>i<span class="p">,</span> <span class="m">3</span><span class="p">]</span> <span class="o"><-</span> mlv<span class="p">(</span>posteriors<span class="o">$</span>sigmax<span class="p">,</span> method <span class="o">=</span> <span class="s">"shorth"</span><span class="p">)</span><span class="o">$</span>M
</span><span class="line"><span class="p">}</span>
</span><span class="line">
</span><span class="line"><span class="c1"># Plot the relationship b/t number of sampling units revisited & sdx</span>
</span><span class="line">plot<span class="p">(</span>x <span class="o">=</span> n.repeated<span class="p">,</span> y <span class="o">=</span> rep<span class="p">(</span>xsd<span class="p">,</span> length<span class="p">(</span>n.repeated<span class="p">)),</span>
</span><span class="line">    type <span class="o">=</span> <span class="s">"l"</span><span class="p">,</span> lty <span class="o">=</span> <span class="m">2</span><span class="p">,</span>
</span><span class="line">    ylim <span class="o">=</span> c<span class="p">(</span><span class="m">0</span><span class="p">,</span> max<span class="p">(</span>post.sdx<span class="p">)),</span>
</span><span class="line">    xlab <span class="o">=</span> <span class="s">"Number of sampling units measured three times"</span><span class="p">,</span>
</span><span class="line">    ylab <span class="o">=</span> <span class="s">"Estimated measurement error"</span><span class="p">)</span>
</span><span class="line">segments<span class="p">(</span>x0 <span class="o">=</span> n.repeated<span class="p">,</span> x1 <span class="o">=</span> n.repeated<span class="p">,</span>
</span><span class="line">    y0 <span class="o">=</span> post.sdx<span class="p">[,</span> <span class="m">1</span><span class="p">],</span> y1 <span class="o">=</span> post.sdx<span class="p">[,</span> <span class="m">2</span><span class="p">],</span>
</span><span class="line">    col <span class="o">=</span> <span class="s">"red"</span><span class="p">)</span>
</span><span class="line">points<span class="p">(</span>x <span class="o">=</span> n.repeated<span class="p">,</span> y <span class="o">=</span> post.sdx<span class="p">[,</span> <span class="m">3</span><span class="p">],</span> col <span class="o">=</span> <span class="s">"red"</span><span class="p">)</span>
</span><span class="line">legend<span class="p">(</span><span class="s">"topright"</span><span class="p">,</span> legend <span class="o">=</span> c<span class="p">(</span><span class="s">"True value"</span><span class="p">,</span> <span class="s">"Posterior estimate"</span><span class="p">),</span>
</span><span class="line">    col <span class="o">=</span> c<span class="p">(</span><span class="s">"black"</span><span class="p">,</span> <span class="s">"red"</span><span class="p">),</span> lty <span class="o">=</span> c<span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">1</span><span class="p">),</span>
</span><span class="line">    pch <span class="o">=</span> c<span class="p">(</span><span class="kc">NA</span><span class="p">,</span> <span class="m">1</span><span class="p">),</span> bty <span class="o">=</span> <span class="s">"n"</span><span class="p">)</span>
</span>

Looking at this plot, you could eyeball the number of sample units that should be remeasured when designing a study.
Realistically, you would want to explore how this number depends on the true amount of measurement error, and also simulate multiple realizations (rather than just one) for each scenario.
Using a similar approach, you might also evaluate whether it’s more efficient to remeasure more sample units, or invest in more repeated measurements per sample unit.

To leave a comment for the author, please follow the link and comment on their blog: Ecology in silico.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)