# Bayesian Modeling of Anscombe’s Quartet

**Publishable Stuff**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Anscombe’s quartet is a collection of four datasets that look radically different yet result in the same regression line when using ordinary least square regression. The graph below shows Anscombe’s quartet with imposed regression lines (taken from the Wikipedia article).

While least square regression is a good choice for dataset 1 (upper left plot) it fails to capture the shape of the other three datasets. In a recent post John Kruschke shows how to implement a Bayesian model in JAGS that captures the shape of both data set 1 and 3 (lower left plot). Here I will expand that model to capture the shape of all four data sets. If that sounds interesting start out by reading Kruschke’s post and I will continue where that post ended…

Ok, using a wide tailed t distribution it was possible to down weight the influence of the outlier in dataset 3 while still capturing the shape of dataset 1. It still fails to capture datasets 2 and 4 however. Looking at dataset 2 (upper right plot) it is clear that we would like to model this as a quadratic curve and what we would like to do is to allow the model to include a quadratic term when the data supports it (as dataset 2 does) and refrain from including a quadratic term when the data supports a linear trend (as in datasets 1 and 3). A solution is to include a quadratic term (b2) with a *spike and slab* prior distribution which is a mixture between two distributions one thin (spiky) distribution and one wide (slab) distribution. By centering the spike over zero we introduce a bit of automatic model selection into our model, that is, if there is evidence for a quadratic trend in the data then the slab will allow this trend to be included in the model, however, if there is little evidence for a quadratic trend then the spike will make the estimate of b2 practically equivalent to zero. In JAGS such a prior can be implemented as follows:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">b2 <span style="color: #666666"><-</span> b2_prior[b2_pick <span style="color: #666666">+</span> <span style="color: #666666">1</span>] b2_prior[<span style="color: #666666">1</span>] <span style="color: #666666">~</span> dnorm(<span style="color: #666666">0</span>, <span style="color: #666666">99999999999999</span>) <span style="color: #408080; font-style: italic"># spike</span> b2_prior[<span style="color: #666666">2</span>] <span style="color: #666666">~</span> dnorm(<span style="color: #666666">0</span>, <span style="color: #666666">0.1</span>) <span style="color: #408080; font-style: italic"># slab</span> b2_pick <span style="color: #666666">~</span> dbern(<span style="color: #666666">0.5</span>) </pre></div>

The argument to the dbern function indicates our prior belief that the data includes a quadratic term, in this case we think the odds are 50/50. The resulting prior looks like this:

How to model dataset 4? Well, I don’t think it is clear what dataset 4 actually shows. Since there are many datapoints at x = 8 we should be pretty sure about the mean and SD at that point. Otherwise the only datapoint is that at x = 19 and this really doesn’t tell much. For me, a good description of dataset 4 would be lots of certainty at x = 8 and lots of uncertainty for any other value of x. One reason why ordinary least squares regression doesn’t work here is because of the assumption that the variance of the data (y) is the same at any x. The many datapoints at x = 8 also make an ordinary least squares regression very certain about the variance at x = 19. By allowing the variance to vary at different x we can implement a model that isn’t overconfident regarding the variance at x = 19. The following JAGS code implements a variance (here reparameterized as precision) that can differ at different x:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">tau[i] <span style="color: #666666"><-</span> exp( t0_log <span style="color: #666666">+</span> t1_log <span style="color: #666666">*</span> x[i] ) t0_log <span style="color: #666666">~</span> dunif(<span style="color: #666666">-10</span>, <span style="color: #666666">10</span>) t1_log <span style="color: #666666">~</span> dunif(<span style="color: #666666">-10</span>, <span style="color: #666666">10</span>) </pre></div>

Putting everything together in one JAGS model and using R and rjags to run it:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">library(datasets) library(rjags) model_string <span style="color: #666666"><-</span> <span style="color: #BA2121">"</span> <span style="color: #BA2121">model{</span> <span style="color: #BA2121">for(i in 1:length(x)) {</span> <span style="color: #BA2121"> y[i] ~ dt( b0 + b1 * x[i] + b2 * pow(x[i], 2), tau[i], nu )</span> <span style="color: #BA2121"> tau[i] <- exp( t0_log + t1_log * x[i] )</span> <span style="color: #BA2121">}</span> <span style="color: #BA2121">nu <- nuMinusOne + 1</span> <span style="color: #BA2121">nuMinusOne ~ dexp( 1/29 )</span> <span style="color: #BA2121">b0 ~ dnorm(0, 0.0001)</span> <span style="color: #BA2121">b1 ~ dnorm(0, 0.0001)</span> <span style="color: #BA2121">b2 <- b2_prior[b2_pick + 1]</span> <span style="color: #BA2121">b2_prior[1] ~ dnorm(0, 99999999999999)</span> <span style="color: #BA2121">b2_prior[2] ~ dnorm(0, 0.1)</span> <span style="color: #BA2121">b2_pick ~ dbern(0.5)</span> <span style="color: #BA2121">t0_log ~ dunif(-10, 10)</span> <span style="color: #BA2121">t1_log ~ dunif(-10, 10)</span> <span style="color: #BA2121">}</span> <span style="color: #BA2121">"</span> </pre></div>

Now lets see how this model performs on the Anscombe’s quartet. The following runs the model for dataset 1:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">m <span style="color: #666666"><-</span> jags.model(textConnection(model_string), list(x <span style="color: #666666">=</span> anscombe<span style="color: #666666">$</span>x1, y <span style="color: #666666">=</span> anscombe<span style="color: #666666">$</span>y1), n.adapt<span style="color: #666666">=10000</span>) update(m, <span style="color: #666666">10000</span>) s <span style="color: #666666"><-</span> coda.samples(m, c(<span style="color: #BA2121">"b0"</span>, <span style="color: #BA2121">"b1"</span>, <span style="color: #BA2121">"b2"</span>, <span style="color: #BA2121">"t0_log"</span>, <span style="color: #BA2121">"t1_log"</span>,<span style="color: #BA2121">"b2_pick"</span>), n.iter<span style="color: #666666">=20000</span>) plot(s) </pre></div>

Resulting in the following traceplots and parameter estimates:

Looks reasonable enough, there seems to be little evidence for a quadratic trend in the data. Now lets plot the data with 40 randomly chosen regression lines from the MCMC samples superimposed:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">samples_mat <span style="color: #666666"><-</span> as.matrix(samples) plot(anscombe<span style="color: #666666">$</span>x1, anscombe<span style="color: #666666">$</span>y1, xlim<span style="color: #666666">=</span>c(<span style="color: #666666">0</span>, <span style="color: #666666">20</span>), ylim<span style="color: #666666">=</span>c(<span style="color: #666666">0</span>, <span style="color: #666666">20</span>), pch<span style="color: #666666">=16</span>, xlab<span style="color: #666666">=</span><span style="color: #BA2121">"X"</span>, ylab<span style="color: #666666">=</span><span style="color: #BA2121">"Y"</span>) <span style="color: #008000; font-weight: bold">for</span>(row_i <span style="color: #008000; font-weight: bold">in</span> sample(nrow(samples_mat), size<span style="color: #666666">=40</span>)) { curve(samples_mat[row_i,<span style="color: #BA2121">"b0"</span>] <span style="color: #666666">+</span> samples_mat[row_i,<span style="color: #BA2121">"b1"</span>] <span style="color: #666666">*</span> x <span style="color: #666666">+</span> samples_mat[row_i,<span style="color: #BA2121">"b2"</span>] <span style="color: #666666">*</span> x<span style="color: #666666">^2</span>, <span style="color: #666666">0</span>, <span style="color: #666666">20</span>, add<span style="color: #666666">=</span><span style="color: #008000; font-weight: bold">T</span>, col<span style="color: #666666">=</span>rgb(<span style="color: #666666">0.3</span>, <span style="color: #666666">0.3</span>, <span style="color: #666666">1</span>, <span style="color: #666666">0.5</span>)) } </pre></div>

Also looks reasonable. Now lets do the same for the three other datasets. Dataset 2:

Well, the trace plots looks a bit strange but this is because the data is a perfect quadratic function with no noise which the model is able to capture as the superimposed regression lines look the same.

Now for dataset 3:

This works well too! The superimposed regression lines are not influenced by the outlier just as in Kruschke’s model.

At last, dataset 4:

Well, the trace plots look a bit strange even if I run the model for a large amount of samples (the plots above show 1 000 000 samples). The resulting regression lines are also difficult to interpret, a lot of them go through the point at x = 19 but there are also many lines that are far off. Still I believe it better captures how I view dataset 4, that is, it might be that there is a strong increasing trend as shown by the datapoint at x = 19 *but* any other trend is actually also possible.

To conclude. I think it is really cool how crazy models it is possible to build using Bayesian modeling and JAGS. I mean, the model I’ve presented here is not really *sane* and I would not use it for anything serious. For one thing, it seems to be quite sensitive to the priors (yes, guilty guilty, I had to tamper with the priors a little bit to get it to work…).

From my perspective, the above exercise shows a difference between classical statistics and Bayesian statistics. Using classical statistics it would be quite difficult to implement the model above. “Please use a normal distribution and maybe transform the data, *if you must*.” – Classical statistics. Bayesian statistics is more easy going, I can use whatever distribution I want and building a model is like building a LEGO castle. “Hey, why don’t you try using a t brick for the towers? And while your at it, put a spiky slab brick on top, it will look like a flagpole!” – Bayesian statistics.

**leave a comment**for the author, please follow the link and comment on their blog:

**Publishable Stuff**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.