Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a relatively brief addendum to last week’s post, where I described how the rvar datatype implemented in the R package posterior makes it quite easy to perform posterior probability checks to assess goodness of fit. In the initial post, I generated data from a linear model and estimated parameters for a linear regression model, and, unsurprisingly, the model fit the data quite well. When I introduced a quadratic term into the data generating process and fit the same linear model (without a quadratic term), equally unsurprising, the model wasn’t a great fit.

Immediately after putting the post up, I decided to make sure the correct model with the quadratic term would not result in extreme p-value (i.e. would fall between 0.02 and 0.98). And, again not surprisingly, the model was a good fit. I’m sharing all this here, because I got some advice on how to work with the rvar data a little more efficiently, and wanted to make sure those who are interested could see that. And while I was at it, I decided to investigate the distribution of Bayesian p-values under the condition that the model and data generating process are the same (i.e. the model is correct).

Just as a reminder, here is the data generation process:

$y \sim N(\mu = 2 + 6*x – 0.3x^2, \ \sigma^2 = 4)$

Here are the necessary libraries:

library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)

And here is the data generation:

b_quad <- -0.3

ddef <- defData(varname = "x", formula = "0;10", dist = "uniform")
ddef <- defData(ddef, "y", "2 + 6*x + ..b_quad*(x^2)", 4)

set.seed(72612)
dd <- genData(100, ddef)

The Stan model is slightly modified to include the additional term; $$\gamma$$ is the quadratic parameter:

data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}

transformed data{
vector[N] x2;

for (i in 1:N) {
x2[i] = x[i]*x[i];
};

}

parameters {
real alpha;
real beta;
real gamma;
real<lower=0> sigma;
}

model {
y ~ normal(alpha + beta*x + gamma*x2, sigma);
}
fit <- mod$sample( data = list(N = nrow(dd), x = dd$x, y = dd$y), seed = 72651, refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 500, iter_sampling = 2500 ) ## Running MCMC with 4 parallel chains... ## ## Chain 3 finished in 0.5 seconds. ## Chain 1 finished in 0.5 seconds. ## Chain 2 finished in 0.5 seconds. ## Chain 4 finished in 0.5 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 0.5 seconds. ## Total execution time: 0.6 seconds. As before, I am plotting the observed (actual data) along with the 80% intervals of predicted values at each level of $$x$$. The observed data appear to be randomly scattered within the intervals with no apparent pattern: post_rvars <- as_draws_rvars(fit$draws())

mu <- with(post_rvars, alpha + beta * as_rvar(dd$x) + gamma * as_rvar(dd$x^2))
pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma) df.80 <- data.table(x = dd$x, y=dd$y, t(quantile(pred, c(0.10, 0.90)))) df.80[, extreme := !(y >= V1 & y <= V2)] The code to estimate the p-value is slightly modified from last time. The important difference is that the lists of rvars (bin_prop_y and bin_prop_pred) are converted directly into vectors of rvars using the do.call function: df <- data.frame(x = dd$x, y = dd$y, mu, pred) df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE) bin_prop_y <- lapply(split(df, df$grp), function(x) rvar_mean(x$y < x$mu))
rv_y <- do.call(c, bin_prop_y)
T_y <- rvar_var(rv_y)

bin_prop_pred <- lapply(split(df, df$grp), function(x) rvar_mean(x$pred < x\$mu))
rv_pred <- do.call(c, bin_prop_pred)
T_pred <- rvar_var(rv_pred)

mean(T_pred > T_y)
## [1] 0.583

In this one case, the p-value is 0.58, suggesting the model is a good fit. But, could this have been a fluke? Looking below at the density plot of p-values based on 10,000 simulated data sets suggests not; indeed, $$P(0.02 < \text{p-value} < 0.98) = 99.8\%.$$ (If you are interested in the code that estimated the density of p-values, I can post it as well.)