# Posterior probability checking with rvars: a quick follow-up

**ouR data generation**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a relatively brief addendum to last week’s post, where I described how the `rvar`

datatype implemented in the `R`

package `posterior`

makes it quite easy to perform posterior probability checks to assess goodness of fit. In the initial post, I generated data from a linear model and estimated parameters for a linear regression model, and, unsurprisingly, the model fit the data quite well. When I introduced a quadratic term into the data generating process and fit the same linear model (without a quadratic term), equally unsurprising, the model wasn’t a great fit.

Immediately after putting the post up, I decided to make sure the correct model with the quadratic term would not result in extreme p-value (i.e. would fall between 0.02 and 0.98). And, again not surprisingly, the model was a good fit. I’m sharing all this here, because I got some advice on how to work with the `rvar`

data a little more efficiently, and wanted to make sure those who are interested could see that. And while I was at it, I decided to investigate the distribution of Bayesian p-values under the condition that the model and data generating process are the same (i.e. the model is correct).

Just as a reminder, here is the data generation process:

\[y \sim N(\mu = 2 + 6*x – 0.3x^2, \ \sigma^2 = 4)\]

Here are the necessary libraries:

library(simstudy) library(data.table) library(cmdstanr) library(posterior) library(bayesplot) library(ggplot2)

And here is the data generation:

b_quad <- -0.3 ddef <- defData(varname = "x", formula = "0;10", dist = "uniform") ddef <- defData(ddef, "y", "2 + 6*x + ..b_quad*(x^2)", 4) set.seed(72612) dd <- genData(100, ddef)

The `Stan`

model is slightly modified to include the additional term; \(\gamma\) is the quadratic parameter:

data { int<lower=0> N; vector[N] x; vector[N] y; } transformed data{ vector[N] x2; for (i in 1:N) { x2[i] = x[i]*x[i]; }; } parameters { real alpha; real beta; real gamma; real<lower=0> sigma; } model { y ~ normal(alpha + beta*x + gamma*x2, sigma); } mod <- cmdstan_model("code/quadratic_regression.stan") fit <- mod$sample( data = list(N = nrow(dd), x = dd$x, y = dd$y), seed = 72651, refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 500, iter_sampling = 2500 ) ## Running MCMC with 4 parallel chains... ## ## Chain 3 finished in 0.5 seconds. ## Chain 1 finished in 0.5 seconds. ## Chain 2 finished in 0.5 seconds. ## Chain 4 finished in 0.5 seconds. ## ## All 4 chains finished successfully. ## Mean chain execution time: 0.5 seconds. ## Total execution time: 0.6 seconds.

As before, I am plotting the observed (actual data) along with the 80% intervals of predicted values at each level of \(x\). The observed data appear to be randomly scattered within the intervals with no apparent pattern:

post_rvars <- as_draws_rvars(fit$draws()) mu <- with(post_rvars, alpha + beta * as_rvar(dd$x) + gamma * as_rvar(dd$x^2)) pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma) df.80 <- data.table(x = dd$x, y=dd$y, t(quantile(pred, c(0.10, 0.90)))) df.80[, extreme := !(y >= V1 & y <= V2)]

The code to estimate the p-value is slightly modified from last time. The important difference is that the lists of `rvars`

(*bin_prop_y* and *bin_prop_pred*) are converted directly into vectors of `rvars`

using the `do.call`

function:

df <- data.frame(x = dd$x, y = dd$y, mu, pred) df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE) bin_prop_y <- lapply(split(df, df$grp), function(x) rvar_mean(x$y < x$mu)) rv_y <- do.call(c, bin_prop_y) T_y <- rvar_var(rv_y) bin_prop_pred <- lapply(split(df, df$grp), function(x) rvar_mean(x$pred < x$mu)) rv_pred <- do.call(c, bin_prop_pred) T_pred <- rvar_var(rv_pred) mean(T_pred > T_y) ## [1] 0.583

In this one case, the p-value is 0.58, suggesting the model is a good fit. But, could this have been a fluke? Looking below at the density plot of p-values based on 10,000 simulated data sets suggests not; indeed, \(P(0.02 < \text{p-value} < 0.98) = 99.8\%.\) (If you are interested in the code that estimated the density of p-values, I can post it as well.)

**leave a comment**for the author, please follow the link and comment on their blog:

**ouR data generation**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.