[This article was first published on Statistical Science & Related Matters on Less Likely, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The following post provides some of the code that was used to construct the figures and tables from Rafi & Greenland, 2020.1 An enhanced PDF version of the paper can be found here. For further discussion of the computations, see the appendix of the main paper, along with our technical supplement.2

Disclaimer: I am responsible for all the code and mistakes below, and none of them can be attributed to my coauthors or my fellow package developers.

In order to recreate the functions, I would recommend installing the latest version of concurve from CRAN, as it has patched some issues with graphing when the outcome is binary. Use the script below to get the latest version and load the R package. A number of other R packages are also used in this post, which are listed below.

install.packages("concurve")
library("concurve")

## Valid $$P$$-values Are Uniform Under the Null Model

Here we show that valid $$P$$-values have specific properties, when the null model is true. We first generate two variables ($$Y$$, $$X$$) that come from the same normal distribution with a $$\mu$$ of 0 and $$\sigma$$ of 1, each with a total of 1000 observations. We assume that there is no relationship between these two variables. We run a simple t-test between $$Y$$ and $$X$$ and iterate this 100000 times and compute 100000 $$P$$-values to see the overall distribution of the $$P$$-values, which we then plot using a histogram./

RNGkind(kind = "L'Ecuyer-CMRG")
set.seed <- 1031
n.sim <- 100000
t.sim <- numeric(n.sim)
n.samp <- 1000

for (i in 1:n.sim) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
t <- t.test(X, Y, data = df)
t.sim[i] <- t[]
}

ggplot(NULL, aes(x = t.sim)) +
geom_histogram(bins = 30, col = "black",
fill = "#99c7c7", alpha = 0.25) +
labs(title = "Distribution of P-values Under the Null",
x = "P-value") +
scale_x_continuous(breaks = seq(0, 1, 0.10)) +
theme_bw() This can also be shown using the TeachingDemos R package, which has a function dedicated to showing this phenomenon.

library("TeachingDemos")
RNGkind(kind = "L'Ecuyer-CMRG")
set.seed <- 1031
obs_p <- Pvalue.norm.sim(n = 1000, mu = 0,
mu0 = 0, sigma = 1,
sigma0 = 1, test= "t",
alternative = "two.sided",
alpha = 0.05, B = 100000) ggplot(NULL, aes(x = obs_p)) +
geom_histogram(bins = 30, col = "black",
fill = "#99c7c7", alpha = 0.25) +
labs(title = "Distribution of P-values Under the Null",
x = "P-value") +
scale_x_continuous(breaks = seq(0, 1, 0.10)) +
theme_bw() As you can see, when the null model is true, the distribution of $$P$$-values is uniform. Valid $$P$$-values are uniform under the null hypothesis and their corresponding $$S$$-values are exponentially distributed. We run the same simulation as before, but then convert the obtained $$P$$-values into $$S$$-values, to see how they are distributed.

RNGkind(kind = "L'Ecuyer-CMRG")
set.seed <- 1031
n.sim <- 100000
t.sim <- numeric(n.sim)
n.samp <- 1000

for (i in 1:n.sim) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
t <- t.test(X, Y, data = df)
t.sim[i] <- t[]
}

ggplot(NULL, aes(x = -log2(t.sim))) +
geom_histogram(bins = 30, col = "black",
fill = "#d46c5b", alpha = 0.5) +
labs(title = "Distribution of S-values Under the Null",
x = "S-value (Bits of Information)") +
theme_bw() ## Posterior Predictive $$P$$-values

Despite the name, posterior predictive $$P$$-values are not considered valid frequentist $$P$$-values because they do not meet the uniformity criterion, instead they are pulled towards the parameter value 0.5. For further discussion, see our technical supplement along with Greenland (2019) for comprehensive theoretical discussions.2, 3

A quick excerpt from our main paper and the technical supplement explains why this is not the case,

As discussed in the Supplement, in Bayesian settings one may see certain $$P$$-values that are not valid frequentist $$P$$-values, the primary example being the posterior predictive $$P$$-value;45; unfortunately, the negative logs of such invalid $$P$$-values do not measure surprisal at the statistic given the model, and so are not valid $$S$$-values.

The decision rule “reject H if $$p\leq$$ $$\alpha$$” will reject H 100$$\alpha$$% of the time under sampling from a model M obeying H (i.e., the Type-1 error rate of the test will be $$\alpha$$) provided the random variable $$P$$ corresponding to $$p$$ is valid (uniform under the model M used to compute it), but not necessarily otherwise6. This is one reason why frequentist writers reject invalid $$P$$-values (such as posterior predictive $$P$$-values, which highly concentrate around 0.50) and devote considerable technical coverage to uniform $$P$$-values;6;5.4 A valid $$P$$-value (“U-value”) translates into an exponentially distributed $$S$$-value with a mean of 1 nat or $$\log_{2}(e)=1.443$$ bits where $$e$$ is the base of the natural logs.

Uniformity is also central to the “refutational information” interpretation of the $$S$$-value used here, for it is necessary to ensure that the $$P$$-value $$p$$ from which $$s$$ is derived is in fact the percentile of the observed value of the test statistic in the distribution of the statistic under M, thus making small $$p$$ surprising under M and making $$s$$ the corresponding degree of surprise. Because posterior predictive $$P$$-values do not translate into sampling percentiles of the statistic under the hypothesis (in fact, they are pulled toward 0.5 from the correct percentiles);5,4 the resulting negative log does not measure surprisal at the statistic given M, and so is not a valid $$S$$-value in our terms.

And indeed, we can show this phenomenon below with simulations. Here we fit a simple Bayesian regression model with a weakly informative prior normal(0, 10) using rstanarm, where both the predictor and response variable come from the same distribution and have the same location and scale parameters ($$\mu = 0$$, $$\sigma = 1$$. We then calculate the observed test statistics, along with their distributions, and convert them into posterior predictive $$P$$-values and plot them using Bayesplot functions. Then, we iterate this 1000 times to examine the distribution of posterior predictive $$P$$-values and compare them to standard $$P$$-values that are known to be uniform.

But first, we’ll generate the distribution of the test statistic from one model.

library("bayesplot")
library("rstan")
library("rstanarm")
rstan_options(auto_write = TRUE)
options(mc.cores = 4)
teal <- c("#99c7c7")
color_scheme_set("teal")
RNGkind(kind = "L'Ecuyer-CMRG")
n.samp <- 1000
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
mod1 <- stan_glm(Y ~ X, data = df, chains = 1, cores = 4,
refresh = 0, prior = normal(0, 10))
yrep <- posterior_predict(mod1)

h <- ppc_stat(Y, yrep, stat = "median", freq = TRUE, binwidth = 0.01) +
labs(title = "Distribution of Posterior Test Statistic") +
theme_bw() +
yaxis_text(on = TRUE)

values_all <- h[["plot_env"]][["T_yrep"]]
prob_to_find <- h[["plot_env"]][["T_y"]]
quantInv <- function(distr, value) ecdf(distr)(value)
posterior_p_value <- quantInv(values_all, prob_to_find)

l <- ppc_stat_2d(Y, yrep) +
theme_bw() +
labs(title = "Distribution of Posterior Test Statistic")

plot(h) plot(l) print(prob_to_find)
 0.03618

The above is the posterior predictive test statistic for one model, along with it’s distribution. Now let’s iterate this process, convert them into posterior predictive $$P$$-values, and generate their distribution.

rstan_options(auto_write = TRUE)
options(mc.cores = 4)
RNGkind(kind = "L'Ecuyer-CMRG")
set.seed <- 1031
n.sim <- 1000
ppp <- numeric(n.sim)
n.samp <- 1000

quantInv <- function(distr, value) ecdf(distr)(value)

for (i in 1:length(ppp)) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
mod1 <- stan_glm(Y ~ X, data = df, chains = 1, cores = 4,
iter = 1000, refresh = 0, prior = normal(0, 10))
yrep <- posterior_predict(mod1)
h <- ppc_stat(Y, yrep, stat = "median")
values_all <- h[["plot_env"]][["T_yrep"]]
prob_to_find <- h[["plot_env"]][["T_y"]]
posterior_p_value <- quantInv(values_all, prob_to_find)
ppp[i] <- posterior_p_value
}
Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
Warning: The largest R-hat is 1.05, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

ggplot(NULL, aes(x = ppp)) +
geom_histogram(bins = 30, col = "black", fill = "#99c7c7", alpha = 0.25) +
labs(title = "Distribution of Posterior Predictive P-values",
subtitle = "From Simulation of 1000 Simple Linear Regressions",
x = "Posterior Predictive P-value") +
scale_x_continuous(breaks = seq(0, 1, 0.10)) +
theme_bw() We can also show this in Stata 16.1, using their new bayesstats ppvalues command which generates posterior predictive $$P$$-values.

clear
set obs 10000
/**
Generate variables from a normal distribution like before
*/
generate x = rnormal(0, 1)
generate y = rnormal(0, 1)
/**
We set up our model here
*/
bayesmh y x, likelihood(normal({var})) prior({var}, normal(0, 10)) ///
prior({y:}, normal(0, 10)) rseed(1031) saving(coutput_pred, replace) mcmcsize(1000)
/**
We use the bayespredict command to make predictions from the model
*/
bayespredict (mean:@mean({_resid})) (var:@variance({_resid})), ///
rseed(1031) saving(coutput_pred, replace)
/**
Then we calculate the posterior predictive P-values
*/
bayesstats ppvalues {mean} using coutput_pred
number of observations (_N) was 0, now 10,000

Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood:
y ~ normal(xb_y,{var})

Priors:
{y:x _cons} ~ normal(0,10)                                               (1)
{var} ~ normal(0,10)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_y.

Bayesian normal regression                       MCMC iterations  =      3,500
Random-walk Metropolis-Hastings sampling         Burn-in          =      2,500
MCMC sample size =      1,000
Number of obs    =     10,000
Acceptance rate  =      .1878
Efficiency:  min =     .05592
avg =     .08647
Log marginal-likelihood = -14154.084                          max =      .1143

------------------------------------------------------------------------------
|                                                Equal-tailed
|      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
-------------+----------------------------------------------------------------
y            |
x |  .0044782   .0103859   .001389   .0033005   -.012539   .0282905
_cons |  .0183664   .0100446    .00094    .017438  -.0008961   .0360122
-------------+----------------------------------------------------------------
var |  .9880526   .0134149    .00142    .988618   .9624949   1.016522
------------------------------------------------------------------------------

file coutput_pred.dta saved

Computing predictions ...

file coutput_pred.dta saved
file coutput_pred.ster saved

Posterior predictive summary   MCMC sample size =     1,000

-----------------------------------------------------------
T |      Mean   Std. Dev.  E(T_obs)  P(T>=T_obs)
-------------+---------------------------------------------
mean | -.0005311   .0093848   .0003538          .47
-----------------------------------------------------------
Note: P(T>=T_obs) close to 0 or 1 indicates lack of fit.

As we can see from the R plot above, the test statistics are are generally concentrated around 0.5 and the distribution (as shown above via the R code) hardly resembles the uniform distribution of $$P$$-value under the null model. Further, the base-2 log transformation of posterior predictive $$P$$-values is not exponentially distributed, making them invalid $$S$$-values in our terms.

ggplot(NULL, aes(x = (-log2(ppp)))) +
geom_histogram(bins = 30, col = "black",
fill = "#d46c5b", alpha = 0.25) +
labs(title = "Distribution of -log2(Posterior Predictive P-values)",
subtitle = "S-values Produced From Posterior Predictive P-values",
x = "-log2(Posterior Predictive P-value)") +
theme_bw() ## Table 1: Some $$P$$-values, $$S$$-values, Maximum Likelihood Ratios, and Likelihood Ratio Statistics

Here, we look at some common $$P$$-values, and how they correspond to other information statistics and likelihood measures.

pvalue <- c(0.99, 0.90, 0.50, 0.25, 0.10,
0.05, 0.025, 0.01, 0.005, 0.0001,
paste("5 sigma (~ 2.9 in 10 million)"),
paste("1 in 100 million (GWAS)"),
paste("6 sigma (~ 1 in a billion)"))

svalue <- round(c(-log2(c(0.99, 0.90, 0.50, 0.25, 0.10,
0.05, 0.025, 0.01, 0.005, 0.0001)), 21.7, 26.6, 29.9), 2)

pvalue[1:10] <- formatC(pvalue[1:10], format = "e", digits = 2)

mlr <- round(c(1.00, 1.01, 1.26, 1.94, 3.87, 6.83, 12.3, 27.6, 51.4, 1935,
(5.2 * (10^5)), (1.4 * (10^7)), (1.3 * (10^8))), 2)

# likelihood ratio statistic

lr <- round(c(0.00016, 0.016, 0.45, 1.32, 2.71, 3.84, 5.02,
6.63, 7.88, 15.1, 26.3, 32.8, 37.4), 2)

table1 <- data.frame(pvalue, svalue, mlr, lr)

colnames(table1) <- c("P-value (compatibility)",
"S-value (bits)",
"Maximum Likelihood Ratio",
"Deviance Statistic 2ln(MLR)")

kbl(table1, format = "html", padding = 45, longtable = TRUE, color = "#777") %>%
row_spec(row = 0, color = "#777",  bold = T) %>%
column_spec(column = 1:4, color = "#777",  bold = F) %>%
kable_classic(full_width = TRUE, html_font = "Cambria",
lightable_options = "basic",
font_size = 17, fixed_thead = list(enabled = F)) %>%
footnote(general_title = "Abbreviations: ", title_format = "underline", general = c("Table 1 $P$-values                  and binary $S$-values, with corresponding maximum-likelihood ratios (MLR) and deviance (likelihood-ratio) statistics for a simple test hypothesis H under background assumptions A")) 
P-value (compatibility) S-value (bits) Maximum Likelihood Ratio Deviance Statistic 2ln(MLR)
0.99 0.01 1.000e+00 0.00
0.9 0.15 1.010e+00 0.02
0.5 1.00 1.260e+00 0.45
0.25 2.00 1.940e+00 1.32
0.1 3.32 3.870e+00 2.71
0.05 4.32 6.830e+00 3.84
0.025 5.32 1.230e+01 5.02
0.01 6.64 2.760e+01 6.63
0.005 7.64 5.140e+01 7.88
1e-04 13.29 1.935e+03 15.10
5 sigma (~ 2.9 in 10 million) 21.70 5.200e+05 26.30
1 in 100 million (GWAS) 26.60 1.400e+07 32.80
6 sigma (~ 1 in a billion) 29.90 1.300e+08 37.40
Abbreviations:
Table 1 $$P$$-values and binary $$S$$-values, with corresponding maximum-likelihood ratios (MLR) and deviance (likelihood-ratio) statistics for a simple test hypothesis H under background assumptions A

For further discussion of 5 and 6 sigma cutoffs, see.7

## Refer to Long-Run Coverage

Here we simulate a study where one group with 100 participants has an $$\mu$$ of 100 with a $$\sigma$$ of 20 and the second group has the same number of participants but an average of 80 and a standard deviation of 20. We compare them using a Welch’s $$t$$-test and generate 95% compatibility intervals several times, specifically 100 times, and then plot them. Since we know the mean difference is 20, we wish to see how often the interval estimates cover this true parameter value.

#' NOTE: Confidence Interval Simulation Function
#'
#' @return
#' @export
#'
#' @examples
sim <- function() {
fake <- data.frame((x <- rnorm(100, 100, 20)),
(y <- rnorm(100, 80, 20)))
intervals <- t.test(x = x, y = y, data = fake,
conf.level = .95)$conf.int[] } RNGkind(kind = "L'Ecuyer-CMRG") set.seed(1031) z <- replicate(100, sim(), simplify = FALSE) df <- data.frame(do.call(rbind, z)) df$studynumber <- (1:length(z))
intrvl.limit <- c("lower.limit", "upper.limit", "studynumber")
colnames(df) <- intrvl.limit
df$point <- ((df$lower.limit + df$upper.limit) / 2) df$covered <- (df$lower.limit <= 20 & 20 <= df$upper.limit)
df$coverageprob <- ((as.numeric(table(df$covered)) / nrow(df) * 100))

ggplot(data = df, aes(x = studynumber, y = point,
ymin = lower.limit, ymax = upper.limit)) +
geom_pointrange(mapping = aes(color = covered),
fill = "#99c7c7", size = .60, alpha = 0.35) +
geom_hline(yintercept = 20, lty = 1, color = "red", alpha = 0.30) +
coord_flip() +
labs(title = "Simulated 95% Frequentist Interval Estimates",
subtitle = "True Mean Difference is 20",
x = "Study Number",
y = "Estimate (Point + Interval)",
caption = "Red Intervals Do Not Cover Population Parameter") +
theme_bw() + # use a white background
theme(legend.position = "none") +
annotate(geom = "text", x = 102, y = 30,
label = "Overall Coverage (%) =", size = 3.5,
color = "#99c7c7", alpha = 0.950) +
annotate(geom = "text", x = 102, y = 35,
label = df\$coverageprob, size = 3.5,
color = "#99c7c7", alpha = 0.950) +
theme(text = element_text(size = 13.5)) +
theme(plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 10)) This shows that the intervals tend to vary simply as a result of randomness, and that the proper interpretation of the attached percentile is about long run coverage of the true parameter. However, this does not preclude interpretation of a single interval estimate from a study.8 As we write in the Replace unrealistic “confidence” claims with compatibility measures section of our paper.1

The fact that “confidence” refers to the procedure behavior, not the reported interval, seems to be lost on most researchers. Remarking on this subtlety, when Jerzy Neyman discussed his confidence concept in 1934 at a meeting of the Royal Statistical Society, Arthur Bowley replied, “I am not at all sure that the ‘confidence’ is not a confidence trick.”.9 And indeed, 40 years later, Cox and Hinkley warned, “interval estimates cannot be taken as probability statements about parameters, and foremost is the interpretation ‘such and such parameter values are consistent with the data.’”.8 Unfortunately, the word “consistency” is used for several other concepts in statistics, while in logic it refers to an absolute condition (of noncontradiction); thus, its use in place of “confidence” would risk further confusion.

To address the problems above, we exploit the fact that a 95% CI summarizes the results of varying the test hypothesis H over a range of parameter values, displaying all values for which $$p$$ > 0.0510 and hence $$s$$ < 4.32 bits;3.11 Thus the CI contains a range of parameter values that are more compatible with the data than are values outside the interval, under the background assumptions;3.12 Unconditionally (and thus even if the background assumptions are uncertain), the interval shows the values of the parameter which, when combined with the background assumptions, produce a test model that is “highly compatible” with the data in the sense of having less than 4.32 bits of information against it. We thus refer to CI as compatibility intervals rather than confidence intervals;13;3;11 their abbreviation remains “CI.” We reject calling these intervals “uncertainty intervals,” because they do not capture uncertainty about background assumptions.13

Indeed, this also has to do with our paper on deemphasizing the model assumptions that are often behind common statistical outputs, and why it is necessary to treat them as uncertain, rather than given.14 For example, a 95% interval estimate is assumed to be free of bias in its construction, however, it’s coverage claims are no longer so, once there are systematic errors in play. Many of these common, classical statistical procedures are designed to deal with random variation, and less so with bias (a relatively new field, with shiny new methods).

# Brown et al. (2017) Reanalysis

Taken from the Brown et al. data.15, 16

Here we take the reported statistics from the Brown et al. studies in order to run statistical tests of different alternative test hypotheses and use those results to construct various functions.

We calculate the standard errors from the point estimate and the confidence (compatibility) limits.

se <- log(2.59 / 0.997) / 3.92

logUL <- log(2.59)
logLL <- log(0.997)

logpoint <- log(1.61)

logpoint + (1.96 * se)
 0.9536
logpoint - (1.96 * se)
 -0.001097

## Table 2: $$P$$-values, $$S$$-values, and Likelihoods for Targeted Hypotheses About Hazard Ratios for Brown et al.

testhypothesis <- c("Halving of hazard", "No effect (null)",
"Point estimate", "Doubling of hazard",
"Tripling of hazard", "Quintupling of hazard")

hazardratios <- c(0.5, 1, 1.61, 2, 3, 5)
pvals <- c(1.6e-06, 0.05, 1.00, 0.37, 0.01, 3.2e-06)
svals <- round(-log2(pvals), 2)
mlr2 <- c((1.0 * 10^5), 6.77, 1.00, 1.49, 26.2, (5.0 * 10^4))
lr <- round(c(
exp((((log(0.5 / 1.61)) / se)^2) / (2)),
exp((((log(1 / 1.61)) / se)^2) / (2)),
exp((((log(1.61 / 1.61)) / se)^2) / (2)),
exp((((log(2 / 1.61)) / se)^2) / (2)),
exp((((log(3 / 1.61)) / se)^2) / (2)),
exp((((log(5 / 1.61)) / se)^2) / (2))), 3)

LR <- formatC(lr, format = "e", digits = 2)

table2 <- data.frame(testhypothesis, hazardratios,
pvals, svals, mlr2, LR)

colnames(table2) <- c("Test Hypothesis", "HR",
"P-values", "S-values",
"MLR", "LR")

kbl(table2, format = "html", padding = 45, longtable = TRUE, color = "#777") %>%
row_spec(row = 0, color = "#777",  bold = T) %>%
column_spec(column = 1:6, color = "#777",  bold = F) %>%
kable_classic(full_width = TRUE, html_font = "Cambria",
lightable_options = "basic",
font_size = 15, fixed_thead = list(enabled = F)) 
Test Hypothesis HR P-values S-values MLR LR
Halving of hazard 0.50 0.00 19.25 1.00e+05 1.02e+05
No effect (null) 1.00 0.05 4.32 6.77e+00 6.77e+00
Point estimate 1.61 1.00 0.00 1.00e+00 1.00e+00
Doubling of hazard 2.00 0.37 1.43 1.49e+00 1.49e+00
Tripling of hazard 3.00 0.01 6.64 2.62e+01 2.62e+01
Quintupling of hazard 5.00 0.00 18.25 5.00e+04 5.03e+04

## Plot the point estimate and 95% compatibility interval

label <- c("Brown et al. (2017)\n JAMA",
"Brown et al. (2017)\n J Clin Psychiatry")

point <- c(1.61, 1.7)
lower <- c(0.997, 1.1)
upper <- c(2.59, 2.6)

df <- data.frame(label, point, lower, upper)

Here we plot the 95% compatibility interval estimate reported from the high-dimensional propensity score analysis.

ggplot(data = df, mapping = aes(x = label, y = point,
ymin = lower, ymax = upper, group = label)) +
geom_pointrange(color = "#007C7C", size = 1.5, alpha = 0.4) +
geom_hline(yintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.5) +
coord_flip() +
scale_y_log10(breaks = scales::pretty_breaks(n = 10), limits = c(0.80, 4)) +
labs(title = "Association Between Serotonergic Antidepressant Exposure
\nDuring Pregnancy and Child Autism Spectrum Disorder",
subtitle = "Reported 95% Compatibility Intervals From Primary Results",
x = "Study",
y = "Hazard Ratio (JAMA) +
Adjusted Pooled Odds Ratio (J Clin Psychiatry)") +
annotate(geom = "text", x = 1.5, y = 1, size = 5,
label = "Null parameter value of 1",
color = "#d46c5b", alpha = 0.75) +
annotate(geom = "text", x = 2.2, y = 1.8, size = 5,
label = "Point Estimate = 1.61,
Lower Limit = 0.997, Upper Limit = 2.59",
color = "#000000", alpha = 0.75) +
annotate(geom = "text", x = 1.2, y = 1.8, size = 5,
label = "Point Estimate = 1.7,
Lower Limit = 1.1, Upper Limit = 2.6",
color = "#000000", alpha = 0.75) +
theme_light() +
theme(axis.text.y = element_text(angle=90, hjust=1, size = 13)) +
theme(axis.text.x = element_text(size = 13)) +
theme(title = element_text(size = 13)) Once again, the authors reported that15

In the 2837 pregnancies (7.9%) exposed to antidepressants, 2.0% (95% CI, 1.6%-2.6%) of children were diagnosed with autism spectrum disorder. Risk of autism spectrum disorder was significantly higher with serotonergic antidepressant exposure (4.51 exposed vs 2.03 unexposed per 1000 person-years; between-group difference, 2.48 95% CI, 2.33-2.62 per 1000 person-years) in crude (HR, 2.16 95% CI, 1.64-2.86) and multivariable-adjusted analyses (HR, 1.59 95% CI, 1.17-2.17) (Table 2). After inverse probability of treatment weighting based on the HDPS, the association was not significant (HR, 1.61 95% CI, 0.997-2.59) (Table 2).

and concluded with

In children born to mothers receiving public drug coverage in Ontario, Canada, in utero serotonergic antidepressant exposure compared with no exposure was not associated with autism spectrum disorder in the child. Although a causal relationship cannot be ruled out, the previously observed association may be explained by other factors.

We will show using the graphical and tabular summaries below, why a more nuanced summary such as,

“After HDPS adjustment for confounding, a 61% hazard elevation remained; however, under the same model, every hypothesis from no elevation up to a 160% hazard increase had $$p$$ > 0.05; Thus, while quite imprecise, these results are consistent with previous observations of a positive association between serotonergic antidepressant prescriptions and subsequent ASD. Because the association may be partially or wholly due to uncontrolled biases, further evidence will be needed for evaluating what, if any, proportion of it can be attributed to causal effects of prenatal serotonergic antidepressant use on ASD incidence.”

would have been far more appropriate, as we discuss in our paper.1

## $$P$$-value and $$S$$-value Functions

In order to use this information to construct a $$P$$-value function, we can use the concurve functions.

library("concurve")

We enter the reported point estimates and compatibility limits and produce all possible intervals + $$P$$-values + $$S$$-values. This is calculated assuming normal approximations with the curve_rev() function.

curve1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59,
measure = "ratio")

It is stored in the object curve1.

We can generate a table of the relevant statistics from this reanalysis, similar to the table from above, but this time using the concurve function, curve_table(), but this time it will give us interval estimates of various percentiles ($$25$$%, $$50$$%, $$75$$%, $$95$$%, etc.). These are in turn used to construct the entire confidence (compatibility) distribution or $$P$$-value function.

kbl(curve_table(curve1[]), format = "html", padding = 45,
longtable = TRUE, color = "#777", row.names = F,
col.names = colnames(curve_table(curve1[]))) %>%
row_spec(row = 0, color = "#777",  bold = T) %>%
column_spec(column = 1:7, color = "#777",  bold = F) %>%
kable_classic(full_width = TRUE, html_font = "Cambria",
lightable_options = "basic",
font_size = 17, fixed_thead = list(enabled = F)) %>%
footnote(general_title = c("Table of Statistics for Various Interval Estimate Percentiles"), general = " ")  
Lower Limit Upper Limit Interval Width Interval Level (%) CDF P-value S-value (bits)
1.490 1.740 0.250 25.0 0.625 0.750 0.415
1.366 1.897 0.531 50.0 0.750 0.500 1.000
1.217 2.131 0.914 75.0 0.875 0.250 2.000
1.178 2.200 1.021 80.0 0.900 0.200 2.322
1.134 2.286 1.152 85.0 0.925 0.150 2.737
1.079 2.403 1.325 90.0 0.950 0.100 3.322
0.999 2.595 1.596 95.0 0.975 0.050 4.322
0.933 2.779 1.846 97.5 0.988 0.025 5.322
0.860 3.015 2.155 99.0 0.995 0.010 6.644
Table of Statistics for Various Interval Estimate Percentiles

## Figure 3: $$P$$-value (Compatibility) Function

Plot the $$P$$-value (Compatibility) function of the Brown et al. data

Now we plot the $$P$$-value function from the data stored in curve1 using the ggcurve() function from concurve.

ggcurve(curve1[], type = "c", measure = "ratio",
nullvalue = c(1), levels = c(0.5, 0.75, 0.95)) +
labs(title = expression(paste(italic(P), "-value (Compatibility) Function")),
subtitle = expression(paste(italic(P),
"-values for a range of hazard ratios (HR)")),
x = "Hazard Ratio (HR)",
y = expression(paste(italic(P),"-value"))) +
geom_vline(xintercept = 1.61, lty = 1, color = "gray", alpha = 0.2) +
geom_vline(xintercept = 2.59, lty = 1, color = "gray", alpha = 0.2) +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 11),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
text = element_text(size = 11)) It is practically the same as the published version from Rafi & Greenland, 2020.1 curve1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59,
measure = "ratio")

curve2 <- curve_rev(point = 1.7, LL = 1.1, UL = 2.6,
measure = "ratio")

## Cumulative Confidence (Compatibility Distribution)

Although we do not cover this figure in our paper, it is easy to construct using concurve's ggcurve() function by specifying type as “cdf” to see the median estimate within the confidence distribution. The horizontal line that meets at the curve, is approximately the same as the maximum at the $$P$$-value function/confidence (compatibility) curve.

ggcurve(curve1[], type = "cdf", measure = "ratio", nullvalue = c(1)) +
labs(title = "P-values for a range of hazard ratios (HR)",
subtitle = "Association Between Serotonergic Antidepressant Exposure
\nDuring Pregnancy and Child Autism Spectrum Disorder",
x = "Hazard Ratio (HR)",
y = "Cumulative Compatibility") +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 11),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
text = element_text(size = 11)) We will also calculate the likelihoods so that we can generate the corresponding likelihood functions.

lik2 <- curve_rev(point = 1.7, LL = 1.1, UL = 2.6,
type = "l", measure = "ratio")

lik1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59,
type = "l", measure = "ratio")

We can also see how consistent these results are with previous studies conducted by the same research group, given the overlap of the functions, which can be compared using the plot_compare() function. Let’s compare the relative likelihood functions from both studies from this research group to see how consistent the results are.

plot_compare(data1 = lik1[], data2 = lik2[],
type = "l1", measure = "ratio", nullvalue = TRUE,
title = "Brown et al. 2017. J Clin Psychiatry. vs.
\nBrown et al. 2017. JAMA.",
subtitle = "J Clin Psychiatry: OR = 1.7, 1/6.83 LI: LL = 1.1, UL = 2.6
\nJAMA: HR = 1.61, 1/6.83 LI: LL = 0.997, UL = 2.59",
xaxis = "Hazard Ratio / Odds Ratio") +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 11),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
text = element_text(size = 11)) and the $$P$$-value functions.

plot_compare(data1 = curve1[], data2 = curve2[],
type = "c", measure = "ratio", nullvalue = TRUE,
title = "Brown et al. 2017. J Clin Psychiatry. vs.
\nBrown et al. 2017. JAMA.",
subtitle = "J Clin Psychiatry: OR = 1.7, 1/6.83 LI: LL = 1.1, UL = 2.6
\nJAMA: HR = 1.61, 1/6.83 LI: LL = 0.997, UL = 2.59",
xaxis = "Hazard Ratio / Odds Ratio") +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 11),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
text = element_text(size = 11)) ## Figure 4: $$S$$-value (Suprisal) Function

Plot the $$S$$-value (Surprisal) function of the Brown et al. data with ggcurve()

ggcurve(data = curve1[], type = "s", measure = "ratio", nullvalue = c(1),
title = "S-value (Surprisal) Function",
subtitle = "S-Values (surprisals) for a range of hazard ratios (HR)",
xaxis = "Hazard Ratio", yaxis1 = "S-value (bits of information)") +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 11),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
text = element_text(size = 11)) which is quite close to the figure from our paper. ## Likelihood (Support) Functions

Here we provide the code to show how we constructed the likelihood functions from our paper, which are the supplementary figures found here (S1) and here (S2).

Calculate and Plot Likelihood (Support) Intervals.

Here we use the reported estimates to construct the $$Z$$-scores and standard errors, which are in turn used to compute the likelihoods.

hrvalues <- seq(from = 0.65, to = 3.98, by = 0.01)

se <- log(2.59 / 0.997) / 3.92

zscore <- sapply(hrvalues, function(i) (log(i / 1.61) / se))

## Figure S1: Relative Likelihood Function

We then standardize all the likelihoods by their maximum at the likelihood function,17 to produce relative likelihoods, likelihood intervals, and their corresponding relative likelihood function.18

support <- exp((-zscore^2) / 2)

likfunction <- data.frame(hrvalues, zscore, support)

ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) +
geom_line() +
geom_vline(xintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.75) +
geom_hline(yintercept = (1/6.83), lty = 1, color = "#333333", alpha = 0.05) +
geom_ribbon(aes(x = hrvalues, ymin = min(support), ymax = support),
fill = "#d46c5b", alpha = 0.10) +
labs(title = "Relative Likelihood Function",
subtitle = "Relative likelihoods for a range of hazard ratios",
x = "Hazard Ratio (HR)",
y = "Relative Likelihood \n1/MLR") +
annotate(geom = "text", x = 1.65, y = 0.2,
label = "1/6.83 LI = 0.997, 2.59", size = 4, color = "#000000") +
theme_bw() +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 11),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
text = element_text(size = 11)) +
scale_x_log10(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.0125)),
breaks = seq(0, 1, 0.10)) The $$\frac{1}{6.83}$$ likelihood interval corresponds to the 95% compatibility interval, as shown by the figure from our paper. Below we use the calculated $$Z$$-scores to construct the log-likelihood function which is the upward-concave parabola $$\frac{Z^{2}}{2}$$ = $$-ln(MLR)$$

support <- (zscore^2) / 2

likfunction <- data.frame(hrvalues, zscore, support)

ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) +
geom_line() +
geom_vline(xintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.75) +
geom_ribbon(aes(x = hrvalues, ymin = support, ymax = max(support)),
fill = "#d46c5b", alpha = 0.10) +
labs(title = "Log-Likelihood Function",
subtitle = "Log-likelihoods for a range of hazard ratios",
x = "Hazard Ratio (HR)",
y = "ln(MLR)") +
theme_bw() +
theme(axis.title.x = element_text(size = 13),
axis.title.y = element_text(size = 13)) +
scale_x_log10(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.0125)),
breaks = seq(0, 7, 1)) +
theme(text = element_text(size = 15)) +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 12)) ## Figure S2: Deviance Function $$2ln(MLR)$$

Known as the deviance function, it is twice the log-likelihood, which maps to the $$S$$-value function.

support <- (zscore^2)

likfunction <- data.frame(hrvalues, zscore, support)

ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) +
geom_line() +
geom_vline(xintercept = 1, lty = 3, color = "#d46c5b", alpha = 0.75) +
geom_hline(yintercept = 3.84, lty = 1, color = "#333333", alpha = 0.05) +
geom_ribbon(aes(x = hrvalues, ymin = support, ymax = max(support)),
fill = "#d46c5b", alpha = 0.10) +
annotate(geom = "text", x = 1.65, y = 4.4,
label = "1/6.83 LI = 0.997, 2.59", size = 4, color = "#000000") +
theme_bw() +
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 11),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
text = element_text(size = 11)) +
scale_x_log10(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.0125)),
breaks = seq(0, 14, 2)) +
labs(title = "Deviance Function",
subtitle = "Deviance statistics for a range of hazard ratios",
x = "Hazard Ratio (HR)",
y = " Deviance Statistic \n2ln(MLR)") And the version from our paper can be found here. It is important to note that although we have calculated these likelihood functions manually here, they can also be generated easily using the curve_lik() function which takes inputs from the ProfileLikelihood R package. To see a further discussion, please see the following article, which gives several examples for a wide variety of models.

Further, we urge some caution. Although we endorse the construction and presentation of likelihood functions along with $$P$$-value and $$S$$-value functions, along with the tabulations, the use of pure likelihood methods has been highly controversial among some statisticians, with some going as far as to say that likelihood is blind although not all statisticians believe this, and have responded to such criticisms in kind, see discussants.19 Thus, we support providing both $$P$$-value/$$S$$-value and likelihood-based functions for a complete picture.

Indeed, we can see how they all easily map to one another when plotted side by side, with the following script.

library("cowplot")
plot_grid(confcurve, surprisalcurve,
relsupportfunction, deviancefunction,
ncol = 2, nrow = 2)

A clearer plot comparing the four functions is seen below (click on the image to view in full). All errors are ours, and we welcome critical feedback and reporting of errors. To report possible errors in our analyses, please post it as a comment below or as a bug here. We will compile them onto a public errata, if any errors or flaws are reported to us.

# Statistical Package Citations

Please remember to cite the R packages if you use any of the R scripts from above. The citation for our1 can be found below in the References section or it can be downloaded here for a reference manager.

citation("concurve")
citation("TeachingDemos")
citation("ProfileLikelihood")
citation("rstan")
citation("rstanarm")
citation("bayesplot")
citation("knitr")
citation("kableExtra")
citation("ggplot2")
citation("cowplot")
citation("Statamarkdown")

# Environment

The analyses were run on:

R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

Random number generation:
RNG:     L'Ecuyer-CMRG
Normal:  Inversion
Sample:  Rejection

locale:
 en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 Statamarkdown_0.5.2 kableExtra_1.3.4    cowplot_1.1.1       ggplot2_3.3.3       concurve_2.7.7      rmarkdown_2.7

loaded via a namespace (and not attached):
 readxl_1.3.1          uuid_0.1-4            backports_1.2.1       systemfonts_1.0.1     plyr_1.8.6
 igraph_1.2.6          splines_4.0.5         crosstalk_1.1.1       rstantools_2.1.1      inline_0.3.17
 digest_0.6.27         htmltools_0.5.1.1     rsconnect_0.8.17      fansi_0.4.2           magrittr_2.0.1
 openxlsx_4.2.3        credentials_1.3.0     RcppParallel_5.1.2    matrixStats_0.58.0    officer_0.3.18
 xts_0.12.1            svglite_2.0.0         askpass_1.1           prettyunits_1.1.1     colorspace_2.0-0
 rvest_1.0.0           haven_2.4.0           xfun_0.22             dplyr_1.0.5           callr_3.7.0
 crayon_1.4.1          bcaboot_0.2-1         jsonlite_1.7.2        lme4_1.1-26           survival_3.2-10
 zoo_1.8-9             glue_1.4.2            survminer_0.4.9       gtable_0.3.0          webshot_0.5.2
 V8_3.4.0              car_3.0-10            pkgbuild_1.2.0        rstan_2.21.1          abind_1.4-5
 scales_1.1.1          DBI_1.1.1             rstatix_0.7.0         miniUI_0.1.1.1        Rcpp_1.0.6
 viridisLite_0.4.0     xtable_1.8-4          foreign_0.8-81        km.ci_0.5-2           stats4_4.0.5
 StanHeaders_2.21.0-7  DT_0.18               htmlwidgets_1.5.3     httr_1.4.2            threejs_0.3.3
 ellipsis_0.3.1        farver_2.1.0          pkgconfig_2.0.3       loo_2.4.1             sass_0.3.1
 utf8_1.2.1            labeling_0.4.2        reshape2_1.4.4        tidyselect_1.1.0      rlang_0.4.10
 later_1.1.0.1         munsell_0.5.0         pbmcapply_1.5.0       cellranger_1.1.0      tools_4.0.5
 cli_2.4.0             generics_0.1.0        broom_0.7.6           ggridges_0.5.3        evaluate_0.14
 stringr_1.4.0         fastmap_1.1.0         yaml_2.2.1            sys_3.4               processx_3.5.1
 knitr_1.32            zip_2.1.1             survMisc_0.5.5        purrr_0.3.4           nlme_3.1-152
 mime_0.10             rstanarm_2.21.1       xml2_1.3.2            shinythemes_1.2.0     compiler_4.0.5
 bayesplot_1.8.0       rstudioapi_0.13       curl_4.3              ggsignif_0.6.1        statmod_1.4.35
 tibble_3.1.1          bslib_0.2.4           stringi_1.5.3         highr_0.9             ps_1.6.0
 blogdown_1.3          forcats_0.5.1         gdtools_0.2.3         lattice_0.20-41       Matrix_1.3-2
 nloptr_1.2.2.2        markdown_1.1          shinyjs_2.0.0         KMsurv_0.1-5          vctrs_0.3.7
 pillar_1.6.0          lifecycle_1.0.0       jquerylib_0.1.3       data.table_1.14.0     ProfileLikelihood_1.1
 flextable_0.6.5       httpuv_1.5.5          R6_2.5.0              bookdown_0.22         promises_1.2.0.1
 gridExtra_2.3         rio_0.5.26            codetools_0.2-18      MASS_7.3-53.1         boot_1.3-27
 colourpicker_1.1.0    gtools_3.8.2          assertthat_0.2.1      openssl_1.4.3         withr_2.4.2
 metafor_2.4-0         shinystan_2.5.0       hms_1.0.0             grid_4.0.5            minqa_1.2.4
 tidyr_1.1.3           carData_3.0-4         ggpubr_0.4.0          shiny_1.6.0           base64enc_0.1-3
 dygraphs_1.1.1.6