[This article was first published on R Psychologist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Earlier this week I read this article about “Why Publishing Everything Is More Effective than Selective Publishing of Statistically Significant Results” by Mercal et al (2014). The authors simulated different meta-analytic scenarios and came to the conclusion that publishing everything is more effective for the scientific collective. This got me thinking about publication bias and its expected effect on effect sizes, particularly Cohen’s d. It is evident that the effect will be overestimated under selective publishing, and that this bias will be larger in small studies, simply because the sampling distribution from small studies will have a larger standard errors. Daniel Simons wrote a similar post about “What effect size would you expect?”. In his post Simons looked at the expectation of the absolute value of d when there is no effect in the population. However, I am not interested in the absolute effect and not so much in the null distribution. So in this post I will both use d‘s theoretical and empirical sampling distribution to show the expected overestimation due to selective publishing. I will look at the overestimation for various sample sizes when the population effect $delta$ is 0, 0.2, 0.5 and 0.8. The conclusion is that you should be weary of effect sizes from small samples, and that the issue is rather with type M (magnitude) errors than type I errors. At least is clinical psychology the pervasive problem is overestimation of effects and not falsely rejecting null hypothesis.

## About the R-code

To make this post easier to follow I have chosen to put all of the R code at the end of this post.

## When the population effect is zero

### Cohen’s d sample distribution

It is known that the sampling distribution for Cohen’s d follows a noncentral t distribution. However, it is also knows that it is asymptotically normal with a standard deviation of $sqrt{frac{n_1 + n_2}{n_1 n_2}}$. What we will do is look at the 25th, 50th and 75th percentile of the upper-half of the sampling distribution, which corresponds to the 62.5th, 75th and 87.5th percentile of the full distribution. This means that we are looking at the expected value of Cohen’s d if researchers selectively publish studies that yield an effect above these percentiles. Stated as fractions the 62.5th percentile corresponds to the top 3 out of 8 studies, 75th percentile is 1 out of 4 studies and 87.5th 1 out of 8 studies. Figure 1

In clinical psychology we often have a hypothesis about the direction of an effect, so we are not interested in the absolute effect. Consequently I will refer to the percentiles of the full distribution, but if you are interested in the absolute value of d you can think of the percentiles as 25th, 50th and 75th percent of the sampling distribution. Figure 2 bellow shows the results from simulating this and from using the quantile function of the standard normal distribution. We see that the analytical methods give slightly different results for small samples. Figure 2

### Using the noncentral t distribution

To get more exact results we will use the noncentral t distribution to find the three percentiles from above. We accomplish this by finding the noncentrality parameters that correspond to the respective percentile and then multiply this by the standard error to get the expected Cohen’s d. This is the same methodology as when the noncentral t distribution is used to construct 95 % confidence intervals for d. The interested reader can look at Cumming (2001) for a short primer on this method. Figure 3 shows that this methods closely matches the simulation results. We also see that when we have 20 participants in each group 1 in 4 studies will estimate d to be larger then 0.2, even though no real effect exists. Figure 3

## When there is a population effect

### A note about the Cohen’s d and Hedges’ g

The terminology regarding Cohen’s d is inconsistent, but in this post I refer to the biased estimate of $delta$ as Cohen’s d and the bias-adjusted estimate as Hedges’ g. The bias is trivial for moderate and large samples. Cohen’s d can be converted to Hedges’ g by multiplying by an adjustment factor. However, since I want to analytically estimate the sample distribution of Cohen’s d, I need to adjust $delta$ to take this bias into account. I simply do this by dividing $delta$ by the adjustment factor. Like this $$g approx d*A Leftrightarrow d approx frac{g}{A}$$ where the adjustment factor $A = left(1-frac{3}{4df-1}right)$

### When d is non-zero

I think it is fair to say that null hypotheses are seldom true. Therefore I examined a scenario were researchers only publish values above some expected cutoff. For instance, say that the true effect is 0.2 but researchers only publish values larger then this, then what is the expected 25th, 50th and 75th percentile of this distribution? Figure 4 Figure 5 Figure 6

The figure below shows the percentage of bias for studies at the 75th percentile. The bias percentage is calculated like this $$left(frac{bar{hat{delta}} – delta}{delta}right)*100$$ Figure 7

Figures 4 show that the overestimation is huge for small samples when the population effect is 0.2. For instance, with 20 subjects in each group 1 in 4 studies is expected to find an effect that is at least 0.4. And 1 in 8 studies will find an effect close to 0.6 or over.

When the population effect is 0.5, Figure 5 shows that with 20 subjects per group 1 in 4 studies is expected to find an effect close to 0.75 or higher. And 1 in 8 studies will find and effect close to 0.9 or higher. Using Cohen’s rule of thumb for interpreting d we would wrongly conclude that there is an “large” effect when in fact there is a “medium” effect in the population. When the population effect is 0.8 the overestimation is slightly smaller, but the overall pattern is the same.

## But are these effect sizes “statistically significant”?

I have not focused on null hypothesis significance testing (NHST) in this post. But in the figure below I show the expected p-values for the effect sizes at the 75th percentile. The p-values are both from independent t-tests and mixed-design ANOVAs. Figure 8

It should be no surprise that, independent of the size of n, p > 0.05 when the null is true. A traditional null hypothesis test at $alpha = 0.05$ will – by definition – never reject values at the 75th percentile. Moreover, we see that p-values are no safeguard against type M errors when the population effect is 0.5 or 0.8. On the contrary, NHST most probably contributes to even larger overestimations of effect sizes. If the population effect is 0.2 and we repeatedly rerun our study until p < 0.05, then the effect will be grossly overestimated. If $n_1 = n_2 = 20$ then d will be about 0.64 when p = 0.05 (two-tailed t-test). If we once again use the noncentral t distribution we can show that if $delta=0.2$ then about 10 % of the studies will estimate d to be equal to or greater then 0.64. This means that the estimate is 220 % biased, but still “statistically significant”. So p-values from underpowered studies clearly perpetuates the problem of overestimation — unscrupulous researchers just have work a little harder.

## Summary and conclusions

• The overestimation due to publication bias is not trivial.
• Selective publishing of the top 25 % of small studies will yield effects that are at least 50–100 % biased, if the true effect is between 0.2 and 0.5.
• Selective reporting of outcomes will give the same effect. However the overestimation depends on the correlation between outcomes.
• Larger studies will give less biased estimates even under publication bias. Larger studies are also more expensive and resource consuming making willful selective publishing more tedious.
• Don’t trust effect sizes from small studies.
• Interpret confidence intervals, not point estimates.
• Focus on replication.
• Think “meta-analytically”, interpret single studies with caution.
• Null hypothesis significance testing will only make the problem worse.

## R-code

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 library(MBESS) library(ggplot2) library(MASS) # # FIGURE 1 ---------------- # # normal plot - percentiles x <- seq(-1,1, by=0.001) y <- dnorm(x, 0, sd=sqrt(sigma^2/25 + sigma^2/25)) sample_dist <- data.frame("x" = x, "y"= y) d_25 <- qnorm(0.625, 0, sqrt(sigma^2/25 + sigma^2/25)) d_50 <- qnorm(0.75, 0, sqrt(sigma^2/25 + sigma^2/25)) d_75 <- qnorm(0.875, 0, sqrt(sigma^2/25 + sigma^2/25)) polygon <- sample_dist[sample_dist$x > 0,] polygon$y1 <- 0 ggplot(sample_dist, aes(x,y, color=NULL)) + geom_line(size=1) + geom_polygon(data=polygon, fill="#c0392b") + geom_vline(aes(xintercept=c(d_25, d_50, d_75)), linetype="dashed") + labs(title="Cohen's d sampling distribution (normal aprox.; n = 25)n Dashed lines showing 62.5th, 75th and 87.5th percentile", y="Density", x="Cohen'd") ggsave("figure1.png", width=10, height=8, dpi=72) # # FIGURE 2 ---------------- # # normal approx, d = 0 wrap_sim_d <- function(n.1, n.2) { sim <- function(n.1, n.2) { group1 <- rnorm(n.1) group2 <- rnorm(n.2) return(smd(group1, group2, Unbiased=F)) } sim_data <- replicate(10000, sim(n.1, n.2)) # get percentiles sigma <- 1 SE <- sqrt((n.1+n.2)/(n.1*n.2)) analytical <- c("25%"=qnorm(0.625, 0, SE), "50%"=qnorm(0.75, 0, SE), "75%"=qnorm(0.875, 0, SE)) mc <- quantile(abs(sim_data), c(0.25, 0.5, 0.75)) output <- data.frame("y"=c(analytical, mc)) output$n <- n.1 output$group <- gl(2,3, labels=c("analytical","MC")) output$percentile <- gl(3,1, labels=c("62.5%","75%","87.5%")) return(output) } result_d <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d(n.1=i, n.2=i))) ggplot(result_d, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = "Expected Cohen's d if only positive effects are published and the population parameter is zero", y="Cohen's d") ggsave("figure2.png", dpi=72, width=10, height=8) # # FIGURE 3 ------------- # # code for d using non-central t wrap_sim_d_nct <- function(n.1, n.2) { sim <- function(n.1, n.2) { group1 <- rnorm(n.1) group2 <- rnorm(n.2) return(smd(group1, group2, Unbiased=F)) } sim_data <- replicate(50000, sim(n.1, n.2)) # get percentiles SE <- sqrt((n.1+n.2)/(n.1*n.2)) analytical <- c("25%" = conf.limits.nct(0, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.625)$Upper.Limit * SE, "50%" = conf.limits.nct(0, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.750)$Upper.Limit * SE, "75%" = conf.limits.nct(0, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.875)$Upper.Limit * SE) # alpha.lower is not used for anything here mc <- quantile(abs(sim_data), c(0.25, 0.5, 0.75)) output <- data.frame("y" = c(analytical, mc)) output$n <- n.1 output$group <- gl(2, 3, labels=c("analytical", "MC")) output$percentile <- gl(3, 1, labels=c("62.5%", "75%", "87.5%")) return(output) } result_d_nct <- do.call(rbind, lapply(seq(5, 100, by = 5), function(i) wrap_sim_d_nct(n.1 = i, n.2 = i))) ggplot(result_d_nct, aes(x = n, y = y, group = interaction(group, percentile), color = percentile, linetype = group)) + geom_line() + geom_point(shape = 1, size = 3) + scale_color_manual(values=c("#3498db", "black", "#3498db")) + scale_x_continuous(breaks = c(5, 10, 15, 20, 25, 50, 100)) + labs(title = "Expected Cohen's d under publication bias when the population parameter is zero (noncentral t)", y="Cohen's d") ggsave("figure3.png", dpi = 72, width = 10, height = 8) # # FIGURE 4, 5 & 6 ----------- # code for Cohen's d nct > 0 # # wrap_sim_d_not_zero <- function(n.1, n.2, d) { sim <- function(n.1, n.2, d) { group1 <- rnorm(n.1, d) group2 <- rnorm(n.2) return(smd(group1, group2, Unbiased = F)) } sim_data <- replicate(10000, sim(n.1, n.2, d)) mean(sim_data[2,] < 0) * 2 getp(mean(sim_data[1,]), n.1) # get percentiles SE <- sqrt((n.1+n.2)/(n.1*n.2)) df <- (n.1+n.2)-2 biased_d <- d/(1-3/(4*df-1)) # give biased d ncp <- biased_d * sqrt((n.1 * n.2)/(n.1 + n.2)) analytical <- c("25%" = conf.limits.nct(ncp, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.625)$Upper.Limit * SE, "50%" = conf.limits.nct(ncp, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.750)$Upper.Limit * SE, "75%" = conf.limits.nct(ncp, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.875)$Upper.Limit * SE) mc <- quantile(sim_data[sim_data > d], c(0.25, 0.5, 0.75)) output <- data.frame("y"=c(analytical, mc)) output$n <- n.1 output$group <- gl(2,3, labels=c("analytical","MC")) output$percentile <- gl(3,1, labels=c("62.5%", "75%", "87.5%")) return(output) } d.val <- 0.2 result_d_02 <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d_not_zero(n.1=i, n.2=i, d=d.val))) plot_d_02 <- ggplot(result_d_02, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's" ~ d ~ under ~ publication ~ bias ~ when ~ delta == ~ .(d.val)~"(noncentral t)"), y="Cohen's d") d.val <- 0.5 result_d_05 <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d_not_zero(n.1=i, n.2=i, d=d.val))) plot_d_05 <- ggplot(result_d_05, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's" ~ d ~ under ~ publication ~ bias ~ when ~ delta == ~ .(d.val)~"(noncentral t)"), y="Cohen's d") d.val <- 0.8 result_d_08 <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d_not_zero(n.1=i, n.2=i, d=d.val))) plot_d_08 <- ggplot(result_d_08, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's" ~ d ~ under ~ publication ~ bias ~ when ~ delta == ~ .(d.val)~"(noncentral t)"), y="Cohen's d") plot_d_02; plot_d_05; plot_d_08 ggsave("plot_d_02.png", plot_d_02, width=10, height=8, dpi=72) ggsave("plot_d_05.png", plot_d_05, width=10, height=8, dpi=72) ggsave("plot_d_08.png", plot_d_08, width=10, height=8, dpi=72) # # FIGURE 7 ----------- # # bias for d=0.2, 0.5, 0.8 result_all <- data.frame("y" = c(result_d_02$y[result_d_02$percentile=="75%" & result_d_02$group=="MC"], result_d_05$y[result_d_05$percentile=="75%" & result_d_05$group=="MC"], result_d_08$y[result_d_08$percentile=="75%" & result_d_08$group=="MC"])) result_all$n <- rep(seq(5,100, by=5), 3) result_all$d <- rep(c(0.2, 0.5, 0.8), each=20) result_all$bias <- ((result_all$y - result_all$d)/result_all$d) * 100 # percentage bias plot_all <- ggplot(result_all, aes(x=n, y=bias, group=factor(d), color=factor(d))) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's d under publication bias (>75th perc.) when" ~ delta == "(0.2, 0.5, 0.8)"~"(Monte Carlo estimates)"), y="% Bias") ggsave("plot_d_nonzero_all.png", plot_all, width=10, height=8, dpi=72) # # FIGURE 8 ------------- # # p.vals from t getp <- function(d, n) { df <- (2*n)-2 t <- (d / sqrt(2/n)) 2*pt(-abs(t), df) # two tailed } result_d_0 <- result_d_nct result_d_0$p <- getp(result_d_nct$y, result_d_nct$n) result_d_0$ES <- 0 result_d_02$p <- getp(result_d_02$y, result_d_02$n) result_d_02$ES <- 0.2 result_d_05$p <- getp(result_d_05$y, result_d_05$n) result_d_05$ES <- 0.5 result_d_08$p <- getp(result_d_08$y, result_d_08$n) result_d_08$ES <- 0.8 p_vals <- rbind(result_d_0[result_d_0$group == "MC" & result_d_0$percentile == "75%", ], result_d_02[result_d_02$group == "MC" & result_d_02$percentile == "75%", ], result_d_05[result_d_05$group == "MC" & result_d_05$percentile == "75%", ], result_d_08[result_d_08$group == "MC" & result_d_08$percentile == "75%", ]) p_vals_plot <- ggplot(p_vals, aes(x=n, y=p, group=factor(ES), color=factor(ES))) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100,150)) + geom_hline(yintercept=0.05) + labs(title = bquote("P-values from independent t-tests for" ~ delta== ~ "{0, 0.2, 0.5, 0.8}"), y="p-value (two-tailed)", parse=T) ggsave("p_vals_plot.png", p_vals_plot, width=10, height=8, dpi=72) ## pvals anova getANOVAp <- function(d, n, rho) { if(!require(MASS)) stop("MASS is not installed") rho <- rho cov_matrix <- function(sigmas) { B <- matrix(sigmas, ncol=length(sigmas), nrow=length(sigmas)) sigma <- t(B) * B * rho diag(sigma) <- sigmas^2 sigma } innerFunc <- function(d, n, rho=rho) { delta <- d/2 CBT <- c(3, 3-delta, 3-(2*delta)) WL <- c(3, 3, 3) CBT_i <- mvrnorm(n=n, mu=CBT, Sigma=cov_matrix(c(1,1,1)), empirical=T) WL_i <- mvrnorm(n=n, mu=WL, Sigma=cov_matrix(c(1,1,1)), empirical=T) data <- data.frame("DV" = c(CBT_i, WL_i), "time" = gl(3, n, 2*3*n), "group" = gl(2,3*n, 2*3*n), "subjects" = factor(c(rep(1:n, 3), rep((n+1):(2*n), 3))) ) result<-summary(aov(DV ~ time*group + Error(subjects), data=data)) return(result[][][]) } mapply(function(i,j) innerFunc(i,j), i=d, j=n) } result_d_0_anova <- result_d_nct result_d_0_anova$p <- getANOVAp(result_d_nct$y, result_d_nct$n, 0.5) result_d_0_anova$ES <- 0 result_d_02_anova <- result_d_02 result_d_02_anova$p <- getANOVAp(result_d_02$y, result_d_02$n, 0.5) result_d_02_anova$ES <- 0.2 result_d_05_anova <- result_d_05 result_d_05_anova$p <- getANOVAp(result_d_05$y, result_d_05$n, 0.5) result_d_05_anova$ES <- 0.5 result_d_08_anova <- result_d_08 result_d_08_anova$p <- getANOVAp(result_d_08$y, result_d_08$n, 0.5) result_d_08_anova$ES <- 0.8 p_vals_anova <- rbind(result_d_0_anova[result_d_0_anova$group == "MC" & result_d_0_anova$percentile == "75%", ], result_d_02_anova[result_d_02_anova$group == "MC" & result_d_02_anova$percentile == "75%", ], result_d_05_anova[result_d_05_anova$group == "MC" & result_d_05_anova$percentile == "75%", ], result_d_08_anova[result_d_08_anova$group == "MC" & result_d_08_anova$percentile == "75%", ]) p_vals_plot_anova <- ggplot(p_vals_anova, aes(x=n, y=p, group=factor(ES), color=factor(ES))) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100,150)) + geom_hline(yintercept=0.05) + labs(title = expression(atop("P-values from mixed-design ANOVAS for" ~ delta== ~ "{0, 0.2, 0.5, 0.8} at endpoint","3 time points, two groups, "~rho==0.5)), y="p-value (time x group)", parse=T) ggsave("p_vals_plot_anova.png", p_vals_plot, width=10, height=8, dpi=72) p_vals$group <- "t-test" p_vals_anova$group <- "ANOVA" p_vals_df <- rbind(p_vals, p_vals_anova) p_vals_plot_comb <- ggplot(p_vals_df, aes(x=n, y=p, group=interaction(factor(ES), group), color=factor(ES), linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100,150)) + scale_y_continuous(breaks = c(0, 0.05, 0.2, 0.4, 0.6, 0.8)) + geom_hline(yintercept=0.05) + labs(title = expression(atop("P-values from mixed-design ANOVAs (time x group) and t-tests (two-tailed)", ~ delta== ~ "{0, 0.2, 0.5, 0.8} at endpoint, 3 time points, two groups, "~rho==0.5)), y="p-value", parse=T) ggsave("p_vals_plot_comb.png", p_vals_plot_comb, width=10, height=8, dpi=72)