A comment on “We cannot afford to study effect size in the lab” from the DataColada blog

[This article was first published on Nicebread » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a recent post on the DataColada blog, Uri Simonsohn wrote about “We cannot afford to study effect size in the lab“. The central message is: If we want accurate effect size (ES) estimates, we need large sample sizes (he suggests four-digit n’s). As this is hardly possible in the lab we have to use other research tools, like onelin studies, archival data, or more within-subject designs.

While I agree to the main point of the post, I’d like to discuss and extend some of the conclusions. As the DataColada blog has no comments section, I’ll comment in my own blog …

 

“Does it make sense to push for effect size reporting when we run small samples? I don’t see how.”

“Properly powered studies teach you almost nothing about effect size.”

It is true that the ES estimate with n=20 will be utterly imprecise, and reporting this ES estimate could misguide readers who give too much importance to the point estimate and do not take the huge CI into account (maybe, because it has not been reported).

Still, and here’s my disagreement, I’d argue that also small-n studies should report the point estimate (along with the CI), as a meta-analysis of many imprecise small-n estimates still can give an unbiased and precise cumulative estimate. This, of course, would require that all estimates are reported, not only the significant ones (van Assen, van Aert, Nuijten, & Wicherts, 2014).

Here’s an example – we simulate a data set with d = 0.8 and 400 participants in each group. Then we look at three scenarios: a) the first 20 participants (a single small-n study), b) all 400 participants, and c) 20 * n=20 studies which are meta-analyzed.

[cc lang=”rsplus” escaped=”true”]

# simulate population with true ES d=0.8
set.seed(0xBEEF1)
library(compute.es)
library(metafor)
library(ggplot2)
X1 <- rnorm(1000000, mean=0, sd=1)
X2 <- rnorm(1000000, mean=0, sd=1) – 0.8

## Compute effect size for …
# … a single n=20 study
x1 <- sample(X1, 20)
x2 <- sample(X2, 20)
(t1 <- t.test(x1, x2))
(ES1 <- tes(t1$statistic, 20, 20, dig=3))
[/cc]

This single small-n study has g = 1.337 [ 0.64 , 2.034] (g is an unbiased estimate of d). Quite biased, but the true ES is in the CI.

[cc lang=”rsplus” escaped=”true”]
# .. a single n=400 study
x1 <- sample(X1, 400)
x2 <- sample(X2, 400)
(t2 <- t.test(x1, x2))
(ES2 <- tes(t2$statistic, 400, 400, dig=3))
[/cc]

This single large-n study has g = 0.76 [ 0.616 , 0.903] . It close to the true ES, and has quite narrow CI (i.e., high precision).

Now, here’s a meta-analyses of 20*n=20 studies:

[cc lang=”rsplus” escaped=”true”]
# … 20x n=20 studies
dat <- data.frame()
for (i in 1:20) {
x1b <- x1[(i*20-19):(i*20)]
x2b <- x2[(i*20-19):(i*20)]
dat <- rbind(dat, data.frame(
study=i,
m1i=mean(x1b),
m2i=mean(x2b),
sd1i=sd(x1b),
sd2i=sd(x2b),
n1i=20, n2i=20
))
}

# Do a fixed-effect model meta-analysis
es <- escalc(“SMD”, m1i=m1i, m2i=m2i, sd1i=sd1i, sd2i=sd2i, n1i=n1i, n2i=n2i, data=dat, append=TRUE)
(meta <- rma(yi, vi, data=es, method=”FE”))
[/cc]

The meta-analysis reveals g = 0.775 [0.630; 0.920]. This has nearly exactly the same CI width as the n=400 study, but a slightly lower ES estimate.

Here’s a plot of the results:

[cc lang=”rsplus” escaped=”true”]
# Plot results
res <- data.frame(
n = factor(c(“a) n=20″, “b) n=400″, “c) 20 * n=20\n(meta-analysis)”), ordered=TRUE),
point_estimate = c(ES1$d, ES2$d, meta$b),
ci.lower = c(ES1$l.d, ES2$l.d, meta$ci.lb),
ci.upper = c(ES1$u.d, ES2$u.d, meta$ci.ub)
)
ggplot(res, aes(x=n, y=point_estimate, ymin=ci.lower, ymax=ci.upper)) + geom_pointrange() + theme_bw() + xlab(“”) + ylab(“Cohen’s d”) + geom_hline(yintercept=0.8, linetype=”dotted”, color=”darkgreen”)[/cc]

Bildschirmfoto 2014-05-07 um 07.14.43

To summarize, a single small-n study hardly teaches something about effect sizes – but many small-n‘s do.

“But just how big an n do we need to study effect size? I am about to show that the answer has four-digits.”

In Uri’s post (and the linked R code) the precision issue is approached from the power side – if you increase power, you also increase precision. But you can also directly compute the necessary sample size for a desired precision. This is called the AIPE-framework (“accuracy in parameter estimation”) made popular by Ken Kelley, Scott Maxwell, and Joseph Rausch (Kelley & Rausch, 2006; Kelley & Maxwell, 2003; Maxwell, Kelley, & Rausch, 2008). The necessary functions are implemented in the MBESS package for R. If you want a CI width of .10 around an expected ES of 0.5, you need 3170 participants:

[cc lang=”rsplus” escaped=”true”]

library(MBESS)
ss.aipe.smd(delta=0.5, conf.level=.95, width=0.10)
[1] 3170
[/cc]

The same point has been made from a Bayesian point of view in a blog post from John Kruschke: notice the sample size on the x-axis.

In our own analysis on how correlations evolve with increasing sample size (Schönbrodt & Perugini, 2013; see also blog post), we conclude that for typical effect sizes in psychology, you need 250 participants to get sufficiently accurate and stable estimates of the ES:

evolDemo

How much precision is needed?

It’s certainly hard to give general guidelines how much precision is sensible, but here are our thoughts we based our stability analyses on. We used a CI-like “corridor of stability” (see Figure) with a half-width of w= .10, w=.15, and w=.20 (everything in the “correlation-metric”).

w = .20 was chosen for following reason: The average reported effect size in psychology is around r = .21 (Richard, Bond, & Stokes-Zoota, 2003). For this effect size, an accuracy of w = .20 would result in a CI that is “just significant” and does not include a reversal of the sign of the effect. Hence, with the typical effect sizes we are dealing with in psychology, a CI with a half-width > .20 would not make much sense.

w = .10 was chosen as it corresponds to a “small effect size” à la Cohen. This is arbitrary, of course, but at least some anchor. And w = .15 is just in between.

Using these numbers and an ES estimate of, say, r = .29, a just tolerable precision would be [.10; .46] (w = .20), a tolerable precision [.15; .42] (w = .15) and a moderate precision [.20; .38] (w = .10).

Regardless of the specific level of precision and method used, however, one thing is clear: Accuracy does not come in cheaply. We need much less participants for an hypothesis test (Is there a non-zero effect or not?) compared to an accurate estimate.

With increasing sample size, unfortunately you have diminishing returns on precision: As you can see in the dotted lines in Figure 2, the CI levels off, and you need disproportionally large sample sizes to squeeze out the last tiny percentages of a shrinking CI. If you follow Pareto’s principle, you should stop somewhen. Probably in scientific progress accuracy will be rather achieved in meta-analyzing several studies (which also gives you an estimate about the ES variability and possible moderators) than doing one mega-study.

Hence: Always report your ES estimate, even in small-n studies.

 

References

Kelley, K., & Maxwell, S. E. (2003). Sample Size for Multiple Regression: Obtaining Regression Coefficients That Are Accurate, Not Simply Significant. Psychological Methods, 8, 305–321. doi:10.1037/1082-989X.8.3.305

Kelley, K., & Rausch, J. R. (2006). Sample size planning for the standardized mean difference: Accuracy in parameter estimation via narrow confidence intervals. Psychological Methods, 11, 363–385. doi:10.1037/1082-989X.11.4.363

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample Size Planning for Statistical Power and Accuracy in Parameter Estimation. Annual Review of Psychology, 59(1), 537–563. doi:10.1146/annurev.psych.59.103006.093735

Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363. doi:10.1037/1089-2680.7.4.331

van Assen, M. A. L. M., van Aert, R. C. M., Nuijten, M. B., & Wicherts, J. M. (2014). Why publishing everything is more effective than selective publishing of statistically significant results. PLoS ONE, 9, e84896. doi:10.1371/journal.pone.0084896

To leave a comment for the author, please follow the link and comment on their blog: Nicebread » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)