(This article was first published on

**The 20% Statistician**, and kindly contributed to R-bloggers)In the reproducibility project, original effect sizes correlated r=0.51 with the effect sizes of replications. Some researchers find this hopeful.

Less-popularised findings from the “estimating the reproducibility” paper @Eli_Finkel #SPSP2016 pic.twitter.com/8CFJMbRhi8— Jessie Sun (@JessieSunPsych) January 28, 2016

I don’t think we should be interpreting this correlation at all, because it might very well be completely spurious. One important reason why correlations might be spurious is the presence of different subgroups, as introduction to statistics textbooks explain.

When we consider the Reproducibility Project (note: I’m a co-author of the paper) we can assume there are two subsets, one subgroup consisting of experiments that examine true effects, and one subgroup consisting of experiments that examine effects that are not true. This logically implies that for one subgroup, the true effect size is 0, while for the other, the true effect size is an unknown larger value. Different means in subgroups is a classic case where spurious correlations can emerge.

I find the best way to learn to understand statistics is through simulations. So let’s simulate 100 normally distributed effect sizes from original studies that are comparable to the 100 studies included in the Reproducibility Project, and 100 effect sizes for their replications, and correlate these. We create two subgroups. Forty effect sizes will have true effects (e.g., d = 0.4). The original and replication effect sizes will be correlated (e.g., r = 0.5). Sixty of the effect sizes will have an effect size of d = 0, and a correlation between replication and original studies of r = 0. I’m not suggesting this reflects the truth of the studies in the Reproducibility Project – there’s no way to know. The parameters look sort of reasonable to me, but feel free to explore different choices for parameters by running the code yourself.

As you see, the pattern is perfectly expected, under reasonable assumptions, when 60% of the studies is simulated to have no true effect. With a small N (100 studies gives a pretty unreliable correlation, see for yourself by running the code a few times) the spuriousness of the correlation might not be clear. So let’s simulate 100 times more studies.

Now, the spuriousness becomes clear. The two groups differ in their means, and if we calculate the correlation over the entire sample, the

*r*= 0.51 we get is not very meaningful (I cut off original studies at d = 0, to simulate publication bias and make the graph more similar to Figure 1 in the paper, but it doesn’t matter for the current point).So: be careful interpreting correlations when there are different subgroups. There’s no way to know what is going on. The correlation of 0.51 between effect sizes in original and replication studies might not mean anything.

To

**leave a comment**for the author, please follow the link and comment on their blog:**The 20% Statistician**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...