**The 20% Statistician**, and kindly contributed to R-bloggers)

We’ll use R, and the R script at the bottom of this post (or download it from GitHub). Run the first section (sections are differentiated by # # # #) to install the required packages and change some setting.

**mean**IQ of the entire population of adults is 100, with a

**standard deviation**of 15. This will not be true for every

**sample**we draw from the

**population**. Let’s get a feel for what the IQ scores from a sample look like. Which IQ scores will people in our sample have?

**Assignment 1**

Let’s simulate a larger sample, of 100 participants by changing the n=10 in line 23 of the R script to n = 100 (remember R code is case-sensitive).

**normal distribution**. This is the well known bell shaped curve that represents the distribution of many variables in scientific research (although some other types of distributions are quite common as well). The mean and standard deviation are much closer to the true mean and standard deviation, and this is true for most of the simulated samples. Simulate at least 10 samples with n = 10, and 10 samples with n = 100. Look at the means and standard deviations. Let’s simulate one really large sample, of 1000 people (run the code, changing n=10 to n=1000). The picture shows one example.

Not every simulated study of 1000 people will yield the true mean and standard deviation, but this one did. And although the distribution is very close to a normal distribution, even with a 1000 people it is not perfect.

^{2}

^{ 2 }= 216 people (rounded down). Feel free to check by running the code with n = 216 (remember that this is a long term average!)

In addition to planning for accuracy, you can plan for power. The power of a study is the probability of observing a statistically significant effect, given that there is a true effect to be found. It depends on the effect size, the sample size, and the alpha level.

We can use the code in the section of Assignment 2. Running this code will take a while. It will simulate 100000 experiments, where 10 participants are drawn from a normal distribution with the mean of 110, and a SD of 15. To continue our experiment, let’s assume the numbers represent measured IQ, which is 110 in our samples. For each simulated sample, we test whether the effect differs from an IQ of 100. In other words, we are testing whether our sample is smarter than average.

*p*-values, and it will return the power, which will be somewhere around 47%. It will also yield a plot of the

*p*-values. The first bar is the count of all p-values smaller than 0.05, so all statistically significant

*p*-values. The percentage of

*p*-values in this single bar visualizes the power of the study.

*d*= (X-μ)/σ, or (110-100)/15 = 0.6667.

**Assignment 2**

*p*-value distribution change?

*p*-value distribution change?

*p*-value distribution change? Can we formally speak of ‘power’ in this case? What is a better name in this specific situation?

**Variance in two groups, and their difference.**

Now, assume we have a new IQ training program that will increase peoples IQ score with 6 points. People in condition 1 are in the control condition – they do not get IQ training. People in condition 2 get IQ training. Let’s simulate 10 people in each group, assuming the IQ in the control condition is 100, and in the experimental group is 106 (the SD is still 15 in each group) by running the code for Assignment 3.

**Assignment 3**

*t*-test:

**dependent**samples, the mean in one sample correlates with the mean in the other sample. This reduced the amount of variability in the difference scores. If we perform a power analysis, how do you think this will influence the power of our test?

*t*-test will be identical to the power in a one-sample

*t*-test.

*t*-test, with a true effect size of 0.6667, and compare the power with the same power analysis for a one-sample

*t*-test we performed above:

**Variation across studies**

**effects**we are interested in directly. Both correlations are mean differences are effect sizes. Because mean differences are difficult to compare across studies that use different types of measures to examine an effect, or different scales to measure differences on, whenever multiple effect sizes are compared researchers often use standardized effect sizes. In this example, we will focus on

**Cohen’s**, which provides the

*d***standardized mean difference**.

*d*is provided by:

*d*depends only on the sample size and the value of

*d*itself.

**Single study meta-analysis**

**meta-analysis**. In essence, you perform an analysis over analyses. You first analyze individual studies, and then analyze the set of effect sizes you calculated from each individual study. To perform a meta-analysis, all you need are the effect sizes and the sample size of each individual study.

*t*-test of correlation will tell you the same thing – but that’s educational to see.

*t*(158) = 2.80,

*p*= 0.007. The effect size Hedges’

*g*= 0.71. This effect size overestimates the true effect size substantially. The true effect size is

*d*= 0.4 – calculate this for yourself.

**Assignment 6**

*t*-test.

*g*= 0.7144, a 95% CI, and a z-score (2.7178), which is the test statistic for which a

*p*-value can be calculated. The

*p*-value of 0.0066 is very similar to that observed in the

*t*-test.

**A small-scale meta-analysis**

**fixed effect model**or a

**random effects model**when performing a meta-analysis.

**a single true effect size underlies all the studies**included in the meta-analysis. Fixed effect models are therefore only appropriate when all studies in the meta-analysis are practically identical (e.g., use the same manipulation) and when researchers do not want to generalize to different populations (Borenstein, Hedges, Higgins, & Rothstein, 2009).

**the true effect size to vary from study to study**(e.g., due to differences in the manipulations between studies). Note the difference between fixed effect and random effect

**s**(plural, meaning multiple effects). Random effects models therefore are appropriate when a wide range of different studies is examined and there is substantial variance between studies in the effect sizes. Since the assumption that all effect sizes are identical is implausible in most meta-analyses random effects meta-analyses are generally recommended (Borenstein et al., 2009).

**Assignment 7**. We get the following output, where we see four rows (one for each study), the effect sizes and 95% CI for each effect, and the %W (random), which is the relative weight for each study in a random effects meta-analysis.

*p*-value. Based on the set of studies we simulated here, we would conclude it looks like there is a true effect.

**heterogeneity**. Tests for heterogeneity examine whether there is large enough variation in the effect sizes included in the meta-analysis to assume their might be important moderators of the effect. For example, assume studies examine how happy receiving money makes people. Half of the studies gave people around 10 euros, while the other half of the study gave people 100 euros. It would not be surprising to find both these manipulations increase happiness, but 100 euro does so more strongly that 10 euro. Many manipulations in psychological research differ similarly in their strength. If there is substantial heterogeneity, researchers should attempt to examine the underlying reason for this heterogeneity, for example by identifying subsets of studies, and then examining the effect in these subsets. In our example, there does not seem to be substantial heterogeneity (the test for heterogeneity, the Q-statistic, is not statistically significant).

**Assignment 7**

**Simulating small studies**

**Assignment 8**is the same as earlier, we just changed the nSims=1 to nSims=8.

**true difference**between both groups is exactly the same in every simulated study. Only 50% of the studies reveal a statistically significant effect, but the meta-analysis provides clear evidence for the presence of a true effect in the fixed-effect model (

*p*< 0.0001):

` 95%-CI %W(fixed) %W(random)`

`1 -0.0173 [-0.4461; 0.4116] 14.47 13.83`

`2 -0.0499 [-0.5577; 0.4580] 10.31 11.16`

`3 0.6581 [ 0.0979; 1.2183] 8.48 9.74`

`4 0.5806 [ 0.0439; 1.1172] 9.24 10.35`

`5 0.3104 [-0.1693; 0.7901] 11.56 12.04`

`6 0.4895 [ 0.0867; 0.8923] 16.40 14.87`

`7 0.7362 [ 0.3175; 1.1550] 15.17 14.22`

`8 0.2278 [-0.2024; 0.6580] 14.37 13.78`

` `

`Number of studies combined: k=8`

` `

` 95%-CI z p-value`

`Fixed effect model 0.3624 [0.1993; 0.5255] 4.3544 < 0.0001`

**Assignment 8**

**Meta-Analysis, not Miracles**

*garbage-in, garbage-out*. If you calculate the meta-analytic effect size of a bunch of crappy studies, the meta-analytic effect size estimate will also be meaningless. It is true that a meta-analysis cannot turn bad data into a good effect size estimation. Similarly, meta-analytic techniques that aim to address publication bias (not discussed in this blog post) can never provide certainty about the unbiased effect size estimate.

**leave a comment**for the author, please follow the link and comment on their blog:

**The 20% Statistician**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...