**Computational Psychology - Henrik Singmann**, and kindly contributed to R-bloggers)

**Prelude**: When you start with `R`

and try to estimate a standard ANOVA , which is relatively simple in commercial software like SPSS, `R`

kind of sucks. Especially for unbalanced designs or designs with repeated-measures replicating the results from such software in base `R`

may require considerable effort. For a newcomer (and even an old timer) this can be somewhat off-putting. After I had gained experience developing my first package and was once again struggling with `R`

and ANOVA I had enough and decided to develop `afex`

. If you know this feeling, `afex`

is also for you.

A new version of `afex`

(0.18-0) has been accepted on CRAN a few days ago. This version only fixes a small bug that was introduced in the last version. `aov_ez`

did not work with more than one `covariate`

(thanks to `tkerwin`

for reporting this bug).

I want to use this opportunity to introduce one of the main functionalities of `afex`

. It provides a set of functions that make calculating ANOVAs easy. In the default settings, `afex`

automatically uses appropriate orthogonal contrasts for factors, transforms numerical variables into factors, uses so-called Type III sums of squares, and allows for any number of factors including repeated-measures (or within-subjects) factors and mixed/split-plot designs. Together this guarantees that the ANOVA results correspond to the results obtained from commercial statistical packages such as `SPSS`

or `SAS`

. On top of this, the ANOVA object returned by afex (of class `afex_aov`

) can be directly used for follow-up or post-hoc tests/contrasts using the `lsmeans`

package .

### Example Data

Let me illustrate how to calculate an ANOVA with a simple example. We use data courtesy of Andrew Heathcote and colleagues . The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants. Here we only look at three factors:

`task`

is a between subjects (or independent-samples) factor: 25 participants worked on the lexical decision task (`lexdec`

; i.e., participants had to make a binary decision whether or not the presented string is a word or nonword) and 20 participants on the naming task (`naming`

; i.e., participant had to say the presented string out loud).`stimulus`

is a repeated-measures or within-subjects factor that codes whether a presented string was a`word`

or`nonword`

.`length`

is also a repeated-measures factor that gives the number of characters of the presented strings with three levels: 3, 4, and 5.

The dependent variable is the response latency or response time for each presented string. More specifically, as is common in the literature we analyze the log of the response times, `log_rt`

. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. To use this data in an ANOVA one needs to aggregate the data such that only one observation per participant and cell of the design (i.e., combination of all factors) remains. As we will see, `afex`

does this automatically for us (this is one of the features I blatantly stole from `ez`

).

library(afex) data("fhch2010") # load data (comes with afex) mean(!fhch2010$correct) # error rate # [1] 0.01981546 fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors str(fhch2010) # structure of the data # 'data.frame': 13222 obs. of 10 variables: # $ id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 ... # $ task : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ... # $ stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ... # $ density : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ... # $ frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ... # $ length : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ... # $ item : Factor w/ 600 levels "abide","acts",..: 363 121 ... # $ rt : num 1.091 0.876 0.71 1.21 0.843 ... # $ log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ... # $ correct : logi TRUE TRUE TRUE TRUE TRUE TRUE ...

We first load the data and remove the roughly 2% errors. The structure of the `data.frame`

(obtained via `str()`

) shows us that the data has a few more factors than discussed here. To specify our ANOVA we first use function `aov_car()`

which works very similar to base `R aov()`

, but as all `afex`

functions uses `car::Anova()`

(read as function `Anova()`

from package `car`

) as the backend for calculating the ANOVA.

### Specifying an ANOVA

(a1 <- aov_car(log_rt ~ task*length*stimulus + Error(id/(length*stimulus)), fhch)) # Contrasts set to contr.sum for the following variables: task # Anova Table (Type 3 tests) # # Response: log_rt # Effect df MSE F ges p.value # 1 task 1, 43 0.23 13.38 *** .22 .0007 # 2 length 1.83, 78.64 0.00 18.55 *** .008 <.0001 # 3 task:length 1.83, 78.64 0.00 1.02 .0004 .36 # 4 stimulus 1, 43 0.01 173.25 *** .17 <.0001 # 5 task:stimulus 1, 43 0.01 87.56 *** .10 <.0001 # 6 length:stimulus 1.70, 72.97 0.00 1.91 .0007 .16 # 7 task:length:stimulus 1.70, 72.97 0.00 1.21 .0005 .30 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1 # # Sphericity correction method: GG # Warning message: # More than one observation per cell, aggregating the data using mean (i.e, fun_aggregate = mean)!

The printed output is an ANOVA table that could basically be copied to a manuscript as is. One sees the terms in column `Effect`

, the degrees of freedoms (`df`

), the mean-squared error (`MSE`

, I would probably remove this column in a manuscript), the *F*-value (`F`

, which also contains the significance stars), and the *p*-value (`p.value`

). The only somewhat uncommon column is `ges`

which provides generalized eta-squared, ‘the recommended effect size statistics for repeated measure designs’ . The standard output also reports Greenhouse-Geisser (`GG`

) corrected *df* for repeated-measures factors with more than two levels (to account for possible violations of sphericity). Note that these corrected *df* are not integers.

We can also see a warning notifying us that `afex`

has detected that each participant and cell of the design provides more than one observation which are then automatically aggregated using `mean`

. The warning serves the purpose to notify the user in case this was not intended (i.e., when there should be only one observation per participant and cell of the design). The warning can be suppressed via specifying `fun_aggregate = mean`

explicitly in the call to `aov_car`

.

The formula passed to `aov_car`

basically needs to be the same as for standard `aov`

with a few differences:

- It must have an
`Error`

term specifying the column containing the participant (or unit of observation) identifier (e.g., minimally`+Error(id)`

). This is necessary to allow the automatic aggregation even in designs without repeated-measures factor. - Repeated-measures factors only need to be defined in the
`Error`

term and do not need to be enclosed by parentheses. Consequently, the following call produces the same ANOVA:aov_car(log_rt ~ task + Error(id/length*stimulus), fhch)

In addition to `aov_car`

, `afex`

provides two further function for calculating ANOVAs. These function produce the same output but differ in the way how to specify the ANOVA.

`aov_ez`

allows the ANOVA specification not via a`formula`

but via`character`

vectors (and is similar to`ez::ezANOVA()`

):aov_ez(id = "id", dv = "log_rt", fhch, between = "task", within = c("length", "stimulus"))

`aov_4`

requires a formula for which the`id`

and repeated-measures factors need to be specified as in`lme4::lmer()`

(with the same simplification that repeated-measures factors only need to be specified in the random part):aov_4(log_rt ~ task + (length*stimulus|id), fhch) aov_4(log_rt ~ task*length*stimulus + (length*stimulus|id), fhch)

### Follow-up Tests

A common requirement after the omnibus test provided by the ANOVA is some-sort of follow-up analysis. For this purpose, `afex`

is fully integrated with `lsmeans`

.

For example, assume we are interested in the significant `task:stimulus`

interaction. As a first step we might want to investigate the marginal means of these two factors:

lsmeans(a1, c("stimulus","task")) # NOTE: Results may be misleading due to involvement in interactions # stimulus task lsmean SE df lower.CL upper.CL # word naming -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword naming -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # word lexdec 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword lexdec 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95

From this we can see naming trials seems to be generally slower (as a reminder, the dv is log-transformed RT in seconds, so values below 0 correspond to RTs bewteen 0 and 1), It also appears that the difference between `word`

and `nonword`

trials is larger in the `naming`

task then in the `lexdec`

task. We test this with the following code using a few different `lsmeans`

function. We first use `lsmeans`

again, but this time using `task`

as the conditioning variable specified in `by`

. Then we use `pairs()`

for obtaining all pairwise comparisons within each conditioning strata (i.e., level of `task`

). This provides us already with the correct tests, but does not control for the family-wise error rate across both tests. To get those, we simply `update()`

the returned results and remove the conditioning by setting `by=NULL`

. In the call to update we can already specify the method for error control and we specify `'holm'`

, because it is uniformly more powerful than Bonferroni.

# set up conditional marginal means: (ls1 <- lsmeans(a1, c("stimulus"), by="task")) # task = naming: # stimulus lsmean SE df lower.CL upper.CL # word -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # # task = lexdec: # stimulus lsmean SE df lower.CL upper.CL # word 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95 update(pairs(ls1), by=NULL, adjust = "holm") # contrast task estimate SE df t.ratio p.value # word - nonword naming -0.31424037 0.02080113 43 -15.107 <.0001 # word - nonword lexdec -0.05309159 0.01860509 43 -2.854 0.0066 # # Results are averaged over the levels of: length # P value adjustment: holm method for 2 tests

Hmm. These results show that the `stimulus`

effects in both `task`

conditions are independently significant. Obviously, the difference between them must also be significant then, or?

pairs(update(pairs(ls1), by=NULL)) # contrast estimate SE df t.ratio p.value # wrd-nnwrd,naming - wrd-nnwrd,lexdec -0.2611488 0.02790764 43 -9.358 <.0001

They obviously are. As a reminder, the interaction is testing exactly this, the difference of the difference. And we can actually recover the *F*-value of the interaction using `lsmeans`

alone by invoking yet another of its functions, `test(..., joint=TRUE)`

.

test(pairs(update(pairs(ls1), by=NULL)), joint=TRUE) # df1 df2 F p.value # 1 43 87.565 <.0001

These last two example were perhaps not particularly interesting from a statistical point of view, but show an important ability of `lsmeans`

. Any set of estimated marginal means produced by `lsmeans`

, including any sort of (custom) contrasts, can be used again for further tests or calculating new sets of marginal means. And with `test()`

we can even obtain joint *F*-tests over several parameters using `joint=TRUE`

. `lsmeans`

is extremely powerful and one of my most frequently used packages that basically performs all tests following an omnibus test (and in its latest version it directly interfaces with `rstanarm`

so it can now also be used for a lot of Bayesian stuff, but this is the topic of another blog post).

Finally, `lsmeans`

can also be used directly for plotting by envoking `lsmip`

:

lsmip(a1, task ~ stimulus)

Note that `lsmip`

does not add error bars to the estimated marginal means, but only plots the point estimates. There are mainly two reasons for this. First, as soon as repeated-measures factors are involved, it is difficult to decide which error bars to plot. Standard error bars based on the standard error of the mean are not appropriate for within-subjects comparisons. For those, one would need to use a within-subject intervals (see also here or here). Especially for plots as the current one with both independent-samples and repeated-measures factors (i.e., mixed within-between designs or split-plot designs) no error bar will allow comparisons across both dimensions. Second, only ‘if the SE [i.e., standard error] of the mean is exactly 1/2 the SE of the difference of two means — which is almost never the case — it would be appropriate to use overlapping confidence intervals to test comparisons of means’ (lsmeans author Russel Lenth, the link provides an alternative).

We can also use `lsmeans`

in combination with `lattice`

to plot the results on the unconstrained scale (i.e., after back-transforming tha data from the log scale to the original scale of response time in seconds). The plot is not shown here.

lsm1 <- summary(lsmeans(a1, c("stimulus","task"))) lsm1$lsmean <- exp(lsm1$lsmean) require(lattice) xyplot(lsmean ~ stimulus, lsm1, group = task, type = "b", auto.key = list(space = "right"))

### Summary

`afex`

provides a set of functions that make specifying standard ANOVAs for an arbitrary number of between-subjects (i.e., independent-sample) or within-subjects (i.e., repeated-measures) factors easy:`aov_car()`

,`aov_ez()`

, and`aov_4()`

.- In its default settings, the
`afex`

ANOVA functions replicate the results of commercial statistical packages such as SPSS or SAS (using orthogonal contrasts and Type III sums of squares). - Fitted ANOVA models can be passed to
`lsmeans`

for follow-up tests, custom contrast tests, and plotting. - For specific questions visit the new afex support forum: afex.singmann.science (I think we just need someone to ask the first ANOVA question to get the ball rolling).
- For more examples see the vignette or here (blog post by Ulf Mertens) or download the full example R script used here.

As a caveat, let me end this post with some cautionary remarks from Douglas Bates (`fortunes::fortune(184)`

) who explains why ANOVA in `R`

is supposed to not be the same as in other software packages (i.e., he justifies why it ‘sucks’):

You must realize that R is written by experts in statistics and statistical computing who, despite popular opinion, do not believe that everything in SAS and SPSS is worth copying. Some things done in such packages, which trace their roots back to the days of punched cards and magnetic tape when fitting a single linear model may take several days because your first 5 attempts failed due to syntax errors in the JCL or the SAS code, still reflect the approach of “give me every possible statistic that could be calculated from this model, whether or not it makes sense”. The approach taken in R is different. The underlying assumption is that the useR is thinking about the analysis while doing it.

— Douglas Bates (in reply to the suggestion to include type III sums of squares and lsmeans in base R to make it more similar to SAS or SPSS)

R-help (March 2007)

Maybe he is right, but maybe what I have described here is useful to some degree.

### References

**leave a comment**for the author, please follow the link and comment on their blog:

**Computational Psychology - Henrik Singmann**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...