Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Prelude: When you start with R and try to estimate a standard ANOVA , which is relatively simple in commercial software like SPSS, R kind of sucks. Especially for unbalanced designs or designs with repeated-measures replicating the results from such software in base R may require considerable effort. For a newcomer (and even an old timer) this can be somewhat off-putting. After I had gained experience developing my first package and was once again struggling with R and ANOVA I had enough and decided to develop afex. If you know this feeling, afex is also for you.

A new version of afex (0.18-0) has been accepted on CRAN a few days ago. This version only fixes a small bug that was introduced in the last version.  aov_ez did not work with more than one covariate (thanks to tkerwin for reporting this bug).

I want to use this opportunity to introduce one of the main functionalities of afex. It provides a set of functions that make calculating ANOVAs easy. In the default settings, afex automatically uses appropriate orthogonal contrasts for factors, transforms numerical variables into factors, uses so-called Type III sums of squares, and allows for any number of factors including repeated-measures (or within-subjects) factors and mixed/split-plot designs. Together this guarantees that the ANOVA results correspond to the results obtained from commercial statistical packages such as SPSS or SAS. On top of this, the ANOVA object returned by afex (of class afex_aov) can be directly used for follow-up or post-hoc tests/contrasts using the lsmeans package .

### Example Data

Let me illustrate how to calculate an ANOVA with a simple example. We use data courtesy of Andrew Heathcote and colleagues . The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants. Here we only look at three factors:

• task is a between subjects (or independent-samples) factor: 25 participants worked on the lexical decision task (lexdec; i.e., participants had to make a binary decision whether or not the presented string is a word or nonword) and 20 participants on the naming task (naming; i.e., participant had to say the presented string out loud).
• stimulus is a repeated-measures or within-subjects factor that codes whether a presented string was a word or nonword.
• length is also a repeated-measures factor that gives the number of characters of the presented strings with three levels: 3, 4, and 5.

The dependent variable is the response latency or response time for each presented string. More specifically, as is common in the literature we analyze the log of the response times, log_rt. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. To use this data in an ANOVA one needs to aggregate the data such that only one observation per participant and cell of the design (i.e., combination of all factors) remains. As we will see, afex does this automatically for us (this is one of the features I blatantly stole from ez).

library(afex)
data("fhch2010") # load data (comes with afex)

mean(!fhch2010$correct) # error rate #  0.01981546 fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors

str(fhch2010) # structure of the data
# 'data.frame': 13222 obs. of  10 variables:
#  $id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 ... #$ task     : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ...
#  $stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ... #$ density  : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ...
#  $frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ... #$ length   : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ...
#  $item : Factor w/ 600 levels "abide","acts",..: 363 121 ... #$ rt       : num  1.091 0.876 0.71 1.21 0.843 ...
#  $log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ... #$ correct  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...


We first load the data and remove the roughly 2% errors. The structure of the data.frame (obtained via str()) shows us that the data has a few more factors than discussed here. To specify our ANOVA we first use function aov_car() which works very similar to base R aov(), but as all afex functions uses car::Anova() (read as function Anova() from package car) as the backend for calculating the ANOVA.

### Specifying an ANOVA

(a1 <- aov_car(log_rt ~ task*length*stimulus + Error(id/(length*stimulus)), fhch))
# Contrasts set to contr.sum for the following variables: task
# Anova Table (Type 3 tests)
#
# Response: log_rt
#                 Effect          df  MSE          F   ges p.value
# 1                 task       1, 43 0.23  13.38 ***   .22   .0007
# 2               length 1.83, 78.64 0.00  18.55 ***  .008  <.0001
# 3          task:length 1.83, 78.64 0.00       1.02 .0004     .36
# 4             stimulus       1, 43 0.01 173.25 ***   .17  <.0001
# 5        task:stimulus       1, 43 0.01  87.56 ***   .10  <.0001
# 6      length:stimulus 1.70, 72.97 0.00       1.91 .0007     .16
# 7 task:length:stimulus 1.70, 72.97 0.00       1.21 .0005     .30
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1
#
# Sphericity correction method: GG
# Warning message:
# More than one observation per cell, aggregating the data using mean (i.e, fun_aggregate = mean)!

The printed output is an ANOVA table that could basically be copied to a manuscript as is. One sees the terms in column Effect, the degrees of freedoms (df), the mean-squared error (MSE, I would probably remove this column in a manuscript), the F-value (F, which also contains the significance stars), and the p-value (p.value). The only somewhat uncommon column is ges which provides generalized eta-squared, 'the recommended effect size statistics for repeated measure designs'  . The standard output also reports Greenhouse-Geisser (GG) corrected df for repeated-measures factors with more than two levels (to account for possible violations of sphericity). Note that these corrected df are not integers.

We can also see a warning notifying us that afex has detected that each participant and cell of the design provides more than one observation which are then automatically aggregated using mean. The warning serves the purpose to notify the user in case this was not intended (i.e., when there should be only one observation per participant and cell of the design). The warning can be suppressed via specifying fun_aggregate = mean explicitly in the call to aov_car.

The formula passed to aov_car basically needs to be the same as for standard aov with a few differences:

• It must have an Error term specifying the column containing the participant (or unit of observation) identifier (e.g., minimally +Error(id)). This is necessary to allow the automatic aggregation even in designs without repeated-measures factor.
• Repeated-measures factors only need to be defined in the Error term and do not need to be enclosed by parentheses. Consequently, the following call produces the same ANOVA:
aov_car(log_rt ~ task + Error(id/length*stimulus), fhch)

In addition to aov_car, afex provides two further function for calculating ANOVAs. These function produce the same output but differ in the way how to specify the ANOVA.

• aov_ez allows the ANOVA specification not via a formula but via character vectors (and is similar to ez::ezANOVA()):
aov_ez(id = "id", dv = "log_rt", fhch, between = "task", within = c("length", "stimulus"))
• aov_4 requires a formula for which the id and repeated-measures factors need to be specified as in lme4::lmer() (with the same simplification that repeated-measures factors only need to be specified in the random part):
aov_4(log_rt ~ task + (length*stimulus|id), fhch)
aov_4(log_rt ~ task*length*stimulus + (length*stimulus|id), fhch)


### Follow-up Tests

A common requirement after the omnibus test provided by the ANOVA is some-sort of follow-up analysis. For this purpose, afex is fully integrated with lsmeans .

For example, assume we are interested in the significant task:stimulus interaction. As a first step we might want to investigate the marginal means of these two factors:

lsmeans(a1, c("stimulus","task"))
# NOTE: Results may be misleading due to involvement in interactions
#  stimulus task        lsmean         SE    df    lower.CL    upper.CL
#  word     naming -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435
#  nonword  naming -0.02687619 0.04250050 48.46 -0.11230839  0.05855602
#  word     lexdec  0.00331642 0.04224522 47.37 -0.08165241  0.08828525
#  nonword  lexdec  0.05640801 0.04224522 47.37 -0.02856083  0.14137684
#
# Results are averaged over the levels of: length
# Confidence level used: 0.95


From this we can see naming trials seems to be generally slower (as a reminder, the dv is log-transformed RT in seconds, so values below 0 correspond to RTs bewteen 0 and 1), It also appears that the difference between word and nonword trials is larger in the naming task then in the lexdec task. We test this with the following code using a few different lsmeans function. We first use lsmeans again, but this time using task as the conditioning variable specified in by. Then we use pairs() for obtaining all pairwise comparisons within each conditioning strata (i.e., level of task). This provides us already with the correct tests, but does not control for the family-wise error rate across both tests. To get those, we simply update() the returned results and remove the conditioning by setting by=NULL. In the call to update we can already specify the method for error control and we specify 'holm',  because it is uniformly more powerful than Bonferroni.

# set up conditional marginal means:
#  stimulus      lsmean         SE    df    lower.CL    upper.CL
#  word     -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435
#  nonword  -0.02687619 0.04250050 48.46 -0.11230839  0.05855602
#
#  stimulus      lsmean         SE    df    lower.CL    upper.CL
#  word      0.00331642 0.04224522 47.37 -0.08165241  0.08828525
#  nonword   0.05640801 0.04224522 47.37 -0.02856083  0.14137684
#
# Results are averaged over the levels of: length
# Confidence level used: 0.95
#  contrast       task      estimate         SE df t.ratio p.value
#  word - nonword naming -0.31424037 0.02080113 43 -15.107  <.0001
#  word - nonword lexdec -0.05309159 0.01860509 43  -2.854  0.0066
#
# Results are averaged over the levels of: length
# P value adjustment: holm method for 2 tests

Hmm. These results show that the stimulus effects in both task conditions are independently significant. Obviously, the difference between them must also be significant then, or?

pairs(update(pairs(ls1), by=NULL))
# contrast                              estimate         SE df t.ratio p.value
# wrd-nnwrd,naming - wrd-nnwrd,lexdec -0.2611488 0.02790764 43  -9.358  <.0001

They obviously are. As a reminder, the interaction is testing exactly this, the difference of the difference. And we can actually recover the F-value of the interaction using lsmeans alone by invoking yet another of its functions, test(..., joint=TRUE).

test(pairs(update(pairs(ls1), by=NULL)), joint=TRUE)
# df1 df2      F p.value
#   1  43 87.565  <.0001

These last two example were perhaps not particularly interesting from a statistical point of view, but show an important ability of lsmeans. Any set of estimated marginal means produced by lsmeans, including any sort of (custom) contrasts, can be used again for further tests or calculating new sets of marginal means. And with test() we can even obtain joint F-tests over several parameters using joint=TRUE. lsmeans is extremely powerful and one of my most frequently used packages that basically performs all tests following an omnibus test (and in its latest version it directly interfaces with rstanarm so it can now also be used for a lot of Bayesian stuff, but this is the topic of another blog post).

Finally, lsmeans can also be used directly for plotting by envoking lsmip:

lsmip(a1, task ~ stimulus) Note that lsmip does not add error bars to the estimated marginal means, but only plots the point estimates. There are mainly two reasons for this. First, as soon as repeated-measures factors are involved, it is difficult to decide which error bars to plot. Standard error bars based on the standard error of the mean are not appropriate for within-subjects comparisons. For those, one would need to use a within-subject intervals  (see also here or here). Especially for plots as the current one with both independent-samples and repeated-measures factors (i.e., mixed within-between designs or split-plot designs) no error bar will allow comparisons across both dimensions. Second, only 'if the SE [i.e., standard error] of the mean is exactly 1/2 the SE of the difference of two means -- which is almost never the case -- it would be appropriate to use overlapping confidence intervals to test comparisons of means' (lsmeans author Russel Lenth, the link provides an alternative).

We can also use lsmeans in combination with lattice to plot the results on the unconstrained scale (i.e., after back-transforming tha data from the log scale to the original scale of response time in seconds). The plot is not shown here.

lsm1 <- summary(lsmeans(a1, c("stimulus","task")))
lsm1$lsmean <- exp(lsm1$lsmean)
require(lattice)
xyplot(lsmean ~ stimulus, lsm1, group = task, type = "b",
auto.key = list(space = "right"))


### Summary

• afex provides a set of functions that make specifying standard ANOVAs for an arbitrary number of between-subjects (i.e., independent-sample) or within-subjects (i.e., repeated-measures) factors easy: aov_car(), aov_ez(), and aov_4().
• In its default settings, the afex ANOVA functions replicate the results of commercial statistical packages such as SPSS or SAS (using orthogonal contrasts and Type III sums of squares).
• Fitted ANOVA models can be passed to lsmeans for follow-up tests, custom contrast tests, and plotting.
• For specific questions visit the new afex support forum: afex.singmann.science (I think we just need someone to ask the first ANOVA question to get the ball rolling).
• For more examples see the vignette or here (blog post by Ulf Mertens) or download the full example R script used here.

As a caveat, let me end this post with some cautionary remarks from Douglas Bates (fortunes::fortune(184)) who explains why ANOVA in R is supposed to not be the same as in other software packages (i.e., he justifies why it 'sucks'):

You must realize that R is written by experts in statistics and statistical computing who, despite popular opinion, do not believe that everything in SAS and SPSS is worth copying. Some things done in such packages, which trace their roots back to the days of punched cards and magnetic tape when fitting a single linear model may take several days because your first 5 attempts failed due to syntax errors in the JCL or the SAS code, still reflect the approach of "give me every possible statistic that could be calculated from this model, whether or not it makes sense". The approach taken in R is different. The underlying assumption is that the useR is thinking about the analysis while doing it.
-- Douglas Bates (in reply to the suggestion to include type III sums of squares and lsmeans in base R to make it more similar to SAS or SPSS)
R-help (March 2007)