Neuroskeptic has just blogged on a new paper by Judd, Westfall and Kenny on Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problem. I can’t access the original paper (which is supposed to be available via my University but hasn’t appeared yet …) but I know a little bit about the topic and thought I’d write a few words.
What stimulated me to write was a) a few of the comments on Neuroskeptic’s blog, and b) that I’ve just written a book that covers the topic in some detail. (Yes – this book!).
The basic problem is that standard statistical analyses in psychology treat participants (subjects) as a random factor, but stimuli as a fixed factor. Thus our statistics assume that the goal of inference is to say something about some population that those participants are representative of (rather than just the particular people in our study). By treating stimuli as fixed it is assumed that we’ve exhaustively sampled the population of interest in our study. This limits statistical generalization to those particular stimuli. This is an unattractive property for psycholinguists (because they tend to be interested in, say, all concrete nouns rather than the 30 nouns used in the study). The same issue may apply to lots of other types of stimuli (faces, people, voices, pictures, logic problems and so forth).
The comments fell into several camps, but one response was that this was another case of researchers getting basic stats wrong. I consider this to be unfair because we’re not talking basic stats here. The problem is quite subtle and the solutions are, in statistical terms, far from basic. Furthermore, it is not always an error. There are situations in which you don’t need to worry about the problem and situations in which it is debatable what the correct approach is.
Another response was the psycholinguists have known about this problem for years (true!) and have analyzed their data correctly (false!). The problem came to prominence in a paper by Herb Clark (The language-as-fixed-effect fallacy), but was originally raised by Coleman (1964). Clark noted that running separate ANOVAs treating subjects as unit of analysis and items as unit of analysis did not solve the problem (by-subject and by-item analyses). Either analysis is statistically non-significant the effect fails to generalize, but if both are statistically significant the correct analysis (that combines variability across subjects and items) might still be statistically non-significant. His solution was to estimate the correct ANOVA test statistic (quasi F or F‘) with a simple-to-calculate minimum value (min F‘). This is known to be conservative (i.e., produces p values that are slightly too large) but not unreasonably so in practice (see Raaijmakers et al., 1999). Raaijmakers et al. (1999) show that until recently most psycholinguistic researchers still got it wrong (e.g., by reporting separate by-item and by-subject analyses).
What is the correct approach? Well, it depends. First, do you need to generalize beyond your stimuli set? This has to do with your research goals. In some applied research you might just need to understand how people respond to a particular set of stimuli. A single stimulus or stimulus set can offer a counterexample to a strong claim (e.g., that X is always the case). Alternatively, it might be reasonable to assume that the stimuli are – for the purposes of the study – very similar to others in the population (i.e., that population variability is negligible). This might be the case for certain mass-produced products (e.g., brands of chocolate bar) or precision-engineered equipment. However, a lot of the time you do want to generalize beyond your sample of stimuli …
That leaves you with the option of altering the design of the study or doing incorporating the extra variability owing to stimuli into the analysis. The design option was considered by Clark (1973) and by Raaijmakers et al. (1999). Clark pointed out that if each person had a different (ideally random) sample of items from the stimulus population then the F ratio of a conventional ANOVA would be correct. The principle here is quite simple: all relevant sources of variability need to be represented in the analysis. By varying the stimuli between participants the variability is present and ends up being incorporated into the between-subjects error term.* This is quite a neat method and can be easy to set up in some studies (e.g., if you have a very large pool of words to sample from by computer). Raaijmakers et al. (1999) also notes that you get the correct F ratios from certain other designs. This, in my view, is only partly true. Any design that restricts the population sampled from (of participants or stimuli) restricts its variability and therefore restricts its generalizability to the pool of participants or stimuli being sampled from.
Recent development in statistics and software (or at least recent awareness of them in psychology) have brought the discussion of the language-as-fixed-effect fallacy or more properly stimuli-as-fixed-effect fallacy back to prominence. In principle it is possible to use a multilevel (or linear mixed) model to deal with the problem of multiple random effects (and this has all sorts of other advantages). However, the usual default model is a nested model that implicitly assumes that stimuli presented to each person are different.
A nice point here is that a nested multilevel repeated measures model fitted with RML (restricted maximum likelihood) and a certain covariance structure (compound symmetry) is pretty much equivalent to repeated measures ANOVA and can be used to derive standard F tests etc. Thus Clark’s assertion about using a design with stimuli nested within participants producing the correct F ratios is confirmed.
Baayen et al. (2008) offered a critique of the standard approach and explained how to fit a multilevel model with crossed random factors (i.e., where stimuli are the same for all participants … or equivalently participants are the same for all stimuli). These models can be fit in software such as MLwiN or R (but not SPSS**) that allows for cross-classified multilevel. The lme4 package in R is particularly useful because it fits these models fairly effortlessly.
This looks to be the solution described by
Judd, Westfall and Kenny – as far as I can tell by their abstract and the solution I cover in my book (Baguley, 2012).
* Note that a by-item analysis or by-subject analysis violates this principle because the each analysis uses the average response (averaged over the levels of the other random factor) and the variability around this average is unavailable to the analysis.
** UPDATE: Jake Westfall kindly sent me a copy of the paper. I have not read it properly yet but looks extremely good. He points out that recent versions of SPSS can run cross-classified models (I’m still on an older version). Their paper includes SPSS, R and SAS code. I would still recommend R over SPSS. One highlight is that show how to compute the Kenward-Roger approximation in R. Complex multilevel models make it difficult to assess the correct df for effects and the Kenward-Roger approximation is one of the better solutions. In my book I used parametric boostrapping or HPD intervals to get round this problem, but this is potentially a very useful addition.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory & Language, 59, 390-412.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12, 335-359.
Coleman, E. B. (1964). Generalizing to a language population. Psychological Reports, 14, 219-226.
Raaijmakers, J. G. W., Schrijnemakers, J. M. C., & Gremmen, F. (1999). How to deal with “The language-as-fixed-effect fallacy”: Common misconceptions and alternative solutions. Journal of Memory & Language, 41, 416-426.