Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Stan is probably the most interesting development in computational statistics in the last few years, at least for me. The version of Hamiltonian Monte-Carlo (HMC) implemented in Stan (NUTS, ) is extremely efficient and the range of probability distributions implemented in the Stan language allows to fit an extremely wide range of models. Stan has considerably changed which models I think can be realistically estimated both in terms of model complexity and data size. It is not an overstatement to say that Stan (and particularly rstan) has considerable changed the way I analyze data.

One of the R packages that allows to implement Stan models in a very convenient manner and which has created a lot of buzz recently is brms . It allows to specify a wide range of models using the R formula interface. Based on the formula and a specification of the family of the model, it generates the model code, compiles it, and then passes it together with the data to rstan for sampling. Because I usually program my models by-hand (thanks to the great Stan documentation), I have so far stayed away from brms.

However, I recently learned that brms also allows the estimation of the Wiener model (i.e., the 4-parameter diffusion model, ) for simultaneously accounting for responses and corresponding response times for data from two-choice tasks. Such data is quite common in psychology and the diffusion model is one of the more popular cognitive models out there . In a series of (probably 3) posts I provide an example of applying the Wiener model to some published data using brms. This first part shows how to set up and estimate the model. The second part gives an overview of model diagnostics and an assessment of model fit via posterior predictive distributions. The third part shows how to inspect and compare the posterior distributions of the parameters.

In addition to brms and a working C++ compiler, this first part also needs package RWiener for generating the posterior predictive distribution within brms and package rtdists for the data.

library("brms")


## Data and Model

A graphical illustration of the Wiener diffusion model for two-choice reaction times. An evidence counter starts at value alpha*beta and evolves with random increments. The mean increment is delta . The process terminates as soon as the accrued evidence exceeds alpha or deceeds 0. The decision process starts at time tau from the stimulus presentation and terminates at the reaction time. [This figure and caption are taken from Wabersich and Vandekerckhove (2014, The R Journal, CC-BY license).]

I expect the reader to already be familiar with the Wiener model and will only provide the a very brief introduction here, for more see . The Wiener model is a continuous-time evidence accumulation model for binary choice tasks. It assumes that in each trial evidence is accumulated in a noisy (diffusion) process by a single accumulator. Evidence accumulation starts at the start point and continues until the accumulator hits one of the two decision bounds in which case the corresponding response is given. The total response time is the sum of the decision time from the accumulation process plus non-decisional components. In sum, the Wiener model allows to decompose responses to a binary choice tasks and corresponding response times into four latent processes:

• The drift rate (delta) is the average slope of the accumulation process towards the boundaries. The larger the (absolute value of the) drift rate, the stronger the evidence for the corresponding response option.
• The boundary separation (alpha) is the distance between the two decision bounds and interpreted as a measure of response caution.
• The starting point (beta) of the accumulation process is a measure of response bias towards one of the two response boundaries.
• The non-decision time (tau) captures all non-decisional process such as stimulus encoding and response processes.

We will analyze part of the data from Experiment 1 of . The data comes from 17 participants performing a lexical decision task in which they have to decide if a presented string is a word or non-word. Participants made decisions either under speed or accuracy emphasis instructions in different experimental blocks. This data comes with the rtdists package (which provides the PDF, CDF, and RNG for the full 7-parameter diffusion model). After removing some extreme RTs, we restrict the analysis to high-frequency words (frequency = high) and the corresponding high-frequency non-words (frequency = nw_high) to reduce estimation time. To setup the model we also need a numeric response variable in which 0 corresponds to responses at the lower response boundary and 1 corresponds to responses at the upper boundary. For this we transform the categorical response variable response to numeric and subtract 1 such that a word response correspond to the lower response boundary and a nonword response to the upper boundary.

data(speed_acc, package = "rtdists")
speed_acc <- droplevels(speed_acc[!speed_acc$censor,]) # remove extreme RTs speed_acc <- droplevels(speed_acc[ speed_acc$frequency %in%
c("high", "nw_high"),])
speed_acc$response2 <- as.numeric(speed_acc$response)-1
str(speed_acc)
'data.frame':    10462 obs. of  10 variables:
$id : Factor w/ 17 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...$ block    : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$condition: Factor w/ 2 levels "accuracy","speed": 2 2 2 2 2 2 2 2 2 2 ...$ stim     : Factor w/ 1611 levels "1001","1002",..: 1271 46 110 666 422 ...
$stim_cat : Factor w/ 2 levels "word","nonword": 2 1 1 1 1 1 2 1 1 2 ...$ frequency: Factor w/ 2 levels "high","nw_high": 2 1 1 1 1 1 2 1 1 2 ...
$response : Factor w/ 2 levels "word","nonword": 2 1 1 1 1 1 1 1 1 1 ...$ rt       : num  0.773 0.39 0.435 0.427 0.622 0.441 0.308 0.436 0.412 ...
$censor : logi FALSE FALSE FALSE FALSE FALSE FALSE ...$ response2: num  1 0 0 0 0 0 0 0 0 0 ...

## Model Formula

The important decision that has to be made before setting up a model is which parameters are allowed to differ between which conditions (i.e., factor levels). One common constraint of the Wiener model (and other evidence-accumulation models) is that the parameters that are set before the evidence accumulation process starts (i.e., boundary separation, starting point, and non-decision time) cannot change based on stimulus characteristics that are not known to the participant before the start of the trial. Thus, the item-type, in the present case word versus non-word, is usually only allowed to affect the drift rate. We follow this constraint. Furthermore, all four parameters are allowed to vary between speed and accuracy condition as this is manipulated between blocks of trials. Also note that all relevant variables are manipulated within-subjects. Thus, the maximal random-effects structure entails corresponding random-effects parameters for each fixed-effect. To set up the model we need to invoke the bf() function and construct one formula for each of the four parameters of the Wiener model.

formula <- bf(rt | dec(response2) ~ 0 + condition:frequency +
(0 + condition:frequency|p|id),
bs ~ 0 + condition + (0 + condition|p|id),
ndt ~ 0 + condition + (0 + condition|p|id),
bias ~ 0 + condition + (0 + condition|p|id))

The first formula is for the drift rate and is also used for specifying the column containing the RTs (rt) and the response or decision (response2) on the left hand side. On the right hand side one can specify fixed effects as well as random effects in a way similar to lme4. The drift rate is allowed to vary between both variables, condition and frequency (stim_cat would be equivalent), thus we estimate fixed effects as well as random effects for both factors as well as their interaction.

We then also need to set up one formula for each of the other three parameters (which are only allowed to vary by condition). For these formulas, the left hand side denotes the parameter names:

• bs: boundary separation (alpha)
• ndt: non-decision time (tau)
• bias: starting point (beta)

The right hand side again specifies the fixed- and random-effects. Note that one common approach for setting up evidence accumulation models is to specify that one response boundary represent correct responses and one response boundary denotes incorrect responses (in contrast to the current approach in which the response boundaries represent the actually two response options). In such a situation one cannot estimate the starting point and it needs to be fixed to 0.5 (i.e., replace the formula with bias = 0.5).

Two further points are relevant in the formulas. First, I have used a somewhat uncommon parameterization and suppressed the intercept (e.g., ~ 0 + condition instead of ~ condition). The reason for this is that when an intercept is present, categorical variables (i.e., factors) with k levels are coded with k-1 deviation variables that represent deviations from the intercept. Thus, in a Bayesian setting one needs to consider the choice of prior for these deviation variables. In contrast, when suppressing the intercept the model can be setup such that each factor level (or design cells in case of involvement of more than one factor) receives its own parameter, as done here. This essentially allows the same prior for each parameter (as long as one does not expect the parameters to vary dramatically). Furthermore, when programming a model oneself this is a common parameterization. To see the differences between the different parameterizations compare the following two calls (model.matrix is the function that creates the parameterization internally). Only the first creates a separate parameter for each condition.

unique(model.matrix(~0+condition, speed_acc))
##     conditionaccuracy conditionspeed
## 36                  0              1
## 128                 1              0
unique(model.matrix(~condition, speed_acc))
##     (Intercept) conditionspeed
## 36            1              1
## 128           1              0


Note that when more than one factor is involved and one wants to use this parameterization, one needs to combine the factors using the : and not *. This can be seen when running the code below. Also note that when combining the factors with : without suppressing the intercept, the resulting model has one parameter more than can be estimated (i.e., the model-matrix is rank deficient). So care needs to be taken at this step.

unique(model.matrix(~ 0 + condition:frequency, speed_acc))
unique(model.matrix(~ 0 + condition*frequency, speed_acc))
unique(model.matrix(~ condition:frequency, speed_acc))

Second, brms formulas provide a way to estimate correlations among random-effects parameters of different formulas. To achieve this, one can place an identifier in the middle of the random-effects formula that is separated by | on both sides. Correlations among random-effects will then be estimated for all random-effects formulas that share the same identifier. In our case, we want to estimate the full random-effects matrix with correlations among all model parameters, following the “latent-trait approach” . We therefore place the same identifier (p) in all formulas. Thus, correlations will be estimated among all individual-level deviations across all four Wiener parameters. To estimate correlations only among the random-effects parameters of each formula, simply omit the identifier (e.g., (0 + condition|id)). Furthermore, note that brms, similar to afex, supports suppressing the correlations among categorical random-effects parameters via || (e.g., (0 + condition||id)).

The next step is to setup the priors. For this we can invoke the get_prior function. This function requires one to specify the formula, data, as well as the family of the model. family is the argument where we tell brms that we want to use the wiener model. We also use it to specify the link function for the four Wiener parameters. Because the drift rate can take on any value (i.e., from -Inf to Inf), the default link function is "identity" (i.e., no transformation) which we retain. The other three parameters all have a restricted range. The boundary needs to be larger than 0, the non-decision time needs to be larger than 0 and smaller than the smallest RT, and the starting point needs to be between 0 and 1. The default link-functions respect these constraints and use "log" for the first two parameters and "logit" for the bias. This certainly is a possibility, but has a number of drawbacks leading me to use the "identity" link function for all parameters. First, when parameters are transformed, the priors need to be specified on the untransformed scale. Second, the individual-levels deviations (i.e., the random-effects estimates) are assumed to come from a multivariate normal distribution. Parameters transformations would entail that these individual-deviations are only normally distributed on the untransformed scale. Likewise, the correlations of parameter deviations across parameters would also be on the untransformed scale. Both make the interpretation of the random-effects difficult.

When specifying the parameters without transformation (i.e., link = "identity") care must be taken that the priors places most mass on values inside the allowed range. Likewise, starting values need to be inside the allowed range. Using the identity link function also comes with drawbacks discussed at the end. However, as long as parameter outside the allowed range only occur rarely, such a model can converge successfully and it makes the interpretation easier.

The get_prior function returns a data.frame containing all parameters of the model. If parameters have default priors these are listed as well. One needs to define priors either for individual parameters, parameter classes, or parameter classes for specific groups, or dpars. Note that all parameters that do not have a default prior should receive a specific prior.

get_prior(formula,
data = speed_acc,
link_bias = "identity"))

[Two empty columns to the right were removed from the following output.]

                 prior class                               coef group resp dpar
1                          b
2                          b    conditionaccuracy:frequencyhigh
3                          b conditionaccuracy:frequencynw_high
4                          b       conditionspeed:frequencyhigh
5                          b    conditionspeed:frequencynw_high
6               lkj(1)   cor
7                        cor                                       id
8  student_t(3, 0, 10)    sd
9                         sd                                       id
10                        sd    conditionaccuracy:frequencyhigh    id
11                        sd conditionaccuracy:frequencynw_high    id
12                        sd       conditionspeed:frequencyhigh    id
13                        sd    conditionspeed:frequencynw_high    id
14                         b                                               bias
15                         b                  conditionaccuracy            bias
16                         b                     conditionspeed            bias
17 student_t(3, 0, 10)    sd                                               bias
18                        sd                                       id      bias
19                        sd                  conditionaccuracy    id      bias
20                        sd                     conditionspeed    id      bias
21                         b                                                 bs
22                         b                  conditionaccuracy              bs
23                         b                     conditionspeed              bs
24 student_t(3, 0, 10)    sd                                                 bs
25                        sd                                       id        bs
26                        sd                  conditionaccuracy    id        bs
27                        sd                     conditionspeed    id        bs
28                         b                                                ndt
29                         b                  conditionaccuracy             ndt
30                         b                     conditionspeed             ndt
31 student_t(3, 0, 10)    sd                                                ndt
32                        sd                                       id       ndt
33                        sd                  conditionaccuracy    id       ndt
34                        sd                     conditionspeed    id       ndt

Priors can be defined with the prior or set_prior function allowing different levels of control. One benefit of the way the model is parameterized is that we only need to specify priors for one set of parameters per Wiener parameters (i.e., b) and do not have to distinguish between intercept and other parameters.

We base our choice of the priors on prior knowledge of likely parameter values for the Wiener model, but otherwise try to specify them in a weakly informative manner. That is, they should restrict the range to likely values but not affect the estimation any further. For the drift rate we use a Cauchy distribution with location 0 and scale 5 so that roughly 70% of prior mass are between -10 and 10. For the boundary separation we use a normal prior with mean 1.5 and standard deviation of 1, for the non-decision time a normal prior with mean 0.2 and standard deviation of 0.1, and for the bias we use a normal with mean of 0.5 (i.e., no-bias) and standard deviation of 0.2.

prior <- c(
prior("cauchy(0, 5)", class = "b"),
set_prior("normal(1.5, 1)", class = "b", dpar = "bs"),
set_prior("normal(0.2, 0.1)", class = "b", dpar = "ndt"),
set_prior("normal(0.5, 0.2)", class = "b", dpar = "bias")
)

With this information we can use the make_stancode function and inspect the full model code. The important thing is to make sure that all parameters listed in the parameters block have a prior listed in the model block. We can also see, at the beginning of the model block, that none of our parameters is transformed just as desired (a bug in a previous version of brms prevented anything but the default links for the Wiener model parameters).

make_stancode(formula,
data = speed_acc,
prior = prior)

// generated with brms 1.10.2
functions {

/* Wiener diffusion log-PDF for a single response
* Args:
*   y: reaction time data
*   dec: decision data (0 or 1)
*   alpha: boundary separation parameter > 0
*   tau: non-decision time parameter > 0
*   beta: initial bias parameter in [0, 1]
*   delta: drift rate parameter
* Returns:
*   a scalar to be added to the log posterior
*/
real wiener_diffusion_lpdf(real y, int dec, real alpha,
real tau, real beta, real delta) {
if (dec == 1) {
return wiener_lpdf(y | alpha, tau, beta, delta);
} else {
return wiener_lpdf(y | alpha, tau, 1 - beta, - delta);
}
}
}
data {
int N;  // total number of observations
vector[N] Y;  // response variable
int K;  // number of population-level effects
matrix[N, K] X;  // population-level design matrix
int K_bs;  // number of population-level effects
matrix[N, K_bs] X_bs;  // population-level design matrix
int K_ndt;  // number of population-level effects
matrix[N, K_ndt] X_ndt;  // population-level design matrix
int K_bias;  // number of population-level effects
matrix[N, K_bias] X_bias;  // population-level design matrix
// data for group-level effects of ID 1
int J_1[N];
int N_1;
int M_1;
vector[N] Z_1_1;
vector[N] Z_1_2;
vector[N] Z_1_3;
vector[N] Z_1_4;
vector[N] Z_1_bs_5;
vector[N] Z_1_bs_6;
vector[N] Z_1_ndt_7;
vector[N] Z_1_ndt_8;
vector[N] Z_1_bias_9;
vector[N] Z_1_bias_10;
int NC_1;
int dec[N];  // decisions
int prior_only;  // should the likelihood be ignored?
}
transformed data {
real min_Y = min(Y);
}
parameters {
vector[K] b;  // population-level effects
vector[K_bs] b_bs;  // population-level effects
vector[K_ndt] b_ndt;  // population-level effects
vector[K_bias] b_bias;  // population-level effects
vector[M_1] sd_1;  // group-level standard deviations
matrix[M_1, N_1] z_1;  // unscaled group-level effects
// cholesky factor of correlation matrix
cholesky_factor_corr[M_1] L_1;
}
transformed parameters {
// group-level effects
matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
vector[N_1] r_1_1 = r_1[, 1];
vector[N_1] r_1_2 = r_1[, 2];
vector[N_1] r_1_3 = r_1[, 3];
vector[N_1] r_1_4 = r_1[, 4];
vector[N_1] r_1_bs_5 = r_1[, 5];
vector[N_1] r_1_bs_6 = r_1[, 6];
vector[N_1] r_1_ndt_7 = r_1[, 7];
vector[N_1] r_1_ndt_8 = r_1[, 8];
vector[N_1] r_1_bias_9 = r_1[, 9];
vector[N_1] r_1_bias_10 = r_1[, 10];
}
model {
vector[N] mu = X * b;
vector[N] bs = X_bs * b_bs;
vector[N] ndt = X_ndt * b_ndt;
vector[N] bias = X_bias * b_bias;
for (n in 1:N) {
mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_1_2[J_1[n]]) * Z_1_2[n] + (r_1_3[J_1[n]]) * Z_1_3[n] + (r_1_4[J_1[n]]) * Z_1_4[n];
bs[n] = bs[n] + (r_1_bs_5[J_1[n]]) * Z_1_bs_5[n] + (r_1_bs_6[J_1[n]]) * Z_1_bs_6[n];
ndt[n] = ndt[n] + (r_1_ndt_7[J_1[n]]) * Z_1_ndt_7[n] + (r_1_ndt_8[J_1[n]]) * Z_1_ndt_8[n];
bias[n] = bias[n] + (r_1_bias_9[J_1[n]]) * Z_1_bias_9[n] + (r_1_bias_10[J_1[n]]) * Z_1_bias_10[n];
}
// priors including all constants
target += cauchy_lpdf(b | 0, 5);
target += normal_lpdf(b_bs | 1.5, 1);
target += normal_lpdf(b_ndt | 0.2, 0.1);
target += normal_lpdf(b_bias | 0.5, 0.2);
target += student_t_lpdf(sd_1 | 3, 0, 10)
- 10 * student_t_lccdf(0 | 3, 0, 10);
target += lkj_corr_cholesky_lpdf(L_1 | 1);
target += normal_lpdf(to_vector(z_1) | 0, 1);
// likelihood including all constants
if (!prior_only) {
for (n in 1:N) {
target += wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n]);
}
}
}
generated quantities {
corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector[NC_1] cor_1;
// take only relevant parts of correlation matrix
cor_1[1] = Cor_1[1,2];
[...]
cor_1[45] = Cor_1[9,10];
}

[The output was slightly modified.]

The last piece we need, before we can finally estimate the model, is a function that generates initial values. Without initial values that lead to an identifiable model for all data points, estimation will not start. The function needs to provide initial values for all parameters listed in the parameters block of the model. Note that many of those parameters have at least one dimension with a parameterized extent (e.g., K). We can use make_standata and create the data set used by brms for the estimation for obtaining the necessary information. We then use this data object (i.e., a list) for generating the correctly sized initial values in function initfun (note that initfun relies on the fact that tmp_dat is in the global environment which is something of a code smell).

tmp_dat <- make_standata(formula,
data = speed_acc, prior = prior)
str(tmp_dat, 1, give.attr = FALSE)
## List of 26
##  $N : int 10462 ##$ Y          : num [1:10462(1d)] 0.773 0.39 0.435  ...
##  $K : int 4 ##$ X          : num [1:10462, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
##  $Z_1_1 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ##$ Z_1_2      : num [1:10462(1d)] 0 1 1 1 1 1 0 1 1 0 ...
##  $Z_1_3 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ##$ Z_1_4      : num [1:10462(1d)] 1 0 0 0 0 0 1 0 0 1 ...
##  $K_bs : int 2 ##$ X_bs       : num [1:10462, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
##  $Z_1_bs_5 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ##$ Z_1_bs_6   : num [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  $K_ndt : int 2 ##$ X_ndt      : num [1:10462, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
##  $Z_1_ndt_7 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ##$ Z_1_ndt_8  : num [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  $K_bias : int 2 ##$ X_bias     : num [1:10462, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
##  $Z_1_bias_9 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ##$ Z_1_bias_10: num [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  $J_1 : int [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ... ##$ N_1        : int 17
##  $M_1 : int 10 ##$ NC_1       : num 45
##  $dec : num [1:10462(1d)] 1 0 0 0 0 0 0 0 0 0 ... ##$ prior_only : int 0

initfun <- function() {
list(
b = rnorm(tmp_dat$K), b_bs = runif(tmp_dat$K_bs, 1, 2),
b_ndt = runif(tmp_dat$K_ndt, 0.1, 0.15), b_bias = rnorm(tmp_dat$K_bias, 0.5, 0.1),
sd_1 = runif(tmp_dat$M_1, 0.5, 1), z_1 = matrix(rnorm(tmp_dat$M_1*tmp_dat$N_1, 0, 0.01), tmp_dat$M_1, tmp_dat$N_1), L_1 = diag(tmp_dat$M_1)
)
}

Finally, we have all pieces together and can estimate the Wiener model using the brm function. Note that this will take roughly a full day, depending on the speed of your PC also longer. We also already increase the maximal treedepth to 15. We probably should have also increased adapt_delta above the default value of .8 as there are a few divergent transitions, but this is left as an exercise to the reader.

After estimation is finished, we see that there are a few (< 10) divergent transitions. If this were a real analysis and not only an example, we would need to increase adapt_delta to a larger value (e.g., .95 or .99) and rerun the estimation. In this case however, we immediately begin with the second step and obtain samples from the posterior predictive distribution using predict. For this it is important to specify the number of posterior samples (here we use 500). In addition, it is important to set summary = FALSE, for obtaining the actual posterior predictive distribution and not a summary of the posterior predictive distribution, and negative_rt = TRUE. The latter ensures that predicted responses to the lower boundary receive a negative sign whereas predicted responses to the upper boundary receive a positive sign.

fit_wiener <- brm(formula,
data = speed_acc,
prior = prior, inits = initfun,
iter = 1000, warmup = 500,
chains = 4, cores = 4,
control = list(max_treedepth = 15))
NPRED <- 500
pred_wiener <- predict(fit_wiener,
summary = FALSE,
negative_rt = TRUE,
nsamples = NPRED)

Because both steps are quite time intensive (estimation 1 day, obtaining posterior predictives a few hours), we save the results of both steps. Given the comparatively large size of both objects, using the 'xz' compression (i.e., the strongest in R) seems like a good idea.

save(fit_wiener, file = "brms_wiener_example_fit.rda",
compress = "xz")
save(pred_wiener, file = "brms_wiener_example_predictions.rda",
compress = "xz")

The next part will show how to perform model diagnostics and how to asses the model fit.