April 5, 2019
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Suppose you have a bunch of customers who make repeat purchases – some more frequenty, some less. There are a few things you might like to know about these customers, such as

• which customers are still active (i.e. not yet churned) and likely to continue purchasing from you?; and
• how many purchases can you expect from each customer?

Modelling this directly is more difficult than it might seem at first. A customer that regularly makes purchases every day might be considered at risk of churning if they haven’t purchased anything in the past week, whereas a customer that regularly puchases once per month would not be considered at risk of churning. That is, churn and frequency of purchasing are closely related. The difficulty is that we don’t observe the moment of churn of any customer and have to model it probabilistically.

There are a number of established models for estimating this, the most well-known perhaps being the SMC model (a.k.a pareto-nbd model). There are already some implementations using maximum likelihood or Gibbs sampling. In this post, we’ll explain how the model works, make some prior predictive simulations, and fit a version implemented in Stan.

## Data Generating Process

Let’s describe the model first by simulation. Suppose we have a company that is 2 years old and a total of 2000 customers, $$C$$, that have made at least one purchase from us. We’ll assume a linear rate of customer acquisition, so that the first purchase date is simply a uniform random variable over the 2 years of the company existance. These assumptions are just to keep the example concrete, and are not so important for understanding the model.

customers <- tibble(id = 1:1000) %>%
mutate(
end = 2 * 365,
start = runif(n(), 0, end - 1),
T = end - start
)

The $$T$$-variable is the total observation time, counted from the date of first joining to the present day.

First the likelihood. Each customer $$c \in C$$ is assumed to have a certain lifetime, $$\tau_c$$, starting on their join-date. During their lifetime, they will purchase at a constant rate, $$\lambda_c$$, so that they will make $$k \sim \dpois(t\lambda_c)$$ purchases over a time-interval $$t$$. Once their lifetime is over, they will stop purchasing. We only observe the customer for $$T_c$$ units of time, and this observation time can be either larger or smaller than the lifetime, $$\tau_c$$. Since we don’t observe $$\tau_c$$ itself, we will assume it follows an exponential distribution, i.e. $$\tau_c \sim \dexp(\mu_c)$$.

The following function generates possible observations given $$\mu$$ and $$\lambda$$.

sample_conditional <- function(mu, lambda, T) {

tau <- rexp(1, mu)

t <- 0
k <- 0

# simulate time till next purchase
wait <- rexp(1, lambda)

# keep purchasing till end of life/observation time
while(t + wait <= pmin(T, tau)) {
t <- t + wait
k <- k + 1
wait <- rexp(1, lambda)
}

# return tabular data
tibble(
mu = mu,
lambda = lambda,
T = T,
tau = tau,
k = k,
t = t
)
}

s <- sample_conditional(0.01, 1, 30) 
Example output from sample_conditional
mu lambda T tau k t
0.01 1 30 49.63373 39 29.21926

Now the priors. Typically, $$\mu$$ and $$\lambda$$ are given gamma priors, which we’ll use too. However, the expected mean lifetime $$\mathbb E (\tau) = \frac{1}{\mu}$$ is easier to reason about than $$\mu$$, so we’ll put an inverse gamma distribution on $$\frac{1}{\mu}$$. The reciprocal of an inverse gamma distribution has a gamma distribution, so $$\mu$$ will still end up with a gamma distribution.

The mean expected lifetime in our simulated example will be ~2 months, with a standard deviation of 30. The mean purchase rate will be once a fortnight, with a standard deviation around 0.05.

set.seed(2017896)

etau_mean <- 60
etau_variance <- 30^2
etau_beta <- etau_mean^3 / etau_variance + etau_mean
etau_alpha <- etau_mean^2 / etau_variance + 2

lambda_mean <- 1 / 14
lambda_variance <- 0.05^2
lambda_beta <- lambda_mean / lambda_variance
lambda_alpha <- lambda_mean * lambda_beta

df <- customers %>%
mutate(
etau = rinvgamma(n(), etau_alpha, etau_beta),
mu = 1 / etau,
lambda = rgamma(n(), lambda_alpha, lambda_beta)
) %>%
group_by(id) %>%
group_map(~sample_conditional(.$mu, .$lambda, .\$T)) 
Sample of customers and their properties
id mu lambda T tau k t
1 0.0241091 0.2108978 295.3119 32.2814622 6 29.46052
2 0.0122084 0.0135551 673.2100 11.5250690 0 0.00000
3 0.0032994 0.0789800 357.1805 4.7921238 0 0.00000
4 0.0227431 0.0980176 270.0511 141.4766791 10 125.60765
5 0.0270742 0.0429184 608.9049 5.7293256 0 0.00000
6 0.0208168 0.0661296 666.1305 0.9481004 0 0.00000

The lifetimes are mostly under 3 months, but also allow some more extreme values up to around a year.

The purchase rates are mostly around once a fortnight, but there are also rates as high as 4 purchases per week and ras low as one per quarter.

## Likelihood

The likelihood is somewhat complicated, so we’ll derive a more concise expression for it. Knowing the lifetime simplifies the probabilities, so we’ll marginalise the liklihood over $$\tau$$.

\begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \int_{\tau = t}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) d\tau \\ &= \int_{\tau = t}^T \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) + \int_{\tau = T}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) \\ &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &\hphantom{=} + \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (T-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \end{align}

The right-hand side is straight forward. The Poisson probabilities can be pulled out of the integral since they are independent of $$\tau$$, turning the remaining integral into the survival function of the exponential distribution.

\begin{align} \text{RHS} &= \int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau – t)\lambda) \cdot\dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda}}{k!} e^{-(T-t)\lambda}\int_T^\infty \cdot\dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-T\lambda}}{k!} e^{-T\mu} \\ &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} \end{align}

The left-hand side is a little more involved.

\begin{align} \text{LHS} &= \int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} \int_t^T e^{-(\tau – t)\lambda} \mu e^{-\tau\mu} d\tau \\ &= \frac{(t\lambda)^k e^{-t\lambda} }{k!} e^{t\lambda} \mu \int_t^T e^{-\tau(\lambda + \mu)} d\tau \\ &= \frac{(t\lambda)^k }{k!} \mu \left. \frac{ e^{-\tau(\lambda + \mu)}}{-(\lambda + \mu)} \right|_t^T \\ &= \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu} \end{align}

Adding both expressions gives our final expression for the likelihood

\begin{align} \mathbb P (k, t \mid \mu, \lambda) &= \frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!} + \frac{(t\lambda)^k }{k!} \mu \frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &\propto \lambda^k e^{-T(\lambda + \mu)} + \lambda^k \mu \frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu} \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} – \mu e^{-T(\lambda + \mu)} + \mu e^{-T(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) \\ &= \frac{\lambda^k}{\lambda + \mu} \left( \mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) , \end{align}

where we dropped any factors independent of the parameters, $$\lambda, \mu$$. This expression agrees with equation 2 in ML07.

Another way to view this likelihood is as a mixture of censored observations, but where the mixture probability $$p(\mu, \lambda) := \frac{\mu}{\lambda + \mu}$$ depends on the parameters. We can write this alternative interpretation as

\begin{align} \mathbb P(k, t \mid \mu, \lambda) &\propto p \dpois(k \mid t\lambda)S(t \mid \mu) \\ &\hphantom{\propto}+ (1 – p) \dpois(k \mid t\lambda)\dpois(0 \mid (T-t)\lambda)S(T \mid \mu) , \end{align}

where $$S$$ is the survival function of the exponential distribution. In other words, either we censor at $$t$$ with probability $$p$$, or we censor at $$T$$ with probability $$(1 – p)$$. Note that

• either decreasing the expected lifetime (i.e. increasing $$\mu$$) or decreasing the purchase rate increases $$p$$;
• if $$t \approx T$$, then the censored distributions are approximately equal. The smaller $$\lambda$$ is, the closer the approximation has to be for this to hold.

To implement this in stan, we’ll need the log-likelihood, which is given by

$\log\mathbb P (k, t \mid \mu, \lambda) = k \log\lambda – \log(\lambda + \mu) + \log\left(\mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right) .$

## Stan implementation

Let’s take a look at our Stan implementation. Note that Stan uses the log-likelihood, and we can increment it by incrementing the target variable. We have also used the log_sum_exp for numeric stability, where $$\text{log_sum_exp}(x, y) := \log(e^x + e^y)$$.

pnb <- here('models/pnbd.stan') %>%
stan_model() 
S4 class stanmodel 'pnbd' coded as follows:
data {
int n;       // number of customers
vector[n] t; // time to most recent purchase
vector[n] T; // total observation time
vector[n] k; // number of purchases observed

// user-specified parameters
real etau_alpha;
real etau_beta;
real lambda_alpha;
real lambda_beta;
}

parameters {
vector[n] lambda; // purchase rate
vector[n] etau;   // expected mean lifetime
}

transformed parameters {
vector[n] mu = 1.0 ./ etau;
}

model {
// priors
etau ~ inv_gamma(etau_alpha, etau_beta);
lambda ~ gamma(lambda_alpha, lambda_beta);

// likelihood
target += k .* log(lambda) - log(lambda + mu);
for (i in 1:n) {
target += log_sum_exp(
log(lambda[i]) - (lambda[i] + mu[i]) .* T[i],
log(mu[i]) - (lambda[i] + mu[i]) .* t[i]
);
}
} 

Let’s fit the model to our simulated data, using the correct priors.

pnb_fit <- rstan::sampling(
pnb,
data = compose_data(
df,
etau_alpha = etau_alpha,
etau_beta = etau_beta,
lambda_alpha = lambda_alpha,
lambda_beta = lambda_beta
),
control = list(max_treedepth = 15),
chains = 4,
cores = 4,
warmup = 1000,
iter = 3000
) 

Using the default max_treedepth of 10 shows problems with the energy diagnostic, with the etau parameters seemingly most problematic. However, increasing it to 15 resolved these issues.

pnb_fit %>%
check_hmc_diagnostics()
Divergences:

0 of 8000 iterations ended with a divergence.

Tree depth:

0 of 8000 iterations saturated the maximum tree depth of 15.

Energy:

E-BFMI indicated no pathological behavior.

There are also no problems with the effective sample sizes, although etau typically has the lowest.

pnb_neff <- pnb_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
arrange(ratio) %>%
head(5) 
Parameters with the lowest effective sample size
ratio parameter
0.3877002 lp__
0.4838375 etau
0.5157888 etau
0.5722499 etau
0.5803245 etau

The rhat statistic also looks good.

pnb_rhat <- pnb_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(min(rhat), max(rhat)) 
The most extreme rhat values
min(rhat) max(rhat)
0.9995222 1.001541

Around 50% of our 50% posterior intervals contain the true value, which is a good sign.

calibration <- pnb_fit %>%
mean_qi(.width = 0.5) %>%
inner_join(df, by = 'id') %>%
summarise(
mu = mean(mu.lower <= mu.y & mu.y <= mu.upper),
lambda = mean(lambda.lower <= lambda.y & lambda.y <= lambda.upper)
) %>%
gather(parameter, fraction) 
Fraction of 50% posterior intervals containing the true value. These should be close to 50%.
parameter fraction
mu 0.495
lambda 0.498

## Discussion

We described the data generating process behind the Pareto-NBD model, implemented a model in Stan using our derivation of the likelihood, and fit the model to simulated data. The diagnostics didn’t indicate any convergence problems, and around 50% of the 50% posterior intervals contained the true parameter values. However, we used our knowledge of the prior distribution to fit the model. It would be better to use a hierarchical prior to relax this requirement.

As a next step, it would be interesting to extend the model to

• estimate spend per purchase;
• use hierarchical priors on $$\mu$$ and $$\lambda$$;
• allow correlation between $$\mu$$ and $$\lambda$$; and
• allow covariates, such as cohorts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.