**Brian Callander**, and kindly contributed to R-bloggers)

# Pareto-NBD Customer Lifetime Value

Suppose you have a bunch of customers who make repeat purchases – some more frequenty, some less. There are a few things you might like to know about these customers, such as

- which customers are still active (i.e. not yet churned) and likely to continue purchasing from you?; and
- how many purchases can you expect from each customer?

Modelling this directly is more difficult than it might seem at first. A customer that regularly makes purchases every day might be considered at risk of churning if they haven’t purchased anything in the past week, whereas a customer that regularly puchases once per month would not be considered at risk of churning. That is, churn and frequency of purchasing are closely related. The difficulty is that we don’t observe the moment of churn of any customer and have to model it probabilistically.

There are a number of established models for estimating this, the most well-known perhaps being the SMC model (a.k.a pareto-nbd model). There are already some implementations using maximum likelihood or Gibbs sampling. In this post, we’ll explain how the model works, make some prior predictive simulations, and fit a version implemented in Stan.

## Data Generating Process

Let’s describe the model first by simulation. Suppose we have a company that is 2 years old and a total of 2000 customers, \(C\), that have made at least one purchase from us. We’ll assume a linear rate of customer acquisition, so that the first purchase date is simply a uniform random variable over the 2 years of the company existance. These assumptions are just to keep the example concrete, and are not so important for understanding the model.

```
customers <- tibble(id = 1:1000) %>%
mutate(
end = 2 * 365,
start = runif(n(), 0, end - 1),
T = end - start
)
```

The \(T\)-variable is the total observation time, counted from the date of first joining to the present day.

First the likelihood. Each customer \(c \in C\) is assumed to have a certain lifetime, \(\tau_c\), starting on their join-date. During their lifetime, they will purchase at a constant rate, \(\lambda_c\), so that they will make \(k \sim \dpois(t\lambda_c)\) purchases over a time-interval \(t\). Once their lifetime is over, they will stop purchasing. We only observe the customer for \(T_c\) units of time, and this observation time can be either larger or smaller than the lifetime, \(\tau_c\). Since we don’t observe \(\tau_c\) itself, we will assume it follows an exponential distribution, i.e. \(\tau_c \sim \dexp(\mu_c)\).

The following function generates possible observations given \(\mu\) and \(\lambda\).

```
sample_conditional <- function(mu, lambda, T) {
# lifetime
tau <- rexp(1, mu)
# start with 0 purchases
t <- 0
k <- 0
# simulate time till next purchase
wait <- rexp(1, lambda)
# keep purchasing till end of life/observation time
while(t + wait <= pmin(T, tau)) {
t <- t + wait
k <- k + 1
wait <- rexp(1, lambda)
}
# return tabular data
tibble(
mu = mu,
lambda = lambda,
T = T,
tau = tau,
k = k,
t = t
)
}
s <- sample_conditional(0.01, 1, 30)
```

mu | lambda | T | tau | k | t |
---|---|---|---|---|---|

0.01 | 1 | 30 | 49.63373 | 39 | 29.21926 |

Now the priors. Typically, \(\mu\) and \(\lambda\) are given gamma priors, which we’ll use too. However, the expected mean lifetime \(\mathbb E (\tau) = \frac{1}{\mu}\) is easier to reason about than \(\mu\), so we’ll put an inverse gamma distribution on \(\frac{1}{\mu}\). The reciprocal of an inverse gamma distribution has a gamma distribution, so \(\mu\) will still end up with a gamma distribution.

The mean expected lifetime in our simulated example will be ~2 months, with a standard deviation of 30. The mean purchase rate will be once a fortnight, with a standard deviation around 0.05.

```
set.seed(2017896)
etau_mean <- 60
etau_variance <- 30^2
etau_beta <- etau_mean^3 / etau_variance + etau_mean
etau_alpha <- etau_mean^2 / etau_variance + 2
lambda_mean <- 1 / 14
lambda_variance <- 0.05^2
lambda_beta <- lambda_mean / lambda_variance
lambda_alpha <- lambda_mean * lambda_beta
df <- customers %>%
mutate(
etau = rinvgamma(n(), etau_alpha, etau_beta),
mu = 1 / etau,
lambda = rgamma(n(), lambda_alpha, lambda_beta)
) %>%
group_by(id) %>%
group_map(~sample_conditional(.$mu, .$lambda, .$T))
```

id | mu | lambda | T | tau | k | t |
---|---|---|---|---|---|---|

1 | 0.0241091 | 0.2108978 | 295.3119 | 32.2814622 | 6 | 29.46052 |

2 | 0.0122084 | 0.0135551 | 673.2100 | 11.5250690 | 0 | 0.00000 |

3 | 0.0032994 | 0.0789800 | 357.1805 | 4.7921238 | 0 | 0.00000 |

4 | 0.0227431 | 0.0980176 | 270.0511 | 141.4766791 | 10 | 125.60765 |

5 | 0.0270742 | 0.0429184 | 608.9049 | 5.7293256 | 0 | 0.00000 |

6 | 0.0208168 | 0.0661296 | 666.1305 | 0.9481004 | 0 | 0.00000 |

The lifetimes are mostly under 3 months, but also allow some more extreme values up to around a year.

The purchase rates are mostly around once a fortnight, but there are also rates as high as 4 purchases per week and ras low as one per quarter.

## Likelihood

The likelihood is somewhat complicated, so we’ll derive a more concise expression for it. Knowing the lifetime simplifies the probabilities, so we’ll marginalise the liklihood over \(\tau\).

\[

\begin{align}

\mathbb P (k, t \mid \mu, \lambda)

&=

\int_{\tau = t}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda) d\tau

\\

&=

\int_{\tau = t}^T \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda)

+

\int_{\tau = T}^\infty \mathbb P (k, t \mid \mu, \lambda, \tau) \cdot \mathbb P(\tau \mid \mu, \lambda)

\\

&=

\int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau

\\

&\hphantom{=}

+

\int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (T-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau

\end{align}

\]

The right-hand side is straight forward. The Poisson probabilities can be pulled out of the integral since they are independent of \(\tau\), turning the remaining integral into the survival function of the exponential distribution.

\[

\begin{align}

\text{RHS}

&=

\int_{\tau = T}^\infty \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau – t)\lambda) \cdot\dexp(\tau \mid \mu) d\tau

\\

&=

\frac{(t\lambda)^k e^{-t\lambda}}{k!} e^{-(T-t)\lambda}\int_T^\infty \cdot\dexp(\tau \mid \mu) d\tau

\\

&=

\frac{(t\lambda)^k e^{-T\lambda}}{k!} e^{-T\mu}

\\

&=

\frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!}

\end{align}

\]

The left-hand side is a little more involved.

\[

\begin{align}

\text{LHS}

&=

\int_{\tau = t}^T \dpois(k \mid t\lambda) \cdot \dpois(0 \mid (\tau-t)\lambda) \cdot \dexp(\tau \mid \mu) d\tau

\\

&=

\frac{(t\lambda)^k e^{-t\lambda} }{k!}

\int_t^T e^{-(\tau – t)\lambda} \mu e^{-\tau\mu} d\tau

\\

&=

\frac{(t\lambda)^k e^{-t\lambda} }{k!} e^{t\lambda} \mu

\int_t^T e^{-\tau(\lambda + \mu)} d\tau

\\

&=

\frac{(t\lambda)^k }{k!} \mu

\left.

\frac{ e^{-\tau(\lambda + \mu)}}{-(\lambda + \mu)} \right|_t^T

\\

&=

\frac{(t\lambda)^k }{k!} \mu

\frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu}

\end{align}

\]

Adding both expressions gives our final expression for the likelihood

\[

\begin{align}

\mathbb P (k, t \mid \mu, \lambda)

&=

\frac{(t\lambda)^k e^{-T(\lambda + \mu)}}{k!}

+

\frac{(t\lambda)^k }{k!} \mu

\frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu}

\\

&\propto

\lambda^k e^{-T(\lambda + \mu)}

+

\lambda^k \mu

\frac{ e^{-t(\lambda + \mu)} – e^{-T(\lambda + \mu)}}{\lambda + \mu}

\\

&=

\frac{\lambda^k}{\lambda + \mu}

\left( \mu e^{-t(\lambda + \mu)} – \mu e^{-T(\lambda + \mu)} + \mu e^{-T(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right)

\\

&=

\frac{\lambda^k}{\lambda + \mu}

\left( \mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right)

,

\end{align}

\]

where we dropped any factors independent of the parameters, \(\lambda, \mu\). This expression agrees with equation 2 in ML07.

Another way to view this likelihood is as a mixture of censored observations, but where the mixture probability \(p(\mu, \lambda) := \frac{\mu}{\lambda + \mu}\) depends on the parameters. We can write this alternative interpretation as

\[

\begin{align}

\mathbb P(k, t \mid \mu, \lambda)

&\propto

p \dpois(k \mid t\lambda)S(t \mid \mu)

\\

&\hphantom{\propto}+ (1 – p) \dpois(k \mid t\lambda)\dpois(0 \mid (T-t)\lambda)S(T \mid \mu)

,

\end{align}

\]

where \(S\) is the survival function of the exponential distribution. In other words, either we censor at \(t\) with probability \(p\), or we censor at \(T\) with probability \((1 – p)\). Note that

- either decreasing the expected lifetime (i.e. increasing \(\mu\)) or decreasing the purchase rate increases \(p\);
- if \(t \approx T\), then the censored distributions are approximately equal. The smaller \(\lambda\) is, the closer the approximation has to be for this to hold.

To implement this in stan, we’ll need the log-likelihood, which is given by

\[

\log\mathbb P (k, t \mid \mu, \lambda)

=

k \log\lambda – \log(\lambda + \mu) + \log\left(\mu e^{-t(\lambda + \mu)} + \lambda e^{-T(\lambda + \mu)} \right)

.

\]

## Stan implementation

Let’s take a look at our Stan implementation. Note that Stan uses the log-likelihood, and we can increment it by incrementing the `target`

variable. We have also used the `log_sum_exp`

for numeric stability, where \(\text{log_sum_exp}(x, y) := \log(e^x + e^y)\).

```
pnb <- here('models/pnbd.stan') %>%
stan_model()
```

```
S4 class stanmodel 'pnbd' coded as follows:
data {
int
``` n; // number of customers
vector[n] t; // time to most recent purchase
vector[n] T; // total observation time
vector[n] k; // number of purchases observed
// user-specified parameters
real etau_alpha;
real etau_beta;
real lambda_alpha;
real lambda_beta;
}
parameters {
vector[n] lambda; // purchase rate
vector[n] etau; // expected mean lifetime
}
transformed parameters {
vector[n] mu = 1.0 ./ etau;
}
model {
// priors
etau ~ inv_gamma(etau_alpha, etau_beta);
lambda ~ gamma(lambda_alpha, lambda_beta);
// likelihood
target += k .* log(lambda) - log(lambda + mu);
for (i in 1:n) {
target += log_sum_exp(
log(lambda[i]) - (lambda[i] + mu[i]) .* T[i],
log(mu[i]) - (lambda[i] + mu[i]) .* t[i]
);
}
}

Let’s fit the model to our simulated data, using the correct priors.

```
pnb_fit <- rstan::sampling(
pnb,
data = compose_data(
df,
etau_alpha = etau_alpha,
etau_beta = etau_beta,
lambda_alpha = lambda_alpha,
lambda_beta = lambda_beta
),
control = list(max_treedepth = 15),
chains = 4,
cores = 4,
warmup = 1000,
iter = 3000
)
```

Using the default `max_treedepth`

of 10 shows problems with the energy diagnostic, with the `etau`

parameters seemingly most problematic. However, increasing it to 15 resolved these issues.

```
pnb_fit %>%
check_hmc_diagnostics()
```

```
Divergences:
0 of 8000 iterations ended with a divergence.
Tree depth:
0 of 8000 iterations saturated the maximum tree depth of 15.
Energy:
E-BFMI indicated no pathological behavior.
```

There are also no problems with the effective sample sizes, although `etau`

typically has the lowest.

```
pnb_neff <- pnb_fit %>%
neff_ratio() %>%
tibble(
ratio = .,
parameter = names(.)
) %>%
arrange(ratio) %>%
head(5)
```

ratio | parameter |
---|---|

0.3877002 | lp__ |

0.4838375 | etau[716] |

0.5157888 | etau[442] |

0.5722499 | etau[367] |

0.5803245 | etau[443] |

The rhat statistic also looks good.

```
pnb_rhat <- pnb_fit %>%
rhat() %>%
tibble(
rhat = .,
parameter = names(.)
) %>%
summarise(min(rhat), max(rhat))
```

min(rhat) | max(rhat) |
---|---|

0.9995222 | 1.001541 |

Around 50% of our 50% posterior intervals contain the true value, which is a good sign.

```
calibration <- pnb_fit %>%
spread_draws(mu[id], lambda[id]) %>%
mean_qi(.width = 0.5) %>%
inner_join(df, by = 'id') %>%
summarise(
mu = mean(mu.lower <= mu.y & mu.y <= mu.upper),
lambda = mean(lambda.lower <= lambda.y & lambda.y <= lambda.upper)
) %>%
gather(parameter, fraction)
```

parameter | fraction |
---|---|

mu | 0.495 |

lambda | 0.498 |

## Discussion

We described the data generating process behind the Pareto-NBD model, implemented a model in Stan using our derivation of the likelihood, and fit the model to simulated data. The diagnostics didn’t indicate any convergence problems, and around 50% of the 50% posterior intervals contained the true parameter values. However, we used our knowledge of the prior distribution to fit the model. It would be better to use a hierarchical prior to relax this requirement.

As a next step, it would be interesting to extend the model to

- estimate spend per purchase;
- use hierarchical priors on \(\mu\) and \(\lambda\);
- allow correlation between \(\mu\) and \(\lambda\); and
- allow covariates, such as cohorts.

**leave a comment**for the author, please follow the link and comment on their blog:

**Brian Callander**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...