**Drunks&Lampposts » R**, and kindly contributed to R-bloggers)

Here’s a post that appears on my new website, ragscripts.com. On-line resources for analysts are often either too general to be of practical use or too specialised to be accessible. The aim of ragscripts.com is to remedy this by providing start to finish directions for complex analytical tasks. The site is under construction at the moment but any feedback is most welcome!

### The problem

You’ve been asked to calculate some figure or other (e.g. end of year revenue, average customer lifetime value) based on numbers supplied from various parts of the business. You know how to make the calculation but what bothers you is that some of the figures going in are clearly approximations or at worst complete guesses. Therefore the final calculation will inherit this uncertainty. A decision will be made based on your final figure and more likely than not the decision makers will treat this number as a certainty. They won’t have any way of knowing how uncertain it is and may well draw the wrong conclusions. How do you get an understanding of the uncertainty around the final calculation and how do you explain it to the decision makers?

The standard approach in finance is to produce a set of best and worst case scenarios. These are tedious to produce and are often disregarded as unlikely outcomes. I think our approach is better as it gives a full picture of the uncertainty around an estimate.

### What you’ll need

- Basic R programming skills (including writing functions)
- An understanding of probability distributions
- About 1-3 hours (depending on the complexity of the model)

### Example

Let’s say you are asked to project the revenue of a business in one year’s time based on the following numbers:

- Customer base at the beginning of the year: 50,000
- Average revenue per customer: £7
- Monthly acquisition rate: 5%
- Monthly churn rate: 2%

You might have thought that these figures could be known with some certainty but believe me it’s actually very rare especially in large businesses where integrating systems to get an overall picture is a tough job. Add to this the fact that we are guessing at how well any current figures are going to hold out over the year and we should agree than there’s a fair bit of uncertainty in the inputs.

With this is mind you ask the suppliers of the data for a lower and upper bound for the figures, i.e. the lowest and highest numbers that they would expect to see without being very surprised. Where they are unable or unwilling to do this you use your common sense. This gives you the following table:

Statistic | Estimate | Lower Bound | Upper Bound |

ARPC | £7 | £4 | £10 |

Acquisition Rate | 5% | 3% | 7% |

Churn Rate | 2% | 1% | 3% |

### Do it

##### Step 1: Setting up the distributions

There’s lots of tools you could do this with but R seems most appropriate so start up a session. We are going to model the our uncertainty about the input values in each case as a normal distribution. We will say that the mean of the distribution is the value the business has estimated and that the 5% and 95% quantiles are the upper bounds and lower bounds respectively.

Yes this is subjective (we could have chosen other values for the bounds) but the point of this exercise is not to make incredibly accurate statistical inferences rather it is to show the consequences that different levels of uncertainty in the inputs have on the output. You can play with these parameters if you wish and say things like * if we were this accurate with our inputs then we would be this accurate with our output *

Let’s start by calculating the parameters (mean, standard deviation) for the three normal distributions we’ll need.

Let be the 95th quantile. Then

. Therefore

#Work out the standard deviation for arpu arpu.sd<-3/1.645 #Plot the resulting distribution x<-seq(0, 15,by=0.5) d<-dnorm(x, 7, arpu.sd) plot(x, d, type='l')

This should give a plot that meets our expectations and it does:

We’ll complete this step by doing the same for acquisition and churn.

#Do the same for acquisition and churn acq.sd<-0.02/1.645 x<-seq(0, 0.2,by=0.001) d<-dnorm(x, 0.05, acq.sd) plot(x, d, type='l') ch.sd<-0.01/1.645 x<-seq(0, 0.2,by=0.001) d<-dnorm(x, 0.02, ch.sd) plot(x, d, type='l')

##### Step 2: A function for calculating revenue

The next piece you need is a function for calculating the output (the statistic of interest) from the inputs. This is what would usually be captured in the formulae of an excel spreadsheet.

For our example it looks like this:

revenue<-function(arpu, acq, ch){ num.cust<-50000 for (m in 1:12){ num.cust<-num.cust+acq*num.cust-ch*num.cust } return(num.cust*arpu) }

##### Step 3: Making random draws from our belief distributions

We will now generate 10 thousand values from our each of normally distributed input variables to create 10 thousand simulations. Each of these simulations represents a possible set of inputs to our model. For example one set might be

- Average revenue per customer: £7.50
- Monthly acquisition rate: 3%
- Monthly churn rate: 1%

However the way we’ve set things up unlikely values for each input are unlikely to be drawn from our distributions. In fact since 10 thousand is a pretty large number the frequency at which each of the possible values is drawn should fairly well reflect our initial beliefs.

Here is our code for simulating the input values:

#Now let's simulate 10k values from each of our distributions sim.arpu<-rnorm(10000, 7, arpu.sd) sim.acq<-rnorm(10000, 0.05, acq.sd) sim.ch<-rnorm(10000, 0.02, ch.sd)

Note a couple of things we haven’t accounted for: we’ve assumed that the our beliefs about our input variables are independent of one another but that might not be the case. We might think for some reason that if churn is higher then acquisition will be lower. If so we want to draw our values from a multivariate distribution incorporating correlation between the variables. However assuming independence is an easy and useful place to start. Also we’ve assumed that our model itself is correct – i.e. that constant rates of churn and acquisition apply and that there are no other effects (for example seasonal or economic.)

##### Step 4: Running them through the function

Now everything is place and it’s a simple matter to apply our function to each of the 10 thousand simulated sets of input variables. We use R’s mapply function.

sim.rev<-mapply(revenue, sim.arpu, sim.acq, sim.ch)

##### Step 5: Examining the results

Since our simulated input variables were distributed in a way that reflects our beliefs about their likely values, our final revenue calculations, based on these simulated inputs, will also have a distribution that reflect these uncertainties. The next few lines of code show us this distribution.

summary(sim.rev) hist(sim.rev) plot(density(sim.rev))

The histogram of the possible values for annual revenue makes the point of doing all of this quite clear. A straight calculation based on the original input values gives a revenue value of 499k. However a quick glance at the histogram shows that revenues as low as 400k are not at all improbable. Any planning should account for this!

### See it

Here’s a quick fiddle to see how inputs affect outputs in the example case. And here’s the source.

### Explain it

So how do you explain what you have done and why you have done it?

Chances are you may run into the analyst’s catch 22 (see skit). You’ve done a much better job because you’ve incorporated uncertainty into your forecast allowing the decision maker to make a more informed decision. However people like certainty and the very fact you have done this may be perceived as incompetence * What do you mean you don’t know what the revenue figure is going to be – go back and do it again! *

So either you compromise your own integrity by glossing over the uncertainty or you’re branded an idiot or at least a boffin who overcomplicates things!

But there are some ways around this. Try to find out what decisions are being made using the data. Some of the frustration will come from being handed something that doesn’t solve the problem and the decision maker may have a point: if the range of accuracy is not enough to make a decision the model may need to be improved in some way.

Also talking about the decision and the problem from the outset is a good way of driving home the value of what you are doing. The very purpose of showing the uncertainty in the estimate is to prevent misinformed decisions being made.

The decision in our example might be whether or not to we can invest in a new product now given that we still need to be in profit by the end of the year. Our model as it is shows that if all other outgoings come to 150k then we’d be taking a risk spending over 50k on the new product, (since there’s a reasonable chance that revenue could be as low as 200k). If 50k was what were planning then we’ve done the job. If however a new product can’t be done for less than 100k then it’s worth spending some time improving our model.

### Fork it

Take my code and adapt it for your own models. You can find the gist here

**leave a comment**for the author, please follow the link and comment on their blog:

**Drunks&Lampposts » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...