[This article was first published on Having Fun and Creating Value With the R Language on Lucid Manager, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Estimating the cost of a complex project is not a trivial task. Traditional cost estimates are full of assumptions about the future state of the market and the final deliverable. Monte Carlo cost estimates are a tool to better understand the risks in your project and enable better cost control. Monte Carlo simulations are a technique to control your “known unknowns”.

Albert Einstein famously said that “God does not play dice”. While this might or might not be the case, engineers definitely do and embrace the stochastic nature of reality to predict the future. While I have a pet hate against using matrices to manage risk, Monte Carlo simulations are an analytical method for dealing with uncertainty and risk. This article explains the principles of Monte Carlo cost estimates and how to implement them in the R language for statistical computing.

## The principles of cost estimates

The basic principle of cost estimation is deceptively simple. To estimate the cost for each item in a project, multiply the quantity of work $Q_i$ times the rate $R_i$ you will pay for each unit of work. Sum the cost of each item, and you have the total project cost $P$ for a project with $j$ items:

$$P = \sum_{i=1}^j Q_i R_i$$

The reality is, unfortunately, a bit more inconsistent than this equation can express. Determining the correct quantity and the rate for each item is a fine art that requires knowledge of both engineering and economics.

Cost estimates reflect the state of knowledge at the time of developing them. Many aspects of the project are not known and might only surface once we start digging. As the late Donald Rumsfeld philosophically said:

… there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say, we know that there are some things that we do not know. But there are also unknown unknowns — the ones we don’t know, we don’t know.

### Deterministic

The basic method is helpful, but it assumes that we accurately know the amount of work or the unit rate. Both parameters are subject to uncertainty and risk because the project has not yet been realised. Only at the end of the project do we know precisely how much it costs.

In deterministic cost estimation, this uncertainty is accounted for by adding a contingency to the estimate. This contingency is simply a percentage of the total. The less information is known about the project (usually at the early stages of development), the higher the percentage.

In the planning phases, 20% or even higher might be suitable, while uncertainty can be much lower after tender award. This uncertainty covers any risk that quantities or rates are higher and includes unforeseen activities that were not costed. The table below shows a typical example of a deterministic cost estimate.

A deterministic method is a blunt tool. Uncertainties are not estimated but based on rules of thumb. In this particular example, we have no way of knowing which part of the project contributes most to the uncertainty or the likelihood that we can stay within budget. Is the project manager simply inflating the budget to reduce the risk of spending more? What is the probability that the project will cost more than the estimated amount? We cannot answer these questions with the deterministic method.

### Probabilistic

better understanding the financial risk in a project leads to better decisions. Probabilistic cost estimation methods review the uncertainty of each item rather than a percentage on top of the total. Breaking the uncertainty into smaller chunks allows project managers to better understand the financial risk in their projects. Assessing smaller aspects of a project also reduces the risk of significant estimating errors.

Both the quantity of work and the rate we pay varies with local conditions, external factors, market rates and so on. We thus need to introduce an error rate $\epsilon_i$ for each item to account for the uncertainty in the estimate. Uncertainty relates to the inaccuracy of each of the cost items due to a lack of information. Thus, our formula for a project with $j$ items and $k$ events cost now becomes:

$$P = \sum_{i=1}^j Q_i R_i \epsilon_i$$

The most basic way to implement this concept is to assign a contingency to each item in the estimate and nominate the risk events, in something like this:

Note that we now call it uncertainty instead of contingency. This is more than a semantic difference as we have estimated our uncertainty at the level of the items in the Work-Breakdown-Structure.

This approach provides some more intelligence about where the risks in the project really are, which can help the team place efforts where they are most needed. The fact that the estimate is lower is less interesting than what we can gather from looking at the detail. If this was my project, I would try to get more information about the events with the highest relative uncertainty to lower the risk.

## Monte Carlo Cost Estimate

The probabilistic approach in the table above is an acceptable way to assign uncertainties and risks, but it does not give us any insight into how likely this estimate will eventuate.

A Monte Carlo Simulation can further refine the cost estimate by digging deeper into the uncertainty. Monte Carlo simulations are a class of methods that use large volumes of randomised numbers within known distributions to simulate reality. They can be used for any situation where deterministic methods fail, such as modelling contact centre call traffic.

To add more information to the estimate, we assign a low, likely and high cost to each item. The low cost ($a$) will rarely be achieved and is your ‘bargain-basement’ assessment. The likely cost ($c$) is what you would typically use in your estimate. Finally, the high cost ($b$) is what you will pay when everything you can think of goes wrong. Statistically, each cost item now has a triangular probability distribution, visualised below.

Our cost estimate would now look something like this:

Item Low Medium High
Materials 900000 909000 950000
Excavation 33000 49500 50000
Pipe laying 156000 186000 200000

For each item we can calculate the average or any other percentile. The average cost of an item is $\frac{a+b+c}{3}$. The expected cost $x$ with likelihood $p$ is defined by the quantiles for each item, given by:

$$x_p = \begin{cases} a + \sqrt{(b-a)(c-a)p} & \text{for } 0 \leq p \leq F(c) \\ b – \sqrt{(b-a)(b-c)(1-p)} & \text{for } F(c) \leq p \leq 1 \end{cases}$$

Where $F(c)$ is the result of the cumulative distribution: $F(c) = \frac{(x-a)^2}{(b-a)(c-a)}$.

Note that these formulas only work when: $a \leq c$ and $b \geq c$.

While the triangular distribution is the most common method to assess uncertainty. Other distributions are possible, but the principles remain the same. Be mindful, however, that most probability distributions range from minus to plus infinity, which is not a realistic assumption for cost estimates.

If your cost estimate contains only one item, you are done because you can calculate the project cost by plugging a percentile into the formulas. But when the estimate includes multiple items, doing so analytically will be a bit too complex for mere mortals. This is where the Monte Carlo technique comes in.

Monte Carlo simulates reality by calculating thousands of possible outcomes of the distribution. Several libraries are available in R that can calculate triangular distributions. The triangle package provides a set of functions to work with triangular distributions. The rtriangle() function provides a vector of random quantiles for the chosen distribution, as shown below:

library(triangle)
hist(rtriangle(n = 10000, a = 12000, b = 15000, c = 14000),
breaks = 100,
main = "Triangular distribution simulation")

When plotting the histogram of these results, the triangular shape becomes evident, albeit a bit rough around the edges. Notice that a Monte Carlo simulation is only ever an estimate. The higher the number of simulations (in this diagram 10,000), the better the result. Change the n parameter in the rtriangle function to see for yourself.

To add the probability distribution of all your cost items, you can store all simulations in a matrix with n columns (number of simulations) and j (number of cost items) rows. Then, in the next step, you add all the simulated values and create a histogram or calculate percentiles.

The example below reads a CSV file with the same content as the table above. Then, it creates a results matrix, runs the simulations and calculates the result.

  ## Read Data
estimate <- read.csv("estimate.csv")

## Simulation settings
n <- 10000
j <- nrow(estimate)
mc_sims <- matrix(nrow = j,
ncol = n)

## Simulation
for (i in 1:j){
mc_sims[i,] <- rtriangle(n = it,
a = estimate$Low[i], b = estimate$High[i],
c = estimate\$Medium[i])}

## Determine estimates and 95th percentile
mc_results <- colSums(mc_sims)
p95 <- quantile(mc_results, 0.95)

## Visualise
hist(mc_results, breaks = 100)
abline(v = p95, col = "red", lwd = 2)

A Monte Carlo simulation outcome is thus never one number but a vector of numbers from which you can analyse. The example below uses the 95th percentile as the budget figure.

The addition of triangular distributions will create a new distribution specific to this project. The mc_results vector holds the estimated possible total cost.

To set a budget figure, you need to nominate a percentile. The 50th percentile has an equal chance of being higher or lower than the actual. To have some more certainty that a sufficient budget is available, perhaps you should choose the 95th percentile, or whatever percentile matches your risk appetite.

Monte Carlo cost estimates are a powerful tool to better understand your project. They are, however, like any modelling, subject to the GiGo-principle (Garbage-in-Garbage-out). Therefore, determining the low, likely and high cost will require domain knowledge. While these three estimates themselves have their own level of uncertainty, they will provide better insights than relying only on the likely estimate.

Probabilistic cost estimates do not guarantee that your project will remain under budget, but they certainly help you achieve this elusive goal.

To leave a comment for the author, please follow the link and comment on their blog: Having Fun and Creating Value With the R Language on Lucid Manager.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)