An introduction to Monte Carlo Tree Search

[This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


We recently witnessed one of the biggest game AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found here. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights.

Perfect Information Games

Monte Carlo Tree Search is an algorithm used when playing a so-called perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or Tic-Tac-Toe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over . Source

Every perfect information game can be represented in the form of a tree data structure in the following way. At first, you have a root which encapsulates the beginning state of the game. For Chess that would be 16 white figures and 16 black figures placed in the proper places on the chessboard. For Tic-Tac-Toe it would be simply 3×3 empty matrix. The first player has some number of possible choices to make. In the case of Tic-Tac-Toe this would be 9 possible places to draw a circle. Each such move changes the state of the game. These outcome states are the children of the root node. Then, for each of children, the next player has possible moves to consider, each of them generating another state of the game – generating a child node. Note that might differ for each of nodes. For instance in chess you might make a move which forces your enemy to make a move with their king or consider another move which leaves your opponent with many other options.

An outcome of a play is a path from the root node to one of the leaves. Each leaf consist a definite information which player (or players) have won and which have lost the game.

Making a decision based on a tree

There are two main problems we face when making a decision in perfect information game. The first, and main one is the size of the tree.

This doesn’t bother us with very limited games such as Tic-Tac-Toe. We have at most 9 children nodes (at the beginning) and this number gets smaller and smaller as we continue playing. It’s a completely different story with Chess or Go. Here the corresponding tree is so huge that you cannot hope to search the entire tree. The way to approach this is to do a random walk on the tree for some time and get a subtree of the original decision tree.

This, however, creates a second problem. If every time we play we just walk randomly down the tree, we don’t care at all about the efficiency of our move and do not learn from our previous games. Whoever played Chess during his or her life knows that making random moves on a chessboard won’t get him too far. It might be good for a beginner to get an understanding of how the pieces move. But game after game it’s better to learn how to distinguish good moves from bad ones.

So, is there a way to somehow use the facts contained in the previously built decision trees to reason about our next move? As it turns out, there is.

Multi-Armed Bandit Problem

Imagine that you are at a casino and would like to play a slot machine. You can choose one randomly and enjoy your game. Later that night, another gambler sits next to you and wins more in 10 minutes than you have during the last few hours. You shouldn’t compare yourself to the other guy, it’s just luck. But still, it’s normal to ask whether the next time you can do better. Which slot machine should you choose to win the most? Maybe you should play more than one machine at a time?

The problem you are facing is the Multi-Armed Bandit Problem. It was already known during II World War, but the most commonly known version today was formulated by Herbert Robbins in 1952. There are slot machines, each one with a different expected return value (what you expect to net from a given machine). You don’t know the expected return values for any machine. You are allowed to change machines at any time and play on each machine as many times as you’d like. What is the optimal strategy for playing?

What does “optimal” mean in this scenario? Clearly your best option would be to play only on the machine with highest return value. An optimal strategy is a strategy for which you do as well as possible compared to the best machine.

It was actually proven that you cannot do better than on average. So that’s the best you can hope for. Luckily, it was also proven that you can achieve this bound (again – on average). One way to do this is to do the following.

Read this paper if you are interested in the proof.

For each machine we keep track of two things: how many times we have tried this machine () and what the mean return value () was. We also keep track of how many times () we have played in general. Then for each i we compute the confidence interval around :

All the time we choose to play on the machine with the highest upper bound for (so “+” in the formula above).

This is a solution to Multi-Armed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our pay-out. For simplicity, I will assume that we are only interested in winning, so pay-out is 1 if we have won and 0 otherwise.

Real world application example

MAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: 99$, 100$, 115$ and 120$. Now you want to test how those prices will perform and which to choose eventually. During first day of your experiment 4000 people visited your shop when the first price (99$) was tested and 368 bought the product, for the rest of the prices we have the following outcome:

  • 100$ 4060 visits and 355 purchases,
  • 115$ 4011 visits and 373 purchases,
  • 120$ 4007 visits and 230 purchases.

Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment.



visits_day1 <- c(4000, 4060, 4011, 4007)
purchase_day1 <- c(368, 355, 373, 230)
prices <- c(99, 100, 115, 120)

post_distribution = sim_post(purchase_day1, visits_day1, ndraws = 10000)
probability_winning <- prob_winner(post_distribution)
names(probability_winning) <- prices

##     99    100    115    120 
## 0.3960 0.0936 0.5104 0.0000

We calculated the Bayesian probability that the price performed the best and can see that the price 115$ has the highest probability (0.5). On the other hand 120$ seems bit too much for the customers.

The experiment continues for a few more days.

Day 2 results:

visits_day2 <- c(8030, 8060, 8027, 8037)
purchase_day2 <- c(769, 735, 786, 420)

post_distribution = sim_post(purchase_day2, visits_day2, ndraws = 1000000)
probability_winning <- prob_winner(post_distribution)
names(probability_winning) <- prices

##       99      100      115      120 
## 0.308623 0.034632 0.656745 0.000000

After the second day price 115$ still shows the best results, with 99$ and 100$ performing very similar.

Using bandit package we can also perform significant analysis, which is handy for overall proportion comparison using prop.test.

significance_analysis(purchase_day2, visits_day2)
##   successes totals estimated_proportion        lower      upper
## 1       769   8030           0.09576588 -0.004545319 0.01369494
## 2       735   8060           0.09119107  0.030860453 0.04700507
## 3       786   8027           0.09791952 -0.007119595 0.01142688
## 4       420   8037           0.05225831           NA         NA
##   significance rank best       p_best
## 1 3.322143e-01    2    1 3.086709e-01
## 2 1.437347e-21    3    1 2.340515e-06
## 3 6.637812e-01    1    1 6.564434e-01
## 4           NA    4    0 1.548068e-39

At this point we can see that 120$ is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the p_best are very small (p_best has negligible value).

Day 3 results:

visits_day3 <- c(15684, 15690, 15672, 8037)
purchase_day3 <- c(1433, 1440, 1495, 420)

post_distribution = sim_post(purchase_day3, visits_day3, ndraws = 1000000)
probability_winning <- prob_winner(post_distribution)
names(probability_winning) <- prices
##       99      100      115      120 
## 0.087200 0.115522 0.797278 0.000000
value_remaining = value_remaining(purchase_day3, visits_day3)
potential_value = quantile(value_remaining, 0.95)
##        95% 
## 0.02670002

Day 3 results led us to conclude that 115$ will generate the highest conversion rate and revenue. We are still unsure about the conversion probability for the best price 115$, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining).

The histograms below show what happens to the value-remaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price $115.

If this example was interesting to you, checkout our another post about dynamic pricing.

We are ready to learn how the Monte Carlo Tree Search algorithm works.

As long as we have enough information to treat child nodes as slot machines, we choose the next node (move) as we would have when solving Multi-Armed Bandit Problem. This can be done when we have some information about the pay-outs for each child node.

At the first node it's black's turn. The node with highest upper bound is chosen. Then it's white's turn and again the node with the highest upper bound is chosen.

At some point we reach the node where we can no longer proceed in this manner because there is at least one node with no statistic present. It’s time to explore the tree to get new information. This can be done either completely randomly or by applying some heuristic methods when choosing child nodes (in practice this might be necessary for games with high branching factor - like chess or Go - if we want to achieve good results).

At some point we cannot apply Multi-Armed Bandit procedure, because there is a node which has no stats. We explore new part of the tree.

Finally, we arrive at a leaf. Here we can check whether we have won or lost.

We arrive at a leaf. This determines the outcome of the game.

It’s time to update the nodes we have visited on our way to the leaf. If the player making a move at the corresponding node turned out to be the winner we increase the number of wins by one. Otherwise we keep it the same. Whether we have won or not, we always increase the number of times the node was played (in the corresponding picture we can automatically deduce it from the number of loses and wins).

It's time to update the tree.

That’s it! We repeat this process until some condition is met: timeout is reached or the confidence intervals we mentioned earlier stabilize (convergence). Then we make our final decision based on the information we have gathered during the search. We can choose a node with the highest upper bound for pay-out (as we would have in each iteration of the Multi-Armed Bandit). Or, if you prefer, choose the one with the highest mean pay-out.

The decision is made, a move was chosen. Now it’s time for our opponent (or opponents). When they’ve finished we arrive at a new node, somewhere deeper in the tree, and the story repeats.

Not just for games

As you might have noticed, Monte Carlo Tree Search can be considered as a general technique for making decisions in perfect information scenarios. Therefore it’s use does not have to be restrained to games only. The most amazing use case I have heard of is to use it for planning interplanetary flights. You can read about it at this website but I will summarize it briefly.

Think of an interplanetary flight as of a trip during which you would like to visit more than one planet. For instance, you are flying from Earth to Jupiter via Mars.

An efficient way to do this is to make use of these planets gravitational field (like they did in ‘The Martian’ movie) so you can take less fuel with you. The question is when the best time to arrive and leave from each planets surface or orbit is (for the first and last planet it’s only leave and arrive, respectively).

You can treat this problem as a decision tree. If you divide time into intervals, at each planet you make a decision: in which time slot I should arrive and in which I should leave. Each such choice determines the next. First of all, you cannot leave before you arrive. Second - your previous choices determine how much fuel you have left and consequently - what places in the universe you can reach.

Each such set of consecutive choices determines where you arrive at the end. If you visited all required checkpoints - you’ve won. Otherwise, you’ve lost. It’s like a perfect information game. There is no opponent and you make a move by determining the timeslot for leave/arrival. This can be treated using the above Monte Carlo Tree Search. As you can read here, it fares quite well in comparison with other known approaches to this problem. And at the same time – it does not require any domain-specific knowledge to implement it.

Write your question and comments below. We’d love to hear what you think.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)