Monte Carlo Tree Search

This class continues topic introduced in the previous lab. This time we will focus on the famous Monte Carlo Tree Search (MCTS). While MiniMax with alpha/beta pruning uses branch and bound approach to search the game tree exhaustively, some game trees are just too big to be explored, e.g. Go game).

MCTS on the other hand is a probabilistic algorithm, only sampling the game tree in the indeterministic manner.

Related info:

Monte Carlo

The Monte Carlo algorithms are named after the famous casions, where you can play the 'one-armed bandit' slot machines. The idea is that one has many different slot machines to choose from in the casino and every machine has different unknown winning chance. Problem to solve: having a fixed amount of game tokens, how to find a machine that maximizes profits?

In games a “slot-machine” is represented by a possible move with unknown impact on the game, and we have to find the best move, having limited amount of computational resources (tokens). The simplest approach would be to choose moves at random and simulate the game (such simulation is called playout) until the end. We do it as long as we can, then we check which move had the best wins to playouts ratio.

as long as we can:
  move = choose_random_move()
  score = payout(move)


  1. check the basic Monte Carlo implementation
  2. run several matches Alpha-Beta vs Monte Carlo:
    • at some simple game, e.g. tic-tac-toe,
    • at something bigger, e.g. checkers
    • make sure that neither of them times out without making a move

Monte Carlo Tree Search

Monte Carlo wasn't the smartest approach to the problem - we can be more intelligent than some random sampling. The idea behind a more advanced Monte Carlo Tree Search is to choose moves, according to a simple scoring formula:

$UCT = \overline{score_i} + c*\sqrt{{n}\over{n_i} $


  • $\overline{score_i} $ the average score of the $i $-th move
  • $n $ is total number of playouts
  • $n_i $ is number of playouts after choosing the $i $-th move
  • $c$ is a constant parameter, from theoretical reasons set traditionally to $\sqrt{2}$

As you see, the UCT assigns high scores to machines that:

  • have high average score ($\overline{score_i} $ is high)
  • haven't been tried too many times (low $\over{n_i}$).

This way we balance both exploration and exploitation of the search space. The algorithm is called Tree Search because in essence it explores tree in a intelligent manner — you can say that it approximates MiniMax by randomized exploration.

The algorithm always starts in the current game state (root of the tree) and consists in four phases:

  1. selection:
    1. if the node has been explored (all possible moves were tried at least once) we use the UCT formula to choose one of them to get another node. We repeat this process until we get a node, that has at least one previously unplayed move.
    • :!: ATTENTION/ACHTUNG/UWAGA :!: like in the MiniMax algorithm, we assume that we play zero-sum game and our opponent would like to minimize our score. So when selecting his moves we have to be smart enough to either minimize the UCT score, or store the scores in an intelligent manner.
  2. growth: we choose one of the unplayed moves (some implementation can choose more, e.g. all unplayed moves from the current node)
  3. simulation: we make a playout starting with the chosen move. Most basic way to implement playout is to do totally random game simulation.
  4. backprogation: we store results of the playout in the nodes belonging to the tree branch we have explored.

We repeat those phases as long as we can - finally we select a move that had the highest number of playouts - we know it's the best because algorithm favors the most promising moves when sampling.

Below you can see illustration of one MCTS cycle. There are two things worth of noting:

  1. this is not first cycle of the algorithm, the tree is already expanded;
  2. gray moves belong to our opponent, so during backpropagation we store our loss as his win.


  1. Based on the, implement the player
    • MCTS assumes that won games has score 1, not 100 as in the ggp framework. Tie also should be equal 0,5.
  2. Run several matches vs at some big game (chess, checkers?). Does any of them have an upper hand?
  3. Repeat experiment with diffrent $c$ values: 0, 1, 2, 5, 42. Which is the most effective?
  4. Modify MCTS by mixing it with the plain Monte Carlo. First run some random playouts (Monte Carlo) for a limited amount of time (e.g. 20% of the total time given for the deliberation). Does it improve results?
en/dydaktyka/ggp/game_tree_2.txt · Last modified: 2019/06/27 15:49 (external edit) Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0