I have been working lately on a multi-threaded search algorithm for my chess program texel and came up with an algorithm that is inspired by both DTS and ABDADA. The goal is to combine the power of DTS with the simplicity of ABDADA. The algorithm is described below.
Algorithm
There is one master thread and N-1 helper threads. The master thread behaves just like in a single threaded search, except that it also creates "hints" about what it is doing during the search. The single threaded search is basically a standard recursive negascout algorithm.
The helper threads examine the "hints" and decide what sub-trees to search. Search results are stored in the shared transposition table. This is the only way that search results are transferred from the helper threads to the master thread. If the helper threads do a good job the master thread finds lots of sub-tree results in the transposition table so it does not have to search them itself.
The "hints" are objects of the class SplitPoint. A SplitPoint object is created for all nodes in the search tree that have depth >= 10 and ply <= 16. The negaScout function creates a SplitPoint object after it has generated the moves to search (texel does not use an incremental move generator) and deletes the SplitPoint object before the function returns. This means that the active SplitPoints for a thread correspond to the top up to 16 recursive calls of the negaScout function.
A SplitPoint object is quite lightweight and contains a position, alpha, beta, ply, the repetition hash list, and references to history and killer tables. Parent and child pointers are also maintained so that the SplitPoint objects form a tree. Finally a SplitPoint object contains a vector of SplitPointMove objects. Each SplitPointMove object contains move, depth, LMR reduction and flags that say if the move is currently being searched by some thread and if the move has been canceled (result no longer needed).
The existence of a SplitPoint object does not mean that any helper thread is actually searching a move in that SplitPoint, it only means that a helper thread can decide to search at that SplitPoint. Therefore "potential split point" may be a more accurate name, but I did not want the class name to be that long.
Each active SplitPoint object is also stored in a priority queue, ordered by the estimated probability that searching the first non-searched move in the SplitPoint would be useful. A helper thread that is idle will constantly monitor the priority queue, and as soon as it becomes non-empty, the first available move in the first SplitPoint is selected and the helper thread starts searching the corresponding sub-tree. A non-idle helper thread periodically checks if the SplitPointMove it is currently searching has been canceled, or if the the estimated probability of the first item in the priority queue is significantly better than the estimated probability for the move currently being searched. If this is the case, the search is aborted and a new SplitPointMove is selected from the priority queue.
A SplitPointMove is canceled when the master thread starts searching it itself, or if a beta cutoff or alpha improvement "earlier" in the search tree causes the result to be no longer needed. An alpha increase that is not a beta cutoff causes the helper thread to cancel its search, putting the move it was searching back in the priority queue. The next time it selects a move to search, it may very well select the same move, but this time it will search with the new alpha value. The idea is that the transposition table makes the abort and restart procedure efficient, since the re-search will get lots of transposition table cutoffs until it reaches the same point in the search tree as before the abort/restart.
When a helper thread starts to search a move, it calls the same negaScout function that the master thread is using for its search, which means that it can also create SplitPoint objects. What is said above about the master thread also applies to such a helper thread that is considered the master of the SplitPointMove it decided to search.
The probability that searching a move is useful is estimated by collecting fail high statistics and alpha increase statistics during the search. When a move usefulness depends an a series of SplitPointMoves, the probability is computed as the product of the individual move probabilities. Move probabilities are recomputed for the affected moves every time the state of a SplitPoint changes.
Statistics is collected for 5 different types of nodes, which are PV, CUT1, CUT2, ALL and NULL. For PV nodes, it is counted how often alpha increases after searching move 0,1,...,14. For the other node types, it is counted how often a beta cutoff happens after searching move 0,1,...,14, and how often the node fails low. CUT1 nodes are nodes where the previous move was not the first move at the parent node. ALL and CUT2 nodes are nodes where the previous move was the first move at the parent node. Whether a node has type ALL or CUT2 depends on if the number of plies up to the previous non-first move is even or odd. NULL nodes are nodes where the previous move was a null move.
No young brothers wait concept is used. A helper thread will search the move with the highest estimated probability of being useful, even if this is for example the second move at an expected cut node, and the first move is currently being searched. The idea is that it is better to do something than nothing, and the mechanism that aborts a helper thread when a move with a significantly higher probability becomes available makes sure that there is no long term penalty for starting to search a move with a low probability.
Implementation
Texel is written in C++11. Multi-threading is implemented using std::thread, std::mutex and std::lock_guard. SplitPoint lifetimes are managed using std::shared_ptr and std::weak_ptr. The priority queue is implemented using std::set. Exceptions are used to abort the recursive search when either time runs out (for the master thread) or when a SplitPoint is canceled (for the helper threads). RAII is used to clean up resources (cancel SplitPoints) when the stack unwinds. Template metaprogramming is used to ensure that the code in the negaScout function that creates and deletes SplitPoint objects does not negatively affect the performance close to the leaves.
The implementation already works but is not quite ready for a new texel release yet. The current source code is available here. The new data structures for the parallel search are in the parallel.[ch]pp files and the negaScout function is in search.cpp.
Advantages
Like in the DTS algorithm, helper threads are free to work anywhere in the search tree and no thread ever has to wait for any other thread to finish searching a sub-tree.
Unlike the DTS algorithm, the required changes to the recursive negaScout function are pretty small. An extra loop over the move list to create the SplitPoint information is basically all that is needed.
The framework for estimating move usefulness probabilities means that no ad-hoc rules for where it is allowed to split the search tree are needed.
Disadvantages
If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the helper thread fails high, the master thread still finishes searching move one before it finds the fail high result for move two in the transposition table and propagates the fail high result up in the search tree. In this case there is an opportunity for the parallel search to search fewer nodes than the serial search would search, by immediately aborting the master thread. I don't know if the DTS algorithm handles this case but I would guess that it does.
If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the master thread fails low on move one, it will start searching move two. When this happens the helper thread is "kicked out" of that branch and has to start searching somewhere else. The idea is that the master thread should quickly get to the same point where the helper thread was interrupted because of transposition table hits. However, there is still some overhead involved. This seems hard to avoid with the current design, but the hope is that the overhead is small.
If the transposition table is too small it is possible that search results computed by helper threads are overwritten before they are needed by the master thread. It would be straightforward to store the search results in the SplitPointMove objects, but I have not implemented this because I don't think it is a problem in practice.
Results
I don't have many results yet, but I have played two matches between the new texel version and the version just before I started to implement the parallel search. The old version is Texel103a3. The new version first played 400 games against the old version, where the new version used two cores, called TexelHead2c in the table below. Then the new version played 400 games against the old version, where the new version used twice the thinking time but only one core, called TexelHead2t in the table below. The time control was 5 seconds plus 1 second increment per move. (The 2t version obviously used 10s+2s time control.)
Code: Select all
Rank Name Elo + - games score oppo. draws
1 TexelHead2t 42 19 19 400 67% -60 44%
2 TexelHead2c 18 19 19 400 63% -60 45%
3 Texel103a3 -60 13 14 800 35% 30 45%
1 TexelHead2t 42 400.0 (266.5 : 133.5)
400.0 (266.5 : 133.5) Texel103a3 -60
2 TexelHead2c 18 400.0 (250.5 : 149.5)
400.0 (250.5 : 149.5) Texel103a3 -60
3 Texel103a3 -60 800.0 (283.0 : 517.0)
400.0 (133.5 : 266.5) TexelHead2t 42
400.0 (149.5 : 250.5) TexelHead2c 18
Te Te Te
TexelHead2t 89 99
TexelHead2c 10 99
Texel103a3 0 0
It is currently unknown how well the algorithm scales to large number of threads. The design is meant to make sure that all threads have useful work to do, but at some point the locking associated with the priority queue manipulation may start to become a bottleneck.
Possible improvements
It should be possible to fix the first disadvantage above by having the master thread periodically check its SplitPoint objects to see if there has been a beta cutoff (or alpha increase). If this is the case, an exception could be thrown and caught at the correct level in the recursion. There the search result from the helper thread could be picked up and returned.
The algorithm could try harder to search at moves with larger remaining search depth. Currently the move with the highest estimated usefulness probability is searched, and in case of a tie the move with the largest depth is chosen. However, it may be better to search a move with a larger depth but with a slightly lower probability, because the overhead associated with starting and stopping the search, including the priority queue manipulation and copying of history and killer tables, is reduced. If statistics were collected regarding overhead as a function of search depth, this could be weighted into the probability calculations.
Another idea is to perform a "wider" search in the helper threads if the search depth is low, so that the overhead is high. The widening could be implemented by ignoring the LMR reduction in this case, and the effect would be that the helper thread would spend more time examining search tree nodes and less time on overhead. I have no idea if this would improve the program elo though.