Lazy SMP algorithm
The general design principles for texel's
old SMP algorithm were "no waiting" and "be asynchronous whenever possible". Typical
lazy SMP algorithms also follow these principles, since there is very little synchronization between threads, except for the transposition table. Other than that there is very little similarity between the old and the new algorithm.
Texel's lazy SMP algorithm uses a shared transposition table, which is thread safe and lockless using the
"XOR trick". Other data, such as history tables, killer tables and evaluation caches, are private to each search thread.
The first thread is the master thread and all other threads are helper threads. Whenever the master thread calls the negaMax function from the root search function, it also tells the helper threads to start searching the same position. The helper threads improve the search in two ways. First, they store partial search results in the shared transposition table, which can be useful for other search threads. Second, if a helper thread finishes its search before the master thread is finished, the master thread will detect this and abort its search and use the result from the helper thread instead.
In pseudo code, the algorithm works basically like this:
Code: Select all
Move iterativeDeepening(Position pos) {
for (depth = 1; depth <= maxDepth && !timeout; depth++) {
for (all legal moves m) {
[alpha, beta] = score_from_previous_iteration -/+ aspiration_window_delta;
pos.makeMove(m);
score = -negaScoutRoot(pos, -beta, -alpha, depth);
pos.unMakeMove(m);
if (score outside aspiration window)
widen window and redo search;
if (score > bestScore) {
bestMove = m;
bestScore = score;
}
}
}
tell all helper threads to stop searching
return bestMove;
}
int negaScoutRoot(Position pos, int alpha, int beta, int depth) {
tell all helper threads to start searching with params (pos, alpha, beta, depth)
try {
return negaScout(pos, alpha, beta, depth);
} catch (HelperThreadResult result) {
return result.getScore();
}
}
int negaScout(Position pos, int alpha, int beta, int depth) {
if (time to check for abort) {
if (timeout)
throw StopSearch();
if (helper thread has a search result)
throw HelperThreadResult(helper_thread_score);
}
// Normal recursive negaScout implementation ...
}
void helperThreadMainLoop() {
while (!quit) {
while (idle)
wait for search command
int extraDepth = thread_number % 2;
try {
for ( ; depth + extraDepth <= maxDepth; extraDepth++) {
negaScout2(pos, alpha, beta, depth + extraDepth);
}
} catch (StopSearch) {
}
}
}
int negaScout2(Position pos, int alpha, int beta, int depth) { // Like negaScout() except for abort condition
if (time to check for abort) {
if (stop command received or new search command received)
throw StopSearch();
}
// Normal recursive negaScout implementation ...
}
Communication between the master thread and helper threads is asynchronous, meaning that when a thread sends a command to another thread, it does not wait for the other thread to acknowledge the command.
Standard C++11 threading primitives such as locks and condition variables are used for the communication.
Half of the helper threads search at the same depth as the master thread and the other half search at a one ply deeper depth.
When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.
The master thread is responsible for all UCI communication and when reporting search depth it uses its own search depth. This means that even if a "depth+1" helper thread finished first, the reported depth is "depth", even though in that case the "quality" of the search is "depth+1".
The master thread does not directly talk to all helper threads. Instead all search threads are arranged in a tree, where the master thread is the tree root node. Each tree node has at most 4 children. The tree is created to have minimal height and children are assigned from left to right. For example if there are 16 search threads they are arranged as follows:
Code: Select all
0
|
+--------+-----------+---------+
| | | |
1 2 3 4
| | |
+-+-+-+ +-+--+--+ +--+--+
| | | | | | | | | | |
5 6 7 8 9 10 11 12 13 14 15
A search thread is only allowed to talk to its neighbors in the tree graph. When a search thread receives a command from a neighbor it acts on the command and also forwards it to its parent or children as appropriate.
This thread tree structure is not strictly necessary for SMP search, since SMP hardware does not have an excessive number of hardware threads, so the master thread could talk directly to all helper threads. However for clusters the number of possible hardware threads could be very large (the
largest super computers have more than one million cores), so in that case it would be very bad if the master thread tried to talk directly to all helper threads.
The algorithm is NUMA aware. This means that each search thread is locked to a specific NUMA node, and care is taken to allocate thread local memory on the NUMA node where the thread runs. The shared transposition table is allocated on NUMA node 0 if there is enough free memory on that node.
Lazy cluster algorithm
Support for clusters is implemented using the message passing interface (MPI). An MPI program consists of a set of processes, called "processors" in MPI terminology. Different processors do not share any memory and possibly execute on different computers (aka cluster nodes). All communication between processors is performed using message passing, which for example can be implemented by the MPI library using TCP/IP communication over Ethernet.
The texel cluster implementation is a so called "hybrid" MPI implementation, which means that each processor is a multithreaded program and there is only one processor on each computer in the cluster. Inside each processor the threads are arranged in the same tree structure as in the lazy SMP case. Cluster nodes are also connected in a tree structure. The main thread in each processor is connected to the main thread in other processors. This means that the set of all search threads in all cluster nodes forms a tree of trees. For example if each node has 3 cores and there are 7 nodes in the system, the tree structure looks like this:
Code: Select all
n0_t0 +- n0_t1
| |
| +- n0_t2
|
+----------------+----------------+---------------+
| | | |
n1_t0 +- n1_t1 n2_t0 +- n2_t1 n3_t0 +- n3_t1 n4_t0 +- n4_t1
| | | | |
| +- n1_t2 +- n2_t2 +- n3_t2 +- n4_t2
|
+----------------+
| |
n5_t0 +- n5_t1 n6_t0 +- n6_t1
| |
+- n5_t2 +- n6_t2
Communication between cluster nodes works logically the same way as between threads within a node, but the implementation uses MPI messages instead of C++11 thread primitives.
It is not necessary for each cluster node to have the same number of cores, even though that was the case in the example above. The number of cluster nodes in the system is given when the program starts. The number of used search threads is however given by the "Threads" UCI parameter. Normally the number of search threads is specified to be the same as the total number of cores in the cluster. It is however possible to specify a different number of search threads. If there are fewer search threads than cores, cores are allocated in breadth first order. If there are more search threads than cores, the extra search threads are spread out evenly over the cluster nodes and cores, which makes it possible to utilize hyperthreading cores.
Using the cluster communication described so far improves the search by making it possible for search threads to finish the search before the master thread (on node 0) and reporting the result up in the tree to the master thread. However, since each processor lives in its own address space, the major lazy SMP search improvement caused by transposition table sharing does not work between cluster nodes. To overcome this limitation a method to distribute transposition table updates to other cluster nodes has been implemented.
Each processor has one transposition table shared between all threads in the process. Whenever a search thread stores a result in the transposition table, it also checks if the stored depth is large enough to make it worthwhile to distribute this information to other cluster nodes. If so, the TT entry is stored in output queues containing entries to be sent to cluster node neighbors. Each cluster neighbor has its own output queue. Spare bandwidth between cluster nodes is used to transmit the queued entries to neighboring nodes. When a node receives a TT entry, it inserts it in its local transposition table, and, if the depth is large enough, also stores it in the output queues for all neighbors except the neighbor the entry came from. Since the cluster nodes form a tree structure, this ensures that each TT entry update can be distributed to all other cluster nodes, but distribution can stop at any cluster node if the TT entry depth is not large enough.
What TT entry depth is considered "large enough" for distribution to other cluster nodes is determined dynamically using a simple procedure. Each output queue is periodically transmitted to the receiving node. If the queue is less than half full when it is about to be sent, the minimum depth for that queue is decreased by 1. If on the other hand the queue becomes full before it was possible to send it to the receiving node, the minimum depth for that queue is increased by 1.
Cluster algorithm properties
From the design it should be clear that speed measured in nodes per second (NPS) will increase proportionally to the number of nodes in the cluster, since all threads will always be searching something as soon as the first "start search" command has reached all search threads.
The reported NPS values in the UCI communication is based on information available in the master thread. Each search thread maintains a local "nodes searched" counter, and counter changes are periodically reported to parent threads, which makes this information propagate up the tree structure towards the master thread. However, at a given point in time there will be counter values that have not yet been propagated to the master thread. Therefore the reported NPS values will be somewhat smaller than the real NPS values, but this error will get smaller the longer the search is running, and after around 10 seconds the error should be quite small even for very large clusters.
The transposition table design makes TT access from search threads very fast compared to more traditional distributed hash table systems where a thread has to ask another cluster node for the result of a TT probe. This is because no cluster communication is required to access a TT entry. On the other hand the asynchronous TT entry broadcasting mechanism means that there will be a delay between the time when a TT entry is created and the time when it has been distributed to all cluster nodes.
Of course NPS in itself is not important for a chess engine. What really matters is how much the extra hardware contributes to making the program play stronger chess. Tests will have to be performed to measure the strength increase.