This is very similar to what I do in Wasp. Except that I have a "main" thread which handles I/O and a timer and create a search thread pool when the program is started. Before a search, the main thread copies the root position to all the search threads and sets a flag for them to start searching. When the time limit is reached, the main thread sets a global timeout flag. Each search thread checks this flag frequently (in the moveloop at every depth) and returns if it's set, so no longjmp is used. At every iteration (beyond depth 5) each thread checks whether half or more of the other threads are already searching at that depth and if so, bumps the depth. Also, if a thread is not the 1st to search at a given depth I apply a little randomness to the scores used to order the root moves. I don't know if this really helps at all. I have a global structure that contains the PV, depth, and score for the best move and each thread can update that if it finds a new best move (higher depth or same depth and higher score). A search thread will discontinue an iteration if it is about to search a new move at the root with a depth lower than the depth of the current best move (but not if the depth is the same). This is very simple and works reasonably well. I haven't yet tried ABDADA.lucasart wrote:Interesting. My Lazy SMP is rather different. Here's how it works:petero2 wrote:Lazy SMP algorithm
[...]
Half of the helper threads search at the same depth as the master thread and the other half search at a one ply deeper depth.
When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.
The master thread is responsible for all UCI communication and when reporting search depth it uses its own search depth. This means that even if a "depth+1" helper thread finished first, the reported depth is "depth", even though in that case the "quality" of the search is "depth+1".
The master thread does not directly talk to all helper threads. Instead all search threads are arranged in a tree, where the master thread is the tree root node. Each tree node has at most 4 children. The tree is created to have minimal height and children are assigned from left to right. For example if there are 16 search threads they are arranged as follows:A search thread is only allowed to talk to its neighbors in the tree graph. When a search thread receives a command from a neighbor it acts on the command and also forwards it to its parent or children as appropriate.Code: Select all
0 | +--------+-----------+---------+ | | | | 1 2 3 4 | | | +-+-+-+ +-+--+--+ +--+--+ | | | | | | | | | | | 5 6 7 8 9 10 11 12 13 14 15
This thread tree structure is not strictly necessary for SMP search, since SMP hardware does not have an excessive number of hardware threads, so the master thread could talk directly to all helper threads. However for clusters the number of possible hardware threads could be very large (the largest super computers have more than one million cores), so in that case it would be very bad if the master thread tried to talk directly to all helper threads.
The algorithm is NUMA aware. This means that each search thread is locked to a specific NUMA node, and care is taken to allocate thread local memory on the NUMA node where the thread runs. The shared transposition table is allocated on NUMA node 0 if there is enough free memory on that node.
* main thread (the one that starts the main() function) does UCI I/O, and is always responsive.
* upon receiving a go command, it spawns a timer thread.
* the timer thread spawns N worker threads.
Timer and Worker(s) only live when the engine is searching. It turns out that spawning a few threads is so fast, the overhead is totally negligible. So you don't need to keep threads alive all the time, and deal with condition variables. Simple thread_create() and thread_join() operations are all I ever need.
Already the philosophy is quite different. All workers are equal. There is no concept of main thread among the workers. This simplifies the code, as there is no special case for the main thread.
UCI output (printing completed iterations with depth, score, pv etc.) is centralised in a global object, that contains the last iteration completed (highest depth completed so far, don't care which of the equal workers completed it). The object embeds a mutex, obviously. When a worker completes an iteration, it just tries to update the PV (only updated if deeper, when update occurs, print to screen, mutex guarantees no concurrency problem that would cause scrambled output).
In the ID loop, before a worker starts an iteration, it checks how many workers are searching this iteration or above. If at least half the workers are searching >= depth, then there is no point in searching this iteration, try the next one (and so on, until we find a useful iteration to work on).
Search abortion is twofold:
* Timer can signal all workers to stop immediately (using long_jmp() in C, or exception if it was C++).
* When a worker completes an iteration, it signals all workers working <= depth to stop immediately. When the affected workers stop, they return in the ID loop and look for a useful job to do, as previously described.
All this ensures that you have half the workers at each depth d and d+1. But it does so in a rather martial way. Workers are not allowed to just go their own way in the ID loop. We constantly want to keep them in rank and correct towards 1/2 at d, and 1/2 at d+1.
Next steps:
1/ The next thing to try is to use some more complicated patterns as 1/2 at d and 1/2 at d+1. This has shown to be very useful with many threads in Stockfish testing. But I cannot see any gain in variying this for 4 threads. And I do not have big machines at my disposal. So I stay with the simple scheme for now.
2/ The other thing to try is obviously ABDADA. Again, I really doubt I will gain anything on 4 Threads above what I have. Probably needs 8 or 16 threads to see anything.
John