re-inventing the SMP wheel
Posted: Wed Aug 15, 2007 6:43 pm
Before looking into existing solutions for SMP parallellized search, I have been thinking a little myself how I would like to have a bunch of CPUs search a tree. This in order to be better able to pick a solution I like. Properties I value are simplicity and efficiency.
Actually a number of independent threads searching the tree normally, but sharing the hash table, already behave a lot like I would want:
* Threads do not have to be created and started from arbitrary tree nodes. Each thread starts normally from the root, and differentially updates its private state information (board, piece list, game/tree-path history, hash key). This is likely to be more efficient than copying the data all the time, as I plan to have lots of differentially-updated eval stuff in my engine. (Perhaps attack maps.)
* branches that have already been searched by one thread will cause hash pruning in the others, leading to a natural mechanism to distribute work.
Wat IMO would need improvement, though, is that threads will treat a position currently being searched by another thread in the same way as one that has not been searched at all. It just goes in after a hash miss (or unsatisfactory hit). This is not necessarily bad, as it will simply start helping the threads already in there to finish it quicker. But it is not necessarily optimal either. So I would like to improve on 'shared hash' by taking a more rational decision there. Threads must be able to see from the hash if nodes are currently 'open', and if so, get other info about it (e.g. how many threads are searching it, and perhaps which). Probably this info will come from a small secondary hash table, as only very few nodes will be open at any time, and it would be a waste of memory to reserve space for the required info in all hash entries. As I already do for the rep-draw information.
Now how should a thread ideally react if it enters a node that is already under search? For one, there is no need to generate moves for that position, as the (sorted) move list is already in memory somewhere (made by the other thread). Sharing that move list as read-only data might not be very expensive (if it comes from a shared L2). Then it has a choice: it could intentionally follow a thread that is searching one of the moves from this position (there must be one, or the node would not have been marked 'open'). Or it could intentionally split, and start working on a move that no one is searching yet. Or it could decide to back off from the node, and return to the previous level without a result, to try a split there. And it would face the same set of choices each time it returned from searching (or backing off) a move.
Splitting is a bad idea if one of the moves currently being searched is likely to increase alpha. This is typically the case for the first (or the first few) moves in a PV or cut node. In fact you never know if a move being searched will up alpha, and moves that do typically take a longer time to search. So there always is a risk when splitting. While helping out with the currently searched move is alays safe.
OTOH, if the remaining search depth is too low, and many threads are already searching the tree, there is no meaningful way to go in and help to get the result quicker. E.g. a 3-ply search from an all-node will, for each move, probably take the cut-move reply from the hash with near-zero effort, and find all non-captures futile at 1 ply, only generating captures. There is not much there that can be distributed over two threads there. You don't want multiple threads to end up in a node where the eval will cause a stand-pat cutoff.
Thus it seems wise to prefer splitting over helping below a certain remaining depth, if possible. Where exactly is dependent on the number of availabel threads, you could have rules of the type "do not enter a depth=N node with more than MaxThread[N] threads". That way split points would only move further towards the root if the subtrees near the leaves are already saturated with threads. Note that threads will also be squeezed out of nodes that become super-saturated because the remaining number of moves to be searched drops. By having the split points close to the leaves, you enhance the 'cross fertilization' of hash hits in the local (always-replace) hash-table entries.
First thread returning a score of the last move to be searched can determine the final minimax score of the node, write that score in the hash and close the hash entry. Other threads returning will use the score from the hash, because the stack frame containing the common maximum might already have been deallocated by the thread that entered the node first (if it does not leave it last).
It sounds attractive to have a way to abort searches in progress, if in a splitpoint one of the threads stumbles on an unexpected cut-move. I wonder if this would have a large impact, though. Unexpected fail highs usually take a much longer search than the expected fail lows, so if one move in an expected all-node fails high, the search of all other moves that did fail low is likely to have finished already. The involved threads should preferrably be drawn to the move that is in the process of failing high (rather than back off to a lower level, or start new parallel searches). The maximum number of threads that could be employed usefully for a search of a certain depth will depend on if it is a cutt-move or a fail low. If the original assumption was fail-low, leading to some threads effecting a split and start searching brothers, the fact that the daughter node does not find a cut-move amongst the first 3 moves might be used to set a flag in the parent to indicate that this move is now expected to become a cut-move, and as such become unsaturated. Helping in the search of a node that is being searched in general has priority over searching brothers (as long as it is not saturated yet). Note that this requires the search to be unusual in the sense that it should be possible for a thread to search moves that were skipped before.
All this seems rather straightforward to program, and very efficient.
Actually a number of independent threads searching the tree normally, but sharing the hash table, already behave a lot like I would want:
* Threads do not have to be created and started from arbitrary tree nodes. Each thread starts normally from the root, and differentially updates its private state information (board, piece list, game/tree-path history, hash key). This is likely to be more efficient than copying the data all the time, as I plan to have lots of differentially-updated eval stuff in my engine. (Perhaps attack maps.)
* branches that have already been searched by one thread will cause hash pruning in the others, leading to a natural mechanism to distribute work.
Wat IMO would need improvement, though, is that threads will treat a position currently being searched by another thread in the same way as one that has not been searched at all. It just goes in after a hash miss (or unsatisfactory hit). This is not necessarily bad, as it will simply start helping the threads already in there to finish it quicker. But it is not necessarily optimal either. So I would like to improve on 'shared hash' by taking a more rational decision there. Threads must be able to see from the hash if nodes are currently 'open', and if so, get other info about it (e.g. how many threads are searching it, and perhaps which). Probably this info will come from a small secondary hash table, as only very few nodes will be open at any time, and it would be a waste of memory to reserve space for the required info in all hash entries. As I already do for the rep-draw information.
Now how should a thread ideally react if it enters a node that is already under search? For one, there is no need to generate moves for that position, as the (sorted) move list is already in memory somewhere (made by the other thread). Sharing that move list as read-only data might not be very expensive (if it comes from a shared L2). Then it has a choice: it could intentionally follow a thread that is searching one of the moves from this position (there must be one, or the node would not have been marked 'open'). Or it could intentionally split, and start working on a move that no one is searching yet. Or it could decide to back off from the node, and return to the previous level without a result, to try a split there. And it would face the same set of choices each time it returned from searching (or backing off) a move.
Splitting is a bad idea if one of the moves currently being searched is likely to increase alpha. This is typically the case for the first (or the first few) moves in a PV or cut node. In fact you never know if a move being searched will up alpha, and moves that do typically take a longer time to search. So there always is a risk when splitting. While helping out with the currently searched move is alays safe.
OTOH, if the remaining search depth is too low, and many threads are already searching the tree, there is no meaningful way to go in and help to get the result quicker. E.g. a 3-ply search from an all-node will, for each move, probably take the cut-move reply from the hash with near-zero effort, and find all non-captures futile at 1 ply, only generating captures. There is not much there that can be distributed over two threads there. You don't want multiple threads to end up in a node where the eval will cause a stand-pat cutoff.
Thus it seems wise to prefer splitting over helping below a certain remaining depth, if possible. Where exactly is dependent on the number of availabel threads, you could have rules of the type "do not enter a depth=N node with more than MaxThread[N] threads". That way split points would only move further towards the root if the subtrees near the leaves are already saturated with threads. Note that threads will also be squeezed out of nodes that become super-saturated because the remaining number of moves to be searched drops. By having the split points close to the leaves, you enhance the 'cross fertilization' of hash hits in the local (always-replace) hash-table entries.
First thread returning a score of the last move to be searched can determine the final minimax score of the node, write that score in the hash and close the hash entry. Other threads returning will use the score from the hash, because the stack frame containing the common maximum might already have been deallocated by the thread that entered the node first (if it does not leave it last).
It sounds attractive to have a way to abort searches in progress, if in a splitpoint one of the threads stumbles on an unexpected cut-move. I wonder if this would have a large impact, though. Unexpected fail highs usually take a much longer search than the expected fail lows, so if one move in an expected all-node fails high, the search of all other moves that did fail low is likely to have finished already. The involved threads should preferrably be drawn to the move that is in the process of failing high (rather than back off to a lower level, or start new parallel searches). The maximum number of threads that could be employed usefully for a search of a certain depth will depend on if it is a cutt-move or a fail low. If the original assumption was fail-low, leading to some threads effecting a split and start searching brothers, the fact that the daughter node does not find a cut-move amongst the first 3 moves might be used to set a flag in the parent to indicate that this move is now expected to become a cut-move, and as such become unsaturated. Helping in the search of a node that is being searched in general has priority over searching brothers (as long as it is not saturated yet). Note that this requires the search to be unusual in the sense that it should be possible for a thread to search moves that were skipped before.
All this seems rather straightforward to program, and very efficient.