re-inventing the SMP wheel

hgm · Post by **hgm** » Wed Aug 15, 2007 6:43 pm

Before looking into existing solutions for SMP parallellized search, I have been thinking a little myself how I would like to have a bunch of CPUs search a tree. This in order to be better able to pick a solution I like. Properties I value are simplicity and efficiency.

Actually a number of independent threads searching the tree normally, but sharing the hash table, already behave a lot like I would want:
* Threads do not have to be created and started from arbitrary tree nodes. Each thread starts normally from the root, and differentially updates its private state information (board, piece list, game/tree-path history, hash key). This is likely to be more efficient than copying the data all the time, as I plan to have lots of differentially-updated eval stuff in my engine. (Perhaps attack maps.)
* branches that have already been searched by one thread will cause hash pruning in the others, leading to a natural mechanism to distribute work.

Wat IMO would need improvement, though, is that threads will treat a position currently being searched by another thread in the same way as one that has not been searched at all. It just goes in after a hash miss (or unsatisfactory hit). This is not necessarily bad, as it will simply start helping the threads already in there to finish it quicker. But it is not necessarily optimal either. So I would like to improve on 'shared hash' by taking a more rational decision there. Threads must be able to see from the hash if nodes are currently 'open', and if so, get other info about it (e.g. how many threads are searching it, and perhaps which). Probably this info will come from a small secondary hash table, as only very few nodes will be open at any time, and it would be a waste of memory to reserve space for the required info in all hash entries. As I already do for the rep-draw information.

Now how should a thread ideally react if it enters a node that is already under search? For one, there is no need to generate moves for that position, as the (sorted) move list is already in memory somewhere (made by the other thread). Sharing that move list as read-only data might not be very expensive (if it comes from a shared L2). Then it has a choice: it could intentionally follow a thread that is searching one of the moves from this position (there must be one, or the node would not have been marked 'open'). Or it could intentionally split, and start working on a move that no one is searching yet. Or it could decide to back off from the node, and return to the previous level without a result, to try a split there. And it would face the same set of choices each time it returned from searching (or backing off) a move.

Splitting is a bad idea if one of the moves currently being searched is likely to increase alpha. This is typically the case for the first (or the first few) moves in a PV or cut node. In fact you never know if a move being searched will up alpha, and moves that do typically take a longer time to search. So there always is a risk when splitting. While helping out with the currently searched move is alays safe.

OTOH, if the remaining search depth is too low, and many threads are already searching the tree, there is no meaningful way to go in and help to get the result quicker. E.g. a 3-ply search from an all-node will, for each move, probably take the cut-move reply from the hash with near-zero effort, and find all non-captures futile at 1 ply, only generating captures. There is not much there that can be distributed over two threads there. You don't want multiple threads to end up in a node where the eval will cause a stand-pat cutoff.

Thus it seems wise to prefer splitting over helping below a certain remaining depth, if possible. Where exactly is dependent on the number of availabel threads, you could have rules of the type "do not enter a depth=N node with more than MaxThread[N] threads". That way split points would only move further towards the root if the subtrees near the leaves are already saturated with threads. Note that threads will also be squeezed out of nodes that become super-saturated because the remaining number of moves to be searched drops. By having the split points close to the leaves, you enhance the 'cross fertilization' of hash hits in the local (always-replace) hash-table entries.

First thread returning a score of the last move to be searched can determine the final minimax score of the node, write that score in the hash and close the hash entry. Other threads returning will use the score from the hash, because the stack frame containing the common maximum might already have been deallocated by the thread that entered the node first (if it does not leave it last).

It sounds attractive to have a way to abort searches in progress, if in a splitpoint one of the threads stumbles on an unexpected cut-move. I wonder if this would have a large impact, though. Unexpected fail highs usually take a much longer search than the expected fail lows, so if one move in an expected all-node fails high, the search of all other moves that did fail low is likely to have finished already. The involved threads should preferrably be drawn to the move that is in the process of failing high (rather than back off to a lower level, or start new parallel searches). The maximum number of threads that could be employed usefully for a search of a certain depth will depend on if it is a cutt-move or a fail low. If the original assumption was fail-low, leading to some threads effecting a split and start searching brothers, the fact that the daughter node does not find a cut-move amongst the first 3 moves might be used to set a flag in the parent to indicate that this move is now expected to become a cut-move, and as such become unsaturated. Helping in the search of a node that is being searched in general has priority over searching brothers (as long as it is not saturated yet). Note that this requires the search to be unusual in the sense that it should be possible for a thread to search moves that were skipped before.

All this seems rather straightforward to program, and very efficient.

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Aug 15, 2007 7:26 pm

I can only notice that thread communication is generally slow and you suggest some extra form of synchronization then needed by traditional methods.

Your posts are always innovative, but your writing style is IMHO the most difficult to understand in whole CCC, because it is too verbose.

Michael Sherwin · Post by **Michael Sherwin** » Wed Aug 15, 2007 7:47 pm

This subject is difficult for me. My initial thoughts on SMP was similar to yours, I think. The only thing that I can comment on is the idea of starting each thread at the root--supposedly on consecutive moves(?). The first move at the root on average takes a lot more time than the others, does it not? The remaining root moves are searched quicker, because, of the information garnered from the first. Assigning processors to the remaining moves before the first has been searched will cause millions of extra nodes to be searched. I would suggest using all processors on the first move until it is completed and then only splitting at the root, but, only if the score for the first root move did not fall. If it does fall, then I would use all processors on the next move as well. $.02

bob · Post by **bob** » Wed Aug 15, 2007 9:28 pm

There is a major flaw with this unsynchronized approach that absolutely kills it. What do you do if two or more nodes start at the same position at the same time? They get no hash hit, they all search the same nodes in the same order, they provide no increased performance.

The idea is reasonable to consider, but implementation makes it perform poorly. The more processors you have, the more likely it will be that processors are doing nothing more than replicating the work in progress on other processors. That's why the good algorithms share a move list and make sure that this does not happen very often (it still happens when two different branches transpose to the same position and both processors now do search the same identical tree in parallel, wasting time, when the serial search would get a hash hit from one of the two pathways. But that's bad enough without letting it happen everywhere...

Sharing data is not necessarily bad, as one can control the depth where splits are allowed to limit the synchronization overhead to something that is acceptable.

Pradu · Post by **Pradu** » Wed Aug 15, 2007 10:42 pm

bob wrote:There is a major flaw with this unsynchronized approach that absolutely kills it. What do you do if two or more nodes start at the same position at the same time? They get no hash hit, they all search the same nodes in the same order, they provide no increased performance.

The idea is reasonable to consider, but implementation makes it perform poorly. The more processors you have, the more likely it will be that processors are doing nothing more than replicating the work in progress on other processors. That's why the good algorithms share a move list and make sure that this does not happen very often (it still happens when two different branches transpose to the same position and both processors now do search the same identical tree in parallel, wasting time, when the serial search would get a hash hit from one of the two pathways. But that's bad enough without letting it happen everywhere...

Sharing data is not necessarily bad, as one can control the depth where splits are allowed to limit the synchronization overhead to something that is acceptable.

HG wrote:Splitting is a bad idea if one of the moves currently being searched is likely to increase alpha. This is typically the case for the first (or the first few) moves in a PV or cut node. In fact you never know if a move being searched will up alpha, and moves that do typically take a longer time to search. So there always is a risk when splitting. While helping out with the currently searched move is alays safe.

I believe Prof. Hyatt is right when he says splitting is the better way to go. If you want to protect against wasted work like the "risk of splitting" you talked about, you have to develop a really good splitting strategy. You will always have a risk of wasted work, but if you make that risk to go to zero, you will get no speedup. I believe you mentioned that you have the risk that alpha can be improved. If you want to minimize this type of risk you can try splitting only at nodes where alpha+1 == beta (eg. only on verification windows of the scout searches) where it will be impossible to improve the bounds. However, you always face the risk of getting a cutoff.

Also if you are worried about the excessive copying part when splitting, it isn't necessary to implement DTS by copying the entire search stack for every thread at the split point. Instead you could take a "work-stealing" approach where after the thread that enters a node runs out of work, it can ask one of the helper threads to transfer the search stack it created from the split node and have that helper thread go free. There are many ways to make the same thing work.

hgm · Post by **hgm** » Fri Aug 17, 2007 9:46 am

bob wrote:There is a major flaw with this unsynchronized approach that absolutely kills it. What do you do if two or more nodes start at the same position at the same time? They get no hash hit, they all search the same nodes in the same order, they provide no increased performance.

This is exactly what I prevent by the proposed enhancement over the simple shared-hash approach. Two positions cannot start at the same time, as marking a node as open for search (in the small shared hash-table extension) would be a critical section, and the second program to attempt it would thus be aware it is entering a branch that is already being searched, and would only do that if it thinks that is a good thing to do. (E.g. because that branch is the eldest brother in a deep search.)

Unfortunately I suffer from tendonitis, and should not type if I don't want my arms to fall off. But I will get back to this later.

mjlef · Post by **mjlef** » Fri Aug 17, 2007 10:11 am

It seems whatever people are doing with SMP programs is not as effective as I would have thought. If I recall, a doubling of processor speed should be worth around 60 ELO. No parallel search will be perfect, since search effort is wasted a bit, but typical speedups I have seem are around 1.7 for 2 processors and 3 or so for 4 processors. But look at what program have been actually doing in terms of ELO:

Rybka 2.3.2a
Processors ELO Diff
1 3071 0
2 3100 29
4 3117 46

I would have expected the 2 processor version to have an ELO increase of at least 50, and a 4 processor of 75 or so, but the actual results are less. I hear Zapchess does this better, but I do not have enough data with different processors on the same version to prove this.

Maybe Anthony will write a nice paper on his parallel search implementation for us (or has he already?)

->Mark

Uri Blass · Post by **Uri Blass** » Fri Aug 17, 2007 10:41 am

mjlef wrote:It seems whatever people are doing with SMP programs is not as effective as I would have thought. If I recall, a doubling of processor speed should be worth around 60 ELO. No parallel search will be perfect, since search effort is wasted a bit, but typical speedups I have seem are around 1.7 for 2 processors and 3 or so for 4 processors. But look at what program have been actually doing in terms of ELO:

Rybka 2.3.2a
Processors ELO Diff
1 3071 0
2 3100 29
4 3117 46

I would have expected the 2 processor version to have an ELO increase of at least 50, and a 4 processor of 75 or so, but the actual results are less. I hear Zapchess does this better, but I do not have enough data with different processors on the same version to prove this.

Maybe Anthony will write a nice paper on his parallel search implementation for us (or has he already?)

->Mark

I think that looking at rybka is misleading because we cannot know the real rybka improvement when rybka is only tested against weak opponents and no rybka-rybka games.

It is possible that rybka has problems against weak opponents.

Here are some other results from the same list:

9 Zap!Chess Zanzibar x64 4CPU 2994 27 27 339 56.2 % 2950 46.3 %
18 Zap!Chess Zanzibar x64 2CPU 2930 13 13 1694 56.9 % 2881 42.7 %

+64 elo

16 Naum 2.2 x64 4CPU 2937 25 26 384 48.7 % 2946 46.4 %
32 Naum 2.2 x64 2CPU 2880 16 16 984 52.3 % 2864 42.6 %

+57 elo

21 Hiarcs 11.1 4CPU 2912 24 24 468 47.8 % 2928 43.8 %
34 Hiarcs 11.1 2CPU 2874 15 15 1224 48.1 % 2887 41.0 %

+38 elo

22 Naum 2.1 x64 4CPU 2912 27 27 350 44.7 % 2949 44.9 %
47 Naum 2.1 x64 2CPU 2840 12 12 1941 48.6 % 2850 43.3 %

+72 elo

56 Spike 1.2 Turin 2CPU 2821 11 11 2287 45.3 % 2854 39.8 %
80 Spike 1.2 Turin 2772 11 11 2307 53.3 % 2749 37.8 %

+49 elo

94 Glaurung 1.2.1 x64 2CPU 2744 13 13 1686 47.0 % 2765 34.9 %
133 Glaurung 1.2.1 1CPU 2674 17 17 1079 48.9 % 2681 31.3 %

+70 elo

bob · Post by **bob** » Fri Aug 17, 2007 5:33 pm

hgm wrote:
bob wrote:There is a major flaw with this unsynchronized approach that absolutely kills it. What do you do if two or more nodes start at the same position at the same time? They get no hash hit, they all search the same nodes in the same order, they provide no increased performance.
This is exactly what I prevent by the proposed enhancement over the simple shared-hash approach. Two positions cannot start at the same time, as marking a node as open for search (in the small shared hash-table extension) would be a critical section, and the second program to attempt it would thus be aware it is entering a branch that is already being searched, and would only do that if it thinks that is a good thing to do. (E.g. because that branch is the eldest brother in a deep search.)

Unfortunately I suffer from tendonitis, and should not type if I don't want my arms to fall off. But I will get back to this later.

This "small hash table" would seem to have to be as big as the regular hash table. _every_ "open" node needs to be flagged or this problem will come up everywhere. The problem with this becomes even more critical when you now have to write into this table at the beginning of a node, in addition to probing the real hash table, etc. Would seem that the memory traffic would be unbearable. Not to mention that your "critical section" is going to be hit a _bunch_ (the very reason locking the hash table is a bad idea because of the performance hit)...

hgm · Post by **hgm** » Fri Aug 17, 2007 6:26 pm

Well, with 100 threads each searching upto 40 ply deep, there would only be 4000 open nodes max. And as I would prefer split points close to the leaves, most of the lines would largely coincide, so 400 is a more likely number (not to mention that my number of threads is more likely to be 8 than 100...) My main hash is significantly larger then 400 entries (about 150,000 times?). This 400-entry table easily fits in L2. Sharing L2 data is not so expensive. Of course I would not put any L2-miss processing in a critical section, or atempt busy waiting for a critical section that could take long.

re-inventing the SMP wheel

re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel