SMP and questions

syzygy · Post by **syzygy** » Thu Nov 29, 2012 9:57 pm

bob wrote:To properly "update" alpha in a recursive negamax search is difficult. Because it has to be replicated downward through the tree, which is not so easy in a recursive implementation. But I tested this in the Cray Blitz code since it was non-recursive, and I found exactly no measurable benefit... In such a case, I tend to avoid such code because it introduces potential race conditions that can be problematic.

Yes, going all the way and update alpha also in subtrees seems considerably more invasive.

Not saying it is good or bad, or right or wrong, or my way is the only way to go. I simply reported my test results which said "not worth anything"...

I'm surprised that it wouldn't help, although Joona has a point when he says that small aspiration windows limit the gain (and I suppose re-searches with larger windows will usually benefit from better move ordering and therefore less search overhead anyway).

Maybe I'll try to measure it for my engine. Of course these things are not easy to measure...

syzygy · Post by **syzygy** » Thu Nov 29, 2012 11:17 pm

bob wrote:
Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...

What you need is just the moves leading from the root to the candidate split point (and alpha, beta, depth, move list counter to keep track of what moves have already been searched and access to the moves that still need to be searched), and of course you need to know at which plies it makes sense to split. So instead of looking for idle threads when a node is ready so be split, the search can set a flag to mark the node as a candidate split point and copy alpha, beta, depth to a predetermined location. Threads that are idle or become idle later can then join at this node when there is no better candidate available.

What I believe is very complicated (nearly impossible) to get right with a recursive search is to hand over responsibility for backing up the search result to the parent node, so you are restricted to helpful master. But helpful master also has its advantages: each thread is now always searching within a single tree, so only needs one set of data structures.

bob · Post by **bob** » Fri Nov 30, 2012 3:31 pm

syzygy wrote:
Edsel Apostol wrote:I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
From what I understand of DTS, it allows the master of a split node (i.e. the thread that originally started searching it) to transfer responsibility for the node to one of the slave threads when the master has run out of work but some of the slave threads are still searching their moves. The slave newly responsible then has to back up the result to the parent node. With a recursive search you cannot really do this, because only the stack of the master allows a simple backing up to the parent node.

With an iterative search you don't use the stack to keep track of where you are in the tree, but some sort of global data structure that the master can pass to the slave.

Without DTS, masters that have run out of work can only be useful as a "helpful master" as a slave to one of its slaves. But as long as those have work to do, that doesn't seem so bad.

I'm not completely sure where the supposed advantage of DTS lies. The possibilities are:
- reduced idle time and/or splitting overhead, or
- reduced search overhead.

Most important difference is that YBW splits HERE or not at all, while DTS can split ANYWHERE. That gives you a better opportunity to find a safer place to split the tree, not just being forced to split HERE or else deferring the split until later.

Reduced idle time can only be a significant advantage if there is considerable idle time when using a recursive search. I haven't tried to measure idle time in my implementation, but from my nps measurements I can tell that it must be negligible. My splitting overhead seems negligible for the same reason.

That leaves reduced search overhead. It might be that splitting closer to the root in general reduces search overhead, and in that case allowing idle masters to help closer to the root instead of restricting it to the subtree of one of its slaves would indeed help. But would it help much? I kind of doubt it.

From Hyatt's paper I understand that in DTS, idle threads actively search for a suitable split point "candidate" in one of the trees being searched by the active threads. But I do that as well in my recursive YBW implementation.

Hmm, here is something I definitely don't do (this is about going from the PV node at ply=N to the PV node at ply=N-1):
Since we are using the best-known move ordering heuristics in the alpha/ beta tree, it is likely that by the time the first few moves at ply=N have been searched, we know the actual score for this node, as moves further down the list should be worse and not improve the score further. Then, allowing idle processors to back up to ply=N-1 and start searching there before ply=N is completed is probably safe, and this is what DTS does. On the rare occasions when the ply=N search completes, and the last branch produces an even better score, the processors that have backed up to ply=N-1 already are searching with a less efficient bound. DTS notices this and each processor already searching at N-1 is given the new (correct) bound as soon as it is known. This sounds easy and clean, but it can cause some interesting problems, because this new bound might mean most of what a processor has been busy doing can be thrown away since the new bound would have caused a cutoff much sooner than the original bound used.
This seems contrary to YBW, but this is about PV nodes and it might be OK for them. This is an interesting approach to reducing idle time and/or splitting overhead (to avoid threads splitting close to the leaves near the end of the search of the PV node at ply=N), but I'm not convinced that there is much to gain there, and it can increase search overhead. In any case, I think it is possible to do this in a recursive search as well... (tricky, but doing this in an iterative search might be as tricky).

I think it should be possible to implement DTS without most of the copying and messaging between processors that is described in the paper. That would be necessary to make it competitive on today's processors.

Again, the messages and such are generally "small potatos" in terms of performance benefits. The real advantage is that you get to choose a split point anywhere in any current search that is active. And the split overhead is very low to boot, there, since all the data is accessed as a large global array rather than via pointers. Think about this: At which of these two choices would you rather split: (a) current ply, 6 plies from the tip, you just completed the first move; or (b) a position 4 plies from the root position where you have already searched 4 moves without a fail high? I'm more confident splitting in (b), but it is difficult with a recursive search.

bob · Post by **bob** » Fri Nov 30, 2012 6:29 pm

syzygy wrote:
bob wrote:
Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...
What you need is just the moves leading from the root to the candidate split point (and alpha, beta, depth, move list counter to keep track of what moves have already been searched and access to the moves that still need to be searched), and of course you need to know at which plies it makes sense to split. So instead of looking for idle threads when a node is ready so be split, the search can set a flag to mark the node as a candidate split point and copy alpha, beta, depth to a predetermined location. Threads that are idle or become idle later can then join at this node when there is no better candidate available.

What I believe is very complicated (nearly impossible) to get right with a recursive search is to hand over responsibility for backing up the search result to the parent node, so you are restricted to helpful master. But helpful master also has its advantages: each thread is now always searching within a single tree, so only needs one set of data structures.

The problem lies in the recursive stack. Things are set up with no split being done, so there is no easy way to back up and split at an earlier ply, unless you are willing to do the complicated code to really unwind back to the ply where you want to split, do the split, then to un-unwind back to where you were and continue. Doable, but horribly messy. If I am going to do a split, I need to link in a split block for each child, plus a split block for the parent, but it is not easy to get to that information to change it, since all I currently have is local data on the stack in the call tree.

I agree that the "anyone can back up through a ply" is very difficult in recursive. And while theory says nothing is impossible, common sense says it is VERY difficult, at best.

Don · Post by **Don** » Fri Nov 30, 2012 6:40 pm

bob wrote:
syzygy wrote:
bob wrote:
Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...
What you need is just the moves leading from the root to the candidate split point (and alpha, beta, depth, move list counter to keep track of what moves have already been searched and access to the moves that still need to be searched), and of course you need to know at which plies it makes sense to split. So instead of looking for idle threads when a node is ready so be split, the search can set a flag to mark the node as a candidate split point and copy alpha, beta, depth to a predetermined location. Threads that are idle or become idle later can then join at this node when there is no better candidate available.

What I believe is very complicated (nearly impossible) to get right with a recursive search is to hand over responsibility for backing up the search result to the parent node, so you are restricted to helpful master. But helpful master also has its advantages: each thread is now always searching within a single tree, so only needs one set of data structures.
The problem lies in the recursive stack. Things are set up with no split being done, so there is no easy way to back up and split at an earlier ply, unless you are willing to do the complicated code to really unwind back to the ply where you want to split, do the split, then to un-unwind back to where you were and continue. Doable, but horribly messy. If I am going to do a split, I need to link in a split block for each child, plus a split block for the parent, but it is not easy to get to that information to change it, since all I currently have is local data on the stack in the call tree.

I agree that the "anyone can back up through a ply" is very difficult in recursive. And while theory says nothing is impossible, common sense says it is VERY difficult, at best.

There is no reason the search has to be explicitly recursive though.

I wrote a checkers program for the palm pilot years ago that had to be re-entrant and it turned out not to be that bad if you organize the code right. There were very few entry points that you had to worry about. You can also hide a lot of the ugliness with macros and end up with something that actually looks pretty good although you still lose the beauty and elegance of explicit recursion.

syzygy · Post by **syzygy** » Fri Nov 30, 2012 8:55 pm

bob wrote:Most important difference is that YBW splits HERE or not at all, while DTS can split ANYWHERE. That gives you a better opportunity to find a safer place to split the tree, not just being forced to split HERE or else deferring the split until later.

I believe splitting HERE or not at all is an implementation aspect that is not mandated by YBW. But I agree it is advantageous to be able to split anywhere. The question remains though how that advantage leads to a more efficient search: decreased idle time? decreased search overhead?

Again, the messages and such are generally "small potatos" in terms of performance benefits. The real advantage is that you get to choose a split point anywhere in any current search that is active. And the split overhead is very low to boot, there, since all the data is accessed as a large global array rather than via pointers. Think about this: At which of these two choices would you rather split: (a) current ply, 6 plies from the tip, you just completed the first move; or (b) a position 4 plies from the root position where you have already searched 4 moves without a fail high? I'm more confident splitting in (b), but it is difficult with a recursive search.

I agree that splitting in (b) is preferable, and that is what my engine does using a recursive search.

But I wouldn't call communication and splitting overhead "small potatoes". If the overhead is high, it is either just costing a lot, or you have to increase the minimum split depth which will increase idle time.

The problem lies in the recursive stack. Things are set up with no split being done, so there is no easy way to back up and split at an earlier ply, unless you are willing to do the complicated code to really unwind back to the ply where you want to split, do the split, then to un-unwind back to where you were and continue. Doable, but horribly messy. If I am going to do a split, I need to link in a split block for each child, plus a split block for the parent, but it is not easy to get to that information to change it, since all I currently have is local data on the stack in the call tree.

There is no need to back up. All you need is the moves leading from the root to the node where you want to split (so those should be global and not (only) on the stack). The idle thread locates a good split point in the tree of one of the active threads, copies the moves from the root to that node, and performs those moves on its own copy of the position. The master thread does not need to be involved in the split operation (it does have to initialise the candidate split point after having searched captures and killers, but that only requires copying alpha, beta, depth).

Of course the master will remain responsible for returning the node's score to the parent node, so the master node is limited to helpful master once it runs out of work.

Joerg Oster · Post by **Joerg Oster** » Sat Dec 01, 2012 9:38 pm

mcostalba wrote:
Edsel Apostol wrote: What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?
The idea is very simple. Implemenatation, as usual with SMP, is simple only after you made it to work .

Once a slave has finished searching, instead of returning to idle loop waiting to be picked up by a new master, it looks for an active split point, register itself as a split point's slave and starts searching on it. The idea is that average idle time is greatly reduced in this case.

Unfortunaty the patch failed to score an increase and so has been retired:

https://github.com/mcostalba/Stockfish/ ... 95083a3ab9

My guess is that the idle time is anyhow already small on average so that the patch does not change the things a lot. One possibility that I have not tested is to enable active reparenting together with increasing minimum split depth. This is because increasing split depth reduces the SMP overhead but at the same time increases the average slave's idle time. So at first glance could make sense to increase the first and enable the second.

And maybe it would also help to reduce MAX_SPLITPOINTS from 8 to 3. I did some tests which indicate, that 3 Split Points in parallel are sufficient and that 4 already produced a slight overhead.

syzygy · Post by **syzygy** » Sat Dec 01, 2012 9:41 pm

Joerg Oster wrote:And maybe it would also help to reduce MAX_SPLITPOINTS from 8 to 3. I did some tests which indicate, that 3 Split Points in parallel are sufficient and that 4 already produced a slight overhead.

I'm not sure why that would be. With how many threads did you run?

Joerg Oster · Post by **Joerg Oster** » Sat Dec 01, 2012 10:09 pm

syzygy wrote:
Joerg Oster wrote:And maybe it would also help to reduce MAX_SPLITPOINTS from 8 to 3. I did some tests which indicate, that 3 Split Points in parallel are sufficient and that 4 already produced a slight overhead.
I'm not sure why that would be. With how many threads did you run?

IIRC I tested with 4 Threads.
Allowing 1, 2, 3 and 4 Split Points. 3 worked best.
A theoretical consideration: Does it really make sense to allow up to 8 Split Points simultaneously on a 4-core machine? Though most likely it will never create the max number, each additional Split Point produces some overhead.

Please be aware that I'm only a lay on this matter. I am just starting to learn ...

syzygy · Post by **syzygy** » Sun Dec 02, 2012 12:22 am

Joerg Oster wrote:
syzygy wrote:
Joerg Oster wrote:And maybe it would also help to reduce MAX_SPLITPOINTS from 8 to 3. I did some tests which indicate, that 3 Split Points in parallel are sufficient and that 4 already produced a slight overhead.
I'm not sure why that would be. With how many threads did you run?
IIRC I tested with 4 Threads.
Allowing 1, 2, 3 and 4 Split Points. 3 worked best.
A theoretical consideration: Does it really make sense to allow up to 8 Split Points simultaneously on a 4-core machine? Though most likely it will never create the max number, each additional Split Point produces some overhead.

Forcibly limiting the number of split points should only force threads to be idle when they could do useful work. I don't see how that could help.

However, having now looking at the stockfish source you must be talking about MAX_SPLITPOINTS_PER_THREAD. I guess more than 3 active splitpoints per thread is rare, and that limiting this number to 3 will only measurably affect longer searches (in a negative way). I still don't see how it could help. If they are used, that's good. If they are not used, they only take up a bit of memory. They don't cost cpu cycles. (Btw, from the stockfish commits I understand that the stockfish source from October 5 to November 3 had a bug that explains a reduction of tree size (and buggy play) resulting from lowering MAX_SPLITPOINTS_PER_THREAD.)

I also wonder a bit about the usefulness of maxThreadsPerSplitPoint. Limiting the number of threads at a split point close to the leaves might help to reduce splitting overhead, but I don't see why this number should be limited close to the root.

SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions