SMP and questions

syzygy · Post by **syzygy** » Sun Dec 02, 2012 1:55 am

Edsel Apostol wrote:Running some analysis on Windows though had my implementation at around 10% thread synchronization overhead, mostly spent on waiting for locks on selecting the moves.

I don't have any ideas how to avoid this at the moment, and the overhead tends to increase as the number of threads increases. It might be a good idea to limit the number of threads working in a split point to a minimum, lets say 2, but I have to experiment with it yet.

It sounds to me that your move selection is taking too much time.
I split only after (good) captures and killers, which means that I already have generated the remaining moves. Now I only need to pick the next move in the list (or quickly loop to find the move with the highest history score).

Joerg Oster · Post by **Joerg Oster** » Sun Dec 02, 2012 10:35 am

In this thread http://talkchess.com/forum/viewtopic.ph ... 55&t=31388 Tord gave a quick explanation of the SMP parameters in Stockfish.

syzygy · Post by **syzygy** » Sun Dec 02, 2012 1:26 pm

syzygy wrote:
Edsel Apostol wrote:Running some analysis on Windows though had my implementation at around 10% thread synchronization overhead, mostly spent on waiting for locks on selecting the moves.

I don't have any ideas how to avoid this at the moment, and the overhead tends to increase as the number of threads increases. It might be a good idea to limit the number of threads working in a split point to a minimum, lets say 2, but I have to experiment with it yet.
It sounds to me that your move selection is taking too much time.
I split only after (good) captures and killers, which means that I already have generated the remaining moves. Now I only need to pick the next move in the list (or quickly loop to find the move with the highest history score).

Or are you using a global lock (critical section object in MS language) instead of one lock per split point?

bob · Post by **bob** » Sun Dec 02, 2012 4:33 pm

Don wrote:
bob wrote:
syzygy wrote:
bob wrote:
Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...
What you need is just the moves leading from the root to the candidate split point (and alpha, beta, depth, move list counter to keep track of what moves have already been searched and access to the moves that still need to be searched), and of course you need to know at which plies it makes sense to split. So instead of looking for idle threads when a node is ready so be split, the search can set a flag to mark the node as a candidate split point and copy alpha, beta, depth to a predetermined location. Threads that are idle or become idle later can then join at this node when there is no better candidate available.

What I believe is very complicated (nearly impossible) to get right with a recursive search is to hand over responsibility for backing up the search result to the parent node, so you are restricted to helpful master. But helpful master also has its advantages: each thread is now always searching within a single tree, so only needs one set of data structures.
The problem lies in the recursive stack. Things are set up with no split being done, so there is no easy way to back up and split at an earlier ply, unless you are willing to do the complicated code to really unwind back to the ply where you want to split, do the split, then to un-unwind back to where you were and continue. Doable, but horribly messy. If I am going to do a split, I need to link in a split block for each child, plus a split block for the parent, but it is not easy to get to that information to change it, since all I currently have is local data on the stack in the call tree.

I agree that the "anyone can back up through a ply" is very difficult in recursive. And while theory says nothing is impossible, common sense says it is VERY difficult, at best.
There is no reason the search has to be explicitly recursive though.

I wrote a checkers program for the palm pilot years ago that had to be re-entrant and it turned out not to be that bad if you organize the code right. There were very few entry points that you had to worry about. You can also hide a lot of the ugliness with macros and end up with something that actually looks pretty good although you still lose the beauty and elegance of explicit recursion.

That was my point. The recursive search is easier to work on...

bob · Post by **bob** » Sun Dec 02, 2012 4:37 pm

syzygy wrote:
Joerg Oster wrote:
syzygy wrote:
Joerg Oster wrote:And maybe it would also help to reduce MAX_SPLITPOINTS from 8 to 3. I did some tests which indicate, that 3 Split Points in parallel are sufficient and that 4 already produced a slight overhead.
I'm not sure why that would be. With how many threads did you run?
IIRC I tested with 4 Threads.
Allowing 1, 2, 3 and 4 Split Points. 3 worked best.
A theoretical consideration: Does it really make sense to allow up to 8 Split Points simultaneously on a 4-core machine? Though most likely it will never create the max number, each additional Split Point produces some overhead.
Forcibly limiting the number of split points should only force threads to be idle when they could do useful work. I don't see how that could help.

However, having now looking at the stockfish source you must be talking about MAX_SPLITPOINTS_PER_THREAD. I guess more than 3 active splitpoints per thread is rare, and that limiting this number to 3 will only measurably affect longer searches (in a negative way). I still don't see how it could help. If they are used, that's good. If they are not used, they only take up a bit of memory. They don't cost cpu cycles. (Btw, from the stockfish commits I understand that the stockfish source from October 5 to November 3 had a bug that explains a reduction of tree size (and buggy play) resulting from lowering MAX_SPLITPOINTS_PER_THREAD.)

I also wonder a bit about the usefulness of maxThreadsPerSplitPoint. Limiting the number of threads at a split point close to the leaves might help to reduce splitting overhead, but I don't see why this number should be limited close to the root.

I think that idea is a bad one. It makes sense to limit the number of threads at a split point, particularly with significant numbers of threads (8, 16 and up). However, I have plenty of cases where a single thread can have many active split points. Imagine: you split at depth=10. The original thread at that ply runs out of work and then helps at ply=11 (helpful master sort of idea). Then the master of that one runs out of work, and so forth. In some endgames this can happen a lot. Why have a thread sit idle rather than splitting and helping?

Don · Post by **Don** » Sun Dec 02, 2012 4:39 pm

bob wrote:
Don wrote:
bob wrote:
syzygy wrote:
bob wrote:
Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...
What you need is just the moves leading from the root to the candidate split point (and alpha, beta, depth, move list counter to keep track of what moves have already been searched and access to the moves that still need to be searched), and of course you need to know at which plies it makes sense to split. So instead of looking for idle threads when a node is ready so be split, the search can set a flag to mark the node as a candidate split point and copy alpha, beta, depth to a predetermined location. Threads that are idle or become idle later can then join at this node when there is no better candidate available.

What I believe is very complicated (nearly impossible) to get right with a recursive search is to hand over responsibility for backing up the search result to the parent node, so you are restricted to helpful master. But helpful master also has its advantages: each thread is now always searching within a single tree, so only needs one set of data structures.
The problem lies in the recursive stack. Things are set up with no split being done, so there is no easy way to back up and split at an earlier ply, unless you are willing to do the complicated code to really unwind back to the ply where you want to split, do the split, then to un-unwind back to where you were and continue. Doable, but horribly messy. If I am going to do a split, I need to link in a split block for each child, plus a split block for the parent, but it is not easy to get to that information to change it, since all I currently have is local data on the stack in the call tree.

I agree that the "anyone can back up through a ply" is very difficult in recursive. And while theory says nothing is impossible, common sense says it is VERY difficult, at best.
There is no reason the search has to be explicitly recursive though.

I wrote a checkers program for the palm pilot years ago that had to be re-entrant and it turned out not to be that bad if you organize the code right. There were very few entry points that you had to worry about. You can also hide a lot of the ugliness with macros and end up with something that actually looks pretty good although you still lose the beauty and elegance of explicit recursion.
That was my point. The recursive search is easier to work on...

But you missed the point. It was you yourself that just described how difficult and complicated recursive code can be for this.

But I was trying to explain that a non-recursive search is not very difficult once you figure it how to get it right.

Don · Post by **Don** » Sun Dec 02, 2012 4:40 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:
syzygy wrote:
bob wrote:
Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...
What you need is just the moves leading from the root to the candidate split point (and alpha, beta, depth, move list counter to keep track of what moves have already been searched and access to the moves that still need to be searched), and of course you need to know at which plies it makes sense to split. So instead of looking for idle threads when a node is ready so be split, the search can set a flag to mark the node as a candidate split point and copy alpha, beta, depth to a predetermined location. Threads that are idle or become idle later can then join at this node when there is no better candidate available.

What I believe is very complicated (nearly impossible) to get right with a recursive search is to hand over responsibility for backing up the search result to the parent node, so you are restricted to helpful master. But helpful master also has its advantages: each thread is now always searching within a single tree, so only needs one set of data structures.
The problem lies in the recursive stack. Things are set up with no split being done, so there is no easy way to back up and split at an earlier ply, unless you are willing to do the complicated code to really unwind back to the ply where you want to split, do the split, then to un-unwind back to where you were and continue. Doable, but horribly messy. If I am going to do a split, I need to link in a split block for each child, plus a split block for the parent, but it is not easy to get to that information to change it, since all I currently have is local data on the stack in the call tree.

I agree that the "anyone can back up through a ply" is very difficult in recursive. And while theory says nothing is impossible, common sense says it is VERY difficult, at best.
There is no reason the search has to be explicitly recursive though.

I wrote a checkers program for the palm pilot years ago that had to be re-entrant and it turned out not to be that bad if you organize the code right. There were very few entry points that you had to worry about. You can also hide a lot of the ugliness with macros and end up with something that actually looks pretty good although you still lose the beauty and elegance of explicit recursion.
That was my point. The recursive search is easier to work on...
But you missed the point. It was you yourself that just described how difficult and complicated recursive code can be for this.

But I was trying to explain that a non-recursive search is not very difficult once you figure it how to get it right.

And by the way - I am not suggesting this is the right way either - it was just a possible way to get around some of the complexity issues and may or may not be a very good thing.

Edsel Apostol · Post by **Edsel Apostol** » Sun Dec 02, 2012 4:57 pm

syzygy wrote:
syzygy wrote:
Edsel Apostol wrote:Running some analysis on Windows though had my implementation at around 10% thread synchronization overhead, mostly spent on waiting for locks on selecting the moves.

I don't have any ideas how to avoid this at the moment, and the overhead tends to increase as the number of threads increases. It might be a good idea to limit the number of threads working in a split point to a minimum, lets say 2, but I have to experiment with it yet.
It sounds to me that your move selection is taking too much time.
I split only after (good) captures and killers, which means that I already have generated the remaining moves. Now I only need to pick the next move in the list (or quickly loop to find the move with the highest history score).
Or are you using a global lock (critical section object in MS language) instead of one lock per split point?

My move generation is fast enough I think, but I loop all over the moves again to score them, maybe it's better to just score them as soon as they are generated.

My latest code has only 1% synchronization overhead. I'm already using one lock per split point. I do have one global lock though in idle loop and inside the function to split remaining moves between threads.

About your method, would splitting after good captures and killers leave a lot of idle time for the threads? What would be the compensation for the bigger idle time?

And your method regarding copying the position in the split point but instead using the moves from the root to that position, how is it faster than just copying the position on that potential split point?

I would like to know the success rate of a potential split point being actually searched by the slave threads, have you measured it in your program? I would like to implement it in Hannibal.

syzygy · Post by **syzygy** » Sun Dec 02, 2012 5:06 pm

Don wrote:
bob wrote:That was my point. The recursive search is easier to work on...
But you missed the point. It was you yourself that just described how difficult and complicated recursive code can be for this.

But I was trying to explain that a non-recursive search is not very difficult once you figure it how to get it right.

But letting idle threads actively create split points is not difficult either with a recursive search, as long as the essential information is not on the stack but global.

I found this old post by the author of Spike that describes a similar recursive implementation.

Incidentally, I have a solution for one of his questions:

Mangar wrote:If you have only one flag to signal "break" how does a thread know when he should stop breaking? There may be a split point below the split point that causes the break.

My per thread abort flag is an int that is 0 if the thread has not been aborted. If it is -1, the search was aborted e.g. due to time running out. If it is N > 0, there was a cut off at ply=N. If there is an abort at ply=M, the flag is set to M if its value is 0 or >M (using atomic compare and swap). A thread at ply=N that notices an abort at ply=M<N will abort any slaves and fall back to ply=N-1 where the same happens.

syzygy · Post by **syzygy** » Sun Dec 02, 2012 5:41 pm

Edsel Apostol wrote:My latest code has only 1% synchronization overhead. I'm already using one lock per split point. I do have one global lock though in idle loop and inside the function to split remaining moves between threads.

Then you seem to have solved the problem already. A global lock for doing the split should be fine (and seems hard to avoid if active threads are picking up idle threads).

About your method, would splitting after good captures and killers leave a lot of idle time for the threads? What would be the compensation for the bigger idle time?

I think I hardly have idle time (given the NPS speed up). I already split at depth=2. The speed up in the first second of searching the initial position is a bit lower (5.4 on 6 cores), which must partly be due to the pv lines that are being spit out and partly due to the many synchronisations inherent to YBW (when you finish the search of the PV node at ply = N, i.e. the search of the first move at the PV node at ply = N-1, YBW implies that all threads but one are idle).

And your method regarding copying the position in the split point but instead using the moves from the root to that position, how is it faster than just copying the position on that potential split point?

It might depend on how big your position data is, but copying the moves takes only very little data (even though I don't store them consecutively in a single cache line, which maybe I should try). After that, the slave thread has all it needs and can release the lock. No need to copy the history of hash keys etc. Also, I keep a small hash table for speeding up draw-by-rep detection, and I don't want to copy that table at splits.

More importantly, if slave threads actively join a candidate split node without the master knowing it, I don't even have a consistent position state for that node. I don't want to interrupt the master, force it to back up its position all the way to the split node, copy the full state, and then let the master go back to where it was. And I also don't want to make a full copy of the state when creating candidate split points, because the overhead would be way too high. So copying the moves is the only option.

I would like to know the success rate of a potential split point being actually searched by the slave threads, have you measured it in your program? I would like to implement it in Hannibal.

I didn't measure it, but most potential split points probably never become a split point. Most potential split points are close to the leaves, and the idle threads look for one close to the root. But the overhead of a node being a potential split point (i.e. having to store some data and take a lock before selecting a move) turns out to be very low in my case.

I might try to collect some data if I figure out what data is useful.

SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions