SMP and questions

syzygy · Post by **syzygy** » Sun Dec 02, 2012 6:19 pm

bob wrote:It makes sense to limit the number of threads at a split point, particularly with significant numbers of threads (8, 16 and up).

What is the idea behind it? That the node might have fewer moves left than the number of threads joining it? Or is there another reason?

In my implementation where threads actively join potential / candidate split points, an idle thread first picks a move (if there is any) before actually joining.

Joerg Oster · Post by **Joerg Oster** » Sun Dec 02, 2012 6:34 pm

bob wrote:
syzygy wrote:
Joerg Oster wrote:
syzygy wrote:
Joerg Oster wrote:And maybe it would also help to reduce MAX_SPLITPOINTS from 8 to 3. I did some tests which indicate, that 3 Split Points in parallel are sufficient and that 4 already produced a slight overhead.
I'm not sure why that would be. With how many threads did you run?
IIRC I tested with 4 Threads.
Allowing 1, 2, 3 and 4 Split Points. 3 worked best.
A theoretical consideration: Does it really make sense to allow up to 8 Split Points simultaneously on a 4-core machine? Though most likely it will never create the max number, each additional Split Point produces some overhead.
Forcibly limiting the number of split points should only force threads to be idle when they could do useful work. I don't see how that could help.

However, having now looking at the stockfish source you must be talking about MAX_SPLITPOINTS_PER_THREAD. I guess more than 3 active splitpoints per thread is rare, and that limiting this number to 3 will only measurably affect longer searches (in a negative way). I still don't see how it could help. If they are used, that's good. If they are not used, they only take up a bit of memory. They don't cost cpu cycles. (Btw, from the stockfish commits I understand that the stockfish source from October 5 to November 3 had a bug that explains a reduction of tree size (and buggy play) resulting from lowering MAX_SPLITPOINTS_PER_THREAD.)

I also wonder a bit about the usefulness of maxThreadsPerSplitPoint. Limiting the number of threads at a split point close to the leaves might help to reduce splitting overhead, but I don't see why this number should be limited close to the root.
I think that idea is a bad one.

You mean limiting the number of split points. Right?
A first, very crude test showed, that you might be right.

bob wrote:It makes sense to limit the number of threads at a split point, particularly with significant numbers of threads (8, 16 and up).

Well, I think I will do some further tests next week(s).

bob wrote: However, I have plenty of cases where a single thread can have many active split points. Imagine: you split at depth=10. The original thread at that ply runs out of work and then helps at ply=11 (helpful master sort of idea). Then the master of that one runs out of work, and so forth. In some endgames this can happen a lot. Why have a thread sit idle rather than splitting and helping?

jdart · Post by **jdart** » Mon Dec 03, 2012 3:15 am

Synchronization overhead is minimal for me - 10% sounds high.

You can download a trial version of the Intel tools for Windows - you just need to pay if using it for >90 days or something.

--Jon

Joerg Oster · Post by **Joerg Oster** » Mon Dec 03, 2012 3:14 pm

syzygy wrote:
Joerg Oster wrote:
syzygy wrote:
Joerg Oster wrote:And maybe it would also help to reduce MAX_SPLITPOINTS from 8 to 3. I did some tests which indicate, that 3 Split Points in parallel are sufficient and that 4 already produced a slight overhead.
I'm not sure why that would be. With how many threads did you run?
IIRC I tested with 4 Threads.
Allowing 1, 2, 3 and 4 Split Points. 3 worked best.
A theoretical consideration: Does it really make sense to allow up to 8 Split Points simultaneously on a 4-core machine? Though most likely it will never create the max number, each additional Split Point produces some overhead.
Forcibly limiting the number of split points should only force threads to be idle when they could do useful work. I don't see how that could help.

Well, maybe this is exactly the scenario, where Marco's idea of "active reparenting" might work.

syzygy wrote:However, having now looking at the stockfish source you must be talking about MAX_SPLITPOINTS_PER_THREAD. I guess more than 3 active splitpoints per thread is rare, and that limiting this number to 3 will only measurably affect longer searches (in a negative way).

You may be right here.

syzygy wrote:I still don't see how it could help. If they are used, that's good. If they are not used, they only take up a bit of memory. They don't cost cpu cycles. (Btw, from the stockfish commits I understand that the stockfish source from October 5 to November 3 had a bug that explains a reduction of tree size (and buggy play) resulting from lowering MAX_SPLITPOINTS_PER_THREAD.)

I also wonder a bit about the usefulness of maxThreadsPerSplitPoint. Limiting the number of threads at a split point close to the leaves might help to reduce splitting overhead, but I don't see why this number should be limited close to the root.

After some thinking and looking into the sources again, I tend to agree.
Btw, since the default value is 5, this parameter is totally useless on a Dual- or a Quad-core machine.

bob · Post by **bob** » Mon Dec 03, 2012 3:36 pm

syzygy wrote:
bob wrote:It makes sense to limit the number of threads at a split point, particularly with significant numbers of threads (8, 16 and up).
What is the idea behind it? That the node might have fewer moves left than the number of threads joining it? Or is there another reason?

In my implementation where threads actively join potential / candidate split points, an idle thread first picks a move (if there is any) before actually joining.

Basically, yes. If you run on a 16 core box, in the endgame you will find yourself repeatedly splitting, but not having anywhere close to 16 move that need searching. But also there is a NUMA issue. You can control how many split at a single split point so that you can then control WHO splits there as well. You might want to split on remote nodes near the root (where communication / shared memory accesses is limited, and on "close" nodes deep in the tree where the shared information is accessed more frequently...

bob · Post by **bob** » Mon Dec 03, 2012 3:39 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:
syzygy wrote:
bob wrote:
Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?
For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...
What you need is just the moves leading from the root to the candidate split point (and alpha, beta, depth, move list counter to keep track of what moves have already been searched and access to the moves that still need to be searched), and of course you need to know at which plies it makes sense to split. So instead of looking for idle threads when a node is ready so be split, the search can set a flag to mark the node as a candidate split point and copy alpha, beta, depth to a predetermined location. Threads that are idle or become idle later can then join at this node when there is no better candidate available.

What I believe is very complicated (nearly impossible) to get right with a recursive search is to hand over responsibility for backing up the search result to the parent node, so you are restricted to helpful master. But helpful master also has its advantages: each thread is now always searching within a single tree, so only needs one set of data structures.
The problem lies in the recursive stack. Things are set up with no split being done, so there is no easy way to back up and split at an earlier ply, unless you are willing to do the complicated code to really unwind back to the ply where you want to split, do the split, then to un-unwind back to where you were and continue. Doable, but horribly messy. If I am going to do a split, I need to link in a split block for each child, plus a split block for the parent, but it is not easy to get to that information to change it, since all I currently have is local data on the stack in the call tree.

I agree that the "anyone can back up through a ply" is very difficult in recursive. And while theory says nothing is impossible, common sense says it is VERY difficult, at best.
There is no reason the search has to be explicitly recursive though.

I wrote a checkers program for the palm pilot years ago that had to be re-entrant and it turned out not to be that bad if you organize the code right. There were very few entry points that you had to worry about. You can also hide a lot of the ugliness with macros and end up with something that actually looks pretty good although you still lose the beauty and elegance of explicit recursion.
That was my point. The recursive search is easier to work on...
But you missed the point. It was you yourself that just described how difficult and complicated recursive code can be for this.

But I was trying to explain that a non-recursive search is not very difficult once you figure it how to get it right.

Having done both for a long time, I personally believe that recursive search is easier to understand and work on. Given the choice of a non-threaded recursive or iterated search, I'd take recursive every time. Adding threading introduces a wrinkle of course...

Don · Post by **Don** » Mon Dec 03, 2012 3:44 pm

bob wrote: Having done both for a long time, I personally believe that recursive search is easier to understand and work on. Given the choice of a non-threaded recursive or iterated search, I'd take recursive every time. Adding threading introduces a wrinkle of course...

I don't have any argument with that. Who would not prefer recursive code?

Don

diep · Post by **diep** » Mon Dec 03, 2012 7:14 pm

Don wrote:
bob wrote: Having done both for a long time, I personally believe that recursive search is easier to understand and work on. Given the choice of a non-threaded recursive or iterated search, I'd take recursive every time. Adding threading introduces a wrinkle of course...
I don't have any argument with that. Who would not prefer recursive code?

Don

There is no reason why a recursive search wouldn't have the same advantages like Diep"s non-recursive search for SMP search.

The stack is not a result of the recursive search, yet a result from storing positions recursive.

There is no need to store the positions data recursive on the stack of the processor. It's just as easy to use a recursive search AND put all properties of a position inside a structure.

If i'd redo Diep's 1998s rewrite from recursive to iterative, i sure would do it different!

Edsel Apostol · Post by **Edsel Apostol** » Sat Dec 08, 2012 1:44 am

syzygy wrote:
Edsel Apostol wrote:My latest code has only 1% synchronization overhead. I'm already using one lock per split point. I do have one global lock though in idle loop and inside the function to split remaining moves between threads.
Then you seem to have solved the problem already. A global lock for doing the split should be fine (and seems hard to avoid if active threads are picking up idle threads).

About your method, would splitting after good captures and killers leave a lot of idle time for the threads? What would be the compensation for the bigger idle time?
I think I hardly have idle time (given the NPS speed up). I already split at depth=2. The speed up in the first second of searching the initial position is a bit lower (5.4 on 6 cores), which must partly be due to the pv lines that are being spit out and partly due to the many synchronisations inherent to YBW (when you finish the search of the PV node at ply = N, i.e. the search of the first move at the PV node at ply = N-1, YBW implies that all threads but one are idle).

And your method regarding copying the position in the split point but instead using the moves from the root to that position, how is it faster than just copying the position on that potential split point?
It might depend on how big your position data is, but copying the moves takes only very little data (even though I don't store them consecutively in a single cache line, which maybe I should try). After that, the slave thread has all it needs and can release the lock. No need to copy the history of hash keys etc. Also, I keep a small hash table for speeding up draw-by-rep detection, and I don't want to copy that table at splits.

More importantly, if slave threads actively join a candidate split node without the master knowing it, I don't even have a consistent position state for that node. I don't want to interrupt the master, force it to back up its position all the way to the split node, copy the full state, and then let the master go back to where it was. And I also don't want to make a full copy of the state when creating candidate split points, because the overhead would be way too high. So copying the moves is the only option.

I would like to know the success rate of a potential split point being actually searched by the slave threads, have you measured it in your program? I would like to implement it in Hannibal.
I didn't measure it, but most potential split points probably never become a split point. Most potential split points are close to the leaves, and the idle threads look for one close to the root. But the overhead of a node being a potential split point (i.e. having to store some data and take a lock before selecting a move) turns out to be very low in my case.

I might try to collect some data if I figure out what data is useful.

I would like to say thanks to you and to all the contributors in this thread. I'll be using this as reference later on when I work with improvements on SMP. As of now, my SMP implementation is a bit stable already (playing in Graham's 8 core tournament) so I'm moving on with other things for now.

SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions