YBWC: Active Reparenting

mcostalba · Post by **mcostalba** » Sun Apr 15, 2012 12:22 pm

diep wrote: the bug you run into when a cpu owns splitpoints is that when this cpu is gonna do other stuff, it loses its splitpoints. So you do need a mechanism that other cpu's can take over your splitpoints, otherwise that 'valuable' splitpoint is lost forever without anyone splitting in there yet.

Yes, to avoid this I allowed "reparenting" only threads without any owned split point. This was the only choice because I reparented to the oldest (nearest to root)......but for the series "Hey look, Ma, no hands!" I have pushed another patch where reparenting is done against the latest split point, this allow to use YBW condition where the master of a split point is allowed to reparent to another split point as long as this belongs to one of its slaves: so now even threads that owns split points are allowed to reparent and no, there is no bug, because is under test since one day and shows no threads hangs whatsoever, actually it seems even stronger than the old code.

diep · Post by **diep** » Sun Apr 15, 2012 4:39 pm

mcostalba wrote:
diep wrote: the bug you run into when a cpu owns splitpoints is that when this cpu is gonna do other stuff, it loses its splitpoints. So you do need a mechanism that other cpu's can take over your splitpoints, otherwise that 'valuable' splitpoint is lost forever without anyone splitting in there yet.
Yes, to avoid this I allowed "reparenting" only threads without any owned split point. This was the only choice because I reparented to the oldest (nearest to root)......but for the series "Hey look, Ma, no hands!" I have pushed another patch where reparenting is done against the latest split point, this allow to use YBW condition where the master of a split point is allowed to reparent to another split point as long as this belongs to one of its slaves: so now even threads that owns split points are allowed to reparent and no, there is no bug, because is under test since one day and shows no threads hangs whatsoever, actually it seems even stronger than the old code.

What's the speedup now of SF - any improvements you measured?

mcostalba · Post by **mcostalba** » Mon Apr 16, 2012 7:29 am

diep wrote: What's the speedup now of SF - any improvements you measured?

I am running a test with real games because the speed-up in SMP case is difficult to measure reliably. After about 5K games on QUAD (4 threads) at 60''+0.1 we are at +4 ELO (no crashes), test is still running...

diep · Post by **diep** » Mon Apr 16, 2012 7:54 pm

mcostalba wrote:
diep wrote: What's the speedup now of SF - any improvements you measured?
I am running a test with real games because the speed-up in SMP case is difficult to measure reliably. After about 5K games on QUAD (4 threads) at 60''+0.1 we are at +4 ELO (no crashes), test is still running...

Oh if you test stuff at 4 cores only, then you definitely should also test what i did do back in 1999-2002, though i'm sure you'll implement it more efficient than what i did do back then. Your 'reparenting' is what i did do after 2003.

What i did do back then in 1999-summer 2002 was basically that all cores searching reported their splitpoints to a splitlist. There is several ways to maintain it, i used a simple linked list, maybe there are better methods now.

I kept it sorted and each idle CPU the only thing it had to do was look regurarly in the splitlist to see whether there still was a split for it.

In this manner you always automatically split at the biggest search depth left, closer to the root.

The only administration you need to do is that each position carries with it of course a label whether it still has itself in that splitlist, in which case you need to clean it up. Though you want it bugfree, if you always check for this in case a bit is set, then even a non-100% bugfree implementation will be able to work bugfree, because if you try to remove it but do not find this position in the splitlist anymore, no big deal.

Each action to the splitlist is on average O ( 1 ). Not worst case, *on average*.

So if each splitpoint in the splitlist has a left and a right, then you need minimal time to keep the list locked.

It's not a true DTS type search, but it does use the basic idea.

Let's call it another flavour of Diepeveen search, as i also mixed it with YBW

Its speedup is MAGNIFICENT GREAT.

Realize at modern processors the cpu's that nonstop try to find a splitpoint in the splitlist, they do only a READ. So that quickly is in their L1 cache, and they factucal don't communicate with the central spot having the splitpoints, so they can nonstop hammer at that variable without problem, as the hardware automatically gives an invalidate of your entry.

So in fact it is not such a bad alternative form of DTS.

If doing it for a few splitpoints gives you elo then i'm sure this gives more elo to your proggie

p.s. this will work great at shared memory systems - only at clusters/supercmoputers you need something else but you could for example just make a small list there with the best splitpoints.

bob · Post by **bob** » Mon Apr 16, 2012 8:53 pm

diep wrote:
mcostalba wrote:
diep wrote: In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.
Yes, this is the condition I am looking for in my patch. And I also limit the search to the 'oldest' split points that it means, for each thread, master of a set of split points, I check only the split point highest in the tree so to possibly reparent near the root: this should help getting a longer average life time for the slave.
Marco that's what i have been trying to do some years ago in Diep as well.

Though not in a master slave model - as diep doesn't have master/slave model, entire search is in shared memory.

What i did do is each cpu keeps track of positions where one is allowed to split. other cpu's also having access to the shared ram, could read this if they wanted to, then send a request to the split owner to split it in at that spot that it found itself. splitowner received request and then splitted in the cpu requesting to split at that spot. Idea was to always split in closest to the root.

That's what i did do.

Took me a few years to realize we can already mathematically prove this to work crap as basically our parallel search already runs genius at the moment that we can split in there.

Because what basically needs to happen for this to work is that cpu's that already tried to split in others, there were no others to split in, as all cpu's were already carrying out a parallel search.

Only AFTER all cpu's already are doing a parallel search and machine busy 100% with 0 cpu's not having a job, then you could come along a position where you can't split as no cpu's are available.

Basically it's a DTS derived idea.

So
a) your idea is not new i already tried it and DTS is based upon it

In old Diep code i called this : HelpYourSelf()

b) it's a lot of work to prove it correct and has overhead

c) it only can give splitpoints to cpu's when all cpu's in the first place already are searching perfectly parallel and nothing is idle. So your machine already scales genius by then.

If something already scales well no need to add this feature.

2 guys already tried. One based its entire SMP upon it (Bob) and another (me) forgot that problem C.

p.s. note that i had turned off the feature quickly as in case of Diep i had not done B - it wasn't bugfree. Hadn't proven it on paper back then... ... and as in case of diep ever cpu is the same suddenly introducing splitowners was a bad idea at all.

p.s.2 one of problems i encountered is that in diep all cpu's are the same, so when one cpu would call an AbortMe() basically all splitpoints would be lost, so i also designed a mechanismback then to have other cpu take over those potential splitpoints

Couple of things I didn't quite follow. In Cray Blitz (not Crafty) when a thread went idle, it would look at all the active threads (their search data, actually) and select the best candidate for a split point and then inform that thread so that it could take the necessary actions to effect a split there. Of course, at times, the split point you choose has evaporated by the time the owner gets the request, so you get to repeat that again. But I don't see how the search having already developed good split points (parallel search is efficient) would make a suddenly idle thread waste effort if it just "joins" an already active split point. In Crafty or Cray Blitz, that was a no-overhead operation, the idle thread simply copies the split point search data and starts searching. If it was efficient before, it is still efficient now. Only issue in Crafty is that I end up doing more deep splits (far from the root) than shallow splits, because of YBW and recursion. It is therefore possible that one tries to "join" only to discover that entire sub-tree no longer exists.

As far as aborting and such, I only abort those threads that are working at a "busted" split point, or at some point below that split point (but in a tree that comes from it). For example, if we have a split point at ply=4, and threads A and B start to work there, and then A splits at ply=6 with itself and C, if I discover that the ply=6 split point is a CUT node, I about just C. If I discover the ply=4 split point is a CUT node, I about A, B and C since they are all involved in the same sub-tree. I do this by having all "siblings" working at a split point registered in the split point's parent data structure so that I can always discover who is working at this specific split point. And for each sibling, if they have split elsewhere, I can follow these "split blocks" to identify all that are helping, while ignoring those that are working in an unrelated part of the tree...

bob · Post by **bob** » Mon Apr 16, 2012 9:07 pm

Daniel Shawul wrote:Thank you for your thorough reply.
There is no question about it that if your hardware would be having no penalty for your cpu's to bother other cpu's and/or can freely look in remote memory, that the DTS principle of a CPU finding ITSELF a spot where it can split will give a great speedup.

I don't think the most important result is smaller amount of splitpoints. In Diep for example i'm doing massive amounts of splits the first few seconds of the search. Simply to the limit it can handle.

I tried to implement DTS in 1998 and succeeded (with one bugfix in 2000) and up to a cpu or 4 it had a genius speedup, yet right from the start i combined it with YBW, later more on that as it's a crucial thing. Note DTS scaled bad for Diep, but i didn't have an ultra efficient implementation and things were done centralized. Losing about 0.5 from each 4 cpu's back then roughly.

What i do now splits when it is convenient for searching processors and cpu's not active aren't stealing work as that has problems scaling.

Note that in all cases i assumed following the YBW conditions. So if you would equip DTS with YBW basically you can prove that the method of
basically splitting from the searching processors when it is convenient for them, that this performs similar, besides saving out massive waste of internal bandwidth:
I agree. Bandwidth is expensive especially when you start using more than 4 threads. The paper also mentions about using a "thrashing counter" before sending a help request.
Splitting at larger depths (fewer splits) help to avoid time wasted by splitting un-necessarily. I know you are talking on a larger scale (256 processor f.i) in which keeping processors busy is more important than selecting the best split point.

I would NEVER agree with that. If you select a bad split point, you might as well not split as you get zero gain for searching nodes that don't need to be searched. The more nodes you have, the more important selecting a good split point becomes, or your speedup flattens out very quickly.

Let's count now from the root our plies. So root is 0, after 1 move we are at 1, then 2, 3 plies away from root.

We start with the simple and most important case. Namely start of the search. We start searching with 1 core.

Suppose we are at a depth of n+1 in the PV.
We backtrack to n with processor P0

We can prove now that all other cpu's must be idle, because only 1 cpu can basically return in a YBW concept from a move. If there is idle cpu's now we already split them in at depth n (if it would give a cutoff we would split them in at n-1).
I think some limit the number of processors used at a split point (say max of 8 ). It would be unwise to split with all of them when a few would finish the job quickly. Also with 256 or more processors available some will undoubtedly be idle. The same thing also happens in endgames where wer can have few moves at each ply. But I understand your point.
Now we split in at depth n in this model. All moves get parallel distributed at this position P.

I want to drop next lemma now. When following YBW It's very difficult to find a position Q higher in the tree near the root than position P, which holds true that it follows YBW and no parallel search is currently getting carried out in Q.
It is actually more common than you think. Suppose you split at the root, and now you have two threads searching there. You can still split in either of the two simultaneous sub-trees you are searching if other threads become idle. And you can re-split, and re-split. I posted some data the other day where on an 8 core box, in normal games, I saw over 100 active split points max. That is at one instant in time. Since I don't split in the q-search at all, and since I limit myself to 4 threads per split point, max, there is a LOT of re-splitting going on. And for over 100 split blocks to be active, there must be a large number of different split points scattered around the tree...

Basically we can prove that if we already previously visited position Q in our current iteration, that we already had the opportunity to split there.
You might have had the opportunity, but not the ability. No idle threads. Where after searching a couple of other moves there, a thread working somewhere else finally finishes and now you can YBW split there at any time.

[quote\ And we had this opportunity while we can prove for the PV walkdown that all other cpu's were idle, and therefore could effectively get put to work.

I agree. For cases where we had the opportunity to split but not enough processors to search with, slave processors can check for idle processors at the split point
each time they come back to get a move.

In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.

Correct as it was implemented in Cray Blitz and Crafty.

That's the only situation where DTS would in theory have a benefit, assuming your hardware would allow it, over the 'split in as we happen to touch this position' type strategy.

Your statement is way too general. The benefit of DTS is that the instant a thread goes idle, you do not have to split at the current node being searched in any thread, assuming the YBW criteria has been satisfied. With DTS you can split anywhere, no matter where any thread is currently searching. First thing that solves is zero idle time, because you do not have to wait for an active thread to reach a node where YBW criteria has been satisfied, you can split at other nodes, any other node in fact, where either the YBW criteria has been satisfied, or if you find none, you can split at one where you believe it will be satisfied, speculatively.

The bottomline is that basically our parallel search is already going great at the moment that such event would occur.

So DTS doesn't solve some sort of 'worst case problem'. In fact it creates a new worstcase, as it's going to blow all other cpu's bandwidth when just 1 core for a fraction of a second is busy following the YBW property. That doesn't need to be during start of the search only. At this crucial moment that you want to split in other cpu's as quick as possible, which is possible after your single core gets back in a position where it has searched the first move; at that crucial moment with DTS all your machines core are busy hammering and jamming your single core.

How? This is an implementation issue, not a necessary property of DTS...

One worst case scenario mentioned in the paper is re-splitting at deeper depths again and again may lead to overall slow down in some hypothetical situations.
Then to avoid this problem, it is suggested to start a speculative parallel search at N-1 once enough moves are searched at ply N that gaurantee the alpha bound.
It could be a waste of work if the bound gets improved once the rest of the moves are searched at ply N. Well this speculation is one thing that I would hope DTS could give
improvement... But one can also do speculative YBW by starting parallel search at ALL nodes without searching any moves. It gave a little improvement fro Feldmann and co.

So it CAN have an effect that exponentially slows down your YBW search, meanwhile it contributes positively in efficiency (avoiding a few splits) at a moment in time that your parallel search already is going great at a bunch of cores and basically is about to finish an iteration or subbranch.

The benefit versus worstcase is not positive for DTS at modern hardware, when following the YBW property.

I think you have developed some sort of basic misunderstanding about DTS. The basic principle IS "YBW". That is / was the primary criteria for choosing a split point. But DTS is better than pure YBW because in pure YBW, you can only split at a node where some instance of the search is currently about to choose the next move to search and it notices that (a) a thread is idle and (b) the YBW criteria has been satisfied. With DTS you can split at ANY node in ANY active tree that is being searched. Most likely you ALWAYS find a YBW-satisfying node to split. If not, THEN, and only then, does DTS suggest "speculation" rather than sitting idle.

It is mentioned that DTS was developed on cray (high bandwidth) and that this affected decisions of the algorithm. I am not sure which parts of the algorithm are affected most, but it is probably the work stealing as you suggested.

p.s. DTS doesn't need to be implemented in a centralized manner, neither does YBW. We just speak about a fraction of a second that the search is busy at 1 core. The trick is to be able to quickly split in of course - i never made a secret out of that.

Diep is not recursive. So basically to split in another cpu is just setting a few pointers in theory. I didn't even ultra optimize it - that cheap it already is.

Thanks it really is good to hear of your experience.[/quote]

diep · Post by **diep** » Mon Apr 16, 2012 10:55 pm

bob wrote:
Daniel Shawul wrote:Thank you for your thorough reply.
There is no question about it that if your hardware would be having no penalty for your cpu's to bother other cpu's and/or can freely look in remote memory, that the DTS principle of a CPU finding ITSELF a spot where it can split will give a great speedup.

I don't think the most important result is smaller amount of splitpoints. In Diep for example i'm doing massive amounts of splits the first few seconds of the search. Simply to the limit it can handle.

I tried to implement DTS in 1998 and succeeded (with one bugfix in 2000) and up to a cpu or 4 it had a genius speedup, yet right from the start i combined it with YBW, later more on that as it's a crucial thing. Note DTS scaled bad for Diep, but i didn't have an ultra efficient implementation and things were done centralized. Losing about 0.5 from each 4 cpu's back then roughly.

What i do now splits when it is convenient for searching processors and cpu's not active aren't stealing work as that has problems scaling.

Note that in all cases i assumed following the YBW conditions. So if you would equip DTS with YBW basically you can prove that the method of
basically splitting from the searching processors when it is convenient for them, that this performs similar, besides saving out massive waste of internal bandwidth:
I agree. Bandwidth is expensive especially when you start using more than 4 threads. The paper also mentions about using a "thrashing counter" before sending a help request.
Splitting at larger depths (fewer splits) help to avoid time wasted by splitting un-necessarily. I know you are talking on a larger scale (256 processor f.i) in which keeping processors busy is more important than selecting the best split point.
I would NEVER agree with that. If you select a bad split point, you might as well not split as you get zero gain for searching nodes that don't need to be searched. The more nodes you have, the more important selecting a good split point becomes, or your speedup flattens out very quickly.

Let's count now from the root our plies. So root is 0, after 1 move we are at 1, then 2, 3 plies away from root.

We start with the simple and most important case. Namely start of the search. We start searching with 1 core.

Suppose we are at a depth of n+1 in the PV.
We backtrack to n with processor P0

We can prove now that all other cpu's must be idle, because only 1 cpu can basically return in a YBW concept from a move. If there is idle cpu's now we already split them in at depth n (if it would give a cutoff we would split them in at n-1).
I think some limit the number of processors used at a split point (say max of 8 ). It would be unwise to split with all of them when a few would finish the job quickly. Also with 256 or more processors available some will undoubtedly be idle. The same thing also happens in endgames where wer can have few moves at each ply. But I understand your point.
Now we split in at depth n in this model. All moves get parallel distributed at this position P.

I want to drop next lemma now. When following YBW It's very difficult to find a position Q higher in the tree near the root than position P, which holds true that it follows YBW and no parallel search is currently getting carried out in Q.
It is actually more common than you think. Suppose you split at the root, and now you have two threads searching there. You can still split in either of the two simultaneous sub-trees you are searching if other threads become idle. And you can re-split, and re-split. I posted some data the other day where on an 8 core box, in normal games, I saw over 100 active split points max. That is at one instant in time. Since I don't split in the q-search at all, and since I limit myself to 4 threads per split point, max, there is a LOT of re-splitting going on. And for over 100 split blocks to be active, there must be a large number of different split points scattered around the tree...

Basically we can prove that if we already previously visited position Q in our current iteration, that we already had the opportunity to split there.
You might have had the opportunity, but not the ability. No idle threads. Where after searching a couple of other moves there, a thread working somewhere else finally finishes and now you can YBW split there at any time.

[quote\ And we had this opportunity while we can prove for the PV walkdown that all other cpu's were idle, and therefore could effectively get put to work.
I agree. For cases where we had the opportunity to split but not enough processors to search with, slave processors can check for idle processors at the split point
each time they come back to get a move.

In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.
Correct as it was implemented in Cray Blitz and Crafty.

That's the only situation where DTS would in theory have a benefit, assuming your hardware would allow it, over the 'split in as we happen to touch this position' type strategy.

Your statement is way too general. The benefit of DTS is that the instant a thread goes idle, you do not have to split at the current node being searched in any thread, assuming the YBW criteria has been satisfied. With DTS you can split anywhere, no matter where any thread is currently searching. First thing that solves is zero idle time, because you do not have to wait for an active thread to reach a node where YBW criteria has been satisfied, you can split at other nodes, any other node in fact, where either the YBW criteria has been satisfied, or if you find none, you can split at one where you believe it will be satisfied, speculatively.

The bottomline is that basically our parallel search is already going great at the moment that such event would occur.

So DTS doesn't solve some sort of 'worst case problem'. In fact it creates a new worstcase, as it's going to blow all other cpu's bandwidth when just 1 core for a fraction of a second is busy following the YBW property. That doesn't need to be during start of the search only. At this crucial moment that you want to split in other cpu's as quick as possible, which is possible after your single core gets back in a position where it has searched the first move; at that crucial moment with DTS all your machines core are busy hammering and jamming your single core.
How? This is an implementation issue, not a necessary property of DTS...

One worst case scenario mentioned in the paper is re-splitting at deeper depths again and again may lead to overall slow down in some hypothetical situations.
Then to avoid this problem, it is suggested to start a speculative parallel search at N-1 once enough moves are searched at ply N that gaurantee the alpha bound.
It could be a waste of work if the bound gets improved once the rest of the moves are searched at ply N. Well this speculation is one thing that I would hope DTS could give
improvement... But one can also do speculative YBW by starting parallel search at ALL nodes without searching any moves. It gave a little improvement fro Feldmann and co.

So it CAN have an effect that exponentially slows down your YBW search, meanwhile it contributes positively in efficiency (avoiding a few splits) at a moment in time that your parallel search already is going great at a bunch of cores and basically is about to finish an iteration or subbranch.

The benefit versus worstcase is not positive for DTS at modern hardware, when following the YBW property.
I think you have developed some sort of basic misunderstanding about DTS. The basic principle IS "YBW". That is / was the primary criteria for choosing a split point. But DTS is better than pure YBW because in pure YBW, you can only split at a node where some instance of the search is currently about to choose the next move to search and it notices that (a) a thread is idle and (b) the YBW criteria has been satisfied. With DTS you can split at ANY node in ANY active tree that is being searched. Most likely you ALWAYS find a YBW-satisfying node to split. If not, THEN, and only then, does DTS suggest "speculation" rather than sitting idle.

It is mentioned that DTS was developed on cray (high bandwidth) and that this affected decisions of the algorithm. I am not sure which parts of the algorithm are affected most, but it is probably the work stealing as you suggested.

p.s. DTS doesn't need to be implemented in a centralized manner, neither does YBW. We just speak about a fraction of a second that the search is busy at 1 core. The trick is to be able to quickly split in of course - i never made a secret out of that.

Diep is not recursive. So basically to split in another cpu is just setting a few pointers in theory. I didn't even ultra optimize it - that cheap it already is.
Thanks it really is good to hear of your experience.

[/quote]

Bob discussing DTS is not interesting here. Basically we are on 2 tracks here.

Daniel was thinking of GPUs when writing it - other architecture. You should see it in that light. AMD 7970 gpu's start at 2048 cores. Nvidia 680 starts at 1536 cores.

Soon both 'big brothers' will release. That's double the number of cores. For Nvidia it will be easiest to release it.

It's not easy to have gpu cores pick their own splitpoints. Everything goes massive parallel there.

As for DTS. When asked exactly 10 years ago how you splitted, as you said you didn't follow YBW with DTS back then. You answerred that you used some chessknowledge to guess whether to split.

Furthermore you copied 64KB for each splitpoint (the stack).

Now that might not have been a problem at Cray

It sure was a worst case of Crafty some years ago which copied a kilobyte or 3. Nalimov then hacked in your code until it got a lot cheaper. I'd suggest you ship Nalimov a 'thank you' for that, as it took care crafty could scale to a lot more cores

He really got it more efficient that crafty SMP code.

Now i don't want to get into a math discussion too much. But if i say the word 'prove' that means a mathematical proof. If i say "Proof by empirical finding", that isn't a 100% proof. If i say : "it will", that's not any of the 2.

Assuming the YBW property we can prove a few things.

We can PROVE that if you have generated splitpoints closer to the root in DTS, that when following the YBW property we already visited that position before and had our CHANCE to split there.

we can PROVE that if this position happens to be in a PV node that we should have splitted already in that position (of course an implementation can spoil it).

We can prove that if this situation occured from a state where just 1 core was left searching that when returning to a position P, with an average there of 40 semi legal moves, that we could already split there. Usually that means we had room for a CPU or 40. Note in Diep i limit it to 32 cpu's a splitpoint max.

So we can PROVE that in 99.9% of the cases we already did do genius splitting.

now i don't know how your datastructure works, but in Diep it is the case that if a bunch of processors split in at position P, that these cores will first use up all moves of P prior to calling AbortMe().

So we can prove that we already have searched quite efficient the vaste majority of the search tree with that 'random splitting'. Giving an exact % is difficult there, but it'll be around a 95%+.

Now it happens to be the case he has a bug in his search algorithm in StockFish, namely for some programming reason they selfintroduced SF can't call AbortMe().

I know a zillion ways how to fix that, but then designing a new datastructure to again split in cpu's that own this position P is not a very clever thing to do from SMP viewpoint, IMHO.

Better spend your effort that you can really abort yourself there and have the others search further.

We can SHOW that not doing this, can create a worst case in which you nonstop keep splitting in at small search depths - which is what you refer to earlier on. Yet that happens because of an implementation shortcoming in StockFish - not because it is a problem in a proper YBW implementation.

Now i'm fully aware you know all this, as in 1998 that's exactly what you told me - so i fixed that for Diep in 1998

The radical solution for Diep back then was to make the program non-recursive, but that doesn't need to be the solution. Also in a recursive program you can easily build this.

Now all this said a few words on DTS. As i posted before at shared memory systems with 1 or 2 sockets, it's not a problem to still implement a DTS in the manner how i solved it back in 1999 (note i fixed a few nasty bugs around october 2000 in implementation).

This should scale ok at 1 or 2 sockets.

Latencies of 4 socket machines is a lot tougher to deal with and also the amount of cores suddenly then is far above 16 making it a lot tougher to get it working well - yet maybe you'll get away there a tad as well.

But back to proving.

We can prove that what some call 'random splitting' in fact optimizes itself very quickly towards the root. It's moving there very quickly.

We can make it plausible that the 'random splitting' needs to do less work to get say a cpu or 32 to work in YBW.

If we have our alfa = -inf, beta = +inf PVS window and get in position P,
then YBW will search its first move with 1 core p0.

If p0 returns in P, then with the direct splitting i get from the idle list in 1 dang 31 cpu's. It directly unlocks the idle list. It ships then the cpu's their 'start' command in P and a move to search and that's basically it.

Now DTS at the same big box.
Let's do simple about the first thing. p0 describes its position P and puts it in the splitpoint list as splittable.

The hardware will very cheap then update to each individual cpu that's idle an invalidate and the cpu's load that. Still we are busy dirt cheap.

some dozens of cpu's overwhelm each other then at this splitpoint as they all at nearly the same atomic time find this splitpoint. They all try to lock it. Just 1 of them gets the spinlock.

Again - this is a big machine not a tiny one.

Spinlock usually works a LITTLE DIFFERENT there.

It will do a few attempts and then the KERNEL will kick all those cpu's from the 'runqueue' and force them to idle. So a forced idle by kernel.

About 10 miliseconds later another one is allowed to unidle and get the split.

In meantime of course p0 already has searched the entire tree below its move and also wants to lock into this splitpoint, but it also gets overwhelmed by all the idle cpu's that try to get in.

The other cpu's are all fighting still for this splitpoint and will suffer another 10 milliseconds penalty. It will take real long until p0 can lock in.

So some splitpoint that should've taken maybe a 100 microseconds, it effectively takes some dozens of milliseconds to search.

Slowdown garantueed of factor 100.

Soon that exponential explodes and the program searches at a 1000 nodes a second or so a core effectively.

The real problem of this 'slowdown of a garantueed factor 100' is that it is already happening when basically 1 or 2 cores search a position P. So at the very start of the iteration it has the biggest impact, and at the very start of an iteration you 'd prefer quickly putting to work all cores...

Now i gave the extreme example of what factual happens at bigger machines. Also at tinier machines you don't want all that fuzz around 1 splitpoint as p0 also has to update it. It just won't be able to quickly lock back in. Idle cpu's are so eager to split in, and they all detected that splitpoint, so they all try to lock in and they all manage to lock in and make your life a mess.

It's what i call a worst case at a criticial spot of your search, namely start of the iteration and each time that just 1 core is searching, this behaviour is there that exponential will slow down your search.

You don't notice this problem much at a machine with fast latencies to the RAM, but it will be a nightmare at bigger machines.

mcostalba · Post by **mcostalba** » Tue Apr 17, 2012 1:30 pm

diep wrote: What i did do back then in 1999-summer 2002 was basically that all cores searching reported their splitpoints to a splitlist. I kept it sorted and each idle CPU the only thing it had to do was look regurarly in the splitlist to see whether there still was a split for it.

In this manner you always automatically split at the biggest search depth left, closer to the root.

Actually I have started with a kind of centralised list/vestor of splitpoints, but very quickly, in less than an afternoon, I found out that required locking scheme was a real mess and getting stuff race free was really hard. Because central splitpoint list has to lock against all the split points, while decentralising the design, where each split poin care only for itself, is much simpler and is much easier to write a race free implementation. OTH the fact that the list is sorted is not really important IMHO, because also if an idle thread scans the list linearly it is not a waste of time considering that the thread is supposed to be 'idle' so does not takes time for search and also the list is usually short.

Currently the thread picks up the first available split point that satisfies the constrains, so no sorting here and no full list scan, but I don't consider this to be a limit because the reparenting rate is small, about 5-10% so it means that most of the time the thread does not find an available split point, so sort them in advance has no sense.

bob · Post by **bob** » Tue Apr 17, 2012 4:54 pm

diep wrote:
bob wrote:
Daniel Shawul wrote:Thank you for your thorough reply.
There is no question about it that if your hardware would be having no penalty for your cpu's to bother other cpu's and/or can freely look in remote memory, that the DTS principle of a CPU finding ITSELF a spot where it can split will give a great speedup.

I don't think the most important result is smaller amount of splitpoints. In Diep for example i'm doing massive amounts of splits the first few seconds of the search. Simply to the limit it can handle.

I tried to implement DTS in 1998 and succeeded (with one bugfix in 2000) and up to a cpu or 4 it had a genius speedup, yet right from the start i combined it with YBW, later more on that as it's a crucial thing. Note DTS scaled bad for Diep, but i didn't have an ultra efficient implementation and things were done centralized. Losing about 0.5 from each 4 cpu's back then roughly.

What i do now splits when it is convenient for searching processors and cpu's not active aren't stealing work as that has problems scaling.

Note that in all cases i assumed following the YBW conditions. So if you would equip DTS with YBW basically you can prove that the method of
basically splitting from the searching processors when it is convenient for them, that this performs similar, besides saving out massive waste of internal bandwidth:
I agree. Bandwidth is expensive especially when you start using more than 4 threads. The paper also mentions about using a "thrashing counter" before sending a help request.
Splitting at larger depths (fewer splits) help to avoid time wasted by splitting un-necessarily. I know you are talking on a larger scale (256 processor f.i) in which keeping processors busy is more important than selecting the best split point.
I would NEVER agree with that. If you select a bad split point, you might as well not split as you get zero gain for searching nodes that don't need to be searched. The more nodes you have, the more important selecting a good split point becomes, or your speedup flattens out very quickly.

Let's count now from the root our plies. So root is 0, after 1 move we are at 1, then 2, 3 plies away from root.

We start with the simple and most important case. Namely start of the search. We start searching with 1 core.

Suppose we are at a depth of n+1 in the PV.
We backtrack to n with processor P0

We can prove now that all other cpu's must be idle, because only 1 cpu can basically return in a YBW concept from a move. If there is idle cpu's now we already split them in at depth n (if it would give a cutoff we would split them in at n-1).
I think some limit the number of processors used at a split point (say max of 8 ). It would be unwise to split with all of them when a few would finish the job quickly. Also with 256 or more processors available some will undoubtedly be idle. The same thing also happens in endgames where wer can have few moves at each ply. But I understand your point.
Now we split in at depth n in this model. All moves get parallel distributed at this position P.

I want to drop next lemma now. When following YBW It's very difficult to find a position Q higher in the tree near the root than position P, which holds true that it follows YBW and no parallel search is currently getting carried out in Q.
It is actually more common than you think. Suppose you split at the root, and now you have two threads searching there. You can still split in either of the two simultaneous sub-trees you are searching if other threads become idle. And you can re-split, and re-split. I posted some data the other day where on an 8 core box, in normal games, I saw over 100 active split points max. That is at one instant in time. Since I don't split in the q-search at all, and since I limit myself to 4 threads per split point, max, there is a LOT of re-splitting going on. And for over 100 split blocks to be active, there must be a large number of different split points scattered around the tree...

Basically we can prove that if we already previously visited position Q in our current iteration, that we already had the opportunity to split there.
You might have had the opportunity, but not the ability. No idle threads. Where after searching a couple of other moves there, a thread working somewhere else finally finishes and now you can YBW split there at any time.

[quote\ And we had this opportunity while we can prove for the PV walkdown that all other cpu's were idle, and therefore could effectively get put to work.
I agree. For cases where we had the opportunity to split but not enough processors to search with, slave processors can check for idle processors at the split point
each time they come back to get a move.

In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.
Correct as it was implemented in Cray Blitz and Crafty.

That's the only situation where DTS would in theory have a benefit, assuming your hardware would allow it, over the 'split in as we happen to touch this position' type strategy.

Your statement is way too general. The benefit of DTS is that the instant a thread goes idle, you do not have to split at the current node being searched in any thread, assuming the YBW criteria has been satisfied. With DTS you can split anywhere, no matter where any thread is currently searching. First thing that solves is zero idle time, because you do not have to wait for an active thread to reach a node where YBW criteria has been satisfied, you can split at other nodes, any other node in fact, where either the YBW criteria has been satisfied, or if you find none, you can split at one where you believe it will be satisfied, speculatively.

The bottomline is that basically our parallel search is already going great at the moment that such event would occur.

So DTS doesn't solve some sort of 'worst case problem'. In fact it creates a new worstcase, as it's going to blow all other cpu's bandwidth when just 1 core for a fraction of a second is busy following the YBW property. That doesn't need to be during start of the search only. At this crucial moment that you want to split in other cpu's as quick as possible, which is possible after your single core gets back in a position where it has searched the first move; at that crucial moment with DTS all your machines core are busy hammering and jamming your single core.
How? This is an implementation issue, not a necessary property of DTS...

One worst case scenario mentioned in the paper is re-splitting at deeper depths again and again may lead to overall slow down in some hypothetical situations.
Then to avoid this problem, it is suggested to start a speculative parallel search at N-1 once enough moves are searched at ply N that gaurantee the alpha bound.
It could be a waste of work if the bound gets improved once the rest of the moves are searched at ply N. Well this speculation is one thing that I would hope DTS could give
improvement... But one can also do speculative YBW by starting parallel search at ALL nodes without searching any moves. It gave a little improvement fro Feldmann and co.

So it CAN have an effect that exponentially slows down your YBW search, meanwhile it contributes positively in efficiency (avoiding a few splits) at a moment in time that your parallel search already is going great at a bunch of cores and basically is about to finish an iteration or subbranch.

The benefit versus worstcase is not positive for DTS at modern hardware, when following the YBW property.
I think you have developed some sort of basic misunderstanding about DTS. The basic principle IS "YBW". That is / was the primary criteria for choosing a split point. But DTS is better than pure YBW because in pure YBW, you can only split at a node where some instance of the search is currently about to choose the next move to search and it notices that (a) a thread is idle and (b) the YBW criteria has been satisfied. With DTS you can split at ANY node in ANY active tree that is being searched. Most likely you ALWAYS find a YBW-satisfying node to split. If not, THEN, and only then, does DTS suggest "speculation" rather than sitting idle.

It is mentioned that DTS was developed on cray (high bandwidth) and that this affected decisions of the algorithm. I am not sure which parts of the algorithm are affected most, but it is probably the work stealing as you suggested.

p.s. DTS doesn't need to be implemented in a centralized manner, neither does YBW. We just speak about a fraction of a second that the search is busy at 1 core. The trick is to be able to quickly split in of course - i never made a secret out of that.

Diep is not recursive. So basically to split in another cpu is just setting a few pointers in theory. I didn't even ultra optimize it - that cheap it already is.
Thanks it really is good to hear of your experience.

Bob discussing DTS is not interesting here. Basically we are on 2 tracks here.

Daniel was thinking of GPUs when writing it - other architecture. You should see it in that light. AMD 7970 gpu's start at 2048 cores. Nvidia 680 starts at 1536 cores.

Soon both 'big brothers' will release. That's double the number of cores. For Nvidia it will be easiest to release it.

It's not easy to have gpu cores pick their own splitpoints. Everything goes massive parallel there.

As for DTS. When asked exactly 10 years ago how you splitted, as you said you didn't follow YBW with DTS back then. You answerred that you used some chessknowledge to guess whether to split.

Furthermore you copied 64KB for each splitpoint (the stack).

Now that might not have been a problem at Cray

It sure was a worst case of Crafty some years ago which copied a kilobyte or 3. Nalimov then hacked in your code until it got a lot cheaper. I'd suggest you ship Nalimov a 'thank you' for that, as it took care crafty could scale to a lot more cores

He really got it more efficient that crafty SMP code.

Now i don't want to get into a math discussion too much. But if i say the word 'prove' that means a mathematical proof. If i say "Proof by empirical finding", that isn't a 100% proof. If i say : "it will", that's not any of the 2.

Assuming the YBW property we can prove a few things.

We can PROVE that if you have generated splitpoints closer to the root in DTS, that when following the YBW property we already visited that position before and had our CHANCE to split there.

we can PROVE that if this position happens to be in a PV node that we should have splitted already in that position (of course an implementation can spoil it).

We can prove that if this situation occured from a state where just 1 core was left searching that when returning to a position P, with an average there of 40 semi legal moves, that we could already split there. Usually that means we had room for a CPU or 40. Note in Diep i limit it to 32 cpu's a splitpoint max.

So we can PROVE that in 99.9% of the cases we already did do genius splitting.

now i don't know how your datastructure works, but in Diep it is the case that if a bunch of processors split in at position P, that these cores will first use up all moves of P prior to calling AbortMe().

So we can prove that we already have searched quite efficient the vaste majority of the search tree with that 'random splitting'. Giving an exact % is difficult there, but it'll be around a 95%+.

Now it happens to be the case he has a bug in his search algorithm in StockFish, namely for some programming reason they selfintroduced SF can't call AbortMe().

I know a zillion ways how to fix that, but then designing a new datastructure to again split in cpu's that own this position P is not a very clever thing to do from SMP viewpoint, IMHO.

Better spend your effort that you can really abort yourself there and have the others search further.

We can SHOW that not doing this, can create a worst case in which you nonstop keep splitting in at small search depths - which is what you refer to earlier on. Yet that happens because of an implementation shortcoming in StockFish - not because it is a problem in a proper YBW implementation.

Now i'm fully aware you know all this, as in 1998 that's exactly what you told me - so i fixed that for Diep in 1998

The radical solution for Diep back then was to make the program non-recursive, but that doesn't need to be the solution. Also in a recursive program you can easily build this.

Now all this said a few words on DTS. As i posted before at shared memory systems with 1 or 2 sockets, it's not a problem to still implement a DTS in the manner how i solved it back in 1999 (note i fixed a few nasty bugs around october 2000 in implementation).

This should scale ok at 1 or 2 sockets.

Latencies of 4 socket machines is a lot tougher to deal with and also the amount of cores suddenly then is far above 16 making it a lot tougher to get it working well - yet maybe you'll get away there a tad as well.

But back to proving.

We can prove that what some call 'random splitting' in fact optimizes itself very quickly towards the root. It's moving there very quickly.

We can make it plausible that the 'random splitting' needs to do less work to get say a cpu or 32 to work in YBW.

If we have our alfa = -inf, beta = +inf PVS window and get in position P,
then YBW will search its first move with 1 core p0.

If p0 returns in P, then with the direct splitting i get from the idle list in 1 dang 31 cpu's. It directly unlocks the idle list. It ships then the cpu's their 'start' command in P and a move to search and that's basically it.

Now DTS at the same big box.
Let's do simple about the first thing. p0 describes its position P and puts it in the splitpoint list as splittable.

The hardware will very cheap then update to each individual cpu that's idle an invalidate and the cpu's load that. Still we are busy dirt cheap.

some dozens of cpu's overwhelm each other then at this splitpoint as they all at nearly the same atomic time find this splitpoint. They all try to lock it. Just 1 of them gets the spinlock.

Again - this is a big machine not a tiny one.

Spinlock usually works a LITTLE DIFFERENT there.

It will do a few attempts and then the KERNEL will kick all those cpu's from the 'runqueue' and force them to idle. So a forced idle by kernel.

About 10 miliseconds later another one is allowed to unidle and get the split.

In meantime of course p0 already has searched the entire tree below its move and also wants to lock into this splitpoint, but it also gets overwhelmed by all the idle cpu's that try to get in.

The other cpu's are all fighting still for this splitpoint and will suffer another 10 milliseconds penalty. It will take real long until p0 can lock in.

So some splitpoint that should've taken maybe a 100 microseconds, it effectively takes some dozens of milliseconds to search.

Slowdown garantueed of factor 100.

Soon that exponential explodes and the program searches at a 1000 nodes a second or so a core effectively.

The real problem of this 'slowdown of a garantueed factor 100' is that it is already happening when basically 1 or 2 cores search a position P. So at the very start of the iteration it has the biggest impact, and at the very start of an iteration you 'd prefer quickly putting to work all cores...

Now i gave the extreme example of what factual happens at bigger machines. Also at tinier machines you don't want all that fuzz around 1 splitpoint as p0 also has to update it. It just won't be able to quickly lock back in. Idle cpu's are so eager to split in, and they all detected that splitpoint, so they all try to lock in and they all manage to lock in and make your life a mess.

It's what i call a worst case at a criticial spot of your search, namely start of the iteration and each time that just 1 core is searching, this behaviour is there that exponential will slow down your search.

You don't notice this problem much at a machine with fast latencies to the RAM, but it will be a nightmare at bigger machines.[/quote]

I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 17, 2012 7:37 pm

I am not sure if you are responding to me or not. Your quote is really messed up.

I agree. Bandwidth is expensive especially when you start using more than 4 threads. The paper also mentions about using a "thrashing counter" before sending a help request.
Splitting at larger depths (fewer splits) help to avoid time wasted by splitting un-necessarily. I know you are talking on a larger scale (256 processor f.i) in which keeping processors busy is more important than selecting the best split point.
I would NEVER agree with that. If you select a bad split point, you might as well not split as you get zero gain for searching nodes that don't need to be searched. The more nodes you have, the more important selecting a good split point becomes, or your speedup flattens out very quickly.

Splitting at larger depths reduces the number of split points since each split spends more time to spend. That is what I meant by unnecessary splits. Not the bad split points where we get cutoffs on parallel search. We are selecting among those splits considered good already. Hope that clears it up.
The rest of your post seems a reply to Vincent's posts.

YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting