YBWC: Active Reparenting

mcostalba · Post by **mcostalba** » Wed Apr 11, 2012 8:06 am

bob wrote: There are a few ugly race conditions to handle as well. One runs out of work to do, finds an active split point, verifies there is something to do, and then does the usual "copy/work" stuff. But it then can split at a point that is no longer active, because between the time when you check to see if the split point is active and there is work to do, and you copy the necessary data to start the parallel search, one of the other processors already searching could have completed everything and cleaned up...

Obviously you didn't read the code. Thanks anyhow for your report on your past experiences. Actually I cannot say if it is good or bad until after the testing, but because I think this patch kicks in (bad or good) only with high number of cores I was asking for some test support.

Rein Halbersma · Post by **Rein Halbersma** » Wed Apr 11, 2012 9:23 am

mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.

Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.

diep · Post by **diep** » Wed Apr 11, 2012 11:07 am

Rein Halbersma wrote:
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.

Cilk is too slow for realtime software like computerchess - it lost back then a factor 40 somewhere.

That's too much for todays hardware to lose.

Even losing factor 2 would be too much simply, it's not even serious to use something that first loses you factor 40, to consider using it.

It is like coding financial trading platform in Java - first losing factors where every nanosecond matters when trading towards the exchanges, with as result that only the platform owner makes a profit and on average the platform has a loss trading towards the exchanges because in the first place the platform choice of programming is a too slow language, whereas most foreign hedgefunds use C/C++ for that, effectively looting the Dutch platforms based upon being faster. Speed is everything there.

p.s. Diep at the same ahrdware like cilkchess (origin3800) was factors faster in nps; if you realize that if diep would count nodes in the same manner it already increases factor 2 in nps, and diep is factor 10-20 slower than the fastest beancounters, and i kept 10% of the cpu's free which with todays SMP i wouldn't do anymore, furthermore diep profits more from todays compilers as they have PGO nowadays and the SGI compiler back then didn't despite me asking it SGI several times, that tells you something,

diep · Post by **diep** » Wed Apr 11, 2012 11:31 am

Rein Halbersma wrote:
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.

Note that Cilkchess used YBW and used the cilk framework to start new threads to search the moves in parallel, Don showed me that source code how they did do it in Cilk.

Cilk is not clever there and cannot be. If you use a generic framework we always can prove that it will be factors slower than if you manually code it up, because splitting slower has an exponential effect.

If you use a generic framework we can also prove it needs more communication and every cacheline of communication is really expensive at a cluster. Nowadays even more than back then. Each node is factor 10-20 faster and the latency at a cluster to communicate (so blocked forms of communication) is not much faster, maybe factor 2 at most, in most cases not faster at all.

So we can prove on paper that any generic framework that handles our SMP is always factors slower than if we do it manually.

Now Cilk of course loses on top of that factor 4 or so, another factor 10, which might be explained by having cpu's idled. To reactivate threads the minimum action to do is put it back in the runqueue, which is a 10 millisecond delay.

All this for something you want to function preferably within a few microseconds

Factor 1000 difference in latency.

Cilk simply can't communicate enough to get more than a 5000 nodes per second a core.

In exponential games like computerchess it simply hits the maximum speed the framework can deliver. Seemed to me that it suffered massively from master-slave problems.

So at todays hardware when we'd run the same programs as back then Cilk would be another factor 10 slower, making any sane program using cilk a 400 times slower than it might run at that supercomputer.

Chrilly Donninger had a similar problem initially with Brutus (later called Hydra when Sheikh paid him). The Zugzwang framework could deliver a maximum of 1000 nodes a second from its framework a cpu.

Now that was with a network one-way pingpong latency around a 8 microseconds; that means in short that at todays clusterhardware, a big cluster, it would be roughly 2x faster, delivering 2000 nodes a second a cpu.

Today that is not acceptable.

diep · Post by **diep** » Wed Apr 11, 2012 12:09 pm

mcostalba wrote:
diep wrote: In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.
Yes, this is the condition I am looking for in my patch. And I also limit the search to the 'oldest' split points that it means, for each thread, master of a set of split points, I check only the split point highest in the tree so to possibly reparent near the root: this should help getting a longer average life time for the slave.

Marco that's what i have been trying to do some years ago in Diep as well.

Though not in a master slave model - as diep doesn't have master/slave model, entire search is in shared memory.

What i did do is each cpu keeps track of positions where one is allowed to split. other cpu's also having access to the shared ram, could read this if they wanted to, then send a request to the split owner to split it in at that spot that it found itself. splitowner received request and then splitted in the cpu requesting to split at that spot. Idea was to always split in closest to the root.

That's what i did do.

Took me a few years to realize we can already mathematically prove this to work crap as basically our parallel search already runs genius at the moment that we can split in there.

Because what basically needs to happen for this to work is that cpu's that already tried to split in others, there were no others to split in, as all cpu's were already carrying out a parallel search.

Only AFTER all cpu's already are doing a parallel search and machine busy 100% with 0 cpu's not having a job, then you could come along a position where you can't split as no cpu's are available.

Basically it's a DTS derived idea.

So
a) your idea is not new i already tried it and DTS is based upon it

In old Diep code i called this : HelpYourSelf()

b) it's a lot of work to prove it correct and has overhead

c) it only can give splitpoints to cpu's when all cpu's in the first place already are searching perfectly parallel and nothing is idle. So your machine already scales genius by then.

If something already scales well no need to add this feature.

2 guys already tried. One based its entire SMP upon it (Bob) and another (me) forgot that problem C.

p.s. note that i had turned off the feature quickly as in case of Diep i had not done B - it wasn't bugfree. Hadn't proven it on paper back then... ... and as in case of diep ever cpu is the same suddenly introducing splitowners was a bad idea at all.

p.s.2 one of problems i encountered is that in diep all cpu's are the same, so when one cpu would call an AbortMe() basically all splitpoints would be lost, so i also designed a mechanismback then to have other cpu take over those potential splitpoints

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 11, 2012 2:14 pm

Rein Halbersma wrote:
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.

What he did is not randomized work stealing, there is nothing random there. He just checked existing split points to reattach itself.
This will most probably not decreases idle time because working threads pick up threads quickly anyway if you let them, but the selection made by the former i.e higher up
in the tree may help. I actually mentioned about the randomized work stealing that I use on cluster search in my first post.
But this might not be the smartest thing to do since we could have differnt network interfaces with varying latency. MPI allows you to group processors for most efficient work sharing.

Cilk begins with an overhead of creating and destryoing threads,as Vincent explained. I don't know if it keeps a thread pool
for that but it is still expensive anyway. For simple applications like parallel fibonacci serial versions outperform parallel versions due
to this overhead. OpenMP has added task based parallelism now but its performance is poor compared to Cilk. http://myxman.org/dp/node/182

Also from what I read the performance of Socrates chess is affected by this overhead too. The authors have blamed the poor performance
on the algorithm used (Jamboree search) but that seems to me very similar to YBW algorithm. http://publications.csail.mit.edu/lcs/p ... TM-548.pdf
A speedup of 48 out of 256 is what they get and I also read somewhere eles not many extensions/reductions were used.
So I have my doubts about cilk eventhough it makes life a lot easier.

diep · Post by **diep** » Wed Apr 11, 2012 3:54 pm

Daniel Shawul wrote:
Rein Halbersma wrote:
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.
What he did is not randomized work stealing, there is nothing random there. He just checked existing split points to reattach itself.
This will most probably not decreases idle time because working threads pick up threads quickly anyway if you let them, but the selection made by the former i.e higher up
in the tree may help. I actually mentioned about the randomized work stealing that I use on cluster search in my first post.
But this might not be the smartest thing to do since we could have differnt network interfaces with varying latency. MPI allows you to group processors for most efficient work sharing.

Cilk begins with an overhead of creating and destryoing threads,as Vincent explained. I don't know if it keeps a thread pool
for that but it is still expensive anyway. For simple applications like parallel fibonacci serial versions outperform parallel versions due
to this overhead. OpenMP has added task based parallelism now but its performance is poor compared to Cilk. http://myxman.org/dp/node/182

Also from what I read the performance of Socrates chess is affected by this overhead too. The authors have blamed the poor performance
on the algorithm used (Jamboree search) but that seems to me very similar to YBW algorithm. http://publications.csail.mit.edu/lcs/p ... TM-548.pdf
A speedup of 48 out of 256 is what they get and I also read somewhere eles not many extensions/reductions were used.
So I have my doubts about cilk eventhough it makes life a lot easier.

Daniel - the real problem of Cilk is the same like that of all the parallel programs back then at massive supercomputers of some hundreds of cpu's.

They first slowdown their program factor 40 to 50 in order to claim a better scaling and speedup.

So that's exactly what Cilk was doing. When run at 1 processor cilkchess got 5000 nps a second. A simple bitboards program with near to zero evaluation, this at a 300Mhz Sun processor. Yes a 64 bits processor.

Without Cilk at his laptop it got 200k nps. That was a lot lower clocked laptop and a mobile chip.

For comparision at the same Sun processor Diep from back then got 10k nps and Diep had a huge evaluation also back then with all sorts of mobility scans and Crafty got around a 200k nps single core back then.

Now all the 'researches' and pdf's, the few that are there, are based upon comparing n processor search with 1 slowed down program that suffers the framework slowdown.

It's like saying your soldiers are moving fast at the battlefield.
And to prove they move fast you have them carry 50 kilo and also one of the spectators gets to carry 50 kilo.

It's a nonsense compare as you want to compare with a runner without that luggage of course as in computerchess you have to look strong against programs not using Cilk as well

On top of that big major problem is the hopeless idea of doing things total centralized within Cilk which is already an exponential burden and threads that get idled and unidled, which is a big death penalty which you already hardly notice of course after being factor 40+ slowed down first.

Another bizarre claim is the speedup claim all these guys do - yet of course searches aren't comparable to todays searches. If you search fullwidth then of course it's much easier to split and get a good speedup.

The problem is, when i talked to them in the world champs 1999 and saw their output, none of them got more than 1 to 2 million nps or so at a 512 processor supercomputer. Cilkchess ran at a 512 processor origin3800. Some years later Diep would also run at it and reach 10.1 mln nps. Counted in same manner like Cilkchess that would be 20 mln nps, diep being a chessprogram with a huge evauluation function scanning all sorts of things in all sorts of loops as well.

My later requests to obtain logfiles from cilkchess were all ignored.
That already tells you the whole story.

Note that Deep Blue's logfiles also carry no searchcounter at all.

If you hide something that you HAVE to measure, then you have to hide something...

So it already was roughly a knockout at the same hardware of Diep versus Cilk. A program that's factor 10 to 20 times slower than a bitboard beancounter being effectively factors faster at the same supercomputer.

Their scaling and speedup numbers for *socrates and cilkchess basically are academic fraud in its purest form and everyone knows it. Chrilly Donninger later on joined them in doing that. He also compared a single hydra processor not using hashtables last 6 plies or so, with the full cluster not using hashtables last 6 plies, because the ONLY REASON to not use hashtables last 6 plies was because it didn't scale... ...meanwhile not doing it is deteriorating performance by factors. Around a factor 10 in my measurement.

In the first place it's pathetic to do that comparision in that manner.
Secondly it's pure fraud. Third it's the same fraud the guys before him did do.

If you aren't good enough and/or if your framework isn't good enough to play exponential games with, don't say it is having a good scaling and good speedup as it hasn't.

bob · Post by **bob** » Thu Apr 12, 2012 3:20 am

mcostalba wrote:
bob wrote: There are a few ugly race conditions to handle as well. One runs out of work to do, finds an active split point, verifies there is something to do, and then does the usual "copy/work" stuff. But it then can split at a point that is no longer active, because between the time when you check to see if the split point is active and there is work to do, and you copy the necessary data to start the parallel search, one of the other processors already searching could have completed everything and cleaned up...
Obviously you didn't read the code. Thanks anyhow for your report on your past experiences. Actually I cannot say if it is good or bad until after the testing, but because I think this patch kicks in (bad or good) only with high number of cores I was asking for some test support.

Sure didn't. I was simply commenting on the idea itself, which is not new. DTS does that in Cray Blitz since it is simple. But I limited it because I would rather not have too many threads collect at one split point... BTW why wouldn't this "kick in" with even 4 cores???

I have seen as many as 100 active split points on a 12 core box in a 30 minute game. For 8 cores, the max was about the same, with a peak of 94 in the one log file I examined...

I don't have any significant 2/4 core games unfortunately, so I can't give data there, but with so many split points, there is plenty of opportunity to do an "out-of-band split" at an already active point...

bob · Post by **bob** » Thu Apr 12, 2012 3:28 am

Daniel Shawul wrote:
Rein Halbersma wrote:
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.
What he did is not randomized work stealing, there is nothing random there. He just checked existing split points to reattach itself.
This will most probably not decreases idle time because working threads pick up threads quickly anyway if you let them, but the selection made by the former i.e higher up
in the tree may help. I actually mentioned about the randomized work stealing that I use on cluster search in my first post.
But this might not be the smartest thing to do since we could have differnt network interfaces with varying latency. MPI allows you to group processors for most efficient work sharing.

Cilk begins with an overhead of creating and destryoing threads,as Vincent explained. I don't know if it keeps a thread pool
for that but it is still expensive anyway. For simple applications like parallel fibonacci serial versions outperform parallel versions due
to this overhead. OpenMP has added task based parallelism now but its performance is poor compared to Cilk. http://myxman.org/dp/node/182

Also from what I read the performance of Socrates chess is affected by this overhead too. The authors have blamed the poor performance
on the algorithm used (Jamboree search) but that seems to me very similar to YBW algorithm. http://publications.csail.mit.edu/lcs/p ... TM-548.pdf
A speedup of 48 out of 256 is what they get and I also read somewhere eles not many extensions/reductions were used.
So I have my doubts about cilk eventhough it makes life a lot easier.

Here's a key point I did not see mentioned. I have measured my "idle time" many times. It is extremely low. If that is true for most everyone, this is not going to help any, other than to save the overhead of a real split, as opposed to a "fake split" that only has to do one copy...

diep · Post by **diep** » Sun Apr 15, 2012 12:04 pm

bob wrote:
mcostalba wrote:
bob wrote: There are a few ugly race conditions to handle as well. One runs out of work to do, finds an active split point, verifies there is something to do, and then does the usual "copy/work" stuff. But it then can split at a point that is no longer active, because between the time when you check to see if the split point is active and there is work to do, and you copy the necessary data to start the parallel search, one of the other processors already searching could have completed everything and cleaned up...
Obviously you didn't read the code. Thanks anyhow for your report on your past experiences. Actually I cannot say if it is good or bad until after the testing, but because I think this patch kicks in (bad or good) only with high number of cores I was asking for some test support.
Sure didn't. I was simply commenting on the idea itself, which is not new. DTS does that in Cray Blitz since it is simple. But I limited it because I would rather not have too many threads collect at one split point... BTW why wouldn't this "kick in" with even 4 cores???

I have seen as many as 100 active split points on a 12 core box in a 30 minute game. For 8 cores, the max was about the same, with a peak of 94 in the one log file I examined...

I don't have any significant 2/4 core games unfortunately, so I can't give data there, but with so many split points, there is plenty of opportunity to do an "out-of-band split" at an already active point...

Of course it kicks in at 4 cpu's, however if you implement it buggy - i didn't checkout the code Marco wrote there - then the bug you run into when a cpu owns splitpoints is that when this cpu is gonna do other stuff, it loses its splitpoints. So you do need a mechanism that other cpu's can take over your splitpoints, otherwise that 'valuable' splitpoint is lost forever without anyone splitting in there yet.

I did built this mechanism in Diep. Back then i didn't have enough cores to test with this. Yet as i pointed out we can already prove that we have splitted at that spot, except when all other cores were already busy very well, in which case machine is fully busy already.

That said it's not a bad idea to split closer to the root; YBW already gives us that opportunity though.

YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting