YBWC: Active Reparenting

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 17, 2012 7:42 pm

I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.

Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 17, 2012 7:49 pm

bob wrote:
Daniel Shawul wrote:
Rein Halbersma wrote:
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.
What he did is not randomized work stealing, there is nothing random there. He just checked existing split points to reattach itself.
This will most probably not decreases idle time because working threads pick up threads quickly anyway if you let them, but the selection made by the former i.e higher up
in the tree may help. I actually mentioned about the randomized work stealing that I use on cluster search in my first post.
But this might not be the smartest thing to do since we could have differnt network interfaces with varying latency. MPI allows you to group processors for most efficient work sharing.

Cilk begins with an overhead of creating and destryoing threads,as Vincent explained. I don't know if it keeps a thread pool
for that but it is still expensive anyway. For simple applications like parallel fibonacci serial versions outperform parallel versions due
to this overhead. OpenMP has added task based parallelism now but its performance is poor compared to Cilk. http://myxman.org/dp/node/182

Also from what I read the performance of Socrates chess is affected by this overhead too. The authors have blamed the poor performance
on the algorithm used (Jamboree search) but that seems to me very similar to YBW algorithm. http://publications.csail.mit.edu/lcs/p ... TM-548.pdf
A speedup of 48 out of 256 is what they get and I also read somewhere eles not many extensions/reductions were used.
So I have my doubts about cilk eventhough it makes life a lot easier.
Here's a key point I did not see mentioned. I have measured my "idle time" many times. It is extremely low. If that is true for most everyone, this is not going to help any, other than to save the overhead of a real split, as opposed to a "fake split" that only has to do one copy...

I mentioned this in the first line of the post you replied to. I quote again.

This will most probably not decreases idle time because working threads pick up threads quickly anyway if you let them, but the selection made by the former i.e higher up
in the tree may help.

It is pretty much confirmed now with the small +4 elo gain. That improvement is probably due to better selection of split points unlike conventional way of working threads picking up idle threads ( instead of the other way round).

bob · Post by **bob** » Tue Apr 17, 2012 10:01 pm

Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.

Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 17, 2012 11:34 pm

bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.

No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.

bob · Post by **bob** » Tue Apr 17, 2012 11:46 pm

Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.

Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 18, 2012 12:30 am

bob wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...

Again nobody said it is SIMD. If it was it means you did a sort of GPU chess in the 70s, which you didn't. So enough with the strawmans.
My point is your algorithm as you stated many times in your paper is meant for a special hardware Cray. It is pure hypocracy to say GPUs are specialized while having your algorithm specialized for something else itself. AFAIK people make modifications on DTS to make it suitable for current hardware...
The GPU serves thousands of threads compared to 4 or 8 threads that CPUs have but the maximum theoretical bandwidth for GPUs is hundreds of GB / s.

diep · Post by **diep** » Wed Apr 18, 2012 1:10 am

Daniel Shawul wrote: It is pretty much confirmed now with the small +4 elo gain. That improvement is probably due to better selection of split points unlike conventional way of working threads picking up idle threads ( instead of the other way round).

Sorry i must laugh for this statement. Even if the elogain would be true at slower time controls, remember they just tested it at bullet, it would be explained in a total different manner.

However those game tree search theories one never posts online, because they'd win another 200 elo then by increased insight in game tree search.

Vincent

diep · Post by **diep** » Wed Apr 18, 2012 1:18 am

Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.

You're not even close Daniel.

Todays stuff is all cpu's ok?

Cray that Bob refers to is not a CPU.

It's a physical block of electronics so to speak . That's why they were eating that much power as well. Think of megawatts for the 'equivalent' of a processor. That's why they also have total other performance than todays cpu's. The crossbars there were real crossbars from what i understood when i read the Cray manual, let's say a year or 10 ago.

Building something like that today would be real expensive.

Later in 90s cray switched to Alpha cpu's. That was a big change.
Again the latency between each cpu was far better than other CPU's, because of the huge crossbars they used. Also expensive ones.

NUMA initially won from it and the fact that the sporthal top500 doesn't need really low latency node to node probably contributed most.

Some chessprograms that worked at the Alpha crays very well, they worked not so well at todays clusters in terms of scaling. That tells you something about the low latency Cray already had back then, also for the later dec alpha cpu's equipped crays.

That's for computerchess the big difference between todays cpu's with shitlatencies from cpu to cpu and the electronic blocks from cray where basically this latency was nonexisting.

Of course that's also explaining why each cray had its own watercentral to cool it.

diep · Post by **diep** » Wed Apr 18, 2012 1:29 am

bob wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...

Bob don't take the bait - nowadays computerchess gets overcrowded by some who are busy with things that's far above their IQ level.

And one has to admit - see how far they got so far by tuning trivial small modifications and testing them well.

diep · Post by **diep** » Wed Apr 18, 2012 1:43 am

Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...
Again nobody said it is SIMD. If it was it means you did a sort of GPU chess in the 70s, which you didn't. So enough with the strawmans.
My point is your algorithm as you stated many times in your paper is meant for a special hardware Cray. It is pure hypocracy to say GPUs are specialized while having your algorithm specialized for something else itself. AFAIK people make modifications on DTS to make it suitable for current hardware...
The GPU serves thousands of threads compared to 4 or 8 threads that CPUs have but the maximum theoretical bandwidth for GPUs is hundreds of GB / s.

Daniel - if you google you'll find all the old cray manuals. It's a total other architecture than todays hardware.

At todays hardware it's all about the weak latencies to the RAM and to other cores.

When i designed in 2002/2003 Diep for the SGI supercomputer, the origin 3800 @ 512 processors, it was guessed by then, that i wouldn't succeed in scaling well as all other supercomputerchessprograms before Diep, they simply lost factor 40 to factor 50 at the NUMA machines. When i started to realize the ugly latency from CPU to CPU at the origin3800 @ 512 cpu's, as i had written a special testprogram for that; the supercomputer wasn't tested at all before by the Dutch Government - they hadn't invented yet at the government yet that one should first test something to be delivering what you paid those dozens of millions for, prior to using it.

Also it's possible i wrote the first supercomputer on the planet to test at a NUMA system the latency from this cpu to a remote cpu, with all cpu's at the same time doing that.

As i found out there was some one-way ping pong tests performed with entire box idle and routers in fact even optimized then for such test their caches causing a lower latency; total useless of course if you want to use all cores.

So i wrote first a bunch of tests, or better, the few tests i could do at the supercomputer i had to 'waste' to such tests.

The situation didn't get better past few years. CPU's got a lot faster, yet latency from this cpu to another node didn't really improve a lot.

To give an example calculation: each L5420 core of today i've got here at home in the cluster is exactly factor 10 faster than the R14000 cpu's at the origin3800.

Yet the blocked read latency to hashtable at the origin3800 was 5.8 microseconds. Real bad i found that back then.

That would be 580 nanoseconds now and i can assure you no cluster is gonna deliver it in 580 ns to you. It's gonna be several microseconds, and you bet i write another benchmark to measure that exactly

So i'll have exact numbers there soon.

But from your previous posts i'm sure you realize the latencies of todays hardware. Around 2005-2007 at nvidia, when we tested the first time the latency, the latency RAM <=> gpu wasn't real good either. Actually one would write it using the word 0.x milliseconds, not microseconds in fact. FPGA not doing that much better there by the way, Chrilly couldn't get under 10 microseconds there easily. At pci-e this would be slightly better, but even then it's roughly a microsecond to get through and back the pci-e.

0.3 - 0.6 microseconds are what you hear most of the manufacturers.

But on top of that comes the card and the switch and sometimes also other routers as well (level 2 and level 3). So i doubt it's much better than this 5.8 us still

YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting