Parallelization questions, ABDADA or DTS?

Dragulic · Post by **Dragulic** » Sat Mar 24, 2012 10:26 pm

With these views I, a permanent beginner, concur. But parallelisation as we see it today, people think of 8 core, 16 thread, that level. OK, maybe 200 cluster.

IBM and Intel have indicated future lies in massive parallelisation. On chip, limited functionality but 1000, 10000 cores. And many chips. Machine with 1,000,000 threads can be common within our working lives.

How will these algorithms cope and scale?

diep · Post by **diep** » Sat Mar 24, 2012 10:52 pm

BeRo wrote:
diep wrote: If you do that then it's more of a lossy YBW already basically
Yes, there you're right.

diep wrote: It's questionable whether you can abort children by hashtable in a lossless manner. We can prove of course that you might always overwrite that specific hashtable entry.

So only a lossy concept would make sense then.

The question i'd have then is that if you already do the administering work of YBW, then why would you need the overhead of additional stores to hashtable, which only makes things lossy and slower?

I'm implementing the plain classical YWBC stuff (together with the helpful master concept) as the first step in this moment, and then I'll see, what optimizations I can do as the next step, so the chances, as you already said, that it remains plain YWBC with some small optimizations, are very high. The original origin hybrid idea was just a chain of first-brainstorm-thoughts.

diep wrote: By the way are you capable of reading the original abdada paper?
It's encrypted into Anglish by the original French author! Bon.
No, I'm a german, not a french, so don't confuse through my french-sounding surname.

Yes i realize that, because the Gruendlich and systematic working Rudolf Huber was the first guy who could decipher the Anglish as written by the original abdada author. I'm convinced it would've been more popular quite a tad sooner otherwise.

The big difference between abdada and all others is that abdada basically is giving an implementation, not so much algorithm.

YBW is just a simple to understand algorithm, which leaves the question open on how to efficiently implement it; implementing it efficiently for dozens of cores is *really hard*.

Abdada on the other hand with its implementation limitation can be combined of course with YBW, because abdada in itself isn't an algorithm at all

It's just an implementation idea. "oh let's solve it using lossy storage in hashtable".

In theory you could get away setting just 1 bit there; you probably won't split near the leaves anyway so odds you overwrite it using multiprobe system is already real small, and odds for a write error which is really a huge factor smaller than the overwrite, that's even more neglectible.

theoretic odds of overwrite are also real small, just the chaining lemma, which really is the case, it's not some sort of 'theoretic effect'; chaining REALLY happens.

The chaining effect causes a small chance of overwrite - you can easily risk that.

So for abdada you need real real little - yet it's just an implementation idea and a cheap one, it's not an algorithm from my point of view.

Starting with a master -slave concept is maybe a good idea to start implementing parallel search - yet in the long run take into account you want to get rid of it. A central master always is an obstacle you don't want to have; it creates a huge worst case.

Vincent

diep · Post by **diep** » Sat Mar 24, 2012 11:16 pm

Dragulic wrote:With these views I, a permanent beginner, concur. But parallelisation as we see it today, people think of 8 core, 16 thread, that level. OK, maybe 200 cluster.

IBM and Intel have indicated future lies in massive parallelisation. On chip, limited functionality but 1000, 10000 cores. And many chips. Machine with 1,000,000 threads can be common within our working lives.

How will these algorithms cope and scale?

Dragulic it's not about scaling of the algorithms. You must reverse the question here.

I gave a speech in Switzerland for the HPC advisory council. My sheets are probably online there. If i compare Diep at 2 million nps wit hdeep blue, and i turn OFF all algorithmic improvements sincethen, so i just search fullwidth,
then i'm searching 2 ply deeper than deep blue, whereas diep is NOT forward pruning last few plies and deep blue was.

The information that reached me is that Deep blue used the aphid algorithm from Jonathan Schaeffer - in those days a logical choice if your focus was getting more nps for your boss - regardless whether the nps was any useful.

Deep Blue doesn't count nodes. I just start of course with the first move out of book - comparing of course in endgames the difference is much bigger as their hashtable wasn't there simply or worked crap because of using hardware processors. Also algorithmic the deep blue team showed they are not so clever, claiming they had 'hardware hashtables' nearly ready to go; as hashtables in hardware would not have mattered much if you do 4 ply searches in hardware of course, as the other 479 processors don't communicate with this one. Chrilly Donninger having had a similar experience with hardware processors doesn't understand how they could search 4 ply in hardware: "as doing 4 ply in hardware is extremely inefficient, i basically do mostly 2 ply searches with hydra and try to push everything towards a software search".

Deep Blue was already searching in time of opponent and finishes 10 ply
after 1.Nf3,d5 2.g3,Bg4 3.b3

Diep in fact is doubting there and in the end switches to Bxf3, so it switches from PV, and as we can see deep blue also switches from PV yet to the wrong move.

Deep Blue needed roughly 167 seconds at 480 hardware processors, and we forgive it then it also searched in time of opponent which Diep didn't do here. Now the marketing department of ibm claimed it got 200 million nps.

The deep blue team claims in an official publication later in 2001 it was 133 million nps.

Aphid is an algorithm that in itself should scale well.

The last improvements in diep's move ordering were done in 1998, it's not that i have 14 years of advantage there. Also diep never was tuned to be fullwidth real efficient.

If i turn on nullmove, so fullwidth search + nullmove. Not a single form of other forwardpruning other than this, then Diep reaches 17 plies.

Just R=3, i already used that back then, see my postings onto RGCC.
Diep was just 4 years old in 1998.

The 1998 version willl actually use less nodes to reach that 10 ply than todays Diep, let me assure you that.

So Diep needs about a 14 million nodes to finish 10 ply and with nullmove at 2 million nps it would have reached 17 ply back then. An algorithm well known back then and used by everyone, after Frans Morsch was so honest to publicly shout he won the world title 1995 with it. You need to give Frans credits for that.

Deep Blue 1997 gets 10 ply there with 133 million nps. That's what they OFFICIALLY claim in an official publication. Not some vague writing on the internet, official publication for advances in artificial intelligence.

133 mln nps * 167 seconds = 22.2 billion nodes
Diep: 14 million nodes

Factor 1586 difference.
Searching fullwidth diep gets 2.6 million nps.

Searching with nullmove R=3, diep gets 17 ply and 2.0 million nps.
This is at 16 processor barcelona box, a non-optimized executable.

In reality the real question is: why didn't they use nullmove?
The answer is of course obvious: they would've gotten less nps,
but even the n that's not a good answer.

I don't want to redo that discussion of course. They weren't bad guys algorithmic as they invented a few of their own - yet they were pretty clumsy in algorithms overall from lossless searchdepth viewpoint;
Where they did do well is extending their 1988 plan;
they got with expensive hardware and their own cpu's a huge amount of
nps that only 15 years later we also can get. Just by 1997 that was total outdated because of their clumsiness in searching efficient.

The hardware did do 4 ply searches which is way too much; they didn't use killermoves in hardware; we'll forgive them their parallel search as doing a good parallel search simply is very tough and by 1997 no one had achieved an efficient parallel search for so many processors yet in an efficient manner.

That took until Diep, and i easily profitted from the knowledge of others; Bob (Hyatt) is important there and Rainer Feldmann for explaining me back in 1998 in Paderborn clearly YBW.

Yet Diep at their 30 nodes RS/6000 machine would have handsdown outsearched Deep Blue; realize diep is a real SLOW searcher in nps.
If one of the slowest engines is outsearching you - that isn't very good news for you. Know what i mean?

All those algorithms get so total hammered by the efficiency of good implemented YBW programs, that your question on scaling i would want to rephrase.

The question is not: how well does algorithm X or Y scale?

The real question to you is: "how well can you get YBW to work and
TAKE CARE it scales well at many cores?"

And if you don't ask the right question, you'll be factor 1000+ slower anyway than the guy with a good functioning YBW, as you don't preserve your branching factor with other forms of parallel search!

Vincent

diep · Post by **diep** » Sat Mar 24, 2012 11:38 pm

Can you give a link where any company claims thousands of CPU CORES (as opposed to manycores where we already have thousands of cores) with cache coherency and shared RAM?

I don't believe a thing from to happen any soon.... ...so i like to see the claim there.

Cache coherency doesn't scale very well...

But to answer your question there: it's easier to scale YBW than it is to scale up the cache coherency in hardware processors with CPU cores. And i don't mean manycores by that very clearly. CPU's as we know them have huge problems to scale, for more than just cache coherency reasons.

So in that sense the software is the smallest problem there for now.

Vincent

diep · Post by **diep** » Sat Mar 24, 2012 11:52 pm

On scaling of Diep. After some work during the world champs 2003 with Diep, Diep managed to scale very well at 512 processors (note at advice of the sysadmins i used 460 processors out of the 500 adressable - all sorts of stuff was getting done in centralized manner on those SGI machines back then - todays clusters don't suffer from that).

Todays Diep SMP is far more advanced than the 2003 version of it.

Basically for 2000 processors there is just 1 'small' thing i would need to adress and that's how to do the idle list in a better manner. I'm very sure i can get to work todays Diep for 2000 cores with a tad of testing - maybe more.

So if without testing *ever* at 2000 cores i don't really see a problem there; i am convinced if you have enough time to think about it, and test it, that it's solvable for 200k cores as well.

Probably you'll lose a tad of overhead to that.

Vincent

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 25, 2012 12:25 am

I think Vincent answered your question thoroughly

I would also like to know if Intel will push Cilk to be the language of choice for those hardwares ? How is the performance of the scheduler for chess? It has been used before by CilkChess so it should not be bad. But the tree may not be very selective compared to current heavily pruned trees.

I have implemented a cluster YBW which I think is on par with cluster-Toga performance wise. But I didn't test it well because the clusterI had access to used fast ethernet connection (which means slow) and it really did not scale well even on 32 processors. But it works as long as all processors are active.

One thing I did uniquely is to use a combined SMP-cluster search, where the search takes advantage of "fat nodes" (nodes with 8-core SMP machines) by starting an SMP search. It helps with the speed up but could introduce inefficiencies with load balancing as some workers become extra powerfull. Also it makes implementation becomes complicated since you can just use MPI to start a process for every core. That is do message passing even when you know it is an SMP machine. MPI actually optimizes stuff in that case but it will not be as good as the SMP algorithm.

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 25, 2012 12:31 am

Trying to figure out difference of YBW and Jamboree.It looks like it is very similar but there is a "wait for all children" at J12 that may be different. Anyone knows details of Jamboree ? http://supertech.csail.mit.edu/papers/t ... szmaul.pdf

Code: Select all


&#40;J1&#41; De&#64257;ne jamboree&#40;n; ; ) as
&#40;J2&#41; If n is a leaf then return static_eval&#40;n&#41;.
&#40;J3&#41; Let ~c  the children of n, and
&#40;J4&#41; b  jamboree&#40;c0; ; )&#58;
&#40;J5&#41; If b   then return b.
&#40;J6&#41; If b >  then set   b.
&#40;J7&#41; In Parallel&#58; For i from 1 below j~cj do&#58;
&#40;J8&#41; Let s  jamboree&#40;~ci;   1; )&#58;
&#40;J9&#41; If s > b then set b  s.
&#40;J10&#41; If s   then abort-and-return s.
&#40;J11&#41; If s >  then
&#40;J12&#41; Wait for the completion of all previous iterations
&#40;J13&#41; of the parallel loop.
&#40;J14&#41; Set s  jamboree&#40;~ci; ; ). ;; Research for value
&#40;J15&#41; If s   then abort-and-return s.
&#40;J16&#41; If s >  then set   s.
&#40;J17&#41; If s > b then set b  s.
&#40;J18&#41; Note the completion of the ith iteration of the parallel loop.
&#40;J19&#41; enddo
&#40;J20&#41; return b.
Figure 4-5&#58; Algorithm jamboree

Dragulic · Post by **Dragulic** » Sun Mar 25, 2012 1:21 am

diep wrote:Can you give a link

http://ieeexplore.ieee.org/Xplore/login ... %3D1383246
From there, many google searches can be inspired. Like, for gigascale integration.

diep · Post by **diep** » Sun Mar 25, 2012 1:43 am

Dragulic wrote:
diep wrote:Can you give a link
http://ieeexplore.ieee.org/Xplore/login ... %3D1383246
From there, many google searches can be inspired. Like, for gigascale integration.

I see some VLSI stuff from 2005.
We live in 2012 now, so hardware predictions from 2005 are a tad outdated

Dragulic · Post by **Dragulic** » Sun Mar 25, 2012 1:59 am

diep wrote:outdated

Maybe but maybe no. The cycle from theoretisation of a new level of approach to the end-user delivery of a practical production item can be 5-10 years even in this fast moving field. Example, 3D transistors were postulated there in 2004/5. Delivery is yet to occur but the next generation (22nm) from Intel is expected to use these.

I stand to belief that I will see 10^6 thread systems becoming common. But then I am young and in good health with a long life expectancy.

Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?