Parallelization questions, ABDADA or DTS?

bob · Post by **bob** » Sat Mar 24, 2012 3:57 am

ZirconiumX wrote:DTS gives you the best performance - for an iterative search - and generally.

ABDADA (or Jamboree) will probably give you the best performance for a recursive search.

Personally I would recommend ABDADA as Smash uses it - so you have a free reference point.

Matthew:out

ABDADA is not even close to any good PVS/YBW/DTS type algorithm. Sharing stuff through the hash is not a very efficient way of managing a parallel search...

diep · Post by **diep** » Sat Mar 24, 2012 9:19 am

bob wrote:
ZirconiumX wrote:DTS gives you the best performance - for an iterative search - and generally.

ABDADA (or Jamboree) will probably give you the best performance for a recursive search.

Personally I would recommend ABDADA as Smash uses it - so you have a free reference point.

Matthew:out
ABDADA is not even close to any good PVS/YBW/DTS type algorithm. Sharing stuff through the hash is not a very efficient way of managing a parallel search...

Abdada is easy to implement.
PVS is no good at all, it just splits at 1 spot. It will give some sort of a speedup though.

YBW the superior algorithm of course.

Most go for a variation of Abdada nowadays, as you can start dividing the cpu's in an YBW manner yet you simply don't abort them.

So it's not so close to YBW in performance, yet easy to implement bugfree. Doesn't require big mathskills from you to get it going, unlike a full YBW implementation.

Bob is using YBW for crafty.
Diep is using YBW

It's not much of a discussion what performs best.

DTS locks in in the central search tree and is gonna figure out itself where to lock in and get going.

That's not gonna work at todays hardware, that will jam your machine totally.

Abdada + YBW is of course no good in aborting cpu's but that also removes a big bottleneck.

Much depends upon the effort you are willing to do.

In Diep i nowadays do an asynchroneous YBW. So cores that are aborted, might still be busy aborting the previous search meanwhile the new search already is on its way.
On top of all the a4's i needed to prove the first search of Diep, this extension is a page or 40 of a4 to show the insight that it will not give deadlocks nor aborts. Difficult to get to work.
On a big supercomputer or even a small cluster this has a huge advantage though, as waiting there for all cpu's to synchronize is gonna take long, especially in between the searches...

Regards,
Vincent

BeRo · Post by **BeRo** » Sat Mar 24, 2012 11:52 am

Okay, I'll go then for a ABDADA+YBW hybrid approach, but together with cutoff-children-aborting. So in general YBW together with helpful-master-concept and with some additional extended-TT concept from ABDADA.

Do you have any objections?

diep · Post by **diep** » Sat Mar 24, 2012 1:44 pm

BeRo wrote:Okay, I'll go then for a ABDADA+YBW hybrid approach, but together with cutoff-children-aborting. So in general YBW together with helpful-master-concept and with some additional extended-TT concept from ABDADA.

Do you have any objections?

If you do that then it's more of a lossy YBW already basically

It's questionable whether you can abort children by hashtable in a lossless manner. We can prove of course that you might always overwrite that specific hashtable entry.

So only a lossy concept would make sense then.

The question i'd have then is that if you already do the administering work of YBW, then why would you need the overhead of additional stores to hashtable, which only makes things lossy and slower?

By the way are you capable of reading the original abdada paper?
It's encrypted into Anglish by the original French author! Bon.

BeRo · Post by **BeRo** » Sat Mar 24, 2012 2:57 pm

diep wrote: If you do that then it's more of a lossy YBW already basically

Yes, there you're right.

diep wrote: It's questionable whether you can abort children by hashtable in a lossless manner. We can prove of course that you might always overwrite that specific hashtable entry.

So only a lossy concept would make sense then.

The question i'd have then is that if you already do the administering work of YBW, then why would you need the overhead of additional stores to hashtable, which only makes things lossy and slower?

I'm implementing the plain classical YWBC stuff (together with the helpful master concept) as the first step in this moment, and then I'll see, what optimizations I can do as the next step, so the chances, as you already said, that it remains plain YWBC with some small optimizations, are very high. The original origin hybrid idea was just a chain of first-brainstorm-thoughts.

diep wrote: By the way are you capable of reading the original abdada paper?
It's encrypted into Anglish by the original French author! Bon.

No, I'm a german, not a french, so don't confuse through my french-sounding surname.

jdart · Post by **jdart** » Sat Mar 24, 2012 7:06 pm

Bob is correct.

I tried ABDABA a long time ago and found that it performed poorly because its usage of the hash table as a synchronization mechanism interferes with the normal usage of the hash table as storage for scores and best moves from previously visited nodes. Consequently, the hash hit rate went down and this killed performance. Theoretically, maybe if you had an infinite size hash this wouldn't be an issue but with current memory sizes and node rates many programs fill up their hash rapidly.

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Sat Mar 24, 2012 7:06 pm

ABDADA (or Jamboree) will probably give you the best performance for a recursive search.

Jamboree may be one of the better performers but it is not same as ABDADA. It is more closer to YBW than is to ABDADA. The latter relies on the hashtable for parallel speed up. It may perform really well sometimes. Think of embarassingly parallel algorithms such as randomized best-first RBFM and UCT etc. However YBW performs much better for inherently sequential algorithms like alpha-beta.

Suggestion to the OP: Why not implement YBW with a message passing model ?

Dragulic · Post by **Dragulic** » Sat Mar 24, 2012 10:26 pm

With these views I, a permanent beginner, concur. But parallelisation as we see it today, people think of 8 core, 16 thread, that level. OK, maybe 200 cluster.

IBM and Intel have indicated future lies in massive parallelisation. On chip, limited functionality but 1000, 10000 cores. And many chips. Machine with 1,000,000 threads can be common within our working lives.

How will these algorithms cope and scale?

diep · Post by **diep** » Sat Mar 24, 2012 10:52 pm

BeRo wrote:
diep wrote: If you do that then it's more of a lossy YBW already basically
Yes, there you're right.

diep wrote: It's questionable whether you can abort children by hashtable in a lossless manner. We can prove of course that you might always overwrite that specific hashtable entry.

So only a lossy concept would make sense then.

The question i'd have then is that if you already do the administering work of YBW, then why would you need the overhead of additional stores to hashtable, which only makes things lossy and slower?

I'm implementing the plain classical YWBC stuff (together with the helpful master concept) as the first step in this moment, and then I'll see, what optimizations I can do as the next step, so the chances, as you already said, that it remains plain YWBC with some small optimizations, are very high. The original origin hybrid idea was just a chain of first-brainstorm-thoughts.

diep wrote: By the way are you capable of reading the original abdada paper?
It's encrypted into Anglish by the original French author! Bon.
No, I'm a german, not a french, so don't confuse through my french-sounding surname.

Yes i realize that, because the Gruendlich and systematic working Rudolf Huber was the first guy who could decipher the Anglish as written by the original abdada author. I'm convinced it would've been more popular quite a tad sooner otherwise.

The big difference between abdada and all others is that abdada basically is giving an implementation, not so much algorithm.

YBW is just a simple to understand algorithm, which leaves the question open on how to efficiently implement it; implementing it efficiently for dozens of cores is *really hard*.

Abdada on the other hand with its implementation limitation can be combined of course with YBW, because abdada in itself isn't an algorithm at all

It's just an implementation idea. "oh let's solve it using lossy storage in hashtable".

In theory you could get away setting just 1 bit there; you probably won't split near the leaves anyway so odds you overwrite it using multiprobe system is already real small, and odds for a write error which is really a huge factor smaller than the overwrite, that's even more neglectible.

theoretic odds of overwrite are also real small, just the chaining lemma, which really is the case, it's not some sort of 'theoretic effect'; chaining REALLY happens.

The chaining effect causes a small chance of overwrite - you can easily risk that.

So for abdada you need real real little - yet it's just an implementation idea and a cheap one, it's not an algorithm from my point of view.

Starting with a master -slave concept is maybe a good idea to start implementing parallel search - yet in the long run take into account you want to get rid of it. A central master always is an obstacle you don't want to have; it creates a huge worst case.

Vincent

diep · Post by **diep** » Sat Mar 24, 2012 11:16 pm

Dragulic wrote:With these views I, a permanent beginner, concur. But parallelisation as we see it today, people think of 8 core, 16 thread, that level. OK, maybe 200 cluster.

IBM and Intel have indicated future lies in massive parallelisation. On chip, limited functionality but 1000, 10000 cores. And many chips. Machine with 1,000,000 threads can be common within our working lives.

How will these algorithms cope and scale?

Dragulic it's not about scaling of the algorithms. You must reverse the question here.

I gave a speech in Switzerland for the HPC advisory council. My sheets are probably online there. If i compare Diep at 2 million nps wit hdeep blue, and i turn OFF all algorithmic improvements sincethen, so i just search fullwidth,
then i'm searching 2 ply deeper than deep blue, whereas diep is NOT forward pruning last few plies and deep blue was.

The information that reached me is that Deep blue used the aphid algorithm from Jonathan Schaeffer - in those days a logical choice if your focus was getting more nps for your boss - regardless whether the nps was any useful.

Deep Blue doesn't count nodes. I just start of course with the first move out of book - comparing of course in endgames the difference is much bigger as their hashtable wasn't there simply or worked crap because of using hardware processors. Also algorithmic the deep blue team showed they are not so clever, claiming they had 'hardware hashtables' nearly ready to go; as hashtables in hardware would not have mattered much if you do 4 ply searches in hardware of course, as the other 479 processors don't communicate with this one. Chrilly Donninger having had a similar experience with hardware processors doesn't understand how they could search 4 ply in hardware: "as doing 4 ply in hardware is extremely inefficient, i basically do mostly 2 ply searches with hydra and try to push everything towards a software search".

Deep Blue was already searching in time of opponent and finishes 10 ply
after 1.Nf3,d5 2.g3,Bg4 3.b3

Diep in fact is doubting there and in the end switches to Bxf3, so it switches from PV, and as we can see deep blue also switches from PV yet to the wrong move.

Deep Blue needed roughly 167 seconds at 480 hardware processors, and we forgive it then it also searched in time of opponent which Diep didn't do here. Now the marketing department of ibm claimed it got 200 million nps.

The deep blue team claims in an official publication later in 2001 it was 133 million nps.

Aphid is an algorithm that in itself should scale well.

The last improvements in diep's move ordering were done in 1998, it's not that i have 14 years of advantage there. Also diep never was tuned to be fullwidth real efficient.

If i turn on nullmove, so fullwidth search + nullmove. Not a single form of other forwardpruning other than this, then Diep reaches 17 plies.

Just R=3, i already used that back then, see my postings onto RGCC.
Diep was just 4 years old in 1998.

The 1998 version willl actually use less nodes to reach that 10 ply than todays Diep, let me assure you that.

So Diep needs about a 14 million nodes to finish 10 ply and with nullmove at 2 million nps it would have reached 17 ply back then. An algorithm well known back then and used by everyone, after Frans Morsch was so honest to publicly shout he won the world title 1995 with it. You need to give Frans credits for that.

Deep Blue 1997 gets 10 ply there with 133 million nps. That's what they OFFICIALLY claim in an official publication. Not some vague writing on the internet, official publication for advances in artificial intelligence.

133 mln nps * 167 seconds = 22.2 billion nodes
Diep: 14 million nodes

Factor 1586 difference.
Searching fullwidth diep gets 2.6 million nps.

Searching with nullmove R=3, diep gets 17 ply and 2.0 million nps.
This is at 16 processor barcelona box, a non-optimized executable.

In reality the real question is: why didn't they use nullmove?
The answer is of course obvious: they would've gotten less nps,
but even the n that's not a good answer.

I don't want to redo that discussion of course. They weren't bad guys algorithmic as they invented a few of their own - yet they were pretty clumsy in algorithms overall from lossless searchdepth viewpoint;
Where they did do well is extending their 1988 plan;
they got with expensive hardware and their own cpu's a huge amount of
nps that only 15 years later we also can get. Just by 1997 that was total outdated because of their clumsiness in searching efficient.

The hardware did do 4 ply searches which is way too much; they didn't use killermoves in hardware; we'll forgive them their parallel search as doing a good parallel search simply is very tough and by 1997 no one had achieved an efficient parallel search for so many processors yet in an efficient manner.

That took until Diep, and i easily profitted from the knowledge of others; Bob (Hyatt) is important there and Rainer Feldmann for explaining me back in 1998 in Paderborn clearly YBW.

Yet Diep at their 30 nodes RS/6000 machine would have handsdown outsearched Deep Blue; realize diep is a real SLOW searcher in nps.
If one of the slowest engines is outsearching you - that isn't very good news for you. Know what i mean?

All those algorithms get so total hammered by the efficiency of good implemented YBW programs, that your question on scaling i would want to rephrase.

The question is not: how well does algorithm X or Y scale?

The real question to you is: "how well can you get YBW to work and
TAKE CARE it scales well at many cores?"

And if you don't ask the right question, you'll be factor 1000+ slower anyway than the guy with a good functioning YBW, as you don't preserve your branching factor with other forms of parallel search!

Vincent

Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?

Re: Parallelization questions, ABDADA or DTS?