Lazy SMP and "lazy cluster" experiments

D Sceviour · Post by **D Sceviour** » Thu Aug 24, 2017 3:37 am

Dann Corbit wrote:
D Sceviour wrote:
petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?
My guess:
A. If one is deeper, use the deepest one.
B. If depth is the same, use the one with the most nodes.

Hello Dan,
A. Makes sense.
B. Not necessarily. Since both threads are sharing the same hash table, perhaps the one with the fewest nodes has the best hash values available - meaning it has the most recent up-to-date hash entries and cutoffs available. Peters intention might be the thread that returns first is the thread that governs the score and PV, regardless of any subsequent thread value. However, it would be nice to have a good list of the possible errors at the return point.

petero2 · Post by **petero2** » Thu Aug 24, 2017 6:40 am

D Sceviour wrote:
petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?

Several threads could finish their searches, but the master thread will always use the first value it sees, without checking if more than one result is available.

During the master thread search, it will periodically check for messages from other threads. If one such message contains a result from another thread, the master thread will immediately abort its own search and use the result from the other thread instead.

If the master thread finishes its own search, it will use its own value without checking if there are any received but unprocessed messages that also contains results from helper threads.

There is a communication optimization related to this behavior. As I described before, all search threads are arranged in a tree, and a thread is only allowed to talk to its neighbors in the tree. When a search thread receives a message from some other thread, it may decide to not forward it to other threads if it can determine that the other thread does not need the message. For example:

* If a thread receives a search result from one of its child threads, it will normally forward it to its parent thread. However, if it has already sent a search result to its parent thread, it will simply ignore any further search results it receives. (See WorkerThread::sendReportResult() in parallel.cpp in the source code.)

* If a thread receives a "start search position x" command from its parent thread, it will normally forward this message to its child threads. However if it already has an unprocessed "start search" command in its queue, it will first discard the old "start search" command since that has been obsoleted by the new "start search" command. (See ThreadCommunicator::doSendStartSearch() in parallel.cpp in the source code.) This is mostly useful in the first few iterative deepening iterations where the master thread probably produces many "start search" commands faster than they can be sent to helper threads.

Texel 1.07a29 contains the ABDADA implementation and is available for download here: texel 1.07a29.

Dann Corbit · Post by **Dann Corbit** » Thu Aug 24, 2017 6:37 pm

It seems to scale very well on my 64 core machine

petero2 · Post by **petero2** » Thu Aug 24, 2017 7:12 pm

Dann Corbit wrote:It seems to scale very well on my 64 core machine

Nice, although not that surprising given that I previously got a 39 elo increase going from 32 cores on one cluster node to 64 cores on 2 cluster nodes. I would expect SMP/NUMA to give a bigger elo increase than cluster.

petero2 wrote:2 MPI processors running on different computers, each processor having 32 threads running on a 2-socket NUMA hardware, compared to a non-MPI 32-thread search running on one computer having the same NUMA hardware: +39 elo

For the record, that test was run on two r4.16xlarge EC2 instances, each instance having 32 cores.

Dann Corbit · Post by **Dann Corbit** » Thu Aug 24, 2017 8:14 pm

petero2 wrote:
Dann Corbit wrote:It seems to scale very well on my 64 core machine
Nice, although not that surprising given that I previously got a 39 elo increase going from 32 cores on one cluster node to 64 cores on 2 cluster nodes. I would expect SMP/NUMA to give a bigger elo increase than cluster.
petero2 wrote:2 MPI processors running on different computers, each processor having 32 threads running on a 2-socket NUMA hardware, compared to a non-MPI 32-thread search running on one computer having the same NUMA hardware: +39 elo
For the record, that test was run on two r4.16xlarge EC2 instances, each instance having 32 cores.

It also seems to benefit from RAM very well. Going from 8 GB to 32 GB gave a very nice speed up in time to ply.

Daniel Shawul · Post by **Daniel Shawul** » Thu Aug 24, 2017 9:50 pm

Thanks for the tests that confirm ABDADA is clearly better than lazy.

On a constant depth search my lazy gives a time-to-depth speedup of 1.11 while ABDADA gives 1.73 using 2 cores. So the advantage of ABDADA for me is quite significant even though i don't do anything fancy with lazy. YBW is superior than both in my engine, so I still think that is the best algorithm. If you fine tune your YBW (split depth etc) and use longer time control, may be it will be the the best for you too.

Daniel

lucasart · Post by **lucasart** » Fri Aug 25, 2017 3:48 am

Daniel Shawul wrote:Thanks for the tests that confirm ABDADA is clearly better than lazy.

On a constant depth search my lazy gives a time-to-depth speedup of 1.11 while ABDADA gives 1.73 using 2 cores. So the advantage of ABDADA for me is quite significant even though i don't do anything fancy with lazy. YBW is superior than both in my engine, so I still think that is the best algorithm. If you fine tune your YBW (split depth etc) and use longer time control, may be it will be the the best for you too.

Daniel

All your tests show is that your implementation of lazy is crap. There's more to lazy than you think, and TTD is not the right thing to measure, elo is.

Of course, you know everything and we're all ignorant, and should learn from you. But ask yourself: why do top engines like Komodo and SF use lazy and beat the shit out of everything else out there?

Daniel Shawul · Post by **Daniel Shawul** » Fri Aug 25, 2017 6:19 am

lucasart wrote:
Daniel Shawul wrote:Thanks for the tests that confirm ABDADA is clearly better than lazy.

On a constant depth search my lazy gives a time-to-depth speedup of 1.11 while ABDADA gives 1.73 using 2 cores. So the advantage of ABDADA for me is quite significant even though i don't do anything fancy with lazy. YBW is superior than both in my engine, so I still think that is the best algorithm. If you fine tune your YBW (split depth etc) and use longer time control, may be it will be the the best for you too.

Daniel
All your tests show is that your implementation of lazy is crap. There's more to lazy than you think, and TTD is not the right thing to measure, elo is.

Of course, you know everything and we're all ignorant, and should learn from you. But ask yourself: why do top engines like Komodo and SF use lazy and beat the shit out of everything else out there?

Are you on your period ?

Peter just tested ABDADA vs Lazy_with_all_its_tricks and showed upto +20 elos gain.
Instead of commenting on that, you attack me for not being a sheeple i.e. "whatever stockfish/komdo is the right approach".

Martin also threw the "stockfish does it", the "you are being ignorant of the +100 lazy gives to stockfish" argument, so you are not the first one at that.
How about this: I honestly believe that ABDADA is the better algorithm, because all your ticks to lazy can be applied on top of it.

Remember when all this started (way before stockfish/komodo) considered using it, i.e. when Dan Homman first showed his results,
I was testing ABDADA vs Lazy and formed my opinion. Infact, I also convinced him to try ABDADA and he found out that it was better than lazy ...

Now go back to your cave, create an ABDADA patch for stockfish, and when it beats your lazy, declare to the whole word you found a new algorithm...you pathetic troll..Ughh.

Daniel

Dann Corbit · Post by **Dann Corbit** » Fri Aug 25, 2017 7:58 am

I wonder also if hybrids are possible.
IOW, the lazy approach makes a bushy tree, but fills in the tactical gaps.
YBW gets deep fast.

Maybe one algorithm up to depth X, and the other forward.
Why not?

I would also be interested to see what happens when the tricks like staggering the ply of the start point also work equally well with the other algorithms.

I guess the best thing about Lazy is that it is Lazy. IOW, a beginner can understand and implement it.

Laskos · Post by **Laskos** » Fri Aug 25, 2017 8:43 am

Daniel Shawul wrote:Thanks for the tests that confirm ABDADA is clearly better than lazy.

On a constant depth search my lazy gives a time-to-depth speedup of 1.11 while ABDADA gives 1.73 using 2 cores. So the advantage of ABDADA for me is quite significant even though i don't do anything fancy with lazy. YBW is superior than both in my engine, so I still think that is the best algorithm. If you fine tune your YBW (split depth etc) and use longer time control, may be it will be the the best for you too.

Daniel

You cannot discard other data. Stockfish YBW was beating Komodo Lazy on single core LTC, and losing terribly to Komodo on 16-32 cores LTC (TCEC). When they (Stockfish) slowly applied decently Lazy, it started beating Komodo Lazy on 16-44 cores pretty harshly as it is beating it on 1 core.

Houdini YBW (3,4) was quite equal on one core to Komodo Lazy at some time, but was losing terribly on 16-32 cores. The main change in Houdini 5 was adopting Lazy, and it was a sudden gain on 16-44 cores (similarly to what was for Stockfish). Houdini 5 in fact had beaten Komodo Lazy to final in TCEC, and only lost to Stockfish Lazy.

All those YBW implementations in top engines were crappy, and only yours is good?

On CCRL and CEGT, the known Lazy engines (Stockfish, Komodo, Andscacs, Houdini 5, Cheng) 1 -> 4 cores are scaling on average at least equal, more probably better than YBW known engines (Crafty 24, 23, Houdini 4,3,2, Rybka 4,3,2).

Do I have to dismiss this significant data, and to accept only yours?

Peter's data with Lazy and ABDADA are significant, but I still have to see more empirical data and at longer TC. Quick tricks like time-to-depth will not help here. To me, as of now, the LTC empirical data shows Lazy as applied in Stockfish and Komodo as the best, with Peter's results provisory and to be confirmed, and your statements as irrelevant.

Lazy SMP and "lazy cluster" experiments

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments