RMO - Randomized Move Order - yet another Lazy SMP derivate

Michel · Post by **Michel** » Wed Jan 01, 2020 10:54 pm

AndrewGrant wrote: ↑Wed Jan 01, 2020 2:04 pm
I agree with that thought, in that I would not just blindly commit something because its an elo gainer. But its not hard to construct a reasonable argument for why LazySMP might out perform other methods.

You can argue that YBW is just a cheap way of making a minor effective speedup, and that LazySMP is not really Lazy, but a more complex and evolved method of multi threading.

Complex and evolved??? On the contrary. Lazy is crude and naive. Its only redeeming feature is that it is incredibly fast and this is apparently enough to overcome more sophisticated algorithms such as YBW.
Recall that in a primitive engine like Toga II, Lazy was already on par with YBW.

YBW increases speed. LazySMP increase quality. At least, thats how I've always seen it, as LazySMP fails to produce massive increases in depths, but with better move ordering and the existence of Singular Extensions we get far higher quality nodes.

What is your evidence for the claim the YBW does better than Lazy in TTD? This would certainly be an interesting fact, if true.

Michel · Post by **Michel** » Wed Jan 01, 2020 11:52 pm

I found this thread

http://www.talkchess.com/forum3/viewtop ... =2&t=61131

where it is claimed that Lazy is worse in TTD. But it is not clear to me if the analysis is valid since some of the other threads may be searching deeper than the main thread and this would skew the reported results - I think.

Moreover AFAICT there is no comparison with a YBW implementation on the same machine (I did not read the thread in great detail). Perhaps YBW would have performed badly as well in TTD on this machine.

lucasart · Post by **lucasart** » Thu Jan 02, 2020 4:35 am

Michel wrote: ↑Wed Jan 01, 2020 11:52 pm I found this thread

http://www.talkchess.com/forum3/viewtop ... =2&t=61131

where it is claimed that Lazy is worse in TTD. But it is not clear to me if the analysis is valid since some of the other threads may be searching deeper than the main thread and this would skew the reported results - I think.

Good point. Although it depends how lazy SMP is implemented. The typical implication assigns arbitrarily one of the workers the task of UCI updating. And so the depth reported is somewhat misleading.

But Demolito always reports the correct depth. The logic is a bit more complicated, all workers are equal, they never write to stdout, instead they send update to a timer thread who writes them, but only accepts depth increasing updates. So I have a guarantee that the highest completed depth is reported "immediately", and that (full) depth updates are sequential.

I will send some TTD analysis later today, so set the record straight. Both for standard lazy, and for improved lazy (d/d+1 equal balancing).

AndrewGrant · Post by **AndrewGrant** » Thu Jan 02, 2020 5:00 am

lucasart wrote: ↑Thu Jan 02, 2020 4:35 am instead they send update to a timer thread who writes them, but only accepts depth increasing updates. So I have a guarantee that the highest completed depth is reported "immediately", and that (full) depth updates are sequential.

Is this most recently reported completed depth guaranteed to be the bestmove reported?

If so, have you shown this scheme to be an elo gain over say, letting only Thread #1 report best moves?

Michel · Post by **Michel** » Thu Jan 02, 2020 7:17 am

lucasart wrote: ↑Thu Jan 02, 2020 4:35 am
Michel wrote: ↑Wed Jan 01, 2020 11:52 pm I found this thread

http://www.talkchess.com/forum3/viewtop ... =2&t=61131

where it is claimed that Lazy is worse in TTD. But it is not clear to me if the analysis is valid since some of the other threads may be searching deeper than the main thread and this would skew the reported results - I think.
Good point. Although it depends how lazy SMP is implemented. The typical implication assigns arbitrarily one of the workers the task of UCI updating. And so the depth reported is somewhat misleading.

But Demolito always reports the correct depth. The logic is a bit more complicated, all workers are equal, they never write to stdout, instead they send update to a timer thread who writes them, but only accepts depth increasing updates. So I have a guarantee that the highest completed depth is reported "immediately", and that (full) depth updates are sequential.

I will send some TTD analysis later today, so set the record straight. Both for standard lazy, and for improved lazy (d/d+1 equal balancing).

Thanks. However "completed depth" (if I understand correctly what you mean by it) might be misleading if other threads have been searching at d+1. It is like comparing an engine with reductions with an engine with extensions. The former will have smaller TTD than the latter.

Making all threads search at the same depth like in Toga would probably be suboptimal on a larger number of threads since common sense suggests one has to introduce some mechanisms for the threads to search different things ("chaos").

The breadcrumbs idea in SF (borrowed from ABDADA?) introduces another mechanism for the threads not to trample too much on each other's feet.

lucasart · Post by **lucasart** » Thu Jan 02, 2020 1:36 pm

AndrewGrant wrote: ↑Thu Jan 02, 2020 5:00 am
lucasart wrote: ↑Thu Jan 02, 2020 4:35 am instead they send update to a timer thread who writes them, but only accepts depth increasing updates. So I have a guarantee that the highest completed depth is reported "immediately", and that (full) depth updates are sequential.
Is this most recently reported completed depth guaranteed to be the bestmove reported?

If so, have you shown this scheme to be an elo gain over say, letting only Thread #1 report best moves?

Is this most recently reported completed depth guaranteed to be the bestmove reported?
=> yes.

If so, have you shown this scheme to be an elo gain over say, letting only Thread #1 report best moves?
=> don't know. i haven't tried the alternative.

lucasart · Post by **lucasart** » Thu Jan 02, 2020 1:40 pm

lucasart wrote: ↑Thu Jan 02, 2020 4:35 am
Michel wrote: ↑Wed Jan 01, 2020 11:52 pm I found this thread

http://www.talkchess.com/forum3/viewtop ... =2&t=61131

where it is claimed that Lazy is worse in TTD. But it is not clear to me if the analysis is valid since some of the other threads may be searching deeper than the main thread and this would skew the reported results - I think.
Good point. Although it depends how lazy SMP is implemented. The typical implication assigns arbitrarily one of the workers the task of UCI updating. And so the depth reported is somewhat misleading.

But Demolito always reports the correct depth. The logic is a bit more complicated, all workers are equal, they never write to stdout, instead they send update to a timer thread who writes them, but only accepts depth increasing updates. So I have a guarantee that the highest completed depth is reported "immediately", and that (full) depth updates are sequential.

I will send some TTD analysis later today, so set the record straight. Both for standard lazy, and for improved lazy (d/d+1 equal balancing).

Running a bench of 75 (randomly selected) positions, at depth=17, with 16MB hash. Using the median of 3 runs each time (SMP results are very noisy).

Code: Select all

threads  time(ms) n/s      ttd speedup max possible %perfection
      8     27930 12851317        2.68         4.74        0.57
      7     26701 12248597        2.81         4.52        0.62
      6     27763 11538908        2.70         4.26        0.63
      5     27291 10793909        2.75         3.98        0.69
      4     28624 10077439        2.62         3.72        0.70
      3     32551  7735406        2.30         2.85        0.81
      2     42240  5223737        1.78         1.93        0.92
      1     74987  2711138        1.00         1.00        1.00

It's important to note that my machine has only 4 cores, and 8 threads. So the benchmark (for perfect scaling) is not just linear TTD speed-up in number of threads, it is defined by the NPS scaling (which suffers incompressible contention as you reach HT and also don't forget the OS processes in the background etc.). The contention from Demolito itself is negligible.

PS: Above is for improved lazy (d/d+1 equal balancing).

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 02, 2020 11:41 pm

I did some quick tests on a ryzen 3900x with 12-physical cores (24 threads but I used only 12) at tc 40/60

ABDADA beats plain SHT by a score of 10-3-15, and YBW beats ABDADA by a score of 9-6-9 so far.
Not so many games I know but It is clear to me YBW > ABDADA > SHT even at 12 threads.

Lazy SMP could definitely benefit from ABDADA ideas i.e. using transposition table to coordinate work-sharing
between threads rather than rely only on the threads searching at 2 depths. Michel mentioned about "Breadcrumbs" using ABDADA ideas -- I definitely agree threads working on the same depth could benefit from ABDADA ideas. Also, don't forget ABDADA is the original real lazy SMP implementation...so all this hype about lazy SMP is because stockfish uses it yikes! It happens to "widen" search somehow and helped it but it might not work in other engines which don't prune as heavily as stockifsh. Also, LazySMP is rather uninteresting for parallel search literature because it moves goal post to achieve something else ... The work should have been the same for all algorithms but LazySMP apparently changes selectivity even though I never tried it.

lucasart · Post by **lucasart** » Fri Jan 03, 2020 12:02 pm

lucasart wrote: ↑Thu Jan 02, 2020 1:40 pm
lucasart wrote: ↑Thu Jan 02, 2020 4:35 am
Michel wrote: ↑Wed Jan 01, 2020 11:52 pm I found this thread

http://www.talkchess.com/forum3/viewtop ... =2&t=61131

where it is claimed that Lazy is worse in TTD. But it is not clear to me if the analysis is valid since some of the other threads may be searching deeper than the main thread and this would skew the reported results - I think.
Good point. Although it depends how lazy SMP is implemented. The typical implication assigns arbitrarily one of the workers the task of UCI updating. And so the depth reported is somewhat misleading.

But Demolito always reports the correct depth. The logic is a bit more complicated, all workers are equal, they never write to stdout, instead they send update to a timer thread who writes them, but only accepts depth increasing updates. So I have a guarantee that the highest completed depth is reported "immediately", and that (full) depth updates are sequential.

I will send some TTD analysis later today, so set the record straight. Both for standard lazy, and for improved lazy (d/d+1 equal balancing).
Running a bench of 75 (randomly selected) positions, at depth=17, with 16MB hash. Using the median of 3 runs each time (SMP results are very noisy).
Code: Select all
threads  time(ms) n/s      ttd speedup max possible %perfection
      8     27930 12851317        2.68         4.74        0.57
      7     26701 12248597        2.81         4.52        0.62
      6     27763 11538908        2.70         4.26        0.63
      5     27291 10793909        2.75         3.98        0.69
      4     28624 10077439        2.62         3.72        0.70
      3     32551  7735406        2.30         2.85        0.81
      2     42240  5223737        1.78         1.93        0.92
      1     74987  2711138        1.00         1.00        1.00
It's important to note that my machine has only 4 cores, and 8 threads. So the benchmark (for perfect scaling) is not just linear TTD speed-up in number of threads, it is defined by the NPS scaling (which suffers incompressible contention as you reach HT and also don't forget the OS processes in the background etc.). The contention from Demolito itself is negligible.

PS: Above is for improved lazy (d/d+1 equal balancing).

And here's the pure SHT version (no d/d+1 equal balancing):

Code: Select all

threads time(ms)      n/s nps factor ttd speedup %perfection
8          36334 12643581       4.71        2.09        0.44
6          34757 11262257       4.20        2.18        0.52
4          34403  9987640       3.72        2.20        0.59
3          43630  7539371       2.81        1.74        0.62
2          51360  5110509       1.91        1.48        0.77
1          75804  2681918       1.00        1.00        1.00

lucasart · Post by **lucasart** » Fri Jan 03, 2020 12:18 pm

Daniel Shawul wrote: ↑Thu Jan 02, 2020 11:41 pm I did some quick tests on a ryzen 3900x with 12-physical cores (24 threads but I used only 12) at tc 40/60

ABDADA beats plain SHT by a score of 10-3-15, and YBW beats ABDADA by a score of 9-6-9 so far.
Not so many games I know but It is clear to me YBW > ABDADA > SHT even at 12 threads.

Lazy SMP could definitely benefit from ABDADA ideas i.e. using transposition table to coordinate work-sharing
between threads rather than rely only on the threads searching at 2 depths. Michel mentioned about "Breadcrumbs" using ABDADA ideas -- I definitely agree threads working on the same depth could benefit from ABDADA ideas. Also, don't forget ABDADA is the original real lazy SMP implementation...so all this hype about lazy SMP is because stockfish uses it yikes! It happens to "widen" search somehow and helped it but it might not work in other engines which don't prune as heavily as stockifsh. Also, LazySMP is rather uninteresting for parallel search literature because it moves goal post to achieve something else ... The work should have been the same for all algorithms but LazySMP apparently changes selectivity even though I never tried it.

ABDADA is good on paper, but Bread crumbs is better in the real world

It adresses the problems of ABDADA which are:
* HT real estate: There is not a bit to waste on HT entries, no room for nproc. Need another table, preferable a much smaller one (see next point).
* Contention: HT is used (in a modern engine tuned for elo, not a toy engine) at every node, including qs nodes. So one needs to be careful with contention on hashEntry.nProc, especially as they have to be updated with atomic RMW operations (increment and decrement) with acquire/release ordering. So you want a smaller table of (key,nproc) and use it at high enough depths to minimise contention with large number of threads.
* Move ordering: Messiging around with move ordering is a very bad idea, in modern engines where move ordering is so massively fine tuned and path dependant. You just cannot mess up move ordering for the sake of SMP. I am ready to accept the superiority of ABDADA in toy engines where move ordering is deterministic (ie. no path dependance from history CMH etc.). But in real world modern engine, it's a terrible idea. Also, it messes up the codebase and introduces serious code smell (the logic of re-ordering moves). The Bread Crumbs solution of simply increasing the reduction is much superior. There is power in simplicity.

RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate