RMO - Randomized Move Order - yet another Lazy SMP derivate

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 05, 2020 2:27 am

Peter, have you seen my latest results. Stockfish mulit-threaded (8-threads) even when constrained to single-core thrashes 1-thread stockfish.
There is clearly a "bug" in fixed depth tests with LazySMP, because it happens to sneak to depth d+1 is the current most likely theory.
Are you sure you took care of that for your depth-limited tests? You said it is a mix of ABDADA + Lazy so that d/d+1 is a potential problem ...

Also, my tests with my implementation of ABDADA give absolutely zero elo comparing 8-thread vs 1-thread at fixed depth.
I repost here the fixed depth = 14 result

Code: Select all

Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 78 - 90 - 132  [0.480] 300
Score of scorpio-abdada-8-thread vs scorpio-abdada-1-thread: 111 - 107 - 106  [0.506] 324
Score of scorpio-ybw-8-thread vs scorpio-ybw-1-thread: 92 - 94 - 114  [0.496] 300

On the other hand, my ABDADA implementation beats even YBW on timed games even with 12 cores. Since it doesn't widen search as shown above
the only conclusion is it must have deepened search instead (at least in most critical positions is my guess)!

Code: Select all

Score of scorpio-ybw vs scorpio-sht: 44 - 18 - 70  [0.598] 132
Score of scorpio-ybw vs scorpio-abdada: 32 - 67 - 101  [0.412] 200

Here ABDADA > YBW > SHT without any of them showing any sign of widening.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 05, 2020 4:50 am

smatovic wrote: ↑Sat Jan 04, 2020 8:12 am
Despite the discussions on Lazy SMP vs. XY, I am personally more interested in
the last point which Bob mentioned. The break-even point of YBW. There must be
also one for ABDADA. Currently we have max 64 cores per cpu socket, and Moore's
Law has still something in pipe, so it is relative certain that within a decade
we have to switch to something new. I tested different parallel searches with up
to 256 SIMD units, acting as 256 workers, on gpu architecture, but it remains
open how and if my results are applicable to pro cpu engines. Cos of the
architecture I was not able to implement YBW, and I did not compare SHT/Lazy SMP
Elo-wise, but I observed an possible break-even point, >=128 threads, and made
an proposal, RMO, to overcome this. Curious what others will come up with in
some years...

--
Srdja

About the break-even point, I am having difficulty to get YBW to convincingly beat ABDADA even at 2-threads and a 40/60 tc

Code: Select all

Score of scorpio-ybw vs scorpio-abdada: 25 - 31 - 62  [0.475] 118

I have tested 4 and 8 threads too and the gap increases as you go towards 12 where ABDADA shows its superiority.

We often forget that ABDADA _is_ YBW, only that it does not suffer from lock contentions or juggling split blocks as YBW does.
I even have a MIN_SPLIT_DEPTH too for ABDADA but it probably doesn't need it as much.

Michel · Post by **Michel** » Sun Jan 05, 2020 6:37 am

Daniel Shawul wrote: ↑Sun Jan 05, 2020 2:03 am

It looks "ridiclous theory number 1" could be the winner I redid the test at fixed depth=8 but this time constraining all 8 threads to 1 cores using
"taskset -c 0 ./cutechess-cli", and i can see only 100% cpu usage when the previous test was 800% cpu usage.
So both the single thread and multi-thread search get one core.
Code: Select all
Score of stockfish-8-thread vs stockfish-1-thread: 1113 - 467 - 497  [0.656] 2077
So to improve stockfish search, all one has to do is run multi-threaded but with 1-core. So that is my winning patch to stockfish to be sent later
I will do a timed match (not fixed depth) before I sent the patch though lol

That's a clever trick!

So for now it seems that SMP algorithms may widen a little bit (Peter's data) but their main strength comes from searching deeper.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 05, 2020 6:58 am

Michel wrote: ↑Sun Jan 05, 2020 6:37 am
Daniel Shawul wrote: ↑Sun Jan 05, 2020 2:03 am

It looks "ridiclous theory number 1" could be the winner I redid the test at fixed depth=8 but this time constraining all 8 threads to 1 cores using
"taskset -c 0 ./cutechess-cli", and i can see only 100% cpu usage when the previous test was 800% cpu usage.
So both the single thread and multi-thread search get one core.
Code: Select all
Score of stockfish-8-thread vs stockfish-1-thread: 1113 - 467 - 497  [0.656] 2077
So to improve stockfish search, all one has to do is run multi-threaded but with 1-core. So that is my winning patch to stockfish to be sent later
I will do a timed match (not fixed depth) before I sent the patch though lol
That's a clever trick!

So for now it seems that SMP algorithms may widen a little bit (Peter's data) but their main strength comes from searching deeper.

Well, Peter has an ABDADA + LazySMP mix so he may have had this "bug" too due to lazy's presence, he has not yet confirmed if he has it or not so lets's wait. My guess is you could easily forget to test the D+1 depth against the search depth limit instead of D.
I have one data point that corresponds with his at depth=14/13, and 8-threads vs 1-threads, he got +30 elo where I got 0.

Note that the widening I got for my YBW/ABDADA/SHT is exactly zilch for all of them!

As a final nail in the coffin for stockfish: 8 threads depth=7 (one less) vs 1-thread depth=8 gets expected results.

Code: Select all

Score of stockfish-8-thread vs stockfish-1-thread: 2903 - 3074 - 2023  [0.489] 8000

Let's immortalize this for those who take stockfish as the bible ( @lucasart I am looking at you

):
"widening effect in LazySMP" = "Possible bug in your fixed depth tests due to D/D+1 search depths"

Moving on to finding the minimum number of cores where YBW will be superior, I can't find it yet but long time control may change this.
2-threads result at 40/60 tc:

Code: Select all

Score of scorpio-ybw vs scorpio-abdada: 48 - 49 - 141  [0.498] 238

lucasart · Post by **lucasart** » Sun Jan 05, 2020 7:02 am

Michel wrote: ↑Sun Jan 05, 2020 6:37 am
Daniel Shawul wrote: ↑Sun Jan 05, 2020 2:03 am

It looks "ridiclous theory number 1" could be the winner I redid the test at fixed depth=8 but this time constraining all 8 threads to 1 cores using
"taskset -c 0 ./cutechess-cli", and i can see only 100% cpu usage when the previous test was 800% cpu usage.
So both the single thread and multi-thread search get one core.
Code: Select all
Score of stockfish-8-thread vs stockfish-1-thread: 1113 - 467 - 497  [0.656] 2077
So to improve stockfish search, all one has to do is run multi-threaded but with 1-core. So that is my winning patch to stockfish to be sent later
I will do a timed match (not fixed depth) before I sent the patch though lol
That's a clever trick!

So for now it seems that SMP algorithms may widen a little bit (Peter's data) but their main strength comes from searching deeper.

Nonesense. Of course 8T SF trashes 1T SF on 1 core. You are comparing apples and pears. SF uses skip depth scheme, so the reported depth is misleading and massively understated. The 8T version of the fixed depth search benefits from HT pruning coming from deeper entries.

Your "winning patch" will work at fixed depth, but that's meaningless. It will certainly fail using time based search limit.

There is no bug in SF… (at least not here)

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 05, 2020 7:13 am

lucasart wrote: ↑Sun Jan 05, 2020 7:02 am
Michel wrote: ↑Sun Jan 05, 2020 6:37 am
Daniel Shawul wrote: ↑Sun Jan 05, 2020 2:03 am

It looks "ridiclous theory number 1" could be the winner I redid the test at fixed depth=8 but this time constraining all 8 threads to 1 cores using
"taskset -c 0 ./cutechess-cli", and i can see only 100% cpu usage when the previous test was 800% cpu usage.
So both the single thread and multi-thread search get one core.
Code: Select all
Score of stockfish-8-thread vs stockfish-1-thread: 1113 - 467 - 497  [0.656] 2077
So to improve stockfish search, all one has to do is run multi-threaded but with 1-core. So that is my winning patch to stockfish to be sent later
I will do a timed match (not fixed depth) before I sent the patch though lol
That's a clever trick!

So for now it seems that SMP algorithms may widen a little bit (Peter's data) but their main strength comes from searching deeper.
Nonesense. Of course 8T SF trashes 1T SF on 1 core. You are comparing apples and pears. SF uses skip depth scheme, so the reported depth is misleading and massively understated. The 8T version of the fixed depth search benefits from HT pruning coming from deeper entries.

Your "winning patch" will work at fixed depth, but that's meaningless. It will certainly fail using time based search limit.

There is no bug in SF… (at least not here)

You don't even understand what is implied in my post it is so funny. I knew my patch will fail on fixed time tests and was just making fun of how stupid you were to ride this "widening" bullshit for years... Talk about worshiping stockfish blindly.
8T stockfish will NOT trash 1T stockfish on fixed depth test, the fact that it does here means the setup is all wrong! There is no "widening effect" to be observed and you were probably misled after you did fixed depth test as well.

Another of your bullshit was already proven wrong that "ABDADA is a good-on-paper algorithm, if you don't agree you must be a useless academician? How about you being a blind worshiper of whatever-stockfish-does is right?

Alayan · Post by **Alayan** » Sun Jan 05, 2020 9:42 am

Daniel Shawul wrote: ↑Sun Jan 05, 2020 2:27 am Peter, have you seen my latest results. Stockfish mulit-threaded (8-threads) even when constrained to single-core thrashes 1-thread stockfish.
There is clearly a "bug" in fixed depth tests with LazySMP, because it happens to sneak to depth d+1 is the current most likely theory.

You ran fixed depth games. Stockfish with 8 threads on 1 core will take much more time than Stockfish with 8 cores to reach the given depth, but will have a rather similar search tree in the end. Hence, having similar results with 1 or 8 cores using fixed depth for 8 threads is expected, though some parts like breadcrumbs will not behave as expected with such a scheme.

lucasart wrote: ↑Sun Jan 05, 2020 7:02 am SF uses skip depth scheme, so the reported depth is misleading and massively understated.

This is wrong. Depth skipping has been removed from Stockfish over 9 months ago, because it scales awfully at longer TCs. See https://github.com/official-stockfish/S ... /pull/1972

Current Stockfish doesn't do it.

Michel · Post by **Michel** » Sun Jan 05, 2020 11:34 am

Alayan wrote: ↑Sun Jan 05, 2020 9:42 am

You ran fixed depth games. Stockfish with 8 threads on 1 core will take much more time than Stockfish with 8 cores to reach the given depth, but will have a rather similar search tree in the end. Hence, having similar results with 1 or 8 cores using fixed depth for 8 threads is expected, though some parts like breadcrumbs will not behave as expected with such a scheme.

Yes that's true I realize now.

lucasart wrote: ↑Sun Jan 05, 2020 7:02 am SF uses skip depth scheme, so the reported depth is misleading and massively understated.
This is wrong. Depth skipping has been removed from Stockfish over 9 months ago, because it scales awfully at longer TCs. See https://github.com/official-stockfish/S ... /pull/1972

Current Stockfish doesn't do it.

Interesting. So what variant of Lazy does SF use then? d/d+1 or something else? Do the individual threads all respect the depth limit?

lucasart · Post by **lucasart** » Sun Jan 05, 2020 11:39 am

Alayan wrote: ↑Sun Jan 05, 2020 9:42 am
lucasart wrote: ↑Sun Jan 05, 2020 7:02 am SF uses skip depth scheme, so the reported depth is misleading and massively understated.
This is wrong. Depth skipping has been removed from Stockfish over 9 months ago, because it scales awfully at longer TCs. See https://github.com/official-stockfish/S ... /pull/1972

Current Stockfish doesn't do it.

Indeed:

Code: Select all

commit 66818f2e85732644708e23b3f2c2e544abfbc3b0 (HEAD)
Author: CoffeeOne <peter.zsifkovits@gmx.at>
Date:   Wed Mar 20 14:50:41 2019 +0100

    Skip skipping thread scheme (#1972)
    
    
    Several simplification tests (all with the bounds [-3,1]) were run:
    5+0.05 8 threads, failed very quickly:
    http://tests.stockfishchess.org/tests/view/5c439a020ebc5902bb5d3970
    
    20+0.2 8 threads, also failed, but needed a lot more games:
    http://tests.stockfishchess.org/tests/view/5c44b1b70ebc5902bb5d4e34
    
    60+0.6 8 threads passed:
    http://tests.stockfishchess.org/tests/view/5c48bfe40ebc5902bca15325
    
    60+0.6 4 threads passed:
    http://tests.stockfishchess.org/tests/view/5c4b71a00ebc593af5d49904
    
    No functional change.

Saying it scales awfully is a bit exagerated. All we can say is that:
* it clearly helps at short tc
* as tc increases, it helps less and less
* at very long tc, it becomes practically useless (un-mesurable, hard to distinguish signal from noise these sprt(-3,1) results are difficult to interpret, especially in the presence of p-hacking).

But yes, it seems to be asymptotically useless. In fact, all evidence points to this simple conclusion: pure SHT is asymptotically optimal. Yes, this is a very bold claim, a flamewar bait, which I am sure will not fail

In fact, I wouldn't be surprised if all this (ugly) thread voting logic in SF is also asymptotically useless.

I've tried to remove the d/d+1 scheme in demolito, and it failed hard:
http://chess.grantnet.us/viewTest/4170/

So, probably a clear regression, although the test bounds were not decided (in advance of the test) to prove as much.

I'll re-spin it by doubling the tc from 4+0.04 to 8+0.08. Of course, the draw rate will also increase, so I can't just compare elo in 4" to elo in 8", but perpahs comparing bayes elo, or wilo, or Michel's normalized elo metric...

petero2 · Post by **petero2** » Sun Jan 05, 2020 11:55 am

Daniel Shawul wrote: ↑Sun Jan 05, 2020 2:27 am Peter, have you seen my latest results.
...
You said it is a mix of ABDADA + Lazy so that d/d+1 is a potential problem ...

My implementation does not use the classical d/d+1 scheme:

petero2 wrote: ↑Sun Aug 13, 2017 9:52 pm * I disabled the lazy SMP "depth + 1" trick, as I don't know if that makes sense in combination with ABDADA. Disabling it in a pure lazy SMP setting seemed to cost around 5 elo at most, so I hope this is not a big deal.

I do a different kind of extended depth logic though:

petero2 wrote: ↑Sun Aug 06, 2017 11:15 pm When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.

I modified this logic to make it never exceed the fixed depth limit and ran 8 threads vs 1 thread again at fixed depth 10. The result was +42.5 elo after 5536 games.

Daniel Shawul wrote: ↑Sun Jan 05, 2020 2:27 am Also, my tests with my implementation of ABDADA give absolutely zero elo comparing 8-thread vs 1-thread at fixed depth.
I repost here the fixed depth = 14 result
Code: Select all
Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 78 - 90 - 132  [0.480] 300
Score of scorpio-abdada-8-thread vs scorpio-abdada-1-thread: 111 - 107 - 106  [0.506] 324
Score of scorpio-ybw-8-thread vs scorpio-ybw-1-thread: 92 - 94 - 114  [0.496] 300

Are you still using the same move ordering in all threads?

Daniel Shawul wrote: ↑Sat Aug 19, 2017 4:32 pm
1. It might be the case that lazy SMP (and SHT) works better if you have a per-thread history table. The late moves in ALL nodes are mostly ordered by the history heuristic, and if the history table is thread local, there is a decent chance that it becomes different in different threads. This would have the effect that different threads often don't search the same moves at the same time, and would therefore be able to pick up results from other threads from the transposition table.
Unfortunately, I share the history and refutation tables between threads, so the move list is searched in the same order in different threads.

If so, that might explain why you see different results compared to Stockfish and Texel.

RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate