RMO - Randomized Move Order - yet another Lazy SMP derivate

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 05, 2020 10:22 pm

Michel wrote: ↑Sun Jan 05, 2020 9:58 pm
petero2 wrote: ↑Sun Jan 05, 2020 9:39 pm
Michel wrote: ↑Sun Jan 05, 2020 12:52 pm
Code: Select all
  // Iterative deepening loop until requested to stop or the target depth is reached
  while (   (rootDepth += ONE_PLY) < DEPTH_MAX
         && !Threads.stop
         && !(Limits.depth && mainThread && rootDepth / ONE_PLY > Limits.depth))
Perhaps this is the culprit? If I read it correctly then the depth limit is only respected in mainThread.
Yes it is easy to verify using print statements that the depth limit is not respected in fixed depth searches for threads other than the main thread. Fixing this is easy though. Just remove "&& mainThread".

With latest Stockfish from git I get +314 elo for 8 threads vs 1 thread at fixed depth 8. After fixing the above problem I instead get +251 elo. At fixed depth 14 I get +172 elo:
Code: Select all
d     elo    win   draw  loss
8+    314   3501    654   301
8     251   2266    576   314
14    172   1617   1380   169
Wow. So it seems there is indeed substantial widening in SF. As I understand it SF is actually using SHT now (like Toga II) and not Lazy as it is commonly understood.

It appears the amount of widening varies wildly from engine to engine. But there does not seem to be a lot of freedom in implementing SHT. So perhaps in the end there is a mystery....

So there is a bug AND widening wow. I rerun my fixed-depth constrained matche again after fixing the bug,
and now I get acceptable results that can be explained by hyper-threading i think.

Code: Select all

Score of stockfish-8-thread vs stockfish-1-thread: 852 - 695 - 533  [0.538] 2080

My previous result was

Code: Select all

Score of stockfish-8-thread vs stockfish-1-thread: 1113 - 467 - 497  [0.656] 2077

Lets now nail down the real causes -- thread local stuff that could change move ordering a lot ?
So if Stockfish is now SHT, it means there is nothing unique about LazySMP then.

Also, the fact that you don't get that much benefit with the constrained match BUT you do get significant benefit
when threads are running in parallel (non-constrained match) suggests dynamic tables being the culprit.

Edit My result after the fix is not as great as Peter's but it is still there

Code: Select all

Score of stockfish-8-thread vs stockfish-1-thread: 3629 - 1130 - 1380  [0.704] 6139

+151 elo compared to +251 of Peter.
Stockfish git hash I used is from Jul 14, 2019: commit 7090d2561ae9ce3803bbc04319c4c93f

petero2 · Post by **petero2** » Sun Jan 05, 2020 11:36 pm

Daniel Shawul wrote: ↑Sun Jan 05, 2020 10:22 pm Edit My result after the fix is not as great as Peter's but it is still there
Code: Select all
Score of stockfish-8-thread vs stockfish-1-thread: 3629 - 1130 - 1380  [0.704] 6139
+151 elo compared to +251 of Peter.
Stockfish git hash I used is from Jul 14, 2019: commit 7090d2561ae9ce3803bbc04319c4c93f

I tried that version and got +168 elo after 3198 games (w=1972, d=689, l=537), so quite a bit less than the +251 I got using the latest version. I guess the difference between 151 and 168 can be explained by error margins and different opening books.

elcabesa · Post by **elcabesa** » Sun Jan 05, 2020 11:42 pm

sorry, what do you mean by SHT? Shared transpositio table?

lucasart · Post by **lucasart** » Mon Jan 06, 2020 12:38 am

elcabesa wrote: ↑Sun Jan 05, 2020 11:42 pm sorry, what do you mean by SHT? Shared transpositio table?

SHT and Lazy are the same thing.

SHT is the real name, dating back from times immemorial:
https://www.chessprogramming.org/Shared_Hash_Table

Lazy is just a silly name coined in this forum a few years ago, and was refering to a completely misguided idea, that certainly doesn't work or make any sense:
http://www.talkchess.com/forum3/viewtopic.php?t=46597

But the term "lazy" just stuck, becase people like it. Everyone wants to be lazy, and get SMP working in as little effort as possible, without spending weeks of work coding and debugging, an incomprehensible tree splitting algorithm.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jan 06, 2020 2:08 am

petero2 wrote: ↑Sun Jan 05, 2020 11:36 pm
Daniel Shawul wrote: ↑Sun Jan 05, 2020 10:22 pm Edit My result after the fix is not as great as Peter's but it is still there
Code: Select all
Score of stockfish-8-thread vs stockfish-1-thread: 3629 - 1130 - 1380  [0.704] 6139
+151 elo compared to +251 of Peter.
Stockfish git hash I used is from Jul 14, 2019: commit 7090d2561ae9ce3803bbc04319c4c93f
I tried that version and got +168 elo after 3198 games (w=1972, d=689, l=537), so quite a bit less than the +251 I got using the latest version. I guess the difference between 151 and 168 can be explained by error margins and different opening books.

Yeah, i think so too. Why this huge jump from version to version is anybody's guess. I quickly tested the history/refutation idea by disabling update_quiet_stats and update_capture_stats in stockfish. Stockfish made further gains without using history/refutation table updates.
But I must admit that I don't know what I did by putting a return statement at the top these two subroutines (could have side effects I didn't anticipate)

Code: Select all

Score of stockfish-8-thread vs stockfish-1-thread: 5534 - 1209 - 1257  [0.770] 8000

I got access to 20-core machine and run a 1 min search for my YBW and ABDADA, and ABDADA seems to go as deep if not deeper than YBW.
It is just one position though but given that I don't observe any widening in my code, it is understandable that my ABDADA goes deeper.

YBW 12mnps and depth 28

Code: Select all

28 22 4004 406735355  Ng1-f3 Ng8-f6 d2-d4 e7-e6 c2-c4 d7-d5 c4xd5 e6xd5 Nb1-c3 Nb8-c6 a2-a3 h7-h6 e2-e3 Bc8-f5 Bf1-e2 Bf8-d6 Ke1-g1 Ke8-g8 b2-b3 Rf8-e8 Bc1-b2 Qd8-d7 Be2-d3 a7-a6 h2-h3 Qd7-e6 Bd3xf5 Qe6xf5
# splits = 594927 badsplits = 44137 egbb_probes = 0
# nodes = 706700617 <68 qnodes> time = 60040ms nps = 11770496 eps = 12031188  nneps = 0

ABDADA 18mnps and depth 29

Code: Select all

29 31 5487 706625383  d2-d4 Ng8-f6 c2-c4 e7-e6 Nb1-c3 d7-d5 Bc1-g5 Bf8-e7 c4xd5 e6xd5 e2-e3 Ke8-g8 Ng1-f3 h7-h6 Bg5-h4 Nb8-c6 a2-a3 Bc8-f5 Bf1-d3 Nf6-e4 Bh4xe7 Nc6xe7 Ke1-g1 Ne4xc3 b2xc3 Bf5xd3 Qd1xd3 Ne7-c6 Qd3-c2
# splits = 0 badsplits = 0 egbb_probes = 0
# nodes = 1085996479 <66 qnodes> time = 60055ms nps = 18083362 eps = 17143516  nneps = 0

Alayan · Post by **Alayan** » Mon Jan 06, 2020 6:05 am

Daniel Shawul wrote: ↑Sun Jan 05, 2020 5:24 pm Thank you for the confirmation! People here, a main stockfish author, has said both 8-threads and 1-thread will search a similar search tree at a fixed depth meaning there is no "widening".

You misunderstood me.

I wrote than 8-threads @ fixed depth running on a single-core and 8-threads @ fixed depth running on 8 cores will have a similar search tree, though with some differences introduced by no actual concurrent execution happening. Running fixed-depth means that the time difference to reach the given depth depending on how many cores you use is secondary.

Also, while I contributed to some SF patches and testing environment changes, I wouldn't call myself a "main stockfish author", there are many people who did much much more.

petero2 · Post by **petero2** » Mon Jan 06, 2020 5:11 pm

Daniel Shawul wrote: ↑Sun Jan 05, 2020 9:08 pm Ok I have made history and refutation tables thread-local now but it seems there is no change at least for depth=8 will test 14 later.
Code: Select all
Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 66 - 65 - 68  [0.503] 199
My changes are very simple: make the tables local by uncommenting static, and clear history tables for all threads at the start.
Code: Select all
/*static*/ CACHE_ALIGN int history[14][64];
/*static*/ CACHE_ALIGN MOVE refutation[14][64];
@Peter, if you can a test where you make these tables global (opposite of what i did) in the meantime, we can know for sure
it is coming from history and refutation being local ...

I made my History::ht array static so that all History objects share the same state. I then re-ran the fixed depth 10 test for 8 threads vs 1 thread, first with my standard search, then after disabling the ABDADA-like part. I got:

Code: Select all

type      elo    win   draw  loss
abdada    19.7    805  1604   633
lazy      15.7    816  1520   680

Earlier (i.e. with thread local history table) I got at fixed depth 10:

Code: Select all

type      elo    win   draw  loss
abdada    42.5   1701  2808  1027
lazy      50.2    962  1525   529

So a global history table removes about 50-70% of the widening in this test.

I also measured how the global history table affects playing strength by playing against unmodified Texel using 8 threads and time control 6 seconds + 0.06 seconds / move and got -48 elo (w=373, d=1975, l=806).

Alayan · Post by **Alayan** » Mon Jan 06, 2020 6:05 pm

Vondele did an interesting experiment :

I implemented 5 versions (drafts: https://github.com/vondele/Stockfish/co ... hreadDepth)

v1: Master version: all threads search irrespective of Limits.depth, mainThread stops search when it reaches limits depth.
v2: First thread to reach Limits.depth stops search.
v3: all threads at most Limits.depth, mainThread stops search reaching limit (simple '&& mainThread' removal)
v4: all threads search to exactly limits depth
v5: all threads search to at least limits depth (last one to reach stops search)

and had them play a depth=13 match, with 8 threads. Also in the pool: sequential runs with depth 13 ... 16. Interesting, even v2 is significantly stronger than sequential, and v1 ~ v4:
Code: Select all
Rank Name                          Elo     +/-   Games   Score   Draws
   1 v5                            141      15    1006   69.2%   48.2%
   2 seq16                         117      15    1005   66.2%   49.7%
   3 v4                             37      15    1007   55.3%   52.5%
   4 v1                             35      15    1008   55.0%   52.8%
   5 seq15                          23      15    1008   53.3%   53.9%
   6 v3                             22      15    1007   53.2%   53.3%
   7 seq14                         -86      15    1005   37.8%   49.2%
   8 v2                            -92      15    1008   37.1%   47.8%
   9 seq13                        -211      18    1006   22.9%   36.1%

Daniel Shawul · Post by **Daniel Shawul** » Mon Jan 06, 2020 8:30 pm

petero2 wrote: ↑Mon Jan 06, 2020 5:11 pm
Daniel Shawul wrote: ↑Sun Jan 05, 2020 9:08 pm Ok I have made history and refutation tables thread-local now but it seems there is no change at least for depth=8 will test 14 later.
Code: Select all
Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 66 - 65 - 68  [0.503] 199
My changes are very simple: make the tables local by uncommenting static, and clear history tables for all threads at the start.
Code: Select all
/*static*/ CACHE_ALIGN int history[14][64];
/*static*/ CACHE_ALIGN MOVE refutation[14][64];
@Peter, if you can a test where you make these tables global (opposite of what i did) in the meantime, we can know for sure
it is coming from history and refutation being local ...
I made my History::ht array static so that all History objects share the same state. I then re-ran the fixed depth 10 test for 8 threads vs 1 thread, first with my standard search, then after disabling the ABDADA-like part. I got:
Code: Select all
type      elo    win   draw  loss
abdada    19.7    805  1604   633
lazy      15.7    816  1520   680
Earlier (i.e. with thread local history table) I got at fixed depth 10:
Code: Select all
type      elo    win   draw  loss
abdada    42.5   1701  2808  1027
lazy      50.2    962  1525   529
So a global history table removes about 50-70% of the widening in this test.

I also measured how the global history table affects playing strength by playing against unmodified Texel using 8 threads and time control 6 seconds + 0.06 seconds / move and got -48 elo (w=373, d=1975, l=806).

Well, it now seems clear local history has a role in widening of Texel. I don't know why it doesn't widen Scorpio's search.
I re-run with more games now. I had a thread_sleep(30ms) before which was slowing down my tests by 10x i think

Depth=8 global history/refutation

Code: Select all

Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 1115 - 980 - 906  [0.522] 3001

Depth=8 local history/refutation

Code: Select all

Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 1077 - 1051 - 874  [0.504] 3002

So it seems going to local history tables even decreases the widening for me.
Also for Stockfish that seems to be the case. I redid the experiment by removing all dynamic terms in MovePicker::score() ( also tested a scoring scheme that keeps same order moves were generated) and they both increased the widening effect.
There is something funky in Stockfish anyway with its big elo gains from widening so I will try to focus on that instead of Scorpio.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jan 06, 2020 8:32 pm

Alayan wrote: ↑Mon Jan 06, 2020 6:05 pm Vondele did an interesting experiment :
I implemented 5 versions (drafts: https://github.com/vondele/Stockfish/co ... hreadDepth)

v1: Master version: all threads search irrespective of Limits.depth, mainThread stops search when it reaches limits depth.
v2: First thread to reach Limits.depth stops search.
v3: all threads at most Limits.depth, mainThread stops search reaching limit (simple '&& mainThread' removal)
v4: all threads search to exactly limits depth
v5: all threads search to at least limits depth (last one to reach stops search)

and had them play a depth=13 match, with 8 threads. Also in the pool: sequential runs with depth 13 ... 16. Interesting, even v2 is significantly stronger than sequential, and v1 ~ v4:
Code: Select all
Rank Name                          Elo     +/-   Games   Score   Draws
   1 v5                            141      15    1006   69.2%   48.2%
   2 seq16                         117      15    1005   66.2%   49.7%
   3 v4                             37      15    1007   55.3%   52.5%
   4 v1                             35      15    1008   55.0%   52.8%
   5 seq15                          23      15    1008   53.3%   53.9%
   6 v3                             22      15    1007   53.2%   53.3%
   7 seq14                         -86      15    1005   37.8%   49.2%
   8 v2                            -92      15    1008   37.1%   47.8%
   9 seq13                        -211      18    1006   22.9%   36.1%

Thanks, I think v4 ( all threads searching to same depth) is what we want.
So if I read this right, stockfish widening gives it at least +2 more plies of elo i.e. v4 at depth=13 is lightly better than seq at depth=15.

RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate