RMO - Randomized Move Order - yet another Lazy SMP derivate

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by jdart »

Most of modern alpha-beta search logic is just heuristics that work enough of the time to give an ELO gain. There are many positions where they will fail. I remember when Toga for example was one of the early engines making heavy use of late move pruning. That is just throwing away big chunks of the search tree and not even looking at the moves. I found it incredible that that could be good but in fact it is an effective technique.

--Jon
Angrim
Posts: 97
Joined: Mon Jun 25, 2012 10:16 pm
Location: Forks, WA
Full name: Ben Nye

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Angrim »

If this gets better game results despite worse time to depth, then it must be searching moves that were incorrectly pruned normally. And it seems that this incorrect pruning has been happening often enough to have a measurable impact. So it seems like it would be worth researching what moves get searched with this method and discarded(or reduced) with other methods that pay off. I would certainly hope that there would be a better method than "shuffle the move order at random" to improve the selection of which moves to search.

Ben
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by smatovic »

Angrim wrote: Tue Dec 31, 2019 6:24 pm If this gets better game results despite worse time to depth, then it must be searching moves that were incorrectly pruned normally. And it seems that this incorrect pruning has been happening often enough to have a measurable impact. So it seems like it would be worth researching what moves get searched with this method and discarded(or reduced) with other methods that pay off. I would certainly hope that there would be a better method than "shuffle the move order at random" to improve the selection of which moves to search.

Ben
Not sure if you are talking about Lucas' Lazy SMP or RMO.

I benchmarked ABDADA and RMO with time to depth only.

But I have to admit that my engine is quite simple, and it runs on gpu
architecture, so not sure how and if my results are applicable to pro cpu
engines.

Maybe when >128 core cpus are common someone will try it.

--
Srdja

*** edit ***

PS: I guess with a certain amount of threads ABDADA converges to RMO.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Michel »

Angrim wrote: Tue Dec 31, 2019 6:24 pm If this gets better game results despite worse time to depth, then it must be searching moves that were incorrectly pruned normally. And it seems that this incorrect pruning has been happening often enough to have a measurable impact. So it seems like it would be worth researching what moves get searched with this method and discarded(or reduced) with other methods that pay off. I would certainly hope that there would be a better method than "shuffle the move order at random" to improve the selection of which moves to search.

Ben
I wonder where the TTD claim actually comes from (i.e. that lazy is worse in TTD than YBW). One would need an engine that implements both lazy SMP and YBW to make the comparison. Perhaps the version of SF when they made the switch?
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by lucasart »

Back to RMO...

The correct way to test is to proceed step by step, testing each component one at a time, not adding them all together, and then comparing to a wrong benchmark (ABDADA). The benchmark should be the shared HT (or possibly its slightly improved version with alternating depths).

Of course, if you can prove that RMO works better than shared HT with a pencil and paper, you can skip the testing. In that case, just write a paper and have it peer reviewed by Michel :lol:
- communication of TT move and TT scores between threads via Shared Hash Table
- all threads follow the same Iterative Deepening loop with same depth
- thread 0 performs a normal search with move ordering etc.
OK, standard shared HT stuff.
- threads >0 randomize the move order up to QSearch
What ? Throw away move ordering that is worth hundreds of elo :shock:
- as soon as one thread finishes its search, all threads should terminate the
current ID iteration and start the next one
Yet another seemingly good idea on paper that fails in testing. I used to do this until recently in Demolito. But proper testing showed that it was clearly regressive.
RMO - exceptions for threads>0:

There are some cases you may not want to randomize the move order:

- Null move
- TT move
- IID move
- Counter move
- Killer move

and some cases where you may want to perform a normal search:

- LMR
- IID
So now you put a lot of exceptions where we go back to normal, in an attempt to hide the massive elo loss from random move ordering. Move ordering is worth many hundreds of elo on modern engines, throwing that away is a non starter. It should be tested stand alone, just to see how bad it is.

Even in its diluted form as you propose, I am 100% certain that this will lose to a simple shared HT. And by "lose" I mean in proper testing, measuring elo not TTD, and using a realistic engine, not a toy engine with plain alpha beta and random ordering.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by smatovic »

lucasart wrote: Wed Jan 01, 2020 1:25 am Back to RMO...

The correct way to test is to proceed step by step, testing each component one at a time, not adding them all together, and then comparing to a wrong benchmark (ABDADA). The benchmark should be the shared HT (or possibly its slightly improved version with alternating depths).

Of course, if you can prove that RMO works better than shared HT with a pencil and paper, you can skip the testing. In that case, just write a paper and have it peer reviewed by Michel :lol:
- communication of TT move and TT scores between threads via Shared Hash Table
- all threads follow the same Iterative Deepening loop with same depth
- thread 0 performs a normal search with move ordering etc.
OK, standard shared HT stuff.
- threads >0 randomize the move order up to QSearch
What ? Throw away move ordering that is worth hundreds of elo :shock:
- as soon as one thread finishes its search, all threads should terminate the
current ID iteration and start the next one
Yet another seemingly good idea on paper that fails in testing. I used to do this until recently in Demolito. But proper testing showed that it was clearly regressive.
RMO - exceptions for threads>0:

There are some cases you may not want to randomize the move order:

- Null move
- TT move
- IID move
- Counter move
- Killer move

and some cases where you may want to perform a normal search:

- LMR
- IID
So now you put a lot of exceptions where we go back to normal, in an attempt to hide the massive elo loss from random move ordering. Move ordering is worth many hundreds of elo on modern engines, throwing that away is a non starter. It should be tested stand alone, just to see how bad it is.

Even in its diluted form as you propose, I am 100% certain that this will lose to a simple shared HT. And by "lose" I mean in proper testing, measuring elo not TTD, and using a realistic engine, not a toy engine with plain alpha beta and random ordering.
The only serious thing you mentioned is to test Elo not time to depth. And you
have a point here, I compared different parallel searches only with ttd.

You can brake down the exceptions to TT-Move and Null-Move, and it should be a
no-brainer why.

I decommissioned my gpu workstation, so I am not able any more to produce any data.

I can only repeat my observation, with >= 128 threads there is not such a big
difference between ABDADA and RMO, at least with my toy engine.

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by smatovic »

...to make my approach reproducible, here the according code snippets....

activate RMO

Code: Select all

     // lazy smp, randomize move order
      if (
          lmove==MOVENONE                         // no Nullmove and no LMR research
          &&(
             // single slot, randomize
             ((ttindex1>1&&ttindex2<=1)&&gid>0)
             ||
             // RMO, randomize > n
             (RMO&&gid>=RANDWORKERS)
            )
          &&!(localNodeStates[sd]&QS)             // no Quiescence-Search
          &&!(localNodeStates[sd]&KIC)            // no King In Check
          &&!(localNodeStates[sd]&EXT)            // no Extension (kic or pawn promo)
          &&!(localSearchMode[sd]&NULLMOVESEARCH) // not in Nullmove-Search
          &&!(localSearchMode[sd]&LMRSEARCH)      // not in LMR-Search
          &&!(localSearchMode[sd]&IIDSEARCH)      // not in IID-Search
          &&localDepth[sd]>0                      // remaining search depth >0 
          // previous searched moves
          &&localTodoIndex[sd]>=RANDBRO           // RANDBRO set to 1, search oldest son first
          )
      {
        brandomize = true;
      }
    }
randomize move score during move picking

Code: Select all

      // eval move
      // wood count and piece square tables, pto-pfrom   
      tmpscore = EvalPieceValues[GETPTYPE(pto)]
                 +EvalTable[GETPTYPE(pto)*64+((stm)?sqto:FLOP(sqto))]
                 +EvalControl[((stm)?sqto:FLOP(sqto))];

      tmpscore-= EvalPieceValues[GETPTYPE(pfrom)]
                 +EvalTable[GETPTYPE(pfrom)*64+((stm)?lid:FLOP(lid))]
                 +EvalControl[((stm)?lid:FLOP(lid))];
      // MVV-LVA
      tmpscore = (GETPTYPE(pcpt)!=PNONE)?
                  EvalPieceValues[GETPTYPE(pcpt)]*16-EvalPieceValues[GETPTYPE(pto)]
                 :tmpscore;
      // check counter move heuristic
      if (countermove==tmpmove)
      {
        // score as second highest quiet move
        tmpscore = EvalPieceValues[QUEEN]+EvalPieceValues[PAWN];
      }
      // check killer move heuristic
      if (killermove==tmpmove)
      {
        // score as highest quiet move
        tmpscore = EvalPieceValues[QUEEN]+EvalPieceValues[PAWN]*2; 
      }
      // lazy smp, randomize move order
      if (brandomize)
      {
        tmpscore+= (prn%INF);
      }
      // check iid move
      if (localIIDMoves[sd]==tmpmove)
      {
        // score as 2nd highest move
        tmpscore = INFMOVESCORE-200;
        // iid move hit counter
        COUNTERS[gid*64+5]++;
      }
      // check tt move
      if (ttmove==tmpmove)
      {
        // score as highest move
        tmpscore = INFMOVESCORE-100;
        // TThits counter
        COUNTERS[gid*64+3]++;
      }

https://github.com/smatovic/Zeta/blob/v099m/src/zeta.cl

--
Srdja

PS: I named it Randomized Move Order, not Random Move Order
AndrewGrant
Posts: 1750
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by AndrewGrant »

smatovic wrote: Tue Dec 31, 2019 4:00 pm
AndrewGrant wrote: Tue Dec 31, 2019 2:26 pm
This basically means that your whole world of pencil and paper is an illusion. Perhaps a big part of reason is the chaotic nature of modern search (with all the path dependt effects like history ordering and its effect on lmr etc.). That and the chaotic nature of thread scheduling, becoming chaos^2, a chaos feedback loop. Lazy SMP feeds on chaos, which increases with more search time and more threads, and that's why it beats the hell out of YBW on many threads.
Apart from the "This basically means that your whole world of pencil and paper is an illusion" part, does anyone really disagree with this rationale? Its one of the better justifications for why Lazy SMP is king.
What is an Elo increase worth if you can not figure out why it worked out?

--
Srdja
I agree with that thought, in that I would not just blindly commit something because its an elo gainer. But its not hard to construct a reasonable argument for why LazySMP might out perform other methods.

You can argue that YBW is just a cheap way of making a minor effective speedup, and that LazySMP is not really Lazy, but a more complex and evolved method of multi threading. YBW increases speed. LazySMP increase quality. At least, thats how I've always seen it, as LazySMP fails to produce massive increases in depths, but with better move ordering and the existence of Singular Extensions we get far higher quality nodes.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
Alayan
Posts: 550
Joined: Tue Nov 19, 2019 8:48 pm
Full name: Alayan Feh

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Alayan »

The thought process of people like bob who think that YBW ought to beat Lazy-SMP goes something like this :
- The optimal single-core algorithm will need <= CPU time to reach the same strength quality as the best multi-threaded algorithm (this is obviously true).
- YBW is a pure speed-up for the single-core algorithm, therefore it has the same asymptotic efficiency.
- LazySMP leads to a different way of shaping the tree, but if that was the optimal one, then the used single-core algorithm should be updated to shape the tree the same way.

This reasoning falls apart once you consider :
- YBW scales terribly to high thread counts. You can't ignore its overhead.
- There are many, many, many way to shape a search tree, and there is no hard frontier between "very important node to search" and "pure waste of time to search". Hence a different shaping might be worse than the one achieved by the single-core algorithm but lose much less efficiency than YBW's overhead.
- The best algorithm depends on resources available. Single-core testing is oriented around bullet TC where there are massive benefits from increased depth. But if we could feasibly do SPRT testing on much longer TC, we would find out plenty of different tricks to both cut down the search tree even more and also importantly limit holes that current techniques tend to generate.

Modern engines prune very very aggressively, and leave thus plenty of holes to fill by searching somewhat wider with LazySMP.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by dragontamer5788 »

Alayan wrote: Wed Jan 01, 2020 3:34 pmModern engines prune very very aggressively, and leave thus plenty of holes to fill by searching somewhat wider with LazySMP.
That's an interesting take for sure.

If randomized move order truly benefits an engine (even across a LazySMP-like algorithm), that suggests that engines are pruning too much these days, and a more methodological search (or in the case of RMO LazySMP: a less methodological search) could increase Elo.

----------

Another note which confuses me greatly: YBW as written in the ChessProgrammingWiki seems suboptimal to me. It misses opportunities for parallelism, while simultaneously parallelizes upon "wasteful" work occasionally. In particular: ALL nodes can be parallelized. While CUT nodes should visit its children fully sequential (not "sequential for 1st node, and parallel for 2nd+ nodes").

After all, if the 2nd node's "information" causes beta-cutoff on node #3, #4, #5 (etc. etc), then all of #3, #4, and #5 search was in vain. That could have been a huge amount of work wasted. In practice, searching the children sequentially doesn't mean losing parallelism. The first child of any CUT node is an ALL node, so the grandchildren from the 1st child can always be searched in parallel.

Other search strategies, such as DTS or "modified YBW" or "modified ABDADA", that I see in Textel or some other engines, do seem to more carefully analyze the search tree and more carefully choose split-points and sources of parallelism.

-----------

Another note: I think communication between threads can be optimized, lowering overhead costs. Under an AlphaBeta framework:

1. Younger brothers never have to send a message to their older brother.
2. The older brother broadcasts a singular message when complete: the centipawn evaluation to its younger brothers.
3. Only active threads need to get the message immediately. A blocked thread could receive the message "later". As long as the message is processed before the blocked threads run, you're efficient.

#2 limits our broadcast to the number of nodes covered by younger brothers. #3 means our broadcast mechanism only needs to scale to the number of cores of the system. Surely there is an algorithm that can provide #2 and #3 with low overhead?