speed up or avoiding move sorting

brtzsnr · Post by **brtzsnr** » Sun Mar 19, 2017 9:18 pm

My engine spends about 8% of time sorting moves. I find this ridiculously high given that hash, killer and counter moves don't need to be sorted and can cause minimax to fail high at current ply without actually generating moves.

Code: Select all

      22ms  8.10%  8.10%    22.20ms  8.17%  bitbucket.org/zurichess/zurichess/engine.(*stack&#41;.sort

Is this normal for your engines, too? How do you sort moves? I have tried a few methods:

1) Use stdlib provided sort function This is too slow because Golang doesn't support generics and has to do lots of calls to compare/swap.
2) Don't sort, but pop the best move. Can be O(n^2) in the worst case, but works with very good move scoring.
3) What I use now: sort using shell sort. Fast O(n log(n)) sorting; high cost upfront, constant time finding the best move repeatedly.

I believe stockfish uses 2), but that was slower for my engine. I also use far fewer phases (4 in zurichess vs 12? in stockfish).

I tried caching moves (and avoid generating/sorting moves for repeated positions), but that only worked in 13-17% of the cases and was a slow down overall.

Any other ideas?

mjlef · Post by **mjlef** » Sun Mar 19, 2017 9:53 pm

Things are likely to vary depending on your engine. In general, a lot of programs do this:

a. capture/promotions: sort by something like most valuable piece subsorted my lest valuable take
b. non-caps: sort by history or other move ordeirng schemes

in qsearch:
a. generate captures, but do not sort. Instead do a selection of the highest scoring remaining move

When in check, it can vary a lot what people do.

At one point, Stockfish "split" the non captures via a one time pass putting the moves with a move ordering score >X (0 for SF) in one pile, and the rest in another. Then they just sorted the higher scoring moves, leaving the lowered valued pieces in whatever order they were already in. There was some depth limit on this.

More advanced move generating techniques can generate attacks to a specific square. For example, they see what is attacking the queen and pull them off one at a time as needed with pawns capturing first up to king capturing. Then moving on to rooks and so on. This has the nice effect of giving you the same order, and not spending time generating a full list when the first capture might fail high anyway.

I am sure there are more ideas, but that is a start.

Ras · Post by **Ras** » Sun Mar 19, 2017 10:30 pm

brtzsnr wrote:My engine spends about 8% of time sorting moves.

Pretty normal if you sort the moves. You can reduce that via a good sorting algorithm. I made a little benchmark in C; the move data structure is 32 bit.

The resulting speeds, on x86:

- C standard library qsort with compare function pointers: 100%
- shellsort using the Cuira sequence: 123%
- shellsort using computed goto jump tables: 174%

If you are interested, I can post the code.

hgm · Post by **hgm** » Sun Mar 19, 2017 10:59 pm

I often generate captures victim by victim, running through the victims in MVV order. Then you only have to sort captures LVA-wise. If you generate by applying 0x88 tests to the pieces in the piece list in LVA order (after checking for Pawn attacks on the board), there only are captures to sort if several equally valuable victims are attacked.

If you organize the capture section of your 'move list' as a bitmap of all potential (attacker, victim) pairs, (256 in Chess), setting the corresponding bit when the attack is generated, and assign the bits so they extract in MVV/LVA order, there also is nothing to sort. You just set the bits during capture generation, and then extract them from the set to search them.

In large games like Chu or Tenjiku Shogi anything of O(n^2) is a no-no.

The real pain is sorting non-captures (e.g. by history). So I often do not use history at all. A reasonable way to not spend more than O(n) on history would be to keep the history counters as type float, but when scoring the moves use their exponent and high mantisse bits as unsigned int, to determine 64 'bins' along a logarithmic scale. For each bin you have a linked list, and moves with a history score in that bin are added to that list by setting two pointers (the start of the list and the successor of the new move). In addition you set a bit in an int64 corresponding to the bin you put a move in. When you binned all moves, you start to extract bits from this 'bin map' to quickly run through all populated bins, and then search all the moves in it. If the bins are narrow, sorting within a bin probably would not be very important.

mar · Post by **mar** » Sun Mar 19, 2017 11:13 pm

Well, I use staged movegenerator.
While it's true that shellsort kicks ass when n > x,
for me insertion sort is still a marginal win so I keep that.

As usual the only conclusion is measure and choose what's fastest for you. As simple as that

Anyway, move sorting is never the real bottleneck so unless you're really desperate there are better ways to gain elo.

Ras · Post by **Ras** » Sun Mar 19, 2017 11:32 pm

mar wrote:Anyway, move sorting is never the real bottleneck

Fair enough, but it feels like something "un-chessy" eating up time.

mar · Post by **mar** » Sun Mar 19, 2017 11:35 pm

Ras wrote:Fair enough, but it feels like something "un-chessy" eating up time.

I don't think evaluation function is "unchessy" unless we're talking beancounters here

mar · Post by **mar** » Sun Mar 19, 2017 11:46 pm

Don't get me wrong, there's more to it of course.

TT probes aren't cheap, I probe in qs which hurts raw nps quite a bit but it was a win for me (in terms of elo) so I kept it.
Also as you make search more clever it also drops raw nps because you add extra logic per node, but if you do it right you win a lot overall.

So in the end, it boils down to balancing all the stuff that's going on. What's good for perft may not be good for search in general, but as usual testing will show.

EDIT: faster movegen will never hurt of course, but making a fraction of total a fraction faster won't gain that much

Ras · Post by **Ras** » Mon Mar 20, 2017 12:30 am

mar wrote:So in the end, it boils down to balancing all the stuff that's going on. What's good for perft may not be good for search in general, but as usual testing will show.

True, but still, I think eval is chess-related. In fact, I like complex eval if that also emulates some strategy because my project is primarily designed to play humans, not computers. It's just that sorting things is somewhat overhead, and that should be minimised if possible.

I've done even uglier hacks than computed gotos to get more performance, not to mention 40% overclocking in the microcontroller to push the thing to its limits.

brtzsnr · Post by **brtzsnr** » Mon Mar 20, 2017 12:37 am

Ras wrote:[
- C standard library qsort with compare function pointers: 100%
- shellsort using the Cuira sequence: 123%
- shellsort using computed goto jump tables: 174%

If you are interested, I can post the code.

I use Cuira sequence, and Golang doesn't support jump tables, but I am interested to see how shellshort with jump tables looks like.

speed up or avoiding move sorting

speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting

Re: speed up or avoiding move sorting