The mailbox trials

hgm · Post by **hgm** » Sun May 09, 2021 9:22 am

Yes, when a branch is easily predictable it is practically free, and you save the work of the stuff you skip by the branch. But not all branches are easily predictable, and mispredictions can cost you on the order of 40 instructions. So even 10% misprediction is not worth it for skipping just a few instructions.

I measured a definite speedup from replacing an if(victim != EMPTY) attackers[victim] |= bit; by an unconditional attackers[victim] |= bit; in the critical loop (AddMoves()). Using the unconditional in a loop that only went over on-empty squares (e.g. by bit-extraction, or having from-square-dependent move lists for the leapers) also proved slower. Fastest was just to unconditionally do the attackers[victim] |= bit for all nominal moves of the leaper, no matter whether they hit empty squares or go off-board. So that was a case where the branch could not be predicted very well.

The point is that the compiler cannot know this. It has no clue which branches cause an enormous mispredicion rate, and thus slowdown, and which are a nearly free way of saving work. It must make a blind gamble based on a statistical average of some benchmark suit. Besides; when I do something like

Code: Select all

set = A&B;
C = (set ? C : 0);

there seems very little excuse for introducing branches.

The reason I worry about this is that the times I get for some of the routines when profiling the code seems much longer than what should be achievable at 3-4 instructions per clock, considering what they have to do. And considering the experience with the critical loop, I suspect the branches, of which there were a great many in the code I used. But it is hard to test that, when the compiler subverts my code improvements by putting back branches where the high-level code asks for a conditional move. I might have to switch to coding in assembler directly.

Jakob Progsch · Post by **Jakob Progsch** » Sun May 09, 2021 4:31 pm

I tried forcing CMOV in my example code up there with:

Code: Select all

uint64_t cmov_ternary(uint64_t x, uint64_t y, uint64_t a, uint64_t b) {
    #ifdef __amd64
    asm(
        "cmp     %1, %2\n"
        "cmova   %0, %3"
        :  "+r" (b) : "r" (x), "r" (y), "r" (a): "cc");
    return b;
    #else
    return x>y ? a : b;
    #endif
}

uint64_t bishopdiagonal2_inline_ternary(uint64_t pos) {
    uint64_t x = (7-(pos&UINT64_C(7))) << 3u;
    uint64_t y = pos&UINT64_C(0x38);
    uint64_t diagonal = UINT64_C(0x0102040810204080);
    uint64_t a = diagonal << (x-y);
    uint64_t b = diagonal >> (y-x);
    return cmov_ternary(x, y, a, b); 
}

which does produce the right assembly without branches but confusingly the code is indeed slightly slower (pretty much one cycle slower or about 10%). Im sure if I still had an intel CPU and access to vtune that would maybe shed some light on this. But I found AMDs uprof mostly unusable so far. Of course that version also misses the small optimization of skipping the CMP operation and using one of the SUB operations to populate the flags for CMOV.

maksimKorzh · Post by **maksimKorzh** » Wed Jun 09, 2021 6:10 pm

Mr. Muller(anyone aware of it?) is the source code available in public? I've been reading through the thread but didn't yet find links to the source?

maksimKorzh · Post by **maksimKorzh** » Thu Jun 10, 2021 9:46 am

Just read through the entire thread... OMG...
Compiled and run the code from various development stages...
This is so so impressing!
Mr. Muller, thanks for sharing!

hgm · Post by **hgm** » Thu Jun 10, 2021 11:38 am

I just uploaded the version that so far was the fastest (8Mnps on my 2.2GHz i3 laptop), at http://hgm.nubati.net/mailbox7b.c .

One of these days I should continue this...

Kotlov · Post by **Kotlov** » Thu Jun 10, 2021 1:43 pm

hgm wrote: ↑Thu Jun 10, 2021 11:38 am I just uploaded the version that so far was the fastest (8Mnps on my 2.2GHz i3 laptop), at http://hgm.nubati.net/mailbox7b.c .

One of these days I should continue this...

11.6 Mnps on my AMD Ryzen 3 3200U

Code: Select all

1  -16       481 0 e2a6 b4c3 b2c3 e6d5
 2  -16      1722 0 e2a6 b4c3 b2c3 e6d5
 3  -27      4063 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 4  -27     53332 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 5  -37    128431 0 e2a6 e6d5 c3d5 b6d5 e4d5 e7e5
 6  -32    365100 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 7  -32    818090 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 8  -32   4057960 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 9  -96  10569123 0 d5e6 e7e6 e2a6 e6e5 g2h3 b4c3 d2c3 e5h5 f3f4
10  -82  34810485 0 e2a6 e6d5 e5g6 f7g6 c3b5 d5e4 f3g3 e8f8 g2h3 c7c5 g3g6 h8h3
11  -89  86359256 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 e5f7 b8b7 f7h8 g7h8 c2c4
12  -67 358894100 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 e5f7 b8b7 f7h8 g7h8 c2c4 d5b6 g2h3 b6c4
t = 30.944 sec
 358894100 nodes (11.6 Mnps)
 317153584 QS (88.4%)
  44051611 evals (13.9%)
  13678954 stand-pats (4.3%)
 110603905 leaves (34.9%)
   1584064 move gens
         0 map gens
 314736140 moves
  21889962 non-evasions
   1655045 futiles
   4393085 illegals
0.17 pins/move
captures: 84.6%

Great job, mr. HGM!
Waiting for a fully functional engine!

ZirconiumX · Post by **ZirconiumX** » Wed Jul 14, 2021 3:33 am

Perhaps I'm grave digging this thread a little bit, but I want to do two things.

First, thank you very much for this thread HGM, which has provided valuable insight into how to make the most of the board infrastructure in Dorpsgek, leading to a quadrupling of nodes per second.

I think the biggest piece of insight I gained from this thread was incremental evaluation and use of delta pruning to avoid costly makemoves where the opponent would immediately stand pat.

However, I think it would be a useful thought experiment to consider how to expand the evaluation a bit further. You've already talked about mobility, but I think two things to consider for the evaluation would be king safety and pawn structure.

To me, an incremental king safety implementation would be based on tropism: piece moves would be weighted based on proximity to the enemy king (to encourage attack) and pawn moves would be weighted based on proximity to friendly king (to encourage defense). This would be relatively simple to update incrementally, but there would have to be a slow path for a king move. I would assume those are rare in the search tree in practice though.

Pawn structure seems a bit harder to update incrementally though. My initial thought of an array that stores the rank for each file on the board struggles to deal with doubled pawns. It seems clear to me though that the solution should be ordered by file to detect isolation, and have easy access to rank information to detect passedness.

hgm · Post by **hgm** » Wed Jul 14, 2021 10:10 am

The standard solution for Pawn Structure is to use a hash table for it. Even with just 1MB you typically get hit rates around 95%. With such a high hit rate the time needed to calculate the Pawn-structure terms hardly impacts the total performance. Apart from a score the hash entry can also contain information that is helpful for for calculating terms for the relation between Pawns and other pieces. E.g. in KingSlayer I keep track of the value of the King Shield for 4 King locations (b1, g1, b8, g8), assuming that the bonus for squares adjacent to those 'nominal' King locations can be derived from that as well. It contains the (half-)open files to facilitate Rook evaluation, and the number of Pawns on light and dark squares to facilitate Bishop evaluation, the location of most advanced passer, and how many steps from promotion it is, etc.

So the Pawn-Structure evaluation typically involves incrementally updating the Pawn hash, and using it to access the table. Eval terms involving Pawn Structure can then be taken directly from the table, or be easily calculated from info in the table.

The material table, accessed through the incrementally updated material key, basically works the same way. Here the entry could store a centi-Pawn bonus (e.g. for the Bishop pair), flags indicating which players can null move, drawishness factors, pointers for special evaluation functions that should be called instead of the normal evaluation (e.g. for KPK, KBPK, KQKP). If you consider light and dark Bishops separate piece types in the key, it could also detect the case of unlike Bishops.

I don't think there is any shortcut for tropism; like you mention, it requires from-scratch recalculation of the piece-square sum when the King moves. This is still the basis for NNUE evaluations. But fortunately King moves are pretty rare in the tree; in most middle-game positions a King has only 2 moves. A work-around would be to draw pieces not to the actual King location, but to one of the nominal locations b8/g8, depending on to which the King is closest. That would also eliminate the problem of the King stepping to a8/h8 to get further away from the enemy.

I think there are many forms of King Safety around, so whether these could benefit from incremental techniques will vary. E.g. Fruit uses a system that determines which enemy pieces have moves that end adjacent to the King, (note that positions where the King is in check would never be evaluated), and then some non-linear formula to map that set to a score. The latter could also be implemented through a table lookup, indexed by a hash key derived from the set of pieces involved in the 'seige'. So the hard part here is to determine which pieces attack the King neighborhood.

A way to do this from scratch would be to run through the piece list, and test it for each piece individually. In a mailbox representation this test would be similar to a 0x88 attack test: you would have a table indexed by the relative position to the King, indicating through bit flags for each piece type whether that piece would be aligned properly to attack the 3x3 King neighborhood (rather than a single target square). Alignment would not be as rare as with an attack test, but still rare enough. (E.g. Pawns would almost never attack the King neighborhood.) Slider alignment would have to be verified with a ray scan.

The alignment tests could be done incrementally, though; a move with a non-royal could only change the alignment of that non-royal, and that of the piece it captures. So you could keep a set of 32 bit flags, one for each piece, indicating whether that piece is present and aligned. If a piece is moved, you would have to determine its new alignment. If a piece is captured its seige-alignment flag would be unconditionally cleared. The remainder of the test could extract the the aligned sliders (hopefully few), to see whether their seige is blocked or not. For leapers alignment implies attack. You could compare that to the seige status in the parent node. Only if it changed you would have to update the seige key for the participation of the piece, and lookup the new seige bonus using that new key; otherwise you would just copy the old bonus. Only when the King moves you would have to redo the calculation entirely from scratch (again...).

tromp · Post by **tromp** » Mon Jul 19, 2021 9:33 am

> Even with just 1MB you typically get hit rates around 95%.

How do you deal with hash collisions in this table?
Do you use something like a 64-bit pawn structure Zobrist hash?

hgm · Post by **hgm** » Mon Jul 19, 2021 10:50 am

The Pawn key is usually kept in exactly the same way as the main hash key as a 64-bit integer. The only difference is that all the Zobrist keys for non-Pawns are zero for the Pawn key. Like with the TT, it is usually sufficient to keep a 32-bit signature in the table entry, and use the other 32-bit for deriving the table index. Then you would only have a collision once every 4 billion probes, on average.

An always-replace table is probably good enough; since the Pawn hash is basically an evaluation cache, depth is not a meaningful parameter. For a more advanced replacement scheme you could use ' least-recently used'. Pawn-hash entries are typically larger than TT entries (I think KingSlayer uses 32-byte entries), so for efficient caching you would probably not want to have more than 2 slots per bucket. But you could still put a flag in the bucket to indicate which of the two entries was used modt recently, and then replace the other.

The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials