The mailbox trials

Joost Buijs · Post by **Joost Buijs** » Thu Apr 08, 2021 10:09 am

hgm wrote: ↑Wed Apr 07, 2021 4:21 pm [Edit] I uploaded (the still very messy) source code for this test to http://hgm.nubati.net/mailbox4.c . I am not sure whether it would compile on any compiler, because I had to embed some assembly code to be able to use the bit-scan forward instruction. That would probably have to be adapted to a 64-bit version when compiled for x64 architecture. And I am not sure the GNU format for assembly is understood by all compilers.

The uploaded source contains a very nasty bug in the Discover routine. You use the local variable bit without being initialized. As far as I know locals are not initialized at 0 by default. Probably you meant bit = BSF(incoming); ???

[Edit] Sorry, I already see the problem, you use bit in the BSF macro, typically something I would never write this way.

Code: Select all

                            55555555         aaaaaaaa
              presence: @@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@
                        PPPPPPPPNNBBRRQK ppppppppnnbbrrqk
16. 000e0204    P@d5    .@.......@:::::. ....@...@@:::::.
17. 100c0000    P@e4    .........@::::@. .........@:::::.
18. 01040000    P@a2    .........@::@::. ..........:::::.
19. 00000000    P@b2    ..........:::::. ..........:::::.
20. 00000000    P@c2    ..........:::::. ..........:::::.
21. 50000000    P@f2    ..........::::@@ ..........:::::.
22. 10008000    P@g2    ..........::::@. .......@..:::::.
23. 04000000    P@h2    ..........:::@:. ..........:::::.
24. 00000000    N@e5    ..........:::::. ..........:::::.
25. 10102040    N@c3    ...@......@:::@. ......@...:::::.
26. 40000000    B@d2    ..........:::::@ ..........:::::.
27. 50840000    B@e2    .........@::::@@ ..........:@:::.
28. 00000000    R@a1    ..........:::::. ..........:::::.
29. 00000000    R@h1    ..........:::::. ..........:::::.
30. 00411000    Q@f3    ......@.@.:@:::. ..........:::::.
31. 05100000    K@e1    ..........@:@@:. ..........:::::.
32. 02000000    p@a7    ..........:::::. ..........::@::.
33. 00000000    p@c7    ..........:::::. ..........:::::.
34. a00b0000    p@d7    ........@.:::::. ........@@::::@@
35. a0010000    p@f7    ........@.:::::. ..........::::@@
36. 200000a1    p@e6    @.........:::::. ..@@......::::@.
37. 00010080    p@g6    ........@.:::::. ...@......:::::.
38. 20000000    p@b4    ..........:::::. ..........::::@.
39. 18001000    p@h3    ......@...::::@. ..........:::@:.
40. 0000000a    n@b6    ..........:::::. @@........:::::.
41. 30200000    n@f6    ..........::::@. ..........@:::@.
42. 00000000    b@g7    ..........:::::. ..........:::::.
43. 00400000    b@a6    ..........:@:::. ..........:::::.
44. 00020000    r@a8    ..........:::::. ........@.:::::.
45. 00200000    r@h8    ..........:::::. ..........@::::.
46. 80000000    q@e7    ..........:::::. ..........:::::@
47. 2a080000    k@e8    ..........:::::. .........@::@@@.
 1  -16       532 0 e2a6 b4c3 b2c3 e6d5
 2  -16      1859 0 e2a6 b4c3 b2c3 e6d5
 3  -27      4421 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 4  -27     59130 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 5  -37    142018 0 e2a6 e6d5 c3d5 b6d5 e4d5 e7e5
 6  -32    427596 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 7  -32    956057 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 8  -32   4664824 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 9  -96  12236440 0 d5e6 e7e6 e2a6 e6e5 g2h3 b4c3 d2c3 e5h5 f3f4
10  -82  41679702 0 e2a6 e6d5 e5g6 f7g6 c3b5 d5e4 f3g3 e8f8 g2h3 c7c5 g3g6 h8h3
t =  3.089 sec
  41679702 nodes (13.5 Mnps)
  37242403 QS (89.4%)
   5510077 evals (14.8%)
   1893393 stand-pats (5.1%)
  14445410 leaves (38.8%)
    134111 move gens
captures: 89.8%

D:\Mailbox4\x64\Release\Mailbox4.exe (process 2752) exited with code 0.
Press any key to close this window . . .

This is what I get on my i9-10980XE compiled with CLang11

hgm · Post by **hgm** » Thu Apr 08, 2021 2:41 pm

Thanks for compiling. 13.5Mnps is pretty good, 1.7 times faster than my machine & compile. Was this a 64-bit compile? (Not that this should matter much, as I don't use any 64-bit variables other than the hash key.)

Indeed I did mean "int bit = BSF(incoming);", and in the pseudo-code examples I posted before I wrote it that way. But when I tried to actually implement it, I didn't see how I could keep it that way. The GNU syntax for assembler embedding does not support returning the result as if it were a function; you just have to mention the C variable to where the result must go. I suppose that to enhance readability I could have done

#define BSF(A, OP, B) asm(" bsf %1, %0\n" : "=r" (A) : "r" (B))

and then write in the code

BSF(bit, =, incoming);

but that seems hardly an improvement. As it was I chose the typographically easiest solution; just replace the = before BSF everywhere in my code by a semicolon.

BTW, I implemented a legality test now, and this gives a bit of a speedup (5.21 -> 4.70 sec, so about 10%). Before exposure of the King wouldonly be recognize in the child node, where the capture-generation code starts testing for attacks on the opponent King. And when there are any, of course cutoff with +INF score rather than search it.

But since the attack map makes this such a trivial test, there was no reason to do it already in MakeMove(), at the earliest opportunity during the update of the attack map. (Which is really the place that capture generation for the child moved to.) So I put it after the preliminary update now (recapture by former protectors of the victim, and enemy slider discovery). This brings the attackers set of the moving player's pieces up to date, so the exposure of his King can be tested. And if it is exposed the MakeMove() can be aborted with a -INF core (parent POV) before the more-expensive updating of the attacks on the opponent's pieces is done.

Failure to resolve a pre-existing check can be tested cheaply even without preliminary update of the map. At the start of a node I now save the attackers set of the stm's King, ANDed with the presence mask of the opponent's pieces, as inCheck variable. After a capture MakeMove then tests whether that variable is non-zero. And if it is, it test whether it still is non-zero after removing the bit for the captured piece (which is needed in the update anyway) from it, or whether the moved piece was a King. If neither of that is the case, the capture is illegal. This now has become the first thing that is done for making and searching a capture, even before updating the pstEval and testing for futility, because it is the cheapest test of all:

Code: Select all

  u->victimMask = attacker2bit[u->victim-16];               // save for unmake
  if(u->inCheck && u->inCheck & ~u->victimMask              // we were in check, and did not capture checker,
                && u->piece != COLOR + 15 - stm) return 0;  // nor moved the King => move is illegal

The u->inCheck variable is now also used to suppress null-move search. And because this, together with the test after preliminary update, which detects moving into check or moving of pinned pieces, prevents that Search() can be called with an illegal move, the testing for the MVV there can now start with Queen.

Of course this doesn't work miracles; there just aren't enough nodes where you are in check for that, or enough illegal moves. But it still cut 10% of the time. Of course nps went down because of it, because all the moves where the King could be captured are no longer inflating the count. And those nodes were pretty cheap to handle, as they were of course always cut-nodes.

Joost Buijs · Post by **Joost Buijs** » Thu Apr 08, 2021 5:05 pm

hgm wrote: ↑Thu Apr 08, 2021 2:41 pm Thanks for compiling. 13.5Mnps is pretty good, 1.7 times faster than my machine & compile. Was this a 64-bit compile? (Not that this should matter much, as I don't use any 64-bit variables other than the hash key.)

It is indeed a 64 bit compile. The speed difference is probably caused by a combination of the compiler and somewhat faster hardware. When using a single core the i9-10980XE seems to boost to 4.6 GHz. Personally I think it is less.

I thought that you are using a pretty old 32 bit compiler, however that it doesn't support intrinsics for BSF or TZCNT seems very unlikely, if it doesn't it must be really very old.

hgm · Post by **hgm** » Thu Apr 08, 2021 5:31 pm

Well, perhaps it does, and this is just my ignorance. I never used any intrinsics before, so I would not know how to do it even on a compiler that supports them. My approach has always been to just hand-edit the assembler generated by the compiler. I am using gcc 3.4.4. It says (C) 2004.

Joost Buijs · Post by **Joost Buijs** » Thu Apr 08, 2021 7:32 pm

hgm wrote: ↑Thu Apr 08, 2021 5:31 pm Well, perhaps it does, and this is just my ignorance. I never used any intrinsics before, so I would not know how to do it even on a compiler that supports them. My approach has always been to just hand-edit the assembler generated by the compiler. I am using gcc 3.4.4. It says (C) 2004.

GCC 3.4.4 is 17 years old. I think the function you need is __builtin_ctz(x). The documentation is still here: https://gcc.gnu.org/onlinedocs/gcc-3.4. ... fclzl-2143

hgm · Post by **hgm** » Fri Apr 09, 2021 12:30 am

OK, thanks. I will have a look at it.

I have a version that runs in 4.40 sec now:

Code: Select all

 1  -16       372 0 e2a6 b4c3 b2c3 e6d5
 2  -16      1314 0 e2a6 b4c3 b2c3 e6d5
 3  -27      3126 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 4  -27     40932 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 5  -37     97493 0 e2a6 e6d5 c3d5 b6d5 e4d5 e7e5
 6  -32    277881 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 7  -32    623552 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 8  -32   3060559 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 9 -101   7561037 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 e5f7 b8b7 f7h8 g7h8 g2h3 f6e4
10  -82  26016106 0 e2a6 e6d5 e5g6 f7g6 c3b5 d5e4 f3g3 e8f8 g2h3 c7c5 g3g6 h8h3
t =  4.440 sec
  26016106 nodes (5.9 Mnps)
  22704635 QS (87.3%)
   4506472 evals (19.8%)
   1583973 stand-pats (7.0%)
  10303563 leaves (45.4%)
    121425 move gens
captures: 87.9%

The score at 9 ply is different though, which means there could still be a bug. Anyway, the speed improvement is a bit disappointing. What I did was moving the search for the MVV (running top-down through the piece list) basically to the parent node, so that the work can be shared by all children. And can be used to prune those when capturing the MVV is futile. So that the attack map doesn't have to be fully updated for those nodes.

The whole algorithm is a bit cumbersome. The parent node keeps track of the highest piece of the stm that is under attack, or, when this is not yet known, the lowest piece that is certainly not under attack. When a capture from that node is going to be searched, it first does a relatively cheap update of the opponent moves in the attack map (recaptures and pin punishment), and determines what is the highest victim of those. If the parent node cannot exclude that all higher victims were not already under attack, that would be the MVV for the child. If it cannot exclude that, it starts examing other pieces between the lowest non-attacked and the highest newly attacked, to expand the range of 'excluded victims' in the parent node. For this it has to use the old attack map, which fortunately is still around. (The advantage of copy-make!)

If it does find a pre-existing attack there, it records it in the parent node, and this will then remain such for all subsequent moves. Thus all non-attacked pieces above the highest attacked one in the parent node have to be examined only once. Which is otherwise what one of the children would have done. There is never any reason to search below the highest newly attacked piece of the move that is going to be searched, though.

But if we do find a pre-existing attack that tops the highest attack created by the current move, it could no longer be valid after that move. (E.g. the attacker could be captured, or the victim could have moved away.) So we then have to continue the search through the piece list starting from the highest pre-existing attack downward for a piece that has attacks in the semi-updated attack map, until we reach the highest newly created attack. This work cannot be shared between siblings, as each sibbling would get a different attack map. But there still is nothing done that would not have been done in the children if they were reponsible for finding the MVV on their own. It is just that the pieces that can be ruled out because they were not even attacked in the parent (and no new attacks were created on them) don't have to be traversed in each sibbling. And the remaining, node-specific part of looking for a surviving atack amongst the pieces that already were attacked has been moved to before doing the most-expensive part of the attack-map update, so that we can skip the latter if no suitable attacks are found at all.

Yet there isn't very much speedup. This makes me wonder wether I am 'barking up the wrong tree'; the remaining time might not be spend on the things I am trying to optimize now. Perhaps the non-captures (which have gone up to 13%) are already dominating the execution time. These are still doing the stupid from-scratch generation of the attack map. The problem here is that for a non-capture the piece lands on an empty square, for which the map holds no info. (Because it records attacks on pieces, not squares.)

But of course that problem is not insurmountable. The neighbor table also had that problem, and it could be solved by 4 ray scans of the board to find neighbors of an unoccupied square. After that a conventional IsSquareAttacked() routine, looking at the 8 neighbors through the neighbor table, and the two Knights in the piece list, could collect the attacks on that square, to add them to the map. That should be very much faster than creating a map from scratch, even when the IsSquareAttacked() is roughly twice as expensive as the generation of moves of a typical given piece.

hgm · Post by **hgm** » Fri Apr 09, 2021 11:49 am

Finding the MVV

The attack map makes it easy to see whether a piece is attacked, by just consulting its attackers[] element. But the capture generation still has to loop through the piece list top-down for finding pieces that are attacked. Some time could be saved by somehow skipping the test on pieces that are no longer present (e.g. by having a second presence mask from which pieces can be extracted in MVV order). But the main problem is not that there are many pieces that are not present, but that, as you get closer to the leaves of QS, most pieces are no longer attacked.

The code I have now for finding the highest attacked piece looks absolutely awful, like this:

Code: Select all

    int gap = u->alpha - MARGIN - 10 - u->pstEval; // futility gap child will have to bridge with a capture
    if(newMVV < u->threat) {                // newly created exposures might not be the worst
      int attacks, xstm = COLOR - stm;
      int worst = xstm - 1;                 // worst threat (this value is a kludge to represent 'none')
      if(!u->threatVal) {                   // pre-existing threat not yet found
        attackers -= 32;                    // switch back to the map before the move to do the deferred threat detection
        switch(u->threat & 15) {            // skip part of unrolled search loop already done before
          case 14: // Queen
            if(gap > QVAL) break;
            attacks = attackers[xstm+14];
            if(attacks & u->oldPresence) { u->threatVal = QVAL; worst = u->threat = xstm + 14; break; } 
            u->threat = xstm + 13;    // next untested
            if(newMVV >= xstm + 12) break;
          case 13: // Rooks
            if(gap > RVAL) break;
            attacks = attackers[xstm+13] | attackers[xstm+12];
            if(attacks & u->oldPresence) { u->threatVal = RVAL; worst = u->threat = xstm + 13; break; } 
            u->threat = xstm + 11;
            if(newMVV >= xstm + 8) break;
          case 11: // minors
            if(gap > BVAL) break;
            attacks = attackers[xstm+11] | attackers[xstm+10]
                    | attackers[xstm+9]  | attackers[xstm+8];
            if(attacks & u->oldPresence) { u->threatVal = BVAL; worst = u->threat = xstm + 11; break; }
            u->threat = xstm + 7;
            if(newMVV >= xstm) break;
          case 7: // Pawns
            if(gap > PVAL) break;
            attacks = attackers[xstm+7] | attackers[xstm+6]
                    | attackers[xstm+5] | attackers[xstm+4]
                    | attackers[xstm+3] | attackers[xstm+2]
                    | attackers[xstm+1] | attackers[xstm];
            if(attacks & u->oldPresence) { u->threatVal = PVAL; worst = u->threat = xstm + 7; break; }
            // if we get here, nothing was attacked!
            u->threatVal = 1; worst = u->threat = xstm - 1;
        }
        attackers += 32;       // deferred threat detection done for now; continue with updated map
if(PATH) printf("%2d     newMVV=%d threat=%d val=%d worst=%d\n", ply, newMVV, u->threat, u->threatVal, worst), fflush(stdout);
      } else worst = u->threat; // pre-existing gravest threat already known, just use it
[code]
Basically this is two unrolled nested loops, an outer lop over values group and an inner loop over pieces within the value group. For that inner loop I don't even bother to test for a 'partial result', but just OR all attackers sets together and then test if there were any attacks on the group as a whole. That seemed faster than testing them one at the time (which would require an AND operation with the presence mask and a branch for each of them), although for the Pawn value group that could be debatable. But even then, the code is splattered with branches. Because there are so many conditions for exiting it: you could have reached non-futile victims, you could have found an attacked piece, you could have reached a value group that you already know to be attacked by a newly created capture. At least entering the loop is done through a switch, which skips the tests that were already done before (on a previous move in the same node). But even if that is implemented by an indirect jump through a jump table (which I am not sure the compiler will do), it is an almost certain branch misprediction.

How can we do this better? The newly created captures emerge as a 'victim bitmap', where each bit represents a victim (or value group); the process of discovering enemy sliders at the from-square sets these bits, and the recapture is easily added to it too. It would also be possible to record in a bitmap of similar format which pieces have a pre-existing attack on them, and in yet another bitmap which pieces do not have a pre-existing attack on them. (And for pieces not in either map we then would not know whether they are attacked or not.) If we also had a bitmap to indicate which victims were non-futile, we could just do some Boolean logic on these bitmaps, to be left with the pieces that need to be tested for attackers. And then extract the bits from those in MVV order to do the actual tests. That would be a branch-poor way of doing things.

Which victims are futile is determined by the gap between the current (estimated) eval and alpha:

gap = alpha - MARGIN - estimatedEval;

Victims with a piece value > gap are 'non-futile'. Of course we could get these from a table indexed by score, int nonFutile[gap]. The table would have to be big, though, (essentially running from -INF to +INF). This would waste a lot of L1 cache space, which is also not good for performance. Of course we could reduce resolution, and take (say) nonFutile[gap>>4], effectively rounding the scores to multiples of 16 before making the decision. And large parts of the table, namely those for gap < 0 or for gap > QVAL are not very interesting; they have all pieces non-futile, or none at all, respectively. We could test for that and then handle the cases separately, and only use a table for scores between 0 and 1024 (assuming QVAL < 1024). But then we have branches again. And 1024 ints is still 4KB.

To save space we can use a two-step approach:
[code]unsigned char offset[128];
int nonFutile[5*8]; // the actual victim sets, 8x all, 8xNBRQK, 8xRQK, 8xQK, 8x none
mask = nonFutile[offset[gap >> 3 & 127] + (gap & 7)];

So basically the nonFutile[] table is indexed by the low-order 3 bits of gap (gap & 7), but starting at a certain offset. For gaps far away from any piece value, this offset would be 0, 8, 16, 24 or 32, the start of stretches of identical victim sets. Only close to a piece value the offset would be such that within the range determined by (gap & 7) two different victim sets can be seen. E.g. for offset[62], which would be used for gap = 62*8 (= 496) to 503, we would want to see 4x RQK (for gap = 496-499) and 4 times QK (for gap >= 500(=RVAL)-503). So offset[62] = 12.

This would reduce the memory requirement to 5*8*4 = 160 bytes for nonFutile, and 128 for offset[], 288 bytes in total, which seems acceptible. We have not dealt with the out-of-range problem, though: for gap >= 1024 or gap < 0 the & 127 in the offset[] access protects us from out-of-bound accesses, but it returns faulty values. These really should have used offset[127]=32 and offset[0]=0, respectively, rather than what (gap >> 3 & 127) calculated. But this can be achieved in a branchless way with conditional move (CMOV) instructions, the compiler would presumably use for:

Code: Select all

int index = gap >> 3;
index = (gap < 0 ? 0 : index);
index (gap > 1023 ? 127 : index);
mask = nonFutile[offset[index] + (gap & 3)];

With this mask, we can then decide which pieces we would have to test

Code: Select all

int worseThreats = newThreats - 1 & ~newThreats;
worseThreats &= mask; // not interested in non-futile victims
int todo = worseThreats & undetectedThreats; // potential worse threats of interest
while(todo) { // update parent-node's knowledge on pre-existing attacks
  int bitnr = BSF(todo); // nr of trailing zeros
  int victim = bitnr2victim[bit];
  int bitMask = 1 << bitnr;
  if((attackers-32)[victim] & oldPresence) { // test for pre-existing attack using old map (hence the -32)
    existingThreat = bitMask; // we found one!
    undetectedThreats = 0; // not interested in more
    break;
  }
  undetectedTheats -= bitMask;
  todo -= bitMask;
}
potentialThreats = existingThreat | undetectedThreats;
todo = worseThreats & potentialThreats; // mask away confirmed non-attacked pieces
while(todo) { // verify whether pre-existing attacks are still valid after this move
  int bitnr = BSF(todo); // nr of trailing zeros
  int victim = bitnr2victim[bit];
  int bitMask = 1 << bitnr;
  if(attackers[victim] & presence[stm]) { // test based on (partially) updated map
    break; // when attacked piece found, leave its bit in todo set
  }
  todo -= bitMask;
}
todo |= newThreats; // add the newly created attacks
if(!todo) goto overkill; // when no victims found, abort MakeMove for overkill pruning
mvv = bitnr2victim[BSF(todo)]; // otherwise pass this mvv to recursive Search() call.

The first while-loop represents the code shown earlier. Far fewer branches now.

hgm · Post by **hgm** » Fri Apr 09, 2021 2:10 pm

I am still a bit in doubt here, on what to do when the pre-existing threat is no longer valid. We would have to keep searching for a victim amongst the remaining members of the 'worseThreats' victim set. One that is attacked after the move. Which is a different test than for finding a pre-existing threat: it must exclude attacks by the piece that we are going to capture (by using the updated presence mask) and attacks on the piece that was moved (by using the updated attack map). 'newThreats' would already contain the recapture of the latter, if this is possible. If it is attacked, we know for sure that attack must have been pre-existing. (As worseThreats doesn't contain any of the newly created attacks.) But we are really interested in knowing the opposit: if there was no pre-existing attack, we could remove it from existingThreat (which in the code above would include all pieces below the highest attacked one). But not being attacked after the move doesn't prove that, as the move could have evaded the pre-existing attack.

So perhaps it would be better to keep using the old attack map (from before the move) for finding the MVV. On negative test we could then remove it from existingThreats, so that other moves from the same node would automatically skip it. When it tests positive, though, we still would have to redo the test in the situation after the move to verify it as MVV. So we couldn't find an MVV without doing two tests, while a single one in principle would have sufficed. This is an investment we might never earn back, as later moves might not ever require this info. (There might be no later moves, or the current move might give a cutoff, or the later move might all have newly created retaliations against pieces higher than this one, or the higher pre-existing threat was not evaded on those...)

We could exclude the moved piece from being tested by removing it from the victim set before the loop, though. Then the amount of overhead for making the dual test could become very low, and would probably be worth it:

Code: Select all

// initialization in parent:
int undetectedThreats = 0xFFFFFFFF; // or load from an incrementally updated victimPresence[stm] bitmap
int existingThreats = 0;            // nothing found so far

// in MakeMove():
int worseThreats = newThreats - 1 & ~newThreats; // newThreats was obtained in the partial map update
worseThreats &= mask; // weed out non-futile victims
int todo = worseThreats & undetectedThreats; // potential worse threats of interest
todo &= ~bitnr2victim[u->piece];             // pre-existing attacks on the moved piece will have been evaded
while(todo) {                                // update parent-node's knowledge on pre-existing attacks
  int bitnr = BSF(todo);                     // nr of trailing zeros
  int victim = bitnr2victim[bit];
  int bitMask = 1 << bitnr;
  int attacks = (attackers-32)[victim];       // attack map before move (hence -32)
  if(attacks & oldPresence) {                 // there was a pre-existing attack
    existingThreat += bitMask;                // mark it
    undetectedThreats = -= bitMask;           // and mark it as detected
    // ascertain it was not evaded
    if(attacks & presence[stm])               // test without the captured piece as valid attacker
       break;                                 // this is our MVV; stop looking!
  }
  todo -= bitMask;
}
todo |= newThreats;            // add the newly created attacks (in case todo ended empty)
if(!todo) goto overkill;       // when no victims found, abort MakeMove for overkill pruning
mvv = bitnr2victim[BSF(todo)]; // otherwise pass this mvv to recursive Search() call.

[Edit]Since these victim sets will live in the UndoInfo struct that is passed around between Search(), MakMove(), UnMake() and the Search() of the child node, the latter can continue providing knowledge of pre-existing attacks, when it is in the process of searching the captures. At the end of the code in the previous posting, todo & mask will contain the set of all non-futile victims that could be under attack. So the capture-generation process in the child would just have to loop over the victims indicated in that set. It could recognize pieces in 'todo' that are not in newThreats or existingThreats, and do the test for pre-existence (updating existingThreads and clearing it from undetectedThreads) before deciding whether it has to be searched now. Its sibblings could then benefit from that info, by having fewer non-zero bits in their 'todo' victim sets.

hgm · Post by **hgm** » Sat Apr 10, 2021 10:47 am

Although I haven't been able to exploit this yet to achieve further speedup, I want to dwell on this some more. Because I have the feeling I am really on to something.

Moving the code for finding the MVV to the parent node / makemove was originally inspired by the wish to avoid the expensive attack-map update when there is no MVV, or when its capture would be futile. So I did not do it on null move, where the attack map doesn't need any updating at all. But since this method has the additional advantage of being able to share information needed to find the MVV between its children, it could be beneficial to have the null-move search contribute to MVV identification too.

To recapitulate: we pass an argument to Search() (as a member of the UndoInfo struct) that specifies which pieces it should consider capturing. This has the format of a 'victim set', where each of the 32 pieces is represented by a bit, in a way that would extract the bits in order of decreasing value. This should help speeding up its capture generation: if, in the parent, we have figured out that only Pawns are attacked (and a Pawn still is a worthwhile gain), the capture generation in the child can refrain from wasting time on testing whether any of the non-Pawns is attacked. When the parent has no clue as to what is attacked, it simply passes a victim set with all bits set (or at least all bits representing pieces of the right color that are still on the board, info we assume will always be available). Then the child will try to capture all of them, like it would when we didn't have this mechanism.

To sensibly restrict the pieces the children have to target, the parent keeps track of pre-existing attacks on pieces of the side to move ('threats'), also in the format of victim sets. It distinguishes existingThreats and undetectedThreats, where the latter indicates pieces the pieces for which we haven't tested yet whether they were attacked or not. Initially (when entering the node) the set of existingThreats is empty, and every present piece is in the set of undetectedThreats. When we are going to search captures, we will be forced to examine some of the pieces for being attacked, in order to determine the MVV for the child. When we do this, the piece will be removed from the undetectedThreats set, and, depending on whether it was indeed attacked, added to existingThreats.

But we could also have a child return the info of which pieces it succeeded to generate captures on. At the stage where we do null move, all pieces would still be in the undetectedThreats set. If we don't do null move, it will stay that way. If the null move fails high, we are done with the node. But if the null move fails low, the child will have searched some moves, and can report on that in a 'confirmedVictims' victim set. All pieces in that set can then be removed from the undetectedThreats, and added to the existingThreats of the parent. E.g. suppose that current eval is 50cP above beta. So we try null move, but one of our Pawns was hanging, and the null move is refuted by its capture. None of our other pieces was under attack, so the child couldn't generate any captures on them, and the confirmedVictims set it return will only contain that Pawn. So the Pawn goes into existingThreats and out of undetectedThreats. But more importantly, we know that all pieces in the value groups above the Pawns (i.e. NBRQK) are not under attack, and can be removed from undetectedThrats as well. Otherwise the child would have tried those first, and they would be in confirmedVictims. We cannot be sure of other Pawns, though; these might still have attacks on them by higher pieces, so that they came later in the LVA ordering than the capture that refuted the null move. This means none of the captures we must search next has to consider non-Pawns as MVV (unless the capture specifically created a new threat on any of those, which we would know).

After a refuted capture we should be a bit more careful how to interpret the confirmedVictims it reports. But in any case, confirmedVictims that did not have a newly created attack ('newThreats') on them must have had a pre-existing attack on them. So these can be cleared from undetectedThreats and moved to existingThreats. The pieces in the higher value groups than the lowest confirmedVictim that were not confirmedVictims themself apparently were not attacked. But this could have been an effect of the capture that led to the child: it could have captured their attacker, or they could have been moved away. The latter is easy to remedy: just don't draw any conclusions on the moved piece. (Like you cannot draw any conclusions on pre-existing attacks for pieces in newThreats.) There is no shortcut to figure out whether a piece was attacked by the captured piece, though. So the information we can distill from this is much less than what we can get from the null move.

hgm · Post by **hgm** » Sat Apr 10, 2021 3:01 pm

I moved the stuff to Linux, so I could do profiling to see where the time goes. (Un Cygwin profiling never worked for me: it always lists all functions as using 0% of the time!) This gave the following list:

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 38.53      1.61     1.61 141169697   11.41    11.41  AddMoves
 31.47      2.93     1.32 32820332    40.08    65.34  SearchCapture
 13.16      3.48     0.55 21424467    25.68    93.67  Search
  4.31      3.66     0.18 30577986     5.89     8.97  PreUpdate
  3.83      3.82     0.16 52002443     3.08     3.08  Discover
  2.39      3.92     0.10  3573624    27.99   341.86  MapFromScratch
  1.68      3.99     0.07  3573688    19.59    19.59  Occupy
  1.56      4.05     0.07  3584257    18.14   380.64  MakeMove
  1.44      4.11     0.06   126616   474.02   474.02  GenNoncapts
  0.96      4.15     0.04  3573624    11.20    11.20  UnMake
  0.48      4.17     0.02                             PrintMaps
  0.24      4.18     0.01                             HalfMapFromScratch
  0.00      4.18     0.00       89     0.00     0.00  Move2text
  0.00      4.18     0.00       66     0.00     0.00  PrintRow
  0.00      4.18     0.00        2     0.00     0.00  InitNeighbor

AddMoves() is the routine that generates captures (including 'friendly fire') by a given piece, and records those in the attack map. It is called twice from SearchCapture(), for the incremental update of that map after a capture. (To remove the moves from the old location of the moved piece, and add those in its new location.) And it is called for every piece on the board in MapFromScratch(), which updates the map after a non-capture. The call graph shows that this takes on average 27.5 calls to AddMoves(), which sounds reasonable (as we start with 32 pieces). That still makes MapFromScratch() responsible for 70% of the calls to AddMoves(), despite the fact that there are 9.2 times as many captures as non-captures.

So it is obvious that most gains can be expected from also making the map update after a non-capture incremental, rather thangenrating the map from scratch. Then it would also require only 2 calls to AddMoves(), like for the captures,reducing the time spent in that by a factor 3.

Another thing I did was put in some counters to see how often the loop for discovery of enemy slider moves has to run in the incremental update. This confirmed my intuition: on average, the only 0.12 enemy sider moves are dicovered per evacuated from-square. So the preliminary update of the attack map, just for enemy moves, is really fast: concluding there were no enemy sliders in the attackers set of the mover, assigning the attackers set of the victim to the mover (minus its own attack), clearing that of the victim, and removing the victim from the presence mask. The expensive part is updating the moves of the moving player; this requires calling of AddMoves().

The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials