OK, progress! (Finally.)
I got the new attack map running with incremental update. I implemented this in steps: first using from-scratch generation of a new map in every node, using the AddMoves(piece, sqr) routines for all pieces not marked as captured in the piece list, in ze freshly zeroed memory area. This mainly to test the new code for map-based capture generation. (Lots of bugs there; the worst that to merge attackers sets for black I use right-shifts, and attackers[] was not defined as unsigned...) After debugging I got it to the point where it gave the same PVs and scores as the neighbor-table based version we tested before, which (for me) runs at 6 Mnps. As expected, the new version was much slower, as from-scratch map generation defeats the purpose. Especially since it generates captures for both players, rather than just for the side to move. Still, it was running at 3.8 Mnps.
I changed the test case to a 10-ply search, to measure timing more accurately. The new reference (with the old code) is now:
Code: Select all
 1  -16       718 0 e2a6 b4c3 b2c3 e6d5
 2  -16      8338 0 e2a6 b4c3 b2c3 e6d5
 3  -27     11701 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 4  -27     69990 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 5  -37    147413 0 e2a6 e6d5 c3d5 b6d5 e4d5 e7e5
 6  -32    404904 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 7  -32    850164 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 8  -32   4723911 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 9  -96  12878122 0 d5e6 e7e6 e2a6 e6e5 g2h3 b4c3 d2c3 e5h5 f3f4
10  -82  43774990 0 e2a6 e6d5 e5g6 f7g6 c3b5 e8d8 g2h3 d5e4 f3g3 c7c5 g3g6 h8h3
t =  7.230 sec
  43774990 nodes (6.1 Mnps)
  39674278 QS (90.6%)
   6400416 evals (16.1%)
   2071938 stand-pats (5.2%)
  12773244 leaves (32.2%)
  38515424 move gens
captures: 90.8%
Umm, this is a tougher case, as it is less dominated by QS and captures than the 9-ply search. I did not make any effort to further speed up the non-captures; on the contrary, these now all call the from-scratch attack-map generation in MakeMove(), i.e. the 3.8 Mnps method, in all subsequent tests. So they are significantly slower than before.
Firs I played a bit with the method for clearing the memory area where a new new attack map would be generated. Initially I just used my own loop assigning 0 to all 32 attack-map elements. Then I tried memset(). This wasn't much faster: 3.9 Mnps. Then I tried memcpy, copying the zeros from a zero-initialized fixed memory buffer, to test how much slower memcpy() was compared to memset(). Surprise: no measurable difference! In fact intentionally doubling or quadrupling the amount of copied memory had only a minute effect on speed. Copying is nearly free! That is good news: it means we don't have to worry about restoring the attack map on UnMake at all: just do copy-make.
First I did another test, however: only generating the half of the map we need. I.e. the attacks by the side to move. That brought us up to 5.1 Mnps. Problem is that in this case we have to generate a new map for null moves too. But generating a half-map on a copied memory areas of zeros has to run AddMoves() for all 16 pieces of the stm. While incremental update only has to run it for two pieces. (Actually the same piece, in the old and new location.) So doing that on a copy of the previous map must be faster. The incremental update is complex, and error prone, so first I just added it to the version that did map generation from scratch, to compare it with the incrementally updated one. After much debugging, it could run the complete search without detecting any differences. But disastrously slow, of course: 2.6 Mnps.
But then, throwing out the from-scratch map and the comparison, we finally struck paydirt: 6.5 Mnps. 10% faster than the previous best. Still a bit disappointing. But that was because I had forgotten something: the from-scratch generation was still done with the null move. While the incremental update maintains both halves of the attack map, so that null move doesn't have to update anything at all. After removing this redundant map calculation, the speed went up to 8Mnps:
Code: Select all
 1  -16       532 0 e2a6 b4c3 b2c3 e6d5
 2  -16      1859 0 e2a6 b4c3 b2c3 e6d5
 3  -27      4421 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 4  -27     59130 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 5  -37    142018 0 e2a6 e6d5 c3d5 b6d5 e4d5 e7e5
 6  -32    427596 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 7  -32    956057 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 8  -32   4664824 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 9  -96  12236440 0 d5e6 e7e6 e2a6 e6e5 g2h3 b4c3 d2c3 e5h5 f3f4
10  -82  41679702 0 e2a6 e6d5 e5g6 f7g6 c3b5 d5e4 f3g3 e8f8 g2h3 c7c5 g3g6 h8h3
t =  5.210 sec
  41679702 nodes (8.0 Mnps)
  37242403 QS (89.4%)
   5510077 evals (14.8%)
   1893393 stand-pats (5.1%)
  14445410 leaves (38.8%)
    134111 move gens
captures: 89.8%
We see that the number of nodes is not exactly the same as in the old version. This is no surprise, as the move order is not exactly the same. (Especially for HxL captures of minors and Pawns.) But in this case the node count is even smaller. I am still inclined this is mostly noise; a single position cannot be used to judge which move ordering is better.
Anyway, we wen up from 6 to 8Mnps, +33%. And it doesn't have to stop here, as there are still lots of inefficiencies. There is for instance no overkill pruning, and Search() routine is called with the MVV indicator set to King. So that it has to run through the opponent's entire piece list to find the MVV, and the map update has been done in vain when it doesn't find a non-futile one. As apparently happens in 38.8% of the nodes. Imagine how much faster we would be if we could prune these nodes through a cheap test! But that would just reduce execution time, as the nodes that are pruned would then also not be counted.
There was a similar situation with futility pruning in the old version: switching that off nearly tripled the nps, by adding lots of nodes where virtually nothing was done, reached by a dirt-cheap MakeMove(), without adding much execution time. This effect has now largely disappeared, because MakeMove() updates the attack map, and thus is no longer cheap. Just for fun I reran the last test with futility pruning off:
Code: Select all
 1  -16      1282 0 e2a6 b4c3 b2c3 e6d5
 2  -16     15123 0 e2a6 b4c3 b2c3 e6d5
 3  -27     25412 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 4  -27    179979 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 5  -37    412220 0 e2a6 e6d5 c3d5 b6d5 e4d5 e7e5
 6  -32   1200942 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 7  -32   2769202 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 8  -32  12329886 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 9  -96  34438204 0 d5e6 e7e6 e2a6 e6e5 g2h3 b4c3 d2c3 e5h5 f3f4
10  -82 110957182 0 e2a6 e6d5 e5g6 f7g6 c3b5 e8d8 g2h3 d5e4 f3g3 c7c5 g3g6 h8h3
t = 12.811 sec
 110957182 nodes (8.7 Mnps)
 106437647 QS (95.9%)
 106437646 evals (100.0%)
  66686879 stand-pats (62.7%)
   5183909 leaves (4.9%)
    275950 move gens
captures: 91.8%
We see the nps is still a bit higher (up another 9% to 8.7Mnps), but the search time went up from 5.2 to 12.8 sec, more than a doubling!
[Edit] I uploaded (the still very messy) source code for this test to 
http://hgm.nubati.net/mailbox4.c . I am not sure whether it would compile on any compiler, because I had to embed some assembly code to be able to use the bit-scan forward instruction. That would probably have to be adapted to a 64-bit version when compiled for x64 architecture. And I am not sure the GNU format for assembly is understood by all compilers.