I have now also started writing a general (symmetry-less) 4-men generator, based on 1-bit representation. It keeps a single won[] bitmap for the wtm positions, and an array of lost[Nmax][] bitmaps for the btm lost-in-N positions. A symmetry-less 4-men has 16M positions, so each bitmap takes 2MB. It only calculates white wins.
The 6 least-significant bits of the index are the black King square. That makes that the bitmaps can also be interpreted as 256K bitboards for the black King, each for a different constellation of the other 3 pieces. The white attack set on such a bitboard (with these 3 pieces) would mark the positions where black is in check. It can be used to reduce the number of mate candidates to very few by shift and AND operations, treating 64 positions (differing in black King location) at once.
So the mate-detection step involves generating these attack sets. Of course standard bitboard techniques could be used for that, but these are often memory intensive, and not particularly fast (if they have to be applied from scratch to every 3-piece constellation, i.e. 256K times). So I wanted to use another method: For the first two pieces I prepare a table attacks[piece][square][blocker], which contains bitboards for the moves of the 'piece' on the 'square', in the presence of only one other piece, at square number 'blocker'. To prepare such a table for a leaper is trivial, as its attack set will be independent of the blocker, and would just be copied to 64 elements of the attacks[piece][square][] array. For sliders you would start doing that too, but you can then start clearing some bits of the bitboards where blocking took place. Anyway, no matter how you generate them, we are only dealing with 4K bitboards here, so it should always be fast. The effect of their mutual blocking can now be calculated during the scan through the EGT after placement of the second piece (i.e. 4K times per scan), through:
Code: Select all
table0 = attacks[0][pos[0]]; // tables to look up effects of blocking
table1 = attacks[1][pos[1]];
at0 = table0[pos[1]];
at1 = table1[pos[0]];
(where pos[n] the square number of piece n), i.e. piece 0 blocks 1 and piece 1 blocks 0. The third piece can be either black (2+2-men, in which case it does contribute to the black King's no-go area only by its presence), or a white leaper (3+1-men, if all else fails, we use the white King). So it can block piece 0 and 1, but will never be blocked itself. When assigning it a location in the course of the EGT scan, (256K times per scan) we calculate the white attack set for the 3-piece constellation as
Code: Select all
nogo = at0 & table0[pos[2]] | at1 & table1[pos[2]] | attacks_2[pos[2]];
with attacks_2[64] a table with attack sets for the third (leaper) piece. With 3 table lookups, (from 512-byte tables), 2 ANDs and 2 ORs that is about as cheap as you can get. The nogo board would go into the won[] bitmap to mark the black-in-check positions as won to white (because illegal). It can then be pre-screened by, say, calculating
Code: Select all
nogo & (nogo<<1 | edge1) & (nogo>>7 | edge2)
which in general would already eliminate all checks as possible mates, because there is either a step left or one diagonally backward that the black King could take as a legal evasion. Only if some set bits were left you would AND with all the other King steps, and what is left would be mate candidates, to be stored in the lost[0][] bitmap for later verification (to see if there are evasion through moves with the third piece, in case this was black). This verification (in needed, in 2+2-men) can take place as soon as we have treated all locations of the third piece, and would be ready to go to the next constellation of the first two (white) pieces. At that time the won[] bitmap would be complete (all illegal wtm positions being marked there) for the current placement of these two pieces, so that we can start moving the third and probe if such a move would lead us to a legal position (i.e. out of check). Note that the data set to be verified at this point is only 512 bytes, and thus certainly still present in L1.
After identifying the mates I use the leap-frog algorithm for the retrograde solving: the won[], lost[N][] and partially complete lost[N+1][] bitmaps are used to complete lost[N+1][] and generate a partially complete lost[N+2]. This will be done alternately with either the first or the second piece kept static. That means all moving pieces (three of them) access a range of 256K positions, 32KB worth of bitmap. So with 4 active bitmaps the working set will be 128KB, which is too big for L1, but easily fits L2 even in Nehalem architecture. The lost[N][] bitmaps will be accompanied by a 64 times smaller directory, which has 1 bit for each bitboard in the bitmap to indicate if there are any bits set in it, to speed up the scan over all positions.
I will report speeds once I have it running.