The mailbox trials

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

OK, if branches is he problem, let's try to eliminate as many of those as possible. The problem of course is that you then have to do unconditionally everything you could ever have to do, always. But the merging of the sets in 64-bit arithmetic did not seem excessively costly, so perhaps this will pay.

So we can generate the four 64-bit capture sets (P x P, other x P, any x B/N, and any x R/Q/K) always. Logically these form a 256-bit capture set. Now modern CPUs do have 256-bit registers, but as far as I know there is no bitscan-forward available for those. So we have to improvise. After we calculated a capture set we will push it on a stack. That is, we will copy it to the top of the stack, but we will only increment the stack pointer if the set was non-empty. And we will remember what set it is by storing a pointer to a decoding table for bitnr -> piece numbers of attacker and victim. Like

Code: Select all

sets[SP] = setPxP;
decoders[SP] = PxPdecoder;
SP += (setPxP != 0 & PVAL > futilityGap);
Whether setPxP is 0 was determined as a side effect of the last operation that created it, and can be transferred as 0 or 1 to a register with a SETcc instruction. Likewise for the result of the PVAL > futilityGap compare. So no branches are involved there, just arithmetic operations of which you can do 3 in a clock. If the futilityGap (alpha - eval - MARGIN) is smaller than PVAL the incrementing of SP is also suppressed, even if it was non-empty, as Pawn captures are futile then, and should not be searched.

Afer all four sets are (optionally) pushed to the sets[] stack this way, we can start a single extraction loop to process them:

Code: Select all

todo = sets[--SP];                     // last set pushed
while(SP) {                            // kludge: sets[0] is a dummy, so if SP==0 we ca stop
  int bit = BSF(todo);                 // note that only non-empty sets are pushed
  undo.move = decoders[SP][bit];       // decode the move and prepare to pass it
  todo &= todo - 1;                    // clear bit for this capture
  SP -= (todo == 0);                   // continue with next set if this one runs empty
  todo = (todo ? todo : sets[SP]);     // pre-load the next set (could be the dummy)
  if(SearchCapture(u)) goto cutoff;
}
This has a bit of a long latency chain through 'todo': todo -> -1 -> & todo -> == 0 -> SP -> sets[] -> todo for the next iteration. This would be a problem in tight loops. But SearchCapture() does a ton of work here (possibly searching an entire tree). So by the time it finishes the calculation of the new 'todo' will have nicely finished, and is ready for the next bit extraction.

BTW, in PV nodes it can happen that alpha gets incremented without causing a cutoff. This would make the pre-calculated futilityGap used during preparation of the move sets invalid. E.g. if {alpha, beta} was {+100, +400}, currentEval = -150, and MARGIN = 40, the initial gap would have been 210. So Pawn captures would be futile, but captureing a Knight would bring us comfortably in window (+175). But suppose a capture of a Rook searched earlier turned out to gain us R for P. That would raise alpha to +250 (currentEval + 400). Under this new alpha capture of a Knight would be futile. But SearchCaptures() tests for this on a case by case basis with the most-recent alpha, and orders an alpha-cutoff when a capture is 'unexpectedly' futile. Because the MVV ordering of the captures guarantees that all later captures then will also be futile.

For the decoders[] table we can tabulate entire moves. I.e. arrays like PxPdecode[] can be arrays of Move structs, which fit in a 32-bit int (for easy copying), but contain four char fields for from, to, (moving) piece and victim. For SearchCapture() the move has to be specified as (piece, victim) pair, and the tabulated from and to in the decoder table are set to 0. They will be overwritten when SearchCapture() looks up the location of the specified pieces in the piece list. So we would need 8 decoder arrays of 64 ints each (one for each color, 2KB in total).
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

So much for my great ideas... :( When I write x *= 0x200000001ull; the compiler optimizes the multiplier away, for a shift and an add! :shock:

I suppose the problem is that there is no load immediate for 64-bit operands in x64 architecture. So you would have to load the multiplier from a memory variable, which indeed defeats the purpose. But if I would need that same multiplier a number of times, loading it into a register once would still be competitive. All 7 of the 8 pairs of attackers sets will have to be interleaved by

set = att64(SOMEPIECE) & presence64[stm];
set *= 0x200000001ull;

Then follows a series of shifts, masks and adds that is different for each of the groups. Only the Queen is solitary. It could still be loaded as a pair, in which case it oairs with King. But the King (whose attackers would be in the high-order 32-bits) cannot have any attacks on it at the capture-generation stage; existence of such attacks would have aborted the search before. So the Queen attackers are (after presence masking) interleaved with zeros. The above code would put the interleaved Rook attackers in the high-order 32-bits of the 64-bit register; clearing the low-order bits (which would still contain the Rook1 attackers) and adding the Queen set would do the job. Problem will be the mask needed to clear the low-order bits is also 64-bit, and thus can also not be included as a constant operand of an instruction, but has to be loaded as a variable. And that mask isn't needed so many times. The silly method of first shifting 32 bits right, and then 32 bits left, to shift the unwanted bits out of the word, would take as many instructions. This would be heavy on the shift unit, though. But perhaps allinterleavings should be followed by >>32 to get the interleaved sets in the low-order 32-bits, with no garbage bits elsewhere, and then work from there.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Piece Pair encoding II

To make optimal use of 64-bit architecturs, it is better to change the piece numbering, from the natural N, N, B, B, R, R, Q, K to (N, B), (N, B), (Q, R), (K, R). Where the parentheses indicate pairs whose attackers sets can be loaded together as a 64-bit int. Intel architecture is 'little endian', meaning that after a multi-byte load from memory the data at the lowest memory address will end up in the least-significant bits of the register. So loading the (Q, R) pair would get the Q attackers in the low half of the word, and the R attackers in the high half. This is what we want, because the low bits (i.e. attacks on Q) will be extracted first.

Two pairs can be easily merged, by masking away friendly (and captured) pieces from the attackers sets with presence64[xstm] to create space, and then either take pair1 + 2*pair2 or pair1 >> 1 | pair2 (depening on whether the surviving attackers are in the odd or the even bits. For the majors this would give a combined 'capture set' RR:KQ (I use the notation H:L here to indicate High and Low half of the 64-bit register), and K would never have any attackers at this stage. So we can directly extract captures from that, to get them in MVV/LVA order.

There are 4 Pawn pairs, though. And according to MVV/LVA all the PxP captures should go first. In deviation of MVV/LVA we do all Other x P captures in arbitrary order; the sensible thing would be to do them in SEE order, not MVV/LVA as they are all HxL. But this still means we have to separate the PxP captures. After merging the 4 pairs to 2 capture sets (set1 and set2) this requires the following code:

Code: Select all

uint64_t mask = 0x0000FFFF0000FFFFull;
uint64_t set3 = set1 & mask; // PxP captures on P1-P4
uint64_t set4 = set2 & mask; // PxP captures on P5-P8
set1 ^= set3; // Other x P on P1-P4
set2 ^= set4; // Other x P on P5-P8
set3 += set4 << 16; // all PxP
set1 += set2 << 16; // all Other x P
That is 2 shift, 2 add, 2 xor, 2 and and 2 move (between registers) if set1, set2 and mask are already loaded in a register. I.e. 10 simple instructions, to get two of the four capture sets from which we can extract the moves (when we get to searching captures on a Pawn. That is not too bad.

The minors are more trouble: the combined capture set for them will start in BB:NN format. This is intentionally the wrong order for MVV extraction. (That is, I suppose we want to try captures on a Bishop before these on Knights, because a Bishop can be worth more, and thus gives better prospects for a cutoff.) We could not directly extract anyway even when the B and N were reversed, because then the moves would come in the order PxB, minor x B, major x B, PxN, minor x N, major x N. And MVV/LVA order is PxB, PxN, minor x B, minor x N, major x B, major x N. So we have to shuffle the captures into the right order. The best way I could think of for doing this is given schematically below; each line shows a 64-bit capture set, MSB on the left, where each letter indicates a pair of attacker bits, capitals for attacks on Bishops, lower case for attacks on Knights:

Code: Select all

 0 ........xxxxxxxx........xxxxxxxx  mask
 1 KQRRBBNNPPPPPPPPkqrrbbnnpppppppp
 2 ........PPPPPPPP........pppppppp (0) & (1)
 3 pppppppp........................ (2)<<24
 4 ppppppppPPPPPPPP........pppppppp (2)+(3)
 5 ................ppppppppPPPPPPPP (4)>>32

 6 KQRRBBNN........kqrrbbnn........ (1)^(2)
 7 ....KQRRBBNN........kqrrbbnn.... (6)>>8
 8 ........BBNN............bbnn.... (7)&(0)
 9 ....KQRR............kqrr........ (8)^(7)
10 KQRR............kqrr............ (9)<<8
11 KQRR....BBNN....kqrr....bbnn.... (8)+(10)
12 ....................KQRR....BBNN (11)>>40
13 KQRR....BBNN....kqrrKQRRbbnnBBNN (11)+(12)
14 kqrrKQRRbbnnBBNN................ (13)<<32

15 kqrrKQRRbbnnBBNNppppppppPPPPPPPP (5)+(14)
The final result (15) now has the captures in the desired order, PxB first. LVA is still violated by that K or Q x B go before R x N. But again, the order should be SEE-solved, and LVA is pointless. This is appreciably more work than was needed to re-order the Pawn moves, and this time just for a single set of 64 (potential) captures. I count 4 move, 6 shift, 2 add, 2 and and 2 xor. So 16 instructions. At 4 instructions per clock that would be 4 clocks. Would it really pay to do that pre-emptively, even though the node might never get to capturing minors because capture of a Rook already gave a cutoff? Or, even more stupidly, because there were no minors that could be captured (i.e. we have been doing this bit juggling with words that were all entirely zero), or capture of a minor was futile?

Testing whether there are captures on minors is much simpler: just OR the two (N, B) pairs together, apply the presence64 mask, and we know it (2 load, 1 or, 1 and). In GetMVV(), where we decide whether we should even make the move to the child node, we use that method, and there it probably does pay to do that pre-emptively, to eliminate branches. This gave me the following idea:

In GetMVV() we pre-emptively test for attacks on pieces of the moving player for all value groups, in he attack map as it was before the move, but with an already updated presence mask (to account for that the captured piece is no longer attacking anything). Each tests results in a bit (1= there was an attack, 0 = there wasn't). And we mask away the bits for the futile value groups, based on a calculation of the gap between pstEval and alpha in the upcoming child. There are only 4 value groups: Q, R, BN and P. So that gives 16 possible values.

This value is passed to the child, which can use it in a switch statement with 16 cases. Each case then only calculates he capture sets for the value groups that are known to be attacked.

Code: Select all

uint64_t majors = (att64(R1) | att64(R2) )& oppoPresence; // attacks on Q & R
u->targets = ((gap > QVAL ? 0 : majors) << 32 != 0); // Q attacked?
u->targets = 2*u->targets + ((gap > RVAL ? 0 : majors) >> 32 != 0); // R attacked ?
u->targets = 2*u->targets + (((gap < BVAL ? 0 : att64(N1) | att64(N2)) & oppoPresence) != 0); // attacks on B & N
u->targets = 2*u->targets + (((gap < PVAL ? 0 : att64(P1) | att64(P3) | att64(P5)  | att64(P7)) & oppoPresence) != 0); // attacks on P
u->targets |= u->newThreats; 
return !u->targets; // order overkill pruning if nothing to capture
This examines the entire attackers set (eight 64-bit loads), and in addition does 6 or, 3 and, 2 shift, 4 setcc, 4 cmp, 1 move, 4 cmov and 3 lea. (The latter can do A = B + 2*C.) The code is a bit cryptic, so some explanation: it will first OR all relevant attackers sets together, but then obliterate the result through a conditional move with a 0 when the value group is futile (determined by the preceding compare of 'gap'), like there were no attacks. Then it uses the oppoPresence mask to test for the relevant attack, and a setcc instruction to convert the result of the masking to a truth value 1 or 0, which is then added to u->targets. For the majors, shifts substitute for the masking to measure the 'zeroness' of the relevant part of the attackers set (as it contains two different value groups). That is eight loads and 27 other instructions in total, which in theory could run in 9 clocks.

BW, it would be nice if the work of loading half the attack map into registers could be used (in case the move will not be overkill pruned) owards he copy-make we would then have to do for updating the attack map. If the original values would still be left in registers at that point (which would require 4 extra move instructions for making copies to operate on) we would then only have to store those to the memory area for the attack map for the child, presumably saving 25% of he work there.

Because Search() now ets u->targets passed, it would know exactly which value groups of opponen pieces it has captures on. So it could take he pain of one mispredict o switch on that, and in the various switch cases ignore value groups that the stm doesn't attack. This in particular to save the work of reordering the captures on the minors when there are none.
Raphexon
Posts: 476
Joined: Sun Mar 17, 2019 12:00 pm
Full name: Henk Drost

Re: The mailbox trials

Post by Raphexon »

and an evaluation of PST plus (optionally) mobility.
I've recently been wondering about something more flexible.

Something like:

eval = (value_x * num_pieces) + (value_y * mobility_sliders) + (value_z * mobility_jumpers) + (value_a * attacked_squares^)

The initial values may require some tuning, although they're few so should be simple.
But once set up you should have an eval that can be used for any variant regardless of size or pieces. So super flexible.

Maybe needs a few 1 or 2 extra variables for powerful big board shogi pieces but can hopefully compete with PSTs in normal chess.

^Amount of moves that can capture a piece.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Well, details of he evaluation are outside the scope of this project. Except that the basic info on which an evaluation could be based should be available cheaply. That in particular applies to mobility, normally an expensive term to calculate. I did not get to that yet. But what I had in mind was to provide a function SUM(bonus[attacker][victim]), where the SUM runs over all pseuo-legal moves (including friendly captures). So attacker is a colored piece type, and victim a colored piece type or an empty square.

What you propose is (nearly) a special case of that, where the bonus[][] table would have value_y when the victim is an empty square or foe and the attacker is a slider, and likewise value_z when he attacker is a leaper. (BTW what did you have in mind for slider-leaper compounds, like he Archbishop? Would you also award its Knight jumps as if they were slider moves?) And add value_a to the non-empty targets. This would count squares that are twice atacked double, though, and I am not sure that this is what you had in mind.

I am not sure how exaclty you envision the value_x*num_pieces term. Would all pieces get the same value (value_x?), and be discriminaed only by he mobility they have in the current position? That sounds very wrong. The mobility of a Rook in the opening phase is usually much lower than that of the minors. But trading even a single minor for a Rook is an usually a winning advantage, even in that stage. The value of pieces is much dominated by their value in the end-game, i.e. on a sparsely populated board. Because thinking "Oh, now we reach the stage where a Rook is getting more useful than a Knight, so I will quickly trade my Knights for his Rooks, or my opponent would get the upper hand" won't work. By the time the tactical power (e.g. measured as mobility) of your army is about equal to that of your opponent, you are no longer in a position to force anything, in paricular not arbitrary trades of (at that instant) equally powerful pieces.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Move Sorting

In a previous posting I 'went through hoops' to re-pack the captures into the desired order in the 64-bit capture set they were originally extracted from. This was pointless and stupid. It would only serve a purpose if we were using a separate extraction loop for each capture set, to minimize the number of loops. But since we are going to use a single 'super loop' for extracting the captures from all sets, it doesn't matter at all over how many sets the captures are distributed. So there is no need to combine sets, once we have separated them. This simplifies matters a lot, for the minors:

Code: Select all

setRQKxBN = set1 & maskRQK;     // maskRQK = 0xFF000000FF000000ull
sets[SP] = setRQKxBN;           // tentatively push on stack
SP = (setPxBN ? SP+1 : SP);     // accept if non-empty
setNBxNB = set1 & maskRQK >> 8; // mask away the attacks by majors
sets[SP] = setNBxBN             // tentatively push on stack
SP = (setNBxBN ? SP+1 : SP);    // accept if non-empty
setPxNB = set1 & maskP;         // maskP = 0x0000FFFF0000FFFFull
sets[SP] = setPxBN              // tentatively push on stack
SP = (setNBxBN ? SP+1 : SP);    // accept if non-empty
The operation that has created the set (an and or subtract) will have tested the zeroness of the result as a side effect, so a conditional-move instruction can decide on the basis of that result whether it should overwrite SP by a (precalculated) SP+1 or not. Calculating SP+1 in another register is possible in a single instruction (lea), so the conditional increment of SP will take two instructions (lea + cmov). Note that for this to work the piece encoding would need to use the order (B1, N1), (B2, N2), (Q, R1), (K, R2). So that on loading a pair as 64-bit int the Bishops (or Queen) would end up in the low-order half of the register, and be extracted first, as MVV ordering prescribes.

Because we don't shift around the captures within the capture sets anymore to get them in the desired order, the mapping of the bits in the capture sets to (attacker, victim) pairs has become much simpler. Each capture set now only contains attacks on 4 different pieces: Pawns 1-4, Pawns 5-8, the minors, or the majors. We just have to remember the piece number of the lowest member of that group together with the set, and can get the actual victim number by adding that to an 'offset' obtained from a single decoding table shared by all capture sets. The super-loop for move extraction then becomes:

Code: Select all

todo = sets[--SP];
while(SP) {
  int bit = BSF(todo);
  undo.piece = stm + bit2attacker[bit];
  undo.victim = baseVictim[SP] + bit2victim[bit];
  todo &= todo - 1;
  SP = (todo ? SP : SP-1);
  todo = (todo ? todo : sets[SP]);
  if(SearchCapture(&undo)) goto cutoff;
}
The two conditional moves that trigger on todo falling empty would pop a next capture set to replace it from the sets[] stack. This is guaranteed to be non-empty, or it would not have been pushed onto that stack in the first place.

Some micro-optimizations are still possible here: the move could have been encoded as a struct {char victim, piece, from, to}, and the bit2attacker[] and bit2victim[] tables could have been merged into one table containg such structures. So that only a single lookup would be required. We could also use different decoding tables for white x black and black x white, to save us the trouble of adding stm to the piece every time. We would just load a pointer to the appropriate table before the loop, based on the side to move.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

I hink the ultimate code for a branchless overkill detection would be this:

Code: Select all

                                               // zero a register                               1 xor
gap = u->alpha - u->pstEval - MARGIN;          // calculate futility gap                        1 load, 1 lea
new = u->newThreats;                           // new victims of the current move               1 load
set = u->oppoPresence;                         // kludge: start assuming nothing is futile      1 load
presence64 = set*0x100000001ull;               // make 'double-barrel' presence mask            1 move, 1 imul
set = (gap > 0 ? 0 : set);                     // replace by 0 if non-capts futile              1 cmp, 1 cmov
tmp = att64(P1)|att64(P3)|att64(P5)|att64(P7); // attackers of 4 Pawn pairs                     4 load, 3 or
tmp = (new & 0xFF00FF00 ? presence64 : tmp);   // add newly-created attacks on P                1 and, 1 cmov
tmp = (gap > PVAL ? 0 : tmp);                  // replace by 0 if P capture futile              1 cmp, 1 cmov
set |= tmp;                                    // and add to collection                         1 or
tmp = att64(B1)|att64(B2);                     // attackers of 2 B-N pairs                      2 load, 1 or
tmp = (new & 0x00F000F0 ? presence64 : tmp);   // add newly-created attacks on N, B             1 and, 1 cmov
tmp = (gap > BVAL ? 0 : tmp);                  // replace by 0 if minor capture futile          1 cmp, 1 cmov
set |= tmp;                                    // and add to collection                         1 or
tmp = att64(Q)|att64(K);                       // attackers of QR and KR pairs                  2 load, 1 or
tmp2 = tmp >> 32;                              // dump the attackers of K & Q in low-order half 1 move, 1 shift
tmp2 = (new & 0x000C000C ? presence64 : tmp2); // add newly-created attacks on R                1 and, 1 cmov
tmp2 = (gap > RVAL ? 0 : tmp2);                // replace by 0 if Rook capture futile           1 cmp, 1 cmov
set |= tmp2;                                   // and add to collection                         1 or
tmp <<= 32;                                    // dump attackers of Rooks in high-order half    1 shift
tmp = (new & 0x00030003 ? presence64 : tmp);   // add newly-created attacks on Q                1 and, 1 cmov
tmp = (gap > QVAL ? 0 : tmp);                  // replace by 0 if Queen capture futile          1 cmp, 1 cmov
set |= tmp;                                    // and complete the set of dangerous attackers   1 or
set &= presence64;                             // weed out protectors and captured pieces       1 and
if(!set) return 1;                             // overkill prune if nothing to capture          1 branch
If I counted correctly, that makes 47 instructions, amongst which 11 loads, 2 shifts and an imul. So no overloading of specialized execution units (such as load or shift/multiply), so that it should be possible to do this in 12 clock cycles. In the last profile I posted 30M calls to GetMVV() took 0.58 sec, i.e. 19 ns/call, or 42 clock on my 2.2GHz laptop. So this branchless version should be able to take 60% off (assuming it will be inlined in SearchCapture().

This code uses he kludge that it sets the set of non-futile attacks to 'presence64' (all the present pieces) whenever it is sure there must be a non-futile move, but it doesn't know what the attacker is. (Because it came from the newThreats, or is a non-capture/stand-pat.) This is sure to give a non-zero in the final test for emptiness (through set & presence64). The advantage of mindlessly ORing everything together is that we have to mask away the protectors / captured pieces only once, at the very end.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Color-Independent Code

So far I had assumed it was unavoidable to have color-dependent code, as the merging of white and black attackers sets need shifing in opposit directions for interleaving their bits. The white attackers are in the even bits, and the black attackers are in the odd bits. So after masking away the protectors of a piece, to combine set1 and set2 requires set1 + 2*set2 for white (left shift), and set1>>1 | set2 for black attackers, i.e. execution of different instructions, raher than use of different data. Specifying a left-shift of -1 doesn't give you a right-shift, in i386 architecure, but a lef-shift of 31.

But there seems to be i386 instruction ROL and ROR ('rotate' left/right). These act like (x >> N) | (x << 32-N). There are no C operators directly specifying those, but I cannot exclude compiler optimizers are smar enough to recognize these expressions, and use a ROR instruction to implement them. Since (like with the plain shifts) the shift will be interpreted modulo 32 (or 64, in a 64-bit register), negative numbers here actually work. As unlike plain shifting, rotating 31 bits left or 1 bit rights is actually the same. Now rotating and shifing is no precisely the same. But in the current application he bit shifted out of the word is always a bit that has just been cleared by masking away the protectors, rotate is the same as a (logical) shift. So I can write color-independent code after all, if needed forcing the use of ROR shough an embedded assembler instruction, just as I do for BSF.

The 'Final' design

Since the design now seems to converge to a near-optimal state, and since in the process of this blog has changed several times, let me reiterate what we arrived at. The engine uses several data structures to represent the game state, with a high degree of redundancy:

The data structures

1) A 16x12 mailbox board with a 2-wide rim of 'edge guards', indexed by square number in 0x88 format (or perhaps offsetted by 0x22 to avoid negative square numbers in the rims). The 8x8 playing area of the board[] array contains (unique) piece numbers, not piece types.
2) A piece list, indexed by piece number. his is implemented as a number of separate arrays, one for each aspect of the piece we want to tabulate / keep track of. Most important are location[], which contains the number of the square the piece is on. Other important (but static) data about the piece is a pointer to a list of move steps in can make (for leapers, as a difference of square numbers) or directions it can move in (numbered 1-8). This basically defines the piece type. There also are pointers to the Piece-Square Table for the piece, and the table of Zobrist keys. And the piece values.
3) An attack map. As this is also indexed by piece number it formally counts as a part of the piece list. But the whole design revolves around it, so I mention it separately. Each piece has a 32-bit int associated with it, recording the pieces that attack it. These elements of the array attackers[] are thus piece sets, where each of the 32 pieces is represented by a single bit. The assignment of pieces to bits is critical: white and black pieces alternate, and going from least-significant bit to mostsignificant bit, the pieces increase in value. This means that extracting bits with the BSF instruction will retrieve the pieces in the set in order of increasing value, Lowest-Valued Attacker first. To be specific, the bit assignment is:

Code: Select all

             (MSB) kKqQrRrRbBbBnNnNpPpPpPpPpPpPpPpP (LSB)
Since the set contains pieces of both colors, only half the pieces are true attackers, the other half are protectors.
4) A neighbor table. This is an auxiliary data structure that for each occupied square holds the number of the nearest occupied square in each of the 8 directions. So it is actually a two-dimensional array, indexed by square number and direction. Rather than neighbor[sqr][direction] the 'row' of 8 directions is put in a struct, which makes it easy to copy it in its enirety (as the eight unsigned char square numbers fit in a 64-bit int). The neighbor can be a rim square. If you hink about it, you will recognize the neighbor table as a collecion of doubly linked lists, (where the cells are the squares), one list for each rank, file or diagonal. That makes it very easy to delete or re-insert squares, when these get evacuated or reoccupied.
5) A capture stack in each node. This is the logical equivalent of a move list for the captures. But it is not an array where the elements each contain a description of a single move. The elements of the stack are also sets, where each bit represents a capture, and the location of that bit in the set determines what capture that is. In this sense it is similar to the attack map, and it basically is a copy of the half of the attack map that holds the attacks on the pieces of the side that is not on move, leaves out the protectors and has reshuffled the bits into MVV/LVA order. Each element of the capture stack is a 'capture set', where each bit represents a capture. The assignment of the bits is reminiscent of that in attack-map elements. Except that it doesn't contain attackers of different color on the same piece, but attackers of the same color on two different pieces. And they concatenate two such patterns, as they use 64-bit word length. So one capture set holds all captures on a group of 4 pieces. So in theory 4 such capture sets can hold all conceivable captures.
6) A conventional move list for the non-captures. A this point we will not pay any attention to that, as there are so few nodes where non-captures are generated that the details for this have no significant impact on speed.

The process of capture generation in this engine consists of two steps: update the attack map of the parent node (through copy-make), for the effect of the latest ply, and then a sorting steps, transforming the attack map to a collection of capture sets on the capture stack. After this the captures will be searched in the order they are extracted from the capture stack, as if the latter was a 512-bit machine word.

The sorting step

Filling the capture stack from the attack map always starts by loading pairs of attack-map elements in a 64-bit access, and masking away the protectors and the already captured true attackers. (Since we have to do a masking anyway to get rid of the protectors, getting rid of the captured pieces is completely free, and allows us to leave the attacks these pieces were making just before they were captured in the map, saving us a lot of work.) Two such masked pairs will then be interleaved to get a capture set. We need to make 4 such sets (as there are 16 opponent pieces to capture). Two of those will only contain captures on Pawns, a third captures on minors, and the fourth captures on majors.(There will not be any captures on a King at this stage.) The piece numbers are assigned to the minors and majors such that Q and both B end up in the least-significant half of the capture sets, while both R and both N end up in the upper half. The merging requires a shift and a + (or |) to interleave the pairs.

The capture set for the majors can be pushed onto the capture stack 'as is'; the captures in those would extract in MVV/LVA order. But for Pawns and minors we have to re-order them. This is done by splitting the captures on minors into three (P x minor, minor x minor, major x minor), and pushing those on the capture stack separately, such that they would be treated in the listed order. (In the original capture set major x B would preceed PxN, which violates MVV/LVA.) Likewise, we split the captures on Pawns (2 sets) each into two (P x P and other x P), and push the four resulting sets onto the capture stack such that the P x P would be treated before the other x P. So we finally end up with up to eight capture sets on the capture stack.

We avoid pushing of empty capture sets to the stack, though, by suppressing the increment of the stack pointer after copying such a sets to the stack through a conditional move. So the entire procedure of filling the capture stack from the attack map is branch free. The downside of that is that we always have to do the entire job: Even if we already know in advance tha capture of a Pawn is futile, or we will almost certainly get a cutoff by capturing a Queen, we will still have been splitting the Pawn sets. But the splitting is very little work, and a single branch misprediction would already be more costly.

The capture search

Extraction of the captures, and searching them, will be done in a single loop, which only has branches to stay looping (a likely mispredict when the loop terminates), and a branch to break out of the loop for a beta (or alpha) cutoff. (But the routine it calls for searching the captures could of course have its own mispredicts.) As the captures extract from the capture stack in MVV/LVA order, the only task of the loop is to 'decode' the bit & word numbers to piece numbers for mover and victim. This can be done through a lookup table (of which an element specifies both attacker and victim). All capture sets on the stack map their bits in the same way to pieces, except that the victim needs an offset that depends on the set (0 or 4 for the Pawn sets, 8 for the minors and 12 for the majors). This offset will be pushed on a stack together with the capture set.

The map update

Logically, the update a map needs after a capture consists of two parts: attacks by pieces of the player that moved, and attacks by pieces of his opponent. The attacks by the opponent are affected much less than his own (as the opponent did not displace any of his pieces). One of the opponent's pieces disappears in the capture, but we take care of that by removing it from the presence mask and setting its own attack-map element (for the attacks on it) to zero. That is just two alterations. We leave the attacks by it in the map, (where they are scattered over many elements), as whenever such an element is used, the masking with the presence mask (from which the piece was removed), which had to be done anyway to select pieces of the right color, causes these attacks to be ignored.

And then there is the discovery of slider moves, which for opponent sliders is 'pin enforcement'. The map helps is own updating here, as it already lists all attacks on the moved piece before the move. So we just select the sliders from that with a mask, and then for each slider we extract from the remaining set, identify it, get its location, get the direction of its attack, get the neighbor in the opposit direction, and apply the slider's attack to the attack-map element of that new downstream target. That is a bit cumbersome, but for enemy sliders there are on average much fewer than one for which you have to do all that. Most pieces are not attacked by enemy sliders.

The other half of the update involves more. Here we really have to generate moves of the moved piece, both from its old location (to remove those from the map), and its new one. But that is still a lot less work than generating moves for all pieces, which would be needed with move generation from scratch. And then we also have to discover friendly slider moves ('discovered attacks'). There tend to be a lot more of those than enemy sliders. Still, you would only have to generate a capture of each slider that hits the piece in one direction. And not every slider will hit the piece. So again this should be much less work than move generation from scratch. To save on loop overhead, discovery of friendly and enemy sliders can be done in a single loop.

Finally we have to account for the attacks on the moved piece. The old attacks now hit thin air. In the new location it inherits the attacks of the piece it captured (minus its own; he must have been attacking it when he captures it!). We have to be careful to do things in the right order, to make sure discovered slider attacks that also hit the piece in its new location get assigned to the right piece.

The move generation

The few capture generations we have to do will be done as follows: for sliders we look up their location in the piece list, as well as a pointer to the list of the 4 or 8 directions they move in. We then loop through that list, and for each direction in it consult the neighbor table at the from-square, to see where the nearest obstacle in that direcion is. And on the board we then find what stands there. We then record (or remove, as the case may be) the attack of that slider in the attack-map element of the hit piece. It cannot be an empty square (the neighbor table guarantees that), and if it is an edge guard we apply the attack to the (dummy) attack-map element for edge guards. (This to avoid having to test and branch. No one will ever look at that dummy element, so it doesn't matter how much we maltreat it.)

If the moved piece is a leaper, we again look up its location in the piece list, as well as the pointer to its list of board steps. We then loop through the 2 or 8 elements in that list, and for each step, calculate where it ends, look on the board what is there, and apply or remove the leaper's attack to the attackers set of what it hit. In this case it can hit empty squares or edge guards as well as pieces, so empty squares have a dummy attack-map element too.

Note that we don't care about the color of the pieces we hit; the attack map contains both true attacks and protections.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Check evasions

The attack map makes it very easy to detect whether the stm is in check: just mask the attackers set of King with the present opponent pieces. If that isn't zero, the King is attacked. Rather than just setting an inCheck flag, we use the entire set of valid attacks as such:

Code: Select all

  undo.inCheck = attackers[stm+KING] & presence[COLOR-stm];
When in check most pseudo-legal moves tend to be illegal, for failing to resolve the pre-existing check. I currently test that on a per-move basis: generae the pseuo-legal captures as usual, and then a the start of each SearchCapture() do:

Code: Select all

  if(u->inCheck && u->inCheck & ~u->victimMask              // we were in check, and did not capture checker,
                && u->piece != COLOR+KING - stm) return 0;  // nor moved the King => move is illegal, ignore
When in check this tests whether after the propsed capture there will be attackers left in the King's attackers set (because it was a double check, or another piece than the checker was captured), and whether the King stayed in place to 'enjoy' that attack.

This contains a number of branches, and even though it would not get beyond the first one when not in check (which is the common case, and therefore probably predicted), this is still nasty. It would of course be better if moves that did not satisfy one of the criteria (not in check to begin with, capturing an only checker, or moving the King) would not be generated at all. Then there wouldn't be anything to test. So can we suppress non-evasions during capture generation?

What complicates this is that there are three cases: not in check, single check and double check. Captures by the King will have to be tried in all cases. When in double check these would be the only moves that have to be tried. It is easy to suppress all captures by pieces other than King in the normal move sorting process: just fake that you only have a King. So load the presence[stm] mask through a conditional move, before you use it in he normal procedure:

Code: Select all

presence64 = presence[stm] & 0xC0000000; // only leave King attackers
presence64 = (u->inCheck ? presence46 : presence[stm]); // but take all when not in check
But in the case of a single check we would still have to push the captures of the checker by other pieces onto the capture stack. Unfortunately, figuring out wheher there is a single checker, and which piece this then it, is not as easy as I had hoped. One problem is that BSF doesn't seem give a defined result when acting on 0. And it is not so easy to disinguish the case of a single check from both no-check and double check. I could come up with the expression x - 2*(x & -x) < 0 for that. There (x & -x) is the usual method to isolate the bit for the lowest attacker, and it would not go bonkers if there are no attacks at all, but give a nice 0. Subtracting it from x would leave all the higher attackers. Subtracting the already zeroed lowest attacker a second time would create a borrow, which would only propagate to the sign bit if there are no higher attackers. For x=0 (not in check) this expression would stay 0, and no satisfy the comparison.

A conditional move based on this test could be used to 'approve' the pushing of the attackers set of the checker onto the capture stack, by conditionally incrementing the stack pointer. But we would still have to identify that atackers set. For this we have to extract the bit of the checker with bsf. Which would give an undefined result when the inCheck set was empty. And it is used to index the attackers[] array, and even though the eventual use of the retrieved attackers set will be suppressed, we cannot afford an out-of-bounds access here that might segfault. This can be guarded against by masking the bsf result to force it into the range of valid piece numbers. So something like

Code: Select all

int bit = bsf(u->inCheck) & 31; // force potentially undefined result in range
int checker = bit2attacker[bit];
sets[SP] = set1 = attackers[checker] & presence[stm]; // attacks on the alledged checker
victims[SP] = bit2checker[bit];
set1 = (u->inCheck - 2*(u->inCheck & -u->inCheck) < 0 ? set1 : 0); // remove attacks on checker if not single check
SP = (set1 ? SP+1 : SP);
Together with the code for allowing the King moves this amounst to about 20 instructions, or 5 clocks. But hose would be added to every node, no just to the nodes where you are in check, and in most of those this would be complely wasted time. So I am not sure if it would pay off.

But we could of course take the pain of a mispredicted branch at the start of an in-check node, rather than in the SearchCapture() it calls for every move it finds. Note that the search is structured such that we should nottt get into a node when here isn't at least one move it is going to search. So we could use the above code within an if(u->inCheck) clause, so that it causes near-zero overhead in the common case of not being in check. (Just a test and a correcly predicted branch, 2 instructions = 0.5 clock.) And the code itself could then be simplified, as it is now executed only in situations where a check is known to exist. So the BSF will be valid, and we can drop the &31. And the clause for excluding double checks can be simplified to (u->inCheck & u->inCheck - 1), the usual way for clearing the lowest bit, and seeing whether anything is left.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Something even better: the normal (not-in-check) way of building the capture stack is a very inefficient way to just get the captures by a King. We go through the whole rigmarole of sorting the captures on the larger value groups in LVA order, while the only attacker under consideration is a King. So here is no need for re-sorting by LVA, and we could just push the four capture sets (with all pieces other than King masked away) unmodified on the capture stack. When we take the pain of an if-statement anyway, it doesn't cost anything extra to put the normal procedure in the else-part, and have dedicated code for building a check-evasion capture stack.

We can then put the capture of the checker into the set in the proper way, by using the normal presence mask on the attackers set of the checker, while for the capture of the other pieces only the King attacks count. The problem we have to solve here is that he procedure for building the capture stack applies the presence mask (for efficiency reasons) to capture sets containing all attackers of four victims. So we have to figure out in which bits of the capture set, and in which of the four sets, the checker ends up in. But because the capture sets won't be taken apart, the attacks on a given piece will always end up in a place (set number and bit mask) that can be tabulated as a function of the piece number of the checker.

A good way to do this would be to have a global array checkerMasks[] of 4 elements, all entries initialized to 0xC0000000C0000000ull (the King-attack bits in a capture set), and a table checker2set[] indexed by piece number, which tells in which of the 4 capture sets (Pawns1, Pawns2, minors and majors) a given piece belongs. And a table checker2mask[], which would idenitfy the bits used for the attacks on that piece. (E.g. for a Queen ha would be capure set 3 (the majors) and mask 0xC0000000EAAAAAAAull, as the attacks on Queen from the attack map would end up in the odd bits of the low half of the capture set after interleaving the atackers sets of the pairs (Q, R1) and (K, R2).

So when the checker is a Queen, we would then temporarily write 0xC000000EAAAAAAAull to checkerMasks[3], note this also transmits the King attacks on all pieces), and use the checkerMask[] elements to mask the capture sets before we push them on the capture stack. Afterwards we set checkerMask[3] back to 0xC0000000C0000000ull. Someting like

Code: Select all

if(u->inCheck) {
  int bit = bsf(u->inCheck) ; // force potentially undefined result in range
  int checker = bit2attacker[bit];
  int captSet = checker2set[checker];
  uint64_t mask = checker2mask[checker];
  mask = (u->inCheck & u->inCheck - 1 ? 0xC0000000C0000000ull : mask); // use K-only mask with double check
  checkerMasks[captSet] = mask;
  
  sets[SP] = set1 = (att64(QR1)|att64(KR2)) & presence64; // attacks on K, Q, and R
  set1 &= checkerMasks[3]; // mask for majors
  SP = (set1 ? SP+1 : SP);
  sets[SP] = set1 = (att64(B1N1)|att64(B2N2)) & presence64; // attacks on B and N
  set1 = (gap > BVAL ? 0 : set1); // futility pruning of those captures
  set1 &= checkerMasks[2]; // mask for minors
  SP = (set1 ? SP+1 : SP);
  ...
  
  checkerMask[captSet] = 0xC0000000C0000000ull; // restore to default (K-only)
}
P.S. thinking about this made me realize some code earlier conained a bug: the mask presence64 should not be a duplication of presence[stm] (which I achieved by multiplying with 0x100000001ull), but should actually contain presence[stm] four times. As the capture sets interleave the true attack on 4 victims. So the multiplier should be 0x300000003ull. (For stm = WHITE, which uses he even bits, while for black the mask needs to be shifted right 1 bit first. That is, the presence mask should be interleaved wih itself in the same way as two pairs of attackers sets are interleaved.)