More uses for magics?

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Aug 08, 2007 11:53 am

Jacob wrote: Another idea I had was a way to generate pawn moves without separate cases for black and white. The basic principle is simple enough; for white pawns, the pawn bitboard is shifted and compared to the occupancy. For black pawns, the occupancy is shifted by the same amount and compared to the pawn bitboard. I used a 17 element array to store my piece bitboards, which allows me to use some simple algebra to generate the pawn moves.

Hi Jacob,

nice idea to handle the white/black pawn cases. You safe separate biscan traversal loops for black pawns - which is of course nice. But there is a price though. Your loop bodies became much more expensive.

Code: Select all

   while &#40;moves&#41; &#123;
      sqr&#91;side&#93; = PopLSB&#40;&moves&#41;;
      sqr&#91;xside&#93; = sqr&#91;side&#93; - 8;
      AddMove&#40;sqr&#91;0&#93;, sqr&#91;1&#93;, pos&#91;sqr&#91;0&#93;&#93;);
   &#125;

versus loop hoisting with much cheaper loop bodies

Code: Select all

   for &#40;wpmoves = ...; wpmoves; wpmoves &= wpmoves-1&#41; &#123;
      to = bitScan&#40;&#40;wpmoves&#41;;
      fr = to - 8;
      AddMove&#40;fr, to, whitePawn&#41;;
   &#125;
   for &#40;bpmoves = ...; bpmoves; bpmoves &= bpmoves-1&#41; &#123;
      to = bitScan&#40;bpmoves&#41;;
      fr = to + 8;
      AddMove&#40;fr, to, blackpawn&#41;;
   &#125;

Inside the loop body you index sq[2] by side and xside to use sqr[0], sqr[1] in AddMove.
That requires, despite the xside == side ^ 1 variable which takes one register, 2*2 "expensive" memory read/writes with four agu-calculations, much longer code inside the "most critical" inner loop - while disjoint white/black pawn handling has all inside scratch registers.

Code: Select all

   test rdx, rdx        ; set to traverse
   jz   over
loopNext&#58;
   bsf  rax, rdx        ; to
   lea  ecx, &#91;eax-8&#93;    ; from
   ; eg AddMove 16-bit moves
   shl eax, 6             
   lea eax, ...           ; add fr,to,whitepawn together
   mov WORD PTR &#91;r10&#93;, ax ; move 2 list
   add r10d, 2            ; next 16-bit index

   lea  rax, &#91;rdx-1&#93;
   and  rdx, rax
   jnz  loopNext
over&#58;

You may post the generated assembly of your loop body for comparison. Another minor point is that disjoint loops for black/white are likely better to predict conditional branches, as long you have enough entries in your branch target buffer. I appreciate central common code for better maintabilty as well - one should be aware of the costs. Actually i am attired by the approach to flip sides at each move - and to therefor to have only white moves with only black capture targets

Cheers,
Gerd

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Aug 08, 2007 3:21 pm

Gerd Isenberg wrote:Actually i am attired by the approach to flip sides at each move - and to therefor to have only white moves with only black capture targets

Oops, rotated bitboards reincarnation.

C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;
2) not in check/quiesearch/check evasion move generators;
3) root/principal variation/zero-window variants of alpha-beta search.

Just a point in C++ vs C war.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Aug 08, 2007 4:04 pm

Aleks Peshkov wrote:
Gerd Isenberg wrote:Actually i am attired by the approach to flip sides at each move - and to therefor to have only white moves with only black capture targets
Oops, rotated bitboards reincarnation.

Hehe - yes I flip my quad-bitboard, hashkey etc. by some bswaps during make/unmake.
I always have white to move now - which makes some stuff simpler - or at least avoids multiple template incarnations. I use

Code: Select all

zobrist&#40;blackPiece, sq&#41; == bswap &#40;zobrist&#40;whitePiece, sq^56&#41;)

with the drawback that hashkeys of symmetrical (sub)positions are effectiv 32 bit instead of 64, since
byte == byte[i^7] after xoring them

Code: Select all

   &#123;SS,   TT,   UU,   VV,   WW,   XX,   YY,   ZZ   &#125;
 ^ &#123;   ZZ,   YY,   XX,   WW,   VV,   UU,   TT,   SS&#125; 

== &#123;SS^ZZ,TT^YY,UU^XX,VV^WW,VV^WW,UU^XX,TT^YY,SS^ZZ&#125;

Aleks Peshkov wrote: C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;
2) not in check/quiesearch/check evasion move generators;
3) root/principal variation/zero-window variants of alpha-beta search.

Just a point in C++ vs C war.

Yes, even consts in inlines are likely to get optimized away...

Jacob · Post by **Jacob** » Fri Aug 10, 2007 1:33 am

Thanks for the responses : ).

Here's the assembly (I think >_<, I'm not used to eclipse, or assembly for that matter)

Code: Select all

0x0000000000403d1b <GenNonCaptures+2690>&#58; jmp    0x403d7e <GenNonCaptures+2789>
0x0000000000403d1d <GenNonCaptures+2692>&#58; mov    2980101&#40;%rip&#41;,%ebx        # 0x6db628 <side>
0x0000000000403d23 <GenNonCaptures+2698>&#58; lea    0xffffffffffffffd8&#40;%rbp&#41;,%rdi
0x0000000000403d27 <GenNonCaptures+2702>&#58; callq  0x4010e0 <PopLSB>
0x0000000000403d2c <GenNonCaptures+2707>&#58; mov    %eax,%edx
0x0000000000403d2e <GenNonCaptures+2709>&#58; movslq %ebx,%rax
0x0000000000403d31 <GenNonCaptures+2712>&#58; mov    %edx,0xffffffffffffffd0&#40;%rbp,%rax,4&#41;
0x0000000000403d35 <GenNonCaptures+2716>&#58; mov    2980081&#40;%rip&#41;,%ecx        # 0x6db62c <xside>
0x0000000000403d3b <GenNonCaptures+2722>&#58; mov    2980071&#40;%rip&#41;,%eax        # 0x6db628 <side>
0x0000000000403d41 <GenNonCaptures+2728>&#58; cltq   
0x0000000000403d43 <GenNonCaptures+2730>&#58; mov    0xffffffffffffffd0&#40;%rbp,%rax,4&#41;,%eax
0x0000000000403d47 <GenNonCaptures+2734>&#58; lea    0xfffffffffffffff8&#40;%rax&#41;,%edx
0x0000000000403d4a <GenNonCaptures+2737>&#58; movslq %ecx,%rax
0x0000000000403d4d <GenNonCaptures+2740>&#58; mov    %edx,0xffffffffffffffd0&#40;%rbp,%rax,4&#41;
0x0000000000403d51 <GenNonCaptures+2744>&#58; mov    0xffffffffffffffd0&#40;%rbp&#41;,%edx
0x0000000000403d54 <GenNonCaptures+2747>&#58; mov    0xffffffffffffffd4&#40;%rbp&#41;,%eax
0x0000000000403d57 <GenNonCaptures+2750>&#58; shl    $0x6,%eax
0x0000000000403d5a <GenNonCaptures+2753>&#58; mov    %edx,%ecx
0x0000000000403d5c <GenNonCaptures+2755>&#58; or     %eax,%ecx
0x0000000000403d5e <GenNonCaptures+2757>&#58; mov    14456&#40;%rip&#41;,%edx        # 0x4075dc <PAWN>
0x0000000000403d64 <GenNonCaptures+2763>&#58; mov    2980030&#40;%rip&#41;,%eax        # 0x6db628 <side>
0x0000000000403d6a <GenNonCaptures+2769>&#58; or     %edx,%eax
0x0000000000403d6c <GenNonCaptures+2771>&#58; shl    $0xc,%eax
0x0000000000403d6f <GenNonCaptures+2774>&#58; or     %ecx,%eax
0x0000000000403d71 <GenNonCaptures+2776>&#58; mov    %eax,%edx
0x0000000000403d73 <GenNonCaptures+2778>&#58; mov    0xffffffffffffffc8&#40;%rbp&#41;,%rax
0x0000000000403d77 <GenNonCaptures+2782>&#58; mov    %edx,(%rax&#41;
0x0000000000403d79 <GenNonCaptures+2784>&#58; addq   $0x4,0xffffffffffffffc8&#40;%rbp&#41;
0x0000000000403d7e <GenNonCaptures+2789>&#58; mov    0xffffffffffffffd8&#40;%rbp&#41;,%rax

Code: Select all

while &#40;moves&#41; &#123;
		sqr&#91;side&#93; = PopLSB&#40;&moves&#41;;
		sqr&#91;xside&#93; = sqr&#91;side&#93; - 8;
		AddMove&#40;sqr&#91;0&#93;, sqr&#91;1&#93;, PAWN|side&#41;;
	&#125;

Code: Select all

#define AddMove&#40;from, to, pc&#41; \
	*list++ = (&#40;from&#41; | (&#40;to&#41; << 6&#41; | (&#40;pc&#41; << 12&#41;);

I might just keep my old move generator that has disjoint loops,... At the very least, I won't have to rewrite my engine to use the different data structures :-/. It was fun playing with the idea, though.

C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;

I'm kind of new to C/C++, and my engine is written in C++, Do you have an example of how you might use a template for this?

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Aug 10, 2007 1:41 pm

Jacob wrote: I might just keep my old move generator that has disjoint loops,... At the very least, I won't have to rewrite my engine to use the different data structures :-/. It was fun playing with the idea, though.

C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;
I'm kind of new to C/C++, and my engine is written in C++, Do you have an example of how you might use a template for this?

With integer-templates someting like this is possible for "side" as compile time parameter.

Code: Select all

template <int side> 
void genPawnPushs&#40;u64 &moves&#41;
&#123;
   int pc = PAWN|side;
   while &#40;moves&#41; 
   &#123;
      u32 sqr&#91;2&#93;;
      sqr&#91;side&#93;   = PopLSB&#40;moves&#41;;
      sqr&#91;side^1&#93; = sqr&#91;side&#93; - 8;
      AddMove&#40;sqr&#91;0&#93;, sqr&#91;1&#93;, pc&#41;;    
   &#125;
&#125;

Since "side" is compile time invariant - either 0 or 1 for each incarnation - the optimizer can make it disappear.
The drawback is with a side-variable, you can only use it with a conditional branch ...

Code: Select all

if ( side == white )
   genPawnPushs<white>(...);
else
   genPawnPushs<black>(...);

... Which might be nice, if good predictable (likely) - because the templates pay off in smaller code of the loop-bodies.
There are other, probably better solutions for your purpose - to hoist the side-variant outside the loop body without introducing too much register pressure and memory accesses.

Your approach is to traverse different set-semantics, to get rid of the side-condition. You traverse either a set of from-squares or to-squares - and then you do a conditional swap of the from/to-squares via memory inside the loop body - which seems quite expensive according to your posted assembly. Ok, expensive is relative here - the memory is L1 and processor (at least AMD K8, K10) will apply store/load forwarding. Anyway, there are more and longer instructions.

You may try the conditional swap via registers instead of memory ...

Code: Select all

void genPawnPushs&#40;u64 moves, int side /* &#123;0,1&#125; */)
&#123;
   int pc = PAWN|side;
   side = -sise; // &#123;0,-1&#125;
   while &#40;moves&#41; 
   &#123;
      from = PopLSB&#40;&moves&#41;;
      to   =  from - 8;
      swpm = &#40;from ^ to&#41; & side;   // 0 if side == 0, otherwise from ^ to
      from =  from ^ swpm;         // becomes "to" if side == 1
      to   =  to   ^ swpm;         // becomes "from" if side == 1
      AddMove&#40;from, to, pc&#41;;    
   &#125;
&#125;

... but that also looks not that cheap either - four register ops inside the loop body. I suggest the "usual" and imho more consistant semantic of traversing target-sets only - and to calculate from-to delta (eg. +-8) by side as a loop invariant.

Code: Select all

void genPawnPushs&#40;u64 moves, int side /* &#123;0,1&#125; */)
&#123;
   int delta = side ? 8 &#58; -8; // hopefully a branchless cmove
   int pc = PAWN|side;
   while &#40;moves&#41; 
   &#123;
      from = PopLSB&#40;&moves&#41;;
      to   =  from + delta;
      AddMove&#40;from, to, pc&#41;;    
   &#125;
&#125;

or if the compiler doesn't like cmove

Code: Select all

void genPawnPushs&#40;u64 moves, int side /* &#123;0,1&#125; */)
&#123;
   int delta = side * 16 - 8; // &#40;side << 4&#41; - 8 or even delta = pawnPushDelta&#91;side&#93;
   int pc = PAWN|side;
   while &#40;moves&#41; 
   &#123;
      from = PopLSB&#40;&moves&#41;;
      to   =  from + delta;
      AddMove&#40;from, to, pc&#41;;    
   &#125;
&#125;

Some further remarks to your assembly - Gnu ATA-Syntax is a bit hard to read for me

You should inline PopLSB - and I also suggest to use a pure bitscanforward and to reset the found bit explicitly inside the loop by
moves &= moves - 1. This usually produces better and faster code, more parallel computation, and don't takes redundant tests for the continue-condition at the end of the loop. It also helps some compilers to keep the traversed set inside a register - but it is a matter of taste, and also the optimization-ability of the compiler. Of course it only works with bitscanforward but not with bitscanreverse that way.

Code: Select all

#ifdef ...
inline int bitscanforward&#40;u64 moves&#41;
&#123;
     long idx;
     _BitScanForward64&#40;&idx, moves&#41;;
     return &#40;int&#41; idx;
&#125;
#else
...
#endif

void genPawnPushs&#40;u64 moves, int side /* &#123;0,1&#125; */)
&#123;
   int delta = side * 16 - 8; // &#40;side << 4&#41; - 8
   int pc = PAWN|side;
   while &#40;moves&#41; 
   &#123;
      from = bitscanforward&#40;moves&#41;;
      AddMove&#40;from, from + delta, pc&#41;;
      moves &= moves - 1;    
   &#125;
&#125;

Cheers,
Gerd

More uses for magics?

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves