Check idea + bittwiddler request

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Gerd Isenberg
Posts: 2127
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: Check idea + bittwiddler request

Post by Gerd Isenberg » Mon Mar 26, 2007 8:06 pm

jwes wrote: The bittwiddling part of this is finding the best way to determine if two squares are on a line, what direction the line is, and get a mask to determine if the squares between the two are vacant.
My bit-twiddling proposal to save the 4KByte (char[64][64]) lookup, to determine common ray and direction with some parallel tasks to logical shift four sign-bits of boolean differences right. A setcc-version has shorter code but worse ipc due to more (flag) dependencies.

Code: Select all

// return: 0 no common ray
//         1 same rank
//         2 same file
//         3 same diagonal
//         4 same antidiagonal
// (10 if same squares)

int isCommonRay(int a, int b) {
   int f = (b & 7) - (a & 7); // delta file
   int r = (b >>3) - (a >>3); // delta rank
   unsigned int x  =  a ^ b ;
   unsigned int y  = (r - f) & 255;
   unsigned int z  = (r + f) & 255;
   return   ( (x   - 8)>>31) // same rank
       + 2* (((x&7)- 1)>>31) // same file
       + 3* ( (y   - 1)>>31) // same diagonal
       + 4* ( (z   - 1)>>31) // same antidiagonal
       ;
}
generated x64 assembly:

Code: Select all

a$ = 8
b$ = 16
?isCommonRay@@YAHHH@Z PROC
  00000	8b c1		 mov	 eax, ecx
  00002	44 8b d1	 mov	 r10d, ecx
  00005	44 8b ca	 mov	 r9d, edx
  00008	83 e0 07	 and	 eax, 7
  0000b	44 33 d2	 xor	 r10d, edx
  0000e	41 83 e1 07	 and	 r9d, 7
  00012	44 2b c8	 sub	 r9d, eax
  00015	8b c1		 mov	 eax, ecx
  00017	44 8b c2	 mov	 r8d, edx
  0001a	c1 f8 03	 sar	 eax, 3
  0001d	41 c1 f8 03	 sar	 r8d, 3
  00021	44 2b c0	 sub	 r8d, eax
  00024	43 8d 04 08	 lea	 eax, DWORD PTR [r8+r9]
  00028	45 2b c1	 sub	 r8d, r9d
  0002b	0f b6 c8	 movzx	 ecx, al
  0002e	41 8b c2	 mov	 eax, r10d
  00031	83 e0 07	 and	 eax, 7
  00034	83 e9 01	 sub	 ecx, 1
  00037	83 e8 01	 sub	 eax, 1
  0003a	c1 e9 1f	 shr	 ecx, 31
  0003d	c1 e8 1f	 shr	 eax, 31
  00040	8d 0c 48	 lea	 ecx, DWORD PTR [rax+rcx*2]
  00043	41 0f b6 c0	 movzx	 eax, r8b
  00047	83 e8 01	 sub	 eax, 1
  0004a	c1 e8 1f	 shr	 eax, 31
  0004d	8d 04 40	 lea	 eax, DWORD PTR [rax+rax*2]
  00050	8d 04 48	 lea	 eax, DWORD PTR [rax+rcx*2]
  00053	41 8d 4a f8	 lea	 ecx, DWORD PTR [r10-8]
  00057	c1 e9 1f	 shr	 ecx, 31
  0005a	03 c1		 add	 eax, ecx
  0005c	c3		 ret	 0
To reduce inbetween-bitboard lookups from 32K to 1/2K with some computation of absolute differences plus left shift, e.g.:

Code: Select all

extern u64 inbetweenByDiff[64];

u64 inBetweenRay(u32 a, u32 b) {
   int xdelta = b ^ a;
   int delta  = b - a;
   u32 sameRank = (xdelta - 8) >> 31;
   u32 negaDiff = delta & (delta >> 31); 
   a += negaDiff; b -= negaDiff; // min; max
   return inbetweenByDiff&#91;b - a + sameRank&#93; << a;
&#125;
Gerd

Gerd Isenberg
Posts: 2127
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: Check idea + bittwiddler request

Post by Gerd Isenberg » Mon Mar 26, 2007 9:15 pm

Cleaner, more consistant source produces shorter code with two less registers used:

Code: Select all

// return&#58; 0 no common ray
//         1 same file
//         2 same rank
//         3 same diagonal
//         4 same antidiagonal
// ( 10 if same squares )

int isCommonRay&#40;int a, int b&#41; &#123;
   unsigned int f = &#40;b& 7&#41; - &#40;a& 7&#41;;
   unsigned int r = &#40;b>>3&#41; - &#40;a>>3&#41;;
   unsigned int g = r - f;
   unsigned int n = r + f;
   return   ( (&#40;r & 15&#41; - 1&#41; >> 31 )
       + 2* ( (&#40;f & 15&#41; - 1&#41; >> 31 )
       + 3* ( (&#40;g & 15&#41; - 1&#41; >> 31 )
       + 4* ( (&#40;n & 15&#41; - 1&#41; >> 31 )
       ;
&#125;

a$ = 8
b$ = 16
?isCommonRay@@YAHHH@Z PROC
  00000	8b c1		 mov	 eax, ecx
  00002	44 8b c2	 mov	 r8d, edx
  00005	c1 f9 03	 sar	 ecx, 3
  00008	83 e0 07	 and	 eax, 7
  0000b	c1 fa 03	 sar	 edx, 3
  0000e	41 83 e0 07	 and	 r8d, 7
  00012	44 2b c0	 sub	 r8d, eax
  00015	2b d1		 sub	 edx, ecx
  00017	42 8d 0c 02	 lea	 ecx, DWORD PTR &#91;rdx+r8&#93;
  0001b	41 8b c0	 mov	 eax, r8d
  0001e	83 e0 0f	 and	 eax, 15
  00021	83 e1 0f	 and	 ecx, 15
  00024	83 e8 01	 sub	 eax, 1
  00027	83 e9 01	 sub	 ecx, 1
  0002a	c1 e8 1f	 shr	 eax, 31
  0002d	c1 e9 1f	 shr	 ecx, 31
  00030	8d 0c 48	 lea	 ecx, DWORD PTR &#91;rax+rcx*2&#93;
  00033	8b c2		 mov	 eax, edx
  00035	83 e2 0f	 and	 edx, 15
  00038	41 2b c0	 sub	 eax, r8d
  0003b	83 ea 01	 sub	 edx, 1
  0003e	83 e0 0f	 and	 eax, 15
  00041	c1 ea 1f	 shr	 edx, 31
  00044	83 e8 01	 sub	 eax, 1
  00047	c1 e8 1f	 shr	 eax, 31
  0004a	8d 04 40	 lea	 eax, DWORD PTR &#91;rax+rax*2&#93;
  0004d	8d 04 48	 lea	 eax, DWORD PTR &#91;rax+rcx*2&#93;
  00050	03 c2		 add	 eax, edx
  00052	c3		 ret	 0

CThinker
Posts: 388
Joined: Wed Mar 08, 2006 9:08 pm
Contact:

Re: Check idea + bittwiddler request

Post by CThinker » Mon Mar 26, 2007 10:39 pm

There are two more complications:

1. En-passant capture.

The from-sq and to-sq has nothing to do with the diagonal that gets opened (the diagonals where the captured pawn is).

2. Casling.

The checking piece is the rook, and not the king.

Gerd Isenberg
Posts: 2127
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: Check idea + bittwiddler request

Post by Gerd Isenberg » Tue Mar 27, 2007 12:44 am

Some further optimization...

Code: Select all

// return&#58; 0 no common ray
//         1 same rank
//         2 same file
//         3 same diagonal
//         4 same antidiagonal
// ( 10 if same squares )

int isCommonRay&#40;int a, int b&#41; &#123;
   int r = &#40;b>>3&#41; - &#40;a>>3&#41;;
   int f = &#40;b &7&#41; - &#40;a &7&#41;;
   a = r - f;    b = r + f;
   return&#40;&#40;16 & (&#40;r&15&#41;-1&#41;)
        + &#40;32 & (&#40;f&15&#41;-1&#41;)
        + &#40;48 & (&#40;a&15&#41;-1&#41;)
        + &#40;64 & (&#40;b&15&#41;-1&#41;)
         ) >> 4;
&#125;


a$ = 8
b$ = 16
?isCommonRay@@YAHHH@Z PROC
  00000	8b c1		 mov	 eax, ecx
  00002	44 8b c2	 mov	 r8d, edx
  00005	83 e1 07	 and	 ecx, 7
  00008	c1 f8 03	 sar	 eax, 3
  0000b	41 c1 f8 03	 sar	 r8d, 3
  0000f	83 e2 07	 and	 edx, 7
  00012	44 2b c0	 sub	 r8d, eax
  00015	2b d1		 sub	 edx, ecx
  00017	42 8d 04 02	 lea	 eax, DWORD PTR &#91;rdx+r8&#93;
  0001b	41 8b c8	 mov	 ecx, r8d
  0001e	41 83 e0 0f	 and	 r8d, 15
  00022	2b ca		 sub	 ecx, edx
  00024	83 e0 0f	 and	 eax, 15
  00027	83 e2 0f	 and	 edx, 15
  0002a	83 e8 01	 sub	 eax, 1
  0002d	83 e1 0f	 and	 ecx, 15
  00030	83 ea 01	 sub	 edx, 1
  00033	83 e0 40	 and	 eax, 64
  00036	83 e9 01	 sub	 ecx, 1
  00039	83 e2 20	 and	 edx, 32
  0003c	83 e1 30	 and	 ecx, 48
  0003f	41 83 e8 01	 sub	 r8d, 1
  00043	03 c1		 add	 eax, ecx
  00045	41 83 e0 10	 and	 r8d, 16
  00049	03 c2		 add	 eax, edx
  0004b	41 03 c0	 add	 eax, r8d
  0004e	c1 f8 04	 sar	 eax, 4
  00051	c3		 ret	 0

Gerd Isenberg
Posts: 2127
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: Check idea + bittwiddler request

Post by Gerd Isenberg » Fri Apr 06, 2007 9:14 am

How to calculate inBetween-bitboards by two squares (sq1, sq2) without any lookup:

Not that I claim it is faster than a 32KByte lookup - at least this routine might be usefull to initialize the lookup-array and probably to use it up and then, to hide some other memory accesses. Vc2005 uses the seven scratch registers for it in 64-bit mode (rax, rcx, rdx, r08..r11) with some parallel execution.

Similar to the isCommonRay-routine it computes {-1,0}-masks (except the least significant nibble) for zero file- and rank-differences, file-difference equals rank-difference and file-difference equals minus rank-difference (sum is zero).

It then ands those masks with appropriate 64-bit ray-constants and adds (ors) all together to one "normalized" ray-mask with six bit set, since all conditions are usually distinct (except in case of both squares equal).

The routine ensures that sq2 >= sq1, thus it multiplies the sq1-bitboard (1ULL << sq1) with the ray-mask to shift it to its desired place. For instance how it works for b3->f7:

both squares

Code: Select all

 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
ray-mask 0x0040201008040200 due to file-difference ^ rank-differece == 0

Code: Select all

 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0
 0 0 0 0 0 1 0 0
 0 0 0 0 1 0 0 0
 0 0 0 1 0 0 0 0
 0 0 1 0 0 0 0 0
 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
ray-mask * sq1BB

Code: Select all

 0 0 0 0 0 0 1 0
 0 0 0 0 0 1 0 0
 0 0 0 0 1 0 0 0
 0 0 0 1 0 0 0 0
 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
sq2BB - 1

Code: Select all

 0 0 0 0 0 0 0 0
 1 1 1 1 1 0 0 0
 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1
(ray-mask * sq1BB) & (sq2BB - 1) leaves the desired inbetween set, and also ensures that same squares results in an empty set:

Code: Select all

 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0
 0 0 0 1 0 0 0 0
 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0
The routine and assembly:

Code: Select all

u64 inBetween&#40;int sq1, int sq2&#41;
&#123;
    int delta = sq2 - sq1;
    sq1 += delta & &#40;delta >> 31&#41;; // min
    sq2 -= delta & &#40;delta >> 31&#41;; // max

    u32 rankDiff, fileDiff, antiDiff, diaxDiff;
    rankDiff  = &#40;sq2 >>3&#41; - &#40;sq1 >>3&#41;;
    fileDiff  = &#40;sq2 & 7&#41; - &#40;sq1 & 7&#41;;
    antiDiff  = &#40;rankDiff + fileDiff&#41;;
    fileDiff &= 15;
    antiDiff &= 15;
    diaxDiff  = &#40;rankDiff ^ fileDiff&#41;;

    u64 between;
    between   =     &#40;rankDiff-1 & 0x000007E0&#41;  >>  4;
    between  += &#40;u64&#41;fileDiff-1 & 0x0001010101010100;
    between  += &#40;u64&#41;diaxDiff-1 & 0x0040201008040200;
    between  += &#40;u64&#41;antiDiff-1 & 0x0000040810204080;
    between  *= &#40;u64&#41;1 << sq1;
    between  &=(&#40;u64&#41;1 << sq2&#41; - 1;
    return between;
&#125;

Code: Select all

?inBetween@@YA_KHH@Z PROC
  00000	8b c2		 mov	 eax, edx
  00002	44 8b da	 mov	 r11d, edx
  00005	2b c1		 sub	 eax, ecx
  00007	44 8b c0	 mov	 r8d, eax
  0000a	41 c1 f8 1f	 sar	 r8d, 31
  0000e	44 23 c0	 and	 r8d, eax
  00011	45 2b d8	 sub	 r11d, r8d
  00014	45 8d 14 08	 lea	 r10d, DWORD PTR &#91;r8+rcx&#93;
  00018	41 8b cb	 mov	 ecx, r11d
  0001b	41 8b d3	 mov	 edx, r11d
  0001e	83 e1 07	 and	 ecx, 7
  00021	c1 fa 03	 sar	 edx, 3
  00024	41 8b c2	 mov	 eax, r10d
  00027	c1 f8 03	 sar	 eax, 3
  0002a	49 b8 00 02 04
	08 10 20 40 00	 mov	 r8, 0040201008040200H
  00034	2b d0		 sub	 edx, eax
  00036	41 8b c2	 mov	 eax, r10d
  00039	83 e0 07	 and	 eax, 7
  0003c	44 8b ca	 mov	 r9d, edx
  0003f	2b c8		 sub	 ecx, eax
  00041	8d 04 11	 lea	 eax, DWORD PTR &#91;rcx+rdx&#93;
  00044	83 e1 0f	 and	 ecx, 15
  00047	83 e0 0f	 and	 eax, 15
  0004a	4c 33 c9	 xor	 r9, rcx
  0004d	48 83 c1 ff	 add	 rcx, -1
  00051	48 83 e8 01	 sub	 rax, 1
  00055	49 83 e9 01	 sub	 r9, 1
  00059	4d 23 c8	 and	 r9, r8
  0005c	49 b8 80 40 20
	10 08 04 00 00	 mov	 r8, 0000040810204080H
  00066	49 23 c0	 and	 rax, r8
  00069	41 b8 01 00 00
	00		 mov	 r8d, 1
  0006f	4c 03 c8	 add	 r9, rax
  00072	8d 42 ff	 lea	 eax, DWORD PTR &#91;rdx-1&#93;
  00075	49 8b d0	 mov	 rdx, r8
  00078	48 c1 e8 04	 shr	 rax, 4
  0007c	83 e0 7e	 and	 eax, 0000007eH
  0007f	4c 03 c8	 add	 r9, rax
  00082	48 b8 00 01 01
	01 01 01 01 00	 mov	 rax, 0001010101010100H
  0008c	48 23 c1	 and	 rax, rcx
  0008f	41 8b ca	 mov	 ecx, r10d
  00092	48 d3 e2	 shl	 rdx, cl
  00095	49 03 c1	 add	 rax, r9
  00098	41 8b cb	 mov	 ecx, r11d
  0009b	49 d3 e0	 shl	 r8, cl
  0009e	49 83 e8 01	 sub	 r8, 1
  000a2	48 0f af c2	 imul	 rax, rdx
  000a6	49 23 c0	 and	 rax, r8
  000a9	c3		 ret	 0
Cheers,
Gerd

Gerd Isenberg
Posts: 2127
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: Check idea + bittwiddler request

Post by Gerd Isenberg » Fri Apr 06, 2007 6:06 pm

Gerd Isenberg wrote:How to calculate inBetween-bitboards by two squares (sq1, sq2) without any lookup:

Code: Select all

u64 inBetween&#40;int sq1, int sq2&#41;
&#123;
    int delta = sq2 - sq1;
    sq1 += delta & &#40;delta >> 31&#41;; // min
    sq2 -= delta & &#40;delta >> 31&#41;; // max

    u32 rankDiff, fileDiff, antiDiff, diaxDiff;
    rankDiff  = &#40;sq2 >>3&#41; - &#40;sq1 >>3&#41;;
    fileDiff  = &#40;sq2 & 7&#41; - &#40;sq1 & 7&#41;;
    antiDiff  = &#40;rankDiff + fileDiff&#41;;
    fileDiff &= 15;
    antiDiff &= 15;
    diaxDiff  = &#40;rankDiff ^ fileDiff&#41;;

    u64 between;
    between   =     &#40;rankDiff-1 & 0x000007E0&#41;  >>  4;
    between  += &#40;u64&#41;fileDiff-1 & 0x0001010101010100;
    between  += &#40;u64&#41;diaxDiff-1 & 0x0040201008040200;
    between  += &#40;u64&#41;antiDiff-1 & 0x0000040810204080;
    between <<= sq1;
    between  &=(&#40;u64&#41;1 << sq2&#41; - 1;
    return between;
&#125;
oups, of course

Code: Select all

    between  *= &#40;u64&#41;1 << sq1;
can be replaced by...

Code: Select all

    between <<= sq1;

Gerd Isenberg
Posts: 2127
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: Check idea + bittwiddler request

Post by Gerd Isenberg » Tue Apr 10, 2007 8:17 pm

Some further (final?) remarks on those didactical 64*64-lookup replacement-routines to get 1. the kind of ray (if any) of two squares (0..63) and/or 2. the inbetween bitsets. Again they are intended to initialize the lookup arrays or to replace them in high memory traffic situations and/or to relax cache pollution. The routines are based on building boolean {-1,0}-masks of nibble-differences for ...

Code: Select all

fileDifference == 0 -> -1 if same file, else < 15
rankDifference == 0 -> -1 if same rank, else < 15
fileDifference ^ rankDifference == 0 -> -1 if same diagonal, else < 15
fileDifference + rankDifference == 0 -> -1 if same antidiagonal, else < 15.
... and to perform further ~nibble-0-ands to combine (add,xor,or) them to a final result.

There are four independent and,dec,and chains - interesting to see what assembly the compiler produces - a kind of benchmark. VC2005 is sensible to minor source code changes and really acts like a high level assembler ;-)


This seems shortest for 32-bit mode - thanks to the one-byte dec-instruction, which is not available in 64-bit mode. To get the dense 5 (instead of 7) as index for same squares, the strange "+^|" -sequence is used, instead of three times "or".

Code: Select all

// return&#58; 1 if same rank
//         2 if same file
//         3 if same diagonal
//         4 if same antidiagonal
//         5 if same square
//         0 otherwise

u32 onSameRay&#40;u32 sq1, u32 sq2&#41; &#123;
    u32 rankDiff,  fileDiff, 
        antiDiff,  diaxDiff,  rayindex;
    rankDiff  = (&#40;sq2 | 7&#41; -  sq1&#41; >>3;
    fileDiff  = &#40;sq2 &  7&#41; - &#40;sq1 & 7&#41;;
    antiDiff  = rankDiff   +  fileDiff;
    // differences as signed nibbles
    rankDiff  = rankDiff   &        15;
    fileDiff  = fileDiff   &        15;
    antiDiff  = antiDiff   &        15;
    diaxDiff  = rankDiff   ^  fileDiff;
    rayindex  = antiDiff-1 &    4 * 16;
    rayindex += diaxDiff-1 &    3 * 16;
    rayindex ^= fileDiff-1 &    2 * 16;
    rayindex |= rankDiff-1 &    1 * 16;
    return rayindex >> 4;
&#125;
Applying De Morgan results in slightly better scheduling as well as shorter code in 64-bit mode, despite the additional not-instruction ...

Code: Select all

// return&#58; 1 if same rank
//         2 if same file
//         3 if same diagonal
//         4 if same antidiagonal
//         7 if same square
//         0 otherwise

u32 onSameRay&#40;u32 sq1, u32 sq2&#41; &#123;
    u32 rankDiff, fileDiff, 
        antiDiff, diaxDiff, rayindex;
    rankDiff  = (&#40;sq2| 7&#41; - sq1&#41; >>3;
    fileDiff  = &#40;sq2 & 7&#41; -&#40;sq1 & 7&#41;;
    antiDiff  =  rankDiff + fileDiff;
    rankDiff  =  rankDiff &       15;

    fileDiff  =  fileDiff &       15;
    antiDiff  =  antiDiff &       15;
    diaxDiff  =  rankDiff ^ fileDiff;
    rayindex  = -antiDiff |      ~64;
    rayindex &= -diaxDiff |      ~48;
    rayindex &= -fileDiff |      ~32;
    rayindex &= -rankDiff |      ~16;
    return ~rayindex >> 4;
&#125;
The onSameRay-routine might be used to lookup an 8-bitboard array - to calculate the final inbetween-bitboard by shift and mask. Two additional shifts, xors safe calculating min and max of the two squares ...

Code: Select all

u64 inBetween&#40;u32 sq1, u32 sq2&#41;
&#123;
    const u64 o = 1;
    static const u64 rays&#91;8&#93; =
    &#123;
        0x0000000000000000,
        0x000000000000007e, // 1.rank, inner six
        0x0001010101010100, // a-file, inner six
        0x0040201008040200, // a1h8-diagonal, inner six
        0x0000040810204080, // h1a8-antidiagonal, inner six, considering h1 == 7
        0x0000000000000000, 
        0x0000000000000000,
        0x0000000000000000,
    &#125;;
    u64 ray, btw;
    ray = rays&#91;onSameRay&#40;sq1, sq2&#41;&#93;;
    ray = &#40;ray<<sq1&#41;   ^ &#40;ray<<sq2&#41;;
    btw = (&#40;o<<sq1&#41;-o&#41; ^ (&#40;o<<sq2&#41;-o&#41;;
    return ray & btw;
&#125;
how ((1<<sq1)-1) ^ ((1<<sq2)-1) works with b3,f7 to get all bits "between" them including b3:

Code: Select all

 0 0 0 0 0 0 0 0    0 0 0 0 0 0 0 0    0 0 0 0 0 0 0 0    0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0    1 1 1 1 1 0 0 0    0 0 0 0 0 0 0 0    1 1 1 1 1 0 0 0
 0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1    0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1
 0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1    0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1
 0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1    0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1
 0 1 0 0 0 0 0 0    1 1 1 1 1 1 1 1    1 0 0 0 0 0 0 0    0 1 1 1 1 1 1 1
 0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1    1 1 1 1 1 1 1 1    0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0    1 1 1 1 1 1 1 1    1 1 1 1 1 1 1 1    0 0 0 0 0 0 0 0
... but applying min/max seems to result in shorter and faster code (vc2005 needs to load four times cl prior to the variable shifts), since it also schedules with the independent code of the inlined onSameRay-routine:

Code: Select all

u64 inBetween&#40;u32 sq1, u32 sq2&#41;
&#123;
    const u64 o = 1;
    static const u64 rays&#91;8&#93; =
    &#123;
        0x0000000000000000,
        0x000000000000007e, // 1.rank, inner six
        0x0001010101010100, // a-file, inner six
        0x0040201008040200, // a1h8-diagonal, inner six
        0x0000040810204080, // h1a8-antidiagonal, inner six, considering h1 == 7
        0x0000000000000000,
        0x0000000000000000,
        0x0000000000000000,
    &#125;;
    u64 ray, btw;
    ray = rays&#91;onSameRay&#40;sq1, sq2&#41;&#93;;

    int delta = sq2 - sq1;
    sq1  += delta & &#40;delta >> 31&#41;; // min
    sq2  -= delta & &#40;delta >> 31&#41;; // max

    btw  = &#40;o << sq2&#41; - o;
    return &#40;ray << sq1&#41; & btw;
&#125;
Direct calculation with immediate 64-bit ray-masks - without any lookups - results in a few more code-bytes (the 10-byte opcodes for the immediate 64-bit loads). It seems better to calculate min/max in advance here. A dependency during startup safes one &15 nibblemask, since rankDiff is always positive. Trying to calculate min/max independently from the ray-Differences to gain more parallelism seems to confuse the optimizer of vc2005 ;-)

Code: Select all

u64 inBetween&#40;u32 sq1, u32 sq2&#41;
&#123;   // ensure sq2 >= sq1
    int delta = sq2 - sq1;
    sq1  += delta & &#40;delta >> 31&#41;; // min
    sq2  -= delta & &#40;delta >> 31&#41;; // max

    // differences as signed nibbles &#40;4 bit&#41;
    //  rankDiff always positive &#40;0..7&#41; sq2>=sq1
    //  fileDiff, antiDiff may negative and require & 15
    u32 rankDiff, fileDiff, antiDiff, diaxDiff;	
    rankDiff  =(&#40;sq2 | 7&#41; -  sq1&#41; >>3,
    fileDiff  = &#40;sq2 & 7&#41; - &#40;sq1 & 7&#41;;
    antiDiff  = &#40;rankDiff + fileDiff&#41;;
    fileDiff &= 15,    antiDiff &= 15;
    diaxDiff  = &#40;rankDiff ^ fileDiff&#41;;

    // if difference == 0 -> -1-mask
    // otherwise -> 0..14 ->  0-mask
    const u64 o = 1, m1 = -1;	u64 between; 
    between   = 2 * (&#40;rankDiff  - 1&#41; >> 26&#41;;
    between  |= &#40;m1 + fileDiff&#41; & 0x0001010101010100,
    between  |= &#40;m1 + diaxDiff&#41; & 0x8040201008040200,
    between  |= &#40;m1 + antiDiff&#41; & 0x0002040810204080;
    between   = &#40;m1 + &#40;o<<sq2&#41;) & ( between << sq1 );
    return between;
&#125;
One may consider the vertical flip of the 7-bit diagonal/antidiagonal masks, using bswap aka _byteswap_uint64 instead of a third immediate load. The one square "longer" rays get masked off anyway...


Some assembly by vc2005:

32-bit mode - hmm, OOE may help a lot.

Code: Select all

?onSameRay@@YAIII@Z PROC
; _sq1$ = edx
; _sq2$ = eax
  00000	8b c8		 mov	 ecx, eax
  00002	83 e0 07	 and	 eax, 7
  00005	83 c9 07	 or	 ecx, 7
  00008	2b ca		 sub	 ecx, edx
  0000a	83 e2 07	 and	 edx, 7
  0000d	2b c2		 sub	 eax, edx
  0000f	c1 e9 03	 shr	 ecx, 3
  00012	56		 push	 esi
  00013	8b f0		 mov	 esi, eax
  00015	8d 14 0e	 lea	 edx, DWORD PTR &#91;esi+ecx&#93;
  00018	83 e6 0f	 and	 esi, 15	
  0001b	83 e1 0f	 and	 ecx, 15	
  0001e	8b c6		 mov	 eax, esi
  00020	33 c1		 xor	 eax, ecx
  00022	48		 dec	 eax
  00023	83 e0 30	 and	 eax, 48	
  00026	83 e2 0f	 and	 edx, 15	
  00029	4a		 dec	 edx
  0002a	83 e2 40	 and	 edx, 64	
  0002d	03 c2		 add	 eax, edx
  0002f	4e		 dec	 esi
  00030	83 e6 20	 and	 esi, 32	
  00033	33 c6		 xor	 eax, esi
  00035	49		 dec	 ecx
  00036	83 e1 10	 and	 ecx, 16	
  00039	0b c1		 or	 eax, ecx
  0003b	c1 e8 04	 shr	 eax, 4
  0003e	5e		 pop	 esi
  0003f	c3		 ret	 0
64-bit mode: longer opcode due to rex-prefix using r8 and one byte increment opcodes are reserved for prefixes and not available - thus neg is one byte shorter opcode than it's one's complement counterpart sub 1.

Code: Select all

sq1$ = 8
sq2$ = 16
?onSameRay@@YAIII@Z PROC
  00000	44 8b c2	 mov	 r8d, edx
  00003	83 e2 07	 and	 edx, 7
  00006	41 83 c8 07	 or	 r8d, 7
  0000a	44 2b c1	 sub	 r8d, ecx
  0000d	83 e1 07	 and	 ecx, 7
  00010	2b d1		 sub	 edx, ecx
  00012	41 c1 e8 03	 shr	 r8d, 3
  00016	42 8d 0c 02	 lea	 ecx, DWORD PTR &#91;rdx+r8&#93;
  0001a	83 e2 0f	 and	 edx, 15
  0001d	41 83 e0 0f	 and	 r8d, 15
  00021	8b c2		 mov	 eax, edx
  00023	83 e1 0f	 and	 ecx, 15
  00026	f7 da		 neg	 edx
  00028	41 33 c0	 xor	 eax, r8d
  0002b	83 ca 10	 or	 edx, 16
  0002e	41 f7 d8	 neg	 r8d
  00031	f7 d8		 neg	 eax
  00033	41 83 c8 ef	 or	 r8d, -17
  00037	f7 d9		 neg	 ecx
  00039	23 c2		 and	 eax, edx
  0003b	83 c9 bf	 or	 ecx, -65
  0003e	41 23 c0	 and	 eax, r8d
  00041	83 c8 cf	 or	 eax, -49
  00044	23 c1		 and	 eax, ecx
  00046	f7 d0		 not	 eax
  00048	c1 e8 04	 shr	 eax, 4
  0004b	c3		 ret	 0

Code: Select all

sq1$ = 8
sq2$ = 16
?inBetween@@YA_KII@Z PROC
  00000	8b c1		 mov	 eax, ecx
  00002	44 8b c2	 mov	 r8d, edx
  00005	44 8b d1	 mov	 r10d, ecx
  00008	83 e0 07	 and	 eax, 7
  0000b	44 8b ca	 mov	 r9d, edx
  0000e	44 8b da	 mov	 r11d, edx
  00011	41 83 e0 07	 and	 r8d, 7
  00015	41 83 c9 07	 or	 r9d, 7
  00019	44 2b c0	 sub	 r8d, eax
  0001c	44 2b c9	 sub	 r9d, ecx
  0001f	8b c2		 mov	 eax, edx
  00021	41 2b c2	 sub	 eax, r10d
  00024	41 c1 e9 03	 shr	 r9d, 3

  00028	8b d0		 mov	 edx, eax
  0002a	43 8d 0c 08	 lea	 ecx, DWORD PTR &#91;r8+r9&#93;
  0002e	41 83 e0 0f	 and	 r8d, 15
  00032	83 e1 0f	 and	 ecx, 15
  00035	c1 fa 1f	 sar	 edx, 31
  00038	41 83 e1 0f	 and	 r9d, 15
  0003c	23 d0		 and	 edx, eax
  0003e	41 8b c0	 mov	 eax, r8d
  00041	f7 d9		 neg	 ecx
  00043	41 33 c1	 xor	 eax, r9d
  00046	83 c9 bf	 or	 ecx, -65
  00049	41 f7 d8	 neg	 r8d
  0004c	f7 d8		 neg	 eax
  0004e	41 83 c8 10	 or	 r8d, 16
  00052	41 f7 d9	 neg	 r9d
  00055	41 23 c0	 and	 eax, r8d
  00058	41 83 c9 ef	 or	 r9d, -17
  0005c	4c 8d 05 00 00
	00 00		 lea	 r8, OFFSET FLAT&#58;?rays@?1??inBetween
  00063	41 23 c1	 and	 eax, r9d
  00066	83 c8 cf	 or	 eax, -49
  00069	23 c1		 and	 eax, ecx
  0006b	42 8d 0c 12	 lea	 ecx, DWORD PTR &#91;rdx+r10&#93;
  0006f	f7 d0		 not	 eax
  00071	48 c1 e8 04	 shr	 rax, 4
  00075	49 8b 04 c0	 mov	 rax, QWORD PTR &#91;r8+rax*8&#93;
  00079	48 d3 e0	 shl	 rax, cl
  0007c	41 8b cb	 mov	 ecx, r11d
  0007f	2b ca		 sub	 ecx, edx
  00081	ba 01 00 00 00	 mov	 edx, 1
  00086	48 d3 e2	 shl	 rdx, cl
  00089	48 83 ea 01	 sub	 rdx, 1
  0008d	48 23 c2	 and	 rax, rdx
  00090	c3		 ret	 0

Code: Select all

sq1$ = 8
sq2$ = 16
?inBetween@@YA_KII@Z PROC
  00000	8b c2		 mov	 eax, edx
  00002	49 bb 80 40 20
	10 08 04 02 00	 mov	 r11, 0002040810204080H
  0000c	2b c1		 sub	 eax, ecx
  0000e	44 8b c0	 mov	 r8d, eax
  00011	41 c1 f8 1f	 sar	 r8d, 31
  00015	44 23 c0	 and	 r8d, eax

  00018	41 2b d0	 sub	 edx, r8d
  0001b	45 8d 14 08	 lea	 r10d, DWORD PTR &#91;r8+rcx&#93;
  0001f	8b ca		 mov	 ecx, edx
  00021	44 8b c2	 mov	 r8d, edx
  00024	83 e1 07	 and	 ecx, 7
  00027	41 83 c8 07	 or	 r8d, 7
  0002b	41 8b c2	 mov	 eax, r10d
  0002e	83 e0 07	 and	 eax, 7
  00031	45 2b c2	 sub	 r8d, r10d
  00034	2b c8		 sub	 ecx, eax
  00036	41 c1 e8 03	 shr	 r8d, 3
  0003a	42 8d 04 01	 lea	 eax, DWORD PTR &#91;rcx+r8&#93;
  0003e	83 e1 0f	 and	 ecx, 15
  00041	45 8b c8	 mov	 r9d, r8d
  00044	83 e0 0f	 and	 eax, 15
  00047	4c 33 c9	 xor	 r9, rcx
  0004a	48 83 c1 ff	 add	 rcx, -1
  0004e	48 83 e8 01	 sub	 rax, 1
  00052	49 83 e9 01	 sub	 r9, 1
  00056	49 23 c3	 and	 rax, r11
  00059	4d 23 cb	 and	 r9, r11
  0005c	49 0f c9	 bswap	 r9
  0005f	4c 0b c8	 or	 r9, rax
  00062	41 8d 40 ff	 lea	 eax, DWORD PTR &#91;r8-1&#93;
  00066	c1 e8 1a	 shr	 eax, 26
  00069	03 c0		 add	 eax, eax
  0006b	4c 0b c8	 or	 r9, rax
  0006e	48 b8 00 01 01
	01 01 01 01 00	 mov	 rax, 0001010101010100H
  00078	48 23 c1	 and	 rax, rcx
  0007b	41 8b ca	 mov	 ecx, r10d
  0007e	49 0b c1	 or	 rax, r9
  00081	48 d3 e0	 shl	 rax, cl
  00084	8b ca		 mov	 ecx, edx
  00086	ba 01 00 00 00	 mov	 edx, 1
  0008b	48 d3 e2	 shl	 rdx, cl
  0008e	48 83 ea 01	 sub	 rdx, 1
  00092	48 23 c2	 and	 rax, rdx
  00095	c3		 ret	 0
Cheers,
Gerd

Gerd Isenberg
Posts: 2127
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: Check idea + bittwiddler request

Post by Gerd Isenberg » Thu Apr 12, 2007 6:55 pm

Another idea is to get the inbetween bits as mentioned

Code: Select all

    btwnbits  = &#40;1 <<sq1&#41; - 1;
    btwnbits ^= &#40;1 <<sq2&#41; - 1;
at the very beginning of the inbetween-routine, which nicely schedules with calculating the differences independently. Instead of shifting raybits left by min(sq1,sq2), one can forgot sq1, sq2 early - to relax registers. The isolated ls1b, (btwnbits & -btwnbits) is already 1 << min(sq1,sq2). The minor drawback is the use of imul instead of shift. But in total this code is more compact, has less dependencies, and takes one register less, compared to the other none lookup approaches. Considering shadow registers and oooe, the vc2005 generated assembly looks quite optimal.

Code: Select all

u64 inBetween&#40;u32 sq1, u32 sq2&#41;
&#123;
    const u64 o = 1, m1 = -1; 
    const u64 a2a6 = 0x0001010101010100;
    const u64 b7h1 = 0x0002040810204080;
    u64 btwnbits, raybits; 
    u32 rankDiff, fileDiff, antiDiff, diaxDiff;	

    btwnbits  = &#40;o <<sq1&#41; - o;
    btwnbits ^= &#40;o <<sq2&#41; - o;
    rankDiff  =(&#40;sq2 | 7&#41; -  sq1&#41; >> 3,
    fileDiff  = &#40;sq2 & 7&#41; -  &#40;sq1 & 7&#41;;
    antiDiff  = rankDiff  +   fileDiff;
    rankDiff  = rankDiff  &   15;
    fileDiff  = fileDiff  &   15;
    antiDiff  = antiDiff  &   15;
    diaxDiff  = rankDiff  ^   fileDiff;
    raybits   = 2*(&#40;rankDiff-1&#41; >> 26&#41;;
    raybits  |= _byteswap_uint64&#40;&#40;m1 + diaxDiff&#41; & b7h1&#41;;
    raybits  |= &#40;m1 + antiDiff&#41; & b7h1;
    raybits  |= &#40;m1 + fileDiff&#41; & a2a6;
    raybits  *= btwnbits  &  -btwnbits;
    return      raybits   &   btwnbits;
&#125;

Code: Select all

sq1$ = 8
sq2$ = 16
?inBetween@@YA_KII@Z PROC
  00000	44 8b c1	 mov	 r8d, ecx
  00003	b8 01 00 00 00	 mov	 eax, 1
  00008	49 ba 80 40 20
	10 08 04 02 00	 mov	 r10, 0002040810204080H
  00012	4c 8b c8	 mov	 r9, rax
  00015	49 d3 e1	 shl	 r9, cl
  00018	8b ca		 mov	 ecx, edx
  0001a	83 e2 07	 and	 edx, 7
  0001d	4c 2b c8	 sub	 r9, rax
  00020	48 d3 e0	 shl	 rax, cl
  00023	83 c9 07	 or	 ecx, 7
  00026	41 2b c8	 sub	 ecx, r8d
  00029	48 83 e8 01	 sub	 rax, 1
  0002d	41 83 e0 07	 and	 r8d, 7
  00031	4c 33 c8	 xor	 r9, rax
  00034	41 2b d0	 sub	 edx, r8d
  00037	c1 e9 03	 shr	 ecx, 3
  0003a	8d 04 0a	 lea	 eax, DWORD PTR &#91;rdx+rcx&#93;
  0003d	83 e1 0f	 and	 ecx, 15
  00040	83 e2 0f	 and	 edx, 15
  00043	83 e0 0f	 and	 eax, 15
  00046	44 8b c1	 mov	 r8d, ecx
  00049	48 83 e8 01	 sub	 rax, 1
  0004d	4c 33 c2	 xor	 r8, rdx
  00050	49 23 c2	 and	 rax, r10
  00053	49 83 e8 01	 sub	 r8, 1
  00057	4d 23 c2	 and	 r8, r10
  0005a	49 0f c8	 bswap	 r8
  0005d	4c 0b c0	 or	 r8, rax
  00060	8d 41 ff	 lea	 eax, DWORD PTR &#91;rcx-1&#93;
  00063	c1 e8 1a	 shr	 eax, 26
  00066	8d 0c 00	 lea	 ecx, DWORD PTR &#91;rax+rax&#93;
  00069	48 b8 00 01 01
	01 01 01 01 00	 mov	 rax, 0001010101010100H
  00073	4c 0b c1	 or	 r8, rcx
  00076	48 8d 4a ff	 lea	 rcx, QWORD PTR &#91;rdx-1&#93;
  0007a	48 23 c1	 and	 rax, rcx
  0007d	49 8b c9	 mov	 rcx, r9
  00080	49 0b c0	 or	 rax, r8
  00083	48 f7 d9	 neg	 rcx
  00086	49 23 c9	 and	 rcx, r9
  00089	48 0f af c1	 imul	 rax, rcx
  0008d	49 23 c1	 and	 rax, r9
  00090	c3		 ret	 0
Gerd

Post Reply