Fastest 32 (64?) bit BishopAttacks()

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Fastest 32 (64?) bit BishopAttacks()

Post by Michael Sherwin »

Yes, it's me again--sorry!

As I stated in another thread, I believe that I had developed the fastest bitboard move generator for either 32 bit or 64 bit machines and that I would post about them. The last bitboard generator that I posted about was to just get a 'read' on that method before I had completly made up my mind on which was fastest.

This is the bishop generator. The rook will follow later.

Code: Select all

typedef union {
  u32   u;
  u08   1;
  u08   2;
  u08   3;
  u08   4;
} split32;

typedef union {
  u64    u;
  u32   lo;
  u32   hi;
} split64;
The shift amounts in sq->bishopShifts:

Code: Select all

17, 19, 19, 17, 17, 18, 18, 16,
17, 17, 17, 17, 18, 18, 16, 17,
17, 17, 17, 17, 17, 15, 17, 17,
16, 16, 16, 16, 16, 16, 18, 18,
16, 18, 16, 16, 16, 16, 19, 20,
15, 17, 15, 17, 17, 17, 19, 17,
18, 18, 20, 20, 17, 17, 17, 17,
16, 17, 16, 16, 17, 17, 17, 17

Code: Select all

u64 BishopAttacks(square * sq, split64 * occ) {
  sq->blockers.u = occ.u & sq->bishopBits;
  sq->index = sq->blockers.lo + (sq->blockers.hi >> 1);
  sq->index |= (sq->index >> sq->bishopShifts);
  return bishopAttacks[sq->bMinimal1[sq->index.1] | sq->bMinimal2[sq->index.2]];
}
Note: no masking of the index is needed, because the shift amounts work out and
also, because of the split 32 union. The bishopAttacks subtable address is also
pre computed and part of the sq->bMinimal1 array.

Features:
* Very fast (maybe fastest) on 64 bit machines as well !
* Code will work as is for both 32 bit and 64 bit machines !!
* Only one dimentional array indexing calculations !!!
* For 32 bit machines, very few 64 bit accesses !!!!
* Most data accessed with one pointer !!!!!
* Very small memory footprint compared to magic !!!!!!
* No need for slow imul machine instructions !!!!!!!

Ofcourse mistakes are possible and optimizations are possible too.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Alessandro Scotti

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Alessandro Scotti »

Great!!!!!!!!!! :D
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Pradu »

Michael Sherwin wrote:Yes, it's me again--sorry!

As I stated in another thread, I believe that I had developed the fastest bitboard move generator for either 32 bit or 64 bit machines and that I would post about them. The last bitboard generator that I posted about was to just get a 'read' on that method before I had completly made up my mind on which was fastest.
Quite good, just did a quick trace on square D4:

after "sq->blockers.u = occ.u & sq->bishopBits;," sq->blokcers.u is

Code: Select all

00000000
000000e0
0f000d00
00g0c000
000X0000
00b0h000
0a000i00
00000000
after "sq->index = sq->blockers.lo + (sq->blockers.hi >> 1);," sq->index is

Code: Select all

000X0000
00b0h000
0a000i00
00000000
+
00000000
000000e0
0f000d00
00g0c000 >>1

000X0000
00b0h000
0a000i00
00000000
+
00000000
00000e00
f000d000
0g0c0000

00000000
00b0he00
fa00di00
0g0c0000
after "sq->index |= (sq->index >> sq->bishopShifts);," sq->index is

Code: Select all

00000000
00b0he00
fa00di00
0g0c0000
|
00000000
00b0he00
fa00di00
0g0c0000>>16

00000000
00b0he00
fa00di00
0g0c0000
|
00000000
00000000
00000000
00b0he00

00000000
00b0he00
fa00di00
0gbche00
Then you use the first two rows to index bishopAttacks1[256][256], bishopAttacks2[256][256] arrays to create the attacks bitboard or use the first two rows to index an index for bishopAttacks[variable] like your code does. It will be interesting to compare the speed to magic, I'll make sure all your shifts have no collisions and implement this to compare to magic in 64-bits. One question though. I don't clearly understand how you know index is square independent. EG, this code:

Code: Select all

return bishopAttacks[sq->bMinimal1[sq->index.1] | sq->bMinimal2[sq->index.2]];
I would think it should be bishopAttacks[square][...] but I might have to think about it a bit. For example, the trace for square A1 and H8 would be identical.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Gerd Isenberg »

Michael Sherwin wrote:Yes, it's me again--sorry!

As I stated in another thread, I believe that I had developed the fastest bitboard move generator for either 32 bit or 64 bit machines and that I would post about them. The last bitboard generator that I posted about was to just get a 'read' on that method before I had completly made up my mind on which was fastest.

This is the bishop generator. The rook will follow later.

Code: Select all

typedef union {
  u32   u;
  u08   1;
  u08   2;
  u08   3;
  u08   4;
} split32;

typedef union {
  u64    u;
  u32   lo;
  u32   hi;
} split64;
The shift amounts in sq->bishopShifts:

Code: Select all

17, 19, 19, 17, 17, 18, 18, 16,
17, 17, 17, 17, 18, 18, 16, 17,
17, 17, 17, 17, 17, 15, 17, 17,
16, 16, 16, 16, 16, 16, 18, 18,
16, 18, 16, 16, 16, 16, 19, 20,
15, 17, 15, 17, 17, 17, 19, 17,
18, 18, 20, 20, 17, 17, 17, 17,
16, 17, 16, 16, 17, 17, 17, 17

Code: Select all

u64 BishopAttacks(square * sq, split64 * occ) {
  sq->blockers.u = occ.u & sq->bishopBits;
  sq->index = sq->blockers.lo + (sq->blockers.hi >> 1);
  sq->index |= (sq->index >> sq->bishopShifts);
  return bishopAttacks[sq->bMinimal1[sq->index.1] | sq->bMinimal2[sq->index.2]];
}
Note: no masking of the index is needed, because the shift amounts work out and
also, because of the split 32 union. The bishopAttacks subtable address is also
pre computed and part of the sq->bMinimal1 array.

Features:
* Very fast (maybe fastest) on 64 bit machines as well !
* Code will work as is for both 32 bit and 64 bit machines !!
* Only one dimentional array indexing calculations !!!
* For 32 bit machines, very few 64 bit accesses !!!!
* Most data accessed with one pointer !!!!!
* Very small memory footprint compared to magic !!!!!!
* No need for slow imul machine instructions !!!!!!!

Ofcourse mistakes are possible and optimizations are possible too.
Nice folding. The unions are not portable due to endian issue (I know, I have posted some bad examples of kindergarten for 32-bit mode as well). Better use shift/and to get shorter types inside one register.

Code: Select all

u64 BishopAttacks(u64 occ, square * sq) {
  occ &= sq->bishopBits;
  u32 index = (u32)(occ + (occ >> 33));
  index |= index >> sq->bishopShifts;
  return bishopAttacks[sq->bMinimal1[index & 255] | sq->bMinimal2[index >> 8]];
}

some handmade assembly
   ; rcx - occ
   ; rdx - *sq
   and   rcx, qword ptr [rdx + offsetBishopBits]
   mov   rax, rcx
   shr   rcx, 33
   add   eax, ecx
   mov   ecx, dword ptr [rdx + offsetbishopShifts]
   mov   r8d, eax
   shr   eax, cl
   or    eax, r8d
   mov   cl, ah  ; more likely compiler will make   mov ecx,eax  shr ecx,8
   and   eax, 0xff
   movzx eax, word ptr [rdx + 2*rax + offsetbMinimal1]
   movzx ecx, word ptr [rdx + 2*rcx + offsetbMinimal2]
   or    eax, ecx 
   mov   rax, qword ptr [bishopAttacks + 8*rax]
Not bad, if it works to get distinct indices over all squares.

Gerd
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Michael Sherwin »

One question though. I don't clearly understand how you know index is square independent.
Hi Pradu,

Each square has its own structure record. Each structure record has its own set of tables. I hope that this is the correct answear to your question.

Mike
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Michael Sherwin »

Alessandro Scotti wrote:Great!!!!!!!!!! :D
Hi Alessandro,

I forgot to mention one feature.

Small dependency chains, therefore very good parallel execution !!!!!!!! :D

Mike
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Michael Sherwin »

And now RookAttacks():

Code: Select all

u64 RookAttacks(u64 occ, square * sq) {
  u32 index; u64 rank;
  rank   = occ & sq->rankMask; // get the rank bits
  occ   ^= rank; // clear the rank bits from occ
  index  = (u32)(occ >> sq->colShift1); // shift colum to 0 bit
  index |= (u32)((occ & 0xffffffffff000000) >> sq->colShift2); // upper half to 1 bit
  index |= ((index & 0x00030000) >> 6); // the third row bits to second row
  index |= sq->rankPart[rank >> sq->rankShift]; // fill in the holes
  return rookAttacks[sq->rMinimal1[index & 63 | rMinimal2[(index >> 8) & 63]];
}


Not as fast as bishopAttacks(), but not that far off either.

Maybe it can be improved.

The most ironic thing is that it requires less data to drive it.

Once again, errors and optimizations possible.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Gerd Isenberg »

Michael Sherwin wrote:And now RookAttacks():

Not as fast as bishopAttacks(), but not that far off either.

Maybe it can be improved.

The most ironic thing is that it requires less data to drive it.

Once again, errors and optimizations possible.
hmm, that is a bit over my head - anyway, it already looks quite more expensive than Kindergarten-rooks, despite one imul...

Code: Select all

u64 RookAttacks(u64 occ, square * sq) {
  u32 index; u64 rank;
  rank   = occ & sq->rankMask; // get the rank bits
  occ   ^= rank; // clear the rank bits from occ
  index  = (u32)(occ >> sq->colShift1); // shift colum to 0 bit
  index |= (u32)((occ & 0xffffffffff000000) >> sq->colShift2); // upper half to 1 bit
  index |= ((index & 0x00030000) >> 6); // the third row bits to second row
  index |= sq->rankPart[rank >> sq->rankShift]; // fill in the holes
  return rookAttacks[sq->rMinimal1[index & 63] | sq->rMinimal2[(index >> 8) & 63]];
}
3 variable shifts, each with different cl
2 constant shifts
5 ands, 1 with 64-bit immediate
1 xor
4 ors
8 memory reads

plus lea, shift for building the sq* pointer.

Kindergarten-rooks:

Code: Select all

u64 aFileAttacks[64][8];     // 4 KByte
u8  firstRankAttacks2[64*8]; // 0.5 KByte

u64 rookAttacks(u64 occ, u32 sq) {
    u32 k = sq &  7;
    u32 i = sq & 56;
    u32 n = (u32)(occ >> i)    & 2*63;
    u64 d = firstRankAttacks2[4*n+k];
    u64 e = 0x0101010101010101 & (occ >> k);
    u64 r = 0x0080402010080400 *  e   >> 58;
    u64 garten  =   aFileAttacks[r][i >> 3];
    return       (  d << i&#41; | &#40;garten << k&#41;;   
&#125; 
4 variable shifts by two different amounts of cl
2 constant shifts
4 ands with immediate (3 32-bit, 1 64-bit)
1 or
1 64-bit imul with 64-bit immediate
2 lea with scaling(4,8)
2 memory reads

If you come up with a running routine, i am willing to do some performance testing ;-)

Cheers,
Gerd
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Michael Sherwin »

I'll see if I can do better and then make a complete package for testing. But, Gerd, as far as this being over your head. Tell us another one! :lol:
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Fastest 32 (64?) bit BishopAttacks()

Post by Michael Sherwin »

I conceed that trying to beat Kindergarden RookAttacks() on a 64 bit machine is a waste of time! However, 32 bit is a different story. I want to take one last try at RookAttacks() targeted for 32 bit machines, with:

Code: Select all

U64 RookAttacks&#40;square *sq, u64 *occ&#41; &#123;
  u32 occ1 = &#40;u32&#41;(*&#40;occ + sq->off1&#41;);
  u32 occ2 = &#40;u32&#41;(*&#40;occ + sq->off2&#41;);
  u32 idx  = &#40;occ1 & sq->rankMask&#41; >> sq->rankShift;
      idx |= &#40;occ1 & sq->colMask1&#41; >> sq->colShift1;
      idx |= &#40;occ2 & sq->colMask2&#41; >> sq->colShift2;
      idx |= &#40;idx  &   0xffff0000&#41; >>  sq->idxShift;
      return rookAttacks&#91;sq->rMinimal1&#91;idx & 255&#93; | sq->rMinimal2&#91;&#40;idx >> 8&#41; & 255&#93;&#93;;
&#125;
Notice that the only 64 bit instruction is the final return!

How does this stack up against 32 bit friendly Kindergarden RookAttacks()?

A philosophical note:

During the time that I have been trying to come up with a supper fast bitboard move generator for 32 bit machines, Tord Romstad has written a complete new and very strong chess program. I wonder which one of us has used our time more wisley? I consider his sliding-piece bitboard move generator to be horibly slow and yet he has a new program and I do not. My sliding-piece bitboard move generators will run circles around Tord's, but have no legs with which to do it. Anyway, what percentage of the time is used up by just the sliding-piece generators and does it really matter much how fast they are?

Since I have come this far already I will make both BishopAttacks() and RookAttacks() as 32 bit friendly as possible. Then I will build my next program with them regardless of how fast they are or arn't, just because I made them.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through