TalkChess.com

Posted: **Tue Aug 11, 2009 8:50 am**

For part 2 in our 438 part series, I ask another hypothetical question of possible interest to bit twiddlers everywhere: how to run a bitscan over a bitboard, pulling off bits with an arbitrarily defined priority? For instance, a normal BSF has priority of bit X = 63-X, and BSR has priority =X. More importantly, how to do this fast?

The two ideas I have:

One other possibility is, if the priorities are all grouped in contiguous 16-bit chunks, then you can do a Nalimov-style bitscan. As an example, here's a normal BSF, but with the priorities of the 4 groups reversed (hope this makes sense):

Code: Select all

if &#40;bb & 0xffff000000000000&#41;
  sq = lsb&#91;bb>>48&#93;;
else if &#40;bb & 0x0000ffff00000000&#41;
  sq = lsb&#91;bb>>32 & 0xffff&#93;;
else if &#40;bb & 0x00000000ffff0000&#41;
  sq = lsb&#91;bb>>16 & 0xffff&#93;;
else if &#40;bb & 0x000000000000ffff&#41;
  sq = lsb&#91;bb & 0xffff&#93;;

This would scan bits in the order 48-63, 32-47, 16-31, and then 0-15. This works, but having the orders be restricted to those 16 bit chunks isn't very useful at all. Of course this could be extended to N bit chunks with suitable lookup tables, or arbitrary sets of bits that are transformed into a contiguous set before lookup. Too slow though!

The other is inspired by the second bitscan here (created by Gerd), which does 6 tests, one for each bit of the final result: http://chessprogramming.wikispaces.com/BitScan#toc8
(btw, be sure to check out the FZ video at the bottom, coincidentally right now I am listening to an audience tape of the same song performed about 8 months later

)

Simply take the priorities of the squares, and convert them to a "horizontal" array of 6 bitboards. Bitboard 0 will be the set of all squares with bit 0 set in their priority index, and so on. Then we loop through all the bitboards, except we need some special handling. We have to test the 32 highest priority bits first. If there is a bit in there, then we mask out all other bits and continue (I do this below with a simple -1/0 mask). Otherwise we do nothing, and we know that only bits with priorities <= 31 remain.

Code: Select all

bit5 = &#40;bb & mask&#91;5&#93;) != 0;
bb &= mask&#91;5&#93; | &#40;bit5-1&#41;; // assuming bit5 is a bitboard, probably a cast is smarter
result |= bit5; result <<= 1;

...then do the same for bits 4-0. At the end we have the highest existing priority 0<=p<=63 in result, which we use in a reverse lookup table. This should have about the same instructions of cycles as Gerd's, but with an added sub, or, and and for each bit.

Posted: **Tue Aug 11, 2009 9:47 am**

Zach Wegner wrote:... At the end we have the highest existing priority 0<=p<=63 in result, which we use in a reverse lookup table.

Why do the lookup? By the end of your code, isn't the answer in the variable bb? If so, you dont need the 'result' variable at all.

Cheers,
Steffan

Posted: **Tue Aug 11, 2009 11:52 am**

Just a very quick idea, probably could be further optimized. It compiles, but I have no time now for a complete test.

Code: Select all

int bitScanReverse&#40;uint64_t b&#41; &#123;

    const uint64_t P32 = 0xFFFFFFFF00000000UL;
    const uint64_t P16 = 0xFFFF0000FFFF0000UL;
    const uint64_t P08 = 0xFF00FF00FF00FF00UL;
    const uint64_t P04 = 0xF0F0F0F0F0F0F0F0UL;
    const uint64_t P02 = 0xCCCCCCCCCCCCCCCCUL;
    const uint64_t P01 = 0x8888888888888888UL;

    const uint64_t M32&#91;2&#93; = &#123; P32, ~P32 &#125;;
    const uint64_t M16&#91;2&#93; = &#123; P16, ~P16 &#125;;
    const uint64_t M08&#91;2&#93; = &#123; P08, ~P08 &#125;;
    const uint64_t M04&#91;2&#93; = &#123; P04, ~P04 &#125;;
    const uint64_t M02&#91;2&#93; = &#123; P02, ~P02 &#125;;

    uint64_t mask;
    int ret, n;

    ret = n = b > P32;
    mask = M32&#91;n&#93;;

    n = b > &#40;P16 & mask&#41;;
    mask &= M16&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = b > &#40;P08 & mask&#41;;
    mask &= M08&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = b > &#40;P04 & mask&#41;;
    mask &= M04&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = b > &#40;P02 & mask&#41;;
    mask &= M02&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    return &#40;ret << 1&#41; + &#40;b > &#40;P01 & mask&#41;);
&#125;

Posted: **Tue Aug 11, 2009 2:19 pm**

I don't actually understand what "arbitrary bitscan order" means ;-(
Is it file/rank order related? F.i. to implement a "symmetrical" white versus black bitscan?

Code: Select all

int symmetricalBsf &#40;U64 bb, enumColor color&#41; &#123;
    if ( color == ECWhite )
       return bsf&#40;bb&#41;;          // from a1..h1, a2..., h8
    return bsf&#40;bswap&#40;bb&#41;) ^ 56; // from a8..h8, a7..., h1
&#125;

Posted: **Tue Aug 11, 2009 3:01 pm**

I see the use of arbitrary ordered bitscan for PST-evaluation move generation ordering.

Posted: **Tue Aug 11, 2009 5:14 pm**

steffan wrote:
Zach Wegner wrote:... At the end we have the highest existing priority 0<=p<=63 in result, which we use in a reverse lookup table.
Why do the lookup? By the end of your code, isn't the answer in the variable bb? If so, you dont need the 'result' variable at all.

Cheers,
Steffan

True, if you didn't want to get the index of the bit, but merely isolate it, that would be faster. It might even be competitive to just do a bsf at the end after isolating the bit.

The result accumulation is just 6 lea's, but they should be pipelined. OTOH you still need a lookup table, and the code size is increased a bit. These solutions should be quite similar in speed.

Posted: **Tue Aug 11, 2009 5:27 pm**

mcostalba wrote:Just a very quick idea, probably could be further optimized. It compiles, but I have no time now for a complete test.

Code: Select all

int bitScanReverse&#40;uint64_t b&#41; &#123;

    const uint64_t P32 = 0xFFFFFFFF00000000UL;
    const uint64_t P16 = 0xFFFF0000FFFF0000UL;
    const uint64_t P08 = 0xFF00FF00FF00FF00UL;
    const uint64_t P04 = 0xF0F0F0F0F0F0F0F0UL;
    const uint64_t P02 = 0xCCCCCCCCCCCCCCCCUL;
    const uint64_t P01 = 0x8888888888888888UL;

    const uint64_t M32&#91;2&#93; = &#123; P32, ~P32 &#125;;
    const uint64_t M16&#91;2&#93; = &#123; P16, ~P16 &#125;;
    const uint64_t M08&#91;2&#93; = &#123; P08, ~P08 &#125;;
    const uint64_t M04&#91;2&#93; = &#123; P04, ~P04 &#125;;
    const uint64_t M02&#91;2&#93; = &#123; P02, ~P02 &#125;;

    uint64_t mask;
    int ret, n;

    ret = n = b > P32;
    mask = M32&#91;n&#93;;

    n = b > &#40;P16 & mask&#41;;
    mask &= M16&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = b > &#40;P08 & mask&#41;;
    mask &= M08&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = b > &#40;P04 & mask&#41;;
    mask &= M04&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = b > &#40;P02 & mask&#41;;
    mask &= M02&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    return &#40;ret << 1&#41; + &#40;b > &#40;P01 & mask&#41;);
&#125;

I think this should work, but with an inversion of the P* masks. Testing for a bit in the top 32 bits is equivalent to > 0xffffffff. I think the (bb&mask)!=0 and bb>~mask should be the same speed though, a test and setnz in the first case and a cmp and seta in the second. I like your array idea though, perhaps it could be faster than a calculated mask.

Also, just scrapping the mask and doing b &= M*[n] should be faster.

Posted: **Tue Aug 11, 2009 5:39 pm**

Gerd Isenberg wrote:I don't actually understand what "arbitrary bitscan order" means ;-(
Is it file/rank order related? F.i. to implement a "symmetrical" white versus black bitscan?
Code: Select all
int symmetricalBsf &#40;U64 bb, enumColor color&#41; &#123;
    if ( color == ECWhite )
       return bsf&#40;bb&#41;;          // from a1..h1, a2..., h8
    return bsf&#40;bswap&#40;bb&#41;) ^ 56; // from a8..h8, a7..., h1
&#125;

Yes, that could be one possible order. The priority table in this case, for black and a1=0, would be:

Code: Select all

int table&#91;&#93; = &#123;
  56, 57, 58, 59, 60, 61, 62, 63,
  48, 49, 50, 51, 52, 53, 54, 55,
  ...
  0, 1, 2, 3, 4, 5, 6, 7
&#125;;

Think about it like quad bitboards--the 4 bitboards there vertically encode a piece code 0-16 (I guess I should have said "vertical" originally instead of "horizontally"). Here 6 bitboards vertically encode a priority value 0-63 for each square. The priority is just the order the bits should be pulled off in. If you pass in -1 as a bitboard, it would first pull off the bit with priority 63, then 62...0.

As Aleks says, this could be used to order in a piece-square order. For kings you'd prioritize the back rank, with the two wings being preferable to the center (maybe with a penalty for d1/f1 though). Knights go to the center first, etc.

Posted: **Tue Aug 11, 2009 8:50 pm**

Zach Wegner wrote: Yes, that could be one possible order. The priority table in this case, for black and a1=0, would be:
Code: Select all
int table&#91;&#93; = &#123;
  56, 57, 58, 59, 60, 61, 62, 63,
  48, 49, 50, 51, 52, 53, 54, 55,
  ...
  0, 1, 2, 3, 4, 5, 6, 7
&#125;;
Think about it like quad bitboards--the 4 bitboards there vertically encode a piece code 0-16 (I guess I should have said "vertical" originally instead of "horizontally"). Here 6 bitboards vertically encode a priority value 0-63 for each square. The priority is just the order the bits should be pulled off in. If you pass in -1 as a bitboard, it would first pull off the bit with priority 63, then 62...0.

As Aleks says, this could be used to order in a piece-square order. For kings you'd prioritize the back rank, with the two wings being preferable to the center (maybe with a penalty for d1/f1 though). Knights go to the center first, etc.

I see. How about brute force simd approach (8 xmm-regs) to expand the bitboard to a wordboard alá SSE2-dot-product, to mask the expanded words (0xffff for a binary one, 0x0000 for binary zero) with a word[64]-vector, containing order bytes ( > 0) in the upper half and the bitindex (0..63) in the lower half. Then sort the vector by a small SSE2 bubble sort

Posted: **Wed Aug 12, 2009 12:35 am**

Zach Wegner wrote:I think this should work, but with an inversion of the P* masks. Testing for a bit in the top 32 bits is equivalent to > 0xffffffff. I think the (bb&mask)!=0 and bb>~mask should be the same speed though, a test and setnz in the first case and a cmp and seta in the second. I like your array idea though, perhaps it could be faster than a calculated mask.

Also, just scrapping the mask and doing b &= M*[n] should be faster.

Thanks for your suggestions here is the new version.

Code: Select all

int bitScanReverse&#40;uint64_t b&#41; &#123;

    const uint64_t P32 = 0xFFFFFFFF00000000UL;
    const uint64_t P16 = 0xFFFF0000FFFF0000UL;
    const uint64_t P08 = 0xFF00FF00FF00FF00UL;
    const uint64_t P04 = 0xF0F0F0F0F0F0F0F0UL;
    const uint64_t P02 = 0xCCCCCCCCCCCCCCCCUL;
    const uint64_t P01 = 0x8888888888888888UL;

    const uint64_t M32&#91;2&#93; = &#123; ~P32, P32 &#125;;
    const uint64_t M16&#91;2&#93; = &#123; ~P16, P16 &#125;;
    const uint64_t M08&#91;2&#93; = &#123; ~P08, P08 &#125;;
    const uint64_t M04&#91;2&#93; = &#123; ~P04, P04 &#125;;
    const uint64_t M02&#91;2&#93; = &#123; ~P02, P02 &#125;;

    int ret, n;

    ret = n = &#40;b & P32&#41; != 0;
    b &= M32&#91;n&#93;;

    n = &#40;b & P16&#41; != 0;
    b &= M16&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = &#40;b & P08&#41; != 0;
    b &= M08&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = &#40;b & P04&#41; != 0;
    b &= M04&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    n = &#40;b & P02&#41; != 0;
    b &= M02&#91;n&#93;;
    ret = &#40;ret << 1&#41; + n;

    return &#40;ret << 1&#41; + (&#40;b & P01&#41; != 0&#41;;
&#125;

TalkChess.com

Bit twiddling question, part 2: arbitrary bitscan order

Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order