Bit twiddling question, part 2: arbitrary bitscan order

Zach Wegner · Post by **Zach Wegner** » Wed Aug 12, 2009 2:18 am

Gerd Isenberg wrote:I see. How about brute force simd approach (8 xmm-regs) to expand the bitboard to a wordboard alá SSE2-dot-product, to mask the expanded words (0xffff for a binary one, 0x0000 for binary zero) with a word[64]-vector, containing order bytes ( > 0) in the upper half and the bitindex (0..63) in the lower half. Then sort the vector by a small SSE2 bubble sort

Fascinating! Not sure if it's exactly optimal, for sparsely populated bitboards.

Zach Wegner · Post by **Zach Wegner** » Wed Aug 12, 2009 2:49 am

Speed comparison of five routines: my original proposal, Steffan's proposal (but with a bsf at the end to get the index), Marco's proposal, and a combination of all three. I also tested Steffan's without the bitscan, and with a bb &= ~s; in the loop.

Code: Select all

int bitscan_pri_z(bb_t bb)
{
    int bit, res = 0;
    bb_t r;

    /* Unroll this! */
    for (bit = 5; bit >= 0; bit--) {
        r = (bb & pri_mask[bit]) != 0;
        bb &= pri_mask[bit] | (r - 1);
        res = res << 1 | r;
    }

    return rev_priority[res];
}

bb_t bitscan_pri_so(bb_t bb)
{
    int bit;
    bb_t r;

    /* Unroll this! */
    for (bit = 5; bit >= 0; bit--) {
        r = (bb & pri_mask[bit]) != 0;
        bb &= pri_mask[bit] | (r - 1);
    }

    return (bb);
}

int bitscan_pri_s(bb_t bb)
{
    int bit;
    bb_t r;

    /* Unroll this! */
    for (bit = 5; bit >= 0; bit--) {
        r = (bb & pri_mask[bit]) != 0;
        bb &= pri_mask[bit] | (r - 1);
    }

    return bsfq(bb);
}

int bitscan_pri_m(bb_t bb)
{
    int bit, res = 0, r;

    /* Unroll this! */
    for (bit = 5; bit >= 0; bit--) {
        r = (bb & pri_mask_a[bit][1]) != 0;
        bb &= pri_mask_a[bit][r];
        res = res << 1 | r;
    }

    return rev_priority[res];
}

int bitscan_pri_sm(bb_t bb)
{
    int bit, r;

    /* Unroll this! */
    for (bit = 5; bit >= 0; bit--) {
        r = (bb & pri_mask_a[bit][1]) != 0;
        bb &= pri_mask_a[bit][r];
    }

    return bsfq(bb);
}

Each one was measured in a dumb loop:

Code: Select all

t = clock();
    for (i = 0; i < 10000000; i++) {
        bb_t b = -1;
        x = 0;
        while (b) {
            b &= ~(1ull << x);
            x++;
        }
    }
    int overhead = clock() - t;

    int time;

    t = clock();
    for (i = 0; i < 10000000; i++) {
        bb_t b = -1;
        while (b) {
            int s = bitscan_pri_z(b);
            b &= ~(1ull << s);
        }
    }
    time = clock() - t - overhead;
    printf("zach: overhead=%i, time=%i\n",overhead, time);

And the results (in centiseconds):

Code: Select all

zach: overhead=54, time=1037
steffan orig: overhead=54, time=873
steffan: overhead=54, time=903
marco: overhead=54, time=1389
s/m: overhead=54, time=1769

Surprisingly, the bsfq in Steffan's is very cheap, and the combination of Steffan and Marco's proposals was much slower than the original. Steffan looks like the winner here though.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Aug 12, 2009 8:16 am

Zach Wegner wrote:
Gerd Isenberg wrote:I see. How about brute force simd approach (8 xmm-regs) to expand the bitboard to a wordboard alá SSE2-dot-product, to mask the expanded words (0xffff for a binary one, 0x0000 for binary zero) with a word[64]-vector, containing order bytes ( > 0) in the upper half and the bitindex (0..63) in the lower half. Then sort the vector by a small SSE2 bubble sort
Fascinating! Not sure if it's exactly optimal, for sparsely populated bitboards.

Propably the bubble sort of 64 shorts is a bit too slow

Expanding bits to bytes (or shorts) is quite effiecient as the dot-product is nowadays for a weighted popcount.

mcostalba · Post by **mcostalba** » Wed Aug 12, 2009 8:28 am

Zach Wegner wrote: Surprisingly, the bsfq in Steffan's is very cheap, and the combination of Steffan and Marco's proposals was much slower than the original. Steffan looks like the winner here though.

Thanks for testing !

Have you tested on a 64 bit system or on a 32bit ?

The little reverse routine of mine could be rewritten in a much faster way for 32 bit, namely all the 'and' operation but the first could be 32 bit.

mcostalba · Post by **mcostalba** » Wed Aug 12, 2009 9:40 am

This is another version, probably slower but fancier

Code: Select all

void rol32(uint64_t& bb) {

    bb = (bb << 32) | (bb >> 32);
}

int bitScanReverse32(uint32_t b) {
    
    const uint32_t U16 = 0xFFFF0000;
    const uint32_t U08 = 0xFF00FF00;
    const uint32_t U04 = 0xF0F0F0F0;
    const uint32_t U02 = 0xCCCCCCCC;
    const uint32_t U01 = 0x88888888;

    const uint64_t L16 = (uint64_t(~U16) << 32) + U16;
    const uint64_t L08 = (uint64_t(~U08) << 32) + U08;
    const uint64_t L04 = (uint64_t(~U04) << 32) + U04;
    const uint64_t L02 = (uint64_t(~U02) << 32) + U02;
    const uint64_t L01 = (uint64_t(~U01) << 32) + U01;

    int ret = 0;

    uint64_t bb = (uint64_t(b) << 32) | b;

    bb &= L16;
    if (unsigned(bb)) ret += 1; else rol32(bb);
    bb = (bb << 32) | unsigned(bb);

    ret <<= 1;
    bb &= L08;
    if (unsigned(bb)) ret += 1; else rol32(bb);
    bb = (bb << 32) | unsigned(bb);

    ret <<= 1;
    bb &= L04;    
    if (unsigned(bb)) ret += 1; else rol32(bb);
    bb = (bb << 32) | unsigned(bb);

    ret <<= 1;
    bb &= L02;    
    if (unsigned(bb)) ret += 1; else rol32(bb);
    bb = (bb << 32) | unsigned(bb);

    return (ret << 1) + (unsigned(bb & L01) != 0);
}

int bitScanReverse(uint64_t b) {

    const uint32_t hb = b >> 32;
    return hb ? 32 + bitScanReverse32(hb) : bitScanReverse32(uint32_t(b));

}

Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order

Re: Bit twiddling question, part 2: arbitrary bitscan order