Fastest bitboard compress routine when you can't use ASM

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Jun 01, 2007 6:59 pm

mambofish wrote:Hi Gerd

Thanks for the details. I don't know where you find this stuff, but it is very interesting!

Here's the Robert Harley reference I used for my code.

http://www.df.lth.se/~john_e/gems/gem0039.html

I found a reference to the table-based technique in "Hacker's Delight" which put me on to the idea. A Google search did the rest

Hi Vince,

thanks for the reference. But this is only a 32-bsf replacement similar to a 32-bit DeBruijn multiplication, but with no perfect hashing - that is a 64-byte table using only 32.

Bitscanning was an old hobby for me and there are zillions of approaches, the archives, also in the winboard programming forum are saturated with bitscans.

As a "historical" reference two C-versions from my bitscan repository, Walter Faxon's folding trick with none perfect hashing and a direct calculation by test,setnz,lea chains, I came up with. This versions also clear the found bit:

Code: Select all

extern const BYTE LSB_64_table&#91;154&#93;;  // bit number table
#define LSB_64_adj  -51               // offset to table base
#define LSB_64_magic  ( &#40;UINT32&#41;0x01C5FC81 )  // magic constant

// Author&#58; Walter Faxon &#40;in the 80ies, constant found by brute force&#41;
int bitScanAndReset&#40;BitBoard &bb&#41;
&#123;
    BitBoard t64;
    UINT32 t32;
    t64 = bb - 1;
    bb &= t64;         // omit this line to retain current LSB
    t64 ^= bb;
    t32 = &#40;UINT32&#41;t64 ^ &#40;UINT32&#41;&#40;t64 >> 32&#41;;
    t32 ^= LSB_64_magic;
    t32 += t32 >> 16;
    t32 -= t32 >> 8;
    return LSB_64_table &#91;LSB_64_adj + &#40;BYTE&#41;t32&#93;;
&#125;

const BYTE LSB_64_table&#91;154&#93; =
&#123;
#define __  0
    23,__,__,__,31,__,__,39,19,__, 17,16,18,__,47,10,20, 9, 8,11,
     1, 0, 2,57,56,58, 3,12,__,59, __,__,21,__, 4,__,__,60,__,__,
    __,__,__,13,__,__,__,__,__,__,  5,__,__,61,__,__,__,__,__,__,
    __,__,__,__,22,__,__,__,30,__, __,38,__,__,__,14,__,__,46,__,
    __,__, 6,__,__,62,__,__,__,54, __,__,__,__,__,__,__,__,__,__,
    29,__,__,37,__,__,__,__,__,__, 45,__,__,__,__,__,28,__,__,36,
    __,53,__,__,27,__,44,35,26,24, 25,34,32,33,43,40,41,52,42,15,
    __,50,48,49,__,51, 7,__,__,63, __,__,__,55
#undef __
&#125;;

Code: Select all

// Author&#58; Gerd Isenberg &#40;around 1995&#41;
int bitScanAndReset&#40;BitBoard &bb&#41;
&#123;
    BitBoard lsbb = bb & (-(__int64&#41;bb&#41;;
    bb ^= lsbb;
    UINT32 lsb = LOWBOARD&#40;lsbb&#41; | HIGHBOARD&#40;lsbb&#41;;
    return (((((((((&#40;HIGHBOARD&#40;lsbb&#41;!=0&#41; *2&#41;
                ^(&#40;lsb & 0xffff0000&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xff00ff00&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xf0f0f0f0&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xcccccccc&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xaaaaaaaa&#41;!=0&#41;;
&#125;

Intel P4 and AMD K8 even in 64 bit mode have very slow implementations of the former (PIII) fast bsf/bsr-instructions. The portable De Bruijn multiplication plus shift and lookup were (and are still for K8 and others) competitive with fast multiplication of K8. For 32-bit systems Matt's routine was the favourite approach.

In the meantime on the c2d bsf/bsr have become fast again - two cycles latency, and one cycle reciprocal throughput. K10 (and intel as well) will have fast lzcnt instruction. x86-64 C-compiler will have intrinsic functions to use it without assembly.

mambofish wrote: I am a little surprised at some of the timings presented. For example, I wouldn't use numberOfTrailingZeros. It relies on this binary chop code to isolate the LSB.
Code: Select all
    public static int numberOfTrailingZeros&#40;long i&#41; &#123;
        // HD, Figure 5-14

	int x, y;
	if &#40;i == 0&#41; return 64;

	int n = 63;
	y = &#40;int&#41;i; if &#40;y != 0&#41; &#123; n = n -32; x = y; &#125; else x = &#40;int&#41;&#40;i>>>32&#41;;
	y = x <<16; if &#40;y != 0&#41; &#123; n = n -16; x = y; &#125;
	y = x << 8; if &#40;y != 0&#41; &#123; n = n - 8; x = y; &#125;
	y = x << 4; if &#40;y != 0&#41; &#123; n = n - 4; x = y; &#125;


	y = x << 2; if &#40;y != 0&#41; &#123; n = n - 2; x = y; &#125;
	return n - (&#40;x << 1&#41; >>> 31&#41;;
    &#125;
Each 1-bit therefore requires 5 tests to identify. I did try this technique and found it very much slower than the magic number method, so I am a little surprised at the timings you report. Nervertheless I'll set up some tests again and confirm.

Regards,
Vince

Of course in Java Long.numberOfTrailingZeros is slow on 32-bit machines - there are faster implementations for 64-bit systems - and it will become the fastest with an approprite 64-bit just in time compiler on a core 2 duo or K10, using the new instructions.

Cheers,
Gerd

Fastest bitboard compress routine when you can't use ASM

Re: Fastest bitboard compress routine when you can't use ASM