Hi Vince,mambofish wrote:Hi Gerd
Thanks for the details. I don't know where you find this stuff, but it is very interesting!
Here's the Robert Harley reference I used for my code.
http://www.df.lth.se/~john_e/gems/gem0039.html
I found a reference to the table-based technique in "Hacker's Delight" which put me on to the idea. A Google search did the rest
thanks for the reference. But this is only a 32-bsf replacement similar to a 32-bit DeBruijn multiplication, but with no perfect hashing - that is a 64-byte table using only 32.
Bitscanning was an old hobby for me and there are zillions of approaches, the archives, also in the winboard programming forum are saturated with bitscans.
As a "historical" reference two C-versions from my bitscan repository, Walter Faxon's folding trick with none perfect hashing and a direct calculation by test,setnz,lea chains, I came up with. This versions also clear the found bit:
Code: Select all
extern const BYTE LSB_64_table[154]; // bit number table
#define LSB_64_adj -51 // offset to table base
#define LSB_64_magic ( (UINT32)0x01C5FC81 ) // magic constant
// Author: Walter Faxon (in the 80ies, constant found by brute force)
int bitScanAndReset(BitBoard &bb)
{
BitBoard t64;
UINT32 t32;
t64 = bb - 1;
bb &= t64; // omit this line to retain current LSB
t64 ^= bb;
t32 = (UINT32)t64 ^ (UINT32)(t64 >> 32);
t32 ^= LSB_64_magic;
t32 += t32 >> 16;
t32 -= t32 >> 8;
return LSB_64_table [LSB_64_adj + (BYTE)t32];
}
const BYTE LSB_64_table[154] =
{
#define __ 0
23,__,__,__,31,__,__,39,19,__, 17,16,18,__,47,10,20, 9, 8,11,
1, 0, 2,57,56,58, 3,12,__,59, __,__,21,__, 4,__,__,60,__,__,
__,__,__,13,__,__,__,__,__,__, 5,__,__,61,__,__,__,__,__,__,
__,__,__,__,22,__,__,__,30,__, __,38,__,__,__,14,__,__,46,__,
__,__, 6,__,__,62,__,__,__,54, __,__,__,__,__,__,__,__,__,__,
29,__,__,37,__,__,__,__,__,__, 45,__,__,__,__,__,28,__,__,36,
__,53,__,__,27,__,44,35,26,24, 25,34,32,33,43,40,41,52,42,15,
__,50,48,49,__,51, 7,__,__,63, __,__,__,55
#undef __
};
Code: Select all
// Author: Gerd Isenberg (around 1995)
int bitScanAndReset(BitBoard &bb)
{
BitBoard lsbb = bb & (-(__int64)bb);
bb ^= lsbb;
UINT32 lsb = LOWBOARD(lsbb) | HIGHBOARD(lsbb);
return ((((((((((HIGHBOARD(lsbb)!=0) *2)
^((lsb & 0xffff0000)!=0))*2)
^((lsb & 0xff00ff00)!=0))*2)
^((lsb & 0xf0f0f0f0)!=0))*2)
^((lsb & 0xcccccccc)!=0))*2)
^((lsb & 0xaaaaaaaa)!=0);
}
In the meantime on the c2d bsf/bsr have become fast again - two cycles latency, and one cycle reciprocal throughput. K10 (and intel as well) will have fast lzcnt instruction. x86-64 C-compiler will have intrinsic functions to use it without assembly.
Of course in Java Long.numberOfTrailingZeros is slow on 32-bit machines - there are faster implementations for 64-bit systems - and it will become the fastest with an approprite 64-bit just in time compiler on a core 2 duo or K10, using the new instructions.mambofish wrote: I am a little surprised at some of the timings presented. For example, I wouldn't use numberOfTrailingZeros. It relies on this binary chop code to isolate the LSB.
Each 1-bit therefore requires 5 tests to identify. I did try this technique and found it very much slower than the magic number method, so I am a little surprised at the timings you report. Nervertheless I'll set up some tests again and confirm.Code: Select all
public static int numberOfTrailingZeros(long i) { // HD, Figure 5-14 int x, y; if (i == 0) return 64; int n = 63; y = (int)i; if (y != 0) { n = n -32; x = y; } else x = (int)(i>>>32); y = x <<16; if (y != 0) { n = n -16; x = y; } y = x << 8; if (y != 0) { n = n - 8; x = y; } y = x << 4; if (y != 0) { n = n - 4; x = y; } y = x << 2; if (y != 0) { n = n - 2; x = y; } return n - ((x << 1) >>> 31); }
Regards,
Vince
Cheers,
Gerd