Page 1 of 2

### Fastest bitboard compress routine when you can't use ASM

Posted: Thu May 31, 2007 5:22 pm
One of the problems with bitboard move generators is how to efficiently determine which bits are "on" in a 64 bit move map in order to generate the target squares for a piece when constructing move lists.

Those of you who can drop into assembler can of course use the BSF and BSR opcodes to do this, but I thought you might nonetheless be interested in a technique based on number theory that I have used in ermintrude (which is written in java and so can't use any asm code) that is faster than any other non-assembler technique I've tried. Its so good its almost like magic. Moreover it operates in linear time - no processing huge numbers of zero-bits just to find the one-bits

The idea is to extract each bit field in turn from the 64 bit value, fold it into a 32 bit value, and multiply it by a "magic" number. The top 5 bits of the result is then used as an index into a small lookup table, whose value indicates the bit index (0 - 63) of the bit value - and hence the target square for the move.

Here's the code I use in ermintrude:

Code: Select all

``````

public static final int tzc64&#91;&#93; = &#123;
63, 30,  3, 32, 59, 14, 11, 33, 60, 24, 50,  9, 55, 19, 21, 34,
61, 29,  2, 53, 51, 23, 41, 18, 56, 28,  1, 43, 46, 27,  0, 35,
62, 31, 58,  4,  5, 49, 54,  6, 15, 52, 12, 40,  7, 42, 45, 16,
25, 57, 48, 13, 10, 39,  8, 44, 20, 47, 38, 22, 17, 37, 36, 26
&#125;;

public static int compress&#40;long bb, int&#91;&#93; loc&#41; &#123;
int lo, high;
long v;
int j = 0;
while &#40;bb != 0&#41; &#123;
v = bb ^ &#40;bb-1&#41;;
bb &= &#40;bb - 1&#41;;
lo =   &#40;int&#41;&#40;v >> 32&#41;;
high = &#40;int&#41; v;
loc&#91;j++&#93; = tzc64&#91;&#40;0x3f & (&#40;lo ^ high&#41; * 0x78291ACF&#41; >> 26&#41;&#93;;
&#125;
loc&#91;j&#93; = -1;
return j;
&#125;

``````
The multiplication is probably the slowest parts of the algorithm, as the lookup table itself is small enough to reside in cache memory. I haven't tried factorising the magic number into a sequence of shifts and adds - this might speed it up slightly.

Anyway its' just one of the tricks I've had to resort to in order to get java to perform respectably

Vince

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Thu May 31, 2007 7:47 pm
I'm using an almost identical method in Kiwi, which I took from Gerd Isenberg. I was using BSF and BSR before that, and after replacing _all_ of the ASM stuff with this kind of "magic" stuff my program actually got slightly faster!

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Thu May 31, 2007 10:31 pm
mambofish wrote:

Code: Select all

``tzc64&#91;&#40;0x3f & (&#40;lo ^ high&#41; * 0x78291ACF&#41; >> 26&#41;&#93;;``
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

``````X - a bitboard with a single bit active
#define toIndex&#40;X&#41; toIndexDB&#91;&#40;X&#41;*C64&#40;0x07EDD5E59A4E28C2&#41;>>58&#93;``````
I used to use these magic bitscans because I didn't know and didn't like assembly. Now I use compiler intrinsics where possible which seem to work faster and are as simple to use:

Code: Select all

``````#ifdef _MSC_VER
#pragma message&#40;"MSC compatible compiler detected -- turning off warning 4146")
#pragma warning&#40; disable &#58; 4146&#41;
#ifdef _WIN64
#include <intrin.h>
#pragma intrinsic&#40;_BitScanForward64&#41;
#pragma intrinsic&#40;_BitScanReverse64&#41;
#define USING_INTRINSICS
#endif
#endif
_BitScanForward64&#40;&ret,bb&#41;;
_BitScanReverse64&#40;&ret,bb&#41;;``````

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Fri Jun 01, 2007 12:43 am
Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Fri Jun 01, 2007 12:57 am
mambofish wrote:Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really
There are actually many (67108864) 64-bit magic numbers. See=> http://www.vpittlik.org/wbforum/viewtop ... 6467#26467

I believe there are only 2 Matt-Taylor magic numbers though.

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Fri Jun 01, 2007 8:12 am
mambofish wrote:Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really
Java has explict logical shift right >>>

Here is Matt's first posting about his folded trick it from Jun 2003.

I was not aware of Robert Harley. Do you have some references?

Java 1.5 has Long.numberOfTrailingZeros, which might be the best choice - as it might use lzcnt of future intel and amd processors.
Here some Java-bitscans and some timing results by Klaus Friedel by varous JREs:

Code: Select all

``````   /**
* @author GIsenberg
* @return index 0..63 of LS1B -1023 if passing zero
* @param b a 64-bit word to bitscan
*/
static public int bitScanForwardDbl&#40;long b&#41;
&#123;
double x = &#40;double&#41;&#40;b & - b&#41;;
int exp = &#40;int&#41; &#40;Double.doubleToLongBits&#40;x&#41; >>> 52&#41;;
return &#40;exp & 2047&#41; - 1023;
&#125;
``````

Code: Select all

``````    /**
* @author Matt Taylor
* @return index 0..63
* @param bb a 64-bit word to bitscan, should not be zero
*/
static private final int&#91;&#93; foldedTable = &#123;
63,30, 3,32,59,14,11,33,
60,24,50, 9,55,19,21,34,
61,29, 2,53,51,23,41,18,
56,28, 1,43,46,27, 0,35,
62,31,58, 4, 5,49,54, 6,
15,52,12,40, 7,42,45,16,
25,57,48,13,10,39, 8,44,
20,47,38,22,17,37,36,26,
&#125;;

static public int bitScanForwardMatt&#40;long b&#41; &#123;
b ^= &#40;b - 1&#41;;
int folded = (&#40;int&#41;b&#41; ^ (&#40;int&#41;&#40;b >>> 32&#41;);
return foldedTable&#91;&#40;folded * 0x78291ACF&#41; >>> 26&#93;;
&#125;
``````

Code: Select all

``````   /**
* @author Gerd Isenberg based on
*         "Using de Bruijn Sequences to Index a 1 in a Computer Word" by
*         Charles E. Leiserson
*         Harald Prokop
*         Keith H. Randall
* @return index 0..63
* @param bb a 64-bit word to bitscan, should not be zero
*/

static private final long deBruijn = 0x03f79d71b4cb0a89L;
static private final int&#91;&#93; magicTable = &#123;
0, 1,48, 2,57,49,28, 3,
61,58,50,42,38,29,17, 4,
62,55,59,36,53,51,43,22,
45,39,33,30,24,18,12, 5,
63,47,56,27,60,41,37,16,
54,35,52,21,44,32,23,11,
46,26,40,15,34,20,31,10,
25,14,19, 9,13, 8, 7, 6,
&#125;;

static public int bitScanForwardDeBruijn64 &#40;long b&#41; &#123;
int idx = &#40;int&#41;((&#40;b & -b&#41; * deBruijn&#41; >>> 58&#41;;
return magicTable&#91;idx&#93;;
&#125;

``````

Code: Select all

``````public static void main&#40;String&#91;&#93; args&#41; &#123;
int n = 1000000, i;
Date t1 = new Date&#40;);
for &#40;i = 0; i < n; i++) &#123;
long set = -1L;
while &#40;set != 0&#41;
set ^= 1L<<bitScanForwardMatt&#40;set&#41;;
&#125;
Date t2 = new Date&#40;);
for &#40;i = 0; i < n; i++) &#123;
long set = -1L;
while &#40;set != 0&#41;
set ^= 1L<<bitScanForwardDeBruijn64 &#40;set&#41;;
&#125;
Date t3 = new Date&#40;);
for &#40;i = 0; i < n; i++) &#123;
long set = -1L;
while &#40;set != 0&#41;
set ^= 1L<<bitScanForwardDbl&#40;set&#41;;
&#125;
Date t4 = new Date&#40;);
for &#40;i = 0; i < n; i++) &#123;
long set = -1L;
while &#40;set != 0&#41;
set ^= 1L<<Long.numberOfTrailingZeros&#40;set&#41;;
&#125;
Date t5 = new Date&#40;);

double time1 = &#40;t2.getTime&#40;) - t1.getTime&#40;)) / 64;
double time2 = &#40;t3.getTime&#40;) - t2.getTime&#40;)) / 64;
double time3 = &#40;t4.getTime&#40;) - t3.getTime&#40;)) / 64;
double time4 = &#40;t5.getTime&#40;) - t4.getTime&#40;)) / 64;

System.out.println&#40;"bitScanForwardMatt       takes " + time1 + " ns");
System.out.println&#40;"bitScanForwardDeBruijn64 takes " + time2 + " ns");
System.out.println&#40;"bitScanForwardDbl        takes " + time3 + " ns");
System.out.println&#40;"numberOfTrailingZeros    takes " + time4 + " ns");
&#125;
``````
Execution times under various JREs on my AMD X2 4200:

Sun JDK 1.5.0_05 Linux i586 (-server):

Code: Select all

``````bitScanForwardMatt       takes 11.1 ns
bitScanForwardDeBruijn64 takes 15.2 ns
bitScanForwardDbl        takes 31.8 ns
numberOfTrailingZeros    takes 23.5 ns
``````
Sun JDK 1.5.0_08 Linux AMD64 (-server):

Code: Select all

``````bitScanForwardMatt       takes 9.1 ns
bitScanForwardDeBruijn64 takes 9.4 ns
bitScanForwardDbl        takes 13.9 ns
numberOfTrailingZeros    takes 13.2 ns
``````
JRockit JDK 1.5.0_6 Linux AMD64:

Code: Select all

``````bitScanForwardMatt       takes 10.9 ns
bitScanForwardDeBruijn64 takes 8.6 ns
bitScanForwardDbl        takes 16.6 ns
numberOfTrailingZeros    takes 16.6 ns
``````
Sun JDK 1.6.0 beta 2 Linux AMD64 (-server):

Code: Select all

``````bitScanForwardMatt       takes 8.9 ns
bitScanForwardDeBruijn64 takes 9.3 ns
bitScanForwardDbl        takes 14.5 ns
numberOfTrailingZeros    takes 13.0 ns
``````
Gerd

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Fri Jun 01, 2007 11:28 am
mambofish wrote:

Code: Select all

``tzc64&#91;&#40;0x3f & (&#40;lo ^ high&#41; * 0x78291ACF&#41; >> 26&#41;&#93;;``
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

``````X - a bitboard with a single bit active
#define toIndex&#40;X&#41; toIndexDB&#91;&#40;X&#41;*C64&#40;0x07EDD5E59A4E28C2&#41;>>58&#93;``````
So, it seems I didn't find anything new...

Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Fri Jun 01, 2007 12:15 pm
Hi Gerd

Thanks for the details. I don't know where you find this stuff, but it is very interesting!

Here's the Robert Harley reference I used for my code.

http://www.df.lth.se/~john_e/gems/gem0039.html

I found a reference to the table-based technique in "Hacker's Delight" which put me on to the idea. A Google search did the rest

I am a little surprised at some of the timings presented. For example, I wouldn't use numberOfTrailingZeros. It relies on this binary chop code to isolate the LSB.

Code: Select all

``````    public static int numberOfTrailingZeros&#40;long i&#41; &#123;
// HD, Figure 5-14
int x, y;
if &#40;i == 0&#41; return 64;

int n = 63;
y = &#40;int&#41;i; if &#40;y != 0&#41; &#123; n = n -32; x = y; &#125; else x = &#40;int&#41;&#40;i>>>32&#41;;
y = x <<16; if &#40;y != 0&#41; &#123; n = n -16; x = y; &#125;
y = x << 8; if &#40;y != 0&#41; &#123; n = n - 8; x = y; &#125;
y = x << 4; if &#40;y != 0&#41; &#123; n = n - 4; x = y; &#125;

y = x << 2; if &#40;y != 0&#41; &#123; n = n - 2; x = y; &#125;
return n - (&#40;x << 1&#41; >>> 31&#41;;
&#125;
``````
Each 1-bit therefore requires 5 tests to identify. I did try this technique and found it very much slower than the magic number method, so I am a little surprised at the timings you report. Nervertheless I'll set up some tests again and confirm.

Regards,
Vince

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Fri Jun 01, 2007 1:28 pm
yoshiharu wrote:
mambofish wrote:

Code: Select all

``tzc64&#91;&#40;0x3f & (&#40;lo ^ high&#41; * 0x78291ACF&#41; >> 26&#41;&#93;;``
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

``````X - a bitboard with a single bit active
#define toIndex&#40;X&#41; toIndexDB&#91;&#40;X&#41;*C64&#40;0x07EDD5E59A4E28C2&#41;>>58&#93;``````
So, it seems I didn't find anything new...

Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro
A story from the world of C#

When i first started out programming my engine "Pupsi", i quickly made a simple straightforward algorithm, basically shifting the bits off to the left, and updating indexes.

I thought "This is going to fly, once i get this code optimized with some really smart stuff!"

I have tried every trick i could find on the net, but nothing gave any improvement.

I implemented the algorithm suggested by Vince, and i got roughly a 2,5-3 % improvement in pure pertf. Perft including evaluation gives about 0.5 % speedup. (This is after very preliminary tests. More is definitely needed!)

Here is the algorithm i wrote:
Please note: I index the bits from the opposite side as the example given by Vince. So the High order bits is in mask_low, meaning the lowest indexes.

Code: Select all

``````    public static void getSquaresOld&#40;ulong mask, int&#91;&#93; squares, ref int index&#41; &#123;
index = 0;
int rank = 0;
int file;
file = 0;
if (&#40;rank_mask & 0x80000000&#41; != 0&#41; &#123;
squares&#91;index++&#93; = rank + file;
&#125;
file += 1;
&#125;
&#125;
rank += 8;
&#125;

rank = 32;
file = 0;
if (&#40;rank_mask & 0x80000000&#41; != 0&#41; &#123;
squares&#91;index++&#93; = rank + file;
&#125;
file += 1;
&#125;
&#125;
rank += 8;
&#125;
&#125;
``````
So even though the improvement is very, very small, perhaps I can perhaps finally put the old algorithm to rest!

Thanks!

- Jesper

### Re: Fastest bitboard compress routine when you can't use ASM

Posted: Fri Jun 01, 2007 3:26 pm
Hi Jesper,

Having now seen the posts elsewhere by Gerd and Pradu among others, it seems this algorithm is much more widely known than I had appreciated.

Gerd's 64-bit magic number code is neater, but slower than the code I posted. I imagine this is because I'm running on 32-bit hardware and a 64-bit multiplication takes appreciably longer than a 32 bit one. On 64-bit hardware, Gerd's code might be faster.

However, in your case I don't think a 2.5-3% speedup will make that much difference, because your program probably isn't spending that more than 10-15% of its time in move generation anyway. (As you noticed, when you include the evaluation effort, its effect was reduced to 0.5% in your engine)

At these tiny differences, I think code size and elegance become more important considerations. In fact I am tempted to try Gerd's algorithm anyway simply on these criteria, even though it is around 25% slower:

Code: Select all

``````test value&#58; 805318656 &#40;4 bits on&#41;
bsf64_58&#58; elapsed= 875 ms
bsf64_26&#58; elapsed= 672 ms
test value&#58; 805318656
``````
(I tested over a range of values of course, but this is an indicative result. In fact I suspect its quite important to base any decision on test values that are roughly representative of the average number of target squares per move a piece might have during the game, rather than on an outlying value, such as 1 or 19)

Kind regards
Vince