Fastest bitboard compress routine when you can't use ASM

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mambofish

Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

One of the problems with bitboard move generators is how to efficiently determine which bits are "on" in a 64 bit move map in order to generate the target squares for a piece when constructing move lists.

Those of you who can drop into assembler can of course use the BSF and BSR opcodes to do this, but I thought you might nonetheless be interested in a technique based on number theory that I have used in ermintrude (which is written in java and so can't use any asm code) that is faster than any other non-assembler technique I've tried. Its so good its almost like magic. Moreover it operates in linear time - no processing huge numbers of zero-bits just to find the one-bits ;-)

The idea is to extract each bit field in turn from the 64 bit value, fold it into a 32 bit value, and multiply it by a "magic" number. The top 5 bits of the result is then used as an index into a small lookup table, whose value indicates the bit index (0 - 63) of the bit value - and hence the target square for the move.

Here's the code I use in ermintrude:

Code: Select all

 
     
	public static final int tzc64[] = { 
	  63, 30,  3, 32, 59, 14, 11, 33, 60, 24, 50,  9, 55, 19, 21, 34,
	  61, 29,  2, 53, 51, 23, 41, 18, 56, 28,  1, 43, 46, 27,  0, 35,
	  62, 31, 58,  4,  5, 49, 54,  6, 15, 52, 12, 40,  7, 42, 45, 16,
	  25, 57, 48, 13, 10, 39,  8, 44, 20, 47, 38, 22, 17, 37, 36, 26
	};

   public static int compress(long bb, int[] loc) {
		int lo, high;
		long v;
		int j = 0;
		while (bb != 0) {
			v = bb ^ (bb-1);
			bb &= (bb - 1);
			lo =   (int)(v >> 32);
			high = (int) v;
			loc[j++] = tzc64[(0x3f & ((lo ^ high) * 0x78291ACF) >> 26)];
		}
		loc[j] = -1;
		return j;
	}

The multiplication is probably the slowest parts of the algorithm, as the lookup table itself is small enough to reside in cache memory. I haven't tried factorising the magic number into a sequence of shifts and adds - this might speed it up slightly.

Anyway its' just one of the tricks I've had to resort to in order to get java to perform respectably :-)

Vince
Alessandro Scotti

Re: Fastest bitboard compress routine when you can't use ASM

Post by Alessandro Scotti »

I'm using an almost identical method in Kiwi, which I took from Gerd Isenberg. I was using BSF and BSR before that, and after replacing _all_ of the ASM stuff with this kind of "magic" stuff my program actually got slightly faster! :-)
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: Fastest bitboard compress routine when you can't use ASM

Post by Pradu »

mambofish wrote:

Code: Select all

tzc64[(0x3f & ((lo ^ high) * 0x78291ACF) >> 26)];
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex(X) toIndexDB[(X)*C64(0x07EDD5E59A4E28C2)>>58]
I used to use these magic bitscans because I didn't know and didn't like assembly. Now I use compiler intrinsics where possible which seem to work faster and are as simple to use:

Code: Select all

#ifdef _MSC_VER
	#pragma message("MSC compatible compiler detected -- turning off warning 4146")
	#pragma warning( disable : 4146)
	#ifdef _WIN64
		#include <intrin.h>
		#pragma intrinsic&#40;_BitScanForward64&#41;
		#pragma intrinsic&#40;_BitScanReverse64&#41;
		#define USING_INTRINSICS
	#endif
#endif
_BitScanForward64&#40;&ret,bb&#41;;
_BitScanReverse64&#40;&ret,bb&#41;;
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really :-)
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: Fastest bitboard compress routine when you can't use ASM

Post by Pradu »

mambofish wrote:Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really :-)
There are actually many (67108864) 64-bit magic numbers. See=> http://www.vpittlik.org/wbforum/viewtop ... 6467#26467

I believe there are only 2 Matt-Taylor magic numbers though.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Fastest bitboard compress routine when you can't use ASM

Post by Gerd Isenberg »

mambofish wrote:Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really :-)
Java has explict logical shift right >>>

Here is Matt's first posting about his folded trick it from Jun 2003.
http://groups.google.de/group/comp.lang ... a3044afa7c

I was not aware of Robert Harley. Do you have some references?

Java 1.5 has Long.numberOfTrailingZeros, which might be the best choice - as it might use lzcnt of future intel and amd processors.
Here some Java-bitscans and some timing results by Klaus Friedel by varous JREs:

Code: Select all

   /**
     * @author GIsenberg
     * @return index 0..63 of LS1B -1023 if passing zero
     * @param b a 64-bit word to bitscan
     */
    static public int bitScanForwardDbl&#40;long b&#41;
    &#123;
        double x = &#40;double&#41;&#40;b & - b&#41;;
        int exp = &#40;int&#41; &#40;Double.doubleToLongBits&#40;x&#41; >>> 52&#41;;
        return &#40;exp & 2047&#41; - 1023;
    &#125; 

Code: Select all

    /**
     * @author Matt Taylor
     * @return index 0..63
     * @param bb a 64-bit word to bitscan, should not be zero
     */
    static private final int&#91;&#93; foldedTable = &#123;
     63,30, 3,32,59,14,11,33,
     60,24,50, 9,55,19,21,34,
     61,29, 2,53,51,23,41,18,
     56,28, 1,43,46,27, 0,35,
     62,31,58, 4, 5,49,54, 6,
     15,52,12,40, 7,42,45,16,
     25,57,48,13,10,39, 8,44,
     20,47,38,22,17,37,36,26,
    &#125;;

    static public int bitScanForwardMatt&#40;long b&#41; &#123;
       b ^= &#40;b - 1&#41;;
       int folded = (&#40;int&#41;b&#41; ^ (&#40;int&#41;&#40;b >>> 32&#41;);
       return foldedTable&#91;&#40;folded * 0x78291ACF&#41; >>> 26&#93;;
    &#125;

Code: Select all

   /**
     * @author Gerd Isenberg based on
     *         "Using de Bruijn Sequences to Index a 1 in a Computer Word" by
     *         Charles E. Leiserson
     *         Harald Prokop
     *         Keith H. Randall
     * @return index 0..63
     * @param bb a 64-bit word to bitscan, should not be zero
     */

    static private final long deBruijn = 0x03f79d71b4cb0a89L;
    static private final int&#91;&#93; magicTable = &#123;
      0, 1,48, 2,57,49,28, 3,
     61,58,50,42,38,29,17, 4,
     62,55,59,36,53,51,43,22,
     45,39,33,30,24,18,12, 5,
     63,47,56,27,60,41,37,16,
     54,35,52,21,44,32,23,11,
     46,26,40,15,34,20,31,10,
     25,14,19, 9,13, 8, 7, 6,
    &#125;;

    static public int bitScanForwardDeBruijn64 &#40;long b&#41; &#123;
       int idx = &#40;int&#41;((&#40;b & -b&#41; * deBruijn&#41; >>> 58&#41;;
       return magicTable&#91;idx&#93;;
    &#125; 

Code: Select all

public static void main&#40;String&#91;&#93; args&#41; &#123;
    int n = 1000000, i;
    Date t1 = new Date&#40;);
    for &#40;i = 0; i < n; i++) &#123;
        long set = -1L;
        while &#40;set != 0&#41;
            set ^= 1L<<bitScanForwardMatt&#40;set&#41;;
    &#125;
    Date t2 = new Date&#40;);
    for &#40;i = 0; i < n; i++) &#123;
        long set = -1L;
        while &#40;set != 0&#41;
            set ^= 1L<<bitScanForwardDeBruijn64 &#40;set&#41;;
    &#125;
    Date t3 = new Date&#40;);
    for &#40;i = 0; i < n; i++) &#123;
        long set = -1L;
        while &#40;set != 0&#41;
            set ^= 1L<<bitScanForwardDbl&#40;set&#41;;
    &#125;
    Date t4 = new Date&#40;);
    for &#40;i = 0; i < n; i++) &#123;
        long set = -1L;
        while &#40;set != 0&#41;
            set ^= 1L<<Long.numberOfTrailingZeros&#40;set&#41;;
    &#125;
    Date t5 = new Date&#40;);

    double time1 = &#40;t2.getTime&#40;) - t1.getTime&#40;)) / 64;       
    double time2 = &#40;t3.getTime&#40;) - t2.getTime&#40;)) / 64;       
    double time3 = &#40;t4.getTime&#40;) - t3.getTime&#40;)) / 64;       
    double time4 = &#40;t5.getTime&#40;) - t4.getTime&#40;)) / 64;       

    System.out.println&#40;"bitScanForwardMatt       takes " + time1 + " ns");
    System.out.println&#40;"bitScanForwardDeBruijn64 takes " + time2 + " ns");
    System.out.println&#40;"bitScanForwardDbl        takes " + time3 + " ns");
    System.out.println&#40;"numberOfTrailingZeros    takes " + time4 + " ns");
&#125;
Execution times under various JREs on my AMD X2 4200:

Sun JDK 1.5.0_05 Linux i586 (-server):

Code: Select all

bitScanForwardMatt       takes 11.1 ns
bitScanForwardDeBruijn64 takes 15.2 ns
bitScanForwardDbl        takes 31.8 ns
numberOfTrailingZeros    takes 23.5 ns
Sun JDK 1.5.0_08 Linux AMD64 (-server):

Code: Select all

bitScanForwardMatt       takes 9.1 ns
bitScanForwardDeBruijn64 takes 9.4 ns
bitScanForwardDbl        takes 13.9 ns
numberOfTrailingZeros    takes 13.2 ns
JRockit JDK 1.5.0_6 Linux AMD64:

Code: Select all

bitScanForwardMatt       takes 10.9 ns
bitScanForwardDeBruijn64 takes 8.6 ns
bitScanForwardDbl        takes 16.6 ns
numberOfTrailingZeros    takes 16.6 ns
Sun JDK 1.6.0 beta 2 Linux AMD64 (-server):

Code: Select all

bitScanForwardMatt       takes 8.9 ns
bitScanForwardDeBruijn64 takes 9.3 ns
bitScanForwardDbl        takes 14.5 ns
numberOfTrailingZeros    takes 13.0 ns
Gerd
yoshiharu
Posts: 56
Joined: Sat Nov 11, 2006 11:14 pm

Re: Fastest bitboard compress routine when you can't use ASM

Post by yoshiharu »

Pradu wrote:
mambofish wrote:

Code: Select all

tzc64&#91;&#40;0x3f & (&#40;lo ^ high&#41; * 0x78291ACF&#41; >> 26&#41;&#93;;
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex&#40;X&#41; toIndexDB&#91;&#40;X&#41;*C64&#40;0x07EDD5E59A4E28C2&#41;>>58&#93;
So, it seems I didn't find anything new...:?
:-D
Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ;-) ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Hi Gerd

Thanks for the details. I don't know where you find this stuff, but it is very interesting!

Here's the Robert Harley reference I used for my code.

http://www.df.lth.se/~john_e/gems/gem0039.html

I found a reference to the table-based technique in "Hacker's Delight" which put me on to the idea. A Google search did the rest :-)

I am a little surprised at some of the timings presented. For example, I wouldn't use numberOfTrailingZeros. It relies on this binary chop code to isolate the LSB.

Code: Select all

    public static int numberOfTrailingZeros&#40;long i&#41; &#123;
        // HD, Figure 5-14
	int x, y;
	if &#40;i == 0&#41; return 64;

	int n = 63;
	y = &#40;int&#41;i; if &#40;y != 0&#41; &#123; n = n -32; x = y; &#125; else x = &#40;int&#41;&#40;i>>>32&#41;;
	y = x <<16; if &#40;y != 0&#41; &#123; n = n -16; x = y; &#125;
	y = x << 8; if &#40;y != 0&#41; &#123; n = n - 8; x = y; &#125;
	y = x << 4; if &#40;y != 0&#41; &#123; n = n - 4; x = y; &#125;


	y = x << 2; if &#40;y != 0&#41; &#123; n = n - 2; x = y; &#125;
	return n - (&#40;x << 1&#41; >>> 31&#41;;
    &#125;
Each 1-bit therefore requires 5 tests to identify. I did try this technique and found it very much slower than the magic number method, so I am a little surprised at the timings you report. Nervertheless I'll set up some tests again and confirm.

Regards,
Vince
jesper_nielsen

Re: Fastest bitboard compress routine when you can't use ASM

Post by jesper_nielsen »

yoshiharu wrote:
Pradu wrote:
mambofish wrote:

Code: Select all

tzc64&#91;&#40;0x3f & (&#40;lo ^ high&#41; * 0x78291ACF&#41; >> 26&#41;&#93;;
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex&#40;X&#41; toIndexDB&#91;&#40;X&#41;*C64&#40;0x07EDD5E59A4E28C2&#41;>>58&#93;
So, it seems I didn't find anything new...:?
:-D
Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ;-) ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro
A story from the world of C#

When i first started out programming my engine "Pupsi", i quickly made a simple straightforward algorithm, basically shifting the bits off to the left, and updating indexes.

I thought "This is going to fly, once i get this code optimized with some really smart stuff!"

I have tried every trick i could find on the net, but nothing gave any improvement. :-(

I implemented the algorithm suggested by Vince, and i got roughly a 2,5-3 % improvement in pure pertf. Perft including evaluation gives about 0.5 % speedup. (This is after very preliminary tests. More is definitely needed!)

Here is the algorithm i wrote:
Please note: I index the bits from the opposite side as the example given by Vince. So the High order bits is in mask_low, meaning the lowest indexes.

Code: Select all

    public static void getSquaresOld&#40;ulong mask, int&#91;&#93; squares, ref int index&#41; &#123;
      index = 0;
      int rank = 0;
      int file;
      uint mask_low = &#40;uint&#41;(&#40;mask >> 32&#41; & 0xFFFFFFFF&#41;;
      uint mask_high = &#40;uint&#41;( mask & 0xFFFFFFFF&#41;;
      uint rank_mask;
      while &#40;mask_low != 0&#41; &#123;
        if (&#40;rank_mask = &#40;mask_low & 0xFF000000&#41;) != 0&#41; &#123;
          file = 0;
          while &#40;rank_mask != 0&#41; &#123;
            if (&#40;rank_mask & 0x80000000&#41; != 0&#41; &#123;
              squares&#91;index++&#93; = rank + file;
            &#125;
            rank_mask <<= 1;
            file += 1;
          &#125;
        &#125;
        mask_low <<= 8;
        rank += 8;
      &#125;

      rank = 32;
      while &#40;mask_high != 0&#41; &#123;
        if (&#40;rank_mask = &#40;mask_high & 0xFF000000&#41;) != 0&#41; &#123;
          file = 0;
          while &#40;rank_mask != 0&#41; &#123;
            if (&#40;rank_mask & 0x80000000&#41; != 0&#41; &#123;
              squares&#91;index++&#93; = rank + file;
            &#125;
            rank_mask <<= 1;
            file += 1;
          &#125;
        &#125;
        mask_high <<= 8;
        rank += 8;
      &#125;
    &#125;
So even though the improvement is very, very small, perhaps I can perhaps finally put the old algorithm to rest! :-)

Thanks!

- Jesper
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Hi Jesper,

Having now seen the posts elsewhere by Gerd and Pradu among others, it seems this algorithm is much more widely known than I had appreciated. :oops:

Gerd's 64-bit magic number code is neater, but slower than the code I posted. I imagine this is because I'm running on 32-bit hardware and a 64-bit multiplication takes appreciably longer than a 32 bit one. On 64-bit hardware, Gerd's code might be faster.

However, in your case I don't think a 2.5-3% speedup will make that much difference, because your program probably isn't spending that more than 10-15% of its time in move generation anyway. (As you noticed, when you include the evaluation effort, its effect was reduced to 0.5% in your engine)

At these tiny differences, I think code size and elegance become more important considerations. In fact I am tempted to try Gerd's algorithm anyway simply on these criteria, even though it is around 25% slower:

Code: Select all

test value&#58; 805318656 &#40;4 bits on&#41;
bsf64_58&#58; elapsed= 875 ms
bsf64_26&#58; elapsed= 672 ms
test value&#58; 805318656
(I tested over a range of values of course, but this is an indicative result. In fact I suspect its quite important to base any decision on test values that are roughly representative of the average number of target squares per move a piece might have during the game, rather than on an outlying value, such as 1 or 19)

Kind regards
Vince