Fastest bitboard compress routine when you can't use ASM

Discussion of chess software programming and technical issues.

Moderator: Ras

mambofish

Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

One of the problems with bitboard move generators is how to efficiently determine which bits are "on" in a 64 bit move map in order to generate the target squares for a piece when constructing move lists.

Those of you who can drop into assembler can of course use the BSF and BSR opcodes to do this, but I thought you might nonetheless be interested in a technique based on number theory that I have used in ermintrude (which is written in java and so can't use any asm code) that is faster than any other non-assembler technique I've tried. Its so good its almost like magic. Moreover it operates in linear time - no processing huge numbers of zero-bits just to find the one-bits ;-)

The idea is to extract each bit field in turn from the 64 bit value, fold it into a 32 bit value, and multiply it by a "magic" number. The top 5 bits of the result is then used as an index into a small lookup table, whose value indicates the bit index (0 - 63) of the bit value - and hence the target square for the move.

Here's the code I use in ermintrude:

Code: Select all

 
     
	public static final int tzc64[] = { 
	  63, 30,  3, 32, 59, 14, 11, 33, 60, 24, 50,  9, 55, 19, 21, 34,
	  61, 29,  2, 53, 51, 23, 41, 18, 56, 28,  1, 43, 46, 27,  0, 35,
	  62, 31, 58,  4,  5, 49, 54,  6, 15, 52, 12, 40,  7, 42, 45, 16,
	  25, 57, 48, 13, 10, 39,  8, 44, 20, 47, 38, 22, 17, 37, 36, 26
	};

   public static int compress(long bb, int[] loc) {
		int lo, high;
		long v;
		int j = 0;
		while (bb != 0) {
			v = bb ^ (bb-1);
			bb &= (bb - 1);
			lo =   (int)(v >> 32);
			high = (int) v;
			loc[j++] = tzc64[(0x3f & ((lo ^ high) * 0x78291ACF) >> 26)];
		}
		loc[j] = -1;
		return j;
	}

The multiplication is probably the slowest parts of the algorithm, as the lookup table itself is small enough to reside in cache memory. I haven't tried factorising the magic number into a sequence of shifts and adds - this might speed it up slightly.

Anyway its' just one of the tricks I've had to resort to in order to get java to perform respectably :-)

Vince
Alessandro Scotti

Re: Fastest bitboard compress routine when you can't use ASM

Post by Alessandro Scotti »

I'm using an almost identical method in Kiwi, which I took from Gerd Isenberg. I was using BSF and BSR before that, and after replacing _all_ of the ASM stuff with this kind of "magic" stuff my program actually got slightly faster! :-)
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: Fastest bitboard compress routine when you can't use ASM

Post by Pradu »

mambofish wrote:

Code: Select all

tzc64[(0x3f & ((lo ^ high) * 0x78291ACF) >> 26)];
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex(X) toIndexDB[(X)*C64(0x07EDD5E59A4E28C2)>>58]
I used to use these magic bitscans because I didn't know and didn't like assembly. Now I use compiler intrinsics where possible which seem to work faster and are as simple to use:

Code: Select all

#ifdef _MSC_VER
	#pragma message("MSC compatible compiler detected -- turning off warning 4146")
	#pragma warning( disable : 4146)
	#ifdef _WIN64
		#include <intrin.h>
		#pragma intrinsic(_BitScanForward64)
		#pragma intrinsic(_BitScanReverse64)
		#define USING_INTRINSICS
	#endif
#endif
_BitScanForward64(&ret,bb);
_BitScanReverse64(&ret,bb);
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really :-)
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: Fastest bitboard compress routine when you can't use ASM

Post by Pradu »

mambofish wrote:Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really :-)
There are actually many (67108864) 64-bit magic numbers. See=> http://www.vpittlik.org/wbforum/viewtop ... 6467#26467

I believe there are only 2 Matt-Taylor magic numbers though.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Fastest bitboard compress routine when you can't use ASM

Post by Gerd Isenberg »

mambofish wrote:Yep, I think Matt Taylor and Robert Harley were the guys who came up with this originally, back in 1996 or thereabouts.

I wasn't aware of the 64-bit magic number though. I'll see if I can get it to work, because the (int) casts in java are s l o w . It might not work though because java doesn't do unsigned arithmetic.

Its a toy language really :-)
Java has explict logical shift right >>>

Here is Matt's first posting about his folded trick it from Jun 2003.
http://groups.google.de/group/comp.lang ... a3044afa7c

I was not aware of Robert Harley. Do you have some references?

Java 1.5 has Long.numberOfTrailingZeros, which might be the best choice - as it might use lzcnt of future intel and amd processors.
Here some Java-bitscans and some timing results by Klaus Friedel by varous JREs:

Code: Select all

   /**
     * @author GIsenberg
     * @return index 0..63 of LS1B -1023 if passing zero
     * @param b a 64-bit word to bitscan
     */
    static public int bitScanForwardDbl(long b)
    {
        double x = (double)(b & - b);
        int exp = (int) (Double.doubleToLongBits(x) >>> 52);
        return (exp & 2047) - 1023;
    } 

Code: Select all

    /**
     * @author Matt Taylor
     * @return index 0..63
     * @param bb a 64-bit word to bitscan, should not be zero
     */
    static private final int[] foldedTable = {
     63,30, 3,32,59,14,11,33,
     60,24,50, 9,55,19,21,34,
     61,29, 2,53,51,23,41,18,
     56,28, 1,43,46,27, 0,35,
     62,31,58, 4, 5,49,54, 6,
     15,52,12,40, 7,42,45,16,
     25,57,48,13,10,39, 8,44,
     20,47,38,22,17,37,36,26,
    };

    static public int bitScanForwardMatt(long b) {
       b ^= (b - 1);
       int folded = ((int)b) ^ ((int)(b >>> 32));
       return foldedTable[(folded * 0x78291ACF) >>> 26];
    }

Code: Select all

   /**
     * @author Gerd Isenberg based on
     *         "Using de Bruijn Sequences to Index a 1 in a Computer Word" by
     *         Charles E. Leiserson
     *         Harald Prokop
     *         Keith H. Randall
     * @return index 0..63
     * @param bb a 64-bit word to bitscan, should not be zero
     */

    static private final long deBruijn = 0x03f79d71b4cb0a89L;
    static private final int[] magicTable = {
      0, 1,48, 2,57,49,28, 3,
     61,58,50,42,38,29,17, 4,
     62,55,59,36,53,51,43,22,
     45,39,33,30,24,18,12, 5,
     63,47,56,27,60,41,37,16,
     54,35,52,21,44,32,23,11,
     46,26,40,15,34,20,31,10,
     25,14,19, 9,13, 8, 7, 6,
    };

    static public int bitScanForwardDeBruijn64 (long b) {
       int idx = (int)(((b & -b) * deBruijn) >>> 58);
       return magicTable[idx];
    } 

Code: Select all

public static void main(String[] args) {
    int n = 1000000, i;
    Date t1 = new Date();
    for (i = 0; i < n; i++) {
        long set = -1L;
        while (set != 0)
            set ^= 1L<<bitScanForwardMatt(set);
    }
    Date t2 = new Date();
    for (i = 0; i < n; i++) {
        long set = -1L;
        while (set != 0)
            set ^= 1L<<bitScanForwardDeBruijn64 (set);
    }
    Date t3 = new Date();
    for (i = 0; i < n; i++) {
        long set = -1L;
        while (set != 0)
            set ^= 1L<<bitScanForwardDbl(set);
    }
    Date t4 = new Date();
    for (i = 0; i < n; i++) {
        long set = -1L;
        while (set != 0)
            set ^= 1L<<Long.numberOfTrailingZeros(set);
    }
    Date t5 = new Date();

    double time1 = (t2.getTime() - t1.getTime()) / 64;       
    double time2 = (t3.getTime() - t2.getTime()) / 64;       
    double time3 = (t4.getTime() - t3.getTime()) / 64;       
    double time4 = (t5.getTime() - t4.getTime()) / 64;       

    System.out.println("bitScanForwardMatt       takes " + time1 + " ns");
    System.out.println("bitScanForwardDeBruijn64 takes " + time2 + " ns");
    System.out.println("bitScanForwardDbl        takes " + time3 + " ns");
    System.out.println("numberOfTrailingZeros    takes " + time4 + " ns");
}
Execution times under various JREs on my AMD X2 4200:

Sun JDK 1.5.0_05 Linux i586 (-server):

Code: Select all

bitScanForwardMatt       takes 11.1 ns
bitScanForwardDeBruijn64 takes 15.2 ns
bitScanForwardDbl        takes 31.8 ns
numberOfTrailingZeros    takes 23.5 ns
Sun JDK 1.5.0_08 Linux AMD64 (-server):

Code: Select all

bitScanForwardMatt       takes 9.1 ns
bitScanForwardDeBruijn64 takes 9.4 ns
bitScanForwardDbl        takes 13.9 ns
numberOfTrailingZeros    takes 13.2 ns
JRockit JDK 1.5.0_6 Linux AMD64:

Code: Select all

bitScanForwardMatt       takes 10.9 ns
bitScanForwardDeBruijn64 takes 8.6 ns
bitScanForwardDbl        takes 16.6 ns
numberOfTrailingZeros    takes 16.6 ns
Sun JDK 1.6.0 beta 2 Linux AMD64 (-server):

Code: Select all

bitScanForwardMatt       takes 8.9 ns
bitScanForwardDeBruijn64 takes 9.3 ns
bitScanForwardDbl        takes 14.5 ns
numberOfTrailingZeros    takes 13.0 ns
Gerd
yoshiharu
Posts: 56
Joined: Sat Nov 11, 2006 11:14 pm

Re: Fastest bitboard compress routine when you can't use ASM

Post by yoshiharu »

Pradu wrote:
mambofish wrote:

Code: Select all

tzc64[(0x3f & ((lo ^ high) * 0x78291ACF) >> 26)];
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex(X) toIndexDB[(X)*C64(0x07EDD5E59A4E28C2)>>58]
So, it seems I didn't find anything new...:?
:-D
Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ;-) ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Hi Gerd

Thanks for the details. I don't know where you find this stuff, but it is very interesting!

Here's the Robert Harley reference I used for my code.

http://www.df.lth.se/~john_e/gems/gem0039.html

I found a reference to the table-based technique in "Hacker's Delight" which put me on to the idea. A Google search did the rest :-)

I am a little surprised at some of the timings presented. For example, I wouldn't use numberOfTrailingZeros. It relies on this binary chop code to isolate the LSB.

Code: Select all

    public static int numberOfTrailingZeros(long i) {
        // HD, Figure 5-14
	int x, y;
	if (i == 0) return 64;

	int n = 63;
	y = (int)i; if (y != 0) { n = n -32; x = y; } else x = (int)(i>>>32);
	y = x <<16; if (y != 0) { n = n -16; x = y; }
	y = x << 8; if (y != 0) { n = n - 8; x = y; }
	y = x << 4; if (y != 0) { n = n - 4; x = y; }


	y = x << 2; if (y != 0) { n = n - 2; x = y; }
	return n - ((x << 1) >>> 31);
    }
Each 1-bit therefore requires 5 tests to identify. I did try this technique and found it very much slower than the magic number method, so I am a little surprised at the timings you report. Nervertheless I'll set up some tests again and confirm.

Regards,
Vince
jesper_nielsen

Re: Fastest bitboard compress routine when you can't use ASM

Post by jesper_nielsen »

yoshiharu wrote:
Pradu wrote:
mambofish wrote:

Code: Select all

tzc64[(0x3f & ((lo ^ high) * 0x78291ACF) >> 26)];
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex(X) toIndexDB[(X)*C64(0x07EDD5E59A4E28C2)>>58]
So, it seems I didn't find anything new...:?
:-D
Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ;-) ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro
A story from the world of C#

When i first started out programming my engine "Pupsi", i quickly made a simple straightforward algorithm, basically shifting the bits off to the left, and updating indexes.

I thought "This is going to fly, once i get this code optimized with some really smart stuff!"

I have tried every trick i could find on the net, but nothing gave any improvement. :-(

I implemented the algorithm suggested by Vince, and i got roughly a 2,5-3 % improvement in pure pertf. Perft including evaluation gives about 0.5 % speedup. (This is after very preliminary tests. More is definitely needed!)

Here is the algorithm i wrote:
Please note: I index the bits from the opposite side as the example given by Vince. So the High order bits is in mask_low, meaning the lowest indexes.

Code: Select all

    public static void getSquaresOld(ulong mask, int[] squares, ref int index) {
      index = 0;
      int rank = 0;
      int file;
      uint mask_low = (uint)((mask >> 32) & 0xFFFFFFFF);
      uint mask_high = (uint)( mask & 0xFFFFFFFF);
      uint rank_mask;
      while (mask_low != 0) {
        if ((rank_mask = (mask_low & 0xFF000000)) != 0) {
          file = 0;
          while (rank_mask != 0) {
            if ((rank_mask & 0x80000000) != 0) {
              squares[index++] = rank + file;
            }
            rank_mask <<= 1;
            file += 1;
          }
        }
        mask_low <<= 8;
        rank += 8;
      }

      rank = 32;
      while (mask_high != 0) {
        if ((rank_mask = (mask_high & 0xFF000000)) != 0) {
          file = 0;
          while (rank_mask != 0) {
            if ((rank_mask & 0x80000000) != 0) {
              squares[index++] = rank + file;
            }
            rank_mask <<= 1;
            file += 1;
          }
        }
        mask_high <<= 8;
        rank += 8;
      }
    }
So even though the improvement is very, very small, perhaps I can perhaps finally put the old algorithm to rest! :-)

Thanks!

- Jesper
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Hi Jesper,

Having now seen the posts elsewhere by Gerd and Pradu among others, it seems this algorithm is much more widely known than I had appreciated. :oops:

Gerd's 64-bit magic number code is neater, but slower than the code I posted. I imagine this is because I'm running on 32-bit hardware and a 64-bit multiplication takes appreciably longer than a 32 bit one. On 64-bit hardware, Gerd's code might be faster.

However, in your case I don't think a 2.5-3% speedup will make that much difference, because your program probably isn't spending that more than 10-15% of its time in move generation anyway. (As you noticed, when you include the evaluation effort, its effect was reduced to 0.5% in your engine)

At these tiny differences, I think code size and elegance become more important considerations. In fact I am tempted to try Gerd's algorithm anyway simply on these criteria, even though it is around 25% slower:

Code: Select all

test value: 805318656 (4 bits on)
bsf64_58: elapsed= 875 ms
bsf64_26: elapsed= 672 ms
test value: 805318656
(I tested over a range of values of course, but this is an indicative result. In fact I suspect its quite important to base any decision on test values that are roughly representative of the average number of target squares per move a piece might have during the game, rather than on an outlying value, such as 1 or 19)

Kind regards
Vince