Page 1 of 2

Introduction and (hopefully) contribution - bitboard methods

Posted: Fri Jun 03, 2011 3:07 pm
by sedicla
I've been lurking here for awhile and first of all I'd like to say thank you to all for posting all this good stuff, I've learned a lot here and now i'd like to maybe contribute to the chess programming community.

I am involved professionally with computer programming for 25 years. And some of my hobbies are programming and chess. So why not combine both :). Well I'm working on my own chess engine and now I'm just reviewing the code in order to make available soon. Hope it will be helpfull to other fellow cc programmers.

Here I want to share what I did with bitboards, that got me some performance improvement over traditional methods. I'm talking about bit_count, first_index and last_index methods.
I was using the usual methods:

Code: Select all

int bb_bit_count(U64 bb)
{
   int count = 0;
   while (bb)
   {
       count++;
       bb &= bb - 1;
   }
   return count;
}

int bb_first_index(U64 bb) 
{
	if	(bb >> 48)
		return 63 - (bb_msb_table[bb >> 48] + 48);
	if	((bb >> 32) & 65535)
		return 63 - (bb_msb_table[(bb >> 32) & 65535] + 32);
	if	((bb >> 16) & 65535)
		return 63 - (bb_msb_table[(bb >> 16) & 65535] + 16);
	return 63 - (bb_msb_table[bb & 65535]);
}

int bb_last_index(U64 bb) 
{
	unsigned int folded;
	if (!bb)
		return -1;
	bb ^= bb - 1;
	folded = (int) bb ^ (bb >> 32);
	return 63 - bb_lsb_table[folded * 0x78291ACF >> 26];
}
Those methods take in consideration that 0 is the leftmost bit and 63 is the rightmost bit. I know I'm using the opposite of most engines.

What I did was to create a structure to split the bitboard in 4 parts and them precompute a table with necessary information for each part. When I need the first/last index I locate which of the 4 parts has the first/last bit and them just lookup the precomputed table.
For bit count I just summ the 4 parts together. Well a code is worth thousand words. See the code bellow.

Code: Select all

typedef union u_bitboard_index
{
    U64             val64;
    unsigned long   val32[2];
    unsigned short  val16[4];
}   BBIX;
This is the structure to access each part of the bitboard. The bitboard value will be at BBIX.val64. And each part is at the val16[4]. Remember that this is a union.

The initialization part consists in creating 3 tables for first_index, last_index and bit_count. Note that I still use the tradicional methods to help the initialization. Of course we can get rid completely of the tradional methods by initializing using a more manual method. This on my TODO list.

Code: Select all

//  Bit tables
char first_index_table[4][65536];
char last_index_table[4][65536];
char bit_count_table[65536];

void bb_init()
{
    // ommited code to init the usual bitboards first/last/count

    // tables for faster first/last/count
    first_index_table[0][0] = -1;
    first_index_table[1][0] = -1;
    first_index_table[2][0] = -1;
    first_index_table[3][0] = -1;
    last_index_table[0][0]  = -1;
    last_index_table[1][0]  = -1;
    last_index_table[2][0]  = -1;
    last_index_table[3][0]  = -1;

    for &#40;j = 1; j < 65536; j++)
    &#123;
        first_index_table&#91;0&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41; - 48;
        first_index_table&#91;1&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41; - 32;
        first_index_table&#91;2&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41; - 16;
        first_index_table&#91;3&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41;;
        last_index_table&#91;0&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41; - 48;
        last_index_table&#91;1&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41; - 32;
        last_index_table&#91;2&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41; - 16;
        last_index_table&#91;3&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41;;
    &#125;

    for &#40;j = 0; j < 65536; j++)
    &#123;
        bit_count_table&#91;j&#93; = &#40;char&#41;bb_bit_count&#40;&#40;U64&#41;j&#41;;
    &#125;
&#125;
And then here are the new methods using the BBIX and the precomputed tables.

Code: Select all

//-------------------------------------------------------------------------------------------------
//  Return index of the first "1" bit of BBIX&#58; 0 to 63, or -1 when bitboard is 0
//-------------------------------------------------------------------------------------------------
int first_index&#40;BBIX bbix&#41;
&#123;
    if &#40;bbix.val32&#91;1&#93;)
    &#123;
        if &#40;bbix.val16&#91;3&#93;)
            return first_index_table&#91;0&#93;&#91;bbix.val16&#91;3&#93;&#93;;
        else
            return first_index_table&#91;1&#93;&#91;bbix.val16&#91;2&#93;&#93;;
    &#125;
    else
    &#123;
        if &#40;bbix.val16&#91;1&#93;)
            return first_index_table&#91;2&#93;&#91;bbix.val16&#91;1&#93;&#93;;
        else
            return first_index_table&#91;3&#93;&#91;bbix.val16&#91;0&#93;&#93;;
    &#125;
&#125;

//-------------------------------------------------------------------------------------------------
//  Return index of the last "1" bit of BBIX&#58; 0 to 63, or -1 when bitboard is 0
//-------------------------------------------------------------------------------------------------
int last_index&#40;BBIX bbix&#41;
&#123;
    if &#40;bbix.val32&#91;0&#93;)
    &#123;
        if &#40;bbix.val16&#91;0&#93;)
            return last_index_table&#91;3&#93;&#91;bbix.val16&#91;0&#93;&#93;;
        else
            return last_index_table&#91;2&#93;&#91;bbix.val16&#91;1&#93;&#93;;
    &#125;
    else
    &#123;
        if &#40;bbix.val16&#91;2&#93;)
            return last_index_table&#91;1&#93;&#91;bbix.val16&#91;2&#93;&#93;;
        else
            return last_index_table&#91;0&#93;&#91;bbix.val16&#91;3&#93;&#93;;
    &#125;
&#125;

//-------------------------------------------------------------------------------------------------
//  Number of "1" bits in the BBIX
//-------------------------------------------------------------------------------------------------
int bit_count&#40;BBIX bbix&#41;
&#123;
    return bit_count_table&#91;bbix.val16&#91;0&#93;&#93; +
           bit_count_table&#91;bbix.val16&#91;1&#93;&#93; +
           bit_count_table&#91;bbix.val16&#91;2&#93;&#93; +
           bit_count_table&#91;bbix.val16&#91;3&#93;&#93;;
&#125;
Pretty much the approach is divide and conquer for first/last. I need to know which part has the first/last bit and then lookup the corresponding table. For each bitboard I need 2 if's and one lookup.

I'm using microsoft, so the methods/tables are adjusted for the LSM, MSB microsoft stuff (I hate this part). So it may require some configuration for other plataforms.

On my computer I got better performance that resulted a couple of thousand nodes searched.
It is not worth to simply move a bitboard to BBIX.val64 and then use the new methods, otherwise you have the overhead of moving data. You need to have the data on the BBIX structure.
Here is part of my genmove using the new method:

Code: Select all

        
                BBIX piece, attacks, moves;

	piece.val64 = STM_STATE.piece&#91;ROOK&#93; | STM_STATE.piece&#91;QUEEN&#93;;

	while&#40;piece.val64&#41;
	&#123;
		from = &#40;char&#41;last_index&#40;piece&#41;;

			// rankfile_east
	        attacks.val64 = rankfile_east&#91;from&#93; & board.occupied_squares;
        if &#40;attacks.val64&#41; 
        &#123;
            to = &#40;char&#41;first_index&#40;attacks&#41;;
            if &#40;OPP_STATE.square&#91;to&#93;)
                add_capture&#40;from, to&#41;;
            moves.val64 = from_to_path&#91;from&#93;&#91;to&#93;;
        &#125;
        else 
            moves.val64 = rankfile_east&#91;from&#93;;
        while&#40;moves.val64&#41;
        &#123;
            to = &#40;char&#41;first_index&#40;moves&#41;;
            add_move&#40;from, to&#41;;
            CLRBIT_IX&#40;moves.val64, to&#41;;
        &#125;
	//....
The biggest gain was in the first_index and bit_count.
In my tests I got those numbers on my 32bit windows. Didn't had a chance to test 64bit yet.

first_index new 13,937ms old: 41,793ms
last_index new 14,118ms old: 18,283ms
bit_count new 1,856ms old: 35,833ms

As I said, hope this is useful, and please point out anything you find if you do any tests.

Once again, thanks everyone and hope this is usefull.

Alcides Schulz.

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 3:14 pm
by sedicla
Here is the code I used for tests:

Code: Select all

#define BBTSTSIZE 90000000

void bb_test_index_ok&#40;void&#41;;
void bb_test_fi_perf&#40;void&#41;;
void bb_test_li_perf&#40;void&#41;;
void bb_test_count_ok&#40;void&#41;;
void bb_test_bc_perft&#40;void&#41;;

void bb_test&#40;)
&#123;
    bb_test_index_ok&#40;);
    bb_test_fi_perf&#40;);
    bb_test_li_perf&#40;);
    bb_test_count_ok&#40;);
    bb_test_bc_perft&#40;);
&#125;

void bb_test_index_ok&#40;)
&#123;
	BBIX value;
    unsigned int fi_count&#91;64&#93;;
    unsigned int li_count&#91;64&#93;;
    int i, j;

    memset&#40;fi_count, 0, sizeof&#40;fi_count&#41;);
    memset&#40;li_count, 0, sizeof&#40;li_count&#41;);

    value.val64 = 0;

	printf&#40;"validation begin\n");
    for &#40;i = 0; i < 4; i++)
    &#123;
        value.val64 = 0;
        for &#40;j = 1; j < 65536; j++)
        &#123;
            value.val16&#91;i&#93; = &#40;unsigned short&#41;j;
            if &#40;bb_first_index&#40;value.val64&#41; != first_index&#40;value&#41;)
            &#123;
                printf&#40;"diff first_index old&#58; %d new&#58;%d value=%llu\n", bb_first_index&#40;value.val64&#41;, first_index&#40;value&#41;, value.val64&#41;;
                bb_print&#40;"", value.val64&#41;;
            &#125;
            if &#40;bb_last_index&#40;value.val64&#41; != last_index&#40;value&#41;)
            &#123;
                printf&#40;"diff last_index old&#58; %d new&#58;%d value=%llu\n", bb_last_index&#40;value.val64&#41;, last_index&#40;value&#41;, value.val64&#41;;
                bb_print&#40;"", value.val64&#41;;
            &#125;
            fi_count&#91;first_index&#40;value&#41;&#93;++;
            li_count&#91;last_index&#40;value&#41;&#93;++;
        &#125;
    &#125;
    for &#40;i = 0; i < 64; i++)
    &#123;
        printf&#40;"%2d&#58; fi=%10u li=%10u\n", i, fi_count&#91;i&#93;, li_count&#91;i&#93;);
    &#125;
	printf&#40;"validation end\n");
&#125;

void bb_test_fi_perf&#40;)
&#123;
	BBIX value;
    int new_time;
    int old_time;
    unsigned int index_count&#91;64&#93;;
    int i, j;

    printf&#40;"testing first_index performance...\n");
    // new index
    memset&#40;index_count, 0, sizeof&#40;index_count&#41;);
    new_time = util_get_time&#40;);
    for &#40;i = 0; i < BBTSTSIZE; i++)
    &#123;
        for &#40;j = 0; j < 64; j++)
        &#123;
            value.val64 = 0;
            SETBIT_IX&#40;value.val64, j&#41;;
            index_count&#91;first_index&#40;value&#41;&#93;++;
        &#125;
    &#125;
    new_time = util_get_time&#40;) - new_time;
    printf&#40;"new index counters\n");
    for &#40;i = 0; i < 16; i++)
    &#123;
        printf&#40;"%2d&#58; %10u %2d&#58; %10u ", i, index_count&#91;i&#93;, i+16, index_count&#91;i+16&#93;);
        printf&#40;"%2d&#58; %10u %2d&#58; %10u\n", i+32, index_count&#91;i+32&#93;, i+48, index_count&#91;i+48&#93;);
    &#125;

    // old index
    memset&#40;index_count, 0, sizeof&#40;index_count&#41;);
    old_time = util_get_time&#40;);
    for &#40;i = 0; i < BBTSTSIZE; i++)
    &#123;
        for &#40;j = 0; j < 64; j++)
        &#123;
            value.val64 = 0;
            SETBIT_IX&#40;value.val64, j&#41;;
            index_count&#91;bb_first_index&#40;value.val64&#41;&#93;++;
        &#125;
    &#125;
    old_time = util_get_time&#40;) - old_time;
    printf&#40;"old index counters\n");
    for &#40;i = 0; i < 16; i++)
    &#123;
        printf&#40;"%2d&#58; %10u %2d&#58; %10u ", i, index_count&#91;i&#93;, i+16, index_count&#91;i+16&#93;);
        printf&#40;"%2d&#58; %10u %2d&#58; %10u\n", i+32, index_count&#91;i+32&#93;, i+48, index_count&#91;i+48&#93;);
    &#125;

    printf&#40;"new first_index time&#58; %u msec\n", new_time&#41;;
    printf&#40;"old first_index time&#58; %u msec\n", old_time&#41;;
&#125;

void bb_test_li_perf&#40;)
&#123;
	BBIX value;
    int new_time;
    int old_time;
    unsigned int index_count&#91;64&#93;;
    int i, j;

    printf&#40;"testing last_index performance...\n");
    // new index
    memset&#40;index_count, 0, sizeof&#40;index_count&#41;);
    new_time = util_get_time&#40;);
    for &#40;i = 0; i < BBTSTSIZE; i++)
    &#123;
        for &#40;j = 0; j < 64; j++)
        &#123;
            value.val64 = 0;
            SETBIT_IX&#40;value.val64, j&#41;;
            index_count&#91;last_index&#40;value&#41;&#93;++;
        &#125;
    &#125;
    new_time = util_get_time&#40;) - new_time;
    printf&#40;"new index counters\n");
    for &#40;i = 0; i < 16; i++)
    &#123;
        printf&#40;"%2d&#58; %10u %2d&#58; %10u ", i, index_count&#91;i&#93;, i+16, index_count&#91;i+16&#93;);
        printf&#40;"%2d&#58; %10u %2d&#58; %10u\n", i+32, index_count&#91;i+32&#93;, i+48, index_count&#91;i+48&#93;);
    &#125;

    // old index
    memset&#40;index_count, 0, sizeof&#40;index_count&#41;);
    old_time = util_get_time&#40;);
    for &#40;i = 0; i < BBTSTSIZE; i++)
    &#123;
        for &#40;j = 0; j < 64; j++)
        &#123;
            value.val64 = 0;
            SETBIT_IX&#40;value.val64, j&#41;;
            index_count&#91;bb_last_index&#40;value.val64&#41;&#93;++;
        &#125;
    &#125;
    old_time = util_get_time&#40;) - old_time;
    printf&#40;"old index counters\n");
    for &#40;i = 0; i < 16; i++)
    &#123;
        printf&#40;"%2d&#58; %10u %2d&#58; %10u ", i, index_count&#91;i&#93;, i+16, index_count&#91;i+16&#93;);
        printf&#40;"%2d&#58; %10u %2d&#58; %10u\n", i+32, index_count&#91;i+32&#93;, i+48, index_count&#91;i+48&#93;);
    &#125;

    printf&#40;"new last_index time&#58; %u msec\n", new_time&#41;;
    printf&#40;"old last_index time&#58; %u msec\n", old_time&#41;;
&#125;

void bb_test_count_ok&#40;)
&#123;
	BBIX value;
    unsigned int count&#91;64&#93;;
    int i, j;

    memset&#40;count, 0, sizeof&#40;count&#41;);

    value.val64 = 0;

	printf&#40;"bit_count validation begin\n");
    for &#40;i = 0; i < 4; i++)
    &#123;
        for &#40;j = 0; j < 65536; j++)
        &#123;
            value.val16&#91;i&#93; = &#40;unsigned short&#41;j;
            if &#40;bb_bit_count&#40;value.val64&#41; != bit_count&#40;value&#41;)
            &#123;
                printf&#40;"diff bit_count old&#58; %d new&#58;%d value=%llu\n", bb_bit_count&#40;value.val64&#41;, bit_count&#40;value&#41;, value.val64&#41;;
                bb_print&#40;"", value.val64&#41;;
            &#125;
            count&#91;bit_count&#40;value&#41;&#93;++;
        &#125;
    &#125;
    for &#40;i = 0; i < 16; i++)
    &#123;
        printf&#40;"%2d&#58; %10u %2d&#58; %10u ", i, count&#91;i&#93;, i+16, count&#91;i+16&#93;);
        printf&#40;"%2d&#58; %10u %2d&#58; %10u\n", i+32, count&#91;i+32&#93;, i+48, count&#91;i+48&#93;);
    &#125;
	printf&#40;"bit_count validation end\n");
&#125;

void bb_test_bc_perft&#40;)
&#123;
	BBIX value;
    int new_time;
    int old_time;
    unsigned int index_count&#91;64&#93;;
    int i, j;
    int test_size = 10000000;

    printf&#40;"testing bit_count performance...\n");
    // new
    memset&#40;index_count, 0, sizeof&#40;index_count&#41;);
    new_time = util_get_time&#40;);
    for &#40;i = 0; i < test_size; i++)
    &#123;
        for &#40;j = 0; j < 64; j++)
        &#123;
            value.val64 = &#40;i << j&#41; + i;
            index_count&#91;bit_count&#40;value&#41;&#93;++;
        &#125;
    &#125;

    new_time = util_get_time&#40;) - new_time;
    printf&#40;"new bit_count\n");
    for &#40;i = 0; i < 16; i++)
    &#123;
        printf&#40;"%2d&#58; %10u %2d&#58; %10u ", i, index_count&#91;i&#93;, i+16, index_count&#91;i+16&#93;);
        printf&#40;"%2d&#58; %10u %2d&#58; %10u\n", i+32, index_count&#91;i+32&#93;, i+48, index_count&#91;i+48&#93;);
    &#125;

    // old
    memset&#40;index_count, 0, sizeof&#40;index_count&#41;);
    old_time = util_get_time&#40;);
    for &#40;i = 0; i < test_size; i++)
    &#123;
        for &#40;j = 0; j < 64; j++)
        &#123;
            value.val64 = &#40;i << j&#41; + i;
            index_count&#91;bb_bit_count&#40;value.val64&#41;&#93;++;
        &#125;
    &#125;
    old_time = util_get_time&#40;) - old_time;
    printf&#40;"old bit_count\n");
    for &#40;i = 0; i < 16; i++)
    &#123;
        printf&#40;"%2d&#58; %10u %2d&#58; %10u ", i, index_count&#91;i&#93;, i+16, index_count&#91;i+16&#93;);
        printf&#40;"%2d&#58; %10u %2d&#58; %10u\n", i+32, index_count&#91;i+32&#93;, i+48, index_count&#91;i+48&#93;);
    &#125;

    printf&#40;"new bit_count time&#58; %u msec\n", new_time&#41;;
    printf&#40;"old bit_count time&#58; %u msec\n", old_time&#41;;
&#125;

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 3:38 pm
by mcostalba
sedicla wrote: Here I want to share what I did with bitboards, that got me some performance improvement over traditional methods. I'm talking about bit_count, first_index and last_index methods.
Thanks for sharing your results. Currently we use the following functions:

Code: Select all

static CACHE_LINE_ALIGNMENT
const int BitTable&#91;64&#93; = &#123;
  63, 30,  3, 32, 25, 41, 22, 33, 15, 50, 42, 13, 11, 53, 19, 34, 61, 29,  2,
  51, 21, 43, 45, 10, 18, 47,  1, 54,  9, 57,  0, 35, 62, 31, 40,  4, 49,  5,
  52, 26, 60,  6, 23, 44, 46, 27, 56, 16,  7, 39, 48, 24, 59, 14, 12, 55, 38,
  28, 58, 20, 37, 17, 36, 8
&#125;;

Square first_1&#40;Bitboard b&#41; &#123;

  b ^= &#40;b - 1&#41;;
  uint32_t fold = int&#40;b&#41; ^ int&#40;b >> 32&#41;;
  return Square&#40;BitTable&#91;&#40;fold * 0x783a9b23&#41; >> 26&#93;);
&#125;

int count_1s&#40;Bitboard b&#41; &#123;
  b -= (&#40;b>>1&#41; & 0x5555555555555555ULL&#41;;
  b = (&#40;b>>2&#41; & 0x3333333333333333ULL&#41; + &#40;b & 0x3333333333333333ULL&#41;;
  b = (&#40;b>>4&#41; + b&#41; & 0x0F0F0F0F0F0F0F0FULL;
  b *= 0x0101010101010101ULL;
  return int&#40;b >> 56&#41;;
&#125;


And this 32bit optimized function for counting ones:

Code: Select all

int count_1s_32&#40;Bitboard b&#41; &#123;
  unsigned w = unsigned&#40;b >> 32&#41;, v = unsigned&#40;b&#41;;
  v -= &#40;v >> 1&#41; & 0x55555555; // 0-2 in 2 bits
  w -= &#40;w >> 1&#41; & 0x55555555;
  v = (&#40;v >> 2&#41; & 0x33333333&#41; + &#40;v & 0x33333333&#41;; // 0-4 in 4 bits
  w = (&#40;w >> 2&#41; & 0x33333333&#41; + &#40;w & 0x33333333&#41;;
  v = (&#40;v >> 4&#41; + v&#41; & 0x0F0F0F0F; // 0-8 in 8 bits
  v += ((&#40;w >> 4&#41; + w&#41; & 0x0F0F0F0F&#41;;  // 0-16 in 8 bits
  v *= 0x01010101; // mul is fast on amd procs
  return int&#40;v >> 24&#41;;
&#125;

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 3:38 pm
by Don
There is much on the web and even discussed here over the years on doing popcount and findbit fast. Also there is a lot on the chessprogramming wiki about these issues. For bitboard style programs these 2 operations are critical to speed and you are right to try to optimize them.

I have not taken the time to digest what you have written, but it's worth a look. I would be very surprised if you have come up with anything faster than what we are already using, but I would love to be surprised and I will definitely take a look! If you have improved on the speed of this operation you will be everyones hero!

And thanks for sharing your ideas on this.

Don

sedicla wrote:I've been lurking here for awhile and first of all I'd like to say thank you to all for posting all this good stuff, I've learned a lot here and now i'd like to maybe contribute to the chess programming community.

I am involved professionally with computer programming for 25 years. And some of my hobbies are programming and chess. So why not combine both :). Well I'm working on my own chess engine and now I'm just reviewing the code in order to make available soon. Hope it will be helpfull to other fellow cc programmers.

Here I want to share what I did with bitboards, that got me some performance improvement over traditional methods. I'm talking about bit_count, first_index and last_index methods.
I was using the usual methods:

Code: Select all

int bb_bit_count&#40;U64 bb&#41;
&#123;
   int count = 0;
   while &#40;bb&#41;
   &#123;
       count++;
       bb &= bb - 1;
   &#125;
   return count;
&#125;

int bb_first_index&#40;U64 bb&#41; 
&#123;
	if	&#40;bb >> 48&#41;
		return 63 - &#40;bb_msb_table&#91;bb >> 48&#93; + 48&#41;;
	if	(&#40;bb >> 32&#41; & 65535&#41;
		return 63 - &#40;bb_msb_table&#91;&#40;bb >> 32&#41; & 65535&#93; + 32&#41;;
	if	(&#40;bb >> 16&#41; & 65535&#41;
		return 63 - &#40;bb_msb_table&#91;&#40;bb >> 16&#41; & 65535&#93; + 16&#41;;
	return 63 - &#40;bb_msb_table&#91;bb & 65535&#93;);
&#125;

int bb_last_index&#40;U64 bb&#41; 
&#123;
	unsigned int folded;
	if (!bb&#41;
		return -1;
	bb ^= bb - 1;
	folded = &#40;int&#41; bb ^ &#40;bb >> 32&#41;;
	return 63 - bb_lsb_table&#91;folded * 0x78291ACF >> 26&#93;;
&#125;
Those methods take in consideration that 0 is the leftmost bit and 63 is the rightmost bit. I know I'm using the opposite of most engines.

What I did was to create a structure to split the bitboard in 4 parts and them precompute a table with necessary information for each part. When I need the first/last index I locate which of the 4 parts has the first/last bit and them just lookup the precomputed table.
For bit count I just summ the 4 parts together. Well a code is worth thousand words. See the code bellow.

Code: Select all

typedef union u_bitboard_index
&#123;
    U64             val64;
    unsigned long   val32&#91;2&#93;;
    unsigned short  val16&#91;4&#93;;
&#125;   BBIX;
This is the structure to access each part of the bitboard. The bitboard value will be at BBIX.val64. And each part is at the val16[4]. Remember that this is a union.

The initialization part consists in creating 3 tables for first_index, last_index and bit_count. Note that I still use the tradicional methods to help the initialization. Of course we can get rid completely of the tradional methods by initializing using a more manual method. This on my TODO list.

Code: Select all

//  Bit tables
char first_index_table&#91;4&#93;&#91;65536&#93;;
char last_index_table&#91;4&#93;&#91;65536&#93;;
char bit_count_table&#91;65536&#93;;

void bb_init&#40;)
&#123;
    // ommited code to init the usual bitboards first/last/count

    // tables for faster first/last/count
    first_index_table&#91;0&#93;&#91;0&#93; = -1;
    first_index_table&#91;1&#93;&#91;0&#93; = -1;
    first_index_table&#91;2&#93;&#91;0&#93; = -1;
    first_index_table&#91;3&#93;&#91;0&#93; = -1;
    last_index_table&#91;0&#93;&#91;0&#93;  = -1;
    last_index_table&#91;1&#93;&#91;0&#93;  = -1;
    last_index_table&#91;2&#93;&#91;0&#93;  = -1;
    last_index_table&#91;3&#93;&#91;0&#93;  = -1;

    for &#40;j = 1; j < 65536; j++)
    &#123;
        first_index_table&#91;0&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41; - 48;
        first_index_table&#91;1&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41; - 32;
        first_index_table&#91;2&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41; - 16;
        first_index_table&#91;3&#93;&#91;j&#93; = &#40;char&#41;bb_first_index&#40;&#40;U64&#41;j&#41;;
        last_index_table&#91;0&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41; - 48;
        last_index_table&#91;1&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41; - 32;
        last_index_table&#91;2&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41; - 16;
        last_index_table&#91;3&#93;&#91;j&#93;  = &#40;char&#41;bb_last_index&#40;&#40;U64&#41;j&#41;;
    &#125;

    for &#40;j = 0; j < 65536; j++)
    &#123;
        bit_count_table&#91;j&#93; = &#40;char&#41;bb_bit_count&#40;&#40;U64&#41;j&#41;;
    &#125;
&#125;
And then here are the new methods using the BBIX and the precomputed tables.

Code: Select all

//-------------------------------------------------------------------------------------------------
//  Return index of the first "1" bit of BBIX&#58; 0 to 63, or -1 when bitboard is 0
//-------------------------------------------------------------------------------------------------
int first_index&#40;BBIX bbix&#41;
&#123;
    if &#40;bbix.val32&#91;1&#93;)
    &#123;
        if &#40;bbix.val16&#91;3&#93;)
            return first_index_table&#91;0&#93;&#91;bbix.val16&#91;3&#93;&#93;;
        else
            return first_index_table&#91;1&#93;&#91;bbix.val16&#91;2&#93;&#93;;
    &#125;
    else
    &#123;
        if &#40;bbix.val16&#91;1&#93;)
            return first_index_table&#91;2&#93;&#91;bbix.val16&#91;1&#93;&#93;;
        else
            return first_index_table&#91;3&#93;&#91;bbix.val16&#91;0&#93;&#93;;
    &#125;
&#125;

//-------------------------------------------------------------------------------------------------
//  Return index of the last "1" bit of BBIX&#58; 0 to 63, or -1 when bitboard is 0
//-------------------------------------------------------------------------------------------------
int last_index&#40;BBIX bbix&#41;
&#123;
    if &#40;bbix.val32&#91;0&#93;)
    &#123;
        if &#40;bbix.val16&#91;0&#93;)
            return last_index_table&#91;3&#93;&#91;bbix.val16&#91;0&#93;&#93;;
        else
            return last_index_table&#91;2&#93;&#91;bbix.val16&#91;1&#93;&#93;;
    &#125;
    else
    &#123;
        if &#40;bbix.val16&#91;2&#93;)
            return last_index_table&#91;1&#93;&#91;bbix.val16&#91;2&#93;&#93;;
        else
            return last_index_table&#91;0&#93;&#91;bbix.val16&#91;3&#93;&#93;;
    &#125;
&#125;

//-------------------------------------------------------------------------------------------------
//  Number of "1" bits in the BBIX
//-------------------------------------------------------------------------------------------------
int bit_count&#40;BBIX bbix&#41;
&#123;
    return bit_count_table&#91;bbix.val16&#91;0&#93;&#93; +
           bit_count_table&#91;bbix.val16&#91;1&#93;&#93; +
           bit_count_table&#91;bbix.val16&#91;2&#93;&#93; +
           bit_count_table&#91;bbix.val16&#91;3&#93;&#93;;
&#125;
Pretty much the approach is divide and conquer for first/last. I need to know which part has the first/last bit and then lookup the corresponding table. For each bitboard I need 2 if's and one lookup.

I'm using microsoft, so the methods/tables are adjusted for the LSM, MSB microsoft stuff (I hate this part). So it may require some configuration for other plataforms.

On my computer I got better performance that resulted a couple of thousand nodes searched.
It is not worth to simply move a bitboard to BBIX.val64 and then use the new methods, otherwise you have the overhead of moving data. You need to have the data on the BBIX structure.
Here is part of my genmove using the new method:

Code: Select all

        
                BBIX piece, attacks, moves;

	piece.val64 = STM_STATE.piece&#91;ROOK&#93; | STM_STATE.piece&#91;QUEEN&#93;;

	while&#40;piece.val64&#41;
	&#123;
		from = &#40;char&#41;last_index&#40;piece&#41;;

			// rankfile_east
	        attacks.val64 = rankfile_east&#91;from&#93; & board.occupied_squares;
        if &#40;attacks.val64&#41; 
        &#123;
            to = &#40;char&#41;first_index&#40;attacks&#41;;
            if &#40;OPP_STATE.square&#91;to&#93;)
                add_capture&#40;from, to&#41;;
            moves.val64 = from_to_path&#91;from&#93;&#91;to&#93;;
        &#125;
        else 
            moves.val64 = rankfile_east&#91;from&#93;;
        while&#40;moves.val64&#41;
        &#123;
            to = &#40;char&#41;first_index&#40;moves&#41;;
            add_move&#40;from, to&#41;;
            CLRBIT_IX&#40;moves.val64, to&#41;;
        &#125;
	//....
The biggest gain was in the first_index and bit_count.
In my tests I got those numbers on my 32bit windows. Didn't had a chance to test 64bit yet.

first_index new 13,937ms old: 41,793ms
last_index new 14,118ms old: 18,283ms
bit_count new 1,856ms old: 35,833ms

As I said, hope this is useful, and please point out anything you find if you do any tests.

Once again, thanks everyone and hope this is usefull.

Alcides Schulz.

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 4:02 pm
by mcostalba
sedicla wrote:Here is the code I used for tests:
I'd suggest to test with random numbers, not with a linear loop that increments the target bitboard. Otherwise for the CPU becomes easy to predict the branches inside first_index() and friends and the results are better than what you would achieve in a real game.

Even better to test the speed of the engine, not just the functions alone because they use big tables and result could be different if there is a concurrent pressure to use the L2 cache memory.

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 5:01 pm
by sedicla
mcostalba wrote:
sedicla wrote:Here is the code I used for tests:
I'd suggest to test with random numbers, not with a linear loop that increments the target bitboard. Otherwise for the CPU becomes easy to predict the branches inside first_index() and friends and the results are better than what you would achieve in a real game.

Even better to test the speed of the engine, not just the functions alone because they use big tables and result could be different if there is a concurrent pressure to use the L2 cache memory.
Yes, I know what you mean. I did this initial test and quickly changed the genmove to use the new methods. Still need to change other parts like "in-check", "check_evasions" and "see" functions.

With this change I got about 15k extra nodes per search.

I'm still trying to create a better test for this. I tried to use my zobrish keys random number generator, but it doesnt' generate good data for test, the first and last bits are mostly close to the bitboard extremes. Also I want to test the same data against the new and old functions, so I may have to generate the test data in a table and them perform the tests.
Anyway, I'll post any results when I have.

thanks.

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 5:17 pm
by mcostalba
sedicla wrote: I'm still trying to create a better test for this. I tried to use my zobrish keys random number generator, but it doesnt' generate good data for test, the first and last bits are mostly close to the bitboard extremes.
You could use zobrist to index into the selected target bitboard, adding an indirection level should fix the issue both for the quality of target bitboards and for the reproducibility.

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 6:08 pm
by Gerd Isenberg
sedicla wrote:I've been lurking here for awhile and first of all I'd like to say thank you to all for posting all this good stuff, I've learned a lot here and now i'd like to maybe contribute to the chess programming community.

I am involved professionally with computer programming for 25 years. And some of my hobbies are programming and chess. So why not combine both :). Well I'm working on my own chess engine and now I'm just reviewing the code in order to make available soon. Hope it will be helpfull to other fellow cc programmers.

Here I want to share what I did with bitboards, that got me some performance improvement over traditional methods. I'm talking about bit_count, first_index and last_index methods.
...
Alcides Schulz.
Hi Alcides,
welcome to CCC and thanks for sharing your implementations of bitscanning and popcounting. Those and similar methods were heavily discussed here and elsewhere over the years or even decades. They are well known. 9*64 KByte sounds small compared to let say tables for magic bitboards, but it will in conjunction with other stuff pollute L1-cache. And often, while adding more and more stuff into a chess program or somehow changing memory and cache footprint, the program suddenly exceeds some limit, where L1 and BTB misses rises over proportional and really start to hurt. OK, the next hardware generation and cache technology may fix that as well, but with todays hardware with fast 64-bit intrinsic processor instructions in mind (x86-64, SSE4, no old P4 or K8), I would't like to spend so much heavily used and valuable L1 for that purpose.

Like Marco, I prefer using a portable 32-bit "backup" using less memory, even if it is some cycles slower in a loop test. While little-big endian issues seem a bit relaxed today, such unions might not portable and would require conditional compiles for certain architectures, similar as using certain processor instructions using C/C++ intrinsics (no need to use Assembly nowadays).

Best regards,
Gerd

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 6:59 pm
by sedicla
Gerd Isenberg wrote:
sedicla wrote:I've been lurking here for awhile and first of all I'd like to say thank you to all for posting all this good stuff, I've learned a lot here and now i'd like to maybe contribute to the chess programming community.

I am involved professionally with computer programming for 25 years. And some of my hobbies are programming and chess. So why not combine both :). Well I'm working on my own chess engine and now I'm just reviewing the code in order to make available soon. Hope it will be helpfull to other fellow cc programmers.

Here I want to share what I did with bitboards, that got me some performance improvement over traditional methods. I'm talking about bit_count, first_index and last_index methods.
...
Alcides Schulz.
Hi Alcides,
welcome to CCC and thanks for sharing your implementations of bitscanning and popcounting. Those and similar methods were heavily discussed here and elsewhere over the years or even decades. They are well known. 9*64 KByte sounds small compared to let say tables for magic bitboards, but it will in conjunction with other stuff pollute L1-cache. And often, while adding more and more stuff into a chess program or somehow changing memory and cache footprint, the program suddenly exceeds some limit, where L1 and BTB misses rises over proportional and really start to hurt. OK, the next hardware generation and cache technology may fix that as well, but with todays hardware with fast 64-bit intrinsic processor instructions in mind (x86-64, SSE4, no old P4 or K8), I would't like to spend so much heavily used and valuable L1 for that purpose.

Like Marco, I prefer using a portable 32-bit "backup" using less memory, even if it is some cycles slower in a loop test. While little-big endian issues seem a bit relaxed today, such unions might not portable and would require conditional compiles for certain architectures, similar as using certain processor instructions using C/C++ intrinsics (no need to use Assembly nowadays).

Best regards,
Gerd
Hey Gerd, thanks for you comments. I learned here about this cache thing. Of course I learned that after creating too many tables :roll:
In my engine I tried to reduce computing operations by using tables. Now I'm not sure if this was a good idea after all.
I think I'll leave this way for now. Maybe someday I'll review or write a new engine with fewer tables and compare. As long it is fun, why not.
I also expect that newer cpu's will take care of this.

I am aware of the portability issues, but I see this as part of the game, we just have to deal with this. For now I just use windows, I miss the good old unix environment.

Thanks again, appreciate your feedback.

Re: Introduction and (hopefully) contribution - bitboard met

Posted: Fri Jun 03, 2011 10:09 pm
by wgarvin
Gerd Isenberg wrote: Hi Alcides,
welcome to CCC and thanks for sharing your implementations of bitscanning and popcounting. Those and similar methods were heavily discussed here and elsewhere over the years or even decades. ...
Good terms to search for are "bitscan" and "popcount".

Here's a good thread from two years ago about bitscan functions.

Here's one from a year and a half ago, about popcount.

The state of the art has not changed much since those threads.

All x86 chips and some other common chips have instructions that help with bitscans (BSR/BSF). Very modern x86 chips have a single instruction for popcount, but there's still lots of chips around that don't support it, so its less often used in chess engines. If you're looking for more info about the throughput and latency of BSF, BSR etc. instructions, this x86-timing.pdf is the best collection of timing info I've seen for x86 chips, and another useful resource is the instruction tables at Agner Fog's site.