I did some magic bitboard "science" and mostly learned not to worry about it - TalkChess.com

I did some magic bitboard "science" and mostly learned not to worry about it

Moderators: hgm, Rebel, chrisw

22 posts

1
2
3
Next

Jakob Progsch: Posts: 40; Joined: Fri Apr 16, 2021 4:44 pm; Full name: Jakob Progsch

I did some magic bitboard "science" and mostly learned not to worry about it

Post by Jakob Progsch » Tue Apr 20, 2021 5:14 am

I started my engine programming efforts recently and intially went for purely computation (I'll call it arithmetic from here on) based bitboard move generation. Because that was "good enough" to get started. Also conceptually I prefer those approaches as their behavior is less dependent on hard to model cache interactions and interference from other memory transactions. I'll add the code I use for the sliding operation at the bottom of the post.

However reading this forum and the wiki gave me the impression that I'd need magic bitboards to be allowed at the cool kids table. But I'm also lazy and don't like the idea of having almost MB sized sparsely populated tables (see me worrying about funky memory interactions above). So I wanted to know how much I leave on the table with the alternative.

Throughout my code I use separate sliding move generation per direction for things like pins, discoveries, skewers and so on. Which makes the "less magic" and much denser kindergarten bitboards attractive to me since they are a drop in replacement where magics would be much more invasive since their benefit is getting both directions at once.

I set out to do some synthetic and practical benchmarking of these three approaches. For arithmetic and kindergarten generators I have my own code and for the magics I borrowed the code from Pradu Kannan.

For synthetic testing I use the following loop. It repeatedly XORs the generated moveset into the blocker board.

Code: Select all

unsigned int tsc_aux;
uint64_t start = __rdtscp(&tsc_aux);
for(uint64_t i = 0u;i<N;++i) {
    #pragma unroll
    for(int j = 0;j<M;++j) {
        block[j] ^= movegen_function((i+7u*j)&63u, block[j]);
    }
}
uint64_t end = __rdtscp(&tsc_aux);

The operation performed is meaningless but is useful for this experiment I think. The set of blockers keeps changing in a non-obvious way so lookup based approaches don't just get to camp on the same entries and the overhead outside of the movegen itself is very small. If M=1 there is only a single sequentially dependent chain exposing the full latency of the move generation function. As M increases more independent chains exist that the compiler and CPU can reorder to take advantage of ILP. For sufficiently large M it should eventually measures the achieveable throughput.

In the tables t = (end-start)/N is the average per iteration time of the outer loop. Measurements were done on a AMD 3900X. RDTSC frequency measures at 3.8GHz which is consistent with the specced base clock while the actual observed clock speeds during measurements are reported at around 4.35GHz. So if one wanted to know the values in "real clock cycles" they'd have to be multiplied by about 1.15.

Rooks

Code: Select all

            arithmetic         kindergarten           magic
  M         t      t/M          t      t/M          t      t/M

  1       12.36   12.36       15.09   15.09       26.64   26.64
  2       20.59   10.29       16.93    8.46       27.62   13.81
  3       31.18   10.39       24.46    8.15       28.38    9.46
  4       42.07   10.52       31.58    7.89       29.24    7.31
  5       50.76   10.15       39.03    7.81       30.53    6.11
  6       60.57   10.10       47.37    7.89       32.57    5.43
  7       77.19   11.03       56.69    8.10       35.00    5.00
  8       86.91   10.86       65.01    8.13       37.78    4.72

For the rooks we see most of the interesting behavior already. The arithmetic approach saturates the ALUs right from the start and the amount of in flight calculations don't really affect the throughput. Also latency is the lowest. Kindergarten bitboards are competitive at M=1 and then improve slightly. Magics have the highest latency which is probably caused by the large rook table which will rarely hit in L1 and fall through to L2. The observed latency of about 30 cycles make sense since there are two dependent lookups involved and Agner quotes 14 cycles latency for L2 hits. Once we have enough work to be throughput limited they are almost twice as fast as kindergarten which also makes intuitive sense since they do both directions in one go instead of two.

The arithmetic approach directly scales with amount of computations where magics are dominated by latency and each additional lookup just costs an extra cycle.

Bishops

Code: Select all

            arithmetic         kindergarten           magic
  M         t      t/M          t      t/M          t      t/M

  1       15.36   15.36       15.13   15.13       10.80   10.80
  2       23.75   11.88       16.05    8.03       11.00    5.50
  3       35.41   11.80       20.81    6.94       11.18    3.73
  4       46.16   11.54       25.33    6.33       11.75    2.94
  5       59.87   11.97       33.62    6.72       13.00    2.60
  6       73.74   12.29       40.30    6.72       15.29    2.55
  7       93.07   13.30       47.94    6.85       17.64    2.52
  8      108.51   13.56       55.19    6.90       20.60    2.57

Bishops are massively in favor of magics. Probably because all the previous advantages apply but now the tables fit into L1. Arithmetics fall slightly behind their rook performance since diagonal masks are more expensive to compute (lookups) than file and rank masks (CLZ+shift).

Queens

Code: Select all

            arithmetic         kindergarten           magic
  M         t      t/M          t      t/M          t      t/M

  1       24.12   24.12       19.32   19.32       32.85   32.85
  2       47.78   23.89       26.92   13.46       33.74   16.87
  3       65.85   21.95       39.97   13.32       35.66   11.89
  4       95.15   23.79       55.30   13.82       41.77   10.44
  5      122.70   24.54       68.36   13.67       43.55    8.71
  6      150.33   25.05       90.59   15.10       50.24    8.37
  7      167.43   23.92      102.42   14.63       57.98    8.28
  8      189.12   23.64      116.76   14.60       66.81    8.35

Queens are just here for completeness. Nothing much to be learned from this table that we didn't see in the previous ones. With magics you can essentially get a queen for the same latency cost as a rook since the exposed latency of the rook lookup will mostly cover the cost of the bishop lookup.

At this point someone might object "this is all meaningless and pathological". And they would be right I guess. From a distance it is difficult to guess whether a specific piece of "real code" will expose the latency or just be throughput limited. You can't even make that statement globally for the entire engine. Move generation might be a "bulk operation" and be throughput limited while pin detection might be latency limited.

All of these measure in the one to low two digit realm of clock cycles. By comparison this is about as expensive as a 64bit integer division. A full cache miss from the TT will singelhandedly cost you a multiple of any of these operations.

To get the real impact I have to test it within my actual search. As mentioned above I can't easily integrate magics into my search but can just drop in the kindergarten lookups. The "real world" result was that the difference is barely distinguishable from measurement noise with a tiny advantage (1-3%) to the kindergarten bb over arithmetic ones. My personal conclusion is that magics might get me another single digit percentage increase in linear speed which seems insignificant compared to the more "exponential" gains to be had by improving the search algorithm and heuristics.

One last observation: This was all done on a AMD Zen2 processor which can execute most of the relevant logic ops (AND, OR, XOR, SHL, SHR, CLZ, POPCNT...) at throughputs of 2-4 per cycle. Intel CPUs of recent generations list lower throughputs (1-2/cycle) for the same family of instructions so the relative advantage of lookup based approaches could be more pronounced there? I have to eventually rerun this experiment on my Intel laptop I guess.

And finally, the code for the arithmetic sliding attacks mentioned above. The line masks are obtained by arithmetic for rooks and lookups for bishops.

Code: Select all

uint64_t slide_arithmetic(uint64_t piece, uint64_t line, uint64_t block) {
    // restrict to possible blockers
    block &= line;
    if(!block) {
        // don't bother if there are no blockers
        return line;
    }
    
    // split the line into upper and lower rays
    uint64_t bottom = piece ^ (piece - UINT64_C(1));

    // for the upper part we can use the x^(x-1) trick to fill in from the bottom
    uint64_t masked_up = block & ~bottom;
    uint64_t blocked_up = (masked_up ^ (masked_up-UINT64_C(1)));

    // for the bottom we use CLZ + a shift to fill in from the top
    uint64_t masked_down = block & bottom & ~piece;
    uint64_t blocked_down = (UINT64_C(0x7FFFFFFFFFFFFFFF) >> CLZ64(masked_down|UINT64_C(1)));
    
    // the intersection of the two is the move set after masking with the line
    return (blocked_up ^ blocked_down) & line;
}

Dann Corbit: Posts: 12537; Joined: Wed Mar 08, 2006 8:57 pm; Location: Redmond, WA USA

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by Dann Corbit » Tue Apr 20, 2021 7:13 am

This is quite interesting and a delicious compliment to hgm's array based dissertation.

Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

hgm: Posts: 27788; Joined: Fri Mar 10, 2006 10:06 am; Location: Amsterdam; Full name: H G Muller

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by hgm » Tue Apr 20, 2021 8:52 am

It would be interesting if these efforts could be somehow combined, by incorporating the various bitboard generation methods into the 'toy search' on the KiwiPete position that I use to gauge the speed of the various mailbox methods. This would only require replacement of the MoveGen() and MakeMove() / UnMake() routines by their bitboard equivalents.

smatovic: Posts: 2639; Joined: Wed Mar 10, 2010 10:18 pm; Location: Hamburg, Germany; Full name: Srdja Matovic

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by smatovic » Tue Apr 20, 2021 10:05 am

Interesting, so it boils down to lookup vs computation.

From Ankan's gpu perft:

- Move generation uses bitboards
-- position represented using a set of 6 bitboards
-- two variations for sliding piece move generation: kogge-stone and magics
--- three variations of magics: plain, fixed shift fancy, and byte-lookup fancy
--- table sizes: 2.3 MB, 760KB, 150KB respectively.
-- kogge stone faster in some cases and fixed shift fancy faster in some

https://github.com/ankan-ban/perft_gpu

--
Srdja

https://gitlab.com/smatovic/Zeta
https://gitlab.com/smatovic/ZetaDva

Gerd Isenberg: Posts: 2250; Joined: Wed Mar 08, 2006 8:47 pm; Location: Hattingen, Germany

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by Gerd Isenberg » Tue Apr 20, 2021 7:34 pm

A variation of the classical approach using x^(x-1) separation and pure calculation for orthogonal lines and tiny lookup for diagonals.
Very nice!

https://www.chessprogramming.org/Main_Page

Jakob Progsch: Posts: 40; Joined: Fri Apr 16, 2021 4:44 pm; Full name: Jakob Progsch

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by Jakob Progsch » Wed Apr 21, 2021 3:36 am

I went over the arithmetic slide implementation and noticed some things:

The piece mask only has one bit set, so I can just replace this

Code: Select all

uint64_t bottom = piece ^ (piece - UINT64_C(1));

with this

Code: Select all

uint64_t bottom = piece - UINT64_C(1);

since the xor part is only really there to cancel out any stray bits beyond the first one.

Then this part

Code: Select all

    block &= line;
    if(!block) {
        // don't bother if there are no blockers
        return line;
    }

doesn't work as well as I thought. Not because it's "wrong" but just because at least the way I use it the piece is usually included in the blocker set and block is never zero. So I changed the first part to

Code: Select all

    block &= line & ~piece;

To "fix" it, so I finally get the benefit of my early out. Which as it turns out made performance worse... for reasons. Well, luckily after just throwing out the conditional the measured performance is better anyway. I left the "& ~piece" part in because it seems cleaner and I otherwise have to do it later anyway.

Then the last thing about instruction ordering in that same line. The way I use these slide functions often looks something like this:

Code: Select all

uint64_t rookmoves(uint64_t square, uint64_t blockers) {
    return 
        slide_arithmetic(square, rookfile(square), blockers) ^
        slide_arithmetic(square, rookrank(square), blockers);
}

and since compilers inline this they can optimize out redundant instructions. For example the rookfile and rookrank functions look like this:

Code: Select all

uint64_t rookfile(uint64_t r) {
    uint64_t pos = CLZ64(r);
    return UINT64_C(0x8080808080808080) >> (pos&UINT64_C(7));
}
uint64_t rookrank(uint64_t r) {
    uint64_t pos = CLZ64(r);
    return UINT64_C(0xFF00000000000000) >> (pos&UINT64_C(0x38));
}

which makes it seem like in the above rookmove you'd be redundantly computing pos twice. But after inlining the compiler understands this and the assembly will have only one CLZ instruction.

knowing this now spot what is wrong with

Code: Select all

uint64_t slide_arithmetic(uint64_t piece, uint64_t line, uint64_t block) {
    block &= line & ~piece;
    //...

this computes "block & (line & ~piece)" but piece and block tend to be the same on subsequent calls. If instead you write

Code: Select all

    block = block & ~piece & line;

The compiler can "share" the value of "block & ~piece". I tested all of this with gcc other compiles might be smarter or less smart about it. but by making the expressions the actual same they at least have the best chance of seeing it.

The new version of the arithmetic slide now looks like this:

Code: Select all

uint64_t slide_arithmetic(uint64_t piece, uint64_t line, uint64_t block) {
    // mask blockers
    block = block & ~piece & line;

    // split the line into upper and lower rays
    uint64_t bottom = piece-UINT64_C(1);

    // for the upper part we can use the x^(x-1) trick to fill in from the bottom
    uint64_t masked_up = block & ~bottom;
    uint64_t blocked_up = (masked_up ^ (masked_up-UINT64_C(1)));

    // for the bottom we use CLZ + a shift to fill in from the top
    uint64_t masked_down = block & bottom;
    uint64_t blocked_down = (UINT64_C(0x7FFFFFFFFFFFFFFF) >> CLZ64(masked_down|UINT64_C(1)));
    
    // the intersection of the two is the move set after masking with the line
    return (blocked_up ^ blocked_down) & line;
}

All of these changes improved performance slightly.
Rooks

Code: Select all

            arithmetic         kindergarten           magic
  M         t      t/M          t      t/M          t      t/M

  1       10.72   10.72       15.25   15.25       26.67   26.67
  2       19.29    9.65       16.65    8.33       27.70   13.85
  3       27.64    9.21       24.62    8.21       28.51    9.50
  4       35.69    8.92       31.80    7.95       29.41    7.35
  5       45.10    9.02       39.58    7.92       32.13    6.43
  6       55.94    9.32       47.77    7.96       33.19    5.53
  7       64.58    9.23       57.34    8.19       35.44    5.06
  8       73.13    9.14       65.83    8.23       38.45    4.81

Bishops

Code: Select all

            arithmetic         kindergarten           magic
  M         t      t/M          t      t/M          t      t/M

  1       11.44   11.44       15.20   15.20       10.82   10.82
  2       19.12    9.56       16.33    8.16       11.10    5.55
  3       25.17    8.39       20.90    6.97       11.35    3.78
  4       34.13    8.53       25.48    6.37       11.87    2.97
  5       43.32    8.66       33.98    6.80       13.12    2.62
  6       52.36    8.73       40.67    6.78       15.28    2.55
  7       63.07    9.01       48.07    6.87       17.87    2.55
  8       72.60    9.07       55.51    6.94       20.72    2.59

so about a 20% improvement and even more at M=1. This has brought the arithmetic and kindergarten implementations close enough that the practical performance differences in the search are indistinguishable from measurement noise.

Dann Corbit: Posts: 12537; Joined: Wed Mar 08, 2006 8:57 pm; Location: Redmond, WA USA

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by Dann Corbit » Thu Apr 22, 2021 12:09 am

I am sure that this is ignorant, but what does M stand for?

Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Jakob Progsch: Posts: 40; Joined: Fri Apr 16, 2021 4:44 pm; Full name: Jakob Progsch

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by Jakob Progsch » Thu Apr 22, 2021 3:43 am

It refers to the count of the inner "loop" in the benchmark. That one should get unrolled by the compiler to produce M independent chains of move generation calls. I'll try to explain/justify this better:
I want to characterize the performance of a function f so I time the following loop and divide by N:

Code: Select all

for(i<N) {
    x = f(x);
}

this measures how long f takes start to finish aka the latency. This is the M=1 case.

When I want to figure out the throughput at which I can calculate f I can use it multiple times on independent data so the compiler and CPU can reorder it into a better sequence filling in any latencies and empty issue slots by issuing other instructions:

Code: Select all

for(i<N) {
    x = f(x);
    y = f(y);
    z = f(z);
    w = f(w);
}

This would be M=4. For sufficiently large M this will eventually measure the achieveable throughput of calls to f.

Dann Corbit: Posts: 12537; Joined: Wed Mar 08, 2006 8:57 pm; Location: Redmond, WA USA

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by Dann Corbit » Thu Apr 22, 2021 6:25 am

OK, thanks, that makes perfect sense.
Compilers today can be incredibly efficient.
Long ago, they could eliminate tail recursion.
Now, they even improve algorithms (no, really).

Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Jakob Progsch: Posts: 40; Joined: Fri Apr 16, 2021 4:44 pm; Full name: Jakob Progsch

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Post by Jakob Progsch » Thu Apr 22, 2021 10:07 am

hgm wrote: ↑Tue Apr 20, 2021 8:52 am It would be interesting if these efforts could be somehow combined, by incorporating the various bitboard generation methods into the 'toy search' on the KiwiPete position that I use to gauge the speed of the various mailbox methods. This would only require replacement of the MoveGen() and MakeMove() / UnMake() routines by their bitboard equivalents.

I was thinking of doing something like that for perft but it would apply to a search as well I guess. What I'm struggling with a bit is figuring out how to write the generic search/perft without precluding some approaches. A natural starting point would be something like:

Code: Select all

uint64_t perft(board_t &board, int depth) {
    temporary_t tmp;
    
    if(!is_legal(board, tmp)) {
        return 0;
    }
    if(depth==0) {
        return 1;
    }

    init_moves(board, tmp);

    uint64_t count = 0;
    while(has_move(board, tmp)) {
        move_t move = next_move(board, tmp);
        play(board, tmp, move);
        count += perft(board, depth-1);
        un_play(board, tmp, move);
    }

    return count;
}

However that already makes pretty strong assumptions about how move generation and traversal works. In my current code for example I never actually generate a list of moves but rather I have a function that extracts them on the fly from the bitboard movesets and calls a functor on it. The code for that would look more like:

Code: Select all

uint64_t perft(board_t &board, int depth) {
    //...

    uint64_t count = 0;
    traverse_moves(board, tmp, [&](move_t move){
        play(board, tmp, move);
        count += perft(board, depth-1);
        un_play(board, tmp, move);    	
    });

    return count;
}

I'm not sure what the most "generic" version is. It would be interesting though.

22 posts

1
2
3
Next

Return to “Computer Chess Club: Programming and Technical Discussions”