Magic Move generation

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Feb 03, 2009 8:35 am

wgarvin wrote: Let me brainstorm out loud for a few paragraphs. Lets ignore bishops for now, and just think about rooks.

The post-mask lets you combine the entire set of results for a square, with the entire set of results for another square. But there are limits on which squares you can combine it with, because if the result that goes in that slot for one square happens to have a bit set, then that bit *also* needs to be set in the result from the other square that goes in that slot.

You guys have worked out how to combine two diagonally-adjacent squares in the same bitboard. But since the squares which are "compatible" for merging in this way have different shift amounts, it seems like we might be able to do better?

I'm thinking that there are many result bitboards that will not have bits that reach all the way to the end of the ray (because that requires all of the bits along that ray to have been empty in the occupancy bits that are being hashed, which is not the case for most of the possible occupancies). Indeed, it seems that if you take two magics, then AND together the post-masks for their two squares, only the bits which are set in this result need to be "unambiguous". If the squares involved have these "need to be unambiguous" bits near the end of their rays, then most of the entries from both sets of bitboards will have zero for these bits. So it might be possible to fit them together in some fashion where there are no conflicts. So you could store more than two bitboards there as long as you can somehow enforce the requirement that for each pair of bitboards being stored, the "overlapping" bits from their post-masks are either both ones or both zeros.

If you are hashing N occupancy bits from the original bitboard, there should be lots of different magics that redistribute those keys over a range of 2^N entries, and indeed with the constructive collisions there are usually magics that can remap them to a range of 2^(N-1) entries. I'm guessing that there might be plenty of working magics that remap them into a 2^(N-1) range ? And we get to choose which one we use.

So maybe, through clever or lucky choosing of magics and shift amounts and starting offsets, it is possible to "fit together" more squares to optimize the space they take up.

It doesn't seem very important to me to optimize the space usage of the 5- and 6-bit magics because they are already using small, efficient tables. But the worst rook magics at 11 and 12 bits require a lot of table space (up to 32 KB for one square). So it would be interesting to see whether any savings are possible there, by somehow combining and interleaving the entries of their lookup tables. I.e. having chosen the best magic you can find for a "worst-case" square, try to pick random magics for another square such that you can lay them on top of each other and when the "post-mask" for the two squares crosses, they both have either a 1 or a zero.

I haven't tried this (or thought about the math much), so it might be more or less impossible.
Code: Select all
   A 1 a a a a * *    |    A 1 a a a a a *    A 1 a a a a * *
   1 B 1 b b b * *    |    1 B 1 b b b b *    1 - - - - - c d
   a 1 - - - - c d    |    a 1 - - - - - d    a - - - - - c d
   a b - - - - c d    |    a b - - - - - d    a - - - - - c d
   a b - - - - c d    |    a b - - - - - d    a - - - - - c d
   a b - - - - 1 d    |    a b - - - - - d    a - - - - - 1 d
   * * c c c 1 C 1    |    a b - - - - - 1    * c c c c 1 C 1
   * * d d d d 1 D    |    * * d d d d 1 D    * d d d d d 1 D
If it were possible to fit all four of these squares into the same array data... Only the bits marked with a * would have two obligations that have to be satisfied (and each bit would have only two). That still might be impossibly difficult, but what about the two diagrams on the right? Three bitboards each instead of two, and only 4 bits that need to do double duty, and those bits are right near the ends of the ray in all cases so *most* of the table entries for each square are going to want those bits to be zero.

Interesting brainstorm, I also believe that further combining and interleaving in some clever way may dense the table below 256K. The problem is, it takes some time to backtrack all the magics and offsets. If two concrete attack sets share the same memory location, one has to prove that all intersections with the post-masks leaves equal bits in all union sets.

Pradu · Post by **Pradu** » Tue Feb 03, 2009 9:51 pm

Gerd Isenberg wrote: is your other paper no longer available or renamed?
http://buzz.pradu.us/research/magic/Bitboards.pdf

Thanks,
Gerd

It's here:
http://www.pradu.us/old/Nov27_2008/Buzz ... boards.pdf

Pradu · Post by **Pradu** » Tue Feb 03, 2009 9:55 pm

cms271828 wrote:Well I implemented it, and tried it down the a-file.

for square 0 [a8] (6 bit occ), I found 4 bits was impossible, but haven't tried 5 bits due to time required.

for square 8 [a7] (5 bit occ), best I found was 5 bits, so no change.
for square 16 [a6] (5 bit occ), best I found was 5 bits, so no change.
for square 24 [a5] (5 bit occ), best I found was 5 bits, so no change.
for square 32 [a4] (5 bit occ), best I found was 5 bits, so no change.
for square 40 [a3] (5 bit occ), best I found was 5 bits, so no change.
for square 48 [a2] (5 bit occ), best I found was 5 bits, so no change.

for square 56 [a1] (6 bit occ), I found 4 bits was impossible, but haven't tried 5 bits due to time required.

So I've got zero improvement so far.
Any thoughts?

You are forgetting about carry effects. You will only find magics where the number of bits in the hashed index is equal to the number of bits in the occupancy and you won't find all possible magics that can achieve this either (if you were looking for one that hashed to the smallest index in addition to the smallest number of bits). Carry effects are necessary for making better quality magics.

Pradu · Post by **Pradu** » Tue Feb 03, 2009 10:14 pm

Pradu wrote:You will only find magics where the number of bits in the hashed index is equal to the number of bits in the occupancy.

I take this back, you may find it, but even if one exists, you are not certain that you will find it because you ignore carry effects of some of the bits that don't directly affect the index bits but can do so with another bit letting it carry. Bits that are too high up that get shifted out of the 64-bit range can be set to zero (they are upper-redundant bits). Bits that don't directly affect the index bits but can affect bits within the 64-bit range need further investigation as to whether they will affect the upper bits or not. If you find out that it is impossible for these bits to affect the index bits then they can be set to zero (lower-redundant bits).

cms271828 · Post by **cms271828** » Thu Feb 05, 2009 6:55 am

Thanks for help,

I got about 7% speed up using magic bitboards over rotated bitboards, and got rid of loads of nasty code too.

I tried making my own 4 bit magic for a7 by looking at bits that effect top 5 bits, but I wasn't able to do it.

I tried the magics for a-file, and they work fine, I'll need to come up with some sort of algorithm to get better ones, can't just leave it with a-file done.

Thanks again

Tord Romstad · Post by **Tord Romstad** » Thu Feb 05, 2009 8:40 am

cms271828 wrote:I got about 7% speed up using magic bitboards over rotated bitboards, and got rid of loads of nasty code too.

Seriously? My biggest problem with magic bitboards is that the code looks extremely nasty. I spent an hour or two trying to clean up my code a few weeks back, but couldn't come up with anything better than the code below. The code is 122 lines long, including the precomputed arrays and initialization. The two huge 64-entry arrays of precomputed magic numbers are particularly annoying, but I'm not aware of any sufficiently fast, elegant and simple ways to compute them at startup.

Any suggestions for how to make this code more compact and elegant?

Tord

Code: Select all

const uint64_t RMult[64] = {
  0xa8002c000108020ULL, 0x4440200140003000ULL, 0x8080200010011880ULL,
  0x380180080141000ULL, 0x1a00060008211044ULL, 0x410001000a0c0008ULL,
  0x9500060004008100ULL, 0x100024284a20700ULL, 0x802140008000ULL,
  0x80c01002a00840ULL, 0x402004282011020ULL, 0x9862000820420050ULL,
  0x1001448011100ULL, 0x6432800200800400ULL, 0x40100010002000cULL,
  0x2800d0010c080ULL, 0x90c0008000803042ULL, 0x4010004000200041ULL,
  0x3010010200040ULL, 0xa40828028001000ULL, 0x123010008000430ULL,
  0x24008004020080ULL, 0x60040001104802ULL, 0x582200028400d1ULL,
  0x4000802080044000ULL, 0x408208200420308ULL, 0x610038080102000ULL,
  0x3601000900100020ULL, 0x80080040180ULL, 0xc2020080040080ULL,
  0x80084400100102ULL, 0x4022408200014401ULL, 0x40052040800082ULL,
  0xb08200280804000ULL, 0x8a80a008801000ULL, 0x4000480080801000ULL,
  0x911808800801401ULL, 0x822a003002001894ULL, 0x401068091400108aULL,
  0x4a10a00004cULL, 0x2000800640008024ULL, 0x1486408102020020ULL,
  0x100a000d50041ULL, 0x810050020b0020ULL, 0x204000800808004ULL,
  0x20048100a000cULL, 0x112000831020004ULL, 0x9000040810002ULL,
  0x440490200208200ULL, 0x8910401000200040ULL, 0x6404200050008480ULL,
  0x4b824a2010010100ULL, 0x4080801810c0080ULL, 0x400802a0080ULL,
  0x8224080110026400ULL, 0x40002c4104088200ULL, 0x1002100104a0282ULL,
  0x1208400811048021ULL, 0x3201014a40d02001ULL, 0x5100019200501ULL,
  0x101000208001005ULL, 0x2008450080702ULL, 0x1002080301d00cULL,
  0x410201ce5c030092ULL
};

const int RShift[64] = {
  52, 53, 53, 53, 53, 53, 53, 52, 53, 54, 54, 54, 54, 54, 54, 53,
  53, 54, 54, 54, 54, 54, 54, 53, 53, 54, 54, 54, 54, 54, 54, 53,
  53, 54, 54, 54, 54, 54, 54, 53, 53, 54, 54, 54, 54, 54, 54, 53,
  53, 54, 54, 54, 54, 54, 54, 53, 52, 53, 53, 53, 53, 53, 53, 52
};

uint64_t RMask[64];
int RIndex[64];
uint64_t RAttacks[0x19000];

const uint64_t BMult[64] = {
  0x440049104032280ULL, 0x1021023c82008040ULL, 0x404040082000048ULL,
  0x48c4440084048090ULL, 0x2801104026490000ULL, 0x4100880442040800ULL,
  0x181011002e06040ULL, 0x9101004104200e00ULL, 0x1240848848310401ULL,
  0x2000142828050024ULL, 0x1004024d5000ULL, 0x102044400800200ULL,
  0x8108108820112000ULL, 0xa880818210c00046ULL, 0x4008008801082000ULL,
  0x60882404049400ULL, 0x104402004240810ULL, 0xa002084250200ULL,
  0x100b0880801100ULL, 0x4080201220101ULL, 0x44008080a00000ULL,
  0x202200842000ULL, 0x5006004882d00808ULL, 0x200045080802ULL,
  0x86100020200601ULL, 0xa802080a20112c02ULL, 0x80411218080900ULL,
  0x200a0880080a0ULL, 0x9a01010000104000ULL, 0x28008003100080ULL,
  0x211021004480417ULL, 0x401004188220806ULL, 0x825051400c2006ULL,
  0x140c0210943000ULL, 0x242800300080ULL, 0xc2208120080200ULL,
  0x2430008200002200ULL, 0x1010100112008040ULL, 0x8141050100020842ULL,
  0x822081014405ULL, 0x800c049e40400804ULL, 0x4a0404028a000820ULL,
  0x22060201041200ULL, 0x360904200840801ULL, 0x881a08208800400ULL,
  0x60202c00400420ULL, 0x1204440086061400ULL, 0x8184042804040ULL,
  0x64040315300400ULL, 0xc01008801090a00ULL, 0x808010401140c00ULL,
  0x4004830c2020040ULL, 0x80005002020054ULL, 0x40000c14481a0490ULL,
  0x10500101042048ULL, 0x1010100200424000ULL, 0x640901901040ULL,
  0xa0201014840ULL, 0x840082aa011002ULL, 0x10010840084240aULL,
  0x420400810420608ULL, 0x8d40230408102100ULL, 0x4a00200612222409ULL,
  0xa08520292120600ULL
};

const int BShift[64] = {
  58, 59, 59, 59, 59, 59, 59, 58, 59, 59, 59, 59, 59, 59, 59, 59,
  59, 59, 57, 57, 57, 57, 59, 59, 59, 59, 57, 55, 55, 57, 59, 59,
  59, 59, 57, 55, 55, 57, 59, 59, 59, 59, 57, 57, 57, 57, 59, 59,
  59, 59, 59, 59, 59, 59, 59, 59, 58, 59, 59, 59, 59, 59, 59, 58
};

uint64_t BMask[64];
int BIndex[64];
uint64_t BAttacks[0x1480];

inline uint64_t b_attacks(int s, uint64_t b) {
  return BAttacks[BIndex[s] + (((b&BMask[s]) * BMult[s]) >> BShift[s])];
}

inline uint64_t r_attacks(int s, uint64_t b)  {
  return RAttacks[RIndex[s] + (((b&RMask[s]) * RMult[s]) >> RShift[s])];
}

inline uint64_t q_attacks(int s, uint64_t b) {
  return b_attacks(s, b) | r_attacks(s, b);
}

uint64_t slide(int sq, uint64_t b, int deltas[],
               int fmin, int fmax, int rmin, int rmax) {
  uint64_t result = 0;
  int rk = sq/8, fl = sq%8, r, f, i, dx, dy;
  for(i = 0; i < 4; i++)
    for(dx = ((9+deltas[i]) & 7) - 1, dy = deltas[i] / 7, f = fl+dx, r = rk+dy;
        (!dx || (f>=fmin && f<=fmax)) && (!dy || (r>=rmin && r<=rmax));
        f += dx, r += dy) {
      result |= (1ULL << (f+r*8));
      if(b & (1ULL << (f+r*8))) break;
    }
  return result;
}

void init_slide_attacks(uint64_t attacks[], int aIndex[], uint64_t mask[],
                        const int shift[], const uint64_t mult[],
                        int deltas[]) {
  int i, j, k, l, m, ix;
  uint64_t b, bb;
  for(i = 0, ix = 0; i < 64; i++, ix += j) {
    aIndex[i] = ix;
    mask[i] = slide(i, 0, deltas, 1, 6, 1, 6);
    for(k = 0, j = (1 << (64-shift[i])); k < j; k++) {
      for(l = 0, b = 0, bb = mask[i]; bb; l++) {
        m = pop_1st_bit(&bb);
        if(k & (1<<l)) b |= (1ULL << m);
      }
      attacks[ix+((b*mult[i]) >> shift[i])] = slide(i, b, deltas, 0, 7, 0, 7);
    }
  }
}

void init(void) {
  int dirs[] = {9,7,-7,-9,1,-1,8,-8};
  init_slide_attacks(BAttacks, BIndex, BMask, BShift, BMult, dirs);
  init_slide_attacks(RAttacks, RIndex, RMask, RShift, RMult, dirs+4);
}

jwes · Post by **jwes** » Thu Feb 05, 2009 9:11 am

One idea is to read the numbers from a file at startup and if the file does not exist, calculate the numbers and save them to the file. If you include the file in the distribution, no one (but you, once) will have to wait for them to be created.

Edmund · Post by **Edmund** » Thu Feb 05, 2009 10:03 am

you could calculate RShift on startup or include the shifts in the magic key

Code: Select all

for (int sq=0; sq<64; sq++) RShift[sq] = 54 - (sq & 7 == 0) - (sq & 7 == 7) - (sq & 56 == 0) - (sq & 56 == 56);

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Feb 05, 2009 6:46 pm

Tord Romstad wrote:
cms271828 wrote:I got about 7% speed up using magic bitboards over rotated bitboards, and got rid of loads of nasty code too.
Seriously? My biggest problem with magic bitboards is that the code looks extremely nasty. I spent an hour or two trying to clean up my code a few weeks back, but couldn't come up with anything better than the code below. The code is 122 lines long, including the precomputed arrays and initialization. The two huge 64-entry arrays of precomputed magic numbers are particularly annoying, but I'm not aware of any sufficiently fast, elegant and simple ways to compute them at startup.

Any suggestions for how to make this code more compact and elegant?

Tord

Initialization of Attack-bitboards has to be done for all lookup-methods anyway, whether you use rotated (ok, only line-wise) or magics. I also suggest Grant's incorporating the shift inside the magics, or to format the magics with leading zeros, four in a line.

I would declare the inline attack getter inside a header-file, as well the data declarations as externals. Such initialization code of often ugly, so put it inside a separate module - to keep it distinct from often used code. I prefer to use Kogge-Stone for attack initialization, and would even unroll the whole stuff - without too much parameter passing to "generalize" the initialization for rooks and bishops. This code is only used once and discarded afterwards - if it is 4K or 64K.

Another idea is to delegate the whole attack initialization stuff to your magic finder, which generates a c-file with all const data, not only magics and shifts, but also the attack arrays...

Gerd

Tord Romstad · Post by **Tord Romstad** » Fri Feb 06, 2009 9:45 am

Thanks for all suggestions, everyone!

Gerd Isenberg wrote:Initialization of Attack-bitboards has to be done for all lookup-methods anyway, whether you use rotated (ok, only line-wise) or magics. I also suggest Grant's incorporating the shift inside the magics,

Yes, I should try this. It should both make the code a tiny bit simpler and infinitesimally faster

or to format the magics with leading zeros, four in a line.

I don't understand this one.

I would declare the inline attack getter inside a header-file, as well the data declarations as externals.

That's what I do, actually. The code I posted wasn't taken from Glaurung, but from a little sliding attack benchmarking program, where all the code is in a single .c file, and "extern" and "inline" declarations are therefore not necessary.

]Such initialization code of often ugly, so put it inside a separate module - to keep it distinct from often used code. I prefer to use Kogge-Stone for attack initialization, and would even unroll the whole stuff - without too much parameter passing to "generalize" the initialization for rooks and bishops. This code is only used once and discarded afterwards - if it is 4K or 64K.

I've never looked at Kogge-Stone, perhaps it can be used to simplify things. I'll have a look. Thanks for the suggestion!

Another idea is to delegate the whole attack initialization stuff to your magic finder, which generates a c-file with all const data, not only magics and shifts, but also the attack arrays...

That would make things even worse -- huge arrays of numbers which look meaningless to the reader is exactly what I want to avoid. Moving them from the source code to a separate text or binary file and reading them at startup, as Wesley suggests, doesn't really make any difference. The source code should be self-documenting, and not contain a bunch of magic numbers which do not make sense without a long explanation.

I think what I need to do is to find an efficient, clear and readable algorithm for computing all the magic numbers during startup, which does not involve trying random numbers until I find some that work.

Tord

Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation

Re: Magic Move generation