The keyword efficiently comes with 2 hard constraints:
32 Bit multiplication only
maximum of 64kb of memory
Existing implementations did not meet the target and hashing 12 bits at once make the memory consumption too large. So 4 rays of 6 bits per square must be hashed.
Doing the math gives the upper limit memory consumption at 64*4*64 * 8 bytes = 130kbyte. This is still too big but hashing into a single array will give nice improvements.
Is it possible to fold the 64 bit occupation into 32 bits?
Yes it is when using the relevant bits with this formula: (folding 64 bits onto 32 with retaining all 64 unique relevant bits (6 bits per direction))
Code: Select all
(uint32_t)(input[i].Occ >> 32ull) ^ (static_cast<uint32_t>(input[i].Occ) * 3);
but with 10x less memory consumption than previously and the whole table now fits nicely into L2 (or shared memory on cuda).
Also now we are in 32bit land and that means when searching for 'magic numbers' you can exhaustively generate all of the possible 32bit numbers that reduces all inputs to 6 bits with only valid intersections and find the lowest possible fitting offset! And this can be done exhaustively. Meaning that you still dont get a perfect soltion (because the order in which you insert into the list also affects the maximum index) but its very close.
This means that the final algorithm looks like this: Having everything in 32 bit also reduces per square information by 8 byte compared to 32bit optimized black magic!
Code: Select all
struct Direction {
uint64_t mask; uint32_t mult; uint32_t offset;
};
constexpr uint64_t attacks[5956] = ...
constexpr Direction masks[64][4] = ...
static constexpr uint64_t fold(uint64_t occ, Direction d)
{
uint64_t o = occ | d.mask;
uint32_t folded = (uint32_t)(o >> 32) ^ ((uint32_t)(o) * 3);
return attacks[((folded * d.mult) >> 26) + d.offset];
}
uint64_t Queen(int sq, uint64_t occ)
{
return fold(occ, masks[sq][0]) | fold(occ, masks[sq][1]) | fold(occ, masks[sq][2]) | fold(occ, masks[sq][3]);
}
https://github.com/Gigantua/Chess_Movegen
https://github.com/Gigantua/Chess_Movegen_GPU
I also have started to take a look at the more classical approach and have generated 2.1TB of raw valid reduction numbers for all 64 squares for 12 bit and 9 bit (or as you call them 'magics')


When seeing how low the number of non repeated hashes is in the 64 bit range for some squares one can even generate a very narrow bound for the number of magics per square (Very similar to approximating a population of animals by tagging and releasing animals - and the ratio of already tagged animals).
When looking at how many bits are non variable its easier to condense and find solutions that need less than 88316 elements.
I also created a 'magic number' viewer (which I might release too) where you can see how the input list gets distributed into 12 bits. Its good for understanding what a magic number actually does. (distribute the relevant occupation into an index and ideally self intersecting and intersecting with the list with the lowest possible maximum index)
Oh and CUDA is really the only (fast) way to do this. My 5950x is around 35-40x slower than the same code on the gpu for finding and generating random 64 bit magics.
TLDR:
FoldingHash will be released soon and is a 48kb - 32 bit hashing algorithm.
The main new idea is to condense 64bit occupation into 32bit and thus having 4 rays with only 4x 32 bit multiplications. This would have been perfect 20 years ago (compared to expensive 64 bit multiplication) - and looks very promising for the GPU nowadays.