32 Bit Magic Bitboard - Foldinghash [Release]

dangi12012 · Post by **dangi12012** » Thu May 05, 2022 1:07 am

Working with gpus extensively has made me appreciate 32 bit architectures a lot more. So that got me thinking of how to use Magic Bitboards efficiently with 32 bit.

The keyword efficiently comes with 2 hard constraints:
32 Bit multiplication only
maximum of 64kb of memory

Existing implementations did not meet the target and hashing 12 bits at once make the memory consumption too large. So 4 rays of 6 bits per square must be hashed.
Doing the math gives the upper limit memory consumption at 64*4*64 * 8 bytes = 130kbyte. This is still too big but hashing into a single array will give nice improvements.

Is it possible to fold the 64 bit occupation into 32 bits?
Yes it is when using the relevant bits with this formula: (folding 64 bits onto 32 with retaining all 64 unique relevant bits (6 bits per direction))

Code: Select all

(uint32_t)(input[i].Occ >> 32ull) ^ (static_cast<uint32_t>(input[i].Occ) * 3);

This has the huge advantage that the heavy lifting of 32 bit multiplication does not need to be done 8x but only 4x! 4 multiplications is the same amount of work as the existing 32 bit optimisation: https://www.chessprogramming.org/Magic_ ... 20possible
but with 10x less memory consumption than previously and the whole table now fits nicely into L2 (or shared memory on cuda).

Also now we are in 32bit land and that means when searching for 'magic numbers' you can exhaustively generate all of the possible 32bit numbers that reduces all inputs to 6 bits with only valid intersections and find the lowest possible fitting offset! And this can be done exhaustively. Meaning that you still dont get a perfect soltion (because the order in which you insert into the list also affects the maximum index) but its very close.

This means that the final algorithm looks like this: Having everything in 32 bit also reduces per square information by 8 byte compared to 32bit optimized black magic!

Code: Select all

struct Direction {
    uint64_t mask; uint32_t mult; uint32_t offset;
};
constexpr uint64_t attacks[5956] = ...
constexpr Direction masks[64][4] = ...
static constexpr uint64_t fold(uint64_t occ, Direction d)
{
	uint64_t o = occ | d.mask;
	uint32_t folded = (uint32_t)(o >> 32) ^ ((uint32_t)(o) * 3);
	return attacks[((folded * d.mult) >> 26) + d.offset];
}

uint64_t Queen(int sq, uint64_t occ) 
{
	return fold(occ, masks[sq][0]) | fold(occ, masks[sq][1]) | fold(occ, masks[sq][2]) | fold(occ, masks[sq][3]);
}

I will release the code and full tables into the usual comparison repos soon!
https://github.com/Gigantua/Chess_Movegen
https://github.com/Gigantua/Chess_Movegen_GPU

I also have started to take a look at the more classical approach and have generated 2.1TB of raw valid reduction numbers for all 64 squares for 12 bit and 9 bit (or as you call them 'magics')

When seeing how low the number of non repeated hashes is in the 64 bit range for some squares one can even generate a very narrow bound for the number of magics per square (Very similar to approximating a population of animals by tagging and releasing animals - and the ratio of already tagged animals).
When looking at how many bits are non variable its easier to condense and find solutions that need less than 88316 elements.
I also created a 'magic number' viewer (which I might release too) where you can see how the input list gets distributed into 12 bits. Its good for understanding what a magic number actually does. (distribute the relevant occupation into an index and ideally self intersecting and intersecting with the list with the lowest possible maximum index)

Oh and CUDA is really the only (fast) way to do this. My 5950x is around 35-40x slower than the same code on the gpu for finding and generating random 64 bit magics.

TLDR:
FoldingHash will be released soon and is a 48kb - 32 bit hashing algorithm.
The main new idea is to condense 64bit occupation into 32bit and thus having 4 rays with only 4x 32 bit multiplications. This would have been perfect 20 years ago (compared to expensive 64 bit multiplication) - and looks very promising for the GPU nowadays.

dangi12012 · Post by **dangi12012** » Fri May 06, 2022 11:46 am

General findings for 64 bit magic numbers for 12 bit problem (the rook)
For numbers that work as a reduction ('magic number') you can actually create a statistics of which bits are set and unset. Some bits of the 64 have to be set to either 0 or 1 to always and some have to be set 90% etc. To not exclude extremely unlikely magics.

Finding the right formula
(random sampling of integers and seeing how many were already found and work)
(1 - 1/N)^M > (1-S/N)
where M is the number of observations and S is the number of distinct observations.
N is the first number that satisfies above condition and is the number of approximate number of magics.

Exhaustive search
Of course exhaustively searching can be done up to a fixed number of set bits. (6 to 11) set bits out of 64 bit magics are no problem
since NCR(64, 10) = 7E11 which is trivially solvable on a gpu. (a few seconds). BUT: even when you have a good magic number it is still impossible to solve the order of inserting where the overlaps are maximized and the total size is minimized.
There a two stage approach is needed. First finding Billions of magic numbers - and then inserting them with maximum overlap.
I will release the solved numbers soon - but I want to look at postmasks first since this could bring 12bit magic to the gpu (< 64kb memory)

This is for 64 bit magics. For 32bit like with foldinghash - the best hash can be found (but not the bestoverall insertion order)

dangi12012 · Post by **dangi12012** » Sun May 08, 2022 7:23 pm

I have decided I will release my CUDA hashing tool soon. It should outperform any existing CPU solution by a lot. Even a weak gpu is a lot faster to find ''magic numbers' for any problem you throw at it than even high end cpus.

So this tool is created to find a hashfunction parameter (like a magic number) to project an inputset of values onto unique distinct indices.
In chess we use it for looking up the sliding pieces. But anywhere a dense lookup is needed it will be helpful.

You will need 500GB of NVME to run this efficiently.
Its a two step process. First finding self intersecting and non intersecting magics - and the finding a good or even optimal packing. Its an NP hard problem so I dont expect any smarter method than brute force to help here.
The seconds step is to keep packing all the input of the 500GB of candidates into a smaller and smaller buffer.

In summary its perfect for the GPU and you can expect to have it try 1E10 candidates in seconds - and 1E13 overnight. (so if you can remove 20 bits from the inputset you can expect to solve a hashing problem with a size of 2^44 over a few hours)

32 Bit Magic Bitboard - Foldinghash [Release]

32 Bit Magic Bitboard - Foldinghash [Release]

Re: 32 Bit Magic Bitboard - Foldinghash [Release]

Re: 32 Bit Magic Bitboard - Foldinghash [Release]