Fancy magic - find me the code for this pattern.

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Fancy magic - find me the code for this pattern.

Post by dangi12012 »

I will pay 666 USD via paypal to someone who can find a simple function which is not a lookup table for this code. Valid until end of 1st jan 2024.
Lets say a maximum of 8 instructions for the body of this function:

We wont allow these instructions: MUL, DIV
uint64_t p = function(sq); //sq = 0..63

Code: Select all

uint64_t function(int sq){ 
     //Your code goes here. Optimally only 2-3 constants and a few shifts. 
}
The function should produce, 492583356993536, 246291678496768 etc..

Code: Select all

BEST OVERALL:
0000000000000001110000000000000010000000000001000001000000000000 -  p:492583356993536ull
0000000000000000111000000000000001000000000000100000100000000000 -  p:246291678496768ull
0000000000000000011100000000000000100000000000010000010000000000 -  p:123145839248384ull
0000000000000000001110000000000000010000000000001000001000000000 -  p:61572919624192ull
0000000000000000000111000000000000001000000000000100000100000000 -  p:30786459812096ull
0000000000000000000011100000000000000100000000000010000010000000 -  p:15393229906048ull
0000000000000000000001110000000000000010000000000001000001000000 -  p:7696614953024ull
0000000000000000000000111000000000000001000000000000100000100000 -  p:3848307476512ull
0000000000000000000000011100000000000000100000000000010000000000 -  p:1924153738240ull
0000000000000000000000001110000000000000010000000000001000000000 -  p:962076869120ull
0000000000000000000000000111000000000000001000000000000100000000 -  p:481038434560ull
0000000000000000000000000011100000000000000100000000000010000000 -  p:240519217280ull
0000000000000000000000000001110000000000000010000000000001000000 -  p:120259608640ull
0000000000000000000000000000111000000000000001000000000000100000 -  p:60129804320ull
0000000000000000000000000000011100000000000000100000000000010000 -  p:30064902160ull
0000000000000000000000000000001110000000000000010000000000001000 -  p:15032451080ull
0000000000000100000000000000000111000000000000001000000000000000 -  p:1125907423068160ull
0000000000000010000000000000000011100000000000000100000000000000 -  p:562953711534080ull
0000000000000001000000000000000001110000000000000010000000000000 -  p:281476855767040ull
0000000000000000100000000000000000111000000000000001000000000000 -  p:140738427883520ull
0000000000000000010000000000000000011100000000000000100000000000 -  p:70369213941760ull
0000000000000000001000000000000000001110000000000000010000000000 -  p:35184606970880ull
0000000000000000000100000000000000000111000000000000001000000000 -  p:17592303485440ull
0000000000000000000010000000000000000011100000000000000100000000 -  p:8796151742720ull
0000000000001000000001000000000000000001100000000000001000000000 -  p:2256197885362688ull
0000000000000100000000100000000000000000110000000000000100000000 -  p:1128098942681344ull
0000000000000010000000010000000000000000011000000000000010000000 -  p:564049471340672ull
0000000000000001000000001000000000000000001100000000000001000000 -  p:282024735670336ull
0000000000000000100000000100000000000000000110000000000000100000 -  p:141012367835168ull
0000000000000000010000000010000000000000000011000000000000010000 -  p:70506183917584ull
0000000000000000001000000001000000000000000001100000000000001000 -  p:35253091958792ull
0000000000000000000100000000100000000000000000110000000000000100 -  p:17626545979396ull
0000000000000010000010000000010000000000000000011000000000000000 -  p:571763226411008ull
0000000000000001000001000000001000000000000000001100000000000000 -  p:285881613205504ull
0000000000000000100000100000000100000000000000000110000000000000 -  p:142940806602752ull
0000000000000000010000010000000010000000000000000011000000000000 -  p:71470403301376ull
0000000000000000001000001000000001000000000000000001100000000000 -  p:35735201650688ull
0000000000000000000100000100000000100000000000000000110000000000 -  p:17867600825344ull
0000000000000000000010000010000000010000000000000000011000000000 -  p:8933800412672ull
0000000000000000000001000001000000001000000000000000001100000000 -  p:4466900206336ull
0000000000010000000000100000100000000100000000000000000100000000 -  p:4505833077473536ull
0000000000001000000000010000010000000010000000000000000010000000 -  p:2252916538736768ull
0000000000000100000000001000001000000001000000000000000001000000 -  p:1126458269368384ull
0000000000000010000000000100000100000000100000000000000000100000 -  p:563229134684192ull
0000000000000001000000000010000010000000010000000000000000010000 -  p:281614567342096ull
0000000000000000100000000001000001000000001000000000000000001000 -  p:140807283671048ull
0000000000000000010000000000100000100000000100000000000000000100 -  p:70403641835524ull
0000000000000000001000000000010000010000000010000000000000000010 -  p:35201820917762ull
0000000000000001000100000000001000001000000001000000000000000000 -  p:299075887169536ull
0000000000000000100010000000000100000100000000100000000000000000 -  p:149537943584768ull
0000000000000000010001000000000010000010000000010000000000000000 -  p:74768971792384ull
0000000000000000001000100000000001000001000000001000000000000000 -  p:37384485896192ull
0000000000000000000100010000000000100000100000000100000000000000 -  p:18692242948096ull
0000000000000000000010001000000000010000010000000010000000000000 -  p:9346121474048ull
0000000000000000000001000100000000001000001000000001000000000000 -  p:4673060737024ull
0000000000000000000000100010000000000100000100000000100000000000 -  p:2336530368512ull
0000000000100000000000010001000000000010000010000000010000000000 -  p:9008367519925248ull
0000000000010000000000001000100000000001000001000000001000000000 -  p:4504183759962624ull
0000000000001000000000000100010000000000100000100000000100000000 -  p:2252091879981312ull
0000000000000100000000000010001000000000010000010000000010000000 -  p:1126045939990656ull
0000000000000010000000000001000100000000001000001000000001000000 -  p:563022969995328ull
0000000000000001000000000000100010000000000100000100000000100000 -  p:281511484997664ull
0000000000000000100000000000010001000000000010000010000000010000 -  p:140755742498832ull
0000000000000000010000000000001000100000000001000001000000001000 -  p:70377871249416ull
You can also suggest a different scoring function for the bitpattern above and I will send you 8 valid pattern lists.
Now this is the score function:

Code: Select all

static int scoreOf(uint64_t sol, uint64_t last_sol)
{
    int match = std::popcount(~(sol ^ last_sol));
    int match_left = std::popcount(~(sol ^ (last_sol << 1)));
    int match_right = 2 * std::popcount(~(sol ^ (last_sol >> 1)));
    return match + match_left + match_right;
}
The money goes to whoever solves the code for above pattern for all 64 squares. Or suggests and solved for a different scoring function.
This would produce a fancy magic sliding move generator where we dont need to lookup the magic but can calculate it quicker than the lookup.

Pure bitpattern list:

Code: Select all

_______________xxx______________x____________x_____x____________
________________xxx______________x____________x_____x___________
_________________xxx______________x____________x_____x__________
__________________xxx______________x____________x_____x_________
___________________xxx______________x____________x_____x________
____________________xxx______________x____________x_____x_______
_____________________xxx______________x____________x_____x______
______________________xxx______________x____________x_____x_____
_______________________xxx______________x____________x__________
________________________xxx______________x____________x_________
_________________________xxx______________x____________x________
__________________________xxx______________x____________x_______
___________________________xxx______________x____________x______
____________________________xxx______________x____________x_____
_____________________________xxx______________x____________x____
______________________________xxx______________x____________x___
_____________x_________________xxx______________x_______________
______________x_________________xxx______________x______________
_______________x_________________xxx______________x_____________
________________x_________________xxx______________x____________
_________________x_________________xxx______________x___________
__________________x_________________xxx______________x__________
___________________x_________________xxx______________x_________
____________________x_________________xxx______________x________
____________x________x_________________xx_____________x_________
_____________x________x_________________xx_____________x________
______________x________x_________________xx_____________x_______
_______________x________x_________________xx_____________x______
________________x________x_________________xx_____________x_____
_________________x________x_________________xx_____________x____
__________________x________x_________________xx_____________x___
___________________x________x_________________xx_____________x__
______________x_____x________x_________________xx_______________
_______________x_____x________x_________________xx______________
________________x_____x________x_________________xx_____________
_________________x_____x________x_________________xx____________
__________________x_____x________x_________________xx___________
___________________x_____x________x_________________xx__________
____________________x_____x________x_________________xx_________
_____________________x_____x________x_________________xx________
___________x__________x_____x________x_________________x________
____________x__________x_____x________x_________________x_______
_____________x__________x_____x________x_________________x______
______________x__________x_____x________x_________________x_____
_______________x__________x_____x________x_________________x____
________________x__________x_____x________x_________________x___
_________________x__________x_____x________x_________________x__
__________________x__________x_____x________x_________________x_
_______________x___x__________x_____x________x__________________
________________x___x__________x_____x________x_________________
_________________x___x__________x_____x________x________________
__________________x___x__________x_____x________x_______________
___________________x___x__________x_____x________x______________
____________________x___x__________x_____x________x_____________
_____________________x___x__________x_____x________x____________
______________________x___x__________x_____x________x___________
__________x____________x___x__________x_____x________x__________
___________x____________x___x__________x_____x________x_________
____________x____________x___x__________x_____x________x________
_____________x____________x___x__________x_____x________x_______
______________x____________x___x__________x_____x________x______
_______________x____________x___x__________x_____x________x_____
________________x____________x___x__________x_____x________x____
_________________x____________x___x__________x_____x________x___
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

For example I propose this scoring function where we put a penalty of glued bits together:

Code: Select all

static uint64_t longestConsecutiveOnes(uint64_t n) {
    int count = 0;
    while (n != 0) {
        // Using bitwise AND operator with n and n shifted 1 bit to the right
        n = (n & (n << 1));
        count++;
    }
    return count;
}

static int scoreOf(uint64_t sol, uint64_t last_sol)
{
    int match = std::popcount(~(sol ^ last_sol));
    int match_left = std::popcount(~(sol ^ (last_sol << 1)));
    int match_right = 2 * std::popcount(~(sol ^ (last_sol >> 1)));
    int cnt = longestConsecutiveOnes(sol);

    int score = match + match_left + match_right;
    return cnt != 1 ? score/2 : score;
}
Yields this (1 = █)

Code: Select all

00000000000000█0█00000000000000█0█0█0000000000000█00000000000000 -  p:703693078937600ull
000000000000000█0█00000000000000█0█0█0000000000000█0000000000000 -  p:35█846539468800ull
0000000000000000█0█00000000000000█0█0█0000000000000█000000000000 -  p:█75923269734400ull
00000000000000000█0█00000000000000█0█0█0000000000000█00000000000 -  p:8796█634867200ull
000000000000000000█0█00000000000000█0█0█0000000000000█0000000000 -  p:439808█7433600ull
0000000000000000000█0█00000000000000█0█0█0000000000000█000000000 -  p:2█9904087█6800ull
00000000000000000000█0█00000000000000█0█0█0000000000000█00000000 -  p:█0995204358400ull
000000000000000000000█0█00000000000000█0█0█0000000000000█0000000 -  p:5497602█79200ull
0000000000000000000000█0█00000000000000█0█00000000000█0000000000 -  p:274880004█984ull
00000000000000000000000█0█00000000000000█0█00000000000█000000000 -  p:█374400020992ull
000000000000000000000000█0█00000000000000█0█00000000000█00000000 -  p:6872000█0496ull
0000000000000000000000000█0█00000000000000█0█00000000000█0000000 -  p:343600005248ull
00000000000000000000000000█0█00000000000000█0█00000000000█000000 -  p:█7█800002624ull
000000000000000000000000000█0█00000000000000█0█00000000000█00000 -  p:8590000█3█2ull
0000000000000000000000000000█0█00000000000000█0█00000000000█0000 -  p:42950000656ull
00000000000000000000000000000█0█00000000000000█0█00000000000█000 -  p:2█475000328ull
0000000000000█0000000000000000█0█00000000000000█0█00000000000000 -  p:██259█0644342784ull
00000000000000█0000000000000000█0█00000000000000█0█0000000000000 -  p:562955322█7█392ull
000000000000000█0000000000000000█0█00000000000000█0█000000000000 -  p:28█47766█085696ull
0000000000000000█0000000000000000█0█00000000000000█0█00000000000 -  p:█40738830542848ull
00000000000000000█0000000000000000█0█00000000000000█0█0000000000 -  p:703694█527█424ull
000000000000000000█0000000000000000█0█00000000000000█0█000000000 -  p:35█847076357█2ull
0000000000000000000█0000000000000000█0█00000000000000█0█00000000 -  p:█75923538█7856ull
00000000000000000000█0000000000000000█0█00000000000000█0█0000000 -  p:8796█76908928ull
000000000000█00000000█0000000000000000█0█00000000000000█00000000 -  p:2256█97902█39648ull
0000000000000█00000000█0000000000000000█0█00000000000000█0000000 -  p:██2809895█069824ull
00000000000000█00000000█0000000000000000█0█00000000000000█000000 -  p:5640494755349█2ull
000000000000000█00000000█0000000000000000█0█00000000000000█00000 -  p:282024737767456ull
0000000000000000█00000000█0000000000000000█0█00000000000000█0000 -  p:█4█0█2368883728ull
00000000000000000█00000000█0000000000000000█0█00000000000000█000 -  p:70506█8444█864ull
000000000000000000█00000000█0000000000000000█0█00000000000000█00 -  p:35253092220932ull
0000000000000000000█00000000█0000000000000000█0█00000000000000█0 -  p:█7626546██0466ull
000000000000000█0000█00000000█0000000000000000█0█000000000000000 -  p:290288249765888ull
0000000000000000█0000█00000000█0000000000000000█0█00000000000000 -  p:█45█44█24882944ull
00000000000000000█0000█00000000█0000000000000000█0█0000000000000 -  p:7257206244█472ull
000000000000000000█0000█00000000█0000000000000000█0█000000000000 -  p:3628603█220736ull
0000000000000000000█0000█00000000█0000000000000000█0█00000000000 -  p:█8█430█56█0368ull
00000000000000000000█0000█00000000█0000000000000000█0█0000000000 -  p:907█507805█84ull
000000000000000000000█0000█00000000█0000000000000000█0█000000000 -  p:4535753902592ull
0000000000000000000000█0000█00000000█0000000000000000█0█00000000 -  p:226787695█296ull
00000000000█00000000000█0000█00000000█0000000000000000█000000000 -  p:45047335658460█6ull
000000000000█00000000000█0000█00000000█0000000000000000█00000000 -  p:2252366782923008ull
0000000000000█00000000000█0000█00000000█0000000000000000█0000000 -  p:██26█8339█46█504ull
00000000000000█00000000000█0000█00000000█0000000000000000█000000 -  p:56309█695730752ull
000000000000000█00000000000█0000█00000000█0000000000000000█00000 -  p:28█545847865376ull
0000000000000000█00000000000█0000█00000000█0000000000000000█0000 -  p:█40772923932688ull
00000000000000000█00000000000█0000█00000000█0000000000000000█000 -  p:7038646█966344ull
000000000000000000█00000000000█0000█00000000█0000000000000000█00 -  p:35█93230983█72ull
00000000000000█0000█00000000000█0000█00000000█000000000000000000 -  p:5805465689█2896ull
000000000000000█0000█00000000000█0000█00000000█00000000000000000 -  p:290273284456448ull
0000000000000000█0000█00000000000█0000█00000000█0000000000000000 -  p:█45█36642228224ull
00000000000000000█0000█00000000000█0000█00000000█000000000000000 -  p:7256832███4██2ull
000000000000000000█0000█00000000000█0000█00000000█00000000000000 -  p:36284█60557056ull
0000000000000000000█0000█00000000000█0000█00000000█0000000000000 -  p:█8█42080278528ull
00000000000000000000█0000█00000000000█0000█00000000█000000000000 -  p:907█040█39264ull
000000000000000000000█0000█00000000000█0000█00000000█00000000000 -  p:4535520069632ull
0000000000█00000000000█0000█00000000000█0000█00000000█0000000000 -  p:90094670█4775808ull
00000000000█00000000000█0000█00000000000█0000█00000000█000000000 -  p:4504733507387904ull
000000000000█00000000000█0000█00000000000█0000█00000000█00000000 -  p:2252366753693952ull
0000000000000█00000000000█0000█00000000000█0000█00000000█0000000 -  p:██26█83376846976ull
00000000000000█00000000000█0000█00000000000█0000█00000000█000000 -  p:56309█688423488ull
000000000000000█00000000000█0000█00000000000█0000█00000000█00000 -  p:28█5458442██744ull
0000000000000000█00000000000█0000█00000000000█0000█00000000█0000 -  p:█40772922█05872ull
00000000000000000█00000000000█0000█00000000000█0000█00000000█000 -  p:7038646█052936ull
Can I solve it? No unfortunately - I already tried for so long and it would make me very happy if someone can articulate this or any other possible pattern into a short C++ snippet that only depends on the index in the list and is not an array lookup.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

We have a candidate by Twipply from Discord:

Code: Select all

[[nodiscard]] auto lines4(const std::size_t idx) -> std::uint64_t {
    return ((idx <  8) << (12 - idx)) |
           ((idx < 16) << (18 - idx)) |
           ((idx > 23 & idx < 32) << (33 - idx)) |
           ((std::uint64_t)(idx < 24) << (46 - idx)) |
           ((std::uint64_t)(idx < 24) << (31 - idx)) |
           ((std::uint64_t)(idx < 40) << (47 - idx)) |
           ((std::uint64_t)(idx < 48) << (48 - idx)) |
           ((std::uint64_t)(idx > 15) << (66 - idx)) |
           ((std::uint64_t)(idx > 23) << (75 - idx)) |
           ((std::uint64_t)(idx > 31) << (81 - idx)) |
           ((std::uint64_t)(idx > 39) << (92 - idx)) |
           ((std::uint64_t)(idx > 47) << (96 - idx)) |
           ((std::uint64_t)(idx > 55) << (109 - idx));
}
Albeit above 8 instruction limit.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

We have a cnadidate from GreatGodofFire:

Code: Select all

#include <cstdint>
#include <iostream>

typedef unsigned __int128 uint128_t;

const uint128_t MAGICS = ((uint128_t) 0b0000000000000000001000000000000100010000000000100000100000000100 << 64) | 0b0000000000000001110000000000000010000000000001000001000000000000ULL; 
const uint64_t MASK = (1ULL << 48) - 1;

uint64_t function(uint8_t sq) { 
  uint128_t magics = MAGICS;

  if (sq >= 24) {
    magics ^= 0b10000000000001010000000000000000000000000000000ULL;
  }

  uint8_t file = sq & 0b111;
  uint8_t rank = sq & ~0b111;
  
  return (((((uint64_t) (magics >> rank) & ~0xff) >> 8)) & MASK) << (8 - file);
}
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

That was faster than expected and I can announce the winner! - GreatGodOfFire

Here is a repo of all code produced for this award.
https://1drv.ms/u/s!AoDIyNlDG2QTg85CCsl ... Q?e=LK2A3b

And here is the explanation by the author: (until the ------ line than its back to myself)
First I noticed that the magics can be split up into blocks of 8 which are all just shifted version of each other.
So they can be represented with just one magic which gets shifted accordingly:

Code: Select all

uint64_t magic[8] {
  0b0000000000000001110000000000000010000000000001000001000000000000,
  0b0000000000000000000000011100000000000000100000000000010000000000,
  0b0000000000000100000000000000000111000000000000001000000000000000,
  0b0000000000001000000001000000000000000001100000000000001000000000,
  0b0000000000000010000010000000010000000000000000011000000000000000,
  0b0000000000010000000000100000100000000100000000000000000100000000,
  0b0000000000000001000100000000001000001000000001000000000000000000,
  0b0000000000100000000000010001000000000010000010000000010000000000,
};

uint64_t function(int sq) { 
  return magic[sq >> 3] >> (sq & 0b111);
}
But these are similar to the previous magic shifted right by 8:

Code: Select all

uint64_t magic[8] {
                                                          0b000000000000000000000001110000000000000010000000000001000001000000000000,
                                                  0b000000000000000000000000000000011100000000000000100000000000010000000000,
                                          0b000000000000000000000100000000000000000111000000000000001000000000000000,
                                  0b000000000000000000001000000001000000000000000001100000000000001000000000,
                          0b000000000000000000000010000010000000010000000000000000011000000000000000,
                  0b000000000000000000010000000000100000100000000100000000000000000100000000,
          0b000000000000000000000001000100000000001000001000000001000000000000000000,
  0b000000000000000000100000000000010001000000000010000010000000010000000000,
};
So you can pack them all into a 128 bit integer:

Code: Select all

const uint128_t MAGICS = ((uint128_t) 0b0000000000000000001000000000000100010000000000100000100000000100 << 64) | 0b0000000000000001110000000000000010000000000001000001000000000000ULL; 
But starting with the 4th index (starting at 1) there are 2 bits which changed. That is what the

Code: Select all

if (sq >= 24) {
  magics ^= 0b10000000000001010000000000000000000000000000000ULL;
}
----------------------------------------------------------------------------
So we end up with this:

Code: Select all

#include <cstdint>
#include <iostream>

typedef unsigned __int128 uint128_t;

const uint128_t MAGICS = ((uint128_t) 0b0000000000000000001000000000000100010000000000100000100000000100 << 64) | 0b0000000000000001110000000000000010000000000001000001000000000000ULL; 
const uint64_t MASK = (1ULL << 48) - 1;

uint64_t function(uint8_t sq) { 
  uint128_t magics = MAGICS;

  if (sq >= 24) {
    magics ^= 0b10000000000001010000000000000000000000000000000ULL;
  }

  uint8_t file = sq & 0b111;
  uint8_t rank = sq & ~0b111;
  
  return (((((uint64_t) (magics >> rank) & ~0xff) >> 8)) & MASK) << (8 - file);
}
But the 8 instructions and more importantly seemingly generic solution was mentioned at first so I think this looks very compelling and produces good ASM compared to uint128_t.

Code: Select all

constexpr uint64_t magic[8] {
  0b0000000000000001110000000000000010000000000001000001000000000000,
  0b0000000000000000000000011100000000000000100000000000010000000000,
  0b0000000000000100000000000000000111000000000000001000000000000000,
  0b0000000000001000000001000000000000000001100000000000001000000000,
  0b0000000000000010000010000000010000000000000000011000000000000000,
  0b0000000000010000000000100000100000000100000000000000000100000000,
  0b0000000000000001000100000000001000001000000001000000000000000000,
  0b0000000000100000000000010001000000000010000010000000010000000000,
};

uint64_t rook_magic(int sq) { 
  return magic[sq >> 3] >> (sq & 0b111);
}
//
rook_magic(int):                        # @rook_magic(int)
        mov     eax, edi
        sar     eax, 3
        cdqe
        lea     rcx, [rip + magic]
        and     dil, 7
        shrx    rax, qword ptr [rcx + 8*rax], rdi
        ret
This already gives rise to some ideas of having a persistent ZMM AVX512 register around for magics since 8x64bit fits into 1 ZMM perfectly.
https://godbolt.org/z/YYxafhG41
Other code in 666.cpp also has nice ideas. I will consolidate this into a fancy magic movegenerator without any lookups - correlationmagic.

Since research has shown and proven now today 25.06.2023 that the magic numbers can correlate so much that we can find 1) denser forms of 64 slot lookup or alternatively 2) produce efficient code to generate them only depending on the square.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

I am very close to a sensible solution now.

Fancy magic where we can skip magic, mask and offset lookup (with bigger lookup size) seems to be possible too - but first we just correlate the magics as promised.

Epiphany I had this week:
Since we correlated all 64 magics and know the specific bitpattern we can use a similar decomposition for multiplication:

Code: Select all

uint64_t mul(uint64_t y) {
    uint64_t result = (y << 3) + (y << 16) + (y << 44) + (y << 60) + (y << 61);
    return result;
}
Which will be quite a speedup on some machines. (Not on x86-64 tho since clang even reverses this because its a known "non optimisation" and clang even reverses it when applicable which is quite cool: https://godbolt.org/z/8eoYn41Gh)


So we have 3 ideas on the table to be explored.
1) Write and finish a simple "fancy correlation magic" movegen removing mask from the square specific struct
2) Write and finish a "fancy correlation magic" movegen removing offsets and mask (fixed 4096*square offset, fast per square mask gen is already known)
3) Explore the possibility to not shift constants but its spectral decomposition like above where we increment the individual shift amounts.
Only simple if we constrain the magics to popcount 6. (exhaustive search shows that popcount 5 magics do not exist for all squares)

For point 3 we could end up with a fancy correlation magic that looks similar to this:

Code: Select all

Rook(int sq, uint64_t occ){
uint64_t mask = (HO(sq) | VE(sq)) | occ; //"fancy" trick where HO and VE are inversed 
return (RookAttacks + (sq << 12)) //Fixed 4096 offsets
     [
          (mask << (3-sq)) + 
          (mask << (16-sq)) + 
          (mask << (44-sq)) + 
          (mask << (60-sq)) + 
          (mask << (61-sq))  //Instead of IMUL64
     ];
}

Of course shifts by negative amounts would have to be circumvented. But it looks like a removal of one array lookup and IMUL64 is on the table with the possibility of having both implementations (the one with IMUL64) accessing the same table with no downside whatsoever.

I will do and publish number 1) since thats what I paid good money for. The others I will put on the side due to time reasons.

If anyone wants do more research on movegen I am quite happy with the current state. KGSSB, Galoisfield (not yet public) with AVX2 and AVX512 codepaths, and others are already leaps and bounds above what was possible a few years ago performance wise on old machines as well as new ones.

The spectral decomposition is also interesting since IMUL64 is one of the few exceptions that does not exist in AVX at all. So opening the possibility to calculate multiple Pieces at once with 1 function call and the result bitboard is not 1x uint64_t but 4x uint64_t of 2 different rooks or 1 queen or a mix as done in galoisfield movegen currently.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

Rook magics:

Code: Select all

correlation: 3957/4096
                                                                .....█..█..........█......█.....█................█.............. - 6
                                                               ......█..█..........█......█.....█................█............. - 6
                                                              .......█.............█......█.....█................█.....█...... - 6
                                                             ........█.............█......█.....█................█.....█..... - 6
                                                            .........█.............█......█.....█................█.....█.... - 6
                                                           .........█..............█......█.....█................█.....█... - 6
                                                          ..........█..............█......█.....█...........█....█.....█.. - 7
                                                         .......█..................█......█.....█...........█....█.....█. - 7
                                                        .................█.........█......█.....█................█.....█ - 6
                                                       ..................█.........█......█.....█................█..... - 5
                                                      ...................█.........█......█.....█................█.... - 5
                                                     ....................█.........█......█.....█................█... - 5
                                                    ...................█.█.........█......█.....█................█.. - 6
                                                   ....................█.█.........█......█.....█............█..... - 6
                                                  ....................█..█.........█......█.....█............█.... - 6
                                                 ..................█.....█................█.....█............█... - 5
                                                ...................█.....█................█..............█...█.. - 5
                                               ....................█.....█................█..............█...█. - 5
                                              .....................█.....█................█..............█...█ - 5
                                             ......................█.....█................█..............█... - 4
                                            .......................█.....█................█..............█.. - 4
                                           ........................█.....█................█..............█. - 4
                                          .........................█.....█................█..............█ - 4
                                         ..........................█.....█................█.....█........ - 4
                                        .................█.........█.....█................█............. - 4
                                       ..................█.........█.....█................█............ - 4
                                      ...................█.........█.....█................█........... - 4
                                     ....................█.........█.....█................█.......... - 4
                                    .....................█.........█.....█................█......... - 4
                                   ......................█.........█.....█................█........ - 4
                                  .......................█.........█.....█................█....... - 4
                                 ...............█..................█.....█................█...... - 4
                                ................█..................█......█...............█..... - 4
                               .................█..................█......█...............█.... - 4
                              ..................█..................█......█...............█... - 4
                             ...................█..................█......█...............█.. - 4
                            ....................█..................█......█...............█. - 4
                           .....................█..................█......█...............█ - 4
                          ......................█...............█..█......█......█........ - 5
                         .......................█...............█..██.....█......█....... - 6
                        ..................█.....█...............█..██.....█............. - 6
                       ...................█.....█..................██.....█............ - 5
                      ....................█.....█..................█....█.█........... - 5
                     .....................█.....█..................█....█.█.█........ - 6
                    ......................█.....█..................█....█.█.█....... - 6
                   .......................█.....█..................█......█........ - 4
                  ........................█.....█..................█......█......█ - 5
                 .........................█.....█..................█.....█....... - 4
                .................█........█.....█.................██.....█...... - 6
               ..................█..............█.................██.....█..... - 5
              ...................█..............█.................██.....█.... - 5
             ....................█..............█..................█.....█... - 4
            .....................█..............█..................█.....█.. - 4
           ......................█..............█..................█.....█. - 4
          .......................█..............█...............█..█.....█ - 5
         ........................█..............█.█.............█..█..... - 5
        ..................█......█.....█........█...............█..█...█ - 7
       ...................█......█.....█........█...............█..█..█ - 7
      ....................█......█.....█........█...............█..█.█ - 7
     .....................█......█.....█........█..........█....█..█. - 7
    ......................█......█.....█........█...............█..█ - 6
   .......................██.....█.....█........█...............█.█ - 7
  .........................█.....█.....█........█.█...█.........█. - 7
 ..........................█.....█......█.......█.█...█........█. - 7
Hmm the font does not do it justice - in the console they are beautiful lines.
Maybe we split indices into blocks of 8 for higher correlation. Winner on page 1 showed how we condense into uint128_t or my idea into a single ZMM register anyways.

Code: Select all

       .█.....█...........█......█......█.....█........█............... - 7
      ..█.....█...........█......█......█.....█........█.............. - 7
     ...█.....█...........█......█......█..............█............. - 6
    ....█.....█...........█......█......█..............█............ - 6
   .....█.....█...........█......█......█..............█........... - 6
  ......█.....█...........█......█......█..............█.......... - 6
 .......█.....█...........█......█......█..............█......... - 6
........█.................█......█......█..............█........ - 5
Thats out of all magic number with popcount 7 only:
That is an exhaustive search of ALL uint64_t with popcount = 7 that are valid rook magics.

Code: Select all

       .█.....█...........█......█......█.....█........█............... - 7
      ..█.....█...........█......█......█.....█........█.............. - 7
     █..█.....█...........█......█......█..............█............. - 7
    .█..█.....█...........█......█......█..............█............ - 7
   ..█..█.....█...........█......█......█..............█........... - 7
  ...█..█.....█...........█......█......█..............█.......... - 7
 ....█..█.....█...........█......█......█..............█......... - 7
█....█..█.................█......█......█..............█........ - 7
Exploring popcount 8 next but that is already over 35GB of binary uint64_t's in a single file and even with optimized code this takes a bit to explore.
The search for a perfect magic is ellusive and I want to avoid it and not fall into it like in 2019. Addictive. Dangerous.

If you are interested int the math: NCR(64, 8) = 4426165368 uint64_t's. Storing that yields 35.4GB.
Each of those numbers needs to be expanded into a 4096 slot buffer with self intersection optimisation - where we test if its a valid magic number. Meaning that one way or another we have to touch 145036GB of memory (4kb fits in L1 cache so we are still fast).

Then we sort those numbers by any heuristic - I have found that "64 - popcount(last ^ (current << 1))" is the best by far. It seems to be the natural way magic bits want to align.
Always there seems to be 1-2 bits off a perfect magic, inviting to research more numbers but the bigger batch is always also so very temptingly short of perfection.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

Explored ALL magic numbers that exist with a popcount of 8 and below. all_candidates.dat is 41.045.276.480 bytes (41GB).
Yielding 15.727.838.576 (14.6gb) of valid magic numbers for all 64 rook squares.

We can already build correlation numbers if we limit ourselfes to 4 or below. That would compress 64magics down to a lookup buffer of 16magics. Totally already possible - but I think 1 step more is doable for me on one machine.

My correlation function is: next = prev >> 1 (which is the best one by a factor of 10)
So the goal is to start with a magic and have the minimum number of bits deviate from the pattern of the correlation function.

I thought to heck with it I dont want any single bit to deviate and it looks possible. Just need to expand further.

This is where we have 8 successive magics always correlate perfectly. For example out of 169268 in the first square slot only 153745 are a perfect match of next = prev >> 1.

To solve this quite quickly I had to make my program 100x faster 2 times in a row (which is totally needed because I have to go to 9 bits = 3E10 candidates next)

The first trick is to not use random sampling via rnd() & rnd() & rnd() but to specifically and rigurously just enumerate all N set bits / 64 from 1 to 8.
This already yields magics much much faster and we dont get any duplicates. (In 2019 I had random sampling into a hashset)

The second trick is to use memory mapped files and the gpu. I already have very optimized C++ but even on my 16 core machines below list would take more than 1 days to produce. So another 100x speedup can be gotten from (paywall information here). Just kidding. memory mapped file + the thrust library with insanely optimized gpu reduction algorithms. Outcompeting any and all workstation cpus with a midrange gpu. The killer is thrust::set_intersection which tries all paths of all magics in one call per square. And ultimately is what produces below list another 100x faster.

So with a performance uplift of 100x100 compared to a naive magic sifter we get this list:

Code: Select all

0 - 169268
1 - 153745
2 - 27680
3 - 5993
4 - 905
5 - 45
6 - 2
7 - 0
8 - 7908833
9 - 6557369
10 - 1301001
11 - 391325
12 - 129917
13 - 42878
14 - 18528
15 - 2774
16 - 7135868
17 - 6060323
18 - 1309438
19 - 378736
20 - 125001
21 - 44374
22 - 20000
23 - 807
24 - 7635451
25 - 6467147
26 - 1261691
27 - 372847
28 - 132039
29 - 47078
30 - 23627
31 - 2640
32 - 7213943
33 - 6150625
34 - 1354437
35 - 393023
36 - 136874
37 - 46141
38 - 17381
39 - 513
40 - 9446858
41 - 7939524
42 - 1669303
43 - 493934
44 - 165091
45 - 63663
46 - 34396
47 - 3375
48 - 13418194
49 - 12402899
50 - 2899303
51 - 844381
52 - 231706
53 - 74164
54 - 32173
55 - 2359
56 - 158036
57 - 144382
58 - 97608
59 - 68876
60 - 13916
61 - 157
62 - 13
63 - 0
0 means that no path exists to form a perfect magic from the previous 1 8x in a row - but it looks probable that popcount 9 can yield results.
I am quite happy here since I am not doing random sampling with rnd() & rnd() & rnd() but actually permuting all bits with a specific popcount so no possible solution is untried!

I think I can live with a few bits that only get turned on for some squares after the following.
Doing the math: For 9 bits = ncr(64, 9) I will get 27540584512 candidates which needs 220GB of space and will probably yield around 75GB of magic numbers for all squares.
To test a candidate we take a uint64_t and run fancy magic against all possible relevant occupancies - logging self intersection count and discarding if its not a magic number (meaning in a buffer of 4096 it does produce wrong overlaps).
So around 300 GB of disk space and around 10hrs of optimized compute.

Sounds doable and if 7 - and 63 - do not produce 0s anymore that means the "Find me the code for this pattern" was obsoleted by "I tried so many candidates that the pattern next = prev >> 1. is 100% correct for buckets of 8.

End result will be a correlation magic movegen with a bucket of 8 numbers instead of 64 - and a big old (binary compatible to normal fancy magic) lookup table. Also quite happy with the outcome of an actual award since that is also motivating for myself.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

Spectral decomposition can be exploited if we know that all magics have a specific popcount.
I can already pre share what that means exactly by showing the results for the bishop.

Here a blocksize of 8 is already possible with a popcount of 6 for all magics.

Let me show how the bishop is calculated in correlationmagic:

Code: Select all

static constexpr uint64_t bishop_magics[8] = {
     35188734169472ull,
     137455993728ull,
     35185450041728ull,
     17695399512064ull,
     17764118986880ull,
     17661242130432ull,
     17661175009536ull,
     35322350019072ull};

extern uint64_t bishop_masks[64];
extern uint64_t* attack_table;

uint64_t bishop(int sq, uint64_t occ) {
    uint64_t masked_occ = occ | bishop_masks[sq];
    uint64_t* attacks = attack_table + (sq << 12);
    
    uint64_t magic = bishop_magics[sq >> 3] >> (sq & 7);
    uint64_t index = (masked_occ * magic) >> 52;

    return attack_table[index];
}
All the 64 constants that we normally need are condensed into 8 numbers with popcount == 6 each.
To get the magic we shift by sq & 7 - and that is all. The masks can be calculated too and the offset is known (here I picked 4096 which is too much but point is it can be known as a fixed constant). If we find magics such that 16 continuous elements exist or only a few bits differ we can reduce to 4 slots or maybe even 1 magic number finally. (With which we can calculate all 64 magics inline and quickly so that we never need to look them up)

The spectral magic lookup looks also interesting:
We take above 8 correlated magics and the bit indices of it and store it. Instead of arrays we can store them in ZMM or some uint64_t's. We just need 6*6*8 bits for all magic numbers.

If we find bigger buckets than 8 we would need only 4 magic numbers and so on.
But here we have 8 magics with a popcount of 6 so we store the bit indices:

Code: Select all

static constexpr uint8_t bishop_spectralmagic[][6] = {
{ 7, 8, 12, 26, 32, 45 },
{ 7, 8, 9, 18, 24, 37 },
{ 7, 8, 14, 22, 30, 45 },
{ 11, 15, 27, 35, 36, 44 },
{ 7, 15, 27, 35, 37, 44 },
{ 14, 20, 26, 28, 36, 44 },
{ 8, 12, 20, 28, 36, 44 },
{ 9, 13, 21, 29, 37, 45 }};

uint64_t spectral_bishop(int sq, uint64_t occ) {
    uint64_t masked_occ = occ | bishop_masks[sq];
    uint64_t* attacks = attack_table + (sq << 12);
    const uint8_t* s = bishop_spectralmagic[sq >> 3];
    
    uint64_t index = 
    (
    (masked_occ << s[0] << (sq & 7)) + 
    (masked_occ << s[1] << (sq & 7)) + 
    (masked_occ << s[2] << (sq & 7)) + 
    (masked_occ << s[3] << (sq & 7)) + 
    (masked_occ << s[4] << (sq & 7)) + 
    (masked_occ << s[5] << (sq & 7))) >> 52;

    return attack_table[index];
}
In there we calculate the magic number and multply via bitshift addition in 6 steps in one go. This looks interesting on platforms where we dont get fast imul64 implementation.

Here is the preview:
https://godbolt.org/z/oqM3qzfK5

Summary:
Above code makes the price award already obsolete since its shown that the bitpattern can be a perfect magicN = magic[0] >> sq for buckets of 8 at least. What needs to be done is finish rook and verify it too.
The outcome of this will be two movegenerators - correlationmagic and spectral correlationmagic which both work with compressed magic tables and come with and without the need for imul64. Both can work on the same table and are also binary compatible with normal fancy magic. Meaning that you can still populate an array of 64 elements if that is faster on that platform.

Personal note:
Once that bug is out of my head I can finally get back to eval() and publish my fast NNUEv5 repo. After all these years it feels like time and effort is worth more to stop to improve chess movegen alone. Even tho its fun - the effort has greater impact on other parts of the code. I have a movegen on my computer that is twice as fast as the fastest movegen (gigantua) without templates and fully using intrinsics for every aspect of it. Board fitting in one YMM and all. I already described it in the galoisfield post together with a fast eval() its time to port all of it to hardware that can do that massively parallel.
I have this proven to myself in this work. Very optimized handcrafted parallel C++ solves like 0.07 MMagics on 16 cores 5950x but my GPU from 2019 can solve 1.2MMagics. Meaning GPU can solve in 3 minutes what 16 Cores need an hour for. Branchless Integer arithmetic is just that much faster and its time to take advantage of it for all of computerchess. Including AB deeper down the tree, TBProbe and Incremental eval at the end of the tree and a MCTS memory mapped superstructure. Next up: Optimized NNUE running on the GPU.

Future:
After completing the 2 new movegens I am surely done with movegen. That was the last thing that bugged me. Even tho there is more to explore for instance a 4 ray version of magic without multiplication can come back to be competitive yet again since unconditional bitshifts are extremely fast and much faster than IMUL64. On 4 ray magic - the lookup tables become very small since we dont hash HV but H and V and diagonals etc.. and the probability is high that 1 magic is enough (together with a shift) to hash all possible relevant occupancies like above.
An ideal 4 ray spectral correlation magic would take exactly 4 magics, 4 bitshifts and (2-6) additions per line to solve the respective lines which is competitive. I leave that to somebody willing to research further.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Fancy magic - find me the code for this pattern.

Post by dangi12012 »

Update: Needed to optimize more. Now its solving. 25% done.
Issue: Scaling with 40GB vs 200GB is not linear.

With a popcount of 9 we need to test 27540584512 unique possible magic numbers for 64 squres: ~1.8E12 tests.
The scaling issue this this is that there are huge swaths of ranges that are barely not valid magics.
Meaning that when we expand a magic into 4096 indices there is a collision very late into the multiplication process - tanking performance.

So 20% of the whole range can be solved in 12hrs. That is too slow due to many reasons.
So I sat down and with a mix of OpenMP and more use of cudas expanded sharedMemPerBlockOptin as well as running many cuda streams in parallel.
Nowadays even cuda THRUST library supports stream selection which is also very cool with "thrust::cuda::par.on(stream)".

Now I get 3.5MMagics per second. (That is 3.5 Million magics can be solved on 64 squares into 4096 slots).
That is quite a performance uplift of around 50x from very optimized C++ on 16 cores.

It all has to do with faster memory and more cores. It is even awesome that we can even attempt to solve something that big.
So 1 gpu compute minute is almost equal to 1 compute hour for a 16core machine for that particular chess problem.

The goal is described above - but essentially I want to create the list of magics and find a highly correlated path from magic idx 0 down to 7, 8 - 15 and so on.
With magic[i+1] == magic << 1

Solved for bishops even with buffers of 16 - but rooks are much harder.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer