Faster than Fancy magic: Hypercube Slider lookup [TEASER]

dangi12012 · Post by **dangi12012** » Sun Dec 12, 2021 8:47 pm

hgm wrote: ↑Sun Dec 12, 2021 8:12 pm
dangi12012 wrote: ↑Sat Dec 11, 2021 11:30 pmIts broken there: https://www.talkchess.com/forum3/viewto ... es#p140155
That link is wrong: you use https, while TalkChess is on http. As a result the page cannot load anything that is http, which includes the scripts from my website that restore the code sections, or display FENs. If you would use the same URL as http the code would not have been corrupted.

What? https vs http is the transport layer of your html files. If one works but the other doesnt there is something fishy architecturally.
Just put the script into an <script> tag and be done with it

But this has nothing to do with Hypercube Slider lookup - so maybe another topic for https vs http

hgm · Post by **hgm** » Sun Dec 12, 2021 10:34 pm

Browsers seem to care a great deal about this nowadays. But the point was whether you wanted to see the uncorrupted code in the posting you quoted.

dangi12012 · Post by **dangi12012** » Sun Dec 12, 2021 10:38 pm

hgm wrote: ↑Sun Dec 12, 2021 10:34 pm Browsers seem to care a great deal about this nowadays. But the point was whether you wanted to see the uncorrupted code in the posting you quoted.

Yes it works here: http://www.talkchess.com/forum3/viewtop ... es#p140155
Good suggestion

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 8:09 pm

Update on current comparison Megalookups/s:

Code: Select all

Exploading:     121.61MOps
Reference:      124.90MOps
KoggeStone:     177.87MOps
RotatedBoard:   169.98MOps
BobMike:        339.81MOps
SlideArithm:    341.30MOps
XorRookSub:     498.89MOps
FancyHash:      657.88MOps
Pext  :         921.04MOps
HyperCube:      954.54MOps

My favourite 2 algos of those: Hypercube which I will release soon -
and the very nice Slider Arithmethic algorithm that just takes 50 LOC with all initialisation:

Code: Select all

#include <array>
#include <cstdint>
#include <bit>

namespace Chess_Lookup::SlideArithm
{
	/* Init */
	consteval bool safe_coord(int f, int r)
	{
		return (0 <= f && f < 8) && (0 <= r && r < 8);
	}

	consteval uint64_t init_mask(int s, int df, int dr)
	{
		uint64_t b{}; int f{}, r{};
		f = s & 7; r = s >> 3;
		while (safe_coord(f += df, r += dr))
			b |= 1ull << (f + r * 8);

		return b;
	}

	consteval std::array<uint64_t, 256> init_array()
	{
		std::array<uint64_t, 256> a{}; int n{};
		for (int s = 0; s < 64; s++)
		{
			a[n++] = init_mask(s, 1, 0) | init_mask(s, -1, 0);
			a[n++] = init_mask(s, 0, 1) | init_mask(s, 0, -1);
			a[n++] = init_mask(s, 1, 1) | init_mask(s, -1, -1);
			a[n++] = init_mask(s, -1, 1) | init_mask(s, 1, -1);
		}
		return a;
	}
	static constexpr std::array<uint64_t, 256> rank_mask = init_array();

	/* Start of code */
	constexpr uint64_t slide_arithmetic(int p, uint64_t line, uint64_t block) {
		uint64_t piece = 1ull << p;

		// mask blockers
		block &= ~piece & line;

		// split the line into upper and lower rays
		uint64_t bottom = piece - 1ull;

		// for the upper part we can use the x^(x-1) trick to fill in from the bottom
		uint64_t masked_up = block & ~bottom;
		uint64_t blocked_up = (masked_up ^ (masked_up - 1ull));

		// for the bottom we use CLZ + a shift to fill in from the top
		uint64_t masked_down = block & bottom;
		uint64_t blocked_down = 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(masked_down | 1ull);

		// the intersection of the two is the move set after masking with the line
		return (blocked_up ^ blocked_down) & line;
	}

	constexpr uint64_t Queen(uint64_t s, uint64_t occ)
	{
		const uint64_t* r = rank_mask.data() + 4 * s;
		return slide_arithmetic(s, r[0], occ)
			 ^ slide_arithmetic(s, r[1], occ)
			 ^ slide_arithmetic(s, r[2], occ)
			 ^ slide_arithmetic(s, r[3], occ);
	}
}

tcusr · Post by **tcusr** » Mon Dec 13, 2021 8:14 pm

take a look at this also
https://github.com/lithander/QBB-Perft/ ... rft.c#L239

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 9:58 pm

tcusr wrote: ↑Mon Dec 13, 2021 8:14 pm take a look at this also
https://github.com/lithander/QBB-Perft/ ... rft.c#L239

Code: Select all

Megalookups/s:
Exploading:     119.15MOps
Reference:      126.32MOps
KoggeStone:     170.21MOps
RotatedBoard:   165.94MOps
QBB Algo:       209.68MOps
BobMike:        335.92MOps
SlideArithm:    366.03MOps
XorRookSub:     502.38MOps
FancyHash:      667.64MOps
Pext  :         935.70MOps
HyperCube:      952.90MOps

QBB Algo: 209.68MOps
Middle fast - but the author knew what he was doing in terms of optimizing. One of the very few lookups without any table.
So 0kb lookup table. Btw i compile all with clang++ -flto -O3 -march=native -funroll-loops -std=c++20 Main.cpp -o movegen_compare

Code: Select all

/*
 This sliding lookup implementation is based on QBBEngine by Fabio Gobbato
 Some parts of this source call gcc intrinsic functions. If you are not using gcc you need to
 change them with the functions of your compiler.
*/

#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdio.h>
#include <assert.h>
#include <type_traits>

namespace Chess_Lookup::QBB
{
    /* Start of Code */
    constexpr uint64_t bit_bswap_constexpr(uint64_t b) {
        b = ((b >> 8) & 0x00FF00FF00FF00FFULL) | ((b << 8) & 0xFF00FF00FF00FF00ULL);
        b = ((b >> 16) & 0x0000FFFF0000FFFFULL) | ((b << 16) & 0xFFFF0000FFFF0000ULL);
        b = ((b >> 32) & 0x00000000FFFFFFFFULL) | ((b << 32) & 0xFFFFFFFF00000000ULL);
        return b;
    }
    constexpr uint64_t RevBB(uint64_t b) {
        if (std::is_constant_evaluated()) { return bit_bswap_constexpr(b); }
#if defined(_MSC_VER)
        return _byteswap_uint64(b);
#elif defined(__GNUC__)
        return __builtin_bswap64(b);
#else
        return bit_bswap_constexpr(b);
#endif
    }


    constexpr uint64_t MSB(uint64_t value)
    {
        return 63 - std::countl_zero(value);
    }

    constexpr uint64_t LSB(uint64_t value)
    {
        return std::countr_zero(value);
    }

    /* return the bitboard with the rook destinations */
    constexpr uint64_t Rook(uint64_t sq, uint64_t occupation)
    {
        uint64_t piece = 1ULL << sq;
        occupation ^= piece; /* remove the selected piece from the occupation */
        uint64_t piecesup = (0x0101010101010101ULL << sq) & (occupation | 0xFF00000000000000ULL); /* find the pieces up */
        uint64_t piecesdo = (0x8080808080808080ULL >> (63 - sq)) & (occupation | 0x00000000000000FFULL); /* find the pieces down */
        uint64_t piecesri = (0x00000000000000FFULL << sq) & (occupation | 0x8080808080808080ULL); /* find pieces on the right */
        uint64_t piecesle = (0xFF00000000000000ULL >> (63 - sq)) & (occupation | 0x0101010101010101ULL); /* find pieces on the left */
        return (((0x8080808080808080ULL >> (63 - LSB(piecesup))) & (0x0101010101010101ULL << MSB(piecesdo))) |
                ((0xFF00000000000000ULL >> (63 - LSB(piecesri))) & (0x00000000000000FFULL << MSB(piecesle)))) ^ piece;
        /* From every direction find the first piece and from that piece put a mask in the opposite direction.
           Put togheter all the 4 masks and remove the moving piece */
    }

    /* return the bitboard with the bishops destinations */
    constexpr uint64_t Bishop(uint64_t sq, uint64_t occupation)
    {  /* it's the same as the rook */
        uint64_t piece = 1ULL << sq;
        occupation ^= piece;
        uint64_t piecesup = (0x8040201008040201ULL << sq) & (occupation | 0xFF80808080808080ULL);
        uint64_t piecesdo = (0x8040201008040201ULL >> (63 - sq)) & (occupation | 0x01010101010101FFULL);
        uint64_t piecesle = (0x8102040810204081ULL << sq) & (occupation | 0xFF01010101010101ULL);
        uint64_t piecesri = (0x8102040810204081ULL >> (63 - sq)) & (occupation | 0x80808080808080FFULL);
        return (((0x8040201008040201ULL >> (63 - LSB(piecesup))) & (0x8040201008040201ULL << MSB(piecesdo))) |
                ((0x8102040810204081ULL >> (63 - LSB(piecesle))) & (0x8102040810204081ULL << MSB(piecesri)))) ^ piece;
    }

    constexpr uint64_t Queen(uint64_t sq, uint64_t occupation) {
        return Rook(sq, occupation) | Bishop(sq, occupation);
    }
}

I would like to see performance on some old intel 32 bit cpu - to know what was possible 20 years ago when cpus existed but some algos didnt.
Its also funny that ALL algos agree on what index 0 means by chance. All define it by 1ull << idx. While 8000000000000000ull >> idx is equally valid.

tcusr · Post by **tcusr** » Mon Dec 13, 2021 10:25 pm

it's sad to see kogge stone perform so poorly (btw why it's the only algo not sorted correclty?)
where did you take the code for it? one mistake i've seen in other engines (to avoid typing 8 functions) is that they use a general shift function and when they fill one direction they also consider wraps, which they shouldn't since they are already taken care of by the propagator

in my engine 'queen_attacks' is exactly 200 branchless instructions. the funny thing is that i never use it, when i generate attacks for bishops/rook i also add the queen bitboards so i only use rook/bishop attacks.

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 10:41 pm

tcusr wrote: ↑Mon Dec 13, 2021 10:25 pm it's sad to see kogge stone perform so poorly (btw why it's the only algo not sorted correclty?)
where did you take the code for it? one mistake i've seen in other engines (to avoid typing 8 functions) is that they use a general shift function and when they fill one direction they also consider wraps, which they shouldn't since they are already taken care of by the propagator

in my engine 'queen_attacks' is exactly 200 branchless instructions. the funny thing is that i never use it, when i generate attacks for bishops/rook i also add the queen bitboards so i only use rook/bishop attacks.

Kogge was from this thread where i was given the sourcecode to Elephant_1_07_remains.
Sourcecode: (Maybe not that I look at it nothing is marked as constexpr..)

Not a single one of the algos has "Queen Attacks". I always add Queen() return Rook | Bishop. Because that tests both codepaths and is more rigurous testing.

You want to send me your engines sliding piece algorithm?

Code: Select all

namespace Chess_Lookup::KoggeStone {
    /*******************************************************************/
#include <stdint.h>
#define C64 uint64_t
#define BitBoard uint64_t

    ///////////////////////////////////////////////////////////////////////////

    BitBoard ShiftUp(BitBoard b) { return  b << 8; }
    BitBoard ShiftDown(BitBoard b) { return  b >> 8; }
    BitBoard ShiftLeft(BitBoard b) { return (b << 1) & C64(0xfefefefefefefefe); }
    BitBoard ShiftRight(BitBoard b) { return (b >> 1) & C64(0x7f7f7f7f7f7f7f7f); }
    BitBoard ShiftUpLeft(BitBoard b) { return (b << 9) & C64(0xfefefefefefefefe); }
    BitBoard ShiftUpRight(BitBoard b) { return (b << 7) & C64(0x7f7f7f7f7f7f7f7f); }
    BitBoard ShiftDownLeft(BitBoard b) { return (b >> 7) & C64(0xfefefefefefefefe); }
    BitBoard ShiftDownRight(BitBoard b) { return (b >> 9) & C64(0x7f7f7f7f7f7f7f7f); }

    ///////////////////////////////////////////////////////////////////////////

    /**
    The routine FillUpOccluded() smears the set bits of bitboard g upwards, but only
    along set bits of p; a reset bit in p is enough to halt a smear. RookAttacksUp()
    uses this routine to find all squares rooks may occupy by either staying still
    or moving up along empty squares (no captures allowed). Shifting this last
    result up by 1 square gives the squares all rooks can attack by moving upwards
    (which _does_ include captures).
    */
    BitBoard FillUpOccluded(BitBoard g, BitBoard p)
    {
        g |= p & (g << 8);
        p &= (p << 8);
        g |= p & (g << 16);
        p &= (p << 16);
        return g |= p & (g << 32);
    }

    BitBoard FillDownOccluded(BitBoard g, BitBoard p)
    {
        g |= p & (g >> 8);
        p &= (p >> 8);
        g |= p & (g >> 16);
        p &= (p >> 16);
        return g |= p & (g >> 32);
    }

    BitBoard FillLeftOccluded(BitBoard g, BitBoard p)
    {
        p &= C64(0xfefefefefefefefe);
        g |= p & (g << 1);
        p &= (p << 1);
        g |= p & (g << 2);
        p &= (p << 2);
        return g |= p & (g << 4);
    }

    BitBoard FillRightOccluded(BitBoard g, BitBoard p)
    {
        p &= C64(0x7f7f7f7f7f7f7f7f);
        g |= p & (g >> 1);
        p &= (p >> 1);
        g |= p & (g >> 2);
        p &= (p >> 2);
        return g |= p & (g >> 4);
    }

    BitBoard FillUpLeftOccluded(BitBoard g, BitBoard p)
    {
        p &= C64(0xfefefefefefefefe);
        g |= p & (g << 9);
        p &= (p << 9);
        g |= p & (g << 18);
        p &= (p << 18);
        return g |= p & (g << 36);
    }

    BitBoard FillUpRightOccluded(BitBoard g, BitBoard p)
    {
        p &= C64(0x7f7f7f7f7f7f7f7f);
        g |= p & (g << 7);
        p &= (p << 7);
        g |= p & (g << 14);
        p &= (p << 14);
        return g |= p & (g << 28);
    }

    BitBoard FillDownLeftOccluded(BitBoard g, BitBoard p)
    {
        p &= C64(0xfefefefefefefefe);
        g |= p & (g >> 7);
        p &= (p >> 7);
        g |= p & (g >> 14);
        p &= (p >> 14);
        return g |= p & (g >> 28);
    }

    BitBoard FillDownRightOccluded(BitBoard g, BitBoard p)
    {
        p &= C64(0x7f7f7f7f7f7f7f7f);
        g |= p & (g >> 9);
        p &= (p >> 9);
        g |= p & (g >> 18);
        p &= (p >> 18);
        return g |= p & (g >> 36);
    }


    /*******************************************************************/

    /**
    Get a bitboard with all positions set to 1 which can be attacked
    from a bishop, rook or queen on the square moving in the direction.
    */
    uint64_t direction_attacks(int square, int dir, uint64_t occ)
    {
        uint64_t seed = 1ull << square;
        uint64_t free = ~occ;

        // 4 3 2
        // 5 0 1
        // 6 7 8
        switch (dir)
        {
        case 1:
            return ShiftRight(FillRightOccluded(seed, free));
        case 5:
            return ShiftLeft(FillLeftOccluded(seed, free));
        case 7:
            return ShiftDown(FillDownOccluded(seed, free));
        case 3:
            return ShiftUp(FillUpOccluded(seed, free));
        case 8:
            return ShiftDownRight(FillDownRightOccluded(seed, free));
        case 4:
            return ShiftUpLeft(FillUpLeftOccluded(seed, free));
        case 2:
            return ShiftUpRight(FillUpRightOccluded(seed, free));
        case 6:
            return ShiftDownLeft(FillDownLeftOccluded(seed, free));
        default:
            return 0;
        }
    }


    /*******************************************************************/

    /**
    Get a bitboard with all positions set to 1 which can be attacked
    from a rook or queen on the square.
    */
    uint64_t attacks_rook(int square, uint64_t occ)
    {
        uint64_t seed = 1ull << square;
        uint64_t free = ~occ;
        return ShiftRight(FillRightOccluded(seed, free))
            | ShiftLeft(FillLeftOccluded(seed, free))
            | ShiftUp(FillUpOccluded(seed, free))
            | ShiftDown(FillDownOccluded(seed, free));
    }


    /*******************************************************************/

    /**
    Get a bitboard with all positions set to 1 which can be attacked
    from a bishop or queen on the square.
    */
    BitBoard attacks_bishop(int square, uint64_t occ)
    {
        uint64_t seed = 1ull << square;
        uint64_t free = ~occ;
        return ShiftUpRight(FillUpRightOccluded(seed, free))
            | ShiftUpLeft(FillUpLeftOccluded(seed, free))
            | ShiftDownLeft(FillDownLeftOccluded(seed, free))
            | ShiftDownRight(FillDownRightOccluded(seed, free));
    }

    uint64_t Queen(int square, uint64_t occ) {
        return attacks_bishop(square, occ) | attacks_rook(square, occ);
    }
}

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 10:44 pm

Update: constexpr didnt make kogge stone faster.
Its so slow because of the dependency chain. Modern processors like it when you can reorder the instructions without changing the results. Simple as that. A normal loop is as bad as this kogge stone algo.

Thats why we have the more advanced algorithms that can calculate this without a hard dependency chain and without any branches!

tcusr · Post by **tcusr** » Mon Dec 13, 2021 10:57 pm

dangi12012 wrote: ↑Mon Dec 13, 2021 10:44 pm Update: constexpr didnt make kogge stone faster.
Its so slow because of the dependency chain. Modern processors like it when you can reorder the instructions without changing the results. Simple as that. A normal loop is as bad as this kogge stone algo.

Thats why we have the more advanced algorithms that can calculate this without a hard dependency chain and without any branches!

ok the implementation is fine, thanks for trying!
i thought kogge stone was slow because it was long, i still have a lot to learn about low level programming

btw my point was that queen_attacks is rarely used in a chess engine that uses kogge stone because it's expensive, try to run it again with only rook or bishops attacks

Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]

Re: Faster than Fancy magic: Hypercube Slider lookup [TEASER]