[Question] Efficiently generate ray masks?

dangi12012 · Post by **dangi12012** » Mon Apr 18, 2022 11:43 pm

tcusr wrote: ↑Mon Apr 18, 2022 10:55 pm i found the pattern
Code: Select all
def find_byte(n):
    for i in range(0, 8):
        if (n >> 8 * i) & 0xff:
            return i + 1

def dir_HO(n):
    b = find_byte(n)

    return (1 << b * 8) - n - (1 << (b - 1) * 8)

for n in range(0, 64):
    print(1 << n, " -> ", dir_HO(1 << n))
the only problem now is that we have to find a fast function that finds the byte in which a bit is placed

Thats very-very cool! Will have to think how to automatically detect these kind of pattern.
Thank you for that solution!

I also did a lot of thinking - of how to skip the countlzero that seems to always pop up.

Possible Solution:
When running any of the many algos with occ = 0 - then consequently the 4 masks will be generated!
If the algorithm itself is not dependent on the 4 masks then its a win.

A prime candidate is https://github.com/Gigantua/Chess_Moveg ... in/QBB.hpp
because of its simplicity and use of shifted squares. (1ull << sq is just he native Bitboard form of sq)

With the sifter I can find alternative forms of many patterns and can exclude countlzero.
So for example these two are equivalent:

Code: Select all

(0x0101010101010101ULL << sq) //sq = 0..64
(0x0101010101010101ULL << popcount((sq_mask - popcount(sq_mask)))) //sq_mask = 0...1 << 64

The above would be IMPOSSIBLE to find.... What is subtracting the popcount from a single set bit - and why does shifting this work at all

So in general I will try to find a solution that accepts 1ull << sq as input and with occ = 0 will simplify to a fast lookup-free simple function.

Two things bugged me in this thread:
That the final bishop solution has a conditional in it.
That the squares have to be in 0..64 form to generate the 4 masks.

I think we are close to a fast solution for both points above. If a solution is found here it will accelerate all Bitboard algorithms that depend on the 4 raymasks (which is incidentally the 3 fastest gpu algorithms and all of the fastest CPU algos after PEXT). So in summary a challenge worth thinking about.

tcusr · Post by **tcusr** » Tue Apr 19, 2022 12:03 am

dangi12012 wrote: ↑Mon Apr 18, 2022 11:43 pm
tcusr wrote: ↑Mon Apr 18, 2022 10:55 pm i found the pattern
Code: Select all
def find_byte(n):
    for i in range(0, 8):
        if (n >> 8 * i) & 0xff:
            return i + 1

def dir_HO(n):
    b = find_byte(n)

    return (1 << b * 8) - n - (1 << (b - 1) * 8)

for n in range(0, 64):
    print(1 << n, " -> ", dir_HO(1 << n))
the only problem now is that we have to find a fast function that finds the byte in which a bit is placed
Thats very-very cool! Will have to think how to automatically detect these kind of pattern.
Thank you for that solution!

I also did a lot of thinking - of how to skip the countlzero that seems to always pop up.

Possible Solution:
When running any of the many algos with occ = 0 - then consequently the 4 masks will be generated!
If the algorithm itself is not dependent on the 4 masks then its a win.

A prime candidate is https://github.com/Gigantua/Chess_Moveg ... in/QBB.hpp
because of its simplicity and use of shifted squares. (1ull << sq is just he native Bitboard form of sq)

With the sifter I can find alternative forms of many patterns and can exclude countlzero.
So for example these two are equivalent:
Code: Select all
(0x0101010101010101ULL << sq) //sq = 0..64
(0x0101010101010101ULL << popcount((sq_mask - popcount(sq_mask)))) //sq_mask = 0...1 << 64
The above would be IMPOSSIBLE to find.... What is subtracting the popcount from a single set bit - and why does shifting this work at all

So in general I will try to find a solution that accepts 1ull << sq as input and with occ = 0 will simplify to a fast lookup-free simple function.

Two things bugged me in this thread:
That the final bishop solution has a conditional in it.
That the squares have to be in 0..64 form to generate the 4 masks.

I think we are close to a fast solution for both points above. If a solution is found here it will accelerate all Bitboard algorithms that depend on the 4 raymasks (which is incidentally the 3 fastest gpu algorithms and all of the fastest CPU algos after PEXT). So in summary a challenge worth thinking about.

keep in mind that if N has only a single bit set then popcnt(N-1) == ctz(N) (if there many bit sets then isolate LSB with N & -N)

dangi12012 · Post by **dangi12012** » Tue Apr 19, 2022 12:28 am

tcusr wrote: ↑Tue Apr 19, 2022 12:03 am keep in mind that if N has only a single bit set then popcnt(N-1) == ctz(N) (if there many bit sets then isolate LSB with N & -N)
my solution becomes really fast if you can use popcnt
Code: Select all
uint64_t dir_HO(uint64_t n)
{
    auto b = std::popcount(n - 1) & 56;
    return (1ull << b) - n - ((1ull << b) >> 8);
}

Ahh thats a good point! The twos complement seems to have the same property maybe? Its getting late I cannot verify - popcnt(-N) = ctz(N)
In many many architectures the popcount is much faster than counting all leading/trailing zeroes! - so popcnt is allowed - and we are definitely onto something very new and generally good for chessprogramming! Not to mention its only SSE41 instead of BMI1
https://www.agner.org/optimize/instruction_tables.pdf

NOW - its getting late I cannot verify now but it seems that I have found the solutions (which should produce the same images as yours with below code)

Code: Select all

for(int i=0;i<64;i++) uint64_t mask = 1ull << i;

Derived from QBB - I get 4 sifted solutions for OCC = 0, sq = 1ull << sq

Code: Select all

//(0x8080808080808080ULL >> popcount(((0x8080808080808080ULL &- (0x8080808080808080ULL &- sq)) - sq)))  #724293417
//(0x8080808080808080ULL >> popcount(((0x8080808080808080ULL & -(0x8080808080808080ULL & -sq)) - sq)))  #724293417
//(0x00000000000000FFULL << (-(popcount(0x0101010101010101ULL)) & -popcount(-(sq))))  #1729588044
//(0x00000000000000FFULL << (-(popcount(0x0101010101010101ULL)) &- popcount(-(sq))))  #1729588044

2,3 are equivalent - so that is a good sign - could someone test or verify tcurs + my solution?

Mask to String:

Code: Select all

static std::string _map(uint64_t value)
{
	static std::string str(64 + 7, 'o');
	for (uint64_t i = 0, c = 0; i < 64; i++)
	{
		uint64_t bitmask = (1ull) << i;

		if ((bitmask & value) != 0) str[c++] = 'X';
		else str[c++] = '.';

		if ((i + 1) % 8 == 0 && i != 63) str[c++] = '\n';
	}
	return str;
}

tcusr · Post by **tcusr** » Tue Apr 19, 2022 12:41 am

dangi12012 wrote: ↑Tue Apr 19, 2022 12:28 am
tcusr wrote: ↑Tue Apr 19, 2022 12:03 am keep in mind that if N has only a single bit set then popcnt(N-1) == ctz(N) (if there many bit sets then isolate LSB with N & -N)
my solution becomes really fast if you can use popcnt
Code: Select all
uint64_t dir_HO(uint64_t n)
{
    auto b = std::popcount(n - 1) & 56;
    return (1ull << b) - n - ((1ull << b) >> 8);
}
Ahh thats a good point! The twos complement seems to have the same property maybe? Its getting late I cannot verify - popcnt(-N) = ctz(N)
In many many architectures the popcount is much faster than counting all leading/trailing zeroes! - so popcnt is allowed - and we are definitely onto something very new and generally good for chessprogramming! Not to mention its only SSE41 instead of BMI1
https://www.agner.org/optimize/instruction_tables.pdf

NOW - its getting late I cannot verify now but it seems that I have found the solutions (which should produce the same images as yours with below code)

i believe we're in the same timezone so i agree it's better to go sleep now

Code: Select all

for(int i=0;i<64;i++) uint64_t mask = 1ull << i;

Derived from QBB - I get 4 sifted solutions for OCC = 0, sq = 1ull << sq

Code: Select all

//(0x8080808080808080ULL >> popcount(((0x8080808080808080ULL &- (0x8080808080808080ULL &- sq)) - sq)))  #724293417
//(0x8080808080808080ULL >> popcount(((0x8080808080808080ULL & -(0x8080808080808080ULL & -sq)) - sq)))  #724293417
//(0x00000000000000FFULL << (-(popcount(0x0101010101010101ULL)) & -popcount(-(sq))))  #1729588044
//(0x00000000000000FFULL << (-(popcount(0x0101010101010101ULL)) &- popcount(-(sq))))  #1729588044

2,3 are equivalent - so that is a good sign - could someone test or verify tcurs + my solution?

Mask to String:

Code: Select all

static std::string _map(uint64_t value)
{
	static std::string str(64 + 7, 'o');
	for (uint64_t i = 0, c = 0; i < 64; i++)
	{
		uint64_t bitmask = (1ull) << i;

		if ((bitmask & value) != 0) str[c++] = 'X';
		else str[c++] = '.';

		if ((i + 1) % 8 == 0 && i != 63) str[c++] = '\n';
	}
	return str;
}

i think you got them mixed up
ctz == std::countr_zero(N) == std::popcount(N - 1)
clz == std::countl_zero(N) == std::popcount(~(N - 1)) // does negating do the same thing?

btw in your response there's code i deleted because it does not work and i don't know why. in chess terms i am getting the rank (i.e. byte) of the square by ANDing with 56 (i also forgot to add 1 because that's how the algorithm works).

dangi12012 · Post by **dangi12012** » Tue Apr 19, 2022 5:11 pm

Release: Generate Ray Masks with Bitwise Squaremask Inputs

These are the 4 rays: sq = bitmask

Code: Select all

uint64_t mask_HO(uint64_t sq)
{
	return ((0x8080808080808080ULL >> (std::countl_zero(0x0101010101010101ULL) & std::countl_zero(sq))) ^ sq);
}

uint64_t mask_VE(uint64_t sq)
{
	return ((0xFF00000000000000ULL >> (std::countl_zero(0x00000000000000FFULL) & std::countl_zero(sq))) ^ sq);
}

uint64_t mask_D1(uint64_t sq)
{
	uint64_t a = (0x8102040810204081ULL << (63 - std::countl_zero((0x80808080808080FFULL & (0x8102040810204081ULL >> std::countl_zero(sq))))));
	uint64_t b = (0x8102040810204081ULL >> (63 - std::countl_zero((0x80808080808080FFULL & (0x8102040810204081ULL >> std::countr_zero(sq))))));
	return (a & b) ^ sq;
}

uint64_t mask_D2(uint64_t sq) 
{ 
	uint64_t a = (0x8040201008040201ULL >> (63 - std::countl_zero((0x01010101010101FFULL & (0x8040201008040201ULL >> std::countr_zero(sq))))));
	uint64_t b = (0x8040201008040201ULL << (63 - std::countl_zero((0x01010101010101FFULL & (0x8040201008040201ULL >> std::countl_zero(sq))))));
	return (a & b) ^ sq;
}

Notice how D1 and D2 dont have any conditionals in them - so this is true branchless arithmetic!
Where is this useful? On Platforms where lookups are expensive and countl_zero ist fast (GPUs mostly). Its not fully the solution I would have liked since this transforms with countl_zero internally - but it still compares favorably to square centric #define solutions solved earlier in this thread because the square needs to be calculated by countl_zero either way.

This is the old solution with sq = 0..64 integer

Code: Select all

template<uint64_t bb>
uint64_t mask_shift(int ranks) {
	return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}

#define dir_HO(X) ((0xFFull << (X & 56)) ^ (1ull << X))
#define dir_VE(X) ((0x0101010101010101ull << (X & 7)) ^ (1ull << X))
#define dir_D1(X) ((mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3))) ^ (1ull << X))
#define dir_D2(X) ((mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3))) ^ (1ull << X))

Proof of equivalence:
https://godbolt.org/z/Pj9qdqzEh

Removal of a the hard branch "ranks > 0" and achievement of a branchless lookupfree mask algorithm that works with the bitmask (not the square) was definitely worth the effort.

[Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?