Bit twiddling: what do you prefer?

Karlo Bala · Post by **Karlo Bala** » Thu Oct 14, 2021 7:43 pm

Mask:

1: 1 << sq
2: mask[sq]


~Mask:

1: ~(1 << sq)
2: nmask[sq]
3: _rotl64(~1, sq)


Test:

1: (bb >> sq) & 1
2: bb & (1 << sq)
3: bb & mask[sq]
4: _bittest64(bb, sq)


Set:

1: bb | (1 << sq)
2: bb | mask[sq]
3: _bittestandset64(bb, sq)


Reset:

1: bb & ~(1 << sq)
2: bb & nmask[sq]
3: bb & _rotl64(~1, sq)
4: _bittestandreset64(bb, sq)


Reset last:

...
5: bb & (bb - 1)
 
 
Flip:

1: bb ^ (1 << sq)
2: bb ^ mask[sq]
3: _bittestandcomplement64(bb, sq)

Anything else?

R. Tomasi · Post by **R. Tomasi** » Thu Oct 14, 2021 8:16 pm

Karlo Bala wrote: ↑Thu Oct 14, 2021 7:43 pm

Code: Select all

Mask:

1: 1 << sq
2: mask[sq]


~Mask:

1: ~(1 << sq)
2: nmask[sq]
3: _rotl64(~1, sq)


Test:

1: (bb >> sq) & 1
2: bb & (1 << sq)
3: bb & mask[sq]
4: _bittest64(bb, sq)


Set:

1: bb | (1 << sq)
2: bb | mask[sq]
3: _bittestandset64(bb, sq)


Reset:

1: bb & ~(1 << sq)
2: bb & nmask[sq]
3: bb & _rotl64(~1, sq)
4: _bittestandreset64(bb, sq)


Reset last:

...
5: bb & (bb - 1)
 
 
Flip:

1: bb ^ (1 << sq)
2: bb ^ mask[sq]
3: _bittestandcomplement64(bb, sq)

Anything else?

I pick option 1) on all examples. The compiler (at least with clang I verified) picks the appropriate machine instructions (like BT and the like) depending on the target architecture - I therefore assume it's always the fastest. For the mask and ~mask examples I tried 1) and 2) extensively during the most early developement stages of my engine, and the memory-lookup was slower in both cases.

dangi12012 · Post by **dangi12012** » Thu Oct 14, 2021 8:30 pm

Karlo Bala wrote: ↑Thu Oct 14, 2021 7:43 pm

Code: Select all

Mask:

1: 1 << sq
2: mask[sq]


~Mask:

1: ~(1 << sq)
2: nmask[sq]
3: _rotl64(~1, sq)


Test:

1: (bb >> sq) & 1
2: bb & (1 << sq)
3: bb & mask[sq]
4: _bittest64(bb, sq)


Set:

1: bb | (1 << sq)
2: bb | mask[sq]
3: _bittestandset64(bb, sq)


Reset:

1: bb & ~(1 << sq)
2: bb & nmask[sq]
3: bb & _rotl64(~1, sq)
4: _bittestandreset64(bb, sq)


Reset last:

...
5: bb & (bb - 1)
 
 
Flip:

1: bb ^ (1 << sq)
2: bb ^ mask[sq]
3: _bittestandcomplement64(bb, sq)

Anything else?

I have found that you mostly never need to operate a single bit with bitboards in chess. Except when you need a square of a set bit (ie bitboard to square).
Moving one or multiple pieces on a bitboard? board ^= (from | to)
Checking which of 32 pieces are pinned? occ & pinmask
Dont think in terms of squares. Think in terms of bitmaps.

But TLDR: All your options 1) are better than the others. Maybe think about BMI1 intrisics like _blsr_u64 instead of b&(b-1). Your compiler might figure it out but sometimes doesnt. Never ever do mask[sq] instead of 1 << sq. 1 << sq is many times faster.

Mergi · Post by **Mergi** » Thu Oct 14, 2021 9:09 pm

dangi12012 wrote: ↑Thu Oct 14, 2021 8:30 pm I have found that you mostly never need to operate a single bit with bitboards in chess. Except when you need a square of a set bit (ie bitboard to square).
Moving one or multiple pieces on a bitboard? board ^= (from | to)
Checking which of 32 pieces are pinned? occ & pinmask
Dont think in terms of squares. Think in terms of bitmaps.

Converting back-and-forth between bitboards with one bit set and an integer square is very common. You need both representations. The most obvious example would be storing a move as a from and to square. Your example of board ^= (from | to) suddenly becomes board ^= (1 << from_sq) | (1 << to_sq). There's just no way around it, if you want a compact move representation for your transposition table.

dangi12012 · Post by **dangi12012** » Thu Oct 14, 2021 9:18 pm

Mergi wrote: ↑Thu Oct 14, 2021 9:09 pm
dangi12012 wrote: ↑Thu Oct 14, 2021 8:30 pm I have found that you mostly never need to operate a single bit with bitboards in chess. Except when you need a square of a set bit (ie bitboard to square).
Moving one or multiple pieces on a bitboard? board ^= (from | to)
Checking which of 32 pieces are pinned? occ & pinmask
Dont think in terms of squares. Think in terms of bitmaps.
Converting back-and-forth between bitboards with one bit set and an integer square is very common. You need both representations. The most obvious example would be storing a move as a from and to square. Your example of board ^= (from | to) suddenly becomes board ^= (1 << from_sq) | (1 << to_sq). There's just no way around it, if you want a compact move representation for your transposition table.

No there is a way around that. Because you enumerate set bits and not squares. Your slider, knight lookup also returns set bits in the form of a bitmap. Its just perfect.

This method enumerates all set bits in an uint64_t as if you would do 1ull << sq for each filled square on the board

Code: Select all

inline Bit PopBit(uint64_t& val)
{
    uint64_t lsb = _blsi_u64(val);
    //val = _blsr_u64(val); - 3% slower? Todo: Extensive benchmarking
    val ^= lsb;
    return lsb;
}

Mergi · Post by **Mergi** » Thu Oct 14, 2021 9:26 pm

dangi12012 wrote: ↑Thu Oct 14, 2021 9:18 pm
Mergi wrote: ↑Thu Oct 14, 2021 9:09 pm
dangi12012 wrote: ↑Thu Oct 14, 2021 8:30 pm I have found that you mostly never need to operate a single bit with bitboards in chess. Except when you need a square of a set bit (ie bitboard to square).
Moving one or multiple pieces on a bitboard? board ^= (from | to)
Checking which of 32 pieces are pinned? occ & pinmask
Dont think in terms of squares. Think in terms of bitmaps.
Converting back-and-forth between bitboards with one bit set and an integer square is very common. You need both representations. The most obvious example would be storing a move as a from and to square. Your example of board ^= (from | to) suddenly becomes board ^= (1 << from_sq) | (1 << to_sq). There's just no way around it, if you want a compact move representation for your transposition table.
No there is a way around that. Because you enumerate set bits and not squares. Your slider, knight lookup also returns set bits in the form of a bitmap. Its just perfect.

This method enumerates all set bits in an uint64_t as if you would do 1ull << sq for each filled square on the board
Code: Select all
inline Bit PopBit(uint64_t& val)
{
    uint64_t lsb = _blsi_u64(val);
    //val = _blsr_u64(val); - 3% slower? Todo: Extensive benchmarking
    val ^= lsb;
    return lsb;
}

I'm not sure if i understand what you are trying to demonstrate. Of course i know how Lsb and PopLsb works, but you originally said that doing 1 << sq is not needed. However, if you are storing a move as uint16, to get a bitmask of that move, you need to do 1 << from_sq | 1 << to_sq.

dangi12012 · Post by **dangi12012** » Thu Oct 14, 2021 9:30 pm

Mergi wrote: ↑Thu Oct 14, 2021 9:26 pm I'm not sure if i understand what you are trying to demonstrate. Of course i know how Lsb and PopLsb works, but you originally said that doing 1 << sq is not needed. However, if you are storing a move as uint16, to get a bitmask of that move, you need to do 1 << from_sq | 1 << to_sq.

Ah I see. If you need to convert from a square to a bitmap. Then yes. I was thinking the other way around.

What I am trying to tell you is that if you have a full Bitboard implementation you wont need the square in terms of 0..64. Only for slider + king + knight lookup and nowhere else.

R. Tomasi · Post by **R. Tomasi** » Thu Oct 14, 2021 9:36 pm

dangi12012 wrote: ↑Thu Oct 14, 2021 9:30 pm
Mergi wrote: ↑Thu Oct 14, 2021 9:26 pm I'm not sure if i understand what you are trying to demonstrate. Of course i know how Lsb and PopLsb works, but you originally said that doing 1 << sq is not needed. However, if you are storing a move as uint16, to get a bitmask of that move, you need to do 1 << from_sq | 1 << to_sq.
Ah I see. If you need to convert from a square to a bitmap. Then yes. I was thinking the other way around.

What I am trying to tell you is that if you have a full Bitboard implementation you wont need the square in terms of 0..64. Only for slider + king + knight lookup and nowhere else.

Not sure what you're about there. But please understand that most of us here do not do pure perft. If you're writing an engine - be it bitboard based or other - there really is no way around having moves represented more compactly then just the delta-bitboards of that move. You will have to access squares in terms of their indices on many occasions. Converting from 16-bit moves for TT and killer storage being only one example among many in actual engines.

mvanthoor · Post by **mvanthoor** » Sat Oct 16, 2021 11:29 am

dangi12012 wrote: ↑Thu Oct 14, 2021 8:30 pm But TLDR: All your options 1) are better than the others. Maybe think about BMI1 intrisics like _blsr_u64 instead of b&(b-1). Your compiler might figure it out but sometimes doesnt. Never ever do mask[sq] instead of 1 << sq. 1 << sq is many times faster.

"mask[sq]" or "1 << sq" doesn't make a difference in Rust. I actually tested it yesterday.

When the array is initialized with a const fn (similar to constexpr in C++), then mask[sq] is inlined, just as if you used "1 << sq". There's no significant Elo difference in a match of 2000 games between engines that only differ in this detail.

dangi12012 · Post by **dangi12012** » Sat Oct 16, 2021 12:11 pm

mvanthoor wrote: ↑Sat Oct 16, 2021 11:29 am
dangi12012 wrote: ↑Thu Oct 14, 2021 8:30 pm But TLDR: All your options 1) are better than the others. Maybe think about BMI1 intrisics like _blsr_u64 instead of b&(b-1). Your compiler might figure it out but sometimes doesnt. Never ever do mask[sq] instead of 1 << sq. 1 << sq is many times faster.
"mask[sq]" or "1 << sq" doesn't make a difference in Rust. I actually tested it yesterday.

When the array is initialized with a const fn (similar to constexpr in C++), then mask[sq] is inlined, just as if you used "1 << sq". There's no significant Elo difference in a match of 2000 games between engines that only differ in this detail.

Wrong. That is only the case if the compiler can infer what sq is. Like if you use it in a for loop - but its exectly opposite if you use a not compiletime known value - like when you are deep in a calculation - then the compiler will emit that array into some memory location and you have a L1 lookup for [sq] but register calculation for <<.

Its 100% verifyably slower - just test it with random numbers

Bit twiddling: what do you prefer?

Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?

Re: Bit twiddling: what do you prefer?