Bob used to do it this way

Discussion of chess software programming and technical issues.

Moderator: Ras

Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Bob used to do it this way

Post by Mike Sherwin »

Code: Select all

# define AttacksBishop(square, occ) \
(plus7dir[square] ^ plus7dir[LSB(plus7dir[square] & (occ))] | \
plus9dir[square] ^ plus9dir[LSB(plus9dir[square] & (occ))] | \
minus7dir[square] ^ minus7dir[MSB(minus7dir[square] & (occ))] | \
minus9dir[square] ^ minus9dir[MSB(minus9dir[square] & (occ))])
I think it was maybe 10% slower than rotated when I tested it
Also.
This works and is fast. It is nearly as fast as rotated bitboards or magic bitboards. And certainly more cache-friendly. But "nearly" isn't always "good enough".
These quotes are from Oct 28 2009. And the attack getter macro probably predates that by a few years? I know back then that the LSB and MSB functions were slow compared to bsf and bsr of today or the MS intrinsics _BitScanForward64 and _BitScanReverse64. So I'm wondering if this method needs reevaluated. ?
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bob used to do it this way

Post by dangi12012 »

I can say that _pext is 30% faster than magic hashing on modern machines.
If you look at the code its also really simple: (ignore the inline tables and scroll to the very bottom)
https://github.com/Gigantua/Gigantua/bl ... ovemap.hpp

Code: Select all

struct SliderPext_t
{
	const uint64_t* AttackPtr;
	const uint64_t Mask;
	constexpr SliderPext_t(int offset, uint64_t mask) : AttackPtr(SliderPext + offset), Mask(mask) {
		
	}

	_ForceInline constexpr uint64_t operator[](const uint64_t blocker) const
	{
	     return AttackPtr[_pext_u64(blocker, Mask)];
	}
};
To answer you question: could you be so kind as to input your code into compilerexplorer:
using these flags:
--std=c++20 -O3 -march=znver3
The fastest bitboard approch is this (i excluded the square lookup since this is engine dependent)
Input for compiler explorer:

Code: Select all

#include <stdint.h>
#include <immintrin.h>

extern uint64_t* AttackPtr;
extern uint64_t Mask;

uint64_t ATTACK(uint64_t occ) {
  return AttackPtr[_pext_u64(occ, Mask)];
}
Output yields this assembly:

Code: Select all

ATTACK(unsigned long):
        mov     rax, QWORD PTR AttackPtr[rip]
        pext    rdi, rdi, QWORD PTR Mask[rip]
        mov     rax, QWORD PTR [rax+rdi*8]
        ret
So the core lookup is literally 3 instructions.
We can compare the codes using agner fogs instruction tables for modern/old intel and amd.

Then we have throughput, asm latency - and can compare if L1, L2 cache latency together with the number of instruction pipelines give the definite answer for your question :)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bob used to do it this way

Post by dangi12012 »

Could be that you are onto something! - chess code that exclusively works inside a 64kb cache or even in the 16 x64 registers could be a huge step in terms of performance if the instruction count is not too high - but my gut feeling is that pext will stay the fastest approach in chess for the foreseeable future.
4 instructions + one L2 cache visit is not too costly - since the compiler will reorder other instructions so to never pay the price for an L2 lookup.

Waiting for your code - as i dont have the source for plus9dir etc...
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Bob used to do it this way

Post by Mike Sherwin »

dangi12012 wrote: Mon Nov 01, 2021 9:53 pm Could be that you are onto something! - chess code that exclusively works inside a 64kb cache or even in the 16 x64 registers could be a huge step in terms of performance if the instruction count is not too high - but my gut feeling is that pext will stay the fastest approach in chess for the foreseeable future.
4 instructions + one L2 cache visit is not too costly - since the compiler will reorder other instructions so to never pay the price for an L2 lookup.

Waiting for your code - as i dont have the source for plus9dir etc...
It is not my code. It is Bob's code.

unsigned long long plus9dir[64]; // etc. are bitboards of all the squares a bishop can reach on an empty board in the appropriate direction
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bob used to do it this way

Post by dangi12012 »

Mike Sherwin wrote: Mon Nov 01, 2021 11:53 pm
dangi12012 wrote: Mon Nov 01, 2021 9:53 pm Could be that you are onto something! - chess code that exclusively works inside a 64kb cache or even in the 16 x64 registers could be a huge step in terms of performance if the instruction count is not too high - but my gut feeling is that pext will stay the fastest approach in chess for the foreseeable future.
4 instructions + one L2 cache visit is not too costly - since the compiler will reorder other instructions so to never pay the price for an L2 lookup.

Waiting for your code - as i dont have the source for plus9dir etc...
It is not my code. It is Bob's code.

unsigned long long plus9dir[64]; // etc. are bitboards of all the squares a bishop can reach on an empty board in the appropriate direction
Can you find the source?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
Deberger
Posts: 91
Joined: Sat Nov 02, 2019 6:42 pm
Full name: ɹǝƃɹǝqǝᗡ ǝɔnɹꓭ

Re: Bob used to do it this way

Post by Deberger »

dangi12012 wrote: Tue Nov 02, 2021 10:42 am
Mike Sherwin wrote: Mon Nov 01, 2021 11:53 pm
dangi12012 wrote: Mon Nov 01, 2021 9:53 pm Waiting for your code - as i dont have the source for plus9dir etc...
It is not my code. It is Bob's code.

unsigned long long plus9dir[64]; // etc. are bitboards of all the squares a bishop can reach on an empty board in the appropriate direction
Can you find the source?
https://github.com/MichaelB7/Crafty/blo ... nit.c#L221
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Bob used to do it this way

Post by Mike Sherwin »

dangi12012 wrote: Tue Nov 02, 2021 10:42 am
Mike Sherwin wrote: Mon Nov 01, 2021 11:53 pm
dangi12012 wrote: Mon Nov 01, 2021 9:53 pm Could be that you are onto something! - chess code that exclusively works inside a 64kb cache or even in the 16 x64 registers could be a huge step in terms of performance if the instruction count is not too high - but my gut feeling is that pext will stay the fastest approach in chess for the foreseeable future.
4 instructions + one L2 cache visit is not too costly - since the compiler will reorder other instructions so to never pay the price for an L2 lookup.

Waiting for your code - as i dont have the source for plus9dir etc...
It is not my code. It is Bob's code.

unsigned long long plus9dir[64]; // etc. are bitboards of all the squares a bishop can reach on an empty board in the appropriate direction
Can you find the source?
All Bob posted was the macro. It was just an example of the way he used to do it.
Here is the full quote. It is all I have.
There's another very simple version I have used. I was once asked to do a version of a chess program for the Nintendo DS. Very restricted platform so no bit rotated arrays. I did this kind of thing:

# define AttacksBishop(square, occ) \
(plus7dir[square] ^ plus7dir[LSB(plus7dir[square] & (occ))] | \
plus9dir[square] ^ plus9dir[LSB(plus9dir[square] & (occ))] | \
minus7dir[square] ^ minus7dir[MSB(minus7dir[square] & (occ))] | \
minus9dir[square] ^ minus9dir[MSB(minus9dir[square] & (occ))])

The 4 arrays plus7dir, etc are pretty obvious. Just 1 bits starting in the right direction and ending at the edge of the board. You then need to identify the blocker with "LSB(plus7dir[square] & occ) (note that you use LSB for two and MSB for the other two. Then you use that same "direction" again, but move down to the blocking square and turn off all bits beyond it.

With the above, for rooks, queens and bishops, you need 8 64 element vectors, for the 8 possible directions. I think it was maybe 10% slower than rotated when I tested it, but don't hold me to that. You may well be doing something similar. Nice thing about the macro is that the conversion to rotated doesn't change any other code, just the 3 macros for bishops, rooks and queens... That's why magic conversion took me all of 30 minutes, total...

Part of that was creating a magic index for my mobility scores, in addition to just looking up an attack vector...
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bob used to do it this way

Post by dangi12012 »

Mike Sherwin wrote: Tue Nov 02, 2021 11:26 am
There wo go mike:

Code: Select all

extern uint64_t  plus1dir[64];
extern uint64_t  plus7dir[64];
extern uint64_t  plus8dir[64];
extern uint64_t  plus9dir[64];
extern uint64_t  minus1dir[64];
extern uint64_t  minus7dir[64];
extern uint64_t  minus8dir[64];
extern uint64_t  minus9dir[64];

#define LSB(v)    __builtin_ctzll(v)
#define MSB(v)    (63 - __builtin_clzll(v))

#define AttacksBishop(square, occ) \
(plus7dir[square] ^ plus7dir[LSB(plus7dir[square] & (occ))] | \
plus9dir[square] ^ plus9dir[LSB(plus9dir[square] & (occ))] | \
minus7dir[square] ^ minus7dir[MSB(minus7dir[square] & (occ))] | \
minus9dir[square] ^ minus9dir[MSB(minus9dir[square] & (occ))])


uint64_t rookAttack(uint64_t occ, uint64_t sq) {
    return AttacksBishop(sq, occ);
  //return arrAttacks[arrRookBase[sq] + _pext_u64(occ, arrRookMask[sq])];
}
Yields this assembly:

Code: Select all

rookAttack(unsigned long, unsigned long):
        mov     rdx, QWORD PTR plus7dir[0+rsi*8]
        mov     r8, QWORD PTR plus9dir[0+rsi*8]
        push    rbx
        mov     ecx, 63
        mov     rax, QWORD PTR minus7dir[0+rsi*8]
        mov     rsi, QWORD PTR minus9dir[0+rsi*8]
        mov     r11d, ecx
        mov     r10, rdx
        mov     r9, r8
        mov     rbx, rax
        and     r10, rdi
        and     r9, rdi
        and     rbx, rdi
        and     rdi, rsi
        tzcnt   r10, r10
        tzcnt   r9, r9
        lzcnt   rbx, rbx
        lzcnt   rdi, rdi
        xor     rdx, QWORD PTR plus7dir[0+r10*8]
        xor     r8, QWORD PTR plus9dir[0+r9*8]
        sub     r11d, ebx
        sub     ecx, edi
        pop     rbx
        movsx   r11, r11d
        movsx   rcx, ecx
        xor     rax, QWORD PTR minus7dir[0+r11*8]
        xor     rsi, QWORD PTR minus9dir[0+rcx*8]
        or      rdx, r8
        or      rax, rsi
        or      rax, rdx
        ret
VS this is all of PEXT lookup:

Code: Select all

        mov     rax, QWORD PTR AttackPtr[rip]
        pext    rdi, rdi, QWORD PTR Mask[rip]
        mov     rax, QWORD PTR [rax+rdi*8]
        ret
Performance can definitely be answered because you have 8x indirect lookups into L1 and a hard dependency chain.
Pext only needs a single lookup into L2 and is done - and as I said earlier this latency will not lead to a stall if the compiler is smart enough.. which it is.

If pext is not available the normal hashing will also be faster since it only replaces one pext with imul + 2x shifts.

Greetings - and thank you for asking :)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Bob used to do it this way

Post by hgm »

Counting instructions can be a bit misleading when memory accesses are involved. Your (magic bitboard) implementation requires access to a huge table. Bob's method only uses a few small tables, which will permanently reside in L1.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Bob used to do it this way

Post by dangi12012 »

hgm wrote: Wed Nov 03, 2021 10:51 am Counting instructions can be a bit misleading when memory accesses are involved. Your (magic bitboard) implementation requires access to a huge table. Bob's method only uses a few small tables, which will permanently reside in L1.
Counting instructions is not misleading it is totally useless. Modern compilers interleave instructions - and the cpu side is not even x64 anymore - its just the interface and the actual microarchitecture does crazy things like uop fusing etc. Thats why a 4ghz processor of today is more than twice as fast in singlethreaded applications as a 4ghz e8400 from 13 years ago even when memory is not used at all.

What can be said for most architectures and cuda more than x64 is that the indirect lookup is extremely costly when the index needs a lot of calculation like that:
minus9dir[MSB(minus9dir[square] & (occ))]

Cache access will get interleaved by other work so that you actually never have to pay the latency cost (when there are other independent instructions pending - like in chess where pieces are independent)
It can be confirmed that the macro is much slower https://quick-bench.com/

Thats why L1 vs L2 is a non issue. It would be different if the algorithm fits inside the 16 registers entirely - thats can be an order of magnitude faster. But with these things you better benchmark on the target architecture.

Greetings
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer