Bob used to do it this way

hgm · Post by **hgm** » Thu Nov 04, 2021 9:55 am

dangi12012 wrote: ↑Wed Nov 03, 2021 11:40 pm
hgm wrote: ↑Wed Nov 03, 2021 10:51 am Counting instructions can be a bit misleading when memory accesses are involved. Your (magic bitboard) implementation requires access to a huge table. Bob's method only uses a few small tables, which will permanently reside in L1.
Counting instructions is not misleading it is totally useless. Modern compilers interleave instructions - and the cpu side is not even x64 anymore - its just the interface and the actual microarchitecture does crazy things like uop fusing etc. Thats why a 4ghz processor of today is more than twice as fast in singlethreaded applications as a 4ghz e8400 from 13 years ago even when memory is not used at all.

What can be said for most architectures and cuda more than x64 is that the indirect lookup is extremely costly when the index needs a lot of calculation like that:
minus9dir[MSB(minus9dir[square] & (occ))]

Cache access will get interleaved by other work so that you actually never have to pay the latency cost (when there are other independent instructions pending - like in chess where pieces are independent)
It can be confirmed that the macro is much slower https://quick-bench.com/

Thats why L1 vs L2 is a non issue. It would be different if the algorithm fits inside the 16 registers entirely - thats can be an order of magnitude faster. But with these things you better benchmark on the target architecture.

Greetings

That depens on whether the code is latency bound or throughput bound. Bitboard programs use these attack getters very frequently. And even though out-of-order execution works wonders, it cannot get around throughput bottlenecks. In particular, the hardware that resolves cache misses can usually handle a single miss at the time. If you get a second cache miss while the first is still being processed, the entire CPU might stall untill the first miss is resolved. And even if the treatment of cache misses would use a scheduling queue for those, at some point this queue would simply fill up. The entire reorder buffer would be filled with instructions directly or indirectly waiting for data being fetched from memory or a higher cache level, and the execution slows down to the rate at which the single unit that processes cache misses can handle them. And even handling an L1 miss takes many clock cycles.

dangi12012 · Post by **dangi12012** » Mon Jan 24, 2022 11:41 pm

Do we have a better name for this algorithm?
It is called "Improved classical approach" https://www.chessprogramming.org/Classical_Approach
Or "Bob Mike Lookup"

// Robert Hyatt's and Michael Sherwin's classical bitboard approach
Any input by the Authors?

Bob used to do it this way

Re: Bob used to do it this way

Re: Bob used to do it this way