If you read this text. As of 06.03.2022 - this is the fastest slider lookup algorithm ever created!
No x64-x86 algo ever came close. No known asic or fpga implementation scratched this performance.
We are scratching at the 100 Billion Lookup/ second mark here - which is insane since the best cpu algo can do 10Gigalookups / 16 Cores at the moment.
tcusr wrote: ↑Mon Mar 07, 2022 9:40 pm
how expensive is a branch inside a GPU? why don't you try kogge stone or dumb7fill just for D1/D2?
kogge stone and dumb7fill is there already and performans quite well. But I always had the bitreverse intrinsics in the back of my head.
Turns out that was the correct approach - 100 Billion Lookups/s achieved.
So what is the suprise with cuda?
Lookups are expensive - even a constexpr inline array (which is inline and not the same as __constant__) lookup is ~8x slower than 10-12 instructions!
So I can confirm here 100% that calculating the knight/king moves from scratch is cheaper than looking up one single element in an 64 slot array. mailbox algos are DOA on the current gpu architecture.
To answer your question: Most branches get compiled away into branchless code. So its very very cheap to branch. What is not cheap is having thread divergence but thats another topic. Also I am not using the advanced cuda scheduling features like graphs or streams yet.
What is still Todo
I want to see if my AST generator leads to a new algorithm - http://www.talkchess.com/forum3/viewtop ... 80#p922256
but after that I will take a hard look at cuda cores - not even the new neural network popxor network first but I want to see if I can solve sliding pieces with matrix multiplication directly. If you take the board and multiply it by some matrix and get the correct result or a unique offset directly it should be fast also.
I would not mind having all chess code eaten by linear algebra.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Its the expected performance - and a great example why benchmarking is hard. I specifically set Bitrotation as first and last algo so everyone here can see the impact when you are missing some warmup code. 55.99 vs 63.09
But I would not call that "looking tired" at all.
This performance you probably would not even be getting close to with a 64 core ryzen system. Cool thing about "inline calculation" in cuda:
There is no memory pressure - so this will scale and scale and scale with future architecutures and multiple gpus.
I calculated it today. The 3090 reaches 100 Billion Lookups/s. I am writing results to memory to verify and force the compiler to emit correct code.
With 100 Billion lookups/s we are going to need 800Gbit of memory bandwidth which is 1/8th of the theoretical 936.2 GByte/s.
Having a code like PEXT or Fancy magic - that will need to 2-3x the amount of memory to be read before it is written afterwards.
In any case - calculation is slow enough to not be memory bound at the moment. That leaves room for algorithmic improvement. But the point I want to make is that currently this is writing to memory which for an actual movegenerator or engine this would only be an intermediate result which gets consumed or used elsewhere. So embedded in another algorithm this sliding piece code is faster than the numbers posted.
dangi12012 wrote: ↑Fri Mar 11, 2022 1:35 pm
Its the expected performance - and a great example why benchmarking is hard. I specifically set Bitrotation as first and last algo so everyone here can see the impact when you are missing some warmup code. 55.99 vs 63.09
I already wondered why you did Bitrotation twice, so I left the 2nd one out of the results. You seem to use CUDA 11.4, I used CUDA 11.6, this doesn't make much difference.
What strikes me is that the 'Switch Lookup' seems to be 10 times slower on RTX-2000 devices compared to RTX-3000 devices, I wonder why this is.
Above link is GLSL which compiles into spirv which is the equivalent of ptx.
Vulkan gives a better memorymodel than CUDA and runs on android, playstation, xbox IOS (Apple M1!! and MacOS) and almost everywhere else.
It will be interesting to see if the RDNA iGPU of a Ryzen performans faster than the 8 Cores of the system itself. (probably yes)
If a real hardware unified memory model is available like on M1 or on Consoles - I am certain that hybrid chess engines can easily offload some work the the igpu because there host device memory latency is non existent (still not single function calls but for at least 1M invocations of bulk load to be done in parallel)
All in all I can see a shift to Vk will increase boilerplate code to 1000x the current state - but there are ways to hide that code in a common header etc. But it is the right thing to do for computerchess.
One interesting point is if vk on nvidia could be faster than a cuda implementation.
My gut feeling is that there has to be a backend in the nvidia driver that actually translates any gpgpu language to the native HW language (of the architecture) - but there will have to be implementation specific differences.
spirV and ptx are too different to produce the same core assembly and same performance.
Not to mention that glsl opens the door to exotic hardware that are not gpus to begin with. This will be interesting to see! - but VK is the better road forward in any case.