smatovic wrote: ↑Sun Jan 16, 2022 5:59 am
Perft on gpu is one thing, a chess playing engine another.
That is the main problem with having the search on the GPU. I don't know how you did zeta but you probably were not using
one cuda thread for a standalone search (probably a warp or maybe even a block?).
My attempt at implementing MCTS on the GPU for Hex using one-cuda-thread-per-standalone-search was successful, since there is not
a lot of warp divergence due to the nature of the game. However, for chess there is not even space for storing the generated move list
if you use this approach. I don't recall the details much, but I believe I used some sort of bitfields to for that purpose but am not sure if there isn't register spilling even after that.
Here is my attempt at MCTS search on the GPU for chess some 10 years ago.
https://github.com/dshawul/GpuHex/blob/chess/hex.cu
The best approach is what AlphaGo demonstrated, that is to do the search on the CPU and use the GPU for the heavylifiting the evaluation, especially now that NN are the standard evaluation. Batching multiple evaluations will avoid latency of data transfer that is the main bottlneck.
Hence, here is no point in having a fast GPU move generator, unless you have the right search algorithm to go with it. I just don't see how you can avoid warp divergence even with something GPU friendly like MCTS search in chess (Hex was pretty good though). Your approach of using the warp/block for a standalone search is probably the more feasible one in this regard, especially if there are enough vector operations that can engage a warp most of the time.