Actually the search speed does scale linearly with thread count, but the algorithmic efficiency of AB doesn't.j.t. wrote: ↑Mon Nov 08, 2021 1:32 amThe search speed doesn't scale linearly with thread count. At some point, when you use multiple thousand threads instead of a few hundred, the algorithmic loss is practically the same as the advantage you get from more cores. That's what I mean with "the longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta". Considering that GPUs are slower per thread and have an additional overhead compared to CPU threads, there might be no advantage at all using a GPU instead of hundreds of CPU threads to parallelize alpha-beta (but as I said earlier, this is mostly speculation by me). Of course, using the GPUs many threads to evaluate a huge neural network is a viable solution. But I don't think that this is what the original question was about, as it would be basically the same what Lc0 does.Daniel Shawul wrote: ↑Sun Nov 07, 2021 10:02 pmThis is wrong! Modern AB engines use many threads, 256 threads with lazy smp, so you can batch together eval requests from different threads.j.t. wrote: ↑Sat Nov 06, 2021 12:11 am To make any use of the GPU you need to collect a pretty big amount of positions that you want to evaluate. The longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta. Together with the extreme pruning engines do today, the relative efficient NNUE algorithm, and the overhead that comes from using a GPU, and last but not least the complexity of GPU code, I assume that people just think that it's not worth it.
But I'll be honest and say that I've never tried this myself (I thought about it, but I've never got familiar enough with GPUs to actually try it), so maybe I misjudged the potential of GPUs.
Each thread evaluates exactly one position and you then effectively have a batch size of 256 that should be enough to fill up a GPU with a decent net size (not nnue though since that is too tiny). I already do this in Scorpio and it "works", but it is darn slow even with a tiny 6x64 net ...
Engines moved from YBW type parallel search to lazy smp / abdada exactly because of this even on the CPU.
You don't expect to implement AB using a strict sequential evaluation of positions on the GPU, do you? That would be ridiculous, even YBW has some overhead! Once CPU engines start using lazy smp / ABDADA, they have become as easy to implement and as scalable as MCTS parallel search, which is a good thing! There is a harder to implement equivalent version of AB that is identical to MCTS except for its back up operator (alpha-beta rollouts version). But Lazy SMP / ABDADA is already good enough that there is no need for it.
Also, if you need batch size of more than 256, you can do "prefetching" of positions from the subtree, which I already do.
So you can easily go above 256 with that. Since NN is the most time consuming part of the search, this evaluated positions will be stored in cache and soon become useful. The downside of using many threads in lazy smp /mcts way is that, they are bound to lead to similar lines and two threads may want to evaluate the same position (i.e. a collision!). You do what you can to avoid it (e.g virtual loss for mcts or using Transposition table to coordinate search as in ABDADA etc), and if you still get collision, which you will do, you give up and preftech a node in the subtree.
Realize that MCTS algorithm (lc0) is not much different from CPU algorithms when you start using 100s of threads.
You can find a sweat spot for your neural network size to compensate for the overhead of offloading to GPU, overlap evaluation with something else etc... many ideas to try here.I was trying to get tiny 2x32 conv net working with my AB search on the GPU before NNUE came in and showed it is possible to do that on CPU directly. A 2x32 conv net is actually not that much slower than a fat NNUE net.
Stockfish has been increasing NNUE size since then, and I wouldn't be surprised if the optimal is a network size much larger than current NNUE size and one that runs on the GPU. Stockfish NNUE still lags far behind interms of the knowledge it has compared to a 20b net, so there is that path to try.