Why is there no bulk evaluation happening in Alpha Beta?

Discussion of chess software programming and technical issues.

Moderator: Ras

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Why is there no bulk evaluation happening in Alpha Beta?

Post by Daniel Shawul »

j.t. wrote: Mon Nov 08, 2021 1:32 am
Daniel Shawul wrote: Sun Nov 07, 2021 10:02 pm
j.t. wrote: Sat Nov 06, 2021 12:11 am To make any use of the GPU you need to collect a pretty big amount of positions that you want to evaluate. The longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta. Together with the extreme pruning engines do today, the relative efficient NNUE algorithm, and the overhead that comes from using a GPU, and last but not least the complexity of GPU code, I assume that people just think that it's not worth it.

But I'll be honest and say that I've never tried this myself (I thought about it, but I've never got familiar enough with GPUs to actually try it), so maybe I misjudged the potential of GPUs.
This is wrong! Modern AB engines use many threads, 256 threads with lazy smp, so you can batch together eval requests from different threads.
Each thread evaluates exactly one position and you then effectively have a batch size of 256 that should be enough to fill up a GPU with a decent net size (not nnue though since that is too tiny). I already do this in Scorpio and it "works", but it is darn slow even with a tiny 6x64 net ...
The search speed doesn't scale linearly with thread count. At some point, when you use multiple thousand threads instead of a few hundred, the algorithmic loss is practically the same as the advantage you get from more cores. That's what I mean with "the longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta". Considering that GPUs are slower per thread and have an additional overhead compared to CPU threads, there might be no advantage at all using a GPU instead of hundreds of CPU threads to parallelize alpha-beta (but as I said earlier, this is mostly speculation by me). Of course, using the GPUs many threads to evaluate a huge neural network is a viable solution. But I don't think that this is what the original question was about, as it would be basically the same what Lc0 does.
Actually the search speed does scale linearly with thread count, but the algorithmic efficiency of AB doesn't.
Engines moved from YBW type parallel search to lazy smp / abdada exactly because of this even on the CPU.
You don't expect to implement AB using a strict sequential evaluation of positions on the GPU, do you? That would be ridiculous, even YBW has some overhead! Once CPU engines start using lazy smp / ABDADA, they have become as easy to implement and as scalable as MCTS parallel search, which is a good thing! There is a harder to implement equivalent version of AB that is identical to MCTS except for its back up operator (alpha-beta rollouts version). But Lazy SMP / ABDADA is already good enough that there is no need for it.

Also, if you need batch size of more than 256, you can do "prefetching" of positions from the subtree, which I already do.
So you can easily go above 256 with that. Since NN is the most time consuming part of the search, this evaluated positions will be stored in cache and soon become useful. The downside of using many threads in lazy smp /mcts way is that, they are bound to lead to similar lines and two threads may want to evaluate the same position (i.e. a collision!). You do what you can to avoid it (e.g virtual loss for mcts or using Transposition table to coordinate search as in ABDADA etc), and if you still get collision, which you will do, you give up and preftech a node in the subtree.

Realize that MCTS algorithm (lc0) is not much different from CPU algorithms when you start using 100s of threads.

You can find a sweat spot for your neural network size to compensate for the overhead of offloading to GPU, overlap evaluation with something else etc... many ideas to try here.I was trying to get tiny 2x32 conv net working with my AB search on the GPU before NNUE came in and showed it is possible to do that on CPU directly. A 2x32 conv net is actually not that much slower than a fat NNUE net.

Stockfish has been increasing NNUE size since then, and I wouldn't be surprised if the optimal is a network size much larger than current NNUE size and one that runs on the GPU. Stockfish NNUE still lags far behind interms of the knowledge it has compared to a 20b net, so there is that path to try.
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: Why is there no bulk evaluation happening in Alpha Beta?

Post by j.t. »

Daniel Shawul wrote: Mon Nov 08, 2021 1:55 am Actually the search speed does scale linearly with thread count, but the algorithmic efficiency of AB doesn't.
Engines moved from YBW type parallel search to lazy smp / abdada exactly because of this even on the CPU.
I always thought that lazy SMP is used because it is simpler to implement, and not because it scales better than YBW.

Did you find any improvement in time to depth or playing strength when using the GPU over a CPU? I mean not in the context of bigger neural networks, but assuming the smaller NNUE evaluations or even HCE that alpha-beta engines are using (in case you've experimented with something like that)? As far as I can see, the original question was about evaluating thousands of different positions at the same time on the GPU, so that's what I've been thinking about writing these comments. I didn't want to argue that GPUs are not that well suited for other uses as well, like evaluating a smaller number of bigger neural networks.
amanjpro
Posts: 883
Joined: Sat Mar 13, 2021 1:47 am
Full name: Amanj Sherwany

Re: Why is there no bulk evaluation happening in Alpha Beta?

Post by amanjpro »

j.t. wrote: Mon Nov 08, 2021 2:57 am
Daniel Shawul wrote: Mon Nov 08, 2021 1:55 am Actually the search speed does scale linearly with thread count, but the algorithmic efficiency of AB doesn't.
Engines moved from YBW type parallel search to lazy smp / abdada exactly because of this even on the CPU.
I always thought that lazy SMP is used because it is simpler to implement, and not because it scales better than YBW.

Did you find any improvement in time to depth or playing strength when using the GPU over a CPU? I mean not in the context of bigger neural networks, but assuming the smaller NNUE evaluations or even HCE that alpha-beta engines are using (in case you've experimented with something like that)? As far as I can see, the original question was about evaluating thousands of different positions at the same time on the GPU, so that's what I've been thinking about writing these comments. I didn't want to argue that GPUs are not that well suited for other uses as well, like evaluating a smaller number of bigger neural networks.

SMP scaling is not really linear.. the nps is but the time to depth is not... 2 threads is almost 40+ over one thread. But a CPU twice as fast as another is more than 40 elo stronger than the slower one.
But SMP scales really well for very high number of threads like in TCEC
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Why is there no bulk evaluation happening in Alpha Beta?

Post by dangi12012 »

amanjpro wrote: Mon Nov 08, 2021 4:29 am
j.t. wrote: Mon Nov 08, 2021 2:57 am
Daniel Shawul wrote: Mon Nov 08, 2021 1:55 am Actually the search speed does scale linearly with thread count, but the algorithmic efficiency of AB doesn't.
Engines moved from YBW type parallel search to lazy smp / abdada exactly because of this even on the CPU.
I always thought that lazy SMP is used because it is simpler to implement, and not because it scales better than YBW.

Did you find any improvement in time to depth or playing strength when using the GPU over a CPU? I mean not in the context of bigger neural networks, but assuming the smaller NNUE evaluations or even HCE that alpha-beta engines are using (in case you've experimented with something like that)? As far as I can see, the original question was about evaluating thousands of different positions at the same time on the GPU, so that's what I've been thinking about writing these comments. I didn't want to argue that GPUs are not that well suited for other uses as well, like evaluating a smaller number of bigger neural networks.
SMP scaling is not really linear.. the nps is but the time to depth is not... 2 threads is almost 40+ over one thread. But a CPU twice as fast as another is more than 40 elo stronger than the slower one.
But SMP scales really well for very high number of threads like in TCEC
I was hoping someone already has experience with parallel treesearching:
https://www.chessprogramming.org/Parall ... Alpha-Beta
https://www.chessprogramming.org/Dynamic_Tree_Splitting

Since its obviously possible - it would be interesting to get new ideas. Maybe alpha beta isnt suited - because it tries to prune serially - something that isnt in the wiki maybe?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Why is there no bulk evaluation happening in Alpha Beta?

Post by Desperado »

dangi12012 wrote: Mon Nov 08, 2021 2:28 pm
I was hoping someone already has experience with parallel treesearching:
https://www.chessprogramming.org/Parall ... Alpha-Beta
https://www.chessprogramming.org/Dynamic_Tree_Splitting
I think there are many people having a lot of experience with parallel tree search. (Did you read the posts before?).

If you search for an asynchron algorithm you might be interested in APHID. ( APHID )
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Why is there no bulk evaluation happening in Alpha Beta?

Post by diep »

j.t. wrote: Mon Nov 08, 2021 1:32 am
Daniel Shawul wrote: Sun Nov 07, 2021 10:02 pm
j.t. wrote: Sat Nov 06, 2021 12:11 am To make any use of the GPU you need to collect a pretty big amount of positions that you want to evaluate. The longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta. Together with the extreme pruning engines do today, the relative efficient NNUE algorithm, and the overhead that comes from using a GPU, and last but not least the complexity of GPU code, I assume that people just think that it's not worth it.

But I'll be honest and say that I've never tried this myself (I thought about it, but I've never got familiar enough with GPUs to actually try it), so maybe I misjudged the potential of GPUs.
This is wrong! Modern AB engines use many threads, 256 threads with lazy smp, so you can batch together eval requests from different threads.
Each thread evaluates exactly one position and you then effectively have a batch size of 256 that should be enough to fill up a GPU with a decent net size (not nnue though since that is too tiny). I already do this in Scorpio and it "works", but it is darn slow even with a tiny 6x64 net ...
The search speed doesn't scale linearly with thread count. At some point, when you use multiple thousand threads instead of a few hundred, the algorithmic loss is practically the same as the advantage you get from more cores. That's what I mean with "the longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta". Considering that GPUs are slower per thread and have an additional overhead compared to CPU threads, there might be no advantage at all using a GPU instead of hundreds of CPU threads to parallelize alpha-beta (but as I said earlier, this is mostly speculation by me). Of course, using the GPUs many threads to evaluate a huge neural network is a viable solution. But I don't think that this is what the original question was about, as it would be basically the same what Lc0 does.
I remember sitting at a worldchampionship playing Stefan Meyer-Kahlen and discussing poor speedups most had at supercomputers. He then replied: "yeah but even if you get 10 out of 500 speedup it's still faster than me at a PC at 1 cpu".
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Why is there no bulk evaluation happening in Alpha Beta?

Post by diep »

Daniel Shawul wrote: Mon Nov 08, 2021 1:55 am
j.t. wrote: Mon Nov 08, 2021 1:32 am
Daniel Shawul wrote: Sun Nov 07, 2021 10:02 pm
j.t. wrote: Sat Nov 06, 2021 12:11 am To make any use of the GPU you need to collect a pretty big amount of positions that you want to evaluate. The longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta. Together with the extreme pruning engines do today, the relative efficient NNUE algorithm, and the overhead that comes from using a GPU, and last but not least the complexity of GPU code, I assume that people just think that it's not worth it.

But I'll be honest and say that I've never tried this myself (I thought about it, but I've never got familiar enough with GPUs to actually try it), so maybe I misjudged the potential of GPUs.
This is wrong! Modern AB engines use many threads, 256 threads with lazy smp, so you can batch together eval requests from different threads.
Each thread evaluates exactly one position and you then effectively have a batch size of 256 that should be enough to fill up a GPU with a decent net size (not nnue though since that is too tiny). I already do this in Scorpio and it "works", but it is darn slow even with a tiny 6x64 net ...
The search speed doesn't scale linearly with thread count. At some point, when you use multiple thousand threads instead of a few hundred, the algorithmic loss is practically the same as the advantage you get from more cores. That's what I mean with "the longer you postpone evaluating positions, the more you lose the advantages of the alpha-beta". Considering that GPUs are slower per thread and have an additional overhead compared to CPU threads, there might be no advantage at all using a GPU instead of hundreds of CPU threads to parallelize alpha-beta (but as I said earlier, this is mostly speculation by me). Of course, using the GPUs many threads to evaluate a huge neural network is a viable solution. But I don't think that this is what the original question was about, as it would be basically the same what Lc0 does.
Actually the search speed does scale linearly with thread count, but the algorithmic efficiency of AB doesn't.
Engines moved from YBW type parallel search to lazy smp / abdada exactly because of this even on the CPU.
You don't expect to implement AB using a strict sequential evaluation of positions on the GPU, do you? That would be ridiculous, even YBW has some overhead! Once CPU engines start using lazy smp / ABDADA, they have become as easy to implement and as scalable as MCTS parallel search, which is a good thing! There is a harder to implement equivalent version of AB that is identical to MCTS except for its back up operator (alpha-beta rollouts version). But Lazy SMP / ABDADA is already good enough that there is no need for it.

Also, if you need batch size of more than 256, you can do "prefetching" of positions from the subtree, which I already do.
So you can easily go above 256 with that. Since NN is the most time consuming part of the search, this evaluated positions will be stored in cache and soon become useful. The downside of using many threads in lazy smp /mcts way is that, they are bound to lead to similar lines and two threads may want to evaluate the same position (i.e. a collision!). You do what you can to avoid it (e.g virtual loss for mcts or using Transposition table to coordinate search as in ABDADA etc), and if you still get collision, which you will do, you give up and preftech a node in the subtree.

Realize that MCTS algorithm (lc0) is not much different from CPU algorithms when you start using 100s of threads.

You can find a sweat spot for your neural network size to compensate for the overhead of offloading to GPU, overlap evaluation with something else etc... many ideas to try here.I was trying to get tiny 2x32 conv net working with my AB search on the GPU before NNUE came in and showed it is possible to do that on CPU directly. A 2x32 conv net is actually not that much slower than a fat NNUE net.

Stockfish has been increasing NNUE size since then, and I wouldn't be surprised if the optimal is a network size much larger than current NNUE size and one that runs on the GPU. Stockfish NNUE still lags far behind interms of the knowledge it has compared to a 20b net, so there is that path to try.
Daniel - without wanting to change your story too much. We can mathematically prove simply that parallel speedup never is fully linear. All you can do it trying to flatten the curve going down. In short the more threads the smaller your speedup scaling percentage will be. Another thing to consider which most overlook is the parallel linear overhead you have. Whether that goes up with more threads as well.