Wouldn't it be nice if C++ GPU

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Daniel Shawul
Posts: 3758
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Daniel Shawul » Sat Apr 27, 2019 2:25 am

chrisw wrote:
Fri Apr 26, 2019 8:03 pm
Daniel Shawul wrote:
Fri Apr 26, 2019 6:16 pm
Which begs the question, why use small batch sizes at all ? I don't use batch_size of less than 128.
Even launching 128 to 256 threads for multi-threaded batching on a 4-core cpu i see no problems...
Lc0 uses single threaded batching and defaults to batch size of 256 -- though smaller batch size of 32 is used for training ...
Well, if batch=1, you get full NN guidance whenever you want it. If batch=infinity, you get no NN guidance at all. As batchsize increases from 1 you get increasingly less NN guidance at and near leaf nodes. On the other side of the balance you get faster NN lookups.

Best N for batchsize is easily established by results. So far, so obvious, I guess the interesting bit is how, or not, one provides useful search guidance in the absense of the NN. The original paper, I think, just did a random selection. Does it make sense to use some kind of handcrafted selector? I’ve been trying with selecting obvious captures over the last few days, but only succeeded in breaking everything, so nothing useful to report right now. Which is basically why I want c++ gpu, fiddling around at a chess level with Python is not good. Need C++.
Are you doing rollouts without NN? I think even in the original paper they trained a smaller policy network for that purpose. In my case, the search is always guided by the NN policy. I have stopped guiding the search with hand-crafted eval or other heuristic search (qsearch captures etc) once I added policy head.
Did you measure the effect of batch size on playing strength? Nodes per second is not a good measurement of performance. I am using very large batches for self-play game generation, because I can play N self-play games in parallel. But when playing a single game in a tournament, I feel such huge batches may hurt performance, especially when the number of nodes is small. I have not measured this very seriously, though. I will run some tests during the week-end.
Well going from a batch size of 128 to 16 my nps goes down by a factor of 4x, so I haven't really bothered to measure if the increased selectivity from smaller batch size can compensate for the loss in nps. However, I have now started using single thread search for generating training games. When I do 800 playouts training, say with batch size of 128, each thread builds its own tree and 128 games are produced separately.

For actual search (not training with small number of playotus), one can make cpuct a function of the batch size to account for the added exploration due to virtual loss. I wonder how A0 got away with batch size of 8 -- maybe TPU has a lot less memory transfer overhead than GPU.
I noticed that batching helps not only because it reduces this latency, but also because tensorflow/tensorrt nn evaluation code uses generic std containers (list/vector etc) that is very slow. I remember first time I tried to use tensorflow on CPU, my nps tanked even after i commented out the actual NN evaluation code while keeping construction of input tensors etc.
Looking forward to your test results.

Daniel

Rémi Coulom
Posts: 433
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Rémi Coulom » Sat Apr 27, 2019 11:57 am

Daniel Shawul wrote:
Sat Apr 27, 2019 2:25 am
Well going from a batch size of 128 to 16 my nps goes down by a factor of 4x, so I haven't really bothered to measure if the increased selectivity from smaller batch size can compensate for the loss in nps. However, I have now started using single thread search for generating training games. When I do 800 playouts training, say with batch size of 128, each thread builds its own tree and 128 games are produced separately.

For actual search (not training with small number of playotus), one can make cpuct a function of the batch size to account for the added exploration due to virtual loss. I wonder how A0 got away with batch size of 8 -- maybe TPU has a lot less memory transfer overhead than GPU.
I noticed that batching helps not only because it reduces this latency, but also because tensorflow/tensorrt nn evaluation code uses generic std containers (list/vector etc) that is very slow. I remember first time I tried to use tensorflow on CPU, my nps tanked even after i commented out the actual NN evaluation code while keeping construction of input tensors etc.
Looking forward to your test results.

Daniel
In my experience, there is very little overhead for transferring data from the CPU to the GPU. But that may be because I am doing everything with cuDNN in C++, and build a half-precision host tensor directly on the CPU for transfer. Sending two half-size batches takes almost the same time as one full-size batch. Also, the overhead might be more noticeable if you have a very small neural network.

What makes small batches so slow is mainly what is happening on the GPU. cuDNN's performance is really terrible for small batches, and that's why I want to try to improve it. Smaller batches will also allow better scaling to multiple GPUs.

I will report with my experiment results. I'll test for shogi, not chess. I am preparing for the World Computer Shogi Championship that will take place next week, and will do some testing to decide on batch size with 4xV100 or 8xV100 configurations.

My intuition is that large batches may be OK for very long searches, but are not for short ones. The optimal solution might be to increase batch size as the search tree becomes bigger.

Rémi

Daniel Shawul
Posts: 3758
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Daniel Shawul » Sun Apr 28, 2019 12:05 pm

Rémi Coulom wrote:
Sat Apr 27, 2019 11:57 am
In my experience, there is very little overhead for transferring data from the CPU to the GPU. But that may be because I am doing everything with cuDNN in C++, and build a half-precision host tensor directly on the CPU for transfer. Sending two half-size batches takes almost the same time as one full-size batch. Also, the overhead might be more noticeable if you have a very small neural network.

What makes small batches so slow is mainly what is happening on the GPU. cuDNN's performance is really terrible for small batches, and that's why I want to try to improve it. Smaller batches will also allow better scaling to multiple GPUs.
I see. I don't know if that is possible at all using tensorflow since you don't get to see what device you have AFAIK but I could probably do it with TensorRT. However, with batching the profile shows 95% of the time is spent in the convolution kernels with cudaMemcpyHD/DH accounting for <1% so i never bothered to optimize it. On that note, I agree if you can optimize convolution kernels even by a little bit, there is a lot to be gained.
I will report with my experiment results. I'll test for shogi, not chess. I am preparing for the World Computer Shogi Championship that will take place next week, and will do some testing to decide on batch size with 4xV100 or 8xV100 configurations.

My intuition is that large batches may be OK for very long searches, but are not for short ones. The optimal solution might be to increase batch size as the search tree becomes bigger.
Yes it makes sense to try to lower the batchsize when usinging multiple GPUs. My scorpio currently barely scales to 4 GPUs interms of nps using 128 threads per GPU (i.e. for a total 512). I have tried to optimize the performance of parallel mcts by making it completely lockless even when allocating tree nodes. Good luck with the shogi championship.

Rémi Coulom
Posts: 433
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Rémi Coulom » Sun Apr 28, 2019 12:31 pm

Daniel Shawul wrote:
Sun Apr 28, 2019 12:05 pm
Yes it makes sense to try to lower the batchsize when usinging multiple GPUs. My scorpio currently barely scales to 4 GPUs interms of nps using 128 threads per GPU (i.e. for a total 512). I have tried to optimize the performance of parallel mcts by making it completely lockless even when allocating tree nodes. Good luck with the shogi championship.
In my program, the tree search is a single thread. I parallelize only the leaf calculations: legal move generation, neural-network input preparation, and output decoding. This way I don't have to worry with concurrent access to the tree. In my experience, the main tree thread becomes a bottleneck only when the neural network is very small.

chrisw
Posts: 2113
Joined: Tue Apr 03, 2012 2:28 pm

Re: Wouldn't it be nice if C++ GPU

Post by chrisw » Sun Apr 28, 2019 1:42 pm

Daniel Shawul wrote:
Sat Apr 27, 2019 2:25 am
chrisw wrote:
Fri Apr 26, 2019 8:03 pm
Daniel Shawul wrote:
Fri Apr 26, 2019 6:16 pm
Which begs the question, why use small batch sizes at all ? I don't use batch_size of less than 128.
Even launching 128 to 256 threads for multi-threaded batching on a 4-core cpu i see no problems...
Lc0 uses single threaded batching and defaults to batch size of 256 -- though smaller batch size of 32 is used for training ...
Well, if batch=1, you get full NN guidance whenever you want it. If batch=infinity, you get no NN guidance at all. As batchsize increases from 1 you get increasingly less NN guidance at and near leaf nodes. On the other side of the balance you get faster NN lookups.

Best N for batchsize is easily established by results. So far, so obvious, I guess the interesting bit is how, or not, one provides useful search guidance in the absense of the NN. The original paper, I think, just did a random selection. Does it make sense to use some kind of handcrafted selector? I’ve been trying with selecting obvious captures over the last few days, but only succeeded in breaking everything, so nothing useful to report right now. Which is basically why I want c++ gpu, fiddling around at a chess level with Python is not good. Need C++.
Are you doing rollouts without NN?
Training is SL at the moment, so not generating training games.
For play, I use a simple MCTS and expansion at the edge of the tree is by definition, because NN hasn’t seen the position yet, either random choice or some handcrafted algo choosing. So far my handcrafter is worse than random. It’s possible that random sampling of the unknown search space is better for pseudo MCTS, I’ll keep testing for a while. Eg always choosing a SEE favourable capture risks search tree bias, so I would guess some random sampling, or a very wide ranging handcraft predictor necessary.
I think even in the original paper they trained a smaller policy network for that purpose. In my case, the search is always guided by the NN policy. I have stopped guiding the search with hand-crafted eval or other heuristic search (qsearch captures etc) once I added policy head.
But with batching there will be occasions there is no policy yet. If you extend lines beyond the tree, eg reply to check, or give favourable checks and so on, there won’t be policy at hand either.

Did you measure the effect of batch size on playing strength? Nodes per second is not a good measurement of performance. I am using very large batches for self-play game generation, because I can play N self-play games in parallel. But when playing a single game in a tournament, I feel such huge batches may hurt performance, especially when the number of nodes is small. I have not measured this very seriously, though. I will run some tests during the week-end.
Well going from a batch size of 128 to 16 my nps goes down by a factor of 4x, so I haven't really bothered to measure if the increased selectivity from smaller batch size can compensate for the loss in nps. However, I have now started using single thread search for generating training games. When I do 800 playouts training, say with batch size of 128, each thread builds its own tree and 128 games are produced separately.

For actual search (not training with small number of playotus), one can make cpuct a function of the batch size to account for the added exploration due to virtual loss. I wonder how A0 got away with batch size of 8 -- maybe TPU has a lot less memory transfer overhead than GPU.
I noticed that batching helps not only because it reduces this latency, but also because tensorflow/tensorrt nn evaluation code uses generic std containers (list/vector etc) that is very slow. I remember first time I tried to use tensorflow on CPU, my nps tanked even after i commented out the actual NN evaluation code while keeping construction of input tensors etc.
Looking forward to your test results.

Daniel

Post Reply