Lc0 speedup on 2 GPUs

Werewolf · Post by **Werewolf** » Wed Oct 17, 2018 11:36 pm

With a normal alpha beta engine running on a CPU, it is well known that a doubling of cores has a search efficiency cost.
Stockfish on 2 cores is somewhere around 1.8x faster than on 1 core (though the nps will be about double).

What is the equivalent with Lc0 going from 1 graphics card to 2?

AdminX · Post by **AdminX** » Thu Oct 18, 2018 11:22 am

On a GTX 1060 using Network ID 10970 I had 3598 NPS, when using 2x I had NPS between 7000 and 8000. The average being about 7500 NPS. Take a look at https://docs.google.com/spreadsheets/d/ ... 1508569046

Werewolf · Post by **Werewolf** » Thu Oct 18, 2018 11:34 am

Thanks - nice table.

However, NPS won't answer this question since it doesn't take into account any incurred search loss as I showed above. Two machines can have identical NPS and yet be very different in terms of search speed.

I don't know whether this principle holds true for Lc0

grahamj · Post by **grahamj** » Thu Oct 18, 2018 2:04 pm

As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.

Werewolf · Post by **Werewolf** » Thu Oct 18, 2018 2:16 pm

grahamj wrote: ↑Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.

Interesting. From what you're saying it seems likely we'll only know the speedup by experiment. However, there should be some data on 2x V100 which is close to 2 x 2080 Ti.

Laskos · Post by **Laskos** » Thu Oct 18, 2018 3:10 pm

grahamj wrote: ↑Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.

Isn't the issue, though, with multi-GPU, reduced basically to NPS, efficiency being very high aside that? Maybe I got it wrong, but as I understand, if NPS is 2.0 times higher, that's very close to the effective speed-up. And when the problems appear, they show up in NPS. In short, NPS scaling is the main indicator? But maybe I am wrong.

smatovic · Post by **smatovic** » Fri Oct 19, 2018 11:33 am

Laskos wrote: ↑Thu Oct 18, 2018 3:10 pm
grahamj wrote: ↑Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
Isn't the issue, though, with multi-GPU, reduced basically to NPS, efficiency being very high aside that? Maybe I got it wrong, but as I understand, if NPS is 2.0 times higher, that's very close to the effective speed-up. And when the problems appear, they show up in NPS. In short, NPS scaling is the main indicator? But maybe I am wrong.

I agree with Graham.

Even in MCTS there must be some at some point an advantage of serial processing over parallel processing.

--
Srdja

Hai · Post by **Hai** » Fri Oct 19, 2018 10:28 pm

Werewolf wrote: ↑Wed Oct 17, 2018 11:36 pm With a normal alpha beta engine running on a CPU, it is well known that a doubling of cores has a search efficiency cost.
Stockfish on 2 cores is somewhere around 1.8x faster than on 1 core (though the nps will be about double).

What is the equivalent with Lc0 going from 1 graphics card to 2?

Maybe 50-75 elo more?

Lc0 speedup on 2 GPUs

Lc0 speedup on 2 GPUs

Re: Lc0 speedup on 2 GPUs

Re: Lc0 speedup on 2 GPUs

Re: Lc0 speedup on 2 GPUs

Re: Lc0 speedup on 2 GPUs

Re: Lc0 speedup on 2 GPUs

Re: Lc0 speedup on 2 GPUs

Re: Lc0 speedup on 2 GPUs