Lc0 speedup on 2 GPUs

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Werewolf
Posts: 1797
Joined: Thu Sep 18, 2008 10:24 pm

Lc0 speedup on 2 GPUs

Post by Werewolf »

With a normal alpha beta engine running on a CPU, it is well known that a doubling of cores has a search efficiency cost.
Stockfish on 2 cores is somewhere around 1.8x faster than on 1 core (though the nps will be about double).

What is the equivalent with Lc0 going from 1 graphics card to 2?
User avatar
AdminX
Posts: 6340
Joined: Mon Mar 13, 2006 2:34 pm
Location: Acworth, GA

Re: Lc0 speedup on 2 GPUs

Post by AdminX »

On a GTX 1060 using Network ID 10970 I had 3598 NPS, when using 2x I had NPS between 7000 and 8000. The average being about 7500 NPS. Take a look at https://docs.google.com/spreadsheets/d/ ... 1508569046
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
Werewolf
Posts: 1797
Joined: Thu Sep 18, 2008 10:24 pm

Re: Lc0 speedup on 2 GPUs

Post by Werewolf »

Thanks - nice table.

However, NPS won't answer this question since it doesn't take into account any incurred search loss as I showed above. Two machines can have identical NPS and yet be very different in terms of search speed.

I don't know whether this principle holds true for Lc0
grahamj
Posts: 43
Joined: Thu Oct 11, 2018 2:26 pm
Full name: Graham Jones

Re: Lc0 speedup on 2 GPUs

Post by grahamj »

As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
Graham Jones, www.indriid.com
Werewolf
Posts: 1797
Joined: Thu Sep 18, 2008 10:24 pm

Re: Lc0 speedup on 2 GPUs

Post by Werewolf »

grahamj wrote: Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
Interesting. From what you're saying it seems likely we'll only know the speedup by experiment. However, there should be some data on 2x V100 which is close to 2 x 2080 Ti.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Lc0 speedup on 2 GPUs

Post by Laskos »

grahamj wrote: Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
Isn't the issue, though, with multi-GPU, reduced basically to NPS, efficiency being very high aside that? Maybe I got it wrong, but as I understand, if NPS is 2.0 times higher, that's very close to the effective speed-up. And when the problems appear, they show up in NPS. In short, NPS scaling is the main indicator? But maybe I am wrong.
smatovic
Posts: 2662
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Lc0 speedup on 2 GPUs

Post by smatovic »

Laskos wrote: Thu Oct 18, 2018 3:10 pm
grahamj wrote: Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
Isn't the issue, though, with multi-GPU, reduced basically to NPS, efficiency being very high aside that? Maybe I got it wrong, but as I understand, if NPS is 2.0 times higher, that's very close to the effective speed-up. And when the problems appear, they show up in NPS. In short, NPS scaling is the main indicator? But maybe I am wrong.
I agree with Graham.

Even in MCTS there must be some at some point an advantage of serial processing over parallel processing.

--
Srdja
Hai
Posts: 598
Joined: Sun Aug 04, 2013 1:19 pm

Re: Lc0 speedup on 2 GPUs

Post by Hai »

Werewolf wrote: Wed Oct 17, 2018 11:36 pm With a normal alpha beta engine running on a CPU, it is well known that a doubling of cores has a search efficiency cost.
Stockfish on 2 cores is somewhere around 1.8x faster than on 1 core (though the nps will be about double).

What is the equivalent with Lc0 going from 1 graphics card to 2?
Maybe 50-75 elo more?