With a normal alpha beta engine running on a CPU, it is well known that a doubling of cores has a search efficiency cost.
Stockfish on 2 cores is somewhere around 1.8x faster than on 1 core (though the nps will be about double).
What is the equivalent with Lc0 going from 1 graphics card to 2?
Lc0 speedup on 2 GPUs
Moderators: hgm, Rebel, chrisw
-
- Posts: 6340
- Joined: Mon Mar 13, 2006 2:34 pm
- Location: Acworth, GA
Re: Lc0 speedup on 2 GPUs
On a GTX 1060 using Network ID 10970 I had 3598 NPS, when using 2x I had NPS between 7000 and 8000. The average being about 7500 NPS. Take a look at https://docs.google.com/spreadsheets/d/ ... 1508569046
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
__________________________________________________________________
Ted Summers
-
- Posts: 1797
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Lc0 speedup on 2 GPUs
Thanks - nice table.
However, NPS won't answer this question since it doesn't take into account any incurred search loss as I showed above. Two machines can have identical NPS and yet be very different in terms of search speed.
I don't know whether this principle holds true for Lc0
However, NPS won't answer this question since it doesn't take into account any incurred search loss as I showed above. Two machines can have identical NPS and yet be very different in terms of search speed.
I don't know whether this principle holds true for Lc0
-
- Posts: 43
- Joined: Thu Oct 11, 2018 2:26 pm
- Full name: Graham Jones
Re: Lc0 speedup on 2 GPUs
As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
Graham Jones, www.indriid.com
-
- Posts: 1797
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Lc0 speedup on 2 GPUs
Interesting. From what you're saying it seems likely we'll only know the speedup by experiment. However, there should be some data on 2x V100 which is close to 2 x 2080 Ti.grahamj wrote: ↑Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Lc0 speedup on 2 GPUs
Isn't the issue, though, with multi-GPU, reduced basically to NPS, efficiency being very high aside that? Maybe I got it wrong, but as I understand, if NPS is 2.0 times higher, that's very close to the effective speed-up. And when the problems appear, they show up in NPS. In short, NPS scaling is the main indicator? But maybe I am wrong.grahamj wrote: ↑Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
-
- Posts: 2662
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Lc0 speedup on 2 GPUs
I agree with Graham.Laskos wrote: ↑Thu Oct 18, 2018 3:10 pmIsn't the issue, though, with multi-GPU, reduced basically to NPS, efficiency being very high aside that? Maybe I got it wrong, but as I understand, if NPS is 2.0 times higher, that's very close to the effective speed-up. And when the problems appear, they show up in NPS. In short, NPS scaling is the main indicator? But maybe I am wrong.grahamj wrote: ↑Thu Oct 18, 2018 2:04 pm As I understand it, LC0 sends a batch of positions to the GPU for NN evaluation. The default "Minibatch size for NN inference" is 256. If this is too small, the GPU will not be fully utilised. More powerful GPUs need bigger batches to keep them busy. If the batch size is too big, search becomes less efficient (positions are evaluated unnecessarily). There isn't a simple equivalent for Lc0 going from 1 graphics card to 2 because it will depend on the GPU. It could be that 2 1060s are twice as good as 1, but 2 2080s are not twice as good as 1.
Even in MCTS there must be some at some point an advantage of serial processing over parallel processing.
--
Srdja
-
- Posts: 598
- Joined: Sun Aug 04, 2013 1:19 pm
Re: Lc0 speedup on 2 GPUs
Maybe 50-75 elo more?Werewolf wrote: ↑Wed Oct 17, 2018 11:36 pm With a normal alpha beta engine running on a CPU, it is well known that a doubling of cores has a search efficiency cost.
Stockfish on 2 cores is somewhere around 1.8x faster than on 1 core (though the nps will be about double).
What is the equivalent with Lc0 going from 1 graphics card to 2?