Using LC0 with one or two GPUs - a guide

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

smatovic
Posts: 2641
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Using LC0 with one or two GPUs - a guide

Post by smatovic »

Cos it came up repeatedly, here a short guide what to consider when using one or
two gpus with LC0.

Hardware:

- CPU or GPU?
LC0 uses neural networks for evaluation of chess positions, these are
commpute and memory intensive, ideally for being accelerated by an gpu.
To add an discrete gpu to your PC you will need an free PCI Express slot,
and a power supply unit that can serve the additional power consumption.
Note that gpus need commonly two free slots in your PC case.

- AMD or Nvidia?
LC0 is able to run on CPUs and on GPUs via OpenCL and on Nvidia GPUs via CUDA
and cuDNN. Currently the Nvidia CUDA and cuDNN backend outperforms AMD OpenCL
backend by a wide margin. Of course this may change in the future.
See these benchs for some numbers:

https://www.phoronix.com/scan.php?page= ... Benchmarks
https://www.phoronix.com/scan.php?page= ... inux&num=9

- Nvidia RTX or GTX?
The Nvidia RTX series has TensorCores onboard, which accelerates the neural
network of LC0 significantly, of course for an higher price.

- Two or one GPUs?
An additional gpu gives est. about +50 Elo. You can mix different gpus with LC0.

- Thermal issues
A highend gpu produces about 300 Watts thermal power under load, so you may
have to add some additional fans in your PC case for cooling. An alternative
is a water cooling solution. See also:

viewtopic.php?f=2&t=70097

Software:

- FP16 (half precision) or FP32 (single precision)?
Neural network inference is currently done via floating point computation.
Some gpus offer higher instruction throughput with lower precision, so on
these devices FP16 (half precision) can pay off. Nvidia RTX series for
example offer FP16 optimized computation in LC0.

- Which parameters to choose?
LC0 has some tuneable params to get more nps, for example backend type, number
of threads, nncache or batch size. Consider this sheet for different params
and nps:

https://docs.google.com/spreadsheets/d/ ... CjBILe6uA/

- Which network to choose?
LC0 is still under development, and network design may change, so there are
a bunch of different networks which give different nps and Elo. Here an
overview of different networks LC0 offers:

http://www.lczero.org/networks/

--
Srdja
Krzysztof Grzelak
Posts: 1525
Joined: Tue Jul 15, 2014 12:47 pm

Re: Using LC0 with one or two GPUs - a guide

Post by Krzysztof Grzelak »

You described a very interesting thing, but unfortunately there was one thing missing. He writes a lot about the GPU but you do not write about the CPU at all.
smatovic
Posts: 2641
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Using LC0 with one or two GPUs - a guide

Post by smatovic »

Krzysztof Grzelak wrote: Sat Mar 30, 2019 11:16 am You described a very interesting thing, but unfortunately there was one thing missing. He writes a lot about the GPU but you do not write about the CPU at all.
As for running LC0 on CPU, i have too little experience to write on this topic, but maybe someone else can help out.

Considerung the CPU for running LC0 on GPU:

- threads per GPU
You may want to run two threads per GPU to be able to utilize it fully, so consider two cpu cores per gpu.

- the higher the clock rate the better
I have no numbers for comparison, but the higher the cpu clocks are, the faster the kernel calls should be.

--
Srdja
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: Using LC0 with one or two GPUs - a guide

Post by crem »

I'd also add the following points which are often brought up (sorted from most surprising to least surprising):

- No SLI bridge is needed (or useful at all) when using multiple GPUs.
- For RTX cards, default Leela configuration is much slower because cudnn backend is default instead of cudnn-fp16.
- Multiple GPUs also don't automatically work, one has to pass parameters to Lc0 to enable that.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Using LC0 with one or two GPUs - a guide

Post by Laskos »

crem wrote: Sat Mar 30, 2019 12:29 pm I'd also add the following points which are often brought up (sorted from most surprising to least surprising):

- No SLI bridge is needed (or useful at all) when using multiple GPUs.
- For RTX cards, default Leela configuration is much slower because cudnn backend is default instead of cudnn-fp16.
- Multiple GPUs also don't automatically work, one has to pass parameters to Lc0 to enable that.
How is the scaling with 2 GPUs? First, NPS scaling. Second, effective speed-up scaling. I saw bad scaling even NPS-wise in TCEC, and the effective speed-up must be even worse. How many CPU threads are best for using 2 RTX GPUs? I plan in some future to have a new system with 8-16 core CPU, and am not sure whether a second GPU is worth having. I have a well tuned and cooled RTX 2070 which runs fast and flawlessly on very heavy and long loads, and if the scaling is good, would go for an additional identical second GPU.
Also, is there a perspective for Lc0 engine handling speeds above 80-100k NPS? It is a serious bottleneck for me with smaller nets.
Hugo
Posts: 782
Joined: Tue Dec 01, 2009 11:10 am

Re: Using LC0 with one or two GPUs - a guide

Post by Hugo »

Hi all

a few days ago, I installed additionally to my RTX 2070 a RTX 2060.
I have benchmaked the system with one of the latest 40 networks.
my benchmark is using go nodes 5000000 in console mode.
single RTX 2070 was about 35.000 nps
single RTX 2060 was about 30.000 nps

and both together (backend=roundrobin) was about 65.000 nps and still increasing when using it in GUI it was far over 70.000 nps after two minutes.
In real game, its not always full load on both GPUs. Its more something between 80% and 98%.

regards, C.K.

Laskos wrote: Sat Mar 30, 2019 1:24 pm
crem wrote: Sat Mar 30, 2019 12:29 pm I'd also add the following points which are often brought up (sorted from most surprising to least surprising):

- No SLI bridge is needed (or useful at all) when using multiple GPUs.
- For RTX cards, default Leela configuration is much slower because cudnn backend is default instead of cudnn-fp16.
- Multiple GPUs also don't automatically work, one has to pass parameters to Lc0 to enable that.
How is the scaling with 2 GPUs? First, NPS scaling. Second, effective speed-up scaling. I saw bad scaling even NPS-wise in TCEC, and the effective speed-up must be even worse. How many CPU threads are best for using 2 RTX GPUs? I plan in some future to have a new system with 8-16 core CPU, and am not sure whether a second GPU is worth having. I have a well tuned and cooled RTX 2070 which runs fast and flawlessly on very heavy and long loads, and if the scaling is good, would go for an additional identical second GPU.
Also, is there a perspective for Lc0 engine handling speeds above 80-100k NPS? It is a serious bottleneck for me with smaller nets.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Using LC0 with one or two GPUs - a guide

Post by Laskos »

Hugo wrote: Sat Mar 30, 2019 3:59 pm Hi all

a few days ago, I installed additionally to my RTX 2070 a RTX 2060.
I have benchmaked the system with one of the latest 40 networks.
my benchmark is using go nodes 5000000 in console mode.
single RTX 2070 was about 35.000 nps
single RTX 2060 was about 30.000 nps

and both together (backend=roundrobin) was about 65.000 nps and still increasing when using it in GUI it was far over 70.000 nps after two minutes.
In real game, its not always full load on both GPUs. Its more something between 80% and 98%.

regards, C.K.

Laskos wrote: Sat Mar 30, 2019 1:24 pm
crem wrote: Sat Mar 30, 2019 12:29 pm I'd also add the following points which are often brought up (sorted from most surprising to least surprising):

- No SLI bridge is needed (or useful at all) when using multiple GPUs.
- For RTX cards, default Leela configuration is much slower because cudnn backend is default instead of cudnn-fp16.
- Multiple GPUs also don't automatically work, one has to pass parameters to Lc0 to enable that.
How is the scaling with 2 GPUs? First, NPS scaling. Second, effective speed-up scaling. I saw bad scaling even NPS-wise in TCEC, and the effective speed-up must be even worse. How many CPU threads are best for using 2 RTX GPUs? I plan in some future to have a new system with 8-16 core CPU, and am not sure whether a second GPU is worth having. I have a well tuned and cooled RTX 2070 which runs fast and flawlessly on very heavy and long loads, and if the scaling is good, would go for an additional identical second GPU.
Also, is there a perspective for Lc0 engine handling speeds above 80-100k NPS? It is a serious bottleneck for me with smaller nets.
So what the heck they were doing at TCEC? They had 55-65k NPS with 2080ti + 2080 in the openings and middlegames. Their temperatures were pretty high, and I guess the arrangement of air flow was not optimal, maybe they didn't even have any case fan.
Krzysztof Grzelak
Posts: 1525
Joined: Tue Jul 15, 2014 12:47 pm

Re: Using LC0 with one or two GPUs - a guide

Post by Krzysztof Grzelak »

Thank you very much for the information smatovic. Request to Laskos, Hugo, crem. Please, focus a little bit on cpu, not on gpu. I understand that you are using the engine under gpu. But most people will not go straight to the store to buy a graphics card for a few hundred dollars.
smatovic
Posts: 2641
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Using LC0 with one or two GPUs - a guide

Post by smatovic »

A lill update for the guide...

- PCIe lanes
A common GPU uses 16x lanes, 8x should be sufficient with current Lc0, 4x might
have performance penalties of estimated 5%.

http://talkchess.com/forum3/viewtopic.p ... 92#p824192

- External GPU
eGPU uses commonly Thunderbolt (resp. upcoming USB4) connection with 4x PCIe
lanes, expect a est. 5% penalty with current Lc0.

- Speedup by RTX TensorCores?
Speedup by Nvidia RTX TensorCores is estimated to be about 2x compared to non
TensorCores gpus with same FP16 throughput.

http://talkchess.com/forum3/viewtopic.p ... 47#p870041

- Backends
There are different GPU backends with different releases and NPS throughput, so
might have to check recent benchmarks of recent backend versions like OpenCL,
CUDA, CUDNN, DX12. The DX12 backend should also run with upcoming Intel/AMD
GPUs on Windows OS. The OpenCL backend should also run on ARM/Mali GPUs.

--
Srdja
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Using LC0 with one or two GPUs - a guide

Post by corres »

Laskos wrote: Sat Mar 30, 2019 6:18 pm
Hugo wrote: Sat Mar 30, 2019 3:59 pm Hi all

a few days ago, I installed additionally to my RTX 2070 a RTX 2060.
I have benchmaked the system with one of the latest 40 networks.
my benchmark is using go nodes 5000000 in console mode.
single RTX 2070 was about 35.000 nps
single RTX 2060 was about 30.000 nps

and both together (backend=roundrobin) was about 65.000 nps and still increasing when using it in GUI it was far over 70.000 nps after two minutes.
In real game, its not always full load on both GPUs. Its more something between 80% and 98%.

regards, C.K.

Laskos wrote: Sat Mar 30, 2019 1:24 pm
crem wrote: Sat Mar 30, 2019 12:29 pm I'd also add the following points which are often brought up (sorted from most surprising to least surprising):

- No SLI bridge is needed (or useful at all) when using multiple GPUs.
- For RTX cards, default Leela configuration is much slower because cudnn backend is default instead of cudnn-fp16.
- Multiple GPUs also don't automatically work, one has to pass parameters to Lc0 to enable that.
How is the scaling with 2 GPUs? First, NPS scaling. Second, effective speed-up scaling. I saw bad scaling even NPS-wise in TCEC, and the effective speed-up must be even worse. How many CPU threads are best for using 2 RTX GPUs? I plan in some future to have a new system with 8-16 core CPU, and am not sure whether a second GPU is worth having. I have a well tuned and cooled RTX 2070 which runs fast and flawlessly on very heavy and long loads, and if the scaling is good, would go for an additional identical second GPU.
Also, is there a perspective for Lc0 engine handling speeds above 80-100k NPS? It is a serious bottleneck for me with smaller nets.
So what the heck they were doing at TCEC? They had 55-65k NPS with 2080ti + 2080 in the openings and middlegames. Their temperatures were pretty high, and I guess the arrangement of air flow was not optimal, maybe they didn't even have any case fan.
If you have two or more GPUs you need not only more case ventilator but bigger case - full tower - also, or
you should buy "Blower type" GPUs, what blow out the hotted air from the case.