Dual RTX 2060 for Leela

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

I started a competition between the Dual RTX 2060 and Single RTX 2060. The competition is at the 41. game and the result is: Dual RTX 2060 : Single RTX 2060 = 4 : 0 (36 draw).
TC is 2 min + 2 sec / move, as it was before.
When the competition will end I will report the result.
After then I plan a competition with longer TC too.

Hugo,
As I marked your NNCachesize is too high.
For such fast games 2000000 is more than enough.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

I made a test with AntiFish_1.9_Mark_313_2400 Net file.
My system:
Ryzen 7 1800x 8x4000 MHz. 2xRTX 2060, Windows 10
Lc0 version 0.21.1
Lc0 parameters:
NNCachesize=10000000
SmartPruningFactor=0.0
Backend=cudnn-fp16, for DUAL GPU=Multiplexing
Other parameters are Default
Measuring with Go Nodes 10000000
Test 1:
SINGLE GPU1
max nps = 20961 (depth 21, nodes 3962809)
Test 2:
SINGLE GPU2
max nps = 20119 (depth 21, nodes 3780518)
Test 3:
DUAL GPU
max nps =34108 (depth 21, nodes 5200827)

Effectiveness of the DUAL GPU is about 83%

I wait for independent tests from testers mainly with stronger dual GPUs.
jjoshua2
Posts: 99
Joined: Sat Mar 10, 2018 6:16 am

Re: Dual RTX 2060 for Leela

Post by jjoshua2 »

corres wrote: Thu May 02, 2019 9:27 am Lc0 parameters:
NNCachesize=10000000
SmartPruningFactor=0.0
Backend=cudnn-fp16, for DUAL GPU=Multiplexing
Other parameters are Default
Maybe you forgot to mention threads since default of 2 will be non optimal. Have you tried threads=3 with round-robin yet? TCEC uses this for a reason because it's the fastest with dual GPUs for almost all dual GPU systems, unless perhaps they are heavily imbalanced like a 2080 + 1080 without RTX cores. CCCC also was tested fastest with 3 threads but decided to use 2 threads because the demux backend with 4 GPUs needs such a large batchsize of 640, that it was worried it might not have strength gain with just slightly more nps.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

Corrected:
I made a test with AntiFish_1.9_Mark_313_24000 (!) Net file.
My system:
Ryzen 7 1800x 8x4000 MHz. 2xRTX 2060, Windows 10
Lc0 version 0.21.1
Lc0 parameters:
NNCachesize=10000000
SmartPruningFactor=0.0
Backend=cudnn-fp16, for DUAL GPU=Multiplexing
Threads for SINGLE GPUs=2 (!)
Threads for DUAL GPU=4 (!)
Other parameters are Default
Measuring with Go Nodes 10000000
Test 1:
SINGLE GPU1
max nps = 20961 (depth 21, nodes 3962809)
Test 2:
SINGLE GPU2
max nps = 20119 (depth 21, nodes 3780518)
Test 3:
DUAL GPU
max nps =34108 (depth 21, nodes 5200827)

Effectiveness of the DUAL GPU is about 83%

I wait for independent tests from testers mainly with stronger dual GPUs.
-----------------------------------------------------------------------------------------
Thanks for the note, but tests above are more detailed.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

jjoshua2 wrote: Thu May 02, 2019 2:48 pm ...
Have you tried threads=3 with round-robin yet? TCEC uses this for a reason because it's the fastest with dual GPUs for almost all dual GPU systems, unless perhaps they are heavily imbalanced like a 2080 + 1080 without RTX cores. CCCC also was tested fastest with 3 threads but decided to use 2 threads because the demux backend with 4 GPUs needs such a large batchsize of 640, that it was worried it might not have strength gain with just slightly more nps.
Earlier I made tests with Multiplexing and Round Robin with different number of threads:
Net file = 41812 , NNCackeSize = 2000000, MiniBatchSize = 512, smartPruningFactor = 0.0
RoundRobin:
Threads = 2, max nps = 29476
Threads = 3, max nps = 49923
Threads = 4, max nps = 51185

Multiplexing:
Threads = 2, max nps = 37425
Threads = 3, max nps = 45746
Threads = 4, max nps = 51800

It seems if you use Threads = 3 RoundRobin gives higher nps but the Multiplexing with 4 threads gives the most higher nps.
I think Multiplexing has lower sensitivity to heating effect and to the throttling of GPUs.
Hugo
Posts: 782
Joined: Tue Dec 01, 2009 11:10 am

Re: Dual RTX 2060 for Leela

Post by Hugo »

Hi

I have connected RTX 2060 + RTX 2070.
My setup is:

Code: Select all

--backend=multiplexing
--backend-opts=(backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
--threads=6
--nncache=20000000
both cards have clockspeeds @ 1900MHz (+-100MHz)
Used network 41812
With 6 Threads I get 72529 nps
With 4 threads I get 68985

both nps taken after go nodes 10000000 was finished.

Regards, C.K.

Earlier I made tests with Multiplexing and Round Robin with different number of threads:
Net file = 41812 , NNCackeSize = 2000000, MiniBatchSize = 512, smartPruningFactor = 0.0
RoundRobin:
Threads = 2, max nps = 29476
Threads = 3, max nps = 49923
Threads = 4, max nps = 51185

Multiplexing:
Threads = 2, max nps = 37425
Threads = 3, max nps = 45746
Threads = 4, max nps = 51800

It seems if you use Threads = 3 RoundRobin gives higher nps but the Multiplexing with 4 threads gives the most higher nps.
I think Multiplexing has lower sensitivity to heating effect and to the throttling of GPUs.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

Hugo wrote: Thu May 02, 2019 9:06 pm Hi
I have connected RTX 2060 + RTX 2070.
My setup is:

Code: Select all

--backend=multiplexing
--backend-opts=(backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
--threads=6
--nncache=20000000
both cards have clockspeeds @ 1900MHz (+-100MHz)
Used network 41812
With 6 Threads I get 72529 nps
With 4 threads I get 68985
both nps taken after go nodes 10000000 was finished.
Regards, C.K.
Thanks for the report.
The values of nps are very good.
A question: How many MHz is the core clock of your CPU?
If core clock is low more threads than default may help.
I did not test more than 4 threads because my CPU has 8 physical cores altogether.