Dual RTX 2060 for Leela

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Dual RTX 2060 for Leela

Post by corres »

I installed two RTX 2060 (Gigabyte Windforce OC) into my Ryzen7 1800x 8x4000 MHz PC and I made some tests.
I used Leela version 0.21.1 for tests.
1. test: Net 11250
1a. test: Default parameters
GPU1
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 22533 (depth 10 time 15406 nodes 347152 hasfull 986)
GPU2
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 20313 (depth 10 time 19277 nodes 391592 hasfull 1000)
Note: GPU2 is in the second (SLI) slot what is a PCIe ver.2.0 (x4) slot with 1/8 bandwith.
DUAL GPU
setoption name threads value 4
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 1000000
Result: max nps = 41.481 (depth 10 time 1102 nodes 456797 hashfull 1000)

npsGPU1 + npsGPU2 = 42846 so the effectiveness of the dual GPU is about 97%.

1b. test parameters found by Laskos
GPU1
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 28646 (depth 13 time 143931 nodes 4036742 hasfull 919)
GPU2
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 25143 (depth 13 time 145956 nodes 3669798 hashfull 839)
Note: as above
DUAL GPU
setoption name threads value 4
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
setoption name backend value multiplex
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 5000000
Result: max nodes = 51646 (depth 13 time 73566 nodes 3780545 hashfull 876)

npsGPU1 + npsGPU2 = 53789 so the effectiveness of DUAL GPU is about 95%.

(continued)
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Dual RTX 2060 for Leela

Post by Laskos »

corres wrote: Fri Apr 19, 2019 12:05 pm I installed two RTX 2060 (Gigabyte Windforce OC) into my Ryzen7 1800x 8x4000 MHz PC and I made some tests.
I used Leela version 0.21.1 for tests.
1. test: Net 11250
1a. test: Default parameters
GPU1
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 22533 (depth 10 time 15406 nodes 347152 hasfull 986)
GPU2
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 20313 (depth 10 time 19277 nodes 391592 hasfull 1000)
Note: GPU2 is in the second (SLI) slot what is a PCIe ver.2.0 (x4) slot with 1/8 bandwith.
DUAL GPU
setoption name threads value 4
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 1000000
Result: max nps = 41.481 (depth 10 time 1102 nodes 456797 hashfull 1000)

npsGPU1 + npsGPU2 = 42846 so the effectiveness of the dual GPU is about 97%.

1b. test parameters found by Laskos
GPU1
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 28646 (depth 13 time 143931 nodes 4036742 hasfull 919)
GPU2
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 25143 (depth 13 time 145956 nodes 3669798 hashfull 839)
Note: as above
DUAL GPU
setoption name threads value 4
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
setoption name backend value multiplex
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 5000000
Result: max nodes = 51646 (depth 13 time 73566 nodes 3780545 hashfull 876)

npsGPU1 + npsGPU2 = 53789 so the effectiveness of DUAL GPU is about 95%.

(continued)
Thanks, looks good! That is probably the most cost-efficient set-up. For $800, the price of RTX 2080, you get speeds significantly above 2080Ti ($1300) and 40% above 2080. I will build in the future a similar 2 x RTX 2070, but 2060 seems the most cost-efficient solution. Curious if 2x scale well strength-wise, but I guess that if NPS are good, then effective speed-up is not far away.
Last edited by Laskos on Fri Apr 19, 2019 1:37 pm, edited 1 time in total.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

corres wrote: Fri Apr 19, 2019 12:05 pm I installed two RTX 2060 (Gigabyte Windforce OC) into my Ryzen7 1800x 8x4000 MHz PC and I made some tests.
I used Leela version 0.21.1 for tests.
1. test: Net 11250
1a. test: Default parameters
GPU1
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 22533 (depth 10 time 15406 nodes 347152 hasfull 986)
GPU2
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 20313 (depth 10 time 19277 nodes 391592 hasfull 1000)
Note: GPU2 is in the second (SLI) slot what is a PCIe ver.2.0 (x4) slot with 1/8 bandwith.
DUAL GPU
setoption name threads value 4
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 1000000
Result: max nps = 41.481 (depth 10 time 1102 nodes 456797 hashfull 1000)

npsGPU1 + npsGPU2 = 42846 so the effectiveness of the dual GPU is about 97%.

1b. test parameters found by Laskos
GPU1
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 28646 (depth 13 time 143931 nodes 4036742 hasfull 919)
GPU2
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 25143 (depth 13 time 145956 nodes 3669798 hashfull 839)
Note: as above
DUAL GPU
setoption name threads value 4
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
setoption name backend value multiplex
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 5000000
Result: max nodes = 51646 (depth 13 time 73566 nodes 3780545 hashfull 876)

npsGPU1 + npsGPU2 = 53789 so the effectiveness of DUAL GPU is about 95%.

(continued)
2. test: Net 41800 (TCEC)
2a. Test: Default parameters
GPU1
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 23595 (depth 14 time 26760 nodes 631425 hashfull 1000)
GPU2
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 21505 (depth 14 time 29358 nodes 631355 hashfull 1000)
Note: as above
DUAL GPU
setoption name threads value 4
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 1000000
Result: max nodes = 43489 (depth 14 time 14529 nodes 631862 hashfull 1000)

npsGPU1 + npsGPU2 = 45100 so the effectiveness of DUAL GPU is about 96%.

2b.Test: parameters found by Laskos
GPU1
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 34213 (depth 17 time 87071 nodes 2978872 hashfull 469)
GPU2
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 31108 (depth 17 time 95821 nodes 2980840 hashfull 471)
Note: as above
DUAL GPU
setoption name threads value 4
setoption minibatchsize value 512
setoption name nncachesize value 2000000
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 5000000
Result: max nps = 53130 (depth 17 time 56260 nodes 2989143 hashfull 476)

npsGPU1 + npsGPU2 = 65321 so the effectiveness of DUAL GPU is about 81%.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Dual RTX 2060 for Leela

Post by Laskos »

corres wrote: Fri Apr 19, 2019 1:37 pm
corres wrote: Fri Apr 19, 2019 12:05 pm I installed two RTX 2060 (Gigabyte Windforce OC) into my Ryzen7 1800x 8x4000 MHz PC and I made some tests.
I used Leela version 0.21.1 for tests.
1. test: Net 11250
1a. test: Default parameters
GPU1
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 22533 (depth 10 time 15406 nodes 347152 hasfull 986)
GPU2
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 20313 (depth 10 time 19277 nodes 391592 hasfull 1000)
Note: GPU2 is in the second (SLI) slot what is a PCIe ver.2.0 (x4) slot with 1/8 bandwith.
DUAL GPU
setoption name threads value 4
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 1000000
Result: max nps = 41.481 (depth 10 time 1102 nodes 456797 hashfull 1000)

npsGPU1 + npsGPU2 = 42846 so the effectiveness of the dual GPU is about 97%.

1b. test parameters found by Laskos
GPU1
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 28646 (depth 13 time 143931 nodes 4036742 hasfull 919)
GPU2
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 25143 (depth 13 time 145956 nodes 3669798 hashfull 839)
Note: as above
DUAL GPU
setoption name threads value 4
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
setoption name backend value multiplex
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 5000000
Result: max nodes = 51646 (depth 13 time 73566 nodes 3780545 hashfull 876)

npsGPU1 + npsGPU2 = 53789 so the effectiveness of DUAL GPU is about 95%.

(continued)
2. test: Net 41800 (TCEC)
2a. Test: Default parameters
GPU1
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 23595 (depth 14 time 26760 nodes 631425 hashfull 1000)
GPU2
setoption name backend value cudnn-fp16
go nodes 1000000
Result: max nps = 21505 (depth 14 time 29358 nodes 631355 hashfull 1000)
Note: as above
DUAL GPU
setoption name threads value 4
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 1000000
Result: max nodes = 43489 (depth 14 time 14529 nodes 631862 hashfull 1000)

npsGPU1 + npsGPU2 = 45100 so the effectiveness of DUAL GPU is about 96%.

2b.Test: parameters found by Laskos
GPU1
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 34213 (depth 17 time 87071 nodes 2978872 hashfull 469)
GPU2
setoption name backend value cudnn-fp16
setoption name minibatchsize value 512
setoption name nncachesize value 2000000
go nodes 5000000
Result: max nps = 31108 (depth 17 time 95821 nodes 2980840 hashfull 471)
Note: as above
DUAL GPU
setoption name threads value 4
setoption minibatchsize value 512
setoption name nncachesize value 2000000
setoption name backend value multiplexing
setoption name backendoptions value (backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
go nodes 5000000
Result: max nps = 53130 (depth 17 time 56260 nodes 2989143 hashfull 476)

npsGPU1 + npsGPU2 = 65321 so the effectiveness of DUAL GPU is about 81%.
Hello, can you use these for each card:

Backend value cudnn-fp16
MinibatchSize value 512
NNCacheSize value 10000000

then

WeightsFile value .\weights_run1_41687.pb.gz

and then observe speeds reached immediately after 10 million nodes mark?

Also, could you try first "multiplexing" then another try with "roundrobin"?

EDIT: also, I am not sure about the number of CPU threads. They seemed to have used 3 threads in TCEC on dual GPU, I am not sure what the reason was. Maybe MCTS parallelization is crappy.
Last edited by Laskos on Fri Apr 19, 2019 1:55 pm, edited 1 time in total.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

Laskos wrote: Fri Apr 19, 2019 1:32 pm ...
Thanks, looks good! That is probably the most cost-efficient set-up. For $800, the price of RTX 2080, you get speeds significantly above 2080Ti ($1300) and 40% above 2080. I will build in the future a similar 2 x RTX 2070, but 2060 seems the most cost-efficient solution. Curious if 2x scale well strength-wise, but I guess that if NPS are good, then effective speed-up is not far away.
I agree.
But there are some issues with dual gpu:
It needs bigger room in the PC case, it shows enhanced power consumption (RTX 2080 ~220 Wats, RTX 2080Ti ~250 Watts and RTX 2060 dual needs ~320 Watts) and naturally it produces more heat.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Dual RTX 2060 for Leela

Post by Laskos »

corres wrote: Fri Apr 19, 2019 1:52 pm
Laskos wrote: Fri Apr 19, 2019 1:32 pm ...
Thanks, looks good! That is probably the most cost-efficient set-up. For $800, the price of RTX 2080, you get speeds significantly above 2080Ti ($1300) and 40% above 2080. I will build in the future a similar 2 x RTX 2070, but 2060 seems the most cost-efficient solution. Curious if 2x scale well strength-wise, but I guess that if NPS are good, then effective speed-up is not far away.
I agree.
But there are some issues with dual gpu:
It needs bigger room in the PC case, it shows enhanced power consumption (RTX 2080 ~220 Wats, RTX 2080Ti ~250 Watts and RTX 2060 dual needs ~320 Watts) and naturally it produces more heat.
Yes, but probably better OC-able with good cooling. Use 2-3 case fans, I used 2 for my smaller set-up and reduced temperatures on both CPU and GPU by some 10-14C, so I managed to OC my 2070 with an effect of 10% NPS speed-up with GPU temp. never going above 70C. The case should be spacious enough (I changed mine).
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Dual RTX 2060 for Leela

Post by Laskos »

Laskos wrote: Fri Apr 19, 2019 1:49 pm
Hello, can you use these for each card:

Backend value cudnn-fp16
MinibatchSize value 512
NNCacheSize value 10000000

then

WeightsFile value .\weights_run1_41687.pb.gz

and then observe speeds reached immediately after 10 million nodes mark?

Also, could you try first "multiplexing" then another try with "roundrobin"?

EDIT: also, I am not sure about the number of CPU threads. They seemed to have used 3 threads in TCEC on dual GPU, I am not sure what the reason was. Maybe MCTS parallelization is crappy.
I am getting the following with these above settings, just after 10 million nodes mark, but with my stable over-clocked RTX 2070:

setoption name Backend value cudnn-fp16
setoption name MinibatchSize value 512
setoption name NNCacheSize value 10000000
setoption name WeightsFile value .\weights_run1_41687.pb.gz

info depth 19 seldepth 57 time 220147 nodes 10789318 score cp 44 hashfull 250 nps 49009 tbhits 0 pv d2d4

This is about 3-4 minutes search, or tournament time control. I am curious what your dual 2060 set-up shows (although Lc0 engine seems to have problems digesting above 70-80k NPS speeds).
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Dual RTX 2060 for Leela

Post by corres »

Laskos wrote: Fri Apr 19, 2019 1:49 pm Hello, can you use these for each card:
Backend value cudnn-fp16
MinibatchSize value 512
NNCacheSize value 10000000
then
WeightsFile value .\weights_run1_41687.pb.gz
and then observe speeds reached immediately after 10 million nodes mark?
Also, could you try first "multiplexing" then another try with "roundrobin"?
EDIT: also, I am not sure about the number of CPU threads. They seemed to have used 3 threads in TCEC on dual GPU, I am not sure what the reason was. Maybe MCTS parallelization is crappy.
"Roundrobin" is a trick to invert the GPUs and it has sense if you use different GPUs only.
If it is used nncachesize 10000000 instead of nncachesize 2000000 the max nps will grow in some minimal measure.
I do not understand .\weights_run1_41687.pb.gz. Obviously 41687 is an another Net file but what is the good of it?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Dual RTX 2060 for Leela

Post by Laskos »

corres wrote: Fri Apr 19, 2019 3:24 pm
Laskos wrote: Fri Apr 19, 2019 1:49 pm Hello, can you use these for each card:
Backend value cudnn-fp16
MinibatchSize value 512
NNCacheSize value 10000000
then
WeightsFile value .\weights_run1_41687.pb.gz
and then observe speeds reached immediately after 10 million nodes mark?
Also, could you try first "multiplexing" then another try with "roundrobin"?
EDIT: also, I am not sure about the number of CPU threads. They seemed to have used 3 threads in TCEC on dual GPU, I am not sure what the reason was. Maybe MCTS parallelization is crappy.
"Roundrobin" is a trick to invert the GPUs and it has sense if you use different GPUs only.
If it is used nncachesize 10000000 instead of nncachesize 2000000 the max nps will grow in some minimal measure.
I do not understand .\weights_run1_41687.pb.gz. Obviously 41687 is an another Net file but what is the good of it?
No, just to have the same net, as every net will show a different speed behavior. I am not sure of your description of "roundrobin" option.
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: Dual RTX 2060 for Leela

Post by crem »

It's at least intended that roundrobin works the best when all GPUs are the same, and for multiplexing it's not that strict.

I wrote a blog post about backends recently: http://blog.lczero.org/2019/04/backend- ... l?m=1#more